[PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width
@ 2023-09-14 12:43 Di Zhao OS
  2023-10-06  9:33 ` Richard Biener
  0 siblings, 1 reply; 18+ messages in thread
From: Di Zhao OS @ 2023-09-14 12:43 UTC (permalink / raw)
  To: gcc-patches; +Cc: Richard Biener

[-- Attachment #1: Type: text/plain, Size: 3758 bytes --]

This is a new version of the patch on "nested FMA".
Sorry for updating this after so long, I've been studying and
writing micro cases to sort out the cause of the regression.

First, following previous discussion:
(https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629080.html)

1. From testing more altered cases, I don't think the
problem is that reassociation works locally. In that:

  1) On the example with multiplications:
	
        tmp1 = a + c * c + d * d + x * y;
        tmp2 = x * tmp1;
        result += (a + c + d + tmp2);

  Given "result" rewritten by width=2, the performance is
  worse if we rewrite "tmp1" with width=2. In contrast, if we
  remove the multiplications from the example (and make "tmp1"
  not singe used), and still rewrite "result" by width=2, then
  rewriting "tmp1" with width=2 is better. (Make sense because
  the tree's depth at "result" is still smaller if we rewrite
  "tmp1".)

  2) I tried to modify the assembly code of the example without
  FMA, so the width of "result" is 4. On Ampere1 there's no
  obvious improvement. So although this is an interesting
  problem, it doesn't seem like the cause of the regression.

2. From assembly code of the case with FMA, one problem is
that, rewriting "tmp1" to parallel didn't decrease the
minimum CPU cycles (taking MULT_EXPRs into account), but
increased code size, so the overhead is increased.

   a) When "tmp1" is not re-written to parallel:
        fmadd d31, d2, d2, d30
        fmadd d31, d3, d3, d31
        fmadd d31, d4, d5, d31	//"tmp1"                
        fmadd d31, d31, d4, d3

   b) When "tmp1" is re-written to parallel:
        fmul  d31, d4, d5      
        fmadd d27, d2, d2, d30 
        fmadd d31, d3, d3, d31 
        fadd  d31, d31, d27    	//"tmp1"
        fmadd d31, d31, d4, d3

For version a), there are 3 dependent FMAs to calculate "tmp1".
For version b), there are also 3 dependent instructions in the
longer path: the 1st, 3rd and 4th.

So it seems to me the current get_reassociation_width algorithm
isn't optimal in the presence of FMA. So I modified the patch to
improve get_reassociation_width, rather than check for code
patterns. (Although there could be some other complicated
factors so the regression is more obvious when there's "nested
FMA". But with this patch that should be avoided or reduced.)

With this patch 508.namd_r 1-copy run has 7% improvement on
Ampere1, on Intel Xeon there's about 3%. While I'm still
collecting data on other CPUs, I'd like to know how do you
think of this.

About changes in the patch:

1. When the op list forms a complete FMA chain, try to search
for a smaller width considering the benefit of using FMA. With
a smaller width, the increment of code size is smaller when
breaking the chain.

2. To avoid regressions, included the other patch
(https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629203.html)
on this tracker again. This is because more FMA will be kept
with 1., so we need to rule out the loop dependent
FMA chains when param_avoid_fma_max_bits is set.

Thanks,
Di Zhao

----

        PR tree-optimization/110279

gcc/ChangeLog:

        * tree-ssa-reassoc.cc (rank_ops_for_better_parallelism_p):
        New function to check whether ranking the ops results in
        better parallelism.
        (get_reassociation_width): Add new parameters. Search for
        smaller width considering the benefit of FMA.
        (rank_ops_for_fma): Change return value to be number of
        MULT_EXPRs.
        (reassociate_bb): For 3 ops, refine the condition to call
        swap_ops_for_binary_stmt.

gcc/testsuite/ChangeLog:

        * gcc.dg/pr110279.c: New test.

[-- Attachment #2: 0001-Consider-FMA-in-get_reassociation_width.patch --]
[-- Type: application/octet-stream, Size: 9254 bytes --]

From 35309fea033413977a4e5b927a26db7b4c1442e8 Mon Sep 17 00:00:00 2001
From: "dzhao.ampere" <di.zhao@amperecomputing.com>
Date: Thu, 14 Sep 2023 16:48:20 +0800
Subject: [PATCH] Consider FMA in get_reassociation_width

---
 gcc/testsuite/gcc.dg/pr110279.c |  62 ++++++++++++++
 gcc/tree-ssa-reassoc.cc         | 147 ++++++++++++++++++++++++++++----
 2 files changed, 194 insertions(+), 15 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr110279.c

diff --git a/gcc/testsuite/gcc.dg/pr110279.c b/gcc/testsuite/gcc.dg/pr110279.c
new file mode 100644
index 00000000000..9dc72658bff
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr110279.c
@@ -0,0 +1,62 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast --param avoid-fma-max-bits=512 --param tree-reassoc-width=4 -fdump-tree-widening_mul-details" } */
+/* { dg-additional-options "-march=armv8.2-a" } */
+
+#define LOOP_COUNT 800000000
+typedef double data_e;
+
+/* Check that FMAs with backedge dependency are avoided. Otherwise there won't
+   be FMA generated with "--param avoid-fma-max-bits=512".   */
+
+foo1 (data_e a, data_e b, data_e c, data_e d)
+{
+  data_e result = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      result += (a * b + c * d);
+
+      a -= 0.1;
+      b += 0.9;
+      c *= 1.02;
+      d *= 0.61;
+    }
+
+  return result;
+}
+
+foo2 (data_e a, data_e b, data_e c, data_e d)
+{
+  data_e result = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      result += a * b + result + c * d;
+
+      a -= 0.1;
+      b += 0.9;
+      c *= 1.02;
+      d *= 0.61;
+    }
+
+  return result;
+}
+
+foo3 (data_e a, data_e b, data_e c, data_e d)
+{
+  data_e result = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      result += result + a * b + c * d;
+
+      a -= 0.1;
+      b += 0.9;
+      c *= 1.02;
+      d *= 0.61;
+    }
+
+  return result;
+}
+
+/* { dg-final { scan-tree-dump-times "Generated FMA" 3 "widening_mul"} } */
diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
index eda03bf98a6..94db11edd4b 100644
--- a/gcc/tree-ssa-reassoc.cc
+++ b/gcc/tree-ssa-reassoc.cc
@@ -5427,17 +5427,96 @@ get_required_cycles (int ops_num, int cpu_width)
   return res;
 }
 
+/* Given that LHS is the result SSA_NAME of OPS, returns whether ranking the ops
+   results in better parallelism.  */
+static bool
+rank_ops_for_better_parallelism_p (vec<operand_entry *> *ops, tree lhs)
+{
+  /* If there's code like "acc = a * b + c * d + acc" in a tight loop, some
+     uarchs can execute results like:
+
+	_1 = a * b;
+	_2 = .FMA (c, d, _1);
+	acc_1 = acc_0 + _2;
+
+     in parallel, while turning it into
+
+	_1 = .FMA(a, b, acc_0);
+	acc_1 = .FMA(c, d, _1);
+
+     hinders that, because then the first FMA depends on the result of preceding
+     iteration.  */
+  if (maybe_le (tree_to_poly_int64 (TYPE_SIZE (TREE_TYPE (lhs))),
+		param_avoid_fma_max_bits))
+    {
+      /* Look for cross backedge dependency:
+	1. LHS is a phi argument in the same basic block it is defined.
+	2. And the result of the phi node is used in OPS.  */
+      basic_block bb = gimple_bb (SSA_NAME_DEF_STMT (lhs));
+      gimple_stmt_iterator gsi;
+      for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	{
+	  gphi *phi = dyn_cast<gphi *> (gsi_stmt (gsi));
+	  for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
+	    {
+	      tree op = PHI_ARG_DEF (phi, i);
+	      if (!(op == lhs && gimple_phi_arg_edge (phi, i)->src == bb))
+		continue;
+	      tree phi_result = gimple_phi_result (phi);
+	      operand_entry *oe;
+	      unsigned int j;
+	      FOR_EACH_VEC_ELT (*ops, j, oe)
+		{
+		  if (TREE_CODE (oe->op) != SSA_NAME)
+		    continue;
+
+		  /* Result of phi is operand of PLUS_EXPR.  */
+		  if (oe->op == phi_result)
+		    return true;
+
+		  /* Check is result of phi is operand of MULT_EXPR.  */
+		  gimple *def_stmt = SSA_NAME_DEF_STMT (oe->op);
+		  if (is_gimple_assign (def_stmt)
+		      && gimple_assign_rhs_code (def_stmt) == NEGATE_EXPR)
+		    {
+		      tree rhs = gimple_assign_rhs1 (def_stmt);
+		      if (TREE_CODE (rhs) == SSA_NAME)
+			{
+			  if (rhs == phi_result)
+			    return true;
+			  def_stmt = SSA_NAME_DEF_STMT (rhs);
+			}
+		    }
+		  if (is_gimple_assign (def_stmt)
+		      && gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
+		    {
+		      if (gimple_assign_rhs1 (def_stmt) == phi_result
+			  || gimple_assign_rhs2 (def_stmt) == phi_result)
+			return true;
+		    }
+		}
+	    }
+	}
+    }
+
+  return false;
+}
+
 /* Returns an optimal number of registers to use for computation of
-   given statements.  */
+   given statements.
+
+   MULT_NUM is the number of MULT_EXPRs in OPS.  LHS is the result SSA_NAME of
+   the operators.  */
 
 static int
-get_reassociation_width (int ops_num, enum tree_code opc,
-			 machine_mode mode)
+get_reassociation_width (vec<operand_entry *> *ops, int mult_num, tree lhs,
+			 enum tree_code opc, machine_mode mode)
 {
   int param_width = param_tree_reassoc_width;
   int width;
   int width_min;
   int cycles_best;
+  int ops_num = ops->length ();
 
   if (param_width > 0)
     width = param_width;
@@ -5468,6 +5547,37 @@ get_reassociation_width (int ops_num, enum tree_code opc,
 	break;
     }
 
+  /* For a complete FMA chain, rewriting to parallel reduces the number of FMA,
+     so the code size increases.  Check if fewer partitions results in better
+     (or same) cycle number.  */
+  if (mult_num >= ops_num - 1 && width > 1)
+    {
+      width_min = 1;
+      while (width > width_min)
+	{
+	  int width_mid = (width + width_min) / 2;
+	  int elog = exact_log2 (width_mid);
+	  elog = elog >= 0 ? elog : floor_log2 (width_mid) + 1;
+	  int attempt_cycles = CEIL (mult_num, width_mid) + elog;
+	  /* Since CYCLES_BEST doesn't count the circle of multiplications,
+	     compare with CYCLES_BEST + 1.  */
+	  if (cycles_best + 1 >= attempt_cycles)
+	    {
+	      width = width_mid;
+	      cycles_best = attempt_cycles - 1;
+	    }
+	  else if (width_min < width_mid)
+	    width_min = width_mid;
+	  else
+	    break;
+	}
+    }
+
+  /* If there's loop dependent FMA result, rewrite to avoid that.  This is
+     better than skipping the FMA candidates in widening_mul.  */
+  if (width == 1 && mult_num && rank_ops_for_better_parallelism_p (ops, lhs))
+    return 2;
+
   return width;
 }
 
@@ -6780,8 +6890,10 @@ transform_stmt_to_multiply (gimple_stmt_iterator *gsi, gimple *stmt,
    Rearrange ops to -> e + a * b + c * d generates:
 
    _4  = .FMA (c_7(D), d_8(D), _3);
-   _11 = .FMA (a_5(D), b_6(D), _4);  */
-static bool
+   _11 = .FMA (a_5(D), b_6(D), _4);
+
+   Return the return number of MULT_EXPRs in the chain.  */
+static unsigned
 rank_ops_for_fma (vec<operand_entry *> *ops)
 {
   operand_entry *oe;
@@ -6813,7 +6925,8 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
      Putting ops that not def from mult in front can generate more FMAs.
 
      2. If all ops are defined with mult, we don't need to rearrange them.  */
-  if (ops_mult.length () >= 2 && ops_mult.length () != ops_length)
+  unsigned mult_num = ops_mult.length ();
+  if (mult_num >= 2 && mult_num != ops_length)
     {
       /* Put no-mult ops and mult ops alternately at the end of the
 	 queue, which is conducive to generating more FMA and reducing the
@@ -6829,9 +6942,8 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
 	  if (opindex > 0)
 	    opindex--;
 	}
-      return true;
     }
-  return false;
+  return mult_num;
 }
 /* Reassociate expressions in basic block BB and its post-dominator as
    children.
@@ -6995,9 +7107,10 @@ reassociate_bb (basic_block bb)
 	      else
 		{
 		  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
-		  int ops_num = ops.length ();
+		  unsigned ops_num = ops.length ();
 		  int width;
-		  bool has_fma = false;
+		  /* Number of MULT_EXPRs in the op list.  */
+		  unsigned mult_num = 0;
 
 		  /* For binary bit operations, if there are at least 3
 		     operands and the last operand in OPS is a constant,
@@ -7020,16 +7133,18 @@ reassociate_bb (basic_block bb)
 						      opt_type)
 		      && (rhs_code == PLUS_EXPR || rhs_code == MINUS_EXPR))
 		    {
-		      has_fma = rank_ops_for_fma (&ops);
+		      mult_num = rank_ops_for_fma (&ops);
 		    }
 
 		  /* Only rewrite the expression tree to parallel in the
 		     last reassoc pass to avoid useless work back-and-forth
 		     with initial linearization.  */
+		  bool has_fma = mult_num >= 2 && mult_num != ops_num;
 		  if (!reassoc_insert_powi_p
-		      && ops.length () > 3
-		      && (width = get_reassociation_width (ops_num, rhs_code,
-							   mode)) > 1)
+		      && ops_num > 3
+		      && (width = get_reassociation_width (&ops, mult_num, lhs,
+							   rhs_code, mode))
+			   > 1)
 		    {
 		      if (dump_file && (dump_flags & TDF_DETAILS))
 			fprintf (dump_file,
@@ -7046,7 +7161,9 @@ reassociate_bb (basic_block bb)
 			 to make sure the ones that get the double
 			 binary op are chosen wisely.  */
 		      int len = ops.length ();
-		      if (len >= 3 && !has_fma)
+		      if (len >= 3
+			  && (!has_fma
+			      || rank_ops_for_better_parallelism_p (&ops, lhs)))
 			swap_ops_for_binary_stmt (ops, len - 3);
 
 		      new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width
  2023-09-14 12:43 [PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width Di Zhao OS
@ 2023-10-06  9:33 ` Richard Biener
  2023-10-08 16:39   ` Di Zhao OS
  0 siblings, 1 reply; 18+ messages in thread
From: Richard Biener @ 2023-10-06  9:33 UTC (permalink / raw)
  To: Di Zhao OS; +Cc: gcc-patches

On Thu, Sep 14, 2023 at 2:43 PM Di Zhao OS
<dizhao@os.amperecomputing.com> wrote:
>
> This is a new version of the patch on "nested FMA".
> Sorry for updating this after so long, I've been studying and
> writing micro cases to sort out the cause of the regression.

Sorry for taking so long to reply.

> First, following previous discussion:
> (https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629080.html)
>
> 1. From testing more altered cases, I don't think the
> problem is that reassociation works locally. In that:
>
>   1) On the example with multiplications:
>
>         tmp1 = a + c * c + d * d + x * y;
>         tmp2 = x * tmp1;
>         result += (a + c + d + tmp2);
>
>   Given "result" rewritten by width=2, the performance is
>   worse if we rewrite "tmp1" with width=2. In contrast, if we
>   remove the multiplications from the example (and make "tmp1"
>   not singe used), and still rewrite "result" by width=2, then
>   rewriting "tmp1" with width=2 is better. (Make sense because
>   the tree's depth at "result" is still smaller if we rewrite
>   "tmp1".)
>
>   2) I tried to modify the assembly code of the example without
>   FMA, so the width of "result" is 4. On Ampere1 there's no
>   obvious improvement. So although this is an interesting
>   problem, it doesn't seem like the cause of the regression.

OK, I see.

> 2. From assembly code of the case with FMA, one problem is
> that, rewriting "tmp1" to parallel didn't decrease the
> minimum CPU cycles (taking MULT_EXPRs into account), but
> increased code size, so the overhead is increased.
>
>    a) When "tmp1" is not re-written to parallel:
>         fmadd d31, d2, d2, d30
>         fmadd d31, d3, d3, d31
>         fmadd d31, d4, d5, d31  //"tmp1"
>         fmadd d31, d31, d4, d3
>
>    b) When "tmp1" is re-written to parallel:
>         fmul  d31, d4, d5
>         fmadd d27, d2, d2, d30
>         fmadd d31, d3, d3, d31
>         fadd  d31, d31, d27     //"tmp1"
>         fmadd d31, d31, d4, d3
>
> For version a), there are 3 dependent FMAs to calculate "tmp1".
> For version b), there are also 3 dependent instructions in the
> longer path: the 1st, 3rd and 4th.

Yes, it doesn't really change anything.  The patch has

+  /* If there's code like "acc = a * b + c * d + acc" in a tight loop, some
+     uarchs can execute results like:
+
+       _1 = a * b;
+       _2 = .FMA (c, d, _1);
+       acc_1 = acc_0 + _2;
+
+     in parallel, while turning it into
+
+       _1 = .FMA(a, b, acc_0);
+       acc_1 = .FMA(c, d, _1);
+
+     hinders that, because then the first FMA depends on the result
of preceding
+     iteration.  */

I can't see what can be run in parallel for the first case.  The .FMA
depends on the multiplication a * b.  Iff the uarch somehow decomposes
.FMA into multiply + add then the c * d multiply could run in parallel
with the a * b multiply which _might_ be able to hide some of the
latency of the full .FMA.  Like on x86 Zen FMA has a latency of 4
cycles but a multiply only 3.  But I never got confirmation from any
of the CPU designers that .FMAs are issued when the multiply
operands are ready and the add operand can be forwarded.

I also wonder why the multiplications of the two-FMA sequence
then cannot be executed at the same time?  So I have some doubt
of the theory above.

Iff this really is the reason for the sequence to execute with lower
overall latency and we want to attack this on GIMPLE then I think
we need a target hook telling us this fact (I also wonder if such
behavior can be modeled in the scheduler pipeline description at all?)

> So it seems to me the current get_reassociation_width algorithm
> isn't optimal in the presence of FMA. So I modified the patch to
> improve get_reassociation_width, rather than check for code
> patterns. (Although there could be some other complicated
> factors so the regression is more obvious when there's "nested
> FMA". But with this patch that should be avoided or reduced.)
>
> With this patch 508.namd_r 1-copy run has 7% improvement on
> Ampere1, on Intel Xeon there's about 3%. While I'm still
> collecting data on other CPUs, I'd like to know how do you
> think of this.
>
> About changes in the patch:
>
> 1. When the op list forms a complete FMA chain, try to search
> for a smaller width considering the benefit of using FMA. With
> a smaller width, the increment of code size is smaller when
> breaking the chain.

But this is all highly target specific (code size even more so).

How I understand your approach to fixing the issue leads me to
the suggestion to prioritize parallel rewriting, thus alter rank_ops_for_fma,
taking the reassoc width into account (the computed width should be
unchanged from rank_ops_for_fma) instead of "fixing up" the parallel
rewriting of FMAs (well, they are not yet formed of course).
get_reassociation_width has 'get_required_cycles', the above theory
could be verified with a very simple toy pipeline model.  We'd have
to ask the target for the reassoc width for MULT_EXPRs as well (or maybe
even FMA_EXPRs).

Taking the width of FMAs into account when computing the reassoc width
might be another way to attack this.

> 2. To avoid regressions, included the other patch
> (https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629203.html)
> on this tracker again. This is because more FMA will be kept
> with 1., so we need to rule out the loop dependent
> FMA chains when param_avoid_fma_max_bits is set.

Sorry again for taking so long to reply.

I'll note we have an odd case on x86 Zen2(?) as well which we don't really
understand from a CPU behavior perspective.

Thanks,
Richard.

> Thanks,
> Di Zhao
>
> ----
>
>         PR tree-optimization/110279
>
> gcc/ChangeLog:
>
>         * tree-ssa-reassoc.cc (rank_ops_for_better_parallelism_p):
>         New function to check whether ranking the ops results in
>         better parallelism.
>         (get_reassociation_width): Add new parameters. Search for
>         smaller width considering the benefit of FMA.
>         (rank_ops_for_fma): Change return value to be number of
>         MULT_EXPRs.
>         (reassociate_bb): For 3 ops, refine the condition to call
>         swap_ops_for_binary_stmt.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/pr110279.c: New test.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width
  2023-10-06  9:33 ` Richard Biener
@ 2023-10-08 16:39   ` Di Zhao OS
  2023-10-23  3:49     ` [PING][PATCH " Di Zhao OS
  2023-10-31 13:47     ` [PATCH " Richard Biener
  0 siblings, 2 replies; 18+ messages in thread
From: Di Zhao OS @ 2023-10-08 16:39 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 9162 bytes --]

Attached is a new version of the patch.

> -----Original Message-----
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: Friday, October 6, 2023 5:33 PM
> To: Di Zhao OS <dizhao@os.amperecomputing.com>
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> get_reassociation_width
> 
> On Thu, Sep 14, 2023 at 2:43 PM Di Zhao OS
> <dizhao@os.amperecomputing.com> wrote:
> >
> > This is a new version of the patch on "nested FMA".
> > Sorry for updating this after so long, I've been studying and
> > writing micro cases to sort out the cause of the regression.
> 
> Sorry for taking so long to reply.
> 
> > First, following previous discussion:
> > (https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629080.html)
> >
> > 1. From testing more altered cases, I don't think the
> > problem is that reassociation works locally. In that:
> >
> >   1) On the example with multiplications:
> >
> >         tmp1 = a + c * c + d * d + x * y;
> >         tmp2 = x * tmp1;
> >         result += (a + c + d + tmp2);
> >
> >   Given "result" rewritten by width=2, the performance is
> >   worse if we rewrite "tmp1" with width=2. In contrast, if we
> >   remove the multiplications from the example (and make "tmp1"
> >   not singe used), and still rewrite "result" by width=2, then
> >   rewriting "tmp1" with width=2 is better. (Make sense because
> >   the tree's depth at "result" is still smaller if we rewrite
> >   "tmp1".)
> >
> >   2) I tried to modify the assembly code of the example without
> >   FMA, so the width of "result" is 4. On Ampere1 there's no
> >   obvious improvement. So although this is an interesting
> >   problem, it doesn't seem like the cause of the regression.
> 
> OK, I see.
> 
> > 2. From assembly code of the case with FMA, one problem is
> > that, rewriting "tmp1" to parallel didn't decrease the
> > minimum CPU cycles (taking MULT_EXPRs into account), but
> > increased code size, so the overhead is increased.
> >
> >    a) When "tmp1" is not re-written to parallel:
> >         fmadd d31, d2, d2, d30
> >         fmadd d31, d3, d3, d31
> >         fmadd d31, d4, d5, d31  //"tmp1"
> >         fmadd d31, d31, d4, d3
> >
> >    b) When "tmp1" is re-written to parallel:
> >         fmul  d31, d4, d5
> >         fmadd d27, d2, d2, d30
> >         fmadd d31, d3, d3, d31
> >         fadd  d31, d31, d27     //"tmp1"
> >         fmadd d31, d31, d4, d3
> >
> > For version a), there are 3 dependent FMAs to calculate "tmp1".
> > For version b), there are also 3 dependent instructions in the
> > longer path: the 1st, 3rd and 4th.
> 
> Yes, it doesn't really change anything.  The patch has
> 
> +  /* If there's code like "acc = a * b + c * d + acc" in a tight loop, some
> +     uarchs can execute results like:
> +
> +       _1 = a * b;
> +       _2 = .FMA (c, d, _1);
> +       acc_1 = acc_0 + _2;
> +
> +     in parallel, while turning it into
> +
> +       _1 = .FMA(a, b, acc_0);
> +       acc_1 = .FMA(c, d, _1);
> +
> +     hinders that, because then the first FMA depends on the result
> of preceding
> +     iteration.  */
> 
> I can't see what can be run in parallel for the first case.  The .FMA
> depends on the multiplication a * b.  Iff the uarch somehow decomposes
> .FMA into multiply + add then the c * d multiply could run in parallel
> with the a * b multiply which _might_ be able to hide some of the
> latency of the full .FMA.  Like on x86 Zen FMA has a latency of 4
> cycles but a multiply only 3.  But I never got confirmation from any
> of the CPU designers that .FMAs are issued when the multiply
> operands are ready and the add operand can be forwarded.
> 
> I also wonder why the multiplications of the two-FMA sequence
> then cannot be executed at the same time?  So I have some doubt
> of the theory above.

The parallel execution for the code snippet above was the other
issue (previously discussed here:
https://gcc.gnu.org/pipermail/gcc-patches/2023-August/628960.html).
Sorry it's a bit confusing to include that here, but these 2 fixes
needs to be combined to avoid new regressions. Since considering
FMA in get_reassociation_width produces more results of width=1,
so there would be more loop depending FMA chains.

> Iff this really is the reason for the sequence to execute with lower
> overall latency and we want to attack this on GIMPLE then I think
> we need a target hook telling us this fact (I also wonder if such
> behavior can be modeled in the scheduler pipeline description at all?)
> 
> > So it seems to me the current get_reassociation_width algorithm
> > isn't optimal in the presence of FMA. So I modified the patch to
> > improve get_reassociation_width, rather than check for code
> > patterns. (Although there could be some other complicated
> > factors so the regression is more obvious when there's "nested
> > FMA". But with this patch that should be avoided or reduced.)
> >
> > With this patch 508.namd_r 1-copy run has 7% improvement on
> > Ampere1, on Intel Xeon there's about 3%. While I'm still
> > collecting data on other CPUs, I'd like to know how do you
> > think of this.
> >
> > About changes in the patch:
> >
> > 1. When the op list forms a complete FMA chain, try to search
> > for a smaller width considering the benefit of using FMA. With
> > a smaller width, the increment of code size is smaller when
> > breaking the chain.
> 
> But this is all highly target specific (code size even more so).
>
> How I understand your approach to fixing the issue leads me to
> the suggestion to prioritize parallel rewriting, thus alter rank_ops_for_fma,
> taking the reassoc width into account (the computed width should be
> unchanged from rank_ops_for_fma) instead of "fixing up" the parallel
> rewriting of FMAs (well, they are not yet formed of course).
> get_reassociation_width has 'get_required_cycles', the above theory
> could be verified with a very simple toy pipeline model.  We'd have
> to ask the target for the reassoc width for MULT_EXPRs as well (or maybe
> even FMA_EXPRs).
> 
> Taking the width of FMAs into account when computing the reassoc width
> might be another way to attack this.

Previously I tried to solve this generally, on the assumption that
FMA (smaller code size) is preferred. Now I agree it's difficult
since: 1) As you mentioned, the latency of FMA, FMUL and FADD can
be different. 2) From my test result on different machines we
have, it seems simply adding the cycles together is not a good way
to estimate the latency of consecutive FMA.

I think an easier way to fix this is to add a parameter to suggest
the length of complete FMA chain to keep. (It can be set by target
specific tuning then.) And we can break longer FMA chains for
better parallelism. Attached is the new implementation. With
max-fma-chain-len=8, there's about 7% improvement in spec2017
508.namd_r on ampere1, and the overall improvement on fprate is
about 1%.

Since there's code in rank_ops_for_fma to identify MULT_EXPRs from
others, I left it before get_reassociation_width so the number of
MULT_EXPRs can be used.

> 
> > 2. To avoid regressions, included the other patch
> > (https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629203.html)
> > on this tracker again. This is because more FMA will be kept
> > with 1., so we need to rule out the loop dependent
> > FMA chains when param_avoid_fma_max_bits is set.
> 
> Sorry again for taking so long to reply.
> 
> I'll note we have an odd case on x86 Zen2(?) as well which we don't really
> understand from a CPU behavior perspective.
> 
> Thanks,
> Richard.
> 
> > Thanks,
> > Di Zhao
> >
> > ----
> >
> >         PR tree-optimization/110279
> >
> > gcc/ChangeLog:
> >
> >         * tree-ssa-reassoc.cc (rank_ops_for_better_parallelism_p):
> >         New function to check whether ranking the ops results in
> >         better parallelism.
> >         (get_reassociation_width): Add new parameters. Search for
> >         smaller width considering the benefit of FMA.
> >         (rank_ops_for_fma): Change return value to be number of
> >         MULT_EXPRs.
> >         (reassociate_bb): For 3 ops, refine the condition to call
> >         swap_ops_for_binary_stmt.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         * gcc.dg/pr110279.c: New test.

Thanks,
Di Zhao

----

        PR tree-optimization/110279

gcc/ChangeLog:

        * doc/invoke.texi: Description of param_max_fma_chain_len.
        * params.opt: New parameter param_max_fma_chain_len.
        * tree-ssa-reassoc.cc (get_reassociation_width):
        Support param_max_fma_chain_len; check for loop dependent
        FMAs.
        (rank_ops_for_fma): Return the number of MULT_EXPRs.
        (reassociate_bb): For 3 ops, refine the condition to call
        swap_ops_for_binary_stmt.

gcc/testsuite/ChangeLog:

        * gcc.dg/pr110279-1.c: New test.
        * gcc.dg/pr110279-2.c: New test.
        * gcc.dg/pr110279-3.c: New test.

[-- Attachment #2: 0001-Keep-FMA-chains-in-reassoc-based-on-new-parameter.patch --]
[-- Type: application/octet-stream, Size: 14002 bytes --]

From 4890f1a78a85e9b731b34bfded9fa79b782fc0a6 Mon Sep 17 00:00:00 2001
From: "dzhao.ampere" <di.zhao@amperecomputing.com>
Date: Sat, 7 Oct 2023 19:27:22 +0800
Subject: [PATCH] Keep FMA chains in reassoc based on new parameter

Add a new parameter param_max_fma_chain_len, to suggest the
maximum length of FMA chain to be kept in reassoc2.
---
 gcc/doc/invoke.texi               |   3 +
 gcc/params.opt                    |   4 +
 gcc/testsuite/gcc.dg/pr110279-1.c |  47 +++++++++++
 gcc/testsuite/gcc.dg/pr110279-2.c |  50 ++++++++++++
 gcc/testsuite/gcc.dg/pr110279-3.c |  65 +++++++++++++++
 gcc/tree-ssa-reassoc.cc           | 131 ++++++++++++++++++++++++++----
 6 files changed, 284 insertions(+), 16 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr110279-1.c
 create mode 100644 gcc/testsuite/gcc.dg/pr110279-2.c
 create mode 100644 gcc/testsuite/gcc.dg/pr110279-3.c

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 23083467d47..c928f4bfb0e 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -16204,6 +16204,9 @@ Emit instrumentation calls to __tsan_func_entry() and __tsan_func_exit().
 Maximum number of instructions to copy when duplicating blocks on a
 finite state automaton jump thread path.
 
+@item max-fma-chain-len
+The maximum number of consecutive FMAs that we'd like not to break.
+
 @item threader-debug
 threader-debug=[none|all] Enables verbose dumping of the threader solver.
 
diff --git a/gcc/params.opt b/gcc/params.opt
index fffa8b1bc64..c22dbda1e9f 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -502,6 +502,10 @@ The maximum number of nested indirect inlining performed by early inliner.
 Common Joined UInteger Var(param_max_fields_for_field_sensitive) Param
 Maximum number of fields in a structure before pointer analysis treats the structure as a single variable.
 
+-param=max-fma-chain-len=
+Common Joined UInteger Var(param_max_fma_chain_len) IntegerRange(0, 512) Param Optimization
+The maximum number of consecutive FMAs that we'd like not to break.
+
 -param=max-fsm-thread-path-insns=
 Common Joined UInteger Var(param_max_fsm_thread_path_insns) Init(100) IntegerRange(1, 999999) Param Optimization
 Maximum number of instructions to copy when duplicating blocks on a finite state automaton jump thread path.
diff --git a/gcc/testsuite/gcc.dg/pr110279-1.c b/gcc/testsuite/gcc.dg/pr110279-1.c
new file mode 100644
index 00000000000..51cccd5d8b7
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr110279-1.c
@@ -0,0 +1,47 @@
+/* PR tree-optimization/110279 */
+/* { dg-do compile } */
+/* { dg-options "-Ofast --param tree-reassoc-width=4 --param max-fma-chain-len=8 -fdump-tree-reassoc2-details -fdump-tree-optimized" } */
+/* { dg-additional-options "-march=armv8.2-a" { target aarch64-*-* } } */
+
+#define LOOP_COUNT 800000000
+typedef double data_e;
+
+#include <stdio.h>
+
+__attribute_noinline__ data_e
+foo (data_e in)
+{
+  data_e a1, a2, a3, a4;
+  data_e tmp, result = 0;
+  a1 = in + 0.1;
+  a2 = in * 0.1;
+  a3 = in + 0.01;
+  a4 = in * 0.59;
+
+  data_e result2 = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      /* Test that a complete FMA chain with length=4 is not broken.  */
+      tmp = a1 + a2 * a2 + a3 * a3 + a4 * a4 ;
+      result += tmp - ic;
+      result2 = result2 / 2 - tmp;
+
+      a1 += 0.91;
+      a2 += 0.1;
+      a3 -= 0.01;
+      a4 -= 0.89;
+
+    }
+
+  return result + result2;
+}
+
+int
+main (int argc, char **argv)
+{
+  printf ("%f\n", foo (-1.2));
+}
+
+/* { dg-final { scan-tree-dump-not "was chosen for reassociation" "reassoc2"} } */
+/* { dg-final { scan-tree-dump-times {\.FMA } 3 "optimized"} } */
\ No newline at end of file
diff --git a/gcc/testsuite/gcc.dg/pr110279-2.c b/gcc/testsuite/gcc.dg/pr110279-2.c
new file mode 100644
index 00000000000..423e65d2d7f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr110279-2.c
@@ -0,0 +1,50 @@
+/* PR tree-optimization/110279 */
+/* { dg-do compile } */
+/* { dg-options "-Ofast --param tree-reassoc-width=4 --param max-fma-chain-len=8 -fdump-tree-reassoc2-details -fdump-tree-optimized" } */
+/* { dg-additional-options "-march=armv8.2-a" { target aarch64-*-* } } */
+
+#define LOOP_COUNT 800000000
+typedef double data_e;
+
+data_e
+foo (data_e in)
+{
+  data_e a1, a2, a3, a4, a5, a6, a7, a8, a9, a10;
+  data_e tmp, result, result2 = 0;
+  a1 = in + 0.1;
+  a2 = in * 0.1;
+  a3 = in + 0.01;
+  a4 = in * 0.59;
+  a5 = in;
+  a6 = in * 2;
+  a7 = in - 0.1;
+  a8 = in * 0.09;
+  a9 = in * 2 - 2;
+  a10 = in / 2 + 0.7;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      /* op_num=10, mult_exprs_num=8. Test that the op list is broken into 2
+         complete FMA chains.  */
+      tmp = a1 * a1 + a2 * a2 + a3 * a3 + a4 + a5 * a5 + a6 + a7 * a7 + a8 * a8
+	    + a9 * a9 + a10 * a10;
+      result += tmp - ic;
+
+      a1 += 0.91;
+      a2 += 0.1;
+      a3 -= 0.01;
+      a4 -= 1.0;
+      a6 += 0.09;
+      a7 -= 1.9;
+      a8 = a1 + a2;
+      a9 += a2;
+      a10 -= a4;
+
+      result2 = result2 - a1 - tmp * 0.2;
+    }
+
+  return result * result2;
+}
+
+/* { dg-final { scan-tree-dump-times "Width = 2 was chosen for reassociation" 1 "reassoc2"} } */
+/* { dg-final { scan-tree-dump-times {\.FMA } 9 "optimized"} } */
\ No newline at end of file
diff --git a/gcc/testsuite/gcc.dg/pr110279-3.c b/gcc/testsuite/gcc.dg/pr110279-3.c
new file mode 100644
index 00000000000..d94474cc42d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr110279-3.c
@@ -0,0 +1,65 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast --param avoid-fma-max-bits=512 --param tree-reassoc-width=4 -fdump-tree-widening_mul-details" } */
+/* { dg-additional-options "-march=armv8.2-a" { target aarch64-*-* } } */
+
+#define LOOP_COUNT 800000000
+typedef double data_e;
+
+/* Check that FMAs with backedge dependency are avoided. Otherwise there won't
+   be FMA generated with "--param avoid-fma-max-bits=512".   */
+
+data_e
+foo1 (data_e a, data_e b, data_e c, data_e d)
+{
+  data_e result = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      result += (a * b + c * d);
+
+      a -= 0.1;
+      b += 0.9;
+      c *= 1.02;
+      d *= 0.61;
+    }
+
+  return result;
+}
+
+data_e
+foo2 (data_e a, data_e b, data_e c, data_e d)
+{
+  data_e result = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      result = a * b + result + c * d;
+
+      a -= 0.1;
+      b += 0.9;
+      c *= 1.02;
+      d *= 0.61;
+    }
+
+  return result;
+}
+
+data_e
+foo3 (data_e a, data_e b, data_e c, data_e d)
+{
+  data_e result = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      result = result + a * b + c * d;
+
+      a -= 0.1;
+      b += 0.9;
+      c *= 1.02;
+      d *= 0.61;
+    }
+
+  return result;
+}
+
+/* { dg-final { scan-tree-dump-times "Generated FMA" 3 "widening_mul"} } */
diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
index 41ee36413b5..babd58aa88a 100644
--- a/gcc/tree-ssa-reassoc.cc
+++ b/gcc/tree-ssa-reassoc.cc
@@ -5431,16 +5431,20 @@ get_required_cycles (int ops_num, int cpu_width)
 }
 
 /* Returns an optimal number of registers to use for computation of
-   given statements.  */
+   given statements.
+
+   LHS is the result ssa name of OPS.  MULT_NUM is number of MULT_EXPRs in
+   OPS.  */
 
 static int
-get_reassociation_width (int ops_num, enum tree_code opc,
-			 machine_mode mode)
+get_reassociation_width (vec<operand_entry *> *ops, int mult_num, tree lhs,
+			 enum tree_code opc, machine_mode mode)
 {
   int param_width = param_tree_reassoc_width;
   int width;
   int width_min;
   int cycles_best;
+  int ops_num = ops->length ();
 
   if (param_width > 0)
     width = param_width;
@@ -5471,6 +5475,74 @@ get_reassociation_width (int ops_num, enum tree_code opc,
 	break;
     }
 
+  /* Check if keeping complete FMA chains is preferred.  */
+  if (width > 1 && mult_num >= 2 && param_max_fma_chain_len)
+    {
+      /* num_fma_chain + (num_fma_chain - 1) >= num_plus .  */
+      int num_others = ops_num - mult_num;
+      int num_fma_chain = CEIL (num_others + 1, 2);
+
+      if (num_fma_chain < width
+	  && CEIL (mult_num, num_fma_chain) <= param_max_fma_chain_len)
+	width = num_fma_chain;
+    }
+
+  /* If there's loop dependent FMA result, return width=2 to avoid it.  This is
+     better than skipping these FMA candidates in widening_mul.  */
+  if (width == 1 && mult_num
+      && maybe_le (tree_to_poly_int64 (TYPE_SIZE (TREE_TYPE (lhs))),
+		   param_avoid_fma_max_bits))
+    {
+      /* Look for cross backedge dependency:
+	1. LHS is a phi argument in the same basic block it is defined.
+	2. And the result of the phi node is used in OPS.  */
+      basic_block bb = gimple_bb (SSA_NAME_DEF_STMT (lhs));
+      gimple_stmt_iterator gsi;
+      for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	{
+	  gphi *phi = dyn_cast<gphi *> (gsi_stmt (gsi));
+	  for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
+	    {
+	      tree op = PHI_ARG_DEF (phi, i);
+	      if (!(op == lhs && gimple_phi_arg_edge (phi, i)->src == bb))
+		continue;
+	      tree phi_result = gimple_phi_result (phi);
+	      operand_entry *oe;
+	      unsigned int j;
+	      FOR_EACH_VEC_ELT (*ops, j, oe)
+		{
+		  if (TREE_CODE (oe->op) != SSA_NAME)
+		    continue;
+
+		  /* Result of phi is operand of PLUS_EXPR.  */
+		  if (oe->op == phi_result)
+		    return 2;
+
+		  /* Check is result of phi is operand of MULT_EXPR.  */
+		  gimple *def_stmt = SSA_NAME_DEF_STMT (oe->op);
+		  if (is_gimple_assign (def_stmt)
+		      && gimple_assign_rhs_code (def_stmt) == NEGATE_EXPR)
+		    {
+		      tree rhs = gimple_assign_rhs1 (def_stmt);
+		      if (TREE_CODE (rhs) == SSA_NAME)
+			{
+			  if (rhs == phi_result)
+			    return 2;
+			  def_stmt = SSA_NAME_DEF_STMT (rhs);
+			}
+		    }
+		  if (is_gimple_assign (def_stmt)
+		      && gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
+		    {
+		      if (gimple_assign_rhs1 (def_stmt) == phi_result
+			  || gimple_assign_rhs2 (def_stmt) == phi_result)
+			return 2;
+		    }
+		}
+	    }
+	}
+    }
+
   return width;
 }
 
@@ -6783,8 +6855,10 @@ transform_stmt_to_multiply (gimple_stmt_iterator *gsi, gimple *stmt,
    Rearrange ops to -> e + a * b + c * d generates:
 
    _4  = .FMA (c_7(D), d_8(D), _3);
-   _11 = .FMA (a_5(D), b_6(D), _4);  */
-static bool
+   _11 = .FMA (a_5(D), b_6(D), _4);
+
+   Return the number of MULT_EXPRs in the chain.  */
+static int
 rank_ops_for_fma (vec<operand_entry *> *ops)
 {
   operand_entry *oe;
@@ -6798,9 +6872,26 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
       if (TREE_CODE (oe->op) == SSA_NAME)
 	{
 	  gimple *def_stmt = SSA_NAME_DEF_STMT (oe->op);
-	  if (is_gimple_assign (def_stmt)
-	      && gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
-	    ops_mult.safe_push (oe);
+	  if (is_gimple_assign (def_stmt))
+	    {
+	      if (gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
+		ops_mult.safe_push (oe);
+	      /* A negate on the multiplication leads to FNMA.  */
+	      else if (gimple_assign_rhs_code (def_stmt) == NEGATE_EXPR
+		       && TREE_CODE (gimple_assign_rhs1 (def_stmt)) == SSA_NAME)
+		{
+		  gimple *neg_def_stmt
+		    = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (def_stmt));
+		  if (is_gimple_assign (neg_def_stmt)
+		      && gimple_bb (neg_def_stmt) == gimple_bb (def_stmt)
+		      && gimple_assign_rhs_code (neg_def_stmt) == MULT_EXPR)
+		    ops_mult.safe_push (oe);
+		  else
+		    ops_others.safe_push (oe);
+		}
+	      else
+		ops_others.safe_push (oe);
+	    }
 	  else
 	    ops_others.safe_push (oe);
 	}
@@ -6816,7 +6907,8 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
      Putting ops that not def from mult in front can generate more FMAs.
 
      2. If all ops are defined with mult, we don't need to rearrange them.  */
-  if (ops_mult.length () >= 2 && ops_mult.length () != ops_length)
+  unsigned mult_num = ops_mult.length ();
+  if (mult_num >= 2 && mult_num != ops_length)
     {
       /* Put no-mult ops and mult ops alternately at the end of the
 	 queue, which is conducive to generating more FMA and reducing the
@@ -6832,9 +6924,8 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
 	  if (opindex > 0)
 	    opindex--;
 	}
-      return true;
     }
-  return false;
+  return mult_num;
 }
 /* Reassociate expressions in basic block BB and its post-dominator as
    children.
@@ -7000,7 +7091,7 @@ reassociate_bb (basic_block bb)
 		  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
 		  int ops_num = ops.length ();
 		  int width;
-		  bool has_fma = false;
+		  int mult_num = 0;
 
 		  /* For binary bit operations, if there are at least 3
 		     operands and the last operand in OPS is a constant,
@@ -7023,16 +7114,18 @@ reassociate_bb (basic_block bb)
 						      opt_type)
 		      && (rhs_code == PLUS_EXPR || rhs_code == MINUS_EXPR))
 		    {
-		      has_fma = rank_ops_for_fma (&ops);
+		      mult_num = rank_ops_for_fma (&ops);
 		    }
 
 		  /* Only rewrite the expression tree to parallel in the
 		     last reassoc pass to avoid useless work back-and-forth
 		     with initial linearization.  */
+		  bool has_fma = mult_num >= 2 && mult_num != ops_num;
 		  if (!reassoc_insert_powi_p
 		      && ops.length () > 3
-		      && (width = get_reassociation_width (ops_num, rhs_code,
-							   mode)) > 1)
+		      && (width = get_reassociation_width (&ops, mult_num, lhs,
+							   rhs_code, mode))
+			   > 1)
 		    {
 		      if (dump_file && (dump_flags & TDF_DETAILS))
 			fprintf (dump_file,
@@ -7049,7 +7142,13 @@ reassociate_bb (basic_block bb)
 			 to make sure the ones that get the double
 			 binary op are chosen wisely.  */
 		      int len = ops.length ();
-		      if (len >= 3 && !has_fma)
+		      if (len >= 3
+			  && (!has_fma
+			      /* width > 1 means ranking ops results in better
+				 parallelism.  */
+			      || get_reassociation_width (&ops, mult_num, lhs,
+							  rhs_code, mode)
+				   > 1))
 			swap_ops_for_binary_stmt (ops, len - 3);
 
 		      new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PING][PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width
  2023-10-08 16:39   ` Di Zhao OS
@ 2023-10-23  3:49     ` Di Zhao OS
  2023-10-31 13:47     ` [PATCH " Richard Biener
  1 sibling, 0 replies; 18+ messages in thread
From: Di Zhao OS @ 2023-10-23  3:49 UTC (permalink / raw)
  To: Di Zhao OS, Richard Biener; +Cc: gcc-patches

Hello and Ping,

Thanks,
Di

> -----Original Message-----
> From: Di Zhao OS <dizhao@os.amperecomputing.com>
> Sent: Monday, October 9, 2023 12:40 AM
> To: Richard Biener <richard.guenther@gmail.com>
> Cc: gcc-patches@gcc.gnu.org
> Subject: RE: [PATCH v4] [tree-optimization/110279] Consider FMA in
> get_reassociation_width
> 
> Attached is a new version of the patch.
> 
> > -----Original Message-----
> > From: Richard Biener <richard.guenther@gmail.com>
> > Sent: Friday, October 6, 2023 5:33 PM
> > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > Cc: gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > get_reassociation_width
> >
> > On Thu, Sep 14, 2023 at 2:43 PM Di Zhao OS
> > <dizhao@os.amperecomputing.com> wrote:
> > >
> > > This is a new version of the patch on "nested FMA".
> > > Sorry for updating this after so long, I've been studying and
> > > writing micro cases to sort out the cause of the regression.
> >
> > Sorry for taking so long to reply.
> >
> > > First, following previous discussion:
> > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629080.html)
> > >
> > > 1. From testing more altered cases, I don't think the
> > > problem is that reassociation works locally. In that:
> > >
> > >   1) On the example with multiplications:
> > >
> > >         tmp1 = a + c * c + d * d + x * y;
> > >         tmp2 = x * tmp1;
> > >         result += (a + c + d + tmp2);
> > >
> > >   Given "result" rewritten by width=2, the performance is
> > >   worse if we rewrite "tmp1" with width=2. In contrast, if we
> > >   remove the multiplications from the example (and make "tmp1"
> > >   not singe used), and still rewrite "result" by width=2, then
> > >   rewriting "tmp1" with width=2 is better. (Make sense because
> > >   the tree's depth at "result" is still smaller if we rewrite
> > >   "tmp1".)
> > >
> > >   2) I tried to modify the assembly code of the example without
> > >   FMA, so the width of "result" is 4. On Ampere1 there's no
> > >   obvious improvement. So although this is an interesting
> > >   problem, it doesn't seem like the cause of the regression.
> >
> > OK, I see.
> >
> > > 2. From assembly code of the case with FMA, one problem is
> > > that, rewriting "tmp1" to parallel didn't decrease the
> > > minimum CPU cycles (taking MULT_EXPRs into account), but
> > > increased code size, so the overhead is increased.
> > >
> > >    a) When "tmp1" is not re-written to parallel:
> > >         fmadd d31, d2, d2, d30
> > >         fmadd d31, d3, d3, d31
> > >         fmadd d31, d4, d5, d31  //"tmp1"
> > >         fmadd d31, d31, d4, d3
> > >
> > >    b) When "tmp1" is re-written to parallel:
> > >         fmul  d31, d4, d5
> > >         fmadd d27, d2, d2, d30
> > >         fmadd d31, d3, d3, d31
> > >         fadd  d31, d31, d27     //"tmp1"
> > >         fmadd d31, d31, d4, d3
> > >
> > > For version a), there are 3 dependent FMAs to calculate "tmp1".
> > > For version b), there are also 3 dependent instructions in the
> > > longer path: the 1st, 3rd and 4th.
> >
> > Yes, it doesn't really change anything.  The patch has
> >
> > +  /* If there's code like "acc = a * b + c * d + acc" in a tight loop, some
> > +     uarchs can execute results like:
> > +
> > +       _1 = a * b;
> > +       _2 = .FMA (c, d, _1);
> > +       acc_1 = acc_0 + _2;
> > +
> > +     in parallel, while turning it into
> > +
> > +       _1 = .FMA(a, b, acc_0);
> > +       acc_1 = .FMA(c, d, _1);
> > +
> > +     hinders that, because then the first FMA depends on the result
> > of preceding
> > +     iteration.  */
> >
> > I can't see what can be run in parallel for the first case.  The .FMA
> > depends on the multiplication a * b.  Iff the uarch somehow decomposes
> > .FMA into multiply + add then the c * d multiply could run in parallel
> > with the a * b multiply which _might_ be able to hide some of the
> > latency of the full .FMA.  Like on x86 Zen FMA has a latency of 4
> > cycles but a multiply only 3.  But I never got confirmation from any
> > of the CPU designers that .FMAs are issued when the multiply
> > operands are ready and the add operand can be forwarded.
> >
> > I also wonder why the multiplications of the two-FMA sequence
> > then cannot be executed at the same time?  So I have some doubt
> > of the theory above.
> 
> The parallel execution for the code snippet above was the other
> issue (previously discussed here:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-August/628960.html).
> Sorry it's a bit confusing to include that here, but these 2 fixes
> needs to be combined to avoid new regressions. Since considering
> FMA in get_reassociation_width produces more results of width=1,
> so there would be more loop depending FMA chains.
> 
> > Iff this really is the reason for the sequence to execute with lower
> > overall latency and we want to attack this on GIMPLE then I think
> > we need a target hook telling us this fact (I also wonder if such
> > behavior can be modeled in the scheduler pipeline description at all?)
> >
> > > So it seems to me the current get_reassociation_width algorithm
> > > isn't optimal in the presence of FMA. So I modified the patch to
> > > improve get_reassociation_width, rather than check for code
> > > patterns. (Although there could be some other complicated
> > > factors so the regression is more obvious when there's "nested
> > > FMA". But with this patch that should be avoided or reduced.)
> > >
> > > With this patch 508.namd_r 1-copy run has 7% improvement on
> > > Ampere1, on Intel Xeon there's about 3%. While I'm still
> > > collecting data on other CPUs, I'd like to know how do you
> > > think of this.
> > >
> > > About changes in the patch:
> > >
> > > 1. When the op list forms a complete FMA chain, try to search
> > > for a smaller width considering the benefit of using FMA. With
> > > a smaller width, the increment of code size is smaller when
> > > breaking the chain.
> >
> > But this is all highly target specific (code size even more so).
> >
> > How I understand your approach to fixing the issue leads me to
> > the suggestion to prioritize parallel rewriting, thus alter rank_ops_for_fma,
> > taking the reassoc width into account (the computed width should be
> > unchanged from rank_ops_for_fma) instead of "fixing up" the parallel
> > rewriting of FMAs (well, they are not yet formed of course).
> > get_reassociation_width has 'get_required_cycles', the above theory
> > could be verified with a very simple toy pipeline model.  We'd have
> > to ask the target for the reassoc width for MULT_EXPRs as well (or maybe
> > even FMA_EXPRs).
> >
> > Taking the width of FMAs into account when computing the reassoc width
> > might be another way to attack this.
> 
> Previously I tried to solve this generally, on the assumption that
> FMA (smaller code size) is preferred. Now I agree it's difficult
> since: 1) As you mentioned, the latency of FMA, FMUL and FADD can
> be different. 2) From my test result on different machines we
> have, it seems simply adding the cycles together is not a good way
> to estimate the latency of consecutive FMA.
> 
> I think an easier way to fix this is to add a parameter to suggest
> the length of complete FMA chain to keep. (It can be set by target
> specific tuning then.) And we can break longer FMA chains for
> better parallelism. Attached is the new implementation. With
> max-fma-chain-len=8, there's about 7% improvement in spec2017
> 508.namd_r on ampere1, and the overall improvement on fprate is
> about 1%.
> 
> Since there's code in rank_ops_for_fma to identify MULT_EXPRs from
> others, I left it before get_reassociation_width so the number of
> MULT_EXPRs can be used.
> 
> >
> > > 2. To avoid regressions, included the other patch
> > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629203.html)
> > > on this tracker again. This is because more FMA will be kept
> > > with 1., so we need to rule out the loop dependent
> > > FMA chains when param_avoid_fma_max_bits is set.
> >
> > Sorry again for taking so long to reply.
> >
> > I'll note we have an odd case on x86 Zen2(?) as well which we don't really
> > understand from a CPU behavior perspective.
> >
> > Thanks,
> > Richard.
> >
> > > Thanks,
> > > Di Zhao
> > >
> > > ----
> > >
> > >         PR tree-optimization/110279
> > >
> > > gcc/ChangeLog:
> > >
> > >         * tree-ssa-reassoc.cc (rank_ops_for_better_parallelism_p):
> > >         New function to check whether ranking the ops results in
> > >         better parallelism.
> > >         (get_reassociation_width): Add new parameters. Search for
> > >         smaller width considering the benefit of FMA.
> > >         (rank_ops_for_fma): Change return value to be number of
> > >         MULT_EXPRs.
> > >         (reassociate_bb): For 3 ops, refine the condition to call
> > >         swap_ops_for_binary_stmt.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >         * gcc.dg/pr110279.c: New test.
> 
> Thanks,
> Di Zhao
> 
> ----
> 
>         PR tree-optimization/110279
> 
> gcc/ChangeLog:
> 
>         * doc/invoke.texi: Description of param_max_fma_chain_len.
>         * params.opt: New parameter param_max_fma_chain_len.
>         * tree-ssa-reassoc.cc (get_reassociation_width):
>         Support param_max_fma_chain_len; check for loop dependent
>         FMAs.
>         (rank_ops_for_fma): Return the number of MULT_EXPRs.
>         (reassociate_bb): For 3 ops, refine the condition to call
>         swap_ops_for_binary_stmt.
> 
> gcc/testsuite/ChangeLog:
> 
>         * gcc.dg/pr110279-1.c: New test.
>         * gcc.dg/pr110279-2.c: New test.
>         * gcc.dg/pr110279-3.c: New test.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width
  2023-10-08 16:39   ` Di Zhao OS
  2023-10-23  3:49     ` [PING][PATCH " Di Zhao OS
@ 2023-10-31 13:47     ` Richard Biener
  2023-11-09 17:53       ` Di Zhao OS
  1 sibling, 1 reply; 18+ messages in thread
From: Richard Biener @ 2023-10-31 13:47 UTC (permalink / raw)
  To: Di Zhao OS; +Cc: gcc-patches

On Sun, Oct 8, 2023 at 6:40 PM Di Zhao OS <dizhao@os.amperecomputing.com> wrote:
>
> Attached is a new version of the patch.
>
> > -----Original Message-----
> > From: Richard Biener <richard.guenther@gmail.com>
> > Sent: Friday, October 6, 2023 5:33 PM
> > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > Cc: gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > get_reassociation_width
> >
> > On Thu, Sep 14, 2023 at 2:43 PM Di Zhao OS
> > <dizhao@os.amperecomputing.com> wrote:
> > >
> > > This is a new version of the patch on "nested FMA".
> > > Sorry for updating this after so long, I've been studying and
> > > writing micro cases to sort out the cause of the regression.
> >
> > Sorry for taking so long to reply.
> >
> > > First, following previous discussion:
> > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629080.html)
> > >
> > > 1. From testing more altered cases, I don't think the
> > > problem is that reassociation works locally. In that:
> > >
> > >   1) On the example with multiplications:
> > >
> > >         tmp1 = a + c * c + d * d + x * y;
> > >         tmp2 = x * tmp1;
> > >         result += (a + c + d + tmp2);
> > >
> > >   Given "result" rewritten by width=2, the performance is
> > >   worse if we rewrite "tmp1" with width=2. In contrast, if we
> > >   remove the multiplications from the example (and make "tmp1"
> > >   not singe used), and still rewrite "result" by width=2, then
> > >   rewriting "tmp1" with width=2 is better. (Make sense because
> > >   the tree's depth at "result" is still smaller if we rewrite
> > >   "tmp1".)
> > >
> > >   2) I tried to modify the assembly code of the example without
> > >   FMA, so the width of "result" is 4. On Ampere1 there's no
> > >   obvious improvement. So although this is an interesting
> > >   problem, it doesn't seem like the cause of the regression.
> >
> > OK, I see.
> >
> > > 2. From assembly code of the case with FMA, one problem is
> > > that, rewriting "tmp1" to parallel didn't decrease the
> > > minimum CPU cycles (taking MULT_EXPRs into account), but
> > > increased code size, so the overhead is increased.
> > >
> > >    a) When "tmp1" is not re-written to parallel:
> > >         fmadd d31, d2, d2, d30
> > >         fmadd d31, d3, d3, d31
> > >         fmadd d31, d4, d5, d31  //"tmp1"
> > >         fmadd d31, d31, d4, d3
> > >
> > >    b) When "tmp1" is re-written to parallel:
> > >         fmul  d31, d4, d5
> > >         fmadd d27, d2, d2, d30
> > >         fmadd d31, d3, d3, d31
> > >         fadd  d31, d31, d27     //"tmp1"
> > >         fmadd d31, d31, d4, d3
> > >
> > > For version a), there are 3 dependent FMAs to calculate "tmp1".
> > > For version b), there are also 3 dependent instructions in the
> > > longer path: the 1st, 3rd and 4th.
> >
> > Yes, it doesn't really change anything.  The patch has
> >
> > +  /* If there's code like "acc = a * b + c * d + acc" in a tight loop, some
> > +     uarchs can execute results like:
> > +
> > +       _1 = a * b;
> > +       _2 = .FMA (c, d, _1);
> > +       acc_1 = acc_0 + _2;
> > +
> > +     in parallel, while turning it into
> > +
> > +       _1 = .FMA(a, b, acc_0);
> > +       acc_1 = .FMA(c, d, _1);
> > +
> > +     hinders that, because then the first FMA depends on the result
> > of preceding
> > +     iteration.  */
> >
> > I can't see what can be run in parallel for the first case.  The .FMA
> > depends on the multiplication a * b.  Iff the uarch somehow decomposes
> > .FMA into multiply + add then the c * d multiply could run in parallel
> > with the a * b multiply which _might_ be able to hide some of the
> > latency of the full .FMA.  Like on x86 Zen FMA has a latency of 4
> > cycles but a multiply only 3.  But I never got confirmation from any
> > of the CPU designers that .FMAs are issued when the multiply
> > operands are ready and the add operand can be forwarded.
> >
> > I also wonder why the multiplications of the two-FMA sequence
> > then cannot be executed at the same time?  So I have some doubt
> > of the theory above.
>
> The parallel execution for the code snippet above was the other
> issue (previously discussed here:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-August/628960.html).
> Sorry it's a bit confusing to include that here, but these 2 fixes
> needs to be combined to avoid new regressions. Since considering
> FMA in get_reassociation_width produces more results of width=1,
> so there would be more loop depending FMA chains.
>
> > Iff this really is the reason for the sequence to execute with lower
> > overall latency and we want to attack this on GIMPLE then I think
> > we need a target hook telling us this fact (I also wonder if such
> > behavior can be modeled in the scheduler pipeline description at all?)
> >
> > > So it seems to me the current get_reassociation_width algorithm
> > > isn't optimal in the presence of FMA. So I modified the patch to
> > > improve get_reassociation_width, rather than check for code
> > > patterns. (Although there could be some other complicated
> > > factors so the regression is more obvious when there's "nested
> > > FMA". But with this patch that should be avoided or reduced.)
> > >
> > > With this patch 508.namd_r 1-copy run has 7% improvement on
> > > Ampere1, on Intel Xeon there's about 3%. While I'm still
> > > collecting data on other CPUs, I'd like to know how do you
> > > think of this.
> > >
> > > About changes in the patch:
> > >
> > > 1. When the op list forms a complete FMA chain, try to search
> > > for a smaller width considering the benefit of using FMA. With
> > > a smaller width, the increment of code size is smaller when
> > > breaking the chain.
> >
> > But this is all highly target specific (code size even more so).
> >
> > How I understand your approach to fixing the issue leads me to
> > the suggestion to prioritize parallel rewriting, thus alter rank_ops_for_fma,
> > taking the reassoc width into account (the computed width should be
> > unchanged from rank_ops_for_fma) instead of "fixing up" the parallel
> > rewriting of FMAs (well, they are not yet formed of course).
> > get_reassociation_width has 'get_required_cycles', the above theory
> > could be verified with a very simple toy pipeline model.  We'd have
> > to ask the target for the reassoc width for MULT_EXPRs as well (or maybe
> > even FMA_EXPRs).
> >
> > Taking the width of FMAs into account when computing the reassoc width
> > might be another way to attack this.
>
> Previously I tried to solve this generally, on the assumption that
> FMA (smaller code size) is preferred. Now I agree it's difficult
> since: 1) As you mentioned, the latency of FMA, FMUL and FADD can
> be different. 2) From my test result on different machines we
> have, it seems simply adding the cycles together is not a good way
> to estimate the latency of consecutive FMA.
>
> I think an easier way to fix this is to add a parameter to suggest
> the length of complete FMA chain to keep. (It can be set by target
> specific tuning then.) And we can break longer FMA chains for
> better parallelism. Attached is the new implementation. With
> max-fma-chain-len=8, there's about 7% improvement in spec2017
> 508.namd_r on ampere1, and the overall improvement on fprate is
> about 1%.
>
> Since there's code in rank_ops_for_fma to identify MULT_EXPRs from
> others, I left it before get_reassociation_width so the number of
> MULT_EXPRs can be used.

Sorry again for the delay in replying.

+  /* Check if keeping complete FMA chains is preferred.  */
+  if (width > 1 && mult_num >= 2 && param_max_fma_chain_len)
+    {
+      /* num_fma_chain + (num_fma_chain - 1) >= num_plus .  */
+      int num_others = ops_num - mult_num;
+      int num_fma_chain = CEIL (num_others + 1, 2);
+
+      if (num_fma_chain < width
+         && CEIL (mult_num, num_fma_chain) <= param_max_fma_chain_len)
+       width = num_fma_chain;
+    }

so here 'mult_num' serves as a heuristical value how many
FMAs we could build.  If that were close to ops_num - 1 then
we'd have a chain of FMAs.  Not sure how you get at
num_others / 2 here.  Maybe we need to elaborate on what an
FMA chain is?  I thought it is FMA (FMA (FMA (..., b, c), d, e), f, g)
where each (b,c) pair is really just one operand in the ops array,
one of the 'mult's.  Thus a FMA chain is _not_
FMA (a, b, c) + FMA (d, e, f) + FMA (...) + ..., right?

Forming an FMA chain effectively reduces the reassociation width
of the participating multiplies.  If we were not to form FMAs all
the multiplies could execute in parallel.

So what does the above do, in terms of adjusting the reassociation
width for the _adds_, and what's the ripple-down effect on later
FMA forming?

The change still feels like whack-a-mole playing rather than understanding
the fundamental issue on the targets.

+  /* If there's loop dependent FMA result, return width=2 to avoid it.  This is
+     better than skipping these FMA candidates in widening_mul.  */

better than skipping, but you don't touch it there?  I suppose width == 2
will bypass the skipping, right?  This heuristic only comes in when the above
change made width == 1, since otherwise we have an earlier

  if (width == 1)
    return width;

which als guarantees width == 2 was allowed by the hook/param, right?

+  if (width == 1 && mult_num
+      && maybe_le (tree_to_poly_int64 (TYPE_SIZE (TREE_TYPE (lhs))),
+                  param_avoid_fma_max_bits))
+    {
+      /* Look for cross backedge dependency:
+       1. LHS is a phi argument in the same basic block it is defined.
+       2. And the result of the phi node is used in OPS.  */
+      basic_block bb = gimple_bb (SSA_NAME_DEF_STMT (lhs));
+      gimple_stmt_iterator gsi;
+      for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+       {
+         gphi *phi = dyn_cast<gphi *> (gsi_stmt (gsi));
+         for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
+           {
+             tree op = PHI_ARG_DEF (phi, i);
+             if (!(op == lhs && gimple_phi_arg_edge (phi, i)->src == bb))
+               continue;

I think it's easier to iterate over the immediate uses of LHS like

  FOR_EACH_IMM_USE_FAST (use_p, iter, lhs)
     if (gphi *phi = dyn_cast <gphi *> (USE_STMT (use_p)))
       {
          if (gimple_phi_arg_edge (phi, phi_arg_index_from_use
(use_p))->src != bb)
            continue;
...
       }

otherwise I think _this_ part of the patch looks reasonable.

As you say heuristically they might go together but I think we should split the
patch - the cross-loop part can probably stand independently.  Can you adjust
and re-post?

As for the first part I still don't understand very well and am still hoping we
can get away without yet another knob to tune.

Richard.

> >
> > > 2. To avoid regressions, included the other patch
> > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629203.html)
> > > on this tracker again. This is because more FMA will be kept
> > > with 1., so we need to rule out the loop dependent
> > > FMA chains when param_avoid_fma_max_bits is set.
> >
> > Sorry again for taking so long to reply.
> >
> > I'll note we have an odd case on x86 Zen2(?) as well which we don't really
> > understand from a CPU behavior perspective.
> >
> > Thanks,
> > Richard.
> >
> > > Thanks,
> > > Di Zhao
> > >
> > > ----
> > >
> > >         PR tree-optimization/110279
> > >
> > > gcc/ChangeLog:
> > >
> > >         * tree-ssa-reassoc.cc (rank_ops_for_better_parallelism_p):
> > >         New function to check whether ranking the ops results in
> > >         better parallelism.
> > >         (get_reassociation_width): Add new parameters. Search for
> > >         smaller width considering the benefit of FMA.
> > >         (rank_ops_for_fma): Change return value to be number of
> > >         MULT_EXPRs.
> > >         (reassociate_bb): For 3 ops, refine the condition to call
> > >         swap_ops_for_binary_stmt.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >         * gcc.dg/pr110279.c: New test.
>
> Thanks,
> Di Zhao
>
> ----
>
>         PR tree-optimization/110279
>
> gcc/ChangeLog:
>
>         * doc/invoke.texi: Description of param_max_fma_chain_len.
>         * params.opt: New parameter param_max_fma_chain_len.
>         * tree-ssa-reassoc.cc (get_reassociation_width):
>         Support param_max_fma_chain_len; check for loop dependent
>         FMAs.
>         (rank_ops_for_fma): Return the number of MULT_EXPRs.
>         (reassociate_bb): For 3 ops, refine the condition to call
>         swap_ops_for_binary_stmt.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/pr110279-1.c: New test.
>         * gcc.dg/pr110279-2.c: New test.
>         * gcc.dg/pr110279-3.c: New test.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width
  2023-10-31 13:47     ` [PATCH " Richard Biener
@ 2023-11-09 17:53       ` Di Zhao OS
  2023-11-21 13:01         ` Richard Biener
  0 siblings, 1 reply; 18+ messages in thread
From: Di Zhao OS @ 2023-11-09 17:53 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 16117 bytes --]

> -----Original Message-----
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: Tuesday, October 31, 2023 9:48 PM
> To: Di Zhao OS <dizhao@os.amperecomputing.com>
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> get_reassociation_width
> 
> On Sun, Oct 8, 2023 at 6:40 PM Di Zhao OS <dizhao@os.amperecomputing.com>
> wrote:
> >
> > Attached is a new version of the patch.
> >
> > > -----Original Message-----
> > > From: Richard Biener <richard.guenther@gmail.com>
> > > Sent: Friday, October 6, 2023 5:33 PM
> > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > Cc: gcc-patches@gcc.gnu.org
> > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > > get_reassociation_width
> > >
> > > On Thu, Sep 14, 2023 at 2:43 PM Di Zhao OS
> > > <dizhao@os.amperecomputing.com> wrote:
> > > >
> > > > This is a new version of the patch on "nested FMA".
> > > > Sorry for updating this after so long, I've been studying and
> > > > writing micro cases to sort out the cause of the regression.
> > >
> > > Sorry for taking so long to reply.
> > >
> > > > First, following previous discussion:
> > > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629080.html)
> > > >
> > > > 1. From testing more altered cases, I don't think the
> > > > problem is that reassociation works locally. In that:
> > > >
> > > >   1) On the example with multiplications:
> > > >
> > > >         tmp1 = a + c * c + d * d + x * y;
> > > >         tmp2 = x * tmp1;
> > > >         result += (a + c + d + tmp2);
> > > >
> > > >   Given "result" rewritten by width=2, the performance is
> > > >   worse if we rewrite "tmp1" with width=2. In contrast, if we
> > > >   remove the multiplications from the example (and make "tmp1"
> > > >   not singe used), and still rewrite "result" by width=2, then
> > > >   rewriting "tmp1" with width=2 is better. (Make sense because
> > > >   the tree's depth at "result" is still smaller if we rewrite
> > > >   "tmp1".)
> > > >
> > > >   2) I tried to modify the assembly code of the example without
> > > >   FMA, so the width of "result" is 4. On Ampere1 there's no
> > > >   obvious improvement. So although this is an interesting
> > > >   problem, it doesn't seem like the cause of the regression.
> > >
> > > OK, I see.
> > >
> > > > 2. From assembly code of the case with FMA, one problem is
> > > > that, rewriting "tmp1" to parallel didn't decrease the
> > > > minimum CPU cycles (taking MULT_EXPRs into account), but
> > > > increased code size, so the overhead is increased.
> > > >
> > > >    a) When "tmp1" is not re-written to parallel:
> > > >         fmadd d31, d2, d2, d30
> > > >         fmadd d31, d3, d3, d31
> > > >         fmadd d31, d4, d5, d31  //"tmp1"
> > > >         fmadd d31, d31, d4, d3
> > > >
> > > >    b) When "tmp1" is re-written to parallel:
> > > >         fmul  d31, d4, d5
> > > >         fmadd d27, d2, d2, d30
> > > >         fmadd d31, d3, d3, d31
> > > >         fadd  d31, d31, d27     //"tmp1"
> > > >         fmadd d31, d31, d4, d3
> > > >
> > > > For version a), there are 3 dependent FMAs to calculate "tmp1".
> > > > For version b), there are also 3 dependent instructions in the
> > > > longer path: the 1st, 3rd and 4th.
> > >
> > > Yes, it doesn't really change anything.  The patch has
> > >
> > > +  /* If there's code like "acc = a * b + c * d + acc" in a tight loop,
> some
> > > +     uarchs can execute results like:
> > > +
> > > +       _1 = a * b;
> > > +       _2 = .FMA (c, d, _1);
> > > +       acc_1 = acc_0 + _2;
> > > +
> > > +     in parallel, while turning it into
> > > +
> > > +       _1 = .FMA(a, b, acc_0);
> > > +       acc_1 = .FMA(c, d, _1);
> > > +
> > > +     hinders that, because then the first FMA depends on the result
> > > of preceding
> > > +     iteration.  */
> > >
> > > I can't see what can be run in parallel for the first case.  The .FMA
> > > depends on the multiplication a * b.  Iff the uarch somehow decomposes
> > > .FMA into multiply + add then the c * d multiply could run in parallel
> > > with the a * b multiply which _might_ be able to hide some of the
> > > latency of the full .FMA.  Like on x86 Zen FMA has a latency of 4
> > > cycles but a multiply only 3.  But I never got confirmation from any
> > > of the CPU designers that .FMAs are issued when the multiply
> > > operands are ready and the add operand can be forwarded.
> > >
> > > I also wonder why the multiplications of the two-FMA sequence
> > > then cannot be executed at the same time?  So I have some doubt
> > > of the theory above.
> >
> > The parallel execution for the code snippet above was the other
> > issue (previously discussed here:
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-August/628960.html).
> > Sorry it's a bit confusing to include that here, but these 2 fixes
> > needs to be combined to avoid new regressions. Since considering
> > FMA in get_reassociation_width produces more results of width=1,
> > so there would be more loop depending FMA chains.
> >
> > > Iff this really is the reason for the sequence to execute with lower
> > > overall latency and we want to attack this on GIMPLE then I think
> > > we need a target hook telling us this fact (I also wonder if such
> > > behavior can be modeled in the scheduler pipeline description at all?)
> > >
> > > > So it seems to me the current get_reassociation_width algorithm
> > > > isn't optimal in the presence of FMA. So I modified the patch to
> > > > improve get_reassociation_width, rather than check for code
> > > > patterns. (Although there could be some other complicated
> > > > factors so the regression is more obvious when there's "nested
> > > > FMA". But with this patch that should be avoided or reduced.)
> > > >
> > > > With this patch 508.namd_r 1-copy run has 7% improvement on
> > > > Ampere1, on Intel Xeon there's about 3%. While I'm still
> > > > collecting data on other CPUs, I'd like to know how do you
> > > > think of this.
> > > >
> > > > About changes in the patch:
> > > >
> > > > 1. When the op list forms a complete FMA chain, try to search
> > > > for a smaller width considering the benefit of using FMA. With
> > > > a smaller width, the increment of code size is smaller when
> > > > breaking the chain.
> > >
> > > But this is all highly target specific (code size even more so).
> > >
> > > How I understand your approach to fixing the issue leads me to
> > > the suggestion to prioritize parallel rewriting, thus alter
> rank_ops_for_fma,
> > > taking the reassoc width into account (the computed width should be
> > > unchanged from rank_ops_for_fma) instead of "fixing up" the parallel
> > > rewriting of FMAs (well, they are not yet formed of course).
> > > get_reassociation_width has 'get_required_cycles', the above theory
> > > could be verified with a very simple toy pipeline model.  We'd have
> > > to ask the target for the reassoc width for MULT_EXPRs as well (or maybe
> > > even FMA_EXPRs).
> > >
> > > Taking the width of FMAs into account when computing the reassoc width
> > > might be another way to attack this.
> >
> > Previously I tried to solve this generally, on the assumption that
> > FMA (smaller code size) is preferred. Now I agree it's difficult
> > since: 1) As you mentioned, the latency of FMA, FMUL and FADD can
> > be different. 2) From my test result on different machines we
> > have, it seems simply adding the cycles together is not a good way
> > to estimate the latency of consecutive FMA.
> >
> > I think an easier way to fix this is to add a parameter to suggest
> > the length of complete FMA chain to keep. (It can be set by target
> > specific tuning then.) And we can break longer FMA chains for
> > better parallelism. Attached is the new implementation. With
> > max-fma-chain-len=8, there's about 7% improvement in spec2017
> > 508.namd_r on ampere1, and the overall improvement on fprate is
> > about 1%.
> >
> > Since there's code in rank_ops_for_fma to identify MULT_EXPRs from
> > others, I left it before get_reassociation_width so the number of
> > MULT_EXPRs can be used.
> 
> Sorry again for the delay in replying.
> 
> +  /* Check if keeping complete FMA chains is preferred.  */
> +  if (width > 1 && mult_num >= 2 && param_max_fma_chain_len)
> +    {
> +      /* num_fma_chain + (num_fma_chain - 1) >= num_plus .  */
> +      int num_others = ops_num - mult_num;
> +      int num_fma_chain = CEIL (num_others + 1, 2);
> +
> +      if (num_fma_chain < width
> +         && CEIL (mult_num, num_fma_chain) <= param_max_fma_chain_len)
> +       width = num_fma_chain;
> +    }
> 
> so here 'mult_num' serves as a heuristical value how many
> FMAs we could build.  If that were close to ops_num - 1 then
> we'd have a chain of FMAs.  Not sure how you get at
> num_others / 2 here.  Maybe we need to elaborate on what an
> FMA chain is?  I thought it is FMA (FMA (FMA (..., b, c), d, e), f, g)
> where each (b,c) pair is really just one operand in the ops array,
> one of the 'mult's.  Thus a FMA chain is _not_
> FMA (a, b, c) + FMA (d, e, f) + FMA (...) + ..., right?

The "FMA chain" here refers to consecutive FMAs, each taking 
The previous one's result as the third operator, i.e. 
... FMA(e, f, FMA(c, d, FMA (a, b, r)))... . So original op
list looks like "r + a * b + c * d + e * f + ...". These FMAs
will end up using the same accumulate register.

When num_others=2 or 3, there can be 2 complete chains, e.g.
	FMA (d, e, FMA (a, b, c)) + FMA (f, g, h)
or
	FMA (d, e, FMA (a, b, c)) + FMA (f, g, h) + i .
And so on, that's where the "CEIL (num_others + 1, 2)" comes from.

> 
> Forming an FMA chain effectively reduces the reassociation width
> of the participating multiplies.  If we were not to form FMAs all
> the multiplies could execute in parallel.
> 
> So what does the above do, in terms of adjusting the reassociation
> width for the _adds_, and what's the ripple-down effect on later
> FMA forming?
> 

The above code calculates the number of such FMA chains in the op
list. And if the length of each chain doesn't exceed
param_max_fma_chain_len, then width is set to the number of chains,
so we won't break them (because rewrite_expr_tree_parallel handles
this well).

> The change still feels like whack-a-mole playing rather than understanding
> the fundamental issue on the targets.

I think the complexity is in how the instructions are piped.
Some Arm CPUs such as Neoverse V2 supports "late-forwarding":
"FP multiply-accumulate pipelines support late-forwarding of
accumulate operands from similar μOPs, allowing a typical
sequence of multiply-accumulate μOPs to issue one every N
cycles". ("N" is smaller than the latency of a single FMA
instruction.) So keeping such FMA chains can utilize such
feature and uses less FP units. I guess the case is similar on
some late X86 CPUs.

If we try to compute the minimum circles of each option, I think
at least we'll need to know whether the target has similar
feature, and the latency of each uop. While using an
experiential length of beneficial FMA chain could be a shortcut.
(Maybe allowing different lengths for different data widths is
better.)

> 
> +  /* If there's loop dependent FMA result, return width=2 to avoid it.  This
> is
> +     better than skipping these FMA candidates in widening_mul.  */
> 
> better than skipping, but you don't touch it there?  I suppose width == 2
> will bypass the skipping, right?  This heuristic only comes in when the above
> change made width == 1, since otherwise we have an earlier
> 
>   if (width == 1)
>     return width;
> 
> which als guarantees width == 2 was allowed by the hook/param, right?

Yes, that's right.

> 
> +  if (width == 1 && mult_num
> +      && maybe_le (tree_to_poly_int64 (TYPE_SIZE (TREE_TYPE (lhs))),
> +                  param_avoid_fma_max_bits))
> +    {
> +      /* Look for cross backedge dependency:
> +       1. LHS is a phi argument in the same basic block it is defined.
> +       2. And the result of the phi node is used in OPS.  */
> +      basic_block bb = gimple_bb (SSA_NAME_DEF_STMT (lhs));
> +      gimple_stmt_iterator gsi;
> +      for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> +       {
> +         gphi *phi = dyn_cast<gphi *> (gsi_stmt (gsi));
> +         for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
> +           {
> +             tree op = PHI_ARG_DEF (phi, i);
> +             if (!(op == lhs && gimple_phi_arg_edge (phi, i)->src == bb))
> +               continue;
> 
> I think it's easier to iterate over the immediate uses of LHS like
> 
>   FOR_EACH_IMM_USE_FAST (use_p, iter, lhs)
>      if (gphi *phi = dyn_cast <gphi *> (USE_STMT (use_p)))
>        {
>           if (gimple_phi_arg_edge (phi, phi_arg_index_from_use
> (use_p))->src != bb)
>             continue;
> ...
>        }
> 
> otherwise I think _this_ part of the patch looks reasonable.
> 
> As you say heuristically they might go together but I think we should split
> the
> patch - the cross-loop part can probably stand independently.  Can you adjust
> and re-post?

Attached is the separated part for cross-loop FMA. Thank you for the correction.

> 
> As for the first part I still don't understand very well and am still hoping
> we
> can get away without yet another knob to tune.
> 
> Richard.
> 
> > >
> > > > 2. To avoid regressions, included the other patch
> > > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629203.html)
> > > > on this tracker again. This is because more FMA will be kept
> > > > with 1., so we need to rule out the loop dependent
> > > > FMA chains when param_avoid_fma_max_bits is set.
> > >
> > > Sorry again for taking so long to reply.
> > >
> > > I'll note we have an odd case on x86 Zen2(?) as well which we don't really
> > > understand from a CPU behavior perspective.
> > >
> > > Thanks,
> > > Richard.
> > >
> > > > Thanks,
> > > > Di Zhao
> > > >
> > > > ----
> > > >
> > > >         PR tree-optimization/110279
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >         * tree-ssa-reassoc.cc (rank_ops_for_better_parallelism_p):
> > > >         New function to check whether ranking the ops results in
> > > >         better parallelism.
> > > >         (get_reassociation_width): Add new parameters. Search for
> > > >         smaller width considering the benefit of FMA.
> > > >         (rank_ops_for_fma): Change return value to be number of
> > > >         MULT_EXPRs.
> > > >         (reassociate_bb): For 3 ops, refine the condition to call
> > > >         swap_ops_for_binary_stmt.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > >         * gcc.dg/pr110279.c: New test.
> >
> > Thanks,
> > Di Zhao
> >
> > ----
> >
> >         PR tree-optimization/110279
> >
> > gcc/ChangeLog:
> >
> >         * doc/invoke.texi: Description of param_max_fma_chain_len.
> >         * params.opt: New parameter param_max_fma_chain_len.
> >         * tree-ssa-reassoc.cc (get_reassociation_width):
> >         Support param_max_fma_chain_len; check for loop dependent
> >         FMAs.
> >         (rank_ops_for_fma): Return the number of MULT_EXPRs.
> >         (reassociate_bb): For 3 ops, refine the condition to call
> >         swap_ops_for_binary_stmt.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         * gcc.dg/pr110279-1.c: New test.
> >         * gcc.dg/pr110279-2.c: New test.
> >         * gcc.dg/pr110279-3.c: New test.

---

        PR tree-optimization/110279

gcc/ChangeLog:

        * tree-ssa-reassoc.cc (get_reassociation_width): check
        for loop dependent FMAs.
        (reassociate_bb): For 3 ops, refine the condition to call
        swap_ops_for_binary_stmt.

gcc/testsuite/ChangeLog:

        * gcc.dg/pr110279-1.c: New test.

[-- Attachment #2: 0001-swap-ops-in-reassoc-to-reduce-cross-backedge-FMA.patch --]
[-- Type: application/octet-stream, Size: 6294 bytes --]

From 66ec486c850e71a754d5e0f1a6703ecf66e7d048 Mon Sep 17 00:00:00 2001
From: "Di Zhao" <dizhao@os.amperecomputing.com>
Date: Thu, 9 Nov 2023 15:06:37 +0800
Subject: [PATCH] swap ops in reassoc to reduce cross backedge FMA

Previously for ops.length >= 3, when FMA is present, we don't
rank the operands so that more FMAs can be preserved. But this
brings more FMAs with loop dependency, which lead to worse
performance on some targets.

Rank the oprands (set width=2) when:
1. avoid_fma_max_bits is set.
2. And loop dependent FMA sequence is found.

In this way, we don't have to discard all the FMA candidates
in the bad shaped sequence in widening_mul, instead we can keep
fewer FMAs without loop dependency.

With this patch, there's about 2% improvement in 510.parest_r
1-copy run on ampere1 (with "-Ofast -mcpu=ampere1 -flto
--param avoid-fma-max-bits=512").
---
 gcc/testsuite/gcc.dg/pr110279-1.c | 65 ++++++++++++++++++++++++++
 gcc/tree-ssa-reassoc.cc           | 77 ++++++++++++++++++++++++++++---
 2 files changed, 136 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr110279-1.c

diff --git a/gcc/testsuite/gcc.dg/pr110279-1.c b/gcc/testsuite/gcc.dg/pr110279-1.c
new file mode 100644
index 00000000000..f25b6aec967
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr110279-1.c
@@ -0,0 +1,65 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast --param avoid-fma-max-bits=512 --param tree-reassoc-width=4 -fdump-tree-widening_mul-details" } */
+/* { dg-additional-options "-march=armv8.2-a" { target aarch64-*-* } } */
+
+#define LOOP_COUNT 800000000
+typedef double data_e;
+
+/* Check that FMAs with backedge dependency are avoided. Otherwise there won't
+   be FMA generated with "--param avoid-fma-max-bits=512".   */
+
+data_e
+foo1 (data_e a, data_e b, data_e c, data_e d)
+{
+  data_e result = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      result += (a * b + c * d);
+
+      a -= 0.1;
+      b += 0.9;
+      c *= 1.02;
+      d *= 0.61;
+    }
+
+  return result;
+}
+
+data_e
+foo2 (data_e a, data_e b, data_e c, data_e d)
+{
+  data_e result = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      result = a * b + result + c * d;
+
+      a -= 0.1;
+      b += 0.9;
+      c *= 1.02;
+      d *= 0.61;
+    }
+
+  return result;
+}
+
+data_e
+foo3 (data_e a, data_e b, data_e c, data_e d)
+{
+  data_e result = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      result = result + a * b + c * d;
+
+      a -= 0.1;
+      b += 0.9;
+      c *= 1.02;
+      d *= 0.61;
+    }
+
+  return result;
+}
+
+/* { dg-final { scan-tree-dump-times "Generated FMA" 3 "widening_mul"} } */
\ No newline at end of file
diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
index 26321aa4fc5..07fc8e2f1f9 100644
--- a/gcc/tree-ssa-reassoc.cc
+++ b/gcc/tree-ssa-reassoc.cc
@@ -5431,16 +5431,19 @@ get_required_cycles (int ops_num, int cpu_width)
 }
 
 /* Returns an optimal number of registers to use for computation of
-   given statements.  */
+   given statements.
+
+   LHS is the result ssa name of OPS.  */
 
 static int
-get_reassociation_width (int ops_num, enum tree_code opc,
-			 machine_mode mode)
+get_reassociation_width (vec<operand_entry *> *ops, tree lhs,
+			 enum tree_code opc, machine_mode mode)
 {
   int param_width = param_tree_reassoc_width;
   int width;
   int width_min;
   int cycles_best;
+  int ops_num = ops->length ();
 
   if (param_width > 0)
     width = param_width;
@@ -5471,6 +5474,61 @@ get_reassociation_width (int ops_num, enum tree_code opc,
 	break;
     }
 
+  /* If there's loop dependent FMA result, return width=2 to avoid it.  This is
+     better than skipping these FMA candidates in widening_mul.  */
+  if (width == 1
+      && maybe_le (tree_to_poly_int64 (TYPE_SIZE (TREE_TYPE (lhs))),
+		   param_avoid_fma_max_bits))
+    {
+      /* Look for cross backedge dependency:
+	1. LHS is a phi argument in the same basic block it is defined.
+	2. And the result of the phi node is used in OPS.  */
+      basic_block bb = gimple_bb (SSA_NAME_DEF_STMT (lhs));
+
+      use_operand_p use_p;
+      imm_use_iterator iter;
+      FOR_EACH_IMM_USE_FAST (use_p, iter, lhs)
+	if (gphi *phi = dyn_cast<gphi *> (USE_STMT (use_p)))
+	  {
+	    if (gimple_phi_arg_edge (phi, phi_arg_index_from_use (use_p))->src
+		!= bb)
+	      continue;
+	    tree phi_result = gimple_phi_result (phi);
+	    operand_entry *oe;
+	    unsigned int j;
+	    FOR_EACH_VEC_ELT (*ops, j, oe)
+	      {
+		if (TREE_CODE (oe->op) != SSA_NAME)
+		  continue;
+
+		/* Result of phi is operand of PLUS_EXPR.  */
+		if (oe->op == phi_result)
+		  return 2;
+
+		/* Check is result of phi is operand of MULT_EXPR.  */
+		gimple *def_stmt = SSA_NAME_DEF_STMT (oe->op);
+		if (is_gimple_assign (def_stmt)
+		    && gimple_assign_rhs_code (def_stmt) == NEGATE_EXPR)
+		  {
+		    tree rhs = gimple_assign_rhs1 (def_stmt);
+		    if (TREE_CODE (rhs) == SSA_NAME)
+		      {
+			if (rhs == phi_result)
+			  return 2;
+			def_stmt = SSA_NAME_DEF_STMT (rhs);
+		      }
+		  }
+		if (is_gimple_assign (def_stmt)
+		    && gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
+		  {
+		    if (gimple_assign_rhs1 (def_stmt) == phi_result
+			|| gimple_assign_rhs2 (def_stmt) == phi_result)
+		      return 2;
+		  }
+	      }
+	  }
+    }
+
   return width;
 }
 
@@ -7031,8 +7089,9 @@ reassociate_bb (basic_block bb)
 		     with initial linearization.  */
 		  if (!reassoc_insert_powi_p
 		      && ops.length () > 3
-		      && (width = get_reassociation_width (ops_num, rhs_code,
-							   mode)) > 1)
+		      && (width
+			  = get_reassociation_width (&ops, lhs, rhs_code, mode))
+			   > 1)
 		    {
 		      if (dump_file && (dump_flags & TDF_DETAILS))
 			fprintf (dump_file,
@@ -7049,7 +7108,13 @@ reassociate_bb (basic_block bb)
 			 to make sure the ones that get the double
 			 binary op are chosen wisely.  */
 		      int len = ops.length ();
-		      if (len >= 3 && !has_fma)
+		      if (len >= 3
+			  && (!has_fma
+			      /* width > 1 means ranking ops results in better
+				 parallelism.  */
+			      || get_reassociation_width (&ops, lhs, rhs_code,
+							  mode)
+				   > 1))
 			swap_ops_for_binary_stmt (ops, len - 3);
 
 		      new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width
  2023-11-09 17:53       ` Di Zhao OS
@ 2023-11-21 13:01         ` Richard Biener
  2023-11-29 14:35           ` Di Zhao OS
  0 siblings, 1 reply; 18+ messages in thread
From: Richard Biener @ 2023-11-21 13:01 UTC (permalink / raw)
  To: Di Zhao OS; +Cc: gcc-patches

On Thu, Nov 9, 2023 at 6:53 PM Di Zhao OS <dizhao@os.amperecomputing.com> wrote:
>
> > -----Original Message-----
> > From: Richard Biener <richard.guenther@gmail.com>
> > Sent: Tuesday, October 31, 2023 9:48 PM
> > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > Cc: gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > get_reassociation_width
> >
> > On Sun, Oct 8, 2023 at 6:40 PM Di Zhao OS <dizhao@os.amperecomputing.com>
> > wrote:
> > >
> > > Attached is a new version of the patch.
> > >
> > > > -----Original Message-----
> > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > Sent: Friday, October 6, 2023 5:33 PM
> > > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > > Cc: gcc-patches@gcc.gnu.org
> > > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > > > get_reassociation_width
> > > >
> > > > On Thu, Sep 14, 2023 at 2:43 PM Di Zhao OS
> > > > <dizhao@os.amperecomputing.com> wrote:
> > > > >
> > > > > This is a new version of the patch on "nested FMA".
> > > > > Sorry for updating this after so long, I've been studying and
> > > > > writing micro cases to sort out the cause of the regression.
> > > >
> > > > Sorry for taking so long to reply.
> > > >
> > > > > First, following previous discussion:
> > > > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629080.html)
> > > > >
> > > > > 1. From testing more altered cases, I don't think the
> > > > > problem is that reassociation works locally. In that:
> > > > >
> > > > >   1) On the example with multiplications:
> > > > >
> > > > >         tmp1 = a + c * c + d * d + x * y;
> > > > >         tmp2 = x * tmp1;
> > > > >         result += (a + c + d + tmp2);
> > > > >
> > > > >   Given "result" rewritten by width=2, the performance is
> > > > >   worse if we rewrite "tmp1" with width=2. In contrast, if we
> > > > >   remove the multiplications from the example (and make "tmp1"
> > > > >   not singe used), and still rewrite "result" by width=2, then
> > > > >   rewriting "tmp1" with width=2 is better. (Make sense because
> > > > >   the tree's depth at "result" is still smaller if we rewrite
> > > > >   "tmp1".)
> > > > >
> > > > >   2) I tried to modify the assembly code of the example without
> > > > >   FMA, so the width of "result" is 4. On Ampere1 there's no
> > > > >   obvious improvement. So although this is an interesting
> > > > >   problem, it doesn't seem like the cause of the regression.
> > > >
> > > > OK, I see.
> > > >
> > > > > 2. From assembly code of the case with FMA, one problem is
> > > > > that, rewriting "tmp1" to parallel didn't decrease the
> > > > > minimum CPU cycles (taking MULT_EXPRs into account), but
> > > > > increased code size, so the overhead is increased.
> > > > >
> > > > >    a) When "tmp1" is not re-written to parallel:
> > > > >         fmadd d31, d2, d2, d30
> > > > >         fmadd d31, d3, d3, d31
> > > > >         fmadd d31, d4, d5, d31  //"tmp1"
> > > > >         fmadd d31, d31, d4, d3
> > > > >
> > > > >    b) When "tmp1" is re-written to parallel:
> > > > >         fmul  d31, d4, d5
> > > > >         fmadd d27, d2, d2, d30
> > > > >         fmadd d31, d3, d3, d31
> > > > >         fadd  d31, d31, d27     //"tmp1"
> > > > >         fmadd d31, d31, d4, d3
> > > > >
> > > > > For version a), there are 3 dependent FMAs to calculate "tmp1".
> > > > > For version b), there are also 3 dependent instructions in the
> > > > > longer path: the 1st, 3rd and 4th.
> > > >
> > > > Yes, it doesn't really change anything.  The patch has
> > > >
> > > > +  /* If there's code like "acc = a * b + c * d + acc" in a tight loop,
> > some
> > > > +     uarchs can execute results like:
> > > > +
> > > > +       _1 = a * b;
> > > > +       _2 = .FMA (c, d, _1);
> > > > +       acc_1 = acc_0 + _2;
> > > > +
> > > > +     in parallel, while turning it into
> > > > +
> > > > +       _1 = .FMA(a, b, acc_0);
> > > > +       acc_1 = .FMA(c, d, _1);
> > > > +
> > > > +     hinders that, because then the first FMA depends on the result
> > > > of preceding
> > > > +     iteration.  */
> > > >
> > > > I can't see what can be run in parallel for the first case.  The .FMA
> > > > depends on the multiplication a * b.  Iff the uarch somehow decomposes
> > > > .FMA into multiply + add then the c * d multiply could run in parallel
> > > > with the a * b multiply which _might_ be able to hide some of the
> > > > latency of the full .FMA.  Like on x86 Zen FMA has a latency of 4
> > > > cycles but a multiply only 3.  But I never got confirmation from any
> > > > of the CPU designers that .FMAs are issued when the multiply
> > > > operands are ready and the add operand can be forwarded.
> > > >
> > > > I also wonder why the multiplications of the two-FMA sequence
> > > > then cannot be executed at the same time?  So I have some doubt
> > > > of the theory above.
> > >
> > > The parallel execution for the code snippet above was the other
> > > issue (previously discussed here:
> > > https://gcc.gnu.org/pipermail/gcc-patches/2023-August/628960.html).
> > > Sorry it's a bit confusing to include that here, but these 2 fixes
> > > needs to be combined to avoid new regressions. Since considering
> > > FMA in get_reassociation_width produces more results of width=1,
> > > so there would be more loop depending FMA chains.
> > >
> > > > Iff this really is the reason for the sequence to execute with lower
> > > > overall latency and we want to attack this on GIMPLE then I think
> > > > we need a target hook telling us this fact (I also wonder if such
> > > > behavior can be modeled in the scheduler pipeline description at all?)
> > > >
> > > > > So it seems to me the current get_reassociation_width algorithm
> > > > > isn't optimal in the presence of FMA. So I modified the patch to
> > > > > improve get_reassociation_width, rather than check for code
> > > > > patterns. (Although there could be some other complicated
> > > > > factors so the regression is more obvious when there's "nested
> > > > > FMA". But with this patch that should be avoided or reduced.)
> > > > >
> > > > > With this patch 508.namd_r 1-copy run has 7% improvement on
> > > > > Ampere1, on Intel Xeon there's about 3%. While I'm still
> > > > > collecting data on other CPUs, I'd like to know how do you
> > > > > think of this.
> > > > >
> > > > > About changes in the patch:
> > > > >
> > > > > 1. When the op list forms a complete FMA chain, try to search
> > > > > for a smaller width considering the benefit of using FMA. With
> > > > > a smaller width, the increment of code size is smaller when
> > > > > breaking the chain.
> > > >
> > > > But this is all highly target specific (code size even more so).
> > > >
> > > > How I understand your approach to fixing the issue leads me to
> > > > the suggestion to prioritize parallel rewriting, thus alter
> > rank_ops_for_fma,
> > > > taking the reassoc width into account (the computed width should be
> > > > unchanged from rank_ops_for_fma) instead of "fixing up" the parallel
> > > > rewriting of FMAs (well, they are not yet formed of course).
> > > > get_reassociation_width has 'get_required_cycles', the above theory
> > > > could be verified with a very simple toy pipeline model.  We'd have
> > > > to ask the target for the reassoc width for MULT_EXPRs as well (or maybe
> > > > even FMA_EXPRs).
> > > >
> > > > Taking the width of FMAs into account when computing the reassoc width
> > > > might be another way to attack this.
> > >
> > > Previously I tried to solve this generally, on the assumption that
> > > FMA (smaller code size) is preferred. Now I agree it's difficult
> > > since: 1) As you mentioned, the latency of FMA, FMUL and FADD can
> > > be different. 2) From my test result on different machines we
> > > have, it seems simply adding the cycles together is not a good way
> > > to estimate the latency of consecutive FMA.
> > >
> > > I think an easier way to fix this is to add a parameter to suggest
> > > the length of complete FMA chain to keep. (It can be set by target
> > > specific tuning then.) And we can break longer FMA chains for
> > > better parallelism. Attached is the new implementation. With
> > > max-fma-chain-len=8, there's about 7% improvement in spec2017
> > > 508.namd_r on ampere1, and the overall improvement on fprate is
> > > about 1%.
> > >
> > > Since there's code in rank_ops_for_fma to identify MULT_EXPRs from
> > > others, I left it before get_reassociation_width so the number of
> > > MULT_EXPRs can be used.
> >
> > Sorry again for the delay in replying.
> >
> > +  /* Check if keeping complete FMA chains is preferred.  */
> > +  if (width > 1 && mult_num >= 2 && param_max_fma_chain_len)
> > +    {
> > +      /* num_fma_chain + (num_fma_chain - 1) >= num_plus .  */
> > +      int num_others = ops_num - mult_num;
> > +      int num_fma_chain = CEIL (num_others + 1, 2);
> > +
> > +      if (num_fma_chain < width
> > +         && CEIL (mult_num, num_fma_chain) <= param_max_fma_chain_len)
> > +       width = num_fma_chain;
> > +    }
> >
> > so here 'mult_num' serves as a heuristical value how many
> > FMAs we could build.  If that were close to ops_num - 1 then
> > we'd have a chain of FMAs.  Not sure how you get at
> > num_others / 2 here.  Maybe we need to elaborate on what an
> > FMA chain is?  I thought it is FMA (FMA (FMA (..., b, c), d, e), f, g)
> > where each (b,c) pair is really just one operand in the ops array,
> > one of the 'mult's.  Thus a FMA chain is _not_
> > FMA (a, b, c) + FMA (d, e, f) + FMA (...) + ..., right?
>
> The "FMA chain" here refers to consecutive FMAs, each taking
> The previous one's result as the third operator, i.e.
> ... FMA(e, f, FMA(c, d, FMA (a, b, r)))... . So original op
> list looks like "r + a * b + c * d + e * f + ...". These FMAs
> will end up using the same accumulate register.
>
> When num_others=2 or 3, there can be 2 complete chains, e.g.
>         FMA (d, e, FMA (a, b, c)) + FMA (f, g, h)
> or
>         FMA (d, e, FMA (a, b, c)) + FMA (f, g, h) + i .
> And so on, that's where the "CEIL (num_others + 1, 2)" comes from.
>
> >
> > Forming an FMA chain effectively reduces the reassociation width
> > of the participating multiplies.  If we were not to form FMAs all
> > the multiplies could execute in parallel.
> >
> > So what does the above do, in terms of adjusting the reassociation
> > width for the _adds_, and what's the ripple-down effect on later
> > FMA forming?
> >
>
> The above code calculates the number of such FMA chains in the op
> list. And if the length of each chain doesn't exceed
> param_max_fma_chain_len, then width is set to the number of chains,
> so we won't break them (because rewrite_expr_tree_parallel handles
> this well).
>
> > The change still feels like whack-a-mole playing rather than understanding
> > the fundamental issue on the targets.
>
> I think the complexity is in how the instructions are piped.
> Some Arm CPUs such as Neoverse V2 supports "late-forwarding":
> "FP multiply-accumulate pipelines support late-forwarding of
> accumulate operands from similar μOPs, allowing a typical
> sequence of multiply-accumulate μOPs to issue one every N
> cycles". ("N" is smaller than the latency of a single FMA
> instruction.) So keeping such FMA chains can utilize such
> feature and uses less FP units. I guess the case is similar on
> some late X86 CPUs.
>
> If we try to compute the minimum circles of each option, I think
> at least we'll need to know whether the target has similar
> feature, and the latency of each uop. While using an
> experiential length of beneficial FMA chain could be a shortcut.
> (Maybe allowing different lengths for different data widths is
> better.)

Hm.  So even when we can late-forward in an FMA chain
increasing the width should typically be still better?

_1 = FMA (_2 * _3 + _4);
_5 = FMA (_6 * _7 + _1);

say with late-forwarding we can hide the latency of the _6 * _7
multiply and the overall latency of the two FMAs above become
lat (FMA) + lat (ADD) in the ideal case.  Alternatively we do

_1 = FMA (_2 * _ 3 + _4);
_8 = _6 * _ 7;
_5 = _1 + _8;

where if the FMA and the multiply can execute in parallel
(we have two FMA pipes) the latency would be lat (FMA) + lat (ADD).
But when we only have a single pipeline capable of
FMA or multiplies then it is at least MIN (lat (FMA) + 1, lat (MUL) + 1)
+ lat (ADD), it depends on luck whether the FMA or the MUL is
issued first there.

So if late-forward works really well and the add part of the FMA
has very low latency compared to the multiplication part having
a smaller reassoc width should pay off here and we might be
able to simply control this via the existing target hook?

I'm not aware of x86 CPUs having late-forwarding capabilities
but usually the latency of multiplication and FMA is very similar
and one can issue two FMAs and possibly more ADDs in parallel.

As said I think this detail (late-forward) should maybe reflected
into get_required_cycles, possibly guided by a different
targetm.sched.reassociation_width for MULT_EXPR vs PLUS_EXPR?

> >
> > +  /* If there's loop dependent FMA result, return width=2 to avoid it.  This
> > is
> > +     better than skipping these FMA candidates in widening_mul.  */
> >
> > better than skipping, but you don't touch it there?  I suppose width == 2
> > will bypass the skipping, right?  This heuristic only comes in when the above
> > change made width == 1, since otherwise we have an earlier
> >
> >   if (width == 1)
> >     return width;
> >
> > which als guarantees width == 2 was allowed by the hook/param, right?
>
> Yes, that's right.
>
> >
> > +  if (width == 1 && mult_num
> > +      && maybe_le (tree_to_poly_int64 (TYPE_SIZE (TREE_TYPE (lhs))),
> > +                  param_avoid_fma_max_bits))
> > +    {
> > +      /* Look for cross backedge dependency:
> > +       1. LHS is a phi argument in the same basic block it is defined.
> > +       2. And the result of the phi node is used in OPS.  */
> > +      basic_block bb = gimple_bb (SSA_NAME_DEF_STMT (lhs));
> > +      gimple_stmt_iterator gsi;
> > +      for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > +       {
> > +         gphi *phi = dyn_cast<gphi *> (gsi_stmt (gsi));
> > +         for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
> > +           {
> > +             tree op = PHI_ARG_DEF (phi, i);
> > +             if (!(op == lhs && gimple_phi_arg_edge (phi, i)->src == bb))
> > +               continue;
> >
> > I think it's easier to iterate over the immediate uses of LHS like
> >
> >   FOR_EACH_IMM_USE_FAST (use_p, iter, lhs)
> >      if (gphi *phi = dyn_cast <gphi *> (USE_STMT (use_p)))
> >        {
> >           if (gimple_phi_arg_edge (phi, phi_arg_index_from_use
> > (use_p))->src != bb)
> >             continue;
> > ...
> >        }
> >
> > otherwise I think _this_ part of the patch looks reasonable.
> >
> > As you say heuristically they might go together but I think we should split
> > the
> > patch - the cross-loop part can probably stand independently.  Can you adjust
> > and re-post?
>
> Attached is the separated part for cross-loop FMA. Thank you for the correction.

That cross-loop FMA patch is OK.

Thanks,
Richard.

> >
> > As for the first part I still don't understand very well and am still hoping
> > we
> > can get away without yet another knob to tune.
> >
> > Richard.
> >
> > > >
> > > > > 2. To avoid regressions, included the other patch
> > > > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629203.html)
> > > > > on this tracker again. This is because more FMA will be kept
> > > > > with 1., so we need to rule out the loop dependent
> > > > > FMA chains when param_avoid_fma_max_bits is set.
> > > >
> > > > Sorry again for taking so long to reply.
> > > >
> > > > I'll note we have an odd case on x86 Zen2(?) as well which we don't really
> > > > understand from a CPU behavior perspective.
> > > >
> > > > Thanks,
> > > > Richard.
> > > >
> > > > > Thanks,
> > > > > Di Zhao
> > > > >
> > > > > ----
> > > > >
> > > > >         PR tree-optimization/110279
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > >         * tree-ssa-reassoc.cc (rank_ops_for_better_parallelism_p):
> > > > >         New function to check whether ranking the ops results in
> > > > >         better parallelism.
> > > > >         (get_reassociation_width): Add new parameters. Search for
> > > > >         smaller width considering the benefit of FMA.
> > > > >         (rank_ops_for_fma): Change return value to be number of
> > > > >         MULT_EXPRs.
> > > > >         (reassociate_bb): For 3 ops, refine the condition to call
> > > > >         swap_ops_for_binary_stmt.
> > > > >
> > > > > gcc/testsuite/ChangeLog:
> > > > >
> > > > >         * gcc.dg/pr110279.c: New test.
> > >
> > > Thanks,
> > > Di Zhao
> > >
> > > ----
> > >
> > >         PR tree-optimization/110279
> > >
> > > gcc/ChangeLog:
> > >
> > >         * doc/invoke.texi: Description of param_max_fma_chain_len.
> > >         * params.opt: New parameter param_max_fma_chain_len.
> > >         * tree-ssa-reassoc.cc (get_reassociation_width):
> > >         Support param_max_fma_chain_len; check for loop dependent
> > >         FMAs.
> > >         (rank_ops_for_fma): Return the number of MULT_EXPRs.
> > >         (reassociate_bb): For 3 ops, refine the condition to call
> > >         swap_ops_for_binary_stmt.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >         * gcc.dg/pr110279-1.c: New test.
> > >         * gcc.dg/pr110279-2.c: New test.
> > >         * gcc.dg/pr110279-3.c: New test.
>
> ---
>
>         PR tree-optimization/110279
>
> gcc/ChangeLog:
>
>         * tree-ssa-reassoc.cc (get_reassociation_width): check
>         for loop dependent FMAs.
>         (reassociate_bb): For 3 ops, refine the condition to call
>         swap_ops_for_binary_stmt.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/pr110279-1.c: New test.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width
  2023-11-21 13:01         ` Richard Biener
@ 2023-11-29 14:35           ` Di Zhao OS
  2023-12-11 11:01             ` Richard Biener
  0 siblings, 1 reply; 18+ messages in thread
From: Di Zhao OS @ 2023-11-29 14:35 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 21434 bytes --]

> -----Original Message-----
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: Tuesday, November 21, 2023 9:01 PM
> To: Di Zhao OS <dizhao@os.amperecomputing.com>
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> get_reassociation_width
> 
> On Thu, Nov 9, 2023 at 6:53 PM Di Zhao OS <dizhao@os.amperecomputing.com>
> wrote:
> >
> > > -----Original Message-----
> > > From: Richard Biener <richard.guenther@gmail.com>
> > > Sent: Tuesday, October 31, 2023 9:48 PM
> > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > Cc: gcc-patches@gcc.gnu.org
> > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > > get_reassociation_width
> > >
> > > On Sun, Oct 8, 2023 at 6:40 PM Di Zhao OS <dizhao@os.amperecomputing.com>
> > > wrote:
> > > >
> > > > Attached is a new version of the patch.
> > > >
> > > > > -----Original Message-----
> > > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > > Sent: Friday, October 6, 2023 5:33 PM
> > > > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > > > Cc: gcc-patches@gcc.gnu.org
> > > > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > > > > get_reassociation_width
> > > > >
> > > > > On Thu, Sep 14, 2023 at 2:43 PM Di Zhao OS
> > > > > <dizhao@os.amperecomputing.com> wrote:
> > > > > >
> > > > > > This is a new version of the patch on "nested FMA".
> > > > > > Sorry for updating this after so long, I've been studying and
> > > > > > writing micro cases to sort out the cause of the regression.
> > > > >
> > > > > Sorry for taking so long to reply.
> > > > >
> > > > > > First, following previous discussion:
> > > > > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-
> September/629080.html)
> > > > > >
> > > > > > 1. From testing more altered cases, I don't think the
> > > > > > problem is that reassociation works locally. In that:
> > > > > >
> > > > > >   1) On the example with multiplications:
> > > > > >
> > > > > >         tmp1 = a + c * c + d * d + x * y;
> > > > > >         tmp2 = x * tmp1;
> > > > > >         result += (a + c + d + tmp2);
> > > > > >
> > > > > >   Given "result" rewritten by width=2, the performance is
> > > > > >   worse if we rewrite "tmp1" with width=2. In contrast, if we
> > > > > >   remove the multiplications from the example (and make "tmp1"
> > > > > >   not singe used), and still rewrite "result" by width=2, then
> > > > > >   rewriting "tmp1" with width=2 is better. (Make sense because
> > > > > >   the tree's depth at "result" is still smaller if we rewrite
> > > > > >   "tmp1".)
> > > > > >
> > > > > >   2) I tried to modify the assembly code of the example without
> > > > > >   FMA, so the width of "result" is 4. On Ampere1 there's no
> > > > > >   obvious improvement. So although this is an interesting
> > > > > >   problem, it doesn't seem like the cause of the regression.
> > > > >
> > > > > OK, I see.
> > > > >
> > > > > > 2. From assembly code of the case with FMA, one problem is
> > > > > > that, rewriting "tmp1" to parallel didn't decrease the
> > > > > > minimum CPU cycles (taking MULT_EXPRs into account), but
> > > > > > increased code size, so the overhead is increased.
> > > > > >
> > > > > >    a) When "tmp1" is not re-written to parallel:
> > > > > >         fmadd d31, d2, d2, d30
> > > > > >         fmadd d31, d3, d3, d31
> > > > > >         fmadd d31, d4, d5, d31  //"tmp1"
> > > > > >         fmadd d31, d31, d4, d3
> > > > > >
> > > > > >    b) When "tmp1" is re-written to parallel:
> > > > > >         fmul  d31, d4, d5
> > > > > >         fmadd d27, d2, d2, d30
> > > > > >         fmadd d31, d3, d3, d31
> > > > > >         fadd  d31, d31, d27     //"tmp1"
> > > > > >         fmadd d31, d31, d4, d3
> > > > > >
> > > > > > For version a), there are 3 dependent FMAs to calculate "tmp1".
> > > > > > For version b), there are also 3 dependent instructions in the
> > > > > > longer path: the 1st, 3rd and 4th.
> > > > >
> > > > > Yes, it doesn't really change anything.  The patch has
> > > > >
> > > > > +  /* If there's code like "acc = a * b + c * d + acc" in a tight loop,
> > > some
> > > > > +     uarchs can execute results like:
> > > > > +
> > > > > +       _1 = a * b;
> > > > > +       _2 = .FMA (c, d, _1);
> > > > > +       acc_1 = acc_0 + _2;
> > > > > +
> > > > > +     in parallel, while turning it into
> > > > > +
> > > > > +       _1 = .FMA(a, b, acc_0);
> > > > > +       acc_1 = .FMA(c, d, _1);
> > > > > +
> > > > > +     hinders that, because then the first FMA depends on the result
> > > > > of preceding
> > > > > +     iteration.  */
> > > > >
> > > > > I can't see what can be run in parallel for the first case.  The .FMA
> > > > > depends on the multiplication a * b.  Iff the uarch somehow decomposes
> > > > > .FMA into multiply + add then the c * d multiply could run in parallel
> > > > > with the a * b multiply which _might_ be able to hide some of the
> > > > > latency of the full .FMA.  Like on x86 Zen FMA has a latency of 4
> > > > > cycles but a multiply only 3.  But I never got confirmation from any
> > > > > of the CPU designers that .FMAs are issued when the multiply
> > > > > operands are ready and the add operand can be forwarded.
> > > > >
> > > > > I also wonder why the multiplications of the two-FMA sequence
> > > > > then cannot be executed at the same time?  So I have some doubt
> > > > > of the theory above.
> > > >
> > > > The parallel execution for the code snippet above was the other
> > > > issue (previously discussed here:
> > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-August/628960.html).
> > > > Sorry it's a bit confusing to include that here, but these 2 fixes
> > > > needs to be combined to avoid new regressions. Since considering
> > > > FMA in get_reassociation_width produces more results of width=1,
> > > > so there would be more loop depending FMA chains.
> > > >
> > > > > Iff this really is the reason for the sequence to execute with lower
> > > > > overall latency and we want to attack this on GIMPLE then I think
> > > > > we need a target hook telling us this fact (I also wonder if such
> > > > > behavior can be modeled in the scheduler pipeline description at all?)
> > > > >
> > > > > > So it seems to me the current get_reassociation_width algorithm
> > > > > > isn't optimal in the presence of FMA. So I modified the patch to
> > > > > > improve get_reassociation_width, rather than check for code
> > > > > > patterns. (Although there could be some other complicated
> > > > > > factors so the regression is more obvious when there's "nested
> > > > > > FMA". But with this patch that should be avoided or reduced.)
> > > > > >
> > > > > > With this patch 508.namd_r 1-copy run has 7% improvement on
> > > > > > Ampere1, on Intel Xeon there's about 3%. While I'm still
> > > > > > collecting data on other CPUs, I'd like to know how do you
> > > > > > think of this.
> > > > > >
> > > > > > About changes in the patch:
> > > > > >
> > > > > > 1. When the op list forms a complete FMA chain, try to search
> > > > > > for a smaller width considering the benefit of using FMA. With
> > > > > > a smaller width, the increment of code size is smaller when
> > > > > > breaking the chain.
> > > > >
> > > > > But this is all highly target specific (code size even more so).
> > > > >
> > > > > How I understand your approach to fixing the issue leads me to
> > > > > the suggestion to prioritize parallel rewriting, thus alter
> > > rank_ops_for_fma,
> > > > > taking the reassoc width into account (the computed width should be
> > > > > unchanged from rank_ops_for_fma) instead of "fixing up" the parallel
> > > > > rewriting of FMAs (well, they are not yet formed of course).
> > > > > get_reassociation_width has 'get_required_cycles', the above theory
> > > > > could be verified with a very simple toy pipeline model.  We'd have
> > > > > to ask the target for the reassoc width for MULT_EXPRs as well (or
> maybe
> > > > > even FMA_EXPRs).
> > > > >
> > > > > Taking the width of FMAs into account when computing the reassoc width
> > > > > might be another way to attack this.
> > > >
> > > > Previously I tried to solve this generally, on the assumption that
> > > > FMA (smaller code size) is preferred. Now I agree it's difficult
> > > > since: 1) As you mentioned, the latency of FMA, FMUL and FADD can
> > > > be different. 2) From my test result on different machines we
> > > > have, it seems simply adding the cycles together is not a good way
> > > > to estimate the latency of consecutive FMA.
> > > >
> > > > I think an easier way to fix this is to add a parameter to suggest
> > > > the length of complete FMA chain to keep. (It can be set by target
> > > > specific tuning then.) And we can break longer FMA chains for
> > > > better parallelism. Attached is the new implementation. With
> > > > max-fma-chain-len=8, there's about 7% improvement in spec2017
> > > > 508.namd_r on ampere1, and the overall improvement on fprate is
> > > > about 1%.
> > > >
> > > > Since there's code in rank_ops_for_fma to identify MULT_EXPRs from
> > > > others, I left it before get_reassociation_width so the number of
> > > > MULT_EXPRs can be used.
> > >
> > > Sorry again for the delay in replying.
> > >
> > > +  /* Check if keeping complete FMA chains is preferred.  */
> > > +  if (width > 1 && mult_num >= 2 && param_max_fma_chain_len)
> > > +    {
> > > +      /* num_fma_chain + (num_fma_chain - 1) >= num_plus .  */
> > > +      int num_others = ops_num - mult_num;
> > > +      int num_fma_chain = CEIL (num_others + 1, 2);
> > > +
> > > +      if (num_fma_chain < width
> > > +         && CEIL (mult_num, num_fma_chain) <= param_max_fma_chain_len)
> > > +       width = num_fma_chain;
> > > +    }
> > >
> > > so here 'mult_num' serves as a heuristical value how many
> > > FMAs we could build.  If that were close to ops_num - 1 then
> > > we'd have a chain of FMAs.  Not sure how you get at
> > > num_others / 2 here.  Maybe we need to elaborate on what an
> > > FMA chain is?  I thought it is FMA (FMA (FMA (..., b, c), d, e), f, g)
> > > where each (b,c) pair is really just one operand in the ops array,
> > > one of the 'mult's.  Thus a FMA chain is _not_
> > > FMA (a, b, c) + FMA (d, e, f) + FMA (...) + ..., right?
> >
> > The "FMA chain" here refers to consecutive FMAs, each taking
> > The previous one's result as the third operator, i.e.
> > ... FMA(e, f, FMA(c, d, FMA (a, b, r)))... . So original op
> > list looks like "r + a * b + c * d + e * f + ...". These FMAs
> > will end up using the same accumulate register.
> >
> > When num_others=2 or 3, there can be 2 complete chains, e.g.
> >         FMA (d, e, FMA (a, b, c)) + FMA (f, g, h)
> > or
> >         FMA (d, e, FMA (a, b, c)) + FMA (f, g, h) + i .
> > And so on, that's where the "CEIL (num_others + 1, 2)" comes from.
> >
> > >
> > > Forming an FMA chain effectively reduces the reassociation width
> > > of the participating multiplies.  If we were not to form FMAs all
> > > the multiplies could execute in parallel.
> > >
> > > So what does the above do, in terms of adjusting the reassociation
> > > width for the _adds_, and what's the ripple-down effect on later
> > > FMA forming?
> > >
> >
> > The above code calculates the number of such FMA chains in the op
> > list. And if the length of each chain doesn't exceed
> > param_max_fma_chain_len, then width is set to the number of chains,
> > so we won't break them (because rewrite_expr_tree_parallel handles
> > this well).
> >
> > > The change still feels like whack-a-mole playing rather than understanding
> > > the fundamental issue on the targets.
> >
> > I think the complexity is in how the instructions are piped.
> > Some Arm CPUs such as Neoverse V2 supports "late-forwarding":
> > "FP multiply-accumulate pipelines support late-forwarding of
> > accumulate operands from similar μOPs, allowing a typical
> > sequence of multiply-accumulate μOPs to issue one every N
> > cycles". ("N" is smaller than the latency of a single FMA
> > instruction.) So keeping such FMA chains can utilize such
> > feature and uses less FP units. I guess the case is similar on
> > some late X86 CPUs.
> >
> > If we try to compute the minimum circles of each option, I think
> > at least we'll need to know whether the target has similar
> > feature, and the latency of each uop. While using an
> > experiential length of beneficial FMA chain could be a shortcut.
> > (Maybe allowing different lengths for different data widths is
> > better.)
> 
> Hm.  So even when we can late-forward in an FMA chain
> increasing the width should typically be still better?
> 
> _1 = FMA (_2 * _3 + _4);
> _5 = FMA (_6 * _7 + _1);
> 
> say with late-forwarding we can hide the latency of the _6 * _7
> multiply and the overall latency of the two FMAs above become
> lat (FMA) + lat (ADD) in the ideal case.  Alternatively we do
> 
> _1 = FMA (_2 * _ 3 + _4);
> _8 = _6 * _ 7;
> _5 = _1 + _8;
> 
> where if the FMA and the multiply can execute in parallel
> (we have two FMA pipes) the latency would be lat (FMA) + lat (ADD).
> But when we only have a single pipeline capable of
> FMA or multiplies then it is at least MIN (lat (FMA) + 1, lat (MUL) + 1)
> + lat (ADD), it depends on luck whether the FMA or the MUL is
> issued first there.
> 
> So if late-forward works really well and the add part of the FMA
> has very low latency compared to the multiplication part having
> a smaller reassoc width should pay off here and we might be
> able to simply control this via the existing target hook?
> 
> I'm not aware of x86 CPUs having late-forwarding capabilities
> but usually the latency of multiplication and FMA is very similar
> and one can issue two FMAs and possibly more ADDs in parallel.
> 
> As said I think this detail (late-forward) should maybe reflected
> into get_required_cycles, possibly guided by a different
> targetm.sched.reassociation_width for MULT_EXPR vs PLUS_EXPR?
> 

To my understanding, the question is whether the target fully
pipelines FMA instructions, so the MULT part can start first if
its operands are ready. While targetm.sched.reassociation_width
reflects the number of pipes for some operation, so it can guide
get_required_cycles for a sequence of identical operations
(e.g. A * B * C * D or A + B + C + D). Since the problem in
this case is not the number of pipes for FMA, I think another
indicator maybe better.

(Currently the fma_reassoc_width for AArch64 is to control
whether reassociation on FADD is OK. This workaround doesn't
work well on some cases, for example it turns down reassociation
even when there's no FMA at all. So I think we'd better not 
follow the schema.)

Attached is a new version of the patch with a flag to indicate
whether FMA is fully pipelined, and: 1) lat (MUL) >= lat (ADD);
2) symmetric units are used or FMUL/FADD/FMA. Otherwise the
patch may not be beneficial.

It tries to calculate the latencies including MULT_EXPRs. Since
the code is different with the current code (the quick-search
part), I haven't included it inside get_required_cycles.

> > >
> > > +  /* If there's loop dependent FMA result, return width=2 to avoid it.
> This
> > > is
> > > +     better than skipping these FMA candidates in widening_mul.  */
> > >
> > > better than skipping, but you don't touch it there?  I suppose width == 2
> > > will bypass the skipping, right?  This heuristic only comes in when the
> above
> > > change made width == 1, since otherwise we have an earlier
> > >
> > >   if (width == 1)
> > >     return width;
> > >
> > > which als guarantees width == 2 was allowed by the hook/param, right?
> >
> > Yes, that's right.
> >
> > >
> > > +  if (width == 1 && mult_num
> > > +      && maybe_le (tree_to_poly_int64 (TYPE_SIZE (TREE_TYPE (lhs))),
> > > +                  param_avoid_fma_max_bits))
> > > +    {
> > > +      /* Look for cross backedge dependency:
> > > +       1. LHS is a phi argument in the same basic block it is defined.
> > > +       2. And the result of the phi node is used in OPS.  */
> > > +      basic_block bb = gimple_bb (SSA_NAME_DEF_STMT (lhs));
> > > +      gimple_stmt_iterator gsi;
> > > +      for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > > +       {
> > > +         gphi *phi = dyn_cast<gphi *> (gsi_stmt (gsi));
> > > +         for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
> > > +           {
> > > +             tree op = PHI_ARG_DEF (phi, i);
> > > +             if (!(op == lhs && gimple_phi_arg_edge (phi, i)->src == bb))
> > > +               continue;
> > >
> > > I think it's easier to iterate over the immediate uses of LHS like
> > >
> > >   FOR_EACH_IMM_USE_FAST (use_p, iter, lhs)
> > >      if (gphi *phi = dyn_cast <gphi *> (USE_STMT (use_p)))
> > >        {
> > >           if (gimple_phi_arg_edge (phi, phi_arg_index_from_use
> > > (use_p))->src != bb)
> > >             continue;
> > > ...
> > >        }
> > >
> > > otherwise I think _this_ part of the patch looks reasonable.
> > >
> > > As you say heuristically they might go together but I think we should
> split
> > > the
> > > patch - the cross-loop part can probably stand independently.  Can you
> adjust
> > > and re-post?
> >
> > Attached is the separated part for cross-loop FMA. Thank you for the
> correction.
> 
> That cross-loop FMA patch is OK.

Committed this part at 746344dd.

Thanks,
Di

> 
> Thanks,
> Richard.
> 
> > >
> > > As for the first part I still don't understand very well and am still
> hoping
> > > we
> > > can get away without yet another knob to tune.
> > >
> > > Richard.
> > >
> > > > >
> > > > > > 2. To avoid regressions, included the other patch
> > > > > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-
> September/629203.html)
> > > > > > on this tracker again. This is because more FMA will be kept
> > > > > > with 1., so we need to rule out the loop dependent
> > > > > > FMA chains when param_avoid_fma_max_bits is set.
> > > > >
> > > > > Sorry again for taking so long to reply.
> > > > >
> > > > > I'll note we have an odd case on x86 Zen2(?) as well which we don't
> really
> > > > > understand from a CPU behavior perspective.
> > > > >
> > > > > Thanks,
> > > > > Richard.
> > > > >
> > > > > > Thanks,
> > > > > > Di Zhao
> > > > > >
> > > > > > ----
> > > > > >
> > > > > >         PR tree-optimization/110279
> > > > > >
> > > > > > gcc/ChangeLog:
> > > > > >
> > > > > >         * tree-ssa-reassoc.cc (rank_ops_for_better_parallelism_p):
> > > > > >         New function to check whether ranking the ops results in
> > > > > >         better parallelism.
> > > > > >         (get_reassociation_width): Add new parameters. Search for
> > > > > >         smaller width considering the benefit of FMA.
> > > > > >         (rank_ops_for_fma): Change return value to be number of
> > > > > >         MULT_EXPRs.
> > > > > >         (reassociate_bb): For 3 ops, refine the condition to call
> > > > > >         swap_ops_for_binary_stmt.
> > > > > >
> > > > > > gcc/testsuite/ChangeLog:
> > > > > >
> > > > > >         * gcc.dg/pr110279.c: New test.
> > > >
> > > > Thanks,
> > > > Di Zhao
> > > >
> > > > ----
> > > >
> > > >         PR tree-optimization/110279
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >         * doc/invoke.texi: Description of param_max_fma_chain_len.
> > > >         * params.opt: New parameter param_max_fma_chain_len.
> > > >         * tree-ssa-reassoc.cc (get_reassociation_width):
> > > >         Support param_max_fma_chain_len; check for loop dependent
> > > >         FMAs.
> > > >         (rank_ops_for_fma): Return the number of MULT_EXPRs.
> > > >         (reassociate_bb): For 3 ops, refine the condition to call
> > > >         swap_ops_for_binary_stmt.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > >         * gcc.dg/pr110279-1.c: New test.
> > > >         * gcc.dg/pr110279-2.c: New test.
> > > >         * gcc.dg/pr110279-3.c: New test.
> >
> > ---
> >
> >         PR tree-optimization/110279
> >
> > gcc/ChangeLog:
> >
> >         * tree-ssa-reassoc.cc (get_reassociation_width): check
> >         for loop dependent FMAs.
> >         (reassociate_bb): For 3 ops, refine the condition to call
> >         swap_ops_for_binary_stmt.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         * gcc.dg/pr110279-1.c: New test.
---

        PR tree-optimization/110279

gcc/ChangeLog:

        * common.opt: New flag fully-pipelined-fma.
        * tree-ssa-reassoc.cc (get_mult_latency_consider_fma):
	Return latency of MULT_EXPRs that can't be hided by FMA.
        (get_reassociation_width): Search for smaller widths
	considering the benefit of fully pipelined FMA.
        (rank_ops_for_fma): Return the number of MULT_EXPRs.
        (reassociate_bb): Pass the number of MULT_EXPRs to
	get_reassociation_width; avoid calling
	get_reassociation_width twice.

gcc/testsuite/ChangeLog:

        * gcc.dg/pr110279-2.c: New test.

[-- Attachment #2: 0001-Consider-fully-pipelined-FMA-in-get_reassociation_wi.patch --]
[-- Type: application/octet-stream, Size: 10001 bytes --]

From 485069bf0a7fc471480de08ad37c10d01a76030d Mon Sep 17 00:00:00 2001
From: "dzhao.ampere" <di.zhao@amperecomputing.com>
Date: Mon, 27 Nov 2023 18:19:14 +0800
Subject: [PATCH] Consider fully pipelined FMA in get_reassociation_width

---
 gcc/common.opt                    |   7 ++
 gcc/testsuite/gcc.dg/pr110279-2.c |  41 +++++++++
 gcc/tree-ssa-reassoc.cc           | 140 ++++++++++++++++++++++++------
 3 files changed, 161 insertions(+), 27 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr110279-2.c

diff --git a/gcc/common.opt b/gcc/common.opt
index 35971c501fc..c53e4d5e5b0 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1710,6 +1710,13 @@ ffunction-cse
 Common Var(flag_no_function_cse,0) Optimization
 Allow function addresses to be held in registers.
 
+; If the flag 'fully-pipelined-fma' is set, reassociation takes into account
+; the benifit of parallelizing FMA's multiply part and addition part.
+ffully-pipelined-fma
+Common Var(flag_fully_pipelined_fma)
+Assume the target fully pipelines FMA instruction, and symmetric units are used
+for FMUL/FADD/FMA.
+
 ffunction-sections
 Common Var(flag_function_sections)
 Place each function into its own section.
diff --git a/gcc/testsuite/gcc.dg/pr110279-2.c b/gcc/testsuite/gcc.dg/pr110279-2.c
new file mode 100644
index 00000000000..3e5c57a7c0e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr110279-2.c
@@ -0,0 +1,41 @@
+/* PR tree-optimization/110279 */
+/* { dg-do compile } */
+/* { dg-options "-Ofast --param tree-reassoc-width=4 -ffully-pipelined-fma -fdump-tree-reassoc2-details -fdump-tree-optimized" } */
+/* { dg-additional-options "-march=armv8.2-a" { target aarch64-*-* } } */
+
+#define LOOP_COUNT 800000000
+typedef double data_e;
+
+#include <stdio.h>
+
+__attribute_noinline__ data_e
+foo (data_e in)
+{
+  data_e a1, a2, a3, a4;
+  data_e tmp, result = 0;
+  a1 = in + 0.1;
+  a2 = in * 0.1;
+  a3 = in + 0.01;
+  a4 = in * 0.59;
+
+  data_e result2 = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      /* Test that a complete FMA chain with length=4 is not broken.  */
+      tmp = a1 + a2 * a2 + a3 * a3 + a4 * a4 ;
+      result += tmp - ic;
+      result2 = result2 / 2 - tmp;
+
+      a1 += 0.91;
+      a2 += 0.1;
+      a3 -= 0.01;
+      a4 -= 0.89;
+
+    }
+
+  return result + result2;
+}
+
+/* { dg-final { scan-tree-dump-not "was chosen for reassociation" "reassoc2"} } */
+/* { dg-final { scan-tree-dump-times {\.FMA } 3 "optimized"} } */
\ No newline at end of file
diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
index 07fc8e2f1f9..6248dda2f60 100644
--- a/gcc/tree-ssa-reassoc.cc
+++ b/gcc/tree-ssa-reassoc.cc
@@ -5430,13 +5430,32 @@ get_required_cycles (int ops_num, int cpu_width)
   return res;
 }
 
+/* Given that the target fully pipelines FMA instructions, return latency of
+   MULT_EXPRs that can't be hided by FMA.  WIDTH is the number of pipes.  */
+
+static inline int
+get_mult_latency_consider_fma (int ops_num, int mult_num, int width)
+{
+  /* For each partition, if mult_num == ops_num, there's latency(MULT)*2.
+     e.g:
+
+	A * B + C * D
+	=>
+	_1 = A * B;
+	_2 = .FMA (C, D, _1);
+
+      Otherwise there's latency(MULT)*1 in the first FMA.  */
+  return CEIL (ops_num, width) == CEIL (mult_num, width) ? 2 : 1;
+}
+
 /* Returns an optimal number of registers to use for computation of
    given statements.
 
-   LHS is the result ssa name of OPS.  */
+   LHS is the result ssa name of OPS.  MULT_NUM is number of sub-expressions
+   that are MULT_EXPRs, when OPS are PLUS_EXPRs or MINUS_EXPRs.  */
 
 static int
-get_reassociation_width (vec<operand_entry *> *ops, tree lhs,
+get_reassociation_width (vec<operand_entry *> *ops, int mult_num, tree lhs,
 			 enum tree_code opc, machine_mode mode)
 {
   int param_width = param_tree_reassoc_width;
@@ -5462,16 +5481,61 @@ get_reassociation_width (vec<operand_entry *> *ops, tree lhs,
      so we can perform a binary search for the minimal width that still
      results in the optimal cycle count.  */
   width_min = 1;
-  while (width > width_min)
+
+  /* If the target fully pipelines FMA instruction, the multiply part can start
+     first if its operands are ready.  Assuming symmetric pipes are used for
+     FMUL/FADD/FMA, then for a sequence of FMA like:
+
+	_8 = .FMA (_2, _3, _1);
+	_9 = .FMA (_5, _4, _8);
+	_10 = .FMA (_7, _6, _9);
+
+     , if width=1, the latency is latency(MULT) + latency(ADD)*3.
+     While with width=2:
+
+	_8 = _4 * _5;
+	_9 = .FMA (_2, _3, _1);
+	_10 = .FMA (_6, _7, _8);
+	_11 = _9 + _10;
+
+     , it is latency(MULT)*2 + latency(ADD)*2.  Assuming latency(MULT) <=
+     latency(ADD), the previous one is preferred.
+
+     Find out if we can get a smaller width considering FMA.  */
+  if (width > 1 && mult_num && flag_fully_pipelined_fma)
     {
-      int width_mid = (width + width_min) / 2;
+      /* When flag_fully_pipelined_fma is set, assumes symmetric pipes are used
+	 for FMUL/FADD/FMA.  */
+      int lat_mul = get_mult_latency_consider_fma (ops_num, mult_num, width);
 
-      if (get_required_cycles (ops_num, width_mid) == cycles_best)
-	width = width_mid;
-      else if (width_min < width_mid)
-	width_min = width_mid;
-      else
-	break;
+      /* Quick search might not apply.  So start from 1.  */
+      for (int i = 1; i < width; i++)
+	{
+	  int lat_mul_new
+	    = get_mult_latency_consider_fma (ops_num, mult_num, i);
+	  int lat_add_new = get_required_cycles (ops_num, i);
+
+	  /* Assume latency(MULT) >= latency(ADD).  */
+	  if (lat_mul - lat_mul_new >= lat_add_new - cycles_best)
+	    {
+	      width = i;
+	      break;
+	    }
+	}
+    }
+  else
+    {
+      while (width > width_min)
+	{
+	  int width_mid = (width + width_min) / 2;
+
+	  if (get_required_cycles (ops_num, width_mid) == cycles_best)
+	    width = width_mid;
+	  else if (width_min < width_mid)
+	    width_min = width_mid;
+	  else
+	    break;
+	}
     }
 
   /* If there's loop dependent FMA result, return width=2 to avoid it.  This is
@@ -6841,8 +6905,10 @@ transform_stmt_to_multiply (gimple_stmt_iterator *gsi, gimple *stmt,
    Rearrange ops to -> e + a * b + c * d generates:
 
    _4  = .FMA (c_7(D), d_8(D), _3);
-   _11 = .FMA (a_5(D), b_6(D), _4);  */
-static bool
+   _11 = .FMA (a_5(D), b_6(D), _4);
+
+   Return the number of MULT_EXPRs in the chain.  */
+static int
 rank_ops_for_fma (vec<operand_entry *> *ops)
 {
   operand_entry *oe;
@@ -6856,9 +6922,26 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
       if (TREE_CODE (oe->op) == SSA_NAME)
 	{
 	  gimple *def_stmt = SSA_NAME_DEF_STMT (oe->op);
-	  if (is_gimple_assign (def_stmt)
-	      && gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
-	    ops_mult.safe_push (oe);
+	  if (is_gimple_assign (def_stmt))
+	    {
+	      if (gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
+		ops_mult.safe_push (oe);
+	      /* A negate on the multiplication leads to FNMA.  */
+	      else if (gimple_assign_rhs_code (def_stmt) == NEGATE_EXPR
+		       && TREE_CODE (gimple_assign_rhs1 (def_stmt)) == SSA_NAME)
+		{
+		  gimple *neg_def_stmt
+		    = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (def_stmt));
+		  if (is_gimple_assign (neg_def_stmt)
+		      && gimple_bb (neg_def_stmt) == gimple_bb (def_stmt)
+		      && gimple_assign_rhs_code (neg_def_stmt) == MULT_EXPR)
+		    ops_mult.safe_push (oe);
+		  else
+		    ops_others.safe_push (oe);
+		}
+	      else
+		ops_others.safe_push (oe);
+	    }
 	  else
 	    ops_others.safe_push (oe);
 	}
@@ -6874,7 +6957,8 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
      Putting ops that not def from mult in front can generate more FMAs.
 
      2. If all ops are defined with mult, we don't need to rearrange them.  */
-  if (ops_mult.length () >= 2 && ops_mult.length () != ops_length)
+  unsigned mult_num = ops_mult.length ();
+  if (mult_num >= 2 && mult_num != ops_length)
     {
       /* Put no-mult ops and mult ops alternately at the end of the
 	 queue, which is conducive to generating more FMA and reducing the
@@ -6890,9 +6974,8 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
 	  if (opindex > 0)
 	    opindex--;
 	}
-      return true;
     }
-  return false;
+  return mult_num;
 }
 /* Reassociate expressions in basic block BB and its post-dominator as
    children.
@@ -7057,8 +7140,8 @@ reassociate_bb (basic_block bb)
 		{
 		  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
 		  int ops_num = ops.length ();
-		  int width;
-		  bool has_fma = false;
+		  int width = 0;
+		  int mult_num = 0;
 
 		  /* For binary bit operations, if there are at least 3
 		     operands and the last operand in OPS is a constant,
@@ -7081,16 +7164,17 @@ reassociate_bb (basic_block bb)
 						      opt_type)
 		      && (rhs_code == PLUS_EXPR || rhs_code == MINUS_EXPR))
 		    {
-		      has_fma = rank_ops_for_fma (&ops);
+		      mult_num = rank_ops_for_fma (&ops);
 		    }
 
 		  /* Only rewrite the expression tree to parallel in the
 		     last reassoc pass to avoid useless work back-and-forth
 		     with initial linearization.  */
+		  bool has_fma = mult_num >= 2 && mult_num != ops_num;
 		  if (!reassoc_insert_powi_p
 		      && ops.length () > 3
-		      && (width
-			  = get_reassociation_width (&ops, lhs, rhs_code, mode))
+		      && (width = get_reassociation_width (&ops, mult_num, lhs,
+							   rhs_code, mode))
 			   > 1)
 		    {
 		      if (dump_file && (dump_flags & TDF_DETAILS))
@@ -7111,10 +7195,12 @@ reassociate_bb (basic_block bb)
 		      if (len >= 3
 			  && (!has_fma
 			      /* width > 1 means ranking ops results in better
-				 parallelism.  */
-			      || get_reassociation_width (&ops, lhs, rhs_code,
-							  mode)
-				   > 1))
+				 parallelism.  Check current value to avoid
+				 calling get_reassociation_width again.  */
+			      || (width != 1
+				  && get_reassociation_width (
+				       &ops, mult_num, lhs, rhs_code, mode)
+				       > 1)))
 			swap_ops_for_binary_stmt (ops, len - 3);
 
 		      new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width
  2023-11-29 14:35           ` Di Zhao OS
@ 2023-12-11 11:01             ` Richard Biener
  2023-12-13  8:14               ` Di Zhao OS
  0 siblings, 1 reply; 18+ messages in thread
From: Richard Biener @ 2023-12-11 11:01 UTC (permalink / raw)
  To: Di Zhao OS; +Cc: gcc-patches

On Wed, Nov 29, 2023 at 3:36 PM Di Zhao OS
<dizhao@os.amperecomputing.com> wrote:
>
> > -----Original Message-----
> > From: Richard Biener <richard.guenther@gmail.com>
> > Sent: Tuesday, November 21, 2023 9:01 PM
> > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > Cc: gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > get_reassociation_width
> >
> > On Thu, Nov 9, 2023 at 6:53 PM Di Zhao OS <dizhao@os.amperecomputing.com>
> > wrote:
> > >
> > > > -----Original Message-----
> > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > Sent: Tuesday, October 31, 2023 9:48 PM
> > > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > > Cc: gcc-patches@gcc.gnu.org
> > > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > > > get_reassociation_width
> > > >
> > > > On Sun, Oct 8, 2023 at 6:40 PM Di Zhao OS <dizhao@os.amperecomputing.com>
> > > > wrote:
> > > > >
> > > > > Attached is a new version of the patch.
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > > > Sent: Friday, October 6, 2023 5:33 PM
> > > > > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > > > > Cc: gcc-patches@gcc.gnu.org
> > > > > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > > > > > get_reassociation_width
> > > > > >
> > > > > > On Thu, Sep 14, 2023 at 2:43 PM Di Zhao OS
> > > > > > <dizhao@os.amperecomputing.com> wrote:
> > > > > > >
> > > > > > > This is a new version of the patch on "nested FMA".
> > > > > > > Sorry for updating this after so long, I've been studying and
> > > > > > > writing micro cases to sort out the cause of the regression.
> > > > > >
> > > > > > Sorry for taking so long to reply.
> > > > > >
> > > > > > > First, following previous discussion:
> > > > > > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-
> > September/629080.html)
> > > > > > >
> > > > > > > 1. From testing more altered cases, I don't think the
> > > > > > > problem is that reassociation works locally. In that:
> > > > > > >
> > > > > > >   1) On the example with multiplications:
> > > > > > >
> > > > > > >         tmp1 = a + c * c + d * d + x * y;
> > > > > > >         tmp2 = x * tmp1;
> > > > > > >         result += (a + c + d + tmp2);
> > > > > > >
> > > > > > >   Given "result" rewritten by width=2, the performance is
> > > > > > >   worse if we rewrite "tmp1" with width=2. In contrast, if we
> > > > > > >   remove the multiplications from the example (and make "tmp1"
> > > > > > >   not singe used), and still rewrite "result" by width=2, then
> > > > > > >   rewriting "tmp1" with width=2 is better. (Make sense because
> > > > > > >   the tree's depth at "result" is still smaller if we rewrite
> > > > > > >   "tmp1".)
> > > > > > >
> > > > > > >   2) I tried to modify the assembly code of the example without
> > > > > > >   FMA, so the width of "result" is 4. On Ampere1 there's no
> > > > > > >   obvious improvement. So although this is an interesting
> > > > > > >   problem, it doesn't seem like the cause of the regression.
> > > > > >
> > > > > > OK, I see.
> > > > > >
> > > > > > > 2. From assembly code of the case with FMA, one problem is
> > > > > > > that, rewriting "tmp1" to parallel didn't decrease the
> > > > > > > minimum CPU cycles (taking MULT_EXPRs into account), but
> > > > > > > increased code size, so the overhead is increased.
> > > > > > >
> > > > > > >    a) When "tmp1" is not re-written to parallel:
> > > > > > >         fmadd d31, d2, d2, d30
> > > > > > >         fmadd d31, d3, d3, d31
> > > > > > >         fmadd d31, d4, d5, d31  //"tmp1"
> > > > > > >         fmadd d31, d31, d4, d3
> > > > > > >
> > > > > > >    b) When "tmp1" is re-written to parallel:
> > > > > > >         fmul  d31, d4, d5
> > > > > > >         fmadd d27, d2, d2, d30
> > > > > > >         fmadd d31, d3, d3, d31
> > > > > > >         fadd  d31, d31, d27     //"tmp1"
> > > > > > >         fmadd d31, d31, d4, d3
> > > > > > >
> > > > > > > For version a), there are 3 dependent FMAs to calculate "tmp1".
> > > > > > > For version b), there are also 3 dependent instructions in the
> > > > > > > longer path: the 1st, 3rd and 4th.
> > > > > >
> > > > > > Yes, it doesn't really change anything.  The patch has
> > > > > >
> > > > > > +  /* If there's code like "acc = a * b + c * d + acc" in a tight loop,
> > > > some
> > > > > > +     uarchs can execute results like:
> > > > > > +
> > > > > > +       _1 = a * b;
> > > > > > +       _2 = .FMA (c, d, _1);
> > > > > > +       acc_1 = acc_0 + _2;
> > > > > > +
> > > > > > +     in parallel, while turning it into
> > > > > > +
> > > > > > +       _1 = .FMA(a, b, acc_0);
> > > > > > +       acc_1 = .FMA(c, d, _1);
> > > > > > +
> > > > > > +     hinders that, because then the first FMA depends on the result
> > > > > > of preceding
> > > > > > +     iteration.  */
> > > > > >
> > > > > > I can't see what can be run in parallel for the first case.  The .FMA
> > > > > > depends on the multiplication a * b.  Iff the uarch somehow decomposes
> > > > > > .FMA into multiply + add then the c * d multiply could run in parallel
> > > > > > with the a * b multiply which _might_ be able to hide some of the
> > > > > > latency of the full .FMA.  Like on x86 Zen FMA has a latency of 4
> > > > > > cycles but a multiply only 3.  But I never got confirmation from any
> > > > > > of the CPU designers that .FMAs are issued when the multiply
> > > > > > operands are ready and the add operand can be forwarded.
> > > > > >
> > > > > > I also wonder why the multiplications of the two-FMA sequence
> > > > > > then cannot be executed at the same time?  So I have some doubt
> > > > > > of the theory above.
> > > > >
> > > > > The parallel execution for the code snippet above was the other
> > > > > issue (previously discussed here:
> > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-August/628960.html).
> > > > > Sorry it's a bit confusing to include that here, but these 2 fixes
> > > > > needs to be combined to avoid new regressions. Since considering
> > > > > FMA in get_reassociation_width produces more results of width=1,
> > > > > so there would be more loop depending FMA chains.
> > > > >
> > > > > > Iff this really is the reason for the sequence to execute with lower
> > > > > > overall latency and we want to attack this on GIMPLE then I think
> > > > > > we need a target hook telling us this fact (I also wonder if such
> > > > > > behavior can be modeled in the scheduler pipeline description at all?)
> > > > > >
> > > > > > > So it seems to me the current get_reassociation_width algorithm
> > > > > > > isn't optimal in the presence of FMA. So I modified the patch to
> > > > > > > improve get_reassociation_width, rather than check for code
> > > > > > > patterns. (Although there could be some other complicated
> > > > > > > factors so the regression is more obvious when there's "nested
> > > > > > > FMA". But with this patch that should be avoided or reduced.)
> > > > > > >
> > > > > > > With this patch 508.namd_r 1-copy run has 7% improvement on
> > > > > > > Ampere1, on Intel Xeon there's about 3%. While I'm still
> > > > > > > collecting data on other CPUs, I'd like to know how do you
> > > > > > > think of this.
> > > > > > >
> > > > > > > About changes in the patch:
> > > > > > >
> > > > > > > 1. When the op list forms a complete FMA chain, try to search
> > > > > > > for a smaller width considering the benefit of using FMA. With
> > > > > > > a smaller width, the increment of code size is smaller when
> > > > > > > breaking the chain.
> > > > > >
> > > > > > But this is all highly target specific (code size even more so).
> > > > > >
> > > > > > How I understand your approach to fixing the issue leads me to
> > > > > > the suggestion to prioritize parallel rewriting, thus alter
> > > > rank_ops_for_fma,
> > > > > > taking the reassoc width into account (the computed width should be
> > > > > > unchanged from rank_ops_for_fma) instead of "fixing up" the parallel
> > > > > > rewriting of FMAs (well, they are not yet formed of course).
> > > > > > get_reassociation_width has 'get_required_cycles', the above theory
> > > > > > could be verified with a very simple toy pipeline model.  We'd have
> > > > > > to ask the target for the reassoc width for MULT_EXPRs as well (or
> > maybe
> > > > > > even FMA_EXPRs).
> > > > > >
> > > > > > Taking the width of FMAs into account when computing the reassoc width
> > > > > > might be another way to attack this.
> > > > >
> > > > > Previously I tried to solve this generally, on the assumption that
> > > > > FMA (smaller code size) is preferred. Now I agree it's difficult
> > > > > since: 1) As you mentioned, the latency of FMA, FMUL and FADD can
> > > > > be different. 2) From my test result on different machines we
> > > > > have, it seems simply adding the cycles together is not a good way
> > > > > to estimate the latency of consecutive FMA.
> > > > >
> > > > > I think an easier way to fix this is to add a parameter to suggest
> > > > > the length of complete FMA chain to keep. (It can be set by target
> > > > > specific tuning then.) And we can break longer FMA chains for
> > > > > better parallelism. Attached is the new implementation. With
> > > > > max-fma-chain-len=8, there's about 7% improvement in spec2017
> > > > > 508.namd_r on ampere1, and the overall improvement on fprate is
> > > > > about 1%.
> > > > >
> > > > > Since there's code in rank_ops_for_fma to identify MULT_EXPRs from
> > > > > others, I left it before get_reassociation_width so the number of
> > > > > MULT_EXPRs can be used.
> > > >
> > > > Sorry again for the delay in replying.
> > > >
> > > > +  /* Check if keeping complete FMA chains is preferred.  */
> > > > +  if (width > 1 && mult_num >= 2 && param_max_fma_chain_len)
> > > > +    {
> > > > +      /* num_fma_chain + (num_fma_chain - 1) >= num_plus .  */
> > > > +      int num_others = ops_num - mult_num;
> > > > +      int num_fma_chain = CEIL (num_others + 1, 2);
> > > > +
> > > > +      if (num_fma_chain < width
> > > > +         && CEIL (mult_num, num_fma_chain) <= param_max_fma_chain_len)
> > > > +       width = num_fma_chain;
> > > > +    }
> > > >
> > > > so here 'mult_num' serves as a heuristical value how many
> > > > FMAs we could build.  If that were close to ops_num - 1 then
> > > > we'd have a chain of FMAs.  Not sure how you get at
> > > > num_others / 2 here.  Maybe we need to elaborate on what an
> > > > FMA chain is?  I thought it is FMA (FMA (FMA (..., b, c), d, e), f, g)
> > > > where each (b,c) pair is really just one operand in the ops array,
> > > > one of the 'mult's.  Thus a FMA chain is _not_
> > > > FMA (a, b, c) + FMA (d, e, f) + FMA (...) + ..., right?
> > >
> > > The "FMA chain" here refers to consecutive FMAs, each taking
> > > The previous one's result as the third operator, i.e.
> > > ... FMA(e, f, FMA(c, d, FMA (a, b, r)))... . So original op
> > > list looks like "r + a * b + c * d + e * f + ...". These FMAs
> > > will end up using the same accumulate register.
> > >
> > > When num_others=2 or 3, there can be 2 complete chains, e.g.
> > >         FMA (d, e, FMA (a, b, c)) + FMA (f, g, h)
> > > or
> > >         FMA (d, e, FMA (a, b, c)) + FMA (f, g, h) + i .
> > > And so on, that's where the "CEIL (num_others + 1, 2)" comes from.
> > >
> > > >
> > > > Forming an FMA chain effectively reduces the reassociation width
> > > > of the participating multiplies.  If we were not to form FMAs all
> > > > the multiplies could execute in parallel.
> > > >
> > > > So what does the above do, in terms of adjusting the reassociation
> > > > width for the _adds_, and what's the ripple-down effect on later
> > > > FMA forming?
> > > >
> > >
> > > The above code calculates the number of such FMA chains in the op
> > > list. And if the length of each chain doesn't exceed
> > > param_max_fma_chain_len, then width is set to the number of chains,
> > > so we won't break them (because rewrite_expr_tree_parallel handles
> > > this well).
> > >
> > > > The change still feels like whack-a-mole playing rather than understanding
> > > > the fundamental issue on the targets.
> > >
> > > I think the complexity is in how the instructions are piped.
> > > Some Arm CPUs such as Neoverse V2 supports "late-forwarding":
> > > "FP multiply-accumulate pipelines support late-forwarding of
> > > accumulate operands from similar μOPs, allowing a typical
> > > sequence of multiply-accumulate μOPs to issue one every N
> > > cycles". ("N" is smaller than the latency of a single FMA
> > > instruction.) So keeping such FMA chains can utilize such
> > > feature and uses less FP units. I guess the case is similar on
> > > some late X86 CPUs.
> > >
> > > If we try to compute the minimum circles of each option, I think
> > > at least we'll need to know whether the target has similar
> > > feature, and the latency of each uop. While using an
> > > experiential length of beneficial FMA chain could be a shortcut.
> > > (Maybe allowing different lengths for different data widths is
> > > better.)
> >
> > Hm.  So even when we can late-forward in an FMA chain
> > increasing the width should typically be still better?
> >
> > _1 = FMA (_2 * _3 + _4);
> > _5 = FMA (_6 * _7 + _1);
> >
> > say with late-forwarding we can hide the latency of the _6 * _7
> > multiply and the overall latency of the two FMAs above become
> > lat (FMA) + lat (ADD) in the ideal case.  Alternatively we do
> >
> > _1 = FMA (_2 * _ 3 + _4);
> > _8 = _6 * _ 7;
> > _5 = _1 + _8;
> >
> > where if the FMA and the multiply can execute in parallel
> > (we have two FMA pipes) the latency would be lat (FMA) + lat (ADD).
> > But when we only have a single pipeline capable of
> > FMA or multiplies then it is at least MIN (lat (FMA) + 1, lat (MUL) + 1)
> > + lat (ADD), it depends on luck whether the FMA or the MUL is
> > issued first there.
> >
> > So if late-forward works really well and the add part of the FMA
> > has very low latency compared to the multiplication part having
> > a smaller reassoc width should pay off here and we might be
> > able to simply control this via the existing target hook?
> >
> > I'm not aware of x86 CPUs having late-forwarding capabilities
> > but usually the latency of multiplication and FMA is very similar
> > and one can issue two FMAs and possibly more ADDs in parallel.
> >
> > As said I think this detail (late-forward) should maybe reflected
> > into get_required_cycles, possibly guided by a different
> > targetm.sched.reassociation_width for MULT_EXPR vs PLUS_EXPR?
> >
>
> To my understanding, the question is whether the target fully
> pipelines FMA instructions, so the MULT part can start first if
> its operands are ready. While targetm.sched.reassociation_width
> reflects the number of pipes for some operation, so it can guide
> get_required_cycles for a sequence of identical operations
> (e.g. A * B * C * D or A + B + C + D). Since the problem in
> this case is not the number of pipes for FMA, I think another
> indicator maybe better.
>
> (Currently the fma_reassoc_width for AArch64 is to control
> whether reassociation on FADD is OK. This workaround doesn't
> work well on some cases, for example it turns down reassociation
> even when there's no FMA at all. So I think we'd better not
> follow the schema.)
>
> Attached is a new version of the patch with a flag to indicate
> whether FMA is fully pipelined, and: 1) lat (MUL) >= lat (ADD);
> 2) symmetric units are used or FMUL/FADD/FMA. Otherwise the
> patch may not be beneficial.
>
> It tries to calculate the latencies including MULT_EXPRs. Since
> the code is different with the current code (the quick-search
> part), I haven't included it inside get_required_cycles.

+; If the flag 'fully-pipelined-fma' is set, reassociation takes into account
+; the benifit of parallelizing FMA's multiply part and addition part.
+ffully-pipelined-fma
+Common Var(flag_fully_pipelined_fma)
+Assume the target fully pipelines FMA instruction, and symmetric units are used
+for FMUL/FADD/FMA.

please use a --param for now, I think targets might want to set this based
on active core tuning.

+/* Given that the target fully pipelines FMA instructions, return latency of
+   MULT_EXPRs that can't be hided by FMA.  WIDTH is the number of pipes.  */
+

return the latency .. can't be hidden by the FMA

For documentation purposes it should be stated that mult_num <= ops_num

+  /* If the target fully pipelines FMA instruction, the multiply part can start

instructions

+     first if its operands are ready.  Assuming symmetric pipes are used for

s/first/already/

+     FMUL/FADD/FMA, then for a sequence of FMA like:
+
+       _8 = .FMA (_2, _3, _1);
+       _9 = .FMA (_5, _4, _8);
+       _10 = .FMA (_7, _6, _9);
+
+     , if width=1, the latency is latency(MULT) + latency(ADD)*3.
+     While with width=2:
+
+       _8 = _4 * _5;
+       _9 = .FMA (_2, _3, _1);
+       _10 = .FMA (_6, _7, _8);
+       _11 = _9 + _10;
+
+     , it is latency(MULT)*2 + latency(ADD)*2.  Assuming latency(MULT) <=
+     latency(ADD), the previous one is preferred.

latency (MULT) >= latency (ADD)?

".. the first variant is preferred."

+
+     Find out if we can get a smaller width considering FMA.  */


+      /* When flag_fully_pipelined_fma is set, assumes symmetric pipes are used
+        for FMUL/FADD/FMA.  */
+      int lat_mul = get_mult_latency_consider_fma (ops_num, mult_num, width);

what does "symmetric pipes" actually mean?  For x86 Zen3 we have
two FMA pipes (that can also do ADD and MUL) and one pipe that can
do ADD.  Is that then non-symmetric because we can issue more adds
in parallel than FMA/MUL?

Otherwise this looks OK now.

Thanks,
Richard.

> > > > +  /* If there's loop dependent FMA result, return width=2 to avoid it.
> > This
> > > > is
> > > > +     better than skipping these FMA candidates in widening_mul.  */
> > > >
> > > > better than skipping, but you don't touch it there?  I suppose width == 2
> > > > will bypass the skipping, right?  This heuristic only comes in when the
> > above
> > > > change made width == 1, since otherwise we have an earlier
> > > >
> > > >   if (width == 1)
> > > >     return width;
> > > >
> > > > which als guarantees width == 2 was allowed by the hook/param, right?
> > >
> > > Yes, that's right.
> > >
> > > >
> > > > +  if (width == 1 && mult_num
> > > > +      && maybe_le (tree_to_poly_int64 (TYPE_SIZE (TREE_TYPE (lhs))),
> > > > +                  param_avoid_fma_max_bits))
> > > > +    {
> > > > +      /* Look for cross backedge dependency:
> > > > +       1. LHS is a phi argument in the same basic block it is defined.
> > > > +       2. And the result of the phi node is used in OPS.  */
> > > > +      basic_block bb = gimple_bb (SSA_NAME_DEF_STMT (lhs));
> > > > +      gimple_stmt_iterator gsi;
> > > > +      for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > > > +       {
> > > > +         gphi *phi = dyn_cast<gphi *> (gsi_stmt (gsi));
> > > > +         for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
> > > > +           {
> > > > +             tree op = PHI_ARG_DEF (phi, i);
> > > > +             if (!(op == lhs && gimple_phi_arg_edge (phi, i)->src == bb))
> > > > +               continue;
> > > >
> > > > I think it's easier to iterate over the immediate uses of LHS like
> > > >
> > > >   FOR_EACH_IMM_USE_FAST (use_p, iter, lhs)
> > > >      if (gphi *phi = dyn_cast <gphi *> (USE_STMT (use_p)))
> > > >        {
> > > >           if (gimple_phi_arg_edge (phi, phi_arg_index_from_use
> > > > (use_p))->src != bb)
> > > >             continue;
> > > > ...
> > > >        }
> > > >
> > > > otherwise I think _this_ part of the patch looks reasonable.
> > > >
> > > > As you say heuristically they might go together but I think we should
> > split
> > > > the
> > > > patch - the cross-loop part can probably stand independently.  Can you
> > adjust
> > > > and re-post?
> > >
> > > Attached is the separated part for cross-loop FMA. Thank you for the
> > correction.
> >
> > That cross-loop FMA patch is OK.
>
> Committed this part at 746344dd.
>
> Thanks,
> Di
>
> >
> > Thanks,
> > Richard.
> >
> > > >
> > > > As for the first part I still don't understand very well and am still
> > hoping
> > > > we
> > > > can get away without yet another knob to tune.
> > > >
> > > > Richard.
> > > >
> > > > > >
> > > > > > > 2. To avoid regressions, included the other patch
> > > > > > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-
> > September/629203.html)
> > > > > > > on this tracker again. This is because more FMA will be kept
> > > > > > > with 1., so we need to rule out the loop dependent
> > > > > > > FMA chains when param_avoid_fma_max_bits is set.
> > > > > >
> > > > > > Sorry again for taking so long to reply.
> > > > > >
> > > > > > I'll note we have an odd case on x86 Zen2(?) as well which we don't
> > really
> > > > > > understand from a CPU behavior perspective.
> > > > > >
> > > > > > Thanks,
> > > > > > Richard.
> > > > > >
> > > > > > > Thanks,
> > > > > > > Di Zhao
> > > > > > >
> > > > > > > ----
> > > > > > >
> > > > > > >         PR tree-optimization/110279
> > > > > > >
> > > > > > > gcc/ChangeLog:
> > > > > > >
> > > > > > >         * tree-ssa-reassoc.cc (rank_ops_for_better_parallelism_p):
> > > > > > >         New function to check whether ranking the ops results in
> > > > > > >         better parallelism.
> > > > > > >         (get_reassociation_width): Add new parameters. Search for
> > > > > > >         smaller width considering the benefit of FMA.
> > > > > > >         (rank_ops_for_fma): Change return value to be number of
> > > > > > >         MULT_EXPRs.
> > > > > > >         (reassociate_bb): For 3 ops, refine the condition to call
> > > > > > >         swap_ops_for_binary_stmt.
> > > > > > >
> > > > > > > gcc/testsuite/ChangeLog:
> > > > > > >
> > > > > > >         * gcc.dg/pr110279.c: New test.
> > > > >
> > > > > Thanks,
> > > > > Di Zhao
> > > > >
> > > > > ----
> > > > >
> > > > >         PR tree-optimization/110279
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > >         * doc/invoke.texi: Description of param_max_fma_chain_len.
> > > > >         * params.opt: New parameter param_max_fma_chain_len.
> > > > >         * tree-ssa-reassoc.cc (get_reassociation_width):
> > > > >         Support param_max_fma_chain_len; check for loop dependent
> > > > >         FMAs.
> > > > >         (rank_ops_for_fma): Return the number of MULT_EXPRs.
> > > > >         (reassociate_bb): For 3 ops, refine the condition to call
> > > > >         swap_ops_for_binary_stmt.
> > > > >
> > > > > gcc/testsuite/ChangeLog:
> > > > >
> > > > >         * gcc.dg/pr110279-1.c: New test.
> > > > >         * gcc.dg/pr110279-2.c: New test.
> > > > >         * gcc.dg/pr110279-3.c: New test.
> > >
> > > ---
> > >
> > >         PR tree-optimization/110279
> > >
> > > gcc/ChangeLog:
> > >
> > >         * tree-ssa-reassoc.cc (get_reassociation_width): check
> > >         for loop dependent FMAs.
> > >         (reassociate_bb): For 3 ops, refine the condition to call
> > >         swap_ops_for_binary_stmt.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >         * gcc.dg/pr110279-1.c: New test.
> ---
>
>         PR tree-optimization/110279
>
> gcc/ChangeLog:
>
>         * common.opt: New flag fully-pipelined-fma.
>         * tree-ssa-reassoc.cc (get_mult_latency_consider_fma):
>         Return latency of MULT_EXPRs that can't be hided by FMA.
>         (get_reassociation_width): Search for smaller widths
>         considering the benefit of fully pipelined FMA.
>         (rank_ops_for_fma): Return the number of MULT_EXPRs.
>         (reassociate_bb): Pass the number of MULT_EXPRs to
>         get_reassociation_width; avoid calling
>         get_reassociation_width twice.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/pr110279-2.c: New test.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width
  2023-12-11 11:01             ` Richard Biener
@ 2023-12-13  8:14               ` Di Zhao OS
  2023-12-13  9:00                 ` Richard Biener
  2023-12-15  9:46                 ` Thomas Schwinge
  0 siblings, 2 replies; 18+ messages in thread
From: Di Zhao OS @ 2023-12-13  8:14 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 27831 bytes --]

Hello Richard,

> -----Original Message-----
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: Monday, December 11, 2023 7:01 PM
> To: Di Zhao OS <dizhao@os.amperecomputing.com>
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> get_reassociation_width
> 
> On Wed, Nov 29, 2023 at 3:36 PM Di Zhao OS
> <dizhao@os.amperecomputing.com> wrote:
> >
> > > -----Original Message-----
> > > From: Richard Biener <richard.guenther@gmail.com>
> > > Sent: Tuesday, November 21, 2023 9:01 PM
> > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > Cc: gcc-patches@gcc.gnu.org
> > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > > get_reassociation_width
> > >
> > > On Thu, Nov 9, 2023 at 6:53 PM Di Zhao OS <dizhao@os.amperecomputing.com>
> > > wrote:
> > > >
> > > > > -----Original Message-----
> > > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > > Sent: Tuesday, October 31, 2023 9:48 PM
> > > > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > > > Cc: gcc-patches@gcc.gnu.org
> > > > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > > > > get_reassociation_width
> > > > >
> > > > > On Sun, Oct 8, 2023 at 6:40 PM Di Zhao OS
> <dizhao@os.amperecomputing.com>
> > > > > wrote:
> > > > > >
> > > > > > Attached is a new version of the patch.
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > > > > Sent: Friday, October 6, 2023 5:33 PM
> > > > > > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > > > > > Cc: gcc-patches@gcc.gnu.org
> > > > > > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > > > > > > get_reassociation_width
> > > > > > >
> > > > > > > On Thu, Sep 14, 2023 at 2:43 PM Di Zhao OS
> > > > > > > <dizhao@os.amperecomputing.com> wrote:
> > > > > > > >
> > > > > > > > This is a new version of the patch on "nested FMA".
> > > > > > > > Sorry for updating this after so long, I've been studying and
> > > > > > > > writing micro cases to sort out the cause of the regression.
> > > > > > >
> > > > > > > Sorry for taking so long to reply.
> > > > > > >
> > > > > > > > First, following previous discussion:
> > > > > > > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-
> > > September/629080.html)
> > > > > > > >
> > > > > > > > 1. From testing more altered cases, I don't think the
> > > > > > > > problem is that reassociation works locally. In that:
> > > > > > > >
> > > > > > > >   1) On the example with multiplications:
> > > > > > > >
> > > > > > > >         tmp1 = a + c * c + d * d + x * y;
> > > > > > > >         tmp2 = x * tmp1;
> > > > > > > >         result += (a + c + d + tmp2);
> > > > > > > >
> > > > > > > >   Given "result" rewritten by width=2, the performance is
> > > > > > > >   worse if we rewrite "tmp1" with width=2. In contrast, if we
> > > > > > > >   remove the multiplications from the example (and make "tmp1"
> > > > > > > >   not singe used), and still rewrite "result" by width=2, then
> > > > > > > >   rewriting "tmp1" with width=2 is better. (Make sense because
> > > > > > > >   the tree's depth at "result" is still smaller if we rewrite
> > > > > > > >   "tmp1".)
> > > > > > > >
> > > > > > > >   2) I tried to modify the assembly code of the example without
> > > > > > > >   FMA, so the width of "result" is 4. On Ampere1 there's no
> > > > > > > >   obvious improvement. So although this is an interesting
> > > > > > > >   problem, it doesn't seem like the cause of the regression.
> > > > > > >
> > > > > > > OK, I see.
> > > > > > >
> > > > > > > > 2. From assembly code of the case with FMA, one problem is
> > > > > > > > that, rewriting "tmp1" to parallel didn't decrease the
> > > > > > > > minimum CPU cycles (taking MULT_EXPRs into account), but
> > > > > > > > increased code size, so the overhead is increased.
> > > > > > > >
> > > > > > > >    a) When "tmp1" is not re-written to parallel:
> > > > > > > >         fmadd d31, d2, d2, d30
> > > > > > > >         fmadd d31, d3, d3, d31
> > > > > > > >         fmadd d31, d4, d5, d31  //"tmp1"
> > > > > > > >         fmadd d31, d31, d4, d3
> > > > > > > >
> > > > > > > >    b) When "tmp1" is re-written to parallel:
> > > > > > > >         fmul  d31, d4, d5
> > > > > > > >         fmadd d27, d2, d2, d30
> > > > > > > >         fmadd d31, d3, d3, d31
> > > > > > > >         fadd  d31, d31, d27     //"tmp1"
> > > > > > > >         fmadd d31, d31, d4, d3
> > > > > > > >
> > > > > > > > For version a), there are 3 dependent FMAs to calculate "tmp1".
> > > > > > > > For version b), there are also 3 dependent instructions in the
> > > > > > > > longer path: the 1st, 3rd and 4th.
> > > > > > >
> > > > > > > Yes, it doesn't really change anything.  The patch has
> > > > > > >
> > > > > > > +  /* If there's code like "acc = a * b + c * d + acc" in a tight
> loop,
> > > > > some
> > > > > > > +     uarchs can execute results like:
> > > > > > > +
> > > > > > > +       _1 = a * b;
> > > > > > > +       _2 = .FMA (c, d, _1);
> > > > > > > +       acc_1 = acc_0 + _2;
> > > > > > > +
> > > > > > > +     in parallel, while turning it into
> > > > > > > +
> > > > > > > +       _1 = .FMA(a, b, acc_0);
> > > > > > > +       acc_1 = .FMA(c, d, _1);
> > > > > > > +
> > > > > > > +     hinders that, because then the first FMA depends on the
> result
> > > > > > > of preceding
> > > > > > > +     iteration.  */
> > > > > > >
> > > > > > > I can't see what can be run in parallel for the first case.
> The .FMA
> > > > > > > depends on the multiplication a * b.  Iff the uarch somehow
> decomposes
> > > > > > > .FMA into multiply + add then the c * d multiply could run in
> parallel
> > > > > > > with the a * b multiply which _might_ be able to hide some of the
> > > > > > > latency of the full .FMA.  Like on x86 Zen FMA has a latency of 4
> > > > > > > cycles but a multiply only 3.  But I never got confirmation from
> any
> > > > > > > of the CPU designers that .FMAs are issued when the multiply
> > > > > > > operands are ready and the add operand can be forwarded.
> > > > > > >
> > > > > > > I also wonder why the multiplications of the two-FMA sequence
> > > > > > > then cannot be executed at the same time?  So I have some doubt
> > > > > > > of the theory above.
> > > > > >
> > > > > > The parallel execution for the code snippet above was the other
> > > > > > issue (previously discussed here:
> > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-August/628960.html).
> > > > > > Sorry it's a bit confusing to include that here, but these 2 fixes
> > > > > > needs to be combined to avoid new regressions. Since considering
> > > > > > FMA in get_reassociation_width produces more results of width=1,
> > > > > > so there would be more loop depending FMA chains.
> > > > > >
> > > > > > > Iff this really is the reason for the sequence to execute with
> lower
> > > > > > > overall latency and we want to attack this on GIMPLE then I think
> > > > > > > we need a target hook telling us this fact (I also wonder if such
> > > > > > > behavior can be modeled in the scheduler pipeline description at
> all?)
> > > > > > >
> > > > > > > > So it seems to me the current get_reassociation_width algorithm
> > > > > > > > isn't optimal in the presence of FMA. So I modified the patch to
> > > > > > > > improve get_reassociation_width, rather than check for code
> > > > > > > > patterns. (Although there could be some other complicated
> > > > > > > > factors so the regression is more obvious when there's "nested
> > > > > > > > FMA". But with this patch that should be avoided or reduced.)
> > > > > > > >
> > > > > > > > With this patch 508.namd_r 1-copy run has 7% improvement on
> > > > > > > > Ampere1, on Intel Xeon there's about 3%. While I'm still
> > > > > > > > collecting data on other CPUs, I'd like to know how do you
> > > > > > > > think of this.
> > > > > > > >
> > > > > > > > About changes in the patch:
> > > > > > > >
> > > > > > > > 1. When the op list forms a complete FMA chain, try to search
> > > > > > > > for a smaller width considering the benefit of using FMA. With
> > > > > > > > a smaller width, the increment of code size is smaller when
> > > > > > > > breaking the chain.
> > > > > > >
> > > > > > > But this is all highly target specific (code size even more so).
> > > > > > >
> > > > > > > How I understand your approach to fixing the issue leads me to
> > > > > > > the suggestion to prioritize parallel rewriting, thus alter
> > > > > rank_ops_for_fma,
> > > > > > > taking the reassoc width into account (the computed width should
> be
> > > > > > > unchanged from rank_ops_for_fma) instead of "fixing up" the
> parallel
> > > > > > > rewriting of FMAs (well, they are not yet formed of course).
> > > > > > > get_reassociation_width has 'get_required_cycles', the above
> theory
> > > > > > > could be verified with a very simple toy pipeline model.  We'd
> have
> > > > > > > to ask the target for the reassoc width for MULT_EXPRs as well (or
> > > maybe
> > > > > > > even FMA_EXPRs).
> > > > > > >
> > > > > > > Taking the width of FMAs into account when computing the reassoc
> width
> > > > > > > might be another way to attack this.
> > > > > >
> > > > > > Previously I tried to solve this generally, on the assumption that
> > > > > > FMA (smaller code size) is preferred. Now I agree it's difficult
> > > > > > since: 1) As you mentioned, the latency of FMA, FMUL and FADD can
> > > > > > be different. 2) From my test result on different machines we
> > > > > > have, it seems simply adding the cycles together is not a good way
> > > > > > to estimate the latency of consecutive FMA.
> > > > > >
> > > > > > I think an easier way to fix this is to add a parameter to suggest
> > > > > > the length of complete FMA chain to keep. (It can be set by target
> > > > > > specific tuning then.) And we can break longer FMA chains for
> > > > > > better parallelism. Attached is the new implementation. With
> > > > > > max-fma-chain-len=8, there's about 7% improvement in spec2017
> > > > > > 508.namd_r on ampere1, and the overall improvement on fprate is
> > > > > > about 1%.
> > > > > >
> > > > > > Since there's code in rank_ops_for_fma to identify MULT_EXPRs from
> > > > > > others, I left it before get_reassociation_width so the number of
> > > > > > MULT_EXPRs can be used.
> > > > >
> > > > > Sorry again for the delay in replying.
> > > > >
> > > > > +  /* Check if keeping complete FMA chains is preferred.  */
> > > > > +  if (width > 1 && mult_num >= 2 && param_max_fma_chain_len)
> > > > > +    {
> > > > > +      /* num_fma_chain + (num_fma_chain - 1) >= num_plus .  */
> > > > > +      int num_others = ops_num - mult_num;
> > > > > +      int num_fma_chain = CEIL (num_others + 1, 2);
> > > > > +
> > > > > +      if (num_fma_chain < width
> > > > > +         && CEIL (mult_num, num_fma_chain) <= param_max_fma_chain_len)
> > > > > +       width = num_fma_chain;
> > > > > +    }
> > > > >
> > > > > so here 'mult_num' serves as a heuristical value how many
> > > > > FMAs we could build.  If that were close to ops_num - 1 then
> > > > > we'd have a chain of FMAs.  Not sure how you get at
> > > > > num_others / 2 here.  Maybe we need to elaborate on what an
> > > > > FMA chain is?  I thought it is FMA (FMA (FMA (..., b, c), d, e), f, g)
> > > > > where each (b,c) pair is really just one operand in the ops array,
> > > > > one of the 'mult's.  Thus a FMA chain is _not_
> > > > > FMA (a, b, c) + FMA (d, e, f) + FMA (...) + ..., right?
> > > >
> > > > The "FMA chain" here refers to consecutive FMAs, each taking
> > > > The previous one's result as the third operator, i.e.
> > > > ... FMA(e, f, FMA(c, d, FMA (a, b, r)))... . So original op
> > > > list looks like "r + a * b + c * d + e * f + ...". These FMAs
> > > > will end up using the same accumulate register.
> > > >
> > > > When num_others=2 or 3, there can be 2 complete chains, e.g.
> > > >         FMA (d, e, FMA (a, b, c)) + FMA (f, g, h)
> > > > or
> > > >         FMA (d, e, FMA (a, b, c)) + FMA (f, g, h) + i .
> > > > And so on, that's where the "CEIL (num_others + 1, 2)" comes from.
> > > >
> > > > >
> > > > > Forming an FMA chain effectively reduces the reassociation width
> > > > > of the participating multiplies.  If we were not to form FMAs all
> > > > > the multiplies could execute in parallel.
> > > > >
> > > > > So what does the above do, in terms of adjusting the reassociation
> > > > > width for the _adds_, and what's the ripple-down effect on later
> > > > > FMA forming?
> > > > >
> > > >
> > > > The above code calculates the number of such FMA chains in the op
> > > > list. And if the length of each chain doesn't exceed
> > > > param_max_fma_chain_len, then width is set to the number of chains,
> > > > so we won't break them (because rewrite_expr_tree_parallel handles
> > > > this well).
> > > >
> > > > > The change still feels like whack-a-mole playing rather than
> understanding
> > > > > the fundamental issue on the targets.
> > > >
> > > > I think the complexity is in how the instructions are piped.
> > > > Some Arm CPUs such as Neoverse V2 supports "late-forwarding":
> > > > "FP multiply-accumulate pipelines support late-forwarding of
> > > > accumulate operands from similar μOPs, allowing a typical
> > > > sequence of multiply-accumulate μOPs to issue one every N
> > > > cycles". ("N" is smaller than the latency of a single FMA
> > > > instruction.) So keeping such FMA chains can utilize such
> > > > feature and uses less FP units. I guess the case is similar on
> > > > some late X86 CPUs.
> > > >
> > > > If we try to compute the minimum circles of each option, I think
> > > > at least we'll need to know whether the target has similar
> > > > feature, and the latency of each uop. While using an
> > > > experiential length of beneficial FMA chain could be a shortcut.
> > > > (Maybe allowing different lengths for different data widths is
> > > > better.)
> > >
> > > Hm.  So even when we can late-forward in an FMA chain
> > > increasing the width should typically be still better?
> > >
> > > _1 = FMA (_2 * _3 + _4);
> > > _5 = FMA (_6 * _7 + _1);
> > >
> > > say with late-forwarding we can hide the latency of the _6 * _7
> > > multiply and the overall latency of the two FMAs above become
> > > lat (FMA) + lat (ADD) in the ideal case.  Alternatively we do
> > >
> > > _1 = FMA (_2 * _ 3 + _4);
> > > _8 = _6 * _ 7;
> > > _5 = _1 + _8;
> > >
> > > where if the FMA and the multiply can execute in parallel
> > > (we have two FMA pipes) the latency would be lat (FMA) + lat (ADD).
> > > But when we only have a single pipeline capable of
> > > FMA or multiplies then it is at least MIN (lat (FMA) + 1, lat (MUL) + 1)
> > > + lat (ADD), it depends on luck whether the FMA or the MUL is
> > > issued first there.
> > >
> > > So if late-forward works really well and the add part of the FMA
> > > has very low latency compared to the multiplication part having
> > > a smaller reassoc width should pay off here and we might be
> > > able to simply control this via the existing target hook?
> > >
> > > I'm not aware of x86 CPUs having late-forwarding capabilities
> > > but usually the latency of multiplication and FMA is very similar
> > > and one can issue two FMAs and possibly more ADDs in parallel.
> > >
> > > As said I think this detail (late-forward) should maybe reflected
> > > into get_required_cycles, possibly guided by a different
> > > targetm.sched.reassociation_width for MULT_EXPR vs PLUS_EXPR?
> > >
> >
> > To my understanding, the question is whether the target fully
> > pipelines FMA instructions, so the MULT part can start first if
> > its operands are ready. While targetm.sched.reassociation_width
> > reflects the number of pipes for some operation, so it can guide
> > get_required_cycles for a sequence of identical operations
> > (e.g. A * B * C * D or A + B + C + D). Since the problem in
> > this case is not the number of pipes for FMA, I think another
> > indicator maybe better.
> >
> > (Currently the fma_reassoc_width for AArch64 is to control
> > whether reassociation on FADD is OK. This workaround doesn't
> > work well on some cases, for example it turns down reassociation
> > even when there's no FMA at all. So I think we'd better not
> > follow the schema.)
> >
> > Attached is a new version of the patch with a flag to indicate
> > whether FMA is fully pipelined, and: 1) lat (MUL) >= lat (ADD);
> > 2) symmetric units are used or FMUL/FADD/FMA. Otherwise the
> > patch may not be beneficial.
> >
> > It tries to calculate the latencies including MULT_EXPRs. Since
> > the code is different with the current code (the quick-search
> > part), I haven't included it inside get_required_cycles.
> 
> +; If the flag 'fully-pipelined-fma' is set, reassociation takes into account
> +; the benifit of parallelizing FMA's multiply part and addition part.
> +ffully-pipelined-fma
> +Common Var(flag_fully_pipelined_fma)
> +Assume the target fully pipelines FMA instruction, and symmetric units are
> used
> +for FMUL/FADD/FMA.
> 
> please use a --param for now, I think targets might want to set this based
> on active core tuning.
> 
> +/* Given that the target fully pipelines FMA instructions, return latency of
> +   MULT_EXPRs that can't be hided by FMA.  WIDTH is the number of pipes.  */
> +
> 
> return the latency .. can't be hidden by the FMA
> 
> For documentation purposes it should be stated that mult_num <= ops_num
> 
> +  /* If the target fully pipelines FMA instruction, the multiply part can
> start
> 
> instructions
> 
> +     first if its operands are ready.  Assuming symmetric pipes are used for
> 
> s/first/already/
> 
> +     FMUL/FADD/FMA, then for a sequence of FMA like:
> +
> +       _8 = .FMA (_2, _3, _1);
> +       _9 = .FMA (_5, _4, _8);
> +       _10 = .FMA (_7, _6, _9);
> +
> +     , if width=1, the latency is latency(MULT) + latency(ADD)*3.
> +     While with width=2:
> +
> +       _8 = _4 * _5;
> +       _9 = .FMA (_2, _3, _1);
> +       _10 = .FMA (_6, _7, _8);
> +       _11 = _9 + _10;
> +
> +     , it is latency(MULT)*2 + latency(ADD)*2.  Assuming latency(MULT) <=
> +     latency(ADD), the previous one is preferred.
> 
> latency (MULT) >= latency (ADD)?
> 
> ".. the first variant is preferred."
> 
> +
> +     Find out if we can get a smaller width considering FMA.  */

Corrected these errors. Thank you for the corrections.

> +      /* When flag_fully_pipelined_fma is set, assumes symmetric pipes are
> used
> +        for FMUL/FADD/FMA.  */
> +      int lat_mul = get_mult_latency_consider_fma (ops_num, mult_num, width);
> 
> what does "symmetric pipes" actually mean?  For x86 Zen3 we have
> two FMA pipes (that can also do ADD and MUL) and one pipe that can
> do ADD.  Is that then non-symmetric because we can issue more adds
> in parallel than FMA/MUL?

"symmetric pipes" was to indicate that FADD/FMA/FMUL use the same unit
set, so the widths are uniform, and the calculations in this patch can
apply. I think this can be relaxed for scenarios like Zen3, by searching
for a smaller width only using the pipes for FMUL/FMA. But if the pipes
for FMUL and FADD are separated, for example 1 for FMA/FMUL and 2 other
pipes for FADD, then the minimum circle might be incorrect.

Changed the descriptions and code in get_reassociation_width a bit to
include the case like Zen3:

+      /* When param_fully_pipelined_fma is set, assume FMUL and FMA use the
+	 same units that can also do FADD.  For other scenarios, such as when
+	 FMUL and FADD are using distinct units, the following code may not
+	 apply.  */
+      int width_mult = targetm.sched.reassociation_width (MULT_EXPR, mode);
+      gcc_checking_assert (width_mult <= width);
+
+      /* Latency of MULT_EXPRs.  */
+      int lat_mul
+	= get_mult_latency_consider_fma (ops_num, mult_num, width_mult);

> Otherwise this looks OK now.
> 
> Thanks,
> Richard.
> 
> > > > > +  /* If there's loop dependent FMA result, return width=2 to avoid it.
> > > This
> > > > > is
> > > > > +     better than skipping these FMA candidates in widening_mul.  */
> > > > >
> > > > > better than skipping, but you don't touch it there?  I suppose width
> == 2
> > > > > will bypass the skipping, right?  This heuristic only comes in when
> the
> > > above
> > > > > change made width == 1, since otherwise we have an earlier
> > > > >
> > > > >   if (width == 1)
> > > > >     return width;
> > > > >
> > > > > which als guarantees width == 2 was allowed by the hook/param, right?
> > > >
> > > > Yes, that's right.
> > > >
> > > > >
> > > > > +  if (width == 1 && mult_num
> > > > > +      && maybe_le (tree_to_poly_int64 (TYPE_SIZE (TREE_TYPE (lhs))),
> > > > > +                  param_avoid_fma_max_bits))
> > > > > +    {
> > > > > +      /* Look for cross backedge dependency:
> > > > > +       1. LHS is a phi argument in the same basic block it is defined.
> > > > > +       2. And the result of the phi node is used in OPS.  */
> > > > > +      basic_block bb = gimple_bb (SSA_NAME_DEF_STMT (lhs));
> > > > > +      gimple_stmt_iterator gsi;
> > > > > +      for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next
> (&gsi))
> > > > > +       {
> > > > > +         gphi *phi = dyn_cast<gphi *> (gsi_stmt (gsi));
> > > > > +         for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
> > > > > +           {
> > > > > +             tree op = PHI_ARG_DEF (phi, i);
> > > > > +             if (!(op == lhs && gimple_phi_arg_edge (phi, i)->src ==
> bb))
> > > > > +               continue;
> > > > >
> > > > > I think it's easier to iterate over the immediate uses of LHS like
> > > > >
> > > > >   FOR_EACH_IMM_USE_FAST (use_p, iter, lhs)
> > > > >      if (gphi *phi = dyn_cast <gphi *> (USE_STMT (use_p)))
> > > > >        {
> > > > >           if (gimple_phi_arg_edge (phi, phi_arg_index_from_use
> > > > > (use_p))->src != bb)
> > > > >             continue;
> > > > > ...
> > > > >        }
> > > > >
> > > > > otherwise I think _this_ part of the patch looks reasonable.
> > > > >
> > > > > As you say heuristically they might go together but I think we should
> > > split
> > > > > the
> > > > > patch - the cross-loop part can probably stand independently.  Can you
> > > adjust
> > > > > and re-post?
> > > >
> > > > Attached is the separated part for cross-loop FMA. Thank you for the
> > > correction.
> > >
> > > That cross-loop FMA patch is OK.
> >
> > Committed this part at 746344dd.
> >
> > Thanks,
> > Di
> >
> > >
> > > Thanks,
> > > Richard.
> > >
> > > > >
> > > > > As for the first part I still don't understand very well and am still
> > > hoping
> > > > > we
> > > > > can get away without yet another knob to tune.
> > > > >
> > > > > Richard.
> > > > >
> > > > > > >
> > > > > > > > 2. To avoid regressions, included the other patch
> > > > > > > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-
> > > September/629203.html)
> > > > > > > > on this tracker again. This is because more FMA will be kept
> > > > > > > > with 1., so we need to rule out the loop dependent
> > > > > > > > FMA chains when param_avoid_fma_max_bits is set.
> > > > > > >
> > > > > > > Sorry again for taking so long to reply.
> > > > > > >
> > > > > > > I'll note we have an odd case on x86 Zen2(?) as well which we
> don't
> > > really
> > > > > > > understand from a CPU behavior perspective.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Richard.
> > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Di Zhao
> > > > > > > >
> > > > > > > > ----
> > > > > > > >
> > > > > > > >         PR tree-optimization/110279
> > > > > > > >
> > > > > > > > gcc/ChangeLog:
> > > > > > > >
> > > > > > > >         * tree-ssa-reassoc.cc
> (rank_ops_for_better_parallelism_p):
> > > > > > > >         New function to check whether ranking the ops results in
> > > > > > > >         better parallelism.
> > > > > > > >         (get_reassociation_width): Add new parameters. Search
> for
> > > > > > > >         smaller width considering the benefit of FMA.
> > > > > > > >         (rank_ops_for_fma): Change return value to be number of
> > > > > > > >         MULT_EXPRs.
> > > > > > > >         (reassociate_bb): For 3 ops, refine the condition to
> call
> > > > > > > >         swap_ops_for_binary_stmt.
> > > > > > > >
> > > > > > > > gcc/testsuite/ChangeLog:
> > > > > > > >
> > > > > > > >         * gcc.dg/pr110279.c: New test.
> > > > > >
> > > > > > Thanks,
> > > > > > Di Zhao
> > > > > >
> > > > > > ----
> > > > > >
> > > > > >         PR tree-optimization/110279
> > > > > >
> > > > > > gcc/ChangeLog:
> > > > > >
> > > > > >         * doc/invoke.texi: Description of param_max_fma_chain_len.
> > > > > >         * params.opt: New parameter param_max_fma_chain_len.
> > > > > >         * tree-ssa-reassoc.cc (get_reassociation_width):
> > > > > >         Support param_max_fma_chain_len; check for loop dependent
> > > > > >         FMAs.
> > > > > >         (rank_ops_for_fma): Return the number of MULT_EXPRs.
> > > > > >         (reassociate_bb): For 3 ops, refine the condition to call
> > > > > >         swap_ops_for_binary_stmt.
> > > > > >
> > > > > > gcc/testsuite/ChangeLog:
> > > > > >
> > > > > >         * gcc.dg/pr110279-1.c: New test.
> > > > > >         * gcc.dg/pr110279-2.c: New test.
> > > > > >         * gcc.dg/pr110279-3.c: New test.
> > > >
> > > > ---
> > > >
> > > >         PR tree-optimization/110279
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >         * tree-ssa-reassoc.cc (get_reassociation_width): check
> > > >         for loop dependent FMAs.
> > > >         (reassociate_bb): For 3 ops, refine the condition to call
> > > >         swap_ops_for_binary_stmt.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > >         * gcc.dg/pr110279-1.c: New test.
> > ---
> >
> >         PR tree-optimization/110279
> >
> > gcc/ChangeLog:
> >
> >         * common.opt: New flag fully-pipelined-fma.
> >         * tree-ssa-reassoc.cc (get_mult_latency_consider_fma):
> >         Return latency of MULT_EXPRs that can't be hided by FMA.
> >         (get_reassociation_width): Search for smaller widths
> >         considering the benefit of fully pipelined FMA.
> >         (rank_ops_for_fma): Return the number of MULT_EXPRs.
> >         (reassociate_bb): Pass the number of MULT_EXPRs to
> >         get_reassociation_width; avoid calling
> >         get_reassociation_width twice.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         * gcc.dg/pr110279-2.c: New test.

Thanks,
Di

---

	PR tree-optimization/110279

gcc/ChangeLog:

	* doc/invoke.texi: New parameter fully-pipelined-fma.
	* params.opt: New parameter fully-pipelined-fma.
	* tree-ssa-reassoc.cc (get_mult_latency_consider_fma): Return
	the latency of MULT_EXPRs that can't be hidden by the FMAs.
	(get_reassociation_width): Search for a smaller width
	considering the benefit of fully pipelined FMA.
	(rank_ops_for_fma): Return the number of MULT_EXPRs.
	(reassociate_bb): Pass the number of MULT_EXPRs to
	get_reassociation_width; avoid calling
	get_reassociation_width twice.

gcc/testsuite/ChangeLog:

        * gcc.dg/pr110279-2.c: New test.

[-- Attachment #2: 0001-Consider-fully-pipelined-FMA-in-get_reassociation_wi.patch --]
[-- Type: application/octet-stream, Size: 11556 bytes --]

From 5bf21ade03704e965e4d07c2603b9542dc4c0ac4 Mon Sep 17 00:00:00 2001
From: "Di Zhao" <dizhao@os.amperecomputing.com>
Date: Tue, 12 Dec 2023 23:06:14 +0800
Subject: [PATCH] Consider fully pipelined FMA in get_reassociation_width

---
 gcc/doc/invoke.texi               |   6 ++
 gcc/params.opt                    |   7 ++
 gcc/testsuite/gcc.dg/pr110279-2.c |  41 ++++++++
 gcc/tree-ssa-reassoc.cc           | 150 ++++++++++++++++++++++++------
 4 files changed, 177 insertions(+), 27 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr110279-2.c

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index d395a6a747e..f9ec93a8106 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -16159,6 +16159,12 @@ Enable loop vectorization of floating point inductions.
 @item avoid-fma-max-bits
 Maximum number of bits for which we avoid creating FMAs.
 
+@item fully-pipelined-fma
+Whether the target fully pipelines FMA instructions.  If non-zero,
+reassociation considers the benefit of parallelizing FMA's multiplication
+part and addition part, assuming FMUL and FMA use the same units that can
+also do FADD.
+
 @item sms-loop-average-count-threshold
 A threshold on the average loop count considered by the swing modulo scheduler.
 
diff --git a/gcc/params.opt b/gcc/params.opt
index e3a93ac7435..ab12b528540 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -146,6 +146,13 @@ Maximum number of outgoing edges in a switch before EVRP will not process it.
 Common Joined UInteger Var(param_fsm_scale_path_stmts) Init(2) IntegerRange(1, 10) Param Optimization
 Scale factor to apply to the number of statements in a threading path crossing a loop backedge when comparing to max-jump-thread-duplication-stmts.
 
+-param=fully-pipelined-fma=
+Common Joined UInteger Var(param_fully_pipelined_fma) Init(0) IntegerRange(0, 1) Param Optimization
+Whether the target fully pipelines FMA instructions.  If non-zero,
+reassociation considers the benefit of parallelizing FMA's multiplication
+part and addition part, assuming FMUL and FMA use the same units that can
+also do FADD.
+
 -param=gcse-after-reload-critical-fraction=
 Common Joined UInteger Var(param_gcse_after_reload_critical_fraction) Init(10) Param Optimization
 The threshold ratio of critical edges execution count that permit performing redundancy elimination after reload.
diff --git a/gcc/testsuite/gcc.dg/pr110279-2.c b/gcc/testsuite/gcc.dg/pr110279-2.c
new file mode 100644
index 00000000000..0304a77aa66
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr110279-2.c
@@ -0,0 +1,41 @@
+/* PR tree-optimization/110279 */
+/* { dg-do compile } */
+/* { dg-options "-Ofast --param tree-reassoc-width=4 --param fully-pipelined-fma=1 -fdump-tree-reassoc2-details -fdump-tree-optimized" } */
+/* { dg-additional-options "-march=armv8.2-a" { target aarch64-*-* } } */
+
+#define LOOP_COUNT 800000000
+typedef double data_e;
+
+#include <stdio.h>
+
+__attribute_noinline__ data_e
+foo (data_e in)
+{
+  data_e a1, a2, a3, a4;
+  data_e tmp, result = 0;
+  a1 = in + 0.1;
+  a2 = in * 0.1;
+  a3 = in + 0.01;
+  a4 = in * 0.59;
+
+  data_e result2 = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      /* Test that a complete FMA chain with length=4 is not broken.  */
+      tmp = a1 + a2 * a2 + a3 * a3 + a4 * a4 ;
+      result += tmp - ic;
+      result2 = result2 / 2 - tmp;
+
+      a1 += 0.91;
+      a2 += 0.1;
+      a3 -= 0.01;
+      a4 -= 0.89;
+
+    }
+
+  return result + result2;
+}
+
+/* { dg-final { scan-tree-dump-not "was chosen for reassociation" "reassoc2"} } */
+/* { dg-final { scan-tree-dump-times {\.FMA } 3 "optimized"} } */
\ No newline at end of file
diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
index ce97fc9a8b8..d45898ea1d5 100644
--- a/gcc/tree-ssa-reassoc.cc
+++ b/gcc/tree-ssa-reassoc.cc
@@ -5425,13 +5425,35 @@ get_required_cycles (int ops_num, int cpu_width)
   return res;
 }
 
+/* Given that the target fully pipelines FMA instructions, return the latency
+   of MULT_EXPRs that can't be hidden by the FMAs.  WIDTH is the number of
+   pipes.  */
+
+static inline int
+get_mult_latency_consider_fma (int ops_num, int mult_num, int width)
+{
+  gcc_checking_assert (mult_num && mult_num <= ops_num);
+
+  /* For each partition, if mult_num == ops_num, there's latency(MULT)*2.
+     e.g:
+
+	A * B + C * D
+	=>
+	_1 = A * B;
+	_2 = .FMA (C, D, _1);
+
+      Otherwise there's latency(MULT)*1 in the first FMA.  */
+  return CEIL (ops_num, width) == CEIL (mult_num, width) ? 2 : 1;
+}
+
 /* Returns an optimal number of registers to use for computation of
    given statements.
 
-   LHS is the result ssa name of OPS.  */
+   LHS is the result ssa name of OPS.  MULT_NUM is number of sub-expressions
+   that are MULT_EXPRs, when OPS are PLUS_EXPRs or MINUS_EXPRs.  */
 
 static int
-get_reassociation_width (vec<operand_entry *> *ops, tree lhs,
+get_reassociation_width (vec<operand_entry *> *ops, int mult_num, tree lhs,
 			 enum tree_code opc, machine_mode mode)
 {
   int param_width = param_tree_reassoc_width;
@@ -5457,16 +5479,68 @@ get_reassociation_width (vec<operand_entry *> *ops, tree lhs,
      so we can perform a binary search for the minimal width that still
      results in the optimal cycle count.  */
   width_min = 1;
-  while (width > width_min)
+
+  /* If the target fully pipelines FMA instruction, the multiply part can start
+     already if its operands are ready.  Assuming symmetric pipes are used for
+     FMUL/FADD/FMA, then for a sequence of FMA like:
+
+	_8 = .FMA (_2, _3, _1);
+	_9 = .FMA (_5, _4, _8);
+	_10 = .FMA (_7, _6, _9);
+
+     , if width=1, the latency is latency(MULT) + latency(ADD)*3.
+     While with width=2:
+
+	_8 = _4 * _5;
+	_9 = .FMA (_2, _3, _1);
+	_10 = .FMA (_6, _7, _8);
+	_11 = _9 + _10;
+
+     , it is latency(MULT)*2 + latency(ADD)*2.  Assuming latency(MULT) >=
+     latency(ADD), the first variant is preferred.
+
+     Find out if we can get a smaller width considering FMA.  */
+  if (width > 1 && mult_num && param_fully_pipelined_fma)
     {
-      int width_mid = (width + width_min) / 2;
+      /* When param_fully_pipelined_fma is set, assume FMUL and FMA use the
+	 same units that can also do FADD.  For other scenarios, such as when
+	 FMUL and FADD are using separated units, the following code may not
+	 apply.  */
+      int width_mult = targetm.sched.reassociation_width (MULT_EXPR, mode);
+      gcc_checking_assert (width_mult <= width);
+
+      /* Latency of MULT_EXPRs.  */
+      int lat_mul
+	= get_mult_latency_consider_fma (ops_num, mult_num, width_mult);
+
+      /* Quick search might not apply.  So start from 1.  */
+      for (int i = 1; i < width_mult; i++)
+	{
+	  int lat_mul_new
+	    = get_mult_latency_consider_fma (ops_num, mult_num, i);
+	  int lat_add_new = get_required_cycles (ops_num, i);
 
-      if (get_required_cycles (ops_num, width_mid) == cycles_best)
-	width = width_mid;
-      else if (width_min < width_mid)
-	width_min = width_mid;
-      else
-	break;
+	  /* Assume latency(MULT) >= latency(ADD).  */
+	  if (lat_mul - lat_mul_new >= lat_add_new - cycles_best)
+	    {
+	      width = i;
+	      break;
+	    }
+	}
+    }
+  else
+    {
+      while (width > width_min)
+	{
+	  int width_mid = (width + width_min) / 2;
+
+	  if (get_required_cycles (ops_num, width_mid) == cycles_best)
+	    width = width_mid;
+	  else if (width_min < width_mid)
+	    width_min = width_mid;
+	  else
+	    break;
+	}
     }
 
   /* If there's loop dependent FMA result, return width=2 to avoid it.  This is
@@ -6836,8 +6910,10 @@ transform_stmt_to_multiply (gimple_stmt_iterator *gsi, gimple *stmt,
    Rearrange ops to -> e + a * b + c * d generates:
 
    _4  = .FMA (c_7(D), d_8(D), _3);
-   _11 = .FMA (a_5(D), b_6(D), _4);  */
-static bool
+   _11 = .FMA (a_5(D), b_6(D), _4);
+
+   Return the number of MULT_EXPRs in the chain.  */
+static int
 rank_ops_for_fma (vec<operand_entry *> *ops)
 {
   operand_entry *oe;
@@ -6851,9 +6927,26 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
       if (TREE_CODE (oe->op) == SSA_NAME)
 	{
 	  gimple *def_stmt = SSA_NAME_DEF_STMT (oe->op);
-	  if (is_gimple_assign (def_stmt)
-	      && gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
-	    ops_mult.safe_push (oe);
+	  if (is_gimple_assign (def_stmt))
+	    {
+	      if (gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
+		ops_mult.safe_push (oe);
+	      /* A negate on the multiplication leads to FNMA.  */
+	      else if (gimple_assign_rhs_code (def_stmt) == NEGATE_EXPR
+		       && TREE_CODE (gimple_assign_rhs1 (def_stmt)) == SSA_NAME)
+		{
+		  gimple *neg_def_stmt
+		    = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (def_stmt));
+		  if (is_gimple_assign (neg_def_stmt)
+		      && gimple_bb (neg_def_stmt) == gimple_bb (def_stmt)
+		      && gimple_assign_rhs_code (neg_def_stmt) == MULT_EXPR)
+		    ops_mult.safe_push (oe);
+		  else
+		    ops_others.safe_push (oe);
+		}
+	      else
+		ops_others.safe_push (oe);
+	    }
 	  else
 	    ops_others.safe_push (oe);
 	}
@@ -6869,7 +6962,8 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
      Putting ops that not def from mult in front can generate more FMAs.
 
      2. If all ops are defined with mult, we don't need to rearrange them.  */
-  if (ops_mult.length () >= 2 && ops_mult.length () != ops_length)
+  unsigned mult_num = ops_mult.length ();
+  if (mult_num >= 2 && mult_num != ops_length)
     {
       /* Put no-mult ops and mult ops alternately at the end of the
 	 queue, which is conducive to generating more FMA and reducing the
@@ -6885,9 +6979,8 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
 	  if (opindex > 0)
 	    opindex--;
 	}
-      return true;
     }
-  return false;
+  return mult_num;
 }
 /* Reassociate expressions in basic block BB and its post-dominator as
    children.
@@ -7052,8 +7145,8 @@ reassociate_bb (basic_block bb)
 		{
 		  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
 		  int ops_num = ops.length ();
-		  int width;
-		  bool has_fma = false;
+		  int width = 0;
+		  int mult_num = 0;
 
 		  /* For binary bit operations, if there are at least 3
 		     operands and the last operand in OPS is a constant,
@@ -7076,16 +7169,17 @@ reassociate_bb (basic_block bb)
 						      opt_type)
 		      && (rhs_code == PLUS_EXPR || rhs_code == MINUS_EXPR))
 		    {
-		      has_fma = rank_ops_for_fma (&ops);
+		      mult_num = rank_ops_for_fma (&ops);
 		    }
 
 		  /* Only rewrite the expression tree to parallel in the
 		     last reassoc pass to avoid useless work back-and-forth
 		     with initial linearization.  */
+		  bool has_fma = mult_num >= 2 && mult_num != ops_num;
 		  if (!reassoc_insert_powi_p
 		      && ops.length () > 3
-		      && (width
-			  = get_reassociation_width (&ops, lhs, rhs_code, mode))
+		      && (width = get_reassociation_width (&ops, mult_num, lhs,
+							   rhs_code, mode))
 			   > 1)
 		    {
 		      if (dump_file && (dump_flags & TDF_DETAILS))
@@ -7106,10 +7200,12 @@ reassociate_bb (basic_block bb)
 		      if (len >= 3
 			  && (!has_fma
 			      /* width > 1 means ranking ops results in better
-				 parallelism.  */
-			      || get_reassociation_width (&ops, lhs, rhs_code,
-							  mode)
-				   > 1))
+				 parallelism.  Check current value to avoid
+				 calling get_reassociation_width again.  */
+			      || (width != 1
+				  && get_reassociation_width (
+				       &ops, mult_num, lhs, rhs_code, mode)
+				       > 1)))
 			swap_ops_for_binary_stmt (ops, len - 3);
 
 		      new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width
  2023-12-13  8:14               ` Di Zhao OS
@ 2023-12-13  9:00                 ` Richard Biener
  2023-12-14 20:55                   ` Di Zhao OS
  2023-12-15  9:46                 ` Thomas Schwinge
  1 sibling, 1 reply; 18+ messages in thread
From: Richard Biener @ 2023-12-13  9:00 UTC (permalink / raw)
  To: Di Zhao OS; +Cc: gcc-patches

On Wed, Dec 13, 2023 at 9:14 AM Di Zhao OS
<dizhao@os.amperecomputing.com> wrote:
>
> Hello Richard,
>
> > -----Original Message-----
> > From: Richard Biener <richard.guenther@gmail.com>
> > Sent: Monday, December 11, 2023 7:01 PM
> > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > Cc: gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > get_reassociation_width
> >
> > On Wed, Nov 29, 2023 at 3:36 PM Di Zhao OS
> > <dizhao@os.amperecomputing.com> wrote:
> > >
> > > > -----Original Message-----
> > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > Sent: Tuesday, November 21, 2023 9:01 PM
> > > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > > Cc: gcc-patches@gcc.gnu.org
> > > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > > > get_reassociation_width
> > > >
> > > > On Thu, Nov 9, 2023 at 6:53 PM Di Zhao OS <dizhao@os.amperecomputing.com>
> > > > wrote:
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > > > Sent: Tuesday, October 31, 2023 9:48 PM
> > > > > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > > > > Cc: gcc-patches@gcc.gnu.org
> > > > > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > > > > > get_reassociation_width
> > > > > >
> > > > > > On Sun, Oct 8, 2023 at 6:40 PM Di Zhao OS
> > <dizhao@os.amperecomputing.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > Attached is a new version of the patch.
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > > > > > Sent: Friday, October 6, 2023 5:33 PM
> > > > > > > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > > > > > > Cc: gcc-patches@gcc.gnu.org
> > > > > > > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > > > > > > > get_reassociation_width
> > > > > > > >
> > > > > > > > On Thu, Sep 14, 2023 at 2:43 PM Di Zhao OS
> > > > > > > > <dizhao@os.amperecomputing.com> wrote:
> > > > > > > > >
> > > > > > > > > This is a new version of the patch on "nested FMA".
> > > > > > > > > Sorry for updating this after so long, I've been studying and
> > > > > > > > > writing micro cases to sort out the cause of the regression.
> > > > > > > >
> > > > > > > > Sorry for taking so long to reply.
> > > > > > > >
> > > > > > > > > First, following previous discussion:
> > > > > > > > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-
> > > > September/629080.html)
> > > > > > > > >
> > > > > > > > > 1. From testing more altered cases, I don't think the
> > > > > > > > > problem is that reassociation works locally. In that:
> > > > > > > > >
> > > > > > > > >   1) On the example with multiplications:
> > > > > > > > >
> > > > > > > > >         tmp1 = a + c * c + d * d + x * y;
> > > > > > > > >         tmp2 = x * tmp1;
> > > > > > > > >         result += (a + c + d + tmp2);
> > > > > > > > >
> > > > > > > > >   Given "result" rewritten by width=2, the performance is
> > > > > > > > >   worse if we rewrite "tmp1" with width=2. In contrast, if we
> > > > > > > > >   remove the multiplications from the example (and make "tmp1"
> > > > > > > > >   not singe used), and still rewrite "result" by width=2, then
> > > > > > > > >   rewriting "tmp1" with width=2 is better. (Make sense because
> > > > > > > > >   the tree's depth at "result" is still smaller if we rewrite
> > > > > > > > >   "tmp1".)
> > > > > > > > >
> > > > > > > > >   2) I tried to modify the assembly code of the example without
> > > > > > > > >   FMA, so the width of "result" is 4. On Ampere1 there's no
> > > > > > > > >   obvious improvement. So although this is an interesting
> > > > > > > > >   problem, it doesn't seem like the cause of the regression.
> > > > > > > >
> > > > > > > > OK, I see.
> > > > > > > >
> > > > > > > > > 2. From assembly code of the case with FMA, one problem is
> > > > > > > > > that, rewriting "tmp1" to parallel didn't decrease the
> > > > > > > > > minimum CPU cycles (taking MULT_EXPRs into account), but
> > > > > > > > > increased code size, so the overhead is increased.
> > > > > > > > >
> > > > > > > > >    a) When "tmp1" is not re-written to parallel:
> > > > > > > > >         fmadd d31, d2, d2, d30
> > > > > > > > >         fmadd d31, d3, d3, d31
> > > > > > > > >         fmadd d31, d4, d5, d31  //"tmp1"
> > > > > > > > >         fmadd d31, d31, d4, d3
> > > > > > > > >
> > > > > > > > >    b) When "tmp1" is re-written to parallel:
> > > > > > > > >         fmul  d31, d4, d5
> > > > > > > > >         fmadd d27, d2, d2, d30
> > > > > > > > >         fmadd d31, d3, d3, d31
> > > > > > > > >         fadd  d31, d31, d27     //"tmp1"
> > > > > > > > >         fmadd d31, d31, d4, d3
> > > > > > > > >
> > > > > > > > > For version a), there are 3 dependent FMAs to calculate "tmp1".
> > > > > > > > > For version b), there are also 3 dependent instructions in the
> > > > > > > > > longer path: the 1st, 3rd and 4th.
> > > > > > > >
> > > > > > > > Yes, it doesn't really change anything.  The patch has
> > > > > > > >
> > > > > > > > +  /* If there's code like "acc = a * b + c * d + acc" in a tight
> > loop,
> > > > > > some
> > > > > > > > +     uarchs can execute results like:
> > > > > > > > +
> > > > > > > > +       _1 = a * b;
> > > > > > > > +       _2 = .FMA (c, d, _1);
> > > > > > > > +       acc_1 = acc_0 + _2;
> > > > > > > > +
> > > > > > > > +     in parallel, while turning it into
> > > > > > > > +
> > > > > > > > +       _1 = .FMA(a, b, acc_0);
> > > > > > > > +       acc_1 = .FMA(c, d, _1);
> > > > > > > > +
> > > > > > > > +     hinders that, because then the first FMA depends on the
> > result
> > > > > > > > of preceding
> > > > > > > > +     iteration.  */
> > > > > > > >
> > > > > > > > I can't see what can be run in parallel for the first case.
> > The .FMA
> > > > > > > > depends on the multiplication a * b.  Iff the uarch somehow
> > decomposes
> > > > > > > > .FMA into multiply + add then the c * d multiply could run in
> > parallel
> > > > > > > > with the a * b multiply which _might_ be able to hide some of the
> > > > > > > > latency of the full .FMA.  Like on x86 Zen FMA has a latency of 4
> > > > > > > > cycles but a multiply only 3.  But I never got confirmation from
> > any
> > > > > > > > of the CPU designers that .FMAs are issued when the multiply
> > > > > > > > operands are ready and the add operand can be forwarded.
> > > > > > > >
> > > > > > > > I also wonder why the multiplications of the two-FMA sequence
> > > > > > > > then cannot be executed at the same time?  So I have some doubt
> > > > > > > > of the theory above.
> > > > > > >
> > > > > > > The parallel execution for the code snippet above was the other
> > > > > > > issue (previously discussed here:
> > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-August/628960.html).
> > > > > > > Sorry it's a bit confusing to include that here, but these 2 fixes
> > > > > > > needs to be combined to avoid new regressions. Since considering
> > > > > > > FMA in get_reassociation_width produces more results of width=1,
> > > > > > > so there would be more loop depending FMA chains.
> > > > > > >
> > > > > > > > Iff this really is the reason for the sequence to execute with
> > lower
> > > > > > > > overall latency and we want to attack this on GIMPLE then I think
> > > > > > > > we need a target hook telling us this fact (I also wonder if such
> > > > > > > > behavior can be modeled in the scheduler pipeline description at
> > all?)
> > > > > > > >
> > > > > > > > > So it seems to me the current get_reassociation_width algorithm
> > > > > > > > > isn't optimal in the presence of FMA. So I modified the patch to
> > > > > > > > > improve get_reassociation_width, rather than check for code
> > > > > > > > > patterns. (Although there could be some other complicated
> > > > > > > > > factors so the regression is more obvious when there's "nested
> > > > > > > > > FMA". But with this patch that should be avoided or reduced.)
> > > > > > > > >
> > > > > > > > > With this patch 508.namd_r 1-copy run has 7% improvement on
> > > > > > > > > Ampere1, on Intel Xeon there's about 3%. While I'm still
> > > > > > > > > collecting data on other CPUs, I'd like to know how do you
> > > > > > > > > think of this.
> > > > > > > > >
> > > > > > > > > About changes in the patch:
> > > > > > > > >
> > > > > > > > > 1. When the op list forms a complete FMA chain, try to search
> > > > > > > > > for a smaller width considering the benefit of using FMA. With
> > > > > > > > > a smaller width, the increment of code size is smaller when
> > > > > > > > > breaking the chain.
> > > > > > > >
> > > > > > > > But this is all highly target specific (code size even more so).
> > > > > > > >
> > > > > > > > How I understand your approach to fixing the issue leads me to
> > > > > > > > the suggestion to prioritize parallel rewriting, thus alter
> > > > > > rank_ops_for_fma,
> > > > > > > > taking the reassoc width into account (the computed width should
> > be
> > > > > > > > unchanged from rank_ops_for_fma) instead of "fixing up" the
> > parallel
> > > > > > > > rewriting of FMAs (well, they are not yet formed of course).
> > > > > > > > get_reassociation_width has 'get_required_cycles', the above
> > theory
> > > > > > > > could be verified with a very simple toy pipeline model.  We'd
> > have
> > > > > > > > to ask the target for the reassoc width for MULT_EXPRs as well (or
> > > > maybe
> > > > > > > > even FMA_EXPRs).
> > > > > > > >
> > > > > > > > Taking the width of FMAs into account when computing the reassoc
> > width
> > > > > > > > might be another way to attack this.
> > > > > > >
> > > > > > > Previously I tried to solve this generally, on the assumption that
> > > > > > > FMA (smaller code size) is preferred. Now I agree it's difficult
> > > > > > > since: 1) As you mentioned, the latency of FMA, FMUL and FADD can
> > > > > > > be different. 2) From my test result on different machines we
> > > > > > > have, it seems simply adding the cycles together is not a good way
> > > > > > > to estimate the latency of consecutive FMA.
> > > > > > >
> > > > > > > I think an easier way to fix this is to add a parameter to suggest
> > > > > > > the length of complete FMA chain to keep. (It can be set by target
> > > > > > > specific tuning then.) And we can break longer FMA chains for
> > > > > > > better parallelism. Attached is the new implementation. With
> > > > > > > max-fma-chain-len=8, there's about 7% improvement in spec2017
> > > > > > > 508.namd_r on ampere1, and the overall improvement on fprate is
> > > > > > > about 1%.
> > > > > > >
> > > > > > > Since there's code in rank_ops_for_fma to identify MULT_EXPRs from
> > > > > > > others, I left it before get_reassociation_width so the number of
> > > > > > > MULT_EXPRs can be used.
> > > > > >
> > > > > > Sorry again for the delay in replying.
> > > > > >
> > > > > > +  /* Check if keeping complete FMA chains is preferred.  */
> > > > > > +  if (width > 1 && mult_num >= 2 && param_max_fma_chain_len)
> > > > > > +    {
> > > > > > +      /* num_fma_chain + (num_fma_chain - 1) >= num_plus .  */
> > > > > > +      int num_others = ops_num - mult_num;
> > > > > > +      int num_fma_chain = CEIL (num_others + 1, 2);
> > > > > > +
> > > > > > +      if (num_fma_chain < width
> > > > > > +         && CEIL (mult_num, num_fma_chain) <= param_max_fma_chain_len)
> > > > > > +       width = num_fma_chain;
> > > > > > +    }
> > > > > >
> > > > > > so here 'mult_num' serves as a heuristical value how many
> > > > > > FMAs we could build.  If that were close to ops_num - 1 then
> > > > > > we'd have a chain of FMAs.  Not sure how you get at
> > > > > > num_others / 2 here.  Maybe we need to elaborate on what an
> > > > > > FMA chain is?  I thought it is FMA (FMA (FMA (..., b, c), d, e), f, g)
> > > > > > where each (b,c) pair is really just one operand in the ops array,
> > > > > > one of the 'mult's.  Thus a FMA chain is _not_
> > > > > > FMA (a, b, c) + FMA (d, e, f) + FMA (...) + ..., right?
> > > > >
> > > > > The "FMA chain" here refers to consecutive FMAs, each taking
> > > > > The previous one's result as the third operator, i.e.
> > > > > ... FMA(e, f, FMA(c, d, FMA (a, b, r)))... . So original op
> > > > > list looks like "r + a * b + c * d + e * f + ...". These FMAs
> > > > > will end up using the same accumulate register.
> > > > >
> > > > > When num_others=2 or 3, there can be 2 complete chains, e.g.
> > > > >         FMA (d, e, FMA (a, b, c)) + FMA (f, g, h)
> > > > > or
> > > > >         FMA (d, e, FMA (a, b, c)) + FMA (f, g, h) + i .
> > > > > And so on, that's where the "CEIL (num_others + 1, 2)" comes from.
> > > > >
> > > > > >
> > > > > > Forming an FMA chain effectively reduces the reassociation width
> > > > > > of the participating multiplies.  If we were not to form FMAs all
> > > > > > the multiplies could execute in parallel.
> > > > > >
> > > > > > So what does the above do, in terms of adjusting the reassociation
> > > > > > width for the _adds_, and what's the ripple-down effect on later
> > > > > > FMA forming?
> > > > > >
> > > > >
> > > > > The above code calculates the number of such FMA chains in the op
> > > > > list. And if the length of each chain doesn't exceed
> > > > > param_max_fma_chain_len, then width is set to the number of chains,
> > > > > so we won't break them (because rewrite_expr_tree_parallel handles
> > > > > this well).
> > > > >
> > > > > > The change still feels like whack-a-mole playing rather than
> > understanding
> > > > > > the fundamental issue on the targets.
> > > > >
> > > > > I think the complexity is in how the instructions are piped.
> > > > > Some Arm CPUs such as Neoverse V2 supports "late-forwarding":
> > > > > "FP multiply-accumulate pipelines support late-forwarding of
> > > > > accumulate operands from similar μOPs, allowing a typical
> > > > > sequence of multiply-accumulate μOPs to issue one every N
> > > > > cycles". ("N" is smaller than the latency of a single FMA
> > > > > instruction.) So keeping such FMA chains can utilize such
> > > > > feature and uses less FP units. I guess the case is similar on
> > > > > some late X86 CPUs.
> > > > >
> > > > > If we try to compute the minimum circles of each option, I think
> > > > > at least we'll need to know whether the target has similar
> > > > > feature, and the latency of each uop. While using an
> > > > > experiential length of beneficial FMA chain could be a shortcut.
> > > > > (Maybe allowing different lengths for different data widths is
> > > > > better.)
> > > >
> > > > Hm.  So even when we can late-forward in an FMA chain
> > > > increasing the width should typically be still better?
> > > >
> > > > _1 = FMA (_2 * _3 + _4);
> > > > _5 = FMA (_6 * _7 + _1);
> > > >
> > > > say with late-forwarding we can hide the latency of the _6 * _7
> > > > multiply and the overall latency of the two FMAs above become
> > > > lat (FMA) + lat (ADD) in the ideal case.  Alternatively we do
> > > >
> > > > _1 = FMA (_2 * _ 3 + _4);
> > > > _8 = _6 * _ 7;
> > > > _5 = _1 + _8;
> > > >
> > > > where if the FMA and the multiply can execute in parallel
> > > > (we have two FMA pipes) the latency would be lat (FMA) + lat (ADD).
> > > > But when we only have a single pipeline capable of
> > > > FMA or multiplies then it is at least MIN (lat (FMA) + 1, lat (MUL) + 1)
> > > > + lat (ADD), it depends on luck whether the FMA or the MUL is
> > > > issued first there.
> > > >
> > > > So if late-forward works really well and the add part of the FMA
> > > > has very low latency compared to the multiplication part having
> > > > a smaller reassoc width should pay off here and we might be
> > > > able to simply control this via the existing target hook?
> > > >
> > > > I'm not aware of x86 CPUs having late-forwarding capabilities
> > > > but usually the latency of multiplication and FMA is very similar
> > > > and one can issue two FMAs and possibly more ADDs in parallel.
> > > >
> > > > As said I think this detail (late-forward) should maybe reflected
> > > > into get_required_cycles, possibly guided by a different
> > > > targetm.sched.reassociation_width for MULT_EXPR vs PLUS_EXPR?
> > > >
> > >
> > > To my understanding, the question is whether the target fully
> > > pipelines FMA instructions, so the MULT part can start first if
> > > its operands are ready. While targetm.sched.reassociation_width
> > > reflects the number of pipes for some operation, so it can guide
> > > get_required_cycles for a sequence of identical operations
> > > (e.g. A * B * C * D or A + B + C + D). Since the problem in
> > > this case is not the number of pipes for FMA, I think another
> > > indicator maybe better.
> > >
> > > (Currently the fma_reassoc_width for AArch64 is to control
> > > whether reassociation on FADD is OK. This workaround doesn't
> > > work well on some cases, for example it turns down reassociation
> > > even when there's no FMA at all. So I think we'd better not
> > > follow the schema.)
> > >
> > > Attached is a new version of the patch with a flag to indicate
> > > whether FMA is fully pipelined, and: 1) lat (MUL) >= lat (ADD);
> > > 2) symmetric units are used or FMUL/FADD/FMA. Otherwise the
> > > patch may not be beneficial.
> > >
> > > It tries to calculate the latencies including MULT_EXPRs. Since
> > > the code is different with the current code (the quick-search
> > > part), I haven't included it inside get_required_cycles.
> >
> > +; If the flag 'fully-pipelined-fma' is set, reassociation takes into account
> > +; the benifit of parallelizing FMA's multiply part and addition part.
> > +ffully-pipelined-fma
> > +Common Var(flag_fully_pipelined_fma)
> > +Assume the target fully pipelines FMA instruction, and symmetric units are
> > used
> > +for FMUL/FADD/FMA.
> >
> > please use a --param for now, I think targets might want to set this based
> > on active core tuning.
> >
> > +/* Given that the target fully pipelines FMA instructions, return latency of
> > +   MULT_EXPRs that can't be hided by FMA.  WIDTH is the number of pipes.  */
> > +
> >
> > return the latency .. can't be hidden by the FMA
> >
> > For documentation purposes it should be stated that mult_num <= ops_num
> >
> > +  /* If the target fully pipelines FMA instruction, the multiply part can
> > start
> >
> > instructions
> >
> > +     first if its operands are ready.  Assuming symmetric pipes are used for
> >
> > s/first/already/
> >
> > +     FMUL/FADD/FMA, then for a sequence of FMA like:
> > +
> > +       _8 = .FMA (_2, _3, _1);
> > +       _9 = .FMA (_5, _4, _8);
> > +       _10 = .FMA (_7, _6, _9);
> > +
> > +     , if width=1, the latency is latency(MULT) + latency(ADD)*3.
> > +     While with width=2:
> > +
> > +       _8 = _4 * _5;
> > +       _9 = .FMA (_2, _3, _1);
> > +       _10 = .FMA (_6, _7, _8);
> > +       _11 = _9 + _10;
> > +
> > +     , it is latency(MULT)*2 + latency(ADD)*2.  Assuming latency(MULT) <=
> > +     latency(ADD), the previous one is preferred.
> >
> > latency (MULT) >= latency (ADD)?
> >
> > ".. the first variant is preferred."
> >
> > +
> > +     Find out if we can get a smaller width considering FMA.  */
>
> Corrected these errors. Thank you for the corrections.
>
> > +      /* When flag_fully_pipelined_fma is set, assumes symmetric pipes are
> > used
> > +        for FMUL/FADD/FMA.  */
> > +      int lat_mul = get_mult_latency_consider_fma (ops_num, mult_num, width);
> >
> > what does "symmetric pipes" actually mean?  For x86 Zen3 we have
> > two FMA pipes (that can also do ADD and MUL) and one pipe that can
> > do ADD.  Is that then non-symmetric because we can issue more adds
> > in parallel than FMA/MUL?

btw, I double-checked and Zen3/4 have two pipes for FMUL/FMA and two
separate pipes for FADD, the FMUL/FMA pipes cannot do FADD.  FADD
has a latency of 3 cycles while FMUL/FMA has a latency of 4 cycles.

I'd say Zen is then not "symmetric" as in your definition?  I do wonder
what part of the pipeline characteristic could be derived from the
reassoc_width target hook (maybe the number of pipes but not whether
they are shared with another op).  In theory the scheduling description
could offer the info (if correct and precise enough), but I don't think there's
a good way to query this details.

> "symmetric pipes" was to indicate that FADD/FMA/FMUL use the same unit
> set, so the widths are uniform, and the calculations in this patch can
> apply. I think this can be relaxed for scenarios like Zen3, by searching
> for a smaller width only using the pipes for FMUL/FMA. But if the pipes
> for FMUL and FADD are separated, for example 1 for FMA/FMUL and 2 other
> pipes for FADD, then the minimum circle might be incorrect.
>
> Changed the descriptions and code in get_reassociation_width a bit to
> include the case like Zen3:
>
> +      /* When param_fully_pipelined_fma is set, assume FMUL and FMA use the
> +        same units that can also do FADD.  For other scenarios, such as when
> +        FMUL and FADD are using distinct units, the following code may not
> +        apply.  */
> +      int width_mult = targetm.sched.reassociation_width (MULT_EXPR, mode);
> +      gcc_checking_assert (width_mult <= width);
> +
> +      /* Latency of MULT_EXPRs.  */
> +      int lat_mul
> +       = get_mult_latency_consider_fma (ops_num, mult_num, width_mult);

The updated patch is OK.

Thanks for your patience.

Thanks,
Richard.

> > Otherwise this looks OK now.
> >
> > Thanks,
> > Richard.
> >
> > > > > > +  /* If there's loop dependent FMA result, return width=2 to avoid it.
> > > > This
> > > > > > is
> > > > > > +     better than skipping these FMA candidates in widening_mul.  */
> > > > > >
> > > > > > better than skipping, but you don't touch it there?  I suppose width
> > == 2
> > > > > > will bypass the skipping, right?  This heuristic only comes in when
> > the
> > > > above
> > > > > > change made width == 1, since otherwise we have an earlier
> > > > > >
> > > > > >   if (width == 1)
> > > > > >     return width;
> > > > > >
> > > > > > which als guarantees width == 2 was allowed by the hook/param, right?
> > > > >
> > > > > Yes, that's right.
> > > > >
> > > > > >
> > > > > > +  if (width == 1 && mult_num
> > > > > > +      && maybe_le (tree_to_poly_int64 (TYPE_SIZE (TREE_TYPE (lhs))),
> > > > > > +                  param_avoid_fma_max_bits))
> > > > > > +    {
> > > > > > +      /* Look for cross backedge dependency:
> > > > > > +       1. LHS is a phi argument in the same basic block it is defined.
> > > > > > +       2. And the result of the phi node is used in OPS.  */
> > > > > > +      basic_block bb = gimple_bb (SSA_NAME_DEF_STMT (lhs));
> > > > > > +      gimple_stmt_iterator gsi;
> > > > > > +      for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next
> > (&gsi))
> > > > > > +       {
> > > > > > +         gphi *phi = dyn_cast<gphi *> (gsi_stmt (gsi));
> > > > > > +         for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
> > > > > > +           {
> > > > > > +             tree op = PHI_ARG_DEF (phi, i);
> > > > > > +             if (!(op == lhs && gimple_phi_arg_edge (phi, i)->src ==
> > bb))
> > > > > > +               continue;
> > > > > >
> > > > > > I think it's easier to iterate over the immediate uses of LHS like
> > > > > >
> > > > > >   FOR_EACH_IMM_USE_FAST (use_p, iter, lhs)
> > > > > >      if (gphi *phi = dyn_cast <gphi *> (USE_STMT (use_p)))
> > > > > >        {
> > > > > >           if (gimple_phi_arg_edge (phi, phi_arg_index_from_use
> > > > > > (use_p))->src != bb)
> > > > > >             continue;
> > > > > > ...
> > > > > >        }
> > > > > >
> > > > > > otherwise I think _this_ part of the patch looks reasonable.
> > > > > >
> > > > > > As you say heuristically they might go together but I think we should
> > > > split
> > > > > > the
> > > > > > patch - the cross-loop part can probably stand independently.  Can you
> > > > adjust
> > > > > > and re-post?
> > > > >
> > > > > Attached is the separated part for cross-loop FMA. Thank you for the
> > > > correction.
> > > >
> > > > That cross-loop FMA patch is OK.
> > >
> > > Committed this part at 746344dd.
> > >
> > > Thanks,
> > > Di
> > >
> > > >
> > > > Thanks,
> > > > Richard.
> > > >
> > > > > >
> > > > > > As for the first part I still don't understand very well and am still
> > > > hoping
> > > > > > we
> > > > > > can get away without yet another knob to tune.
> > > > > >
> > > > > > Richard.
> > > > > >
> > > > > > > >
> > > > > > > > > 2. To avoid regressions, included the other patch
> > > > > > > > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-
> > > > September/629203.html)
> > > > > > > > > on this tracker again. This is because more FMA will be kept
> > > > > > > > > with 1., so we need to rule out the loop dependent
> > > > > > > > > FMA chains when param_avoid_fma_max_bits is set.
> > > > > > > >
> > > > > > > > Sorry again for taking so long to reply.
> > > > > > > >
> > > > > > > > I'll note we have an odd case on x86 Zen2(?) as well which we
> > don't
> > > > really
> > > > > > > > understand from a CPU behavior perspective.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Richard.
> > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Di Zhao
> > > > > > > > >
> > > > > > > > > ----
> > > > > > > > >
> > > > > > > > >         PR tree-optimization/110279
> > > > > > > > >
> > > > > > > > > gcc/ChangeLog:
> > > > > > > > >
> > > > > > > > >         * tree-ssa-reassoc.cc
> > (rank_ops_for_better_parallelism_p):
> > > > > > > > >         New function to check whether ranking the ops results in
> > > > > > > > >         better parallelism.
> > > > > > > > >         (get_reassociation_width): Add new parameters. Search
> > for
> > > > > > > > >         smaller width considering the benefit of FMA.
> > > > > > > > >         (rank_ops_for_fma): Change return value to be number of
> > > > > > > > >         MULT_EXPRs.
> > > > > > > > >         (reassociate_bb): For 3 ops, refine the condition to
> > call
> > > > > > > > >         swap_ops_for_binary_stmt.
> > > > > > > > >
> > > > > > > > > gcc/testsuite/ChangeLog:
> > > > > > > > >
> > > > > > > > >         * gcc.dg/pr110279.c: New test.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Di Zhao
> > > > > > >
> > > > > > > ----
> > > > > > >
> > > > > > >         PR tree-optimization/110279
> > > > > > >
> > > > > > > gcc/ChangeLog:
> > > > > > >
> > > > > > >         * doc/invoke.texi: Description of param_max_fma_chain_len.
> > > > > > >         * params.opt: New parameter param_max_fma_chain_len.
> > > > > > >         * tree-ssa-reassoc.cc (get_reassociation_width):
> > > > > > >         Support param_max_fma_chain_len; check for loop dependent
> > > > > > >         FMAs.
> > > > > > >         (rank_ops_for_fma): Return the number of MULT_EXPRs.
> > > > > > >         (reassociate_bb): For 3 ops, refine the condition to call
> > > > > > >         swap_ops_for_binary_stmt.
> > > > > > >
> > > > > > > gcc/testsuite/ChangeLog:
> > > > > > >
> > > > > > >         * gcc.dg/pr110279-1.c: New test.
> > > > > > >         * gcc.dg/pr110279-2.c: New test.
> > > > > > >         * gcc.dg/pr110279-3.c: New test.
> > > > >
> > > > > ---
> > > > >
> > > > >         PR tree-optimization/110279
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > >         * tree-ssa-reassoc.cc (get_reassociation_width): check
> > > > >         for loop dependent FMAs.
> > > > >         (reassociate_bb): For 3 ops, refine the condition to call
> > > > >         swap_ops_for_binary_stmt.
> > > > >
> > > > > gcc/testsuite/ChangeLog:
> > > > >
> > > > >         * gcc.dg/pr110279-1.c: New test.
> > > ---
> > >
> > >         PR tree-optimization/110279
> > >
> > > gcc/ChangeLog:
> > >
> > >         * common.opt: New flag fully-pipelined-fma.
> > >         * tree-ssa-reassoc.cc (get_mult_latency_consider_fma):
> > >         Return latency of MULT_EXPRs that can't be hided by FMA.
> > >         (get_reassociation_width): Search for smaller widths
> > >         considering the benefit of fully pipelined FMA.
> > >         (rank_ops_for_fma): Return the number of MULT_EXPRs.
> > >         (reassociate_bb): Pass the number of MULT_EXPRs to
> > >         get_reassociation_width; avoid calling
> > >         get_reassociation_width twice.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >         * gcc.dg/pr110279-2.c: New test.
>
> Thanks,
> Di
>
> ---
>
>         PR tree-optimization/110279
>
> gcc/ChangeLog:
>
>         * doc/invoke.texi: New parameter fully-pipelined-fma.
>         * params.opt: New parameter fully-pipelined-fma.
>         * tree-ssa-reassoc.cc (get_mult_latency_consider_fma): Return
>         the latency of MULT_EXPRs that can't be hidden by the FMAs.
>         (get_reassociation_width): Search for a smaller width
>         considering the benefit of fully pipelined FMA.
>         (rank_ops_for_fma): Return the number of MULT_EXPRs.
>         (reassociate_bb): Pass the number of MULT_EXPRs to
>         get_reassociation_width; avoid calling
>         get_reassociation_width twice.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/pr110279-2.c: New test.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width
  2023-12-13  9:00                 ` Richard Biener
@ 2023-12-14 20:55                   ` Di Zhao OS
  2023-12-15  7:23                     ` Richard Biener
  0 siblings, 1 reply; 18+ messages in thread
From: Di Zhao OS @ 2023-12-14 20:55 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches


> -----Original Message-----
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: Wednesday, December 13, 2023 5:01 PM
> To: Di Zhao OS <dizhao@os.amperecomputing.com>
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> get_reassociation_width
> 
> On Wed, Dec 13, 2023 at 9:14 AM Di Zhao OS
> <dizhao@os.amperecomputing.com> wrote:
> >
> > Hello Richard,
> >
> > > -----Original Message-----
> > > From: Richard Biener <richard.guenther@gmail.com>
> > > Sent: Monday, December 11, 2023 7:01 PM
> > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > Cc: gcc-patches@gcc.gnu.org
> > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > > get_reassociation_width
> > >
> > > On Wed, Nov 29, 2023 at 3:36 PM Di Zhao OS
> > > <dizhao@os.amperecomputing.com> wrote:
> > > >
> > > > > -----Original Message-----
> > > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > > Sent: Tuesday, November 21, 2023 9:01 PM
> > > > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > > > Cc: gcc-patches@gcc.gnu.org
> > > > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > > > > get_reassociation_width
> > > > >
> > > > > On Thu, Nov 9, 2023 at 6:53 PM Di Zhao OS
> <dizhao@os.amperecomputing.com>
> > > > > wrote:
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > > > > Sent: Tuesday, October 31, 2023 9:48 PM
> > > > > > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > > > > > Cc: gcc-patches@gcc.gnu.org
> > > > > > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > > > > > > get_reassociation_width
> > > > > > >
> > > > > > > On Sun, Oct 8, 2023 at 6:40 PM Di Zhao OS
> > > <dizhao@os.amperecomputing.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Attached is a new version of the patch.
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > > > > > > Sent: Friday, October 6, 2023 5:33 PM
> > > > > > > > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > > > > > > > Cc: gcc-patches@gcc.gnu.org
> > > > > > > > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider
> FMA in
> > > > > > > > > get_reassociation_width
> > > > > > > > >
> > > > > > > > > On Thu, Sep 14, 2023 at 2:43 PM Di Zhao OS
> > > > > > > > > <dizhao@os.amperecomputing.com> wrote:
> > > > > > > > > >
> > > > > > > > > > This is a new version of the patch on "nested FMA".
> > > > > > > > > > Sorry for updating this after so long, I've been studying
> and
> > > > > > > > > > writing micro cases to sort out the cause of the regression.
> > > > > > > > >
> > > > > > > > > Sorry for taking so long to reply.
> > > > > > > > >
> > > > > > > > > > First, following previous discussion:
> > > > > > > > > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-
> > > > > September/629080.html)
> > > > > > > > > >
> > > > > > > > > > 1. From testing more altered cases, I don't think the
> > > > > > > > > > problem is that reassociation works locally. In that:
> > > > > > > > > >
> > > > > > > > > >   1) On the example with multiplications:
> > > > > > > > > >
> > > > > > > > > >         tmp1 = a + c * c + d * d + x * y;
> > > > > > > > > >         tmp2 = x * tmp1;
> > > > > > > > > >         result += (a + c + d + tmp2);
> > > > > > > > > >
> > > > > > > > > >   Given "result" rewritten by width=2, the performance is
> > > > > > > > > >   worse if we rewrite "tmp1" with width=2. In contrast, if
> we
> > > > > > > > > >   remove the multiplications from the example (and make
> "tmp1"
> > > > > > > > > >   not singe used), and still rewrite "result" by width=2,
> then
> > > > > > > > > >   rewriting "tmp1" with width=2 is better. (Make sense
> because
> > > > > > > > > >   the tree's depth at "result" is still smaller if we
> rewrite
> > > > > > > > > >   "tmp1".)
> > > > > > > > > >
> > > > > > > > > >   2) I tried to modify the assembly code of the example
> without
> > > > > > > > > >   FMA, so the width of "result" is 4. On Ampere1 there's no
> > > > > > > > > >   obvious improvement. So although this is an interesting
> > > > > > > > > >   problem, it doesn't seem like the cause of the regression.
> > > > > > > > >
> > > > > > > > > OK, I see.
> > > > > > > > >
> > > > > > > > > > 2. From assembly code of the case with FMA, one problem is
> > > > > > > > > > that, rewriting "tmp1" to parallel didn't decrease the
> > > > > > > > > > minimum CPU cycles (taking MULT_EXPRs into account), but
> > > > > > > > > > increased code size, so the overhead is increased.
> > > > > > > > > >
> > > > > > > > > >    a) When "tmp1" is not re-written to parallel:
> > > > > > > > > >         fmadd d31, d2, d2, d30
> > > > > > > > > >         fmadd d31, d3, d3, d31
> > > > > > > > > >         fmadd d31, d4, d5, d31  //"tmp1"
> > > > > > > > > >         fmadd d31, d31, d4, d3
> > > > > > > > > >
> > > > > > > > > >    b) When "tmp1" is re-written to parallel:
> > > > > > > > > >         fmul  d31, d4, d5
> > > > > > > > > >         fmadd d27, d2, d2, d30
> > > > > > > > > >         fmadd d31, d3, d3, d31
> > > > > > > > > >         fadd  d31, d31, d27     //"tmp1"
> > > > > > > > > >         fmadd d31, d31, d4, d3
> > > > > > > > > >
> > > > > > > > > > For version a), there are 3 dependent FMAs to calculate
> "tmp1".
> > > > > > > > > > For version b), there are also 3 dependent instructions in
> the
> > > > > > > > > > longer path: the 1st, 3rd and 4th.
> > > > > > > > >
> > > > > > > > > Yes, it doesn't really change anything.  The patch has
> > > > > > > > >
> > > > > > > > > +  /* If there's code like "acc = a * b + c * d + acc" in a
> tight
> > > loop,
> > > > > > > some
> > > > > > > > > +     uarchs can execute results like:
> > > > > > > > > +
> > > > > > > > > +       _1 = a * b;
> > > > > > > > > +       _2 = .FMA (c, d, _1);
> > > > > > > > > +       acc_1 = acc_0 + _2;
> > > > > > > > > +
> > > > > > > > > +     in parallel, while turning it into
> > > > > > > > > +
> > > > > > > > > +       _1 = .FMA(a, b, acc_0);
> > > > > > > > > +       acc_1 = .FMA(c, d, _1);
> > > > > > > > > +
> > > > > > > > > +     hinders that, because then the first FMA depends on the
> > > result
> > > > > > > > > of preceding
> > > > > > > > > +     iteration.  */
> > > > > > > > >
> > > > > > > > > I can't see what can be run in parallel for the first case.
> > > The .FMA
> > > > > > > > > depends on the multiplication a * b.  Iff the uarch somehow
> > > decomposes
> > > > > > > > > .FMA into multiply + add then the c * d multiply could run in
> > > parallel
> > > > > > > > > with the a * b multiply which _might_ be able to hide some of
> the
> > > > > > > > > latency of the full .FMA.  Like on x86 Zen FMA has a latency
> of 4
> > > > > > > > > cycles but a multiply only 3.  But I never got confirmation
> from
> > > any
> > > > > > > > > of the CPU designers that .FMAs are issued when the multiply
> > > > > > > > > operands are ready and the add operand can be forwarded.
> > > > > > > > >
> > > > > > > > > I also wonder why the multiplications of the two-FMA sequence
> > > > > > > > > then cannot be executed at the same time?  So I have some
> doubt
> > > > > > > > > of the theory above.
> > > > > > > >
> > > > > > > > The parallel execution for the code snippet above was the other
> > > > > > > > issue (previously discussed here:
> > > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-
> August/628960.html).
> > > > > > > > Sorry it's a bit confusing to include that here, but these 2
> fixes
> > > > > > > > needs to be combined to avoid new regressions. Since considering
> > > > > > > > FMA in get_reassociation_width produces more results of width=1,
> > > > > > > > so there would be more loop depending FMA chains.
> > > > > > > >
> > > > > > > > > Iff this really is the reason for the sequence to execute with
> > > lower
> > > > > > > > > overall latency and we want to attack this on GIMPLE then I
> think
> > > > > > > > > we need a target hook telling us this fact (I also wonder if
> such
> > > > > > > > > behavior can be modeled in the scheduler pipeline description
> at
> > > all?)
> > > > > > > > >
> > > > > > > > > > So it seems to me the current get_reassociation_width
> algorithm
> > > > > > > > > > isn't optimal in the presence of FMA. So I modified the
> patch to
> > > > > > > > > > improve get_reassociation_width, rather than check for code
> > > > > > > > > > patterns. (Although there could be some other complicated
> > > > > > > > > > factors so the regression is more obvious when there's
> "nested
> > > > > > > > > > FMA". But with this patch that should be avoided or reduced.)
> > > > > > > > > >
> > > > > > > > > > With this patch 508.namd_r 1-copy run has 7% improvement on
> > > > > > > > > > Ampere1, on Intel Xeon there's about 3%. While I'm still
> > > > > > > > > > collecting data on other CPUs, I'd like to know how do you
> > > > > > > > > > think of this.
> > > > > > > > > >
> > > > > > > > > > About changes in the patch:
> > > > > > > > > >
> > > > > > > > > > 1. When the op list forms a complete FMA chain, try to
> search
> > > > > > > > > > for a smaller width considering the benefit of using FMA.
> With
> > > > > > > > > > a smaller width, the increment of code size is smaller when
> > > > > > > > > > breaking the chain.
> > > > > > > > >
> > > > > > > > > But this is all highly target specific (code size even more
> so).
> > > > > > > > >
> > > > > > > > > How I understand your approach to fixing the issue leads me to
> > > > > > > > > the suggestion to prioritize parallel rewriting, thus alter
> > > > > > > rank_ops_for_fma,
> > > > > > > > > taking the reassoc width into account (the computed width
> should
> > > be
> > > > > > > > > unchanged from rank_ops_for_fma) instead of "fixing up" the
> > > parallel
> > > > > > > > > rewriting of FMAs (well, they are not yet formed of course).
> > > > > > > > > get_reassociation_width has 'get_required_cycles', the above
> > > theory
> > > > > > > > > could be verified with a very simple toy pipeline model.  We'd
> > > have
> > > > > > > > > to ask the target for the reassoc width for MULT_EXPRs as well
> (or
> > > > > maybe
> > > > > > > > > even FMA_EXPRs).
> > > > > > > > >
> > > > > > > > > Taking the width of FMAs into account when computing the
> reassoc
> > > width
> > > > > > > > > might be another way to attack this.
> > > > > > > >
> > > > > > > > Previously I tried to solve this generally, on the assumption
> that
> > > > > > > > FMA (smaller code size) is preferred. Now I agree it's difficult
> > > > > > > > since: 1) As you mentioned, the latency of FMA, FMUL and FADD
> can
> > > > > > > > be different. 2) From my test result on different machines we
> > > > > > > > have, it seems simply adding the cycles together is not a good
> way
> > > > > > > > to estimate the latency of consecutive FMA.
> > > > > > > >
> > > > > > > > I think an easier way to fix this is to add a parameter to
> suggest
> > > > > > > > the length of complete FMA chain to keep. (It can be set by
> target
> > > > > > > > specific tuning then.) And we can break longer FMA chains for
> > > > > > > > better parallelism. Attached is the new implementation. With
> > > > > > > > max-fma-chain-len=8, there's about 7% improvement in spec2017
> > > > > > > > 508.namd_r on ampere1, and the overall improvement on fprate is
> > > > > > > > about 1%.
> > > > > > > >
> > > > > > > > Since there's code in rank_ops_for_fma to identify MULT_EXPRs
> from
> > > > > > > > others, I left it before get_reassociation_width so the number
> of
> > > > > > > > MULT_EXPRs can be used.
> > > > > > >
> > > > > > > Sorry again for the delay in replying.
> > > > > > >
> > > > > > > +  /* Check if keeping complete FMA chains is preferred.  */
> > > > > > > +  if (width > 1 && mult_num >= 2 && param_max_fma_chain_len)
> > > > > > > +    {
> > > > > > > +      /* num_fma_chain + (num_fma_chain - 1) >= num_plus .  */
> > > > > > > +      int num_others = ops_num - mult_num;
> > > > > > > +      int num_fma_chain = CEIL (num_others + 1, 2);
> > > > > > > +
> > > > > > > +      if (num_fma_chain < width
> > > > > > > +         && CEIL (mult_num, num_fma_chain) <=
> param_max_fma_chain_len)
> > > > > > > +       width = num_fma_chain;
> > > > > > > +    }
> > > > > > >
> > > > > > > so here 'mult_num' serves as a heuristical value how many
> > > > > > > FMAs we could build.  If that were close to ops_num - 1 then
> > > > > > > we'd have a chain of FMAs.  Not sure how you get at
> > > > > > > num_others / 2 here.  Maybe we need to elaborate on what an
> > > > > > > FMA chain is?  I thought it is FMA (FMA (FMA (..., b, c), d, e), f,
> g)
> > > > > > > where each (b,c) pair is really just one operand in the ops array,
> > > > > > > one of the 'mult's.  Thus a FMA chain is _not_
> > > > > > > FMA (a, b, c) + FMA (d, e, f) + FMA (...) + ..., right?
> > > > > >
> > > > > > The "FMA chain" here refers to consecutive FMAs, each taking
> > > > > > The previous one's result as the third operator, i.e.
> > > > > > ... FMA(e, f, FMA(c, d, FMA (a, b, r)))... . So original op
> > > > > > list looks like "r + a * b + c * d + e * f + ...". These FMAs
> > > > > > will end up using the same accumulate register.
> > > > > >
> > > > > > When num_others=2 or 3, there can be 2 complete chains, e.g.
> > > > > >         FMA (d, e, FMA (a, b, c)) + FMA (f, g, h)
> > > > > > or
> > > > > >         FMA (d, e, FMA (a, b, c)) + FMA (f, g, h) + i .
> > > > > > And so on, that's where the "CEIL (num_others + 1, 2)" comes from.
> > > > > >
> > > > > > >
> > > > > > > Forming an FMA chain effectively reduces the reassociation width
> > > > > > > of the participating multiplies.  If we were not to form FMAs all
> > > > > > > the multiplies could execute in parallel.
> > > > > > >
> > > > > > > So what does the above do, in terms of adjusting the reassociation
> > > > > > > width for the _adds_, and what's the ripple-down effect on later
> > > > > > > FMA forming?
> > > > > > >
> > > > > >
> > > > > > The above code calculates the number of such FMA chains in the op
> > > > > > list. And if the length of each chain doesn't exceed
> > > > > > param_max_fma_chain_len, then width is set to the number of chains,
> > > > > > so we won't break them (because rewrite_expr_tree_parallel handles
> > > > > > this well).
> > > > > >
> > > > > > > The change still feels like whack-a-mole playing rather than
> > > understanding
> > > > > > > the fundamental issue on the targets.
> > > > > >
> > > > > > I think the complexity is in how the instructions are piped.
> > > > > > Some Arm CPUs such as Neoverse V2 supports "late-forwarding":
> > > > > > "FP multiply-accumulate pipelines support late-forwarding of
> > > > > > accumulate operands from similar μOPs, allowing a typical
> > > > > > sequence of multiply-accumulate μOPs to issue one every N
> > > > > > cycles". ("N" is smaller than the latency of a single FMA
> > > > > > instruction.) So keeping such FMA chains can utilize such
> > > > > > feature and uses less FP units. I guess the case is similar on
> > > > > > some late X86 CPUs.
> > > > > >
> > > > > > If we try to compute the minimum circles of each option, I think
> > > > > > at least we'll need to know whether the target has similar
> > > > > > feature, and the latency of each uop. While using an
> > > > > > experiential length of beneficial FMA chain could be a shortcut.
> > > > > > (Maybe allowing different lengths for different data widths is
> > > > > > better.)
> > > > >
> > > > > Hm.  So even when we can late-forward in an FMA chain
> > > > > increasing the width should typically be still better?
> > > > >
> > > > > _1 = FMA (_2 * _3 + _4);
> > > > > _5 = FMA (_6 * _7 + _1);
> > > > >
> > > > > say with late-forwarding we can hide the latency of the _6 * _7
> > > > > multiply and the overall latency of the two FMAs above become
> > > > > lat (FMA) + lat (ADD) in the ideal case.  Alternatively we do
> > > > >
> > > > > _1 = FMA (_2 * _ 3 + _4);
> > > > > _8 = _6 * _ 7;
> > > > > _5 = _1 + _8;
> > > > >
> > > > > where if the FMA and the multiply can execute in parallel
> > > > > (we have two FMA pipes) the latency would be lat (FMA) + lat (ADD).
> > > > > But when we only have a single pipeline capable of
> > > > > FMA or multiplies then it is at least MIN (lat (FMA) + 1, lat (MUL) +
> 1)
> > > > > + lat (ADD), it depends on luck whether the FMA or the MUL is
> > > > > issued first there.
> > > > >
> > > > > So if late-forward works really well and the add part of the FMA
> > > > > has very low latency compared to the multiplication part having
> > > > > a smaller reassoc width should pay off here and we might be
> > > > > able to simply control this via the existing target hook?
> > > > >
> > > > > I'm not aware of x86 CPUs having late-forwarding capabilities
> > > > > but usually the latency of multiplication and FMA is very similar
> > > > > and one can issue two FMAs and possibly more ADDs in parallel.
> > > > >
> > > > > As said I think this detail (late-forward) should maybe reflected
> > > > > into get_required_cycles, possibly guided by a different
> > > > > targetm.sched.reassociation_width for MULT_EXPR vs PLUS_EXPR?
> > > > >
> > > >
> > > > To my understanding, the question is whether the target fully
> > > > pipelines FMA instructions, so the MULT part can start first if
> > > > its operands are ready. While targetm.sched.reassociation_width
> > > > reflects the number of pipes for some operation, so it can guide
> > > > get_required_cycles for a sequence of identical operations
> > > > (e.g. A * B * C * D or A + B + C + D). Since the problem in
> > > > this case is not the number of pipes for FMA, I think another
> > > > indicator maybe better.
> > > >
> > > > (Currently the fma_reassoc_width for AArch64 is to control
> > > > whether reassociation on FADD is OK. This workaround doesn't
> > > > work well on some cases, for example it turns down reassociation
> > > > even when there's no FMA at all. So I think we'd better not
> > > > follow the schema.)
> > > >
> > > > Attached is a new version of the patch with a flag to indicate
> > > > whether FMA is fully pipelined, and: 1) lat (MUL) >= lat (ADD);
> > > > 2) symmetric units are used or FMUL/FADD/FMA. Otherwise the
> > > > patch may not be beneficial.
> > > >
> > > > It tries to calculate the latencies including MULT_EXPRs. Since
> > > > the code is different with the current code (the quick-search
> > > > part), I haven't included it inside get_required_cycles.
> > >
> > > +; If the flag 'fully-pipelined-fma' is set, reassociation takes into
> account
> > > +; the benifit of parallelizing FMA's multiply part and addition part.
> > > +ffully-pipelined-fma
> > > +Common Var(flag_fully_pipelined_fma)
> > > +Assume the target fully pipelines FMA instruction, and symmetric units
> are
> > > used
> > > +for FMUL/FADD/FMA.
> > >
> > > please use a --param for now, I think targets might want to set this based
> > > on active core tuning.
> > >
> > > +/* Given that the target fully pipelines FMA instructions, return latency
> of
> > > +   MULT_EXPRs that can't be hided by FMA.  WIDTH is the number of pipes.
> */
> > > +
> > >
> > > return the latency .. can't be hidden by the FMA
> > >
> > > For documentation purposes it should be stated that mult_num <= ops_num
> > >
> > > +  /* If the target fully pipelines FMA instruction, the multiply part can
> > > start
> > >
> > > instructions
> > >
> > > +     first if its operands are ready.  Assuming symmetric pipes are used
> for
> > >
> > > s/first/already/
> > >
> > > +     FMUL/FADD/FMA, then for a sequence of FMA like:
> > > +
> > > +       _8 = .FMA (_2, _3, _1);
> > > +       _9 = .FMA (_5, _4, _8);
> > > +       _10 = .FMA (_7, _6, _9);
> > > +
> > > +     , if width=1, the latency is latency(MULT) + latency(ADD)*3.
> > > +     While with width=2:
> > > +
> > > +       _8 = _4 * _5;
> > > +       _9 = .FMA (_2, _3, _1);
> > > +       _10 = .FMA (_6, _7, _8);
> > > +       _11 = _9 + _10;
> > > +
> > > +     , it is latency(MULT)*2 + latency(ADD)*2.  Assuming latency(MULT) <=
> > > +     latency(ADD), the previous one is preferred.
> > >
> > > latency (MULT) >= latency (ADD)?
> > >
> > > ".. the first variant is preferred."
> > >
> > > +
> > > +     Find out if we can get a smaller width considering FMA.  */
> >
> > Corrected these errors. Thank you for the corrections.
> >
> > > +      /* When flag_fully_pipelined_fma is set, assumes symmetric pipes
> are
> > > used
> > > +        for FMUL/FADD/FMA.  */
> > > +      int lat_mul = get_mult_latency_consider_fma (ops_num, mult_num,
> width);
> > >
> > > what does "symmetric pipes" actually mean?  For x86 Zen3 we have
> > > two FMA pipes (that can also do ADD and MUL) and one pipe that can
> > > do ADD.  Is that then non-symmetric because we can issue more adds
> > > in parallel than FMA/MUL?
> 
> btw, I double-checked and Zen3/4 have two pipes for FMUL/FMA and two
> separate pipes for FADD, the FMUL/FMA pipes cannot do FADD.  FADD
> has a latency of 3 cycles while FMUL/FMA has a latency of 4 cycles.
> 
> I'd say Zen is then not "symmetric" as in your definition?  I do wonder
> what part of the pipeline characteristic could be derived from the
> reassoc_width target hook (maybe the number of pipes but not whether
> they are shared with another op).  In theory the scheduling description
> could offer the info (if correct and precise enough), but I don't think
> there's
> a good way to query this details.

Yes, that's not "symmetric". The current code is at least confusing
in that scenario. For instance, get_mult_latency_consider_fma uses
the width of FMUL, and get_required_cycles should use the width of
FADD, I think.

As you pointed out earlier, a problem with reassociation is that it
works locally on single used operator lists, so the result may not
be globally optimum. For example, if using 2 pipes results in 4 circles,
and using 1 pipe results in 5 circles; the latter might be preferable
for saving 1 pipe for other calculations (in the basic block), if
there is any. If to enhance reassociation for this, it seems we need
to know the arrangement of pipes for each kind of operator? (Perhaps 
targetm.sched.reassociation_width could return an index number
indicating the set of pipes for the operator, along with the width.
But this may not cover cases like Haswell, which has ports 0 and 1
for multiply and port 1 for addition. Besides, I'm not yet clear
about how the algorithm should be.)

Committed the patch at 8afdbcdd.

Thanks,
Di

> 
> > "symmetric pipes" was to indicate that FADD/FMA/FMUL use the same unit
> > set, so the widths are uniform, and the calculations in this patch can
> > apply. I think this can be relaxed for scenarios like Zen3, by searching
> > for a smaller width only using the pipes for FMUL/FMA. But if the pipes
> > for FMUL and FADD are separated, for example 1 for FMA/FMUL and 2 other
> > pipes for FADD, then the minimum circle might be incorrect.
> >
> > Changed the descriptions and code in get_reassociation_width a bit to
> > include the case like Zen3:
> >
> > +      /* When param_fully_pipelined_fma is set, assume FMUL and FMA use the
> > +        same units that can also do FADD.  For other scenarios, such as
> when
> > +        FMUL and FADD are using distinct units, the following code may not
> > +        apply.  */
> > +      int width_mult = targetm.sched.reassociation_width (MULT_EXPR, mode);
> > +      gcc_checking_assert (width_mult <= width);
> > +
> > +      /* Latency of MULT_EXPRs.  */
> > +      int lat_mul
> > +       = get_mult_latency_consider_fma (ops_num, mult_num, width_mult);
> 
> The updated patch is OK.
> 
> Thanks for your patience.
> 
> Thanks,
> Richard.
> 
> > > Otherwise this looks OK now.
> > >
> > > Thanks,
> > > Richard.
> > >
> > > > > > > +  /* If there's loop dependent FMA result, return width=2 to
> avoid it.
> > > > > This
> > > > > > > is
> > > > > > > +     better than skipping these FMA candidates in widening_mul.
> */
> > > > > > >
> > > > > > > better than skipping, but you don't touch it there?  I suppose
> width
> > > == 2
> > > > > > > will bypass the skipping, right?  This heuristic only comes in
> when
> > > the
> > > > > above
> > > > > > > change made width == 1, since otherwise we have an earlier
> > > > > > >
> > > > > > >   if (width == 1)
> > > > > > >     return width;
> > > > > > >
> > > > > > > which als guarantees width == 2 was allowed by the hook/param,
> right?
> > > > > >
> > > > > > Yes, that's right.
> > > > > >
> > > > > > >
> > > > > > > +  if (width == 1 && mult_num
> > > > > > > +      && maybe_le (tree_to_poly_int64 (TYPE_SIZE (TREE_TYPE
> (lhs))),
> > > > > > > +                  param_avoid_fma_max_bits))
> > > > > > > +    {
> > > > > > > +      /* Look for cross backedge dependency:
> > > > > > > +       1. LHS is a phi argument in the same basic block it is
> defined.
> > > > > > > +       2. And the result of the phi node is used in OPS.  */
> > > > > > > +      basic_block bb = gimple_bb (SSA_NAME_DEF_STMT (lhs));
> > > > > > > +      gimple_stmt_iterator gsi;
> > > > > > > +      for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next
> > > (&gsi))
> > > > > > > +       {
> > > > > > > +         gphi *phi = dyn_cast<gphi *> (gsi_stmt (gsi));
> > > > > > > +         for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
> > > > > > > +           {
> > > > > > > +             tree op = PHI_ARG_DEF (phi, i);
> > > > > > > +             if (!(op == lhs && gimple_phi_arg_edge (phi, i)->src
> ==
> > > bb))
> > > > > > > +               continue;
> > > > > > >
> > > > > > > I think it's easier to iterate over the immediate uses of LHS like
> > > > > > >
> > > > > > >   FOR_EACH_IMM_USE_FAST (use_p, iter, lhs)
> > > > > > >      if (gphi *phi = dyn_cast <gphi *> (USE_STMT (use_p)))
> > > > > > >        {
> > > > > > >           if (gimple_phi_arg_edge (phi, phi_arg_index_from_use
> > > > > > > (use_p))->src != bb)
> > > > > > >             continue;
> > > > > > > ...
> > > > > > >        }
> > > > > > >
> > > > > > > otherwise I think _this_ part of the patch looks reasonable.
> > > > > > >
> > > > > > > As you say heuristically they might go together but I think we
> should
> > > > > split
> > > > > > > the
> > > > > > > patch - the cross-loop part can probably stand independently.  Can
> you
> > > > > adjust
> > > > > > > and re-post?
> > > > > >
> > > > > > Attached is the separated part for cross-loop FMA. Thank you for the
> > > > > correction.
> > > > >
> > > > > That cross-loop FMA patch is OK.
> > > >
> > > > Committed this part at 746344dd.
> > > >
> > > > Thanks,
> > > > Di
> > > >
> > > > >
> > > > > Thanks,
> > > > > Richard.
> > > > >
> > > > > > >
> > > > > > > As for the first part I still don't understand very well and am
> still
> > > > > hoping
> > > > > > > we
> > > > > > > can get away without yet another knob to tune.
> > > > > > >
> > > > > > > Richard.
> > > > > > >
> > > > > > > > >
> > > > > > > > > > 2. To avoid regressions, included the other patch
> > > > > > > > > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-
> > > > > September/629203.html)
> > > > > > > > > > on this tracker again. This is because more FMA will be kept
> > > > > > > > > > with 1., so we need to rule out the loop dependent
> > > > > > > > > > FMA chains when param_avoid_fma_max_bits is set.
> > > > > > > > >
> > > > > > > > > Sorry again for taking so long to reply.
> > > > > > > > >
> > > > > > > > > I'll note we have an odd case on x86 Zen2(?) as well which we
> > > don't
> > > > > really
> > > > > > > > > understand from a CPU behavior perspective.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Richard.
> > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Di Zhao
> > > > > > > > > >
> > > > > > > > > > ----
> > > > > > > > > >
> > > > > > > > > >         PR tree-optimization/110279
> > > > > > > > > >
> > > > > > > > > > gcc/ChangeLog:
> > > > > > > > > >
> > > > > > > > > >         * tree-ssa-reassoc.cc
> > > (rank_ops_for_better_parallelism_p):
> > > > > > > > > >         New function to check whether ranking the ops
> results in
> > > > > > > > > >         better parallelism.
> > > > > > > > > >         (get_reassociation_width): Add new parameters.
> Search
> > > for
> > > > > > > > > >         smaller width considering the benefit of FMA.
> > > > > > > > > >         (rank_ops_for_fma): Change return value to be number
> of
> > > > > > > > > >         MULT_EXPRs.
> > > > > > > > > >         (reassociate_bb): For 3 ops, refine the condition to
> > > call
> > > > > > > > > >         swap_ops_for_binary_stmt.
> > > > > > > > > >
> > > > > > > > > > gcc/testsuite/ChangeLog:
> > > > > > > > > >
> > > > > > > > > >         * gcc.dg/pr110279.c: New test.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Di Zhao
> > > > > > > >
> > > > > > > > ----
> > > > > > > >
> > > > > > > >         PR tree-optimization/110279
> > > > > > > >
> > > > > > > > gcc/ChangeLog:
> > > > > > > >
> > > > > > > >         * doc/invoke.texi: Description of
> param_max_fma_chain_len.
> > > > > > > >         * params.opt: New parameter param_max_fma_chain_len.
> > > > > > > >         * tree-ssa-reassoc.cc (get_reassociation_width):
> > > > > > > >         Support param_max_fma_chain_len; check for loop
> dependent
> > > > > > > >         FMAs.
> > > > > > > >         (rank_ops_for_fma): Return the number of MULT_EXPRs.
> > > > > > > >         (reassociate_bb): For 3 ops, refine the condition to
> call
> > > > > > > >         swap_ops_for_binary_stmt.
> > > > > > > >
> > > > > > > > gcc/testsuite/ChangeLog:
> > > > > > > >
> > > > > > > >         * gcc.dg/pr110279-1.c: New test.
> > > > > > > >         * gcc.dg/pr110279-2.c: New test.
> > > > > > > >         * gcc.dg/pr110279-3.c: New test.
> > > > > >
> > > > > > ---
> > > > > >
> > > > > >         PR tree-optimization/110279
> > > > > >
> > > > > > gcc/ChangeLog:
> > > > > >
> > > > > >         * tree-ssa-reassoc.cc (get_reassociation_width): check
> > > > > >         for loop dependent FMAs.
> > > > > >         (reassociate_bb): For 3 ops, refine the condition to call
> > > > > >         swap_ops_for_binary_stmt.
> > > > > >
> > > > > > gcc/testsuite/ChangeLog:
> > > > > >
> > > > > >         * gcc.dg/pr110279-1.c: New test.
> > > > ---
> > > >
> > > >         PR tree-optimization/110279
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >         * common.opt: New flag fully-pipelined-fma.
> > > >         * tree-ssa-reassoc.cc (get_mult_latency_consider_fma):
> > > >         Return latency of MULT_EXPRs that can't be hided by FMA.
> > > >         (get_reassociation_width): Search for smaller widths
> > > >         considering the benefit of fully pipelined FMA.
> > > >         (rank_ops_for_fma): Return the number of MULT_EXPRs.
> > > >         (reassociate_bb): Pass the number of MULT_EXPRs to
> > > >         get_reassociation_width; avoid calling
> > > >         get_reassociation_width twice.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > >         * gcc.dg/pr110279-2.c: New test.
> >
> > Thanks,
> > Di
> >
> > ---
> >
> >         PR tree-optimization/110279
> >
> > gcc/ChangeLog:
> >
> >         * doc/invoke.texi: New parameter fully-pipelined-fma.
> >         * params.opt: New parameter fully-pipelined-fma.
> >         * tree-ssa-reassoc.cc (get_mult_latency_consider_fma): Return
> >         the latency of MULT_EXPRs that can't be hidden by the FMAs.
> >         (get_reassociation_width): Search for a smaller width
> >         considering the benefit of fully pipelined FMA.
> >         (rank_ops_for_fma): Return the number of MULT_EXPRs.
> >         (reassociate_bb): Pass the number of MULT_EXPRs to
> >         get_reassociation_width; avoid calling
> >         get_reassociation_width twice.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         * gcc.dg/pr110279-2.c: New test.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width
  2023-12-14 20:55                   ` Di Zhao OS
@ 2023-12-15  7:23                     ` Richard Biener
  0 siblings, 0 replies; 18+ messages in thread
From: Richard Biener @ 2023-12-15  7:23 UTC (permalink / raw)
  To: Di Zhao OS; +Cc: gcc-patches

On Thu, Dec 14, 2023 at 9:55 PM Di Zhao OS
<dizhao@os.amperecomputing.com> wrote:
>
>
> > -----Original Message-----
> > From: Richard Biener <richard.guenther@gmail.com>
> > Sent: Wednesday, December 13, 2023 5:01 PM
> > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > Cc: gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > get_reassociation_width
> >
> > On Wed, Dec 13, 2023 at 9:14 AM Di Zhao OS
> > <dizhao@os.amperecomputing.com> wrote:
> > >
> > > Hello Richard,
> > >
> > > > -----Original Message-----
> > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > Sent: Monday, December 11, 2023 7:01 PM
> > > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > > Cc: gcc-patches@gcc.gnu.org
> > > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > > > get_reassociation_width
> > > >
> > > > On Wed, Nov 29, 2023 at 3:36 PM Di Zhao OS
> > > > <dizhao@os.amperecomputing.com> wrote:
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > > > Sent: Tuesday, November 21, 2023 9:01 PM
> > > > > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > > > > Cc: gcc-patches@gcc.gnu.org
> > > > > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > > > > > get_reassociation_width
> > > > > >
> > > > > > On Thu, Nov 9, 2023 at 6:53 PM Di Zhao OS
> > <dizhao@os.amperecomputing.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > > > > > Sent: Tuesday, October 31, 2023 9:48 PM
> > > > > > > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > > > > > > Cc: gcc-patches@gcc.gnu.org
> > > > > > > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > > > > > > > get_reassociation_width
> > > > > > > >
> > > > > > > > On Sun, Oct 8, 2023 at 6:40 PM Di Zhao OS
> > > > <dizhao@os.amperecomputing.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Attached is a new version of the patch.
> > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > > > > > > > Sent: Friday, October 6, 2023 5:33 PM
> > > > > > > > > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > > > > > > > > Cc: gcc-patches@gcc.gnu.org
> > > > > > > > > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider
> > FMA in
> > > > > > > > > > get_reassociation_width
> > > > > > > > > >
> > > > > > > > > > On Thu, Sep 14, 2023 at 2:43 PM Di Zhao OS
> > > > > > > > > > <dizhao@os.amperecomputing.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > This is a new version of the patch on "nested FMA".
> > > > > > > > > > > Sorry for updating this after so long, I've been studying
> > and
> > > > > > > > > > > writing micro cases to sort out the cause of the regression.
> > > > > > > > > >
> > > > > > > > > > Sorry for taking so long to reply.
> > > > > > > > > >
> > > > > > > > > > > First, following previous discussion:
> > > > > > > > > > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-
> > > > > > September/629080.html)
> > > > > > > > > > >
> > > > > > > > > > > 1. From testing more altered cases, I don't think the
> > > > > > > > > > > problem is that reassociation works locally. In that:
> > > > > > > > > > >
> > > > > > > > > > >   1) On the example with multiplications:
> > > > > > > > > > >
> > > > > > > > > > >         tmp1 = a + c * c + d * d + x * y;
> > > > > > > > > > >         tmp2 = x * tmp1;
> > > > > > > > > > >         result += (a + c + d + tmp2);
> > > > > > > > > > >
> > > > > > > > > > >   Given "result" rewritten by width=2, the performance is
> > > > > > > > > > >   worse if we rewrite "tmp1" with width=2. In contrast, if
> > we
> > > > > > > > > > >   remove the multiplications from the example (and make
> > "tmp1"
> > > > > > > > > > >   not singe used), and still rewrite "result" by width=2,
> > then
> > > > > > > > > > >   rewriting "tmp1" with width=2 is better. (Make sense
> > because
> > > > > > > > > > >   the tree's depth at "result" is still smaller if we
> > rewrite
> > > > > > > > > > >   "tmp1".)
> > > > > > > > > > >
> > > > > > > > > > >   2) I tried to modify the assembly code of the example
> > without
> > > > > > > > > > >   FMA, so the width of "result" is 4. On Ampere1 there's no
> > > > > > > > > > >   obvious improvement. So although this is an interesting
> > > > > > > > > > >   problem, it doesn't seem like the cause of the regression.
> > > > > > > > > >
> > > > > > > > > > OK, I see.
> > > > > > > > > >
> > > > > > > > > > > 2. From assembly code of the case with FMA, one problem is
> > > > > > > > > > > that, rewriting "tmp1" to parallel didn't decrease the
> > > > > > > > > > > minimum CPU cycles (taking MULT_EXPRs into account), but
> > > > > > > > > > > increased code size, so the overhead is increased.
> > > > > > > > > > >
> > > > > > > > > > >    a) When "tmp1" is not re-written to parallel:
> > > > > > > > > > >         fmadd d31, d2, d2, d30
> > > > > > > > > > >         fmadd d31, d3, d3, d31
> > > > > > > > > > >         fmadd d31, d4, d5, d31  //"tmp1"
> > > > > > > > > > >         fmadd d31, d31, d4, d3
> > > > > > > > > > >
> > > > > > > > > > >    b) When "tmp1" is re-written to parallel:
> > > > > > > > > > >         fmul  d31, d4, d5
> > > > > > > > > > >         fmadd d27, d2, d2, d30
> > > > > > > > > > >         fmadd d31, d3, d3, d31
> > > > > > > > > > >         fadd  d31, d31, d27     //"tmp1"
> > > > > > > > > > >         fmadd d31, d31, d4, d3
> > > > > > > > > > >
> > > > > > > > > > > For version a), there are 3 dependent FMAs to calculate
> > "tmp1".
> > > > > > > > > > > For version b), there are also 3 dependent instructions in
> > the
> > > > > > > > > > > longer path: the 1st, 3rd and 4th.
> > > > > > > > > >
> > > > > > > > > > Yes, it doesn't really change anything.  The patch has
> > > > > > > > > >
> > > > > > > > > > +  /* If there's code like "acc = a * b + c * d + acc" in a
> > tight
> > > > loop,
> > > > > > > > some
> > > > > > > > > > +     uarchs can execute results like:
> > > > > > > > > > +
> > > > > > > > > > +       _1 = a * b;
> > > > > > > > > > +       _2 = .FMA (c, d, _1);
> > > > > > > > > > +       acc_1 = acc_0 + _2;
> > > > > > > > > > +
> > > > > > > > > > +     in parallel, while turning it into
> > > > > > > > > > +
> > > > > > > > > > +       _1 = .FMA(a, b, acc_0);
> > > > > > > > > > +       acc_1 = .FMA(c, d, _1);
> > > > > > > > > > +
> > > > > > > > > > +     hinders that, because then the first FMA depends on the
> > > > result
> > > > > > > > > > of preceding
> > > > > > > > > > +     iteration.  */
> > > > > > > > > >
> > > > > > > > > > I can't see what can be run in parallel for the first case.
> > > > The .FMA
> > > > > > > > > > depends on the multiplication a * b.  Iff the uarch somehow
> > > > decomposes
> > > > > > > > > > .FMA into multiply + add then the c * d multiply could run in
> > > > parallel
> > > > > > > > > > with the a * b multiply which _might_ be able to hide some of
> > the
> > > > > > > > > > latency of the full .FMA.  Like on x86 Zen FMA has a latency
> > of 4
> > > > > > > > > > cycles but a multiply only 3.  But I never got confirmation
> > from
> > > > any
> > > > > > > > > > of the CPU designers that .FMAs are issued when the multiply
> > > > > > > > > > operands are ready and the add operand can be forwarded.
> > > > > > > > > >
> > > > > > > > > > I also wonder why the multiplications of the two-FMA sequence
> > > > > > > > > > then cannot be executed at the same time?  So I have some
> > doubt
> > > > > > > > > > of the theory above.
> > > > > > > > >
> > > > > > > > > The parallel execution for the code snippet above was the other
> > > > > > > > > issue (previously discussed here:
> > > > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-
> > August/628960.html).
> > > > > > > > > Sorry it's a bit confusing to include that here, but these 2
> > fixes
> > > > > > > > > needs to be combined to avoid new regressions. Since considering
> > > > > > > > > FMA in get_reassociation_width produces more results of width=1,
> > > > > > > > > so there would be more loop depending FMA chains.
> > > > > > > > >
> > > > > > > > > > Iff this really is the reason for the sequence to execute with
> > > > lower
> > > > > > > > > > overall latency and we want to attack this on GIMPLE then I
> > think
> > > > > > > > > > we need a target hook telling us this fact (I also wonder if
> > such
> > > > > > > > > > behavior can be modeled in the scheduler pipeline description
> > at
> > > > all?)
> > > > > > > > > >
> > > > > > > > > > > So it seems to me the current get_reassociation_width
> > algorithm
> > > > > > > > > > > isn't optimal in the presence of FMA. So I modified the
> > patch to
> > > > > > > > > > > improve get_reassociation_width, rather than check for code
> > > > > > > > > > > patterns. (Although there could be some other complicated
> > > > > > > > > > > factors so the regression is more obvious when there's
> > "nested
> > > > > > > > > > > FMA". But with this patch that should be avoided or reduced.)
> > > > > > > > > > >
> > > > > > > > > > > With this patch 508.namd_r 1-copy run has 7% improvement on
> > > > > > > > > > > Ampere1, on Intel Xeon there's about 3%. While I'm still
> > > > > > > > > > > collecting data on other CPUs, I'd like to know how do you
> > > > > > > > > > > think of this.
> > > > > > > > > > >
> > > > > > > > > > > About changes in the patch:
> > > > > > > > > > >
> > > > > > > > > > > 1. When the op list forms a complete FMA chain, try to
> > search
> > > > > > > > > > > for a smaller width considering the benefit of using FMA.
> > With
> > > > > > > > > > > a smaller width, the increment of code size is smaller when
> > > > > > > > > > > breaking the chain.
> > > > > > > > > >
> > > > > > > > > > But this is all highly target specific (code size even more
> > so).
> > > > > > > > > >
> > > > > > > > > > How I understand your approach to fixing the issue leads me to
> > > > > > > > > > the suggestion to prioritize parallel rewriting, thus alter
> > > > > > > > rank_ops_for_fma,
> > > > > > > > > > taking the reassoc width into account (the computed width
> > should
> > > > be
> > > > > > > > > > unchanged from rank_ops_for_fma) instead of "fixing up" the
> > > > parallel
> > > > > > > > > > rewriting of FMAs (well, they are not yet formed of course).
> > > > > > > > > > get_reassociation_width has 'get_required_cycles', the above
> > > > theory
> > > > > > > > > > could be verified with a very simple toy pipeline model.  We'd
> > > > have
> > > > > > > > > > to ask the target for the reassoc width for MULT_EXPRs as well
> > (or
> > > > > > maybe
> > > > > > > > > > even FMA_EXPRs).
> > > > > > > > > >
> > > > > > > > > > Taking the width of FMAs into account when computing the
> > reassoc
> > > > width
> > > > > > > > > > might be another way to attack this.
> > > > > > > > >
> > > > > > > > > Previously I tried to solve this generally, on the assumption
> > that
> > > > > > > > > FMA (smaller code size) is preferred. Now I agree it's difficult
> > > > > > > > > since: 1) As you mentioned, the latency of FMA, FMUL and FADD
> > can
> > > > > > > > > be different. 2) From my test result on different machines we
> > > > > > > > > have, it seems simply adding the cycles together is not a good
> > way
> > > > > > > > > to estimate the latency of consecutive FMA.
> > > > > > > > >
> > > > > > > > > I think an easier way to fix this is to add a parameter to
> > suggest
> > > > > > > > > the length of complete FMA chain to keep. (It can be set by
> > target
> > > > > > > > > specific tuning then.) And we can break longer FMA chains for
> > > > > > > > > better parallelism. Attached is the new implementation. With
> > > > > > > > > max-fma-chain-len=8, there's about 7% improvement in spec2017
> > > > > > > > > 508.namd_r on ampere1, and the overall improvement on fprate is
> > > > > > > > > about 1%.
> > > > > > > > >
> > > > > > > > > Since there's code in rank_ops_for_fma to identify MULT_EXPRs
> > from
> > > > > > > > > others, I left it before get_reassociation_width so the number
> > of
> > > > > > > > > MULT_EXPRs can be used.
> > > > > > > >
> > > > > > > > Sorry again for the delay in replying.
> > > > > > > >
> > > > > > > > +  /* Check if keeping complete FMA chains is preferred.  */
> > > > > > > > +  if (width > 1 && mult_num >= 2 && param_max_fma_chain_len)
> > > > > > > > +    {
> > > > > > > > +      /* num_fma_chain + (num_fma_chain - 1) >= num_plus .  */
> > > > > > > > +      int num_others = ops_num - mult_num;
> > > > > > > > +      int num_fma_chain = CEIL (num_others + 1, 2);
> > > > > > > > +
> > > > > > > > +      if (num_fma_chain < width
> > > > > > > > +         && CEIL (mult_num, num_fma_chain) <=
> > param_max_fma_chain_len)
> > > > > > > > +       width = num_fma_chain;
> > > > > > > > +    }
> > > > > > > >
> > > > > > > > so here 'mult_num' serves as a heuristical value how many
> > > > > > > > FMAs we could build.  If that were close to ops_num - 1 then
> > > > > > > > we'd have a chain of FMAs.  Not sure how you get at
> > > > > > > > num_others / 2 here.  Maybe we need to elaborate on what an
> > > > > > > > FMA chain is?  I thought it is FMA (FMA (FMA (..., b, c), d, e), f,
> > g)
> > > > > > > > where each (b,c) pair is really just one operand in the ops array,
> > > > > > > > one of the 'mult's.  Thus a FMA chain is _not_
> > > > > > > > FMA (a, b, c) + FMA (d, e, f) + FMA (...) + ..., right?
> > > > > > >
> > > > > > > The "FMA chain" here refers to consecutive FMAs, each taking
> > > > > > > The previous one's result as the third operator, i.e.
> > > > > > > ... FMA(e, f, FMA(c, d, FMA (a, b, r)))... . So original op
> > > > > > > list looks like "r + a * b + c * d + e * f + ...". These FMAs
> > > > > > > will end up using the same accumulate register.
> > > > > > >
> > > > > > > When num_others=2 or 3, there can be 2 complete chains, e.g.
> > > > > > >         FMA (d, e, FMA (a, b, c)) + FMA (f, g, h)
> > > > > > > or
> > > > > > >         FMA (d, e, FMA (a, b, c)) + FMA (f, g, h) + i .
> > > > > > > And so on, that's where the "CEIL (num_others + 1, 2)" comes from.
> > > > > > >
> > > > > > > >
> > > > > > > > Forming an FMA chain effectively reduces the reassociation width
> > > > > > > > of the participating multiplies.  If we were not to form FMAs all
> > > > > > > > the multiplies could execute in parallel.
> > > > > > > >
> > > > > > > > So what does the above do, in terms of adjusting the reassociation
> > > > > > > > width for the _adds_, and what's the ripple-down effect on later
> > > > > > > > FMA forming?
> > > > > > > >
> > > > > > >
> > > > > > > The above code calculates the number of such FMA chains in the op
> > > > > > > list. And if the length of each chain doesn't exceed
> > > > > > > param_max_fma_chain_len, then width is set to the number of chains,
> > > > > > > so we won't break them (because rewrite_expr_tree_parallel handles
> > > > > > > this well).
> > > > > > >
> > > > > > > > The change still feels like whack-a-mole playing rather than
> > > > understanding
> > > > > > > > the fundamental issue on the targets.
> > > > > > >
> > > > > > > I think the complexity is in how the instructions are piped.
> > > > > > > Some Arm CPUs such as Neoverse V2 supports "late-forwarding":
> > > > > > > "FP multiply-accumulate pipelines support late-forwarding of
> > > > > > > accumulate operands from similar μOPs, allowing a typical
> > > > > > > sequence of multiply-accumulate μOPs to issue one every N
> > > > > > > cycles". ("N" is smaller than the latency of a single FMA
> > > > > > > instruction.) So keeping such FMA chains can utilize such
> > > > > > > feature and uses less FP units. I guess the case is similar on
> > > > > > > some late X86 CPUs.
> > > > > > >
> > > > > > > If we try to compute the minimum circles of each option, I think
> > > > > > > at least we'll need to know whether the target has similar
> > > > > > > feature, and the latency of each uop. While using an
> > > > > > > experiential length of beneficial FMA chain could be a shortcut.
> > > > > > > (Maybe allowing different lengths for different data widths is
> > > > > > > better.)
> > > > > >
> > > > > > Hm.  So even when we can late-forward in an FMA chain
> > > > > > increasing the width should typically be still better?
> > > > > >
> > > > > > _1 = FMA (_2 * _3 + _4);
> > > > > > _5 = FMA (_6 * _7 + _1);
> > > > > >
> > > > > > say with late-forwarding we can hide the latency of the _6 * _7
> > > > > > multiply and the overall latency of the two FMAs above become
> > > > > > lat (FMA) + lat (ADD) in the ideal case.  Alternatively we do
> > > > > >
> > > > > > _1 = FMA (_2 * _ 3 + _4);
> > > > > > _8 = _6 * _ 7;
> > > > > > _5 = _1 + _8;
> > > > > >
> > > > > > where if the FMA and the multiply can execute in parallel
> > > > > > (we have two FMA pipes) the latency would be lat (FMA) + lat (ADD).
> > > > > > But when we only have a single pipeline capable of
> > > > > > FMA or multiplies then it is at least MIN (lat (FMA) + 1, lat (MUL) +
> > 1)
> > > > > > + lat (ADD), it depends on luck whether the FMA or the MUL is
> > > > > > issued first there.
> > > > > >
> > > > > > So if late-forward works really well and the add part of the FMA
> > > > > > has very low latency compared to the multiplication part having
> > > > > > a smaller reassoc width should pay off here and we might be
> > > > > > able to simply control this via the existing target hook?
> > > > > >
> > > > > > I'm not aware of x86 CPUs having late-forwarding capabilities
> > > > > > but usually the latency of multiplication and FMA is very similar
> > > > > > and one can issue two FMAs and possibly more ADDs in parallel.
> > > > > >
> > > > > > As said I think this detail (late-forward) should maybe reflected
> > > > > > into get_required_cycles, possibly guided by a different
> > > > > > targetm.sched.reassociation_width for MULT_EXPR vs PLUS_EXPR?
> > > > > >
> > > > >
> > > > > To my understanding, the question is whether the target fully
> > > > > pipelines FMA instructions, so the MULT part can start first if
> > > > > its operands are ready. While targetm.sched.reassociation_width
> > > > > reflects the number of pipes for some operation, so it can guide
> > > > > get_required_cycles for a sequence of identical operations
> > > > > (e.g. A * B * C * D or A + B + C + D). Since the problem in
> > > > > this case is not the number of pipes for FMA, I think another
> > > > > indicator maybe better.
> > > > >
> > > > > (Currently the fma_reassoc_width for AArch64 is to control
> > > > > whether reassociation on FADD is OK. This workaround doesn't
> > > > > work well on some cases, for example it turns down reassociation
> > > > > even when there's no FMA at all. So I think we'd better not
> > > > > follow the schema.)
> > > > >
> > > > > Attached is a new version of the patch with a flag to indicate
> > > > > whether FMA is fully pipelined, and: 1) lat (MUL) >= lat (ADD);
> > > > > 2) symmetric units are used or FMUL/FADD/FMA. Otherwise the
> > > > > patch may not be beneficial.
> > > > >
> > > > > It tries to calculate the latencies including MULT_EXPRs. Since
> > > > > the code is different with the current code (the quick-search
> > > > > part), I haven't included it inside get_required_cycles.
> > > >
> > > > +; If the flag 'fully-pipelined-fma' is set, reassociation takes into
> > account
> > > > +; the benifit of parallelizing FMA's multiply part and addition part.
> > > > +ffully-pipelined-fma
> > > > +Common Var(flag_fully_pipelined_fma)
> > > > +Assume the target fully pipelines FMA instruction, and symmetric units
> > are
> > > > used
> > > > +for FMUL/FADD/FMA.
> > > >
> > > > please use a --param for now, I think targets might want to set this based
> > > > on active core tuning.
> > > >
> > > > +/* Given that the target fully pipelines FMA instructions, return latency
> > of
> > > > +   MULT_EXPRs that can't be hided by FMA.  WIDTH is the number of pipes.
> > */
> > > > +
> > > >
> > > > return the latency .. can't be hidden by the FMA
> > > >
> > > > For documentation purposes it should be stated that mult_num <= ops_num
> > > >
> > > > +  /* If the target fully pipelines FMA instruction, the multiply part can
> > > > start
> > > >
> > > > instructions
> > > >
> > > > +     first if its operands are ready.  Assuming symmetric pipes are used
> > for
> > > >
> > > > s/first/already/
> > > >
> > > > +     FMUL/FADD/FMA, then for a sequence of FMA like:
> > > > +
> > > > +       _8 = .FMA (_2, _3, _1);
> > > > +       _9 = .FMA (_5, _4, _8);
> > > > +       _10 = .FMA (_7, _6, _9);
> > > > +
> > > > +     , if width=1, the latency is latency(MULT) + latency(ADD)*3.
> > > > +     While with width=2:
> > > > +
> > > > +       _8 = _4 * _5;
> > > > +       _9 = .FMA (_2, _3, _1);
> > > > +       _10 = .FMA (_6, _7, _8);
> > > > +       _11 = _9 + _10;
> > > > +
> > > > +     , it is latency(MULT)*2 + latency(ADD)*2.  Assuming latency(MULT) <=
> > > > +     latency(ADD), the previous one is preferred.
> > > >
> > > > latency (MULT) >= latency (ADD)?
> > > >
> > > > ".. the first variant is preferred."
> > > >
> > > > +
> > > > +     Find out if we can get a smaller width considering FMA.  */
> > >
> > > Corrected these errors. Thank you for the corrections.
> > >
> > > > +      /* When flag_fully_pipelined_fma is set, assumes symmetric pipes
> > are
> > > > used
> > > > +        for FMUL/FADD/FMA.  */
> > > > +      int lat_mul = get_mult_latency_consider_fma (ops_num, mult_num,
> > width);
> > > >
> > > > what does "symmetric pipes" actually mean?  For x86 Zen3 we have
> > > > two FMA pipes (that can also do ADD and MUL) and one pipe that can
> > > > do ADD.  Is that then non-symmetric because we can issue more adds
> > > > in parallel than FMA/MUL?
> >
> > btw, I double-checked and Zen3/4 have two pipes for FMUL/FMA and two
> > separate pipes for FADD, the FMUL/FMA pipes cannot do FADD.  FADD
> > has a latency of 3 cycles while FMUL/FMA has a latency of 4 cycles.
> >
> > I'd say Zen is then not "symmetric" as in your definition?  I do wonder
> > what part of the pipeline characteristic could be derived from the
> > reassoc_width target hook (maybe the number of pipes but not whether
> > they are shared with another op).  In theory the scheduling description
> > could offer the info (if correct and precise enough), but I don't think
> > there's
> > a good way to query this details.
>
> Yes, that's not "symmetric". The current code is at least confusing
> in that scenario. For instance, get_mult_latency_consider_fma uses
> the width of FMUL, and get_required_cycles should use the width of
> FADD, I think.
>
> As you pointed out earlier, a problem with reassociation is that it
> works locally on single used operator lists, so the result may not
> be globally optimum. For example, if using 2 pipes results in 4 circles,
> and using 1 pipe results in 5 circles; the latter might be preferable
> for saving 1 pipe for other calculations (in the basic block), if
> there is any. If to enhance reassociation for this, it seems we need
> to know the arrangement of pipes for each kind of operator? (Perhaps
> targetm.sched.reassociation_width could return an index number
> indicating the set of pipes for the operator, along with the width.
> But this may not cover cases like Haswell, which has ports 0 and 1
> for multiply and port 1 for addition. Besides, I'm not yet clear
> about how the algorithm should be.)

Yeah, while targetm.sched.reassociation_width is at least per
operation and mode I expect it to return a "heuristic" value
trying to factor this in.

I think it would be useful to look whether generating a query API
from the scheduler descriptions is possible somehow.  Currently
the main issue is that the pipeline description only indirectly
couples to define_insns (via insn attributes usually) and those
in turn only indirectly lead to the actual operation implemented
(unless it's a define_expand and if not only if a RTL pattern is
present).  So I guess that the easiest way would be to amend
the scheduler descriptions with optional, say,

(define_cpu_unit_for_code "port0" "PLUS_EXPR")

somehow also specifying the modes applicable.  But maybe
reverse engineering this from the scheduler + insn descriptions
is reasonably possible as well (the information is more-or-less
there I think).

For just FMA we could also invent a new target hook to describe
the setup of the plus/mult/fma pipes and their latency but for
more precise cost modeling of say vectorization or unrolling
a scheduling model that can be applied to all "gimple" operations
is needed.

Thanks,
Richard.

> Committed the patch at 8afdbcdd.
>
> Thanks,
> Di
>
> >
> > > "symmetric pipes" was to indicate that FADD/FMA/FMUL use the same unit
> > > set, so the widths are uniform, and the calculations in this patch can
> > > apply. I think this can be relaxed for scenarios like Zen3, by searching
> > > for a smaller width only using the pipes for FMUL/FMA. But if the pipes
> > > for FMUL and FADD are separated, for example 1 for FMA/FMUL and 2 other
> > > pipes for FADD, then the minimum circle might be incorrect.
> > >
> > > Changed the descriptions and code in get_reassociation_width a bit to
> > > include the case like Zen3:
> > >
> > > +      /* When param_fully_pipelined_fma is set, assume FMUL and FMA use the
> > > +        same units that can also do FADD.  For other scenarios, such as
> > when
> > > +        FMUL and FADD are using distinct units, the following code may not
> > > +        apply.  */
> > > +      int width_mult = targetm.sched.reassociation_width (MULT_EXPR, mode);
> > > +      gcc_checking_assert (width_mult <= width);
> > > +
> > > +      /* Latency of MULT_EXPRs.  */
> > > +      int lat_mul
> > > +       = get_mult_latency_consider_fma (ops_num, mult_num, width_mult);
> >
> > The updated patch is OK.
> >
> > Thanks for your patience.
> >
> > Thanks,
> > Richard.
> >
> > > > Otherwise this looks OK now.
> > > >
> > > > Thanks,
> > > > Richard.
> > > >
> > > > > > > > +  /* If there's loop dependent FMA result, return width=2 to
> > avoid it.
> > > > > > This
> > > > > > > > is
> > > > > > > > +     better than skipping these FMA candidates in widening_mul.
> > */
> > > > > > > >
> > > > > > > > better than skipping, but you don't touch it there?  I suppose
> > width
> > > > == 2
> > > > > > > > will bypass the skipping, right?  This heuristic only comes in
> > when
> > > > the
> > > > > > above
> > > > > > > > change made width == 1, since otherwise we have an earlier
> > > > > > > >
> > > > > > > >   if (width == 1)
> > > > > > > >     return width;
> > > > > > > >
> > > > > > > > which als guarantees width == 2 was allowed by the hook/param,
> > right?
> > > > > > >
> > > > > > > Yes, that's right.
> > > > > > >
> > > > > > > >
> > > > > > > > +  if (width == 1 && mult_num
> > > > > > > > +      && maybe_le (tree_to_poly_int64 (TYPE_SIZE (TREE_TYPE
> > (lhs))),
> > > > > > > > +                  param_avoid_fma_max_bits))
> > > > > > > > +    {
> > > > > > > > +      /* Look for cross backedge dependency:
> > > > > > > > +       1. LHS is a phi argument in the same basic block it is
> > defined.
> > > > > > > > +       2. And the result of the phi node is used in OPS.  */
> > > > > > > > +      basic_block bb = gimple_bb (SSA_NAME_DEF_STMT (lhs));
> > > > > > > > +      gimple_stmt_iterator gsi;
> > > > > > > > +      for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next
> > > > (&gsi))
> > > > > > > > +       {
> > > > > > > > +         gphi *phi = dyn_cast<gphi *> (gsi_stmt (gsi));
> > > > > > > > +         for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
> > > > > > > > +           {
> > > > > > > > +             tree op = PHI_ARG_DEF (phi, i);
> > > > > > > > +             if (!(op == lhs && gimple_phi_arg_edge (phi, i)->src
> > ==
> > > > bb))
> > > > > > > > +               continue;
> > > > > > > >
> > > > > > > > I think it's easier to iterate over the immediate uses of LHS like
> > > > > > > >
> > > > > > > >   FOR_EACH_IMM_USE_FAST (use_p, iter, lhs)
> > > > > > > >      if (gphi *phi = dyn_cast <gphi *> (USE_STMT (use_p)))
> > > > > > > >        {
> > > > > > > >           if (gimple_phi_arg_edge (phi, phi_arg_index_from_use
> > > > > > > > (use_p))->src != bb)
> > > > > > > >             continue;
> > > > > > > > ...
> > > > > > > >        }
> > > > > > > >
> > > > > > > > otherwise I think _this_ part of the patch looks reasonable.
> > > > > > > >
> > > > > > > > As you say heuristically they might go together but I think we
> > should
> > > > > > split
> > > > > > > > the
> > > > > > > > patch - the cross-loop part can probably stand independently.  Can
> > you
> > > > > > adjust
> > > > > > > > and re-post?
> > > > > > >
> > > > > > > Attached is the separated part for cross-loop FMA. Thank you for the
> > > > > > correction.
> > > > > >
> > > > > > That cross-loop FMA patch is OK.
> > > > >
> > > > > Committed this part at 746344dd.
> > > > >
> > > > > Thanks,
> > > > > Di
> > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Richard.
> > > > > >
> > > > > > > >
> > > > > > > > As for the first part I still don't understand very well and am
> > still
> > > > > > hoping
> > > > > > > > we
> > > > > > > > can get away without yet another knob to tune.
> > > > > > > >
> > > > > > > > Richard.
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 2. To avoid regressions, included the other patch
> > > > > > > > > > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-
> > > > > > September/629203.html)
> > > > > > > > > > > on this tracker again. This is because more FMA will be kept
> > > > > > > > > > > with 1., so we need to rule out the loop dependent
> > > > > > > > > > > FMA chains when param_avoid_fma_max_bits is set.
> > > > > > > > > >
> > > > > > > > > > Sorry again for taking so long to reply.
> > > > > > > > > >
> > > > > > > > > > I'll note we have an odd case on x86 Zen2(?) as well which we
> > > > don't
> > > > > > really
> > > > > > > > > > understand from a CPU behavior perspective.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Richard.
> > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Di Zhao
> > > > > > > > > > >
> > > > > > > > > > > ----
> > > > > > > > > > >
> > > > > > > > > > >         PR tree-optimization/110279
> > > > > > > > > > >
> > > > > > > > > > > gcc/ChangeLog:
> > > > > > > > > > >
> > > > > > > > > > >         * tree-ssa-reassoc.cc
> > > > (rank_ops_for_better_parallelism_p):
> > > > > > > > > > >         New function to check whether ranking the ops
> > results in
> > > > > > > > > > >         better parallelism.
> > > > > > > > > > >         (get_reassociation_width): Add new parameters.
> > Search
> > > > for
> > > > > > > > > > >         smaller width considering the benefit of FMA.
> > > > > > > > > > >         (rank_ops_for_fma): Change return value to be number
> > of
> > > > > > > > > > >         MULT_EXPRs.
> > > > > > > > > > >         (reassociate_bb): For 3 ops, refine the condition to
> > > > call
> > > > > > > > > > >         swap_ops_for_binary_stmt.
> > > > > > > > > > >
> > > > > > > > > > > gcc/testsuite/ChangeLog:
> > > > > > > > > > >
> > > > > > > > > > >         * gcc.dg/pr110279.c: New test.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Di Zhao
> > > > > > > > >
> > > > > > > > > ----
> > > > > > > > >
> > > > > > > > >         PR tree-optimization/110279
> > > > > > > > >
> > > > > > > > > gcc/ChangeLog:
> > > > > > > > >
> > > > > > > > >         * doc/invoke.texi: Description of
> > param_max_fma_chain_len.
> > > > > > > > >         * params.opt: New parameter param_max_fma_chain_len.
> > > > > > > > >         * tree-ssa-reassoc.cc (get_reassociation_width):
> > > > > > > > >         Support param_max_fma_chain_len; check for loop
> > dependent
> > > > > > > > >         FMAs.
> > > > > > > > >         (rank_ops_for_fma): Return the number of MULT_EXPRs.
> > > > > > > > >         (reassociate_bb): For 3 ops, refine the condition to
> > call
> > > > > > > > >         swap_ops_for_binary_stmt.
> > > > > > > > >
> > > > > > > > > gcc/testsuite/ChangeLog:
> > > > > > > > >
> > > > > > > > >         * gcc.dg/pr110279-1.c: New test.
> > > > > > > > >         * gcc.dg/pr110279-2.c: New test.
> > > > > > > > >         * gcc.dg/pr110279-3.c: New test.
> > > > > > >
> > > > > > > ---
> > > > > > >
> > > > > > >         PR tree-optimization/110279
> > > > > > >
> > > > > > > gcc/ChangeLog:
> > > > > > >
> > > > > > >         * tree-ssa-reassoc.cc (get_reassociation_width): check
> > > > > > >         for loop dependent FMAs.
> > > > > > >         (reassociate_bb): For 3 ops, refine the condition to call
> > > > > > >         swap_ops_for_binary_stmt.
> > > > > > >
> > > > > > > gcc/testsuite/ChangeLog:
> > > > > > >
> > > > > > >         * gcc.dg/pr110279-1.c: New test.
> > > > > ---
> > > > >
> > > > >         PR tree-optimization/110279
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > >         * common.opt: New flag fully-pipelined-fma.
> > > > >         * tree-ssa-reassoc.cc (get_mult_latency_consider_fma):
> > > > >         Return latency of MULT_EXPRs that can't be hided by FMA.
> > > > >         (get_reassociation_width): Search for smaller widths
> > > > >         considering the benefit of fully pipelined FMA.
> > > > >         (rank_ops_for_fma): Return the number of MULT_EXPRs.
> > > > >         (reassociate_bb): Pass the number of MULT_EXPRs to
> > > > >         get_reassociation_width; avoid calling
> > > > >         get_reassociation_width twice.
> > > > >
> > > > > gcc/testsuite/ChangeLog:
> > > > >
> > > > >         * gcc.dg/pr110279-2.c: New test.
> > >
> > > Thanks,
> > > Di
> > >
> > > ---
> > >
> > >         PR tree-optimization/110279
> > >
> > > gcc/ChangeLog:
> > >
> > >         * doc/invoke.texi: New parameter fully-pipelined-fma.
> > >         * params.opt: New parameter fully-pipelined-fma.
> > >         * tree-ssa-reassoc.cc (get_mult_latency_consider_fma): Return
> > >         the latency of MULT_EXPRs that can't be hidden by the FMAs.
> > >         (get_reassociation_width): Search for a smaller width
> > >         considering the benefit of fully pipelined FMA.
> > >         (rank_ops_for_fma): Return the number of MULT_EXPRs.
> > >         (reassociate_bb): Pass the number of MULT_EXPRs to
> > >         get_reassociation_width; avoid calling
> > >         get_reassociation_width twice.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >         * gcc.dg/pr110279-2.c: New test.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width
  2023-12-13  8:14               ` Di Zhao OS
  2023-12-13  9:00                 ` Richard Biener
@ 2023-12-15  9:46                 ` Thomas Schwinge
  2023-12-17 12:30                   ` Di Zhao OS
  1 sibling, 1 reply; 18+ messages in thread
From: Thomas Schwinge @ 2023-12-15  9:46 UTC (permalink / raw)
  To: Di Zhao OS, gcc-patches; +Cc: Richard Biener

[-- Attachment #1: Type: text/plain, Size: 3000 bytes --]

Hi!

On 2023-12-13T08:14:28+0000, Di Zhao OS <dizhao@os.amperecomputing.com> wrote:
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/pr110279-2.c
> @@ -0,0 +1,41 @@
> +/* PR tree-optimization/110279 */
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast --param tree-reassoc-width=4 --param fully-pipelined-fma=1 -fdump-tree-reassoc2-details -fdump-tree-optimized" } */
> +/* { dg-additional-options "-march=armv8.2-a" { target aarch64-*-* } } */
> +
> +#define LOOP_COUNT 800000000
> +typedef double data_e;
> +
> +#include <stdio.h>
> +
> +__attribute_noinline__ data_e
> +foo (data_e in)

Pushed to master branch commit 91e9e8faea4086b3b8aef2355fc12c1559d425f6
"Fix 'gcc.dg/pr110279-2.c' syntax error due to '__attribute_noinline__'",
see attached.

However:

> +{
> +  data_e a1, a2, a3, a4;
> +  data_e tmp, result = 0;
> +  a1 = in + 0.1;
> +  a2 = in * 0.1;
> +  a3 = in + 0.01;
> +  a4 = in * 0.59;
> +
> +  data_e result2 = 0;
> +
> +  for (int ic = 0; ic < LOOP_COUNT; ic++)
> +    {
> +      /* Test that a complete FMA chain with length=4 is not broken.  */
> +      tmp = a1 + a2 * a2 + a3 * a3 + a4 * a4 ;
> +      result += tmp - ic;
> +      result2 = result2 / 2 - tmp;
> +
> +      a1 += 0.91;
> +      a2 += 0.1;
> +      a3 -= 0.01;
> +      a4 -= 0.89;
> +
> +    }
> +
> +  return result + result2;
> +}
> +
> +/* { dg-final { scan-tree-dump-not "was chosen for reassociation" "reassoc2"} } */
> +/* { dg-final { scan-tree-dump-times {\.FMA } 3 "optimized"} } */

..., I still see these latter two tree dump scans FAIL, for GCN:

    $ grep -C2 'was chosen for reassociation' pr110279-2.c.197t.reassoc2
      2 *: a3_40
      2 *: a2_39
    Width = 4 was chosen for reassociation
    Transforming _15 = powmult_1 + powmult_3;
     into _63 = powmult_1 + a1_38;
    $ grep -F .FMA pr110279-2.c.265t.optimized
      _63 = .FMA (a2_39, a2_39, a1_38);
      _64 = .FMA (a3_40, a3_40, powmult_5);

..., nvptx:

    $ grep -C2 'was chosen for reassociation' pr110279-2.c.197t.reassoc2
      2 *: a3_40
      2 *: a2_39
    Width = 4 was chosen for reassociation
    Transforming _15 = powmult_1 + powmult_3;
     into _63 = powmult_1 + a1_38;
    $ grep -F .FMA pr110279-2.c.265t.optimized
      _63 = .FMA (a2_39, a2_39, a1_38);
      _64 = .FMA (a3_40, a3_40, powmult_5);

..., but also x86_64-pc-linux-gnu:

    $  grep -C2 'was chosen for reassociation' pr110279-2.c.197t.reassoc2
      2 *: a3_40
      2 *: a2_39
    Width = 2 was chosen for reassociation
    Transforming _15 = powmult_1 + powmult_3;
     into _63 = powmult_1 + powmult_3;
    $ grep -cF .FMA pr110279-2.c.265t.optimized
    0


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Fix-gcc.dg-pr110279-2.c-syntax-error-due-to-__attrib.patch --]
[-- Type: text/x-diff, Size: 1535 bytes --]

From 91e9e8faea4086b3b8aef2355fc12c1559d425f6 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Fri, 15 Dec 2023 10:03:12 +0100
Subject: [PATCH] Fix 'gcc.dg/pr110279-2.c' syntax error due to
 '__attribute_noinline__'

For example, for GCN or nvptx target configurations, using newlib:

    FAIL: gcc.dg/pr110279-2.c (test for excess errors)
    UNRESOLVED: gcc.dg/pr110279-2.c scan-tree-dump-not reassoc2 "was chosen for reassociation"
    UNRESOLVED: gcc.dg/pr110279-2.c scan-tree-dump-times optimized "\\.FMA " 3

    [...]/source-gcc/gcc/testsuite/gcc.dg/pr110279-2.c:11:1: error: unknown type name '__attribute_noinline__'
    [...]/source-gcc/gcc/testsuite/gcc.dg/pr110279-2.c:12:1: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'foo'

We cannot assume 'stdio.h' to define '__attribute_noinline__' -- but then, that
also isn't necessary for this test case (there is nothing to inline into).

	gcc/testsuite/
	* gcc.dg/pr110279-2.c: Don't '#include <stdio.h>'.  Remove
	'__attribute_noinline__'.
---
 gcc/testsuite/gcc.dg/pr110279-2.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/pr110279-2.c b/gcc/testsuite/gcc.dg/pr110279-2.c
index 0304a77aa66..b6b69969c6b 100644
--- a/gcc/testsuite/gcc.dg/pr110279-2.c
+++ b/gcc/testsuite/gcc.dg/pr110279-2.c
@@ -6,9 +6,7 @@
 #define LOOP_COUNT 800000000
 typedef double data_e;
 
-#include <stdio.h>
-
-__attribute_noinline__ data_e
+data_e
 foo (data_e in)
 {
   data_e a1, a2, a3, a4;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width
  2023-12-15  9:46                 ` Thomas Schwinge
@ 2023-12-17 12:30                   ` Di Zhao OS
  2023-12-22 15:05                     ` Di Zhao OS
  0 siblings, 1 reply; 18+ messages in thread
From: Di Zhao OS @ 2023-12-17 12:30 UTC (permalink / raw)
  To: Thomas Schwinge, gcc-patches; +Cc: Richard Biener

Hello Thomas,

> -----Original Message-----
> From: Thomas Schwinge <thomas@codesourcery.com>
> Sent: Friday, December 15, 2023 5:46 PM
> To: Di Zhao OS <dizhao@os.amperecomputing.com>; gcc-patches@gcc.gnu.org
> Cc: Richard Biener <richard.guenther@gmail.com>
> Subject: RE: [PATCH v4] [tree-optimization/110279] Consider FMA in
> get_reassociation_width
> 
> Hi!
> 
> On 2023-12-13T08:14:28+0000, Di Zhao OS <dizhao@os.amperecomputing.com> wrote:
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/pr110279-2.c
> > @@ -0,0 +1,41 @@
> > +/* PR tree-optimization/110279 */
> > +/* { dg-do compile } */
> > +/* { dg-options "-Ofast --param tree-reassoc-width=4 --param fully-
> pipelined-fma=1 -fdump-tree-reassoc2-details -fdump-tree-optimized" } */
> > +/* { dg-additional-options "-march=armv8.2-a" { target aarch64-*-* } } */
> > +
> > +#define LOOP_COUNT 800000000
> > +typedef double data_e;
> > +
> > +#include <stdio.h>
> > +
> > +__attribute_noinline__ data_e
> > +foo (data_e in)
> 
> Pushed to master branch commit 91e9e8faea4086b3b8aef2355fc12c1559d425f6
> "Fix 'gcc.dg/pr110279-2.c' syntax error due to '__attribute_noinline__'",
> see attached.
> 
> However:
> 
> > +{
> > +  data_e a1, a2, a3, a4;
> > +  data_e tmp, result = 0;
> > +  a1 = in + 0.1;
> > +  a2 = in * 0.1;
> > +  a3 = in + 0.01;
> > +  a4 = in * 0.59;
> > +
> > +  data_e result2 = 0;
> > +
> > +  for (int ic = 0; ic < LOOP_COUNT; ic++)
> > +    {
> > +      /* Test that a complete FMA chain with length=4 is not broken.  */
> > +      tmp = a1 + a2 * a2 + a3 * a3 + a4 * a4 ;
> > +      result += tmp - ic;
> > +      result2 = result2 / 2 - tmp;
> > +
> > +      a1 += 0.91;
> > +      a2 += 0.1;
> > +      a3 -= 0.01;
> > +      a4 -= 0.89;
> > +
> > +    }
> > +
> > +  return result + result2;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump-not "was chosen for reassociation"
> "reassoc2"} } */
> > +/* { dg-final { scan-tree-dump-times {\.FMA } 3 "optimized"} } */

Thank you for the fix.

> ..., I still see these latter two tree dump scans FAIL, for GCN:
> 
>     $ grep -C2 'was chosen for reassociation' pr110279-2.c.197t.reassoc2
>       2 *: a3_40
>       2 *: a2_39
>     Width = 4 was chosen for reassociation
>     Transforming _15 = powmult_1 + powmult_3;
>      into _63 = powmult_1 + a1_38;
>     $ grep -F .FMA pr110279-2.c.265t.optimized
>       _63 = .FMA (a2_39, a2_39, a1_38);
>       _64 = .FMA (a3_40, a3_40, powmult_5);
> 
> ..., nvptx:
> 
>     $ grep -C2 'was chosen for reassociation' pr110279-2.c.197t.reassoc2
>       2 *: a3_40
>       2 *: a2_39
>     Width = 4 was chosen for reassociation
>     Transforming _15 = powmult_1 + powmult_3;
>      into _63 = powmult_1 + a1_38;
>     $ grep -F .FMA pr110279-2.c.265t.optimized
>       _63 = .FMA (a2_39, a2_39, a1_38);
>       _64 = .FMA (a3_40, a3_40, powmult_5);

For these 2 targets, the reassoc_width for FMUL is 1 (default value),
While the testcase assumes that to be 4. The bug was introduced when I
updated the patch but forgot to update the testcase.

> ..., but also x86_64-pc-linux-gnu:
> 
>     $  grep -C2 'was chosen for reassociation' pr110279-2.c.197t.reassoc2
>       2 *: a3_40
>       2 *: a2_39
>     Width = 2 was chosen for reassociation
>     Transforming _15 = powmult_1 + powmult_3;
>      into _63 = powmult_1 + powmult_3;
>     $ grep -cF .FMA pr110279-2.c.265t.optimized
>     0

For x86_64 this needs "-mfma". Sorry the compile options missed that.
Can the change below fix these issues? I moved them into
testsuite/gcc.target/aarch64, since they rely on tunings.

Tested on aarch64-unknown-linux-gnu.

> 
> Grüße
>  Thomas
> 
> 
> -----------------
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634
> München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas
> Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht
> München, HRB 106955

Thanks,
Di Zhao

---
 gcc/testsuite/{gcc.dg => gcc.target/aarch64}/pr110279-1.c | 3 +--
 gcc/testsuite/{gcc.dg => gcc.target/aarch64}/pr110279-2.c | 3 +--
 2 files changed, 2 insertions(+), 4 deletions(-)
 rename gcc/testsuite/{gcc.dg => gcc.target/aarch64}/pr110279-1.c (83%)
 rename gcc/testsuite/{gcc.dg => gcc.target/aarch64}/pr110279-2.c (78%)

diff --git a/gcc/testsuite/gcc.dg/pr110279-1.c b/gcc/testsuite/gcc.target/aarch64/pr110279-1.c
similarity index 83%
rename from gcc/testsuite/gcc.dg/pr110279-1.c
rename to gcc/testsuite/gcc.target/aarch64/pr110279-1.c
index f25b6aec967..97d693f56a5 100644
--- a/gcc/testsuite/gcc.dg/pr110279-1.c
+++ b/gcc/testsuite/gcc.target/aarch64/pr110279-1.c
@@ -1,6 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-Ofast --param avoid-fma-max-bits=512 --param tree-reassoc-width=4 -fdump-tree-widening_mul-details" } */
-/* { dg-additional-options "-march=armv8.2-a" { target aarch64-*-* } } */
+/* { dg-options "-Ofast -mcpu=generic --param avoid-fma-max-bits=512 --param tree-reassoc-width=4 -fdump-tree-widening_mul-details" } */
 
 #define LOOP_COUNT 800000000
 typedef double data_e;
diff --git a/gcc/testsuite/gcc.dg/pr110279-2.c b/gcc/testsuite/gcc.target/aarch64/pr110279-2.c
similarity index 78%
rename from gcc/testsuite/gcc.dg/pr110279-2.c
rename to gcc/testsuite/gcc.target/aarch64/pr110279-2.c
index b6b69969c6b..a88cb361fdc 100644
--- a/gcc/testsuite/gcc.dg/pr110279-2.c
+++ b/gcc/testsuite/gcc.target/aarch64/pr110279-2.c
@@ -1,7 +1,6 @@
 /* PR tree-optimization/110279 */
 /* { dg-do compile } */
-/* { dg-options "-Ofast --param tree-reassoc-width=4 --param fully-pipelined-fma=1 -fdump-tree-reassoc2-details -fdump-tree-optimized" } */
-/* { dg-additional-options "-march=armv8.2-a" { target aarch64-*-* } } */
+/* { dg-options "-Ofast -mcpu=generic --param tree-reassoc-width=4 --param fully-pipelined-fma=1 -fdump-tree-reassoc2-details -fdump-tree-optimized" } */
 
 #define LOOP_COUNT 800000000
 typedef double data_e;
-- 
2.25.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width
  2023-12-17 12:30                   ` Di Zhao OS
@ 2023-12-22 15:05                     ` Di Zhao OS
  2023-12-22 15:39                       ` Richard Biener
  0 siblings, 1 reply; 18+ messages in thread
From: Di Zhao OS @ 2023-12-22 15:05 UTC (permalink / raw)
  To: Di Zhao OS, Thomas Schwinge, gcc-patches; +Cc: Richard Biener

[-- Attachment #1: Type: text/plain, Size: 6905 bytes --]

Updated the fix in attachment.

Is it OK for trunk?

Tested on aarch64-unknown-linux-gnu and x86_64-pc-linux-gnu.

Thanks,
Di Zhao

> -----Original Message-----
> From: Di Zhao OS <dizhao@os.amperecomputing.com>
> Sent: Sunday, December 17, 2023 8:31 PM
> To: Thomas Schwinge <thomas@codesourcery.com>; gcc-patches@gcc.gnu.org
> Cc: Richard Biener <richard.guenther@gmail.com>
> Subject: RE: [PATCH v4] [tree-optimization/110279] Consider FMA in
> get_reassociation_width
> 
> Hello Thomas,
> 
> > -----Original Message-----
> > From: Thomas Schwinge <thomas@codesourcery.com>
> > Sent: Friday, December 15, 2023 5:46 PM
> > To: Di Zhao OS <dizhao@os.amperecomputing.com>; gcc-patches@gcc.gnu.org
> > Cc: Richard Biener <richard.guenther@gmail.com>
> > Subject: RE: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > get_reassociation_width
> >
> > Hi!
> >
> > On 2023-12-13T08:14:28+0000, Di Zhao OS <dizhao@os.amperecomputing.com>
> wrote:
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/pr110279-2.c
> > > @@ -0,0 +1,41 @@
> > > +/* PR tree-optimization/110279 */
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-Ofast --param tree-reassoc-width=4 --param fully-
> > pipelined-fma=1 -fdump-tree-reassoc2-details -fdump-tree-optimized" } */
> > > +/* { dg-additional-options "-march=armv8.2-a" { target aarch64-*-* } } */
> > > +
> > > +#define LOOP_COUNT 800000000
> > > +typedef double data_e;
> > > +
> > > +#include <stdio.h>
> > > +
> > > +__attribute_noinline__ data_e
> > > +foo (data_e in)
> >
> > Pushed to master branch commit 91e9e8faea4086b3b8aef2355fc12c1559d425f6
> > "Fix 'gcc.dg/pr110279-2.c' syntax error due to '__attribute_noinline__'",
> > see attached.
> >
> > However:
> >
> > > +{
> > > +  data_e a1, a2, a3, a4;
> > > +  data_e tmp, result = 0;
> > > +  a1 = in + 0.1;
> > > +  a2 = in * 0.1;
> > > +  a3 = in + 0.01;
> > > +  a4 = in * 0.59;
> > > +
> > > +  data_e result2 = 0;
> > > +
> > > +  for (int ic = 0; ic < LOOP_COUNT; ic++)
> > > +    {
> > > +      /* Test that a complete FMA chain with length=4 is not broken.  */
> > > +      tmp = a1 + a2 * a2 + a3 * a3 + a4 * a4 ;
> > > +      result += tmp - ic;
> > > +      result2 = result2 / 2 - tmp;
> > > +
> > > +      a1 += 0.91;
> > > +      a2 += 0.1;
> > > +      a3 -= 0.01;
> > > +      a4 -= 0.89;
> > > +
> > > +    }
> > > +
> > > +  return result + result2;
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump-not "was chosen for reassociation"
> > "reassoc2"} } */
> > > +/* { dg-final { scan-tree-dump-times {\.FMA } 3 "optimized"} } */
> 
> Thank you for the fix.
> 
> > ..., I still see these latter two tree dump scans FAIL, for GCN:
> >
> >     $ grep -C2 'was chosen for reassociation' pr110279-2.c.197t.reassoc2
> >       2 *: a3_40
> >       2 *: a2_39
> >     Width = 4 was chosen for reassociation
> >     Transforming _15 = powmult_1 + powmult_3;
> >      into _63 = powmult_1 + a1_38;
> >     $ grep -F .FMA pr110279-2.c.265t.optimized
> >       _63 = .FMA (a2_39, a2_39, a1_38);
> >       _64 = .FMA (a3_40, a3_40, powmult_5);
> >
> > ..., nvptx:
> >
> >     $ grep -C2 'was chosen for reassociation' pr110279-2.c.197t.reassoc2
> >       2 *: a3_40
> >       2 *: a2_39
> >     Width = 4 was chosen for reassociation
> >     Transforming _15 = powmult_1 + powmult_3;
> >      into _63 = powmult_1 + a1_38;
> >     $ grep -F .FMA pr110279-2.c.265t.optimized
> >       _63 = .FMA (a2_39, a2_39, a1_38);
> >       _64 = .FMA (a3_40, a3_40, powmult_5);
> 
> For these 2 targets, the reassoc_width for FMUL is 1 (default value),
> While the testcase assumes that to be 4. The bug was introduced when I
> updated the patch but forgot to update the testcase.
> 
> > ..., but also x86_64-pc-linux-gnu:
> >
> >     $  grep -C2 'was chosen for reassociation' pr110279-2.c.197t.reassoc2
> >       2 *: a3_40
> >       2 *: a2_39
> >     Width = 2 was chosen for reassociation
> >     Transforming _15 = powmult_1 + powmult_3;
> >      into _63 = powmult_1 + powmult_3;
> >     $ grep -cF .FMA pr110279-2.c.265t.optimized
> >     0
> 
> For x86_64 this needs "-mfma". Sorry the compile options missed that.
> Can the change below fix these issues? I moved them into
> testsuite/gcc.target/aarch64, since they rely on tunings.
> 
> Tested on aarch64-unknown-linux-gnu.
> 
> >
> > Grüße
> >  Thomas
> >
> >
> > -----------------
> > Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201,
> 80634
> > München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas
> > Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht
> > München, HRB 106955
> 
> Thanks,
> Di Zhao
> 
> ---
>  gcc/testsuite/{gcc.dg => gcc.target/aarch64}/pr110279-1.c | 3 +--
>  gcc/testsuite/{gcc.dg => gcc.target/aarch64}/pr110279-2.c | 3 +--
>  2 files changed, 2 insertions(+), 4 deletions(-)
>  rename gcc/testsuite/{gcc.dg => gcc.target/aarch64}/pr110279-1.c (83%)
>  rename gcc/testsuite/{gcc.dg => gcc.target/aarch64}/pr110279-2.c (78%)
> 
> diff --git a/gcc/testsuite/gcc.dg/pr110279-1.c
> b/gcc/testsuite/gcc.target/aarch64/pr110279-1.c
> similarity index 83%
> rename from gcc/testsuite/gcc.dg/pr110279-1.c
> rename to gcc/testsuite/gcc.target/aarch64/pr110279-1.c
> index f25b6aec967..97d693f56a5 100644
> --- a/gcc/testsuite/gcc.dg/pr110279-1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/pr110279-1.c
> @@ -1,6 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-Ofast --param avoid-fma-max-bits=512 --param tree-reassoc-
> width=4 -fdump-tree-widening_mul-details" } */
> -/* { dg-additional-options "-march=armv8.2-a" { target aarch64-*-* } } */
> +/* { dg-options "-Ofast -mcpu=generic --param avoid-fma-max-bits=512 --param
> tree-reassoc-width=4 -fdump-tree-widening_mul-details" } */
> 
>  #define LOOP_COUNT 800000000
>  typedef double data_e;
> diff --git a/gcc/testsuite/gcc.dg/pr110279-2.c
> b/gcc/testsuite/gcc.target/aarch64/pr110279-2.c
> similarity index 78%
> rename from gcc/testsuite/gcc.dg/pr110279-2.c
> rename to gcc/testsuite/gcc.target/aarch64/pr110279-2.c
> index b6b69969c6b..a88cb361fdc 100644
> --- a/gcc/testsuite/gcc.dg/pr110279-2.c
> +++ b/gcc/testsuite/gcc.target/aarch64/pr110279-2.c
> @@ -1,7 +1,6 @@
>  /* PR tree-optimization/110279 */
>  /* { dg-do compile } */
> -/* { dg-options "-Ofast --param tree-reassoc-width=4 --param fully-pipelined-
> fma=1 -fdump-tree-reassoc2-details -fdump-tree-optimized" } */
> -/* { dg-additional-options "-march=armv8.2-a" { target aarch64-*-* } } */
> +/* { dg-options "-Ofast -mcpu=generic --param tree-reassoc-width=4 --param
> fully-pipelined-fma=1 -fdump-tree-reassoc2-details -fdump-tree-optimized" } */
> 
>  #define LOOP_COUNT 800000000
>  typedef double data_e;
> --
> 2.25.1

[-- Attachment #2: 0001-Fix-compile-options-of-pr110279-1.c-and-pr110279-2.c.patch --]
[-- Type: application/octet-stream, Size: 2606 bytes --]

From 216976028c4d5d66b1666fe501abb869d480c214 Mon Sep 17 00:00:00 2001
From: "dzhao.ampere" <di.zhao@amperecomputing.com>
Date: Sun, 17 Dec 2023 19:33:42 +0800
Subject: [PATCH] Fix compile options of pr110279-1.c and pr110279-2.c

The two testcases are for targets that support FMA. And
pr110279-2.c assumes reassoc_width of FMUL to be 4.

This patch adds missing options, to fix regression test failures
on nvptx/GCN (default reassoc_width of FMUL is 1) and x86_64
(need "-mfma").

gcc/testsuite/ChangeLog:

        * gcc.dg/pr110279-1.c: Add "-mcpu=generic" for aarch64; add
	"-mfma" for x86_64.
        * gcc.dg/pr110279-2.c: Replace "-march=armv8.2-a" with
	"-mcpu=generic"; limit the check to be on aarch64.
---
 gcc/testsuite/gcc.dg/pr110279-1.c | 3 ++-
 gcc/testsuite/gcc.dg/pr110279-2.c | 6 +++---
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/pr110279-1.c b/gcc/testsuite/gcc.dg/pr110279-1.c
index f25b6aec967..c2737418afe 100644
--- a/gcc/testsuite/gcc.dg/pr110279-1.c
+++ b/gcc/testsuite/gcc.dg/pr110279-1.c
@@ -1,6 +1,7 @@
 /* { dg-do compile } */
 /* { dg-options "-Ofast --param avoid-fma-max-bits=512 --param tree-reassoc-width=4 -fdump-tree-widening_mul-details" } */
-/* { dg-additional-options "-march=armv8.2-a" { target aarch64-*-* } } */
+/* { dg-additional-options "-mfma" { target i?86-*-* x86_64-*-* } } */
+/* { dg-additional-options "-mcpu=generic" { target aarch64*-*-* } } */
 
 #define LOOP_COUNT 800000000
 typedef double data_e;
diff --git a/gcc/testsuite/gcc.dg/pr110279-2.c b/gcc/testsuite/gcc.dg/pr110279-2.c
index b6b69969c6b..135e64882d1 100644
--- a/gcc/testsuite/gcc.dg/pr110279-2.c
+++ b/gcc/testsuite/gcc.dg/pr110279-2.c
@@ -1,7 +1,7 @@
 /* PR tree-optimization/110279 */
 /* { dg-do compile } */
 /* { dg-options "-Ofast --param tree-reassoc-width=4 --param fully-pipelined-fma=1 -fdump-tree-reassoc2-details -fdump-tree-optimized" } */
-/* { dg-additional-options "-march=armv8.2-a" { target aarch64-*-* } } */
+/* { dg-additional-options "-mcpu=generic" { target aarch64*-*-* } } */
 
 #define LOOP_COUNT 800000000
 typedef double data_e;
@@ -35,5 +35,5 @@ foo (data_e in)
   return result + result2;
 }
 
-/* { dg-final { scan-tree-dump-not "was chosen for reassociation" "reassoc2"} } */
-/* { dg-final { scan-tree-dump-times {\.FMA } 3 "optimized"} } */
\ No newline at end of file
+/* { dg-final { scan-tree-dump-not "was chosen for reassociation" "reassoc2" { target aarch64*-*-* }} } */
+/* { dg-final { scan-tree-dump-times {\.FMA } 3 "optimized" { target aarch64*-*-* }} } */
\ No newline at end of file
-- 
2.25.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width
  2023-12-22 15:05                     ` Di Zhao OS
@ 2023-12-22 15:39                       ` Richard Biener
  2023-12-27  9:35                         ` Di Zhao OS
  0 siblings, 1 reply; 18+ messages in thread
From: Richard Biener @ 2023-12-22 15:39 UTC (permalink / raw)
  To: Di Zhao OS; +Cc: Thomas Schwinge, gcc-patches



> Am 22.12.2023 um 16:05 schrieb Di Zhao OS <dizhao@os.amperecomputing.com>:
> 
> Updated the fix in attachment.
> 
> Is it OK for trunk?

Ok

> Tested on aarch64-unknown-linux-gnu and x86_64-pc-linux-gnu.
> 
> Thanks,
> Di Zhao
> 
>> -----Original Message-----
>> From: Di Zhao OS <dizhao@os.amperecomputing.com>
>> Sent: Sunday, December 17, 2023 8:31 PM
>> To: Thomas Schwinge <thomas@codesourcery.com>; gcc-patches@gcc.gnu.org
>> Cc: Richard Biener <richard.guenther@gmail.com>
>> Subject: RE: [PATCH v4] [tree-optimization/110279] Consider FMA in
>> get_reassociation_width
>> 
>> Hello Thomas,
>> 
>>> -----Original Message-----
>>> From: Thomas Schwinge <thomas@codesourcery.com>
>>> Sent: Friday, December 15, 2023 5:46 PM
>>> To: Di Zhao OS <dizhao@os.amperecomputing.com>; gcc-patches@gcc.gnu.org
>>> Cc: Richard Biener <richard.guenther@gmail.com>
>>> Subject: RE: [PATCH v4] [tree-optimization/110279] Consider FMA in
>>> get_reassociation_width
>>> 
>>> Hi!
>>> 
>>> On 2023-12-13T08:14:28+0000, Di Zhao OS <dizhao@os.amperecomputing.com>
>> wrote:
>>>> --- /dev/null
>>>> +++ b/gcc/testsuite/gcc.dg/pr110279-2.c
>>>> @@ -0,0 +1,41 @@
>>>> +/* PR tree-optimization/110279 */
>>>> +/* { dg-do compile } */
>>>> +/* { dg-options "-Ofast --param tree-reassoc-width=4 --param fully-
>>> pipelined-fma=1 -fdump-tree-reassoc2-details -fdump-tree-optimized" } */
>>>> +/* { dg-additional-options "-march=armv8.2-a" { target aarch64-*-* } } */
>>>> +
>>>> +#define LOOP_COUNT 800000000
>>>> +typedef double data_e;
>>>> +
>>>> +#include <stdio.h>
>>>> +
>>>> +__attribute_noinline__ data_e
>>>> +foo (data_e in)
>>> 
>>> Pushed to master branch commit 91e9e8faea4086b3b8aef2355fc12c1559d425f6
>>> "Fix 'gcc.dg/pr110279-2.c' syntax error due to '__attribute_noinline__'",
>>> see attached.
>>> 
>>> However:
>>> 
>>>> +{
>>>> +  data_e a1, a2, a3, a4;
>>>> +  data_e tmp, result = 0;
>>>> +  a1 = in + 0.1;
>>>> +  a2 = in * 0.1;
>>>> +  a3 = in + 0.01;
>>>> +  a4 = in * 0.59;
>>>> +
>>>> +  data_e result2 = 0;
>>>> +
>>>> +  for (int ic = 0; ic < LOOP_COUNT; ic++)
>>>> +    {
>>>> +      /* Test that a complete FMA chain with length=4 is not broken.  */
>>>> +      tmp = a1 + a2 * a2 + a3 * a3 + a4 * a4 ;
>>>> +      result += tmp - ic;
>>>> +      result2 = result2 / 2 - tmp;
>>>> +
>>>> +      a1 += 0.91;
>>>> +      a2 += 0.1;
>>>> +      a3 -= 0.01;
>>>> +      a4 -= 0.89;
>>>> +
>>>> +    }
>>>> +
>>>> +  return result + result2;
>>>> +}
>>>> +
>>>> +/* { dg-final { scan-tree-dump-not "was chosen for reassociation"
>>> "reassoc2"} } */
>>>> +/* { dg-final { scan-tree-dump-times {\.FMA } 3 "optimized"} } */
>> 
>> Thank you for the fix.
>> 
>>> ..., I still see these latter two tree dump scans FAIL, for GCN:
>>> 
>>>    $ grep -C2 'was chosen for reassociation' pr110279-2.c.197t.reassoc2
>>>      2 *: a3_40
>>>      2 *: a2_39
>>>    Width = 4 was chosen for reassociation
>>>    Transforming _15 = powmult_1 + powmult_3;
>>>     into _63 = powmult_1 + a1_38;
>>>    $ grep -F .FMA pr110279-2.c.265t.optimized
>>>      _63 = .FMA (a2_39, a2_39, a1_38);
>>>      _64 = .FMA (a3_40, a3_40, powmult_5);
>>> 
>>> ..., nvptx:
>>> 
>>>    $ grep -C2 'was chosen for reassociation' pr110279-2.c.197t.reassoc2
>>>      2 *: a3_40
>>>      2 *: a2_39
>>>    Width = 4 was chosen for reassociation
>>>    Transforming _15 = powmult_1 + powmult_3;
>>>     into _63 = powmult_1 + a1_38;
>>>    $ grep -F .FMA pr110279-2.c.265t.optimized
>>>      _63 = .FMA (a2_39, a2_39, a1_38);
>>>      _64 = .FMA (a3_40, a3_40, powmult_5);
>> 
>> For these 2 targets, the reassoc_width for FMUL is 1 (default value),
>> While the testcase assumes that to be 4. The bug was introduced when I
>> updated the patch but forgot to update the testcase.
>> 
>>> ..., but also x86_64-pc-linux-gnu:
>>> 
>>>    $  grep -C2 'was chosen for reassociation' pr110279-2.c.197t.reassoc2
>>>      2 *: a3_40
>>>      2 *: a2_39
>>>    Width = 2 was chosen for reassociation
>>>    Transforming _15 = powmult_1 + powmult_3;
>>>     into _63 = powmult_1 + powmult_3;
>>>    $ grep -cF .FMA pr110279-2.c.265t.optimized
>>>    0
>> 
>> For x86_64 this needs "-mfma". Sorry the compile options missed that.
>> Can the change below fix these issues? I moved them into
>> testsuite/gcc.target/aarch64, since they rely on tunings.
>> 
>> Tested on aarch64-unknown-linux-gnu.
>> 
>>> 
>>> Grüße
>>> Thomas
>>> 
>>> 
>>> -----------------
>>> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201,
>> 80634
>>> München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas
>>> Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht
>>> München, HRB 106955
>> 
>> Thanks,
>> Di Zhao
>> 
>> ---
>> gcc/testsuite/{gcc.dg => gcc.target/aarch64}/pr110279-1.c | 3 +--
>> gcc/testsuite/{gcc.dg => gcc.target/aarch64}/pr110279-2.c | 3 +--
>> 2 files changed, 2 insertions(+), 4 deletions(-)
>> rename gcc/testsuite/{gcc.dg => gcc.target/aarch64}/pr110279-1.c (83%)
>> rename gcc/testsuite/{gcc.dg => gcc.target/aarch64}/pr110279-2.c (78%)
>> 
>> diff --git a/gcc/testsuite/gcc.dg/pr110279-1.c
>> b/gcc/testsuite/gcc.target/aarch64/pr110279-1.c
>> similarity index 83%
>> rename from gcc/testsuite/gcc.dg/pr110279-1.c
>> rename to gcc/testsuite/gcc.target/aarch64/pr110279-1.c
>> index f25b6aec967..97d693f56a5 100644
>> --- a/gcc/testsuite/gcc.dg/pr110279-1.c
>> +++ b/gcc/testsuite/gcc.target/aarch64/pr110279-1.c
>> @@ -1,6 +1,5 @@
>> /* { dg-do compile } */
>> -/* { dg-options "-Ofast --param avoid-fma-max-bits=512 --param tree-reassoc-
>> width=4 -fdump-tree-widening_mul-details" } */
>> -/* { dg-additional-options "-march=armv8.2-a" { target aarch64-*-* } } */
>> +/* { dg-options "-Ofast -mcpu=generic --param avoid-fma-max-bits=512 --param
>> tree-reassoc-width=4 -fdump-tree-widening_mul-details" } */
>> 
>> #define LOOP_COUNT 800000000
>> typedef double data_e;
>> diff --git a/gcc/testsuite/gcc.dg/pr110279-2.c
>> b/gcc/testsuite/gcc.target/aarch64/pr110279-2.c
>> similarity index 78%
>> rename from gcc/testsuite/gcc.dg/pr110279-2.c
>> rename to gcc/testsuite/gcc.target/aarch64/pr110279-2.c
>> index b6b69969c6b..a88cb361fdc 100644
>> --- a/gcc/testsuite/gcc.dg/pr110279-2.c
>> +++ b/gcc/testsuite/gcc.target/aarch64/pr110279-2.c
>> @@ -1,7 +1,6 @@
>> /* PR tree-optimization/110279 */
>> /* { dg-do compile } */
>> -/* { dg-options "-Ofast --param tree-reassoc-width=4 --param fully-pipelined-
>> fma=1 -fdump-tree-reassoc2-details -fdump-tree-optimized" } */
>> -/* { dg-additional-options "-march=armv8.2-a" { target aarch64-*-* } } */
>> +/* { dg-options "-Ofast -mcpu=generic --param tree-reassoc-width=4 --param
>> fully-pipelined-fma=1 -fdump-tree-reassoc2-details -fdump-tree-optimized" } */
>> 
>> #define LOOP_COUNT 800000000
>> typedef double data_e;
>> --
>> 2.25.1
> <0001-Fix-compile-options-of-pr110279-1.c-and-pr110279-2.c.patch>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width
  2023-12-22 15:39                       ` Richard Biener
@ 2023-12-27  9:35                         ` Di Zhao OS
  0 siblings, 0 replies; 18+ messages in thread
From: Di Zhao OS @ 2023-12-27  9:35 UTC (permalink / raw)
  To: Richard Biener; +Cc: Thomas Schwinge, gcc-patches

Committed at 6cec7b06b3c8187b36fc05cfd4dd38b42313d727

Thanks,
Di

> -----Original Message-----
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: Friday, December 22, 2023 11:40 PM
> To: Di Zhao OS <dizhao@os.amperecomputing.com>
> Cc: Thomas Schwinge <thomas@codesourcery.com>; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> get_reassociation_width
> 
> 
> 
> > Am 22.12.2023 um 16:05 schrieb Di Zhao OS <dizhao@os.amperecomputing.com>:
> >
> > Updated the fix in attachment.
> >
> > Is it OK for trunk?
> 
> Ok
> 
> > Tested on aarch64-unknown-linux-gnu and x86_64-pc-linux-gnu.
> >
> > Thanks,
> > Di Zhao
> >
> >> -----Original Message-----
> >> From: Di Zhao OS <dizhao@os.amperecomputing.com>
> >> Sent: Sunday, December 17, 2023 8:31 PM
> >> To: Thomas Schwinge <thomas@codesourcery.com>; gcc-patches@gcc.gnu.org
> >> Cc: Richard Biener <richard.guenther@gmail.com>
> >> Subject: RE: [PATCH v4] [tree-optimization/110279] Consider FMA in
> >> get_reassociation_width
> >>
> >> Hello Thomas,
> >>
> >>> -----Original Message-----
> >>> From: Thomas Schwinge <thomas@codesourcery.com>
> >>> Sent: Friday, December 15, 2023 5:46 PM
> >>> To: Di Zhao OS <dizhao@os.amperecomputing.com>; gcc-patches@gcc.gnu.org
> >>> Cc: Richard Biener <richard.guenther@gmail.com>
> >>> Subject: RE: [PATCH v4] [tree-optimization/110279] Consider FMA in
> >>> get_reassociation_width
> >>>
> >>> Hi!
> >>>
> >>> On 2023-12-13T08:14:28+0000, Di Zhao OS <dizhao@os.amperecomputing.com>
> >> wrote:
> >>>> --- /dev/null
> >>>> +++ b/gcc/testsuite/gcc.dg/pr110279-2.c
> >>>> @@ -0,0 +1,41 @@
> >>>> +/* PR tree-optimization/110279 */
> >>>> +/* { dg-do compile } */
> >>>> +/* { dg-options "-Ofast --param tree-reassoc-width=4 --param fully-
> >>> pipelined-fma=1 -fdump-tree-reassoc2-details -fdump-tree-optimized" } */
> >>>> +/* { dg-additional-options "-march=armv8.2-a" { target aarch64-*-* } }
> */
> >>>> +
> >>>> +#define LOOP_COUNT 800000000
> >>>> +typedef double data_e;
> >>>> +
> >>>> +#include <stdio.h>
> >>>> +
> >>>> +__attribute_noinline__ data_e
> >>>> +foo (data_e in)
> >>>
> >>> Pushed to master branch commit 91e9e8faea4086b3b8aef2355fc12c1559d425f6
> >>> "Fix 'gcc.dg/pr110279-2.c' syntax error due to '__attribute_noinline__'",
> >>> see attached.
> >>>
> >>> However:
> >>>
> >>>> +{
> >>>> +  data_e a1, a2, a3, a4;
> >>>> +  data_e tmp, result = 0;
> >>>> +  a1 = in + 0.1;
> >>>> +  a2 = in * 0.1;
> >>>> +  a3 = in + 0.01;
> >>>> +  a4 = in * 0.59;
> >>>> +
> >>>> +  data_e result2 = 0;
> >>>> +
> >>>> +  for (int ic = 0; ic < LOOP_COUNT; ic++)
> >>>> +    {
> >>>> +      /* Test that a complete FMA chain with length=4 is not broken.  */
> >>>> +      tmp = a1 + a2 * a2 + a3 * a3 + a4 * a4 ;
> >>>> +      result += tmp - ic;
> >>>> +      result2 = result2 / 2 - tmp;
> >>>> +
> >>>> +      a1 += 0.91;
> >>>> +      a2 += 0.1;
> >>>> +      a3 -= 0.01;
> >>>> +      a4 -= 0.89;
> >>>> +
> >>>> +    }
> >>>> +
> >>>> +  return result + result2;
> >>>> +}
> >>>> +
> >>>> +/* { dg-final { scan-tree-dump-not "was chosen for reassociation"
> >>> "reassoc2"} } */
> >>>> +/* { dg-final { scan-tree-dump-times {\.FMA } 3 "optimized"} } */
> >>
> >> Thank you for the fix.
> >>
> >>> ..., I still see these latter two tree dump scans FAIL, for GCN:
> >>>
> >>>    $ grep -C2 'was chosen for reassociation' pr110279-2.c.197t.reassoc2
> >>>      2 *: a3_40
> >>>      2 *: a2_39
> >>>    Width = 4 was chosen for reassociation
> >>>    Transforming _15 = powmult_1 + powmult_3;
> >>>     into _63 = powmult_1 + a1_38;
> >>>    $ grep -F .FMA pr110279-2.c.265t.optimized
> >>>      _63 = .FMA (a2_39, a2_39, a1_38);
> >>>      _64 = .FMA (a3_40, a3_40, powmult_5);
> >>>
> >>> ..., nvptx:
> >>>
> >>>    $ grep -C2 'was chosen for reassociation' pr110279-2.c.197t.reassoc2
> >>>      2 *: a3_40
> >>>      2 *: a2_39
> >>>    Width = 4 was chosen for reassociation
> >>>    Transforming _15 = powmult_1 + powmult_3;
> >>>     into _63 = powmult_1 + a1_38;
> >>>    $ grep -F .FMA pr110279-2.c.265t.optimized
> >>>      _63 = .FMA (a2_39, a2_39, a1_38);
> >>>      _64 = .FMA (a3_40, a3_40, powmult_5);
> >>
> >> For these 2 targets, the reassoc_width for FMUL is 1 (default value),
> >> While the testcase assumes that to be 4. The bug was introduced when I
> >> updated the patch but forgot to update the testcase.
> >>
> >>> ..., but also x86_64-pc-linux-gnu:
> >>>
> >>>    $  grep -C2 'was chosen for reassociation' pr110279-2.c.197t.reassoc2
> >>>      2 *: a3_40
> >>>      2 *: a2_39
> >>>    Width = 2 was chosen for reassociation
> >>>    Transforming _15 = powmult_1 + powmult_3;
> >>>     into _63 = powmult_1 + powmult_3;
> >>>    $ grep -cF .FMA pr110279-2.c.265t.optimized
> >>>    0
> >>
> >> For x86_64 this needs "-mfma". Sorry the compile options missed that.
> >> Can the change below fix these issues? I moved them into
> >> testsuite/gcc.target/aarch64, since they rely on tunings.
> >>
> >> Tested on aarch64-unknown-linux-gnu.
> >>
> >>>
> >>> Grüße
> >>> Thomas
> >>>
> >>>
> >>> -----------------
> >>> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201,
> >> 80634
> >>> München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas
> >>> Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht
> >>> München, HRB 106955
> >>
> >> Thanks,
> >> Di Zhao
> >>
> >> ---
> >> gcc/testsuite/{gcc.dg => gcc.target/aarch64}/pr110279-1.c | 3 +--
> >> gcc/testsuite/{gcc.dg => gcc.target/aarch64}/pr110279-2.c | 3 +--
> >> 2 files changed, 2 insertions(+), 4 deletions(-)
> >> rename gcc/testsuite/{gcc.dg => gcc.target/aarch64}/pr110279-1.c (83%)
> >> rename gcc/testsuite/{gcc.dg => gcc.target/aarch64}/pr110279-2.c (78%)
> >>
> >> diff --git a/gcc/testsuite/gcc.dg/pr110279-1.c
> >> b/gcc/testsuite/gcc.target/aarch64/pr110279-1.c
> >> similarity index 83%
> >> rename from gcc/testsuite/gcc.dg/pr110279-1.c
> >> rename to gcc/testsuite/gcc.target/aarch64/pr110279-1.c
> >> index f25b6aec967..97d693f56a5 100644
> >> --- a/gcc/testsuite/gcc.dg/pr110279-1.c
> >> +++ b/gcc/testsuite/gcc.target/aarch64/pr110279-1.c
> >> @@ -1,6 +1,5 @@
> >> /* { dg-do compile } */
> >> -/* { dg-options "-Ofast --param avoid-fma-max-bits=512 --param tree-
> reassoc-
> >> width=4 -fdump-tree-widening_mul-details" } */
> >> -/* { dg-additional-options "-march=armv8.2-a" { target aarch64-*-* } } */
> >> +/* { dg-options "-Ofast -mcpu=generic --param avoid-fma-max-bits=512 --
> param
> >> tree-reassoc-width=4 -fdump-tree-widening_mul-details" } */
> >>
> >> #define LOOP_COUNT 800000000
> >> typedef double data_e;
> >> diff --git a/gcc/testsuite/gcc.dg/pr110279-2.c
> >> b/gcc/testsuite/gcc.target/aarch64/pr110279-2.c
> >> similarity index 78%
> >> rename from gcc/testsuite/gcc.dg/pr110279-2.c
> >> rename to gcc/testsuite/gcc.target/aarch64/pr110279-2.c
> >> index b6b69969c6b..a88cb361fdc 100644
> >> --- a/gcc/testsuite/gcc.dg/pr110279-2.c
> >> +++ b/gcc/testsuite/gcc.target/aarch64/pr110279-2.c
> >> @@ -1,7 +1,6 @@
> >> /* PR tree-optimization/110279 */
> >> /* { dg-do compile } */
> >> -/* { dg-options "-Ofast --param tree-reassoc-width=4 --param fully-
> pipelined-
> >> fma=1 -fdump-tree-reassoc2-details -fdump-tree-optimized" } */
> >> -/* { dg-additional-options "-march=armv8.2-a" { target aarch64-*-* } } */
> >> +/* { dg-options "-Ofast -mcpu=generic --param tree-reassoc-width=4 --param
> >> fully-pipelined-fma=1 -fdump-tree-reassoc2-details -fdump-tree-optimized" }
> */
> >>
> >> #define LOOP_COUNT 800000000
> >> typedef double data_e;
> >> --
> >> 2.25.1
> > <0001-Fix-compile-options-of-pr110279-1.c-and-pr110279-2.c.patch>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-12-27  9:35 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-14 12:43 [PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width Di Zhao OS
2023-10-06  9:33 ` Richard Biener
2023-10-08 16:39   ` Di Zhao OS
2023-10-23  3:49     ` [PING][PATCH " Di Zhao OS
2023-10-31 13:47     ` [PATCH " Richard Biener
2023-11-09 17:53       ` Di Zhao OS
2023-11-21 13:01         ` Richard Biener
2023-11-29 14:35           ` Di Zhao OS
2023-12-11 11:01             ` Richard Biener
2023-12-13  8:14               ` Di Zhao OS
2023-12-13  9:00                 ` Richard Biener
2023-12-14 20:55                   ` Di Zhao OS
2023-12-15  7:23                     ` Richard Biener
2023-12-15  9:46                 ` Thomas Schwinge
2023-12-17 12:30                   ` Di Zhao OS
2023-12-22 15:05                     ` Di Zhao OS
2023-12-22 15:39                       ` Richard Biener
2023-12-27  9:35                         ` Di Zhao OS

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).