RE: [PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Di Zhao OS <dizhao@os.amperecomputing.com>
To: Richard Biener <richard.guenther@gmail.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Subject: RE: [PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width
Date: Sun, 8 Oct 2023 16:39:56 +0000	[thread overview]
Message-ID: <SN6PR01MB4240C36D35F02379B33C258DE8CFA@SN6PR01MB4240.prod.exchangelabs.com> (raw)
In-Reply-To: <CAFiYyc02ooZrRicHc1rDtjKuc2pHn_DwfBW441F-yTWE5UG-Ew@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 9162 bytes --]

Attached is a new version of the patch.

> -----Original Message-----
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: Friday, October 6, 2023 5:33 PM
> To: Di Zhao OS <dizhao@os.amperecomputing.com>
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> get_reassociation_width
> 
> On Thu, Sep 14, 2023 at 2:43 PM Di Zhao OS
> <dizhao@os.amperecomputing.com> wrote:
> >
> > This is a new version of the patch on "nested FMA".
> > Sorry for updating this after so long, I've been studying and
> > writing micro cases to sort out the cause of the regression.
> 
> Sorry for taking so long to reply.
> 
> > First, following previous discussion:
> > (https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629080.html)
> >
> > 1. From testing more altered cases, I don't think the
> > problem is that reassociation works locally. In that:
> >
> >   1) On the example with multiplications:
> >
> >         tmp1 = a + c * c + d * d + x * y;
> >         tmp2 = x * tmp1;
> >         result += (a + c + d + tmp2);
> >
> >   Given "result" rewritten by width=2, the performance is
> >   worse if we rewrite "tmp1" with width=2. In contrast, if we
> >   remove the multiplications from the example (and make "tmp1"
> >   not singe used), and still rewrite "result" by width=2, then
> >   rewriting "tmp1" with width=2 is better. (Make sense because
> >   the tree's depth at "result" is still smaller if we rewrite
> >   "tmp1".)
> >
> >   2) I tried to modify the assembly code of the example without
> >   FMA, so the width of "result" is 4. On Ampere1 there's no
> >   obvious improvement. So although this is an interesting
> >   problem, it doesn't seem like the cause of the regression.
> 
> OK, I see.
> 
> > 2. From assembly code of the case with FMA, one problem is
> > that, rewriting "tmp1" to parallel didn't decrease the
> > minimum CPU cycles (taking MULT_EXPRs into account), but
> > increased code size, so the overhead is increased.
> >
> >    a) When "tmp1" is not re-written to parallel:
> >         fmadd d31, d2, d2, d30
> >         fmadd d31, d3, d3, d31
> >         fmadd d31, d4, d5, d31  //"tmp1"
> >         fmadd d31, d31, d4, d3
> >
> >    b) When "tmp1" is re-written to parallel:
> >         fmul  d31, d4, d5
> >         fmadd d27, d2, d2, d30
> >         fmadd d31, d3, d3, d31
> >         fadd  d31, d31, d27     //"tmp1"
> >         fmadd d31, d31, d4, d3
> >
> > For version a), there are 3 dependent FMAs to calculate "tmp1".
> > For version b), there are also 3 dependent instructions in the
> > longer path: the 1st, 3rd and 4th.
> 
> Yes, it doesn't really change anything.  The patch has
> 
> +  /* If there's code like "acc = a * b + c * d + acc" in a tight loop, some
> +     uarchs can execute results like:
> +
> +       _1 = a * b;
> +       _2 = .FMA (c, d, _1);
> +       acc_1 = acc_0 + _2;
> +
> +     in parallel, while turning it into
> +
> +       _1 = .FMA(a, b, acc_0);
> +       acc_1 = .FMA(c, d, _1);
> +
> +     hinders that, because then the first FMA depends on the result
> of preceding
> +     iteration.  */
> 
> I can't see what can be run in parallel for the first case.  The .FMA
> depends on the multiplication a * b.  Iff the uarch somehow decomposes
> .FMA into multiply + add then the c * d multiply could run in parallel
> with the a * b multiply which _might_ be able to hide some of the
> latency of the full .FMA.  Like on x86 Zen FMA has a latency of 4
> cycles but a multiply only 3.  But I never got confirmation from any
> of the CPU designers that .FMAs are issued when the multiply
> operands are ready and the add operand can be forwarded.
> 
> I also wonder why the multiplications of the two-FMA sequence
> then cannot be executed at the same time?  So I have some doubt
> of the theory above.

The parallel execution for the code snippet above was the other
issue (previously discussed here:
https://gcc.gnu.org/pipermail/gcc-patches/2023-August/628960.html).
Sorry it's a bit confusing to include that here, but these 2 fixes
needs to be combined to avoid new regressions. Since considering
FMA in get_reassociation_width produces more results of width=1,
so there would be more loop depending FMA chains.

> Iff this really is the reason for the sequence to execute with lower
> overall latency and we want to attack this on GIMPLE then I think
> we need a target hook telling us this fact (I also wonder if such
> behavior can be modeled in the scheduler pipeline description at all?)
> 
> > So it seems to me the current get_reassociation_width algorithm
> > isn't optimal in the presence of FMA. So I modified the patch to
> > improve get_reassociation_width, rather than check for code
> > patterns. (Although there could be some other complicated
> > factors so the regression is more obvious when there's "nested
> > FMA". But with this patch that should be avoided or reduced.)
> >
> > With this patch 508.namd_r 1-copy run has 7% improvement on
> > Ampere1, on Intel Xeon there's about 3%. While I'm still
> > collecting data on other CPUs, I'd like to know how do you
> > think of this.
> >
> > About changes in the patch:
> >
> > 1. When the op list forms a complete FMA chain, try to search
> > for a smaller width considering the benefit of using FMA. With
> > a smaller width, the increment of code size is smaller when
> > breaking the chain.
> 
> But this is all highly target specific (code size even more so).
>
> How I understand your approach to fixing the issue leads me to
> the suggestion to prioritize parallel rewriting, thus alter rank_ops_for_fma,
> taking the reassoc width into account (the computed width should be
> unchanged from rank_ops_for_fma) instead of "fixing up" the parallel
> rewriting of FMAs (well, they are not yet formed of course).
> get_reassociation_width has 'get_required_cycles', the above theory
> could be verified with a very simple toy pipeline model.  We'd have
> to ask the target for the reassoc width for MULT_EXPRs as well (or maybe
> even FMA_EXPRs).
> 
> Taking the width of FMAs into account when computing the reassoc width
> might be another way to attack this.

Previously I tried to solve this generally, on the assumption that
FMA (smaller code size) is preferred. Now I agree it's difficult
since: 1) As you mentioned, the latency of FMA, FMUL and FADD can
be different. 2) From my test result on different machines we
have, it seems simply adding the cycles together is not a good way
to estimate the latency of consecutive FMA.

I think an easier way to fix this is to add a parameter to suggest
the length of complete FMA chain to keep. (It can be set by target
specific tuning then.) And we can break longer FMA chains for
better parallelism. Attached is the new implementation. With
max-fma-chain-len=8, there's about 7% improvement in spec2017
508.namd_r on ampere1, and the overall improvement on fprate is
about 1%.

Since there's code in rank_ops_for_fma to identify MULT_EXPRs from
others, I left it before get_reassociation_width so the number of
MULT_EXPRs can be used.

> 
> > 2. To avoid regressions, included the other patch
> > (https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629203.html)
> > on this tracker again. This is because more FMA will be kept
> > with 1., so we need to rule out the loop dependent
> > FMA chains when param_avoid_fma_max_bits is set.
> 
> Sorry again for taking so long to reply.
> 
> I'll note we have an odd case on x86 Zen2(?) as well which we don't really
> understand from a CPU behavior perspective.
> 
> Thanks,
> Richard.
> 
> > Thanks,
> > Di Zhao
> >
> > ----
> >
> >         PR tree-optimization/110279
> >
> > gcc/ChangeLog:
> >
> >         * tree-ssa-reassoc.cc (rank_ops_for_better_parallelism_p):
> >         New function to check whether ranking the ops results in
> >         better parallelism.
> >         (get_reassociation_width): Add new parameters. Search for
> >         smaller width considering the benefit of FMA.
> >         (rank_ops_for_fma): Change return value to be number of
> >         MULT_EXPRs.
> >         (reassociate_bb): For 3 ops, refine the condition to call
> >         swap_ops_for_binary_stmt.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         * gcc.dg/pr110279.c: New test.

Thanks,
Di Zhao

----

        PR tree-optimization/110279

gcc/ChangeLog:

        * doc/invoke.texi: Description of param_max_fma_chain_len.
        * params.opt: New parameter param_max_fma_chain_len.
        * tree-ssa-reassoc.cc (get_reassociation_width):
        Support param_max_fma_chain_len; check for loop dependent
        FMAs.
        (rank_ops_for_fma): Return the number of MULT_EXPRs.
        (reassociate_bb): For 3 ops, refine the condition to call
        swap_ops_for_binary_stmt.

gcc/testsuite/ChangeLog:

        * gcc.dg/pr110279-1.c: New test.
        * gcc.dg/pr110279-2.c: New test.
        * gcc.dg/pr110279-3.c: New test.

[-- Attachment #2: 0001-Keep-FMA-chains-in-reassoc-based-on-new-parameter.patch --]
[-- Type: application/octet-stream, Size: 14002 bytes --]

From 4890f1a78a85e9b731b34bfded9fa79b782fc0a6 Mon Sep 17 00:00:00 2001
From: "dzhao.ampere" <di.zhao@amperecomputing.com>
Date: Sat, 7 Oct 2023 19:27:22 +0800
Subject: [PATCH] Keep FMA chains in reassoc based on new parameter

Add a new parameter param_max_fma_chain_len, to suggest the
maximum length of FMA chain to be kept in reassoc2.
---
 gcc/doc/invoke.texi               |   3 +
 gcc/params.opt                    |   4 +
 gcc/testsuite/gcc.dg/pr110279-1.c |  47 +++++++++++
 gcc/testsuite/gcc.dg/pr110279-2.c |  50 ++++++++++++
 gcc/testsuite/gcc.dg/pr110279-3.c |  65 +++++++++++++++
 gcc/tree-ssa-reassoc.cc           | 131 ++++++++++++++++++++++++++----
 6 files changed, 284 insertions(+), 16 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr110279-1.c
 create mode 100644 gcc/testsuite/gcc.dg/pr110279-2.c
 create mode 100644 gcc/testsuite/gcc.dg/pr110279-3.c

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 23083467d47..c928f4bfb0e 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -16204,6 +16204,9 @@ Emit instrumentation calls to __tsan_func_entry() and __tsan_func_exit().
 Maximum number of instructions to copy when duplicating blocks on a
 finite state automaton jump thread path.
 
+@item max-fma-chain-len
+The maximum number of consecutive FMAs that we'd like not to break.
+
 @item threader-debug
 threader-debug=[none|all] Enables verbose dumping of the threader solver.
 
diff --git a/gcc/params.opt b/gcc/params.opt
index fffa8b1bc64..c22dbda1e9f 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -502,6 +502,10 @@ The maximum number of nested indirect inlining performed by early inliner.
 Common Joined UInteger Var(param_max_fields_for_field_sensitive) Param
 Maximum number of fields in a structure before pointer analysis treats the structure as a single variable.
 
+-param=max-fma-chain-len=
+Common Joined UInteger Var(param_max_fma_chain_len) IntegerRange(0, 512) Param Optimization
+The maximum number of consecutive FMAs that we'd like not to break.
+
 -param=max-fsm-thread-path-insns=
 Common Joined UInteger Var(param_max_fsm_thread_path_insns) Init(100) IntegerRange(1, 999999) Param Optimization
 Maximum number of instructions to copy when duplicating blocks on a finite state automaton jump thread path.
diff --git a/gcc/testsuite/gcc.dg/pr110279-1.c b/gcc/testsuite/gcc.dg/pr110279-1.c
new file mode 100644
index 00000000000..51cccd5d8b7
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr110279-1.c
@@ -0,0 +1,47 @@
+/* PR tree-optimization/110279 */
+/* { dg-do compile } */
+/* { dg-options "-Ofast --param tree-reassoc-width=4 --param max-fma-chain-len=8 -fdump-tree-reassoc2-details -fdump-tree-optimized" } */
+/* { dg-additional-options "-march=armv8.2-a" { target aarch64-*-* } } */
+
+#define LOOP_COUNT 800000000
+typedef double data_e;
+
+#include <stdio.h>
+
+__attribute_noinline__ data_e
+foo (data_e in)
+{
+  data_e a1, a2, a3, a4;
+  data_e tmp, result = 0;
+  a1 = in + 0.1;
+  a2 = in * 0.1;
+  a3 = in + 0.01;
+  a4 = in * 0.59;
+
+  data_e result2 = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      /* Test that a complete FMA chain with length=4 is not broken.  */
+      tmp = a1 + a2 * a2 + a3 * a3 + a4 * a4 ;
+      result += tmp - ic;
+      result2 = result2 / 2 - tmp;
+
+      a1 += 0.91;
+      a2 += 0.1;
+      a3 -= 0.01;
+      a4 -= 0.89;
+
+    }
+
+  return result + result2;
+}
+
+int
+main (int argc, char **argv)
+{
+  printf ("%f\n", foo (-1.2));
+}
+
+/* { dg-final { scan-tree-dump-not "was chosen for reassociation" "reassoc2"} } */
+/* { dg-final { scan-tree-dump-times {\.FMA } 3 "optimized"} } */
\ No newline at end of file
diff --git a/gcc/testsuite/gcc.dg/pr110279-2.c b/gcc/testsuite/gcc.dg/pr110279-2.c
new file mode 100644
index 00000000000..423e65d2d7f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr110279-2.c
@@ -0,0 +1,50 @@
+/* PR tree-optimization/110279 */
+/* { dg-do compile } */
+/* { dg-options "-Ofast --param tree-reassoc-width=4 --param max-fma-chain-len=8 -fdump-tree-reassoc2-details -fdump-tree-optimized" } */
+/* { dg-additional-options "-march=armv8.2-a" { target aarch64-*-* } } */
+
+#define LOOP_COUNT 800000000
+typedef double data_e;
+
+data_e
+foo (data_e in)
+{
+  data_e a1, a2, a3, a4, a5, a6, a7, a8, a9, a10;
+  data_e tmp, result, result2 = 0;
+  a1 = in + 0.1;
+  a2 = in * 0.1;
+  a3 = in + 0.01;
+  a4 = in * 0.59;
+  a5 = in;
+  a6 = in * 2;
+  a7 = in - 0.1;
+  a8 = in * 0.09;
+  a9 = in * 2 - 2;
+  a10 = in / 2 + 0.7;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      /* op_num=10, mult_exprs_num=8. Test that the op list is broken into 2
+         complete FMA chains.  */
+      tmp = a1 * a1 + a2 * a2 + a3 * a3 + a4 + a5 * a5 + a6 + a7 * a7 + a8 * a8
+	    + a9 * a9 + a10 * a10;
+      result += tmp - ic;
+
+      a1 += 0.91;
+      a2 += 0.1;
+      a3 -= 0.01;
+      a4 -= 1.0;
+      a6 += 0.09;
+      a7 -= 1.9;
+      a8 = a1 + a2;
+      a9 += a2;
+      a10 -= a4;
+
+      result2 = result2 - a1 - tmp * 0.2;
+    }
+
+  return result * result2;
+}
+
+/* { dg-final { scan-tree-dump-times "Width = 2 was chosen for reassociation" 1 "reassoc2"} } */
+/* { dg-final { scan-tree-dump-times {\.FMA } 9 "optimized"} } */
\ No newline at end of file
diff --git a/gcc/testsuite/gcc.dg/pr110279-3.c b/gcc/testsuite/gcc.dg/pr110279-3.c
new file mode 100644
index 00000000000..d94474cc42d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr110279-3.c
@@ -0,0 +1,65 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast --param avoid-fma-max-bits=512 --param tree-reassoc-width=4 -fdump-tree-widening_mul-details" } */
+/* { dg-additional-options "-march=armv8.2-a" { target aarch64-*-* } } */
+
+#define LOOP_COUNT 800000000
+typedef double data_e;
+
+/* Check that FMAs with backedge dependency are avoided. Otherwise there won't
+   be FMA generated with "--param avoid-fma-max-bits=512".   */
+
+data_e
+foo1 (data_e a, data_e b, data_e c, data_e d)
+{
+  data_e result = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      result += (a * b + c * d);
+
+      a -= 0.1;
+      b += 0.9;
+      c *= 1.02;
+      d *= 0.61;
+    }
+
+  return result;
+}
+
+data_e
+foo2 (data_e a, data_e b, data_e c, data_e d)
+{
+  data_e result = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      result = a * b + result + c * d;
+
+      a -= 0.1;
+      b += 0.9;
+      c *= 1.02;
+      d *= 0.61;
+    }
+
+  return result;
+}
+
+data_e
+foo3 (data_e a, data_e b, data_e c, data_e d)
+{
+  data_e result = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      result = result + a * b + c * d;
+
+      a -= 0.1;
+      b += 0.9;
+      c *= 1.02;
+      d *= 0.61;
+    }
+
+  return result;
+}
+
+/* { dg-final { scan-tree-dump-times "Generated FMA" 3 "widening_mul"} } */
diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
index 41ee36413b5..babd58aa88a 100644
--- a/gcc/tree-ssa-reassoc.cc
+++ b/gcc/tree-ssa-reassoc.cc
@@ -5431,16 +5431,20 @@ get_required_cycles (int ops_num, int cpu_width)
 }
 
 /* Returns an optimal number of registers to use for computation of
-   given statements.  */
+   given statements.
+
+   LHS is the result ssa name of OPS.  MULT_NUM is number of MULT_EXPRs in
+   OPS.  */
 
 static int
-get_reassociation_width (int ops_num, enum tree_code opc,
-			 machine_mode mode)
+get_reassociation_width (vec<operand_entry *> *ops, int mult_num, tree lhs,
+			 enum tree_code opc, machine_mode mode)
 {
   int param_width = param_tree_reassoc_width;
   int width;
   int width_min;
   int cycles_best;
+  int ops_num = ops->length ();
 
   if (param_width > 0)
     width = param_width;
@@ -5471,6 +5475,74 @@ get_reassociation_width (int ops_num, enum tree_code opc,
 	break;
     }
 
+  /* Check if keeping complete FMA chains is preferred.  */
+  if (width > 1 && mult_num >= 2 && param_max_fma_chain_len)
+    {
+      /* num_fma_chain + (num_fma_chain - 1) >= num_plus .  */
+      int num_others = ops_num - mult_num;
+      int num_fma_chain = CEIL (num_others + 1, 2);
+
+      if (num_fma_chain < width
+	  && CEIL (mult_num, num_fma_chain) <= param_max_fma_chain_len)
+	width = num_fma_chain;
+    }
+
+  /* If there's loop dependent FMA result, return width=2 to avoid it.  This is
+     better than skipping these FMA candidates in widening_mul.  */
+  if (width == 1 && mult_num
+      && maybe_le (tree_to_poly_int64 (TYPE_SIZE (TREE_TYPE (lhs))),
+		   param_avoid_fma_max_bits))
+    {
+      /* Look for cross backedge dependency:
+	1. LHS is a phi argument in the same basic block it is defined.
+	2. And the result of the phi node is used in OPS.  */
+      basic_block bb = gimple_bb (SSA_NAME_DEF_STMT (lhs));
+      gimple_stmt_iterator gsi;
+      for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	{
+	  gphi *phi = dyn_cast<gphi *> (gsi_stmt (gsi));
+	  for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
+	    {
+	      tree op = PHI_ARG_DEF (phi, i);
+	      if (!(op == lhs && gimple_phi_arg_edge (phi, i)->src == bb))
+		continue;
+	      tree phi_result = gimple_phi_result (phi);
+	      operand_entry *oe;
+	      unsigned int j;
+	      FOR_EACH_VEC_ELT (*ops, j, oe)
+		{
+		  if (TREE_CODE (oe->op) != SSA_NAME)
+		    continue;
+
+		  /* Result of phi is operand of PLUS_EXPR.  */
+		  if (oe->op == phi_result)
+		    return 2;
+
+		  /* Check is result of phi is operand of MULT_EXPR.  */
+		  gimple *def_stmt = SSA_NAME_DEF_STMT (oe->op);
+		  if (is_gimple_assign (def_stmt)
+		      && gimple_assign_rhs_code (def_stmt) == NEGATE_EXPR)
+		    {
+		      tree rhs = gimple_assign_rhs1 (def_stmt);
+		      if (TREE_CODE (rhs) == SSA_NAME)
+			{
+			  if (rhs == phi_result)
+			    return 2;
+			  def_stmt = SSA_NAME_DEF_STMT (rhs);
+			}
+		    }
+		  if (is_gimple_assign (def_stmt)
+		      && gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
+		    {
+		      if (gimple_assign_rhs1 (def_stmt) == phi_result
+			  || gimple_assign_rhs2 (def_stmt) == phi_result)
+			return 2;
+		    }
+		}
+	    }
+	}
+    }
+
   return width;
 }
 
@@ -6783,8 +6855,10 @@ transform_stmt_to_multiply (gimple_stmt_iterator *gsi, gimple *stmt,
    Rearrange ops to -> e + a * b + c * d generates:
 
    _4  = .FMA (c_7(D), d_8(D), _3);
-   _11 = .FMA (a_5(D), b_6(D), _4);  */
-static bool
+   _11 = .FMA (a_5(D), b_6(D), _4);
+
+   Return the number of MULT_EXPRs in the chain.  */
+static int
 rank_ops_for_fma (vec<operand_entry *> *ops)
 {
   operand_entry *oe;
@@ -6798,9 +6872,26 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
       if (TREE_CODE (oe->op) == SSA_NAME)
 	{
 	  gimple *def_stmt = SSA_NAME_DEF_STMT (oe->op);
-	  if (is_gimple_assign (def_stmt)
-	      && gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
-	    ops_mult.safe_push (oe);
+	  if (is_gimple_assign (def_stmt))
+	    {
+	      if (gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
+		ops_mult.safe_push (oe);
+	      /* A negate on the multiplication leads to FNMA.  */
+	      else if (gimple_assign_rhs_code (def_stmt) == NEGATE_EXPR
+		       && TREE_CODE (gimple_assign_rhs1 (def_stmt)) == SSA_NAME)
+		{
+		  gimple *neg_def_stmt
+		    = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (def_stmt));
+		  if (is_gimple_assign (neg_def_stmt)
+		      && gimple_bb (neg_def_stmt) == gimple_bb (def_stmt)
+		      && gimple_assign_rhs_code (neg_def_stmt) == MULT_EXPR)
+		    ops_mult.safe_push (oe);
+		  else
+		    ops_others.safe_push (oe);
+		}
+	      else
+		ops_others.safe_push (oe);
+	    }
 	  else
 	    ops_others.safe_push (oe);
 	}
@@ -6816,7 +6907,8 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
      Putting ops that not def from mult in front can generate more FMAs.
 
      2. If all ops are defined with mult, we don't need to rearrange them.  */
-  if (ops_mult.length () >= 2 && ops_mult.length () != ops_length)
+  unsigned mult_num = ops_mult.length ();
+  if (mult_num >= 2 && mult_num != ops_length)
     {
       /* Put no-mult ops and mult ops alternately at the end of the
 	 queue, which is conducive to generating more FMA and reducing the
@@ -6832,9 +6924,8 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
 	  if (opindex > 0)
 	    opindex--;
 	}
-      return true;
     }
-  return false;
+  return mult_num;
 }
 /* Reassociate expressions in basic block BB and its post-dominator as
    children.
@@ -7000,7 +7091,7 @@ reassociate_bb (basic_block bb)
 		  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
 		  int ops_num = ops.length ();
 		  int width;
-		  bool has_fma = false;
+		  int mult_num = 0;
 
 		  /* For binary bit operations, if there are at least 3
 		     operands and the last operand in OPS is a constant,
@@ -7023,16 +7114,18 @@ reassociate_bb (basic_block bb)
 						      opt_type)
 		      && (rhs_code == PLUS_EXPR || rhs_code == MINUS_EXPR))
 		    {
-		      has_fma = rank_ops_for_fma (&ops);
+		      mult_num = rank_ops_for_fma (&ops);
 		    }
 
 		  /* Only rewrite the expression tree to parallel in the
 		     last reassoc pass to avoid useless work back-and-forth
 		     with initial linearization.  */
+		  bool has_fma = mult_num >= 2 && mult_num != ops_num;
 		  if (!reassoc_insert_powi_p
 		      && ops.length () > 3
-		      && (width = get_reassociation_width (ops_num, rhs_code,
-							   mode)) > 1)
+		      && (width = get_reassociation_width (&ops, mult_num, lhs,
+							   rhs_code, mode))
+			   > 1)
 		    {
 		      if (dump_file && (dump_flags & TDF_DETAILS))
 			fprintf (dump_file,
@@ -7049,7 +7142,13 @@ reassociate_bb (basic_block bb)
 			 to make sure the ones that get the double
 			 binary op are chosen wisely.  */
 		      int len = ops.length ();
-		      if (len >= 3 && !has_fma)
+		      if (len >= 3
+			  && (!has_fma
+			      /* width > 1 means ranking ops results in better
+				 parallelism.  */
+			      || get_reassociation_width (&ops, mult_num, lhs,
+							  rhs_code, mode)
+				   > 1))
 			swap_ops_for_binary_stmt (ops, len - 3);
 
 		      new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
-- 
2.25.1

next prev parent reply	other threads:[~2023-10-08 16:40 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-14 12:43 Di Zhao OS
2023-10-06  9:33 ` Richard Biener
2023-10-08 16:39   ` Di Zhao OS [this message]
2023-10-23  3:49     ` [PING][PATCH " Di Zhao OS
2023-10-31 13:47     ` [PATCH " Richard Biener
2023-11-09 17:53       ` Di Zhao OS
2023-11-21 13:01         ` Richard Biener
2023-11-29 14:35           ` Di Zhao OS
2023-12-11 11:01             ` Richard Biener
2023-12-13  8:14               ` Di Zhao OS
2023-12-13  9:00                 ` Richard Biener
2023-12-14 20:55                   ` Di Zhao OS
2023-12-15  7:23                     ` Richard Biener
2023-12-15  9:46                 ` Thomas Schwinge
2023-12-17 12:30                   ` Di Zhao OS
2023-12-22 15:05                     ` Di Zhao OS
2023-12-22 15:39                       ` Richard Biener
2023-12-27  9:35                         ` Di Zhao OS

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=SN6PR01MB4240C36D35F02379B33C258DE8CFA@SN6PR01MB4240.prod.exchangelabs.com \
    --to=dizhao@os.amperecomputing.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=richard.guenther@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).