public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [RFC][PATCH] Improve generating FMA by adding a widening_mul pass
@ 2023-05-26  7:14 Di Zhao OS
  2023-05-30  7:52 ` Di Zhao OS
  0 siblings, 1 reply; 2+ messages in thread
From: Di Zhao OS @ 2023-05-26  7:14 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2532 bytes --]

As GCC's reassociation pass does not have knowledge of FMA, when
transforming expression lists to parallel, it reduces the
opportunities to generate FMAs. Currently there's a workaround
on AArch64 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84114),
that is, to disable the parallelization with floating-point additions.
However, this approach may cause regressions. For example, in the
code below there are only floating-point additions when calculating
"result += array[j]", and rewriting to parallel is better:

// Compile with -Ofast on aarch64
float foo (int n, float in)
{
  float array[8] = { 0.1, 1.0, 1.1, 100.0, 10.5, 0.5, 0.01, 9.9 };
  float result = 0.0;
  for (int i = 0; i < n; i++)
    {
      if (i % 10)
        for (unsigned j = 0; j < 8; j++)
          array[j] *= in;

      for (unsigned j = 0; j < 8; j++)
       result += array[j];
    }
  return result;
}

To improve this, one option is to count the number of MUL_EXPRs in an
operator list before rewriting to parallel, and allow the rewriting
when there's none (or 1 MUL_EXPR). This is simple and unlikely to
introduce regressions. However it lacks flexibility and can not handle
more general cases.

Here's an attempt to address the issue more generally.

1. Added an additional widening_mul pass before the original reassoc2
pass. The new pass is limited to only insert FMA, and leave other
operations like convert_mult_to_widen to the old late widening_mul pass,
in case other optimizations between the two passes could be hindered.

2. On some platforms, for a very long FMA chain, rewriting to parallel
can be faster. Extended the original "deferring" logic so that all
conversions to FMA can be deferred. Introduced a new parameter
op-count-prefer-reassoc to control this behavior.

3. Additionally, the new widening_mul pass calls execute_reassoc first,
to avoid losing opportunities such as folding constants and
undistributing.

However, changing the sequence of generating FMA and reassociation may
expose more FMA chains that are slow (see commit 4a0d0ed2).
To reduce possible regressions, improved handling the slow FMA chain by:

1. Modified result_of_phi to support checking an additional FADD/FMUL.

2. On some CPUs, rather than removing the whole FMA chain, only skipping
a few candidates may generate faster code. Added new parameter
fskip-fma-heuristic to control this behavior.

This patch also solves https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98350.

Thanks,
Di Zhao


[-- Attachment #2: 0001-Improve-generating-FMA-by-adding-a-widening_mul-pass.patch --]
[-- Type: application/octet-stream, Size: 46911 bytes --]

From c49f70bb180dc00b0fd5a4041443ef46ee0ad663 Mon Sep 17 00:00:00 2001
From: "dzhao.ampere" <di.zhao@amperecomputing.com>
Date: Tue, 21 Mar 2023 18:42:31 +0800
Subject: [PATCH] Improve generating FMA by adding a widening_mul pass

- Removed the workaround in config/aarch64/aarch64.cc to insert more FMA.
  Since it stops reassociation even MUL_EXPRs are not present.

- Added a new widening_mul pass to insert FMA before reassoc2.

    - The new pass does nothing else other than inserting FMA.
      Added pass parameter to control this.

    - It runs execute_reassoc first, to avoid losing opportunities such
      as folding constants and undistributing.

    - Extended the original "deferring" logic so that all conversions to
      FMA can be deferred. Because for very long FMA chain, reassociation
      may be faster. Add a new parameter `op-count-prefer-reassoc` to
      control this behavior.

- Imporoved eliminating slow FMA chain in original widening_mul by:

  - Changed result_of_phi to support checking an additional FADD/FMUL.

  - Added new parameter `fskip-fma-heuristic` to specify the number of
    FMAs to skip.

- Added tuning for ampere1 to include these parameters.
---
 gcc/common.opt                              |   5 +
 gcc/config/aarch64/aarch64-tuning-flags.def |   4 +
 gcc/config/aarch64/aarch64.cc               |  26 +-
 gcc/doc/invoke.texi                         |  14 +-
 gcc/params.opt                              |   4 +
 gcc/passes.def                              |   3 +-
 gcc/testsuite/gcc.dg/divmod-1-simode.c      |   2 +-
 gcc/testsuite/gcc.dg/divmod-1.c             |   2 +-
 gcc/testsuite/gcc.dg/divmod-2-simode.c      |   2 +-
 gcc/testsuite/gcc.dg/divmod-2.c             |   2 +-
 gcc/testsuite/gcc.dg/divmod-3-simode.c      |   2 +-
 gcc/testsuite/gcc.dg/divmod-3.c             |   2 +-
 gcc/testsuite/gcc.dg/divmod-4-simode.c      |   2 +-
 gcc/testsuite/gcc.dg/divmod-4.c             |   2 +-
 gcc/testsuite/gcc.dg/divmod-5.c             |   2 +-
 gcc/testsuite/gcc.dg/divmod-6-simode.c      |   2 +-
 gcc/testsuite/gcc.dg/divmod-6.c             |   2 +-
 gcc/testsuite/gcc.dg/divmod-7.c             |   2 +-
 gcc/testsuite/gcc.dg/fma-1.c                |   2 +-
 gcc/testsuite/gcc.dg/fma-2.c                |   2 +-
 gcc/testsuite/gcc.dg/fma-3.c                |   2 +-
 gcc/testsuite/gcc.dg/fma-4.c                |   2 +-
 gcc/testsuite/gcc.dg/pr46309-2.c            |   2 +-
 gcc/testsuite/gcc.dg/pr46309.c              |   2 +-
 gcc/testsuite/gcc.dg/pr50717-1.c            |   4 +-
 gcc/testsuite/gcc.dg/pr67089-6.c            |   6 +-
 gcc/testsuite/gcc.dg/pr67089-7.c            |   6 +-
 gcc/testsuite/gcc.dg/pr95853.c              |   2 +-
 gcc/testsuite/gcc.dg/tree-ssa/pr96272.c     |   2 +-
 gcc/testsuite/gcc.dg/wmul-1.c               |   3 +-
 gcc/tree-ssa-math-opts.cc                   | 349 ++++++++++++++++----
 gcc/tree-ssa-reassoc.cc                     |  20 +-
 gcc/tree-ssa-reassoc.h                      |   3 +
 33 files changed, 370 insertions(+), 117 deletions(-)

diff --git a/gcc/common.opt b/gcc/common.opt
index a28ca13385a..48363100930 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1027,6 +1027,11 @@ Align the start of functions.
 falign-functions=
 Common RejectNegative Joined Var(str_align_functions) Optimization
 
+fskip-fma-heuristic=
+Common Joined RejectNegative UInteger Var(flag_skip_fma_heuristic) Optimization Init(-1)
+When avoid generating FMA chains that could be slow, heuristically skip
+specified number of FMA candidates and generates the rest.
+
 flimit-function-alignment
 Common Var(flag_limit_function_alignment) Optimization Init(0)
 
diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
index 52112ba7c48..bc78c4bd595 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -47,6 +47,10 @@ AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp_qregs", NO_LDP_STP_QREGS)
 /* Disallow load-pair instructions to be formed in combine/peephole.  */
 AARCH64_EXTRA_TUNING_OPTION ("no_ldp_combine", NO_LDP_COMBINE)
 
+/* Use heuristics on FMA: avoid FMA chain in tight loops by skipping the first
+   candidate.  */
+AARCH64_EXTRA_TUNING_OPTION ("fma_heuristic", FMA_HEURISTIC)
+
 AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs", RENAME_LOAD_REGS)
 
 AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 29dbacfa917..3a3beecbd01 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -1973,7 +1973,8 @@ static const struct tune_params ampere1a_tunings =
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE),	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE |
+   AARCH64_EXTRA_TUNE_FMA_HEURISTIC),	/* tune_flags.  */
   &ampere1_prefetch_tune
 };
 
@@ -3303,21 +3304,15 @@ aarch64_min_divisions_for_recip_mul (machine_mode mode)
 
 /* Return the reassociation width of treeop OPC with mode MODE.  */
 static int
-aarch64_reassociation_width (unsigned opc, machine_mode mode)
+aarch64_reassociation_width (unsigned opc ATTRIBUTE_UNUSED, machine_mode mode)
 {
   if (VECTOR_MODE_P (mode))
     return aarch64_tune_params.vec_reassoc_width;
   if (INTEGRAL_MODE_P (mode))
     return aarch64_tune_params.int_reassoc_width;
-  /* Reassociation reduces the number of FMAs which may result in worse
-     performance.  Use a per-CPU setting for FMA reassociation which allows
-     narrow CPUs with few FP pipes to switch it off (value of 1), and wider
-     CPUs with many FP pipes to enable reassociation.
-     Since the reassociation pass doesn't understand FMA at all, assume
-     that any FP addition might turn into FMA.  */
+  /* TODO: fma_reassoc_width is no longer useful?  */
   if (FLOAT_MODE_P (mode))
-    return opc == PLUS_EXPR ? aarch64_tune_params.fma_reassoc_width
-			    : aarch64_tune_params.fp_reassoc_width;
+    return aarch64_tune_params.fp_reassoc_width;
   return 1;
 }
 
@@ -18002,6 +17997,17 @@ aarch64_override_options_internal (struct gcc_options *opts)
     SET_OPTION_IF_UNSET (opts, &global_options_set,
 			 aarch64_sve_compare_costs, 0);
 
+  if (aarch64_tune_params.extra_tuning_flags
+	& AARCH64_EXTRA_TUNE_FMA_HEURISTIC)
+    {
+      SET_OPTION_IF_UNSET (opts, &global_options_set, param_avoid_fma_max_bits,
+			   512);
+      SET_OPTION_IF_UNSET (opts, &global_options_set, flag_skip_fma_heuristic,
+			   1);
+      SET_OPTION_IF_UNSET (opts, &global_options_set,
+			   param_op_count_prefer_reassoc, 8);
+    }
+
   /* Set up parameters to be used in prefetching algorithm.  Do not
      override the defaults unless we are tuning for a core we have
      researched values for.  */
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index b92b8576027..b259c7b0872 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -12356,8 +12356,15 @@ This pass is always skipped on architectures that do not have
 instructions to support this.  Enabled by default at @option{-O1} and
 higher on architectures that support this.
 
-@opindex fdce
+@item -fskip-fma-heuristic
+@opindex fskip-fma-heuristic
+When avoid generating FMA chains that could be slow, heuristically skip
+specified number of FMA candidates and generates the rest.  This option
+is only effective when @option{avoid-fma-max-bits} is non-zero.  The
+default value is -1, which means all such FMAs are skipped.
+
 @item -fdce
+@opindex fdce
 Perform dead code elimination (DCE) on RTL@.
 Enabled by default at @option{-O1} and higher.
 
@@ -15834,6 +15841,11 @@ Set the maximum number of instructions executed in parallel in
 reassociated tree. This parameter overrides target dependent
 heuristics used by default if has non zero value.
 
+@item op-count-prefer-reassoc
+Set the minimum number of operators in a list, when reassociation is
+preferred (than inserting as much as FMA as possible).  If 0, always
+insert FMA as much as possible.
+
 @item sched-pressure-algorithm
 Choose between the two available implementations of
 @option{-fsched-pressure}.  Algorithm 1 is the original implementation
diff --git a/gcc/params.opt b/gcc/params.opt
index 66f1c99036a..ef1463bf30f 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -98,6 +98,10 @@ Average number of iterations of a loop.
 Common Joined UInteger Var(param_avoid_fma_max_bits) IntegerRange(0, 512) Param Optimization
 Maximum number of bits for which we avoid creating FMAs.
 
+-param=op-count-prefer-reassoc=
+Common Joined UInteger Var(param_op_count_prefer_reassoc) Param Optimization
+Set the minimum number of operators in a list, when reassociation is preferred (than inserting as much as FMA as possible).  If 0, always insert FMA as much as possible.
+
 -param=builtin-expect-probability=
 Common Joined UInteger Var(param_builtin_expect_probability) Init(90) IntegerRange(0, 100) Param Optimization
 Set the estimated probability in percentage for builtin expect. The default value is 90% probability.
diff --git a/gcc/passes.def b/gcc/passes.def
index c9a8f19747b..0714bf28e74 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -330,6 +330,7 @@ along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_lower_switch);
       NEXT_PASS (pass_cse_sincos);
       NEXT_PASS (pass_cse_reciprocals);
+      NEXT_PASS (pass_optimize_widening_mul, true /* early_p */);
       NEXT_PASS (pass_reassoc, false /* early_p */);
       NEXT_PASS (pass_strength_reduction);
       NEXT_PASS (pass_split_paths);
@@ -353,7 +354,7 @@ along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_sink_code, true /* unsplit edges */);
       NEXT_PASS (pass_phiopt, false /* early_p */);
       NEXT_PASS (pass_fold_builtins);
-      NEXT_PASS (pass_optimize_widening_mul);
+      NEXT_PASS (pass_optimize_widening_mul, false /* early_p */);
       NEXT_PASS (pass_store_merging);
       /* If DCE is not run before checking for uninitialized uses,
 	 we may get false warnings (e.g., testsuite/gcc.dg/uninit-5.c).
diff --git a/gcc/testsuite/gcc.dg/divmod-1-simode.c b/gcc/testsuite/gcc.dg/divmod-1-simode.c
index 9e477997bcf..1cb3c97d585 100644
--- a/gcc/testsuite/gcc.dg/divmod-1-simode.c
+++ b/gcc/testsuite/gcc.dg/divmod-1-simode.c
@@ -22,4 +22,4 @@ FOO(SImode, SImode, 1)
 FOO(SImode, USImode, 2)
 FOO(USImode, USImode, 3)
 
-/* { dg-final { scan-tree-dump-times "DIVMOD" 3 "widening_mul" } } */
+/* { dg-final { scan-tree-dump-times "DIVMOD" 3 "widening_mul2" } } */
diff --git a/gcc/testsuite/gcc.dg/divmod-1.c b/gcc/testsuite/gcc.dg/divmod-1.c
index edcc2a107da..34ba02dd783 100644
--- a/gcc/testsuite/gcc.dg/divmod-1.c
+++ b/gcc/testsuite/gcc.dg/divmod-1.c
@@ -29,4 +29,4 @@ FOO(DImode, DImode, 5)
 FOO(DImode, UDImode, 6)
 FOO(UDImode, UDImode, 7)
 
-/* { dg-final { scan-tree-dump-times "DIVMOD" 7 "widening_mul" } } */
+/* { dg-final { scan-tree-dump-times "DIVMOD" 7 "widening_mul2" } } */
diff --git a/gcc/testsuite/gcc.dg/divmod-2-simode.c b/gcc/testsuite/gcc.dg/divmod-2-simode.c
index fa28beb3cef..d55d6473ccb 100644
--- a/gcc/testsuite/gcc.dg/divmod-2-simode.c
+++ b/gcc/testsuite/gcc.dg/divmod-2-simode.c
@@ -22,4 +22,4 @@ FOO(SImode, SImode, 1)
 FOO(SImode, USImode, 2)
 FOO(USImode, USImode, 3)
 
-/* { dg-final { scan-tree-dump-times "DIVMOD" 3 "widening_mul" } } */
+/* { dg-final { scan-tree-dump-times "DIVMOD" 3 "widening_mul2" } } */
diff --git a/gcc/testsuite/gcc.dg/divmod-2.c b/gcc/testsuite/gcc.dg/divmod-2.c
index ded732e121d..2d99d5aea51 100644
--- a/gcc/testsuite/gcc.dg/divmod-2.c
+++ b/gcc/testsuite/gcc.dg/divmod-2.c
@@ -29,4 +29,4 @@ FOO(DImode, DImode, 5)
 FOO(DImode, UDImode, 6)
 FOO(UDImode, UDImode, 7)
 
-/* { dg-final { scan-tree-dump-times "DIVMOD" 7 "widening_mul" } } */
+/* { dg-final { scan-tree-dump-times "DIVMOD" 7 "widening_mul2" } } */
diff --git a/gcc/testsuite/gcc.dg/divmod-3-simode.c b/gcc/testsuite/gcc.dg/divmod-3-simode.c
index 9dee5bf603b..8aab7d823d6 100644
--- a/gcc/testsuite/gcc.dg/divmod-3-simode.c
+++ b/gcc/testsuite/gcc.dg/divmod-3-simode.c
@@ -20,4 +20,4 @@ FOO(SImode, SImode, 1)
 FOO(SImode, USImode, 2)
 FOO(USImode, USImode, 3)
 
-/* { dg-final { scan-tree-dump-times "DIVMOD" 3 "widening_mul" } } */
+/* { dg-final { scan-tree-dump-times "DIVMOD" 3 "widening_mul2" } } */
diff --git a/gcc/testsuite/gcc.dg/divmod-3.c b/gcc/testsuite/gcc.dg/divmod-3.c
index 02aa367ac6e..5e6520bb97c 100644
--- a/gcc/testsuite/gcc.dg/divmod-3.c
+++ b/gcc/testsuite/gcc.dg/divmod-3.c
@@ -27,4 +27,4 @@ FOO(DImode, DImode, 5)
 FOO(DImode, UDImode, 6)
 FOO(UDImode, UDImode, 7)
 
-/* { dg-final { scan-tree-dump-times "DIVMOD" 7 "widening_mul" } } */
+/* { dg-final { scan-tree-dump-times "DIVMOD" 7 "widening_mul2" } } */
diff --git a/gcc/testsuite/gcc.dg/divmod-4-simode.c b/gcc/testsuite/gcc.dg/divmod-4-simode.c
index dbe29cb761d..b8ddc97e093 100644
--- a/gcc/testsuite/gcc.dg/divmod-4-simode.c
+++ b/gcc/testsuite/gcc.dg/divmod-4-simode.c
@@ -20,4 +20,4 @@ FOO(SImode, SImode, 1)
 FOO(SImode, USImode, 2)
 FOO(USImode, USImode, 3)
 
-/* { dg-final { scan-tree-dump-times "DIVMOD" 3 "widening_mul" } } */
+/* { dg-final { scan-tree-dump-times "DIVMOD" 3 "widening_mul2" } } */
diff --git a/gcc/testsuite/gcc.dg/divmod-4.c b/gcc/testsuite/gcc.dg/divmod-4.c
index 861ecbdec4b..a1bc117a8ef 100644
--- a/gcc/testsuite/gcc.dg/divmod-4.c
+++ b/gcc/testsuite/gcc.dg/divmod-4.c
@@ -27,4 +27,4 @@ FOO(DImode, DImode, 8)
 FOO(DImode, UDImode, 9)
 FOO(UDImode, UDImode, 10)
 
-/* { dg-final { scan-tree-dump-times "DIVMOD" 7 "widening_mul" } } */
+/* { dg-final { scan-tree-dump-times "DIVMOD" 7 "widening_mul2" } } */
diff --git a/gcc/testsuite/gcc.dg/divmod-5.c b/gcc/testsuite/gcc.dg/divmod-5.c
index 8a8cee50ae2..73b27fcb5ff 100644
--- a/gcc/testsuite/gcc.dg/divmod-5.c
+++ b/gcc/testsuite/gcc.dg/divmod-5.c
@@ -16,4 +16,4 @@ int f(int x, int y)
   return q + r;
 }
 
-/* { dg-final { scan-tree-dump-times "DIVMOD" 0 "widening_mul" } } */
+/* { dg-final { scan-tree-dump-times "DIVMOD" 0 "widening_mul2" } } */
diff --git a/gcc/testsuite/gcc.dg/divmod-6-simode.c b/gcc/testsuite/gcc.dg/divmod-6-simode.c
index 1107f760b42..4f30da87161 100644
--- a/gcc/testsuite/gcc.dg/divmod-6-simode.c
+++ b/gcc/testsuite/gcc.dg/divmod-6-simode.c
@@ -23,4 +23,4 @@ FOO(SImode, SImode, 1)
 FOO(SImode, USImode, 2)
 FOO(USImode, USImode, 3)
 
-/* { dg-final { scan-tree-dump-times "DIVMOD" 3 "widening_mul" } } */
+/* { dg-final { scan-tree-dump-times "DIVMOD" 3 "widening_mul2" } } */
diff --git a/gcc/testsuite/gcc.dg/divmod-6.c b/gcc/testsuite/gcc.dg/divmod-6.c
index 495ebaff805..e0e3b26c34a 100644
--- a/gcc/testsuite/gcc.dg/divmod-6.c
+++ b/gcc/testsuite/gcc.dg/divmod-6.c
@@ -30,4 +30,4 @@ FOO(DImode, DImode, 8)
 FOO(DImode, UDImode, 9)
 FOO(UDImode, UDImode, 10)
 
-/* { dg-final { scan-tree-dump-times "DIVMOD" 7 "widening_mul" } } */
+/* { dg-final { scan-tree-dump-times "DIVMOD" 7 "widening_mul2" } } */
diff --git a/gcc/testsuite/gcc.dg/divmod-7.c b/gcc/testsuite/gcc.dg/divmod-7.c
index faa90b3ac8f..dd0a5588f2a 100644
--- a/gcc/testsuite/gcc.dg/divmod-7.c
+++ b/gcc/testsuite/gcc.dg/divmod-7.c
@@ -18,4 +18,4 @@ int f(int x, int y)
   return q + r2;
 }
 
-/* { dg-final { scan-tree-dump-times "DIVMOD" 1 "widening_mul" } } */
+/* { dg-final { scan-tree-dump-times "DIVMOD" 1 "widening_mul2" } } */
diff --git a/gcc/testsuite/gcc.dg/fma-1.c b/gcc/testsuite/gcc.dg/fma-1.c
index 8c33cb2d64c..8aef2295364 100644
--- a/gcc/testsuite/gcc.dg/fma-1.c
+++ b/gcc/testsuite/gcc.dg/fma-1.c
@@ -12,4 +12,4 @@ f2 (double a, double b, double c)
   return a * b + c;
 }
 
-/* { dg-final { scan-tree-dump-times { = \.FMA \(} 2 "widening_mul" { target scalar_all_fma } } } */
+/* { dg-final { scan-tree-dump-times { = \.FMA \(} 2 "widening_mul1" { target scalar_all_fma } } } */
diff --git a/gcc/testsuite/gcc.dg/fma-2.c b/gcc/testsuite/gcc.dg/fma-2.c
index 41d20a42967..2363e49e4d2 100644
--- a/gcc/testsuite/gcc.dg/fma-2.c
+++ b/gcc/testsuite/gcc.dg/fma-2.c
@@ -12,4 +12,4 @@ f2 (double a, double b, double c)
   return a * b - c;
 }
 
-/* { dg-final { scan-tree-dump-times { = \.FMS \(} 2 "widening_mul" { target scalar_all_fma } } } */
+/* { dg-final { scan-tree-dump-times { = \.FMS \(} 2 "widening_mul1" { target scalar_all_fma } } } */
diff --git a/gcc/testsuite/gcc.dg/fma-3.c b/gcc/testsuite/gcc.dg/fma-3.c
index 699aa2c9530..3366d217cc6 100644
--- a/gcc/testsuite/gcc.dg/fma-3.c
+++ b/gcc/testsuite/gcc.dg/fma-3.c
@@ -12,4 +12,4 @@ f2 (double a, double b, double c)
   return c - a * b;
 }
 
-/* { dg-final { scan-tree-dump-times { = \.FNMA \(} 2 "widening_mul" { target scalar_all_fma } } } */
+/* { dg-final { scan-tree-dump-times { = \.FNMA \(} 2 "widening_mul1" { target scalar_all_fma } } } */
diff --git a/gcc/testsuite/gcc.dg/fma-4.c b/gcc/testsuite/gcc.dg/fma-4.c
index bff928f1fac..e486ff7b408 100644
--- a/gcc/testsuite/gcc.dg/fma-4.c
+++ b/gcc/testsuite/gcc.dg/fma-4.c
@@ -12,4 +12,4 @@ f2 (double a, double b, double c)
   return -(a * b) - c;
 }
 
-/* { dg-final { scan-tree-dump-times { = \.FNMS \(} 2 "widening_mul" { target scalar_all_fma } } } */
+/* { dg-final { scan-tree-dump-times { = \.FNMS \(} 2 "widening_mul1" { target scalar_all_fma } } } */
diff --git a/gcc/testsuite/gcc.dg/pr46309-2.c b/gcc/testsuite/gcc.dg/pr46309-2.c
index 2903fff225a..64f99d45a9a 100644
--- a/gcc/testsuite/gcc.dg/pr46309-2.c
+++ b/gcc/testsuite/gcc.dg/pr46309-2.c
@@ -1,6 +1,6 @@
 /* PR tree-optimization/46309 */
 /* { dg-do compile } */
-/* { dg-options "-O2 -fno-ipa-icf -fno-jump-tables -fno-bit-tests -fdump-tree-reassoc-details" } */
+/* { dg-options "-O2 -fno-ipa-icf -fno-jump-tables -fdisable-tree-widening_mul1 -fno-bit-tests -fdump-tree-reassoc-details" } */
 
 int foo (void);
 
diff --git a/gcc/testsuite/gcc.dg/pr46309.c b/gcc/testsuite/gcc.dg/pr46309.c
index 615d6574ef9..04e408d281c 100644
--- a/gcc/testsuite/gcc.dg/pr46309.c
+++ b/gcc/testsuite/gcc.dg/pr46309.c
@@ -1,6 +1,6 @@
 /* PR tree-optimization/46309 */
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-reassoc-details --param logical-op-non-short-circuit=1" } */
+/* { dg-options "-O2 -fdisable-tree-widening_mul1 -fdump-tree-reassoc-details --param logical-op-non-short-circuit=1" } */
 
 int
 f1 (int a)
diff --git a/gcc/testsuite/gcc.dg/pr50717-1.c b/gcc/testsuite/gcc.dg/pr50717-1.c
index de691a8aece..ffaec1c16a9 100644
--- a/gcc/testsuite/gcc.dg/pr50717-1.c
+++ b/gcc/testsuite/gcc.dg/pr50717-1.c
@@ -2,7 +2,7 @@
 /* Ensure that widening multiply-and-accumulate is not used where integer
    type promotion or users' casts should prevent it.  */
 
-/* { dg-options "-O2 -fdump-tree-widening_mul" } */
+/* { dg-options "-O2 -fdisable-tree-widening_mul1 -fdump-tree-widening_mul" } */
 
 long long
 f (unsigned int a, char b, long long c)
@@ -22,4 +22,4 @@ h (char a, char b, int c)
   return (char)(a * b) + c;
 }
 
-/* { dg-final { scan-tree-dump-times "WIDEN_MULT_PLUS_EXPR" 0 "widening_mul" } } */
+/* { dg-final { scan-tree-dump-times "WIDEN_MULT_PLUS_EXPR" 0 "widening_mul2" } } */
diff --git a/gcc/testsuite/gcc.dg/pr67089-6.c b/gcc/testsuite/gcc.dg/pr67089-6.c
index b59d75b2318..fe26fbab9d0 100644
--- a/gcc/testsuite/gcc.dg/pr67089-6.c
+++ b/gcc/testsuite/gcc.dg/pr67089-6.c
@@ -56,6 +56,6 @@ T (24, unsigned long long, x + y, if (d || y > r) foo (0))
 T (25, unsigned short, 2U - x, if (r > 2U) foo (0))
 T (26, unsigned char, 2U - x, if (r <= 2U) foo (0))
 
-/* { dg-final { scan-tree-dump-times "ADD_OVERFLOW" 16 "widening_mul" { target { i?86-*-* x86_64-*-* } } } } */
-/* { dg-final { scan-tree-dump-times "SUB_OVERFLOW" 11 "widening_mul" { target { { i?86-*-* x86_64-*-* } && { ! ia32 } } } } } */
-/* { dg-final { scan-tree-dump-times "SUB_OVERFLOW" 9 "widening_mul" { target { { i?86-*-* x86_64-*-* } && ia32 } } } } */
+/* { dg-final { scan-tree-dump-times "ADD_OVERFLOW" 16 "widening_mul2" { target { i?86-*-* x86_64-*-* } } } } */
+/* { dg-final { scan-tree-dump-times "SUB_OVERFLOW" 11 "widening_mul2" { target { { i?86-*-* x86_64-*-* } && { ! ia32 } } } } } */
+/* { dg-final { scan-tree-dump-times "SUB_OVERFLOW" 9 "widening_mul2" { target { { i?86-*-* x86_64-*-* } && ia32 } } } } */
diff --git a/gcc/testsuite/gcc.dg/pr67089-7.c b/gcc/testsuite/gcc.dg/pr67089-7.c
index dc05e3b2263..0180e009647 100644
--- a/gcc/testsuite/gcc.dg/pr67089-7.c
+++ b/gcc/testsuite/gcc.dg/pr67089-7.c
@@ -58,5 +58,7 @@ T (38, unsigned long, x + y, if (d || r <= y) foo (0))
 T (39, unsigned char, x + y, if (d || y < r) foo (0))
 T (40, unsigned long long, x + y, if (d || y >= r) foo (0))
 
-/* { dg-final { scan-tree-dump-not "ADD_OVERFLOW" "widening_mul" } } */
-/* { dg-final { scan-tree-dump-not "SUB_OVERFLOW" "widening_mul" } } */
+/* { dg-final { scan-tree-dump-not "ADD_OVERFLOW" "widening_mul1" } } */
+/* { dg-final { scan-tree-dump-not "SUB_OVERFLOW" "widening_mul1" } } */
+/* { dg-final { scan-tree-dump-not "ADD_OVERFLOW" "widening_mul2" } } */
+/* { dg-final { scan-tree-dump-not "SUB_OVERFLOW" "widening_mul2" } } */
diff --git a/gcc/testsuite/gcc.dg/pr95853.c b/gcc/testsuite/gcc.dg/pr95853.c
index fdd3c30c456..0a7af34586c 100644
--- a/gcc/testsuite/gcc.dg/pr95853.c
+++ b/gcc/testsuite/gcc.dg/pr95853.c
@@ -56,4 +56,4 @@ garple (T x, T y)
   return (struct S) { z > ~(T) 0, w };
 }
 
-/* { dg-final { scan-tree-dump-times "ADD_OVERFLOW" 6 "widening_mul" { target { i?86-*-* x86_64-*-* } } } } */
+/* { dg-final { scan-tree-dump-times "ADD_OVERFLOW" 6 "widening_mul2" { target { i?86-*-* x86_64-*-* } } } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr96272.c b/gcc/testsuite/gcc.dg/tree-ssa/pr96272.c
index 4c9fa6360d3..a59d280bb89 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr96272.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr96272.c
@@ -34,4 +34,4 @@ qux (unsigned a, unsigned b)
   return a + b;
 }
 
-/* { dg-final { scan-tree-dump-times "ADD_OVERFLOW" 4 "widening_mul" { target { i?86-*-* x86_64-*-* } } } } */
+/* { dg-final { scan-tree-dump-times "ADD_OVERFLOW" 4 "widening_mul2" { target { i?86-*-* x86_64-*-* } } } } */
diff --git a/gcc/testsuite/gcc.dg/wmul-1.c b/gcc/testsuite/gcc.dg/wmul-1.c
index 468a58db946..ae253c8bd8d 100644
--- a/gcc/testsuite/gcc.dg/wmul-1.c
+++ b/gcc/testsuite/gcc.dg/wmul-1.c
@@ -15,4 +15,5 @@ foo (ArrT Arr, int Idx)
   Arr[Idx + 10][Idx] = 2;
 }
 
-/* { dg-final { scan-tree-dump-not "WIDEN_MULT_PLUS_EXPR" "widening_mul" } } */
+/* { dg-final { scan-tree-dump-not "WIDEN_MULT_PLUS_EXPR" "widening_mul2" } } */
+/* { dg-final { scan-tree-dump-not "WIDEN_MULT_PLUS_EXPR" "widening_mul1" } } */
\ No newline at end of file
diff --git a/gcc/tree-ssa-math-opts.cc b/gcc/tree-ssa-math-opts.cc
index b58a2ac9e6a..beb6648e53e 100644
--- a/gcc/tree-ssa-math-opts.cc
+++ b/gcc/tree-ssa-math-opts.cc
@@ -115,6 +115,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-eh.h"
 #include "targhooks.h"
 #include "domwalk.h"
+#include "tree-ssa-reassoc.h"
 #include "tree-ssa-math-opts.h"
 
 /* This structure represents one basic block that either computes a
@@ -2688,9 +2689,13 @@ is_copysign_call_with_1 (gimple *call)
 /* Try to expand the pattern x * copysign (1, y) into xorsign (x, y).
    This only happens when the xorsign optab is defined, if the
    pattern is not a xorsign pattern or if expansion fails FALSE is
-   returned, otherwise TRUE is returned.  */
+   returned, otherwise TRUE is returned.
+
+   If CHECK_ONLY_P, only return check result and leave the statement
+   unchanged.  */
 static bool
-convert_expand_mult_copysign (gimple *stmt, gimple_stmt_iterator *gsi)
+convert_expand_mult_copysign (gimple *stmt, gimple_stmt_iterator *gsi,
+			      bool check_only_p)
 {
   tree treeop0, treeop1, lhs, type;
   location_t loc = gimple_location (stmt);
@@ -2717,14 +2722,17 @@ convert_expand_mult_copysign (gimple *stmt, gimple_stmt_iterator *gsi)
 	if (optab_handler (xorsign_optab, mode) == CODE_FOR_nothing)
 	  return false;
 
-	gcall *c = as_a<gcall*> (call0);
-	treeop0 = gimple_call_arg (c, 1);
-
-	gcall *call_stmt
-	  = gimple_build_call_internal (IFN_XORSIGN, 2, treeop1, treeop0);
-	gimple_set_lhs (call_stmt, lhs);
-	gimple_set_location (call_stmt, loc);
-	gsi_replace (gsi, call_stmt, true);
+	if (!check_only_p)
+	  {
+	    gcall *c = as_a<gcall *> (call0);
+	    treeop0 = gimple_call_arg (c, 1);
+
+	    gcall *call_stmt
+	      = gimple_build_call_internal (IFN_XORSIGN, 2, treeop1, treeop0);
+	    gimple_set_lhs (call_stmt, lhs);
+	    gimple_set_location (call_stmt, loc);
+	    gsi_replace (gsi, call_stmt, true);
+	  }
 	return true;
     }
 
@@ -2733,10 +2741,14 @@ convert_expand_mult_copysign (gimple *stmt, gimple_stmt_iterator *gsi)
 
 /* Process a single gimple statement STMT, which has a MULT_EXPR as
    its rhs, and try to convert it into a WIDEN_MULT_EXPR.  The return
-   value is true iff we converted the statement.  */
+   value is true iff we converted the statement.
+
+   If CHECK_ONLY_P, only return check result and leave the statement
+   unchanged.  */
 
 static bool
-convert_mult_to_widen (gimple *stmt, gimple_stmt_iterator *gsi)
+convert_mult_to_widen (gimple *stmt, gimple_stmt_iterator *gsi,
+		       bool check_only_p)
 {
   tree lhs, rhs1, rhs2, type, type1, type2;
   enum insn_code handler;
@@ -2817,6 +2829,10 @@ convert_mult_to_widen (gimple *stmt, gimple_stmt_iterator *gsi)
   actual_precision = GET_MODE_PRECISION (actual_mode);
   if (2 * actual_precision > TYPE_PRECISION (type))
     return false;
+
+  if (check_only_p)
+    return true;
+
   if (actual_precision != TYPE_PRECISION (type1)
       || from_unsigned1 != TYPE_UNSIGNED (type1))
     rhs1 = build_and_insert_cast (gsi, loc,
@@ -3062,11 +3078,44 @@ convert_plusminus_to_widen (gimple_stmt_iterator *gsi, gimple *stmt,
   return true;
 }
 
+/* Number of FMA candidates to skip, to avoid generating FMA chain with the last
+   result fed back through PHI node.  When SKIP_FMA_HEURISTIC == -1 (default),
+   all candidates are skipped.  */
+static int skip_fma_heuristic = -1;
+
+/* If reassociation is preferred for candidates forming long FMA chain, we need
+   to defer all FMA conversions in widening_mul1.  */
+static bool defer_all_fma_p = false;
+
 /* Given a result MUL_RESULT which is a result of a multiplication of OP1 and
    OP2 and which we know is used in statements that can be, together with the
-   multiplication, converted to FMAs, perform the transformation.  */
+   multiplication, converted to FMAs, perform the transformation.  FALSE is
+   returned if the conversion fails, otherwise TRUE is returned.  */
 
-static void
+static bool
+is_internal_fma_fn_p (gimple *stmt)
+{
+  if (!(is_gimple_call (stmt) && gimple_call_internal_p (stmt)))
+    return false;
+  internal_fn ifn = gimple_call_internal_fn (stmt);
+  switch (ifn)
+    {
+    case IFN_FMA:
+    case IFN_FMS:
+    case IFN_FNMA:
+    case IFN_FNMS:
+    case IFN_COND_FMA:
+    case IFN_COND_FMS:
+    case IFN_COND_FNMA:
+    case IFN_COND_FNMS:
+      return true;
+
+    default:
+      return false;
+    }
+}
+
+static bool
 convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
 {
   tree type = TREE_TYPE (mul_result);
@@ -3091,6 +3140,18 @@ convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
 	  use_operand_p use_p;
 	  gimple *neguse_stmt;
 	  single_imm_use (gimple_assign_lhs (use_stmt), &use_p, &neguse_stmt);
+	  /* For a case like: a = - (b * c) - (d * e); the neguse_stmt is
+	     possibly already transformed into FMA.  For now just skip current
+	     transformation.  */
+	  if (is_internal_fma_fn_p (neguse_stmt))
+	    {
+	      if (dump_file && (dump_flags & TDF_DETAILS))
+		{
+		  fprintf (dump_file, "Not transformed due to use: ");
+		  print_gimple_stmt (dump_file, neguse_stmt, 0);
+		}
+	      return false;
+	    }
 	  gsi_remove (&gsi, true);
 	  release_defs (use_stmt);
 
@@ -3098,6 +3159,15 @@ convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
 	  gsi = gsi_for_stmt (use_stmt);
 	  negate_p = true;
 	}
+      if (is_internal_fma_fn_p (use_stmt))
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    {
+	      fprintf (dump_file, "Not transformed due to use: ");
+	      print_gimple_stmt (dump_file, use_stmt, 0);
+	    }
+	  return false;
+	}
 
       tree cond, else_value, ops[3];
       tree_code code;
@@ -3179,6 +3249,7 @@ convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
 
       widen_mul_stats.fmas_inserted++;
     }
+    return true;
 }
 
 /* Data necessary to perform the actual transformation from a multiplication
@@ -3205,7 +3276,9 @@ public:
 
   fma_deferring_state (bool perform_deferring)
     : m_candidates (), m_mul_result_set (), m_initial_phi (NULL),
-      m_last_result (NULL_TREE), m_deferring_p (perform_deferring) {}
+      m_initial_phi2 (NULL), m_last_result (NULL_TREE),
+      m_deferring_p (perform_deferring)
+  {}
 
   /* List of FMA candidates for which we the transformation has been determined
      possible but we at this point in BB analysis we do not consider them
@@ -3219,6 +3292,7 @@ public:
   /* The PHI that supposedly feeds back result of a FMA to another over loop
      boundary.  */
   gphi *m_initial_phi;
+  gphi *m_initial_phi2;
 
   /* Result of the last produced FMA candidate or NULL if there has not been
      one.  */
@@ -3227,42 +3301,97 @@ public:
   /* If true, deferring might still be profitable.  If false, transform all
      candidates and no longer defer.  */
   bool m_deferring_p;
+
+  hash_set<gimple *> use_stmt_set;
 };
 
-/* Transform all deferred FMA candidates and mark STATE as no longer
-   deferring.  */
+/* Transform deferred FMA candidates and mark STATE as no longer deferring.
+   Skip the first SKIP_N candidates.  */
 
 static void
-cancel_fma_deferring (fma_deferring_state *state)
+cancel_fma_deferring (fma_deferring_state *state, unsigned skip_n = 0)
 {
   if (!state->m_deferring_p)
     return;
 
-  for (unsigned i = 0; i < state->m_candidates.length (); i++)
+  if (defer_all_fma_p)
+    {
+      int left = state->m_candidates.length () - skip_n;
+      gcc_checking_assert (param_op_count_prefer_reassoc);
+      if (left >= param_op_count_prefer_reassoc)
+	return;
+    }
+  for (unsigned i = skip_n; i < state->m_candidates.length (); i++)
     {
       if (dump_file && (dump_flags & TDF_DETAILS))
 	fprintf (dump_file, "Generating deferred FMA\n");
 
       const fma_transformation_info &fti = state->m_candidates[i];
-      convert_mult_to_fma_1 (fti.mul_result, fti.op1, fti.op2);
-
-      gimple_stmt_iterator gsi = gsi_for_stmt (fti.mul_stmt);
-      gsi_remove (&gsi, true);
-      release_defs (fti.mul_stmt);
+      if (convert_mult_to_fma_1 (fti.mul_result, fti.op1, fti.op2))
+	{
+	  gimple_stmt_iterator gsi = gsi_for_stmt (fti.mul_stmt);
+	  gsi_remove (&gsi, true);
+	  release_defs (fti.mul_stmt);
+	}
+    }
+  if (defer_all_fma_p)
+    {
+      state->m_candidates.truncate (0);
+      state->m_mul_result_set.empty ();
+      state->use_stmt_set.empty ();
+      state->m_last_result = NULL;
+      state->m_initial_phi = NULL;
+      state->m_initial_phi2 = NULL;
     }
-  state->m_deferring_p = false;
+  else
+    state->m_deferring_p = false;
 }
 
 /* If OP is an SSA name defined by a PHI node, return the PHI statement.
+   If OP is defined by a PLUS_EXPR or MULT_EXPR, and at least one of the
+   operands is an SSA_NAME define by a PHI node, then return the PHI node,
+   set RESULT_PHI2 if there're 2 such PHI nodes.
+
    Otherwise return NULL.  */
 
 static gphi *
-result_of_phi (tree op)
+result_of_phi (tree op, gphi **result_phi2)
 {
   if (TREE_CODE (op) != SSA_NAME)
     return NULL;
 
-  return dyn_cast <gphi *> (SSA_NAME_DEF_STMT (op));
+  gimple *def_stmt = SSA_NAME_DEF_STMT (op);
+  if (gimple_code (def_stmt) == GIMPLE_PHI)
+    return dyn_cast<gphi *> (def_stmt);
+
+  /* Consider an addtional FMUL/FADD also in the chain.  */
+  else if (gimple_code (def_stmt) == GIMPLE_ASSIGN)
+    {
+      if (gimple_assign_single_p (def_stmt)
+	  || gimple_assign_unary_nop_p (def_stmt))
+	;
+      else if (gimple_assign_rhs_code (def_stmt) == PLUS_EXPR
+	       || gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
+	{
+	  gphi *phi1 = NULL, *phi2 = NULL;
+	  tree rhs1 = gimple_assign_rhs1 (def_stmt);
+	  tree rhs2 = gimple_assign_rhs2 (def_stmt);
+	  if (TREE_CODE (rhs1) == SSA_NAME
+	      && gimple_code (SSA_NAME_DEF_STMT (rhs1)) == GIMPLE_PHI)
+	    phi1 = dyn_cast<gphi *> (SSA_NAME_DEF_STMT (rhs1));
+	  if (TREE_CODE (rhs2) == SSA_NAME
+	      && gimple_code (SSA_NAME_DEF_STMT (rhs2)) == GIMPLE_PHI)
+	    phi2 = dyn_cast<gphi *> (SSA_NAME_DEF_STMT (rhs2));
+	  if (phi1)
+	    {
+	      *result_phi2 = phi2;
+	      return phi1;
+	    }
+	  else
+	    return phi2;
+	}
+    }
+  return NULL;
 }
 
 /* After processing statements of a BB and recording STATE, return true if the
@@ -3282,6 +3411,15 @@ last_fma_candidate_feeds_initial_phi (fma_deferring_state *state,
 	  || last_result_set->contains (t))
 	return true;
     }
+  if (state->m_initial_phi2)
+    {
+      FOR_EACH_PHI_ARG (use, state->m_initial_phi2, iter, SSA_OP_USE)
+	{
+	  tree t = USE_FROM_PTR (use);
+	  if (t == state->m_last_result || last_result_set->contains (t))
+	    return true;
+	}
+    }
 
   return false;
 }
@@ -3307,7 +3445,8 @@ last_fma_candidate_feeds_initial_phi (fma_deferring_state *state,
 
 static bool
 convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
-		     fma_deferring_state *state, tree mul_cond = NULL_TREE)
+		     fma_deferring_state *state,
+		     hash_set<tree> *last_result_set, tree mul_cond = NULL_TREE)
 {
   tree mul_result = gimple_get_lhs (mul_stmt);
   /* If there isn't a LHS then this can't be an FMA.  There can be no LHS
@@ -3340,10 +3479,10 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
   if (has_zero_uses (mul_result))
     return false;
 
-  bool check_defer
-    = (state->m_deferring_p
-       && maybe_le (tree_to_poly_int64 (TYPE_SIZE (type)),
-		    param_avoid_fma_max_bits));
+  bool check_defer = defer_all_fma_p
+		     || (state->m_deferring_p
+			 && maybe_le (tree_to_poly_int64 (TYPE_SIZE (type)),
+				      param_avoid_fma_max_bits));
   bool defer = check_defer;
   bool seen_negate_p = false;
 
@@ -3360,6 +3499,8 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
       && !has_single_use (mul_result))
     return false;
 
+  bool cancel_prev_chain = false;
+  gphi *phi = NULL, *phi2 = NULL;
   /* Make sure that the multiplication statement becomes dead after
      the transformation, thus that all uses are transformed to FMAs.
      This means we assume that an FMA operation has the same cost
@@ -3488,27 +3629,31 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
 		  || ops[0] == state->m_last_result)
 		defer = true;
 	      else
-		defer = false;
+		{
+		  if (defer_all_fma_p)
+		    {
+		      if (!(state->m_initial_phi
+			    && last_fma_candidate_feeds_initial_phi (
+			      state, last_result_set)))
+			cancel_prev_chain = true;
+		    }
+		  defer = defer_all_fma_p;
+		}
 	    }
-	  else
+	  if (!state->m_last_result || cancel_prev_chain)
 	    {
-	      gcc_checking_assert (!state->m_initial_phi);
-	      gphi *phi;
 	      if (ops[0] == result)
-		phi = result_of_phi (ops[1]);
+		phi = result_of_phi (ops[1], &phi2);
 	      else
 		{
 		  gcc_assert (ops[1] == result);
-		  phi = result_of_phi (ops[0]);
+		  phi = result_of_phi (ops[0], &phi2);
 		}
 
 	      if (phi)
-		{
-		  state->m_initial_phi = phi;
-		  defer = true;
-		}
+		defer = true;
 	      else
-		defer = false;
+		defer = defer_all_fma_p;
 	    }
 
 	  state->m_last_result = use_lhs;
@@ -3517,6 +3662,12 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
       else
 	defer = false;
 
+      /* There's a chance two FMA candidates share the same use statement,
+	 because of negate_expr.  For now just give up the second one.  */
+      if (state->use_stmt_set.contains (use_stmt))
+	return false;
+      state->use_stmt_set.add (use_stmt);
+
       /* While it is possible to validate whether or not the exact form that
 	 we've recognized is available in the backend, the assumption is that
 	 if the deferring logic above did not trigger, the transformation is
@@ -3529,11 +3680,18 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
 
   if (defer)
     {
+      if (cancel_prev_chain)
+	cancel_fma_deferring (state);
       fma_transformation_info fti;
       fti.mul_stmt = mul_stmt;
       fti.mul_result = mul_result;
       fti.op1 = op1;
       fti.op2 = op2;
+      if (phi)
+	{
+	  state->m_initial_phi = phi;
+	  state->m_initial_phi2 = phi2;
+	}
       state->m_candidates.safe_push (fti);
       state->m_mul_result_set.add (mul_result);
 
@@ -3550,8 +3708,7 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
     {
       if (state->m_deferring_p)
 	cancel_fma_deferring (state);
-      convert_mult_to_fma_1 (mul_result, op1, op2);
-      return true;
+      return convert_mult_to_fma_1 (mul_result, op1, op2);
     }
 }
 
@@ -4968,10 +5125,17 @@ class pass_optimize_widening_mul : public gimple_opt_pass
 {
 public:
   pass_optimize_widening_mul (gcc::context *ctxt)
-    : gimple_opt_pass (pass_data_optimize_widening_mul, ctxt)
+    : gimple_opt_pass (pass_data_optimize_widening_mul, ctxt),
+      only_fma_p (false)
   {}
 
   /* opt_pass methods: */
+  opt_pass *clone () { return new pass_optimize_widening_mul (m_ctxt); }
+  void set_pass_param (unsigned int n, bool param)
+    {
+      gcc_assert (n == 0);
+      only_fma_p = param;
+    }
   bool gate (function *) final override
     {
       return flag_expensive_optimizations && optimize;
@@ -4979,6 +5143,9 @@ public:
 
   unsigned int execute (function *) final override;
 
+private:
+  /* Only generate FMAs, do nothing else.  */
+  bool only_fma_p;
 }; // class pass_optimize_widening_mul
 
 /* Walker class to perform the transformation in reverse dominance order. */
@@ -4989,9 +5156,10 @@ public:
   /* Constructor, CFG_CHANGED is a pointer to a boolean flag that will be set
      if walking modidifes the CFG.  */
 
-  math_opts_dom_walker (bool *cfg_changed_p)
+  math_opts_dom_walker (bool *cfg_changed_p, bool only_fma_p)
     : dom_walker (CDI_DOMINATORS), m_last_result_set (),
-      m_cfg_changed_p (cfg_changed_p) {}
+      m_cfg_changed_p (cfg_changed_p), m_only_fma_p (only_fma_p)
+  {}
 
   /* The actual actions performed in the walk.  */
 
@@ -5004,14 +5172,17 @@ public:
   /* Pointer to a flag of the user that needs to be set if CFG has been
      modified.  */
   bool *m_cfg_changed_p;
+
+  /* Whether we're only inserting FMAs and do nothing else.  */
+  bool m_only_fma_p;
 };
 
 void
 math_opts_dom_walker::after_dom_children (basic_block bb)
 {
   gimple_stmt_iterator gsi;
-
-  fma_deferring_state fma_state (param_avoid_fma_max_bits > 0);
+  fma_deferring_state fma_state (param_avoid_fma_max_bits > 0
+				 || defer_all_fma_p);
 
   for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi);)
     {
@@ -5024,37 +5195,42 @@ math_opts_dom_walker::after_dom_children (basic_block bb)
 	  switch (code)
 	    {
 	    case MULT_EXPR:
-	      if (!convert_mult_to_widen (stmt, &gsi)
-		  && !convert_expand_mult_copysign (stmt, &gsi)
+	      if (!convert_mult_to_widen (stmt, &gsi, m_only_fma_p)
+		  && !convert_expand_mult_copysign (stmt, &gsi, m_only_fma_p)
 		  && convert_mult_to_fma (stmt,
 					  gimple_assign_rhs1 (stmt),
 					  gimple_assign_rhs2 (stmt),
-					  &fma_state))
+					  &fma_state, &m_last_result_set))
 		{
 		  gsi_remove (&gsi, true);
 		  release_defs (stmt);
 		  continue;
 		}
-	      match_arith_overflow (&gsi, stmt, code, m_cfg_changed_p);
+	      if (!m_only_fma_p)
+		match_arith_overflow (&gsi, stmt, code, m_cfg_changed_p);
 	      break;
 
 	    case PLUS_EXPR:
 	    case MINUS_EXPR:
-	      if (!convert_plusminus_to_widen (&gsi, stmt, code))
+	      if (!m_only_fma_p
+		  && !convert_plusminus_to_widen (&gsi, stmt, code))
 		match_arith_overflow (&gsi, stmt, code, m_cfg_changed_p);
 	      break;
 
 	    case BIT_NOT_EXPR:
-	      if (match_arith_overflow (&gsi, stmt, code, m_cfg_changed_p))
+	      if (!m_only_fma_p
+		  && match_arith_overflow (&gsi, stmt, code, m_cfg_changed_p))
 		continue;
 	      break;
 
 	    case TRUNC_MOD_EXPR:
-	      convert_to_divmod (as_a<gassign *> (stmt));
+	      if (!m_only_fma_p)
+		convert_to_divmod (as_a<gassign *> (stmt));
 	      break;
 
 	    case RSHIFT_EXPR:
-	      convert_mult_to_highpart (as_a<gassign *> (stmt), &gsi);
+	      if (!m_only_fma_p)
+		convert_mult_to_highpart (as_a<gassign *> (stmt), &gsi);
 	      break;
 
 	    default:;
@@ -5072,7 +5248,7 @@ math_opts_dom_walker::after_dom_children (basic_block bb)
 		  && convert_mult_to_fma (stmt,
 					  gimple_call_arg (stmt, 0),
 					  gimple_call_arg (stmt, 0),
-					  &fma_state))
+					  &fma_state, &m_last_result_set))
 		{
 		  unlink_stmt_vdef (stmt);
 		  if (gsi_remove (&gsi, true)
@@ -5087,7 +5263,7 @@ math_opts_dom_walker::after_dom_children (basic_block bb)
 	      if (convert_mult_to_fma (stmt,
 				       gimple_call_arg (stmt, 1),
 				       gimple_call_arg (stmt, 2),
-				       &fma_state,
+				       &fma_state, &m_last_result_set,
 				       gimple_call_arg (stmt, 0)))
 
 		{
@@ -5105,23 +5281,35 @@ math_opts_dom_walker::after_dom_children (basic_block bb)
 	      break;
 	    }
 	}
-      else if (gimple_code (stmt) == GIMPLE_COND)
+      else if (!m_only_fma_p && gimple_code (stmt) == GIMPLE_COND)
 	optimize_spaceship (as_a <gcond *> (stmt));
       gsi_next (&gsi);
     }
-  if (fma_state.m_deferring_p
-      && fma_state.m_initial_phi)
+
+  if (fma_state.m_deferring_p && fma_state.m_last_result)
     {
-      gcc_checking_assert (fma_state.m_last_result);
-      if (!last_fma_candidate_feeds_initial_phi (&fma_state,
-						 &m_last_result_set))
-	cancel_fma_deferring (&fma_state);
-      else
-	m_last_result_set.add (fma_state.m_last_result);
+      bool cancel = true;
+      unsigned skip = 0;
+      if (fma_state.m_initial_phi
+	  && last_fma_candidate_feeds_initial_phi (&fma_state,
+						   &m_last_result_set))
+	{
+	  cancel = false;
+	  if (skip_fma_heuristic >= 0)
+	    {
+	      /* We have a FMA chain that last result fed back into first
+		 one's input, but heuristically still benifitable to generate
+		 some FMAs.  */
+	      cancel = true;
+	      skip = skip_fma_heuristic;
+	    }
+	  m_last_result_set.add (fma_state.m_last_result);
+	}
+      if (cancel)
+	cancel_fma_deferring (&fma_state, skip);
     }
 }
 
-
 unsigned int
 pass_optimize_widening_mul::execute (function *fun)
 {
@@ -5131,7 +5319,26 @@ pass_optimize_widening_mul::execute (function *fun)
   calculate_dominance_info (CDI_DOMINATORS);
   renumber_gimple_stmt_uids (cfun);
 
-  math_opts_dom_walker (&cfg_changed).walk (ENTRY_BLOCK_PTR_FOR_FN (cfun));
+  /* Chances of reassociation and constant-folding can be lost due to the
+     conversions to FMA.  So run execute_reassoc first.  */
+  if (only_fma_p && flag_tree_reassoc)
+    cfg_changed |= execute_reassoc (false, false, false);
+
+  /* Defer all conversions to FMA if reassociation is prefered for long
+     chains.  */
+  defer_all_fma_p = only_fma_p && param_op_count_prefer_reassoc
+		    && flag_tree_reassoc && flag_associative_math;
+
+  /* On ruling out FMA chains that could be slow (When param_avoid_fma_max_bits
+     is set):
+
+     1) In widening_mul1, discard all such FMA chains.
+     2) In widening_mul2, discard the first SKIP_FMA_HEURISTIC FMA candidates,
+     and generate the rest.  */
+  skip_fma_heuristic = only_fma_p ? -1 : flag_skip_fma_heuristic;
+
+  math_opts_dom_walker (&cfg_changed, only_fma_p)
+    .walk (ENTRY_BLOCK_PTR_FOR_FN (cfun));
 
   statistics_counter_event (fun, "widening multiplications inserted",
 			    widen_mul_stats.widen_mults_inserted);
diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
index 6956a3dadb5..ed6aee32be0 100644
--- a/gcc/tree-ssa-reassoc.cc
+++ b/gcc/tree-ssa-reassoc.cc
@@ -184,6 +184,9 @@ static bool reassoc_insert_powi_p;
    vectorization, since it interferes with reduction chains.  */
 static bool reassoc_bias_loop_carried_phi_ranks_p;
 
+/* Whether in last reassoc pass.  */
+static bool reassoc_is_the_last_p;
+
 /* Statistics */
 static struct
 {
@@ -3559,7 +3562,7 @@ optimize_range_tests_cmp_bitwise (enum tree_code opcode, int first, int length,
 		  reassoc pass, as it interferes with the reassociation
 		  itself or could also with VRP etc. which might not
 		  be able to virtually undo the optimization.  */
-	       && !reassoc_insert_powi_p
+	       && reassoc_is_the_last_p
 	       && !TYPE_UNSIGNED (TREE_TYPE (ranges[i].exp))
 	       && integer_zerop (ranges[i].low))
 	idx = 3;
@@ -6831,7 +6834,7 @@ reassociate_bb (basic_block bb)
 		  /* Only rewrite the expression tree to parallel in the
 		     last reassoc pass to avoid useless work back-and-forth
 		     with initial linearization.  */
-		  if (!reassoc_insert_powi_p
+		  if (reassoc_is_the_last_p
 		      && ops.length () > 3
 		      && (width = get_reassociation_width (ops_num, rhs_code,
 							   mode)) > 1)
@@ -7088,15 +7091,19 @@ fini_reassoc (void)
 }
 
 /* Gate and execute functions for Reassociation.  If INSERT_POWI_P, enable
-   insertion of __builtin_powi calls.
+   insertion of __builtin_powi calls.  If IS_THE_LAST_REASSOC_P, which means
+   this is the last reassoc pass, then transform expression tree to parallel
+   if possible.
 
    Returns TODO_cfg_cleanup if a CFG cleanup pass is desired due to
    optimization of a gimple conditional.  Otherwise returns zero.  */
 
-static unsigned int
-execute_reassoc (bool insert_powi_p, bool bias_loop_carried_phi_ranks_p)
+unsigned int
+execute_reassoc (bool insert_powi_p, bool bias_loop_carried_phi_ranks_p,
+		 bool is_the_last_reassoc_p)
 {
   reassoc_insert_powi_p = insert_powi_p;
+  reassoc_is_the_last_p = is_the_last_reassoc_p;
   reassoc_bias_loop_carried_phi_ranks_p = bias_loop_carried_phi_ranks_p;
 
   init_reassoc ();
@@ -7143,7 +7150,8 @@ public:
   bool gate (function *) final override { return flag_tree_reassoc != 0; }
   unsigned int execute (function *) final override
   {
-    return execute_reassoc (insert_powi_p, bias_loop_carried_phi_ranks_p);
+    return execute_reassoc (insert_powi_p, bias_loop_carried_phi_ranks_p,
+			    !insert_powi_p);
   }
 
  private:
diff --git a/gcc/tree-ssa-reassoc.h b/gcc/tree-ssa-reassoc.h
index cbd7170d165..71df4b16bb4 100644
--- a/gcc/tree-ssa-reassoc.h
+++ b/gcc/tree-ssa-reassoc.h
@@ -44,5 +44,8 @@ void dump_range_entry (FILE *file, struct range_entry *r);
 void debug_range_entry (struct range_entry *r);
 void init_range_entry (struct range_entry *r, tree exp, gimple *stmt);
 bool no_side_effect_bb (basic_block bb);
+unsigned int
+execute_reassoc (bool insert_powi_p, bool bias_loop_carried_phi_ranks_p,
+		 bool is_the_last_reassoc_p);
 
 #endif  /* GCC_SSA_REASSOC_H  */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 2+ messages in thread

* RE: [RFC][PATCH] Improve generating FMA by adding a widening_mul pass
  2023-05-26  7:14 [RFC][PATCH] Improve generating FMA by adding a widening_mul pass Di Zhao OS
@ 2023-05-30  7:52 ` Di Zhao OS
  0 siblings, 0 replies; 2+ messages in thread
From: Di Zhao OS @ 2023-05-30  7:52 UTC (permalink / raw)
  To: gcc-patches

Sorry I've missed the recent updates on trunk regarding handling FMA.
I'll measure again if something in this still helps.

Thanks,
Di Zhao

> -----Original Message-----
> From: Di Zhao OS
> Sent: Friday, May 26, 2023 3:15 PM
> To: gcc-patches@gcc.gnu.org
> Subject: [RFC][PATCH] Improve generating FMA by adding a widening_mul pass
> 
> As GCC's reassociation pass does not have knowledge of FMA, when
> transforming expression lists to parallel, it reduces the
> opportunities to generate FMAs. Currently there's a workaround
> on AArch64 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84114),
> that is, to disable the parallelization with floating-point additions.
> However, this approach may cause regressions. For example, in the
> code below there are only floating-point additions when calculating
> "result += array[j]", and rewriting to parallel is better:
> 
> // Compile with -Ofast on aarch64
> float foo (int n, float in)
> {
>   float array[8] = { 0.1, 1.0, 1.1, 100.0, 10.5, 0.5, 0.01, 9.9 };
>   float result = 0.0;
>   for (int i = 0; i < n; i++)
>     {
>       if (i % 10)
>         for (unsigned j = 0; j < 8; j++)
>           array[j] *= in;
> 
>       for (unsigned j = 0; j < 8; j++)
>        result += array[j];
>     }
>   return result;
> }
> 
> To improve this, one option is to count the number of MUL_EXPRs in an
> operator list before rewriting to parallel, and allow the rewriting
> when there's none (or 1 MUL_EXPR). This is simple and unlikely to
> introduce regressions. However it lacks flexibility and can not handle
> more general cases.
> 
> Here's an attempt to address the issue more generally.
> 
> 1. Added an additional widening_mul pass before the original reassoc2
> pass. The new pass is limited to only insert FMA, and leave other
> operations like convert_mult_to_widen to the old late widening_mul pass,
> in case other optimizations between the two passes could be hindered.
> 
> 2. On some platforms, for a very long FMA chain, rewriting to parallel
> can be faster. Extended the original "deferring" logic so that all
> conversions to FMA can be deferred. Introduced a new parameter
> op-count-prefer-reassoc to control this behavior.
> 
> 3. Additionally, the new widening_mul pass calls execute_reassoc first,
> to avoid losing opportunities such as folding constants and
> undistributing.
> 
> However, changing the sequence of generating FMA and reassociation may
> expose more FMA chains that are slow (see commit 4a0d0ed2).
> To reduce possible regressions, improved handling the slow FMA chain by:
> 
> 1. Modified result_of_phi to support checking an additional FADD/FMUL.
> 
> 2. On some CPUs, rather than removing the whole FMA chain, only skipping
> a few candidates may generate faster code. Added new parameter
> fskip-fma-heuristic to control this behavior.
> 
> This patch also solves https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98350.
> 
> Thanks,
> Di Zhao


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2023-05-30  7:52 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-26  7:14 [RFC][PATCH] Improve generating FMA by adding a widening_mul pass Di Zhao OS
2023-05-30  7:52 ` Di Zhao OS

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).