* [PATCH] tree-optimization/110279- Check for nested FMA chains in reassoc
@ 2023-06-16 8:50 Di Zhao OS
2023-07-07 8:26 ` [PING][PATCH] " Di Zhao OS
0 siblings, 1 reply; 3+ messages in thread
From: Di Zhao OS @ 2023-06-16 8:50 UTC (permalink / raw)
To: gcc-patches
[-- Attachment #1: Type: text/plain, Size: 1508 bytes --]
This patch is to fix the regressions found in SPEC2017 fprate cases
on aarch64.
1. Reused code in pass widening_mul to check for nested FMA chains
(those connected by MULT_EXPRs), since re-writing to parallel
generates worse codes.
2. Avoid re-arrange to produce less FMA chains that can be slow.
Tested on ampere1 and neoverse-n1, this fixed the regressions in
508.namd_r and 510.parest_r 1 copy run. While I'm still collecting data
on x86 machines we have, I'd like to know what do you think of this.
(Previously I tried to improve things with FMA by adding a widening_mul
pass before reassoc2 for it's easier to recognize different patterns
of FMA chains and decide whether to split them. But I suppose handling
them all in reassoc pass is more efficient.)
Thanks,
Di Zhao
---
gcc/ChangeLog:
* tree-ssa-math-opts.cc (convert_mult_to_fma_1): Add new parameter.
Support new mode that merely do the checking.
(struct fma_transformation_info): Moved to header.
(class fma_deferring_state): Moved to header.
(convert_mult_to_fma): Add new parameter.
* tree-ssa-math-opts.h (struct fma_transformation_info):
(class fma_deferring_state): Moved from .cc.
(convert_mult_to_fma): Add function decl.
* tree-ssa-reassoc.cc (rewrite_expr_tree_parallel):
(rank_ops_for_fma): Return -1 if nested FMAs are found.
(reassociate_bb): Avoid rewriting to parallel if nested FMAs are found.
[-- Attachment #2: pr110279-Check-for-nested-FMA-chains-in-reassoc.diff --]
[-- Type: application/octet-stream, Size: 12037 bytes --]
diff --git a/gcc/tree-ssa-math-opts.cc b/gcc/tree-ssa-math-opts.cc
index 9c9ca571bcd..98b7b70e14c 100644
--- a/gcc/tree-ssa-math-opts.cc
+++ b/gcc/tree-ssa-math-opts.cc
@@ -3067,7 +3067,8 @@ convert_plusminus_to_widen (gimple_stmt_iterator *gsi, gimple *stmt,
multiplication, converted to FMAs, perform the transformation. */
static void
-convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
+convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2,
+ tree *return_lhs = NULL)
{
tree type = TREE_TYPE (mul_result);
gimple *use_stmt;
@@ -3091,8 +3092,11 @@ convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
use_operand_p use_p;
gimple *neguse_stmt;
single_imm_use (gimple_assign_lhs (use_stmt), &use_p, &neguse_stmt);
- gsi_remove (&gsi, true);
- release_defs (use_stmt);
+ if (!return_lhs)
+ {
+ gsi_remove (&gsi, true);
+ release_defs (use_stmt);
+ }
use_stmt = neguse_stmt;
gsi = gsi_for_stmt (use_stmt);
@@ -3106,6 +3110,12 @@ convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
gcc_unreachable ();
addop = ops[0] == result ? ops[1] : ops[0];
+ if (return_lhs)
+ {
+ *return_lhs = gimple_get_lhs (use_stmt);
+ return;
+ }
+
if (code == MINUS_EXPR)
{
if (ops[0] == result)
@@ -3181,54 +3191,6 @@ convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
}
}
-/* Data necessary to perform the actual transformation from a multiplication
- and an addition to an FMA after decision is taken it should be done and to
- then delete the multiplication statement from the function IL. */
-
-struct fma_transformation_info
-{
- gimple *mul_stmt;
- tree mul_result;
- tree op1;
- tree op2;
-};
-
-/* Structure containing the current state of FMA deferring, i.e. whether we are
- deferring, whether to continue deferring, and all data necessary to come
- back and perform all deferred transformations. */
-
-class fma_deferring_state
-{
-public:
- /* Class constructor. Pass true as PERFORM_DEFERRING in order to actually
- do any deferring. */
-
- fma_deferring_state (bool perform_deferring)
- : m_candidates (), m_mul_result_set (), m_initial_phi (NULL),
- m_last_result (NULL_TREE), m_deferring_p (perform_deferring) {}
-
- /* List of FMA candidates for which we the transformation has been determined
- possible but we at this point in BB analysis we do not consider them
- beneficial. */
- auto_vec<fma_transformation_info, 8> m_candidates;
-
- /* Set of results of multiplication that are part of an already deferred FMA
- candidates. */
- hash_set<tree> m_mul_result_set;
-
- /* The PHI that supposedly feeds back result of a FMA to another over loop
- boundary. */
- gphi *m_initial_phi;
-
- /* Result of the last produced FMA candidate or NULL if there has not been
- one. */
- tree m_last_result;
-
- /* If true, deferring might still be profitable. If false, transform all
- candidates and no longer defer. */
- bool m_deferring_p;
-};
-
/* Transform all deferred FMA candidates and mark STATE as no longer
deferring. */
@@ -3303,12 +3265,18 @@ last_fma_candidate_feeds_initial_phi (fma_deferring_state *state,
or its unrolled version, i.e. with several FMA candidates that feed result
of one into the addend of another. Instead, we add them to a list in STATE
and if we later discover an FMA candidate that is not part of such a chain,
- we go back and perform all deferred past candidates. */
+ we go back and perform all deferred past candidates.
-static bool
+ If RETURN_LHS is not NULL, instead of the actual conversion, set RETURN_LHS
+ and return true. */
+
+bool
convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
- fma_deferring_state *state, tree mul_cond = NULL_TREE)
+ fma_deferring_state *state, tree mul_cond,
+ tree *return_lhs)
{
+ if (return_lhs)
+ *return_lhs = NULL_TREE;
tree mul_result = gimple_get_lhs (mul_stmt);
/* If there isn't a LHS then this can't be an FMA. There can be no LHS
if the statement was left just for the side-effects. */
@@ -3550,7 +3518,7 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
{
if (state->m_deferring_p)
cancel_fma_deferring (state);
- convert_mult_to_fma_1 (mul_result, op1, op2);
+ convert_mult_to_fma_1 (mul_result, op1, op2, return_lhs);
return true;
}
}
diff --git a/gcc/tree-ssa-math-opts.h b/gcc/tree-ssa-math-opts.h
index 52b7938b599..252597ba428 100644
--- a/gcc/tree-ssa-math-opts.h
+++ b/gcc/tree-ssa-math-opts.h
@@ -23,4 +23,57 @@ along with GCC; see the file COPYING3. If not see
extern tree powi_as_mults (gimple_stmt_iterator *, location_t,
tree, HOST_WIDE_INT);
+/* Data necessary to perform the actual transformation from a multiplication
+ and an addition to an FMA after decision is taken it should be done and to
+ then delete the multiplication statement from the function IL. */
+
+struct fma_transformation_info
+{
+ gimple *mul_stmt;
+ tree mul_result;
+ tree op1;
+ tree op2;
+};
+
+/* Structure containing the current state of FMA deferring, i.e. whether we are
+ deferring, whether to continue deferring, and all data necessary to come
+ back and perform all deferred transformations. */
+
+class fma_deferring_state
+{
+public:
+ /* Class constructor. Pass true as PERFORM_DEFERRING in order to actually
+ do any deferring. */
+
+ fma_deferring_state (bool perform_deferring)
+ : m_candidates (), m_mul_result_set (), m_initial_phi (NULL),
+ m_last_result (NULL_TREE), m_deferring_p (perform_deferring) {}
+
+ /* List of FMA candidates for which we the transformation has been determined
+ possible but we at this point in BB analysis we do not consider them
+ beneficial. */
+ auto_vec<fma_transformation_info, 8> m_candidates;
+
+ /* Set of results of multiplication that are part of an already deferred FMA
+ candidates. */
+ hash_set<tree> m_mul_result_set;
+
+ /* The PHI that supposedly feeds back result of a FMA to another over loop
+ boundary. */
+ gphi *m_initial_phi;
+
+ /* Result of the last produced FMA candidate or NULL if there has not been
+ one. */
+ tree m_last_result;
+
+ /* If true, deferring might still be profitable. If false, transform all
+ candidates and no longer defer. */
+ bool m_deferring_p;
+};
+
+bool
+convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
+ fma_deferring_state *state, tree mul_cond = NULL_TREE,
+ tree *return_lhs = NULL);
+
#endif /* GCC_TREE_SSA_MATH_OPTS_H */
diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
index 96c88ec003e..dd78f41ae7b 100644
--- a/gcc/tree-ssa-reassoc.cc
+++ b/gcc/tree-ssa-reassoc.cc
@@ -5488,7 +5488,7 @@ get_reassociation_width (int ops_num, enum tree_code opc,
as possible. */
static void
rewrite_expr_tree_parallel (gassign *stmt, int width, bool has_fma,
- const vec<operand_entry *> &ops)
+ const vec<operand_entry *> &ops)
{
enum tree_code opcode = gimple_assign_rhs_code (stmt);
int op_num = ops.length ();
@@ -5517,7 +5517,7 @@ rewrite_expr_tree_parallel (gassign *stmt, int width, bool has_fma,
/* Build parallel dependency chain according to width. */
for (i = 0; i < width; i++)
{
- /* If the chain has FAM, we do not swap two operands. */
+ /* If the chain has FMA, we do not swap two operands. */
if (op_index > 1 && !has_fma)
swap_ops_for_binary_stmt (ops, op_index - 2);
@@ -6678,8 +6678,8 @@ transform_stmt_to_multiply (gimple_stmt_iterator *gsi, gimple *stmt,
_4 = .FMA (c_7(D), d_8(D), _3);
_11 = .FMA (a_5(D), b_6(D), _4); */
-static bool
-rank_ops_for_fma (vec<operand_entry *> *ops)
+static int
+rank_ops_for_fma (vec<operand_entry *> *ops, tree type)
{
operand_entry *oe;
unsigned int i;
@@ -6701,6 +6701,59 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
else
ops_others.safe_push (oe);
}
+
+ /* Check the lhs of MULT_EXPRs, if there's a FMA chain with pattern:
+
+ d = .FMA (a, b, c);
+ g = .FMA (d, e, f);
+ ...
+
+ then return -1, to avoid rewrite_expr_tree_parallel. */
+ fma_deferring_state fma_state (false);
+ FOR_EACH_VEC_ELT (ops_mult, i, oe)
+ {
+ gimple *mul_stmt = SSA_NAME_DEF_STMT (oe->op);
+ tree lhs, lhs2;
+ if (convert_mult_to_fma (mul_stmt, gimple_assign_rhs1 (mul_stmt),
+ gimple_assign_rhs2 (mul_stmt), &fma_state, NULL,
+ &lhs))
+ {
+ /* Check for MULT_EXPR that uses LHS. */
+ imm_use_iterator imm_iter;
+ gimple *use_stmt;
+ use_operand_p use_p;
+ FOR_EACH_IMM_USE_FAST (use_p, imm_iter, lhs)
+ {
+ use_stmt = USE_STMT (use_p);
+ if (is_gimple_debug (use_stmt))
+ continue;
+ if (gimple_bb (use_stmt) != gimple_bb (mul_stmt))
+ continue;
+
+ if (is_gimple_assign (use_stmt)
+ && gimple_assign_rhs_code (use_stmt) == MULT_EXPR
+ && convert_mult_to_fma (use_stmt,
+ gimple_assign_rhs1 (use_stmt),
+ gimple_assign_rhs2 (use_stmt),
+ &fma_state, NULL, &lhs2)
+ && gimple_bb (SSA_NAME_DEF_STMT (lhs2))
+ == gimple_bb (mul_stmt))
+ {
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ {
+ fprintf (dump_file,
+ "Found possible non-flat FMA chain. LHS1: ");
+ print_generic_expr (dump_file, lhs);
+ fprintf (dump_file, ", LHS2: ");
+ print_generic_expr (dump_file, lhs2);
+ fprintf (dump_file, "\n");
+ }
+ return -1;
+ }
+ }
+ }
+ }
+
/* 1. When ops_mult.length == 2, like the following case,
a * b + c * d + e.
@@ -6712,6 +6765,11 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
2. If all ops are defined with mult, we don't need to rearrange them. */
if (ops_mult.length () >= 2 && ops_mult.length () != ops_length)
{
+ if (maybe_le (tree_to_poly_int64 (TYPE_SIZE (type)),
+ param_avoid_fma_max_bits))
+ /* Avoid re-arrange to produce less FMA chains that can be slow. */
+ return 1;
+
/* Put no-mult ops and mult ops alternately at the end of the
queue, which is conducive to generating more FMA and reducing the
loss of FMA when breaking the chain. */
@@ -6726,9 +6784,9 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
if (opindex > 0)
opindex--;
}
- return true;
+ return 1;
}
- return false;
+ return 0;
}
/* Reassociate expressions in basic block BB and its post-dominator as
children.
@@ -6894,7 +6952,7 @@ reassociate_bb (basic_block bb)
machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
int ops_num = ops.length ();
int width;
- bool has_fma = false;
+ int has_fma = 0;
/* For binary bit operations, if there are at least 3
operands and the last operand in OPS is a constant,
@@ -6917,7 +6975,7 @@ reassociate_bb (basic_block bb)
opt_type)
&& (rhs_code == PLUS_EXPR || rhs_code == MINUS_EXPR))
{
- has_fma = rank_ops_for_fma (&ops);
+ has_fma = rank_ops_for_fma (&ops, TREE_TYPE (lhs));
}
/* Only rewrite the expression tree to parallel in the
@@ -6925,6 +6983,7 @@ reassociate_bb (basic_block bb)
with initial linearization. */
if (!reassoc_insert_powi_p
&& ops.length () > 3
+ && has_fma >= 0
&& (width = get_reassociation_width (ops_num, rhs_code,
mode)) > 1)
{
@@ -6934,7 +6993,7 @@ reassociate_bb (basic_block bb)
width);
rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
width,
- has_fma,
+ has_fma > 0,
ops);
}
else
@@ -6943,7 +7002,7 @@ reassociate_bb (basic_block bb)
to make sure the ones that get the double
binary op are chosen wisely. */
int len = ops.length ();
- if (len >= 3 && !has_fma)
+ if (len >= 3 && has_fma == 0)
swap_ops_for_binary_stmt (ops, len - 3);
new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
--
2.25.1
^ permalink raw reply [flat|nested] 3+ messages in thread
* [PING][PATCH] tree-optimization/110279- Check for nested FMA chains in reassoc
2023-06-16 8:50 [PATCH] tree-optimization/110279- Check for nested FMA chains in reassoc Di Zhao OS
@ 2023-07-07 8:26 ` Di Zhao OS
2023-07-07 14:31 ` Philipp Tomsich
0 siblings, 1 reply; 3+ messages in thread
From: Di Zhao OS @ 2023-07-07 8:26 UTC (permalink / raw)
To: gcc-patches
[-- Attachment #1: Type: text/plain, Size: 2133 bytes --]
Update the patch so it can apply.
Tested on spec2017 fprate cases again. With option "-funroll-loops -Ofast -flto",
the improvements of 1-copy run are:
Ampere1:
508.namd_r 4.26%
510.parest_r 2.55%
Overall 0.54%
Intel Xeon:
503.bwaves_r 1.3%
508.namd_r 1.58%
overall 0.42%
Thanks,
Di Zhao
> -----Original Message-----
> From: Di Zhao OS
> Sent: Friday, June 16, 2023 4:51 PM
> To: gcc-patches@gcc.gnu.org
> Subject: [PATCH] tree-optimization/110279- Check for nested FMA chains in
> reassoc
>
> This patch is to fix the regressions found in SPEC2017 fprate cases
> on aarch64.
>
> 1. Reused code in pass widening_mul to check for nested FMA chains
> (those connected by MULT_EXPRs), since re-writing to parallel
> generates worse codes.
>
> 2. Avoid re-arrange to produce less FMA chains that can be slow.
>
> Tested on ampere1 and neoverse-n1, this fixed the regressions in
> 508.namd_r and 510.parest_r 1 copy run. While I'm still collecting data
> on x86 machines we have, I'd like to know what do you think of this.
>
> (Previously I tried to improve things with FMA by adding a widening_mul
> pass before reassoc2 for it's easier to recognize different patterns
> of FMA chains and decide whether to split them. But I suppose handling
> them all in reassoc pass is more efficient.)
>
> Thanks,
> Di Zhao
>
> ---
> gcc/ChangeLog:
>
> * tree-ssa-math-opts.cc (convert_mult_to_fma_1): Add new parameter.
> Support new mode that merely do the checking.
> (struct fma_transformation_info): Moved to header.
> (class fma_deferring_state): Moved to header.
> (convert_mult_to_fma): Add new parameter.
> * tree-ssa-math-opts.h (struct fma_transformation_info):
> (class fma_deferring_state): Moved from .cc.
> (convert_mult_to_fma): Add function decl.
> * tree-ssa-reassoc.cc (rewrite_expr_tree_parallel):
> (rank_ops_for_fma): Return -1 if nested FMAs are found.
> (reassociate_bb): Avoid rewriting to parallel if nested FMAs are
> found.
[-- Attachment #2: 0001-Check-for-nested-FMA-chains-in-reassoc.patch --]
[-- Type: application/octet-stream, Size: 11274 bytes --]
diff --git a/gcc/tree-ssa-math-opts.cc b/gcc/tree-ssa-math-opts.cc
index 68fc518b1ab..19e2bef9307 100644
--- a/gcc/tree-ssa-math-opts.cc
+++ b/gcc/tree-ssa-math-opts.cc
@@ -3067,7 +3067,8 @@ convert_plusminus_to_widen (gimple_stmt_iterator *gsi, gimple *stmt,
multiplication, converted to FMAs, perform the transformation. */
static void
-convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
+convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2,
+ tree *return_lhs = NULL)
{
tree type = TREE_TYPE (mul_result);
gimple *use_stmt;
@@ -3091,8 +3092,11 @@ convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
use_operand_p use_p;
gimple *neguse_stmt;
single_imm_use (gimple_assign_lhs (use_stmt), &use_p, &neguse_stmt);
- gsi_remove (&gsi, true);
- release_defs (use_stmt);
+ if (!return_lhs)
+ {
+ gsi_remove (&gsi, true);
+ release_defs (use_stmt);
+ }
use_stmt = neguse_stmt;
gsi = gsi_for_stmt (use_stmt);
@@ -3106,6 +3110,12 @@ convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
gcc_unreachable ();
addop = ops[0] == result ? ops[1] : ops[0];
+ if (return_lhs)
+ {
+ *return_lhs = gimple_get_lhs (use_stmt);
+ return;
+ }
+
if (code == MINUS_EXPR)
{
if (ops[0] == result)
@@ -3181,54 +3191,6 @@ convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
}
}
-/* Data necessary to perform the actual transformation from a multiplication
- and an addition to an FMA after decision is taken it should be done and to
- then delete the multiplication statement from the function IL. */
-
-struct fma_transformation_info
-{
- gimple *mul_stmt;
- tree mul_result;
- tree op1;
- tree op2;
-};
-
-/* Structure containing the current state of FMA deferring, i.e. whether we are
- deferring, whether to continue deferring, and all data necessary to come
- back and perform all deferred transformations. */
-
-class fma_deferring_state
-{
-public:
- /* Class constructor. Pass true as PERFORM_DEFERRING in order to actually
- do any deferring. */
-
- fma_deferring_state (bool perform_deferring)
- : m_candidates (), m_mul_result_set (), m_initial_phi (NULL),
- m_last_result (NULL_TREE), m_deferring_p (perform_deferring) {}
-
- /* List of FMA candidates for which we the transformation has been determined
- possible but we at this point in BB analysis we do not consider them
- beneficial. */
- auto_vec<fma_transformation_info, 8> m_candidates;
-
- /* Set of results of multiplication that are part of an already deferred FMA
- candidates. */
- hash_set<tree> m_mul_result_set;
-
- /* The PHI that supposedly feeds back result of a FMA to another over loop
- boundary. */
- gphi *m_initial_phi;
-
- /* Result of the last produced FMA candidate or NULL if there has not been
- one. */
- tree m_last_result;
-
- /* If true, deferring might still be profitable. If false, transform all
- candidates and no longer defer. */
- bool m_deferring_p;
-};
-
/* Transform all deferred FMA candidates and mark STATE as no longer
deferring. */
@@ -3303,12 +3265,18 @@ last_fma_candidate_feeds_initial_phi (fma_deferring_state *state,
or its unrolled version, i.e. with several FMA candidates that feed result
of one into the addend of another. Instead, we add them to a list in STATE
and if we later discover an FMA candidate that is not part of such a chain,
- we go back and perform all deferred past candidates. */
+ we go back and perform all deferred past candidates.
-static bool
+ If RETURN_LHS is not NULL, instead of the actual conversion, set RETURN_LHS
+ and return true. */
+
+bool
convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
- fma_deferring_state *state, tree mul_cond = NULL_TREE)
+ fma_deferring_state *state, tree mul_cond,
+ tree *return_lhs)
{
+ if (return_lhs)
+ *return_lhs = NULL_TREE;
tree mul_result = gimple_get_lhs (mul_stmt);
/* If there isn't a LHS then this can't be an FMA. There can be no LHS
if the statement was left just for the side-effects. */
@@ -3550,7 +3518,7 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
{
if (state->m_deferring_p)
cancel_fma_deferring (state);
- convert_mult_to_fma_1 (mul_result, op1, op2);
+ convert_mult_to_fma_1 (mul_result, op1, op2, return_lhs);
return true;
}
}
diff --git a/gcc/tree-ssa-math-opts.h b/gcc/tree-ssa-math-opts.h
index 52b7938b599..252597ba428 100644
--- a/gcc/tree-ssa-math-opts.h
+++ b/gcc/tree-ssa-math-opts.h
@@ -23,4 +23,57 @@ along with GCC; see the file COPYING3. If not see
extern tree powi_as_mults (gimple_stmt_iterator *, location_t,
tree, HOST_WIDE_INT);
+/* Data necessary to perform the actual transformation from a multiplication
+ and an addition to an FMA after decision is taken it should be done and to
+ then delete the multiplication statement from the function IL. */
+
+struct fma_transformation_info
+{
+ gimple *mul_stmt;
+ tree mul_result;
+ tree op1;
+ tree op2;
+};
+
+/* Structure containing the current state of FMA deferring, i.e. whether we are
+ deferring, whether to continue deferring, and all data necessary to come
+ back and perform all deferred transformations. */
+
+class fma_deferring_state
+{
+public:
+ /* Class constructor. Pass true as PERFORM_DEFERRING in order to actually
+ do any deferring. */
+
+ fma_deferring_state (bool perform_deferring)
+ : m_candidates (), m_mul_result_set (), m_initial_phi (NULL),
+ m_last_result (NULL_TREE), m_deferring_p (perform_deferring) {}
+
+ /* List of FMA candidates for which we the transformation has been determined
+ possible but we at this point in BB analysis we do not consider them
+ beneficial. */
+ auto_vec<fma_transformation_info, 8> m_candidates;
+
+ /* Set of results of multiplication that are part of an already deferred FMA
+ candidates. */
+ hash_set<tree> m_mul_result_set;
+
+ /* The PHI that supposedly feeds back result of a FMA to another over loop
+ boundary. */
+ gphi *m_initial_phi;
+
+ /* Result of the last produced FMA candidate or NULL if there has not been
+ one. */
+ tree m_last_result;
+
+ /* If true, deferring might still be profitable. If false, transform all
+ candidates and no longer defer. */
+ bool m_deferring_p;
+};
+
+bool
+convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
+ fma_deferring_state *state, tree mul_cond = NULL_TREE,
+ tree *return_lhs = NULL);
+
#endif /* GCC_TREE_SSA_MATH_OPTS_H */
diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
index eda03bf98a6..73f5c93595c 100644
--- a/gcc/tree-ssa-reassoc.cc
+++ b/gcc/tree-ssa-reassoc.cc
@@ -6781,8 +6781,8 @@ transform_stmt_to_multiply (gimple_stmt_iterator *gsi, gimple *stmt,
_4 = .FMA (c_7(D), d_8(D), _3);
_11 = .FMA (a_5(D), b_6(D), _4); */
-static bool
-rank_ops_for_fma (vec<operand_entry *> *ops)
+static int
+rank_ops_for_fma (vec<operand_entry *> *ops, tree type)
{
operand_entry *oe;
unsigned int i;
@@ -6804,6 +6804,59 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
else
ops_others.safe_push (oe);
}
+
+ /* Check the lhs of MULT_EXPRs, if there's a FMA chain with pattern:
+
+ d = .FMA (a, b, c);
+ g = .FMA (d, e, f);
+ ...
+
+ then return -1, to avoid rewrite_expr_tree_parallel. */
+ fma_deferring_state fma_state (false);
+ FOR_EACH_VEC_ELT (ops_mult, i, oe)
+ {
+ gimple *mul_stmt = SSA_NAME_DEF_STMT (oe->op);
+ tree lhs, lhs2;
+ if (convert_mult_to_fma (mul_stmt, gimple_assign_rhs1 (mul_stmt),
+ gimple_assign_rhs2 (mul_stmt), &fma_state, NULL,
+ &lhs))
+ {
+ /* Check for MULT_EXPR that uses LHS. */
+ imm_use_iterator imm_iter;
+ gimple *use_stmt;
+ use_operand_p use_p;
+ FOR_EACH_IMM_USE_FAST (use_p, imm_iter, lhs)
+ {
+ use_stmt = USE_STMT (use_p);
+ if (is_gimple_debug (use_stmt))
+ continue;
+ if (gimple_bb (use_stmt) != gimple_bb (mul_stmt))
+ continue;
+
+ if (is_gimple_assign (use_stmt)
+ && gimple_assign_rhs_code (use_stmt) == MULT_EXPR
+ && convert_mult_to_fma (use_stmt,
+ gimple_assign_rhs1 (use_stmt),
+ gimple_assign_rhs2 (use_stmt),
+ &fma_state, NULL, &lhs2)
+ && gimple_bb (SSA_NAME_DEF_STMT (lhs2))
+ == gimple_bb (mul_stmt))
+ {
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ {
+ fprintf (dump_file,
+ "Found possible non-flat FMA chain. LHS1: ");
+ print_generic_expr (dump_file, lhs);
+ fprintf (dump_file, ", LHS2: ");
+ print_generic_expr (dump_file, lhs2);
+ fprintf (dump_file, "\n");
+ }
+ return -1;
+ }
+ }
+ }
+ }
+
/* 1. When ops_mult.length == 2, like the following case,
a * b + c * d + e.
@@ -6815,6 +6868,11 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
2. If all ops are defined with mult, we don't need to rearrange them. */
if (ops_mult.length () >= 2 && ops_mult.length () != ops_length)
{
+ if (maybe_le (tree_to_poly_int64 (TYPE_SIZE (type)),
+ param_avoid_fma_max_bits))
+ /* Avoid re-arrange to produce less FMA chains that can be slow. */
+ return 1;
+
/* Put no-mult ops and mult ops alternately at the end of the
queue, which is conducive to generating more FMA and reducing the
loss of FMA when breaking the chain. */
@@ -6829,9 +6887,9 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
if (opindex > 0)
opindex--;
}
- return true;
+ return 1;
}
- return false;
+ return 0;
}
/* Reassociate expressions in basic block BB and its post-dominator as
children.
@@ -6997,7 +7055,7 @@ reassociate_bb (basic_block bb)
machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
int ops_num = ops.length ();
int width;
- bool has_fma = false;
+ int has_fma = 0;
/* For binary bit operations, if there are at least 3
operands and the last operand in OPS is a constant,
@@ -7020,7 +7078,7 @@ reassociate_bb (basic_block bb)
opt_type)
&& (rhs_code == PLUS_EXPR || rhs_code == MINUS_EXPR))
{
- has_fma = rank_ops_for_fma (&ops);
+ has_fma = rank_ops_for_fma (&ops, TREE_TYPE (lhs));
}
/* Only rewrite the expression tree to parallel in the
@@ -7028,6 +7086,7 @@ reassociate_bb (basic_block bb)
with initial linearization. */
if (!reassoc_insert_powi_p
&& ops.length () > 3
+ && has_fma >= 0
&& (width = get_reassociation_width (ops_num, rhs_code,
mode)) > 1)
{
@@ -7037,7 +7096,7 @@ reassociate_bb (basic_block bb)
width);
rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
width,
- has_fma,
+ has_fma > 0,
ops);
}
else
@@ -7046,7 +7105,7 @@ reassociate_bb (basic_block bb)
to make sure the ones that get the double
binary op are chosen wisely. */
int len = ops.length ();
- if (len >= 3 && !has_fma)
+ if (len >= 3 && has_fma == 0)
swap_ops_for_binary_stmt (ops, len - 3);
new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
--
2.25.1
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PING][PATCH] tree-optimization/110279- Check for nested FMA chains in reassoc
2023-07-07 8:26 ` [PING][PATCH] " Di Zhao OS
@ 2023-07-07 14:31 ` Philipp Tomsich
0 siblings, 0 replies; 3+ messages in thread
From: Philipp Tomsich @ 2023-07-07 14:31 UTC (permalink / raw)
To: Di Zhao OS; +Cc: gcc-patches
On Fri, 7 Jul 2023 at 10:28, Di Zhao OS via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Update the patch so it can apply.
>
> Tested on spec2017 fprate cases again. With option "-funroll-loops -Ofast -flto",
> the improvements of 1-copy run are:
>
> Ampere1:
> 508.namd_r 4.26%
> 510.parest_r 2.55%
> Overall 0.54%
> Intel Xeon:
> 503.bwaves_r 1.3%
> 508.namd_r 1.58%
> overall 0.42%
This looks like a worthwhile improvement.
From reviewing the patch, a few nit-picks:
- given that 'has_fma' can now take three values { -1, 0, 1 }, an enum
with more descriptive names for these 3 states should be used;
- using "has_fma >= 0" and "fma > 0" tests are hard to read; after
changing this to an enum, you can use macros or helper functions to
test the predicates (i.e., *_P macros or *_p helpers) for readability
- the meaning of the return values of rank_ops_for_fma should be
documented in the comment describing the function
- changing convert_mult_to_fma_1 to return a tree* (i.e., return_lhs
or NULL_TREE) removes the need for an in/out parameter
Thanks,
Philipp.
>
>
> Thanks,
> Di Zhao
>
>
> > -----Original Message-----
> > From: Di Zhao OS
> > Sent: Friday, June 16, 2023 4:51 PM
> > To: gcc-patches@gcc.gnu.org
> > Subject: [PATCH] tree-optimization/110279- Check for nested FMA chains in
> > reassoc
> >
> > This patch is to fix the regressions found in SPEC2017 fprate cases
> > on aarch64.
> >
> > 1. Reused code in pass widening_mul to check for nested FMA chains
> > (those connected by MULT_EXPRs), since re-writing to parallel
> > generates worse codes.
> >
> > 2. Avoid re-arrange to produce less FMA chains that can be slow.
> >
> > Tested on ampere1 and neoverse-n1, this fixed the regressions in
> > 508.namd_r and 510.parest_r 1 copy run. While I'm still collecting data
> > on x86 machines we have, I'd like to know what do you think of this.
> >
> > (Previously I tried to improve things with FMA by adding a widening_mul
> > pass before reassoc2 for it's easier to recognize different patterns
> > of FMA chains and decide whether to split them. But I suppose handling
> > them all in reassoc pass is more efficient.)
> >
> > Thanks,
> > Di Zhao
> >
> > ---
> > gcc/ChangeLog:
> >
> > * tree-ssa-math-opts.cc (convert_mult_to_fma_1): Add new parameter.
> > Support new mode that merely do the checking.
> > (struct fma_transformation_info): Moved to header.
> > (class fma_deferring_state): Moved to header.
> > (convert_mult_to_fma): Add new parameter.
> > * tree-ssa-math-opts.h (struct fma_transformation_info):
> > (class fma_deferring_state): Moved from .cc.
> > (convert_mult_to_fma): Add function decl.
> > * tree-ssa-reassoc.cc (rewrite_expr_tree_parallel):
> > (rank_ops_for_fma): Return -1 if nested FMAs are found.
> > (reassociate_bb): Avoid rewriting to parallel if nested FMAs are
> > found.
>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2023-07-07 14:31 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-16 8:50 [PATCH] tree-optimization/110279- Check for nested FMA chains in reassoc Di Zhao OS
2023-07-07 8:26 ` [PING][PATCH] " Di Zhao OS
2023-07-07 14:31 ` Philipp Tomsich
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).