From: Di Zhao OS <dizhao@os.amperecomputing.com>
To: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Cc: Richard Biener <richard.guenther@gmail.com>
Subject: [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA
Date: Mon, 28 Aug 2023 08:17:29 +0000 [thread overview]
Message-ID: <SN6PR01MB4240A22F29D390F5B96FD057E8E0A@SN6PR01MB4240.prod.exchangelabs.com> (raw)
[-- Attachment #1: Type: text/plain, Size: 2168 bytes --]
This patch tries to fix the 2% regression in 510.parest_r on
ampere1 in the tracker. (Previous discussion is here:
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624893.html)
1. Add testcases for the problem. For an op list in the form of
"acc = a * b + c * d + acc", currently reassociation doesn't
Swap the operands so that more FMAs can be generated.
After widening_mul the result looks like:
_1 = .FMA(a, b, acc_0);
acc_1 = .FMA(c, d, _1);
While previously (before the "Handle FMA friendly..." patch),
widening_mul's result was like:
_1 = a * b;
_2 = .FMA (c, d, _1);
acc_1 = acc_0 + _2;
If the code fragment is in a loop, some architecture can execute
the latter in parallel, so the performance can be much faster than
the former. For the small testcase, the performance gap is over
10% on both ampere1 and neoverse-n1. So the point here is to avoid
turning the last statement into FMA, and keep it a PLUS_EXPR as
much as possible. (If we are rewriting the op list into parallel,
no special treatment is needed, since the last statement after
rewrite_expr_tree_parallel will be PLUS_EXPR anyway.)
2. Function result_feeds_back_from_phi_p is to check for cross
backedge dependency. Added new enum fma_state to describe the
state of FMA candidates.
With this patch, there's a 3% improvement in 510.parest_r 1-copy
run on ampere1. The compile options are:
"-Ofast -mcpu=ampere1 -flto --param avoid-fma-max-bits=512".
Best regards,
Di Zhao
----
PR tree-optimization/110279
gcc/ChangeLog:
* tree-ssa-reassoc.cc (enum fma_state): New enum to
describe the state of FMA candidates for an op list.
(rewrite_expr_tree_parallel): Changed boolean
parameter to enum type.
(result_feeds_back_from_phi_p): New function to check
for cross backedge dependency.
(rank_ops_for_fma): Return enum fma_state. Added new
parameter.
(reassociate_bb): If there's backedge dependency in an
op list, swap the operands before rewrite_expr_tree.
gcc/testsuite/ChangeLog:
* gcc.dg/pr110279.c: New test.
[-- Attachment #2: 0001-swap-operands-in-reassoc-to-reduce-cross-backedge-FM.patch --]
[-- Type: application/octet-stream, Size: 8928 bytes --]
From 663f9822b2f5b458ae4fd0983ec59fd1eb47877f Mon Sep 17 00:00:00 2001
From: amptest <amptest@amperecomputing.com>
Date: Wed, 23 Aug 2023 01:26:24 -0700
Subject: [PATCH] swap operands in reassoc to reduce cross backedge FMA
---
gcc/testsuite/gcc.dg/pr110279.c | 62 ++++++++++++++++++
gcc/tree-ssa-reassoc.cc | 111 +++++++++++++++++++++++++++++---
2 files changed, 163 insertions(+), 10 deletions(-)
create mode 100644 gcc/testsuite/gcc.dg/pr110279.c
diff --git a/gcc/testsuite/gcc.dg/pr110279.c b/gcc/testsuite/gcc.dg/pr110279.c
new file mode 100644
index 00000000000..9dc72658bff
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr110279.c
@@ -0,0 +1,62 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast --param avoid-fma-max-bits=512 --param tree-reassoc-width=4 -fdump-tree-widening_mul-details" } */
+/* { dg-additional-options "-march=armv8.2-a" } */
+
+#define LOOP_COUNT 800000000
+typedef double data_e;
+
+/* Check that FMAs with backedge dependency are avoided. Otherwise there won't
+ be FMA generated with "--param avoid-fma-max-bits=512". */
+
+foo1 (data_e a, data_e b, data_e c, data_e d)
+{
+ data_e result = 0;
+
+ for (int ic = 0; ic < LOOP_COUNT; ic++)
+ {
+ result += (a * b + c * d);
+
+ a -= 0.1;
+ b += 0.9;
+ c *= 1.02;
+ d *= 0.61;
+ }
+
+ return result;
+}
+
+foo2 (data_e a, data_e b, data_e c, data_e d)
+{
+ data_e result = 0;
+
+ for (int ic = 0; ic < LOOP_COUNT; ic++)
+ {
+ result += a * b + result + c * d;
+
+ a -= 0.1;
+ b += 0.9;
+ c *= 1.02;
+ d *= 0.61;
+ }
+
+ return result;
+}
+
+foo3 (data_e a, data_e b, data_e c, data_e d)
+{
+ data_e result = 0;
+
+ for (int ic = 0; ic < LOOP_COUNT; ic++)
+ {
+ result += result + a * b + c * d;
+
+ a -= 0.1;
+ b += 0.9;
+ c *= 1.02;
+ d *= 0.61;
+ }
+
+ return result;
+}
+
+/* { dg-final { scan-tree-dump-times "Generated FMA" 3 "widening_mul"} } */
diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
index eda03bf98a6..3552ee9c0e7 100644
--- a/gcc/tree-ssa-reassoc.cc
+++ b/gcc/tree-ssa-reassoc.cc
@@ -5471,6 +5471,24 @@ get_reassociation_width (int ops_num, enum tree_code opc,
return width;
}
+/* This enum describes the state of FMA candidates for a list of operands. */
+
+enum fma_state
+{
+ /* There is at most one FMA candidate in the op list. In this case, no
+ special treatment is needed when rewriting expression tree. */
+ FMA_STATE_NONE_OR_SINGLE = 0,
+
+ /* The result of op-list has cross backedge dependency, and we need to avoid
+ turning it into an FMA. In this case, swap the operands if we are not
+ rewriting it into parallel. */
+ FMA_STATE_CROSS_BACKEDGE,
+
+ /* There are multiple FMA candidates in the op list. Avoid swapping operands
+ in this case. */
+ FMA_STATE_MULTIPLE
+};
+
#define SPECIAL_BIASED_END_STMT 0 /* It is the end stmt of all ops. */
#define BIASED_END_STMT 1 /* It is the end stmt of normal or biased ops. */
#define NORMAL_END_STMT 2 /* It is the end stmt of normal ops. */
@@ -5507,7 +5525,7 @@ get_reassociation_width (int ops_num, enum tree_code opc,
SSA6 = SSA4 + SSA5;
*/
static void
-rewrite_expr_tree_parallel (gassign *stmt, int width, bool has_fma,
+rewrite_expr_tree_parallel (gassign *stmt, int width, fma_state fs,
const vec<operand_entry *> &ops)
{
enum tree_code opcode = gimple_assign_rhs_code (stmt);
@@ -5582,7 +5600,7 @@ rewrite_expr_tree_parallel (gassign *stmt, int width, bool has_fma,
}
/* Swap the operands if no FMA in the chain. */
- if (ops1->length () > 2 && !has_fma)
+ if (ops1->length () > 2 && fs == FMA_STATE_NONE_OR_SINGLE)
swap_ops_for_binary_stmt (*ops1, ops1->length () - 3);
if (i < width_active
@@ -6766,7 +6784,69 @@ transform_stmt_to_multiply (gimple_stmt_iterator *gsi, gimple *stmt,
}
}
+/* Given that LHS is the result SSA_NAME of OPS, returns whether it has cross
+ backedge dependency:
+
+ 1. LHS is a phi argument in the same basic block it is defined.
+
+ 2. And, the result of the phi node is used in OPS. */
+
+static bool
+result_feeds_back_from_phi_p (vec<operand_entry *> *ops, tree lhs)
+{
+ basic_block bb = gimple_bb (SSA_NAME_DEF_STMT (lhs));
+ gimple_stmt_iterator gsi;
+
+ /* Iterate over phi argument in the basic block LHS is defined. */
+ for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+ {
+ gphi *phi = dyn_cast<gphi *> (gsi_stmt (gsi));
+ for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
+ {
+ tree op = PHI_ARG_DEF (phi, i);
+ if (!(op == lhs && gimple_phi_arg_edge (phi, i)->src == bb))
+ continue;
+ tree phi_result = gimple_phi_result (phi);
+ operand_entry *oe;
+ unsigned int j;
+ FOR_EACH_VEC_ELT (*ops, j, oe)
+ {
+ if (TREE_CODE (oe->op) != SSA_NAME)
+ continue;
+
+ /* Result of phi is operand of PLUS_EXPR. */
+ if (oe->op == phi_result)
+ return true;
+
+ /* Check is result of phi is operand of MULT_EXPR. */
+ gimple *def_stmt = SSA_NAME_DEF_STMT (oe->op);
+ if (is_gimple_assign (def_stmt)
+ && gimple_assign_rhs_code (def_stmt) == NEGATE_EXPR)
+ {
+ tree rhs = gimple_assign_rhs1 (def_stmt);
+ if (TREE_CODE (rhs) == SSA_NAME)
+ {
+ if (rhs == phi_result)
+ return true;
+ def_stmt = SSA_NAME_DEF_STMT (rhs);
+ }
+ }
+ if (is_gimple_assign (def_stmt)
+ && gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
+ {
+ if (gimple_assign_rhs1 (def_stmt) == phi_result
+ || gimple_assign_rhs2 (def_stmt) == phi_result)
+ return true;
+ }
+ }
+ }
+ }
+ return false;
+}
+
/* Rearrange ops may have more FMA when the chain may has more than 2 FMAs.
+ LHS is the result SSA_NAME of OPS.
+
Put no-mult ops and mult ops alternately at the end of the queue, which is
conducive to generating more FMA and reducing the loss of FMA when breaking
the chain.
@@ -6781,14 +6861,15 @@ transform_stmt_to_multiply (gimple_stmt_iterator *gsi, gimple *stmt,
_4 = .FMA (c_7(D), d_8(D), _3);
_11 = .FMA (a_5(D), b_6(D), _4); */
-static bool
-rank_ops_for_fma (vec<operand_entry *> *ops)
+static fma_state
+rank_ops_for_fma (vec<operand_entry *> *ops, tree lhs)
{
operand_entry *oe;
unsigned int i;
unsigned int ops_length = ops->length ();
auto_vec<operand_entry *> ops_mult;
auto_vec<operand_entry *> ops_others;
+ fma_state fs = FMA_STATE_NONE_OR_SINGLE;
FOR_EACH_VEC_ELT (*ops, i, oe)
{
@@ -6815,6 +6896,17 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
2. If all ops are defined with mult, we don't need to rearrange them. */
if (ops_mult.length () >= 2 && ops_mult.length () != ops_length)
{
+ /* On some platforms, if the result has cross backedge dependency,
+ transforming it into an FMA leads to worse data parallelism. If we're
+ rewriting the op list to parallel we don't need to worry about this,
+ otherwise swap the operands. */
+ if (maybe_le (tree_to_poly_int64 (TYPE_SIZE (TREE_TYPE (lhs))),
+ param_avoid_fma_max_bits)
+ && result_feeds_back_from_phi_p (ops, lhs))
+ fs = FMA_STATE_CROSS_BACKEDGE;
+ else
+ fs = FMA_STATE_MULTIPLE;
+
/* Put no-mult ops and mult ops alternately at the end of the
queue, which is conducive to generating more FMA and reducing the
loss of FMA when breaking the chain. */
@@ -6829,9 +6921,8 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
if (opindex > 0)
opindex--;
}
- return true;
}
- return false;
+ return fs;
}
/* Reassociate expressions in basic block BB and its post-dominator as
children.
@@ -6997,7 +7088,7 @@ reassociate_bb (basic_block bb)
machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
int ops_num = ops.length ();
int width;
- bool has_fma = false;
+ fma_state fs = FMA_STATE_NONE_OR_SINGLE;
/* For binary bit operations, if there are at least 3
operands and the last operand in OPS is a constant,
@@ -7020,7 +7111,7 @@ reassociate_bb (basic_block bb)
opt_type)
&& (rhs_code == PLUS_EXPR || rhs_code == MINUS_EXPR))
{
- has_fma = rank_ops_for_fma (&ops);
+ fs = rank_ops_for_fma (&ops, lhs);
}
/* Only rewrite the expression tree to parallel in the
@@ -7037,7 +7128,7 @@ reassociate_bb (basic_block bb)
width);
rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
width,
- has_fma,
+ fs,
ops);
}
else
@@ -7046,7 +7137,7 @@ reassociate_bb (basic_block bb)
to make sure the ones that get the double
binary op are chosen wisely. */
int len = ops.length ();
- if (len >= 3 && !has_fma)
+ if (len >= 3 && fs != FMA_STATE_MULTIPLE)
swap_ops_for_binary_stmt (ops, len - 3);
new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
--
2.25.1
next reply other threads:[~2023-08-28 8:17 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-28 8:17 Di Zhao OS [this message]
2023-08-28 23:22 ` Jeff Law
2023-08-29 7:41 ` Richard Biener
2023-08-29 7:49 ` Di Zhao OS
2023-08-29 8:09 ` Richard Biener
2023-08-29 8:58 ` Di Zhao OS
2023-08-29 11:11 ` Richard Biener
2023-08-30 9:33 ` Di Zhao OS
2023-08-31 12:22 ` Richard Biener
2023-09-04 10:33 ` Di Zhao OS
2023-08-29 12:34 ` Jeff Law
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=SN6PR01MB4240A22F29D390F5B96FD057E8E0A@SN6PR01MB4240.prod.exchangelabs.com \
--to=dizhao@os.amperecomputing.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=richard.guenther@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).