From: Richard Biener <rguenther@suse.de>
To: Martin Jambor <mjambor@suse.cz>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>, Jan Hubicka <jh@suse.cz>
Subject: Re: [PR 81616] Deferring FMA transformations in tight loops
Date: Fri, 12 Jan 2018 12:23:00 -0000 [thread overview]
Message-ID: <alpine.LSU.2.20.1801121313580.32271@zhemvz.fhfr.qr> (raw)
In-Reply-To: <ri6lgh51ri9.fsf@suse.cz>
On Wed, 10 Jan 2018, Martin Jambor wrote:
> Hello,
>
> I would really like to ping the FMA transformation prevention patch that
> I sent here in December, which, after incorporating a suggestion from
> Richi, re-base and re-testing, I re-post below. I really think that it
> should make into gcc 8 in some form, because the performance wins are
> really big.
>
> I am still opened to all sorts of comments, of course, especially to
> suggestions about the form of the param controlling this behavior (or
> how to communicate that we want to do this on Zen in general). It might
> even be a binary value since not forming FMAs does not seem to harm
> bigger vectors either (as far as we know, I should add).
>
> For the record, I have not yet received any information from AMD why
> FMAs on 256bit vectors do not have this problem, I assume all people
> that could give an authoritative answer are now looking into covert
> channel mitigation stuff. But very probably it is just that the
> internal split FMAs can be scheduled so that while one is still waiting
> for its addend, another can already execute.
OK, and sorry for the delay.
Thanks,
Richard.
> Thanks,
>
> Martin
>
>
> On Fri, Dec 15 2017, Martin Jambor wrote:
> > Hello,
> >
> > the patch below prevents creation if fused-multiply-and-add instructions
> > in the widening_mul gimple pass on the Zen-based AMD CPUs and as a
> > result fixes regressions of native znver1 tuning when compared to
> > generic tuning in:
> >
> > - the matrix.c testcase of PR 81616 (straightforward matrix
> > multiplication) at -O2 and -O3 which is currently 60% (!),
> >
> > - SPEC 2006 454.calculix at -O2, which is currently over 20%, and
> >
> > - SPEC 2017 510.parest at -O2 and -Ofast, which is currently also
> > about 20% in both cases.
> >
> > The basic idea is to detect loops in the following form:
> >
> > <bb 6>
> > # accumulator_111 = PHI <0.0(5), accumulator_66(6)>
> > ...
> > _65 = _14 * _16;
> > accumulator_66 = _65 + accumulator_111;
> >
> > and prevents from creating FMA for it. Because at least in the parest
> > and calculix cases it has to, it also deals with more than one chain of
> > FMA candidates that feed the next one's addend:
> >
> >
> > <bb 6>
> > # accumulator_111 = PHI <0.0(5), accumulator_66(6)>
> > ...
> > _65 = _14 * _16;
> > accumulator_55 = _65 + accumulator_111;
> > _65 = _24 * _36;
> > accumulator_66 = _65 + accumulator_55;
> >
> > Unfortunately, to really get rid of the calculix regression, the
> > algorithm cannot just look at one BB at a time but also has to work for
> > cases like the following:
> >
> > 1 void mult(void)
> > 2 {
> > 3 int i, j, k, l;
> > 4
> > 5 for(i=0; i<SIZE; ++i)
> > 6 {
> > 7 for(j=0; j<SIZE; ++j)
> > 8 {
> > 9 for(l=0; l<SIZE; l+=10)
> > 10 {
> > 11 c[i][j] += a[i][l] * b[k][l];
> > 12 for(k=1; k<10; ++k)
> > 13 {
> > 14 c[i][j] += a[i][l+k] * b[k][l+k];
> > 15 }
> > 16
> > 17 }
> > 18 }
> > 19 }
> > 20 }
> >
> > where the FMA on line 14 feeds into the one on line 11 in an
> > encompassing loop. Therefore I have changed the structure of the pass
> > to work in reverse dominance order and it keeps a hash set of results of
> > rejected FMAs candidates which it checks when looking at PHI nodes of
> > the current BB. Without this reorganization, calculix was still 8%
> > slower with native tuning than with generic one.
> >
> > When the deferring mechanism realizes that in the current BB, the FMA
> > candidates do not all form a one chain tight-loop like in the examples
> > above, it goes back to all the previously deferred candidates (in the
> > current BB only) and performs the transformation.
> >
> > The main reason is to keep the patch conservative (and also simple), but
> > it also means that the following function is not affected and is still
> > 20% slower when compiled with native march and tuning compared to the
> > generic one:
> >
> > 1 void mult(struct s *p1, struct s *p2)
> > 2 {
> > 3 int i, j, k;
> > 4
> > 5 for(i=0; i<SIZE; ++i)
> > 6 {
> > 7 for(j=0; j<SIZE; ++j)
> > 8 {
> > 9 for(k=0; k<SIZE; ++k)
> > 10 {
> > 11 p1->c[i][j] += p1->a[i][k] * p1->b[k][j];
> > 12 p2->c[i][j] += p2->a[i][k] * p2->b[k][j];
> > 13 }
> > 14 }
> > 15 }
> > 16 }
> >
> > I suppose that the best optimization for the above would be to split the
> > loops, but one could probably construct at least an artificial testcase
> > where the FMAs would keep enough locality that it is not the case. The
> > mechanism can be easily extended to keep track of not just one chain but
> > a few, preferably as a followup, if people think it makes sense.
> >
> > An interesting observation is that the matrix multiplication does not
> > suffer the penalty when compiled with -O3 -mprefer-vector-width=256.
> > Apparently the 256 vector processing can hide the latency penalty when
> > internally it is split into two halves. The same goes for 512 bit
> > vectors. That is why the patch leaves those be - well, there is a param
> > for the threshold which is set to zero for everybody but znver1. If
> > maintainers of any other architecture suspect that their FMAs might
> > suffer similar latency problem, they can easily try tweaking that
> > parameter and see what happens with the matrix multiplication example.
> >
> > I have bootstrapped and tested the patch on x86_64-linux (as it is and
> > also with the param set to a 256 by default to make it trigger). I have
> > also measured run-times of all benchmarks in SPEC 2006 FP and SPEC 2017
> > FPrate and the only changes are the big improvements of calculix and
> > parest.
> >
>
> 2018-01-09 Martin Jambor <mjambor@suse.cz>
>
> PR target/81616
> * params.def: New parameter PARAM_AVOID_FMA_MAX_BITS.
> * tree-ssa-math-opts.c: Include domwalk.h.
> (convert_mult_to_fma_1): New function.
> (fma_transformation_info): New type.
> (fma_deferring_state): Likewise.
> (cancel_fma_deferring): New function.
> (result_of_phi): Likewise.
> (last_fma_candidate_feeds_initial_phi): Likewise.
> (convert_mult_to_fma): Added deferring logic, split actual
> transformation to convert_mult_to_fma_1.
> (math_opts_dom_walker): New type.
> (math_opts_dom_walker::after_dom_children): New method, body moved
> here from pass_optimize_widening_mul::execute, added deferring logic
> bits.
> (pass_optimize_widening_mul::execute): Moved most of code to
> math_opts_dom_walker::after_dom_children.
> * config/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS): New.
> * config/i386/i386.c (ix86_option_override_internal): Added
> maybe_setting of PARAM_AVOID_FMA_MAX_BITS.
> ---
> gcc/config/i386/i386.c | 5 +
> gcc/config/i386/x86-tune.def | 4 +
> gcc/params.def | 5 +
> gcc/tree-ssa-math-opts.c | 517 ++++++++++++++++++++++++++++++++-----------
> 4 files changed, 403 insertions(+), 128 deletions(-)
>
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 8696f931806..8200a136dd1 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -4900,6 +4900,11 @@ ix86_option_override_internal (bool main_args_p,
> (cf_protection_level) (opts->x_flag_cf_protection | CF_SET);
> }
>
> + if (ix86_tune_features [X86_TUNE_AVOID_128FMA_CHAINS])
> + maybe_set_param_value (PARAM_AVOID_FMA_MAX_BITS, 128,
> + opts->x_param_values,
> + opts_set->x_param_values);
> +
> return true;
> }
>
> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> index 9401d51cdc1..0effce759f1 100644
> --- a/gcc/config/i386/x86-tune.def
> +++ b/gcc/config/i386/x86-tune.def
> @@ -399,6 +399,10 @@ DEF_TUNE (X86_TUNE_SLOW_PSHUFB, "slow_pshufb",
> DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes",
> m_SILVERMONT | m_INTEL)
>
> +/* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
> + smaller FMA chain. */
> +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1)
> +
> /*****************************************************************************/
> /* AVX instruction selection tuning (some of SSE flags affects AVX, too) */
> /*****************************************************************************/
> diff --git a/gcc/params.def b/gcc/params.def
> index d9f8d8208a1..a0cd06db339 100644
> --- a/gcc/params.def
> +++ b/gcc/params.def
> @@ -1326,6 +1326,11 @@ DEFPARAM(PARAM_UNROLL_JAM_MAX_UNROLL,
> "Maximum unroll factor for the unroll-and-jam transformation.",
> 4, 0, 0)
>
> +DEFPARAM(PARAM_AVOID_FMA_MAX_BITS,
> + "avoid-fma-max-bits",
> + "Maximum number of bits for which we avoid creating FMAs.",
> + 0, 0, 512)
> +
> /*
>
> Local variables:
> diff --git a/gcc/tree-ssa-math-opts.c b/gcc/tree-ssa-math-opts.c
> index ea880c7b1d8..16d9399af0b 100644
> --- a/gcc/tree-ssa-math-opts.c
> +++ b/gcc/tree-ssa-math-opts.c
> @@ -115,6 +115,7 @@ along with GCC; see the file COPYING3. If not see
> #include "optabs-libfuncs.h"
> #include "tree-eh.h"
> #include "targhooks.h"
> +#include "domwalk.h"
>
> /* This structure represents one basic block that either computes a
> division, or is a common dominator for basic block that compute a
> @@ -2639,17 +2640,214 @@ convert_plusminus_to_widen (gimple_stmt_iterator *gsi, gimple *stmt,
> return true;
> }
>
> +/* Given a result MUL_RESULT which is a result of a multiplication of OP1 and
> + OP2 and which we know is used in statements that can be, together with the
> + multiplication, converted to FMAs, perform the transformation. */
> +
> +static void
> +convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
> +{
> + tree type = TREE_TYPE (mul_result);
> + gimple *use_stmt;
> + imm_use_iterator imm_iter;
> + gassign *fma_stmt;
> +
> + FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, mul_result)
> + {
> + gimple_stmt_iterator gsi = gsi_for_stmt (use_stmt);
> + enum tree_code use_code;
> + tree addop, mulop1 = op1, result = mul_result;
> + bool negate_p = false;
> +
> + if (is_gimple_debug (use_stmt))
> + continue;
> +
> + use_code = gimple_assign_rhs_code (use_stmt);
> + if (use_code == NEGATE_EXPR)
> + {
> + result = gimple_assign_lhs (use_stmt);
> + use_operand_p use_p;
> + gimple *neguse_stmt;
> + single_imm_use (gimple_assign_lhs (use_stmt), &use_p, &neguse_stmt);
> + gsi_remove (&gsi, true);
> + release_defs (use_stmt);
> +
> + use_stmt = neguse_stmt;
> + gsi = gsi_for_stmt (use_stmt);
> + use_code = gimple_assign_rhs_code (use_stmt);
> + negate_p = true;
> + }
> +
> + if (gimple_assign_rhs1 (use_stmt) == result)
> + {
> + addop = gimple_assign_rhs2 (use_stmt);
> + /* a * b - c -> a * b + (-c) */
> + if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR)
> + addop = force_gimple_operand_gsi (&gsi,
> + build1 (NEGATE_EXPR,
> + type, addop),
> + true, NULL_TREE, true,
> + GSI_SAME_STMT);
> + }
> + else
> + {
> + addop = gimple_assign_rhs1 (use_stmt);
> + /* a - b * c -> (-b) * c + a */
> + if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR)
> + negate_p = !negate_p;
> + }
> +
> + if (negate_p)
> + mulop1 = force_gimple_operand_gsi (&gsi,
> + build1 (NEGATE_EXPR,
> + type, mulop1),
> + true, NULL_TREE, true,
> + GSI_SAME_STMT);
> +
> + fma_stmt = gimple_build_assign (gimple_assign_lhs (use_stmt),
> + FMA_EXPR, mulop1, op2, addop);
> +
> + if (dump_file && (dump_flags & TDF_DETAILS))
> + {
> + fprintf (dump_file, "Generated FMA ");
> + print_gimple_stmt (dump_file, fma_stmt, 0, 0);
> + fprintf (dump_file, "\n");
> + }
> +
> + gsi_replace (&gsi, fma_stmt, true);
> + widen_mul_stats.fmas_inserted++;
> + }
> +}
> +
> +/* Data necessary to perform the actual transformation from a multiplication
> + and an addition to an FMA after decision is taken it should be done and to
> + then delete the multiplication statement from the function IL. */
> +
> +struct fma_transformation_info
> +{
> + gimple *mul_stmt;
> + tree mul_result;
> + tree op1;
> + tree op2;
> +};
> +
> +/* Structure containing the current state of FMA deferring, i.e. whether we are
> + deferring, whether to continue deferring, and all data necessary to come
> + back and perform all deferred transformations. */
> +
> +class fma_deferring_state
> +{
> +public:
> + /* Class constructor. Pass true as PERFORM_DEFERRING in order to actually
> + do any deferring. */
> +
> + fma_deferring_state (bool perform_deferring)
> + : m_candidates (), m_mul_result_set (), m_initial_phi (NULL),
> + m_last_result (NULL_TREE), m_deferring_p (perform_deferring) {}
> +
> + /* List of FMA candidates for which we the transformation has been determined
> + possible but we at this point in BB analysis we do not consider them
> + beneficial. */
> + auto_vec<fma_transformation_info, 8> m_candidates;
> +
> + /* Set of results of multiplication that are part of an already deferred FMA
> + candidates. */
> + hash_set<tree> m_mul_result_set;
> +
> + /* The PHI that supposedly feeds back result of a FMA to another over loop
> + boundary. */
> + gphi *m_initial_phi;
> +
> + /* Result of the last produced FMA candidate or NULL if there has not been
> + one. */
> + tree m_last_result;
> +
> + /* If true, deferring might still be profitable. If false, transform all
> + candidates and no longer defer. */
> + bool m_deferring_p;
> +};
> +
> +/* Transform all deferred FMA candidates and mark STATE as no longer
> + deferring. */
> +
> +static void
> +cancel_fma_deferring (fma_deferring_state *state)
> +{
> + if (!state->m_deferring_p)
> + return;
> +
> + for (unsigned i = 0; i < state->m_candidates.length (); i++)
> + {
> + if (dump_file && (dump_flags & TDF_DETAILS))
> + fprintf (dump_file, "Generating deferred FMA\n");
> +
> + const fma_transformation_info &fti = state->m_candidates[i];
> + convert_mult_to_fma_1 (fti.mul_result, fti.op1, fti.op2);
> +
> + gimple_stmt_iterator gsi = gsi_for_stmt (fti.mul_stmt);
> + gsi_remove (&gsi, true);
> + release_defs (fti.mul_stmt);
> + }
> + state->m_deferring_p = false;
> +}
> +
> +/* If OP is an SSA name defined by a PHI node, return the PHI statement.
> + Otherwise return NULL. */
> +
> +static gphi *
> +result_of_phi (tree op)
> +{
> + if (TREE_CODE (op) != SSA_NAME)
> + return NULL;
> +
> + return dyn_cast <gphi *> (SSA_NAME_DEF_STMT (op));
> +}
> +
> +/* After processing statements of a BB and recording STATE, return true if the
> + initial phi is fed by the last FMA candidate result ore one such result from
> + previously processed BBs marked in LAST_RESULT_SET. */
> +
> +static bool
> +last_fma_candidate_feeds_initial_phi (fma_deferring_state *state,
> + hash_set<tree> *last_result_set)
> +{
> + ssa_op_iter iter;
> + use_operand_p use;
> + FOR_EACH_PHI_ARG (use, state->m_initial_phi, iter, SSA_OP_USE)
> + {
> + tree t = USE_FROM_PTR (use);
> + if (t == state->m_last_result
> + || last_result_set->contains (t))
> + return true;
> + }
> +
> + return false;
> +}
> +
> /* Combine the multiplication at MUL_STMT with operands MULOP1 and MULOP2
> with uses in additions and subtractions to form fused multiply-add
> - operations. Returns true if successful and MUL_STMT should be removed. */
> + operations. Returns true if successful and MUL_STMT should be removed.
> +
> + If STATE indicates that we are deferring FMA transformation, that means
> + that we do not produce FMAs for basic blocks which look like:
> +
> + <bb 6>
> + # accumulator_111 = PHI <0.0(5), accumulator_66(6)>
> + _65 = _14 * _16;
> + accumulator_66 = _65 + accumulator_111;
> +
> + or its unrolled version, i.e. with several FMA candidates that feed result
> + of one into the addend of another. Instead, we add them to a list in STATE
> + and if we later discover an FMA candidate that is not part of such a chain,
> + we go back and perform all deferred past candidates. */
>
> static bool
> -convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2)
> +convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
> + fma_deferring_state *state)
> {
> tree mul_result = gimple_get_lhs (mul_stmt);
> tree type = TREE_TYPE (mul_result);
> gimple *use_stmt, *neguse_stmt;
> - gassign *fma_stmt;
> use_operand_p use_p;
> imm_use_iterator imm_iter;
>
> @@ -2673,6 +2871,11 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2)
> if (has_zero_uses (mul_result))
> return false;
>
> + bool check_defer
> + = (state->m_deferring_p
> + && (tree_to_shwi (TYPE_SIZE (type))
> + <= PARAM_VALUE (PARAM_AVOID_FMA_MAX_BITS)));
> + bool defer = check_defer;
> /* Make sure that the multiplication statement becomes dead after
> the transformation, thus that all uses are transformed to FMAs.
> This means we assume that an FMA operation has the same cost
> @@ -2770,77 +2973,92 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2)
> }
> }
>
> + tree use_rhs1 = gimple_assign_rhs1 (use_stmt);
> + tree use_rhs2 = gimple_assign_rhs2 (use_stmt);
> /* We can't handle a * b + a * b. */
> - if (gimple_assign_rhs1 (use_stmt) == gimple_assign_rhs2 (use_stmt))
> + if (use_rhs1 == use_rhs2)
> + return false;
> + /* If deferring, make sure we are not looking at an instruction that
> + wouldn't have existed if we were not. */
> + if (state->m_deferring_p
> + && (state->m_mul_result_set.contains (use_rhs1)
> + || state->m_mul_result_set.contains (use_rhs2)))
> return false;
>
> - /* While it is possible to validate whether or not the exact form
> - that we've recognized is available in the backend, the assumption
> - is that the transformation is never a loss. For instance, suppose
> - the target only has the plain FMA pattern available. Consider
> - a*b-c -> fma(a,b,-c): we've exchanged MUL+SUB for FMA+NEG, which
> - is still two operations. Consider -(a*b)-c -> fma(-a,b,-c): we
> - still have 3 operations, but in the FMA form the two NEGs are
> - independent and could be run in parallel. */
> - }
> -
> - FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, mul_result)
> - {
> - gimple_stmt_iterator gsi = gsi_for_stmt (use_stmt);
> - enum tree_code use_code;
> - tree addop, mulop1 = op1, result = mul_result;
> - bool negate_p = false;
> -
> - if (is_gimple_debug (use_stmt))
> - continue;
> -
> - use_code = gimple_assign_rhs_code (use_stmt);
> - if (use_code == NEGATE_EXPR)
> + if (check_defer)
> {
> - result = gimple_assign_lhs (use_stmt);
> - single_imm_use (gimple_assign_lhs (use_stmt), &use_p, &neguse_stmt);
> - gsi_remove (&gsi, true);
> - release_defs (use_stmt);
> + tree use_lhs = gimple_assign_lhs (use_stmt);
> + if (state->m_last_result)
> + {
> + if (use_rhs2 == state->m_last_result
> + || use_rhs1 == state->m_last_result)
> + defer = true;
> + else
> + defer = false;
> + }
> + else
> + {
> + gcc_checking_assert (!state->m_initial_phi);
> + gphi *phi;
> + if (use_rhs1 == result)
> + phi = result_of_phi (use_rhs2);
> + else
> + {
> + gcc_assert (use_rhs2 == result);
> + phi = result_of_phi (use_rhs1);
> + }
>
> - use_stmt = neguse_stmt;
> - gsi = gsi_for_stmt (use_stmt);
> - use_code = gimple_assign_rhs_code (use_stmt);
> - negate_p = true;
> - }
> + if (phi)
> + {
> + state->m_initial_phi = phi;
> + defer = true;
> + }
> + else
> + defer = false;
> + }
>
> - if (gimple_assign_rhs1 (use_stmt) == result)
> - {
> - addop = gimple_assign_rhs2 (use_stmt);
> - /* a * b - c -> a * b + (-c) */
> - if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR)
> - addop = force_gimple_operand_gsi (&gsi,
> - build1 (NEGATE_EXPR,
> - type, addop),
> - true, NULL_TREE, true,
> - GSI_SAME_STMT);
> + state->m_last_result = use_lhs;
> + check_defer = false;
> }
> else
> + defer = false;
> +
> + /* While it is possible to validate whether or not the exact form that
> + we've recognized is available in the backend, the assumption is that
> + if the deferring logic above did not trigger, the transformation is
> + never a loss. For instance, suppose the target only has the plain FMA
> + pattern available. Consider a*b-c -> fma(a,b,-c): we've exchanged
> + MUL+SUB for FMA+NEG, which is still two operations. Consider
> + -(a*b)-c -> fma(-a,b,-c): we still have 3 operations, but in the FMA
> + form the two NEGs are independent and could be run in parallel. */
> + }
> +
> + if (defer)
> + {
> + fma_transformation_info fti;
> + fti.mul_stmt = mul_stmt;
> + fti.mul_result = mul_result;
> + fti.op1 = op1;
> + fti.op2 = op2;
> + state->m_candidates.safe_push (fti);
> + state->m_mul_result_set.add (mul_result);
> +
> + if (dump_file && (dump_flags & TDF_DETAILS))
> {
> - addop = gimple_assign_rhs1 (use_stmt);
> - /* a - b * c -> (-b) * c + a */
> - if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR)
> - negate_p = !negate_p;
> + fprintf (dump_file, "Deferred generating FMA for multiplication ");
> + print_gimple_stmt (dump_file, mul_stmt, 0, 0);
> + fprintf (dump_file, "\n");
> }
>
> - if (negate_p)
> - mulop1 = force_gimple_operand_gsi (&gsi,
> - build1 (NEGATE_EXPR,
> - type, mulop1),
> - true, NULL_TREE, true,
> - GSI_SAME_STMT);
> -
> - fma_stmt = gimple_build_assign (gimple_assign_lhs (use_stmt),
> - FMA_EXPR, mulop1, op2, addop);
> - gsi_replace (&gsi, fma_stmt, true);
> - widen_mul_stats.fmas_inserted++;
> + return false;
> + }
> + else
> + {
> + if (state->m_deferring_p)
> + cancel_fma_deferring (state);
> + convert_mult_to_fma_1 (mul_result, op1, op2);
> + return true;
> }
> -
> - return true;
> }
>
>
> @@ -3270,92 +3488,135 @@ public:
>
> }; // class pass_optimize_widening_mul
>
> -unsigned int
> -pass_optimize_widening_mul::execute (function *fun)
> +/* Walker class to perform the transformation in reverse dominance order. */
> +
> +class math_opts_dom_walker : public dom_walker
> {
> - basic_block bb;
> - bool cfg_changed = false;
> +public:
> + /* Constructor, CFG_CHANGED is a pointer to a boolean flag that will be set
> + if walking modidifes the CFG. */
>
> - memset (&widen_mul_stats, 0, sizeof (widen_mul_stats));
> - calculate_dominance_info (CDI_DOMINATORS);
> - renumber_gimple_stmt_uids ();
> + math_opts_dom_walker (bool *cfg_changed_p)
> + : dom_walker (CDI_DOMINATORS), m_last_result_set (),
> + m_cfg_changed_p (cfg_changed_p) {}
>
> - FOR_EACH_BB_FN (bb, fun)
> + /* The actual actions performed in the walk. */
> +
> + virtual void after_dom_children (basic_block);
> +
> + /* Set of results of chains of multiply and add statement combinations that
> + were not transformed into FMAs because of active deferring. */
> + hash_set<tree> m_last_result_set;
> +
> + /* Pointer to a flag of the user that needs to be set if CFG has been
> + modified. */
> + bool *m_cfg_changed_p;
> +};
> +
> +void
> +math_opts_dom_walker::after_dom_children (basic_block bb)
> +{
> + gimple_stmt_iterator gsi;
> +
> + fma_deferring_state fma_state (PARAM_VALUE (PARAM_AVOID_FMA_MAX_BITS) > 0);
> +
> + for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi);)
> {
> - gimple_stmt_iterator gsi;
> + gimple *stmt = gsi_stmt (gsi);
> + enum tree_code code;
>
> - for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi);)
> - {
> - gimple *stmt = gsi_stmt (gsi);
> - enum tree_code code;
> + if (is_gimple_assign (stmt))
> + {
> + code = gimple_assign_rhs_code (stmt);
> + switch (code)
> + {
> + case MULT_EXPR:
> + if (!convert_mult_to_widen (stmt, &gsi)
> + && !convert_expand_mult_copysign (stmt, &gsi)
> + && convert_mult_to_fma (stmt,
> + gimple_assign_rhs1 (stmt),
> + gimple_assign_rhs2 (stmt),
> + &fma_state))
> + {
> + gsi_remove (&gsi, true);
> + release_defs (stmt);
> + continue;
> + }
> + break;
> +
> + case PLUS_EXPR:
> + case MINUS_EXPR:
> + if (!convert_plusminus_to_widen (&gsi, stmt, code))
> + match_uaddsub_overflow (&gsi, stmt, code);
> + break;
>
> - if (is_gimple_assign (stmt))
> + case TRUNC_MOD_EXPR:
> + convert_to_divmod (as_a<gassign *> (stmt));
> + break;
> +
> + default:;
> + }
> + }
> + else if (is_gimple_call (stmt))
> + {
> + tree fndecl = gimple_call_fndecl (stmt);
> + if (fndecl && gimple_call_builtin_p (stmt, BUILT_IN_NORMAL))
> {
> - code = gimple_assign_rhs_code (stmt);
> - switch (code)
> + switch (DECL_FUNCTION_CODE (fndecl))
> {
> - case MULT_EXPR:
> - if (!convert_mult_to_widen (stmt, &gsi)
> - && !convert_expand_mult_copysign (stmt, &gsi)
> + case BUILT_IN_POWF:
> + case BUILT_IN_POW:
> + case BUILT_IN_POWL:
> + if (gimple_call_lhs (stmt)
> + && TREE_CODE (gimple_call_arg (stmt, 1)) == REAL_CST
> + && real_equal
> + (&TREE_REAL_CST (gimple_call_arg (stmt, 1)),
> + &dconst2)
> && convert_mult_to_fma (stmt,
> - gimple_assign_rhs1 (stmt),
> - gimple_assign_rhs2 (stmt)))
> + gimple_call_arg (stmt, 0),
> + gimple_call_arg (stmt, 0),
> + &fma_state))
> {
> - gsi_remove (&gsi, true);
> + unlink_stmt_vdef (stmt);
> + if (gsi_remove (&gsi, true)
> + && gimple_purge_dead_eh_edges (bb))
> + *m_cfg_changed_p = true;
> release_defs (stmt);
> continue;
> }
> break;
>
> - case PLUS_EXPR:
> - case MINUS_EXPR:
> - if (!convert_plusminus_to_widen (&gsi, stmt, code))
> - match_uaddsub_overflow (&gsi, stmt, code);
> - break;
> -
> - case TRUNC_MOD_EXPR:
> - convert_to_divmod (as_a<gassign *> (stmt));
> - break;
> -
> default:;
> }
> }
> - else if (is_gimple_call (stmt)
> - && gimple_call_lhs (stmt))
> - {
> - tree fndecl = gimple_call_fndecl (stmt);
> - if (fndecl
> - && gimple_call_builtin_p (stmt, BUILT_IN_NORMAL))
> - {
> - switch (DECL_FUNCTION_CODE (fndecl))
> - {
> - case BUILT_IN_POWF:
> - case BUILT_IN_POW:
> - case BUILT_IN_POWL:
> - if (TREE_CODE (gimple_call_arg (stmt, 1)) == REAL_CST
> - && real_equal
> - (&TREE_REAL_CST (gimple_call_arg (stmt, 1)),
> - &dconst2)
> - && convert_mult_to_fma (stmt,
> - gimple_call_arg (stmt, 0),
> - gimple_call_arg (stmt, 0)))
> - {
> - unlink_stmt_vdef (stmt);
> - if (gsi_remove (&gsi, true)
> - && gimple_purge_dead_eh_edges (bb))
> - cfg_changed = true;
> - release_defs (stmt);
> - continue;
> - }
> - break;
> -
> - default:;
> - }
> - }
> - }
> - gsi_next (&gsi);
> + else
> + cancel_fma_deferring (&fma_state);
> }
> + gsi_next (&gsi);
> }
> + if (fma_state.m_deferring_p
> + && fma_state.m_initial_phi)
> + {
> + gcc_checking_assert (fma_state.m_last_result);
> + if (!last_fma_candidate_feeds_initial_phi (&fma_state,
> + &m_last_result_set))
> + cancel_fma_deferring (&fma_state);
> + else
> + m_last_result_set.add (fma_state.m_last_result);
> + }
> +}
> +
> +
> +unsigned int
> +pass_optimize_widening_mul::execute (function *fun)
> +{
> + bool cfg_changed = false;
> +
> + memset (&widen_mul_stats, 0, sizeof (widen_mul_stats));
> + calculate_dominance_info (CDI_DOMINATORS);
> + renumber_gimple_stmt_uids ();
> +
> + math_opts_dom_walker (&cfg_changed).walk (ENTRY_BLOCK_PTR_FOR_FN (cfun));
>
> statistics_counter_event (fun, "widening multiplications inserted",
> widen_mul_stats.widen_mults_inserted);
>
--
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
prev parent reply other threads:[~2018-01-12 12:14 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-12-15 14:19 Martin Jambor
2017-12-18 11:20 ` Richard Biener
2017-12-22 0:01 ` Martin Jambor
2018-01-10 19:01 ` Martin Jambor
2018-01-10 19:16 ` Jeff Law
2018-01-12 12:23 ` Richard Biener [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.LSU.2.20.1801121313580.32271@zhemvz.fhfr.qr \
--to=rguenther@suse.de \
--cc=gcc-patches@gcc.gnu.org \
--cc=jh@suse.cz \
--cc=mjambor@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).