From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 34274 invoked by alias); 12 Jan 2018 12:14:30 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 34252 invoked by uid 89); 12 Jan 2018 12:14:29 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-26.9 required=5.0 tests=BAYES_00,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,SPF_PASS,T_RP_MATCHES_RCVD autolearn=ham version=3.3.2 spammy= X-HELO: mx2.suse.de Received: from mx2.suse.de (HELO mx2.suse.de) (195.135.220.15) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 12 Jan 2018 12:14:25 +0000 Received: from relay2.suse.de (charybdis-ext.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 0C1B7ACBE for ; Fri, 12 Jan 2018 12:14:23 +0000 (UTC) Date: Fri, 12 Jan 2018 12:23:00 -0000 From: Richard Biener To: Martin Jambor cc: GCC Patches , Jan Hubicka Subject: Re: [PR 81616] Deferring FMA transformations in tight loops In-Reply-To: Message-ID: References: User-Agent: Alpine 2.20 (LSU 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-SW-Source: 2018-01/txt/msg01035.txt.bz2 On Wed, 10 Jan 2018, Martin Jambor wrote: > Hello, > > I would really like to ping the FMA transformation prevention patch that > I sent here in December, which, after incorporating a suggestion from > Richi, re-base and re-testing, I re-post below. I really think that it > should make into gcc 8 in some form, because the performance wins are > really big. > > I am still opened to all sorts of comments, of course, especially to > suggestions about the form of the param controlling this behavior (or > how to communicate that we want to do this on Zen in general). It might > even be a binary value since not forming FMAs does not seem to harm > bigger vectors either (as far as we know, I should add). > > For the record, I have not yet received any information from AMD why > FMAs on 256bit vectors do not have this problem, I assume all people > that could give an authoritative answer are now looking into covert > channel mitigation stuff. But very probably it is just that the > internal split FMAs can be scheduled so that while one is still waiting > for its addend, another can already execute. OK, and sorry for the delay. Thanks, Richard. > Thanks, > > Martin > > > On Fri, Dec 15 2017, Martin Jambor wrote: > > Hello, > > > > the patch below prevents creation if fused-multiply-and-add instructions > > in the widening_mul gimple pass on the Zen-based AMD CPUs and as a > > result fixes regressions of native znver1 tuning when compared to > > generic tuning in: > > > > - the matrix.c testcase of PR 81616 (straightforward matrix > > multiplication) at -O2 and -O3 which is currently 60% (!), > > > > - SPEC 2006 454.calculix at -O2, which is currently over 20%, and > > > > - SPEC 2017 510.parest at -O2 and -Ofast, which is currently also > > about 20% in both cases. > > > > The basic idea is to detect loops in the following form: > > > > > > # accumulator_111 = PHI <0.0(5), accumulator_66(6)> > > ... > > _65 = _14 * _16; > > accumulator_66 = _65 + accumulator_111; > > > > and prevents from creating FMA for it. Because at least in the parest > > and calculix cases it has to, it also deals with more than one chain of > > FMA candidates that feed the next one's addend: > > > > > > > > # accumulator_111 = PHI <0.0(5), accumulator_66(6)> > > ... > > _65 = _14 * _16; > > accumulator_55 = _65 + accumulator_111; > > _65 = _24 * _36; > > accumulator_66 = _65 + accumulator_55; > > > > Unfortunately, to really get rid of the calculix regression, the > > algorithm cannot just look at one BB at a time but also has to work for > > cases like the following: > > > > 1 void mult(void) > > 2 { > > 3 int i, j, k, l; > > 4 > > 5 for(i=0; i > 6 { > > 7 for(j=0; j > 8 { > > 9 for(l=0; l > 10 { > > 11 c[i][j] += a[i][l] * b[k][l]; > > 12 for(k=1; k<10; ++k) > > 13 { > > 14 c[i][j] += a[i][l+k] * b[k][l+k]; > > 15 } > > 16 > > 17 } > > 18 } > > 19 } > > 20 } > > > > where the FMA on line 14 feeds into the one on line 11 in an > > encompassing loop. Therefore I have changed the structure of the pass > > to work in reverse dominance order and it keeps a hash set of results of > > rejected FMAs candidates which it checks when looking at PHI nodes of > > the current BB. Without this reorganization, calculix was still 8% > > slower with native tuning than with generic one. > > > > When the deferring mechanism realizes that in the current BB, the FMA > > candidates do not all form a one chain tight-loop like in the examples > > above, it goes back to all the previously deferred candidates (in the > > current BB only) and performs the transformation. > > > > The main reason is to keep the patch conservative (and also simple), but > > it also means that the following function is not affected and is still > > 20% slower when compiled with native march and tuning compared to the > > generic one: > > > > 1 void mult(struct s *p1, struct s *p2) > > 2 { > > 3 int i, j, k; > > 4 > > 5 for(i=0; i > 6 { > > 7 for(j=0; j > 8 { > > 9 for(k=0; k > 10 { > > 11 p1->c[i][j] += p1->a[i][k] * p1->b[k][j]; > > 12 p2->c[i][j] += p2->a[i][k] * p2->b[k][j]; > > 13 } > > 14 } > > 15 } > > 16 } > > > > I suppose that the best optimization for the above would be to split the > > loops, but one could probably construct at least an artificial testcase > > where the FMAs would keep enough locality that it is not the case. The > > mechanism can be easily extended to keep track of not just one chain but > > a few, preferably as a followup, if people think it makes sense. > > > > An interesting observation is that the matrix multiplication does not > > suffer the penalty when compiled with -O3 -mprefer-vector-width=256. > > Apparently the 256 vector processing can hide the latency penalty when > > internally it is split into two halves. The same goes for 512 bit > > vectors. That is why the patch leaves those be - well, there is a param > > for the threshold which is set to zero for everybody but znver1. If > > maintainers of any other architecture suspect that their FMAs might > > suffer similar latency problem, they can easily try tweaking that > > parameter and see what happens with the matrix multiplication example. > > > > I have bootstrapped and tested the patch on x86_64-linux (as it is and > > also with the param set to a 256 by default to make it trigger). I have > > also measured run-times of all benchmarks in SPEC 2006 FP and SPEC 2017 > > FPrate and the only changes are the big improvements of calculix and > > parest. > > > > 2018-01-09 Martin Jambor > > PR target/81616 > * params.def: New parameter PARAM_AVOID_FMA_MAX_BITS. > * tree-ssa-math-opts.c: Include domwalk.h. > (convert_mult_to_fma_1): New function. > (fma_transformation_info): New type. > (fma_deferring_state): Likewise. > (cancel_fma_deferring): New function. > (result_of_phi): Likewise. > (last_fma_candidate_feeds_initial_phi): Likewise. > (convert_mult_to_fma): Added deferring logic, split actual > transformation to convert_mult_to_fma_1. > (math_opts_dom_walker): New type. > (math_opts_dom_walker::after_dom_children): New method, body moved > here from pass_optimize_widening_mul::execute, added deferring logic > bits. > (pass_optimize_widening_mul::execute): Moved most of code to > math_opts_dom_walker::after_dom_children. > * config/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS): New. > * config/i386/i386.c (ix86_option_override_internal): Added > maybe_setting of PARAM_AVOID_FMA_MAX_BITS. > --- > gcc/config/i386/i386.c | 5 + > gcc/config/i386/x86-tune.def | 4 + > gcc/params.def | 5 + > gcc/tree-ssa-math-opts.c | 517 ++++++++++++++++++++++++++++++++----------- > 4 files changed, 403 insertions(+), 128 deletions(-) > > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c > index 8696f931806..8200a136dd1 100644 > --- a/gcc/config/i386/i386.c > +++ b/gcc/config/i386/i386.c > @@ -4900,6 +4900,11 @@ ix86_option_override_internal (bool main_args_p, > (cf_protection_level) (opts->x_flag_cf_protection | CF_SET); > } > > + if (ix86_tune_features [X86_TUNE_AVOID_128FMA_CHAINS]) > + maybe_set_param_value (PARAM_AVOID_FMA_MAX_BITS, 128, > + opts->x_param_values, > + opts_set->x_param_values); > + > return true; > } > > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def > index 9401d51cdc1..0effce759f1 100644 > --- a/gcc/config/i386/x86-tune.def > +++ b/gcc/config/i386/x86-tune.def > @@ -399,6 +399,10 @@ DEF_TUNE (X86_TUNE_SLOW_PSHUFB, "slow_pshufb", > DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes", > m_SILVERMONT | m_INTEL) > > +/* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or > + smaller FMA chain. */ > +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1) > + > /*****************************************************************************/ > /* AVX instruction selection tuning (some of SSE flags affects AVX, too) */ > /*****************************************************************************/ > diff --git a/gcc/params.def b/gcc/params.def > index d9f8d8208a1..a0cd06db339 100644 > --- a/gcc/params.def > +++ b/gcc/params.def > @@ -1326,6 +1326,11 @@ DEFPARAM(PARAM_UNROLL_JAM_MAX_UNROLL, > "Maximum unroll factor for the unroll-and-jam transformation.", > 4, 0, 0) > > +DEFPARAM(PARAM_AVOID_FMA_MAX_BITS, > + "avoid-fma-max-bits", > + "Maximum number of bits for which we avoid creating FMAs.", > + 0, 0, 512) > + > /* > > Local variables: > diff --git a/gcc/tree-ssa-math-opts.c b/gcc/tree-ssa-math-opts.c > index ea880c7b1d8..16d9399af0b 100644 > --- a/gcc/tree-ssa-math-opts.c > +++ b/gcc/tree-ssa-math-opts.c > @@ -115,6 +115,7 @@ along with GCC; see the file COPYING3. If not see > #include "optabs-libfuncs.h" > #include "tree-eh.h" > #include "targhooks.h" > +#include "domwalk.h" > > /* This structure represents one basic block that either computes a > division, or is a common dominator for basic block that compute a > @@ -2639,17 +2640,214 @@ convert_plusminus_to_widen (gimple_stmt_iterator *gsi, gimple *stmt, > return true; > } > > +/* Given a result MUL_RESULT which is a result of a multiplication of OP1 and > + OP2 and which we know is used in statements that can be, together with the > + multiplication, converted to FMAs, perform the transformation. */ > + > +static void > +convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2) > +{ > + tree type = TREE_TYPE (mul_result); > + gimple *use_stmt; > + imm_use_iterator imm_iter; > + gassign *fma_stmt; > + > + FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, mul_result) > + { > + gimple_stmt_iterator gsi = gsi_for_stmt (use_stmt); > + enum tree_code use_code; > + tree addop, mulop1 = op1, result = mul_result; > + bool negate_p = false; > + > + if (is_gimple_debug (use_stmt)) > + continue; > + > + use_code = gimple_assign_rhs_code (use_stmt); > + if (use_code == NEGATE_EXPR) > + { > + result = gimple_assign_lhs (use_stmt); > + use_operand_p use_p; > + gimple *neguse_stmt; > + single_imm_use (gimple_assign_lhs (use_stmt), &use_p, &neguse_stmt); > + gsi_remove (&gsi, true); > + release_defs (use_stmt); > + > + use_stmt = neguse_stmt; > + gsi = gsi_for_stmt (use_stmt); > + use_code = gimple_assign_rhs_code (use_stmt); > + negate_p = true; > + } > + > + if (gimple_assign_rhs1 (use_stmt) == result) > + { > + addop = gimple_assign_rhs2 (use_stmt); > + /* a * b - c -> a * b + (-c) */ > + if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR) > + addop = force_gimple_operand_gsi (&gsi, > + build1 (NEGATE_EXPR, > + type, addop), > + true, NULL_TREE, true, > + GSI_SAME_STMT); > + } > + else > + { > + addop = gimple_assign_rhs1 (use_stmt); > + /* a - b * c -> (-b) * c + a */ > + if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR) > + negate_p = !negate_p; > + } > + > + if (negate_p) > + mulop1 = force_gimple_operand_gsi (&gsi, > + build1 (NEGATE_EXPR, > + type, mulop1), > + true, NULL_TREE, true, > + GSI_SAME_STMT); > + > + fma_stmt = gimple_build_assign (gimple_assign_lhs (use_stmt), > + FMA_EXPR, mulop1, op2, addop); > + > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, "Generated FMA "); > + print_gimple_stmt (dump_file, fma_stmt, 0, 0); > + fprintf (dump_file, "\n"); > + } > + > + gsi_replace (&gsi, fma_stmt, true); > + widen_mul_stats.fmas_inserted++; > + } > +} > + > +/* Data necessary to perform the actual transformation from a multiplication > + and an addition to an FMA after decision is taken it should be done and to > + then delete the multiplication statement from the function IL. */ > + > +struct fma_transformation_info > +{ > + gimple *mul_stmt; > + tree mul_result; > + tree op1; > + tree op2; > +}; > + > +/* Structure containing the current state of FMA deferring, i.e. whether we are > + deferring, whether to continue deferring, and all data necessary to come > + back and perform all deferred transformations. */ > + > +class fma_deferring_state > +{ > +public: > + /* Class constructor. Pass true as PERFORM_DEFERRING in order to actually > + do any deferring. */ > + > + fma_deferring_state (bool perform_deferring) > + : m_candidates (), m_mul_result_set (), m_initial_phi (NULL), > + m_last_result (NULL_TREE), m_deferring_p (perform_deferring) {} > + > + /* List of FMA candidates for which we the transformation has been determined > + possible but we at this point in BB analysis we do not consider them > + beneficial. */ > + auto_vec m_candidates; > + > + /* Set of results of multiplication that are part of an already deferred FMA > + candidates. */ > + hash_set m_mul_result_set; > + > + /* The PHI that supposedly feeds back result of a FMA to another over loop > + boundary. */ > + gphi *m_initial_phi; > + > + /* Result of the last produced FMA candidate or NULL if there has not been > + one. */ > + tree m_last_result; > + > + /* If true, deferring might still be profitable. If false, transform all > + candidates and no longer defer. */ > + bool m_deferring_p; > +}; > + > +/* Transform all deferred FMA candidates and mark STATE as no longer > + deferring. */ > + > +static void > +cancel_fma_deferring (fma_deferring_state *state) > +{ > + if (!state->m_deferring_p) > + return; > + > + for (unsigned i = 0; i < state->m_candidates.length (); i++) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "Generating deferred FMA\n"); > + > + const fma_transformation_info &fti = state->m_candidates[i]; > + convert_mult_to_fma_1 (fti.mul_result, fti.op1, fti.op2); > + > + gimple_stmt_iterator gsi = gsi_for_stmt (fti.mul_stmt); > + gsi_remove (&gsi, true); > + release_defs (fti.mul_stmt); > + } > + state->m_deferring_p = false; > +} > + > +/* If OP is an SSA name defined by a PHI node, return the PHI statement. > + Otherwise return NULL. */ > + > +static gphi * > +result_of_phi (tree op) > +{ > + if (TREE_CODE (op) != SSA_NAME) > + return NULL; > + > + return dyn_cast (SSA_NAME_DEF_STMT (op)); > +} > + > +/* After processing statements of a BB and recording STATE, return true if the > + initial phi is fed by the last FMA candidate result ore one such result from > + previously processed BBs marked in LAST_RESULT_SET. */ > + > +static bool > +last_fma_candidate_feeds_initial_phi (fma_deferring_state *state, > + hash_set *last_result_set) > +{ > + ssa_op_iter iter; > + use_operand_p use; > + FOR_EACH_PHI_ARG (use, state->m_initial_phi, iter, SSA_OP_USE) > + { > + tree t = USE_FROM_PTR (use); > + if (t == state->m_last_result > + || last_result_set->contains (t)) > + return true; > + } > + > + return false; > +} > + > /* Combine the multiplication at MUL_STMT with operands MULOP1 and MULOP2 > with uses in additions and subtractions to form fused multiply-add > - operations. Returns true if successful and MUL_STMT should be removed. */ > + operations. Returns true if successful and MUL_STMT should be removed. > + > + If STATE indicates that we are deferring FMA transformation, that means > + that we do not produce FMAs for basic blocks which look like: > + > + > + # accumulator_111 = PHI <0.0(5), accumulator_66(6)> > + _65 = _14 * _16; > + accumulator_66 = _65 + accumulator_111; > + > + or its unrolled version, i.e. with several FMA candidates that feed result > + of one into the addend of another. Instead, we add them to a list in STATE > + and if we later discover an FMA candidate that is not part of such a chain, > + we go back and perform all deferred past candidates. */ > > static bool > -convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2) > +convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2, > + fma_deferring_state *state) > { > tree mul_result = gimple_get_lhs (mul_stmt); > tree type = TREE_TYPE (mul_result); > gimple *use_stmt, *neguse_stmt; > - gassign *fma_stmt; > use_operand_p use_p; > imm_use_iterator imm_iter; > > @@ -2673,6 +2871,11 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2) > if (has_zero_uses (mul_result)) > return false; > > + bool check_defer > + = (state->m_deferring_p > + && (tree_to_shwi (TYPE_SIZE (type)) > + <= PARAM_VALUE (PARAM_AVOID_FMA_MAX_BITS))); > + bool defer = check_defer; > /* Make sure that the multiplication statement becomes dead after > the transformation, thus that all uses are transformed to FMAs. > This means we assume that an FMA operation has the same cost > @@ -2770,77 +2973,92 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2) > } > } > > + tree use_rhs1 = gimple_assign_rhs1 (use_stmt); > + tree use_rhs2 = gimple_assign_rhs2 (use_stmt); > /* We can't handle a * b + a * b. */ > - if (gimple_assign_rhs1 (use_stmt) == gimple_assign_rhs2 (use_stmt)) > + if (use_rhs1 == use_rhs2) > + return false; > + /* If deferring, make sure we are not looking at an instruction that > + wouldn't have existed if we were not. */ > + if (state->m_deferring_p > + && (state->m_mul_result_set.contains (use_rhs1) > + || state->m_mul_result_set.contains (use_rhs2))) > return false; > > - /* While it is possible to validate whether or not the exact form > - that we've recognized is available in the backend, the assumption > - is that the transformation is never a loss. For instance, suppose > - the target only has the plain FMA pattern available. Consider > - a*b-c -> fma(a,b,-c): we've exchanged MUL+SUB for FMA+NEG, which > - is still two operations. Consider -(a*b)-c -> fma(-a,b,-c): we > - still have 3 operations, but in the FMA form the two NEGs are > - independent and could be run in parallel. */ > - } > - > - FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, mul_result) > - { > - gimple_stmt_iterator gsi = gsi_for_stmt (use_stmt); > - enum tree_code use_code; > - tree addop, mulop1 = op1, result = mul_result; > - bool negate_p = false; > - > - if (is_gimple_debug (use_stmt)) > - continue; > - > - use_code = gimple_assign_rhs_code (use_stmt); > - if (use_code == NEGATE_EXPR) > + if (check_defer) > { > - result = gimple_assign_lhs (use_stmt); > - single_imm_use (gimple_assign_lhs (use_stmt), &use_p, &neguse_stmt); > - gsi_remove (&gsi, true); > - release_defs (use_stmt); > + tree use_lhs = gimple_assign_lhs (use_stmt); > + if (state->m_last_result) > + { > + if (use_rhs2 == state->m_last_result > + || use_rhs1 == state->m_last_result) > + defer = true; > + else > + defer = false; > + } > + else > + { > + gcc_checking_assert (!state->m_initial_phi); > + gphi *phi; > + if (use_rhs1 == result) > + phi = result_of_phi (use_rhs2); > + else > + { > + gcc_assert (use_rhs2 == result); > + phi = result_of_phi (use_rhs1); > + } > > - use_stmt = neguse_stmt; > - gsi = gsi_for_stmt (use_stmt); > - use_code = gimple_assign_rhs_code (use_stmt); > - negate_p = true; > - } > + if (phi) > + { > + state->m_initial_phi = phi; > + defer = true; > + } > + else > + defer = false; > + } > > - if (gimple_assign_rhs1 (use_stmt) == result) > - { > - addop = gimple_assign_rhs2 (use_stmt); > - /* a * b - c -> a * b + (-c) */ > - if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR) > - addop = force_gimple_operand_gsi (&gsi, > - build1 (NEGATE_EXPR, > - type, addop), > - true, NULL_TREE, true, > - GSI_SAME_STMT); > + state->m_last_result = use_lhs; > + check_defer = false; > } > else > + defer = false; > + > + /* While it is possible to validate whether or not the exact form that > + we've recognized is available in the backend, the assumption is that > + if the deferring logic above did not trigger, the transformation is > + never a loss. For instance, suppose the target only has the plain FMA > + pattern available. Consider a*b-c -> fma(a,b,-c): we've exchanged > + MUL+SUB for FMA+NEG, which is still two operations. Consider > + -(a*b)-c -> fma(-a,b,-c): we still have 3 operations, but in the FMA > + form the two NEGs are independent and could be run in parallel. */ > + } > + > + if (defer) > + { > + fma_transformation_info fti; > + fti.mul_stmt = mul_stmt; > + fti.mul_result = mul_result; > + fti.op1 = op1; > + fti.op2 = op2; > + state->m_candidates.safe_push (fti); > + state->m_mul_result_set.add (mul_result); > + > + if (dump_file && (dump_flags & TDF_DETAILS)) > { > - addop = gimple_assign_rhs1 (use_stmt); > - /* a - b * c -> (-b) * c + a */ > - if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR) > - negate_p = !negate_p; > + fprintf (dump_file, "Deferred generating FMA for multiplication "); > + print_gimple_stmt (dump_file, mul_stmt, 0, 0); > + fprintf (dump_file, "\n"); > } > > - if (negate_p) > - mulop1 = force_gimple_operand_gsi (&gsi, > - build1 (NEGATE_EXPR, > - type, mulop1), > - true, NULL_TREE, true, > - GSI_SAME_STMT); > - > - fma_stmt = gimple_build_assign (gimple_assign_lhs (use_stmt), > - FMA_EXPR, mulop1, op2, addop); > - gsi_replace (&gsi, fma_stmt, true); > - widen_mul_stats.fmas_inserted++; > + return false; > + } > + else > + { > + if (state->m_deferring_p) > + cancel_fma_deferring (state); > + convert_mult_to_fma_1 (mul_result, op1, op2); > + return true; > } > - > - return true; > } > > > @@ -3270,92 +3488,135 @@ public: > > }; // class pass_optimize_widening_mul > > -unsigned int > -pass_optimize_widening_mul::execute (function *fun) > +/* Walker class to perform the transformation in reverse dominance order. */ > + > +class math_opts_dom_walker : public dom_walker > { > - basic_block bb; > - bool cfg_changed = false; > +public: > + /* Constructor, CFG_CHANGED is a pointer to a boolean flag that will be set > + if walking modidifes the CFG. */ > > - memset (&widen_mul_stats, 0, sizeof (widen_mul_stats)); > - calculate_dominance_info (CDI_DOMINATORS); > - renumber_gimple_stmt_uids (); > + math_opts_dom_walker (bool *cfg_changed_p) > + : dom_walker (CDI_DOMINATORS), m_last_result_set (), > + m_cfg_changed_p (cfg_changed_p) {} > > - FOR_EACH_BB_FN (bb, fun) > + /* The actual actions performed in the walk. */ > + > + virtual void after_dom_children (basic_block); > + > + /* Set of results of chains of multiply and add statement combinations that > + were not transformed into FMAs because of active deferring. */ > + hash_set m_last_result_set; > + > + /* Pointer to a flag of the user that needs to be set if CFG has been > + modified. */ > + bool *m_cfg_changed_p; > +}; > + > +void > +math_opts_dom_walker::after_dom_children (basic_block bb) > +{ > + gimple_stmt_iterator gsi; > + > + fma_deferring_state fma_state (PARAM_VALUE (PARAM_AVOID_FMA_MAX_BITS) > 0); > + > + for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi);) > { > - gimple_stmt_iterator gsi; > + gimple *stmt = gsi_stmt (gsi); > + enum tree_code code; > > - for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi);) > - { > - gimple *stmt = gsi_stmt (gsi); > - enum tree_code code; > + if (is_gimple_assign (stmt)) > + { > + code = gimple_assign_rhs_code (stmt); > + switch (code) > + { > + case MULT_EXPR: > + if (!convert_mult_to_widen (stmt, &gsi) > + && !convert_expand_mult_copysign (stmt, &gsi) > + && convert_mult_to_fma (stmt, > + gimple_assign_rhs1 (stmt), > + gimple_assign_rhs2 (stmt), > + &fma_state)) > + { > + gsi_remove (&gsi, true); > + release_defs (stmt); > + continue; > + } > + break; > + > + case PLUS_EXPR: > + case MINUS_EXPR: > + if (!convert_plusminus_to_widen (&gsi, stmt, code)) > + match_uaddsub_overflow (&gsi, stmt, code); > + break; > > - if (is_gimple_assign (stmt)) > + case TRUNC_MOD_EXPR: > + convert_to_divmod (as_a (stmt)); > + break; > + > + default:; > + } > + } > + else if (is_gimple_call (stmt)) > + { > + tree fndecl = gimple_call_fndecl (stmt); > + if (fndecl && gimple_call_builtin_p (stmt, BUILT_IN_NORMAL)) > { > - code = gimple_assign_rhs_code (stmt); > - switch (code) > + switch (DECL_FUNCTION_CODE (fndecl)) > { > - case MULT_EXPR: > - if (!convert_mult_to_widen (stmt, &gsi) > - && !convert_expand_mult_copysign (stmt, &gsi) > + case BUILT_IN_POWF: > + case BUILT_IN_POW: > + case BUILT_IN_POWL: > + if (gimple_call_lhs (stmt) > + && TREE_CODE (gimple_call_arg (stmt, 1)) == REAL_CST > + && real_equal > + (&TREE_REAL_CST (gimple_call_arg (stmt, 1)), > + &dconst2) > && convert_mult_to_fma (stmt, > - gimple_assign_rhs1 (stmt), > - gimple_assign_rhs2 (stmt))) > + gimple_call_arg (stmt, 0), > + gimple_call_arg (stmt, 0), > + &fma_state)) > { > - gsi_remove (&gsi, true); > + unlink_stmt_vdef (stmt); > + if (gsi_remove (&gsi, true) > + && gimple_purge_dead_eh_edges (bb)) > + *m_cfg_changed_p = true; > release_defs (stmt); > continue; > } > break; > > - case PLUS_EXPR: > - case MINUS_EXPR: > - if (!convert_plusminus_to_widen (&gsi, stmt, code)) > - match_uaddsub_overflow (&gsi, stmt, code); > - break; > - > - case TRUNC_MOD_EXPR: > - convert_to_divmod (as_a (stmt)); > - break; > - > default:; > } > } > - else if (is_gimple_call (stmt) > - && gimple_call_lhs (stmt)) > - { > - tree fndecl = gimple_call_fndecl (stmt); > - if (fndecl > - && gimple_call_builtin_p (stmt, BUILT_IN_NORMAL)) > - { > - switch (DECL_FUNCTION_CODE (fndecl)) > - { > - case BUILT_IN_POWF: > - case BUILT_IN_POW: > - case BUILT_IN_POWL: > - if (TREE_CODE (gimple_call_arg (stmt, 1)) == REAL_CST > - && real_equal > - (&TREE_REAL_CST (gimple_call_arg (stmt, 1)), > - &dconst2) > - && convert_mult_to_fma (stmt, > - gimple_call_arg (stmt, 0), > - gimple_call_arg (stmt, 0))) > - { > - unlink_stmt_vdef (stmt); > - if (gsi_remove (&gsi, true) > - && gimple_purge_dead_eh_edges (bb)) > - cfg_changed = true; > - release_defs (stmt); > - continue; > - } > - break; > - > - default:; > - } > - } > - } > - gsi_next (&gsi); > + else > + cancel_fma_deferring (&fma_state); > } > + gsi_next (&gsi); > } > + if (fma_state.m_deferring_p > + && fma_state.m_initial_phi) > + { > + gcc_checking_assert (fma_state.m_last_result); > + if (!last_fma_candidate_feeds_initial_phi (&fma_state, > + &m_last_result_set)) > + cancel_fma_deferring (&fma_state); > + else > + m_last_result_set.add (fma_state.m_last_result); > + } > +} > + > + > +unsigned int > +pass_optimize_widening_mul::execute (function *fun) > +{ > + bool cfg_changed = false; > + > + memset (&widen_mul_stats, 0, sizeof (widen_mul_stats)); > + calculate_dominance_info (CDI_DOMINATORS); > + renumber_gimple_stmt_uids (); > + > + math_opts_dom_walker (&cfg_changed).walk (ENTRY_BLOCK_PTR_FOR_FN (cfun)); > > statistics_counter_event (fun, "widening multiplications inserted", > widen_mul_stats.widen_mults_inserted); > -- Richard Biener SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)