From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-470964-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 34274 invoked by alias); 12 Jan 2018 12:14:30 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 34252 invoked by uid 89); 12 Jan 2018 12:14:29 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-26.9 required=5.0 tests=BAYES_00,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,SPF_PASS,T_RP_MATCHES_RCVD autolearn=ham version=3.3.2 spammy=
X-HELO: mx2.suse.de
Received: from mx2.suse.de (HELO mx2.suse.de) (195.135.220.15) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 12 Jan 2018 12:14:25 +0000
Received: from relay2.suse.de (charybdis-ext.suse.de [195.135.220.254])	by mx2.suse.de (Postfix) with ESMTP id 0C1B7ACBE	for <gcc-patches@gcc.gnu.org>; Fri, 12 Jan 2018 12:14:23 +0000 (UTC)
Date: Fri, 12 Jan 2018 12:23:00 -0000
From: Richard Biener <rguenther@suse.de>
To: Martin Jambor <mjambor@suse.cz>
cc: GCC Patches <gcc-patches@gcc.gnu.org>, Jan Hubicka <jh@suse.cz>
Subject: Re: [PR 81616] Deferring FMA transformations in tight loops
In-Reply-To: <ri6lgh51ri9.fsf@suse.cz>
Message-ID: <alpine.LSU.2.20.1801121313580.32271@zhemvz.fhfr.qr>
References: <ri6mv2knkeg.fsf@suse.cz> <ri6lgh51ri9.fsf@suse.cz>
User-Agent: Alpine 2.20 (LSU 67 2015-01-07)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-SW-Source: 2018-01/txt/msg01035.txt.bz2

On Wed, 10 Jan 2018, Martin Jambor wrote:

> Hello,
> 
> I would really like to ping the FMA transformation prevention patch that
> I sent here in December, which, after incorporating a suggestion from
> Richi, re-base and re-testing, I re-post below.  I really think that it
> should make into gcc 8 in some form, because the performance wins are
> really big.
> 
> I am still opened to all sorts of comments, of course, especially to
> suggestions about the form of the param controlling this behavior (or
> how to communicate that we want to do this on Zen in general).  It might
> even be a binary value since not forming FMAs does not seem to harm
> bigger vectors either (as far as we know, I should add).
> 
> For the record, I have not yet received any information from AMD why
> FMAs on 256bit vectors do not have this problem, I assume all people
> that could give an authoritative answer are now looking into covert
> channel mitigation stuff.  But very probably it is just that the
> internal split FMAs can be scheduled so that while one is still waiting
> for its addend, another can already execute.

OK, and sorry for the delay.

Thanks,
Richard.

> Thanks,
> 
> Martin
> 
> 
> On Fri, Dec 15 2017, Martin Jambor wrote:
> > Hello,
> >
> > the patch below prevents creation if fused-multiply-and-add instructions
> > in the widening_mul gimple pass on the Zen-based AMD CPUs and as a
> > result fixes regressions of native znver1 tuning when compared to
> > generic tuning in:
> >
> >   - the matrix.c testcase of PR 81616 (straightforward matrix
> >     multiplication) at -O2 and -O3 which is currently 60% (!),
> >
> >   - SPEC 2006 454.calculix at -O2, which is currently over 20%, and
> >
> >   - SPEC 2017 510.parest at -O2 and -Ofast, which is currently also
> >     about 20% in both cases.
> >
> > The basic idea is to detect loops in the following form:
> >
> >     <bb 6>
> >     # accumulator_111 = PHI <0.0(5), accumulator_66(6)>
> >     ...
> >     _65 = _14 * _16;
> >     accumulator_66 = _65 + accumulator_111;
> >
> > and prevents from creating FMA for it.  Because at least in the parest
> > and calculix cases it has to, it also deals with more than one chain of
> > FMA candidates that feed the next one's addend:
> >
> >
> >     <bb 6>
> >     # accumulator_111 = PHI <0.0(5), accumulator_66(6)>
> >     ...
> >     _65 = _14 * _16;
> >     accumulator_55 = _65 + accumulator_111;
> >     _65 = _24 * _36;
> >     accumulator_66 = _65 + accumulator_55;
> >
> > Unfortunately, to really get rid of the calculix regression, the
> > algorithm cannot just look at one BB at a time but also has to work for
> > cases like the following:
> >
> >      1  void mult(void)
> >      2  {
> >      3      int i, j, k, l;
> >      4  
> >      5     for(i=0; i<SIZE; ++i)
> >      6     {
> >      7        for(j=0; j<SIZE; ++j)
> >      8        {
> >      9           for(l=0; l<SIZE; l+=10)
> >     10           {
> >     11               c[i][j] += a[i][l] * b[k][l];
> >     12               for(k=1; k<10; ++k)
> >     13               {
> >     14                   c[i][j] += a[i][l+k] * b[k][l+k];
> >     15               }
> >     16  
> >     17           }
> >     18        }
> >     19     }
> >     20  }
> >
> > where the FMA on line 14 feeds into the one on line 11 in an
> > encompassing loop.  Therefore I have changed the structure of the pass
> > to work in reverse dominance order and it keeps a hash set of results of
> > rejected FMAs candidates which it checks when looking at PHI nodes of
> > the current BB.  Without this reorganization, calculix was still 8%
> > slower with native tuning than with generic one.
> >
> > When the deferring mechanism realizes that in the current BB, the FMA
> > candidates do not all form a one chain tight-loop like in the examples
> > above, it goes back to all the previously deferred candidates (in the
> > current BB only) and performs the transformation.
> >
> > The main reason is to keep the patch conservative (and also simple), but
> > it also means that the following function is not affected and is still
> > 20% slower when compiled with native march and tuning compared to the
> > generic one:
> >
> >      1  void mult(struct s *p1, struct s *p2)
> >      2  {
> >      3     int i, j, k;
> >      4  
> >      5     for(i=0; i<SIZE; ++i)
> >      6     {
> >      7        for(j=0; j<SIZE; ++j)
> >      8        {
> >      9           for(k=0; k<SIZE; ++k)
> >     10           {
> >     11              p1->c[i][j] += p1->a[i][k] * p1->b[k][j];
> >     12              p2->c[i][j] += p2->a[i][k] * p2->b[k][j];
> >     13           }
> >     14        }
> >     15     }
> >     16  }
> >
> > I suppose that the best optimization for the above would be to split the
> > loops, but one could probably construct at least an artificial testcase
> > where the FMAs would keep enough locality that it is not the case.  The
> > mechanism can be easily extended to keep track of not just one chain but
> > a few, preferably as a followup, if people think it makes sense.
> >
> > An interesting observation is that the matrix multiplication does not
> > suffer the penalty when compiled with -O3 -mprefer-vector-width=256.
> > Apparently the 256 vector processing can hide the latency penalty when
> > internally it is split into two halves.  The same goes for 512 bit
> > vectors.  That is why the patch leaves those be - well, there is a param
> > for the threshold which is set to zero for everybody but znver1.  If
> > maintainers of any other architecture suspect that their FMAs might
> > suffer similar latency problem, they can easily try tweaking that
> > parameter and see what happens with the matrix multiplication example.
> >
> > I have bootstrapped and tested the patch on x86_64-linux (as it is and
> > also with the param set to a 256 by default to make it trigger).  I have
> > also measured run-times of all benchmarks in SPEC 2006 FP and SPEC 2017
> > FPrate and the only changes are the big improvements of calculix and
> > parest.
> >
> 
> 2018-01-09  Martin Jambor  <mjambor@suse.cz>
> 
> 	PR target/81616
> 	* params.def: New parameter PARAM_AVOID_FMA_MAX_BITS.
> 	* tree-ssa-math-opts.c: Include domwalk.h.
> 	(convert_mult_to_fma_1): New function.
> 	(fma_transformation_info): New type.
> 	(fma_deferring_state): Likewise.
> 	(cancel_fma_deferring): New function.
> 	(result_of_phi): Likewise.
> 	(last_fma_candidate_feeds_initial_phi): Likewise.
> 	(convert_mult_to_fma): Added deferring logic, split actual
> 	transformation to convert_mult_to_fma_1.
> 	(math_opts_dom_walker): New type.
> 	(math_opts_dom_walker::after_dom_children): New method, body moved
> 	here from pass_optimize_widening_mul::execute, added deferring logic
> 	bits.
> 	(pass_optimize_widening_mul::execute): Moved most of code to
> 	math_opts_dom_walker::after_dom_children.
> 	* config/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS): New.
> 	* config/i386/i386.c (ix86_option_override_internal): Added
> 	maybe_setting of PARAM_AVOID_FMA_MAX_BITS.
> ---
>  gcc/config/i386/i386.c       |   5 +
>  gcc/config/i386/x86-tune.def |   4 +
>  gcc/params.def               |   5 +
>  gcc/tree-ssa-math-opts.c     | 517 ++++++++++++++++++++++++++++++++-----------
>  4 files changed, 403 insertions(+), 128 deletions(-)
> 
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 8696f931806..8200a136dd1 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -4900,6 +4900,11 @@ ix86_option_override_internal (bool main_args_p,
>  	(cf_protection_level) (opts->x_flag_cf_protection | CF_SET);
>      }
>  
> +  if (ix86_tune_features [X86_TUNE_AVOID_128FMA_CHAINS])
> +    maybe_set_param_value (PARAM_AVOID_FMA_MAX_BITS, 128,
> +			   opts->x_param_values,
> +			   opts_set->x_param_values);
> +
>    return true;
>  }
>  
> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> index 9401d51cdc1..0effce759f1 100644
> --- a/gcc/config/i386/x86-tune.def
> +++ b/gcc/config/i386/x86-tune.def
> @@ -399,6 +399,10 @@ DEF_TUNE (X86_TUNE_SLOW_PSHUFB, "slow_pshufb",
>  DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes",
>            m_SILVERMONT | m_INTEL)
>  
> +/* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
> +   smaller FMA chain.  */
> +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1)
> +
>  /*****************************************************************************/
>  /* AVX instruction selection tuning (some of SSE flags affects AVX, too)     */
>  /*****************************************************************************/
> diff --git a/gcc/params.def b/gcc/params.def
> index d9f8d8208a1..a0cd06db339 100644
> --- a/gcc/params.def
> +++ b/gcc/params.def
> @@ -1326,6 +1326,11 @@ DEFPARAM(PARAM_UNROLL_JAM_MAX_UNROLL,
>  	 "Maximum unroll factor for the unroll-and-jam transformation.",
>  	 4, 0, 0)
>  
> +DEFPARAM(PARAM_AVOID_FMA_MAX_BITS,
> +	 "avoid-fma-max-bits",
> +	 "Maximum number of bits for which we avoid creating FMAs.",
> +	 0, 0, 512)
> +
>  /*
>  
>  Local variables:
> diff --git a/gcc/tree-ssa-math-opts.c b/gcc/tree-ssa-math-opts.c
> index ea880c7b1d8..16d9399af0b 100644
> --- a/gcc/tree-ssa-math-opts.c
> +++ b/gcc/tree-ssa-math-opts.c
> @@ -115,6 +115,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "optabs-libfuncs.h"
>  #include "tree-eh.h"
>  #include "targhooks.h"
> +#include "domwalk.h"
>  
>  /* This structure represents one basic block that either computes a
>     division, or is a common dominator for basic block that compute a
> @@ -2639,17 +2640,214 @@ convert_plusminus_to_widen (gimple_stmt_iterator *gsi, gimple *stmt,
>    return true;
>  }
>  
> +/* Given a result MUL_RESULT which is a result of a multiplication of OP1 and
> +   OP2 and which we know is used in statements that can be, together with the
> +   multiplication, converted to FMAs, perform the transformation.  */
> +
> +static void
> +convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
> +{
> +  tree type = TREE_TYPE (mul_result);
> +  gimple *use_stmt;
> +  imm_use_iterator imm_iter;
> +  gassign *fma_stmt;
> +
> +  FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, mul_result)
> +    {
> +      gimple_stmt_iterator gsi = gsi_for_stmt (use_stmt);
> +      enum tree_code use_code;
> +      tree addop, mulop1 = op1, result = mul_result;
> +      bool negate_p = false;
> +
> +      if (is_gimple_debug (use_stmt))
> +	continue;
> +
> +      use_code = gimple_assign_rhs_code (use_stmt);
> +      if (use_code == NEGATE_EXPR)
> +	{
> +	  result = gimple_assign_lhs (use_stmt);
> +	  use_operand_p use_p;
> +	  gimple *neguse_stmt;
> +	  single_imm_use (gimple_assign_lhs (use_stmt), &use_p, &neguse_stmt);
> +	  gsi_remove (&gsi, true);
> +	  release_defs (use_stmt);
> +
> +	  use_stmt = neguse_stmt;
> +	  gsi = gsi_for_stmt (use_stmt);
> +	  use_code = gimple_assign_rhs_code (use_stmt);
> +	  negate_p = true;
> +	}
> +
> +      if (gimple_assign_rhs1 (use_stmt) == result)
> +	{
> +	  addop = gimple_assign_rhs2 (use_stmt);
> +	  /* a * b - c -> a * b + (-c)  */
> +	  if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR)
> +	    addop = force_gimple_operand_gsi (&gsi,
> +					      build1 (NEGATE_EXPR,
> +						      type, addop),
> +					      true, NULL_TREE, true,
> +					      GSI_SAME_STMT);
> +	}
> +      else
> +	{
> +	  addop = gimple_assign_rhs1 (use_stmt);
> +	  /* a - b * c -> (-b) * c + a */
> +	  if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR)
> +	    negate_p = !negate_p;
> +	}
> +
> +      if (negate_p)
> +	mulop1 = force_gimple_operand_gsi (&gsi,
> +					   build1 (NEGATE_EXPR,
> +						   type, mulop1),
> +					   true, NULL_TREE, true,
> +					   GSI_SAME_STMT);
> +
> +      fma_stmt = gimple_build_assign (gimple_assign_lhs (use_stmt),
> +				      FMA_EXPR, mulop1, op2, addop);
> +
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +	{
> +	  fprintf (dump_file, "Generated FMA ");
> +	  print_gimple_stmt (dump_file, fma_stmt, 0, 0);
> +	  fprintf (dump_file, "\n");
> +	}
> +
> +      gsi_replace (&gsi, fma_stmt, true);
> +      widen_mul_stats.fmas_inserted++;
> +    }
> +}
> +
> +/* Data necessary to perform the actual transformation from a multiplication
> +   and an addition to an FMA after decision is taken it should be done and to
> +   then delete the multiplication statement from the function IL.  */
> +
> +struct fma_transformation_info
> +{
> +  gimple *mul_stmt;
> +  tree mul_result;
> +  tree op1;
> +  tree op2;
> +};
> +
> +/* Structure containing the current state of FMA deferring, i.e. whether we are
> +   deferring, whether to continue deferring, and all data necessary to come
> +   back and perform all deferred transformations.  */
> +
> +class fma_deferring_state
> +{
> +public:
> +  /* Class constructor.  Pass true as PERFORM_DEFERRING in order to actually
> +     do any deferring.  */
> +
> +  fma_deferring_state (bool perform_deferring)
> +    : m_candidates (), m_mul_result_set (), m_initial_phi (NULL),
> +      m_last_result (NULL_TREE), m_deferring_p (perform_deferring) {}
> +
> +  /* List of FMA candidates for which we the transformation has been determined
> +     possible but we at this point in BB analysis we do not consider them
> +     beneficial.  */
> +  auto_vec<fma_transformation_info, 8> m_candidates;
> +
> +  /* Set of results of multiplication that are part of an already deferred FMA
> +     candidates.  */
> +  hash_set<tree> m_mul_result_set;
> +
> +  /* The PHI that supposedly feeds back result of a FMA to another over loop
> +     boundary.  */
> +  gphi *m_initial_phi;
> +
> +  /* Result of the last produced FMA candidate or NULL if there has not been
> +     one.  */
> +  tree m_last_result;
> +
> +  /* If true, deferring might still be profitable.  If false, transform all
> +     candidates and no longer defer.  */
> +  bool m_deferring_p;
> +};
> +
> +/* Transform all deferred FMA candidates and mark STATE as no longer
> +   deferring.  */
> +
> +static void
> +cancel_fma_deferring (fma_deferring_state *state)
> +{
> +  if (!state->m_deferring_p)
> +    return;
> +
> +  for (unsigned i = 0; i < state->m_candidates.length (); i++)
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +	fprintf (dump_file, "Generating deferred FMA\n");
> +
> +      const fma_transformation_info &fti = state->m_candidates[i];
> +      convert_mult_to_fma_1 (fti.mul_result, fti.op1, fti.op2);
> +
> +      gimple_stmt_iterator gsi = gsi_for_stmt (fti.mul_stmt);
> +      gsi_remove (&gsi, true);
> +      release_defs (fti.mul_stmt);
> +    }
> +  state->m_deferring_p = false;
> +}
> +
> +/* If OP is an SSA name defined by a PHI node, return the PHI statement.
> +   Otherwise return NULL.  */
> +
> +static gphi *
> +result_of_phi (tree op)
> +{
> +  if (TREE_CODE (op) != SSA_NAME)
> +    return NULL;
> +
> +  return dyn_cast <gphi *> (SSA_NAME_DEF_STMT (op));
> +}
> +
> +/* After processing statements of a BB and recording STATE, return true if the
> +   initial phi is fed by the last FMA candidate result ore one such result from
> +   previously processed BBs marked in LAST_RESULT_SET.  */
> +
> +static bool
> +last_fma_candidate_feeds_initial_phi (fma_deferring_state *state,
> +				      hash_set<tree> *last_result_set)
> +{
> +  ssa_op_iter iter;
> +  use_operand_p use;
> +  FOR_EACH_PHI_ARG (use, state->m_initial_phi, iter, SSA_OP_USE)
> +    {
> +      tree t = USE_FROM_PTR (use);
> +      if (t == state->m_last_result
> +	  || last_result_set->contains (t))
> +	return true;
> +    }
> +
> +  return false;
> +}
> +
>  /* Combine the multiplication at MUL_STMT with operands MULOP1 and MULOP2
>     with uses in additions and subtractions to form fused multiply-add
> -   operations.  Returns true if successful and MUL_STMT should be removed.  */
> +   operations.  Returns true if successful and MUL_STMT should be removed.
> +
> +   If STATE indicates that we are deferring FMA transformation, that means
> +   that we do not produce FMAs for basic blocks which look like:
> +
> +    <bb 6>
> +    # accumulator_111 = PHI <0.0(5), accumulator_66(6)>
> +    _65 = _14 * _16;
> +    accumulator_66 = _65 + accumulator_111;
> +
> +  or its unrolled version, i.e. with several FMA candidates that feed result
> +  of one into the addend of another.  Instead, we add them to a list in STATE
> +  and if we later discover an FMA candidate that is not part of such a chain,
> +  we go back and perform all deferred past candidates.  */
>  
>  static bool
> -convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2)
> +convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
> +		     fma_deferring_state *state)
>  {
>    tree mul_result = gimple_get_lhs (mul_stmt);
>    tree type = TREE_TYPE (mul_result);
>    gimple *use_stmt, *neguse_stmt;
> -  gassign *fma_stmt;
>    use_operand_p use_p;
>    imm_use_iterator imm_iter;
>  
> @@ -2673,6 +2871,11 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2)
>    if (has_zero_uses (mul_result))
>      return false;
>  
> +  bool check_defer
> +    = (state->m_deferring_p
> +       && (tree_to_shwi (TYPE_SIZE (type))
> +	   <= PARAM_VALUE (PARAM_AVOID_FMA_MAX_BITS)));
> +  bool defer = check_defer;
>    /* Make sure that the multiplication statement becomes dead after
>       the transformation, thus that all uses are transformed to FMAs.
>       This means we assume that an FMA operation has the same cost
> @@ -2770,77 +2973,92 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2)
>  	    }
>  	}
>  
> +      tree use_rhs1 = gimple_assign_rhs1 (use_stmt);
> +      tree use_rhs2 = gimple_assign_rhs2 (use_stmt);
>        /* We can't handle a * b + a * b.  */
> -      if (gimple_assign_rhs1 (use_stmt) == gimple_assign_rhs2 (use_stmt))
> +      if (use_rhs1 == use_rhs2)
> +	return false;
> +      /* If deferring, make sure we are not looking at an instruction that
> +	 wouldn't have existed if we were not.  */
> +      if (state->m_deferring_p
> +	  && (state->m_mul_result_set.contains (use_rhs1)
> +	      || state->m_mul_result_set.contains (use_rhs2)))
>  	return false;
>  
> -      /* While it is possible to validate whether or not the exact form
> -	 that we've recognized is available in the backend, the assumption
> -	 is that the transformation is never a loss.  For instance, suppose
> -	 the target only has the plain FMA pattern available.  Consider
> -	 a*b-c -> fma(a,b,-c): we've exchanged MUL+SUB for FMA+NEG, which
> -	 is still two operations.  Consider -(a*b)-c -> fma(-a,b,-c): we
> -	 still have 3 operations, but in the FMA form the two NEGs are
> -	 independent and could be run in parallel.  */
> -    }
> -
> -  FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, mul_result)
> -    {
> -      gimple_stmt_iterator gsi = gsi_for_stmt (use_stmt);
> -      enum tree_code use_code;
> -      tree addop, mulop1 = op1, result = mul_result;
> -      bool negate_p = false;
> -
> -      if (is_gimple_debug (use_stmt))
> -	continue;
> -
> -      use_code = gimple_assign_rhs_code (use_stmt);
> -      if (use_code == NEGATE_EXPR)
> +      if (check_defer)
>  	{
> -	  result = gimple_assign_lhs (use_stmt);
> -	  single_imm_use (gimple_assign_lhs (use_stmt), &use_p, &neguse_stmt);
> -	  gsi_remove (&gsi, true);
> -	  release_defs (use_stmt);
> +	  tree use_lhs = gimple_assign_lhs (use_stmt);
> +	  if (state->m_last_result)
> +	    {
> +	      if (use_rhs2 == state->m_last_result
> +		  || use_rhs1 == state->m_last_result)
> +		defer = true;
> +	      else
> +		defer = false;
> +	    }
> +	  else
> +	    {
> +	      gcc_checking_assert (!state->m_initial_phi);
> +	      gphi *phi;
> +	      if (use_rhs1 == result)
> +		phi = result_of_phi (use_rhs2);
> +	      else
> +		{
> +		  gcc_assert (use_rhs2 == result);
> +		  phi = result_of_phi (use_rhs1);
> +		}
>  
> -	  use_stmt = neguse_stmt;
> -	  gsi = gsi_for_stmt (use_stmt);
> -	  use_code = gimple_assign_rhs_code (use_stmt);
> -	  negate_p = true;
> -	}
> +	      if (phi)
> +		{
> +		  state->m_initial_phi = phi;
> +		  defer = true;
> +		}
> +	      else
> +		defer = false;
> +	    }
>  
> -      if (gimple_assign_rhs1 (use_stmt) == result)
> -	{
> -	  addop = gimple_assign_rhs2 (use_stmt);
> -	  /* a * b - c -> a * b + (-c)  */
> -	  if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR)
> -	    addop = force_gimple_operand_gsi (&gsi,
> -					      build1 (NEGATE_EXPR,
> -						      type, addop),
> -					      true, NULL_TREE, true,
> -					      GSI_SAME_STMT);
> +	  state->m_last_result = use_lhs;
> +	  check_defer = false;
>  	}
>        else
> +	defer = false;
> +
> +      /* While it is possible to validate whether or not the exact form that
> +	 we've recognized is available in the backend, the assumption is that
> +	 if the deferring logic above did not trigger, the transformation is
> +	 never a loss.  For instance, suppose the target only has the plain FMA
> +	 pattern available.  Consider a*b-c -> fma(a,b,-c): we've exchanged
> +	 MUL+SUB for FMA+NEG, which is still two operations.  Consider
> +         -(a*b)-c -> fma(-a,b,-c): we still have 3 operations, but in the FMA
> +	 form the two NEGs are independent and could be run in parallel.  */
> +    }
> +
> +  if (defer)
> +    {
> +      fma_transformation_info fti;
> +      fti.mul_stmt = mul_stmt;
> +      fti.mul_result = mul_result;
> +      fti.op1 = op1;
> +      fti.op2 = op2;
> +      state->m_candidates.safe_push (fti);
> +      state->m_mul_result_set.add (mul_result);
> +
> +      if (dump_file && (dump_flags & TDF_DETAILS))
>  	{
> -	  addop = gimple_assign_rhs1 (use_stmt);
> -	  /* a - b * c -> (-b) * c + a */
> -	  if (gimple_assign_rhs_code (use_stmt) == MINUS_EXPR)
> -	    negate_p = !negate_p;
> +	  fprintf (dump_file, "Deferred generating FMA for multiplication ");
> +	  print_gimple_stmt (dump_file, mul_stmt, 0, 0);
> +	  fprintf (dump_file, "\n");
>  	}
>  
> -      if (negate_p)
> -	mulop1 = force_gimple_operand_gsi (&gsi,
> -					   build1 (NEGATE_EXPR,
> -						   type, mulop1),
> -					   true, NULL_TREE, true,
> -					   GSI_SAME_STMT);
> -
> -      fma_stmt = gimple_build_assign (gimple_assign_lhs (use_stmt),
> -				      FMA_EXPR, mulop1, op2, addop);
> -      gsi_replace (&gsi, fma_stmt, true);
> -      widen_mul_stats.fmas_inserted++;
> +      return false;
> +    }
> +  else
> +    {
> +      if (state->m_deferring_p)
> +	cancel_fma_deferring (state);
> +      convert_mult_to_fma_1 (mul_result, op1, op2);
> +      return true;
>      }
> -
> -  return true;
>  }
>  
>  
> @@ -3270,92 +3488,135 @@ public:
>  
>  }; // class pass_optimize_widening_mul
>  
> -unsigned int
> -pass_optimize_widening_mul::execute (function *fun)
> +/* Walker class to perform the transformation in reverse dominance order. */
> +
> +class math_opts_dom_walker : public dom_walker
>  {
> -  basic_block bb;
> -  bool cfg_changed = false;
> +public:
> +  /* Constructor, CFG_CHANGED is a pointer to a boolean flag that will be set
> +     if walking modidifes the CFG.  */
>  
> -  memset (&widen_mul_stats, 0, sizeof (widen_mul_stats));
> -  calculate_dominance_info (CDI_DOMINATORS);
> -  renumber_gimple_stmt_uids ();
> +  math_opts_dom_walker (bool *cfg_changed_p)
> +    : dom_walker (CDI_DOMINATORS), m_last_result_set (),
> +      m_cfg_changed_p (cfg_changed_p) {}
>  
> -  FOR_EACH_BB_FN (bb, fun)
> +  /* The actual actions performed in the walk.  */
> +
> +  virtual void after_dom_children (basic_block);
> +
> +  /* Set of results of chains of multiply and add statement combinations that
> +     were not transformed into FMAs because of active deferring.  */
> +  hash_set<tree> m_last_result_set;
> +
> +  /* Pointer to a flag of the user that needs to be set if CFG has been
> +     modified.  */
> +  bool *m_cfg_changed_p;
> +};
> +
> +void
> +math_opts_dom_walker::after_dom_children (basic_block bb)
> +{
> +  gimple_stmt_iterator gsi;
> +
> +  fma_deferring_state fma_state (PARAM_VALUE (PARAM_AVOID_FMA_MAX_BITS) > 0);
> +
> +  for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi);)
>      {
> -      gimple_stmt_iterator gsi;
> +      gimple *stmt = gsi_stmt (gsi);
> +      enum tree_code code;
>  
> -      for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi);)
> -        {
> -	  gimple *stmt = gsi_stmt (gsi);
> -	  enum tree_code code;
> +      if (is_gimple_assign (stmt))
> +	{
> +	  code = gimple_assign_rhs_code (stmt);
> +	  switch (code)
> +	    {
> +	    case MULT_EXPR:
> +	      if (!convert_mult_to_widen (stmt, &gsi)
> +		  && !convert_expand_mult_copysign (stmt, &gsi)
> +		  && convert_mult_to_fma (stmt,
> +					  gimple_assign_rhs1 (stmt),
> +					  gimple_assign_rhs2 (stmt),
> +					  &fma_state))
> +		{
> +		  gsi_remove (&gsi, true);
> +		  release_defs (stmt);
> +		  continue;
> +		}
> +	      break;
> +
> +	    case PLUS_EXPR:
> +	    case MINUS_EXPR:
> +	      if (!convert_plusminus_to_widen (&gsi, stmt, code))
> +		match_uaddsub_overflow (&gsi, stmt, code);
> +	      break;
>  
> -	  if (is_gimple_assign (stmt))
> +	    case TRUNC_MOD_EXPR:
> +	      convert_to_divmod (as_a<gassign *> (stmt));
> +	      break;
> +
> +	    default:;
> +	    }
> +	}
> +      else if (is_gimple_call (stmt))
> +	{
> +	  tree fndecl = gimple_call_fndecl (stmt);
> +	  if (fndecl && gimple_call_builtin_p (stmt, BUILT_IN_NORMAL))
>  	    {
> -	      code = gimple_assign_rhs_code (stmt);
> -	      switch (code)
> +	      switch (DECL_FUNCTION_CODE (fndecl))
>  		{
> -		case MULT_EXPR:
> -		  if (!convert_mult_to_widen (stmt, &gsi)
> -		      && !convert_expand_mult_copysign (stmt, &gsi)
> +		case BUILT_IN_POWF:
> +		case BUILT_IN_POW:
> +		case BUILT_IN_POWL:
> +		  if (gimple_call_lhs (stmt)
> +		      && TREE_CODE (gimple_call_arg (stmt, 1)) == REAL_CST
> +		      && real_equal
> +		      (&TREE_REAL_CST (gimple_call_arg (stmt, 1)),
> +		       &dconst2)
>  		      && convert_mult_to_fma (stmt,
> -					      gimple_assign_rhs1 (stmt),
> -					      gimple_assign_rhs2 (stmt)))
> +					      gimple_call_arg (stmt, 0),
> +					      gimple_call_arg (stmt, 0),
> +					      &fma_state))
>  		    {
> -		      gsi_remove (&gsi, true);
> +		      unlink_stmt_vdef (stmt);
> +		      if (gsi_remove (&gsi, true)
> +			  && gimple_purge_dead_eh_edges (bb))
> +			*m_cfg_changed_p = true;
>  		      release_defs (stmt);
>  		      continue;
>  		    }
>  		  break;
>  
> -		case PLUS_EXPR:
> -		case MINUS_EXPR:
> -		  if (!convert_plusminus_to_widen (&gsi, stmt, code))
> -		    match_uaddsub_overflow (&gsi, stmt, code);
> -		  break;
> -
> -		case TRUNC_MOD_EXPR:
> -		  convert_to_divmod (as_a<gassign *> (stmt));
> -		  break;
> -
>  		default:;
>  		}
>  	    }
> -	  else if (is_gimple_call (stmt)
> -		   && gimple_call_lhs (stmt))
> -	    {
> -	      tree fndecl = gimple_call_fndecl (stmt);
> -	      if (fndecl
> -		  && gimple_call_builtin_p (stmt, BUILT_IN_NORMAL))
> -		{
> -		  switch (DECL_FUNCTION_CODE (fndecl))
> -		    {
> -		      case BUILT_IN_POWF:
> -		      case BUILT_IN_POW:
> -		      case BUILT_IN_POWL:
> -			if (TREE_CODE (gimple_call_arg (stmt, 1)) == REAL_CST
> -			    && real_equal
> -			         (&TREE_REAL_CST (gimple_call_arg (stmt, 1)),
> -				  &dconst2)
> -			    && convert_mult_to_fma (stmt,
> -						    gimple_call_arg (stmt, 0),
> -						    gimple_call_arg (stmt, 0)))
> -			  {
> -			    unlink_stmt_vdef (stmt);
> -			    if (gsi_remove (&gsi, true)
> -				&& gimple_purge_dead_eh_edges (bb))
> -			      cfg_changed = true;
> -			    release_defs (stmt);
> -			    continue;
> -			  }
> -			  break;
> -
> -		      default:;
> -		    }
> -		}
> -	    }
> -	  gsi_next (&gsi);
> +	  else
> +	    cancel_fma_deferring (&fma_state);
>  	}
> +      gsi_next (&gsi);
>      }
> +  if (fma_state.m_deferring_p
> +      && fma_state.m_initial_phi)
> +    {
> +      gcc_checking_assert (fma_state.m_last_result);
> +      if (!last_fma_candidate_feeds_initial_phi (&fma_state,
> +						 &m_last_result_set))
> +	cancel_fma_deferring (&fma_state);
> +      else
> +	m_last_result_set.add (fma_state.m_last_result);
> +    }
> +}
> +
> +
> +unsigned int
> +pass_optimize_widening_mul::execute (function *fun)
> +{
> +  bool cfg_changed = false;
> +
> +  memset (&widen_mul_stats, 0, sizeof (widen_mul_stats));
> +  calculate_dominance_info (CDI_DOMINATORS);
> +  renumber_gimple_stmt_uids ();
> +
> +  math_opts_dom_walker (&cfg_changed).walk (ENTRY_BLOCK_PTR_FOR_FN (cfun));
>  
>    statistics_counter_event (fun, "widening multiplications inserted",
>  			    widen_mul_stats.widen_mults_inserted);
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)