From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-484851-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 99407 invoked by alias); 31 Aug 2018 11:07:27 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 99395 invoked by uid 89); 31 Aug 2018 11:07:26 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=0.8 required=5.0 tests=BAYES_50,KAM_SHORT,SPF_PASS autolearn=ham version=3.3.2 spammy=as_a, ccing, is_square_of, msg01496html
X-HELO: mx1.suse.de
Received: from mx2.suse.de (HELO mx1.suse.de) (195.135.220.15) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 31 Aug 2018 11:07:24 +0000
Received: from relay2.suse.de (unknown [195.135.220.254])	by mx1.suse.de (Postfix) with ESMTP id E9EFDAFD2;	Fri, 31 Aug 2018 11:07:20 +0000 (UTC)
Date: Fri, 31 Aug 2018 11:07:00 -0000
From: Richard Biener <rguenther@suse.de>
To: Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>
cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,     richard.sandiford@arm.com
Subject: Re: [PATCH] Optimise sqrt reciprocal multiplications
In-Reply-To: <5B87B55F.3020707@foss.arm.com>
Message-ID: <alpine.LSU.2.20.1808311238020.16707@zhemvz.fhfr.qr>
References: <5B7C371C.8040207@foss.arm.com> <87bm9ttmq5.fsf@arm.com> <5B7EEA31.6000908@foss.arm.com> <5B87B55F.3020707@foss.arm.com>
User-Agent: Alpine 2.20 (LSU 67 2015-01-07)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-SW-Source: 2018-08/txt/msg02024.txt.bz2

On Thu, 30 Aug 2018, Kyrill Tkachov wrote:

> Ping.
> 
> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01496.html
> 
> Thanks,
> Kyrill
> 
> On 23/08/18 18:09, Kyrill Tkachov wrote:
> > Hi Richard,
> > 
> > On 23/08/18 11:13, Richard Sandiford wrote:
> > > Kyrill  Tkachov <kyrylo.tkachov@foss.arm.com> writes:
> > > > Hi all,
> > > > 
> > > > This patch aims to optimise sequences involving uses of 1.0 / sqrt (a)
> > > > under -freciprocal-math and -funsafe-math-optimizations.
> > > > In particular consider:
> > > > 
> > > > x = 1.0 / sqrt (a);
> > > > r1 = x * x;  // same as 1.0 / a
> > > > r2 = a * x; // same as sqrt (a)
> > > > 
> > > > If x, r1 and r2 are all used further on in the code, this can be
> > > > transformed into:
> > > > tmp1 = 1.0 / a
> > > > tmp2 = sqrt (a)
> > > > tmp3 = tmp1 * tmp2
> > > > x = tmp3
> > > > r1 = tmp1
> > > > r2 = tmp2
> > > Nice optimisation :-)  Someone who knows the pass better should review,
> > > but:
> > 
> > Thanks for the review.
> > 
> > > There seems to be an implicit assumption that this is a win even
> > > when the r1 and r2 assignments are only conditionally executed.
> > > That's probably true, but it might be worth saying explicitly.
> > 
> > I'll admit I had not considered that case.
> > I think it won't make a difference in practice, as the really expensive
> > operations here
> > are the sqrt and the division and they are on the executed path in either
> > case and them
> > becoming independent should be a benefit of its own.
> > 
> > > > +/* Return TRUE if USE_STMT is a multiplication of DEF by A.  */
> > > > +
> > > > +static inline bool
> > > > +is_mult_by (gimple *use_stmt, tree def, tree a)
> > > > +{
> > > > +  if (gimple_code (use_stmt) == GIMPLE_ASSIGN
> > > > +      && gimple_assign_rhs_code (use_stmt) == MULT_EXPR)
> > > > +    {
> > > > +      tree op0 = gimple_assign_rhs1 (use_stmt);
> > > > +      tree op1 = gimple_assign_rhs2 (use_stmt);
> > > > +
> > > > +      return (op0 == def && op1 == a)
> > > > +          || (op0 == a && op1 == def);
> > > > +    }
> > > > +  return 0;
> > > > +}
> > > Seems like is_square_of could now be a light-weight wrapper around this.
> > 
> > Indeed, I've done the wrapping now.
> > 
> > > > @@ -652,6 +669,180 @@ execute_cse_reciprocals_1 (gimple_stmt_iterator
> > > > *def_gsi, tree def)
> > > >     occ_head = NULL;
> > > >   }
> > > >   +/* Transform sequences like
> > > > +   x = 1.0 / sqrt (a);
> > > > +   r1 = x * x;
> > > > +   r2 = a * x;
> > > > +   into:
> > > > +   tmp1 = 1.0 / a;
> > > > +   tmp2 = sqrt (a);
> > > > +   tmp3 = tmp1 * tmp2;
> > > > +   x = tmp3;
> > > > +   r1 = tmp1;
> > > > +   r2 = tmp2;
> > > > +   depending on the uses of x, r1, r2.  This removes one multiplication
> > > > and
> > > > +   allows the sqrt and division operations to execute in parallel.
> > > > +   DEF_GSI is the gsi of the initial division by sqrt that defines
> > > > +   DEF (x in the example abovs).  */
> > > > +
> > > > +static void
> > > > +optimize_recip_sqrt (gimple_stmt_iterator *def_gsi, tree def)
> > > > +{
> > > > +  use_operand_p use_p;
> > > > +  imm_use_iterator use_iter;
> > > > +  gimple *stmt = gsi_stmt (*def_gsi);
> > > > +  tree x = def;
> > > > +  tree orig_sqrt_ssa_name = gimple_assign_rhs2 (stmt);
> > > > +  tree div_rhs1 = gimple_assign_rhs1 (stmt);
> > > > +
> > > > +  if (TREE_CODE (orig_sqrt_ssa_name) != SSA_NAME
> > > > +      || TREE_CODE (div_rhs1) != REAL_CST
> > > > +      || !real_equal (&TREE_REAL_CST (div_rhs1), &dconst1))
> > > > +    return;
> > > > +
> > > > +  gimple *sqrt_stmt = SSA_NAME_DEF_STMT (orig_sqrt_ssa_name);
> > > > +  if (!is_gimple_call (sqrt_stmt)
> > > > +      || !gimple_call_lhs (sqrt_stmt))
> > > > +    return;
> > > > +
> > > > +  gcall *call = as_a <gcall *> (sqrt_stmt);
> > > Very minor, but:
> > > 
> > >    gcall *sqrt_stmt
> > >      = dyn_cast <gcall *> (SSA_NAME_DEF_STMT (orig_sqrt_ssa_name));
> > >    if (!sqrt_stmt || !gimple_call_lhs (sqrt_stmt))
> > >      return;
> > > 
> > > would avoid the need for the separate as_a<>, and would mean that
> > > we only call gimple_call_* on gcalls.
> > 
> > Ok.
> > 
> > > > +  if (has_other_use)
> > > > +    {
> > > > +      /* Using the two temporaries tmp1, tmp2 from above
> > > > +     the original x is now:
> > > > +     x = tmp1 * tmp2.  */
> > > > +      gcc_assert (mult_ssa_name);
> > > > +      gcc_assert (sqr_ssa_name);
> > > > +      gimple_stmt_iterator gsi2 = gsi_for_stmt (stmt);
> > > > +
> > > > +      tree new_ssa_name
> > > > +    = make_temp_ssa_name (TREE_TYPE (a), NULL,
> > > > "recip_sqrt_transformed");
> > > > +      gimple *new_stmt
> > > > +    = gimple_build_assign (new_ssa_name, MULT_EXPR,
> > > > +                   mult_ssa_name, sqr_ssa_name);
> > > > +      gsi_insert_before (&gsi2, new_stmt, GSI_SAME_STMT);
> > > > +      gcc_assert (gsi_stmt (gsi2) == stmt);
> > > > +      gimple_assign_set_rhs_from_tree (&gsi2, new_ssa_name);
> > > > +      fold_stmt (&gsi2);
> > > > +      update_stmt (stmt);
> > > In this case we're replacing the statement in its original position,
> > > so there's no real need to use a temporary.  It seems better to
> > > change the rhs_code, rhs1 and rhs2 of stmt in-place, with the same
> > > lhs as before.
> > 
> > Yes, that's cleaner.
> > 
> > > > @@ -762,6 +953,23 @@ pass_cse_reciprocals::execute (function *fun)
> > > >         if (optimize_bb_for_size_p (bb))
> > > >           continue;
> > > Seems unnecessary to skip the new optimisation when optimising for size.
> > > Like you say, it saves a multiplication overall.  Also:
> > 
> > Indeed.
> > 
> > > > +      if (flag_unsafe_math_optimizations)
> > > > +    {
> > > > +      for (gimple_stmt_iterator gsi = gsi_after_labels (bb);
> > > > +           !gsi_end_p (gsi);
> > > > +           gsi_next (&gsi))
> > > > +        {
> > > > +          gimple *stmt = gsi_stmt (gsi);
> > > > +
> > > > +          if (gimple_has_lhs (stmt)
> > > > +          && (def = SINGLE_SSA_TREE_OPERAND (stmt, SSA_OP_DEF)) != NULL
> > > > +          && FLOAT_TYPE_P (TREE_TYPE (def))
> > > > +          && TREE_CODE (def) == SSA_NAME
> > > > +          && is_gimple_assign (stmt)
> > > > +          && gimple_assign_rhs_code (stmt) == RDIV_EXPR)
> > > > +        optimize_recip_sqrt (&gsi, def);
> > > > +        }
> > > > +    }
> > > It looks like this could safely be done in one of the existing walks
> > > (e.g. the execute_cse_reciprocals_1 one, if we do this when optimising
> > > for size).
> > 
> > You're right. I've moved this into one of the walks above this.
> > 
> > Bootstrapped and tested on aarch64-none-linux-gnu and
> > x86_64-unknown-linux-gnu.
> > CC'ing richi as he's reviewed these kinds of patches in the past.
> > 
> > Is this ok for trunk?

I wonder how it interacts with execute_cse_reciprocals_1 given it
introduces a division 1.0 / a which will be not processed by
execute_cse_reciprocals_1 given that operates by walking uses of 'a'.
That may be just a missed optimization of course.

+  if (has_other_use)
+    {
+      /* Using the two temporaries tmp1, tmp2 from above
+        the original x is now:
+        x = tmp1 * tmp2.  */
+      gcc_assert (mult_ssa_name);
+      gcc_assert (sqr_ssa_name);
+
+      gimple_assign_set_rhs1 (stmt, mult_ssa_name);
+      gimple_assign_set_rhs2 (stmt, sqr_ssa_name);
+      gimple_assign_set_rhs_code (stmt, MULT_EXPR);
+      fold_stmt_inplace (def_gsi);
+      update_stmt (stmt);
+    }

so you are leaving the original stmt in place unchanged even if it is
not used?  Why?  Note that with -fno-call-exceptions this stmt may
throw, so you should arrange to code-generate 1./a in place of the
original division to preserve EH behavior.  Watch out because then
with other uses you have to find a place to insert its computation.

+      if (is_square_of (stmt2, x))
+       {
+         if (!sqr_stmts.contains (stmt2))
+           sqr_stmts.safe_push (stmt2);
+       }

this is quadratic in the number of square stmts...  please consider
making sqr_stmts a bitmap of SSA defs (so the stmt you have now
is then SSA_NAME_DEF_STMT (ssa_name (bitmap-element))).

You do not seem to restrict placement of the use stmts but insert
before the 1/sqrt(a) stmt.  That possibly hoists the multiplications
where they are not needed.  Consider

  x = 1./sqrt(a);
  if (l)
    l1 = x * 3.;
  else if (l2)
    l1 = x * x;
  else if (l3)
    l1 = a * x;

or similar where on the path to x * 3. you now perform two extra
multiplications.

Your testcases do not cover the case of other uses at all.  Or of
EH.

Richard.


> > Thanks,
> > Kyrill
> > 
> > 2018-08-23  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
> > 
> >     * tree-ssa-math-opts.c (is_mult_by): New function.
> >     (is_square_of): Use the above.
> >     (optimize_recip_sqrt): New function.
> >     (pass_cse_reciprocals::execute): Use the above.
> > 
> > 2018-08-23  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
> > 
> >     * gcc.dg/recip_sqrt_mult_1.c: New test.
> >     * gcc.dg/recip_sqrt_mult_2.c: Likewise.
> >     * gcc.dg/recip_sqrt_mult_3.c: Likewise.
> > 
> > > Thanks,
> > > Richard
> > 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)