From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 99407 invoked by alias); 31 Aug 2018 11:07:27 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 99395 invoked by uid 89); 31 Aug 2018 11:07:26 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=0.8 required=5.0 tests=BAYES_50,KAM_SHORT,SPF_PASS autolearn=ham version=3.3.2 spammy=as_a, ccing, is_square_of, msg01496html X-HELO: mx1.suse.de Received: from mx2.suse.de (HELO mx1.suse.de) (195.135.220.15) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 31 Aug 2018 11:07:24 +0000 Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id E9EFDAFD2; Fri, 31 Aug 2018 11:07:20 +0000 (UTC) Date: Fri, 31 Aug 2018 11:07:00 -0000 From: Richard Biener To: Kyrill Tkachov cc: "gcc-patches@gcc.gnu.org" , richard.sandiford@arm.com Subject: Re: [PATCH] Optimise sqrt reciprocal multiplications In-Reply-To: <5B87B55F.3020707@foss.arm.com> Message-ID: References: <5B7C371C.8040207@foss.arm.com> <87bm9ttmq5.fsf@arm.com> <5B7EEA31.6000908@foss.arm.com> <5B87B55F.3020707@foss.arm.com> User-Agent: Alpine 2.20 (LSU 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-SW-Source: 2018-08/txt/msg02024.txt.bz2 On Thu, 30 Aug 2018, Kyrill Tkachov wrote: > Ping. > > https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01496.html > > Thanks, > Kyrill > > On 23/08/18 18:09, Kyrill Tkachov wrote: > > Hi Richard, > > > > On 23/08/18 11:13, Richard Sandiford wrote: > > > Kyrill Tkachov writes: > > > > Hi all, > > > > > > > > This patch aims to optimise sequences involving uses of 1.0 / sqrt (a) > > > > under -freciprocal-math and -funsafe-math-optimizations. > > > > In particular consider: > > > > > > > > x = 1.0 / sqrt (a); > > > > r1 = x * x; // same as 1.0 / a > > > > r2 = a * x; // same as sqrt (a) > > > > > > > > If x, r1 and r2 are all used further on in the code, this can be > > > > transformed into: > > > > tmp1 = 1.0 / a > > > > tmp2 = sqrt (a) > > > > tmp3 = tmp1 * tmp2 > > > > x = tmp3 > > > > r1 = tmp1 > > > > r2 = tmp2 > > > Nice optimisation :-) Someone who knows the pass better should review, > > > but: > > > > Thanks for the review. > > > > > There seems to be an implicit assumption that this is a win even > > > when the r1 and r2 assignments are only conditionally executed. > > > That's probably true, but it might be worth saying explicitly. > > > > I'll admit I had not considered that case. > > I think it won't make a difference in practice, as the really expensive > > operations here > > are the sqrt and the division and they are on the executed path in either > > case and them > > becoming independent should be a benefit of its own. > > > > > > +/* Return TRUE if USE_STMT is a multiplication of DEF by A. */ > > > > + > > > > +static inline bool > > > > +is_mult_by (gimple *use_stmt, tree def, tree a) > > > > +{ > > > > + if (gimple_code (use_stmt) == GIMPLE_ASSIGN > > > > + && gimple_assign_rhs_code (use_stmt) == MULT_EXPR) > > > > + { > > > > + tree op0 = gimple_assign_rhs1 (use_stmt); > > > > + tree op1 = gimple_assign_rhs2 (use_stmt); > > > > + > > > > + return (op0 == def && op1 == a) > > > > + || (op0 == a && op1 == def); > > > > + } > > > > + return 0; > > > > +} > > > Seems like is_square_of could now be a light-weight wrapper around this. > > > > Indeed, I've done the wrapping now. > > > > > > @@ -652,6 +669,180 @@ execute_cse_reciprocals_1 (gimple_stmt_iterator > > > > *def_gsi, tree def) > > > > occ_head = NULL; > > > > } > > > > +/* Transform sequences like > > > > + x = 1.0 / sqrt (a); > > > > + r1 = x * x; > > > > + r2 = a * x; > > > > + into: > > > > + tmp1 = 1.0 / a; > > > > + tmp2 = sqrt (a); > > > > + tmp3 = tmp1 * tmp2; > > > > + x = tmp3; > > > > + r1 = tmp1; > > > > + r2 = tmp2; > > > > + depending on the uses of x, r1, r2. This removes one multiplication > > > > and > > > > + allows the sqrt and division operations to execute in parallel. > > > > + DEF_GSI is the gsi of the initial division by sqrt that defines > > > > + DEF (x in the example abovs). */ > > > > + > > > > +static void > > > > +optimize_recip_sqrt (gimple_stmt_iterator *def_gsi, tree def) > > > > +{ > > > > + use_operand_p use_p; > > > > + imm_use_iterator use_iter; > > > > + gimple *stmt = gsi_stmt (*def_gsi); > > > > + tree x = def; > > > > + tree orig_sqrt_ssa_name = gimple_assign_rhs2 (stmt); > > > > + tree div_rhs1 = gimple_assign_rhs1 (stmt); > > > > + > > > > + if (TREE_CODE (orig_sqrt_ssa_name) != SSA_NAME > > > > + || TREE_CODE (div_rhs1) != REAL_CST > > > > + || !real_equal (&TREE_REAL_CST (div_rhs1), &dconst1)) > > > > + return; > > > > + > > > > + gimple *sqrt_stmt = SSA_NAME_DEF_STMT (orig_sqrt_ssa_name); > > > > + if (!is_gimple_call (sqrt_stmt) > > > > + || !gimple_call_lhs (sqrt_stmt)) > > > > + return; > > > > + > > > > + gcall *call = as_a (sqrt_stmt); > > > Very minor, but: > > > > > > gcall *sqrt_stmt > > > = dyn_cast (SSA_NAME_DEF_STMT (orig_sqrt_ssa_name)); > > > if (!sqrt_stmt || !gimple_call_lhs (sqrt_stmt)) > > > return; > > > > > > would avoid the need for the separate as_a<>, and would mean that > > > we only call gimple_call_* on gcalls. > > > > Ok. > > > > > > + if (has_other_use) > > > > + { > > > > + /* Using the two temporaries tmp1, tmp2 from above > > > > + the original x is now: > > > > + x = tmp1 * tmp2. */ > > > > + gcc_assert (mult_ssa_name); > > > > + gcc_assert (sqr_ssa_name); > > > > + gimple_stmt_iterator gsi2 = gsi_for_stmt (stmt); > > > > + > > > > + tree new_ssa_name > > > > + = make_temp_ssa_name (TREE_TYPE (a), NULL, > > > > "recip_sqrt_transformed"); > > > > + gimple *new_stmt > > > > + = gimple_build_assign (new_ssa_name, MULT_EXPR, > > > > + mult_ssa_name, sqr_ssa_name); > > > > + gsi_insert_before (&gsi2, new_stmt, GSI_SAME_STMT); > > > > + gcc_assert (gsi_stmt (gsi2) == stmt); > > > > + gimple_assign_set_rhs_from_tree (&gsi2, new_ssa_name); > > > > + fold_stmt (&gsi2); > > > > + update_stmt (stmt); > > > In this case we're replacing the statement in its original position, > > > so there's no real need to use a temporary. It seems better to > > > change the rhs_code, rhs1 and rhs2 of stmt in-place, with the same > > > lhs as before. > > > > Yes, that's cleaner. > > > > > > @@ -762,6 +953,23 @@ pass_cse_reciprocals::execute (function *fun) > > > > if (optimize_bb_for_size_p (bb)) > > > > continue; > > > Seems unnecessary to skip the new optimisation when optimising for size. > > > Like you say, it saves a multiplication overall. Also: > > > > Indeed. > > > > > > + if (flag_unsafe_math_optimizations) > > > > + { > > > > + for (gimple_stmt_iterator gsi = gsi_after_labels (bb); > > > > + !gsi_end_p (gsi); > > > > + gsi_next (&gsi)) > > > > + { > > > > + gimple *stmt = gsi_stmt (gsi); > > > > + > > > > + if (gimple_has_lhs (stmt) > > > > + && (def = SINGLE_SSA_TREE_OPERAND (stmt, SSA_OP_DEF)) != NULL > > > > + && FLOAT_TYPE_P (TREE_TYPE (def)) > > > > + && TREE_CODE (def) == SSA_NAME > > > > + && is_gimple_assign (stmt) > > > > + && gimple_assign_rhs_code (stmt) == RDIV_EXPR) > > > > + optimize_recip_sqrt (&gsi, def); > > > > + } > > > > + } > > > It looks like this could safely be done in one of the existing walks > > > (e.g. the execute_cse_reciprocals_1 one, if we do this when optimising > > > for size). > > > > You're right. I've moved this into one of the walks above this. > > > > Bootstrapped and tested on aarch64-none-linux-gnu and > > x86_64-unknown-linux-gnu. > > CC'ing richi as he's reviewed these kinds of patches in the past. > > > > Is this ok for trunk? I wonder how it interacts with execute_cse_reciprocals_1 given it introduces a division 1.0 / a which will be not processed by execute_cse_reciprocals_1 given that operates by walking uses of 'a'. That may be just a missed optimization of course. + if (has_other_use) + { + /* Using the two temporaries tmp1, tmp2 from above + the original x is now: + x = tmp1 * tmp2. */ + gcc_assert (mult_ssa_name); + gcc_assert (sqr_ssa_name); + + gimple_assign_set_rhs1 (stmt, mult_ssa_name); + gimple_assign_set_rhs2 (stmt, sqr_ssa_name); + gimple_assign_set_rhs_code (stmt, MULT_EXPR); + fold_stmt_inplace (def_gsi); + update_stmt (stmt); + } so you are leaving the original stmt in place unchanged even if it is not used? Why? Note that with -fno-call-exceptions this stmt may throw, so you should arrange to code-generate 1./a in place of the original division to preserve EH behavior. Watch out because then with other uses you have to find a place to insert its computation. + if (is_square_of (stmt2, x)) + { + if (!sqr_stmts.contains (stmt2)) + sqr_stmts.safe_push (stmt2); + } this is quadratic in the number of square stmts... please consider making sqr_stmts a bitmap of SSA defs (so the stmt you have now is then SSA_NAME_DEF_STMT (ssa_name (bitmap-element))). You do not seem to restrict placement of the use stmts but insert before the 1/sqrt(a) stmt. That possibly hoists the multiplications where they are not needed. Consider x = 1./sqrt(a); if (l) l1 = x * 3.; else if (l2) l1 = x * x; else if (l3) l1 = a * x; or similar where on the path to x * 3. you now perform two extra multiplications. Your testcases do not cover the case of other uses at all. Or of EH. Richard. > > Thanks, > > Kyrill > > > > 2018-08-23 Kyrylo Tkachov > > > > * tree-ssa-math-opts.c (is_mult_by): New function. > > (is_square_of): Use the above. > > (optimize_recip_sqrt): New function. > > (pass_cse_reciprocals::execute): Use the above. > > > > 2018-08-23 Kyrylo Tkachov > > > > * gcc.dg/recip_sqrt_mult_1.c: New test. > > * gcc.dg/recip_sqrt_mult_2.c: Likewise. > > * gcc.dg/recip_sqrt_mult_3.c: Likewise. > > > > > Thanks, > > > Richard > > > > -- Richard Biener SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)