From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id 7B6A43A53005 for ; Thu, 6 May 2021 07:57:55 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 7B6A43A53005 Received: from pps.filterd (m0098410.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 1467mwaT129503; Thu, 6 May 2021 03:57:54 -0400 Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com with ESMTP id 38ccd985wf-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 06 May 2021 03:57:53 -0400 Received: from m0098410.ppops.net (m0098410.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 1467niFj131885; Thu, 6 May 2021 03:57:53 -0400 Received: from ppma03wdc.us.ibm.com (ba.79.3fa9.ip4.static.sl-reverse.com [169.63.121.186]) by mx0a-001b2d01.pphosted.com with ESMTP id 38ccd985vy-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 06 May 2021 03:57:53 -0400 Received: from pps.filterd (ppma03wdc.us.ibm.com [127.0.0.1]) by ppma03wdc.us.ibm.com (8.16.0.43/8.16.0.43) with SMTP id 1467vcjs005693; Thu, 6 May 2021 07:57:52 GMT Received: from b03cxnp07027.gho.boulder.ibm.com (b03cxnp07027.gho.boulder.ibm.com [9.17.130.14]) by ppma03wdc.us.ibm.com with ESMTP id 38bedra66g-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 06 May 2021 07:57:52 +0000 Received: from b03ledav003.gho.boulder.ibm.com (b03ledav003.gho.boulder.ibm.com [9.17.130.234]) by b03cxnp07027.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 1467vpWc27132354 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 6 May 2021 07:57:51 GMT Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 3849A6A051; Thu, 6 May 2021 07:57:51 +0000 (GMT) Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id C60966A054; Thu, 6 May 2021 07:57:50 +0000 (GMT) Received: from ltc.linux.ibm.com (unknown [9.10.229.42]) by b03ledav003.gho.boulder.ibm.com (Postfix) with ESMTP; Thu, 6 May 2021 07:57:50 +0000 (GMT) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Date: Thu, 06 May 2021 15:57:50 +0800 From: guojiufu To: Richard Biener Cc: gcc-patches@gcc.gnu.org, wschmidt@linux.ibm.com, segher@kernel.crashing.org, dje.gcc@gmail.com, jlaw@tachyum.com Subject: Re: [PATCH] split loop for NE condition. In-Reply-To: References: <20210429095048.76239-1-guojiufu@linux.ibm.com> Message-ID: <78b3f181bf040ae995809d921851a2ee@imap.linux.ibm.com> X-Sender: guojiufu@linux.ibm.com User-Agent: Roundcube Webmail/1.1.12 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: xcqytqcEed4VVb_bsNTXs7TnvdLSkavT X-Proofpoint-GUID: nfd-I8jOqdv9SLhKTuD4bEeg-jhq1EgE X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.761 definitions=2021-05-06_05:2021-05-05, 2021-05-06 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 malwarescore=0 priorityscore=1501 lowpriorityscore=0 spamscore=0 clxscore=1015 phishscore=0 impostorscore=0 mlxlogscore=999 mlxscore=0 adultscore=0 bulkscore=0 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104060000 definitions=main-2105060051 X-Spam-Status: No, score=-11.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 06 May 2021 07:57:58 -0000 On 2021-05-03 20:18, Richard Biener wrote: > On Thu, 29 Apr 2021, Jiufu Guo wrote: > >> When there is the possibility that overflow may happen on the loop >> index, >> a few optimizations would not happen. For example code: >> >> foo (int *a, int *b, unsigned k, unsigned n) >> { >> while (++k != n) >> a[k] = b[k] + 1; >> } >> >> For this code, if "l > n", overflow may happen. if "l < n" at >> begining, >> it could be optimized (e.g. vectorization). >> >> We can split the loop into two loops: >> >> while (++k > n) >> a[k] = b[k] + 1; >> while (l++ < n) >> a[k] = b[k] + 1; >> >> then for the second loop, it could be optimized. >> >> This patch is splitting this kind of small loop to achieve better >> performance. >> >> Bootstrap and regtest pass on ppc64le. Is this ok for trunk? > > Do you have any statistics on how often this splits a loop during > bootstrap (use --with-build-config=bootstrap-O3)? Or alternatively > on SPEC? In SPEC2017, there are ~240 loops are split. And I saw some performance improvement on xz. I would try bootstrap-O3 (encounter ICE). > > Actual comments on the patch inline. > >> Thanks! >> >> Jiufu Guo. >> >> gcc/ChangeLog: >> >> 2021-04-29 Jiufu Guo >> >> * params.opt (max-insns-ne-cond-split): New. >> * tree-ssa-loop-split.c (connect_loop_phis): Add new param. >> (get_ne_cond_branch): New function. >> (split_ne_loop): New function. >> (split_loop_on_ne_cond): New function. >> (tree_ssa_split_loops): Use split_loop_on_ne_cond. >> >> gcc/testsuite/ChangeLog: >> 2021-04-29 Jiufu Guo >> >> * gcc.dg/loop-split1.c: New test. >> >> --- >> gcc/params.opt | 4 + >> gcc/testsuite/gcc.dg/loop-split1.c | 28 ++++ >> gcc/tree-ssa-loop-split.c | 219 >> ++++++++++++++++++++++++++++- >> 3 files changed, 247 insertions(+), 4 deletions(-) >> create mode 100644 gcc/testsuite/gcc.dg/loop-split1.c >> >> diff --git a/gcc/params.opt b/gcc/params.opt >> index 2e4cbdd7a71..900b59b5136 100644 >> --- a/gcc/params.opt >> +++ b/gcc/params.opt >> @@ -766,6 +766,10 @@ Min. ratio of insns to prefetches to enable >> prefetching for a loop with an unkno >> Common Joined UInteger Var(param_min_loop_cond_split_prob) Init(30) >> IntegerRange(0, 100) Param Optimization >> The minimum threshold for probability of semi-invariant condition >> statement to trigger loop split. >> >> +-param=max-insns-ne-cond-split= >> +Common Joined UInteger Var(param_max_insn_ne_cond_split) Init(64) >> Param Optimization >> +The maximum threshold for insnstructions number of a loop with ne >> condition to split. >> + >> -param=min-nondebug-insn-uid= >> Common Joined UInteger Var(param_min_nondebug_insn_uid) Param >> The minimum UID to be used for a nondebug insn. >> diff --git a/gcc/testsuite/gcc.dg/loop-split1.c >> b/gcc/testsuite/gcc.dg/loop-split1.c >> new file mode 100644 >> index 00000000000..4c466aa9f54 >> --- /dev/null >> +++ b/gcc/testsuite/gcc.dg/loop-split1.c >> @@ -0,0 +1,28 @@ >> +/* { dg-do compile } */ >> +/* { dg-options "-O2 -fsplit-loops -fdump-tree-lsplit-details" } */ >> + >> +void >> +foo (int *a, int *b, unsigned l, unsigned n) >> +{ >> + while (++l != n) >> + a[l] = b[l] + 1; >> +} >> + >> +void >> +foo1 (int *a, int *b, unsigned l, unsigned n) >> +{ >> + while (l++ != n) >> + a[l] = b[l] + 1; >> +} >> + >> +unsigned >> +foo2 (char *a, char *b, unsigned l, unsigned n) >> +{ >> + while (++l != n) >> + if (a[l] != b[l]) >> + break; >> + >> + return l; >> +} >> + >> +/* { dg-final { scan-tree-dump-times "Loop split" 3 "lsplit" } } */ >> diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c >> index b80b6a75e62..a6d28078e5e 100644 >> --- a/gcc/tree-ssa-loop-split.c >> +++ b/gcc/tree-ssa-loop-split.c >> @@ -41,6 +41,7 @@ along with GCC; see the file COPYING3. If not see >> #include "cfghooks.h" >> #include "gimple-fold.h" >> #include "gimplify-me.h" >> +#include "tree-ssa-loop-ivopts.h" >> >> /* This file implements two kinds of loop splitting. >> >> @@ -233,7 +234,8 @@ easy_exit_values (class loop *loop) >> this. The loops need to fulfill easy_exit_values(). */ >> >> static void >> -connect_loop_phis (class loop *loop1, class loop *loop2, edge new_e) >> +connect_loop_phis (class loop *loop1, class loop *loop2, edge new_e, >> + bool use_prev = false) >> { >> basic_block rest = loop_preheader_edge (loop2)->src; >> gcc_assert (new_e->dest == rest); >> @@ -248,13 +250,14 @@ connect_loop_phis (class loop *loop1, class loop >> *loop2, edge new_e) >> !gsi_end_p (psi_first); >> gsi_next (&psi_first), gsi_next (&psi_second)) >> { >> - tree init, next, new_init; >> + tree init, next, new_init, prev; >> use_operand_p op; >> gphi *phi_first = psi_first.phi (); >> gphi *phi_second = psi_second.phi (); >> >> init = PHI_ARG_DEF_FROM_EDGE (phi_first, firste); >> next = PHI_ARG_DEF_FROM_EDGE (phi_first, firstn); >> + prev = PHI_RESULT (phi_first); >> op = PHI_ARG_DEF_PTR_FROM_EDGE (phi_second, seconde); >> gcc_assert (operand_equal_for_phi_arg_p (init, USE_FROM_PTR >> (op))); >> >> @@ -279,7 +282,7 @@ connect_loop_phis (class loop *loop1, class loop >> *loop2, edge new_e) >> >> gphi * newphi = create_phi_node (new_init, rest); >> add_phi_arg (newphi, init, skip_first, UNKNOWN_LOCATION); >> - add_phi_arg (newphi, next, new_e, UNKNOWN_LOCATION); >> + add_phi_arg (newphi, use_prev ? prev : next, new_e, >> UNKNOWN_LOCATION); >> SET_USE (op, new_init); >> } >> } >> @@ -1599,6 +1602,213 @@ split_loop_on_cond (struct loop *loop) >> return do_split; >> } >> >> +/* Check if the LOOP exit branch likes "if (idx != bound)". >> + if INV is not NULL and the branch is "if (bound != idx)", set *INV >> to true. > > If Ok, I will update accordingly. > >> + return the branch edge which exit loop. */ > > Return Thanks. > >> + >> +static edge >> +get_ne_cond_branch (struct loop *loop, bool *inv) >> +{ >> + int i; >> + edge e; >> + >> + auto_vec edges = get_loop_exit_edges (loop); >> + FOR_EACH_VEC_ELT (edges, i, e) >> + { >> + basic_block bb = e->src; >> + >> + /* Check gcond. */ >> + gimple *last = last_stmt (bb); >> + if (!last || gimple_code (last) != GIMPLE_COND) >> + continue; >> + gcond *cond = as_a (last); >> + enum tree_code code = gimple_cond_code (cond); >> + if (code != NE_EXPR) >> + continue; > > I'm not sure we canonicalize the case with code == EQ_EXPR, > at least for > > void bar(); > void foo(unsigned n) > { > unsigned i = 0; > do > { > if (i == n) > return; > bar(); > ++i; > } > while (1); > } > > we don't. Since you return the exit edge can this case be > handled transparently? Oh, this case was not handled, the patch can be enhanced to handle this kind of case. > >> + >> + /* Make sure idx and bound. */ >> + tree idx = gimple_cond_lhs (cond); >> + tree bnd = gimple_cond_rhs (cond); >> + if (expr_invariant_in_loop_p (loop, idx)) >> + { >> + std::swap (idx, bnd); >> + if (inv) >> + *inv = true; >> + } >> + else if (!expr_invariant_in_loop_p (loop, bnd)) >> + continue; > > We canonicalize i < UINT_MAX to i != UINT_MAX so you want to > detect that and not split the loop if 'bnd' is the maximum > or minimum value of the type I think. Yeap! > >> + /* Extract conversion. */ >> + if (TREE_CODE (idx) == SSA_NAME) >> + { >> + gimple *stmt = SSA_NAME_DEF_STMT (idx); >> + if (is_gimple_assign (stmt) >> + && CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (stmt)) >> + && flow_bb_inside_loop_p (loop, gimple_bb (stmt))) >> + idx = gimple_assign_rhs1 (stmt); >> + } > > This skips arbitrary extensions and truncations - is that intended? Yes. This could handle code like: _6 = (sizetype) idx; if (_6 != n) > >> + /* Make sure idx is iv. */ >> + class loop *useloop = loop_containing_stmt (cond); >> + affine_iv iv; >> + if (!simple_iv (loop, useloop, idx, &iv, false)) >> + continue; >> + >> + /* No need to split loop, if base is know value. >> + Or check range info. */ >> + if (TREE_CODE (iv.base) == INTEGER_CST) >> + continue; > > I think it would be better to check iv.no_overflow? Also looking > it might be possible to use simple_iv_with_niters with IV_NITERS > not NULL for most of this analysis? Yes, iv.no_overflow may better to use directly. simple_iv invoke simple_iv_with_niters with IV_NITERS=null. Because wrap may happen, IV_NITERS may not be calculated. Thanks! > >> + /* There is type conversion on idx(or rhs of idx's def). >> + And there is converting shorter to longer type. */ >> + tree type = TREE_TYPE (idx); >> + if (!INTEGRAL_TYPE_P (type) || TREE_CODE (idx) != SSA_NAME >> + || !TYPE_UNSIGNED (type) >> + || TYPE_PRECISION (type) == TYPE_PRECISION (sizetype)) >> + continue; > > That check can be done before the (expensive) simple_iv check. Ok, will update accordingly. > I wonder what the 'sizetype' precision check is about? The > function level comment should probably clarify what kind of > conversions we handle (and why). Currently, this patch is more care about the conversions which may generate ext/truck stmts. I would try to remove this check, since we may split the loop if there is a possible wrap/overflow. > >> + /* Check loop is simple to split. */ >> + gcc_assert (bb != loop->latch); >> + >> + if (single_pred_p (loop->latch) >> + && single_pred_edge (loop->latch)->src == bb >> + && empty_block_p (loop->latch)) >> + return e; >> + >> + /* Splitting is cheap for idx increase header. */ >> + if (bb == loop->header) >> + { >> + if (get_virtual_phi (loop->header)) >> + continue; >> + >> + /* In loop header: i++ or ++i. */ >> + gimple_stmt_iterator gsi = gsi_start_bb (bb); >> + if (gsi_end_p (gsi)) >> + return e; >> + >> + gimple *s1 = gsi_stmt (gsi); >> + if (!(is_gimple_assign (s1) >> + && (idx == gimple_assign_lhs (s1) >> + || idx == gimple_assign_rhs1 (s1)))) >> + continue; >> + >> + gsi_next (&gsi); >> + if (!gsi_end_p (gsi) && gsi_stmt (gsi) == cond) >> + return e; >> + } > > I wonder if these "cheapness" heuristics should simply fold > into the cost of the extra duplication of the header/tail > in the overall stmt limit? Without this heuristic, the loop can be split only if the branch to exit locates at the end of the loop (just before the empty latch). With the heuristics, the loop can be split if the branch is at the simple header. And I feel the cost of small to rerun the duplicated simple header: maybe just 'move' instructions or one add instruction. > >> + } >> + >> + return NULL; >> +} >> + >> +/* Split the LOOP with NE_EXPR into two loops with GT_EXPR and >> LT_EXPR. */ >> + >> +static bool >> +split_ne_loop (struct loop *loop, edge cond_e) >> +{ >> + initialize_original_copy_tables (); >> + >> + struct loop *loop2 = loop_version (loop, boolean_true_node, NULL, >> + profile_probability::always (), >> + profile_probability::never (), >> + profile_probability::always (), >> + profile_probability::always (), true); >> + >> + gcc_assert (loop2); >> + update_ssa (TODO_update_ssa); >> + >> + free_original_copy_tables (); >> + >> + /* Change if (i != n) to LOOP1:if (i > n) and LOOP2:if (i < n) */ >> + bool inv = false; >> + edge dup_cond = get_ne_cond_branch (loop2, &inv); > > I don't think you should rely in pattern-matching to detect the same > condition in the versioned loop - instead you can use the copy > tables, do > > 2nd_loop_exit_block = get_bb_copy (cond_e->src); > > to get to the block with the COND_EXPR (before > free_original_copy_tables > obviously). > Thanks! Your suggestion is great, it would save a lot of time to get the new exit branch. >> + enum tree_code up_code = inv ? LT_EXPR : GT_EXPR; >> + enum tree_code down_code = inv ? GT_EXPR : LT_EXPR; >> + >> + gcond *gc = as_a (last_stmt (cond_e->src)); >> + gimple_cond_set_code (gc, up_code); >> + >> + gcond *dup_gc = as_a (last_stmt (dup_cond->src)); >> + gimple_cond_set_code (dup_gc, down_code); >> + >> + /* Link the exit cond edge to new loop. */ >> + gcond *break_cond = as_a (gimple_copy (gc)); >> + edge pred_e = single_pred_edge (loop->latch); >> + gcc_assert (pred_e); >> + bool simple_loop = pred_e->src == cond_e->src && empty_block_p >> (loop->latch); >> + if (simple_loop) >> + gimple_cond_set_code (break_cond, down_code); >> + else >> + gimple_cond_make_true (break_cond); >> + >> + basic_block break_bb = split_edge (cond_e); >> + gimple_stmt_iterator gsi = gsi_last_bb (break_bb); >> + gsi_insert_after (&gsi, break_cond, GSI_NEW_STMT); >> + >> + edge to_exit = single_succ_edge (break_bb); >> + edge to_new_loop = make_edge (break_bb, loop_preheader_edge >> (loop2)->src, 0); >> + to_new_loop->flags |= EDGE_TRUE_VALUE; >> + to_exit->flags |= EDGE_FALSE_VALUE; >> + to_exit->flags &= ~EDGE_FALLTHRU; >> + to_exit->probability = cond_e->probability; >> + to_new_loop->probability = to_exit->probability.invert (); >> + >> + update_ssa (TODO_update_ssa); >> + >> + connect_loop_phis (loop, loop2, to_new_loop, !simple_loop); >> + >> + rewrite_into_loop_closed_ssa_1 (NULL, 0, SSA_OP_USE, loop); >> + if (dump_file && (dump_flags & TDF_DETAILS)) >> + fprintf (dump_file, ";; Loop split.\n"); > > Maybe ";; Loop split on != condition.\n"? Ok! > >> + >> + return true; >> +} >> + >> +/* Checks if LOOP contains a suitable NE_EXPR conditional block to >> split. >> +L_H: >> + if (i!=N) >> + S; >> + i++; >> + goto L_H; >> + >> +The "i!=N" is like "i>N || i> + >> +L_H: >> + if (i>N) >> + S; >> + i++; >> + goto L_H; >> +L1_H: >> + if (i> + S; >> + i++; >> + goto L1_H; >> + >> +The loop with "i> + >> +static bool >> +split_loop_on_ne_cond (class loop *loop) >> +{ >> + if (!can_duplicate_loop_p (loop)) >> + return false; >> + >> + int num = 0; >> + basic_block *bbs = get_loop_body (loop); > > To avoid repeated DFS walks of the loop body do the > can_duplicate_loop_p > check here as > > if (!can_copy_bbs_p (bbs, loop->num_nodes)) > { > free (bbs); > return false; > } > > (see split_loop) > Thanks! This could also save gcc runtime. >> + for (unsigned i = 0; i < loop->num_nodes; i++) >> + num += estimate_num_insns_seq (bb_seq (bbs[i]), >> &eni_size_weights); >> + free (bbs); > > So with this and the suggestion above it is maybe possible to re-use > compute_added_num_insns? That already seems to handle splitting at > aribtrary branches (but maybe not loops with multiple exits?). > The code using this computation uses param_max_peeled_insns - is > that limit not sufficient for your case (to avoid another param?)? > It's default is 100, a bit higher than yours (64). This patch uses this simple code to calculate and does not re-use compute_added_num_insns, because there would be no many stmts in loops are deleted. Introducing a new param, because I think can independently use it to control different optimization part. And the name of param_max_peeled_insns seems designed for other behavior:) Do we prefer to reuse exit params for new optimizations? If so, we would reuse it. > > I'd really like to see some numbers on how much this triggers since > the splitting itself is O (number-of-split-loops * function-size) > complexity and thus worse than quadratic as we do update_ssa for each > split loop. You can experience this by simply concating one of your > testcase loops N times in a function ... If I duplicate the test case loop 15 times in the function, it riggers 15 ";; Loop split". Concatted times, compiling time, gap 4: 0m0.106s 8: 0m0.186s 0.080s 12: 0m0.267s 0.082s 16: 0m0.350s 0.083s 20: 0m0.434s 0.084s 24: 0m0.522s 0.088s 28: 0m0.609s 0.087s 32: 0m0.705s 0.086s It seems near a linear complexity. Thanks a lot for review! Jiufu Guo. > >> + if (num > param_max_insn_ne_cond_split) >> + return false; >> + >> + edge branch_edge = get_ne_cond_branch (loop, NULL); >> + if (branch_edge && split_ne_loop (loop, branch_edge)) >> + return true; >> + >> + return false; >> +} >> + >> /* Main entry point. Perform loop splitting on all suitable loops. >> */ >> >> static unsigned int >> @@ -1628,7 +1838,8 @@ tree_ssa_split_loops (void) >> if (optimize_loop_for_size_p (loop)) >> continue; >> >> - if (split_loop (loop) || split_loop_on_cond (loop)) >> + if (split_loop (loop) || split_loop_on_cond (loop) >> + || split_loop_on_ne_cond (loop)) >> { >> /* Mark our containing loop as having had some split inner loops. >> */ >> loop_outer (loop)->aux = loop; >>