From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-out1.suse.de (smtp-out1.suse.de [IPv6:2a07:de40:b251:101:10:150:64:1]) by sourceware.org (Postfix) with ESMTPS id 2AE493858D3C for ; Fri, 24 Nov 2023 12:38:56 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 2AE493858D3C Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 2AE493858D3C Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a07:de40:b251:101:10:150:64:1 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1700829539; cv=none; b=AzixRegdInZ2HWuhsKaTz3uRNadc4qIy/LNPXGFY3sqY6Yt3mjOw3sQnMfpU/yHCaq4vdPEC95Rqg8Gs4OwtOiToyGM4makMJ+t7PGNUxIdMEI9Wh67x6die4rn+SNXejSE0d/5Wo3enbN131LjyxQe4RMWJIrrjtMZ8A1oAg0E= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1700829539; c=relaxed/simple; bh=sX5IVe8GQsaEQQt/8ByhGN9MpWvMH4tKc3AT0RmsqKY=; h=Date:From:To:Subject:Message-ID:MIME-Version; b=W5paFV4owsf1ILXYeHBXCbV5GExvGLb0O9kYKL6mIFI1FtFHVwMmTu07sN9b15vw3ZFXW8evfxzhdLJJaAfhVBvQpK1dd2PqqoAPbQ+HW9qi6FOBxmCnQTjX0n/Ll3oNS4j0XTF884EMJvHINMMUrouFWP230QcmwWWL52ZMftw= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from relay2.suse.de (unknown [149.44.160.134]) by smtp-out1.suse.de (Postfix) with ESMTP id 495B221E70; Fri, 24 Nov 2023 12:38:55 +0000 (UTC) Received: from wotan.suse.de (wotan.suse.de [10.160.0.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id 6772B2C17D; Fri, 24 Nov 2023 12:38:53 +0000 (UTC) Date: Fri, 24 Nov 2023 12:38:53 +0000 (UTC) From: Richard Biener To: Tamar Christina cc: "gcc-patches@gcc.gnu.org" , nd , "jlaw@ventanamicro.com" Subject: RE: [PATCH 4/21]middle-end: update loop peeling code to maintain LCSSA form for early breaks In-Reply-To: Message-ID: References: User-Agent: Alpine 2.22 (LSU 394 2020-01-19) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Level: Authentication-Results: smtp-out1.suse.de; none X-Rspamd-Server: rspamd2 X-Spamd-Result: default: False [-4.00 / 50.00]; REPLY(-4.00)[] X-Spam-Score: -4.00 X-Rspamd-Queue-Id: 495B221E70 X-Spam-Status: No, score=-10.7 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_STATUS,KAM_LOTSOFHASH,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Fri, 24 Nov 2023, Tamar Christina wrote: > Hi All, > > Here's an updated patch, which takes a slightly different approach but makes things much easier later on. > > Peeling for early breaks works by redirecting all early break exits to a > single "early break" block and combine them and the normal exit edge together > later in a different block which then goes into the epilog preheader. > > This allows us to re-use all the existing code for IV updates, Additionally this > also enables correct linking for multiple vector epilogues. > > flush_pending_stmts cannot be used in this scenario since it updates the PHI > nodes in the order that they are in the exit destination blocks. This means > they are in CFG visit order. With a single exit this doesn't matter but with > multiple exits with different live values through the different exits the order > usually does not line up. > > Additionally the vectorizer helper functions expect to be able to iterate over > the nodes in the order that they occur in the loop header blocks. This is an > invariant we must maintain. To do this we just inline the work > flush_pending_stmts but maintain the order by using the header blocks to guide > the work. > > The way peeling is done result in LIM noticing that in some cases the condition > and the results are loop invariant and tries to move them out of the loop. > > While the resulting code is operationally sound, moving the compare out of the > gcond results in generating code that no longer branches, so cbranch is no > longer applicable. As such I now add code to check during this motion to see > if the target supports flag setting vector comparison as general operation. > > Because of the change in peeling I now also have to update the BB counts for > the loop exit intermediate block. > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > Ok for master? > > Thanks, > Tamar > > gcc/ChangeLog: > > * gcc/tree-ssa-loop-im.cc (compute_invariantness): Import insn-codes.h > and optabs-tree.h and check for vector compare motion out of gcond. > * tree-vect-loop-manip.cc > (slpeel_tree_duplicate_loop_to_edge_cfg): Peel using intermediate blocks. > (vect_update_ivs_after_vectorizer): Drop assert. > (vect_do_peeling): Correct BB count for new intermediate block. > * tree-vectorizer.h (is_loop_header_bb_p): Drop assert. > (slpeel_tree_duplicate_loop_to_edge_cfg): Update signature. > > --- inline copy of patch --- > > diff --git a/gcc/tree-ssa-loop-im.cc b/gcc/tree-ssa-loop-im.cc > index 396963b6754c7671e2e5404302a69129918555e2..92a9318a1ca0a2da50ff2f29cf271d2e78fddd77 100644 > --- a/gcc/tree-ssa-loop-im.cc > +++ b/gcc/tree-ssa-loop-im.cc > @@ -48,6 +48,8 @@ along with GCC; see the file COPYING3. If not see > #include "tree-dfa.h" > #include "tree-ssa.h" > #include "dbgcnt.h" > +#include "insn-codes.h" > +#include "optabs-tree.h" > > /* TODO: Support for predicated code motion. I.e. > > @@ -1138,6 +1140,24 @@ compute_invariantness (basic_block bb) > continue; > } > > + /* Check if one of the depedent statement is a vector compare whether > + the target supports it, otherwise it's invalid to hoist it out of > + the gcond it belonged to. */ > + for (auto dep_stmt : lim_data->depends) > + { > + if (is_gimple_assign (dep_stmt) > + && VECTOR_TYPE_P (TREE_TYPE (gimple_assign_lhs (dep_stmt)))) > + { > + tree type = TREE_TYPE (gimple_assign_lhs (dep_stmt)); > + auto code = gimple_assign_rhs_code (dep_stmt); > + if (!target_supports_op_p (type, code, optab_vector)) > + pos = MOVE_IMPOSSIBLE; > + } > + } I think it's more natural to handle this in determine_max_movement where we specifically look at the condition we are going to turn into a COND_EXPR condition. I think it's also independent of this series - the issue should be latent, but possibly only triggerable with a GIMPLE testcase. Can you split it out? The rest of the patch is OK. Thanks, Richard. > + > + if (pos == MOVE_IMPOSSIBLE) > + continue; > + > if (dump_file && (dump_flags & TDF_DETAILS)) > { > print_gimple_stmt (dump_file, stmt, 2); > diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc > index b9161274ce401a7307f3e61ad23aa036701190d7..0b042b2baf976572af962dd40d5dc311a419ee60 100644 > --- a/gcc/tree-vect-loop-manip.cc > +++ b/gcc/tree-vect-loop-manip.cc > @@ -1403,13 +1403,16 @@ vect_set_loop_condition (class loop *loop, edge loop_e, loop_vec_info loop_vinfo > copies remains the same. > > If UPDATED_DOMS is not NULL it is update with the list of basic blocks whoms > - dominators were updated during the peeling. */ > + dominators were updated during the peeling. When doing early break vectorization > + then LOOP_VINFO needs to be provided and is used to keep track of any newly created > + memory references that need to be updated should we decide to vectorize. */ > > class loop * > slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop, edge loop_exit, > class loop *scalar_loop, > edge scalar_exit, edge e, edge *new_e, > - bool flow_loops) > + bool flow_loops, > + vec *updated_doms) > { > class loop *new_loop; > basic_block *new_bbs, *bbs, *pbbs; > @@ -1526,7 +1529,9 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop, edge loop_exit, > } > > auto loop_exits = get_loop_exit_edges (loop); > + bool multiple_exits_p = loop_exits.length () > 1; > auto_vec doms; > + class loop *update_loop = NULL; > > if (at_exit) /* Add the loop copy at exit. */ > { > @@ -1536,39 +1541,65 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop, edge loop_exit, > flush_pending_stmts (new_exit); > } > > + bool multiple_exits_p = loop_exits.length () > 1; > + basic_block main_loop_exit_block = new_preheader; > + basic_block alt_loop_exit_block = NULL; > + /* Create intermediate edge for main exit. But only useful for early > + exits. */ > + if (multiple_exits_p) > + { > + edge loop_e = single_succ_edge (new_preheader); > + new_preheader = split_edge (loop_e); > + } > + > auto_vec new_phis; > hash_map new_phi_args; > /* First create the empty phi nodes so that when we flush the > statements they can be filled in. However because there is no order > between the PHI nodes in the exits and the loop headers we need to > order them base on the order of the two headers. First record the new > - phi nodes. */ > - for (auto gsi_from = gsi_start_phis (scalar_exit->dest); > + phi nodes. Then redirect the edges and flush the changes. This writes > + out the new SSA names. */ > + for (auto gsi_from = gsi_start_phis (loop_exit->dest); > !gsi_end_p (gsi_from); gsi_next (&gsi_from)) > { > gimple *from_phi = gsi_stmt (gsi_from); > tree new_res = copy_ssa_name (gimple_phi_result (from_phi)); > - gphi *res = create_phi_node (new_res, new_preheader); > + gphi *res = create_phi_node (new_res, main_loop_exit_block); > new_phis.safe_push (res); > } > > - /* Then redirect the edges and flush the changes. This writes out the new > - SSA names. */ > - for (edge exit : loop_exits) > + for (auto exit : loop_exits) > { > - edge temp_e = redirect_edge_and_branch (exit, new_preheader); > - flush_pending_stmts (temp_e); > + basic_block dest = main_loop_exit_block; > + if (exit != loop_exit) > + { > + if (!alt_loop_exit_block) > + { > + alt_loop_exit_block = split_edge (exit); > + edge res = redirect_edge_and_branch ( > + single_succ_edge (alt_loop_exit_block), > + new_preheader); > + flush_pending_stmts (res); > + continue; > + } > + dest = alt_loop_exit_block; > + } > + edge e = redirect_edge_and_branch (exit, dest); > + flush_pending_stmts (e); > } > + > /* Record the new SSA names in the cache so that we can skip materializing > them again when we fill in the rest of the LCSSA variables. */ > for (auto phi : new_phis) > { > - tree new_arg = gimple_phi_arg (phi, 0)->def; > + tree new_arg = gimple_phi_arg (phi, loop_exit->dest_idx)->def; > > if (!SSA_VAR_P (new_arg)) > continue; > + > /* If the PHI MEM node dominates the loop then we shouldn't create > - a new LC-SSSA PHI for it in the intermediate block. */ > + a new LC-SSSA PHI for it in the intermediate block. */ > /* A MEM phi that consitutes a new DEF for the vUSE chain can either > be a .VDEF or a PHI that operates on MEM. And said definition > must not be inside the main loop. Or we must be a parameter. > @@ -1584,6 +1615,9 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop, edge loop_exit, > remove_phi_node (&gsi, true); > continue; > } > + > + /* If we decide to remove the PHI node we should also not > + rematerialize it later on. */ > new_phi_args.put (new_arg, gimple_phi_result (phi)); > > if (TREE_CODE (new_arg) != SSA_NAME) > @@ -1595,34 +1629,68 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop, edge loop_exit, > preheader block and still find the right LC nodes. */ > edge loop_entry = single_succ_edge (new_preheader); > if (flow_loops) > - for (auto gsi_from = gsi_start_phis (loop->header), > - gsi_to = gsi_start_phis (new_loop->header); > - !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to); > - gsi_next (&gsi_from), gsi_next (&gsi_to)) > - { > - gimple *from_phi = gsi_stmt (gsi_from); > - gimple *to_phi = gsi_stmt (gsi_to); > - tree new_arg = PHI_ARG_DEF_FROM_EDGE (from_phi, > - loop_latch_edge (loop)); > + { > + /* Link through the main exit first. */ > + for (auto gsi_from = gsi_start_phis (loop->header), > + gsi_to = gsi_start_phis (new_loop->header); > + !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to); > + gsi_next (&gsi_from), gsi_next (&gsi_to)) > + { > + gimple *from_phi = gsi_stmt (gsi_from); > + gimple *to_phi = gsi_stmt (gsi_to); > + tree new_arg = PHI_ARG_DEF_FROM_EDGE (from_phi, > + loop_latch_edge (loop)); > + > + /* Check if we've already created a new phi node during edge > + redirection. If we have, only propagate the value > + downwards. */ > + if (tree *res = new_phi_args.get (new_arg)) > + { > + if (multiple_exits_p) > + new_arg = *res; > + else > + { > + adjust_phi_and_debug_stmts (to_phi, loop_entry, *res); > + continue; > + } > + } > > - /* Check if we've already created a new phi node during edge > - redirection. If we have, only propagate the value downwards. */ > - if (tree *res = new_phi_args.get (new_arg)) > - { > - adjust_phi_and_debug_stmts (to_phi, loop_entry, *res); > - continue; > - } > + tree new_res = copy_ssa_name (gimple_phi_result (from_phi)); > + gphi *lcssa_phi = create_phi_node (new_res, new_preheader); > > - tree new_res = copy_ssa_name (gimple_phi_result (from_phi)); > - gphi *lcssa_phi = create_phi_node (new_res, new_preheader); > + /* Main loop exit should use the final iter value. */ > + SET_PHI_ARG_DEF (lcssa_phi, loop_exit->dest_idx, new_arg); > > - /* Main loop exit should use the final iter value. */ > - add_phi_arg (lcssa_phi, new_arg, loop_exit, UNKNOWN_LOCATION); > + adjust_phi_and_debug_stmts (to_phi, loop_entry, new_res); > + } > > - adjust_phi_and_debug_stmts (to_phi, loop_entry, new_res); > - } > + set_immediate_dominator (CDI_DOMINATORS, main_loop_exit_block, > + loop_exit->src); > + > + /* Now link the alternative exits. */ > + if (multiple_exits_p) > + { > + set_immediate_dominator (CDI_DOMINATORS, new_preheader, > + main_loop_exit_block); > + for (auto gsi_from = gsi_start_phis (loop->header), > + gsi_to = gsi_start_phis (new_preheader); > + !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to); > + gsi_next (&gsi_from), gsi_next (&gsi_to)) > + { > + gimple *from_phi = gsi_stmt (gsi_from); > + gimple *to_phi = gsi_stmt (gsi_to); > + > + tree alt_arg = gimple_phi_result (from_phi); > + edge main_e = single_succ_edge (alt_loop_exit_block); > + for (edge e : loop_exits) > + if (e != loop_exit) > + SET_PHI_ARG_DEF (to_phi, main_e->dest_idx, alt_arg); > + } > > - set_immediate_dominator (CDI_DOMINATORS, new_preheader, e->src); > + set_immediate_dominator (CDI_DOMINATORS, new_preheader, > + loop->header); > + } > + } > > if (was_imm_dom || duplicate_outer_loop) > set_immediate_dominator (CDI_DOMINATORS, exit_dest, new_exit->src); > @@ -1634,6 +1702,21 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop, edge loop_exit, > delete_basic_block (preheader); > set_immediate_dominator (CDI_DOMINATORS, scalar_loop->header, > loop_preheader_edge (scalar_loop)->src); > + > + /* Finally after wiring the new epilogue we need to update its main exit > + to the original function exit we recorded. Other exits are already > + correct. */ > + if (multiple_exits_p) > + { > + update_loop = new_loop; > + for (edge e : get_loop_exit_edges (loop)) > + doms.safe_push (e->dest); > + doms.safe_push (exit_dest); > + > + /* Likely a fall-through edge, so update if needed. */ > + if (single_succ_p (exit_dest)) > + doms.safe_push (single_succ (exit_dest)); > + } > } > else /* Add the copy at entry. */ > { > @@ -1681,6 +1764,34 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop, edge loop_exit, > delete_basic_block (new_preheader); > set_immediate_dominator (CDI_DOMINATORS, new_loop->header, > loop_preheader_edge (new_loop)->src); > + > + if (multiple_exits_p) > + update_loop = loop; > + } > + > + if (multiple_exits_p) > + { > + for (edge e : get_loop_exit_edges (update_loop)) > + { > + edge ex; > + edge_iterator ei; > + FOR_EACH_EDGE (ex, ei, e->dest->succs) > + { > + /* Find the first non-fallthrough block as fall-throughs can't > + dominate other blocks. */ > + if (single_succ_p (ex->dest)) > + { > + doms.safe_push (ex->dest); > + ex = single_succ_edge (ex->dest); > + } > + doms.safe_push (ex->dest); > + } > + doms.safe_push (e->dest); > + } > + > + iterate_fix_dominators (CDI_DOMINATORS, doms, false); > + if (updated_doms) > + updated_doms->safe_splice (doms); > } > > free (new_bbs); > @@ -2050,7 +2161,6 @@ vect_update_ivs_after_vectorizer (loop_vec_info loop_vinfo, > > /* Make sure there exists a single-predecessor exit bb: */ > gcc_assert (single_pred_p (exit_bb)); > - gcc_assert (single_succ_edge (exit_bb) == update_e); > > for (gsi = gsi_start_phis (loop->header), gsi1 = gsi_start_phis (update_bb); > !gsi_end_p (gsi) && !gsi_end_p (gsi1); > @@ -3138,6 +3248,11 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, > epilog->force_vectorize = false; > bb_before_epilog = loop_preheader_edge (epilog)->src; > > + /* Fixup the probabities of the new intermediate blocks that we use to connect > + to the merge block. The rest are dealt with via bb_before_epilog > + adjustments. */ > + e->dest->count = e->count (); > + > /* Scalar version loop may be preferred. In this case, add guard > and skip to epilog. Note this only happens when the number of > iterations of loop is unknown at compile time, otherwise this > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h > index b5e27d1c46d9cb3dfe5b44f1b49c9e4204572ff1..39aa4d1250efe308acccf484d370f8adfd1ba843 100644 > --- a/gcc/tree-vectorizer.h > +++ b/gcc/tree-vectorizer.h > @@ -1821,7 +1821,7 @@ is_loop_header_bb_p (basic_block bb) > { > if (bb == (bb->loop_father)->header) > return true; > - gcc_checking_assert (EDGE_COUNT (bb->preds) == 1); > + > return false; > } > > @@ -2212,7 +2212,8 @@ extern bool slpeel_can_duplicate_loop_p (const class loop *, const_edge, > const_edge); > class loop *slpeel_tree_duplicate_loop_to_edge_cfg (class loop *, edge, > class loop *, edge, > - edge, edge *, bool = true); > + edge, edge *, bool = true, > + vec * = NULL); > class loop *vect_loop_versioning (loop_vec_info, gimple *); > extern class loop *vect_do_peeling (loop_vec_info, tree, tree, > tree *, tree *, tree *, int, bool, bool, > -- Richard Biener SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)