Hi, On 2021/9/28 20:09, Richard Biener wrote: > On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo wrote: >> >> Update the patch to v3, not sure whether you prefer the paste style >> and continue to link the previous thread as Segher dislikes this... >> >> >> [PATCH v3] Don't move cold code out of loop by checking bb count >> >> >> Changes: >> 1. Handle max_loop in determine_max_movement instead of >> outermost_invariant_loop. >> 2. Remove unnecessary changes. >> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p. >> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused >> infinite loop when implementing v1 and the iteration is missed to be >> updated actually. >> >> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html >> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html >> >> There was a patch trying to avoid move cold block out of loop: >> >> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html >> >> Richard suggested to "never hoist anything from a bb with lower execution >> frequency to a bb with higher one in LIM invariantness_dom_walker >> before_dom_children". >> >> In gimple LIM analysis, add find_coldest_out_loop to move invariants to >> expected target loop, if profile count of the loop bb is colder >> than target loop preheader, it won't be hoisted out of loop. >> Likely for store motion, if all locations of the REF in loop is cold, >> don't do store motion of it. >> >> SPEC2017 performance evaluation shows 1% performance improvement for >> intrate GEOMEAN and no obvious regression for others. Especially, >> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is >> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% >> on P8LE. >> >> gcc/ChangeLog: >> >> * loop-invariant.c (find_invariants_bb): Check profile count >> before motion. >> (find_invariants_body): Add argument. >> * tree-ssa-loop-im.c (find_coldest_out_loop): New function. >> (determine_max_movement): Use find_coldest_out_loop. >> (move_computations_worker): Adjust and fix iteration udpate. >> (execute_sm_exit): Check pointer validness. >> (class ref_in_loop_hot_body): New functor. >> (ref_in_loop_hot_body::operator): New. >> (can_sm_ref_p): Use for_all_locs_in_loop. >> >> gcc/testsuite/ChangeLog: >> >> * gcc.dg/tree-ssa/recip-3.c: Adjust. >> * gcc.dg/tree-ssa/ssa-lim-18.c: New test. >> * gcc.dg/tree-ssa/ssa-lim-19.c: New test. >> * gcc.dg/tree-ssa/ssa-lim-20.c: New test. >> --- >> gcc/loop-invariant.c | 10 ++-- >> gcc/tree-ssa-loop-im.c | 61 ++++++++++++++++++++-- >> gcc/testsuite/gcc.dg/tree-ssa/recip-3.c | 2 +- >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++ >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++ >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++ >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++ >> 7 files changed, 165 insertions(+), 8 deletions(-) >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c >> >> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c >> index fca0c2b24be..5c3be7bf0eb 100644 >> --- a/gcc/loop-invariant.c >> +++ b/gcc/loop-invariant.c >> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed) >> call. */ >> >> static void >> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed) >> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached, >> + bool always_executed) >> { >> rtx_insn *insn; >> + basic_block preheader = loop_preheader_edge (loop)->src; >> + >> + if (preheader->count > bb->count) >> + return; >> >> FOR_BB_INSNS (bb, insn) >> { >> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body, >> unsigned i; >> >> for (i = 0; i < loop->num_nodes; i++) >> - find_invariants_bb (body[i], >> - bitmap_bit_p (always_reached, i), >> + find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i), >> bitmap_bit_p (always_executed, i)); >> } >> >> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c >> index 4b187c2cdaf..655fab03442 100644 >> --- a/gcc/tree-ssa-loop-im.c >> +++ b/gcc/tree-ssa-loop-im.c >> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt) >> return ret; >> } >> >> +/* Find coldest loop between outmost_loop and loop by comapring profile count. */ >> + >> +static class loop * >> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, >> + basic_block curr_bb) >> +{ >> + class loop *cold_loop, *min_loop; >> + cold_loop = min_loop = outmost_loop; >> + profile_count min_count = loop_preheader_edge (min_loop)->src->count; >> + >> + if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count) > > Honza - can you comment on whether we should compare BB counts this way? > > I would suspect that for, say, > > for (...) > if (a) > X; > else > Y; > > that the counts for X and Y will be less than that of the preheader of the loop > only when the loop is estimated to run once. That is, should we really compare > the to the preheader count or maybe better to the _header_ count which > would keep the number of iterations out of the equation? I quickly tried to replace all the loop_preheader_edge (loop)->src with loop_preheader_edge (loop)->dest, it will cause many failures in gcc.dg/tree-ssa/ssa-lim-*.c, I didn't go deep to investigate, but it seems reasonable to compare the bb count with preheader count as both gimple lim and RTL loop-invariant move instructions to *preheader* instead of *header* after analysis? > > If we look at maybe_hot_count_p that's a quite sophisticated thing to > compare a count to the "IPA hot", here we're comparing two counts > within a function where it actually matters whether we use a !(a>=b) since 'unordered' is mapped to false (but there's no ordered_p > function). > > Xionghu, you error on the side of not hoisting for unordered counts here > >> + return NULL; >> + >> + while (min_loop != loop) >> + { >> + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); >> + if (loop_preheader_edge (min_loop)->src->count < min_count) > > but in the other direction here and on the side of not hoisting > in ref_in_loop_hot_body. > > The three-state relational operator overloads are probably not the > very best idea... > (see profile-count.h for them) > Added new function bb_colder_than_loop_preheader to encapsulate the comparision, if FALSE is returned due to three-state inequality, find_coldest_out_loop will return the original input to lim->max_loop, and ref_in_loop_hot_body::operator () will return true to continue perform store motion, both preserve the previous behavior. >> + cold_loop = min_loop; >> + } >> + return cold_loop; >> +} >> + >> /* Suppose that operand DEF is used inside the LOOP. Returns the outermost >> loop to that we could move the expression using DEF if it did not have >> other operands, i.e. the outermost loop enclosing LOOP in that the value >> @@ -685,7 +707,9 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec) >> level = ALWAYS_EXECUTED_IN (bb); >> else >> level = superloop_at_depth (loop, 1); >> - lim_data->max_loop = level; >> + lim_data->max_loop = find_coldest_out_loop (level, loop, bb); >> + if (!lim_data->max_loop) >> + return false; >> >> if (gphi *phi = dyn_cast (stmt)) >> { >> @@ -1198,7 +1222,6 @@ move_computations_worker (basic_block bb) >> for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); ) >> { >> edge e; >> - >> gimple *stmt = gsi_stmt (bsi); >> >> lim_data = get_lim_data (stmt); >> @@ -1221,7 +1244,10 @@ move_computations_worker (basic_block bb) >> /* We do not really want to move conditionals out of the loop; we just >> placed it here to force its operands to be moved if necessary. */ >> if (gimple_code (stmt) == GIMPLE_COND) >> - continue; >> + { >> + gsi_next (&bsi); >> + continue; >> + } >> >> if (dump_file && (dump_flags & TDF_DETAILS)) >> { >> @@ -2241,7 +2267,12 @@ execute_sm_exit (class loop *loop, edge ex, vec &seq, >> } >> else >> { >> - sm_aux *aux = *aux_map.get (ref); >> + sm_aux **paux = aux_map.get (ref); >> + sm_aux *aux; >> + if (paux) >> + aux = *paux; >> + else >> + continue; > > do you really need this? I doubt so. Removed. > >> if (!aux->store_flag || kind == sm_ord) >> { >> gassign *store; >> @@ -2887,6 +2918,25 @@ ref_indep_loop_p (class loop *loop, im_mem_ref *ref, dep_kind kind) >> return indep_p; >> } >> >> +class ref_in_loop_hot_body >> +{ >> +public: >> + ref_in_loop_hot_body (loop *loop_) : l (loop_) {} >> + bool operator () (mem_ref_loc *loc); >> + class loop *l; >> +}; >> + >> +bool >> +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) >> +{ >> + basic_block curr_bb = gimple_bb (loc->stmt); >> + edge e = loop_preheader_edge (l); >> + if (e->src->count > curr_bb->count) >> + return false; >> + else >> + return true; >> +} >> + >> >> /* Returns true if we can perform store motion of REF from LOOP. */ >> >> @@ -2941,6 +2991,9 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref) >> if (!ref_indep_loop_p (loop, ref, sm_war)) >> return false; >> > > Add a comment here what this is about. Done. > > Otherwise the GIMPLE invariant motion parts look sensible, but I'd > really like to have > the issue on the profile_count API sorted out. > > Can you split out the RTL invariant motion part to a separate patch please? Done. Attached the two patches, thanks. BR, Xionghu