From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ed1-x52c.google.com (mail-ed1-x52c.google.com [IPv6:2a00:1450:4864:20::52c]) by sourceware.org (Postfix) with ESMTPS id F247B3858412; Fri, 15 Oct 2021 08:12:04 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org F247B3858412 Received: by mail-ed1-x52c.google.com with SMTP id z20so35010937edc.13; Fri, 15 Oct 2021 01:12:04 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=97vuL0TAP5k0tS3XE3NsTmDk4vJng+ei3lHrPpb68EQ=; b=HleF+ovAjak8oVOTlrqo4g0gwTlqN9ZE+VJzEl/FWbbIiSN8V3KHFcRLVsxcU1deTA u5k1LPSbyadhjj6XJLeT54o1aiSAPfl3/57GDkjmtCdORuIJhfy9yohtwTGyw8UN0pXB Ni0KFlVeLm+cU6WtbLRev76tFky/LDz1Zvm1U4zLP53S+NRcaO5juCMTA80813N6fk5l /8M/CE03k4Jy9NHWbwe/NaH6ss4Nbezrv6XWAWnNv2qLr8xfYWnW4QV643wW4kcrnLl6 Q4PS5Is3INRikkM3M0xyiLmkZnfT3qIepfuFZ1Zf/CKWvAAdjK2poiDW9JS3mXXJzyEH 5xXw== X-Gm-Message-State: AOAM533ZjurW3rzqXXb5eKwS9768UCz26yv3YY/7cNwwsOyeVgANEXXK e06+y9qfzNFh3NIMjOGBmvTw3ANI7BsWSGVWTtA= X-Google-Smtp-Source: ABdhPJzPp2yVBkjbp+efYFuicp1W6WCSnQyVpTgXmDVIsizwhpDvRPXmbAdzqBEzGXc6kCfNDVtSxF51jIvGYuQG8Fw= X-Received: by 2002:a50:cfc1:: with SMTP id i1mr15812773edk.251.1634285523752; Fri, 15 Oct 2021 01:12:03 -0700 (PDT) MIME-Version: 1.0 References: <20210802050501.159058-1-luoxhu@linux.ibm.com> <53b7c729-33c0-138f-fa06-d6efb7a43911@linux.ibm.com> In-Reply-To: <53b7c729-33c0-138f-fa06-d6efb7a43911@linux.ibm.com> From: Richard Biener Date: Fri, 15 Oct 2021 10:11:52 +0200 Message-ID: Subject: Re: [RFC] Don't move cold code out of loop by checking bb count To: Xionghu Luo Cc: GCC Patches , Segher Boessenkool , David Edelsohn , Bill Schmidt , Jiufu Guo , linkw@gcc.gnu.org, Feng Xue OS , Jan Hubicka Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-8.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Oct 2021 08:12:07 -0000 On Sat, Oct 9, 2021 at 5:45 AM Xionghu Luo wrote: > > Hi, > > On 2021/9/28 20:09, Richard Biener wrote: > > On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo wrote: > >> > >> Update the patch to v3, not sure whether you prefer the paste style > >> and continue to link the previous thread as Segher dislikes this... > >> > >> > >> [PATCH v3] Don't move cold code out of loop by checking bb count > >> > >> > >> Changes: > >> 1. Handle max_loop in determine_max_movement instead of > >> outermost_invariant_loop. > >> 2. Remove unnecessary changes. > >> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p. > >> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused > >> infinite loop when implementing v1 and the iteration is missed to be > >> updated actually. > >> > >> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html > >> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html > >> > >> There was a patch trying to avoid move cold block out of loop: > >> > >> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html > >> > >> Richard suggested to "never hoist anything from a bb with lower execution > >> frequency to a bb with higher one in LIM invariantness_dom_walker > >> before_dom_children". > >> > >> In gimple LIM analysis, add find_coldest_out_loop to move invariants to > >> expected target loop, if profile count of the loop bb is colder > >> than target loop preheader, it won't be hoisted out of loop. > >> Likely for store motion, if all locations of the REF in loop is cold, > >> don't do store motion of it. > >> > >> SPEC2017 performance evaluation shows 1% performance improvement for > >> intrate GEOMEAN and no obvious regression for others. Especially, > >> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is > >> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% > >> on P8LE. > >> > >> gcc/ChangeLog: > >> > >> * loop-invariant.c (find_invariants_bb): Check profile count > >> before motion. > >> (find_invariants_body): Add argument. > >> * tree-ssa-loop-im.c (find_coldest_out_loop): New function. > >> (determine_max_movement): Use find_coldest_out_loop. > >> (move_computations_worker): Adjust and fix iteration udpate. > >> (execute_sm_exit): Check pointer validness. > >> (class ref_in_loop_hot_body): New functor. > >> (ref_in_loop_hot_body::operator): New. > >> (can_sm_ref_p): Use for_all_locs_in_loop. > >> > >> gcc/testsuite/ChangeLog: > >> > >> * gcc.dg/tree-ssa/recip-3.c: Adjust. > >> * gcc.dg/tree-ssa/ssa-lim-18.c: New test. > >> * gcc.dg/tree-ssa/ssa-lim-19.c: New test. > >> * gcc.dg/tree-ssa/ssa-lim-20.c: New test. > >> --- > >> gcc/loop-invariant.c | 10 ++-- > >> gcc/tree-ssa-loop-im.c | 61 ++++++++++++++++++++-- > >> gcc/testsuite/gcc.dg/tree-ssa/recip-3.c | 2 +- > >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++ > >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++ > >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++ > >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++ > >> 7 files changed, 165 insertions(+), 8 deletions(-) > >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c > >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c > >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c > >> > >> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c > >> index fca0c2b24be..5c3be7bf0eb 100644 > >> --- a/gcc/loop-invariant.c > >> +++ b/gcc/loop-invariant.c > >> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed) > >> call. */ > >> > >> static void > >> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed) > >> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached, > >> + bool always_executed) > >> { > >> rtx_insn *insn; > >> + basic_block preheader = loop_preheader_edge (loop)->src; > >> + > >> + if (preheader->count > bb->count) > >> + return; > >> > >> FOR_BB_INSNS (bb, insn) > >> { > >> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body, > >> unsigned i; > >> > >> for (i = 0; i < loop->num_nodes; i++) > >> - find_invariants_bb (body[i], > >> - bitmap_bit_p (always_reached, i), > >> + find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i), > >> bitmap_bit_p (always_executed, i)); > >> } > >> > >> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c > >> index 4b187c2cdaf..655fab03442 100644 > >> --- a/gcc/tree-ssa-loop-im.c > >> +++ b/gcc/tree-ssa-loop-im.c > >> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt) > >> return ret; > >> } > >> > >> +/* Find coldest loop between outmost_loop and loop by comapring profile count. */ > >> + > >> +static class loop * > >> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, > >> + basic_block curr_bb) > >> +{ > >> + class loop *cold_loop, *min_loop; > >> + cold_loop = min_loop = outmost_loop; > >> + profile_count min_count = loop_preheader_edge (min_loop)->src->count; > >> + > >> + if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count) > > > > Honza - can you comment on whether we should compare BB counts this way? > > > > I would suspect that for, say, > > > > for (...) > > if (a) > > X; > > else > > Y; > > > > that the counts for X and Y will be less than that of the preheader of the loop > > only when the loop is estimated to run once. That is, should we really compare > > the to the preheader count or maybe better to the _header_ count which > > would keep the number of iterations out of the equation? > > I quickly tried to replace all the loop_preheader_edge (loop)->src with > loop_preheader_edge (loop)->dest, it will cause many failures in > gcc.dg/tree-ssa/ssa-lim-*.c, I didn't go deep to investigate, but it seems > reasonable to compare the bb count with preheader count as both gimple lim > and RTL loop-invariant move instructions to *preheader* instead of *header* > after analysis? Hmm, yeah - guess I was confused here. > > > > If we look at maybe_hot_count_p that's a quite sophisticated thing to > > compare a count to the "IPA hot", here we're comparing two counts > > within a function where it actually matters whether we use a > !(a>=b) since 'unordered' is mapped to false (but there's no ordered_p > > function). > > > > Xionghu, you error on the side of not hoisting for unordered counts here > > > >> + return NULL; > >> + > >> + while (min_loop != loop) > >> + { > >> + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); > >> + if (loop_preheader_edge (min_loop)->src->count < min_count) > > > > but in the other direction here and on the side of not hoisting > > in ref_in_loop_hot_body. > > > > The three-state relational operator overloads are probably not the > > very best idea... > > (see profile-count.h for them) > > > Added new function bb_colder_than_loop_preheader to encapsulate the comparision, > if FALSE is returned due to three-state inequality, find_coldest_out_loop > will return the original input to lim->max_loop, and ref_in_loop_hot_body::operator () > will return true to continue perform store motion, both preserve the previous > behavior. Thanks. But I don't think the abstraction as written is useful: +/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state + as stated in profile-count.h, FALSE is returned if inequality cannot be + decided. */ +bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2) +{ + if (count1 < count2) + return true; + else + return false; +} given the following seems to pass the preheader count in place of the BB count. + if (bb_colder_than_loop_preheader ( + loop_preheader_edge (min_loop)->src->count, min_count)) + cold_loop = min_loop; find_coldest_out_loop is also a bit weird, I think we want to find the outermost loop between outmost_loop and loop that has a lower count than the curr_bb count but + while (min_loop != loop) + { + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); + if (bb_colder_than_loop_preheader ( + loop_preheader_edge (min_loop)->src->count, min_count)) + cold_loop = min_loop; compares the outermost loops count (min_count) against the preheader count? So we're searching for a cold loop with respect to its enclosing loop here? Why is this function not simply +static class loop * +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, + basic_block curr_bb) +{ while (bb_colder_than_loop_preheader (curr_bb->count, loop_preheader_edge (outermost_loop)->src->count)) { if (outermost_loop == loop) return NULL; outermost_loop = superloop_at_depth (loop, loop_depth (outermost_loop) + 1); } return outermost_loop; } ? Likewise I wonder why ref_in_loop_hot_body::operator () needs to call find_coldest_out_loop and why it not simply does +bool +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) +{ + basic_block curr_bb = gimple_bb (loc->stmt); if (bb_colder_than_loop_preheader (curr_bb->count, loop_preheader_edge (l)->src->count)) return false; return true; } ? > > >> + cold_loop = min_loop; > >> + } > >> + return cold_loop; > >> +} > >> + > >> /* Suppose that operand DEF is used inside the LOOP. Returns the outermost > >> loop to that we could move the expression using DEF if it did not have > >> other operands, i.e. the outermost loop enclosing LOOP in that the value > >> @@ -685,7 +707,9 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec) > >> level = ALWAYS_EXECUTED_IN (bb); > >> else > >> level = superloop_at_depth (loop, 1); > >> - lim_data->max_loop = level; > >> + lim_data->max_loop = find_coldest_out_loop (level, loop, bb); > >> + if (!lim_data->max_loop) > >> + return false; > >> > >> if (gphi *phi = dyn_cast (stmt)) > >> { > >> @@ -1198,7 +1222,6 @@ move_computations_worker (basic_block bb) > >> for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); ) > >> { > >> edge e; > >> - > >> gimple *stmt = gsi_stmt (bsi); > >> > >> lim_data = get_lim_data (stmt); > >> @@ -1221,7 +1244,10 @@ move_computations_worker (basic_block bb) > >> /* We do not really want to move conditionals out of the loop; we just > >> placed it here to force its operands to be moved if necessary. */ > >> if (gimple_code (stmt) == GIMPLE_COND) > >> - continue; > >> + { > >> + gsi_next (&bsi); > >> + continue; > >> + } > >> > >> if (dump_file && (dump_flags & TDF_DETAILS)) > >> { > >> @@ -2241,7 +2267,12 @@ execute_sm_exit (class loop *loop, edge ex, vec &seq, > >> } > >> else > >> { > >> - sm_aux *aux = *aux_map.get (ref); > >> + sm_aux **paux = aux_map.get (ref); > >> + sm_aux *aux; > >> + if (paux) > >> + aux = *paux; > >> + else > >> + continue; > > > > do you really need this? I doubt so. > > Removed. > > > > >> if (!aux->store_flag || kind == sm_ord) > >> { > >> gassign *store; > >> @@ -2887,6 +2918,25 @@ ref_indep_loop_p (class loop *loop, im_mem_ref *ref, dep_kind kind) > >> return indep_p; > >> } > >> > >> +class ref_in_loop_hot_body > >> +{ > >> +public: > >> + ref_in_loop_hot_body (loop *loop_) : l (loop_) {} > >> + bool operator () (mem_ref_loc *loc); > >> + class loop *l; > >> +}; > >> + > >> +bool > >> +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) > >> +{ > >> + basic_block curr_bb = gimple_bb (loc->stmt); > >> + edge e = loop_preheader_edge (l); > >> + if (e->src->count > curr_bb->count) > >> + return false; > >> + else > >> + return true; > >> +} > >> + > >> > >> /* Returns true if we can perform store motion of REF from LOOP. */ > >> > >> @@ -2941,6 +2991,9 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref) > >> if (!ref_indep_loop_p (loop, ref, sm_war)) > >> return false; > >> > > > > Add a comment here what this is about. > > Done. > > > > > Otherwise the GIMPLE invariant motion parts look sensible, but I'd > > really like to have > > the issue on the profile_count API sorted out. > > > > Can you split out the RTL invariant motion part to a separate patch please? > > Done. Attached the two patches, thanks. > > > BR, > Xionghu