From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by sourceware.org (Postfix) with ESMTPS id 2127D3858402; Wed, 27 Oct 2021 02:40:28 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 2127D3858402 Received: from pps.filterd (m0098420.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.1.2/8.16.1.2) with SMTP id 19R1IM4Z023060; Wed, 27 Oct 2021 02:40:26 GMT Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com with ESMTP id 3bxw0222hm-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 27 Oct 2021 02:40:26 +0000 Received: from m0098420.ppops.net (m0098420.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 19R2NRni013115; Wed, 27 Oct 2021 02:40:25 GMT Received: from ppma03fra.de.ibm.com (6b.4a.5195.ip4.static.sl-reverse.com [149.81.74.107]) by mx0b-001b2d01.pphosted.com with ESMTP id 3bxw0222gw-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 27 Oct 2021 02:40:25 +0000 Received: from pps.filterd (ppma03fra.de.ibm.com [127.0.0.1]) by ppma03fra.de.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 19R2WK6Y027034; Wed, 27 Oct 2021 02:40:24 GMT Received: from b06cxnps4076.portsmouth.uk.ibm.com (d06relay13.portsmouth.uk.ibm.com [9.149.109.198]) by ppma03fra.de.ibm.com with ESMTP id 3bx4epsmc9-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 27 Oct 2021 02:40:23 +0000 Received: from d06av25.portsmouth.uk.ibm.com (d06av25.portsmouth.uk.ibm.com [9.149.105.61]) by b06cxnps4076.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 19R2eKFH64094480 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 27 Oct 2021 02:40:20 GMT Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 915DE11C054; Wed, 27 Oct 2021 02:40:20 +0000 (GMT) Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 879B211C050; Wed, 27 Oct 2021 02:40:17 +0000 (GMT) Received: from [9.200.154.17] (unknown [9.200.154.17]) by d06av25.portsmouth.uk.ibm.com (Postfix) with ESMTPS; Wed, 27 Oct 2021 02:40:17 +0000 (GMT) Message-ID: Date: Wed, 27 Oct 2021 10:40:15 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.2.1 Subject: Re: [RFC] Don't move cold code out of loop by checking bb count Content-Language: en-US To: Richard Biener Cc: GCC Patches , Segher Boessenkool , David Edelsohn , Bill Schmidt , Jiufu Guo , linkw@gcc.gnu.org, Feng Xue OS , Jan Hubicka References: <20210802050501.159058-1-luoxhu@linux.ibm.com> <53b7c729-33c0-138f-fa06-d6efb7a43911@linux.ibm.com> <0b675ba1-cdab-652a-0bba-704b0090f1c5@linux.ibm.com> From: Xionghu Luo In-Reply-To: Content-Type: text/plain; charset=UTF-8 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: OiCsqAz891huBzPNwZUxLV9BFkQacmoq X-Proofpoint-ORIG-GUID: wAXCoLY1pepuaWsWw4VjoEx-iDts58_y Content-Transfer-Encoding: 7bit X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.182.1,Aquarius:18.0.790,Hydra:6.0.425,FMLib:17.0.607.475 definitions=2021-10-26_07,2021-10-26_01,2020-04-07_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 lowpriorityscore=0 suspectscore=0 phishscore=0 mlxlogscore=999 spamscore=0 malwarescore=0 adultscore=0 impostorscore=0 priorityscore=1501 clxscore=1015 bulkscore=0 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2110150000 definitions=main-2110270012 X-Spam-Status: No, score=-11.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, NICE_REPLY_A, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 27 Oct 2021 02:40:30 -0000 On 2021/10/26 21:20, Richard Biener wrote: > On Mon, Oct 18, 2021 at 6:29 AM Xionghu Luo wrote: >> >> >> >> On 2021/10/15 16:11, Richard Biener wrote: >>> On Sat, Oct 9, 2021 at 5:45 AM Xionghu Luo wrote: >>>> >>>> Hi, >>>> >>>> On 2021/9/28 20:09, Richard Biener wrote: >>>>> On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo wrote: >>>>>> >>>>>> Update the patch to v3, not sure whether you prefer the paste style >>>>>> and continue to link the previous thread as Segher dislikes this... >>>>>> >>>>>> >>>>>> [PATCH v3] Don't move cold code out of loop by checking bb count >>>>>> >>>>>> >>>>>> Changes: >>>>>> 1. Handle max_loop in determine_max_movement instead of >>>>>> outermost_invariant_loop. >>>>>> 2. Remove unnecessary changes. >>>>>> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p. >>>>>> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused >>>>>> infinite loop when implementing v1 and the iteration is missed to be >>>>>> updated actually. >>>>>> >>>>>> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html >>>>>> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html >>>>>> >>>>>> There was a patch trying to avoid move cold block out of loop: >>>>>> >>>>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html >>>>>> >>>>>> Richard suggested to "never hoist anything from a bb with lower execution >>>>>> frequency to a bb with higher one in LIM invariantness_dom_walker >>>>>> before_dom_children". >>>>>> >>>>>> In gimple LIM analysis, add find_coldest_out_loop to move invariants to >>>>>> expected target loop, if profile count of the loop bb is colder >>>>>> than target loop preheader, it won't be hoisted out of loop. >>>>>> Likely for store motion, if all locations of the REF in loop is cold, >>>>>> don't do store motion of it. >>>>>> >>>>>> SPEC2017 performance evaluation shows 1% performance improvement for >>>>>> intrate GEOMEAN and no obvious regression for others. Especially, >>>>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is >>>>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% >>>>>> on P8LE. >>>>>> >>>>>> gcc/ChangeLog: >>>>>> >>>>>> * loop-invariant.c (find_invariants_bb): Check profile count >>>>>> before motion. >>>>>> (find_invariants_body): Add argument. >>>>>> * tree-ssa-loop-im.c (find_coldest_out_loop): New function. >>>>>> (determine_max_movement): Use find_coldest_out_loop. >>>>>> (move_computations_worker): Adjust and fix iteration udpate. >>>>>> (execute_sm_exit): Check pointer validness. >>>>>> (class ref_in_loop_hot_body): New functor. >>>>>> (ref_in_loop_hot_body::operator): New. >>>>>> (can_sm_ref_p): Use for_all_locs_in_loop. >>>>>> >>>>>> gcc/testsuite/ChangeLog: >>>>>> >>>>>> * gcc.dg/tree-ssa/recip-3.c: Adjust. >>>>>> * gcc.dg/tree-ssa/ssa-lim-18.c: New test. >>>>>> * gcc.dg/tree-ssa/ssa-lim-19.c: New test. >>>>>> * gcc.dg/tree-ssa/ssa-lim-20.c: New test. >>>>>> --- >>>>>> gcc/loop-invariant.c | 10 ++-- >>>>>> gcc/tree-ssa-loop-im.c | 61 ++++++++++++++++++++-- >>>>>> gcc/testsuite/gcc.dg/tree-ssa/recip-3.c | 2 +- >>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++ >>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++ >>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++ >>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++ >>>>>> 7 files changed, 165 insertions(+), 8 deletions(-) >>>>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c >>>>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c >>>>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c >>>>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c >>>>>> >>>>>> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c >>>>>> index fca0c2b24be..5c3be7bf0eb 100644 >>>>>> --- a/gcc/loop-invariant.c >>>>>> +++ b/gcc/loop-invariant.c >>>>>> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed) >>>>>> call. */ >>>>>> >>>>>> static void >>>>>> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed) >>>>>> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached, >>>>>> + bool always_executed) >>>>>> { >>>>>> rtx_insn *insn; >>>>>> + basic_block preheader = loop_preheader_edge (loop)->src; >>>>>> + >>>>>> + if (preheader->count > bb->count) >>>>>> + return; >>>>>> >>>>>> FOR_BB_INSNS (bb, insn) >>>>>> { >>>>>> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body, >>>>>> unsigned i; >>>>>> >>>>>> for (i = 0; i < loop->num_nodes; i++) >>>>>> - find_invariants_bb (body[i], >>>>>> - bitmap_bit_p (always_reached, i), >>>>>> + find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i), >>>>>> bitmap_bit_p (always_executed, i)); >>>>>> } >>>>>> >>>>>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c >>>>>> index 4b187c2cdaf..655fab03442 100644 >>>>>> --- a/gcc/tree-ssa-loop-im.c >>>>>> +++ b/gcc/tree-ssa-loop-im.c >>>>>> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt) >>>>>> return ret; >>>>>> } >>>>>> >>>>>> +/* Find coldest loop between outmost_loop and loop by comapring profile count. */ >>>>>> + >>>>>> +static class loop * >>>>>> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, >>>>>> + basic_block curr_bb) >>>>>> +{ >>>>>> + class loop *cold_loop, *min_loop; >>>>>> + cold_loop = min_loop = outmost_loop; >>>>>> + profile_count min_count = loop_preheader_edge (min_loop)->src->count; >>>>>> + >>>>>> + if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count) >>>>> >>>>> Honza - can you comment on whether we should compare BB counts this way? >>>>> >>>>> I would suspect that for, say, >>>>> >>>>> for (...) >>>>> if (a) >>>>> X; >>>>> else >>>>> Y; >>>>> >>>>> that the counts for X and Y will be less than that of the preheader of the loop >>>>> only when the loop is estimated to run once. That is, should we really compare >>>>> the to the preheader count or maybe better to the _header_ count which >>>>> would keep the number of iterations out of the equation? >>>> >>>> I quickly tried to replace all the loop_preheader_edge (loop)->src with >>>> loop_preheader_edge (loop)->dest, it will cause many failures in >>>> gcc.dg/tree-ssa/ssa-lim-*.c, I didn't go deep to investigate, but it seems >>>> reasonable to compare the bb count with preheader count as both gimple lim >>>> and RTL loop-invariant move instructions to *preheader* instead of *header* >>>> after analysis? >>> >>> Hmm, yeah - guess I was confused here. >>> >>>>> >>>>> If we look at maybe_hot_count_p that's a quite sophisticated thing to >>>>> compare a count to the "IPA hot", here we're comparing two counts >>>>> within a function where it actually matters whether we use a>>>> !(a>=b) since 'unordered' is mapped to false (but there's no ordered_p >>>>> function). >>>>> >>>>> Xionghu, you error on the side of not hoisting for unordered counts here >>>>> >>>>>> + return NULL; >>>>>> + >>>>>> + while (min_loop != loop) >>>>>> + { >>>>>> + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); >>>>>> + if (loop_preheader_edge (min_loop)->src->count < min_count) >>>>> >>>>> but in the other direction here and on the side of not hoisting >>>>> in ref_in_loop_hot_body. >>>>> >>>>> The three-state relational operator overloads are probably not the >>>>> very best idea... >>>>> (see profile-count.h for them) >>>>> >>>> Added new function bb_colder_than_loop_preheader to encapsulate the comparision, >>>> if FALSE is returned due to three-state inequality, find_coldest_out_loop >>>> will return the original input to lim->max_loop, and ref_in_loop_hot_body::operator () >>>> will return true to continue perform store motion, both preserve the previous >>>> behavior. >>> >>> Thanks. But I don't think the abstraction as written is useful: >>> >>> +/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state >>> + as stated in profile-count.h, FALSE is returned if inequality cannot be >>> + decided. */ >>> +bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2) >>> +{ >>> + if (count1 < count2) >>> + return true; >>> + else >>> + return false; >>> +} >>> >>> given the following seems to pass the preheader count in place of the BB count. >>> >>> + if (bb_colder_than_loop_preheader ( >>> + loop_preheader_edge (min_loop)->src->count, min_count)) >>> + cold_loop = min_loop; >>> >>> find_coldest_out_loop is also a bit weird, I think we want to find >>> the outermost loop between outmost_loop and loop that has a >>> lower count than the curr_bb count but >>> >>> + while (min_loop != loop) >>> + { >>> + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); >>> + if (bb_colder_than_loop_preheader ( >>> + loop_preheader_edge (min_loop)->src->count, min_count)) >>> + cold_loop = min_loop; >>> >>> compares the outermost loops count (min_count) against the preheader >>> count? So we're searching for a cold loop with respect to its enclosing loop >>> here? >> >> Let me try to explain how it works :) >> >> find_coldest_out_loop does two steps check: >> 1) Check whether curr_bb is cold in it's own loop_father, if it is cold, >> just return NULL which means it should not be moved out at all; >> 2) curr_bb is NOT cold, assuming the current loop L[m] is the coldest first, >> than try to find a cold loop to be hoisted to from {L[1], L[2], ... L[m]}, >> if L[i]->count < L[m]->count, set the cold_loop to L[i] until find the loop >> that has smallest profile_count. >> >> >> Take the updated ssa-lim-19.c as example, check whether curr_bb(bb 5) is cold in >> loop 3, if it is cold, just return NULL, otherwise select the coldest loop in >> {loop1, loop2, loop3} and find that loop2 is colder than loop3, return loop2 to >> be the target hoist loop. The first check could AVOID hoist if curr_bb is colder >> than loop3, but it is still hot than loop1 and loop2. Not sure whether it is possible >> to construct such cases? >> >> >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c >> >> volatile int x; >> void >> bar (int, char *, char *); >> void >> foo (int *a, int n, int m, int s, int t) >> { >> int i; >> int j; >> int k; >> >> for (i = 0; i < m; i++) // loop 1 >> { >> if (__builtin_expect (x, 0)) >> for (j = 0; j < n; j++) // loop 2 >> for (k = 0; k < n; k++) // loop 3 >> { >> bar (s / 5, "one", "two"); // curr_bb >> a[t] = s; >> } >> a[t] = t; // curr_bb2 >> } >> } >> >> The 4 invariant statements are moved to bb 11(loop2) instead of bb 10(loop1) >> with this patch. >> There are totally 6 combinations when curr_bb is hotter than loop 3. We need >> to compare the "Loop preheader hotness" instead of "every Loop[i] and curr_bb hotness", >> returning the coldest loop for this function find_coldest_out_loop, otherwise >> unexpected behavior happens. >> >> L1 > L2 > L3 => return L3 >> L1 > L3 > L2 => return L2 >> L2 > L1 > L3 => return L3 >> L2 > L3 > L1 => return L1 >> L3 > L1 > L2 => return L2 >> L3 > L2 > L1 => return L1 >> >> So bb_colder_than_loop_preheader does two kind of checks, one is checking >> L3 preheader count with curr_bb count, another is checking L3 preheader count >> with L1 preheader count, L2 preheader count, etc... >> >> >> ssa-lim-19.c.138t.lim2: >> ... >> [local count: 16057869]: // L1 preheader >> - _4 = s_22(D) / 5; >> - _5 = (long unsigned int) t_24(D); >> - _6 = _5 * 4; >> - _7 = a_25(D) + _6; >> _8 = (long unsigned int) t_24(D); >> _9 = _8 * 4; >> _10 = a_25(D) + _9; >> >> [local count: 145980626]: >> # i_34 = PHI >> x.0_1 ={v} x; >> if (x.0_1 != 0) >> goto ; [10.00%] >> else >> goto ; [90.00%] >> >> [local count: 14598063]: >> if (n_20(D) > 0) >> goto ; [89.00%] >> else >> goto ; [11.00%] >> >> [local count: 12992276]: // L2 preheader >> + _4 = s_22(D) / 5; >> + _5 = (long unsigned int) t_24(D); >> + _6 = _5 * 4; >> + _7 = a_25(D) + _6; >> goto ; [100.00%] >> >> [local count: 850510901]: >> >> [local count: 955630225]: // curr_bb >> # k_36 = PHI >> bar (_4, "one", "two"); >> *_7 = s_22(D); >> k_27 = k_36 + 1; >> if (n_20(D) > k_27) >> goto ; [89.00%] >> else >> goto ; [11.00%] >> >> [local count: 118111600]: >> j_21 = j_35 + 1; >> if (n_20(D) > j_21) >> goto ; [89.00%] >> else >> goto ; [11.00%] >> >> [local count: 105119324]: >> >> [local count: 118111600]: // L3 preheader >> # j_35 = PHI >> goto ; [100.00%] >> >> [local count: 145980626]: >> *_10 = t_24(D); >> i_29 = i_34 + 1; >> >> Re-paste the bb_colder_than_loop_preheader and find_coldest_out_loop: >> >> +/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state >> + as stated in profile-count.h, FALSE is returned if inequality cannot be >> + decided. */ >> +bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2) >> +{ >> + if (count1 < count2) >> + return true; >> + else >> + return false; >> +} >> + >> +/* Find coldest loop between OUTMOST_LOOP and LOOP by comapring profile count. >> + */ >> + >> +static class loop * >> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, >> + basic_block curr_bb) >> +{ >> + class loop *cold_loop, *min_loop; >> + cold_loop = min_loop = outmost_loop; >> + profile_count min_count = loop_preheader_edge (min_loop)->src->count; >> + >> + /* If bb_colder_than_loop_preheader returns false due to three-state >> + comparision, OUTMOST_LOOP is returned finally to preserve the behavior. >> + Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP. */ >> + if (curr_bb >> + && bb_colder_than_loop_preheader (curr_bb->count, >> + loop_preheader_edge (loop)->src->count)) >> + return NULL; >> + >> + while (min_loop != loop) >> + { >> + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); >> + if (bb_colder_than_loop_preheader ( >> + loop_preheader_edge (min_loop)->src->count, min_count)) >> + cold_loop = min_loop; >> + } >> + return cold_loop; >> +} >> + >> >> >>> >>> Why is this function not simply >>> >>> +static class loop * >>> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, >>> + basic_block curr_bb) >>> +{ >>> while (bb_colder_than_loop_preheader (curr_bb->count, >>> loop_preheader_edge (outermost_loop)->src->count)) >>> { >>> if (outermost_loop == loop) >>> return NULL; >>> outermost_loop = superloop_at_depth (loop, loop_depth >>> (outermost_loop) + 1); >>> } >>> return outermost_loop; >>> } >> >> If change like this, when processing curr_bb(5), outermost_loop will >> return loop 1 since curr_bb->count > Loop1_prehead->count, the while >> loop stopped. This doesn't meet what we want. > > Why? curr_bb is executed at least as often as loop1 preheader if > we look at the counts? So either the counts do not really tell us > anything of help or I am missing something. Are you merely > looking for a block with a lower count on the path from the outermost > loop entry to the block in question and deciding you do not want to > hoist further than that? So it's not about not hoisting to a hot place > but instead hoist to the coldest place within a loop nest? > > So we have > > for (i = 0; i < m; i++) // loop 1 > { > if (__builtin_expect (x, 0)) > for (j = 0; j < n; j++) // loop 2 > > > [local count: 16057869]: // L1 preheader > ... > [local count: 145980626]: > # i_34 = PHI > ... > [local count: 12992276]: // L2 preheader > ... > [local count: 118111600]: // L3 preheader > # j_35 = PHI > goto ; [100.00%] > > and we want to hoist to the L2 preheader because that's less frequently > executed than the L1 preheader (which is less frequently executed > than the L3 preheader or the block we are hoisting from). Yes, this is exactly what I want, sorry for not describe it clear before ;( The updated patch[1] may reflect find_coldest_out_loop better: It first check whether curr_bb is hotter than it's preheader, if false, return NULL which means no need hoist at all; Then find a *coldest* preheader to hoist within a loop nest from outmost_loop. [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581961.html +/* Find coldest loop between OUTMOST_LOOP and LOOP by comparing profile count. + It does two steps check: + 1) Check whether CURR_BB is cold in it's own loop_father, if it is cold, just + return NULL which means it should not be moved out at all; + 2) CURR_BB is NOT cold, set LOOP to cold_loop, then iteratively search loops + from {L[outmost_loop], L[outmost_loop+1], ... L[loop]}, if L[i] is colder + than L[cold_loop], reset cold_loop to L[i] until get the loop that has + smallest profile_count. */ + +static class loop * +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, + basic_block curr_bb) +{ + class loop *cold_loop; + + /* If bb_colder_than_loop_preheader returns false due to three-state + comparision, OUTMOST_LOOP is returned finally to preserve the behavior. + Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP. */ + if (curr_bb + && bb_colder_than_loop_preheader (curr_bb, + loop_preheader_edge (loop)->src)) + return NULL; + + cold_loop = loop; + while (outmost_loop != loop) + { + if (bb_colder_than_loop_preheader (loop_preheader_edge (outmost_loop)->src, + loop_preheader_edge (cold_loop)->src)) + cold_loop = outmost_loop; + outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1); + } + return cold_loop; +} > > I'm concerned with compile-time complexity re-evaluating counts on the > loop nest many times. So it looks to me that we can pre-compute > this lowest-preheader-count loop for a loop nest at least for the > store-motion case where we know the outermost loop? > > But the lowest-preheader-count loop may change for a loop/bb with different outermost loop. For example if, L1_preheader_count < L2_preheader_count < L3_preheader_count < L4_preheader_count < curr_bb_count then, find_coldest_out_loop (L1, loop, curr_bb) => coldest preheader loop is L1 find_coldest_out_loop (L2, loop, curr_bb) => coldest preheader loop is L2 So it will be a 1:N map? Pre-compute it in find_coldest_out_loop and save it also in lim_data with a new variable coldest_preheader_loop[outmost_loop][coldest_preheader_loop]? each call of find_coldest_out_loop will check whether that variable is set already, only continue the search if coldest_preheader_loop[outmost_loop][coldest_preheader_loop] is not set? Seems a bit complicated and not sure whether it helps to reduce compile-time complexity or I am misunderstanding... -- Thanks, Xionghu