From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <luoxhu@linux.ibm.com>
Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com
 [148.163.158.5])
 by sourceware.org (Postfix) with ESMTPS id 2127D3858402;
 Wed, 27 Oct 2021 02:40:28 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 2127D3858402
Received: from pps.filterd (m0098420.ppops.net [127.0.0.1])
 by mx0b-001b2d01.pphosted.com (8.16.1.2/8.16.1.2) with SMTP id 19R1IM4Z023060; 
 Wed, 27 Oct 2021 02:40:26 GMT
Received: from pps.reinject (localhost [127.0.0.1])
 by mx0b-001b2d01.pphosted.com with ESMTP id 3bxw0222hm-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Wed, 27 Oct 2021 02:40:26 +0000
Received: from m0098420.ppops.net (m0098420.ppops.net [127.0.0.1])
 by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 19R2NRni013115;
 Wed, 27 Oct 2021 02:40:25 GMT
Received: from ppma03fra.de.ibm.com (6b.4a.5195.ip4.static.sl-reverse.com
 [149.81.74.107])
 by mx0b-001b2d01.pphosted.com with ESMTP id 3bxw0222gw-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Wed, 27 Oct 2021 02:40:25 +0000
Received: from pps.filterd (ppma03fra.de.ibm.com [127.0.0.1])
 by ppma03fra.de.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 19R2WK6Y027034;
 Wed, 27 Oct 2021 02:40:24 GMT
Received: from b06cxnps4076.portsmouth.uk.ibm.com
 (d06relay13.portsmouth.uk.ibm.com [9.149.109.198])
 by ppma03fra.de.ibm.com with ESMTP id 3bx4epsmc9-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Wed, 27 Oct 2021 02:40:23 +0000
Received: from d06av25.portsmouth.uk.ibm.com (d06av25.portsmouth.uk.ibm.com
 [9.149.105.61])
 by b06cxnps4076.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
 19R2eKFH64094480
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Wed, 27 Oct 2021 02:40:20 GMT
Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 915DE11C054;
 Wed, 27 Oct 2021 02:40:20 +0000 (GMT)
Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 879B211C050;
 Wed, 27 Oct 2021 02:40:17 +0000 (GMT)
Received: from [9.200.154.17] (unknown [9.200.154.17])
 by d06av25.portsmouth.uk.ibm.com (Postfix) with ESMTPS;
 Wed, 27 Oct 2021 02:40:17 +0000 (GMT)
Message-ID: <d18754f1-08b2-9a59-40ff-485374ee0e7e@linux.ibm.com>
Date: Wed, 27 Oct 2021 10:40:15 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0)
 Gecko/20100101 Thunderbird/91.2.1
Subject: Re: [RFC] Don't move cold code out of loop by checking bb count
Content-Language: en-US
To: Richard Biener <richard.guenther@gmail.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>,
 Segher Boessenkool <segher@kernel.crashing.org>,
 David Edelsohn <dje.gcc@gmail.com>, Bill Schmidt <wschmidt@linux.ibm.com>,
 Jiufu Guo <guojiufu@linux.ibm.com>, linkw@gcc.gnu.org,
 Feng Xue OS <fxue@os.amperecomputing.com>, Jan Hubicka <hubicka@ucw.cz>
References: <20210802050501.159058-1-luoxhu@linux.ibm.com>
 <CAFiYyc2tg7h9hczNtedviM5UaEYOAbQV+DGH_6EkjxeGKc9hzA@mail.gmail.com>
 <e9cc0874-b066-5eaf-6334-e7c8492a22a5@linux.ibm.com>
 <CAFiYyc0ZxonSQ8kjoErua7VSVkYw5PmaLUADt3cC659wV_8KiA@mail.gmail.com>
 <f4382ef5-2825-4c75-720f-1e475b9b9636@linux.ibm.com>
 <CAFiYyc0KnsADFpD2ktNR+fJhAup+J6fekuQRVa78wFBq-F__jw@mail.gmail.com>
 <b98ca191-0f61-8658-0bcf-dcbc9739c22d@linux.ibm.com>
 <CAFiYyc1jTuGoKS2TBRe35BJo5ZhK=ttDn5zJ_TUhinrK2pB3Sw@mail.gmail.com>
 <53b7c729-33c0-138f-fa06-d6efb7a43911@linux.ibm.com>
 <CAFiYyc0vSB=+n9SBzQ49dBow3mq7FcNqmELicLaGX4tCgmH7Bw@mail.gmail.com>
 <0b675ba1-cdab-652a-0bba-704b0090f1c5@linux.ibm.com>
 <CAFiYyc0ZYNLouDJieqtZD5HMj5v3skdpfML9=sSob4j5a-GYXQ@mail.gmail.com>
From: Xionghu Luo <luoxhu@linux.ibm.com>
In-Reply-To: <CAFiYyc0ZYNLouDJieqtZD5HMj5v3skdpfML9=sSob4j5a-GYXQ@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
X-TM-AS-GCONF: 00
X-Proofpoint-GUID: OiCsqAz891huBzPNwZUxLV9BFkQacmoq
X-Proofpoint-ORIG-GUID: wAXCoLY1pepuaWsWw4VjoEx-iDts58_y
Content-Transfer-Encoding: 7bit
X-Proofpoint-UnRewURL: 0 URL was un-rewritten
MIME-Version: 1.0
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.182.1,Aquarius:18.0.790,Hydra:6.0.425,FMLib:17.0.607.475
 definitions=2021-10-26_07,2021-10-26_01,2020-04-07_01
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 lowpriorityscore=0
 suspectscore=0 phishscore=0 mlxlogscore=999 spamscore=0 malwarescore=0
 adultscore=0 impostorscore=0 priorityscore=1501 clxscore=1015 bulkscore=0
 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2110150000 definitions=main-2110270012
X-Spam-Status: No, score=-11.5 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, NICE_REPLY_A,
 RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Wed, 27 Oct 2021 02:40:30 -0000


On 2021/10/26 21:20, Richard Biener wrote:
> On Mon, Oct 18, 2021 at 6:29 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>
>>
>>
>> On 2021/10/15 16:11, Richard Biener wrote:
>>> On Sat, Oct 9, 2021 at 5:45 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> On 2021/9/28 20:09, Richard Biener wrote:
>>>>> On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>>>>>
>>>>>> Update the patch to v3, not sure whether you prefer the paste style
>>>>>> and continue to link the previous thread as Segher dislikes this...
>>>>>>
>>>>>>
>>>>>> [PATCH v3] Don't move cold code out of loop by checking bb count
>>>>>>
>>>>>>
>>>>>> Changes:
>>>>>> 1. Handle max_loop in determine_max_movement instead of
>>>>>> outermost_invariant_loop.
>>>>>> 2. Remove unnecessary changes.
>>>>>> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p.
>>>>>> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
>>>>>> infinite loop when implementing v1 and the iteration is missed to be
>>>>>> updated actually.
>>>>>>
>>>>>> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
>>>>>> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
>>>>>>
>>>>>> There was a patch trying to avoid move cold block out of loop:
>>>>>>
>>>>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
>>>>>>
>>>>>> Richard suggested to "never hoist anything from a bb with lower execution
>>>>>> frequency to a bb with higher one in LIM invariantness_dom_walker
>>>>>> before_dom_children".
>>>>>>
>>>>>> In gimple LIM analysis, add find_coldest_out_loop to move invariants to
>>>>>> expected target loop, if profile count of the loop bb is colder
>>>>>> than target loop preheader, it won't be hoisted out of loop.
>>>>>> Likely for store motion, if all locations of the REF in loop is cold,
>>>>>> don't do store motion of it.
>>>>>>
>>>>>> SPEC2017 performance evaluation shows 1% performance improvement for
>>>>>> intrate GEOMEAN and no obvious regression for others.  Especially,
>>>>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
>>>>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
>>>>>> on P8LE.
>>>>>>
>>>>>> gcc/ChangeLog:
>>>>>>
>>>>>>         * loop-invariant.c (find_invariants_bb): Check profile count
>>>>>>         before motion.
>>>>>>         (find_invariants_body): Add argument.
>>>>>>         * tree-ssa-loop-im.c (find_coldest_out_loop): New function.
>>>>>>         (determine_max_movement): Use find_coldest_out_loop.
>>>>>>         (move_computations_worker): Adjust and fix iteration udpate.
>>>>>>         (execute_sm_exit): Check pointer validness.
>>>>>>         (class ref_in_loop_hot_body): New functor.
>>>>>>         (ref_in_loop_hot_body::operator): New.
>>>>>>         (can_sm_ref_p): Use for_all_locs_in_loop.
>>>>>>
>>>>>> gcc/testsuite/ChangeLog:
>>>>>>
>>>>>>         * gcc.dg/tree-ssa/recip-3.c: Adjust.
>>>>>>         * gcc.dg/tree-ssa/ssa-lim-18.c: New test.
>>>>>>         * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
>>>>>>         * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
>>>>>> ---
>>>>>>  gcc/loop-invariant.c                       | 10 ++--
>>>>>>  gcc/tree-ssa-loop-im.c                     | 61 ++++++++++++++++++++--
>>>>>>  gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |  2 +-
>>>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++
>>>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++
>>>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++
>>>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++
>>>>>>  7 files changed, 165 insertions(+), 8 deletions(-)
>>>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
>>>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>>>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
>>>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
>>>>>>
>>>>>> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
>>>>>> index fca0c2b24be..5c3be7bf0eb 100644
>>>>>> --- a/gcc/loop-invariant.c
>>>>>> +++ b/gcc/loop-invariant.c
>>>>>> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed)
>>>>>>     call.  */
>>>>>>
>>>>>>  static void
>>>>>> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed)
>>>>>> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
>>>>>> +                   bool always_executed)
>>>>>>  {
>>>>>>    rtx_insn *insn;
>>>>>> +  basic_block preheader = loop_preheader_edge (loop)->src;
>>>>>> +
>>>>>> +  if (preheader->count > bb->count)
>>>>>> +    return;
>>>>>>
>>>>>>    FOR_BB_INSNS (bb, insn)
>>>>>>      {
>>>>>> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body,
>>>>>>    unsigned i;
>>>>>>
>>>>>>    for (i = 0; i < loop->num_nodes; i++)
>>>>>> -    find_invariants_bb (body[i],
>>>>>> -                       bitmap_bit_p (always_reached, i),
>>>>>> +    find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i),
>>>>>>                         bitmap_bit_p (always_executed, i));
>>>>>>  }
>>>>>>
>>>>>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
>>>>>> index 4b187c2cdaf..655fab03442 100644
>>>>>> --- a/gcc/tree-ssa-loop-im.c
>>>>>> +++ b/gcc/tree-ssa-loop-im.c
>>>>>> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt)
>>>>>>    return ret;
>>>>>>  }
>>>>>>
>>>>>> +/* Find coldest loop between outmost_loop and loop by comapring profile count.  */
>>>>>> +
>>>>>> +static class loop *
>>>>>> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
>>>>>> +                      basic_block curr_bb)
>>>>>> +{
>>>>>> +  class loop *cold_loop, *min_loop;
>>>>>> +  cold_loop = min_loop = outmost_loop;
>>>>>> +  profile_count min_count = loop_preheader_edge (min_loop)->src->count;
>>>>>> +
>>>>>> +  if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count)
>>>>>
>>>>> Honza - can you comment on whether we should compare BB counts this way?
>>>>>
>>>>> I would suspect that for, say,
>>>>>
>>>>>   for (...)
>>>>>      if (a)
>>>>>        X;
>>>>>      else
>>>>>        Y;
>>>>>
>>>>> that the counts for X and Y will be less than that of the preheader of the loop
>>>>> only when the loop is estimated to run once.  That is, should we really compare
>>>>> the to the preheader count or maybe better to the _header_ count which
>>>>> would keep the number of iterations out of the equation?
>>>>
>>>> I quickly tried to replace all the loop_preheader_edge (loop)->src with
>>>> loop_preheader_edge (loop)->dest, it will cause many failures in
>>>> gcc.dg/tree-ssa/ssa-lim-*.c, I didn't go deep to investigate, but it seems
>>>> reasonable to compare the bb count with preheader count as both gimple lim
>>>> and RTL loop-invariant move instructions to *preheader* instead of *header*
>>>> after analysis?
>>>
>>> Hmm, yeah - guess I was confused here.
>>>
>>>>>
>>>>> If we look at maybe_hot_count_p that's a quite sophisticated thing to
>>>>> compare a count to the "IPA hot", here we're comparing two counts
>>>>> within a function where it actually matters whether we use a<b or
>>>>> !(a>=b) since 'unordered' is mapped to false (but there's no ordered_p
>>>>> function).
>>>>>
>>>>> Xionghu, you error on the side of not hoisting for unordered counts here
>>>>>
>>>>>> +    return NULL;
>>>>>> +
>>>>>> +  while (min_loop != loop)
>>>>>> +    {
>>>>>> +      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
>>>>>> +      if (loop_preheader_edge (min_loop)->src->count < min_count)
>>>>>
>>>>> but in the other direction here and on the side of not hoisting
>>>>> in ref_in_loop_hot_body.
>>>>>
>>>>> The three-state relational operator overloads are probably not the
>>>>> very best idea...
>>>>> (see profile-count.h for them)
>>>>>
>>>> Added new function bb_colder_than_loop_preheader to encapsulate the comparision,
>>>> if FALSE is returned due to three-state inequality,  find_coldest_out_loop
>>>> will return the original input to lim->max_loop, and ref_in_loop_hot_body::operator ()
>>>> will return true to continue perform store motion, both preserve the previous
>>>> behavior.
>>>
>>> Thanks.  But I don't think the abstraction as written is useful:
>>>
>>> +/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state
>>> +   as stated in profile-count.h, FALSE is returned if inequality cannot be
>>> +   decided.  */
>>> +bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2)
>>> +{
>>> +  if (count1 < count2)
>>> +    return true;
>>> +  else
>>> +    return false;
>>> +}
>>>
>>> given the following seems to pass the preheader count in place of the BB count.
>>>
>>> +      if (bb_colder_than_loop_preheader (
>>> +           loop_preheader_edge (min_loop)->src->count, min_count))
>>> +       cold_loop = min_loop;
>>>
>>> find_coldest_out_loop is also a bit weird, I think we want to find
>>> the outermost loop between outmost_loop and loop that has a
>>> lower count than the curr_bb count but
>>>
>>> +  while (min_loop != loop)
>>> +    {
>>> +      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
>>> +      if (bb_colder_than_loop_preheader (
>>> +           loop_preheader_edge (min_loop)->src->count, min_count))
>>> +       cold_loop = min_loop;
>>>
>>> compares the outermost loops count (min_count) against the preheader
>>> count?  So we're searching for a cold loop with respect to its enclosing loop
>>> here?
>>
>> Let me try to explain how it works :)
>>
>> find_coldest_out_loop does two steps check:
>> 1) Check whether curr_bb is cold in it's own loop_father, if it is cold,
>> just return NULL which means it should not be moved out at all;
>> 2)  curr_bb is NOT cold, assuming the current loop L[m] is the coldest first,
>> than try to find a cold loop to be hoisted to from {L[1], L[2], ... L[m]},
>> if L[i]->count < L[m]->count, set the cold_loop to L[i] until find the loop
>> that has smallest profile_count.
>>
>>
>> Take the updated ssa-lim-19.c as example, check whether curr_bb(bb 5) is cold in
>> loop 3, if it is cold, just return NULL, otherwise select the coldest loop in
>> {loop1, loop2, loop3} and find that loop2 is colder than loop3, return loop2 to
>> be the target hoist loop.  The first check could AVOID hoist if curr_bb is colder
>> than loop3, but it is still hot than loop1 and loop2.  Not sure whether it is possible
>> to construct such cases?
>>
>>
>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>>
>> volatile int x;
>> void
>> bar (int, char *, char *);
>> void
>> foo (int *a, int n, int m, int s, int t)
>> {
>>   int i;
>>   int j;
>>   int k;
>>
>>   for (i = 0; i < m; i++)  // loop 1
>>     {
>>       if (__builtin_expect (x, 0))
>>         for (j = 0; j < n; j++)   // loop 2
>>           for (k = 0; k < n; k++)   // loop 3
>>            {
>>              bar (s / 5, "one", "two");  // curr_bb
>>              a[t] = s;
>>            }
>>       a[t] = t;  // curr_bb2
>>     }
>> }
>>
>> The 4 invariant statements are moved to bb 11(loop2) instead of bb 10(loop1)
>> with this patch.
>> There are totally 6 combinations when curr_bb is hotter than loop 3.  We need
>> to compare the "Loop preheader hotness" instead of "every Loop[i] and curr_bb hotness",
>> returning the coldest loop for this function find_coldest_out_loop, otherwise
>> unexpected behavior happens.
>>
>> L1 > L2 > L3   =>  return L3
>> L1 > L3 > L2   =>  return L2
>> L2 > L1 > L3   =>  return L3
>> L2 > L3 > L1   =>  return L1
>> L3 > L1 > L2   =>  return L2
>> L3 > L2 > L1   =>  return L1
>>
>> So bb_colder_than_loop_preheader does two kind of checks, one is checking
>> L3 preheader count with curr_bb count, another is checking L3 preheader count
>> with L1 preheader count, L2 preheader count, etc...
>>
>>
>> ssa-lim-19.c.138t.lim2:
>> ...
>>    <bb 10> [local count: 16057869]:  // L1 preheader
>> -  _4 = s_22(D) / 5;
>> -  _5 = (long unsigned int) t_24(D);
>> -  _6 = _5 * 4;
>> -  _7 = a_25(D) + _6;
>>    _8 = (long unsigned int) t_24(D);
>>    _9 = _8 * 4;
>>    _10 = a_25(D) + _9;
>>
>>    <bb 3> [local count: 145980626]:
>>    # i_34 = PHI <i_29(12), 0(10)>
>>    x.0_1 ={v} x;
>>    if (x.0_1 != 0)
>>      goto <bb 4>; [10.00%]
>>    else
>>      goto <bb 8>; [90.00%]
>>
>>    <bb 4> [local count: 14598063]:
>>    if (n_20(D) > 0)
>>      goto <bb 11>; [89.00%]
>>    else
>>      goto <bb 8>; [11.00%]
>>
>>    <bb 11> [local count: 12992276]:  // L2 preheader
>> +  _4 = s_22(D) / 5;
>> +  _5 = (long unsigned int) t_24(D);
>> +  _6 = _5 * 4;
>> +  _7 = a_25(D) + _6;
>>    goto <bb 7>; [100.00%]
>>
>>    <bb 14> [local count: 850510901]:
>>
>>    <bb 5> [local count: 955630225]:  // curr_bb
>>    # k_36 = PHI <k_27(14), 0(7)>
>>    bar (_4, "one", "two");
>>    *_7 = s_22(D);
>>    k_27 = k_36 + 1;
>>    if (n_20(D) > k_27)
>>      goto <bb 14>; [89.00%]
>>    else
>>      goto <bb 6>; [11.00%]
>>
>>    <bb 6> [local count: 118111600]:
>>    j_21 = j_35 + 1;
>>    if (n_20(D) > j_21)
>>      goto <bb 13>; [89.00%]
>>    else
>>      goto <bb 8>; [11.00%]
>>
>>    <bb 13> [local count: 105119324]:
>>
>>    <bb 7> [local count: 118111600]:   // L3 preheader
>>    # j_35 = PHI <j_21(13), 0(11)>
>>    goto <bb 5>; [100.00%]
>>
>>    <bb 8> [local count: 145980626]:
>>    *_10 = t_24(D);
>>    i_29 = i_34 + 1;
>>
>> Re-paste the bb_colder_than_loop_preheader and find_coldest_out_loop:
>>
>> +/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state
>> +   as stated in profile-count.h, FALSE is returned if inequality cannot be
>> +   decided.  */
>> +bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2)
>> +{
>> +  if (count1 < count2)
>> +    return true;
>> +  else
>> +    return false;
>> +}
>> +
>> +/* Find coldest loop between OUTMOST_LOOP and LOOP by comapring profile count.
>> + */
>> +
>> +static class loop *
>> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
>> +                      basic_block curr_bb)
>> +{
>> +  class loop *cold_loop, *min_loop;
>> +  cold_loop = min_loop = outmost_loop;
>> +  profile_count min_count = loop_preheader_edge (min_loop)->src->count;
>> +
>> +  /* If bb_colder_than_loop_preheader returns false due to three-state
>> +    comparision, OUTMOST_LOOP is returned finally to preserve the behavior.
>> +    Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP.  */
>> +  if (curr_bb
>> +      && bb_colder_than_loop_preheader (curr_bb->count,
>> +                                       loop_preheader_edge (loop)->src->count))
>> +    return NULL;
>> +
>> +  while (min_loop != loop)
>> +    {
>> +      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
>> +      if (bb_colder_than_loop_preheader (
>> +           loop_preheader_edge (min_loop)->src->count, min_count))
>> +       cold_loop = min_loop;
>> +    }
>> +  return cold_loop;
>> +}
>> +
>>
>>
>>>
>>> Why is this function not simply
>>>
>>> +static class loop *
>>> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
>>> +                      basic_block curr_bb)
>>> +{
>>>      while (bb_colder_than_loop_preheader (curr_bb->count,
>>>                loop_preheader_edge (outermost_loop)->src->count))
>>>         {
>>>             if (outermost_loop == loop)
>>>               return NULL;
>>>             outermost_loop = superloop_at_depth (loop, loop_depth
>>> (outermost_loop) + 1);
>>>         }
>>>      return outermost_loop;
>>> }
>>
>> If change like this, when processing curr_bb(5), outermost_loop will
>> return loop 1 since curr_bb->count > Loop1_prehead->count, the while
>> loop stopped.  This doesn't meet what we want.
> 
> Why?  curr_bb is executed at least as often as loop1 preheader if
> we look at the counts?  So either the counts do not really tell us
> anything of help or I am missing something.  Are you merely
> looking for a block with a lower count on the path from the outermost
> loop entry to the block in question and deciding you do not want to
> hoist further than that?  So it's not about not hoisting to a hot place
> but instead hoist to the coldest place within a loop nest?
> 
> So we have
> 
>   for (i = 0; i < m; i++)  // loop 1
>     {
>       if (__builtin_expect (x, 0))
>         for (j = 0; j < n; j++)   // loop 2
> 
> 
>    <bb 10> [local count: 16057869]:  // L1 preheader
>        ...
>  <bb 3> [local count: 145980626]:
>    # i_34 = PHI <i_29(12), 0(10)>
>  ...
>    <bb 11> [local count: 12992276]:  // L2 preheader
>    ...
>     <bb 7> [local count: 118111600]:   // L3 preheader
>    # j_35 = PHI <j_21(13), 0(11)>
>    goto <bb 5>; [100.00%]
> 
> and we want to hoist to the L2 preheader because that's less frequently
> executed than the L1 preheader (which is less frequently executed
> than the L3 preheader or the block we are hoisting from).

Yes, this is exactly what I want, sorry for not describe it clear before ;(

The updated patch[1] may reflect find_coldest_out_loop better:
It first check whether curr_bb is hotter than it's preheader, if false, return NULL
which means no need hoist at all; Then find a *coldest* preheader to hoist
within a loop nest from outmost_loop.


[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581961.html


+/* Find coldest loop between OUTMOST_LOOP and LOOP by comparing profile count.
+   It does two steps check:
+   1) Check whether CURR_BB is cold in it's own loop_father, if it is cold, just
+   return NULL which means it should not be moved out at all;
+   2)  CURR_BB is NOT cold, set LOOP to cold_loop, then iteratively search loops
+   from {L[outmost_loop], L[outmost_loop+1], ... L[loop]}, if L[i] is colder
+   than L[cold_loop], reset cold_loop to L[i] until get the loop that has
+   smallest profile_count.  */
+
+static class loop *
+find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
+		       basic_block curr_bb)
+{
+  class loop *cold_loop;
+
+  /* If bb_colder_than_loop_preheader returns false due to three-state
+    comparision, OUTMOST_LOOP is returned finally to preserve the behavior.
+    Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP.  */
+  if (curr_bb
+      && bb_colder_than_loop_preheader (curr_bb,
+					loop_preheader_edge (loop)->src))
+    return NULL;
+
+  cold_loop = loop;
+  while (outmost_loop != loop)
+    {
+      if (bb_colder_than_loop_preheader (loop_preheader_edge (outmost_loop)->src,
+					 loop_preheader_edge (cold_loop)->src))
+	cold_loop = outmost_loop;
+      outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1);
+    }
+  return cold_loop;
+}


> 
> I'm concerned with compile-time complexity re-evaluating counts on the
> loop nest many times.  So it looks to me that we can pre-compute
> this lowest-preheader-count loop for a loop nest at least for the
> store-motion case where we know the outermost loop?
> 
> 

But the lowest-preheader-count loop may change for a loop/bb with different
outermost loop.  For example if,

L1_preheader_count < L2_preheader_count < L3_preheader_count < L4_preheader_count < curr_bb_count

then,

find_coldest_out_loop (L1, loop, curr_bb)  => coldest preheader loop is L1
find_coldest_out_loop (L2, loop, curr_bb)  => coldest preheader loop is L2

So it will be a 1:N map?  Pre-compute it in find_coldest_out_loop
and save it also in lim_data with a new variable
coldest_preheader_loop[outmost_loop][coldest_preheader_loop]?
each call of find_coldest_out_loop will check whether that variable is set
already, only continue the search if
coldest_preheader_loop[outmost_loop][coldest_preheader_loop] is not set?
Seems a bit complicated and not sure whether it helps to reduce
compile-time complexity or I am misunderstanding...


-- 
Thanks,
Xionghu