From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id 0FC6A3858D28; Mon, 18 Oct 2021 04:29:19 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 0FC6A3858D28 Received: from pps.filterd (m0098396.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.1.2/8.16.1.2) with SMTP id 19I3pR4L020456; Mon, 18 Oct 2021 00:29:19 -0400 Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com with ESMTP id 3bs1cw0gtj-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 18 Oct 2021 00:29:18 -0400 Received: from m0098396.ppops.net (m0098396.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 19I3qf5N022253; Mon, 18 Oct 2021 00:29:18 -0400 Received: from ppma06fra.de.ibm.com (48.49.7a9f.ip4.static.sl-reverse.com [159.122.73.72]) by mx0a-001b2d01.pphosted.com with ESMTP id 3bs1cw0gt2-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 18 Oct 2021 00:29:17 -0400 Received: from pps.filterd (ppma06fra.de.ibm.com [127.0.0.1]) by ppma06fra.de.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 19I4S6ug010639; Mon, 18 Oct 2021 04:29:15 GMT Received: from b06cxnps3074.portsmouth.uk.ibm.com (d06relay09.portsmouth.uk.ibm.com [9.149.109.194]) by ppma06fra.de.ibm.com with ESMTP id 3bqp0jh6cc-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 18 Oct 2021 04:29:15 +0000 Received: from d06av22.portsmouth.uk.ibm.com (d06av22.portsmouth.uk.ibm.com [9.149.105.58]) by b06cxnps3074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 19I4TCtn59244808 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 18 Oct 2021 04:29:12 GMT Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id EA9054C050; Mon, 18 Oct 2021 04:29:11 +0000 (GMT) Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 9E81F4C046; Mon, 18 Oct 2021 04:29:08 +0000 (GMT) Received: from [9.200.154.17] (unknown [9.200.154.17]) by d06av22.portsmouth.uk.ibm.com (Postfix) with ESMTPS; Mon, 18 Oct 2021 04:29:08 +0000 (GMT) Message-ID: <0b675ba1-cdab-652a-0bba-704b0090f1c5@linux.ibm.com> Date: Mon, 18 Oct 2021 12:29:06 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.2.0 Subject: Re: [RFC] Don't move cold code out of loop by checking bb count Content-Language: en-US To: Richard Biener Cc: GCC Patches , Segher Boessenkool , David Edelsohn , Bill Schmidt , Jiufu Guo , linkw@gcc.gnu.org, Feng Xue OS , Jan Hubicka References: <20210802050501.159058-1-luoxhu@linux.ibm.com> <53b7c729-33c0-138f-fa06-d6efb7a43911@linux.ibm.com> From: Xionghu Luo In-Reply-To: Content-Type: text/plain; charset=UTF-8 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: PyD5noJsX2bi_0PmgDDyh8jX6gKwslcB X-Proofpoint-ORIG-GUID: PX4R35_L3E4_MDGrefRdN7dmnlv_56Zi Content-Transfer-Encoding: 7bit X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.182.1,Aquarius:18.0.790,Hydra:6.0.425,FMLib:17.0.607.475 definitions=2021-10-18_01,2021-10-14_02,2020-04-07_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 malwarescore=0 adultscore=0 mlxscore=0 phishscore=0 lowpriorityscore=0 clxscore=1015 suspectscore=0 bulkscore=0 spamscore=0 mlxlogscore=999 impostorscore=0 priorityscore=1501 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2109230001 definitions=main-2110180023 X-Spam-Status: No, score=-11.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, NICE_REPLY_A, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Oct 2021 04:29:22 -0000 On 2021/10/15 16:11, Richard Biener wrote: > On Sat, Oct 9, 2021 at 5:45 AM Xionghu Luo wrote: >> >> Hi, >> >> On 2021/9/28 20:09, Richard Biener wrote: >>> On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo wrote: >>>> >>>> Update the patch to v3, not sure whether you prefer the paste style >>>> and continue to link the previous thread as Segher dislikes this... >>>> >>>> >>>> [PATCH v3] Don't move cold code out of loop by checking bb count >>>> >>>> >>>> Changes: >>>> 1. Handle max_loop in determine_max_movement instead of >>>> outermost_invariant_loop. >>>> 2. Remove unnecessary changes. >>>> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p. >>>> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused >>>> infinite loop when implementing v1 and the iteration is missed to be >>>> updated actually. >>>> >>>> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html >>>> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html >>>> >>>> There was a patch trying to avoid move cold block out of loop: >>>> >>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html >>>> >>>> Richard suggested to "never hoist anything from a bb with lower execution >>>> frequency to a bb with higher one in LIM invariantness_dom_walker >>>> before_dom_children". >>>> >>>> In gimple LIM analysis, add find_coldest_out_loop to move invariants to >>>> expected target loop, if profile count of the loop bb is colder >>>> than target loop preheader, it won't be hoisted out of loop. >>>> Likely for store motion, if all locations of the REF in loop is cold, >>>> don't do store motion of it. >>>> >>>> SPEC2017 performance evaluation shows 1% performance improvement for >>>> intrate GEOMEAN and no obvious regression for others. Especially, >>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is >>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% >>>> on P8LE. >>>> >>>> gcc/ChangeLog: >>>> >>>> * loop-invariant.c (find_invariants_bb): Check profile count >>>> before motion. >>>> (find_invariants_body): Add argument. >>>> * tree-ssa-loop-im.c (find_coldest_out_loop): New function. >>>> (determine_max_movement): Use find_coldest_out_loop. >>>> (move_computations_worker): Adjust and fix iteration udpate. >>>> (execute_sm_exit): Check pointer validness. >>>> (class ref_in_loop_hot_body): New functor. >>>> (ref_in_loop_hot_body::operator): New. >>>> (can_sm_ref_p): Use for_all_locs_in_loop. >>>> >>>> gcc/testsuite/ChangeLog: >>>> >>>> * gcc.dg/tree-ssa/recip-3.c: Adjust. >>>> * gcc.dg/tree-ssa/ssa-lim-18.c: New test. >>>> * gcc.dg/tree-ssa/ssa-lim-19.c: New test. >>>> * gcc.dg/tree-ssa/ssa-lim-20.c: New test. >>>> --- >>>> gcc/loop-invariant.c | 10 ++-- >>>> gcc/tree-ssa-loop-im.c | 61 ++++++++++++++++++++-- >>>> gcc/testsuite/gcc.dg/tree-ssa/recip-3.c | 2 +- >>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++ >>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++ >>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++ >>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++ >>>> 7 files changed, 165 insertions(+), 8 deletions(-) >>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c >>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c >>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c >>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c >>>> >>>> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c >>>> index fca0c2b24be..5c3be7bf0eb 100644 >>>> --- a/gcc/loop-invariant.c >>>> +++ b/gcc/loop-invariant.c >>>> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed) >>>> call. */ >>>> >>>> static void >>>> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed) >>>> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached, >>>> + bool always_executed) >>>> { >>>> rtx_insn *insn; >>>> + basic_block preheader = loop_preheader_edge (loop)->src; >>>> + >>>> + if (preheader->count > bb->count) >>>> + return; >>>> >>>> FOR_BB_INSNS (bb, insn) >>>> { >>>> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body, >>>> unsigned i; >>>> >>>> for (i = 0; i < loop->num_nodes; i++) >>>> - find_invariants_bb (body[i], >>>> - bitmap_bit_p (always_reached, i), >>>> + find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i), >>>> bitmap_bit_p (always_executed, i)); >>>> } >>>> >>>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c >>>> index 4b187c2cdaf..655fab03442 100644 >>>> --- a/gcc/tree-ssa-loop-im.c >>>> +++ b/gcc/tree-ssa-loop-im.c >>>> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt) >>>> return ret; >>>> } >>>> >>>> +/* Find coldest loop between outmost_loop and loop by comapring profile count. */ >>>> + >>>> +static class loop * >>>> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, >>>> + basic_block curr_bb) >>>> +{ >>>> + class loop *cold_loop, *min_loop; >>>> + cold_loop = min_loop = outmost_loop; >>>> + profile_count min_count = loop_preheader_edge (min_loop)->src->count; >>>> + >>>> + if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count) >>> >>> Honza - can you comment on whether we should compare BB counts this way? >>> >>> I would suspect that for, say, >>> >>> for (...) >>> if (a) >>> X; >>> else >>> Y; >>> >>> that the counts for X and Y will be less than that of the preheader of the loop >>> only when the loop is estimated to run once. That is, should we really compare >>> the to the preheader count or maybe better to the _header_ count which >>> would keep the number of iterations out of the equation? >> >> I quickly tried to replace all the loop_preheader_edge (loop)->src with >> loop_preheader_edge (loop)->dest, it will cause many failures in >> gcc.dg/tree-ssa/ssa-lim-*.c, I didn't go deep to investigate, but it seems >> reasonable to compare the bb count with preheader count as both gimple lim >> and RTL loop-invariant move instructions to *preheader* instead of *header* >> after analysis? > > Hmm, yeah - guess I was confused here. > >>> >>> If we look at maybe_hot_count_p that's a quite sophisticated thing to >>> compare a count to the "IPA hot", here we're comparing two counts >>> within a function where it actually matters whether we use a>> !(a>=b) since 'unordered' is mapped to false (but there's no ordered_p >>> function). >>> >>> Xionghu, you error on the side of not hoisting for unordered counts here >>> >>>> + return NULL; >>>> + >>>> + while (min_loop != loop) >>>> + { >>>> + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); >>>> + if (loop_preheader_edge (min_loop)->src->count < min_count) >>> >>> but in the other direction here and on the side of not hoisting >>> in ref_in_loop_hot_body. >>> >>> The three-state relational operator overloads are probably not the >>> very best idea... >>> (see profile-count.h for them) >>> >> Added new function bb_colder_than_loop_preheader to encapsulate the comparision, >> if FALSE is returned due to three-state inequality, find_coldest_out_loop >> will return the original input to lim->max_loop, and ref_in_loop_hot_body::operator () >> will return true to continue perform store motion, both preserve the previous >> behavior. > > Thanks. But I don't think the abstraction as written is useful: > > +/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state > + as stated in profile-count.h, FALSE is returned if inequality cannot be > + decided. */ > +bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2) > +{ > + if (count1 < count2) > + return true; > + else > + return false; > +} > > given the following seems to pass the preheader count in place of the BB count. > > + if (bb_colder_than_loop_preheader ( > + loop_preheader_edge (min_loop)->src->count, min_count)) > + cold_loop = min_loop; > > find_coldest_out_loop is also a bit weird, I think we want to find > the outermost loop between outmost_loop and loop that has a > lower count than the curr_bb count but > > + while (min_loop != loop) > + { > + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); > + if (bb_colder_than_loop_preheader ( > + loop_preheader_edge (min_loop)->src->count, min_count)) > + cold_loop = min_loop; > > compares the outermost loops count (min_count) against the preheader > count? So we're searching for a cold loop with respect to its enclosing loop > here? Let me try to explain how it works :) find_coldest_out_loop does two steps check: 1) Check whether curr_bb is cold in it's own loop_father, if it is cold, just return NULL which means it should not be moved out at all; 2) curr_bb is NOT cold, assuming the current loop L[m] is the coldest first, than try to find a cold loop to be hoisted to from {L[1], L[2], ... L[m]}, if L[i]->count < L[m]->count, set the cold_loop to L[i] until find the loop that has smallest profile_count. Take the updated ssa-lim-19.c as example, check whether curr_bb(bb 5) is cold in loop 3, if it is cold, just return NULL, otherwise select the coldest loop in {loop1, loop2, loop3} and find that loop2 is colder than loop3, return loop2 to be the target hoist loop. The first check could AVOID hoist if curr_bb is colder than loop3, but it is still hot than loop1 and loop2. Not sure whether it is possible to construct such cases? gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c volatile int x; void bar (int, char *, char *); void foo (int *a, int n, int m, int s, int t) { int i; int j; int k; for (i = 0; i < m; i++) // loop 1 { if (__builtin_expect (x, 0)) for (j = 0; j < n; j++) // loop 2 for (k = 0; k < n; k++) // loop 3 { bar (s / 5, "one", "two"); // curr_bb a[t] = s; } a[t] = t; // curr_bb2 } } The 4 invariant statements are moved to bb 11(loop2) instead of bb 10(loop1) with this patch. There are totally 6 combinations when curr_bb is hotter than loop 3. We need to compare the "Loop preheader hotness" instead of "every Loop[i] and curr_bb hotness", returning the coldest loop for this function find_coldest_out_loop, otherwise unexpected behavior happens. L1 > L2 > L3 => return L3 L1 > L3 > L2 => return L2 L2 > L1 > L3 => return L3 L2 > L3 > L1 => return L1 L3 > L1 > L2 => return L2 L3 > L2 > L1 => return L1 So bb_colder_than_loop_preheader does two kind of checks, one is checking L3 preheader count with curr_bb count, another is checking L3 preheader count with L1 preheader count, L2 preheader count, etc... ssa-lim-19.c.138t.lim2: ... [local count: 16057869]: // L1 preheader - _4 = s_22(D) / 5; - _5 = (long unsigned int) t_24(D); - _6 = _5 * 4; - _7 = a_25(D) + _6; _8 = (long unsigned int) t_24(D); _9 = _8 * 4; _10 = a_25(D) + _9; [local count: 145980626]: # i_34 = PHI x.0_1 ={v} x; if (x.0_1 != 0) goto ; [10.00%] else goto ; [90.00%] [local count: 14598063]: if (n_20(D) > 0) goto ; [89.00%] else goto ; [11.00%] [local count: 12992276]: // L2 preheader + _4 = s_22(D) / 5; + _5 = (long unsigned int) t_24(D); + _6 = _5 * 4; + _7 = a_25(D) + _6; goto ; [100.00%] [local count: 850510901]: [local count: 955630225]: // curr_bb # k_36 = PHI bar (_4, "one", "two"); *_7 = s_22(D); k_27 = k_36 + 1; if (n_20(D) > k_27) goto ; [89.00%] else goto ; [11.00%] [local count: 118111600]: j_21 = j_35 + 1; if (n_20(D) > j_21) goto ; [89.00%] else goto ; [11.00%] [local count: 105119324]: [local count: 118111600]: // L3 preheader # j_35 = PHI goto ; [100.00%] [local count: 145980626]: *_10 = t_24(D); i_29 = i_34 + 1; Re-paste the bb_colder_than_loop_preheader and find_coldest_out_loop: +/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state + as stated in profile-count.h, FALSE is returned if inequality cannot be + decided. */ +bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2) +{ + if (count1 < count2) + return true; + else + return false; +} + +/* Find coldest loop between OUTMOST_LOOP and LOOP by comapring profile count. + */ + +static class loop * +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, + basic_block curr_bb) +{ + class loop *cold_loop, *min_loop; + cold_loop = min_loop = outmost_loop; + profile_count min_count = loop_preheader_edge (min_loop)->src->count; + + /* If bb_colder_than_loop_preheader returns false due to three-state + comparision, OUTMOST_LOOP is returned finally to preserve the behavior. + Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP. */ + if (curr_bb + && bb_colder_than_loop_preheader (curr_bb->count, + loop_preheader_edge (loop)->src->count)) + return NULL; + + while (min_loop != loop) + { + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); + if (bb_colder_than_loop_preheader ( + loop_preheader_edge (min_loop)->src->count, min_count)) + cold_loop = min_loop; + } + return cold_loop; +} + > > Why is this function not simply > > +static class loop * > +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, > + basic_block curr_bb) > +{ > while (bb_colder_than_loop_preheader (curr_bb->count, > loop_preheader_edge (outermost_loop)->src->count)) > { > if (outermost_loop == loop) > return NULL; > outermost_loop = superloop_at_depth (loop, loop_depth > (outermost_loop) + 1); > } > return outermost_loop; > } If change like this, when processing curr_bb(5), outermost_loop will return loop 1 since curr_bb->count > Loop1_prehead->count, the while loop stopped. This doesn't meet what we want. > > ? > > Likewise I wonder why ref_in_loop_hot_body::operator () needs to call > find_coldest_out_loop and why it not simply does > > +bool > +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) > +{ > + basic_block curr_bb = gimple_bb (loc->stmt); > if (bb_colder_than_loop_preheader (curr_bb->count, > loop_preheader_edge (l)->src->count)) > return false; > return true; > } Likely for this part, +/* Find out the coldest loop between loop L and innermost loop, compare the + hotness between current BB and coldest loop preheader by profile count. */ +bool +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) +{ + basic_block curr_bb = gimple_bb (loc->stmt); + class loop *inner_loop = curr_bb->loop_father; + class loop *cold_loop = l; + if (l != inner_loop) + cold_loop = find_coldest_out_loop (l, inner_loop, curr_bb); + if (!cold_loop) + return false; + edge e = loop_preheader_edge (cold_loop); + /* If bb_colder_than_loop_preheader is false due to three-state inequality + comparision, TRUE is returned to continue perform store motion. */ + if (bb_colder_than_loop_preheader (curr_bb->count, e->src->count)) + return false; + else + return true; +} l is the input of ref_in_loop_hot_body, it is an out loop, we need to find a cold_loop between l and inner_loop. Reason is there may be cold loop between l and inner_loop, which means we shouldn't do store-motion from curr_bb to l directly. After reconsideration, I think the bb_colder_than_loop_preheader could be removed since curr_bb is checked in find_coldest_out_loop already. And remove the "l != inner_loop" check: +/* Find out the coldest loop between loop L and innermost loop. */ +bool +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) +{ + basic_block curr_bb = gimple_bb (loc->stmt); + class loop *inner_loop = curr_bb->loop_father; + class loop *cold_loop = l; + cold_loop = find_coldest_out_loop (l, inner_loop, curr_bb); + if (!cold_loop) + return false; + return true; +} -- Thanks, Xionghu