public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Xionghu Luo <luoxhu@linux.ibm.com>
To: Richard Biener <richard.guenther@gmail.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>,
	Segher Boessenkool <segher@kernel.crashing.org>,
	David Edelsohn <dje.gcc@gmail.com>,
	Bill Schmidt <wschmidt@linux.ibm.com>,
	Jiufu Guo <guojiufu@linux.ibm.com>,
	linkw@gcc.gnu.org, Feng Xue OS <fxue@os.amperecomputing.com>,
	Jan Hubicka <hubicka@ucw.cz>
Subject: Re: [RFC] Don't move cold code out of loop by checking bb count
Date: Mon, 18 Oct 2021 12:29:06 +0800	[thread overview]
Message-ID: <0b675ba1-cdab-652a-0bba-704b0090f1c5@linux.ibm.com> (raw)
In-Reply-To: <CAFiYyc0vSB=+n9SBzQ49dBow3mq7FcNqmELicLaGX4tCgmH7Bw@mail.gmail.com>



On 2021/10/15 16:11, Richard Biener wrote:
> On Sat, Oct 9, 2021 at 5:45 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>
>> Hi,
>>
>> On 2021/9/28 20:09, Richard Biener wrote:
>>> On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>>>
>>>> Update the patch to v3, not sure whether you prefer the paste style
>>>> and continue to link the previous thread as Segher dislikes this...
>>>>
>>>>
>>>> [PATCH v3] Don't move cold code out of loop by checking bb count
>>>>
>>>>
>>>> Changes:
>>>> 1. Handle max_loop in determine_max_movement instead of
>>>> outermost_invariant_loop.
>>>> 2. Remove unnecessary changes.
>>>> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p.
>>>> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
>>>> infinite loop when implementing v1 and the iteration is missed to be
>>>> updated actually.
>>>>
>>>> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
>>>> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
>>>>
>>>> There was a patch trying to avoid move cold block out of loop:
>>>>
>>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
>>>>
>>>> Richard suggested to "never hoist anything from a bb with lower execution
>>>> frequency to a bb with higher one in LIM invariantness_dom_walker
>>>> before_dom_children".
>>>>
>>>> In gimple LIM analysis, add find_coldest_out_loop to move invariants to
>>>> expected target loop, if profile count of the loop bb is colder
>>>> than target loop preheader, it won't be hoisted out of loop.
>>>> Likely for store motion, if all locations of the REF in loop is cold,
>>>> don't do store motion of it.
>>>>
>>>> SPEC2017 performance evaluation shows 1% performance improvement for
>>>> intrate GEOMEAN and no obvious regression for others.  Especially,
>>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
>>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
>>>> on P8LE.
>>>>
>>>> gcc/ChangeLog:
>>>>
>>>>         * loop-invariant.c (find_invariants_bb): Check profile count
>>>>         before motion.
>>>>         (find_invariants_body): Add argument.
>>>>         * tree-ssa-loop-im.c (find_coldest_out_loop): New function.
>>>>         (determine_max_movement): Use find_coldest_out_loop.
>>>>         (move_computations_worker): Adjust and fix iteration udpate.
>>>>         (execute_sm_exit): Check pointer validness.
>>>>         (class ref_in_loop_hot_body): New functor.
>>>>         (ref_in_loop_hot_body::operator): New.
>>>>         (can_sm_ref_p): Use for_all_locs_in_loop.
>>>>
>>>> gcc/testsuite/ChangeLog:
>>>>
>>>>         * gcc.dg/tree-ssa/recip-3.c: Adjust.
>>>>         * gcc.dg/tree-ssa/ssa-lim-18.c: New test.
>>>>         * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
>>>>         * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
>>>> ---
>>>>  gcc/loop-invariant.c                       | 10 ++--
>>>>  gcc/tree-ssa-loop-im.c                     | 61 ++++++++++++++++++++--
>>>>  gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |  2 +-
>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++
>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++
>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++
>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++
>>>>  7 files changed, 165 insertions(+), 8 deletions(-)
>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
>>>>
>>>> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
>>>> index fca0c2b24be..5c3be7bf0eb 100644
>>>> --- a/gcc/loop-invariant.c
>>>> +++ b/gcc/loop-invariant.c
>>>> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed)
>>>>     call.  */
>>>>
>>>>  static void
>>>> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed)
>>>> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
>>>> +                   bool always_executed)
>>>>  {
>>>>    rtx_insn *insn;
>>>> +  basic_block preheader = loop_preheader_edge (loop)->src;
>>>> +
>>>> +  if (preheader->count > bb->count)
>>>> +    return;
>>>>
>>>>    FOR_BB_INSNS (bb, insn)
>>>>      {
>>>> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body,
>>>>    unsigned i;
>>>>
>>>>    for (i = 0; i < loop->num_nodes; i++)
>>>> -    find_invariants_bb (body[i],
>>>> -                       bitmap_bit_p (always_reached, i),
>>>> +    find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i),
>>>>                         bitmap_bit_p (always_executed, i));
>>>>  }
>>>>
>>>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
>>>> index 4b187c2cdaf..655fab03442 100644
>>>> --- a/gcc/tree-ssa-loop-im.c
>>>> +++ b/gcc/tree-ssa-loop-im.c
>>>> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt)
>>>>    return ret;
>>>>  }
>>>>
>>>> +/* Find coldest loop between outmost_loop and loop by comapring profile count.  */
>>>> +
>>>> +static class loop *
>>>> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
>>>> +                      basic_block curr_bb)
>>>> +{
>>>> +  class loop *cold_loop, *min_loop;
>>>> +  cold_loop = min_loop = outmost_loop;
>>>> +  profile_count min_count = loop_preheader_edge (min_loop)->src->count;
>>>> +
>>>> +  if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count)
>>>
>>> Honza - can you comment on whether we should compare BB counts this way?
>>>
>>> I would suspect that for, say,
>>>
>>>   for (...)
>>>      if (a)
>>>        X;
>>>      else
>>>        Y;
>>>
>>> that the counts for X and Y will be less than that of the preheader of the loop
>>> only when the loop is estimated to run once.  That is, should we really compare
>>> the to the preheader count or maybe better to the _header_ count which
>>> would keep the number of iterations out of the equation?
>>
>> I quickly tried to replace all the loop_preheader_edge (loop)->src with
>> loop_preheader_edge (loop)->dest, it will cause many failures in
>> gcc.dg/tree-ssa/ssa-lim-*.c, I didn't go deep to investigate, but it seems
>> reasonable to compare the bb count with preheader count as both gimple lim
>> and RTL loop-invariant move instructions to *preheader* instead of *header*
>> after analysis?
> 
> Hmm, yeah - guess I was confused here.
> 
>>>
>>> If we look at maybe_hot_count_p that's a quite sophisticated thing to
>>> compare a count to the "IPA hot", here we're comparing two counts
>>> within a function where it actually matters whether we use a<b or
>>> !(a>=b) since 'unordered' is mapped to false (but there's no ordered_p
>>> function).
>>>
>>> Xionghu, you error on the side of not hoisting for unordered counts here
>>>
>>>> +    return NULL;
>>>> +
>>>> +  while (min_loop != loop)
>>>> +    {
>>>> +      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
>>>> +      if (loop_preheader_edge (min_loop)->src->count < min_count)
>>>
>>> but in the other direction here and on the side of not hoisting
>>> in ref_in_loop_hot_body.
>>>
>>> The three-state relational operator overloads are probably not the
>>> very best idea...
>>> (see profile-count.h for them)
>>>
>> Added new function bb_colder_than_loop_preheader to encapsulate the comparision,
>> if FALSE is returned due to three-state inequality,  find_coldest_out_loop
>> will return the original input to lim->max_loop, and ref_in_loop_hot_body::operator ()
>> will return true to continue perform store motion, both preserve the previous
>> behavior.
> 
> Thanks.  But I don't think the abstraction as written is useful:
> 
> +/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state
> +   as stated in profile-count.h, FALSE is returned if inequality cannot be
> +   decided.  */
> +bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2)
> +{
> +  if (count1 < count2)
> +    return true;
> +  else
> +    return false;
> +}
> 
> given the following seems to pass the preheader count in place of the BB count.
> 
> +      if (bb_colder_than_loop_preheader (
> +           loop_preheader_edge (min_loop)->src->count, min_count))
> +       cold_loop = min_loop;
> 
> find_coldest_out_loop is also a bit weird, I think we want to find
> the outermost loop between outmost_loop and loop that has a
> lower count than the curr_bb count but
> 
> +  while (min_loop != loop)
> +    {
> +      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
> +      if (bb_colder_than_loop_preheader (
> +           loop_preheader_edge (min_loop)->src->count, min_count))
> +       cold_loop = min_loop;
> 
> compares the outermost loops count (min_count) against the preheader
> count?  So we're searching for a cold loop with respect to its enclosing loop
> here?

Let me try to explain how it works :)

find_coldest_out_loop does two steps check:
1) Check whether curr_bb is cold in it's own loop_father, if it is cold,
just return NULL which means it should not be moved out at all;
2)  curr_bb is NOT cold, assuming the current loop L[m] is the coldest first, 
than try to find a cold loop to be hoisted to from {L[1], L[2], ... L[m]}, 
if L[i]->count < L[m]->count, set the cold_loop to L[i] until find the loop
that has smallest profile_count.


Take the updated ssa-lim-19.c as example, check whether curr_bb(bb 5) is cold in
loop 3, if it is cold, just return NULL, otherwise select the coldest loop in
{loop1, loop2, loop3} and find that loop2 is colder than loop3, return loop2 to
be the target hoist loop.  The first check could AVOID hoist if curr_bb is colder
than loop3, but it is still hot than loop1 and loop2.  Not sure whether it is possible
to construct such cases?


gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c

volatile int x;
void
bar (int, char *, char *);
void
foo (int *a, int n, int m, int s, int t)
{
  int i;
  int j;
  int k;

  for (i = 0; i < m; i++)  // loop 1
    {
      if (__builtin_expect (x, 0))
        for (j = 0; j < n; j++)   // loop 2
          for (k = 0; k < n; k++)   // loop 3
           {
             bar (s / 5, "one", "two");  // curr_bb 
             a[t] = s;
           }
      a[t] = t;  // curr_bb2
    }
}

The 4 invariant statements are moved to bb 11(loop2) instead of bb 10(loop1)
with this patch.
There are totally 6 combinations when curr_bb is hotter than loop 3.  We need
to compare the "Loop preheader hotness" instead of "every Loop[i] and curr_bb hotness",
returning the coldest loop for this function find_coldest_out_loop, otherwise
unexpected behavior happens.

L1 > L2 > L3   =>  return L3
L1 > L3 > L2   =>  return L2
L2 > L1 > L3   =>  return L3
L2 > L3 > L1   =>  return L1
L3 > L1 > L2   =>  return L2
L3 > L2 > L1   =>  return L1

So bb_colder_than_loop_preheader does two kind of checks, one is checking
L3 preheader count with curr_bb count, another is checking L3 preheader count
with L1 preheader count, L2 preheader count, etc...


ssa-lim-19.c.138t.lim2:
...
   <bb 10> [local count: 16057869]:  // L1 preheader
-  _4 = s_22(D) / 5;
-  _5 = (long unsigned int) t_24(D);
-  _6 = _5 * 4;
-  _7 = a_25(D) + _6;
   _8 = (long unsigned int) t_24(D);
   _9 = _8 * 4;
   _10 = a_25(D) + _9;

   <bb 3> [local count: 145980626]:
   # i_34 = PHI <i_29(12), 0(10)>
   x.0_1 ={v} x;
   if (x.0_1 != 0)
     goto <bb 4>; [10.00%]
   else
     goto <bb 8>; [90.00%]

   <bb 4> [local count: 14598063]:
   if (n_20(D) > 0)
     goto <bb 11>; [89.00%]
   else
     goto <bb 8>; [11.00%]

   <bb 11> [local count: 12992276]:  // L2 preheader
+  _4 = s_22(D) / 5;
+  _5 = (long unsigned int) t_24(D);
+  _6 = _5 * 4;
+  _7 = a_25(D) + _6;
   goto <bb 7>; [100.00%]

   <bb 14> [local count: 850510901]:

   <bb 5> [local count: 955630225]:  // curr_bb
   # k_36 = PHI <k_27(14), 0(7)>
   bar (_4, "one", "two");
   *_7 = s_22(D);
   k_27 = k_36 + 1;
   if (n_20(D) > k_27)
     goto <bb 14>; [89.00%]
   else
     goto <bb 6>; [11.00%]

   <bb 6> [local count: 118111600]:
   j_21 = j_35 + 1;
   if (n_20(D) > j_21)
     goto <bb 13>; [89.00%]
   else
     goto <bb 8>; [11.00%]

   <bb 13> [local count: 105119324]:

   <bb 7> [local count: 118111600]:   // L3 preheader
   # j_35 = PHI <j_21(13), 0(11)>
   goto <bb 5>; [100.00%]

   <bb 8> [local count: 145980626]:
   *_10 = t_24(D);
   i_29 = i_34 + 1;

Re-paste the bb_colder_than_loop_preheader and find_coldest_out_loop:

+/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state
+   as stated in profile-count.h, FALSE is returned if inequality cannot be
+   decided.  */
+bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2)
+{
+  if (count1 < count2)
+    return true;
+  else
+    return false;
+}
+
+/* Find coldest loop between OUTMOST_LOOP and LOOP by comapring profile count.
+ */
+
+static class loop *
+find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
+		       basic_block curr_bb)
+{
+  class loop *cold_loop, *min_loop;
+  cold_loop = min_loop = outmost_loop;
+  profile_count min_count = loop_preheader_edge (min_loop)->src->count;
+
+  /* If bb_colder_than_loop_preheader returns false due to three-state
+    comparision, OUTMOST_LOOP is returned finally to preserve the behavior.
+    Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP.  */
+  if (curr_bb
+      && bb_colder_than_loop_preheader (curr_bb->count,
+					loop_preheader_edge (loop)->src->count))
+    return NULL;
+
+  while (min_loop != loop)
+    {
+      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
+      if (bb_colder_than_loop_preheader (
+	    loop_preheader_edge (min_loop)->src->count, min_count))
+	cold_loop = min_loop;
+    }
+  return cold_loop;
+}
+


> 
> Why is this function not simply
> 
> +static class loop *
> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
> +                      basic_block curr_bb)
> +{
>      while (bb_colder_than_loop_preheader (curr_bb->count,
>                loop_preheader_edge (outermost_loop)->src->count))
>         {
>             if (outermost_loop == loop)
>               return NULL;
>             outermost_loop = superloop_at_depth (loop, loop_depth
> (outermost_loop) + 1);
>         }
>      return outermost_loop;
> }

If change like this, when processing curr_bb(5), outermost_loop will
return loop 1 since curr_bb->count > Loop1_prehead->count, the while
loop stopped.  This doesn't meet what we want.  

> 
> ?
> 
> Likewise I wonder why ref_in_loop_hot_body::operator () needs to call
> find_coldest_out_loop and why it not simply does
> 
> +bool
> +ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
> +{
> +  basic_block curr_bb = gimple_bb (loc->stmt);
>     if (bb_colder_than_loop_preheader (curr_bb->count,
> loop_preheader_edge (l)->src->count))
>       return false;
>    return true;
>    }

Likely for this part,

+/* Find out the coldest loop between loop L and innermost loop, compare the
+   hotness between current BB and coldest loop preheader by profile count.  */
+bool
+ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
+{
+  basic_block curr_bb = gimple_bb (loc->stmt);
+  class loop *inner_loop = curr_bb->loop_father;
+  class loop *cold_loop = l;
+  if (l != inner_loop)
+    cold_loop = find_coldest_out_loop (l, inner_loop, curr_bb);
+  if (!cold_loop)
+    return false;
+  edge e = loop_preheader_edge (cold_loop);
+  /*  If bb_colder_than_loop_preheader is false due to three-state inequality
+     comparision, TRUE is returned to continue perform store motion.  */
+  if (bb_colder_than_loop_preheader (curr_bb->count, e->src->count))
+    return false;
+  else
+    return true;
+}

l is the input of ref_in_loop_hot_body, it is an out loop, we need to find a
cold_loop between l and inner_loop.  Reason is there may be cold loop between
l and inner_loop, which means we shouldn't do store-motion from curr_bb to l
directly.
After reconsideration, I think the bb_colder_than_loop_preheader could be
 removed since curr_bb is checked in find_coldest_out_loop already.  And remove
the "l != inner_loop" check:

+/* Find out the coldest loop between loop L and innermost loop.  */
+bool
+ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
+{
+  basic_block curr_bb = gimple_bb (loc->stmt);
+  class loop *inner_loop = curr_bb->loop_father;
+  class loop *cold_loop = l;
+  cold_loop = find_coldest_out_loop (l, inner_loop, curr_bb);
+  if (!cold_loop)
+    return false;
+  return true;
+}


-- 
Thanks,
Xionghu

  reply	other threads:[~2021-10-18  4:29 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-02  5:05 Xiong Hu Luo
2021-08-06 12:15 ` Richard Biener
2021-08-10  2:03   ` Xionghu Luo
2021-08-10  4:25     ` Ulrich Drepper
2021-08-19  5:51       ` [PATCH v2] " Xionghu Luo
2021-08-26 11:33     ` [RFC] " Richard Biener
2021-09-09  1:55       ` Xionghu Luo
2021-09-22  9:14         ` Richard Biener
2021-09-23  2:13           ` Xionghu Luo
2021-09-23  2:16             ` Xionghu Luo
2021-09-24  6:29           ` Xionghu Luo
2021-09-28 12:09             ` Richard Biener
2021-10-09  3:44               ` Xionghu Luo
2021-10-15  8:11                 ` Richard Biener
2021-10-18  4:29                   ` Xionghu Luo [this message]
2021-10-19  1:47                     ` [PATCH v5 2/2] " Xionghu Luo
2021-10-26 13:20                     ` [RFC] " Richard Biener
2021-10-27  2:40                       ` Xionghu Luo
2021-10-29 11:48                         ` Richard Biener
2021-11-03  6:49                           ` Xionghu Luo
2021-11-03 13:29                           ` Xionghu Luo
2021-11-04 13:00                             ` Richard Biener
2021-11-10  3:08                               ` [PATCH v7 2/2] " Xionghu Luo
2021-11-24  5:15                                 ` Ping: " Xionghu Luo
2021-11-24  7:35                                   ` Richard Biener
2021-12-01 10:09                                 ` Richard Biener
2021-12-06  5:09                                   ` [PATCH v8 " Xionghu Luo
2021-12-06  5:26                                     ` Xionghu Luo
2021-12-07 12:17                                       ` Richard Biener
2021-12-08  6:32                                         ` Xionghu Luo
2021-12-20  7:29                                           ` Richard Biener
2021-12-21  3:59                                             ` Xionghu Luo
2021-10-27 12:54                 ` [RFC] " Jan Hubicka
2021-10-28  1:49                   ` Xionghu Luo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0b675ba1-cdab-652a-0bba-704b0090f1c5@linux.ibm.com \
    --to=luoxhu@linux.ibm.com \
    --cc=dje.gcc@gmail.com \
    --cc=fxue@os.amperecomputing.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=guojiufu@linux.ibm.com \
    --cc=hubicka@ucw.cz \
    --cc=linkw@gcc.gnu.org \
    --cc=richard.guenther@gmail.com \
    --cc=segher@kernel.crashing.org \
    --cc=wschmidt@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).