Re: [RFC] Don't move cold code out of loop by checking bb count

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Xionghu Luo <luoxhu@linux.ibm.com>
To: Richard Biener <richard.guenther@gmail.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>,
	Segher Boessenkool <segher@kernel.crashing.org>,
	David Edelsohn <dje.gcc@gmail.com>,
	Bill Schmidt <wschmidt@linux.ibm.com>,
	Jiufu Guo <guojiufu@linux.ibm.com>,
	linkw@gcc.gnu.org, Feng Xue OS <fxue@os.amperecomputing.com>,
	Jan Hubicka <hubicka@ucw.cz>
Subject: Re: [RFC] Don't move cold code out of loop by checking bb count
Date: Thu, 9 Sep 2021 09:55:57 +0800	[thread overview]
Message-ID: <f4382ef5-2825-4c75-720f-1e475b9b9636@linux.ibm.com> (raw)
In-Reply-To: <CAFiYyc0ZxonSQ8kjoErua7VSVkYw5PmaLUADt3cC659wV_8KiA@mail.gmail.com>



On 2021/8/26 19:33, Richard Biener wrote:
> On Tue, Aug 10, 2021 at 4:03 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>
>> Hi,
>>
>> On 2021/8/6 20:15, Richard Biener wrote:
>>> On Mon, Aug 2, 2021 at 7:05 AM Xiong Hu Luo <luoxhu@linux.ibm.com> wrote:
>>>>
>>>> There was a patch trying to avoid move cold block out of loop:
>>>>
>>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
>>>>
>>>> Richard suggested to "never hoist anything from a bb with lower execution
>>>> frequency to a bb with higher one in LIM invariantness_dom_walker
>>>> before_dom_children".
>>>>
>>>> This patch does this profile count check in both gimple LIM
>>>> move_computations_worker and RTL loop-invariant.c find_invariants_bb,
>>>> if the loop bb is colder than loop preheader, don't hoist it out of
>>>> loop.
>>>>
>>>> Also, the profile count in loop split pass should be corrected to avoid
>>>> lim2 and lim4 mismatch behavior, currently, the new loop preheader generated
>>>> by loop_version is set to "[count: 0]:", then lim4 after lsplt pass will
>>>> move statement out of loop unexpectely when lim2 didn't move it.  This
>>>> change could fix regression on 544.nab_r from -1.55% to +0.46%.
>>>>
>>>> SPEC2017 performance evaluation shows 1% performance improvement for
>>>> intrate GEOMEAN and no obvious regression for others.  Especially,
>>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
>>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
>>>> on P8LE.
>>>>
>>>> Regression and bootstrap tested pass on P8LE, any comments?  Thanks.
>>>
>>> While I'm not familiar with the RTL invariant motion pass the patch there
>>> looks reasonable.  Note that we should assess the profile quality
>>> somehow - I'm not sure how to do that, CCed Honza for that.
>>
>> Thanks.
>>
>>>
>>> For the GIMPLE part the patch looks quite complicated - but note it
>>> probably has to be since LIM performs kind of a "CSE" on loads
>>> (and stores for store-motion), so when there are multiple stmts
>>> affected by a hoisting decision the biggest block count has to be
>>> accounted.  Likewise when there are dependent stmts involved
>>> that might include conditional stmts (a "PHI"), but the overall
>>> cost should be looked at.
>>
>> Currently, The gimple code check two situations with the patch:
>> 1) The statement or PHI‘s BB is *colder* then preheader, don't move it out
>> of loop;
>> 2) The statement or PHI's BB is *hotter* then preheader, but any of it's rhs
>> couldn't be moved out of loop, also don't move it out of loop to avoid definition
>> not dominates use error.
> 
> But part 2) is obviously already done.  What I tried to say is your heuristic
> doesn't integrate nicely with the pass but I admitted that it might be a bit
> difficult to find a place to add this heuristic.
> 
> There is lim_data->cost which we could bias negatively but then this is
> a cost that is independent on the hoisting distance.  But doing this would
> work at least for the case where the immediately enclosing loop preheader
> is hotter than the stmt and with this it would be a patch that's similarly
> simple as the RTL one.
> 
> Another possibility is to simply only adjust PHI processing in
> compute_invariantness, capping movement according to the hotness
> heuristic.  The same could be done for regular stmts there but I'm
> not sure that will do good in the end since this function is supposed
> to compute "correctness" (well, it also has the cost stuff), and it's
> not the place to do overall cost considerations.

Thanks.  I found that adding a function find_coldest_out_loop and check it in
outermost_invariant_loop to find the coldest invariant loop between outermost
loop and itself could also reach the purpose.  Then the gimple code check is
redundant and could be removed.

> 
>> May be I could collect the number of instructions not hoisted with the patch
>> on regression tests and SPEC2017 to do a estimation for "multiple stmts affected"
>> and "overall cost" need to be considered?  But it seems move_computations_worker
>> couldn't rollback if we still want to hoist multiple stmts out during the iterations?
>>
>>>
>>> Now - GIMPLE LIM "costing" is somewhat backward right now
>>> and it isn't set up to consider those multiple involved stmts.  Plus
>>> the store-motion part does not have any cost part (but it depends
>>> on previously decided invariant motions).
>>>
>>> I think the way you implemented the check will cause no hoisting
>>> to be performed instead of, say, hoisting to a different loop level
>>> only.  Possibly shown when you consider a loop nest like
>>>
>>>     for (;;)
>>>       if (unlikely_cond)
>>>         for (;;)
>>>            invariant;
>>>
>>> we want to hoist 'invariant' but only from the inner loop even if it
>>> is invariant also in the outer loop.
>>
>>
>> For this case, theorotically I think the master GCC will optimize it to:
>>
>>    invariant;
>>    for (;;)
>>      if (unlikely_cond)
>>        for (;;)
>>           ;
>>
>> 'invariant' is moved out of outer loop, but with the patch, it will get:
>>
>>    for (;;)
>>      if (unlikely_cond)
>>        {
>>          invariant;
>>          for (;;)
>>             ;
>>        }
>>
>> 'invariant' is *cold* for outer loop, but it is still *hot* for inner loop,
>> so hoist it out of inner loop, this is exactly what we want, right?
> 
> Yes.  I had doubts your patch would achieve that.
> 


The below updated patch could achieve it:


There was a patch trying to avoid move cold block out of loop:

https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html

Richard suggested to "never hoist anything from a bb with lower execution
frequency to a bb with higher one in LIM invariantness_dom_walker
before_dom_children".

In gimple LIM analysis,  add find_coldest_out_loop to move invariants to
expected target loop, then  if profile count of the loop bb is colder
than target loop preheader, it won't be hoisted out of loop.

SPEC2017 performance evaluation shows 1% performance improvement for
intrate GEOMEAN and no obvious regression for others.  Especially,
500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
on P8LE.

Regression and bootstrap tested pass on P8LE, any comments?  Thanks.

gcc/ChangeLog:

	* loop-invariant.c (find_invariants_bb): Check profile count
	before motion.
	(find_invariants_body): Add argument.
	* tree-ssa-loop-im.c (find_coldest_out_loop): New function.
	(outermost_invariant_loop): Use find_coldest_out_loop.
	(determine_max_movement): Likewise.
	(move_computations_worker): Adjust and fix iteration udpate.
	(execute_sm): Likewise.
	(execute_sm_exit): Check pointer validness.

gcc/testsuite/ChangeLog:

	* gcc.dg/tree-ssa/recip-3.c: Adjust.
	* gcc.dg/tree-ssa/ssa-lim-16.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-17.c: New test.
---
  gcc/loop-invariant.c                       | 10 ++-
  gcc/tree-ssa-loop-im.c                     | 79 ++++++++++++++++++----
  gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |  2 +-
  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c | 20 ++++++
  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c | 26 +++++++
  5 files changed, 121 insertions(+), 16 deletions(-)
  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c
  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c

diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
index fca0c2b24be..5c3be7bf0eb 100644
--- a/gcc/loop-invariant.c
+++ b/gcc/loop-invariant.c
@@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed)
     call.  */
  
  static void
-find_invariants_bb (basic_block bb, bool always_reached, bool always_executed)
+find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
+		    bool always_executed)
  {
    rtx_insn *insn;
+  basic_block preheader = loop_preheader_edge (loop)->src;
+
+  if (preheader->count > bb->count)
+    return;
  
    FOR_BB_INSNS (bb, insn)
      {
@@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body,
    unsigned i;
  
    for (i = 0; i < loop->num_nodes; i++)
-    find_invariants_bb (body[i],
-			bitmap_bit_p (always_reached, i),
+    find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i),
  			bitmap_bit_p (always_executed, i));
  }
  
diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index d9f75d5025e..f5ab6a734e7 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -417,6 +417,28 @@ movement_possibility (gimple *stmt)
    return ret;
  }
  
+/* Find coldest loop between outmost_loop and loop by comapring profile count.  */
+
+static class loop *
+find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
+		       basic_block def_bb = NULL)
+{
+  class loop *cold_loop, *min_loop;
+  cold_loop = min_loop = outmost_loop;
+  profile_count min_count = loop_preheader_edge (min_loop)->src->count;
+
+  if (def_bb && def_bb->count < loop_preheader_edge (loop)->src->count)
+    return NULL;
+
+  while (min_loop != loop)
+    {
+      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
+      if (loop_preheader_edge (min_loop)->src->count < min_count)
+	cold_loop = min_loop;
+    }
+  return cold_loop;
+}
+
  /* Suppose that operand DEF is used inside the LOOP.  Returns the outermost
     loop to that we could move the expression using DEF if it did not have
     other operands, i.e. the outermost loop enclosing LOOP in that the value
@@ -431,18 +453,18 @@ outermost_invariant_loop (tree def, class loop *loop)
    struct lim_aux_data *lim_data;
  
    if (!def)
-    return superloop_at_depth (loop, 1);
+    return find_coldest_out_loop (superloop_at_depth (loop, 1), loop);
  
    if (TREE_CODE (def) != SSA_NAME)
      {
        gcc_assert (is_gimple_min_invariant (def));
-      return superloop_at_depth (loop, 1);
+      return find_coldest_out_loop (superloop_at_depth (loop, 1), loop);
      }
  
    def_stmt = SSA_NAME_DEF_STMT (def);
    def_bb = gimple_bb (def_stmt);
    if (!def_bb)
-    return superloop_at_depth (loop, 1);
+    return find_coldest_out_loop (superloop_at_depth (loop, 1), loop, def_bb);
  
    max_loop = find_common_loop (loop, def_bb->loop_father);
  
@@ -452,7 +474,13 @@ outermost_invariant_loop (tree def, class loop *loop)
  				 loop_outer (lim_data->max_loop));
    if (max_loop == loop)
      return NULL;
-  max_loop = superloop_at_depth (loop, loop_depth (max_loop) + 1);
+  max_loop = find_coldest_out_loop (max_loop, loop, def_bb);
+  if (!max_loop)
+    return NULL;
+  if (max_loop == loop)
+    return max_loop;
+  else
+    max_loop = superloop_at_depth (loop, loop_depth (max_loop) + 1);
  
    return max_loop;
  }
@@ -684,7 +712,11 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec)
    if (must_preserve_exec)
      level = ALWAYS_EXECUTED_IN (bb);
    else
-    level = superloop_at_depth (loop, 1);
+    level = find_coldest_out_loop (superloop_at_depth (loop, 1), loop, bb);
+
+  if (!level)
+    return false;
+
    lim_data->max_loop = level;
  
    if (gphi *phi = dyn_cast <gphi *> (stmt))
@@ -783,8 +815,10 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec)
        if (ref
  	  && MEM_ANALYZABLE (ref))
  	{
-	  lim_data->max_loop = outermost_indep_loop (lim_data->max_loop,
-						     loop, ref);
+	  level = outermost_indep_loop (lim_data->max_loop, loop, ref);
+	  if (!level)
+	    return false;
+	  lim_data->max_loop = find_coldest_out_loop (level, loop, bb);
  	  if (!lim_data->max_loop)
  	    return false;
  	}
@@ -1154,6 +1188,7 @@ move_computations_worker (basic_block bb)
  	  continue;
  	}
  
+      edge e = loop_preheader_edge (level);
        if (dump_file && (dump_flags & TDF_DETAILS))
  	{
  	  fprintf (dump_file, "Moving PHI node\n");
@@ -1191,14 +1226,13 @@ move_computations_worker (basic_block bb)
  	  tree lhs = gimple_assign_lhs (new_stmt);
  	  SSA_NAME_RANGE_INFO (lhs) = NULL;
  	}
-      gsi_insert_on_edge (loop_preheader_edge (level), new_stmt);
+      gsi_insert_on_edge (e, new_stmt);
        remove_phi_node (&bsi, false);
      }
  
    for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); )
      {
        edge e;
-
        gimple *stmt = gsi_stmt (bsi);
  
        lim_data = get_lim_data (stmt);
@@ -1221,8 +1255,12 @@ move_computations_worker (basic_block bb)
        /* We do not really want to move conditionals out of the loop; we just
  	 placed it here to force its operands to be moved if necessary.  */
        if (gimple_code (stmt) == GIMPLE_COND)
-	continue;
+	{
+	  gsi_next (&bsi);
+	  continue;
+	}
  
+      e = loop_preheader_edge (level);
        if (dump_file && (dump_flags & TDF_DETAILS))
  	{
  	  fprintf (dump_file, "Moving statement\n");
@@ -1231,7 +1269,6 @@ move_computations_worker (basic_block bb)
  		   cost, level->num);
  	}
  
-      e = loop_preheader_edge (level);
        gcc_assert (!gimple_vdef (stmt));
        if (gimple_vuse (stmt))
  	{
@@ -2133,6 +2170,19 @@ execute_sm (class loop *loop, im_mem_ref *ref,
    bool multi_threaded_model_p = false;
    gimple_stmt_iterator gsi;
    sm_aux *aux = new sm_aux;
+  basic_block bb = gimple_bb (first_mem_ref_loc (loop, ref)->stmt);
+
+  edge e = loop_preheader_edge (loop);
+  if (e->src->count > bb->count)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "Don't execute store motion of ");
+	  print_generic_expr (dump_file, ref->mem.ref);
+	  fprintf (dump_file, " from loop %d\n", loop->num);
+	}
+      return;
+    }
  
    if (dump_file && (dump_flags & TDF_DETAILS))
      {
@@ -2241,7 +2291,12 @@ execute_sm_exit (class loop *loop, edge ex, vec<seq_entry> &seq,
  	}
        else
  	{
-	  sm_aux *aux = *aux_map.get (ref);
+	  sm_aux **paux = aux_map.get (ref);
+	  sm_aux *aux;
+	  if (paux)
+	    aux = *paux;
+	  else
+	    continue;
  	  if (!aux->store_flag || kind == sm_ord)
  	    {
  	      gassign *store;
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
index 638bf38db8c..641c91e719e 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
@@ -23,4 +23,4 @@ float h ()
  	F[0] += E / d;
  }
  
-/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */
+/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c
new file mode 100644
index 00000000000..2303f3d5d86
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+volatile int x;
+void
+assert_fail (int, char *, char *);
+void
+foo (int *a, int n, int k)
+{
+  int i;
+
+  for (i = 0; i < n; i++)
+    {
+      if (__builtin_expect (x, 0))
+	assert_fail (k / 5, "one", "two");
+      a[i] = k;
+    }
+}
+
+/* { dg-final { scan-tree-dump-not "out of loop 1" "lim2" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c
new file mode 100644
index 00000000000..3b1c7c0cb3e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c
@@ -0,0 +1,26 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+volatile int x;
+void
+assert_fail (int, char *, char *);
+void
+foo (int *a, int n, int m, int k, int s)
+{
+  int i;
+  int j;
+
+  for (i = 0; i < m; i++)
+    {
+      if (__builtin_expect (x, 0))
+	for (j = 0; j < n; j++)
+	  {
+	    assert_fail (k / 5, "one", "two");
+	  a[s] = k;
+	}
+      a[s] = s;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "out of loop 2" 4 "lim2" } } */
+/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */
-- 
2.27.0.90.geebb51ba8c

next prev parent reply	other threads:[~2021-09-09  1:56 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-02  5:05 Xiong Hu Luo
2021-08-06 12:15 ` Richard Biener
2021-08-10  2:03   ` Xionghu Luo
2021-08-10  4:25     ` Ulrich Drepper
2021-08-19  5:51       ` [PATCH v2] " Xionghu Luo
2021-08-26 11:33     ` [RFC] " Richard Biener
2021-09-09  1:55       ` Xionghu Luo [this message]
2021-09-22  9:14         ` Richard Biener
2021-09-23  2:13           ` Xionghu Luo
2021-09-23  2:16             ` Xionghu Luo
2021-09-24  6:29           ` Xionghu Luo
2021-09-28 12:09             ` Richard Biener
2021-10-09  3:44               ` Xionghu Luo
2021-10-15  8:11                 ` Richard Biener
2021-10-18  4:29                   ` Xionghu Luo
2021-10-19  1:47                     ` [PATCH v5 2/2] " Xionghu Luo
2021-10-26 13:20                     ` [RFC] " Richard Biener
2021-10-27  2:40                       ` Xionghu Luo
2021-10-29 11:48                         ` Richard Biener
2021-11-03  6:49                           ` Xionghu Luo
2021-11-03 13:29                           ` Xionghu Luo
2021-11-04 13:00                             ` Richard Biener
2021-11-10  3:08                               ` [PATCH v7 2/2] " Xionghu Luo
2021-11-24  5:15                                 ` Ping: " Xionghu Luo
2021-11-24  7:35                                   ` Richard Biener
2021-12-01 10:09                                 ` Richard Biener
2021-12-06  5:09                                   ` [PATCH v8 " Xionghu Luo
2021-12-06  5:26                                     ` Xionghu Luo
2021-12-07 12:17                                       ` Richard Biener
2021-12-08  6:32                                         ` Xionghu Luo
2021-12-20  7:29                                           ` Richard Biener
2021-12-21  3:59                                             ` Xionghu Luo
2021-10-27 12:54                 ` [RFC] " Jan Hubicka
2021-10-28  1:49                   ` Xionghu Luo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f4382ef5-2825-4c75-720f-1e475b9b9636@linux.ibm.com \
    --to=luoxhu@linux.ibm.com \
    --cc=dje.gcc@gmail.com \
    --cc=fxue@os.amperecomputing.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=guojiufu@linux.ibm.com \
    --cc=hubicka@ucw.cz \
    --cc=linkw@gcc.gnu.org \
    --cc=richard.guenther@gmail.com \
    --cc=segher@kernel.crashing.org \
    --cc=wschmidt@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).