Re: [PATCH] tree-optimization/102155 - fix LIM fill_always_executed_in CFG walk

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Xionghu Luo <luoxhu@linux.ibm.com>
To: Richard Biener <rguenther@suse.de>
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] tree-optimization/102155 - fix LIM fill_always_executed_in CFG walk
Date: Thu, 9 Sep 2021 14:40:28 +0800	[thread overview]
Message-ID: <f0d27d79-56b2-d302-f1e5-33c2c4019657@linux.ibm.com> (raw)
In-Reply-To: <nycvar.YFH.7.76.2109021226110.11781@zhemvz.fhfr.qr>



On 2021/9/2 18:37, Richard Biener wrote:
> On Thu, 2 Sep 2021, Xionghu Luo wrote:
> 
>>
>>
>> On 2021/9/2 16:50, Richard Biener wrote:
>>> On Thu, 2 Sep 2021, Richard Biener wrote:
>>>
>>>> On Thu, 2 Sep 2021, Xionghu Luo wrote:
>>>>
>>>>>
>>>>>
>>>>> On 2021/9/1 17:58, Richard Biener wrote:
>>>>>> This fixes the CFG walk order of fill_always_executed_in to use
>>>>>> RPO oder rather than the dominator based order computed by
>>>>>> get_loop_body_in_dom_order.  That fixes correctness issues with
>>>>>> unordered dominator children.
>>>>>>
>>>>>> The RPO order computed by rev_post_order_and_mark_dfs_back_seme in
>>>>>> its for-iteration mode is a good match for the algorithm.
>>>>>>
>>>>>> Xionghu, I've tried to only fix the CFG walk order issue and not
>>>>>> change anything else with this so we have a more correct base
>>>>>> to work against.  The code still walks inner loop bodies
>>>>>> up to loop depth times and thus is quadratic in the loop depth.
>>>>>>
>>>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu, if you don't
>>>>>> have any comments I plan to push this and then revisit what we
>>>>>> were circling around.
>>>>>
>>>>> LGTM, thanks.
>>>>
>>>> I pushed it, thought again in the attempt to build a testcase and
>>>> concluded I was wrong with the appearant mishandling of
>>>> contains_call - get_loop_body_in_dom_order seems to be exactly
>>>> correct for this specific case.  So I reverted the commit again.
>>>
>>> And I figured what the
>>>
>>>                 /* In a loop that is always entered we may proceed anyway.
>>>                    But record that we entered it and stop once we leave it.
>>> */
>>>
>>> comment was about.  The code was present before the fix for PR78185
>>> and it was supposed to catch the case where the entered inner loop
>>> is not finite.  Just as the testcase from PR78185 shows the
>>> stopping was done too late when the exit block was already marked
>>> as to be always executed.  A simpler fix for PR78185 would have been
>>> to move
>>>
>>>             if (!flow_bb_inside_loop_p (inn_loop, bb))
>>>               break;
>>>
>>> before setting of last = bb.  In fact the installed fix was more
>>> pessimistic than that given it terminated already when entering
>>> a possibly infinite loop.  So we can improve that by doing
>>> sth like which should also improve the situation for some of
>>> the cases you were looking at?
>>>
>>> What remains is that we continue to stop when entering a
>>> not always executed loop:
>>>
>>>             if (bb->loop_father->header == bb)
>>>               {
>>>                 if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>>>                   break;
>>
>> Yes.  This will cause blocks after inner loop missed to be check
>> if they are actually ALWAYS_EXECUTED.   I am afraid O(N^2) is
>> inevitable here...
> 
> Yes.  What we can try is pre-computing whether a loop
> has a call or an inner loop that might not terminate and then
> when that's true for the loop to be entered continue to break;
> but when not, skip processing that loop blocks (but we still
> fill the blocks array, and we do need to do this in the order
> for the loop we're processing ...).
> 
> So what I was thinking was to somehow embed the dominator
> walk of get_loop_body_in_dom_order and instead of pre-recording
> the above info (call, infinite loop) for loops, pre-record
> it on the dominator tree so that we can ask "in any of our
> dominator children, is there a call or an infinite loop" and
> thus cut the dominator walk at loop header blocks that are
> not dominating the outer loop latch ...
> 
> Of course the simplistic solution might be to simply do
> 
>     if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)
>         && ((loop_depth (bb->loop_father) - loop_depth (loop))
>             > param_max_lim_loop_depth_lookahead)))
>       break;
> 
> and thus limit the processing of conditionally executed inner
> loops by relative depth ... as you say the actual processing
> is unlikely to be the bottleneck for the degenerate cases
> of a very deep nest of conditionally executed loops.
> 
> But still for this case get_loop_body_in_dom_order is
> doing quadratic processing so we can also say that
> another linear walk over the produced array does not
> increase complexity.> 
>>>
>>> volatile int flag, bar;
>>> double foo (double *valp)
>>> {
>>>     double sum = 0;
>>>     for (int i = 0; i < 256; ++i)
>>>       {
>>>         for (int j = 0; j < 256; ++j)
>>>           bar = flag;
>>>         if (flag)
>>>           sum += 1.;
>>>         sum += *valp;
>>>       }
>>>     return sum;
>>> }
>>
>> The patch still fails to handle cases like this:
>>
>>
>> struct X { int i; int j; int k;};
>> volatile int m;
>>
>> void bar (struct X *x, int n, int l, int k)
>> {
>>    for (int i = 0; i < l; i++)
>>      {
>>       if (k)
>>          for (int j = 0; j < l; j++)
>>            {
>>              if (m)
>>                x->i = m;
>>              else
>>                x->i = 1 - m;
>>
>>              int *r = &x->k;
>>              int tem2 = *r;
>>              x->k += tem2 * j;
>>          }
>>
>>        x->i = m;
>>      }
>> }
>>
>> x->i is still not marked ALWAYS_EXECUTED for outer loop.
> 

Collected data when build gcc stage1 and bootstrap.  There are still
about 9% bbs are missed to be marked with ALWAYS_EXECUTED.  Execution time
of fill_always_executed_in_1 is increased obvious in stage1 but it is in
mini-second level while bootstrap build doesn't show big time change.

base vs. incoming_edge_counter_patch:

1. For stage1: 2214/2216 functions handled in both lim2 and lim4 with:
     -  bbs find:  base vs patched  "always executed":  17666  vs. 19278.
     - total cost time of fill_always_executed_in_1 (usec):  30574 vs. 51238.
2. bootstrap: 33910/33951 functions handled in both lim2 and lim4 with:
     - bbs find:  base vs patched  "always executed":  201171 vs. 222942.
     - total cost time of fill_always_executed_in_1 (usec):  987054  vs. 1061821.5

Do we still have plan to replace the get_loop_body_in_dom_order with
linear search or just "limit the processing of conditionally executed inner
loops by relative depth"? ;)



ALWAYS_EXECUTED_IN is not computed completely for nested loops.
Current design will exit if an inner loop doesn't dominate outer
loop's latch or exit after exiting from inner loop, which
caused early return from outer loop, then ALWAYS EXECUTED blocks after
inner loops are skipped.  Another problem is it has to call
get_loop_body_in_dom_order in each recursive call which is also
inefficient.

This patch does greedy visiting of the loop header successors,
processing further blocks if all entries (ignoring backedges) are
processed, setting SET_ALWAYS_EXECUTED_IN if dominates loop's latch.
When the worklist is empty proceed to inner loops.  For bookkeeping
simply keep a to-visit-incoming-edges counter per BB, and pass it
through to inner loops.

Details discusions:
https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577743.html

gcc/ChangeLog:

	* tree-ssa-loop-im.c (fill_always_executed_in_1): Remove
	inn_loop check.  Greedy visit loop header successors and update
	edge counts.
	(fill_always_executed_in): Compute bb_incoming_edges and pass
	down.
	(loop_invariant_motion_in_fun): Compute dfs back edge.

gcc/testsuite/ChangeLog:

	* gcc.dg/tree-ssa/ssa-lim-19.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-20.c: New test.
---
  gcc/tree-ssa-loop-im.c                     | 113 +++++++++++----------
  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c |  31 ++++++
  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c |  30 ++++++
  3 files changed, 122 insertions(+), 52 deletions(-)
  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c

diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index b0582c047a1..e9ce18f822f 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -3192,79 +3192,65 @@ do_store_motion (void)
     blocks that contain a nonpure call.  */
  
  static void
-fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
+fill_always_executed_in_1 (class loop *loop, sbitmap contains_call, int *bbi)
  {
-  basic_block bb = NULL, *bbs, last = NULL;
-  unsigned i;
+  basic_block bb = NULL;
    edge e;
-  class loop *inn_loop = loop;
  
    if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
      {
-      bbs = get_loop_body_in_dom_order (loop);
+      auto_bitmap work_set;
+      bitmap_set_bit (work_set, loop->header->index);
+      unsigned bb_index;
  
-      for (i = 0; i < loop->num_nodes; i++)
-	{
-	  edge_iterator ei;
-	  bb = bbs[i];
+      unsigned array_size = last_basic_block_for_fn (cfun) + 1;
+      int *bbd = XCNEWVEC (int, array_size);
+      bbd = XDUPVEC (int, bbi, array_size);
  
-	  if (!flow_bb_inside_loop_p (inn_loop, bb))
+      while (!bitmap_empty_p (work_set))
+	{
+	  bb_index = bitmap_first_set_bit (work_set);
+	  bitmap_clear_bit (work_set, bb_index);
+	  bb = BASIC_BLOCK_FOR_FN (cfun, bb_index);
+	  if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
  	    {
-	      /* When we are leaving a possibly infinite inner loop
-		 we have to stop processing.  */
-	      if (!finite_loop_p (inn_loop))
-		break;
-	      /* If the loop was finite we can continue with processing
-		 the loop we exited to.  */
-	      inn_loop = bb->loop_father;
+	      SET_ALWAYS_EXECUTED_IN (bb, loop);
+	      if (dump_enabled_p ())
+		dump_printf (MSG_NOTE, "BB %d is always executed in loop %d\n",
+		    bb->index, loop->num);
  	    }
-
-	  if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
-	    last = bb;
-
  	  if (bitmap_bit_p (contains_call, bb->index))
  	    break;
  
-	  /* If LOOP exits from this BB stop processing.  */
-	  FOR_EACH_EDGE (e, ei, bb->succs)
-	    if (!flow_bb_inside_loop_p (loop, e->dest))
-	      break;
-	  if (e)
-	    break;
-
  	  /* A loop might be infinite (TODO use simple loop analysis
  	     to disprove this if possible).  */
  	  if (bb->flags & BB_IRREDUCIBLE_LOOP)
  	    break;
  
-	  if (bb->loop_father->header == bb)
+	  edge_iterator ei;
+	  FOR_EACH_EDGE (e, ei, bb->succs)
  	    {
-	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
+	      if (e->dest == loop->latch)
+		continue;
+	      /* If there is an exit from this BB.  */
+	      if (!flow_bb_inside_loop_p (loop, e->dest))
+		break;
+	      /* Or we enter a possibly non-finite loop.  */
+	      if (flow_loop_nested_p (bb->loop_father,
+				      e->dest->loop_father)
+		  && ! finite_loop_p (e->dest->loop_father))
  		break;
  
-	      /* In a loop that is always entered we may proceed anyway.
-		 But record that we entered it and stop once we leave it
-		 since it might not be finite.  */
-	      inn_loop = bb->loop_father;
+	      bbd[e->dest->index]--;
+	      if (bbd[e->dest->index] == 0)
+		bitmap_set_bit (work_set, e->dest->index);
  	    }
-	}
-
-      while (1)
-	{
-	  if (dump_enabled_p ())
-	    dump_printf (MSG_NOTE, "BB %d is always executed in loop %d\n",
-			 last->index, loop->num);
-	  SET_ALWAYS_EXECUTED_IN (last, loop);
-	  if (last == loop->header)
+	  if (e)
  	    break;
-	  last = get_immediate_dominator (CDI_DOMINATORS, last);
  	}
-
-      free (bbs);
+      free (bbd);
      }
-
-  for (loop = loop->inner; loop; loop = loop->next)
-    fill_always_executed_in_1 (loop, contains_call);
  }
  
  /* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e.
@@ -3275,12 +3261,22 @@ static void
  fill_always_executed_in (void)
  {
    basic_block bb;
-  class loop *loop;
+  edge_iterator ei;
+  edge e;
+
+  unsigned array_size = last_basic_block_for_fn (cfun) + 1;
+  int *bb_incoming_edges = XCNEWVEC (int, array_size);
  
    auto_sbitmap contains_call (last_basic_block_for_fn (cfun));
    bitmap_clear (contains_call);
    FOR_EACH_BB_FN (bb, cfun)
      {
+      int len = bb->preds->length ();
+      FOR_EACH_EDGE (e, ei, bb->preds)
+	if (e->flags & EDGE_DFS_BACK)
+	  len--;
+      bb_incoming_edges[bb->index] = len;
+
        gimple_stmt_iterator gsi;
        for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
  	{
@@ -3292,8 +3288,19 @@ fill_always_executed_in (void)
  	bitmap_set_bit (contains_call, bb->index);
      }
  
-  for (loop = current_loops->tree_root->inner; loop; loop = loop->next)
-    fill_always_executed_in_1 (loop, contains_call);
+
+  for (auto loop : loops_list (cfun, 0))
+    fill_always_executed_in_1 (loop, contains_call, bb_incoming_edges);
  }
  
  
@@ -3396,6 +3403,8 @@ loop_invariant_motion_in_fun (function *fun, bool store_motion)
    /* Gathers information about memory accesses in the loops.  */
    analyze_memory_references (store_motion);
  
+  mark_dfs_back_edges ();
+
    /* Fills ALWAYS_EXECUTED_IN information for basic blocks.  */
    fill_always_executed_in ();
  
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
new file mode 100644
index 00000000000..3d5fe6f0314
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
@@ -0,0 +1,31 @@
+/* PR/101293 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+struct X { int i; int j; int k;};
+
+void foo(struct X *x, int n, int l, int m)
+{
+  for (int j = 0; j < l; j++)
+    {
+      for (int i = 0; i < n; ++i)
+	{
+	  if (m)
+	    x->j++;
+	  else
+	    x->j = m+n+l;
+
+	  int *p = &x->j;
+	  int tem = *p;
+	  x->j += tem * i;
+	}
+      int *r = &x->k;
+      int tem2 = *r;
+      x->k += tem2 * j;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } */
+
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
new file mode 100644
index 00000000000..6e180de528e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
@@ -0,0 +1,30 @@
+/* PR/101293 */
+/* { dg-do run } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+void __attribute__ ((noipa)) foo (int n, int m, int f, int *p, int *q)
+{
+  while (--n)
+    {
+      do
+	{
+	  *q += 1;
+	  if (f)
+	    goto out;
+	}
+      while (--m);
+      *p += 1;
+    }
+out:;
+}
+
+int main()
+{
+    int i = 0;
+    foo (10, 10, 1, (void *) 0, &i);
+    if (i != 1)
+      __builtin_abort ();
+    return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Executing store motion" 1 "lim2" } } */
-- 
2.27.0.90.geebb51ba8c

next prev parent reply	other threads:[~2021-09-09  6:40 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-01  9:58 Richard Biener
2021-09-02  5:09 ` Xionghu Luo
2021-09-02  8:00   ` Richard Biener
2021-09-02  8:50     ` Richard Biener
2021-09-02 10:00       ` Xionghu Luo
2021-09-02 10:37         ` Richard Biener
2021-09-09  6:40           ` Xionghu Luo [this message]
2021-09-09  9:16             ` Richard Biener
2021-09-09 10:55               ` Richard Biener
2021-09-10 13:54                 ` Xionghu Luo
2021-09-13  2:28                   ` Xionghu Luo
2021-09-13  8:17                     ` Richard Biener
2021-09-14  1:15                       ` Xionghu Luo
2021-09-14  7:30                         ` Richard Biener
2021-09-13  6:55                   ` Richard Biener

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f0d27d79-56b2-d302-f1e5-33c2c4019657@linux.ibm.com \
    --to=luoxhu@linux.ibm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=rguenther@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).