Re: [PATCH v3] Fix incomplete computation in fill_always_executed_in_1

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Xionghu Luo <luoxhu@linux.ibm.com>
To: Richard Biener <rguenther@suse.de>
Cc: gcc-patches@gcc.gnu.org, segher@kernel.crashing.org,
	dje.gcc@gmail.com, wschmidt@linux.ibm.com,
	guojiufu@linux.ibm.com, linkw@gcc.gnu.org
Subject: Re: [PATCH v3] Fix incomplete computation in fill_always_executed_in_1
Date: Thu, 26 Aug 2021 13:50:23 +0800	[thread overview]
Message-ID: <1f880a67-50c6-665f-5633-0505a2388c30@linux.ibm.com> (raw)
In-Reply-To: <nycvar.YFH.7.76.2108241009380.11781@zhemvz.fhfr.qr>



On 2021/8/24 16:20, Richard Biener wrote:
> On Tue, 24 Aug 2021, Xionghu Luo wrote:
> 
>>
>>
>> On 2021/8/19 20:11, Richard Biener wrote:
>>>> -  class loop *inn_loop = loop;
>>>>    
>>>>      if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
>>>>        {
>>>> @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
>>>>    	     to disprove this if possible).  */
>>>>    	  if (bb->flags & BB_IRREDUCIBLE_LOOP)
>>>>    	    break;
>>>> -
>>>> -	  if (!flow_bb_inside_loop_p (inn_loop, bb))
>>>> -	    break;
>>>> -
>>>> -	  if (bb->loop_father->header == bb)
>>>> -	    {
>>>> -	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>>>> -		break;
>>>> -
>>>> -	      /* In a loop that is always entered we may proceed anyway.
>>>> -		 But record that we entered it and stop once we leave it.  */
>>>> -	      inn_loop = bb->loop_father;
>>>> -	    }
>>>>    	}
>>>>    
>>>>          while (1)
>>> I'm not sure this will work correct (I'm not sure how the existing
>>> code makes it so either...).  That said, I can't poke any hole
>>> into the change.  What I see is that definitely
>>>
>>>             if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>>>               last = bb;
>>>
>>>             if (bitmap_bit_p (contains_call, bb->index))
>>>               break;
>>>
>>> doesn't work reliably since the DOM ordering will process blocks
>>> A B and C in random order for
>>>
>>>     for (;;)
>>>      {
>>>         if (cond)
>>>           {
>>>             A: foo ();
>>>           }
>>>         else B:;
>>>         C:;
>>>      }
>>>
>>> and thus we can end up setting 'last' to C_before_  processing
>>> 'A' and thus arriving at the call foo () ...
>>>
>>> get_loop_body_in_dom_order does some "special sauce" but not
>>> to address the above problem - but it might be that a subtle
>>> issue like the above is the reason for the inner loop handling.
>>> The inner loop block order does_not_  adhere to this "special sauce",
>>> that is - the "Additionally, if a basic block s dominates
>>> the latch, then only blocks dominated by s are be after it."
>>> guarantee holds for the outer loop latch, not for the inner.
>>>
>>> Digging into the history of fill_always_executed_in_1 doesn't
>>> reveal anything - the inner loop handling has been present
>>> since introduction by Zdenek - but usually Zdenek has a reason
>>> for doing things as he does;)
>>
>> Yes, this is really complicated usage, thanks for point it out. :)
>> I constructed two cases to verify this with inner loop includes "If A; else B; C".
>> Finding that fill_sons_in_loop in get_loop_body_in_dom_order will also checks
>> whether the bb domintes outer loop’s latch, if C dominate outer loop’s latch,
>> C is postponed, the access order is ABC, 'last' won’t be set to C if A or B contains call;
> 
> But it depends on the order of visiting ABC and that's hard to put into
> a testcase since it depends on the order of edges and the processing
> of the dominance computation.  ABC are simply unordered with respect
> to a dominator walk.
> 
>> Otherwise if C doesn’t dominate outer loop’s latch in fill_sons_in_loop,
>> the access order is CAB, but 'last' also won’t be updated to C in fill_always_executed_in_1
>> since there is also dominate check, then if A or B contains call, it could break
>> successfully.
>>
>> C won't be set to ALWAYS EXECUTED for both circumstance.
>>
>>>
>>> Note it might be simply a measure against quadratic complexity,
>>> esp. since with your patch we also dive into not always executed
>>> subloops as you remove the
>>>
>>>                 if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>>>                   break;
>>>
>>> check.  I suggest to evaluate behavior of the patch on a testcase
>>> like
>>>
>>> void foo (int n, int **k)
>>> {
>>>     for (int i = 0; i < n; ++i)
>>>       if (k[0][i])
>>>         for (int j = 0; j < n; ++j)
>>>           if (k[1][j])
>>>             for (int l = 0; l < n; ++l)
>>>               if (k[2][l])
>>>                 ...
>>> }
>>
>> Theoretically the complexity is changing from L1(bbs) to L1(bbs)+L2(bbs)+L3(bbs)+…+Ln(bbs),
>> so fill_always_executed_in_1's execution time is supposed to be increase from
>> O(n) to O(n2)?  The time should depend on loop depth and bb counts.   I also drafted a
>> test case has 73-depth loop function with 25 no-ipa function copies each compiled
>> in lim2 and lim4 dependently.  Total execution time of fill_always_executed_in_1 is
>> increased from 32ms to 58ms, almost doubled but not quadratic?
> 
> It's more like n + (n-1) + (n-2) + ... + 1 which is n^2/2 but that's still
> O(n^2).
> 
>> It seems reasonable to see compiling time getting longer since most bbs are checked
>> more but a MUST to ensure early break correctly in every loop level...
>> Though loop nodes could be huge, loop depth will never be so large in actual code?
> 
> The "in practice" argument is almost always defeated by automatic
> program generators ;)
> 
>>>   
>>> I suspect you'll see quadratic behavior with your patch.  You
>>> should be at least able to preserve a check like
>>>
>>>             /* Do not process not always executed subloops to avoid
>>>                quadratic behavior.  */
>>>             if (bb->loop_father->header == bb
>>>                 && !dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>>>               break;
>>>
>>> which is of course not optimistic for cases like
>>>
>>>     for (..)
>>>      {
>>>        if (cond)
>>>          for (..)
>>>        x = 1; // this is always executed if the inner loop is finite
>>>      }
>>>
>>> but we need to have an eye on the complexity of this function.  I would
>>> have suggested to do greedy visiting of the loop header successors,
>>> processing further blocks if all entries (ignoring backedges) are
>>> processed, setting SET_ALWAYS_EXECUTED_IN.  When the worklist
>>> is empty proceed to inner loops as the current code does.  For
>>> bookkeeping simply keep a to-visit-incoming-edges counter per BB.
>>> Pseudo-code:
>>>
>>>     bitmap_set_bit (worklist, loop-header-bb);
>>>     while (!bitmap_empty_p (worklist))
>>>       {
>>>         bb = pop (worklist);
>>
>> Need check whether bb dominates latch before SET_ALWAYS_EXECUTED_IN?
> 
> Ah, sure.
> 
>>>         SET_ALWAYS_EXECUTED_IN (bb, loop);
>>>         if (bitmap_bit_p (contains_call, bb->index))
>>>           continue;
>>>         FOR_EACH_EDGE (e, ei, bb->succs)
>>>           {
>>>             if (!flow_bb_inside_loop_p (loop, e->dest))
>>>               continue;
>>>             if (incoming_count[e->dest->index]-- == 0)
>>>               push (worklist, e->dest);
>>>           }
>>>       }
>>
>> Sorry I don't fully understand your algorithm.  worklist should be
>> auto_vec<basic_block> don't support bitmap operations?  Is incoming_count
>> the bb's preds count, why need retain it since it it decreased to 0?
> 
> the worklist is a auto_bitmap, pop () would be
> bitmap_first_set_bit/clear_bit.  One could use a vector with the
> caveat of eventually adding duplicate members to the worklist.
> But as said, it's pseudo-code ;)

Thanks a lot!

> 
>>>     
>>>     iterate over inner loops (incoming_count can be retained,
>>>     we just force the inner loop header onto the worklist).
>>
>> Is this same to ?
>>
>>    for (loop = loop->inner; loop; loop = loop->next)
>>      fill_always_executed_in_1 (loop, contains_call)
> 
> Yes, but I'd simply use for (loop : loops_list (cfun, 0)) which
> should iterate over loops in PRE order.

Tried implementation with your pseudo-code, it works like the
previous v2 solution, bb after inner loops could be marked
as ALWAYS EXECUTED.  You are really so strong! ;)
Regression tested pass on P8LE.

Assume,

  A: GCC Base
  B: inner loop check removal
  C: incoming count

Execution time of fill_always_executed_in_1 for the 73-depth loop
(with five function copies) is
(A vs. B vs. C): 32ms vs 58ms vs 38ms.  Much better than O(n^2)?


Bumped the patch to v3 with TODOs:
 1. The time measurement code will be removed if the v3 is OK; 
 2. loops_list is not used as my code is not rebased to upstream yet.
 3. Function comments is not updated.



[PATCH v3] Fix incomplete computation in fill_always_executed_in_1


ALWAYS_EXECUTED_IN is not computed completely for nested loops.
Current design will exit if an inner loop doesn't dominate outer
loop's latch or exit after exiting from inner loop, which
caused early return from outer loop, then ALWAYS EXECUTED blocks after
inner loops are skipped.  Another problem is it has to call
get_loop_body_in_dom_order in each recursive call which is also
inefficient.

This patch does greedy visiting of the loop header successors,
processing further blocks if all entries (ignoring backedges) are
processed, setting SET_ALWAYS_EXECUTED_IN of dominates loop's latch.
When the worklist is empty proceed to inner loops as the current
code does.  For bookkeeping simply keep a to-visit-incoming-edges
counter per BB, and pass it through iteration over inner loops.

Details discusions:
https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577743.html

gcc/ChangeLog:

	* tree-ssa-loop-im.c (fill_always_executed_in_1): Remove
	inn_loop check.

gcc/testsuite/ChangeLog:

	* gcc.dg/tree-ssa/ssa-lim-19.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-20.c: New test.
---
 gcc/tree-ssa-loop-im.c                     | 95 +++++++++++++---------
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 30 +++++++
 3 files changed, 111 insertions(+), 38 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c

diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index b24bc64f2a7..72be6b6bfac 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -3192,39 +3192,52 @@ do_store_motion (void)
    blocks that contain a nonpure call.  */
 
 static void
-fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
+fill_always_executed_in_1 (class loop *loop, sbitmap contains_call, int *bbi)
 {
-  basic_block bb = NULL, *bbs, last = NULL;
-  unsigned i;
+  basic_block bb = NULL;
   edge e;
-  class loop *inn_loop = loop;
 
   if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
     {
-      bbs = get_loop_body_in_dom_order (loop);
+      auto_bitmap work_set;
+      bitmap_clear (work_set);
+      bitmap_set_bit (work_set, loop->header->index);
+      unsigned bb_index;
 
-      for (i = 0; i < loop->num_nodes; i++)
-	{
-	  edge_iterator ei;
-	  bb = bbs[i];
+      unsigned array_size = last_basic_block_for_fn (cfun) + 1;
+      int *bbd = XNEWVEC (int, array_size);
+      bbd = XDUPVEC (int, bbi, array_size);
 
+      while (!bitmap_empty_p (work_set))
+	{
+	  bb_index = bitmap_first_set_bit (work_set);
+	  bitmap_clear_bit (work_set, bb_index);
+	  bb = BASIC_BLOCK_FOR_FN (cfun, bb_index);
 	  if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
-	    last = bb;
-
+	    SET_ALWAYS_EXECUTED_IN (bb, loop);
 	  if (bitmap_bit_p (contains_call, bb->index))
 	    break;
-
+	  edge_iterator ei;
 	  FOR_EACH_EDGE (e, ei, bb->succs)
 	    {
-	      /* If there is an exit from this BB.  */
 	      if (!flow_bb_inside_loop_p (loop, e->dest))
 		break;
+
 	      /* Or we enter a possibly non-finite loop.  */
 	      if (flow_loop_nested_p (bb->loop_father,
 				      e->dest->loop_father)
 		  && ! finite_loop_p (e->dest->loop_father))
 		break;
+
+	      if (bbd[e->dest->index] == 1)
+	      {
+		bitmap_set_bit (work_set, e->dest->index);
+		bbd[e->dest->index]--;
+	      }
+	      else if (bbd[e->dest->index] > 1)
+		bbd[e->dest->index]--;
 	    }
+
 	  if (e)
 	    break;
 
@@ -3232,34 +3245,12 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
 	     to disprove this if possible).  */
 	  if (bb->flags & BB_IRREDUCIBLE_LOOP)
 	    break;
-
-	  if (!flow_bb_inside_loop_p (inn_loop, bb))
-	    break;
-
-	  if (bb->loop_father->header == bb)
-	    {
-	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
-		break;
-
-	      /* In a loop that is always entered we may proceed anyway.
-		 But record that we entered it and stop once we leave it.  */
-	      inn_loop = bb->loop_father;
-	    }
-	}
-
-      while (1)
-	{
-	  SET_ALWAYS_EXECUTED_IN (last, loop);
-	  if (last == loop->header)
-	    break;
-	  last = get_immediate_dominator (CDI_DOMINATORS, last);
 	}
-
-      free (bbs);
+      free (bbd);
     }
 
   for (loop = loop->inner; loop; loop = loop->next)
-    fill_always_executed_in_1 (loop, contains_call);
+    fill_always_executed_in_1 (loop, contains_call, bbi);
 }
 
 /* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e.
@@ -3287,8 +3278,36 @@ fill_always_executed_in (void)
 	bitmap_set_bit (contains_call, bb->index);
     }
 
+  mark_dfs_back_edges ();
+  unsigned array_size = last_basic_block_for_fn (cfun) + 1;
+  int *bb_incoming_edges = XNEWVEC (int, array_size);
+  memset (bb_incoming_edges, 0, array_size * sizeof (int));
+  edge_iterator ei;
+  edge e;
+
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      int len = bb->preds->length ();
+      FOR_EACH_EDGE (e, ei, bb->preds)
+	if (e->flags & EDGE_DFS_BACK)
+	  len--;
+      bb_incoming_edges[bb->index] = len;
+    }
+
+  static unsigned long diff = 0;
+  struct timeval tv;
+  gettimeofday (&tv, NULL);
+  unsigned long begin = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec;
+
+  //for (loop : loops_list (cfun, 0))
   for (loop = current_loops->tree_root->inner; loop; loop = loop->next)
-    fill_always_executed_in_1 (loop, contains_call);
+    fill_always_executed_in_1 (loop, contains_call, bb_incoming_edges);
+
+  gettimeofday (&tv, NULL);
+  unsigned long end = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec;
+  diff += end - begin;
+  //printf ("%s,diff:%ld\n", current_function_name (), diff);
+  free (bb_incoming_edges);
 }
 
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
new file mode 100644
index 00000000000..097a5ee4a4b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
@@ -0,0 +1,24 @@
+/* PR/101293 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+struct X { int i; int j; int k;};
+
+void foo(struct X *x, int n, int l)
+{
+  for (int j = 0; j < l; j++)
+    {
+      for (int i = 0; i < n; ++i)
+	{
+	  int *p = &x->j;
+	  int tem = *p;
+	  x->j += tem * i;
+	}
+      int *r = &x->k;
+      int tem2 = *r;
+      x->k += tem2 * j;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } */
+
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
new file mode 100644
index 00000000000..6e180de528e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
@@ -0,0 +1,30 @@
+/* PR/101293 */
+/* { dg-do run } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+void __attribute__ ((noipa)) foo (int n, int m, int f, int *p, int *q)
+{
+  while (--n)
+    {
+      do
+	{
+	  *q += 1;
+	  if (f)
+	    goto out;
+	}
+      while (--m);
+      *p += 1;
+    }
+out:;
+}
+
+int main()
+{
+    int i = 0;
+    foo (10, 10, 1, (void *) 0, &i);
+    if (i != 1)
+      __builtin_abort ();
+    return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Executing store motion" 1 "lim2" } } */
-- 
2.27.0.90.geebb51ba8c

next prev parent reply	other threads:[~2021-08-26  5:50 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-16  8:46 [PATCH] Fix incorrect " Xiong Hu Luo
2021-08-16 11:46 ` Richard Biener
2021-08-17  5:17   ` Xionghu Luo
2021-08-17  5:24     ` Xionghu Luo
2021-08-17  7:12     ` Richard Biener
2021-08-17  9:10       ` [PATCH v2] Fix incomplete " Xionghu Luo
2021-08-19  5:23         ` Xionghu Luo
2021-08-19 12:11         ` Richard Biener
2021-08-24  7:44           ` Xionghu Luo
2021-08-24  8:20             ` Richard Biener
2021-08-26  5:50               ` Xionghu Luo [this message]
2021-08-27  7:45                 ` [PATCH v3] " Richard Biener
2021-08-30  8:49                   ` Xionghu Luo
2021-08-30  9:19                     ` Richard Biener
2021-08-31  7:43                       ` Xionghu Luo
2021-08-31 11:29                         ` Richard Biener
2021-08-17 20:59 ` [PATCH] Fix incorrect " Segher Boessenkool

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1f880a67-50c6-665f-5633-0505a2388c30@linux.ibm.com \
    --to=luoxhu@linux.ibm.com \
    --cc=dje.gcc@gmail.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=guojiufu@linux.ibm.com \
    --cc=linkw@gcc.gnu.org \
    --cc=rguenther@suse.de \
    --cc=segher@kernel.crashing.org \
    --cc=wschmidt@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).