[PATCH] tree-optimization/102155 - fix LIM fill_always_executed

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] tree-optimization/102155 - fix LIM fill_always_executed_in CFG walk
@ 2021-09-01  9:58 Richard Biener
  2021-09-02  5:09 ` Xionghu Luo
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Biener @ 2021-09-01  9:58 UTC (permalink / raw)
  To: gcc-patches; +Cc: luoxhu

This fixes the CFG walk order of fill_always_executed_in to use
RPO oder rather than the dominator based order computed by
get_loop_body_in_dom_order.  That fixes correctness issues with
unordered dominator children.

The RPO order computed by rev_post_order_and_mark_dfs_back_seme in
its for-iteration mode is a good match for the algorithm.

Xionghu, I've tried to only fix the CFG walk order issue and not
change anything else with this so we have a more correct base
to work against.  The code still walks inner loop bodies
up to loop depth times and thus is quadratic in the loop depth.

Bootstrapped and tested on x86_64-unknown-linux-gnu, if you don't
have any comments I plan to push this and then revisit what we
were circling around.

Richard.

2021-09-01  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/102155
	* tree-ssa-loop-im.c (fill_always_executed_in_1): Iterate
	over a part of the RPO array and do not recurse here.
	Dump blocks marked as always executed.
	(fill_always_executed_in): Walk over the RPO array and
	process loops whose header we run into.
	(loop_invariant_motion_in_fun): Compute the first RPO
	using rev_post_order_and_mark_dfs_back_seme in iteration
	order and pass that to fill_always_executed_in.
---
 gcc/tree-ssa-loop-im.c | 136 ++++++++++++++++++++++-------------------
 1 file changed, 73 insertions(+), 63 deletions(-)

diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index d9f75d5025e..f3706dcdb8a 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -3025,77 +3025,74 @@ do_store_motion (void)
 /* Fills ALWAYS_EXECUTED_IN information for basic blocks of LOOP, i.e.
    for each such basic block bb records the outermost loop for that execution
    of its header implies execution of bb.  CONTAINS_CALL is the bitmap of
-   blocks that contain a nonpure call.  */
+   blocks that contain a nonpure call.  The blocks of LOOP start at index
+   START of the RPO array of size N.  */
 
 static void
-fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
+fill_always_executed_in_1 (function *fun, class loop *loop,
+			   int *rpo, int start, int n, sbitmap contains_call)
 {
-  basic_block bb = NULL, *bbs, last = NULL;
-  unsigned i;
-  edge e;
+  basic_block last = NULL;
   class loop *inn_loop = loop;
 
-  if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
+  for (int i = start; i < n; i++)
     {
-      bbs = get_loop_body_in_dom_order (loop);
-
-      for (i = 0; i < loop->num_nodes; i++)
-	{
-	  edge_iterator ei;
-	  bb = bbs[i];
-
-	  if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
-	    last = bb;
+      basic_block bb = BASIC_BLOCK_FOR_FN (fun, rpo[i]);
+      /* Stop when we iterated over all blocks in this loop.  */
+      if (!flow_bb_inside_loop_p (loop, bb))
+	break;
 
-	  if (bitmap_bit_p (contains_call, bb->index))
-	    break;
+      if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
+	last = bb;
 
-	  FOR_EACH_EDGE (e, ei, bb->succs)
-	    {
-	      /* If there is an exit from this BB.  */
-	      if (!flow_bb_inside_loop_p (loop, e->dest))
-		break;
-	      /* Or we enter a possibly non-finite loop.  */
-	      if (flow_loop_nested_p (bb->loop_father,
-				      e->dest->loop_father)
-		  && ! finite_loop_p (e->dest->loop_father))
-		break;
-	    }
-	  if (e)
-	    break;
+      if (bitmap_bit_p (contains_call, bb->index))
+	break;
 
-	  /* A loop might be infinite (TODO use simple loop analysis
-	     to disprove this if possible).  */
-	  if (bb->flags & BB_IRREDUCIBLE_LOOP)
+      edge_iterator ei;
+      edge e;
+      FOR_EACH_EDGE (e, ei, bb->succs)
+	{
+	  /* If there is an exit from this BB.  */
+	  if (!flow_bb_inside_loop_p (loop, e->dest))
 	    break;
-
-	  if (!flow_bb_inside_loop_p (inn_loop, bb))
+	  /* Or we enter a possibly non-finite loop.  */
+	  if (flow_loop_nested_p (bb->loop_father,
+				  e->dest->loop_father)
+	      && ! finite_loop_p (e->dest->loop_father))
 	    break;
+	}
+      if (e)
+	break;
 
-	  if (bb->loop_father->header == bb)
-	    {
-	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
-		break;
+      /* A loop might be infinite (TODO use simple loop analysis
+	 to disprove this if possible).  */
+      if (bb->flags & BB_IRREDUCIBLE_LOOP)
+	break;
 
-	      /* In a loop that is always entered we may proceed anyway.
-		 But record that we entered it and stop once we leave it.  */
-	      inn_loop = bb->loop_father;
-	    }
-	}
+      if (!flow_bb_inside_loop_p (inn_loop, bb))
+	break;
 
-      while (1)
+      if (bb->loop_father->header == bb)
 	{
-	  SET_ALWAYS_EXECUTED_IN (last, loop);
-	  if (last == loop->header)
+	  if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
 	    break;
-	  last = get_immediate_dominator (CDI_DOMINATORS, last);
-	}
 
-      free (bbs);
+	  /* In a loop that is always entered we may proceed anyway.
+	     But record that we entered it and stop once we leave it.  */
+	  inn_loop = bb->loop_father;
+	}
     }
 
-  for (loop = loop->inner; loop; loop = loop->next)
-    fill_always_executed_in_1 (loop, contains_call);
+  while (1)
+    {
+      SET_ALWAYS_EXECUTED_IN (last, loop);
+      if (dump_enabled_p ())
+	dump_printf (MSG_NOTE, "bb %d is always executed in loop %d\n",
+		     last->index, loop->num);
+      if (last == loop->header)
+	break;
+      last = get_immediate_dominator (CDI_DOMINATORS, last);
+    }
 }
 
 /* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e.
@@ -3103,14 +3100,13 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
    of its header implies execution of bb.  */
 
 static void
-fill_always_executed_in (void)
+fill_always_executed_in (function *fun, int *rpo, int n)
 {
   basic_block bb;
-  class loop *loop;
 
-  auto_sbitmap contains_call (last_basic_block_for_fn (cfun));
+  auto_sbitmap contains_call (last_basic_block_for_fn (fun));
   bitmap_clear (contains_call);
-  FOR_EACH_BB_FN (bb, cfun)
+  FOR_EACH_BB_FN (bb, fun)
     {
       gimple_stmt_iterator gsi;
       for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
@@ -3123,8 +3119,18 @@ fill_always_executed_in (void)
 	bitmap_set_bit (contains_call, bb->index);
     }
 
-  for (loop = current_loops->tree_root->inner; loop; loop = loop->next)
-    fill_always_executed_in_1 (loop, contains_call);
+  /* The RPO order we iterate over is one that visits all blocks of a CFG
+     cycle before leaving it.  That means we can visit a loop once we
+     run into its header and we can skip it if it was determined as always
+     entering when proccessing the containing loop.  */
+  for (int i = 0; i < n; ++i)
+    {
+      basic_block bb = BASIC_BLOCK_FOR_FN (fun, rpo[i]);
+      if (bb->loop_father->header == bb
+	  && !ALWAYS_EXECUTED_IN (bb))
+	fill_always_executed_in_1 (fun, bb->loop_father,
+				   rpo, i, n, contains_call);
+    }
 }
 
 
@@ -3227,23 +3233,27 @@ loop_invariant_motion_in_fun (function *fun, bool store_motion)
   /* Gathers information about memory accesses in the loops.  */
   analyze_memory_references (store_motion);
 
-  /* Fills ALWAYS_EXECUTED_IN information for basic blocks.  */
-  fill_always_executed_in ();
+  int *rpo = XNEWVEC (int, n_basic_blocks_for_fn (fun) - NUM_FIXED_BLOCKS);
+  auto_bitmap exit_bbs;
+  bitmap_set_bit (exit_bbs, EXIT_BLOCK);
+  edge entry = single_succ_edge (ENTRY_BLOCK_PTR_FOR_FN (fun));
+  int n = rev_post_order_and_mark_dfs_back_seme (fun, entry, exit_bbs, true,
+						 rpo, NULL);
 
-  int *rpo = XNEWVEC (int, last_basic_block_for_fn (fun));
-  int n = pre_and_rev_post_order_compute_fn (fun, NULL, rpo, false);
+  /* Fills ALWAYS_EXECUTED_IN information for basic blocks.  */
+  fill_always_executed_in (fun, rpo, n);
 
   /* For each statement determine the outermost loop in that it is
      invariant and cost for computing the invariant.  */
   for (int i = 0; i < n; ++i)
     compute_invariantness (BASIC_BLOCK_FOR_FN (fun, rpo[i]));
+  free (rpo);
 
   /* Execute store motion.  Force the necessary invariants to be moved
      out of the loops as well.  */
   if (store_motion)
     do_store_motion ();
 
-  free (rpo);
   rpo = XNEWVEC (int, last_basic_block_for_fn (fun));
   n = pre_and_rev_post_order_compute_fn (fun, NULL, rpo, false);
 
-- 
2.31.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] tree-optimization/102155 - fix LIM fill_always_executed_in CFG walk
  2021-09-01  9:58 [PATCH] tree-optimization/102155 - fix LIM fill_always_executed_in CFG walk Richard Biener
@ 2021-09-02  5:09 ` Xionghu Luo
  2021-09-02  8:00   ` Richard Biener
  0 siblings, 1 reply; 15+ messages in thread
From: Xionghu Luo @ 2021-09-02  5:09 UTC (permalink / raw)
  To: Richard Biener, gcc-patches



On 2021/9/1 17:58, Richard Biener wrote:
> This fixes the CFG walk order of fill_always_executed_in to use
> RPO oder rather than the dominator based order computed by
> get_loop_body_in_dom_order.  That fixes correctness issues with
> unordered dominator children.
> 
> The RPO order computed by rev_post_order_and_mark_dfs_back_seme in
> its for-iteration mode is a good match for the algorithm.
> 
> Xionghu, I've tried to only fix the CFG walk order issue and not
> change anything else with this so we have a more correct base
> to work against.  The code still walks inner loop bodies
> up to loop depth times and thus is quadratic in the loop depth.
> 
> Bootstrapped and tested on x86_64-unknown-linux-gnu, if you don't
> have any comments I plan to push this and then revisit what we
> were circling around.

LGTM, thanks.

> 
> Richard.
> 
> 2021-09-01  Richard Biener  <rguenther@suse.de>
> 
> 	PR tree-optimization/102155
> 	* tree-ssa-loop-im.c (fill_always_executed_in_1): Iterate
> 	over a part of the RPO array and do not recurse here.
> 	Dump blocks marked as always executed.
> 	(fill_always_executed_in): Walk over the RPO array and
> 	process loops whose header we run into.
> 	(loop_invariant_motion_in_fun): Compute the first RPO
> 	using rev_post_order_and_mark_dfs_back_seme in iteration
> 	order and pass that to fill_always_executed_in.
> ---
>   gcc/tree-ssa-loop-im.c | 136 ++++++++++++++++++++++-------------------
>   1 file changed, 73 insertions(+), 63 deletions(-)
> 
> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> index d9f75d5025e..f3706dcdb8a 100644
> --- a/gcc/tree-ssa-loop-im.c
> +++ b/gcc/tree-ssa-loop-im.c
> @@ -3025,77 +3025,74 @@ do_store_motion (void)
>   /* Fills ALWAYS_EXECUTED_IN information for basic blocks of LOOP, i.e.
>      for each such basic block bb records the outermost loop for that execution
>      of its header implies execution of bb.  CONTAINS_CALL is the bitmap of
> -   blocks that contain a nonpure call.  */
> +   blocks that contain a nonpure call.  The blocks of LOOP start at index
> +   START of the RPO array of size N.  */
>   
>   static void
> -fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
> +fill_always_executed_in_1 (function *fun, class loop *loop,
> +			   int *rpo, int start, int n, sbitmap contains_call)
>   {
> -  basic_block bb = NULL, *bbs, last = NULL;
> -  unsigned i;
> -  edge e;
> +  basic_block last = NULL;
>     class loop *inn_loop = loop;
>   
> -  if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
> +  for (int i = start; i < n; i++)
>       {
> -      bbs = get_loop_body_in_dom_order (loop);
> -
> -      for (i = 0; i < loop->num_nodes; i++)
> -	{
> -	  edge_iterator ei;
> -	  bb = bbs[i];
> -
> -	  if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> -	    last = bb;
> +      basic_block bb = BASIC_BLOCK_FOR_FN (fun, rpo[i]);
> +      /* Stop when we iterated over all blocks in this loop.  */
> +      if (!flow_bb_inside_loop_p (loop, bb))
> +	break;
>   
> -	  if (bitmap_bit_p (contains_call, bb->index))
> -	    break;
> +      if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> +	last = bb;
>   
> -	  FOR_EACH_EDGE (e, ei, bb->succs)
> -	    {
> -	      /* If there is an exit from this BB.  */
> -	      if (!flow_bb_inside_loop_p (loop, e->dest))
> -		break;
> -	      /* Or we enter a possibly non-finite loop.  */
> -	      if (flow_loop_nested_p (bb->loop_father,
> -				      e->dest->loop_father)
> -		  && ! finite_loop_p (e->dest->loop_father))
> -		break;
> -	    }
> -	  if (e)
> -	    break;
> +      if (bitmap_bit_p (contains_call, bb->index))
> +	break;
>   
> -	  /* A loop might be infinite (TODO use simple loop analysis
> -	     to disprove this if possible).  */
> -	  if (bb->flags & BB_IRREDUCIBLE_LOOP)
> +      edge_iterator ei;
> +      edge e;
> +      FOR_EACH_EDGE (e, ei, bb->succs)
> +	{
> +	  /* If there is an exit from this BB.  */
> +	  if (!flow_bb_inside_loop_p (loop, e->dest))
>   	    break;
> -
> -	  if (!flow_bb_inside_loop_p (inn_loop, bb))
> +	  /* Or we enter a possibly non-finite loop.  */
> +	  if (flow_loop_nested_p (bb->loop_father,
> +				  e->dest->loop_father)
> +	      && ! finite_loop_p (e->dest->loop_father))
>   	    break;
> +	}
> +      if (e)
> +	break;
>   
> -	  if (bb->loop_father->header == bb)
> -	    {
> -	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> -		break;
> +      /* A loop might be infinite (TODO use simple loop analysis
> +	 to disprove this if possible).  */
> +      if (bb->flags & BB_IRREDUCIBLE_LOOP)
> +	break;
>   
> -	      /* In a loop that is always entered we may proceed anyway.
> -		 But record that we entered it and stop once we leave it.  */
> -	      inn_loop = bb->loop_father;
> -	    }
> -	}
> +      if (!flow_bb_inside_loop_p (inn_loop, bb))
> +	break;
>   
> -      while (1)
> +      if (bb->loop_father->header == bb)
>   	{
> -	  SET_ALWAYS_EXECUTED_IN (last, loop);
> -	  if (last == loop->header)
> +	  if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>   	    break;
> -	  last = get_immediate_dominator (CDI_DOMINATORS, last);
> -	}
>   
> -      free (bbs);
> +	  /* In a loop that is always entered we may proceed anyway.
> +	     But record that we entered it and stop once we leave it.  */
> +	  inn_loop = bb->loop_father;
> +	}
>       }
>   
> -  for (loop = loop->inner; loop; loop = loop->next)
> -    fill_always_executed_in_1 (loop, contains_call);
> +  while (1)
> +    {
> +      SET_ALWAYS_EXECUTED_IN (last, loop);
> +      if (dump_enabled_p ())
> +	dump_printf (MSG_NOTE, "bb %d is always executed in loop %d\n",
> +		     last->index, loop->num);
> +      if (last == loop->header)
> +	break;
> +      last = get_immediate_dominator (CDI_DOMINATORS, last);
> +    }
>   }
>   
>   /* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e.
> @@ -3103,14 +3100,13 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
>      of its header implies execution of bb.  */
>   
>   static void
> -fill_always_executed_in (void)
> +fill_always_executed_in (function *fun, int *rpo, int n)
>   {
>     basic_block bb;
> -  class loop *loop;
>   
> -  auto_sbitmap contains_call (last_basic_block_for_fn (cfun));
> +  auto_sbitmap contains_call (last_basic_block_for_fn (fun));
>     bitmap_clear (contains_call);
> -  FOR_EACH_BB_FN (bb, cfun)
> +  FOR_EACH_BB_FN (bb, fun)
>       {
>         gimple_stmt_iterator gsi;
>         for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> @@ -3123,8 +3119,18 @@ fill_always_executed_in (void)
>   	bitmap_set_bit (contains_call, bb->index);
>       }
>   
> -  for (loop = current_loops->tree_root->inner; loop; loop = loop->next)
> -    fill_always_executed_in_1 (loop, contains_call);
> +  /* The RPO order we iterate over is one that visits all blocks of a CFG
> +     cycle before leaving it.  That means we can visit a loop once we
> +     run into its header and we can skip it if it was determined as always
> +     entering when proccessing the containing loop.  */
> +  for (int i = 0; i < n; ++i)
> +    {
> +      basic_block bb = BASIC_BLOCK_FOR_FN (fun, rpo[i]);
> +      if (bb->loop_father->header == bb
> +	  && !ALWAYS_EXECUTED_IN (bb))
> +	fill_always_executed_in_1 (fun, bb->loop_father,
> +				   rpo, i, n, contains_call);
> +    }
>   }
>   
>   
> @@ -3227,23 +3233,27 @@ loop_invariant_motion_in_fun (function *fun, bool store_motion)
>     /* Gathers information about memory accesses in the loops.  */
>     analyze_memory_references (store_motion);
>   
> -  /* Fills ALWAYS_EXECUTED_IN information for basic blocks.  */
> -  fill_always_executed_in ();
> +  int *rpo = XNEWVEC (int, n_basic_blocks_for_fn (fun) - NUM_FIXED_BLOCKS);
> +  auto_bitmap exit_bbs;
> +  bitmap_set_bit (exit_bbs, EXIT_BLOCK);
> +  edge entry = single_succ_edge (ENTRY_BLOCK_PTR_FOR_FN (fun));
> +  int n = rev_post_order_and_mark_dfs_back_seme (fun, entry, exit_bbs, true,
> +						 rpo, NULL);
>   
> -  int *rpo = XNEWVEC (int, last_basic_block_for_fn (fun));
> -  int n = pre_and_rev_post_order_compute_fn (fun, NULL, rpo, false);
> +  /* Fills ALWAYS_EXECUTED_IN information for basic blocks.  */
> +  fill_always_executed_in (fun, rpo, n);
>   
>     /* For each statement determine the outermost loop in that it is
>        invariant and cost for computing the invariant.  */
>     for (int i = 0; i < n; ++i)
>       compute_invariantness (BASIC_BLOCK_FOR_FN (fun, rpo[i]));
> +  free (rpo);
>   
>     /* Execute store motion.  Force the necessary invariants to be moved
>        out of the loops as well.  */
>     if (store_motion)
>       do_store_motion ();
>   
> -  free (rpo);
>     rpo = XNEWVEC (int, last_basic_block_for_fn (fun));
>     n = pre_and_rev_post_order_compute_fn (fun, NULL, rpo, false);
>   
> 

-- 
Thanks,
Xionghu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] tree-optimization/102155 - fix LIM fill_always_executed_in CFG walk
  2021-09-02  5:09 ` Xionghu Luo
@ 2021-09-02  8:00   ` Richard Biener
  2021-09-02  8:50     ` Richard Biener
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Biener @ 2021-09-02  8:00 UTC (permalink / raw)
  To: Xionghu Luo; +Cc: gcc-patches

On Thu, 2 Sep 2021, Xionghu Luo wrote:

> 
> 
> On 2021/9/1 17:58, Richard Biener wrote:
> > This fixes the CFG walk order of fill_always_executed_in to use
> > RPO oder rather than the dominator based order computed by
> > get_loop_body_in_dom_order.  That fixes correctness issues with
> > unordered dominator children.
> > 
> > The RPO order computed by rev_post_order_and_mark_dfs_back_seme in
> > its for-iteration mode is a good match for the algorithm.
> > 
> > Xionghu, I've tried to only fix the CFG walk order issue and not
> > change anything else with this so we have a more correct base
> > to work against.  The code still walks inner loop bodies
> > up to loop depth times and thus is quadratic in the loop depth.
> > 
> > Bootstrapped and tested on x86_64-unknown-linux-gnu, if you don't
> > have any comments I plan to push this and then revisit what we
> > were circling around.
> 
> LGTM, thanks.

I pushed it, thought again in the attempt to build a testcase and
concluded I was wrong with the appearant mishandling of
contains_call - get_loop_body_in_dom_order seems to be exactly
correct for this specific case.  So I reverted the commit again.

Richard.

> > 
> > Richard.
> > 
> > 2021-09-01  Richard Biener  <rguenther@suse.de>
> > 
> >  PR tree-optimization/102155
> >  * tree-ssa-loop-im.c (fill_always_executed_in_1): Iterate
> >  over a part of the RPO array and do not recurse here.
> >  Dump blocks marked as always executed.
> >  (fill_always_executed_in): Walk over the RPO array and
> >  process loops whose header we run into.
> >  (loop_invariant_motion_in_fun): Compute the first RPO
> >  using rev_post_order_and_mark_dfs_back_seme in iteration
> >  order and pass that to fill_always_executed_in.
> > ---
> >   gcc/tree-ssa-loop-im.c | 136 ++++++++++++++++++++++-------------------
> >   1 file changed, 73 insertions(+), 63 deletions(-)
> > 
> > diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> > index d9f75d5025e..f3706dcdb8a 100644
> > --- a/gcc/tree-ssa-loop-im.c
> > +++ b/gcc/tree-ssa-loop-im.c
> > @@ -3025,77 +3025,74 @@ do_store_motion (void)
> >   /* Fills ALWAYS_EXECUTED_IN information for basic blocks of LOOP, i.e.
> >      for each such basic block bb records the outermost loop for that
> >      execution
> >      of its header implies execution of bb.  CONTAINS_CALL is the bitmap of
> > -   blocks that contain a nonpure call.  */
> > +   blocks that contain a nonpure call.  The blocks of LOOP start at index
> > +   START of the RPO array of size N.  */
> >   
> >   static void
> > -fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
> > +fill_always_executed_in_1 (function *fun, class loop *loop,
> > +			   int *rpo, int start, int n, sbitmap contains_call)
> >   {
> > -  basic_block bb = NULL, *bbs, last = NULL;
> > -  unsigned i;
> > -  edge e;
> > +  basic_block last = NULL;
> >     class loop *inn_loop = loop;
> >   -  if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
> > +  for (int i = start; i < n; i++)
> >       {
> > -      bbs = get_loop_body_in_dom_order (loop);
> > -
> > -      for (i = 0; i < loop->num_nodes; i++)
> > -	{
> > -	  edge_iterator ei;
> > -	  bb = bbs[i];
> > -
> > -	  if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> > -	    last = bb;
> > +      basic_block bb = BASIC_BLOCK_FOR_FN (fun, rpo[i]);
> > +      /* Stop when we iterated over all blocks in this loop.  */
> > +      if (!flow_bb_inside_loop_p (loop, bb))
> > +	break;
> >   -	  if (bitmap_bit_p (contains_call, bb->index))
> > -	    break;
> > +      if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> > +	last = bb;
> >   -	  FOR_EACH_EDGE (e, ei, bb->succs)
> > -	    {
> > -	      /* If there is an exit from this BB.  */
> > -	      if (!flow_bb_inside_loop_p (loop, e->dest))
> > -		break;
> > -	      /* Or we enter a possibly non-finite loop.  */
> > -	      if (flow_loop_nested_p (bb->loop_father,
> > -				      e->dest->loop_father)
> > -		  && ! finite_loop_p (e->dest->loop_father))
> > -		break;
> > -	    }
> > -	  if (e)
> > -	    break;
> > +      if (bitmap_bit_p (contains_call, bb->index))
> > +	break;
> >   -	  /* A loop might be infinite (TODO use simple loop analysis
> > -	     to disprove this if possible).  */
> > -	  if (bb->flags & BB_IRREDUCIBLE_LOOP)
> > +      edge_iterator ei;
> > +      edge e;
> > +      FOR_EACH_EDGE (e, ei, bb->succs)
> > +	{
> > +	  /* If there is an exit from this BB.  */
> > +	  if (!flow_bb_inside_loop_p (loop, e->dest))
> >        break;
> > -
> > -	  if (!flow_bb_inside_loop_p (inn_loop, bb))
> > +	  /* Or we enter a possibly non-finite loop.  */
> > +	  if (flow_loop_nested_p (bb->loop_father,
> > +				  e->dest->loop_father)
> > +	      && ! finite_loop_p (e->dest->loop_father))
> >   	    break;
> > +	}
> > +      if (e)
> > +	break;
> >   -	  if (bb->loop_father->header == bb)
> > -	    {
> > -	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> > -		break;
> > +      /* A loop might be infinite (TODO use simple loop analysis
> > +	 to disprove this if possible).  */
> > +      if (bb->flags & BB_IRREDUCIBLE_LOOP)
> > +	break;
> >   -	      /* In a loop that is always entered we may proceed anyway.
> > -		 But record that we entered it and stop once we leave it.  */
> > -	      inn_loop = bb->loop_father;
> > -	    }
> > -	}
> > +      if (!flow_bb_inside_loop_p (inn_loop, bb))
> > +	break;
> >   -      while (1)
> > +      if (bb->loop_father->header == bb)
> >   	{
> > -	  SET_ALWAYS_EXECUTED_IN (last, loop);
> > -	  if (last == loop->header)
> > +	  if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >        break;
> > -	  last = get_immediate_dominator (CDI_DOMINATORS, last);
> > -	}
> >   -      free (bbs);
> > +	  /* In a loop that is always entered we may proceed anyway.
> > +	     But record that we entered it and stop once we leave it.  */
> > +	  inn_loop = bb->loop_father;
> > +	}
> >       }
> >   -  for (loop = loop->inner; loop; loop = loop->next)
> > -    fill_always_executed_in_1 (loop, contains_call);
> > +  while (1)
> > +    {
> > +      SET_ALWAYS_EXECUTED_IN (last, loop);
> > +      if (dump_enabled_p ())
> > +	dump_printf (MSG_NOTE, "bb %d is always executed in loop %d\n",
> > +		     last->index, loop->num);
> > +      if (last == loop->header)
> > +	break;
> > +      last = get_immediate_dominator (CDI_DOMINATORS, last);
> > +    }
> >   }
> >   
> >   /* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e.
> > @@ -3103,14 +3100,13 @@ fill_always_executed_in_1 (class loop *loop, sbitmap
> > @@ contains_call)
> >      of its header implies execution of bb.  */
> >   
> >   static void
> > -fill_always_executed_in (void)
> > +fill_always_executed_in (function *fun, int *rpo, int n)
> >   {
> >     basic_block bb;
> > -  class loop *loop;
> >   -  auto_sbitmap contains_call (last_basic_block_for_fn (cfun));
> > +  auto_sbitmap contains_call (last_basic_block_for_fn (fun));
> >     bitmap_clear (contains_call);
> > -  FOR_EACH_BB_FN (bb, cfun)
> > +  FOR_EACH_BB_FN (bb, fun)
> >       {
> >         gimple_stmt_iterator gsi;
> >         for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > @@ -3123,8 +3119,18 @@ fill_always_executed_in (void)
> >    bitmap_set_bit (contains_call, bb->index);
> >       }
> >   -  for (loop = current_loops->tree_root->inner; loop; loop = loop->next)
> > -    fill_always_executed_in_1 (loop, contains_call);
> > +  /* The RPO order we iterate over is one that visits all blocks of a CFG
> > +     cycle before leaving it.  That means we can visit a loop once we
> > +     run into its header and we can skip it if it was determined as always
> > +     entering when proccessing the containing loop.  */
> > +  for (int i = 0; i < n; ++i)
> > +    {
> > +      basic_block bb = BASIC_BLOCK_FOR_FN (fun, rpo[i]);
> > +      if (bb->loop_father->header == bb
> > +	  && !ALWAYS_EXECUTED_IN (bb))
> > +	fill_always_executed_in_1 (fun, bb->loop_father,
> > +				   rpo, i, n, contains_call);
> > +    }
> >   }
> >   
> >   
> > @@ -3227,23 +3233,27 @@ loop_invariant_motion_in_fun (function *fun, bool
> > @@ store_motion)
> >     /* Gathers information about memory accesses in the loops.  */
> >     analyze_memory_references (store_motion);
> >   -  /* Fills ALWAYS_EXECUTED_IN information for basic blocks.  */
> > -  fill_always_executed_in ();
> > +  int *rpo = XNEWVEC (int, n_basic_blocks_for_fn (fun) - NUM_FIXED_BLOCKS);
> > +  auto_bitmap exit_bbs;
> > +  bitmap_set_bit (exit_bbs, EXIT_BLOCK);
> > +  edge entry = single_succ_edge (ENTRY_BLOCK_PTR_FOR_FN (fun));
> > +  int n = rev_post_order_and_mark_dfs_back_seme (fun, entry, exit_bbs,
> > true,
> > +						 rpo, NULL);
> >   -  int *rpo = XNEWVEC (int, last_basic_block_for_fn (fun));
> > -  int n = pre_and_rev_post_order_compute_fn (fun, NULL, rpo, false);
> > +  /* Fills ALWAYS_EXECUTED_IN information for basic blocks.  */
> > +  fill_always_executed_in (fun, rpo, n);
> >   
> >     /* For each statement determine the outermost loop in that it is
> >        invariant and cost for computing the invariant.  */
> >     for (int i = 0; i < n; ++i)
> >       compute_invariantness (BASIC_BLOCK_FOR_FN (fun, rpo[i]));
> > +  free (rpo);
> >   
> >     /* Execute store motion.  Force the necessary invariants to be moved
> >        out of the loops as well.  */
> >     if (store_motion)
> >       do_store_motion ();
> >   -  free (rpo);
> >     rpo = XNEWVEC (int, last_basic_block_for_fn (fun));
> >     n = pre_and_rev_post_order_compute_fn (fun, NULL, rpo, false);
> >   
> > 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] tree-optimization/102155 - fix LIM fill_always_executed_in CFG walk
  2021-09-02  8:00   ` Richard Biener
@ 2021-09-02  8:50     ` Richard Biener
  2021-09-02 10:00       ` Xionghu Luo
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Biener @ 2021-09-02  8:50 UTC (permalink / raw)
  To: Xionghu Luo; +Cc: gcc-patches

On Thu, 2 Sep 2021, Richard Biener wrote:

> On Thu, 2 Sep 2021, Xionghu Luo wrote:
> 
> > 
> > 
> > On 2021/9/1 17:58, Richard Biener wrote:
> > > This fixes the CFG walk order of fill_always_executed_in to use
> > > RPO oder rather than the dominator based order computed by
> > > get_loop_body_in_dom_order.  That fixes correctness issues with
> > > unordered dominator children.
> > > 
> > > The RPO order computed by rev_post_order_and_mark_dfs_back_seme in
> > > its for-iteration mode is a good match for the algorithm.
> > > 
> > > Xionghu, I've tried to only fix the CFG walk order issue and not
> > > change anything else with this so we have a more correct base
> > > to work against.  The code still walks inner loop bodies
> > > up to loop depth times and thus is quadratic in the loop depth.
> > > 
> > > Bootstrapped and tested on x86_64-unknown-linux-gnu, if you don't
> > > have any comments I plan to push this and then revisit what we
> > > were circling around.
> > 
> > LGTM, thanks.
> 
> I pushed it, thought again in the attempt to build a testcase and
> concluded I was wrong with the appearant mishandling of
> contains_call - get_loop_body_in_dom_order seems to be exactly
> correct for this specific case.  So I reverted the commit again.

And I figured what the

              /* In a loop that is always entered we may proceed anyway.
                 But record that we entered it and stop once we leave it.  
*/

comment was about.  The code was present before the fix for PR78185
and it was supposed to catch the case where the entered inner loop
is not finite.  Just as the testcase from PR78185 shows the
stopping was done too late when the exit block was already marked
as to be always executed.  A simpler fix for PR78185 would have been
to move

          if (!flow_bb_inside_loop_p (inn_loop, bb))
            break;

before setting of last = bb.  In fact the installed fix was more
pessimistic than that given it terminated already when entering
a possibly infinite loop.  So we can improve that by doing
sth like which should also improve the situation for some of
the cases you were looking at?

What remains is that we continue to stop when entering a
not always executed loop:

          if (bb->loop_father->header == bb)
            {
              if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
                break;

that I can at this point only explain by possible efficiency
concerns?  Any better idea on that one?

I'm going to test the patch below which improves the situation for

volatile int flag, bar;
double foo (double *valp)
{
  double sum = 0;
  for (int i = 0; i < 256; ++i)
    {
      for (int j = 0; j < 256; ++j)
        bar = flag;
      if (flag)
        sum += 1.;
      sum += *valp;
    }
  return sum;
}

Thanks,
Richard.

diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index d9f75d5025e..f0c93d6a882 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -3044,23 +3044,27 @@ fill_always_executed_in_1 (class loop *loop, 
sbitmap contains_call)
          edge_iterator ei;
          bb = bbs[i];
 
+         if (!flow_bb_inside_loop_p (inn_loop, bb))
+           {
+             /* When we are leaving a possibly infinite inner loop
+                we have to stop processing.  */
+             if (!finite_loop_p (inn_loop))
+               break;
+             /* If the loop was finite we can continue with processing
+                the loop we exited to.  */
+             inn_loop = bb->loop_father;
+           }
+
          if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
            last = bb;
 
          if (bitmap_bit_p (contains_call, bb->index))
            break;
 
+         /* If LOOP exits from this BB stop processing.  */
          FOR_EACH_EDGE (e, ei, bb->succs)
-           {
-             /* If there is an exit from this BB.  */
-             if (!flow_bb_inside_loop_p (loop, e->dest))
-               break;
-             /* Or we enter a possibly non-finite loop.  */
-             if (flow_loop_nested_p (bb->loop_father,
-                                     e->dest->loop_father)
-                 && ! finite_loop_p (e->dest->loop_father))
-               break;
-           }
+           if (!flow_bb_inside_loop_p (loop, e->dest))
+             break;
          if (e)
            break;
 
@@ -3069,16 +3073,14 @@ fill_always_executed_in_1 (class loop *loop, 
sbitmap contains_call)
          if (bb->flags & BB_IRREDUCIBLE_LOOP)
            break;
 
-         if (!flow_bb_inside_loop_p (inn_loop, bb))
-           break;
-
          if (bb->loop_father->header == bb)
            {
              if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
                break;
 
              /* In a loop that is always entered we may proceed anyway.
-                But record that we entered it and stop once we leave it.  
*/
+                But record that we entered it and stop once we leave it
+                since it might not be finite.  */
              inn_loop = bb->loop_father;
            }
        }


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] tree-optimization/102155 - fix LIM fill_always_executed_in CFG walk
  2021-09-02  8:50     ` Richard Biener
@ 2021-09-02 10:00       ` Xionghu Luo
  2021-09-02 10:37         ` Richard Biener
  0 siblings, 1 reply; 15+ messages in thread
From: Xionghu Luo @ 2021-09-02 10:00 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches



On 2021/9/2 16:50, Richard Biener wrote:
> On Thu, 2 Sep 2021, Richard Biener wrote:
> 
>> On Thu, 2 Sep 2021, Xionghu Luo wrote:
>>
>>>
>>>
>>> On 2021/9/1 17:58, Richard Biener wrote:
>>>> This fixes the CFG walk order of fill_always_executed_in to use
>>>> RPO oder rather than the dominator based order computed by
>>>> get_loop_body_in_dom_order.  That fixes correctness issues with
>>>> unordered dominator children.
>>>>
>>>> The RPO order computed by rev_post_order_and_mark_dfs_back_seme in
>>>> its for-iteration mode is a good match for the algorithm.
>>>>
>>>> Xionghu, I've tried to only fix the CFG walk order issue and not
>>>> change anything else with this so we have a more correct base
>>>> to work against.  The code still walks inner loop bodies
>>>> up to loop depth times and thus is quadratic in the loop depth.
>>>>
>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu, if you don't
>>>> have any comments I plan to push this and then revisit what we
>>>> were circling around.
>>>
>>> LGTM, thanks.
>>
>> I pushed it, thought again in the attempt to build a testcase and
>> concluded I was wrong with the appearant mishandling of
>> contains_call - get_loop_body_in_dom_order seems to be exactly
>> correct for this specific case.  So I reverted the commit again.
> 
> And I figured what the
> 
>                /* In a loop that is always entered we may proceed anyway.
>                   But record that we entered it and stop once we leave it.
> */
> 
> comment was about.  The code was present before the fix for PR78185
> and it was supposed to catch the case where the entered inner loop
> is not finite.  Just as the testcase from PR78185 shows the
> stopping was done too late when the exit block was already marked
> as to be always executed.  A simpler fix for PR78185 would have been
> to move
> 
>            if (!flow_bb_inside_loop_p (inn_loop, bb))
>              break;
> 
> before setting of last = bb.  In fact the installed fix was more
> pessimistic than that given it terminated already when entering
> a possibly infinite loop.  So we can improve that by doing
> sth like which should also improve the situation for some of
> the cases you were looking at?
> 
> What remains is that we continue to stop when entering a
> not always executed loop:
> 
>            if (bb->loop_father->header == bb)
>              {
>                if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>                  break;

Yes.  This will cause blocks after inner loop missed to be check
if they are actually ALWAYS_EXECUTED.   I am afraid O(N^2) is 
inevitable here...

> 
> that I can at this point only explain by possible efficiency
> concerns?  Any better idea on that one?
From experiment, early break from inner loop seems not cost shorter
time than full inner loop walk.  I will take more precise
measurement and larger data set on the function fill_always_executed_in_1
if necessary.   

My previous v2 patch also tried to update inn_loop level by level
when exiting from inn_loops, but it is proved to be  unnecessary
but you worried about the dominance order by get_loop_body_in_dom_order.

> 
> I'm going to test the patch below which improves the situation for
> 
> volatile int flag, bar;
> double foo (double *valp)
> {
>    double sum = 0;
>    for (int i = 0; i < 256; ++i)
>      {
>        for (int j = 0; j < 256; ++j)
>          bar = flag;
>        if (flag)
>          sum += 1.;
>        sum += *valp;
>      }
>    return sum;
> }

The patch still fails to handle cases like this:


struct X { int i; int j; int k;};
volatile int m;

void bar (struct X *x, int n, int l, int k)
{
  for (int i = 0; i < l; i++)
    {
     if (k)
        for (int j = 0; j < l; j++)
          {
            if (m)
              x->i = m;
            else
              x->i = 1 - m;

            int *r = &x->k;
            int tem2 = *r;
            x->k += tem2 * j;
        }

      x->i = m;
    }
}

x->i is still not marked ALWAYS_EXECUTED for outer loop.


> 
> Thanks,
> Richard.
> 
> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> index d9f75d5025e..f0c93d6a882 100644
> --- a/gcc/tree-ssa-loop-im.c
> +++ b/gcc/tree-ssa-loop-im.c
> @@ -3044,23 +3044,27 @@ fill_always_executed_in_1 (class loop *loop,
> sbitmap contains_call)
>            edge_iterator ei;
>            bb = bbs[i];
>   
> +         if (!flow_bb_inside_loop_p (inn_loop, bb))
> +           {
> +             /* When we are leaving a possibly infinite inner loop
> +                we have to stop processing.  */
> +             if (!finite_loop_p (inn_loop))
> +               break;
> +             /* If the loop was finite we can continue with processing
> +                the loop we exited to.  */
> +             inn_loop = bb->loop_father;
> +           }
> +
>            if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>              last = bb;
>   
>            if (bitmap_bit_p (contains_call, bb->index))
>              break;
>   
> +         /* If LOOP exits from this BB stop processing.  */
>            FOR_EACH_EDGE (e, ei, bb->succs)
> -           {
> -             /* If there is an exit from this BB.  */
> -             if (!flow_bb_inside_loop_p (loop, e->dest))
> -               break;
> -             /* Or we enter a possibly non-finite loop.  */
> -             if (flow_loop_nested_p (bb->loop_father,
> -                                     e->dest->loop_father)
> -                 && ! finite_loop_p (e->dest->loop_father))
> -               break;
> -           }
> +           if (!flow_bb_inside_loop_p (loop, e->dest))
> +             break;
>            if (e)
>              break;
>   
> @@ -3069,16 +3073,14 @@ fill_always_executed_in_1 (class loop *loop,
> sbitmap contains_call)
>            if (bb->flags & BB_IRREDUCIBLE_LOOP)
>              break;
>   
> -         if (!flow_bb_inside_loop_p (inn_loop, bb))
> -           break;
> -
>            if (bb->loop_father->header == bb)
>              {
>                if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>                  break;
>   
>                /* In a loop that is always entered we may proceed anyway.
> -                But record that we entered it and stop once we leave it.
> */
> +                But record that we entered it and stop once we leave it
> +                since it might not be finite.  */
>                inn_loop = bb->loop_father;
>              }
>          }
> 

-- 
Thanks,
Xionghu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] tree-optimization/102155 - fix LIM fill_always_executed_in CFG walk
  2021-09-02 10:00       ` Xionghu Luo
@ 2021-09-02 10:37         ` Richard Biener
  2021-09-09  6:40           ` Xionghu Luo
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Biener @ 2021-09-02 10:37 UTC (permalink / raw)
  To: Xionghu Luo; +Cc: gcc-patches

On Thu, 2 Sep 2021, Xionghu Luo wrote:

> 
> 
> On 2021/9/2 16:50, Richard Biener wrote:
> > On Thu, 2 Sep 2021, Richard Biener wrote:
> > 
> >> On Thu, 2 Sep 2021, Xionghu Luo wrote:
> >>
> >>>
> >>>
> >>> On 2021/9/1 17:58, Richard Biener wrote:
> >>>> This fixes the CFG walk order of fill_always_executed_in to use
> >>>> RPO oder rather than the dominator based order computed by
> >>>> get_loop_body_in_dom_order.  That fixes correctness issues with
> >>>> unordered dominator children.
> >>>>
> >>>> The RPO order computed by rev_post_order_and_mark_dfs_back_seme in
> >>>> its for-iteration mode is a good match for the algorithm.
> >>>>
> >>>> Xionghu, I've tried to only fix the CFG walk order issue and not
> >>>> change anything else with this so we have a more correct base
> >>>> to work against.  The code still walks inner loop bodies
> >>>> up to loop depth times and thus is quadratic in the loop depth.
> >>>>
> >>>> Bootstrapped and tested on x86_64-unknown-linux-gnu, if you don't
> >>>> have any comments I plan to push this and then revisit what we
> >>>> were circling around.
> >>>
> >>> LGTM, thanks.
> >>
> >> I pushed it, thought again in the attempt to build a testcase and
> >> concluded I was wrong with the appearant mishandling of
> >> contains_call - get_loop_body_in_dom_order seems to be exactly
> >> correct for this specific case.  So I reverted the commit again.
> > 
> > And I figured what the
> > 
> >                /* In a loop that is always entered we may proceed anyway.
> >                   But record that we entered it and stop once we leave it.
> > */
> > 
> > comment was about.  The code was present before the fix for PR78185
> > and it was supposed to catch the case where the entered inner loop
> > is not finite.  Just as the testcase from PR78185 shows the
> > stopping was done too late when the exit block was already marked
> > as to be always executed.  A simpler fix for PR78185 would have been
> > to move
> > 
> >            if (!flow_bb_inside_loop_p (inn_loop, bb))
> >              break;
> > 
> > before setting of last = bb.  In fact the installed fix was more
> > pessimistic than that given it terminated already when entering
> > a possibly infinite loop.  So we can improve that by doing
> > sth like which should also improve the situation for some of
> > the cases you were looking at?
> > 
> > What remains is that we continue to stop when entering a
> > not always executed loop:
> > 
> >            if (bb->loop_father->header == bb)
> >              {
> >                if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >                  break;
> 
> Yes.  This will cause blocks after inner loop missed to be check
> if they are actually ALWAYS_EXECUTED.   I am afraid O(N^2) is 
> inevitable here...

Yes.  What we can try is pre-computing whether a loop
has a call or an inner loop that might not terminate and then
when that's true for the loop to be entered continue to break;
but when not, skip processing that loop blocks (but we still
fill the blocks array, and we do need to do this in the order
for the loop we're processing ...).

So what I was thinking was to somehow embed the dominator
walk of get_loop_body_in_dom_order and instead of pre-recording
the above info (call, infinite loop) for loops, pre-record
it on the dominator tree so that we can ask "in any of our
dominator children, is there a call or an infinite loop" and
thus cut the dominator walk at loop header blocks that are
not dominating the outer loop latch ...

Of course the simplistic solution might be to simply do

   if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)
       && ((loop_depth (bb->loop_father) - loop_depth (loop))
           > param_max_lim_loop_depth_lookahead)))
     break;

and thus limit the processing of conditionally executed inner
loops by relative depth ... as you say the actual processing
is unlikely to be the bottleneck for the degenerate cases
of a very deep nest of conditionally executed loops.

But still for this case get_loop_body_in_dom_order is
doing quadratic processing so we can also say that
another linear walk over the produced array does not
increase complexity.

> > 
> > that I can at this point only explain by possible efficiency
> > concerns?  Any better idea on that one?
> From experiment, early break from inner loop seems not cost shorter
> time than full inner loop walk.  I will take more precise
> measurement and larger data set on the function fill_always_executed_in_1
> if necessary.   
> 
> My previous v2 patch also tried to update inn_loop level by level
> when exiting from inn_loops, but it is proved to be  unnecessary
> but you worried about the dominance order by get_loop_body_in_dom_order.
> 
> > 
> > I'm going to test the patch below which improves the situation for
> > 
> > volatile int flag, bar;
> > double foo (double *valp)
> > {
> >    double sum = 0;
> >    for (int i = 0; i < 256; ++i)
> >      {
> >        for (int j = 0; j < 256; ++j)
> >          bar = flag;
> >        if (flag)
> >          sum += 1.;
> >        sum += *valp;
> >      }
> >    return sum;
> > }
> 
> The patch still fails to handle cases like this:
> 
> 
> struct X { int i; int j; int k;};
> volatile int m;
> 
> void bar (struct X *x, int n, int l, int k)
> {
>   for (int i = 0; i < l; i++)
>     {
>      if (k)
>         for (int j = 0; j < l; j++)
>           {
>             if (m)
>               x->i = m;
>             else
>               x->i = 1 - m;
> 
>             int *r = &x->k;
>             int tem2 = *r;
>             x->k += tem2 * j;
>         }
> 
>       x->i = m;
>     }
> }
> 
> x->i is still not marked ALWAYS_EXECUTED for outer loop.

Yes.

Richard.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] tree-optimization/102155 - fix LIM fill_always_executed_in CFG walk
  2021-09-02 10:37         ` Richard Biener
@ 2021-09-09  6:40           ` Xionghu Luo
  2021-09-09  9:16             ` Richard Biener
  0 siblings, 1 reply; 15+ messages in thread
From: Xionghu Luo @ 2021-09-09  6:40 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches



On 2021/9/2 18:37, Richard Biener wrote:
> On Thu, 2 Sep 2021, Xionghu Luo wrote:
> 
>>
>>
>> On 2021/9/2 16:50, Richard Biener wrote:
>>> On Thu, 2 Sep 2021, Richard Biener wrote:
>>>
>>>> On Thu, 2 Sep 2021, Xionghu Luo wrote:
>>>>
>>>>>
>>>>>
>>>>> On 2021/9/1 17:58, Richard Biener wrote:
>>>>>> This fixes the CFG walk order of fill_always_executed_in to use
>>>>>> RPO oder rather than the dominator based order computed by
>>>>>> get_loop_body_in_dom_order.  That fixes correctness issues with
>>>>>> unordered dominator children.
>>>>>>
>>>>>> The RPO order computed by rev_post_order_and_mark_dfs_back_seme in
>>>>>> its for-iteration mode is a good match for the algorithm.
>>>>>>
>>>>>> Xionghu, I've tried to only fix the CFG walk order issue and not
>>>>>> change anything else with this so we have a more correct base
>>>>>> to work against.  The code still walks inner loop bodies
>>>>>> up to loop depth times and thus is quadratic in the loop depth.
>>>>>>
>>>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu, if you don't
>>>>>> have any comments I plan to push this and then revisit what we
>>>>>> were circling around.
>>>>>
>>>>> LGTM, thanks.
>>>>
>>>> I pushed it, thought again in the attempt to build a testcase and
>>>> concluded I was wrong with the appearant mishandling of
>>>> contains_call - get_loop_body_in_dom_order seems to be exactly
>>>> correct for this specific case.  So I reverted the commit again.
>>>
>>> And I figured what the
>>>
>>>                 /* In a loop that is always entered we may proceed anyway.
>>>                    But record that we entered it and stop once we leave it.
>>> */
>>>
>>> comment was about.  The code was present before the fix for PR78185
>>> and it was supposed to catch the case where the entered inner loop
>>> is not finite.  Just as the testcase from PR78185 shows the
>>> stopping was done too late when the exit block was already marked
>>> as to be always executed.  A simpler fix for PR78185 would have been
>>> to move
>>>
>>>             if (!flow_bb_inside_loop_p (inn_loop, bb))
>>>               break;
>>>
>>> before setting of last = bb.  In fact the installed fix was more
>>> pessimistic than that given it terminated already when entering
>>> a possibly infinite loop.  So we can improve that by doing
>>> sth like which should also improve the situation for some of
>>> the cases you were looking at?
>>>
>>> What remains is that we continue to stop when entering a
>>> not always executed loop:
>>>
>>>             if (bb->loop_father->header == bb)
>>>               {
>>>                 if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>>>                   break;
>>
>> Yes.  This will cause blocks after inner loop missed to be check
>> if they are actually ALWAYS_EXECUTED.   I am afraid O(N^2) is
>> inevitable here...
> 
> Yes.  What we can try is pre-computing whether a loop
> has a call or an inner loop that might not terminate and then
> when that's true for the loop to be entered continue to break;
> but when not, skip processing that loop blocks (but we still
> fill the blocks array, and we do need to do this in the order
> for the loop we're processing ...).
> 
> So what I was thinking was to somehow embed the dominator
> walk of get_loop_body_in_dom_order and instead of pre-recording
> the above info (call, infinite loop) for loops, pre-record
> it on the dominator tree so that we can ask "in any of our
> dominator children, is there a call or an infinite loop" and
> thus cut the dominator walk at loop header blocks that are
> not dominating the outer loop latch ...
> 
> Of course the simplistic solution might be to simply do
> 
>     if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)
>         && ((loop_depth (bb->loop_father) - loop_depth (loop))
>             > param_max_lim_loop_depth_lookahead)))
>       break;
> 
> and thus limit the processing of conditionally executed inner
> loops by relative depth ... as you say the actual processing
> is unlikely to be the bottleneck for the degenerate cases
> of a very deep nest of conditionally executed loops.
> 
> But still for this case get_loop_body_in_dom_order is
> doing quadratic processing so we can also say that
> another linear walk over the produced array does not
> increase complexity.> 
>>>
>>> volatile int flag, bar;
>>> double foo (double *valp)
>>> {
>>>     double sum = 0;
>>>     for (int i = 0; i < 256; ++i)
>>>       {
>>>         for (int j = 0; j < 256; ++j)
>>>           bar = flag;
>>>         if (flag)
>>>           sum += 1.;
>>>         sum += *valp;
>>>       }
>>>     return sum;
>>> }
>>
>> The patch still fails to handle cases like this:
>>
>>
>> struct X { int i; int j; int k;};
>> volatile int m;
>>
>> void bar (struct X *x, int n, int l, int k)
>> {
>>    for (int i = 0; i < l; i++)
>>      {
>>       if (k)
>>          for (int j = 0; j < l; j++)
>>            {
>>              if (m)
>>                x->i = m;
>>              else
>>                x->i = 1 - m;
>>
>>              int *r = &x->k;
>>              int tem2 = *r;
>>              x->k += tem2 * j;
>>          }
>>
>>        x->i = m;
>>      }
>> }
>>
>> x->i is still not marked ALWAYS_EXECUTED for outer loop.
> 

Collected data when build gcc stage1 and bootstrap.  There are still
about 9% bbs are missed to be marked with ALWAYS_EXECUTED.  Execution time
of fill_always_executed_in_1 is increased obvious in stage1 but it is in
mini-second level while bootstrap build doesn't show big time change.

base vs. incoming_edge_counter_patch:

1. For stage1: 2214/2216 functions handled in both lim2 and lim4 with:
     -  bbs find:  base vs patched  "always executed":  17666  vs. 19278.
     - total cost time of fill_always_executed_in_1 (usec):  30574 vs. 51238.
2. bootstrap: 33910/33951 functions handled in both lim2 and lim4 with:
     - bbs find:  base vs patched  "always executed":  201171 vs. 222942.
     - total cost time of fill_always_executed_in_1 (usec):  987054  vs. 1061821.5

Do we still have plan to replace the get_loop_body_in_dom_order with
linear search or just "limit the processing of conditionally executed inner
loops by relative depth"? ;)



ALWAYS_EXECUTED_IN is not computed completely for nested loops.
Current design will exit if an inner loop doesn't dominate outer
loop's latch or exit after exiting from inner loop, which
caused early return from outer loop, then ALWAYS EXECUTED blocks after
inner loops are skipped.  Another problem is it has to call
get_loop_body_in_dom_order in each recursive call which is also
inefficient.

This patch does greedy visiting of the loop header successors,
processing further blocks if all entries (ignoring backedges) are
processed, setting SET_ALWAYS_EXECUTED_IN if dominates loop's latch.
When the worklist is empty proceed to inner loops.  For bookkeeping
simply keep a to-visit-incoming-edges counter per BB, and pass it
through to inner loops.

Details discusions:
https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577743.html

gcc/ChangeLog:

	* tree-ssa-loop-im.c (fill_always_executed_in_1): Remove
	inn_loop check.  Greedy visit loop header successors and update
	edge counts.
	(fill_always_executed_in): Compute bb_incoming_edges and pass
	down.
	(loop_invariant_motion_in_fun): Compute dfs back edge.

gcc/testsuite/ChangeLog:

	* gcc.dg/tree-ssa/ssa-lim-19.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-20.c: New test.
---
  gcc/tree-ssa-loop-im.c                     | 113 +++++++++++----------
  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c |  31 ++++++
  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c |  30 ++++++
  3 files changed, 122 insertions(+), 52 deletions(-)
  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c

diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index b0582c047a1..e9ce18f822f 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -3192,79 +3192,65 @@ do_store_motion (void)
     blocks that contain a nonpure call.  */
  
  static void
-fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
+fill_always_executed_in_1 (class loop *loop, sbitmap contains_call, int *bbi)
  {
-  basic_block bb = NULL, *bbs, last = NULL;
-  unsigned i;
+  basic_block bb = NULL;
    edge e;
-  class loop *inn_loop = loop;
  
    if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
      {
-      bbs = get_loop_body_in_dom_order (loop);
+      auto_bitmap work_set;
+      bitmap_set_bit (work_set, loop->header->index);
+      unsigned bb_index;
  
-      for (i = 0; i < loop->num_nodes; i++)
-	{
-	  edge_iterator ei;
-	  bb = bbs[i];
+      unsigned array_size = last_basic_block_for_fn (cfun) + 1;
+      int *bbd = XCNEWVEC (int, array_size);
+      bbd = XDUPVEC (int, bbi, array_size);
  
-	  if (!flow_bb_inside_loop_p (inn_loop, bb))
+      while (!bitmap_empty_p (work_set))
+	{
+	  bb_index = bitmap_first_set_bit (work_set);
+	  bitmap_clear_bit (work_set, bb_index);
+	  bb = BASIC_BLOCK_FOR_FN (cfun, bb_index);
+	  if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
  	    {
-	      /* When we are leaving a possibly infinite inner loop
-		 we have to stop processing.  */
-	      if (!finite_loop_p (inn_loop))
-		break;
-	      /* If the loop was finite we can continue with processing
-		 the loop we exited to.  */
-	      inn_loop = bb->loop_father;
+	      SET_ALWAYS_EXECUTED_IN (bb, loop);
+	      if (dump_enabled_p ())
+		dump_printf (MSG_NOTE, "BB %d is always executed in loop %d\n",
+		    bb->index, loop->num);
  	    }
-
-	  if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
-	    last = bb;
-
  	  if (bitmap_bit_p (contains_call, bb->index))
  	    break;
  
-	  /* If LOOP exits from this BB stop processing.  */
-	  FOR_EACH_EDGE (e, ei, bb->succs)
-	    if (!flow_bb_inside_loop_p (loop, e->dest))
-	      break;
-	  if (e)
-	    break;
-
  	  /* A loop might be infinite (TODO use simple loop analysis
  	     to disprove this if possible).  */
  	  if (bb->flags & BB_IRREDUCIBLE_LOOP)
  	    break;
  
-	  if (bb->loop_father->header == bb)
+	  edge_iterator ei;
+	  FOR_EACH_EDGE (e, ei, bb->succs)
  	    {
-	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
+	      if (e->dest == loop->latch)
+		continue;
+	      /* If there is an exit from this BB.  */
+	      if (!flow_bb_inside_loop_p (loop, e->dest))
+		break;
+	      /* Or we enter a possibly non-finite loop.  */
+	      if (flow_loop_nested_p (bb->loop_father,
+				      e->dest->loop_father)
+		  && ! finite_loop_p (e->dest->loop_father))
  		break;
  
-	      /* In a loop that is always entered we may proceed anyway.
-		 But record that we entered it and stop once we leave it
-		 since it might not be finite.  */
-	      inn_loop = bb->loop_father;
+	      bbd[e->dest->index]--;
+	      if (bbd[e->dest->index] == 0)
+		bitmap_set_bit (work_set, e->dest->index);
  	    }
-	}
-
-      while (1)
-	{
-	  if (dump_enabled_p ())
-	    dump_printf (MSG_NOTE, "BB %d is always executed in loop %d\n",
-			 last->index, loop->num);
-	  SET_ALWAYS_EXECUTED_IN (last, loop);
-	  if (last == loop->header)
+	  if (e)
  	    break;
-	  last = get_immediate_dominator (CDI_DOMINATORS, last);
  	}
-
-      free (bbs);
+      free (bbd);
      }
-
-  for (loop = loop->inner; loop; loop = loop->next)
-    fill_always_executed_in_1 (loop, contains_call);
  }
  
  /* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e.
@@ -3275,12 +3261,22 @@ static void
  fill_always_executed_in (void)
  {
    basic_block bb;
-  class loop *loop;
+  edge_iterator ei;
+  edge e;
+
+  unsigned array_size = last_basic_block_for_fn (cfun) + 1;
+  int *bb_incoming_edges = XCNEWVEC (int, array_size);
  
    auto_sbitmap contains_call (last_basic_block_for_fn (cfun));
    bitmap_clear (contains_call);
    FOR_EACH_BB_FN (bb, cfun)
      {
+      int len = bb->preds->length ();
+      FOR_EACH_EDGE (e, ei, bb->preds)
+	if (e->flags & EDGE_DFS_BACK)
+	  len--;
+      bb_incoming_edges[bb->index] = len;
+
        gimple_stmt_iterator gsi;
        for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
  	{
@@ -3292,8 +3288,19 @@ fill_always_executed_in (void)
  	bitmap_set_bit (contains_call, bb->index);
      }
  
-  for (loop = current_loops->tree_root->inner; loop; loop = loop->next)
-    fill_always_executed_in_1 (loop, contains_call);
+
+  for (auto loop : loops_list (cfun, 0))
+    fill_always_executed_in_1 (loop, contains_call, bb_incoming_edges);
  }
  
  
@@ -3396,6 +3403,8 @@ loop_invariant_motion_in_fun (function *fun, bool store_motion)
    /* Gathers information about memory accesses in the loops.  */
    analyze_memory_references (store_motion);
  
+  mark_dfs_back_edges ();
+
    /* Fills ALWAYS_EXECUTED_IN information for basic blocks.  */
    fill_always_executed_in ();
  
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
new file mode 100644
index 00000000000..3d5fe6f0314
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
@@ -0,0 +1,31 @@
+/* PR/101293 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+struct X { int i; int j; int k;};
+
+void foo(struct X *x, int n, int l, int m)
+{
+  for (int j = 0; j < l; j++)
+    {
+      for (int i = 0; i < n; ++i)
+	{
+	  if (m)
+	    x->j++;
+	  else
+	    x->j = m+n+l;
+
+	  int *p = &x->j;
+	  int tem = *p;
+	  x->j += tem * i;
+	}
+      int *r = &x->k;
+      int tem2 = *r;
+      x->k += tem2 * j;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } */
+
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
new file mode 100644
index 00000000000..6e180de528e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
@@ -0,0 +1,30 @@
+/* PR/101293 */
+/* { dg-do run } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+void __attribute__ ((noipa)) foo (int n, int m, int f, int *p, int *q)
+{
+  while (--n)
+    {
+      do
+	{
+	  *q += 1;
+	  if (f)
+	    goto out;
+	}
+      while (--m);
+      *p += 1;
+    }
+out:;
+}
+
+int main()
+{
+    int i = 0;
+    foo (10, 10, 1, (void *) 0, &i);
+    if (i != 1)
+      __builtin_abort ();
+    return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Executing store motion" 1 "lim2" } } */
-- 
2.27.0.90.geebb51ba8c

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] tree-optimization/102155 - fix LIM fill_always_executed_in CFG walk
  2021-09-09  6:40           ` Xionghu Luo
@ 2021-09-09  9:16             ` Richard Biener
  2021-09-09 10:55               ` Richard Biener
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Biener @ 2021-09-09  9:16 UTC (permalink / raw)
  To: Xionghu Luo; +Cc: gcc-patches

On Thu, 9 Sep 2021, Xionghu Luo wrote:

> 
> 
> On 2021/9/2 18:37, Richard Biener wrote:
> > On Thu, 2 Sep 2021, Xionghu Luo wrote:
> > 
> >>
> >>
> >> On 2021/9/2 16:50, Richard Biener wrote:
> >>> On Thu, 2 Sep 2021, Richard Biener wrote:
> >>>
> >>>> On Thu, 2 Sep 2021, Xionghu Luo wrote:
> >>>>
> >>>>>
> >>>>>
> >>>>> On 2021/9/1 17:58, Richard Biener wrote:
> >>>>>> This fixes the CFG walk order of fill_always_executed_in to use
> >>>>>> RPO oder rather than the dominator based order computed by
> >>>>>> get_loop_body_in_dom_order.  That fixes correctness issues with
> >>>>>> unordered dominator children.
> >>>>>>
> >>>>>> The RPO order computed by rev_post_order_and_mark_dfs_back_seme in
> >>>>>> its for-iteration mode is a good match for the algorithm.
> >>>>>>
> >>>>>> Xionghu, I've tried to only fix the CFG walk order issue and not
> >>>>>> change anything else with this so we have a more correct base
> >>>>>> to work against.  The code still walks inner loop bodies
> >>>>>> up to loop depth times and thus is quadratic in the loop depth.
> >>>>>>
> >>>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu, if you don't
> >>>>>> have any comments I plan to push this and then revisit what we
> >>>>>> were circling around.
> >>>>>
> >>>>> LGTM, thanks.
> >>>>
> >>>> I pushed it, thought again in the attempt to build a testcase and
> >>>> concluded I was wrong with the appearant mishandling of
> >>>> contains_call - get_loop_body_in_dom_order seems to be exactly
> >>>> correct for this specific case.  So I reverted the commit again.
> >>>
> >>> And I figured what the
> >>>
> >>>                 /* In a loop that is always entered we may proceed anyway.
> >>>                    But record that we entered it and stop once we leave
> >>> it.
> >>> */
> >>>
> >>> comment was about.  The code was present before the fix for PR78185
> >>> and it was supposed to catch the case where the entered inner loop
> >>> is not finite.  Just as the testcase from PR78185 shows the
> >>> stopping was done too late when the exit block was already marked
> >>> as to be always executed.  A simpler fix for PR78185 would have been
> >>> to move
> >>>
> >>>             if (!flow_bb_inside_loop_p (inn_loop, bb))
> >>>               break;
> >>>
> >>> before setting of last = bb.  In fact the installed fix was more
> >>> pessimistic than that given it terminated already when entering
> >>> a possibly infinite loop.  So we can improve that by doing
> >>> sth like which should also improve the situation for some of
> >>> the cases you were looking at?
> >>>
> >>> What remains is that we continue to stop when entering a
> >>> not always executed loop:
> >>>
> >>>             if (bb->loop_father->header == bb)
> >>>               {
> >>>                 if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >>>                   break;
> >>
> >> Yes.  This will cause blocks after inner loop missed to be check
> >> if they are actually ALWAYS_EXECUTED.   I am afraid O(N^2) is
> >> inevitable here...
> > 
> > Yes.  What we can try is pre-computing whether a loop
> > has a call or an inner loop that might not terminate and then
> > when that's true for the loop to be entered continue to break;
> > but when not, skip processing that loop blocks (but we still
> > fill the blocks array, and we do need to do this in the order
> > for the loop we're processing ...).
> > 
> > So what I was thinking was to somehow embed the dominator
> > walk of get_loop_body_in_dom_order and instead of pre-recording
> > the above info (call, infinite loop) for loops, pre-record
> > it on the dominator tree so that we can ask "in any of our
> > dominator children, is there a call or an infinite loop" and
> > thus cut the dominator walk at loop header blocks that are
> > not dominating the outer loop latch ...
> > 
> > Of course the simplistic solution might be to simply do
> > 
> >     if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)
> >         && ((loop_depth (bb->loop_father) - loop_depth (loop))
> >             > param_max_lim_loop_depth_lookahead)))
> >       break;
> > 
> > and thus limit the processing of conditionally executed inner
> > loops by relative depth ... as you say the actual processing
> > is unlikely to be the bottleneck for the degenerate cases
> > of a very deep nest of conditionally executed loops.
> > 
> > But still for this case get_loop_body_in_dom_order is
> > doing quadratic processing so we can also say that
> > another linear walk over the produced array does not
> > increase complexity.> 
> >>>
> >>> volatile int flag, bar;
> >>> double foo (double *valp)
> >>> {
> >>>     double sum = 0;
> >>>     for (int i = 0; i < 256; ++i)
> >>>       {
> >>>         for (int j = 0; j < 256; ++j)
> >>>           bar = flag;
> >>>         if (flag)
> >>>           sum += 1.;
> >>>         sum += *valp;
> >>>       }
> >>>     return sum;
> >>> }
> >>
> >> The patch still fails to handle cases like this:
> >>
> >>
> >> struct X { int i; int j; int k;};
> >> volatile int m;
> >>
> >> void bar (struct X *x, int n, int l, int k)
> >> {
> >>    for (int i = 0; i < l; i++)
> >>      {
> >>       if (k)
> >>          for (int j = 0; j < l; j++)
> >>            {
> >>              if (m)
> >>                x->i = m;
> >>              else
> >>                x->i = 1 - m;
> >>
> >>              int *r = &x->k;
> >>              int tem2 = *r;
> >>              x->k += tem2 * j;
> >>          }
> >>
> >>        x->i = m;
> >>      }
> >> }
> >>
> >> x->i is still not marked ALWAYS_EXECUTED for outer loop.
> > 
> 
> Collected data when build gcc stage1 and bootstrap.  There are still
> about 9% bbs are missed to be marked with ALWAYS_EXECUTED.  Execution time
> of fill_always_executed_in_1 is increased obvious in stage1 but it is in
> mini-second level while bootstrap build doesn't show big time change.
> 
> base vs. incoming_edge_counter_patch:
> 
> 1. For stage1: 2214/2216 functions handled in both lim2 and lim4 with:
>     -  bbs find:  base vs patched  "always executed":  17666  vs. 19278.
>     - total cost time of fill_always_executed_in_1 (usec):  30574 vs. 51238.
> 2. bootstrap: 33910/33951 functions handled in both lim2 and lim4 with:
>     - bbs find:  base vs patched  "always executed":  201171 vs. 222942.
>     - total cost time of fill_always_executed_in_1 (usec):  987054  vs.
>     1061821.5
> 
> Do we still have plan to replace the get_loop_body_in_dom_order with
> linear search or just "limit the processing of conditionally executed inner
> loops by relative depth"? ;)

Now that I think I understand how it all works we're going to stay
with the current dominator based approach.  As yet another
incremental change I have bootstrapped and tested the patch below
which cuts off some processing time from the current implementation
making it O(n) as long as we are not walking conditionally
executed subloops.

I don't see yet that the greedy linear walk patch is correct, so
the option forward to reap all the remaining opportunities would
be to either do limited processing of conditionally executed
inner loops or process all of them but for example gated by
flag_expensive_optimizations.  Another option is to ignore
the complexity concern given there's other quadraticnesses
in loop depth in LIM (but on other entities).

Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.

From ae352922abac1d70ffb217349295c23ea35d0c8d Mon Sep 17 00:00:00 2001
From: Richard Biener <rguenther@suse.de>
Date: Thu, 9 Sep 2021 10:52:12 +0200
Subject: [PATCH] Avoid full DOM walk in LIM fill_always_executed_in
To: gcc-patches@gcc.gnu.org

This avoids a full DOM walk via get_loop_body_in_dom_order in the
loop body walk of fill_always_executed_in which is often terminating
the walk of a loop body early by integrating the DOM walk of
get_loop_body_in_dom_order with the actual processing done by
fill_always_executed_in.  This trades the fully populated loop
body array with a worklist allocation of the same size and thus
should be a strict improvement over the recursive approach of
get_loop_body_in_dom_order.

2021-09-09  Richard Biener  <rguenther@suse.de>

	* tree-ssa-loop-im.c (fill_always_executed_in_1): Integrate
	DOM walk from get_loop_body_in_dom_order using a worklist
	approach.
---
 gcc/tree-ssa-loop-im.c | 39 +++++++++++++++++++++++++++++++--------
 1 file changed, 31 insertions(+), 8 deletions(-)

diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index 01f3fc2f7f0..5d6845478e7 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -3030,19 +3030,19 @@ do_store_motion (void)
 static void
 fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
 {
-  basic_block bb = NULL, *bbs, last = NULL;
-  unsigned i;
+  basic_block bb = NULL, last = NULL;
   edge e;
   class loop *inn_loop = loop;
 
   if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
     {
-      bbs = get_loop_body_in_dom_order (loop);
-
-      for (i = 0; i < loop->num_nodes; i++)
+      auto_vec<basic_block, 64> worklist;
+      worklist.reserve_exact (loop->num_nodes);
+      worklist.quick_push (loop->header);
+      do
 	{
 	  edge_iterator ei;
-	  bb = bbs[i];
+	  bb = worklist.pop ();
 
 	  if (!flow_bb_inside_loop_p (inn_loop, bb))
 	    {
@@ -3083,7 +3083,32 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
 		 since it might not be finite.  */
 	      inn_loop = bb->loop_father;
 	    }
+
+	  /* Walk the body of LOOP sorted by dominance relation.  Additionally,
+	     if a basic block S dominates the latch, then only blocks dominated
+	     by S are after it.
+	     This is get_loop_body_in_dom_order using a worklist algorithm and
+	     stopping once we are no longer interested in visiting further
+	     blocks.  */
+	  unsigned old_len = worklist.length ();
+	  unsigned postpone = 0;
+	  for (basic_block son = first_dom_son (CDI_DOMINATORS, bb);
+	       son;
+	       son = next_dom_son (CDI_DOMINATORS, son))
+	    {
+	      if (!flow_bb_inside_loop_p (loop, son))
+		continue;
+	      if (dominated_by_p (CDI_DOMINATORS, loop->latch, son))
+		postpone = worklist.length ();
+	      worklist.quick_push (son);
+	    }
+	  if (postpone)
+	    /* Postponing the block that dominates the latch means
+	       processing it last and thus putting it earliest in the
+	       worklist.  */
+	    std::swap (worklist[old_len], worklist[postpone]);
 	}
+      while (!worklist.is_empty ());
 
       while (1)
 	{
@@ -3095,8 +3120,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
 	    break;
 	  last = get_immediate_dominator (CDI_DOMINATORS, last);
 	}
-
-      free (bbs);
     }
 
   for (loop = loop->inner; loop; loop = loop->next)
-- 
2.31.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] tree-optimization/102155 - fix LIM fill_always_executed_in CFG walk
  2021-09-09  9:16             ` Richard Biener
@ 2021-09-09 10:55               ` Richard Biener
  2021-09-10 13:54                 ` Xionghu Luo
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Biener @ 2021-09-09 10:55 UTC (permalink / raw)
  To: Xionghu Luo; +Cc: gcc-patches

On Thu, 9 Sep 2021, Richard Biener wrote:

> On Thu, 9 Sep 2021, Xionghu Luo wrote:
> 
> > 
> > 
> > On 2021/9/2 18:37, Richard Biener wrote:
> > > On Thu, 2 Sep 2021, Xionghu Luo wrote:
> > > 
> > >>
> > >>
> > >> On 2021/9/2 16:50, Richard Biener wrote:
> > >>> On Thu, 2 Sep 2021, Richard Biener wrote:
> > >>>
> > >>>> On Thu, 2 Sep 2021, Xionghu Luo wrote:
> > >>>>
> > >>>>>
> > >>>>>
> > >>>>> On 2021/9/1 17:58, Richard Biener wrote:
> > >>>>>> This fixes the CFG walk order of fill_always_executed_in to use
> > >>>>>> RPO oder rather than the dominator based order computed by
> > >>>>>> get_loop_body_in_dom_order.  That fixes correctness issues with
> > >>>>>> unordered dominator children.
> > >>>>>>
> > >>>>>> The RPO order computed by rev_post_order_and_mark_dfs_back_seme in
> > >>>>>> its for-iteration mode is a good match for the algorithm.
> > >>>>>>
> > >>>>>> Xionghu, I've tried to only fix the CFG walk order issue and not
> > >>>>>> change anything else with this so we have a more correct base
> > >>>>>> to work against.  The code still walks inner loop bodies
> > >>>>>> up to loop depth times and thus is quadratic in the loop depth.
> > >>>>>>
> > >>>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu, if you don't
> > >>>>>> have any comments I plan to push this and then revisit what we
> > >>>>>> were circling around.
> > >>>>>
> > >>>>> LGTM, thanks.
> > >>>>
> > >>>> I pushed it, thought again in the attempt to build a testcase and
> > >>>> concluded I was wrong with the appearant mishandling of
> > >>>> contains_call - get_loop_body_in_dom_order seems to be exactly
> > >>>> correct for this specific case.  So I reverted the commit again.
> > >>>
> > >>> And I figured what the
> > >>>
> > >>>                 /* In a loop that is always entered we may proceed anyway.
> > >>>                    But record that we entered it and stop once we leave
> > >>> it.
> > >>> */
> > >>>
> > >>> comment was about.  The code was present before the fix for PR78185
> > >>> and it was supposed to catch the case where the entered inner loop
> > >>> is not finite.  Just as the testcase from PR78185 shows the
> > >>> stopping was done too late when the exit block was already marked
> > >>> as to be always executed.  A simpler fix for PR78185 would have been
> > >>> to move
> > >>>
> > >>>             if (!flow_bb_inside_loop_p (inn_loop, bb))
> > >>>               break;
> > >>>
> > >>> before setting of last = bb.  In fact the installed fix was more
> > >>> pessimistic than that given it terminated already when entering
> > >>> a possibly infinite loop.  So we can improve that by doing
> > >>> sth like which should also improve the situation for some of
> > >>> the cases you were looking at?
> > >>>
> > >>> What remains is that we continue to stop when entering a
> > >>> not always executed loop:
> > >>>
> > >>>             if (bb->loop_father->header == bb)
> > >>>               {
> > >>>                 if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> > >>>                   break;
> > >>
> > >> Yes.  This will cause blocks after inner loop missed to be check
> > >> if they are actually ALWAYS_EXECUTED.   I am afraid O(N^2) is
> > >> inevitable here...
> > > 
> > > Yes.  What we can try is pre-computing whether a loop
> > > has a call or an inner loop that might not terminate and then
> > > when that's true for the loop to be entered continue to break;
> > > but when not, skip processing that loop blocks (but we still
> > > fill the blocks array, and we do need to do this in the order
> > > for the loop we're processing ...).
> > > 
> > > So what I was thinking was to somehow embed the dominator
> > > walk of get_loop_body_in_dom_order and instead of pre-recording
> > > the above info (call, infinite loop) for loops, pre-record
> > > it on the dominator tree so that we can ask "in any of our
> > > dominator children, is there a call or an infinite loop" and
> > > thus cut the dominator walk at loop header blocks that are
> > > not dominating the outer loop latch ...
> > > 
> > > Of course the simplistic solution might be to simply do
> > > 
> > >     if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)
> > >         && ((loop_depth (bb->loop_father) - loop_depth (loop))
> > >             > param_max_lim_loop_depth_lookahead)))
> > >       break;
> > > 
> > > and thus limit the processing of conditionally executed inner
> > > loops by relative depth ... as you say the actual processing
> > > is unlikely to be the bottleneck for the degenerate cases
> > > of a very deep nest of conditionally executed loops.
> > > 
> > > But still for this case get_loop_body_in_dom_order is
> > > doing quadratic processing so we can also say that
> > > another linear walk over the produced array does not
> > > increase complexity.> 
> > >>>
> > >>> volatile int flag, bar;
> > >>> double foo (double *valp)
> > >>> {
> > >>>     double sum = 0;
> > >>>     for (int i = 0; i < 256; ++i)
> > >>>       {
> > >>>         for (int j = 0; j < 256; ++j)
> > >>>           bar = flag;
> > >>>         if (flag)
> > >>>           sum += 1.;
> > >>>         sum += *valp;
> > >>>       }
> > >>>     return sum;
> > >>> }
> > >>
> > >> The patch still fails to handle cases like this:
> > >>
> > >>
> > >> struct X { int i; int j; int k;};
> > >> volatile int m;
> > >>
> > >> void bar (struct X *x, int n, int l, int k)
> > >> {
> > >>    for (int i = 0; i < l; i++)
> > >>      {
> > >>       if (k)
> > >>          for (int j = 0; j < l; j++)
> > >>            {
> > >>              if (m)
> > >>                x->i = m;
> > >>              else
> > >>                x->i = 1 - m;
> > >>
> > >>              int *r = &x->k;
> > >>              int tem2 = *r;
> > >>              x->k += tem2 * j;
> > >>          }
> > >>
> > >>        x->i = m;
> > >>      }
> > >> }
> > >>
> > >> x->i is still not marked ALWAYS_EXECUTED for outer loop.
> > > 
> > 
> > Collected data when build gcc stage1 and bootstrap.  There are still
> > about 9% bbs are missed to be marked with ALWAYS_EXECUTED.  Execution time
> > of fill_always_executed_in_1 is increased obvious in stage1 but it is in
> > mini-second level while bootstrap build doesn't show big time change.
> > 
> > base vs. incoming_edge_counter_patch:
> > 
> > 1. For stage1: 2214/2216 functions handled in both lim2 and lim4 with:
> >     -  bbs find:  base vs patched  "always executed":  17666  vs. 19278.
> >     - total cost time of fill_always_executed_in_1 (usec):  30574 vs. 51238.
> > 2. bootstrap: 33910/33951 functions handled in both lim2 and lim4 with:
> >     - bbs find:  base vs patched  "always executed":  201171 vs. 222942.
> >     - total cost time of fill_always_executed_in_1 (usec):  987054  vs.
> >     1061821.5
> > 
> > Do we still have plan to replace the get_loop_body_in_dom_order with
> > linear search or just "limit the processing of conditionally executed inner
> > loops by relative depth"? ;)
> 
> Now that I think I understand how it all works we're going to stay
> with the current dominator based approach.  As yet another
> incremental change I have bootstrapped and tested the patch below
> which cuts off some processing time from the current implementation
> making it O(n) as long as we are not walking conditionally
> executed subloops.
> 
> I don't see yet that the greedy linear walk patch is correct, so
> the option forward to reap all the remaining opportunities would
> be to either do limited processing of conditionally executed
> inner loops or process all of them but for example gated by
> flag_expensive_optimizations.  Another option is to ignore
> the complexity concern given there's other quadraticnesses
> in loop depth in LIM (but on other entities).

So I've created a testcase (PR102253) and indeed other issues pop
up and LIM is not even on the radar here with the patch below
which I've now bootstrapped and tested on x86_64-unknown-linux-gnu
and pushed to trunk.

Richard.

From 013cfc648405a8a118d07436f103e4d70224fe00 Mon Sep 17 00:00:00 2001
From: Richard Biener <rguenther@suse.de>
Date: Thu, 9 Sep 2021 11:50:20 +0200
Subject: [PATCH] Improve LIM fill_always_executed_in computation
To: gcc-patches@gcc.gnu.org

Currently the DOM walk over a loop body does not walk into not
always executed subloops to avoid scalability issues since doing
so makes the walk quadratic in the loop depth.  It turns out this
is not an issue in practice and even with a loop depth of 1800
this function is way off the radar.

So the following patch removes the limitation, replacing it with
a comment.

2021-09-09  Richard Biener  <rguenther@suse.de>

	* tree-ssa-loop-im.c (fill_always_executed_in_1): Walk
	into all subloops.

	* gcc.dg/tree-ssa/ssa-lim-17.c: New testcase.
---
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c | 20 ++++++++++++++++++++
 gcc/tree-ssa-loop-im.c                     | 16 +++++++---------
 2 files changed, 27 insertions(+), 9 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c
new file mode 100644
index 00000000000..1c840e554b7
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+volatile int flag, bar;
+double foo (double *valp)
+{
+  double sum = 0;
+  for (int i = 0; i < 256; ++i)
+    {
+      if (flag)
+	for (int j = 0; j < 256; ++j)
+	  bar = flag;
+      if (flag)
+        sum += 1.;
+      sum += *valp; // we should move the load of *valp out of the loop
+    }
+  return sum;
+}
+
+/* { dg-final { scan-tree-dump-times "Moving statement" 1 "lim2" } } */
diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index 5d6845478e7..4b187c2cdaf 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -3074,15 +3074,13 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
 	    break;
 
 	  if (bb->loop_father->header == bb)
-	    {
-	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
-		break;
-
-	      /* In a loop that is always entered we may proceed anyway.
-		 But record that we entered it and stop once we leave it
-		 since it might not be finite.  */
-	      inn_loop = bb->loop_father;
-	    }
+	    /* Record that we enter into a subloop since it might not
+	       be finite.  */
+	    /* ???  Entering into a not always executed subloop makes
+	       fill_always_executed_in quadratic in loop depth since
+	       we walk those loops N times.  This is not a problem
+	       in practice though, see PR102253 for a worst-case testcase.  */
+	    inn_loop = bb->loop_father;
 
 	  /* Walk the body of LOOP sorted by dominance relation.  Additionally,
 	     if a basic block S dominates the latch, then only blocks dominated
-- 
2.31.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] tree-optimization/102155 - fix LIM fill_always_executed_in CFG walk
  2021-09-09 10:55               ` Richard Biener
@ 2021-09-10 13:54                 ` Xionghu Luo
  2021-09-13  2:28                   ` Xionghu Luo
  2021-09-13  6:55                   ` Richard Biener
  0 siblings, 2 replies; 15+ messages in thread
From: Xionghu Luo @ 2021-09-10 13:54 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches



On 2021/9/9 18:55, Richard Biener wrote:
> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> index 5d6845478e7..4b187c2cdaf 100644
> --- a/gcc/tree-ssa-loop-im.c
> +++ b/gcc/tree-ssa-loop-im.c
> @@ -3074,15 +3074,13 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
>   	    break;
>   
>   	  if (bb->loop_father->header == bb)
> -	    {
> -	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> -		break;
> -
> -	      /* In a loop that is always entered we may proceed anyway.
> -		 But record that we entered it and stop once we leave it
> -		 since it might not be finite.  */
> -	      inn_loop = bb->loop_father;
> -	    }
> +	    /* Record that we enter into a subloop since it might not
> +	       be finite.  */
> +	    /* ???  Entering into a not always executed subloop makes
> +	       fill_always_executed_in quadratic in loop depth since
> +	       we walk those loops N times.  This is not a problem
> +	       in practice though, see PR102253 for a worst-case testcase.  */
> +	    inn_loop = bb->loop_father;


Yes your two patches extracted the get_loop_body_in_dom_order out and removed
the inn_loop break logic when it doesn't dominate outer loop.  Confirmed the replacement
could improve for saving ~10% build time due to not full DOM walker and marked the previously
ignored ALWAYS_EXECUTED bbs.
But if we don't break for inner loop again, why still keep the *inn_loop* variable?
It seems unnecessary and confusing, could we just remove it and restore the original
infinte loop check in bb->succs for better understanding?


diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index d1e2104233b..82a0509e0c4 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -3200,7 +3200,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
  {
    basic_block bb = NULL, last = NULL;
    edge e;
-  class loop *inn_loop = loop;

    if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
      {
@@ -3213,17 +3212,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
           edge_iterator ei;
           bb = worklist.pop ();

-         if (!flow_bb_inside_loop_p (inn_loop, bb))
-           {
-             /* When we are leaving a possibly infinite inner loop
-                we have to stop processing.  */
-             if (!finite_loop_p (inn_loop))
-               break;
-             /* If the loop was finite we can continue with processing
-                the loop we exited to.  */
-             inn_loop = bb->loop_father;
-           }
-
           if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
             last = bb;

@@ -3232,8 +3220,15 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)

          /* If LOOP exits from this BB stop processing.  */
           FOR_EACH_EDGE (e, ei, bb->succs)
+         {
             if (!flow_bb_inside_loop_p (loop, e->dest))
               break;
+           /* Or we enter a possibly non-finite loop.  */
+           if (flow_loop_nested_p (bb->loop_father,
+                 e->dest->loop_father)
+               && ! finite_loop_p (e->dest->loop_father))
+             break;
+         }
           if (e)
             break;

@@ -3242,15 +3237,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
           if (bb->flags & BB_IRREDUCIBLE_LOOP)
             break;

-         if (bb->loop_father->header == bb)
-           /* Record that we enter into a subloop since it might not
-              be finite.  */
-           /* ???  Entering into a not always executed subloop makes
-              fill_always_executed_in quadratic in loop depth since
-              we walk those loops N times.  This is not a problem
-              in practice though, see PR102253 for a worst-case testcase.  */
-           inn_loop = bb->loop_father;
-
           /* Walk the body of LOOP sorted by dominance relation.  Additionally,
              if a basic block S dominates the latch, then only blocks dominated
              by S are after it.

  

>   
>   	  /* Walk the body of LOOP sorted by dominance relation.  Additionally,
>   	     if a basic block S dominates the latch, then only blocks dominated

-- 
Thanks,
Xionghu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] tree-optimization/102155 - fix LIM fill_always_executed_in CFG walk
  2021-09-10 13:54                 ` Xionghu Luo
@ 2021-09-13  2:28                   ` Xionghu Luo
  2021-09-13  8:17                     ` Richard Biener
  2021-09-13  6:55                   ` Richard Biener
  1 sibling, 1 reply; 15+ messages in thread
From: Xionghu Luo @ 2021-09-13  2:28 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches



On 2021/9/10 21:54, Xionghu Luo via Gcc-patches wrote:
> 
> 
> On 2021/9/9 18:55, Richard Biener wrote:
>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
>> index 5d6845478e7..4b187c2cdaf 100644
>> --- a/gcc/tree-ssa-loop-im.c
>> +++ b/gcc/tree-ssa-loop-im.c
>> @@ -3074,15 +3074,13 @@ fill_always_executed_in_1 (class loop *loop, 
>> sbitmap contains_call)
>>           break;
>>         if (bb->loop_father->header == bb)
>> -        {
>> -          if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>> -        break;
>> -
>> -          /* In a loop that is always entered we may proceed anyway.
>> -         But record that we entered it and stop once we leave it
>> -         since it might not be finite.  */
>> -          inn_loop = bb->loop_father;
>> -        }
>> +        /* Record that we enter into a subloop since it might not
>> +           be finite.  */
>> +        /* ???  Entering into a not always executed subloop makes
>> +           fill_always_executed_in quadratic in loop depth since
>> +           we walk those loops N times.  This is not a problem
>> +           in practice though, see PR102253 for a worst-case 
>> testcase.  */
>> +        inn_loop = bb->loop_father;
> 
> 
> Yes your two patches extracted the get_loop_body_in_dom_order out and 
> removed
> the inn_loop break logic when it doesn't dominate outer loop.  Confirmed 
> the replacement
> could improve for saving ~10% build time due to not full DOM walker and 
> marked the previously
> ignored ALWAYS_EXECUTED bbs.
> But if we don't break for inner loop again, why still keep the 
> *inn_loop* variable?
> It seems unnecessary and confusing, could we just remove it and restore 
> the original
> infinte loop check in bb->succs for better understanding?


What's more, the refine of this fix is incorrect for PR78185.


commit 483e400870601f650c80f867ec781cd5f83507d6
Author: Richard Biener <rguenther@suse.de>
Date:   Thu Sep 2 10:47:35 2021 +0200

     Refine fix for PR78185, improve LIM for code after inner loops
     
     This refines the fix for PR78185 after understanding that the code
     regarding to the comment 'In a loop that is always entered we may
     proceed anyway.  But record that we entered it and stop once we leave
     it.' was supposed to protect us from leaving possibly infinite inner
     loops.  The simpler fix of moving the misplaced stopping code
     can then be refined to continue processing when the exited inner
     loop is finite, improving invariant motion for cases like in the
     added testcase.
     
     2021-09-02  Richard Biener  <rguenther@suse.de>
     
             * tree-ssa-loop-im.c (fill_always_executed_in_1): Refine
             fix for PR78185 and continue processing when leaving
             finite inner loops.
     
             * gcc.dg/tree-ssa/ssa-lim-16.c: New testcase.


3<-------
|        |
6<---    |
| \  |   |
|  \ |   |
4    8   |
|---     |
|  |     |
5  7------
|
1

loop 2 is an infinite loop, it is only ALWAYS_EXECUTED for loop 2,
but r12-3313-g483e40087 sets it ALWAYS_EXECUTED for loop 1.
We need to restore it like this:
https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579195.html


diff of pr78185.c.138t.lim2:
;;
  ;; Loop 1
  ;;  header 3, latch 7
  ;;  depth 1, outer 0
  ;;  nodes: 3 7 4 6 8
  ;;
  ;; Loop 2
  ;;  header 6, latch 8
  ;;  depth 2, outer 1
  ;;  nodes: 6 8
  ;; 2 succs { 3 }
  ;; 3 succs { 6 }
  ;; 6 succs { 4 8 }
  ;; 8 succs { 6 }
  ;; 4 succs { 7 5 }
  ;; 7 succs { 3 }
  ;; 5 succs { 1 }
  Memory reference 1: var1
-BB 6 is always executed in loop 1
  BB 3 is always executed in loop 1
+BB 6 is always executed in loop 2
  Basic block 3 (loop 1 -- depth 1):

  Basic block 6 (loop 2 -- depth 2):


> 
> 
> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> index d1e2104233b..82a0509e0c4 100644
> --- a/gcc/tree-ssa-loop-im.c
> +++ b/gcc/tree-ssa-loop-im.c
> @@ -3200,7 +3200,6 @@ fill_always_executed_in_1 (class loop *loop, 
> sbitmap contains_call)
>   {
>     basic_block bb = NULL, last = NULL;
>     edge e;
> -  class loop *inn_loop = loop;
> 
>     if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
>       {
> @@ -3213,17 +3212,6 @@ fill_always_executed_in_1 (class loop *loop, 
> sbitmap contains_call)
>            edge_iterator ei;
>            bb = worklist.pop ();
> 
> -         if (!flow_bb_inside_loop_p (inn_loop, bb))
> -           {
> -             /* When we are leaving a possibly infinite inner loop
> -                we have to stop processing.  */
> -             if (!finite_loop_p (inn_loop))
> -               break;
> -             /* If the loop was finite we can continue with processing
> -                the loop we exited to.  */
> -             inn_loop = bb->loop_father;
> -           }
> -
>            if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>              last = bb;
> 
> @@ -3232,8 +3220,15 @@ fill_always_executed_in_1 (class loop *loop, 
> sbitmap contains_call)
> 
>           /* If LOOP exits from this BB stop processing.  */
>            FOR_EACH_EDGE (e, ei, bb->succs)
> +         {
>              if (!flow_bb_inside_loop_p (loop, e->dest))
>                break;
> +           /* Or we enter a possibly non-finite loop.  */
> +           if (flow_loop_nested_p (bb->loop_father,
> +                 e->dest->loop_father)
> +               && ! finite_loop_p (e->dest->loop_father))
> +             break;
> +         }
>            if (e)
>              break;
> 
> @@ -3242,15 +3237,6 @@ fill_always_executed_in_1 (class loop *loop, 
> sbitmap contains_call)
>            if (bb->flags & BB_IRREDUCIBLE_LOOP)
>              break;
> 
> -         if (bb->loop_father->header == bb)
> -           /* Record that we enter into a subloop since it might not
> -              be finite.  */
> -           /* ???  Entering into a not always executed subloop makes
> -              fill_always_executed_in quadratic in loop depth since
> -              we walk those loops N times.  This is not a problem
> -              in practice though, see PR102253 for a worst-case 
> testcase.  */
> -           inn_loop = bb->loop_father;
> -
>            /* Walk the body of LOOP sorted by dominance relation.  
> Additionally,
>               if a basic block S dominates the latch, then only blocks 
> dominated
>               by S are after it.
> 
> 
> 
>>         /* Walk the body of LOOP sorted by dominance relation.  
>> Additionally,
>>            if a basic block S dominates the latch, then only blocks 
>> dominated
> 

-- 
Thanks,
Xionghu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] tree-optimization/102155 - fix LIM fill_always_executed_in CFG walk
  2021-09-10 13:54                 ` Xionghu Luo
  2021-09-13  2:28                   ` Xionghu Luo
@ 2021-09-13  6:55                   ` Richard Biener
  1 sibling, 0 replies; 15+ messages in thread
From: Richard Biener @ 2021-09-13  6:55 UTC (permalink / raw)
  To: Xionghu Luo; +Cc: gcc-patches

On Fri, 10 Sep 2021, Xionghu Luo wrote:

> 
> 
> On 2021/9/9 18:55, Richard Biener wrote:
> > diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> > index 5d6845478e7..4b187c2cdaf 100644
> > --- a/gcc/tree-ssa-loop-im.c
> > +++ b/gcc/tree-ssa-loop-im.c
> > @@ -3074,15 +3074,13 @@ fill_always_executed_in_1 (class loop *loop, sbitmap
> > @@ contains_call)
> >        break;
> >   
> >   	  if (bb->loop_father->header == bb)
> > -	    {
> > -	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> > -		break;
> > -
> > -	      /* In a loop that is always entered we may proceed anyway.
> > -		 But record that we entered it and stop once we leave it
> > -		 since it might not be finite.  */
> > -	      inn_loop = bb->loop_father;
> > -	    }
> > +	    /* Record that we enter into a subloop since it might not
> > +	       be finite.  */
> > +	    /* ???  Entering into a not always executed subloop makes
> > +	       fill_always_executed_in quadratic in loop depth since
> > +	       we walk those loops N times.  This is not a problem
> > +	       in practice though, see PR102253 for a worst-case testcase.  */
> > +	    inn_loop = bb->loop_father;
> 
> 
> Yes your two patches extracted the get_loop_body_in_dom_order out and removed
> the inn_loop break logic when it doesn't dominate outer loop.  Confirmed the
> replacement
> could improve for saving ~10% build time due to not full DOM walker and marked
> the previously
> ignored ALWAYS_EXECUTED bbs.
> But if we don't break for inner loop again, why still keep the *inn_loop*
> variable?
> It seems unnecessary and confusing, could we just remove it and restore the
> original
> infinte loop check in bb->succs for better understanding?

But that's sub-optimal since then we won't enter an always executed
possibly infinite loop, not hoisting from it.  With the current code
we track that at the point we leave it.

Richard.

> 
> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> index d1e2104233b..82a0509e0c4 100644
> --- a/gcc/tree-ssa-loop-im.c
> +++ b/gcc/tree-ssa-loop-im.c
> @@ -3200,7 +3200,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap
> @@ contains_call)
>  {
>    basic_block bb = NULL, last = NULL;
>    edge e;
> -  class loop *inn_loop = loop;
> 
>    if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
>      {
> @@ -3213,17 +3212,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap
> @@ contains_call)
>           edge_iterator ei;
>           bb = worklist.pop ();
> 
> -         if (!flow_bb_inside_loop_p (inn_loop, bb))
> -           {
> -             /* When we are leaving a possibly infinite inner loop
> -                we have to stop processing.  */
> -             if (!finite_loop_p (inn_loop))
> -               break;
> -             /* If the loop was finite we can continue with processing
> -                the loop we exited to.  */
> -             inn_loop = bb->loop_father;
> -           }
> -
>           if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>             last = bb;
> 
> @@ -3232,8 +3220,15 @@ fill_always_executed_in_1 (class loop *loop, sbitmap
> @@ contains_call)
> 
>          /* If LOOP exits from this BB stop processing.  */
>           FOR_EACH_EDGE (e, ei, bb->succs)
> +         {
>             if (!flow_bb_inside_loop_p (loop, e->dest))
>               break;
> +           /* Or we enter a possibly non-finite loop.  */
> +           if (flow_loop_nested_p (bb->loop_father,
> +                 e->dest->loop_father)
> +               && ! finite_loop_p (e->dest->loop_father))
> +             break;
> +         }
>           if (e)
>             break;
> 
> @@ -3242,15 +3237,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap
> @@ contains_call)
>           if (bb->flags & BB_IRREDUCIBLE_LOOP)
>             break;
> 
> -         if (bb->loop_father->header == bb)
> -           /* Record that we enter into a subloop since it might not
> -              be finite.  */
> -           /* ???  Entering into a not always executed subloop makes
> -              fill_always_executed_in quadratic in loop depth since
> -              we walk those loops N times.  This is not a problem
> -              in practice though, see PR102253 for a worst-case testcase.  */
> -           inn_loop = bb->loop_father;
> -
>           /* Walk the body of LOOP sorted by dominance relation.
>           Additionally,
>              if a basic block S dominates the latch, then only blocks
>              dominated
>              by S are after it.
> 
>  
> >   
> >      /* Walk the body of LOOP sorted by dominance relation.  Additionally,
> >         if a basic block S dominates the latch, then only blocks dominated
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] tree-optimization/102155 - fix LIM fill_always_executed_in CFG walk
  2021-09-13  2:28                   ` Xionghu Luo
@ 2021-09-13  8:17                     ` Richard Biener
  2021-09-14  1:15                       ` Xionghu Luo
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Biener @ 2021-09-13  8:17 UTC (permalink / raw)
  To: Xionghu Luo; +Cc: gcc-patches

On Mon, 13 Sep 2021, Xionghu Luo wrote:

> 
> 
> On 2021/9/10 21:54, Xionghu Luo via Gcc-patches wrote:
> > 
> > 
> > On 2021/9/9 18:55, Richard Biener wrote:
> >> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> >> index 5d6845478e7..4b187c2cdaf 100644
> >> --- a/gcc/tree-ssa-loop-im.c
> >> +++ b/gcc/tree-ssa-loop-im.c
> >> @@ -3074,15 +3074,13 @@ fill_always_executed_in_1 (class loop *loop, 
> >> sbitmap contains_call)
> >>           break;
> >>         if (bb->loop_father->header == bb)
> >> -        {
> >> -          if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >> -        break;
> >> -
> >> -          /* In a loop that is always entered we may proceed anyway.
> >> -         But record that we entered it and stop once we leave it
> >> -         since it might not be finite.  */
> >> -          inn_loop = bb->loop_father;
> >> -        }
> >> +        /* Record that we enter into a subloop since it might not
> >> +           be finite.  */
> >> +        /* ???  Entering into a not always executed subloop makes
> >> +           fill_always_executed_in quadratic in loop depth since
> >> +           we walk those loops N times.  This is not a problem
> >> +           in practice though, see PR102253 for a worst-case testcase.  */
> >> +        inn_loop = bb->loop_father;
> > 
> > 
> > Yes your two patches extracted the get_loop_body_in_dom_order out and
> > removed
> > the inn_loop break logic when it doesn't dominate outer loop.  Confirmed the
> > replacement
> > could improve for saving ~10% build time due to not full DOM walker and
> > marked the previously
> > ignored ALWAYS_EXECUTED bbs.
> > But if we don't break for inner loop again, why still keep the *inn_loop*
> > variable?
> > It seems unnecessary and confusing, could we just remove it and restore the
> > original
> > infinte loop check in bb->succs for better understanding?
> 
> 
> What's more, the refine of this fix is incorrect for PR78185.
> 
> 
> commit 483e400870601f650c80f867ec781cd5f83507d6
> Author: Richard Biener <rguenther@suse.de>
> Date:   Thu Sep 2 10:47:35 2021 +0200
> 
>     Refine fix for PR78185, improve LIM for code after inner loops
>     
>     This refines the fix for PR78185 after understanding that the code
>     regarding to the comment 'In a loop that is always entered we may
>     proceed anyway.  But record that we entered it and stop once we leave
>     it.' was supposed to protect us from leaving possibly infinite inner
>     loops.  The simpler fix of moving the misplaced stopping code
>     can then be refined to continue processing when the exited inner
>     loop is finite, improving invariant motion for cases like in the
>     added testcase.
>     
>     2021-09-02  Richard Biener  <rguenther@suse.de>
>     
>             * tree-ssa-loop-im.c (fill_always_executed_in_1): Refine
>             fix for PR78185 and continue processing when leaving
>             finite inner loops.
>     
>             * gcc.dg/tree-ssa/ssa-lim-16.c: New testcase.
> 
> 
> 3<-------
> |        |
> 6<---    |
> | \  |   |
> |  \ |   |
> 4    8   |
> |---     |
> |  |     |
> 5  7------
> |
> 1
> 
> loop 2 is an infinite loop, it is only ALWAYS_EXECUTED for loop 2,
> but r12-3313-g483e40087 sets it ALWAYS_EXECUTED for loop 1.
> We need to restore it like this:
> https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579195.html

I don't understand - BB6 is the header block of loop 2 which is
always entered and thus BB6 is always executed at least once.

The important part is that BB4 which follows the inner loop is
_not_ always executed because we don't know if we will exit the
inner loop.

What am I missing?

Richard.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] tree-optimization/102155 - fix LIM fill_always_executed_in CFG walk
  2021-09-13  8:17                     ` Richard Biener
@ 2021-09-14  1:15                       ` Xionghu Luo
  2021-09-14  7:30                         ` Richard Biener
  0 siblings, 1 reply; 15+ messages in thread
From: Xionghu Luo @ 2021-09-14  1:15 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches



On 2021/9/13 16:17, Richard Biener wrote:
> On Mon, 13 Sep 2021, Xionghu Luo wrote:
> 
>>
>>
>> On 2021/9/10 21:54, Xionghu Luo via Gcc-patches wrote:
>>>
>>>
>>> On 2021/9/9 18:55, Richard Biener wrote:
>>>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
>>>> index 5d6845478e7..4b187c2cdaf 100644
>>>> --- a/gcc/tree-ssa-loop-im.c
>>>> +++ b/gcc/tree-ssa-loop-im.c
>>>> @@ -3074,15 +3074,13 @@ fill_always_executed_in_1 (class loop *loop,
>>>> sbitmap contains_call)
>>>>            break;
>>>>          if (bb->loop_father->header == bb)
>>>> -        {
>>>> -          if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>>>> -        break;
>>>> -
>>>> -          /* In a loop that is always entered we may proceed anyway.
>>>> -         But record that we entered it and stop once we leave it
>>>> -         since it might not be finite.  */
>>>> -          inn_loop = bb->loop_father;
>>>> -        }
>>>> +        /* Record that we enter into a subloop since it might not
>>>> +           be finite.  */
>>>> +        /* ???  Entering into a not always executed subloop makes
>>>> +           fill_always_executed_in quadratic in loop depth since
>>>> +           we walk those loops N times.  This is not a problem
>>>> +           in practice though, see PR102253 for a worst-case testcase.  */
>>>> +        inn_loop = bb->loop_father;
>>>
>>>
>>> Yes your two patches extracted the get_loop_body_in_dom_order out and
>>> removed
>>> the inn_loop break logic when it doesn't dominate outer loop.  Confirmed the
>>> replacement
>>> could improve for saving ~10% build time due to not full DOM walker and
>>> marked the previously
>>> ignored ALWAYS_EXECUTED bbs.
>>> But if we don't break for inner loop again, why still keep the *inn_loop*
>>> variable?
>>> It seems unnecessary and confusing, could we just remove it and restore the
>>> original
>>> infinte loop check in bb->succs for better understanding?
>>
>>
>> What's more, the refine of this fix is incorrect for PR78185.
>>
>>
>> commit 483e400870601f650c80f867ec781cd5f83507d6
>> Author: Richard Biener <rguenther@suse.de>
>> Date:   Thu Sep 2 10:47:35 2021 +0200
>>
>>      Refine fix for PR78185, improve LIM for code after inner loops
>>      
>>      This refines the fix for PR78185 after understanding that the code
>>      regarding to the comment 'In a loop that is always entered we may
>>      proceed anyway.  But record that we entered it and stop once we leave
>>      it.' was supposed to protect us from leaving possibly infinite inner
>>      loops.  The simpler fix of moving the misplaced stopping code
>>      can then be refined to continue processing when the exited inner
>>      loop is finite, improving invariant motion for cases like in the
>>      added testcase.
>>      
>>      2021-09-02  Richard Biener  <rguenther@suse.de>
>>      
>>              * tree-ssa-loop-im.c (fill_always_executed_in_1): Refine
>>              fix for PR78185 and continue processing when leaving
>>              finite inner loops.
>>      
>>              * gcc.dg/tree-ssa/ssa-lim-16.c: New testcase.
>>
>>
>> 3<-------
>> |        |
>> 6<---    |
>> | \  |   |
>> |  \ |   |
>> 4    8   |
>> |---     |
>> |  |     |
>> 5  7------
>> |
>> 1
>>
>> loop 2 is an infinite loop, it is only ALWAYS_EXECUTED for loop 2,
>> but r12-3313-g483e40087 sets it ALWAYS_EXECUTED for loop 1.
>> We need to restore it like this:
>> https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579195.html
> 
> I don't understand - BB6 is the header block of loop 2 which is
> always entered and thus BB6 is always executed at least once.
> 
> The important part is that BB4 which follows the inner loop is
> _not_ always executed because we don't know if we will exit the
> inner loop.
> 
> What am I missing?

Oh, I see.  I only noticed the functionality change of the patch on the case
and no failure check of it, misunderstood it was a regression instead of an
improvement to also hoisting invariants from infinite loop, sorry about that.

Finally, the function fill_always_executed_in_1 could mark all ALWAYS_EXECUTED
bb both including and after all subloops' bb but break after exiting from
infinite subloops with better performance, thanks.  The only thing to be
worried is replacing get_loop_body_in_dom_order makes the code a bit more
complicated for later readers as the loop depth and DOM order is not a problem
here any more? ;)

> 
> Richard.
> 

-- 
Thanks,
Xionghu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] tree-optimization/102155 - fix LIM fill_always_executed_in CFG walk
  2021-09-14  1:15                       ` Xionghu Luo
@ 2021-09-14  7:30                         ` Richard Biener
  0 siblings, 0 replies; 15+ messages in thread
From: Richard Biener @ 2021-09-14  7:30 UTC (permalink / raw)
  To: Xionghu Luo; +Cc: gcc-patches

On Tue, 14 Sep 2021, Xionghu Luo wrote:

> 
> 
> On 2021/9/13 16:17, Richard Biener wrote:
> > 
[...]
> > I don't understand - BB6 is the header block of loop 2 which is
> > always entered and thus BB6 is always executed at least once.
> > 
> > The important part is that BB4 which follows the inner loop is
> > _not_ always executed because we don't know if we will exit the
> > inner loop.
> > 
> > What am I missing?
> 
> Oh, I see.  I only noticed the functionality change of the patch on the case
> and no failure check of it, misunderstood it was a regression instead of an
> improvement to also hoisting invariants from infinite loop, sorry about that.
> 
> Finally, the function fill_always_executed_in_1 could mark all ALWAYS_EXECUTED
> bb both including and after all subloops' bb but break after exiting from
> infinite subloops with better performance, thanks.  The only thing to be
> worried is replacing get_loop_body_in_dom_order makes the code a bit more
> complicated for later readers as the loop depth and DOM order is not a problem
> here any more? ;)

Yeah, but embedding the DOM walk _does_ improve the runtime of the code
and it would in principle allow to avoid entering conditionally executed
loops that do not always terminate (that's something we could pre-compute
and propagate up the loop tree).  It's just it didn't seem worth adding
even more code given I couldn't make the function pop up on the radar...

Richard.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-09-14  7:30 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-01  9:58 [PATCH] tree-optimization/102155 - fix LIM fill_always_executed_in CFG walk Richard Biener
2021-09-02  5:09 ` Xionghu Luo
2021-09-02  8:00   ` Richard Biener
2021-09-02  8:50     ` Richard Biener
2021-09-02 10:00       ` Xionghu Luo
2021-09-02 10:37         ` Richard Biener
2021-09-09  6:40           ` Xionghu Luo
2021-09-09  9:16             ` Richard Biener
2021-09-09 10:55               ` Richard Biener
2021-09-10 13:54                 ` Xionghu Luo
2021-09-13  2:28                   ` Xionghu Luo
2021-09-13  8:17                     ` Richard Biener
2021-09-14  1:15                       ` Xionghu Luo
2021-09-14  7:30                         ` Richard Biener
2021-09-13  6:55                   ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).