Loop-ch improvements, part 3

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* Loop-ch improvements, part 3
@ 2023-07-14 12:22 Jan Hubicka
  2023-07-17  9:17 ` Richard Biener
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Hubicka @ 2023-07-14 12:22 UTC (permalink / raw)
  To: gcc-patches, rguenther

Hi,
loop-ch currently does analysis using ranger for all loops to identify
candidates and then follows by phase where headers are duplicated (which
breaks SSA and ranger).  The second stage does more analysis (to see how
many BBs we want to duplicate) but can't use ranger and thus misses
information about static conditionals.

This patch pushes all analysis into the first stage. We record how many
BBs to duplicate and the second stage just duplicats as it is told so.
This makes it possible to also extend range query done also to basic
blocks that are not headers.  This is easy to do, since we already do
path specific query so we only need to extend the path by headers we
decided to dulicate earlier.

This makes it possible to track situations where exit that is always
false in the first iteration for tests not in the original loop header.
Doing so lets us to update profile better and do better heuristics.  In
particular I changed logic as follows
  1) should_duplicate_loop_header_p counts size of duplicated region.  When we
     know that a given conditional will be constant true or constant false either
     in the duplicated region, by range query, or in the loop body after
     duplication (since it is loop invariant), we do not account it to code size
     costs
  2) don't need account loop invariant compuations that will be duplicated
     as they will become fully invariant
     (maybe we want to have some cap for register pressure eventually?)
  3) optimize_size logic is now different.  Originally we started duplicating
     iff the first conditional was known to be true by ranger query, but then
     we used same limits as for -O2.

     I now simply lower limits to 0. This means that every conditional
     in duplicated sequence must be either loop invariant or constant when
     duplicated and we only duplicate statements computing loop invariants
     and those we account to 0 size anyway,

This makes code IMO more streamlined (and hopefully will let us to merge
ibts with loop peeling logic), but makes little difference in practice.
The problem is that in loop:

void test2();
void test(int n)
{
  for (int i = 0; n && i < 10; i++)
	  test2();
}

We produce:
  <bb 4> [local count: 1073741824 freq: 9.090909]:
  # i_4 = PHI <0(2), i_9(3)>
  _1 = n_7(D) != 0;
  _2 = i_4 <= 9;
  _3 = _1 & _2;
  if (_3 != 0)
    goto <bb 3>; [89.00%]
  else
    goto <bb 5>; [11.00%]

and do not understand that the final conditional is a combination of a conditional
that is always true in first iteration and a conditional that is loop invariant.

This is also the case of
void test2();
void test(int n)
{
  for (int i = 0; n; i++)
    {
      if (i > 10)
        break;
      test2();
    }
}
Which we turn to the earlier case in ifcombine.

With disabled ifcombine things however works as exepcted.  This is something
I plan to handle incrementally.  However extending loop-ch and peeling passes
to understand such combined conditionals is still not good enough: at the time ifcombine
merged the two conditionals we lost profile information on how often n is 0,
so we can't recover correct profile or know what is expected number of iterations
after the transofrm.

Bootstrapped/regtested x86_64-linux, OK?

Honza


gcc/ChangeLog:

	* tree-ssa-loop-ch.cc (edge_range_query): Take loop argument; be ready
	for queries not in headers.
	(static_loop_exit): Add basic blck parameter; update use of
	edge_range_query
	(should_duplicate_loop_header_p): Add ranger and static_exits
	parameter.  Do not account statements that will be optimized
	out after duplicaiton in overall size. Add ranger query to
	find static exits.
	(update_profile_after_ch):  Take static_exits has set instead of
	single eliminated_edge.
	(ch_base::copy_headers): Do all analysis in the first pass;
	remember invariant_exits and static_exits.

diff --git a/gcc/tree-ssa-loop-ch.cc b/gcc/tree-ssa-loop-ch.cc
index 24e7fbc805a..e0139cb432c 100644
--- a/gcc/tree-ssa-loop-ch.cc
+++ b/gcc/tree-ssa-loop-ch.cc
@@ -49,11 +49,13 @@ along with GCC; see the file COPYING3.  If not see
    the range of the solved conditional in R.  */
 
 static void
-edge_range_query (irange &r, edge e, gcond *cond, gimple_ranger &ranger)
+edge_range_query (irange &r, class loop *loop, gcond *cond, gimple_ranger &ranger)
 {
-  auto_vec<basic_block> path (2);
-  path.safe_push (e->dest);
-  path.safe_push (e->src);
+  auto_vec<basic_block, 8> path;
+  for (basic_block bb = gimple_bb (cond); bb != loop->header; bb = single_pred_edge (bb)->src)
+    path.safe_push (bb);
+  path.safe_push (loop->header);
+  path.safe_push (loop_preheader_edge (loop)->src);
   path_range_query query (ranger, path);
   if (!query.range_of_stmt (r, cond))
     r.set_varying (boolean_type_node);
@@ -63,17 +65,16 @@ edge_range_query (irange &r, edge e, gcond *cond, gimple_ranger &ranger)
    and NULL otherwise.  */
 
 static edge
-static_loop_exit (class loop *l, gimple_ranger *ranger)
+static_loop_exit (class loop *l, basic_block bb, gimple_ranger *ranger)
 {
-  edge e = loop_preheader_edge (l);
-  gcond *last = safe_dyn_cast <gcond *> (*gsi_last_bb (e->dest));
+  gcond *last = safe_dyn_cast <gcond *> (*gsi_last_bb (bb));
   edge ret_e;
 
   if (!last)
     return NULL;
 
   edge true_e, false_e;
-  extract_true_false_edges_from_block (e->dest, &true_e, &false_e);
+  extract_true_false_edges_from_block (bb, &true_e, &false_e);
 
   /* If neither edge is the exit edge, this is not a case we'd like to
      special-case.  */
@@ -93,7 +94,7 @@ static_loop_exit (class loop *l, gimple_ranger *ranger)
   }
 
   int_range<2> r;
-  edge_range_query (r, e, last, *ranger);
+  edge_range_query (r, l, last, *ranger);
   return r == desired_static_range ? ret_e : NULL;
 }
 
@@ -131,8 +132,10 @@ loop_iv_derived_p (class loop *loop,
 
 static bool
 should_duplicate_loop_header_p (basic_block header, class loop *loop,
+				gimple_ranger *ranger,
 				int *limit,
-				hash_set <edge> *invariant_exits)
+				hash_set <edge> *invariant_exits,
+				hash_set <edge> *static_exits)
 {
   gimple_stmt_iterator bsi;
 
@@ -209,6 +212,9 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
       if (is_gimple_debug (last))
 	continue;
 
+      if (gimple_code (last) == GIMPLE_COND)
+	break;
+
       if (gimple_code (last) == GIMPLE_CALL
 	  && (!gimple_inexpensive_call_p (as_a <gcall *> (last))
 	      /* IFN_LOOP_DIST_ALIAS means that inner loop is distributed
@@ -222,16 +228,6 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
 	  return false;
 	}
 
-      *limit -= estimate_num_insns (last, &eni_size_weights);
-      if (*limit < 0)
-	{
-	  if (dump_file && (dump_flags & TDF_DETAILS))
-	    fprintf (dump_file,
-		     "  Not duplicating bb %i contains too many insns.\n",
-		     header->index);
-	  return false;
-	}
-
       /* Classify the stmt based on whether its computation is based
          on a IV or whether it is invariant in the loop.  */
       gimple_set_uid (last, 0);
@@ -257,9 +253,36 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
 		  inv = false;
 	      }
 	  gimple_set_uid (last, (iv ? 1 : 0) | (inv ? 2 : 0));
+	  /* Loop invariants will be optimized out in loop body after
+	     duplication; do not account invariant computation in code
+	     size costs.  */
+	  if (inv)
+	    continue;
+	}
+
+      *limit -= estimate_num_insns (last, &eni_size_weights);
+      if (*limit < 0)
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file,
+		     "  Not duplicating bb %i contains too many insns.\n",
+		     header->index);
+	  return false;
 	}
     }
 
+  edge static_exit = static_loop_exit (loop, header, ranger);
+
+  if (static_exit && static_exits)
+    {
+      static_exits->add (static_exit);
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file,
+		 "    Will eliminate peeled conditional in bb %d.\n",
+		 static_exit->src->index);
+      /* Still look for invariant exits; exit may be both.  */
+    }
+
   /* If the condition tests a non-IV loop variant we do not want to rotate
      the loop further.  Unless this is the original loop header.  */
   tree lhs = gimple_cond_lhs (last);
@@ -284,12 +307,28 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
 	}
       return true;
     }
+
+  if (static_exit)
+    return true;
+
+  /* We was not able to prove that conditional will be eliminated.  */
+  *limit -= estimate_num_insns (last, &eni_size_weights);
+  if (*limit < 0)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file,
+		 "  Not duplicating bb %i contains too many insns.\n",
+		 header->index);
+      return false;
+    }
+
   /* TODO: This is heuristics that claims that IV based ocnditionals will
      likely be optimized out in duplicated header.  We could use ranger
      query instead to tell this more precisely.  */
-  if (header != loop->header
-      && ((!lhs_invariant && !loop_iv_derived_p (loop, lhs))
-	  || (!rhs_invariant && !loop_iv_derived_p (loop, rhs))))
+  if ((lhs_invariant || loop_iv_derived_p (loop, lhs))
+      && (rhs_invariant || loop_iv_derived_p (loop, rhs)))
+    return true;
+  if (header != loop->header)
     {
       if (dump_file && (dump_flags & TDF_DETAILS))
 	fprintf (dump_file,
@@ -386,8 +425,9 @@ do_while_loop_p (class loop *loop)
 static void
 update_profile_after_ch (class loop *loop,
 			 basic_block *region, basic_block *region_copy,
-			 unsigned n_region, edge eliminated_edge,
+			 unsigned n_region,
 			 hash_set <edge> *invariant_exits,
+			 hash_set <edge> *static_exits,
 			 profile_count entry_count)
 {
   for (unsigned int i = 0; i < n_region; i++)
@@ -421,10 +461,13 @@ update_profile_after_ch (class loop *loop,
 		      && e_copy->dest == region_copy[i + 1]));
       region_copy[i]->count = entry_count;
       profile_count exit_e_count = exit_e->count ();
-      if (eliminated_edge == exit_e)
+      bool was_static = false;
+      if (static_exits->contains (exit_e))
 	{
 	  /* Update profile and the conditional.
 	     CFG update is done by caller.  */
+	  static_exits->remove (exit_e);
+	  was_static = true;
 	  e_copy->probability = profile_probability::always ();
 	  exit_e_copy->probability = profile_probability::never ();
 	  gcond *cond_stmt
@@ -437,7 +480,6 @@ update_profile_after_ch (class loop *loop,
 	  /* Header copying is a special case of jump threading, so use
 	     common code to update loop body exit condition.  */
 	  update_bb_profile_for_threading (region[i], entry_count, e);
-	  eliminated_edge = NULL;
 	}
       else
 	region[i]->count -= region_copy[i]->count;
@@ -448,7 +490,7 @@ update_profile_after_ch (class loop *loop,
 	     loop, so increase probability accordingly.
 	     If the edge is eliminated_edge we already corrected
 	     profile above.  */
-	  if (entry_count.nonzero_p () && eliminated_edge != exit_e)
+	  if (entry_count.nonzero_p () && !was_static)
 	    set_edge_probability_and_rescale_others
 		    (exit_e_copy, exit_e_count.probability_in
 							(entry_count));
@@ -465,7 +507,7 @@ update_profile_after_ch (class loop *loop,
       entry_count = e_copy->count ();
     }
   /* Be sure that we seen all edges we are supposed to update.  */
-  gcc_checking_assert (!eliminated_edge
+  gcc_checking_assert (static_exits->is_empty ()
 		       && invariant_exits->is_empty ());
 }
 
@@ -564,10 +606,7 @@ protected:
 unsigned int
 ch_base::copy_headers (function *fun)
 {
-  basic_block header;
-  edge exit, nonexit, entry;
   basic_block *bbs, *copied_bbs;
-  unsigned n_bbs;
   unsigned bbs_size;
   bool changed = false;
 
@@ -578,7 +617,12 @@ ch_base::copy_headers (function *fun)
   copied_bbs = XNEWVEC (basic_block, n_basic_blocks_for_fn (fun));
   bbs_size = n_basic_blocks_for_fn (fun);
 
-  struct candidate {class loop *loop; edge static_exit;};
+  struct candidate
+    {
+      class loop *loop;
+      unsigned int nheaders;
+      hash_set <edge> *invariant_exits, *static_exits;
+    };
   auto_vec<struct candidate> candidates;
   auto_vec<std::pair<edge, loop_p> > copied;
   auto_vec<class loop *> loops_to_unloop;
@@ -588,13 +632,14 @@ ch_base::copy_headers (function *fun)
   gimple_ranger *ranger = new gimple_ranger;
   for (auto loop : loops_list (cfun, 0))
     {
-      int initial_limit = param_max_loop_header_insns;
+      int initial_limit = optimize_loop_for_speed_p (loop)
+			  ? param_max_loop_header_insns : 0;
       int remaining_limit = initial_limit;
       if (dump_file && (dump_flags & TDF_DETAILS))
 	fprintf (dump_file,
 		 "Analyzing loop %i\n", loop->num);
 
-      header = loop->header;
+      basic_block header = loop->header;
       if (!get_max_loop_iterations_int (loop))
 	{
 	  if (dump_file && (dump_flags & TDF_DETAILS))
@@ -613,24 +658,38 @@ ch_base::copy_headers (function *fun)
 	  || !process_loop_p (loop))
 	continue;
 
-      edge static_exit = static_loop_exit (loop, ranger);
-
-      /* Avoid loop header copying when optimizing for size unless we can
-	 determine that the loop condition is static in the first
-	 iteration.  */
-      if (optimize_loop_for_size_p (loop)
-	  && !loop->force_vectorize
-	  && !static_exit)
+      /* Iterate the header copying up to limit; this takes care of the cases
+	 like while (a && b) {...}, where we want to have both of the conditions
+	 copied.  TODO -- handle while (a || b) - like cases, by not requiring
+	 the header to have just a single successor and copying up to
+	 postdominator.  */
+      int nheaders = 0;
+      hash_set <edge> *invariant_exits = new hash_set <edge>;
+      hash_set <edge> *static_exits = new hash_set <edge>;
+      while (should_duplicate_loop_header_p (header, loop, ranger,
+					     &remaining_limit,
+					     invariant_exits,
+					     static_exits))
 	{
+	  nheaders++;
 	  if (dump_file && (dump_flags & TDF_DETAILS))
-	    fprintf (dump_file,
-		     "  Not duplicating bb %i: optimizing for size.\n",
-		     header->index);
-	  continue;
+	    fprintf (dump_file, "    Will duplicate bb %i\n", header->index);
+
+	  /* Find a successor of header that is inside a loop; i.e. the new
+	     header after the condition is copied.  */
+	  if (flow_bb_inside_loop_p (loop, EDGE_SUCC (header, 0)->dest))
+	    header = EDGE_SUCC (header, 0)->dest;
+	  else
+	    header = EDGE_SUCC (header, 1)->dest;
 	}
 
-      if (should_duplicate_loop_header_p (header, loop, &remaining_limit, NULL))
-	candidates.safe_push ({loop, static_exit});
+      if (nheaders)
+	candidates.safe_push ({loop, nheaders, invariant_exits, static_exits});
+      else
+	{
+	  delete invariant_exits;
+	  delete static_exits;
+	}
     }
   /* Do not use ranger after we change the IL and not have updated SSA.  */
   delete ranger;
@@ -638,37 +697,25 @@ ch_base::copy_headers (function *fun)
   for (auto candidate : candidates)
     {
       class loop *loop = candidate.loop;
-      int initial_limit = param_max_loop_header_insns;
-      int remaining_limit = initial_limit;
       if (dump_file && (dump_flags & TDF_DETAILS))
 	fprintf (dump_file,
 		 "Copying headers of loop %i\n", loop->num);
 
-      header = loop->header;
-
-      /* Iterate the header copying up to limit; this takes care of the cases
-	 like while (a && b) {...}, where we want to have both of the conditions
-	 copied.  TODO -- handle while (a || b) - like cases, by not requiring
-	 the header to have just a single successor and copying up to
-	 postdominator.  */
-
-      nonexit = NULL;
-      n_bbs = 0;
+      basic_block header = loop->header;
+      edge nonexit = NULL;
+      edge exit = NULL;
+      unsigned int n_bbs = 0;
       int nexits = 0;
       profile_count exit_count = profile_count::zero ();
       profile_count entry_count = profile_count::zero ();
       edge e;
       edge_iterator ei;
-      hash_set <edge> invariant_exits;
+
       FOR_EACH_EDGE (e, ei, loop->header->preds)
 	if (e->src != loop->latch)
 	  entry_count += e->count ();
-      while (should_duplicate_loop_header_p (header, loop, &remaining_limit,
-					     &invariant_exits))
+      while (n_bbs < candidate.nheaders)
 	{
-	  if (dump_file && (dump_flags & TDF_DETAILS))
-	    fprintf (dump_file, "    Will duplicate bb %i\n", header->index);
-
 	  /* Find a successor of header that is inside a loop; i.e. the new
 	     header after the condition is copied.  */
 	  if (flow_bb_inside_loop_p (loop, EDGE_SUCC (header, 0)->dest))
@@ -688,21 +735,10 @@ ch_base::copy_headers (function *fun)
 	  header = nonexit->dest;
 	}
 
-      if (!nonexit)
-	continue;
-
       if (dump_file && (dump_flags & TDF_DETAILS))
-	{
-	  fprintf (dump_file,
-		   "Duplicating header of the loop %d up to edge %d->%d,"
-		   " %i insns.\n",
-		   loop->num, exit->src->index, exit->dest->index,
-		   initial_limit - remaining_limit);
-	  if (candidate.static_exit)
-	    fprintf (dump_file,
-		     "  Will eliminate peeled conditional in bb %d.\n",
-		     candidate.static_exit->src->index);
-	}
+	fprintf (dump_file,
+		 "Duplicating header of the loop %d up to edge %d->%d\n",
+		 loop->num, exit->src->index, exit->dest->index);
 
       /* Ensure that the header will have just the latch as a predecessor
 	 inside the loop.  */
@@ -712,20 +748,25 @@ ch_base::copy_headers (function *fun)
 	  exit = single_pred_edge (header);
 	}
 
-      entry = loop_preheader_edge (loop);
+      edge entry = loop_preheader_edge (loop);
 
       propagate_threaded_block_debug_into (exit->dest, entry->dest);
       if (!gimple_duplicate_seme_region (entry, exit, bbs, n_bbs, copied_bbs,
 					 true))
 	{
+	  delete candidate.static_exits;
+	  delete candidate.invariant_exits;
 	  if (dump_file && (dump_flags & TDF_DETAILS))
 	    fprintf (dump_file, "Duplication failed.\n");
 	  continue;
 	}
       if (loop->header->count.initialized_p ())
 	update_profile_after_ch (loop, bbs, copied_bbs, n_bbs,
-				 candidate.static_exit, &invariant_exits,
+				 candidate.invariant_exits,
+				 candidate.static_exits,
 				 entry_count);
+      delete candidate.static_exits;
+      delete candidate.invariant_exits;
       copied.safe_push (std::make_pair (entry, loop));
 
       /* If the loop has the form "for (i = j; i < j + 10; i++)" then

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Loop-ch improvements, part 3
  2023-07-14 12:22 Loop-ch improvements, part 3 Jan Hubicka
@ 2023-07-17  9:17 ` Richard Biener
  2023-08-22  5:24   ` Hongtao Liu
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Biener @ 2023-07-17  9:17 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches

On Fri, 14 Jul 2023, Jan Hubicka wrote:

> Hi,
> loop-ch currently does analysis using ranger for all loops to identify
> candidates and then follows by phase where headers are duplicated (which
> breaks SSA and ranger).  The second stage does more analysis (to see how
> many BBs we want to duplicate) but can't use ranger and thus misses
> information about static conditionals.
> 
> This patch pushes all analysis into the first stage. We record how many
> BBs to duplicate and the second stage just duplicats as it is told so.
> This makes it possible to also extend range query done also to basic
> blocks that are not headers.  This is easy to do, since we already do
> path specific query so we only need to extend the path by headers we
> decided to dulicate earlier.
> 
> This makes it possible to track situations where exit that is always
> false in the first iteration for tests not in the original loop header.
> Doing so lets us to update profile better and do better heuristics.  In
> particular I changed logic as follows
>   1) should_duplicate_loop_header_p counts size of duplicated region.  When we
>      know that a given conditional will be constant true or constant false either
>      in the duplicated region, by range query, or in the loop body after
>      duplication (since it is loop invariant), we do not account it to code size
>      costs
>   2) don't need account loop invariant compuations that will be duplicated
>      as they will become fully invariant
>      (maybe we want to have some cap for register pressure eventually?)
>   3) optimize_size logic is now different.  Originally we started duplicating
>      iff the first conditional was known to be true by ranger query, but then
>      we used same limits as for -O2.
> 
>      I now simply lower limits to 0. This means that every conditional
>      in duplicated sequence must be either loop invariant or constant when
>      duplicated and we only duplicate statements computing loop invariants
>      and those we account to 0 size anyway,
> 
> This makes code IMO more streamlined (and hopefully will let us to merge
> ibts with loop peeling logic), but makes little difference in practice.
> The problem is that in loop:
> 
> void test2();
> void test(int n)
> {
>   for (int i = 0; n && i < 10; i++)
> 	  test2();
> }
> 
> We produce:
>   <bb 4> [local count: 1073741824 freq: 9.090909]:
>   # i_4 = PHI <0(2), i_9(3)>
>   _1 = n_7(D) != 0;
>   _2 = i_4 <= 9;
>   _3 = _1 & _2;
>   if (_3 != 0)
>     goto <bb 3>; [89.00%]
>   else
>     goto <bb 5>; [11.00%]
> 
> and do not understand that the final conditional is a combination of a conditional
> that is always true in first iteration and a conditional that is loop invariant.
> 
> This is also the case of
> void test2();
> void test(int n)
> {
>   for (int i = 0; n; i++)
>     {
>       if (i > 10)
>         break;
>       test2();
>     }
> }
> Which we turn to the earlier case in ifcombine.
> 
> With disabled ifcombine things however works as exepcted.  This is something
> I plan to handle incrementally.  However extending loop-ch and peeling passes
> to understand such combined conditionals is still not good enough: at the time ifcombine
> merged the two conditionals we lost profile information on how often n is 0,
> so we can't recover correct profile or know what is expected number of iterations
> after the transofrm.
> 
> Bootstrapped/regtested x86_64-linux, OK?

OK.

Thanks,
Richard.

> Honza
> 
> 
> gcc/ChangeLog:
> 
> 	* tree-ssa-loop-ch.cc (edge_range_query): Take loop argument; be ready
> 	for queries not in headers.
> 	(static_loop_exit): Add basic blck parameter; update use of
> 	edge_range_query
> 	(should_duplicate_loop_header_p): Add ranger and static_exits
> 	parameter.  Do not account statements that will be optimized
> 	out after duplicaiton in overall size. Add ranger query to
> 	find static exits.
> 	(update_profile_after_ch):  Take static_exits has set instead of
> 	single eliminated_edge.
> 	(ch_base::copy_headers): Do all analysis in the first pass;
> 	remember invariant_exits and static_exits.
> 
> diff --git a/gcc/tree-ssa-loop-ch.cc b/gcc/tree-ssa-loop-ch.cc
> index 24e7fbc805a..e0139cb432c 100644
> --- a/gcc/tree-ssa-loop-ch.cc
> +++ b/gcc/tree-ssa-loop-ch.cc
> @@ -49,11 +49,13 @@ along with GCC; see the file COPYING3.  If not see
>     the range of the solved conditional in R.  */
>  
>  static void
> -edge_range_query (irange &r, edge e, gcond *cond, gimple_ranger &ranger)
> +edge_range_query (irange &r, class loop *loop, gcond *cond, gimple_ranger &ranger)
>  {
> -  auto_vec<basic_block> path (2);
> -  path.safe_push (e->dest);
> -  path.safe_push (e->src);
> +  auto_vec<basic_block, 8> path;
> +  for (basic_block bb = gimple_bb (cond); bb != loop->header; bb = single_pred_edge (bb)->src)
> +    path.safe_push (bb);
> +  path.safe_push (loop->header);
> +  path.safe_push (loop_preheader_edge (loop)->src);
>    path_range_query query (ranger, path);
>    if (!query.range_of_stmt (r, cond))
>      r.set_varying (boolean_type_node);
> @@ -63,17 +65,16 @@ edge_range_query (irange &r, edge e, gcond *cond, gimple_ranger &ranger)
>     and NULL otherwise.  */
>  
>  static edge
> -static_loop_exit (class loop *l, gimple_ranger *ranger)
> +static_loop_exit (class loop *l, basic_block bb, gimple_ranger *ranger)
>  {
> -  edge e = loop_preheader_edge (l);
> -  gcond *last = safe_dyn_cast <gcond *> (*gsi_last_bb (e->dest));
> +  gcond *last = safe_dyn_cast <gcond *> (*gsi_last_bb (bb));
>    edge ret_e;
>  
>    if (!last)
>      return NULL;
>  
>    edge true_e, false_e;
> -  extract_true_false_edges_from_block (e->dest, &true_e, &false_e);
> +  extract_true_false_edges_from_block (bb, &true_e, &false_e);
>  
>    /* If neither edge is the exit edge, this is not a case we'd like to
>       special-case.  */
> @@ -93,7 +94,7 @@ static_loop_exit (class loop *l, gimple_ranger *ranger)
>    }
>  
>    int_range<2> r;
> -  edge_range_query (r, e, last, *ranger);
> +  edge_range_query (r, l, last, *ranger);
>    return r == desired_static_range ? ret_e : NULL;
>  }
>  
> @@ -131,8 +132,10 @@ loop_iv_derived_p (class loop *loop,
>  
>  static bool
>  should_duplicate_loop_header_p (basic_block header, class loop *loop,
> +				gimple_ranger *ranger,
>  				int *limit,
> -				hash_set <edge> *invariant_exits)
> +				hash_set <edge> *invariant_exits,
> +				hash_set <edge> *static_exits)
>  {
>    gimple_stmt_iterator bsi;
>  
> @@ -209,6 +212,9 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
>        if (is_gimple_debug (last))
>  	continue;
>  
> +      if (gimple_code (last) == GIMPLE_COND)
> +	break;
> +
>        if (gimple_code (last) == GIMPLE_CALL
>  	  && (!gimple_inexpensive_call_p (as_a <gcall *> (last))
>  	      /* IFN_LOOP_DIST_ALIAS means that inner loop is distributed
> @@ -222,16 +228,6 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
>  	  return false;
>  	}
>  
> -      *limit -= estimate_num_insns (last, &eni_size_weights);
> -      if (*limit < 0)
> -	{
> -	  if (dump_file && (dump_flags & TDF_DETAILS))
> -	    fprintf (dump_file,
> -		     "  Not duplicating bb %i contains too many insns.\n",
> -		     header->index);
> -	  return false;
> -	}
> -
>        /* Classify the stmt based on whether its computation is based
>           on a IV or whether it is invariant in the loop.  */
>        gimple_set_uid (last, 0);
> @@ -257,9 +253,36 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
>  		  inv = false;
>  	      }
>  	  gimple_set_uid (last, (iv ? 1 : 0) | (inv ? 2 : 0));
> +	  /* Loop invariants will be optimized out in loop body after
> +	     duplication; do not account invariant computation in code
> +	     size costs.  */
> +	  if (inv)
> +	    continue;
> +	}
> +
> +      *limit -= estimate_num_insns (last, &eni_size_weights);
> +      if (*limit < 0)
> +	{
> +	  if (dump_file && (dump_flags & TDF_DETAILS))
> +	    fprintf (dump_file,
> +		     "  Not duplicating bb %i contains too many insns.\n",
> +		     header->index);
> +	  return false;
>  	}
>      }
>  
> +  edge static_exit = static_loop_exit (loop, header, ranger);
> +
> +  if (static_exit && static_exits)
> +    {
> +      static_exits->add (static_exit);
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +	fprintf (dump_file,
> +		 "    Will eliminate peeled conditional in bb %d.\n",
> +		 static_exit->src->index);
> +      /* Still look for invariant exits; exit may be both.  */
> +    }
> +
>    /* If the condition tests a non-IV loop variant we do not want to rotate
>       the loop further.  Unless this is the original loop header.  */
>    tree lhs = gimple_cond_lhs (last);
> @@ -284,12 +307,28 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
>  	}
>        return true;
>      }
> +
> +  if (static_exit)
> +    return true;
> +
> +  /* We was not able to prove that conditional will be eliminated.  */
> +  *limit -= estimate_num_insns (last, &eni_size_weights);
> +  if (*limit < 0)
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +	fprintf (dump_file,
> +		 "  Not duplicating bb %i contains too many insns.\n",
> +		 header->index);
> +      return false;
> +    }
> +
>    /* TODO: This is heuristics that claims that IV based ocnditionals will
>       likely be optimized out in duplicated header.  We could use ranger
>       query instead to tell this more precisely.  */
> -  if (header != loop->header
> -      && ((!lhs_invariant && !loop_iv_derived_p (loop, lhs))
> -	  || (!rhs_invariant && !loop_iv_derived_p (loop, rhs))))
> +  if ((lhs_invariant || loop_iv_derived_p (loop, lhs))
> +      && (rhs_invariant || loop_iv_derived_p (loop, rhs)))
> +    return true;
> +  if (header != loop->header)
>      {
>        if (dump_file && (dump_flags & TDF_DETAILS))
>  	fprintf (dump_file,
> @@ -386,8 +425,9 @@ do_while_loop_p (class loop *loop)
>  static void
>  update_profile_after_ch (class loop *loop,
>  			 basic_block *region, basic_block *region_copy,
> -			 unsigned n_region, edge eliminated_edge,
> +			 unsigned n_region,
>  			 hash_set <edge> *invariant_exits,
> +			 hash_set <edge> *static_exits,
>  			 profile_count entry_count)
>  {
>    for (unsigned int i = 0; i < n_region; i++)
> @@ -421,10 +461,13 @@ update_profile_after_ch (class loop *loop,
>  		      && e_copy->dest == region_copy[i + 1]));
>        region_copy[i]->count = entry_count;
>        profile_count exit_e_count = exit_e->count ();
> -      if (eliminated_edge == exit_e)
> +      bool was_static = false;
> +      if (static_exits->contains (exit_e))
>  	{
>  	  /* Update profile and the conditional.
>  	     CFG update is done by caller.  */
> +	  static_exits->remove (exit_e);
> +	  was_static = true;
>  	  e_copy->probability = profile_probability::always ();
>  	  exit_e_copy->probability = profile_probability::never ();
>  	  gcond *cond_stmt
> @@ -437,7 +480,6 @@ update_profile_after_ch (class loop *loop,
>  	  /* Header copying is a special case of jump threading, so use
>  	     common code to update loop body exit condition.  */
>  	  update_bb_profile_for_threading (region[i], entry_count, e);
> -	  eliminated_edge = NULL;
>  	}
>        else
>  	region[i]->count -= region_copy[i]->count;
> @@ -448,7 +490,7 @@ update_profile_after_ch (class loop *loop,
>  	     loop, so increase probability accordingly.
>  	     If the edge is eliminated_edge we already corrected
>  	     profile above.  */
> -	  if (entry_count.nonzero_p () && eliminated_edge != exit_e)
> +	  if (entry_count.nonzero_p () && !was_static)
>  	    set_edge_probability_and_rescale_others
>  		    (exit_e_copy, exit_e_count.probability_in
>  							(entry_count));
> @@ -465,7 +507,7 @@ update_profile_after_ch (class loop *loop,
>        entry_count = e_copy->count ();
>      }
>    /* Be sure that we seen all edges we are supposed to update.  */
> -  gcc_checking_assert (!eliminated_edge
> +  gcc_checking_assert (static_exits->is_empty ()
>  		       && invariant_exits->is_empty ());
>  }
>  
> @@ -564,10 +606,7 @@ protected:
>  unsigned int
>  ch_base::copy_headers (function *fun)
>  {
> -  basic_block header;
> -  edge exit, nonexit, entry;
>    basic_block *bbs, *copied_bbs;
> -  unsigned n_bbs;
>    unsigned bbs_size;
>    bool changed = false;
>  
> @@ -578,7 +617,12 @@ ch_base::copy_headers (function *fun)
>    copied_bbs = XNEWVEC (basic_block, n_basic_blocks_for_fn (fun));
>    bbs_size = n_basic_blocks_for_fn (fun);
>  
> -  struct candidate {class loop *loop; edge static_exit;};
> +  struct candidate
> +    {
> +      class loop *loop;
> +      unsigned int nheaders;
> +      hash_set <edge> *invariant_exits, *static_exits;
> +    };
>    auto_vec<struct candidate> candidates;
>    auto_vec<std::pair<edge, loop_p> > copied;
>    auto_vec<class loop *> loops_to_unloop;
> @@ -588,13 +632,14 @@ ch_base::copy_headers (function *fun)
>    gimple_ranger *ranger = new gimple_ranger;
>    for (auto loop : loops_list (cfun, 0))
>      {
> -      int initial_limit = param_max_loop_header_insns;
> +      int initial_limit = optimize_loop_for_speed_p (loop)
> +			  ? param_max_loop_header_insns : 0;
>        int remaining_limit = initial_limit;
>        if (dump_file && (dump_flags & TDF_DETAILS))
>  	fprintf (dump_file,
>  		 "Analyzing loop %i\n", loop->num);
>  
> -      header = loop->header;
> +      basic_block header = loop->header;
>        if (!get_max_loop_iterations_int (loop))
>  	{
>  	  if (dump_file && (dump_flags & TDF_DETAILS))
> @@ -613,24 +658,38 @@ ch_base::copy_headers (function *fun)
>  	  || !process_loop_p (loop))
>  	continue;
>  
> -      edge static_exit = static_loop_exit (loop, ranger);
> -
> -      /* Avoid loop header copying when optimizing for size unless we can
> -	 determine that the loop condition is static in the first
> -	 iteration.  */
> -      if (optimize_loop_for_size_p (loop)
> -	  && !loop->force_vectorize
> -	  && !static_exit)
> +      /* Iterate the header copying up to limit; this takes care of the cases
> +	 like while (a && b) {...}, where we want to have both of the conditions
> +	 copied.  TODO -- handle while (a || b) - like cases, by not requiring
> +	 the header to have just a single successor and copying up to
> +	 postdominator.  */
> +      int nheaders = 0;
> +      hash_set <edge> *invariant_exits = new hash_set <edge>;
> +      hash_set <edge> *static_exits = new hash_set <edge>;
> +      while (should_duplicate_loop_header_p (header, loop, ranger,
> +					     &remaining_limit,
> +					     invariant_exits,
> +					     static_exits))
>  	{
> +	  nheaders++;
>  	  if (dump_file && (dump_flags & TDF_DETAILS))
> -	    fprintf (dump_file,
> -		     "  Not duplicating bb %i: optimizing for size.\n",
> -		     header->index);
> -	  continue;
> +	    fprintf (dump_file, "    Will duplicate bb %i\n", header->index);
> +
> +	  /* Find a successor of header that is inside a loop; i.e. the new
> +	     header after the condition is copied.  */
> +	  if (flow_bb_inside_loop_p (loop, EDGE_SUCC (header, 0)->dest))
> +	    header = EDGE_SUCC (header, 0)->dest;
> +	  else
> +	    header = EDGE_SUCC (header, 1)->dest;
>  	}
>  
> -      if (should_duplicate_loop_header_p (header, loop, &remaining_limit, NULL))
> -	candidates.safe_push ({loop, static_exit});
> +      if (nheaders)
> +	candidates.safe_push ({loop, nheaders, invariant_exits, static_exits});
> +      else
> +	{
> +	  delete invariant_exits;
> +	  delete static_exits;
> +	}
>      }
>    /* Do not use ranger after we change the IL and not have updated SSA.  */
>    delete ranger;
> @@ -638,37 +697,25 @@ ch_base::copy_headers (function *fun)
>    for (auto candidate : candidates)
>      {
>        class loop *loop = candidate.loop;
> -      int initial_limit = param_max_loop_header_insns;
> -      int remaining_limit = initial_limit;
>        if (dump_file && (dump_flags & TDF_DETAILS))
>  	fprintf (dump_file,
>  		 "Copying headers of loop %i\n", loop->num);
>  
> -      header = loop->header;
> -
> -      /* Iterate the header copying up to limit; this takes care of the cases
> -	 like while (a && b) {...}, where we want to have both of the conditions
> -	 copied.  TODO -- handle while (a || b) - like cases, by not requiring
> -	 the header to have just a single successor and copying up to
> -	 postdominator.  */
> -
> -      nonexit = NULL;
> -      n_bbs = 0;
> +      basic_block header = loop->header;
> +      edge nonexit = NULL;
> +      edge exit = NULL;
> +      unsigned int n_bbs = 0;
>        int nexits = 0;
>        profile_count exit_count = profile_count::zero ();
>        profile_count entry_count = profile_count::zero ();
>        edge e;
>        edge_iterator ei;
> -      hash_set <edge> invariant_exits;
> +
>        FOR_EACH_EDGE (e, ei, loop->header->preds)
>  	if (e->src != loop->latch)
>  	  entry_count += e->count ();
> -      while (should_duplicate_loop_header_p (header, loop, &remaining_limit,
> -					     &invariant_exits))
> +      while (n_bbs < candidate.nheaders)
>  	{
> -	  if (dump_file && (dump_flags & TDF_DETAILS))
> -	    fprintf (dump_file, "    Will duplicate bb %i\n", header->index);
> -
>  	  /* Find a successor of header that is inside a loop; i.e. the new
>  	     header after the condition is copied.  */
>  	  if (flow_bb_inside_loop_p (loop, EDGE_SUCC (header, 0)->dest))
> @@ -688,21 +735,10 @@ ch_base::copy_headers (function *fun)
>  	  header = nonexit->dest;
>  	}
>  
> -      if (!nonexit)
> -	continue;
> -
>        if (dump_file && (dump_flags & TDF_DETAILS))
> -	{
> -	  fprintf (dump_file,
> -		   "Duplicating header of the loop %d up to edge %d->%d,"
> -		   " %i insns.\n",
> -		   loop->num, exit->src->index, exit->dest->index,
> -		   initial_limit - remaining_limit);
> -	  if (candidate.static_exit)
> -	    fprintf (dump_file,
> -		     "  Will eliminate peeled conditional in bb %d.\n",
> -		     candidate.static_exit->src->index);
> -	}
> +	fprintf (dump_file,
> +		 "Duplicating header of the loop %d up to edge %d->%d\n",
> +		 loop->num, exit->src->index, exit->dest->index);
>  
>        /* Ensure that the header will have just the latch as a predecessor
>  	 inside the loop.  */
> @@ -712,20 +748,25 @@ ch_base::copy_headers (function *fun)
>  	  exit = single_pred_edge (header);
>  	}
>  
> -      entry = loop_preheader_edge (loop);
> +      edge entry = loop_preheader_edge (loop);
>  
>        propagate_threaded_block_debug_into (exit->dest, entry->dest);
>        if (!gimple_duplicate_seme_region (entry, exit, bbs, n_bbs, copied_bbs,
>  					 true))
>  	{
> +	  delete candidate.static_exits;
> +	  delete candidate.invariant_exits;
>  	  if (dump_file && (dump_flags & TDF_DETAILS))
>  	    fprintf (dump_file, "Duplication failed.\n");
>  	  continue;
>  	}
>        if (loop->header->count.initialized_p ())
>  	update_profile_after_ch (loop, bbs, copied_bbs, n_bbs,
> -				 candidate.static_exit, &invariant_exits,
> +				 candidate.invariant_exits,
> +				 candidate.static_exits,
>  				 entry_count);
> +      delete candidate.static_exits;
> +      delete candidate.invariant_exits;
>        copied.safe_push (std::make_pair (entry, loop));
>  
>        /* If the loop has the form "for (i = j; i < j + 10; i++)" then
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Loop-ch improvements, part 3
  2023-07-17  9:17 ` Richard Biener
@ 2023-08-22  5:24   ` Hongtao Liu
  2023-08-22  7:53     ` Richard Biener
  0 siblings, 1 reply; 9+ messages in thread
From: Hongtao Liu @ 2023-08-22  5:24 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jan Hubicka, gcc-patches

On Mon, Jul 17, 2023 at 5:18 PM Richard Biener via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Fri, 14 Jul 2023, Jan Hubicka wrote:
>
> > Hi,
> > loop-ch currently does analysis using ranger for all loops to identify
> > candidates and then follows by phase where headers are duplicated (which
> > breaks SSA and ranger).  The second stage does more analysis (to see how
> > many BBs we want to duplicate) but can't use ranger and thus misses
> > information about static conditionals.
> >
> > This patch pushes all analysis into the first stage. We record how many
> > BBs to duplicate and the second stage just duplicats as it is told so.
> > This makes it possible to also extend range query done also to basic
> > blocks that are not headers.  This is easy to do, since we already do
> > path specific query so we only need to extend the path by headers we
> > decided to dulicate earlier.
> >
> > This makes it possible to track situations where exit that is always
> > false in the first iteration for tests not in the original loop header.
> > Doing so lets us to update profile better and do better heuristics.  In
> > particular I changed logic as follows
> >   1) should_duplicate_loop_header_p counts size of duplicated region.  When we
> >      know that a given conditional will be constant true or constant false either
> >      in the duplicated region, by range query, or in the loop body after
> >      duplication (since it is loop invariant), we do not account it to code size
> >      costs
> >   2) don't need account loop invariant compuations that will be duplicated
> >      as they will become fully invariant
> >      (maybe we want to have some cap for register pressure eventually?)
> >   3) optimize_size logic is now different.  Originally we started duplicating
> >      iff the first conditional was known to be true by ranger query, but then
> >      we used same limits as for -O2.
> >
> >      I now simply lower limits to 0. This means that every conditional
> >      in duplicated sequence must be either loop invariant or constant when
> >      duplicated and we only duplicate statements computing loop invariants
> >      and those we account to 0 size anyway,
> >
> > This makes code IMO more streamlined (and hopefully will let us to merge
> > ibts with loop peeling logic), but makes little difference in practice.
> > The problem is that in loop:
> >
> > void test2();
> > void test(int n)
> > {
> >   for (int i = 0; n && i < 10; i++)
> >         test2();
> > }
> >
> > We produce:
> >   <bb 4> [local count: 1073741824 freq: 9.090909]:
> >   # i_4 = PHI <0(2), i_9(3)>
> >   _1 = n_7(D) != 0;
> >   _2 = i_4 <= 9;
> >   _3 = _1 & _2;
> >   if (_3 != 0)
> >     goto <bb 3>; [89.00%]
> >   else
> >     goto <bb 5>; [11.00%]
> >
> > and do not understand that the final conditional is a combination of a conditional
> > that is always true in first iteration and a conditional that is loop invariant.
> >
> > This is also the case of
> > void test2();
> > void test(int n)
> > {
> >   for (int i = 0; n; i++)
> >     {
> >       if (i > 10)
> >         break;
> >       test2();
> >     }
> > }
> > Which we turn to the earlier case in ifcombine.
> >
> > With disabled ifcombine things however works as exepcted.  This is something
> > I plan to handle incrementally.  However extending loop-ch and peeling passes
> > to understand such combined conditionals is still not good enough: at the time ifcombine
> > merged the two conditionals we lost profile information on how often n is 0,
> > so we can't recover correct profile or know what is expected number of iterations
> > after the transofrm.
This regressed
FAIL: gcc.target/i386/pr93089-2.c scan-assembler vmulps[^\n\r]*zmm
FAIL: gcc.target/i386/pr93089-3.c scan-assembler vmulps[^\n\r]*zmm
The testcase is quite simple, not sure why it's regressed.

 1/* PR target/93089 */
 2/* { dg-do compile } */
 3/* { dg-options "-O2 -fopenmp-simd -mtune=znver1" } */
 4/* { dg-final { scan-assembler "vmulps\[^\n\r]*zmm" } } */
 5/* { dg-final { scan-assembler "vmulps\[^\n\r]*ymm" } } */
 6
 7#pragma omp declare simd notinbranch
 8float
 9foo (float x, float y)
10{
11  return x * y;
12}

> >
> > Bootstrapped/regtested x86_64-linux, OK?
>
> OK.
>
> Thanks,
> Richard.
>
> > Honza
> >
> >
> > gcc/ChangeLog:
> >
> >       * tree-ssa-loop-ch.cc (edge_range_query): Take loop argument; be ready
> >       for queries not in headers.
> >       (static_loop_exit): Add basic blck parameter; update use of
> >       edge_range_query
> >       (should_duplicate_loop_header_p): Add ranger and static_exits
> >       parameter.  Do not account statements that will be optimized
> >       out after duplicaiton in overall size. Add ranger query to
> >       find static exits.
> >       (update_profile_after_ch):  Take static_exits has set instead of
> >       single eliminated_edge.
> >       (ch_base::copy_headers): Do all analysis in the first pass;
> >       remember invariant_exits and static_exits.
> >
> > diff --git a/gcc/tree-ssa-loop-ch.cc b/gcc/tree-ssa-loop-ch.cc
> > index 24e7fbc805a..e0139cb432c 100644
> > --- a/gcc/tree-ssa-loop-ch.cc
> > +++ b/gcc/tree-ssa-loop-ch.cc
> > @@ -49,11 +49,13 @@ along with GCC; see the file COPYING3.  If not see
> >     the range of the solved conditional in R.  */
> >
> >  static void
> > -edge_range_query (irange &r, edge e, gcond *cond, gimple_ranger &ranger)
> > +edge_range_query (irange &r, class loop *loop, gcond *cond, gimple_ranger &ranger)
> >  {
> > -  auto_vec<basic_block> path (2);
> > -  path.safe_push (e->dest);
> > -  path.safe_push (e->src);
> > +  auto_vec<basic_block, 8> path;
> > +  for (basic_block bb = gimple_bb (cond); bb != loop->header; bb = single_pred_edge (bb)->src)
> > +    path.safe_push (bb);
> > +  path.safe_push (loop->header);
> > +  path.safe_push (loop_preheader_edge (loop)->src);
> >    path_range_query query (ranger, path);
> >    if (!query.range_of_stmt (r, cond))
> >      r.set_varying (boolean_type_node);
> > @@ -63,17 +65,16 @@ edge_range_query (irange &r, edge e, gcond *cond, gimple_ranger &ranger)
> >     and NULL otherwise.  */
> >
> >  static edge
> > -static_loop_exit (class loop *l, gimple_ranger *ranger)
> > +static_loop_exit (class loop *l, basic_block bb, gimple_ranger *ranger)
> >  {
> > -  edge e = loop_preheader_edge (l);
> > -  gcond *last = safe_dyn_cast <gcond *> (*gsi_last_bb (e->dest));
> > +  gcond *last = safe_dyn_cast <gcond *> (*gsi_last_bb (bb));
> >    edge ret_e;
> >
> >    if (!last)
> >      return NULL;
> >
> >    edge true_e, false_e;
> > -  extract_true_false_edges_from_block (e->dest, &true_e, &false_e);
> > +  extract_true_false_edges_from_block (bb, &true_e, &false_e);
> >
> >    /* If neither edge is the exit edge, this is not a case we'd like to
> >       special-case.  */
> > @@ -93,7 +94,7 @@ static_loop_exit (class loop *l, gimple_ranger *ranger)
> >    }
> >
> >    int_range<2> r;
> > -  edge_range_query (r, e, last, *ranger);
> > +  edge_range_query (r, l, last, *ranger);
> >    return r == desired_static_range ? ret_e : NULL;
> >  }
> >
> > @@ -131,8 +132,10 @@ loop_iv_derived_p (class loop *loop,
> >
> >  static bool
> >  should_duplicate_loop_header_p (basic_block header, class loop *loop,
> > +                             gimple_ranger *ranger,
> >                               int *limit,
> > -                             hash_set <edge> *invariant_exits)
> > +                             hash_set <edge> *invariant_exits,
> > +                             hash_set <edge> *static_exits)
> >  {
> >    gimple_stmt_iterator bsi;
> >
> > @@ -209,6 +212,9 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
> >        if (is_gimple_debug (last))
> >       continue;
> >
> > +      if (gimple_code (last) == GIMPLE_COND)
> > +     break;
> > +
> >        if (gimple_code (last) == GIMPLE_CALL
> >         && (!gimple_inexpensive_call_p (as_a <gcall *> (last))
> >             /* IFN_LOOP_DIST_ALIAS means that inner loop is distributed
> > @@ -222,16 +228,6 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
> >         return false;
> >       }
> >
> > -      *limit -= estimate_num_insns (last, &eni_size_weights);
> > -      if (*limit < 0)
> > -     {
> > -       if (dump_file && (dump_flags & TDF_DETAILS))
> > -         fprintf (dump_file,
> > -                  "  Not duplicating bb %i contains too many insns.\n",
> > -                  header->index);
> > -       return false;
> > -     }
> > -
> >        /* Classify the stmt based on whether its computation is based
> >           on a IV or whether it is invariant in the loop.  */
> >        gimple_set_uid (last, 0);
> > @@ -257,9 +253,36 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
> >                 inv = false;
> >             }
> >         gimple_set_uid (last, (iv ? 1 : 0) | (inv ? 2 : 0));
> > +       /* Loop invariants will be optimized out in loop body after
> > +          duplication; do not account invariant computation in code
> > +          size costs.  */
> > +       if (inv)
> > +         continue;
> > +     }
> > +
> > +      *limit -= estimate_num_insns (last, &eni_size_weights);
> > +      if (*limit < 0)
> > +     {
> > +       if (dump_file && (dump_flags & TDF_DETAILS))
> > +         fprintf (dump_file,
> > +                  "  Not duplicating bb %i contains too many insns.\n",
> > +                  header->index);
> > +       return false;
> >       }
> >      }
> >
> > +  edge static_exit = static_loop_exit (loop, header, ranger);
> > +
> > +  if (static_exit && static_exits)
> > +    {
> > +      static_exits->add (static_exit);
> > +      if (dump_file && (dump_flags & TDF_DETAILS))
> > +     fprintf (dump_file,
> > +              "    Will eliminate peeled conditional in bb %d.\n",
> > +              static_exit->src->index);
> > +      /* Still look for invariant exits; exit may be both.  */
> > +    }
> > +
> >    /* If the condition tests a non-IV loop variant we do not want to rotate
> >       the loop further.  Unless this is the original loop header.  */
> >    tree lhs = gimple_cond_lhs (last);
> > @@ -284,12 +307,28 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
> >       }
> >        return true;
> >      }
> > +
> > +  if (static_exit)
> > +    return true;
> > +
> > +  /* We was not able to prove that conditional will be eliminated.  */
> > +  *limit -= estimate_num_insns (last, &eni_size_weights);
> > +  if (*limit < 0)
> > +    {
> > +      if (dump_file && (dump_flags & TDF_DETAILS))
> > +     fprintf (dump_file,
> > +              "  Not duplicating bb %i contains too many insns.\n",
> > +              header->index);
> > +      return false;
> > +    }
> > +
> >    /* TODO: This is heuristics that claims that IV based ocnditionals will
> >       likely be optimized out in duplicated header.  We could use ranger
> >       query instead to tell this more precisely.  */
> > -  if (header != loop->header
> > -      && ((!lhs_invariant && !loop_iv_derived_p (loop, lhs))
> > -       || (!rhs_invariant && !loop_iv_derived_p (loop, rhs))))
> > +  if ((lhs_invariant || loop_iv_derived_p (loop, lhs))
> > +      && (rhs_invariant || loop_iv_derived_p (loop, rhs)))
> > +    return true;
> > +  if (header != loop->header)
> >      {
> >        if (dump_file && (dump_flags & TDF_DETAILS))
> >       fprintf (dump_file,
> > @@ -386,8 +425,9 @@ do_while_loop_p (class loop *loop)
> >  static void
> >  update_profile_after_ch (class loop *loop,
> >                        basic_block *region, basic_block *region_copy,
> > -                      unsigned n_region, edge eliminated_edge,
> > +                      unsigned n_region,
> >                        hash_set <edge> *invariant_exits,
> > +                      hash_set <edge> *static_exits,
> >                        profile_count entry_count)
> >  {
> >    for (unsigned int i = 0; i < n_region; i++)
> > @@ -421,10 +461,13 @@ update_profile_after_ch (class loop *loop,
> >                     && e_copy->dest == region_copy[i + 1]));
> >        region_copy[i]->count = entry_count;
> >        profile_count exit_e_count = exit_e->count ();
> > -      if (eliminated_edge == exit_e)
> > +      bool was_static = false;
> > +      if (static_exits->contains (exit_e))
> >       {
> >         /* Update profile and the conditional.
> >            CFG update is done by caller.  */
> > +       static_exits->remove (exit_e);
> > +       was_static = true;
> >         e_copy->probability = profile_probability::always ();
> >         exit_e_copy->probability = profile_probability::never ();
> >         gcond *cond_stmt
> > @@ -437,7 +480,6 @@ update_profile_after_ch (class loop *loop,
> >         /* Header copying is a special case of jump threading, so use
> >            common code to update loop body exit condition.  */
> >         update_bb_profile_for_threading (region[i], entry_count, e);
> > -       eliminated_edge = NULL;
> >       }
> >        else
> >       region[i]->count -= region_copy[i]->count;
> > @@ -448,7 +490,7 @@ update_profile_after_ch (class loop *loop,
> >            loop, so increase probability accordingly.
> >            If the edge is eliminated_edge we already corrected
> >            profile above.  */
> > -       if (entry_count.nonzero_p () && eliminated_edge != exit_e)
> > +       if (entry_count.nonzero_p () && !was_static)
> >           set_edge_probability_and_rescale_others
> >                   (exit_e_copy, exit_e_count.probability_in
> >                                                       (entry_count));
> > @@ -465,7 +507,7 @@ update_profile_after_ch (class loop *loop,
> >        entry_count = e_copy->count ();
> >      }
> >    /* Be sure that we seen all edges we are supposed to update.  */
> > -  gcc_checking_assert (!eliminated_edge
> > +  gcc_checking_assert (static_exits->is_empty ()
> >                      && invariant_exits->is_empty ());
> >  }
> >
> > @@ -564,10 +606,7 @@ protected:
> >  unsigned int
> >  ch_base::copy_headers (function *fun)
> >  {
> > -  basic_block header;
> > -  edge exit, nonexit, entry;
> >    basic_block *bbs, *copied_bbs;
> > -  unsigned n_bbs;
> >    unsigned bbs_size;
> >    bool changed = false;
> >
> > @@ -578,7 +617,12 @@ ch_base::copy_headers (function *fun)
> >    copied_bbs = XNEWVEC (basic_block, n_basic_blocks_for_fn (fun));
> >    bbs_size = n_basic_blocks_for_fn (fun);
> >
> > -  struct candidate {class loop *loop; edge static_exit;};
> > +  struct candidate
> > +    {
> > +      class loop *loop;
> > +      unsigned int nheaders;
> > +      hash_set <edge> *invariant_exits, *static_exits;
> > +    };
> >    auto_vec<struct candidate> candidates;
> >    auto_vec<std::pair<edge, loop_p> > copied;
> >    auto_vec<class loop *> loops_to_unloop;
> > @@ -588,13 +632,14 @@ ch_base::copy_headers (function *fun)
> >    gimple_ranger *ranger = new gimple_ranger;
> >    for (auto loop : loops_list (cfun, 0))
> >      {
> > -      int initial_limit = param_max_loop_header_insns;
> > +      int initial_limit = optimize_loop_for_speed_p (loop)
> > +                       ? param_max_loop_header_insns : 0;
> >        int remaining_limit = initial_limit;
> >        if (dump_file && (dump_flags & TDF_DETAILS))
> >       fprintf (dump_file,
> >                "Analyzing loop %i\n", loop->num);
> >
> > -      header = loop->header;
> > +      basic_block header = loop->header;
> >        if (!get_max_loop_iterations_int (loop))
> >       {
> >         if (dump_file && (dump_flags & TDF_DETAILS))
> > @@ -613,24 +658,38 @@ ch_base::copy_headers (function *fun)
> >         || !process_loop_p (loop))
> >       continue;
> >
> > -      edge static_exit = static_loop_exit (loop, ranger);
> > -
> > -      /* Avoid loop header copying when optimizing for size unless we can
> > -      determine that the loop condition is static in the first
> > -      iteration.  */
> > -      if (optimize_loop_for_size_p (loop)
> > -       && !loop->force_vectorize
> > -       && !static_exit)
> > +      /* Iterate the header copying up to limit; this takes care of the cases
> > +      like while (a && b) {...}, where we want to have both of the conditions
> > +      copied.  TODO -- handle while (a || b) - like cases, by not requiring
> > +      the header to have just a single successor and copying up to
> > +      postdominator.  */
> > +      int nheaders = 0;
> > +      hash_set <edge> *invariant_exits = new hash_set <edge>;
> > +      hash_set <edge> *static_exits = new hash_set <edge>;
> > +      while (should_duplicate_loop_header_p (header, loop, ranger,
> > +                                          &remaining_limit,
> > +                                          invariant_exits,
> > +                                          static_exits))
> >       {
> > +       nheaders++;
> >         if (dump_file && (dump_flags & TDF_DETAILS))
> > -         fprintf (dump_file,
> > -                  "  Not duplicating bb %i: optimizing for size.\n",
> > -                  header->index);
> > -       continue;
> > +         fprintf (dump_file, "    Will duplicate bb %i\n", header->index);
> > +
> > +       /* Find a successor of header that is inside a loop; i.e. the new
> > +          header after the condition is copied.  */
> > +       if (flow_bb_inside_loop_p (loop, EDGE_SUCC (header, 0)->dest))
> > +         header = EDGE_SUCC (header, 0)->dest;
> > +       else
> > +         header = EDGE_SUCC (header, 1)->dest;
> >       }
> >
> > -      if (should_duplicate_loop_header_p (header, loop, &remaining_limit, NULL))
> > -     candidates.safe_push ({loop, static_exit});
> > +      if (nheaders)
> > +     candidates.safe_push ({loop, nheaders, invariant_exits, static_exits});
> > +      else
> > +     {
> > +       delete invariant_exits;
> > +       delete static_exits;
> > +     }
> >      }
> >    /* Do not use ranger after we change the IL and not have updated SSA.  */
> >    delete ranger;
> > @@ -638,37 +697,25 @@ ch_base::copy_headers (function *fun)
> >    for (auto candidate : candidates)
> >      {
> >        class loop *loop = candidate.loop;
> > -      int initial_limit = param_max_loop_header_insns;
> > -      int remaining_limit = initial_limit;
> >        if (dump_file && (dump_flags & TDF_DETAILS))
> >       fprintf (dump_file,
> >                "Copying headers of loop %i\n", loop->num);
> >
> > -      header = loop->header;
> > -
> > -      /* Iterate the header copying up to limit; this takes care of the cases
> > -      like while (a && b) {...}, where we want to have both of the conditions
> > -      copied.  TODO -- handle while (a || b) - like cases, by not requiring
> > -      the header to have just a single successor and copying up to
> > -      postdominator.  */
> > -
> > -      nonexit = NULL;
> > -      n_bbs = 0;
> > +      basic_block header = loop->header;
> > +      edge nonexit = NULL;
> > +      edge exit = NULL;
> > +      unsigned int n_bbs = 0;
> >        int nexits = 0;
> >        profile_count exit_count = profile_count::zero ();
> >        profile_count entry_count = profile_count::zero ();
> >        edge e;
> >        edge_iterator ei;
> > -      hash_set <edge> invariant_exits;
> > +
> >        FOR_EACH_EDGE (e, ei, loop->header->preds)
> >       if (e->src != loop->latch)
> >         entry_count += e->count ();
> > -      while (should_duplicate_loop_header_p (header, loop, &remaining_limit,
> > -                                          &invariant_exits))
> > +      while (n_bbs < candidate.nheaders)
> >       {
> > -       if (dump_file && (dump_flags & TDF_DETAILS))
> > -         fprintf (dump_file, "    Will duplicate bb %i\n", header->index);
> > -
> >         /* Find a successor of header that is inside a loop; i.e. the new
> >            header after the condition is copied.  */
> >         if (flow_bb_inside_loop_p (loop, EDGE_SUCC (header, 0)->dest))
> > @@ -688,21 +735,10 @@ ch_base::copy_headers (function *fun)
> >         header = nonexit->dest;
> >       }
> >
> > -      if (!nonexit)
> > -     continue;
> > -
> >        if (dump_file && (dump_flags & TDF_DETAILS))
> > -     {
> > -       fprintf (dump_file,
> > -                "Duplicating header of the loop %d up to edge %d->%d,"
> > -                " %i insns.\n",
> > -                loop->num, exit->src->index, exit->dest->index,
> > -                initial_limit - remaining_limit);
> > -       if (candidate.static_exit)
> > -         fprintf (dump_file,
> > -                  "  Will eliminate peeled conditional in bb %d.\n",
> > -                  candidate.static_exit->src->index);
> > -     }
> > +     fprintf (dump_file,
> > +              "Duplicating header of the loop %d up to edge %d->%d\n",
> > +              loop->num, exit->src->index, exit->dest->index);
> >
> >        /* Ensure that the header will have just the latch as a predecessor
> >        inside the loop.  */
> > @@ -712,20 +748,25 @@ ch_base::copy_headers (function *fun)
> >         exit = single_pred_edge (header);
> >       }
> >
> > -      entry = loop_preheader_edge (loop);
> > +      edge entry = loop_preheader_edge (loop);
> >
> >        propagate_threaded_block_debug_into (exit->dest, entry->dest);
> >        if (!gimple_duplicate_seme_region (entry, exit, bbs, n_bbs, copied_bbs,
> >                                        true))
> >       {
> > +       delete candidate.static_exits;
> > +       delete candidate.invariant_exits;
> >         if (dump_file && (dump_flags & TDF_DETAILS))
> >           fprintf (dump_file, "Duplication failed.\n");
> >         continue;
> >       }
> >        if (loop->header->count.initialized_p ())
> >       update_profile_after_ch (loop, bbs, copied_bbs, n_bbs,
> > -                              candidate.static_exit, &invariant_exits,
> > +                              candidate.invariant_exits,
> > +                              candidate.static_exits,
> >                                entry_count);
> > +      delete candidate.static_exits;
> > +      delete candidate.invariant_exits;
> >        copied.safe_push (std::make_pair (entry, loop));
> >
> >        /* If the loop has the form "for (i = j; i < j + 10; i++)" then
> >
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
> Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
> HRB 36809 (AG Nuernberg)



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Loop-ch improvements, part 3
  2023-08-22  5:24   ` Hongtao Liu
@ 2023-08-22  7:53     ` Richard Biener
  2023-08-22 12:03       ` Jan Hubicka
  2023-08-23  8:34       ` Jan Hubicka
  0 siblings, 2 replies; 9+ messages in thread
From: Richard Biener @ 2023-08-22  7:53 UTC (permalink / raw)
  To: Hongtao Liu; +Cc: Jan Hubicka, gcc-patches

On Tue, 22 Aug 2023, Hongtao Liu wrote:

> On Mon, Jul 17, 2023 at 5:18?PM Richard Biener via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > On Fri, 14 Jul 2023, Jan Hubicka wrote:
> >
> > > Hi,
> > > loop-ch currently does analysis using ranger for all loops to identify
> > > candidates and then follows by phase where headers are duplicated (which
> > > breaks SSA and ranger).  The second stage does more analysis (to see how
> > > many BBs we want to duplicate) but can't use ranger and thus misses
> > > information about static conditionals.
> > >
> > > This patch pushes all analysis into the first stage. We record how many
> > > BBs to duplicate and the second stage just duplicats as it is told so.
> > > This makes it possible to also extend range query done also to basic
> > > blocks that are not headers.  This is easy to do, since we already do
> > > path specific query so we only need to extend the path by headers we
> > > decided to dulicate earlier.
> > >
> > > This makes it possible to track situations where exit that is always
> > > false in the first iteration for tests not in the original loop header.
> > > Doing so lets us to update profile better and do better heuristics.  In
> > > particular I changed logic as follows
> > >   1) should_duplicate_loop_header_p counts size of duplicated region.  When we
> > >      know that a given conditional will be constant true or constant false either
> > >      in the duplicated region, by range query, or in the loop body after
> > >      duplication (since it is loop invariant), we do not account it to code size
> > >      costs
> > >   2) don't need account loop invariant compuations that will be duplicated
> > >      as they will become fully invariant
> > >      (maybe we want to have some cap for register pressure eventually?)
> > >   3) optimize_size logic is now different.  Originally we started duplicating
> > >      iff the first conditional was known to be true by ranger query, but then
> > >      we used same limits as for -O2.
> > >
> > >      I now simply lower limits to 0. This means that every conditional
> > >      in duplicated sequence must be either loop invariant or constant when
> > >      duplicated and we only duplicate statements computing loop invariants
> > >      and those we account to 0 size anyway,
> > >
> > > This makes code IMO more streamlined (and hopefully will let us to merge
> > > ibts with loop peeling logic), but makes little difference in practice.
> > > The problem is that in loop:
> > >
> > > void test2();
> > > void test(int n)
> > > {
> > >   for (int i = 0; n && i < 10; i++)
> > >         test2();
> > > }
> > >
> > > We produce:
> > >   <bb 4> [local count: 1073741824 freq: 9.090909]:
> > >   # i_4 = PHI <0(2), i_9(3)>
> > >   _1 = n_7(D) != 0;
> > >   _2 = i_4 <= 9;
> > >   _3 = _1 & _2;
> > >   if (_3 != 0)
> > >     goto <bb 3>; [89.00%]
> > >   else
> > >     goto <bb 5>; [11.00%]
> > >
> > > and do not understand that the final conditional is a combination of a conditional
> > > that is always true in first iteration and a conditional that is loop invariant.
> > >
> > > This is also the case of
> > > void test2();
> > > void test(int n)
> > > {
> > >   for (int i = 0; n; i++)
> > >     {
> > >       if (i > 10)
> > >         break;
> > >       test2();
> > >     }
> > > }
> > > Which we turn to the earlier case in ifcombine.
> > >
> > > With disabled ifcombine things however works as exepcted.  This is something
> > > I plan to handle incrementally.  However extending loop-ch and peeling passes
> > > to understand such combined conditionals is still not good enough: at the time ifcombine
> > > merged the two conditionals we lost profile information on how often n is 0,
> > > so we can't recover correct profile or know what is expected number of iterations
> > > after the transofrm.
> This regressed
> FAIL: gcc.target/i386/pr93089-2.c scan-assembler vmulps[^\n\r]*zmm
> FAIL: gcc.target/i386/pr93089-3.c scan-assembler vmulps[^\n\r]*zmm
> The testcase is quite simple, not sure why it's regressed.
> 
>  1/* PR target/93089 */
>  2/* { dg-do compile } */
>  3/* { dg-options "-O2 -fopenmp-simd -mtune=znver1" } */
>  4/* { dg-final { scan-assembler "vmulps\[^\n\r]*zmm" } } */
>  5/* { dg-final { scan-assembler "vmulps\[^\n\r]*ymm" } } */
>  6
>  7#pragma omp declare simd notinbranch
>  8float
>  9foo (float x, float y)
> 10{
> 11  return x * y;
> 12}

We seem to peel one iteration for no good reason.  The loop is
a do-while loop already.  The key is we see the first iteration
exit condition is known not taken and then:

 Registering value_relation (path_oracle) (iter.24_6 > iter.24_5) (root: 
bb2)
    Stmt is static (constant in the first iteration)
  Analyzing: if (iter.24_6 != 16)
 Registering killing_def (path_oracle) iter.24_6
 Registering value_relation (path_oracle) (iter.24_6 > iter.24_5) (root: 
bb2)
    Will eliminate peeled conditional in bb 3.
    Duplicating bb 3 is a win; it has zero cost
  Not duplicating bb 5: it is single succ.
Copying headers of loop 1
    Will duplicate bb 3
Duplicating header of the loop 1 up to edge 3->4
Loop 1 is do-while loop
Loop 1 is now do-while loop.
Exit count: 0 (estimated locally)
Entry count: 10631108 (estimated locally)
Peeled all exits: decreased number of iterations of loop 1 by 1.

and that's because of

  /* If the static exit fully optimize out, it is win to "duplicate"
     it.

     TODO: Even if duplication costs some size we may opt to do so in case
     exit probability is significant enough (do partial peeling).  */
  if (static_exit)
    return code_size_cost ? ch_possible_zero_cost : ch_win;

IMHO we're over aggressively apply early peeling here.  That holds
generally, not only for OMP simd loops (which we could identify).

Why are we doing this game for single-block do-while loops?

Richard.


> > >
> > > Bootstrapped/regtested x86_64-linux, OK?
> >
> > OK.
> >
> > Thanks,
> > Richard.
> >
> > > Honza
> > >
> > >
> > > gcc/ChangeLog:
> > >
> > >       * tree-ssa-loop-ch.cc (edge_range_query): Take loop argument; be ready
> > >       for queries not in headers.
> > >       (static_loop_exit): Add basic blck parameter; update use of
> > >       edge_range_query
> > >       (should_duplicate_loop_header_p): Add ranger and static_exits
> > >       parameter.  Do not account statements that will be optimized
> > >       out after duplicaiton in overall size. Add ranger query to
> > >       find static exits.
> > >       (update_profile_after_ch):  Take static_exits has set instead of
> > >       single eliminated_edge.
> > >       (ch_base::copy_headers): Do all analysis in the first pass;
> > >       remember invariant_exits and static_exits.
> > >
> > > diff --git a/gcc/tree-ssa-loop-ch.cc b/gcc/tree-ssa-loop-ch.cc
> > > index 24e7fbc805a..e0139cb432c 100644
> > > --- a/gcc/tree-ssa-loop-ch.cc
> > > +++ b/gcc/tree-ssa-loop-ch.cc
> > > @@ -49,11 +49,13 @@ along with GCC; see the file COPYING3.  If not see
> > >     the range of the solved conditional in R.  */
> > >
> > >  static void
> > > -edge_range_query (irange &r, edge e, gcond *cond, gimple_ranger &ranger)
> > > +edge_range_query (irange &r, class loop *loop, gcond *cond, gimple_ranger &ranger)
> > >  {
> > > -  auto_vec<basic_block> path (2);
> > > -  path.safe_push (e->dest);
> > > -  path.safe_push (e->src);
> > > +  auto_vec<basic_block, 8> path;
> > > +  for (basic_block bb = gimple_bb (cond); bb != loop->header; bb = single_pred_edge (bb)->src)
> > > +    path.safe_push (bb);
> > > +  path.safe_push (loop->header);
> > > +  path.safe_push (loop_preheader_edge (loop)->src);
> > >    path_range_query query (ranger, path);
> > >    if (!query.range_of_stmt (r, cond))
> > >      r.set_varying (boolean_type_node);
> > > @@ -63,17 +65,16 @@ edge_range_query (irange &r, edge e, gcond *cond, gimple_ranger &ranger)
> > >     and NULL otherwise.  */
> > >
> > >  static edge
> > > -static_loop_exit (class loop *l, gimple_ranger *ranger)
> > > +static_loop_exit (class loop *l, basic_block bb, gimple_ranger *ranger)
> > >  {
> > > -  edge e = loop_preheader_edge (l);
> > > -  gcond *last = safe_dyn_cast <gcond *> (*gsi_last_bb (e->dest));
> > > +  gcond *last = safe_dyn_cast <gcond *> (*gsi_last_bb (bb));
> > >    edge ret_e;
> > >
> > >    if (!last)
> > >      return NULL;
> > >
> > >    edge true_e, false_e;
> > > -  extract_true_false_edges_from_block (e->dest, &true_e, &false_e);
> > > +  extract_true_false_edges_from_block (bb, &true_e, &false_e);
> > >
> > >    /* If neither edge is the exit edge, this is not a case we'd like to
> > >       special-case.  */
> > > @@ -93,7 +94,7 @@ static_loop_exit (class loop *l, gimple_ranger *ranger)
> > >    }
> > >
> > >    int_range<2> r;
> > > -  edge_range_query (r, e, last, *ranger);
> > > +  edge_range_query (r, l, last, *ranger);
> > >    return r == desired_static_range ? ret_e : NULL;
> > >  }
> > >
> > > @@ -131,8 +132,10 @@ loop_iv_derived_p (class loop *loop,
> > >
> > >  static bool
> > >  should_duplicate_loop_header_p (basic_block header, class loop *loop,
> > > +                             gimple_ranger *ranger,
> > >                               int *limit,
> > > -                             hash_set <edge> *invariant_exits)
> > > +                             hash_set <edge> *invariant_exits,
> > > +                             hash_set <edge> *static_exits)
> > >  {
> > >    gimple_stmt_iterator bsi;
> > >
> > > @@ -209,6 +212,9 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
> > >        if (is_gimple_debug (last))
> > >       continue;
> > >
> > > +      if (gimple_code (last) == GIMPLE_COND)
> > > +     break;
> > > +
> > >        if (gimple_code (last) == GIMPLE_CALL
> > >         && (!gimple_inexpensive_call_p (as_a <gcall *> (last))
> > >             /* IFN_LOOP_DIST_ALIAS means that inner loop is distributed
> > > @@ -222,16 +228,6 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
> > >         return false;
> > >       }
> > >
> > > -      *limit -= estimate_num_insns (last, &eni_size_weights);
> > > -      if (*limit < 0)
> > > -     {
> > > -       if (dump_file && (dump_flags & TDF_DETAILS))
> > > -         fprintf (dump_file,
> > > -                  "  Not duplicating bb %i contains too many insns.\n",
> > > -                  header->index);
> > > -       return false;
> > > -     }
> > > -
> > >        /* Classify the stmt based on whether its computation is based
> > >           on a IV or whether it is invariant in the loop.  */
> > >        gimple_set_uid (last, 0);
> > > @@ -257,9 +253,36 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
> > >                 inv = false;
> > >             }
> > >         gimple_set_uid (last, (iv ? 1 : 0) | (inv ? 2 : 0));
> > > +       /* Loop invariants will be optimized out in loop body after
> > > +          duplication; do not account invariant computation in code
> > > +          size costs.  */
> > > +       if (inv)
> > > +         continue;
> > > +     }
> > > +
> > > +      *limit -= estimate_num_insns (last, &eni_size_weights);
> > > +      if (*limit < 0)
> > > +     {
> > > +       if (dump_file && (dump_flags & TDF_DETAILS))
> > > +         fprintf (dump_file,
> > > +                  "  Not duplicating bb %i contains too many insns.\n",
> > > +                  header->index);
> > > +       return false;
> > >       }
> > >      }
> > >
> > > +  edge static_exit = static_loop_exit (loop, header, ranger);
> > > +
> > > +  if (static_exit && static_exits)
> > > +    {
> > > +      static_exits->add (static_exit);
> > > +      if (dump_file && (dump_flags & TDF_DETAILS))
> > > +     fprintf (dump_file,
> > > +              "    Will eliminate peeled conditional in bb %d.\n",
> > > +              static_exit->src->index);
> > > +      /* Still look for invariant exits; exit may be both.  */
> > > +    }
> > > +
> > >    /* If the condition tests a non-IV loop variant we do not want to rotate
> > >       the loop further.  Unless this is the original loop header.  */
> > >    tree lhs = gimple_cond_lhs (last);
> > > @@ -284,12 +307,28 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
> > >       }
> > >        return true;
> > >      }
> > > +
> > > +  if (static_exit)
> > > +    return true;
> > > +
> > > +  /* We was not able to prove that conditional will be eliminated.  */
> > > +  *limit -= estimate_num_insns (last, &eni_size_weights);
> > > +  if (*limit < 0)
> > > +    {
> > > +      if (dump_file && (dump_flags & TDF_DETAILS))
> > > +     fprintf (dump_file,
> > > +              "  Not duplicating bb %i contains too many insns.\n",
> > > +              header->index);
> > > +      return false;
> > > +    }
> > > +
> > >    /* TODO: This is heuristics that claims that IV based ocnditionals will
> > >       likely be optimized out in duplicated header.  We could use ranger
> > >       query instead to tell this more precisely.  */
> > > -  if (header != loop->header
> > > -      && ((!lhs_invariant && !loop_iv_derived_p (loop, lhs))
> > > -       || (!rhs_invariant && !loop_iv_derived_p (loop, rhs))))
> > > +  if ((lhs_invariant || loop_iv_derived_p (loop, lhs))
> > > +      && (rhs_invariant || loop_iv_derived_p (loop, rhs)))
> > > +    return true;
> > > +  if (header != loop->header)
> > >      {
> > >        if (dump_file && (dump_flags & TDF_DETAILS))
> > >       fprintf (dump_file,
> > > @@ -386,8 +425,9 @@ do_while_loop_p (class loop *loop)
> > >  static void
> > >  update_profile_after_ch (class loop *loop,
> > >                        basic_block *region, basic_block *region_copy,
> > > -                      unsigned n_region, edge eliminated_edge,
> > > +                      unsigned n_region,
> > >                        hash_set <edge> *invariant_exits,
> > > +                      hash_set <edge> *static_exits,
> > >                        profile_count entry_count)
> > >  {
> > >    for (unsigned int i = 0; i < n_region; i++)
> > > @@ -421,10 +461,13 @@ update_profile_after_ch (class loop *loop,
> > >                     && e_copy->dest == region_copy[i + 1]));
> > >        region_copy[i]->count = entry_count;
> > >        profile_count exit_e_count = exit_e->count ();
> > > -      if (eliminated_edge == exit_e)
> > > +      bool was_static = false;
> > > +      if (static_exits->contains (exit_e))
> > >       {
> > >         /* Update profile and the conditional.
> > >            CFG update is done by caller.  */
> > > +       static_exits->remove (exit_e);
> > > +       was_static = true;
> > >         e_copy->probability = profile_probability::always ();
> > >         exit_e_copy->probability = profile_probability::never ();
> > >         gcond *cond_stmt
> > > @@ -437,7 +480,6 @@ update_profile_after_ch (class loop *loop,
> > >         /* Header copying is a special case of jump threading, so use
> > >            common code to update loop body exit condition.  */
> > >         update_bb_profile_for_threading (region[i], entry_count, e);
> > > -       eliminated_edge = NULL;
> > >       }
> > >        else
> > >       region[i]->count -= region_copy[i]->count;
> > > @@ -448,7 +490,7 @@ update_profile_after_ch (class loop *loop,
> > >            loop, so increase probability accordingly.
> > >            If the edge is eliminated_edge we already corrected
> > >            profile above.  */
> > > -       if (entry_count.nonzero_p () && eliminated_edge != exit_e)
> > > +       if (entry_count.nonzero_p () && !was_static)
> > >           set_edge_probability_and_rescale_others
> > >                   (exit_e_copy, exit_e_count.probability_in
> > >                                                       (entry_count));
> > > @@ -465,7 +507,7 @@ update_profile_after_ch (class loop *loop,
> > >        entry_count = e_copy->count ();
> > >      }
> > >    /* Be sure that we seen all edges we are supposed to update.  */
> > > -  gcc_checking_assert (!eliminated_edge
> > > +  gcc_checking_assert (static_exits->is_empty ()
> > >                      && invariant_exits->is_empty ());
> > >  }
> > >
> > > @@ -564,10 +606,7 @@ protected:
> > >  unsigned int
> > >  ch_base::copy_headers (function *fun)
> > >  {
> > > -  basic_block header;
> > > -  edge exit, nonexit, entry;
> > >    basic_block *bbs, *copied_bbs;
> > > -  unsigned n_bbs;
> > >    unsigned bbs_size;
> > >    bool changed = false;
> > >
> > > @@ -578,7 +617,12 @@ ch_base::copy_headers (function *fun)
> > >    copied_bbs = XNEWVEC (basic_block, n_basic_blocks_for_fn (fun));
> > >    bbs_size = n_basic_blocks_for_fn (fun);
> > >
> > > -  struct candidate {class loop *loop; edge static_exit;};
> > > +  struct candidate
> > > +    {
> > > +      class loop *loop;
> > > +      unsigned int nheaders;
> > > +      hash_set <edge> *invariant_exits, *static_exits;
> > > +    };
> > >    auto_vec<struct candidate> candidates;
> > >    auto_vec<std::pair<edge, loop_p> > copied;
> > >    auto_vec<class loop *> loops_to_unloop;
> > > @@ -588,13 +632,14 @@ ch_base::copy_headers (function *fun)
> > >    gimple_ranger *ranger = new gimple_ranger;
> > >    for (auto loop : loops_list (cfun, 0))
> > >      {
> > > -      int initial_limit = param_max_loop_header_insns;
> > > +      int initial_limit = optimize_loop_for_speed_p (loop)
> > > +                       ? param_max_loop_header_insns : 0;
> > >        int remaining_limit = initial_limit;
> > >        if (dump_file && (dump_flags & TDF_DETAILS))
> > >       fprintf (dump_file,
> > >                "Analyzing loop %i\n", loop->num);
> > >
> > > -      header = loop->header;
> > > +      basic_block header = loop->header;
> > >        if (!get_max_loop_iterations_int (loop))
> > >       {
> > >         if (dump_file && (dump_flags & TDF_DETAILS))
> > > @@ -613,24 +658,38 @@ ch_base::copy_headers (function *fun)
> > >         || !process_loop_p (loop))
> > >       continue;
> > >
> > > -      edge static_exit = static_loop_exit (loop, ranger);
> > > -
> > > -      /* Avoid loop header copying when optimizing for size unless we can
> > > -      determine that the loop condition is static in the first
> > > -      iteration.  */
> > > -      if (optimize_loop_for_size_p (loop)
> > > -       && !loop->force_vectorize
> > > -       && !static_exit)
> > > +      /* Iterate the header copying up to limit; this takes care of the cases
> > > +      like while (a && b) {...}, where we want to have both of the conditions
> > > +      copied.  TODO -- handle while (a || b) - like cases, by not requiring
> > > +      the header to have just a single successor and copying up to
> > > +      postdominator.  */
> > > +      int nheaders = 0;
> > > +      hash_set <edge> *invariant_exits = new hash_set <edge>;
> > > +      hash_set <edge> *static_exits = new hash_set <edge>;
> > > +      while (should_duplicate_loop_header_p (header, loop, ranger,
> > > +                                          &remaining_limit,
> > > +                                          invariant_exits,
> > > +                                          static_exits))
> > >       {
> > > +       nheaders++;
> > >         if (dump_file && (dump_flags & TDF_DETAILS))
> > > -         fprintf (dump_file,
> > > -                  "  Not duplicating bb %i: optimizing for size.\n",
> > > -                  header->index);
> > > -       continue;
> > > +         fprintf (dump_file, "    Will duplicate bb %i\n", header->index);
> > > +
> > > +       /* Find a successor of header that is inside a loop; i.e. the new
> > > +          header after the condition is copied.  */
> > > +       if (flow_bb_inside_loop_p (loop, EDGE_SUCC (header, 0)->dest))
> > > +         header = EDGE_SUCC (header, 0)->dest;
> > > +       else
> > > +         header = EDGE_SUCC (header, 1)->dest;
> > >       }
> > >
> > > -      if (should_duplicate_loop_header_p (header, loop, &remaining_limit, NULL))
> > > -     candidates.safe_push ({loop, static_exit});
> > > +      if (nheaders)
> > > +     candidates.safe_push ({loop, nheaders, invariant_exits, static_exits});
> > > +      else
> > > +     {
> > > +       delete invariant_exits;
> > > +       delete static_exits;
> > > +     }
> > >      }
> > >    /* Do not use ranger after we change the IL and not have updated SSA.  */
> > >    delete ranger;
> > > @@ -638,37 +697,25 @@ ch_base::copy_headers (function *fun)
> > >    for (auto candidate : candidates)
> > >      {
> > >        class loop *loop = candidate.loop;
> > > -      int initial_limit = param_max_loop_header_insns;
> > > -      int remaining_limit = initial_limit;
> > >        if (dump_file && (dump_flags & TDF_DETAILS))
> > >       fprintf (dump_file,
> > >                "Copying headers of loop %i\n", loop->num);
> > >
> > > -      header = loop->header;
> > > -
> > > -      /* Iterate the header copying up to limit; this takes care of the cases
> > > -      like while (a && b) {...}, where we want to have both of the conditions
> > > -      copied.  TODO -- handle while (a || b) - like cases, by not requiring
> > > -      the header to have just a single successor and copying up to
> > > -      postdominator.  */
> > > -
> > > -      nonexit = NULL;
> > > -      n_bbs = 0;
> > > +      basic_block header = loop->header;
> > > +      edge nonexit = NULL;
> > > +      edge exit = NULL;
> > > +      unsigned int n_bbs = 0;
> > >        int nexits = 0;
> > >        profile_count exit_count = profile_count::zero ();
> > >        profile_count entry_count = profile_count::zero ();
> > >        edge e;
> > >        edge_iterator ei;
> > > -      hash_set <edge> invariant_exits;
> > > +
> > >        FOR_EACH_EDGE (e, ei, loop->header->preds)
> > >       if (e->src != loop->latch)
> > >         entry_count += e->count ();
> > > -      while (should_duplicate_loop_header_p (header, loop, &remaining_limit,
> > > -                                          &invariant_exits))
> > > +      while (n_bbs < candidate.nheaders)
> > >       {
> > > -       if (dump_file && (dump_flags & TDF_DETAILS))
> > > -         fprintf (dump_file, "    Will duplicate bb %i\n", header->index);
> > > -
> > >         /* Find a successor of header that is inside a loop; i.e. the new
> > >            header after the condition is copied.  */
> > >         if (flow_bb_inside_loop_p (loop, EDGE_SUCC (header, 0)->dest))
> > > @@ -688,21 +735,10 @@ ch_base::copy_headers (function *fun)
> > >         header = nonexit->dest;
> > >       }
> > >
> > > -      if (!nonexit)
> > > -     continue;
> > > -
> > >        if (dump_file && (dump_flags & TDF_DETAILS))
> > > -     {
> > > -       fprintf (dump_file,
> > > -                "Duplicating header of the loop %d up to edge %d->%d,"
> > > -                " %i insns.\n",
> > > -                loop->num, exit->src->index, exit->dest->index,
> > > -                initial_limit - remaining_limit);
> > > -       if (candidate.static_exit)
> > > -         fprintf (dump_file,
> > > -                  "  Will eliminate peeled conditional in bb %d.\n",
> > > -                  candidate.static_exit->src->index);
> > > -     }
> > > +     fprintf (dump_file,
> > > +              "Duplicating header of the loop %d up to edge %d->%d\n",
> > > +              loop->num, exit->src->index, exit->dest->index);
> > >
> > >        /* Ensure that the header will have just the latch as a predecessor
> > >        inside the loop.  */
> > > @@ -712,20 +748,25 @@ ch_base::copy_headers (function *fun)
> > >         exit = single_pred_edge (header);
> > >       }
> > >
> > > -      entry = loop_preheader_edge (loop);
> > > +      edge entry = loop_preheader_edge (loop);
> > >
> > >        propagate_threaded_block_debug_into (exit->dest, entry->dest);
> > >        if (!gimple_duplicate_seme_region (entry, exit, bbs, n_bbs, copied_bbs,
> > >                                        true))
> > >       {
> > > +       delete candidate.static_exits;
> > > +       delete candidate.invariant_exits;
> > >         if (dump_file && (dump_flags & TDF_DETAILS))
> > >           fprintf (dump_file, "Duplication failed.\n");
> > >         continue;
> > >       }
> > >        if (loop->header->count.initialized_p ())
> > >       update_profile_after_ch (loop, bbs, copied_bbs, n_bbs,
> > > -                              candidate.static_exit, &invariant_exits,
> > > +                              candidate.invariant_exits,
> > > +                              candidate.static_exits,
> > >                                entry_count);
> > > +      delete candidate.static_exits;
> > > +      delete candidate.invariant_exits;
> > >        copied.safe_push (std::make_pair (entry, loop));
> > >
> > >        /* If the loop has the form "for (i = j; i < j + 10; i++)" then
> > >
> >
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
> > Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
> > HRB 36809 (AG Nuernberg)
> 
> 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Loop-ch improvements, part 3
  2023-08-22  7:53     ` Richard Biener
@ 2023-08-22 12:03       ` Jan Hubicka
  2023-08-23  8:34       ` Jan Hubicka
  1 sibling, 0 replies; 9+ messages in thread
From: Jan Hubicka @ 2023-08-22 12:03 UTC (permalink / raw)
  To: Richard Biener; +Cc: Hongtao Liu, gcc-patches

> 
> We seem to peel one iteration for no good reason.  The loop is
> a do-while loop already.  The key is we see the first iteration
> exit condition is known not taken and then:
> 
>  Registering value_relation (path_oracle) (iter.24_6 > iter.24_5) (root: 
> bb2)
>     Stmt is static (constant in the first iteration)
>   Analyzing: if (iter.24_6 != 16)
>  Registering killing_def (path_oracle) iter.24_6
>  Registering value_relation (path_oracle) (iter.24_6 > iter.24_5) (root: 
> bb2)
>     Will eliminate peeled conditional in bb 3.
>     Duplicating bb 3 is a win; it has zero cost
>   Not duplicating bb 5: it is single succ.
> Copying headers of loop 1
>     Will duplicate bb 3
> Duplicating header of the loop 1 up to edge 3->4
> Loop 1 is do-while loop
> Loop 1 is now do-while loop.
> Exit count: 0 (estimated locally)
> Entry count: 10631108 (estimated locally)
> Peeled all exits: decreased number of iterations of loop 1 by 1.
> 
> and that's because of
> 
>   /* If the static exit fully optimize out, it is win to "duplicate"
>      it.
> 
>      TODO: Even if duplication costs some size we may opt to do so in case
>      exit probability is significant enough (do partial peeling).  */
>   if (static_exit)
>     return code_size_cost ? ch_possible_zero_cost : ch_win;
> 
> IMHO we're over aggressively apply early peeling here.  That holds
> generally, not only for OMP simd loops (which we could identify).
> 
> Why are we doing this game for single-block do-while loops?

It seems I just wrongly updated the old conditional. Sorry for that.
It should be:
diff --git a/gcc/tree-ssa-loop-ch.cc b/gcc/tree-ssa-loop-ch.cc
index 6cdb87a762f..8142add4bec 100644
--- a/gcc/tree-ssa-loop-ch.cc
+++ b/gcc/tree-ssa-loop-ch.cc
@@ -464,7 +464,7 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
      TODO: Even if duplication costs some size we may opt to do so in case
      exit probability is significant enough (do partial peeling).  */
   if (static_exit)
-    return code_size_cost ? ch_possible_zero_cost : ch_win;
+    return !code_size_cost ? ch_possible_zero_cost : ch_possible;
 
   /* We was not able to prove that conditional will be eliminated.  */
   int insns = estimate_num_insns (last, &eni_size_weights);

So the heuristics knows that if there is no code produced "peeling" is
good idea since it eliminates one conditional for free. Otherwise it
should know that peeling is possible but only done if it produces
do-while-loop

As TODO says it would make to duplicate also if the exit likely avoids
entering the loop (which would be cheaper than peeling full first
iteration), but that can be done incrementally.

I am testing the fix.

Honza

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Loop-ch improvements, part 3
  2023-08-22  7:53     ` Richard Biener
  2023-08-22 12:03       ` Jan Hubicka
@ 2023-08-23  8:34       ` Jan Hubicka
  2023-08-23  8:58         ` Richard Biener
  1 sibling, 1 reply; 9+ messages in thread
From: Jan Hubicka @ 2023-08-23  8:34 UTC (permalink / raw)
  To: Richard Biener; +Cc: Hongtao Liu, gcc-patches

> We seem to peel one iteration for no good reason.  The loop is
> a do-while loop already.  The key is we see the first iteration
> exit condition is known not taken and then:
Hi,
this is patch fixing wrong return value in should_duplicate_loop_header_p.
Doing so uncovered suboptimal decisions on some jump threading testcases
where we chose to stop duplicating just before basic block that has zero
cost and duplicating so would be always a win.

This is because the heuristics trying to chose right point to duplicate
all winning blocks and to get loop to be do_while did not account
zero_cost blocks in all cases.  The patch simplifies the logic by
simply remembering zero cost blocks and handling them last after
the right stopping point is chosen.

Bootstrapped/regtested x86_64-linux, OK?

gcc/ChangeLog:

	* tree-ssa-loop-ch.cc (enum ch_decision): Fix comment.
	(should_duplicate_loop_header_p): Fix return value for static exits.
	(ch_base::copy_headers): Improve handling of ch_possible_zero_cost.

gcc/testsuite/ChangeLog:

	* gcc.dg/tree-ssa/copy-headers-9.c: Update template.

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-9.c b/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-9.c
index b49d1fc9576..11ee29458a2 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-9.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-9.c
@@ -13,7 +13,6 @@ void test (int m, int n)
 	}
 	while (i<10);
 }
-/* { dg-final { scan-tree-dump-times "Duplicating bb . is a win" 2 "ch2" } } */
-/* { dg-final { scan-tree-dump-times "Duplicating bb . is a win. it has zero" 1 "ch2" } } */
+/* { dg-final { scan-tree-dump-times "Duplicating bb . is a win" 1 "ch2" } } */
 /* { dg-final { scan-tree-dump-times "Will duplicate bb" 2 "ch2" } } */
 /* { dg-final { scan-tree-dump "is now do-while loop" "ch2" } } */
diff --git a/gcc/tree-ssa-loop-ch.cc b/gcc/tree-ssa-loop-ch.cc
index 6cdb87a762f..461416e4086 100644
--- a/gcc/tree-ssa-loop-ch.cc
+++ b/gcc/tree-ssa-loop-ch.cc
@@ -176,7 +176,7 @@ enum ch_decision
   ch_impossible,
   /* We can copy it if it enables wins.  */
   ch_possible,
-  /* We can "cop" it if it enables wins and doing
+  /* We can "copy" it if it enables wins and doing
      so will introduce no new code.  */
   ch_possible_zero_cost,
   /* We want to copy.  */
@@ -464,7 +464,7 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
      TODO: Even if duplication costs some size we may opt to do so in case
      exit probability is significant enough (do partial peeling).  */
   if (static_exit)
-    return code_size_cost ? ch_possible_zero_cost : ch_win;
+    return !code_size_cost ? ch_possible_zero_cost : ch_possible;
 
   /* We was not able to prove that conditional will be eliminated.  */
   int insns = estimate_num_insns (last, &eni_size_weights);
@@ -824,6 +824,7 @@ ch_base::copy_headers (function *fun)
       int last_win_nheaders = 0;
       bool last_win_invariant_exit = false;
       ch_decision ret;
+      auto_vec <ch_decision, 32> decision;
       hash_set <edge> *invariant_exits = new hash_set <edge>;
       hash_set <edge> *static_exits = new hash_set <edge>;
       while ((ret = should_duplicate_loop_header_p (header, loop, ranger,
@@ -833,6 +834,7 @@ ch_base::copy_headers (function *fun)
 	     != ch_impossible)
 	{
 	  nheaders++;
+	  decision.safe_push (ret);
 	  if (ret >= ch_win)
 	    {
 	      last_win_nheaders = nheaders;
@@ -841,20 +843,6 @@ ch_base::copy_headers (function *fun)
 		fprintf (dump_file, "    Duplicating bb %i is a win\n",
 			 header->index);
 	    }
-	  /* Duplicate BB if has zero cost but be sure it will not
-	     imply duplication of other BBs.  */
-	  else if (ret == ch_possible_zero_cost
-		   && (last_win_nheaders == nheaders - 1
-		       || (last_win_nheaders == nheaders - 2
-			   && last_win_invariant_exit)))
-	    {
-	      last_win_nheaders = nheaders;
-	      last_win_invariant_exit = false;
-	      if (dump_file && (dump_flags & TDF_DETAILS))
-		fprintf (dump_file,
-			 "    Duplicating bb %i is a win; it has zero cost\n",
-			 header->index);
-	    }
 	  else
 	    if (dump_file && (dump_flags & TDF_DETAILS))
 	      fprintf (dump_file, "    May duplicate bb %i\n", header->index);
@@ -884,6 +872,16 @@ ch_base::copy_headers (function *fun)
 	    fprintf (dump_file,
 		     "    Duplicating header BB to obtain do-while loop\n");
 	}
+      /* "Duplicate" all BBs with zero cost following last basic blocks we
+	 decided to copy.  */
+      while (last_win_nheaders < (int)decision.length ()
+	     && decision[last_win_nheaders] == ch_possible_zero_cost)
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file,
+		     "    Duplicating extra bb is a win; it has zero cost\n");
+	  last_win_nheaders++;
+	}
 
       if (last_win_nheaders)
 	candidates.safe_push ({loop, last_win_nheaders,

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Loop-ch improvements, part 3
  2023-08-23  8:34       ` Jan Hubicka
@ 2023-08-23  8:58         ` Richard Biener
  0 siblings, 0 replies; 9+ messages in thread
From: Richard Biener @ 2023-08-23  8:58 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Hongtao Liu, gcc-patches

On Wed, 23 Aug 2023, Jan Hubicka wrote:

> > We seem to peel one iteration for no good reason.  The loop is
> > a do-while loop already.  The key is we see the first iteration
> > exit condition is known not taken and then:
> Hi,
> this is patch fixing wrong return value in should_duplicate_loop_header_p.
> Doing so uncovered suboptimal decisions on some jump threading testcases
> where we chose to stop duplicating just before basic block that has zero
> cost and duplicating so would be always a win.
> 
> This is because the heuristics trying to chose right point to duplicate
> all winning blocks and to get loop to be do_while did not account
> zero_cost blocks in all cases.  The patch simplifies the logic by
> simply remembering zero cost blocks and handling them last after
> the right stopping point is chosen.
> 
> Bootstrapped/regtested x86_64-linux, OK?

OK.

Thanks,
Richard.

> gcc/ChangeLog:
> 
> 	* tree-ssa-loop-ch.cc (enum ch_decision): Fix comment.
> 	(should_duplicate_loop_header_p): Fix return value for static exits.
> 	(ch_base::copy_headers): Improve handling of ch_possible_zero_cost.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.dg/tree-ssa/copy-headers-9.c: Update template.
> 
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-9.c b/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-9.c
> index b49d1fc9576..11ee29458a2 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-9.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-9.c
> @@ -13,7 +13,6 @@ void test (int m, int n)
>  	}
>  	while (i<10);
>  }
> -/* { dg-final { scan-tree-dump-times "Duplicating bb . is a win" 2 "ch2" } } */
> -/* { dg-final { scan-tree-dump-times "Duplicating bb . is a win. it has zero" 1 "ch2" } } */
> +/* { dg-final { scan-tree-dump-times "Duplicating bb . is a win" 1 "ch2" } } */
>  /* { dg-final { scan-tree-dump-times "Will duplicate bb" 2 "ch2" } } */
>  /* { dg-final { scan-tree-dump "is now do-while loop" "ch2" } } */
> diff --git a/gcc/tree-ssa-loop-ch.cc b/gcc/tree-ssa-loop-ch.cc
> index 6cdb87a762f..461416e4086 100644
> --- a/gcc/tree-ssa-loop-ch.cc
> +++ b/gcc/tree-ssa-loop-ch.cc
> @@ -176,7 +176,7 @@ enum ch_decision
>    ch_impossible,
>    /* We can copy it if it enables wins.  */
>    ch_possible,
> -  /* We can "cop" it if it enables wins and doing
> +  /* We can "copy" it if it enables wins and doing
>       so will introduce no new code.  */
>    ch_possible_zero_cost,
>    /* We want to copy.  */
> @@ -464,7 +464,7 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
>       TODO: Even if duplication costs some size we may opt to do so in case
>       exit probability is significant enough (do partial peeling).  */
>    if (static_exit)
> -    return code_size_cost ? ch_possible_zero_cost : ch_win;
> +    return !code_size_cost ? ch_possible_zero_cost : ch_possible;
>  
>    /* We was not able to prove that conditional will be eliminated.  */
>    int insns = estimate_num_insns (last, &eni_size_weights);
> @@ -824,6 +824,7 @@ ch_base::copy_headers (function *fun)
>        int last_win_nheaders = 0;
>        bool last_win_invariant_exit = false;
>        ch_decision ret;
> +      auto_vec <ch_decision, 32> decision;
>        hash_set <edge> *invariant_exits = new hash_set <edge>;
>        hash_set <edge> *static_exits = new hash_set <edge>;
>        while ((ret = should_duplicate_loop_header_p (header, loop, ranger,
> @@ -833,6 +834,7 @@ ch_base::copy_headers (function *fun)
>  	     != ch_impossible)
>  	{
>  	  nheaders++;
> +	  decision.safe_push (ret);
>  	  if (ret >= ch_win)
>  	    {
>  	      last_win_nheaders = nheaders;
> @@ -841,20 +843,6 @@ ch_base::copy_headers (function *fun)
>  		fprintf (dump_file, "    Duplicating bb %i is a win\n",
>  			 header->index);
>  	    }
> -	  /* Duplicate BB if has zero cost but be sure it will not
> -	     imply duplication of other BBs.  */
> -	  else if (ret == ch_possible_zero_cost
> -		   && (last_win_nheaders == nheaders - 1
> -		       || (last_win_nheaders == nheaders - 2
> -			   && last_win_invariant_exit)))
> -	    {
> -	      last_win_nheaders = nheaders;
> -	      last_win_invariant_exit = false;
> -	      if (dump_file && (dump_flags & TDF_DETAILS))
> -		fprintf (dump_file,
> -			 "    Duplicating bb %i is a win; it has zero cost\n",
> -			 header->index);
> -	    }
>  	  else
>  	    if (dump_file && (dump_flags & TDF_DETAILS))
>  	      fprintf (dump_file, "    May duplicate bb %i\n", header->index);
> @@ -884,6 +872,16 @@ ch_base::copy_headers (function *fun)
>  	    fprintf (dump_file,
>  		     "    Duplicating header BB to obtain do-while loop\n");
>  	}
> +      /* "Duplicate" all BBs with zero cost following last basic blocks we
> +	 decided to copy.  */
> +      while (last_win_nheaders < (int)decision.length ()
> +	     && decision[last_win_nheaders] == ch_possible_zero_cost)
> +	{
> +	  if (dump_file && (dump_flags & TDF_DETAILS))
> +	    fprintf (dump_file,
> +		     "    Duplicating extra bb is a win; it has zero cost\n");
> +	  last_win_nheaders++;
> +	}
>  
>        if (last_win_nheaders)
>  	candidates.safe_push ({loop, last_win_nheaders,
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: loop-ch improvements, part 3
  2023-07-20  7:09 loop-ch " Jan Hubicka
@ 2023-07-20 13:03 ` Richard Biener
  0 siblings, 0 replies; 9+ messages in thread
From: Richard Biener @ 2023-07-20 13:03 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches

On Thu, Jul 20, 2023 at 9:10 AM Jan Hubicka via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Hi,
> this patch makes tree-ssa-loop-ch to understand if-combined conditionals (which
> are quite common) and remove the IV-derived heuristics.  That heuristics is
> quite dubious because every variable with PHI in header of integral or pointer
> type is seen as IV, so in the first basic block we match all loop invariants as
> invariants and everything that chagnes in loop as IV-like.
>
> I think the heuristics was mostly there to make header duplication happen when
> the exit conditional is constant false in the first iteration and with ranger
> we can work this out in good enough precision.
>
> The patch adds notion of "combined exit" which has conditional that is
> and/or/xor of loop invariant exit and exit known to be false in first
> iteration.  Copying these is a win since the loop conditional will simplify
> in both copies.
>
> It seems that those are usual bit or/and/xor and the code size accounting is
> true only when the values have at most one bit set or when the static constant
> and invariant versions are simple (such as all zeros).  I am not testing this,
> so the code may be optimistic here.  I think it is not common enough to matter
> and I can not think of correct condition that is not quite complex.
>
> I also improved code size estimate not accounting non-conditionals that are
> know to be constant in peeled copy and improved debug output.
>
> This requires testsuite compensaiton.  uninit-pred-loop-1.c.C does:
>
> /* { dg-do compile } */
> /* { dg-options "-Wuninitialized -O2 -std=c++98" } */
>
> extern int bar();
> int foo(int n, int m)
> {
>  for (;;) {
>    int err = ({int _err;
>      for (int i = 0; i < 16; ++i) {
>        if (m+i > n)
>           break;
>        _err = 17;
>        _err = bar();
>      }
>      _err;
>    });
>
>    if (err == 0) return 17;
> }
>
> Before path we duplicate
>        if (m+i > n)
> which makes maybe-uninitialized warning to not be output.  I do not quite see
> why copying this out would be a win, since it won't simlify.  Also I think the
> warning is correct.  if m>n the loop will bail out before initializing _err and
> it will be used unitialized.  I think it is bug elsewhere that header
> duplication supresses this.
>
> copy headers does:
> int is_sorted(int *a, int n, int m, int k)
> {
>   for (int i = 0; i < n - 1 && m && k > i; i++)
>     if (a[i] > a[i + 1])
>       return 0;
>   return 1;
> }
>
> it tests that all three for statement conditionals are duplicaed.  With patch
> we no longer do k>i since it is not going to simplify.  So I added test
> ensuring that k is positive.  Also the tests requires disabling if-combining and
> vrp to avoid conditionals becoming combined ones. So I aded new version of test
> that we now behave correctly aslo with if-combine.
>
> ivopt_mult_2.c and ivopt_mult_1.c seems to require loop header
> duplication for ivopts to behave particular way, so I also ensured by value
> range that the header is duplicated.
>
> Bootstrapped/regtested x86_64-linux, OK?
>
> gcc/ChangeLog:
>
>         * tree-ssa-loop-ch.cc (edge_range_query): Rename to ...
>         (get_range_query): ... this one; do
>         (static_loop_exit): Add query parametr, turn ranger to reference.
>         (loop_static_stmt_p): New function.
>         (loop_static_op_p): New function.
>         (loop_iv_derived_p): Remove.
>         (loop_combined_static_and_iv_p): New function.
>         (should_duplicate_loop_header_p): Discover combined onditionals;
>         do not track iv derived; improve dumps.
>         (pass_ch::execute): Fix whitespace.
>
> gcc/testsuite/ChangeLog:
>
>         * g++.dg/uninit-pred-loop-1_c.C: Allow warning.
>         * gcc.dg/tree-ssa/copy-headers-7.c: Add tests so exit conditition is
>         static; update template.
>         * gcc.dg/tree-ssa/ivopt_mult_1.c: Add test so exit condition is static.
>         * gcc.dg/tree-ssa/ivopt_mult_2.c: Add test so exit condition is static.
>         * gcc.dg/tree-ssa/copy-headers-8.c: New test.
>
> diff --git a/gcc/testsuite/g++.dg/uninit-pred-loop-1_c.C b/gcc/testsuite/g++.dg/uninit-pred-loop-1_c.C
> index 711812aae1b..1ee1615526f 100644
> --- a/gcc/testsuite/g++.dg/uninit-pred-loop-1_c.C
> +++ b/gcc/testsuite/g++.dg/uninit-pred-loop-1_c.C
> @@ -15,7 +15,7 @@ int foo(int n, int m)
>       _err;
>     });
>
> -   if (err == 0) return 17;
> +   if (err == 0) return 17;    /* { dg-warning "uninitialized" "warning" } */
>   }
>
>   return 18;
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-7.c b/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-7.c
> index 3c9b3807041..e2a6c75f2e9 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-7.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-7.c
> @@ -3,9 +3,10 @@
>
>  int is_sorted(int *a, int n, int m, int k)
>  {
> -  for (int i = 0; i < n - 1 && m && k > i; i++)
> -    if (a[i] > a[i + 1])
> -      return 0;
> +  if (k > 0)
> +    for (int i = 0; i < n - 1 && m && k > i; i++)
> +      if (a[i] > a[i + 1])
> +       return 0;
>    return 1;
>  }
>
> @@ -13,4 +14,8 @@ int is_sorted(int *a, int n, int m, int k)
>     the invariant test, not the alternate exit test.  */
>
>  /* { dg-final { scan-tree-dump "is now do-while loop" "ch2" } } */
> +/* { dg-final { scan-tree-dump-times "Conditional combines static and invariant" 0 "ch2" } } */
> +/* { dg-final { scan-tree-dump-times "Will elliminate invariant exit" 1 "ch2" } } */
> +/* { dg-final { scan-tree-dump-times "Will eliminate peeled conditional" 1 "ch2" } } */
> +/* { dg-final { scan-tree-dump-times "Not duplicating bb .: condition based on non-IV loop variant." 1 "ch2" } } */
>  /* { dg-final { scan-tree-dump-times "Will duplicate bb" 3 "ch2" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-8.c b/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-8.c
> new file mode 100644
> index 00000000000..8b4b5e7ea81
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-8.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-ch2-details" } */
> +
> +int is_sorted(int *a, int n, int m, int k)
> +{
> +  if (k > 0)
> +    for (int i = 0; i < n - 1 && m && k > i; i++)
> +      if (a[i] > a[i + 1])
> +       return 0;
> +  return 1;
> +}
> +
> +/* Verify we apply loop header copying but only copy the IV tests and
> +   the invariant test, not the alternate exit test.  */
> +
> +/* { dg-final { scan-tree-dump "is now do-while loop" "ch2" } } */
> +/* { dg-final { scan-tree-dump-times "Conditional combines static and invariant" 1 "ch2" } } */
> +/* { dg-final { scan-tree-dump-times "Will duplicate bb" 2 "ch2" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopt_mult_1.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopt_mult_1.c
> index adfe371c7ce..faed9114f6f 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/ivopt_mult_1.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopt_mult_1.c
> @@ -9,6 +9,7 @@ long foo(long* p, long* p2, int N1, int N2)
>    long* p_limit = p + N1;
>    long* p_limit2 = p2 + N2;
>    long s = 0;
> +  if (p2 <= p_limit2)
>    while (p  <= p_limit)
>      {
>        p++;
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopt_mult_2.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopt_mult_2.c
> index 50d0cc5d2ae..6a82aeb0268 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/ivopt_mult_2.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopt_mult_2.c
> @@ -8,15 +8,16 @@ long foo(long* p, long* p2, int N1, int N2)
>    int i = 0;
>    long* p_limit2 = p2 + N2;
>    long s = 0;
> -  while (i < N1)
> -    {
> -       p++;
> -       p2++;
> -       i++;
> -       if (p2 > p_limit2)
> -         break;
> -       s += (*p);
> -    }
> +  if (p2 <= p_limit2)
> +    while (i < N1)
> +      {
> +        p++;
> +        p2++;
> +        i++;
> +        if (p2 > p_limit2)
> +          break;
> +        s += (*p);
> +      }
>
>    return s;
>  }
> diff --git a/gcc/tree-ssa-loop-ch.cc b/gcc/tree-ssa-loop-ch.cc
> index e0139cb432c..f3dc3d998e3 100644
> --- a/gcc/tree-ssa-loop-ch.cc
> +++ b/gcc/tree-ssa-loop-ch.cc
> @@ -38,34 +38,33 @@ along with GCC; see the file COPYING3.  If not see
>  #include "value-range.h"
>  #include "gimple-range.h"
>  #include "gimple-range-path.h"
> +#include "gimple-pretty-print.h"
>  #include "cfganal.h"
>
> -/* Duplicates headers of loops if they are small enough, so that the statements
> -   in the loop body are always executed when the loop is entered.  This
> -   increases effectiveness of code motion optimizations, and reduces the need
> -   for loop preconditioning.  */
> +/* Return path query insteance for testing ranges of statements
> +   in headers of LOOP contained in basic block BB.
> +   Use RANGER instance.  */
>
> -/* Given a path through edge E, whose last statement is COND, return
> -   the range of the solved conditional in R.  */
> -
> -static void
> -edge_range_query (irange &r, class loop *loop, gcond *cond, gimple_ranger &ranger)
> +static path_range_query *
> +get_range_query (class loop *loop,
> +                basic_block bb,
> +                gimple_ranger &ranger)
>  {
>    auto_vec<basic_block, 8> path;
> -  for (basic_block bb = gimple_bb (cond); bb != loop->header; bb = single_pred_edge (bb)->src)
> +  for (; bb != loop->header; bb = single_pred_edge (bb)->src)
>      path.safe_push (bb);
>    path.safe_push (loop->header);
>    path.safe_push (loop_preheader_edge (loop)->src);
> -  path_range_query query (ranger, path);
> -  if (!query.range_of_stmt (r, cond))
> -    r.set_varying (boolean_type_node);
> +  return new path_range_query (ranger, path);
>  }
>
>  /* Return edge that is true in the first iteration of the loop
> -   and NULL otherwise.  */
> +   and NULL otherwise.
> +   Formulate corrent ranger query to RANGER.  */
>
>  static edge
> -static_loop_exit (class loop *l, basic_block bb, gimple_ranger *ranger)
> +static_loop_exit (class loop *l, basic_block bb, gimple_ranger &ranger,
> +                 path_range_query *&query)
>  {
>    gcond *last = safe_dyn_cast <gcond *> (*gsi_last_bb (bb));
>    edge ret_e;
> @@ -83,21 +82,48 @@ static_loop_exit (class loop *l, basic_block bb, gimple_ranger *ranger)
>
>    int_range<1> desired_static_range;
>    if (loop_exit_edge_p (l, true_e))
> -   {
> +    {
>        desired_static_range = range_false ();
>        ret_e = true_e;
> -   }
> +    }
>    else
> -  {
> -    desired_static_range = range_true ();
> -    ret_e = false_e;
> -  }
> +   {
> +      desired_static_range = range_true ();
> +      ret_e = false_e;
> +   }
> +
> +  if (!query)
> +    query = get_range_query (l, gimple_bb (last), ranger);
>
>    int_range<2> r;
> -  edge_range_query (r, l, last, *ranger);
> +  if (!query->range_of_stmt (r, last))
> +    return NULL;
>    return r == desired_static_range ? ret_e : NULL;
>  }
>
> +/* Return true if STMT is static in LOOP.  This means that its value
> +   is constant in the first iteration.
> +   Use RANGER and formulate query cached in QUERY.  */
> +
> +static bool
> +loop_static_stmt_p (class loop *loop,
> +                   gimple_ranger &ranger,
> +                   path_range_query *&query,
> +                   gimple *stmt)
> +{
> +  tree type = gimple_range_type (stmt);
> +  if (!type || !Value_Range::supports_type_p (type))
> +    return false;
> +
> +  if (!query)
> +    query = get_range_query (loop, gimple_bb (stmt), ranger);
> +
> +  Value_Range r (gimple_range_type (stmt));
> +  if (!query->range_of_stmt (r, stmt))
> +    return false;
> +  return r.singleton_p ();
> +}
> +
>  /* Return true if OP is invariant.  */
>
>  static bool
> @@ -109,21 +135,37 @@ loop_invariant_op_p (class loop *loop,
>    if (SSA_NAME_IS_DEFAULT_DEF (op)
>        || !flow_bb_inside_loop_p (loop, gimple_bb (SSA_NAME_DEF_STMT (op))))
>      return true;
> +  return gimple_uid (SSA_NAME_DEF_STMT (op)) & 1;
> +}
> +
> +/* Return true if OP combines outcome of static and
> +   loop invariant conditional.  */
> +
> +static bool
> +loop_static_op_p (class loop *loop, tree op)
> +{
> +  /* Always check for invariant first.  */
> +  gcc_checking_assert (!is_gimple_min_invariant (op)
> +                      && !SSA_NAME_IS_DEFAULT_DEF (op)
> +                      && flow_bb_inside_loop_p (loop,
> +                              gimple_bb (SSA_NAME_DEF_STMT (op))));
>    return gimple_uid (SSA_NAME_DEF_STMT (op)) & 2;
>  }
>
> -/* Return true if OP looks like it is derived from IV.  */
> +
> +/* Return true if OP combines outcome of static and
> +   loop invariant conditional.  */
>
>  static bool
> -loop_iv_derived_p (class loop *loop,
> -                   tree op)
> +loop_combined_static_and_iv_p (class loop *loop,
> +                              tree op)
>  {
>    /* Always check for invariant first.  */
>    gcc_checking_assert (!is_gimple_min_invariant (op)
>                        && !SSA_NAME_IS_DEFAULT_DEF (op)
>                        && flow_bb_inside_loop_p (loop,
>                                gimple_bb (SSA_NAME_DEF_STMT (op))));
> -  return gimple_uid (SSA_NAME_DEF_STMT (op)) & 1;
> +  return gimple_uid (SSA_NAME_DEF_STMT (op)) & 4;
>  }
>
>  /* Check whether we should duplicate HEADER of LOOP.  At most *LIMIT
> @@ -182,25 +224,18 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
>        return false;
>      }
>
> +  path_range_query *query = NULL;
>    for (gphi_iterator psi = gsi_start_phis (header); !gsi_end_p (psi);
>         gsi_next (&psi))
> -    /* If this is actual loop header PHIs indicate that the SSA_NAME
> -       may be IV.  Otherwise just give up.  */
> -    if (header == loop->header)
> +    if (!virtual_operand_p (gimple_phi_result (psi.phi ())))
>        {
> -       gphi *phi = psi.phi ();
> -       tree res = gimple_phi_result (phi);
> -       if (INTEGRAL_TYPE_P (TREE_TYPE (res))
> -           || POINTER_TYPE_P (TREE_TYPE (res)))
> -         gimple_set_uid (phi, 1 /* IV */);
> -       else
> -         gimple_set_uid (phi, 0);
> +       bool static_p = loop_static_stmt_p (loop, *ranger,
> +                                           query, psi.phi ());
> +       gimple_set_uid (psi.phi (), static_p ? 2 : 0);
>        }
> -    else
> -      gimple_set_uid (psi.phi (), 0);
>
>    /* Count number of instructions and punt on calls.
> -     Populate stmts INV/IV flag to later apply heuristics to the
> +     Populate stmts INV flag to later apply heuristics to the
>       kind of conditions we want to copy.  */
>    for (bsi = gsi_start_bb (header); !gsi_end_p (bsi); gsi_next (&bsi))
>      {
> @@ -215,6 +250,12 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
>        if (gimple_code (last) == GIMPLE_COND)
>         break;
>
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       {
> +         fprintf (dump_file, "  Analyzing: ");
> +         print_gimple_stmt (dump_file, last, 0, TDF_SLIM);
> +       }
> +
>        if (gimple_code (last) == GIMPLE_CALL
>           && (!gimple_inexpensive_call_p (as_a <gcall *> (last))
>               /* IFN_LOOP_DIST_ALIAS means that inner loop is distributed
> @@ -225,53 +266,152 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
>             fprintf (dump_file,
>                      "  Not duplicating bb %i: it contains call.\n",
>                      header->index);
> +         if (query)
> +           delete query;
>           return false;
>         }
>
> -      /* Classify the stmt based on whether its computation is based
> -         on a IV or whether it is invariant in the loop.  */
> +      /* Classify the stmt is invariant in the loop.  */
>        gimple_set_uid (last, 0);
>        if (!gimple_vuse (last)
>           && gimple_code (last) != GIMPLE_ASM
>           && (gimple_code (last) != GIMPLE_CALL
>               || gimple_call_flags (last) & ECF_CONST))
>         {
> -         bool inv = true;
> -         bool iv = true;
> +         bool inv = true, static_p = false;
>           ssa_op_iter i;
>           tree op;
>           FOR_EACH_SSA_TREE_OPERAND (op, last, i, SSA_OP_USE)
>             if (!loop_invariant_op_p (loop, op))
> -             {
> -               if (!loop_iv_derived_p (loop, op))
> -                 {
> -                   inv = false;
> -                   iv = false;
> -                   break;
> -                 }
> -               else
> -                 inv = false;
> -             }
> -         gimple_set_uid (last, (iv ? 1 : 0) | (inv ? 2 : 0));
> +             inv = false;
> +         /* Assume that code is reasonably optimized and invariant
> +            constants are already identified.  */
> +         if (!inv)
> +           static_p = loop_static_stmt_p (loop, *ranger, query, last);
> +         gimple_set_uid (last, (inv ? 1 : 0) | (static_p ? 2 : 0));
> +         if (dump_file && (dump_flags & TDF_DETAILS))
> +           {
> +             if (inv)
> +               fprintf (dump_file, "    Stmt is loop invariant\n");
> +             if (static_p)
> +               fprintf (dump_file, "    Stmt is static "
> +                        "(constant in the first iteration)\n");
> +           }
>           /* Loop invariants will be optimized out in loop body after
>              duplication; do not account invariant computation in code
> -            size costs.  */
> -         if (inv)
> +            size costs.
> +
> +            Similarly static computations will be optimized out in the
> +            duplicatd header.  */
> +         if (inv || static_p)
>             continue;
> +
> +         /* Match the following:
> +            _1 = i_1 < 10   <- static condtion
> +            _2 = n != 0     <- loop invariant condition
> +            _3 = _1 & _2    <- combined static and iv statement.  */
> +         if (gimple_code (last) == GIMPLE_ASSIGN
> +             && (gimple_assign_rhs_code (last) == TRUTH_AND_EXPR
> +                 || gimple_assign_rhs_code (last) == TRUTH_OR_EXPR
> +                 || gimple_assign_rhs_code (last) == TRUTH_XOR_EXPR
> +                 || gimple_assign_rhs_code (last) == BIT_AND_EXPR
> +                 || gimple_assign_rhs_code (last) == BIT_IOR_EXPR
> +                 || gimple_assign_rhs_code (last) == BIT_XOR_EXPR))

Please make this cheaper by doing sth like

    gassign *last_ass = dyn_cast <gassign *> (last);
    enum tree_code code;
    if (last_ass && ((code = gimple_asssign_rhs_code (last_ass), true))
        && (code == TRUTH_ ....))

and use last_ass in the following block.  You save checking code for
each call and esp. gimple_assign_rhs_code isn't very cheap (though
it should be all inlined and optimized eventually).

Note TRUTH_* can never appear here, those are not valid GIMPLE codes.

Otherwise looks good to me.

Thanks,
Richard.

> +           {
> +             tree op1 = gimple_assign_rhs1 (last);
> +             tree op2 = gimple_assign_rhs2 (last);
> +
> +             if ((loop_invariant_op_p (loop, op1)
> +                  || loop_combined_static_and_iv_p (loop, op1)
> +                  || loop_static_op_p (loop, op1))
> +                 && (loop_invariant_op_p (loop, op2)
> +                     || loop_combined_static_and_iv_p (loop, op2)
> +                     || loop_static_op_p (loop, op2)))
> +               {
> +                 /* Duplicating loop header with combned conditional will
> +                    remove this statement in each copy.  But we account for
> +                    that later when seeing that condition.
> +
> +                    Note that this may be overly optimistic for bit operations
> +                    where the static parameter may still result in non-trivial
> +                    bit operation.  */
> +                 gimple_set_uid (last, 4);
> +                 if (dump_file && (dump_flags & TDF_DETAILS))
> +                   fprintf (dump_file,
> +                            "    Stmt combines static and invariant op\n");
> +                 continue;
> +               }
> +           }
>         }
>
> -      *limit -= estimate_num_insns (last, &eni_size_weights);
> +      int insns = estimate_num_insns (last, &eni_size_weights);
> +      *limit -= insns;
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       fprintf (dump_file,
> +                "    Acconting stmt as %i insns\n", insns);
>        if (*limit < 0)
>         {
>           if (dump_file && (dump_flags & TDF_DETAILS))
>             fprintf (dump_file,
>                      "  Not duplicating bb %i contains too many insns.\n",
>                      header->index);
> +         if (query)
> +           delete query;
>           return false;
>         }
>      }
>
> -  edge static_exit = static_loop_exit (loop, header, ranger);
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    {
> +      fprintf (dump_file, "  Analyzing: ");
> +      print_gimple_stmt (dump_file, last, 0, TDF_SLIM);
> +    }
> +
> +  /* If the condition tests a non-IV loop variant we do not want to rotate
> +     the loop further.  Unless this is the original loop header.  */
> +  tree lhs = gimple_cond_lhs (last);
> +  tree rhs = gimple_cond_rhs (last);
> +  bool lhs_invariant = loop_invariant_op_p (loop, lhs);
> +  bool rhs_invariant = loop_invariant_op_p (loop, rhs);
> +
> +  /* Combined conditional is a result of if combining:
> +
> +     _1 = i_1 < 10   <- static condtion
> +     _2 = n != 0     <- loop invariant condition
> +     _3 = _1 & _2    <- combined static and iv statement
> +     if (_3 != 0)    <- combined conditional
> +
> +     Combined conditionals will not be optimized out in either copy.
> +     However duplicaed header simplifies to:
> +
> +     if (n < 10)
> +
> +     while loop body to
> +
> +     if (i_1 < 10)
> +
> +     So effectively the resulting code sequence will be of same length as
> +     the original code.
> +
> +     Combined conditionals are never static or invariant, so save some work
> +     below.  */
> +  if (lhs_invariant != rhs_invariant
> +      && (lhs_invariant
> +         || loop_combined_static_and_iv_p (loop, lhs))
> +      && (rhs_invariant
> +         || loop_combined_static_and_iv_p (loop, rhs)))
> +   {
> +     if (query)
> +       delete query;
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       fprintf (dump_file,
> +                "  Conditional combines static and invariant op.\n");
> +     return true;
> +   }
> +
> +  edge static_exit = static_loop_exit (loop, header, *ranger, query);
> +  if (query)
> +    delete query;
>
>    if (static_exit && static_exits)
>      {
> @@ -282,13 +422,6 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
>                  static_exit->src->index);
>        /* Still look for invariant exits; exit may be both.  */
>      }
> -
> -  /* If the condition tests a non-IV loop variant we do not want to rotate
> -     the loop further.  Unless this is the original loop header.  */
> -  tree lhs = gimple_cond_lhs (last);
> -  tree rhs = gimple_cond_rhs (last);
> -  bool lhs_invariant = loop_invariant_op_p (loop, lhs);
> -  bool rhs_invariant = loop_invariant_op_p (loop, rhs);
>    if (lhs_invariant && rhs_invariant)
>      {
>        if (invariant_exits)
> @@ -312,7 +445,11 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
>      return true;
>
>    /* We was not able to prove that conditional will be eliminated.  */
> -  *limit -= estimate_num_insns (last, &eni_size_weights);
> +  int insns = estimate_num_insns (last, &eni_size_weights);
> +  *limit -= insns;
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    fprintf (dump_file,
> +            "    Acconting stmt as %i insns\n", insns);
>    if (*limit < 0)
>      {
>        if (dump_file && (dump_flags & TDF_DETAILS))
> @@ -322,12 +459,6 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
>        return false;
>      }
>
> -  /* TODO: This is heuristics that claims that IV based ocnditionals will
> -     likely be optimized out in duplicated header.  We could use ranger
> -     query instead to tell this more precisely.  */
> -  if ((lhs_invariant || loop_iv_derived_p (loop, lhs))
> -      && (rhs_invariant || loop_iv_derived_p (loop, rhs)))
> -    return true;
>    if (header != loop->header)
>      {
>        if (dump_file && (dump_flags & TDF_DETAILS))
> @@ -550,7 +681,7 @@ public:
>
>    /* opt_pass methods: */
>    bool gate (function *) final override { return flag_tree_ch != 0; }
> -
> +
>    /* Initialize and finalize loop structures, copying headers inbetween.  */
>    unsigned int execute (function *) final override;
>
> @@ -590,7 +721,7 @@ public:
>      return flag_tree_ch != 0
>            && (flag_tree_loop_vectorize != 0 || fun->has_force_vectorize_loops);
>    }
> -
> +
>    /* Just copy headers, no initialization/finalization of loop structures.  */
>    unsigned int execute (function *) final override;
>
> @@ -973,7 +1104,7 @@ pass_ch::execute (function *fun)
>  /* Assume an earlier phase has already initialized all the loop structures that
>     we need here (and perhaps others too), and that these will be finalized by
>     a later phase.  */
> -
> +
>  unsigned int
>  pass_ch_vect::execute (function *fun)
>  {

^ permalink raw reply	[flat|nested] 9+ messages in thread

* loop-ch improvements, part 3
@ 2023-07-20  7:09 Jan Hubicka
  2023-07-20 13:03 ` Richard Biener
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Hubicka @ 2023-07-20  7:09 UTC (permalink / raw)
  To: gcc-patches

Hi,
this patch makes tree-ssa-loop-ch to understand if-combined conditionals (which
are quite common) and remove the IV-derived heuristics.  That heuristics is
quite dubious because every variable with PHI in header of integral or pointer
type is seen as IV, so in the first basic block we match all loop invariants as
invariants and everything that chagnes in loop as IV-like.

I think the heuristics was mostly there to make header duplication happen when
the exit conditional is constant false in the first iteration and with ranger
we can work this out in good enough precision.

The patch adds notion of "combined exit" which has conditional that is
and/or/xor of loop invariant exit and exit known to be false in first
iteration.  Copying these is a win since the loop conditional will simplify
in both copies.

It seems that those are usual bit or/and/xor and the code size accounting is
true only when the values have at most one bit set or when the static constant
and invariant versions are simple (such as all zeros).  I am not testing this,
so the code may be optimistic here.  I think it is not common enough to matter
and I can not think of correct condition that is not quite complex.

I also improved code size estimate not accounting non-conditionals that are
know to be constant in peeled copy and improved debug output.

This requires testsuite compensaiton.  uninit-pred-loop-1.c.C does:

/* { dg-do compile } */
/* { dg-options "-Wuninitialized -O2 -std=c++98" } */

extern int bar();
int foo(int n, int m)
{
 for (;;) {
   int err = ({int _err; 
     for (int i = 0; i < 16; ++i) {
       if (m+i > n)
          break;
       _err = 17;
       _err = bar();
     }
     _err; 
   }); 

   if (err == 0) return 17;
}

Before path we duplicate
       if (m+i > n)
which makes maybe-uninitialized warning to not be output.  I do not quite see
why copying this out would be a win, since it won't simlify.  Also I think the
warning is correct.  if m>n the loop will bail out before initializing _err and
it will be used unitialized.  I think it is bug elsewhere that header
duplication supresses this.

copy headers does:
int is_sorted(int *a, int n, int m, int k)
{
  for (int i = 0; i < n - 1 && m && k > i; i++)
    if (a[i] > a[i + 1])
      return 0;
  return 1;
}

it tests that all three for statement conditionals are duplicaed.  With patch
we no longer do k>i since it is not going to simplify.  So I added test
ensuring that k is positive.  Also the tests requires disabling if-combining and
vrp to avoid conditionals becoming combined ones. So I aded new version of test
that we now behave correctly aslo with if-combine.

ivopt_mult_2.c and ivopt_mult_1.c seems to require loop header
duplication for ivopts to behave particular way, so I also ensured by value
range that the header is duplicated.

Bootstrapped/regtested x86_64-linux, OK?

gcc/ChangeLog:

	* tree-ssa-loop-ch.cc (edge_range_query): Rename to ...
	(get_range_query): ... this one; do 
	(static_loop_exit): Add query parametr, turn ranger to reference.
	(loop_static_stmt_p): New function.
	(loop_static_op_p): New function.
	(loop_iv_derived_p): Remove.
	(loop_combined_static_and_iv_p): New function.
	(should_duplicate_loop_header_p): Discover combined onditionals;
	do not track iv derived; improve dumps.
	(pass_ch::execute): Fix whitespace.

gcc/testsuite/ChangeLog:

	* g++.dg/uninit-pred-loop-1_c.C: Allow warning.
	* gcc.dg/tree-ssa/copy-headers-7.c: Add tests so exit conditition is
	static; update template.
	* gcc.dg/tree-ssa/ivopt_mult_1.c: Add test so exit condition is static.
	* gcc.dg/tree-ssa/ivopt_mult_2.c: Add test so exit condition is static.
	* gcc.dg/tree-ssa/copy-headers-8.c: New test.

diff --git a/gcc/testsuite/g++.dg/uninit-pred-loop-1_c.C b/gcc/testsuite/g++.dg/uninit-pred-loop-1_c.C
index 711812aae1b..1ee1615526f 100644
--- a/gcc/testsuite/g++.dg/uninit-pred-loop-1_c.C
+++ b/gcc/testsuite/g++.dg/uninit-pred-loop-1_c.C
@@ -15,7 +15,7 @@ int foo(int n, int m)
      _err; 
    }); 
 
-   if (err == 0) return 17;
+   if (err == 0) return 17;	/* { dg-warning "uninitialized" "warning" } */
  }
 
  return 18;
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-7.c b/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-7.c
index 3c9b3807041..e2a6c75f2e9 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-7.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-7.c
@@ -3,9 +3,10 @@
 
 int is_sorted(int *a, int n, int m, int k)
 {
-  for (int i = 0; i < n - 1 && m && k > i; i++)
-    if (a[i] > a[i + 1])
-      return 0;
+  if (k > 0)
+    for (int i = 0; i < n - 1 && m && k > i; i++)
+      if (a[i] > a[i + 1])
+	return 0;
   return 1;
 }
 
@@ -13,4 +14,8 @@ int is_sorted(int *a, int n, int m, int k)
    the invariant test, not the alternate exit test.  */
 
 /* { dg-final { scan-tree-dump "is now do-while loop" "ch2" } } */
+/* { dg-final { scan-tree-dump-times "Conditional combines static and invariant" 0 "ch2" } } */
+/* { dg-final { scan-tree-dump-times "Will elliminate invariant exit" 1 "ch2" } } */
+/* { dg-final { scan-tree-dump-times "Will eliminate peeled conditional" 1 "ch2" } } */
+/* { dg-final { scan-tree-dump-times "Not duplicating bb .: condition based on non-IV loop variant." 1 "ch2" } } */
 /* { dg-final { scan-tree-dump-times "Will duplicate bb" 3 "ch2" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-8.c b/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-8.c
new file mode 100644
index 00000000000..8b4b5e7ea81
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/copy-headers-8.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-ch2-details" } */
+
+int is_sorted(int *a, int n, int m, int k)
+{
+  if (k > 0)
+    for (int i = 0; i < n - 1 && m && k > i; i++)
+      if (a[i] > a[i + 1])
+	return 0;
+  return 1;
+}
+
+/* Verify we apply loop header copying but only copy the IV tests and
+   the invariant test, not the alternate exit test.  */
+
+/* { dg-final { scan-tree-dump "is now do-while loop" "ch2" } } */
+/* { dg-final { scan-tree-dump-times "Conditional combines static and invariant" 1 "ch2" } } */
+/* { dg-final { scan-tree-dump-times "Will duplicate bb" 2 "ch2" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopt_mult_1.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopt_mult_1.c
index adfe371c7ce..faed9114f6f 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ivopt_mult_1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopt_mult_1.c
@@ -9,6 +9,7 @@ long foo(long* p, long* p2, int N1, int N2)
   long* p_limit = p + N1;
   long* p_limit2 = p2 + N2;
   long s = 0;
+  if (p2 <= p_limit2)
   while (p  <= p_limit)
     {
       p++;
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopt_mult_2.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopt_mult_2.c
index 50d0cc5d2ae..6a82aeb0268 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ivopt_mult_2.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopt_mult_2.c
@@ -8,15 +8,16 @@ long foo(long* p, long* p2, int N1, int N2)
   int i = 0;
   long* p_limit2 = p2 + N2;
   long s = 0;
-  while (i < N1)
-    {
-       p++;
-       p2++;
-       i++;
-       if (p2 > p_limit2)
-         break;
-       s += (*p);
-    }
+  if (p2 <= p_limit2)
+    while (i < N1)
+      {
+	 p++;
+	 p2++;
+	 i++;
+	 if (p2 > p_limit2)
+	   break;
+	 s += (*p);
+      }
 
   return s;
 }
diff --git a/gcc/tree-ssa-loop-ch.cc b/gcc/tree-ssa-loop-ch.cc
index e0139cb432c..f3dc3d998e3 100644
--- a/gcc/tree-ssa-loop-ch.cc
+++ b/gcc/tree-ssa-loop-ch.cc
@@ -38,34 +38,33 @@ along with GCC; see the file COPYING3.  If not see
 #include "value-range.h"
 #include "gimple-range.h"
 #include "gimple-range-path.h"
+#include "gimple-pretty-print.h"
 #include "cfganal.h"
 
-/* Duplicates headers of loops if they are small enough, so that the statements
-   in the loop body are always executed when the loop is entered.  This
-   increases effectiveness of code motion optimizations, and reduces the need
-   for loop preconditioning.  */
+/* Return path query insteance for testing ranges of statements
+   in headers of LOOP contained in basic block BB.
+   Use RANGER instance.  */
 
-/* Given a path through edge E, whose last statement is COND, return
-   the range of the solved conditional in R.  */
-
-static void
-edge_range_query (irange &r, class loop *loop, gcond *cond, gimple_ranger &ranger)
+static path_range_query *
+get_range_query (class loop *loop,
+		 basic_block bb,
+		 gimple_ranger &ranger)
 {
   auto_vec<basic_block, 8> path;
-  for (basic_block bb = gimple_bb (cond); bb != loop->header; bb = single_pred_edge (bb)->src)
+  for (; bb != loop->header; bb = single_pred_edge (bb)->src)
     path.safe_push (bb);
   path.safe_push (loop->header);
   path.safe_push (loop_preheader_edge (loop)->src);
-  path_range_query query (ranger, path);
-  if (!query.range_of_stmt (r, cond))
-    r.set_varying (boolean_type_node);
+  return new path_range_query (ranger, path);
 }
 
 /* Return edge that is true in the first iteration of the loop
-   and NULL otherwise.  */
+   and NULL otherwise.
+   Formulate corrent ranger query to RANGER.  */
 
 static edge
-static_loop_exit (class loop *l, basic_block bb, gimple_ranger *ranger)
+static_loop_exit (class loop *l, basic_block bb, gimple_ranger &ranger,
+		  path_range_query *&query)
 {
   gcond *last = safe_dyn_cast <gcond *> (*gsi_last_bb (bb));
   edge ret_e;
@@ -83,21 +82,48 @@ static_loop_exit (class loop *l, basic_block bb, gimple_ranger *ranger)
 
   int_range<1> desired_static_range;
   if (loop_exit_edge_p (l, true_e))
-   {
+    {
       desired_static_range = range_false ();
       ret_e = true_e;
-   }
+    }
   else
-  {
-    desired_static_range = range_true ();
-    ret_e = false_e;
-  }
+   {
+      desired_static_range = range_true ();
+      ret_e = false_e;
+   }
+
+  if (!query)
+    query = get_range_query (l, gimple_bb (last), ranger);
 
   int_range<2> r;
-  edge_range_query (r, l, last, *ranger);
+  if (!query->range_of_stmt (r, last))
+    return NULL;
   return r == desired_static_range ? ret_e : NULL;
 }
 
+/* Return true if STMT is static in LOOP.  This means that its value
+   is constant in the first iteration.
+   Use RANGER and formulate query cached in QUERY.  */
+
+static bool
+loop_static_stmt_p (class loop *loop,
+		    gimple_ranger &ranger,
+		    path_range_query *&query,
+		    gimple *stmt)
+{
+  tree type = gimple_range_type (stmt);
+  if (!type || !Value_Range::supports_type_p (type))
+    return false;
+
+  if (!query)
+    query = get_range_query (loop, gimple_bb (stmt), ranger);
+
+  Value_Range r (gimple_range_type (stmt));
+  if (!query->range_of_stmt (r, stmt))
+    return false;
+  return r.singleton_p ();
+}
+
 /* Return true if OP is invariant.  */
 
 static bool
@@ -109,21 +135,37 @@ loop_invariant_op_p (class loop *loop,
   if (SSA_NAME_IS_DEFAULT_DEF (op)
       || !flow_bb_inside_loop_p (loop, gimple_bb (SSA_NAME_DEF_STMT (op))))
     return true;
+  return gimple_uid (SSA_NAME_DEF_STMT (op)) & 1;
+}
+
+/* Return true if OP combines outcome of static and
+   loop invariant conditional.  */
+
+static bool
+loop_static_op_p (class loop *loop, tree op)
+{
+  /* Always check for invariant first.  */
+  gcc_checking_assert (!is_gimple_min_invariant (op)
+		       && !SSA_NAME_IS_DEFAULT_DEF (op)
+		       && flow_bb_inside_loop_p (loop,
+			       gimple_bb (SSA_NAME_DEF_STMT (op))));
   return gimple_uid (SSA_NAME_DEF_STMT (op)) & 2;
 }
 
-/* Return true if OP looks like it is derived from IV.  */
+
+/* Return true if OP combines outcome of static and
+   loop invariant conditional.  */
 
 static bool
-loop_iv_derived_p (class loop *loop,
-		    tree op)
+loop_combined_static_and_iv_p (class loop *loop,
+			       tree op)
 {
   /* Always check for invariant first.  */
   gcc_checking_assert (!is_gimple_min_invariant (op)
 		       && !SSA_NAME_IS_DEFAULT_DEF (op)
 		       && flow_bb_inside_loop_p (loop,
 			       gimple_bb (SSA_NAME_DEF_STMT (op))));
-  return gimple_uid (SSA_NAME_DEF_STMT (op)) & 1;
+  return gimple_uid (SSA_NAME_DEF_STMT (op)) & 4;
 }
 
 /* Check whether we should duplicate HEADER of LOOP.  At most *LIMIT
@@ -182,25 +224,18 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
       return false;
     }
 
+  path_range_query *query = NULL;
   for (gphi_iterator psi = gsi_start_phis (header); !gsi_end_p (psi);
        gsi_next (&psi))
-    /* If this is actual loop header PHIs indicate that the SSA_NAME
-       may be IV.  Otherwise just give up.  */
-    if (header == loop->header)
+    if (!virtual_operand_p (gimple_phi_result (psi.phi ())))
       {
-	gphi *phi = psi.phi ();
-	tree res = gimple_phi_result (phi);
-	if (INTEGRAL_TYPE_P (TREE_TYPE (res))
-	    || POINTER_TYPE_P (TREE_TYPE (res)))
-	  gimple_set_uid (phi, 1 /* IV */);
-	else
-	  gimple_set_uid (phi, 0);
+	bool static_p = loop_static_stmt_p (loop, *ranger,
+					    query, psi.phi ());
+	gimple_set_uid (psi.phi (), static_p ? 2 : 0);
       }
-    else
-      gimple_set_uid (psi.phi (), 0);
 
   /* Count number of instructions and punt on calls.
-     Populate stmts INV/IV flag to later apply heuristics to the
+     Populate stmts INV flag to later apply heuristics to the
      kind of conditions we want to copy.  */
   for (bsi = gsi_start_bb (header); !gsi_end_p (bsi); gsi_next (&bsi))
     {
@@ -215,6 +250,12 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
       if (gimple_code (last) == GIMPLE_COND)
 	break;
 
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "  Analyzing: ");
+	  print_gimple_stmt (dump_file, last, 0, TDF_SLIM);
+	}
+
       if (gimple_code (last) == GIMPLE_CALL
 	  && (!gimple_inexpensive_call_p (as_a <gcall *> (last))
 	      /* IFN_LOOP_DIST_ALIAS means that inner loop is distributed
@@ -225,53 +266,152 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
 	    fprintf (dump_file,
 		     "  Not duplicating bb %i: it contains call.\n",
 		     header->index);
+	  if (query)
+	    delete query;
 	  return false;
 	}
 
-      /* Classify the stmt based on whether its computation is based
-         on a IV or whether it is invariant in the loop.  */
+      /* Classify the stmt is invariant in the loop.  */
       gimple_set_uid (last, 0);
       if (!gimple_vuse (last)
 	  && gimple_code (last) != GIMPLE_ASM
 	  && (gimple_code (last) != GIMPLE_CALL
 	      || gimple_call_flags (last) & ECF_CONST))
 	{
-	  bool inv = true;
-	  bool iv = true;
+	  bool inv = true, static_p = false;
 	  ssa_op_iter i;
 	  tree op;
 	  FOR_EACH_SSA_TREE_OPERAND (op, last, i, SSA_OP_USE)
 	    if (!loop_invariant_op_p (loop, op))
-	      {
-		if (!loop_iv_derived_p (loop, op))
-		  {
-		    inv = false;
-		    iv = false;
-		    break;
-		  }
-		else
-		  inv = false;
-	      }
-	  gimple_set_uid (last, (iv ? 1 : 0) | (inv ? 2 : 0));
+	      inv = false;
+	  /* Assume that code is reasonably optimized and invariant
+	     constants are already identified.  */
+	  if (!inv)
+	    static_p = loop_static_stmt_p (loop, *ranger, query, last);
+	  gimple_set_uid (last, (inv ? 1 : 0) | (static_p ? 2 : 0));
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    {
+	      if (inv)
+		fprintf (dump_file, "    Stmt is loop invariant\n");
+	      if (static_p)
+		fprintf (dump_file, "    Stmt is static "
+			 "(constant in the first iteration)\n");
+	    }
 	  /* Loop invariants will be optimized out in loop body after
 	     duplication; do not account invariant computation in code
-	     size costs.  */
-	  if (inv)
+	     size costs.
+
+	     Similarly static computations will be optimized out in the
+	     duplicatd header.  */
+	  if (inv || static_p)
 	    continue;
+
+	  /* Match the following:
+	     _1 = i_1 < 10   <- static condtion
+	     _2 = n != 0     <- loop invariant condition
+	     _3 = _1 & _2    <- combined static and iv statement.  */
+	  if (gimple_code (last) == GIMPLE_ASSIGN
+	      && (gimple_assign_rhs_code (last) == TRUTH_AND_EXPR
+		  || gimple_assign_rhs_code (last) == TRUTH_OR_EXPR
+		  || gimple_assign_rhs_code (last) == TRUTH_XOR_EXPR
+		  || gimple_assign_rhs_code (last) == BIT_AND_EXPR
+		  || gimple_assign_rhs_code (last) == BIT_IOR_EXPR
+		  || gimple_assign_rhs_code (last) == BIT_XOR_EXPR))
+	    {
+	      tree op1 = gimple_assign_rhs1 (last);
+	      tree op2 = gimple_assign_rhs2 (last);
+
+	      if ((loop_invariant_op_p (loop, op1)
+		   || loop_combined_static_and_iv_p (loop, op1)
+		   || loop_static_op_p (loop, op1))
+		  && (loop_invariant_op_p (loop, op2)
+		      || loop_combined_static_and_iv_p (loop, op2)
+		      || loop_static_op_p (loop, op2)))
+		{
+		  /* Duplicating loop header with combned conditional will
+		     remove this statement in each copy.  But we account for
+		     that later when seeing that condition.
+		     
+		     Note that this may be overly optimistic for bit operations
+		     where the static parameter may still result in non-trivial
+		     bit operation.  */
+		  gimple_set_uid (last, 4);
+		  if (dump_file && (dump_flags & TDF_DETAILS))
+		    fprintf (dump_file,
+			     "    Stmt combines static and invariant op\n");
+		  continue;
+		}
+	    }
 	}
 
-      *limit -= estimate_num_insns (last, &eni_size_weights);
+      int insns = estimate_num_insns (last, &eni_size_weights);
+      *limit -= insns;
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file,
+		 "    Acconting stmt as %i insns\n", insns);
       if (*limit < 0)
 	{
 	  if (dump_file && (dump_flags & TDF_DETAILS))
 	    fprintf (dump_file,
 		     "  Not duplicating bb %i contains too many insns.\n",
 		     header->index);
+	  if (query)
+	    delete query;
 	  return false;
 	}
     }
 
-  edge static_exit = static_loop_exit (loop, header, ranger);
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "  Analyzing: ");
+      print_gimple_stmt (dump_file, last, 0, TDF_SLIM);
+    }
+
+  /* If the condition tests a non-IV loop variant we do not want to rotate
+     the loop further.  Unless this is the original loop header.  */
+  tree lhs = gimple_cond_lhs (last);
+  tree rhs = gimple_cond_rhs (last);
+  bool lhs_invariant = loop_invariant_op_p (loop, lhs);
+  bool rhs_invariant = loop_invariant_op_p (loop, rhs);
+
+  /* Combined conditional is a result of if combining:
+
+     _1 = i_1 < 10   <- static condtion
+     _2 = n != 0     <- loop invariant condition
+     _3 = _1 & _2    <- combined static and iv statement
+     if (_3 != 0)    <- combined conditional
+
+     Combined conditionals will not be optimized out in either copy.
+     However duplicaed header simplifies to:
+
+     if (n < 10)
+
+     while loop body to
+
+     if (i_1 < 10)
+
+     So effectively the resulting code sequence will be of same length as
+     the original code.
+
+     Combined conditionals are never static or invariant, so save some work
+     below.  */
+  if (lhs_invariant != rhs_invariant
+      && (lhs_invariant
+	  || loop_combined_static_and_iv_p (loop, lhs))
+      && (rhs_invariant
+	  || loop_combined_static_and_iv_p (loop, rhs)))
+   {
+     if (query)
+       delete query;
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file,
+		 "  Conditional combines static and invariant op.\n");
+     return true;
+   }
+
+  edge static_exit = static_loop_exit (loop, header, *ranger, query);
+  if (query)
+    delete query;
 
   if (static_exit && static_exits)
     {
@@ -282,13 +422,6 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
 		 static_exit->src->index);
       /* Still look for invariant exits; exit may be both.  */
     }
-
-  /* If the condition tests a non-IV loop variant we do not want to rotate
-     the loop further.  Unless this is the original loop header.  */
-  tree lhs = gimple_cond_lhs (last);
-  tree rhs = gimple_cond_rhs (last);
-  bool lhs_invariant = loop_invariant_op_p (loop, lhs);
-  bool rhs_invariant = loop_invariant_op_p (loop, rhs);
   if (lhs_invariant && rhs_invariant)
     {
       if (invariant_exits)
@@ -312,7 +445,11 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
     return true;
 
   /* We was not able to prove that conditional will be eliminated.  */
-  *limit -= estimate_num_insns (last, &eni_size_weights);
+  int insns = estimate_num_insns (last, &eni_size_weights);
+  *limit -= insns;
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file,
+	     "    Acconting stmt as %i insns\n", insns);
   if (*limit < 0)
     {
       if (dump_file && (dump_flags & TDF_DETAILS))
@@ -322,12 +459,6 @@ should_duplicate_loop_header_p (basic_block header, class loop *loop,
       return false;
     }
 
-  /* TODO: This is heuristics that claims that IV based ocnditionals will
-     likely be optimized out in duplicated header.  We could use ranger
-     query instead to tell this more precisely.  */
-  if ((lhs_invariant || loop_iv_derived_p (loop, lhs))
-      && (rhs_invariant || loop_iv_derived_p (loop, rhs)))
-    return true;
   if (header != loop->header)
     {
       if (dump_file && (dump_flags & TDF_DETAILS))
@@ -550,7 +681,7 @@ public:
 
   /* opt_pass methods: */
   bool gate (function *) final override { return flag_tree_ch != 0; }
-  
+
   /* Initialize and finalize loop structures, copying headers inbetween.  */
   unsigned int execute (function *) final override;
 
@@ -590,7 +721,7 @@ public:
     return flag_tree_ch != 0
 	   && (flag_tree_loop_vectorize != 0 || fun->has_force_vectorize_loops);
   }
-  
+
   /* Just copy headers, no initialization/finalization of loop structures.  */
   unsigned int execute (function *) final override;
 
@@ -973,7 +1104,7 @@ pass_ch::execute (function *fun)
 /* Assume an earlier phase has already initialized all the loop structures that
    we need here (and perhaps others too), and that these will be finalized by
    a later phase.  */
-   
+
 unsigned int
 pass_ch_vect::execute (function *fun)
 {

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-08-23  8:58 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-14 12:22 Loop-ch improvements, part 3 Jan Hubicka
2023-07-17  9:17 ` Richard Biener
2023-08-22  5:24   ` Hongtao Liu
2023-08-22  7:53     ` Richard Biener
2023-08-22 12:03       ` Jan Hubicka
2023-08-23  8:34       ` Jan Hubicka
2023-08-23  8:58         ` Richard Biener
2023-07-20  7:09 loop-ch " Jan Hubicka
2023-07-20 13:03 ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).