public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] Refactor back_threader_profitability
@ 2022-08-16 14:04 Richard Biener
  0 siblings, 0 replies; 11+ messages in thread
From: Richard Biener @ 2022-08-16 14:04 UTC (permalink / raw)
  To: gcc-patches

The following refactors profitable_path_p in the backward threader,
splitting out parts that can be computed once the exit block is known,
parts that contiguously update and that can be checked allowing
for the path to be later identified as FSM with larger limits,
possibly_profitable_path_p, and final checks done when the whole
path is known, profitable_path_p.

I've removed the back_threader_profitability instance from the
back_threader class and instead instantiate it once per path
discovery.  I've kept the size compute non-incremental to simplify
the patch and not worry about unwinding.

There's key changes to previous behavior - namely we apply
the param_max_jump_thread_duplication_stmts early only when
we know the path cannot become an FSM one (multiway + thread through
latch) but make sure to elide the path query when we we didn't
yet discover that but are over this limit.  Similarly the
speed limit is now used even when we did not yet discover a
hot BB on the path.  Basically the idea is to only stop path
discovery when we know the path will never become profitable
but avoid the expensive path range query when we know it's
currently not.

I've done a few cleanups, merging functions, on the way.

Bootstrapped and tested on x86_64-unknown-linux-gnu.

Statistics show an overall slight increase in threading but
looking at different files theres noise up and down.  That's
somewhat expected since we now are applying the "more correct"
limits in the end.  Unless I made big mistakes of course.

The next thing cost-wise would be to raise the backwards
threading limit to the limit of DOM so we don't get
artificial high counts for that.

OK?

Thanks,
Richard.

	* tree-ssa-threadbackward.cc
	(back_threader_profitability): Split profitable_path_p
	into possibly_profitable_path_p and itself, keep state
	as new members.
	(back_threader::m_profit): Remove.
	(back_threader::find_paths): Likewise.
	(back_threader::maybe_register_path): Take profitability
	instance as parameter.
	(back_threader::find_paths_to_names): Likewise.  Use
	possibly_profitable_path_p and avoid the path range query
	when the path is currently too large.
	(back_threader::find_paths): Fold into ...
	(back_threader::maybe_thread_block): ... this.
	(get_gimple_control_stmt): Remove.
	(back_threader_profitability::possibly_profitable_path_p):
	Split out from profitable_path_p, do early profitability
	checks.
	(back_threader_profitability::profitable_path_p): Do final
	profitability path after the taken edge has been determined.
---
 gcc/tree-ssa-threadbackward.cc | 350 ++++++++++++++++++---------------
 1 file changed, 192 insertions(+), 158 deletions(-)

diff --git a/gcc/tree-ssa-threadbackward.cc b/gcc/tree-ssa-threadbackward.cc
index 1c362839fab..077a767e73d 100644
--- a/gcc/tree-ssa-threadbackward.cc
+++ b/gcc/tree-ssa-threadbackward.cc
@@ -61,15 +61,33 @@ public:
 class back_threader_profitability
 {
 public:
-  back_threader_profitability (bool speed_p)
-    : m_speed_p (speed_p)
-  { }
-  bool profitable_path_p (const vec<basic_block> &, tree name, edge taken,
-			  bool *irreducible_loop = NULL);
+  back_threader_profitability (bool speed_p, gimple *stmt);
+  bool possibly_profitable_path_p (const vec<basic_block> &, tree, bool *);
+  bool profitable_path_p (const vec<basic_block> &,
+			  edge taken, bool *irreducible_loop);
 private:
   const bool m_speed_p;
+  int m_exit_jump_benefit;
+  bool m_threaded_multiway_branch;
+  // The following are computed by possibly_profitable_path_p
+  bool m_threaded_through_latch;
+  bool m_multiway_branch_in_path;
+  bool m_contains_hot_bb;
+  int m_n_insns;
 };
 
+back_threader_profitability::back_threader_profitability (bool speed_p,
+							  gimple *last)
+  : m_speed_p (speed_p)
+{
+  m_threaded_multiway_branch = (gimple_code (last) == GIMPLE_SWITCH
+				|| gimple_code (last) == GIMPLE_GOTO);
+  // The forward threader has estimate_threading_killed_stmts, in
+  // particular it estimates further DCE from eliminating the exit
+  // control stmt.
+  m_exit_jump_benefit = estimate_num_insns (last, &eni_size_weights);
+}
+
 // Back threader flags.
 #define BT_NONE 0
 // Generate fast code at the expense of code size.
@@ -86,11 +104,11 @@ public:
   unsigned thread_blocks ();
 private:
   void maybe_thread_block (basic_block bb);
-  void find_paths (basic_block bb, tree name);
   bool debug_counter ();
-  edge maybe_register_path ();
+  edge maybe_register_path (back_threader_profitability &);
   void maybe_register_path_dump (edge taken_edge);
-  void find_paths_to_names (basic_block bb, bitmap imports, unsigned);
+  void find_paths_to_names (basic_block bb, bitmap imports, unsigned,
+			    back_threader_profitability &);
   edge find_taken_edge (const vec<basic_block> &path);
   edge find_taken_edge_cond (const vec<basic_block> &path, gcond *);
   edge find_taken_edge_switch (const vec<basic_block> &path, gswitch *);
@@ -98,7 +116,6 @@ private:
   virtual void dump (FILE *out);
 
   back_threader_registry m_registry;
-  back_threader_profitability m_profit;
   path_range_query *m_solver;
 
   // Current path being analyzed.
@@ -131,8 +148,7 @@ private:
 const edge back_threader::UNREACHABLE_EDGE = (edge) -1;
 
 back_threader::back_threader (function *fun, unsigned flags, bool first)
-  : m_profit (flags & BT_SPEED),
-    m_first (first)
+  : m_first (first)
 {
   if (flags & BT_SPEED)
     loop_optimizer_init (LOOPS_HAVE_PREHEADERS | LOOPS_HAVE_SIMPLE_LATCHES);
@@ -226,7 +242,7 @@ back_threader::maybe_register_path_dump (edge taken)
 // unreachable.
 
 edge
-back_threader::maybe_register_path ()
+back_threader::maybe_register_path (back_threader_profitability &profit)
 {
   edge taken_edge = find_taken_edge (m_path);
 
@@ -241,8 +257,7 @@ back_threader::maybe_register_path ()
       else
 	{
 	  bool irreducible = false;
-	  if (m_profit.profitable_path_p (m_path, m_name, taken_edge,
-					  &irreducible)
+	  if (profit.profitable_path_p (m_path, taken_edge, &irreducible)
 	      && debug_counter ()
 	      && m_registry.register_path (m_path, taken_edge))
 	    {
@@ -342,17 +357,21 @@ back_threader::find_taken_edge_cond (const vec<basic_block> &path,
 
 void
 back_threader::find_paths_to_names (basic_block bb, bitmap interesting,
-				    unsigned overall_paths)
+				    unsigned overall_paths,
+				    back_threader_profitability &profit)
 {
   if (m_visited_bbs.add (bb))
     return;
 
   m_path.safe_push (bb);
 
-  // Try to resolve the path without looking back.
+  // Try to resolve the path without looking back.  Avoid resolving paths
+  // we know are large but not (yet) FSM.
+  bool large_non_fsm;
   if (m_path.length () > 1
-      && (!m_profit.profitable_path_p (m_path, m_name, NULL)
-	  || maybe_register_path ()))
+      && (!profit.possibly_profitable_path_p (m_path, m_name, &large_non_fsm)
+	  || (!large_non_fsm
+	      && maybe_register_path (profit))))
     ;
 
   // The backwards thread copier cannot copy blocks that do not belong
@@ -474,7 +493,8 @@ back_threader::find_paths_to_names (basic_block bb, bitmap interesting,
 			}
 		    }
 		}
-	      find_paths_to_names (e->src, new_interesting, overall_paths);
+	      find_paths_to_names (e->src, new_interesting, overall_paths,
+				   profit);
 	      // Restore new_interesting.
 	      for (int def : unwind)
 		bitmap_clear_bit (new_interesting, def);
@@ -499,66 +519,6 @@ back_threader::find_paths_to_names (basic_block bb, bitmap interesting,
   m_visited_bbs.remove (bb);
 }
 
-// Search backwards from BB looking for paths where the final
-// conditional out of BB can be determined.  NAME is the LHS of the
-// final conditional.  Register such paths for jump threading.
-
-void
-back_threader::find_paths (basic_block bb, tree name)
-{
-  gimple *stmt = last_stmt (bb);
-  if (!stmt
-      || (gimple_code (stmt) != GIMPLE_COND
-	  && gimple_code (stmt) != GIMPLE_SWITCH))
-    return;
-
-  if (EDGE_COUNT (bb->succs) > 1)
-    {
-      m_last_stmt = stmt;
-      m_visited_bbs.empty ();
-      m_path.truncate (0);
-      m_name = name;
-
-      // We compute imports of the path during discovery starting
-      // just with names used in the conditional.
-      bitmap_clear (m_imports);
-      ssa_op_iter iter;
-      FOR_EACH_SSA_TREE_OPERAND (name, stmt, iter, SSA_OP_USE)
-	{
-	  if (!gimple_range_ssa_p (name))
-	    return;
-	  bitmap_set_bit (m_imports, SSA_NAME_VERSION (name));
-	}
-
-      // Interesting is the set of imports we still not have see
-      // the definition of.  So while imports only grow, the
-      // set of interesting defs dwindles and once empty we can
-      // stop searching.
-      auto_bitmap interesting;
-      bitmap_copy (interesting, m_imports);
-      find_paths_to_names (bb, interesting, 1);
-    }
-}
-
-// Simple helper to get the last statement from BB, which is assumed
-// to be a control statement.  Return NULL if the last statement is
-// not a control statement.
-
-static gimple *
-get_gimple_control_stmt (basic_block bb)
-{
-  gimple_stmt_iterator gsi = gsi_last_nondebug_bb (bb);
-
-  if (gsi_end_p (gsi))
-    return NULL;
-
-  gimple *stmt = gsi_stmt (gsi);
-  enum gimple_code code = gimple_code (stmt);
-  if (code == GIMPLE_COND || code == GIMPLE_SWITCH || code == GIMPLE_GOTO)
-    return stmt;
-  return NULL;
-}
-
 // Search backwards from BB looking for paths where the final
 // conditional maybe threaded to a successor block.  Record such paths
 // for jump threading.
@@ -566,7 +526,10 @@ get_gimple_control_stmt (basic_block bb)
 void
 back_threader::maybe_thread_block (basic_block bb)
 {
-  gimple *stmt = get_gimple_control_stmt (bb);
+  if (EDGE_COUNT (bb->succs) <= 1)
+    return;
+
+  gimple *stmt = last_stmt (bb);
   if (!stmt)
     return;
 
@@ -576,12 +539,33 @@ back_threader::maybe_thread_block (basic_block bb)
     name = gimple_switch_index (as_a <gswitch *> (stmt));
   else if (code == GIMPLE_COND)
     name = gimple_cond_lhs (stmt);
-  else if (code == GIMPLE_GOTO)
-    name = gimple_goto_dest (stmt);
   else
-    name = NULL;
+    return;
 
-  find_paths (bb, name);
+  m_last_stmt = stmt;
+  m_visited_bbs.empty ();
+  m_path.truncate (0);
+  m_name = name;
+
+  // We compute imports of the path during discovery starting
+  // just with names used in the conditional.
+  bitmap_clear (m_imports);
+  ssa_op_iter iter;
+  FOR_EACH_SSA_TREE_OPERAND (name, stmt, iter, SSA_OP_USE)
+    {
+      if (!gimple_range_ssa_p (name))
+	return;
+      bitmap_set_bit (m_imports, SSA_NAME_VERSION (name));
+    }
+
+  // Interesting is the set of imports we still not have see
+  // the definition of.  So while imports only grow, the
+  // set of interesting defs dwindles and once empty we can
+  // stop searching.
+  auto_bitmap interesting;
+  bitmap_copy (interesting, m_imports);
+  back_threader_profitability profit (m_flags & BT_SPEED, stmt);
+  find_paths_to_names (bb, interesting, 1, profit);
 }
 
 DEBUG_FUNCTION void
@@ -615,26 +599,21 @@ back_threader::debug ()
   dump (stderr);
 }
 
-/* Examine jump threading path PATH and return TRUE if it is profitable to
-   thread it, otherwise return FALSE.
+/* Examine jump threading path PATH and return TRUE if it is possibly
+   profitable to thread it, otherwise return FALSE.  Indicate in
+   *LARGE_NON_FSM whether the thread is too large for a non-fsm thread
+   but would be OK if we discover a backedge on our way.
 
    NAME is the SSA_NAME of the variable we found to have a constant
    value on PATH.  If unknown, SSA_NAME is NULL.
 
-   If the taken edge out of the path is known ahead of time it is passed in
-   TAKEN_EDGE, otherwise it is NULL.
-
-   CREATES_IRREDUCIBLE_LOOP, if non-null is set to TRUE if threading this path
-   would create an irreducible loop.
-
    ?? It seems we should be able to loosen some of the restrictions in
    this function after loop optimizations have run.  */
 
 bool
-back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
-						tree name,
-						edge taken_edge,
-						bool *creates_irreducible_loop)
+back_threader_profitability::possibly_profitable_path_p
+				  (const vec<basic_block> &m_path, tree name,
+				   bool *large_non_fsm)
 {
   gcc_checking_assert (!m_path.is_empty ());
 
@@ -650,13 +629,15 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
   if (m_path.length () <= 1)
       return false;
 
-  int n_insns = 0;
   gimple_stmt_iterator gsi;
   loop_p loop = m_path[0]->loop_father;
-  bool threaded_through_latch = false;
-  bool multiway_branch_in_path = false;
-  bool threaded_multiway_branch = false;
-  bool contains_hot_bb = false;
+
+  // We recompute the following, when we rewrite possibly_profitable_path_p
+  // to work incrementally on added BBs we have to unwind them on backtracking
+  m_n_insns = 0;
+  m_threaded_through_latch = false;
+  m_multiway_branch_in_path = false;
+  m_contains_hot_bb = false;
 
   if (dump_file && (dump_flags & TDF_DETAILS))
     fprintf (dump_file, "Checking profitability of path (backwards): ");
@@ -679,7 +660,7 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
 	 it ends in a multiway branch.  */
       if (j < m_path.length () - 1)
 	{
-	  int orig_n_insns = n_insns;
+	  int orig_n_insns = m_n_insns;
 	  /* PHIs in the path will create degenerate PHIS in the
 	     copied path which will then get propagated away, so
 	     looking at just the duplicate path the PHIs would
@@ -714,12 +695,12 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
 		      && (SSA_NAME_VAR (dst) != SSA_NAME_VAR (name)
 			  || !SSA_NAME_VAR (dst))
 		      && !virtual_operand_p (dst))
-		    ++n_insns;
+		    ++m_n_insns;
 		}
 	    }
 
-	  if (!contains_hot_bb && m_speed_p)
-	    contains_hot_bb |= optimize_bb_for_speed_p (bb);
+	  if (!m_contains_hot_bb && m_speed_p)
+	    m_contains_hot_bb |= optimize_bb_for_speed_p (bb);
 	  for (gsi = gsi_after_labels (bb);
 	       !gsi_end_p (gsi);
 	       gsi_next_nondebug (&gsi))
@@ -735,10 +716,10 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
 	      /* Do not count empty statements and labels.  */
 	      if (gimple_code (stmt) != GIMPLE_NOP
 		  && !is_gimple_debug (stmt))
-		n_insns += estimate_num_insns (stmt, &eni_size_weights);
+		m_n_insns += estimate_num_insns (stmt, &eni_size_weights);
 	    }
 	  if (dump_file && (dump_flags & TDF_DETAILS))
-	    fprintf (dump_file, " (%i insns)", n_insns-orig_n_insns);
+	    fprintf (dump_file, " (%i insns)", m_n_insns-orig_n_insns);
 
 	  /* We do not look at the block with the threaded branch
 	     in this loop.  So if any block with a last statement that
@@ -747,14 +728,13 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
 
 	     The block in PATH[0] is special, it's the block were we're
 	     going to be able to eliminate its branch.  */
-	  gimple *last = last_stmt (bb);
-	  if (last && (gimple_code (last) == GIMPLE_SWITCH
-		       || gimple_code (last) == GIMPLE_GOTO))
+	  if (j > 0)
 	    {
-	      if (j == 0)
-		threaded_multiway_branch = true;
-	      else
-		multiway_branch_in_path = true;
+	      gimple *last = last_stmt (bb);
+	      if (last
+		  && (gimple_code (last) == GIMPLE_SWITCH
+		      || gimple_code (last) == GIMPLE_GOTO))
+		m_multiway_branch_in_path = true;
 	    }
 	}
 
@@ -763,50 +743,29 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
 	 through the loop latch.  */
       if (loop->latch == bb)
 	{
-	  threaded_through_latch = true;
+	  m_threaded_through_latch = true;
 	  if (dump_file && (dump_flags & TDF_DETAILS))
 	    fprintf (dump_file, " (latch)");
 	}
     }
 
-  gimple *stmt = get_gimple_control_stmt (m_path[0]);
-  gcc_assert (stmt);
-
   /* We are going to remove the control statement at the end of the
      last block in the threading path.  So don't count it against our
      statement count.  */
-
-  int stmt_insns = estimate_num_insns (stmt, &eni_size_weights);
-  n_insns-= stmt_insns;
+  m_n_insns -= m_exit_jump_benefit;
 
   if (dump_file && (dump_flags & TDF_DETAILS))
     fprintf (dump_file, "\n  Control statement insns: %i\n"
 	     "  Overall: %i insns\n",
-	     stmt_insns, n_insns);
-
-  if (creates_irreducible_loop)
-    {
-      /* If this path threaded through the loop latch back into the
-	 same loop and the destination does not dominate the loop
-	 latch, then this thread would create an irreducible loop.  */
-      *creates_irreducible_loop = false;
-      if (taken_edge
-	  && threaded_through_latch
-	  && loop == taken_edge->dest->loop_father
-	  && (determine_bb_domination_status (loop, taken_edge->dest)
-	      == DOMST_NONDOMINATING))
-	*creates_irreducible_loop = true;
-    }
+	     m_exit_jump_benefit, m_n_insns);
 
   /* Threading is profitable if the path duplicated is hot but also
      in a case we separate cold path from hot path and permit optimization
      of the hot path later.  Be on the agressive side here. In some testcases,
      as in PR 78407 this leads to noticeable improvements.  */
-  if (m_speed_p
-      && ((taken_edge && optimize_edge_for_speed_p (taken_edge))
-	  || contains_hot_bb))
+  if (m_speed_p)
     {
-      if (n_insns >= param_max_fsm_thread_path_insns)
+      if (m_n_insns >= param_max_fsm_thread_path_insns)
 	{
 	  if (dump_file && (dump_flags & TDF_DETAILS))
 	    fprintf (dump_file, "  FAIL: Jump-thread path not considered: "
@@ -814,29 +773,104 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
 		     "exceeds PARAM_MAX_FSM_THREAD_PATH_INSNS.\n");
 	  return false;
 	}
-      if (taken_edge && probably_never_executed_edge_p (cfun, taken_edge))
+      edge entry = find_edge (m_path[m_path.length () - 1],
+			      m_path[m_path.length () - 2]);
+      if (probably_never_executed_edge_p (cfun, entry))
 	{
 	  if (dump_file && (dump_flags & TDF_DETAILS))
 	    fprintf (dump_file, "  FAIL: Jump-thread path not considered: "
-		     "path leads to probably never executed edge.\n");
+		     "path entry is probably never executed.\n");
 	  return false;
 	}
-      edge entry = find_edge (m_path[m_path.length () - 1],
-			      m_path[m_path.length () - 2]);
-      if (probably_never_executed_edge_p (cfun, entry))
+    }
+  else if (m_n_insns > 1)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "  FAIL: Jump-thread path not considered: "
+		 "duplication of %i insns is needed and optimizing for size.\n",
+		 m_n_insns);
+      return false;
+    }
+
+  /* The generic copier used by the backthreader does not re-use an
+     existing threading path to reduce code duplication.  So for that
+     case, drastically reduce the number of statements we are allowed
+     to copy.  We don't know yet whether we will thread through the latch
+     so we have to be permissive and continue threading, but indicate
+     to the caller the thread, if final, wouldn't be profitable.  */
+  if ((!m_threaded_multiway_branch
+       || !loop->latch
+       || loop->latch->index == EXIT_BLOCK)
+      && (m_n_insns * param_fsm_scale_path_stmts
+	  >= param_max_jump_thread_duplication_stmts))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file,
+		 "  FAIL: Did not thread around loop and would copy too "
+		 "many statements.\n");
+      return false;
+    }
+  *large_non_fsm = (!(m_threaded_through_latch && m_threaded_multiway_branch)
+		    && (m_n_insns * param_fsm_scale_path_stmts
+			>= param_max_jump_thread_duplication_stmts));
+
+  return true;
+}
+
+/* Examine jump threading path PATH and return TRUE if it is profitable to
+   thread it, otherwise return FALSE.
+
+   The taken edge out of the path is TAKEN_EDGE.
+
+   CREATES_IRREDUCIBLE_LOOP is set to TRUE if threading this path
+   would create an irreducible loop.
+
+   ?? It seems we should be able to loosen some of the restrictions in
+   this function after loop optimizations have run.  */
+
+bool
+back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
+						edge taken_edge,
+						bool *creates_irreducible_loop)
+{
+  // We can assume that possibly_profitable_path_p holds here
+
+  loop_p loop = m_path[0]->loop_father;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "Checking profitability of path (backwards): ");
+
+  /* If this path threaded through the loop latch back into the
+     same loop and the destination does not dominate the loop
+     latch, then this thread would create an irreducible loop.  */
+  *creates_irreducible_loop = false;
+  if (m_threaded_through_latch
+      && loop == taken_edge->dest->loop_father
+      && (determine_bb_domination_status (loop, taken_edge->dest)
+	  == DOMST_NONDOMINATING))
+    *creates_irreducible_loop = true;
+
+  /* Threading is profitable if the path duplicated is hot but also
+     in a case we separate cold path from hot path and permit optimization
+     of the hot path later.  Be on the agressive side here. In some testcases,
+     as in PR 78407 this leads to noticeable improvements.  */
+  if (m_speed_p
+      && (optimize_edge_for_speed_p (taken_edge) || m_contains_hot_bb))
+    {
+      if (probably_never_executed_edge_p (cfun, taken_edge))
 	{
 	  if (dump_file && (dump_flags & TDF_DETAILS))
 	    fprintf (dump_file, "  FAIL: Jump-thread path not considered: "
-		     "path entry is probably never executed.\n");
+		     "path leads to probably never executed edge.\n");
 	  return false;
 	}
     }
-  else if (n_insns > 1)
+  else if (m_n_insns > 1)
     {
       if (dump_file && (dump_flags & TDF_DETAILS))
 	fprintf (dump_file, "  FAIL: Jump-thread path not considered: "
 		 "duplication of %i insns is needed and optimizing for size.\n",
-		 n_insns);
+		 m_n_insns);
       return false;
     }
 
@@ -849,10 +883,9 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
      the path -- in that case there's little the traditional loop
      optimizer would have done anyway, so an irreducible loop is not
      so bad.  */
-  if (!threaded_multiway_branch
-      && creates_irreducible_loop
+  if (!m_threaded_multiway_branch
       && *creates_irreducible_loop
-      && (n_insns * (unsigned) param_fsm_scale_path_stmts
+      && (m_n_insns * (unsigned) param_fsm_scale_path_stmts
 	  > (m_path.length () *
 	     (unsigned) param_fsm_scale_path_blocks)))
 
@@ -861,6 +894,7 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
 	fprintf (dump_file,
 		 "  FAIL: Would create irreducible loop without threading "
 		 "multiway branch.\n");
+      /* We compute creates_irreducible_loop only late.  */
       return false;
     }
 
@@ -868,8 +902,8 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
      existing threading path to reduce code duplication.  So for that
      case, drastically reduce the number of statements we are allowed
      to copy.  */
-  if (!(threaded_through_latch && threaded_multiway_branch)
-      && (n_insns * param_fsm_scale_path_stmts
+  if (!(m_threaded_through_latch && m_threaded_multiway_branch)
+      && (m_n_insns * param_fsm_scale_path_stmts
 	  >= param_max_jump_thread_duplication_stmts))
     {
       if (dump_file && (dump_flags & TDF_DETAILS))
@@ -883,7 +917,7 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
      explode the CFG due to duplicating the edges for that multi-way
      branch.  So like above, only allow a multi-way branch on the path
      if we actually thread a multi-way branch.  */
-  if (!threaded_multiway_branch && multiway_branch_in_path)
+  if (!m_threaded_multiway_branch && m_multiway_branch_in_path)
     {
       if (dump_file && (dump_flags & TDF_DETAILS))
 	fprintf (dump_file,
@@ -896,20 +930,20 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
      the latch.  This could alter the loop form sufficiently to cause
      loop optimizations to fail.  Disable these threads until after
      loop optimizations have run.  */
-  if ((threaded_through_latch
-       || (taken_edge && taken_edge->dest == loop->latch))
+  if ((m_threaded_through_latch || taken_edge->dest == loop->latch)
       && !(cfun->curr_properties & PROP_loop_opts_done)
       && empty_block_p (loop->latch))
     {
       if (dump_file && (dump_flags & TDF_DETAILS))
 	fprintf (dump_file,
-		 "  FAIL: Thread through latch before loop opts would create non-empty latch\n");
+		 "  FAIL: Thread through latch before loop opts would create "
+		 "non-empty latch\n");
       return false;
-
     }
   return true;
 }
 
+
 /* The current path PATH is a vector of blocks forming a jump threading
    path in reverse order.  TAKEN_EDGE is the edge taken from path[0].
 
-- 
2.35.3

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Refactor back_threader_profitability
  2022-08-17  7:31 ` Aldy Hernandez
  2022-08-17  7:53   ` Richard Biener
@ 2022-08-19 16:02   ` Jeff Law
  1 sibling, 0 replies; 11+ messages in thread
From: Jeff Law @ 2022-08-19 16:02 UTC (permalink / raw)
  To: gcc-patches



On 8/17/2022 1:31 AM, Aldy Hernandez via Gcc-patches wrote:
> I just have a few high level comments.
>
> On Tue, Aug 16, 2022 at 4:05 PM Richard Biener <rguenther@suse.de> wrote:
>> The following refactors profitable_path_p in the backward threader,
>> splitting out parts that can be computed once the exit block is known,
>> parts that contiguously update and that can be checked allowing
>> for the path to be later identified as FSM with larger limits,
>> possibly_profitable_path_p, and final checks done when the whole
>> path is known, profitable_path_p.
> I thought we were removing references to FSM, as they were leftovers
> from some previous incarnation.  For that matter, I don't think I ever
> understood what they are, so if we're gonna keep them, could you
> comment what makes FSM threads different from other threads?
FSM refers to the initial implementation from Steve E. IIRC.  It was 
designed to handle threading backedges in a loop where the path out of 
the current iteration would tell us where a multi-way branch in the next 
loop iteration would go.

During the integration of Steve's work it was recognized that the 
backwards walk was generally a better model and we started moving to the 
backwards based threader with the goal of removing the forward threader.

There should be tests in the testsuite which validate that we haven't 
lost the key transformation.  ssa-thread-backedge.c and pr77445-2.c

> The DOM threader has limits?  I thought most of those limits were just
> due to the fact that it couldn't determine long enough paths?  Either
> way, I like that we're merging the necessary forward threader bits
> here, in preparation for its demise ;-).
The forward threader has structural limitations due to its custom block 
copier and CFG update code as well as profitibility limitations.

jeff


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Refactor back_threader_profitability
  2022-08-17  8:59           ` Richard Biener
@ 2022-08-17  9:27             ` Aldy Hernandez
  0 siblings, 0 replies; 11+ messages in thread
From: Aldy Hernandez @ 2022-08-17  9:27 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

On Wed, Aug 17, 2022 at 10:59 AM Richard Biener <rguenther@suse.de> wrote:
>
> On Wed, 17 Aug 2022, Aldy Hernandez wrote:
>
> > On Wed, Aug 17, 2022 at 10:38 AM Richard Biener <rguenther@suse.de> wrote:
> > >
> > > On Wed, 17 Aug 2022, Aldy Hernandez wrote:
> > >
> > > > On Wed, Aug 17, 2022 at 9:54 AM Richard Biener <rguenther@suse.de> wrote:
> > > > >
> > > > > On Wed, 17 Aug 2022, Aldy Hernandez wrote:
> > > > >
> > > > > > I just have a few high level comments.
> > > > > >
> > > > > > On Tue, Aug 16, 2022 at 4:05 PM Richard Biener <rguenther@suse.de> wrote:
> > > > > > >
> > > > > > > The following refactors profitable_path_p in the backward threader,
> > > > > > > splitting out parts that can be computed once the exit block is known,
> > > > > > > parts that contiguously update and that can be checked allowing
> > > > > > > for the path to be later identified as FSM with larger limits,
> > > > > > > possibly_profitable_path_p, and final checks done when the whole
> > > > > > > path is known, profitable_path_p.
> > > > > >
> > > > > > I thought we were removing references to FSM, as they were leftovers
> > > > > > from some previous incarnation.  For that matter, I don't think I ever
> > > > > > understood what they are, so if we're gonna keep them, could you
> > > > > > comment what makes FSM threads different from other threads?
> > > > >
> > > > > I don't know exactly what 'FSM' stands for but the FSM threader
> > > > > specifically tried to cover
> > > >
> > > > It probably stands for finite state machine then?
> > > >
> > > > >
> > > > > for (;;)
> > > > >  {
> > > > >    switch (state)
> > > > >     {
> > > > >     case 1:
> > > > >       /* sth */
> > > > >       state = 3;
> > > > >       break;
> > > > >     ...
> > > > >     case 3:
> > > > >       ...
> > > > >     }
> > > > > }
> > > > >
> > > > > so optimizing state machine transitions.  That is, these are all
> > > > > multiway (switch or goto, but goto support has been removed with the
> > > > > ranger rewrite it seems) and the thread path going through the
> > > >
> > > > ISTR going through the computed goto path in the old code, and it
> > > > never got triggered.  I just didn't get around to removing the
> > > > references to GIMPLE_GOTO.  At least, the old threader not once got a
> > > > thread we didn't get with the new code, even through a full Fedora
> > > > build.  I could be wrong though, but that's my recollection.
> > >
> > > When one massages the threader to even consider gotos we eventually
> > > run into find_taken_edge not handling them because range_of_expr
> > > computes the range of 'state' as
> > >
> > > [irange] void * [1, +INF]$4 = void
> > >
> > > rather than &&E.
> > >
> > > A testcase would be
> > >
> > > unsigned g;
> > > void FSM (int start)
> > > {
> > >   void *states[] = { &&A, &&B, &&C, &&E };
> > >   void *state = states[start];
> > >
> > >   do {
> > >   goto *state;
> > >
> > > A:
> > >   g += 1;
> > >   state = g & 1 ? &&B : &&E;
> > >   continue;
> > >
> > > B:
> > >   g += 2;
> > >   state = &&C;
> > >   continue;
> > >
> > > C:
> > >   g += 3;
> > >   state = states[g & 3];
> > >   continue;
> > >
> > > E:
> > >   break;
> > >   } while (1);
> > > }
> > >
> > > where we'd expect to thread the B -> C state transition.  I checked
> > > gcc 7 and it doesn't do that - not sure if it broke somewhen
> > > on the way or if it was just never working.  I'll file a bugreport,
> > > OTOH &label and goto *p are both GNU extensions, so not sure if it
> > > is worth optimizing.  I'll attach the "simple" enabler I have.
> >
> > Yeah, that's what I thought.  I seem to remember tracing through the
> > old code and never finding a path that would handle computed gotos.
> > Either way, thanks for distilling a testcase.  If you could file a PR
> > and CC me on it, that'd be great.
> >
> > When I contribute prange (irange for pointers), the plan is to have it
> > handle pointer equivalences, and we should be able to get the address
> > from the prange itself.
>
> Well, ISTR seeing &decl in integer ranges so I wondered why it doesn't
> just work ;)  tracing ranger shows

&decl in integer ranges are leftovers from legacy evrp/VRP.  Iranges
are constant only, but we make an exception when
irange::legacy_mode_p() is true so legacy VRP can fit in whatever it
needs.  But ranger will never put a non-integer in an irange.  There's
probably an assert to make sure a symbolic never slips into a "modern"
irange.

The proof of concept I have for prange still has integer endpoints
(well, zero, nonzero, since that's what we care about).  But range-ops
will also keep track of equivalences that we can query with
prange::get_equiv() or some such.

So yeah, ranger will never give you &decl with irange, but prange may
give us what we need.

Aldy

>
> Checking profitability of path (backwards):  bb:11 (0 insns) bb:4 (latch)
>   Control statement insns: 0
>   Overall: 0 insns
> 1        range_of_expr(state_9) at stmt state_9 = PHI <&E(6), state_23(9),
> &C(8), &B(5)>
> 2          range_of_stmt (state_9) at stmt state_9 = PHI <&E(6),
> state_23(9), &C(8), &B(5)>
> 3            ROS dependence fill
>                ROS dep fill (state_9) at stmt state_9 = PHI <&E(6),
> state_23(9), &C(8), &B(5)>
>              FALSE : (3) ROS  (state_9)
> 4            range_on_edge (&E) on edge 6->4
>              TRUE : (4) range_on_edge (&E) [irange] void * [1, +INF]
>
> so why's the range not &E but non-NULL instead?  Seems to be
> range_query::get_tree_range doing.
>
> Richard.
>
> > Aldy
> >
> > >
> > > > > loop backedge.  This is why FSM has different size limits because
> > > > > it was thought those threadings are especially worthwhile and
> > > > > they tend to be larger.  To make discovery cheaper a TODO item
> > > > > would be to skip to the loop backedge when we reach the regular
> > > > > thread limit and only continue the real search there (and
> > > > > pick the "smallest" ways through any diamonds on the way).
> > > >
> > > > Ah, thanks.  If you could, add some similar blurb, it'd be great.  I'm
> > > > sure we'll all forget it at some time (well, I will :)).
> > >
> > > Sure ;)
> > >
> > > Richard.
> > >
> > > > Aldy
> > > >
> > > > >
> > > > > > In your possibly_profitable_path_p function, could you document a bit
> > > > > > better what's the difference between profitable_path_p and
> > > > > > possibly_profitable_path_p?
> > > > >
> > > > > Sure.
> > > > >
> > > > > > >
> > > > > > > I've removed the back_threader_profitability instance from the
> > > > > > > back_threader class and instead instantiate it once per path
> > > > > > > discovery.  I've kept the size compute non-incremental to simplify
> > > > > > > the patch and not worry about unwinding.
> > > > > > >
> > > > > > > There's key changes to previous behavior - namely we apply
> > > > > > > the param_max_jump_thread_duplication_stmts early only when
> > > > > > > we know the path cannot become an FSM one (multiway + thread through
> > > > > > > latch) but make sure to elide the path query when we we didn't
> > > > > > > yet discover that but are over this limit.  Similarly the
> > > > > > > speed limit is now used even when we did not yet discover a
> > > > > > > hot BB on the path.  Basically the idea is to only stop path
> > > > > > > discovery when we know the path will never become profitable
> > > > > > > but avoid the expensive path range query when we know it's
> > > > > > > currently not.
> > > > > > >
> > > > > > > I've done a few cleanups, merging functions, on the way.
> > > > > > >
> > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > > > > > >
> > > > > > > Statistics show an overall slight increase in threading but
> > > > > > > looking at different files theres noise up and down.  That's
> > > > > > > somewhat expected since we now are applying the "more correct"
> > > > > > > limits in the end.  Unless I made big mistakes of course.
> > > > > > >
> > > > > > > The next thing cost-wise would be to raise the backwards
> > > > > > > threading limit to the limit of DOM so we don't get
> > > > > > > artificial high counts for that.
> > > > > >
> > > > > > The DOM threader has limits?  I thought most of those limits were just
> > > > > > due to the fact that it couldn't determine long enough paths?  Either
> > > > > > way, I like that we're merging the necessary forward threader bits
> > > > > > here, in preparation for its demise ;-).
> > > > >
> > > > > Both use param_max_jump_thread_duplication_stmts, but the backwards
> > > > > threader applies this limit to non-FSM threads with
> > > > >
> > > > >   /* The generic copier used by the backthreader does not re-use an
> > > > >      existing threading path to reduce code duplication.  So for that
> > > > >      case, drastically reduce the number of statements we are allowed
> > > > >      to copy.  */
> > > > >   if (!(threaded_through_latch && threaded_multiway_branch)
> > > > >       && (n_insns * param_fsm_scale_path_stmts
> > > > >           >= param_max_jump_thread_duplication_stmts))
> > > > >
> > > > > and param_fsm_scale_path_stmts happens to be two.  I have no idea
> > > > > why we apply this scaling, the scaling is otherwise used in
> > > > >
> > > > >   /* We avoid creating irreducible inner loops unless we thread through
> > > > >      a multiway branch, in which case we have deemed it worth losing
> > > > >      other loop optimizations later.
> > > > >
> > > > >      We also consider it worth creating an irreducible inner loop if
> > > > >      the number of copied statement is low relative to the length of
> > > > >      the path -- in that case there's little the traditional loop
> > > > >      optimizer would have done anyway, so an irreducible loop is not
> > > > >      so bad.  */
> > > > >   if (!threaded_multiway_branch
> > > > >       && creates_irreducible_loop
> > > > >       && *creates_irreducible_loop
> > > > >       && (n_insns * (unsigned) param_fsm_scale_path_stmts
> > > > >           > (m_path.length () *
> > > > >              (unsigned) param_fsm_scale_path_blocks)))
> > > > >
> > > > > so my idea is to drop the scaling from applying the
> > > > > param_max_jump_thread_duplication_stmts limit which raises the
> > > > > effective default limit from 15 / 2 to 15, just like what DOM
> > > > > applies (where DOM also estimates some optimization on the path,
> > > > > reducing the stmt count).
> > > > >
> > > > > > Looks good.
> > > > >
> > > > > Thanks - I'll adjust the commentary and push.
> > > > >
> > > > > Richard.
> > > > >
> > > >
> > > >
> > >
> > > --
> > > Richard Biener <rguenther@suse.de>
> > > SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
> > > Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
> > > HRB 36809 (AG Nuernberg)
> > >
> >
> >
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
> Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
> HRB 36809 (AG Nuernberg)
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Refactor back_threader_profitability
  2022-08-17  8:46         ` Aldy Hernandez
@ 2022-08-17  8:59           ` Richard Biener
  2022-08-17  9:27             ` Aldy Hernandez
  0 siblings, 1 reply; 11+ messages in thread
From: Richard Biener @ 2022-08-17  8:59 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: gcc-patches

On Wed, 17 Aug 2022, Aldy Hernandez wrote:

> On Wed, Aug 17, 2022 at 10:38 AM Richard Biener <rguenther@suse.de> wrote:
> >
> > On Wed, 17 Aug 2022, Aldy Hernandez wrote:
> >
> > > On Wed, Aug 17, 2022 at 9:54 AM Richard Biener <rguenther@suse.de> wrote:
> > > >
> > > > On Wed, 17 Aug 2022, Aldy Hernandez wrote:
> > > >
> > > > > I just have a few high level comments.
> > > > >
> > > > > On Tue, Aug 16, 2022 at 4:05 PM Richard Biener <rguenther@suse.de> wrote:
> > > > > >
> > > > > > The following refactors profitable_path_p in the backward threader,
> > > > > > splitting out parts that can be computed once the exit block is known,
> > > > > > parts that contiguously update and that can be checked allowing
> > > > > > for the path to be later identified as FSM with larger limits,
> > > > > > possibly_profitable_path_p, and final checks done when the whole
> > > > > > path is known, profitable_path_p.
> > > > >
> > > > > I thought we were removing references to FSM, as they were leftovers
> > > > > from some previous incarnation.  For that matter, I don't think I ever
> > > > > understood what they are, so if we're gonna keep them, could you
> > > > > comment what makes FSM threads different from other threads?
> > > >
> > > > I don't know exactly what 'FSM' stands for but the FSM threader
> > > > specifically tried to cover
> > >
> > > It probably stands for finite state machine then?
> > >
> > > >
> > > > for (;;)
> > > >  {
> > > >    switch (state)
> > > >     {
> > > >     case 1:
> > > >       /* sth */
> > > >       state = 3;
> > > >       break;
> > > >     ...
> > > >     case 3:
> > > >       ...
> > > >     }
> > > > }
> > > >
> > > > so optimizing state machine transitions.  That is, these are all
> > > > multiway (switch or goto, but goto support has been removed with the
> > > > ranger rewrite it seems) and the thread path going through the
> > >
> > > ISTR going through the computed goto path in the old code, and it
> > > never got triggered.  I just didn't get around to removing the
> > > references to GIMPLE_GOTO.  At least, the old threader not once got a
> > > thread we didn't get with the new code, even through a full Fedora
> > > build.  I could be wrong though, but that's my recollection.
> >
> > When one massages the threader to even consider gotos we eventually
> > run into find_taken_edge not handling them because range_of_expr
> > computes the range of 'state' as
> >
> > [irange] void * [1, +INF]$4 = void
> >
> > rather than &&E.
> >
> > A testcase would be
> >
> > unsigned g;
> > void FSM (int start)
> > {
> >   void *states[] = { &&A, &&B, &&C, &&E };
> >   void *state = states[start];
> >
> >   do {
> >   goto *state;
> >
> > A:
> >   g += 1;
> >   state = g & 1 ? &&B : &&E;
> >   continue;
> >
> > B:
> >   g += 2;
> >   state = &&C;
> >   continue;
> >
> > C:
> >   g += 3;
> >   state = states[g & 3];
> >   continue;
> >
> > E:
> >   break;
> >   } while (1);
> > }
> >
> > where we'd expect to thread the B -> C state transition.  I checked
> > gcc 7 and it doesn't do that - not sure if it broke somewhen
> > on the way or if it was just never working.  I'll file a bugreport,
> > OTOH &label and goto *p are both GNU extensions, so not sure if it
> > is worth optimizing.  I'll attach the "simple" enabler I have.
> 
> Yeah, that's what I thought.  I seem to remember tracing through the
> old code and never finding a path that would handle computed gotos.
> Either way, thanks for distilling a testcase.  If you could file a PR
> and CC me on it, that'd be great.
> 
> When I contribute prange (irange for pointers), the plan is to have it
> handle pointer equivalences, and we should be able to get the address
> from the prange itself.

Well, ISTR seeing &decl in integer ranges so I wondered why it doesn't
just work ;)  tracing ranger shows

Checking profitability of path (backwards):  bb:11 (0 insns) bb:4 (latch)
  Control statement insns: 0
  Overall: 0 insns
1        range_of_expr(state_9) at stmt state_9 = PHI <&E(6), state_23(9), 
&C(8), &B(5)>
2          range_of_stmt (state_9) at stmt state_9 = PHI <&E(6), 
state_23(9), &C(8), &B(5)>
3            ROS dependence fill
               ROS dep fill (state_9) at stmt state_9 = PHI <&E(6), 
state_23(9), &C(8), &B(5)>
             FALSE : (3) ROS  (state_9)
4            range_on_edge (&E) on edge 6->4
             TRUE : (4) range_on_edge (&E) [irange] void * [1, +INF]

so why's the range not &E but non-NULL instead?  Seems to be
range_query::get_tree_range doing.

Richard.

> Aldy
> 
> >
> > > > loop backedge.  This is why FSM has different size limits because
> > > > it was thought those threadings are especially worthwhile and
> > > > they tend to be larger.  To make discovery cheaper a TODO item
> > > > would be to skip to the loop backedge when we reach the regular
> > > > thread limit and only continue the real search there (and
> > > > pick the "smallest" ways through any diamonds on the way).
> > >
> > > Ah, thanks.  If you could, add some similar blurb, it'd be great.  I'm
> > > sure we'll all forget it at some time (well, I will :)).
> >
> > Sure ;)
> >
> > Richard.
> >
> > > Aldy
> > >
> > > >
> > > > > In your possibly_profitable_path_p function, could you document a bit
> > > > > better what's the difference between profitable_path_p and
> > > > > possibly_profitable_path_p?
> > > >
> > > > Sure.
> > > >
> > > > > >
> > > > > > I've removed the back_threader_profitability instance from the
> > > > > > back_threader class and instead instantiate it once per path
> > > > > > discovery.  I've kept the size compute non-incremental to simplify
> > > > > > the patch and not worry about unwinding.
> > > > > >
> > > > > > There's key changes to previous behavior - namely we apply
> > > > > > the param_max_jump_thread_duplication_stmts early only when
> > > > > > we know the path cannot become an FSM one (multiway + thread through
> > > > > > latch) but make sure to elide the path query when we we didn't
> > > > > > yet discover that but are over this limit.  Similarly the
> > > > > > speed limit is now used even when we did not yet discover a
> > > > > > hot BB on the path.  Basically the idea is to only stop path
> > > > > > discovery when we know the path will never become profitable
> > > > > > but avoid the expensive path range query when we know it's
> > > > > > currently not.
> > > > > >
> > > > > > I've done a few cleanups, merging functions, on the way.
> > > > > >
> > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > > > > >
> > > > > > Statistics show an overall slight increase in threading but
> > > > > > looking at different files theres noise up and down.  That's
> > > > > > somewhat expected since we now are applying the "more correct"
> > > > > > limits in the end.  Unless I made big mistakes of course.
> > > > > >
> > > > > > The next thing cost-wise would be to raise the backwards
> > > > > > threading limit to the limit of DOM so we don't get
> > > > > > artificial high counts for that.
> > > > >
> > > > > The DOM threader has limits?  I thought most of those limits were just
> > > > > due to the fact that it couldn't determine long enough paths?  Either
> > > > > way, I like that we're merging the necessary forward threader bits
> > > > > here, in preparation for its demise ;-).
> > > >
> > > > Both use param_max_jump_thread_duplication_stmts, but the backwards
> > > > threader applies this limit to non-FSM threads with
> > > >
> > > >   /* The generic copier used by the backthreader does not re-use an
> > > >      existing threading path to reduce code duplication.  So for that
> > > >      case, drastically reduce the number of statements we are allowed
> > > >      to copy.  */
> > > >   if (!(threaded_through_latch && threaded_multiway_branch)
> > > >       && (n_insns * param_fsm_scale_path_stmts
> > > >           >= param_max_jump_thread_duplication_stmts))
> > > >
> > > > and param_fsm_scale_path_stmts happens to be two.  I have no idea
> > > > why we apply this scaling, the scaling is otherwise used in
> > > >
> > > >   /* We avoid creating irreducible inner loops unless we thread through
> > > >      a multiway branch, in which case we have deemed it worth losing
> > > >      other loop optimizations later.
> > > >
> > > >      We also consider it worth creating an irreducible inner loop if
> > > >      the number of copied statement is low relative to the length of
> > > >      the path -- in that case there's little the traditional loop
> > > >      optimizer would have done anyway, so an irreducible loop is not
> > > >      so bad.  */
> > > >   if (!threaded_multiway_branch
> > > >       && creates_irreducible_loop
> > > >       && *creates_irreducible_loop
> > > >       && (n_insns * (unsigned) param_fsm_scale_path_stmts
> > > >           > (m_path.length () *
> > > >              (unsigned) param_fsm_scale_path_blocks)))
> > > >
> > > > so my idea is to drop the scaling from applying the
> > > > param_max_jump_thread_duplication_stmts limit which raises the
> > > > effective default limit from 15 / 2 to 15, just like what DOM
> > > > applies (where DOM also estimates some optimization on the path,
> > > > reducing the stmt count).
> > > >
> > > > > Looks good.
> > > >
> > > > Thanks - I'll adjust the commentary and push.
> > > >
> > > > Richard.
> > > >
> > >
> > >
> >
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
> > Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
> > HRB 36809 (AG Nuernberg)
> >
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Refactor back_threader_profitability
  2022-08-17  8:38       ` Richard Biener
@ 2022-08-17  8:46         ` Aldy Hernandez
  2022-08-17  8:59           ` Richard Biener
  0 siblings, 1 reply; 11+ messages in thread
From: Aldy Hernandez @ 2022-08-17  8:46 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

On Wed, Aug 17, 2022 at 10:38 AM Richard Biener <rguenther@suse.de> wrote:
>
> On Wed, 17 Aug 2022, Aldy Hernandez wrote:
>
> > On Wed, Aug 17, 2022 at 9:54 AM Richard Biener <rguenther@suse.de> wrote:
> > >
> > > On Wed, 17 Aug 2022, Aldy Hernandez wrote:
> > >
> > > > I just have a few high level comments.
> > > >
> > > > On Tue, Aug 16, 2022 at 4:05 PM Richard Biener <rguenther@suse.de> wrote:
> > > > >
> > > > > The following refactors profitable_path_p in the backward threader,
> > > > > splitting out parts that can be computed once the exit block is known,
> > > > > parts that contiguously update and that can be checked allowing
> > > > > for the path to be later identified as FSM with larger limits,
> > > > > possibly_profitable_path_p, and final checks done when the whole
> > > > > path is known, profitable_path_p.
> > > >
> > > > I thought we were removing references to FSM, as they were leftovers
> > > > from some previous incarnation.  For that matter, I don't think I ever
> > > > understood what they are, so if we're gonna keep them, could you
> > > > comment what makes FSM threads different from other threads?
> > >
> > > I don't know exactly what 'FSM' stands for but the FSM threader
> > > specifically tried to cover
> >
> > It probably stands for finite state machine then?
> >
> > >
> > > for (;;)
> > >  {
> > >    switch (state)
> > >     {
> > >     case 1:
> > >       /* sth */
> > >       state = 3;
> > >       break;
> > >     ...
> > >     case 3:
> > >       ...
> > >     }
> > > }
> > >
> > > so optimizing state machine transitions.  That is, these are all
> > > multiway (switch or goto, but goto support has been removed with the
> > > ranger rewrite it seems) and the thread path going through the
> >
> > ISTR going through the computed goto path in the old code, and it
> > never got triggered.  I just didn't get around to removing the
> > references to GIMPLE_GOTO.  At least, the old threader not once got a
> > thread we didn't get with the new code, even through a full Fedora
> > build.  I could be wrong though, but that's my recollection.
>
> When one massages the threader to even consider gotos we eventually
> run into find_taken_edge not handling them because range_of_expr
> computes the range of 'state' as
>
> [irange] void * [1, +INF]$4 = void
>
> rather than &&E.
>
> A testcase would be
>
> unsigned g;
> void FSM (int start)
> {
>   void *states[] = { &&A, &&B, &&C, &&E };
>   void *state = states[start];
>
>   do {
>   goto *state;
>
> A:
>   g += 1;
>   state = g & 1 ? &&B : &&E;
>   continue;
>
> B:
>   g += 2;
>   state = &&C;
>   continue;
>
> C:
>   g += 3;
>   state = states[g & 3];
>   continue;
>
> E:
>   break;
>   } while (1);
> }
>
> where we'd expect to thread the B -> C state transition.  I checked
> gcc 7 and it doesn't do that - not sure if it broke somewhen
> on the way or if it was just never working.  I'll file a bugreport,
> OTOH &label and goto *p are both GNU extensions, so not sure if it
> is worth optimizing.  I'll attach the "simple" enabler I have.

Yeah, that's what I thought.  I seem to remember tracing through the
old code and never finding a path that would handle computed gotos.
Either way, thanks for distilling a testcase.  If you could file a PR
and CC me on it, that'd be great.

When I contribute prange (irange for pointers), the plan is to have it
handle pointer equivalences, and we should be able to get the address
from the prange itself.

Aldy

>
> > > loop backedge.  This is why FSM has different size limits because
> > > it was thought those threadings are especially worthwhile and
> > > they tend to be larger.  To make discovery cheaper a TODO item
> > > would be to skip to the loop backedge when we reach the regular
> > > thread limit and only continue the real search there (and
> > > pick the "smallest" ways through any diamonds on the way).
> >
> > Ah, thanks.  If you could, add some similar blurb, it'd be great.  I'm
> > sure we'll all forget it at some time (well, I will :)).
>
> Sure ;)
>
> Richard.
>
> > Aldy
> >
> > >
> > > > In your possibly_profitable_path_p function, could you document a bit
> > > > better what's the difference between profitable_path_p and
> > > > possibly_profitable_path_p?
> > >
> > > Sure.
> > >
> > > > >
> > > > > I've removed the back_threader_profitability instance from the
> > > > > back_threader class and instead instantiate it once per path
> > > > > discovery.  I've kept the size compute non-incremental to simplify
> > > > > the patch and not worry about unwinding.
> > > > >
> > > > > There's key changes to previous behavior - namely we apply
> > > > > the param_max_jump_thread_duplication_stmts early only when
> > > > > we know the path cannot become an FSM one (multiway + thread through
> > > > > latch) but make sure to elide the path query when we we didn't
> > > > > yet discover that but are over this limit.  Similarly the
> > > > > speed limit is now used even when we did not yet discover a
> > > > > hot BB on the path.  Basically the idea is to only stop path
> > > > > discovery when we know the path will never become profitable
> > > > > but avoid the expensive path range query when we know it's
> > > > > currently not.
> > > > >
> > > > > I've done a few cleanups, merging functions, on the way.
> > > > >
> > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > > > >
> > > > > Statistics show an overall slight increase in threading but
> > > > > looking at different files theres noise up and down.  That's
> > > > > somewhat expected since we now are applying the "more correct"
> > > > > limits in the end.  Unless I made big mistakes of course.
> > > > >
> > > > > The next thing cost-wise would be to raise the backwards
> > > > > threading limit to the limit of DOM so we don't get
> > > > > artificial high counts for that.
> > > >
> > > > The DOM threader has limits?  I thought most of those limits were just
> > > > due to the fact that it couldn't determine long enough paths?  Either
> > > > way, I like that we're merging the necessary forward threader bits
> > > > here, in preparation for its demise ;-).
> > >
> > > Both use param_max_jump_thread_duplication_stmts, but the backwards
> > > threader applies this limit to non-FSM threads with
> > >
> > >   /* The generic copier used by the backthreader does not re-use an
> > >      existing threading path to reduce code duplication.  So for that
> > >      case, drastically reduce the number of statements we are allowed
> > >      to copy.  */
> > >   if (!(threaded_through_latch && threaded_multiway_branch)
> > >       && (n_insns * param_fsm_scale_path_stmts
> > >           >= param_max_jump_thread_duplication_stmts))
> > >
> > > and param_fsm_scale_path_stmts happens to be two.  I have no idea
> > > why we apply this scaling, the scaling is otherwise used in
> > >
> > >   /* We avoid creating irreducible inner loops unless we thread through
> > >      a multiway branch, in which case we have deemed it worth losing
> > >      other loop optimizations later.
> > >
> > >      We also consider it worth creating an irreducible inner loop if
> > >      the number of copied statement is low relative to the length of
> > >      the path -- in that case there's little the traditional loop
> > >      optimizer would have done anyway, so an irreducible loop is not
> > >      so bad.  */
> > >   if (!threaded_multiway_branch
> > >       && creates_irreducible_loop
> > >       && *creates_irreducible_loop
> > >       && (n_insns * (unsigned) param_fsm_scale_path_stmts
> > >           > (m_path.length () *
> > >              (unsigned) param_fsm_scale_path_blocks)))
> > >
> > > so my idea is to drop the scaling from applying the
> > > param_max_jump_thread_duplication_stmts limit which raises the
> > > effective default limit from 15 / 2 to 15, just like what DOM
> > > applies (where DOM also estimates some optimization on the path,
> > > reducing the stmt count).
> > >
> > > > Looks good.
> > >
> > > Thanks - I'll adjust the commentary and push.
> > >
> > > Richard.
> > >
> >
> >
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
> Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
> HRB 36809 (AG Nuernberg)
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Refactor back_threader_profitability
  2022-08-17  8:04     ` Aldy Hernandez
@ 2022-08-17  8:38       ` Richard Biener
  2022-08-17  8:46         ` Aldy Hernandez
  0 siblings, 1 reply; 11+ messages in thread
From: Richard Biener @ 2022-08-17  8:38 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: gcc-patches

On Wed, 17 Aug 2022, Aldy Hernandez wrote:

> On Wed, Aug 17, 2022 at 9:54 AM Richard Biener <rguenther@suse.de> wrote:
> >
> > On Wed, 17 Aug 2022, Aldy Hernandez wrote:
> >
> > > I just have a few high level comments.
> > >
> > > On Tue, Aug 16, 2022 at 4:05 PM Richard Biener <rguenther@suse.de> wrote:
> > > >
> > > > The following refactors profitable_path_p in the backward threader,
> > > > splitting out parts that can be computed once the exit block is known,
> > > > parts that contiguously update and that can be checked allowing
> > > > for the path to be later identified as FSM with larger limits,
> > > > possibly_profitable_path_p, and final checks done when the whole
> > > > path is known, profitable_path_p.
> > >
> > > I thought we were removing references to FSM, as they were leftovers
> > > from some previous incarnation.  For that matter, I don't think I ever
> > > understood what they are, so if we're gonna keep them, could you
> > > comment what makes FSM threads different from other threads?
> >
> > I don't know exactly what 'FSM' stands for but the FSM threader
> > specifically tried to cover
> 
> It probably stands for finite state machine then?
> 
> >
> > for (;;)
> >  {
> >    switch (state)
> >     {
> >     case 1:
> >       /* sth */
> >       state = 3;
> >       break;
> >     ...
> >     case 3:
> >       ...
> >     }
> > }
> >
> > so optimizing state machine transitions.  That is, these are all
> > multiway (switch or goto, but goto support has been removed with the
> > ranger rewrite it seems) and the thread path going through the
> 
> ISTR going through the computed goto path in the old code, and it
> never got triggered.  I just didn't get around to removing the
> references to GIMPLE_GOTO.  At least, the old threader not once got a
> thread we didn't get with the new code, even through a full Fedora
> build.  I could be wrong though, but that's my recollection.

When one massages the threader to even consider gotos we eventually
run into find_taken_edge not handling them because range_of_expr
computes the range of 'state' as

[irange] void * [1, +INF]$4 = void

rather than &&E.

A testcase would be

unsigned g;
void FSM (int start)
{
  void *states[] = { &&A, &&B, &&C, &&E };
  void *state = states[start];

  do {
  goto *state;

A:
  g += 1;
  state = g & 1 ? &&B : &&E;
  continue;

B:
  g += 2;
  state = &&C;
  continue;

C:
  g += 3;
  state = states[g & 3];
  continue;

E:
  break;
  } while (1);
}

where we'd expect to thread the B -> C state transition.  I checked
gcc 7 and it doesn't do that - not sure if it broke somewhen
on the way or if it was just never working.  I'll file a bugreport,
OTOH &label and goto *p are both GNU extensions, so not sure if it
is worth optimizing.  I'll attach the "simple" enabler I have.

> > loop backedge.  This is why FSM has different size limits because
> > it was thought those threadings are especially worthwhile and
> > they tend to be larger.  To make discovery cheaper a TODO item
> > would be to skip to the loop backedge when we reach the regular
> > thread limit and only continue the real search there (and
> > pick the "smallest" ways through any diamonds on the way).
> 
> Ah, thanks.  If you could, add some similar blurb, it'd be great.  I'm
> sure we'll all forget it at some time (well, I will :)).

Sure ;)

Richard.

> Aldy
> 
> >
> > > In your possibly_profitable_path_p function, could you document a bit
> > > better what's the difference between profitable_path_p and
> > > possibly_profitable_path_p?
> >
> > Sure.
> >
> > > >
> > > > I've removed the back_threader_profitability instance from the
> > > > back_threader class and instead instantiate it once per path
> > > > discovery.  I've kept the size compute non-incremental to simplify
> > > > the patch and not worry about unwinding.
> > > >
> > > > There's key changes to previous behavior - namely we apply
> > > > the param_max_jump_thread_duplication_stmts early only when
> > > > we know the path cannot become an FSM one (multiway + thread through
> > > > latch) but make sure to elide the path query when we we didn't
> > > > yet discover that but are over this limit.  Similarly the
> > > > speed limit is now used even when we did not yet discover a
> > > > hot BB on the path.  Basically the idea is to only stop path
> > > > discovery when we know the path will never become profitable
> > > > but avoid the expensive path range query when we know it's
> > > > currently not.
> > > >
> > > > I've done a few cleanups, merging functions, on the way.
> > > >
> > > > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > > >
> > > > Statistics show an overall slight increase in threading but
> > > > looking at different files theres noise up and down.  That's
> > > > somewhat expected since we now are applying the "more correct"
> > > > limits in the end.  Unless I made big mistakes of course.
> > > >
> > > > The next thing cost-wise would be to raise the backwards
> > > > threading limit to the limit of DOM so we don't get
> > > > artificial high counts for that.
> > >
> > > The DOM threader has limits?  I thought most of those limits were just
> > > due to the fact that it couldn't determine long enough paths?  Either
> > > way, I like that we're merging the necessary forward threader bits
> > > here, in preparation for its demise ;-).
> >
> > Both use param_max_jump_thread_duplication_stmts, but the backwards
> > threader applies this limit to non-FSM threads with
> >
> >   /* The generic copier used by the backthreader does not re-use an
> >      existing threading path to reduce code duplication.  So for that
> >      case, drastically reduce the number of statements we are allowed
> >      to copy.  */
> >   if (!(threaded_through_latch && threaded_multiway_branch)
> >       && (n_insns * param_fsm_scale_path_stmts
> >           >= param_max_jump_thread_duplication_stmts))
> >
> > and param_fsm_scale_path_stmts happens to be two.  I have no idea
> > why we apply this scaling, the scaling is otherwise used in
> >
> >   /* We avoid creating irreducible inner loops unless we thread through
> >      a multiway branch, in which case we have deemed it worth losing
> >      other loop optimizations later.
> >
> >      We also consider it worth creating an irreducible inner loop if
> >      the number of copied statement is low relative to the length of
> >      the path -- in that case there's little the traditional loop
> >      optimizer would have done anyway, so an irreducible loop is not
> >      so bad.  */
> >   if (!threaded_multiway_branch
> >       && creates_irreducible_loop
> >       && *creates_irreducible_loop
> >       && (n_insns * (unsigned) param_fsm_scale_path_stmts
> >           > (m_path.length () *
> >              (unsigned) param_fsm_scale_path_blocks)))
> >
> > so my idea is to drop the scaling from applying the
> > param_max_jump_thread_duplication_stmts limit which raises the
> > effective default limit from 15 / 2 to 15, just like what DOM
> > applies (where DOM also estimates some optimization on the path,
> > reducing the stmt count).
> >
> > > Looks good.
> >
> > Thanks - I'll adjust the commentary and push.
> >
> > Richard.
> >
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Refactor back_threader_profitability
  2022-08-17  7:53   ` Richard Biener
@ 2022-08-17  8:04     ` Aldy Hernandez
  2022-08-17  8:38       ` Richard Biener
  0 siblings, 1 reply; 11+ messages in thread
From: Aldy Hernandez @ 2022-08-17  8:04 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

On Wed, Aug 17, 2022 at 9:54 AM Richard Biener <rguenther@suse.de> wrote:
>
> On Wed, 17 Aug 2022, Aldy Hernandez wrote:
>
> > I just have a few high level comments.
> >
> > On Tue, Aug 16, 2022 at 4:05 PM Richard Biener <rguenther@suse.de> wrote:
> > >
> > > The following refactors profitable_path_p in the backward threader,
> > > splitting out parts that can be computed once the exit block is known,
> > > parts that contiguously update and that can be checked allowing
> > > for the path to be later identified as FSM with larger limits,
> > > possibly_profitable_path_p, and final checks done when the whole
> > > path is known, profitable_path_p.
> >
> > I thought we were removing references to FSM, as they were leftovers
> > from some previous incarnation.  For that matter, I don't think I ever
> > understood what they are, so if we're gonna keep them, could you
> > comment what makes FSM threads different from other threads?
>
> I don't know exactly what 'FSM' stands for but the FSM threader
> specifically tried to cover

It probably stands for finite state machine then?

>
> for (;;)
>  {
>    switch (state)
>     {
>     case 1:
>       /* sth */
>       state = 3;
>       break;
>     ...
>     case 3:
>       ...
>     }
> }
>
> so optimizing state machine transitions.  That is, these are all
> multiway (switch or goto, but goto support has been removed with the
> ranger rewrite it seems) and the thread path going through the

ISTR going through the computed goto path in the old code, and it
never got triggered.  I just didn't get around to removing the
references to GIMPLE_GOTO.  At least, the old threader not once got a
thread we didn't get with the new code, even through a full Fedora
build.  I could be wrong though, but that's my recollection.

> loop backedge.  This is why FSM has different size limits because
> it was thought those threadings are especially worthwhile and
> they tend to be larger.  To make discovery cheaper a TODO item
> would be to skip to the loop backedge when we reach the regular
> thread limit and only continue the real search there (and
> pick the "smallest" ways through any diamonds on the way).

Ah, thanks.  If you could, add some similar blurb, it'd be great.  I'm
sure we'll all forget it at some time (well, I will :)).

Aldy

>
> > In your possibly_profitable_path_p function, could you document a bit
> > better what's the difference between profitable_path_p and
> > possibly_profitable_path_p?
>
> Sure.
>
> > >
> > > I've removed the back_threader_profitability instance from the
> > > back_threader class and instead instantiate it once per path
> > > discovery.  I've kept the size compute non-incremental to simplify
> > > the patch and not worry about unwinding.
> > >
> > > There's key changes to previous behavior - namely we apply
> > > the param_max_jump_thread_duplication_stmts early only when
> > > we know the path cannot become an FSM one (multiway + thread through
> > > latch) but make sure to elide the path query when we we didn't
> > > yet discover that but are over this limit.  Similarly the
> > > speed limit is now used even when we did not yet discover a
> > > hot BB on the path.  Basically the idea is to only stop path
> > > discovery when we know the path will never become profitable
> > > but avoid the expensive path range query when we know it's
> > > currently not.
> > >
> > > I've done a few cleanups, merging functions, on the way.
> > >
> > > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > >
> > > Statistics show an overall slight increase in threading but
> > > looking at different files theres noise up and down.  That's
> > > somewhat expected since we now are applying the "more correct"
> > > limits in the end.  Unless I made big mistakes of course.
> > >
> > > The next thing cost-wise would be to raise the backwards
> > > threading limit to the limit of DOM so we don't get
> > > artificial high counts for that.
> >
> > The DOM threader has limits?  I thought most of those limits were just
> > due to the fact that it couldn't determine long enough paths?  Either
> > way, I like that we're merging the necessary forward threader bits
> > here, in preparation for its demise ;-).
>
> Both use param_max_jump_thread_duplication_stmts, but the backwards
> threader applies this limit to non-FSM threads with
>
>   /* The generic copier used by the backthreader does not re-use an
>      existing threading path to reduce code duplication.  So for that
>      case, drastically reduce the number of statements we are allowed
>      to copy.  */
>   if (!(threaded_through_latch && threaded_multiway_branch)
>       && (n_insns * param_fsm_scale_path_stmts
>           >= param_max_jump_thread_duplication_stmts))
>
> and param_fsm_scale_path_stmts happens to be two.  I have no idea
> why we apply this scaling, the scaling is otherwise used in
>
>   /* We avoid creating irreducible inner loops unless we thread through
>      a multiway branch, in which case we have deemed it worth losing
>      other loop optimizations later.
>
>      We also consider it worth creating an irreducible inner loop if
>      the number of copied statement is low relative to the length of
>      the path -- in that case there's little the traditional loop
>      optimizer would have done anyway, so an irreducible loop is not
>      so bad.  */
>   if (!threaded_multiway_branch
>       && creates_irreducible_loop
>       && *creates_irreducible_loop
>       && (n_insns * (unsigned) param_fsm_scale_path_stmts
>           > (m_path.length () *
>              (unsigned) param_fsm_scale_path_blocks)))
>
> so my idea is to drop the scaling from applying the
> param_max_jump_thread_duplication_stmts limit which raises the
> effective default limit from 15 / 2 to 15, just like what DOM
> applies (where DOM also estimates some optimization on the path,
> reducing the stmt count).
>
> > Looks good.
>
> Thanks - I'll adjust the commentary and push.
>
> Richard.
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Refactor back_threader_profitability
  2022-08-17  7:31 ` Aldy Hernandez
@ 2022-08-17  7:53   ` Richard Biener
  2022-08-17  8:04     ` Aldy Hernandez
  2022-08-19 16:02   ` Jeff Law
  1 sibling, 1 reply; 11+ messages in thread
From: Richard Biener @ 2022-08-17  7:53 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: gcc-patches

On Wed, 17 Aug 2022, Aldy Hernandez wrote:

> I just have a few high level comments.
> 
> On Tue, Aug 16, 2022 at 4:05 PM Richard Biener <rguenther@suse.de> wrote:
> >
> > The following refactors profitable_path_p in the backward threader,
> > splitting out parts that can be computed once the exit block is known,
> > parts that contiguously update and that can be checked allowing
> > for the path to be later identified as FSM with larger limits,
> > possibly_profitable_path_p, and final checks done when the whole
> > path is known, profitable_path_p.
> 
> I thought we were removing references to FSM, as they were leftovers
> from some previous incarnation.  For that matter, I don't think I ever
> understood what they are, so if we're gonna keep them, could you
> comment what makes FSM threads different from other threads?

I don't know exactly what 'FSM' stands for but the FSM threader
specifically tried to cover

for (;;)
 {
   switch (state)
    {
    case 1:
      /* sth */
      state = 3;
      break;
    ...
    case 3:
      ...
    }
}

so optimizing state machine transitions.  That is, these are all
multiway (switch or goto, but goto support has been removed with the
ranger rewrite it seems) and the thread path going through the
loop backedge.  This is why FSM has different size limits because
it was thought those threadings are especially worthwhile and
they tend to be larger.  To make discovery cheaper a TODO item
would be to skip to the loop backedge when we reach the regular
thread limit and only continue the real search there (and
pick the "smallest" ways through any diamonds on the way).

> In your possibly_profitable_path_p function, could you document a bit
> better what's the difference between profitable_path_p and
> possibly_profitable_path_p?

Sure.

> >
> > I've removed the back_threader_profitability instance from the
> > back_threader class and instead instantiate it once per path
> > discovery.  I've kept the size compute non-incremental to simplify
> > the patch and not worry about unwinding.
> >
> > There's key changes to previous behavior - namely we apply
> > the param_max_jump_thread_duplication_stmts early only when
> > we know the path cannot become an FSM one (multiway + thread through
> > latch) but make sure to elide the path query when we we didn't
> > yet discover that but are over this limit.  Similarly the
> > speed limit is now used even when we did not yet discover a
> > hot BB on the path.  Basically the idea is to only stop path
> > discovery when we know the path will never become profitable
> > but avoid the expensive path range query when we know it's
> > currently not.
> >
> > I've done a few cleanups, merging functions, on the way.
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> >
> > Statistics show an overall slight increase in threading but
> > looking at different files theres noise up and down.  That's
> > somewhat expected since we now are applying the "more correct"
> > limits in the end.  Unless I made big mistakes of course.
> >
> > The next thing cost-wise would be to raise the backwards
> > threading limit to the limit of DOM so we don't get
> > artificial high counts for that.
> 
> The DOM threader has limits?  I thought most of those limits were just
> due to the fact that it couldn't determine long enough paths?  Either
> way, I like that we're merging the necessary forward threader bits
> here, in preparation for its demise ;-).

Both use param_max_jump_thread_duplication_stmts, but the backwards
threader applies this limit to non-FSM threads with

  /* The generic copier used by the backthreader does not re-use an
     existing threading path to reduce code duplication.  So for that
     case, drastically reduce the number of statements we are allowed
     to copy.  */
  if (!(threaded_through_latch && threaded_multiway_branch)
      && (n_insns * param_fsm_scale_path_stmts
          >= param_max_jump_thread_duplication_stmts))

and param_fsm_scale_path_stmts happens to be two.  I have no idea
why we apply this scaling, the scaling is otherwise used in

  /* We avoid creating irreducible inner loops unless we thread through
     a multiway branch, in which case we have deemed it worth losing
     other loop optimizations later.

     We also consider it worth creating an irreducible inner loop if
     the number of copied statement is low relative to the length of
     the path -- in that case there's little the traditional loop
     optimizer would have done anyway, so an irreducible loop is not
     so bad.  */
  if (!threaded_multiway_branch
      && creates_irreducible_loop
      && *creates_irreducible_loop
      && (n_insns * (unsigned) param_fsm_scale_path_stmts
          > (m_path.length () *
             (unsigned) param_fsm_scale_path_blocks)))

so my idea is to drop the scaling from applying the
param_max_jump_thread_duplication_stmts limit which raises the
effective default limit from 15 / 2 to 15, just like what DOM
applies (where DOM also estimates some optimization on the path,
reducing the stmt count).

> Looks good.

Thanks - I'll adjust the commentary and push.

Richard.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Refactor back_threader_profitability
       [not found] <81970.122081610045900133@us-mta-26.us.mimecast.lan>
@ 2022-08-17  7:31 ` Aldy Hernandez
  2022-08-17  7:53   ` Richard Biener
  2022-08-19 16:02   ` Jeff Law
  0 siblings, 2 replies; 11+ messages in thread
From: Aldy Hernandez @ 2022-08-17  7:31 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

I just have a few high level comments.

On Tue, Aug 16, 2022 at 4:05 PM Richard Biener <rguenther@suse.de> wrote:
>
> The following refactors profitable_path_p in the backward threader,
> splitting out parts that can be computed once the exit block is known,
> parts that contiguously update and that can be checked allowing
> for the path to be later identified as FSM with larger limits,
> possibly_profitable_path_p, and final checks done when the whole
> path is known, profitable_path_p.

I thought we were removing references to FSM, as they were leftovers
from some previous incarnation.  For that matter, I don't think I ever
understood what they are, so if we're gonna keep them, could you
comment what makes FSM threads different from other threads?

In your possibly_profitable_path_p function, could you document a bit
better what's the difference between profitable_path_p and
possibly_profitable_path_p?

>
> I've removed the back_threader_profitability instance from the
> back_threader class and instead instantiate it once per path
> discovery.  I've kept the size compute non-incremental to simplify
> the patch and not worry about unwinding.
>
> There's key changes to previous behavior - namely we apply
> the param_max_jump_thread_duplication_stmts early only when
> we know the path cannot become an FSM one (multiway + thread through
> latch) but make sure to elide the path query when we we didn't
> yet discover that but are over this limit.  Similarly the
> speed limit is now used even when we did not yet discover a
> hot BB on the path.  Basically the idea is to only stop path
> discovery when we know the path will never become profitable
> but avoid the expensive path range query when we know it's
> currently not.
>
> I've done a few cleanups, merging functions, on the way.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>
> Statistics show an overall slight increase in threading but
> looking at different files theres noise up and down.  That's
> somewhat expected since we now are applying the "more correct"
> limits in the end.  Unless I made big mistakes of course.
>
> The next thing cost-wise would be to raise the backwards
> threading limit to the limit of DOM so we don't get
> artificial high counts for that.

The DOM threader has limits?  I thought most of those limits were just
due to the fact that it couldn't determine long enough paths?  Either
way, I like that we're merging the necessary forward threader bits
here, in preparation for its demise ;-).

Looks good.

Thanks.
Aldy


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] Refactor back_threader_profitability
@ 2022-08-16 14:04 Richard Biener
  0 siblings, 0 replies; 11+ messages in thread
From: Richard Biener @ 2022-08-16 14:04 UTC (permalink / raw)
  To: gcc-patches

The following refactors profitable_path_p in the backward threader,
splitting out parts that can be computed once the exit block is known,
parts that contiguously update and that can be checked allowing
for the path to be later identified as FSM with larger limits,
possibly_profitable_path_p, and final checks done when the whole
path is known, profitable_path_p.

I've removed the back_threader_profitability instance from the
back_threader class and instead instantiate it once per path
discovery.  I've kept the size compute non-incremental to simplify
the patch and not worry about unwinding.

There's key changes to previous behavior - namely we apply
the param_max_jump_thread_duplication_stmts early only when
we know the path cannot become an FSM one (multiway + thread through
latch) but make sure to elide the path query when we we didn't
yet discover that but are over this limit.  Similarly the
speed limit is now used even when we did not yet discover a
hot BB on the path.  Basically the idea is to only stop path
discovery when we know the path will never become profitable
but avoid the expensive path range query when we know it's
currently not.

I've done a few cleanups, merging functions, on the way.

Bootstrapped and tested on x86_64-unknown-linux-gnu.

Statistics show an overall slight increase in threading but
looking at different files theres noise up and down.  That's
somewhat expected since we now are applying the "more correct"
limits in the end.  Unless I made big mistakes of course.

The next thing cost-wise would be to raise the backwards
threading limit to the limit of DOM so we don't get
artificial high counts for that.

OK?

Thanks,
Richard.

	* tree-ssa-threadbackward.cc
	(back_threader_profitability): Split profitable_path_p
	into possibly_profitable_path_p and itself, keep state
	as new members.
	(back_threader::m_profit): Remove.
	(back_threader::find_paths): Likewise.
	(back_threader::maybe_register_path): Take profitability
	instance as parameter.
	(back_threader::find_paths_to_names): Likewise.  Use
	possibly_profitable_path_p and avoid the path range query
	when the path is currently too large.
	(back_threader::find_paths): Fold into ...
	(back_threader::maybe_thread_block): ... this.
	(get_gimple_control_stmt): Remove.
	(back_threader_profitability::possibly_profitable_path_p):
	Split out from profitable_path_p, do early profitability
	checks.
	(back_threader_profitability::profitable_path_p): Do final
	profitability path after the taken edge has been determined.
---
 gcc/tree-ssa-threadbackward.cc | 350 ++++++++++++++++++---------------
 1 file changed, 192 insertions(+), 158 deletions(-)

diff --git a/gcc/tree-ssa-threadbackward.cc b/gcc/tree-ssa-threadbackward.cc
index 1c362839fab..077a767e73d 100644
--- a/gcc/tree-ssa-threadbackward.cc
+++ b/gcc/tree-ssa-threadbackward.cc
@@ -61,15 +61,33 @@ public:
 class back_threader_profitability
 {
 public:
-  back_threader_profitability (bool speed_p)
-    : m_speed_p (speed_p)
-  { }
-  bool profitable_path_p (const vec<basic_block> &, tree name, edge taken,
-			  bool *irreducible_loop = NULL);
+  back_threader_profitability (bool speed_p, gimple *stmt);
+  bool possibly_profitable_path_p (const vec<basic_block> &, tree, bool *);
+  bool profitable_path_p (const vec<basic_block> &,
+			  edge taken, bool *irreducible_loop);
 private:
   const bool m_speed_p;
+  int m_exit_jump_benefit;
+  bool m_threaded_multiway_branch;
+  // The following are computed by possibly_profitable_path_p
+  bool m_threaded_through_latch;
+  bool m_multiway_branch_in_path;
+  bool m_contains_hot_bb;
+  int m_n_insns;
 };
 
+back_threader_profitability::back_threader_profitability (bool speed_p,
+							  gimple *last)
+  : m_speed_p (speed_p)
+{
+  m_threaded_multiway_branch = (gimple_code (last) == GIMPLE_SWITCH
+				|| gimple_code (last) == GIMPLE_GOTO);
+  // The forward threader has estimate_threading_killed_stmts, in
+  // particular it estimates further DCE from eliminating the exit
+  // control stmt.
+  m_exit_jump_benefit = estimate_num_insns (last, &eni_size_weights);
+}
+
 // Back threader flags.
 #define BT_NONE 0
 // Generate fast code at the expense of code size.
@@ -86,11 +104,11 @@ public:
   unsigned thread_blocks ();
 private:
   void maybe_thread_block (basic_block bb);
-  void find_paths (basic_block bb, tree name);
   bool debug_counter ();
-  edge maybe_register_path ();
+  edge maybe_register_path (back_threader_profitability &);
   void maybe_register_path_dump (edge taken_edge);
-  void find_paths_to_names (basic_block bb, bitmap imports, unsigned);
+  void find_paths_to_names (basic_block bb, bitmap imports, unsigned,
+			    back_threader_profitability &);
   edge find_taken_edge (const vec<basic_block> &path);
   edge find_taken_edge_cond (const vec<basic_block> &path, gcond *);
   edge find_taken_edge_switch (const vec<basic_block> &path, gswitch *);
@@ -98,7 +116,6 @@ private:
   virtual void dump (FILE *out);
 
   back_threader_registry m_registry;
-  back_threader_profitability m_profit;
   path_range_query *m_solver;
 
   // Current path being analyzed.
@@ -131,8 +148,7 @@ private:
 const edge back_threader::UNREACHABLE_EDGE = (edge) -1;
 
 back_threader::back_threader (function *fun, unsigned flags, bool first)
-  : m_profit (flags & BT_SPEED),
-    m_first (first)
+  : m_first (first)
 {
   if (flags & BT_SPEED)
     loop_optimizer_init (LOOPS_HAVE_PREHEADERS | LOOPS_HAVE_SIMPLE_LATCHES);
@@ -226,7 +242,7 @@ back_threader::maybe_register_path_dump (edge taken)
 // unreachable.
 
 edge
-back_threader::maybe_register_path ()
+back_threader::maybe_register_path (back_threader_profitability &profit)
 {
   edge taken_edge = find_taken_edge (m_path);
 
@@ -241,8 +257,7 @@ back_threader::maybe_register_path ()
       else
 	{
 	  bool irreducible = false;
-	  if (m_profit.profitable_path_p (m_path, m_name, taken_edge,
-					  &irreducible)
+	  if (profit.profitable_path_p (m_path, taken_edge, &irreducible)
 	      && debug_counter ()
 	      && m_registry.register_path (m_path, taken_edge))
 	    {
@@ -342,17 +357,21 @@ back_threader::find_taken_edge_cond (const vec<basic_block> &path,
 
 void
 back_threader::find_paths_to_names (basic_block bb, bitmap interesting,
-				    unsigned overall_paths)
+				    unsigned overall_paths,
+				    back_threader_profitability &profit)
 {
   if (m_visited_bbs.add (bb))
     return;
 
   m_path.safe_push (bb);
 
-  // Try to resolve the path without looking back.
+  // Try to resolve the path without looking back.  Avoid resolving paths
+  // we know are large but not (yet) FSM.
+  bool large_non_fsm;
   if (m_path.length () > 1
-      && (!m_profit.profitable_path_p (m_path, m_name, NULL)
-	  || maybe_register_path ()))
+      && (!profit.possibly_profitable_path_p (m_path, m_name, &large_non_fsm)
+	  || (!large_non_fsm
+	      && maybe_register_path (profit))))
     ;
 
   // The backwards thread copier cannot copy blocks that do not belong
@@ -474,7 +493,8 @@ back_threader::find_paths_to_names (basic_block bb, bitmap interesting,
 			}
 		    }
 		}
-	      find_paths_to_names (e->src, new_interesting, overall_paths);
+	      find_paths_to_names (e->src, new_interesting, overall_paths,
+				   profit);
 	      // Restore new_interesting.
 	      for (int def : unwind)
 		bitmap_clear_bit (new_interesting, def);
@@ -499,66 +519,6 @@ back_threader::find_paths_to_names (basic_block bb, bitmap interesting,
   m_visited_bbs.remove (bb);
 }
 
-// Search backwards from BB looking for paths where the final
-// conditional out of BB can be determined.  NAME is the LHS of the
-// final conditional.  Register such paths for jump threading.
-
-void
-back_threader::find_paths (basic_block bb, tree name)
-{
-  gimple *stmt = last_stmt (bb);
-  if (!stmt
-      || (gimple_code (stmt) != GIMPLE_COND
-	  && gimple_code (stmt) != GIMPLE_SWITCH))
-    return;
-
-  if (EDGE_COUNT (bb->succs) > 1)
-    {
-      m_last_stmt = stmt;
-      m_visited_bbs.empty ();
-      m_path.truncate (0);
-      m_name = name;
-
-      // We compute imports of the path during discovery starting
-      // just with names used in the conditional.
-      bitmap_clear (m_imports);
-      ssa_op_iter iter;
-      FOR_EACH_SSA_TREE_OPERAND (name, stmt, iter, SSA_OP_USE)
-	{
-	  if (!gimple_range_ssa_p (name))
-	    return;
-	  bitmap_set_bit (m_imports, SSA_NAME_VERSION (name));
-	}
-
-      // Interesting is the set of imports we still not have see
-      // the definition of.  So while imports only grow, the
-      // set of interesting defs dwindles and once empty we can
-      // stop searching.
-      auto_bitmap interesting;
-      bitmap_copy (interesting, m_imports);
-      find_paths_to_names (bb, interesting, 1);
-    }
-}
-
-// Simple helper to get the last statement from BB, which is assumed
-// to be a control statement.  Return NULL if the last statement is
-// not a control statement.
-
-static gimple *
-get_gimple_control_stmt (basic_block bb)
-{
-  gimple_stmt_iterator gsi = gsi_last_nondebug_bb (bb);
-
-  if (gsi_end_p (gsi))
-    return NULL;
-
-  gimple *stmt = gsi_stmt (gsi);
-  enum gimple_code code = gimple_code (stmt);
-  if (code == GIMPLE_COND || code == GIMPLE_SWITCH || code == GIMPLE_GOTO)
-    return stmt;
-  return NULL;
-}
-
 // Search backwards from BB looking for paths where the final
 // conditional maybe threaded to a successor block.  Record such paths
 // for jump threading.
@@ -566,7 +526,10 @@ get_gimple_control_stmt (basic_block bb)
 void
 back_threader::maybe_thread_block (basic_block bb)
 {
-  gimple *stmt = get_gimple_control_stmt (bb);
+  if (EDGE_COUNT (bb->succs) <= 1)
+    return;
+
+  gimple *stmt = last_stmt (bb);
   if (!stmt)
     return;
 
@@ -576,12 +539,33 @@ back_threader::maybe_thread_block (basic_block bb)
     name = gimple_switch_index (as_a <gswitch *> (stmt));
   else if (code == GIMPLE_COND)
     name = gimple_cond_lhs (stmt);
-  else if (code == GIMPLE_GOTO)
-    name = gimple_goto_dest (stmt);
   else
-    name = NULL;
+    return;
 
-  find_paths (bb, name);
+  m_last_stmt = stmt;
+  m_visited_bbs.empty ();
+  m_path.truncate (0);
+  m_name = name;
+
+  // We compute imports of the path during discovery starting
+  // just with names used in the conditional.
+  bitmap_clear (m_imports);
+  ssa_op_iter iter;
+  FOR_EACH_SSA_TREE_OPERAND (name, stmt, iter, SSA_OP_USE)
+    {
+      if (!gimple_range_ssa_p (name))
+	return;
+      bitmap_set_bit (m_imports, SSA_NAME_VERSION (name));
+    }
+
+  // Interesting is the set of imports we still not have see
+  // the definition of.  So while imports only grow, the
+  // set of interesting defs dwindles and once empty we can
+  // stop searching.
+  auto_bitmap interesting;
+  bitmap_copy (interesting, m_imports);
+  back_threader_profitability profit (m_flags & BT_SPEED, stmt);
+  find_paths_to_names (bb, interesting, 1, profit);
 }
 
 DEBUG_FUNCTION void
@@ -615,26 +599,21 @@ back_threader::debug ()
   dump (stderr);
 }
 
-/* Examine jump threading path PATH and return TRUE if it is profitable to
-   thread it, otherwise return FALSE.
+/* Examine jump threading path PATH and return TRUE if it is possibly
+   profitable to thread it, otherwise return FALSE.  Indicate in
+   *LARGE_NON_FSM whether the thread is too large for a non-fsm thread
+   but would be OK if we discover a backedge on our way.
 
    NAME is the SSA_NAME of the variable we found to have a constant
    value on PATH.  If unknown, SSA_NAME is NULL.
 
-   If the taken edge out of the path is known ahead of time it is passed in
-   TAKEN_EDGE, otherwise it is NULL.
-
-   CREATES_IRREDUCIBLE_LOOP, if non-null is set to TRUE if threading this path
-   would create an irreducible loop.
-
    ?? It seems we should be able to loosen some of the restrictions in
    this function after loop optimizations have run.  */
 
 bool
-back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
-						tree name,
-						edge taken_edge,
-						bool *creates_irreducible_loop)
+back_threader_profitability::possibly_profitable_path_p
+				  (const vec<basic_block> &m_path, tree name,
+				   bool *large_non_fsm)
 {
   gcc_checking_assert (!m_path.is_empty ());
 
@@ -650,13 +629,15 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
   if (m_path.length () <= 1)
       return false;
 
-  int n_insns = 0;
   gimple_stmt_iterator gsi;
   loop_p loop = m_path[0]->loop_father;
-  bool threaded_through_latch = false;
-  bool multiway_branch_in_path = false;
-  bool threaded_multiway_branch = false;
-  bool contains_hot_bb = false;
+
+  // We recompute the following, when we rewrite possibly_profitable_path_p
+  // to work incrementally on added BBs we have to unwind them on backtracking
+  m_n_insns = 0;
+  m_threaded_through_latch = false;
+  m_multiway_branch_in_path = false;
+  m_contains_hot_bb = false;
 
   if (dump_file && (dump_flags & TDF_DETAILS))
     fprintf (dump_file, "Checking profitability of path (backwards): ");
@@ -679,7 +660,7 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
 	 it ends in a multiway branch.  */
       if (j < m_path.length () - 1)
 	{
-	  int orig_n_insns = n_insns;
+	  int orig_n_insns = m_n_insns;
 	  /* PHIs in the path will create degenerate PHIS in the
 	     copied path which will then get propagated away, so
 	     looking at just the duplicate path the PHIs would
@@ -714,12 +695,12 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
 		      && (SSA_NAME_VAR (dst) != SSA_NAME_VAR (name)
 			  || !SSA_NAME_VAR (dst))
 		      && !virtual_operand_p (dst))
-		    ++n_insns;
+		    ++m_n_insns;
 		}
 	    }
 
-	  if (!contains_hot_bb && m_speed_p)
-	    contains_hot_bb |= optimize_bb_for_speed_p (bb);
+	  if (!m_contains_hot_bb && m_speed_p)
+	    m_contains_hot_bb |= optimize_bb_for_speed_p (bb);
 	  for (gsi = gsi_after_labels (bb);
 	       !gsi_end_p (gsi);
 	       gsi_next_nondebug (&gsi))
@@ -735,10 +716,10 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
 	      /* Do not count empty statements and labels.  */
 	      if (gimple_code (stmt) != GIMPLE_NOP
 		  && !is_gimple_debug (stmt))
-		n_insns += estimate_num_insns (stmt, &eni_size_weights);
+		m_n_insns += estimate_num_insns (stmt, &eni_size_weights);
 	    }
 	  if (dump_file && (dump_flags & TDF_DETAILS))
-	    fprintf (dump_file, " (%i insns)", n_insns-orig_n_insns);
+	    fprintf (dump_file, " (%i insns)", m_n_insns-orig_n_insns);
 
 	  /* We do not look at the block with the threaded branch
 	     in this loop.  So if any block with a last statement that
@@ -747,14 +728,13 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
 
 	     The block in PATH[0] is special, it's the block were we're
 	     going to be able to eliminate its branch.  */
-	  gimple *last = last_stmt (bb);
-	  if (last && (gimple_code (last) == GIMPLE_SWITCH
-		       || gimple_code (last) == GIMPLE_GOTO))
+	  if (j > 0)
 	    {
-	      if (j == 0)
-		threaded_multiway_branch = true;
-	      else
-		multiway_branch_in_path = true;
+	      gimple *last = last_stmt (bb);
+	      if (last
+		  && (gimple_code (last) == GIMPLE_SWITCH
+		      || gimple_code (last) == GIMPLE_GOTO))
+		m_multiway_branch_in_path = true;
 	    }
 	}
 
@@ -763,50 +743,29 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
 	 through the loop latch.  */
       if (loop->latch == bb)
 	{
-	  threaded_through_latch = true;
+	  m_threaded_through_latch = true;
 	  if (dump_file && (dump_flags & TDF_DETAILS))
 	    fprintf (dump_file, " (latch)");
 	}
     }
 
-  gimple *stmt = get_gimple_control_stmt (m_path[0]);
-  gcc_assert (stmt);
-
   /* We are going to remove the control statement at the end of the
      last block in the threading path.  So don't count it against our
      statement count.  */
-
-  int stmt_insns = estimate_num_insns (stmt, &eni_size_weights);
-  n_insns-= stmt_insns;
+  m_n_insns -= m_exit_jump_benefit;
 
   if (dump_file && (dump_flags & TDF_DETAILS))
     fprintf (dump_file, "\n  Control statement insns: %i\n"
 	     "  Overall: %i insns\n",
-	     stmt_insns, n_insns);
-
-  if (creates_irreducible_loop)
-    {
-      /* If this path threaded through the loop latch back into the
-	 same loop and the destination does not dominate the loop
-	 latch, then this thread would create an irreducible loop.  */
-      *creates_irreducible_loop = false;
-      if (taken_edge
-	  && threaded_through_latch
-	  && loop == taken_edge->dest->loop_father
-	  && (determine_bb_domination_status (loop, taken_edge->dest)
-	      == DOMST_NONDOMINATING))
-	*creates_irreducible_loop = true;
-    }
+	     m_exit_jump_benefit, m_n_insns);
 
   /* Threading is profitable if the path duplicated is hot but also
      in a case we separate cold path from hot path and permit optimization
      of the hot path later.  Be on the agressive side here. In some testcases,
      as in PR 78407 this leads to noticeable improvements.  */
-  if (m_speed_p
-      && ((taken_edge && optimize_edge_for_speed_p (taken_edge))
-	  || contains_hot_bb))
+  if (m_speed_p)
     {
-      if (n_insns >= param_max_fsm_thread_path_insns)
+      if (m_n_insns >= param_max_fsm_thread_path_insns)
 	{
 	  if (dump_file && (dump_flags & TDF_DETAILS))
 	    fprintf (dump_file, "  FAIL: Jump-thread path not considered: "
@@ -814,29 +773,104 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
 		     "exceeds PARAM_MAX_FSM_THREAD_PATH_INSNS.\n");
 	  return false;
 	}
-      if (taken_edge && probably_never_executed_edge_p (cfun, taken_edge))
+      edge entry = find_edge (m_path[m_path.length () - 1],
+			      m_path[m_path.length () - 2]);
+      if (probably_never_executed_edge_p (cfun, entry))
 	{
 	  if (dump_file && (dump_flags & TDF_DETAILS))
 	    fprintf (dump_file, "  FAIL: Jump-thread path not considered: "
-		     "path leads to probably never executed edge.\n");
+		     "path entry is probably never executed.\n");
 	  return false;
 	}
-      edge entry = find_edge (m_path[m_path.length () - 1],
-			      m_path[m_path.length () - 2]);
-      if (probably_never_executed_edge_p (cfun, entry))
+    }
+  else if (m_n_insns > 1)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "  FAIL: Jump-thread path not considered: "
+		 "duplication of %i insns is needed and optimizing for size.\n",
+		 m_n_insns);
+      return false;
+    }
+
+  /* The generic copier used by the backthreader does not re-use an
+     existing threading path to reduce code duplication.  So for that
+     case, drastically reduce the number of statements we are allowed
+     to copy.  We don't know yet whether we will thread through the latch
+     so we have to be permissive and continue threading, but indicate
+     to the caller the thread, if final, wouldn't be profitable.  */
+  if ((!m_threaded_multiway_branch
+       || !loop->latch
+       || loop->latch->index == EXIT_BLOCK)
+      && (m_n_insns * param_fsm_scale_path_stmts
+	  >= param_max_jump_thread_duplication_stmts))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file,
+		 "  FAIL: Did not thread around loop and would copy too "
+		 "many statements.\n");
+      return false;
+    }
+  *large_non_fsm = (!(m_threaded_through_latch && m_threaded_multiway_branch)
+		    && (m_n_insns * param_fsm_scale_path_stmts
+			>= param_max_jump_thread_duplication_stmts));
+
+  return true;
+}
+
+/* Examine jump threading path PATH and return TRUE if it is profitable to
+   thread it, otherwise return FALSE.
+
+   The taken edge out of the path is TAKEN_EDGE.
+
+   CREATES_IRREDUCIBLE_LOOP is set to TRUE if threading this path
+   would create an irreducible loop.
+
+   ?? It seems we should be able to loosen some of the restrictions in
+   this function after loop optimizations have run.  */
+
+bool
+back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
+						edge taken_edge,
+						bool *creates_irreducible_loop)
+{
+  // We can assume that possibly_profitable_path_p holds here
+
+  loop_p loop = m_path[0]->loop_father;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "Checking profitability of path (backwards): ");
+
+  /* If this path threaded through the loop latch back into the
+     same loop and the destination does not dominate the loop
+     latch, then this thread would create an irreducible loop.  */
+  *creates_irreducible_loop = false;
+  if (m_threaded_through_latch
+      && loop == taken_edge->dest->loop_father
+      && (determine_bb_domination_status (loop, taken_edge->dest)
+	  == DOMST_NONDOMINATING))
+    *creates_irreducible_loop = true;
+
+  /* Threading is profitable if the path duplicated is hot but also
+     in a case we separate cold path from hot path and permit optimization
+     of the hot path later.  Be on the agressive side here. In some testcases,
+     as in PR 78407 this leads to noticeable improvements.  */
+  if (m_speed_p
+      && (optimize_edge_for_speed_p (taken_edge) || m_contains_hot_bb))
+    {
+      if (probably_never_executed_edge_p (cfun, taken_edge))
 	{
 	  if (dump_file && (dump_flags & TDF_DETAILS))
 	    fprintf (dump_file, "  FAIL: Jump-thread path not considered: "
-		     "path entry is probably never executed.\n");
+		     "path leads to probably never executed edge.\n");
 	  return false;
 	}
     }
-  else if (n_insns > 1)
+  else if (m_n_insns > 1)
     {
       if (dump_file && (dump_flags & TDF_DETAILS))
 	fprintf (dump_file, "  FAIL: Jump-thread path not considered: "
 		 "duplication of %i insns is needed and optimizing for size.\n",
-		 n_insns);
+		 m_n_insns);
       return false;
     }
 
@@ -849,10 +883,9 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
      the path -- in that case there's little the traditional loop
      optimizer would have done anyway, so an irreducible loop is not
      so bad.  */
-  if (!threaded_multiway_branch
-      && creates_irreducible_loop
+  if (!m_threaded_multiway_branch
       && *creates_irreducible_loop
-      && (n_insns * (unsigned) param_fsm_scale_path_stmts
+      && (m_n_insns * (unsigned) param_fsm_scale_path_stmts
 	  > (m_path.length () *
 	     (unsigned) param_fsm_scale_path_blocks)))
 
@@ -861,6 +894,7 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
 	fprintf (dump_file,
 		 "  FAIL: Would create irreducible loop without threading "
 		 "multiway branch.\n");
+      /* We compute creates_irreducible_loop only late.  */
       return false;
     }
 
@@ -868,8 +902,8 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
      existing threading path to reduce code duplication.  So for that
      case, drastically reduce the number of statements we are allowed
      to copy.  */
-  if (!(threaded_through_latch && threaded_multiway_branch)
-      && (n_insns * param_fsm_scale_path_stmts
+  if (!(m_threaded_through_latch && m_threaded_multiway_branch)
+      && (m_n_insns * param_fsm_scale_path_stmts
 	  >= param_max_jump_thread_duplication_stmts))
     {
       if (dump_file && (dump_flags & TDF_DETAILS))
@@ -883,7 +917,7 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
      explode the CFG due to duplicating the edges for that multi-way
      branch.  So like above, only allow a multi-way branch on the path
      if we actually thread a multi-way branch.  */
-  if (!threaded_multiway_branch && multiway_branch_in_path)
+  if (!m_threaded_multiway_branch && m_multiway_branch_in_path)
     {
       if (dump_file && (dump_flags & TDF_DETAILS))
 	fprintf (dump_file,
@@ -896,20 +930,20 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
      the latch.  This could alter the loop form sufficiently to cause
      loop optimizations to fail.  Disable these threads until after
      loop optimizations have run.  */
-  if ((threaded_through_latch
-       || (taken_edge && taken_edge->dest == loop->latch))
+  if ((m_threaded_through_latch || taken_edge->dest == loop->latch)
       && !(cfun->curr_properties & PROP_loop_opts_done)
       && empty_block_p (loop->latch))
     {
       if (dump_file && (dump_flags & TDF_DETAILS))
 	fprintf (dump_file,
-		 "  FAIL: Thread through latch before loop opts would create non-empty latch\n");
+		 "  FAIL: Thread through latch before loop opts would create "
+		 "non-empty latch\n");
       return false;
-
     }
   return true;
 }
 
+
 /* The current path PATH is a vector of blocks forming a jump threading
    path in reverse order.  TAKEN_EDGE is the edge taken from path[0].
 
-- 
2.35.3

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] Refactor back_threader_profitability
@ 2022-08-16 14:04 Richard Biener
  0 siblings, 0 replies; 11+ messages in thread
From: Richard Biener @ 2022-08-16 14:04 UTC (permalink / raw)
  To: gcc-patches

The following refactors profitable_path_p in the backward threader,
splitting out parts that can be computed once the exit block is known,
parts that contiguously update and that can be checked allowing
for the path to be later identified as FSM with larger limits,
possibly_profitable_path_p, and final checks done when the whole
path is known, profitable_path_p.

I've removed the back_threader_profitability instance from the
back_threader class and instead instantiate it once per path
discovery.  I've kept the size compute non-incremental to simplify
the patch and not worry about unwinding.

There's key changes to previous behavior - namely we apply
the param_max_jump_thread_duplication_stmts early only when
we know the path cannot become an FSM one (multiway + thread through
latch) but make sure to elide the path query when we we didn't
yet discover that but are over this limit.  Similarly the
speed limit is now used even when we did not yet discover a
hot BB on the path.  Basically the idea is to only stop path
discovery when we know the path will never become profitable
but avoid the expensive path range query when we know it's
currently not.

I've done a few cleanups, merging functions, on the way.

Bootstrapped and tested on x86_64-unknown-linux-gnu.

Statistics show an overall slight increase in threading but
looking at different files theres noise up and down.  That's
somewhat expected since we now are applying the "more correct"
limits in the end.  Unless I made big mistakes of course.

The next thing cost-wise would be to raise the backwards
threading limit to the limit of DOM so we don't get
artificial high counts for that.

OK?

Thanks,
Richard.

	* tree-ssa-threadbackward.cc
	(back_threader_profitability): Split profitable_path_p
	into possibly_profitable_path_p and itself, keep state
	as new members.
	(back_threader::m_profit): Remove.
	(back_threader::find_paths): Likewise.
	(back_threader::maybe_register_path): Take profitability
	instance as parameter.
	(back_threader::find_paths_to_names): Likewise.  Use
	possibly_profitable_path_p and avoid the path range query
	when the path is currently too large.
	(back_threader::find_paths): Fold into ...
	(back_threader::maybe_thread_block): ... this.
	(get_gimple_control_stmt): Remove.
	(back_threader_profitability::possibly_profitable_path_p):
	Split out from profitable_path_p, do early profitability
	checks.
	(back_threader_profitability::profitable_path_p): Do final
	profitability path after the taken edge has been determined.
---
 gcc/tree-ssa-threadbackward.cc | 350 ++++++++++++++++++---------------
 1 file changed, 192 insertions(+), 158 deletions(-)

diff --git a/gcc/tree-ssa-threadbackward.cc b/gcc/tree-ssa-threadbackward.cc
index 1c362839fab..077a767e73d 100644
--- a/gcc/tree-ssa-threadbackward.cc
+++ b/gcc/tree-ssa-threadbackward.cc
@@ -61,15 +61,33 @@ public:
 class back_threader_profitability
 {
 public:
-  back_threader_profitability (bool speed_p)
-    : m_speed_p (speed_p)
-  { }
-  bool profitable_path_p (const vec<basic_block> &, tree name, edge taken,
-			  bool *irreducible_loop = NULL);
+  back_threader_profitability (bool speed_p, gimple *stmt);
+  bool possibly_profitable_path_p (const vec<basic_block> &, tree, bool *);
+  bool profitable_path_p (const vec<basic_block> &,
+			  edge taken, bool *irreducible_loop);
 private:
   const bool m_speed_p;
+  int m_exit_jump_benefit;
+  bool m_threaded_multiway_branch;
+  // The following are computed by possibly_profitable_path_p
+  bool m_threaded_through_latch;
+  bool m_multiway_branch_in_path;
+  bool m_contains_hot_bb;
+  int m_n_insns;
 };
 
+back_threader_profitability::back_threader_profitability (bool speed_p,
+							  gimple *last)
+  : m_speed_p (speed_p)
+{
+  m_threaded_multiway_branch = (gimple_code (last) == GIMPLE_SWITCH
+				|| gimple_code (last) == GIMPLE_GOTO);
+  // The forward threader has estimate_threading_killed_stmts, in
+  // particular it estimates further DCE from eliminating the exit
+  // control stmt.
+  m_exit_jump_benefit = estimate_num_insns (last, &eni_size_weights);
+}
+
 // Back threader flags.
 #define BT_NONE 0
 // Generate fast code at the expense of code size.
@@ -86,11 +104,11 @@ public:
   unsigned thread_blocks ();
 private:
   void maybe_thread_block (basic_block bb);
-  void find_paths (basic_block bb, tree name);
   bool debug_counter ();
-  edge maybe_register_path ();
+  edge maybe_register_path (back_threader_profitability &);
   void maybe_register_path_dump (edge taken_edge);
-  void find_paths_to_names (basic_block bb, bitmap imports, unsigned);
+  void find_paths_to_names (basic_block bb, bitmap imports, unsigned,
+			    back_threader_profitability &);
   edge find_taken_edge (const vec<basic_block> &path);
   edge find_taken_edge_cond (const vec<basic_block> &path, gcond *);
   edge find_taken_edge_switch (const vec<basic_block> &path, gswitch *);
@@ -98,7 +116,6 @@ private:
   virtual void dump (FILE *out);
 
   back_threader_registry m_registry;
-  back_threader_profitability m_profit;
   path_range_query *m_solver;
 
   // Current path being analyzed.
@@ -131,8 +148,7 @@ private:
 const edge back_threader::UNREACHABLE_EDGE = (edge) -1;
 
 back_threader::back_threader (function *fun, unsigned flags, bool first)
-  : m_profit (flags & BT_SPEED),
-    m_first (first)
+  : m_first (first)
 {
   if (flags & BT_SPEED)
     loop_optimizer_init (LOOPS_HAVE_PREHEADERS | LOOPS_HAVE_SIMPLE_LATCHES);
@@ -226,7 +242,7 @@ back_threader::maybe_register_path_dump (edge taken)
 // unreachable.
 
 edge
-back_threader::maybe_register_path ()
+back_threader::maybe_register_path (back_threader_profitability &profit)
 {
   edge taken_edge = find_taken_edge (m_path);
 
@@ -241,8 +257,7 @@ back_threader::maybe_register_path ()
       else
 	{
 	  bool irreducible = false;
-	  if (m_profit.profitable_path_p (m_path, m_name, taken_edge,
-					  &irreducible)
+	  if (profit.profitable_path_p (m_path, taken_edge, &irreducible)
 	      && debug_counter ()
 	      && m_registry.register_path (m_path, taken_edge))
 	    {
@@ -342,17 +357,21 @@ back_threader::find_taken_edge_cond (const vec<basic_block> &path,
 
 void
 back_threader::find_paths_to_names (basic_block bb, bitmap interesting,
-				    unsigned overall_paths)
+				    unsigned overall_paths,
+				    back_threader_profitability &profit)
 {
   if (m_visited_bbs.add (bb))
     return;
 
   m_path.safe_push (bb);
 
-  // Try to resolve the path without looking back.
+  // Try to resolve the path without looking back.  Avoid resolving paths
+  // we know are large but not (yet) FSM.
+  bool large_non_fsm;
   if (m_path.length () > 1
-      && (!m_profit.profitable_path_p (m_path, m_name, NULL)
-	  || maybe_register_path ()))
+      && (!profit.possibly_profitable_path_p (m_path, m_name, &large_non_fsm)
+	  || (!large_non_fsm
+	      && maybe_register_path (profit))))
     ;
 
   // The backwards thread copier cannot copy blocks that do not belong
@@ -474,7 +493,8 @@ back_threader::find_paths_to_names (basic_block bb, bitmap interesting,
 			}
 		    }
 		}
-	      find_paths_to_names (e->src, new_interesting, overall_paths);
+	      find_paths_to_names (e->src, new_interesting, overall_paths,
+				   profit);
 	      // Restore new_interesting.
 	      for (int def : unwind)
 		bitmap_clear_bit (new_interesting, def);
@@ -499,66 +519,6 @@ back_threader::find_paths_to_names (basic_block bb, bitmap interesting,
   m_visited_bbs.remove (bb);
 }
 
-// Search backwards from BB looking for paths where the final
-// conditional out of BB can be determined.  NAME is the LHS of the
-// final conditional.  Register such paths for jump threading.
-
-void
-back_threader::find_paths (basic_block bb, tree name)
-{
-  gimple *stmt = last_stmt (bb);
-  if (!stmt
-      || (gimple_code (stmt) != GIMPLE_COND
-	  && gimple_code (stmt) != GIMPLE_SWITCH))
-    return;
-
-  if (EDGE_COUNT (bb->succs) > 1)
-    {
-      m_last_stmt = stmt;
-      m_visited_bbs.empty ();
-      m_path.truncate (0);
-      m_name = name;
-
-      // We compute imports of the path during discovery starting
-      // just with names used in the conditional.
-      bitmap_clear (m_imports);
-      ssa_op_iter iter;
-      FOR_EACH_SSA_TREE_OPERAND (name, stmt, iter, SSA_OP_USE)
-	{
-	  if (!gimple_range_ssa_p (name))
-	    return;
-	  bitmap_set_bit (m_imports, SSA_NAME_VERSION (name));
-	}
-
-      // Interesting is the set of imports we still not have see
-      // the definition of.  So while imports only grow, the
-      // set of interesting defs dwindles and once empty we can
-      // stop searching.
-      auto_bitmap interesting;
-      bitmap_copy (interesting, m_imports);
-      find_paths_to_names (bb, interesting, 1);
-    }
-}
-
-// Simple helper to get the last statement from BB, which is assumed
-// to be a control statement.  Return NULL if the last statement is
-// not a control statement.
-
-static gimple *
-get_gimple_control_stmt (basic_block bb)
-{
-  gimple_stmt_iterator gsi = gsi_last_nondebug_bb (bb);
-
-  if (gsi_end_p (gsi))
-    return NULL;
-
-  gimple *stmt = gsi_stmt (gsi);
-  enum gimple_code code = gimple_code (stmt);
-  if (code == GIMPLE_COND || code == GIMPLE_SWITCH || code == GIMPLE_GOTO)
-    return stmt;
-  return NULL;
-}
-
 // Search backwards from BB looking for paths where the final
 // conditional maybe threaded to a successor block.  Record such paths
 // for jump threading.
@@ -566,7 +526,10 @@ get_gimple_control_stmt (basic_block bb)
 void
 back_threader::maybe_thread_block (basic_block bb)
 {
-  gimple *stmt = get_gimple_control_stmt (bb);
+  if (EDGE_COUNT (bb->succs) <= 1)
+    return;
+
+  gimple *stmt = last_stmt (bb);
   if (!stmt)
     return;
 
@@ -576,12 +539,33 @@ back_threader::maybe_thread_block (basic_block bb)
     name = gimple_switch_index (as_a <gswitch *> (stmt));
   else if (code == GIMPLE_COND)
     name = gimple_cond_lhs (stmt);
-  else if (code == GIMPLE_GOTO)
-    name = gimple_goto_dest (stmt);
   else
-    name = NULL;
+    return;
 
-  find_paths (bb, name);
+  m_last_stmt = stmt;
+  m_visited_bbs.empty ();
+  m_path.truncate (0);
+  m_name = name;
+
+  // We compute imports of the path during discovery starting
+  // just with names used in the conditional.
+  bitmap_clear (m_imports);
+  ssa_op_iter iter;
+  FOR_EACH_SSA_TREE_OPERAND (name, stmt, iter, SSA_OP_USE)
+    {
+      if (!gimple_range_ssa_p (name))
+	return;
+      bitmap_set_bit (m_imports, SSA_NAME_VERSION (name));
+    }
+
+  // Interesting is the set of imports we still not have see
+  // the definition of.  So while imports only grow, the
+  // set of interesting defs dwindles and once empty we can
+  // stop searching.
+  auto_bitmap interesting;
+  bitmap_copy (interesting, m_imports);
+  back_threader_profitability profit (m_flags & BT_SPEED, stmt);
+  find_paths_to_names (bb, interesting, 1, profit);
 }
 
 DEBUG_FUNCTION void
@@ -615,26 +599,21 @@ back_threader::debug ()
   dump (stderr);
 }
 
-/* Examine jump threading path PATH and return TRUE if it is profitable to
-   thread it, otherwise return FALSE.
+/* Examine jump threading path PATH and return TRUE if it is possibly
+   profitable to thread it, otherwise return FALSE.  Indicate in
+   *LARGE_NON_FSM whether the thread is too large for a non-fsm thread
+   but would be OK if we discover a backedge on our way.
 
    NAME is the SSA_NAME of the variable we found to have a constant
    value on PATH.  If unknown, SSA_NAME is NULL.
 
-   If the taken edge out of the path is known ahead of time it is passed in
-   TAKEN_EDGE, otherwise it is NULL.
-
-   CREATES_IRREDUCIBLE_LOOP, if non-null is set to TRUE if threading this path
-   would create an irreducible loop.
-
    ?? It seems we should be able to loosen some of the restrictions in
    this function after loop optimizations have run.  */
 
 bool
-back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
-						tree name,
-						edge taken_edge,
-						bool *creates_irreducible_loop)
+back_threader_profitability::possibly_profitable_path_p
+				  (const vec<basic_block> &m_path, tree name,
+				   bool *large_non_fsm)
 {
   gcc_checking_assert (!m_path.is_empty ());
 
@@ -650,13 +629,15 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
   if (m_path.length () <= 1)
       return false;
 
-  int n_insns = 0;
   gimple_stmt_iterator gsi;
   loop_p loop = m_path[0]->loop_father;
-  bool threaded_through_latch = false;
-  bool multiway_branch_in_path = false;
-  bool threaded_multiway_branch = false;
-  bool contains_hot_bb = false;
+
+  // We recompute the following, when we rewrite possibly_profitable_path_p
+  // to work incrementally on added BBs we have to unwind them on backtracking
+  m_n_insns = 0;
+  m_threaded_through_latch = false;
+  m_multiway_branch_in_path = false;
+  m_contains_hot_bb = false;
 
   if (dump_file && (dump_flags & TDF_DETAILS))
     fprintf (dump_file, "Checking profitability of path (backwards): ");
@@ -679,7 +660,7 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
 	 it ends in a multiway branch.  */
       if (j < m_path.length () - 1)
 	{
-	  int orig_n_insns = n_insns;
+	  int orig_n_insns = m_n_insns;
 	  /* PHIs in the path will create degenerate PHIS in the
 	     copied path which will then get propagated away, so
 	     looking at just the duplicate path the PHIs would
@@ -714,12 +695,12 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
 		      && (SSA_NAME_VAR (dst) != SSA_NAME_VAR (name)
 			  || !SSA_NAME_VAR (dst))
 		      && !virtual_operand_p (dst))
-		    ++n_insns;
+		    ++m_n_insns;
 		}
 	    }
 
-	  if (!contains_hot_bb && m_speed_p)
-	    contains_hot_bb |= optimize_bb_for_speed_p (bb);
+	  if (!m_contains_hot_bb && m_speed_p)
+	    m_contains_hot_bb |= optimize_bb_for_speed_p (bb);
 	  for (gsi = gsi_after_labels (bb);
 	       !gsi_end_p (gsi);
 	       gsi_next_nondebug (&gsi))
@@ -735,10 +716,10 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
 	      /* Do not count empty statements and labels.  */
 	      if (gimple_code (stmt) != GIMPLE_NOP
 		  && !is_gimple_debug (stmt))
-		n_insns += estimate_num_insns (stmt, &eni_size_weights);
+		m_n_insns += estimate_num_insns (stmt, &eni_size_weights);
 	    }
 	  if (dump_file && (dump_flags & TDF_DETAILS))
-	    fprintf (dump_file, " (%i insns)", n_insns-orig_n_insns);
+	    fprintf (dump_file, " (%i insns)", m_n_insns-orig_n_insns);
 
 	  /* We do not look at the block with the threaded branch
 	     in this loop.  So if any block with a last statement that
@@ -747,14 +728,13 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
 
 	     The block in PATH[0] is special, it's the block were we're
 	     going to be able to eliminate its branch.  */
-	  gimple *last = last_stmt (bb);
-	  if (last && (gimple_code (last) == GIMPLE_SWITCH
-		       || gimple_code (last) == GIMPLE_GOTO))
+	  if (j > 0)
 	    {
-	      if (j == 0)
-		threaded_multiway_branch = true;
-	      else
-		multiway_branch_in_path = true;
+	      gimple *last = last_stmt (bb);
+	      if (last
+		  && (gimple_code (last) == GIMPLE_SWITCH
+		      || gimple_code (last) == GIMPLE_GOTO))
+		m_multiway_branch_in_path = true;
 	    }
 	}
 
@@ -763,50 +743,29 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
 	 through the loop latch.  */
       if (loop->latch == bb)
 	{
-	  threaded_through_latch = true;
+	  m_threaded_through_latch = true;
 	  if (dump_file && (dump_flags & TDF_DETAILS))
 	    fprintf (dump_file, " (latch)");
 	}
     }
 
-  gimple *stmt = get_gimple_control_stmt (m_path[0]);
-  gcc_assert (stmt);
-
   /* We are going to remove the control statement at the end of the
      last block in the threading path.  So don't count it against our
      statement count.  */
-
-  int stmt_insns = estimate_num_insns (stmt, &eni_size_weights);
-  n_insns-= stmt_insns;
+  m_n_insns -= m_exit_jump_benefit;
 
   if (dump_file && (dump_flags & TDF_DETAILS))
     fprintf (dump_file, "\n  Control statement insns: %i\n"
 	     "  Overall: %i insns\n",
-	     stmt_insns, n_insns);
-
-  if (creates_irreducible_loop)
-    {
-      /* If this path threaded through the loop latch back into the
-	 same loop and the destination does not dominate the loop
-	 latch, then this thread would create an irreducible loop.  */
-      *creates_irreducible_loop = false;
-      if (taken_edge
-	  && threaded_through_latch
-	  && loop == taken_edge->dest->loop_father
-	  && (determine_bb_domination_status (loop, taken_edge->dest)
-	      == DOMST_NONDOMINATING))
-	*creates_irreducible_loop = true;
-    }
+	     m_exit_jump_benefit, m_n_insns);
 
   /* Threading is profitable if the path duplicated is hot but also
      in a case we separate cold path from hot path and permit optimization
      of the hot path later.  Be on the agressive side here. In some testcases,
      as in PR 78407 this leads to noticeable improvements.  */
-  if (m_speed_p
-      && ((taken_edge && optimize_edge_for_speed_p (taken_edge))
-	  || contains_hot_bb))
+  if (m_speed_p)
     {
-      if (n_insns >= param_max_fsm_thread_path_insns)
+      if (m_n_insns >= param_max_fsm_thread_path_insns)
 	{
 	  if (dump_file && (dump_flags & TDF_DETAILS))
 	    fprintf (dump_file, "  FAIL: Jump-thread path not considered: "
@@ -814,29 +773,104 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
 		     "exceeds PARAM_MAX_FSM_THREAD_PATH_INSNS.\n");
 	  return false;
 	}
-      if (taken_edge && probably_never_executed_edge_p (cfun, taken_edge))
+      edge entry = find_edge (m_path[m_path.length () - 1],
+			      m_path[m_path.length () - 2]);
+      if (probably_never_executed_edge_p (cfun, entry))
 	{
 	  if (dump_file && (dump_flags & TDF_DETAILS))
 	    fprintf (dump_file, "  FAIL: Jump-thread path not considered: "
-		     "path leads to probably never executed edge.\n");
+		     "path entry is probably never executed.\n");
 	  return false;
 	}
-      edge entry = find_edge (m_path[m_path.length () - 1],
-			      m_path[m_path.length () - 2]);
-      if (probably_never_executed_edge_p (cfun, entry))
+    }
+  else if (m_n_insns > 1)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "  FAIL: Jump-thread path not considered: "
+		 "duplication of %i insns is needed and optimizing for size.\n",
+		 m_n_insns);
+      return false;
+    }
+
+  /* The generic copier used by the backthreader does not re-use an
+     existing threading path to reduce code duplication.  So for that
+     case, drastically reduce the number of statements we are allowed
+     to copy.  We don't know yet whether we will thread through the latch
+     so we have to be permissive and continue threading, but indicate
+     to the caller the thread, if final, wouldn't be profitable.  */
+  if ((!m_threaded_multiway_branch
+       || !loop->latch
+       || loop->latch->index == EXIT_BLOCK)
+      && (m_n_insns * param_fsm_scale_path_stmts
+	  >= param_max_jump_thread_duplication_stmts))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file,
+		 "  FAIL: Did not thread around loop and would copy too "
+		 "many statements.\n");
+      return false;
+    }
+  *large_non_fsm = (!(m_threaded_through_latch && m_threaded_multiway_branch)
+		    && (m_n_insns * param_fsm_scale_path_stmts
+			>= param_max_jump_thread_duplication_stmts));
+
+  return true;
+}
+
+/* Examine jump threading path PATH and return TRUE if it is profitable to
+   thread it, otherwise return FALSE.
+
+   The taken edge out of the path is TAKEN_EDGE.
+
+   CREATES_IRREDUCIBLE_LOOP is set to TRUE if threading this path
+   would create an irreducible loop.
+
+   ?? It seems we should be able to loosen some of the restrictions in
+   this function after loop optimizations have run.  */
+
+bool
+back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
+						edge taken_edge,
+						bool *creates_irreducible_loop)
+{
+  // We can assume that possibly_profitable_path_p holds here
+
+  loop_p loop = m_path[0]->loop_father;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "Checking profitability of path (backwards): ");
+
+  /* If this path threaded through the loop latch back into the
+     same loop and the destination does not dominate the loop
+     latch, then this thread would create an irreducible loop.  */
+  *creates_irreducible_loop = false;
+  if (m_threaded_through_latch
+      && loop == taken_edge->dest->loop_father
+      && (determine_bb_domination_status (loop, taken_edge->dest)
+	  == DOMST_NONDOMINATING))
+    *creates_irreducible_loop = true;
+
+  /* Threading is profitable if the path duplicated is hot but also
+     in a case we separate cold path from hot path and permit optimization
+     of the hot path later.  Be on the agressive side here. In some testcases,
+     as in PR 78407 this leads to noticeable improvements.  */
+  if (m_speed_p
+      && (optimize_edge_for_speed_p (taken_edge) || m_contains_hot_bb))
+    {
+      if (probably_never_executed_edge_p (cfun, taken_edge))
 	{
 	  if (dump_file && (dump_flags & TDF_DETAILS))
 	    fprintf (dump_file, "  FAIL: Jump-thread path not considered: "
-		     "path entry is probably never executed.\n");
+		     "path leads to probably never executed edge.\n");
 	  return false;
 	}
     }
-  else if (n_insns > 1)
+  else if (m_n_insns > 1)
     {
       if (dump_file && (dump_flags & TDF_DETAILS))
 	fprintf (dump_file, "  FAIL: Jump-thread path not considered: "
 		 "duplication of %i insns is needed and optimizing for size.\n",
-		 n_insns);
+		 m_n_insns);
       return false;
     }
 
@@ -849,10 +883,9 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
      the path -- in that case there's little the traditional loop
      optimizer would have done anyway, so an irreducible loop is not
      so bad.  */
-  if (!threaded_multiway_branch
-      && creates_irreducible_loop
+  if (!m_threaded_multiway_branch
       && *creates_irreducible_loop
-      && (n_insns * (unsigned) param_fsm_scale_path_stmts
+      && (m_n_insns * (unsigned) param_fsm_scale_path_stmts
 	  > (m_path.length () *
 	     (unsigned) param_fsm_scale_path_blocks)))
 
@@ -861,6 +894,7 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
 	fprintf (dump_file,
 		 "  FAIL: Would create irreducible loop without threading "
 		 "multiway branch.\n");
+      /* We compute creates_irreducible_loop only late.  */
       return false;
     }
 
@@ -868,8 +902,8 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
      existing threading path to reduce code duplication.  So for that
      case, drastically reduce the number of statements we are allowed
      to copy.  */
-  if (!(threaded_through_latch && threaded_multiway_branch)
-      && (n_insns * param_fsm_scale_path_stmts
+  if (!(m_threaded_through_latch && m_threaded_multiway_branch)
+      && (m_n_insns * param_fsm_scale_path_stmts
 	  >= param_max_jump_thread_duplication_stmts))
     {
       if (dump_file && (dump_flags & TDF_DETAILS))
@@ -883,7 +917,7 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
      explode the CFG due to duplicating the edges for that multi-way
      branch.  So like above, only allow a multi-way branch on the path
      if we actually thread a multi-way branch.  */
-  if (!threaded_multiway_branch && multiway_branch_in_path)
+  if (!m_threaded_multiway_branch && m_multiway_branch_in_path)
     {
       if (dump_file && (dump_flags & TDF_DETAILS))
 	fprintf (dump_file,
@@ -896,20 +930,20 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
      the latch.  This could alter the loop form sufficiently to cause
      loop optimizations to fail.  Disable these threads until after
      loop optimizations have run.  */
-  if ((threaded_through_latch
-       || (taken_edge && taken_edge->dest == loop->latch))
+  if ((m_threaded_through_latch || taken_edge->dest == loop->latch)
       && !(cfun->curr_properties & PROP_loop_opts_done)
       && empty_block_p (loop->latch))
     {
       if (dump_file && (dump_flags & TDF_DETAILS))
 	fprintf (dump_file,
-		 "  FAIL: Thread through latch before loop opts would create non-empty latch\n");
+		 "  FAIL: Thread through latch before loop opts would create "
+		 "non-empty latch\n");
       return false;
-
     }
   return true;
 }
 
+
 /* The current path PATH is a vector of blocks forming a jump threading
    path in reverse order.  TAKEN_EDGE is the edge taken from path[0].
 
-- 
2.35.3

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-08-19 16:02 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-16 14:04 [PATCH] Refactor back_threader_profitability Richard Biener
     [not found] <81970.122081610045900133@us-mta-26.us.mimecast.lan>
2022-08-17  7:31 ` Aldy Hernandez
2022-08-17  7:53   ` Richard Biener
2022-08-17  8:04     ` Aldy Hernandez
2022-08-17  8:38       ` Richard Biener
2022-08-17  8:46         ` Aldy Hernandez
2022-08-17  8:59           ` Richard Biener
2022-08-17  9:27             ` Aldy Hernandez
2022-08-19 16:02   ` Jeff Law
  -- strict thread matches above, loose matches on Subject: below --
2022-08-16 14:04 Richard Biener
2022-08-16 14:04 Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).