public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH 1/4] cgraph: Do not warn about caller count mismatches of removed functions
  2021-08-24 11:48 [PATCH 0/4] IPA-CP profile feedback handling fixes Martin Jambor
@ 2021-08-20 17:43 ` Martin Jambor
  2021-09-16 15:10   ` Martin Jambor
  2021-08-20 17:43 ` [PATCH 2/4] ipa-cp: Propagation boost for recursion generated values Martin Jambor
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Martin Jambor @ 2021-08-20 17:43 UTC (permalink / raw)
  To: GCC Patches; +Cc: Jan Hubicka, Xionghu Luo

To verify other changes in the patch series, I have been searching for
"Invalid sum of caller counts" string in symtab dump but found that
there are false warnings about functions which have their body removed
because they are now unreachable.  Those are of course invalid and so
this patches avoids checking such cgraph_nodes.

gcc/ChangeLog:

2021-08-20  Martin Jambor  <mjambor@suse.cz>

	* cgraph.c (cgraph_node::dump): Do not check caller count sums if
	the body has been removed.  Remove trailing whitespace.
---
 gcc/cgraph.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/cgraph.c b/gcc/cgraph.c
index 8f3af003f2a..de078653781 100644
--- a/gcc/cgraph.c
+++ b/gcc/cgraph.c
@@ -2236,7 +2236,7 @@ cgraph_node::dump (FILE *f)
     }
   fprintf (f, "\n");
 
-  if (count.ipa ().initialized_p ())
+  if (!body_removed && count.ipa ().initialized_p ())
     {
       bool ok = true;
       bool min = false;
@@ -2245,7 +2245,7 @@ cgraph_node::dump (FILE *f)
       FOR_EACH_ALIAS (this, ref)
 	if (dyn_cast <cgraph_node *> (ref->referring)->count.initialized_p ())
 	  sum += dyn_cast <cgraph_node *> (ref->referring)->count.ipa ();
-  
+
       if (inlined_to
 	  || (symtab->state < EXPANSION
 	      && ultimate_alias_target () == this && only_called_directly_p ()))
-- 
2.32.0


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 3/4] ipa-cp: Fix updating of profile counts and self-gen value evaluation
  2021-08-24 11:48 [PATCH 0/4] IPA-CP profile feedback handling fixes Martin Jambor
  2021-08-20 17:43 ` [PATCH 1/4] cgraph: Do not warn about caller count mismatches of removed functions Martin Jambor
  2021-08-20 17:43 ` [PATCH 2/4] ipa-cp: Propagation boost for recursion generated values Martin Jambor
@ 2021-08-20 17:43 ` Martin Jambor
  2021-10-08 11:31   ` Jan Hubicka
  2021-08-23 18:49 ` [PATCH 4/4] ipa-cp: Select saner profile count to base heuristics on Martin Jambor
  3 siblings, 1 reply; 16+ messages in thread
From: Martin Jambor @ 2021-08-20 17:43 UTC (permalink / raw)
  To: GCC Patches; +Cc: Jan Hubicka, Xionghu Luo

IPA-CP does not do a reasonable job when it is updating profile counts
after it has created clones of recursive functions.  This patch
addresses that by:

1. Only updating counts for special-context clones.  When a clone is
created for all contexts, the original is going to be dead and the
cgraph machinery has copied counts to the new node which is the right
thing to do.  Therefore updating counts has been moved from
create_specialized_node to decide_about_value and
decide_whether_version_node.

2. The current profile updating code artificially increased the assumed
old count when the sum of counts of incoming edges to both the
original and new node were bigger than the count of the original
node.  This always happened when self-recursive edge from the clone
was also redirected to the clone because both the original edge and
its clone had original high counts.  This clutch was removed and
replaced by the next point.

3. When cloning creates also redirects a self-recursive clone to the
clone itself, new logic has been added to divide the counts brought by
such recursive edges between the original node and the clone.  This is
impossible to do well without special knowledge about the function and
which non-recursive entry calls are responsible for what portion of
recursion depth, so the approach taken is rather crude.

For non-local nodes which can have unknown callers, the algorithm just
takes half of the counts - we may decide that taking just a third or
some other portion is more reasonable, but I do not think we can
attempt anything more clever.

For local nodes, we detect the case when the original node is never
called (in the training run at least) with another value and if so,
steal all its counts like if it was dead.  If that is not the case, we
try to divide the count brought by recursive edges (or rather not
brought by direct edges) proportionally to the counts brought by
non-recursive edges - but with artificial limits in place so that we
do not take too many or too few, because that was happening with
detrimental effect in mcf_r.

4. When cloning creates extra clones for values brought by a formerly
self-recursive edge with an arithmetic pass-through jump function on
it, such as it does in exchange2_r, all such clones are processed at
once rather than one after another.  The counts of all such nodes are
distributed evenly (modulo even-formerly-non-recursive-edges) and the
whole situation is then fixed up so that the edge counts fit.  This is
what new function update_counts_for_self_gen_clones does.

5. When values brought by a formerly self-recursive edge with an
arithmetic pass-through jump function on it are evaluated by
heuristics which assumes vast majority of node counts are result of
recursive calls and so we simply divide those with the number of
clones there would be if we created another one.

6. The mechanisms in init_caller_stats and gather_caller_stats and
get_info_about_necessary_edges was enhanced to gather data required
for the above and a missing check not to count dead incoming edges was
also added.

gcc/ChangeLog:

2021-08-23  Martin Jambor  <mjambor@suse.cz>

	* ipa-cp.c (struct caller_statistics): New fields rec_count_sum,
	n_nonrec_calls and itself, document all fields.
	(init_caller_stats): Initialize the above new fields.
	(gather_caller_stats): Gather self-recursive counts and calls number.
	(get_info_about_necessary_edges): Gather counts of self-recursive and
	other edges bringing in the requested value separately.
	(dump_profile_updates): Rework to dump info about a single node only.
	(lenient_count_portion_handling): New function.
	(struct gather_other_count_struct): New type.
	(gather_count_of_non_rec_edges): New function.
	(struct desc_incoming_count_struct): New type.
	(analyze_clone_icoming_counts): New function.
	(adjust_clone_incoming_counts): Likewise.
	(update_counts_for_self_gen_clones): Likewise.
	(update_profiling_info): Rewritten.
	(update_specialized_profile): Adjust call to dump_profile_updates.
	(create_specialized_node): Do not update profiling info.
	(decide_about_value): New parameter self_gen_clones, either push new
	clones into it or updat their profile counts.  For self-recursively
	generated values, use a portion of the node count instead of count
	from self-recursive edges to estimate goodness.
	(decide_whether_version_node): Gather clones for self-generated values
	in a new vector, update their profiles at once at the end.
---
 gcc/ipa-cp.c | 543 +++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 457 insertions(+), 86 deletions(-)

diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
index b987d975793..53cca7aa804 100644
--- a/gcc/ipa-cp.c
+++ b/gcc/ipa-cp.c
@@ -701,20 +701,36 @@ ipcp_versionable_function_p (struct cgraph_node *node)
 
 struct caller_statistics
 {
+  /* If requested (see below), self-recursive call counts are summed into this
+     field.  */
+  profile_count rec_count_sum;
+  /* The sum of all ipa counts of all the other (non-recursive) calls.  */
   profile_count count_sum;
+  /* Sum of all frequencies for all calls.  */
   sreal freq_sum;
+  /* Number of calls and hot calls respectively.  */
   int n_calls, n_hot_calls;
+  /* If itself is set up, also count the number of non-self-recursive
+     calls.  */
+  int n_nonrec_calls;
+  /* If non-NULL, this is the node itself and calls from it should have their
+     counts included in rec_count_sum and not count_sum.  */
+  cgraph_node *itself;
 };
 
-/* Initialize fields of STAT to zeroes.  */
+/* Initialize fields of STAT to zeroes and optionally set it up so that edges
+   from IGNORED_CALLER are not counted.  */
 
 static inline void
-init_caller_stats (struct caller_statistics *stats)
+init_caller_stats (caller_statistics *stats, cgraph_node *itself = NULL)
 {
+  stats->rec_count_sum = profile_count::zero ();
   stats->count_sum = profile_count::zero ();
   stats->n_calls = 0;
   stats->n_hot_calls = 0;
+  stats->n_nonrec_calls = 0;
   stats->freq_sum = 0;
+  stats->itself = itself;
 }
 
 /* Worker callback of cgraph_for_node_and_aliases accumulating statistics of
@@ -729,10 +745,22 @@ gather_caller_stats (struct cgraph_node *node, void *data)
   for (cs = node->callers; cs; cs = cs->next_caller)
     if (!cs->caller->thunk)
       {
-        if (cs->count.ipa ().initialized_p ())
-	  stats->count_sum += cs->count.ipa ();
+	ipa_node_params *info = ipa_node_params_sum->get (cs->caller);
+	if (info && info->node_dead)
+	  continue;
+
+	if (cs->count.ipa ().initialized_p ())
+	  {
+	    if (stats->itself && stats->itself == cs->caller)
+	      stats->rec_count_sum += cs->count.ipa ();
+	    else
+	      stats->count_sum += cs->count.ipa ();
+	  }
 	stats->freq_sum += cs->sreal_frequency ();
 	stats->n_calls++;
+	if (stats->itself && stats->itself != cs->caller)
+	  stats->n_nonrec_calls++;
+
 	if (cs->maybe_hot_p ())
 	  stats->n_hot_calls ++;
       }
@@ -4202,19 +4230,22 @@ get_next_cgraph_edge_clone (struct cgraph_edge *cs)
 
 /* Given VAL that is intended for DEST, iterate over all its sources and if any
    of them is viable and hot, return true.  In that case, for those that still
-   hold, add their edge frequency and their number into *FREQUENCY and
-   *CALLER_COUNT respectively.  */
+   hold, add their edge frequency and their number and cumulative profile
+   counts of self-ecursive and other edges into *FREQUENCY, *CALLER_COUNT,
+   REC_COUNT_SUM and NONREC_COUNT_SUM respectively.  */
 
 template <typename valtype>
 static bool
 get_info_about_necessary_edges (ipcp_value<valtype> *val, cgraph_node *dest,
-				sreal *freq_sum, profile_count *count_sum,
-				int *caller_count)
+				sreal *freq_sum, int *caller_count,
+				profile_count *rec_count_sum,
+				profile_count *nonrec_count_sum)
 {
   ipcp_value_source<valtype> *src;
   sreal freq = 0;
   int count = 0;
-  profile_count cnt = profile_count::zero ();
+  profile_count rec_cnt = profile_count::zero ();
+  profile_count nonrec_cnt = profile_count::zero ();
   bool hot = false;
   bool non_self_recursive = false;
 
@@ -4227,11 +4258,15 @@ get_info_about_necessary_edges (ipcp_value<valtype> *val, cgraph_node *dest,
 	    {
 	      count++;
 	      freq += cs->sreal_frequency ();
-	      if (cs->count.ipa ().initialized_p ())
-	        cnt += cs->count.ipa ();
 	      hot |= cs->maybe_hot_p ();
 	      if (cs->caller != dest)
-		non_self_recursive = true;
+		{
+		  non_self_recursive = true;
+		  if (cs->count.ipa ().initialized_p ())
+		    rec_cnt += cs->count.ipa ();
+		}
+	      else if (cs->count.ipa ().initialized_p ())
+	        nonrec_cnt += cs->count.ipa ();
 	    }
 	  cs = get_next_cgraph_edge_clone (cs);
 	}
@@ -4243,8 +4278,9 @@ get_info_about_necessary_edges (ipcp_value<valtype> *val, cgraph_node *dest,
     return false;
 
   *freq_sum = freq;
-  *count_sum = cnt;
   *caller_count = count;
+  *rec_count_sum = rec_cnt;
+  *nonrec_count_sum = nonrec_cnt;
 
   if (!hot && ipa_node_params_sum->get (dest)->node_within_scc)
     {
@@ -4349,112 +4385,408 @@ get_replacement_map (class ipa_node_params *info, tree value, int parm_num,
   return replace_map;
 }
 
-/* Dump new profiling counts */
+/* Dump new profiling counts of NODE.  SPEC is true when NODE is a specialzied
+   one, otherwise it will be referred to as the original node.  */
 
 static void
-dump_profile_updates (struct cgraph_node *orig_node,
-		      struct cgraph_node *new_node)
+dump_profile_updates (cgraph_node *node, bool spec)
 {
-  struct cgraph_edge *cs;
+  if (spec)
+    fprintf (dump_file, "     setting count of the specialized node %s to ",
+	     node->dump_name ());
+  else
+    fprintf (dump_file, "     setting count of the original node %s to ",
+	     node->dump_name ());
 
-  fprintf (dump_file, "    setting count of the specialized node to ");
-  new_node->count.dump (dump_file);
+  node->count.dump (dump_file);
   fprintf (dump_file, "\n");
-  for (cs = new_node->callees; cs; cs = cs->next_callee)
+  for (cgraph_edge *cs = node->callees; cs; cs = cs->next_callee)
     {
-      fprintf (dump_file, "      edge to %s has count ",
-	       cs->callee->dump_name ());
-      cs->count.dump (dump_file);
-      fprintf (dump_file, "\n");
-    }
-
-  fprintf (dump_file, "    setting count of the original node to ");
-  orig_node->count.dump (dump_file);
-  fprintf (dump_file, "\n");
-  for (cs = orig_node->callees; cs; cs = cs->next_callee)
-    {
-      fprintf (dump_file, "      edge to %s is left with ",
+      fprintf (dump_file, "       edge to %s has count ",
 	       cs->callee->dump_name ());
       cs->count.dump (dump_file);
       fprintf (dump_file, "\n");
     }
 }
 
+/* With partial train run we do not want to assume that original's count is
+   zero whenever we redurect all executed edges to clone.  Simply drop profile
+   to local one in this case.  In eany case, return the new value.  ORIG_NODE
+   is the original node and its count has not been updaed yet.  */
+
+profile_count
+lenient_count_portion_handling (profile_count remainder, cgraph_node *orig_node)
+{
+  if (remainder.ipa_p () && !remainder.ipa ().nonzero_p ()
+      && orig_node->count.ipa_p () && orig_node->count.ipa ().nonzero_p ()
+      && opt_for_fn (orig_node->decl, flag_profile_partial_training))
+    remainder = remainder.guessed_local ();
+
+  return remainder;
+}
+
+/* Structure to sum counts coming from nodes other than the original node and
+   its clones.  */
+
+struct gather_other_count_struct
+{
+  cgraph_node *orig;
+  profile_count other_count;
+};
+
+/* Worker callback of call_for_symbol_thunks_and_aliases summing the number of
+   counts that come from non-self-recursive calls..  */
+
+static bool
+gather_count_of_non_rec_edges (cgraph_node *node, void *data)
+{
+  gather_other_count_struct *desc = (gather_other_count_struct *) data;
+  for (cgraph_edge *cs = node->callers; cs; cs = cs->next_caller)
+    if (cs->caller != desc->orig && cs->caller->clone_of != desc->orig)
+      desc->other_count += cs->count.ipa ();
+  return false;
+}
+
+/* Structure to help analyze if we need to boost counts of some clones of some
+   non-recursive edges to match the new callee count.  */
+
+struct desc_incoming_count_struct
+{
+  cgraph_node *orig;
+  hash_set <cgraph_edge *> *processed_edges;
+  profile_count count;
+  unsigned unproc_orig_rec_edges;
+};
+
+/* Go over edges calling NODE and its thunks and gather information about
+   incoming counts so that we know if we need to make any adjustments.  */
+
+static void
+analyze_clone_icoming_counts (cgraph_node *node,
+			      desc_incoming_count_struct *desc)
+{
+  for (cgraph_edge *cs = node->callers; cs; cs = cs->next_caller)
+    if (cs->caller->thunk)
+      {
+	analyze_clone_icoming_counts (cs->caller, desc);
+	continue;
+      }
+    else
+      {
+	if (cs->count.initialized_p ())
+	  desc->count += cs->count.ipa ();
+	if (!desc->processed_edges->contains (cs)
+	    && cs->caller->clone_of == desc->orig)
+	  desc->unproc_orig_rec_edges++;
+      }
+}
+
+/* If caller edge counts of a clone created for a self-recursive arithmetic jump
+   function must be adjusted, do so. NODE is the node or its thunk.  */
+
+static void
+adjust_clone_incoming_counts (cgraph_node *node,
+			      desc_incoming_count_struct *desc)
+{
+  for (cgraph_edge *cs = node->callers; cs; cs = cs->next_caller)
+    if (cs->caller->thunk)
+      {
+	adjust_clone_incoming_counts (cs->caller, desc);
+	profile_count sum = profile_count::zero ();
+	for (cgraph_edge *e = cs->caller->callers; e; e = e->next_caller)
+	  if (e->count.initialized_p ())
+	    sum += e->count.ipa ();
+	cs->count = cs->count.combine_with_ipa_count (sum);
+      }
+    else if (!desc->processed_edges->contains (cs)
+	     && cs->caller->clone_of == desc->orig)
+      {
+	cs->count += desc->count;
+	if (dump_file)
+	  {
+	    fprintf (dump_file, "       Adjusted count of an incoming edge of "
+		     "a clone %s -> %s to ", cs->caller->dump_name (),
+		     cs->callee->dump_name ());
+	    cs->count.dump (dump_file);
+	    fprintf (dump_file, "\n");
+	  }
+      }
+}
+
+/* When ORIG_NODE has been cloned for values which have been generated fora
+   self-recursive call as a result of an arithmetic pass-through
+   jump-functions, adjust its count together with counts of all such clones in
+   SELF_GEN_CLONES which also at this point contains ORIG_NODE itself.
+
+   The function sums the counts of the original node and all its clones that
+   cannot be attributed to a specific clone because it comes from a
+   non-recursive edge.  This sum is then evenly divided between the clones and
+   on top of that each one gets all the counts which can be attributed directly
+   to it.  */
+
+static void
+update_counts_for_self_gen_clones (cgraph_node *orig_node,
+				   const vec<cgraph_node *> &self_gen_clones)
+{
+  profile_count redist_sum = orig_node->count.ipa ();
+  if (!(redist_sum > profile_count::zero ()))
+    return;
+
+  if (dump_file)
+    fprintf (dump_file, "     Updating profile of self recursive clone "
+	     "series\n");
+
+  gather_other_count_struct gocs;
+  gocs.orig = orig_node;
+  gocs.other_count = profile_count::zero ();
+
+  auto_vec <profile_count, 8> other_edges_count;
+  for (cgraph_node *n : self_gen_clones)
+    {
+      gocs.other_count = profile_count::zero ();
+      n->call_for_symbol_thunks_and_aliases (gather_count_of_non_rec_edges,
+					     &gocs, false);
+      other_edges_count.safe_push (gocs.other_count);
+      redist_sum -= gocs.other_count;
+    }
+
+  hash_set<cgraph_edge *> processed_edges;
+  unsigned i = 0;
+  for (cgraph_node *n : self_gen_clones)
+    {
+      profile_count orig_count = n->count;
+      profile_count new_count
+	= (redist_sum.apply_scale (1, self_gen_clones.length ())
+	   + other_edges_count[i]);
+      new_count = lenient_count_portion_handling (new_count, orig_node);
+      n->count = new_count;
+      profile_count::adjust_for_ipa_scaling (&new_count, &orig_count);
+      for (cgraph_edge *cs = n->callees; cs; cs = cs->next_callee)
+	{
+	  cs->count = cs->count.apply_scale (new_count, orig_count);
+	  processed_edges.add (cs);
+	}
+      for (cgraph_edge *cs = n->indirect_calls; cs; cs = cs->next_callee)
+	cs->count = cs->count.apply_scale (new_count, orig_count);
+
+      i++;
+    }
+
+  /* There are still going to be edges to ORIG_NODE that have one or more
+     clones from another clone in SELF_GEN_CLONES to ORIG_NODE and which we
+     scaled by the same amount, which means that the total incoming sum of
+     counts to ORIG_NODE will be too high, scale such edges back.  */
+  for (cgraph_edge *cs = orig_node->callees; cs; cs = cs->next_callee)
+    {
+      if (cs->callee->ultimate_alias_target () == orig_node)
+	{
+	  unsigned den = 0;
+	  for (cgraph_edge *e = cs; e; e = get_next_cgraph_edge_clone (e))
+	    if (e->callee->ultimate_alias_target () == orig_node
+		&& processed_edges.contains (e))
+	      den++;
+	  if (den > 0)
+	    for (cgraph_edge *e = cs; e; e = get_next_cgraph_edge_clone (e))
+	      if (e->callee->ultimate_alias_target () == orig_node
+		  && processed_edges.contains (e))
+		e->count = e->count.apply_scale (1, den);
+	}
+    }
+
+  /* Edges from the seeds of the valus generated for arithmetic jump-functions
+     along self-recursive edges are likely to have fairly low count and so
+     edges from them to nodes in the self_gen_clones do not correspond to the
+     artificially distributed count of the nodes, the total sum of incoming
+     edges to some clones might be too low.  Detect this situation and correct
+     it.  */
+  for (cgraph_node *n : self_gen_clones)
+    {
+      if (!(n->count.ipa () > profile_count::zero ()))
+	continue;
+
+      desc_incoming_count_struct desc;
+      desc.orig = orig_node;
+      desc.processed_edges = &processed_edges;
+      desc.count = profile_count::zero ();
+      desc.unproc_orig_rec_edges = 0;
+      analyze_clone_icoming_counts (n, &desc);
+
+      if (n->count.differs_from_p (desc.count))
+	{
+	  if (n->count > desc.count
+	      && desc.unproc_orig_rec_edges > 0)
+	    {
+	      desc.count = n->count - desc.count;
+	      desc.count
+		= desc.count.apply_scale (1, desc.unproc_orig_rec_edges);
+	      adjust_clone_incoming_counts (n, &desc);
+	    }
+	  else if (dump_file)
+	    fprintf (dump_file,
+		     "       Unable to fix up incoming counts for %s.\n",
+		     n->dump_name ());
+	}
+    }
+
+  if (dump_file)
+    for (cgraph_node *n : self_gen_clones)
+      dump_profile_updates (n, n != orig_node);
+  return;
+}
+
 /* After a specialized NEW_NODE version of ORIG_NODE has been created, update
-   their profile information to reflect this.  */
+   their profile information to reflect this.  This function should not be used
+   for clones generated for arithmetic pass-through jump functions on a
+   self-recursive call graph edge, that situation is handled by
+   update_counts_for_self_gen_clones.  */
 
 static void
 update_profiling_info (struct cgraph_node *orig_node,
 		       struct cgraph_node *new_node)
 {
-  struct cgraph_edge *cs;
   struct caller_statistics stats;
-  profile_count new_sum, orig_sum;
-  profile_count remainder, orig_node_count = orig_node->count;
-  profile_count orig_new_node_count = new_node->count;
+  profile_count new_sum;
+  profile_count remainder, orig_node_count = orig_node->count.ipa ();
 
-  if (!(orig_node_count.ipa () > profile_count::zero ()))
+  if (!(orig_node_count > profile_count::zero ()))
     return;
 
-  init_caller_stats (&stats);
-  orig_node->call_for_symbol_thunks_and_aliases (gather_caller_stats, &stats,
-					       false);
-  orig_sum = stats.count_sum;
-  init_caller_stats (&stats);
+  if (dump_file)
+    {
+      fprintf (dump_file, "     Updating profile from original count: ");
+      orig_node_count.dump (dump_file);
+      fprintf (dump_file, "\n");
+    }
+
+  init_caller_stats (&stats, new_node);
   new_node->call_for_symbol_thunks_and_aliases (gather_caller_stats, &stats,
 					      false);
   new_sum = stats.count_sum;
 
-  if (orig_node_count < orig_sum + new_sum)
-    {
-      if (dump_file)
-	{
-	  fprintf (dump_file, "    Problem: node %s has too low count ",
-		   orig_node->dump_name ());
-	  orig_node_count.dump (dump_file);
-	  fprintf (dump_file, "while the sum of incoming count is ");
-	  (orig_sum + new_sum).dump (dump_file);
-	  fprintf (dump_file, "\n");
-	}
 
-      orig_node_count = (orig_sum + new_sum).apply_scale (12, 10);
-      if (dump_file)
+  if (new_sum > orig_node_count)
+    {
+      /* TODO: Perhaps this should be gcc_unreachable ()?  */
+      remainder = profile_count::zero ().guessed_local ();
+    }
+  else if (stats.rec_count_sum.nonzero_p ())
+    {
+      int new_nonrec_calls = stats.n_nonrec_calls;
+      /* There are self-recursive edges which are likely to bring in the
+	 majority of calls but which we must divide in between the original and
+	 new node.  */
+      init_caller_stats (&stats, orig_node);
+      orig_node->call_for_symbol_thunks_and_aliases (gather_caller_stats,
+						     &stats, false);
+      if (orig_node->local)
 	{
-	  fprintf (dump_file, "      proceeding by pretending it was ");
-	  orig_node_count.dump (dump_file);
-	  fprintf (dump_file, "\n");
+	  /* If ORIG_NODE is local, divide all "unexplained" counts roughly
+	     proportionally to sums of counts of non-recursive calls, unless
+	     the orig_node does not seem to have any counts from elsewhere,
+	     then assume it is dead.  */
+	  if (!stats.count_sum.nonzero_p ())
+	    {
+	      if (dump_file)
+		fprintf (dump_file, "       The original is local and only "
+			 "has self-recursive edges with counts from non-dead "
+			 "callers, assuming it is dead too.\n");
+	      new_sum = orig_node_count;
+	      if (opt_for_fn (orig_node->decl, flag_profile_partial_training))
+		remainder = profile_count::zero ().guessed_local ();
+	      else
+		{
+		  /* The NEW_NODE count and counts of all its outgoing edges
+		     are still unmodified copies of ORIG_NODE's.  Just clear
+		     the latter and bail out.  */
+		  orig_node->count = profile_count::zero ().guessed_local ();
+		  for (cgraph_edge *cs = orig_node->callees;
+		       cs;
+		       cs = cs->next_callee)
+		    cs->count = profile_count::zero ().guessed_local ();
+		  for (cgraph_edge *cs = orig_node->indirect_calls;
+		       cs;
+		       cs = cs->next_callee)
+		    cs->count = profile_count::zero ().guessed_local ();
+		  return;
+		}
+	    }
+	  else
+	    {
+	      profile_count unexp = orig_node_count - stats.count_sum;
+	      /* We put rather arbitrary limits on how many counts we claim
+		 because the number of non-self-recursive incoming count is
+		 only a rough guideline and there are cases (such as mcf) where
+		 using it blindly just takes too many.  And if clattices are
+		 considered in the opposite order we could also take too
+		 few.  */
+	      int limit_den = 2 * (stats.n_nonrec_calls + new_nonrec_calls);
+	      profile_count new_part
+		= MAX(MIN (unexp.apply_scale (new_sum,
+					      new_sum + stats.count_sum),
+			   unexp.apply_scale (limit_den - new_nonrec_calls,
+					      limit_den)),
+		      unexp.apply_scale (new_nonrec_calls, limit_den));
+
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "       Claiming ");
+		  new_part.dump (dump_file);
+		  fprintf (dump_file, " of unexplained ");
+		  unexp.dump (dump_file);
+		  fprintf (dump_file, " counts because of self-recursive "
+			   "calls\n");
+		}
+	      new_sum += new_part;
+	      unexp -= new_part;
+	      remainder = lenient_count_portion_handling (stats.count_sum
+							  + unexp,
+							  orig_node);
+	    }
+	}
+      else
+	{
+	  /* If ORIG_NODE is not local, take a wild guess and simply claim a
+	     half of all unexplained counts.  */
+	  profile_count half
+	    = (orig_node_count - stats.count_sum).apply_scale (1, 2);
+	  if (dump_file)
+	    {
+	      fprintf (dump_file, "       Claiming half of unexplained "
+		       "counts (");
+	      half.dump (dump_file);
+	      fprintf (dump_file, ") because of relf-recursive calls\n");
+	    }
+	  new_sum += half;
+	  remainder = lenient_count_portion_handling (stats.count_sum + half,
+						      orig_node);
 	}
     }
-
-  remainder = orig_node_count.combine_with_ipa_count (orig_node_count.ipa ()
-						      - new_sum.ipa ());
-
-  /* With partial train run we do not want to assume that original's
-     count is zero whenever we redurect all executed edges to clone.
-     Simply drop profile to local one in this case.  */
-  if (remainder.ipa_p () && !remainder.ipa ().nonzero_p ()
-      && orig_node->count.ipa_p () && orig_node->count.ipa ().nonzero_p ()
-      && flag_profile_partial_training)
-    remainder = remainder.guessed_local ();
+  else
+    remainder = lenient_count_portion_handling (orig_node_count - new_sum,
+						orig_node);
 
   new_sum = orig_node_count.combine_with_ipa_count (new_sum);
   new_node->count = new_sum;
   orig_node->count = remainder;
 
+  profile_count orig_new_node_count = orig_node_count;
   profile_count::adjust_for_ipa_scaling (&new_sum, &orig_new_node_count);
-  for (cs = new_node->callees; cs; cs = cs->next_callee)
+  for (cgraph_edge *cs = new_node->callees; cs; cs = cs->next_callee)
     cs->count = cs->count.apply_scale (new_sum, orig_new_node_count);
-  for (cs = new_node->indirect_calls; cs; cs = cs->next_callee)
+  for (cgraph_edge *cs = new_node->indirect_calls; cs; cs = cs->next_callee)
     cs->count = cs->count.apply_scale (new_sum, orig_new_node_count);
 
   profile_count::adjust_for_ipa_scaling (&remainder, &orig_node_count);
-  for (cs = orig_node->callees; cs; cs = cs->next_callee)
+  for (cgraph_edge *cs = orig_node->callees; cs; cs = cs->next_callee)
     cs->count = cs->count.apply_scale (remainder, orig_node_count);
-  for (cs = orig_node->indirect_calls; cs; cs = cs->next_callee)
+  for (cgraph_edge *cs = orig_node->indirect_calls; cs; cs = cs->next_callee)
     cs->count = cs->count.apply_scale (remainder, orig_node_count);
 
   if (dump_file)
-    dump_profile_updates (orig_node, new_node);
+    {
+      dump_profile_updates (new_node, true);
+      dump_profile_updates (orig_node, false);
+    }
 }
 
 /* Update the respective profile of specialized NEW_NODE and the original
@@ -4495,7 +4827,10 @@ update_specialized_profile (struct cgraph_node *new_node,
     }
 
   if (dump_file)
-    dump_profile_updates (orig_node, new_node);
+    {
+      dump_profile_updates (new_node, true);
+      dump_profile_updates (orig_node, false);
+    }
 }
 
 static void adjust_references_in_caller (cgraph_edge *cs,
@@ -4795,8 +5130,7 @@ create_specialized_node (struct cgraph_node *node,
       if (aggvals)
 	ipa_dump_agg_replacement_values (dump_file, aggvals);
     }
-  ipa_check_create_node_params ();
-  update_profiling_info (node, new_node);
+
   new_info = ipa_node_params_sum->get (new_node);
   new_info->ipcp_orig_node = node;
   new_node->ipcp_clone = true;
@@ -5621,17 +5955,20 @@ ipcp_val_agg_replacement_ok_p (ipa_agg_replacement_value *,
 /* Decide whether to create a special version of NODE for value VAL of
    parameter at the given INDEX.  If OFFSET is -1, the value is for the
    parameter itself, otherwise it is stored at the given OFFSET of the
-   parameter.  AVALS describes the other already known values.  */
+   parameter.  AVALS describes the other already known values.  SELF_GEN_CLONES
+   is a vector which contains clones created for self-recursive calls with an
+   arithmetic pass-through jump function.  */
 
 template <typename valtype>
 static bool
 decide_about_value (struct cgraph_node *node, int index, HOST_WIDE_INT offset,
-		    ipcp_value<valtype> *val, ipa_auto_call_arg_values *avals)
+		    ipcp_value<valtype> *val, ipa_auto_call_arg_values *avals,
+		    vec<cgraph_node *> *self_gen_clones)
 {
   struct ipa_agg_replacement_value *aggvals;
   int caller_count;
   sreal freq_sum;
-  profile_count count_sum;
+  profile_count count_sum, rec_count_sum;
   vec<cgraph_edge *> callers;
 
   if (val->spec_node)
@@ -5647,13 +5984,31 @@ decide_about_value (struct cgraph_node *node, int index, HOST_WIDE_INT offset,
 		 val->local_size_cost + overall_size);
       return false;
     }
-  else if (!get_info_about_necessary_edges (val, node, &freq_sum, &count_sum,
-					    &caller_count))
+  else if (!get_info_about_necessary_edges (val, node, &freq_sum, &caller_count,
+					    &rec_count_sum, &count_sum))
     return false;
 
   if (!dbg_cnt (ipa_cp_values))
     return false;
 
+  if (val->self_recursion_generated_p ())
+    {
+      /* The edge counts in this case might not have been adjusted yet.
+	 Nevertleless, even if they were it would be only a guesswork which we
+	 can do now.  The recursive part of the counts can be derived from the
+	 count of the original node anyway.  */
+      if (node->count.ipa ().nonzero_p ())
+	{
+	  unsigned dem = self_gen_clones->length () + 1;
+	  rec_count_sum = node->count.ipa ().apply_scale (1, dem);
+	}
+      else
+	rec_count_sum = profile_count::zero ();
+    }
+
+  /* get_info_about_necessary_edges only sums up ipa counts.  */
+  count_sum += rec_count_sum;
+
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
       fprintf (dump_file, " - considering value ");
@@ -5694,6 +6049,12 @@ decide_about_value (struct cgraph_node *node, int index, HOST_WIDE_INT offset,
 						      offset, val->value));
   val->spec_node = create_specialized_node (node, known_csts, known_contexts,
 					    aggvals, callers);
+
+  if (val->self_recursion_generated_p ())
+    self_gen_clones->safe_push (val->spec_node);
+  else
+    update_profiling_info (node, val->spec_node);
+
   callers.release ();
   overall_size += val->local_size_cost;
   if (dump_file && (dump_flags & TDF_DETAILS))
@@ -5722,6 +6083,7 @@ decide_whether_version_node (struct cgraph_node *node)
     fprintf (dump_file, "\nEvaluating opportunities for %s.\n",
 	     node->dump_name ());
 
+  auto_vec <cgraph_node *, 9> self_gen_clones;
   ipa_auto_call_arg_values avals;
   gather_context_independent_values (info, &avals, false, NULL);
 
@@ -5736,7 +6098,8 @@ decide_whether_version_node (struct cgraph_node *node)
 	{
 	  ipcp_value<tree> *val;
 	  for (val = lat->values; val; val = val->next)
-	    ret |= decide_about_value (node, i, -1, val, &avals);
+	    ret |= decide_about_value (node, i, -1, val, &avals,
+				       &self_gen_clones);
 	}
 
       if (!plats->aggs_bottom)
@@ -5750,7 +6113,8 @@ decide_whether_version_node (struct cgraph_node *node)
 		&& (plats->aggs_contain_variable
 		    || !aglat->is_single_const ()))
 	      for (val = aglat->values; val; val = val->next)
-		ret |= decide_about_value (node, i, aglat->offset, val, &avals);
+		ret |= decide_about_value (node, i, aglat->offset, val, &avals,
+					   &self_gen_clones);
 	}
 
       if (!ctxlat->bottom
@@ -5758,10 +6122,17 @@ decide_whether_version_node (struct cgraph_node *node)
 	{
 	  ipcp_value<ipa_polymorphic_call_context> *val;
 	  for (val = ctxlat->values; val; val = val->next)
-	    ret |= decide_about_value (node, i, -1, val, &avals);
+	    ret |= decide_about_value (node, i, -1, val, &avals,
+				       &self_gen_clones);
 	}
     }
 
+  if (!self_gen_clones.is_empty ())
+    {
+      self_gen_clones.safe_push (node);
+      update_counts_for_self_gen_clones (node, self_gen_clones);
+    }
+
   if (info->do_clone_for_all_contexts)
     {
       if (!dbg_cnt (ipa_cp_values))
-- 
2.32.0


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 2/4] ipa-cp: Propagation boost for recursion generated values
  2021-08-24 11:48 [PATCH 0/4] IPA-CP profile feedback handling fixes Martin Jambor
  2021-08-20 17:43 ` [PATCH 1/4] cgraph: Do not warn about caller count mismatches of removed functions Martin Jambor
@ 2021-08-20 17:43 ` Martin Jambor
  2021-10-06 15:49   ` Jan Hubicka
  2021-08-20 17:43 ` [PATCH 3/4] ipa-cp: Fix updating of profile counts and self-gen value evaluation Martin Jambor
  2021-08-23 18:49 ` [PATCH 4/4] ipa-cp: Select saner profile count to base heuristics on Martin Jambor
  3 siblings, 1 reply; 16+ messages in thread
From: Martin Jambor @ 2021-08-20 17:43 UTC (permalink / raw)
  To: GCC Patches; +Cc: Jan Hubicka, Xionghu Luo

Recursive call graph edges, even when they are hot and important for
the compiled program, can never have frequency bigger than one, even
when the actual time savings in the next recursion call are not
realized just once but depend on the depth of recursion.  The current
IPA-CP effect propagation code did not take that into account and just
used the frequency, thus severely underestimating the effect.

This patch artificially boosts values taking part in such calls.  If a
value feeds into itself through a recursive call, the frequency of the
edge is multiplied by a parameter with default value of 6, basically
assuming that the recursion will take place 6 times.  This value can
of course be subject to change.

Moreover, values which do not feed into themselves but which were
generated for a self-recursive call with an arithmetic
pass-function (aka the 548.exchange "hack" which however is generally
applicable for recursive functions which count the recursion depth in
a parameter) have the edge frequency multiplied as many times as there
are generated values in the chain.  In essence, we will assume they
are all useful.

This patch partially fixes the current situation when we fail to
optimize 548.exchange with PGO.  In the benchmark one recursive edge
count overwhelmingly dominates all other counts in the program and so
we fail to perform the first cloning (for the nonrecursive entry call)
because it looks totally insignificant.

gcc/ChangeLog:

2021-07-16  Martin Jambor  <mjambor@suse.cz>

	* params.opt (ipa-cp-recursive-freq-factor): New.
	* ipa-cp.c (ipcp_value): Switch to inline initialization.  New members
	scc_no, self_recursion_generated_level, same_scc and
	self_recursion_generated_p.
	(ipcp_lattice::add_value): Replaced parameter unlimited with
	same_lat_gen_level, usit it determine limit of values and store it to
	the value.
	(ipcp_lattice<valtype>::print): Dump the new fileds.
	(allocate_and_init_ipcp_value): Take same_lat_gen_level as a new
	parameter and store it to the new value.
	(self_recursively_generated_p): Removed.
	(propagate_vals_across_arith_jfunc): Use self_recursion_generated_p
	instead of self_recursively_generated_p, store self generation level
	to such values.
	(value_topo_info<valtype>::add_val): Set scc_no.
	(value_topo_info<valtype>::propagate_effects): Multiply frequencies of
	recursively feeding values and self generated values by appropriate
	new factors.
---
 gcc/ipa-cp.c   | 161 ++++++++++++++++++++++++-------------------------
 gcc/params.opt |   4 ++
 2 files changed, 84 insertions(+), 81 deletions(-)

diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
index 55b9216337f..b987d975793 100644
--- a/gcc/ipa-cp.c
+++ b/gcc/ipa-cp.c
@@ -184,30 +184,52 @@ public:
   /* The actual value for the given parameter.  */
   valtype value;
   /* The list of sources from which this value originates.  */
-  ipcp_value_source <valtype> *sources;
+  ipcp_value_source <valtype> *sources = nullptr;
   /* Next pointers in a linked list of all values in a lattice.  */
-  ipcp_value *next;
+  ipcp_value *next = nullptr;
   /* Next pointers in a linked list of values in a strongly connected component
      of values. */
-  ipcp_value *scc_next;
+  ipcp_value *scc_next = nullptr;
   /* Next pointers in a linked list of SCCs of values sorted topologically
      according their sources.  */
-  ipcp_value  *topo_next;
+  ipcp_value  *topo_next = nullptr;
   /* A specialized node created for this value, NULL if none has been (so far)
      created.  */
-  cgraph_node *spec_node;
+  cgraph_node *spec_node = nullptr;
   /* Depth first search number and low link for topological sorting of
      values.  */
-  int dfs, low_link;
+  int dfs = 0;
+  int low_link = 0;
+  /* SCC number to identify values which recursively feed into each other.
+     Values in the same SCC have the same SCC number.  */
+  int scc_no = 0;
+  /* Non zero if the value is generated from another value in the same lattice
+     for a self-recursive call, the actual number is how many times the
+     operation has been performed.  In the unlikely event of the value being
+     present in two chains fo self-recursive value generation chains, it is the
+     maximum.  */
+  unsigned self_recursion_generated_level = 0;
   /* True if this value is currently on the topo-sort stack.  */
-  bool on_stack;
-
-  ipcp_value()
-    : sources (0), next (0), scc_next (0), topo_next (0),
-      spec_node (0), dfs (0), low_link (0), on_stack (false) {}
+  bool on_stack = false;
 
   void add_source (cgraph_edge *cs, ipcp_value *src_val, int src_idx,
 		   HOST_WIDE_INT offset);
+
+  /* Return true if both THIS value and O feed into each other.  */
+
+  bool same_scc (const ipcp_value<valtype> *o)
+  {
+    return o->scc_no == scc_no;
+  }
+
+/* Return true, if a this value has been generated for a self-recursive call as
+   a result of an arithmetic pass-through jump-function acting on a value in
+   the same lattice function.  */
+
+  bool self_recursion_generated_p ()
+  {
+    return self_recursion_generated_level > 0;
+  }
 };
 
 /* Lattice describing potential values of a formal parameter of a function, or
@@ -239,7 +261,7 @@ public:
 		  ipcp_value<valtype> *src_val = NULL,
 		  int src_idx = 0, HOST_WIDE_INT offset = -1,
 		  ipcp_value<valtype> **val_p = NULL,
-		  bool unlimited = false);
+		  unsigned same_lat_gen_level = 0);
   void print (FILE * f, bool dump_sources, bool dump_benefits);
 };
 
@@ -498,7 +520,11 @@ ipcp_lattice<valtype>::print (FILE * f, bool dump_sources, bool dump_benefits)
 	{
 	  ipcp_value_source<valtype> *s;
 
-	  fprintf (f, " [from:");
+	  if (val->self_recursion_generated_p ())
+	    fprintf (f, " [self_gen(%i), from:",
+		     val->self_recursion_generated_level);
+	  else
+	    fprintf (f, " [scc: %i, from:", val->scc_no);
 	  for (s = val->sources; s; s = s->next)
 	    fprintf (f, " %i(%f)", s->cs->caller->order,
 		     s->cs->sreal_frequency ().to_double ());
@@ -1837,12 +1863,13 @@ ipcp_value<valtype>::add_source (cgraph_edge *cs, ipcp_value *src_val,
    SOURCE and clear all other fields.  */
 
 static ipcp_value<tree> *
-allocate_and_init_ipcp_value (tree source)
+allocate_and_init_ipcp_value (tree cst, unsigned same_lat_gen_level)
 {
   ipcp_value<tree> *val;
 
   val = new (ipcp_cst_values_pool.allocate ()) ipcp_value<tree>();
-  val->value = source;
+  val->value = cst;
+  val->self_recursion_generated_level = same_lat_gen_level;
   return val;
 }
 
@@ -1850,14 +1877,15 @@ allocate_and_init_ipcp_value (tree source)
    value to SOURCE and clear all other fields.  */
 
 static ipcp_value<ipa_polymorphic_call_context> *
-allocate_and_init_ipcp_value (ipa_polymorphic_call_context source)
+allocate_and_init_ipcp_value (ipa_polymorphic_call_context ctx,
+			      unsigned same_lat_gen_level)
 {
   ipcp_value<ipa_polymorphic_call_context> *val;
 
-  // TODO
   val = new (ipcp_poly_ctx_values_pool.allocate ())
     ipcp_value<ipa_polymorphic_call_context>();
-  val->value = source;
+  val->value = ctx;
+  val->self_recursion_generated_level = same_lat_gen_level;
   return val;
 }
 
@@ -1865,8 +1893,12 @@ allocate_and_init_ipcp_value (ipa_polymorphic_call_context source)
    SRC_VAL SRC_INDEX and OFFSET are meant for add_source and have the same
    meaning.  OFFSET -1 means the source is scalar and not a part of an
    aggregate.  If non-NULL, VAL_P records address of existing or newly added
-   ipcp_value.  UNLIMITED means whether value count should not exceed the limit
-   given by PARAM_IPA_CP_VALUE_LIST_SIZE.  */
+   ipcp_value.
+
+   If the value is generated for a self-recursive call as a result of an
+   arithmetic pass-through jump-function acting on a value in the same lattice,
+   SAME_LAT_GEN_LEVEL must be the length of such chain, otherwise it must be
+   zero.  If it is non-zero, PARAM_IPA_CP_VALUE_LIST_SIZE limit is ignored.  */
 
 template <typename valtype>
 bool
@@ -1874,7 +1906,7 @@ ipcp_lattice<valtype>::add_value (valtype newval, cgraph_edge *cs,
 				  ipcp_value<valtype> *src_val,
 				  int src_idx, HOST_WIDE_INT offset,
 				  ipcp_value<valtype> **val_p,
-				  bool unlimited)
+				  unsigned same_lat_gen_level)
 {
   ipcp_value<valtype> *val, *last_val = NULL;
 
@@ -1890,6 +1922,9 @@ ipcp_lattice<valtype>::add_value (valtype newval, cgraph_edge *cs,
 	if (val_p)
 	  *val_p = val;
 
+	if (val->self_recursion_generated_level < same_lat_gen_level)
+	  val->self_recursion_generated_level = same_lat_gen_level;
+
 	if (ipa_edge_within_scc (cs))
 	  {
 	    ipcp_value_source<valtype> *s;
@@ -1904,7 +1939,7 @@ ipcp_lattice<valtype>::add_value (valtype newval, cgraph_edge *cs,
 	return false;
       }
 
-  if (!unlimited && values_count == opt_for_fn (cs->caller->decl,
+  if (!same_lat_gen_level && values_count == opt_for_fn (cs->caller->decl,
 						param_ipa_cp_value_list_size))
     {
       /* We can only free sources, not the values themselves, because sources
@@ -1923,7 +1958,7 @@ ipcp_lattice<valtype>::add_value (valtype newval, cgraph_edge *cs,
     }
 
   values_count++;
-  val = allocate_and_init_ipcp_value (newval);
+  val = allocate_and_init_ipcp_value (newval, same_lat_gen_level);
   val->add_source (cs, src_val, src_idx, offset);
   val->next = NULL;
 
@@ -1940,60 +1975,6 @@ ipcp_lattice<valtype>::add_value (valtype newval, cgraph_edge *cs,
   return true;
 }
 
-/* Return true, if a ipcp_value VAL is orginated from parameter value of
-   self-feeding recursive function via some kind of pass-through jump
-   function.  */
-
-static bool
-self_recursively_generated_p (ipcp_value<tree> *val)
-{
-  class ipa_node_params *info = NULL;
-
-  for (ipcp_value_source<tree> *src = val->sources; src; src = src->next)
-    {
-      cgraph_edge *cs = src->cs;
-
-      if (!src->val || cs->caller != cs->callee->function_symbol ())
-	return false;
-
-      if (src->val == val)
-	continue;
-
-      if (!info)
-	info = ipa_node_params_sum->get (cs->caller);
-
-      class ipcp_param_lattices *plats = ipa_get_parm_lattices (info,
-								src->index);
-      ipcp_lattice<tree> *src_lat;
-      ipcp_value<tree> *src_val;
-
-      if (src->offset == -1)
-	src_lat = &plats->itself;
-      else
-	{
-	  struct ipcp_agg_lattice *src_aglat;
-
-	  for (src_aglat = plats->aggs; src_aglat; src_aglat = src_aglat->next)
-	    if (src_aglat->offset == src->offset)
-	      break;
-
-	  if (!src_aglat)
-	    return false;
-
-	  src_lat = src_aglat;
-	}
-
-      for (src_val = src_lat->values; src_val; src_val = src_val->next)
-	if (src_val == val)
-	  break;
-
-      if (!src_val)
-	return false;
-    }
-
-  return true;
-}
-
 /* A helper function that returns result of operation specified by OPCODE on
    the value of SRC_VAL.  If non-NULL, OPND1_TYPE is expected type for the
    value of SRC_VAL.  If the operation is binary, OPND2 is a constant value
@@ -2068,7 +2049,7 @@ propagate_vals_across_arith_jfunc (cgraph_edge *cs,
 	     source, this is absolutely conservative, but could avoid explosion
 	     of lattice's value space, especially when one recursive function
 	     calls another recursive.  */
-	  if (self_recursively_generated_p (src_val))
+	  if (src_val->self_recursion_generated_p ())
 	    {
 	      ipcp_value_source<tree> *s;
 
@@ -2096,7 +2077,7 @@ propagate_vals_across_arith_jfunc (cgraph_edge *cs,
 		break;
 
 	      ret |= dest_lat->add_value (cstval, cs, src_val, src_idx,
-					  src_offset, &src_val, true);
+					  src_offset, &src_val, j);
 	      gcc_checking_assert (src_val);
 	    }
 	}
@@ -2108,7 +2089,7 @@ propagate_vals_across_arith_jfunc (cgraph_edge *cs,
 	/* Now we do not use self-recursively generated value as propagation
 	   source, otherwise it is easy to make value space of normal lattice
 	   overflow.  */
-	if (self_recursively_generated_p (src_val))
+	if (src_val->self_recursion_generated_p ())
 	  {
 	    ret |= dest_lat->set_contains_variable ();
 	    continue;
@@ -3732,6 +3713,7 @@ value_topo_info<valtype>::add_val (ipcp_value<valtype> *cur_val)
 	  v = stack;
 	  stack = v->topo_next;
 	  v->on_stack = false;
+	  v->scc_no = cur_val->dfs;
 
 	  v->scc_next = scc_list;
 	  scc_list = v;
@@ -3905,8 +3887,25 @@ value_topo_info<valtype>::propagate_effects ()
 		    else
 		      continue;
 		  }
+
+		int special_factor = 1;
+		if (val->same_scc (src->val))
+		  special_factor
+		    = opt_for_fn(src->cs->caller->decl,
+				 param_ipa_cp_recursive_freq_factor);
+		else if (val->self_recursion_generated_p ()
+			 && (src->cs->callee->function_symbol ()
+			     == src->cs->caller))
+		  {
+		    int max_recur_gen_depth
+		      = opt_for_fn(src->cs->caller->decl,
+				   param_ipa_cp_max_recursive_depth);
+		    special_factor = max_recur_gen_depth
+		      - val->self_recursion_generated_level + 1;
+		  }
+
 		src->val->prop_time_benefit
-		  += time * src->cs->sreal_frequency ();
+		  += time * special_factor * src->cs->sreal_frequency ();
 	      }
 
 	  if (size < INT_MAX)
diff --git a/gcc/params.opt b/gcc/params.opt
index 92b003e38cb..8d772309407 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -266,6 +266,10 @@ Maximum depth of recursive cloning for self-recursive function.
 Common Joined UInteger Var(param_ipa_cp_min_recursive_probability) Init(2) Param Optimization
 Recursive cloning only when the probability of call being executed exceeds the parameter.
 
+-param=ipa-cp-recursive-freq-factor=
+Common Joined UInteger Var(param_ipa_cp_recursive_freq_factor) Init(6) Param Optimization
+When propagating IPA-CP effect estimates, multiply frequencies of recursive edges that that bring back an unchanged value by this factor.
+
 -param=ipa-cp-recursion-penalty=
 Common Joined UInteger Var(param_ipa_cp_recursion_penalty) Init(40) IntegerRange(0, 100) Param Optimization
 Percentage penalty the recursive functions will receive when they are evaluated for cloning.
-- 
2.32.0


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 4/4] ipa-cp: Select saner profile count to base heuristics on
  2021-08-24 11:48 [PATCH 0/4] IPA-CP profile feedback handling fixes Martin Jambor
                   ` (2 preceding siblings ...)
  2021-08-20 17:43 ` [PATCH 3/4] ipa-cp: Fix updating of profile counts and self-gen value evaluation Martin Jambor
@ 2021-08-23 18:49 ` Martin Jambor
  2021-10-06 15:33   ` Jan Hubicka
  2021-10-27 13:20   ` Martin Jambor
  3 siblings, 2 replies; 16+ messages in thread
From: Martin Jambor @ 2021-08-23 18:49 UTC (permalink / raw)
  To: GCC Patches; +Cc: Jan Hubicka, Xionghu Luo

When profile feedback is available, IPA-CP takes the count of the
hottest node and then evaluates all call contexts relative to it.
This means that typically almost no clones for specialized contexts
are ever created because the maximum is some special function, called
from everywhere (that is likely to get inlined anyway) and all the
examined edges look cold compared to it.

This patch changes the selection.  It simply sorts counts of all edges
eligible for cloning in a vector and then picks the count in 90th
percentile (the actual number is configurable via a parameter).

I also tried more complex approaches which were summing the counts and
picking the edge which together with all hotter edges accounted for a
given portion of the total sum of all edge counts.  But first it was
not apparently clear to me that they make more logical sense that the
simple method and practically I always also had to ignore a few
percent of the hottest edges with really extreme counts (looking at
bash and python).  And when I had to do that anyway, it seemed simpler
to just "ignore" more and take the first non-ignored count as the
base.

Nevertheless, if people think some more sophisticated method should be
used anyway, I am willing to be persuaded.  But this patch is a clear
improvement over the current situation.

gcc/ChangeLog:

2021-08-23  Martin Jambor  <mjambor@suse.cz>

	* params.opt (param_ipa_cp_profile_count_base): New parameter.
	* ipa-cp.c (max_count): Replace with base_count, replace all
	occurrences too, unless otherwise stated.
	(ipcp_cloning_candidate_p): identify mostly-directly called
	functions based on their counts, not max_count.
	(compare_edge_profile_counts): New function.
	(ipcp_propagate_stage): Instead of setting max_count, find the
	appropriate edge count in a sorted vector of counts of eligible
	edges and make it the base_count.
---
 gcc/ipa-cp.c   | 82 +++++++++++++++++++++++++++++++++++++++++++++-----
 gcc/params.opt |  4 +++
 2 files changed, 78 insertions(+), 8 deletions(-)

diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
index 53cca7aa804..6ab74f61e83 100644
--- a/gcc/ipa-cp.c
+++ b/gcc/ipa-cp.c
@@ -400,9 +400,9 @@ object_allocator<ipcp_value_source<tree> > ipcp_sources_pool
 object_allocator<ipcp_agg_lattice> ipcp_agg_lattice_pool
   ("IPA_CP aggregate lattices");
 
-/* Maximal count found in program.  */
+/* Base count to use in heuristics when using profile feedback.  */
 
-static profile_count max_count;
+static profile_count base_count;
 
 /* Original overall size of the program.  */
 
@@ -809,7 +809,8 @@ ipcp_cloning_candidate_p (struct cgraph_node *node)
   /* When profile is available and function is hot, propagate into it even if
      calls seems cold; constant propagation can improve function's speed
      significantly.  */
-  if (max_count > profile_count::zero ())
+  if (stats.count_sum > profile_count::zero ()
+      && node->count.ipa ().initialized_p ())
     {
       if (stats.count_sum > node->count.ipa ().apply_scale (90, 100))
 	{
@@ -3310,10 +3311,10 @@ good_cloning_opportunity_p (struct cgraph_node *node, sreal time_benefit,
 
   ipa_node_params *info = ipa_node_params_sum->get (node);
   int eval_threshold = opt_for_fn (node->decl, param_ipa_cp_eval_threshold);
-  if (max_count > profile_count::zero ())
+  if (base_count > profile_count::zero ())
     {
 
-      sreal factor = count_sum.probability_in (max_count).to_sreal ();
+      sreal factor = count_sum.probability_in (base_count).to_sreal ();
       sreal evaluation = (time_benefit * factor) / size_cost;
       evaluation = incorporate_penalties (node, info, evaluation);
       evaluation *= 1000;
@@ -3950,6 +3951,21 @@ value_topo_info<valtype>::propagate_effects ()
     }
 }
 
+/* Callback for qsort to sort counts of all edges.  */
+
+static int
+compare_edge_profile_counts (const void *a, const void *b)
+{
+  const profile_count *cnt1 = (const profile_count *) a;
+  const profile_count *cnt2 = (const profile_count *) b;
+
+  if (*cnt1 < *cnt2)
+    return 1;
+  if (*cnt1 > *cnt2)
+    return -1;
+  return 0;
+}
+
 
 /* Propagate constants, polymorphic contexts and their effects from the
    summaries interprocedurally.  */
@@ -3962,8 +3978,10 @@ ipcp_propagate_stage (class ipa_topo_info *topo)
   if (dump_file)
     fprintf (dump_file, "\n Propagating constants:\n\n");
 
-  max_count = profile_count::uninitialized ();
+  base_count = profile_count::uninitialized ();
 
+  bool compute_count_base = false;
+  unsigned base_count_pos_percent = 0;
   FOR_EACH_DEFINED_FUNCTION (node)
   {
     if (node->has_gimple_body_p ()
@@ -3981,9 +3999,57 @@ ipcp_propagate_stage (class ipa_topo_info *topo)
     ipa_size_summary *s = ipa_size_summaries->get (node);
     if (node->definition && !node->alias && s != NULL)
       overall_size += s->self_size;
-    max_count = max_count.max (node->count.ipa ());
+    if (node->count.ipa ().initialized_p ())
+      {
+	compute_count_base = true;
+	unsigned pos_percent = opt_for_fn (node->decl,
+					   param_ipa_cp_profile_count_base);
+	base_count_pos_percent = MAX (base_count_pos_percent, pos_percent);
+      }
   }
 
+  if (compute_count_base)
+    {
+      auto_vec<profile_count> all_edge_counts;
+      all_edge_counts.reserve_exact (symtab->edges_count);
+      FOR_EACH_DEFINED_FUNCTION (node)
+	for (cgraph_edge *cs = node->callees; cs; cs = cs->next_callee)
+	  {
+	    profile_count count = cs->count.ipa ();
+	    if (!(count > profile_count::zero ()))
+	      continue;
+
+	    enum availability avail;
+	    cgraph_node *tgt
+	      = cs->callee->function_or_virtual_thunk_symbol (&avail);
+	    ipa_node_params *info = ipa_node_params_sum->get (tgt);
+	    if (info && info->versionable)
+	      all_edge_counts.quick_push (count);
+	  }
+
+      if (!all_edge_counts.is_empty ())
+	{
+	  gcc_assert (base_count_pos_percent <= 100);
+	  all_edge_counts.qsort (compare_edge_profile_counts);
+
+	  unsigned base_count_pos
+	    = ((all_edge_counts.length () * (base_count_pos_percent)) / 100);
+	  base_count = all_edge_counts[base_count_pos];
+
+	  if (dump_file)
+	    {
+	      fprintf (dump_file, "\nSelected base_count from %u edges at "
+		       "position %u, arriving at: ", all_edge_counts.length (),
+		       base_count_pos);
+	      base_count.dump (dump_file);
+	      fprintf (dump_file, "\n");
+	    }
+	}
+      else if (dump_file)
+	fprintf (dump_file, "\nNo candidates with non-zero call count found, "
+		 "continuing as if without profile feedback.\n");
+    }
+
   orig_overall_size = overall_size;
 
   if (dump_file)
@@ -6576,7 +6642,7 @@ make_pass_ipa_cp (gcc::context *ctxt)
 void
 ipa_cp_c_finalize (void)
 {
-  max_count = profile_count::uninitialized ();
+  base_count = profile_count::uninitialized ();
   overall_size = 0;
   orig_overall_size = 0;
   ipcp_free_transformation_sum ();
diff --git a/gcc/params.opt b/gcc/params.opt
index 8d772309407..5223f784bf0 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -290,6 +290,10 @@ The size of translation unit that IPA-CP pass considers large.
 Common Joined UInteger Var(param_ipa_cp_value_list_size) Init(8) Param Optimization
 Maximum size of a list of values associated with each parameter for interprocedural constant propagation.
 
+-param=ipa-cp-profile-count-base=
+Common Joined UInteger Var(param_ipa_cp_profile_count_base) Init(10) IntegerRange(0, 100) Param Optimization
+When using profile feedback, use the edge at this percentage position in frequncy histogram as the bases for IPA-CP heuristics.
+
 -param=ipa-jump-function-lookups=
 Common Joined UInteger Var(param_ipa_jump_function_lookups) Init(8) Param Optimization
 Maximum number of statements visited during jump function offset discovery.
-- 
2.32.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 0/4] IPA-CP profile feedback handling fixes
@ 2021-08-24 11:48 Martin Jambor
  2021-08-20 17:43 ` [PATCH 1/4] cgraph: Do not warn about caller count mismatches of removed functions Martin Jambor
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Martin Jambor @ 2021-08-24 11:48 UTC (permalink / raw)
  To: GCC Patches; +Cc: Jan Hubicka, Xionghu Luo

Hi,

this patch set addresses a number of shortcomings of IPA-CP when it
has profile feedback data at its disposal.  While at this point it is
mostly RFC material because I expect Honza will correct many of the
places where I use a wrong method of profile_count and should be using
some slightly different one, I do want to turn it into material I can
push to master rather quickly.

Most of the changes were motivated by SPEC 2017 exchange2 benchmark,
which exposes the problems nicely, is now 22% slower with profile
feedback, and this patch fixes that.  Overall, the patch set does not
have any effect on SPEC 2017 FPrate. SPEC 2017 INTrate results, as
quickly gathered on my znver2 desktop overnight (1 run only), are:

PGO only:

  | Benchmark       | Trunk | Rate | Patch |      % | Rate |
  |-----------------+-------+------+-------+--------+------|
  | 500.perlbench_r |   236 | 6.74 |   239 |  +1.27 | 6.67 |
  | 502.gcc_r       |   160 | 8.85 |   159 |  -0.62 | 8.89 |
  | 505.mcf_r       |   227 | 7.11 |   228 |  +0.44 | 7.08 |
  | 520.omnetpp_r   |   314 | 4.18 |   311 |  -0.96 | 4.21 |
  | 523.xalancbmk_r |   195 | 5.41 |   199 |  +2.05 | 5.32 |
  | 525.x264_r      |   129 | 13.6 |   131 |  +1.55 | 13.4 |
  | 531.deepsjeng_r |   230 | 4.98 |   230 |  +0.00 | 4.98 |
  | 541.leela_r     |   353 | 4.70 |   353 |  +0.00 | 4.69 |
  | 548.exchange2_r |   249 | 10.5 |   189 | -24.10 | 13.8 |
  | 557.xz_r        |   246 | 4.39 |   248 |  +0.81 | 4.36 |
  |-----------------+-------+------+-------+--------+------|
  | Geomean         |       | 6.53 |       |        | 6.68 |

I have re-run 523.xalancbmk_r and the regression seems to be noise.

PGO+LTO:

| Benchmark       | Trunk | Rate | Patch |      % |  Rate |
|-----------------+-------+------+-------+--------+-------|
| 500.perlbench_r |   231 | 6.88 |   230 |  -0.43 |  6.93 |
| 502.gcc_r       |   149 | 9.51 |   149 |  +0.00 |  9.53 |
| 505.mcf_r       |   208 | 7.76 |   202 |  -2.88 |  7.98 |
| 520.omnetpp_r   |   282 | 4.64 |   282 |  +0.00 |  4.65 |
| 523.xalancbmk_r |   185 | 5.70 |   188 |  +1.62 |  5.63 |
| 525.x264_r      |   133 | 13.1 |   134 |  +0.75 | 13.00 |
| 531.deepsjeng_r |   190 | 6.04 |   185 |  -2.63 |  6.20 |
| 541.leela_r     |   298 | 5.56 |   298 |  +0.00 |  5.57 |
| 548.exchange2_r |   247 | 10.6 |   193 | -21.86 | 13.60 |
| 557.xz_r        |   250 | 4.32 |   251 |  +0.40 |  4.31 |
|-----------------+-------+------+-------+--------+-------|
| Geomean         |       | 6.97 |       |        |  7.18 |

I have re-run 531.deepsjeng_r and 505.mcf_r and while the former
improvement seems to be noise, the latter is consistent and even
explainable by more cloning of spec_qsort, which is the result of the
last patch and saner updates of counts of call graph edges from these
clones.

In both cases the exchange2 improvement is achieved by:

1) The second patch which makes sure that IPA-CP creates a clone for
   the first value, even though the non-recursive edge bringing the
   value is quite cold, because it enables specializing for much
   hotter contexts, and

2) the third patch which changes how values resulting from arithmetic
   jump functions on self-recursive edges are evaluated and then
   modifies the profile count of the whole resulting call graph part.

The final patch is not necessary to address the exchange2 regression.

I have bootstrapped and LTO-profile-bootstrapped and tested the whole
patch series on x86_64-linux without any issues.  As written above,
I'll be happy to address any comments/concerns so that something like
this can be pushed to master soon.

Thanks,

Martin


Martin Jambor (4):
  cgraph: Do not warn about caller count mismatches of removed functions
  ipa-cp: Propagation boost for recursion generated values
  ipa-cp: Fix updating of profile counts and self-gen value evaluation
  ipa-cp: Select saner profile count to base heuristics on

 gcc/cgraph.c   |   4 +-
 gcc/ipa-cp.c   | 786 ++++++++++++++++++++++++++++++++++++++-----------
 gcc/params.opt |   8 +
 3 files changed, 621 insertions(+), 177 deletions(-)

-- 
2.32.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/4] cgraph: Do not warn about caller count mismatches of removed functions
  2021-08-20 17:43 ` [PATCH 1/4] cgraph: Do not warn about caller count mismatches of removed functions Martin Jambor
@ 2021-09-16 15:10   ` Martin Jambor
  0 siblings, 0 replies; 16+ messages in thread
From: Martin Jambor @ 2021-09-16 15:10 UTC (permalink / raw)
  To: GCC Patches; +Cc: Jan Hubicka

Hi,

On Fri, Aug 20 2021, Martin Jambor wrote:
> To verify other changes in the patch series, I have been searching for
> "Invalid sum of caller counts" string in symtab dump but found that
> there are false warnings about functions which have their body removed
> because they are now unreachable.  Those are of course invalid and so
> this patches avoids checking such cgraph_nodes.
>
> gcc/ChangeLog:
>
> 2021-08-20  Martin Jambor  <mjambor@suse.cz>
>
> 	* cgraph.c (cgraph_node::dump): Do not check caller count sums if
> 	the body has been removed.  Remove trailing whitespace.

I have pushed this patch as obvious but like to ping the rest of the
series.

Thanks,

Martin


> ---
>  gcc/cgraph.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/gcc/cgraph.c b/gcc/cgraph.c
> index 8f3af003f2a..de078653781 100644
> --- a/gcc/cgraph.c
> +++ b/gcc/cgraph.c
> @@ -2236,7 +2236,7 @@ cgraph_node::dump (FILE *f)
>      }
>    fprintf (f, "\n");
>  
> -  if (count.ipa ().initialized_p ())
> +  if (!body_removed && count.ipa ().initialized_p ())
>      {
>        bool ok = true;
>        bool min = false;
> @@ -2245,7 +2245,7 @@ cgraph_node::dump (FILE *f)
>        FOR_EACH_ALIAS (this, ref)
>  	if (dyn_cast <cgraph_node *> (ref->referring)->count.initialized_p ())
>  	  sum += dyn_cast <cgraph_node *> (ref->referring)->count.ipa ();
> -  
> +
>        if (inlined_to
>  	  || (symtab->state < EXPANSION
>  	      && ultimate_alias_target () == this && only_called_directly_p ()))
> -- 
> 2.32.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] ipa-cp: Select saner profile count to base heuristics on
  2021-08-23 18:49 ` [PATCH 4/4] ipa-cp: Select saner profile count to base heuristics on Martin Jambor
@ 2021-10-06 15:33   ` Jan Hubicka
  2021-10-18 17:10     ` Martin Jambor
  2021-10-27 13:20   ` Martin Jambor
  1 sibling, 1 reply; 16+ messages in thread
From: Jan Hubicka @ 2021-10-06 15:33 UTC (permalink / raw)
  To: Martin Jambor; +Cc: GCC Patches, Xionghu Luo

> 2021-08-23  Martin Jambor  <mjambor@suse.cz>
> 
> 	* params.opt (param_ipa_cp_profile_count_base): New parameter.
> 	* ipa-cp.c (max_count): Replace with base_count, replace all
> 	occurrences too, unless otherwise stated.
> 	(ipcp_cloning_candidate_p): identify mostly-directly called
> 	functions based on their counts, not max_count.
> 	(compare_edge_profile_counts): New function.
> 	(ipcp_propagate_stage): Instead of setting max_count, find the
> 	appropriate edge count in a sorted vector of counts of eligible
> 	edges and make it the base_count.
> ---
>  gcc/ipa-cp.c   | 82 +++++++++++++++++++++++++++++++++++++++++++++-----
>  gcc/params.opt |  4 +++
>  2 files changed, 78 insertions(+), 8 deletions(-)
> 
> diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
> index 53cca7aa804..6ab74f61e83 100644
> --- a/gcc/ipa-cp.c
> +++ b/gcc/ipa-cp.c
> @@ -400,9 +400,9 @@ object_allocator<ipcp_value_source<tree> > ipcp_sources_pool
>  object_allocator<ipcp_agg_lattice> ipcp_agg_lattice_pool
>    ("IPA_CP aggregate lattices");
>  
> -/* Maximal count found in program.  */
> +/* Base count to use in heuristics when using profile feedback.  */
>  
> -static profile_count max_count;
> +static profile_count base_count;
>  
>  /* Original overall size of the program.  */
>  
> @@ -809,7 +809,8 @@ ipcp_cloning_candidate_p (struct cgraph_node *node)
>    /* When profile is available and function is hot, propagate into it even if
>       calls seems cold; constant propagation can improve function's speed
>       significantly.  */
> -  if (max_count > profile_count::zero ())
> +  if (stats.count_sum > profile_count::zero ()
> +      && node->count.ipa ().initialized_p ())
>      {
>        if (stats.count_sum > node->count.ipa ().apply_scale (90, 100))
>  	{
> @@ -3310,10 +3311,10 @@ good_cloning_opportunity_p (struct cgraph_node *node, sreal time_benefit,
>  
>    ipa_node_params *info = ipa_node_params_sum->get (node);
>    int eval_threshold = opt_for_fn (node->decl, param_ipa_cp_eval_threshold);
> -  if (max_count > profile_count::zero ())
> +  if (base_count > profile_count::zero ())
>      {
>  
> -      sreal factor = count_sum.probability_in (max_count).to_sreal ();
> +      sreal factor = count_sum.probability_in (base_count).to_sreal ();

If you have part of program built with -fprofile-use and part built without this will
disable cloning for functions called only from places w/o profile.
I think we want to count frequencies when ipa profile is uninitialized
and then pass the cloning oportunity if either count or freqs says it is
good idea.

>        sreal evaluation = (time_benefit * factor) / size_cost;
>        evaluation = incorporate_penalties (node, info, evaluation);
>        evaluation *= 1000;
> @@ -3950,6 +3951,21 @@ value_topo_info<valtype>::propagate_effects ()
>      }
>  }
>  
> +/* Callback for qsort to sort counts of all edges.  */
> +
> +static int
> +compare_edge_profile_counts (const void *a, const void *b)
> +{
> +  const profile_count *cnt1 = (const profile_count *) a;
> +  const profile_count *cnt2 = (const profile_count *) b;
> +
> +  if (*cnt1 < *cnt2)
> +    return 1;
> +  if (*cnt1 > *cnt2)
> +    return -1;
> +  return 0;
> +}
> +
>  
>  /* Propagate constants, polymorphic contexts and their effects from the
>     summaries interprocedurally.  */
> @@ -3962,8 +3978,10 @@ ipcp_propagate_stage (class ipa_topo_info *topo)
>    if (dump_file)
>      fprintf (dump_file, "\n Propagating constants:\n\n");
>  
> -  max_count = profile_count::uninitialized ();
> +  base_count = profile_count::uninitialized ();
>  
> +  bool compute_count_base = false;
> +  unsigned base_count_pos_percent = 0;
>    FOR_EACH_DEFINED_FUNCTION (node)
>    {
>      if (node->has_gimple_body_p ()
> @@ -3981,9 +3999,57 @@ ipcp_propagate_stage (class ipa_topo_info *topo)
>      ipa_size_summary *s = ipa_size_summaries->get (node);
>      if (node->definition && !node->alias && s != NULL)
>        overall_size += s->self_size;
> -    max_count = max_count.max (node->count.ipa ());
> +    if (node->count.ipa ().initialized_p ())
> +      {
> +	compute_count_base = true;
> +	unsigned pos_percent = opt_for_fn (node->decl,
> +					   param_ipa_cp_profile_count_base);
> +	base_count_pos_percent = MAX (base_count_pos_percent, pos_percent);
> +      }
>    }
>  
> +  if (compute_count_base)
> +    {
> +      auto_vec<profile_count> all_edge_counts;
> +      all_edge_counts.reserve_exact (symtab->edges_count);
> +      FOR_EACH_DEFINED_FUNCTION (node)
> +	for (cgraph_edge *cs = node->callees; cs; cs = cs->next_callee)
> +	  {
> +	    profile_count count = cs->count.ipa ();
> +	    if (!(count > profile_count::zero ()))
> +	      continue;
> +
> +	    enum availability avail;
> +	    cgraph_node *tgt
> +	      = cs->callee->function_or_virtual_thunk_symbol (&avail);
> +	    ipa_node_params *info = ipa_node_params_sum->get (tgt);
> +	    if (info && info->versionable)
> +	      all_edge_counts.quick_push (count);
> +	  }

I wonder how big those tables gets for whole program, but probably it is
safe since it is heap allocatd and temporary.
Not sure if reserving exact is going to give good idea what the usual
count is - I think if program is built w/o profile feedback we may end
up with few functions with zero or 1 count (called once and unlikely).
> +
> +      if (!all_edge_counts.is_empty ())
> +	{
> +	  gcc_assert (base_count_pos_percent <= 100);
> +	  all_edge_counts.qsort (compare_edge_profile_counts);
> +
> +	  unsigned base_count_pos
> +	    = ((all_edge_counts.length () * (base_count_pos_percent)) / 100);
> +	  base_count = all_edge_counts[base_count_pos];

It may make more sense to sum all the counts and then do the given
percentile of function invocations, but we can see how well this fares.

Patch is OK (it is improvement over what we have) but we should get the
combined FDO/nonFDO case right.  It also matters when we lose profile
feedback i.e. due to comdat function merging.

Honza

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] ipa-cp: Propagation boost for recursion generated values
  2021-08-20 17:43 ` [PATCH 2/4] ipa-cp: Propagation boost for recursion generated values Martin Jambor
@ 2021-10-06 15:49   ` Jan Hubicka
  2021-10-07 14:59     ` Martin Jambor
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Hubicka @ 2021-10-06 15:49 UTC (permalink / raw)
  To: Martin Jambor; +Cc: GCC Patches, Xionghu Luo

> Recursive call graph edges, even when they are hot and important for
> the compiled program, can never have frequency bigger than one, even
> when the actual time savings in the next recursion call are not
> realized just once but depend on the depth of recursion.  The current
> IPA-CP effect propagation code did not take that into account and just
> used the frequency, thus severely underestimating the effect.
> 
> This patch artificially boosts values taking part in such calls.  If a
> value feeds into itself through a recursive call, the frequency of the
> edge is multiplied by a parameter with default value of 6, basically
> assuming that the recursion will take place 6 times.  This value can
> of course be subject to change.
> 
> Moreover, values which do not feed into themselves but which were
> generated for a self-recursive call with an arithmetic
> pass-function (aka the 548.exchange "hack" which however is generally
> applicable for recursive functions which count the recursion depth in
> a parameter) have the edge frequency multiplied as many times as there
> are generated values in the chain.  In essence, we will assume they
> are all useful.
> 
> This patch partially fixes the current situation when we fail to
> optimize 548.exchange with PGO.  In the benchmark one recursive edge
> count overwhelmingly dominates all other counts in the program and so
> we fail to perform the first cloning (for the nonrecursive entry call)
> because it looks totally insignificant.
> 
> gcc/ChangeLog:
> 
> 2021-07-16  Martin Jambor  <mjambor@suse.cz>
> 
> 	* params.opt (ipa-cp-recursive-freq-factor): New.
> 	* ipa-cp.c (ipcp_value): Switch to inline initialization.  New members
> 	scc_no, self_recursion_generated_level, same_scc and
> 	self_recursion_generated_p.
> 	(ipcp_lattice::add_value): Replaced parameter unlimited with
> 	same_lat_gen_level, usit it determine limit of values and store it to
> 	the value.
> 	(ipcp_lattice<valtype>::print): Dump the new fileds.
> 	(allocate_and_init_ipcp_value): Take same_lat_gen_level as a new
> 	parameter and store it to the new value.
> 	(self_recursively_generated_p): Removed.
> 	(propagate_vals_across_arith_jfunc): Use self_recursion_generated_p
> 	instead of self_recursively_generated_p, store self generation level
> 	to such values.
> 	(value_topo_info<valtype>::add_val): Set scc_no.
> 	(value_topo_info<valtype>::propagate_effects): Multiply frequencies of
> 	recursively feeding values and self generated values by appropriate
> 	new factors.

If you boost every self fed value by factor of 6, I wonder how quickly
we run into exponential explosion of the cost (since the frequency
should be close to 1 and 6^9=10077696....

I think it would be more robust to simply assume that the job will
distribute evenly across the clones.  How hard is to implement that?

Honza
> ---
>  gcc/ipa-cp.c   | 161 ++++++++++++++++++++++++-------------------------
>  gcc/params.opt |   4 ++
>  2 files changed, 84 insertions(+), 81 deletions(-)
> 
> diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
> index 55b9216337f..b987d975793 100644
> --- a/gcc/ipa-cp.c
> +++ b/gcc/ipa-cp.c
> @@ -184,30 +184,52 @@ public:
>    /* The actual value for the given parameter.  */
>    valtype value;
>    /* The list of sources from which this value originates.  */
> -  ipcp_value_source <valtype> *sources;
> +  ipcp_value_source <valtype> *sources = nullptr;
>    /* Next pointers in a linked list of all values in a lattice.  */
> -  ipcp_value *next;
> +  ipcp_value *next = nullptr;
>    /* Next pointers in a linked list of values in a strongly connected component
>       of values. */
> -  ipcp_value *scc_next;
> +  ipcp_value *scc_next = nullptr;
>    /* Next pointers in a linked list of SCCs of values sorted topologically
>       according their sources.  */
> -  ipcp_value  *topo_next;
> +  ipcp_value  *topo_next = nullptr;
>    /* A specialized node created for this value, NULL if none has been (so far)
>       created.  */
> -  cgraph_node *spec_node;
> +  cgraph_node *spec_node = nullptr;
>    /* Depth first search number and low link for topological sorting of
>       values.  */
> -  int dfs, low_link;
> +  int dfs = 0;
> +  int low_link = 0;
> +  /* SCC number to identify values which recursively feed into each other.
> +     Values in the same SCC have the same SCC number.  */
> +  int scc_no = 0;
> +  /* Non zero if the value is generated from another value in the same lattice
> +     for a self-recursive call, the actual number is how many times the
> +     operation has been performed.  In the unlikely event of the value being
> +     present in two chains fo self-recursive value generation chains, it is the
> +     maximum.  */
> +  unsigned self_recursion_generated_level = 0;
>    /* True if this value is currently on the topo-sort stack.  */
> -  bool on_stack;
> -
> -  ipcp_value()
> -    : sources (0), next (0), scc_next (0), topo_next (0),
> -      spec_node (0), dfs (0), low_link (0), on_stack (false) {}
> +  bool on_stack = false;
>  
>    void add_source (cgraph_edge *cs, ipcp_value *src_val, int src_idx,
>  		   HOST_WIDE_INT offset);
> +
> +  /* Return true if both THIS value and O feed into each other.  */
> +
> +  bool same_scc (const ipcp_value<valtype> *o)
> +  {
> +    return o->scc_no == scc_no;
> +  }
> +
> +/* Return true, if a this value has been generated for a self-recursive call as
> +   a result of an arithmetic pass-through jump-function acting on a value in
> +   the same lattice function.  */
> +
> +  bool self_recursion_generated_p ()
> +  {
> +    return self_recursion_generated_level > 0;
> +  }
>  };
>  
>  /* Lattice describing potential values of a formal parameter of a function, or
> @@ -239,7 +261,7 @@ public:
>  		  ipcp_value<valtype> *src_val = NULL,
>  		  int src_idx = 0, HOST_WIDE_INT offset = -1,
>  		  ipcp_value<valtype> **val_p = NULL,
> -		  bool unlimited = false);
> +		  unsigned same_lat_gen_level = 0);
>    void print (FILE * f, bool dump_sources, bool dump_benefits);
>  };
>  
> @@ -498,7 +520,11 @@ ipcp_lattice<valtype>::print (FILE * f, bool dump_sources, bool dump_benefits)
>  	{
>  	  ipcp_value_source<valtype> *s;
>  
> -	  fprintf (f, " [from:");
> +	  if (val->self_recursion_generated_p ())
> +	    fprintf (f, " [self_gen(%i), from:",
> +		     val->self_recursion_generated_level);
> +	  else
> +	    fprintf (f, " [scc: %i, from:", val->scc_no);
>  	  for (s = val->sources; s; s = s->next)
>  	    fprintf (f, " %i(%f)", s->cs->caller->order,
>  		     s->cs->sreal_frequency ().to_double ());
> @@ -1837,12 +1863,13 @@ ipcp_value<valtype>::add_source (cgraph_edge *cs, ipcp_value *src_val,
>     SOURCE and clear all other fields.  */
>  
>  static ipcp_value<tree> *
> -allocate_and_init_ipcp_value (tree source)
> +allocate_and_init_ipcp_value (tree cst, unsigned same_lat_gen_level)
>  {
>    ipcp_value<tree> *val;
>  
>    val = new (ipcp_cst_values_pool.allocate ()) ipcp_value<tree>();
> -  val->value = source;
> +  val->value = cst;
> +  val->self_recursion_generated_level = same_lat_gen_level;
>    return val;
>  }
>  
> @@ -1850,14 +1877,15 @@ allocate_and_init_ipcp_value (tree source)
>     value to SOURCE and clear all other fields.  */
>  
>  static ipcp_value<ipa_polymorphic_call_context> *
> -allocate_and_init_ipcp_value (ipa_polymorphic_call_context source)
> +allocate_and_init_ipcp_value (ipa_polymorphic_call_context ctx,
> +			      unsigned same_lat_gen_level)
>  {
>    ipcp_value<ipa_polymorphic_call_context> *val;
>  
> -  // TODO
>    val = new (ipcp_poly_ctx_values_pool.allocate ())
>      ipcp_value<ipa_polymorphic_call_context>();
> -  val->value = source;
> +  val->value = ctx;
> +  val->self_recursion_generated_level = same_lat_gen_level;
>    return val;
>  }
>  
> @@ -1865,8 +1893,12 @@ allocate_and_init_ipcp_value (ipa_polymorphic_call_context source)
>     SRC_VAL SRC_INDEX and OFFSET are meant for add_source and have the same
>     meaning.  OFFSET -1 means the source is scalar and not a part of an
>     aggregate.  If non-NULL, VAL_P records address of existing or newly added
> -   ipcp_value.  UNLIMITED means whether value count should not exceed the limit
> -   given by PARAM_IPA_CP_VALUE_LIST_SIZE.  */
> +   ipcp_value.
> +
> +   If the value is generated for a self-recursive call as a result of an
> +   arithmetic pass-through jump-function acting on a value in the same lattice,
> +   SAME_LAT_GEN_LEVEL must be the length of such chain, otherwise it must be
> +   zero.  If it is non-zero, PARAM_IPA_CP_VALUE_LIST_SIZE limit is ignored.  */
>  
>  template <typename valtype>
>  bool
> @@ -1874,7 +1906,7 @@ ipcp_lattice<valtype>::add_value (valtype newval, cgraph_edge *cs,
>  				  ipcp_value<valtype> *src_val,
>  				  int src_idx, HOST_WIDE_INT offset,
>  				  ipcp_value<valtype> **val_p,
> -				  bool unlimited)
> +				  unsigned same_lat_gen_level)
>  {
>    ipcp_value<valtype> *val, *last_val = NULL;
>  
> @@ -1890,6 +1922,9 @@ ipcp_lattice<valtype>::add_value (valtype newval, cgraph_edge *cs,
>  	if (val_p)
>  	  *val_p = val;
>  
> +	if (val->self_recursion_generated_level < same_lat_gen_level)
> +	  val->self_recursion_generated_level = same_lat_gen_level;
> +
>  	if (ipa_edge_within_scc (cs))
>  	  {
>  	    ipcp_value_source<valtype> *s;
> @@ -1904,7 +1939,7 @@ ipcp_lattice<valtype>::add_value (valtype newval, cgraph_edge *cs,
>  	return false;
>        }
>  
> -  if (!unlimited && values_count == opt_for_fn (cs->caller->decl,
> +  if (!same_lat_gen_level && values_count == opt_for_fn (cs->caller->decl,
>  						param_ipa_cp_value_list_size))
>      {
>        /* We can only free sources, not the values themselves, because sources
> @@ -1923,7 +1958,7 @@ ipcp_lattice<valtype>::add_value (valtype newval, cgraph_edge *cs,
>      }
>  
>    values_count++;
> -  val = allocate_and_init_ipcp_value (newval);
> +  val = allocate_and_init_ipcp_value (newval, same_lat_gen_level);
>    val->add_source (cs, src_val, src_idx, offset);
>    val->next = NULL;
>  
> @@ -1940,60 +1975,6 @@ ipcp_lattice<valtype>::add_value (valtype newval, cgraph_edge *cs,
>    return true;
>  }
>  
> -/* Return true, if a ipcp_value VAL is orginated from parameter value of
> -   self-feeding recursive function via some kind of pass-through jump
> -   function.  */
> -
> -static bool
> -self_recursively_generated_p (ipcp_value<tree> *val)
> -{
> -  class ipa_node_params *info = NULL;
> -
> -  for (ipcp_value_source<tree> *src = val->sources; src; src = src->next)
> -    {
> -      cgraph_edge *cs = src->cs;
> -
> -      if (!src->val || cs->caller != cs->callee->function_symbol ())
> -	return false;
> -
> -      if (src->val == val)
> -	continue;
> -
> -      if (!info)
> -	info = ipa_node_params_sum->get (cs->caller);
> -
> -      class ipcp_param_lattices *plats = ipa_get_parm_lattices (info,
> -								src->index);
> -      ipcp_lattice<tree> *src_lat;
> -      ipcp_value<tree> *src_val;
> -
> -      if (src->offset == -1)
> -	src_lat = &plats->itself;
> -      else
> -	{
> -	  struct ipcp_agg_lattice *src_aglat;
> -
> -	  for (src_aglat = plats->aggs; src_aglat; src_aglat = src_aglat->next)
> -	    if (src_aglat->offset == src->offset)
> -	      break;
> -
> -	  if (!src_aglat)
> -	    return false;
> -
> -	  src_lat = src_aglat;
> -	}
> -
> -      for (src_val = src_lat->values; src_val; src_val = src_val->next)
> -	if (src_val == val)
> -	  break;
> -
> -      if (!src_val)
> -	return false;
> -    }
> -
> -  return true;
> -}
> -
>  /* A helper function that returns result of operation specified by OPCODE on
>     the value of SRC_VAL.  If non-NULL, OPND1_TYPE is expected type for the
>     value of SRC_VAL.  If the operation is binary, OPND2 is a constant value
> @@ -2068,7 +2049,7 @@ propagate_vals_across_arith_jfunc (cgraph_edge *cs,
>  	     source, this is absolutely conservative, but could avoid explosion
>  	     of lattice's value space, especially when one recursive function
>  	     calls another recursive.  */
> -	  if (self_recursively_generated_p (src_val))
> +	  if (src_val->self_recursion_generated_p ())
>  	    {
>  	      ipcp_value_source<tree> *s;
>  
> @@ -2096,7 +2077,7 @@ propagate_vals_across_arith_jfunc (cgraph_edge *cs,
>  		break;
>  
>  	      ret |= dest_lat->add_value (cstval, cs, src_val, src_idx,
> -					  src_offset, &src_val, true);
> +					  src_offset, &src_val, j);
>  	      gcc_checking_assert (src_val);
>  	    }
>  	}
> @@ -2108,7 +2089,7 @@ propagate_vals_across_arith_jfunc (cgraph_edge *cs,
>  	/* Now we do not use self-recursively generated value as propagation
>  	   source, otherwise it is easy to make value space of normal lattice
>  	   overflow.  */
> -	if (self_recursively_generated_p (src_val))
> +	if (src_val->self_recursion_generated_p ())
>  	  {
>  	    ret |= dest_lat->set_contains_variable ();
>  	    continue;
> @@ -3732,6 +3713,7 @@ value_topo_info<valtype>::add_val (ipcp_value<valtype> *cur_val)
>  	  v = stack;
>  	  stack = v->topo_next;
>  	  v->on_stack = false;
> +	  v->scc_no = cur_val->dfs;
>  
>  	  v->scc_next = scc_list;
>  	  scc_list = v;
> @@ -3905,8 +3887,25 @@ value_topo_info<valtype>::propagate_effects ()
>  		    else
>  		      continue;
>  		  }
> +
> +		int special_factor = 1;
> +		if (val->same_scc (src->val))
> +		  special_factor
> +		    = opt_for_fn(src->cs->caller->decl,
> +				 param_ipa_cp_recursive_freq_factor);
> +		else if (val->self_recursion_generated_p ()
> +			 && (src->cs->callee->function_symbol ()
> +			     == src->cs->caller))
> +		  {
> +		    int max_recur_gen_depth
> +		      = opt_for_fn(src->cs->caller->decl,
> +				   param_ipa_cp_max_recursive_depth);
> +		    special_factor = max_recur_gen_depth
> +		      - val->self_recursion_generated_level + 1;
> +		  }
> +
>  		src->val->prop_time_benefit
> -		  += time * src->cs->sreal_frequency ();
> +		  += time * special_factor * src->cs->sreal_frequency ();
>  	      }
>  
>  	  if (size < INT_MAX)
> diff --git a/gcc/params.opt b/gcc/params.opt
> index 92b003e38cb..8d772309407 100644
> --- a/gcc/params.opt
> +++ b/gcc/params.opt
> @@ -266,6 +266,10 @@ Maximum depth of recursive cloning for self-recursive function.
>  Common Joined UInteger Var(param_ipa_cp_min_recursive_probability) Init(2) Param Optimization
>  Recursive cloning only when the probability of call being executed exceeds the parameter.
>  
> +-param=ipa-cp-recursive-freq-factor=
> +Common Joined UInteger Var(param_ipa_cp_recursive_freq_factor) Init(6) Param Optimization
> +When propagating IPA-CP effect estimates, multiply frequencies of recursive edges that that bring back an unchanged value by this factor.
> +
>  -param=ipa-cp-recursion-penalty=
>  Common Joined UInteger Var(param_ipa_cp_recursion_penalty) Init(40) IntegerRange(0, 100) Param Optimization
>  Percentage penalty the recursive functions will receive when they are evaluated for cloning.
> -- 
> 2.32.0
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] ipa-cp: Propagation boost for recursion generated values
  2021-10-06 15:49   ` Jan Hubicka
@ 2021-10-07 14:59     ` Martin Jambor
  2021-10-07 15:25       ` Jan Hubicka
  0 siblings, 1 reply; 16+ messages in thread
From: Martin Jambor @ 2021-10-07 14:59 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: GCC Patches, Xionghu Luo

Hi,

On Wed, Oct 06 2021, Jan Hubicka wrote:
>> Recursive call graph edges, even when they are hot and important for
>> the compiled program, can never have frequency bigger than one, even
>> when the actual time savings in the next recursion call are not
>> realized just once but depend on the depth of recursion.  The current
>> IPA-CP effect propagation code did not take that into account and just
>> used the frequency, thus severely underestimating the effect.
>> 
>> This patch artificially boosts values taking part in such calls.  If a
>> value feeds into itself through a recursive call, the frequency of the
>> edge is multiplied by a parameter with default value of 6, basically
>> assuming that the recursion will take place 6 times.  This value can
>> of course be subject to change.
>> 
>> Moreover, values which do not feed into themselves but which were
>> generated for a self-recursive call with an arithmetic
>> pass-function (aka the 548.exchange "hack" which however is generally
>> applicable for recursive functions which count the recursion depth in
>> a parameter) have the edge frequency multiplied as many times as there
>> are generated values in the chain.  In essence, we will assume they
>> are all useful.
>> 
>> This patch partially fixes the current situation when we fail to
>> optimize 548.exchange with PGO.  In the benchmark one recursive edge
>> count overwhelmingly dominates all other counts in the program and so
>> we fail to perform the first cloning (for the nonrecursive entry call)
>> because it looks totally insignificant.
>> 
>> gcc/ChangeLog:
>> 
>> 2021-07-16  Martin Jambor  <mjambor@suse.cz>
>> 
>> 	* params.opt (ipa-cp-recursive-freq-factor): New.
>> 	* ipa-cp.c (ipcp_value): Switch to inline initialization.  New members
>> 	scc_no, self_recursion_generated_level, same_scc and
>> 	self_recursion_generated_p.
>> 	(ipcp_lattice::add_value): Replaced parameter unlimited with
>> 	same_lat_gen_level, usit it determine limit of values and store it to
>> 	the value.
>> 	(ipcp_lattice<valtype>::print): Dump the new fileds.
>> 	(allocate_and_init_ipcp_value): Take same_lat_gen_level as a new
>> 	parameter and store it to the new value.
>> 	(self_recursively_generated_p): Removed.
>> 	(propagate_vals_across_arith_jfunc): Use self_recursion_generated_p
>> 	instead of self_recursively_generated_p, store self generation level
>> 	to such values.
>> 	(value_topo_info<valtype>::add_val): Set scc_no.
>> 	(value_topo_info<valtype>::propagate_effects): Multiply frequencies of
>> 	recursively feeding values and self generated values by appropriate
>> 	new factors.
>
> If you boost every self fed value by factor of 6, I wonder how quickly
> we run into exponential explosion of the cost (since the frequency
> should be close to 1 and 6^9=10077696....

The factor of six is applied once for an entire SCC, so we'd reach this
huge number only if there was a chain of nine different recursive
functions - with this patch we assume each one will recurse six times,
so the result is indeed a huge execution count estimate.

The constant is not used for the "self generated" values like those in
exchange, those are handled by the else branch below.  For those we
expect the recursion happens 8 times, because that is how many values we
generate, but the boost is different depending on the recursion depth.

>
> I think it would be more robust to simply assume that the job will
>distribute evenly across the clones.  How hard is to implement that?

This is not an update of counters.  The code tries to estimate execution
time improvement that is will be possible in callees if we clone for
this particular value and so is based on call graph edge frequencies (so
that if in a callee we can save 5 units of time and the frequency is 5,
we estimate we will save 25).  The code has the advantage that it is
universal for both situations when profile feedback is and is not
available.

Martin


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] ipa-cp: Propagation boost for recursion generated values
  2021-10-07 14:59     ` Martin Jambor
@ 2021-10-07 15:25       ` Jan Hubicka
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Hubicka @ 2021-10-07 15:25 UTC (permalink / raw)
  To: Martin Jambor; +Cc: Xionghu Luo, GCC Patches

> Hi,
> >
> > If you boost every self fed value by factor of 6, I wonder how quickly
> > we run into exponential explosion of the cost (since the frequency
> > should be close to 1 and 6^9=10077696....
> 
> The factor of six is applied once for an entire SCC, so we'd reach this
> huge number only if there was a chain of nine different recursive
> functions - with this patch we assume each one will recurse six times,
> so the result is indeed a huge execution count estimate.
> 
> The constant is not used for the "self generated" values like those in
> exchange, those are handled by the else branch below.  For those we
> expect the recursion happens 8 times, because that is how many values we
> generate, but the boost is different depending on the recursion depth.
> 
> >
> > I think it would be more robust to simply assume that the job will
> >distribute evenly across the clones.  How hard is to implement that?
> 
> This is not an update of counters.  The code tries to estimate execution
> time improvement that is will be possible in callees if we clone for
> this particular value and so is based on call graph edge frequencies (so
> that if in a callee we can save 5 units of time and the frequency is 5,
> we estimate we will save 25).  The code has the advantage that it is
> universal for both situations when profile feedback is and is not
> available.

I guess the patch is OK then.

Thanks,
Honza
> 
> Martin
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4] ipa-cp: Fix updating of profile counts and self-gen value evaluation
  2021-08-20 17:43 ` [PATCH 3/4] ipa-cp: Fix updating of profile counts and self-gen value evaluation Martin Jambor
@ 2021-10-08 11:31   ` Jan Hubicka
  2021-10-18 16:56     ` Martin Jambor
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Hubicka @ 2021-10-08 11:31 UTC (permalink / raw)
  To: Martin Jambor; +Cc: GCC Patches, Xionghu Luo

> For non-local nodes which can have unknown callers, the algorithm just
> takes half of the counts - we may decide that taking just a third or
> some other portion is more reasonable, but I do not think we can
> attempt anything more clever.

Can't you just sum the calling edges and subtract it from callee's
count?
> 2021-08-23  Martin Jambor  <mjambor@suse.cz>
> 
> 	* ipa-cp.c (struct caller_statistics): New fields rec_count_sum,
> 	n_nonrec_calls and itself, document all fields.
> 	(init_caller_stats): Initialize the above new fields.
> 	(gather_caller_stats): Gather self-recursive counts and calls number.
> 	(get_info_about_necessary_edges): Gather counts of self-recursive and
> 	other edges bringing in the requested value separately.
> 	(dump_profile_updates): Rework to dump info about a single node only.
> 	(lenient_count_portion_handling): New function.
> 	(struct gather_other_count_struct): New type.
> 	(gather_count_of_non_rec_edges): New function.
> 	(struct desc_incoming_count_struct): New type.
> 	(analyze_clone_icoming_counts): New function.
> 	(adjust_clone_incoming_counts): Likewise.
> 	(update_counts_for_self_gen_clones): Likewise.
> 	(update_profiling_info): Rewritten.
> 	(update_specialized_profile): Adjust call to dump_profile_updates.
> 	(create_specialized_node): Do not update profiling info.
> 	(decide_about_value): New parameter self_gen_clones, either push new
> 	clones into it or updat their profile counts.  For self-recursively
> 	generated values, use a portion of the node count instead of count
> 	from self-recursive edges to estimate goodness.
> 	(decide_whether_version_node): Gather clones for self-generated values
> 	in a new vector, update their profiles at once at the end.
> ---
>  gcc/ipa-cp.c | 543 +++++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 457 insertions(+), 86 deletions(-)
> 
> diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
> index b987d975793..53cca7aa804 100644
> --- a/gcc/ipa-cp.c
> +++ b/gcc/ipa-cp.c
> @@ -701,20 +701,36 @@ ipcp_versionable_function_p (struct cgraph_node *node)
>  
>  struct caller_statistics
>  {
> +  /* If requested (see below), self-recursive call counts are summed into this
> +     field.  */
> +  profile_count rec_count_sum;
> +  /* The sum of all ipa counts of all the other (non-recursive) calls.  */
>    profile_count count_sum;
> +  /* Sum of all frequencies for all calls.  */
>    sreal freq_sum;
> +  /* Number of calls and hot calls respectively.  */
>    int n_calls, n_hot_calls;
> +  /* If itself is set up, also count the number of non-self-recursive
> +     calls.  */
> +  int n_nonrec_calls;
> +  /* If non-NULL, this is the node itself and calls from it should have their
> +     counts included in rec_count_sum and not count_sum.  */
> +  cgraph_node *itself;
>  };
>  
> +/* With partial train run we do not want to assume that original's count is
> +   zero whenever we redurect all executed edges to clone.  Simply drop profile
> +   to local one in this case.  In eany case, return the new value.  ORIG_NODE
> +   is the original node and its count has not been updaed yet.  */
> +
> +profile_count
> +lenient_count_portion_handling (profile_count remainder, cgraph_node *orig_node)
> +{
> +  if (remainder.ipa_p () && !remainder.ipa ().nonzero_p ()
> +      && orig_node->count.ipa_p () && orig_node->count.ipa ().nonzero_p ()
> +      && opt_for_fn (orig_node->decl, flag_profile_partial_training))
> +    remainder = remainder.guessed_local ();

I do not think you need partial training flag here.  You should see IPA
profile is mising by simply testing ipa_p predicate on relevant counts.
> +
> +/* If caller edge counts of a clone created for a self-recursive arithmetic jump
> +   function must be adjusted, do so. NODE is the node or its thunk.  */

I would add comment on why it needs to be adjusted and how.
> +
> +static void
> +adjust_clone_incoming_counts (cgraph_node *node,
> +			      desc_incoming_count_struct *desc)
> +{
> +  for (cgraph_edge *cs = node->callers; cs; cs = cs->next_caller)
> +    if (cs->caller->thunk)
> +      {
> +	adjust_clone_incoming_counts (cs->caller, desc);
> +	profile_count sum = profile_count::zero ();
> +	for (cgraph_edge *e = cs->caller->callers; e; e = e->next_caller)
> +	  if (e->count.initialized_p ())
> +	    sum += e->count.ipa ();
> +	cs->count = cs->count.combine_with_ipa_count (sum);
> +      }
> +    else if (!desc->processed_edges->contains (cs)
> +	     && cs->caller->clone_of == desc->orig)
> +      {
> +	cs->count += desc->count;
> +	if (dump_file)
> +	  {
> +	    fprintf (dump_file, "       Adjusted count of an incoming edge of "
> +		     "a clone %s -> %s to ", cs->caller->dump_name (),
> +		     cs->callee->dump_name ());
> +	    cs->count.dump (dump_file);
> +	    fprintf (dump_file, "\n");
> +	  }
> +      }
> +}

Otherwise the patch looks OK.

Honza

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4] ipa-cp: Fix updating of profile counts and self-gen value evaluation
  2021-10-08 11:31   ` Jan Hubicka
@ 2021-10-18 16:56     ` Martin Jambor
  2021-10-27 13:18       ` Martin Jambor
  0 siblings, 1 reply; 16+ messages in thread
From: Martin Jambor @ 2021-10-18 16:56 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: GCC Patches, Xionghu Luo

Hi,

On Fri, Oct 08 2021, Jan Hubicka wrote:
>> For non-local nodes which can have unknown callers, the algorithm just
>> takes half of the counts - we may decide that taking just a third or
>> some other portion is more reasonable, but I do not think we can
>> attempt anything more clever.
>
> Can't you just sum the calling edges and subtract it from callee's
> count?

I guess I can, the patch below changes handling of this scenario, it now
behaves as if there was another hidden caller with the count of calls
that are calculated as you suggested.

[...]

>> +/* With partial train run we do not want to assume that original's count is
>> +   zero whenever we redurect all executed edges to clone.  Simply drop profile
>> +   to local one in this case.  In eany case, return the new value.  ORIG_NODE
>> +   is the original node and its count has not been updaed yet.  */
>> +
>> +profile_count
>> +lenient_count_portion_handling (profile_count remainder, cgraph_node *orig_node)
>> +{
>> +  if (remainder.ipa_p () && !remainder.ipa ().nonzero_p ()
>> +      && orig_node->count.ipa_p () && orig_node->count.ipa ().nonzero_p ()
>> +      && opt_for_fn (orig_node->decl, flag_profile_partial_training))
>> +    remainder = remainder.guessed_local ();
>
> I do not think you need partial training flag here.  You should see IPA
> profile is mising by simply testing ipa_p predicate on relevant counts.

I can take that test out, that is not a problem, but I have not done
that yet because I am still wondering whether it is the right thing to
do.  The code attempts to do the same thing which is currently in
update_profiling_info and which you added and which does test the flag.

If I understand that snippet I'm replacing correctly, the code is
supposed to make sure that, when a clone "steals" all counts from the
original node, this original node is not left with "adjusted" zero count
but with a "locally guessed" zero count when the partial training flag
is provided, which should not make it cold cold but rather switch back
to reasoning about it as if we did not have profile at all.

That is why I kept the flag check there, but if you really want me to remove
it, I'll do that.

>> +
>> +/* If caller edge counts of a clone created for a self-recursive arithmetic jump
>> +   function must be adjusted, do so. NODE is the node or its thunk.  */
>
> I would add comment on why it needs to be adjusted and how.

Done.  The adjusted patch which I am testing (but which has already
passed LTO profiledbootatrap and testing) is below.  Let me know what
you think.

Thanks,

Martin


IPA-CP does not do a reasonable job when it is updating profile counts
after it has created clones of recursive functions.  This patch
addresses that by:

1. Only updating counts for special-context clones.  When a clone is
created for all contexts, the original is going to be dead and the
cgraph machinery has copied counts to the new node which is the right
thing to do.  Therefore updating counts has been moved from
create_specialized_node to decide_about_value and
decide_whether_version_node.

2. The current profile updating code artificially increased the assumed
old count when the sum of counts of incoming edges to both the
original and new node were bigger than the count of the original
node.  This always happened when self-recursive edge from the clone
was also redirected to the clone because both the original edge and
its clone had original high counts.  This clutch was removed and
replaced by the next point.

3. When cloning also redirects a self-recursive clone to the clone
itself, new logic has been added to divide the counts brought by such
recursive edges between the original node and the clone.  This is
impossible to do well without special knowledge about the function and
which non-recursive entry calls are responsible for what portion of
recursion depth, so the approach taken is rather crude.

For local nodes, we detect the case when the original node is never
called (in the training run at least) with another value and if so,
steal all its counts like if it was dead.  If that is not the case, we
try to divide the count brought by recursive edges (or rather not
brought by direct edges) proportionally to the counts brought by
non-recursive edges - but with artificial limits in place so that we
do not take too many or too few, because that was happening with
detrimental effect in mcf_r.

4. When cloning creates extra clones for values brought by a formerly
self-recursive edge with an arithmetic pass-through jump function on
it, such as it does in exchange2_r, all such clones are processed at
once rather than one after another.  The counts of all such nodes are
distributed evenly (modulo even-formerly-non-recursive-edges) and the
whole situation is then fixed up so that the edge counts fit.  This is
what new function update_counts_for_self_gen_clones does.

5. When values brought by a formerly self-recursive edge with an
arithmetic pass-through jump function on it are evaluated by
heuristics which assumes vast majority of node counts are result of
recursive calls and so we simply divide those with the number of
clones there would be if we created another one.

6. The mechanisms in init_caller_stats and gather_caller_stats and
get_info_about_necessary_edges was enhanced to gather data required
for the above and a missing check not to count dead incoming edges was
also added.

gcc/ChangeLog:

2021-10-15  Martin Jambor  <mjambor@suse.cz>

	* ipa-cp.c (struct caller_statistics): New fields rec_count_sum,
	n_nonrec_calls and itself, document all fields.
	(init_caller_stats): Initialize the above new fields.
	(gather_caller_stats): Gather self-recursive counts and calls number.
	(get_info_about_necessary_edges): Gather counts of self-recursive and
	other edges bringing in the requested value separately.
	(dump_profile_updates): Rework to dump info about a single node only.
	(lenient_count_portion_handling): New function.
	(struct gather_other_count_struct): New type.
	(gather_count_of_non_rec_edges): New function.
	(struct desc_incoming_count_struct): New type.
	(analyze_clone_icoming_counts): New function.
	(adjust_clone_incoming_counts): Likewise.
	(update_counts_for_self_gen_clones): Likewise.
	(update_profiling_info): Rewritten.
	(update_specialized_profile): Adjust call to dump_profile_updates.
	(create_specialized_node): Do not update profiling info.
	(decide_about_value): New parameter self_gen_clones, either push new
	clones into it or updat their profile counts.  For self-recursively
	generated values, use a portion of the node count instead of count
	from self-recursive edges to estimate goodness.
	(decide_whether_version_node): Gather clones for self-generated values
	in a new vector, update their profiles at once at the end.
---
 gcc/ipa-cp.c | 534 ++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 448 insertions(+), 86 deletions(-)

diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
index b987d975793..b254b9b9de6 100644
--- a/gcc/ipa-cp.c
+++ b/gcc/ipa-cp.c
@@ -701,20 +701,36 @@ ipcp_versionable_function_p (struct cgraph_node *node)
 
 struct caller_statistics
 {
+  /* If requested (see below), self-recursive call counts are summed into this
+     field.  */
+  profile_count rec_count_sum;
+  /* The sum of all ipa counts of all the other (non-recursive) calls.  */
   profile_count count_sum;
+  /* Sum of all frequencies for all calls.  */
   sreal freq_sum;
+  /* Number of calls and hot calls respectively.  */
   int n_calls, n_hot_calls;
+  /* If itself is set up, also count the number of non-self-recursive
+     calls.  */
+  int n_nonrec_calls;
+  /* If non-NULL, this is the node itself and calls from it should have their
+     counts included in rec_count_sum and not count_sum.  */
+  cgraph_node *itself;
 };
 
-/* Initialize fields of STAT to zeroes.  */
+/* Initialize fields of STAT to zeroes and optionally set it up so that edges
+   from IGNORED_CALLER are not counted.  */
 
 static inline void
-init_caller_stats (struct caller_statistics *stats)
+init_caller_stats (caller_statistics *stats, cgraph_node *itself = NULL)
 {
+  stats->rec_count_sum = profile_count::zero ();
   stats->count_sum = profile_count::zero ();
   stats->n_calls = 0;
   stats->n_hot_calls = 0;
+  stats->n_nonrec_calls = 0;
   stats->freq_sum = 0;
+  stats->itself = itself;
 }
 
 /* Worker callback of cgraph_for_node_and_aliases accumulating statistics of
@@ -729,10 +745,22 @@ gather_caller_stats (struct cgraph_node *node, void *data)
   for (cs = node->callers; cs; cs = cs->next_caller)
     if (!cs->caller->thunk)
       {
-        if (cs->count.ipa ().initialized_p ())
-	  stats->count_sum += cs->count.ipa ();
+	ipa_node_params *info = ipa_node_params_sum->get (cs->caller);
+	if (info && info->node_dead)
+	  continue;
+
+	if (cs->count.ipa ().initialized_p ())
+	  {
+	    if (stats->itself && stats->itself == cs->caller)
+	      stats->rec_count_sum += cs->count.ipa ();
+	    else
+	      stats->count_sum += cs->count.ipa ();
+	  }
 	stats->freq_sum += cs->sreal_frequency ();
 	stats->n_calls++;
+	if (stats->itself && stats->itself != cs->caller)
+	  stats->n_nonrec_calls++;
+
 	if (cs->maybe_hot_p ())
 	  stats->n_hot_calls ++;
       }
@@ -4202,19 +4230,22 @@ get_next_cgraph_edge_clone (struct cgraph_edge *cs)
 
 /* Given VAL that is intended for DEST, iterate over all its sources and if any
    of them is viable and hot, return true.  In that case, for those that still
-   hold, add their edge frequency and their number into *FREQUENCY and
-   *CALLER_COUNT respectively.  */
+   hold, add their edge frequency and their number and cumulative profile
+   counts of self-ecursive and other edges into *FREQUENCY, *CALLER_COUNT,
+   REC_COUNT_SUM and NONREC_COUNT_SUM respectively.  */
 
 template <typename valtype>
 static bool
 get_info_about_necessary_edges (ipcp_value<valtype> *val, cgraph_node *dest,
-				sreal *freq_sum, profile_count *count_sum,
-				int *caller_count)
+				sreal *freq_sum, int *caller_count,
+				profile_count *rec_count_sum,
+				profile_count *nonrec_count_sum)
 {
   ipcp_value_source<valtype> *src;
   sreal freq = 0;
   int count = 0;
-  profile_count cnt = profile_count::zero ();
+  profile_count rec_cnt = profile_count::zero ();
+  profile_count nonrec_cnt = profile_count::zero ();
   bool hot = false;
   bool non_self_recursive = false;
 
@@ -4227,11 +4258,15 @@ get_info_about_necessary_edges (ipcp_value<valtype> *val, cgraph_node *dest,
 	    {
 	      count++;
 	      freq += cs->sreal_frequency ();
-	      if (cs->count.ipa ().initialized_p ())
-	        cnt += cs->count.ipa ();
 	      hot |= cs->maybe_hot_p ();
 	      if (cs->caller != dest)
-		non_self_recursive = true;
+		{
+		  non_self_recursive = true;
+		  if (cs->count.ipa ().initialized_p ())
+		    rec_cnt += cs->count.ipa ();
+		}
+	      else if (cs->count.ipa ().initialized_p ())
+	        nonrec_cnt += cs->count.ipa ();
 	    }
 	  cs = get_next_cgraph_edge_clone (cs);
 	}
@@ -4243,8 +4278,9 @@ get_info_about_necessary_edges (ipcp_value<valtype> *val, cgraph_node *dest,
     return false;
 
   *freq_sum = freq;
-  *count_sum = cnt;
   *caller_count = count;
+  *rec_count_sum = rec_cnt;
+  *nonrec_count_sum = nonrec_cnt;
 
   if (!hot && ipa_node_params_sum->get (dest)->node_within_scc)
     {
@@ -4349,112 +4385,399 @@ get_replacement_map (class ipa_node_params *info, tree value, int parm_num,
   return replace_map;
 }
 
-/* Dump new profiling counts */
+/* Dump new profiling counts of NODE.  SPEC is true when NODE is a specialzied
+   one, otherwise it will be referred to as the original node.  */
 
 static void
-dump_profile_updates (struct cgraph_node *orig_node,
-		      struct cgraph_node *new_node)
+dump_profile_updates (cgraph_node *node, bool spec)
 {
-  struct cgraph_edge *cs;
+  if (spec)
+    fprintf (dump_file, "     setting count of the specialized node %s to ",
+	     node->dump_name ());
+  else
+    fprintf (dump_file, "     setting count of the original node %s to ",
+	     node->dump_name ());
 
-  fprintf (dump_file, "    setting count of the specialized node to ");
-  new_node->count.dump (dump_file);
+  node->count.dump (dump_file);
   fprintf (dump_file, "\n");
-  for (cs = new_node->callees; cs; cs = cs->next_callee)
+  for (cgraph_edge *cs = node->callees; cs; cs = cs->next_callee)
     {
-      fprintf (dump_file, "      edge to %s has count ",
-	       cs->callee->dump_name ());
-      cs->count.dump (dump_file);
-      fprintf (dump_file, "\n");
-    }
-
-  fprintf (dump_file, "    setting count of the original node to ");
-  orig_node->count.dump (dump_file);
-  fprintf (dump_file, "\n");
-  for (cs = orig_node->callees; cs; cs = cs->next_callee)
-    {
-      fprintf (dump_file, "      edge to %s is left with ",
+      fprintf (dump_file, "       edge to %s has count ",
 	       cs->callee->dump_name ());
       cs->count.dump (dump_file);
       fprintf (dump_file, "\n");
     }
 }
 
+/* With partial train run we do not want to assume that original's count is
+   zero whenever we redurect all executed edges to clone.  Simply drop profile
+   to local one in this case.  In eany case, return the new value.  ORIG_NODE
+   is the original node and its count has not been updaed yet.  */
+
+profile_count
+lenient_count_portion_handling (profile_count remainder, cgraph_node *orig_node)
+{
+  if (remainder.ipa_p () && !remainder.ipa ().nonzero_p ()
+      && orig_node->count.ipa_p () && orig_node->count.ipa ().nonzero_p ()
+      && opt_for_fn (orig_node->decl, flag_profile_partial_training))
+    remainder = remainder.guessed_local ();
+
+  return remainder;
+}
+
+/* Structure to sum counts coming from nodes other than the original node and
+   its clones.  */
+
+struct gather_other_count_struct
+{
+  cgraph_node *orig;
+  profile_count other_count;
+};
+
+/* Worker callback of call_for_symbol_thunks_and_aliases summing the number of
+   counts that come from non-self-recursive calls..  */
+
+static bool
+gather_count_of_non_rec_edges (cgraph_node *node, void *data)
+{
+  gather_other_count_struct *desc = (gather_other_count_struct *) data;
+  for (cgraph_edge *cs = node->callers; cs; cs = cs->next_caller)
+    if (cs->caller != desc->orig && cs->caller->clone_of != desc->orig)
+      desc->other_count += cs->count.ipa ();
+  return false;
+}
+
+/* Structure to help analyze if we need to boost counts of some clones of some
+   non-recursive edges to match the new callee count.  */
+
+struct desc_incoming_count_struct
+{
+  cgraph_node *orig;
+  hash_set <cgraph_edge *> *processed_edges;
+  profile_count count;
+  unsigned unproc_orig_rec_edges;
+};
+
+/* Go over edges calling NODE and its thunks and gather information about
+   incoming counts so that we know if we need to make any adjustments.  */
+
+static void
+analyze_clone_icoming_counts (cgraph_node *node,
+			      desc_incoming_count_struct *desc)
+{
+  for (cgraph_edge *cs = node->callers; cs; cs = cs->next_caller)
+    if (cs->caller->thunk)
+      {
+	analyze_clone_icoming_counts (cs->caller, desc);
+	continue;
+      }
+    else
+      {
+	if (cs->count.initialized_p ())
+	  desc->count += cs->count.ipa ();
+	if (!desc->processed_edges->contains (cs)
+	    && cs->caller->clone_of == desc->orig)
+	  desc->unproc_orig_rec_edges++;
+      }
+}
+
+/* If caller edge counts of a clone created for a self-recursive arithmetic
+   jump function must be adjusted because it is coming from a the "seed" clone
+   for the first value and so has been excessively scaled back as if it was not
+   a recursive call, adjust it so that the incoming counts of NODE match its
+   count. NODE is the node or its thunk.  */
+
+static void
+adjust_clone_incoming_counts (cgraph_node *node,
+			      desc_incoming_count_struct *desc)
+{
+  for (cgraph_edge *cs = node->callers; cs; cs = cs->next_caller)
+    if (cs->caller->thunk)
+      {
+	adjust_clone_incoming_counts (cs->caller, desc);
+	profile_count sum = profile_count::zero ();
+	for (cgraph_edge *e = cs->caller->callers; e; e = e->next_caller)
+	  if (e->count.initialized_p ())
+	    sum += e->count.ipa ();
+	cs->count = cs->count.combine_with_ipa_count (sum);
+      }
+    else if (!desc->processed_edges->contains (cs)
+	     && cs->caller->clone_of == desc->orig)
+      {
+	cs->count += desc->count;
+	if (dump_file)
+	  {
+	    fprintf (dump_file, "       Adjusted count of an incoming edge of "
+		     "a clone %s -> %s to ", cs->caller->dump_name (),
+		     cs->callee->dump_name ());
+	    cs->count.dump (dump_file);
+	    fprintf (dump_file, "\n");
+	  }
+      }
+}
+
+/* When ORIG_NODE has been cloned for values which have been generated fora
+   self-recursive call as a result of an arithmetic pass-through
+   jump-functions, adjust its count together with counts of all such clones in
+   SELF_GEN_CLONES which also at this point contains ORIG_NODE itself.
+
+   The function sums the counts of the original node and all its clones that
+   cannot be attributed to a specific clone because it comes from a
+   non-recursive edge.  This sum is then evenly divided between the clones and
+   on top of that each one gets all the counts which can be attributed directly
+   to it.  */
+
+static void
+update_counts_for_self_gen_clones (cgraph_node *orig_node,
+				   const vec<cgraph_node *> &self_gen_clones)
+{
+  profile_count redist_sum = orig_node->count.ipa ();
+  if (!(redist_sum > profile_count::zero ()))
+    return;
+
+  if (dump_file)
+    fprintf (dump_file, "     Updating profile of self recursive clone "
+	     "series\n");
+
+  gather_other_count_struct gocs;
+  gocs.orig = orig_node;
+  gocs.other_count = profile_count::zero ();
+
+  auto_vec <profile_count, 8> other_edges_count;
+  for (cgraph_node *n : self_gen_clones)
+    {
+      gocs.other_count = profile_count::zero ();
+      n->call_for_symbol_thunks_and_aliases (gather_count_of_non_rec_edges,
+					     &gocs, false);
+      other_edges_count.safe_push (gocs.other_count);
+      redist_sum -= gocs.other_count;
+    }
+
+  hash_set<cgraph_edge *> processed_edges;
+  unsigned i = 0;
+  for (cgraph_node *n : self_gen_clones)
+    {
+      profile_count orig_count = n->count;
+      profile_count new_count
+	= (redist_sum.apply_scale (1, self_gen_clones.length ())
+	   + other_edges_count[i]);
+      new_count = lenient_count_portion_handling (new_count, orig_node);
+      n->count = new_count;
+      profile_count::adjust_for_ipa_scaling (&new_count, &orig_count);
+      for (cgraph_edge *cs = n->callees; cs; cs = cs->next_callee)
+	{
+	  cs->count = cs->count.apply_scale (new_count, orig_count);
+	  processed_edges.add (cs);
+	}
+      for (cgraph_edge *cs = n->indirect_calls; cs; cs = cs->next_callee)
+	cs->count = cs->count.apply_scale (new_count, orig_count);
+
+      i++;
+    }
+
+  /* There are still going to be edges to ORIG_NODE that have one or more
+     clones coming from another node clone in SELF_GEN_CLONES and which we
+     scaled by the same amount, which means that the total incoming sum of
+     counts to ORIG_NODE will be too high, scale such edges back.  */
+  for (cgraph_edge *cs = orig_node->callees; cs; cs = cs->next_callee)
+    {
+      if (cs->callee->ultimate_alias_target () == orig_node)
+	{
+	  unsigned den = 0;
+	  for (cgraph_edge *e = cs; e; e = get_next_cgraph_edge_clone (e))
+	    if (e->callee->ultimate_alias_target () == orig_node
+		&& processed_edges.contains (e))
+	      den++;
+	  if (den > 0)
+	    for (cgraph_edge *e = cs; e; e = get_next_cgraph_edge_clone (e))
+	      if (e->callee->ultimate_alias_target () == orig_node
+		  && processed_edges.contains (e))
+		e->count = e->count.apply_scale (1, den);
+	}
+    }
+
+  /* Edges from the seeds of the valus generated for arithmetic jump-functions
+     along self-recursive edges are likely to have fairly low count and so
+     edges from them to nodes in the self_gen_clones do not correspond to the
+     artificially distributed count of the nodes, the total sum of incoming
+     edges to some clones might be too low.  Detect this situation and correct
+     it.  */
+  for (cgraph_node *n : self_gen_clones)
+    {
+      if (!(n->count.ipa () > profile_count::zero ()))
+	continue;
+
+      desc_incoming_count_struct desc;
+      desc.orig = orig_node;
+      desc.processed_edges = &processed_edges;
+      desc.count = profile_count::zero ();
+      desc.unproc_orig_rec_edges = 0;
+      analyze_clone_icoming_counts (n, &desc);
+
+      if (n->count.differs_from_p (desc.count))
+	{
+	  if (n->count > desc.count
+	      && desc.unproc_orig_rec_edges > 0)
+	    {
+	      desc.count = n->count - desc.count;
+	      desc.count
+		= desc.count.apply_scale (1, desc.unproc_orig_rec_edges);
+	      adjust_clone_incoming_counts (n, &desc);
+	    }
+	  else if (dump_file)
+	    fprintf (dump_file,
+		     "       Unable to fix up incoming counts for %s.\n",
+		     n->dump_name ());
+	}
+    }
+
+  if (dump_file)
+    for (cgraph_node *n : self_gen_clones)
+      dump_profile_updates (n, n != orig_node);
+  return;
+}
+
 /* After a specialized NEW_NODE version of ORIG_NODE has been created, update
-   their profile information to reflect this.  */
+   their profile information to reflect this.  This function should not be used
+   for clones generated for arithmetic pass-through jump functions on a
+   self-recursive call graph edge, that situation is handled by
+   update_counts_for_self_gen_clones.  */
 
 static void
 update_profiling_info (struct cgraph_node *orig_node,
 		       struct cgraph_node *new_node)
 {
-  struct cgraph_edge *cs;
   struct caller_statistics stats;
-  profile_count new_sum, orig_sum;
-  profile_count remainder, orig_node_count = orig_node->count;
-  profile_count orig_new_node_count = new_node->count;
+  profile_count new_sum;
+  profile_count remainder, orig_node_count = orig_node->count.ipa ();
 
-  if (!(orig_node_count.ipa () > profile_count::zero ()))
+  if (!(orig_node_count > profile_count::zero ()))
     return;
 
-  init_caller_stats (&stats);
-  orig_node->call_for_symbol_thunks_and_aliases (gather_caller_stats, &stats,
-					       false);
-  orig_sum = stats.count_sum;
-  init_caller_stats (&stats);
+  if (dump_file)
+    {
+      fprintf (dump_file, "     Updating profile from original count: ");
+      orig_node_count.dump (dump_file);
+      fprintf (dump_file, "\n");
+    }
+
+  init_caller_stats (&stats, new_node);
   new_node->call_for_symbol_thunks_and_aliases (gather_caller_stats, &stats,
 					      false);
   new_sum = stats.count_sum;
 
-  if (orig_node_count < orig_sum + new_sum)
+  if (new_sum > orig_node_count)
     {
-      if (dump_file)
-	{
-	  fprintf (dump_file, "    Problem: node %s has too low count ",
-		   orig_node->dump_name ());
-	  orig_node_count.dump (dump_file);
-	  fprintf (dump_file, "while the sum of incoming count is ");
-	  (orig_sum + new_sum).dump (dump_file);
-	  fprintf (dump_file, "\n");
-	}
-
-      orig_node_count = (orig_sum + new_sum).apply_scale (12, 10);
-      if (dump_file)
-	{
-	  fprintf (dump_file, "      proceeding by pretending it was ");
-	  orig_node_count.dump (dump_file);
-	  fprintf (dump_file, "\n");
-	}
+      /* TODO: Perhaps this should be gcc_unreachable ()?  */
+      remainder = profile_count::zero ().guessed_local ();
     }
+  else if (stats.rec_count_sum.nonzero_p ())
+    {
+      int new_nonrec_calls = stats.n_nonrec_calls;
+      /* There are self-recursive edges which are likely to bring in the
+	 majority of calls but which we must divide in between the original and
+	 new node.  */
+      init_caller_stats (&stats, orig_node);
+      orig_node->call_for_symbol_thunks_and_aliases (gather_caller_stats,
+						     &stats, false);
+      int orig_nonrec_calls = stats.n_nonrec_calls;
+      profile_count orig_nonrec_call_count = stats.count_sum;
 
-  remainder = orig_node_count.combine_with_ipa_count (orig_node_count.ipa ()
-						      - new_sum.ipa ());
+      if (orig_node->local)
+	{
+	  if (!orig_nonrec_call_count.nonzero_p ())
+	    {
+	      if (dump_file)
+		fprintf (dump_file, "       The original is local and the only "
+			 "incoming edges from non-dead callers with nonzero "
+			 "counts are self-recursive, assuming it is cold.\n");
+	      /* The NEW_NODE count and counts of all its outgoing edges
+		 are still unmodified copies of ORIG_NODE's.  Just clear
+		 the latter and bail out.  */
+	      profile_count zero;
+              if (opt_for_fn (orig_node->decl, flag_profile_partial_training))
+                zero = profile_count::zero ().guessed_local ();
+	      else
+		zero = profile_count::adjusted_zero ();
+	      orig_node->count = zero;
+	      for (cgraph_edge *cs = orig_node->callees;
+		   cs;
+		   cs = cs->next_callee)
+		cs->count = zero;
+	      for (cgraph_edge *cs = orig_node->indirect_calls;
+		   cs;
+		   cs = cs->next_callee)
+		cs->count = zero;
+	      return;
+	    }
+	}
+      else
+	{
+	  /* Let's behave as if there was another caller that accounts for all
+	     the calls that were either indirect or from other compilation
+	     units. */
+	  orig_nonrec_calls++;
+	  profile_count pretend_caller_count
+	    = (orig_node_count - new_sum - orig_nonrec_call_count
+	       - stats.rec_count_sum);
+	  orig_nonrec_call_count += pretend_caller_count;
+	}
 
-  /* With partial train run we do not want to assume that original's
-     count is zero whenever we redurect all executed edges to clone.
-     Simply drop profile to local one in this case.  */
-  if (remainder.ipa_p () && !remainder.ipa ().nonzero_p ()
-      && orig_node->count.ipa_p () && orig_node->count.ipa ().nonzero_p ()
-      && flag_profile_partial_training)
-    remainder = remainder.guessed_local ();
+      /* Divide all "unexplained" counts roughly proportionally to sums of
+	 counts of non-recursive calls.
+
+	 We put rather arbitrary limits on how many counts we claim because the
+	 number of non-self-recursive incoming count is only a rough guideline
+	 and there are cases (such as mcf) where using it blindly just takes
+	 too many.  And if lattices are considered in the opposite order we
+	 could also take too few.  */
+      profile_count unexp = orig_node_count - new_sum - orig_nonrec_call_count;
+
+      int limit_den = 2 * (orig_nonrec_calls + new_nonrec_calls);
+      profile_count new_part
+	= MAX(MIN (unexp.apply_scale (new_sum,
+				      new_sum + orig_nonrec_call_count),
+		   unexp.apply_scale (limit_den - 1, limit_den)),
+	      unexp.apply_scale (new_nonrec_calls, limit_den));
+      if (dump_file)
+	{
+	  fprintf (dump_file, "       Claiming ");
+	  new_part.dump (dump_file);
+	  fprintf (dump_file, " of unexplained ");
+	  unexp.dump (dump_file);
+	  fprintf (dump_file, " counts because of self-recursive "
+		   "calls\n");
+	}
+      new_sum += new_part;
+      remainder = lenient_count_portion_handling (orig_node_count - new_sum,
+						  orig_node);
+    }
+  else
+    remainder = lenient_count_portion_handling (orig_node_count - new_sum,
+						orig_node);
 
   new_sum = orig_node_count.combine_with_ipa_count (new_sum);
   new_node->count = new_sum;
   orig_node->count = remainder;
 
+  profile_count orig_new_node_count = orig_node_count;
   profile_count::adjust_for_ipa_scaling (&new_sum, &orig_new_node_count);
-  for (cs = new_node->callees; cs; cs = cs->next_callee)
+  for (cgraph_edge *cs = new_node->callees; cs; cs = cs->next_callee)
     cs->count = cs->count.apply_scale (new_sum, orig_new_node_count);
-  for (cs = new_node->indirect_calls; cs; cs = cs->next_callee)
+  for (cgraph_edge *cs = new_node->indirect_calls; cs; cs = cs->next_callee)
     cs->count = cs->count.apply_scale (new_sum, orig_new_node_count);
 
   profile_count::adjust_for_ipa_scaling (&remainder, &orig_node_count);
-  for (cs = orig_node->callees; cs; cs = cs->next_callee)
+  for (cgraph_edge *cs = orig_node->callees; cs; cs = cs->next_callee)
     cs->count = cs->count.apply_scale (remainder, orig_node_count);
-  for (cs = orig_node->indirect_calls; cs; cs = cs->next_callee)
+  for (cgraph_edge *cs = orig_node->indirect_calls; cs; cs = cs->next_callee)
     cs->count = cs->count.apply_scale (remainder, orig_node_count);
 
   if (dump_file)
-    dump_profile_updates (orig_node, new_node);
+    {
+      dump_profile_updates (new_node, true);
+      dump_profile_updates (orig_node, false);
+    }
 }
 
 /* Update the respective profile of specialized NEW_NODE and the original
@@ -4495,7 +4818,10 @@ update_specialized_profile (struct cgraph_node *new_node,
     }
 
   if (dump_file)
-    dump_profile_updates (orig_node, new_node);
+    {
+      dump_profile_updates (new_node, true);
+      dump_profile_updates (orig_node, false);
+    }
 }
 
 static void adjust_references_in_caller (cgraph_edge *cs,
@@ -4795,8 +5121,7 @@ create_specialized_node (struct cgraph_node *node,
       if (aggvals)
 	ipa_dump_agg_replacement_values (dump_file, aggvals);
     }
-  ipa_check_create_node_params ();
-  update_profiling_info (node, new_node);
+
   new_info = ipa_node_params_sum->get (new_node);
   new_info->ipcp_orig_node = node;
   new_node->ipcp_clone = true;
@@ -5621,17 +5946,20 @@ ipcp_val_agg_replacement_ok_p (ipa_agg_replacement_value *,
 /* Decide whether to create a special version of NODE for value VAL of
    parameter at the given INDEX.  If OFFSET is -1, the value is for the
    parameter itself, otherwise it is stored at the given OFFSET of the
-   parameter.  AVALS describes the other already known values.  */
+   parameter.  AVALS describes the other already known values.  SELF_GEN_CLONES
+   is a vector which contains clones created for self-recursive calls with an
+   arithmetic pass-through jump function.  */
 
 template <typename valtype>
 static bool
 decide_about_value (struct cgraph_node *node, int index, HOST_WIDE_INT offset,
-		    ipcp_value<valtype> *val, ipa_auto_call_arg_values *avals)
+		    ipcp_value<valtype> *val, ipa_auto_call_arg_values *avals,
+		    vec<cgraph_node *> *self_gen_clones)
 {
   struct ipa_agg_replacement_value *aggvals;
   int caller_count;
   sreal freq_sum;
-  profile_count count_sum;
+  profile_count count_sum, rec_count_sum;
   vec<cgraph_edge *> callers;
 
   if (val->spec_node)
@@ -5647,13 +5975,31 @@ decide_about_value (struct cgraph_node *node, int index, HOST_WIDE_INT offset,
 		 val->local_size_cost + overall_size);
       return false;
     }
-  else if (!get_info_about_necessary_edges (val, node, &freq_sum, &count_sum,
-					    &caller_count))
+  else if (!get_info_about_necessary_edges (val, node, &freq_sum, &caller_count,
+					    &rec_count_sum, &count_sum))
     return false;
 
   if (!dbg_cnt (ipa_cp_values))
     return false;
 
+  if (val->self_recursion_generated_p ())
+    {
+      /* The edge counts in this case might not have been adjusted yet.
+	 Nevertleless, even if they were it would be only a guesswork which we
+	 can do now.  The recursive part of the counts can be derived from the
+	 count of the original node anyway.  */
+      if (node->count.ipa ().nonzero_p ())
+	{
+	  unsigned dem = self_gen_clones->length () + 1;
+	  rec_count_sum = node->count.ipa ().apply_scale (1, dem);
+	}
+      else
+	rec_count_sum = profile_count::zero ();
+    }
+
+  /* get_info_about_necessary_edges only sums up ipa counts.  */
+  count_sum += rec_count_sum;
+
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
       fprintf (dump_file, " - considering value ");
@@ -5694,6 +6040,12 @@ decide_about_value (struct cgraph_node *node, int index, HOST_WIDE_INT offset,
 						      offset, val->value));
   val->spec_node = create_specialized_node (node, known_csts, known_contexts,
 					    aggvals, callers);
+
+  if (val->self_recursion_generated_p ())
+    self_gen_clones->safe_push (val->spec_node);
+  else
+    update_profiling_info (node, val->spec_node);
+
   callers.release ();
   overall_size += val->local_size_cost;
   if (dump_file && (dump_flags & TDF_DETAILS))
@@ -5722,6 +6074,7 @@ decide_whether_version_node (struct cgraph_node *node)
     fprintf (dump_file, "\nEvaluating opportunities for %s.\n",
 	     node->dump_name ());
 
+  auto_vec <cgraph_node *, 9> self_gen_clones;
   ipa_auto_call_arg_values avals;
   gather_context_independent_values (info, &avals, false, NULL);
 
@@ -5736,7 +6089,8 @@ decide_whether_version_node (struct cgraph_node *node)
 	{
 	  ipcp_value<tree> *val;
 	  for (val = lat->values; val; val = val->next)
-	    ret |= decide_about_value (node, i, -1, val, &avals);
+	    ret |= decide_about_value (node, i, -1, val, &avals,
+				       &self_gen_clones);
 	}
 
       if (!plats->aggs_bottom)
@@ -5750,7 +6104,8 @@ decide_whether_version_node (struct cgraph_node *node)
 		&& (plats->aggs_contain_variable
 		    || !aglat->is_single_const ()))
 	      for (val = aglat->values; val; val = val->next)
-		ret |= decide_about_value (node, i, aglat->offset, val, &avals);
+		ret |= decide_about_value (node, i, aglat->offset, val, &avals,
+					   &self_gen_clones);
 	}
 
       if (!ctxlat->bottom
@@ -5758,10 +6113,17 @@ decide_whether_version_node (struct cgraph_node *node)
 	{
 	  ipcp_value<ipa_polymorphic_call_context> *val;
 	  for (val = ctxlat->values; val; val = val->next)
-	    ret |= decide_about_value (node, i, -1, val, &avals);
+	    ret |= decide_about_value (node, i, -1, val, &avals,
+				       &self_gen_clones);
 	}
     }
 
+  if (!self_gen_clones.is_empty ())
+    {
+      self_gen_clones.safe_push (node);
+      update_counts_for_self_gen_clones (node, self_gen_clones);
+    }
+
   if (info->do_clone_for_all_contexts)
     {
       if (!dbg_cnt (ipa_cp_values))
-- 
2.33.0


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] ipa-cp: Select saner profile count to base heuristics on
  2021-10-06 15:33   ` Jan Hubicka
@ 2021-10-18 17:10     ` Martin Jambor
  2021-10-27 13:22       ` Martin Jambor
  0 siblings, 1 reply; 16+ messages in thread
From: Martin Jambor @ 2021-10-18 17:10 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: GCC Patches, Xionghu Luo

Hi,

On Wed, Oct 06 2021, Jan Hubicka wrote:
>> 2021-08-23  Martin Jambor  <mjambor@suse.cz>
>> 
>> 	* params.opt (param_ipa_cp_profile_count_base): New parameter.
>> 	* ipa-cp.c (max_count): Replace with base_count, replace all
>> 	occurrences too, unless otherwise stated.
>> 	(ipcp_cloning_candidate_p): identify mostly-directly called
>> 	functions based on their counts, not max_count.
>> 	(compare_edge_profile_counts): New function.
>> 	(ipcp_propagate_stage): Instead of setting max_count, find the
>> 	appropriate edge count in a sorted vector of counts of eligible
>> 	edges and make it the base_count.
>> ---
>>  gcc/ipa-cp.c   | 82 +++++++++++++++++++++++++++++++++++++++++++++-----
>>  gcc/params.opt |  4 +++
>>  2 files changed, 78 insertions(+), 8 deletions(-)
>> 
>> diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
>> index 53cca7aa804..6ab74f61e83 100644
>> --- a/gcc/ipa-cp.c
>> +++ b/gcc/ipa-cp.c
>> @@ -400,9 +400,9 @@ object_allocator<ipcp_value_source<tree> > ipcp_sources_pool
>>  object_allocator<ipcp_agg_lattice> ipcp_agg_lattice_pool
>>    ("IPA_CP aggregate lattices");
>>  
>> -/* Maximal count found in program.  */
>> +/* Base count to use in heuristics when using profile feedback.  */
>>  
>> -static profile_count max_count;
>> +static profile_count base_count;
>>  
>>  /* Original overall size of the program.  */
>>  
>> @@ -809,7 +809,8 @@ ipcp_cloning_candidate_p (struct cgraph_node *node)
>>    /* When profile is available and function is hot, propagate into it even if
>>       calls seems cold; constant propagation can improve function's speed
>>       significantly.  */
>> -  if (max_count > profile_count::zero ())
>> +  if (stats.count_sum > profile_count::zero ()
>> +      && node->count.ipa ().initialized_p ())
>>      {
>>        if (stats.count_sum > node->count.ipa ().apply_scale (90, 100))
>>  	{
>> @@ -3310,10 +3311,10 @@ good_cloning_opportunity_p (struct cgraph_node *node, sreal time_benefit,
>>  
>>    ipa_node_params *info = ipa_node_params_sum->get (node);
>>    int eval_threshold = opt_for_fn (node->decl, param_ipa_cp_eval_threshold);
>> -  if (max_count > profile_count::zero ())
>> +  if (base_count > profile_count::zero ())
>>      {
>>  
>> -      sreal factor = count_sum.probability_in (max_count).to_sreal ();
>> +      sreal factor = count_sum.probability_in (base_count).to_sreal ();
>
> If you have part of program built with -fprofile-use and part built without this will
> disable cloning for functions called only from places w/o profile.
> I think we want to count frequencies when ipa profile is uninitialized
> and then pass the cloning oportunity if either count or freqs says it is
> good idea.

OK, I would like to address that by a separate follow-up patch below.

>
>>        sreal evaluation = (time_benefit * factor) / size_cost;
>>        evaluation = incorporate_penalties (node, info, evaluation);
>>        evaluation *= 1000;
>> @@ -3950,6 +3951,21 @@ value_topo_info<valtype>::propagate_effects ()
>>      }
>>  }
>>  
>> +/* Callback for qsort to sort counts of all edges.  */
>> +
>> +static int
>> +compare_edge_profile_counts (const void *a, const void *b)
>> +{
>> +  const profile_count *cnt1 = (const profile_count *) a;
>> +  const profile_count *cnt2 = (const profile_count *) b;
>> +
>> +  if (*cnt1 < *cnt2)
>> +    return 1;
>> +  if (*cnt1 > *cnt2)
>> +    return -1;
>> +  return 0;
>> +}
>> +
>>  
>>  /* Propagate constants, polymorphic contexts and their effects from the
>>     summaries interprocedurally.  */
>> @@ -3962,8 +3978,10 @@ ipcp_propagate_stage (class ipa_topo_info *topo)
>>    if (dump_file)
>>      fprintf (dump_file, "\n Propagating constants:\n\n");
>>  
>> -  max_count = profile_count::uninitialized ();
>> +  base_count = profile_count::uninitialized ();
>>  
>> +  bool compute_count_base = false;
>> +  unsigned base_count_pos_percent = 0;
>>    FOR_EACH_DEFINED_FUNCTION (node)
>>    {
>>      if (node->has_gimple_body_p ()
>> @@ -3981,9 +3999,57 @@ ipcp_propagate_stage (class ipa_topo_info *topo)
>>      ipa_size_summary *s = ipa_size_summaries->get (node);
>>      if (node->definition && !node->alias && s != NULL)
>>        overall_size += s->self_size;
>> -    max_count = max_count.max (node->count.ipa ());
>> +    if (node->count.ipa ().initialized_p ())
>> +      {
>> +	compute_count_base = true;
>> +	unsigned pos_percent = opt_for_fn (node->decl,
>> +					   param_ipa_cp_profile_count_base);
>> +	base_count_pos_percent = MAX (base_count_pos_percent, pos_percent);
>> +      }
>>    }
>>  
>> +  if (compute_count_base)
>> +    {
>> +      auto_vec<profile_count> all_edge_counts;
>> +      all_edge_counts.reserve_exact (symtab->edges_count);
>> +      FOR_EACH_DEFINED_FUNCTION (node)
>> +	for (cgraph_edge *cs = node->callees; cs; cs = cs->next_callee)
>> +	  {
>> +	    profile_count count = cs->count.ipa ();
>> +	    if (!(count > profile_count::zero ()))
>> +	      continue;
>> +
>> +	    enum availability avail;
>> +	    cgraph_node *tgt
>> +	      = cs->callee->function_or_virtual_thunk_symbol (&avail);
>> +	    ipa_node_params *info = ipa_node_params_sum->get (tgt);
>> +	    if (info && info->versionable)
>> +	      all_edge_counts.quick_push (count);
>> +	  }
>
> I wonder how big those tables gets for whole program, but probably it is
> safe since it is heap allocatd and temporary.
> Not sure if reserving exact is going to give good idea what the usual
> count is - I think if program is built w/o profile feedback we may end
> up with few functions with zero or 1 count (called once and unlikely).

My reasoning was that rather than iterating over all edges twice (once
to count how big an array I need and once to fill it in), I'd just
allocate the known maximum.  If I happen to be wrong, it is easy to
change the code not to allocate an element of the array for edges with
zero or one count.  I can of course change it before pushing to master.

>> +
>> +      if (!all_edge_counts.is_empty ())
>> +	{
>> +	  gcc_assert (base_count_pos_percent <= 100);
>> +	  all_edge_counts.qsort (compare_edge_profile_counts);
>> +
>> +	  unsigned base_count_pos
>> +	    = ((all_edge_counts.length () * (base_count_pos_percent)) / 100);
>> +	  base_count = all_edge_counts[base_count_pos];
>
> It may make more sense to sum all the counts and then do the given
> percentile of function invocations, but we can see how well this fares.

I hope my approach will have similar results for big applications and
also work for small programs (and, well, benchmarks) where very few or
even one edge dominate the program.

Below is the aforementioned follow-up fix, which has passed LTO
profiled-bootstrap on x86_64-linux.

Is it OK too?

Thanks,

Martin


Subject: [PATCH 4/4] ipa-cp: Use profile counters (or not) based on local availability

This is a follow-up small patch to address Honza's review of my
previous patch to select saner profile count to base heuristics on.
Currently the IPA-CP heuristics switch to PGO-mode only if there are
PGO counters available for any part of the call graph.  This change
makes it to switch to the PGO mode only if any of the incoming edges
bringing in the constant in question had any ipa-quality counts on
them.  Consequently, if a part of the program is built with
-fprofile-use and another part without, IPA-CP will use
estimated-frequency-based heuristics for the latter.

I still wonder whether this should only happen with
flag_profile_partial_training on.  It seems like we're behaving as if
it was always on.

gcc/ChangeLog:

2021-10-18  Martin Jambor  <mjambor@suse.cz>

	* ipa-cp.c (good_cloning_opportunity_p): Decide whether to use
	profile feedback depending on their local availability.
---
 gcc/ipa-cp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
index 4d07a6d0a58..703541d15cc 100644
--- a/gcc/ipa-cp.c
+++ b/gcc/ipa-cp.c
@@ -3311,9 +3311,9 @@ good_cloning_opportunity_p (struct cgraph_node *node, sreal time_benefit,
 
   ipa_node_params *info = ipa_node_params_sum->get (node);
   int eval_threshold = opt_for_fn (node->decl, param_ipa_cp_eval_threshold);
-  if (base_count > profile_count::zero ())
+  if (count_sum > profile_count::zero ())
     {
-
+      gcc_assert (base_count > profile_count::zero ());
       sreal factor = count_sum.probability_in (base_count).to_sreal ();
       sreal evaluation = (time_benefit * factor) / size_cost;
       evaluation = incorporate_penalties (node, info, evaluation);
-- 
2.33.0


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4] ipa-cp: Fix updating of profile counts and self-gen value evaluation
  2021-10-18 16:56     ` Martin Jambor
@ 2021-10-27 13:18       ` Martin Jambor
  0 siblings, 0 replies; 16+ messages in thread
From: Martin Jambor @ 2021-10-27 13:18 UTC (permalink / raw)
  To: GCC Patches

On Mon, Oct 18 2021, Martin Jambor wrote:
>
[...]
>
> IPA-CP does not do a reasonable job when it is updating profile counts
> after it has created clones of recursive functions.  This patch
> addresses that by:
>
> 1. Only updating counts for special-context clones.  When a clone is
> created for all contexts, the original is going to be dead and the
> cgraph machinery has copied counts to the new node which is the right
> thing to do.  Therefore updating counts has been moved from
> create_specialized_node to decide_about_value and
> decide_whether_version_node.
>
> 2. The current profile updating code artificially increased the assumed
> old count when the sum of counts of incoming edges to both the
> original and new node were bigger than the count of the original
> node.  This always happened when self-recursive edge from the clone
> was also redirected to the clone because both the original edge and
> its clone had original high counts.  This clutch was removed and
> replaced by the next point.
>
> 3. When cloning also redirects a self-recursive clone to the clone
> itself, new logic has been added to divide the counts brought by such
> recursive edges between the original node and the clone.  This is
> impossible to do well without special knowledge about the function and
> which non-recursive entry calls are responsible for what portion of
> recursion depth, so the approach taken is rather crude.
>
> For local nodes, we detect the case when the original node is never
> called (in the training run at least) with another value and if so,
> steal all its counts like if it was dead.  If that is not the case, we
> try to divide the count brought by recursive edges (or rather not
> brought by direct edges) proportionally to the counts brought by
> non-recursive edges - but with artificial limits in place so that we
> do not take too many or too few, because that was happening with
> detrimental effect in mcf_r.
>
> 4. When cloning creates extra clones for values brought by a formerly
> self-recursive edge with an arithmetic pass-through jump function on
> it, such as it does in exchange2_r, all such clones are processed at
> once rather than one after another.  The counts of all such nodes are
> distributed evenly (modulo even-formerly-non-recursive-edges) and the
> whole situation is then fixed up so that the edge counts fit.  This is
> what new function update_counts_for_self_gen_clones does.
>
> 5. When values brought by a formerly self-recursive edge with an
> arithmetic pass-through jump function on it are evaluated by
> heuristics which assumes vast majority of node counts are result of
> recursive calls and so we simply divide those with the number of
> clones there would be if we created another one.
>
> 6. The mechanisms in init_caller_stats and gather_caller_stats and
> get_info_about_necessary_edges was enhanced to gather data required
> for the above and a missing check not to count dead incoming edges was
> also added.
>
> gcc/ChangeLog:
>
> 2021-10-15  Martin Jambor  <mjambor@suse.cz>
>
> 	* ipa-cp.c (struct caller_statistics): New fields rec_count_sum,
> 	n_nonrec_calls and itself, document all fields.
> 	(init_caller_stats): Initialize the above new fields.
> 	(gather_caller_stats): Gather self-recursive counts and calls number.
> 	(get_info_about_necessary_edges): Gather counts of self-recursive and
> 	other edges bringing in the requested value separately.
> 	(dump_profile_updates): Rework to dump info about a single node only.
> 	(lenient_count_portion_handling): New function.
> 	(struct gather_other_count_struct): New type.
> 	(gather_count_of_non_rec_edges): New function.
> 	(struct desc_incoming_count_struct): New type.
> 	(analyze_clone_icoming_counts): New function.
> 	(adjust_clone_incoming_counts): Likewise.
> 	(update_counts_for_self_gen_clones): Likewise.
> 	(update_profiling_info): Rewritten.
> 	(update_specialized_profile): Adjust call to dump_profile_updates.
> 	(create_specialized_node): Do not update profiling info.
> 	(decide_about_value): New parameter self_gen_clones, either push new
> 	clones into it or updat their profile counts.  For self-recursively
> 	generated values, use a portion of the node count instead of count
> 	from self-recursive edges to estimate goodness.
> 	(decide_whether_version_node): Gather clones for self-generated values
> 	in a new vector, update their profiles at once at the end.


Honza approved the patch in a private conversation and I have pushed it
to master as commit d1e2e4f9ce4df50564f1244dcea9befc3066faa8.

Thanks,

Martin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] ipa-cp: Select saner profile count to base heuristics on
  2021-08-23 18:49 ` [PATCH 4/4] ipa-cp: Select saner profile count to base heuristics on Martin Jambor
  2021-10-06 15:33   ` Jan Hubicka
@ 2021-10-27 13:20   ` Martin Jambor
  1 sibling, 0 replies; 16+ messages in thread
From: Martin Jambor @ 2021-10-27 13:20 UTC (permalink / raw)
  To: GCC Patches

Hi,

On Mon, Aug 23 2021, Martin Jambor wrote:
> When profile feedback is available, IPA-CP takes the count of the
> hottest node and then evaluates all call contexts relative to it.
> This means that typically almost no clones for specialized contexts
> are ever created because the maximum is some special function, called
> from everywhere (that is likely to get inlined anyway) and all the
> examined edges look cold compared to it.
>
> This patch changes the selection.  It simply sorts counts of all edges
> eligible for cloning in a vector and then picks the count in 90th
> percentile (the actual number is configurable via a parameter).
>
> I also tried more complex approaches which were summing the counts and
> picking the edge which together with all hotter edges accounted for a
> given portion of the total sum of all edge counts.  But first it was
> not apparently clear to me that they make more logical sense that the
> simple method and practically I always also had to ignore a few
> percent of the hottest edges with really extreme counts (looking at
> bash and python).  And when I had to do that anyway, it seemed simpler
> to just "ignore" more and take the first non-ignored count as the
> base.
>
> Nevertheless, if people think some more sophisticated method should be
> used anyway, I am willing to be persuaded.  But this patch is a clear
> improvement over the current situation.
>
> gcc/ChangeLog:
>
> 2021-08-23  Martin Jambor  <mjambor@suse.cz>
>
> 	* params.opt (param_ipa_cp_profile_count_base): New parameter.
> 	* ipa-cp.c (max_count): Replace with base_count, replace all
> 	occurrences too, unless otherwise stated.
> 	(ipcp_cloning_candidate_p): identify mostly-directly called
> 	functions based on their counts, not max_count.
> 	(compare_edge_profile_counts): New function.
> 	(ipcp_propagate_stage): Instead of setting max_count, find the
> 	appropriate edge count in a sorted vector of counts of eligible
> 	edges and make it the base_count.

Honza approved this patch in a private conversation but then I noticed I
forgot to add an entry for the new parameter into invoke.texi, so I
fixed that problem (and checked the result with make info and make pdf)
and pushed the patch to master as commit
ab1008255e37b5b51a433ed69e04c06300543799.

Thanks,

Martin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] ipa-cp: Select saner profile count to base heuristics on
  2021-10-18 17:10     ` Martin Jambor
@ 2021-10-27 13:22       ` Martin Jambor
  0 siblings, 0 replies; 16+ messages in thread
From: Martin Jambor @ 2021-10-27 13:22 UTC (permalink / raw)
  To: GCC Patches

Hi,

On Mon, Oct 18 2021, Martin Jambor wrote:
>
[...]
>
>
> This is a follow-up small patch to address Honza's review of my
> previous patch to select saner profile count to base heuristics on.
> Currently the IPA-CP heuristics switch to PGO-mode only if there are
> PGO counters available for any part of the call graph.  This change
> makes it to switch to the PGO mode only if any of the incoming edges
> bringing in the constant in question had any ipa-quality counts on
> them.  Consequently, if a part of the program is built with
> -fprofile-use and another part without, IPA-CP will use
> estimated-frequency-based heuristics for the latter.
>
> I still wonder whether this should only happen with
> flag_profile_partial_training on.  It seems like we're behaving as if
> it was always on.
>

Honza approved this patch in a private conversation and so I have pushed
it to master as commit ab810952eb7c061e37054ddd1dfe0aa033365131.

Thanks,

Martin


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-10-27 13:22 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-24 11:48 [PATCH 0/4] IPA-CP profile feedback handling fixes Martin Jambor
2021-08-20 17:43 ` [PATCH 1/4] cgraph: Do not warn about caller count mismatches of removed functions Martin Jambor
2021-09-16 15:10   ` Martin Jambor
2021-08-20 17:43 ` [PATCH 2/4] ipa-cp: Propagation boost for recursion generated values Martin Jambor
2021-10-06 15:49   ` Jan Hubicka
2021-10-07 14:59     ` Martin Jambor
2021-10-07 15:25       ` Jan Hubicka
2021-08-20 17:43 ` [PATCH 3/4] ipa-cp: Fix updating of profile counts and self-gen value evaluation Martin Jambor
2021-10-08 11:31   ` Jan Hubicka
2021-10-18 16:56     ` Martin Jambor
2021-10-27 13:18       ` Martin Jambor
2021-08-23 18:49 ` [PATCH 4/4] ipa-cp: Select saner profile count to base heuristics on Martin Jambor
2021-10-06 15:33   ` Jan Hubicka
2021-10-18 17:10     ` Martin Jambor
2021-10-27 13:22       ` Martin Jambor
2021-10-27 13:20   ` Martin Jambor

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).