public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* COMDAT missing profile handling (was Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition)
@ 2013-08-28  5:43 Teresa Johnson
  2013-08-30  9:37 ` Jan Hubicka
  0 siblings, 1 reply; 3+ messages in thread
From: Teresa Johnson @ 2013-08-28  5:43 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches, marxin.liska

On Wed, Aug 21, 2013 at 8:25 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >
>> > Because offline COMDAT functoin will be porduced for every COMDAT used, I think
>> > it is bad to porduce any COMDAT (or any reachable function via calls with non-0
>> > count) that has empty profile (either because it got lost by COMDAT merging
>> > or because of reading mismatch).
>>
>> The approach this patch takes is to simply treat those functions the
>> same as we would if we didn't feed back profile data in the first
>> place, by using the frequencies. This is sufficient except when one is
>> inlined, which is why I have the special handling in the inliner
>> itself.
>
> Yes, my orignal plan was to have per-function profile_status that
> specify if profile is read, guessed or absent and do function local
> decision sanely with each setting.
>
> Here we read the function, we set profile to READ (with all counts being 0).
> We should drop it to GUESSED when we see that there are non-0 count edges
> calling the function in question and probably we should see if it is obviously
> hot (i.e. reachable by a hot call) and promote its function profile to HOT
> then to get code placement less bad...
>> >
>> > Since new direct calls can be discovered later, inline may want to do that
>> > again each time it inlines non-0 count call of COMDAT with 0 count...
>>
>> How about an approach like this:
>> - Invoke init_and_estimate_bb_frequencies as I am doing to guess the
>> profiles at profile read time for functions with 0 counts.
>
> I see, here we are out of sync.
> We always used to go with estimated frequencies for functions with 0 counts,
> but it seems that this code broke when prediction was moved before profiling.
> (we also should keep edge probabilities from predict.c in that case)
>
> The esitmated profile is already there before reading the profile in, so we
> only do not want to overwrite it.  Does the following work for you?

It does get the estimated frequencies on the bbs.

>
> Index: tree-profile.c
> ===================================================================
> --- tree-profile.c      (revision 201838)
> +++ tree-profile.c      (working copy)
> @@ -604,6 +604,34 @@
>
>        pop_cfun ();
>      }
> +  /* See if 0 count function has non-0 count callers.  In this case we
> +     lost some profile.  Drop its function profile to PROFILE_GUESSED.  */
> +  FOR_EACH_DEFINED_FUNCTION (node)
> +    {
> +      struct cgraph_edge *e;
> +      bool called = false;
> +      if (node->count)
> +       continue;
> +      for (e = node->callers; e; e = e->next_caller)
> +       {
> +         if (e->count)
> +           called = true;
> +         if (cgraph_maybe_hot_edge_p (e))
> +           break;
> +       }
> +      if (e || called
> +         && profile_status_for_function
> +             (DECL_STRUCT_FUNCTION (node->symbol.decl)) == PROFILE_READ)
> +       {
> +         if (dump_file)
> +           fprintf (stderr, "Dropping 0 profile for %s/%i.%s based on calls.\n",
> +                    cgraph_node_name (node), node->symbol.order,
> +                    e ? "function is hot" : "function is normal");
> +         profile_status_for_function (DECL_STRUCT_FUNCTION (node->symbol.decl))
> +           = (flag_guess_branch_prob ? PROFILE_GUESSED : PROFILE_ABSENT);
> +         node->frequency = e ? NODE_FREQUENCY_HOT : NODE_FREQUENCY_NORMAL;
> +       }
> +    }
>
>    del_node_map();
>    return 0;
> Index: predict.c
> ===================================================================
> --- predict.c   (revision 201838)
> +++ predict.c   (working copy)
> @@ -2715,6 +2715,9 @@
>    gcov_type count_max, true_count_max = 0;
>    basic_block bb;
>
> +  if (!ENTRY_BLOCK_PTR->count)
> +    return 0;
> +
>    FOR_BB_BETWEEN (bb, ENTRY_BLOCK_PTR, NULL, next_bb)
>      true_count_max = MAX (bb->count, true_count_max);
>
>
>> - At inline time, invoke some kind of freqs_to_counts routine for any
>> 0-count routine that is reached by non-zero call edges. It would take
>
> We should not need that since frequencies ought to be there.
>
>> the sum of all incoming call edge counts and synthesize counts for the
>> bbs using the guessed profile frequencies applied earlier by
>> init_and_estimate_bb_frequencies. Then the inliner can do its normal
>> bb count scaling.
>
> Yes, i guess we should go this way.  Still we will need to watch overly large values.
> Recrusive inlining can probably easily produce quite a nonsense here.
>
> We wil also need to solve problem that in this case cgraph edges will have 0 profile.
> We probably want to play the game there and just do the scaling for edge count,
> since IPA passes probably do not want to care about partial profiles.

The problem I am having is that my new freqs_to_counts routine takes
the sum of the incoming edge counts and computes the bb counts by
scaling by the estimated bb frequency. But in the case where the
caller was also missing profile data the edge count will have 0
profile as you note above, so the counts remain 0. I am not quite sure
how to handle this. Instead of using the sum of the incoming profile
counts, I could just pick a default profile count to use as the entry
bb count. The synthesized counts should get scaled down appropriately
during inlining.

Teresa

>>
>> Does that seem like a reasonable approach?
>>
>> There is one other fix in this patch:
>> - The clone_inlined_nodes/update_noncloned_frequencies changes below
>> are handling a different case: 0-count call edge in this module, with
>> non-zero callee node count due to calls from other modules. It will
>> allow update_noncloned_frequencies to scale down the edge counts in
>> callee's cloned call tree. This was a fix I made for the
>> callgraph-based linker plugin function reordering, and not splitting
>> (since it is using both the node and edge weights to make ordering
>> decisions). Here's a description of the issue when I was debugging it:
>
> Yes, it seems resonable.  I did not really care about comdats at a time
> I was writting this function..
>> >> Index: ipa-inline-transform.c
>> >> ===================================================================
>> >> --- ipa-inline-transform.c      (revision 201644)
>> >> +++ ipa-inline-transform.c      (working copy)
>> >> @@ -51,7 +51,7 @@ int nfunctions_inlined;
>> >>
>> >>  static void
>> >>  update_noncloned_frequencies (struct cgraph_node *node,
>> >> -                             int freq_scale)
>> >> +                             gcov_type count_scale, int freq_scale)
>> >>  {
>> >>    struct cgraph_edge *e;
>> >>
>> >> @@ -60,14 +60,16 @@ update_noncloned_frequencies (struct cgraph_node *
>> >>      freq_scale = 1;
>> >>    for (e = node->callees; e; e = e->next_callee)
>> >>      {
>> >> +      e->count = apply_scale (e->count, count_scale);
>> >>        e->frequency = e->frequency * (gcov_type) freq_scale / CGRAPH_FREQ_BASE;
>> >>        if (e->frequency > CGRAPH_FREQ_MAX)
>> >>          e->frequency = CGRAPH_FREQ_MAX;
>> >>        if (!e->inline_failed)
>> >> -        update_noncloned_frequencies (e->callee, freq_scale);
>> >> +        update_noncloned_frequencies (e->callee, count_scale, freq_scale);
>> >>      }
>> >>    for (e = node->indirect_calls; e; e = e->next_callee)
>> >>      {
>> >> +      e->count = apply_scale (e->count, count_scale);
>> >>        e->frequency = e->frequency * (gcov_type) freq_scale / CGRAPH_FREQ_BASE;
>> >>        if (e->frequency > CGRAPH_FREQ_MAX)
>> >>          e->frequency = CGRAPH_FREQ_MAX;
>> >> @@ -169,7 +171,13 @@ clone_inlined_nodes (struct cgraph_edge *e, bool d
>> >>             }
>> >>           duplicate = false;
>> >>           e->callee->symbol.externally_visible = false;
>> >> -          update_noncloned_frequencies (e->callee, e->frequency);
>> >> +          // In the case of a COMDAT, the callee's count may be from other
>> >> +          // modules, and we need to scale it for the current module's calls
>> >> +          // (e.g. e->count may be 0 despite e->callee->count > 0).
>> >> +          gcov_type count_scale = REG_BR_PROB_BASE;
>> >> +          if (e->callee->count > e->count)
>> >> +            count_scale = GCOV_COMPUTE_SCALE (e->count, e->callee->count);
>> >> +          update_noncloned_frequencies (e->callee, count_scale, e->frequency);
>
> Please fix the comment to the usual style and go ahead with this change.
>
> Thanks,
> Honza
>> >>         }
>> >>        else
>> >>         {
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: COMDAT missing profile handling (was Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition)
  2013-08-28  5:43 COMDAT missing profile handling (was Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition) Teresa Johnson
@ 2013-08-30  9:37 ` Jan Hubicka
  2013-08-30 16:13   ` Teresa Johnson
  0 siblings, 1 reply; 3+ messages in thread
From: Jan Hubicka @ 2013-08-30  9:37 UTC (permalink / raw)
  To: Teresa Johnson; +Cc: Jan Hubicka, gcc-patches, marxin.liska

> > The esitmated profile is already there before reading the profile in, so we
> > only do not want to overwrite it.  Does the following work for you?
> 
> It does get the estimated frequencies on the bbs.

Good.

> > We wil also need to solve problem that in this case cgraph edges will have 0 profile.
> > We probably want to play the game there and just do the scaling for edge count,
> > since IPA passes probably do not want to care about partial profiles.
> 
> The problem I am having is that my new freqs_to_counts routine takes
> the sum of the incoming edge counts and computes the bb counts by
> scaling by the estimated bb frequency. But in the case where the
> caller was also missing profile data the edge count will have 0
> profile as you note above, so the counts remain 0. I am not quite sure
> how to handle this. Instead of using the sum of the incoming profile
> counts, I could just pick a default profile count to use as the entry
> bb count. The synthesized counts should get scaled down appropriately
> during inlining.

I am having problem to think of what default we should pick here.  

Thinking about the whole problem, I think the following sounds like possible
sollution:

1) use the patch above to get frequencies computed where counts are 0.
   Make sure that branch probailities are guessed for functions with non-0
   counts for each of branch that has count 0.
2) At a point we see inconsistency in the IPA profile (i.e. we get to fact
   that we have direct calls with non-0 counts to a comdat with 0 count),
   drop the profile_status for them to profile_guessed and let the backed
   optimize them as if there was no FDO.
   I do not think we can do better for those.

   If we do not see this problem, we can keep status as READ so the function
   will get size optimized.  Here we will lose for indirect calls, but I
   do not see how to handle these short of making every 0 count COMDAT guessed.
   Perhaps you can try how these variants work for you performance/code size
   wise.
3) at a time we inline function with guessed profile into function with read
   profile, we will use your trick to feed the counts in based on frequencies.

   We will do that in tree-inline when inlining. I.e. we will never feed fake
   counts into non-inline functions
4) we will want to sanitize callgarph, too.  This means looking for GUESSED
   profile functions and feeding their counts/edge counts based on sum of
   the incomming counts.

   We can do this as part of ipa-profile pass (I will move it to separate file
   as it is getting more complex) Here we will have the propagation issue you
   mention above.  Either we can ignore it or we can iterate.

3) and 4) will absolutely need some capping to be sure we won't make the
synthetized counts bigger than other counts in program.

I think we have code for 1-3. If the plan looks sane to you, I think we can
start getting it in?

Of course we want to eliminate most of the problems by getting runtime to
do the merging... (but the merging will not be posible always due to 
cfg differences comming from different optimization flags, anyway).

Honza

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: COMDAT missing profile handling (was Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition)
  2013-08-30  9:37 ` Jan Hubicka
@ 2013-08-30 16:13   ` Teresa Johnson
  0 siblings, 0 replies; 3+ messages in thread
From: Teresa Johnson @ 2013-08-30 16:13 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches, marxin.liska

On Fri, Aug 30, 2013 at 2:31 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> > The esitmated profile is already there before reading the profile in, so we
>> > only do not want to overwrite it.  Does the following work for you?
>>
>> It does get the estimated frequencies on the bbs.
>
> Good.

I ended up making some slight modifications, see below.

>
>> > We wil also need to solve problem that in this case cgraph edges will have 0 profile.
>> > We probably want to play the game there and just do the scaling for edge count,
>> > since IPA passes probably do not want to care about partial profiles.
>>
>> The problem I am having is that my new freqs_to_counts routine takes
>> the sum of the incoming edge counts and computes the bb counts by
>> scaling by the estimated bb frequency. But in the case where the
>> caller was also missing profile data the edge count will have 0
>> profile as you note above, so the counts remain 0. I am not quite sure
>> how to handle this. Instead of using the sum of the incoming profile
>> counts, I could just pick a default profile count to use as the entry
>> bb count. The synthesized counts should get scaled down appropriately
>> during inlining.
>
> I am having problem to think of what default we should pick here.

What I tried was if the sum of the incoming counts was 0 then use the
BB_FREQ_MAX as the fake incoming profile count and apportioning that
to the bbs based on the guessed profile frequencies. Rather than
iterating, I am simply doing this for all 0 count functions. I put an
assert in the inliner when testing to ensure that we never inline a
0-count body. However, I realized after reading your proposed solution
below that this is not good since it will cause all of these (possibly
truly cold) functions to not be placed in the unlikely text section.

As an aside, the patch I am testing is hitting an issue in a
profiledbootstrap that I think is an unrelated bug that seems to be
provoked by my changes:

../../gcc_trunk_7/gcc/config/i386/i386.c:33752:34: error: array
subscript is above array bounds [-Werror=array-bounds]
     if (ipar[i] + 1 != ipar[i + 1])

which seems to be a bogus error (probably related to PR 56456 and
other specific instances of this warning being bogus that have been
filed). Not sure what to do about this. But I am guessing it will not
be triggered once I rework the patch in any case...


>
> Thinking about the whole problem, I think the following sounds like possible
> sollution:
>
> 1) use the patch above to get frequencies computed where counts are 0.

Do you mean my original patch, where I am doing this via new function
init_and_estimate_bb_frequencies called from branch_prob()? I assume
so - that will give us frequencies on the branches but no counts yet,
and leave the profile status as READ.

>    Make sure that branch probailities are guessed for functions with non-0
>    counts for each of branch that has count 0.

Not sure about the second sentence above. Do you mean also guess
probabilities within functions that have other non-zero counts in
them. Or do you mean those reached via non-zero cgraph call counts?
The init_and_estimate_bb_frequencies routine will only guess
frequencies for routines that are all-0. I thought that the
probabilities on branches that are 0 would be 0, which seems correct
for functions with other non-zero counts?

> 2) At a point we see inconsistency in the IPA profile (i.e. we get to fact
>    that we have direct calls with non-0 counts to a comdat with 0 count),
>    drop the profile_status for them to profile_guessed and let the backed
>    optimize them as if there was no FDO.
>    I do not think we can do better for those.

I assume you mean sometime later than where we were currently doing
this when reading counts. Maybe during ipa-profile. That way it can be
done iteratively to handle the case where a 0-count comdat is calling
another 0-count comdat where they both are going to be reached via a
non-zero call further up.

>
>    If we do not see this problem, we can keep status as READ so the function
>    will get size optimized.  Here we will lose for indirect calls, but I
>    do not see how to handle these short of making every 0 count COMDAT guessed.
>    Perhaps you can try how these variants work for you performance/code size
>    wise.

Yes, I think you are right. We will lose a bit on the out-of-line
copy, but still should get this handled properly when inlined due to
3) below.

> 3) at a time we inline function with guessed profile into function with read
>    profile, we will use your trick to feed the counts in based on frequencies.
>
>    We will do that in tree-inline when inlining. I.e. we will never feed fake
>    counts into non-inline functions

Right, so invoke a new routine like the following, when inlining and
before we invoke the copy_body, etc that scales the counts. Where the
count used can be the current sum of incoming cgraph edge counts
(presumably non-zero at this point since we decided to inline):

+/* Convert estimated frequencies into counts.  */
+
+void
+freqs_to_counts (struct cgraph_node *node, gcov_type count)
+{
+  basic_block bb;
+  edge_iterator ei;
+  edge e;
+  struct function *fn = DECL_STRUCT_FUNCTION (node->symbol.decl);
+
+  FOR_ALL_BB_FN(bb, fn)
+  {
+    bb->count = apply_scale (count,
+                             GCOV_COMPUTE_SCALE (bb->frequency, BB_FREQ_MAX));
+    FOR_EACH_EDGE (e, ei, bb->succs)
+      e->count = apply_probability (e->src->count, e->probability);
+  }
+}
+

> 4) we will want to sanitize callgarph, too.  This means looking for GUESSED
>    profile functions and feeding their counts/edge counts based on sum of
>    the incomming counts.

I assume here the sum of the incoming counts to use for the cgraph
edge is obtained from the bb counts of calls to those functions?

>
>    We can do this as part of ipa-profile pass (I will move it to separate file
>    as it is getting more complex) Here we will have the propagation issue you
>    mention above.  Either we can ignore it or we can iterate.
>
> 3) and 4) will absolutely need some capping to be sure we won't make the
> synthetized counts bigger than other counts in program.

How do we end up with a synthesized count that is too big? I.e. for 3
we would use the sum of the incoming cgraph edges.

Thanks,
Teresa

>
> I think we have code for 1-3. If the plan looks sane to you, I think we can
> start getting it in?
>
> Of course we want to eliminate most of the problems by getting runtime to
> do the merging... (but the merging will not be posible always due to
> cfg differences comming from different optimization flags, anyway).
>
> Honza



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2013-08-30 16:08 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-28  5:43 COMDAT missing profile handling (was Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition) Teresa Johnson
2013-08-30  9:37 ` Jan Hubicka
2013-08-30 16:13   ` Teresa Johnson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).