public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] ipa-inline: Improve growth accumulation for recursive calls
@ 2020-08-12 10:19 Xionghu Luo
  2020-08-12 17:53 ` Jan Hubicka
  0 siblings, 1 reply; 21+ messages in thread
From: Xionghu Luo @ 2020-08-12 10:19 UTC (permalink / raw)
  To: gcc-patches
  Cc: segher, dje.gcc, wschmidt, guojiufu, linkw, hubicka, mliska,
	mjambor, fxue, Xiong Hu Luo

From: Xiong Hu Luo <luoxhu@linux.ibm.com>

For SPEC2017 exchange2, there is a large recursive functiondigits_2(function
size 1300) generates specialized node from digits_2.1 to digits_2.8 with added
build option:

--param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80

ipa-inline pass will consider inline these nodes called only once, but these
large functions inlined too deeply will cause serious register spill and
performance down as followed.

inlineA: brute (inline digits_2.1, 2.2, 2.3, 2.4) -> digits_2.5 (inline 2.6, 2.7, 2.8)
inlineB: digits_2.1 (inline digits_2.2, 2.3) -> call digits_2.4 (inline digits_2.5, 2.6) -> call digits_2.7 (inline 2.8)
inlineC: brute (inline digits_2) -> call 2.1 -> 2.2 (inline 2.3) -> 2.4 -> 2.5 -> 2.6 (inline 2.7 ) -> 2.8
inlineD: brute -> call digits_2 -> call 2.1 -> call 2.2 -> 2.3 -> 2.4 -> 2.5 -> 2.6 -> 2.7 -> 2.8

Performance diff:
inlineB is ~25% faster than inlineA;
inlineC is ~20% faster than inlineB;
inlineD is ~30% faster than inlineC.

The master GCC code now generates inline sequence like inlineB, this patch
makes the ipa-inline pass behavior like inlineD by:
 1) The growth acumulation for recursive calls by adding the growth data
to the edge when edge's caller is inlined into another function to avoid
inline too deeply;
 2) And if caller and callee are both specialized from same node, the edge
should also be considered as recursive edge.

SPEC2017 test shows GEOMEAN improve +2.75% in total(+0.56% without exchange2).
Any comments?  Thanks.

523.xalancbmk_r +1.32%
541.leela_r     +1.51%
548.exchange2_r +31.87%
507.cactuBSSN_r +0.80%
526.blender_r   +1.25%
538.imagick_r   +1.82%

gcc/ChangeLog:

2020-08-12  Xionghu Luo  <luoxhu@linux.ibm.com>

	* cgraph.h (cgraph_edge::recursive_p): Return true if caller and
	callee and specialized from same node.
	* ipa-inline-analysis.c (do_estimate_growth_1): Add caller's
	inlined_to growth to edge whose caller is inlined.
---
 gcc/cgraph.h              | 2 ++
 gcc/ipa-inline-analysis.c | 3 +++
 2 files changed, 5 insertions(+)

diff --git a/gcc/cgraph.h b/gcc/cgraph.h
index 0211f08964f..11903ac1960 100644
--- a/gcc/cgraph.h
+++ b/gcc/cgraph.h
@@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void)
   cgraph_node *c = callee->ultimate_alias_target ();
   if (caller->inlined_to)
     return caller->inlined_to->decl == c->decl;
+  else if (caller->clone_of && c->clone_of)
+    return caller->clone_of->decl == c->clone_of->decl;
   else
     return caller->decl == c->decl;
 }
diff --git a/gcc/ipa-inline-analysis.c b/gcc/ipa-inline-analysis.c
index 148efbc09ef..ba0cf836364 100644
--- a/gcc/ipa-inline-analysis.c
+++ b/gcc/ipa-inline-analysis.c
@@ -434,6 +434,9 @@ do_estimate_growth_1 (struct cgraph_node *node, void *data)
 	  continue;
 	}
       d->growth += estimate_edge_growth (e);
+      if (e->caller->inlined_to)
+	d->growth += ipa_fn_summaries->get (e->caller->inlined_to)->growth;
+
       if (d->growth > d->cap)
 	return true;
     }
-- 
2.25.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls
  2020-08-12 10:19 [PATCH] ipa-inline: Improve growth accumulation for recursive calls Xionghu Luo
@ 2020-08-12 17:53 ` Jan Hubicka
  2020-08-12 19:03   ` Richard Biener
                     ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Jan Hubicka @ 2020-08-12 17:53 UTC (permalink / raw)
  To: Xionghu Luo; +Cc: gcc-patches, segher, wschmidt, linkw, dje.gcc

Hello,
with Martin we spent some time looking into exchange2 and my
understanding of the problem is the following:

There is the self recursive function digits_2 with the property that it
has 10 nested loops and calls itself from the innermost.
Now we do not do amazing job on guessing the profile since it is quite
atypical. First observation is that the callback frequencly needs to be
less than 1 otherwise the program never terminates, however with 10
nested loops one needs to predict every loop to iterate just few times
and conditionals guarding them as not very likely. For that we added
PRED_LOOP_GUARD_WITH_RECURSION some time ago and I fixed it yesterday
(causing regression in exhange since the bad profile turned out to
disable some harmful vectorization) and I also now added a cap to the
self recursive frequency so things to not get mispropagated by ipa-cp.

Now if ipa-cp decides to duplicate digits few times we have a new
problem.  The tree of recursion is orgnaized in a way that the depth is
bounded by 10 (which GCC does not know) and moreover most time is not
spent on very deep levels of recursion.

For that you have the patch which increases frequencies of recursively
cloned nodes, however it still seems to me as very specific hack for
exchange: I do not see how to guess where most of time is spent. 
Even for very regular trees, by master theorem, it depends on very
little differences in the estimates of recursion frequency whether most
of time is spent on the top of tree, bottom or things are balanced.

With algorithms doing backtracing, like exhchange, the likelyness of
recusion reduces with deeper recursion level, but we do not know how
quickly and what the level is.

> From: Xiong Hu Luo <luoxhu@linux.ibm.com>
> 
> For SPEC2017 exchange2, there is a large recursive functiondigits_2(function
> size 1300) generates specialized node from digits_2.1 to digits_2.8 with added
> build option:
> 
> --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
> 
> ipa-inline pass will consider inline these nodes called only once, but these
> large functions inlined too deeply will cause serious register spill and
> performance down as followed.
> 
> inlineA: brute (inline digits_2.1, 2.2, 2.3, 2.4) -> digits_2.5 (inline 2.6, 2.7, 2.8)
> inlineB: digits_2.1 (inline digits_2.2, 2.3) -> call digits_2.4 (inline digits_2.5, 2.6) -> call digits_2.7 (inline 2.8)
> inlineC: brute (inline digits_2) -> call 2.1 -> 2.2 (inline 2.3) -> 2.4 -> 2.5 -> 2.6 (inline 2.7 ) -> 2.8
> inlineD: brute -> call digits_2 -> call 2.1 -> call 2.2 -> 2.3 -> 2.4 -> 2.5 -> 2.6 -> 2.7 -> 2.8
> 
> Performance diff:
> inlineB is ~25% faster than inlineA;
> inlineC is ~20% faster than inlineB;
> inlineD is ~30% faster than inlineC.
> 
> The master GCC code now generates inline sequence like inlineB, this patch
> makes the ipa-inline pass behavior like inlineD by:
>  1) The growth acumulation for recursive calls by adding the growth data
> to the edge when edge's caller is inlined into another function to avoid
> inline too deeply;
>  2) And if caller and callee are both specialized from same node, the edge
> should also be considered as recursive edge.
> 
> SPEC2017 test shows GEOMEAN improve +2.75% in total(+0.56% without exchange2).
> Any comments?  Thanks.
> 
> 523.xalancbmk_r +1.32%
> 541.leela_r     +1.51%
> 548.exchange2_r +31.87%
> 507.cactuBSSN_r +0.80%
> 526.blender_r   +1.25%
> 538.imagick_r   +1.82%
> 
> gcc/ChangeLog:
> 
> 2020-08-12  Xionghu Luo  <luoxhu@linux.ibm.com>
> 
> 	* cgraph.h (cgraph_edge::recursive_p): Return true if caller and
> 	callee and specialized from same node.
> 	* ipa-inline-analysis.c (do_estimate_growth_1): Add caller's
> 	inlined_to growth to edge whose caller is inlined.
> ---
>  gcc/cgraph.h              | 2 ++
>  gcc/ipa-inline-analysis.c | 3 +++
>  2 files changed, 5 insertions(+)
> 
> diff --git a/gcc/cgraph.h b/gcc/cgraph.h
> index 0211f08964f..11903ac1960 100644
> --- a/gcc/cgraph.h
> +++ b/gcc/cgraph.h
> @@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void)
>    cgraph_node *c = callee->ultimate_alias_target ();
>    if (caller->inlined_to)
>      return caller->inlined_to->decl == c->decl;
> +  else if (caller->clone_of && c->clone_of)
> +    return caller->clone_of->decl == c->clone_of->decl;
>    else
>      return caller->decl == c->decl;

If you clone the function so it is no longer self recursive, it does not
make much sense to lie to optimizers that the function is still
recursive.

The inlining would be harmful even if the programer did cloning by hand.
I guess main problem is the extreme register pressure issue combining
loop depth of 10 in caller with loop depth of 10 in callee just because
the function is called once.

The negative effect is most likely also due to wrong profile estimate
which drives IRA to optimize wrong spot.  But I wonder if we simply
don't want to teach inlining function called once to not construct large
loop depths?  Something like do not inline if caller&callee loop depth
is over 3 or so?

Honza
>  }
> diff --git a/gcc/ipa-inline-analysis.c b/gcc/ipa-inline-analysis.c
> index 148efbc09ef..ba0cf836364 100644
> --- a/gcc/ipa-inline-analysis.c
> +++ b/gcc/ipa-inline-analysis.c
> @@ -434,6 +434,9 @@ do_estimate_growth_1 (struct cgraph_node *node, void *data)
>  	  continue;
>  	}
>        d->growth += estimate_edge_growth (e);
> +      if (e->caller->inlined_to)
> +	d->growth += ipa_fn_summaries->get (e->caller->inlined_to)->growth;
> +
>        if (d->growth > d->cap)
>  	return true;
>      }
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls
  2020-08-12 17:53 ` Jan Hubicka
@ 2020-08-12 19:03   ` Richard Biener
  2020-08-13  0:29     ` Segher Boessenkool
  2020-08-13  3:59   ` Feng Xue OS
  2020-08-13  6:51   ` luoxhu
  2 siblings, 1 reply; 21+ messages in thread
From: Richard Biener @ 2020-08-12 19:03 UTC (permalink / raw)
  To: gcc-patches, Jan Hubicka, Xionghu Luo; +Cc: wschmidt, dje.gcc, linkw, segher

On August 12, 2020 7:53:07 PM GMT+02:00, Jan Hubicka <hubicka@ucw.cz> wrote:
>Hello,
>with Martin we spent some time looking into exchange2 and my
>understanding of the problem is the following:
>
>There is the self recursive function digits_2 with the property that it
>has 10 nested loops and calls itself from the innermost.
>Now we do not do amazing job on guessing the profile since it is quite
>atypical. First observation is that the callback frequencly needs to be
>less than 1 otherwise the program never terminates, however with 10
>nested loops one needs to predict every loop to iterate just few times
>and conditionals guarding them as not very likely. For that we added
>PRED_LOOP_GUARD_WITH_RECURSION some time ago and I fixed it yesterday
>(causing regression in exhange since the bad profile turned out to
>disable some harmful vectorization) and I also now added a cap to the
>self recursive frequency so things to not get mispropagated by ipa-cp.
>
>Now if ipa-cp decides to duplicate digits few times we have a new
>problem.  The tree of recursion is orgnaized in a way that the depth is
>bounded by 10 (which GCC does not know) and moreover most time is not
>spent on very deep levels of recursion.
>
>For that you have the patch which increases frequencies of recursively
>cloned nodes, however it still seems to me as very specific hack for
>exchange: I do not see how to guess where most of time is spent. 
>Even for very regular trees, by master theorem, it depends on very
>little differences in the estimates of recursion frequency whether most
>of time is spent on the top of tree, bottom or things are balanced.
>
>With algorithms doing backtracing, like exhchange, the likelyness of
>recusion reduces with deeper recursion level, but we do not know how
>quickly and what the level is.
>
>> From: Xiong Hu Luo <luoxhu@linux.ibm.com>
>> 
>> For SPEC2017 exchange2, there is a large recursive
>functiondigits_2(function
>> size 1300) generates specialized node from digits_2.1 to digits_2.8
>with added
>> build option:
>> 
>> --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
>> 
>> ipa-inline pass will consider inline these nodes called only once,
>but these
>> large functions inlined too deeply will cause serious register spill
>and
>> performance down as followed.
>> 
>> inlineA: brute (inline digits_2.1, 2.2, 2.3, 2.4) -> digits_2.5
>(inline 2.6, 2.7, 2.8)
>> inlineB: digits_2.1 (inline digits_2.2, 2.3) -> call digits_2.4
>(inline digits_2.5, 2.6) -> call digits_2.7 (inline 2.8)
>> inlineC: brute (inline digits_2) -> call 2.1 -> 2.2 (inline 2.3) ->
>2.4 -> 2.5 -> 2.6 (inline 2.7 ) -> 2.8
>> inlineD: brute -> call digits_2 -> call 2.1 -> call 2.2 -> 2.3 -> 2.4
>-> 2.5 -> 2.6 -> 2.7 -> 2.8
>> 
>> Performance diff:
>> inlineB is ~25% faster than inlineA;
>> inlineC is ~20% faster than inlineB;
>> inlineD is ~30% faster than inlineC.
>> 
>> The master GCC code now generates inline sequence like inlineB, this
>patch
>> makes the ipa-inline pass behavior like inlineD by:
>>  1) The growth acumulation for recursive calls by adding the growth
>data
>> to the edge when edge's caller is inlined into another function to
>avoid
>> inline too deeply;
>>  2) And if caller and callee are both specialized from same node, the
>edge
>> should also be considered as recursive edge.
>> 
>> SPEC2017 test shows GEOMEAN improve +2.75% in total(+0.56% without
>exchange2).
>> Any comments?  Thanks.
>> 
>> 523.xalancbmk_r +1.32%
>> 541.leela_r     +1.51%
>> 548.exchange2_r +31.87%
>> 507.cactuBSSN_r +0.80%
>> 526.blender_r   +1.25%
>> 538.imagick_r   +1.82%
>> 
>> gcc/ChangeLog:
>> 
>> 2020-08-12  Xionghu Luo  <luoxhu@linux.ibm.com>
>> 
>> 	* cgraph.h (cgraph_edge::recursive_p): Return true if caller and
>> 	callee and specialized from same node.
>> 	* ipa-inline-analysis.c (do_estimate_growth_1): Add caller's
>> 	inlined_to growth to edge whose caller is inlined.
>> ---
>>  gcc/cgraph.h              | 2 ++
>>  gcc/ipa-inline-analysis.c | 3 +++
>>  2 files changed, 5 insertions(+)
>> 
>> diff --git a/gcc/cgraph.h b/gcc/cgraph.h
>> index 0211f08964f..11903ac1960 100644
>> --- a/gcc/cgraph.h
>> +++ b/gcc/cgraph.h
>> @@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void)
>>    cgraph_node *c = callee->ultimate_alias_target ();
>>    if (caller->inlined_to)
>>      return caller->inlined_to->decl == c->decl;
>> +  else if (caller->clone_of && c->clone_of)
>> +    return caller->clone_of->decl == c->clone_of->decl;
>>    else
>>      return caller->decl == c->decl;
>
>If you clone the function so it is no longer self recursive, it does
>not
>make much sense to lie to optimizers that the function is still
>recursive.
>
>The inlining would be harmful even if the programer did cloning by
>hand.
>I guess main problem is the extreme register pressure issue combining
>loop depth of 10 in caller with loop depth of 10 in callee just because
>the function is called once.
>
>The negative effect is most likely also due to wrong profile estimate
>which drives IRA to optimize wrong spot.  But I wonder if we simply
>don't want to teach inlining function called once to not construct
>large
>loop depths?  Something like do not inline if caller&callee loop depth
>is over 3 or so?

I don't think that's good by itself (consider leaf functions and x86 xmm reg ABI across calls). Even with large loop depth abstraction penalty removal can make inlining worth it. For the testcase the recursiveness is what looks special (recursion from a deeper loop nest level). 

Richard. 

>Honza
>>  }
>> diff --git a/gcc/ipa-inline-analysis.c b/gcc/ipa-inline-analysis.c
>> index 148efbc09ef..ba0cf836364 100644
>> --- a/gcc/ipa-inline-analysis.c
>> +++ b/gcc/ipa-inline-analysis.c
>> @@ -434,6 +434,9 @@ do_estimate_growth_1 (struct cgraph_node *node,
>void *data)
>>  	  continue;
>>  	}
>>        d->growth += estimate_edge_growth (e);
>> +      if (e->caller->inlined_to)
>> +	d->growth += ipa_fn_summaries->get (e->caller->inlined_to)->growth;
>> +
>>        if (d->growth > d->cap)
>>  	return true;
>>      }
>> -- 
>> 2.25.1
>> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls
  2020-08-12 19:03   ` Richard Biener
@ 2020-08-13  0:29     ` Segher Boessenkool
  2020-08-13  7:53       ` Jan Hubicka
  0 siblings, 1 reply; 21+ messages in thread
From: Segher Boessenkool @ 2020-08-13  0:29 UTC (permalink / raw)
  To: Richard Biener
  Cc: gcc-patches, Jan Hubicka, Xionghu Luo, wschmidt, dje.gcc, linkw

Hi!

On Wed, Aug 12, 2020 at 09:03:35PM +0200, Richard Biener wrote:
> On August 12, 2020 7:53:07 PM GMT+02:00, Jan Hubicka <hubicka@ucw.cz> wrote:
> >> From: Xiong Hu Luo <luoxhu@linux.ibm.com>
> >> 523.xalancbmk_r +1.32%
> >> 541.leela_r     +1.51%
> >> 548.exchange2_r +31.87%
> >> 507.cactuBSSN_r +0.80%
> >> 526.blender_r   +1.25%
> >> 538.imagick_r   +1.82%

> >> diff --git a/gcc/cgraph.h b/gcc/cgraph.h
> >> index 0211f08964f..11903ac1960 100644
> >> --- a/gcc/cgraph.h
> >> +++ b/gcc/cgraph.h
> >> @@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void)
> >>    cgraph_node *c = callee->ultimate_alias_target ();
> >>    if (caller->inlined_to)
> >>      return caller->inlined_to->decl == c->decl;
> >> +  else if (caller->clone_of && c->clone_of)
> >> +    return caller->clone_of->decl == c->clone_of->decl;
> >>    else
> >>      return caller->decl == c->decl;
> >
> >If you clone the function so it is no longer self recursive, it does
> >not
> >make much sense to lie to optimizers that the function is still
> >recursive.

Like Richard says below (if I understand him right, sorry if not), the
function still *is* recursive in its group of clones.

> >The inlining would be harmful even if the programer did cloning by
> >hand.
> >I guess main problem is the extreme register pressure issue combining
> >loop depth of 10 in caller with loop depth of 10 in callee just because
> >the function is called once.
> >
> >The negative effect is most likely also due to wrong profile estimate
> >which drives IRA to optimize wrong spot.  But I wonder if we simply
> >don't want to teach inlining function called once to not construct
> >large
> >loop depths?  Something like do not inline if caller&callee loop depth
> >is over 3 or so?
> 
> I don't think that's good by itself (consider leaf functions and x86 xmm reg ABI across calls). Even with large loop depth abstraction penalty removal can make inlining worth it. For the testcase the recursiveness is what looks special (recursion from a deeper loop nest level). 

Yes, the loop stuff / register pressure issues might help for the
exchange result, but what about the other five above?


Segher

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls
  2020-08-12 17:53 ` Jan Hubicka
  2020-08-12 19:03   ` Richard Biener
@ 2020-08-13  3:59   ` Feng Xue OS
  2020-08-13  6:51   ` luoxhu
  2 siblings, 0 replies; 21+ messages in thread
From: Feng Xue OS @ 2020-08-13  3:59 UTC (permalink / raw)
  To: Jan Hubicka, Xionghu Luo; +Cc: wschmidt, dje.gcc, gcc-patches, linkw, segher

> Hello,
> with Martin we spent some time looking into exchange2 and my
> understanding of the problem is the following:
> 
> There is the self recursive function digits_2 with the property that it
> has 10 nested loops and calls itself from the innermost.
> Now we do not do amazing job on guessing the profile since it is quite
> atypical. First observation is that the callback frequencly needs to be
> less than 1 otherwise the program never terminates, however with 10
> nested loops one needs to predict every loop to iterate just few times
> and conditionals guarding them as not very likely. For that we added
> PRED_LOOP_GUARD_WITH_RECURSION some time ago and I fixed it yesterday
> (causing regression in exhange since the bad profile turned out to
> disable some harmful vectorization) and I also now added a cap to the
> self recursive frequency so things to not get mispropagated by ipa-cp.

With default setting of PRED_LOOP_GUARD_WITH_RECURSION, static profile
estimation for exchange2 is far from accurate, the hottest recursive function
is predicted as infrequent. However, this low execution estimation works fine
with IRA. I've tried to tweak likelihood of the predictor, same as you,
performance was degraded when estimated profile increased. This regression
is also found to be correlated with IRA, which produces much more register
spills than default. In presence of deep loops and high register pressure, IRA
behaves more sensitively to profile estimation, and this exhibits an unwanted
property of current IRA algorithm. I've described it in a tracker
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174).

Feng

> 
> Now if ipa-cp decides to duplicate digits few times we have a new
> problem.  The tree of recursion is orgnaized in a way that the depth is
> bounded by 10 (which GCC does not know) and moreover most time is not
> spent on very deep levels of recursion.
> 
> For that you have the patch which increases frequencies of recursively
> cloned nodes, however it still seems to me as very specific hack for
> exchange: I do not see how to guess where most of time is spent.
> Even for very regular trees, by master theorem, it depends on very
> little differences in the estimates of recursion frequency whether most
> of time is spent on the top of tree, bottom or things are balanced.
> 
> With algorithms doing backtracing, like exhchange, the likelyness of
> recusion reduces with deeper recursion level, but we do not know how
> quickly and what the level is.
> 
>> From: Xiong Hu Luo <luoxhu@linux.ibm.com>
>> 
>>  For SPEC2017 exchange2, there is a large recursive functiondigits_2(function
>>  size 1300) generates specialized node from digits_2.1 to digits_2.8 with added
>>  build option:
>> 
>>  --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
>> 
>>  ipa-inline pass will consider inline these nodes called only once, but these
>>  large functions inlined too deeply will cause serious register spill and
>>  performance down as followed.
>> 
>>  inlineA: brute (inline digits_2.1, 2.2, 2.3, 2.4) -> digits_2.5 (inline 2.6, 2.7, 2.8)
>>  inlineB: digits_2.1 (inline digits_2.2, 2.3) -> call digits_2.4 (inline digits_2.5, 2.6) -> call digits_2.7 (inline 2.8)
>>  inlineC: brute (inline digits_2) -> call 2.1 -> 2.2 (inline 2.3) -> 2.4 -> 2.5 -> 2.6 (inline 2.7 ) -> 2.8
>>  inlineD: brute -> call digits_2 -> call 2.1 -> call 2.2 -> 2.3 -> 2.4 -> 2.5 -> 2.6 -> 2.7 -> 2.8
>> 
>>  Performance diff:
>>  inlineB is ~25% faster than inlineA;
>>  inlineC is ~20% faster than inlineB;
>>  inlineD is ~30% faster than inlineC.
>> 
>>  The master GCC code now generates inline sequence like inlineB, this patch
>>  makes the ipa-inline pass behavior like inlineD by:
>>   1) The growth acumulation for recursive calls by adding the growth data
>>  to the edge when edge's caller is inlined into another function to avoid
>>  inline too deeply;
>>   2) And if caller and callee are both specialized from same node, the edge
>>  should also be considered as recursive edge.
>> 
>>  SPEC2017 test shows GEOMEAN improve +2.75% in total(+0.56% without exchange2).
>>  Any comments?  Thanks.
>> 
>>  523.xalancbmk_r +1.32%
>>  541.leela_r     +1.51%
>>  548.exchange2_r +31.87%
>>  507.cactuBSSN_r +0.80%
>>  526.blender_r   +1.25%
>>  538.imagick_r   +1.82%
>> 
>>  gcc/ChangeLog:
>> 
>>  2020-08-12  Xionghu Luo  <luoxhu@linux.ibm.com>
>> 
>>        * cgraph.h (cgraph_edge::recursive_p): Return true if caller and
>>        callee and specialized from same node.
>>        * ipa-inline-analysis.c (do_estimate_growth_1): Add caller's
>>        inlined_to growth to edge whose caller is inlined.
>>  ---
>>   gcc/cgraph.h              | 2 ++
>>   gcc/ipa-inline-analysis.c | 3 +++
>>   2 files changed, 5 insertions(+)
>> 
>>  diff --git a/gcc/cgraph.h b/gcc/cgraph.h
>>  index 0211f08964f..11903ac1960 100644
>>  --- a/gcc/cgraph.h
>>  +++ b/gcc/cgraph.h
>>  @@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void)
>>     cgraph_node *c = callee->ultimate_alias_target ();
>>     if (caller->inlined_to)
>>       return caller->inlined_to->decl == c->decl;
>>  +  else if (caller->clone_of && c->clone_of)
>>  +    return caller->clone_of->decl == c->clone_of->decl;
>>     else
>>       return caller->decl == c->decl;
> 
> If you clone the function so it is no longer self recursive, it does not
> make much sense to lie to optimizers that the function is still
> recursive.
> 
> The inlining would be harmful even if the programer did cloning by hand.
> I guess main problem is the extreme register pressure issue combining
> loop depth of 10 in caller with loop depth of 10 in callee just because
> the function is called once.
> 
> The negative effect is most likely also due to wrong profile estimate
> which drives IRA to optimize wrong spot.  But I wonder if we simply
> don't want to teach inlining function called once to not construct large
> loop depths?  Something like do not inline if caller&callee loop depth
> is over 3 or so?
> 
> Honza
>>   }
>>  diff --git a/gcc/ipa-inline-analysis.c b/gcc/ipa-inline-analysis.c
>>  index 148efbc09ef..ba0cf836364 100644
>>  --- a/gcc/ipa-inline-analysis.c
>>  +++ b/gcc/ipa-inline-analysis.c
>>  @@ -434,6 +434,9 @@ do_estimate_growth_1 (struct cgraph_node *node, void *data)
>>          continue;
>>        }
>>         d->growth += estimate_edge_growth (e);
>>  +      if (e->caller->inlined_to)
>>  +     d->growth += ipa_fn_summaries->get (e->caller->inlined_to)->growth;
>>  +
>>         if (d->growth > d->cap)
>>        return true;
>>       }
>>  --
>>  2.25.1
>> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls
  2020-08-12 17:53 ` Jan Hubicka
  2020-08-12 19:03   ` Richard Biener
  2020-08-13  3:59   ` Feng Xue OS
@ 2020-08-13  6:51   ` luoxhu
  2020-08-13 12:52     ` Jan Hubicka
  2 siblings, 1 reply; 21+ messages in thread
From: luoxhu @ 2020-08-13  6:51 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches, segher, wschmidt, linkw, dje.gcc

Hi,

On 2020/8/13 01:53, Jan Hubicka wrote:
> Hello,
> with Martin we spent some time looking into exchange2 and my
> understanding of the problem is the following:
> 
> There is the self recursive function digits_2 with the property that it
> has 10 nested loops and calls itself from the innermost.
> Now we do not do amazing job on guessing the profile since it is quite
> atypical. First observation is that the callback frequencly needs to be
> less than 1 otherwise the program never terminates, however with 10
> nested loops one needs to predict every loop to iterate just few times
> and conditionals guarding them as not very likely. For that we added
> PRED_LOOP_GUARD_WITH_RECURSION some time ago and I fixed it yesterday
> (causing regression in exhange since the bad profile turned out to
> disable some harmful vectorization) and I also now added a cap to the
> self recursive frequency so things to not get mispropagated by ipa-cp.

Thanks for the information :)  Tamar replied that there is another
regression *on exchange2 is 11%.*, I've also rebased my code and confirmed
it really getting even slower than before (revert the patch could pull the
performance back)...

> 
> Now if ipa-cp decides to duplicate digits few times we have a new
> problem.  The tree of recursion is orgnaized in a way that the depth is
> bounded by 10 (which GCC does not know) and moreover most time is not
> spent on very deep levels of recursion.
> 
> For that you have the patch which increases frequencies of recursively
> cloned nodes, however it still seems to me as very specific hack for
> exchange: I do not see how to guess where most of time is spent.
> Even for very regular trees, by master theorem, it depends on very
> little differences in the estimates of recursion frequency whether most
> of time is spent on the top of tree, bottom or things are balanced.

The build is not PGO, so I am not clear how profile count will affect the 
ipa-cp and ipa-inline decision. 
Since there are no other callers outside of these specialized nodes, the
guessed profile count should be same equal?  Perf tool shows that even
each specialized node is called only once, none of them take same time for
each call:

  40.65%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_digits_2.constprop.4
  16.31%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_digits_2.constprop.3
  10.91%  exchange2_gcc.o  libgfortran.so.5.0.0     [.] _gfortran_mminloc0_4_i4
   5.41%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_digits_2.constprop.6
   4.68%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __logic_MOD_new_solver
   3.76%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_digits_2.constprop.5
   1.07%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_digits_2.constprop.7
   0.84%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_brute.constprop.0
   0.47%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_digits_2.constprop.2
   0.24%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_digits_2.constprop.1
   0.24%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_covered.constprop.0
   0.11%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_reflected.constprop.0
   0.00%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_brute.constprop.1


digits_2.constprop.4 & digits_2.constprop.3 takes most of the execution time,
So profile count and frequency seem not very helpful for this case? 

> 
> With algorithms doing backtracing, like exhchange, the likelyness of
> recusion reduces with deeper recursion level, but we do not know how
> quickly and what the level is.


> 
>> From: Xiong Hu Luo <luoxhu@linux.ibm.com>
>>
>> For SPEC2017 exchange2, there is a large recursive functiondigits_2(function
>> size 1300) generates specialized node from digits_2.1 to digits_2.8 with added
>> build option:
>>
>> --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
>>
>> ipa-inline pass will consider inline these nodes called only once, but these
>> large functions inlined too deeply will cause serious register spill and
>> performance down as followed.
>>
>> inlineA: brute (inline digits_2.1, 2.2, 2.3, 2.4) -> digits_2.5 (inline 2.6, 2.7, 2.8)
>> inlineB: digits_2.1 (inline digits_2.2, 2.3) -> call digits_2.4 (inline digits_2.5, 2.6) -> call digits_2.7 (inline 2.8)
>> inlineC: brute (inline digits_2) -> call 2.1 -> 2.2 (inline 2.3) -> 2.4 -> 2.5 -> 2.6 (inline 2.7 ) -> 2.8
>> inlineD: brute -> call digits_2 -> call 2.1 -> call 2.2 -> 2.3 -> 2.4 -> 2.5 -> 2.6 -> 2.7 -> 2.8
>>
>> Performance diff:
>> inlineB is ~25% faster than inlineA;
>> inlineC is ~20% faster than inlineB;
>> inlineD is ~30% faster than inlineC.
>>
>> The master GCC code now generates inline sequence like inlineB, this patch
>> makes the ipa-inline pass behavior like inlineD by:
>>   1) The growth acumulation for recursive calls by adding the growth data
>> to the edge when edge's caller is inlined into another function to avoid
>> inline too deeply;
>>   2) And if caller and callee are both specialized from same node, the edge
>> should also be considered as recursive edge.
>>
>> SPEC2017 test shows GEOMEAN improve +2.75% in total(+0.56% without exchange2).
>> Any comments?  Thanks.
>>
>> 523.xalancbmk_r +1.32%
>> 541.leela_r     +1.51%
>> 548.exchange2_r +31.87%
>> 507.cactuBSSN_r +0.80%
>> 526.blender_r   +1.25%
>> 538.imagick_r   +1.82%
>>
>> gcc/ChangeLog:
>>
>> 2020-08-12  Xionghu Luo  <luoxhu@linux.ibm.com>
>>
>> 	* cgraph.h (cgraph_edge::recursive_p): Return true if caller and
>> 	callee and specialized from same node.
>> 	* ipa-inline-analysis.c (do_estimate_growth_1): Add caller's
>> 	inlined_to growth to edge whose caller is inlined.
>> ---
>>   gcc/cgraph.h              | 2 ++
>>   gcc/ipa-inline-analysis.c | 3 +++
>>   2 files changed, 5 insertions(+)
>>
>> diff --git a/gcc/cgraph.h b/gcc/cgraph.h
>> index 0211f08964f..11903ac1960 100644
>> --- a/gcc/cgraph.h
>> +++ b/gcc/cgraph.h
>> @@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void)
>>     cgraph_node *c = callee->ultimate_alias_target ();
>>     if (caller->inlined_to)
>>       return caller->inlined_to->decl == c->decl;
>> +  else if (caller->clone_of && c->clone_of)
>> +    return caller->clone_of->decl == c->clone_of->decl;
>>     else
>>       return caller->decl == c->decl;
> 
> If you clone the function so it is no longer self recursive, it does not
> make much sense to lie to optimizers that the function is still
> recursive.
> 
> The inlining would be harmful even if the programer did cloning by hand.
> I guess main problem is the extreme register pressure issue combining
> loop depth of 10 in caller with loop depth of 10 in callee just because
> the function is called once.
> 
> The negative effect is most likely also due to wrong profile estimate
> which drives IRA to optimize wrong spot.  But I wonder if we simply
> don't want to teach inlining function called once to not construct large
> loop depths?  Something like do not inline if caller&callee loop depth
> is over 3 or so?

Will try this.  Though test result shows that no-inline is better than inline
one than inline two than more here.  Moreover, sorry that I am also not quite
following Richard Biener whether he is positive or not about this...

> 
> Honza
>>   }
>> diff --git a/gcc/ipa-inline-analysis.c b/gcc/ipa-inline-analysis.c
>> index 148efbc09ef..ba0cf836364 100644
>> --- a/gcc/ipa-inline-analysis.c
>> +++ b/gcc/ipa-inline-analysis.c
>> @@ -434,6 +434,9 @@ do_estimate_growth_1 (struct cgraph_node *node, void *data)
>>   	  continue;
>>   	}
>>         d->growth += estimate_edge_growth (e);
>> +      if (e->caller->inlined_to)
>> +	d->growth += ipa_fn_summaries->get (e->caller->inlined_to)->growth;
>> +

What about this change, is it reasonable to add the inlined_to's caller growth to
current edge?  It is a bit independent with previous code.


Thanks,
Xionghu 

>>         if (d->growth > d->cap)
>>   	return true;
>>       }
>> -- 
>> 2.25.1
>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls
  2020-08-13  0:29     ` Segher Boessenkool
@ 2020-08-13  7:53       ` Jan Hubicka
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Hubicka @ 2020-08-13  7:53 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Richard Biener, Xionghu Luo, wschmidt, linkw, gcc-patches, dje.gcc

> Hi!
> 
> On Wed, Aug 12, 2020 at 09:03:35PM +0200, Richard Biener wrote:
> > On August 12, 2020 7:53:07 PM GMT+02:00, Jan Hubicka <hubicka@ucw.cz> wrote:
> > >> From: Xiong Hu Luo <luoxhu@linux.ibm.com>
> > >> 523.xalancbmk_r +1.32%
> > >> 541.leela_r     +1.51%
> > >> 548.exchange2_r +31.87%
> > >> 507.cactuBSSN_r +0.80%
> > >> 526.blender_r   +1.25%
> > >> 538.imagick_r   +1.82%
> 
> > >> diff --git a/gcc/cgraph.h b/gcc/cgraph.h
> > >> index 0211f08964f..11903ac1960 100644
> > >> --- a/gcc/cgraph.h
> > >> +++ b/gcc/cgraph.h
> > >> @@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void)
> > >>    cgraph_node *c = callee->ultimate_alias_target ();
> > >>    if (caller->inlined_to)
> > >>      return caller->inlined_to->decl == c->decl;
> > >> +  else if (caller->clone_of && c->clone_of)
> > >> +    return caller->clone_of->decl == c->clone_of->decl;
> > >>    else
> > >>      return caller->decl == c->decl;
> > >
> > >If you clone the function so it is no longer self recursive, it does
> > >not
> > >make much sense to lie to optimizers that the function is still
> > >recursive.
> 
> Like Richard says below (if I understand him right, sorry if not), the
> function still *is* recursive in its group of clones.

The test above is not an heuristics.  Its purpose is to determine when
offline body will be eliminated after inlining the call.  This happens
when
 1) this is last call to the function 
 2) function is not used otherwise (exported or visible)
 3) the call is not self recursive
In case of 3 the problem is that inlning will introduce new call to
function itself and offline copy will still be needed.

Here we see a chain of clones calling
 clone1->clone2->clone3->clone4...->clone9->clone1
inlining clone2 to clone3 will elliminate offline copy of clone3 and
will reduce code size.

Inliner has analysis which intends to model code size/time accurately
and the heuristics part. The patch makes size/time to produce wrong
results, while this needs to be solved in the heuristics part.

Note that with bit of optimization we should be able to eliminate call
clone9->clone1 because it is dead (I believe we don't do that only
becuase recursion limit is set 1 off). This however does not make
inlining clone2->clone3 any better idea. Same is true if user wrote the
clones explicitly.


> 
> > >The inlining would be harmful even if the programer did cloning by
> > >hand.
> > >I guess main problem is the extreme register pressure issue combining
> > >loop depth of 10 in caller with loop depth of 10 in callee just because
> > >the function is called once.
> > >
> > >The negative effect is most likely also due to wrong profile estimate
> > >which drives IRA to optimize wrong spot.  But I wonder if we simply
> > >don't want to teach inlining function called once to not construct
> > >large
> > >loop depths?  Something like do not inline if caller&callee loop depth
> > >is over 3 or so?
> > 
> > I don't think that's good by itself (consider leaf functions and x86 xmm reg ABI across calls). Even with large loop depth abstraction penalty removal can make inlining worth it. For the testcase the recursiveness is what looks special (recursion from a deeper loop nest level). 
> 
> Yes, the loop stuff / register pressure issues might help for the
> exchange result, but what about the other five above?

I think such large loop nest (20 if you combine caller+callee) rarely
happens in practice, so it may be good heuristics. I guess it depends
how much exchange2 only hacks we want in compiler.  We should aim to
make changes that helps other codebases too.

The ipa-cp recursive clonning machinery IMO has good potential to help
other code which has fixed cap on depth of recursion.
Could consider following code:

int array[1000];
static
fact (int a)
{
  int ret;
  if (a > 1)
    ret = fact (a - 1) * a;
  for (int i = 0; i < a; i++)
    array[i+(a-1)*(a-1)/2] = 0;
  return ret;
}

test ()
{
  return fact (10);
}

This computes factorial and at the same time clears memory (just to
avoid tailcall happening and also to have functions bit bigger).

Here it is a win to produce all 10 clones, inline them together,
precompute factorial and combine the memsets.

We have few problems visible on this testcase
 1) ipa-cp needs reduction of evaluation thresholds
 2) estimated profile is bogus (since we estimate the recursion with
 probability 1/3 and after 10 levels of recursion the inner factorial
 get probability almost 0)
 3) inliner will still happily flatten the function not noticing that it
 has to be cold.  This is because it will do that pairwise and never see
 the combind probability.

Honza

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls
  2020-08-13  6:51   ` luoxhu
@ 2020-08-13 12:52     ` Jan Hubicka
  2020-08-14  9:04       ` luoxhu
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Hubicka @ 2020-08-13 12:52 UTC (permalink / raw)
  To: luoxhu; +Cc: gcc-patches, segher, wschmidt, linkw, dje.gcc

> 
> Thanks for the information :)  Tamar replied that there is another
> regression *on exchange2 is 11%.*, I've also rebased my code and confirmed
> it really getting even slower than before (revert the patch could pull the
> performance back)...

Yep, we need to figure out how to fix this - the summary on IRA issue is
interesting.  I was aware of it, but never really looked into place what
IRA does wrong.

Basically what happened historically is that when exchange2 was newly
added to spec we looked on it with Martin and noticed the issue with the
loop nest being predicted very argressively toward to the innermost loop
which led the loop optimizer to do funny things on loops that really
must not iterate too many times since we know that the frequency of
recursive call is strictly less than 1.

We spent some time on tuning inliner for low trip count loops and also
added the LOOP_GUARD_WITH_PREDICTION heuristics that was meant to reduce
probability of entering the loop which contains recursive call - this
should be a pattern in all similar backtrack-like algorithms. 
The conditions terminating the walk should be likely or the program
would never finish.
> 
> > 
> > Now if ipa-cp decides to duplicate digits few times we have a new
> > problem.  The tree of recursion is orgnaized in a way that the depth is
> > bounded by 10 (which GCC does not know) and moreover most time is not
> > spent on very deep levels of recursion.
> > 
> > For that you have the patch which increases frequencies of recursively
> > cloned nodes, however it still seems to me as very specific hack for
> > exchange: I do not see how to guess where most of time is spent.
> > Even for very regular trees, by master theorem, it depends on very
> > little differences in the estimates of recursion frequency whether most
> > of time is spent on the top of tree, bottom or things are balanced.
> 
> The build is not PGO, so I am not clear how profile count will affect the 
> ipa-cp and ipa-inline decision. 

Even without PGO the counts are used.  predict.c first estimates the
branch probabilities and then these are propagated to estimated counts
of basic blocks and this is still used thorough the compiler to drive
the optimization decisions (so ipa-inline computed the esitmated runtime
effects of the inlining that is all weighted by the counts, similarly
does ipa-cp).

What we ended up was a bug in the patch adding LOOP_GUARD_WITH_REDICTION
which resulted in the loop guards in exchange to be prodicted by both
LOOP_GUARD_WITH_RECRUSIOn and LOOP_GUARD.
Since first claims 85% chance that loop will not be entered and
second 75% the combined outcome got over 90% probability and combining
10 conditions resulted in very small frequency of the recursive edge.
It for did helps IRA to allocate sanely, but not for good reasons,
so we ended up with exchange improvements and did not notice the
bug (this is one fixed by patch above).

For some releases PGO performance is slower than non-PGO
https://lnt.opensuse.org/db_default/v4/SPEC/spec_report/options
which I think is a combination of IRA and bad decisions in some loop
opts.  The other problem is that vectorizer tends to blow up the
register pressure too.

> Since there are no other callers outside of these specialized nodes, the
> guessed profile count should be same equal?  Perf tool shows that even
> each specialized node is called only once, none of them take same time for
> each call:
> 
>   40.65%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_digits_2.constprop.4
>   16.31%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_digits_2.constprop.3
>   10.91%  exchange2_gcc.o  libgfortran.so.5.0.0     [.] _gfortran_mminloc0_4_i4
>    5.41%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_digits_2.constprop.6
>    4.68%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __logic_MOD_new_solver
>    3.76%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_digits_2.constprop.5
>    1.07%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_digits_2.constprop.7
>    0.84%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_brute.constprop.0
>    0.47%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_digits_2.constprop.2
>    0.24%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_digits_2.constprop.1
>    0.24%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_covered.constprop.0
>    0.11%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_reflected.constprop.0
>    0.00%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_brute.constprop.1
> 
> 
> digits_2.constprop.4 & digits_2.constprop.3 takes most of the execution time,
> So profile count and frequency seem not very helpful for this case? 

Yep, you can not really determine the time spent on each of recursion
levels from the recursive edge probability since you can not assume it
to be the same on each level of recursion (here we know it is 0 at level
10 and it is slowly dropping as the recursion increases because there
are fewer posiblities to fill up the partly filled sudoku :).
Especially you can not assume it after the ipa-cp did the work to figure
out that there is recursion depth counter that affect the function.

Thinking of the ipa-cp profile updating changes, I did not consider safe
the approach of adding extra weight to the recursive call. The problem
is that the recursive call frequency estimate, when comming from static
profile stimation, is likely completely off, just like static profile
estimator can not be used to predict realistically the number of
iterations of loops.  However even with profile feedback this is hard to
interpret.

I was wondering if we can not simply detect this scenario and avoid
using the recursive edge probability in the profile updating.
We could simply scale by the number of copies.
This means that if we produce 10 clones, we could just set each clone to
1/10th of the frequency.  This implies that the hottest spot will not be
underestimated expontentially much as can happen now, but just 10 times
at worst, wich is probably still OK. We play similar games in loop
optimizers: static profile estimators expect every loop to trip around 3
times so unroller/vectorizer/etc would make no sense on them.

By scaling down according to number of copies we will keep the
frequencies of calls to function called from clones "sane".  We will
still run into problems when we inline one clone to another (sine its
proflie will get scaled by the recursive edge frequency), but perhaps
that can be special cases in profiler or make ipa-cp to adjust the
recursive edges to get frequency close to 1 as a hack.

This is not pretty, but at least it looks safer to me than the original
profile updating patch adding extra weight to the frequency.

Honza
> 
> > 
> > With algorithms doing backtracing, like exhchange, the likelyness of
> > recusion reduces with deeper recursion level, but we do not know how
> > quickly and what the level is.
> 
> 
> > 
> >> From: Xiong Hu Luo <luoxhu@linux.ibm.com>
> >>
> >> For SPEC2017 exchange2, there is a large recursive functiondigits_2(function
> >> size 1300) generates specialized node from digits_2.1 to digits_2.8 with added
> >> build option:
> >>
> >> --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
> >>
> >> ipa-inline pass will consider inline these nodes called only once, but these
> >> large functions inlined too deeply will cause serious register spill and
> >> performance down as followed.
> >>
> >> inlineA: brute (inline digits_2.1, 2.2, 2.3, 2.4) -> digits_2.5 (inline 2.6, 2.7, 2.8)
> >> inlineB: digits_2.1 (inline digits_2.2, 2.3) -> call digits_2.4 (inline digits_2.5, 2.6) -> call digits_2.7 (inline 2.8)
> >> inlineC: brute (inline digits_2) -> call 2.1 -> 2.2 (inline 2.3) -> 2.4 -> 2.5 -> 2.6 (inline 2.7 ) -> 2.8
> >> inlineD: brute -> call digits_2 -> call 2.1 -> call 2.2 -> 2.3 -> 2.4 -> 2.5 -> 2.6 -> 2.7 -> 2.8
> >>
> >> Performance diff:
> >> inlineB is ~25% faster than inlineA;
> >> inlineC is ~20% faster than inlineB;
> >> inlineD is ~30% faster than inlineC.
> >>
> >> The master GCC code now generates inline sequence like inlineB, this patch
> >> makes the ipa-inline pass behavior like inlineD by:
> >>   1) The growth acumulation for recursive calls by adding the growth data
> >> to the edge when edge's caller is inlined into another function to avoid
> >> inline too deeply;
> >>   2) And if caller and callee are both specialized from same node, the edge
> >> should also be considered as recursive edge.
> >>
> >> SPEC2017 test shows GEOMEAN improve +2.75% in total(+0.56% without exchange2).
> >> Any comments?  Thanks.
> >>
> >> 523.xalancbmk_r +1.32%
> >> 541.leela_r     +1.51%
> >> 548.exchange2_r +31.87%
> >> 507.cactuBSSN_r +0.80%
> >> 526.blender_r   +1.25%
> >> 538.imagick_r   +1.82%
> >>
> >> gcc/ChangeLog:
> >>
> >> 2020-08-12  Xionghu Luo  <luoxhu@linux.ibm.com>
> >>
> >> 	* cgraph.h (cgraph_edge::recursive_p): Return true if caller and
> >> 	callee and specialized from same node.
> >> 	* ipa-inline-analysis.c (do_estimate_growth_1): Add caller's
> >> 	inlined_to growth to edge whose caller is inlined.
> >> ---
> >>   gcc/cgraph.h              | 2 ++
> >>   gcc/ipa-inline-analysis.c | 3 +++
> >>   2 files changed, 5 insertions(+)
> >>
> >> diff --git a/gcc/cgraph.h b/gcc/cgraph.h
> >> index 0211f08964f..11903ac1960 100644
> >> --- a/gcc/cgraph.h
> >> +++ b/gcc/cgraph.h
> >> @@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void)
> >>     cgraph_node *c = callee->ultimate_alias_target ();
> >>     if (caller->inlined_to)
> >>       return caller->inlined_to->decl == c->decl;
> >> +  else if (caller->clone_of && c->clone_of)
> >> +    return caller->clone_of->decl == c->clone_of->decl;
> >>     else
> >>       return caller->decl == c->decl;
> > 
> > If you clone the function so it is no longer self recursive, it does not
> > make much sense to lie to optimizers that the function is still
> > recursive.
> > 
> > The inlining would be harmful even if the programer did cloning by hand.
> > I guess main problem is the extreme register pressure issue combining
> > loop depth of 10 in caller with loop depth of 10 in callee just because
> > the function is called once.
> > 
> > The negative effect is most likely also due to wrong profile estimate
> > which drives IRA to optimize wrong spot.  But I wonder if we simply
> > don't want to teach inlining function called once to not construct large
> > loop depths?  Something like do not inline if caller&callee loop depth
> > is over 3 or so?
> 
> Will try this.  Though test result shows that no-inline is better than inline
> one than inline two than more here.  Moreover, sorry that I am also not quite
> following Richard Biener whether he is positive or not about this...
> 
> > 
> > Honza
> >>   }
> >> diff --git a/gcc/ipa-inline-analysis.c b/gcc/ipa-inline-analysis.c
> >> index 148efbc09ef..ba0cf836364 100644
> >> --- a/gcc/ipa-inline-analysis.c
> >> +++ b/gcc/ipa-inline-analysis.c
> >> @@ -434,6 +434,9 @@ do_estimate_growth_1 (struct cgraph_node *node, void *data)
> >>   	  continue;
> >>   	}
> >>         d->growth += estimate_edge_growth (e);
> >> +      if (e->caller->inlined_to)
> >> +	d->growth += ipa_fn_summaries->get (e->caller->inlined_to)->growth;
> >> +
> 
> What about this change, is it reasonable to add the inlined_to's caller growth to
> current edge?  It is a bit independent with previous code.
> 
> 
> Thanks,
> Xionghu 
> 
> >>         if (d->growth > d->cap)
> >>   	return true;
> >>       }
> >> -- 
> >> 2.25.1
> >>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls
  2020-08-13 12:52     ` Jan Hubicka
@ 2020-08-14  9:04       ` luoxhu
  2020-08-20  9:31         ` Richard Sandiford
  0 siblings, 1 reply; 21+ messages in thread
From: luoxhu @ 2020-08-14  9:04 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches, segher, wschmidt, linkw, dje.gcc

Hi,

On 2020/8/13 20:52, Jan Hubicka wrote:
>> Since there are no other callers outside of these specialized nodes, the
>> guessed profile count should be same equal?  Perf tool shows that even
>> each specialized node is called only once, none of them take same time for
>> each call:
>>
>>    40.65%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_digits_2.constprop.4
>>    16.31%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_digits_2.constprop.3
>>    10.91%  exchange2_gcc.o  libgfortran.so.5.0.0     [.] _gfortran_mminloc0_4_i4
>>     5.41%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_digits_2.constprop.6
>>     4.68%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __logic_MOD_new_solver
>>     3.76%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_digits_2.constprop.5
>>     1.07%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_digits_2.constprop.7
>>     0.84%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_brute.constprop.0
>>     0.47%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_digits_2.constprop.2
>>     0.24%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_digits_2.constprop.1
>>     0.24%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_covered.constprop.0
>>     0.11%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_reflected.constprop.0
>>     0.00%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_brute.constprop.1
>>
>>
>> digits_2.constprop.4 & digits_2.constprop.3 takes most of the execution time,
>> So profile count and frequency seem not very helpful for this case?
> Yep, you can not really determine the time spent on each of recursion
> levels from the recursive edge probability since you can not assume it
> to be the same on each level of recursion (here we know it is 0 at level
> 10 and it is slowly dropping as the recursion increases because there
> are fewer posiblities to fill up the partly filled sudoku:).
> Especially you can not assume it after the ipa-cp did the work to figure
> out that there is recursion depth counter that affect the function.
> 
> Thinking of the ipa-cp profile updating changes, I did not consider safe
> the approach of adding extra weight to the recursive call. The problem
> is that the recursive call frequency estimate, when comming from static
> profile stimation, is likely completely off, just like static profile
> estimator can not be used to predict realistically the number of
> iterations of loops.  However even with profile feedback this is hard to
> interpret.
> 
> I was wondering if we can not simply detect this scenario and avoid
> using the recursive edge probability in the profile updating.
> We could simply scale by the number of copies.
> This means that if we produce 10 clones, we could just set each clone to
> 1/10th of the frequency.  This implies that the hottest spot will not be
> underestimated expontentially much as can happen now, but just 10 times
> at worst, wich is probably still OK. We play similar games in loop
> optimizers: static profile estimators expect every loop to trip around 3
> times so unroller/vectorizer/etc would make no sense on them.
> 
> By scaling down according to number of copies we will keep the
> frequencies of calls to function called from clones "sane".  We will
> still run into problems when we inline one clone to another (sine its
> proflie will get scaled by the recursive edge frequency), but perhaps
> that can be special cases in profiler or make ipa-cp to adjust the
> recursive edges to get frequency close to 1 as a hack.
> 
> This is not pretty, but at least it looks safer to me than the original
> profile updating patch adding extra weight to the frequency.


Really appreciate for your detailed explanation.  BTW, My previous patch
for PGO build on exchange2 takes this similar method by setting each cloned
node to 1/10th of the frequency several month agao :)

https://gcc.gnu.org/pipermail/gcc-patches/2020-June/546926.html

Xionghu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls
  2020-08-14  9:04       ` luoxhu
@ 2020-08-20  9:31         ` Richard Sandiford
  2020-08-20 10:49           ` [PATCH] ipa-inline: Improve growth accumulation for recursive callsg Jan Hubicka
  2020-08-20 12:04           ` [PATCH] ipa-inline: Improve growth accumulation for recursive calls Martin Jambor
  0 siblings, 2 replies; 21+ messages in thread
From: Richard Sandiford @ 2020-08-20  9:31 UTC (permalink / raw)
  To: luoxhu via Gcc-patches
  Cc: Jan Hubicka, luoxhu, wschmidt, dje.gcc, linkw, segher

Xionghu, thanks for working on fixing the exchange regression.

luoxhu via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On 2020/8/13 20:52, Jan Hubicka wrote:
>>> Since there are no other callers outside of these specialized nodes, the
>>> guessed profile count should be same equal?  Perf tool shows that even
>>> each specialized node is called only once, none of them take same time for
>>> each call:
>>>
>>>    40.65%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_digits_2.constprop.4
>>>    16.31%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_digits_2.constprop.3
>>>    10.91%  exchange2_gcc.o  libgfortran.so.5.0.0     [.] _gfortran_mminloc0_4_i4
>>>     5.41%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_digits_2.constprop.6
>>>     4.68%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __logic_MOD_new_solver
>>>     3.76%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_digits_2.constprop.5
>>>     1.07%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_digits_2.constprop.7
>>>     0.84%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_brute.constprop.0
>>>     0.47%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_digits_2.constprop.2
>>>     0.24%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_digits_2.constprop.1
>>>     0.24%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_covered.constprop.0
>>>     0.11%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_reflected.constprop.0
>>>     0.00%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __brute_force_MOD_brute.constprop.1
>>>
>>>
>>> digits_2.constprop.4 & digits_2.constprop.3 takes most of the execution time,
>>> So profile count and frequency seem not very helpful for this case?
>> Yep, you can not really determine the time spent on each of recursion
>> levels from the recursive edge probability since you can not assume it
>> to be the same on each level of recursion (here we know it is 0 at level
>> 10 and it is slowly dropping as the recursion increases because there
>> are fewer posiblities to fill up the partly filled sudoku:).
>> Especially you can not assume it after the ipa-cp did the work to figure
>> out that there is recursion depth counter that affect the function.
>> 
>> Thinking of the ipa-cp profile updating changes, I did not consider safe
>> the approach of adding extra weight to the recursive call. The problem
>> is that the recursive call frequency estimate, when comming from static
>> profile stimation, is likely completely off, just like static profile
>> estimator can not be used to predict realistically the number of
>> iterations of loops.  However even with profile feedback this is hard to
>> interpret.
>> 
>> I was wondering if we can not simply detect this scenario and avoid
>> using the recursive edge probability in the profile updating.
>> We could simply scale by the number of copies.
>> This means that if we produce 10 clones, we could just set each clone to
>> 1/10th of the frequency.  This implies that the hottest spot will not be
>> underestimated expontentially much as can happen now, but just 10 times
>> at worst, wich is probably still OK. We play similar games in loop
>> optimizers: static profile estimators expect every loop to trip around 3
>> times so unroller/vectorizer/etc would make no sense on them.
>> 
>> By scaling down according to number of copies we will keep the
>> frequencies of calls to function called from clones "sane".  We will
>> still run into problems when we inline one clone to another (sine its
>> proflie will get scaled by the recursive edge frequency), but perhaps
>> that can be special cases in profiler or make ipa-cp to adjust the
>> recursive edges to get frequency close to 1 as a hack.
>> 
>> This is not pretty, but at least it looks safer to me than the original
>> profile updating patch adding extra weight to the frequency.
>
>
> Really appreciate for your detailed explanation.  BTW, My previous patch
> for PGO build on exchange2 takes this similar method by setting each cloned
> node to 1/10th of the frequency several month agao :)
>
> https://gcc.gnu.org/pipermail/gcc-patches/2020-June/546926.html

Does it seem likely that we'll reach a resolution on this soon?
I take the point that the patch that introduced the exchange regression
[https://gcc.gnu.org/pipermail/gcc-patches/2020-August/551757.html]
was just uncovering a latent issue, but still, this is a large regression
in an important benchmark to be carrying around.  For those of us doing
regular benchmark CI, the longer the performance trough gets, the harder
it is to spot other unrelated regressions in the “properly optimised” code.

So if we don't have a specific plan for fixing the regression soon,
I think we should consider reverting the patch until we have something
that avoids the exchange regression, even though the patch wasn't
technically wrong.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] ipa-inline: Improve growth accumulation for recursive callsg
  2020-08-20  9:31         ` Richard Sandiford
@ 2020-08-20 10:49           ` Jan Hubicka
  2020-08-20 12:04           ` [PATCH] ipa-inline: Improve growth accumulation for recursive calls Martin Jambor
  1 sibling, 0 replies; 21+ messages in thread
From: Jan Hubicka @ 2020-08-20 10:49 UTC (permalink / raw)
  To: luoxhu via Gcc-patches, luoxhu, wschmidt, dje.gcc, linkw, segher,
	richard.sandiford, mjambor

> >
> > Really appreciate for your detailed explanation.  BTW, My previous patch
> > for PGO build on exchange2 takes this similar method by setting each cloned
> > node to 1/10th of the frequency several month agao :)
> >
> > https://gcc.gnu.org/pipermail/gcc-patches/2020-June/546926.html

I was looking again into the patch and I think we could adjust it for
mainline.  I wanted to discuss this with Martin Jambor (added to CC)
after he returns from vacation.

I think the idea of scaling all copies by number of copies is a
reasonable thing to do (as discussed in earlier mail) since generally we
do not have information about where the hot recursion levels lie.
By master theorem it is easy to see that this is actually very subtle
problem and our profile info is not precise enough to be useful here.
Making all copies even is thus a good conservative solution.  I did not
quite understand this point earlier, so the patch did not seem to make
much sense to me and I saw it as an odd exchange only hack. I apologize
for that.

What I think is wrong about the patch is that the scale is not set
accoridng to number of actual copies produced, but according to the
--param controlling maximal recursion depth (since with exchage we
manage to meet it).  We do not need to cap recursion depth so low as
patch does (as shown by the factorial example that seems easy enough so
we should handle it eventualy by default). 

I think we need to work out how to scale by actual number of clones.
Problem is that we clone one by one and each time we update the
profile.  So we do not really know how many times we will clone at that
time.  I was thinking of simply keeping the chain of recursively cloned
nodes and each time rescale profile of all of them so it stays the same.
It would cause some roundoff issues and also will affect the ipa-cp
decisions itself since it will use the intermediate values.
This is probably not to bad, but perhaps Martin may have better idea.

Honza
> 
> Does it seem likely that we'll reach a resolution on this soon?
> I take the point that the patch that introduced the exchange regression
> [https://gcc.gnu.org/pipermail/gcc-patches/2020-August/551757.html]
> was just uncovering a latent issue, but still, this is a large regression
> in an important benchmark to be carrying around.  For those of us doing
> regular benchmark CI, the longer the performance trough gets, the harder
> it is to spot other unrelated regressions in the “properly optimised” code.
> 
> So if we don't have a specific plan for fixing the regression soon,
> I think we should consider reverting the patch until we have something
> that avoids the exchange regression, even though the patch wasn't
> technically wrong.

The problem here is that IRA needs to be lied to to get good results.
One direction of attack is to ignore this problem, get IPA-CP to work by
default and convince inliner to not inline on large loop depths. This
seems to get good performance on x86-64, I did not test ARM.

We may fix IRA issue, but Vladimir does not seem to know what to do
according to the PR log.

We may lie to compiler again and produce wrong profile.

diff --git a/gcc/predict.def b/gcc/predict.def
index 2d8d4958b6d..d8e502152b8 100644
--- a/gcc/predict.def
+++ b/gcc/predict.def
@@ -177,7 +177,7 @@ DEF_PREDICTOR (PRED_LOOP_GUARD, "loop guard", HITRATE (73), 0)
 
 /* Same but for loops containing recursion.  */
 DEF_PREDICTOR (PRED_LOOP_GUARD_WITH_RECURSION, "loop guard with recursion",
-	       HITRATE (85), 0)
+	       HITRATE (95), 0)
 
 /* The following predictors are used in Fortran. */
 
Gets perofrmance back for me.
I do not really like this alternative.  One option I would still like to
look into is that IRA seems to work better with profile feedback than
with currently guessed profile. It work even better with the wrongly
guessed profile with the patch above, but perhaps we could fix profile
estimate somewhere else to get profile more realistic and IRA happier.

Honza
> 
> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls
  2020-08-20  9:31         ` Richard Sandiford
  2020-08-20 10:49           ` [PATCH] ipa-inline: Improve growth accumulation for recursive callsg Jan Hubicka
@ 2020-08-20 12:04           ` Martin Jambor
  2020-08-21  9:17             ` Tamar Christina
  1 sibling, 1 reply; 21+ messages in thread
From: Martin Jambor @ 2020-08-20 12:04 UTC (permalink / raw)
  To: Richard Sandiford, luoxhu via Gcc-patches
  Cc: segher, luoxhu, wschmidt, linkw, Jan Hubicka, dje.gcc

Hi,

On Thu, Aug 20 2020, Richard Sandiford wrote:
>>
>>
>> Really appreciate for your detailed explanation.  BTW, My previous patch
>> for PGO build on exchange2 takes this similar method by setting each cloned
>> node to 1/10th of the frequency several month agao :)
>>
>> https://gcc.gnu.org/pipermail/gcc-patches/2020-June/546926.html
>
> Does it seem likely that we'll reach a resolution on this soon?
> I take the point that the patch that introduced the exchange regression
> [https://gcc.gnu.org/pipermail/gcc-patches/2020-August/551757.html]
> was just uncovering a latent issue, but still, this is a large regression
> in an important benchmark to be carrying around.  For those of us doing
> regular benchmark CI, the longer the performance trough gets, the harder
> it is to spot other unrelated regressions in the “properly optimised” code.
>
> So if we don't have a specific plan for fixing the regression soon,
> I think we should consider reverting the patch until we have something
> that avoids the exchange regression, even though the patch wasn't
> technically wrong.

Honza's changes have been motivated to big extent as an enabler for
IPA-CP heuristics changes to actually speed up 548.exchange2_r.

On my AMD Zen2 machine, the run-time of exchange2 was 358 seconds two
weeks ago, this week it is 403, but with my WIP (and so far untested)
patch below it is just 276 seconds - faster than one built with GCC 8
which needs 283 seconds.

I'll be interested in knowing if it also works this well on other
architectures.

The patch still needs a bit of a cleanup.  The change of the default
value of ipa-cp-unit-growth needs to be done only for small compilation
units (like inlining does).  I should experiment if the value of
param_ipa_cp_loop_hint_bonus should be changed or not.  And last but not
least, I also want to clean-up the interfaces between ipa-fnsummary.c
and ipa-cp.c a bit.  I am working on all of this and hope to finish the
patch set in a few (working) days.

The bottom line is that there is a plan to address this regression.

Martin



diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
index e4910a04ffa..0d44310503a 100644
--- a/gcc/ipa-cp.c
+++ b/gcc/ipa-cp.c
@@ -3190,11 +3190,23 @@ devirtualization_time_bonus (struct cgraph_node *node,
 /* Return time bonus incurred because of HINTS.  */
 
 static int
-hint_time_bonus (cgraph_node *node, ipa_hints hints)
+hint_time_bonus (cgraph_node *node, ipa_hints hints, sreal known_iter_freq,
+		 sreal known_strides_freq)
 {
   int result = 0;
-  if (hints & (INLINE_HINT_loop_iterations | INLINE_HINT_loop_stride))
-    result += opt_for_fn (node->decl, param_ipa_cp_loop_hint_bonus);
+  sreal bonus_for_one = opt_for_fn (node->decl, param_ipa_cp_loop_hint_bonus);
+
+  if (hints & INLINE_HINT_loop_iterations)
+    {
+      /* FIXME: This should probably be way more nuanced.  */
+      result += (known_iter_freq * bonus_for_one).to_int ();
+    }
+  if (hints & INLINE_HINT_loop_stride)
+    {
+      /* FIXME: And this as well.  */
+      result += (known_strides_freq * bonus_for_one).to_int ();
+    }
+
   return result;
 }
 
@@ -3395,12 +3407,13 @@ perform_estimation_of_a_value (cgraph_node *node, vec<tree> known_csts,
 			       int est_move_cost, ipcp_value_base *val)
 {
   int size, time_benefit;
-  sreal time, base_time;
+  sreal time, base_time, known_iter_freq, known_strides_freq;
   ipa_hints hints;
 
   estimate_ipcp_clone_size_and_time (node, known_csts, known_contexts,
 				     known_aggs, &size, &time,
-				     &base_time, &hints);
+				     &base_time, &hints, &known_iter_freq,
+				     &known_strides_freq);
   base_time -= time;
   if (base_time > 65535)
     base_time = 65535;
@@ -3414,7 +3427,7 @@ perform_estimation_of_a_value (cgraph_node *node, vec<tree> known_csts,
     time_benefit = base_time.to_int ()
       + devirtualization_time_bonus (node, known_csts, known_contexts,
 				     known_aggs)
-      + hint_time_bonus (node, hints)
+      + hint_time_bonus (node, hints, known_iter_freq, known_strides_freq)
       + removable_params_cost + est_move_cost;
 
   gcc_checking_assert (size >=0);
@@ -3476,7 +3489,7 @@ estimate_local_effects (struct cgraph_node *node)
     {
       struct caller_statistics stats;
       ipa_hints hints;
-      sreal time, base_time;
+      sreal time, base_time, known_iter_freq, known_strides_freq;
       int size;
 
       init_caller_stats (&stats);
@@ -3484,9 +3497,11 @@ estimate_local_effects (struct cgraph_node *node)
 					      false);
       estimate_ipcp_clone_size_and_time (node, known_csts, known_contexts,
 					 known_aggs, &size, &time,
-					 &base_time, &hints);
+					 &base_time, &hints, &known_iter_freq,
+					 &known_strides_freq);
       time -= devirt_bonus;
-      time -= hint_time_bonus (node, hints);
+      time -= hint_time_bonus (node, hints, known_iter_freq,
+			       known_strides_freq);
       time -= removable_params_cost;
       size -= stats.n_calls * removable_params_cost;
 
diff --git a/gcc/ipa-fnsummary.c b/gcc/ipa-fnsummary.c
index 2cfab40156e..29fac5db19f 100644
--- a/gcc/ipa-fnsummary.c
+++ b/gcc/ipa-fnsummary.c
@@ -310,6 +310,35 @@ set_hint_predicate (predicate **p, predicate new_predicate)
     }
 }
 
+/* Find if NEW_PREDICATE is already in V and if so, increment its freq.
+   Otherwise add a new item to the vector with this predicate and frerq equal
+   to add_freq.  */
+
+static void
+add_freqcounting_predicate (vec<ipa_freqcounting_predicate, va_gc> **v,
+			    const predicate &new_predicate, sreal add_freq)
+{
+  if (new_predicate == false || new_predicate == true)
+    return;
+  ipa_freqcounting_predicate *f;
+  for (int i = 0; vec_safe_iterate (*v, i, &f); i++)
+    if (new_predicate == f->predicate)
+      {
+	f->freq += add_freq;
+	return;
+      }
+  /* FIXME: Make this a parameter.  */
+  if (vec_safe_length (*v) >= 32)
+    /* Too many different predicates to account for.  */
+    return;
+
+  ipa_freqcounting_predicate fcp;
+  fcp.predicate = NULL;
+  set_hint_predicate (&fcp.predicate, new_predicate);
+  fcp.freq = add_freq;
+  vec_safe_push (*v, fcp);
+  return;
+}
 
 /* Compute what conditions may or may not hold given information about
    parameters.  RET_CLAUSE returns truths that may hold in a specialized copy,
@@ -722,10 +751,12 @@ ipa_call_summary::~ipa_call_summary ()
 
 ipa_fn_summary::~ipa_fn_summary ()
 {
-  if (loop_iterations)
-    edge_predicate_pool.remove (loop_iterations);
-  if (loop_stride)
-    edge_predicate_pool.remove (loop_stride);
+  unsigned len = vec_safe_length (loop_iterations);
+  for (unsigned i = 0; i < len; i++)
+    edge_predicate_pool.remove ((*loop_iterations)[i].predicate);
+  len = vec_safe_length (loop_strides);
+  for (unsigned i = 0; i < len; i++)
+    edge_predicate_pool.remove ((*loop_strides)[i].predicate);
   vec_free (conds);
   vec_free (size_time_table);
   vec_free (call_size_time_table);
@@ -741,24 +772,33 @@ ipa_fn_summary_t::remove_callees (cgraph_node *node)
     ipa_call_summaries->remove (e);
 }
 
-/* Same as remap_predicate_after_duplication but handle hint predicate *P.
-   Additionally care about allocating new memory slot for updated predicate
-   and set it to NULL when it becomes true or false (and thus uninteresting).
- */
+/* Duplicate predicates in loop hint vector, allocating memory for them and
+   remove and deallocate any uninteresting (true or false) ones.  Return the
+   result.  */
 
-static void
-remap_hint_predicate_after_duplication (predicate **p,
-					clause_t possible_truths)
+static vec<ipa_freqcounting_predicate, va_gc> *
+remap_freqcounting_preds_after_dup (vec<ipa_freqcounting_predicate, va_gc> *v,
+				    clause_t possible_truths)
 {
-  predicate new_predicate;
+  if (vec_safe_length (v) == 0)
+    return NULL;
 
-  if (!*p)
-    return;
+  vec<ipa_freqcounting_predicate, va_gc> *res = v->copy ();
+  int len = res->length();
+  for (int i = len - 1; i >= 0; i--)
+    {
+      predicate new_predicate
+	= (*res)[i].predicate->remap_after_duplication (possible_truths);
+      /* We do not want to free previous predicate; it is used by node
+	 origin.  */
+      (*res)[i].predicate = NULL;
+      set_hint_predicate (&(*res)[i].predicate, new_predicate);
+
+      if (!(*res)[i].predicate)
+	res->unordered_remove (i);
+    }
 
-  new_predicate = (*p)->remap_after_duplication (possible_truths);
-  /* We do not want to free previous predicate; it is used by node origin.  */
-  *p = NULL;
-  set_hint_predicate (p, new_predicate);
+  return res;
 }
 
 
@@ -874,9 +914,11 @@ ipa_fn_summary_t::duplicate (cgraph_node *src,
 	    optimized_out_size += es->call_stmt_size * ipa_fn_summary::size_scale;
 	  edge_set_predicate (edge, &new_predicate);
 	}
-      remap_hint_predicate_after_duplication (&info->loop_iterations,
+      info->loop_iterations
+	= remap_freqcounting_preds_after_dup (info->loop_iterations,
 					      possible_truths);
-      remap_hint_predicate_after_duplication (&info->loop_stride,
+      info->loop_strides
+	= remap_freqcounting_preds_after_dup (info->loop_strides,
 					      possible_truths);
 
       /* If inliner or someone after inliner will ever start producing
@@ -888,17 +930,21 @@ ipa_fn_summary_t::duplicate (cgraph_node *src,
   else
     {
       info->size_time_table = vec_safe_copy (info->size_time_table);
-      if (info->loop_iterations)
+      info->loop_iterations = vec_safe_copy (info->loop_iterations);
+      info->loop_strides = vec_safe_copy (info->loop_strides);
+
+      ipa_freqcounting_predicate *f;
+      for (int i = 0; vec_safe_iterate (info->loop_iterations, i, &f); i++)
 	{
-	  predicate p = *info->loop_iterations;
-	  info->loop_iterations = NULL;
-	  set_hint_predicate (&info->loop_iterations, p);
+	  predicate p = *f->predicate;
+	  f->predicate = NULL;
+	  set_hint_predicate (&f->predicate, p);
 	}
-      if (info->loop_stride)
+      for (int i = 0; vec_safe_iterate (info->loop_strides, i, &f); i++)
 	{
-	  predicate p = *info->loop_stride;
-	  info->loop_stride = NULL;
-	  set_hint_predicate (&info->loop_stride, p);
+	  predicate p = *f->predicate;
+	  f->predicate = NULL;
+	  set_hint_predicate (&f->predicate, p);
 	}
     }
   if (!dst->inlined_to)
@@ -1057,15 +1103,28 @@ ipa_dump_fn_summary (FILE *f, struct cgraph_node *node)
 		}
 	      fprintf (f, "\n");
 	    }
-	  if (s->loop_iterations)
+	  ipa_freqcounting_predicate *fcp;
+	  bool first_fcp = true;
+	  for (int i = 0; vec_safe_iterate (s->loop_iterations, i, &fcp); i++)
 	    {
-	      fprintf (f, "  loop iterations:");
-	      s->loop_iterations->dump (f, s->conds);
+	      if (first_fcp)
+		{
+		  fprintf (f, "  loop iterations:");
+		  first_fcp = false;
+		}
+	      fprintf (f, "  %3.2f for ", fcp->freq.to_double ());
+	      fcp->predicate->dump (f, s->conds);
 	    }
-	  if (s->loop_stride)
+	  first_fcp = true;
+	  for (int i = 0; vec_safe_iterate (s->loop_strides, i, &fcp); i++)
 	    {
-	      fprintf (f, "  loop stride:");
-	      s->loop_stride->dump (f, s->conds);
+	      if (first_fcp)
+		{
+		  fprintf (f, "  loop strides:");
+		  first_fcp = false;
+		}
+	      fprintf (f, "  %3.2f for :", fcp->freq.to_double ());
+	      fcp->predicate->dump (f, s->conds);
 	    }
 	  fprintf (f, "  calls:\n");
 	  dump_ipa_call_summary (f, 4, node, s);
@@ -2514,12 +2573,13 @@ analyze_function_body (struct cgraph_node *node, bool early)
 
   if (fbi.info)
     compute_bb_predicates (&fbi, node, info, params_summary);
+  const profile_count entry_count = ENTRY_BLOCK_PTR_FOR_FN (cfun)->count;
   order = XNEWVEC (int, n_basic_blocks_for_fn (cfun));
   nblocks = pre_and_rev_post_order_compute (NULL, order, false);
   for (n = 0; n < nblocks; n++)
     {
       bb = BASIC_BLOCK_FOR_FN (cfun, order[n]);
-      freq = bb->count.to_sreal_scale (ENTRY_BLOCK_PTR_FOR_FN (cfun)->count);
+      freq = bb->count.to_sreal_scale (entry_count);
       if (clobber_only_eh_bb_p (bb))
 	{
 	  if (dump_file && (dump_flags & TDF_DETAILS))
@@ -2758,23 +2818,27 @@ analyze_function_body (struct cgraph_node *node, bool early)
 
   if (nonconstant_names.exists () && !early)
     {
+      ipa_fn_summary *s = ipa_fn_summaries->get (node);
       class loop *loop;
-      predicate loop_iterations = true;
-      predicate loop_stride = true;
 
       if (dump_file && (dump_flags & TDF_DETAILS))
 	flow_loops_dump (dump_file, NULL, 0);
       scev_initialize ();
       FOR_EACH_LOOP (loop, 0)
 	{
+	  predicate loop_iterations = true;
+	  sreal header_freq;
 	  vec<edge> exits;
 	  edge ex;
 	  unsigned int j;
 	  class tree_niter_desc niter_desc;
 	  if (loop->header->aux)
-	    bb_predicate = *(predicate *) loop->header->aux;
+	    {
+	      bb_predicate = *(predicate *) loop->header->aux;
+	      header_freq = loop->header->count.to_sreal_scale (entry_count);
+	    }
 	  else
-	    bb_predicate = false;
+	    continue;
 
 	  exits = get_loop_exit_edges (loop);
 	  FOR_EACH_VEC_ELT (exits, j, ex)
@@ -2790,10 +2854,10 @@ analyze_function_body (struct cgraph_node *node, bool early)
 		will_be_nonconstant = bb_predicate & will_be_nonconstant;
 	      if (will_be_nonconstant != true
 		  && will_be_nonconstant != false)
-		/* This is slightly inprecise.  We may want to represent each
-		   loop with independent predicate.  */
 		loop_iterations &= will_be_nonconstant;
 	    }
+	  add_freqcounting_predicate (&s->loop_iterations, loop_iterations,
+				      header_freq);
 	  exits.release ();
 	}
 
@@ -2803,14 +2867,20 @@ analyze_function_body (struct cgraph_node *node, bool early)
       for (loop = loops_for_fn (cfun)->tree_root->inner;
 	   loop != NULL; loop = loop->next)
 	{
+	  predicate loop_stride = true;
+	  sreal bb_freq;
 	  basic_block *body = get_loop_body (loop);
 	  for (unsigned i = 0; i < loop->num_nodes; i++)
 	    {
 	      gimple_stmt_iterator gsi;
 	      if (body[i]->aux)
-		bb_predicate = *(predicate *) body[i]->aux;
+		{
+		  bb_predicate = *(predicate *) body[i]->aux;
+		  bb_freq = body[i]->count.to_sreal_scale (entry_count);
+		}
 	      else
-		bb_predicate = false;
+		continue;
+
 	      for (gsi = gsi_start_bb (body[i]); !gsi_end_p (gsi);
 		   gsi_next (&gsi))
 		{
@@ -2839,16 +2909,13 @@ analyze_function_body (struct cgraph_node *node, bool early)
 		    will_be_nonconstant = bb_predicate & will_be_nonconstant;
 		  if (will_be_nonconstant != true
 		      && will_be_nonconstant != false)
-		    /* This is slightly inprecise.  We may want to represent
-		       each loop with independent predicate.  */
 		    loop_stride = loop_stride & will_be_nonconstant;
 		}
 	    }
+	  add_freqcounting_predicate (&s->loop_strides, loop_stride,
+				      bb_freq);
 	  free (body);
 	}
-      ipa_fn_summary *s = ipa_fn_summaries->get (node);
-      set_hint_predicate (&s->loop_iterations, loop_iterations);
-      set_hint_predicate (&s->loop_stride, loop_stride);
       scev_finalize ();
     }
   FOR_ALL_BB_FN (bb, my_function)
@@ -3640,14 +3707,32 @@ ipa_call_context::estimate_size_and_time (int *ret_size,
   if (time > nonspecialized_time)
     time = nonspecialized_time;
 
+  m_loops_with_known_iterations = 0;
+  ipa_freqcounting_predicate *fcp;
+  for (i = 0; vec_safe_iterate (info->loop_iterations, i, &fcp); i++)
+    {
+      gcc_assert (fcp->predicate);
+      if (!fcp->predicate->evaluate (m_possible_truths))
+	{
+	  if (ret_hints)
+	    hints |= INLINE_HINT_loop_iterations;
+	  m_loops_with_known_iterations += fcp->freq;
+	}
+    }
+  m_loops_with_known_strides = 0;
+  for (i = 0; vec_safe_iterate (info->loop_strides, i, &fcp); i++)
+    {
+      gcc_assert (fcp->predicate);
+      if (!fcp->predicate->evaluate (m_possible_truths))
+	{
+	  if (ret_hints)
+	    hints |= INLINE_HINT_loop_stride;
+	  m_loops_with_known_strides += fcp->freq;
+	}
+    }
+
   if (ret_hints)
     {
-      if (info->loop_iterations
-	  && !info->loop_iterations->evaluate (m_possible_truths))
-	hints |= INLINE_HINT_loop_iterations;
-      if (info->loop_stride
-	  && !info->loop_stride->evaluate (m_possible_truths))
-	hints |= INLINE_HINT_loop_stride;
       if (info->scc_no)
 	hints |= INLINE_HINT_in_scc;
       if (DECL_DECLARED_INLINE_P (m_node->decl))
@@ -3687,7 +3772,9 @@ estimate_ipcp_clone_size_and_time (struct cgraph_node *node,
 				   vec<ipa_agg_value_set> known_aggs,
 				   int *ret_size, sreal *ret_time,
 				   sreal *ret_nonspec_time,
-				   ipa_hints *hints)
+				   ipa_hints *hints,
+				   sreal *loops_with_known_iterations,
+				   sreal *loops_with_known_strides)
 {
   clause_t clause, nonspec_clause;
 
@@ -3699,6 +3786,8 @@ estimate_ipcp_clone_size_and_time (struct cgraph_node *node,
 		        known_aggs, vNULL);
   ctx.estimate_size_and_time (ret_size, NULL, ret_time,
 			      ret_nonspec_time, hints);
+  *loops_with_known_iterations = ctx.m_loops_with_known_iterations;
+  *loops_with_known_strides = ctx.m_loops_with_known_strides;
 }
 
 /* Return stack frame offset where frame of NODE is supposed to start inside
@@ -3857,32 +3946,31 @@ remap_edge_summaries (struct cgraph_edge *inlined_edge,
     }
 }
 
-/* Same as remap_predicate, but set result into hint *HINT.  */
+/* Run remap_after_inlining on each predicate in V.  */
 
 static void
-remap_hint_predicate (class ipa_fn_summary *info,
-		      class ipa_node_params *params_summary,
-		      class ipa_fn_summary *callee_info,
-		      predicate **hint,
-		      vec<int> operand_map,
-		      vec<int> offset_map,
-		      clause_t possible_truths,
-		      predicate *toplev_predicate)
-{
-  predicate p;
+remap_freqcounting_predicate (class ipa_fn_summary *info,
+			      class ipa_node_params *params_summary,
+			      class ipa_fn_summary *callee_info,
+			      vec<ipa_freqcounting_predicate, va_gc> *v,
+			      vec<int> operand_map,
+			      vec<int> offset_map,
+			      clause_t possible_truths,
+			      predicate *toplev_predicate)
 
-  if (!*hint)
-    return;
-  p = (*hint)->remap_after_inlining
-			 (info, params_summary, callee_info,
-			  operand_map, offset_map,
-			  possible_truths, *toplev_predicate);
-  if (p != false && p != true)
+{
+  ipa_freqcounting_predicate *fcp;
+  for (int i = 0; vec_safe_iterate (v, i, &fcp); i++)
     {
-      if (!*hint)
-	set_hint_predicate (hint, p);
-      else
-	**hint &= p;
+      predicate p
+	= fcp->predicate->remap_after_inlining (info, params_summary,
+						callee_info, operand_map,
+						offset_map, possible_truths,
+						*toplev_predicate);
+      if (p != false && p != true)
+	/* FIXME: Is this really supposed to be &= and not a plain
+	   assignment?  */
+	*fcp->predicate &= p;
     }
 }
 
@@ -3992,12 +4080,12 @@ ipa_merge_fn_summary_after_inlining (struct cgraph_edge *edge)
   remap_edge_summaries (edge, edge->callee, info, params_summary,
 		 	callee_info, operand_map,
 			offset_map, clause, &toplev_predicate);
-  remap_hint_predicate (info, params_summary, callee_info,
-			&callee_info->loop_iterations,
-			operand_map, offset_map, clause, &toplev_predicate);
-  remap_hint_predicate (info, params_summary, callee_info,
-			&callee_info->loop_stride,
-			operand_map, offset_map, clause, &toplev_predicate);
+  remap_freqcounting_predicate (info, params_summary, callee_info,
+				info->loop_iterations, operand_map,
+				offset_map, clause, &toplev_predicate);
+  remap_freqcounting_predicate (info, params_summary, callee_info,
+				info->loop_strides, operand_map,
+				offset_map, clause, &toplev_predicate);
 
   HOST_WIDE_INT stack_frame_offset = ipa_get_stack_frame_offset (edge->callee);
   HOST_WIDE_INT peak = stack_frame_offset + callee_info->estimated_stack_size;
@@ -4322,12 +4410,34 @@ inline_read_section (struct lto_file_decl_data *file_data, const char *data,
 	    info->size_time_table->quick_push (e);
 	}
 
-      p.stream_in (&ib);
-      if (info)
-        set_hint_predicate (&info->loop_iterations, p);
-      p.stream_in (&ib);
-      if (info)
-        set_hint_predicate (&info->loop_stride, p);
+      count2 = streamer_read_uhwi (&ib);
+      for (j = 0; j < count2; j++)
+	{
+	  p.stream_in (&ib);
+	  sreal fcp_freq = sreal::stream_in (&ib);
+	  if (info)
+	    {
+	      ipa_freqcounting_predicate fcp;
+	      fcp.predicate = NULL;
+	      set_hint_predicate (&fcp.predicate, p);
+	      fcp.freq = fcp_freq;
+	      vec_safe_push (info->loop_iterations, fcp);
+	    }
+	}
+      count2 = streamer_read_uhwi (&ib);
+      for (j = 0; j < count2; j++)
+	{
+	  p.stream_in (&ib);
+	  sreal fcp_freq = sreal::stream_in (&ib);
+	  if (info)
+	    {
+	      ipa_freqcounting_predicate fcp;
+	      fcp.predicate = NULL;
+	      set_hint_predicate (&fcp.predicate, p);
+	      fcp.freq = fcp_freq;
+	      vec_safe_push (info->loop_strides, fcp);
+	    }
+	}
       for (e = node->callees; e; e = e->next_callee)
 	read_ipa_call_summary (&ib, e, info != NULL);
       for (e = node->indirect_calls; e; e = e->next_callee)
@@ -4487,14 +4597,19 @@ ipa_fn_summary_write (void)
 	      e->exec_predicate.stream_out (ob);
 	      e->nonconst_predicate.stream_out (ob);
 	    }
-	  if (info->loop_iterations)
-	    info->loop_iterations->stream_out (ob);
- 	  else
-	    streamer_write_uhwi (ob, 0);
-	  if (info->loop_stride)
-	    info->loop_stride->stream_out (ob);
- 	  else
-	    streamer_write_uhwi (ob, 0);
+	  ipa_freqcounting_predicate *fcp;
+	  streamer_write_uhwi (ob, vec_safe_length (info->loop_iterations));
+	  for (i = 0; vec_safe_iterate (info->loop_iterations, i, &fcp); i++)
+	    {
+	      fcp->predicate->stream_out (ob);
+	      fcp->freq.stream_out (ob);
+	    }
+	  streamer_write_uhwi (ob, vec_safe_length (info->loop_strides));
+	  for (i = 0; vec_safe_iterate (info->loop_strides, i, &fcp); i++)
+	    {
+	      fcp->predicate->stream_out (ob);
+	      fcp->freq.stream_out (ob);
+	    }
 	  for (edge = cnode->callees; edge; edge = edge->next_callee)
 	    write_ipa_call_summary (ob, edge);
 	  for (edge = cnode->indirect_calls; edge; edge = edge->next_callee)
diff --git a/gcc/ipa-fnsummary.h b/gcc/ipa-fnsummary.h
index c6ddc9f3199..d8429afdbef 100644
--- a/gcc/ipa-fnsummary.h
+++ b/gcc/ipa-fnsummary.h
@@ -101,6 +101,19 @@ public:
   }
 };
 
+/* Structure to capture how frequently some interesting events occur given a
+   particular predicate.  The structure is used to estimate how often we
+   encounter loops with known iteration count or stride in various
+   contexts.  */
+
+struct GTY(()) ipa_freqcounting_predicate
+{
+  /* The described event happens with this frequency... */
+  sreal freq;
+  /* ...when this predicate evaluates to false. */
+  class predicate * GTY((skip)) predicate;
+};
+
 /* Function inlining information.  */
 class GTY(()) ipa_fn_summary
 {
@@ -112,8 +125,9 @@ public:
       inlinable (false), single_caller (false),
       fp_expressions (false), estimated_stack_size (false),
       time (0), conds (NULL),
-      size_time_table (NULL), call_size_time_table (NULL), loop_iterations (NULL),
-      loop_stride (NULL), growth (0), scc_no (0)
+      size_time_table (NULL), call_size_time_table (NULL),
+      loop_iterations (NULL), loop_strides (NULL),
+      growth (0), scc_no (0)
   {
   }
 
@@ -125,13 +139,12 @@ public:
     estimated_stack_size (s.estimated_stack_size),
     time (s.time), conds (s.conds), size_time_table (s.size_time_table),
     call_size_time_table (NULL),
-    loop_iterations (s.loop_iterations), loop_stride (s.loop_stride),
+    loop_iterations (s.loop_iterations), loop_strides (s.loop_strides),
     growth (s.growth), scc_no (s.scc_no)
   {}
 
   /* Default constructor.  */
   ~ipa_fn_summary ();
-
   /* Information about the function body itself.  */
 
   /* Minimal size increase after inlining.  */
@@ -164,12 +177,10 @@ public:
   vec<size_time_entry, va_gc> *size_time_table;
   vec<size_time_entry, va_gc> *call_size_time_table;
 
-  /* Predicate on when some loop in the function becomes to have known
-     bounds.   */
-  predicate * GTY((skip)) loop_iterations;
-  /* Predicate on when some loop in the function becomes to have known
-     stride.   */
-  predicate * GTY((skip)) loop_stride;
+  /* Predicates on when some loops in the function can have known bounds.  */
+  vec<ipa_freqcounting_predicate, va_gc> *loop_iterations;
+  /* Predicates on when some loops in the function can have known strides.  */
+  vec<ipa_freqcounting_predicate, va_gc> *loop_strides;
   /* Estimated growth for inlining all copies of the function before start
      of small functions inlining.
      This value will get out of date as the callers are duplicated, but
@@ -316,6 +327,15 @@ public:
   {
     return m_node != NULL;
   }
+
+
+  /* How often loops will have known iterations.  Calculated in
+     estimate_size_and_time.  */
+  sreal m_loops_with_known_iterations;
+  /* How often loops will have known strides.  Calculated in
+     estimate_size_and_time. */
+  sreal m_loops_with_known_strides;
+
 private:
   /* Called function.  */
   cgraph_node *m_node;
@@ -353,7 +373,7 @@ void estimate_ipcp_clone_size_and_time (struct cgraph_node *,
 					vec<ipa_polymorphic_call_context>,
 					vec<ipa_agg_value_set>,
 					int *, sreal *, sreal *,
-				        ipa_hints *);
+				        ipa_hints *, sreal *, sreal *);
 void ipa_merge_fn_summary_after_inlining (struct cgraph_edge *edge);
 void ipa_update_overall_fn_summary (struct cgraph_node *node, bool reset = true);
 void compute_fn_summary (struct cgraph_node *, bool);
diff --git a/gcc/params.opt b/gcc/params.opt
index f39e5d1a012..2a5f3d61727 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -211,7 +211,7 @@ Common Joined UInteger Var(param_ipa_cp_single_call_penalty) Init(15) IntegerRan
 Percentage penalty functions containing a single call to another function will receive when they are evaluated for cloning.
 
 -param=ipa-cp-unit-growth=
-Common Joined UInteger Var(param_ipa_cp_unit_growth) Init(10) Param Optimization
+Common Joined UInteger Var(param_ipa_cp_unit_growth) Init(80) Param Optimization
 How much can given compilation unit grow because of the interprocedural constant propagation (in percent).
 
 -param=ipa-cp-value-list-size=

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH] ipa-inline: Improve growth accumulation for recursive calls
  2020-08-20 12:04           ` [PATCH] ipa-inline: Improve growth accumulation for recursive calls Martin Jambor
@ 2020-08-21  9:17             ` Tamar Christina
  2020-09-08 14:00               ` Martin Jambor
  0 siblings, 1 reply; 21+ messages in thread
From: Tamar Christina @ 2020-08-21  9:17 UTC (permalink / raw)
  To: Martin Jambor, Richard Sandiford, luoxhu via Gcc-patches
  Cc: segher, luoxhu, wschmidt, linkw, Jan Hubicka, dje.gcc

Hi Martin,

> Hi,
> 
> On Thu, Aug 20 2020, Richard Sandiford wrote:
> >>
> >>
> >> Really appreciate for your detailed explanation.  BTW, My previous
> >> patch for PGO build on exchange2 takes this similar method by setting
> >> each cloned node to 1/10th of the frequency several month agao :)
> >>
> >> https://gcc.gnu.org/pipermail/gcc-patches/2020-June/546926.html
> >
> > Does it seem likely that we'll reach a resolution on this soon?
> > I take the point that the patch that introduced the exchange
> > regression
> > [https://gcc.gnu.org/pipermail/gcc-patches/2020-August/551757.html]
> > was just uncovering a latent issue, but still, this is a large
> > regression in an important benchmark to be carrying around.  For those
> > of us doing regular benchmark CI, the longer the performance trough
> > gets, the harder it is to spot other unrelated regressions in the “properly
> optimised” code.
> >
> > So if we don't have a specific plan for fixing the regression soon, I
> > think we should consider reverting the patch until we have something
> > that avoids the exchange regression, even though the patch wasn't
> > technically wrong.
> 
> Honza's changes have been motivated to big extent as an enabler for IPA-CP
> heuristics changes to actually speed up 548.exchange2_r.
> 
> On my AMD Zen2 machine, the run-time of exchange2 was 358 seconds two
> weeks ago, this week it is 403, but with my WIP (and so far untested) patch
> below it is just 276 seconds - faster than one built with GCC 8 which needs
> 283 seconds.
> 
> I'll be interested in knowing if it also works this well on other architectures.
> 

Many thanks for working on this!

I tried this on an AArch64 Neoverse-N1 machine and didn't see any difference.
Do I need any flags for it to work? The patch was applied on top of 656218ab982cc22b826227045826c92743143af1

And I tried 3 runs
1) -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 -fno-inline-functions-called-once
2) -mcpu=native -Ofast -fomit-frame-pointer -flto -fno-inline-functions-called-once
3) -mcpu=native -Ofast -fomit-frame-pointer -flto

First one used to give us the best result, with this patch there's no difference between 1 and 2 (11% regression) and the 3rd one is about 15% on top of that.

If there's anything I can do to help just let me know.

Cheers,
Tamar

> The patch still needs a bit of a cleanup.  The change of the default value of
> ipa-cp-unit-growth needs to be done only for small compilation units (like
> inlining does).  I should experiment if the value of
> param_ipa_cp_loop_hint_bonus should be changed or not.  And last but not
> least, I also want to clean-up the interfaces between ipa-fnsummary.c and
> ipa-cp.c a bit.  I am working on all of this and hope to finish the patch set in a
> few (working) days.
> 
> The bottom line is that there is a plan to address this regression.
> 
> Martin
> 
> 
> 
> diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c index e4910a04ffa..0d44310503a
> 100644
> --- a/gcc/ipa-cp.c
> +++ b/gcc/ipa-cp.c
> @@ -3190,11 +3190,23 @@ devirtualization_time_bonus (struct
> cgraph_node *node,
>  /* Return time bonus incurred because of HINTS.  */
> 
>  static int
> -hint_time_bonus (cgraph_node *node, ipa_hints hints)
> +hint_time_bonus (cgraph_node *node, ipa_hints hints, sreal
> known_iter_freq,
> +		 sreal known_strides_freq)
>  {
>    int result = 0;
> -  if (hints & (INLINE_HINT_loop_iterations | INLINE_HINT_loop_stride))
> -    result += opt_for_fn (node->decl, param_ipa_cp_loop_hint_bonus);
> +  sreal bonus_for_one = opt_for_fn (node->decl,
> + param_ipa_cp_loop_hint_bonus);
> +
> +  if (hints & INLINE_HINT_loop_iterations)
> +    {
> +      /* FIXME: This should probably be way more nuanced.  */
> +      result += (known_iter_freq * bonus_for_one).to_int ();
> +    }
> +  if (hints & INLINE_HINT_loop_stride)
> +    {
> +      /* FIXME: And this as well.  */
> +      result += (known_strides_freq * bonus_for_one).to_int ();
> +    }
> +
>    return result;
>  }
> 
> @@ -3395,12 +3407,13 @@ perform_estimation_of_a_value (cgraph_node
> *node, vec<tree> known_csts,
>  			       int est_move_cost, ipcp_value_base *val)  {
>    int size, time_benefit;
> -  sreal time, base_time;
> +  sreal time, base_time, known_iter_freq, known_strides_freq;
>    ipa_hints hints;
> 
>    estimate_ipcp_clone_size_and_time (node, known_csts, known_contexts,
>  				     known_aggs, &size, &time,
> -				     &base_time, &hints);
> +				     &base_time, &hints, &known_iter_freq,
> +				     &known_strides_freq);
>    base_time -= time;
>    if (base_time > 65535)
>      base_time = 65535;
> @@ -3414,7 +3427,7 @@ perform_estimation_of_a_value (cgraph_node
> *node, vec<tree> known_csts,
>      time_benefit = base_time.to_int ()
>        + devirtualization_time_bonus (node, known_csts, known_contexts,
>  				     known_aggs)
> -      + hint_time_bonus (node, hints)
> +      + hint_time_bonus (node, hints, known_iter_freq,
> + known_strides_freq)
>        + removable_params_cost + est_move_cost;
> 
>    gcc_checking_assert (size >=0);
> @@ -3476,7 +3489,7 @@ estimate_local_effects (struct cgraph_node *node)
>      {
>        struct caller_statistics stats;
>        ipa_hints hints;
> -      sreal time, base_time;
> +      sreal time, base_time, known_iter_freq, known_strides_freq;
>        int size;
> 
>        init_caller_stats (&stats);
> @@ -3484,9 +3497,11 @@ estimate_local_effects (struct cgraph_node
> *node)
>  					      false);
>        estimate_ipcp_clone_size_and_time (node, known_csts,
> known_contexts,
>  					 known_aggs, &size, &time,
> -					 &base_time, &hints);
> +					 &base_time, &hints,
> &known_iter_freq,
> +					 &known_strides_freq);
>        time -= devirt_bonus;
> -      time -= hint_time_bonus (node, hints);
> +      time -= hint_time_bonus (node, hints, known_iter_freq,
> +			       known_strides_freq);
>        time -= removable_params_cost;
>        size -= stats.n_calls * removable_params_cost;
> 
> diff --git a/gcc/ipa-fnsummary.c b/gcc/ipa-fnsummary.c index
> 2cfab40156e..29fac5db19f 100644
> --- a/gcc/ipa-fnsummary.c
> +++ b/gcc/ipa-fnsummary.c
> @@ -310,6 +310,35 @@ set_hint_predicate (predicate **p, predicate
> new_predicate)
>      }
>  }
> 
> +/* Find if NEW_PREDICATE is already in V and if so, increment its freq.
> +   Otherwise add a new item to the vector with this predicate and frerq
> equal
> +   to add_freq.  */
> +
> +static void
> +add_freqcounting_predicate (vec<ipa_freqcounting_predicate, va_gc> **v,
> +			    const predicate &new_predicate, sreal add_freq) {
> +  if (new_predicate == false || new_predicate == true)
> +    return;
> +  ipa_freqcounting_predicate *f;
> +  for (int i = 0; vec_safe_iterate (*v, i, &f); i++)
> +    if (new_predicate == f->predicate)
> +      {
> +	f->freq += add_freq;
> +	return;
> +      }
> +  /* FIXME: Make this a parameter.  */
> +  if (vec_safe_length (*v) >= 32)
> +    /* Too many different predicates to account for.  */
> +    return;
> +
> +  ipa_freqcounting_predicate fcp;
> +  fcp.predicate = NULL;
> +  set_hint_predicate (&fcp.predicate, new_predicate);
> +  fcp.freq = add_freq;
> +  vec_safe_push (*v, fcp);
> +  return;
> +}
> 
>  /* Compute what conditions may or may not hold given information about
>     parameters.  RET_CLAUSE returns truths that may hold in a specialized copy,
> @@ -722,10 +751,12 @@ ipa_call_summary::~ipa_call_summary ()
> 
>  ipa_fn_summary::~ipa_fn_summary ()
>  {
> -  if (loop_iterations)
> -    edge_predicate_pool.remove (loop_iterations);
> -  if (loop_stride)
> -    edge_predicate_pool.remove (loop_stride);
> +  unsigned len = vec_safe_length (loop_iterations);  for (unsigned i =
> + 0; i < len; i++)
> +    edge_predicate_pool.remove ((*loop_iterations)[i].predicate);
> +  len = vec_safe_length (loop_strides);  for (unsigned i = 0; i < len;
> + i++)
> +    edge_predicate_pool.remove ((*loop_strides)[i].predicate);
>    vec_free (conds);
>    vec_free (size_time_table);
>    vec_free (call_size_time_table);
> @@ -741,24 +772,33 @@ ipa_fn_summary_t::remove_callees (cgraph_node
> *node)
>      ipa_call_summaries->remove (e);
>  }
> 
> -/* Same as remap_predicate_after_duplication but handle hint predicate *P.
> -   Additionally care about allocating new memory slot for updated predicate
> -   and set it to NULL when it becomes true or false (and thus uninteresting).
> - */
> +/* Duplicate predicates in loop hint vector, allocating memory for them and
> +   remove and deallocate any uninteresting (true or false) ones.  Return the
> +   result.  */
> 
> -static void
> -remap_hint_predicate_after_duplication (predicate **p,
> -					clause_t possible_truths)
> +static vec<ipa_freqcounting_predicate, va_gc> *
> +remap_freqcounting_preds_after_dup (vec<ipa_freqcounting_predicate,
> va_gc> *v,
> +				    clause_t possible_truths)
>  {
> -  predicate new_predicate;
> +  if (vec_safe_length (v) == 0)
> +    return NULL;
> 
> -  if (!*p)
> -    return;
> +  vec<ipa_freqcounting_predicate, va_gc> *res = v->copy ();
> +  int len = res->length();
> +  for (int i = len - 1; i >= 0; i--)
> +    {
> +      predicate new_predicate
> +	= (*res)[i].predicate->remap_after_duplication (possible_truths);
> +      /* We do not want to free previous predicate; it is used by node
> +	 origin.  */
> +      (*res)[i].predicate = NULL;
> +      set_hint_predicate (&(*res)[i].predicate, new_predicate);
> +
> +      if (!(*res)[i].predicate)
> +	res->unordered_remove (i);
> +    }
> 
> -  new_predicate = (*p)->remap_after_duplication (possible_truths);
> -  /* We do not want to free previous predicate; it is used by node origin.  */
> -  *p = NULL;
> -  set_hint_predicate (p, new_predicate);
> +  return res;
>  }
> 
> 
> @@ -874,9 +914,11 @@ ipa_fn_summary_t::duplicate (cgraph_node *src,
>  	    optimized_out_size += es->call_stmt_size *
> ipa_fn_summary::size_scale;
>  	  edge_set_predicate (edge, &new_predicate);
>  	}
> -      remap_hint_predicate_after_duplication (&info->loop_iterations,
> +      info->loop_iterations
> +	= remap_freqcounting_preds_after_dup (info->loop_iterations,
>  					      possible_truths);
> -      remap_hint_predicate_after_duplication (&info->loop_stride,
> +      info->loop_strides
> +	= remap_freqcounting_preds_after_dup (info->loop_strides,
>  					      possible_truths);
> 
>        /* If inliner or someone after inliner will ever start producing @@ -888,17
> +930,21 @@ ipa_fn_summary_t::duplicate (cgraph_node *src,
>    else
>      {
>        info->size_time_table = vec_safe_copy (info->size_time_table);
> -      if (info->loop_iterations)
> +      info->loop_iterations = vec_safe_copy (info->loop_iterations);
> +      info->loop_strides = vec_safe_copy (info->loop_strides);
> +
> +      ipa_freqcounting_predicate *f;
> +      for (int i = 0; vec_safe_iterate (info->loop_iterations, i, &f);
> + i++)
>  	{
> -	  predicate p = *info->loop_iterations;
> -	  info->loop_iterations = NULL;
> -	  set_hint_predicate (&info->loop_iterations, p);
> +	  predicate p = *f->predicate;
> +	  f->predicate = NULL;
> +	  set_hint_predicate (&f->predicate, p);
>  	}
> -      if (info->loop_stride)
> +      for (int i = 0; vec_safe_iterate (info->loop_strides, i, &f);
> + i++)
>  	{
> -	  predicate p = *info->loop_stride;
> -	  info->loop_stride = NULL;
> -	  set_hint_predicate (&info->loop_stride, p);
> +	  predicate p = *f->predicate;
> +	  f->predicate = NULL;
> +	  set_hint_predicate (&f->predicate, p);
>  	}
>      }
>    if (!dst->inlined_to)
> @@ -1057,15 +1103,28 @@ ipa_dump_fn_summary (FILE *f, struct
> cgraph_node *node)
>  		}
>  	      fprintf (f, "\n");
>  	    }
> -	  if (s->loop_iterations)
> +	  ipa_freqcounting_predicate *fcp;
> +	  bool first_fcp = true;
> +	  for (int i = 0; vec_safe_iterate (s->loop_iterations, i, &fcp); i++)
>  	    {
> -	      fprintf (f, "  loop iterations:");
> -	      s->loop_iterations->dump (f, s->conds);
> +	      if (first_fcp)
> +		{
> +		  fprintf (f, "  loop iterations:");
> +		  first_fcp = false;
> +		}
> +	      fprintf (f, "  %3.2f for ", fcp->freq.to_double ());
> +	      fcp->predicate->dump (f, s->conds);
>  	    }
> -	  if (s->loop_stride)
> +	  first_fcp = true;
> +	  for (int i = 0; vec_safe_iterate (s->loop_strides, i, &fcp); i++)
>  	    {
> -	      fprintf (f, "  loop stride:");
> -	      s->loop_stride->dump (f, s->conds);
> +	      if (first_fcp)
> +		{
> +		  fprintf (f, "  loop strides:");
> +		  first_fcp = false;
> +		}
> +	      fprintf (f, "  %3.2f for :", fcp->freq.to_double ());
> +	      fcp->predicate->dump (f, s->conds);
>  	    }
>  	  fprintf (f, "  calls:\n");
>  	  dump_ipa_call_summary (f, 4, node, s); @@ -2514,12 +2573,13 @@
> analyze_function_body (struct cgraph_node *node, bool early)
> 
>    if (fbi.info)
>      compute_bb_predicates (&fbi, node, info, params_summary);
> +  const profile_count entry_count = ENTRY_BLOCK_PTR_FOR_FN
> + (cfun)->count;
>    order = XNEWVEC (int, n_basic_blocks_for_fn (cfun));
>    nblocks = pre_and_rev_post_order_compute (NULL, order, false);
>    for (n = 0; n < nblocks; n++)
>      {
>        bb = BASIC_BLOCK_FOR_FN (cfun, order[n]);
> -      freq = bb->count.to_sreal_scale (ENTRY_BLOCK_PTR_FOR_FN (cfun)-
> >count);
> +      freq = bb->count.to_sreal_scale (entry_count);
>        if (clobber_only_eh_bb_p (bb))
>  	{
>  	  if (dump_file && (dump_flags & TDF_DETAILS)) @@ -2758,23
> +2818,27 @@ analyze_function_body (struct cgraph_node *node, bool early)
> 
>    if (nonconstant_names.exists () && !early)
>      {
> +      ipa_fn_summary *s = ipa_fn_summaries->get (node);
>        class loop *loop;
> -      predicate loop_iterations = true;
> -      predicate loop_stride = true;
> 
>        if (dump_file && (dump_flags & TDF_DETAILS))
>  	flow_loops_dump (dump_file, NULL, 0);
>        scev_initialize ();
>        FOR_EACH_LOOP (loop, 0)
>  	{
> +	  predicate loop_iterations = true;
> +	  sreal header_freq;
>  	  vec<edge> exits;
>  	  edge ex;
>  	  unsigned int j;
>  	  class tree_niter_desc niter_desc;
>  	  if (loop->header->aux)
> -	    bb_predicate = *(predicate *) loop->header->aux;
> +	    {
> +	      bb_predicate = *(predicate *) loop->header->aux;
> +	      header_freq = loop->header->count.to_sreal_scale
> (entry_count);
> +	    }
>  	  else
> -	    bb_predicate = false;
> +	    continue;
> 
>  	  exits = get_loop_exit_edges (loop);
>  	  FOR_EACH_VEC_ELT (exits, j, ex)
> @@ -2790,10 +2854,10 @@ analyze_function_body (struct cgraph_node
> *node, bool early)
>  		will_be_nonconstant = bb_predicate & will_be_nonconstant;
>  	      if (will_be_nonconstant != true
>  		  && will_be_nonconstant != false)
> -		/* This is slightly inprecise.  We may want to represent each
> -		   loop with independent predicate.  */
>  		loop_iterations &= will_be_nonconstant;
>  	    }
> +	  add_freqcounting_predicate (&s->loop_iterations, loop_iterations,
> +				      header_freq);
>  	  exits.release ();
>  	}
> 
> @@ -2803,14 +2867,20 @@ analyze_function_body (struct cgraph_node
> *node, bool early)
>        for (loop = loops_for_fn (cfun)->tree_root->inner;
>  	   loop != NULL; loop = loop->next)
>  	{
> +	  predicate loop_stride = true;
> +	  sreal bb_freq;
>  	  basic_block *body = get_loop_body (loop);
>  	  for (unsigned i = 0; i < loop->num_nodes; i++)
>  	    {
>  	      gimple_stmt_iterator gsi;
>  	      if (body[i]->aux)
> -		bb_predicate = *(predicate *) body[i]->aux;
> +		{
> +		  bb_predicate = *(predicate *) body[i]->aux;
> +		  bb_freq = body[i]->count.to_sreal_scale (entry_count);
> +		}
>  	      else
> -		bb_predicate = false;
> +		continue;
> +
>  	      for (gsi = gsi_start_bb (body[i]); !gsi_end_p (gsi);
>  		   gsi_next (&gsi))
>  		{
> @@ -2839,16 +2909,13 @@ analyze_function_body (struct cgraph_node
> *node, bool early)
>  		    will_be_nonconstant = bb_predicate &
> will_be_nonconstant;
>  		  if (will_be_nonconstant != true
>  		      && will_be_nonconstant != false)
> -		    /* This is slightly inprecise.  We may want to represent
> -		       each loop with independent predicate.  */
>  		    loop_stride = loop_stride & will_be_nonconstant;
>  		}
>  	    }
> +	  add_freqcounting_predicate (&s->loop_strides, loop_stride,
> +				      bb_freq);
>  	  free (body);
>  	}
> -      ipa_fn_summary *s = ipa_fn_summaries->get (node);
> -      set_hint_predicate (&s->loop_iterations, loop_iterations);
> -      set_hint_predicate (&s->loop_stride, loop_stride);
>        scev_finalize ();
>      }
>    FOR_ALL_BB_FN (bb, my_function)
> @@ -3640,14 +3707,32 @@ ipa_call_context::estimate_size_and_time (int
> *ret_size,
>    if (time > nonspecialized_time)
>      time = nonspecialized_time;
> 
> +  m_loops_with_known_iterations = 0;
> +  ipa_freqcounting_predicate *fcp;
> +  for (i = 0; vec_safe_iterate (info->loop_iterations, i, &fcp); i++)
> +    {
> +      gcc_assert (fcp->predicate);
> +      if (!fcp->predicate->evaluate (m_possible_truths))
> +	{
> +	  if (ret_hints)
> +	    hints |= INLINE_HINT_loop_iterations;
> +	  m_loops_with_known_iterations += fcp->freq;
> +	}
> +    }
> +  m_loops_with_known_strides = 0;
> +  for (i = 0; vec_safe_iterate (info->loop_strides, i, &fcp); i++)
> +    {
> +      gcc_assert (fcp->predicate);
> +      if (!fcp->predicate->evaluate (m_possible_truths))
> +	{
> +	  if (ret_hints)
> +	    hints |= INLINE_HINT_loop_stride;
> +	  m_loops_with_known_strides += fcp->freq;
> +	}
> +    }
> +
>    if (ret_hints)
>      {
> -      if (info->loop_iterations
> -	  && !info->loop_iterations->evaluate (m_possible_truths))
> -	hints |= INLINE_HINT_loop_iterations;
> -      if (info->loop_stride
> -	  && !info->loop_stride->evaluate (m_possible_truths))
> -	hints |= INLINE_HINT_loop_stride;
>        if (info->scc_no)
>  	hints |= INLINE_HINT_in_scc;
>        if (DECL_DECLARED_INLINE_P (m_node->decl)) @@ -3687,7 +3772,9 @@
> estimate_ipcp_clone_size_and_time (struct cgraph_node *node,
>  				   vec<ipa_agg_value_set> known_aggs,
>  				   int *ret_size, sreal *ret_time,
>  				   sreal *ret_nonspec_time,
> -				   ipa_hints *hints)
> +				   ipa_hints *hints,
> +				   sreal *loops_with_known_iterations,
> +				   sreal *loops_with_known_strides)
>  {
>    clause_t clause, nonspec_clause;
> 
> @@ -3699,6 +3786,8 @@ estimate_ipcp_clone_size_and_time (struct
> cgraph_node *node,
>  		        known_aggs, vNULL);
>    ctx.estimate_size_and_time (ret_size, NULL, ret_time,
>  			      ret_nonspec_time, hints);
> +  *loops_with_known_iterations = ctx.m_loops_with_known_iterations;
> +  *loops_with_known_strides = ctx.m_loops_with_known_strides;
>  }
> 
>  /* Return stack frame offset where frame of NODE is supposed to start
> inside @@ -3857,32 +3946,31 @@ remap_edge_summaries (struct
> cgraph_edge *inlined_edge,
>      }
>  }
> 
> -/* Same as remap_predicate, but set result into hint *HINT.  */
> +/* Run remap_after_inlining on each predicate in V.  */
> 
>  static void
> -remap_hint_predicate (class ipa_fn_summary *info,
> -		      class ipa_node_params *params_summary,
> -		      class ipa_fn_summary *callee_info,
> -		      predicate **hint,
> -		      vec<int> operand_map,
> -		      vec<int> offset_map,
> -		      clause_t possible_truths,
> -		      predicate *toplev_predicate)
> -{
> -  predicate p;
> +remap_freqcounting_predicate (class ipa_fn_summary *info,
> +			      class ipa_node_params *params_summary,
> +			      class ipa_fn_summary *callee_info,
> +			      vec<ipa_freqcounting_predicate, va_gc> *v,
> +			      vec<int> operand_map,
> +			      vec<int> offset_map,
> +			      clause_t possible_truths,
> +			      predicate *toplev_predicate)
> 
> -  if (!*hint)
> -    return;
> -  p = (*hint)->remap_after_inlining
> -			 (info, params_summary, callee_info,
> -			  operand_map, offset_map,
> -			  possible_truths, *toplev_predicate);
> -  if (p != false && p != true)
> +{
> +  ipa_freqcounting_predicate *fcp;
> +  for (int i = 0; vec_safe_iterate (v, i, &fcp); i++)
>      {
> -      if (!*hint)
> -	set_hint_predicate (hint, p);
> -      else
> -	**hint &= p;
> +      predicate p
> +	= fcp->predicate->remap_after_inlining (info, params_summary,
> +						callee_info, operand_map,
> +						offset_map, possible_truths,
> +						*toplev_predicate);
> +      if (p != false && p != true)
> +	/* FIXME: Is this really supposed to be &= and not a plain
> +	   assignment?  */
> +	*fcp->predicate &= p;
>      }
>  }
> 
> @@ -3992,12 +4080,12 @@ ipa_merge_fn_summary_after_inlining (struct
> cgraph_edge *edge)
>    remap_edge_summaries (edge, edge->callee, info, params_summary,
>  		 	callee_info, operand_map,
>  			offset_map, clause, &toplev_predicate);
> -  remap_hint_predicate (info, params_summary, callee_info,
> -			&callee_info->loop_iterations,
> -			operand_map, offset_map, clause,
> &toplev_predicate);
> -  remap_hint_predicate (info, params_summary, callee_info,
> -			&callee_info->loop_stride,
> -			operand_map, offset_map, clause,
> &toplev_predicate);
> +  remap_freqcounting_predicate (info, params_summary, callee_info,
> +				info->loop_iterations, operand_map,
> +				offset_map, clause, &toplev_predicate);
> +  remap_freqcounting_predicate (info, params_summary, callee_info,
> +				info->loop_strides, operand_map,
> +				offset_map, clause, &toplev_predicate);
> 
>    HOST_WIDE_INT stack_frame_offset = ipa_get_stack_frame_offset (edge-
> >callee);
>    HOST_WIDE_INT peak = stack_frame_offset + callee_info-
> >estimated_stack_size;
> @@ -4322,12 +4410,34 @@ inline_read_section (struct lto_file_decl_data
> *file_data, const char *data,
>  	    info->size_time_table->quick_push (e);
>  	}
> 
> -      p.stream_in (&ib);
> -      if (info)
> -        set_hint_predicate (&info->loop_iterations, p);
> -      p.stream_in (&ib);
> -      if (info)
> -        set_hint_predicate (&info->loop_stride, p);
> +      count2 = streamer_read_uhwi (&ib);
> +      for (j = 0; j < count2; j++)
> +	{
> +	  p.stream_in (&ib);
> +	  sreal fcp_freq = sreal::stream_in (&ib);
> +	  if (info)
> +	    {
> +	      ipa_freqcounting_predicate fcp;
> +	      fcp.predicate = NULL;
> +	      set_hint_predicate (&fcp.predicate, p);
> +	      fcp.freq = fcp_freq;
> +	      vec_safe_push (info->loop_iterations, fcp);
> +	    }
> +	}
> +      count2 = streamer_read_uhwi (&ib);
> +      for (j = 0; j < count2; j++)
> +	{
> +	  p.stream_in (&ib);
> +	  sreal fcp_freq = sreal::stream_in (&ib);
> +	  if (info)
> +	    {
> +	      ipa_freqcounting_predicate fcp;
> +	      fcp.predicate = NULL;
> +	      set_hint_predicate (&fcp.predicate, p);
> +	      fcp.freq = fcp_freq;
> +	      vec_safe_push (info->loop_strides, fcp);
> +	    }
> +	}
>        for (e = node->callees; e; e = e->next_callee)
>  	read_ipa_call_summary (&ib, e, info != NULL);
>        for (e = node->indirect_calls; e; e = e->next_callee) @@ -4487,14
> +4597,19 @@ ipa_fn_summary_write (void)
>  	      e->exec_predicate.stream_out (ob);
>  	      e->nonconst_predicate.stream_out (ob);
>  	    }
> -	  if (info->loop_iterations)
> -	    info->loop_iterations->stream_out (ob);
> - 	  else
> -	    streamer_write_uhwi (ob, 0);
> -	  if (info->loop_stride)
> -	    info->loop_stride->stream_out (ob);
> - 	  else
> -	    streamer_write_uhwi (ob, 0);
> +	  ipa_freqcounting_predicate *fcp;
> +	  streamer_write_uhwi (ob, vec_safe_length (info->loop_iterations));
> +	  for (i = 0; vec_safe_iterate (info->loop_iterations, i, &fcp); i++)
> +	    {
> +	      fcp->predicate->stream_out (ob);
> +	      fcp->freq.stream_out (ob);
> +	    }
> +	  streamer_write_uhwi (ob, vec_safe_length (info->loop_strides));
> +	  for (i = 0; vec_safe_iterate (info->loop_strides, i, &fcp); i++)
> +	    {
> +	      fcp->predicate->stream_out (ob);
> +	      fcp->freq.stream_out (ob);
> +	    }
>  	  for (edge = cnode->callees; edge; edge = edge->next_callee)
>  	    write_ipa_call_summary (ob, edge);
>  	  for (edge = cnode->indirect_calls; edge; edge = edge->next_callee)
> diff --git a/gcc/ipa-fnsummary.h b/gcc/ipa-fnsummary.h index
> c6ddc9f3199..d8429afdbef 100644
> --- a/gcc/ipa-fnsummary.h
> +++ b/gcc/ipa-fnsummary.h
> @@ -101,6 +101,19 @@ public:
>    }
>  };
> 
> +/* Structure to capture how frequently some interesting events occur given
> a
> +   particular predicate.  The structure is used to estimate how often we
> +   encounter loops with known iteration count or stride in various
> +   contexts.  */
> +
> +struct GTY(()) ipa_freqcounting_predicate {
> +  /* The described event happens with this frequency... */
> +  sreal freq;
> +  /* ...when this predicate evaluates to false. */
> +  class predicate * GTY((skip)) predicate; };
> +
>  /* Function inlining information.  */
>  class GTY(()) ipa_fn_summary
>  {
> @@ -112,8 +125,9 @@ public:
>        inlinable (false), single_caller (false),
>        fp_expressions (false), estimated_stack_size (false),
>        time (0), conds (NULL),
> -      size_time_table (NULL), call_size_time_table (NULL), loop_iterations
> (NULL),
> -      loop_stride (NULL), growth (0), scc_no (0)
> +      size_time_table (NULL), call_size_time_table (NULL),
> +      loop_iterations (NULL), loop_strides (NULL),
> +      growth (0), scc_no (0)
>    {
>    }
> 
> @@ -125,13 +139,12 @@ public:
>      estimated_stack_size (s.estimated_stack_size),
>      time (s.time), conds (s.conds), size_time_table (s.size_time_table),
>      call_size_time_table (NULL),
> -    loop_iterations (s.loop_iterations), loop_stride (s.loop_stride),
> +    loop_iterations (s.loop_iterations), loop_strides (s.loop_strides),
>      growth (s.growth), scc_no (s.scc_no)
>    {}
> 
>    /* Default constructor.  */
>    ~ipa_fn_summary ();
> -
>    /* Information about the function body itself.  */
> 
>    /* Minimal size increase after inlining.  */ @@ -164,12 +177,10 @@ public:
>    vec<size_time_entry, va_gc> *size_time_table;
>    vec<size_time_entry, va_gc> *call_size_time_table;
> 
> -  /* Predicate on when some loop in the function becomes to have known
> -     bounds.   */
> -  predicate * GTY((skip)) loop_iterations;
> -  /* Predicate on when some loop in the function becomes to have known
> -     stride.   */
> -  predicate * GTY((skip)) loop_stride;
> +  /* Predicates on when some loops in the function can have known
> + bounds.  */  vec<ipa_freqcounting_predicate, va_gc> *loop_iterations;
> +  /* Predicates on when some loops in the function can have known
> + strides.  */  vec<ipa_freqcounting_predicate, va_gc> *loop_strides;
>    /* Estimated growth for inlining all copies of the function before start
>       of small functions inlining.
>       This value will get out of date as the callers are duplicated, but @@ -316,6
> +327,15 @@ public:
>    {
>      return m_node != NULL;
>    }
> +
> +
> +  /* How often loops will have known iterations.  Calculated in
> +     estimate_size_and_time.  */
> +  sreal m_loops_with_known_iterations;
> +  /* How often loops will have known strides.  Calculated in
> +     estimate_size_and_time. */
> +  sreal m_loops_with_known_strides;
> +
>  private:
>    /* Called function.  */
>    cgraph_node *m_node;
> @@ -353,7 +373,7 @@ void estimate_ipcp_clone_size_and_time (struct
> cgraph_node *,
>  					vec<ipa_polymorphic_call_context>,
>  					vec<ipa_agg_value_set>,
>  					int *, sreal *, sreal *,
> -				        ipa_hints *);
> +				        ipa_hints *, sreal *, sreal *);
>  void ipa_merge_fn_summary_after_inlining (struct cgraph_edge *edge);
> void ipa_update_overall_fn_summary (struct cgraph_node *node, bool
> reset = true);  void compute_fn_summary (struct cgraph_node *, bool); diff -
> -git a/gcc/params.opt b/gcc/params.opt index f39e5d1a012..2a5f3d61727
> 100644
> --- a/gcc/params.opt
> +++ b/gcc/params.opt
> @@ -211,7 +211,7 @@ Common Joined UInteger
> Var(param_ipa_cp_single_call_penalty) Init(15) IntegerRan  Percentage
> penalty functions containing a single call to another function will receive
> when they are evaluated for cloning.
> 
>  -param=ipa-cp-unit-growth=
> -Common Joined UInteger Var(param_ipa_cp_unit_growth) Init(10) Param
> Optimization
> +Common Joined UInteger Var(param_ipa_cp_unit_growth) Init(80) Param
> +Optimization
>  How much can given compilation unit grow because of the interprocedural
> constant propagation (in percent).
> 
>  -param=ipa-cp-value-list-size=

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH] ipa-inline: Improve growth accumulation for recursive calls
  2020-08-21  9:17             ` Tamar Christina
@ 2020-09-08 14:00               ` Martin Jambor
  2020-09-11 10:38                 ` Tamar Christina
  0 siblings, 1 reply; 21+ messages in thread
From: Martin Jambor @ 2020-09-08 14:00 UTC (permalink / raw)
  To: Tamar Christina, Richard Sandiford, luoxhu via Gcc-patches
  Cc: segher, luoxhu, wschmidt, linkw, Jan Hubicka, dje.gcc

Hi,

On Fri, Aug 21 2020, Tamar Christina wrote:
>> 
>> Honza's changes have been motivated to big extent as an enabler for IPA-CP
>> heuristics changes to actually speed up 548.exchange2_r.
>> 
>> On my AMD Zen2 machine, the run-time of exchange2 was 358 seconds two
>> weeks ago, this week it is 403, but with my WIP (and so far untested) patch
>> below it is just 276 seconds - faster than one built with GCC 8 which needs
>> 283 seconds.
>> 
>> I'll be interested in knowing if it also works this well on other architectures.
>> 

I have posted the new version of the patch series to the mailing list
yesterday and I have also pushed the branch to the FSF repo as
refs/users/jamborm/heads/ipa-context_and_exchange-200907

>
> Many thanks for working on this!
>
> I tried this on an AArch64 Neoverse-N1 machine and didn't see any difference.
> Do I need any flags for it to work? The patch was applied on top of 656218ab982cc22b826227045826c92743143af1
>

I only have access to fairly old AMD (Seattle) Opteron 1100 which might
not support some interesting Aarch64 ISA extensions but I can measure a
significant speedup on it (everything with just -Ofast -march=native
-mtune=native, no non-default parameters, without LTO, without any
inlining options):

  GCC 10 branch:              915 seconds
  Master (rev. 995bb851ffe):  989 seconds
  My branch:                  827 seconds

(All is 548.exchange_r reference run time.)

> And I tried 3 runs
> 1) -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 -fno-inline-functions-called-once

This is the first time I saw -fno-inline-functions-called-once used in
this context.  This seems to indicate we are looking at another problem
that at least I have not known about yet.  Can you please upload
somewhere the inlining WPA dumps with and without the option?

Similarly, I do not need LTO for the speedup on x86_64.

The patches in the series should also remove the need for --param
ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 If you still need
them on my branch, could you please again provide me with (WPA, if with
LTO) ipa-cp dumps with and without them?


> 2) -mcpu=native -Ofast -fomit-frame-pointer -flto -fno-inline-functions-called-once
> 3) -mcpu=native -Ofast -fomit-frame-pointer -flto
>
> First one used to give us the best result, with this patch there's no difference between 1 and 2 (11% regression) and the 3rd one is about 15% on top of that.

OK, so the patch did help (but above you wrote it did not?) but not
enough to be as fast as some previous revision and on top of that
-fno-inline-functions-called-once further helps but again not enough?

If correct, this looks like we need to examine what goes wrong
specifically in the case of Neoverse-N1 though.

Thanks,

Martin


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH] ipa-inline: Improve growth accumulation for recursive calls
  2020-09-08 14:00               ` Martin Jambor
@ 2020-09-11 10:38                 ` Tamar Christina
  2020-09-11 12:45                   ` Martin Jambor
  0 siblings, 1 reply; 21+ messages in thread
From: Tamar Christina @ 2020-09-11 10:38 UTC (permalink / raw)
  To: Martin Jambor, Richard Sandiford, luoxhu via Gcc-patches
  Cc: segher, luoxhu, wschmidt, linkw, Jan Hubicka, dje.gcc

Hi Martin,

> -----Original Message-----
> From: Martin Jambor <mjambor@suse.cz>
> Sent: Tuesday, September 8, 2020 3:01 PM
> To: Tamar Christina <Tamar.Christina@arm.com>; Richard Sandiford
> <Richard.Sandiford@arm.com>; luoxhu via Gcc-patches <gcc-
> patches@gcc.gnu.org>
> Cc: segher@kernel.crashing.org; luoxhu <luoxhu@linux.ibm.com>;
> wschmidt@linux.ibm.com; linkw@gcc.gnu.org; Jan Hubicka
> <hubicka@ucw.cz>; dje.gcc@gmail.com
> Subject: RE: [PATCH] ipa-inline: Improve growth accumulation for recursive
> calls
> 
> Hi,
> 
> On Fri, Aug 21 2020, Tamar Christina wrote:
> >>
> >> Honza's changes have been motivated to big extent as an enabler for
> >> IPA-CP heuristics changes to actually speed up 548.exchange2_r.
> >>
> >> On my AMD Zen2 machine, the run-time of exchange2 was 358 seconds
> two
> >> weeks ago, this week it is 403, but with my WIP (and so far untested)
> >> patch below it is just 276 seconds - faster than one built with GCC 8
> >> which needs
> >> 283 seconds.
> >>
> >> I'll be interested in knowing if it also works this well on other architectures.
> >>
> 
> I have posted the new version of the patch series to the mailing list
> yesterday and I have also pushed the branch to the FSF repo as
> refs/users/jamborm/heads/ipa-context_and_exchange-200907
> 

Thanks! I've pushed it through our CI with a host of different options (see below)

> >
> > Many thanks for working on this!
> >
> > I tried this on an AArch64 Neoverse-N1 machine and didn't see any
> difference.
> > Do I need any flags for it to work? The patch was applied on top of
> > 656218ab982cc22b826227045826c92743143af1
> >
> 
> I only have access to fairly old AMD (Seattle) Opteron 1100 which might not
> support some interesting Aarch64 ISA extensions but I can measure a
> significant speedup on it (everything with just -Ofast -march=native -
> mtune=native, no non-default parameters, without LTO, without any inlining
> options):
> 
>   GCC 10 branch:              915 seconds
>   Master (rev. 995bb851ffe):  989 seconds
>   My branch:                  827 seconds
> 
> (All is 548.exchange_r reference run time.)
> 

+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
| Compiler | Flags                                                                                                                                            | diff GCC 10  |  |  |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
| GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 -fno-inline-functions-called-once |              |  |  |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
| GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer                                                                                                         | -44%         |  |  |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
| GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto                                                                                                   | -36%         |  |  |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
| GCC 11   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 -fno-inline-functions-called-once | -12%         |  |  |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
| Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80                                   | -22%         |  |  |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
| Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 -fno-inline-functions-called-once | -12%         |  |  |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
| Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto                                                                                                   | -24%         |  |  |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
| Branch   | -mcpu=native -Ofast -fomit-frame-pointer                                                                                                         | -26%         |  |  |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+

(Hopefully the table shows up correct)

It looks like your patch definitely does improve the basic cases. So there's not much difference between lto and non-lto anymore and it's much
Better than GCC 10. However it still contains the regression introduced by Honza's changes.

> > And I tried 3 runs
> > 1) -mcpu=native -Ofast -fomit-frame-pointer -flto --param
> > ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
> > -fno-inline-functions-called-once
> 
> This is the first time I saw -fno-inline-functions-called-once used in this
> context.  This seems to indicate we are looking at another problem that at
> least I have not known about yet.  Can you please upload somewhere the
> inlining WPA dumps with and without the option?

We used it to cover up for the register allocation issue where in lining some large
functions would cause massive spilling.  Looks like it still has an effect now but even
with it we're still seeing the 12% regression.

Which option is this? -fdump-ipa-cgraph?

Thanks,
Tamar

> 
> Similarly, I do not need LTO for the speedup on x86_64.
> 
> The patches in the series should also remove the need for --param
> ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 If you still need them
> on my branch, could you please again provide me with (WPA, if with
> LTO) ipa-cp dumps with and without them?
> 
> 
> > 2) -mcpu=native -Ofast -fomit-frame-pointer -flto
> > -fno-inline-functions-called-once
> > 3) -mcpu=native -Ofast -fomit-frame-pointer -flto
> >
> > First one used to give us the best result, with this patch there's no
> difference between 1 and 2 (11% regression) and the 3rd one is about 15%
> on top of that.
> 
> OK, so the patch did help (but above you wrote it did not?) but not enough
> to be as fast as some previous revision and on top of that -fno-inline-
> functions-called-once further helps but again not enough?
> 
> If correct, this looks like we need to examine what goes wrong specifically in
> the case of Neoverse-N1 though.
> 
> Thanks,
> 
> Martin


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH] ipa-inline: Improve growth accumulation for recursive calls
  2020-09-11 10:38                 ` Tamar Christina
@ 2020-09-11 12:45                   ` Martin Jambor
  2020-09-11 17:36                     ` Tamar Christina
  0 siblings, 1 reply; 21+ messages in thread
From: Martin Jambor @ 2020-09-11 12:45 UTC (permalink / raw)
  To: Tamar Christina, Richard Sandiford, luoxhu via Gcc-patches
  Cc: segher, luoxhu, wschmidt, linkw, Jan Hubicka, dje.gcc

Hi,

On Fri, Sep 11 2020, Tamar Christina wrote:
> Hi Martin,
>
>> On Fri, Aug 21 2020, Tamar Christina wrote:
>> >>
>> >> Honza's changes have been motivated to big extent as an enabler for
>> >> IPA-CP heuristics changes to actually speed up 548.exchange2_r.
>> >>
>> >> On my AMD Zen2 machine, the run-time of exchange2 was 358 seconds
>> two
>> >> weeks ago, this week it is 403, but with my WIP (and so far untested)
>> >> patch below it is just 276 seconds - faster than one built with GCC 8
>> >> which needs
>> >> 283 seconds.
>> >>
>> >> I'll be interested in knowing if it also works this well on other architectures.
>> >>
>> 
>> I have posted the new version of the patch series to the mailing list
>> yesterday and I have also pushed the branch to the FSF repo as
>> refs/users/jamborm/heads/ipa-context_and_exchange-200907
>> 
>
> Thanks! I've pushed it through our CI with a host of different options (see below)
>
>> >
>> > Many thanks for working on this!
>> >
>> > I tried this on an AArch64 Neoverse-N1 machine and didn't see any
>> difference.
>> > Do I need any flags for it to work? The patch was applied on top of
>> > 656218ab982cc22b826227045826c92743143af1
>> >
>> 
>> I only have access to fairly old AMD (Seattle) Opteron 1100 which might not
>> support some interesting Aarch64 ISA extensions but I can measure a
>> significant speedup on it (everything with just -Ofast -march=native -
>> mtune=native, no non-default parameters, without LTO, without any inlining
>> options):
>> 
>>   GCC 10 branch:              915 seconds
>>   Master (rev. 995bb851ffe):  989 seconds
>>   My branch:                  827 seconds
>> 
>> (All is 548.exchange_r reference run time.)
>> 
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
> | Compiler | Flags                                                                                                                                            | diff GCC 10  |  |  |
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
> | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 -fno-inline-functions-called-once |              |  |  |
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
> | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer                                                                                                         | -44%         |  |  |
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
> | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto                                                                                                   | -36%         |  |  |
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
> | GCC 11   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 -fno-inline-functions-called-once | -12%         |  |  |
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80                                   | -22%         |  |  |
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 -fno-inline-functions-called-once | -12%         |  |  |

can you please confirm that the difference between these two is all due
to the last option -fno-inline-functions-called-once ?  Is LTo
necessary?  I.e., can you run the benchmark also built with the branch
compiler and
-mcpu=native -Ofast -fomit-frame-pointer -fno-inline-functions-called-once ?

> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto                                                                                                   | -24%         |  |  |
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer                                                                                                         | -26%         |  |  |
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+

>
> (Hopefully the table shows up correct)

it does show OK for me, thanks.

>
> It looks like your patch definitely does improve the basic cases. So there's not much difference between lto and non-lto anymore and it's much
> Better than GCC 10. However it still contains the regression introduced by Honza's changes.

I assume these are rates, not times, so negative means bad.  But do I
understand it correctly that you're comparing against GCC 10 with the
two parameters set to rather special values?  Because your table seems
to indicate that even for you, the branch is faster than GCC 10 with
just -mcpu=native -Ofast -fomit-frame-pointer.

So is the problem that the best obtainable run-time, even with obscure
options, from the branch is slower than the best obtainable run-time
from GCC 10?

>
>> > And I tried 3 runs
>> > 1) -mcpu=native -Ofast -fomit-frame-pointer -flto --param
>> > ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
>> > -fno-inline-functions-called-once
>> 
>> This is the first time I saw -fno-inline-functions-called-once used in this
>> context.  This seems to indicate we are looking at another problem that at
>> least I have not known about yet.  Can you please upload somewhere the
>> inlining WPA dumps with and without the option?
>
> We used it to cover up for the register allocation issue where in lining some large
> functions would cause massive spilling.  Looks like it still has an effect now but even
> with it we're still seeing the 12% regression.
>
> Which option is this? -fdump-ipa-cgraph?

-fdump-ipa-inline-details and -fdump-ipa-cp-details.

It would be nice if the slowdown was all due to the inliner.  But the
predictors changes might of course have quite an impact also on other
optimizations.

Martin


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH] ipa-inline: Improve growth accumulation for recursive calls
  2020-09-11 12:45                   ` Martin Jambor
@ 2020-09-11 17:36                     ` Tamar Christina
  2020-09-16  8:35                       ` Hongyu Wang
  2020-10-16  8:46                       ` Xionghu Luo
  0 siblings, 2 replies; 21+ messages in thread
From: Tamar Christina @ 2020-09-11 17:36 UTC (permalink / raw)
  To: Martin Jambor, Richard Sandiford, luoxhu via Gcc-patches
  Cc: segher, luoxhu, wschmidt, linkw, Jan Hubicka, dje.gcc

Hi Martin,

> 
> can you please confirm that the difference between these two is all due to
> the last option -fno-inline-functions-called-once ?  Is LTo necessary?  I.e., can
> you run the benchmark also built with the branch compiler and -mcpu=native
> -Ofast -fomit-frame-pointer -fno-inline-functions-called-once ?
> 

Done, see below.

> > +----------+------------------------------------------------------------------------------
> --------------------------------------------------------------------+--------------+--+--+
> > | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto
> | -24%         |  |  |
> > +----------+------------------------------------------------------------------------------
> --------------------------------------------------------------------+--------------+--+--+
> > | Branch   | -mcpu=native -Ofast -fomit-frame-pointer
> | -26%         |  |  |
> > +----------+------------------------------------------------------------------------------
> --------------------------------------------------------------------+--------------+--+--+
> 
> >
> > (Hopefully the table shows up correct)
> 
> it does show OK for me, thanks.
> 
> >
> > It looks like your patch definitely does improve the basic cases. So
> > there's not much difference between lto and non-lto anymore and it's
> much Better than GCC 10. However it still contains the regression introduced
> by Honza's changes.
> 
> I assume these are rates, not times, so negative means bad.  But do I
> understand it correctly that you're comparing against GCC 10 with the two
> parameters set to rather special values?  Because your table seems to
> indicate that even for you, the branch is faster than GCC 10 with just -
> mcpu=native -Ofast -fomit-frame-pointer.

Yes these are indeed rates, and indeed I am comparing against the same options
we used to get the fastest rates on before which is the two parameters and
the inline flag.

> 
> So is the problem that the best obtainable run-time, even with obscure
> options, from the branch is slower than the best obtainable run-time from
> GCC 10?
> 

Yeah that's the problem, when we compare the two we're still behind.

I've done the additional two runs

+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
| Compiler | Flags                                                                                                                                            | diff GCC 10  |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
| GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 -fno-inline-functions-called-once |              |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
| GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer                                                                                                         | -44%         |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
| GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto                                                                                                   | -36%         |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
| GCC 11   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 -fno-inline-functions-called-once | -12%         |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
| Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80                                   | -22%         |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
| Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 -fno-inline-functions-called-once | -12%         |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
| Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto                                                                                                   | -24%         |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
| Branch   | -mcpu=native -Ofast -fomit-frame-pointer                                                                                                         | -26%         |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
| Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto -fno-inline-functions-called-once                                                                 | -12%         |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
| Branch   | -mcpu=native -Ofast -fomit-frame-pointer -fno-inline-functions-called-once                                                                       | -11%         |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+

And this confirms that indeed LTO isn't needed and that the branch
without any options is indeed much better than it was on GCC 10 without any options.

It also confirms that the only remaining difference is in the -fno-inline-functions-called-once

> >
> >> > And I tried 3 runs
> >> > 1) -mcpu=native -Ofast -fomit-frame-pointer -flto --param
> >> > ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
> >> > -fno-inline-functions-called-once
> >>
> >> This is the first time I saw -fno-inline-functions-called-once used
> >> in this context.  This seems to indicate we are looking at another
> >> problem that at least I have not known about yet.  Can you please
> >> upload somewhere the inlining WPA dumps with and without the option?
> >
> > We used it to cover up for the register allocation issue where in
> > lining some large functions would cause massive spilling.  Looks like
> > it still has an effect now but even with it we're still seeing the 12%
> regression.
> >
> > Which option is this? -fdump-ipa-cgraph?
> 
> -fdump-ipa-inline-details and -fdump-ipa-cp-details.

I've kicked off the CI runs and will get you the dumps on Monday.

Cheers,
Tamar

> 
> It would be nice if the slowdown was all due to the inliner.  But the predictors
> changes might of course have quite an impact also on other optimizations.
> 
> Martin


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls
  2020-09-11 17:36                     ` Tamar Christina
@ 2020-09-16  8:35                       ` Hongyu Wang
  2020-10-16  8:46                       ` Xionghu Luo
  1 sibling, 0 replies; 21+ messages in thread
From: Hongyu Wang @ 2020-09-16  8:35 UTC (permalink / raw)
  To: Tamar Christina
  Cc: Martin Jambor, Richard Sandiford, luoxhu via Gcc-patches, segher,
	luoxhu, wschmidt, linkw, Jan Hubicka, dje.gcc

Tamar Christina <Tamar.Christina@arm.com> 于2020年9月12日周六 上午1:39写道:

> Hi Martin,
>
> >
> > can you please confirm that the difference between these two is all due
> to
> > the last option -fno-inline-functions-called-once ?  Is LTo necessary?
> I.e., can
> > you run the benchmark also built with the branch compiler and
> -mcpu=native
> > -Ofast -fomit-frame-pointer -fno-inline-functions-called-once ?
> >
>
> Done, see below.
>
> > >
> +----------+------------------------------------------------------------------------------
> >
> --------------------------------------------------------------------+--------------+--+--+
> > > | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto
> > | -24%         |  |  |
> > >
> +----------+------------------------------------------------------------------------------
> >
> --------------------------------------------------------------------+--------------+--+--+
> > > | Branch   | -mcpu=native -Ofast -fomit-frame-pointer
> > | -26%         |  |  |
> > >
> +----------+------------------------------------------------------------------------------
> >
> --------------------------------------------------------------------+--------------+--+--+
> >
> > >
> > > (Hopefully the table shows up correct)
> >
> > it does show OK for me, thanks.
> >
> > >
> > > It looks like your patch definitely does improve the basic cases. So
> > > there's not much difference between lto and non-lto anymore and it's
> > much Better than GCC 10. However it still contains the regression
> introduced
> > by Honza's changes.
> >
> > I assume these are rates, not times, so negative means bad.  But do I
> > understand it correctly that you're comparing against GCC 10 with the two
> > parameters set to rather special values?  Because your table seems to
> > indicate that even for you, the branch is faster than GCC 10 with just -
> > mcpu=native -Ofast -fomit-frame-pointer.
>
> Yes these are indeed rates, and indeed I am comparing against the same
> options
> we used to get the fastest rates on before which is the two parameters and
> the inline flag.
>
> >
> > So is the problem that the best obtainable run-time, even with obscure
> > options, from the branch is slower than the best obtainable run-time from
> > GCC 10?
> >
>
> Yeah that's the problem, when we compare the two we're still behind.
>
> I've done the additional two runs
>
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | Compiler | Flags
>
>         | diff GCC 10  |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param
> ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80
> -fno-inline-functions-called-once |              |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer
>
>        | -44%         |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto
>
>        | -36%         |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | GCC 11   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param
> ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80
> -fno-inline-functions-called-once | -12%         |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param
> ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80
>                | -22%         |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param
> ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80
> -fno-inline-functions-called-once | -12%         |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto
>
>        | -24%         |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer
>
>        | -26%         |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto
> -fno-inline-functions-called-once
>                        | -12%         |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer
> -fno-inline-functions-called-once
>                              | -11%         |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
>
> And this confirms that indeed LTO isn't needed and that the branch
> without any options is indeed much better than it was on GCC 10 without
> any options.
>
> It also confirms that the only remaining difference is in the
> -fno-inline-functions-called-once
>
> > >
> > >> > And I tried 3 runs
> > >> > 1) -mcpu=native -Ofast -fomit-frame-pointer -flto --param
> > >> > ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
> > >> > -fno-inline-functions-called-once
> > >>
> > >> This is the first time I saw -fno-inline-functions-called-once used
> > >> in this context.  This seems to indicate we are looking at another
> > >> problem that at least I have not known about yet.  Can you please
> > >> upload somewhere the inlining WPA dumps with and without the option?
> > >
> > > We used it to cover up for the register allocation issue where in
> > > lining some large functions would cause massive spilling.  Looks like
> > > it still has an effect now but even with it we're still seeing the 12%
> > regression.
> > >
> > > Which option is this? -fdump-ipa-cgraph?
> >
> > -fdump-ipa-inline-details and -fdump-ipa-cp-details.
>
> I've kicked off the CI runs and will get you the dumps on Monday.
>
> Cheers,
> Tamar
>
> >
> > It would be nice if the slowdown was all due to the inliner.  But the
> predictors
> > changes might of course have quite an impact also on other optimizations.
> >
> > Martin
>
>
Hi Martin,

Thanks for your work. In case you are interested, here is the exchange2
result for your branch on our Cascadelake server (based on Tamar's test and
our regular configuration):

| Compiler | Flags

    | single-core diff GCC10 | multi-core diff GCC10 |
|---------|-------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------|----------------------|
| GCC10.1   |   -march=native -Ofast -funroll-loops  -flto --param
ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80
-fno-inline-functions-called-once  | -                     | -
       |
| GCC10.1   |   -march=native -Ofast -funroll-loops

      | -32%                  | -37%                 |
| GCC10.1   |   -march=native -Ofast -funroll-loops  -flto

       | -32%                  | -37%                 |
| GCC11   |   -march=native -Ofast -funroll-loops  -flto --param
ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80
-fno-inline-functions-called-once  | -20%                  | -13%
      |
| Branch  |   -march=native -Ofast -funroll-loops  -flto --param
ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80
               | -39%                  | -28%                 |
| Branch  |   -march=native -Ofast -funroll-loops  -flto --param
ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80
-fno-inline-functions-called-once  | -20%                  | -13%
      |
| Branch  |   -march=native -Ofast -funroll-loops  -flto

     | -39%                  | -28%                 |
| Branch  |   -march=native -Ofast -funroll-loops

    | -41%                  | -29%                 |
| Branch  |   -march=native -Ofast -funroll-loops  -flto
-fno-inline-functions-called-once
                       | -19%                  | -13%                 |
| Branch  |   -march=native -Ofast -funroll-loops
-fno-inline-functions-called-once
                              | -20%                  | -13%
  |

For multi-core tests, it can provide better performance without extra ipa
options, but still 12% regression compared with GCC10's best score.

Also for single-core, there's a about 7% gap between the branch and GCC10.1.

Regards,
Hongyu Wang

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls
  2020-09-11 17:36                     ` Tamar Christina
  2020-09-16  8:35                       ` Hongyu Wang
@ 2020-10-16  8:46                       ` Xionghu Luo
  2021-01-21 14:34                         ` Tamar Christina
  1 sibling, 1 reply; 21+ messages in thread
From: Xionghu Luo @ 2020-10-16  8:46 UTC (permalink / raw)
  To: Tamar Christina, Martin Jambor, Richard Sandiford,
	luoxhu via Gcc-patches
  Cc: segher, wschmidt, linkw, Jan Hubicka, dje.gcc



On 2020/9/12 01:36, Tamar Christina wrote:
> Hi Martin,
> 
>>
>> can you please confirm that the difference between these two is all due to
>> the last option -fno-inline-functions-called-once ?  Is LTo necessary?  I.e., can
>> you run the benchmark also built with the branch compiler and -mcpu=native
>> -Ofast -fomit-frame-pointer -fno-inline-functions-called-once ?
>>
> 
> Done, see below.
> 
>>> +----------+------------------------------------------------------------------------------
>> --------------------------------------------------------------------+--------------+--+--+
>>> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto
>> | -24%         |  |  |
>>> +----------+------------------------------------------------------------------------------
>> --------------------------------------------------------------------+--------------+--+--+
>>> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer
>> | -26%         |  |  |
>>> +----------+------------------------------------------------------------------------------
>> --------------------------------------------------------------------+--------------+--+--+
>>
>>>
>>> (Hopefully the table shows up correct)
>>
>> it does show OK for me, thanks.
>>
>>>
>>> It looks like your patch definitely does improve the basic cases. So
>>> there's not much difference between lto and non-lto anymore and it's
>> much Better than GCC 10. However it still contains the regression introduced
>> by Honza's changes.
>>
>> I assume these are rates, not times, so negative means bad.  But do I
>> understand it correctly that you're comparing against GCC 10 with the two
>> parameters set to rather special values?  Because your table seems to
>> indicate that even for you, the branch is faster than GCC 10 with just -
>> mcpu=native -Ofast -fomit-frame-pointer.
> 
> Yes these are indeed rates, and indeed I am comparing against the same options
> we used to get the fastest rates on before which is the two parameters and
> the inline flag.
> 
>>
>> So is the problem that the best obtainable run-time, even with obscure
>> options, from the branch is slower than the best obtainable run-time from
>> GCC 10?
>>
> 
> Yeah that's the problem, when we compare the two we're still behind.
> 
> I've done the additional two runs
> 
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | Compiler | Flags                                                                                                                                            | diff GCC 10  |
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 -fno-inline-functions-called-once |              |
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer                                                                                                         | -44%         |
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto                                                                                                   | -36%         |
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | GCC 11   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 -fno-inline-functions-called-once | -12%         |
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80                                   | -22%         |
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 -fno-inline-functions-called-once | -12%         |
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto                                                                                                   | -24%         |
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer                                                                                                         | -26%         |
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto -fno-inline-functions-called-once                                                                 | -12%         |
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -fno-inline-functions-called-once                                                                       | -11%         |
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> 
> And this confirms that indeed LTO isn't needed and that the branch
> without any options is indeed much better than it was on GCC 10 without any options.
> 
> It also confirms that the only remaining difference is in the -fno-inline-functions-called-once

If -fno-inline-functions-called-once is added, the recursive call
function digits_2 won't be inlined, as each digits_2 is specialized
to clone nodes and called once only, so performance back is expected,
I guess it is somewhat similar to -fno-inline for this case. 

@Jambor @Honza Any progress about this (--param controlling maximal
recursion depth) and the other regression about LOOP_GUARD_WITH_PREDICTION in
PR96825(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96825) please? :)
I tested the current master FSF code, the regression still exists...


Thanks,
Xionghu



^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH] ipa-inline: Improve growth accumulation for recursive calls
  2020-10-16  8:46                       ` Xionghu Luo
@ 2021-01-21 14:34                         ` Tamar Christina
  2021-01-21 15:11                           ` Jan Hubicka
  0 siblings, 1 reply; 21+ messages in thread
From: Tamar Christina @ 2021-01-21 14:34 UTC (permalink / raw)
  To: Xionghu Luo, Martin Jambor, Richard Sandiford, luoxhu via Gcc-patches
  Cc: segher, wschmidt, linkw, Jan Hubicka, dje.gcc, James Greenhalgh

Hi All,

James and I have been investigating this regression and have tracked it down to register allocation.

I have create a new PR with our findings https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 but unfortunately
we don't know how to proceed.

This does seem like a genuine bug in RA.  It looks like some magic threshold has been crossed, but we're having
trouble determining what this magic number is.

Any help is appreciated.

Thanks,
Tamar

> -----Original Message-----
> From: Xionghu Luo <luoxhu@linux.ibm.com>
> Sent: Friday, October 16, 2020 9:47 AM
> To: Tamar Christina <Tamar.Christina@arm.com>; Martin Jambor
> <mjambor@suse.cz>; Richard Sandiford <Richard.Sandiford@arm.com>;
> luoxhu via Gcc-patches <gcc-patches@gcc.gnu.org>
> Cc: segher@kernel.crashing.org; wschmidt@linux.ibm.com;
> linkw@gcc.gnu.org; Jan Hubicka <hubicka@ucw.cz>; dje.gcc@gmail.com
> Subject: Re: [PATCH] ipa-inline: Improve growth accumulation for recursive
> calls
> 
> 
> 
> On 2020/9/12 01:36, Tamar Christina wrote:
> > Hi Martin,
> >
> >>
> >> can you please confirm that the difference between these two is all
> >> due to the last option -fno-inline-functions-called-once ?  Is LTo
> >> necessary?  I.e., can you run the benchmark also built with the
> >> branch compiler and -mcpu=native -Ofast -fomit-frame-pointer -fno-
> inline-functions-called-once ?
> >>
> >
> > Done, see below.
> >
> >>> +----------+--------------------------------------------------------
> >>> +----------+----------------------
> >> --------------------------------------------------------------------+--------------+--+-
> -+
> >>> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto
> >> | -24%         |  |  |
> >>> +----------+--------------------------------------------------------
> >>> +----------+----------------------
> >> --------------------------------------------------------------------+--------------+--+-
> -+
> >>> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer
> >> | -26%         |  |  |
> >>> +----------+--------------------------------------------------------
> >>> +----------+----------------------
> >> --------------------------------------------------------------------+--------------+--+-
> -+
> >>
> >>>
> >>> (Hopefully the table shows up correct)
> >>
> >> it does show OK for me, thanks.
> >>
> >>>
> >>> It looks like your patch definitely does improve the basic cases. So
> >>> there's not much difference between lto and non-lto anymore and it's
> >> much Better than GCC 10. However it still contains the regression
> >> introduced by Honza's changes.
> >>
> >> I assume these are rates, not times, so negative means bad.  But do I
> >> understand it correctly that you're comparing against GCC 10 with the
> >> two parameters set to rather special values?  Because your table
> >> seems to indicate that even for you, the branch is faster than GCC 10
> >> with just - mcpu=native -Ofast -fomit-frame-pointer.
> >
> > Yes these are indeed rates, and indeed I am comparing against the same
> > options we used to get the fastest rates on before which is the two
> > parameters and the inline flag.
> >
> >>
> >> So is the problem that the best obtainable run-time, even with
> >> obscure options, from the branch is slower than the best obtainable
> >> run-time from GCC 10?
> >>
> >
> > Yeah that's the problem, when we compare the two we're still behind.
> >
> > I've done the additional two runs
> >
> > +----------+------------------------------------------------------------------------------
> --------------------------------------------------------------------+--------------+
> > | Compiler | Flags
> | diff GCC 10  |
> > +----------+------------------------------------------------------------------------------
> --------------------------------------------------------------------+--------------+
> > | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-
> eval-threshold=1 --param   ipa-cp-unit-growth=80 -fno-inline-functions-
> called-once |              |
> > +----------+------------------------------------------------------------------------------
> --------------------------------------------------------------------+--------------+
> > | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer
> | -44%         |
> > +----------+------------------------------------------------------------------------------
> --------------------------------------------------------------------+--------------+
> > | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto
> | -36%         |
> > +----------+------------------------------------------------------------------------------
> --------------------------------------------------------------------+--------------+
> > | GCC 11   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-
> eval-threshold=1 --param   ipa-cp-unit-growth=80 -fno-inline-functions-
> called-once | -12%         |
> > +----------+------------------------------------------------------------------------------
> --------------------------------------------------------------------+--------------+
> > | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-
> eval-threshold=1 --param   ipa-cp-unit-growth=80                                   | -22%
> |
> > +----------+------------------------------------------------------------------------------
> --------------------------------------------------------------------+--------------+
> > | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-
> eval-threshold=1 --param   ipa-cp-unit-growth=80 -fno-inline-functions-
> called-once | -12%         |
> > +----------+------------------------------------------------------------------------------
> --------------------------------------------------------------------+--------------+
> > | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto
> | -24%         |
> > +----------+------------------------------------------------------------------------------
> --------------------------------------------------------------------+--------------+
> > | Branch   | -mcpu=native -Ofast -fomit-frame-pointer
> | -26%         |
> > +----------+------------------------------------------------------------------------------
> --------------------------------------------------------------------+--------------+
> > | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto -fno-inline-
> functions-called-once                                                                 | -12%         |
> > +----------+------------------------------------------------------------------------------
> --------------------------------------------------------------------+--------------+
> > | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -fno-inline-
> functions-called-once                                                                       | -11%         |
> > +----------+------------------------------------------------------------------------------
> --------------------------------------------------------------------+--------------+
> >
> > And this confirms that indeed LTO isn't needed and that the branch
> > without any options is indeed much better than it was on GCC 10 without
> any options.
> >
> > It also confirms that the only remaining difference is in the
> > -fno-inline-functions-called-once
> 
> If -fno-inline-functions-called-once is added, the recursive call function
> digits_2 won't be inlined, as each digits_2 is specialized to clone nodes and
> called once only, so performance back is expected, I guess it is somewhat
> similar to -fno-inline for this case.
> 
> @Jambor @Honza Any progress about this (--param controlling maximal
> recursion depth) and the other regression about
> LOOP_GUARD_WITH_PREDICTION in
> PR96825(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96825) please? :) I
> tested the current master FSF code, the regression still exists...
> 
> 
> Thanks,
> Xionghu
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls
  2021-01-21 14:34                         ` Tamar Christina
@ 2021-01-21 15:11                           ` Jan Hubicka
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Hubicka @ 2021-01-21 15:11 UTC (permalink / raw)
  To: Tamar Christina
  Cc: Xionghu Luo, Martin Jambor, Richard Sandiford,
	luoxhu via Gcc-patches, segher, wschmidt, linkw, dje.gcc,
	James Greenhalgh

> Hi All,
> 
> James and I have been investigating this regression and have tracked it down to register allocation.
> 
> I have create a new PR with our findings https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 but unfortunately
> we don't know how to proceed.
> 
> This does seem like a genuine bug in RA.  It looks like some magic threshold has been crossed, but we're having
> trouble determining what this magic number is.

Thank you for the analysis - it was on my TODO list for very long
time, but the function is large.  I will read it carefully and lets see
if we can come up with something useful.  

Honza
> 
> Any help is appreciated.
> 
> Thanks,
> Tamar
> 
> > -----Original Message-----
> > From: Xionghu Luo <luoxhu@linux.ibm.com>
> > Sent: Friday, October 16, 2020 9:47 AM
> > To: Tamar Christina <Tamar.Christina@arm.com>; Martin Jambor
> > <mjambor@suse.cz>; Richard Sandiford <Richard.Sandiford@arm.com>;
> > luoxhu via Gcc-patches <gcc-patches@gcc.gnu.org>
> > Cc: segher@kernel.crashing.org; wschmidt@linux.ibm.com;
> > linkw@gcc.gnu.org; Jan Hubicka <hubicka@ucw.cz>; dje.gcc@gmail.com
> > Subject: Re: [PATCH] ipa-inline: Improve growth accumulation for recursive
> > calls
> > 
> > 
> > 
> > On 2020/9/12 01:36, Tamar Christina wrote:
> > > Hi Martin,
> > >
> > >>
> > >> can you please confirm that the difference between these two is all
> > >> due to the last option -fno-inline-functions-called-once ?  Is LTo
> > >> necessary?  I.e., can you run the benchmark also built with the
> > >> branch compiler and -mcpu=native -Ofast -fomit-frame-pointer -fno-
> > inline-functions-called-once ?
> > >>
> > >
> > > Done, see below.
> > >
> > >>> +----------+--------------------------------------------------------
> > >>> +----------+----------------------
> > >> --------------------------------------------------------------------+--------------+--+-
> > -+
> > >>> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto
> > >> | -24%         |  |  |
> > >>> +----------+--------------------------------------------------------
> > >>> +----------+----------------------
> > >> --------------------------------------------------------------------+--------------+--+-
> > -+
> > >>> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer
> > >> | -26%         |  |  |
> > >>> +----------+--------------------------------------------------------
> > >>> +----------+----------------------
> > >> --------------------------------------------------------------------+--------------+--+-
> > -+
> > >>
> > >>>
> > >>> (Hopefully the table shows up correct)
> > >>
> > >> it does show OK for me, thanks.
> > >>
> > >>>
> > >>> It looks like your patch definitely does improve the basic cases. So
> > >>> there's not much difference between lto and non-lto anymore and it's
> > >> much Better than GCC 10. However it still contains the regression
> > >> introduced by Honza's changes.
> > >>
> > >> I assume these are rates, not times, so negative means bad.  But do I
> > >> understand it correctly that you're comparing against GCC 10 with the
> > >> two parameters set to rather special values?  Because your table
> > >> seems to indicate that even for you, the branch is faster than GCC 10
> > >> with just - mcpu=native -Ofast -fomit-frame-pointer.
> > >
> > > Yes these are indeed rates, and indeed I am comparing against the same
> > > options we used to get the fastest rates on before which is the two
> > > parameters and the inline flag.
> > >
> > >>
> > >> So is the problem that the best obtainable run-time, even with
> > >> obscure options, from the branch is slower than the best obtainable
> > >> run-time from GCC 10?
> > >>
> > >
> > > Yeah that's the problem, when we compare the two we're still behind.
> > >
> > > I've done the additional two runs
> > >
> > > +----------+------------------------------------------------------------------------------
> > --------------------------------------------------------------------+--------------+
> > > | Compiler | Flags
> > | diff GCC 10  |
> > > +----------+------------------------------------------------------------------------------
> > --------------------------------------------------------------------+--------------+
> > > | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-
> > eval-threshold=1 --param   ipa-cp-unit-growth=80 -fno-inline-functions-
> > called-once |              |
> > > +----------+------------------------------------------------------------------------------
> > --------------------------------------------------------------------+--------------+
> > > | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer
> > | -44%         |
> > > +----------+------------------------------------------------------------------------------
> > --------------------------------------------------------------------+--------------+
> > > | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto
> > | -36%         |
> > > +----------+------------------------------------------------------------------------------
> > --------------------------------------------------------------------+--------------+
> > > | GCC 11   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-
> > eval-threshold=1 --param   ipa-cp-unit-growth=80 -fno-inline-functions-
> > called-once | -12%         |
> > > +----------+------------------------------------------------------------------------------
> > --------------------------------------------------------------------+--------------+
> > > | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-
> > eval-threshold=1 --param   ipa-cp-unit-growth=80                                   | -22%
> > |
> > > +----------+------------------------------------------------------------------------------
> > --------------------------------------------------------------------+--------------+
> > > | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-
> > eval-threshold=1 --param   ipa-cp-unit-growth=80 -fno-inline-functions-
> > called-once | -12%         |
> > > +----------+------------------------------------------------------------------------------
> > --------------------------------------------------------------------+--------------+
> > > | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto
> > | -24%         |
> > > +----------+------------------------------------------------------------------------------
> > --------------------------------------------------------------------+--------------+
> > > | Branch   | -mcpu=native -Ofast -fomit-frame-pointer
> > | -26%         |
> > > +----------+------------------------------------------------------------------------------
> > --------------------------------------------------------------------+--------------+
> > > | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto -fno-inline-
> > functions-called-once                                                                 | -12%         |
> > > +----------+------------------------------------------------------------------------------
> > --------------------------------------------------------------------+--------------+
> > > | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -fno-inline-
> > functions-called-once                                                                       | -11%         |
> > > +----------+------------------------------------------------------------------------------
> > --------------------------------------------------------------------+--------------+
> > >
> > > And this confirms that indeed LTO isn't needed and that the branch
> > > without any options is indeed much better than it was on GCC 10 without
> > any options.
> > >
> > > It also confirms that the only remaining difference is in the
> > > -fno-inline-functions-called-once
> > 
> > If -fno-inline-functions-called-once is added, the recursive call function
> > digits_2 won't be inlined, as each digits_2 is specialized to clone nodes and
> > called once only, so performance back is expected, I guess it is somewhat
> > similar to -fno-inline for this case.
> > 
> > @Jambor @Honza Any progress about this (--param controlling maximal
> > recursion depth) and the other regression about
> > LOOP_GUARD_WITH_PREDICTION in
> > PR96825(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96825) please? :) I
> > tested the current master FSF code, the regression still exists...
> > 
> > 
> > Thanks,
> > Xionghu
> > 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-01-21 15:11 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-12 10:19 [PATCH] ipa-inline: Improve growth accumulation for recursive calls Xionghu Luo
2020-08-12 17:53 ` Jan Hubicka
2020-08-12 19:03   ` Richard Biener
2020-08-13  0:29     ` Segher Boessenkool
2020-08-13  7:53       ` Jan Hubicka
2020-08-13  3:59   ` Feng Xue OS
2020-08-13  6:51   ` luoxhu
2020-08-13 12:52     ` Jan Hubicka
2020-08-14  9:04       ` luoxhu
2020-08-20  9:31         ` Richard Sandiford
2020-08-20 10:49           ` [PATCH] ipa-inline: Improve growth accumulation for recursive callsg Jan Hubicka
2020-08-20 12:04           ` [PATCH] ipa-inline: Improve growth accumulation for recursive calls Martin Jambor
2020-08-21  9:17             ` Tamar Christina
2020-09-08 14:00               ` Martin Jambor
2020-09-11 10:38                 ` Tamar Christina
2020-09-11 12:45                   ` Martin Jambor
2020-09-11 17:36                     ` Tamar Christina
2020-09-16  8:35                       ` Hongyu Wang
2020-10-16  8:46                       ` Xionghu Luo
2021-01-21 14:34                         ` Tamar Christina
2021-01-21 15:11                           ` Jan Hubicka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).