* [PATCH] ipa-inline: Improve growth accumulation for recursive calls @ 2020-08-12 10:19 Xionghu Luo 2020-08-12 17:53 ` Jan Hubicka 0 siblings, 1 reply; 21+ messages in thread From: Xionghu Luo @ 2020-08-12 10:19 UTC (permalink / raw) To: gcc-patches Cc: segher, dje.gcc, wschmidt, guojiufu, linkw, hubicka, mliska, mjambor, fxue, Xiong Hu Luo From: Xiong Hu Luo <luoxhu@linux.ibm.com> For SPEC2017 exchange2, there is a large recursive functiondigits_2(function size 1300) generates specialized node from digits_2.1 to digits_2.8 with added build option: --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 ipa-inline pass will consider inline these nodes called only once, but these large functions inlined too deeply will cause serious register spill and performance down as followed. inlineA: brute (inline digits_2.1, 2.2, 2.3, 2.4) -> digits_2.5 (inline 2.6, 2.7, 2.8) inlineB: digits_2.1 (inline digits_2.2, 2.3) -> call digits_2.4 (inline digits_2.5, 2.6) -> call digits_2.7 (inline 2.8) inlineC: brute (inline digits_2) -> call 2.1 -> 2.2 (inline 2.3) -> 2.4 -> 2.5 -> 2.6 (inline 2.7 ) -> 2.8 inlineD: brute -> call digits_2 -> call 2.1 -> call 2.2 -> 2.3 -> 2.4 -> 2.5 -> 2.6 -> 2.7 -> 2.8 Performance diff: inlineB is ~25% faster than inlineA; inlineC is ~20% faster than inlineB; inlineD is ~30% faster than inlineC. The master GCC code now generates inline sequence like inlineB, this patch makes the ipa-inline pass behavior like inlineD by: 1) The growth acumulation for recursive calls by adding the growth data to the edge when edge's caller is inlined into another function to avoid inline too deeply; 2) And if caller and callee are both specialized from same node, the edge should also be considered as recursive edge. SPEC2017 test shows GEOMEAN improve +2.75% in total(+0.56% without exchange2). Any comments? Thanks. 523.xalancbmk_r +1.32% 541.leela_r +1.51% 548.exchange2_r +31.87% 507.cactuBSSN_r +0.80% 526.blender_r +1.25% 538.imagick_r +1.82% gcc/ChangeLog: 2020-08-12 Xionghu Luo <luoxhu@linux.ibm.com> * cgraph.h (cgraph_edge::recursive_p): Return true if caller and callee and specialized from same node. * ipa-inline-analysis.c (do_estimate_growth_1): Add caller's inlined_to growth to edge whose caller is inlined. --- gcc/cgraph.h | 2 ++ gcc/ipa-inline-analysis.c | 3 +++ 2 files changed, 5 insertions(+) diff --git a/gcc/cgraph.h b/gcc/cgraph.h index 0211f08964f..11903ac1960 100644 --- a/gcc/cgraph.h +++ b/gcc/cgraph.h @@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void) cgraph_node *c = callee->ultimate_alias_target (); if (caller->inlined_to) return caller->inlined_to->decl == c->decl; + else if (caller->clone_of && c->clone_of) + return caller->clone_of->decl == c->clone_of->decl; else return caller->decl == c->decl; } diff --git a/gcc/ipa-inline-analysis.c b/gcc/ipa-inline-analysis.c index 148efbc09ef..ba0cf836364 100644 --- a/gcc/ipa-inline-analysis.c +++ b/gcc/ipa-inline-analysis.c @@ -434,6 +434,9 @@ do_estimate_growth_1 (struct cgraph_node *node, void *data) continue; } d->growth += estimate_edge_growth (e); + if (e->caller->inlined_to) + d->growth += ipa_fn_summaries->get (e->caller->inlined_to)->growth; + if (d->growth > d->cap) return true; } -- 2.25.1 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls 2020-08-12 10:19 [PATCH] ipa-inline: Improve growth accumulation for recursive calls Xionghu Luo @ 2020-08-12 17:53 ` Jan Hubicka 2020-08-12 19:03 ` Richard Biener ` (2 more replies) 0 siblings, 3 replies; 21+ messages in thread From: Jan Hubicka @ 2020-08-12 17:53 UTC (permalink / raw) To: Xionghu Luo; +Cc: gcc-patches, segher, wschmidt, linkw, dje.gcc Hello, with Martin we spent some time looking into exchange2 and my understanding of the problem is the following: There is the self recursive function digits_2 with the property that it has 10 nested loops and calls itself from the innermost. Now we do not do amazing job on guessing the profile since it is quite atypical. First observation is that the callback frequencly needs to be less than 1 otherwise the program never terminates, however with 10 nested loops one needs to predict every loop to iterate just few times and conditionals guarding them as not very likely. For that we added PRED_LOOP_GUARD_WITH_RECURSION some time ago and I fixed it yesterday (causing regression in exhange since the bad profile turned out to disable some harmful vectorization) and I also now added a cap to the self recursive frequency so things to not get mispropagated by ipa-cp. Now if ipa-cp decides to duplicate digits few times we have a new problem. The tree of recursion is orgnaized in a way that the depth is bounded by 10 (which GCC does not know) and moreover most time is not spent on very deep levels of recursion. For that you have the patch which increases frequencies of recursively cloned nodes, however it still seems to me as very specific hack for exchange: I do not see how to guess where most of time is spent. Even for very regular trees, by master theorem, it depends on very little differences in the estimates of recursion frequency whether most of time is spent on the top of tree, bottom or things are balanced. With algorithms doing backtracing, like exhchange, the likelyness of recusion reduces with deeper recursion level, but we do not know how quickly and what the level is. > From: Xiong Hu Luo <luoxhu@linux.ibm.com> > > For SPEC2017 exchange2, there is a large recursive functiondigits_2(function > size 1300) generates specialized node from digits_2.1 to digits_2.8 with added > build option: > > --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 > > ipa-inline pass will consider inline these nodes called only once, but these > large functions inlined too deeply will cause serious register spill and > performance down as followed. > > inlineA: brute (inline digits_2.1, 2.2, 2.3, 2.4) -> digits_2.5 (inline 2.6, 2.7, 2.8) > inlineB: digits_2.1 (inline digits_2.2, 2.3) -> call digits_2.4 (inline digits_2.5, 2.6) -> call digits_2.7 (inline 2.8) > inlineC: brute (inline digits_2) -> call 2.1 -> 2.2 (inline 2.3) -> 2.4 -> 2.5 -> 2.6 (inline 2.7 ) -> 2.8 > inlineD: brute -> call digits_2 -> call 2.1 -> call 2.2 -> 2.3 -> 2.4 -> 2.5 -> 2.6 -> 2.7 -> 2.8 > > Performance diff: > inlineB is ~25% faster than inlineA; > inlineC is ~20% faster than inlineB; > inlineD is ~30% faster than inlineC. > > The master GCC code now generates inline sequence like inlineB, this patch > makes the ipa-inline pass behavior like inlineD by: > 1) The growth acumulation for recursive calls by adding the growth data > to the edge when edge's caller is inlined into another function to avoid > inline too deeply; > 2) And if caller and callee are both specialized from same node, the edge > should also be considered as recursive edge. > > SPEC2017 test shows GEOMEAN improve +2.75% in total(+0.56% without exchange2). > Any comments? Thanks. > > 523.xalancbmk_r +1.32% > 541.leela_r +1.51% > 548.exchange2_r +31.87% > 507.cactuBSSN_r +0.80% > 526.blender_r +1.25% > 538.imagick_r +1.82% > > gcc/ChangeLog: > > 2020-08-12 Xionghu Luo <luoxhu@linux.ibm.com> > > * cgraph.h (cgraph_edge::recursive_p): Return true if caller and > callee and specialized from same node. > * ipa-inline-analysis.c (do_estimate_growth_1): Add caller's > inlined_to growth to edge whose caller is inlined. > --- > gcc/cgraph.h | 2 ++ > gcc/ipa-inline-analysis.c | 3 +++ > 2 files changed, 5 insertions(+) > > diff --git a/gcc/cgraph.h b/gcc/cgraph.h > index 0211f08964f..11903ac1960 100644 > --- a/gcc/cgraph.h > +++ b/gcc/cgraph.h > @@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void) > cgraph_node *c = callee->ultimate_alias_target (); > if (caller->inlined_to) > return caller->inlined_to->decl == c->decl; > + else if (caller->clone_of && c->clone_of) > + return caller->clone_of->decl == c->clone_of->decl; > else > return caller->decl == c->decl; If you clone the function so it is no longer self recursive, it does not make much sense to lie to optimizers that the function is still recursive. The inlining would be harmful even if the programer did cloning by hand. I guess main problem is the extreme register pressure issue combining loop depth of 10 in caller with loop depth of 10 in callee just because the function is called once. The negative effect is most likely also due to wrong profile estimate which drives IRA to optimize wrong spot. But I wonder if we simply don't want to teach inlining function called once to not construct large loop depths? Something like do not inline if caller&callee loop depth is over 3 or so? Honza > } > diff --git a/gcc/ipa-inline-analysis.c b/gcc/ipa-inline-analysis.c > index 148efbc09ef..ba0cf836364 100644 > --- a/gcc/ipa-inline-analysis.c > +++ b/gcc/ipa-inline-analysis.c > @@ -434,6 +434,9 @@ do_estimate_growth_1 (struct cgraph_node *node, void *data) > continue; > } > d->growth += estimate_edge_growth (e); > + if (e->caller->inlined_to) > + d->growth += ipa_fn_summaries->get (e->caller->inlined_to)->growth; > + > if (d->growth > d->cap) > return true; > } > -- > 2.25.1 > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls 2020-08-12 17:53 ` Jan Hubicka @ 2020-08-12 19:03 ` Richard Biener 2020-08-13 0:29 ` Segher Boessenkool 2020-08-13 3:59 ` Feng Xue OS 2020-08-13 6:51 ` luoxhu 2 siblings, 1 reply; 21+ messages in thread From: Richard Biener @ 2020-08-12 19:03 UTC (permalink / raw) To: gcc-patches, Jan Hubicka, Xionghu Luo; +Cc: wschmidt, dje.gcc, linkw, segher On August 12, 2020 7:53:07 PM GMT+02:00, Jan Hubicka <hubicka@ucw.cz> wrote: >Hello, >with Martin we spent some time looking into exchange2 and my >understanding of the problem is the following: > >There is the self recursive function digits_2 with the property that it >has 10 nested loops and calls itself from the innermost. >Now we do not do amazing job on guessing the profile since it is quite >atypical. First observation is that the callback frequencly needs to be >less than 1 otherwise the program never terminates, however with 10 >nested loops one needs to predict every loop to iterate just few times >and conditionals guarding them as not very likely. For that we added >PRED_LOOP_GUARD_WITH_RECURSION some time ago and I fixed it yesterday >(causing regression in exhange since the bad profile turned out to >disable some harmful vectorization) and I also now added a cap to the >self recursive frequency so things to not get mispropagated by ipa-cp. > >Now if ipa-cp decides to duplicate digits few times we have a new >problem. The tree of recursion is orgnaized in a way that the depth is >bounded by 10 (which GCC does not know) and moreover most time is not >spent on very deep levels of recursion. > >For that you have the patch which increases frequencies of recursively >cloned nodes, however it still seems to me as very specific hack for >exchange: I do not see how to guess where most of time is spent. >Even for very regular trees, by master theorem, it depends on very >little differences in the estimates of recursion frequency whether most >of time is spent on the top of tree, bottom or things are balanced. > >With algorithms doing backtracing, like exhchange, the likelyness of >recusion reduces with deeper recursion level, but we do not know how >quickly and what the level is. > >> From: Xiong Hu Luo <luoxhu@linux.ibm.com> >> >> For SPEC2017 exchange2, there is a large recursive >functiondigits_2(function >> size 1300) generates specialized node from digits_2.1 to digits_2.8 >with added >> build option: >> >> --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 >> >> ipa-inline pass will consider inline these nodes called only once, >but these >> large functions inlined too deeply will cause serious register spill >and >> performance down as followed. >> >> inlineA: brute (inline digits_2.1, 2.2, 2.3, 2.4) -> digits_2.5 >(inline 2.6, 2.7, 2.8) >> inlineB: digits_2.1 (inline digits_2.2, 2.3) -> call digits_2.4 >(inline digits_2.5, 2.6) -> call digits_2.7 (inline 2.8) >> inlineC: brute (inline digits_2) -> call 2.1 -> 2.2 (inline 2.3) -> >2.4 -> 2.5 -> 2.6 (inline 2.7 ) -> 2.8 >> inlineD: brute -> call digits_2 -> call 2.1 -> call 2.2 -> 2.3 -> 2.4 >-> 2.5 -> 2.6 -> 2.7 -> 2.8 >> >> Performance diff: >> inlineB is ~25% faster than inlineA; >> inlineC is ~20% faster than inlineB; >> inlineD is ~30% faster than inlineC. >> >> The master GCC code now generates inline sequence like inlineB, this >patch >> makes the ipa-inline pass behavior like inlineD by: >> 1) The growth acumulation for recursive calls by adding the growth >data >> to the edge when edge's caller is inlined into another function to >avoid >> inline too deeply; >> 2) And if caller and callee are both specialized from same node, the >edge >> should also be considered as recursive edge. >> >> SPEC2017 test shows GEOMEAN improve +2.75% in total(+0.56% without >exchange2). >> Any comments? Thanks. >> >> 523.xalancbmk_r +1.32% >> 541.leela_r +1.51% >> 548.exchange2_r +31.87% >> 507.cactuBSSN_r +0.80% >> 526.blender_r +1.25% >> 538.imagick_r +1.82% >> >> gcc/ChangeLog: >> >> 2020-08-12 Xionghu Luo <luoxhu@linux.ibm.com> >> >> * cgraph.h (cgraph_edge::recursive_p): Return true if caller and >> callee and specialized from same node. >> * ipa-inline-analysis.c (do_estimate_growth_1): Add caller's >> inlined_to growth to edge whose caller is inlined. >> --- >> gcc/cgraph.h | 2 ++ >> gcc/ipa-inline-analysis.c | 3 +++ >> 2 files changed, 5 insertions(+) >> >> diff --git a/gcc/cgraph.h b/gcc/cgraph.h >> index 0211f08964f..11903ac1960 100644 >> --- a/gcc/cgraph.h >> +++ b/gcc/cgraph.h >> @@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void) >> cgraph_node *c = callee->ultimate_alias_target (); >> if (caller->inlined_to) >> return caller->inlined_to->decl == c->decl; >> + else if (caller->clone_of && c->clone_of) >> + return caller->clone_of->decl == c->clone_of->decl; >> else >> return caller->decl == c->decl; > >If you clone the function so it is no longer self recursive, it does >not >make much sense to lie to optimizers that the function is still >recursive. > >The inlining would be harmful even if the programer did cloning by >hand. >I guess main problem is the extreme register pressure issue combining >loop depth of 10 in caller with loop depth of 10 in callee just because >the function is called once. > >The negative effect is most likely also due to wrong profile estimate >which drives IRA to optimize wrong spot. But I wonder if we simply >don't want to teach inlining function called once to not construct >large >loop depths? Something like do not inline if caller&callee loop depth >is over 3 or so? I don't think that's good by itself (consider leaf functions and x86 xmm reg ABI across calls). Even with large loop depth abstraction penalty removal can make inlining worth it. For the testcase the recursiveness is what looks special (recursion from a deeper loop nest level). Richard. >Honza >> } >> diff --git a/gcc/ipa-inline-analysis.c b/gcc/ipa-inline-analysis.c >> index 148efbc09ef..ba0cf836364 100644 >> --- a/gcc/ipa-inline-analysis.c >> +++ b/gcc/ipa-inline-analysis.c >> @@ -434,6 +434,9 @@ do_estimate_growth_1 (struct cgraph_node *node, >void *data) >> continue; >> } >> d->growth += estimate_edge_growth (e); >> + if (e->caller->inlined_to) >> + d->growth += ipa_fn_summaries->get (e->caller->inlined_to)->growth; >> + >> if (d->growth > d->cap) >> return true; >> } >> -- >> 2.25.1 >> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls 2020-08-12 19:03 ` Richard Biener @ 2020-08-13 0:29 ` Segher Boessenkool 2020-08-13 7:53 ` Jan Hubicka 0 siblings, 1 reply; 21+ messages in thread From: Segher Boessenkool @ 2020-08-13 0:29 UTC (permalink / raw) To: Richard Biener Cc: gcc-patches, Jan Hubicka, Xionghu Luo, wschmidt, dje.gcc, linkw Hi! On Wed, Aug 12, 2020 at 09:03:35PM +0200, Richard Biener wrote: > On August 12, 2020 7:53:07 PM GMT+02:00, Jan Hubicka <hubicka@ucw.cz> wrote: > >> From: Xiong Hu Luo <luoxhu@linux.ibm.com> > >> 523.xalancbmk_r +1.32% > >> 541.leela_r +1.51% > >> 548.exchange2_r +31.87% > >> 507.cactuBSSN_r +0.80% > >> 526.blender_r +1.25% > >> 538.imagick_r +1.82% > >> diff --git a/gcc/cgraph.h b/gcc/cgraph.h > >> index 0211f08964f..11903ac1960 100644 > >> --- a/gcc/cgraph.h > >> +++ b/gcc/cgraph.h > >> @@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void) > >> cgraph_node *c = callee->ultimate_alias_target (); > >> if (caller->inlined_to) > >> return caller->inlined_to->decl == c->decl; > >> + else if (caller->clone_of && c->clone_of) > >> + return caller->clone_of->decl == c->clone_of->decl; > >> else > >> return caller->decl == c->decl; > > > >If you clone the function so it is no longer self recursive, it does > >not > >make much sense to lie to optimizers that the function is still > >recursive. Like Richard says below (if I understand him right, sorry if not), the function still *is* recursive in its group of clones. > >The inlining would be harmful even if the programer did cloning by > >hand. > >I guess main problem is the extreme register pressure issue combining > >loop depth of 10 in caller with loop depth of 10 in callee just because > >the function is called once. > > > >The negative effect is most likely also due to wrong profile estimate > >which drives IRA to optimize wrong spot. But I wonder if we simply > >don't want to teach inlining function called once to not construct > >large > >loop depths? Something like do not inline if caller&callee loop depth > >is over 3 or so? > > I don't think that's good by itself (consider leaf functions and x86 xmm reg ABI across calls). Even with large loop depth abstraction penalty removal can make inlining worth it. For the testcase the recursiveness is what looks special (recursion from a deeper loop nest level). Yes, the loop stuff / register pressure issues might help for the exchange result, but what about the other five above? Segher ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls 2020-08-13 0:29 ` Segher Boessenkool @ 2020-08-13 7:53 ` Jan Hubicka 0 siblings, 0 replies; 21+ messages in thread From: Jan Hubicka @ 2020-08-13 7:53 UTC (permalink / raw) To: Segher Boessenkool Cc: Richard Biener, Xionghu Luo, wschmidt, linkw, gcc-patches, dje.gcc > Hi! > > On Wed, Aug 12, 2020 at 09:03:35PM +0200, Richard Biener wrote: > > On August 12, 2020 7:53:07 PM GMT+02:00, Jan Hubicka <hubicka@ucw.cz> wrote: > > >> From: Xiong Hu Luo <luoxhu@linux.ibm.com> > > >> 523.xalancbmk_r +1.32% > > >> 541.leela_r +1.51% > > >> 548.exchange2_r +31.87% > > >> 507.cactuBSSN_r +0.80% > > >> 526.blender_r +1.25% > > >> 538.imagick_r +1.82% > > > >> diff --git a/gcc/cgraph.h b/gcc/cgraph.h > > >> index 0211f08964f..11903ac1960 100644 > > >> --- a/gcc/cgraph.h > > >> +++ b/gcc/cgraph.h > > >> @@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void) > > >> cgraph_node *c = callee->ultimate_alias_target (); > > >> if (caller->inlined_to) > > >> return caller->inlined_to->decl == c->decl; > > >> + else if (caller->clone_of && c->clone_of) > > >> + return caller->clone_of->decl == c->clone_of->decl; > > >> else > > >> return caller->decl == c->decl; > > > > > >If you clone the function so it is no longer self recursive, it does > > >not > > >make much sense to lie to optimizers that the function is still > > >recursive. > > Like Richard says below (if I understand him right, sorry if not), the > function still *is* recursive in its group of clones. The test above is not an heuristics. Its purpose is to determine when offline body will be eliminated after inlining the call. This happens when 1) this is last call to the function 2) function is not used otherwise (exported or visible) 3) the call is not self recursive In case of 3 the problem is that inlning will introduce new call to function itself and offline copy will still be needed. Here we see a chain of clones calling clone1->clone2->clone3->clone4...->clone9->clone1 inlining clone2 to clone3 will elliminate offline copy of clone3 and will reduce code size. Inliner has analysis which intends to model code size/time accurately and the heuristics part. The patch makes size/time to produce wrong results, while this needs to be solved in the heuristics part. Note that with bit of optimization we should be able to eliminate call clone9->clone1 because it is dead (I believe we don't do that only becuase recursion limit is set 1 off). This however does not make inlining clone2->clone3 any better idea. Same is true if user wrote the clones explicitly. > > > >The inlining would be harmful even if the programer did cloning by > > >hand. > > >I guess main problem is the extreme register pressure issue combining > > >loop depth of 10 in caller with loop depth of 10 in callee just because > > >the function is called once. > > > > > >The negative effect is most likely also due to wrong profile estimate > > >which drives IRA to optimize wrong spot. But I wonder if we simply > > >don't want to teach inlining function called once to not construct > > >large > > >loop depths? Something like do not inline if caller&callee loop depth > > >is over 3 or so? > > > > I don't think that's good by itself (consider leaf functions and x86 xmm reg ABI across calls). Even with large loop depth abstraction penalty removal can make inlining worth it. For the testcase the recursiveness is what looks special (recursion from a deeper loop nest level). > > Yes, the loop stuff / register pressure issues might help for the > exchange result, but what about the other five above? I think such large loop nest (20 if you combine caller+callee) rarely happens in practice, so it may be good heuristics. I guess it depends how much exchange2 only hacks we want in compiler. We should aim to make changes that helps other codebases too. The ipa-cp recursive clonning machinery IMO has good potential to help other code which has fixed cap on depth of recursion. Could consider following code: int array[1000]; static fact (int a) { int ret; if (a > 1) ret = fact (a - 1) * a; for (int i = 0; i < a; i++) array[i+(a-1)*(a-1)/2] = 0; return ret; } test () { return fact (10); } This computes factorial and at the same time clears memory (just to avoid tailcall happening and also to have functions bit bigger). Here it is a win to produce all 10 clones, inline them together, precompute factorial and combine the memsets. We have few problems visible on this testcase 1) ipa-cp needs reduction of evaluation thresholds 2) estimated profile is bogus (since we estimate the recursion with probability 1/3 and after 10 levels of recursion the inner factorial get probability almost 0) 3) inliner will still happily flatten the function not noticing that it has to be cold. This is because it will do that pairwise and never see the combind probability. Honza ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls 2020-08-12 17:53 ` Jan Hubicka 2020-08-12 19:03 ` Richard Biener @ 2020-08-13 3:59 ` Feng Xue OS 2020-08-13 6:51 ` luoxhu 2 siblings, 0 replies; 21+ messages in thread From: Feng Xue OS @ 2020-08-13 3:59 UTC (permalink / raw) To: Jan Hubicka, Xionghu Luo; +Cc: wschmidt, dje.gcc, gcc-patches, linkw, segher > Hello, > with Martin we spent some time looking into exchange2 and my > understanding of the problem is the following: > > There is the self recursive function digits_2 with the property that it > has 10 nested loops and calls itself from the innermost. > Now we do not do amazing job on guessing the profile since it is quite > atypical. First observation is that the callback frequencly needs to be > less than 1 otherwise the program never terminates, however with 10 > nested loops one needs to predict every loop to iterate just few times > and conditionals guarding them as not very likely. For that we added > PRED_LOOP_GUARD_WITH_RECURSION some time ago and I fixed it yesterday > (causing regression in exhange since the bad profile turned out to > disable some harmful vectorization) and I also now added a cap to the > self recursive frequency so things to not get mispropagated by ipa-cp. With default setting of PRED_LOOP_GUARD_WITH_RECURSION, static profile estimation for exchange2 is far from accurate, the hottest recursive function is predicted as infrequent. However, this low execution estimation works fine with IRA. I've tried to tweak likelihood of the predictor, same as you, performance was degraded when estimated profile increased. This regression is also found to be correlated with IRA, which produces much more register spills than default. In presence of deep loops and high register pressure, IRA behaves more sensitively to profile estimation, and this exhibits an unwanted property of current IRA algorithm. I've described it in a tracker (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174). Feng > > Now if ipa-cp decides to duplicate digits few times we have a new > problem. The tree of recursion is orgnaized in a way that the depth is > bounded by 10 (which GCC does not know) and moreover most time is not > spent on very deep levels of recursion. > > For that you have the patch which increases frequencies of recursively > cloned nodes, however it still seems to me as very specific hack for > exchange: I do not see how to guess where most of time is spent. > Even for very regular trees, by master theorem, it depends on very > little differences in the estimates of recursion frequency whether most > of time is spent on the top of tree, bottom or things are balanced. > > With algorithms doing backtracing, like exhchange, the likelyness of > recusion reduces with deeper recursion level, but we do not know how > quickly and what the level is. > >> From: Xiong Hu Luo <luoxhu@linux.ibm.com> >> >> For SPEC2017 exchange2, there is a large recursive functiondigits_2(function >> size 1300) generates specialized node from digits_2.1 to digits_2.8 with added >> build option: >> >> --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 >> >> ipa-inline pass will consider inline these nodes called only once, but these >> large functions inlined too deeply will cause serious register spill and >> performance down as followed. >> >> inlineA: brute (inline digits_2.1, 2.2, 2.3, 2.4) -> digits_2.5 (inline 2.6, 2.7, 2.8) >> inlineB: digits_2.1 (inline digits_2.2, 2.3) -> call digits_2.4 (inline digits_2.5, 2.6) -> call digits_2.7 (inline 2.8) >> inlineC: brute (inline digits_2) -> call 2.1 -> 2.2 (inline 2.3) -> 2.4 -> 2.5 -> 2.6 (inline 2.7 ) -> 2.8 >> inlineD: brute -> call digits_2 -> call 2.1 -> call 2.2 -> 2.3 -> 2.4 -> 2.5 -> 2.6 -> 2.7 -> 2.8 >> >> Performance diff: >> inlineB is ~25% faster than inlineA; >> inlineC is ~20% faster than inlineB; >> inlineD is ~30% faster than inlineC. >> >> The master GCC code now generates inline sequence like inlineB, this patch >> makes the ipa-inline pass behavior like inlineD by: >> 1) The growth acumulation for recursive calls by adding the growth data >> to the edge when edge's caller is inlined into another function to avoid >> inline too deeply; >> 2) And if caller and callee are both specialized from same node, the edge >> should also be considered as recursive edge. >> >> SPEC2017 test shows GEOMEAN improve +2.75% in total(+0.56% without exchange2). >> Any comments? Thanks. >> >> 523.xalancbmk_r +1.32% >> 541.leela_r +1.51% >> 548.exchange2_r +31.87% >> 507.cactuBSSN_r +0.80% >> 526.blender_r +1.25% >> 538.imagick_r +1.82% >> >> gcc/ChangeLog: >> >> 2020-08-12 Xionghu Luo <luoxhu@linux.ibm.com> >> >> * cgraph.h (cgraph_edge::recursive_p): Return true if caller and >> callee and specialized from same node. >> * ipa-inline-analysis.c (do_estimate_growth_1): Add caller's >> inlined_to growth to edge whose caller is inlined. >> --- >> gcc/cgraph.h | 2 ++ >> gcc/ipa-inline-analysis.c | 3 +++ >> 2 files changed, 5 insertions(+) >> >> diff --git a/gcc/cgraph.h b/gcc/cgraph.h >> index 0211f08964f..11903ac1960 100644 >> --- a/gcc/cgraph.h >> +++ b/gcc/cgraph.h >> @@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void) >> cgraph_node *c = callee->ultimate_alias_target (); >> if (caller->inlined_to) >> return caller->inlined_to->decl == c->decl; >> + else if (caller->clone_of && c->clone_of) >> + return caller->clone_of->decl == c->clone_of->decl; >> else >> return caller->decl == c->decl; > > If you clone the function so it is no longer self recursive, it does not > make much sense to lie to optimizers that the function is still > recursive. > > The inlining would be harmful even if the programer did cloning by hand. > I guess main problem is the extreme register pressure issue combining > loop depth of 10 in caller with loop depth of 10 in callee just because > the function is called once. > > The negative effect is most likely also due to wrong profile estimate > which drives IRA to optimize wrong spot. But I wonder if we simply > don't want to teach inlining function called once to not construct large > loop depths? Something like do not inline if caller&callee loop depth > is over 3 or so? > > Honza >> } >> diff --git a/gcc/ipa-inline-analysis.c b/gcc/ipa-inline-analysis.c >> index 148efbc09ef..ba0cf836364 100644 >> --- a/gcc/ipa-inline-analysis.c >> +++ b/gcc/ipa-inline-analysis.c >> @@ -434,6 +434,9 @@ do_estimate_growth_1 (struct cgraph_node *node, void *data) >> continue; >> } >> d->growth += estimate_edge_growth (e); >> + if (e->caller->inlined_to) >> + d->growth += ipa_fn_summaries->get (e->caller->inlined_to)->growth; >> + >> if (d->growth > d->cap) >> return true; >> } >> -- >> 2.25.1 >> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls 2020-08-12 17:53 ` Jan Hubicka 2020-08-12 19:03 ` Richard Biener 2020-08-13 3:59 ` Feng Xue OS @ 2020-08-13 6:51 ` luoxhu 2020-08-13 12:52 ` Jan Hubicka 2 siblings, 1 reply; 21+ messages in thread From: luoxhu @ 2020-08-13 6:51 UTC (permalink / raw) To: Jan Hubicka; +Cc: gcc-patches, segher, wschmidt, linkw, dje.gcc Hi, On 2020/8/13 01:53, Jan Hubicka wrote: > Hello, > with Martin we spent some time looking into exchange2 and my > understanding of the problem is the following: > > There is the self recursive function digits_2 with the property that it > has 10 nested loops and calls itself from the innermost. > Now we do not do amazing job on guessing the profile since it is quite > atypical. First observation is that the callback frequencly needs to be > less than 1 otherwise the program never terminates, however with 10 > nested loops one needs to predict every loop to iterate just few times > and conditionals guarding them as not very likely. For that we added > PRED_LOOP_GUARD_WITH_RECURSION some time ago and I fixed it yesterday > (causing regression in exhange since the bad profile turned out to > disable some harmful vectorization) and I also now added a cap to the > self recursive frequency so things to not get mispropagated by ipa-cp. Thanks for the information :) Tamar replied that there is another regression *on exchange2 is 11%.*, I've also rebased my code and confirmed it really getting even slower than before (revert the patch could pull the performance back)... > > Now if ipa-cp decides to duplicate digits few times we have a new > problem. The tree of recursion is orgnaized in a way that the depth is > bounded by 10 (which GCC does not know) and moreover most time is not > spent on very deep levels of recursion. > > For that you have the patch which increases frequencies of recursively > cloned nodes, however it still seems to me as very specific hack for > exchange: I do not see how to guess where most of time is spent. > Even for very regular trees, by master theorem, it depends on very > little differences in the estimates of recursion frequency whether most > of time is spent on the top of tree, bottom or things are balanced. The build is not PGO, so I am not clear how profile count will affect the ipa-cp and ipa-inline decision. Since there are no other callers outside of these specialized nodes, the guessed profile count should be same equal? Perf tool shows that even each specialized node is called only once, none of them take same time for each call: 40.65% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.4 16.31% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.3 10.91% exchange2_gcc.o libgfortran.so.5.0.0 [.] _gfortran_mminloc0_4_i4 5.41% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.6 4.68% exchange2_gcc.o exchange2_gcc.orig.slow [.] __logic_MOD_new_solver 3.76% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.5 1.07% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.7 0.84% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_brute.constprop.0 0.47% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.2 0.24% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.1 0.24% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_covered.constprop.0 0.11% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_reflected.constprop.0 0.00% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_brute.constprop.1 digits_2.constprop.4 & digits_2.constprop.3 takes most of the execution time, So profile count and frequency seem not very helpful for this case? > > With algorithms doing backtracing, like exhchange, the likelyness of > recusion reduces with deeper recursion level, but we do not know how > quickly and what the level is. > >> From: Xiong Hu Luo <luoxhu@linux.ibm.com> >> >> For SPEC2017 exchange2, there is a large recursive functiondigits_2(function >> size 1300) generates specialized node from digits_2.1 to digits_2.8 with added >> build option: >> >> --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 >> >> ipa-inline pass will consider inline these nodes called only once, but these >> large functions inlined too deeply will cause serious register spill and >> performance down as followed. >> >> inlineA: brute (inline digits_2.1, 2.2, 2.3, 2.4) -> digits_2.5 (inline 2.6, 2.7, 2.8) >> inlineB: digits_2.1 (inline digits_2.2, 2.3) -> call digits_2.4 (inline digits_2.5, 2.6) -> call digits_2.7 (inline 2.8) >> inlineC: brute (inline digits_2) -> call 2.1 -> 2.2 (inline 2.3) -> 2.4 -> 2.5 -> 2.6 (inline 2.7 ) -> 2.8 >> inlineD: brute -> call digits_2 -> call 2.1 -> call 2.2 -> 2.3 -> 2.4 -> 2.5 -> 2.6 -> 2.7 -> 2.8 >> >> Performance diff: >> inlineB is ~25% faster than inlineA; >> inlineC is ~20% faster than inlineB; >> inlineD is ~30% faster than inlineC. >> >> The master GCC code now generates inline sequence like inlineB, this patch >> makes the ipa-inline pass behavior like inlineD by: >> 1) The growth acumulation for recursive calls by adding the growth data >> to the edge when edge's caller is inlined into another function to avoid >> inline too deeply; >> 2) And if caller and callee are both specialized from same node, the edge >> should also be considered as recursive edge. >> >> SPEC2017 test shows GEOMEAN improve +2.75% in total(+0.56% without exchange2). >> Any comments? Thanks. >> >> 523.xalancbmk_r +1.32% >> 541.leela_r +1.51% >> 548.exchange2_r +31.87% >> 507.cactuBSSN_r +0.80% >> 526.blender_r +1.25% >> 538.imagick_r +1.82% >> >> gcc/ChangeLog: >> >> 2020-08-12 Xionghu Luo <luoxhu@linux.ibm.com> >> >> * cgraph.h (cgraph_edge::recursive_p): Return true if caller and >> callee and specialized from same node. >> * ipa-inline-analysis.c (do_estimate_growth_1): Add caller's >> inlined_to growth to edge whose caller is inlined. >> --- >> gcc/cgraph.h | 2 ++ >> gcc/ipa-inline-analysis.c | 3 +++ >> 2 files changed, 5 insertions(+) >> >> diff --git a/gcc/cgraph.h b/gcc/cgraph.h >> index 0211f08964f..11903ac1960 100644 >> --- a/gcc/cgraph.h >> +++ b/gcc/cgraph.h >> @@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void) >> cgraph_node *c = callee->ultimate_alias_target (); >> if (caller->inlined_to) >> return caller->inlined_to->decl == c->decl; >> + else if (caller->clone_of && c->clone_of) >> + return caller->clone_of->decl == c->clone_of->decl; >> else >> return caller->decl == c->decl; > > If you clone the function so it is no longer self recursive, it does not > make much sense to lie to optimizers that the function is still > recursive. > > The inlining would be harmful even if the programer did cloning by hand. > I guess main problem is the extreme register pressure issue combining > loop depth of 10 in caller with loop depth of 10 in callee just because > the function is called once. > > The negative effect is most likely also due to wrong profile estimate > which drives IRA to optimize wrong spot. But I wonder if we simply > don't want to teach inlining function called once to not construct large > loop depths? Something like do not inline if caller&callee loop depth > is over 3 or so? Will try this. Though test result shows that no-inline is better than inline one than inline two than more here. Moreover, sorry that I am also not quite following Richard Biener whether he is positive or not about this... > > Honza >> } >> diff --git a/gcc/ipa-inline-analysis.c b/gcc/ipa-inline-analysis.c >> index 148efbc09ef..ba0cf836364 100644 >> --- a/gcc/ipa-inline-analysis.c >> +++ b/gcc/ipa-inline-analysis.c >> @@ -434,6 +434,9 @@ do_estimate_growth_1 (struct cgraph_node *node, void *data) >> continue; >> } >> d->growth += estimate_edge_growth (e); >> + if (e->caller->inlined_to) >> + d->growth += ipa_fn_summaries->get (e->caller->inlined_to)->growth; >> + What about this change, is it reasonable to add the inlined_to's caller growth to current edge? It is a bit independent with previous code. Thanks, Xionghu >> if (d->growth > d->cap) >> return true; >> } >> -- >> 2.25.1 >> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls 2020-08-13 6:51 ` luoxhu @ 2020-08-13 12:52 ` Jan Hubicka 2020-08-14 9:04 ` luoxhu 0 siblings, 1 reply; 21+ messages in thread From: Jan Hubicka @ 2020-08-13 12:52 UTC (permalink / raw) To: luoxhu; +Cc: gcc-patches, segher, wschmidt, linkw, dje.gcc > > Thanks for the information :) Tamar replied that there is another > regression *on exchange2 is 11%.*, I've also rebased my code and confirmed > it really getting even slower than before (revert the patch could pull the > performance back)... Yep, we need to figure out how to fix this - the summary on IRA issue is interesting. I was aware of it, but never really looked into place what IRA does wrong. Basically what happened historically is that when exchange2 was newly added to spec we looked on it with Martin and noticed the issue with the loop nest being predicted very argressively toward to the innermost loop which led the loop optimizer to do funny things on loops that really must not iterate too many times since we know that the frequency of recursive call is strictly less than 1. We spent some time on tuning inliner for low trip count loops and also added the LOOP_GUARD_WITH_PREDICTION heuristics that was meant to reduce probability of entering the loop which contains recursive call - this should be a pattern in all similar backtrack-like algorithms. The conditions terminating the walk should be likely or the program would never finish. > > > > > Now if ipa-cp decides to duplicate digits few times we have a new > > problem. The tree of recursion is orgnaized in a way that the depth is > > bounded by 10 (which GCC does not know) and moreover most time is not > > spent on very deep levels of recursion. > > > > For that you have the patch which increases frequencies of recursively > > cloned nodes, however it still seems to me as very specific hack for > > exchange: I do not see how to guess where most of time is spent. > > Even for very regular trees, by master theorem, it depends on very > > little differences in the estimates of recursion frequency whether most > > of time is spent on the top of tree, bottom or things are balanced. > > The build is not PGO, so I am not clear how profile count will affect the > ipa-cp and ipa-inline decision. Even without PGO the counts are used. predict.c first estimates the branch probabilities and then these are propagated to estimated counts of basic blocks and this is still used thorough the compiler to drive the optimization decisions (so ipa-inline computed the esitmated runtime effects of the inlining that is all weighted by the counts, similarly does ipa-cp). What we ended up was a bug in the patch adding LOOP_GUARD_WITH_REDICTION which resulted in the loop guards in exchange to be prodicted by both LOOP_GUARD_WITH_RECRUSIOn and LOOP_GUARD. Since first claims 85% chance that loop will not be entered and second 75% the combined outcome got over 90% probability and combining 10 conditions resulted in very small frequency of the recursive edge. It for did helps IRA to allocate sanely, but not for good reasons, so we ended up with exchange improvements and did not notice the bug (this is one fixed by patch above). For some releases PGO performance is slower than non-PGO https://lnt.opensuse.org/db_default/v4/SPEC/spec_report/options which I think is a combination of IRA and bad decisions in some loop opts. The other problem is that vectorizer tends to blow up the register pressure too. > Since there are no other callers outside of these specialized nodes, the > guessed profile count should be same equal? Perf tool shows that even > each specialized node is called only once, none of them take same time for > each call: > > 40.65% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.4 > 16.31% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.3 > 10.91% exchange2_gcc.o libgfortran.so.5.0.0 [.] _gfortran_mminloc0_4_i4 > 5.41% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.6 > 4.68% exchange2_gcc.o exchange2_gcc.orig.slow [.] __logic_MOD_new_solver > 3.76% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.5 > 1.07% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.7 > 0.84% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_brute.constprop.0 > 0.47% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.2 > 0.24% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.1 > 0.24% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_covered.constprop.0 > 0.11% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_reflected.constprop.0 > 0.00% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_brute.constprop.1 > > > digits_2.constprop.4 & digits_2.constprop.3 takes most of the execution time, > So profile count and frequency seem not very helpful for this case? Yep, you can not really determine the time spent on each of recursion levels from the recursive edge probability since you can not assume it to be the same on each level of recursion (here we know it is 0 at level 10 and it is slowly dropping as the recursion increases because there are fewer posiblities to fill up the partly filled sudoku :). Especially you can not assume it after the ipa-cp did the work to figure out that there is recursion depth counter that affect the function. Thinking of the ipa-cp profile updating changes, I did not consider safe the approach of adding extra weight to the recursive call. The problem is that the recursive call frequency estimate, when comming from static profile stimation, is likely completely off, just like static profile estimator can not be used to predict realistically the number of iterations of loops. However even with profile feedback this is hard to interpret. I was wondering if we can not simply detect this scenario and avoid using the recursive edge probability in the profile updating. We could simply scale by the number of copies. This means that if we produce 10 clones, we could just set each clone to 1/10th of the frequency. This implies that the hottest spot will not be underestimated expontentially much as can happen now, but just 10 times at worst, wich is probably still OK. We play similar games in loop optimizers: static profile estimators expect every loop to trip around 3 times so unroller/vectorizer/etc would make no sense on them. By scaling down according to number of copies we will keep the frequencies of calls to function called from clones "sane". We will still run into problems when we inline one clone to another (sine its proflie will get scaled by the recursive edge frequency), but perhaps that can be special cases in profiler or make ipa-cp to adjust the recursive edges to get frequency close to 1 as a hack. This is not pretty, but at least it looks safer to me than the original profile updating patch adding extra weight to the frequency. Honza > > > > > With algorithms doing backtracing, like exhchange, the likelyness of > > recusion reduces with deeper recursion level, but we do not know how > > quickly and what the level is. > > > > > >> From: Xiong Hu Luo <luoxhu@linux.ibm.com> > >> > >> For SPEC2017 exchange2, there is a large recursive functiondigits_2(function > >> size 1300) generates specialized node from digits_2.1 to digits_2.8 with added > >> build option: > >> > >> --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 > >> > >> ipa-inline pass will consider inline these nodes called only once, but these > >> large functions inlined too deeply will cause serious register spill and > >> performance down as followed. > >> > >> inlineA: brute (inline digits_2.1, 2.2, 2.3, 2.4) -> digits_2.5 (inline 2.6, 2.7, 2.8) > >> inlineB: digits_2.1 (inline digits_2.2, 2.3) -> call digits_2.4 (inline digits_2.5, 2.6) -> call digits_2.7 (inline 2.8) > >> inlineC: brute (inline digits_2) -> call 2.1 -> 2.2 (inline 2.3) -> 2.4 -> 2.5 -> 2.6 (inline 2.7 ) -> 2.8 > >> inlineD: brute -> call digits_2 -> call 2.1 -> call 2.2 -> 2.3 -> 2.4 -> 2.5 -> 2.6 -> 2.7 -> 2.8 > >> > >> Performance diff: > >> inlineB is ~25% faster than inlineA; > >> inlineC is ~20% faster than inlineB; > >> inlineD is ~30% faster than inlineC. > >> > >> The master GCC code now generates inline sequence like inlineB, this patch > >> makes the ipa-inline pass behavior like inlineD by: > >> 1) The growth acumulation for recursive calls by adding the growth data > >> to the edge when edge's caller is inlined into another function to avoid > >> inline too deeply; > >> 2) And if caller and callee are both specialized from same node, the edge > >> should also be considered as recursive edge. > >> > >> SPEC2017 test shows GEOMEAN improve +2.75% in total(+0.56% without exchange2). > >> Any comments? Thanks. > >> > >> 523.xalancbmk_r +1.32% > >> 541.leela_r +1.51% > >> 548.exchange2_r +31.87% > >> 507.cactuBSSN_r +0.80% > >> 526.blender_r +1.25% > >> 538.imagick_r +1.82% > >> > >> gcc/ChangeLog: > >> > >> 2020-08-12 Xionghu Luo <luoxhu@linux.ibm.com> > >> > >> * cgraph.h (cgraph_edge::recursive_p): Return true if caller and > >> callee and specialized from same node. > >> * ipa-inline-analysis.c (do_estimate_growth_1): Add caller's > >> inlined_to growth to edge whose caller is inlined. > >> --- > >> gcc/cgraph.h | 2 ++ > >> gcc/ipa-inline-analysis.c | 3 +++ > >> 2 files changed, 5 insertions(+) > >> > >> diff --git a/gcc/cgraph.h b/gcc/cgraph.h > >> index 0211f08964f..11903ac1960 100644 > >> --- a/gcc/cgraph.h > >> +++ b/gcc/cgraph.h > >> @@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void) > >> cgraph_node *c = callee->ultimate_alias_target (); > >> if (caller->inlined_to) > >> return caller->inlined_to->decl == c->decl; > >> + else if (caller->clone_of && c->clone_of) > >> + return caller->clone_of->decl == c->clone_of->decl; > >> else > >> return caller->decl == c->decl; > > > > If you clone the function so it is no longer self recursive, it does not > > make much sense to lie to optimizers that the function is still > > recursive. > > > > The inlining would be harmful even if the programer did cloning by hand. > > I guess main problem is the extreme register pressure issue combining > > loop depth of 10 in caller with loop depth of 10 in callee just because > > the function is called once. > > > > The negative effect is most likely also due to wrong profile estimate > > which drives IRA to optimize wrong spot. But I wonder if we simply > > don't want to teach inlining function called once to not construct large > > loop depths? Something like do not inline if caller&callee loop depth > > is over 3 or so? > > Will try this. Though test result shows that no-inline is better than inline > one than inline two than more here. Moreover, sorry that I am also not quite > following Richard Biener whether he is positive or not about this... > > > > > Honza > >> } > >> diff --git a/gcc/ipa-inline-analysis.c b/gcc/ipa-inline-analysis.c > >> index 148efbc09ef..ba0cf836364 100644 > >> --- a/gcc/ipa-inline-analysis.c > >> +++ b/gcc/ipa-inline-analysis.c > >> @@ -434,6 +434,9 @@ do_estimate_growth_1 (struct cgraph_node *node, void *data) > >> continue; > >> } > >> d->growth += estimate_edge_growth (e); > >> + if (e->caller->inlined_to) > >> + d->growth += ipa_fn_summaries->get (e->caller->inlined_to)->growth; > >> + > > What about this change, is it reasonable to add the inlined_to's caller growth to > current edge? It is a bit independent with previous code. > > > Thanks, > Xionghu > > >> if (d->growth > d->cap) > >> return true; > >> } > >> -- > >> 2.25.1 > >> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls 2020-08-13 12:52 ` Jan Hubicka @ 2020-08-14 9:04 ` luoxhu 2020-08-20 9:31 ` Richard Sandiford 0 siblings, 1 reply; 21+ messages in thread From: luoxhu @ 2020-08-14 9:04 UTC (permalink / raw) To: Jan Hubicka; +Cc: gcc-patches, segher, wschmidt, linkw, dje.gcc Hi, On 2020/8/13 20:52, Jan Hubicka wrote: >> Since there are no other callers outside of these specialized nodes, the >> guessed profile count should be same equal? Perf tool shows that even >> each specialized node is called only once, none of them take same time for >> each call: >> >> 40.65% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.4 >> 16.31% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.3 >> 10.91% exchange2_gcc.o libgfortran.so.5.0.0 [.] _gfortran_mminloc0_4_i4 >> 5.41% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.6 >> 4.68% exchange2_gcc.o exchange2_gcc.orig.slow [.] __logic_MOD_new_solver >> 3.76% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.5 >> 1.07% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.7 >> 0.84% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_brute.constprop.0 >> 0.47% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.2 >> 0.24% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.1 >> 0.24% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_covered.constprop.0 >> 0.11% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_reflected.constprop.0 >> 0.00% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_brute.constprop.1 >> >> >> digits_2.constprop.4 & digits_2.constprop.3 takes most of the execution time, >> So profile count and frequency seem not very helpful for this case? > Yep, you can not really determine the time spent on each of recursion > levels from the recursive edge probability since you can not assume it > to be the same on each level of recursion (here we know it is 0 at level > 10 and it is slowly dropping as the recursion increases because there > are fewer posiblities to fill up the partly filled sudoku:). > Especially you can not assume it after the ipa-cp did the work to figure > out that there is recursion depth counter that affect the function. > > Thinking of the ipa-cp profile updating changes, I did not consider safe > the approach of adding extra weight to the recursive call. The problem > is that the recursive call frequency estimate, when comming from static > profile stimation, is likely completely off, just like static profile > estimator can not be used to predict realistically the number of > iterations of loops. However even with profile feedback this is hard to > interpret. > > I was wondering if we can not simply detect this scenario and avoid > using the recursive edge probability in the profile updating. > We could simply scale by the number of copies. > This means that if we produce 10 clones, we could just set each clone to > 1/10th of the frequency. This implies that the hottest spot will not be > underestimated expontentially much as can happen now, but just 10 times > at worst, wich is probably still OK. We play similar games in loop > optimizers: static profile estimators expect every loop to trip around 3 > times so unroller/vectorizer/etc would make no sense on them. > > By scaling down according to number of copies we will keep the > frequencies of calls to function called from clones "sane". We will > still run into problems when we inline one clone to another (sine its > proflie will get scaled by the recursive edge frequency), but perhaps > that can be special cases in profiler or make ipa-cp to adjust the > recursive edges to get frequency close to 1 as a hack. > > This is not pretty, but at least it looks safer to me than the original > profile updating patch adding extra weight to the frequency. Really appreciate for your detailed explanation. BTW, My previous patch for PGO build on exchange2 takes this similar method by setting each cloned node to 1/10th of the frequency several month agao :) https://gcc.gnu.org/pipermail/gcc-patches/2020-June/546926.html Xionghu ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls 2020-08-14 9:04 ` luoxhu @ 2020-08-20 9:31 ` Richard Sandiford 2020-08-20 10:49 ` [PATCH] ipa-inline: Improve growth accumulation for recursive callsg Jan Hubicka 2020-08-20 12:04 ` [PATCH] ipa-inline: Improve growth accumulation for recursive calls Martin Jambor 0 siblings, 2 replies; 21+ messages in thread From: Richard Sandiford @ 2020-08-20 9:31 UTC (permalink / raw) To: luoxhu via Gcc-patches Cc: Jan Hubicka, luoxhu, wschmidt, dje.gcc, linkw, segher Xionghu, thanks for working on fixing the exchange regression. luoxhu via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > On 2020/8/13 20:52, Jan Hubicka wrote: >>> Since there are no other callers outside of these specialized nodes, the >>> guessed profile count should be same equal? Perf tool shows that even >>> each specialized node is called only once, none of them take same time for >>> each call: >>> >>> 40.65% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.4 >>> 16.31% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.3 >>> 10.91% exchange2_gcc.o libgfortran.so.5.0.0 [.] _gfortran_mminloc0_4_i4 >>> 5.41% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.6 >>> 4.68% exchange2_gcc.o exchange2_gcc.orig.slow [.] __logic_MOD_new_solver >>> 3.76% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.5 >>> 1.07% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.7 >>> 0.84% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_brute.constprop.0 >>> 0.47% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.2 >>> 0.24% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_digits_2.constprop.1 >>> 0.24% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_covered.constprop.0 >>> 0.11% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_reflected.constprop.0 >>> 0.00% exchange2_gcc.o exchange2_gcc.orig.slow [.] __brute_force_MOD_brute.constprop.1 >>> >>> >>> digits_2.constprop.4 & digits_2.constprop.3 takes most of the execution time, >>> So profile count and frequency seem not very helpful for this case? >> Yep, you can not really determine the time spent on each of recursion >> levels from the recursive edge probability since you can not assume it >> to be the same on each level of recursion (here we know it is 0 at level >> 10 and it is slowly dropping as the recursion increases because there >> are fewer posiblities to fill up the partly filled sudoku:). >> Especially you can not assume it after the ipa-cp did the work to figure >> out that there is recursion depth counter that affect the function. >> >> Thinking of the ipa-cp profile updating changes, I did not consider safe >> the approach of adding extra weight to the recursive call. The problem >> is that the recursive call frequency estimate, when comming from static >> profile stimation, is likely completely off, just like static profile >> estimator can not be used to predict realistically the number of >> iterations of loops. However even with profile feedback this is hard to >> interpret. >> >> I was wondering if we can not simply detect this scenario and avoid >> using the recursive edge probability in the profile updating. >> We could simply scale by the number of copies. >> This means that if we produce 10 clones, we could just set each clone to >> 1/10th of the frequency. This implies that the hottest spot will not be >> underestimated expontentially much as can happen now, but just 10 times >> at worst, wich is probably still OK. We play similar games in loop >> optimizers: static profile estimators expect every loop to trip around 3 >> times so unroller/vectorizer/etc would make no sense on them. >> >> By scaling down according to number of copies we will keep the >> frequencies of calls to function called from clones "sane". We will >> still run into problems when we inline one clone to another (sine its >> proflie will get scaled by the recursive edge frequency), but perhaps >> that can be special cases in profiler or make ipa-cp to adjust the >> recursive edges to get frequency close to 1 as a hack. >> >> This is not pretty, but at least it looks safer to me than the original >> profile updating patch adding extra weight to the frequency. > > > Really appreciate for your detailed explanation. BTW, My previous patch > for PGO build on exchange2 takes this similar method by setting each cloned > node to 1/10th of the frequency several month agao :) > > https://gcc.gnu.org/pipermail/gcc-patches/2020-June/546926.html Does it seem likely that we'll reach a resolution on this soon? I take the point that the patch that introduced the exchange regression [https://gcc.gnu.org/pipermail/gcc-patches/2020-August/551757.html] was just uncovering a latent issue, but still, this is a large regression in an important benchmark to be carrying around. For those of us doing regular benchmark CI, the longer the performance trough gets, the harder it is to spot other unrelated regressions in the “properly optimised” code. So if we don't have a specific plan for fixing the regression soon, I think we should consider reverting the patch until we have something that avoids the exchange regression, even though the patch wasn't technically wrong. Thanks, Richard ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] ipa-inline: Improve growth accumulation for recursive callsg 2020-08-20 9:31 ` Richard Sandiford @ 2020-08-20 10:49 ` Jan Hubicka 2020-08-20 12:04 ` [PATCH] ipa-inline: Improve growth accumulation for recursive calls Martin Jambor 1 sibling, 0 replies; 21+ messages in thread From: Jan Hubicka @ 2020-08-20 10:49 UTC (permalink / raw) To: luoxhu via Gcc-patches, luoxhu, wschmidt, dje.gcc, linkw, segher, richard.sandiford, mjambor > > > > Really appreciate for your detailed explanation. BTW, My previous patch > > for PGO build on exchange2 takes this similar method by setting each cloned > > node to 1/10th of the frequency several month agao :) > > > > https://gcc.gnu.org/pipermail/gcc-patches/2020-June/546926.html I was looking again into the patch and I think we could adjust it for mainline. I wanted to discuss this with Martin Jambor (added to CC) after he returns from vacation. I think the idea of scaling all copies by number of copies is a reasonable thing to do (as discussed in earlier mail) since generally we do not have information about where the hot recursion levels lie. By master theorem it is easy to see that this is actually very subtle problem and our profile info is not precise enough to be useful here. Making all copies even is thus a good conservative solution. I did not quite understand this point earlier, so the patch did not seem to make much sense to me and I saw it as an odd exchange only hack. I apologize for that. What I think is wrong about the patch is that the scale is not set accoridng to number of actual copies produced, but according to the --param controlling maximal recursion depth (since with exchage we manage to meet it). We do not need to cap recursion depth so low as patch does (as shown by the factorial example that seems easy enough so we should handle it eventualy by default). I think we need to work out how to scale by actual number of clones. Problem is that we clone one by one and each time we update the profile. So we do not really know how many times we will clone at that time. I was thinking of simply keeping the chain of recursively cloned nodes and each time rescale profile of all of them so it stays the same. It would cause some roundoff issues and also will affect the ipa-cp decisions itself since it will use the intermediate values. This is probably not to bad, but perhaps Martin may have better idea. Honza > > Does it seem likely that we'll reach a resolution on this soon? > I take the point that the patch that introduced the exchange regression > [https://gcc.gnu.org/pipermail/gcc-patches/2020-August/551757.html] > was just uncovering a latent issue, but still, this is a large regression > in an important benchmark to be carrying around. For those of us doing > regular benchmark CI, the longer the performance trough gets, the harder > it is to spot other unrelated regressions in the “properly optimised” code. > > So if we don't have a specific plan for fixing the regression soon, > I think we should consider reverting the patch until we have something > that avoids the exchange regression, even though the patch wasn't > technically wrong. The problem here is that IRA needs to be lied to to get good results. One direction of attack is to ignore this problem, get IPA-CP to work by default and convince inliner to not inline on large loop depths. This seems to get good performance on x86-64, I did not test ARM. We may fix IRA issue, but Vladimir does not seem to know what to do according to the PR log. We may lie to compiler again and produce wrong profile. diff --git a/gcc/predict.def b/gcc/predict.def index 2d8d4958b6d..d8e502152b8 100644 --- a/gcc/predict.def +++ b/gcc/predict.def @@ -177,7 +177,7 @@ DEF_PREDICTOR (PRED_LOOP_GUARD, "loop guard", HITRATE (73), 0) /* Same but for loops containing recursion. */ DEF_PREDICTOR (PRED_LOOP_GUARD_WITH_RECURSION, "loop guard with recursion", - HITRATE (85), 0) + HITRATE (95), 0) /* The following predictors are used in Fortran. */ Gets perofrmance back for me. I do not really like this alternative. One option I would still like to look into is that IRA seems to work better with profile feedback than with currently guessed profile. It work even better with the wrongly guessed profile with the patch above, but perhaps we could fix profile estimate somewhere else to get profile more realistic and IRA happier. Honza > > Thanks, > Richard ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls 2020-08-20 9:31 ` Richard Sandiford 2020-08-20 10:49 ` [PATCH] ipa-inline: Improve growth accumulation for recursive callsg Jan Hubicka @ 2020-08-20 12:04 ` Martin Jambor 2020-08-21 9:17 ` Tamar Christina 1 sibling, 1 reply; 21+ messages in thread From: Martin Jambor @ 2020-08-20 12:04 UTC (permalink / raw) To: Richard Sandiford, luoxhu via Gcc-patches Cc: segher, luoxhu, wschmidt, linkw, Jan Hubicka, dje.gcc Hi, On Thu, Aug 20 2020, Richard Sandiford wrote: >> >> >> Really appreciate for your detailed explanation. BTW, My previous patch >> for PGO build on exchange2 takes this similar method by setting each cloned >> node to 1/10th of the frequency several month agao :) >> >> https://gcc.gnu.org/pipermail/gcc-patches/2020-June/546926.html > > Does it seem likely that we'll reach a resolution on this soon? > I take the point that the patch that introduced the exchange regression > [https://gcc.gnu.org/pipermail/gcc-patches/2020-August/551757.html] > was just uncovering a latent issue, but still, this is a large regression > in an important benchmark to be carrying around. For those of us doing > regular benchmark CI, the longer the performance trough gets, the harder > it is to spot other unrelated regressions in the “properly optimised” code. > > So if we don't have a specific plan for fixing the regression soon, > I think we should consider reverting the patch until we have something > that avoids the exchange regression, even though the patch wasn't > technically wrong. Honza's changes have been motivated to big extent as an enabler for IPA-CP heuristics changes to actually speed up 548.exchange2_r. On my AMD Zen2 machine, the run-time of exchange2 was 358 seconds two weeks ago, this week it is 403, but with my WIP (and so far untested) patch below it is just 276 seconds - faster than one built with GCC 8 which needs 283 seconds. I'll be interested in knowing if it also works this well on other architectures. The patch still needs a bit of a cleanup. The change of the default value of ipa-cp-unit-growth needs to be done only for small compilation units (like inlining does). I should experiment if the value of param_ipa_cp_loop_hint_bonus should be changed or not. And last but not least, I also want to clean-up the interfaces between ipa-fnsummary.c and ipa-cp.c a bit. I am working on all of this and hope to finish the patch set in a few (working) days. The bottom line is that there is a plan to address this regression. Martin diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c index e4910a04ffa..0d44310503a 100644 --- a/gcc/ipa-cp.c +++ b/gcc/ipa-cp.c @@ -3190,11 +3190,23 @@ devirtualization_time_bonus (struct cgraph_node *node, /* Return time bonus incurred because of HINTS. */ static int -hint_time_bonus (cgraph_node *node, ipa_hints hints) +hint_time_bonus (cgraph_node *node, ipa_hints hints, sreal known_iter_freq, + sreal known_strides_freq) { int result = 0; - if (hints & (INLINE_HINT_loop_iterations | INLINE_HINT_loop_stride)) - result += opt_for_fn (node->decl, param_ipa_cp_loop_hint_bonus); + sreal bonus_for_one = opt_for_fn (node->decl, param_ipa_cp_loop_hint_bonus); + + if (hints & INLINE_HINT_loop_iterations) + { + /* FIXME: This should probably be way more nuanced. */ + result += (known_iter_freq * bonus_for_one).to_int (); + } + if (hints & INLINE_HINT_loop_stride) + { + /* FIXME: And this as well. */ + result += (known_strides_freq * bonus_for_one).to_int (); + } + return result; } @@ -3395,12 +3407,13 @@ perform_estimation_of_a_value (cgraph_node *node, vec<tree> known_csts, int est_move_cost, ipcp_value_base *val) { int size, time_benefit; - sreal time, base_time; + sreal time, base_time, known_iter_freq, known_strides_freq; ipa_hints hints; estimate_ipcp_clone_size_and_time (node, known_csts, known_contexts, known_aggs, &size, &time, - &base_time, &hints); + &base_time, &hints, &known_iter_freq, + &known_strides_freq); base_time -= time; if (base_time > 65535) base_time = 65535; @@ -3414,7 +3427,7 @@ perform_estimation_of_a_value (cgraph_node *node, vec<tree> known_csts, time_benefit = base_time.to_int () + devirtualization_time_bonus (node, known_csts, known_contexts, known_aggs) - + hint_time_bonus (node, hints) + + hint_time_bonus (node, hints, known_iter_freq, known_strides_freq) + removable_params_cost + est_move_cost; gcc_checking_assert (size >=0); @@ -3476,7 +3489,7 @@ estimate_local_effects (struct cgraph_node *node) { struct caller_statistics stats; ipa_hints hints; - sreal time, base_time; + sreal time, base_time, known_iter_freq, known_strides_freq; int size; init_caller_stats (&stats); @@ -3484,9 +3497,11 @@ estimate_local_effects (struct cgraph_node *node) false); estimate_ipcp_clone_size_and_time (node, known_csts, known_contexts, known_aggs, &size, &time, - &base_time, &hints); + &base_time, &hints, &known_iter_freq, + &known_strides_freq); time -= devirt_bonus; - time -= hint_time_bonus (node, hints); + time -= hint_time_bonus (node, hints, known_iter_freq, + known_strides_freq); time -= removable_params_cost; size -= stats.n_calls * removable_params_cost; diff --git a/gcc/ipa-fnsummary.c b/gcc/ipa-fnsummary.c index 2cfab40156e..29fac5db19f 100644 --- a/gcc/ipa-fnsummary.c +++ b/gcc/ipa-fnsummary.c @@ -310,6 +310,35 @@ set_hint_predicate (predicate **p, predicate new_predicate) } } +/* Find if NEW_PREDICATE is already in V and if so, increment its freq. + Otherwise add a new item to the vector with this predicate and frerq equal + to add_freq. */ + +static void +add_freqcounting_predicate (vec<ipa_freqcounting_predicate, va_gc> **v, + const predicate &new_predicate, sreal add_freq) +{ + if (new_predicate == false || new_predicate == true) + return; + ipa_freqcounting_predicate *f; + for (int i = 0; vec_safe_iterate (*v, i, &f); i++) + if (new_predicate == f->predicate) + { + f->freq += add_freq; + return; + } + /* FIXME: Make this a parameter. */ + if (vec_safe_length (*v) >= 32) + /* Too many different predicates to account for. */ + return; + + ipa_freqcounting_predicate fcp; + fcp.predicate = NULL; + set_hint_predicate (&fcp.predicate, new_predicate); + fcp.freq = add_freq; + vec_safe_push (*v, fcp); + return; +} /* Compute what conditions may or may not hold given information about parameters. RET_CLAUSE returns truths that may hold in a specialized copy, @@ -722,10 +751,12 @@ ipa_call_summary::~ipa_call_summary () ipa_fn_summary::~ipa_fn_summary () { - if (loop_iterations) - edge_predicate_pool.remove (loop_iterations); - if (loop_stride) - edge_predicate_pool.remove (loop_stride); + unsigned len = vec_safe_length (loop_iterations); + for (unsigned i = 0; i < len; i++) + edge_predicate_pool.remove ((*loop_iterations)[i].predicate); + len = vec_safe_length (loop_strides); + for (unsigned i = 0; i < len; i++) + edge_predicate_pool.remove ((*loop_strides)[i].predicate); vec_free (conds); vec_free (size_time_table); vec_free (call_size_time_table); @@ -741,24 +772,33 @@ ipa_fn_summary_t::remove_callees (cgraph_node *node) ipa_call_summaries->remove (e); } -/* Same as remap_predicate_after_duplication but handle hint predicate *P. - Additionally care about allocating new memory slot for updated predicate - and set it to NULL when it becomes true or false (and thus uninteresting). - */ +/* Duplicate predicates in loop hint vector, allocating memory for them and + remove and deallocate any uninteresting (true or false) ones. Return the + result. */ -static void -remap_hint_predicate_after_duplication (predicate **p, - clause_t possible_truths) +static vec<ipa_freqcounting_predicate, va_gc> * +remap_freqcounting_preds_after_dup (vec<ipa_freqcounting_predicate, va_gc> *v, + clause_t possible_truths) { - predicate new_predicate; + if (vec_safe_length (v) == 0) + return NULL; - if (!*p) - return; + vec<ipa_freqcounting_predicate, va_gc> *res = v->copy (); + int len = res->length(); + for (int i = len - 1; i >= 0; i--) + { + predicate new_predicate + = (*res)[i].predicate->remap_after_duplication (possible_truths); + /* We do not want to free previous predicate; it is used by node + origin. */ + (*res)[i].predicate = NULL; + set_hint_predicate (&(*res)[i].predicate, new_predicate); + + if (!(*res)[i].predicate) + res->unordered_remove (i); + } - new_predicate = (*p)->remap_after_duplication (possible_truths); - /* We do not want to free previous predicate; it is used by node origin. */ - *p = NULL; - set_hint_predicate (p, new_predicate); + return res; } @@ -874,9 +914,11 @@ ipa_fn_summary_t::duplicate (cgraph_node *src, optimized_out_size += es->call_stmt_size * ipa_fn_summary::size_scale; edge_set_predicate (edge, &new_predicate); } - remap_hint_predicate_after_duplication (&info->loop_iterations, + info->loop_iterations + = remap_freqcounting_preds_after_dup (info->loop_iterations, possible_truths); - remap_hint_predicate_after_duplication (&info->loop_stride, + info->loop_strides + = remap_freqcounting_preds_after_dup (info->loop_strides, possible_truths); /* If inliner or someone after inliner will ever start producing @@ -888,17 +930,21 @@ ipa_fn_summary_t::duplicate (cgraph_node *src, else { info->size_time_table = vec_safe_copy (info->size_time_table); - if (info->loop_iterations) + info->loop_iterations = vec_safe_copy (info->loop_iterations); + info->loop_strides = vec_safe_copy (info->loop_strides); + + ipa_freqcounting_predicate *f; + for (int i = 0; vec_safe_iterate (info->loop_iterations, i, &f); i++) { - predicate p = *info->loop_iterations; - info->loop_iterations = NULL; - set_hint_predicate (&info->loop_iterations, p); + predicate p = *f->predicate; + f->predicate = NULL; + set_hint_predicate (&f->predicate, p); } - if (info->loop_stride) + for (int i = 0; vec_safe_iterate (info->loop_strides, i, &f); i++) { - predicate p = *info->loop_stride; - info->loop_stride = NULL; - set_hint_predicate (&info->loop_stride, p); + predicate p = *f->predicate; + f->predicate = NULL; + set_hint_predicate (&f->predicate, p); } } if (!dst->inlined_to) @@ -1057,15 +1103,28 @@ ipa_dump_fn_summary (FILE *f, struct cgraph_node *node) } fprintf (f, "\n"); } - if (s->loop_iterations) + ipa_freqcounting_predicate *fcp; + bool first_fcp = true; + for (int i = 0; vec_safe_iterate (s->loop_iterations, i, &fcp); i++) { - fprintf (f, " loop iterations:"); - s->loop_iterations->dump (f, s->conds); + if (first_fcp) + { + fprintf (f, " loop iterations:"); + first_fcp = false; + } + fprintf (f, " %3.2f for ", fcp->freq.to_double ()); + fcp->predicate->dump (f, s->conds); } - if (s->loop_stride) + first_fcp = true; + for (int i = 0; vec_safe_iterate (s->loop_strides, i, &fcp); i++) { - fprintf (f, " loop stride:"); - s->loop_stride->dump (f, s->conds); + if (first_fcp) + { + fprintf (f, " loop strides:"); + first_fcp = false; + } + fprintf (f, " %3.2f for :", fcp->freq.to_double ()); + fcp->predicate->dump (f, s->conds); } fprintf (f, " calls:\n"); dump_ipa_call_summary (f, 4, node, s); @@ -2514,12 +2573,13 @@ analyze_function_body (struct cgraph_node *node, bool early) if (fbi.info) compute_bb_predicates (&fbi, node, info, params_summary); + const profile_count entry_count = ENTRY_BLOCK_PTR_FOR_FN (cfun)->count; order = XNEWVEC (int, n_basic_blocks_for_fn (cfun)); nblocks = pre_and_rev_post_order_compute (NULL, order, false); for (n = 0; n < nblocks; n++) { bb = BASIC_BLOCK_FOR_FN (cfun, order[n]); - freq = bb->count.to_sreal_scale (ENTRY_BLOCK_PTR_FOR_FN (cfun)->count); + freq = bb->count.to_sreal_scale (entry_count); if (clobber_only_eh_bb_p (bb)) { if (dump_file && (dump_flags & TDF_DETAILS)) @@ -2758,23 +2818,27 @@ analyze_function_body (struct cgraph_node *node, bool early) if (nonconstant_names.exists () && !early) { + ipa_fn_summary *s = ipa_fn_summaries->get (node); class loop *loop; - predicate loop_iterations = true; - predicate loop_stride = true; if (dump_file && (dump_flags & TDF_DETAILS)) flow_loops_dump (dump_file, NULL, 0); scev_initialize (); FOR_EACH_LOOP (loop, 0) { + predicate loop_iterations = true; + sreal header_freq; vec<edge> exits; edge ex; unsigned int j; class tree_niter_desc niter_desc; if (loop->header->aux) - bb_predicate = *(predicate *) loop->header->aux; + { + bb_predicate = *(predicate *) loop->header->aux; + header_freq = loop->header->count.to_sreal_scale (entry_count); + } else - bb_predicate = false; + continue; exits = get_loop_exit_edges (loop); FOR_EACH_VEC_ELT (exits, j, ex) @@ -2790,10 +2854,10 @@ analyze_function_body (struct cgraph_node *node, bool early) will_be_nonconstant = bb_predicate & will_be_nonconstant; if (will_be_nonconstant != true && will_be_nonconstant != false) - /* This is slightly inprecise. We may want to represent each - loop with independent predicate. */ loop_iterations &= will_be_nonconstant; } + add_freqcounting_predicate (&s->loop_iterations, loop_iterations, + header_freq); exits.release (); } @@ -2803,14 +2867,20 @@ analyze_function_body (struct cgraph_node *node, bool early) for (loop = loops_for_fn (cfun)->tree_root->inner; loop != NULL; loop = loop->next) { + predicate loop_stride = true; + sreal bb_freq; basic_block *body = get_loop_body (loop); for (unsigned i = 0; i < loop->num_nodes; i++) { gimple_stmt_iterator gsi; if (body[i]->aux) - bb_predicate = *(predicate *) body[i]->aux; + { + bb_predicate = *(predicate *) body[i]->aux; + bb_freq = body[i]->count.to_sreal_scale (entry_count); + } else - bb_predicate = false; + continue; + for (gsi = gsi_start_bb (body[i]); !gsi_end_p (gsi); gsi_next (&gsi)) { @@ -2839,16 +2909,13 @@ analyze_function_body (struct cgraph_node *node, bool early) will_be_nonconstant = bb_predicate & will_be_nonconstant; if (will_be_nonconstant != true && will_be_nonconstant != false) - /* This is slightly inprecise. We may want to represent - each loop with independent predicate. */ loop_stride = loop_stride & will_be_nonconstant; } } + add_freqcounting_predicate (&s->loop_strides, loop_stride, + bb_freq); free (body); } - ipa_fn_summary *s = ipa_fn_summaries->get (node); - set_hint_predicate (&s->loop_iterations, loop_iterations); - set_hint_predicate (&s->loop_stride, loop_stride); scev_finalize (); } FOR_ALL_BB_FN (bb, my_function) @@ -3640,14 +3707,32 @@ ipa_call_context::estimate_size_and_time (int *ret_size, if (time > nonspecialized_time) time = nonspecialized_time; + m_loops_with_known_iterations = 0; + ipa_freqcounting_predicate *fcp; + for (i = 0; vec_safe_iterate (info->loop_iterations, i, &fcp); i++) + { + gcc_assert (fcp->predicate); + if (!fcp->predicate->evaluate (m_possible_truths)) + { + if (ret_hints) + hints |= INLINE_HINT_loop_iterations; + m_loops_with_known_iterations += fcp->freq; + } + } + m_loops_with_known_strides = 0; + for (i = 0; vec_safe_iterate (info->loop_strides, i, &fcp); i++) + { + gcc_assert (fcp->predicate); + if (!fcp->predicate->evaluate (m_possible_truths)) + { + if (ret_hints) + hints |= INLINE_HINT_loop_stride; + m_loops_with_known_strides += fcp->freq; + } + } + if (ret_hints) { - if (info->loop_iterations - && !info->loop_iterations->evaluate (m_possible_truths)) - hints |= INLINE_HINT_loop_iterations; - if (info->loop_stride - && !info->loop_stride->evaluate (m_possible_truths)) - hints |= INLINE_HINT_loop_stride; if (info->scc_no) hints |= INLINE_HINT_in_scc; if (DECL_DECLARED_INLINE_P (m_node->decl)) @@ -3687,7 +3772,9 @@ estimate_ipcp_clone_size_and_time (struct cgraph_node *node, vec<ipa_agg_value_set> known_aggs, int *ret_size, sreal *ret_time, sreal *ret_nonspec_time, - ipa_hints *hints) + ipa_hints *hints, + sreal *loops_with_known_iterations, + sreal *loops_with_known_strides) { clause_t clause, nonspec_clause; @@ -3699,6 +3786,8 @@ estimate_ipcp_clone_size_and_time (struct cgraph_node *node, known_aggs, vNULL); ctx.estimate_size_and_time (ret_size, NULL, ret_time, ret_nonspec_time, hints); + *loops_with_known_iterations = ctx.m_loops_with_known_iterations; + *loops_with_known_strides = ctx.m_loops_with_known_strides; } /* Return stack frame offset where frame of NODE is supposed to start inside @@ -3857,32 +3946,31 @@ remap_edge_summaries (struct cgraph_edge *inlined_edge, } } -/* Same as remap_predicate, but set result into hint *HINT. */ +/* Run remap_after_inlining on each predicate in V. */ static void -remap_hint_predicate (class ipa_fn_summary *info, - class ipa_node_params *params_summary, - class ipa_fn_summary *callee_info, - predicate **hint, - vec<int> operand_map, - vec<int> offset_map, - clause_t possible_truths, - predicate *toplev_predicate) -{ - predicate p; +remap_freqcounting_predicate (class ipa_fn_summary *info, + class ipa_node_params *params_summary, + class ipa_fn_summary *callee_info, + vec<ipa_freqcounting_predicate, va_gc> *v, + vec<int> operand_map, + vec<int> offset_map, + clause_t possible_truths, + predicate *toplev_predicate) - if (!*hint) - return; - p = (*hint)->remap_after_inlining - (info, params_summary, callee_info, - operand_map, offset_map, - possible_truths, *toplev_predicate); - if (p != false && p != true) +{ + ipa_freqcounting_predicate *fcp; + for (int i = 0; vec_safe_iterate (v, i, &fcp); i++) { - if (!*hint) - set_hint_predicate (hint, p); - else - **hint &= p; + predicate p + = fcp->predicate->remap_after_inlining (info, params_summary, + callee_info, operand_map, + offset_map, possible_truths, + *toplev_predicate); + if (p != false && p != true) + /* FIXME: Is this really supposed to be &= and not a plain + assignment? */ + *fcp->predicate &= p; } } @@ -3992,12 +4080,12 @@ ipa_merge_fn_summary_after_inlining (struct cgraph_edge *edge) remap_edge_summaries (edge, edge->callee, info, params_summary, callee_info, operand_map, offset_map, clause, &toplev_predicate); - remap_hint_predicate (info, params_summary, callee_info, - &callee_info->loop_iterations, - operand_map, offset_map, clause, &toplev_predicate); - remap_hint_predicate (info, params_summary, callee_info, - &callee_info->loop_stride, - operand_map, offset_map, clause, &toplev_predicate); + remap_freqcounting_predicate (info, params_summary, callee_info, + info->loop_iterations, operand_map, + offset_map, clause, &toplev_predicate); + remap_freqcounting_predicate (info, params_summary, callee_info, + info->loop_strides, operand_map, + offset_map, clause, &toplev_predicate); HOST_WIDE_INT stack_frame_offset = ipa_get_stack_frame_offset (edge->callee); HOST_WIDE_INT peak = stack_frame_offset + callee_info->estimated_stack_size; @@ -4322,12 +4410,34 @@ inline_read_section (struct lto_file_decl_data *file_data, const char *data, info->size_time_table->quick_push (e); } - p.stream_in (&ib); - if (info) - set_hint_predicate (&info->loop_iterations, p); - p.stream_in (&ib); - if (info) - set_hint_predicate (&info->loop_stride, p); + count2 = streamer_read_uhwi (&ib); + for (j = 0; j < count2; j++) + { + p.stream_in (&ib); + sreal fcp_freq = sreal::stream_in (&ib); + if (info) + { + ipa_freqcounting_predicate fcp; + fcp.predicate = NULL; + set_hint_predicate (&fcp.predicate, p); + fcp.freq = fcp_freq; + vec_safe_push (info->loop_iterations, fcp); + } + } + count2 = streamer_read_uhwi (&ib); + for (j = 0; j < count2; j++) + { + p.stream_in (&ib); + sreal fcp_freq = sreal::stream_in (&ib); + if (info) + { + ipa_freqcounting_predicate fcp; + fcp.predicate = NULL; + set_hint_predicate (&fcp.predicate, p); + fcp.freq = fcp_freq; + vec_safe_push (info->loop_strides, fcp); + } + } for (e = node->callees; e; e = e->next_callee) read_ipa_call_summary (&ib, e, info != NULL); for (e = node->indirect_calls; e; e = e->next_callee) @@ -4487,14 +4597,19 @@ ipa_fn_summary_write (void) e->exec_predicate.stream_out (ob); e->nonconst_predicate.stream_out (ob); } - if (info->loop_iterations) - info->loop_iterations->stream_out (ob); - else - streamer_write_uhwi (ob, 0); - if (info->loop_stride) - info->loop_stride->stream_out (ob); - else - streamer_write_uhwi (ob, 0); + ipa_freqcounting_predicate *fcp; + streamer_write_uhwi (ob, vec_safe_length (info->loop_iterations)); + for (i = 0; vec_safe_iterate (info->loop_iterations, i, &fcp); i++) + { + fcp->predicate->stream_out (ob); + fcp->freq.stream_out (ob); + } + streamer_write_uhwi (ob, vec_safe_length (info->loop_strides)); + for (i = 0; vec_safe_iterate (info->loop_strides, i, &fcp); i++) + { + fcp->predicate->stream_out (ob); + fcp->freq.stream_out (ob); + } for (edge = cnode->callees; edge; edge = edge->next_callee) write_ipa_call_summary (ob, edge); for (edge = cnode->indirect_calls; edge; edge = edge->next_callee) diff --git a/gcc/ipa-fnsummary.h b/gcc/ipa-fnsummary.h index c6ddc9f3199..d8429afdbef 100644 --- a/gcc/ipa-fnsummary.h +++ b/gcc/ipa-fnsummary.h @@ -101,6 +101,19 @@ public: } }; +/* Structure to capture how frequently some interesting events occur given a + particular predicate. The structure is used to estimate how often we + encounter loops with known iteration count or stride in various + contexts. */ + +struct GTY(()) ipa_freqcounting_predicate +{ + /* The described event happens with this frequency... */ + sreal freq; + /* ...when this predicate evaluates to false. */ + class predicate * GTY((skip)) predicate; +}; + /* Function inlining information. */ class GTY(()) ipa_fn_summary { @@ -112,8 +125,9 @@ public: inlinable (false), single_caller (false), fp_expressions (false), estimated_stack_size (false), time (0), conds (NULL), - size_time_table (NULL), call_size_time_table (NULL), loop_iterations (NULL), - loop_stride (NULL), growth (0), scc_no (0) + size_time_table (NULL), call_size_time_table (NULL), + loop_iterations (NULL), loop_strides (NULL), + growth (0), scc_no (0) { } @@ -125,13 +139,12 @@ public: estimated_stack_size (s.estimated_stack_size), time (s.time), conds (s.conds), size_time_table (s.size_time_table), call_size_time_table (NULL), - loop_iterations (s.loop_iterations), loop_stride (s.loop_stride), + loop_iterations (s.loop_iterations), loop_strides (s.loop_strides), growth (s.growth), scc_no (s.scc_no) {} /* Default constructor. */ ~ipa_fn_summary (); - /* Information about the function body itself. */ /* Minimal size increase after inlining. */ @@ -164,12 +177,10 @@ public: vec<size_time_entry, va_gc> *size_time_table; vec<size_time_entry, va_gc> *call_size_time_table; - /* Predicate on when some loop in the function becomes to have known - bounds. */ - predicate * GTY((skip)) loop_iterations; - /* Predicate on when some loop in the function becomes to have known - stride. */ - predicate * GTY((skip)) loop_stride; + /* Predicates on when some loops in the function can have known bounds. */ + vec<ipa_freqcounting_predicate, va_gc> *loop_iterations; + /* Predicates on when some loops in the function can have known strides. */ + vec<ipa_freqcounting_predicate, va_gc> *loop_strides; /* Estimated growth for inlining all copies of the function before start of small functions inlining. This value will get out of date as the callers are duplicated, but @@ -316,6 +327,15 @@ public: { return m_node != NULL; } + + + /* How often loops will have known iterations. Calculated in + estimate_size_and_time. */ + sreal m_loops_with_known_iterations; + /* How often loops will have known strides. Calculated in + estimate_size_and_time. */ + sreal m_loops_with_known_strides; + private: /* Called function. */ cgraph_node *m_node; @@ -353,7 +373,7 @@ void estimate_ipcp_clone_size_and_time (struct cgraph_node *, vec<ipa_polymorphic_call_context>, vec<ipa_agg_value_set>, int *, sreal *, sreal *, - ipa_hints *); + ipa_hints *, sreal *, sreal *); void ipa_merge_fn_summary_after_inlining (struct cgraph_edge *edge); void ipa_update_overall_fn_summary (struct cgraph_node *node, bool reset = true); void compute_fn_summary (struct cgraph_node *, bool); diff --git a/gcc/params.opt b/gcc/params.opt index f39e5d1a012..2a5f3d61727 100644 --- a/gcc/params.opt +++ b/gcc/params.opt @@ -211,7 +211,7 @@ Common Joined UInteger Var(param_ipa_cp_single_call_penalty) Init(15) IntegerRan Percentage penalty functions containing a single call to another function will receive when they are evaluated for cloning. -param=ipa-cp-unit-growth= -Common Joined UInteger Var(param_ipa_cp_unit_growth) Init(10) Param Optimization +Common Joined UInteger Var(param_ipa_cp_unit_growth) Init(80) Param Optimization How much can given compilation unit grow because of the interprocedural constant propagation (in percent). -param=ipa-cp-value-list-size= ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: [PATCH] ipa-inline: Improve growth accumulation for recursive calls 2020-08-20 12:04 ` [PATCH] ipa-inline: Improve growth accumulation for recursive calls Martin Jambor @ 2020-08-21 9:17 ` Tamar Christina 2020-09-08 14:00 ` Martin Jambor 0 siblings, 1 reply; 21+ messages in thread From: Tamar Christina @ 2020-08-21 9:17 UTC (permalink / raw) To: Martin Jambor, Richard Sandiford, luoxhu via Gcc-patches Cc: segher, luoxhu, wschmidt, linkw, Jan Hubicka, dje.gcc Hi Martin, > Hi, > > On Thu, Aug 20 2020, Richard Sandiford wrote: > >> > >> > >> Really appreciate for your detailed explanation. BTW, My previous > >> patch for PGO build on exchange2 takes this similar method by setting > >> each cloned node to 1/10th of the frequency several month agao :) > >> > >> https://gcc.gnu.org/pipermail/gcc-patches/2020-June/546926.html > > > > Does it seem likely that we'll reach a resolution on this soon? > > I take the point that the patch that introduced the exchange > > regression > > [https://gcc.gnu.org/pipermail/gcc-patches/2020-August/551757.html] > > was just uncovering a latent issue, but still, this is a large > > regression in an important benchmark to be carrying around. For those > > of us doing regular benchmark CI, the longer the performance trough > > gets, the harder it is to spot other unrelated regressions in the “properly > optimised” code. > > > > So if we don't have a specific plan for fixing the regression soon, I > > think we should consider reverting the patch until we have something > > that avoids the exchange regression, even though the patch wasn't > > technically wrong. > > Honza's changes have been motivated to big extent as an enabler for IPA-CP > heuristics changes to actually speed up 548.exchange2_r. > > On my AMD Zen2 machine, the run-time of exchange2 was 358 seconds two > weeks ago, this week it is 403, but with my WIP (and so far untested) patch > below it is just 276 seconds - faster than one built with GCC 8 which needs > 283 seconds. > > I'll be interested in knowing if it also works this well on other architectures. > Many thanks for working on this! I tried this on an AArch64 Neoverse-N1 machine and didn't see any difference. Do I need any flags for it to work? The patch was applied on top of 656218ab982cc22b826227045826c92743143af1 And I tried 3 runs 1) -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 -fno-inline-functions-called-once 2) -mcpu=native -Ofast -fomit-frame-pointer -flto -fno-inline-functions-called-once 3) -mcpu=native -Ofast -fomit-frame-pointer -flto First one used to give us the best result, with this patch there's no difference between 1 and 2 (11% regression) and the 3rd one is about 15% on top of that. If there's anything I can do to help just let me know. Cheers, Tamar > The patch still needs a bit of a cleanup. The change of the default value of > ipa-cp-unit-growth needs to be done only for small compilation units (like > inlining does). I should experiment if the value of > param_ipa_cp_loop_hint_bonus should be changed or not. And last but not > least, I also want to clean-up the interfaces between ipa-fnsummary.c and > ipa-cp.c a bit. I am working on all of this and hope to finish the patch set in a > few (working) days. > > The bottom line is that there is a plan to address this regression. > > Martin > > > > diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c index e4910a04ffa..0d44310503a > 100644 > --- a/gcc/ipa-cp.c > +++ b/gcc/ipa-cp.c > @@ -3190,11 +3190,23 @@ devirtualization_time_bonus (struct > cgraph_node *node, > /* Return time bonus incurred because of HINTS. */ > > static int > -hint_time_bonus (cgraph_node *node, ipa_hints hints) > +hint_time_bonus (cgraph_node *node, ipa_hints hints, sreal > known_iter_freq, > + sreal known_strides_freq) > { > int result = 0; > - if (hints & (INLINE_HINT_loop_iterations | INLINE_HINT_loop_stride)) > - result += opt_for_fn (node->decl, param_ipa_cp_loop_hint_bonus); > + sreal bonus_for_one = opt_for_fn (node->decl, > + param_ipa_cp_loop_hint_bonus); > + > + if (hints & INLINE_HINT_loop_iterations) > + { > + /* FIXME: This should probably be way more nuanced. */ > + result += (known_iter_freq * bonus_for_one).to_int (); > + } > + if (hints & INLINE_HINT_loop_stride) > + { > + /* FIXME: And this as well. */ > + result += (known_strides_freq * bonus_for_one).to_int (); > + } > + > return result; > } > > @@ -3395,12 +3407,13 @@ perform_estimation_of_a_value (cgraph_node > *node, vec<tree> known_csts, > int est_move_cost, ipcp_value_base *val) { > int size, time_benefit; > - sreal time, base_time; > + sreal time, base_time, known_iter_freq, known_strides_freq; > ipa_hints hints; > > estimate_ipcp_clone_size_and_time (node, known_csts, known_contexts, > known_aggs, &size, &time, > - &base_time, &hints); > + &base_time, &hints, &known_iter_freq, > + &known_strides_freq); > base_time -= time; > if (base_time > 65535) > base_time = 65535; > @@ -3414,7 +3427,7 @@ perform_estimation_of_a_value (cgraph_node > *node, vec<tree> known_csts, > time_benefit = base_time.to_int () > + devirtualization_time_bonus (node, known_csts, known_contexts, > known_aggs) > - + hint_time_bonus (node, hints) > + + hint_time_bonus (node, hints, known_iter_freq, > + known_strides_freq) > + removable_params_cost + est_move_cost; > > gcc_checking_assert (size >=0); > @@ -3476,7 +3489,7 @@ estimate_local_effects (struct cgraph_node *node) > { > struct caller_statistics stats; > ipa_hints hints; > - sreal time, base_time; > + sreal time, base_time, known_iter_freq, known_strides_freq; > int size; > > init_caller_stats (&stats); > @@ -3484,9 +3497,11 @@ estimate_local_effects (struct cgraph_node > *node) > false); > estimate_ipcp_clone_size_and_time (node, known_csts, > known_contexts, > known_aggs, &size, &time, > - &base_time, &hints); > + &base_time, &hints, > &known_iter_freq, > + &known_strides_freq); > time -= devirt_bonus; > - time -= hint_time_bonus (node, hints); > + time -= hint_time_bonus (node, hints, known_iter_freq, > + known_strides_freq); > time -= removable_params_cost; > size -= stats.n_calls * removable_params_cost; > > diff --git a/gcc/ipa-fnsummary.c b/gcc/ipa-fnsummary.c index > 2cfab40156e..29fac5db19f 100644 > --- a/gcc/ipa-fnsummary.c > +++ b/gcc/ipa-fnsummary.c > @@ -310,6 +310,35 @@ set_hint_predicate (predicate **p, predicate > new_predicate) > } > } > > +/* Find if NEW_PREDICATE is already in V and if so, increment its freq. > + Otherwise add a new item to the vector with this predicate and frerq > equal > + to add_freq. */ > + > +static void > +add_freqcounting_predicate (vec<ipa_freqcounting_predicate, va_gc> **v, > + const predicate &new_predicate, sreal add_freq) { > + if (new_predicate == false || new_predicate == true) > + return; > + ipa_freqcounting_predicate *f; > + for (int i = 0; vec_safe_iterate (*v, i, &f); i++) > + if (new_predicate == f->predicate) > + { > + f->freq += add_freq; > + return; > + } > + /* FIXME: Make this a parameter. */ > + if (vec_safe_length (*v) >= 32) > + /* Too many different predicates to account for. */ > + return; > + > + ipa_freqcounting_predicate fcp; > + fcp.predicate = NULL; > + set_hint_predicate (&fcp.predicate, new_predicate); > + fcp.freq = add_freq; > + vec_safe_push (*v, fcp); > + return; > +} > > /* Compute what conditions may or may not hold given information about > parameters. RET_CLAUSE returns truths that may hold in a specialized copy, > @@ -722,10 +751,12 @@ ipa_call_summary::~ipa_call_summary () > > ipa_fn_summary::~ipa_fn_summary () > { > - if (loop_iterations) > - edge_predicate_pool.remove (loop_iterations); > - if (loop_stride) > - edge_predicate_pool.remove (loop_stride); > + unsigned len = vec_safe_length (loop_iterations); for (unsigned i = > + 0; i < len; i++) > + edge_predicate_pool.remove ((*loop_iterations)[i].predicate); > + len = vec_safe_length (loop_strides); for (unsigned i = 0; i < len; > + i++) > + edge_predicate_pool.remove ((*loop_strides)[i].predicate); > vec_free (conds); > vec_free (size_time_table); > vec_free (call_size_time_table); > @@ -741,24 +772,33 @@ ipa_fn_summary_t::remove_callees (cgraph_node > *node) > ipa_call_summaries->remove (e); > } > > -/* Same as remap_predicate_after_duplication but handle hint predicate *P. > - Additionally care about allocating new memory slot for updated predicate > - and set it to NULL when it becomes true or false (and thus uninteresting). > - */ > +/* Duplicate predicates in loop hint vector, allocating memory for them and > + remove and deallocate any uninteresting (true or false) ones. Return the > + result. */ > > -static void > -remap_hint_predicate_after_duplication (predicate **p, > - clause_t possible_truths) > +static vec<ipa_freqcounting_predicate, va_gc> * > +remap_freqcounting_preds_after_dup (vec<ipa_freqcounting_predicate, > va_gc> *v, > + clause_t possible_truths) > { > - predicate new_predicate; > + if (vec_safe_length (v) == 0) > + return NULL; > > - if (!*p) > - return; > + vec<ipa_freqcounting_predicate, va_gc> *res = v->copy (); > + int len = res->length(); > + for (int i = len - 1; i >= 0; i--) > + { > + predicate new_predicate > + = (*res)[i].predicate->remap_after_duplication (possible_truths); > + /* We do not want to free previous predicate; it is used by node > + origin. */ > + (*res)[i].predicate = NULL; > + set_hint_predicate (&(*res)[i].predicate, new_predicate); > + > + if (!(*res)[i].predicate) > + res->unordered_remove (i); > + } > > - new_predicate = (*p)->remap_after_duplication (possible_truths); > - /* We do not want to free previous predicate; it is used by node origin. */ > - *p = NULL; > - set_hint_predicate (p, new_predicate); > + return res; > } > > > @@ -874,9 +914,11 @@ ipa_fn_summary_t::duplicate (cgraph_node *src, > optimized_out_size += es->call_stmt_size * > ipa_fn_summary::size_scale; > edge_set_predicate (edge, &new_predicate); > } > - remap_hint_predicate_after_duplication (&info->loop_iterations, > + info->loop_iterations > + = remap_freqcounting_preds_after_dup (info->loop_iterations, > possible_truths); > - remap_hint_predicate_after_duplication (&info->loop_stride, > + info->loop_strides > + = remap_freqcounting_preds_after_dup (info->loop_strides, > possible_truths); > > /* If inliner or someone after inliner will ever start producing @@ -888,17 > +930,21 @@ ipa_fn_summary_t::duplicate (cgraph_node *src, > else > { > info->size_time_table = vec_safe_copy (info->size_time_table); > - if (info->loop_iterations) > + info->loop_iterations = vec_safe_copy (info->loop_iterations); > + info->loop_strides = vec_safe_copy (info->loop_strides); > + > + ipa_freqcounting_predicate *f; > + for (int i = 0; vec_safe_iterate (info->loop_iterations, i, &f); > + i++) > { > - predicate p = *info->loop_iterations; > - info->loop_iterations = NULL; > - set_hint_predicate (&info->loop_iterations, p); > + predicate p = *f->predicate; > + f->predicate = NULL; > + set_hint_predicate (&f->predicate, p); > } > - if (info->loop_stride) > + for (int i = 0; vec_safe_iterate (info->loop_strides, i, &f); > + i++) > { > - predicate p = *info->loop_stride; > - info->loop_stride = NULL; > - set_hint_predicate (&info->loop_stride, p); > + predicate p = *f->predicate; > + f->predicate = NULL; > + set_hint_predicate (&f->predicate, p); > } > } > if (!dst->inlined_to) > @@ -1057,15 +1103,28 @@ ipa_dump_fn_summary (FILE *f, struct > cgraph_node *node) > } > fprintf (f, "\n"); > } > - if (s->loop_iterations) > + ipa_freqcounting_predicate *fcp; > + bool first_fcp = true; > + for (int i = 0; vec_safe_iterate (s->loop_iterations, i, &fcp); i++) > { > - fprintf (f, " loop iterations:"); > - s->loop_iterations->dump (f, s->conds); > + if (first_fcp) > + { > + fprintf (f, " loop iterations:"); > + first_fcp = false; > + } > + fprintf (f, " %3.2f for ", fcp->freq.to_double ()); > + fcp->predicate->dump (f, s->conds); > } > - if (s->loop_stride) > + first_fcp = true; > + for (int i = 0; vec_safe_iterate (s->loop_strides, i, &fcp); i++) > { > - fprintf (f, " loop stride:"); > - s->loop_stride->dump (f, s->conds); > + if (first_fcp) > + { > + fprintf (f, " loop strides:"); > + first_fcp = false; > + } > + fprintf (f, " %3.2f for :", fcp->freq.to_double ()); > + fcp->predicate->dump (f, s->conds); > } > fprintf (f, " calls:\n"); > dump_ipa_call_summary (f, 4, node, s); @@ -2514,12 +2573,13 @@ > analyze_function_body (struct cgraph_node *node, bool early) > > if (fbi.info) > compute_bb_predicates (&fbi, node, info, params_summary); > + const profile_count entry_count = ENTRY_BLOCK_PTR_FOR_FN > + (cfun)->count; > order = XNEWVEC (int, n_basic_blocks_for_fn (cfun)); > nblocks = pre_and_rev_post_order_compute (NULL, order, false); > for (n = 0; n < nblocks; n++) > { > bb = BASIC_BLOCK_FOR_FN (cfun, order[n]); > - freq = bb->count.to_sreal_scale (ENTRY_BLOCK_PTR_FOR_FN (cfun)- > >count); > + freq = bb->count.to_sreal_scale (entry_count); > if (clobber_only_eh_bb_p (bb)) > { > if (dump_file && (dump_flags & TDF_DETAILS)) @@ -2758,23 > +2818,27 @@ analyze_function_body (struct cgraph_node *node, bool early) > > if (nonconstant_names.exists () && !early) > { > + ipa_fn_summary *s = ipa_fn_summaries->get (node); > class loop *loop; > - predicate loop_iterations = true; > - predicate loop_stride = true; > > if (dump_file && (dump_flags & TDF_DETAILS)) > flow_loops_dump (dump_file, NULL, 0); > scev_initialize (); > FOR_EACH_LOOP (loop, 0) > { > + predicate loop_iterations = true; > + sreal header_freq; > vec<edge> exits; > edge ex; > unsigned int j; > class tree_niter_desc niter_desc; > if (loop->header->aux) > - bb_predicate = *(predicate *) loop->header->aux; > + { > + bb_predicate = *(predicate *) loop->header->aux; > + header_freq = loop->header->count.to_sreal_scale > (entry_count); > + } > else > - bb_predicate = false; > + continue; > > exits = get_loop_exit_edges (loop); > FOR_EACH_VEC_ELT (exits, j, ex) > @@ -2790,10 +2854,10 @@ analyze_function_body (struct cgraph_node > *node, bool early) > will_be_nonconstant = bb_predicate & will_be_nonconstant; > if (will_be_nonconstant != true > && will_be_nonconstant != false) > - /* This is slightly inprecise. We may want to represent each > - loop with independent predicate. */ > loop_iterations &= will_be_nonconstant; > } > + add_freqcounting_predicate (&s->loop_iterations, loop_iterations, > + header_freq); > exits.release (); > } > > @@ -2803,14 +2867,20 @@ analyze_function_body (struct cgraph_node > *node, bool early) > for (loop = loops_for_fn (cfun)->tree_root->inner; > loop != NULL; loop = loop->next) > { > + predicate loop_stride = true; > + sreal bb_freq; > basic_block *body = get_loop_body (loop); > for (unsigned i = 0; i < loop->num_nodes; i++) > { > gimple_stmt_iterator gsi; > if (body[i]->aux) > - bb_predicate = *(predicate *) body[i]->aux; > + { > + bb_predicate = *(predicate *) body[i]->aux; > + bb_freq = body[i]->count.to_sreal_scale (entry_count); > + } > else > - bb_predicate = false; > + continue; > + > for (gsi = gsi_start_bb (body[i]); !gsi_end_p (gsi); > gsi_next (&gsi)) > { > @@ -2839,16 +2909,13 @@ analyze_function_body (struct cgraph_node > *node, bool early) > will_be_nonconstant = bb_predicate & > will_be_nonconstant; > if (will_be_nonconstant != true > && will_be_nonconstant != false) > - /* This is slightly inprecise. We may want to represent > - each loop with independent predicate. */ > loop_stride = loop_stride & will_be_nonconstant; > } > } > + add_freqcounting_predicate (&s->loop_strides, loop_stride, > + bb_freq); > free (body); > } > - ipa_fn_summary *s = ipa_fn_summaries->get (node); > - set_hint_predicate (&s->loop_iterations, loop_iterations); > - set_hint_predicate (&s->loop_stride, loop_stride); > scev_finalize (); > } > FOR_ALL_BB_FN (bb, my_function) > @@ -3640,14 +3707,32 @@ ipa_call_context::estimate_size_and_time (int > *ret_size, > if (time > nonspecialized_time) > time = nonspecialized_time; > > + m_loops_with_known_iterations = 0; > + ipa_freqcounting_predicate *fcp; > + for (i = 0; vec_safe_iterate (info->loop_iterations, i, &fcp); i++) > + { > + gcc_assert (fcp->predicate); > + if (!fcp->predicate->evaluate (m_possible_truths)) > + { > + if (ret_hints) > + hints |= INLINE_HINT_loop_iterations; > + m_loops_with_known_iterations += fcp->freq; > + } > + } > + m_loops_with_known_strides = 0; > + for (i = 0; vec_safe_iterate (info->loop_strides, i, &fcp); i++) > + { > + gcc_assert (fcp->predicate); > + if (!fcp->predicate->evaluate (m_possible_truths)) > + { > + if (ret_hints) > + hints |= INLINE_HINT_loop_stride; > + m_loops_with_known_strides += fcp->freq; > + } > + } > + > if (ret_hints) > { > - if (info->loop_iterations > - && !info->loop_iterations->evaluate (m_possible_truths)) > - hints |= INLINE_HINT_loop_iterations; > - if (info->loop_stride > - && !info->loop_stride->evaluate (m_possible_truths)) > - hints |= INLINE_HINT_loop_stride; > if (info->scc_no) > hints |= INLINE_HINT_in_scc; > if (DECL_DECLARED_INLINE_P (m_node->decl)) @@ -3687,7 +3772,9 @@ > estimate_ipcp_clone_size_and_time (struct cgraph_node *node, > vec<ipa_agg_value_set> known_aggs, > int *ret_size, sreal *ret_time, > sreal *ret_nonspec_time, > - ipa_hints *hints) > + ipa_hints *hints, > + sreal *loops_with_known_iterations, > + sreal *loops_with_known_strides) > { > clause_t clause, nonspec_clause; > > @@ -3699,6 +3786,8 @@ estimate_ipcp_clone_size_and_time (struct > cgraph_node *node, > known_aggs, vNULL); > ctx.estimate_size_and_time (ret_size, NULL, ret_time, > ret_nonspec_time, hints); > + *loops_with_known_iterations = ctx.m_loops_with_known_iterations; > + *loops_with_known_strides = ctx.m_loops_with_known_strides; > } > > /* Return stack frame offset where frame of NODE is supposed to start > inside @@ -3857,32 +3946,31 @@ remap_edge_summaries (struct > cgraph_edge *inlined_edge, > } > } > > -/* Same as remap_predicate, but set result into hint *HINT. */ > +/* Run remap_after_inlining on each predicate in V. */ > > static void > -remap_hint_predicate (class ipa_fn_summary *info, > - class ipa_node_params *params_summary, > - class ipa_fn_summary *callee_info, > - predicate **hint, > - vec<int> operand_map, > - vec<int> offset_map, > - clause_t possible_truths, > - predicate *toplev_predicate) > -{ > - predicate p; > +remap_freqcounting_predicate (class ipa_fn_summary *info, > + class ipa_node_params *params_summary, > + class ipa_fn_summary *callee_info, > + vec<ipa_freqcounting_predicate, va_gc> *v, > + vec<int> operand_map, > + vec<int> offset_map, > + clause_t possible_truths, > + predicate *toplev_predicate) > > - if (!*hint) > - return; > - p = (*hint)->remap_after_inlining > - (info, params_summary, callee_info, > - operand_map, offset_map, > - possible_truths, *toplev_predicate); > - if (p != false && p != true) > +{ > + ipa_freqcounting_predicate *fcp; > + for (int i = 0; vec_safe_iterate (v, i, &fcp); i++) > { > - if (!*hint) > - set_hint_predicate (hint, p); > - else > - **hint &= p; > + predicate p > + = fcp->predicate->remap_after_inlining (info, params_summary, > + callee_info, operand_map, > + offset_map, possible_truths, > + *toplev_predicate); > + if (p != false && p != true) > + /* FIXME: Is this really supposed to be &= and not a plain > + assignment? */ > + *fcp->predicate &= p; > } > } > > @@ -3992,12 +4080,12 @@ ipa_merge_fn_summary_after_inlining (struct > cgraph_edge *edge) > remap_edge_summaries (edge, edge->callee, info, params_summary, > callee_info, operand_map, > offset_map, clause, &toplev_predicate); > - remap_hint_predicate (info, params_summary, callee_info, > - &callee_info->loop_iterations, > - operand_map, offset_map, clause, > &toplev_predicate); > - remap_hint_predicate (info, params_summary, callee_info, > - &callee_info->loop_stride, > - operand_map, offset_map, clause, > &toplev_predicate); > + remap_freqcounting_predicate (info, params_summary, callee_info, > + info->loop_iterations, operand_map, > + offset_map, clause, &toplev_predicate); > + remap_freqcounting_predicate (info, params_summary, callee_info, > + info->loop_strides, operand_map, > + offset_map, clause, &toplev_predicate); > > HOST_WIDE_INT stack_frame_offset = ipa_get_stack_frame_offset (edge- > >callee); > HOST_WIDE_INT peak = stack_frame_offset + callee_info- > >estimated_stack_size; > @@ -4322,12 +4410,34 @@ inline_read_section (struct lto_file_decl_data > *file_data, const char *data, > info->size_time_table->quick_push (e); > } > > - p.stream_in (&ib); > - if (info) > - set_hint_predicate (&info->loop_iterations, p); > - p.stream_in (&ib); > - if (info) > - set_hint_predicate (&info->loop_stride, p); > + count2 = streamer_read_uhwi (&ib); > + for (j = 0; j < count2; j++) > + { > + p.stream_in (&ib); > + sreal fcp_freq = sreal::stream_in (&ib); > + if (info) > + { > + ipa_freqcounting_predicate fcp; > + fcp.predicate = NULL; > + set_hint_predicate (&fcp.predicate, p); > + fcp.freq = fcp_freq; > + vec_safe_push (info->loop_iterations, fcp); > + } > + } > + count2 = streamer_read_uhwi (&ib); > + for (j = 0; j < count2; j++) > + { > + p.stream_in (&ib); > + sreal fcp_freq = sreal::stream_in (&ib); > + if (info) > + { > + ipa_freqcounting_predicate fcp; > + fcp.predicate = NULL; > + set_hint_predicate (&fcp.predicate, p); > + fcp.freq = fcp_freq; > + vec_safe_push (info->loop_strides, fcp); > + } > + } > for (e = node->callees; e; e = e->next_callee) > read_ipa_call_summary (&ib, e, info != NULL); > for (e = node->indirect_calls; e; e = e->next_callee) @@ -4487,14 > +4597,19 @@ ipa_fn_summary_write (void) > e->exec_predicate.stream_out (ob); > e->nonconst_predicate.stream_out (ob); > } > - if (info->loop_iterations) > - info->loop_iterations->stream_out (ob); > - else > - streamer_write_uhwi (ob, 0); > - if (info->loop_stride) > - info->loop_stride->stream_out (ob); > - else > - streamer_write_uhwi (ob, 0); > + ipa_freqcounting_predicate *fcp; > + streamer_write_uhwi (ob, vec_safe_length (info->loop_iterations)); > + for (i = 0; vec_safe_iterate (info->loop_iterations, i, &fcp); i++) > + { > + fcp->predicate->stream_out (ob); > + fcp->freq.stream_out (ob); > + } > + streamer_write_uhwi (ob, vec_safe_length (info->loop_strides)); > + for (i = 0; vec_safe_iterate (info->loop_strides, i, &fcp); i++) > + { > + fcp->predicate->stream_out (ob); > + fcp->freq.stream_out (ob); > + } > for (edge = cnode->callees; edge; edge = edge->next_callee) > write_ipa_call_summary (ob, edge); > for (edge = cnode->indirect_calls; edge; edge = edge->next_callee) > diff --git a/gcc/ipa-fnsummary.h b/gcc/ipa-fnsummary.h index > c6ddc9f3199..d8429afdbef 100644 > --- a/gcc/ipa-fnsummary.h > +++ b/gcc/ipa-fnsummary.h > @@ -101,6 +101,19 @@ public: > } > }; > > +/* Structure to capture how frequently some interesting events occur given > a > + particular predicate. The structure is used to estimate how often we > + encounter loops with known iteration count or stride in various > + contexts. */ > + > +struct GTY(()) ipa_freqcounting_predicate { > + /* The described event happens with this frequency... */ > + sreal freq; > + /* ...when this predicate evaluates to false. */ > + class predicate * GTY((skip)) predicate; }; > + > /* Function inlining information. */ > class GTY(()) ipa_fn_summary > { > @@ -112,8 +125,9 @@ public: > inlinable (false), single_caller (false), > fp_expressions (false), estimated_stack_size (false), > time (0), conds (NULL), > - size_time_table (NULL), call_size_time_table (NULL), loop_iterations > (NULL), > - loop_stride (NULL), growth (0), scc_no (0) > + size_time_table (NULL), call_size_time_table (NULL), > + loop_iterations (NULL), loop_strides (NULL), > + growth (0), scc_no (0) > { > } > > @@ -125,13 +139,12 @@ public: > estimated_stack_size (s.estimated_stack_size), > time (s.time), conds (s.conds), size_time_table (s.size_time_table), > call_size_time_table (NULL), > - loop_iterations (s.loop_iterations), loop_stride (s.loop_stride), > + loop_iterations (s.loop_iterations), loop_strides (s.loop_strides), > growth (s.growth), scc_no (s.scc_no) > {} > > /* Default constructor. */ > ~ipa_fn_summary (); > - > /* Information about the function body itself. */ > > /* Minimal size increase after inlining. */ @@ -164,12 +177,10 @@ public: > vec<size_time_entry, va_gc> *size_time_table; > vec<size_time_entry, va_gc> *call_size_time_table; > > - /* Predicate on when some loop in the function becomes to have known > - bounds. */ > - predicate * GTY((skip)) loop_iterations; > - /* Predicate on when some loop in the function becomes to have known > - stride. */ > - predicate * GTY((skip)) loop_stride; > + /* Predicates on when some loops in the function can have known > + bounds. */ vec<ipa_freqcounting_predicate, va_gc> *loop_iterations; > + /* Predicates on when some loops in the function can have known > + strides. */ vec<ipa_freqcounting_predicate, va_gc> *loop_strides; > /* Estimated growth for inlining all copies of the function before start > of small functions inlining. > This value will get out of date as the callers are duplicated, but @@ -316,6 > +327,15 @@ public: > { > return m_node != NULL; > } > + > + > + /* How often loops will have known iterations. Calculated in > + estimate_size_and_time. */ > + sreal m_loops_with_known_iterations; > + /* How often loops will have known strides. Calculated in > + estimate_size_and_time. */ > + sreal m_loops_with_known_strides; > + > private: > /* Called function. */ > cgraph_node *m_node; > @@ -353,7 +373,7 @@ void estimate_ipcp_clone_size_and_time (struct > cgraph_node *, > vec<ipa_polymorphic_call_context>, > vec<ipa_agg_value_set>, > int *, sreal *, sreal *, > - ipa_hints *); > + ipa_hints *, sreal *, sreal *); > void ipa_merge_fn_summary_after_inlining (struct cgraph_edge *edge); > void ipa_update_overall_fn_summary (struct cgraph_node *node, bool > reset = true); void compute_fn_summary (struct cgraph_node *, bool); diff - > -git a/gcc/params.opt b/gcc/params.opt index f39e5d1a012..2a5f3d61727 > 100644 > --- a/gcc/params.opt > +++ b/gcc/params.opt > @@ -211,7 +211,7 @@ Common Joined UInteger > Var(param_ipa_cp_single_call_penalty) Init(15) IntegerRan Percentage > penalty functions containing a single call to another function will receive > when they are evaluated for cloning. > > -param=ipa-cp-unit-growth= > -Common Joined UInteger Var(param_ipa_cp_unit_growth) Init(10) Param > Optimization > +Common Joined UInteger Var(param_ipa_cp_unit_growth) Init(80) Param > +Optimization > How much can given compilation unit grow because of the interprocedural > constant propagation (in percent). > > -param=ipa-cp-value-list-size= ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: [PATCH] ipa-inline: Improve growth accumulation for recursive calls 2020-08-21 9:17 ` Tamar Christina @ 2020-09-08 14:00 ` Martin Jambor 2020-09-11 10:38 ` Tamar Christina 0 siblings, 1 reply; 21+ messages in thread From: Martin Jambor @ 2020-09-08 14:00 UTC (permalink / raw) To: Tamar Christina, Richard Sandiford, luoxhu via Gcc-patches Cc: segher, luoxhu, wschmidt, linkw, Jan Hubicka, dje.gcc Hi, On Fri, Aug 21 2020, Tamar Christina wrote: >> >> Honza's changes have been motivated to big extent as an enabler for IPA-CP >> heuristics changes to actually speed up 548.exchange2_r. >> >> On my AMD Zen2 machine, the run-time of exchange2 was 358 seconds two >> weeks ago, this week it is 403, but with my WIP (and so far untested) patch >> below it is just 276 seconds - faster than one built with GCC 8 which needs >> 283 seconds. >> >> I'll be interested in knowing if it also works this well on other architectures. >> I have posted the new version of the patch series to the mailing list yesterday and I have also pushed the branch to the FSF repo as refs/users/jamborm/heads/ipa-context_and_exchange-200907 > > Many thanks for working on this! > > I tried this on an AArch64 Neoverse-N1 machine and didn't see any difference. > Do I need any flags for it to work? The patch was applied on top of 656218ab982cc22b826227045826c92743143af1 > I only have access to fairly old AMD (Seattle) Opteron 1100 which might not support some interesting Aarch64 ISA extensions but I can measure a significant speedup on it (everything with just -Ofast -march=native -mtune=native, no non-default parameters, without LTO, without any inlining options): GCC 10 branch: 915 seconds Master (rev. 995bb851ffe): 989 seconds My branch: 827 seconds (All is 548.exchange_r reference run time.) > And I tried 3 runs > 1) -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 -fno-inline-functions-called-once This is the first time I saw -fno-inline-functions-called-once used in this context. This seems to indicate we are looking at another problem that at least I have not known about yet. Can you please upload somewhere the inlining WPA dumps with and without the option? Similarly, I do not need LTO for the speedup on x86_64. The patches in the series should also remove the need for --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 If you still need them on my branch, could you please again provide me with (WPA, if with LTO) ipa-cp dumps with and without them? > 2) -mcpu=native -Ofast -fomit-frame-pointer -flto -fno-inline-functions-called-once > 3) -mcpu=native -Ofast -fomit-frame-pointer -flto > > First one used to give us the best result, with this patch there's no difference between 1 and 2 (11% regression) and the 3rd one is about 15% on top of that. OK, so the patch did help (but above you wrote it did not?) but not enough to be as fast as some previous revision and on top of that -fno-inline-functions-called-once further helps but again not enough? If correct, this looks like we need to examine what goes wrong specifically in the case of Neoverse-N1 though. Thanks, Martin ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: [PATCH] ipa-inline: Improve growth accumulation for recursive calls 2020-09-08 14:00 ` Martin Jambor @ 2020-09-11 10:38 ` Tamar Christina 2020-09-11 12:45 ` Martin Jambor 0 siblings, 1 reply; 21+ messages in thread From: Tamar Christina @ 2020-09-11 10:38 UTC (permalink / raw) To: Martin Jambor, Richard Sandiford, luoxhu via Gcc-patches Cc: segher, luoxhu, wschmidt, linkw, Jan Hubicka, dje.gcc Hi Martin, > -----Original Message----- > From: Martin Jambor <mjambor@suse.cz> > Sent: Tuesday, September 8, 2020 3:01 PM > To: Tamar Christina <Tamar.Christina@arm.com>; Richard Sandiford > <Richard.Sandiford@arm.com>; luoxhu via Gcc-patches <gcc- > patches@gcc.gnu.org> > Cc: segher@kernel.crashing.org; luoxhu <luoxhu@linux.ibm.com>; > wschmidt@linux.ibm.com; linkw@gcc.gnu.org; Jan Hubicka > <hubicka@ucw.cz>; dje.gcc@gmail.com > Subject: RE: [PATCH] ipa-inline: Improve growth accumulation for recursive > calls > > Hi, > > On Fri, Aug 21 2020, Tamar Christina wrote: > >> > >> Honza's changes have been motivated to big extent as an enabler for > >> IPA-CP heuristics changes to actually speed up 548.exchange2_r. > >> > >> On my AMD Zen2 machine, the run-time of exchange2 was 358 seconds > two > >> weeks ago, this week it is 403, but with my WIP (and so far untested) > >> patch below it is just 276 seconds - faster than one built with GCC 8 > >> which needs > >> 283 seconds. > >> > >> I'll be interested in knowing if it also works this well on other architectures. > >> > > I have posted the new version of the patch series to the mailing list > yesterday and I have also pushed the branch to the FSF repo as > refs/users/jamborm/heads/ipa-context_and_exchange-200907 > Thanks! I've pushed it through our CI with a host of different options (see below) > > > > Many thanks for working on this! > > > > I tried this on an AArch64 Neoverse-N1 machine and didn't see any > difference. > > Do I need any flags for it to work? The patch was applied on top of > > 656218ab982cc22b826227045826c92743143af1 > > > > I only have access to fairly old AMD (Seattle) Opteron 1100 which might not > support some interesting Aarch64 ISA extensions but I can measure a > significant speedup on it (everything with just -Ofast -march=native - > mtune=native, no non-default parameters, without LTO, without any inlining > options): > > GCC 10 branch: 915 seconds > Master (rev. 995bb851ffe): 989 seconds > My branch: 827 seconds > > (All is 548.exchange_r reference run time.) > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+ | Compiler | Flags | diff GCC 10 | | | +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+ | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 -fno-inline-functions-called-once | | | | +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+ | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer | -44% | | | +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+ | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer -flto | -36% | | | +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+ | GCC 11 | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 -fno-inline-functions-called-once | -12% | | | +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+ | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 | -22% | | | +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+ | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 -fno-inline-functions-called-once | -12% | | | +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+ | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto | -24% | | | +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+ | Branch | -mcpu=native -Ofast -fomit-frame-pointer | -26% | | | +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+ (Hopefully the table shows up correct) It looks like your patch definitely does improve the basic cases. So there's not much difference between lto and non-lto anymore and it's much Better than GCC 10. However it still contains the regression introduced by Honza's changes. > > And I tried 3 runs > > 1) -mcpu=native -Ofast -fomit-frame-pointer -flto --param > > ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 > > -fno-inline-functions-called-once > > This is the first time I saw -fno-inline-functions-called-once used in this > context. This seems to indicate we are looking at another problem that at > least I have not known about yet. Can you please upload somewhere the > inlining WPA dumps with and without the option? We used it to cover up for the register allocation issue where in lining some large functions would cause massive spilling. Looks like it still has an effect now but even with it we're still seeing the 12% regression. Which option is this? -fdump-ipa-cgraph? Thanks, Tamar > > Similarly, I do not need LTO for the speedup on x86_64. > > The patches in the series should also remove the need for --param > ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 If you still need them > on my branch, could you please again provide me with (WPA, if with > LTO) ipa-cp dumps with and without them? > > > > 2) -mcpu=native -Ofast -fomit-frame-pointer -flto > > -fno-inline-functions-called-once > > 3) -mcpu=native -Ofast -fomit-frame-pointer -flto > > > > First one used to give us the best result, with this patch there's no > difference between 1 and 2 (11% regression) and the 3rd one is about 15% > on top of that. > > OK, so the patch did help (but above you wrote it did not?) but not enough > to be as fast as some previous revision and on top of that -fno-inline- > functions-called-once further helps but again not enough? > > If correct, this looks like we need to examine what goes wrong specifically in > the case of Neoverse-N1 though. > > Thanks, > > Martin ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: [PATCH] ipa-inline: Improve growth accumulation for recursive calls 2020-09-11 10:38 ` Tamar Christina @ 2020-09-11 12:45 ` Martin Jambor 2020-09-11 17:36 ` Tamar Christina 0 siblings, 1 reply; 21+ messages in thread From: Martin Jambor @ 2020-09-11 12:45 UTC (permalink / raw) To: Tamar Christina, Richard Sandiford, luoxhu via Gcc-patches Cc: segher, luoxhu, wschmidt, linkw, Jan Hubicka, dje.gcc Hi, On Fri, Sep 11 2020, Tamar Christina wrote: > Hi Martin, > >> On Fri, Aug 21 2020, Tamar Christina wrote: >> >> >> >> Honza's changes have been motivated to big extent as an enabler for >> >> IPA-CP heuristics changes to actually speed up 548.exchange2_r. >> >> >> >> On my AMD Zen2 machine, the run-time of exchange2 was 358 seconds >> two >> >> weeks ago, this week it is 403, but with my WIP (and so far untested) >> >> patch below it is just 276 seconds - faster than one built with GCC 8 >> >> which needs >> >> 283 seconds. >> >> >> >> I'll be interested in knowing if it also works this well on other architectures. >> >> >> >> I have posted the new version of the patch series to the mailing list >> yesterday and I have also pushed the branch to the FSF repo as >> refs/users/jamborm/heads/ipa-context_and_exchange-200907 >> > > Thanks! I've pushed it through our CI with a host of different options (see below) > >> > >> > Many thanks for working on this! >> > >> > I tried this on an AArch64 Neoverse-N1 machine and didn't see any >> difference. >> > Do I need any flags for it to work? The patch was applied on top of >> > 656218ab982cc22b826227045826c92743143af1 >> > >> >> I only have access to fairly old AMD (Seattle) Opteron 1100 which might not >> support some interesting Aarch64 ISA extensions but I can measure a >> significant speedup on it (everything with just -Ofast -march=native - >> mtune=native, no non-default parameters, without LTO, without any inlining >> options): >> >> GCC 10 branch: 915 seconds >> Master (rev. 995bb851ffe): 989 seconds >> My branch: 827 seconds >> >> (All is 548.exchange_r reference run time.) >> > > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+ > | Compiler | Flags | diff GCC 10 | | | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+ > | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 -fno-inline-functions-called-once | | | | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+ > | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer | -44% | | | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+ > | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer -flto | -36% | | | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+ > | GCC 11 | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 -fno-inline-functions-called-once | -12% | | | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+ > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 | -22% | | | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+ > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 -fno-inline-functions-called-once | -12% | | | can you please confirm that the difference between these two is all due to the last option -fno-inline-functions-called-once ? Is LTo necessary? I.e., can you run the benchmark also built with the branch compiler and -mcpu=native -Ofast -fomit-frame-pointer -fno-inline-functions-called-once ? > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+ > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto | -24% | | | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+ > | Branch | -mcpu=native -Ofast -fomit-frame-pointer | -26% | | | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+ > > (Hopefully the table shows up correct) it does show OK for me, thanks. > > It looks like your patch definitely does improve the basic cases. So there's not much difference between lto and non-lto anymore and it's much > Better than GCC 10. However it still contains the regression introduced by Honza's changes. I assume these are rates, not times, so negative means bad. But do I understand it correctly that you're comparing against GCC 10 with the two parameters set to rather special values? Because your table seems to indicate that even for you, the branch is faster than GCC 10 with just -mcpu=native -Ofast -fomit-frame-pointer. So is the problem that the best obtainable run-time, even with obscure options, from the branch is slower than the best obtainable run-time from GCC 10? > >> > And I tried 3 runs >> > 1) -mcpu=native -Ofast -fomit-frame-pointer -flto --param >> > ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 >> > -fno-inline-functions-called-once >> >> This is the first time I saw -fno-inline-functions-called-once used in this >> context. This seems to indicate we are looking at another problem that at >> least I have not known about yet. Can you please upload somewhere the >> inlining WPA dumps with and without the option? > > We used it to cover up for the register allocation issue where in lining some large > functions would cause massive spilling. Looks like it still has an effect now but even > with it we're still seeing the 12% regression. > > Which option is this? -fdump-ipa-cgraph? -fdump-ipa-inline-details and -fdump-ipa-cp-details. It would be nice if the slowdown was all due to the inliner. But the predictors changes might of course have quite an impact also on other optimizations. Martin ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: [PATCH] ipa-inline: Improve growth accumulation for recursive calls 2020-09-11 12:45 ` Martin Jambor @ 2020-09-11 17:36 ` Tamar Christina 2020-09-16 8:35 ` Hongyu Wang 2020-10-16 8:46 ` Xionghu Luo 0 siblings, 2 replies; 21+ messages in thread From: Tamar Christina @ 2020-09-11 17:36 UTC (permalink / raw) To: Martin Jambor, Richard Sandiford, luoxhu via Gcc-patches Cc: segher, luoxhu, wschmidt, linkw, Jan Hubicka, dje.gcc Hi Martin, > > can you please confirm that the difference between these two is all due to > the last option -fno-inline-functions-called-once ? Is LTo necessary? I.e., can > you run the benchmark also built with the branch compiler and -mcpu=native > -Ofast -fomit-frame-pointer -fno-inline-functions-called-once ? > Done, see below. > > +----------+------------------------------------------------------------------------------ > --------------------------------------------------------------------+--------------+--+--+ > > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto > | -24% | | | > > +----------+------------------------------------------------------------------------------ > --------------------------------------------------------------------+--------------+--+--+ > > | Branch | -mcpu=native -Ofast -fomit-frame-pointer > | -26% | | | > > +----------+------------------------------------------------------------------------------ > --------------------------------------------------------------------+--------------+--+--+ > > > > > (Hopefully the table shows up correct) > > it does show OK for me, thanks. > > > > > It looks like your patch definitely does improve the basic cases. So > > there's not much difference between lto and non-lto anymore and it's > much Better than GCC 10. However it still contains the regression introduced > by Honza's changes. > > I assume these are rates, not times, so negative means bad. But do I > understand it correctly that you're comparing against GCC 10 with the two > parameters set to rather special values? Because your table seems to > indicate that even for you, the branch is faster than GCC 10 with just - > mcpu=native -Ofast -fomit-frame-pointer. Yes these are indeed rates, and indeed I am comparing against the same options we used to get the fastest rates on before which is the two parameters and the inline flag. > > So is the problem that the best obtainable run-time, even with obscure > options, from the branch is slower than the best obtainable run-time from > GCC 10? > Yeah that's the problem, when we compare the two we're still behind. I've done the additional two runs +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ | Compiler | Flags | diff GCC 10 | +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 -fno-inline-functions-called-once | | +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer | -44% | +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer -flto | -36% | +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ | GCC 11 | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 -fno-inline-functions-called-once | -12% | +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 | -22% | +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 -fno-inline-functions-called-once | -12% | +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto | -24% | +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ | Branch | -mcpu=native -Ofast -fomit-frame-pointer | -26% | +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto -fno-inline-functions-called-once | -12% | +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ | Branch | -mcpu=native -Ofast -fomit-frame-pointer -fno-inline-functions-called-once | -11% | +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ And this confirms that indeed LTO isn't needed and that the branch without any options is indeed much better than it was on GCC 10 without any options. It also confirms that the only remaining difference is in the -fno-inline-functions-called-once > > > >> > And I tried 3 runs > >> > 1) -mcpu=native -Ofast -fomit-frame-pointer -flto --param > >> > ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 > >> > -fno-inline-functions-called-once > >> > >> This is the first time I saw -fno-inline-functions-called-once used > >> in this context. This seems to indicate we are looking at another > >> problem that at least I have not known about yet. Can you please > >> upload somewhere the inlining WPA dumps with and without the option? > > > > We used it to cover up for the register allocation issue where in > > lining some large functions would cause massive spilling. Looks like > > it still has an effect now but even with it we're still seeing the 12% > regression. > > > > Which option is this? -fdump-ipa-cgraph? > > -fdump-ipa-inline-details and -fdump-ipa-cp-details. I've kicked off the CI runs and will get you the dumps on Monday. Cheers, Tamar > > It would be nice if the slowdown was all due to the inliner. But the predictors > changes might of course have quite an impact also on other optimizations. > > Martin ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls 2020-09-11 17:36 ` Tamar Christina @ 2020-09-16 8:35 ` Hongyu Wang 2020-10-16 8:46 ` Xionghu Luo 1 sibling, 0 replies; 21+ messages in thread From: Hongyu Wang @ 2020-09-16 8:35 UTC (permalink / raw) To: Tamar Christina Cc: Martin Jambor, Richard Sandiford, luoxhu via Gcc-patches, segher, luoxhu, wschmidt, linkw, Jan Hubicka, dje.gcc Tamar Christina <Tamar.Christina@arm.com> 于2020年9月12日周六 上午1:39写道: > Hi Martin, > > > > > can you please confirm that the difference between these two is all due > to > > the last option -fno-inline-functions-called-once ? Is LTo necessary? > I.e., can > > you run the benchmark also built with the branch compiler and > -mcpu=native > > -Ofast -fomit-frame-pointer -fno-inline-functions-called-once ? > > > > Done, see below. > > > > > +----------+------------------------------------------------------------------------------ > > > --------------------------------------------------------------------+--------------+--+--+ > > > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto > > | -24% | | | > > > > +----------+------------------------------------------------------------------------------ > > > --------------------------------------------------------------------+--------------+--+--+ > > > | Branch | -mcpu=native -Ofast -fomit-frame-pointer > > | -26% | | | > > > > +----------+------------------------------------------------------------------------------ > > > --------------------------------------------------------------------+--------------+--+--+ > > > > > > > > (Hopefully the table shows up correct) > > > > it does show OK for me, thanks. > > > > > > > > It looks like your patch definitely does improve the basic cases. So > > > there's not much difference between lto and non-lto anymore and it's > > much Better than GCC 10. However it still contains the regression > introduced > > by Honza's changes. > > > > I assume these are rates, not times, so negative means bad. But do I > > understand it correctly that you're comparing against GCC 10 with the two > > parameters set to rather special values? Because your table seems to > > indicate that even for you, the branch is faster than GCC 10 with just - > > mcpu=native -Ofast -fomit-frame-pointer. > > Yes these are indeed rates, and indeed I am comparing against the same > options > we used to get the fastest rates on before which is the two parameters and > the inline flag. > > > > > So is the problem that the best obtainable run-time, even with obscure > > options, from the branch is slower than the best obtainable run-time from > > GCC 10? > > > > Yeah that's the problem, when we compare the two we're still behind. > > I've done the additional two runs > > > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | Compiler | Flags > > | diff GCC 10 | > > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer -flto --param > ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 > -fno-inline-functions-called-once | | > > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer > > | -44% | > > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer -flto > > | -36% | > > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | GCC 11 | -mcpu=native -Ofast -fomit-frame-pointer -flto --param > ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 > -fno-inline-functions-called-once | -12% | > > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto --param > ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 > | -22% | > > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto --param > ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 > -fno-inline-functions-called-once | -12% | > > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto > > | -24% | > > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | Branch | -mcpu=native -Ofast -fomit-frame-pointer > > | -26% | > > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto > -fno-inline-functions-called-once > | -12% | > > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | Branch | -mcpu=native -Ofast -fomit-frame-pointer > -fno-inline-functions-called-once > | -11% | > > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > > And this confirms that indeed LTO isn't needed and that the branch > without any options is indeed much better than it was on GCC 10 without > any options. > > It also confirms that the only remaining difference is in the > -fno-inline-functions-called-once > > > > > > >> > And I tried 3 runs > > >> > 1) -mcpu=native -Ofast -fomit-frame-pointer -flto --param > > >> > ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 > > >> > -fno-inline-functions-called-once > > >> > > >> This is the first time I saw -fno-inline-functions-called-once used > > >> in this context. This seems to indicate we are looking at another > > >> problem that at least I have not known about yet. Can you please > > >> upload somewhere the inlining WPA dumps with and without the option? > > > > > > We used it to cover up for the register allocation issue where in > > > lining some large functions would cause massive spilling. Looks like > > > it still has an effect now but even with it we're still seeing the 12% > > regression. > > > > > > Which option is this? -fdump-ipa-cgraph? > > > > -fdump-ipa-inline-details and -fdump-ipa-cp-details. > > I've kicked off the CI runs and will get you the dumps on Monday. > > Cheers, > Tamar > > > > > It would be nice if the slowdown was all due to the inliner. But the > predictors > > changes might of course have quite an impact also on other optimizations. > > > > Martin > > Hi Martin, Thanks for your work. In case you are interested, here is the exchange2 result for your branch on our Cascadelake server (based on Tamar's test and our regular configuration): | Compiler | Flags | single-core diff GCC10 | multi-core diff GCC10 | |---------|-------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------|----------------------| | GCC10.1 | -march=native -Ofast -funroll-loops -flto --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 -fno-inline-functions-called-once | - | - | | GCC10.1 | -march=native -Ofast -funroll-loops | -32% | -37% | | GCC10.1 | -march=native -Ofast -funroll-loops -flto | -32% | -37% | | GCC11 | -march=native -Ofast -funroll-loops -flto --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 -fno-inline-functions-called-once | -20% | -13% | | Branch | -march=native -Ofast -funroll-loops -flto --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 | -39% | -28% | | Branch | -march=native -Ofast -funroll-loops -flto --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 -fno-inline-functions-called-once | -20% | -13% | | Branch | -march=native -Ofast -funroll-loops -flto | -39% | -28% | | Branch | -march=native -Ofast -funroll-loops | -41% | -29% | | Branch | -march=native -Ofast -funroll-loops -flto -fno-inline-functions-called-once | -19% | -13% | | Branch | -march=native -Ofast -funroll-loops -fno-inline-functions-called-once | -20% | -13% | For multi-core tests, it can provide better performance without extra ipa options, but still 12% regression compared with GCC10's best score. Also for single-core, there's a about 7% gap between the branch and GCC10.1. Regards, Hongyu Wang ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls 2020-09-11 17:36 ` Tamar Christina 2020-09-16 8:35 ` Hongyu Wang @ 2020-10-16 8:46 ` Xionghu Luo 2021-01-21 14:34 ` Tamar Christina 1 sibling, 1 reply; 21+ messages in thread From: Xionghu Luo @ 2020-10-16 8:46 UTC (permalink / raw) To: Tamar Christina, Martin Jambor, Richard Sandiford, luoxhu via Gcc-patches Cc: segher, wschmidt, linkw, Jan Hubicka, dje.gcc On 2020/9/12 01:36, Tamar Christina wrote: > Hi Martin, > >> >> can you please confirm that the difference between these two is all due to >> the last option -fno-inline-functions-called-once ? Is LTo necessary? I.e., can >> you run the benchmark also built with the branch compiler and -mcpu=native >> -Ofast -fomit-frame-pointer -fno-inline-functions-called-once ? >> > > Done, see below. > >>> +----------+------------------------------------------------------------------------------ >> --------------------------------------------------------------------+--------------+--+--+ >>> | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto >> | -24% | | | >>> +----------+------------------------------------------------------------------------------ >> --------------------------------------------------------------------+--------------+--+--+ >>> | Branch | -mcpu=native -Ofast -fomit-frame-pointer >> | -26% | | | >>> +----------+------------------------------------------------------------------------------ >> --------------------------------------------------------------------+--------------+--+--+ >> >>> >>> (Hopefully the table shows up correct) >> >> it does show OK for me, thanks. >> >>> >>> It looks like your patch definitely does improve the basic cases. So >>> there's not much difference between lto and non-lto anymore and it's >> much Better than GCC 10. However it still contains the regression introduced >> by Honza's changes. >> >> I assume these are rates, not times, so negative means bad. But do I >> understand it correctly that you're comparing against GCC 10 with the two >> parameters set to rather special values? Because your table seems to >> indicate that even for you, the branch is faster than GCC 10 with just - >> mcpu=native -Ofast -fomit-frame-pointer. > > Yes these are indeed rates, and indeed I am comparing against the same options > we used to get the fastest rates on before which is the two parameters and > the inline flag. > >> >> So is the problem that the best obtainable run-time, even with obscure >> options, from the branch is slower than the best obtainable run-time from >> GCC 10? >> > > Yeah that's the problem, when we compare the two we're still behind. > > I've done the additional two runs > > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | Compiler | Flags | diff GCC 10 | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 -fno-inline-functions-called-once | | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer | -44% | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer -flto | -36% | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | GCC 11 | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 -fno-inline-functions-called-once | -12% | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 | -22% | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 -fno-inline-functions-called-once | -12% | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto | -24% | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | Branch | -mcpu=native -Ofast -fomit-frame-pointer | -26% | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto -fno-inline-functions-called-once | -12% | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -fno-inline-functions-called-once | -11% | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > > And this confirms that indeed LTO isn't needed and that the branch > without any options is indeed much better than it was on GCC 10 without any options. > > It also confirms that the only remaining difference is in the -fno-inline-functions-called-once If -fno-inline-functions-called-once is added, the recursive call function digits_2 won't be inlined, as each digits_2 is specialized to clone nodes and called once only, so performance back is expected, I guess it is somewhat similar to -fno-inline for this case. @Jambor @Honza Any progress about this (--param controlling maximal recursion depth) and the other regression about LOOP_GUARD_WITH_PREDICTION in PR96825(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96825) please? :) I tested the current master FSF code, the regression still exists... Thanks, Xionghu ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: [PATCH] ipa-inline: Improve growth accumulation for recursive calls 2020-10-16 8:46 ` Xionghu Luo @ 2021-01-21 14:34 ` Tamar Christina 2021-01-21 15:11 ` Jan Hubicka 0 siblings, 1 reply; 21+ messages in thread From: Tamar Christina @ 2021-01-21 14:34 UTC (permalink / raw) To: Xionghu Luo, Martin Jambor, Richard Sandiford, luoxhu via Gcc-patches Cc: segher, wschmidt, linkw, Jan Hubicka, dje.gcc, James Greenhalgh Hi All, James and I have been investigating this regression and have tracked it down to register allocation. I have create a new PR with our findings https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 but unfortunately we don't know how to proceed. This does seem like a genuine bug in RA. It looks like some magic threshold has been crossed, but we're having trouble determining what this magic number is. Any help is appreciated. Thanks, Tamar > -----Original Message----- > From: Xionghu Luo <luoxhu@linux.ibm.com> > Sent: Friday, October 16, 2020 9:47 AM > To: Tamar Christina <Tamar.Christina@arm.com>; Martin Jambor > <mjambor@suse.cz>; Richard Sandiford <Richard.Sandiford@arm.com>; > luoxhu via Gcc-patches <gcc-patches@gcc.gnu.org> > Cc: segher@kernel.crashing.org; wschmidt@linux.ibm.com; > linkw@gcc.gnu.org; Jan Hubicka <hubicka@ucw.cz>; dje.gcc@gmail.com > Subject: Re: [PATCH] ipa-inline: Improve growth accumulation for recursive > calls > > > > On 2020/9/12 01:36, Tamar Christina wrote: > > Hi Martin, > > > >> > >> can you please confirm that the difference between these two is all > >> due to the last option -fno-inline-functions-called-once ? Is LTo > >> necessary? I.e., can you run the benchmark also built with the > >> branch compiler and -mcpu=native -Ofast -fomit-frame-pointer -fno- > inline-functions-called-once ? > >> > > > > Done, see below. > > > >>> +----------+-------------------------------------------------------- > >>> +----------+---------------------- > >> --------------------------------------------------------------------+--------------+--+- > -+ > >>> | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto > >> | -24% | | | > >>> +----------+-------------------------------------------------------- > >>> +----------+---------------------- > >> --------------------------------------------------------------------+--------------+--+- > -+ > >>> | Branch | -mcpu=native -Ofast -fomit-frame-pointer > >> | -26% | | | > >>> +----------+-------------------------------------------------------- > >>> +----------+---------------------- > >> --------------------------------------------------------------------+--------------+--+- > -+ > >> > >>> > >>> (Hopefully the table shows up correct) > >> > >> it does show OK for me, thanks. > >> > >>> > >>> It looks like your patch definitely does improve the basic cases. So > >>> there's not much difference between lto and non-lto anymore and it's > >> much Better than GCC 10. However it still contains the regression > >> introduced by Honza's changes. > >> > >> I assume these are rates, not times, so negative means bad. But do I > >> understand it correctly that you're comparing against GCC 10 with the > >> two parameters set to rather special values? Because your table > >> seems to indicate that even for you, the branch is faster than GCC 10 > >> with just - mcpu=native -Ofast -fomit-frame-pointer. > > > > Yes these are indeed rates, and indeed I am comparing against the same > > options we used to get the fastest rates on before which is the two > > parameters and the inline flag. > > > >> > >> So is the problem that the best obtainable run-time, even with > >> obscure options, from the branch is slower than the best obtainable > >> run-time from GCC 10? > >> > > > > Yeah that's the problem, when we compare the two we're still behind. > > > > I've done the additional two runs > > > > +----------+------------------------------------------------------------------------------ > --------------------------------------------------------------------+--------------+ > > | Compiler | Flags > | diff GCC 10 | > > +----------+------------------------------------------------------------------------------ > --------------------------------------------------------------------+--------------+ > > | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp- > eval-threshold=1 --param ipa-cp-unit-growth=80 -fno-inline-functions- > called-once | | > > +----------+------------------------------------------------------------------------------ > --------------------------------------------------------------------+--------------+ > > | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer > | -44% | > > +----------+------------------------------------------------------------------------------ > --------------------------------------------------------------------+--------------+ > > | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer -flto > | -36% | > > +----------+------------------------------------------------------------------------------ > --------------------------------------------------------------------+--------------+ > > | GCC 11 | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp- > eval-threshold=1 --param ipa-cp-unit-growth=80 -fno-inline-functions- > called-once | -12% | > > +----------+------------------------------------------------------------------------------ > --------------------------------------------------------------------+--------------+ > > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp- > eval-threshold=1 --param ipa-cp-unit-growth=80 | -22% > | > > +----------+------------------------------------------------------------------------------ > --------------------------------------------------------------------+--------------+ > > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp- > eval-threshold=1 --param ipa-cp-unit-growth=80 -fno-inline-functions- > called-once | -12% | > > +----------+------------------------------------------------------------------------------ > --------------------------------------------------------------------+--------------+ > > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto > | -24% | > > +----------+------------------------------------------------------------------------------ > --------------------------------------------------------------------+--------------+ > > | Branch | -mcpu=native -Ofast -fomit-frame-pointer > | -26% | > > +----------+------------------------------------------------------------------------------ > --------------------------------------------------------------------+--------------+ > > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto -fno-inline- > functions-called-once | -12% | > > +----------+------------------------------------------------------------------------------ > --------------------------------------------------------------------+--------------+ > > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -fno-inline- > functions-called-once | -11% | > > +----------+------------------------------------------------------------------------------ > --------------------------------------------------------------------+--------------+ > > > > And this confirms that indeed LTO isn't needed and that the branch > > without any options is indeed much better than it was on GCC 10 without > any options. > > > > It also confirms that the only remaining difference is in the > > -fno-inline-functions-called-once > > If -fno-inline-functions-called-once is added, the recursive call function > digits_2 won't be inlined, as each digits_2 is specialized to clone nodes and > called once only, so performance back is expected, I guess it is somewhat > similar to -fno-inline for this case. > > @Jambor @Honza Any progress about this (--param controlling maximal > recursion depth) and the other regression about > LOOP_GUARD_WITH_PREDICTION in > PR96825(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96825) please? :) I > tested the current master FSF code, the regression still exists... > > > Thanks, > Xionghu > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls 2021-01-21 14:34 ` Tamar Christina @ 2021-01-21 15:11 ` Jan Hubicka 0 siblings, 0 replies; 21+ messages in thread From: Jan Hubicka @ 2021-01-21 15:11 UTC (permalink / raw) To: Tamar Christina Cc: Xionghu Luo, Martin Jambor, Richard Sandiford, luoxhu via Gcc-patches, segher, wschmidt, linkw, dje.gcc, James Greenhalgh > Hi All, > > James and I have been investigating this regression and have tracked it down to register allocation. > > I have create a new PR with our findings https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 but unfortunately > we don't know how to proceed. > > This does seem like a genuine bug in RA. It looks like some magic threshold has been crossed, but we're having > trouble determining what this magic number is. Thank you for the analysis - it was on my TODO list for very long time, but the function is large. I will read it carefully and lets see if we can come up with something useful. Honza > > Any help is appreciated. > > Thanks, > Tamar > > > -----Original Message----- > > From: Xionghu Luo <luoxhu@linux.ibm.com> > > Sent: Friday, October 16, 2020 9:47 AM > > To: Tamar Christina <Tamar.Christina@arm.com>; Martin Jambor > > <mjambor@suse.cz>; Richard Sandiford <Richard.Sandiford@arm.com>; > > luoxhu via Gcc-patches <gcc-patches@gcc.gnu.org> > > Cc: segher@kernel.crashing.org; wschmidt@linux.ibm.com; > > linkw@gcc.gnu.org; Jan Hubicka <hubicka@ucw.cz>; dje.gcc@gmail.com > > Subject: Re: [PATCH] ipa-inline: Improve growth accumulation for recursive > > calls > > > > > > > > On 2020/9/12 01:36, Tamar Christina wrote: > > > Hi Martin, > > > > > >> > > >> can you please confirm that the difference between these two is all > > >> due to the last option -fno-inline-functions-called-once ? Is LTo > > >> necessary? I.e., can you run the benchmark also built with the > > >> branch compiler and -mcpu=native -Ofast -fomit-frame-pointer -fno- > > inline-functions-called-once ? > > >> > > > > > > Done, see below. > > > > > >>> +----------+-------------------------------------------------------- > > >>> +----------+---------------------- > > >> --------------------------------------------------------------------+--------------+--+- > > -+ > > >>> | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto > > >> | -24% | | | > > >>> +----------+-------------------------------------------------------- > > >>> +----------+---------------------- > > >> --------------------------------------------------------------------+--------------+--+- > > -+ > > >>> | Branch | -mcpu=native -Ofast -fomit-frame-pointer > > >> | -26% | | | > > >>> +----------+-------------------------------------------------------- > > >>> +----------+---------------------- > > >> --------------------------------------------------------------------+--------------+--+- > > -+ > > >> > > >>> > > >>> (Hopefully the table shows up correct) > > >> > > >> it does show OK for me, thanks. > > >> > > >>> > > >>> It looks like your patch definitely does improve the basic cases. So > > >>> there's not much difference between lto and non-lto anymore and it's > > >> much Better than GCC 10. However it still contains the regression > > >> introduced by Honza's changes. > > >> > > >> I assume these are rates, not times, so negative means bad. But do I > > >> understand it correctly that you're comparing against GCC 10 with the > > >> two parameters set to rather special values? Because your table > > >> seems to indicate that even for you, the branch is faster than GCC 10 > > >> with just - mcpu=native -Ofast -fomit-frame-pointer. > > > > > > Yes these are indeed rates, and indeed I am comparing against the same > > > options we used to get the fastest rates on before which is the two > > > parameters and the inline flag. > > > > > >> > > >> So is the problem that the best obtainable run-time, even with > > >> obscure options, from the branch is slower than the best obtainable > > >> run-time from GCC 10? > > >> > > > > > > Yeah that's the problem, when we compare the two we're still behind. > > > > > > I've done the additional two runs > > > > > > +----------+------------------------------------------------------------------------------ > > --------------------------------------------------------------------+--------------+ > > > | Compiler | Flags > > | diff GCC 10 | > > > +----------+------------------------------------------------------------------------------ > > --------------------------------------------------------------------+--------------+ > > > | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp- > > eval-threshold=1 --param ipa-cp-unit-growth=80 -fno-inline-functions- > > called-once | | > > > +----------+------------------------------------------------------------------------------ > > --------------------------------------------------------------------+--------------+ > > > | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer > > | -44% | > > > +----------+------------------------------------------------------------------------------ > > --------------------------------------------------------------------+--------------+ > > > | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer -flto > > | -36% | > > > +----------+------------------------------------------------------------------------------ > > --------------------------------------------------------------------+--------------+ > > > | GCC 11 | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp- > > eval-threshold=1 --param ipa-cp-unit-growth=80 -fno-inline-functions- > > called-once | -12% | > > > +----------+------------------------------------------------------------------------------ > > --------------------------------------------------------------------+--------------+ > > > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp- > > eval-threshold=1 --param ipa-cp-unit-growth=80 | -22% > > | > > > +----------+------------------------------------------------------------------------------ > > --------------------------------------------------------------------+--------------+ > > > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp- > > eval-threshold=1 --param ipa-cp-unit-growth=80 -fno-inline-functions- > > called-once | -12% | > > > +----------+------------------------------------------------------------------------------ > > --------------------------------------------------------------------+--------------+ > > > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto > > | -24% | > > > +----------+------------------------------------------------------------------------------ > > --------------------------------------------------------------------+--------------+ > > > | Branch | -mcpu=native -Ofast -fomit-frame-pointer > > | -26% | > > > +----------+------------------------------------------------------------------------------ > > --------------------------------------------------------------------+--------------+ > > > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto -fno-inline- > > functions-called-once | -12% | > > > +----------+------------------------------------------------------------------------------ > > --------------------------------------------------------------------+--------------+ > > > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -fno-inline- > > functions-called-once | -11% | > > > +----------+------------------------------------------------------------------------------ > > --------------------------------------------------------------------+--------------+ > > > > > > And this confirms that indeed LTO isn't needed and that the branch > > > without any options is indeed much better than it was on GCC 10 without > > any options. > > > > > > It also confirms that the only remaining difference is in the > > > -fno-inline-functions-called-once > > > > If -fno-inline-functions-called-once is added, the recursive call function > > digits_2 won't be inlined, as each digits_2 is specialized to clone nodes and > > called once only, so performance back is expected, I guess it is somewhat > > similar to -fno-inline for this case. > > > > @Jambor @Honza Any progress about this (--param controlling maximal > > recursion depth) and the other regression about > > LOOP_GUARD_WITH_PREDICTION in > > PR96825(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96825) please? :) I > > tested the current master FSF code, the regression still exists... > > > > > > Thanks, > > Xionghu > > > ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2021-01-21 15:11 UTC | newest] Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-08-12 10:19 [PATCH] ipa-inline: Improve growth accumulation for recursive calls Xionghu Luo 2020-08-12 17:53 ` Jan Hubicka 2020-08-12 19:03 ` Richard Biener 2020-08-13 0:29 ` Segher Boessenkool 2020-08-13 7:53 ` Jan Hubicka 2020-08-13 3:59 ` Feng Xue OS 2020-08-13 6:51 ` luoxhu 2020-08-13 12:52 ` Jan Hubicka 2020-08-14 9:04 ` luoxhu 2020-08-20 9:31 ` Richard Sandiford 2020-08-20 10:49 ` [PATCH] ipa-inline: Improve growth accumulation for recursive callsg Jan Hubicka 2020-08-20 12:04 ` [PATCH] ipa-inline: Improve growth accumulation for recursive calls Martin Jambor 2020-08-21 9:17 ` Tamar Christina 2020-09-08 14:00 ` Martin Jambor 2020-09-11 10:38 ` Tamar Christina 2020-09-11 12:45 ` Martin Jambor 2020-09-11 17:36 ` Tamar Christina 2020-09-16 8:35 ` Hongyu Wang 2020-10-16 8:46 ` Xionghu Luo 2021-01-21 14:34 ` Tamar Christina 2021-01-21 15:11 ` Jan Hubicka
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).