* On-Demand range technology [4/5] - Performance results
@ 2019-05-23 1:29 Andrew MacLeod
2019-05-23 13:29 ` Richard Biener
0 siblings, 1 reply; 3+ messages in thread
From: Andrew MacLeod @ 2019-05-23 1:29 UTC (permalink / raw)
To: GCC; +Cc: Jeff Law, Aldy Hernandez
We have done extensive performance analysis to help address concerns
about the nature of an on-demand model. LLVM made an attempt at
something similar, but suffered from significant performance issues
they could not solve with their approach. This approach is not the same,
and we have seen no sign of pathological cases which are problematic.
I have trolled bugzilla looking for large problematic test cases, and
have tried a couple of test cases from LLVM which dberlin pointed out
they found to be problematic. To date, I have found nothing that shows
any kind of performance problem. I am more than happy to entertain
whatever cases yâall might want to throw my way.
For a test of robustness, we have built a complete Fedora distribution
consisting of 9174 packages with the ranger branch. All packages except
3 build successfully and pass the relevant regression tests. It appears
2 of them are related to RVRP and are still under analysis.
Our primary performance testbase to date has been the compiler itself.
We compile 242 .ii files from a stage1 compiler build, and compare times
to EVRP. VRP is quite a bit slower, and although we do have an iterative
update infrastructure, comparisons arenât quite fair since we donât do
all equivalence and bitmask operations it does yet.
Before going into the numbers, I would like to visit a minor issue we
have with switches. RVRP works from the bottom up, so in order to
evaluate a range, it begins by getting the constant range for the LHS of
the branch from the edge. For a conditional it is trivially [0,0] or
[1,1] depending on TRUE or FALSE edge.
For a switch, it turns out GCC gimple representation has no simple way
to figure this out. As a result, we need to loop over every edge in the
switch and then union together all the cases which share that edge, or
in the default case, intersect out all the other cases. This turns out
to be *very* time consuming in tests cases with very large switches,Â
somewhere in the vicinity of O(n^3). Ugg. So the ranger incurs a fair
amount of overhead trying to evaluate, and re-evaluate these constant
edges.
There are ways to address this⦠but for now we will present performance
numbers with different switch configurations, each of the 5
configurations listed here:
   1 - Calculating ranges outright from the stock branch.
   2 - Timers adjusted to exclude switch edge calculation code (i.e.
pretend the range is available on the edge like it is with TRUE and FALSE.
   3 - Do not process switches. We spend extra time on switches
because we always attempt to calculate ranges very precisely as if we
had infinite precision. There is room for trimming outcomes here, but we
have made no attempt yet.
   4 - Just like EVRP, RVRP currently includes building the
dominators and integrating calls into SCEV at the beginning of each
block to see if there are any loop range refinements. The ranger has no
need for this to operate, and many passes will not care. So we produce
a 4th number for RVRP where we donât build any of the infrastructure it
doesnât need.
   5 - RVRP can also run in conditional-only mode. Rather than walking
the entire IL trying to resolve every range, it can simply look at the
last statement of every block asking if the branch can be folded. This
gets a lot of what vrp gets that affects the CFG and could be utilized
in either lower optimization levels, or as VRP if we can push all the
other activities it does into appropriate optimizations (such as making
CCP range aware).     **NOTE**: This *still* calculates switch ranges,
so includes that slow down.
All times are with a release configured compiler. Out of the 242 files,
pretty much across the board in all 4 sets of figures, RVRP was faster
in about 90% of the cases, and slower in the other 10%, resulting in the
following cumulative totals.
                     Overall (242 files)
1 - Raw RVRPÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 22% slower
2 - No edge calculation(1)Â Â Â Â Â Â Â Â Â Â Â Â 4.5% slower
3 - No switch processing(2)Â Â Â Â Â Â Â Â Â Â Â Â 9.5% faster
4 - No dominators(3)Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 16% faster
5 - Conditionals (including switches) only      4% faster
These numbers indicate that large switches are the primary cause when
RVRP is slower. We have various approaches we could use to address
this.  Removing the time spent building unnecessary dominators shows a
significant improvement as well.
We also have the results for the time spent in the passes we converted:
            Overall (242 files)
1 - -wrestrict         19% faster
2 - -wprintf         95% faster
3 - -walloca         1% slower
-wrestrict has the dominator walk removed since it is no longer needed,
and simply calls into a ranger to get the range.
-wprintf has had the dominator build removed, as well as the EVRP range
walk. It really benefits from very few ranges being requested⦠so the on
demand approach is a big win here since we only calculate what we
actually need to answer a small number of questions.
-walloca is a touch slower because we are doing more work. The original
pass simply queried for global range information. We replaced calls to
SSA_NAME_RANGE_INFO() with calls to the ranger to get accurate ranges. Â
So this overhead is the amount required to calculate those ranges. The
Walloca pass now handles more things (for instance we fix
gcc.dg/Walloca-6.c), and we catch more issues on a typical bootstrap.Â
For example, we find a possible out of bounds VLA in libgomp/task.c.
We are also working on integrating the on-demand ranges with the
backwards threader. It queries ranges over each potential path to see
if a collapsable branch can be found. It then uses that information to
find better threading opportunities than is currently possible. Results
thus far have been very promising.
These numbers are all indicative that this approach is viable versus the
existing approach, and is quite frequently faster. It already has
iterative back-edge resolution integrated, and we can get cases that the
existing VRP and EVRP approach have difficulty with.
Comments and feedback always welcome!
Thanks
Andrew
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: On-Demand range technology [4/5] - Performance results
2019-05-23 1:29 On-Demand range technology [4/5] - Performance results Andrew MacLeod
@ 2019-05-23 13:29 ` Richard Biener
2019-05-24 15:51 ` Andrew MacLeod
0 siblings, 1 reply; 3+ messages in thread
From: Richard Biener @ 2019-05-23 13:29 UTC (permalink / raw)
To: Andrew MacLeod; +Cc: GCC, Jeff Law, Aldy Hernandez
On Thu, May 23, 2019 at 3:29 AM Andrew MacLeod <amacleod@redhat.com> wrote:
>
> We have done extensive performance analysis to help address concerns
> about the nature of an on-demand model. LLVM made an attempt at
> something similar, but suffered from significant performance issues
> they could not solve with their approach. This approach is not the same,
> and we have seen no sign of pathological cases which are problematic.
>
> I have trolled bugzilla looking for large problematic test cases, and
> have tried a couple of test cases from LLVM which dberlin pointed out
> they found to be problematic. To date, I have found nothing that shows
> any kind of performance problem. I am more than happy to entertain
> whatever cases y’all might want to throw my way.
>
> For a test of robustness, we have built a complete Fedora distribution
> consisting of 9174 packages with the ranger branch. All packages except
> 3 build successfully and pass the relevant regression tests. It appears
> 2 of them are related to RVRP and are still under analysis.
>
> Our primary performance testbase to date has been the compiler itself.
> We compile 242 .ii files from a stage1 compiler build, and compare times
> to EVRP. VRP is quite a bit slower, and although we do have an iterative
> update infrastructure, comparisons aren’t quite fair since we don’t do
> all equivalence and bitmask operations it does yet.
>
> Before going into the numbers, I would like to visit a minor issue we
> have with switches. RVRP works from the bottom up, so in order to
> evaluate a range, it begins by getting the constant range for the LHS of
> the branch from the edge. For a conditional it is trivially [0,0] or
> [1,1] depending on TRUE or FALSE edge.
>
> For a switch, it turns out GCC gimple representation has no simple way
> to figure this out. As a result, we need to loop over every edge in the
> switch and then union together all the cases which share that edge, or
> in the default case, intersect out all the other cases. This turns out
> to be *very* time consuming in tests cases with very large switches,
> somewhere in the vicinity of O(n^3). Ugg. So the ranger incurs a fair
> amount of overhead trying to evaluate, and re-evaluate these constant
> edges.
>
> There are ways to address this… but for now we will present performance
> numbers with different switch configurations, each of the 5
> configurations listed here:
> 1 - Calculating ranges outright from the stock branch.
> 2 - Timers adjusted to exclude switch edge calculation code (i.e.
> pretend the range is available on the edge like it is with TRUE and FALSE.
> 3 - Do not process switches. We spend extra time on switches
> because we always attempt to calculate ranges very precisely as if we
> had infinite precision. There is room for trimming outcomes here, but we
> have made no attempt yet.
> 4 - Just like EVRP, RVRP currently includes building the
> dominators and integrating calls into SCEV at the beginning of each
> block to see if there are any loop range refinements. The ranger has no
> need for this to operate, and many passes will not care. So we produce
> a 4th number for RVRP where we don’t build any of the infrastructure it
> doesn’t need.
> 5 - RVRP can also run in conditional-only mode. Rather than walking
> the entire IL trying to resolve every range, it can simply look at the
> last statement of every block asking if the branch can be folded. This
> gets a lot of what vrp gets that affects the CFG and could be utilized
> in either lower optimization levels, or as VRP if we can push all the
> other activities it does into appropriate optimizations (such as making
> CCP range aware). **NOTE**: This *still* calculates switch ranges,
> so includes that slow down.
>
> All times are with a release configured compiler. Out of the 242 files,
> pretty much across the board in all 4 sets of figures, RVRP was faster
> in about 90% of the cases, and slower in the other 10%, resulting in the
> following cumulative totals.
>
> Overall (242 files)
> 1 - Raw RVRP 22% slower (**a**)
> 2 - No edge calculation(1) 4.5% slower
> 3 - No switch processing(2) 9.5% faster
> 4 - No dominators(3) 16% faster (**b**)
> 5 - Conditionals (including switches) only 4% faster
I don't understand a vs. b here. Does this mean dominators
cost 38% performance?!
Does 4 imply 3 and 2 maybe?
Dominators cannot be this expensive.
> These numbers indicate that large switches are the primary cause when
> RVRP is slower. We have various approaches we could use to address
> this. Removing the time spent building unnecessary dominators shows a
> significant improvement as well.
>
> We also have the results for the time spent in the passes we converted:
>
> Overall (242 files)
> 1 - -wrestrict 19% faster
> 2 - -wprintf 95% faster
> 3 - -walloca 1% slower
>
> -wrestrict has the dominator walk removed since it is no longer needed,
> and simply calls into a ranger to get the range.
>
> -wprintf has had the dominator build removed, as well as the EVRP range
> walk. It really benefits from very few ranges being requested… so the on
> demand approach is a big win here since we only calculate what we
> actually need to answer a small number of questions.
>
> -walloca is a touch slower because we are doing more work. The original
> pass simply queried for global range information. We replaced calls to
> SSA_NAME_RANGE_INFO() with calls to the ranger to get accurate ranges.
> So this overhead is the amount required to calculate those ranges. The
> Walloca pass now handles more things (for instance we fix
> gcc.dg/Walloca-6.c), and we catch more issues on a typical bootstrap.
> For example, we find a possible out of bounds VLA in libgomp/task.c.
>
> We are also working on integrating the on-demand ranges with the
> backwards threader. It queries ranges over each potential path to see
> if a collapsable branch can be found. It then uses that information to
> find better threading opportunities than is currently possible. Results
> thus far have been very promising.
>
> These numbers are all indicative that this approach is viable versus the
> existing approach, and is quite frequently faster. It already has
> iterative back-edge resolution integrated, and we can get cases that the
> existing VRP and EVRP approach have difficulty with.
>
> Comments and feedback always welcome!
> Thanks
> Andrew
>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: On-Demand range technology [4/5] - Performance results
2019-05-23 13:29 ` Richard Biener
@ 2019-05-24 15:51 ` Andrew MacLeod
0 siblings, 0 replies; 3+ messages in thread
From: Andrew MacLeod @ 2019-05-24 15:51 UTC (permalink / raw)
To: Richard Biener; +Cc: GCC, Jeff Law, Aldy Hernandez
On 5/23/19 9:29 AM, Richard Biener wrote:
> On Thu, May 23, 2019 at 3:29 AM Andrew MacLeod <amacleod@redhat.com> wrote:
>> All times are with a release configured compiler. Out of the 242 files,
>> pretty much across the board in all 4 sets of figures, RVRP was faster
>> in about 90% of the cases, and slower in the other 10%, resulting in the
>> following cumulative totals.
>>
>> Overall (242 files)
>> 1 - Raw RVRP 22% slower (**a**)
>> 2 - No edge calculation(1) 4.5% slower
>> 3 - No switch processing(2) 9.5% faster
>> 4 - No dominators(3) 16% faster (**b**)
>> 5 - Conditionals (including switches) only 4% faster
> I don't understand a vs. b here. Does this mean dominators
> cost 38% performance?!
>
> Does 4 imply 3 and 2 maybe?
>
> Dominators cannot be this expensive.
>
> hope
I believe 4 also has no switch processing, so yes, it implies 3 which
also includes 2.  so we go from 10% faster to 16% faster without
building the dominators. Probably more the kind of number you were
expecting :-)
Andrew
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2019-05-24 15:51 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-23 1:29 On-Demand range technology [4/5] - Performance results Andrew MacLeod
2019-05-23 13:29 ` Richard Biener
2019-05-24 15:51 ` Andrew MacLeod
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).