public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* On-Demand range technology [4/5] - Performance results
@ 2019-05-23  1:29 Andrew MacLeod
  2019-05-23 13:29 ` Richard Biener
  0 siblings, 1 reply; 3+ messages in thread
From: Andrew MacLeod @ 2019-05-23  1:29 UTC (permalink / raw)
  To: GCC; +Cc: Jeff Law, Aldy Hernandez

We have done extensive performance analysis to help address concerns 
about the nature of an on-demand model. LLVM made an attempt at 
something similar,  but suffered from significant performance issues 
they could not solve with their approach. This approach is not the same, 
and we have seen no sign of pathological cases which are problematic.

I have trolled bugzilla looking for large problematic test cases, and 
have tried a couple of test cases from LLVM which dberlin pointed out 
they found to be problematic.  To date, I have found nothing that shows 
any kind of performance problem.  I am more than happy to entertain 
whatever cases y’all might want to throw my way.

For a test of robustness, we have built a complete Fedora distribution 
consisting of 9174 packages with the ranger branch. All packages except 
3 build successfully and pass the relevant regression tests. It appears 
2 of them are related to RVRP and are still under analysis.

Our primary performance testbase to date has been the compiler itself. 
We compile 242 .ii files from a stage1 compiler build, and compare times 
to EVRP. VRP is quite a bit slower, and although we do have an iterative 
update infrastructure, comparisons aren’t quite fair since we don’t do 
all equivalence and bitmask operations it does yet.

Before going into the numbers, I would like to visit a minor issue we 
have with switches. RVRP works from the bottom up, so in order to 
evaluate a range, it begins by getting the constant range for the LHS of 
the branch from the edge. For a conditional it is trivially [0,0] or 
[1,1] depending on TRUE or FALSE edge.

For a switch, it turns out GCC gimple representation has no simple way 
to figure this out. As a result, we need to loop over every edge in the 
switch and then union together all the cases which share that edge, or 
in the default case, intersect out all the other cases. This turns out 
to be *very* time consuming in tests cases with very large switches,  
somewhere in the vicinity of O(n^3). Ugg.  So the ranger incurs a fair 
amount of overhead trying to evaluate, and re-evaluate these constant 
edges.

There are ways to address this… but for now we will present performance 
numbers with different switch configurations, each of the 5 
configurations listed here:
     1 - Calculating ranges outright  from the stock branch.
     2 - Timers adjusted to exclude switch edge calculation code (i.e. 
pretend the range is available on the edge like it is with TRUE and FALSE.
     3 - Do not process switches.  We spend extra time on switches 
because we always attempt to calculate ranges very precisely as if we 
had infinite precision. There is room for trimming outcomes here, but we 
have made no attempt yet.
     4 - Just like EVRP,  RVRP currently includes building the 
dominators and integrating calls into SCEV at the beginning of each 
block to see if there are any loop range refinements.  The ranger has no 
need for this to operate, and many passes will not care.  So we produce 
a 4th number for RVRP where we don’t build any of the infrastructure it 
doesn’t need.
     5 - RVRP can also run in conditional-only mode. Rather than walking 
the entire IL trying to resolve every range, it can simply look at the 
last statement of every block asking if the branch can be folded.  This 
gets a lot of what vrp gets that affects the CFG and could be utilized 
in either lower optimization levels, or as VRP if we can push all the 
other activities it does into appropriate optimizations (such as making 
CCP range aware).      **NOTE**: This *still* calculates switch ranges, 
so includes that slow down.

All times are with a release configured compiler.  Out of the 242 files, 
pretty much across the board in all 4 sets of figures, RVRP was faster 
in about 90% of the cases, and slower in the other 10%, resulting in the 
following cumulative totals.

                             Overall (242 files)
1 - Raw RVRP                              22% slower
2 - No edge calculation(1)                4.5% slower
3 - No switch processing(2)                9.5% faster
4 - No dominators(3)                    16% faster
5 - Conditionals (including switches) only        4% faster

These numbers indicate that large switches are the primary cause when 
RVRP is slower.  We have various approaches we could use to address 
this.   Removing the time spent building unnecessary dominators shows a 
significant improvement as well.

We also have the results for the time spent in the passes we converted:

                 Overall (242 files)
1 - -wrestrict            19% faster
2 - -wprintf            95% faster
3 - -walloca            1% slower

-wrestrict has the dominator walk removed since it is no longer needed, 
and simply calls into a ranger to get the range.

-wprintf has had the dominator build removed, as well as the EVRP range 
walk. It really benefits from very few ranges being requested… so the on 
demand approach is a big win here since we only calculate what we 
actually need to answer a small number of questions.

-walloca is a touch slower because we are doing more work.  The original 
pass simply queried for global range information. We replaced calls to 
SSA_NAME_RANGE_INFO() with calls to the ranger to get accurate ranges.   
So this overhead is the amount required to calculate those ranges.  The 
Walloca pass now handles more things (for instance we fix 
gcc.dg/Walloca-6.c), and we catch more issues on a typical bootstrap.  
For example, we find a possible out of bounds VLA in libgomp/task.c.

We are also working on integrating the on-demand ranges with the 
backwards threader.  It queries ranges over each potential path to see 
if a collapsable branch can be found. It then uses that information to 
find better threading opportunities than is currently possible.  Results 
thus far have been very promising.

These numbers are all indicative that this approach is viable versus the 
existing approach, and is quite frequently faster. It already has 
iterative back-edge resolution integrated, and we can get cases that the 
existing VRP and EVRP approach have difficulty with.

Comments and feedback always welcome!
Thanks
Andrew

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: On-Demand range technology [4/5] - Performance results
  2019-05-23  1:29 On-Demand range technology [4/5] - Performance results Andrew MacLeod
@ 2019-05-23 13:29 ` Richard Biener
  2019-05-24 15:51   ` Andrew MacLeod
  0 siblings, 1 reply; 3+ messages in thread
From: Richard Biener @ 2019-05-23 13:29 UTC (permalink / raw)
  To: Andrew MacLeod; +Cc: GCC, Jeff Law, Aldy Hernandez

On Thu, May 23, 2019 at 3:29 AM Andrew MacLeod <amacleod@redhat.com> wrote:
>
> We have done extensive performance analysis to help address concerns
> about the nature of an on-demand model. LLVM made an attempt at
> something similar,  but suffered from significant performance issues
> they could not solve with their approach. This approach is not the same,
> and we have seen no sign of pathological cases which are problematic.
>
> I have trolled bugzilla looking for large problematic test cases, and
> have tried a couple of test cases from LLVM which dberlin pointed out
> they found to be problematic.  To date, I have found nothing that shows
> any kind of performance problem.  I am more than happy to entertain
> whatever cases y’all might want to throw my way.
>
> For a test of robustness, we have built a complete Fedora distribution
> consisting of 9174 packages with the ranger branch. All packages except
> 3 build successfully and pass the relevant regression tests. It appears
> 2 of them are related to RVRP and are still under analysis.
>
> Our primary performance testbase to date has been the compiler itself.
> We compile 242 .ii files from a stage1 compiler build, and compare times
> to EVRP. VRP is quite a bit slower, and although we do have an iterative
> update infrastructure, comparisons aren’t quite fair since we don’t do
> all equivalence and bitmask operations it does yet.
>
> Before going into the numbers, I would like to visit a minor issue we
> have with switches. RVRP works from the bottom up, so in order to
> evaluate a range, it begins by getting the constant range for the LHS of
> the branch from the edge. For a conditional it is trivially [0,0] or
> [1,1] depending on TRUE or FALSE edge.
>
> For a switch, it turns out GCC gimple representation has no simple way
> to figure this out. As a result, we need to loop over every edge in the
> switch and then union together all the cases which share that edge, or
> in the default case, intersect out all the other cases. This turns out
> to be *very* time consuming in tests cases with very large switches,
> somewhere in the vicinity of O(n^3). Ugg.  So the ranger incurs a fair
> amount of overhead trying to evaluate, and re-evaluate these constant
> edges.
>
> There are ways to address this… but for now we will present performance
> numbers with different switch configurations, each of the 5
> configurations listed here:
>      1 - Calculating ranges outright  from the stock branch.
>      2 - Timers adjusted to exclude switch edge calculation code (i.e.
> pretend the range is available on the edge like it is with TRUE and FALSE.
>      3 - Do not process switches.  We spend extra time on switches
> because we always attempt to calculate ranges very precisely as if we
> had infinite precision. There is room for trimming outcomes here, but we
> have made no attempt yet.
>      4 - Just like EVRP,  RVRP currently includes building the
> dominators and integrating calls into SCEV at the beginning of each
> block to see if there are any loop range refinements.  The ranger has no
> need for this to operate, and many passes will not care.  So we produce
> a 4th number for RVRP where we don’t build any of the infrastructure it
> doesn’t need.
>      5 - RVRP can also run in conditional-only mode. Rather than walking
> the entire IL trying to resolve every range, it can simply look at the
> last statement of every block asking if the branch can be folded.  This
> gets a lot of what vrp gets that affects the CFG and could be utilized
> in either lower optimization levels, or as VRP if we can push all the
> other activities it does into appropriate optimizations (such as making
> CCP range aware).      **NOTE**: This *still* calculates switch ranges,
> so includes that slow down.
>
> All times are with a release configured compiler.  Out of the 242 files,
> pretty much across the board in all 4 sets of figures, RVRP was faster
> in about 90% of the cases, and slower in the other 10%, resulting in the
> following cumulative totals.
>
>                              Overall (242 files)
> 1 - Raw RVRP                              22% slower   (**a**)
> 2 - No edge calculation(1)                4.5% slower
> 3 - No switch processing(2)                9.5% faster
> 4 - No dominators(3)                    16% faster  (**b**)
> 5 - Conditionals (including switches) only        4% faster

I don't understand a vs. b here.  Does this mean dominators
cost 38% performance?!

Does 4 imply 3 and 2 maybe?

Dominators cannot be this expensive.

> These numbers indicate that large switches are the primary cause when
> RVRP is slower.  We have various approaches we could use to address
> this.   Removing the time spent building unnecessary dominators shows a
> significant improvement as well.
>
> We also have the results for the time spent in the passes we converted:
>
>                  Overall (242 files)
> 1 - -wrestrict            19% faster
> 2 - -wprintf            95% faster
> 3 - -walloca            1% slower
>
> -wrestrict has the dominator walk removed since it is no longer needed,
> and simply calls into a ranger to get the range.
>
> -wprintf has had the dominator build removed, as well as the EVRP range
> walk. It really benefits from very few ranges being requested… so the on
> demand approach is a big win here since we only calculate what we
> actually need to answer a small number of questions.
>
> -walloca is a touch slower because we are doing more work.  The original
> pass simply queried for global range information. We replaced calls to
> SSA_NAME_RANGE_INFO() with calls to the ranger to get accurate ranges.
> So this overhead is the amount required to calculate those ranges.  The
> Walloca pass now handles more things (for instance we fix
> gcc.dg/Walloca-6.c), and we catch more issues on a typical bootstrap.
> For example, we find a possible out of bounds VLA in libgomp/task.c.
>
> We are also working on integrating the on-demand ranges with the
> backwards threader.  It queries ranges over each potential path to see
> if a collapsable branch can be found. It then uses that information to
> find better threading opportunities than is currently possible.  Results
> thus far have been very promising.
>
> These numbers are all indicative that this approach is viable versus the
> existing approach, and is quite frequently faster. It already has
> iterative back-edge resolution integrated, and we can get cases that the
> existing VRP and EVRP approach have difficulty with.
>
> Comments and feedback always welcome!
> Thanks
> Andrew
>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: On-Demand range technology [4/5] - Performance results
  2019-05-23 13:29 ` Richard Biener
@ 2019-05-24 15:51   ` Andrew MacLeod
  0 siblings, 0 replies; 3+ messages in thread
From: Andrew MacLeod @ 2019-05-24 15:51 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC, Jeff Law, Aldy Hernandez

On 5/23/19 9:29 AM, Richard Biener wrote:
> On Thu, May 23, 2019 at 3:29 AM Andrew MacLeod <amacleod@redhat.com> wrote:
>> All times are with a release configured compiler.  Out of the 242 files,
>> pretty much across the board in all 4 sets of figures, RVRP was faster
>> in about 90% of the cases, and slower in the other 10%, resulting in the
>> following cumulative totals.
>>
>>                               Overall (242 files)
>> 1 - Raw RVRP                              22% slower   (**a**)
>> 2 - No edge calculation(1)                4.5% slower
>> 3 - No switch processing(2)                9.5% faster
>> 4 - No dominators(3)                    16% faster  (**b**)
>> 5 - Conditionals (including switches) only        4% faster
> I don't understand a vs. b here.  Does this mean dominators
> cost 38% performance?!
>
> Does 4 imply 3 and 2 maybe?
>
> Dominators cannot be this expensive.
>
> hope

I believe 4 also has no switch processing, so yes, it implies 3 which 
also includes 2.   so we go from 10% faster to 16% faster without 
building the dominators.  Probably more the kind of number you were 
expecting :-)

Andrew

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-05-24 15:51 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-23  1:29 On-Demand range technology [4/5] - Performance results Andrew MacLeod
2019-05-23 13:29 ` Richard Biener
2019-05-24 15:51   ` Andrew MacLeod

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).