From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-196213-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 46030 invoked by alias); 1 Jun 2018 20:38:46 -0000
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
Received: (qmail 45934 invoked by uid 89); 1 Jun 2018 20:38:46 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: =?ISO-8859-1?Q?No, score=-3.9 required=5.0 tests=AWL,BAYES_50,GIT_PATCH_2,HTML_MESSAGE,SPF_HELO_PASS autolearn=ham version=3.3.2 spammy=strategic, desire, it.=c2, citing?=
X-HELO: mx1.redhat.com
Received: from mx3-rdu2.redhat.com (HELO mx1.redhat.com) (66.187.233.73) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 01 Jun 2018 20:38:40 +0000
Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5])	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))	(No client certificate requested)	by mx1.redhat.com (Postfix) with ESMTPS id 57A93BB42C;	Fri,  1 Jun 2018 20:38:39 +0000 (UTC)
Received: from [10.10.125.247] (ovpn-125-247.rdu2.redhat.com [10.10.125.247])	by smtp.corp.redhat.com (Postfix) with ESMTP id 546DC84448;	Fri,  1 Jun 2018 20:38:38 +0000 (UTC)
Subject: Re: Project Ranger
To: Richard Biener <richard.guenther@gmail.com>
Cc: GCC Development <gcc@gcc.gnu.org>, Aldy Hernandez <aldyh@redhat.com>, Jeff Law <law@redhat.com>
References: <5607b582-639b-7517-e052-014fabfe0ad4@redhat.com> <CAFiYyc2tuV20YTRf=5_+9j+mFZAHJcRf8V7JUot7ikJjPfsPYA@mail.gmail.com>
From: Andrew MacLeod <amacleod@redhat.com>
Message-ID: <722793c4-a5f7-e50b-3ee6-02eda21e670a@redhat.com>
Date: Fri, 01 Jun 2018 20:38:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0
MIME-Version: 1.0
In-Reply-To: <CAFiYyc2tuV20YTRf=5_+9j+mFZAHJcRf8V7JUot7ikJjPfsPYA@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
X-IsSubscribed: yes
X-SW-Source: 2018-06/txt/msg00014.txt.bz2

On 06/01/2018 05:48 AM, Richard Biener wrote:
> On Wed, May 30, 2018 at 1:53 AM Andrew MacLeod <amacleod@redhat.com> wrote:
>
>> This allows queries for a range on an edge, on entry to a block, as an
>> operand on an specific statement, or to calculate the range of the
>> result of a statement.  There are no prerequisites to use it, simply
>> create a path ranger and start using the API.   There is even an
>> available function which can be lightly called and doesn't require
>> knowledge of the ranger:
> So what the wiki pages do not show is how contextual information
> is handled (and how that's done w/o dominators as the wiki page
> claims).  That is, path_range_on_edge suggests that range information
> provided by the (all) controlling stmts of that edge are used when
> determining the range for 'name'.
the wiki doesn't talk about it, but the document itself goes into great 
detail of how the dynamic process works.

> That's probably true for the 'stmt' variant as well.
>
> This leads me to guess that either the cache has to be
> O(number of SSA uses) size or a simple VRP pass implemented
> using Ranger querying ranges of all SSA ops for each stmt
> will require O(number of SSA uses) times analysis?
The current cache is not particularly efficient, its size is #ssa_name * 
#BB, assuming you actually go and look at every basic block, which many 
passes do not.Â  It also only puts a range in there when it has to. Â Â  Im 
not sure why it really matters unless we find a pathological case that 
kills compile time which cannot be addressed.Â  This would all fall under 
performance analysis and tweaking.Â  Its the end performance that matters.

When aÂ  range request is made for NAME in a block, it requests a range 
on each incoming edge, and then unions those ranges, and caches it. This 
is the range-on-entry cache.Â Â  The next time a range request forÂ  NAME 
is made for on-entry to that block, it simply picks it up from the cache.

The requests for range on an edge uses the gori_map summary to decide 
whether that block creates a range for NAME or not.
If the block does not generate a range, it simply requests the 
range-on-entry to that block.
If it does generate a range, then it examines the statements defining 
the condition, gets ranges for those,Â  and calculates the range that 
would be generated on that edge.Â  It also requests ranges for any of 
those names in the defining chain which originate outside this block.

So it queries all the names which feed other names and so on. The 
queries end when a statement is found that doesn't generate a useful 
range. Sometimes it has to go hunt quite a way, but usually they are 
nearby.Â  And caching the range-on-entry values prevents the lookups from 
doing the same work over and over.

Once a range has been calculated, we don't spend much more time 
calculating it again.Â Â  Any other ssa-name which uses that name also 
doesnt need to recalculate it.Â  For the most part, we walk the def 
chains for any given ssa-name and its feeding names once and future 
requests pretty much come from the cache.

So yes, there is also a lot of recursion in here at the moment. calls 
for ranges spawning other calls before they can be resolved. I suspect 
sometimes this call chain is deep. I don't know, we haven't performed 
any analysis on it yet.

> One of the advantages of a DOM-style or SSA-with-ASSERT_EXPR
> pass is that the number of evaluations and the size of the cache
> isn't that way "unbound".  On the WIKI page you mention edge
> info isn't cached yet - whatever that means ;)
>
> So I wonder what's the speed for doing
The on-entry caches prevent it from being "unbound". . and the 
intersection of incoming edge calculation performs the equivalent 
merging of ranges that DOM based processing give you.

    if (x > 10)
     Â Â  A
    else
     Â Â  B
    C

block A has x [11, MAX]
block B has x [MIN, 10]
block C, at the bottom of a diamond, the 2 incoming edges union those 2 
ranges back to [MIN, MAX] and we know nothing about the range of X 
again.Â  Accomplished the same thing dominators would tell you.

The edge info that "isnt cached yet", (and may not ever be), is simply 
the 'static' info calculated by evaluating the few statements at the 
bottom of the block feeding the condition.Â  any values coming into the 
block are cached.Â  Â  I this case its simply looking at x > 10 and 
calculating the range on whichever edge.Â Â Â  One of the original design 
elements called for this to be cached, but in practice I don't think it 
needs to be.Â  But we haven't analyzed the performance yet.. so it can 
only get quicker :-)


>   
>
>    FOR_EACH_BB_FN (bb)
>       FOR_EACH_STMT (stmt)
>         FOR_EACH_SSA_USE (stmt)
>             path_range_on_stmt (..., SSA-USE, stmt);
>
> assuming that it takes into account controlling predicates.
> That's actually what EVRP does (within a DOM walk) given
> it tries to simplify each stmt with based on range info of the ops.

   FOR_EACH_BB_FN (bb)
      FOR_EACH_STMT (stmt)
         path_range_on_stmt (..., stmt);

would be sufficient.  In order to calculate the range for the definition of the statement, the Ranger will query the range of each SSA operand.

I had Aldy run his cumulative .ii  times with a patch added to EVRP which adds a ranger and makes a call to path_range_stmt on each PHI and every stmt as  evrp processes it.

evrp original time: 2025055 usec
evrp + ranger:      3787678 usec
                     -------
       ranger time:  1762623

So to be fair we aren't actually doing anything and evrp is, but the actual folding of statement is not a big time killer which is the only additional thing EVRP is doing.  This indicates that for a full-on stmt-by-stmt walk and process, we're certainly in the same time ballpark.

A significant point is that the stmt-by-stmt walk is also not the intended target. Thats old school now :-)   Many clients only care about a few ranges.

The "quick and dirty" RVRP pass only looks at the last condition in the block and catches the vast majority of what the stmt-by-stmt walk accomplishes. If timings for the RVRP pass are any indication, I would say it walks/calculates 60% of the statements in the program to calculate flow conditions.  Processing a lot of arithmetic statements that can never result in a fold seems kinda pointless. leave those to constant prop, (which could use the ranger as client).   The extra stmt-by-stmt opprtunities found mostly appear to find PHIs that can be folded and the odd condition that doesn't feed a branch.

EVRP also *has* to walk each statement because that is the the model it uses to find ranges. build them from the top down. The ranger model does not have this need.   If uses of the ranger were more prolific in other passes, I think a lot of things the existing VRP passes get on these statements would be caught in the normal processing the other passes do.  We did put a stmt-by-stmt RVRP on the todo list mostly so we could get a better apple-to-apple time comparison and to make sure we would eventually have full coverage of its VRP's abilities.

I would also note again that we have made no attempt to optimize things yet. We don't really know how much extra work we're doing that doesn't need to be redone. We just know the current speed is a pretty good starting point.  We're seeing big improvements in the passes that are changed to use it.

we will get to finding the inefficiencies and try to address them.
  
  
> More generally my remarks are still the same as with the irange introduction
> attempt.
>
> It's a no-no to add yet another set of range op primitives given we already have
> them implemented (with trees, yes...) in tre-vrp.c.  The existing ones
> are already
> factored out to some extent so I expect you will continue that and split out
> the wide-int cases from those existing routines.  Yes - that will not be very
> C++-ish - but who cares.  Simply implement C-style workers, for example
> be inspired by the tree-ssa-ccp.c:bit_value_* routines implementing
> a kind of non-zero-bits operation primitives ontop of wide_ints (and also
> wrapping those with tree variants).
Sharing of code to do the same work was covered as something we will get 
to. Aldy is working on that so there is not dual maintenance. I had 
started it with the wide-int-aux.[ch] file in the branch but never got 
very far.Â Â  Yeah, its not super pretty, but it is what it is.

It wont be perfect however as many of those routines are written with 
the assumption that ranges are either a single range or an anti range. 
we no longer have that restriction and we will diverge in those cases. 
Extricating the code from the symbolic handling is also not 
straightforward some times as there are big switches which mix and match 
various bits of each case.

We will however try to distill it down to the wide-int aspects that are 
common.

 Â I would also argue that long term I see value_range disappearing 
completely... and then there wont be any dual maintenance :-)


>
> For most of your on-demand uses I would expect a _much_ simpler and
> less heavy-weight approach to be extending the recently merged
> determine_value_range (I've pointed you to this multiple times in the
> past) to walk SSA def-use chains using the existing SSA on-the-side
> info as cache (yes, w/o use-granular contextual information).
I think what we have is pretty lightweight and more powerful than that 
would be.Â  I started from scratch so we could get all the ideas in one 
place working together and not suffer from limits imposed by existing 
assumptions or information not easily available. Â Â  I think our results 
confirm those benefits. Its fast, easy to use, and its already starting 
to get things EVRP does not get. It is also showing signs of finding 
things VRP itself has trouble with.Â  When we handle equivalencies I 
would expect we will exceed current capabilities.

>
> I also think that purely restricting yourself to wide-ints is a mistake - we
> do very well want to have range information for REAL_CSTs (or composite
> integers / reals).  How do you think of extending Ranger to handle different
> kind of types?  Eric already raised the issue of symbolic ranges.
This is in fact one of the main reasons we wrote everything new, 
including a distinct range class. We expected to want to make this work 
on other types.

Irange and range-ops is a class to handle integer ranges.Â  We have a 
very well defined API for ranges now.Â  The range engine itself simply 
uses this API.

If we write class 'real_range' to manipulate & store real ranges and 
implement 'real_range_ops' to teach the compilerÂ  how real ranges are 
extracted from expressions, the entire ranger engine will work on reals.

There are 2 options to proceed with this

a) we'd either template the ranger to use <class Range> instead of class 
irange, or at least extract out the common code and template that.Â  And 
then we'll have the same on-demand rangerÂ  for "real" ranges, or 
"complex" or whatever.Â Â  so instead of invoking it with 'path_ranger r', 
it'd be something like 'path_ranger<irange> r',Â  or 
path_ranger<real_range> r;

I like that less because I hate templates, but it is an option.Â  The 
more likely case I planned for is:

b) adjust class irange to derive from a base range class with the API as 
virtual functions, and inherit irange and real_range from thatÂ  for the 
ranger to call. Then it would handle both simultaneously.Â  We'd have 
some tweaking to do where the ranger currently checks the type of things 
to be integral, but nothing too serious.Â  The entire structure of the 
range engine will still apply, and we could process both simultaneously.

The second model was also intended from the beginning to also allow us 
to adjust the number of sub-ranges dynamically. Right now we set it to 3 
sub-ranges at compile time, but he intention is to allow this to vary if 
we want.Â  There are cases where some optimization might be willing to 
spend extra time/memory in order to get really accurate range info. 
Maybe switch analysis could use an irange which allowed up to 1000 
subranges so it could very precisely map what the value is at any given 
point.Â  The general case doesn't need that extra memory usage however.

It just gives us flexibility we don't currently have.Â  We also expect to 
do some tests and find out what the "optimum" default number of ranges 
are.Â  I chose 3 because it was handling cases I wanted to get we 
couldn't get before.Â  Maybe a different number is better, we'd have to 
run some experiments to see how often we could use more ranges.


> Yes, I do like having wide-int workers for ranges.
>
> Yes, I do like the idea of making ranges not restricted to single sub-ranges
> (didn't review the implementation).
>
> But those things should have been done on the existing code, not sticking
> sth completely new into GCC that cannot subsume the existing stuff.  That's
> the road to maintainance hell which we unfortunately walk way too often
> with GCC :/
I argue that this will subsume all the existing code related to it. Â  
Sure we cannot right this moment, its not a small task. I think our 
pathway out of maintenance hell involves removing the existing 
value_range code and the various incarnations of VRP we have and using 
something modern designed from scratch to solve these problems.Â  I would 
also predict we could do in the release after this next one. So we'd 
have one release with the ranger and VRP brothers both present, and 
after that it would be just the Ranger.
> Btw, the original goal of EVRP was to get rid of the ASSERT_EXRP-style
> VRP pass once the threading bits it has are subsumed by the backwards
> threader.  Does ranger at least allow that - get rid of the threading inside
> VRP without regressions?
The ranger calculates ranges. It does not implement a complete VRP 
replacement.Â  We will implement a full VRP in time. One of the original 
goals was to get this working so we can remove all the 
micro-optimizations that VRP does because range info isn't available 
anywhere else.Â Â  The clean goal is all the threading related range work 
should be done in threading via calls to the ranger.

So to more directly answer the question.Â Â  I do not know the state 
today.Â  I'm not the threader dude :-) Â  Aldy enhanced the backward 
threader to do some things we couldn't do before, and converted those 
other 3 passes.Â  I think Jeff has on his plate to remove the threading 
parts from VRP and put them back in the threader.Â  So he can clarify 
this.Â Â  By the end of the stage I would expect that threader work to be 
complete and out of VRP.


> Just as a final note - open source development doesn't work like this,
> citing you "I'd like to introduce a project we've been working on for
> the past year
> an a half." - this simply provokes "reviews" like the above.
well, it is what it is.Â Â  I see nothing wrong with the "review". quite 
frankly, it's pretty much as expected.Â  I think it would have been 
almost identical if I'd talked about it last year.

And why can't it can work like this?Â  We aren't suggesting we shove this 
into trunk right now.Â  We've been experimenting with the technology, and 
now we think its developed to the point where we think it can be of 
benefit, and we can demonstrate it works.Â  Its on a branch and if anyone 
else wants to help.Â  awesome.Â  We have a to do list. Â  It'll probably be 
months before we're even ready to be considered for a merge.Â  It has now 
advanced to the point where we have tangible things to talk about for 
the first time.

Sure, we could replace irange with value_range and range-ops with some 
serious reconfiguration of the range extraction code, and the ranger 
would still work. It just needs the same range API.Â  I happen to think 
irange and range-ops offers benefits that value_range can't, and so to 
do that would be a step backwards.

I chose to create irange so we had the flexibility to enhance its 
ability in the future while offering some improvements now.
Range-ops was a desire to unite the range processing code in one place, 
and I needed to add the ability to do algebraic re-arrangements that we 
currently dont do in VRP... the op1_irange and op2_irange code that 
derives ranges for operands given a LHS and the other operand.Â Â  Yeah, 
its not in one place yet, but it will be.

So at what point do we decide community involvement is appropriate. As 
soon as I get the idea? Maybe after I had the third prototype sort-of 
working indicating the approach was feasible?Â  It was pretty ugly then.Â  
I wouldn't want anyone to have seen that :-)Â  Â  All the exact same 
questions would come up then as now.Â  We need to handle symbolics.Â  too 
expensive.Â  reuse the existing range code.Â  Rather than waving my hands 
to answer and not really accomplish anything, I can now point to 
tangible things. (mostly :-)Â  I would have still pushed for irange. and 
range-ops because I think they are the right strategic solution. Â  Â  
Maybe someone else could have helped advance things a little faster, but 
I was a pretty big bottle neck on design and reworking things. ask Aldy.

The only thing that we ever had working that was even worth considering 
submitting was the irange work last July. and that did not get very 
far.Â Â  We could also have waited another year until we have a complete 
VRP replacement working. Then its a drop in solution that removes a lot 
of the arguments about code sharing and not re-using value_range. That 
might have been even easier! I doÂ  think its reached the point where 
there is a lot of benefit to be derived now, and so here it is, lets 
figure out what we can do with it.

Anyway, the "lateness" of he introduction is purely on me.Â  I like to 
get things working before talking about them especially when they start 
with half baked ideas.

I think this is a good strategic direction to go, demonstrates benefit 
now, and has a path to a better/easier to maintain VRP with a 
compiler-wide range resource.

Andrew