From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-199191-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 115822 invoked by alias); 27 May 2019 13:03:10 -0000
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
Received: (qmail 115775 invoked by uid 89); 27 May 2019 13:03:04 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-3.1 required=5.0 tests=AWL,BAYES_50,FREEMAIL_FROM,GIT_PATCH_2,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.1 spammy=Major, divide, filled, amacleod@redhat.com
X-HELO: mail-lf1-f51.google.com
Received: from mail-lf1-f51.google.com (HELO mail-lf1-f51.google.com) (209.85.167.51) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 27 May 2019 13:03:00 +0000
Received: by mail-lf1-f51.google.com with SMTP id n22so4284093lfe.12        for <gcc@gcc.gnu.org>; Mon, 27 May 2019 06:02:59 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=gmail.com; s=20161025;        h=mime-version:references:in-reply-to:from:date:message-id:subject:to         :cc:content-transfer-encoding;        bh=YiDa+C/2lI+24gGiM4X66+eXmBs/dwMFjADxJwuwQ+g=;        b=NKD5/ciP602u/mCEF7VtHkL6n8D585izMXTC0TPULtDRgV0zuCBeTGxUiczB0MQoD/         BeBr0Yp9dYmBHumM13P+IGzbeunt/JXEkG3xLGHUJaqB6BpuGwRVuIumKsU6PeExtwZf         /9tEIE5PPhs5qe8wF0nQnfLJBMZJDFfaNaagbLtHi73wVVaOH0w/QyxbjDGbjvMuPv+c         RZT0kRSja5+YHGgYa7yloSOrynqCD4JQSq+WZxjhsdm7HpzTaHqSk9PbQHJN/GCn7pap         IEjdZfdiKTdYUKFDzEWMXCaLksRGWWgk9b1ltL5qVrvurKxo7Uo2B0dkQRG6zBuuySr3         mD+Q==
MIME-Version: 1.0
References: <908dbc60-b7b9-c609-f999-cc7264af0c6e@redhat.com> <CAFiYyc2a_h7AVa44G+wrAic=op+qbaGgLHbb+d8wGc3bFZdDaQ@mail.gmail.com> <a6a5c7ea-f4a1-c671-5ff4-cbd5183a7606@redhat.com>
In-Reply-To: <a6a5c7ea-f4a1-c671-5ff4-cbd5183a7606@redhat.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Mon, 27 May 2019 13:03:00 -0000
Message-ID: <CAFiYyc3AGjzJpYJR4FFrN-J1yfa++aQ_9MJYiiZhJ4MS4Nw2qw@mail.gmail.com>
Subject: Re: On-Demand range technology [2/5] - Major Components : How it works
To: Andrew MacLeod <amacleod@redhat.com>
Cc: GCC <gcc@gcc.gnu.org>, Jeff Law <law@redhat.com>, Aldy Hernandez <aldyh@redhat.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-IsSubscribed: yes
X-SW-Source: 2019-05/txt/msg00236.txt.bz2

On Fri, May 24, 2019 at 5:50 PM Andrew MacLeod <amacleod@redhat.com> wrote:
>
> On 5/23/19 8:55 AM, Richard Biener wrote:
> > On Thu, May 23, 2019 at 3:28 AM Andrew MacLeod <amacleod@redhat.com> wr=
ote:
> >>
> >> 2 * GORI
> >> ------------
> >> The second component is the =E2=80=9CGenerates Outgoing Range Info=E2=
=80=9D engine.
> >> This is a basic-block oriented component which determines what ssa-nam=
es
> >> have ranges created on outgoing edges, and how to calculate those rang=
es.
> >>
> >> The last statement in the block is examined to see if it is a branch
> >> which generates range info, and then determines which ssa-names in the
> >> block can have ranges calculated.  It quickly walks the use-def chain
> >> from the branch to find other definitions within the block that can
> >> impact the branch and could also have their ranges calculated. This
> >> summary is then cached for future use.
> >>
> >> The GORI component also uses this summary info to perform the
> >> calculations necessary to determine the outgoing range for any ssa_name
> >> which can be determined.  For example:
> >>       c_3 =3D foo ()
> >>       a_2 =3D b_1 - 20
> >>       If (a_2 < 40)
> >> The summary information would indicate that b_1 and a_2 can have their
> >> outgoing ranges calculated for this block, and uses the cached
> >> information to quickly calculate them when required.
> >> The API consists of 2 basic methods, query and calculate:
> >>    - bool has_edge_range_p (edge, name);
> >>    - range outgoing_edge_range (edge, name);
> >> If a query is made for any other ssa-name, it simply returns false and
> >> says this block does not generate a range for that name.  This is a key
> >> rationale for the summary so we only spend time processing names for
> >> blocks that actually have ranges generated.
> > So the current code (in a few cases) derives ranges for SSA uses when
> > they see for example
> >
> >   _3 =3D _4 / _5;
> >
> > here _5 !=3D 0 for the stmts following this one (OK, this is a bad exam=
ple
> > because for divisions we don't do this, but we do it for derefs and ~[0=
,0]).
> >
> > The above suggests that iff this is done at all it is not in GORI becau=
se
> > those are not conditional stmts or ranges from feeding those.  The
> > machinery doing the use-def walking from stmt context also cannot
> > come along these so I have the suspicion that Ranger cannot handle
> > telling us that for the stmt following above, for example
> >
> >   if (_5 !=3D 0)
> >
> > that _5 is not zero?
> >
> > Can you clarify?
>
> So there are 2 aspects to this.    the range-ops code for DIV_EXPR, if
> asked for the range of op2 () would return ~[0,0] for _5.
> But you are also correct in that the walk backwards would not find this.
>
> This is similar functionality to how null_derefs are currently handled,
> and in fact could probably be done simultaneously using the same code
> base.   I didn't bring null derefs up, but this is a good time :-)
>
> There is a separate class used by the gori-cache which tracks the
> non-nullness property at the block level.    It has a single API:
> non_null_deref_p (name, bb)    which determines whether the is a
> dereference in any BB for NAME, which indicates whether the range has an
> implicit ~[0,0] range in that basic block or not.

So when we then have

 _1 =3D *_2; // after this _2 is non-NULL
 _3 =3D _1 + 1; // _3 is non-NULL
 _4 =3D *_3;
...

when a on-demand user asks whether _3 is non-NULL at the
point of _4 =3D *_3 we don't have this information?  Since the
per-BB caching will only say _1 is non-NULL after the BB.
I'm also not sure whether _3 ever gets non-NULL during
non-NULL processing of the block since walking immediate uses
doesn't really help here?

So this seems to be a fundamental limitation [to the caching scheme],
not sure if it is bad in practice.

Or am I missing something?

> It it implemented by caching a bitmap which has a 1 set for any block
> which contains a deref of that name, so after the first call, it is
> simply a matter of returning that bit.  The first time the API is
> called, it does a immediate-use walk for name and sets the bit for any
> block which contains a statement for which we can infer a non-null result.
>
> We'd want the exact same functionality for divide by zero.. we can infer
> ~[0,0] from the DIV_EXPR for the block that statement is in. The call is
> already made, it just returns false right now  if it isn't a pointer type.
>
>
> I'll add that and make sure it works.
>
>
>
> >
> >> 3 * GORI cache
> >> ---------------------
> >> The third component builds on the basic block GORI component, and adds
> >> the ability to traverse the CFG and combine the various ranges provided
> >> by each basic block into a cache.
> >>
> >> It provides both a global-range cache and a range-on-entry cache
> >> summarizing the range of an ssa_name on entry to the block.
> >>
> >> The range-on-entry cache is self filling, and iteratively evaluates ba=
ck
> >> edges. There is a single API entry point which asks for the range on
> >> entry, and if the data has not been calculated/cached already, it spaw=
ns
> >> the following process to calculate the data. :
> >>       * Walk the CFG from the current block to the definition block,
> >> and/or any block which has range-on-entry information already
> >> calculated. These end points are the source blocks.
> >>       * Any non-source blocks visited are added to a worklist to be up=
dated.
> >>       * Starting with the source blocks, push the known outgoing range
> >> from each block to each successor and update their live-on-entry values
> >> when they are visited.  If the incoming range changes, mark this block=
=E2=80=99s
> >> successors as requiring updating.
> >>       * Continue iterating until no more updates are required.
> >>
> >> The ordering is planned by the ranger such that if there are no back
> >> edges, it is typically a straightforward single pass.  When back edges
> >> are involved, the amount of iterating is typically very small. The
> >> updating is restricted to a single ssa-name, meaning it doesn=E2=80=99=
t get into
> >> feedback loops with other names nor PHIs. . . It usually converges very
> >> quickly.
> >>
> >> It is important to note that this works exclusively with static ranges
> >> of only a single ssa-name at a time. Ie, ranges which are implicitly
> >> exposed in the IL, and only name being examined. The values returned by
> >> these queries are not dependent on changes in other ssa-names, which is
> >> how the iteration process never gets out of control.
> >>
> >> The ranger making the calls to fill this cache has a higher level
> >> overview, and requests these ranges in definition order such that any
> >> ssa-names feeding the definition of a name having its cache filled are
> >> resolved first, providing the best possible results the first time.
> >>       a_2 =3D b_1 - 20
> >>       If (a_2 < 40)
> >> If the range for a_2 is requested on the true side, the ranger will
> >> first calculate the range of b_1 on entry to the block. Then use this =
to
> >> calculate the global range of a_2, and finally for the outgoing range =
on
> >> the desired edge.
> >>
> >> If at some later point, it is discovered that the incoming range of b_1
> >> has changed in such a way that it has an impact on the outgoing range =
of
> >> a_2, the iterative update process can be reapplied by the ranger to
> >> update the relevant cache entries. This is usually only required in
> >> cases where multiple ssa-names are affected by back edges and feed each
> >> other.
> >>
> >> 4 * Ranger
> >> ----------------
> >> The final component is the Ranger which provides a simple API to clien=
ts
> >> where they can make various simple requests:
> >>       - Range of an ssa-name at a statement location
> >>       - Range of the result of a stmt
> >>       - Range of an ssa-name on an edge
> >>       - Range of an ssa-name after the last statement in a block
> >>       - Range of an ssa name on entry to a block
> >>
> >> The ranger is responsible for processing the IL, walking use/def chains
> >> and coordinating the various calls into the GORI components to
> >> pre-satisfy as many conditions as possible before any cache filling is
> >> performed.  It is also responsible for triggering any additional updat=
es
> >> which may be required due to newly discovered ranges. We in fact don=
=E2=80=99t
> >> do this yet, but is rarely required as it turns out.
> >>
> >> Summary
> >> ---------------
> >> All evaluations are done on-demand. If no queries are made, there is no
> >> cost. The Ranger has no prerequisites other than a valid IL and CFG. It
> >> does not need dominators, nor does it require any other changes within
> >> an optimization pass in order to be used.
> > Does it need fully up-to-date SSA form or does it only rely on appropri=
ate
> > SSA uses on stmts in the use-def chain of the SSA name we query
> > range-info for?  Does it use immediate uses or stmt operands at all
> > or just use-def chains?
>
> the range-ops component works purely on stmt operands.
> The gori component which calculates ranges coming out of the block
> require the SSA_NAME_DEF_STMT to be correct in order to look at the
> previous stmt in the def chain and to determine if it is in the same
> basic block or not. .
> immediate uses are used for the non-null/zero property.  If they are not
> up to date this information would be stale or out of date unless it were
> calculated up front
>
>
> >
> > I'm asking because some passes leave parts of the CFG not updated
> > while working on other parts to reduce the number of update_ssa calls
> > (ideally to a single one).
> Due to the on-demand nature, any changes made would be reflected
> immediately in any lookup, for better or worse.  At the moment, we don't
> encourage changing the CFG while working with ranges. Any CFG changes
> made may invalidate some of the caches, or require them to grow...  and
> we'd need to hook into the CFG machinery in order to make sure we either
> look at, or clear any affected caches.
>
> Its clearly safer to queue up the changes in that sort of environment,
> but we'd have to identify which kinds of operations are not safe. Our
> implementation of RVRP does everything on the fly... simplifying and
> eliminating  statements as it goes.  To the best of my knowledge we are
> not currently using the on-demand engine in places where are taking some
> things out of SSA and then putting them back in.  So i have no practical
> experience with that.   Aldy can correct me if he's doing anything in
> the theader, but I doubt it.
>
> so other than the non-null/zero property, we don't need much SSA
> infrastructure, just "correct" IL/CFG.
>
> >> When queries are made, only the minimum effort required is made to
> >> satisfy the request. This is a big win when it comes to passes which
> >> only need occasional range information such as warning passes.  Some of
> >> these passes actually instantiate a new ranger each time they make a
> >> request, as a demonstration of the low overhead.
> >>
> >> As for passes such as VRP which require complete range information, the
> >> caching process prevents the same information from being calculated mo=
re
> >> than once.  This means a similar amount of work is done with this
> >> approach, as with the traditional top-down approach currently being
> >> used, except we also process the back edges and iteratively solve for =
them.
> > What's the storage requirement for the cache when doing VRP pass like
> > processing?  What's the compile-time complexity?  I read you are saying
> > "similar amount of work" but I guess that's from empirical data compari=
ng
> > to the existing VRP implementations (SSA VRP and EVRP) rather than
> > theoretical bounds?
> yes, compile-time complexity is from empirical speed timings and
> theory-crafting from the algorithms,  and that the on-entry cache
> prevents multiple passes over things.
>
> we have not done a memory analysis yet, not done anything to see if we
> can improve it.
> It makes very heavy use of bitmaps, which are typically quite sparse.
> The on-entry cache is a vector of pointers to a range, initially 0, and
> we allocate ranges as needed.   There will be an on-entry range entry
> for any ssa-name which has been queried between the query point and the
> definition.

So that's similar to LIVE-on-entry thus a SSA name with a range
will have an on-entry range entry on each BB it dominates?
That makes storage requirement quadratic in the worst case.

>  My belief is this is typically not large, but we have done
> no analysis as yet.  This does mean that once an ssa-name goes
> "out-of-scope", ie no  more uses in the program, it will not be in the
> cache for any of those blocks simply because its never queried past its
> end of life.
> >
> > Not sure where to put it so let me put it here:  I'd like to see the SSA
> > based VRP pass to die (in fact I'd like all ssa-propagator based
> > passes to die...).
> > EVRP is the future here and I can see EVRP making use of the
> > reverse part of Range-Ops to replace the ad-hoc pattern matching
> > in register_edge_assert_for (badly named function that now gets you
> > back a list of assertions hold on an edge).  So I hope to see VRP
> Does register_edge_assert cache anything? or walk an assert chain? or is
> it basically just a "range-on-this-edge"  call? and the result combined
> with what is known at that point?

register_edge_assert pattern-matches the stmts feeding a conditional
stmt on an edge.  It doesn't cache anything (so does quite some redundant
work when processing all outgoing edges).  It's not intended to be called
multiple times but rather when the DOM walker enters a block and we
look at the controlling statement to derive conditional ranges.  For
example for

 _1 =3D (unsigned) _2;
 if (_1 > 1)
  ...

it derives a vector of two relations on the true edge:

 _1 > 1 =3D=3D true
 (unsigned) _2 > 1 =3D=3D true

this is actually what we end up building ASSERT_EXPRs for in traditional
VRP and what we extract ranges from only valid in dominated context from EV=
RP.

> > yanked (it performing jump threading is the major obstackle) and
> > Range-Ops being used as common infrastructure for ranges (that leaves
> > EVRP).  I don't really want something "additional" leaving things for
>
> Although EVRP can be modified to use the range-ops work, and we can
> probably make VRP go away, I'm not convinced that EVRP has to be the
> future.  Its merely an alternate implementation that has its own set of
> benefits and drawbacks.
>
> > other people to cleanup / merge - honestly you don't have the very best
> > track record in this area (it might be your main area of work is not GC=
C,
> > still the @redhat tag on your mail is an obligation here).
> That's a curious assertion over a 20 year span.  To what are you referrin=
g?
>
> > Now I've said the above let me continue with 3/n.
> >
> > Richard.
> >
> >> **Comments and feedback always welcome!Thanks
> >> Andrew
> >> *
>