From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-199265-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 3853 invoked by alias); 5 Jun 2019 09:59:18 -0000
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
Received: (qmail 3845 invoked by uid 89); 5 Jun 2019 09:59:18 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-4.5 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,GIT_PATCH_1,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.1 spammy=H*f:sk:bf4dbd9, H*i:sk:bf4dbd9
X-HELO: mail-lf1-f44.google.com
Received: from mail-lf1-f44.google.com (HELO mail-lf1-f44.google.com) (209.85.167.44) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 05 Jun 2019 09:59:16 +0000
Received: by mail-lf1-f44.google.com with SMTP id a25so18641055lfg.2        for <gcc@gcc.gnu.org>; Wed, 05 Jun 2019 02:59:15 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=gmail.com; s=20161025;        h=mime-version:references:in-reply-to:from:date:message-id:subject:to         :cc:content-transfer-encoding;        bh=SbsNVsjuU6RUBqVtzKocu8bDthhGTM5JaGa2XZj4eec=;        b=S8LIM1Pio4csqyg/SlWRVubkluGBowgcxm0T/7OFNcAFJ4cF1fjGK+GHiLjFyKUi8P         NXkYK2ve22ecJFzZfgM90Nj0G3xSMEkSNS5m7tTlThketzNC0E/c/mdTk4m14W2ZQkDY         Bau+1QD/tuXgA6ib9BkYg7ZEjS0Ty0xXDqqtqfBwULiqHcbnTtcbgE4GGqi+vTZ3bOSz         DDBD52qisVMvHYlDFWcN6/SokvSobVgsr3dXGkex5kT8C7n6LfssjtBY+coPAPYG7iFY         F5B3wN0ts4HnAA/iBhm5NCkhiy0AP9x/A5/4oAeqHYzkdhW9CZDZJSkwC3Vx44gdKadz         o+sw==
MIME-Version: 1.0
References: <908dbc60-b7b9-c609-f999-cc7264af0c6e@redhat.com> <CAFiYyc2a_h7AVa44G+wrAic=op+qbaGgLHbb+d8wGc3bFZdDaQ@mail.gmail.com> <a6a5c7ea-f4a1-c671-5ff4-cbd5183a7606@redhat.com> <CAFiYyc3AGjzJpYJR4FFrN-J1yfa++aQ_9MJYiiZhJ4MS4Nw2qw@mail.gmail.com> <91bfd08e-1431-d8ba-f9b5-e8822aaf62e2@redhat.com> <CAFiYyc2Qvk2NJ0Akwakm8CEUz0ZdOFc4BbK1_xdQYpvdej=hgA@mail.gmail.com> <dbdfa6db-ddcf-cad9-69ed-0e514a7aea68@redhat.com> <CAFiYyc3TEHVfvFfdTc5-GskyQsn2YdhmN8qYnU9g7Q6uvv-ZDg@mail.gmail.com> <CAFiYyc2UYPDjPqo4DEEmH9wcDsy_ezaQhgm750pou-TcOJWuLQ@mail.gmail.com> <1ea49339-a190-9004-c20a-f001ebb9ba97@redhat.com> <B4741236-6177-40BF-94F3-1721F6FA5EC2@gmail.com> <bf4dbd93-4578-89d9-d833-0545652c7caf@redhat.com>
In-Reply-To: <bf4dbd93-4578-89d9-d833-0545652c7caf@redhat.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Wed, 05 Jun 2019 09:59:00 -0000
Message-ID: <CAFiYyc1xFqF5VcM51uQG4RUrfWi3Y72bbgLuLF2_Sq=xoLZJVw@mail.gmail.com>
Subject: Re: On-Demand range technology [2/5] - Major Components : How it works
To: Andrew MacLeod <amacleod@redhat.com>
Cc: GCC <gcc@gcc.gnu.org>, Jeff Law <law@redhat.com>, Aldy Hernandez <aldyh@redhat.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-IsSubscribed: yes
X-SW-Source: 2019-06/txt/msg00022.txt.bz2

On Tue, Jun 4, 2019 at 9:55 PM Andrew MacLeod <amacleod@redhat.com> wrote:>
> On 6/4/19 1:07 PM, Richard Biener wrote:
> > On June 4, 2019 6:50:07 PM GMT+02:00, Andrew MacLeod <amacleod@redhat.c=
om> wrote:
> >> On 6/4/19 11:37 AM, Richard Biener wrote:
> >>
> >>> where the single(!) query of the range of i on the alloca call
> >>> will populate N BBs caches with the Nth having ranges for
> >>> all SSA defs of i running into quadraticness in N.
> >> well, not really.   its still linear since the 'i' used in each foo()
> >> will be dead, so the next block will contain no range for it
> >>
> >>
> >>    try{
> >>        long i_2 =3D _1;
> >>        i_3 +=3D foo(i_2);
> >>        i_4 +=3D foo(i_3);
> >>        i_5 +=3D foo(i_4);
> >>        i_6 +=3D foo(i_5);
> >> ... repeat N times ...
> >>       alloca(i_6); // query range of i
> >>
> >> once i_2 is 'dead' after the first call to foo(), it will no longer be
> >> in the on-entry cache to any other block.
> >> The walk back never see's i_2 until the first call to foo(), so none of
> > It's actually
> >
> >     Utemp_2 =3D foo(i_1);
> >
> > Bb:
> >     i_3 =3D i_1 + (long) utemp_2;
> >     Utemp_4 =3D foo(i_3);
> >
> > Eliding the separate stmt for the conversion. From you description of t=
he cache it sounded like getting incoming values is a take-all operation. O=
therwise you'd need negative entries as well (I thought the cache might not=
 contain varying ranges for example).
>
> ok,  Utemp2 and i_1 will be live, so there are 2 ranges on entry . the
> next block will have only utemp_4 and i_3 live on entry.  So it'll be 2
> ranges per block.
>
>
> It currently does have varying in the table, but the varying value is
> cached for the entire ssa_name in the cache. so 500 BB's with varying
> range will only consume a single range.
>
> By negative entries I assume you mean something like the bottom of a
> diamond?
>
> if (x_3 > 20)
>    blah1 (x_3)
> else
>    blah 2(x)_3;
> blah3 (x_3);
>
> It does nothing special to come up with the range of x_3 at blah3 ().
> It simply unions all the incoming ranges which is [21,MAX] union
> [MIN,20]  to produce the value [MIN, MAX] which is then put into the
> cache.  We don't use anything special to represent varying.
> varying_p()  merely checks if the range is [MIN, MAX]. And when putting
> a value into the cache, if it is varying_p() the cache simply uses it's
> unique copy of [MIN, MAX] rather than create a new one every time it is
> needed.
>
>
> I had originally considered not caching ranges spanning blocks in which
> the range never changes...  the trade off is you have an increase in
> calculation time as you walk the CFG to find the last block in which it
> did change.  First I figured we'd first see if the less error prone full
> cache causes any real problems or not.   Its always an option to
> implement later since the cache is self contained and can resolve its
> queries internally however it wants.
>
> >
> >> blocks after that have a need for its range, so it will not in their
> >> caches.
> >>
> >> The on-entry cache will only get a single range entry for each of those
> >>
> >> blocks.. the incoming value of i_x  and thats it. The implicitly known
> >> live-on-exit attribute is handy here.
> >>
> >> Now, that's not to say we cant construct cases where there is a lot of
> >> ranges being populated.. If we find the cache is really a problem, we
> >> can start caching ranges.. so that if all these i_x's were live
> >> somehow,
> >> there would only be one range for each in this case, and the cache's
> >> would simply all point to the same range.
> > I suppose using i in the catch block would create sth like this, a phi =
with a large in degree and thus the switch case you already know about?
>
> It would be a PHI. None of the ranges coming into the block are
> considered live on entry, so they dont get into the cache.  The PHI
> calculation asks for the incoming range on the edge for each parameter,
> and we get a global range for the PHI definition calculated, and that is
> it.

Thanks for all the clarifications.

> However, Thats not actually the switch problem :-)  the switch problem
> isn't due to a large degree of incoming edges, but rather the difficulty
> in calculating case ranges on the outgoing edges from the switch itself.
>
> stepping back slightly,  Branches are handled generically the same way
> any expression is, the outgoing edge you want information about provides
> a constant range for the implicit LHS of the branch.
>
> so for an if
>    c_2 =3D x_3 > 20
>    if (c_2 !=3D 0)
>
> the TRUE edge  has a range of bool [1,1] , fed back as the implicit LHS
> of the branch expression
> [1,1] =3D (c_2 !=3D 0)   which range-ops for '!=3D' says c_2 must be [1,1=
] in
> order to satisfy the equation.
> [1,1] =3D x_3 > 20  which range-ops for '>' says x_3 must have a range of
> [21, MAX]
>
> if you want the range on the FALSE edge, the edge starts with the range
> [0,0] into the equation solver, and out pops [MIN, 20].
>
> switch (i_3)
>
> for a switch, it works the same way, except the edge has an integral
> value, not a boolean one.  This range is fed back into the implicit
> branch expression:
> [edge range] =3D i_3    ,    so i_3  has whatever value the edge has since
> it amounts to a copy.  Calculating this edge constant value is the proble=
m.
>
> BB2:
> switch (i_3) {
> BB3:
>    case 4:
>    case 5:
>    case 9:
>    case 10:
>       foo (i_3)
>
> In order to evaluate the case values on  the edge from BB2 to BB3 as
> [4,5][9,10],  we have to loop over all  the cases that have a label at
> the beginning of the block, and union them together. Calculating the
> default value is even worse, start with varying and intersect out each
> and every case.
>
> The gimple representation does not make evaluating this easy nor cheap
> since it merely has a list of case LOW/HIGHS with labels... so you have
> to loop thru that entire list for each edge being evaluated to see if
> the label is associated with this edge and then union together all those
> that match.   This is linear and cant be shortcut. multiple it by the
> number of edges since you probably want a range on each edge and its
> suddenly exponential.    The result is it can be *very* time consuming
> if the switch is very, very large.
>
> And it's a constant value that never changes, so it really could be
> cheap. We've just never needed to ask this exact question before. We've
> considered multiple ways to address this,  but we wont worry about any
> of them just yet.  All in time :-)

Hmm, shouldn't it be

 range edge_ranges[number_of_succs];
 for (int i =3D 0; i < gimple_switch_num_labels (); ++i)
   {
      tree casel =3D gimple_switch_label (s, i);
      basic_block target =3D label_to_block (CASE_LABEL (casel));
      edge e =3D find_edge (gimple_bb (s), target); // OK, this needs to
be optimized, linear
      union_range (edge_ranges[e->src_index], casel-range); //
similarly, we don't have src_index
   }

where the missing detail is precomputing the find_edge + src index things
in a hash_map <dest-block-index, src-index> map.  That's possible by
a simple linear walk of switch successor edges.

So computing the edge ranges is linear.  The default case needs special-cas=
ing
and its edge range computation delayed until we know the union of all
case ranges.

But yes, I agree the GIMPLE switch representation isn't exactly helpful...
But I didn't figure out a good improvement yet.  One possibility is to
do sth similar to PHIs and split the case labels into groups associated
with edges indexed by the index of the edge in the src->succs vector
(there's "conveniently" CASE_CHAIN which I think is unused).  Doing
this requires CFG hook adjustments similar to how we handle PHIs
since we have to adjust the edge index based case group vector.

Btw, there's the function-level {start,end}_recording_cases functionality
which gives you get_cases_for_edge which is overall still quadratic
since it uses find_edge as in my pseudo-code above.  Re-implementing
it to compute all in advance would solve this.

Richard.


>
>
>
>
> >
> >> so far we havent really run across anything that appears to trigger
> >> significant concerns.
> > Sure. It's still worth thinking about worst case behavior and how you'd=
 deal with that given once ranger is actually used we will eventually run i=
nto these cases. When
> > A forward evrp walk is then faster than a single on demand query that w=
ould be concerning.
> >
> > Richard.
> Absolutely.  I've been concerned about worst case since the get go.
> Thats how its evolved to this design.  We're trying to make sure the
> worst case scenario is no worse than doing a normal walk.
>
> We can currently run the ranger version of RVRP across blocks in
> dominator order, reverse dominator order,  linearly BB1 -> BBN, and
> finally  BBN-> BB1   There are fluctuations in timings, but nothing of
> too much significance.  They all run in similar times and pass all the
> tests.
>
> I have thoughts on how the various pieces can fit together with EVRP,
> I'll stick that in another email.
>
> Andrew