public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* On-Demand range technology [2/5] - Major Components : How it works
@ 2019-05-23  1:28 Andrew MacLeod
  2019-05-23 12:55 ` Richard Biener
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew MacLeod @ 2019-05-23  1:28 UTC (permalink / raw)
  To: GCC; +Cc: Jeff Law, Aldy Hernandez

*This note will talk about the 4 major components of the prototype and 
explain how they work together.   I will be fairly light on detail just 
to give an overview, we can delve into whatever details are needed.
- Range-ops : Range operations at the statement level
- GORI - Generates Outgoing Range Info : Ranges generated by basic blocks
- GORI Cache - combining and caching basic-block range info
- Ranger - API and query control

1 * Range-ops
--------------------
The first major component is the centralized range operation database. 
This is where range operations for tree-codes are implemented.  The 
compiler currently has code which can fold ranges, but the new mechanism 
is a general purpose solver which can solve for other operands. If there 
are N input/output operands, and we have ranges for N-1, It is often 
possible to derive the missing range.   ie
     lhs = op1 + op2
The existing code in the compiler can solve for LHS given ranges for OP1 
and OP2. This has been extended so that we can also sometimes solve for 
either op1 or op2 e, ie
     [20,40] = op1 + [5, 10]
...can calculate that op1 has the range [15, 30]

This ability to solve for the other operands provides the ability to 
calculate ranges in the reverse order we are accustomed to, and is key 
to enabling the on-demand range approach.
     a_2 = b_1 - 20
     if (a_2 < 40)
  A conditional jump has an implicit boolean LHS depending on which edge 
is taken. To evaluate ranges on the TRUE edge of the branch, the LHS is 
[1,1]. To calculate the range for a_2 we simply solve the equation:
     [1,1] = a_2 < 40
which provides the answer as [0,39].  Furthermore, substituting this 
range for a_2 as the LHS of it’s definition statement:
     [0,39] = b_1 - 20
The same evaluation  mechanism can calculate a result for b_1 on the 
true edge as [20,59]. This is the key feature which allows the rest of 
the algorithm to work in a general way.

All operations are driven from this component, and the only thing 
required to add support for another tree code is to implement one or 
more methods for it here, and the rest of the range mechanism will 
simply work with it.


2 * GORI
------------
The second component is the “Generates Outgoing Range Info” engine.  
This is a basic-block oriented component which determines what ssa-names 
have ranges created on outgoing edges, and how to calculate those ranges.

The last statement in the block is examined to see if it is a branch 
which generates range info, and then determines which ssa-names in the 
block can have ranges calculated.  It quickly walks the use-def chain 
from the branch to find other definitions within the block that can 
impact the branch and could also have their ranges calculated. This 
summary is then cached for future use.

The GORI component also uses this summary info to perform the 
calculations necessary to determine the outgoing range for any ssa_name 
which can be determined.  For example:
     c_3 = foo ()
     a_2 = b_1 - 20
     If (a_2 < 40)
The summary information would indicate that b_1 and a_2 can have their 
outgoing ranges calculated for this block, and uses the cached 
information to quickly calculate them when required.
The API consists of 2 basic methods, query and calculate:
  - bool has_edge_range_p (edge, name);
  - range outgoing_edge_range (edge, name);
If a query is made for any other ssa-name, it simply returns false and 
says this block does not generate a range for that name.  This is a key 
rationale for the summary so we only spend time processing names for 
blocks that actually have ranges generated.


3 * GORI cache
---------------------
The third component builds on the basic block GORI component, and adds 
the ability to traverse the CFG and combine the various ranges provided 
by each basic block into a cache.

It provides both a global-range cache and a range-on-entry cache 
summarizing the range of an ssa_name on entry to the block.

The range-on-entry cache is self filling, and iteratively evaluates back 
edges. There is a single API entry point which asks for the range on 
entry, and if the data has not been calculated/cached already, it spawns 
the following process to calculate the data. :
     * Walk the CFG from the current block to the definition block, 
and/or any block which has range-on-entry information already 
calculated. These end points are the source blocks.
     * Any non-source blocks visited are added to a worklist to be updated.
     * Starting with the source blocks, push the known outgoing range 
from each block to each successor and update their live-on-entry values 
when they are visited.  If the incoming range changes, mark this block’s 
successors as requiring updating.
     * Continue iterating until no more updates are required.

The ordering is planned by the ranger such that if there are no back 
edges, it is typically a straightforward single pass.  When back edges 
are involved, the amount of iterating is typically very small. The 
updating is restricted to a single ssa-name, meaning it doesn’t get into 
feedback loops with other names nor PHIs. . . It usually converges very 
quickly.

It is important to note that this works exclusively with static ranges 
of only a single ssa-name at a time. Ie, ranges which are implicitly 
exposed in the IL, and only name being examined. The values returned by 
these queries are not dependent on changes in other ssa-names, which is 
how the iteration process never gets out of control.

The ranger making the calls to fill this cache has a higher level 
overview, and requests these ranges in definition order such that any 
ssa-names feeding the definition of a name having its cache filled are 
resolved first, providing the best possible results the first time.
     a_2 = b_1 - 20
     If (a_2 < 40)
If the range for a_2 is requested on the true side, the ranger will 
first calculate the range of b_1 on entry to the block. Then use this to 
calculate the global range of a_2, and finally for the outgoing range on 
the desired edge.

If at some later point, it is discovered that the incoming range of b_1 
has changed in such a way that it has an impact on the outgoing range of 
a_2, the iterative update process can be reapplied by the ranger to 
update the relevant cache entries. This is usually only required in 
cases where multiple ssa-names are affected by back edges and feed each 
other.

4 * Ranger
----------------
The final component is the Ranger which provides a simple API to clients 
where they can make various simple requests:
     - Range of an ssa-name at a statement location
     - Range of the result of a stmt
     - Range of an ssa-name on an edge
     - Range of an ssa-name after the last statement in a block
     - Range of an ssa name on entry to a block

The ranger is responsible for processing the IL, walking use/def chains  
and coordinating the various calls into the GORI components to 
pre-satisfy as many conditions as possible before any cache filling is 
performed.  It is also responsible for triggering any additional updates 
which may be required due to newly discovered ranges. We in fact don’t 
do this yet, but is rarely required as it turns out.

Summary
---------------
All evaluations are done on-demand. If no queries are made, there is no 
cost. The Ranger has no prerequisites other than a valid IL and CFG. It 
does not need dominators, nor does it require any other changes within 
an optimization pass in order to be used.

When queries are made, only the minimum effort required is made to 
satisfy the request. This is a big win when it comes to passes which 
only need occasional range information such as warning passes.  Some of 
these passes actually instantiate a new ranger each time they make a 
request, as a demonstration of the low overhead.

As for passes such as VRP which require complete range information, the 
caching process prevents the same information from being calculated more 
than once.  This means a similar amount of work is done with this 
approach, as with the traditional top-down approach currently being 
used, except we also process the back edges and iteratively solve for them.

**Comments and feedback always welcome!Thanks
Andrew
*

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: On-Demand range technology [2/5] - Major Components : How it works
  2019-05-23  1:28 On-Demand range technology [2/5] - Major Components : How it works Andrew MacLeod
@ 2019-05-23 12:55 ` Richard Biener
  2019-05-24 15:50   ` Andrew MacLeod
  0 siblings, 1 reply; 21+ messages in thread
From: Richard Biener @ 2019-05-23 12:55 UTC (permalink / raw)
  To: Andrew MacLeod; +Cc: GCC, Jeff Law, Aldy Hernandez

On Thu, May 23, 2019 at 3:28 AM Andrew MacLeod <amacleod@redhat.com> wrote:
>
> *This note will talk about the 4 major components of the prototype and
> explain how they work together.   I will be fairly light on detail just
> to give an overview, we can delve into whatever details are needed.
> - Range-ops : Range operations at the statement level
> - GORI - Generates Outgoing Range Info : Ranges generated by basic blocks
> - GORI Cache - combining and caching basic-block range info
> - Ranger - API and query control
>
> 1 * Range-ops
> --------------------
> The first major component is the centralized range operation database.
> This is where range operations for tree-codes are implemented.  The
> compiler currently has code which can fold ranges, but the new mechanism
> is a general purpose solver which can solve for other operands. If there
> are N input/output operands, and we have ranges for N-1, It is often
> possible to derive the missing range.   ie
>      lhs = op1 + op2
> The existing code in the compiler can solve for LHS given ranges for OP1
> and OP2. This has been extended so that we can also sometimes solve for
> either op1 or op2 e, ie
>      [20,40] = op1 + [5, 10]
> ...can calculate that op1 has the range [15, 30]
>
> This ability to solve for the other operands provides the ability to
> calculate ranges in the reverse order we are accustomed to, and is key
> to enabling the on-demand range approach.
>      a_2 = b_1 - 20
>      if (a_2 < 40)
>   A conditional jump has an implicit boolean LHS depending on which edge
> is taken. To evaluate ranges on the TRUE edge of the branch, the LHS is
> [1,1]. To calculate the range for a_2 we simply solve the equation:
>      [1,1] = a_2 < 40
> which provides the answer as [0,39].  Furthermore, substituting this
> range for a_2 as the LHS of it’s definition statement:
>      [0,39] = b_1 - 20
> The same evaluation  mechanism can calculate a result for b_1 on the
> true edge as [20,59]. This is the key feature which allows the rest of
> the algorithm to work in a general way.
>
> All operations are driven from this component, and the only thing
> required to add support for another tree code is to implement one or
> more methods for it here, and the rest of the range mechanism will
> simply work with it.
>
>
> 2 * GORI
> ------------
> The second component is the “Generates Outgoing Range Info” engine.
> This is a basic-block oriented component which determines what ssa-names
> have ranges created on outgoing edges, and how to calculate those ranges.
>
> The last statement in the block is examined to see if it is a branch
> which generates range info, and then determines which ssa-names in the
> block can have ranges calculated.  It quickly walks the use-def chain
> from the branch to find other definitions within the block that can
> impact the branch and could also have their ranges calculated. This
> summary is then cached for future use.
>
> The GORI component also uses this summary info to perform the
> calculations necessary to determine the outgoing range for any ssa_name
> which can be determined.  For example:
>      c_3 = foo ()
>      a_2 = b_1 - 20
>      If (a_2 < 40)
> The summary information would indicate that b_1 and a_2 can have their
> outgoing ranges calculated for this block, and uses the cached
> information to quickly calculate them when required.
> The API consists of 2 basic methods, query and calculate:
>   - bool has_edge_range_p (edge, name);
>   - range outgoing_edge_range (edge, name);
> If a query is made for any other ssa-name, it simply returns false and
> says this block does not generate a range for that name.  This is a key
> rationale for the summary so we only spend time processing names for
> blocks that actually have ranges generated.

So the current code (in a few cases) derives ranges for SSA uses when
they see for example

 _3 = _4 / _5;

here _5 != 0 for the stmts following this one (OK, this is a bad example
because for divisions we don't do this, but we do it for derefs and ~[0,0]).

The above suggests that iff this is done at all it is not in GORI because
those are not conditional stmts or ranges from feeding those.  The
machinery doing the use-def walking from stmt context also cannot
come along these so I have the suspicion that Ranger cannot handle
telling us that for the stmt following above, for example

 if (_5 != 0)

that _5 is not zero?

Can you clarify?

>
> 3 * GORI cache
> ---------------------
> The third component builds on the basic block GORI component, and adds
> the ability to traverse the CFG and combine the various ranges provided
> by each basic block into a cache.
>
> It provides both a global-range cache and a range-on-entry cache
> summarizing the range of an ssa_name on entry to the block.
>
> The range-on-entry cache is self filling, and iteratively evaluates back
> edges. There is a single API entry point which asks for the range on
> entry, and if the data has not been calculated/cached already, it spawns
> the following process to calculate the data. :
>      * Walk the CFG from the current block to the definition block,
> and/or any block which has range-on-entry information already
> calculated. These end points are the source blocks.
>      * Any non-source blocks visited are added to a worklist to be updated.
>      * Starting with the source blocks, push the known outgoing range
> from each block to each successor and update their live-on-entry values
> when they are visited.  If the incoming range changes, mark this block’s
> successors as requiring updating.
>      * Continue iterating until no more updates are required.
>
> The ordering is planned by the ranger such that if there are no back
> edges, it is typically a straightforward single pass.  When back edges
> are involved, the amount of iterating is typically very small. The
> updating is restricted to a single ssa-name, meaning it doesn’t get into
> feedback loops with other names nor PHIs. . . It usually converges very
> quickly.
>
> It is important to note that this works exclusively with static ranges
> of only a single ssa-name at a time. Ie, ranges which are implicitly
> exposed in the IL, and only name being examined. The values returned by
> these queries are not dependent on changes in other ssa-names, which is
> how the iteration process never gets out of control.
>
> The ranger making the calls to fill this cache has a higher level
> overview, and requests these ranges in definition order such that any
> ssa-names feeding the definition of a name having its cache filled are
> resolved first, providing the best possible results the first time.
>      a_2 = b_1 - 20
>      If (a_2 < 40)
> If the range for a_2 is requested on the true side, the ranger will
> first calculate the range of b_1 on entry to the block. Then use this to
> calculate the global range of a_2, and finally for the outgoing range on
> the desired edge.
>
> If at some later point, it is discovered that the incoming range of b_1
> has changed in such a way that it has an impact on the outgoing range of
> a_2, the iterative update process can be reapplied by the ranger to
> update the relevant cache entries. This is usually only required in
> cases where multiple ssa-names are affected by back edges and feed each
> other.
>
> 4 * Ranger
> ----------------
> The final component is the Ranger which provides a simple API to clients
> where they can make various simple requests:
>      - Range of an ssa-name at a statement location
>      - Range of the result of a stmt
>      - Range of an ssa-name on an edge
>      - Range of an ssa-name after the last statement in a block
>      - Range of an ssa name on entry to a block
>
> The ranger is responsible for processing the IL, walking use/def chains
> and coordinating the various calls into the GORI components to
> pre-satisfy as many conditions as possible before any cache filling is
> performed.  It is also responsible for triggering any additional updates
> which may be required due to newly discovered ranges. We in fact don’t
> do this yet, but is rarely required as it turns out.
>
> Summary
> ---------------
> All evaluations are done on-demand. If no queries are made, there is no
> cost. The Ranger has no prerequisites other than a valid IL and CFG. It
> does not need dominators, nor does it require any other changes within
> an optimization pass in order to be used.

Does it need fully up-to-date SSA form or does it only rely on appropriate
SSA uses on stmts in the use-def chain of the SSA name we query
range-info for?  Does it use immediate uses or stmt operands at all
or just use-def chains?

I'm asking because some passes leave parts of the CFG not updated
while working on other parts to reduce the number of update_ssa calls
(ideally to a single one).

> When queries are made, only the minimum effort required is made to
> satisfy the request. This is a big win when it comes to passes which
> only need occasional range information such as warning passes.  Some of
> these passes actually instantiate a new ranger each time they make a
> request, as a demonstration of the low overhead.
>
> As for passes such as VRP which require complete range information, the
> caching process prevents the same information from being calculated more
> than once.  This means a similar amount of work is done with this
> approach, as with the traditional top-down approach currently being
> used, except we also process the back edges and iteratively solve for them.

What's the storage requirement for the cache when doing VRP pass like
processing?  What's the compile-time complexity?  I read you are saying
"similar amount of work" but I guess that's from empirical data comparing
to the existing VRP implementations (SSA VRP and EVRP) rather than
theoretical bounds?

Not sure where to put it so let me put it here:  I'd like to see the SSA
based VRP pass to die (in fact I'd like all ssa-propagator based
passes to die...).
EVRP is the future here and I can see EVRP making use of the
reverse part of Range-Ops to replace the ad-hoc pattern matching
in register_edge_assert_for (badly named function that now gets you
back a list of assertions hold on an edge).  So I hope to see VRP
yanked (it performing jump threading is the major obstackle) and
Range-Ops being used as common infrastructure for ranges (that leaves
EVRP).  I don't really want something "additional" leaving things for
other people to cleanup / merge - honestly you don't have the very best
track record in this area (it might be your main area of work is not GCC,
still the @redhat tag on your mail is an obligation here).

Now I've said the above let me continue with 3/n.

Richard.

> **Comments and feedback always welcome!Thanks
> Andrew
> *

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: On-Demand range technology [2/5] - Major Components : How it works
  2019-05-23 12:55 ` Richard Biener
@ 2019-05-24 15:50   ` Andrew MacLeod
  2019-05-27 13:03     ` Richard Biener
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew MacLeod @ 2019-05-24 15:50 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC, Jeff Law, Aldy Hernandez

On 5/23/19 8:55 AM, Richard Biener wrote:
> On Thu, May 23, 2019 at 3:28 AM Andrew MacLeod <amacleod@redhat.com> wrote:
>>
>> 2 * GORI
>> ------------
>> The second component is the “Generates Outgoing Range Info” engine.
>> This is a basic-block oriented component which determines what ssa-names
>> have ranges created on outgoing edges, and how to calculate those ranges.
>>
>> The last statement in the block is examined to see if it is a branch
>> which generates range info, and then determines which ssa-names in the
>> block can have ranges calculated.  It quickly walks the use-def chain
>> from the branch to find other definitions within the block that can
>> impact the branch and could also have their ranges calculated. This
>> summary is then cached for future use.
>>
>> The GORI component also uses this summary info to perform the
>> calculations necessary to determine the outgoing range for any ssa_name
>> which can be determined.  For example:
>>       c_3 = foo ()
>>       a_2 = b_1 - 20
>>       If (a_2 < 40)
>> The summary information would indicate that b_1 and a_2 can have their
>> outgoing ranges calculated for this block, and uses the cached
>> information to quickly calculate them when required.
>> The API consists of 2 basic methods, query and calculate:
>>    - bool has_edge_range_p (edge, name);
>>    - range outgoing_edge_range (edge, name);
>> If a query is made for any other ssa-name, it simply returns false and
>> says this block does not generate a range for that name.  This is a key
>> rationale for the summary so we only spend time processing names for
>> blocks that actually have ranges generated.
> So the current code (in a few cases) derives ranges for SSA uses when
> they see for example
>
>   _3 = _4 / _5;
>
> here _5 != 0 for the stmts following this one (OK, this is a bad example
> because for divisions we don't do this, but we do it for derefs and ~[0,0]).
>
> The above suggests that iff this is done at all it is not in GORI because
> those are not conditional stmts or ranges from feeding those.  The
> machinery doing the use-def walking from stmt context also cannot
> come along these so I have the suspicion that Ranger cannot handle
> telling us that for the stmt following above, for example
>
>   if (_5 != 0)
>
> that _5 is not zero?
>
> Can you clarify?

So there are 2 aspects to this.    the range-ops code for DIV_EXPR, if 
asked for the range of op2 () would return ~[0,0] for _5.
But you are also correct in that the walk backwards would not find this.

This is similar functionality to how null_derefs are currently handled, 
and in fact could probably be done simultaneously using the same code 
base.   I didn't bring null derefs up, but this is a good time :-)

There is a separate class used by the gori-cache which tracks the 
non-nullness property at the block level.    It has a single API: 
non_null_deref_p (name, bb)    which determines whether the is a 
dereference in any BB for NAME, which indicates whether the range has an 
implicit ~[0,0] range in that basic block or not.

It it implemented by caching a bitmap which has a 1 set for any block 
which contains a deref of that name, so after the first call, it is 
simply a matter of returning that bit.  The first time the API is 
called, it does a immediate-use walk for name and sets the bit for any 
block which contains a statement for which we can infer a non-null result.

We'd want the exact same functionality for divide by zero.. we can infer 
~[0,0] from the DIV_EXPR for the block that statement is in. The call is 
already made, it just returns false right now  if it isn't a pointer type.


I'll add that and make sure it works.



>
>> 3 * GORI cache
>> ---------------------
>> The third component builds on the basic block GORI component, and adds
>> the ability to traverse the CFG and combine the various ranges provided
>> by each basic block into a cache.
>>
>> It provides both a global-range cache and a range-on-entry cache
>> summarizing the range of an ssa_name on entry to the block.
>>
>> The range-on-entry cache is self filling, and iteratively evaluates back
>> edges. There is a single API entry point which asks for the range on
>> entry, and if the data has not been calculated/cached already, it spawns
>> the following process to calculate the data. :
>>       * Walk the CFG from the current block to the definition block,
>> and/or any block which has range-on-entry information already
>> calculated. These end points are the source blocks.
>>       * Any non-source blocks visited are added to a worklist to be updated.
>>       * Starting with the source blocks, push the known outgoing range
>> from each block to each successor and update their live-on-entry values
>> when they are visited.  If the incoming range changes, mark this block’s
>> successors as requiring updating.
>>       * Continue iterating until no more updates are required.
>>
>> The ordering is planned by the ranger such that if there are no back
>> edges, it is typically a straightforward single pass.  When back edges
>> are involved, the amount of iterating is typically very small. The
>> updating is restricted to a single ssa-name, meaning it doesn’t get into
>> feedback loops with other names nor PHIs. . . It usually converges very
>> quickly.
>>
>> It is important to note that this works exclusively with static ranges
>> of only a single ssa-name at a time. Ie, ranges which are implicitly
>> exposed in the IL, and only name being examined. The values returned by
>> these queries are not dependent on changes in other ssa-names, which is
>> how the iteration process never gets out of control.
>>
>> The ranger making the calls to fill this cache has a higher level
>> overview, and requests these ranges in definition order such that any
>> ssa-names feeding the definition of a name having its cache filled are
>> resolved first, providing the best possible results the first time.
>>       a_2 = b_1 - 20
>>       If (a_2 < 40)
>> If the range for a_2 is requested on the true side, the ranger will
>> first calculate the range of b_1 on entry to the block. Then use this to
>> calculate the global range of a_2, and finally for the outgoing range on
>> the desired edge.
>>
>> If at some later point, it is discovered that the incoming range of b_1
>> has changed in such a way that it has an impact on the outgoing range of
>> a_2, the iterative update process can be reapplied by the ranger to
>> update the relevant cache entries. This is usually only required in
>> cases where multiple ssa-names are affected by back edges and feed each
>> other.
>>
>> 4 * Ranger
>> ----------------
>> The final component is the Ranger which provides a simple API to clients
>> where they can make various simple requests:
>>       - Range of an ssa-name at a statement location
>>       - Range of the result of a stmt
>>       - Range of an ssa-name on an edge
>>       - Range of an ssa-name after the last statement in a block
>>       - Range of an ssa name on entry to a block
>>
>> The ranger is responsible for processing the IL, walking use/def chains
>> and coordinating the various calls into the GORI components to
>> pre-satisfy as many conditions as possible before any cache filling is
>> performed.  It is also responsible for triggering any additional updates
>> which may be required due to newly discovered ranges. We in fact don’t
>> do this yet, but is rarely required as it turns out.
>>
>> Summary
>> ---------------
>> All evaluations are done on-demand. If no queries are made, there is no
>> cost. The Ranger has no prerequisites other than a valid IL and CFG. It
>> does not need dominators, nor does it require any other changes within
>> an optimization pass in order to be used.
> Does it need fully up-to-date SSA form or does it only rely on appropriate
> SSA uses on stmts in the use-def chain of the SSA name we query
> range-info for?  Does it use immediate uses or stmt operands at all
> or just use-def chains?

the range-ops component works purely on stmt operands.
The gori component which calculates ranges coming out of the block 
require the SSA_NAME_DEF_STMT to be correct in order to look at the 
previous stmt in the def chain and to determine if it is in the same 
basic block or not. .
immediate uses are used for the non-null/zero property.  If they are not 
up to date this information would be stale or out of date unless it were 
calculated up front


>
> I'm asking because some passes leave parts of the CFG not updated
> while working on other parts to reduce the number of update_ssa calls
> (ideally to a single one).
Due to the on-demand nature, any changes made would be reflected 
immediately in any lookup, for better or worse.  At the moment, we don't 
encourage changing the CFG while working with ranges. Any CFG changes 
made may invalidate some of the caches, or require them to grow...  and 
we'd need to hook into the CFG machinery in order to make sure we either 
look at, or clear any affected caches.

Its clearly safer to queue up the changes in that sort of environment, 
but we'd have to identify which kinds of operations are not safe. Our 
implementation of RVRP does everything on the fly... simplifying and 
eliminating  statements as it goes.  To the best of my knowledge we are 
not currently using the on-demand engine in places where are taking some 
things out of SSA and then putting them back in.  So i have no practical 
experience with that.   Aldy can correct me if he's doing anything in 
the theader, but I doubt it.

so other than the non-null/zero property, we don't need much SSA 
infrastructure, just "correct" IL/CFG.

>> When queries are made, only the minimum effort required is made to
>> satisfy the request. This is a big win when it comes to passes which
>> only need occasional range information such as warning passes.  Some of
>> these passes actually instantiate a new ranger each time they make a
>> request, as a demonstration of the low overhead.
>>
>> As for passes such as VRP which require complete range information, the
>> caching process prevents the same information from being calculated more
>> than once.  This means a similar amount of work is done with this
>> approach, as with the traditional top-down approach currently being
>> used, except we also process the back edges and iteratively solve for them.
> What's the storage requirement for the cache when doing VRP pass like
> processing?  What's the compile-time complexity?  I read you are saying
> "similar amount of work" but I guess that's from empirical data comparing
> to the existing VRP implementations (SSA VRP and EVRP) rather than
> theoretical bounds?
yes, compile-time complexity is from empirical speed timings and 
theory-crafting from the algorithms,  and that the on-entry cache 
prevents multiple passes over things.

we have not done a memory analysis yet, not done anything to see if we 
can improve it.
It makes very heavy use of bitmaps, which are typically quite sparse.
The on-entry cache is a vector of pointers to a range, initially 0, and 
we allocate ranges as needed.   There will be an on-entry range entry 
for any ssa-name which has been queried between the query point and the 
definition.  My belief is this is typically not large, but we have done 
no analysis as yet.  This does mean that once an ssa-name goes 
"out-of-scope", ie no  more uses in the program, it will not be in the 
cache for any of those blocks simply because its never queried past its 
end of life.
>
> Not sure where to put it so let me put it here:  I'd like to see the SSA
> based VRP pass to die (in fact I'd like all ssa-propagator based
> passes to die...).
> EVRP is the future here and I can see EVRP making use of the
> reverse part of Range-Ops to replace the ad-hoc pattern matching
> in register_edge_assert_for (badly named function that now gets you
> back a list of assertions hold on an edge).  So I hope to see VRP
Does register_edge_assert cache anything? or walk an assert chain? or is 
it basically just a "range-on-this-edge"  call? and the result combined 
with what is known at that point?
> yanked (it performing jump threading is the major obstackle) and
> Range-Ops being used as common infrastructure for ranges (that leaves
> EVRP).  I don't really want something "additional" leaving things for

Although EVRP can be modified to use the range-ops work, and we can 
probably make VRP go away, I'm not convinced that EVRP has to be the 
future.  Its merely an alternate implementation that has its own set of 
benefits and drawbacks.


> other people to cleanup / merge - honestly you don't have the very best
> track record in this area (it might be your main area of work is not GCC,
> still the @redhat tag on your mail is an obligation here).
That's a curious assertion over a 20 year span.  To what are you referring?

> Now I've said the above let me continue with 3/n.
>
> Richard.
>
>> **Comments and feedback always welcome!Thanks
>> Andrew
>> *

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: On-Demand range technology [2/5] - Major Components : How it works
  2019-05-24 15:50   ` Andrew MacLeod
@ 2019-05-27 13:03     ` Richard Biener
  2019-05-28 14:17       ` Andrew MacLeod
  0 siblings, 1 reply; 21+ messages in thread
From: Richard Biener @ 2019-05-27 13:03 UTC (permalink / raw)
  To: Andrew MacLeod; +Cc: GCC, Jeff Law, Aldy Hernandez

On Fri, May 24, 2019 at 5:50 PM Andrew MacLeod <amacleod@redhat.com> wrote:
>
> On 5/23/19 8:55 AM, Richard Biener wrote:
> > On Thu, May 23, 2019 at 3:28 AM Andrew MacLeod <amacleod@redhat.com> wrote:
> >>
> >> 2 * GORI
> >> ------------
> >> The second component is the “Generates Outgoing Range Info” engine.
> >> This is a basic-block oriented component which determines what ssa-names
> >> have ranges created on outgoing edges, and how to calculate those ranges.
> >>
> >> The last statement in the block is examined to see if it is a branch
> >> which generates range info, and then determines which ssa-names in the
> >> block can have ranges calculated.  It quickly walks the use-def chain
> >> from the branch to find other definitions within the block that can
> >> impact the branch and could also have their ranges calculated. This
> >> summary is then cached for future use.
> >>
> >> The GORI component also uses this summary info to perform the
> >> calculations necessary to determine the outgoing range for any ssa_name
> >> which can be determined.  For example:
> >>       c_3 = foo ()
> >>       a_2 = b_1 - 20
> >>       If (a_2 < 40)
> >> The summary information would indicate that b_1 and a_2 can have their
> >> outgoing ranges calculated for this block, and uses the cached
> >> information to quickly calculate them when required.
> >> The API consists of 2 basic methods, query and calculate:
> >>    - bool has_edge_range_p (edge, name);
> >>    - range outgoing_edge_range (edge, name);
> >> If a query is made for any other ssa-name, it simply returns false and
> >> says this block does not generate a range for that name.  This is a key
> >> rationale for the summary so we only spend time processing names for
> >> blocks that actually have ranges generated.
> > So the current code (in a few cases) derives ranges for SSA uses when
> > they see for example
> >
> >   _3 = _4 / _5;
> >
> > here _5 != 0 for the stmts following this one (OK, this is a bad example
> > because for divisions we don't do this, but we do it for derefs and ~[0,0]).
> >
> > The above suggests that iff this is done at all it is not in GORI because
> > those are not conditional stmts or ranges from feeding those.  The
> > machinery doing the use-def walking from stmt context also cannot
> > come along these so I have the suspicion that Ranger cannot handle
> > telling us that for the stmt following above, for example
> >
> >   if (_5 != 0)
> >
> > that _5 is not zero?
> >
> > Can you clarify?
>
> So there are 2 aspects to this.    the range-ops code for DIV_EXPR, if
> asked for the range of op2 () would return ~[0,0] for _5.
> But you are also correct in that the walk backwards would not find this.
>
> This is similar functionality to how null_derefs are currently handled,
> and in fact could probably be done simultaneously using the same code
> base.   I didn't bring null derefs up, but this is a good time :-)
>
> There is a separate class used by the gori-cache which tracks the
> non-nullness property at the block level.    It has a single API:
> non_null_deref_p (name, bb)    which determines whether the is a
> dereference in any BB for NAME, which indicates whether the range has an
> implicit ~[0,0] range in that basic block or not.

So when we then have

 _1 = *_2; // after this _2 is non-NULL
 _3 = _1 + 1; // _3 is non-NULL
 _4 = *_3;
...

when a on-demand user asks whether _3 is non-NULL at the
point of _4 = *_3 we don't have this information?  Since the
per-BB caching will only say _1 is non-NULL after the BB.
I'm also not sure whether _3 ever gets non-NULL during
non-NULL processing of the block since walking immediate uses
doesn't really help here?

So this seems to be a fundamental limitation [to the caching scheme],
not sure if it is bad in practice.

Or am I missing something?

> It it implemented by caching a bitmap which has a 1 set for any block
> which contains a deref of that name, so after the first call, it is
> simply a matter of returning that bit.  The first time the API is
> called, it does a immediate-use walk for name and sets the bit for any
> block which contains a statement for which we can infer a non-null result.
>
> We'd want the exact same functionality for divide by zero.. we can infer
> ~[0,0] from the DIV_EXPR for the block that statement is in. The call is
> already made, it just returns false right now  if it isn't a pointer type.
>
>
> I'll add that and make sure it works.
>
>
>
> >
> >> 3 * GORI cache
> >> ---------------------
> >> The third component builds on the basic block GORI component, and adds
> >> the ability to traverse the CFG and combine the various ranges provided
> >> by each basic block into a cache.
> >>
> >> It provides both a global-range cache and a range-on-entry cache
> >> summarizing the range of an ssa_name on entry to the block.
> >>
> >> The range-on-entry cache is self filling, and iteratively evaluates back
> >> edges. There is a single API entry point which asks for the range on
> >> entry, and if the data has not been calculated/cached already, it spawns
> >> the following process to calculate the data. :
> >>       * Walk the CFG from the current block to the definition block,
> >> and/or any block which has range-on-entry information already
> >> calculated. These end points are the source blocks.
> >>       * Any non-source blocks visited are added to a worklist to be updated.
> >>       * Starting with the source blocks, push the known outgoing range
> >> from each block to each successor and update their live-on-entry values
> >> when they are visited.  If the incoming range changes, mark this block’s
> >> successors as requiring updating.
> >>       * Continue iterating until no more updates are required.
> >>
> >> The ordering is planned by the ranger such that if there are no back
> >> edges, it is typically a straightforward single pass.  When back edges
> >> are involved, the amount of iterating is typically very small. The
> >> updating is restricted to a single ssa-name, meaning it doesn’t get into
> >> feedback loops with other names nor PHIs. . . It usually converges very
> >> quickly.
> >>
> >> It is important to note that this works exclusively with static ranges
> >> of only a single ssa-name at a time. Ie, ranges which are implicitly
> >> exposed in the IL, and only name being examined. The values returned by
> >> these queries are not dependent on changes in other ssa-names, which is
> >> how the iteration process never gets out of control.
> >>
> >> The ranger making the calls to fill this cache has a higher level
> >> overview, and requests these ranges in definition order such that any
> >> ssa-names feeding the definition of a name having its cache filled are
> >> resolved first, providing the best possible results the first time.
> >>       a_2 = b_1 - 20
> >>       If (a_2 < 40)
> >> If the range for a_2 is requested on the true side, the ranger will
> >> first calculate the range of b_1 on entry to the block. Then use this to
> >> calculate the global range of a_2, and finally for the outgoing range on
> >> the desired edge.
> >>
> >> If at some later point, it is discovered that the incoming range of b_1
> >> has changed in such a way that it has an impact on the outgoing range of
> >> a_2, the iterative update process can be reapplied by the ranger to
> >> update the relevant cache entries. This is usually only required in
> >> cases where multiple ssa-names are affected by back edges and feed each
> >> other.
> >>
> >> 4 * Ranger
> >> ----------------
> >> The final component is the Ranger which provides a simple API to clients
> >> where they can make various simple requests:
> >>       - Range of an ssa-name at a statement location
> >>       - Range of the result of a stmt
> >>       - Range of an ssa-name on an edge
> >>       - Range of an ssa-name after the last statement in a block
> >>       - Range of an ssa name on entry to a block
> >>
> >> The ranger is responsible for processing the IL, walking use/def chains
> >> and coordinating the various calls into the GORI components to
> >> pre-satisfy as many conditions as possible before any cache filling is
> >> performed.  It is also responsible for triggering any additional updates
> >> which may be required due to newly discovered ranges. We in fact don’t
> >> do this yet, but is rarely required as it turns out.
> >>
> >> Summary
> >> ---------------
> >> All evaluations are done on-demand. If no queries are made, there is no
> >> cost. The Ranger has no prerequisites other than a valid IL and CFG. It
> >> does not need dominators, nor does it require any other changes within
> >> an optimization pass in order to be used.
> > Does it need fully up-to-date SSA form or does it only rely on appropriate
> > SSA uses on stmts in the use-def chain of the SSA name we query
> > range-info for?  Does it use immediate uses or stmt operands at all
> > or just use-def chains?
>
> the range-ops component works purely on stmt operands.
> The gori component which calculates ranges coming out of the block
> require the SSA_NAME_DEF_STMT to be correct in order to look at the
> previous stmt in the def chain and to determine if it is in the same
> basic block or not. .
> immediate uses are used for the non-null/zero property.  If they are not
> up to date this information would be stale or out of date unless it were
> calculated up front
>
>
> >
> > I'm asking because some passes leave parts of the CFG not updated
> > while working on other parts to reduce the number of update_ssa calls
> > (ideally to a single one).
> Due to the on-demand nature, any changes made would be reflected
> immediately in any lookup, for better or worse.  At the moment, we don't
> encourage changing the CFG while working with ranges. Any CFG changes
> made may invalidate some of the caches, or require them to grow...  and
> we'd need to hook into the CFG machinery in order to make sure we either
> look at, or clear any affected caches.
>
> Its clearly safer to queue up the changes in that sort of environment,
> but we'd have to identify which kinds of operations are not safe. Our
> implementation of RVRP does everything on the fly... simplifying and
> eliminating  statements as it goes.  To the best of my knowledge we are
> not currently using the on-demand engine in places where are taking some
> things out of SSA and then putting them back in.  So i have no practical
> experience with that.   Aldy can correct me if he's doing anything in
> the theader, but I doubt it.
>
> so other than the non-null/zero property, we don't need much SSA
> infrastructure, just "correct" IL/CFG.
>
> >> When queries are made, only the minimum effort required is made to
> >> satisfy the request. This is a big win when it comes to passes which
> >> only need occasional range information such as warning passes.  Some of
> >> these passes actually instantiate a new ranger each time they make a
> >> request, as a demonstration of the low overhead.
> >>
> >> As for passes such as VRP which require complete range information, the
> >> caching process prevents the same information from being calculated more
> >> than once.  This means a similar amount of work is done with this
> >> approach, as with the traditional top-down approach currently being
> >> used, except we also process the back edges and iteratively solve for them.
> > What's the storage requirement for the cache when doing VRP pass like
> > processing?  What's the compile-time complexity?  I read you are saying
> > "similar amount of work" but I guess that's from empirical data comparing
> > to the existing VRP implementations (SSA VRP and EVRP) rather than
> > theoretical bounds?
> yes, compile-time complexity is from empirical speed timings and
> theory-crafting from the algorithms,  and that the on-entry cache
> prevents multiple passes over things.
>
> we have not done a memory analysis yet, not done anything to see if we
> can improve it.
> It makes very heavy use of bitmaps, which are typically quite sparse.
> The on-entry cache is a vector of pointers to a range, initially 0, and
> we allocate ranges as needed.   There will be an on-entry range entry
> for any ssa-name which has been queried between the query point and the
> definition.

So that's similar to LIVE-on-entry thus a SSA name with a range
will have an on-entry range entry on each BB it dominates?
That makes storage requirement quadratic in the worst case.

>  My belief is this is typically not large, but we have done
> no analysis as yet.  This does mean that once an ssa-name goes
> "out-of-scope", ie no  more uses in the program, it will not be in the
> cache for any of those blocks simply because its never queried past its
> end of life.
> >
> > Not sure where to put it so let me put it here:  I'd like to see the SSA
> > based VRP pass to die (in fact I'd like all ssa-propagator based
> > passes to die...).
> > EVRP is the future here and I can see EVRP making use of the
> > reverse part of Range-Ops to replace the ad-hoc pattern matching
> > in register_edge_assert_for (badly named function that now gets you
> > back a list of assertions hold on an edge).  So I hope to see VRP
> Does register_edge_assert cache anything? or walk an assert chain? or is
> it basically just a "range-on-this-edge"  call? and the result combined
> with what is known at that point?

register_edge_assert pattern-matches the stmts feeding a conditional
stmt on an edge.  It doesn't cache anything (so does quite some redundant
work when processing all outgoing edges).  It's not intended to be called
multiple times but rather when the DOM walker enters a block and we
look at the controlling statement to derive conditional ranges.  For
example for

 _1 = (unsigned) _2;
 if (_1 > 1)
  ...

it derives a vector of two relations on the true edge:

 _1 > 1 == true
 (unsigned) _2 > 1 == true

this is actually what we end up building ASSERT_EXPRs for in traditional
VRP and what we extract ranges from only valid in dominated context from EVRP.

> > yanked (it performing jump threading is the major obstackle) and
> > Range-Ops being used as common infrastructure for ranges (that leaves
> > EVRP).  I don't really want something "additional" leaving things for
>
> Although EVRP can be modified to use the range-ops work, and we can
> probably make VRP go away, I'm not convinced that EVRP has to be the
> future.  Its merely an alternate implementation that has its own set of
> benefits and drawbacks.
>
> > other people to cleanup / merge - honestly you don't have the very best
> > track record in this area (it might be your main area of work is not GCC,
> > still the @redhat tag on your mail is an obligation here).
> That's a curious assertion over a 20 year span.  To what are you referring?
>
> > Now I've said the above let me continue with 3/n.
> >
> > Richard.
> >
> >> **Comments and feedback always welcome!Thanks
> >> Andrew
> >> *
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: On-Demand range technology [2/5] - Major Components : How it works
  2019-05-27 13:03     ` Richard Biener
@ 2019-05-28 14:17       ` Andrew MacLeod
  2019-05-29 11:16         ` Richard Biener
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew MacLeod @ 2019-05-28 14:17 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC, Jeff Law, Aldy Hernandez

On 5/27/19 9:02 AM, Richard Biener wrote:
> On Fri, May 24, 2019 at 5:50 PM Andrew MacLeod <amacleod@redhat.com> wrote:
>>
>>> The above suggests that iff this is done at all it is not in GORI because
>>> those are not conditional stmts or ranges from feeding those.  The
>>> machinery doing the use-def walking from stmt context also cannot
>>> come along these so I have the suspicion that Ranger cannot handle
>>> telling us that for the stmt following above, for example
>>>
>>>    if (_5 != 0)
>>>
>>> that _5 is not zero?
>>>
>>> Can you clarify?
>> So there are 2 aspects to this.    the range-ops code for DIV_EXPR, if
>> asked for the range of op2 () would return ~[0,0] for _5.
>> But you are also correct in that the walk backwards would not find this.
>>
>> This is similar functionality to how null_derefs are currently handled,
>> and in fact could probably be done simultaneously using the same code
>> base.   I didn't bring null derefs up, but this is a good time :-)
>>
>> There is a separate class used by the gori-cache which tracks the
>> non-nullness property at the block level.    It has a single API:
>> non_null_deref_p (name, bb)    which determines whether the is a
>> dereference in any BB for NAME, which indicates whether the range has an
>> implicit ~[0,0] range in that basic block or not.
> So when we then have
>
>   _1 = *_2; // after this _2 is non-NULL
>   _3 = _1 + 1; // _3 is non-NULL
>   _4 = *_3;
> ...
>
> when a on-demand user asks whether _3 is non-NULL at the
> point of _4 = *_3 we don't have this information?  Since the
> per-BB caching will only say _1 is non-NULL after the BB.
> I'm also not sure whether _3 ever gets non-NULL during
> non-NULL processing of the block since walking immediate uses
> doesn't really help here?
presumably _3 is globally non-null due to the definition being (pointer 
+ x)  ... ie, _3 has a global range o f ~[0,0] ?
>
> So this seems to be a fundamental limitation [to the caching scheme],
> not sure if it is bad in practice.
>
> Or am I missing something?
>

Not missing anything  The non-nullness property is maintains globally at 
the basic block level.  both _1 and _3 are flagged as being non-null in 
the block.  Upon exit, its a bit check.   If the global information does 
not have the non-nullness property, then when a request is made for 
non-nullness  and the def and the use are both within the same block, 
and its flagged as being non-null in that block, then the request is 
forced back to a quick walk between the def and the use to see if there 
is any non-nulless introduced in between.   Yes, that makes it a linear 
walk, but its infrequent, and often short.  to the best of our knowledge 
at this point anyway :-)





>>
>> yes, compile-time complexity is from empirical speed timings and
>> theory-crafting from the algorithms,  and that the on-entry cache
>> prevents multiple passes over things.
>>
>> we have not done a memory analysis yet, not done anything to see if we
>> can improve it.
>> It makes very heavy use of bitmaps, which are typically quite sparse.
>> The on-entry cache is a vector of pointers to a range, initially 0, and
>> we allocate ranges as needed.   There will be an on-entry range entry
>> for any ssa-name which has been queried between the query point and the
>> definition.
> So that's similar to LIVE-on-entry thus a SSA name with a range
> will have an on-entry range entry on each BB it dominates?
> That makes storage requirement quadratic in the worst case.
Yes, assuming a request has been made for all ssa-names everywhere they 
are live.


>
>>   My belief is this is typically not large, but we have done
>> no analysis as yet.  This does mean that once an ssa-name goes
>> "out-of-scope", ie no  more uses in the program, it will not be in the
>> cache for any of those blocks simply because its never queried past its
>> end of life.
>>> Not sure where to put it so let me put it here:  I'd like to see the SSA
>>> based VRP pass to die (in fact I'd like all ssa-propagator based
>>> passes to die...).
>>> EVRP is the future here and I can see EVRP making use of the
>>> reverse part of Range-Ops to replace the ad-hoc pattern matching
>>> in register_edge_assert_for (badly named function that now gets you
>>> back a list of assertions hold on an edge).  So I hope to see VRP
>> Does register_edge_assert cache anything? or walk an assert chain? or is
>> it basically just a "range-on-this-edge"  call? and the result combined
>> with what is known at that point?
> register_edge_assert pattern-matches the stmts feeding a conditional
> stmt on an edge.  It doesn't cache anything (so does quite some redundant
> work when processing all outgoing edges).  It's not intended to be called
> multiple times but rather when the DOM walker enters a block and we
> look at the controlling statement to derive conditional ranges.  For
> example for
>
>   _1 = (unsigned) _2;
>   if (_1 > 1)
>    ...
>
> it derives a vector of two relations on the true edge:
>
>   _1 > 1 == true
>   (unsigned) _2 > 1 == true
>
> this is actually what we end up building ASSERT_EXPRs for in traditional
> VRP and what we extract ranges from only valid in dominated context from EVRP.

and that gets mapped to the actual range [2, MAX] when?  That is what 
range_on_edge will produce for both  _1 and _2 on this edge.

If we were to go to the effort of handling symbolics, and the expression was
   if (_1 > _2)
then we'd get ranges _2 = [_1 + 1, MAX] and _1 = [MIN, _2]  from 
range_on_edge


Andrew

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: On-Demand range technology [2/5] - Major Components : How it works
  2019-05-28 14:17       ` Andrew MacLeod
@ 2019-05-29 11:16         ` Richard Biener
  2019-05-31 15:40           ` Andrew MacLeod
  0 siblings, 1 reply; 21+ messages in thread
From: Richard Biener @ 2019-05-29 11:16 UTC (permalink / raw)
  To: Andrew MacLeod; +Cc: GCC, Jeff Law, Aldy Hernandez

On Tue, May 28, 2019 at 4:17 PM Andrew MacLeod <amacleod@redhat.com> wrote:
>
> On 5/27/19 9:02 AM, Richard Biener wrote:
> > On Fri, May 24, 2019 at 5:50 PM Andrew MacLeod <amacleod@redhat.com> wrote:
> >>
> >>> The above suggests that iff this is done at all it is not in GORI because
> >>> those are not conditional stmts or ranges from feeding those.  The
> >>> machinery doing the use-def walking from stmt context also cannot
> >>> come along these so I have the suspicion that Ranger cannot handle
> >>> telling us that for the stmt following above, for example
> >>>
> >>>    if (_5 != 0)
> >>>
> >>> that _5 is not zero?
> >>>
> >>> Can you clarify?
> >> So there are 2 aspects to this.    the range-ops code for DIV_EXPR, if
> >> asked for the range of op2 () would return ~[0,0] for _5.
> >> But you are also correct in that the walk backwards would not find this.
> >>
> >> This is similar functionality to how null_derefs are currently handled,
> >> and in fact could probably be done simultaneously using the same code
> >> base.   I didn't bring null derefs up, but this is a good time :-)
> >>
> >> There is a separate class used by the gori-cache which tracks the
> >> non-nullness property at the block level.    It has a single API:
> >> non_null_deref_p (name, bb)    which determines whether the is a
> >> dereference in any BB for NAME, which indicates whether the range has an
> >> implicit ~[0,0] range in that basic block or not.
> > So when we then have
> >
> >   _1 = *_2; // after this _2 is non-NULL
> >   _3 = _1 + 1; // _3 is non-NULL
> >   _4 = *_3;
> > ...
> >
> > when a on-demand user asks whether _3 is non-NULL at the
> > point of _4 = *_3 we don't have this information?  Since the
> > per-BB caching will only say _1 is non-NULL after the BB.
> > I'm also not sure whether _3 ever gets non-NULL during
> > non-NULL processing of the block since walking immediate uses
> > doesn't really help here?
> presumably _3 is globally non-null due to the definition being (pointer
> + x)  ... ie, _3 has a global range o f ~[0,0] ?

No, _3 is ~[0, 0] because it is derived from _1 which is ~[0, 0] and
you cannot arrive at NULL by pointer arithmetic from a non-NULL pointer.

> >
> > So this seems to be a fundamental limitation [to the caching scheme],
> > not sure if it is bad in practice.
> >
> > Or am I missing something?
> >
>
> Not missing anything  The non-nullness property is maintains globally at
> the basic block level.  both _1 and _3 are flagged as being non-null in
> the block.  Upon exit, its a bit check.   If the global information does
> not have the non-nullness property, then when a request is made for
> non-nullness  and the def and the use are both within the same block,
> and its flagged as being non-null in that block, then the request is
> forced back to a quick walk between the def and the use to see if there
> is any non-nulless introduced in between.   Yes, that makes it a linear
> walk, but its infrequent, and often short.  to the best of our knowledge
> at this point anyway :-)

So with the clarification above do we ever see that _3 is non-NULL?
I suppose the worker processing _3 = _1 + 1 would ask for
_1 non-nullness but we do not record any non-NULL-ness of _1 in
this basic-block (but only at its end).  Consider stmts

 _4 = (uintptr_t) _2;
 _5 = _6 / _4;
 _1 = *_2;
...

here at _1 we know _2 is not NULL.  But if we ask for non-NULLness
of _2 at the definition of _4 we may not compute ~[0, 0] and thus
conclude that _6 / _4 does not trap.

stmt-level tracking of ranges are sometimes important.  This is
something the machinery cannot provide - correct?  At least not
optimistically enough with ranges derived about uses.

> >>
> >> yes, compile-time complexity is from empirical speed timings and
> >> theory-crafting from the algorithms,  and that the on-entry cache
> >> prevents multiple passes over things.
> >>
> >> we have not done a memory analysis yet, not done anything to see if we
> >> can improve it.
> >> It makes very heavy use of bitmaps, which are typically quite sparse.
> >> The on-entry cache is a vector of pointers to a range, initially 0, and
> >> we allocate ranges as needed.   There will be an on-entry range entry
> >> for any ssa-name which has been queried between the query point and the
> >> definition.
> > So that's similar to LIVE-on-entry thus a SSA name with a range
> > will have an on-entry range entry on each BB it dominates?
> > That makes storage requirement quadratic in the worst case.
> Yes, assuming a request has been made for all ssa-names everywhere they
> are live.

You did propose to replace [E]VRP with a walk over the whole function
querying ranges and simplifying stmts, did you?

> >>   My belief is this is typically not large, but we have done
> >> no analysis as yet.  This does mean that once an ssa-name goes
> >> "out-of-scope", ie no  more uses in the program, it will not be in the
> >> cache for any of those blocks simply because its never queried past its
> >> end of life.
> >>> Not sure where to put it so let me put it here:  I'd like to see the SSA
> >>> based VRP pass to die (in fact I'd like all ssa-propagator based
> >>> passes to die...).
> >>> EVRP is the future here and I can see EVRP making use of the
> >>> reverse part of Range-Ops to replace the ad-hoc pattern matching
> >>> in register_edge_assert_for (badly named function that now gets you
> >>> back a list of assertions hold on an edge).  So I hope to see VRP
> >> Does register_edge_assert cache anything? or walk an assert chain? or is
> >> it basically just a "range-on-this-edge"  call? and the result combined
> >> with what is known at that point?
> > register_edge_assert pattern-matches the stmts feeding a conditional
> > stmt on an edge.  It doesn't cache anything (so does quite some redundant
> > work when processing all outgoing edges).  It's not intended to be called
> > multiple times but rather when the DOM walker enters a block and we
> > look at the controlling statement to derive conditional ranges.  For
> > example for
> >
> >   _1 = (unsigned) _2;
> >   if (_1 > 1)
> >    ...
> >
> > it derives a vector of two relations on the true edge:
> >
> >   _1 > 1 == true
> >   (unsigned) _2 > 1 == true
> >
> > this is actually what we end up building ASSERT_EXPRs for in traditional
> > VRP and what we extract ranges from only valid in dominated context from EVRP.
>
> and that gets mapped to the actual range [2, MAX] when?  That is what
> range_on_edge will produce for both  _1 and _2 on this edge.

In EVRP immediately.  In VRP when the later propagation stage visits the
newly inserted ASSERT_EXPRs.

> If we were to go to the effort of handling symbolics, and the expression was
>    if (_1 > _2)
> then we'd get ranges _2 = [_1 + 1, MAX] and _1 = [MIN, _2]  from
> range_on_edge
>
>
> Andrew

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: On-Demand range technology [2/5] - Major Components : How it works
  2019-05-29 11:16         ` Richard Biener
@ 2019-05-31 15:40           ` Andrew MacLeod
  2019-05-31 18:16             ` Jeff Law
                               ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Andrew MacLeod @ 2019-05-31 15:40 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC, Jeff Law, Aldy Hernandez

On 5/29/19 7:15 AM, Richard Biener wrote:
> On Tue, May 28, 2019 at 4:17 PM Andrew MacLeod <amacleod@redhat.com> wrote:
>> On 5/27/19 9:02 AM, Richard Biener wrote:
>>> On Fri, May 24, 2019 at 5:50 PM Andrew MacLeod <amacleod@redhat.com> wrote:
>>>>> The above suggests that iff this is done at all it is not in GORI because
>>>>> those are not conditional stmts or ranges from feeding those.  The
>>>>> machinery doing the use-def walking from stmt context also cannot
>>>>> come along these so I have the suspicion that Ranger cannot handle
>>>>> telling us that for the stmt following above, for example
>>>>>
>>>>>     if (_5 != 0)
>>>>>
>>>>> that _5 is not zero?
>>>>>
>>>>> Can you clarify?
>>>> So there are 2 aspects to this.    the range-ops code for DIV_EXPR, if
>>>> asked for the range of op2 () would return ~[0,0] for _5.
>>>> But you are also correct in that the walk backwards would not find this.
>>>>
>>>> This is similar functionality to how null_derefs are currently handled,
>>>> and in fact could probably be done simultaneously using the same code
>>>> base.   I didn't bring null derefs up, but this is a good time :-)
>>>>
>>>> There is a separate class used by the gori-cache which tracks the
>>>> non-nullness property at the block level.    It has a single API:
>>>> non_null_deref_p (name, bb)    which determines whether the is a
>>>> dereference in any BB for NAME, which indicates whether the range has an
>>>> implicit ~[0,0] range in that basic block or not.
>>> So when we then have
>>>
>>>    _1 = *_2; // after this _2 is non-NULL
>>>    _3 = _1 + 1; // _3 is non-NULL
>>>    _4 = *_3;
>>> ...
>>>
>>> when a on-demand user asks whether _3 is non-NULL at the
>>> point of _4 = *_3 we don't have this information?  Since the
>>> per-BB caching will only say _1 is non-NULL after the BB.
>>> I'm also not sure whether _3 ever gets non-NULL during
>>> non-NULL processing of the block since walking immediate uses
>>> doesn't really help here?
>> presumably _3 is globally non-null due to the definition being (pointer
>> + x)  ... ie, _3 has a global range o f ~[0,0] ?
> No, _3 is ~[0, 0] because it is derived from _1 which is ~[0, 0] and
> you cannot arrive at NULL by pointer arithmetic from a non-NULL pointer.

I'm confused.

_1 was loaded from _2 (thus asserting _2 is non-NULL).  but we have no 
idea what the range of _1 is, so  how do you assert _1 is [~0,0] ?
The only way I see to determine _3 is non-NULL  is through the _4 = *_3 
statement.


>
>>> So this seems to be a fundamental limitation [to the caching scheme],
>>> not sure if it is bad in practice.
>>>
>>> Or am I missing something?
>>>
>> Not missing anything  The non-nullness property is maintains globally at
>> the basic block level.  both _1 and _3 are flagged as being non-null in
>> the block.  Upon exit, its a bit check.   If the global information does
>> not have the non-nullness property, then when a request is made for
>> non-nullness  and the def and the use are both within the same block,
>> and its flagged as being non-null in that block, then the request is
>> forced back to a quick walk between the def and the use to see if there
>> is any non-nulless introduced in between.   Yes, that makes it a linear
>> walk, but its infrequent, and often short.  to the best of our knowledge
>> at this point anyway :-)
> So with the clarification above do we ever see that _3 is non-NULL?
> I suppose the worker processing _3 = _1 + 1 would ask for
> _1 non-nullness but we do not record any non-NULL-ness of _1 in
> this basic-block (but only at its end).  Consider stmts
>
>   _4 = (uintptr_t) _2;
>   _5 = _6 / _4;
>   _1 = *_2;
> ...
>
> here at _1 we know _2 is not NULL.  But if we ask for non-NULLness
> of _2 at the definition of _4 we may not compute ~[0, 0] and thus
> conclude that _6 / _4 does not trap.
EVRP must look backwards to figure this out since the forward walk will 
process _5 = _6 / _4 before it sees the dereference to _2... so how does 
it know that _4 is non-zero without looking backwards at things after it 
sees the dereference??  Does it actually do this?

>
> stmt-level tracking of ranges are sometimes important.  This is
> something the machinery cannot provide - correct?  At least not
> optimistically enough with ranges derived about uses.

Maybe I'm the one missing something, but in the absence of statement 
level exception throwing via 'can_throw_non_call_exceptions' being true,
any assertion made anywhere in the block to an ssa_name applies to the 
entire block does it not?   ie it doesn't matter if the deference 
happens first thing in the block or last thing, its not going to change 
its value within the block.. its going to be non-null throughout the 
entire block.

so if one statement in the block asserts that references to _2 are 
non-null, we can assert that all references to _2 in the block are 
non-null. Meaning we get all these cases by knowing that the specified 
name is non-zero through-out the block.  This also means we could know 
things earlier in the block than a forward walk would provide.

So with the 2 examples:

   _1 = *_2; // after this _2 is non-NULL
   _3 = _1 + 1;
   _4 = *_3;



both _2 and _3 are flagged as non-null in the block due to the 
de-references.  Im not sure what we know about _1 from above, but we do 
know
   ~[0,0] = _1 + 1  .  so whatever that tells you , if anything, about 
_1 we know.  it seems to me _1 is ~[-1,-1] based on that...

Regardless, I think we know all the same things EVRP does.

likewise

  _4 = (uintptr_t) _2;
  _5 = _6 / _4;
  _1 = *_2;

_2 will be non-null in the entire block,  so _4 must also be non-null 
and we can conclude that the divide does not trap.


now, when we set the ' can_throw_non_call_exceptions' flag, then we'd 
have to resort to statement walks,  and we cannot determine that _5 does 
not trap anyway.   EVRP is in the same boat.. It doesn't know its not 
going to trap either because we may never get to the *_2..






>>>> yes, compile-time complexity is from empirical speed timings and
>>>> theory-crafting from the algorithms,  and that the on-entry cache
>>>> prevents multiple passes over things.
>>>>
>>>> we have not done a memory analysis yet, not done anything to see if we
>>>> can improve it.
>>>> It makes very heavy use of bitmaps, which are typically quite sparse.
>>>> The on-entry cache is a vector of pointers to a range, initially 0, and
>>>> we allocate ranges as needed.   There will be an on-entry range entry
>>>> for any ssa-name which has been queried between the query point and the
>>>> definition.
>>> So that's similar to LIVE-on-entry thus a SSA name with a range
>>> will have an on-entry range entry on each BB it dominates?
>>> That makes storage requirement quadratic in the worst case.
>> Yes, assuming a request has been made for all ssa-names everywhere they
>> are live.
> You did propose to replace [E]VRP with a walk over the whole function
> querying ranges and simplifying stmts, did you?
yes, for the case of EVRP.  But not all use cases query every range 
everywhere.

its also a bit lower than that since we cache certain types of ranges.  
we cache a range for varying (or range for type if you prefer)  of 
whatever the ssa-name type is.  so if the range is varying everywhere, 
we actually only have once instance of the range rather than N of them.  
So any name that doesn't have a range reduction anywhere will only 
create a single range instance for the entire CFG.  I think thats the 
most common value, so that should reduce a large number of them.   I've 
also considering caching ranges like we cache tree constants... but I 
haven't delved into that.  I figured if memory turns out to be a 
problem, then we'll look at it then.

Andrew

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: On-Demand range technology [2/5] - Major Components : How it works
  2019-05-31 15:40           ` Andrew MacLeod
@ 2019-05-31 18:16             ` Jeff Law
  2019-05-31 20:27               ` Andrew MacLeod
  2019-06-04 15:27             ` Richard Biener
  2019-06-04 20:04             ` Martin Sebor
  2 siblings, 1 reply; 21+ messages in thread
From: Jeff Law @ 2019-05-31 18:16 UTC (permalink / raw)
  To: Andrew MacLeod, Richard Biener; +Cc: GCC, Aldy Hernandez

On 5/31/19 9:40 AM, Andrew MacLeod wrote:
> On 5/29/19 7:15 AM, Richard Biener wrote:
>> On Tue, May 28, 2019 at 4:17 PM Andrew MacLeod <amacleod@redhat.com>
>> wrote:
>>> On 5/27/19 9:02 AM, Richard Biener wrote:
>>>> On Fri, May 24, 2019 at 5:50 PM Andrew MacLeod <amacleod@redhat.com>
>>>> wrote:
>>>>>> The above suggests that iff this is done at all it is not in GORI
>>>>>> because
>>>>>> those are not conditional stmts or ranges from feeding those.  The
>>>>>> machinery doing the use-def walking from stmt context also cannot
>>>>>> come along these so I have the suspicion that Ranger cannot handle
>>>>>> telling us that for the stmt following above, for example
>>>>>>
>>>>>>     if (_5 != 0)
>>>>>>
>>>>>> that _5 is not zero?
>>>>>>
>>>>>> Can you clarify?
>>>>> So there are 2 aspects to this.    the range-ops code for DIV_EXPR, if
>>>>> asked for the range of op2 () would return ~[0,0] for _5.
>>>>> But you are also correct in that the walk backwards would not find
>>>>> this.
>>>>>
>>>>> This is similar functionality to how null_derefs are currently
>>>>> handled,
>>>>> and in fact could probably be done simultaneously using the same code
>>>>> base.   I didn't bring null derefs up, but this is a good time :-)
>>>>>
>>>>> There is a separate class used by the gori-cache which tracks the
>>>>> non-nullness property at the block level.    It has a single API:
>>>>> non_null_deref_p (name, bb)    which determines whether the is a
>>>>> dereference in any BB for NAME, which indicates whether the range
>>>>> has an
>>>>> implicit ~[0,0] range in that basic block or not.
>>>> So when we then have
>>>>
>>>>    _1 = *_2; // after this _2 is non-NULL
>>>>    _3 = _1 + 1; // _3 is non-NULL
>>>>    _4 = *_3;
>>>> ...
>>>>
>>>> when a on-demand user asks whether _3 is non-NULL at the
>>>> point of _4 = *_3 we don't have this information?  Since the
>>>> per-BB caching will only say _1 is non-NULL after the BB.
>>>> I'm also not sure whether _3 ever gets non-NULL during
>>>> non-NULL processing of the block since walking immediate uses
>>>> doesn't really help here?
>>> presumably _3 is globally non-null due to the definition being (pointer
>>> + x)  ... ie, _3 has a global range o f ~[0,0] ?
>> No, _3 is ~[0, 0] because it is derived from _1 which is ~[0, 0] and
>> you cannot arrive at NULL by pointer arithmetic from a non-NULL pointer.
> 
> I'm confused.
> 
> _1 was loaded from _2 (thus asserting _2 is non-NULL).  but we have no
> idea what the range of _1 is, so  how do you assert _1 is [~0,0] ?
> The only way I see to determine _3 is non-NULL  is through the _4 = *_3
> statement.
Likewise.  I don't see how we get ~[0,0] for _1, except at the point
after the dereference of _3.


>>>> So this seems to be a fundamental limitation [to the caching scheme],
>>>> not sure if it is bad in practice.
>>>>
>>>> Or am I missing something?
>>>>
>>> Not missing anything  The non-nullness property is maintains globally at
>>> the basic block level.  both _1 and _3 are flagged as being non-null in
>>> the block.  Upon exit, its a bit check.   If the global information does
>>> not have the non-nullness property, then when a request is made for
>>> non-nullness  and the def and the use are both within the same block,
>>> and its flagged as being non-null in that block, then the request is
>>> forced back to a quick walk between the def and the use to see if there
>>> is any non-nulless introduced in between.   Yes, that makes it a linear
>>> walk, but its infrequent, and often short.  to the best of our knowledge
>>> at this point anyway :-)
>> So with the clarification above do we ever see that _3 is non-NULL?
>> I suppose the worker processing _3 = _1 + 1 would ask for
>> _1 non-nullness but we do not record any non-NULL-ness of _1 in
>> this basic-block (but only at its end).  Consider stmts
>>
>>   _4 = (uintptr_t) _2;
>>   _5 = _6 / _4;
>>   _1 = *_2;
>> ...
>>
>> here at _1 we know _2 is not NULL.  But if we ask for non-NULLness
>> of _2 at the definition of _4 we may not compute ~[0, 0] and thus
>> conclude that _6 / _4 does not trap.
> EVRP must look backwards to figure this out since the forward walk will
> process _5 = _6 / _4 before it sees the dereference to _2... so how does
> it know that _4 is non-zero without looking backwards at things after it
> sees the dereference??  Does it actually do this?
During the forward walk we process the assignment to _5, which is _6 /
_4.  We can infer that _4 is nonzero because division by zero is
undefined behavior.  But I'm not sure how EVRP would go back and then
make a determination about _4 unless it's doing so via an equivalence.



> 
>>
>> stmt-level tracking of ranges are sometimes important.  This is
>> something the machinery cannot provide - correct?  At least not
>> optimistically enough with ranges derived about uses.
> 
> Maybe I'm the one missing something, but in the absence of statement
> level exception throwing via 'can_throw_non_call_exceptions' being true,
> any assertion made anywhere in the block to an ssa_name applies to the
> entire block does it not?   ie it doesn't matter if the deference
> happens first thing in the block or last thing, its not going to change
> its value within the block.. its going to be non-null throughout the
> entire block.
No, I don't think it can hold for the entire block.

Consider

  x = p ? 10 : 20;
  foo (x)
  *p = whatever

We don't know p is non-null because foo might not return.  If by way of
the dereference we were to assert p is non-null for the entire block,
then we'd pass the wrong value to foo().

Jeff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: On-Demand range technology [2/5] - Major Components : How it works
  2019-05-31 18:16             ` Jeff Law
@ 2019-05-31 20:27               ` Andrew MacLeod
  2019-05-31 22:00                 ` Jeff Law
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew MacLeod @ 2019-05-31 20:27 UTC (permalink / raw)
  To: Jeff Law, Richard Biener; +Cc: GCC, Aldy Hernandez

On 5/31/19 2:16 PM, Jeff Law wrote:
> On 5/31/19 9:40 AM, Andrew MacLeod wrote:
>> On 5/29/19 7:15 AM, Richard Biener wrote:
>>> On Tue, May 28, 2019 at 4:17 PM Andrew MacLeod <amacleod@redhat.com>
>>> wrote:
>>>> On 5/27/19 9:02 AM, Richard Biener wrote:
>>>>> On Fri, May 24, 2019 at 5:50 PM Andrew MacLeod <amacleod@redhat.com>
>>>>> wrote:
>>>>>>> The above suggests that iff this is done at all it is not in GORI
>>>>>>> because
>>>>>>> those are not conditional stmts or ranges from feeding those.  The
>>>>>>> machinery doing the use-def walking from stmt context also cannot
>>>>>>> come along these so I have the suspicion that Ranger cannot handle
>>>>>>> telling us that for the stmt following above, for example
>>>>>>>
>>>>>>>      if (_5 != 0)
>>>>>>>
>>>>>>> that _5 is not zero?
>>>>>>>
>>>>>>> Can you clarify?
>>>>>> So there are 2 aspects to this.    the range-ops code for DIV_EXPR, if
>>>>>> asked for the range of op2 () would return ~[0,0] for _5.
>>>>>> But you are also correct in that the walk backwards would not find
>>>>>> this.
>>>>>>
>>>>>> This is similar functionality to how null_derefs are currently
>>>>>> handled,
>>>>>> and in fact could probably be done simultaneously using the same code
>>>>>> base.   I didn't bring null derefs up, but this is a good time :-)
>>>>>>
>>>>>> There is a separate class used by the gori-cache which tracks the
>>>>>> non-nullness property at the block level.    It has a single API:
>>>>>> non_null_deref_p (name, bb)    which determines whether the is a
>>>>>> dereference in any BB for NAME, which indicates whether the range
>>>>>> has an
>>>>>> implicit ~[0,0] range in that basic block or not.
>>>>> So when we then have
>>>>>
>>>>>     _1 = *_2; // after this _2 is non-NULL
>>>>>     _3 = _1 + 1; // _3 is non-NULL
>>>>>     _4 = *_3;
>>>>> ...
>>>>>
>>>>> when a on-demand user asks whether _3 is non-NULL at the
>>>>> point of _4 = *_3 we don't have this information?  Since the
>>>>> per-BB caching will only say _1 is non-NULL after the BB.
>>>>> I'm also not sure whether _3 ever gets non-NULL during
>>>>> non-NULL processing of the block since walking immediate uses
>>>>> doesn't really help here?
>>>> presumably _3 is globally non-null due to the definition being (pointer
>>>> + x)  ... ie, _3 has a global range o f ~[0,0] ?
>>> No, _3 is ~[0, 0] because it is derived from _1 which is ~[0, 0] and
>>> you cannot arrive at NULL by pointer arithmetic from a non-NULL pointer.
>> I'm confused.
>>
>> _1 was loaded from _2 (thus asserting _2 is non-NULL).  but we have no
>> idea what the range of _1 is, so  how do you assert _1 is [~0,0] ?
>> The only way I see to determine _3 is non-NULL  is through the _4 = *_3
>> statement.
> Likewise.  I don't see how we get ~[0,0] for _1, except at the point
> after the dereference of _3.
>
>
>>>>> So this seems to be a fundamental limitation [to the caching scheme],
>>>>> not sure if it is bad in practice.
>>>>>
>>>>> Or am I missing something?
>>>>>
>>>> Not missing anything  The non-nullness property is maintains globally at
>>>> the basic block level.  both _1 and _3 are flagged as being non-null in
>>>> the block.  Upon exit, its a bit check.   If the global information does
>>>> not have the non-nullness property, then when a request is made for
>>>> non-nullness  and the def and the use are both within the same block,
>>>> and its flagged as being non-null in that block, then the request is
>>>> forced back to a quick walk between the def and the use to see if there
>>>> is any non-nulless introduced in between.   Yes, that makes it a linear
>>>> walk, but its infrequent, and often short.  to the best of our knowledge
>>>> at this point anyway :-)
>>> So with the clarification above do we ever see that _3 is non-NULL?
>>> I suppose the worker processing _3 = _1 + 1 would ask for
>>> _1 non-nullness but we do not record any non-NULL-ness of _1 in
>>> this basic-block (but only at its end).  Consider stmts
>>>
>>>    _4 = (uintptr_t) _2;
>>>    _5 = _6 / _4;
>>>    _1 = *_2;
>>> ...
>>>
>>> here at _1 we know _2 is not NULL.  But if we ask for non-NULLness
>>> of _2 at the definition of _4 we may not compute ~[0, 0] and thus
>>> conclude that _6 / _4 does not trap.
>> EVRP must look backwards to figure this out since the forward walk will
>> process _5 = _6 / _4 before it sees the dereference to _2... so how does
>> it know that _4 is non-zero without looking backwards at things after it
>> sees the dereference??  Does it actually do this?
> During the forward walk we process the assignment to _5, which is _6 /
> _4.  We can infer that _4 is nonzero because division by zero is
> undefined behavior.  But I'm not sure how EVRP would go back and then
> make a determination about _4 unless it's doing so via an equivalence.
>
>
>
>>> stmt-level tracking of ranges are sometimes important.  This is
>>> something the machinery cannot provide - correct?  At least not
>>> optimistically enough with ranges derived about uses.
>> Maybe I'm the one missing something, but in the absence of statement
>> level exception throwing via 'can_throw_non_call_exceptions' being true,
>> any assertion made anywhere in the block to an ssa_name applies to the
>> entire block does it not?   ie it doesn't matter if the deference
>> happens first thing in the block or last thing, its not going to change
>> its value within the block.. its going to be non-null throughout the
>> entire block.
> No, I don't think it can hold for the entire block.
>
> Consider
>
>    x = p ? 10 : 20;
>    foo (x)
>    *p = whatever
>
> We don't know p is non-null because foo might not return.  If by way of
> the dereference we were to assert p is non-null for the entire block,
> then we'd pass the wrong value to foo().
>
> Jeff
>

Interesting.  I started wondering about calls when I was talking about 
non-call exceptions, but couldn't think of an example off the top of my 
head.

OK, so the statement holds true for any section of code without calls in 
it.  If the block is marked as having a non-null deref in it, I need to 
look at statement in between the relevant def and use to see if there is 
an impact rather than just checking the flag.  If the flag is false, we 
need to do nothing else.


I don't think any example above affects this. if we adjust the first 
example to be

    _1 = *_2; // after this _2 is non-NULL
    _3 = _2 + 1; // _3 is non-NULL
    _4 = *_3;

then  when we are stepping back and evaluating
   _3 = _2 + 1,  if _2 has a range of varying, and the 
"non-null_in_block" flag is set, it would have to trigger a walk up the 
block looking to see if we find a *_2.

The walk terminates when we see
   a)  a call, resulting in  _2 = varying
   b) the start of the block, resulting in _2 = varying  or
   c) *_2  resulting in _2 = ~[0,0]

now, this only triggers when there is *both* a dereference and a 
'normal' use in the same block.  If both conditions aren't met, then no 
walk is required.

There may also be a better approach, but  Id wait until we measured and 
saw some problem with this.

If it turns out to be some sort of real issue, then we could always fall 
back to having the engine recognize that "oh, this block has both a 
deref and a use, lets evaluate the entire block before answering the 
question".  The results are all cached, and any additional queries for 
that block are then already a calculated.

I wonder how frequently it comes up. maybe I'll try some measurements.   
   Best case, we do the minimum like we do today, worst case, we end up 
evaluating everything in every block  (if every block has this 
situation).   But we'd be doing that without the on-demand approach 
anyway, so it'd just be break even in that case.

Andrew





^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: On-Demand range technology [2/5] - Major Components : How it works
  2019-05-31 20:27               ` Andrew MacLeod
@ 2019-05-31 22:00                 ` Jeff Law
  2019-05-31 23:08                   ` Andrew MacLeod
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Law @ 2019-05-31 22:00 UTC (permalink / raw)
  To: Andrew MacLeod, Richard Biener; +Cc: GCC, Aldy Hernandez

On 5/31/19 2:26 PM, Andrew MacLeod wrote:
> On 5/31/19 2:16 PM, Jeff Law wrote:
>> On 5/31/19 9:40 AM, Andrew MacLeod wrote:
>>>> stmt-level tracking of ranges are sometimes important.  This is
>>>> something the machinery cannot provide - correct?  At least not
>>>> optimistically enough with ranges derived about uses.
>>> Maybe I'm the one missing something, but in the absence of statement
>>> level exception throwing via 'can_throw_non_call_exceptions' being true,
>>> any assertion made anywhere in the block to an ssa_name applies to the
>>> entire block does it not?   ie it doesn't matter if the deference
>>> happens first thing in the block or last thing, its not going to change
>>> its value within the block.. its going to be non-null throughout the
>>> entire block.
>> No, I don't think it can hold for the entire block.
>>
>> Consider
>>
>>    x = p ? 10 : 20;
>>    foo (x)
>>    *p = whatever
>>
>> We don't know p is non-null because foo might not return.  If by way of
>> the dereference we were to assert p is non-null for the entire block,
>> then we'd pass the wrong value to foo().
>>
>> Jeff
>>
> 
> Interesting.  I started wondering about calls when I was talking about
> non-call exceptions, but couldn't think of an example off the top of my
> head.
> 
> OK, so the statement holds true for any section of code without calls in
> it.  If the block is marked as having a non-null deref in it, I need to
> look at statement in between the relevant def and use to see if there is
> an impact rather than just checking the flag.  If the flag is false, we
> need to do nothing else.
What about in a multi-threaded application?   No calls here....

  x = p ? 10 : 20;
  sharedvar = x;
  [ bunch of straightline code]
  *p = <whatever>

Our thread gets preempted and for some reason never restarts?  Another
thread could read sharedvar and get the wrong value?

But one might argue we have similar problems in other places where we
"back propagate" non-nullness.  I'm thinking about cases where the
pointer is passed to a function in an argument slot marked as non-null.
 We mark the name as currently having a non-null value and it can get
back propagated around in rather surprising ways and we've been called
out on it.

Dunno...  Probably worth mulling over some more...
Jeff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: On-Demand range technology [2/5] - Major Components : How it works
  2019-05-31 22:00                 ` Jeff Law
@ 2019-05-31 23:08                   ` Andrew MacLeod
  0 siblings, 0 replies; 21+ messages in thread
From: Andrew MacLeod @ 2019-05-31 23:08 UTC (permalink / raw)
  To: Jeff Law, Richard Biener; +Cc: GCC, Aldy Hernandez

On 5/31/19 6:00 PM, Jeff Law wrote:
> On 5/31/19 2:26 PM, Andrew MacLeod wrote:
>> On 5/31/19 2:16 PM, Jeff Law wrote:
>>> On 5/31/19 9:40 AM, Andrew MacLeod wrote:
>>>>> stmt-level tracking of ranges are sometimes important.  This is
>>>>> something the machinery cannot provide - correct?  At least not
>>>>> optimistically enough with ranges derived about uses.
>>>> Maybe I'm the one missing something, but in the absence of statement
>>>> level exception throwing via 'can_throw_non_call_exceptions' being true,
>>>> any assertion made anywhere in the block to an ssa_name applies to the
>>>> entire block does it not?   ie it doesn't matter if the deference
>>>> happens first thing in the block or last thing, its not going to change
>>>> its value within the block.. its going to be non-null throughout the
>>>> entire block.
>>> No, I don't think it can hold for the entire block.
>>>
>>> Consider
>>>
>>>     x = p ? 10 : 20;
>>>     foo (x)
>>>     *p = whatever
>>>
>>> We don't know p is non-null because foo might not return.  If by way of
>>> the dereference we were to assert p is non-null for the entire block,
>>> then we'd pass the wrong value to foo().
>>>
>>> Jeff
>>>
>> Interesting.  I started wondering about calls when I was talking about
>> non-call exceptions, but couldn't think of an example off the top of my
>> head.
>>
>> OK, so the statement holds true for any section of code without calls in
>> it.  If the block is marked as having a non-null deref in it, I need to
>> look at statement in between the relevant def and use to see if there is
>> an impact rather than just checking the flag.  If the flag is false, we
>> need to do nothing else.
> What about in a multi-threaded application?   No calls here....
>
>    x = p ? 10 : 20;
>    sharedvar = x;
>    [ bunch of straightline code]
>    *p = <whatever>
>
> Our thread gets preempted and for some reason never restarts?  Another
> thread could read sharedvar and get the wrong value?
oh we're getting into the weeds :-)
the program dies if *p ever gets executed and p is NULL.    so if we 
write shardedvar = 10 because we know this, does that make the program 
wrong if p is NULL?   I dunno. maybe I suppose some other thread starts 
doing something when it sees 20 and knows this one is going to die.. and 
now it wont know its going to die.  What a screwed up program :-).


>
> But one might argue we have similar problems in other places where we
> "back propagate" non-nullness.  I'm thinking about cases where the
> pointer is passed to a function in an argument slot marked as non-null.
>   We mark the name as currently having a non-null value and it can get
> back propagated around in rather surprising ways and we've been called
> out on it.
>
> Dunno...  Probably worth mulling over some more...
> Jeff
yeah,  we may not be able to optimistically back propagate. I was 
thinking about it,  and what happens if you tried to push that 
non-nullness back into previous blocks. you can get into trouble quickly 
:-)  so it does seem like it might be important to maintain the temporal 
integrity of a range when any kind of inferred knowledge like *p or  /x 
is applied to it.

This is not a pressing issue yet since we arent trying to immediately 
integrate the on-demand engine.  I'll mull it over some more, but we may 
be able to do some processing for blocks in which there is an issue with 
non-zeroness as needed..  And the fallback to fully enumerate and  
forward process any block affected should certainly work and not cost 
more than the current mechanism which does the same thing.


Andrew

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: On-Demand range technology [2/5] - Major Components : How it works
  2019-05-31 15:40           ` Andrew MacLeod
  2019-05-31 18:16             ` Jeff Law
@ 2019-06-04 15:27             ` Richard Biener
  2019-06-04 15:38               ` Richard Biener
  2019-06-04 16:49               ` Andrew MacLeod
  2019-06-04 20:04             ` Martin Sebor
  2 siblings, 2 replies; 21+ messages in thread
From: Richard Biener @ 2019-06-04 15:27 UTC (permalink / raw)
  To: Andrew MacLeod; +Cc: GCC, Jeff Law, Aldy Hernandez

On Fri, May 31, 2019 at 5:40 PM Andrew MacLeod <amacleod@redhat.com> wrote:
>
> On 5/29/19 7:15 AM, Richard Biener wrote:
> > On Tue, May 28, 2019 at 4:17 PM Andrew MacLeod <amacleod@redhat.com> wrote:
> >> On 5/27/19 9:02 AM, Richard Biener wrote:
> >>> On Fri, May 24, 2019 at 5:50 PM Andrew MacLeod <amacleod@redhat.com> wrote:
> >>>>> The above suggests that iff this is done at all it is not in GORI because
> >>>>> those are not conditional stmts or ranges from feeding those.  The
> >>>>> machinery doing the use-def walking from stmt context also cannot
> >>>>> come along these so I have the suspicion that Ranger cannot handle
> >>>>> telling us that for the stmt following above, for example
> >>>>>
> >>>>>     if (_5 != 0)
> >>>>>
> >>>>> that _5 is not zero?
> >>>>>
> >>>>> Can you clarify?
> >>>> So there are 2 aspects to this.    the range-ops code for DIV_EXPR, if
> >>>> asked for the range of op2 () would return ~[0,0] for _5.
> >>>> But you are also correct in that the walk backwards would not find this.
> >>>>
> >>>> This is similar functionality to how null_derefs are currently handled,
> >>>> and in fact could probably be done simultaneously using the same code
> >>>> base.   I didn't bring null derefs up, but this is a good time :-)
> >>>>
> >>>> There is a separate class used by the gori-cache which tracks the
> >>>> non-nullness property at the block level.    It has a single API:
> >>>> non_null_deref_p (name, bb)    which determines whether the is a
> >>>> dereference in any BB for NAME, which indicates whether the range has an
> >>>> implicit ~[0,0] range in that basic block or not.
> >>> So when we then have
> >>>
> >>>    _1 = *_2; // after this _2 is non-NULL
> >>>    _3 = _1 + 1; // _3 is non-NULL
> >>>    _4 = *_3;
> >>> ...
> >>>
> >>> when a on-demand user asks whether _3 is non-NULL at the
> >>> point of _4 = *_3 we don't have this information?  Since the
> >>> per-BB caching will only say _1 is non-NULL after the BB.
> >>> I'm also not sure whether _3 ever gets non-NULL during
> >>> non-NULL processing of the block since walking immediate uses
> >>> doesn't really help here?
> >> presumably _3 is globally non-null due to the definition being (pointer
> >> + x)  ... ie, _3 has a global range o f ~[0,0] ?
> > No, _3 is ~[0, 0] because it is derived from _1 which is ~[0, 0] and
> > you cannot arrive at NULL by pointer arithmetic from a non-NULL pointer.
>
> I'm confused.
>
> _1 was loaded from _2 (thus asserting _2 is non-NULL).  but we have no
> idea what the range of _1 is, so  how do you assert _1 is [~0,0] ?

Bug on my part - it should offset _2:

 _1 = *_2; // after this _2 is non-NULL
 _3 = _2 + 1; // _3 is non-NULL
 _4 = *_3;

> The only way I see to determine _3 is non-NULL  is through the _4 = *_3
> statement.

So how do you derive a ~[0,0] _2 and thus know that *_3 does not trap?

>
> >
> >>> So this seems to be a fundamental limitation [to the caching scheme],
> >>> not sure if it is bad in practice.
> >>>
> >>> Or am I missing something?
> >>>
> >> Not missing anything  The non-nullness property is maintains globally at
> >> the basic block level.  both _1 and _3 are flagged as being non-null in
> >> the block.  Upon exit, its a bit check.   If the global information does
> >> not have the non-nullness property, then when a request is made for
> >> non-nullness  and the def and the use are both within the same block,
> >> and its flagged as being non-null in that block, then the request is
> >> forced back to a quick walk between the def and the use to see if there
> >> is any non-nulless introduced in between.   Yes, that makes it a linear
> >> walk, but its infrequent, and often short.  to the best of our knowledge
> >> at this point anyway :-)
> > So with the clarification above do we ever see that _3 is non-NULL?
> > I suppose the worker processing _3 = _1 + 1 would ask for
> > _1 non-nullness but we do not record any non-NULL-ness of _1 in
> > this basic-block (but only at its end).  Consider stmts
> >
> >   _4 = (uintptr_t) _2;
> >   _5 = _6 / _4;
> >   _1 = *_2;
> > ...
> >
> > here at _1 we know _2 is not NULL.  But if we ask for non-NULLness
> > of _2 at the definition of _4 we may not compute ~[0, 0] and thus
> > conclude that _6 / _4 does not trap.
> EVRP must look backwards to figure this out since the forward walk will
> process _5 = _6 / _4 before it sees the dereference to _2... so how does
> it know that _4 is non-zero without looking backwards at things after it
> sees the dereference??  Does it actually do this?

It actually doesn't do this for divisions (but I have a patch lying somewhere).
During the forward walk when coming along _5 = _6 / _4; it would
in evrp_range_analyzer::record_ranges_from_stmt via infer_value_range
push a new value-range for _4.  To infer the same for _2 it would need to
look backward which I think it doesn't do right now but it easily could
(I'm just trying to make up examples to understand ranger better).
When visiting _1 = *_2 we then know _2 is not NULL.

There may be more "useful" cases, for example we could derive
ranges for shift operands based on undefinedness and their range
may be useful for simplifying adjacent stmts.

I'm trying to discover how ranger appearantly recording ranges only
at BB boundaries interacts with ranges that become "live" at
some point inside the BB.

> >
> > stmt-level tracking of ranges are sometimes important.  This is
> > something the machinery cannot provide - correct?  At least not
> > optimistically enough with ranges derived about uses.
>
> Maybe I'm the one missing something, but in the absence of statement
> level exception throwing via 'can_throw_non_call_exceptions' being true,
> any assertion made anywhere in the block to an ssa_name applies to the
> entire block does it not?  ie it doesn't matter if the deference
> happens first thing in the block or last thing, its not going to change
> its value within the block.. its going to be non-null throughout the
> entire block.

Well, one complication I had with the division by zero patch is that
for _2 = _1 / _3 you can derive that _3 is not zero _after_ the stmt
but you may of course not assume it is not zero during evaluation
of the stmt because that may lead you to assume it may not trap
and thus move it to a place it was not executed previously.

> so if one statement in the block asserts that references to _2 are
> non-null, we can assert that all references to _2 in the block are
> non-null. Meaning we get all these cases by knowing that the specified
> name is non-zero through-out the block.  This also means we could know
> things earlier in the block than a forward walk would provide.
>
> So with the 2 examples:
>
>    _1 = *_2; // after this _2 is non-NULL
>    _3 = _1 + 1;
>    _4 = *_3;
>
>
>
> both _2 and _3 are flagged as non-null in the block due to the
> de-references.  Im not sure what we know about _1 from above, but we do
> know
>    ~[0,0] = _1 + 1  .  so whatever that tells you , if anything, about
> _1 we know.  it seems to me _1 is ~[-1,-1] based on that...
>
> Regardless, I think we know all the same things EVRP does.
>
> likewise
>
>   _4 = (uintptr_t) _2;
>   _5 = _6 / _4;
>   _1 = *_2;
>
> _2 will be non-null in the entire block,  so _4 must also be non-null
> and we can conclude that the divide does not trap.

And move the divide up like so?

 _4 = (uintptr_t) _2;
 _5 = _6 / _4;
  if (flag)
   {
    _1 = *_2;
...
   }

most definitely not.

>
>
> now, when we set the ' can_throw_non_call_exceptions' flag, then we'd
> have to resort to statement walks,  and we cannot determine that _5 does
> not trap anyway.   EVRP is in the same boat.. It doesn't know its not
> going to trap either because we may never get to the *_2..

The point is not that we know *_2 does not trap - we don't!  We just know
that if the program reaches the next stmt it did not and thus _2 was not NULL
_for the stmts following the dereference_.

Note I'm not sure it will make a difference in practice since any transform
on a stmt after *_2 relying on the above creates constraints on stmt ordering
which we cannot represent in the IL and thus the result is likely going to be
very fragile.

As said, I was just wondering if the non-NULL machinery you have
integrates with the range machinery or is really independent
(thus the pointer derived from the non-NULL one due to the dereference
case above).

> >>>> yes, compile-time complexity is from empirical speed timings and
> >>>> theory-crafting from the algorithms,  and that the on-entry cache
> >>>> prevents multiple passes over things.
> >>>>
> >>>> we have not done a memory analysis yet, not done anything to see if we
> >>>> can improve it.
> >>>> It makes very heavy use of bitmaps, which are typically quite sparse.
> >>>> The on-entry cache is a vector of pointers to a range, initially 0, and
> >>>> we allocate ranges as needed.   There will be an on-entry range entry
> >>>> for any ssa-name which has been queried between the query point and the
> >>>> definition.
> >>> So that's similar to LIVE-on-entry thus a SSA name with a range
> >>> will have an on-entry range entry on each BB it dominates?
> >>> That makes storage requirement quadratic in the worst case.
> >> Yes, assuming a request has been made for all ssa-names everywhere they
> >> are live.
> > You did propose to replace [E]VRP with a walk over the whole function
> > querying ranges and simplifying stmts, did you?
> yes, for the case of EVRP.  But not all use cases query every range
> everywhere.
>
> its also a bit lower than that since we cache certain types of ranges.
> we cache a range for varying (or range for type if you prefer)  of
> whatever the ssa-name type is.  so if the range is varying everywhere,
> we actually only have once instance of the range rather than N of them.
> So any name that doesn't have a range reduction anywhere will only
> create a single range instance for the entire CFG.

But you still have a reference to the range in evry BB dominated by the
definition?

>  I think thats the
> most common value, so that should reduce a large number of them.   I've
> also considering caching ranges like we cache tree constants... but I
> haven't delved into that.  I figured if memory turns out to be a
> problem, then we'll look at it then.
>
> Andrew
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: On-Demand range technology [2/5] - Major Components : How it works
  2019-06-04 15:27             ` Richard Biener
@ 2019-06-04 15:38               ` Richard Biener
  2019-06-04 16:50                 ` Andrew MacLeod
  2019-06-04 16:49               ` Andrew MacLeod
  1 sibling, 1 reply; 21+ messages in thread
From: Richard Biener @ 2019-06-04 15:38 UTC (permalink / raw)
  To: Andrew MacLeod; +Cc: GCC, Jeff Law, Aldy Hernandez

On Tue, Jun 4, 2019 at 5:26 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Fri, May 31, 2019 at 5:40 PM Andrew MacLeod <amacleod@redhat.com> wrote:
> >
> > On 5/29/19 7:15 AM, Richard Biener wrote:
> > > On Tue, May 28, 2019 at 4:17 PM Andrew MacLeod <amacleod@redhat.com> wrote:
> > >> On 5/27/19 9:02 AM, Richard Biener wrote:
> > >>> On Fri, May 24, 2019 at 5:50 PM Andrew MacLeod <amacleod@redhat.com> wrote:
> > >>>>> The above suggests that iff this is done at all it is not in GORI because
> > >>>>> those are not conditional stmts or ranges from feeding those.  The
> > >>>>> machinery doing the use-def walking from stmt context also cannot
> > >>>>> come along these so I have the suspicion that Ranger cannot handle
> > >>>>> telling us that for the stmt following above, for example
> > >>>>>
> > >>>>>     if (_5 != 0)
> > >>>>>
> > >>>>> that _5 is not zero?
> > >>>>>
> > >>>>> Can you clarify?
> > >>>> So there are 2 aspects to this.    the range-ops code for DIV_EXPR, if
> > >>>> asked for the range of op2 () would return ~[0,0] for _5.
> > >>>> But you are also correct in that the walk backwards would not find this.
> > >>>>
> > >>>> This is similar functionality to how null_derefs are currently handled,
> > >>>> and in fact could probably be done simultaneously using the same code
> > >>>> base.   I didn't bring null derefs up, but this is a good time :-)
> > >>>>
> > >>>> There is a separate class used by the gori-cache which tracks the
> > >>>> non-nullness property at the block level.    It has a single API:
> > >>>> non_null_deref_p (name, bb)    which determines whether the is a
> > >>>> dereference in any BB for NAME, which indicates whether the range has an
> > >>>> implicit ~[0,0] range in that basic block or not.
> > >>> So when we then have
> > >>>
> > >>>    _1 = *_2; // after this _2 is non-NULL
> > >>>    _3 = _1 + 1; // _3 is non-NULL
> > >>>    _4 = *_3;
> > >>> ...
> > >>>
> > >>> when a on-demand user asks whether _3 is non-NULL at the
> > >>> point of _4 = *_3 we don't have this information?  Since the
> > >>> per-BB caching will only say _1 is non-NULL after the BB.
> > >>> I'm also not sure whether _3 ever gets non-NULL during
> > >>> non-NULL processing of the block since walking immediate uses
> > >>> doesn't really help here?
> > >> presumably _3 is globally non-null due to the definition being (pointer
> > >> + x)  ... ie, _3 has a global range o f ~[0,0] ?
> > > No, _3 is ~[0, 0] because it is derived from _1 which is ~[0, 0] and
> > > you cannot arrive at NULL by pointer arithmetic from a non-NULL pointer.
> >
> > I'm confused.
> >
> > _1 was loaded from _2 (thus asserting _2 is non-NULL).  but we have no
> > idea what the range of _1 is, so  how do you assert _1 is [~0,0] ?
>
> Bug on my part - it should offset _2:
>
>  _1 = *_2; // after this _2 is non-NULL
>  _3 = _2 + 1; // _3 is non-NULL
>  _4 = *_3;
>
> > The only way I see to determine _3 is non-NULL  is through the _4 = *_3
> > statement.
>
> So how do you derive a ~[0,0] _2 and thus know that *_3 does not trap?
>
> >
> > >
> > >>> So this seems to be a fundamental limitation [to the caching scheme],
> > >>> not sure if it is bad in practice.
> > >>>
> > >>> Or am I missing something?
> > >>>
> > >> Not missing anything  The non-nullness property is maintains globally at
> > >> the basic block level.  both _1 and _3 are flagged as being non-null in
> > >> the block.  Upon exit, its a bit check.   If the global information does
> > >> not have the non-nullness property, then when a request is made for
> > >> non-nullness  and the def and the use are both within the same block,
> > >> and its flagged as being non-null in that block, then the request is
> > >> forced back to a quick walk between the def and the use to see if there
> > >> is any non-nulless introduced in between.   Yes, that makes it a linear
> > >> walk, but its infrequent, and often short.  to the best of our knowledge
> > >> at this point anyway :-)
> > > So with the clarification above do we ever see that _3 is non-NULL?
> > > I suppose the worker processing _3 = _1 + 1 would ask for
> > > _1 non-nullness but we do not record any non-NULL-ness of _1 in
> > > this basic-block (but only at its end).  Consider stmts
> > >
> > >   _4 = (uintptr_t) _2;
> > >   _5 = _6 / _4;
> > >   _1 = *_2;
> > > ...
> > >
> > > here at _1 we know _2 is not NULL.  But if we ask for non-NULLness
> > > of _2 at the definition of _4 we may not compute ~[0, 0] and thus
> > > conclude that _6 / _4 does not trap.
> > EVRP must look backwards to figure this out since the forward walk will
> > process _5 = _6 / _4 before it sees the dereference to _2... so how does
> > it know that _4 is non-zero without looking backwards at things after it
> > sees the dereference??  Does it actually do this?
>
> It actually doesn't do this for divisions (but I have a patch lying somewhere).
> During the forward walk when coming along _5 = _6 / _4; it would
> in evrp_range_analyzer::record_ranges_from_stmt via infer_value_range
> push a new value-range for _4.  To infer the same for _2 it would need to
> look backward which I think it doesn't do right now but it easily could
> (I'm just trying to make up examples to understand ranger better).
> When visiting _1 = *_2 we then know _2 is not NULL.
>
> There may be more "useful" cases, for example we could derive
> ranges for shift operands based on undefinedness and their range
> may be useful for simplifying adjacent stmts.
>
> I'm trying to discover how ranger appearantly recording ranges only
> at BB boundaries interacts with ranges that become "live" at
> some point inside the BB.
>
> > >
> > > stmt-level tracking of ranges are sometimes important.  This is
> > > something the machinery cannot provide - correct?  At least not
> > > optimistically enough with ranges derived about uses.
> >
> > Maybe I'm the one missing something, but in the absence of statement
> > level exception throwing via 'can_throw_non_call_exceptions' being true,
> > any assertion made anywhere in the block to an ssa_name applies to the
> > entire block does it not?  ie it doesn't matter if the deference
> > happens first thing in the block or last thing, its not going to change
> > its value within the block.. its going to be non-null throughout the
> > entire block.
>
> Well, one complication I had with the division by zero patch is that
> for _2 = _1 / _3 you can derive that _3 is not zero _after_ the stmt
> but you may of course not assume it is not zero during evaluation
> of the stmt because that may lead you to assume it may not trap
> and thus move it to a place it was not executed previously.
>
> > so if one statement in the block asserts that references to _2 are
> > non-null, we can assert that all references to _2 in the block are
> > non-null. Meaning we get all these cases by knowing that the specified
> > name is non-zero through-out the block.  This also means we could know
> > things earlier in the block than a forward walk would provide.
> >
> > So with the 2 examples:
> >
> >    _1 = *_2; // after this _2 is non-NULL
> >    _3 = _1 + 1;
> >    _4 = *_3;
> >
> >
> >
> > both _2 and _3 are flagged as non-null in the block due to the
> > de-references.  Im not sure what we know about _1 from above, but we do
> > know
> >    ~[0,0] = _1 + 1  .  so whatever that tells you , if anything, about
> > _1 we know.  it seems to me _1 is ~[-1,-1] based on that...
> >
> > Regardless, I think we know all the same things EVRP does.
> >
> > likewise
> >
> >   _4 = (uintptr_t) _2;
> >   _5 = _6 / _4;
> >   _1 = *_2;
> >
> > _2 will be non-null in the entire block,  so _4 must also be non-null
> > and we can conclude that the divide does not trap.
>
> And move the divide up like so?
>
>  _4 = (uintptr_t) _2;
>  _5 = _6 / _4;
>   if (flag)
>    {
>     _1 = *_2;
> ...
>    }
>
> most definitely not.
>
> >
> >
> > now, when we set the ' can_throw_non_call_exceptions' flag, then we'd
> > have to resort to statement walks,  and we cannot determine that _5 does
> > not trap anyway.   EVRP is in the same boat.. It doesn't know its not
> > going to trap either because we may never get to the *_2..
>
> The point is not that we know *_2 does not trap - we don't!  We just know
> that if the program reaches the next stmt it did not and thus _2 was not NULL
> _for the stmts following the dereference_.
>
> Note I'm not sure it will make a difference in practice since any transform
> on a stmt after *_2 relying on the above creates constraints on stmt ordering
> which we cannot represent in the IL and thus the result is likely going to be
> very fragile.
>
> As said, I was just wondering if the non-NULL machinery you have
> integrates with the range machinery or is really independent
> (thus the pointer derived from the non-NULL one due to the dereference
> case above).
>
> > >>>> yes, compile-time complexity is from empirical speed timings and
> > >>>> theory-crafting from the algorithms,  and that the on-entry cache
> > >>>> prevents multiple passes over things.
> > >>>>
> > >>>> we have not done a memory analysis yet, not done anything to see if we
> > >>>> can improve it.
> > >>>> It makes very heavy use of bitmaps, which are typically quite sparse.
> > >>>> The on-entry cache is a vector of pointers to a range, initially 0, and
> > >>>> we allocate ranges as needed.   There will be an on-entry range entry
> > >>>> for any ssa-name which has been queried between the query point and the
> > >>>> definition.
> > >>> So that's similar to LIVE-on-entry thus a SSA name with a range
> > >>> will have an on-entry range entry on each BB it dominates?
> > >>> That makes storage requirement quadratic in the worst case.
> > >> Yes, assuming a request has been made for all ssa-names everywhere they
> > >> are live.
> > > You did propose to replace [E]VRP with a walk over the whole function
> > > querying ranges and simplifying stmts, did you?
> > yes, for the case of EVRP.  But not all use cases query every range
> > everywhere.
> >
> > its also a bit lower than that since we cache certain types of ranges.
> > we cache a range for varying (or range for type if you prefer)  of
> > whatever the ssa-name type is.  so if the range is varying everywhere,
> > we actually only have once instance of the range rather than N of them.
> > So any name that doesn't have a range reduction anywhere will only
> > create a single range instance for the entire CFG.
>
> But you still have a reference to the range in evry BB dominated by the
> definition?

Btw, I was thinking of

unsigned char foo(long);
void baz(unsigned char c)
{
  try{
      long i = c;
      i += foo(i);
      i += foo(i);
      i += foo(i);
      i += foo(i);
... repeat N times ...
     alloca(i); // query range of i
  }
  catch (...)
    {
    }
}


where the single(!) query of the range of i on the alloca call
will populate N BBs caches with the Nth having ranges for
all SSA defs of i running into quadraticness in N.

If you actually try you probably will have to find a persisting
stmt that some user of on-demand query looks for.
You converted some of the warning passes so choose
a stmt that triggers it from there.

Note we've run into "interesting" testcases mostly from
machine generated code which we still want to be able
to optimize at -O1 at least with reasonable use of
time and memory (and quadraticnesses are to be avoided
like a plague - not that we don't have some).

> >  I think thats the
> > most common value, so that should reduce a large number of them.   I've
> > also considering caching ranges like we cache tree constants... but I
> > haven't delved into that.  I figured if memory turns out to be a
> > problem, then we'll look at it then.
> >
> > Andrew
> >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: On-Demand range technology [2/5] - Major Components : How it works
  2019-06-04 15:27             ` Richard Biener
  2019-06-04 15:38               ` Richard Biener
@ 2019-06-04 16:49               ` Andrew MacLeod
  1 sibling, 0 replies; 21+ messages in thread
From: Andrew MacLeod @ 2019-06-04 16:49 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC, Jeff Law, Aldy Hernandez

On 6/4/19 11:26 AM, Richard Biener wrote:
> On Fri, May 31, 2019 at 5:40 PM Andrew MacLeod <amacleod@redhat.com> wrote:
>> On 5/29/19 7:15 AM, Richard Biener wrote:
>>> On Tue, May 28, 2019 at 4:17 PM Andrew MacLeod <amacleod@redhat.com> wrote:
>>>> On 5/27/19 9:02 AM, Richard Biener wrote:
>>>>> On Fri, May 24, 2019 at 5:50 PM Andrew MacLeod <amacleod@redhat.com> wrote:
>>>>>>> The above suggests that iff this is done at all it is not in GORI because
>>>>>>> those are not conditional stmts or ranges from feeding those.  The
>>>>>>> machinery doing the use-def walking from stmt context also cannot
>>>>>>> come along these so I have the suspicion that Ranger cannot handle
>>>>>>> telling us that for the stmt following above, for example
>>>>>>>
>>>>>>>      if (_5 != 0)
>>>>>>>
>>>>>>> that _5 is not zero?
>>>>>>>
>>>>>>> Can you clarify?
>>>>>> So there are 2 aspects to this.    the range-ops code for DIV_EXPR, if
>>>>>> asked for the range of op2 () would return ~[0,0] for _5.
>>>>>> But you are also correct in that the walk backwards would not find this.
>>>>>>
>>>>>> This is similar functionality to how null_derefs are currently handled,
>>>>>> and in fact could probably be done simultaneously using the same code
>>>>>> base.   I didn't bring null derefs up, but this is a good time :-)
>>>>>>
>>>>>> There is a separate class used by the gori-cache which tracks the
>>>>>> non-nullness property at the block level.    It has a single API:
>>>>>> non_null_deref_p (name, bb)    which determines whether the is a
>>>>>> dereference in any BB for NAME, which indicates whether the range has an
>>>>>> implicit ~[0,0] range in that basic block or not.
>>>>> So when we then have
>>>>>
>>>>>     _1 = *_2; // after this _2 is non-NULL
>>>>>     _3 = _1 + 1; // _3 is non-NULL
>>>>>     _4 = *_3;
>>>>> ...
>>>>>
>>>>> when a on-demand user asks whether _3 is non-NULL at the
>>>>> point of _4 = *_3 we don't have this information?  Since the
>>>>> per-BB caching will only say _1 is non-NULL after the BB.
>>>>> I'm also not sure whether _3 ever gets non-NULL during
>>>>> non-NULL processing of the block since walking immediate uses
>>>>> doesn't really help here?
>>>> presumably _3 is globally non-null due to the definition being (pointer
>>>> + x)  ... ie, _3 has a global range o f ~[0,0] ?
>>> No, _3 is ~[0, 0] because it is derived from _1 which is ~[0, 0] and
>>> you cannot arrive at NULL by pointer arithmetic from a non-NULL pointer.
>> I'm confused.
>>
>> _1 was loaded from _2 (thus asserting _2 is non-NULL).  but we have no
>> idea what the range of _1 is, so  how do you assert _1 is [~0,0] ?
> Bug on my part - it should offset _2:
>
>   _1 = *_2; // after this _2 is non-NULL
>   _3 = _2 + 1; // _3 is non-NULL
>   _4 = *_3;
>
>> The only way I see to determine _3 is non-NULL  is through the _4 = *_3
>> statement.
> So how do you derive a ~[0,0] _2 and thus know that *_3 does not trap?

well, we know *_2 is non-null on exit to the block. we wind backwards, 
so we know its non-null when we get to  _3 = _2 + 1
But as I will say later in this note and in the earlier discussion with 
jeff in this thread, the non-null property needs some work as I can 
rewrite this to make it wrong. :-P

There is an interesting difference between what we know about things in 
a  forward walk, what we know in a backward walk, and what we know in a 
def-chain walk  :-)


>
>>>>> So this seems to be a fundamental limitation [to the caching scheme],
>>>>> not sure if it is bad in practice.
>>>>>
>>>>> Or am I missing something?
>>>>>
>>>> Not missing anything  The non-nullness property is maintains globally at
>>>> the basic block level.  both _1 and _3 are flagged as being non-null in
>>>> the block.  Upon exit, its a bit check.   If the global information does
>>>> not have the non-nullness property, then when a request is made for
>>>> non-nullness  and the def and the use are both within the same block,
>>>> and its flagged as being non-null in that block, then the request is
>>>> forced back to a quick walk between the def and the use to see if there
>>>> is any non-nulless introduced in between.   Yes, that makes it a linear
>>>> walk, but its infrequent, and often short.  to the best of our knowledge
>>>> at this point anyway :-)
>>> So with the clarification above do we ever see that _3 is non-NULL?
>>> I suppose the worker processing _3 = _1 + 1 would ask for
>>> _1 non-nullness but we do not record any non-NULL-ness of _1 in
>>> this basic-block (but only at its end).  Consider stmts
>>>
>>>    _4 = (uintptr_t) _2;
>>>    _5 = _6 / _4;
>>>    _1 = *_2;
>>> ...
>>>
>>> here at _1 we know _2 is not NULL.  But if we ask for non-NULLness
>>> of _2 at the definition of _4 we may not compute ~[0, 0] and thus
>>> conclude that _6 / _4 does not trap.
>> EVRP must look backwards to figure this out since the forward walk will
>> process _5 = _6 / _4 before it sees the dereference to _2... so how does
>> it know that _4 is non-zero without looking backwards at things after it
>> sees the dereference??  Does it actually do this?
> It actually doesn't do this for divisions (but I have a patch lying somewhere).
> During the forward walk when coming along _5 = _6 / _4; it would
> in evrp_range_analyzer::record_ranges_from_stmt via infer_value_range
> push a new value-range for _4.  To infer the same for _2 it would need to
> look backward which I think it doesn't do right now but it easily could
> (I'm just trying to make up examples to understand ranger better).
> When visiting _1 = *_2 we then know _2 is not NULL.
>
> There may be more "useful" cases, for example we could derive
> ranges for shift operands based on undefinedness and their range
> may be useful for simplifying adjacent stmts.
>
> I'm trying to discover how ranger appearantly recording ranges only
> at BB boundaries interacts with ranges that become "live" at
> some point inside the BB.

The ranger records that a name becomes non-zero somewhere within a 
block.  this is then propagated to other following blocks.

The case where there is work to do is when a name is tagged as having a 
non-zero reference in the block AND
the use being queried is not the non-zero reference, AND
   a) the range did contain zero before the block OR
   b) the definition is within the block

Then the only way it can determine if any given use is non-zero is to 
walk from the use back towards the def/start of the block looking to see 
if we encounter the non-zero use along the way.

This thread contains a longer discussion with jeff about the block level 
info  in the presence of threading as well where we concluded it needs 
some extra thought.

At the moment, there is not a perfect solution for this. the inferred 
values by unrelated statements is clearly not the strongest aspect of 
the backwards walking ranger which is why it needs something like the 
non-null infrastructure to get these cases..

  Its clearly a far more frequent thing for pointers than for other 
integrals, and maybe there is something better than can be done for them.



>>> stmt-level tracking of ranges are sometimes important.  This is
>>> something the machinery cannot provide - correct?  At least not
>>> optimistically enough with ranges derived about uses.
>> Maybe I'm the one missing something, but in the absence of statement
>> level exception throwing via 'can_throw_non_call_exceptions' being true,
>> any assertion made anywhere in the block to an ssa_name applies to the
>> entire block does it not?  ie it doesn't matter if the deference
>> happens first thing in the block or last thing, its not going to change
>> its value within the block.. its going to be non-null throughout the
>> entire block.
> Well, one complication I had with the division by zero patch is that
> for _2 = _1 / _3 you can derive that _3 is not zero _after_ the stmt
> but you may of course not assume it is not zero during evaluation
> of the stmt because that may lead you to assume it may not trap
> and thus move it to a place it was not executed previously.
yeah, during the reverse walk we can assume that when we reach this 
point its non zero, but the inferred values like this and *p can not be 
assumed ON the statement.
I guess thats would have to be a property of inferred values.


>
>> so if one statement in the block asserts that references to _2 are
>> non-null, we can assert that all references to _2 in the block are
>> non-null. Meaning we get all these cases by knowing that the specified
>> name is non-zero through-out the block.  This also means we could know
>> things earlier in the block than a forward walk would provide.
>>
>> So with the 2 examples:
>>
>>     _1 = *_2; // after this _2 is non-NULL
>>     _3 = _1 + 1;
>>     _4 = *_3;
>>
>>
>>
>> both _2 and _3 are flagged as non-null in the block due to the
>> de-references.  Im not sure what we know about _1 from above, but we do
>> know
>>     ~[0,0] = _1 + 1  .  so whatever that tells you , if anything, about
>> _1 we know.  it seems to me _1 is ~[-1,-1] based on that...
>>
>> Regardless, I think we know all the same things EVRP does.
>>
>> likewise
>>
>>    _4 = (uintptr_t) _2;
>>    _5 = _6 / _4;
>>    _1 = *_2;
>>
>> _2 will be non-null in the entire block,  so _4 must also be non-null
>> and we can conclude that the divide does not trap.
> And move the divide up like so?
>
>   _4 = (uintptr_t) _2;
>   _5 = _6 / _4;
>    if (flag)
>     {
>      _1 = *_2;
> ...
>     }
>
> most definitely not.
well you are in a different block  and there is nothing in the first 
block that says its non-null now, just the TRUE side of the if block...

but I get the point I think you are trying to make, and back to the 
discussion with jeff I think we've concluded that the non-null-ref in 
block needs some work to be fully viable :-)

>>
>> now, when we set the ' can_throw_non_call_exceptions' flag, then we'd
>> have to resort to statement walks,  and we cannot determine that _5 does
>> not trap anyway.   EVRP is in the same boat.. It doesn't know its not
>> going to trap either because we may never get to the *_2..
> The point is not that we know *_2 does not trap - we don't!  We just know
> that if the program reaches the next stmt it did not and thus _2 was not NULL
> _for the stmts following the dereference_.
>
> Note I'm not sure it will make a difference in practice since any transform
> on a stmt after *_2 relying on the above creates constraints on stmt ordering
> which we cannot represent in the IL and thus the result is likely going to be
> very fragile.
>
> As said, I was just wondering if the non-NULL machinery you have
> integrates with the range machinery or is really independent
> (thus the pointer derived from the non-NULL one due to the dereference
> case above).

It is currently an independent  class that is invoked by the ranger when 
evaluating the range on an outgoing edge to determine if non-null should 
be applied to the range

It looks like I disabled any checks within the block, so at the moment 
it wont apply any non-nullness within the block unless it was live on 
entry.  I see some potential issues even with the live on exit value 
being propagated back up into the block if the value wasnt also non-zero 
on entry.    I think my current implementation of non-nullness is only 
safe to use  when the value is non-zero throughout the block.



>>>>>> yes, compile-time complexity is from empirical speed timings and
>>>>>> theory-crafting from the algorithms,  and that the on-entry cache
>>>>>> prevents multiple passes over things.
>>>>>>
>>>>>> we have not done a memory analysis yet, not done anything to see if we
>>>>>> can improve it.
>>>>>> It makes very heavy use of bitmaps, which are typically quite sparse.
>>>>>> The on-entry cache is a vector of pointers to a range, initially 0, and
>>>>>> we allocate ranges as needed.   There will be an on-entry range entry
>>>>>> for any ssa-name which has been queried between the query point and the
>>>>>> definition.
>>>>> So that's similar to LIVE-on-entry thus a SSA name with a range
>>>>> will have an on-entry range entry on each BB it dominates?
>>>>> That makes storage requirement quadratic in the worst case.
>>>> Yes, assuming a request has been made for all ssa-names everywhere they
>>>> are live.
>>> You did propose to replace [E]VRP with a walk over the whole function
>>> querying ranges and simplifying stmts, did you?
>> yes, for the case of EVRP.  But not all use cases query every range
>> everywhere.
>>
>> its also a bit lower than that since we cache certain types of ranges.
>> we cache a range for varying (or range for type if you prefer)  of
>> whatever the ssa-name type is.  so if the range is varying everywhere,
>> we actually only have once instance of the range rather than N of them.
>> So any name that doesn't have a range reduction anywhere will only
>> create a single range instance for the entire CFG.
> But you still have a reference to the range in evry BB dominated by the
> definition?

every range dominated by the definition that leads to a use, yes. Since 
we work bottom up, the live-on-exit property is implicit so once it goes 
out of liveness there are no more references.  so anything that is 
defined, used in the block and never used again will never be in the cache.

>>   I think thats the
>> most common value, so that should reduce a large number of them.   I've
>> also considering caching ranges like we cache tree constants... but I
>> haven't delved into that.  I figured if memory turns out to be a
>> problem, then we'll look at it then.
>>
>> Andrew
>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: On-Demand range technology [2/5] - Major Components : How it works
  2019-06-04 15:38               ` Richard Biener
@ 2019-06-04 16:50                 ` Andrew MacLeod
  2019-06-04 17:07                   ` Richard Biener
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew MacLeod @ 2019-06-04 16:50 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC, Jeff Law, Aldy Hernandez

On 6/4/19 11:37 AM, Richard Biener wrote:
> On Tue, Jun 4, 2019 at 5:26 PM Richard Biener
> <richard.guenther@gmail.com> wrote:
>>
>> But you still have a reference to the range in evry BB dominated by the
>> definition?
> Btw, I was thinking of
>
> unsigned char foo(long);
> void baz(unsigned char c)
> {
>    try{
>        long i = c;
>        i += foo(i);
>        i += foo(i);
>        i += foo(i);
>        i += foo(i);
> ... repeat N times ...
>       alloca(i); // query range of i
>    }
>    catch (...)
>      {
>      }
> }
>
>
> where the single(!) query of the range of i on the alloca call
> will populate N BBs caches with the Nth having ranges for
> all SSA defs of i running into quadraticness in N.
well, not really.   its still linear since the 'i' used in each foo() 
will be dead, so the next block will contain no range for it


   try{
       long i_2 = _1;
       i_3 += foo(i_2);
       i_4 += foo(i_3);
       i_5 += foo(i_4);
       i_6 += foo(i_5);
... repeat N times ...
      alloca(i_6); // query range of i

once i_2 is 'dead' after the first call to foo(), it will no longer be 
in the on-entry cache to any other block.
The walk back never see's i_2 until the first call to foo(), so none of 
blocks after that have a need for its range, so it will not in their caches.

The on-entry cache will only get a single range entry for each of those 
blocks.. the incoming value of i_x  and thats it. The implicitly known 
live-on-exit attribute is handy here.

Now, that's not to say we cant construct cases where there is a lot of 
ranges being populated.. If we find the cache is really a problem, we 
can start caching ranges.. so that if all these i_x's were live somehow, 
there would only be one range for each in this case, and the cache's 
would simply all point to the same range.

so far we havent really run across anything that appears to trigger 
significant concerns.



>
> If you actually try you probably will have to find a persisting
> stmt that some user of on-demand query looks for.
> You converted some of the warning passes so choose
> a stmt that triggers it from there.
>
> Note we've run into "interesting" testcases mostly from
> machine generated code which we still want to be able
> to optimize at -O1 at least with reasonable use of
> time and memory (and quadraticnesses are to be avoided
> like a plague - not that we don't have some).
>
>>>   I think thats the
>>> most common value, so that should reduce a large number of them.   I've
>>> also considering caching ranges like we cache tree constants... but I
>>> haven't delved into that.  I figured if memory turns out to be a
>>> problem, then we'll look at it then.
>>>
>>> Andrew
>>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: On-Demand range technology [2/5] - Major Components : How it works
  2019-06-04 16:50                 ` Andrew MacLeod
@ 2019-06-04 17:07                   ` Richard Biener
  2019-06-04 19:55                     ` Andrew MacLeod
  0 siblings, 1 reply; 21+ messages in thread
From: Richard Biener @ 2019-06-04 17:07 UTC (permalink / raw)
  To: Andrew MacLeod; +Cc: GCC, Jeff Law, Aldy Hernandez

On June 4, 2019 6:50:07 PM GMT+02:00, Andrew MacLeod <amacleod@redhat.com> wrote:
>On 6/4/19 11:37 AM, Richard Biener wrote:
>> On Tue, Jun 4, 2019 at 5:26 PM Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>>
>>> But you still have a reference to the range in evry BB dominated by
>the
>>> definition?
>> Btw, I was thinking of
>>
>> unsigned char foo(long);
>> void baz(unsigned char c)
>> {
>>    try{
>>        long i = c;
>>        i += foo(i);
>>        i += foo(i);
>>        i += foo(i);
>>        i += foo(i);
>> ... repeat N times ...
>>       alloca(i); // query range of i
>>    }
>>    catch (...)
>>      {
>>      }
>> }
>>
>>
>> where the single(!) query of the range of i on the alloca call
>> will populate N BBs caches with the Nth having ranges for
>> all SSA defs of i running into quadraticness in N.
>well, not really.   its still linear since the 'i' used in each foo() 
>will be dead, so the next block will contain no range for it
>
>
>   try{
>       long i_2 = _1;
>       i_3 += foo(i_2);
>       i_4 += foo(i_3);
>       i_5 += foo(i_4);
>       i_6 += foo(i_5);
>... repeat N times ...
>      alloca(i_6); // query range of i
>
>once i_2 is 'dead' after the first call to foo(), it will no longer be 
>in the on-entry cache to any other block.
>The walk back never see's i_2 until the first call to foo(), so none of

It's actually

   Utemp_2 = foo(i_1);

Bb:
   i_3 = i_1 + (long) utemp_2;
   Utemp_4 = foo(i_3);

Eliding the separate stmt for the conversion. From you description of the cache it sounded like getting incoming values is a take-all operation. Otherwise you'd need negative entries as well (I thought the cache might not contain varying ranges for example). 

>blocks after that have a need for its range, so it will not in their
>caches.
>
>The on-entry cache will only get a single range entry for each of those
>
>blocks.. the incoming value of i_x  and thats it. The implicitly known 
>live-on-exit attribute is handy here.
>
>Now, that's not to say we cant construct cases where there is a lot of 
>ranges being populated.. If we find the cache is really a problem, we 
>can start caching ranges.. so that if all these i_x's were live
>somehow, 
>there would only be one range for each in this case, and the cache's 
>would simply all point to the same range.

I suppose using i in the catch block would create sth like this, a phi with a large in degree and thus the switch case you already know about? 

>so far we havent really run across anything that appears to trigger 
>significant concerns.

Sure. It's still worth thinking about worst case behavior and how you'd deal with that given once ranger is actually used we will eventually run into these cases. When
A forward evrp walk is then faster than a single on demand query that would be concerning. 

Richard. 

>
>
>>
>> If you actually try you probably will have to find a persisting
>> stmt that some user of on-demand query looks for.
>> You converted some of the warning passes so choose
>> a stmt that triggers it from there.
>>
>> Note we've run into "interesting" testcases mostly from
>> machine generated code which we still want to be able
>> to optimize at -O1 at least with reasonable use of
>> time and memory (and quadraticnesses are to be avoided
>> like a plague - not that we don't have some).
>>
>>>>   I think thats the
>>>> most common value, so that should reduce a large number of them.  
>I've
>>>> also considering caching ranges like we cache tree constants... but
>I
>>>> haven't delved into that.  I figured if memory turns out to be a
>>>> problem, then we'll look at it then.
>>>>
>>>> Andrew
>>>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: On-Demand range technology [2/5] - Major Components : How it works
  2019-06-04 17:07                   ` Richard Biener
@ 2019-06-04 19:55                     ` Andrew MacLeod
  2019-06-05  9:59                       ` Richard Biener
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew MacLeod @ 2019-06-04 19:55 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC, Jeff Law, Aldy Hernandez

On 6/4/19 1:07 PM, Richard Biener wrote:
> On June 4, 2019 6:50:07 PM GMT+02:00, Andrew MacLeod <amacleod@redhat.com> wrote:
>> On 6/4/19 11:37 AM, Richard Biener wrote:
>>
>>> where the single(!) query of the range of i on the alloca call
>>> will populate N BBs caches with the Nth having ranges for
>>> all SSA defs of i running into quadraticness in N.
>> well, not really.   its still linear since the 'i' used in each foo()
>> will be dead, so the next block will contain no range for it
>>
>>
>>    try{
>>        long i_2 = _1;
>>        i_3 += foo(i_2);
>>        i_4 += foo(i_3);
>>        i_5 += foo(i_4);
>>        i_6 += foo(i_5);
>> ... repeat N times ...
>>       alloca(i_6); // query range of i
>>
>> once i_2 is 'dead' after the first call to foo(), it will no longer be
>> in the on-entry cache to any other block.
>> The walk back never see's i_2 until the first call to foo(), so none of
> It's actually
>
>     Utemp_2 = foo(i_1);
>
> Bb:
>     i_3 = i_1 + (long) utemp_2;
>     Utemp_4 = foo(i_3);
>
> Eliding the separate stmt for the conversion. From you description of the cache it sounded like getting incoming values is a take-all operation. Otherwise you'd need negative entries as well (I thought the cache might not contain varying ranges for example).

ok,  Utemp2 and i_1 will be live, so there are 2 ranges on entry . the 
next block will have only utemp_4 and i_3 live on entry.  So it'll be 2 
ranges per block.


It currently does have varying in the table, but the varying value is 
cached for the entire ssa_name in the cache. so 500 BB's with varying 
range will only consume a single range.

By negative entries I assume you mean something like the bottom of a 
diamond?

if (x_3 > 20)
   blah1 (x_3)
else
   blah 2(x)_3;
blah3 (x_3);

It does nothing special to come up with the range of x_3 at blah3 ().  
It simply unions all the incoming ranges which is [21,MAX] union 
[MIN,20]  to produce the value [MIN, MAX] which is then put into the 
cache.  We don't use anything special to represent varying.   
varying_p()  merely checks if the range is [MIN, MAX]. And when putting 
a value into the cache, if it is varying_p() the cache simply uses it's 
unique copy of [MIN, MAX] rather than create a new one every time it is 
needed.


I had originally considered not caching ranges spanning blocks in which 
the range never changes...  the trade off is you have an increase in 
calculation time as you walk the CFG to find the last block in which it 
did change.  First I figured we'd first see if the less error prone full 
cache causes any real problems or not.   Its always an option to 
implement later since the cache is self contained and can resolve its 
queries internally however it wants.

>
>> blocks after that have a need for its range, so it will not in their
>> caches.
>>
>> The on-entry cache will only get a single range entry for each of those
>>
>> blocks.. the incoming value of i_x  and thats it. The implicitly known
>> live-on-exit attribute is handy here.
>>
>> Now, that's not to say we cant construct cases where there is a lot of
>> ranges being populated.. If we find the cache is really a problem, we
>> can start caching ranges.. so that if all these i_x's were live
>> somehow,
>> there would only be one range for each in this case, and the cache's
>> would simply all point to the same range.
> I suppose using i in the catch block would create sth like this, a phi with a large in degree and thus the switch case you already know about?

It would be a PHI. None of the ranges coming into the block are 
considered live on entry, so they dont get into the cache.  The PHI 
calculation asks for the incoming range on the edge for each parameter, 
and we get a global range for the PHI definition calculated, and that is 
it.

However, Thats not actually the switch problem :-)  the switch problem 
isn't due to a large degree of incoming edges, but rather the difficulty 
in calculating case ranges on the outgoing edges from the switch itself.

stepping back slightly,  Branches are handled generically the same way 
any expression is, the outgoing edge you want information about provides 
a constant range for the implicit LHS of the branch.

so for an if
   c_2 = x_3 > 20
   if (c_2 != 0)

the TRUE edge  has a range of bool [1,1] , fed back as the implicit LHS 
of the branch expression
[1,1] = (c_2 != 0)   which range-ops for '!=' says c_2 must be [1,1] in 
order to satisfy the equation.
[1,1] = x_3 > 20  which range-ops for '>' says x_3 must have a range of 
[21, MAX]

if you want the range on the FALSE edge, the edge starts with the range 
[0,0] into the equation solver, and out pops [MIN, 20].

switch (i_3)

for a switch, it works the same way, except the edge has an integral 
value, not a boolean one.  This range is fed back into the implicit 
branch expression:
[edge range] = i_3    ,    so i_3  has whatever value the edge has since 
it amounts to a copy.  Calculating this edge constant value is the problem.

BB2:
switch (i_3) {
BB3:
   case 4:
   case 5:
   case 9:
   case 10:
      foo (i_3)

In order to evaluate the case values on  the edge from BB2 to BB3 as  
[4,5][9,10],  we have to loop over all  the cases that have a label at 
the beginning of the block, and union them together. Calculating the 
default value is even worse, start with varying and intersect out each 
and every case.

The gimple representation does not make evaluating this easy nor cheap 
since it merely has a list of case LOW/HIGHS with labels... so you have 
to loop thru that entire list for each edge being evaluated to see if 
the label is associated with this edge and then union together all those 
that match.   This is linear and cant be shortcut. multiple it by the 
number of edges since you probably want a range on each edge and its 
suddenly exponential.    The result is it can be *very* time consuming 
if the switch is very, very large.

And it's a constant value that never changes, so it really could be 
cheap. We've just never needed to ask this exact question before. We've 
considered multiple ways to address this,  but we wont worry about any 
of them just yet.  All in time :-)




>
>> so far we havent really run across anything that appears to trigger
>> significant concerns.
> Sure. It's still worth thinking about worst case behavior and how you'd deal with that given once ranger is actually used we will eventually run into these cases. When
> A forward evrp walk is then faster than a single on demand query that would be concerning.
>
> Richard.
Absolutely.  I've been concerned about worst case since the get go. 
Thats how its evolved to this design.  We're trying to make sure the 
worst case scenario is no worse than doing a normal walk.

We can currently run the ranger version of RVRP across blocks in 
dominator order, reverse dominator order,  linearly BB1 -> BBN, and 
finally  BBN-> BB1   There are fluctuations in timings, but nothing of 
too much significance.  They all run in similar times and pass all the 
tests.

I have thoughts on how the various pieces can fit together with EVRP,  
I'll stick that in another email.

Andrew

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: On-Demand range technology [2/5] - Major Components : How it works
  2019-05-31 15:40           ` Andrew MacLeod
  2019-05-31 18:16             ` Jeff Law
  2019-06-04 15:27             ` Richard Biener
@ 2019-06-04 20:04             ` Martin Sebor
  2019-06-04 20:22               ` Marc Glisse
  2 siblings, 1 reply; 21+ messages in thread
From: Martin Sebor @ 2019-06-04 20:04 UTC (permalink / raw)
  To: Andrew MacLeod, Richard Biener; +Cc: GCC, Jeff Law, Aldy Hernandez

On 5/31/19 9:40 AM, Andrew MacLeod wrote:
> On 5/29/19 7:15 AM, Richard Biener wrote:
>> On Tue, May 28, 2019 at 4:17 PM Andrew MacLeod <amacleod@redhat.com> 
>> wrote:
>>> On 5/27/19 9:02 AM, Richard Biener wrote:
>>>> On Fri, May 24, 2019 at 5:50 PM Andrew MacLeod <amacleod@redhat.com> 
>>>> wrote:
>>>>>> The above suggests that iff this is done at all it is not in GORI 
>>>>>> because
>>>>>> those are not conditional stmts or ranges from feeding those.  The
>>>>>> machinery doing the use-def walking from stmt context also cannot
>>>>>> come along these so I have the suspicion that Ranger cannot handle
>>>>>> telling us that for the stmt following above, for example
>>>>>>
>>>>>>     if (_5 != 0)
>>>>>>
>>>>>> that _5 is not zero?
>>>>>>
>>>>>> Can you clarify?
>>>>> So there are 2 aspects to this.    the range-ops code for DIV_EXPR, if
>>>>> asked for the range of op2 () would return ~[0,0] for _5.
>>>>> But you are also correct in that the walk backwards would not find 
>>>>> this.
>>>>>
>>>>> This is similar functionality to how null_derefs are currently 
>>>>> handled,
>>>>> and in fact could probably be done simultaneously using the same code
>>>>> base.   I didn't bring null derefs up, but this is a good time :-)
>>>>>
>>>>> There is a separate class used by the gori-cache which tracks the
>>>>> non-nullness property at the block level.    It has a single API:
>>>>> non_null_deref_p (name, bb)    which determines whether the is a
>>>>> dereference in any BB for NAME, which indicates whether the range 
>>>>> has an
>>>>> implicit ~[0,0] range in that basic block or not.
>>>> So when we then have
>>>>
>>>>    _1 = *_2; // after this _2 is non-NULL
>>>>    _3 = _1 + 1; // _3 is non-NULL
>>>>    _4 = *_3;
>>>> ...
>>>>
>>>> when a on-demand user asks whether _3 is non-NULL at the
>>>> point of _4 = *_3 we don't have this information?  Since the
>>>> per-BB caching will only say _1 is non-NULL after the BB.
>>>> I'm also not sure whether _3 ever gets non-NULL during
>>>> non-NULL processing of the block since walking immediate uses
>>>> doesn't really help here?
>>> presumably _3 is globally non-null due to the definition being (pointer
>>> + x)  ... ie, _3 has a global range o f ~[0,0] ?
>> No, _3 is ~[0, 0] because it is derived from _1 which is ~[0, 0] and
>> you cannot arrive at NULL by pointer arithmetic from a non-NULL pointer.
> 
> I'm confused.
> 
> _1 was loaded from _2 (thus asserting _2 is non-NULL).  but we have no 
> idea what the range of _1 is, so  how do you assert _1 is [~0,0] ?
> The only way I see to determine _3 is non-NULL  is through the _4 = *_3 
> statement.

In the first two statements from the above (where _1 is a pointer):

   _1 = *_2;
   _3 = _1 + 1;

_1 must be non-null because C/C++ define pointer addition only for
non-null pointers, and therefore so must _3.

Or does the middle-end allow arithmetic on null pointers?

Martin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: On-Demand range technology [2/5] - Major Components : How it works
  2019-06-04 20:04             ` Martin Sebor
@ 2019-06-04 20:22               ` Marc Glisse
  2019-06-04 21:40                 ` Andrew MacLeod
  0 siblings, 1 reply; 21+ messages in thread
From: Marc Glisse @ 2019-06-04 20:22 UTC (permalink / raw)
  To: Martin Sebor
  Cc: Andrew MacLeod, Richard Biener, GCC, Jeff Law, Aldy Hernandez

On Tue, 4 Jun 2019, Martin Sebor wrote:

> On 5/31/19 9:40 AM, Andrew MacLeod wrote:
>> On 5/29/19 7:15 AM, Richard Biener wrote:
>>> On Tue, May 28, 2019 at 4:17 PM Andrew MacLeod <amacleod@redhat.com> 
>>> wrote:
>>>> On 5/27/19 9:02 AM, Richard Biener wrote:
>>>>> On Fri, May 24, 2019 at 5:50 PM Andrew MacLeod <amacleod@redhat.com> 
>>>>> wrote:
>>>>>>> The above suggests that iff this is done at all it is not in GORI 
>>>>>>> because
>>>>>>> those are not conditional stmts or ranges from feeding those.  The
>>>>>>> machinery doing the use-def walking from stmt context also cannot
>>>>>>> come along these so I have the suspicion that Ranger cannot handle
>>>>>>> telling us that for the stmt following above, for example
>>>>>>> 
>>>>>>>     if (_5 != 0)
>>>>>>> 
>>>>>>> that _5 is not zero?
>>>>>>> 
>>>>>>> Can you clarify?
>>>>>> So there are 2 aspects to this.    the range-ops code for DIV_EXPR, if
>>>>>> asked for the range of op2 () would return ~[0,0] for _5.
>>>>>> But you are also correct in that the walk backwards would not find 
>>>>>> this.
>>>>>> 
>>>>>> This is similar functionality to how null_derefs are currently handled,
>>>>>> and in fact could probably be done simultaneously using the same code
>>>>>> base.   I didn't bring null derefs up, but this is a good time :-)
>>>>>> 
>>>>>> There is a separate class used by the gori-cache which tracks the
>>>>>> non-nullness property at the block level.    It has a single API:
>>>>>> non_null_deref_p (name, bb)    which determines whether the is a
>>>>>> dereference in any BB for NAME, which indicates whether the range has 
>>>>>> an
>>>>>> implicit ~[0,0] range in that basic block or not.
>>>>> So when we then have
>>>>> 
>>>>>    _1 = *_2; // after this _2 is non-NULL
>>>>>    _3 = _1 + 1; // _3 is non-NULL
>>>>>    _4 = *_3;
>>>>> ...
>>>>> 
>>>>> when a on-demand user asks whether _3 is non-NULL at the
>>>>> point of _4 = *_3 we don't have this information?  Since the
>>>>> per-BB caching will only say _1 is non-NULL after the BB.
>>>>> I'm also not sure whether _3 ever gets non-NULL during
>>>>> non-NULL processing of the block since walking immediate uses
>>>>> doesn't really help here?
>>>> presumably _3 is globally non-null due to the definition being (pointer
>>>> + x)  ... ie, _3 has a global range o f ~[0,0] ?
>>> No, _3 is ~[0, 0] because it is derived from _1 which is ~[0, 0] and
>>> you cannot arrive at NULL by pointer arithmetic from a non-NULL pointer.
>> 
>> I'm confused.
>> 
>> _1 was loaded from _2 (thus asserting _2 is non-NULL).  but we have no idea 
>> what the range of _1 is, so  how do you assert _1 is [~0,0] ?
>> The only way I see to determine _3 is non-NULL  is through the _4 = *_3 
>> statement.
>
> In the first two statements from the above (where _1 is a pointer):
>
>  _1 = *_2;
>  _3 = _1 + 1;
>
> _1 must be non-null because C/C++ define pointer addition only for
> non-null pointers, and therefore so must _3.

(int*)0+0 is well-defined, so this uses the fact that 1 is non-null. This 
is all well done in extract_range_from_binary_expr already, although it 
seems to miss the (dangerous) optimization NULL + unknown == NULL.

Just in case, a quote:

"When an expression J that has integral type is added to or subtracted
from an expression P of pointer type, the result has the type of P.
(4.1) — If P evaluates to a null pointer value and J evaluates to 0, the
result is a null pointer value.
(4.2) — Otherwise, if P points to element x[i] of an array object x with
n elements, 80 the expressions P + J and J + P (where J has the value j)
point to the (possibly-hypothetical) element x[i + j] if 0 ≤ i + j ≤ n
and the expression P - J points to the (possibly-hypothetical) element
x[i − j] if 0 ≤ i − j ≤ n.
(4.3) — Otherwise, the behavior is undefined"

> Or does the middle-end allow arithmetic on null pointers?

When people use -fno-delete-null-pointer-checks because their (embedded) 
platform has important stuff at address 0, they also want to be able to do 
arithmetic there.

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: On-Demand range technology [2/5] - Major Components : How it works
  2019-06-04 20:22               ` Marc Glisse
@ 2019-06-04 21:40                 ` Andrew MacLeod
  0 siblings, 0 replies; 21+ messages in thread
From: Andrew MacLeod @ 2019-06-04 21:40 UTC (permalink / raw)
  To: gcc, Martin Sebor; +Cc: Richard Biener, Jeff Law, Aldy Hernandez

On 6/4/19 4:21 PM, Marc Glisse wrote:
> On Tue, 4 Jun 2019, Martin Sebor wrote:
>
>> On 5/31/19 9:40 AM, Andrew MacLeod wrote:
>>> On 5/29/19 7:15 AM, Richard Biener wrote:
>>>> On Tue, May 28, 2019 at 4:17 PM Andrew MacLeod 
>>>> <amacleod@redhat.com> wrote:
>>>>> On 5/27/19 9:02 AM, Richard Biener wrote:
>>>>>> On Fri, May 24, 2019 at 5:50 PM Andrew MacLeod 
>>>>>> <amacleod@redhat.com> wrote:
>>>>>>>> The above suggests that iff this is done at all it is not in 
>>>>>>>> GORI because
>>>>>>>> those are not conditional stmts or ranges from feeding those.  The
>>>>>>>> machinery doing the use-def walking from stmt context also cannot
>>>>>>>> come along these so I have the suspicion that Ranger cannot handle
>>>>>>>> telling us that for the stmt following above, for example
>>>>>>>>
>>>>>>>>     if (_5 != 0)
>>>>>>>>
>>>>>>>> that _5 is not zero?
>>>>>>>>
>>>>>>>> Can you clarify?
>>>>>>> So there are 2 aspects to this.    the range-ops code for 
>>>>>>> DIV_EXPR, if
>>>>>>> asked for the range of op2 () would return ~[0,0] for _5.
>>>>>>> But you are also correct in that the walk backwards would not 
>>>>>>> find this.
>>>>>>>
>>>>>>> This is similar functionality to how null_derefs are currently 
>>>>>>> handled,
>>>>>>> and in fact could probably be done simultaneously using the same 
>>>>>>> code
>>>>>>> base.   I didn't bring null derefs up, but this is a good time :-)
>>>>>>>
>>>>>>> There is a separate class used by the gori-cache which tracks the
>>>>>>> non-nullness property at the block level.    It has a single API:
>>>>>>> non_null_deref_p (name, bb)    which determines whether the is a
>>>>>>> dereference in any BB for NAME, which indicates whether the 
>>>>>>> range has an
>>>>>>> implicit ~[0,0] range in that basic block or not.
>>>>>> So when we then have
>>>>>>
>>>>>>    _1 = *_2; // after this _2 is non-NULL
>>>>>>    _3 = _1 + 1; // _3 is non-NULL
>>>>>>    _4 = *_3;
>>>>>> ...
>>>>>>
>>>>>> when a on-demand user asks whether _3 is non-NULL at the
>>>>>> point of _4 = *_3 we don't have this information?  Since the
>>>>>> per-BB caching will only say _1 is non-NULL after the BB.
>>>>>> I'm also not sure whether _3 ever gets non-NULL during
>>>>>> non-NULL processing of the block since walking immediate uses
>>>>>> doesn't really help here?
>>>>> presumably _3 is globally non-null due to the definition being 
>>>>> (pointer
>>>>> + x)  ... ie, _3 has a global range o f ~[0,0] ?
>>>> No, _3 is ~[0, 0] because it is derived from _1 which is ~[0, 0] and
>>>> you cannot arrive at NULL by pointer arithmetic from a non-NULL 
>>>> pointer.
>>>
>>> I'm confused.
>>>
>>> _1 was loaded from _2 (thus asserting _2 is non-NULL).  but we have 
>>> no idea what the range of _1 is, so  how do you assert _1 is [~0,0] ?
>>> The only way I see to determine _3 is non-NULL  is through the _4 = 
>>> *_3 statement.
>>
>> In the first two statements from the above (where _1 is a pointer):
>>
>>  _1 = *_2;
>>  _3 = _1 + 1;
>>
>> _1 must be non-null because C/C++ define pointer addition only for
>> non-null pointers, and therefore so must _3.
>
> (int*)0+0 is well-defined, so this uses the fact that 1 is non-null. 
> This is all well done in extract_range_from_binary_expr already, 
> although it seems to miss the (dangerous) optimization NULL + unknown 
> == NULL.
>
> Just in case, a quote:
>
> "When an expression J that has integral type is added to or subtracted
> from an expression P of pointer type, the result has the type of P.
> (4.1) — If P evaluates to a null pointer value and J evaluates to 0, the
> result is a null pointer value.
> (4.2) — Otherwise, if P points to element x[i] of an array object x with
> n elements, 80 the expressions P + J and J + P (where J has the value j)
> point to the (possibly-hypothetical) element x[i + j] if 0 ≤ i + j ≤ n
> and the expression P - J points to the (possibly-hypothetical) element
> x[i − j] if 0 ≤ i − j ≤ n.
> (4.3) — Otherwise, the behavior is undefined"
>
>> Or does the middle-end allow arithmetic on null pointers?
>
> When people use -fno-delete-null-pointer-checks because their 
> (embedded) platform has important stuff at address 0, they also want 
> to be able to do arithmetic there.
>
Plus with the new range ops code, we want to be able to solve for any 
operand given ranges for the other 2, not just for the LHS given op1 and 
op2.     so exact details are good :-)

Andrew

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: On-Demand range technology [2/5] - Major Components : How it works
  2019-06-04 19:55                     ` Andrew MacLeod
@ 2019-06-05  9:59                       ` Richard Biener
  0 siblings, 0 replies; 21+ messages in thread
From: Richard Biener @ 2019-06-05  9:59 UTC (permalink / raw)
  To: Andrew MacLeod; +Cc: GCC, Jeff Law, Aldy Hernandez

On Tue, Jun 4, 2019 at 9:55 PM Andrew MacLeod <amacleod@redhat.com> wrote:>
> On 6/4/19 1:07 PM, Richard Biener wrote:
> > On June 4, 2019 6:50:07 PM GMT+02:00, Andrew MacLeod <amacleod@redhat.com> wrote:
> >> On 6/4/19 11:37 AM, Richard Biener wrote:
> >>
> >>> where the single(!) query of the range of i on the alloca call
> >>> will populate N BBs caches with the Nth having ranges for
> >>> all SSA defs of i running into quadraticness in N.
> >> well, not really.   its still linear since the 'i' used in each foo()
> >> will be dead, so the next block will contain no range for it
> >>
> >>
> >>    try{
> >>        long i_2 = _1;
> >>        i_3 += foo(i_2);
> >>        i_4 += foo(i_3);
> >>        i_5 += foo(i_4);
> >>        i_6 += foo(i_5);
> >> ... repeat N times ...
> >>       alloca(i_6); // query range of i
> >>
> >> once i_2 is 'dead' after the first call to foo(), it will no longer be
> >> in the on-entry cache to any other block.
> >> The walk back never see's i_2 until the first call to foo(), so none of
> > It's actually
> >
> >     Utemp_2 = foo(i_1);
> >
> > Bb:
> >     i_3 = i_1 + (long) utemp_2;
> >     Utemp_4 = foo(i_3);
> >
> > Eliding the separate stmt for the conversion. From you description of the cache it sounded like getting incoming values is a take-all operation. Otherwise you'd need negative entries as well (I thought the cache might not contain varying ranges for example).
>
> ok,  Utemp2 and i_1 will be live, so there are 2 ranges on entry . the
> next block will have only utemp_4 and i_3 live on entry.  So it'll be 2
> ranges per block.
>
>
> It currently does have varying in the table, but the varying value is
> cached for the entire ssa_name in the cache. so 500 BB's with varying
> range will only consume a single range.
>
> By negative entries I assume you mean something like the bottom of a
> diamond?
>
> if (x_3 > 20)
>    blah1 (x_3)
> else
>    blah 2(x)_3;
> blah3 (x_3);
>
> It does nothing special to come up with the range of x_3 at blah3 ().
> It simply unions all the incoming ranges which is [21,MAX] union
> [MIN,20]  to produce the value [MIN, MAX] which is then put into the
> cache.  We don't use anything special to represent varying.
> varying_p()  merely checks if the range is [MIN, MAX]. And when putting
> a value into the cache, if it is varying_p() the cache simply uses it's
> unique copy of [MIN, MAX] rather than create a new one every time it is
> needed.
>
>
> I had originally considered not caching ranges spanning blocks in which
> the range never changes...  the trade off is you have an increase in
> calculation time as you walk the CFG to find the last block in which it
> did change.  First I figured we'd first see if the less error prone full
> cache causes any real problems or not.   Its always an option to
> implement later since the cache is self contained and can resolve its
> queries internally however it wants.
>
> >
> >> blocks after that have a need for its range, so it will not in their
> >> caches.
> >>
> >> The on-entry cache will only get a single range entry for each of those
> >>
> >> blocks.. the incoming value of i_x  and thats it. The implicitly known
> >> live-on-exit attribute is handy here.
> >>
> >> Now, that's not to say we cant construct cases where there is a lot of
> >> ranges being populated.. If we find the cache is really a problem, we
> >> can start caching ranges.. so that if all these i_x's were live
> >> somehow,
> >> there would only be one range for each in this case, and the cache's
> >> would simply all point to the same range.
> > I suppose using i in the catch block would create sth like this, a phi with a large in degree and thus the switch case you already know about?
>
> It would be a PHI. None of the ranges coming into the block are
> considered live on entry, so they dont get into the cache.  The PHI
> calculation asks for the incoming range on the edge for each parameter,
> and we get a global range for the PHI definition calculated, and that is
> it.

Thanks for all the clarifications.

> However, Thats not actually the switch problem :-)  the switch problem
> isn't due to a large degree of incoming edges, but rather the difficulty
> in calculating case ranges on the outgoing edges from the switch itself.
>
> stepping back slightly,  Branches are handled generically the same way
> any expression is, the outgoing edge you want information about provides
> a constant range for the implicit LHS of the branch.
>
> so for an if
>    c_2 = x_3 > 20
>    if (c_2 != 0)
>
> the TRUE edge  has a range of bool [1,1] , fed back as the implicit LHS
> of the branch expression
> [1,1] = (c_2 != 0)   which range-ops for '!=' says c_2 must be [1,1] in
> order to satisfy the equation.
> [1,1] = x_3 > 20  which range-ops for '>' says x_3 must have a range of
> [21, MAX]
>
> if you want the range on the FALSE edge, the edge starts with the range
> [0,0] into the equation solver, and out pops [MIN, 20].
>
> switch (i_3)
>
> for a switch, it works the same way, except the edge has an integral
> value, not a boolean one.  This range is fed back into the implicit
> branch expression:
> [edge range] = i_3    ,    so i_3  has whatever value the edge has since
> it amounts to a copy.  Calculating this edge constant value is the problem.
>
> BB2:
> switch (i_3) {
> BB3:
>    case 4:
>    case 5:
>    case 9:
>    case 10:
>       foo (i_3)
>
> In order to evaluate the case values on  the edge from BB2 to BB3 as
> [4,5][9,10],  we have to loop over all  the cases that have a label at
> the beginning of the block, and union them together. Calculating the
> default value is even worse, start with varying and intersect out each
> and every case.
>
> The gimple representation does not make evaluating this easy nor cheap
> since it merely has a list of case LOW/HIGHS with labels... so you have
> to loop thru that entire list for each edge being evaluated to see if
> the label is associated with this edge and then union together all those
> that match.   This is linear and cant be shortcut. multiple it by the
> number of edges since you probably want a range on each edge and its
> suddenly exponential.    The result is it can be *very* time consuming
> if the switch is very, very large.
>
> And it's a constant value that never changes, so it really could be
> cheap. We've just never needed to ask this exact question before. We've
> considered multiple ways to address this,  but we wont worry about any
> of them just yet.  All in time :-)

Hmm, shouldn't it be

 range edge_ranges[number_of_succs];
 for (int i = 0; i < gimple_switch_num_labels (); ++i)
   {
      tree casel = gimple_switch_label (s, i);
      basic_block target = label_to_block (CASE_LABEL (casel));
      edge e = find_edge (gimple_bb (s), target); // OK, this needs to
be optimized, linear
      union_range (edge_ranges[e->src_index], casel-range); //
similarly, we don't have src_index
   }

where the missing detail is precomputing the find_edge + src index things
in a hash_map <dest-block-index, src-index> map.  That's possible by
a simple linear walk of switch successor edges.

So computing the edge ranges is linear.  The default case needs special-casing
and its edge range computation delayed until we know the union of all
case ranges.

But yes, I agree the GIMPLE switch representation isn't exactly helpful...
But I didn't figure out a good improvement yet.  One possibility is to
do sth similar to PHIs and split the case labels into groups associated
with edges indexed by the index of the edge in the src->succs vector
(there's "conveniently" CASE_CHAIN which I think is unused).  Doing
this requires CFG hook adjustments similar to how we handle PHIs
since we have to adjust the edge index based case group vector.

Btw, there's the function-level {start,end}_recording_cases functionality
which gives you get_cases_for_edge which is overall still quadratic
since it uses find_edge as in my pseudo-code above.  Re-implementing
it to compute all in advance would solve this.

Richard.


>
>
>
>
> >
> >> so far we havent really run across anything that appears to trigger
> >> significant concerns.
> > Sure. It's still worth thinking about worst case behavior and how you'd deal with that given once ranger is actually used we will eventually run into these cases. When
> > A forward evrp walk is then faster than a single on demand query that would be concerning.
> >
> > Richard.
> Absolutely.  I've been concerned about worst case since the get go.
> Thats how its evolved to this design.  We're trying to make sure the
> worst case scenario is no worse than doing a normal walk.
>
> We can currently run the ranger version of RVRP across blocks in
> dominator order, reverse dominator order,  linearly BB1 -> BBN, and
> finally  BBN-> BB1   There are fluctuations in timings, but nothing of
> too much significance.  They all run in similar times and pass all the
> tests.
>
> I have thoughts on how the various pieces can fit together with EVRP,
> I'll stick that in another email.
>
> Andrew

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-06-05  9:59 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-23  1:28 On-Demand range technology [2/5] - Major Components : How it works Andrew MacLeod
2019-05-23 12:55 ` Richard Biener
2019-05-24 15:50   ` Andrew MacLeod
2019-05-27 13:03     ` Richard Biener
2019-05-28 14:17       ` Andrew MacLeod
2019-05-29 11:16         ` Richard Biener
2019-05-31 15:40           ` Andrew MacLeod
2019-05-31 18:16             ` Jeff Law
2019-05-31 20:27               ` Andrew MacLeod
2019-05-31 22:00                 ` Jeff Law
2019-05-31 23:08                   ` Andrew MacLeod
2019-06-04 15:27             ` Richard Biener
2019-06-04 15:38               ` Richard Biener
2019-06-04 16:50                 ` Andrew MacLeod
2019-06-04 17:07                   ` Richard Biener
2019-06-04 19:55                     ` Andrew MacLeod
2019-06-05  9:59                       ` Richard Biener
2019-06-04 16:49               ` Andrew MacLeod
2019-06-04 20:04             ` Martin Sebor
2019-06-04 20:22               ` Marc Glisse
2019-06-04 21:40                 ` Andrew MacLeod

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).