public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* On-Demand range technology [1/5] - Executive Summary
@ 2019-05-23  1:28 Andrew MacLeod
  2019-06-20  3:05 ` Kugan Vivekanandarajah
  0 siblings, 1 reply; 4+ messages in thread
From: Andrew MacLeod @ 2019-05-23  1:28 UTC (permalink / raw)
  To: GCC; +Cc: Jeff Law, Aldy Hernandez

Now that stage 1 has reopened, I’d like to reopen a discussion about the 
technology and experiences we have from the Ranger project I brought up 
last year. https://gcc.gnu.org/ml/gcc/2018-05/msg00288.html .  (The 
original wiki pages are now out of date, and I will work on updating 
them soon.)

The Ranger is designed to evaluate ranges on-demand rather than through 
a top-down approach. This means you can ask for a range from anywhere, 
and it walks back thru the IL satisfying any preconditions and doing the 
required calculations.  It utilizes a cache to avoid re-doing work. If 
ranges are processed in a forward dominator order, it’s not much 
different than what we do today. Due to its nature, the order you 
process things in has minimal impact on the overall time… You can do it 
in reverse dominator order and get similar times.

It requires no outside preconditions (such as dominators) to work, and 
has a very simple API… Simply query the range of an ssa_name at any 
point in the IL and all the details are taken care of.

We have spent much of the past 6 months refining the prototype (branch 
“ssa-range”) and adjusting it to share as much code with VRP as 
possible. They are currently using a common code base for extracting 
ranges from statements, as well as simplifying statements.

The Ranger deals with just  ranges. The other aspects of  VRP are 
intended to be follow on work that integrates tightly with it, but are 
also independent and would be available for other passes to use.  These 
include:
- Equivalency tracking
- Relational processing
- Bitmask tracking

We have implemented a VRP pass that duplicates the functionality of EVRP 
(other than the bits mentioned above), as well as converted a few other 
passes to use the interface.. I do not anticipate those missing bits 
having a significant impact on the results.

The prototype branch it quite stable and can successfully build and test 
an entire Fedora distribution (9174 packages). There is an issue with 
switches I will discuss later whereby the constant range of a switch 
edge is not readily available and is exponentially expensive to 
calculate. We have a design to address that problem, and in the common 
case we are about 20% faster than EVRP is.

When utilized in passes which only require ranges for a small number of 
ssa-names we see significant improvements.  The sprintf warning pass for 
instance allows us to remove the calculations of dominators and the 
resulting forced walk order. We see a 95% speedup (yes, 1/20th of the 
overall time!).  This is primarily due to no additional overhead and 
only calculating the few things that are actually needed.  The walloca 
and wrestrict passes are a similar model, but as they have not been 
converted to use EVRP ranges yet, we don’t see similar speedups there.

That is the executive summary.  I will go into more details of each 
major thing mentioned in follow on notes so that comments and 
discussions can focus on one thing at a time.

We think this approach is very solid and has many significant benefits 
to GCC. We’d like to address any concerns you may have, and work towards 
finding a way to integrate this model with the code base during this 
stage 1.

Comments and feedback always welcome!
Thanks
Andrew

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: On-Demand range technology [1/5] - Executive Summary
  2019-05-23  1:28 On-Demand range technology [1/5] - Executive Summary Andrew MacLeod
@ 2019-06-20  3:05 ` Kugan Vivekanandarajah
  2019-06-21 13:28   ` Andrew MacLeod
  0 siblings, 1 reply; 4+ messages in thread
From: Kugan Vivekanandarajah @ 2019-06-20  3:05 UTC (permalink / raw)
  To: Andrew MacLeod; +Cc: GCC, Jeff Law, Aldy Hernandez

Hi Andrew,

Thanks for working on this.

Enable elimination of zext/sext with VRP patch had to be reverted in
(https://gcc.gnu.org/ml/gcc-patches/2014-09/msg00672.html) due to the
need for value ranges in PROMOTED_MODE precision for at least 1 test
case for alpha.

Playing with ranger suggest that it is not possible to get value
ranges in PROMOTED_MODE precision on demand. Or is there any way we
can use on-demand ranger here?

Thanks,
Kugan

On Thu, 23 May 2019 at 11:28, Andrew MacLeod <amacleod@redhat.com> wrote:
>
> Now that stage 1 has reopened, I’d like to reopen a discussion about the
> technology and experiences we have from the Ranger project I brought up
> last year. https://gcc.gnu.org/ml/gcc/2018-05/msg00288.html .  (The
> original wiki pages are now out of date, and I will work on updating
> them soon.)
>
> The Ranger is designed to evaluate ranges on-demand rather than through
> a top-down approach. This means you can ask for a range from anywhere,
> and it walks back thru the IL satisfying any preconditions and doing the
> required calculations.  It utilizes a cache to avoid re-doing work. If
> ranges are processed in a forward dominator order, it’s not much
> different than what we do today. Due to its nature, the order you
> process things in has minimal impact on the overall time… You can do it
> in reverse dominator order and get similar times.
>
> It requires no outside preconditions (such as dominators) to work, and
> has a very simple API… Simply query the range of an ssa_name at any
> point in the IL and all the details are taken care of.
>
> We have spent much of the past 6 months refining the prototype (branch
> “ssa-range”) and adjusting it to share as much code with VRP as
> possible. They are currently using a common code base for extracting
> ranges from statements, as well as simplifying statements.
>
> The Ranger deals with just  ranges. The other aspects of  VRP are
> intended to be follow on work that integrates tightly with it, but are
> also independent and would be available for other passes to use.  These
> include:
> - Equivalency tracking
> - Relational processing
> - Bitmask tracking
>
> We have implemented a VRP pass that duplicates the functionality of EVRP
> (other than the bits mentioned above), as well as converted a few other
> passes to use the interface.. I do not anticipate those missing bits
> having a significant impact on the results.
>
> The prototype branch it quite stable and can successfully build and test
> an entire Fedora distribution (9174 packages). There is an issue with
> switches I will discuss later whereby the constant range of a switch
> edge is not readily available and is exponentially expensive to
> calculate. We have a design to address that problem, and in the common
> case we are about 20% faster than EVRP is.
>
> When utilized in passes which only require ranges for a small number of
> ssa-names we see significant improvements.  The sprintf warning pass for
> instance allows us to remove the calculations of dominators and the
> resulting forced walk order. We see a 95% speedup (yes, 1/20th of the
> overall time!).  This is primarily due to no additional overhead and
> only calculating the few things that are actually needed.  The walloca
> and wrestrict passes are a similar model, but as they have not been
> converted to use EVRP ranges yet, we don’t see similar speedups there.
>
> That is the executive summary.  I will go into more details of each
> major thing mentioned in follow on notes so that comments and
> discussions can focus on one thing at a time.
>
> We think this approach is very solid and has many significant benefits
> to GCC. We’d like to address any concerns you may have, and work towards
> finding a way to integrate this model with the code base during this
> stage 1.
>
> Comments and feedback always welcome!
> Thanks
> Andrew

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: On-Demand range technology [1/5] - Executive Summary
  2019-06-20  3:05 ` Kugan Vivekanandarajah
@ 2019-06-21 13:28   ` Andrew MacLeod
  2019-06-22  1:48     ` Kugan Vivekanandarajah
  0 siblings, 1 reply; 4+ messages in thread
From: Andrew MacLeod @ 2019-06-21 13:28 UTC (permalink / raw)
  To: Kugan Vivekanandarajah; +Cc: GCC, Jeff Law, Aldy Hernandez

On 6/19/19 11:04 PM, Kugan Vivekanandarajah wrote:
> Hi Andrew,
>
> Thanks for working on this.
>
> Enable elimination of zext/sext with VRP patch had to be reverted in
> (https://gcc.gnu.org/ml/gcc-patches/2014-09/msg00672.html) due to the
> need for value ranges in PROMOTED_MODE precision for at least 1 test
> case for alpha.
>
> Playing with ranger suggest that it is not possible to get value
> ranges in PROMOTED_MODE precision on demand. Or is there any way we
> can use on-demand ranger here?
>
> Thanks,
> Kugan
>
>

I took a look at the thread, but I think I'm still missing some context.

Lets go back to the beginning.  What is an example of the case we arent 
getting that you want to get?

I'm going to guess to start :-)

short foo(unsigned char c)
  {
    c = c & (unsigned char)0x0F;
    if( c > 7 )
      return((short)(c - 5));
    else
      return(( short )c);
  }



A run of this thru the ranger shows me code that looks like (on x86 anyway):

  =========== BB 2 ============
c_4(D)  [0, 255] unsigned char
     <bb 2> :
     c_5 = c_4(D) & 15;
     _9 = c_4(D) & 8;
     if (_9 != 0)
       goto <bb 3>; [INV]
     else
       goto <bb 4>; [INV]

c_5 : [0, 15] unsigned char
_9 : [0, 0][8, 8] unsigned char
2->3  (T)*c_5 :         [0, 15] unsigned char
2->3  (T) _9 :  [8, 8] unsigned char
2->4  (F)*c_5 :         [0, 15] unsigned char
2->4  (F) _9 :  [0, 0] unsigned char

=========== BB 3 ============
c_5     [0, 15] unsigned char
     <bb 3> :
     _1 = (unsigned short) c_5;
     _2 = _1 + 65531;
     _7 = (short int) _2;
     // predicted unlikely by early return (on trees) predictor.
     goto <bb 5>; [INV]

_1 : [0, 15] unsigned short
_2 : [0, 10][65531, 65535] unsigned short
_7 : [-5, 10] short int

=========== BB 4 ============
c_5     [0, 15] unsigned char
     <bb 4> :
     _6 = (short int) c_5;
     // predicted unlikely by early return (on trees) predictor.


I think I see.  we aren't adjusting the range of c_5 going into blocks 3 
and 4.  its obvious from the original source where the code says < 7, 
but once its been "bitmasked" that info becomes obfuscated.
.
If you were to see a range in bb3 of c_5 = [8,15], and a range in bb4 of 
c_4 = [0,7],  would that solve your problem?

so in bb3 , we'd then see ranges that look like:

_1 : [8, 15] unsigned short
_2 : [3, 10] unsigned short
_7 : [3, 10] short int

and then later on we'd see there is no negative/wrap value, and you 
could could just drop the extension then?

SO.

yes. this is fixable,  but is it easy? :-)

We're in the process of replacing the range extraction code with the 
range-ops/gori-computes  components from the ranger.  This is the part 
which figures ranges out from individual statements on exit to a block..

We have implemented mostly the same functionality of VRP to start, and 
one of the things we have not done is enhanced the bitmask tracking side 
of it.
The gori-computes component which performs backwards calculations does 
virtually nothing with masks right now.

lets look at this hunk

     c_5 = c_4(D) & 15;
     _9 = c_4(D) & 8;
     if (_9 != 0)

on the true side of this branch, we start with _9 having a range of 
[0,0][8,8]

[0,0][8,8] = c_4(D) & 8

unfortunately we dont know much about c_4, its effectively VARYING at 
this point.   We do know c_5 has a range of [0,15].  if that statement were
     _9 = c_5 & 8
  this problem would be quite trivial  A quick tweak to the windback 
code for bitwise_and would generate the correct results  because we';d see
[1,1] = [0,15] & 8   and we'd come up with [8, 15]  for c_5 no problem.

but its not.

I do have some  "re-evalaution" code in the ranger which is only in the 
proof of concept stages which might solve this problem down the road.

what the re-evlaution code does is look at the any changes to ranges 
which are in the definition chain of the ssa-name you are looking at.
going back to the example again:
     c_5 = c_4(D) & 15;
     _9 = c_4(D) & 8;
     if (_9 != 0)
         _1 = (unsigned short) c_5;      <<--- here.

When we ask for the range of c_5 here, we calculate the range normally 
and come up with [0,15].
The re-evalution code then takes a look and notices that c_5 is also 
dependent on the value of c_4 , and queries if the range of c_4 changed 
at this point.
Well, we can know the value of c_4 must have the 0x8 bit set in this 
block.  This can be fed into a  "re-evaluation" of c_5, and also then 
know that c_5 must also have the 0x8 bit set.
THis would then reduce the calculated range from the original [0,15] to 
a range of [8, 15]

So the mechanisms within the new code are sort of there, but they have 
not been flushed out.  THis will require
  a)  the re-evaluation code to be completed.  Im not sure if it can be 
applied within EVRP (It seems it should be possible)  or whether it 
requires the ranger.
  b) some level of integration of bitfields with the range info so we 
can also track bits set and cleared along with the range, and integrate 
them with the calculations.

SO assuming I have been working on the correct problem here, No. we dont 
get this now.

Once we get the range-ops and gori-computes code integrated and working 
with EVRP and the bitfield stuff, we can again take a look and see where 
we stand.  I will keep it in mind as a practical example of how we might 
want to combine the information.

So revisit this in September maybe?

Or have i completely missed the mark of the problem you are trying to 
solve? :-)

Andrew

PS

Your original question is


    Playing with ranger suggest that it is not possible to get value
    ranges in PROMOTED_MODE precision on demand. Or is there any way we
    can use on-demand ranger here?


I'm not sure i follow completely why, but in theory it would be possible 
to add code to the ranger to walk a definition list calculating ranges 
in whatever precision you want

=========== BB 2 ============
c_4(D)  [0, 255] unsigned char
     <bb 2> :
     c_5 = c_4(D) & 15;
     _9 = c_4(D) & 8;
     if (_9 != 0)
       goto <bb 3>; [INV]
     else
       goto <bb 4>; [INV]


=========== BB 3 ============
c_5     [0, 15] unsigned char
     <bb 3> :
     _1 = (unsigned short) c_5;
     _2 = _1 + 65531;
     _7 = (short int) _2;

Im not sure how that would help?   the root of the problem would appear 
to be the range of c_5 is not properly calculated to be [8,15].  no 
matter what precision you use, the [0,4] + 65531 part of the range is 
still going to produce an undesireable set upper bit, and still require 
the zext/sext?


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: On-Demand range technology [1/5] - Executive Summary
  2019-06-21 13:28   ` Andrew MacLeod
@ 2019-06-22  1:48     ` Kugan Vivekanandarajah
  0 siblings, 0 replies; 4+ messages in thread
From: Kugan Vivekanandarajah @ 2019-06-22  1:48 UTC (permalink / raw)
  To: Andrew MacLeod; +Cc: GCC, Jeff Law, Aldy Hernandez

Hi Andrew,

Thanks for looking into it and my apologies for not being clear.

My proposal was to use value ranges when expanding gimple to RTL and
eliminate redundant zero/sign extensions. I.e., if we know the value
generated by some gimple operation is already in the (zero/sign)
extended from based on our VR analysis, we could skip the SUBREG and
ZERO/SIGN_EXTEND (or set SRP_SIGNED_AND_UNSIGNED and likes).

However, the problem is, RTL operations are done in PROMOTE_MODE
precision while gimple value range is in natural types. This can be a
problem when type wraps (and shows up mainly for targets where
PROMOTE_MODE is DImode like alpha).

For example, as Uros pointed out with the reverted patch, for alpha-linux we had

FAIL: libgomp.fortran/simd7.f90   -O2  execution test
FAIL: libgomp.fortran/simd7.f90   -Os  execution test

The reason being that values wrap and in VR calculation we only
records the type precision (which is what matters for gimple) but in
order to eliminate the zero/sign extension we need the full precision
in the PROMOTE_MODE.

Extract from the testcase failing:
_343 = ivtmp.179_52 + 2147483645; [0x80000004, 0x800000043]
_344 = _343 * 2; [0x8, 0x86]
_345 = (integer(kind=4)) _344; [0x8, 0x86]

With the above VR of [0x8, 0x86] (in promoted precision which is
[0x100000008, 0x100000086]), my patch was  setting
SRP_SIGNED_AND_UNSIGNED which was wrong and causing the error
(eliminating and extension which was not redundant). If we had the VR
in promoted precision, the patch would be correct and used to
eliminate redundant zero/sign extensions.

Please let me know if my explanation is not clear and I will show it
with more examples.

Thanks,
Kugan


On Fri, 21 Jun 2019 at 23:27, Andrew MacLeod <amacleod@redhat.com> wrote:
>
> On 6/19/19 11:04 PM, Kugan Vivekanandarajah wrote:
>
> Hi Andrew,
>
> Thanks for working on this.
>
> Enable elimination of zext/sext with VRP patch had to be reverted in
> (https://gcc.gnu.org/ml/gcc-patches/2014-09/msg00672.html) due to the
> need for value ranges in PROMOTED_MODE precision for at least 1 test
> case for alpha.
>
> Playing with ranger suggest that it is not possible to get value
> ranges in PROMOTED_MODE precision on demand. Or is there any way we
> can use on-demand ranger here?
>
> Thanks,
> Kugan
>
>
>
> I took a look at the thread, but I think I'm still missing some context.
>
> Lets go back to the beginning.  What is an example of the case we arent getting that you want to get?
>
> I'm going to guess to start :-)
>
> short foo(unsigned char c)
>  {
>    c = c & (unsigned char)0x0F;
>    if( c > 7 )
>      return((short)(c - 5));
>    else
>      return(( short )c);
>  }
>
>
>
> A run of this thru the ranger shows me code that looks like (on x86 anyway):
>
>  =========== BB 2 ============
> c_4(D)  [0, 255] unsigned char
>     <bb 2> :
>     c_5 = c_4(D) & 15;
>     _9 = c_4(D) & 8;
>     if (_9 != 0)
>       goto <bb 3>; [INV]
>     else
>       goto <bb 4>; [INV]
>
> c_5 : [0, 15] unsigned char
> _9 : [0, 0][8, 8] unsigned char
> 2->3  (T)*c_5 :         [0, 15] unsigned char
> 2->3  (T) _9 :  [8, 8] unsigned char
> 2->4  (F)*c_5 :         [0, 15] unsigned char
> 2->4  (F) _9 :  [0, 0] unsigned char
>
> =========== BB 3 ============
> c_5     [0, 15] unsigned char
>     <bb 3> :
>     _1 = (unsigned short) c_5;
>     _2 = _1 + 65531;
>     _7 = (short int) _2;
>     // predicted unlikely by early return (on trees) predictor.
>     goto <bb 5>; [INV]
>
> _1 : [0, 15] unsigned short
> _2 : [0, 10][65531, 65535] unsigned short
> _7 : [-5, 10] short int
>
> =========== BB 4 ============
> c_5     [0, 15] unsigned char
>     <bb 4> :
>     _6 = (short int) c_5;
>     // predicted unlikely by early return (on trees) predictor.
>
>
> I think I see.  we aren't adjusting the range of c_5 going into blocks 3 and 4.  its obvious from the original source where the code says < 7, but once its been "bitmasked" that info becomes obfuscated.
> .
> If you were to see a range in bb3 of c_5 = [8,15], and a range in bb4 of c_4 = [0,7],  would that solve your problem?
>
> so in bb3 , we'd then see ranges that look like:
>
> _1 : [8, 15] unsigned short
> _2 : [3, 10] unsigned short
> _7 : [3, 10] short int
>
> and then later on we'd see there is no negative/wrap value, and you could could just drop the extension then?
>
> SO.
>
> yes. this is fixable,  but is it easy? :-)
>
> We're in the process of replacing the range extraction code with the range-ops/gori-computes  components from the ranger.  This is the part which figures ranges out from individual statements on exit to a block..
>
> We have implemented mostly the same functionality of VRP to start, and one of the things we have not done is enhanced the bitmask tracking side of it.
> The gori-computes component which performs backwards calculations does virtually nothing with masks right now.
>
> lets look at this hunk
>
>     c_5 = c_4(D) & 15;
>     _9 = c_4(D) & 8;
>     if (_9 != 0)
>
> on the true side of this branch, we start with _9 having a range of [0,0][8,8]
>
> [0,0][8,8] = c_4(D) & 8
>
> unfortunately we dont know much about c_4, its effectively VARYING at this point.   We do know c_5 has a range of [0,15].  if that statement were
>     _9 = c_5 & 8
>  this problem would be quite trivial  A quick tweak to the windback code for bitwise_and would generate the correct results  because we';d see
> [1,1] = [0,15] & 8   and we'd come up with [8, 15]  for c_5 no problem.
>
> but its not.
>
> I do have some  "re-evalaution" code in the ranger which is only in the proof of concept stages which might solve this problem down the road.
>
> what the re-evlaution code does is look at the any changes to ranges which are in the definition chain of the ssa-name you are looking at.
> going back to the example again:
>     c_5 = c_4(D) & 15;
>     _9 = c_4(D) & 8;
>     if (_9 != 0)
>         _1 = (unsigned short) c_5;      <<--- here.
>
> When we ask for the range of c_5 here, we calculate the range normally and come up with [0,15].
> The re-evalution code then takes a look and notices that c_5 is also dependent on the value of c_4 , and queries if the range of c_4 changed at this point.
> Well, we can know the value of c_4 must have the 0x8 bit set in this block.  This can be fed into a  "re-evaluation" of c_5, and also then know that c_5 must also have the 0x8 bit set.
> THis would then reduce the calculated range from the original [0,15] to a range of [8, 15]
>
> So the mechanisms within the new code are sort of there, but they have not been flushed out.  THis will require
>  a)  the re-evaluation code to be completed.  Im not sure if it can be applied within EVRP (It seems it should be possible)  or whether it requires the ranger.
>  b) some level of integration of bitfields with the range info so we can also track bits set and cleared along with the range, and integrate them with the calculations.
>
> SO assuming I have been working on the correct problem here, No. we dont get this now.
>
> Once we get the range-ops and gori-computes code integrated and working with EVRP and the bitfield stuff, we can again take a look and see where we stand.  I will keep it in mind as a practical example of how we might want to combine the information.
>
> So revisit this in September maybe?
>
> Or have i completely missed the mark of the problem you are trying to solve? :-)
>
> Andrew
>
> PS
>
> Your original question is
>
>
> Playing with ranger suggest that it is not possible to get value
> ranges in PROMOTED_MODE precision on demand. Or is there any way we
> can use on-demand ranger here?
>
>
> I'm not sure i follow completely why, but in theory it would be possible to add code to the ranger to walk a definition list calculating ranges in whatever precision you want
>
> =========== BB 2 ============
> c_4(D)  [0, 255] unsigned char
>     <bb 2> :
>     c_5 = c_4(D) & 15;
>     _9 = c_4(D) & 8;
>     if (_9 != 0)
>       goto <bb 3>; [INV]
>     else
>       goto <bb 4>; [INV]
>
>
> =========== BB 3 ============
> c_5     [0, 15] unsigned char
>     <bb 3> :
>     _1 = (unsigned short) c_5;
>     _2 = _1 + 65531;
>     _7 = (short int) _2;
>
> Im not sure how that would help?   the root of the problem would appear to be the range of c_5 is not properly calculated to be [8,15].  no matter what precision you use, the [0,4] + 65531 part of the range is still going to produce an undesireable set upper bit, and still require the zext/sext?
>
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-06-22  1:48 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-23  1:28 On-Demand range technology [1/5] - Executive Summary Andrew MacLeod
2019-06-20  3:05 ` Kugan Vivekanandarajah
2019-06-21 13:28   ` Andrew MacLeod
2019-06-22  1:48     ` Kugan Vivekanandarajah

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).