public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* Rematerialization and Live Range Splitting on Region Frequency
@ 2015-01-25  9:56 Ajit Kumar Agarwal
  2015-01-26 19:40 ` Vladimir Makarov
  0 siblings, 1 reply; 3+ messages in thread
From: Ajit Kumar Agarwal @ 2015-01-25  9:56 UTC (permalink / raw)
  To: vmakarov, law, gcc
  Cc: Vinod Kathail, Shail Aditya Gupta, Vidhumouli Hunsigida, Nagaraju Mekala

Hello All:

Looks like Live range splitting and rematerialization are connected to each other. If  the boundary of Live range
Splitting is in the high frequency of the region then the move connected to splitted live ranges are inside the
High frequency region which is the performance bottleneck for many benchmarks.

Live range splitting based on the frequency of the region should be considered. The Live range splitting in the 
High frequency region is beneficial if the splitted live range is assigned the color(hard registers) which is better
Spilling inside the high frequency region, although there will be move instruction or shuffle code which is still 
Better.  If one of the splitted live range does not have any use points and all the partner live ranges gets the
Hard register, then the move instruction due to splitting will be costly for the high frequency region. In such 
Case the  split point should be move up at the boundary of the transition from low frequency region to high
Frequency region, and the splitted live ranges still get hard registers.  This require increment check of 
colorabity which increases the compile time but beneficial with respect to run time. The above heuristic should 
be incorporated on top of the below  Live range splitting Algorithm. Live range splitting algorithm should  consider
the blocks in the decreasing order of frequency with the first block should be taken from the high frequency
region and incrementally updates till it become colorable. Thus split points should be at the edge of the transition 
from high frequency to low frequency region or from low frequency region to high frequency region. 

The above Live range splitting should be incorporated  for all the flavors of Graph Coloring.

Regarding the rematerialization the Chaitin's Rematerialization try to recalculate the expression at all the 
Use points of the Live ranges and Simpson based approach for Rematerialized try to move the arithmetic
Instruction lower to use points or the basic blocks considering the operands of Arithmetic instructions is 
Not touched along the blocks of the Live range.

Considering all the effects of rematerialization, The remat point or the recalculation should be done at the
split points instead of Chaitin's approach of remat at every use points and the Simpson approach of operands
not being touched upon and the movement of arithmetic instruction later at the use points.

The above approaches looks feasible to implement consider the high frequency region into consideration.

Thoughts Please ?

Thanks & Regards
Ajit

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Rematerialization and Live Range Splitting on Region Frequency
  2015-01-25  9:56 Rematerialization and Live Range Splitting on Region Frequency Ajit Kumar Agarwal
@ 2015-01-26 19:40 ` Vladimir Makarov
  2015-01-29  5:06   ` Ajit Kumar Agarwal
  0 siblings, 1 reply; 3+ messages in thread
From: Vladimir Makarov @ 2015-01-26 19:40 UTC (permalink / raw)
  To: Ajit Kumar Agarwal, law, gcc
  Cc: Vinod Kathail, Shail Aditya Gupta, Vidhumouli Hunsigida, Nagaraju Mekala


On 2015-01-25 4:55 AM, Ajit Kumar Agarwal wrote:
> Hello All:
>
> Looks like Live range splitting and rematerialization are connected to each other. If  the boundary of Live range
> Splitting is in the high frequency of the region then the move connected to splitted live ranges are inside the
> High frequency region which is the performance bottleneck for many benchmarks.
>
> Live range splitting based on the frequency of the region should be considered. The Live range splitting in the
> High frequency region is beneficial if the splitted live range is assigned the color(hard registers) which is better
> Spilling inside the high frequency region, although there will be move instruction or shuffle code which is still
> Better.  If one of the splitted live range does not have any use points and all the partner live ranges gets the
> Hard register, then the move instruction due to splitting will be costly for the high frequency region. In such
> Case the  split point should be move up at the boundary of the transition from low frequency region to high
> Frequency region, and the splitted live ranges still get hard registers.  This require increment check of
> colorabity which increases the compile time but beneficial with respect to run time. The above heuristic should
> be incorporated on top of the below  Live range splitting Algorithm. Live range splitting algorithm should  consider
> the blocks in the decreasing order of frequency with the first block should be taken from the high frequency
> region and incrementally updates till it become colorable. Thus split points should be at the edge of the transition
> from high frequency to low frequency region or from low frequency region to high frequency region.
>
> The above Live range splitting should be incorporated  for all the flavors of Graph Coloring.
>
> Regarding the rematerialization the Chaitin's Rematerialization try to recalculate the expression at all the
> Use points of the Live ranges and Simpson based approach for Rematerialized try to move the arithmetic
> Instruction lower to use points or the basic blocks considering the operands of Arithmetic instructions is
> Not touched along the blocks of the Live range.
>
> Considering all the effects of rematerialization, The remat point or the recalculation should be done at the
> split points instead of Chaitin's approach of remat at every use points and the Simpson approach of operands
> not being touched upon and the movement of arithmetic instruction later at the use points.
>
> The above approaches looks feasible to implement consider the high frequency region into consideration.
>
> Thoughts Please ?
>
>
Ajit, nobody can tell you for sure what the final results of your 
proposal can be.  I usually try a lot of things and most of them are 
rejected because the results are not what I expected.

I personally implemented Simpson's register pressure decrease through 
rematerialization twice.  The first time was long ago (as I remember > 
10 years ago) and that time it worked not bad (as I remember it gave 1% 
for x86 - you can find the exact data in my GCC summit paper "Fighting 
register pressure in GCC").  It worked well because the old RA was quite 
bad (imho the problem of most research articles in compiler optimization 
field was/is in usage of some sub-par compiler where a new good 
optimization shines in environment of existing simple ones or because of 
absence of many other optimizations).

   Second time I reimplemented CFG-sensitive rematerialization (as a 
separate pass) last year and the results were worse than without it.  
Therefore I had to implement a rematerialization subpass in LRA which 
really improves the code.  That is because the current RA is pretty 
good.  Even if we have register pressure evaluation (which was absent in 
the old RA), I believe it is very inaccurate as IR uses a sophisticated 
coloring criteria which are actually based on dynamic intersected 
register classes (more accurately approximated by dynamic register class 
inclusion tree).  Also it is very hard to predict a lot of decisions in 
LRA too.

   All the above is also true for the live range splitting implemented 
as a separate pass.

   There is a good point in your email that rematerialization should 
work better when it is done not for each use but for some region of uses 
(and actually Simpon's approach implements it).  I guess if you can 
implement this idea in IRA framework and not as a separate pass, it 
might give some improvements.  The same probably would be true for 
splitting in IRA environment.  Actually IRA was designed to work for 
tree of any regions including BB.  Currently it is only loops and a lot 
of was done to minimize # of considered loops as a lot of people were 
not happy with RA speed on some tests.  The minimization is based on 
register pressure evaluation and as I wrote it is not accurate.  
Including all loops and BB (or may be SESE or other) could improve the 
code but make the compiler slower.

   On one hand, an engineering approach is to implement all this things 
as separate passes.  On the other hand, the best result you can achieve 
when you take into account more RA tasks into consideration.  It is hard 
to find a balance between the two approaches.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: Rematerialization and Live Range Splitting on Region Frequency
  2015-01-26 19:40 ` Vladimir Makarov
@ 2015-01-29  5:06   ` Ajit Kumar Agarwal
  0 siblings, 0 replies; 3+ messages in thread
From: Ajit Kumar Agarwal @ 2015-01-29  5:06 UTC (permalink / raw)
  To: Vladimir Makarov, law, gcc
  Cc: Vinod Kathail, Shail Aditya Gupta, Vidhumouli Hunsigida, Nagaraju Mekala

Thanks Vladimir for the inputs. It is quite helpful.

Thanks & Regards
Ajit

-----Original Message-----
From: Vladimir Makarov [mailto:vmakarov@redhat.com] 
Sent: Tuesday, January 27, 2015 1:10 AM
To: Ajit Kumar Agarwal; law@redhat.com; gcc@gcc.gnu.org
Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: Rematerialization and Live Range Splitting on Region Frequency


On 2015-01-25 4:55 AM, Ajit Kumar Agarwal wrote:
> Hello All:
>
> Looks like Live range splitting and rematerialization are connected to 
> each other. If  the boundary of Live range Splitting is in the high 
> frequency of the region then the move connected to splitted live ranges are inside the High frequency region which is the performance bottleneck for many benchmarks.
>
> Live range splitting based on the frequency of the region should be 
> considered. The Live range splitting in the High frequency region is 
> beneficial if the splitted live range is assigned the color(hard 
> registers) which is better Spilling inside the high frequency region, 
> although there will be move instruction or shuffle code which is still 
> Better.  If one of the splitted live range does not have any use 
> points and all the partner live ranges gets the Hard register, then 
> the move instruction due to splitting will be costly for the high 
> frequency region. In such Case the  split point should be move up at 
> the boundary of the transition from low frequency region to high 
> Frequency region, and the splitted live ranges still get hard 
> registers.  This require increment check of colorabity which increases the compile time but beneficial with respect to run time. The above heuristic should be incorporated on top of the below  Live range splitting Algorithm. Live range splitting algorithm should  consider the blocks in the decreasing order of frequency with the first block should be taken from the high frequency region and incrementally updates till it become colorable. Thus split points should be at the edge of the transition from high frequency to low frequency region or from low frequency region to high frequency region.
>
> The above Live range splitting should be incorporated  for all the flavors of Graph Coloring.
>
> Regarding the rematerialization the Chaitin's Rematerialization try to 
> recalculate the expression at all the Use points of the Live ranges 
> and Simpson based approach for Rematerialized try to move the 
> arithmetic Instruction lower to use points or the basic blocks considering the operands of Arithmetic instructions is Not touched along the blocks of the Live range.
>
> Considering all the effects of rematerialization, The remat point or 
> the recalculation should be done at the split points instead of 
> Chaitin's approach of remat at every use points and the Simpson approach of operands not being touched upon and the movement of arithmetic instruction later at the use points.
>
> The above approaches looks feasible to implement consider the high frequency region into consideration.
>
> Thoughts Please ?
>
>
Ajit, nobody can tell you for sure what the final results of your proposal can be.  I usually try a lot of things and most of them are rejected because the results are not what I expected.

I personally implemented Simpson's register pressure decrease through rematerialization twice.  The first time was long ago (as I remember >
10 years ago) and that time it worked not bad (as I remember it gave 1% for x86 - you can find the exact data in my GCC summit paper "Fighting register pressure in GCC").  It worked well because the old RA was quite bad (imho the problem of most research articles in compiler optimization field was/is in usage of some sub-par compiler where a new good optimization shines in environment of existing simple ones or because of absence of many other optimizations).

   Second time I reimplemented CFG-sensitive rematerialization (as a separate pass) last year and the results were worse than without it.  
Therefore I had to implement a rematerialization subpass in LRA which really improves the code.  That is because the current RA is pretty good.  Even if we have register pressure evaluation (which was absent in the old RA), I believe it is very inaccurate as IR uses a sophisticated coloring criteria which are actually based on dynamic intersected register classes (more accurately approximated by dynamic register class inclusion tree).  Also it is very hard to predict a lot of decisions in LRA too.

   All the above is also true for the live range splitting implemented as a separate pass.

   There is a good point in your email that rematerialization should work better when it is done not for each use but for some region of uses (and actually Simpon's approach implements it).  I guess if you can implement this idea in IRA framework and not as a separate pass, it might give some improvements.  The same probably would be true for splitting in IRA environment.  Actually IRA was designed to work for tree of any regions including BB.  Currently it is only loops and a lot of was done to minimize # of considered loops as a lot of people were not happy with RA speed on some tests.  The minimization is based on register pressure evaluation and as I wrote it is not accurate.  
Including all loops and BB (or may be SESE or other) could improve the code but make the compiler slower.

   On one hand, an engineering approach is to implement all this things as separate passes.  On the other hand, the best result you can achieve when you take into account more RA tasks into consideration.  It is hard to find a balance between the two approaches.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-01-29  5:06 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-25  9:56 Rematerialization and Live Range Splitting on Region Frequency Ajit Kumar Agarwal
2015-01-26 19:40 ` Vladimir Makarov
2015-01-29  5:06   ` Ajit Kumar Agarwal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).