public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* Register Allocation Graph Coloring algorithm and Others
@ 2017-12-15  3:19 Leslie Zhai
  2017-12-15  4:40 ` Vladimir Makarov
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Leslie Zhai @ 2017-12-15  3:19 UTC (permalink / raw)
  To: vmakarov, LewisR9, dag, stoklund
  Cc: GCC Development, LLVM Developers Mailing List

Hi GCC and LLVM developers,

I am learning Register Allocation algorithms and I am clear that:

* Unlimited VirtReg (pseudo) -> limited or fixed or alias[1] PhysReg (hard)

* Memory (20 - 100 cycles) is expensive than Register (1 cycle), but it 
has to spill code when PhysReg is unavailable

* Folding spill code into instructions, handling register coallescing, 
splitting live ranges, doing rematerialization, doing shrink wrapping 
are harder than RegAlloc

* LRA and IRA is default Passes in RA for GCC:

$ /opt/gcc-git/bin/gcc hello.c
DEBUG: ../../gcc/lra.c, lra_init_once, line 2441
DEBUG: ../../gcc/ira-build.c, ira_build, line 3409

* Greedy is default Pass for LLVM

But I have some questions, please give me some hint, thanks a lot!

* IRA is regional register allocator performing graph coloring on a 
top-down traversal of nested regions, is it Global? compares with Local LRA

* The papers by Briggs and Chaiten contradict[2] themselves when examine 
the text of the paper vs. the pseudocode provided?

* Why  interference graph is expensive to build[3]?

And I am practicing[4] to use HEA, developed by Dr. Rhydian Lewis, for 
LLVM firstly.


[1] https://reviews.llvm.org/D39712

[2] http://lists.llvm.org/pipermail/llvm-dev/2008-March/012940.html

[3] https://github.com/joaotavio/llvm-register-allocator

[4] https://github.com/xiangzhai/llvm/tree/avr/include/llvm/CodeGen/GCol

-- 
Regards,
Leslie Zhai - https://reviews.llvm.org/p/xiangzhai/



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Register Allocation Graph Coloring algorithm and Others
  2017-12-15  3:19 Register Allocation Graph Coloring algorithm and Others Leslie Zhai
@ 2017-12-15  4:40 ` Vladimir Makarov
  2017-12-15  4:59   ` Leslie Zhai
  2017-12-15 14:48 ` Peter Bergner
  2017-12-18 16:42 ` dag
  2 siblings, 1 reply; 15+ messages in thread
From: Vladimir Makarov @ 2017-12-15  4:40 UTC (permalink / raw)
  To: Leslie Zhai, LewisR9, dag, stoklund
  Cc: GCC Development, LLVM Developers Mailing List



On 12/14/2017 10:18 PM, Leslie Zhai wrote:
> Hi GCC and LLVM developers,
>
> I am learning Register Allocation algorithms and I am clear that:
>
> * Unlimited VirtReg (pseudo) -> limited or fixed or alias[1] PhysReg 
> (hard)
>
> * Memory (20 - 100 cycles) is expensive than Register (1 cycle), but 
> it has to spill code when PhysReg is unavailable
>
It might be much less if memory value is in L1 cache.
> * Folding spill code into instructions, handling register coallescing, 
> splitting live ranges, doing rematerialization, doing shrink wrapping 
> are harder than RegAlloc
>
RegAlloc is in a wide sense includes all this tasks and more.  For some 
architectures, other tasks like a right live range splitting might be 
even more important for generated code quality than just better graph 
coloring.
> * LRA and IRA is default Passes in RA for GCC:
>
> $ /opt/gcc-git/bin/gcc hello.c
> DEBUG: ../../gcc/lra.c, lra_init_once, line 2441
> DEBUG: ../../gcc/ira-build.c, ira_build, line 3409
>
> * Greedy is default Pass for LLVM
>
> But I have some questions, please give me some hint, thanks a lot!
>
> * IRA is regional register allocator performing graph coloring on a 
> top-down traversal of nested regions, is it Global? compares with 
> Local LRA
IRA is a global RA.  The description of its initial version can be found

https://vmakarov.fedorapeople.org/vmakarov-submission-cgo2008.pdf

LRA in some way is also global RA but it is a very simplified version of 
global RA (e.g. LRA does not use conflict graph and its coloring 
algoritm is closer to priority coloring).  LRA does a lot of other very 
complicated things besides RA, for example instruction selection which 
is quite specific to GCC machine description. Usually code selection 
task is a separate pass in other compilers. Generally speaking LRA is 
more complicated, machine dependent and more buggy than IRA.  But 
fortunately LRA is less complicated than its predecessor so called 
reload pass.

IRA and LRA names have a long history and they do not reflect correctly 
the current situation.

It would be possible to incorporate LRA tasks into IRA, but the final RA 
would be much slower, even more complicated and hard to maintain and the 
generated code would be not much better.  So to improve RA 
maintainability, RA is divided on two parts solving a bit different 
tasks.  This is a typical engineering approach.
>
> * The papers by Briggs and Chaiten contradict[2] themselves when 
> examine the text of the paper vs. the pseudocode provided?
I don't examine Preston Briggs work so thoroughly.  So I can not say 
that is true.  Even so it is natural that there are discrepancy in 
pseudocode and its description especially for such size description.

For me Preston Briggs is famous for his introduction of optimistic coloring.
>
> * Why  interference graph is expensive to build[3]?
>
That is because it might be N^2 algorithm.  There are a lot of 
publications investigating building conflict graphs and its cost in RAs.
> And I am practicing[4] to use HEA, developed by Dr. Rhydian Lewis, for 
> LLVM firstly.
>
When I just started to work on RAs very long ago I used about the same 
approach: a lot of tiny transformations directed by a cost function and 
using metaheuristics (I also used tabu search as HEA). Nothing good came 
out of this.

If you are interesting in RA algorithms and architectures, I'd recommend 
Michael Matz article

ftp://gcc.gnu.org/pub/gcc/summit/2003/Graph%20Coloring%20Register%20Allocation.pdf

as a start point.
>
> [1] https://reviews.llvm.org/D39712
>
> [2] http://lists.llvm.org/pipermail/llvm-dev/2008-March/012940.html
>
> [3] https://github.com/joaotavio/llvm-register-allocator
>
> [4] https://github.com/xiangzhai/llvm/tree/avr/include/llvm/CodeGen/GCol
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Register Allocation Graph Coloring algorithm and Others
  2017-12-15  4:40 ` Vladimir Makarov
@ 2017-12-15  4:59   ` Leslie Zhai
  0 siblings, 0 replies; 15+ messages in thread
From: Leslie Zhai @ 2017-12-15  4:59 UTC (permalink / raw)
  To: Vladimir Makarov, LewisR9, dag, stoklund
  Cc: GCC Development, LLVM Developers Mailing List

Hi Vladimir,

Thanks for your kind and very detailed response!


在 2017年12月15日 12:40, Vladimir Makarov 写道:
>
>
> On 12/14/2017 10:18 PM, Leslie Zhai wrote:
>> Hi GCC and LLVM developers,
>>
>> I am learning Register Allocation algorithms and I am clear that:
>>
>> * Unlimited VirtReg (pseudo) -> limited or fixed or alias[1] PhysReg 
>> (hard)
>>
>> * Memory (20 - 100 cycles) is expensive than Register (1 cycle), but 
>> it has to spill code when PhysReg is unavailable
>>
> It might be much less if memory value is in L1 cache.
>> * Folding spill code into instructions, handling register 
>> coallescing, splitting live ranges, doing rematerialization, doing 
>> shrink wrapping are harder than RegAlloc
>>
> RegAlloc is in a wide sense includes all this tasks and more.  For 
> some architectures, other tasks like a right live range splitting 
> might be even more important for generated code quality than just 
> better graph coloring.
>> * LRA and IRA is default Passes in RA for GCC:
>>
>> $ /opt/gcc-git/bin/gcc hello.c
>> DEBUG: ../../gcc/lra.c, lra_init_once, line 2441
>> DEBUG: ../../gcc/ira-build.c, ira_build, line 3409
>>
>> * Greedy is default Pass for LLVM
>>
>> But I have some questions, please give me some hint, thanks a lot!
>>
>> * IRA is regional register allocator performing graph coloring on a 
>> top-down traversal of nested regions, is it Global? compares with 
>> Local LRA
> IRA is a global RA.  The description of its initial version can be found
>
> https://vmakarov.fedorapeople.org/vmakarov-submission-cgo2008.pdf
I am reading this paper at present :)


>
>
> LRA in some way is also global RA but it is a very simplified version 
> of global RA (e.g. LRA does not use conflict graph and its coloring 
> algoritm is closer to priority coloring).  LRA does a lot of other 
> very complicated things besides RA, for example instruction selection 
> which is quite specific to GCC machine description. Usually code 
> selection task is a separate pass in other compilers. Generally 
> speaking LRA is more complicated, machine dependent and more buggy 
> than IRA.  But fortunately LRA is less complicated than its 
> predecessor so called reload pass.
>
> IRA and LRA names have a long history and they do not reflect 
> correctly the current situation.
>
> It would be possible to incorporate LRA tasks into IRA, but the final 
> RA would be much slower, even more complicated and hard to maintain 
> and the generated code would be not much better.  So to improve RA 
> maintainability, RA is divided on two parts solving a bit different 
> tasks.  This is a typical engineering approach.
I am debugging by printf to be familiar with LRA and IRA.


>
>>
>> * The papers by Briggs and Chaiten contradict[2] themselves when 
>> examine the text of the paper vs. the pseudocode provided?
> I don't examine Preston Briggs work so thoroughly.  So I can not say 
> that is true.  Even so it is natural that there are discrepancy in 
> pseudocode and its description especially for such size description.
>
> For me Preston Briggs is famous for his introduction of optimistic 
> coloring.
>>
>> * Why  interference graph is expensive to build[3]?
>>
> That is because it might be N^2 algorithm.  There are a lot of 
> publications investigating building conflict graphs and its cost in RAs.
>> And I am practicing[4] to use HEA, developed by Dr. Rhydian Lewis, 
>> for LLVM firstly.
>>
> When I just started to work on RAs very long ago I used about the same 
> approach: a lot of tiny transformations directed by a cost function 
> and using metaheuristics (I also used tabu search as HEA). Nothing 
> good came out of this.
Thanks for your lesson! But are there some benchmarks when you used Tabu 
search as HEA, AntCol, etc. such as 
https://pbs.twimg.com/media/DRD-kxcUMAAxZec.jpg


>
>
> If you are interesting in RA algorithms and architectures, I'd 
> recommend Michael Matz article
>
> ftp://gcc.gnu.org/pub/gcc/summit/2003/Graph%20Coloring%20Register%20Allocation.pdf 
>
>
> as a start point.
Thanks! I am reading it.


>>
>> [1] https://reviews.llvm.org/D39712
>>
>> [2] http://lists.llvm.org/pipermail/llvm-dev/2008-March/012940.html
>>
>> [3] https://github.com/joaotavio/llvm-register-allocator
>>
>> [4] https://github.com/xiangzhai/llvm/tree/avr/include/llvm/CodeGen/GCol
>>
>

-- 
Regards,
Leslie Zhai - https://reviews.llvm.org/p/xiangzhai/



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Register Allocation Graph Coloring algorithm and Others
  2017-12-15  3:19 Register Allocation Graph Coloring algorithm and Others Leslie Zhai
  2017-12-15  4:40 ` Vladimir Makarov
@ 2017-12-15 14:48 ` Peter Bergner
  2017-12-18 16:42 ` dag
  2 siblings, 0 replies; 15+ messages in thread
From: Peter Bergner @ 2017-12-15 14:48 UTC (permalink / raw)
  To: Leslie Zhai, vmakarov, LewisR9
  Cc: dag, stoklund, GCC Development, LLVM Developers Mailing List

On 12/14/17 9:18 PM, Leslie Zhai wrote:
> * The papers by Briggs and Chaiten contradict[2] themselves when examine
> the text of the paper vs. the pseudocode provided?

I've read both of these papers many times (in the past) and I don't recall
any contradictions in them.  Can you (Dave?) be more specific about what you
think are contradictions?

I do admit that pseudo code in papers can be very terse, to the point that
they don't show all the little details that are needed to actually implement
them, but they definitely shouldn't contradict their written description.
I was very grateful that Preston was more than willing to answer all my many
questions regarding his allocator and the many many details he couldn't
mention in his Ph.D. thesis, let alone a short paper.

Peter

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Register Allocation Graph Coloring algorithm and Others
  2017-12-15  3:19 Register Allocation Graph Coloring algorithm and Others Leslie Zhai
  2017-12-15  4:40 ` Vladimir Makarov
  2017-12-15 14:48 ` Peter Bergner
@ 2017-12-18 16:42 ` dag
  2017-12-18 17:52   ` Leslie Zhai
       [not found]   ` <9CBFCADA170A6854@apple.com>
  2 siblings, 2 replies; 15+ messages in thread
From: dag @ 2017-12-18 16:42 UTC (permalink / raw)
  To: Leslie Zhai, vmakarov, LewisR9, stoklund
  Cc: GCC Development, LLVM Developers Mailing List

Leslie Zhai <lesliezhai@llvm.org.cn> writes:

> * Memory (20 - 100 cycles) is expensive than Register (1 cycle), but
> it has to spill code when PhysReg is unavailable

As Vladimir said, the cache makes this kind of analysis much more
tricky.  It's not necessarily the case that memory=bad and
register=good.  Since there are tradeoffs involved, one needs to
evaluate different strategies to determine *when* memory is worse than a
register.  It may very well be the case that leaving something in memory
frees up a register for something much more important to use it.  All of
the register allocation algorithms try to determine this kind of thing
through various heuristics.  Which heuristic is most effective is highly
target-dependent.

In my experience, changing heuristics can swing performance 20% or more
in some cases.  Today's processors and memory systems are so complex
that 2nd- and even 3rd-order effects become important.

It is very, very wrong on today's machines to use # of spills as a
metric to determine the "goodness" of an allocation.  Determining *what*
to spill is much more important than the raw number of spills.  Many
times I have a seen codes generated with more spills perform much better
than the code generated with fewer spills.  Almost all of the papers
around the time of Chaiten-Briggs used # of spills as the metric.  That
may have been appropriate at that time but take those results with giant
grains of salt today.  Of course they are still very good papers for
understanding algorithms and heuristics.

The best way I know to determine what's best for your target is to run a
whole bunch of *real* codes on them, trying different allocation
algorithms and heuristics.  It is a lot of work, still worthy of a
Ph.D. even today.  Register allocation is not a "solved" problem and
probably never will be as architectures continue to change and become
ever more diverse.

> * Folding spill code into instructions, handling register coallescing,
> splitting live ranges, doing rematerialization, doing shrink wrapping
> are harder than RegAlloc

Again, as Vladimir said, they are all part of register allocation.
Sometimes they are implemented as separate passes but logically they all
contribute work to the task of assigning registers as efficiently as
possible.  And all of them use heuristics.  Choosing when and when not
to, for example, coalesce can be important.  Splitting live ranges is
logically the opposite of coalescing.  Theoretically one can "undo" a
bad coalescing decision by re-splitting the live range but it's usually
not that simple as other decisions in the interim can make that tricky
if not impossible.  It is a delicate balancing act.

> * The papers by Briggs and Chaiten contradict[2] themselves when
> examine the text of the paper vs. the pseudocode provided?

As others have said, I don't recall any inconsistencies but that doesn't
mean there aren't bugs and/or incomplete descriptions open to
interpretation.  One of my biggest issues with our computer academic
culture is that we do not value repeatability.  It is virtually
impossible to take a paper, implement the algorithm as described (or as
best as possible given ambiguity) and obtain the same results.  Heck,
half my Ph.D. dissertation was dissecting a published paper, examining
the sensitivity of the described algorithm to various parts of the
described heuristic that were ambiguous.  By interpreting the heuristic
description in different ways I observed very different results.  I read
papers for the algorithms, heuristics and ideas in them.  I pay no
attention to results because in the real world we have to implement the
ideas and test them ourselves to see if they will help us.

Peter is right to point you to Preston.  He is very accessible, friendly
and helpful.  I had the good fortune to work with him for a few years
and he taught me a lot.  He has much real-world experience on codes
actually used in production.  That experience is gold.

Good luck to you!  You're entering a world that some computing
professionals think is a waste of time because "we already know how to
do that."  Those of us in the trenches know better.  :)

                               -David

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Register Allocation Graph Coloring algorithm and Others
  2017-12-18 16:42 ` dag
@ 2017-12-18 17:52   ` Leslie Zhai
  2017-12-18 22:35     ` dag
       [not found]   ` <9CBFCADA170A6854@apple.com>
  1 sibling, 1 reply; 15+ messages in thread
From: Leslie Zhai @ 2017-12-18 17:52 UTC (permalink / raw)
  To: dag, vmakarov, LewisR9, stoklund
  Cc: GCC Development, LLVM Developers Mailing List

Hi David,

Thanks for your teaching!

I am a newbie in compiler area, I only learned Compiler Principle in 
2002 https://www.leetcode.cn/2017/12/ilove-compiler-principle.html

But I like to practice and learn :) 
https://github.com/xiangzhai/llvm/blob/avr/lib/CodeGen/RegAllocGraphColoring.cpp#L327 
because theory is not always correct, or misunderstood by people, so I 
want to compare solutionByHEA, IRA, Greedy, PBQP and other algorithms.

Thanks for your lessons to correct my mistakes, such as memory=bad 
register=good, and I need to find the answer *when* memory is worse than 
a register, I am maintaining AVR target, there are 32 general registers, 
32K flash, 2K sram 
http://www.atmel.com/Images/Atmel-42735-8-bit-AVR-Microcontroller-ATmega328-328P_Datasheet.pdf 
so perhaps to MCU, memory might be expensive than register? but what 
about AMDGPU or VLIW processors? I don't have experienced on them, 
please teach me.

I am reading LLVM's code SpillXXX, LiveRangeXXX, RegisterCoalescer, etc. 
to get the whole view of CodeGen.

I am reading Dr. Rhydian Lewis's book: A Guide to Graph Colouring: 
Algorithms and Applications 
http://www.springer.com/us/book/9783319257280   and other papers, even 
if HEA is not the best solution, I still want to practice and see the 
benchmark, I am not computing professionals, I am only 34 olds, perhaps 
I have enough time to waste :)


在 2017年12月19日 00:41, dag@cray.com 写道:
> Leslie Zhai <lesliezhai@llvm.org.cn> writes:
>
>> * Memory (20 - 100 cycles) is expensive than Register (1 cycle), but
>> it has to spill code when PhysReg is unavailable
> As Vladimir said, the cache makes this kind of analysis much more
> tricky.  It's not necessarily the case that memory=bad and
> register=good.  Since there are tradeoffs involved, one needs to
> evaluate different strategies to determine *when* memory is worse than a
> register.  It may very well be the case that leaving something in memory
> frees up a register for something much more important to use it.  All of
> the register allocation algorithms try to determine this kind of thing
> through various heuristics.  Which heuristic is most effective is highly
> target-dependent.
>
> In my experience, changing heuristics can swing performance 20% or more
> in some cases.  Today's processors and memory systems are so complex
> that 2nd- and even 3rd-order effects become important.
>
> It is very, very wrong on today's machines to use # of spills as a
> metric to determine the "goodness" of an allocation.  Determining *what*
> to spill is much more important than the raw number of spills.  Many
> times I have a seen codes generated with more spills perform much better
> than the code generated with fewer spills.  Almost all of the papers
> around the time of Chaiten-Briggs used # of spills as the metric.  That
> may have been appropriate at that time but take those results with giant
> grains of salt today.  Of course they are still very good papers for
> understanding algorithms and heuristics.
>
> The best way I know to determine what's best for your target is to run a
> whole bunch of *real* codes on them, trying different allocation
> algorithms and heuristics.  It is a lot of work, still worthy of a
> Ph.D. even today.  Register allocation is not a "solved" problem and
> probably never will be as architectures continue to change and become
> ever more diverse.
>
>> * Folding spill code into instructions, handling register coallescing,
>> splitting live ranges, doing rematerialization, doing shrink wrapping
>> are harder than RegAlloc
> Again, as Vladimir said, they are all part of register allocation.
> Sometimes they are implemented as separate passes but logically they all
> contribute work to the task of assigning registers as efficiently as
> possible.  And all of them use heuristics.  Choosing when and when not
> to, for example, coalesce can be important.  Splitting live ranges is
> logically the opposite of coalescing.  Theoretically one can "undo" a
> bad coalescing decision by re-splitting the live range but it's usually
> not that simple as other decisions in the interim can make that tricky
> if not impossible.  It is a delicate balancing act.
>
>> * The papers by Briggs and Chaiten contradict[2] themselves when
>> examine the text of the paper vs. the pseudocode provided?
> As others have said, I don't recall any inconsistencies but that doesn't
> mean there aren't bugs and/or incomplete descriptions open to
> interpretation.  One of my biggest issues with our computer academic
> culture is that we do not value repeatability.  It is virtually
> impossible to take a paper, implement the algorithm as described (or as
> best as possible given ambiguity) and obtain the same results.  Heck,
> half my Ph.D. dissertation was dissecting a published paper, examining
> the sensitivity of the described algorithm to various parts of the
> described heuristic that were ambiguous.  By interpreting the heuristic
> description in different ways I observed very different results.  I read
> papers for the algorithms, heuristics and ideas in them.  I pay no
> attention to results because in the real world we have to implement the
> ideas and test them ourselves to see if they will help us.
>
> Peter is right to point you to Preston.  He is very accessible, friendly
> and helpful.  I had the good fortune to work with him for a few years
> and he taught me a lot.  He has much real-world experience on codes
> actually used in production.  That experience is gold.
>
> Good luck to you!  You're entering a world that some computing
> professionals think is a waste of time because "we already know how to
> do that."  Those of us in the trenches know better.  :)
>
>                                 -David

-- 
Regards,
Leslie Zhai - https://reviews.llvm.org/p/xiangzhai/



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Register Allocation Graph Coloring algorithm and Others
  2017-12-18 17:52   ` Leslie Zhai
@ 2017-12-18 22:35     ` dag
  0 siblings, 0 replies; 15+ messages in thread
From: dag @ 2017-12-18 22:35 UTC (permalink / raw)
  To: Leslie Zhai
  Cc: vmakarov, LewisR9, stoklund, GCC Development,
	LLVM Developers Mailing List

Leslie Zhai <lesliezhai@llvm.org.cn> writes:

> But I like to practice and learn :)
> https://github.com/xiangzhai/llvm/blob/avr/lib/CodeGen/RegAllocGraphColoring.cpp#L327because theory is not always correct, or misunderstood by people, so I
> want to compare solutionByHEA, IRA, Greedy, PBQP and other algorithms.

That is a very good way to learn.  Learn by doing and observing how
results change as parameters vary.  You will never stop learning.  :)

> Thanks for your lessons to correct my mistakes, such as memory=bad
> register=good, and I need to find the answer *when* memory is worse
> than a register, I am maintaining AVR target, there are 32 general
> registers, 32K flash, 2K sram
> http://www.atmel.com/Images/Atmel-42735-8-bit-AVR-Microcontroller-ATmega328-328P_Datasheet.pdfso perhaps to MCU, memory might be expensive than register? but what
> about AMDGPU or VLIW processors? I don't have experienced on them,
> please teach me.

I do not have much experience with those architectures either.  As I
said, the "best" algorithm for register allocation is very
target-dependent.  What works well on AVR might work very poorly on a
GPU.  The only way to know is to test, test, test.  Of course one can
make some educated guesses to narrow the amount of testing.  Many times
a "good" allocator is "good enough" on many targets.  I work for a
company that tries to squeeze every last bit of performance out of
codes.  We're a bit fanatical that way so we try lots of things.  Most
places aren't that obsessive.  :)

> I am reading LLVM's code SpillXXX, LiveRangeXXX, RegisterCoalescer,
> etc. to get the whole view of CodeGen.

Those are great places to learn about register allocation!  They can
also be complicated and a bit daunting.  The folks on the LLVM list can
help guide you but you will also do well just making observations,
stepping through with a debugger, etc.  I certainly don't claim to
understand all of the nuances in this code.  Lots of people have
contributed to it over the years.

> I am reading Dr. Rhydian Lewis's book: A Guide to Graph Colouring:
> Algorithms and Applications
> http://www.springer.com/us/book/9783319257280   and other papers, even
> if HEA is not the best solution, I still want to practice and see the
> benchmark, I am not computing professionals, I am only 34 olds,
> perhaps I have enough time to waste :)

I am not familiar with that book but lots of reading will do you well.
There's an endless supply of papers to look at.  And practice, practice,
practice.  You seem to be on the right track!

                             -David

> 在 2017年12月19日 00:41, dag@cray.com 写道:
>> Leslie Zhai <lesliezhai@llvm.org.cn> writes:
>>
>>> * Memory (20 - 100 cycles) is expensive than Register (1 cycle), but
>>> it has to spill code when PhysReg is unavailable
>> As Vladimir said, the cache makes this kind of analysis much more
>> tricky.  It's not necessarily the case that memory=bad and
>> register=good.  Since there are tradeoffs involved, one needs to
>> evaluate different strategies to determine *when* memory is worse than a
>> register.  It may very well be the case that leaving something in memory
>> frees up a register for something much more important to use it.  All of
>> the register allocation algorithms try to determine this kind of thing
>> through various heuristics.  Which heuristic is most effective is highly
>> target-dependent.
>>
>> In my experience, changing heuristics can swing performance 20% or more
>> in some cases.  Today's processors and memory systems are so complex
>> that 2nd- and even 3rd-order effects become important.
>>
>> It is very, very wrong on today's machines to use # of spills as a
>> metric to determine the "goodness" of an allocation.  Determining *what*
>> to spill is much more important than the raw number of spills.  Many
>> times I have a seen codes generated with more spills perform much better
>> than the code generated with fewer spills.  Almost all of the papers
>> around the time of Chaiten-Briggs used # of spills as the metric.  That
>> may have been appropriate at that time but take those results with giant
>> grains of salt today.  Of course they are still very good papers for
>> understanding algorithms and heuristics.
>>
>> The best way I know to determine what's best for your target is to run a
>> whole bunch of *real* codes on them, trying different allocation
>> algorithms and heuristics.  It is a lot of work, still worthy of a
>> Ph.D. even today.  Register allocation is not a "solved" problem and
>> probably never will be as architectures continue to change and become
>> ever more diverse.
>>
>>> * Folding spill code into instructions, handling register coallescing,
>>> splitting live ranges, doing rematerialization, doing shrink wrapping
>>> are harder than RegAlloc
>> Again, as Vladimir said, they are all part of register allocation.
>> Sometimes they are implemented as separate passes but logically they all
>> contribute work to the task of assigning registers as efficiently as
>> possible.  And all of them use heuristics.  Choosing when and when not
>> to, for example, coalesce can be important.  Splitting live ranges is
>> logically the opposite of coalescing.  Theoretically one can "undo" a
>> bad coalescing decision by re-splitting the live range but it's usually
>> not that simple as other decisions in the interim can make that tricky
>> if not impossible.  It is a delicate balancing act.
>>
>>> * The papers by Briggs and Chaiten contradict[2] themselves when
>>> examine the text of the paper vs. the pseudocode provided?
>> As others have said, I don't recall any inconsistencies but that doesn't
>> mean there aren't bugs and/or incomplete descriptions open to
>> interpretation.  One of my biggest issues with our computer academic
>> culture is that we do not value repeatability.  It is virtually
>> impossible to take a paper, implement the algorithm as described (or as
>> best as possible given ambiguity) and obtain the same results.  Heck,
>> half my Ph.D. dissertation was dissecting a published paper, examining
>> the sensitivity of the described algorithm to various parts of the
>> described heuristic that were ambiguous.  By interpreting the heuristic
>> description in different ways I observed very different results.  I read
>> papers for the algorithms, heuristics and ideas in them.  I pay no
>> attention to results because in the real world we have to implement the
>> ideas and test them ourselves to see if they will help us.
>>
>> Peter is right to point you to Preston.  He is very accessible, friendly
>> and helpful.  I had the good fortune to work with him for a few years
>> and he taught me a lot.  He has much real-world experience on codes
>> actually used in production.  That experience is gold.
>>
>> Good luck to you!  You're entering a world that some computing
>> professionals think is a waste of time because "we already know how to
>> do that."  Those of us in the trenches know better.  :)
>>
>>                                 -David

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [llvm-dev] Register Allocation Graph Coloring algorithm and Others
       [not found]     ` <6708A731-203E-4096-8137-85D4104DA035@apple.com>
@ 2017-12-19  3:03       ` Leslie Zhai
  2017-12-20 17:25         ` Jakob Stoklund Olesen
  0 siblings, 1 reply; 15+ messages in thread
From: Leslie Zhai @ 2017-12-19  3:03 UTC (permalink / raw)
  To: Matthias Braun
  Cc: dag, vmakarov, LewisR9, stoklund, LLVM Developers Mailing List,
	GCC Development

Hi Matthias,

Thanks for your hint!

It is just for learning and practicing for me, just like migrate 
DragonEgg 
http://lists.llvm.org/pipermail/llvm-dev/2017-September/117201.html the 
motivating is for learning from GCC and LLVM developers.


在 2017年12月19日 10:07, Matthias Braun 写道:
>
>
>> On Dec 18, 2017, at 9:52 AM, Leslie Zhai via llvm-dev 
>> <llvm-dev@lists.llvm.org <mailto:llvm-dev@lists.llvm.org>> wrote:
>>
>> Hi David,
>>
>> Thanks for your teaching!
>>
>> I am a newbie in compiler area, I only learned Compiler Principle in 
>> 2002https://www.leetcode.cn/2017/12/ilove-compiler-principle.html
>>
>> But I like to practice and learn 
>> :)https://github.com/xiangzhai/llvm/blob/avr/lib/CodeGen/RegAllocGraphColoring.cpp#L327because 
>> theory is not always correct, or misunderstood by people, so I want 
>> to compare solutionByHEA, IRA, Greedy, PBQP and other algorithms.
>
> Just as another tip:
> - Indeed in my experience: Just implementing some algorithms yourself 
> and comparing them against what existing compilers produce and then 
> figuring out why is the best way to learn about allocators.
> - Don't just put emphasis on all the different different coloring 
> techniques. In practice it is usually the way you deal register 
> constraints and other target specific coloring constraints, 
> rematerialization and how you get coalescing into the picture. Most 
> regalloc papers don't talk too much about that as it's highly finicky 
> and often target specific. But they do have a huge impact on 
> allocation quality and can make or break any of those algorithms...
>
> - Matthias
>
>>
>> Thanks for your lessons to correct my mistakes, such as memory=bad 
>> register=good, and I need to find the answer *when* memory is worse 
>> than a register, I am maintaining AVR target, there are 32 general 
>> registers, 32K flash, 2K 
>> sramhttp://www.atmel.com/Images/Atmel-42735-8-bit-AVR-Microcontroller-ATmega328-328P_Datasheet.pdfso 
>> perhaps to MCU, memory might be expensive than register? but what 
>> about AMDGPU or VLIW processors? I don't have experienced on them, 
>> please teach me.
>>
>> I am reading LLVM's code SpillXXX, LiveRangeXXX, RegisterCoalescer, 
>> etc. to get the whole view of CodeGen.
>>
>> I am reading Dr. Rhydian Lewis's book: A Guide to Graph Colouring: 
>> Algorithms and 
>> Applicationshttp://www.springer.com/us/book/9783319257280   and other 
>> papers, even if HEA is not the best solution, I still want to 
>> practice and see the benchmark, I am not computing professionals, I 
>> am only 34 olds, perhaps I have enough time to waste :)
>>
>>
>> 在 2017年12月19日 00:41,dag@cray.com <mailto:dag@cray.com>写道:
>>> Leslie Zhai <lesliezhai@llvm.org.cn <mailto:lesliezhai@llvm.org.cn>> 
>>> writes:
>>>
>>>> * Memory (20 - 100 cycles) is expensive than Register (1 cycle), but
>>>> it has to spill code when PhysReg is unavailable
>>> As Vladimir said, the cache makes this kind of analysis much more
>>> tricky.  It's not necessarily the case that memory=bad and
>>> register=good.  Since there are tradeoffs involved, one needs to
>>> evaluate different strategies to determine *when* memory is worse than a
>>> register.  It may very well be the case that leaving something in memory
>>> frees up a register for something much more important to use it.  All of
>>> the register allocation algorithms try to determine this kind of thing
>>> through various heuristics.  Which heuristic is most effective is highly
>>> target-dependent.
>>>
>>> In my experience, changing heuristics can swing performance 20% or more
>>> in some cases.  Today's processors and memory systems are so complex
>>> that 2nd- and even 3rd-order effects become important.
>>>
>>> It is very, very wrong on today's machines to use # of spills as a
>>> metric to determine the "goodness" of an allocation.  Determining *what*
>>> to spill is much more important than the raw number of spills.  Many
>>> times I have a seen codes generated with more spills perform much better
>>> than the code generated with fewer spills.  Almost all of the papers
>>> around the time of Chaiten-Briggs used # of spills as the metric.  That
>>> may have been appropriate at that time but take those results with giant
>>> grains of salt today.  Of course they are still very good papers for
>>> understanding algorithms and heuristics.
>>>
>>> The best way I know to determine what's best for your target is to run a
>>> whole bunch of *real* codes on them, trying different allocation
>>> algorithms and heuristics.  It is a lot of work, still worthy of a
>>> Ph.D. even today.  Register allocation is not a "solved" problem and
>>> probably never will be as architectures continue to change and become
>>> ever more diverse.
>>>
>>>> * Folding spill code into instructions, handling register coallescing,
>>>> splitting live ranges, doing rematerialization, doing shrink wrapping
>>>> are harder than RegAlloc
>>> Again, as Vladimir said, they are all part of register allocation.
>>> Sometimes they are implemented as separate passes but logically they all
>>> contribute work to the task of assigning registers as efficiently as
>>> possible.  And all of them use heuristics.  Choosing when and when not
>>> to, for example, coalesce can be important.  Splitting live ranges is
>>> logically the opposite of coalescing.  Theoretically one can "undo" a
>>> bad coalescing decision by re-splitting the live range but it's usually
>>> not that simple as other decisions in the interim can make that tricky
>>> if not impossible.  It is a delicate balancing act.
>>>
>>>> * The papers by Briggs and Chaiten contradict[2] themselves when
>>>> examine the text of the paper vs. the pseudocode provided?
>>> As others have said, I don't recall any inconsistencies but that doesn't
>>> mean there aren't bugs and/or incomplete descriptions open to
>>> interpretation.  One of my biggest issues with our computer academic
>>> culture is that we do not value repeatability.  It is virtually
>>> impossible to take a paper, implement the algorithm as described (or as
>>> best as possible given ambiguity) and obtain the same results.  Heck,
>>> half my Ph.D. dissertation was dissecting a published paper, examining
>>> the sensitivity of the described algorithm to various parts of the
>>> described heuristic that were ambiguous.  By interpreting the heuristic
>>> description in different ways I observed very different results.  I read
>>> papers for the algorithms, heuristics and ideas in them.  I pay no
>>> attention to results because in the real world we have to implement the
>>> ideas and test them ourselves to see if they will help us.
>>>
>>> Peter is right to point you to Preston.  He is very accessible, friendly
>>> and helpful.  I had the good fortune to work with him for a few years
>>> and he taught me a lot.  He has much real-world experience on codes
>>> actually used in production.  That experience is gold.
>>>
>>> Good luck to you!  You're entering a world that some computing
>>> professionals think is a waste of time because "we already know how to
>>> do that."  Those of us in the trenches know better.  :)
>>>
>>>                                -David
>>
>> --
>> Regards,
>> Leslie Zhai -https://reviews.llvm.org/p/xiangzhai/
>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev@lists.llvm.org <mailto:llvm-dev@lists.llvm.org>
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

-- 
Regards,
Leslie Zhai - https://reviews.llvm.org/p/xiangzhai/



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Register Allocation Graph Coloring algorithm and Others
  2017-12-19  3:03       ` [llvm-dev] " Leslie Zhai
@ 2017-12-20 17:25         ` Jakob Stoklund Olesen
  2017-12-21  0:44           ` Leslie Zhai
       [not found]           ` <5a3b03ec.01da9f0a.1934a.4b3bSMTPIN_ADDED_BROKEN@mx.google.com>
  0 siblings, 2 replies; 15+ messages in thread
From: Jakob Stoklund Olesen @ 2017-12-20 17:25 UTC (permalink / raw)
  To: Leslie Zhai
  Cc: Matthias Braun, dag, vmakarov, LewisR9,
	LLVM Developers Mailing List, GCC Development


> On Dec 18, 2017, at 19:03, Leslie Zhai <lesliezhai@llvm.org.cn> wrote:

Hi Leslie,

As others have pointed out, the notion that register allocation is isomorphic to graph coloring is poppycock. There are other important aspects, in particular the placement of spill/fill/copy instructions. The importance of graph coloring relative to spill code placement depends on how many registers you have available. If you are generating code for 32-bit x86 which has only 6-7 general purpose registers, you will have so much spill code and short live ranges that graph coloring doesn’t matter much at all. On the other hand, if you have 32 registers like Chaitin did, you have much less spilling in typical code, and the graph coloring aspect becomes important.

Early compilers would keep each local variable in a stack slot, and the register allocation optimization would literally allocate a whole local variable to a register. The C “register” keyword makes sense in that context. Later improvements like copy coalescing and live range splitting meant that multiple local variables could use the same register and a variable could live in different places at different times. It is sometimes useful to take this development to its logical extreme and look at register allocation as a caching problem: The register allocator’s job is to make sure that values are available to the instructions the need them, using the registers as a cache to get the values there in the most efficient way possible.

Guo, J., Garzarán, M. J., & Padua, D. (2004). The Power of Belady’s Algorithm in Register Allocation for Long Basic Blocks. In Languages and Compilers for Parallel Computing (Vol. 2958, pp. 374–389). Berlin, Heidelberg: Springer Berlin Heidelberg. http://doi.org/10.1007/978-3-540-24644-2_24

Braun, M., & Hack, S. (2009). Register spilling and live-range splitting for SSA-form programs. International Conference on Compiler Construction.

When you look at register allocation that way, the graph coloring aspect almost disappears. The optimum approach is probably somewhere in the middle.

A third important aspect is register constraints on individual instructions. Sometimes you almost need a little constraint solver just to figure out a valid register assignment for a single instruction. Preston Briggs dealt with this in his thesis, but it hasn’t gotten as much attention as graph coloring since.

Pereira, F. Q., & Palsberg, J. (2008). Register allocation by puzzle solving.

Regards,
/jakob

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Register Allocation Graph Coloring algorithm and Others
  2017-12-20 17:25         ` Jakob Stoklund Olesen
@ 2017-12-21  0:44           ` Leslie Zhai
       [not found]           ` <5a3b03ec.01da9f0a.1934a.4b3bSMTPIN_ADDED_BROKEN@mx.google.com>
  1 sibling, 0 replies; 15+ messages in thread
From: Leslie Zhai @ 2017-12-21  0:44 UTC (permalink / raw)
  To: Jakob Stoklund Olesen
  Cc: Matthias Braun, dag, vmakarov, LewisR9,
	LLVM Developers Mailing List, GCC Development

Hi Jakob,

Thanks for your kind response!

My usecase is for AVR and RISCV targets, and I just want to learn and 
practice HEA in RA, thanks for your sharing.


在 2017年12月21日 01:25, Jakob Stoklund Olesen 写道:
>
>> On Dec 18, 2017, at 19:03, Leslie Zhai <lesliezhai@llvm.org.cn 
>> <mailto:lesliezhai@llvm.org.cn>> wrote:
>
> Hi Leslie,
>
> As others have pointed out, the notion that register allocation is 
> isomorphic to graph coloring is poppycock. There are other important 
> aspects, in particular the placement of spill/fill/copy instructions. 
> The importance of graph coloring relative to spill code placement 
> depends on how many registers you have available. If you are 
> generating code for 32-bit x86 which has only 6-7 general purpose 
> registers, you will have so much spill code and short live ranges that 
> graph coloring doesn’t matter much at all. On the other hand, if you 
> have 32 registers like Chaitin did, you have much less spilling in 
> typical code, and the graph coloring aspect becomes important.
>
> Early compilers would keep each local variable in a stack slot, and 
> the register allocation optimization would literally allocate a whole 
> local variable to a register. The C “register” keyword makes sense in 
> that context. Later improvements like copy coalescing and live range 
> splitting meant that multiple local variables could use the same 
> register and a variable could live in different places at different 
> times. It is sometimes useful to take this development to its logical 
> extreme and look at register allocation as a caching problem: The 
> register allocator’s job is to make sure that values are available to 
> the instructions the need them, using the registers as a cache to get 
> the values there in the most efficient way possible.
>
>     Guo, J., Garzarán, M. J., & Padua, D. (2004). The Power of
>     Belady’s Algorithm in Register Allocation for Long Basic Blocks.
>     In Languages and Compilers for Parallel Computing (Vol. 2958, pp.
>     374–389). Berlin, Heidelberg: Springer Berlin Heidelberg.
>     http://doi.org/10.1007/978-3-540-24644-2_24
>
>     Braun, M., & Hack, S. (2009). Register spilling and live-range
>     splitting for SSA-form programs. International Conference on
>     Compiler Construction.
>
>
> When you look at register allocation that way, the graph coloring 
> aspect almost disappears. The optimum approach is probably somewhere 
> in the middle.
>
> A third important aspect is register constraints on individual 
> instructions. Sometimes you almost need a little constraint solver 
> just to figure out a valid register assignment for a single 
> instruction. Preston Briggs dealt with this in his thesis, but it 
> hasn’t gotten as much attention as graph coloring since.
>
>     Pereira, F. Q., & Palsberg, J. (2008). Register allocation by
>     puzzle solving.
>
>
> Regards,
> /jakob
>

-- 
Regards,
Leslie Zhai - https://reviews.llvm.org/p/xiangzhai/




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [llvm-dev] Register Allocation Graph Coloring algorithm and Others
       [not found]           ` <5a3b03ec.01da9f0a.1934a.4b3bSMTPIN_ADDED_BROKEN@mx.google.com>
@ 2017-12-21 11:09             ` Bruce Hoult
  2017-12-21 16:26               ` Leslie Zhai
  0 siblings, 1 reply; 15+ messages in thread
From: Bruce Hoult @ 2017-12-21 11:09 UTC (permalink / raw)
  To: Leslie Zhai
  Cc: Jakob Stoklund Olesen, GCC Development,
	LLVM Developers Mailing List, dag, vmakarov, LewisR9

So, both AVR and RISC-V are fairly register-rich with usually 32. RV32E
only has 16, but that's still a lot better than i386. If you use a lot of
16 bit integers then AVR also only has effectively 16 registers (or a more
with a mix of 8 and 16 bit variables). 32 bit integers should be rare in
AVR code, but soft float/double variables are common in Arduino code (both
are implemented as 32 bits), so you only have room for 8 of those.

RISC-V doesn't have any hard constraints on something that *must* go in a
certain register, except the usual argument passing/return convention.
There is a an advantage to allocating both the data src/dst register and
the pointer base register for a load or store from x8-x15 (s0-1,a0-5) as
much as possible as this allows the assembler to use a two byte instruction
instead of a four byte instruction.

I haven't look at AVR in detail for a while, but I seem to recall the top
six registers make up three 16 bit pointer registers X,Y,Z. Any of them can
be used for (C language) *p, *--p, *p++, only Y&Z can be used for p->foo,
and only Z can be used for computed jumps (including function link
register) or loading constants from program memory. Also the various
multiply instructions take their 8 bit operands from any registers but
always produce the 16 bit result in r1:r0. Annoying but nowhere near as bad
as i386 as r0 and r1 are not used for anything else. The usual ABI makes r0
a clobber-at-will temp. r1 is supposed to be "always zero", so you need to
CLR it after retrieving (or ignoring) the high bits of a multiply result.

On Thu, Dec 21, 2017 at 3:44 AM, Leslie Zhai via llvm-dev <
llvm-dev@lists.llvm.org> wrote:

> Hi Jakob,
>
> Thanks for your kind response!
>
> My usecase is for AVR and RISCV targets, and I just want to learn and
> practice HEA in RA, thanks for your sharing.
>
>
> 在 2017年12月21日 01:25, Jakob Stoklund Olesen 写道:
>
>>
>> On Dec 18, 2017, at 19:03, Leslie Zhai <lesliezhai@llvm.org.cn <mailto:
>>> lesliezhai@llvm.org.cn>> wrote:
>>>
>>
>> Hi Leslie,
>>
>> As others have pointed out, the notion that register allocation is
>> isomorphic to graph coloring is poppycock. There are other important
>> aspects, in particular the placement of spill/fill/copy instructions. The
>> importance of graph coloring relative to spill code placement depends on
>> how many registers you have available. If you are generating code for
>> 32-bit x86 which has only 6-7 general purpose registers, you will have so
>> much spill code and short live ranges that graph coloring doesn’t matter
>> much at all. On the other hand, if you have 32 registers like Chaitin did,
>> you have much less spilling in typical code, and the graph coloring aspect
>> becomes important.
>>
>> Early compilers would keep each local variable in a stack slot, and the
>> register allocation optimization would literally allocate a whole local
>> variable to a register. The C “register” keyword makes sense in that
>> context. Later improvements like copy coalescing and live range splitting
>> meant that multiple local variables could use the same register and a
>> variable could live in different places at different times. It is sometimes
>> useful to take this development to its logical extreme and look at register
>> allocation as a caching problem: The register allocator’s job is to make
>> sure that values are available to the instructions the need them, using the
>> registers as a cache to get the values there in the most efficient way
>> possible.
>>
>>     Guo, J., Garzarán, M. J., & Padua, D. (2004). The Power of
>>     Belady’s Algorithm in Register Allocation for Long Basic Blocks.
>>     In Languages and Compilers for Parallel Computing (Vol. 2958, pp.
>>     374–389). Berlin, Heidelberg: Springer Berlin Heidelberg.
>>     http://doi.org/10.1007/978-3-540-24644-2_24
>>
>>     Braun, M., & Hack, S. (2009). Register spilling and live-range
>>     splitting for SSA-form programs. International Conference on
>>     Compiler Construction.
>>
>>
>> When you look at register allocation that way, the graph coloring aspect
>> almost disappears. The optimum approach is probably somewhere in the middle.
>>
>> A third important aspect is register constraints on individual
>> instructions. Sometimes you almost need a little constraint solver just to
>> figure out a valid register assignment for a single instruction. Preston
>> Briggs dealt with this in his thesis, but it hasn’t gotten as much
>> attention as graph coloring since.
>>
>>     Pereira, F. Q., & Palsberg, J. (2008). Register allocation by
>>     puzzle solving.
>>
>>
>> Regards,
>> /jakob
>>
>>
> --
> Regards,
> Leslie Zhai - https://reviews.llvm.org/p/xiangzhai/
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev@lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [llvm-dev] Register Allocation Graph Coloring algorithm and Others
  2017-12-21 11:09             ` [llvm-dev] " Bruce Hoult
@ 2017-12-21 16:26               ` Leslie Zhai
  0 siblings, 0 replies; 15+ messages in thread
From: Leslie Zhai @ 2017-12-21 16:26 UTC (permalink / raw)
  To: Bruce Hoult
  Cc: Jakob Stoklund Olesen, GCC Development,
	LLVM Developers Mailing List, dag, vmakarov, LewisR9,
	Alex Bradbury

Hi Bruce,

Thanks for your sharing!

I am porting GlobalISel to RISCV target[1], the highest priority in the 
TODO list[2], welcome to contribute to lowRISC, if fixed all the issues, 
then I could try to implement RegAllocGraphColoring in HEA and write 
great Machine Schedulers.

[1] https://github.com/lowRISC/riscv-llvm/issues/19
[2] https://github.com/lowRISC/riscv-llvm/issues

在 2017年12月21日 19:09, Bruce Hoult 写道:
> So, both AVR and RISC-V are fairly register-rich with usually 32. 
> RV32E only has 16, but that's still a lot better than i386. If you use 
> a lot of 16 bit integers then AVR also only has effectively 16 
> registers (or a more with a mix of 8 and 16 bit variables). 32 bit 
> integers should be rare in AVR code, but soft float/double variables 
> are common in Arduino code (both are implemented as 32 bits), so you 
> only have room for 8 of those.
>
> RISC-V doesn't have any hard constraints on something that *must* go 
> in a certain register, except the usual argument passing/return 
> convention. There is a an advantage to allocating both the data 
> src/dst register and the pointer base register for a load or store 
> from x8-x15 (s0-1,a0-5) as much as possible as this allows the 
> assembler to use a two byte instruction instead of a four byte 
> instruction.
>
> I haven't look at AVR in detail for a while, but I seem to recall the 
> top six registers make up three 16 bit pointer registers X,Y,Z. Any of 
> them can be used for (C language) *p, *--p, *p++, only Y&Z can be used 
> for p->foo, and only Z can be used for computed jumps (including 
> function link register) or loading constants from program memory. Also 
> the various multiply instructions take their 8 bit operands from any 
> registers but always produce the 16 bit result in r1:r0. Annoying but 
> nowhere near as bad as i386 as r0 and r1 are not used for anything 
> else. The usual ABI makes r0 a clobber-at-will temp. r1 is supposed to 
> be "always zero", so you need to CLR it after retrieving (or ignoring) 
> the high bits of a multiply result.
>
> On Thu, Dec 21, 2017 at 3:44 AM, Leslie Zhai via llvm-dev 
> <llvm-dev@lists.llvm.org <mailto:llvm-dev@lists.llvm.org>> wrote:
>
>     Hi Jakob,
>
>     Thanks for your kind response!
>
>     My usecase is for AVR and RISCV targets, and I just want to learn
>     and practice HEA in RA, thanks for your sharing.
>
>
>     在 2017年12月21日 01:25, Jakob Stoklund Olesen 写道:
>
>
>             On Dec 18, 2017, at 19:03, Leslie Zhai
>             <lesliezhai@llvm.org.cn <mailto:lesliezhai@llvm.org.cn>
>             <mailto:lesliezhai@llvm.org.cn
>             <mailto:lesliezhai@llvm.org.cn>>> wrote:
>
>
>         Hi Leslie,
>
>         As others have pointed out, the notion that register
>         allocation is isomorphic to graph coloring is poppycock. There
>         are other important aspects, in particular the placement of
>         spill/fill/copy instructions. The importance of graph coloring
>         relative to spill code placement depends on how many registers
>         you have available. If you are generating code for 32-bit x86
>         which has only 6-7 general purpose registers, you will have so
>         much spill code and short live ranges that graph coloring
>         doesn’t matter much at all. On the other hand, if you have 32
>         registers like Chaitin did, you have much less spilling in
>         typical code, and the graph coloring aspect becomes important.
>
>         Early compilers would keep each local variable in a stack
>         slot, and the register allocation optimization would literally
>         allocate a whole local variable to a register. The C
>         “register” keyword makes sense in that context. Later
>         improvements like copy coalescing and live range splitting
>         meant that multiple local variables could use the same
>         register and a variable could live in different places at
>         different times. It is sometimes useful to take this
>         development to its logical extreme and look at register
>         allocation as a caching problem: The register allocator’s job
>         is to make sure that values are available to the instructions
>         the need them, using the registers as a cache to get the
>         values there in the most efficient way possible.
>
>             Guo, J., Garzarán, M. J., & Padua, D. (2004). The Power of
>             Belady’s Algorithm in Register Allocation for Long Basic
>         Blocks.
>             In Languages and Compilers for Parallel Computing (Vol.
>         2958, pp.
>             374–389). Berlin, Heidelberg: Springer Berlin Heidelberg.
>         http://doi.org/10.1007/978-3-540-24644-2_24
>         <http://doi.org/10.1007/978-3-540-24644-2_24>
>
>             Braun, M., & Hack, S. (2009). Register spilling and live-range
>             splitting for SSA-form programs. International Conference on
>             Compiler Construction.
>
>
>         When you look at register allocation that way, the graph
>         coloring aspect almost disappears. The optimum approach is
>         probably somewhere in the middle.
>
>         A third important aspect is register constraints on individual
>         instructions. Sometimes you almost need a little constraint
>         solver just to figure out a valid register assignment for a
>         single instruction. Preston Briggs dealt with this in his
>         thesis, but it hasn’t gotten as much attention as graph
>         coloring since.
>
>             Pereira, F. Q., & Palsberg, J. (2008). Register allocation by
>             puzzle solving.
>
>
>         Regards,
>         /jakob
>
>
>     -- 
>     Regards,
>     Leslie Zhai - https://reviews.llvm.org/p/xiangzhai/
>     <https://reviews.llvm.org/p/xiangzhai/>
>
>
>
>
>     _______________________________________________
>     LLVM Developers mailing list
>     llvm-dev@lists.llvm.org <mailto:llvm-dev@lists.llvm.org>
>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>     <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>

-- 
Regards,
Leslie Zhai - https://reviews.llvm.org/p/xiangzhai/



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Register Allocation Graph Coloring algorithm and Others
  2017-12-19  0:07 ` Michael Clark
  2017-12-19  2:10   ` Leslie Zhai
@ 2017-12-19  4:16   ` Vladimir Makarov
  1 sibling, 0 replies; 15+ messages in thread
From: Vladimir Makarov @ 2017-12-19  4:16 UTC (permalink / raw)
  To: Michael Clark, Leslie Zhai
  Cc: LewisR9, dag, stoklund, GCC Development, LLVM Developers Mailing List



On 12/18/2017 07:07 PM, Michael Clark wrote:
> Hi Leslie,
>
> I suggest adding these 3 papers to your reading list.
>
> 	Register allocation for programs in SSA-form
> 	Sebastian Hack, Daniel Grund, and Gerhard Goos
> 	http://www.rw.cdl.uni-saarland.de/~grund/papers/cc06-ra_ssa.pdf
>
> 	Simple and Efficient Construction of Static Single Assignment Form
> 	Matthias Braun , Sebastian Buchwald , Sebastian Hack , Roland Leißa , Christoph Mallon , and Andreas Zwinkau
> 	https://www.info.uni-karlsruhe.de/uploads/publikationen/braun13cc.pdf
>
> 	Optimal register allocation for SSA-form programs in polynomial time
> 	Sebastian Hack, Gerhard Goos
> 	http://web.cs.ucla.edu/~palsberg/course/cs232/papers/HackGoos-ipl06.pdf
>
> A lot of the earlier literature regarding the register allocation problem describes the general graph colouring problem as NP-complete, however previous research in Register Allocation has been in the context of programs that were not in SSA form. i.e. the Chaitin-Briggs paper states that register allocation is NP-complete and proposes an iterative algorithm.
>
>
I am sceptical about SSA-register allocators for high quality code 
generation.  But I should acknowledge although I tried a lot of RA 
algorithms I never tried SSA one.

One task of RA is to deal with moves but when you are destroying SSA you 
are creating moves or swaps.  Of course there are optimizations to 
decrease its number when you are going out of SSA.  But IMHO a good RA 
should see all moves created before or during own work.

As I already wrote coloring is just one out of many task of RA.  All 
these tasks in a good RA should be solved in cooperation. Optimistic 
coloring is good and fast enough (although building a conflict graph for 
it takes a lot of time).  I think better results can be obtained by 
improving other tasks than by improving optimistic coloring.

For example, nobody mentioned irregular file architectures (with 
subregisters and register subclasses) like x86/x86-64.  Even regular 
file architectures with caller/callee saved hard registers have 
irregularity because it creates register subclasses, e.g. class of saved 
general registers is a subclass of the overall general register class.  
Usually hard registers are present in IR and if some pseudos conflict 
with the hard registers, it also create a lot (intersected) subclasses.

Taking such irregularity, e.g. in coloring criteria based on Kempe's 
heuristic, can improve the final RA significantly (as I remember GCC 
generated code for SPECInt2000 even on ppc64 with mostly regular 
register file was improved by 1% by taking such subclasses into account 
during coloring). A classical article about this 
https://www.researchgate.net/publication/2440424_Graph-Coloring_Register_Allocation_for_Irregular_Architectures 
could be a start point for such implementation.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Register Allocation Graph Coloring algorithm and Others
  2017-12-19  0:07 ` Michael Clark
@ 2017-12-19  2:10   ` Leslie Zhai
  2017-12-19  4:16   ` Vladimir Makarov
  1 sibling, 0 replies; 15+ messages in thread
From: Leslie Zhai @ 2017-12-19  2:10 UTC (permalink / raw)
  To: Michael Clark
  Cc: vmakarov, LewisR9, dag, stoklund, GCC Development,
	LLVM Developers Mailing List

Hi Michael,

Thanks for your sharing!

I will read the papers as you suggested, and I asked Quantum Computing 
professionals about solving the NP-complete problem 
https://github.com/epiqc/ScaffCC/issues/14 but GCC's IRA and LRA proved 
that there is a solution in SSA form for classical computer.

Sorting MachineBasicBlock by frequency firstly is a good idea, and I 
only experienced some PASS, such as LoopUnroll 
http://lists.llvm.org/pipermail/llvm-dev/2017-October/118419.html but I 
will practice your idea in my solutionByHEA 
https://github.com/xiangzhai/llvm/blob/avr/lib/CodeGen/RegAllocGraphColoring.cpp#L327

I experienced binary translation from X86 to Mips, QEMU is expensive 
https://www.leetcode.cn/2017/09/tcg-ir-to-llvm-ir.html I will bookmark 
your JIT engine to my blog :)


在 2017年12月19日 08:07, Michael Clark 写道:
> Hi Leslie,
>
> I suggest adding these 3 papers to your reading list.
>
> 	Register allocation for programs in SSA-form
> 	Sebastian Hack, Daniel Grund, and Gerhard Goos
> 	http://www.rw.cdl.uni-saarland.de/~grund/papers/cc06-ra_ssa.pdf
>
> 	Simple and Efficient Construction of Static Single Assignment Form
> 	Matthias Braun , Sebastian Buchwald , Sebastian Hack , Roland Leißa , Christoph Mallon , and Andreas Zwinkau
> 	https://www.info.uni-karlsruhe.de/uploads/publikationen/braun13cc.pdf
>
> 	Optimal register allocation for SSA-form programs in polynomial time
> 	Sebastian Hack, Gerhard Goos
> 	http://web.cs.ucla.edu/~palsberg/course/cs232/papers/HackGoos-ipl06.pdf
>
> A lot of the earlier literature regarding the register allocation problem describes the general graph colouring problem as NP-complete, however previous research in Register Allocation has been in the context of programs that were not in SSA form. i.e. the Chaitin-Briggs paper states that register allocation is NP-complete and proposes an iterative algorithm.
>
> If one reads Hack Goos, there is the discovery that programs that are in SSA form have chordal graphs, and the colouring algorithm for chordal graphs can be completed in polynomial time. After the cost of building the interference graph, it is possible to perform register allocation in a single pass. The key is in not modifying the graph.
>
> If one has frequency for each basic block, then one can sort basic blocks by frequency, allocating the highest frequency blocks first, and where possible assigning the same physcial register to all virtual registers in each phi node (to avoid copies). At program points where the live set is larger than k, the set of physical registers, one spills the the register that has the largest distance between its next use. At least that is how I am thinking about this problem. I am also a novice with regards to register allocation.
>
> I intend to develop a register allocator for use in this JIT engine:
> 	
> 	rv8: a high performance RISC-V to x86 binary translator
> 	Michael Clark, Bruce Hoult
> 	https://carrv.github.io/papers/clark-rv8-carrv2017.pdf
>
> Our JIT already performs almost twice as fast a QEMU and we are using a static register allocation, and QEMU i believe has a register allocator. We are mapping a 31 register RISC architecture to a 16 register CISC architecture. I have statistics on the current slow-down, and it appears to mainly be L1 accesses to spilled registers. I believe after we have implemented a register allocator in our JIT we will be able to run RISC-V binaries at near native performance on x86-64. In fact given we are performing runtime profile guided optimisation, it may even be possible for some benchmarks to run faster than register allocations that are based on static estimates of loop frequencies.
>
> 	https://anarch128.org/~mclark/rv8-slides.pdf
> 	https://rv8.io/bench
>
> We’ve still got a long way to go to get to ~1:1 performance for RISC-V on x86-64, but I think it is possible, at least for some codes…
>
> Regards,
> Michael.
>
>> On 15/12/2017, at 4:18 PM, Leslie Zhai <lesliezhai@llvm.org.cn> wrote:
>>
>> Hi GCC and LLVM developers,
>>
>> I am learning Register Allocation algorithms and I am clear that:
>>
>> * Unlimited VirtReg (pseudo) -> limited or fixed or alias[1] PhysReg (hard)
>>
>> * Memory (20 - 100 cycles) is expensive than Register (1 cycle), but it has to spill code when PhysReg is unavailable
>>
>> * Folding spill code into instructions, handling register coallescing, splitting live ranges, doing rematerialization, doing shrink wrapping are harder than RegAlloc
>>
>> * LRA and IRA is default Passes in RA for GCC:
>>
>> $ /opt/gcc-git/bin/gcc hello.c
>> DEBUG: ../../gcc/lra.c, lra_init_once, line 2441
>> DEBUG: ../../gcc/ira-build.c, ira_build, line 3409
>>
>> * Greedy is default Pass for LLVM
>>
>> But I have some questions, please give me some hint, thanks a lot!
>>
>> * IRA is regional register allocator performing graph coloring on a top-down traversal of nested regions, is it Global? compares with Local LRA
>>
>> * The papers by Briggs and Chaiten contradict[2] themselves when examine the text of the paper vs. the pseudocode provided?
>>
>> * Why  interference graph is expensive to build[3]?
>>
>> And I am practicing[4] to use HEA, developed by Dr. Rhydian Lewis, for LLVM firstly.
>>
>>
>> [1] https://reviews.llvm.org/D39712
>>
>> [2] http://lists.llvm.org/pipermail/llvm-dev/2008-March/012940.html
>>
>> [3] https://github.com/joaotavio/llvm-register-allocator
>>
>> [4] https://github.com/xiangzhai/llvm/tree/avr/include/llvm/CodeGen/GCol
>>
>> -- 
>> Regards,
>> Leslie Zhai - https://reviews.llvm.org/p/xiangzhai/
>>
>>
>>

-- 
Regards,
Leslie Zhai - https://reviews.llvm.org/p/xiangzhai/



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Register Allocation Graph Coloring algorithm and Others
       [not found] <615F0DCE4D5873A9@mac.com>
@ 2017-12-19  0:07 ` Michael Clark
  2017-12-19  2:10   ` Leslie Zhai
  2017-12-19  4:16   ` Vladimir Makarov
  0 siblings, 2 replies; 15+ messages in thread
From: Michael Clark @ 2017-12-19  0:07 UTC (permalink / raw)
  To: Leslie Zhai
  Cc: vmakarov, LewisR9, dag, stoklund, GCC Development,
	LLVM Developers Mailing List

Hi Leslie,

I suggest adding these 3 papers to your reading list.

	Register allocation for programs in SSA-form
	Sebastian Hack, Daniel Grund, and Gerhard Goos
	http://www.rw.cdl.uni-saarland.de/~grund/papers/cc06-ra_ssa.pdf

	Simple and Efficient Construction of Static Single Assignment Form
	Matthias Braun , Sebastian Buchwald , Sebastian Hack , Roland Leißa , Christoph Mallon , and Andreas Zwinkau
	https://www.info.uni-karlsruhe.de/uploads/publikationen/braun13cc.pdf

	Optimal register allocation for SSA-form programs in polynomial time
	Sebastian Hack, Gerhard Goos
	http://web.cs.ucla.edu/~palsberg/course/cs232/papers/HackGoos-ipl06.pdf

A lot of the earlier literature regarding the register allocation problem describes the general graph colouring problem as NP-complete, however previous research in Register Allocation has been in the context of programs that were not in SSA form. i.e. the Chaitin-Briggs paper states that register allocation is NP-complete and proposes an iterative algorithm.

If one reads Hack Goos, there is the discovery that programs that are in SSA form have chordal graphs, and the colouring algorithm for chordal graphs can be completed in polynomial time. After the cost of building the interference graph, it is possible to perform register allocation in a single pass. The key is in not modifying the graph.

If one has frequency for each basic block, then one can sort basic blocks by frequency, allocating the highest frequency blocks first, and where possible assigning the same physcial register to all virtual registers in each phi node (to avoid copies). At program points where the live set is larger than k, the set of physical registers, one spills the the register that has the largest distance between its next use. At least that is how I am thinking about this problem. I am also a novice with regards to register allocation.

I intend to develop a register allocator for use in this JIT engine:
	
	rv8: a high performance RISC-V to x86 binary translator
	Michael Clark, Bruce Hoult
	https://carrv.github.io/papers/clark-rv8-carrv2017.pdf

Our JIT already performs almost twice as fast a QEMU and we are using a static register allocation, and QEMU i believe has a register allocator. We are mapping a 31 register RISC architecture to a 16 register CISC architecture. I have statistics on the current slow-down, and it appears to mainly be L1 accesses to spilled registers. I believe after we have implemented a register allocator in our JIT we will be able to run RISC-V binaries at near native performance on x86-64. In fact given we are performing runtime profile guided optimisation, it may even be possible for some benchmarks to run faster than register allocations that are based on static estimates of loop frequencies.

	https://anarch128.org/~mclark/rv8-slides.pdf
	https://rv8.io/bench

We’ve still got a long way to go to get to ~1:1 performance for RISC-V on x86-64, but I think it is possible, at least for some codes…

Regards,
Michael.

> On 15/12/2017, at 4:18 PM, Leslie Zhai <lesliezhai@llvm.org.cn> wrote:
> 
> Hi GCC and LLVM developers,
> 
> I am learning Register Allocation algorithms and I am clear that:
> 
> * Unlimited VirtReg (pseudo) -> limited or fixed or alias[1] PhysReg (hard)
> 
> * Memory (20 - 100 cycles) is expensive than Register (1 cycle), but it has to spill code when PhysReg is unavailable
> 
> * Folding spill code into instructions, handling register coallescing, splitting live ranges, doing rematerialization, doing shrink wrapping are harder than RegAlloc
> 
> * LRA and IRA is default Passes in RA for GCC:
> 
> $ /opt/gcc-git/bin/gcc hello.c
> DEBUG: ../../gcc/lra.c, lra_init_once, line 2441
> DEBUG: ../../gcc/ira-build.c, ira_build, line 3409
> 
> * Greedy is default Pass for LLVM
> 
> But I have some questions, please give me some hint, thanks a lot!
> 
> * IRA is regional register allocator performing graph coloring on a top-down traversal of nested regions, is it Global? compares with Local LRA
> 
> * The papers by Briggs and Chaiten contradict[2] themselves when examine the text of the paper vs. the pseudocode provided?
> 
> * Why  interference graph is expensive to build[3]?
> 
> And I am practicing[4] to use HEA, developed by Dr. Rhydian Lewis, for LLVM firstly.
> 
> 
> [1] https://reviews.llvm.org/D39712
> 
> [2] http://lists.llvm.org/pipermail/llvm-dev/2008-March/012940.html
> 
> [3] https://github.com/joaotavio/llvm-register-allocator
> 
> [4] https://github.com/xiangzhai/llvm/tree/avr/include/llvm/CodeGen/GCol
> 
> -- 
> Regards,
> Leslie Zhai - https://reviews.llvm.org/p/xiangzhai/
> 
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-12-21 16:26 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-15  3:19 Register Allocation Graph Coloring algorithm and Others Leslie Zhai
2017-12-15  4:40 ` Vladimir Makarov
2017-12-15  4:59   ` Leslie Zhai
2017-12-15 14:48 ` Peter Bergner
2017-12-18 16:42 ` dag
2017-12-18 17:52   ` Leslie Zhai
2017-12-18 22:35     ` dag
     [not found]   ` <9CBFCADA170A6854@apple.com>
     [not found]     ` <6708A731-203E-4096-8137-85D4104DA035@apple.com>
2017-12-19  3:03       ` [llvm-dev] " Leslie Zhai
2017-12-20 17:25         ` Jakob Stoklund Olesen
2017-12-21  0:44           ` Leslie Zhai
     [not found]           ` <5a3b03ec.01da9f0a.1934a.4b3bSMTPIN_ADDED_BROKEN@mx.google.com>
2017-12-21 11:09             ` [llvm-dev] " Bruce Hoult
2017-12-21 16:26               ` Leslie Zhai
     [not found] <615F0DCE4D5873A9@mac.com>
2017-12-19  0:07 ` Michael Clark
2017-12-19  2:10   ` Leslie Zhai
2017-12-19  4:16   ` Vladimir Makarov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).