Register Pressure in Instruction Level Parallelism

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Register Pressure in Instruction Level Parallelism
@ 2009-06-28 12:37 Michael Kruse
  2009-06-28 19:06 ` Dave Korn
  2009-06-29 16:27 ` Vladimir Makarov
  0 siblings, 2 replies; 20+ messages in thread
From: Michael Kruse @ 2009-06-28 12:37 UTC (permalink / raw)
  To: gcc

[-- Attachment #1: Type: text/plain, Size: 2463 bytes --]

Hello GCC developers,

I am going to write a Master's Thesis about instruction scheduling based 
on the technique presented in [1]. This will also include a implementation.

The basic idea is to have an additional pass on the data dependency 
graph before instruction scheduling and register allocation takes place. 
It estimates the minimal (register sufficiency) and maximal number of 
registers (register saturation) required by schedules based on that graph.

If the register sufficiency is higher than the physical number of 
registers, spill code is added to the graph.

If the register saturation is higher than the physical number of 
registers, artificial dependencies are added to the graph, such that the 
instruction scheduler is forced to generate a schedule with less registers.

The aim is that the instruction scheduler can focus on the optimal 
arrangement of instructions to exploit a maximal amount of instruction 
level parallelism. [1] also suggests heuristics for all these problems. 
According to the author, these are "nearly optimal" in practice. The 
heuristics for estimating register sufficiency and saturation are both 
optimistic, meaning that it might still be necessary to add more spill 
code to the final code. Hence, this this technique is just an 
optimization pass and doesn't replace existing register allocation or 
instruction scheduling passes.

[1] also has a part about loop scheduling, but my thesis will be about 
basic blocks only. See [2] for an presentation of this technique.

So, now my questions: How much do you think could this could improve 
compiled code speed? Would the current GCC/YARA benefit from such an 
optimization pass at all? What are the chances that this could get into 
the main GCC tree if it shows up to be an improvement?

There has been a short discussion on this mailing list already [3] some 
years ago (Note if you read this: intLP has been used to compare the 
heuristic to the optimal result only). To my knowledge, there hasn't 
been any more on this topic yet (anywhere).

I'd prefer to implement this for the gcc, but my advisor wants me to do 
it for the university's own compiler. Therefore I could also need 
arguments why to do it for the GCC.

Regards,
Michael Kruse

[1] http://www.prism.uvsq.fr/~touati/thesis.html
[2] http://tel.archives-ouvertes.fr/docs/00/04/72/88/ANNEX/tel-00007405.ppt
[3] http://gcc.gnu.org/ml/gcc/2002-07/msg00565.html

-- 
Tardyzentrismus verboten!

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/x-pkcs7-signature, Size: 4401 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register Pressure in Instruction Level Parallelism
  2009-06-28 12:37 Register Pressure in Instruction Level Parallelism Michael Kruse
@ 2009-06-28 19:06 ` Dave Korn
  2009-06-28 22:53   ` Albert Cohen
                     ` (2 more replies)
  2009-06-29 16:27 ` Vladimir Makarov
  1 sibling, 3 replies; 20+ messages in thread
From: Dave Korn @ 2009-06-28 19:06 UTC (permalink / raw)
  To: reply; +Cc: gcc

Michael Kruse wrote:

> So, now my questions: How much do you think could this could improve
> compiled code speed? Would the current GCC/YARA benefit from such an
> optimization pass at all? What are the chances that this could get into
> the main GCC tree if it shows up to be an improvement?

  One of the major problems in gcc is the intertangling of instruction
selection with register allocation and spill generation.  If these could be
separated it would almost certainly generate better code and be welcomed with
open arms!

> I'd prefer to implement this for the gcc, but my advisor wants me to do
> it for the university's own compiler. Therefore I could also need
> arguments why to do it for the GCC.

  Because destroying reload(*) would be an incalculable public service and
your name will be remembered in history as the one who slew the dragon? ;-)

    cheers,
      DaveK
-- 
(*) - http://gcc.gnu.org/wiki/reload


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register Pressure in Instruction Level Parallelism
  2009-06-28 19:06 ` Dave Korn
@ 2009-06-28 22:53   ` Albert Cohen
  2009-06-28 23:56     ` Dave Korn
                       ` (2 more replies)
  2009-06-28 23:01   ` Michael Kruse
  2009-06-29 17:05   ` Vladimir Makarov
  2 siblings, 3 replies; 20+ messages in thread
From: Albert Cohen @ 2009-06-28 22:53 UTC (permalink / raw)
  To: Dave Korn; +Cc: reply, gcc, Sid Touati, Frederic Brault

Hi all,

I'm convinced that saturation and sufficiency based approaches are the 
future of register pressure management.

[Please keep my colleague Sid Touati in CC of this thread, until he 
registers to the list.]

Unfortunately, the state of the art (more recent that the thesis 
referenced in the original email, see Touati's web page) is limited to 
basic block and software-pipelining scopes, and limited to scheduling.

Compared to the tasks currently managed by reload, it certainly misses a 
whole bunch of instruction selection and immediate operand/address mode 
corner case problems (depending on the target). It also misses global 
scheduling, but extended BB scheduling is not very hard to approximate, 
as well as superblock scheduling.

I'm not at all a knowledgeable person to tell you what to do in the case 
of GCC, but for sure saturation/sufficiency-based approches are not 
sufficient to kill the dragon.

However, I will strongly advise anybody (= Kenny Zadeck) looking at a 
fresh SSA-based backend design to consider such an approach to avoid 
messing up with pressure-aware-* where * is any backend pass that bears 
a risk of blowing up register pressure.

If you interested in collaboration on the topic, we are about to start 
extending those approaches to general control-flow, and implementing it 
in a production compiler (not GCC, but a free one, and not LLVM either).

Albert

Dave Korn wrote:
> Michael Kruse wrote:
> 
>> So, now my questions: How much do you think could this could improve
>> compiled code speed? Would the current GCC/YARA benefit from such an
>> optimization pass at all? What are the chances that this could get into
>> the main GCC tree if it shows up to be an improvement?
> 
>   One of the major problems in gcc is the intertangling of instruction
> selection with register allocation and spill generation.  If these could be
> separated it would almost certainly generate better code and be welcomed with
> open arms!
> 
>> I'd prefer to implement this for the gcc, but my advisor wants me to do
>> it for the university's own compiler. Therefore I could also need
>> arguments why to do it for the GCC.
> 
>   Because destroying reload(*) would be an incalculable public service and
> your name will be remembered in history as the one who slew the dragon? ;-)
> 
>     cheers,
>       DaveK

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register Pressure in Instruction Level Parallelism
  2009-06-28 19:06 ` Dave Korn
  2009-06-28 22:53   ` Albert Cohen
@ 2009-06-28 23:01   ` Michael Kruse
  2009-06-29  0:03     ` Dave Korn
  2009-06-29 17:05   ` Vladimir Makarov
  2 siblings, 1 reply; 20+ messages in thread
From: Michael Kruse @ 2009-06-28 23:01 UTC (permalink / raw)
  Cc: gcc, Sid.Touati, frederic.brault, Albert.Cohen

[-- Attachment #1: Type: text/plain, Size: 1771 bytes --]

[Posting this again because I noticed that I sent this to Dave Korn only]

Hi Dave,

Dave Korn wrote:
 >   One of the major problems in gcc is the intertangling of instruction
 > selection with register allocation and spill generation.  If these 
could be
 > separated it would almost certainly generate better code and be 
welcomed with
 > open arms!
 >  
The separation of these is one concern of the thesis. Although, it does
not separate them completely.

 >> I'd prefer to implement this for the gcc, but my advisor wants me to do
 >> it for the university's own compiler. Therefore I could also need
 >> arguments why to do it for the GCC.
 >>    
 >
 >   Because destroying reload(*) would be an incalculable public 
service and
 > your name will be remembered in history as the one who slew the 
dragon? ;-)
 >  
Yeah, I already read the reload topic in the wiki ("...equivalent of
Satan..." *g*). And it made me think about whether I really want do do
that. But the good thing (for me) is that I don't have for change the
reload pass for this as it is an additional pass, not a replacement. So
I have to disappoint you here.

That does not mean that this couldn't help in getting rid of the reload
pass. After the modification of the ddg, the reload pass doesn't have to
take care for optimization (very much) as this has already been done.
Hence, this could greatly simplify the process.

And I love your optimism :-)

Btw, I guess my advisor doesn't accept your argument. The dragon on my
dragon book is a very tough one *g*. And one of my advisor's arguments for
not implementing it for the GCC is that their compiler would be less
complicated. I can't confirm that since I don't have access to it yet.

Regards,
Michael Kruse

-- 
Tardyzentrismus verboten!


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/x-pkcs7-signature, Size: 4401 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register Pressure in Instruction Level Parallelism
  2009-06-28 22:53   ` Albert Cohen
@ 2009-06-28 23:56     ` Dave Korn
  2009-06-29  6:13       ` Albert Cohen
                         ` (2 more replies)
  2009-06-29  0:30     ` Michael Kruse
       [not found]     ` <303e1d290906281549t4ebfce81m5152069742fa9a1@mail.gmail.com>
  2 siblings, 3 replies; 20+ messages in thread
From: Dave Korn @ 2009-06-28 23:56 UTC (permalink / raw)
  To: Albert Cohen; +Cc: Dave Korn, reply, gcc, Sid Touati, Frederic Brault

Albert Cohen wrote:

> Unfortunately, the state of the art (more recent that the thesis
> referenced in the original email, see Touati's web page) is limited to
> basic block and software-pipelining scopes, and limited to scheduling.
> 
> Compared to the tasks currently managed by reload, it certainly misses a
> whole bunch of instruction selection and immediate operand/address mode
> corner case problems (depending on the target). It also misses global
> scheduling, but extended BB scheduling is not very hard to approximate,
> as well as superblock scheduling.
> 
> I'm not at all a knowledgeable person to tell you what to do in the case
> of GCC, but for sure saturation/sufficiency-based approches are not
> sufficient to kill the dragon.

  In a brief exchange I had with Michael off-list, we discussed that.  I
observed that of the things that reload does,
constraint-satisfaction/insn-variant-selection is its primary purpose, and
spill/reload code generation is something it often does suboptimally (and
secondary reloads even worse).  If a clever pass running before reload could
insert explicit spill/reload code at well-chosen places (bearing in mind
class-based register pressure), it could relieve reload of the necessity to
generate its own spill code most of the time, and let it just do what it does
best.  So, yes, it doesn't kill the dragon, reload would need to be retained
and would still need the last-resort ability to do all the stuff it does now -
but mostly it wouldn't have to in practice, and a sleeping dragon is still an
improvement on a wide-awake one that's stomping round in a furious bad temper
burning everything in sight....  Reload would definitely generate better code
if it was fed stuff that avoided exercising its weakest points; sounds like a
pre-conditioning pass based on your techniques might work really well.

> If you interested in collaboration on the topic, we are about to start
> extending those approaches to general control-flow, and implementing it
> in a production compiler (not GCC, but a free one, and not LLVM either).

  :) I'm not up on advanced compiler theory, I'll have to sit back at this
point and leave it to the collective genii to get on with!  (I'll just suggest
that one advantage of implementing things in GCC is that it runs in such a
huge range of environments and targets, it's a good way to expose any new
algorithm to "interesting" problems caused by quirky architectures.)

    cheers,
      DaveK

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register Pressure in Instruction Level Parallelism
  2009-06-28 23:01   ` Michael Kruse
@ 2009-06-29  0:03     ` Dave Korn
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Korn @ 2009-06-29  0:03 UTC (permalink / raw)
  To: reply; +Cc: gcc, Sid.Touati, frederic.brault, Albert.Cohen

Michael Kruse wrote:
> [Posting this again because I noticed that I sent this to Dave Korn only]

  Wasn't sure if you just meant to keep our idle chatter off the public list
or not.  I think I summed up the salient points of our discussion in my other
post just now, no?

    cheers,
      DaveK

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register Pressure in Instruction Level Parallelism
  2009-06-28 22:53   ` Albert Cohen
  2009-06-28 23:56     ` Dave Korn
@ 2009-06-29  0:30     ` Michael Kruse
       [not found]     ` <303e1d290906281549t4ebfce81m5152069742fa9a1@mail.gmail.com>
  2 siblings, 0 replies; 20+ messages in thread
From: Michael Kruse @ 2009-06-29  0:30 UTC (permalink / raw)
  To: Albert Cohen; +Cc: Dave Korn, reply, gcc, Sid Touati, Frederic Brault

[-- Attachment #1: Type: text/plain, Size: 1736 bytes --]

Hi,

Albert Cohen schrieb:
>
> Unfortunately, the state of the art (more recent that the thesis 
> referenced in the original email, see Touati's web page) is limited to 
> basic block and software-pipelining scopes, and limited to scheduling.
Do have any specific publication in mind? At the moment, my concerns are 
basic blocks only. As far as I can see from the web page, you are 
working on an implementation called DDG. I could not find a download 
link (except for some headers and examples).

> However, I will strongly advise anybody (= Kenny Zadeck) looking at a 
> fresh SSA-based backend design to consider such an approach to avoid 
> messing up with pressure-aware-* where * is any backend pass that 
> bears a risk of blowing up register pressure.
I don't know the deep internals of the gcc yet, but I think it is 
doable. My main work is to evaluate the effectiveness of this technique. 
I don't know of any later passes which introduce new register 
requirements. And even if... the mighty reload pass must be able to 
handle this. I am interested in a comparison to what there has been before.

> If you interested in collaboration on the topic, we are about to start 
> extending those approaches to general control-flow, and implementing 
> it in a production compiler (not GCC, but a free one, and not LLVM 
> either).
At the moment, I am most interested in writing a thesis. And since this 
is the topic, this means that I am interested, yes. Especially how you 
generalized it.

It might also be interesting for my advisor and the professor. 
Otherwise, we might end up in doing redundant work.

Btw, if it is a free compiler, why not telling us which one?

Regards,
Michael Kruse

-- 
Tardyzentrismus verboten!


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/x-pkcs7-signature, Size: 4401 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register Pressure in Instruction Level Parallelism
  2009-06-28 23:56     ` Dave Korn
@ 2009-06-29  6:13       ` Albert Cohen
  2009-06-29 18:04       ` Vladimir Makarov
  2009-07-01  5:58       ` Jeff Law
  2 siblings, 0 replies; 20+ messages in thread
From: Albert Cohen @ 2009-06-29  6:13 UTC (permalink / raw)
  To: Dave Korn; +Cc: reply, gcc, Sid Touati, Frederic Brault

Dave Korn wrote:
> (...)

I fully agree with your arguments. Managing register pressure early, and 
simplifying downstream passes (to avoid poluting them with pressure 
concerns) is a very tempting design. Yet from dream to reality there is 
a long way.

>   :) I'm not up on advanced compiler theory, I'll have to sit back at this
> point and leave it to the collective genii to get on with!  (I'll just suggest
> that one advantage of implementing things in GCC is that it runs in such a
> huge range of environments and targets, it's a good way to expose any new
> algorithm to "interesting" problems caused by quirky architectures.)

Exactly. Much register allocation and scheduling work is completely 
useless for a retargettable compilers for real architectures due to the 
lack of realistic constraint modeling. Of course, there are many other 
reasons that make academic research results useless, this is just one 
possible way to miss the point...

Register saturation is a very powerful concept that remains to be 
extended to real conditions and beyond the interplay of allocation and 
scheduling.

Albert

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Fwd: Register Pressure in Instruction Level Parallelism
       [not found]     ` <303e1d290906281549t4ebfce81m5152069742fa9a1@mail.gmail.com>
@ 2009-06-29  6:28       ` Kenneth Zadeck
  0 siblings, 0 replies; 20+ messages in thread
From: Kenneth Zadeck @ 2009-06-29  6:28 UTC (permalink / raw)
  To: David Edelsohn, Albert Cohen, Dave Korn, Frederic Brault, gcc

David Edelsohn wrote:
> ---------- Forwarded message ----------
> From: Albert Cohen <Albert.Cohen@inria.fr>
> Date: Sun, Jun 28, 2009 at 6:25 PM
> Subject: Re: Register Pressure in Instruction Level Parallelism
> To: Dave Korn <dave.korn.cygwin@googlemail.com>
> Cc: reply@meinersbur.de, gcc@gcc.gnu.org, Sid Touati
> <Sid.Touati@inria.fr>, Frederic Brault <frederic.brault@inria.fr>
>
>
> Hi all,
>
> I'm convinced that saturation and sufficiency based approaches are the
> future of register pressure management.
>
> [Please keep my colleague Sid Touati in CC of this thread, until he
> registers to the list.]
>
> Unfortunately, the state of the art (more recent that the thesis
> referenced in the original email, see Touati's web page) is limited to
> basic block and software-pipelining scopes, and limited to scheduling.
>
> Compared to the tasks currently managed by reload, it certainly misses
> a whole bunch of instruction selection and immediate operand/address
> mode corner case problems (depending on the target). It also misses
> global scheduling, but extended BB scheduling is not very hard to
> approximate, as well as superblock scheduling.
>
>   
> I'm not at all a knowledgeable person to tell you what to do in the
> case of GCC, but for sure saturation/sufficiency-based approches are
> not sufficient to kill the dragon.
>
> However, I will strongly advise anybody (= Kenny Zadeck) looking at a
> fresh SSA-based backend design to consider such an approach to avoid
> messing up with pressure-aware-* where * is any backend pass that
> bears a risk of blowing up register pressure.
>
>   
I am considering such a design and i did, late last week to go public
with my project.  I am now working to put up a project in the public
spaces.  I would very much like the help of anyone who is interested in
working in this area to consider working on it in gcc.


kenny
> If you interested in collaboration on the topic, we are about to start
> extending those approaches to general control-flow, and implementing
> it in a production compiler (not GCC, but a free one, and not LLVM
> either).
>
> Albert
>
>
> Dave Korn wrote:
>   
>> Michael Kruse wrote:
>>
>>     
>>> So, now my questions: How much do you think could this could improve
>>> compiled code speed? Would the current GCC/YARA benefit from such an
>>> optimization pass at all? What are the chances that this could get into
>>> the main GCC tree if it shows up to be an improvement?
>>>       
>>  One of the major problems in gcc is the intertangling of instruction
>> selection with register allocation and spill generation.  If these could be
>> separated it would almost certainly generate better code and be welcomed with
>> open arms!
>>
>>     
>>> I'd prefer to implement this for the gcc, but my advisor wants me to do
>>> it for the university's own compiler. Therefore I could also need
>>> arguments why to do it for the GCC.
>>>       
>>  Because destroying reload(*) would be an incalculable public service and
>> your name will be remembered in history as the one who slew the dragon? ;-)
>>
>>    cheers,
>>      DaveK
>>     

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register Pressure in Instruction Level Parallelism
  2009-06-28 12:37 Register Pressure in Instruction Level Parallelism Michael Kruse
  2009-06-28 19:06 ` Dave Korn
@ 2009-06-29 16:27 ` Vladimir Makarov
  2009-06-29 20:03   ` Michael Kruse
  2009-07-01  6:06   ` Jeff Law
  1 sibling, 2 replies; 20+ messages in thread
From: Vladimir Makarov @ 2009-06-29 16:27 UTC (permalink / raw)
  To: reply; +Cc: gcc

Michael Kruse wrote:
> Hello GCC developers,
>
> I am going to write a Master's Thesis about instruction scheduling 
> based on the technique presented in [1]. This will also include a 
> implementation.
>
> The basic idea is to have an additional pass on the data dependency 
> graph before instruction scheduling and register allocation takes 
> place. It estimates the minimal (register sufficiency) and maximal 
> number of registers (register saturation) required by schedules based 
> on that graph.
>
> If the register sufficiency is higher than the physical number of 
> registers, spill code is added to the graph.
>
For me, that is the most interesting part, unfortunately Touti (as I 
remember) in his thesis say practically nothing about this.
> If the register saturation is higher than the physical number of 
> registers, artificial dependencies are added to the graph, such that 
> the instruction scheduler is forced to generate a schedule with less 
> registers.
>
> The aim is that the instruction scheduler can focus on the optimal 
> arrangement of instructions to exploit a maximal amount of instruction 
> level parallelism. [1] also suggests heuristics for all these 
> problems. According to the author, these are "nearly optimal" in 
> practice. The heuristics for estimating register sufficiency and 
> saturation are both optimistic, meaning that it might still be 
> necessary to add more spill code to the final code. Hence, this this 
> technique is just an optimization pass and doesn't replace existing 
> register allocation or instruction scheduling passes.
>
> [1] also has a part about loop scheduling, but my thesis will be about 
> basic blocks only. See [2] for an presentation of this technique.
>
> So, now my questions: How much do you think could this could improve 
> compiled code speed? Would the current GCC/YARA benefit from such an 
> optimization pass at all?
I think nobody can answer the questions until it is implemented.

I am working on register-pressure sensitive insn scheduling too for a 
few last months.  I started with simple heuristic described in Muchnic 
book: when the current register pressure is high, choose insn decreasing 
the pressure and when the pressure is low, use the usual scheduling 
heuristic.  This did not work at all (increasing code size in about 3-5% 
for x86).  The problem was in that even the current register pressure is 
low at given point, issuing insn at this point could still increase 
register pressure in later points.  So I am working now on evaluation 
register pressure in later points too (some form of look ahead).  This 
approach is similar what Touati proposed (but without actually adding 
additional DD arcs).

If you are going to work on this project, some small advice about 
evaluating register sufficiency.  I found that register pressure is 
already practically minimized before insn-scheduling (I suspect that it 
is mostly done in TER).  By the way, it also means that tree balancing 
(which is actually much more complicated issue because it is not tree 
but DAG) is not necessary for the backend as it was done in Preston 
Briggs project (and mentioned in proposed Ken Zadeck's pass stack).

> What are the chances that this could get into the main GCC tree if it 
> shows up to be an improvement?
>
I don't see any problem to get the code into main GCC tree if you get 
even 1-2% improvement.  Although there are some technical questions 
(like code fitting into gcc practice standards) and  commitment to 
maintain the code.  But this problems could be definitely overcome.
> There has been a short discussion on this mailing list already [3] 
> some years ago (Note if you read this: intLP has been used to compare 
> the heuristic to the optimal result only). To my knowledge, there 
> hasn't been any more on this topic yet (anywhere).
>
> I'd prefer to implement this for the gcc, but my advisor wants me to 
> do it for the university's own compiler. Therefore I could also need 
> arguments why to do it for the GCC.
>
Personally, I'd love to see this work done in GCC.  I believe the 
research work in compiler field should be done in real industrial 
compilers because that is a final target of the research.  I saw many 
times that researchers report big improvements of their algorithms 
because they are using toy compilers but when the same algorithms are 
implemented in GCC they give practically nothing.  For me a toy compiler 
criteria is defined how good they are on some generally used compiler 
benchmarks as SPEC (the better existing compiler performance, the harder 
to get the same percent improvement).  So if your university compiler 
performance is close for example to SPECFP2000,  I think you could use 
it to get a real optimization impact.

On the other hand, you definitely will get better numbers (and spending 
less time) using a toy compiler than GCC and your research probably 
could look more successful with the academic point of view.
>
> [1] http://www.prism.uvsq.fr/~touati/thesis.html
> [2] 
> http://tel.archives-ouvertes.fr/docs/00/04/72/88/ANNEX/tel-00007405.ppt
> [3] http://gcc.gnu.org/ml/gcc/2002-07/msg00565.html
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register Pressure in Instruction Level Parallelism
  2009-06-28 19:06 ` Dave Korn
  2009-06-28 22:53   ` Albert Cohen
  2009-06-28 23:01   ` Michael Kruse
@ 2009-06-29 17:05   ` Vladimir Makarov
  2 siblings, 0 replies; 20+ messages in thread
From: Vladimir Makarov @ 2009-06-29 17:05 UTC (permalink / raw)
  To: Dave Korn; +Cc: reply, gcc

Dave Korn wrote:
> Michael Kruse wrote:
>
>   
>> So, now my questions: How much do you think could this could improve
>> compiled code speed? Would the current GCC/YARA benefit from such an
>> optimization pass at all? What are the chances that this could get into
>> the main GCC tree if it shows up to be an improvement?
>>     
>
>   One of the major problems in gcc is the intertangling of instruction
> selection with register allocation and spill generation.  If these could be
> separated it would almost certainly generate better code and be welcomed with
> open arms!
>
>   
Although I mostly agree with this.  I am not sure about universality of 
this claim.  As an example, choosing x86 add insn instead of lea earlier 
could result in an additional spill and vise versa choosing lea could 
result in usage 3 different register and bigger code.
>> I'd prefer to implement this for the gcc, but my advisor wants me to do
>> it for the university's own compiler. Therefore I could also need
>> arguments why to do it for the GCC.
>>     
>
>   Because destroying reload(*) would be an incalculable public service and
> your name will be remembered in history as the one who slew the dragon? ;-)
>
>    

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register Pressure in Instruction Level Parallelism
  2009-06-28 23:56     ` Dave Korn
  2009-06-29  6:13       ` Albert Cohen
@ 2009-06-29 18:04       ` Vladimir Makarov
  2009-07-01  6:09         ` Jeff Law
  2009-07-01  5:58       ` Jeff Law
  2 siblings, 1 reply; 20+ messages in thread
From: Vladimir Makarov @ 2009-06-29 18:04 UTC (permalink / raw)
  To: Dave Korn; +Cc: Albert Cohen, reply, gcc, Sid Touati, Frederic Brault

Dave Korn wrote:
> Albert Cohen wrote:
>
>   
>> Unfortunately, the state of the art (more recent that the thesis
>> referenced in the original email, see Touati's web page) is limited to
>> basic block and software-pipelining scopes, and limited to scheduling.
>>
>> Compared to the tasks currently managed by reload, it certainly misses a
>> whole bunch of instruction selection and immediate operand/address mode
>> corner case problems (depending on the target). It also misses global
>> scheduling, but extended BB scheduling is not very hard to approximate,
>> as well as superblock scheduling.
>>
>> I'm not at all a knowledgeable person to tell you what to do in the case
>> of GCC, but for sure saturation/sufficiency-based approches are not
>> sufficient to kill the dragon.
>>     
>
>   In a brief exchange I had with Michael off-list, we discussed that.  I
> observed that of the things that reload does,
> constraint-satisfaction/insn-variant-selection is its primary purpose, and
> spill/reload code generation is something it often does suboptimally (and
> secondary reloads even worse).  If a clever pass running before reload could
> insert explicit spill/reload code at well-chosen places (bearing in mind
> class-based register pressure), it could relieve reload of the necessity to
> generate its own spill code most of the time, and let it just do what it does
> best.
IRA actually already inserts spill code in most important places (on 
loop borders).  Besides loop regions, IRA could be extended to other 
regions (and even bb parts to relief pressure inside the blocks).  I am 
going to work on it to evaluate how much it could give.

Most spill/restore code at least for x86 is generated by reload because 
one insn operand should be a register.  Some cooperation of IRA and 
reload on this issue, I hope, has a potential to improve code 
performance more.
>   So, yes, it doesn't kill the dragon, reload would need to be retained
> and would still need the last-resort ability to do all the stuff it does now -
> but mostly it wouldn't have to in practice, and a sleeping dragon is still an
> improvement on a wide-awake one that's stomping round in a furious bad temper
> burning everything in sight....  Reload would definitely generate better code
> if it was fed stuff that avoided exercising its weakest points; sounds like a
> pre-conditioning pass based on your techniques might work really well.
>
>   

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register Pressure in Instruction Level Parallelism
  2009-06-29 16:27 ` Vladimir Makarov
@ 2009-06-29 20:03   ` Michael Kruse
  2009-06-29 20:40     ` Vladimir Makarov
  2009-07-01  6:06   ` Jeff Law
  1 sibling, 1 reply; 20+ messages in thread
From: Michael Kruse @ 2009-06-29 20:03 UTC (permalink / raw)
  To: Vladimir Makarov; +Cc: gcc

[-- Attachment #1: Type: text/plain, Size: 3221 bytes --]



Vladimir Makarov wrote:
> Michael Kruse wrote:
>> If the register sufficiency is higher than the physical number of 
>> registers, spill code is added to the graph.
>>
> For me, that is the most interesting part, unfortunately Touti (as I 
> remember) in his thesis say practically nothing about this.
In the thesis, a modified Poletto algorithm is presented to add spill code.

>>
>> So, now my questions: How much do you think could this could improve 
>> compiled code speed? Would the current GCC/YARA benefit from such an 
>> optimization pass at all?
> I think nobody can answer the questions until it is implemented.
My main intention to ask this is that somebody might have said, that it 
was not worth the effort. Therefore, I could have saved me a lot of work.

> If you are going to work on this project, some small advice about 
> evaluating register sufficiency.  I found that register pressure is 
> already practically minimized before insn-scheduling (I suspect that 
> it is mostly done in TER).  By the way, it also means that tree 
> balancing (which is actually much more complicated issue because it is 
> not tree but DAG) is not necessary for the backend as it was done in 
> Preston Briggs project (and mentioned in proposed Ken Zadeck's pass 
> stack).
Thank you. I am grateful for any advice.
>> What are the chances that this could get into the main GCC tree if it 
>> shows up to be an improvement?
> I don't see any problem to get the code into main GCC tree if you get 
> even 1-2% improvement.  Although there are some technical questions 
> (like code fitting into gcc practice standards) and  commitment to 
> maintain the code.  But this problems could be definitely overcome.
I'd be willing to do this.
>> I'd prefer to implement this for the gcc, but my advisor wants me to 
>> do it for the university's own compiler. Therefore I could also need 
>> arguments why to do it for the GCC.
>>
> Personally, I'd love to see this work done in GCC. I believe the 
> research work in compiler field should be done in real industrial 
> compilers because that is a final target of the research.  I saw many 
> times that researchers report big improvements of their algorithms 
> because they are using toy compilers but when the same algorithms are 
> implemented in GCC they give practically nothing.  For me a toy 
> compiler criteria is defined how good they are on some generally used 
> compiler benchmarks as SPEC (the better existing compiler performance, 
> the harder to get the same percent improvement).  So if your 
> university compiler performance is close for example to SPECFP2000,  I 
> think you could use it to get a real optimization impact.
>
> On the other hand, you definitely will get better numbers (and 
> spending less time) using a toy compiler than GCC and your research 
> probably could look more successful with the academic point of view.
I don't think it a toy project. Industry is involved (embedded systems) 
and it also has multiple back-ends. The problem with it is, that it is 
(at least partially) proprietary. And I don't know about the other part. 
However, you can't download it on the web.

Regards,
Michael Kruse

-- 
Tardyzentrismus verboten!


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/x-pkcs7-signature, Size: 4401 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register Pressure in Instruction Level Parallelism
  2009-06-29 20:03   ` Michael Kruse
@ 2009-06-29 20:40     ` Vladimir Makarov
  2009-06-29 21:04       ` Michael Kruse
  2009-06-30 15:01       ` Albert Cohen
  0 siblings, 2 replies; 20+ messages in thread
From: Vladimir Makarov @ 2009-06-29 20:40 UTC (permalink / raw)
  To: reply; +Cc: gcc

Michael Kruse wrote:
>
>
> Vladimir Makarov wrote:
>> Michael Kruse wrote:
>>> If the register sufficiency is higher than the physical number of 
>>> registers, spill code is added to the graph.
>>>
>> For me, that is the most interesting part, unfortunately Touti (as I 
>> remember) in his thesis say practically nothing about this.
> In the thesis, a modified Poletto algorithm is presented to add spill 
> code.
>
>
I've just checked the thesis again.  I don't think decreasing register 
pressure through spilling will work well.  First of all Polleto linear 
scan RA is worse than Chaitin-Briggs approach.  Even its major 
improvement extending linear scan is worse than Chaitin-Briggs 
approach.  My experience with an ELS implementation in GCC has shown me 
this although in original article about ELS the opposite is stated (the 
comparison in the article was done in GCC but with the new ra project 
which was unsuccessful implementation of Chaitin-Briggs RA and it was 
done only on ppc64.  I am sure that on x86/x86_64 ELS would behave even 
worse).  That is about basic RA spill in Touti's thesis.

The bigger problem is that decreasing register pressure does not take 
live range splitting into account what good modern RAs do.  With this 
point of view, an approach for register pressure decrease in Bob 
Morgan's book is more perspective because it does also live range 
splitting (by the way, I tried Morgan's approach with the old RA more 
than 5 year ago and did not look compelling for me that time).
>>> I'd prefer to implement this for the gcc, but my advisor wants me to 
>>> do it for the university's own compiler. Therefore I could also need 
>>> arguments why to do it for the GCC.
>>>
>> Personally, I'd love to see this work done in GCC. I believe the 
>> research work in compiler field should be done in real industrial 
>> compilers because that is a final target of the research.  I saw many 
>> times that researchers report big improvements of their algorithms 
>> because they are using toy compilers but when the same algorithms are 
>> implemented in GCC they give practically nothing.  For me a toy 
>> compiler criteria is defined how good they are on some generally used 
>> compiler benchmarks as SPEC (the better existing compiler 
>> performance, the harder to get the same percent improvement).  So if 
>> your university compiler performance is close for example to 
>> SPECFP2000,  I think you could use it to get a real optimization impact.
>>
>> On the other hand, you definitely will get better numbers (and 
>> spending less time) using a toy compiler than GCC and your research 
>> probably could look more successful with the academic point of view.
> I don't think it a toy project. Industry is involved (embedded 
> systems) and it also has multiple back-ends. The problem with it is, 
> that it is (at least partially) proprietary. And I don't know about 
> the other part. However, you can't download it on the web.
Could you tell us what compiler is this?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register Pressure in Instruction Level Parallelism
  2009-06-29 20:40     ` Vladimir Makarov
@ 2009-06-29 21:04       ` Michael Kruse
  2009-06-30 15:01       ` Albert Cohen
  1 sibling, 0 replies; 20+ messages in thread
From: Michael Kruse @ 2009-06-29 21:04 UTC (permalink / raw)
  To: Vladimir Makarov; +Cc: reply, gcc

[-- Attachment #1: Type: text/plain, Size: 1734 bytes --]



Vladimir Makarov schrieb:
> Michael Kruse wrote:
>> In the thesis, a modified Poletto algorithm is presented to add spill 
>> code.
> I've just checked the thesis again.  I don't think decreasing register 
> pressure through spilling will work well.  First of all Polleto linear 
> scan RA is worse than Chaitin-Briggs approach.  Even its major 
> improvement extending linear scan is worse than Chaitin-Briggs 
> approach.  My experience with an ELS implementation in GCC has shown 
> me this although in original article about ELS the opposite is stated 
> (the comparison in the article was done in GCC but with the new ra 
> project which was unsuccessful implementation of Chaitin-Briggs RA and 
> it was done only on ppc64.  I am sure that on x86/x86_64 ELS would 
> behave even worse).  That is about basic RA spill in Touti's thesis.
>
> The bigger problem is that decreasing register pressure does not take 
> live range splitting into account what good modern RAs do.  With this 
> point of view, an approach for register pressure decrease in Bob 
> Morgan's book is more perspective because it does also live range 
> splitting (by the way, I tried Morgan's approach with the old RA more 
> than 5 year ago and did not look compelling for me that time).
That this algorithm is used in the thesis does not mean that I have to 
use that approach. Part of my thesis is also to evaluate different 
heuristics and compare them to each other. This one would something I'd try.
> Could you tell us what compiler is this?
Unfortunately not (yet). I have just very few information on my own. 
They just keep telling me how great it is. But I will get the source 
this week.

Regards,
Michael Kruse

-- 
Tardyzentrismus verboten!


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/x-pkcs7-signature, Size: 4401 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register Pressure in Instruction Level Parallelism
  2009-06-29 20:40     ` Vladimir Makarov
  2009-06-29 21:04       ` Michael Kruse
@ 2009-06-30 15:01       ` Albert Cohen
  1 sibling, 0 replies; 20+ messages in thread
From: Albert Cohen @ 2009-06-30 15:01 UTC (permalink / raw)
  To: Vladimir Makarov; +Cc: reply, gcc, Sid Touati

Vladimir Makarov wrote:
> I've just checked the thesis again.  I don't think decreasing register 
> pressure through spilling will work well.  First of all Polleto linear 
> scan RA is worse than Chaitin-Briggs approach.  Even its major 
> improvement extending linear scan is worse than Chaitin-Briggs 
> approach.  My experience with an ELS implementation in GCC has shown me 
> this although in original article about ELS the opposite is stated (the 
> comparison in the article was done in GCC but with the new ra project 
> which was unsuccessful implementation of Chaitin-Briggs RA and it was 
> done only on ppc64.  I am sure that on x86/x86_64 ELS would behave even 
> worse).  That is about basic RA spill in Touti's thesis.

Touati's thesis has a section on spill insertion in the DDG to deal with 
excessive register sufficienty, but he did not have time to implement it.

However, there is a new implementation, in C, to compute saturation, 
sufficiency, insert the extra dependences and spills. This is not yet as 
complete as the previous version (used to be in C++ and based on the 
proprietary LEDA graph algorithm toolkit), but is available to those of 
you interested. The license will a priori become LGPL for the time 
being, but I guess Sid Touati is willing to assign his copyright for 
specific versions and transfer the code directly to GCC if needed. In 
any case, please talk to Sid directly (in CC).

Albert

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register Pressure in Instruction Level Parallelism
  2009-06-28 23:56     ` Dave Korn
  2009-06-29  6:13       ` Albert Cohen
  2009-06-29 18:04       ` Vladimir Makarov
@ 2009-07-01  5:58       ` Jeff Law
  2 siblings, 0 replies; 20+ messages in thread
From: Jeff Law @ 2009-07-01  5:58 UTC (permalink / raw)
  To: Dave Korn; +Cc: Albert Cohen, reply, gcc, Sid Touati, Frederic Brault

Dave Korn wrote:
> Albert Cohen wrote:
>
>   
>> Unfortunately, the state of the art (more recent that the thesis
>> referenced in the original email, see Touati's web page) is limited to
>> basic block and software-pipelining scopes, and limited to scheduling.
>>
>> Compared to the tasks currently managed by reload, it certainly misses a
>> whole bunch of instruction selection and immediate operand/address mode
>> corner case problems (depending on the target). It also misses global
>> scheduling, but extended BB scheduling is not very hard to approximate,
>> as well as superblock scheduling.
>>
>> I'm not at all a knowledgeable person to tell you what to do in the case
>> of GCC, but for sure saturation/sufficiency-based approches are not
>> sufficient to kill the dragon.
>>     
>   If a clever pass running before reload could
> insert explicit spill/reload code at well-chosen places (bearing in mind
> class-based register pressure), it could relieve reload of the necessity to
> generate its own spill code most of the time, and let it just do what it does
> best.  
This is basically what I've been working on.  Effectively there's two 
stages.

The first is a localization phase which localizes pseudos which didn't 
get hard registers to live within basic blocks.  This obviously creates 
new pseudos.

I then run a block-local register allocation phase to attempt to 
allocate those newly created pseudos (or any pre-existing block locals 
that didn't get hard registers) to hard registers.  The block local 
allocator also has some spilling capabilities (which can create, yes, 
even more new pseudos which it'll then proceed to assign to hard registers).

Reload is then allowed to run in the normal fashion for the time being, 
though obviously the long term goal is a drastically simpler reload.

Right now the block-local allocator doesn't deal with constraint 
mismatches, but there's no reason it can't in the future.  There's some 
clear problems with double-word values, but I'm happy with the code so far.

Note that I don't iterate IRA -- I went down that path for a while, but 
couldn't ever get comfortable with the hacks necessary to ensure 
iteration would ultimately finish.

Jeff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register Pressure in Instruction Level Parallelism
  2009-06-29 16:27 ` Vladimir Makarov
  2009-06-29 20:03   ` Michael Kruse
@ 2009-07-01  6:06   ` Jeff Law
  1 sibling, 0 replies; 20+ messages in thread
From: Jeff Law @ 2009-07-01  6:06 UTC (permalink / raw)
  To: Vladimir Makarov; +Cc: reply, gcc

Vladimir Makarov wrote:
>>
>> So, now my questions: How much do you think could this could improve 
>> compiled code speed? Would the current GCC/YARA benefit from such an 
>> optimization pass at all?
> I think nobody can answer the questions until it is implemented.
Agreed.
>
> If you are going to work on this project, some small advice about 
> evaluating register sufficiency.  I found that register pressure is 
> already practically minimized before insn-scheduling (I suspect that 
> it is mostly done in TER).
I was rather surprised to stumble upon this myself -- I find myself 
regularly looking functions where we spilled and couldn't see any way to 
avoid spilling without some kind of range splitting objects in 
non-obvious ways.

The one case I did see where we regularly spilled that could be avoided 
was excessively long ranges created by loading incoming arguments into 
pseudos.    I haven't bothered to look into ways to improve that situation.


Jeff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register Pressure in Instruction Level Parallelism
  2009-06-29 18:04       ` Vladimir Makarov
@ 2009-07-01  6:09         ` Jeff Law
  2009-07-02 21:53           ` Vladimir Makarov
  0 siblings, 1 reply; 20+ messages in thread
From: Jeff Law @ 2009-07-01  6:09 UTC (permalink / raw)
  To: Vladimir Makarov
  Cc: Dave Korn, Albert Cohen, reply, gcc, Sid Touati, Frederic Brault

Vladimir Makarov wrote:
> Dave Korn wrote:
>> Albert Cohen wrote:
>>
>>  
>>> Unfortunately, the state of the art (more recent that the thesis
>>> referenced in the original email, see Touati's web page) is limited to
>>> basic block and software-pipelining scopes, and limited to scheduling.
>>>
>>> Compared to the tasks currently managed by reload, it certainly 
>>> misses a
>>> whole bunch of instruction selection and immediate operand/address mode
>>> corner case problems (depending on the target). It also misses global
>>> scheduling, but extended BB scheduling is not very hard to approximate,
>>> as well as superblock scheduling.
>>>
>>> I'm not at all a knowledgeable person to tell you what to do in the 
>>> case
>>> of GCC, but for sure saturation/sufficiency-based approches are not
>>> sufficient to kill the dragon.
>>>     
>>
>>   In a brief exchange I had with Michael off-list, we discussed that.  I
>> observed that of the things that reload does,
>> constraint-satisfaction/insn-variant-selection is its primary 
>> purpose, and
>> spill/reload code generation is something it often does suboptimally 
>> (and
>> secondary reloads even worse).  If a clever pass running before 
>> reload could
>> insert explicit spill/reload code at well-chosen places (bearing in mind
>> class-based register pressure), it could relieve reload of the 
>> necessity to
>> generate its own spill code most of the time, and let it just do what 
>> it does
>> best.
> IRA actually already inserts spill code in most important places (on 
> loop borders).  Besides loop regions, IRA could be extended to other 
> regions (and even bb parts to relief pressure inside the blocks).  I 
> am going to work on it to evaluate how much it could give.
I've already got some code to do this -- I've pondered more than once 
pushing it through independently of the other allocation/reload work I'm 
doing.  I could probably be convinced to put the block local 
allocation/spilling on hold to focus on benchmarking and tuning my bits 
to generate spill code.

Jeff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register Pressure in Instruction Level Parallelism
  2009-07-01  6:09         ` Jeff Law
@ 2009-07-02 21:53           ` Vladimir Makarov
  0 siblings, 0 replies; 20+ messages in thread
From: Vladimir Makarov @ 2009-07-02 21:53 UTC (permalink / raw)
  To: Jeff Law; +Cc: Dave Korn, Albert Cohen, reply, gcc, Sid Touati, Frederic Brault

Jeff Law wrote:
> Vladimir Makarov wrote:
>> Dave Korn wrote:
>>>   In a brief exchange I had with Michael off-list, we discussed 
>>> that.  I
>>> observed that of the things that reload does,
>>> constraint-satisfaction/insn-variant-selection is its primary 
>>> purpose, and
>>> spill/reload code generation is something it often does suboptimally 
>>> (and
>>> secondary reloads even worse).  If a clever pass running before 
>>> reload could
>>> insert explicit spill/reload code at well-chosen places (bearing in 
>>> mind
>>> class-based register pressure), it could relieve reload of the 
>>> necessity to
>>> generate its own spill code most of the time, and let it just do 
>>> what it does
>>> best.
>> IRA actually already inserts spill code in most important places (on 
>> loop borders).  Besides loop regions, IRA could be extended to other 
>> regions (and even bb parts to relief pressure inside the blocks).  I 
>> am going to work on it to evaluate how much it could give.
> I've already got some code to do this -- I've pondered more than once 
> pushing it through independently of the other allocation/reload work 
> I'm doing.  I could probably be convinced to put the block local 
> allocation/spilling on hold to focus on benchmarking and tuning my 
> bits to generate spill code.

That is great.  I look forward to see the code.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2009-07-02 21:53 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-28 12:37 Register Pressure in Instruction Level Parallelism Michael Kruse
2009-06-28 19:06 ` Dave Korn
2009-06-28 22:53   ` Albert Cohen
2009-06-28 23:56     ` Dave Korn
2009-06-29  6:13       ` Albert Cohen
2009-06-29 18:04       ` Vladimir Makarov
2009-07-01  6:09         ` Jeff Law
2009-07-02 21:53           ` Vladimir Makarov
2009-07-01  5:58       ` Jeff Law
2009-06-29  0:30     ` Michael Kruse
     [not found]     ` <303e1d290906281549t4ebfce81m5152069742fa9a1@mail.gmail.com>
2009-06-29  6:28       ` Fwd: " Kenneth Zadeck
2009-06-28 23:01   ` Michael Kruse
2009-06-29  0:03     ` Dave Korn
2009-06-29 17:05   ` Vladimir Makarov
2009-06-29 16:27 ` Vladimir Makarov
2009-06-29 20:03   ` Michael Kruse
2009-06-29 20:40     ` Vladimir Makarov
2009-06-29 21:04       ` Michael Kruse
2009-06-30 15:01       ` Albert Cohen
2009-07-01  6:06   ` Jeff Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).