Understanding Scheduling

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Understanding Scheduling
@ 2010-03-19 16:12 Ian Bolton
  2010-03-19 16:34 ` Alexander Monakov
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Ian Bolton @ 2010-03-19 16:12 UTC (permalink / raw)
  To: gcc

Hi folks!

I've moved on from register allocation (see Understanding IRA thread)
and onto scheduling.

In particular, I am investigating the effectiveness of the sched1
pass on our architecture and the associated interblock-scheduling
optimisation.

Let's start with sched1 ...

For our architecture at least, it seems like Richard Earnshaw is
right that sched1 is generally bad when you are using -Os, because
it can increase register pressure and cause extra spill/fill code when
you move independent instructions in between dependent instructions.

For example:

LOAD c2,c1[0]
LOAD c3,c1[1]
ADD c2,c2,c3  # depends on LOAD above it (might stall)
LOAD c3,c1[2]
ADD c2,c2,c3  # depends on LOAD above it (might stall)
LOAD c3,c1[3]
ADD c2,c2,c3  # depends on LOAD above it (might stall)
LOAD c3,c1[4]
ADD c2,c2,c3  # depends on LOAD above it (might stall)

might become:

LOAD c2,c1[0]
LOAD c3,c1[1]
LOAD c4,c1[2] # independent of first two LOADS
LOAD c5,c1[3] # independent of first two LOADS
ADD c2,c2,c3  # not dependent on preceding two insns (avoids stall)
LOAD c3,c1[4]
ADD c2,c2,c4  # not dependent on preceding three insns (avoids stall)
...

This is a nice effect if your LOAD instructions have a latency of 3,
so this should lead to performance increases, and indeed this is
what I see for some low-reg-pressure Nullstone cases.  Turning
sched1 off therefore causes a regression on these cases.

However, this pipeline-type effect may increase your register
pressure such that caller-save regs are required and extra spill/fill
code needs to be generated.  This happens for some other Nullstone
cases, and so it is good to have sched1 turned off for them!

It's therefore looking like some kind of clever hybrid is required.

I mention all this because I was wondering which other architectures
have turned off sched1 for -Os?  More importantly, I was wondering
if anyone else had considered creating some kind of clever hybrid
that only uses sched1 when it will increase performance without
increasing register pressure?

Or perhaps I could make a heuristic based on the balanced-ness of the
tree?  (I see sched1 does a lot better if the tree is balanced, since
it has more options to play with.)

Now onto interblock-scheduling ...

As we all know, you can't have interblock-scheduling enabled unless
you use the sched1 pass, so if sched1 is off then interblock is
irrelevant.  For now, let's assume we are going to make some clever
hybrid that allows sched1 when we think it will increase performance
for Os and we are going to keep sched1 on for O2 and O3.

As I understand it, interblock-scheduling enlarges the scope of
sched1, such that you can insert independent insns from a
completely different block in between dependent insns in this
block.  As well as potentially amortizing stalls on high latency
insns, we also get the chance to do "meatier" work in the destination
block and leave less to do in the source block.  I don't know if this
is a deliberate effect of interblock-scheduling or if it is just
a happy side-effect.

Anyway, the reason I mention interblock-scheduling is that I see it
doing seemingly intelligent moves, but then the later BB-reorder pass
is juggling blocks around such that we end up with extra code inside
hot loops!  I assume this is because the scheduler and BB-reorderer
are largely ignorant of each other, and so good intentions on the
part of the former can be scuppered by the latter.

I was wondering if anyone else has witnessed this madness on their
architecture?  Maybe it is a bug with BB-reorder?  Or maybe it should
only be enabled when function profiling information (e.g. gcov) is
available?  Or maybe it is not a high-priority thing for anyone to
think about because no one uses interblock-scheduling?

If anyone can shed some light on the above, I'd greatly appreciate
it.  For now, I will continue my experiments with selective enabling
of sched1 for -Os.

Best regards,
Ian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Understanding Scheduling
  2010-03-19 16:12 Understanding Scheduling Ian Bolton
@ 2010-03-19 16:34 ` Alexander Monakov
  2010-03-19 16:43   ` Ian Bolton
  2010-03-19 16:41 ` Vladimir N. Makarov
  2010-03-19 16:48 ` Steven Bosscher
  2 siblings, 1 reply; 17+ messages in thread
From: Alexander Monakov @ 2010-03-19 16:34 UTC (permalink / raw)
  To: Ian Bolton; +Cc: gcc


On Fri, 19 Mar 2010, Ian Bolton wrote:

> Let's start with sched1 ...
> 
> For our architecture at least, it seems like Richard Earnshaw is
> right that sched1 is generally bad when you are using -Os, because
> it can increase register pressure and cause extra spill/fill code when
> you move independent instructions in between dependent instructions.

Please note that Vladimir Makarov implemented register pressure-aware sched1
for GCC 4.5, activated with  -fsched-pressure.  I thought I should mention
this because your e-mail omits it completely, so it's hard to tell whether you
tested it.

Best regards,
Alexander Monakov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Understanding Scheduling
  2010-03-19 16:34 ` Alexander Monakov
@ 2010-03-19 16:43   ` Ian Bolton
  0 siblings, 0 replies; 17+ messages in thread
From: Ian Bolton @ 2010-03-19 16:43 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc

> > Let's start with sched1 ...
> >
> > For our architecture at least, it seems like Richard Earnshaw is
> > right that sched1 is generally bad when you are using -Os, because
> > it can increase register pressure and cause extra spill/fill code
> when
> > you move independent instructions in between dependent instructions.
> 
> Please note that Vladimir Makarov implemented register pressure-aware
> sched1
> for GCC 4.5, activated with  -fsched-pressure.  I thought I should
> mention
> this because your e-mail omits it completely, so it's hard to tell
> whether you
> tested it.
> 

Hi Alexander,

We are on GCC 4.4 at the moment.  I don't see us moving up in the
near future, but I could apply the patch and see what it does for us.

Cheers,
Ian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Understanding Scheduling
  2010-03-19 16:12 Understanding Scheduling Ian Bolton
  2010-03-19 16:34 ` Alexander Monakov
@ 2010-03-19 16:41 ` Vladimir N. Makarov
  2010-03-19 17:02   ` Ian Bolton
  2010-03-19 16:48 ` Steven Bosscher
  2 siblings, 1 reply; 17+ messages in thread
From: Vladimir N. Makarov @ 2010-03-19 16:41 UTC (permalink / raw)
  To: Ian Bolton; +Cc: gcc

On 03/19/2010 12:09 PM, Ian Bolton wrote:
> Hi folks!
>
> I've moved on from register allocation (see Understanding IRA thread)
> and onto scheduling.
>
> In particular, I am investigating the effectiveness of the sched1
> pass on our architecture and the associated interblock-scheduling
> optimisation.
>
>
> Let's start with sched1 ...
>
> For our architecture at least, it seems like Richard Earnshaw is
> right that sched1 is generally bad when you are using -Os, because
> it can increase register pressure and cause extra spill/fill code when
> you move independent instructions in between dependent instructions.
>
> For example:
>
> LOAD c2,c1[0]
> LOAD c3,c1[1]
> ADD c2,c2,c3  # depends on LOAD above it (might stall)
> LOAD c3,c1[2]
> ADD c2,c2,c3  # depends on LOAD above it (might stall)
> LOAD c3,c1[3]
> ADD c2,c2,c3  # depends on LOAD above it (might stall)
> LOAD c3,c1[4]
> ADD c2,c2,c3  # depends on LOAD above it (might stall)
>
> might become:
>
> LOAD c2,c1[0]
> LOAD c3,c1[1]
> LOAD c4,c1[2] # independent of first two LOADS
> LOAD c5,c1[3] # independent of first two LOADS
> ADD c2,c2,c3  # not dependent on preceding two insns (avoids stall)
> LOAD c3,c1[4]
> ADD c2,c2,c4  # not dependent on preceding three insns (avoids stall)
> ...
>
> This is a nice effect if your LOAD instructions have a latency of 3,
> so this should lead to performance increases, and indeed this is
> what I see for some low-reg-pressure Nullstone cases.  Turning
> sched1 off therefore causes a regression on these cases.
>
> However, this pipeline-type effect may increase your register
> pressure such that caller-save regs are required and extra spill/fill
> code needs to be generated.  This happens for some other Nullstone
> cases, and so it is good to have sched1 turned off for them!
>
> It's therefore looking like some kind of clever hybrid is required.
>
> I mention all this because I was wondering which other architectures
> have turned off sched1 for -Os?  More importantly, I was wondering
> if anyone else had considered creating some kind of clever hybrid
> that only uses sched1 when it will increase performance without
> increasing register pressure?
>
>    
http://gcc.gnu.org/ml/gcc-patches/2009-09/msg00003.html

Another problem is that sched1 for architectures with few registers can 
result in reload failure.  I tried to fix this in the patch mentioned 
above but I am not sure it is done for all targets and all possible 
programs.  The right solution for this would be implementing hard 
register spills in the reload.

The mentioned above code does not work for RA based on priority coloring 
because register pressure calculation for intersected or nested classes 
has a little sense.

If scheduling for the target is very important (as for itanium or 
in-order execution power6), I'd recommend to look at the selective 
scheduler.

> Or perhaps I could make a heuristic based on the balanced-ness of the
> tree?  (I see sched1 does a lot better if the tree is balanced, since
> it has more options to play with.)
>
>
>    
The register pressure is already mostly minimized when shed1 starts to work.
> Now onto interblock-scheduling ...
>
> As we all know, you can't have interblock-scheduling enabled unless
> you use the sched1 pass, so if sched1 is off then interblock is
> irrelevant.  For now, let's assume we are going to make some clever
> hybrid that allows sched1 when we think it will increase performance
> for Os and we are going to keep sched1 on for O2 and O3.
>
> As I understand it, interblock-scheduling enlarges the scope of
> sched1, such that you can insert independent insns from a
> completely different block in between dependent insns in this
> block.  As well as potentially amortizing stalls on high latency
> insns, we also get the chance to do "meatier" work in the destination
> block and leave less to do in the source block.  I don't know if this
> is a deliberate effect of interblock-scheduling or if it is just
> a happy side-effect.
>
> Anyway, the reason I mention interblock-scheduling is that I see it
> doing seemingly intelligent moves, but then the later BB-reorder pass
> is juggling blocks around such that we end up with extra code inside
> hot loops!  I assume this is because the scheduler and BB-reorderer
> are largely ignorant of each other, and so good intentions on the
> part of the former can be scuppered by the latter.
>
>    
That is right.  It would be nice if somebody solves the problem.
> I was wondering if anyone else has witnessed this madness on their
> architecture?  Maybe it is a bug with BB-reorder?  Or maybe it should
> only be enabled when function profiling information (e.g. gcov) is
> available?  Or maybe it is not a high-priority thing for anyone to
> think about because no one uses interblock-scheduling?
>
> If anyone can shed some light on the above, I'd greatly appreciate
> it.  For now, I will continue my experiments with selective enabling
> of sched1 for -Os.
>
>    

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Understanding Scheduling
  2010-03-19 16:41 ` Vladimir N. Makarov
@ 2010-03-19 17:02   ` Ian Bolton
  2010-03-19 17:13     ` Vladimir N. Makarov
  2010-03-19 20:58     ` Jeff Law
  0 siblings, 2 replies; 17+ messages in thread
From: Ian Bolton @ 2010-03-19 17:02 UTC (permalink / raw)
  To: Vladimir N. Makarov; +Cc: gcc

> > I mention all this because I was wondering which other architectures
> > have turned off sched1 for -Os?  More importantly, I was wondering
> > if anyone else had considered creating some kind of clever hybrid
> > that only uses sched1 when it will increase performance without
> > increasing register pressure?
> >
> >
> http://gcc.gnu.org/ml/gcc-patches/2009-09/msg00003.html
> 
> Another problem is that sched1 for architectures with few registers
can
> result in reload failure.  I tried to fix this in the patch mentioned
> above but I am not sure it is done for all targets and all possible
> programs.  The right solution for this would be implementing hard
> register spills in the reload.

I don't think we have so few registers that reload failure will occur,
so it might be worth me trying this.

> 
> The mentioned above code does not work for RA based on priority
> coloring
> because register pressure calculation for intersected or nested
classes
> has a little sense.

Hmm.  Thanks for mentioning that.  As you might recall, we are using
priority coloring at the moment because it yielded better performance
than Chaitin-Briggs.  Well, the real reason CB was rejected was that
we were already using Priority and so moving to CB would cause
sufficient
disruption such that performance increases and decreases would be
inevitable.  I did get some gains, but I also got regressions that we
couldn't absorb at the time I did the work.

I might be revisiting CB soon though, as it did tend to yield smaller
code, which is becoming more important to us. 

> 
> If scheduling for the target is very important (as for itanium or
> in-order execution power6), I'd recommend to look at the selective
> scheduler.

I don't think scheduling is highly important for us, but I will take
a look at the selective scheduler.

> 
> > Or perhaps I could make a heuristic based on the balanced-ness of
the
> > tree?  (I see sched1 does a lot better if the tree is balanced,
since
> > it has more options to play with.)
> >
> >
> >
> The register pressure is already mostly minimized when shed1 starts to
> work.

I guess this is a factor in the unbalancedness of the tree.  The more
you balance it, the more likely it will get wider and require more
registers.  But the wider and more balanced, the more options for sched1
and the more chance of a performance win (assuming the increase in reg
pressure does not outweigh the scheduling performance win.)

> > Now onto interblock-scheduling ...
> >
> > As we all know, you can't have interblock-scheduling enabled unless
> > you use the sched1 pass, so if sched1 is off then interblock is
> > irrelevant.  For now, let's assume we are going to make some clever
> > hybrid that allows sched1 when we think it will increase performance
> > for Os and we are going to keep sched1 on for O2 and O3.
> >
> > As I understand it, interblock-scheduling enlarges the scope of
> > sched1, such that you can insert independent insns from a
> > completely different block in between dependent insns in this
> > block.  As well as potentially amortizing stalls on high latency
> > insns, we also get the chance to do "meatier" work in the
destination
> > block and leave less to do in the source block.  I don't know if
this
> > is a deliberate effect of interblock-scheduling or if it is just
> > a happy side-effect.
> >
> > Anyway, the reason I mention interblock-scheduling is that I see it
> > doing seemingly intelligent moves, but then the later BB-reorder
pass
> > is juggling blocks around such that we end up with extra code inside
> > hot loops!  I assume this is because the scheduler and BB-reorderer
> > are largely ignorant of each other, and so good intentions on the
> > part of the former can be scuppered by the latter.
> >
> >
> That is right.  It would be nice if somebody solves the problem.

Hmm.  If we keep sched1 on, then maybe I will be the man to do it!

Best regards,
Ian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Understanding Scheduling
  2010-03-19 17:02   ` Ian Bolton
@ 2010-03-19 17:13     ` Vladimir N. Makarov
  2010-03-19 20:58     ` Jeff Law
  1 sibling, 0 replies; 17+ messages in thread
From: Vladimir N. Makarov @ 2010-03-19 17:13 UTC (permalink / raw)
  To: Ian Bolton; +Cc: gcc

On 03/19/2010 12:47 PM, Ian Bolton wrote:
>>> I mention all this because I was wondering which other architectures
>>> have turned off sched1 for -Os?  More importantly, I was wondering
>>> if anyone else had considered creating some kind of clever hybrid
>>> that only uses sched1 when it will increase performance without
>>> increasing register pressure?
>>>
>>>
>>>        
>> http://gcc.gnu.org/ml/gcc-patches/2009-09/msg00003.html
>>
>> Another problem is that sched1 for architectures with few registers
>>      
> can
>    
>> result in reload failure.  I tried to fix this in the patch mentioned
>> above but I am not sure it is done for all targets and all possible
>> programs.  The right solution for this would be implementing hard
>> register spills in the reload.
>>      
> I don't think we have so few registers that reload failure will occur,
> so it might be worth me trying this.
>
>    
>> The mentioned above code does not work for RA based on priority
>> coloring
>> because register pressure calculation for intersected or nested
>>      
> classes
>    
>> has a little sense.
>>      
> Hmm.  Thanks for mentioning that.  As you might recall, we are using
> priority coloring at the moment because it yielded better performance
> than Chaitin-Briggs.  Well, the real reason CB was rejected was that
> we were already using Priority and so moving to CB would cause
> sufficient
> disruption such that performance increases and decreases would be
> inevitable.  I did get some gains, but I also got regressions that we
> couldn't absorb at the time I did the work.
>
> I might be revisiting CB soon though, as it did tend to yield smaller
> code, which is becoming more important to us.
>
>    
Actually, I'll be working on the problem.  I am still working on I would 
like to change CB coloring to use any classes (not only cover classes).  
It creates the same problem for register-pressure sensitive insn 
scheduling as priority coloring.  The solution will be most probably 
useful for priority coloring too.  In any case, if I succeed, it is a 
win situation for your target because you could use register-pressure 
sensitive insn scheduling for priority coloring or CB coloring with any 
classes or both.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Understanding Scheduling
  2010-03-19 17:02   ` Ian Bolton
  2010-03-19 17:13     ` Vladimir N. Makarov
@ 2010-03-19 20:58     ` Jeff Law
  1 sibling, 0 replies; 17+ messages in thread
From: Jeff Law @ 2010-03-19 20:58 UTC (permalink / raw)
  To: Ian Bolton; +Cc: Vladimir N. Makarov, gcc

On 03/19/10 10:47, Ian Bolton wrote:
>>> I mention all this because I was wondering which other architectures
>>> have turned off sched1 for -Os?  More importantly, I was wondering
>>> if anyone else had considered creating some kind of clever hybrid
>>> that only uses sched1 when it will increase performance without
>>> increasing register pressure?
>>>
>>>
>>>        
>> http://gcc.gnu.org/ml/gcc-patches/2009-09/msg00003.html
>>
>> Another problem is that sched1 for architectures with few registers
>>      
> can
>    
>> result in reload failure.  I tried to fix this in the patch mentioned
>> above but I am not sure it is done for all targets and all possible
>> programs.  The right solution for this would be implementing hard
>> register spills in the reload.
>>      
> I don't think we have so few registers that reload failure will occur,
> so it might be worth me trying this.
>    
 From what I know of your architecture, it's unlikely.  The canonical 
case occurs when you have certain instructions that absolutely must use 
specific registers and you expose that register's class to asm 
writers.    There's other ways to trigger, but that's the easiest one.

WRT hard register spills, that's something I worked on last year and a 
major component of the second phase of the ira-reload code I'm working 
on.  While I wasn't setting out to solve the sched1 vs small register 
class problems, it gets solved as a byproduct of intelligent spilling.  
The implementations I did were quite ugly, but enough to tell me ideas 
were sound and give me some insight into how to integrate multiple 
spilling techniques.

And just a general note -- the tension between scheduling & register 
allocation is a well known problem, but not one that has been solved 
within GCC.    The sched1, allocate, sched2 is a classic approach to 
lessen the impact, but isn't a real solution.  I'm sure Vlad can point 
you at academic research in this space :-)

jeff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Understanding Scheduling
  2010-03-19 16:12 Understanding Scheduling Ian Bolton
  2010-03-19 16:34 ` Alexander Monakov
  2010-03-19 16:41 ` Vladimir N. Makarov
@ 2010-03-19 16:48 ` Steven Bosscher
  2010-03-22 17:49   ` Ian Bolton
  2 siblings, 1 reply; 17+ messages in thread
From: Steven Bosscher @ 2010-03-19 16:48 UTC (permalink / raw)
  To: Ian Bolton; +Cc: gcc

On Fri, Mar 19, 2010 at 5:09 PM, Ian Bolton <bolton@icerasemi.com> wrote:
> Anyway, the reason I mention interblock-scheduling is that I see it
> doing seemingly intelligent moves, but then the later BB-reorder pass
> is juggling blocks around such that we end up with extra code inside
> hot loops!  I assume this is because the scheduler and BB-reorderer
> are largely ignorant of each other, and so good intentions on the
> part of the former can be scuppered by the latter.
>
> I was wondering if anyone else has witnessed this madness on their
> architecture?  Maybe it is a bug with BB-reorder?  Or maybe it should
> only be enabled when function profiling information (e.g. gcov) is
> available?  Or maybe it is not a high-priority thing for anyone to
> think about because no one uses interblock-scheduling?

It's probably a bug in BB-reorder if you end up with extra code in hot
loops. BB-reorder can do funny things sometimes, especially if the
branch probabilities are estimated. But that seems unlikely in this
case because loops are of course taken into account in the estimation.
You could check that in the RTL dumps.

Enabling BB-reorder only if profile info is available, is not the
right way to go. The compiler really doesn't place blocks in sane
places without it -- and it shouldn't have to, either. For example if
you split an edge at some point, the last thing you want to worry
about, is where the new basic block is going to end up.

There are actually a few bugs in Bugzilla about BB-reorder, FWIW.

Ciao!
Steven

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Understanding Scheduling
  2010-03-19 16:48 ` Steven Bosscher
@ 2010-03-22 17:49   ` Ian Bolton
  2010-03-23 12:55     ` Dave Hudson
  0 siblings, 1 reply; 17+ messages in thread
From: Ian Bolton @ 2010-03-22 17:49 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: gcc

> Enabling BB-reorder only if profile info is available, is not the
> right way to go. The compiler really doesn't place blocks in sane
> places without it -- and it shouldn't have to, either. For example if
> you split an edge at some point, the last thing you want to worry
> about, is where the new basic block is going to end up.
> 
> There are actually a few bugs in Bugzilla about BB-reorder, FWIW.

I've done a few searches in Bugzilla and am not sure if I have found
the BB reorder bugs you are referring to.

The ones I have found are:

16797: Opportunity to remove unnecessary load instructions
41396: missed space optimization related to basic block reorder
21002: RTL prologue and basic-block reordering pessimizes delay-slot
       filling.

(If you can recall any others, I'd appreciate hearing of them.)

Based on 41396, it looks like BB reorder is disabled for -Os.  But
you said in your post above that "the compiler really doesn't place
blocks in sane places without it", so does that mean that we could
probably increase performance for -Os if BB reorder was (improved)
and enabled for -Os?

Cheers,
Ian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Understanding Scheduling
  2010-03-22 17:49   ` Ian Bolton
@ 2010-03-23 12:55     ` Dave Hudson
  2010-03-23 18:24       ` BB reorder forced off for -Os Ian Bolton
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Hudson @ 2010-03-23 12:55 UTC (permalink / raw)
  To: Ian Bolton; +Cc: gcc

On 22 Mar 2010, at 17:44, Ian Bolton wrote:

>> Enabling BB-reorder only if profile info is available, is not the
>> right way to go. The compiler really doesn't place blocks in sane
>> places without it -- and it shouldn't have to, either. For example if
>> you split an edge at some point, the last thing you want to worry
>> about, is where the new basic block is going to end up.
>> 
>> There are actually a few bugs in Bugzilla about BB-reorder, FWIW.
> 
> I've done a few searches in Bugzilla and am not sure if I have found
> the BB reorder bugs you are referring to.
> 
> The ones I have found are:
> 
> 16797: Opportunity to remove unnecessary load instructions
> 41396: missed space optimization related to basic block reorder
> 21002: RTL prologue and basic-block reordering pessimizes delay-slot
>       filling.
> 
> (If you can recall any others, I'd appreciate hearing of them.)
> 
> Based on 41396, it looks like BB reorder is disabled for -Os.  But
> you said in your post above that "the compiler really doesn't place
> blocks in sane places without it", so does that mean that we could
> probably increase performance for -Os if BB reorder was (improved)
> and enabled for -Os?

Back with our old gcc 3.4 compiler we used to routinely compile our code -Os but with BB reordering enabled as it gave us a significant performance gain for a very small increase in code size (less than 2% code size impact from what I remember versus about a 5% performance win).

With gcc 4.4 (where we are until 4.5 is out) I've been constantly frustrated by not being able to do BB reordering at -Os but equally our code sizes at -O2 have steadily shrunk so that it's only about 10% larger than -Os if we disable cache-line-aligning functions (but where -O2 performance is often in the range of 15% to 30% faster).

I seem to remember some suggestions in the past that we might want something like a -Os2 that would generally optimize for size but would still enable some number of small code size expansions where the performance benefit was large (and BB reordering would be my favourite such case) - that's the optimization setting I'd like to see us use for almost everything at Ubicom.

Cheers,
Dave

^ permalink raw reply	[flat|nested] 17+ messages in thread

* BB reorder forced off for -Os
  2010-03-23 12:55     ` Dave Hudson
@ 2010-03-23 18:24       ` Ian Bolton
  2010-03-23 20:01         ` Paul Koning
  2010-03-23 22:55         ` Steven Bosscher
  0 siblings, 2 replies; 17+ messages in thread
From: Ian Bolton @ 2010-03-23 18:24 UTC (permalink / raw)
  To: gcc

Is there any reason why BB reorder has been disabled
in bb-reorder.c for -Os, such that you can't even
turn it on with -freorder-blocks?

From what I've heard on this list in recent days,
BB reorder gives good performance wins such that
most people would still want it on even if it did
increase code size a little.

Cheers,
Ian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: BB reorder forced off for -Os
  2010-03-23 18:24       ` BB reorder forced off for -Os Ian Bolton
@ 2010-03-23 20:01         ` Paul Koning
  2010-03-23 21:10           ` Joe Buck
  2010-03-23 22:55         ` Steven Bosscher
  1 sibling, 1 reply; 17+ messages in thread
From: Paul Koning @ 2010-03-23 20:01 UTC (permalink / raw)
  To: Ian Bolton, gcc

Does -Os mean "optimize even if it makes things a bit bigger" or does it
mean "optimize only to make it smaller"?  If the latter then the current
behavior would appear to be the correct one.

	paul

> -----Original Message-----
> From: Ian Bolton [mailto:bolton@IceraSemi.com]
> Sent: Tuesday, March 23, 2010 2:06 PM
> To: gcc@gcc.gnu.org
> Subject: BB reorder forced off for -Os
> 
> Is there any reason why BB reorder has been disabled
> in bb-reorder.c for -Os, such that you can't even
> turn it on with -freorder-blocks?
> 
> From what I've heard on this list in recent days,
> BB reorder gives good performance wins such that
> most people would still want it on even if it did
> increase code size a little.
> 
> Cheers,
> Ian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BB reorder forced off for -Os
  2010-03-23 20:01         ` Paul Koning
@ 2010-03-23 21:10           ` Joe Buck
  0 siblings, 0 replies; 17+ messages in thread
From: Joe Buck @ 2010-03-23 21:10 UTC (permalink / raw)
  To: Paul Koning; +Cc: Ian Bolton, gcc

> From: Ian Bolton [mailto:bolton@IceraSemi.com]
> > Is there any reason why BB reorder has been disabled
> > in bb-reorder.c for -Os, such that you can't even
> > turn it on with -freorder-blocks?

On Tue, Mar 23, 2010 at 12:21:05PM -0700, Paul Koning wrote:
> Does -Os mean "optimize even if it makes things a bit bigger" or does it
> mean "optimize only to make it smaller"?  If the latter then the current
> behavior would appear to be the correct one.

The intent of -Os is to say that speed matters less than size.  This
would argue against using any optimization that can increase code size
*by default*.

However, if the user explicitly says -freorder-blocks on the command line,
then he/she is overriding part of -Os, saying that desired behavior is
to do the specified optimization, but otherwise optimize for space.

Also, while some combinations of options might not be possible, at the least,
if a user asks for some pass to run with an -f switch and the pass isn't
run, there should at least be a warning to that effect (IMHO).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BB reorder forced off for -Os
  2010-03-23 18:24       ` BB reorder forced off for -Os Ian Bolton
  2010-03-23 20:01         ` Paul Koning
@ 2010-03-23 22:55         ` Steven Bosscher
  2010-03-24 18:37           ` Dave Hudson
  1 sibling, 1 reply; 17+ messages in thread
From: Steven Bosscher @ 2010-03-23 22:55 UTC (permalink / raw)
  To: Ian Bolton; +Cc: gcc

On Tue, Mar 23, 2010 at 7:05 PM, Ian Bolton <bolton@icerasemi.com> wrote:
> Is there any reason why BB reorder has been disabled
> in bb-reorder.c for -Os, such that you can't even
> turn it on with -freorder-blocks?

No, you should have the option to turn it on if you wish to do so. If
that is not possible, I consider this a bug. If you open a PR and
assign it to me, I'll look into it.


> From what I've heard on this list in recent days,
> BB reorder gives good performance wins such that
> most people would still want it on even if it did
> increase code size a little.

FWIW there is already a PR with a request to add heuristics for
BB-reorder to optimize for size.

Ciao!
Steven

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BB reorder forced off for -Os
  2010-03-23 22:55         ` Steven Bosscher
@ 2010-03-24 18:37           ` Dave Hudson
  2010-03-30 13:06             ` Ian Bolton
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Hudson @ 2010-03-24 18:37 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: Ian Bolton, gcc

On 23 Mar 2010, at 22:30, Steven Bosscher wrote:

> On Tue, Mar 23, 2010 at 7:05 PM, Ian Bolton <bolton@icerasemi.com> wrote:
>> Is there any reason why BB reorder has been disabled
>> in bb-reorder.c for -Os, such that you can't even
>> turn it on with -freorder-blocks?
> 
> No, you should have the option to turn it on if you wish to do so. If
> that is not possible, I consider this a bug. If you open a PR and
> assign it to me, I'll look into it.

We're not able to enable BB reordering with -Os.  The behaviour is hard-coded via this if statement in rest_of_handle_reorder_blocks():

  if ((flag_reorder_blocks || flag_reorder_blocks_and_partition)
      /* Don't reorder blocks when optimizing for size because extra jump insns may
	 be created; also barrier may create extra padding.

	 More correctly we should have a block reordering mode that tried to
	 minimize the combined size of all the jumps.  This would more or less
	 automatically remove extra jumps, but would also try to use more short
	 jumps instead of long jumps.  */
      && optimize_function_for_speed_p (cfun))
    {
      reorder_basic_blocks ();

If you comment out the "&& optimize_function_for_speed_p (cfun)" then BB reordering takes places as desired (although this isn't a solution obviously).

In a private message Ian indicated that this had a small impact for the ISA he's working with but a significant performance gain.  I tried the same thing with the ISA I work on (Ubicom32) and this change typically increased code sizes by between 0.1% and 0.3% but improved performance by anything from 0.8% to 3% so on balance this is definitely winning for most of our users (this for a couple of benchmarks, the Linux kernel, busybox and smbd).

Regards,
Dave

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: BB reorder forced off for -Os
  2010-03-24 18:37           ` Dave Hudson
@ 2010-03-30 13:06             ` Ian Bolton
  2010-04-08 17:55               ` Making BB Reorder Smarter " Ian Bolton
  0 siblings, 1 reply; 17+ messages in thread
From: Ian Bolton @ 2010-03-30 13:06 UTC (permalink / raw)
  To: Dave Hudson, Steven Bosscher; +Cc: gcc

> We're not able to enable BB reordering with -Os.  The behaviour is
> hard-coded via this if statement in rest_of_handle_reorder_blocks():
> 
>   if ((flag_reorder_blocks || flag_reorder_blocks_and_partition)
>       /* Don't reorder blocks when optimizing for size because extra
> jump insns may
> 	 be created; also barrier may create extra padding.
> 
> 	 More correctly we should have a block reordering mode that
tried
> to
> 	 minimize the combined size of all the jumps.  This would more
or
> less
> 	 automatically remove extra jumps, but would also try to use
more
> short
> 	 jumps instead of long jumps.  */
>       && optimize_function_for_speed_p (cfun))
>     {
>       reorder_basic_blocks ();
> 
> If you comment out the "&& optimize_function_for_speed_p (cfun)" then
> BB reordering takes places as desired (although this isn't a solution
> obviously).
> 
> In a private message Ian indicated that this had a small impact for
the
> ISA he's working with but a significant performance gain.  I tried the
> same thing with the ISA I work on (Ubicom32) and this change typically
> increased code sizes by between 0.1% and 0.3% but improved performance
> by anything from 0.8% to 3% so on balance this is definitely winning
> for most of our users (this for a couple of benchmarks, the Linux
> kernel, busybox and smbd).
> 

It should be noted that commenting out the conditional to do with
optimising for speed will make BB reordering come on for all functions,
even cold ones, so I think whatever gains have come from making this
hacky change could increase further if BB reordering is set to
only come on for hot functions when compiling with -Os.  (Certainly
the code size increases could be minimised, whilst hopefully retaining
the performance gains.)

Note that I am in no way suggesting this should be the default
behaviour for -Os, but that it should be switchable via the
flags just like other optimisations are.  But, once it is switchable,
I expect choosing to turn it on for -Os should not cause universal
enabling of BB reordering for every function (as opposed to the current
universal disabling of BB reordering for every function), but a sensible
half-way point, based on heat, so that you get the performance wins with
minimal code size increases on selected functions.

Cheers,
Ian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Making BB Reorder Smarter for -Os
  2010-03-30 13:06             ` Ian Bolton
@ 2010-04-08 17:55               ` Ian Bolton
  0 siblings, 0 replies; 17+ messages in thread
From: Ian Bolton @ 2010-04-08 17:55 UTC (permalink / raw)
  To: gcc

Bugzilla 41004 calls for a more -Os-friendly algorithm for BB Reorder,
and I'm hoping I can meet this challenge.  But I need your help!

If you have any existing ideas or thoughts that could help me get
closer to a sensible heuristic sooner, then please post them to this
list.

In the mean time, here's my progress so far:

My first thought was to limit BB Reorder to hot functions, as identified
by gcov, so we could get maximum execution time benefit for minimised
code
size impact.  Based on benchmarking I've done so far, this looks to be a
winner, but it only works when profile data is available.

Without profile data, we face two main problems:

1) We cannot easily estimate how many times we will execute our more
efficient code-layout, so we can't do an accurate trade-off versus the
code size increase.

2) The traces that BB Reorder constructs will be based on static branch
prediction, rather than true dynamic flows of execution, so the new
layout may not be the best one in practice.

We can address #1 by tagging functions as hot (using attributes), but
that may not always be possible and it does not guarantee that we will
get minimal codesize increases, which is the main aim of this work.

I'm not sure how #2 can be addressed, so I'm planning to sidestep it
completely, since the problem isn't really the performance pay-off but
the codesize increase that usually comes with each new layout of a
function that BB Reorder makes.

My current plan is to characterise a function within find_traces()
(looking at things like the number of traces, edge probabilities and
frequencies, etc) and only call connect_traces() to effect the
reordering
change if these characteristics suggest that minimal code disruption
will
occur and/or maximum performance pay-off.

Thanks for reading and I look forward to your input!

Cheers,
Ian

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2010-04-08 17:39 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-19 16:12 Understanding Scheduling Ian Bolton
2010-03-19 16:34 ` Alexander Monakov
2010-03-19 16:43   ` Ian Bolton
2010-03-19 16:41 ` Vladimir N. Makarov
2010-03-19 17:02   ` Ian Bolton
2010-03-19 17:13     ` Vladimir N. Makarov
2010-03-19 20:58     ` Jeff Law
2010-03-19 16:48 ` Steven Bosscher
2010-03-22 17:49   ` Ian Bolton
2010-03-23 12:55     ` Dave Hudson
2010-03-23 18:24       ` BB reorder forced off for -Os Ian Bolton
2010-03-23 20:01         ` Paul Koning
2010-03-23 21:10           ` Joe Buck
2010-03-23 22:55         ` Steven Bosscher
2010-03-24 18:37           ` Dave Hudson
2010-03-30 13:06             ` Ian Bolton
2010-04-08 17:55               ` Making BB Reorder Smarter " Ian Bolton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).