RE: An unusual Performance approach using Synthetic registers

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* RE: An unusual Performance approach using Synthetic registers
@ 2002-12-27  5:47 Chris Lattner
  2002-12-29  0:35 ` Andy Walker
  0 siblings, 1 reply; 85+ messages in thread
From: Chris Lattner @ 2002-12-27  5:47 UTC (permalink / raw)
  To: Andy Walker; +Cc: gcc

> I am modifying gcc to add Synthetic registers.  If I am mistaken, this
> is a big waste of time.  If I am correct, (meaning "really really
> Lucky") then this may provide a significant increase in speed for
> executables.

IMHO, this could be a much bigger win on modern processors than you might
think. You may win a suprising amount strictly because of the register
renaming hardware found on modern x86 processors, which would eliminate
the L1 reference.  Take a look at this paper, for example, which describes
a very similar approach:

http://cm.bell-labs.com/cm/cs/what/smlnj/compiler-notes/k32.ps

On the other hand, I think this approach is not the right one to take if
improving optimizer effectiveness is the goal.  IMHO, more stuff should
be done on a mid-level representation, such as treessa, than the
low-level RTL... but the infrastructure appears to not be ready yet.

You will probably find that the number of "virtual" registers varies
widely across different incarnations of the architecture, implying that
the -march options should change the number of virtual registers.  Maybe
keeping this in mind when you design the code will help down the line.  :)

This will certainly be an interesting project, please keep the list
informed with what you find.  :)

-Chris

-- 
http://llvm.cs.uiuc.edu/
http://www.nondot.org/~sabre/Projects/

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2002-12-27  5:47 An unusual Performance approach using Synthetic registers Chris Lattner
@ 2002-12-29  0:35 ` Andy Walker
  2002-12-29  5:58   ` Chris Lattner
  2002-12-29  7:44   ` Daniel Egger
  0 siblings, 2 replies; 85+ messages in thread
From: Andy Walker @ 2002-12-29  0:35 UTC (permalink / raw)
  To: Chris Lattner; +Cc: gcc

Thank you for your kind comments.  Also, thank you very much for the 
reference to the Lal George paper.

On Friday 27 December 2002 03:03 am, Chris Lattner wrote:

<snip>

> You may win a suprising amount strictly because of the register
> renaming hardware found on modern x86 processors, which would eliminate
> the L1 reference. 

I had not considered this.  

<snip>

> On the other hand, I think this approach is not the right one to take if
> improving optimizer effectiveness is the goal.  

My intuitin says this can be a fertile area for investigation, for these 
reasons:  Algorithms for register use optimization are very good.  My 
experience with numerical analysis and evaluation of algorithms, in graduate 
school, leads me to believe that color-graphing leads to a result that is 
either optimal or near-optimal.  

Algorithms for memory use optimization are not nearly so well developed.  

In one sense, this approach co-opts the optimization algorithms for registers 
and uses them on small amounts of memory.  

Color-graphing works better, the more registers there are available for use.  
Six registers is sub-optimal.  Sixteen registers is a good start.  Thirty-two 
plus five gives thirty-seven, enough for the algorithm to excel.

<snip>

> You will probably find that the number of "virtual" registers varies
> widely across different incarnations of the architecture, implying that
> the -march options should change the number of virtual registers.  Maybe
> keeping this in mind when you design the code will help down the line.  :)

I expect the benefit of Synthetic registers to vary with the complexity of a 
function.  For itty-bitty subroutines, they will be a detriment, not a 
benefit.  For longer, complicated, routines, there will be more benefit.  I 
do not hold out my coding skills as an example of the best, but my keyboard 
does not really get warmed up until after I have added a couple of hundred 
lines to a routine.  

On machines like the intel 80386 that do not use cache, I expect Synthetic 
registers to have very limited benefits.

> This will certainly be an interesting project, please keep the list
> informed with what you find.  :)

Certainly.

> -Chris

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2002-12-29  0:35 ` Andy Walker
@ 2002-12-29  5:58   ` Chris Lattner
  2002-12-29  6:26     ` Alexandre Oliva
  2002-12-29 12:04     ` Andy Walker
  2002-12-29  7:44   ` Daniel Egger
  1 sibling, 2 replies; 85+ messages in thread
From: Chris Lattner @ 2002-12-29  5:58 UTC (permalink / raw)
  To: Andy Walker; +Cc: gcc

On Sat, 28 Dec 2002, Andy Walker wrote:
> Thank you for your kind comments.  Also, thank you very much for the
> reference to the Lal George paper.

No problem, glad to help :)

> > On the other hand, I think this approach is not the right one to take if
> > improving optimizer effectiveness is the goal.
>
> My intuitin says this can be a fertile area for investigation, for these
> reasons:  Algorithms for register use optimization are very good.  My
> experience with numerical analysis and evaluation of algorithms, in graduate
> school, leads me to believe that color-graphing leads to a result that is
> either optimal or near-optimal.

I understand exactly what you're saying and why you think it will work.
However, the true problem is that GCC performs most of its scalar
optimizations on the RTL representation, which operates on values in
physical registers.  For an machine with few physical registers, such as
X86, many scalar values are spilled to the stack, impeding optimization.

The fix for this particular problem, IMHO, is not to increase the number
of registers, for this is only a bandaid.  Instead, optimizations should
be performed on a representation which is not limited by the number of
registers in the target: in GCC this would be the tree representation, or
hopefully soon the GENERIC and GIMPLE representations.  The idea here is
that you perform optimizations on a representation exposing an infinite
virtual register set, so scalar stay scalars, and optimizations are
effective despite the number of physical registers in the target.

-Chris

-- 
http://llvm.cs.uiuc.edu/
http://www.nondot.org/~sabre/Projects/

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2002-12-29  5:58   ` Chris Lattner
@ 2002-12-29  6:26     ` Alexandre Oliva
  2002-12-29 12:04     ` Andy Walker
  1 sibling, 0 replies; 85+ messages in thread
From: Alexandre Oliva @ 2002-12-29  6:26 UTC (permalink / raw)
  To: Chris Lattner; +Cc: Andy Walker, gcc

On Dec 29, 2002, Chris Lattner <sabre@nondot.org> wrote:

> However, the true problem is that GCC performs most of its scalar
> optimizations on the RTL representation, which operates on values in
> physical registers.  For an machine with few physical registers, such as
> X86, many scalar values are spilled to the stack, impeding optimization.

Except that they're only spilled to stack after register allocation is
completed, and most of the optimizations take place before that, when
we still hope that all pseudo registers will get assigned to hardware
registers.

> The idea here is that you perform optimizations on a representation
> exposing an infinite virtual register set, so scalar stay scalars,
> and optimizations are effective despite the number of physical
> registers in the target.

This is already the case of GCC RTL.

-- 
Alexandre Oliva   Enjoy Guarana', see http://www.ic.unicamp.br/~oliva/
Red Hat GCC Developer                 aoliva@{redhat.com, gcc.gnu.org}
CS PhD student at IC-Unicamp        oliva@{lsd.ic.unicamp.br, gnu.org}
Free Software Evangelist                Professional serial bug killer

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2002-12-29  0:35 ` Andy Walker
  2002-12-29  5:58   ` Chris Lattner
@ 2002-12-29  7:44   ` Daniel Egger
  2002-12-29 12:10     ` Andy Walker
  2002-12-30  1:09     ` Michael S. Zick
  1 sibling, 2 replies; 85+ messages in thread
From: Daniel Egger @ 2002-12-29  7:44 UTC (permalink / raw)
  To: Andy Walker; +Cc: gcc

[-- Attachment #1: Type: text/plain, Size: 1080 bytes --]

Am Son, 2002-12-29 um 05.31 schrieb Andy Walker:

> Color-graphing works better, the more registers there are available for use.  
> Six registers is sub-optimal.  Sixteen registers is a good start.  Thirty-two 
> plus five gives thirty-seven, enough for the algorithm to excel.

However if you have a look at register rich architectures like PPC
you'll see that the generated code rarely exceeds a number of 16 used
registers, most of them being used for parameters and return values.
I believe this is because the compiler is quite good about figuring
out register/time dependencies and reusing the available registers as
soon as the content is dead. Further much code doesn't keep stuff in 
variables but writes it to allocated storage preventing any sensible
use of registers. 

There's really only very few specialized enough code that profits from
many registers which you're opting to simulate. 16 virtual registers are
probably more than enough and also fit nicely in modern processors'
cachelines (16 x 32bit = 64 byte).

-- 
Servus,
       Daniel

[-- Attachment #2: Dies ist ein digital signierter Nachrichtenteil --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2002-12-29  5:58   ` Chris Lattner
  2002-12-29  6:26     ` Alexandre Oliva
@ 2002-12-29 12:04     ` Andy Walker
  2002-12-29 13:58       ` Daniel Berlin
  2002-12-29 15:50       ` Diego Novillo
  1 sibling, 2 replies; 85+ messages in thread
From: Andy Walker @ 2002-12-29 12:04 UTC (permalink / raw)
  To: Chris Lattner; +Cc: gcc

On Sunday 29 December 2002 02:34 am, Chris Lattner wrote:

<snip>

> However, the true problem is that GCC performs most of its scalar
> optimizations on the RTL representation, which operates on values in
> physical registers.  For an machine with few physical registers, such as
> X86, many scalar values are spilled to the stack, impeding optimization.
>
> The fix for this particular problem, IMHO, is not to increase the number
> of registers, for this is only a bandaid.  Instead, optimizations should
> be performed on a representation which is not limited by the number of
> registers in the target: in GCC this would be the tree representation, or
> hopefully soon the GENERIC and GIMPLE representations.  The idea here is
> that you perform optimizations on a representation exposing an infinite
> virtual register set, so scalar stay scalars, and optimizations are
> effective despite the number of physical registers in the target.

I keep in mind that at some point the theory hits the silicon.  As an old 
assembler programmer, I always have in the back of my head the question: 
"What will this look like in instructions or in memory?" 

My approach may allow for an effectively infinite register set.  For each 
subroutine, the compiler figures out how many registers are needed.  Right 
now, in GCC, the next step after adding up the number of registers is to call 
Daniel Berlin's implementation of Color-Graphing.  

Instead of Color-Graphing, just allocate that number of Synthetic registers 
as part of the frame for that subroutine.  If you need one hundred, use one 
hundred, and get on with it.  

I have some questions about whether an effectively infinite number of 
Synthetic registers is a good idea.  

I will not look into that until I have some timing and memory results using a 
statically fixed number of Synthetic registers.

Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2002-12-29  7:44   ` Daniel Egger
@ 2002-12-29 12:10     ` Andy Walker
  2002-12-30 20:58       ` James Mansion
  2002-12-30  1:09     ` Michael S. Zick
  1 sibling, 1 reply; 85+ messages in thread
From: Andy Walker @ 2002-12-29 12:10 UTC (permalink / raw)
  To: Daniel Egger; +Cc: gcc

On Sunday 29 December 2002 07:32 am, Daniel Egger wrote:
> Am Son, 2002-12-29 um 05.31 schrieb Andy Walker:
> > Color-graphing works better, the more registers there are available for
> > use. Six registers is sub-optimal.  Sixteen registers is a good start. 
> > Thirty-two plus five gives thirty-seven, enough for the algorithm to
> > excel.
>
> However if you have a look at register rich architectures like PPC
> you'll see that the generated code rarely exceeds a number of 16 used
> registers, most of them being used for parameters and return values.
> I believe this is because the compiler is quite good about figuring
> out register/time dependencies and reusing the available registers as
> soon as the content is dead. Further much code doesn't keep stuff in
> variables but writes it to allocated storage preventing any sensible
> use of registers.
>
> There's really only very few specialized enough code that profits from
> many registers which you're opting to simulate. 16 virtual registers are
> probably more than enough and also fit nicely in modern processors'
> cachelines (16 x 32bit = 64 byte).

Interesting.  I will keep this in mind.  No need to waste a lot of memory on 
unused Synthetic registers.  

I selected 32 because it was 2**5th, and greater than 16.  When doing 
application programming in assembler, I found 16 general purpose registers 
not to be enough.  

Thank you for your comments.

Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2002-12-29 12:04     ` Andy Walker
@ 2002-12-29 13:58       ` Daniel Berlin
  2002-12-29 22:41         ` Andy Walker
  2002-12-29 15:50       ` Diego Novillo
  1 sibling, 1 reply; 85+ messages in thread
From: Daniel Berlin @ 2002-12-29 13:58 UTC (permalink / raw)
  To: Andy Walker; +Cc: Chris Lattner, gcc


On Sunday, December 29, 2002, at 12:29  PM, Andy Walker wrote:

> On Sunday 29 December 2002 02:34 am, Chris Lattner wrote:
>
> <snip>
>
>> However, the true problem is that GCC performs most of its scalar
>> optimizations on the RTL representation, which operates on values in
>> physical registers.  For an machine with few physical registers, such 
>> as
>> X86, many scalar values are spilled to the stack, impeding 
>> optimization.
>>
>> The fix for this particular problem, IMHO, is not to increase the 
>> number
>> of registers, for this is only a bandaid.  Instead, optimizations 
>> should
>> be performed on a representation which is not limited by the number of
>> registers in the target: in GCC this would be the tree 
>> representation, or
>> hopefully soon the GENERIC and GIMPLE representations.  The idea here 
>> is
>> that you perform optimizations on a representation exposing an 
>> infinite
>> virtual register set, so scalar stay scalars, and optimizations are
>> effective despite the number of physical registers in the target.
>
> I keep in mind that at some point the theory hits the silicon.  As an 
> old
> assembler programmer, I always have in the back of my head the 
> question:
> "What will this look like in instructions or in memory?"
>
> My approach may allow for an effectively infinite register set.

Which we have (Chris is wrong, we don't perform optimizations on 
physical registers, we perform it on an infinite register set), and the 
register allocator simply thunks the infinite number down to the real 
number we have on the architecture.

>   For each
> subroutine, the compiler figures out how many registers are needed.  
> Right
> now, in GCC, the next step after adding up the number of registers is 
> to call
> Daniel Berlin's implementation of Color-Graphing.
>
First, this is incorrect.
Right now use an ad-hoc register allocator that isn't all that good 
unless you specifically use -fnew-ra.

Second, Michael Matz is responsible for more of the new allocator than 
me (i'm just trying to give credit where it's due).

> Instead of Color-Graphing, just allocate that number of Synthetic 
> registers
> as part of the frame for that subroutine.  If you need one hundred, 
> use one
> hundred, and get on with it.

> I have some questions about whether an effectively infinite number of
> Synthetic registers is a good idea.
>
> I will not look into that until I have some timing and memory results 
> using a
> statically fixed number of Synthetic registers.
>
> Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2002-12-29 12:04     ` Andy Walker
  2002-12-29 13:58       ` Daniel Berlin
@ 2002-12-29 15:50       ` Diego Novillo
  2002-12-29 22:44         ` Andy Walker
  1 sibling, 1 reply; 85+ messages in thread
From: Diego Novillo @ 2002-12-29 15:50 UTC (permalink / raw)
  To: Andy Walker; +Cc: Chris Lattner, gcc

On Sun, 29 Dec 2002, Andy Walker wrote:

> My approach may allow for an effectively infinite register set.  For each 
>
Eh?  GCC already uses that approach.  RTL is a register language
for an abstract machine with an infinite number of register.


Diego.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2002-12-29 13:58       ` Daniel Berlin
@ 2002-12-29 22:41         ` Andy Walker
  0 siblings, 0 replies; 85+ messages in thread
From: Andy Walker @ 2002-12-29 22:41 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: gcc

On Sunday 29 December 2002 01:47 pm, Daniel Berlin wrote:

<snip>

> > My approach may allow for an effectively infinite register set.
>
> Which we have (Chris is wrong, we don't perform optimizations on
> physical registers, we perform it on an infinite register set), and the
> register allocator simply thunks the infinite number down to the real
> number we have on the architecture.

umm.  This is sort of the point of my whole approach.  I didn't call them 
"virtual", because they are not a simulation of a register, using a lot of 
instructions to perform a simple operation.  I didn't call them "pseudo" 
because they are not slow memory imitations of a register.  Because of the 
x86 architecture, they may be used with many instructions similar to "real" 
registers.  Because of the narrow way they are defined, they will perform 
with speed similar to "real" registers.  

I am adding to the "real" number we have on the architecure.  These are 
Synthetic: artificial duplicates of the originals, with many of the same 
characteristics.  I am specifically not including basing or indexing.

> >   For each
> > subroutine, the compiler figures out how many registers are needed.
> > Right
> > now, in GCC, the next step after adding up the number of registers is
> > to call
> > Daniel Berlin's implementation of Color-Graphing.
>
> First, this is incorrect.
> Right now use an ad-hoc register allocator that isn't all that good
> unless you specifically use -fnew-ra.

Ok.

> Second, Michael Matz is responsible for more of the new allocator than
> me (i'm just trying to give credit where it's due).

Second: I stand corrected.  Michael Matz' and Daniel Berlin's Color-Graphing 
implementation.  

Credit where it is due, indeed!  Many thanks to Michael Matz and Daniel 
Berlin.

Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2002-12-29 15:50       ` Diego Novillo
@ 2002-12-29 22:44         ` Andy Walker
  2002-12-30  1:30           ` Zack Weinberg
  0 siblings, 1 reply; 85+ messages in thread
From: Andy Walker @ 2002-12-29 22:44 UTC (permalink / raw)
  To: Diego Novillo; +Cc: Chris Lattner, gcc

On Sunday 29 December 2002 02:04 pm, Diego Novillo wrote:
> On Sun, 29 Dec 2002, Andy Walker wrote:
> > My approach may allow for an effectively infinite register set.  For each
>
> Eh?  GCC already uses that approach.  RTL is a register language
> for an abstract machine with an infinite number of register.
>
>
> Diego.

I agree.  RTL is for an abstract machine.  Infinite register set.  Excellent 
approach.  

Synthetic registers are artificial registers for a real machine.  Intel 
PentiumPro.  I am creating more real (but artificial) registers for a real 
machine.  For now, this is my own clunky second-hand Dell.  All of my changes 
have been to ix86.c, ix86.h, ix86.md, ix86-protos.h, and config.gcc.  No 
changes to any register allocating or scheduling routines or files.

In ix86.h, I am declaring the first pseudo-register to be number 84.

Synthetic registers are supposed to walk like ducks.  If they don't, I have 
failed.

Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2002-12-29  7:44   ` Daniel Egger
  2002-12-29 12:10     ` Andy Walker
@ 2002-12-30  1:09     ` Michael S. Zick
  2002-12-30  7:27       ` Daniel Egger
  1 sibling, 1 reply; 85+ messages in thread
From: Michael S. Zick @ 2002-12-30  1:09 UTC (permalink / raw)
  To: Daniel Egger, Andy Walker; +Cc: gcc

On Sunday 29 December 2002 07:32 am, Daniel Egger wrote:
> Am Son, 2002-12-29 um 05.31 schrieb Andy Walker:
> > Color-graphing works better, the more registers there are available for
> > use. Six registers is sub-optimal.  Sixteen registers is a good start. 
> > Thirty-two plus five gives thirty-seven, enough for the algorithm to
> > excel.
>
> However if you have a look at register rich architectures like PPC
> you'll see that the generated code rarely exceeds a number of 16 used
> registers, most of them being used for parameters and return values.
> use of registers.
>
Daniel, have you considered that the "16 used registers" observation
could be an artifact?

Presume an arbitrarily sophisticated optimization algorithum working
on a symbolic machine with an infinite number of registers...

Wouldn't the "16 used registers" observation fail as the size of the
source file approached infinite size?

Consider now some of the current practices that limit the size of the
source that the compiler "sees" prior to final optimization and allocation
to actual registers...

Function in-lining is done within a source file.

The size of in-lined functions is limited.

Sources are broken into multiple files, the size of which is kept (hopefully)
down to what can be comprehended by the human mind.  Which can be
expected to be considerably less than the file size that a compiler can
"comprehend".

Similarly with the size and complexity of a single expression statement.
Those things are usually limited in size and complexity to what the
human mind can comprehend.

For instance...
Pick a portion of the compiler (represented as a *.o file) that is actually
the combination of multiple *.o files seperately compiled.
Rather than seperately compile the sources of the component *.o files,
text edit them into one huge source file.
Remove limits on function in-lining.
Compile.
Examine.
Did the "16 used registers" observation hold?

Mike

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2002-12-29 22:44         ` Andy Walker
@ 2002-12-30  1:30           ` Zack Weinberg
  2002-12-30  2:57             ` Andy Walker
  2002-12-30  7:52             ` Michael S. Zick
  0 siblings, 2 replies; 85+ messages in thread
From: Zack Weinberg @ 2002-12-30  1:30 UTC (permalink / raw)
  To: Andy Walker; +Cc: Diego Novillo, Chris Lattner, gcc

Andy Walker <ja_walker@earthlink.net> writes:

> On Sunday 29 December 2002 02:04 pm, Diego Novillo wrote:
>> On Sun, 29 Dec 2002, Andy Walker wrote:
>> > My approach may allow for an effectively infinite register set.  For each
>>
>> Eh?  GCC already uses that approach.  RTL is a register language
>> for an abstract machine with an infinite number of register.
>>
>>
>> Diego.
>
> I agree.  RTL is for an abstract machine.  Infinite register set.  Excellent 
> approach.  
>
> Synthetic registers are artificial registers for a real machine.
[...]

What I think Diego is trying to say is, creating synthetic registers
for the x86 isn't going to help much, possibly not at all, because the
optimizer passes that could benefit already have unlimited registers
to work with.  I agree with this assessment.  The problems which cause
GCC to generate inferior code on register-starved architectures are
intrinsic to the register allocator: lack of live-range splitting, the
ad-hoc two-phase allocation algorithm (which makes awful choices), and
the reload nightmare.  I suspect you will get more for your time by
working with Michael and Daniel to complete the new register
allocator, targetted to the real architectural register set.

There's another angle of potential improvement, which is, GCC is
totally incapable of optimizing stack frame layout.  Stack slots are
assigned in linear order and never recycled; the stack pointer is kept
aligned whether or not this is necessary; and so on.  In this domain I
think the sane thing to do is, whenever we have to put something on
the stack, create a new pseudo register pointing directly to it.
We're not bad at optimizing with operands of the form
(mem:MODE (reg:P pseudo)) particularly when they have nonoverlapping
alias sets.  These pseudos are specially marked and do not get
allocated to real registers; instead, after we know everything that's
going onto the stack, we replace them all with
(mem:MODE (plus:P (reg:P base) (const_int offset))).  Graph-coloring
allocation should work pretty well to lay out the stack frame.

There's trouble if we need scratch registers to calculate some of
these base+offset values, but I think the infrastructure already
exists to deal with that.

zw

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2002-12-30  1:30           ` Zack Weinberg
@ 2002-12-30  2:57             ` Andy Walker
  2002-12-30  7:52             ` Michael S. Zick
  1 sibling, 0 replies; 85+ messages in thread
From: Andy Walker @ 2002-12-30  2:57 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: Diego Novillo, gcc

Thank you for the explanation.  

On Sunday 29 December 2002 06:26 pm, Zack Weinberg wrote:
<snip>
> What I think Diego is trying to say is, creating synthetic registers
> for the x86 isn't going to help much, possibly not at all, because the
> optimizer passes that could benefit already have unlimited registers
> to work with.  I agree with this assessment.  

I completely agree that Synthetic registers will have no impact on optimizer 
passes.  If I am reading the GCC code correctly, the optimizer passes (except 
for peephole, also no impact) are all completed a long time before before any 
of my changes have any significant effect.

>The problems which cause
> GCC to generate inferior code on register-starved architectures are
> intrinsic to the register allocator: lack of live-range splitting, the
> ad-hoc two-phase allocation algorithm (which makes awful choices), and
> the reload nightmare.  

My hope is that these issues will be significantly diminished if the 
architecture is no longer register-starved, and that Color-Graphing will be 
able to do a much better job with more registers.  

Color-Graphing is still critical and necessary.  

>I suspect you will get more for your time by
> working with Michael and Daniel to complete the new register
> allocator, targetted to the real architectural register set.

Probably.  Unfortunately, implementing the Synthetic register approach has 
seized me for the moment. 

> There's another angle of potential improvement, which is, GCC is
> totally incapable of optimizing stack frame layout.  Stack slots are
> assigned in linear order and never recycled; the stack pointer is kept
> aligned whether or not this is necessary; and so on.  In this domain I
> think the sane thing to do is, whenever we have to put something on
> the stack, create a new pseudo register pointing directly to it.
> We're not bad at optimizing with operands of the form
> (mem:MODE (reg:P pseudo)) particularly when they have nonoverlapping
> alias sets.  These pseudos are specially marked and do not get
> allocated to real registers; instead, after we know everything that's
> going onto the stack, we replace them all with
> (mem:MODE (plus:P (reg:P base) (const_int offset))).  Graph-coloring
> allocation should work pretty well to lay out the stack frame.
>
> There's trouble if we need scratch registers to calculate some of
> these base+offset values, but I think the infrastructure already
> exists to deal with that.
>
> zw

I have thought about this issue as well.  

I have wondered about using an address list for current objects.  
I pay good money for a Program when I attend a sporting event or a show.  
That way, I know who are the participants.  Perhaps a witless compiler would 
benefit from the same thing: a block of addresses on the stack, each for a 
current object.  

It would be useless for an itty-bitty program.  A large program might 
benefit.  

Or it might not.

Where I work, we auto-generate huge programs, with tens of thousands of lines 
of code.  Maybe.  Maybe.

Andy 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2002-12-30  1:09     ` Michael S. Zick
@ 2002-12-30  7:27       ` Daniel Egger
  2002-12-30 10:25         ` Michael S. Zick
  2002-12-30 20:50         ` Daniel Berlin
  0 siblings, 2 replies; 85+ messages in thread
From: Daniel Egger @ 2002-12-30  7:27 UTC (permalink / raw)
  To: Michael S.Zick; +Cc: Andy Walker, gcc

[-- Attachment #1: Type: text/plain, Size: 2244 bytes --]

Am Mon, 2002-12-30 um 00.45 schrieb Michael S.Zick:

> Daniel, have you considered that the "16 used registers" observation
> could be an artifact?

Yes.

> Presume an arbitrarily sophisticated optimization algorithum working
> on a symbolic machine with an infinite number of registers...

> Wouldn't the "16 used registers" observation fail as the size of the
> source file approached infinite size?

Depends on the code. Numerical applications tend to need far more
temporary values then say a notepad.

> Similarly with the size and complexity of a single expression statement.
> Those things are usually limited in size and complexity to what the
> human mind can comprehend.

The problem here is that you can surely recursivly inline the full
application into the main function, compile it as one chunk and then
be happy about the maximum use of registers; however at the same time
you absolutely blew code reuse and performancy because of cache abuse.
The effects you'll see describe exactly the troubles people have
determining ideal inlining limits. And without lots of inlining having
really a lot of registers will only benefit very specialised
applications which are unfortunately often "handoptimized" for a
particular processor and as such no good subject to compiler
optimisation.

And yes: I've actually tried what you mentioned with several larger
applications like The GIMP (with a modified paint core to force more
inlining) and there was no big increase in register usage, however I
have no figures about it since I really didn't care about this
particular number at that point.

I really think the whole idea is sound for register starved
architectures like i386 but don't expect too much from "unlimited"
registers because this is what we virtually have now with the current
optimisers.

BTW: Is there a good way to figure out how much registers the compiler
could have used sensibly if they were available? It's easy to count the
numbers of virtual registers allocated after the optimisation passes
but this number does not account for intersecting behaviour when
registers have to be saved and restored to make room for a different
content.

-- 
Servus,
       Daniel

[-- Attachment #2: Dies ist ein digital signierter Nachrichtenteil --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2002-12-30  1:30           ` Zack Weinberg
  2002-12-30  2:57             ` Andy Walker
@ 2002-12-30  7:52             ` Michael S. Zick
  1 sibling, 0 replies; 85+ messages in thread
From: Michael S. Zick @ 2002-12-30  7:52 UTC (permalink / raw)
  To: Zack Weinberg, Andy Walker; +Cc: Diego Novillo, Chris Lattner, gcc

On Sunday 29 December 2002 06:26 pm, Zack Weinberg wrote:
> Andy Walker <ja_walker@earthlink.net> writes:
> > On Sunday 29 December 2002 02:04 pm, Diego Novillo wrote:
> >> On Sun, 29 Dec 2002, Andy Walker wrote:
> >> > My approach may allow for an effectively infinite register set.  For
> >> > each
> >>
> >> Eh?  GCC already uses that approach.  RTL is a register language
> >> for an abstract machine with an infinite number of register.
> >>
> >>
> >> Diego.
> >
> > I agree.  RTL is for an abstract machine.  Infinite register set. 
> > Excellent approach.
> >
> > Synthetic registers are artificial registers for a real machine.
>
> [...]
>
> What I think Diego is trying to say is, creating synthetic registers
> for the x86 isn't going to help much, possibly not at all, because the
> optimizer passes that could benefit already have unlimited registers
> to work with.  I agree with this assessment.  The problems which cause
> GCC to generate inferior code on register-starved architectures are
> intrinsic to the register allocator: lack of live-range splitting, the
> ad-hoc two-phase allocation algorithm (which makes awful choices), and
> the reload nightmare.  I suspect you will get more for your time by
> working with Michael and Daniel to complete the new register
> allocator, targetted to the real architectural register set.
>
> There's another angle of potential improvement, which is, GCC is
> totally incapable of optimizing stack frame layout.  Stack slots are
> assigned in linear order and never recycled; the stack pointer is kept
> aligned whether or not this is necessary; and so on.  In this domain I
> think the sane thing to do is, whenever we have to put something on
> the stack, create a new pseudo register pointing directly to it.
>
Perhaps...
In addition to the current distinction of an unlimited number of pseudo
registers, and;
The actual, physical registers PLUS stack slots that pseudo registers
could be (and eventually are) assigned/resolved too...
Define a new distinction composed of a large, but limited number of
registers, on which;
The current pseudo register optimizations can be run against, and;
They are usable (if they still exist at the register/stack slot assignment
stage) by the machines existing instruction (sub)set.

Of course, to talk about this specialized class of pseudo registers
(or specialized class of pyhsical machine registers - your choice);
one should attach an unused name to them.

An array of Synthetic Register Frames sounds like a workable name.

Hmmm... I think Andy just did that.

If, once Andy reachs the point that he feels his code is working as
he intended...
Someone wanted to throw away the current stack slot assignment/
register reload mess, and instead;
Within the code that assigns his Synthetic Registers, instead;
assign some or all of them to the hardware's stack frame (stack
slots)...

I think you would both reach the same goals.  Incrementally
of course.

In fewer words; 
I think Andy's approach has merit and could well
lead to general improvements in the compiler.  
Even if his Synthetic Register Frames eventual disappear completely, 
along with the stack slot/register reload mess.

Now if some brave soul would step forward to either extend RTL or
define an RTL like symbolic representation of structured data (a
symbolic, position independant data description) to replace the
"assign it as we find" system now used...

Somebody take my keyboard away, I am beginning to ramble on...

Mike
Mike

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2002-12-30  7:27       ` Daniel Egger
@ 2002-12-30 10:25         ` Michael S. Zick
  2002-12-30 20:50         ` Daniel Berlin
  1 sibling, 0 replies; 85+ messages in thread
From: Michael S. Zick @ 2002-12-30 10:25 UTC (permalink / raw)
  To: Daniel Egger; +Cc: Andy Walker, gcc

On Sunday 29 December 2002 08:18 pm, Daniel Egger wrote:
> Am Mon, 2002-12-30 um 00.45 schrieb Michael S.Zick:
> > Daniel, have you considered that the "16 used registers" observation
> > could be an artifact?
>
> Yes.
No offense intended, I really did presume that you had.

>
> > Presume an arbitrarily sophisticated optimization algorithm working
> > on a symbolic machine with an infinite number of registers...
> >
> > Wouldn't the "16 used registers" observation fail as the size of the
> > source file approached infinite size?
>
> Depends on the code. Numerical applications tend to need far more
> temporary values then say a notepad.
>
> > Similarly with the size and complexity of a single expression statement.
> > Those things are usually limited in size and complexity to what the
> > human mind can comprehend.
>
> The problem here is that you can surely recursively inline the full
> application into the main function, compile it as one chunk and then
> be happy about the maximum use of registers; however at the same time
> you absolutely blew code reuse and performance because of cache abuse.
>
Indeed.
Every one of those things.

In fact, I would like to quote your entire, clear, description of the result 
of recursively in-lining the entire application (including libc, libgcc, 
libstdc++, etc.) into the main function, BUT;

NOT in support of failing to do it where the optimization's on the symbolic
machine with a infinite sized set of registers can get hold of it, INSTEAD;

In support that there is an entire pass (just prior to hard register / 
synthetic register / stack slot assignments with spill & reload of whatever
is left over) missing from the compiler design.

It is at this point that the compiler should address (no pun intended) 
things like cache line utilization and locality, memory footprint (and
implied code reuse), etc.

For the purpose of this thread, lets skip over the point that there are
much better ways of achieving the goal of exposing the entire
application to the optimization pass(es) than simply, recursively,
in-lining the entire application into the main function.
Just pretend we did something realistic with a similar result.

I'll refer to that as "effectively exposing the entire application" to
the optimization pass(es).

Presume further that the optimization pass(es) on the symbolic machine
with an infinite number of registers has done it's thing.

At which point we have arrived at the front door of the port specific
"back end".  Where (currently) the task is to translate the symbolic machine
with a fully, usage, optimized, infinite register set into whatever our
silicon really has, plus any artificial restrictions, such as ABI definitions.

At this point we insert the compiler pass that I feel is missing.

For ease of discussion and visualization, presume that this symbolic machine
arrives in the form of a tree.

This pass traverses the tree, looking for leafs, twigs, small branches, major 
limbs having the same (or similar after transformation) patterns.
Leafs, twigs, etc occurring withing a loop count as that many occurrences of
the leaf, twig, etc.

At some point the pass "decides" that a certain pattern in the tree occurs
often enough that it should be UN-INLINED and turned into a function call.

The cost of all the prolog, epilogue, register bashing about to meet an ABI 
specification, and other havoc that has to be done to the beautiful symbolic
machine code is considered in that "decision" process.

All of that takes care of code reuse and cache line assignment.

With this design, any "inline function (.....)" and common functions that the
programmer factored out of the body of his code is only a HINT to this
new pass of what common instruction groupings should be.*

While still, in source form, meeting their PRIMARY, intended, goal of making
the source human comprehensible.

Then you assign hard registers, synthetic registers, and stack slots with
spill & reload of whatever is left over.

*Other than the use of the new "never-inline" attribute.

Mike

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2002-12-30  7:27       ` Daniel Egger
  2002-12-30 10:25         ` Michael S. Zick
@ 2002-12-30 20:50         ` Daniel Berlin
  1 sibling, 0 replies; 85+ messages in thread
From: Daniel Berlin @ 2002-12-30 20:50 UTC (permalink / raw)
  To: Daniel Egger; +Cc: Michael S.Zick, Andy Walker, gcc

>
>
> BTW: Is there a good way to figure out how much registers the compiler
> could have used sensibly if they were available?

Good way?
no.
You are effectively trying to figure out the chromatic number of the 
graph (The smallest number of colors necessary to color the graph).
Determining this number is NP-hard.
Approximating it is in NP.
It is known that, unless coRP = NP , there is no polynomial time 
algorithm which approximates any of these quantities within a factor of 
n^(1-epsilon) for graphs on n vertices, for any fixed episilon > 0.

In other words, worst case, you can't even *approximate* this number 
with reasonable accuracy in polynomial time.
That's bad.
:)

> It's easy to count the
> numbers of virtual registers allocated after the optimisation passes
> but this number does not account for intersecting behaviour when
> registers have to be saved and restored to make room for a different
> content.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: An unusual Performance approach using Synthetic registers
  2002-12-29 12:10     ` Andy Walker
@ 2002-12-30 20:58       ` James Mansion
  2002-12-31  3:56         ` Michael S. Zick
  0 siblings, 1 reply; 85+ messages in thread
From: James Mansion @ 2002-12-30 20:58 UTC (permalink / raw)
  To: gcc

> application programming in assembler, I found 16 general purpose registers
> not to be enough.

I suspect, however, that this was because you wanted to avoid *coding*
spill logic for registers that were reused infrequently. That's
different for avoiding spills for *material* performance reasons, and
a compiler isn't as lazy as you (or I).

James

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2002-12-30 20:58       ` James Mansion
@ 2002-12-31  3:56         ` Michael S. Zick
  0 siblings, 0 replies; 85+ messages in thread
From: Michael S. Zick @ 2002-12-31  3:56 UTC (permalink / raw)
  To: gcc

On Monday 30 December 2002 06:32 pm, James Mansion wrote:
> > application programming in assembler, I found 16 general purpose
> > registers not to be enough.
>
> I suspect, however, that this was because you wanted to avoid *coding*
> spill logic for registers that were reused infrequently. That's
> different for avoiding spills for *material* performance reasons, and
> a compiler isn't as lazy as you (or I).
>
> James
>
One, grumpy old man, says (once again):
That is an artifact of current compiler implementation.

To catorgorize register usage into two groups,
"reused infrequently" and "not reused infrequently"
is a decision.

For the human, it is a subjective one,
For a compiler, it is the result of numerical analysis.

For a human, experienced, assembly langauage programmer,
having knowledge of register usage that spans that represented
in a single (or limited number) of source file(s)...
The registers that fall into the "reused infrequently" group are one 
thing.

For a compiler, where its use of numerical analysis has been
artificially limited to less than the entire application (say limited
to a single source file)...
The registers that fall into the "reused infrequently" group are
a different thing.

The compiler's numerical analysis equivalent of human
"cognative span" has the potential for a span far greater than
any human.

My position:
If the compiler implimentation is done such that its "cognative
span" (numerical analysis) matchs or exceeds that of the human,
then you should see the number of registers that fall outside of its
"reused infrequently" group match or exceed that of the human's.

So the people who are looking at compiler output;
and reaching the conclusion that "16 registers are more than
enough" are probably right - for a compiler implemenation having
a crippled cognative span.

And the people who are drawing on their experience;
and reaching the conclusion that "16 registers don't even begin
to approach what is necessary" are also probably right - from
the viewpoint of considering a greater span of source than the
compiler does.

Q.E.D: Its an artifact of compiler implementation.

Mike

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-08 21:34           ` Michael S. Zick
@ 2003-01-08 22:05             ` tm_gccmail
  0 siblings, 0 replies; 85+ messages in thread
From: tm_gccmail @ 2003-01-08 22:05 UTC (permalink / raw)
  To: Michael S.Zick; +Cc: Andy Walker, gcc

On Wed, 8 Jan 2003, Michael S.Zick wrote:

> Toshi,
> 
> That is really great detail work!
> Could you please do the same for case 2 and 3?

You need to learn to analyze things yourself.

Toshi


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-08 20:44         ` tm_gccmail
@ 2003-01-08 21:34           ` Michael S. Zick
  2003-01-08 22:05             ` tm_gccmail
  0 siblings, 1 reply; 85+ messages in thread
From: Michael S. Zick @ 2003-01-08 21:34 UTC (permalink / raw)
  To: tm_gccmail; +Cc: Andy Walker, gcc

Toshi,

That is really great detail work!
Could you please do the same for case 2 and 3?

Mike

On Wednesday 08 January 2003 01:39 pm, tm_gccmail@mail.kloo.net wrote:
> We've tried to tell you several times.
>
So I'm listening, go for case 2 and case 3 also, please.
>
> XCHG issues a BUS LOCK on the external bus for the duration of the
> instruction if a memory operand used.
>
> Assuming an 800 Mhz P3, with a 133 Mhz external bus, you first need to:
>
> 1. Synchronize with the external bus
>
>    There's a 6:1 differential between the internal clock and the external
>    clock, so it can take up to 5 clock cycles to sync the buses.
>
> 2. Get a bus lock
>
>    Takes at least one bus clock, or 6 internal clocks
>
> 3. Execute the instruction
>
>    Possibly one or two clocks.
>
> 4. Release the bus lock.
>
>    Requires resyncing with the external bus again, so up to 5 clocks.
>    Takes a bus clock, or 6 internal clocks
>
> So basically, it's at least 18 clocks, and might be as bad as 23 clocks
> to execute XCHG on an 800 Mhz P3.
>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-08 19:29       ` Michael S. Zick
  2003-01-08 20:10         ` Michael S. Zick
@ 2003-01-08 20:44         ` tm_gccmail
  2003-01-08 21:34           ` Michael S. Zick
  1 sibling, 1 reply; 85+ messages in thread
From: tm_gccmail @ 2003-01-08 20:44 UTC (permalink / raw)
  To: Michael S.Zick; +Cc: Andy Walker, gcc

On Wed, 8 Jan 2003, Michael S.Zick wrote:

> I do not make any claims of this being anything other than a WAFG...
> 
> It wasn't used as a numerical measure, just "==", "<", ">" to
> determine an order among alternative code sequences.
> 
> But I used it as my guide in the past and is why I suggested XCHG.
> 
> Why:
> If user wanted "Best Size" I dropped the "C" term
> if user wanted "Best Speed" I dropped the "D" term
> Otherwise, just use the diagonal of a cube.
> 
> How:
> Scaled everything so it could be done with integer math.
> 
> Legend:
> B == Buss Cycles
> C == Clock Cycles
> S == Instruction Size
> D == (Instruction Size DIV D-Cache Size)
> Cost == SQRT(256*( B*B + C*C + D*D))

This is an arbitrary and nonsensical cost metric.

GCC either optimizes for size or speed, not both.

> Presumes:
> 1) Write to Stack meets the "Write Before Read" requirement
> So the first stack read does not generate a buss cycle.
> 2) If temporary is required, use EAX 
> 3) If EAX not available, spill/restore with push/pop
> 4) Newer processors will never be worse than 80386
> 5) D-Cache line size 64 bytes
> 
> Notes:
> Case 1 leaves a buss write pending
> Follow with a Reg <-> Reg to hide write cycle
> 
> Case 2 the "load/store" version, needs register
> Follow with another Reg <-> Reg if available
> 
> Case 3 leaves a buss write pending
> Case 4 puts other Reg <-> Reg ops to hide buss write
> 
> PATH____________B_|_C_|_S_|__D__|__Cost
> 
> Case 1 == Cost 80
> xchg ebx, [esp+16]__0_|_5_|_3_|_0.05_|___80
> With a pending Buss Cycle so,
> Reg <-> Reg pad here

We've tried to tell you several times.

XCHG issues a BUS LOCK on the external bus for the duration of the
instruction if a memory operand used.

Assuming an 800 Mhz P3, with a 133 Mhz external bus, you first need to:

1. Synchronize with the external bus

   There's a 6:1 differential between the internal clock and the external
   clock, so it can take up to 5 clock cycles to sync the buses.

2. Get a bus lock

   Takes at least one bus clock, or 6 internal clocks

3. Execute the instruction

   Possibly one or two clocks.

4. Release the bus lock.

   Requires resyncing with the external bus again, so up to 5 clocks.
   Takes a bus clock, or 6 internal clocks

So basically, it's at least 18 clocks, and might be as bad as 23 clocks
to execute XCHG on an 800 Mhz P3. 

It's worse on faster processors, because the ratio between CPU clock and
bus clock is even worse.

It's even worse on SMP systems, because another processor might be doing
transactions on the external bus. In that case, you need to wait for the
other processor to finish its transactions before you can even acquire a
bus lock.

XCHG is NOT designed for simply swapping data between a register and a
memory location. It is an instruction designed to guarantee atomic updates
on a multiprocessor system.

Toshi

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-08 19:29       ` Michael S. Zick
@ 2003-01-08 20:10         ` Michael S. Zick
  2003-01-08 20:44         ` tm_gccmail
  1 sibling, 0 replies; 85+ messages in thread
From: Michael S. Zick @ 2003-01-08 20:10 UTC (permalink / raw)
  To: Andy Walker, tm_gccmail; +Cc: gcc

On Wednesday 08 January 2003 12:10 pm, Michael S. Zick wrote:
>
> Presumes:
> 1) Write to Stack meets the "Write Before Read" requirement
^^^^^^^^^^^^^^^^^First read from Stack

> So the first stack read does not generate a buss cycle.

I simply must get a new proof reader around here.
Mike

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-08  7:52     ` Andy Walker
@ 2003-01-08 19:29       ` Michael S. Zick
  2003-01-08 20:10         ` Michael S. Zick
  2003-01-08 20:44         ` tm_gccmail
  0 siblings, 2 replies; 85+ messages in thread
From: Michael S. Zick @ 2003-01-08 19:29 UTC (permalink / raw)
  To: Andy Walker, tm_gccmail; +Cc: gcc

On Tuesday 07 January 2003 11:35 pm, Andy Walker wrote:
> On Tuesday 07 January 2003 12:16 pm, tm_gccmail@mail.kloo.net wrote:
> <snip>
> <re: XCHG, exchange, instruction>
>
> > Now that I think about it, it's even worse on the Pentium/Pentium MMX
> > than I initially thought.
> >
> > There's two instruction pipelines on the Pentium: the U pipe and the V
> > pipe. The U pipe can execute all the instructions, but the V pipe can
> > only execute simple instructions.
>
> <snip>
>
> > Toshi
>
> I will take your good advice and not use XCHG as a performance enhancing
> option.
>
> Andy
Andy,

I do not make any claims of this being anything other than a WAFG...

It wasn't used as a numerical measure, just "==", "<", ">" to
determine an order among alternative code sequences.

But I used it as my guide in the past and is why I suggested XCHG.

Why:
If user wanted "Best Size" I dropped the "C" term
if user wanted "Best Speed" I dropped the "D" term
Otherwise, just use the diagonal of a cube.

How:
Scaled everything so it could be done with integer math.

Legend:
B == Buss Cycles
C == Clock Cycles
S == Instruction Size
D == (Instruction Size DIV D-Cache Size)
Cost == SQRT(256*( B*B + C*C + D*D))

Presumes:
1) Write to Stack meets the "Write Before Read" requirement
So the first stack read does not generate a buss cycle.
2) If temporary is required, use EAX 
3) If EAX not available, spill/restore with push/pop
4) Newer processors will never be worse than 80386
5) D-Cache line size 64 bytes

Notes:
Case 1 leaves a buss write pending
Follow with a Reg <-> Reg to hide write cycle

Case 2 the "load/store" version, needs register
Follow with another Reg <-> Reg if available

Case 3 leaves a buss write pending
Case 4 puts other Reg <-> Reg ops to hide buss write

PATH____________B_|_C_|_S_|__D__|__Cost

Case 1 == Cost 80
xchg ebx, [esp+16]__0_|_5_|_3_|_0.05_|___80
With a pending Buss Cycle so,
Reg <-> Reg pad here

Case 2 == Cost 129					
mov  eax, [esp+16]__0_|_4_|_3_|_0.05	
mov  [esp+16], ebx__1_|_2_|_3_|_0.05	
mov  ebx, eax______0_|_2_|_2_|_0.03	
- - - - - - -
_________________1_|_8_|_8_|_0.13_|__129

Case 3 == Cost 229					
push eax_________1_|_2_|_1_|_0.02	
mov  eax, [esp+20]_0_|_4_|_3_|_0.05	
mov  [esp+20], ebx_1_|_2_|_3_|_0.05	
mov  ebx, eax_____0_|_2_|_2_|_0.03	
pop  eax_________1_|_4_|_1_|_0.02	
- - - - - - - 
_______________3_|_14_|_10_|_0.16_|__229
	
Case 4 == Cost 226				
push eax_________1_|_2_|_1_|_0.02	
mov  eax, [esp+20]_0_|_4_|_3_|_0.05	
mov  [esp+20], ebx_0_|_2_|_3_|_0.05	
mov  ebx, eax_____0_|_2_|_2_|_0.03
> > Reg <-> Reg pad here	
pop  eax_________1_|_4_|_1_|_0.02	
- - - - - - - 
_______________2_|_14_|_10_|_0.16_|__226
					

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-06 20:59 Robert Dewar
  2003-01-07  5:29 ` Andy Walker
@ 2003-01-08 17:32 ` Tom Lord
  1 sibling, 0 replies; 85+ messages in thread
From: Tom Lord @ 2003-01-08 17:32 UTC (permalink / raw)
  To: dewar; +Cc: gcc, ja_walker

   > In other words, with synthregs, the CPU can ship some value off to
   > memory and not care how long it takes to get there or to get back from
   > there -- because it also ships it off to the synthreg, which it
   > hypothetically has faster access to.

   But this "hypothesis" is wrong. memory used for spills or locals is
   exactly the same as memory used for "synthetic registers" [this
   fancy term is nothing more than a fancy name for a local
   temporary].

I _know_ that synthregs are a kind of local temporary -- if I'm
confused, that isn't the bug.

I wonder if I made a false assumption:

Will (optimizing) GCC _ever_ discard rather than spill a
reconstructable register _other than_ in the cases that the value is
any of:

	1) the address of a local or argument
        2) the value of a local or argument
        3) a constant that can be quickly loaded (e.g. a small integer)

I've been assuming that sometimes reconstructable values not in those
classes _are_ discarded, thus that synthregs can improve locality and
the effectiveness of CSE.  I would expect such values _sometimes_ to
be discarded because, otherwise, there is no upper bound on the amount
of spill space needed to execute a long sequence of short expressions.
On the other hand, I'm not having any luck finding a natural case
where values are discarded.

(A separate question is whether or not synthregs allow better
instruction selection in some cases than spills -- I've mostly been
thinking about locality.)

-t

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-07 22:53 ` tm_gccmail
                     ` (2 preceding siblings ...)
  2003-01-08 11:45   ` Marcel Cox
@ 2003-01-08 17:29   ` Marcel Cox
  3 siblings, 0 replies; 85+ messages in thread
From: Marcel Cox @ 2003-01-08 17:29 UTC (permalink / raw)
  To: gcc

> > 2) Running the RA over the stack slots will cause the slots to be reused
> > when the life range of variables does not overlap. This even increases
the
> > compactness that already gives the benefit of point 1. Also, overall
> > reducing stack usage will always be a small gain.
>
> The stack slots are already reused.

Stack slots are not reused. Neither stack slots allocated for local
variables, nor stack slots created for temporaries. See the following
example:

void dummy(int, int);

void test(void)
{
  int a1,b1,c1,d1;
  int a2,b2,c2,d2;

  c1=d1=0;
  for (a1=0; a1<10; a1++)
    for (b1=0; b1<=a1; b1++) {
      c1+=a1*a1;
      d1+=b1;
      dummy(c1,d1);
     }
  c2=d2=0;
  for (a2=0; a2<10; a2++)
    for (b2=0; b2<=a2; b2++) {
      c2+=a2*a2;
      d2+=b2;
      dummy(c2,d2);
     }
}


Compiling with -O3 -march=pentium4 , this gives:

_test:
 pushl %ebp
 movl %esp, %ebp
 pushl %edi
 xorl %edi, %edi
 pushl %esi
 xorl %esi, %esi
 pushl %ebx
 subl $28, %esp
 movl $0, -16(%ebp)
L11:
 xorl %ebx, %ebx
 cmpl %esi, %ebx
 jg L25
 movl %esi, %edx
 imull %esi, %edx
 movl %edx, -24(%ebp)
L10:
 addl %ebx, -16(%ebp)
 addl -24(%ebp), %edi
 addl $1, %ebx
 movl -16(%ebp), %eax
 movl %edi, (%esp)
 movl %eax, 4(%esp)
 call _dummy
 cmpl %esi, %ebx
 jle L10
L25:
 addl $1, %esi
 cmpl $9, %esi
 jle L11
 movl $0, -20(%ebp)
 xorl %edi, %edi
 xorl %esi, %esi
L21:
 xorl %ebx, %ebx
 cmpl %esi, %ebx
 jg L29
 movl %esi, %edx
 imull %esi, %edx
 movl %edx, -28(%ebp)
L20:
 addl %ebx, -20(%ebp)
 addl -28(%ebp), %edi
 addl $1, %ebx
 movl -20(%ebp), %eax
 movl %edi, (%esp)
 movl %eax, 4(%esp)
 call _dummy
 cmpl %esi, %ebx
 jle L20
L29:
 addl $1, %esi
 cmpl $9, %esi
 jle L21
 addl $28, %esp
 popl %ebx
 popl %esi
 popl %edi
 popl %ebp
 ret

Stack slots are created for a1 (ebp-16) abd s2 (ebp-20), even though thos 2
variables do not overlap and could share a same stack slot. The same holds
for the expression a1*a1 which is stored in ebp-24 and a2*a2 which is stored
in ebp-28. A single stack slot could be used for both expressions as there
is no overlap in the life range.

Marcel



^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
@ 2003-01-08 12:27 Robert Dewar
  0 siblings, 0 replies; 85+ messages in thread
From: Robert Dewar @ 2003-01-08 12:27 UTC (permalink / raw)
  To: gcc, marcel_cox

> What you're describing is actually bad on the Pentium, and probably
> subsequent implementations as well.
> The Pentium can dual-issue loads as long as they reference separate cache
> ways. So, manually sorting the stack so contiguous accesses are localized
> increases the probability of the loads accessing the same cache way, thus
> decreasing the probability of single-issuing.

I would guess this would be dominated by the improvement in icache behavior
from the use of shorter offsets.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-08 12:13 Robert Dewar
@ 2003-01-08 12:21 ` Lars Segerlund
  0 siblings, 0 replies; 85+ messages in thread
From: Lars Segerlund @ 2003-01-08 12:21 UTC (permalink / raw)
  To: Robert Dewar; +Cc: gcc

  Now this stupid $500 motherboard/computer managed in excess of 90 % of 
the performance of a dual xeon system at the time, if equipped with the 
same ram and scsi drives :-) .. since a xeon was about $2000 at the time 
this 'stupid' system was serious bang for the buck as long as you didn't 
need floating point, ( compile farms and chip synthesis was what we used 
it for, and simple math gives you 4 machines for the price of a cpu, 
which is real value for money since it gives you 4 times the memory 
bandwidth which was the limiting factor at the time ).
  They were very cheap and fast machines.

  i.m.h.o. / Lars Segerlund.

Robert Dewar wrote:
>>dumb statistic, fwiw: dual processor 500Mhz Celerons.  
>>My Mandrake 8.2 distro calls it a pentiumpro.  
> 
> 
> I am surprised anyone would ever have built a dual processor with such
> a strange processor choice. By the way the reason 8.2 calls it a 
> pentiumpro is probably just because it is so slow that it looks like
> it is a pentiumpro :)
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-08  6:56   ` Andy Walker
@ 2003-01-08 12:14     ` Michael S. Zick
  0 siblings, 0 replies; 85+ messages in thread
From: Michael S. Zick @ 2003-01-08 12:14 UTC (permalink / raw)
  To: Andy Walker, Robert Dewar; +Cc: gcc

On Tuesday 07 January 2003 11:26 pm, Andy Walker wrote:
> On Tuesday 07 January 2003 10:18 am, Michael S. Zick wrote:
> > It has always done its register <-> memory thing correctly.
>
> Thank you for the info.  This is _not_ in the XCHG description in the Intel
> Developer's manuals.
>
Good catch - It is always good to check anything I say for errors.
>
> ok.  for now.  :o)
> dumb statistic, fwiw: dual processor 500Mhz Celerons.
> My Mandrake 8.2 distro calls it a pentiumpro.
>
So lets try to focus on the model you have - We can generalize later.
Go to your /var/log/dmesg file, post your part that corresponds to this:
- - - snip - - -
CPU: Before vendor init, caps: 0383fbff 00000000 00000000, vendor = 0
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 512K
CPU: After vendor init, caps: 0383fbff 00000000 00000000 00000000
Intel machine check reporting enabled on CPU#0.
CPU:     After generic, caps: 0383fbff 00000000 00000000 00000000
CPU:             Common caps: 0383fbff 00000000 00000000 00000000
CPU0: Intel(R) Pentium(R) III CPU family      1266MHz stepping 01
- - - snip - - -
CPU: Before vendor init, caps: 0383fbff 00000000 00000000, vendor = 0
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 512K
CPU: After vendor init, caps: 0383fbff 00000000 00000000 00000000
Intel machine check reporting enabled on CPU#1.
CPU:     After generic, caps: 0383fbff 00000000 00000000 00000000
CPU:             Common caps: 0383fbff 00000000 00000000 00000000
CPU1: Intel(R) Pentium(R) III CPU family      1266MHz stepping 01
Total of 2 processors activated (5052.82 BogoMIPS).
- - - snip - - -
The "caps:" is the encoded chip features, gives us a little focus.

Mike
> Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
@ 2003-01-08 12:13 Robert Dewar
  2003-01-08 12:21 ` Lars Segerlund
  0 siblings, 1 reply; 85+ messages in thread
From: Robert Dewar @ 2003-01-08 12:13 UTC (permalink / raw)
  To: dewar, ja_walker, mszick, velco; +Cc: gcc, lord

> dumb statistic, fwiw: dual processor 500Mhz Celerons.  
> My Mandrake 8.2 distro calls it a pentiumpro.  

I am surprised anyone would ever have built a dual processor with such
a strange processor choice. By the way the reason 8.2 calls it a 
pentiumpro is probably just because it is so slow that it looks like
it is a pentiumpro :)

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-07 22:53 ` tm_gccmail
  2003-01-08  1:05   ` tm_gccmail
  2003-01-08  1:22   ` tm_gccmail
@ 2003-01-08 11:45   ` Marcel Cox
  2003-01-08 17:29   ` Marcel Cox
  3 siblings, 0 replies; 85+ messages in thread
From: Marcel Cox @ 2003-01-08 11:45 UTC (permalink / raw)
  To: gcc

> The stack reordering pass posted by Sanjiv Gupta can do this.
> This was posted about a few days ago in gcc-patches.

Thanks for the pointer. I do not actively monitor the GCC-PATCHES list, so I
didn't know about it. Anyway, this work shows that at least there seems to
be some beneft in rearranging the stack layout.

> What you're describing is actually bad on the Pentium, and probably
> subsequent implementations as well.
> The Pentium can dual-issue loads as long as they reference separate cache
> ways. So, manually sorting the stack so contiguous accesses are localized
> increases the probability of the loads accessing the same cache way, thus
> decreasing the probability of single-issuing.
>
> You guys really should read the processor manual instead of making
> incorrect assumptions about what features would improve the code quality.
>

I don't know if it is bad for Pentium II or higher processors. The
Optimization guides from Intel for Pentium II/III and for Pentium 4
processors don't seem to mention that multiple accesses to a same cache line
are bad from the performance point of view.
http://developer.intel.com/design/pentiumii/manuals/245127.htm
http://developer.intel.com/design/pentium4/manuals/248966.htm

> > 2) Running the RA over the stack slots will cause the slots to be reused
> > when the life range of variables does not overlap. This even increases
the
> > compactness that already gives the benefit of point 1. Also, overall
> > reducing stack usage will always be a small gain.
>
> The stack slots are already reused.

Are you sure about that? I though some time ago, it was mentioned on this
list that this was not the case. I guess I will try to write a small test
program to check this.

> > 3) The "compact" memory access pattern and the reuse of stack slots
might
> > increase the opportunity for the processor to use "shortcut" features in
> > memory access. For example successive writes to the same memory location
> > might be optimised to a single write, or read access to a memory
location
> > may be fast if there is still a pending write on the same location
>
> The first feature you're describing is called "dead write elimination" and
> is already done in gcc.

There are a number of cases where the compiler can't eiliminate the store.
For example a store from within a loop, and another store after the loop, or
a store / load  / store cycle where the second store occurs before the first
one has retired.

> The second feature you're describing is called "write FIFO snooping" and
> is a hardware feature.

Intel calls it "store-forwarding"

Marcel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-07 19:20   ` tm_gccmail
@ 2003-01-08  7:52     ` Andy Walker
  2003-01-08 19:29       ` Michael S. Zick
  0 siblings, 1 reply; 85+ messages in thread
From: Andy Walker @ 2003-01-08  7:52 UTC (permalink / raw)
  To: tm_gccmail, Robert Dewar; +Cc: lord, mszick, gcc

On Tuesday 07 January 2003 12:16 pm, tm_gccmail@mail.kloo.net wrote:
<snip>
<re: XCHG, exchange, instruction>
>
> Now that I think about it, it's even worse on the Pentium/Pentium MMX than
> I initially thought.
>
> There's two instruction pipelines on the Pentium: the U pipe and the V
> pipe. The U pipe can execute all the instructions, but the V pipe can only
> execute simple instructions.
<snip>
> Toshi

I will take your good advice and not use XCHG as a performance enhancing 
option.

Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-07 17:02 ` Michael S. Zick
@ 2003-01-08  6:56   ` Andy Walker
  2003-01-08 12:14     ` Michael S. Zick
  0 siblings, 1 reply; 85+ messages in thread
From: Andy Walker @ 2003-01-08  6:56 UTC (permalink / raw)
  To: Michael S. Zick, Robert Dewar, velco; +Cc: gcc, lord

On Tuesday 07 January 2003 10:18 am, Michael S. Zick wrote:
> On Tuesday 07 January 2003 06:47 am, Robert Dewar wrote:
> > I had thought that these instructions were only available on recent
> > chips, so if they are used you have to be careful about back
> > compatibility.
>
> Good advice;
> Andy, you mentioned your machine was a "Pentium" (without any
> qualifications) so your machine doesn't have prefetch instructions.
>
> Momchil Velikov <velco at fadata dot bg> gave a nice list in his post.
>
> The exchange (xchg) instruction was included in the original 80386.
> It also issued a "bus lock" signal from the chip but as time passed,
> problems where found with its timing and its use for atomically handling
> symiphores was replaced with new instructions for that purpose.
>
> It has always done its register <-> memory thing correctly.

Thank you for the info.  This is _not_ in the XCHG description in the Intel 
Developer's manuals.  

My point, however, was that XCHG causes unavoidable processor locks in 
register-memory form.  This is unacceptable for any instruction used to speed 
up straight computation.  I passed on using it.

> Also, your "Pentium" (without any qualifications) does not have the
> "out of order execution, with register renaming, in multiple execution
> units" grown into its silcon.  Just ignore us on that subject.

ok.  for now.  :o)  
dumb statistic, fwiw: dual processor 500Mhz Celerons.  
My Mandrake 8.2 distro calls it a pentiumpro.  

> The easy way is to just keep on asking the list, some of us
> have been writing Intel-ASM since the I-4004 hit the
> street. (Myself included.)
>
> Mike

I will.  Thank you for the invitation.

Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-07 12:32 Robert Dewar
  2003-01-07 19:03 ` tm_gccmail
@ 2003-01-08  6:08 ` Andy Walker
  1 sibling, 0 replies; 85+ messages in thread
From: Andy Walker @ 2003-01-08  6:08 UTC (permalink / raw)
  To: Robert Dewar, lord, mszick; +Cc: gcc

On Tuesday 07 January 2003 06:08 am, Robert Dewar wrote:
> > First, XCHG is what I think of as an Operating System instruction.   
<snip>
> > It is one of the very reliable ways to implement
> > semaphores.
>
> Please look through the instruction set more carefully, this is NOT the way
> you would implement any sychronization instructions on the x86.

The Intel Developer's manual describes XCHG as an instruction to be used to 
implement semaphors.  The locking process is integral to the register-memory 
form of the instruction and cannot be disabled.  The phrasing implied that it 
locked the whole processor unit.  

Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
@ 2003-01-08  5:36 Robert Dewar
  0 siblings, 0 replies; 85+ messages in thread
From: Robert Dewar @ 2003-01-08  5:36 UTC (permalink / raw)
  To: brane, marcel_cox; +Cc: gcc

I find the misplacement of local stack variables really amazing here. This
sounds like an easy place to make a significant win.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-07 22:53 ` tm_gccmail
  2003-01-08  1:05   ` tm_gccmail
@ 2003-01-08  1:22   ` tm_gccmail
  2003-01-08 11:45   ` Marcel Cox
  2003-01-08 17:29   ` Marcel Cox
  3 siblings, 0 replies; 85+ messages in thread
From: tm_gccmail @ 2003-01-08  1:22 UTC (permalink / raw)
  To: Marcel Cox; +Cc: gcc

On Tue, 7 Jan 2003 tm_gccmail@mail.kloo.net wrote:
> The stack reordering pass posted by Sanjiv Gupta can do this.
> This was posted about a few days ago in gcc-patches.

Oops, wrong person. The poster was was Naveen Sharma.

http://gcc.gnu.org/ml/gcc-patches/2003-01/msg00019.html

Toshi


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-07 22:53 ` tm_gccmail
@ 2003-01-08  1:05   ` tm_gccmail
  2003-01-08  1:22   ` tm_gccmail
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 85+ messages in thread
From: tm_gccmail @ 2003-01-08  1:05 UTC (permalink / raw)
  To: Marcel Cox; +Cc: gcc

On Tue, 7 Jan 2003 tm_gccmail@mail.kloo.net wrote:

> On Tue, 7 Jan 2003, Marcel Cox wrote:
> (stuff deleted for brevity)
> ...
> > I think the speed gain is achieved for the following reasons:
> > 
> > 1) The most active variables are kept close together in memory. They only
> > occupy one or a few cache lines. Normally,  one would expect the stack frame
> > to always be in L1 cache. However especially when traversing data structures
> > that are bigger than L1 cache, you can expect the cache lines holding the
> > stack frame to be regularly be replaced by data, and having fewer cache
> > lines to reload for local variables will certainly give a speed advantage.
> 
> The stack reordering pass posted by Sanjiv Gupta can do this.
> This was posted about a few days ago in gcc-patches.
> 
> What you're describing is actually bad on the Pentium, and probably
> subsequent implementations as well.
> 
> The Pentium can dual-issue loads as long as they reference separate cache
> ways. So, manually sorting the stack so contiguous accesses are localized
> increases the probability of the loads accessing the same cache way, thus
> decreasing the probability of single-issuing.

That should read "decreasing the probability of dual-issuing" - apologies.

Toshi


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-07 21:01 Marcel Cox
@ 2003-01-07 22:53 ` tm_gccmail
  2003-01-08  1:05   ` tm_gccmail
                     ` (3 more replies)
  0 siblings, 4 replies; 85+ messages in thread
From: tm_gccmail @ 2003-01-07 22:53 UTC (permalink / raw)
  To: Marcel Cox; +Cc: gcc

On Tue, 7 Jan 2003, Marcel Cox wrote:
(stuff deleted for brevity)
...
> I think the speed gain is achieved for the following reasons:
> 
> 1) The most active variables are kept close together in memory. They only
> occupy one or a few cache lines. Normally,  one would expect the stack frame
> to always be in L1 cache. However especially when traversing data structures
> that are bigger than L1 cache, you can expect the cache lines holding the
> stack frame to be regularly be replaced by data, and having fewer cache
> lines to reload for local variables will certainly give a speed advantage.

The stack reordering pass posted by Sanjiv Gupta can do this.
This was posted about a few days ago in gcc-patches.

What you're describing is actually bad on the Pentium, and probably
subsequent implementations as well.

The Pentium can dual-issue loads as long as they reference separate cache
ways. So, manually sorting the stack so contiguous accesses are localized
increases the probability of the loads accessing the same cache way, thus
decreasing the probability of single-issuing.

You guys really should read the processor manual instead of making
incorrect assumptions about what features would improve the code quality.

> 2) Running the RA over the stack slots will cause the slots to be reused
> when the life range of variables does not overlap. This even increases the
> compactness that already gives the benefit of point 1. Also, overall
> reducing stack usage will always be a small gain.

The stack slots are already reused.

> 3) The "compact" memory access pattern and the reuse of stack slots might
> increase the opportunity for the processor to use "shortcut" features in
> memory access. For example successive writes to the same memory location
> might be optimised to a single write, or read access to a memory location
> may be fast if there is still a pending write on the same location

The first feature you're describing is called "dead write elimination" and
is already done in gcc.

The second feature you're describing is called "write FIFO snooping" and
is a hardware feature.

> 4) Finally,  the frequently used stack slots are probably placed next to the
> frame pointer so that 1 byte offsets can always be used for the most
> frequent stack accesses, thus reducing code size and reducing pressure on
> the
> instruction cache.

Yes, many instruction sets have this limitation, including:

o Hitachi SH
o MIPS16
o Thumb
o HP PA-RISC
etc.

> However I think that "synthetic registers" or not needed to get this
> gain. They are just a "trick" used to make the RA optimise the stack.
> However, it is probably possible to have a separate stack optimisation
> pass do the same thing, and this in a more flexible way as the number
> of important variables can dynamically adjust and does not have to be
> a fixed value like 32.

Yes, the stack reordering pass does this. Only the algorithm for deciding
the stack slot placement needs tweaking.

> Such a stack optimisation pass should do the
> following: - analyse the life range of variables and temporaries and
> use this information to reuse stack slots, rather than always
> assigning individual stack slots for each individual variable or
> temporary - analyse the usage of local variables and use this
> information to sort the local variables such that the most frequently
> used variables are nearest to the frame pointer (or to the stack point
> if you work without frame pointer). This will guarantee that 1 byte
> addressing can be used for the most frequent variables, and it will
> also put the most frequently variables together thus optimising the
> cache access pattern.

You guys need to look at Sanjiv's stack reordering patch.

You're introducing an unnecessary artificial construct to indirectly
manipulate the placment of items on the stack. The capability to
directly manipulate the placement of items on the stack already exists.

Toshi

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-07 21:49   ` Marcel Cox
@ 2003-01-07 21:55     ` Branko Čibej
  2003-01-07 21:55       ` Marcel Cox
  0 siblings, 1 reply; 85+ messages in thread
From: Branko Čibej @ 2003-01-07 21:55 UTC (permalink / raw)
  To: Marcel Cox; +Cc: gcc

Marcel Cox wrote:

>Finally,  here is a short sample problem which shows that GCC (3.21 tested
>here) is both better and worse than some people think or have claimed in
>this thread:
>
>void touch(void *, void *, void *);
>
>void testinit(void)
>{
>  int i=0,j=0,k=0;
>  char buffer[256];
>
>  touch(&i,&j,&k);
>  i+=k;
>  touch(&i,&j,&k);
>}
>  
>

One question: couldn't the compiler simply elide "buffer", since there
are no references to it?


-- 
Brane ÄŒibej   <brane@xbc.nu>   http://www.xbc.nu/brane/

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-07 21:55     ` Branko Čibej
@ 2003-01-07 21:55       ` Marcel Cox
  0 siblings, 0 replies; 85+ messages in thread
From: Marcel Cox @ 2003-01-07 21:55 UTC (permalink / raw)
  To: Branko Čibej; +Cc: gcc

> One question: couldn't the compiler simply elide "buffer", since there
> are no references to it?
>

Yes, that's of course another optimisation opportunity that GCC is missing,
but that is a different story not related to the "Synthetic registers"
discussion, even though a better liferange analyses of local variabled, or a
better usage of such information could do this kind of optimisation as well.

Marcel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-07  5:29 ` Andy Walker
@ 2003-01-07 21:49   ` Marcel Cox
  2003-01-07 21:55     ` Branko Čibej
  0 siblings, 1 reply; 85+ messages in thread
From: Marcel Cox @ 2003-01-07 21:49 UTC (permalink / raw)
  To: gcc

I have been following the discussion on synthetic registers since the
beginning, and so far, I have kept myself out of it because I'm not really
an expert on the inner working of GCC, nor do I know all optimisation tricks
for Intel processors. However I think over the time, I have got a little of
an understanding on how GCC works, and I consider I also know a little bit
on how the Intel processors work. So please don't consider what I say as an
expert opinion, but just the opinion of an observer who just knows a bit
about things. Maybe I'm talking complete nonsense, but maybe some of my
ideas have some truth anyway.

This message is a reply to various things said in this discussion. It is not
directed at any single person, but comments on ideas put forward by various
people. The reason I make these comments is because I think there are 2
groups of people in this discussion, one group who claims that synthetic
registers are a big win, and the other group who more or less claims it's
just nonsense. I think what is lacking in the whole discussion is trying to
see what the experience at Bell Labs with the ML compiler actually was and
why it gave some improvement to their compiler.

I think the expression "synthetic registers" suggests a wrong idea. It kind
of suggests that with too few registers, the compiler is not able to
optimise well enough, but if you fake the existence of more registers, it
does a better job. I think what this is all about is optimising the usage of
the stack slots and get a speed gain due to various positive effects of
optimising those stack slots. With the modification of the ML compiler, the
approach was not to create a new register class to simulate additional
registers, but rather to run the register allocator twice. The first is done
with simply 32 artificial registers which are just stack slots. This first
run in essence has the effect of optimising the stack by creating a set of
privileged stack slots which are accessed more often than others because
they are considered registers. Then, there is a second run which does the
real register allocation.

I think the speed gain is achieved for the following reasons:

1) The most active variables are kept close together in memory. They only
occupy one or a few cache lines. Normally,  one would expect the stack frame
to always be in L1 cache. However especially when traversing data structures
that are bigger than L1 cache, you can expect the cache lines holding the
stack frame to be regularly be replaced by data, and having fewer cache
lines to reload for local variables will certainly give a speed advantage.

2) Running the RA over the stack slots will cause the slots to be reused
when the life range of variables does not overlap. This even increases the
compactness that already gives the benefit of point 1. Also, overall
reducing stack usage will always be a small gain.

3) The "compact" memory access pattern and the reuse of stack slots might
increase the opportunity for the processor to use "shortcut" features in
memory access. For example successive writes to the same memory location
might be optimised to a single write, or read access to a memory location
may be fast if there is still a pending write on the same location

4) Finally,  the frequently used stack slots are probably placed next to the
frame pointer so that 1 byte offsets can always be used for the most
frequent stack accesses, thus reducing code size and reducing pressure on
the
instruction cache.

However I think that "synthetic registers" or not needed to get this gain.
They are just a "trick" used to make the RA optimise the stack. However, it
is probably possible to have a separate stack optimisation pass do the same
thing, and this in a more flexible way as the number of important variables
can dynamically adjust and does not have to be a fixed value like 32. Such a
stack optimisation pass should do the following:
- analyse the life range of variables and temporaries and use this
information to reuse stack slots, rather than always assigning individual
stack slots for each individual variable or temporary
- analyse the usage of local variables and use this information to sort the
local variables such that the most frequently used variables are nearest to
the frame pointer (or to the stack point if you work without frame pointer).
This will guarantee that 1 byte addressing can be used for the most frequent
variables, and it will also put the most frequently variables together thus
optimising the cache access pattern.

Finally,  here is a short sample problem which shows that GCC (3.21 tested
here) is both better and worse than some people think or have claimed in
this thread:

void touch(void *, void *, void *);

void testinit(void)
{
  int i=0,j=0,k=0;
  char buffer[256];

  touch(&i,&j,&k);
  i+=k;
  touch(&i,&j,&k);
}

Compiling for Pentium4 with -O3, this gives:

_testinit:
 pushl %ebp
 movl %esp, %ebp
 subl $312, %esp
 movl %ebx, -12(%ebp)
 movl %esi, -8(%ebp)
 movl %edi, -4(%ebp)
 leal -288(%ebp), %esi
 leal -292(%ebp), %ebx
 leal -284(%ebp), %edi
 movl %edi, 8(%esp)
 movl %esi, 4(%esp)
 movl %ebx, (%esp)
 movl $0, -292(%ebp)
 movl $0, -288(%ebp)
 movl $0, -284(%ebp)
 call _touch
 movl %ebx, (%esp)
 movl %esi, 4(%esp)
 movl -284(%ebp), %eax
 movl %edi, 8(%esp)
 addl %eax, -292(%ebp)
 call _touch
 movl -4(%ebp), %edi
 movl -12(%ebp), %ebx
 movl -8(%ebp), %esi
 movl %ebp, %esp
 popl %ebp
 ret

What's bad about GCC in this example:
The variables are allocated in the worst possible way. The dummy array is
allocated near the frame pointer, while the simple variables are far away.
Because of this, a long offset has to be used to access all variables
resulting in unnecessary code bloat. The worst thing about this example is
that GCC insists on the bad stack usage. Even if you declare the char array
before the integers, the array is still allocated near the frame pointer and
the simple variables far from it. So, GCC deliberately sorts the variables
in a bad order here. I think even a very simplistic rule like "sort all
variables by size and put the smallest variables nearest to the frame
pointer" would give an improvement. Note that if you compile
with -fomit-frame-pointer , the result is much better.

What's good about GCC in this example:
Someone claimed that GCC would always use a load/modify/store approach when
dealing with variables not in registers. This example shows that this is not
the case. GCC can generate instructions that directly operate on memory when
this is of advantage. In this example, it is the 'addl %eax, -292(%ebp)'
instruction. The variable i is never loaded into a register, but the
addition to i occurs directly in memory.

Some general comments I have on ideas brought forward:
- I don't think it is a good idea to specially align the stack for best
possible L1 cache alignment. I think the stack bloat would negate the
benefit, especially as without L1 alignment and with a frame pointer, the
arguments of a function might be in the same cache line as the first
frequently used variables.
- locating synthetic registers outside the stack would be complete suicide.
You would need expensive memory to memory operations to save and restore
synthetic registers on function calls, and it would be impossible to create
multithreaded applications (beside possibly many other API related
problems). Also, access to those memory locations would required long
addresses, thus leading to severe code bloat
- any change to the stack layout that would lead to ABI incompatibilities
would certainly not have a chance of every being accepted

Marcel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
@ 2003-01-07 21:01 Marcel Cox
  2003-01-07 22:53 ` tm_gccmail
  0 siblings, 1 reply; 85+ messages in thread
From: Marcel Cox @ 2003-01-07 21:01 UTC (permalink / raw)
  To: gcc

I have been following the discussion on synthetic registers since the
beginning, and so far, I have kept myself out of it because I'm not really
an expert on the inner working of GCC, nor do I know all optimisation tricks
for Intel processors. However I think over the time, I have got a little of
an understanding on how GCC works, and I consider I also know a little bit
on how the Intel processors work. So please don't consider what I say as an
expert opinion, but just the opinion of an observer who just knows a bit
about things. Maybe I'm talking complete nonsense, but maybe some of my
ideas have some truth anyway.

This message is a reply to various things said in this discussion. It is not
directed at any single person, but comments on ideas put forward by various
people. The reason I make these comments is because I think there are 2
groups of people in this discussion, one group who claims that synthetic
registers are a big win, and the other group who more or less claims it's
just nonsense. I think what is lacking in the whole discussion is trying to
see what the experience at Bell Labs with the ML compiler actually was and
why it gave some improvement to their compiler.

I think the expression "synthetic registers" suggests a wrong idea. It kind
of suggests that with too few registers, the compiler is not able to
optimise well enough, but if you fake the existence of more registers, it
does a better job. I think what this is all about is optimising the usage of
the stack slots and get a speed gain due to various positive effects of
optimising those stack slots. With the modification of the ML compiler, the
approach was not to create a new register class to simulate additional
registers, but rather to run the register allocator twice. The first is done
with simply 32 artificial registers which are just stack slots. This first
run in essence has the effect of optimising the stack by creating a set of
privileged stack slots which are accessed more often than others because
they are considered registers. Then, there is a second run which does the
real register allocation.

I think the speed gain is achieved for the following reasons:

1) The most active variables are kept close together in memory. They only
occupy one or a few cache lines. Normally,  one would expect the stack frame
to always be in L1 cache. However especially when traversing data structures
that are bigger than L1 cache, you can expect the cache lines holding the
stack frame to be regularly be replaced by data, and having fewer cache
lines to reload for local variables will certainly give a speed advantage.

2) Running the RA over the stack slots will cause the slots to be reused
when the life range of variables does not overlap. This even increases the
compactness that already gives the benefit of point 1. Also, overall
reducing stack usage will always be a small gain.

3) The "compact" memory access pattern and the reuse of stack slots might
increase the opportunity for the processor to use "shortcut" features in
memory access. For example successive writes to the same memory location
might be optimised to a single write, or read access to a memory location
may be fast if there is still a pending write on the same location

4) Finally,  the frequently used stack slots are probably placed next to the
frame pointer so that 1 byte offsets can always be used for the most
frequent stack accesses, thus reducing code size and reducing pressure on
the
instruction cache.

However I think that "synthetic registers" or not needed to get this gain.
They are just a "trick" used to make the RA optimise the stack. However, it
is probably possible to have a separate stack optimisation pass do the same
thing, and this in a more flexible way as the number of important variables
can dynamically adjust and does not have to be a fixed value like 32. Such a
stack optimisation pass should do the following:
- analyse the life range of variables and temporaries and use this
information to reuse stack slots, rather than always assigning individual
stack slots for each individual variable or temporary
- analyse the usage of local variables and use this information to sort the
local variables such that the most frequently used variables are nearest to
the frame pointer (or to the stack point if you work without frame pointer).
This will guarantee that 1 byte addressing can be used for the most frequent
variables, and it will also put the most frequently variables together thus
optimising the cache access pattern.

Finally,  here is a short sample problem which shows that GCC (3.21 tested
here) is both better and worse than some people think or have claimed in
this thread:

void touch(void *, void *, void *);

void testinit(void)
{
  int i=0,j=0,k=0;
  char buffer[256];

  touch(&i,&j,&k);
  i+=k;
  touch(&i,&j,&k);
}

Compiling for Pentium4 with -O3, this gives:

_testinit:
 pushl %ebp
 movl %esp, %ebp
 subl $312, %esp
 movl %ebx, -12(%ebp)
 movl %esi, -8(%ebp)
 movl %edi, -4(%ebp)
 leal -288(%ebp), %esi
 leal -292(%ebp), %ebx
 leal -284(%ebp), %edi
 movl %edi, 8(%esp)
 movl %esi, 4(%esp)
 movl %ebx, (%esp)
 movl $0, -292(%ebp)
 movl $0, -288(%ebp)
 movl $0, -284(%ebp)
 call _touch
 movl %ebx, (%esp)
 movl %esi, 4(%esp)
 movl -284(%ebp), %eax
 movl %edi, 8(%esp)
 addl %eax, -292(%ebp)
 call _touch
 movl -4(%ebp), %edi
 movl -12(%ebp), %ebx
 movl -8(%ebp), %esi
 movl %ebp, %esp
 popl %ebp
 ret

What's bad about GCC in this example:
The variables are allocated in the worst possible way. The dummy array is
allocated near the frame pointer, while the simple variables are far away.
Because of this, a long offset has to be used to access all variables
resulting in unnecessary code bloat. The worst thing about this example is
that GCC insists on the bad stack usage. Even if you declare the char array
before the integers, the array is still allocated near the frame pointer and
the simple variables far from it. So, GCC deliberately sorts the variables
in a bad order here. I think even a very simplistic rule like "sort all
variables by size and put the smallest variables nearest to the frame
pointer" would give an improvement. Note that if you compile
with -fomit-frame-pointer , the result is much better.

What's good about GCC in this example:
Someone claimed that GCC would always use a load/modify/store approach when
dealing with variables not in registers. This example shows that this is not
the case. GCC can generate instructions that directly operate on memory when
this is of advantage. In this example, it is the 'addl %eax, -292(%ebp)'
instruction. The variable i is never loaded into a register, but the
addition to i occurs directly in memory.

Some general comments I have on ideas brought forward:
- I don't think it is a good idea to specially align the stack for best
possible L1 cache alignment. I think the stack bloat would negate the
benefit, especially as without L1 alignment and with a frame pointer, the
arguments of a function might be in the same cache line as the first
frequently used variables.
- locating synthetic registers outside the stack would be complete suicide.
You would need expensive memory to memory operations to save and restore
synthetic registers on function calls, and it would be impossible to create
multithreaded applications (beside possibly many other API related
problems). Also, access to those memory locations would required long
addresses, thus leading to severe code bloat
- any change to the stack layout that would lead to ABI incompatibilities
would certainly not have a chance of every being accepted

Marcel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-07 19:03 ` tm_gccmail
@ 2003-01-07 19:20   ` tm_gccmail
  2003-01-08  7:52     ` Andy Walker
  0 siblings, 1 reply; 85+ messages in thread
From: tm_gccmail @ 2003-01-07 19:20 UTC (permalink / raw)
  To: Robert Dewar; +Cc: ja_walker, lord, mszick, gcc

On Tue, 7 Jan 2003 tm_gccmail@mail.kloo.net wrote:

> On Tue, 7 Jan 2003, Robert Dewar wrote:
> 
> > > First, XCHG is what I think of as an Operating System instruction.  It is 
> > > quite valuable because the exchange can be limited to a single process on a 
> > > single processor in a multiprocessor system, in conjunction with the locking 
> > > process.  It is one of the very reliable ways to implement semaphores.  
> > 
> > Please look through the instruction set more carefully, this is NOT the way
> > you would implement any sychronization instructions on the x86.
> > 
> > Also, be very careful about timing of instructions when you start to look
> > at the complex instructions of the x86. No one should even think of generating
> > code for the x86 without reading the Intel guide for compiler writers. 
> > Basically the rule on most variants of the x86 is that you should treat
> > it as a conventional load/store RISC machine when it comes to generating
> > code.
> 
> Yes, read-modify-write operations on memory locations is very bad,
> especially on in-order execution members of the x86 family (Pentium and
> below). The problem is that RMW operations stall the pipeline because the
> modify/write cannot be performed until the load completes.
> 
> So basically:
> 
> 1) Load occurs
> 2) Processor stalls for two clocks waiting for the load to occur
> 3) Modify is done
> 4) Write is done
> 
> If the code is generated as for a strict load/store architecture, then
> other instructions can be executed during the latency of the load
> instruction to perform useful instructions.
> 
> Toshi

Now that I think about it, it's even worse on the Pentium/Pentium MMX than
I initially thought.

There's two instruction pipelines on the Pentium: the U pipe and the V
pipe. The U pipe can execute all the instructions, but the V pipe can only
execute simple instructions.

IIRC, the RMW instruction would only execute in the U pipe. So not only
would the processor stall for two clocks, there is about a chance of being
delayed another clock because it needs to be issued in the U pipe.

On the out-of-order execution members of the x86 family (Pentium II and
above) I believe CISC-style instructions are deprecated anyway because
they bottleneck the instruction decoder.

The x86 port originally did generate RMW instructions on memory up until
about four or five years ago, when I noticed that it's deprecated in the
Intel Compiler Writer's guide. I mentioned this on gcc or gcc-bugs, and
rth installed some patches to prevent RMW instructions from being
generated - if my recollection is correct.

I think maybe possibly we may want to generate them if -Os is specified,
since they probably reduce code size. However, they definitely shouldn't
be generated under normal circumstances.

Toshi

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-07 12:32 Robert Dewar
@ 2003-01-07 19:03 ` tm_gccmail
  2003-01-07 19:20   ` tm_gccmail
  2003-01-08  6:08 ` Andy Walker
  1 sibling, 1 reply; 85+ messages in thread
From: tm_gccmail @ 2003-01-07 19:03 UTC (permalink / raw)
  To: Robert Dewar; +Cc: ja_walker, lord, mszick, gcc

On Tue, 7 Jan 2003, Robert Dewar wrote:

> > First, XCHG is what I think of as an Operating System instruction.  It is 
> > quite valuable because the exchange can be limited to a single process on a 
> > single processor in a multiprocessor system, in conjunction with the locking 
> > process.  It is one of the very reliable ways to implement semaphores.  
> 
> Please look through the instruction set more carefully, this is NOT the way
> you would implement any sychronization instructions on the x86.
> 
> Also, be very careful about timing of instructions when you start to look
> at the complex instructions of the x86. No one should even think of generating
> code for the x86 without reading the Intel guide for compiler writers. 
> Basically the rule on most variants of the x86 is that you should treat
> it as a conventional load/store RISC machine when it comes to generating
> code.

Yes, read-modify-write operations on memory locations is very bad,
especially on in-order execution members of the x86 family (Pentium and
below). The problem is that RMW operations stall the pipeline because the
modify/write cannot be performed until the load completes.

So basically:

1) Load occurs
2) Processor stalls for two clocks waiting for the load to occur
3) Modify is done
4) Write is done

If the code is generated as for a strict load/store architecture, then
other instructions can be executed during the latency of the load
instruction to perform useful instructions.

Toshi


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-07  6:02     ` Andy Walker
@ 2003-01-07 17:41       ` Janis Johnson
  0 siblings, 0 replies; 85+ messages in thread
From: Janis Johnson @ 2003-01-07 17:41 UTC (permalink / raw)
  To: Andy Walker; +Cc: Michael S. Zick, Tom Lord, dewar, gcc

On Mon, Jan 06, 2003 at 11:28:25PM -0600, Andy Walker wrote:
> On Monday 06 January 2003 03:43 pm, Michael S. Zick wrote:
> <snip>
> > Because of the way a procedure call entry is made and its use
> > of the stack frame - those lines will be in the L-1 cache by the
> > time Andy wants to use them. - Specially if he has the function
> > call code issue a "prefetch" command for the range of stack
> > memory that he knows the function will use for its prologue.
> >
> > Mike
> 
> I had not considered a "prefetch" command.  I was not aware that such 
> existed.  
> 
> I am pretty familiar with the x86 instruction set, but I clearly recall that 
> I have never seen anything like this.  Is there such a thing in the x86 
> instruction set, and if so, what is it called?  Is it perhaps one of the 
> testing instructions?  

Several architectures supported by GCC provide data prefetch support.
The GCC internals manual documents prefetch support in RTL.

For information about x86 prefetch commands supported by various GCC
options, see gcc/testsuite/gcc.misc-tests/i386-prefetch.exp in the
GCC sources.

Janis

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
@ 2003-01-07 17:19 Robert Dewar
  0 siblings, 0 replies; 85+ messages in thread
From: Robert Dewar @ 2003-01-07 17:19 UTC (permalink / raw)
  To: dewar, mszick, velco; +Cc: gcc, ja_walker, lord

> The exchange (xchg) instruction was included in the original 80386.

Actually this was on the 8086, it is one of the original instructions, of
course that was not ia32 architecture :-)

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-07 15:17 Robert Dewar
@ 2003-01-07 17:02 ` Michael S. Zick
  2003-01-08  6:56   ` Andy Walker
  0 siblings, 1 reply; 85+ messages in thread
From: Michael S. Zick @ 2003-01-07 17:02 UTC (permalink / raw)
  To: Robert Dewar, velco; +Cc: gcc, ja_walker, lord

On Tuesday 07 January 2003 06:47 am, Robert Dewar wrote:
> I had thought that these instructions were only available on recent chips,
> so if they are used you have to be careful about back compatibility.
>
Good advice;
Andy, you mentioned your machine was a "Pentium" (without any
qualifications) so your machine doesn't have prefetch instructions.

Momchil Velikov <velco at fadata dot bg> gave a nice list in his post.

The exchange (xchg) instruction was included in the original 80386.
It also issued a "bus lock" signal from the chip but as time passed,
problems where found with its timing and its use for atomically handling
symiphores was replaced with new instructions for that purpose.

It has always done its register <-> memory thing correctly.

Also, your "Pentium" (without any qualifications) does not have the
"out of order execution, with register renaming, in multiple execution
units" grown into its silcon.  Just ignore us on that subject.

The hard way to learn about all of this is to start at this link:
<http://www.intel.com/support/processors/index.htm>

Pick your processor, on that processor's page will be a link to
manuals - including more than you ever wanted to know.
Most of the manuals discuss earlier versions of the chip in
contrast to the subject chip.

A good mid-point in the product line would be:
<http://www.intel.com/design/pentiumii/manuals/245127.htm>

Which I believe Robert was refering too.

The easy way is to just keep on asking the list, some of us
have been writing Intel-ASM since the I-4004 hit the 
street. (Myself included.)

Mike

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
@ 2003-01-07 15:17 Robert Dewar
  2003-01-07 17:02 ` Michael S. Zick
  0 siblings, 1 reply; 85+ messages in thread
From: Robert Dewar @ 2003-01-07 15:17 UTC (permalink / raw)
  To: dewar, velco; +Cc: gcc, ja_walker, lord, mszick

I had thought that these instructions were only available on recent chips,
so if they are used you have to be careful about back compatibility.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
@ 2003-01-07 12:32 Robert Dewar
  2003-01-07 19:03 ` tm_gccmail
  2003-01-08  6:08 ` Andy Walker
  0 siblings, 2 replies; 85+ messages in thread
From: Robert Dewar @ 2003-01-07 12:32 UTC (permalink / raw)
  To: dewar, ja_walker, lord, mszick; +Cc: gcc

> First, XCHG is what I think of as an Operating System instruction.  It is 
> quite valuable because the exchange can be limited to a single process on a 
> single processor in a multiprocessor system, in conjunction with the locking 
> process.  It is one of the very reliable ways to implement semaphores.  

Please look through the instruction set more carefully, this is NOT the way
you would implement any sychronization instructions on the x86.

Also, be very careful about timing of instructions when you start to look
at the complex instructions of the x86. No one should even think of generating
code for the x86 without reading the Intel guide for compiler writers. 
Basically the rule on most variants of the x86 is that you should treat
it as a conventional load/store RISC machine when it comes to generating
code.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-07 12:08 Robert Dewar
@ 2003-01-07 12:10 ` Momchil Velikov
  0 siblings, 0 replies; 85+ messages in thread
From: Momchil Velikov @ 2003-01-07 12:10 UTC (permalink / raw)
  To: Robert Dewar; +Cc: ja_walker, lord, mszick, gcc

>>>>> "Robert" == Robert Dewar <dewar@gnat.com> writes:

    >> I am pretty familiar with the x86 instruction set, but I clearly recall that 
    >> I have never seen anything like this.  Is there such a thing in the x86 
    >> instruction set, and if so, what is it called?  Is it perhaps one of the 
    >> testing instructions?  

    Robert> There is no prefetch instruction as such on the x86, but of course any
    Robert> access acts as a prefetch in practice.

There is, actually, 

Opcode          Instruction     Description
0F 18 /1        PREFETCHT0 m8   Move data from m8 closer to the processor using T0 hint.
0F 18 /2        PREFETCHT1 m8   Move data from m8 closer to the processor using T1 hint.
0F 18 /3        PREFETCHT2 m8   Move data from m8 closer to the processor using T2 hint.
0F 18 /0        PREFETCHNTA m8  Move data from m8 closer to the processor using NTA hint.

Description

 Fetches the line of data from memory that contains the byte specified
 with the source operand to a location in the cache hierarchy
 specified by a locality hint:

 * T0 (temporal data) prefetch data into all levels of the cache
   hierarchy.  Pentium III processor 1st- or 2nd-level cache.  Pentium
   4 and Intel Xeon processor 2nd-level cache.

 * T1 (temporal data with respect to first level cache) prefetch data
   into level 2 cache and higher.  Pentium III processor 2nd-level
   cache.  Pentium 4 and Intel Xeon processor 2nd-level cache.

 * T2 (temporal data with respect to second level cache) prefetch data
   into level 2 cache and higher.  Pentium III processor 2nd-level
   cache.  Pentium 4 and Intel Xeon processor 2nd-level cache.

 * NTA (non-temporal data with respect to all cache levels) prefetch
   data into non-temporal cache structure and into a location close to
   the processor, minimizing cache pollution.  Pentium III processor
   1st-level cache Pentium 4 and Intel Xeon processor 2nd-level cache

~velco

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
@ 2003-01-07 12:08 Robert Dewar
  2003-01-07 12:10 ` Momchil Velikov
  0 siblings, 1 reply; 85+ messages in thread
From: Robert Dewar @ 2003-01-07 12:08 UTC (permalink / raw)
  To: dewar, ja_walker, lord, mszick; +Cc: gcc

> I am pretty familiar with the x86 instruction set, but I clearly recall that 
> I have never seen anything like this.  Is there such a thing in the x86 
> instruction set, and if so, what is it called?  Is it perhaps one of the 
> testing instructions?  

There is no prefetch instruction as such on the x86, but of course any
access acts as a prefetch in practice.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-06 22:45     ` Michael S. Zick
@ 2003-01-07  6:04       ` Andy Walker
  0 siblings, 0 replies; 85+ messages in thread
From: Andy Walker @ 2003-01-07  6:04 UTC (permalink / raw)
  To: Michael S. Zick, Tom Lord, dewar; +Cc: gcc

On Monday 06 January 2003 03:59 pm, Michael S. Zick wrote:
> On Monday 06 January 2003 01:47 am, Andy Walker wrote:
> > The list of instructions that might operate on a Synthetic register is
> > short: mov, add, sub, neg, and, or, xor, ror, rol, ashl, ashr, and lshr
> > (sp?).  In addition, a synthetic register may be a source for a mul or
> > divmod.  For this short list of instructions, a Synthetic register can be
> > a source or destination, in conjunction with a register, or it may be a
> > destination in conjunction with an immediate value.
>
> Andy,
>
> Is it possible to define a two-ouput instruction?
> If so, add XCHG to your list, short, fast, works in any pipeline, and
> can substitute for a pair of moves. (Three, if you have to supply
> your own temporary.)
>
> Mike

Yes, it is possible to define a two-output instruction in the machine 
description.

XCHG tickled my fancy, no question.  It could be used as a SWAP, which is an 
instruction type in the i386 machine description.  It could also, and at the 
same time, be used as a preferred alternative when implementing a MOV 
instruction.  The very strong advantage from my point of view is severly 
reducing the need to clobber a natural register, then use the register as an 
intermediary, because there are no "memory" to "memory" instructions.  

I question whether or not your statement is true, that XCHG works in any 
pipeline.  Three things generate my pessimism. 

First, XCHG is what I think of as an Operating System instruction.  It is 
quite valuable because the exchange can be limited to a single process on a 
single processor in a multiprocessor system, in conjunction with the locking 
process.  It is one of the very reliable ways to implement semaphores.  

The other side of this coin is that the real memory may require an access, to 
check for someone else's lock.  If so, I cannot imagine anything other than a 
massive pipeline stall happening.  

Second, even though XCHG may be used between a register and memory, it is not 
implemented in the SWAP machine description.  Only register to register SWAPs 
are implemented.  That caught my attention when I saw it.  

Because it is not in SWAP, or anywhere else, it cannot be used by Reload.  
That instruction alone would have reduced much of the burden caused by 
Reload.  A lot of very bright people have walked away from this handy tool, 
as if it had the plague.  The developers here may not be able to use a 
crystal ball, but in every other respect, they are really wizards.

Third, I recall reading a posting on this list in the last year or so where 
one of the contributors was explaining why a particular instruction was never 
used, observing that using the memory version of the instruction caused 
horrible pipeline stalls.  

If I am wrong about this, PLEASE, enlighten me.  I would love to be able to 
use the instruction.

Andy 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-06 21:53   ` Michael S. Zick
@ 2003-01-07  6:02     ` Andy Walker
  2003-01-07 17:41       ` Janis Johnson
  0 siblings, 1 reply; 85+ messages in thread
From: Andy Walker @ 2003-01-07  6:02 UTC (permalink / raw)
  To: Michael S. Zick, Tom Lord, dewar; +Cc: gcc

On Monday 06 January 2003 03:43 pm, Michael S. Zick wrote:
<snip>
> Because of the way a procedure call entry is made and its use
> of the stack frame - those lines will be in the L-1 cache by the
> time Andy wants to use them. - Specially if he has the function
> call code issue a "prefetch" command for the range of stack
> memory that he knows the function will use for its prologue.
>
> Mike

I had not considered a "prefetch" command.  I was not aware that such 
existed.  

I am pretty familiar with the x86 instruction set, but I clearly recall that 
I have never seen anything like this.  Is there such a thing in the x86 
instruction set, and if so, what is it called?  Is it perhaps one of the 
testing instructions?  

Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-06 20:59 Robert Dewar
@ 2003-01-07  5:29 ` Andy Walker
  2003-01-07 21:49   ` Marcel Cox
  2003-01-08 17:32 ` Tom Lord
  1 sibling, 1 reply; 85+ messages in thread
From: Andy Walker @ 2003-01-07  5:29 UTC (permalink / raw)
  To: Robert Dewar, lord; +Cc: gcc

On Monday 06 January 2003 02:24 pm, Robert Dewar wrote:
> > In other words, with synthregs, the CPU can ship some value off to
> > memory and not care how long it takes to get there or to get back from
> > there -- because it also ships it off to the synthreg, which it
> > hypothetically has faster access to.
>
> But this "hypothesis" is wrong. memory used for spills or locals is exactly
> the same as memory used for "synthetic registers" [this fancy term is
> nothing more than a fancy name for a local temporary]. So there is no issue
> of having faster access to one or the other. It may of course be the case
> that in one case you get more competent code than the other, but if so, the
> fix is to fix incompetent code :-)

An interesting proposition that I had not considered.  After Synthetic 
registers get running well enough to be rationally evaluated, I may be 
interested in revisiting this idea.  

Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-06 16:46   ` Michael S. Zick
@ 2003-01-07  5:20     ` Andy Walker
  0 siblings, 0 replies; 85+ messages in thread
From: Andy Walker @ 2003-01-07  5:20 UTC (permalink / raw)
  To: Michael S. Zick; +Cc: gcc

On Monday 06 January 2003 09:13 am, Michael S. Zick wrote:
> On Sunday 05 January 2003 10:39 pm, Andy Walker wrote:
> > Simulated comparison of a loop end:
> >
> > w/o Synth
> > ...
> > mov eax,[StackSlot27] ; Load the increment from spill
> > mov edx,[StackSlot23] ; Load the index -- spilled for lack of registers
> > lea ecx,[eax + edx] ;      Nicely optimized "add"
> > mov edx,[StackSlot28];  Load the loop limit from spill
> > cmp ecx,edx  ;             Compare the index to the loop limit.
> > ...
> >
> > w/ Synth
> > ...
> > add ecx,[ebp -20] ;  Add synthreg 27, the increment, to the index.
> > cmp ecx,[ebp -16];  Compare the index to the loop limit in synthreg 28.
> > ...
> >
> > This is my concept.  Is it reality?  I will not know until I have tried
> > it.
>
> Right!
> Like I said elsewhere, the instruction set GCC generates for stack
> slots SHOULD be greater than load/store or push/pop.
>
> Andy, your approach of putting Synthetic Registers on the stack
> frame has nothing to do with effects on the D-cache...
>
> It should allow a quick and easy verification of what code GCC
> would generate IF it where modified to code/use what are now
> spill slots as local temporaries.
>
> I also suspect that a larger example than you choose to show
> will also have significantly fewer register stalls.
>
> On the ia32 machines that do out-of-order execution with
> register renaming in four or more execution units - the cycle
> count of a major sized routine should be significantly different.
>
> Keep at it.  I think we will all learn something.

Thank you.   I will.

Andy

>
> Mike

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-06  8:06   ` Andy Walker
@ 2003-01-06 22:45     ` Michael S. Zick
  2003-01-07  6:04       ` Andy Walker
  0 siblings, 1 reply; 85+ messages in thread
From: Michael S. Zick @ 2003-01-06 22:45 UTC (permalink / raw)
  To: Andy Walker, Tom Lord, dewar; +Cc: gcc

On Monday 06 January 2003 01:47 am, Andy Walker wrote:
>
> The list of instructions that might operate on a Synthetic register is
> short: mov, add, sub, neg, and, or, xor, ror, rol, ashl, ashr, and lshr
> (sp?).  In addition, a synthetic register may be a source for a mul or
> divmod.  For this short list of instructions, a Synthetic register can be a
> source or destination, in conjunction with a register, or it may be a
> destination in conjunction with an immediate value.
>
Andy,

Is it possible to define a two-ouput instruction?
If so, add XCHG to your list, short, fast, works in any pipeline, and
can substitute for a pair of moves. (Three, if you have to supply
your own temporary.)

Mike

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-06 19:50 ` Tom Lord
  2003-01-06  6:29   ` Andy Walker
@ 2003-01-06 21:53   ` Michael S. Zick
  2003-01-07  6:02     ` Andy Walker
  1 sibling, 1 reply; 85+ messages in thread
From: Michael S. Zick @ 2003-01-06 21:53 UTC (permalink / raw)
  To: Tom Lord, dewar; +Cc: dewar, ja_walker, gcc

On Sunday 05 January 2003 06:24 am, Tom Lord wrote:
>>
> Presumably, the number of locations dedicated to register spills never
> exceeds (approximately) the maximum number of simultaneously live
> _intermediate_ values minus the number of general purpose registers.
> Any non-intermediate value (i.e., one that has a main memory
> location), rather than being spilled, will be written to its location.
> If that value is later re-used, it will be retrieved from memory.
>
Oops -
Sorry Tom;
Consider that GCC doesn't pack the data area in accordance with
referential patterns - so that either or both of those data area
accesses may be to a cache line that got replaced - forcing
a memory bus cycle.

Because of the way a procedure call entry is made and its use
of the stack frame - those lines will be in the L-1 cache by the
time Andy wants to use them. - Specially if he has the function
call code issue a "prefetch" command for the range of stack
memory that he knows the function will use for its prologue.

Mike

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
@ 2003-01-06 20:59 Robert Dewar
  2003-01-07  5:29 ` Andy Walker
  2003-01-08 17:32 ` Tom Lord
  0 siblings, 2 replies; 85+ messages in thread
From: Robert Dewar @ 2003-01-06 20:59 UTC (permalink / raw)
  To: dewar, lord; +Cc: denisc, gcc, ja_walker

> In other words, with synthregs, the CPU can ship some value off to
> memory and not care how long it takes to get there or to get back from
> there -- because it also ships it off to the synthreg, which it
> hypothetically has faster access to.

But this "hypothesis" is wrong. memory used for spills or locals is exactly
the same as memory used for "synthetic registers" [this fancy term is nothing
more than a fancy name for a local temporary]. So there is no issue of having
faster access to one or the other. It may of course be the case that in one
case you get more competent code than the other, but if so, the fix is to
fix incompetent code :-)

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-05 11:41 Robert Dewar
  2003-01-05 16:30 ` Michael S. Zick
  2003-01-06  4:53 ` Andy Walker
@ 2003-01-06 19:50 ` Tom Lord
  2003-01-06  6:29   ` Andy Walker
  2003-01-06 21:53   ` Michael S. Zick
  2 siblings, 2 replies; 85+ messages in thread
From: Tom Lord @ 2003-01-06 19:50 UTC (permalink / raw)
  To: dewar; +Cc: denisc, dewar, ja_walker, gcc

       dewar:

	This is a bit of an odd statement. In practice on a machine
	like the x86, the current stack frame will typically be
	resident in L1 cache, and that's where the register allocator
	spills to. What some of us still don't see is the difference
	in final resulting code between your "synthetic registers" and
	normal spill locations from the register allocator.

Register spills clearly don't equal synthetic registers.

Presumably, the number of locations dedicated to register spills never
exceeds (approximately) the maximum number of simultaneously live
_intermediate_ values minus the number of general purpose registers.
Any non-intermediate value (i.e., one that has a main memory
location), rather than being spilled, will be written to its location.
If that value is later re-used, it will be retrieved from memory.

The number of synthetic registers can be much larger than the number
of simultaneously live intermediate values.

So, with synthetic registers, some values that are not intermediates 
can be retained (in synthetic registers).  Without synthetic
registers, the next time those values are used, they have to be 
fetched from (non-special) memory.

In other words, with synthregs, the CPU can ship some value off to
memory and not care how long it takes to get there or to get back from
there -- because it also ships it off to the synthreg, which it
hypothetically has faster access to.

In practice, that means that synthregs will store some values in
memory twice: once in the location the program text says they go in;
again in the synthetic register.  If the synthetic register is indeed
cache-favored, maybe there's a performance win there -- and if so, a
register allocator is the right algorithm to decide which values to
keep duplicated in synthetic registers (so the proposed implementation
strategy is sensible).

(Another weird interaction is intermediate values that can be
recalculated -- I don't know if GCC ever makes that trade-off --
if it does, it needs to be tuned for synthregs.)

So, does that hypothesis (that synthreg access is faster than general
memory access) hold?  Quite possibly.  For example, a re-used synthreg
inherits cache-presence (at all levels, not just L1) from the previous
uses.  synthregs may win for some apps for more than just L1 reasons.

This brings in new alignment issues, too.  If you can, you might want
to make sure that your allocator locates its metadata where it will
cache-collide with the synthregs, to help push allocated memory out of
those locations (presuming here that allocator meta-data is relatively
infrequently accessed).  It's probably not all that hard to do this
"by accident".  Just in general: do things to protect the
cache-presence of the synthregs.

It might eventually lead to some hw advances: give synthregs with
absolute locations cache preference.  Or, if synthregs are on the
stack, give locations near the frame pointer cache preference (or is
that done already?).

I'd therefore guess it will be a very system-specific optimization --
but that it will win often enough to be useful.  And given what I
understand about trends in architecture, the cases in which it will
win will sharply increase over time.

No?

-t

p.s.: arch foo thinking about non-disruptive ways to improve gcc's
      rev ctl practices:

      http://lists.fifthvision.net/pipermail/arch-users/2003-January/001856.html

      and some of the follow-ups.   It's a pretty "noisy" list,
      though.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-05 14:08 Robert Dewar
  2003-01-05 16:50 ` Michael S. Zick
@ 2003-01-06 19:42 ` Tom Lord
  2003-01-06  8:06   ` Andy Walker
  1 sibling, 1 reply; 85+ messages in thread
From: Tom Lord @ 2003-01-06 19:42 UTC (permalink / raw)
  To: dewar; +Cc: gcc, ja_walker

       That's really not an issue for scalars. L1 caches are small but
       not that small. once again, empirically most references to
       local stack frames are in cache anyway, so there's really not
       much to improve here.

For ja_walker (and, I'd guess this is important): that means that
just blindly "tricking" the register allocator isn't the best way to
do it -- because it makes more sense to store _some_ values in
synthregs than it does to store locals and arguments in them.  So, add
some weights somewhere or else you'll waste synthregs left and right.
Otherwise, what you lose for locals/args (the dominant case) will
probably exceed what you gain for other values.

-t

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-05 13:13 Robert Dewar
  2003-01-06  4:40 ` Andy Walker
@ 2003-01-06 19:42 ` Tom Lord
  2003-01-06  6:39   ` Andy Walker
  1 sibling, 1 reply; 85+ messages in thread
From: Tom Lord @ 2003-01-06 19:42 UTC (permalink / raw)
  To: dewar; +Cc: dewar, denisc, gcc, ja_walker

       Well most certainly you should not get trapped into a situation
       where CSE values *must* live in registers, but that's not a
       problem. Remember that "retrieving from memory" is *EXACTLY*
       the same code sequence as reading a synthetic register,
       assuming both are on the current stack frame.

Two replies:

1) I don't fully understand why synthregs aren't a common area rather
   than part of stack frames.  A common area _adds_ code to
   save/restore synthregs -- but it also increases the number and
   frequency of references to synthregs.  I don't think L1 is the only
   cache that can be used better by synthregs.

2) Same code sequence (or "worse"), yes.  Same cache interaction, no.

   > It might eventually lead to some hw advances: give synthregs with
   > absolute locations cache preference.  Or, if synthregs are on the
   > stack, give locations near the frame pointer cache preference (or is
   > that done already?).

   I don't see that as a good idea at all. The stack frame indeed will
   almost always be in cache with current designs, and locking cache
   seems a bad idea.

If you're right -- then storing additional non-intermediate values
on the stack (as stack-based synthregs) may very well be a win.

If I'm right, then the net effect of the proposed HW changes is to
bump the number of registers, but to have some registers accessed by 
shorter code sequences than others.

Either way, synthregs (plausibly at least) wins.

       Once again, I would just love to see one (1) example of what is
       being talked about here. Let's see a small kernel in source,
       the current GCC code being generated, and the amazing improved
       code that can be generated with synthetic registers (which are
       nothing more than local memory locations). At this stage I
       really can't imagine such an example, so, assuming this is a
       failure of my imagination (I am not the only one with this
       handicap), please enlighten with one convincing example :-)

I'll mostly wimp out for now and hope ja_walker@earthlink.net does
this in detail.

But, sketchingly, let's think of a function that manipulates a dozen
C++ objects, each with a vtable.  It also manipulates some fields in
each object.  The vtable pointers are going to be used lots of times --
each field, just once.  (Maybe the fields are array elements and 
we're talking about a loop here.)

I can't fit all those vtable pointers in regs, but I can fit them in
synthregs.  Do agree that the reg allocator, applied to synthregs,
will keep those vtable pointers in synthregs?

So now the generated code (looking just at the instruction count) with
synthregs will be slightly _worse_ than the code without synthregs --
but if the synthregs really do wind up with noticably better cache
performance, it'll run faster.

Another way to look at this with a slightly longer term perspective is
that synthregs improve locality at a slight cost in instruction count.

-t

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-05 14:05 Robert Dewar
@ 2003-01-06 19:42 ` Tom Lord
  2003-01-06  6:49   ` Andy Walker
  0 siblings, 1 reply; 85+ messages in thread
From: Tom Lord @ 2003-01-06 19:42 UTC (permalink / raw)
  To: dewar; +Cc: denisc, gcc, ja_walker

       If you decide to allocate a base register, as was proposed at
       one point,

Bah.  I missed that (misread while skimming, actually).  Maybe his
implementation approach is bogus after all.  Why not use the FP, if
it's there, or SP when FP is omitted.

      Remember again that the code to access SR's can be no better
      than the code we generate right now for all accesses to local
      variables, function arguments in memory, and spilled registers.

But it can also improve both locality and the temporally proximate
re-use of memory locations.

ja_walker is right: it's a worthwhile emperical question.  And I'd
still bet you a beer or something that in 8 years, the answer to the
question is "yes" for the commodity systems.  Have you ever had an
8-year old beer?  Have you?!?!?

-t

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-06  4:40 ` Andy Walker
@ 2003-01-06 16:46   ` Michael S. Zick
  2003-01-07  5:20     ` Andy Walker
  0 siblings, 1 reply; 85+ messages in thread
From: Michael S. Zick @ 2003-01-06 16:46 UTC (permalink / raw)
  To: Andy Walker, Robert Dewar, lord; +Cc: gcc

On Sunday 05 January 2003 10:39 pm, Andy Walker wrote:
>
> Simulated comparison of a loop end:
>
> w/o Synth
> ...
> mov eax,[StackSlot27] ; Load the increment from spill
> mov edx,[StackSlot23] ; Load the index -- spilled for lack of registers
> lea ecx,[eax + edx] ;      Nicely optimized "add"
> mov edx,[StackSlot28];  Load the loop limit from spill
> cmp ecx,edx  ;             Compare the index to the loop limit.
> ...
>
> w/ Synth
> ...
> add ecx,[ebp -20] ;  Add synthreg 27, the increment, to the index.
> cmp ecx,[ebp -16];  Compare the index to the loop limit in synthreg 28.
> ...
>
> This is my concept.  Is it reality?  I will not know until I have tried it.
>
Right! 
Like I said elsewhere, the instruction set GCC generates for stack
slots SHOULD be greater than load/store or push/pop.

Andy, your approach of putting Synthetic Registers on the stack
frame has nothing to do with effects on the D-cache...

It should allow a quick and easy verification of what code GCC
would generate IF it where modified to code/use what are now
spill slots as local temporaries.

I also suspect that a larger example than you choose to show
will also have significantly fewer register stalls.

On the ia32 machines that do out-of-order execution with
register renaming in four or more execution units - the cycle
count of a major sized routine should be significantly different.

Keep at it.  I think we will all learn something.

Mike

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-06  6:50     ` Daniel Berlin
@ 2003-01-06  9:00       ` Andy Walker
  0 siblings, 0 replies; 85+ messages in thread
From: Andy Walker @ 2003-01-06  9:00 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: Tom Lord, gcc

Do you have any idea about how well the new allocator handles two-tiered 
register classes?

Obviously, my specific interest is a small class of general registers coupled 
with a large class of quite restricted registers.

Andy

On Monday 06 January 2003 12:49 am, Daniel Berlin wrote:
> > My goal is that the generated code with Synthetic registers will be
> > significantly better (looking just at the instruction count).  If the
> > fields
> > have individual values, in your example, then Reload imposes an
> > overhead.  It
> > is a two step process: move the spilled data back into a register,
> > then use
> > it.  With a Synthetic, you get to just use it.
>
> If the new allocator determines the data in question is cheaper to
> recalculate than it would be to reload it from memory, it'll do so
> (though it might be a bit limited in what it *can* recalculate right
> now, this is fixable. I don't have numbers on how often it would have
> wanted to remat but couldn't due to these problems). It's known as
> "rematerialization".
> Thus, in the example of vtable pointers, it'll simply recalculate the
> value into the new register, rather than reload it from memory.
>
> > Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-06 19:42 ` Tom Lord
@ 2003-01-06  8:06   ` Andy Walker
  2003-01-06 22:45     ` Michael S. Zick
  0 siblings, 1 reply; 85+ messages in thread
From: Andy Walker @ 2003-01-06  8:06 UTC (permalink / raw)
  To: Tom Lord, dewar; +Cc: gcc

On Sunday 05 January 2003 08:20 am, Tom Lord wrote:
> For ja_walker (and, I'd guess this is important): that means that
> just blindly "tricking" the register allocator isn't the best way to
> do it -- because it makes more sense to store _some_ values in
> synthregs than it does to store locals and arguments in them.  So, add
> some weights somewhere or else you'll waste synthregs left and right.
> Otherwise, what you lose for locals/args (the dominant case) will
> probably exceed what you gain for other values.

We shall have to see.  I fully expect to have to rewrite the whole thing as 
soon as we get a few runs to see what it is really doing.  This is my third 
rewrite already.

The implementation is a little more complicated than "blind trickery".  But 
not much.  My fond hope is that the additional restrictions will be 
sufficient, and that the optimizer will quite happily use the Synthetic 
registers when it is prudent to do so.  If not, I may have to do some tuning 
by disparaging other alternatives, either a little, or a lot.

I have created a new register class, "k" for Synthetic.  (Obvious, no?  Not 
to me either!)  "k" class registers are not base class registers, index class 
registers, or general registers.  They are also not general operands.  
Therefore, they do not get sucked into being used for everything, and I 
minimize the code changes and checking I have to do.  Also, the optimizer and 
allocator passes get every oportunity to generate the cleanest code with the 
real restrictions in mind.  Big plus here:  we may be able to keep Reload's 
sticky fingers off the natural registers if there are enough Synthetic 
registers.  If there are enough natural registers, then register allocation 
is trivial, register coalescing is unnecessary, and spills never happen.

If Synthetic registers are used, then the hard frame pointer is required.  
Hard frame pointer is realigned onto a 32 byte boundary, and old frame 
pointer and pc are moved accordingly.  Synthetic registers are the 32 full 
word locations immediately below the hard frame pointer.  In light of Robert 
Dewar's comments about stack slot accesses through the hard frame pointer, I 
am reconsidering this. 

I have created some new predicate routines to check whether an operand is, 
for example, a general_or_synth_operand.  These are used in instruction 
predicates for instructions that might use a Synthetic register.  Appropriate 
predicates prevent Synthetic register to memory instructions, in exactly the 
same way that memory-to-memory instructions are prevented.

All Synthetic registers are mode SI.  Period.  If this does not prove out my 
concept in at least a few instances, then the concept will never work, no 
matter how many other modes I add.  If it does prove out my concept, I can 
make a rational decision about how much extra work to put into implementing 
other modes.  

The list of instructions that might operate on a Synthetic register is short: 
mov, add, sub, neg, and, or, xor, ror, rol, ashl, ashr, and lshr (sp?).  In 
addition, a synthetic register may be a source for a mul or divmod.  For this 
short list of instructions, a Synthetic register can be a source or 
destination, in conjunction with a register, or it may be a destination in 
conjunction with an immediate value.

The instructions that might use a Synthetic register are almost unchanged.  
"k" registers are added as the first alternative, with the same restrictive 
syntax used for "m", memory, references.  

For register allocation, the allocation list starts with and runs through all 
the Synthetic registers before trying a general register.  

Finally, in the routine that generates the register-representative character 
string to be concatenated to the current instruction,  I intercept Synthetic 
registers and generate appropriate base-offset addresses.  I know I do not 
know exactly what I am doing here, so it will take some sorting out.  The 
formulation is clear.  The RTL is not.

I do not expect this to be enough, but it is start.

Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-06  6:39   ` Andy Walker
@ 2003-01-06  6:50     ` Daniel Berlin
  2003-01-06  9:00       ` Andy Walker
  0 siblings, 1 reply; 85+ messages in thread
From: Daniel Berlin @ 2003-01-06  6:50 UTC (permalink / raw)
  To: Andy Walker; +Cc: Tom Lord, dewar, gcc

>
> My goal is that the generated code with Synthetic registers will be
> significantly better (looking just at the instruction count).  If the 
> fields
> have individual values, in your example, then Reload imposes an 
> overhead.  It
> is a two step process: move the spilled data back into a register, 
> then use
> it.  With a Synthetic, you get to just use it.
>
If the new allocator determines the data in question is cheaper to 
recalculate than it would be to reload it from memory, it'll do so 
(though it might be a bit limited in what it *can* recalculate right 
now, this is fixable. I don't have numbers on how often it would have 
wanted to remat but couldn't due to these problems). It's known as 
"rematerialization".
Thus, in the example of vtable pointers, it'll simply recalculate the 
value into the new register, rather than reload it from memory.

> Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-06 19:42 ` Tom Lord
@ 2003-01-06  6:49   ` Andy Walker
  0 siblings, 0 replies; 85+ messages in thread
From: Andy Walker @ 2003-01-06  6:49 UTC (permalink / raw)
  To: Tom Lord, dewar; +Cc: gcc

On Sunday 05 January 2003 07:59 am, Tom Lord wrote:
>        If you decide to allocate a base register, as was proposed at
>        one point,
>
> Bah.  I missed that (misread while skimming, actually).  Maybe his
> implementation approach is bogus after all.  Why not use the FP, if
> it's there, or SP when FP is omitted.

I used the ecx register initially, because, at the time, I did not know how 
"fixed" the frame pointer was.  If it changed during module execution, all 
the Synthetic regs would be hosed.  

The hard frame pointer is fixed, I am using it, the Synthetic regs are not 
hosed.

Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-06 19:42 ` Tom Lord
@ 2003-01-06  6:39   ` Andy Walker
  2003-01-06  6:50     ` Daniel Berlin
  0 siblings, 1 reply; 85+ messages in thread
From: Andy Walker @ 2003-01-06  6:39 UTC (permalink / raw)
  To: Tom Lord, dewar; +Cc: gcc

On Sunday 05 January 2003 07:33 am, Tom Lord wrote:
>        Remember that "retrieving from memory" is *EXACTLY*
>        the same code sequence as reading a synthetic register,
>        assuming both are on the current stack frame.

Better not be, or I have wasted a lot of time.

> Two replies:
>
> 1) I don't fully understand why synthregs aren't a common area rather
>    than part of stack frames.  A common area _adds_ code to
>    save/restore synthregs -- but it also increases the number and
>    frequency of references to synthregs.  I don't think L1 is the only
>    cache that can be used better by synthregs.

Threading processess using synthregs from a common area gets complicated, and 
I am avoiding complication.  

I am implementing a relatively simple and straightforward approach, with a 
primary goal of getting what Robert keeps requesting: real, measurable, 
output code.  If it is a complete failure, I will quit.  Experience tells me 
it will be a partial success.  I will learn from it what none of us knew, and 
build on it.

Another common area objection: I want to link synthreg compiled routines with 
non-synthreg compiled routines.  Have synthreg routines call non-synthreg 
routines which in turn call synthreg routines.  Now I have a problem of 
"who's on first?"  Every synthreg routine using a common area would have to 
test to see if the common area is allocated or if it exists.  Worse, this 
could easily be a memory access to (horrors) a location that is not in any 
cache.      

Finally, I do not see any benefit to common area Synthetic registers.  If I 
get the alignment correct, the Synthetic registers will fully occupy two, or 
maybe one, cache buffer blocks.  The very first access will probably have a 
processor stall/memory fetch cost.  However, this exact same cost will be 
added to the multple register store instructions used for common area 
Synthetics.  Where would we save the contents of the common area Synthetic 
registers?  In exactly the same memory location I am using for the frame 
based Synthetic registers.  Once the first access has occurred, the Synthetic 
registers will be in L1 cache, and will stay there because they are now being 
constantly hit during program execution.  

My estimate is that the _only_ difference will be the extra execution time 
for the register save/restore processes.

>
> 2) Same code sequence (or "worse"), yes.  Same cache interaction, no.
>
>    > It might eventually lead to some hw advances: give synthregs with
>    > absolute locations cache preference.  Or, if synthregs are on the
>    > stack, give locations near the frame pointer cache preference (or is
>    > that done already?).
>
>    I don't see that as a good idea at all. The stack frame indeed will
>    almost always be in cache with current designs, and locking cache
>    seems a bad idea.
>
> If you're right -- then storing additional non-intermediate values
> on the stack (as stack-based synthregs) may very well be a win.
>
> If I'm right, then the net effect of the proposed HW changes is to
> bump the number of registers, but to have some registers accessed by
> shorter code sequences than others.
>
> Either way, synthregs (plausibly at least) wins.
>
<snip>
> But, sketchingly, let's think of a function that manipulates a dozen
> C++ objects, each with a vtable.  It also manipulates some fields in
> each object.  The vtable pointers are going to be used lots of times --
> each field, just once.  (Maybe the fields are array elements and
> we're talking about a loop here.)
>
> I can't fit all those vtable pointers in regs, but I can fit them in
> synthregs.  Do agree that the reg allocator, applied to synthregs,
> will keep those vtable pointers in synthregs?
>
> So now the generated code (looking just at the instruction count) with
> synthregs will be slightly _worse_ than the code without synthregs --
> but if the synthregs really do wind up with noticably better cache
> performance, it'll run faster.

My goal is that the generated code with Synthetic registers will be 
significantly better (looking just at the instruction count).  If the fields 
have individual values, in your example, then Reload imposes an overhead.  It 
is a two step process: move the spilled data back into a register, then use 
it.  With a Synthetic, you get to just use it.  

Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-06 19:50 ` Tom Lord
@ 2003-01-06  6:29   ` Andy Walker
  2003-01-06 21:53   ` Michael S. Zick
  1 sibling, 0 replies; 85+ messages in thread
From: Andy Walker @ 2003-01-06  6:29 UTC (permalink / raw)
  To: Tom Lord; +Cc: dewar, gcc

Thank you.  You have clearly stated some things that I have been implying.

On Sunday 05 January 2003 06:24 am, Tom Lord wrote:
>        dewar:
>
> 	This is a bit of an odd statement. In practice on a machine
> 	like the x86, the current stack frame will typically be
> 	resident in L1 cache, and that's where the register allocator
> 	spills to. What some of us still don't see is the difference
> 	in final resulting code between your "synthetic registers" and
> 	normal spill locations from the register allocator.
>
>
> Register spills clearly don't equal synthetic registers.
>
> Presumably, the number of locations dedicated to register spills never
> exceeds (approximately) the maximum number of simultaneously live
> _intermediate_ values minus the number of general purpose registers.
> Any non-intermediate value (i.e., one that has a main memory
> location), rather than being spilled, will be written to its location.
> If that value is later re-used, it will be retrieved from memory.
>
> The number of synthetic registers can be much larger than the number
> of simultaneously live intermediate values.
>
> So, with synthetic registers, some values that are not intermediates
> can be retained (in synthetic registers).  Without synthetic
> registers, the next time those values are used, they have to be
> fetched from (non-special) memory.
>
> In other words, with synthregs, the CPU can ship some value off to
> memory and not care how long it takes to get there or to get back from
> there -- because it also ships it off to the synthreg, which it
> hypothetically has faster access to.
>
> In practice, that means that synthregs will store some values in
> memory twice: once in the location the program text says they go in;
> again in the synthetic register.  If the synthetic register is indeed
> cache-favored, maybe there's a performance win there -- and if so, a
> register allocator is the right algorithm to decide which values to
> keep duplicated in synthetic registers (so the proposed implementation
> strategy is sensible).
>
> (Another weird interaction is intermediate values that can be
> recalculated -- I don't know if GCC ever makes that trade-off --
> if it does, it needs to be tuned for synthregs.)
>
> So, does that hypothesis (that synthreg access is faster than general
> memory access) hold?  Quite possibly.  For example, a re-used synthreg
> inherits cache-presence (at all levels, not just L1) from the previous
> uses.  synthregs may win for some apps for more than just L1 reasons.
>
> This brings in new alignment issues, too.  If you can, you might want
> to make sure that your allocator locates its metadata where it will
> cache-collide with the synthregs, to help push allocated memory out of
> those locations (presuming here that allocator meta-data is relatively
> infrequently accessed).  It's probably not all that hard to do this
> "by accident".  Just in general: do things to protect the
> cache-presence of the synthregs.
>
> It might eventually lead to some hw advances: give synthregs with
> absolute locations cache preference.  Or, if synthregs are on the
> stack, give locations near the frame pointer cache preference (or is
> that done already?).
>
> I'd therefore guess it will be a very system-specific optimization --
> but that it will win often enough to be useful.  And given what I
> understand about trends in architecture, the cases in which it will
> win will sharply increase over time.
>
> No?
>
> -t
>
> p.s.: arch foo thinking about non-disruptive ways to improve gcc's
>       rev ctl practices:
>
>      
> http://lists.fifthvision.net/pipermail/arch-users/2003-January/001856.html
>
>       and some of the follow-ups.   It's a pretty "noisy" list,
>       though.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-05 11:41 Robert Dewar
  2003-01-05 16:30 ` Michael S. Zick
@ 2003-01-06  4:53 ` Andy Walker
  2003-01-06 19:50 ` Tom Lord
  2 siblings, 0 replies; 85+ messages in thread
From: Andy Walker @ 2003-01-06  4:53 UTC (permalink / raw)
  To: Robert Dewar, denisc; +Cc: gcc

On Sunday 05 January 2003 05:38 am, Robert Dewar wrote:
<snip>
> In practice on a machine like the x86,
> the current stack frame will typically be resident in L1 cache, and that's
> where the register allocator spills to. 

An interesting and valuable piece of information.  

Thank you.

Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-05 13:13 Robert Dewar
@ 2003-01-06  4:40 ` Andy Walker
  2003-01-06 16:46   ` Michael S. Zick
  2003-01-06 19:42 ` Tom Lord
  1 sibling, 1 reply; 85+ messages in thread
From: Andy Walker @ 2003-01-06  4:40 UTC (permalink / raw)
  To: Robert Dewar, lord; +Cc: denisc, gcc

Again, Thank You for your time and consideration in making responses.

First: you have made several requests that I somehow demonstrate that 
something like the Synthetic register approach is workable.  Fair enough.  I 
will make a stab at it, if for no other reason than that you are the only 
person to show me the courtesy of answering simple questions.  

How about this for a thought experiment: I call up Bell Labs and have them do 
it for me.  As far as I am concerned, their corporate integrity is at the 
top, and a paper from them is routinely more reliable than most reviewed 
works in scientific journals.  I will ask them to have one of their 
researchers modify an old version of gcc to use a few memory locations as 
artificial registers.  Then run some timing tests on compiled code just to 
see if it makes any difference.  If they report that it makes an improvement, 
I will go ahead with my investigations.  If not, I will not need to waste 
anymore posts to this list about a silly, naive, and useless approach.  

OK.  Time is up.  The dear souls at Bell Labs used their crystal ball, 
figured out that I would need this information, and have conveniently posted 
it here: http://cm.bell-labs.com/cm/cs/what/smlnj/compiler-notes/k32.ps .
I am releived to report that their investigation was a success.  Lal George's 
implementation is different than mine in several respects.  I am satisfied 
that it is near enough to "Synthetic registers" to validate my investigation. 
 No guarantee of success or value, but a solid indication that value might 
exist.     

On Sunday 05 January 2003 07:02 am, Robert Dewar wrote:
<snip>
> Remember that
> "retrieving from memory" is *EXACTLY* the same code sequence as reading
> a synthetic register, assuming both are on the current stack frame.

I am not at all convinced of this.  I surmise that Reload knows nothing about 
the meaning of a piece of data in a stack slot.  I conclude that because RTL 
does not keep that information.  Reload's only option, then, is to physically 
move the data back into the register and try the specified instruction.  This 
should pretty well obliterate any previous attempts at pipeline/instruction 
scheduling, and generate a tremendous amount of pipeline stalls.  And gcc 
does.  

Simulated comparison of a loop end:

w/o Synth 
...
mov eax,[StackSlot27] ; Load the increment from spill   
mov edx,[StackSlot23] ; Load the index -- spilled for lack of registers
lea ecx,[eax + edx] ;      Nicely optimized "add" 
mov edx,[StackSlot28];  Load the loop limit from spill
cmp ecx,edx  ;             Compare the index to the loop limit.
...

w/ Synth
...
add ecx,[ebp -20] ;  Add synthreg 27, the increment, to the index.
cmp ecx,[ebp -16];  Compare the index to the loop limit in synthreg 28.
...

This is my concept.  Is it reality?  I will not know until I have tried it.  

<snip>
> Once again, I would just love to see one (1) example of what is being
> talked about here. 

Me too.

>Let's see a small kernel in source, the current GCC code
> being generated, and the amazing improved code that can be generated with
> synthetic registers (which are nothing more than local memory locations).
> At this stage I really can't imagine such an example, so, assuming this is
> a failure of my imagination (I am not the only one with this handicap),
> please enlighten with one convincing example :-)

IIUYC, you want me to hand compile a small kernel source, and compare it to 
GCC, after I have repeatedly stated that the smaller the module, the less 
value there is in Synthetic registers?  Or would you prefer that I hand 
compile a large kernel source, wildly guessing all along as to how gcc will 
REALLY do the allocations, to really demonstrate any value of the 
approach?

Thank you for your suggestion, but no.   

Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-05 15:47 Robert Dewar
@ 2003-01-05 22:14 ` Tom Lord
  0 siblings, 0 replies; 85+ messages in thread
From: Tom Lord @ 2003-01-05 22:14 UTC (permalink / raw)
  To: dewar; +Cc: gcc, ja_walker

       > Otherwise, what you lose for locals/args (the dominant case) will
       > probably exceed what you gain for other values.

       Actually I think you will break even most of the time and
       generate essentially identical code.

Won't you still get stupid sequences such as a maybe-aliased local
being loaded repeatedly into a synthreg in a loop?

If it weren't for problems like that then, yeah, sure, have a huge
number of potential synthregs, almost never spill, and get code that
looks just like usual -- except that it caches some values that are
stored in memory on the stack (pretty much the goal).

But no, if you just "trick" the reg allocator by raising the number of
apparent registers, it'll do stupid things like pointlessly moving a
local to a synthreg before operating on it.

So, the decision about what non-local, non-arg, non-intermediate
values to put in synthregs -- that _is_ isomorphic to the register
allocation problem -- but before tricking the register allocator into
that, the register allocator would have to recognize that real
registers are cheaper than everthing and that synthregs are equal to
local/args/intermediates -- so that's presumably a generalization of
graph coloring.

-t

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-05 14:08 Robert Dewar
@ 2003-01-05 16:50 ` Michael S. Zick
  2003-01-06 19:42 ` Tom Lord
  1 sibling, 0 replies; 85+ messages in thread
From: Michael S. Zick @ 2003-01-05 16:50 UTC (permalink / raw)
  To: Robert Dewar, dewar, lord; +Cc: gcc, ja_walker

On Sunday 05 January 2003 08:05 am, Robert Dewar wrote:
> > Bah.  I missed that (misread while skimming, actually).  Maybe his
> > implementation approach is bogus after all.  Why not use the FP, if
> > it's there, or SP when FP is omitted.
>
> well indeed FP makes more sense, and that's why the discussoin lead there
>
> > But it can also improve both locality and the temporally proximate
> > re-use of memory locations.
>
> That's really not an issue for scalars. L1 caches are small but not that
> small. once again, empirically most references to local stack frames are
> in cache anyway, so there's really not much to improve here.
>
> > ja_walker is right: it's a worthwhile emperical question
>
Agreed.  
Further consider: Any line of investigation that provides leads
to improvements in cache utilization is worthwhile.

Example, my several year out-of-date system:
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 512K
Both running at core clock speed.
(Times two of them {SMP})

Ignoring the fact that they aren't linearly addressable memory;
going by size alone...

That is enough memory to run PC-DOS in along with a few
memory resident programs and a respectable sized application,
such as a spreadsheet program and never touch main memory.

Mike

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-05 11:41 Robert Dewar
@ 2003-01-05 16:30 ` Michael S. Zick
  2003-01-06  4:53 ` Andy Walker
  2003-01-06 19:50 ` Tom Lord
  2 siblings, 0 replies; 85+ messages in thread
From: Michael S. Zick @ 2003-01-05 16:30 UTC (permalink / raw)
  To: Robert Dewar, ja_walker; +Cc: gcc

On Sunday 05 January 2003 05:38 am, Robert Dewar wrote:
> > Before I started this, I had never heard of an optimization technique
> > that tries to take advantage of L1 cache.  That may very well indicate
> > that the register allocator really is "just dumb".  (No flame wars,
> > please. Outstanding and brilliant developers did the best they could with
> > the algorithms they had.  I sincerely doubt that I could have done as
> > well).
>
> This is a bit of an odd statement. In practice on a machine like the x86,
> the current stack frame will typically be resident in L1 cache, and that's
> where the register allocator spills to. What some of us still don't see
> is the difference in final resulting code between your "synthetic
> registers" and normal spill locations from the register allocator.
>
> Perhaps you could give at least a small example of actual code. We all know
> that (even on the 486), register register moves take the same time as
> register-local stack frame moves when the local stack frame is in cache,
- - ^ ^ ^ ^ ^ ^
> and the code that GCC generates now heavily depends on this.
Sort of.

The ia32 instruction set is not limited to load/store and push/pop of a data
item in the stack frame.

The ia32 does not provide a separate cache for stack instructions (S-cache).
Also, typical program memory layout places the stack/heap and data/bss in the 
same address space.

-> on the ia32, stack frame operations and data area operations both hit
the D-cache.

Which brings us back to the original question: 
"What is the difference between an allocated spill slot in the stack frame
and a synthetic register in the stack frame?"

Ans. 1) From the operation of the hardware - nothing, zip, zero, nada.
Ans. 2) From the domain of instructions GCC might generate, perhaps
a whole lot, but, then again perhaps not.

So, to quantify Ans. 2, as a matter of pratical implimentation; rather than
do a re-write of the spill slot allocation and code generation selection for
spill slots; just define a new set of registers with a limited subset of
instructions - run the compiler, examine the result.

Question: "Does GCC use spill slots as effectively as the synthetic
registers?"
If not - fix something, with the differences in usage as a guide.

Mike

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
@ 2003-01-05 15:47 Robert Dewar
  2003-01-05 22:14 ` Tom Lord
  0 siblings, 1 reply; 85+ messages in thread
From: Robert Dewar @ 2003-01-05 15:47 UTC (permalink / raw)
  To: dewar, lord; +Cc: gcc, ja_walker

> Otherwise, what you lose for locals/args (the dominant case) will
> probably exceed what you gain for other values.

Actually I think you will break even most of the time and generate
essentially identical code.

You will end up saying, "great, I can keep this local variable Q in a 
register, I don't need to store it in the stack frame, but then the
register turns out to be a SR, and in fact it is right back there in
the stack frame, with identical instructions used to access it.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
@ 2003-01-05 14:08 Robert Dewar
  2003-01-05 16:50 ` Michael S. Zick
  2003-01-06 19:42 ` Tom Lord
  0 siblings, 2 replies; 85+ messages in thread
From: Robert Dewar @ 2003-01-05 14:08 UTC (permalink / raw)
  To: dewar, lord; +Cc: denisc, gcc, ja_walker

> Bah.  I missed that (misread while skimming, actually).  Maybe his
> implementation approach is bogus after all.  Why not use the FP, if
> it's there, or SP when FP is omitted.
> 

well indeed FP makes more sense, and that's why the discussoin lead there

> But it can also improve both locality and the temporally proximate
> re-use of memory locations.

That's really not an issue for scalars. L1 caches are small but not that
small. once again, empirically most references to local stack frames are
in cache anyway, so there's really not much to improve here.

> ja_walker is right: it's a worthwhile emperical question

it's reasonable to ask the question, but the way to explore the answer is
to study some examples in detail. I think you will find it is very
difficult to provide even one semi-convincing example if you look at it
in detail.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
@ 2003-01-05 14:05 Robert Dewar
  2003-01-06 19:42 ` Tom Lord
  0 siblings, 1 reply; 85+ messages in thread
From: Robert Dewar @ 2003-01-05 14:05 UTC (permalink / raw)
  To: dewar, lord; +Cc: denisc, gcc, ja_walker

> 1) I don't fully understand why synthregs aren't a common area rather
>    than part of stack frames.  A common area _adds_ code to
>    save/restore synthregs -- but it also increases the number and
>    frequency of references to synthregs.  I don't think L1 is the only
>    cache that can be used better by synthregs.

But if you put the SR's in a global area, then indeed they WILL require
6 byte instructions for their access, and that can greatly increase
pressure on the instruction cache, and will likely slow things down
greatly. 

Once again, you really can not discuss this in the abstract without looking
at the actual code sequences.

If you decide to allocate a base register, as was proposed at one point,
for the synth registers, then you are losing one of your real registers,
and that is a huge hit, you won't begin to buy that back.

With a change in the ABI, and OS, you could use FS or GS as the base
register, but that still does not help much, since the code wold
still be worse than normal access to local variables.

Remember again that the code to access SR's can be no better than the
code we generate right now for all accesses to local variables, function
arguments in memory, and spilled registers.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
@ 2003-01-05 13:13 Robert Dewar
  2003-01-06  4:40 ` Andy Walker
  2003-01-06 19:42 ` Tom Lord
  0 siblings, 2 replies; 85+ messages in thread
From: Robert Dewar @ 2003-01-05 13:13 UTC (permalink / raw)
  To: dewar, lord; +Cc: denisc, gcc, ja_walker

> So, with synthetic registers, some values that are not intermediates 
> can be retained (in synthetic registers).  Without synthetic
> registers, the next time those values are used, they have to be 
> fetched from (non-special) memory.

Well most certainly you should not get trapped into a situation where CSE
values *must* live in registers, but that's not a problem. Remember that
"retrieving from memory" is *EXACTLY* the same code sequence as reading
a synthetic register, assuming both are on the current stack frame. 

> It might eventually lead to some hw advances: give synthregs with
> absolute locations cache preference.  Or, if synthregs are on the
> stack, give locations near the frame pointer cache preference (or is
> that done already?).

I don't see that as a good idea at all. The stack frame indeed will almost
always be in cache with current designs, and locking cache seems a bad idea.

Once again, I would just love to see one (1) example of what is being talked
about here. Let's see a small kernel in source, the current GCC code being
generated, and the amazing improved code that can be generated with synthetic
registers (which are nothing more than local memory locations). At this stage
I really can't imagine such an example, so, assuming this is a failure of
my imagination (I am not the only one with this handicap), please enlighten
with one convincing example :-)

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
@ 2003-01-05 11:41 Robert Dewar
  2003-01-05 16:30 ` Michael S. Zick
                   ` (2 more replies)
  0 siblings, 3 replies; 85+ messages in thread
From: Robert Dewar @ 2003-01-05 11:41 UTC (permalink / raw)
  To: denisc, dewar, ja_walker; +Cc: gcc

> Before I started this, I had never heard of an optimization technique that 
> tries to take advantage of L1 cache.  That may very well indicate that the 
> register allocator really is "just dumb".  (No flame wars, please.  
> Outstanding and brilliant developers did the best they could with the 
> algorithms they had.  I sincerely doubt that I could have done as well).  

This is a bit of an odd statement. In practice on a machine like the x86, 
the current stack frame will typically be resident in L1 cache, and that's
where the register allocator spills to. What some of us still don't see
is the difference in final resulting code between your "synthetic registers"
and normal spill locations from the register allocator. 

Perhaps you could give at least a small example of actual code. We all know
that (even on the 486), register register moves take the same time as
register-local stack frame moves when the local stack frame is in cache,
and the code that GCC generates now heavily depends on this.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-04 18:00 ` Denis Chertykov
@ 2003-01-05  5:53   ` Andy Walker
  0 siblings, 0 replies; 85+ messages in thread
From: Andy Walker @ 2003-01-05  5:53 UTC (permalink / raw)
  To: Denis Chertykov, Robert Dewar; +Cc: gcc

On Saturday 04 January 2003 11:59 am, Denis Chertykov wrote:
> dewar@gnat.com (Robert Dewar) writes:

<snip>

> > If "synthetic registers" help, it would
> > just indicate that the optimizer and code generator is operating very
> > poorly.
>
> It's indicate that register allocator is operating very poorly or just
> dumb.

Before I started this, I had never heard of an optimization technique that 
tries to take advantage of L1 cache.  That may very well indicate that the 
register allocator really is "just dumb".  (No flame wars, please.  
Outstanding and brilliant developers did the best they could with the 
algorithms they had.  I sincerely doubt that I could have done as well).  

My approach is sort of a second stab at taking advantage of L1 cache as an 
optimization technique.  For the first stab, see Lal George's paper 
http://cm.bell-labs.com/cm/cs/what/smlnj/compiler-notes/k32.ps.   

Please note that Mr. George's approach was successful in speeding up program 
execution.

Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-04 14:50 Robert Dewar
  2003-01-04 18:00 ` Denis Chertykov
@ 2003-01-05  5:43 ` Andy Walker
  1 sibling, 0 replies; 85+ messages in thread
From: Andy Walker @ 2003-01-05  5:43 UTC (permalink / raw)
  To: Robert Dewar, zack; +Cc: dnovillo, gcc, sabre

<disclaimer>  I am not the font of all knowledge and wisdom.  I can't even 
get my crystal ball to work.  Please accept my comments with this in mind. 
</disclaimer>

First: thank you for your comments.  I very much appreciate your, everyone 
else's effort to read, review, and comment on my postings.  

I think you are missing the point of Synthetic registers.  

I am not talking about optimization or code generation passes.  As far as I 
can tell, these routines are completely unaffected by the number of natural 
registers available on the target machine.  I do not know how to improve the 
quality of these two passes.  From everything that I have read, both on this 
mailing list, and from other sources, my opinion is that these two passes 
perform at World Class levels.  I doubt that I could make any significant 
improvement in either of them.

Not all architectures are created equal.  The more general purpose registers 
there are, the easier it is for the compiler to allocate registers.  
Otherwise, everyone would build single-accumulator machines and we would not 
have this discussion.  For 4 GHz machines accessing a gigabyte of real 
memory, using only six general purpose registers is analogous to draining a 
railroad car of fine dry sand through an hourglass.  It takes time.

The IBM 360/390 architecture uses 16 general purpose registers.  My 
experience from hand coding assembler in that envrionment was that 16 
registers was not quite enough.  Much better than the eight on my beloved 
PDP-11, but still not enough.  I concede that some of that may have been 
laziness on my part.  

Mr. Berlin says aspects of register allocation are N-P Complete.  I had a 
hunch this was the case.  ( N-P Complete == Programmer's Full Employment Act  
;o)  Since that is so, I expect experienced assembler programmers to be 
generally more efficient than compilers when allocating registers.  

Taking these two observations together, I estimate that a compiler would 
really need 20 to 25 registers for moderately sized programs.  

From my point of view, the register allocator has the task of cramming 
tweny-five pounds of register into a six-pound sock.  It really needs to 
carry around twenty-five things at the same time.  Picking up one more thing 
requires laying something else down, something that will have to be picked up 
again in the future.  By the time we get down to six registers, things get to 
be pretty bad.  The things we are laying down are references.  Those are the 
things we need to tell us how to get back to the things we have to pick up.  
The task is doubled or tripled.  

The register allocator puts a great deal of effort into it, in part because 
it differentiates between registers and memory.  It assumes that all memory 
access is much more expensive than register access.

My observation is that not all memory is the same.  A limited number of 
four-byte word locations, aligned on L1 cache boundaries, can be frequently 
used.  Because they are frequently used, they remain in L1 cache.  Because 
they are in L1 cache, their access times will be similar or equal to register 
access.  Because I am using mostly modrm instructions from the x86 
instruction set, the instruction length will be similar to register to 
register instructions  ( Three bytes instead of two ).

Under these very narrow conditions, this limited number of memory locations 
perform nearly as well as registers.  Synthetic registers are pointless on a 
486.

My approach is to tell the compiler that these memory locations are really 
registers with a few special extra rules.  My hunch, and it is only a hunch, 
is that register content motion will be better handled overall, resulting in 
improved performance.  In plain english, faster throughput from less 
dithering about.  

I could easily be completely wrong, for reasons I have not considered.  

That is what makes this whole thing such an experiment.  

Andy

On Saturday 04 January 2003 08:49 am, Robert Dewar wrote:
> > What I think Diego is trying to say is, creating synthetic registers
> > for the x86 isn't going to help much, possibly not at all, because the
> > optimizer passes that could benefit already have unlimited registers
> > to work with.
>
> I would put it a different way. If "synthetic registers" help, it would
> just indicate that the optimizer and code generator is operating very
> poorly. I certainly don't have the impression that this is the case,
> at least not at the level that this naive synthetic register approach
> would help.
>
> Wouldn't it be best to take some typical kernels, look at the code
> generated by GCC, and then try by hand to see how much help SR's would be.
> I am pretty sure this will quickly discourage the approach and save a lot
> of wasted effort in modifying gcc.
>
> An approach that might really be helpful is to have the register allocator
> and scheduler understand the existence and behavior of renamed registers.
> Quite often you see gcc generated code use two registers when it could
> use one, under the illusion that this helps, when in fact it does not
> since the hardware would in any case use two registers using register
> renmaing.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
@ 2003-01-04 18:12 Robert Dewar
  0 siblings, 0 replies; 85+ messages in thread
From: Robert Dewar @ 2003-01-04 18:12 UTC (permalink / raw)
  To: denisc, dewar; +Cc: dnovillo, gcc, ja_walker, sabre, zack

> It's indicate that register allocator is operating very poorly or just
> dumb.

Exactly, since the implementation of "synthetic registers" is quite naive,
and the register allocator should be able to do at least this well.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
  2003-01-04 14:50 Robert Dewar
@ 2003-01-04 18:00 ` Denis Chertykov
  2003-01-05  5:53   ` Andy Walker
  2003-01-05  5:43 ` Andy Walker
  1 sibling, 1 reply; 85+ messages in thread
From: Denis Chertykov @ 2003-01-04 18:00 UTC (permalink / raw)
  To: Robert Dewar; +Cc: ja_walker, zack, dnovillo, gcc, sabre

dewar@gnat.com (Robert Dewar) writes:

> > What I think Diego is trying to say is, creating synthetic registers
> > for the x86 isn't going to help much, possibly not at all, because the
> > optimizer passes that could benefit already have unlimited registers
> > to work with.
> 
> I would put it a different way. If "synthetic registers" help, it would
> just indicate that the optimizer and code generator is operating very
> poorly.

It's indicate that register allocator is operating very poorly or just
dumb.

Denis.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: An unusual Performance approach using Synthetic registers
@ 2003-01-04 14:50 Robert Dewar
  2003-01-04 18:00 ` Denis Chertykov
  2003-01-05  5:43 ` Andy Walker
  0 siblings, 2 replies; 85+ messages in thread
From: Robert Dewar @ 2003-01-04 14:50 UTC (permalink / raw)
  To: ja_walker, zack; +Cc: dnovillo, gcc, sabre

> What I think Diego is trying to say is, creating synthetic registers
> for the x86 isn't going to help much, possibly not at all, because the
> optimizer passes that could benefit already have unlimited registers
> to work with.

I would put it a different way. If "synthetic registers" help, it would
just indicate that the optimizer and code generator is operating very
poorly. I certainly don't have the impression that this is the case,
at least not at the level that this naive synthetic register approach
would help.

Wouldn't it be best to take some typical kernels, look at the code generated
by GCC, and then try by hand to see how much help SR's would be. I am pretty
sure this will quickly discourage the approach and save a lot of wasted
effort in modifying gcc.

An approach that might really be helpful is to have the register allocator
and scheduler understand the existence and behavior of renamed registers.
Quite often you see gcc generated code use two registers when it could
use one, under the illusion that this helps, when in fact it does not
since the hardware would in any case use two registers using register
renmaing.

^ permalink raw reply	[flat|nested] 85+ messages in thread

end of thread, other threads:[~2003-01-08 20:26 UTC | newest]

Thread overview: 85+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-12-27  5:47 An unusual Performance approach using Synthetic registers Chris Lattner
2002-12-29  0:35 ` Andy Walker
2002-12-29  5:58   ` Chris Lattner
2002-12-29  6:26     ` Alexandre Oliva
2002-12-29 12:04     ` Andy Walker
2002-12-29 13:58       ` Daniel Berlin
2002-12-29 22:41         ` Andy Walker
2002-12-29 15:50       ` Diego Novillo
2002-12-29 22:44         ` Andy Walker
2002-12-30  1:30           ` Zack Weinberg
2002-12-30  2:57             ` Andy Walker
2002-12-30  7:52             ` Michael S. Zick
2002-12-29  7:44   ` Daniel Egger
2002-12-29 12:10     ` Andy Walker
2002-12-30 20:58       ` James Mansion
2002-12-31  3:56         ` Michael S. Zick
2002-12-30  1:09     ` Michael S. Zick
2002-12-30  7:27       ` Daniel Egger
2002-12-30 10:25         ` Michael S. Zick
2002-12-30 20:50         ` Daniel Berlin
2003-01-04 14:50 Robert Dewar
2003-01-04 18:00 ` Denis Chertykov
2003-01-05  5:53   ` Andy Walker
2003-01-05  5:43 ` Andy Walker
2003-01-04 18:12 Robert Dewar
2003-01-05 11:41 Robert Dewar
2003-01-05 16:30 ` Michael S. Zick
2003-01-06  4:53 ` Andy Walker
2003-01-06 19:50 ` Tom Lord
2003-01-06  6:29   ` Andy Walker
2003-01-06 21:53   ` Michael S. Zick
2003-01-07  6:02     ` Andy Walker
2003-01-07 17:41       ` Janis Johnson
2003-01-05 13:13 Robert Dewar
2003-01-06  4:40 ` Andy Walker
2003-01-06 16:46   ` Michael S. Zick
2003-01-07  5:20     ` Andy Walker
2003-01-06 19:42 ` Tom Lord
2003-01-06  6:39   ` Andy Walker
2003-01-06  6:50     ` Daniel Berlin
2003-01-06  9:00       ` Andy Walker
2003-01-05 14:05 Robert Dewar
2003-01-06 19:42 ` Tom Lord
2003-01-06  6:49   ` Andy Walker
2003-01-05 14:08 Robert Dewar
2003-01-05 16:50 ` Michael S. Zick
2003-01-06 19:42 ` Tom Lord
2003-01-06  8:06   ` Andy Walker
2003-01-06 22:45     ` Michael S. Zick
2003-01-07  6:04       ` Andy Walker
2003-01-05 15:47 Robert Dewar
2003-01-05 22:14 ` Tom Lord
2003-01-06 20:59 Robert Dewar
2003-01-07  5:29 ` Andy Walker
2003-01-07 21:49   ` Marcel Cox
2003-01-07 21:55     ` Branko Čibej
2003-01-07 21:55       ` Marcel Cox
2003-01-08 17:32 ` Tom Lord
2003-01-07 12:08 Robert Dewar
2003-01-07 12:10 ` Momchil Velikov
2003-01-07 12:32 Robert Dewar
2003-01-07 19:03 ` tm_gccmail
2003-01-07 19:20   ` tm_gccmail
2003-01-08  7:52     ` Andy Walker
2003-01-08 19:29       ` Michael S. Zick
2003-01-08 20:10         ` Michael S. Zick
2003-01-08 20:44         ` tm_gccmail
2003-01-08 21:34           ` Michael S. Zick
2003-01-08 22:05             ` tm_gccmail
2003-01-08  6:08 ` Andy Walker
2003-01-07 15:17 Robert Dewar
2003-01-07 17:02 ` Michael S. Zick
2003-01-08  6:56   ` Andy Walker
2003-01-08 12:14     ` Michael S. Zick
2003-01-07 17:19 Robert Dewar
2003-01-07 21:01 Marcel Cox
2003-01-07 22:53 ` tm_gccmail
2003-01-08  1:05   ` tm_gccmail
2003-01-08  1:22   ` tm_gccmail
2003-01-08 11:45   ` Marcel Cox
2003-01-08 17:29   ` Marcel Cox
2003-01-08  5:36 Robert Dewar
2003-01-08 12:13 Robert Dewar
2003-01-08 12:21 ` Lars Segerlund
2003-01-08 12:27 Robert Dewar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).