public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* register allocation versus scheduling
@ 2003-01-06  2:52 Brad Lucier
  2003-01-06  4:08 ` Daniel Berlin
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Brad Lucier @ 2003-01-06  2:52 UTC (permalink / raw)
  To: gcc; +Cc: Brad Lucier, feeley, matz

I've been playing around with -fnew-ra, -fno-trapping-math, and various
schedule options for 3.4 on powerpc-darwin.

For a molecular energy minization code, where an affine transformation
is applied consecutively to the location of each atom in the molecule, 

-O1 -fno-trapping-math -fschedule-insns2 -fnew-ra -mcpu=7400

works very well, the code is a bunch of overlapped loads, stores,
and floating-point operations, but

-O1 -fno-trapping-math -fschedule-insns -fschedule-insns2 -fnew-ra -mcpu=7400

which also schedules *before* register allocation is 50% slower, since
the schedule pass before hard register allocation loads *all* the x-y-z
information for all the atoms into pseudo-registers at the top of the routine,
and requires many moves between the stack and registers when these values
are actually needed for computations.

Perhaps, since -fno-trapping-math is a relatively new option, this is
a recent concern.

I've heard people on these lists talk about making scheduling smarter, so
it knows something about register pressure.  It seems that the general
solution would be to do register allocation and scheduling together.

Since we will have a new register allocator for 3.4 that is based on
graph coloring, perhaps one could add a new flag

--fuse-all-registers

that would not try to use the *minimum* number of colors to color an
interference graph, but, if the minimum number is less than the actual
number of available registers, it could use the actual numbers of
registers to color the graph, perhaps guided by liveness information.
Optimally, in the example I gave above, it should be quite possible to
load the location information of *two* atoms at the beginning of the
routine, and always be loading the location information for the next
atom while the current one is being processed.  This isn't done now,
because the same hard registers are assigned to the location information,
even though the full number of 32 FP registers are not used by the register
allocator.

This would at least generate better code with only the post-register-allocation
scheduling pass for naive source code of the form:

read data 1
operate on data 1
write answer 1
read data 2
operate on data 2
write answer 2
etc.

but wouldn't really hurt either where someone tries to write explicitly

read data 1
read data 2
operate on data 1
read data 3
write answer 1
operate on data 2
etc.

to try to get some instruction overlap.

Brad

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: register allocation versus scheduling
  2003-01-06  2:52 register allocation versus scheduling Brad Lucier
@ 2003-01-06  4:08 ` Daniel Berlin
  2003-01-06  4:08 ` Falk Hueffner
  2003-01-06 23:06 ` Geoff Keating
  2 siblings, 0 replies; 4+ messages in thread
From: Daniel Berlin @ 2003-01-06  4:08 UTC (permalink / raw)
  To: Brad Lucier; +Cc: gcc, feeley, matz

>
> that would not try to use the *minimum* number of colors to color an
> interference graph, but, if the minimum number is less than the actual
> number of available registers, it could use the actual numbers of
> registers to color the graph, perhaps guided by liveness information.

It does.
It keeps track of what set of registers are usable for a given web. It
doesn't care if it has to use all the registers, it uses whatever
registers it needs to use (be it all, the minimum number, or none) that it
*can* use.

The problem is that the information the reg allocator works with is too
conservative, so it is told it can't use certain regs in certain cases
when it can.  It won't allocate to things that would violate the register
class contraints.  That's why you generally see it not use all registers.
Because some web is determined to have a usable set of registers below
"all registers".

It also isn't completely bitwidth aware as it might be (though
there's actually a paper on this in POPL '03 entitled "Bitwidth aware global
register allocation")

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: register allocation versus scheduling
  2003-01-06  2:52 register allocation versus scheduling Brad Lucier
  2003-01-06  4:08 ` Daniel Berlin
@ 2003-01-06  4:08 ` Falk Hueffner
  2003-01-06 23:06 ` Geoff Keating
  2 siblings, 0 replies; 4+ messages in thread
From: Falk Hueffner @ 2003-01-06  4:08 UTC (permalink / raw)
  To: Brad Lucier; +Cc: gcc, feeley, matz

Brad Lucier <lucier@math.purdue.edu> writes:

> -O1 -fno-trapping-math -fschedule-insns -fschedule-insns2 -fnew-ra -mcpu=7400
> 
> which also schedules *before* register allocation is 50% slower,
> since the schedule pass before hard register allocation loads *all*
> the x-y-z information for all the atoms into pseudo-registers at the
> top of the routine, and requires many moves between the stack and
> registers when these values are actually needed for computations.
> 
> Perhaps, since -fno-trapping-math is a relatively new option, this
> is a recent concern.

No, this has been a problem for some time, see
e. g. http://gcc.gnu.org/ml/gcc/2002-07/msg00646.html. The test case
there is similar, many FP loads (more than registers), work on them,
writing back.

-- 
	Falk

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: register allocation versus scheduling
  2003-01-06  2:52 register allocation versus scheduling Brad Lucier
  2003-01-06  4:08 ` Daniel Berlin
  2003-01-06  4:08 ` Falk Hueffner
@ 2003-01-06 23:06 ` Geoff Keating
  2 siblings, 0 replies; 4+ messages in thread
From: Geoff Keating @ 2003-01-06 23:06 UTC (permalink / raw)
  To: Brad Lucier; +Cc: feeley, matz, gcc

Brad Lucier <lucier@math.purdue.edu> writes:

...
> For a molecular energy minization code, where an affine transformation
> is applied consecutively to the location of each atom in the molecule, 
> 
> -O1 -fno-trapping-math -fschedule-insns2 -fnew-ra -mcpu=7400
> 
> works very well, the code is a bunch of overlapped loads, stores,
> and floating-point operations, but
> 
> -O1 -fno-trapping-math -fschedule-insns -fschedule-insns2 -fnew-ra -mcpu=7400
> 
> which also schedules *before* register allocation is 50% slower,
> since the schedule pass before hard register allocation loads *all*
> the x-y-z information for all the atoms into pseudo-registers at the
> top of the routine, and requires many moves between the stack and
> registers when these values are actually needed for computations.

This is very much a known problem, and is not specific to FP code.
I've seen it happen to me when optimising code for the (purely
integer) SHA-1 hash algorithm; normally this fits entirely in the
powerpc register set, but if -fschedule-insns is used it needs twice
as many registers, more than available, and runs much slower.

-- 
- Geoffrey Keating <geoffk@geoffk.org>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2003-01-06 22:57 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-01-06  2:52 register allocation versus scheduling Brad Lucier
2003-01-06  4:08 ` Daniel Berlin
2003-01-06  4:08 ` Falk Hueffner
2003-01-06 23:06 ` Geoff Keating

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).