Re: Multiplications on Pentium 4

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: Multiplications on Pentium 4
@ 2001-09-08  8:31 dewar
  2001-09-08  9:17 ` Jan Hubicka
  0 siblings, 1 reply; 18+ messages in thread
From: dewar @ 2001-09-08  8:31 UTC (permalink / raw)
  To: jh, pfk; +Cc: gcc

It's amazing how poor the scaling lea's are on the Pentium 4, probably they
should never be generated.

(at least you could take care of this by simply removing them :-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Multiplications on Pentium 4
  2001-09-08  8:31 Multiplications on Pentium 4 dewar
@ 2001-09-08  9:17 ` Jan Hubicka
  2001-09-08 17:13   ` Profiling and optimization Frank Klemm
  2001-09-08 19:08   ` long long / long long Frank Klemm
  0 siblings, 2 replies; 18+ messages in thread
From: Jan Hubicka @ 2001-09-08  9:17 UTC (permalink / raw)
  To: dewar; +Cc: jh, pfk, gcc

> It's amazing how poor the scaling lea's are on the Pentium 4, probably they
> should never be generated.
They are no worse than shifts - actually LEA is simply translated to trivial
operations.
It is amazing how bad shifts are :) but replacing them by ADDs is not always
solution due to trace cache misses.

Honza
> 
> (at least you could take care of this by simply removing them :-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Profiling and optimization
  2001-09-08  9:17 ` Jan Hubicka
@ 2001-09-08 17:13   ` Frank Klemm
  2001-09-08 19:08   ` long long / long long Frank Klemm
  1 sibling, 0 replies; 18+ messages in thread
From: Frank Klemm @ 2001-09-08 17:13 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc

Profiling for size <=> speed optimzation:

Note, that you do not need the average call frequency for the decision of
optimzation for size or speed, but something like the burst rate:

		calls
	freq =  -----					[Hz]
		calls
                 Sum   timebetweencalls
		 i=1                 

		     1   calls
	burstrate =-----  Sum   1/timebetweencalls	[Hz]
                   calls  i=1

For a equidistant time distant both have the same value.
For short bursts they are different.

-- 
Frank Klemm

^ permalink raw reply	[flat|nested] 18+ messages in thread

* long long / long long
  2001-09-08  9:17 ` Jan Hubicka
  2001-09-08 17:13   ` Profiling and optimization Frank Klemm
@ 2001-09-08 19:08   ` Frank Klemm
  2001-09-09 21:53     ` Joe Buck
  1 sibling, 1 reply; 18+ messages in thread
From: Frank Klemm @ 2001-09-08 19:08 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc

---- Code ----------------------------------------------------

.text
.type   __divdi3,@function
.global __divdi3

__divdi3:
        fildll  12(%esp)
        fildll   4(%esp)
        subl    $12,%esp
        movl    %esp,%ecx
        movw    $0x0C00,%ax
        fnstcw  (%ecx)
        orw     0(%ecx),%ax
        movw    %ax,2(%ecx)
        fldcw   2(%ecx)
        fdivp
        fistpll 4(%ecx)
        fldcw   0(%ecx)
        movl    4(%esp),%eax
        movl    8(%esp),%edx
        addl    $12,%esp
        ret



---- "Benchmark": Duration of a loop of --------------------------

    long long  x [1000];
    long long  y [1000];

    for (i = 0; i < 1000; i++)
        s += x[i] / y[i];


---- results ---------------------------------------------------- 
Old routine on Athlon:
	106 clocks including the a outer loop and storing the arguments on the stack.
	
This routine on Athlon:
	79 clocks including the a outer loop and storing the arguments on the stack.

  + shorter
  + can be inlined
  + sometimes the rounding control switch can be moved avoided by moving it outside a loop
  + faster for a lot of data
  - slower for trivial data (?)
  - do not work with SSE2 (needs 63 or 64 bit mantissa)

---- optimization -----------------------------------------------
This routine on Athlon after inling and moving fstcw/fldcw outside the loop:
	21 clocks including the a outer loop


Interested? Or are 64 bit are uninteresting for benchmarks?

-- 
Frank Klemm


Still remaining:
	long long % long long
	long long / long
	long long % long
	long long / const
	long long % const


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: long long / long long
  2001-09-08 19:08   ` long long / long long Frank Klemm
@ 2001-09-09 21:53     ` Joe Buck
  2001-09-09 22:21       ` John S. Dyson
                         ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Joe Buck @ 2001-09-09 21:53 UTC (permalink / raw)
  To: Frank Klemm; +Cc: Jan Hubicka, gcc

Frank Klemm writes:
[ improved long long code sequences]
> 
> Interested? Or are 64 bit are uninteresting for benchmarks?

Well, the Linux kernel developers found that they couldn't let gcc
do long long arithmetic because it did such a poor job, so they do
it in assembly or in C on pairs of 32 bit values instead.  So at
least some folks probably wouldn't mind seeing an improvement.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: long long / long long
  2001-09-09 21:53     ` Joe Buck
@ 2001-09-09 22:21       ` John S. Dyson
  2001-09-10  6:20       ` Bernd Schmidt
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: John S. Dyson @ 2001-09-09 22:21 UTC (permalink / raw)
  To: Joe Buck, Frank Klemm; +Cc: Jan Hubicka, gcc

----- Original Message ----- 
From: "Joe Buck" <jbuck@synopsys.COM>
To: "Frank Klemm" <pfk@fuchs.offl.uni-jena.de>
Cc: "Jan Hubicka" <jh@suse.cz>; <gcc@gcc.gnu.org>
Sent: Sunday, September 09, 2001 11:51 PM
Subject: Re: long long / long long


> Frank Klemm writes:
> [ improved long long code sequences]
> > 
> > Interested? Or are 64 bit are uninteresting for benchmarks?
> 
> Well, the Linux kernel developers found that they couldn't let gcc
> do long long arithmetic because it did such a poor job, so they do
> it in assembly or in C on pairs of 32 bit values instead.  So at
> least some folks probably wouldn't mind seeing an improvement.
> 
When I was writing and rewriting parts of the FreeBSD code, I was
VERY careful to limit the 64bit long longs to exactly what was needed.

In fact, some consideration in data structure design was made to avoid
the use of 64bit offsets/addresses (for mapping files, etc).

John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: long long / long long
  2001-09-09 21:53     ` Joe Buck
  2001-09-09 22:21       ` John S. Dyson
@ 2001-09-10  6:20       ` Bernd Schmidt
  2001-09-10 12:47         ` Michael Matz
  2001-09-10 10:21       ` Linus Torvalds
  2001-09-10 12:18       ` Florian Weimer
  3 siblings, 1 reply; 18+ messages in thread
From: Bernd Schmidt @ 2001-09-10  6:20 UTC (permalink / raw)
  To: Joe Buck; +Cc: Frank Klemm, Jan Hubicka, gcc

On Sun, 9 Sep 2001, Joe Buck wrote:

> Frank Klemm writes:
> [ improved long long code sequences]
> >
> > Interested? Or are 64 bit are uninteresting for benchmarks?
>
> Well, the Linux kernel developers found that they couldn't let gcc
> do long long arithmetic because it did such a poor job, so they do
> it in assembly or in C on pairs of 32 bit values instead.  So at
> least some folks probably wouldn't mind seeing an improvement.

The main problem probably is the register allocator's requirement that
DImode values be allocated to adjacent registers.  Unfortunately this
is not going to be easy to change.


Bernd

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: long long / long long
  2001-09-10  6:20       ` Bernd Schmidt
@ 2001-09-10 12:47         ` Michael Matz
  2001-09-10 19:55           ` Hans-Peter Nilsson
  2001-09-11  2:26           ` Jan Hubicka
  0 siblings, 2 replies; 18+ messages in thread
From: Michael Matz @ 2001-09-10 12:47 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc

Hi,

On Mon, 10 Sep 2001, Bernd Schmidt wrote:
> > Well, the Linux kernel developers found that they couldn't let gcc
> > do long long arithmetic because it did such a poor job, so they do
> > it in assembly or in C on pairs of 32 bit values instead.  So at
> > least some folks probably wouldn't mind seeing an improvement.
>
> The main problem probably is the register allocator's requirement that
> DImode values be allocated to adjacent registers.

Yes, that requirement creates many constraints.

> Unfortunately this is not going to be easy to change.

The sad thing is, that it _is_ easy to change in the allocator, and in
fact would make the algorithm simpler (I'm talking only about the
new-regalloc) and the graph easier colorable.  The thing which horrifies
me is the encoding of that requirement in the different machine
descriptions.  A first step would be to define a new rtx code MREG
("multi" reg), which can possibly contain a set of (disjoint) REG
or SUBREG expressions, including the then necessary handling of multi-reg
moves (with cycle breaking).  The occurences of those MREG rtx's could
probably be limited to few passes around the allocator.  Unfortunately
nevertheless all .md files would need a good overhaul.  If we only had
such a multi-reg rtx from the beginning ;-|

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: long long / long long
  2001-09-10 12:47         ` Michael Matz
@ 2001-09-10 19:55           ` Hans-Peter Nilsson
  2001-09-11  2:26           ` Jan Hubicka
  1 sibling, 0 replies; 18+ messages in thread
From: Hans-Peter Nilsson @ 2001-09-10 19:55 UTC (permalink / raw)
  To: Michael Matz; +Cc: Bernd Schmidt, gcc

On Mon, 10 Sep 2001, Michael Matz wrote:
[About (non-)adjacency of register parts in a multi-word
register value]
> The sad thing is, that it _is_ easy to change in the allocator, and in
> fact would make the algorithm simpler (I'm talking only about the
> new-regalloc) and the graph easier colorable.  The thing which horrifies
> me is the encoding of that requirement in the different machine
> descriptions.  A first step would be to define a new rtx code MREG
> ("multi" reg), which can possibly contain a set of (disjoint) REG
> or SUBREG expressions, including the then necessary handling of multi-reg
> moves (with cycle breaking).  The occurences of those MREG rtx's could
> probably be limited to few passes around the allocator.  Unfortunately
> nevertheless all .md files would need a good overhaul.  If we only had
> such a multi-reg rtx from the beginning ;-|

Couldn't you use a PARALLEL for the multiword-parts there, so
you don't need to dream up a new rtx thingy?  Perhaps assuming
"natural" layout, (element 0 being the least significant for a
little-endian target etc.).  The PARALLEL would carry the size
of the composite mode, equal to the sum of all registers:
(set (mem:DI (reg:SI 0))
     (minus:DI (parallel:DI [(reg:SI 3) (reg:SI 7)])
               (const_int 1)))
or maybe just expand the semantics of CONCAT to hold more than
two items.

Alternatively the elements would carry subreg-byte indicators as
with the parallels for FUNCTION_ARG and FUNCTION_VALUE, so you
can carry smaller-than-register parts in multiple registers if
needed.

brgds, H-P

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: long long / long long
  2001-09-10 12:47         ` Michael Matz
  2001-09-10 19:55           ` Hans-Peter Nilsson
@ 2001-09-11  2:26           ` Jan Hubicka
  1 sibling, 0 replies; 18+ messages in thread
From: Jan Hubicka @ 2001-09-11  2:26 UTC (permalink / raw)
  To: Michael Matz; +Cc: Bernd Schmidt, gcc

> Hi,
> 
> On Mon, 10 Sep 2001, Bernd Schmidt wrote:
> > > Well, the Linux kernel developers found that they couldn't let gcc
> > > do long long arithmetic because it did such a poor job, so they do
> > > it in assembly or in C on pairs of 32 bit values instead.  So at
> > > least some folks probably wouldn't mind seeing an improvement.
> >
> > The main problem probably is the register allocator's requirement that
> > DImode values be allocated to adjacent registers.
> 
> Yes, that requirement creates many constraints.
> 
> > Unfortunately this is not going to be easy to change.
> 
> The sad thing is, that it _is_ easy to change in the allocator, and in
> fact would make the algorithm simpler (I'm talking only about the
> new-regalloc) and the graph easier colorable.  The thing which horrifies
> me is the encoding of that requirement in the different machine
> descriptions.  A first step would be to define a new rtx code MREG
> ("multi" reg), which can possibly contain a set of (disjoint) REG
We sometimes represent such MREG as PARALLELs of multiple REGS, we also do have
CONCAT expression.  I think these should be used to represent register pairs
too.

All you need is to teach register_operand and simplify_subreg about it.  Then
to fix all those zillions of side effects the change will have (not so extremly
many I can think of - the flow analyzis and register renaming probably,
scheduler should be happy).  But having this as option controlable by target
macro (note that on i386, the 64bit multiply instruction requires the registers
in order), it can be trackable.  We will just convert those machine descriptions
where it makes sense.  There are still many, where the pairs must be consetuctive,
as hardware dictates that.

Honza
> or SUBREG expressions, including the then necessary handling of multi-reg
> moves (with cycle breaking).  The occurences of those MREG rtx's could
> probably be limited to few passes around the allocator.  Unfortunately
> nevertheless all .md files would need a good overhaul.  If we only had
> such a multi-reg rtx from the beginning ;-|
> 
> 
> Ciao,
> Michael.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: long long / long long
  2001-09-09 21:53     ` Joe Buck
  2001-09-09 22:21       ` John S. Dyson
  2001-09-10  6:20       ` Bernd Schmidt
@ 2001-09-10 10:21       ` Linus Torvalds
  2001-09-10 10:40         ` David Edelsohn
  2001-09-11  2:20         ` Jan Hubicka
  2001-09-10 12:18       ` Florian Weimer
  3 siblings, 2 replies; 18+ messages in thread
From: Linus Torvalds @ 2001-09-10 10:21 UTC (permalink / raw)
  To: jbuck, gcc

In article < 200109100451.VAA25485@racerx.synopsys.com >,
Joe Buck  <jbuck@synopsys.COM> wrote:
>Frank Klemm writes:
>[ improved long long code sequences]
>> 
>> Interested? Or are 64 bit are uninteresting for benchmarks?
>
>Well, the Linux kernel developers found that they couldn't let gcc
>do long long arithmetic because it did such a poor job, so they do
>it in assembly or in C on pairs of 32 bit values instead.  So at
>least some folks probably wouldn't mind seeing an improvement.

Well, the linux kernel people would also scream very loudly if the
compiler started using floating point for integer divides (Linux uses
-fno-fp-regs on architectures where it is needed/supported, but x86
doesn't even _have_ that flag right now).  In the kernel, we do NOT want
to pollute the (big) FP state, as the kernel doesn't want to
save/restore it all the time. 

Also, in the kernel we avoid things like "long long" divisions like the
plague anyway.  It's going to be slow however you do it, and there's
almost never any reason to do it at all.  It's somewhat more common to
have a 64/32->64 division, and Linux does that in inline assembly for
the (still fairly rare) cases that need it. 

However, at the same time 64-bit ops _are_ getting more and more common,
simply because 32 bits are starting to be a big limitations in things
like disk block numbers (verily, 2 terabytes isn't as big a number as it
used to be, and 32-bit sector offsets are starting to get tight). 

So..

If gcc developers start looking at double-integer 64-bit things, the
highest priority by far should be making the _simple_ operations and the
spilling faster.  The code generated for many simple 64-bit ops is
horrible because gcc has a very strict notion of what a 64-bit entity is
on a 32-bit architecture.  And that notion doesn't always make much
sense. 

For example, gcc seems to be unable to think of a 64-bit entity as two
almost-independent 32-bit parts, and does some strange register
allocation (I _think_ gcc can't mix and match registers - it seems to
always use fixed pairings (ie eax:edx and ecx:ebx).

See this trivial example to see what I'm talking about:

	unsigned long long a;

	int main(void)
	{
	        a &= ~1ULL;
	}

which really _should_ result in

	main:
		andl $-2,a
		ret

but instead results in

	main:
	        movl    a, %eax
	        andl    $-2, %eax
	        movl    a+4, %edx
	        movl    %eax, a
	        movl    %edx, a+4
	        ret

Notice how gcc loaded the high bits, and stored them again unchanged. 
Stupid.  Also note how gcc did _not_ use the immediate-to-memory format,
even though you'll see it do so if "a" had been just a regular 32-bit
entity... 

It would be much better to actually split up the 64-bit operations into
32-bit operations at a VERY early stage, and then allow them to be
optimized as regular 32-bit operations. So

	a &= ~1ULL;

should be split up early to

	a.high &= ~0UL;
	a.low &= ~1UL;

and then it is trivially simple to notice that the first operation is a
no-op, and the second operation is a perfectly normal and that gcc is
well able to optimize to the proper result (for normal 32-bit ops, gcc
does NOT generate the above stupid "load + op + store", but generates a
simple immediate "andl" to memory). 

Yes, doing the above kind of splitting would mean that gcc _has_ to
understand about the carry flag in eflags, and would obviously require
creating a few new requried patterns inside gcc (ie patterns for
"addsi3_c" and "addcsi3" etc to teach gcc about adds that generate carry
and adds that use carry).

But in return you'd get better code generation, and you could kill some
of the existing patterns (ie "adddi3" should just _go_away_).

There are very few cases where you don't want to think of DI as just
2*SI, I suspect.  So doing the split early would probably result in
uglier RTL ("what the heck is this code doing") but better code by
allowing it to spill just one half of the DI, for example.

(Note: you might be able to do part of this by defining adddi3 to be a
define_expand instead of a define_insn, but I think that would still be
late enough that all the optimization passes would not be able to work
with it as well as they should.  It would be a much smaller change,
though, and maybe it might make most of the bletcherousness go away). 

		Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: long long / long long
  2001-09-10 10:21       ` Linus Torvalds
@ 2001-09-10 10:40         ` David Edelsohn
  2001-09-11  2:21           ` Jan Hubicka
  2001-09-11  2:20         ` Jan Hubicka
  1 sibling, 1 reply; 18+ messages in thread
From: David Edelsohn @ 2001-09-10 10:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: jbuck, gcc

	Some of the problems have to do with value range propagation
optimizations which GCC currently does not have.  The VRP patch under
development is far too over-engineered, IMHO.

	What matters is whether the register will have a zero or non-zero
value.  Or at best, whether the subreg bytes will be zero or not.  Knowing
that the value will be between 33 and 34066415 really doesn't matter at
that level of detail.  For DImode as a pair of 32-bit registers, knowing
that one register will be zero allows a lot of simplifications to be
propagated throughout the computation.

David

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: long long / long long
  2001-09-10 10:40         ` David Edelsohn
@ 2001-09-11  2:21           ` Jan Hubicka
  0 siblings, 0 replies; 18+ messages in thread
From: Jan Hubicka @ 2001-09-11  2:21 UTC (permalink / raw)
  To: David Edelsohn; +Cc: Linus Torvalds, jbuck, gcc

> 	Some of the problems have to do with value range propagation
> optimizations which GCC currently does not have.  The VRP patch under
> development is far too over-engineered, IMHO.
> 
> 	What matters is whether the register will have a zero or non-zero
> value.  Or at best, whether the subreg bytes will be zero or not.  Knowing
> that the value will be between 33 and 34066415 really doesn't matter at
> that level of detail.  For DImode as a pair of 32-bit registers, knowing
> that one register will be zero allows a lot of simplifications to be
> propagated throughout the computation.

We do have such VRP propagation in combine, but it is far too under-enginered
to work well, so I believe the John's VPR patch is a win, if it will ever
get in.

The problem is that we don't expose the SUBREGs early enought to the compiler
to make the simplification possible.

Honza
> 
> David

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: long long / long long
  2001-09-10 10:21       ` Linus Torvalds
  2001-09-10 10:40         ` David Edelsohn
@ 2001-09-11  2:20         ` Jan Hubicka
  2001-09-11  8:06           ` Linus Torvalds
  1 sibling, 1 reply; 18+ messages in thread
From: Jan Hubicka @ 2001-09-11  2:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: jbuck, gcc

> If gcc developers start looking at double-integer 64-bit things, the
> highest priority by far should be making the _simple_ operations and the
> spilling faster.  The code generated for many simple 64-bit ops is
> horrible because gcc has a very strict notion of what a 64-bit entity is
> on a 32-bit architecture.  And that notion doesn't always make much
> sense. 
> 
> For example, gcc seems to be unable to think of a 64-bit entity as two
> almost-independent 32-bit parts, and does some strange register
> allocation (I _think_ gcc can't mix and match registers - it seems to
> always use fixed pairings (ie eax:edx and ecx:ebx).

Actually the problem is that internall GCC represent whole 64bit quantity
as single register.  This means that register allocator will assign it
two constuctive 32bit registers.

This is big problem to avoid with current design, but if Jeff's midlevel
RTL takes it's place, I believe we can win by simply splitting the 64bit
quantities early and represent them as two 32bit registers in the lowlevel
RTL chain.

This kills posibility of using SSE/i387 registers for 64bit integer operations
that is unfortunate too...

I don't see other sensible solution to the problem.  Even if we teach
register allocation, that the 32bit registers don't have to be consetuctive
still we will load unneded parts of 64bit values to registers etc. etc.

Honza

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: long long / long long
  2001-09-11  2:20         ` Jan Hubicka
@ 2001-09-11  8:06           ` Linus Torvalds
  0 siblings, 0 replies; 18+ messages in thread
From: Linus Torvalds @ 2001-09-11  8:06 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: jbuck, gcc

On Tue, 11 Sep 2001, Jan Hubicka wrote:
>
> Actually the problem is that internall GCC represent whole 64bit quantity
> as single register.  This means that register allocator will assign it
> two constuctive 32bit registers.
>
> This is big problem to avoid with current design, but if Jeff's midlevel
> RTL takes it's place, I believe we can win by simply splitting the 64bit
> quantities early and represent them as two 32bit registers in the lowlevel
> RTL chain.

That sounds good.

There are probably many tree-level optimizations that _should_ work on a
"single register" level (it gets really hard to do some simple
optimizations after the split), so it sounds eminently sensible to wait
for the intermediate tree form and do the split after that phase.

> This kills posibility of using SSE/i387 registers for 64bit integer operations
> that is unfortunate too...

I don't think it "kills" it. It just means that the SSE conversion should
(if appropriate) be done at a tree level - which I think is the right
thing anyway. There is _no_ point in doing SSE conversions on an
operation-by-operation basis - the costs of converting to and from SSE are
too big. So SSE conversion should be done at a higher level, so that you
either do _all_ operations in a chain in SSE, or you do none.

> I don't see other sensible solution to the problem.  Even if we teach
> register allocation, that the 32bit registers don't have to be consetuctive
> still we will load unneded parts of 64bit values to registers etc. etc.

Absolutely. I think splitting the 64-bit ops up fairly early (ie after
high-level tree optimizations, but long before any low-level RTL has been
worked on) is the only way to get sane code generation on 32-bit hosts.

Of course, one potentially valid approach is to just say "32-bit targets
are going away", but historically it has taken a _looong_ time for new
architectures to get dominant (in fact, historically they never _did_
become dominant at all ;), and I suspect that some day gcc will want to
have at least the option to make "long long" be 128 bits on 64-bit hosts.

In which case having the split infrastructure will be useful even in the
long run.

		Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: long long / long long
  2001-09-09 21:53     ` Joe Buck
                         ` (2 preceding siblings ...)
  2001-09-10 10:21       ` Linus Torvalds
@ 2001-09-10 12:18       ` Florian Weimer
  3 siblings, 0 replies; 18+ messages in thread
From: Florian Weimer @ 2001-09-10 12:18 UTC (permalink / raw)
  To: gcc

Joe Buck <jbuck@synopsys.COM> writes:

> Well, the Linux kernel developers found that they couldn't let gcc
> do long long arithmetic because it did such a poor job, so they do
> it in assembly or in C on pairs of 32 bit values instead.  So at
> least some folks probably wouldn't mind seeing an improvement.

Anything up to and including GCC 2.95 is probably quite broken with
regard to 64 bit support on x86, performancewise.  For example, if you
pass a 64 bit constant in a function call, the constant is but into
some data segment and is copied to the stack from there.  IIRC, the
generated code is not even shorter if the constant is used multiple
times.

The kernel folks still work with ancient GCC (even gcc) versions, and
it is not clear if their experience carries over to newer GCCs.

Of course, Frank's floating point implementation cannot be used in
kernel space at the moment.  AFAIK, floating point is still not
possible there for technical reasons.

-- 
Florian Weimer 	                  Florian.Weimer@RUS.Uni-Stuttgart.DE
University of Stuttgart           http://cert.uni-stuttgart.de/
RUS-CERT                          +49-711-685-5973/fax +49-711-685-5898

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: long long / long long
@ 2001-09-10 10:48 mike stump
  0 siblings, 0 replies; 18+ messages in thread
From: mike stump @ 2001-09-10 10:48 UTC (permalink / raw)
  To: gcc, jbuck, torvalds

> From: Linus Torvalds <torvalds@transmeta.com>
> Date: Mon, 10 Sep 2001 10:16:22 -0700
> To: jbuck@synopsys.COM, gcc@gcc.gnu.org
> Cc: 

> Well, the linux kernel people would also scream very loudly if the
> compiler started using floating point for integer divides (Linux
> uses -fno-fp-regs on architectures where it is needed/supported, but
> x86 doesn't even _have_ that flag right now).  In the kernel, we do
> NOT want to pollute the (big) FP state, as the kernel doesn't want
> to save/restore it all the time.

I'll echo this point as well.  In our OS, we use gcc, and we need the
ability to generate code that avoids the FP resources, as long as
there isn't any user code that appears to use those resources.

So, for example:

  long long a,b;

  main() { a = b; }

should not use FP resources, but:

  double a,b;

  main() { a = b; }

can.  Presently, our best use of gcc has us using -msoft-float for
this purpose, but, that's not quite what we want.

Would be nice if gcc had a target independent way of doing this.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: long long / long long
@ 2001-09-11  5:06 Benedetto Proietti
  0 siblings, 0 replies; 18+ messages in thread
From: Benedetto Proietti @ 2001-09-11  5:06 UTC (permalink / raw)
  To: bernds; +Cc: gcc

On Mon, 10 Sep 2001 21:45:10 +0200 (MET DST) Michael Matz wrote:
>
> Hi,
>
> On Mon, 10 Sep 2001, Bernd Schmidt wrote:
> > > Well, the Linux kernel developers found that they couldn't let gcc
> > > do long long arithmetic because it did such a poor job, so they do
> > > it in assembly or in C on pairs of 32 bit values instead.  So at
> > > least some folks probably wouldn't mind seeing an improvement.
> >
> > The main problem probably is the register allocator's requirement that
> > DImode values be allocated to adjacent registers.
> 
> Yes, that requirement creates many constraints.
> 
> > Unfortunately this is not going to be easy to change.
> 
> The sad thing is, that it _is_ easy to change in the allocator, and in
> fact would make the algorithm simpler (I'm talking only about the
> new-regalloc) and the graph easier colorable.  The thing which horrifies
> me is the encoding of that requirement in the different machine
> descriptions.  A first step would be to define a new rtx code MREG
> ("multi" reg), which can possibly contain a set of (disjoint) REG
> or SUBREG expressions, including the then necessary handling of multi-reg
> moves (with cycle breaking).  The occurences of those MREG rtx's could
> probably be limited to few passes around the allocator.  Unfortunately
> nevertheless all .md files would need a good overhaul.  If we only had
> such a multi-reg rtx from the beginning ;-|
> 

Hi
in my thesis at university i have done something like this. I called it
"REGSET" 
(SET of REGisters) instead of MREG but it sounds the same.
In the .md i added the "movblk" patterns like this

(define_expand "movblk"
[(set (match_operand:BLK 0 "general_operand" "") 
       (match_operand:BLK 1 "general_operand" ""))]
....

(define_insn "hard_movblk_to_regset"
[(set (match_operand:BLK 0 "regset_operand" "") 
      (match_operand:BLK 1 "memory_operand" "m"))]
....

(define_insn "hard_movblk_from_regset"
[(set (match_operand:BLK 0 "memory_operand" "m") 
       (match_operand:BLK 1 "regset_operand" ""))]
....

Maybe a little primitive but effective.
I also posted a first patch, but not in the *standard* way and not for
the last release.
Hope to have time to do that soon.
Anyhow many changes in the gcc code were necessary, sometimes a little
hard! ;)
I did not implement attributes or other to specify consecutiveness of
registers,
neither asm constraints to use with global register variables.
The idea came out to solve memory access of big structures, because our
machine had lots of registers
(512 per processor!).
My comment is: it is possible, not so difficult, you should agree on the
syntax and the exact behaviour.

ciao
benedetto



> 
> Ciao,
> Michael.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2001-09-11  8:06 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-09-08  8:31 Multiplications on Pentium 4 dewar
2001-09-08  9:17 ` Jan Hubicka
2001-09-08 17:13   ` Profiling and optimization Frank Klemm
2001-09-08 19:08   ` long long / long long Frank Klemm
2001-09-09 21:53     ` Joe Buck
2001-09-09 22:21       ` John S. Dyson
2001-09-10  6:20       ` Bernd Schmidt
2001-09-10 12:47         ` Michael Matz
2001-09-10 19:55           ` Hans-Peter Nilsson
2001-09-11  2:26           ` Jan Hubicka
2001-09-10 10:21       ` Linus Torvalds
2001-09-10 10:40         ` David Edelsohn
2001-09-11  2:21           ` Jan Hubicka
2001-09-11  2:20         ` Jan Hubicka
2001-09-11  8:06           ` Linus Torvalds
2001-09-10 12:18       ` Florian Weimer
2001-09-10 10:48 mike stump
2001-09-11  5:06 Benedetto Proietti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).