RE: floor on i386

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* RE: floor on i386
@ 2001-09-12 12:03 Chris Lattner
  2001-09-12 16:16 ` Joe Buck
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Lattner @ 2001-09-12 12:03 UTC (permalink / raw)
  To: Brad Lucier; +Cc: gcc

> If one had tree nodes or RTL nodes to express the sequence

> Save rounding mode
> Change rounding mode to round-down
> Convert to integer
> Restore rounding mode

> then perhaps the rounding mode change changes could be moved
> outside of loops where they occur, if there were no floating-point
> instructions in  the loop that needed the currently set rounding mode.

Actually you could probably represent this by adding a "hard register"
that effectively is the rounding mode flags.  Doing this seems like a
natural way to allow existing optimizations to get rid of the
redundancies that are killing GCC here... and you wouldn't have to add any
new RTL semantics...

-Chris

http://www.nondot.org/~sabre/os/
http://www.nondot.org/MagicStats/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: floor on i386
  2001-09-12 12:03 floor on i386 Chris Lattner
@ 2001-09-12 16:16 ` Joe Buck
  2001-09-24 10:52   ` Alexandre Oliva
  0 siblings, 1 reply; 13+ messages in thread
From: Joe Buck @ 2001-09-12 16:16 UTC (permalink / raw)
  To: Chris Lattner; +Cc: Brad Lucier, gcc

> > If one had tree nodes or RTL nodes to express the sequence
> 
> > Save rounding mode
> > Change rounding mode to round-down
> > Convert to integer
> > Restore rounding mode
> 
> > then perhaps the rounding mode change changes could be moved
> > outside of loops where they occur, if there were no floating-point
> > instructions in  the loop that needed the currently set rounding mode.

Chris Lattner writes:
> Actually you could probably represent this by adding a "hard register"
> that effectively is the rounding mode flags.  Doing this seems like a
> natural way to allow existing optimizations to get rid of the
> redundancies that are killing GCC here... and you wouldn't have to add any
> new RTL semantics...

Seems logical, but then I guess all FP instruction descriptions have to be
modified to indicate that they read these registers.  I'm not enough of an
expert to know which approach (extra RTL to represent rounding modes, or
explicit rounding mode registers) is less painful.  But it seems that
since changing the rounding mode registers flush the FP pipeline, it may
not be easy to just treat them as ordinary regs and still get the costs
right for scheduling.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: floor on i386
  2001-09-12 16:16 ` Joe Buck
@ 2001-09-24 10:52   ` Alexandre Oliva
  2001-09-25  6:47     ` Jan Hubicka
  0 siblings, 1 reply; 13+ messages in thread
From: Alexandre Oliva @ 2001-09-24 10:52 UTC (permalink / raw)
  To: Joe Buck; +Cc: Chris Lattner, Brad Lucier, gcc

On Sep 12, 2001, Joe Buck <jbuck@synopsys.COM> wrote:

>> > If one had tree nodes or RTL nodes to express the sequence
>> 
>> > Save rounding mode
>> > Change rounding mode to round-down
>> > Convert to integer
>> > Restore rounding mode
>> 
>> > then perhaps the rounding mode change changes could be moved
>> > outside of loops where they occur, if there were no floating-point
>> > instructions in  the loop that needed the currently set rounding mode.

> Chris Lattner writes:
>> Actually you could probably represent this by adding a "hard register"
>> that effectively is the rounding mode flags.  Doing this seems like a
>> natural way to allow existing optimizations to get rid of the
>> redundancies that are killing GCC here... and you wouldn't have to add any
>> new RTL semantics...

> Seems logical, but then I guess all FP instruction descriptions have to be
> modified to indicate that they read these registers.  I'm not enough of an
> expert to know which approach (extra RTL to represent rounding modes, or
> explicit rounding mode registers) is less painful.  But it seems that
> since changing the rounding mode registers flush the FP pipeline, it may
> not be easy to just treat them as ordinary regs and still get the costs
> right for scheduling.

It would probably be best to introduce a hard register to indicate the
rounding mode, and use OPTIMIZE_MODE_SWITCHING to do as few mode
changes as possible.  For reference, have a look at the SH4
implementation of floating-point support, that defines an explicit
floating-point control register, mode-switching RTL and USEs that
register in all instructions that depend on the floating-point mode,
indicating in an attribute which mode the register is supposed to be
in.  The difference is that SH4 uses the floating-point control
register to switch between single- and double-precision operations,
that have the same encoding but different behavior depending on the
state of the control register.  Modeling mode switching for purposes
of rounding on x86 should be far simpler.  In fact, I'm not even sure
you'd need the hard register: just define unspec patterns that switch
back and forth and you're done.

-- 
Alexandre Oliva   Enjoy Guarana', see http://www.ic.unicamp.br/~oliva/
Red Hat GCC Developer                  aoliva@{cygnus.com, redhat.com}
CS PhD student at IC-Unicamp        oliva@{lsd.ic.unicamp.br, gnu.org}
Free Software Evangelist    *Please* write to mailing lists, not to me

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: floor on i386
  2001-09-24 10:52   ` Alexandre Oliva
@ 2001-09-25  6:47     ` Jan Hubicka
  2001-09-25  7:47       ` Brad Lucier
  2001-09-25 12:48       ` Frank Klemm
  0 siblings, 2 replies; 13+ messages in thread
From: Jan Hubicka @ 2001-09-25  6:47 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: Joe Buck, Chris Lattner, Brad Lucier, gcc

> It would probably be best to introduce a hard register to indicate the
> rounding mode, and use OPTIMIZE_MODE_SWITCHING to do as few mode
> changes as possible.  For reference, have a look at the SH4
> implementation of floating-point support, that defines an explicit
> floating-point control register, mode-switching RTL and USEs that

The USEs itself are problem - you loose a lot of optimizations then.
The trick can be to lower code before reload using pre-reload splitting.

Major problem still remains in reload.
If we don't want to get exact IEEE by setting proper precisity before each
mathematic operation (as SH4 does IMO), we will run into problems with spills ,
since these can be put in place control word is set to some wrong value
resutlting in wrong rounding before storing.

Thats the main purpose why my original patch didn't contained it.

The problem can be solved by mode switching pass after reload, when all spills
are visible - you use existing pass before reload to compute control word values
as these needs pseudos and after reload just insert fldcw/fstcw at strategic places.

If you insert them at last optimal position in code, you will get them after the
lazy code to compute control word inserted by pre-reload pass.

As disussed with Timothy, the benefits are relativly small compared to the first
half (computing control word values optimally), as CPUs do have hardware bypass.

> register in all instructions that depend on the floating-point mode,
> indicating in an attribute which mode the register is supposed to be
> in.  The difference is that SH4 uses the floating-point control
> register to switch between single- and double-precision operations,
> that have the same encoding but different behavior depending on the
> state of the control register.  Modeling mode switching for purposes
> of rounding on x86 should be far simpler.  In fact, I'm not even sure
> you'd need the hard register: just define unspec patterns that switch
> back and forth and you're done.
You need scheduling barrier, but it is big problem.

Honza
> 
> -- 
> Alexandre Oliva   Enjoy Guarana', see http://www.ic.unicamp.br/~oliva/
> Red Hat GCC Developer                  aoliva@{cygnus.com, redhat.com}
> CS PhD student at IC-Unicamp        oliva@{lsd.ic.unicamp.br, gnu.org}
> Free Software Evangelist    *Please* write to mailing lists, not to me

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: floor on i386
  2001-09-25  6:47     ` Jan Hubicka
@ 2001-09-25  7:47       ` Brad Lucier
  2001-09-25  8:00         ` Jan Hubicka
  2001-09-25 12:48       ` Frank Klemm
  1 sibling, 1 reply; 13+ messages in thread
From: Brad Lucier @ 2001-09-25  7:47 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Alexandre Oliva, Joe Buck, Chris Lattner, Brad Lucier, gcc

> Major problem still remains in reload.
> If we don't want to get exact IEEE by setting proper precisity before each
> mathematic operation (as SH4 does IMO), we will run into problems with spills ,
> since these can be put in place control word is set to some wrong value
> resutlting in wrong rounding before storing.

If spills spilled the extended precision value, which is needed
anyway for proper IEEE conformance, this wouldn't be an issue.

Brad

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: floor on i386
  2001-09-25  7:47       ` Brad Lucier
@ 2001-09-25  8:00         ` Jan Hubicka
  2001-09-25 12:52           ` Tim Prince
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Hubicka @ 2001-09-25  8:00 UTC (permalink / raw)
  To: Brad Lucier; +Cc: Jan Hubicka, Alexandre Oliva, Joe Buck, Chris Lattner, gcc

> > Major problem still remains in reload.
> > If we don't want to get exact IEEE by setting proper precisity before each
> > mathematic operation (as SH4 does IMO), we will run into problems with spills ,
> > since these can be put in place control word is set to some wrong value
> > resutlting in wrong rounding before storing.
> 
> If spills spilled the extended precision value, which is needed
> anyway for proper IEEE conformance, this wouldn't be an issue.

Yes, but it is big performance problem when done, at least for AMD CPUs, where
XFmode spills cost a lot more than DF/or SFmode, so it should not be enabled unconditionally.
(I was trying to implement this idea in the past and it appears to be quite dificult to do
too :( )

Honza
> 
> Brad

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: floor on i386
  2001-09-25  6:47     ` Jan Hubicka
  2001-09-25  7:47       ` Brad Lucier
@ 2001-09-25 12:48       ` Frank Klemm
  2001-09-26  4:14         ` Jan Hubicka
  1 sibling, 1 reply; 13+ messages in thread
From: Frank Klemm @ 2001-09-25 12:48 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc

On Tue, Sep 25, 2001 at 03:46:59PM +0200, Jan Hubicka wrote:
> > It would probably be best to introduce a hard register to indicate the
> > rounding mode, and use OPTIMIZE_MODE_SWITCHING to do as few mode
> > changes as possible.  For reference, have a look at the SH4
> > implementation of floating-point support, that defines an explicit
> > floating-point control register, mode-switching RTL and USEs that
> 
> The USEs itself are problem - you loose a lot of optimizations then.
> The trick can be to lower code before reload using pre-reload splitting.
> 
> Major problem still remains in reload.
> If we don't want to get exact IEEE by setting proper precisity before each
> mathematic operation (as SH4 does IMO), we will run into problems with spills ,
> since these can be put in place control word is set to some wrong value
> resutlting in wrong rounding before storing.
> 
> Thats the main purpose why my original patch didn't contained it.
> 
> The problem can be solved by mode switching pass after reload, when all spills
> are visible - you use existing pass before reload to compute control word values
> as these needs pseudos and after reload just insert fldcw/fstcw at strategic places.
> 
> If you insert them at last optimal position in code, you will get them after the
> lazy code to compute control word inserted by pre-reload pass.
> 
> As disussed with Timothy, the benefits are relativly small compared to the first
> half (computing control word values optimally), as CPUs do have hardware bypass.
> 
> > register in all instructions that depend on the floating-point mode,
> > indicating in an attribute which mode the register is supposed to be
> > in.  The difference is that SH4 uses the floating-point control
> > register to switch between single- and double-precision operations,
> > that have the same encoding but different behavior depending on the
> > state of the control register.  Modeling mode switching for purposes
> > of rounding on x86 should be far simpler.  In fact, I'm not even sure
> > you'd need the hard register: just define unspec patterns that switch
> > back and forth and you're done.
> You need scheduling barrier, but it is big problem.
> 

Note, that this optimization is necessary if gcc don't want to have 4% of
the performance of icc for Intel iA32. For example a MPEG-2 Layer 2 decoder
spends 65% of the execution time in rounding floats to integers (Athlon).
This is not a joke, it's a flaw of the compiler.

Currently gcc is unusable if you need fast float to int convertion
(Video).
______________________________________________________________________

Another work-around is the following. It can be implemented very fast.

enum rounding_model_e {
    round_default = 0x0000,
    round_floor   = 0x0400,
    round_ceil    = 0x0800,
    round_trunc   = 0x0C00,
    round_round   = 0x0000
}

enum rounding_model_e  set_rounding_model ( enum rounding_model_e );

double         rint    ( double );
float          rintf   ( float );
long double    rintl   ( long double );
int            irint   ( double );		// 64 bit float to 32 bit int
long long      llrintl ( long double );		// 80 bit float to 64 bit int

Other target types ( signed|unsigned, char|short|int|long|long long) are
also possible, also other saturation models (wrap|saturate|integerinfinity).

-- 
Frank Klemm

PS: The are CPUs with the following mapping of 32 bit integers:
    0x80000001...0x7FFFFFFF:   -2^31+1 ... +2^31-1
    0x80000000:                integer NAN


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: floor on i386
  2001-09-25  8:00         ` Jan Hubicka
@ 2001-09-25 12:52           ` Tim Prince
  0 siblings, 0 replies; 13+ messages in thread
From: Tim Prince @ 2001-09-25 12:52 UTC (permalink / raw)
  To: Jan Hubicka, Brad Lucier
  Cc: Jan Hubicka, Alexandre Oliva, Joe Buck, Chris Lattner, gcc

----- Original Message -----
From: "Jan Hubicka" <jh@suse.cz>
To: "Brad Lucier" <lucier@math.purdue.edu>
Cc: "Jan Hubicka" <jh@suse.cz>; "Alexandre Oliva"
<aoliva@redhat.com>; "Joe Buck" <jbuck@synopsys.COM>; "Chris
Lattner" <sabre@nondot.org>; <gcc@gcc.gnu.org>
Sent: Tuesday, September 25, 2001 8:00 AM
Subject: Re: floor on i386


> > > Major problem still remains in reload.
> > > If we don't want to get exact IEEE by setting proper
precisity before each
> > > mathematic operation (as SH4 does IMO), we will run into
problems with spills ,
> > > since these can be put in place control word is set to some
wrong value
> > > resutlting in wrong rounding before storing.
> >
> > If spills spilled the extended precision value, which is
needed
> > anyway for proper IEEE conformance, this wouldn't be an
issue.
>
> Yes, but it is big performance problem when done, at least for
AMD CPUs, where
> XFmode spills cost a lot more than DF/or SFmode, so it should
not be enabled unconditionally.
> (I was trying to implement this idea in the past and it appears
to be quite dificult to do
> too :( )
>
> Honza
> >
> > Brad

XFmode spills should not be so expensive if 16-byte alignment
could be assured.  Those people who set the CPU into 53-bit
precision mode, as well as those who don't like the alignment
requirement, would want a way to keep the current scheme.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: floor on i386
  2001-09-25 12:48       ` Frank Klemm
@ 2001-09-26  4:14         ` Jan Hubicka
  2001-09-26 15:45           ` Frank Klemm
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Hubicka @ 2001-09-26  4:14 UTC (permalink / raw)
  To: Frank Klemm; +Cc: Jan Hubicka, gcc

> Note, that this optimization is necessary if gcc don't want to have 4% of
> the performance of icc for Intel iA32. For example a MPEG-2 Layer 2 decoder
> spends 65% of the execution time in rounding floats to integers (Athlon).
> This is not a joke, it's a flaw of the compiler.
Can't it just use rint function, as disucssed earlier?

Honza

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: floor on i386
  2001-09-26  4:14         ` Jan Hubicka
@ 2001-09-26 15:45           ` Frank Klemm
  0 siblings, 0 replies; 13+ messages in thread
From: Frank Klemm @ 2001-09-26 15:45 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc

On Wed, Sep 26, 2001 at 01:14:36PM +0200, Jan Hubicka wrote:
> > Note, that this optimization is necessary if gcc don't want to have 4% of
> > the performance of icc for Intel iA32. For example a MPEG-2 Layer 2 decoder
> > spends 65% of the execution time in rounding floats to integers (Athlon).
> > This is not a joke, it's a flaw of the compiler.
>
> Can't it just use rint function, as disucssed earlier?
> 
No.

1st:  rint() is faster, but also very slow (14x slower than optimum)
2nd:  There's no subset of function to select another standard rounding
      model.

-- 
Frank Klemm

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: floor on i386
  2001-09-26 16:19 dewar
@ 2001-09-27  7:09 ` Frank Klemm
  0 siblings, 0 replies; 13+ messages in thread
From: Frank Klemm @ 2001-09-27  7:09 UTC (permalink / raw)
  To: dewar; +Cc: gcc

On Wed, Sep 26, 2001 at 07:19:12PM -0400, dewar@gnat.com wrote:
>
> <<1st:  rint() is faster, but also very slow (14x slower than optimum)
> 2nd:  There's no subset of function to select another standard rounding
>       model.
> 
> Why isn't there a "rounding" function that *does* do the "optimal" 
> function.
>
I posted a function collection. This function collection is not ANSI-C,
but with it it is possible to write performance code. You must be careful
while writing the code, but you are able to write the code.

Currently you can only write slow code. And slow code don't mean 70% of the
optimum performance, but something around 4%...7% of the optimum
performance (4% with floor, 7% with rint).

-- 
Frank Klemm

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: floor on i386
@ 2001-09-26 16:19 dewar
  2001-09-27  7:09 ` Frank Klemm
  0 siblings, 1 reply; 13+ messages in thread
From: dewar @ 2001-09-26 16:19 UTC (permalink / raw)
  To: jh, pfk; +Cc: gcc

<<1st:  rint() is faster, but also very slow (14x slower than optimum)
2nd:  There's no subset of function to select another standard rounding
      model.
>>

Why isn't there a "rounding" function that *does* do the "optimal" 
function.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* floor on i386
@ 2001-09-12  9:26 Brad Lucier
  0 siblings, 0 replies; 13+ messages in thread
From: Brad Lucier @ 2001-09-12  9:26 UTC (permalink / raw)
  To: gcc; +Cc: Brad Lucier

I just reviewed the various threads on gcc.gnu.org about slow code
generated for the floor intrinsic.

If one had tree nodes or RTL nodes to express the sequence

Save rounding mode
Change rounding mode to round-down
Convert to integer
Restore rounding mode

then perhaps the rounding mode change changes could be moved
outside of loops where they occur, if there were no floating-point
instructions in  the loop that needed the currently set rounding mode.

Brad Lucier

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2001-09-27  7:09 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-09-12 12:03 floor on i386 Chris Lattner
2001-09-12 16:16 ` Joe Buck
2001-09-24 10:52   ` Alexandre Oliva
2001-09-25  6:47     ` Jan Hubicka
2001-09-25  7:47       ` Brad Lucier
2001-09-25  8:00         ` Jan Hubicka
2001-09-25 12:52           ` Tim Prince
2001-09-25 12:48       ` Frank Klemm
2001-09-26  4:14         ` Jan Hubicka
2001-09-26 15:45           ` Frank Klemm
  -- strict thread matches above, loose matches on Subject: below --
2001-09-26 16:19 dewar
2001-09-27  7:09 ` Frank Klemm
2001-09-12  9:26 Brad Lucier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).