Optimizing floating point *(2^c) and /(2^c)

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Optimizing floating point *(2^c) and /(2^c)
@ 2010-03-29 17:38 Jeroen Van Der Bossche
  2010-03-29 20:30 ` Geert Bosch
  0 siblings, 1 reply; 9+ messages in thread
From: Jeroen Van Der Bossche @ 2010-03-29 17:38 UTC (permalink / raw)
  To: gcc

I've recently written a program where taking the average of 2 floating
point numbers was a real bottleneck. I've looked into the assembly
generated by gcc -O3 and apparently gcc treats multiplication and
division by a hard-coded 2 like any other multiplication with a
constant. I think, however, that *(2^c) and /(2^c) for floating
points, where the c is known at compile-time, should be able to be
optimized with the following pseudo-code:

e = exponent bits of the number
if (e > c && e < (0b111...11)-c) {
e += c or e -= c
} else {
do regular multiplication
}

Even further optimizations may be possible, such as bitshifting the
significand when e=0. However, that would require checking for a lot
of special cases and require so many conditional jumps that it's most
likely not going to be any faster.

I'm not skilled enough with assembly to write this myself and test if
this actually performs faster than how it's implemented now. Its
performance will most likely also depend on the processor
architecture, and I could only test this code on one machine.
Therefore I ask to those who are familiar with gcc's optimization
routines to give this 2 seconds of thought, as this is probably rather
easy to implement and many programs could benefit from this.

Greets,
Jeroen

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Optimizing floating point *(2^c) and /(2^c)
  2010-03-29 17:38 Optimizing floating point *(2^c) and /(2^c) Jeroen Van Der Bossche
@ 2010-03-29 20:30 ` Geert Bosch
  2010-03-29 22:43   ` Tim Prince
  2010-03-30 19:03   ` Jeroen Van Der Bossche
  0 siblings, 2 replies; 9+ messages in thread
From: Geert Bosch @ 2010-03-29 20:30 UTC (permalink / raw)
  To: Jeroen Van Der Bossche; +Cc: gcc

On Mar 29, 2010, at 13:19, Jeroen Van Der Bossche wrote:

> 've recently written a program where taking the average of 2 floating
> point numbers was a real bottleneck. I've looked into the assembly
> generated by gcc -O3 and apparently gcc treats multiplication and
> division by a hard-coded 2 like any other multiplication with a
> constant. I think, however, that *(2^c) and /(2^c) for floating
> points, where the c is known at compile-time, should be able to be
> optimized with the following pseudo-code:
> 
> e = exponent bits of the number
> if (e > c && e < (0b111...11)-c) {
> e += c or e -= c
> } else {
> do regular multiplication
> }
> 
> Even further optimizations may be possible, such as bitshifting the
> significand when e=0. However, that would require checking for a lot
> of special cases and require so many conditional jumps that it's most
> likely not going to be any faster.
> 
> I'm not skilled enough with assembly to write this myself and test if
> this actually performs faster than how it's implemented now. Its
> performance will most likely also depend on the processor
> architecture, and I could only test this code on one machine.
> Therefore I ask to those who are familiar with gcc's optimization
> routines to give this 2 seconds of thought, as this is probably rather
> easy to implement and many programs could benefit from this.

For any optimization suggestions, you should start with showing some real, compilable, code with a performance problem that you think the compiler could address. Please include details about compilation options, GCC versions and target hardware, as well as observed performance numbers. How do you see that averaging two floating point numbers is a bottleneck? This should only be a single addition and multiplication, and will execute in a nanosecond or so on a moderately modern system.

Your particular suggestion is flawed. Floating-point multiplication is very fast on most targets. It is hard to see how on any target with floating-point hardware, manual mucking with the representation can be a win. In particular, your sketch doesn't at all address underflow and overflow. Likely a complete implementation would be many times slower than a floating-point multiply.

  -Geert

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Optimizing floating point *(2^c) and /(2^c)
  2010-03-29 20:30 ` Geert Bosch
@ 2010-03-29 22:43   ` Tim Prince
  2010-03-31 15:29     ` Geert Bosch
  2010-03-30 19:03   ` Jeroen Van Der Bossche
  1 sibling, 1 reply; 9+ messages in thread
From: Tim Prince @ 2010-03-29 22:43 UTC (permalink / raw)
  To: gcc

On 3/29/2010 10:51 AM, Geert Bosch wrote:
> On Mar 29, 2010, at 13:19, Jeroen Van Der Bossche wrote:
>
>    
>> 've recently written a program where taking the average of 2 floating
>> point numbers was a real bottleneck. I've looked into the assembly
>> generated by gcc -O3 and apparently gcc treats multiplication and
>> division by a hard-coded 2 like any other multiplication with a
>> constant. I think, however, that *(2^c) and /(2^c) for floating
>> points, where the c is known at compile-time, should be able to be
>> optimized with the following pseudo-code:
>>
>> e = exponent bits of the number
>> if (e>  c&&  e<  (0b111...11)-c) {
>> e += c or e -= c
>> } else {
>> do regular multiplication
>> }
>>
>> Even further optimizations may be possible, such as bitshifting the
>> significand when e=0. However, that would require checking for a lot
>> of special cases and require so many conditional jumps that it's most
>> likely not going to be any faster.
>>
>> I'm not skilled enough with assembly to write this myself and test if
>> this actually performs faster than how it's implemented now. Its
>> performance will most likely also depend on the processor
>> architecture, and I could only test this code on one machine.
>> Therefore I ask to those who are familiar with gcc's optimization
>> routines to give this 2 seconds of thought, as this is probably rather
>> easy to implement and many programs could benefit from this.
>>      
> For any optimization suggestions, you should start with showing some real, compilable, code with a performance problem that you think the compiler could address. Please include details about compilation options, GCC versions and target hardware, as well as observed performance numbers. How do you see that averaging two floating point numbers is a bottleneck? This should only be a single addition and multiplication, and will execute in a nanosecond or so on a moderately modern system.
>
> Your particular suggestion is flawed. Floating-point multiplication is very fast on most targets. It is hard to see how on any target with floating-point hardware, manual mucking with the representation can be a win. In particular, your sketch doesn't at all address underflow and overflow. Likely a complete implementation would be many times slower than a floating-point multiply.
>
>    -Geert
>    
gcc used to have the ability to replace division by a power of 2 by an 
fscale instruction, for appropriate targets (maybe still does).  Such 
targets have nearly disappeared from everyday usage.  What remains is 
the possibility of replacing the division by constant power of 2 by 
multiplication, but it's generally considered the programmer should have 
done that in the beginning.  icc has such an facility, but it's subject 
to -fp-model=fast (equivalent to gcc -ffast-math -fno-cx-limited-range), 
even though it's a totally safe conversion.
As Geert indicated, it's almost inconceivable that a correct 
implementation which takes care of exceptions could match the floating 
point hardware performance, even for a case which starts with operands 
in memory (but you mention the case following an addition).

-- 
Tim Prince

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Optimizing floating point *(2^c) and /(2^c)
  2010-03-29 22:43   ` Tim Prince
@ 2010-03-31 15:29     ` Geert Bosch
  0 siblings, 0 replies; 9+ messages in thread
From: Geert Bosch @ 2010-03-31 15:29 UTC (permalink / raw)
  To: tprince; +Cc: gcc

On Mar 29, 2010, at 16:30, Tim Prince wrote:
> gcc used to have the ability to replace division by a power of 2 by an fscale instruction, for appropriate targets (maybe still does).
The problem (again) is that floating point multiplication is 
just too damn fast. On x86, even though the latency may 
be 5 cycles, since the multiplier is fully pipelined, the 
throughput is one multiplication per clock cycle, and that's
for non-vectorized code!

For comparison, the fscale instruction breaks down to 30 µops
or something like that, compared to a single µop for most
forms of floating point multiplication. Given that Jeroen
also needs to do floating-point additions, just bouncing
the values between integer and float registers will be
more expensive than the entire multiplication is in the
first place.

> Such targets have nearly disappeared from everyday usage.  What remains is the possibility of replacing the division by constant power of 2 by multiplication, but it's generally considered the programmer should have done that in the beginning.

No, this is something the compiler does and should do. 
It is well understood that for binary floating point multiplications
division by a power of two is identical to multiplication by its reciprocal,
and it's the compiler's job to select the fastest instruction.

  -Geert

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Optimizing floating point *(2^c) and /(2^c)
  2010-03-29 20:30 ` Geert Bosch
  2010-03-29 22:43   ` Tim Prince
@ 2010-03-30 19:03   ` Jeroen Van Der Bossche
  2010-03-31  6:07     ` Vincent Lefevre
  1 sibling, 1 reply; 9+ messages in thread
From: Jeroen Van Der Bossche @ 2010-03-30 19:03 UTC (permalink / raw)
  To: Geert Bosch; +Cc: gcc

On 29 March 2010 19:51, Geert Bosch <bosch@adacore.com> wrote:
>
> On Mar 29, 2010, at 13:19, Jeroen Van Der Bossche wrote:
>
>> 've recently written a program where taking the average of 2 floating
>> point numbers was a real bottleneck. I've looked into the assembly
>> generated by gcc -O3 and apparently gcc treats multiplication and
>> division by a hard-coded 2 like any other multiplication with a
>> constant. I think, however, that *(2^c) and /(2^c) for floating
>> points, where the c is known at compile-time, should be able to be
>> optimized with the following pseudo-code:
>>
>> e = exponent bits of the number
>> if (e > c && e < (0b111...11)-c) {
>> e += c or e -= c
>> } else {
>> do regular multiplication
>> }
>>
>> Even further optimizations may be possible, such as bitshifting the
>> significand when e=0. However, that would require checking for a lot
>> of special cases and require so many conditional jumps that it's most
>> likely not going to be any faster.
>>
>> I'm not skilled enough with assembly to write this myself and test if
>> this actually performs faster than how it's implemented now. Its
>> performance will most likely also depend on the processor
>> architecture, and I could only test this code on one machine.
>> Therefore I ask to those who are familiar with gcc's optimization
>> routines to give this 2 seconds of thought, as this is probably rather
>> easy to implement and many programs could benefit from this.
>
> For any optimization suggestions, you should start with showing some real, compilable, code with a performance problem that you think the compiler could address. Please include details about compilation options, GCC versions and target hardware, as well as observed performance numbers. How do you see that averaging two floating point numbers is a bottleneck? This should only be a single addition and multiplication, and will execute in a nanosecond or so on a moderately modern system.
>
> Your particular suggestion is flawed. Floating-point multiplication is very fast on most targets. It is hard to see how on any target with floating-point hardware, manual mucking with the representation can be a win. In particular, your sketch doesn't at all address underflow and overflow. Likely a complete implementation would be many times slower than a floating-point multiply.
>
>  -Geert

The program in question was a scientific computation where most of the
main loop was spent doing two floating point operations, one of which
was taking the average of two numbers. Granted, there is no
"performance problem" with the code as it is now, it executes in as
much time as would be expected. However, an optimization need not be a
10% performance gain to be called an optimization, right? If we could
get this multiplication, which you say executes in about a nanosecond,
to execute in half a nanosecond, why not do it?

With all due respect, your statement that "your sketch doesn't at all
address underflow and overflow" is just plain wrong. Did you even look
at the code? The if statement is there exactly to address under- and
overflow and nothing else. It's not because it looks simple that I
didn't think it through. I know exactly how floating point
multiplication works, and this implementation of *(2^c) doesn't
violate under/overflow, denormalisation when e=0, rounding rules or
rules about NaN/infinity.

On my processor (Intel Core2, x86-64), a mulsd instruction takes about
as much time as 5 add instructions, though I have to admit that my
method of determining that are highly unscientific. Nevertheless, if
someone competent with optimizing assembly would be able to write the
entire thing so that the if=true-path uses 4 instructions that takes
as much time as an add instruction, then that would constitute a 20%
gain.

Jeroen

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Optimizing floating point *(2^c) and /(2^c)
  2010-03-30 19:03   ` Jeroen Van Der Bossche
@ 2010-03-31  6:07     ` Vincent Lefevre
  2010-03-31  9:25       ` Marc Glisse
  0 siblings, 1 reply; 9+ messages in thread
From: Vincent Lefevre @ 2010-03-31  6:07 UTC (permalink / raw)
  To: gcc

On 2010-03-30 20:36:04 +0200, Jeroen Van Der Bossche wrote:
> The if statement is there exactly to address under- and overflow and
> nothing else. It's not because it looks simple that I didn't think
> it through. I know exactly how floating point multiplication works,
> and this implementation of *(2^c) doesn't violate under/overflow,
> denormalisation when e=0, rounding rules or rules about
> NaN/infinity.

However, extracting the exponent, and putting the new exponent back
to the floating-point number probably takes significant time (not
even mentioning the tests on the exponent) compared to what is done
entirely in hardware. I think that before anyone looks at this, you
should post benchmarks.

Also, the pipeline can stall due to the "if". Perhaps not in your
case. But a compiler cannot know that (the other branch is taken
when the value is 0, and this should not be regarded as rare).

-- 
Vincent LefÃ¨vre <vincent@vinc17.net> - Web: <http://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / ArÃ©naire project (LIP, ENS-Lyon)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Optimizing floating point *(2^c) and /(2^c)
  2010-03-31  6:07     ` Vincent Lefevre
@ 2010-03-31  9:25       ` Marc Glisse
  2010-03-31 12:03         ` Vincent Lefevre
  0 siblings, 1 reply; 9+ messages in thread
From: Marc Glisse @ 2010-03-31  9:25 UTC (permalink / raw)
  To: gcc

On Wed, 31 Mar 2010, Vincent Lefevre wrote:

> On 2010-03-30 20:36:04 +0200, Jeroen Van Der Bossche wrote:
> > The if statement is there exactly to address under- and overflow and
> > nothing else. It's not because it looks simple that I didn't think
> > it through. I know exactly how floating point multiplication works,
> > and this implementation of *(2^c) doesn't violate under/overflow,
> > denormalisation when e=0, rounding rules or rules about
> > NaN/infinity.
> 
> However, extracting the exponent, and putting the new exponent back
> to the floating-point number probably takes significant time (not
> even mentioning the tests on the exponent) compared to what is done
> entirely in hardware. I think that before anyone looks at this, you
> should post benchmarks.
> 
> Also, the pipeline can stall due to the "if". Perhaps not in your
> case. But a compiler cannot know that (the other branch is taken
> when the value is 0, and this should not be regarded as rare).

IMHO this transformation mostly makes sense for the -ffinite-math-only 
case where you can replace: "put a constant and multiply/divide" by "put a 
constant and add/sub" and never care about extracting the exponent, 
overflowing, or anything else. And in that case, it does look like a nice 
optimization.

Otherwise, some old machines do have a slow enough multiplication that a 
soft-float implementation of /2^n can beat it, so it seems likely that 
some recent processors have the same characteristic, but it may indeed not 
be worth the trouble.

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Optimizing floating point *(2^c) and /(2^c)
  2010-03-31  9:25       ` Marc Glisse
@ 2010-03-31 12:03         ` Vincent Lefevre
  2010-04-01 16:21           ` Paolo Bonzini
  0 siblings, 1 reply; 9+ messages in thread
From: Vincent Lefevre @ 2010-03-31 12:03 UTC (permalink / raw)
  To: gcc

On 2010-03-31 11:04:03 +0200, Marc Glisse wrote:
> IMHO this transformation mostly makes sense for the
> -ffinite-math-only case where you can replace: "put a constant and
> multiply/divide" by "put a constant and add/sub" and never care
> about extracting the exponent, overflowing, or anything else. And in
> that case, it does look like a nice optimization.

I suppose that this could be interesting only if a same register
can be seen as a FP one and an integer one (for the addition on
the exponent field), hoping that mixing both kinds of operations
won't stall the pipeline.

-- 
Vincent LefÃ¨vre <vincent@vinc17.net> - Web: <http://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / ArÃ©naire project (LIP, ENS-Lyon)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Optimizing floating point *(2^c) and /(2^c)
  2010-03-31 12:03         ` Vincent Lefevre
@ 2010-04-01 16:21           ` Paolo Bonzini
  0 siblings, 0 replies; 9+ messages in thread
From: Paolo Bonzini @ 2010-04-01 16:21 UTC (permalink / raw)
  To: gcc

On 03/31/2010 11:25 AM, Vincent Lefevre wrote:
> On 2010-03-31 11:04:03 +0200, Marc Glisse wrote:
>> IMHO this transformation mostly makes sense for the
>> -ffinite-math-only case where you can replace: "put a constant and
>> multiply/divide" by "put a constant and add/sub" and never care
>> about extracting the exponent, overflowing, or anything else. And in
>> that case, it does look like a nice optimization.
>
> I suppose that this could be interesting only if a same register
> can be seen as a FP one and an integer one (for the addition on
> the exponent field), hoping that mixing both kinds of operations
> won't stall the pipeline.

Indeed, this is a problem on Intel chips looking at the definition of
TARGET_SSE_TYPELESS_STORES (search for X86_TUNE_SSE_TYPELESS_STORES in 
config/i386/i386.c).

Paolo

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2010-04-01 16:21 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-29 17:38 Optimizing floating point *(2^c) and /(2^c) Jeroen Van Der Bossche
2010-03-29 20:30 ` Geert Bosch
2010-03-29 22:43   ` Tim Prince
2010-03-31 15:29     ` Geert Bosch
2010-03-30 19:03   ` Jeroen Van Der Bossche
2010-03-31  6:07     ` Vincent Lefevre
2010-03-31  9:25       ` Marc Glisse
2010-03-31 12:03         ` Vincent Lefevre
2010-04-01 16:21           ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).