* Optimizing floating point *(2^c) and /(2^c) @ 2010-03-29 17:38 Jeroen Van Der Bossche 2010-03-29 20:30 ` Geert Bosch 0 siblings, 1 reply; 9+ messages in thread From: Jeroen Van Der Bossche @ 2010-03-29 17:38 UTC (permalink / raw) To: gcc I've recently written a program where taking the average of 2 floating point numbers was a real bottleneck. I've looked into the assembly generated by gcc -O3 and apparently gcc treats multiplication and division by a hard-coded 2 like any other multiplication with a constant. I think, however, that *(2^c) and /(2^c) for floating points, where the c is known at compile-time, should be able to be optimized with the following pseudo-code: e = exponent bits of the number if (e > c && e < (0b111...11)-c) { e += c or e -= c } else { do regular multiplication } Even further optimizations may be possible, such as bitshifting the significand when e=0. However, that would require checking for a lot of special cases and require so many conditional jumps that it's most likely not going to be any faster. I'm not skilled enough with assembly to write this myself and test if this actually performs faster than how it's implemented now. Its performance will most likely also depend on the processor architecture, and I could only test this code on one machine. Therefore I ask to those who are familiar with gcc's optimization routines to give this 2 seconds of thought, as this is probably rather easy to implement and many programs could benefit from this. Greets, Jeroen ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Optimizing floating point *(2^c) and /(2^c) 2010-03-29 17:38 Optimizing floating point *(2^c) and /(2^c) Jeroen Van Der Bossche @ 2010-03-29 20:30 ` Geert Bosch 2010-03-29 22:43 ` Tim Prince 2010-03-30 19:03 ` Jeroen Van Der Bossche 0 siblings, 2 replies; 9+ messages in thread From: Geert Bosch @ 2010-03-29 20:30 UTC (permalink / raw) To: Jeroen Van Der Bossche; +Cc: gcc On Mar 29, 2010, at 13:19, Jeroen Van Der Bossche wrote: > 've recently written a program where taking the average of 2 floating > point numbers was a real bottleneck. I've looked into the assembly > generated by gcc -O3 and apparently gcc treats multiplication and > division by a hard-coded 2 like any other multiplication with a > constant. I think, however, that *(2^c) and /(2^c) for floating > points, where the c is known at compile-time, should be able to be > optimized with the following pseudo-code: > > e = exponent bits of the number > if (e > c && e < (0b111...11)-c) { > e += c or e -= c > } else { > do regular multiplication > } > > Even further optimizations may be possible, such as bitshifting the > significand when e=0. However, that would require checking for a lot > of special cases and require so many conditional jumps that it's most > likely not going to be any faster. > > I'm not skilled enough with assembly to write this myself and test if > this actually performs faster than how it's implemented now. Its > performance will most likely also depend on the processor > architecture, and I could only test this code on one machine. > Therefore I ask to those who are familiar with gcc's optimization > routines to give this 2 seconds of thought, as this is probably rather > easy to implement and many programs could benefit from this. For any optimization suggestions, you should start with showing some real, compilable, code with a performance problem that you think the compiler could address. Please include details about compilation options, GCC versions and target hardware, as well as observed performance numbers. How do you see that averaging two floating point numbers is a bottleneck? This should only be a single addition and multiplication, and will execute in a nanosecond or so on a moderately modern system. Your particular suggestion is flawed. Floating-point multiplication is very fast on most targets. It is hard to see how on any target with floating-point hardware, manual mucking with the representation can be a win. In particular, your sketch doesn't at all address underflow and overflow. Likely a complete implementation would be many times slower than a floating-point multiply. -Geert ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Optimizing floating point *(2^c) and /(2^c) 2010-03-29 20:30 ` Geert Bosch @ 2010-03-29 22:43 ` Tim Prince 2010-03-31 15:29 ` Geert Bosch 2010-03-30 19:03 ` Jeroen Van Der Bossche 1 sibling, 1 reply; 9+ messages in thread From: Tim Prince @ 2010-03-29 22:43 UTC (permalink / raw) To: gcc On 3/29/2010 10:51 AM, Geert Bosch wrote: > On Mar 29, 2010, at 13:19, Jeroen Van Der Bossche wrote: > > >> 've recently written a program where taking the average of 2 floating >> point numbers was a real bottleneck. I've looked into the assembly >> generated by gcc -O3 and apparently gcc treats multiplication and >> division by a hard-coded 2 like any other multiplication with a >> constant. I think, however, that *(2^c) and /(2^c) for floating >> points, where the c is known at compile-time, should be able to be >> optimized with the following pseudo-code: >> >> e = exponent bits of the number >> if (e> c&& e< (0b111...11)-c) { >> e += c or e -= c >> } else { >> do regular multiplication >> } >> >> Even further optimizations may be possible, such as bitshifting the >> significand when e=0. However, that would require checking for a lot >> of special cases and require so many conditional jumps that it's most >> likely not going to be any faster. >> >> I'm not skilled enough with assembly to write this myself and test if >> this actually performs faster than how it's implemented now. Its >> performance will most likely also depend on the processor >> architecture, and I could only test this code on one machine. >> Therefore I ask to those who are familiar with gcc's optimization >> routines to give this 2 seconds of thought, as this is probably rather >> easy to implement and many programs could benefit from this. >> > For any optimization suggestions, you should start with showing some real, compilable, code with a performance problem that you think the compiler could address. Please include details about compilation options, GCC versions and target hardware, as well as observed performance numbers. How do you see that averaging two floating point numbers is a bottleneck? This should only be a single addition and multiplication, and will execute in a nanosecond or so on a moderately modern system. > > Your particular suggestion is flawed. Floating-point multiplication is very fast on most targets. It is hard to see how on any target with floating-point hardware, manual mucking with the representation can be a win. In particular, your sketch doesn't at all address underflow and overflow. Likely a complete implementation would be many times slower than a floating-point multiply. > > -Geert > gcc used to have the ability to replace division by a power of 2 by an fscale instruction, for appropriate targets (maybe still does). Such targets have nearly disappeared from everyday usage. What remains is the possibility of replacing the division by constant power of 2 by multiplication, but it's generally considered the programmer should have done that in the beginning. icc has such an facility, but it's subject to -fp-model=fast (equivalent to gcc -ffast-math -fno-cx-limited-range), even though it's a totally safe conversion. As Geert indicated, it's almost inconceivable that a correct implementation which takes care of exceptions could match the floating point hardware performance, even for a case which starts with operands in memory (but you mention the case following an addition). -- Tim Prince ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Optimizing floating point *(2^c) and /(2^c) 2010-03-29 22:43 ` Tim Prince @ 2010-03-31 15:29 ` Geert Bosch 0 siblings, 0 replies; 9+ messages in thread From: Geert Bosch @ 2010-03-31 15:29 UTC (permalink / raw) To: tprince; +Cc: gcc On Mar 29, 2010, at 16:30, Tim Prince wrote: > gcc used to have the ability to replace division by a power of 2 by an fscale instruction, for appropriate targets (maybe still does). The problem (again) is that floating point multiplication is just too damn fast. On x86, even though the latency may be 5 cycles, since the multiplier is fully pipelined, the throughput is one multiplication per clock cycle, and that's for non-vectorized code! For comparison, the fscale instruction breaks down to 30 µops or something like that, compared to a single µop for most forms of floating point multiplication. Given that Jeroen also needs to do floating-point additions, just bouncing the values between integer and float registers will be more expensive than the entire multiplication is in the first place. > Such targets have nearly disappeared from everyday usage. What remains is the possibility of replacing the division by constant power of 2 by multiplication, but it's generally considered the programmer should have done that in the beginning. No, this is something the compiler does and should do. It is well understood that for binary floating point multiplications division by a power of two is identical to multiplication by its reciprocal, and it's the compiler's job to select the fastest instruction. -Geert ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Optimizing floating point *(2^c) and /(2^c) 2010-03-29 20:30 ` Geert Bosch 2010-03-29 22:43 ` Tim Prince @ 2010-03-30 19:03 ` Jeroen Van Der Bossche 2010-03-31 6:07 ` Vincent Lefevre 1 sibling, 1 reply; 9+ messages in thread From: Jeroen Van Der Bossche @ 2010-03-30 19:03 UTC (permalink / raw) To: Geert Bosch; +Cc: gcc On 29 March 2010 19:51, Geert Bosch <bosch@adacore.com> wrote: > > On Mar 29, 2010, at 13:19, Jeroen Van Der Bossche wrote: > >> 've recently written a program where taking the average of 2 floating >> point numbers was a real bottleneck. I've looked into the assembly >> generated by gcc -O3 and apparently gcc treats multiplication and >> division by a hard-coded 2 like any other multiplication with a >> constant. I think, however, that *(2^c) and /(2^c) for floating >> points, where the c is known at compile-time, should be able to be >> optimized with the following pseudo-code: >> >> e = exponent bits of the number >> if (e > c && e < (0b111...11)-c) { >> e += c or e -= c >> } else { >> do regular multiplication >> } >> >> Even further optimizations may be possible, such as bitshifting the >> significand when e=0. However, that would require checking for a lot >> of special cases and require so many conditional jumps that it's most >> likely not going to be any faster. >> >> I'm not skilled enough with assembly to write this myself and test if >> this actually performs faster than how it's implemented now. Its >> performance will most likely also depend on the processor >> architecture, and I could only test this code on one machine. >> Therefore I ask to those who are familiar with gcc's optimization >> routines to give this 2 seconds of thought, as this is probably rather >> easy to implement and many programs could benefit from this. > > For any optimization suggestions, you should start with showing some real, compilable, code with a performance problem that you think the compiler could address. Please include details about compilation options, GCC versions and target hardware, as well as observed performance numbers. How do you see that averaging two floating point numbers is a bottleneck? This should only be a single addition and multiplication, and will execute in a nanosecond or so on a moderately modern system. > > Your particular suggestion is flawed. Floating-point multiplication is very fast on most targets. It is hard to see how on any target with floating-point hardware, manual mucking with the representation can be a win. In particular, your sketch doesn't at all address underflow and overflow. Likely a complete implementation would be many times slower than a floating-point multiply. > > -Geert The program in question was a scientific computation where most of the main loop was spent doing two floating point operations, one of which was taking the average of two numbers. Granted, there is no "performance problem" with the code as it is now, it executes in as much time as would be expected. However, an optimization need not be a 10% performance gain to be called an optimization, right? If we could get this multiplication, which you say executes in about a nanosecond, to execute in half a nanosecond, why not do it? With all due respect, your statement that "your sketch doesn't at all address underflow and overflow" is just plain wrong. Did you even look at the code? The if statement is there exactly to address under- and overflow and nothing else. It's not because it looks simple that I didn't think it through. I know exactly how floating point multiplication works, and this implementation of *(2^c) doesn't violate under/overflow, denormalisation when e=0, rounding rules or rules about NaN/infinity. On my processor (Intel Core2, x86-64), a mulsd instruction takes about as much time as 5 add instructions, though I have to admit that my method of determining that are highly unscientific. Nevertheless, if someone competent with optimizing assembly would be able to write the entire thing so that the if=true-path uses 4 instructions that takes as much time as an add instruction, then that would constitute a 20% gain. Jeroen ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Optimizing floating point *(2^c) and /(2^c) 2010-03-30 19:03 ` Jeroen Van Der Bossche @ 2010-03-31 6:07 ` Vincent Lefevre 2010-03-31 9:25 ` Marc Glisse 0 siblings, 1 reply; 9+ messages in thread From: Vincent Lefevre @ 2010-03-31 6:07 UTC (permalink / raw) To: gcc On 2010-03-30 20:36:04 +0200, Jeroen Van Der Bossche wrote: > The if statement is there exactly to address under- and overflow and > nothing else. It's not because it looks simple that I didn't think > it through. I know exactly how floating point multiplication works, > and this implementation of *(2^c) doesn't violate under/overflow, > denormalisation when e=0, rounding rules or rules about > NaN/infinity. However, extracting the exponent, and putting the new exponent back to the floating-point number probably takes significant time (not even mentioning the tests on the exponent) compared to what is done entirely in hardware. I think that before anyone looks at this, you should post benchmarks. Also, the pipeline can stall due to the "if". Perhaps not in your case. But a compiler cannot know that (the other branch is taken when the value is 0, and this should not be regarded as rare). -- Vincent Lefèvre <vincent@vinc17.net> - Web: <http://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / Arénaire project (LIP, ENS-Lyon) ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Optimizing floating point *(2^c) and /(2^c) 2010-03-31 6:07 ` Vincent Lefevre @ 2010-03-31 9:25 ` Marc Glisse 2010-03-31 12:03 ` Vincent Lefevre 0 siblings, 1 reply; 9+ messages in thread From: Marc Glisse @ 2010-03-31 9:25 UTC (permalink / raw) To: gcc On Wed, 31 Mar 2010, Vincent Lefevre wrote: > On 2010-03-30 20:36:04 +0200, Jeroen Van Der Bossche wrote: > > The if statement is there exactly to address under- and overflow and > > nothing else. It's not because it looks simple that I didn't think > > it through. I know exactly how floating point multiplication works, > > and this implementation of *(2^c) doesn't violate under/overflow, > > denormalisation when e=0, rounding rules or rules about > > NaN/infinity. > > However, extracting the exponent, and putting the new exponent back > to the floating-point number probably takes significant time (not > even mentioning the tests on the exponent) compared to what is done > entirely in hardware. I think that before anyone looks at this, you > should post benchmarks. > > Also, the pipeline can stall due to the "if". Perhaps not in your > case. But a compiler cannot know that (the other branch is taken > when the value is 0, and this should not be regarded as rare). IMHO this transformation mostly makes sense for the -ffinite-math-only case where you can replace: "put a constant and multiply/divide" by "put a constant and add/sub" and never care about extracting the exponent, overflowing, or anything else. And in that case, it does look like a nice optimization. Otherwise, some old machines do have a slow enough multiplication that a soft-float implementation of /2^n can beat it, so it seems likely that some recent processors have the same characteristic, but it may indeed not be worth the trouble. -- Marc Glisse ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Optimizing floating point *(2^c) and /(2^c) 2010-03-31 9:25 ` Marc Glisse @ 2010-03-31 12:03 ` Vincent Lefevre 2010-04-01 16:21 ` Paolo Bonzini 0 siblings, 1 reply; 9+ messages in thread From: Vincent Lefevre @ 2010-03-31 12:03 UTC (permalink / raw) To: gcc On 2010-03-31 11:04:03 +0200, Marc Glisse wrote: > IMHO this transformation mostly makes sense for the > -ffinite-math-only case where you can replace: "put a constant and > multiply/divide" by "put a constant and add/sub" and never care > about extracting the exponent, overflowing, or anything else. And in > that case, it does look like a nice optimization. I suppose that this could be interesting only if a same register can be seen as a FP one and an integer one (for the addition on the exponent field), hoping that mixing both kinds of operations won't stall the pipeline. -- Vincent Lefèvre <vincent@vinc17.net> - Web: <http://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / Arénaire project (LIP, ENS-Lyon) ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Optimizing floating point *(2^c) and /(2^c) 2010-03-31 12:03 ` Vincent Lefevre @ 2010-04-01 16:21 ` Paolo Bonzini 0 siblings, 0 replies; 9+ messages in thread From: Paolo Bonzini @ 2010-04-01 16:21 UTC (permalink / raw) To: gcc On 03/31/2010 11:25 AM, Vincent Lefevre wrote: > On 2010-03-31 11:04:03 +0200, Marc Glisse wrote: >> IMHO this transformation mostly makes sense for the >> -ffinite-math-only case where you can replace: "put a constant and >> multiply/divide" by "put a constant and add/sub" and never care >> about extracting the exponent, overflowing, or anything else. And in >> that case, it does look like a nice optimization. > > I suppose that this could be interesting only if a same register > can be seen as a FP one and an integer one (for the addition on > the exponent field), hoping that mixing both kinds of operations > won't stall the pipeline. Indeed, this is a problem on Intel chips looking at the definition of TARGET_SSE_TYPELESS_STORES (search for X86_TUNE_SSE_TYPELESS_STORES in config/i386/i386.c). Paolo ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2010-04-01 16:21 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2010-03-29 17:38 Optimizing floating point *(2^c) and /(2^c) Jeroen Van Der Bossche 2010-03-29 20:30 ` Geert Bosch 2010-03-29 22:43 ` Tim Prince 2010-03-31 15:29 ` Geert Bosch 2010-03-30 19:03 ` Jeroen Van Der Bossche 2010-03-31 6:07 ` Vincent Lefevre 2010-03-31 9:25 ` Marc Glisse 2010-03-31 12:03 ` Vincent Lefevre 2010-04-01 16:21 ` Paolo Bonzini
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).