* (a+b)+c should be replaced by a+(b+c) @ 2004-03-25 12:03 Joost VandeVondele 2004-03-25 14:45 ` Robert Dewar 0 siblings, 1 reply; 12+ messages in thread From: Joost VandeVondele @ 2004-03-25 12:03 UTC (permalink / raw) To: gcc I think there is an obvious need for doing the optimization (a+b)+c -> a+(b+c) in e.g. many scientific codes. consider matrix multiply do k=1,N do j=1,N do i=1,N c(i,j)=c(i,j)+a(i,k)*b(k,j) enddo enddo enddo good compilers (e.g. xlf90) will (at -O4) do higher order transforms of the loop to introduce blocking, independent FMAs, ... that makes this little piece of code about 100 times faster at O4 than O2 (what about LNO/SSA?). This can only be done if you allow (a+b)+c -> a+(b+c). It is basically what any optimized blas routine will do. Matrix multiply is a trivial example, if you want blas performance, call blas. There are many other kernels like this in e.g. scientific code that are not blas. You can't expect a scientist to hand unroll and block any kernel to the appropriate depth for any machine. There need to be a compiler option to do this. This can only be done if you allow (a+b)+c -> a+(b+c). Joost ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: (a+b)+c should be replaced by a+(b+c) 2004-03-25 12:03 (a+b)+c should be replaced by a+(b+c) Joost VandeVondele @ 2004-03-25 14:45 ` Robert Dewar 2004-03-25 15:07 ` Joost VandeVondele 0 siblings, 1 reply; 12+ messages in thread From: Robert Dewar @ 2004-03-25 14:45 UTC (permalink / raw) To: Joost VandeVondele; +Cc: gcc Joost VandeVondele wrote: > good compilers (e.g. xlf90) will (at -O4) do higher order transforms of > the loop to introduce blocking, independent FMAs, ... that makes this > little piece of code about 100 times faster at O4 than O2 (what about > LNO/SSA?). This can only be done if you allow (a+b)+c -> a+(b+c). It is > basically what any optimized blas routine will do. Matrix multiply is a > trivial example, if you want blas performance, call blas. There are many > other kernels like this in e.g. scientific code that are not blas. You > can't expect a scientist to hand unroll and block any kernel to the > appropriate depth for any machine. There need to be a compiler option to > do this. This can only be done if you allow (a+b)+c -> a+(b+c). Can you really deduce this freedom from later versions of the Fortran standard? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: (a+b)+c should be replaced by a+(b+c) 2004-03-25 14:45 ` Robert Dewar @ 2004-03-25 15:07 ` Joost VandeVondele 2004-03-25 15:18 ` Robert Dewar 2004-03-25 15:59 ` Scott Robert Ladd 0 siblings, 2 replies; 12+ messages in thread From: Joost VandeVondele @ 2004-03-25 15:07 UTC (permalink / raw) To: Robert Dewar; +Cc: gcc On Thu, 25 Mar 2004, Robert Dewar wrote: > Joost VandeVondele wrote: > > > good compilers (e.g. xlf90) will (at -O4) do higher order transforms of > > the loop to introduce blocking, independent FMAs, ... that makes this > > little piece of code about 100 times faster at O4 than O2 (what about .. > > Can you really deduce this freedom from later versions of the Fortran > standard? > No, I'm only happy there are compilers that make my code 100 times faster without doing a lot of work myself, keeping my code easy to maintain and read. Another example that relies on this kind of optimization that comes to my mind is OMP/MPI code. There is just a large class of problems for which this optimization is just what is needed. BTW, timing of the code below on IBM SP4 with xlf90, would be useful to see how gfortran performs. O2:116.76s O4:2.4s O5:1.6s Joost INTEGER, PARAMETER :: N=1024 REAL*8 :: A(N,N), B(N,N), C(N,N) REAL*8 :: t1,t2 A=0.1D0 B=0.1D0 C=0.0D0 CALL cpu_time(t1) CALL mult(A,B,C,N) CALL cpu_time(t2) write(6,*) t2-t1,C(1,1) END SUBROUTINE mult(A,B,C,N) REAL*8 :: A(N,N), B(N,N), C(N,N) INTEGER :: I,J,K,N DO J=1,N DO I=1,N DO K=1,N C(I,J)=C(I,J)+A(I,K)*B(K,J) ENDDO ENDDO ENDDO END ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: (a+b)+c should be replaced by a+(b+c) 2004-03-25 15:07 ` Joost VandeVondele @ 2004-03-25 15:18 ` Robert Dewar 2004-03-25 15:32 ` Joost VandeVondele 2004-03-25 15:59 ` Scott Robert Ladd 1 sibling, 1 reply; 12+ messages in thread From: Robert Dewar @ 2004-03-25 15:18 UTC (permalink / raw) To: Joost VandeVondele; +Cc: gcc Joost VandeVondele wrote: > No, I'm only happy there are compilers that make my code 100 times faster > without doing a lot of work myself, keeping my code easy to maintain and > read. Well it is fine to have this kind of transformation available as an option, though in general it is better to rely on BLAS written by competent numerical programmers, than on transformations of unknown impact. > Another example that relies on this kind of optimization that comes to my > mind is OMP/MPI code. There is just a large class of problems for which > this optimization is just what is needed. Please do not call this an optimization, call it a transformation ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: (a+b)+c should be replaced by a+(b+c) 2004-03-25 15:18 ` Robert Dewar @ 2004-03-25 15:32 ` Joost VandeVondele 0 siblings, 0 replies; 12+ messages in thread From: Joost VandeVondele @ 2004-03-25 15:32 UTC (permalink / raw) To: Robert Dewar; +Cc: gcc > > > No, I'm only happy there are compilers that make my code 100 times faster > > without doing a lot of work myself, keeping my code easy to maintain and > > read. > > Well it is fine to have this kind of transformation available as an > option, though in general it is better to rely on BLAS written by > competent numerical programmers, than on transformations of unknown > impact. > Obviously, this was an example (and I referred to calling blas explicitly), to suggest that there exists a wide range of computational kernels that benefit from (a+b)+c->a+(b+c) being performed by the compiler. (I realized that there are much better example of this, but anyway). I like the name transformation as much as I dislike unsafe-math. FYI the following warning comes from IBM (at -O3 when optimizing, oops transforming, expressions in a way that might lead to non-bitwise identical results), and I think it is not badly worded : "mytest.f90", 1500-036 (I) The NOSTRICT option (default at OPT(3)) has the potential to alter the semantics of a program. Please refer to documentation on the STRICT/NOSTRICT option for more information. Joost ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: (a+b)+c should be replaced by a+(b+c) 2004-03-25 15:07 ` Joost VandeVondele 2004-03-25 15:18 ` Robert Dewar @ 2004-03-25 15:59 ` Scott Robert Ladd 2004-03-25 16:18 ` Jakub Jelinek 1 sibling, 1 reply; 12+ messages in thread From: Scott Robert Ladd @ 2004-03-25 15:59 UTC (permalink / raw) To: gcc mailing list Joost VandeVondele wrote: > BTW, timing of the code below on IBM SP4 with xlf90, would be useful to > see how gfortran performs. Being in a benchmarking mood, I took your code and compiled it on a 2.8GHz Pentium 4 (Northwood core). The results did not show gfortran in a very good light: - - - - - - - - - - - - - - - - - - - - - Tycho$ ifort -O3 -tpp7 -xN -ipo -o matmuli matmul.for IPO: using IR for /tmp/ifortyRX1Wg.o IPO: performing single-file optimizations matmul.for(6) : (col. 6) remark: LOOP WAS VECTORIZED. matmul.for(7) : (col. 6) remark: LOOP WAS VECTORIZED. matmul.for(8) : (col. 6) remark: LOOP WAS VECTORIZED. Tycho:$ ./matmuli 5.90410300000000 10.2399999999998 Tycho$ gfortran -o matmulg -O3 -ffast-math -march=pentium4 matmul.for Tycho$ ./matmulg 71.4641360000000 10.2400000000000 Tycho$ icc -V Intel(R) C++ Compiler for 32-bit applications, Version 8.0 Build 20031211Z Package ID: l_cc_p_8.0.055_pe057 Copyright (C) 1985-2003 Intel Corporation. All rights reserved. Tycho$ gfortran -v Reading specs from /opt/gcc-tree-ssa/lib/gcc/i686-pc-linux-gnu/3.5-tree-ssa/specs Configured with: ../gcc/configure --prefix=/opt/gcc-tree-ssa --disable-checking --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-languages=c,c++,f95 Thread model: posix gcc version 3.5-tree-ssa 20040316 (merged 20040307) - - - - - - - - - - - - - - - - - - - - - The generated assembler from GCC looks like: .globl mult_ .type mult_, @function mult_: pushl %ebp movl %esp, %ebp pushl %edi pushl %esi pushl %ebx subl $36, %esp movl 20(%ebp), %eax movl 16(%ebp), %esi movl 8(%ebp), %edi movl 12(%ebp), %ecx movl (%eax), %ebx movl %ebx, -16(%ebp) xorl $-1, %ebx sall $3, %ebx movl -16(%ebp), %eax movl %ebx, -28(%ebp) addl %ebx, %esi movl -28(%ebp), %edx addl %ebx, %edi addl %ecx, %edx movl %esi, -20(%ebp) movl %edi, -24(%ebp) movl %edx, -28(%ebp) testl %eax, %eax jle .L1 movl -16(%ebp), %edx movl %edx, %ebx movl %edx, -36(%ebp) movl %edx, -44(%ebp) movl %edx, %esi sall $3, %ebx movl %edx, %edi .L4: movl -28(%ebp), %eax movl $1, -32(%ebp) movl -20(%ebp), %edx leal (%eax,%edi,8), %ecx movl %ecx, -40(%ebp) movl %esi, %ecx .p2align 4,,7 .L5: movl -32(%ebp), %edi movl -44(%ebp), %eax addl %edi, %eax movl -24(%ebp), %edi movl %eax, -48(%ebp) fldl (%edx,%eax,8) movl -32(%ebp), %eax movl -40(%ebp), %edx addl %ecx, %eax addl $8, %edx leal (%edi,%eax,8), %eax .p2align 4,,7 .L6: fldl (%edx) fmull (%eax) decl %ecx addl %ebx, %eax addl $8, %edx testl %ecx, %ecx faddp %st, %st(1) jg .L6 .L7: movl -32(%ebp), %ecx movl -48(%ebp), %eax movl -20(%ebp), %edx incl %ecx decl %esi movl %ecx, -32(%ebp) fstpl (%edx,%eax,8) testl %esi, %esi jle .L18 movl -16(%ebp), %ecx jmp .L5 .L2: .L1: addl $36, %esp popl %ebx popl %esi popl %edi popl %ebp ret .L8: .L18: movl -36(%ebp), %edx movl -44(%ebp), %ecx decl %edx movl -16(%ebp), %edi movl %edx, -36(%ebp) addl %edi, %ecx movl -36(%ebp), %esi movl %ecx, -44(%ebp) testl %esi, %esi jle .L1 movl %edi, %esi movl -44(%ebp), %edi jmp .L4 .size mult_, .-mult_ .local c.2 .comm c.2,8388608,32 .local a.0 .comm a.0,8388608,32 .local b.1 .comm b.1,8388608,32 .section .rodata.str1.1,"aMS",@progbits,1 - - - - - - - - - - - - - - - - - - - - - The generated assembler for Intel Fortran: .globl mult_ mult_: # parameter 1: 28 + %esp # parameter 2: 32 + %esp # parameter 3: 36 + %esp # parameter 4: 40 + %esp ..B2.1: # Preds ..B2.0 pushl %edi #15.17 pushl %esi #15.17 pushl %ebp #15.17 pushl %ebx #15.17 subl $8, %esp #15.17 movl 40(%esp), %eax #1.0 movl (%eax), %ebp #15.17 movl $1, %ebx #18.6 testl %ebp, %ebp #18.6 jle ..B2.9 # Prob 1% #18.6 # LOE ebx ebp ..B2.2: # Preds ..B2.1 movl 28(%esp), %esi # movl 32(%esp), %edx # movl 36(%esp), %edi # lea (%ebp,%ebp), %eax # addl %eax, %eax # addl %eax, %eax # subl %eax, %esi # movl %esi, (%esp) # subl %eax, %edx # subl %eax, %edi # movl %ebx, %ecx # imull %eax, %ecx # addl %edx, %ecx # movl %ebx, %edx # imull %eax, %edx # addl %edi, %edx # # LOE eax edx ecx ebx ebp ..B2.3: # Preds ..B2.7 ..B2.2 movl (%esp), %esi #19.6 movl %ebx, 4(%esp) #19.6 movl $1, %edi #19.6 lea (%eax,%esi), %esi #19.6 # LOE eax edx ecx ebp esi edi ..B2.4: # Preds ..B2.6 ..B2.3 movsd -8(%ecx,%edi,8), %xmm0 #21.29 movl $1, %ebx #20.6 .align 4,0x90 # LOE eax edx ecx ebx ebp esi edi xmm0 ..B2.5: # Preds ..B2.5 ..B2.4 movsd -8(%esi,%ebx,8), %xmm1 #21.22 mulsd %xmm0, %xmm1 #21.28 addsd -8(%edx,%ebx,8), %xmm1 #21.21 movsd %xmm1, -8(%edx,%ebx,8) #21.8 addl $1, %ebx #20.6 cmpl %ebp, %ebx #20.6 jle ..B2.5 # Prob 99% #20.6 # LOE eax edx ecx ebx ebp esi edi xmm0 ..B2.6: # Preds ..B2.5 addl %eax, %esi #19.6 addl $1, %edi #19.6 cmpl %ebp, %edi #19.6 jle ..B2.4 # Prob 99% #19.6 # LOE eax edx ecx ebp esi edi ..B2.7: # Preds ..B2.6 movl 4(%esp), %ebx # addl %eax, %ecx #18.6 addl %eax, %edx #18.6 addl $1, %ebx #18.6 cmpl %ebp, %ebx #18.6 jle ..B2.3 # Prob 99% #18.6 # LOE eax edx ecx ebx ebp ..B2.9: # Preds ..B2.7 ..B2.1 addl $8, %esp #26.6 popl %ebx #26.6 popl %ebp #26.6 popl %esi #26.6 popl %edi #26.6 ret #26.6 - - - - - - - - - - - - - - - - - - - - - I think gfortran gets its tail stomped by Intel's effort in this comparison. Side note: I assume you are aware that your code is a brute force technique for matrix multiplies, and that other algorithms are much more efficient. If anyone is interested, I can perform the same experiment with the Intel and GNU C compilers. -- Scott Robert Ladd Coyote Gulch Productions (http://www.coyotegulch.com) Software Invention for High-Performance Computing ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: (a+b)+c should be replaced by a+(b+c) 2004-03-25 15:59 ` Scott Robert Ladd @ 2004-03-25 16:18 ` Jakub Jelinek 2004-03-25 16:38 ` Scott Robert Ladd 2004-03-26 2:51 ` Gabriel Paubert 0 siblings, 2 replies; 12+ messages in thread From: Jakub Jelinek @ 2004-03-25 16:18 UTC (permalink / raw) To: Scott Robert Ladd; +Cc: gcc mailing list On Thu, Mar 25, 2004 at 09:21:48AM -0500, Scott Robert Ladd wrote: > Joost VandeVondele wrote: > >BTW, timing of the code below on IBM SP4 with xlf90, would be useful to > >see how gfortran performs. > > Being in a benchmarking mood, I took your code and compiled it on a > 2.8GHz Pentium 4 (Northwood core). The results did not show gfortran in > a very good light: > > - - - - - - - - - - - - - - - - - - - - - > > Tycho$ ifort -O3 -tpp7 -xN -ipo -o matmuli matmul.for > IPO: using IR for /tmp/ifortyRX1Wg.o > IPO: performing single-file optimizations > matmul.for(6) : (col. 6) remark: LOOP WAS VECTORIZED. > matmul.for(7) : (col. 6) remark: LOOP WAS VECTORIZED. > matmul.for(8) : (col. 6) remark: LOOP WAS VECTORIZED. > Tycho:$ ./matmuli > 5.90410300000000 10.2399999999998 > Tycho$ gfortran -o matmulg -O3 -ffast-math -march=pentium4 matmul.for You forgot -mfpmath=sse. That is only the default for -m64. Jakub ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: (a+b)+c should be replaced by a+(b+c) 2004-03-25 16:18 ` Jakub Jelinek @ 2004-03-25 16:38 ` Scott Robert Ladd 2004-03-25 19:47 ` Laurent GUERBY 2004-03-26 2:51 ` Gabriel Paubert 1 sibling, 1 reply; 12+ messages in thread From: Scott Robert Ladd @ 2004-03-25 16:38 UTC (permalink / raw) To: Jakub Jelinek; +Cc: gcc mailing list Jakub Jelinek wrote: > On Thu, Mar 25, 2004 at 09:21:48AM -0500, Scott Robert Ladd wrote: > >>Joost VandeVondele wrote: >> >>>BTW, timing of the code below on IBM SP4 with xlf90, would be useful to >>>see how gfortran performs. >> >>Being in a benchmarking mood, I took your code and compiled it on a >>2.8GHz Pentium 4 (Northwood core). The results did not show gfortran in >>a very good light: >> >>- - - - - - - - - - - - - - - - - - - - - >> >>Tycho$ ifort -O3 -tpp7 -xN -ipo -o matmuli matmul.for >>IPO: using IR for /tmp/ifortyRX1Wg.o >>IPO: performing single-file optimizations >>matmul.for(6) : (col. 6) remark: LOOP WAS VECTORIZED. >>matmul.for(7) : (col. 6) remark: LOOP WAS VECTORIZED. >>matmul.for(8) : (col. 6) remark: LOOP WAS VECTORIZED. >>Tycho:$ ./matmuli >> 5.90410300000000 10.2399999999998 >>Tycho$ gfortran -o matmulg -O3 -ffast-math -march=pentium4 matmul.for > > > You forgot -mfpmath=sse. That is only the default for -m64. > > Jakub > Good point; I've been doing Opteron work for a week, and was getting used to not explicitly declaring certain flags. Also, a minimized browser was playing a &%$!! Flash animation in the background, so I'll run numbers on a clean machine without the overhead. And the compiler says: - - - - - - - - - - - - - - - - Tycho$ gfortran -o matmulg -O3 -march=pentium4 -ffast-math matmul.for Tycho$ ./matmulg 64.9091330000000 10.2400000000000 Tycho$ gfortran -o matmulg -O3 -march=pentium4 -ffast-math -mfpmath=sse matmul.for Tycho$ ./matmulg 64.6051790000000 10.2399999999998 Tycho$ gfortran -o matmulg -O3 -march=pentium4 -mfpmath=sse matmul.for Tycho$ ./matmulg 64.7361590000000 10.2399999999998 Tycho$ gfortran -o matmulg -O3 -march=pentium4 matmul.for Tycho$ ./matmulg 64.7751530000000 10.2400000000000 Tycho$ - - - - - - - - - - - - - - - - [dry_sarcasm] Well, we can see the -ffast-math *really* helps in this suituation, huh? [/dry_sarcasm] Nor did -mfpmath=sse show much value for this test. In my experience, -mfpmath=sse often fails to produce faster code (with gfortran or gcc) What about Intel Fortran with their -mp1 and -mp options? - - - - - - - - - - - - - - - - Tycho$ ifort -O3 -tpp7 -xN -ipo -o matmuli matmul.for Tycho$ ./matmuli 4.85226200000000 10.2399999999998 Tycho$ ifort -O3 -tpp7 -xN -ipo -mp1 -o matmuli matmul.for Tycho:~/projects/spikes$ ./matmuli 4.90425400000000 10.2399999999998 Tycho$ ifort -O3 -tpp7 -xN -ipo -mp -o matmuli matmul.for Tycho$ ./matmuli 66.0699560000000 10.2399999999998 - - - - - - - - - - - - - - - - Forcing Intel to stick with the "rules" does slow its performance. Certainly some food for thought... -- Scott Robert Ladd Coyote Gulch Productions (http://www.coyotegulch.com) Software Invention for High-Performance Computing ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: (a+b)+c should be replaced by a+(b+c) 2004-03-25 16:38 ` Scott Robert Ladd @ 2004-03-25 19:47 ` Laurent GUERBY 2004-03-25 20:16 ` Scott Robert Ladd 0 siblings, 1 reply; 12+ messages in thread From: Laurent GUERBY @ 2004-03-25 19:47 UTC (permalink / raw) To: Scott Robert Ladd; +Cc: Jakub Jelinek, gcc mailing list On Thu, 2004-03-25 at 16:06, Scott Robert Ladd wrote: > [dry_sarcasm] > Well, we can see the -ffast-math *really* helps in this suituation, huh? > [/dry_sarcasm] Could you try -funroll-all-loops and may be a prefetch flag? On short numerical loops this usually makes lots of difference. Laurent ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: (a+b)+c should be replaced by a+(b+c) 2004-03-25 19:47 ` Laurent GUERBY @ 2004-03-25 20:16 ` Scott Robert Ladd 0 siblings, 0 replies; 12+ messages in thread From: Scott Robert Ladd @ 2004-03-25 20:16 UTC (permalink / raw) To: Laurent GUERBY; +Cc: Jakub Jelinek, gcc mailing list Laurent GUERBY wrote: > Could you try -funroll-all-loops and may be a prefetch flag? On short > numerical loops this usually makes lots of difference. gfortran -o matmulg -O3 -ffast-math -march=pentium4 \ -fprefetch-loop-arrays -funroll-all-loops -mfpmath=sse matmul.for ..does not improve matters. Perhaps this is due to the nascent nature of gfortran? As it is, I'm going to stop doing on-the-spot benchmarks for this set of topics, having demonstrated my assertions about performance and accuracy. At this point, we're using anecdotal evidence to pick compiler options; I prefer a more scientific approach. By Monday, my new Acovea runs (on both P4 and Opteron) will have had a chance to evolve optimal compiler option sets, showing us where the strengths and weaknesses lie. Or so I hope! ..Scott -- Scott Robert Ladd Coyote Gulch Productions (http://www.coyotegulch.com) Software Invention for High-Performance Computing ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: (a+b)+c should be replaced by a+(b+c) 2004-03-25 16:18 ` Jakub Jelinek 2004-03-25 16:38 ` Scott Robert Ladd @ 2004-03-26 2:51 ` Gabriel Paubert 2004-03-26 3:17 ` Jakub Jelinek 1 sibling, 1 reply; 12+ messages in thread From: Gabriel Paubert @ 2004-03-26 2:51 UTC (permalink / raw) To: Jakub Jelinek; +Cc: Scott Robert Ladd, gcc mailing list On Thu, Mar 25, 2004 at 01:19:27PM +0100, Jakub Jelinek wrote: > On Thu, Mar 25, 2004 at 09:21:48AM -0500, Scott Robert Ladd wrote: > > Joost VandeVondele wrote: > > >BTW, timing of the code below on IBM SP4 with xlf90, would be useful to > > >see how gfortran performs. > > > > Being in a benchmarking mood, I took your code and compiled it on a > > 2.8GHz Pentium 4 (Northwood core). The results did not show gfortran in > > a very good light: > > > > - - - - - - - - - - - - - - - - - - - - - > > > > Tycho$ ifort -O3 -tpp7 -xN -ipo -o matmuli matmul.for > > IPO: using IR for /tmp/ifortyRX1Wg.o > > IPO: performing single-file optimizations > > matmul.for(6) : (col. 6) remark: LOOP WAS VECTORIZED. > > matmul.for(7) : (col. 6) remark: LOOP WAS VECTORIZED. > > matmul.for(8) : (col. 6) remark: LOOP WAS VECTORIZED. > > Tycho:$ ./matmuli > > 5.90410300000000 10.2399999999998 > > Tycho$ gfortran -o matmulg -O3 -ffast-math -march=pentium4 matmul.for > > You forgot -mfpmath=sse. That is only the default for -m64. Isn't it rather -mfpmath=sse2, since he is using doubles? IIRC, -mfpmath=sse will only use sse instructions for floats, not for doubles. Gabriel ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: (a+b)+c should be replaced by a+(b+c) 2004-03-26 2:51 ` Gabriel Paubert @ 2004-03-26 3:17 ` Jakub Jelinek 0 siblings, 0 replies; 12+ messages in thread From: Jakub Jelinek @ 2004-03-26 3:17 UTC (permalink / raw) To: Gabriel Paubert; +Cc: Scott Robert Ladd, gcc mailing list On Thu, Mar 25, 2004 at 11:51:20PM +0100, Gabriel Paubert wrote: > > You forgot -mfpmath=sse. That is only the default for -m64. > > Isn't it rather -mfpmath=sse2, since he is using doubles? No, -mfpmath= only takes "i387", "sse", "sse,i387" and "i387,sse" options. Whether doubles are done using SSE* or i387 insns depends on -msse2. Jakub ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2004-03-25 23:25 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2004-03-25 12:03 (a+b)+c should be replaced by a+(b+c) Joost VandeVondele 2004-03-25 14:45 ` Robert Dewar 2004-03-25 15:07 ` Joost VandeVondele 2004-03-25 15:18 ` Robert Dewar 2004-03-25 15:32 ` Joost VandeVondele 2004-03-25 15:59 ` Scott Robert Ladd 2004-03-25 16:18 ` Jakub Jelinek 2004-03-25 16:38 ` Scott Robert Ladd 2004-03-25 19:47 ` Laurent GUERBY 2004-03-25 20:16 ` Scott Robert Ladd 2004-03-26 2:51 ` Gabriel Paubert 2004-03-26 3:17 ` Jakub Jelinek
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).