* (a+b)+c should be replaced by a+(b+c)
@ 2004-03-25 12:03 Joost VandeVondele
2004-03-25 14:45 ` Robert Dewar
0 siblings, 1 reply; 12+ messages in thread
From: Joost VandeVondele @ 2004-03-25 12:03 UTC (permalink / raw)
To: gcc
I think there is an obvious need for doing the optimization
(a+b)+c -> a+(b+c) in e.g. many scientific codes.
consider matrix multiply
do k=1,N
do j=1,N
do i=1,N
c(i,j)=c(i,j)+a(i,k)*b(k,j)
enddo
enddo
enddo
good compilers (e.g. xlf90) will (at -O4) do higher order transforms of
the loop to introduce blocking, independent FMAs, ... that makes this
little piece of code about 100 times faster at O4 than O2 (what about
LNO/SSA?). This can only be done if you allow (a+b)+c -> a+(b+c). It is
basically what any optimized blas routine will do. Matrix multiply is a
trivial example, if you want blas performance, call blas. There are many
other kernels like this in e.g. scientific code that are not blas. You
can't expect a scientist to hand unroll and block any kernel to the
appropriate depth for any machine. There need to be a compiler option to
do this. This can only be done if you allow (a+b)+c -> a+(b+c).
Joost
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: (a+b)+c should be replaced by a+(b+c)
2004-03-25 12:03 (a+b)+c should be replaced by a+(b+c) Joost VandeVondele
@ 2004-03-25 14:45 ` Robert Dewar
2004-03-25 15:07 ` Joost VandeVondele
0 siblings, 1 reply; 12+ messages in thread
From: Robert Dewar @ 2004-03-25 14:45 UTC (permalink / raw)
To: Joost VandeVondele; +Cc: gcc
Joost VandeVondele wrote:
> good compilers (e.g. xlf90) will (at -O4) do higher order transforms of
> the loop to introduce blocking, independent FMAs, ... that makes this
> little piece of code about 100 times faster at O4 than O2 (what about
> LNO/SSA?). This can only be done if you allow (a+b)+c -> a+(b+c). It is
> basically what any optimized blas routine will do. Matrix multiply is a
> trivial example, if you want blas performance, call blas. There are many
> other kernels like this in e.g. scientific code that are not blas. You
> can't expect a scientist to hand unroll and block any kernel to the
> appropriate depth for any machine. There need to be a compiler option to
> do this. This can only be done if you allow (a+b)+c -> a+(b+c).
Can you really deduce this freedom from later versions of the Fortran
standard?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: (a+b)+c should be replaced by a+(b+c)
2004-03-25 14:45 ` Robert Dewar
@ 2004-03-25 15:07 ` Joost VandeVondele
2004-03-25 15:18 ` Robert Dewar
2004-03-25 15:59 ` Scott Robert Ladd
0 siblings, 2 replies; 12+ messages in thread
From: Joost VandeVondele @ 2004-03-25 15:07 UTC (permalink / raw)
To: Robert Dewar; +Cc: gcc
On Thu, 25 Mar 2004, Robert Dewar wrote:
> Joost VandeVondele wrote:
>
> > good compilers (e.g. xlf90) will (at -O4) do higher order transforms of
> > the loop to introduce blocking, independent FMAs, ... that makes this
> > little piece of code about 100 times faster at O4 than O2 (what about
..
>
> Can you really deduce this freedom from later versions of the Fortran
> standard?
>
No, I'm only happy there are compilers that make my code 100 times faster
without doing a lot of work myself, keeping my code easy to maintain and
read.
Another example that relies on this kind of optimization that comes to my
mind is OMP/MPI code. There is just a large class of problems for which
this optimization is just what is needed.
BTW, timing of the code below on IBM SP4 with xlf90, would be useful to
see how gfortran performs.
O2:116.76s
O4:2.4s
O5:1.6s
Joost
INTEGER, PARAMETER :: N=1024
REAL*8 :: A(N,N), B(N,N), C(N,N)
REAL*8 :: t1,t2
A=0.1D0
B=0.1D0
C=0.0D0
CALL cpu_time(t1)
CALL mult(A,B,C,N)
CALL cpu_time(t2)
write(6,*) t2-t1,C(1,1)
END
SUBROUTINE mult(A,B,C,N)
REAL*8 :: A(N,N), B(N,N), C(N,N)
INTEGER :: I,J,K,N
DO J=1,N
DO I=1,N
DO K=1,N
C(I,J)=C(I,J)+A(I,K)*B(K,J)
ENDDO
ENDDO
ENDDO
END
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: (a+b)+c should be replaced by a+(b+c)
2004-03-25 15:07 ` Joost VandeVondele
@ 2004-03-25 15:18 ` Robert Dewar
2004-03-25 15:32 ` Joost VandeVondele
2004-03-25 15:59 ` Scott Robert Ladd
1 sibling, 1 reply; 12+ messages in thread
From: Robert Dewar @ 2004-03-25 15:18 UTC (permalink / raw)
To: Joost VandeVondele; +Cc: gcc
Joost VandeVondele wrote:
> No, I'm only happy there are compilers that make my code 100 times faster
> without doing a lot of work myself, keeping my code easy to maintain and
> read.
Well it is fine to have this kind of transformation available as an
option, though in general it is better to rely on BLAS written by
competent numerical programmers, than on transformations of unknown
impact.
> Another example that relies on this kind of optimization that comes to my
> mind is OMP/MPI code. There is just a large class of problems for which
> this optimization is just what is needed.
Please do not call this an optimization, call it a transformation
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: (a+b)+c should be replaced by a+(b+c)
2004-03-25 15:18 ` Robert Dewar
@ 2004-03-25 15:32 ` Joost VandeVondele
0 siblings, 0 replies; 12+ messages in thread
From: Joost VandeVondele @ 2004-03-25 15:32 UTC (permalink / raw)
To: Robert Dewar; +Cc: gcc
>
> > No, I'm only happy there are compilers that make my code 100 times faster
> > without doing a lot of work myself, keeping my code easy to maintain and
> > read.
>
> Well it is fine to have this kind of transformation available as an
> option, though in general it is better to rely on BLAS written by
> competent numerical programmers, than on transformations of unknown
> impact.
>
Obviously, this was an example (and I referred to calling blas
explicitly), to suggest that there exists a wide range of computational
kernels that benefit from (a+b)+c->a+(b+c) being performed by the
compiler. (I realized that there are much better example of this, but
anyway).
I like the name transformation as much as I dislike unsafe-math. FYI the
following warning comes from IBM (at -O3 when optimizing, oops
transforming, expressions in a way that might lead to non-bitwise
identical results), and I think it is not badly worded :
"mytest.f90", 1500-036 (I) The NOSTRICT option (default at OPT(3)) has the
potential to alter the semantics of a program. Please refer to
documentation on the STRICT/NOSTRICT option for more information.
Joost
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: (a+b)+c should be replaced by a+(b+c)
2004-03-25 15:07 ` Joost VandeVondele
2004-03-25 15:18 ` Robert Dewar
@ 2004-03-25 15:59 ` Scott Robert Ladd
2004-03-25 16:18 ` Jakub Jelinek
1 sibling, 1 reply; 12+ messages in thread
From: Scott Robert Ladd @ 2004-03-25 15:59 UTC (permalink / raw)
To: gcc mailing list
Joost VandeVondele wrote:
> BTW, timing of the code below on IBM SP4 with xlf90, would be useful to
> see how gfortran performs.
Being in a benchmarking mood, I took your code and compiled it on a
2.8GHz Pentium 4 (Northwood core). The results did not show gfortran in
a very good light:
- - - - - - - - - - - - - - - - - - - - -
Tycho$ ifort -O3 -tpp7 -xN -ipo -o matmuli matmul.for
IPO: using IR for /tmp/ifortyRX1Wg.o
IPO: performing single-file optimizations
matmul.for(6) : (col. 6) remark: LOOP WAS VECTORIZED.
matmul.for(7) : (col. 6) remark: LOOP WAS VECTORIZED.
matmul.for(8) : (col. 6) remark: LOOP WAS VECTORIZED.
Tycho:$ ./matmuli
5.90410300000000 10.2399999999998
Tycho$ gfortran -o matmulg -O3 -ffast-math -march=pentium4 matmul.for
Tycho$ ./matmulg 71.4641360000000 10.2400000000000
Tycho$ icc -V
Intel(R) C++ Compiler for 32-bit applications, Version 8.0 Build
20031211Z Package ID: l_cc_p_8.0.055_pe057
Copyright (C) 1985-2003 Intel Corporation. All rights reserved.
Tycho$ gfortran -v
Reading specs from
/opt/gcc-tree-ssa/lib/gcc/i686-pc-linux-gnu/3.5-tree-ssa/specs
Configured with: ../gcc/configure --prefix=/opt/gcc-tree-ssa
--disable-checking --enable-shared --enable-threads=posix
--enable-__cxa_atexit --enable-languages=c,c++,f95
Thread model: posix
gcc version 3.5-tree-ssa 20040316 (merged 20040307)
- - - - - - - - - - - - - - - - - - - - -
The generated assembler from GCC looks like:
.globl mult_
.type mult_, @function
mult_:
pushl %ebp
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
subl $36, %esp
movl 20(%ebp), %eax
movl 16(%ebp), %esi
movl 8(%ebp), %edi
movl 12(%ebp), %ecx
movl (%eax), %ebx
movl %ebx, -16(%ebp)
xorl $-1, %ebx
sall $3, %ebx
movl -16(%ebp), %eax
movl %ebx, -28(%ebp)
addl %ebx, %esi
movl -28(%ebp), %edx
addl %ebx, %edi
addl %ecx, %edx
movl %esi, -20(%ebp)
movl %edi, -24(%ebp)
movl %edx, -28(%ebp)
testl %eax, %eax
jle .L1
movl -16(%ebp), %edx
movl %edx, %ebx
movl %edx, -36(%ebp)
movl %edx, -44(%ebp)
movl %edx, %esi
sall $3, %ebx
movl %edx, %edi
.L4:
movl -28(%ebp), %eax
movl $1, -32(%ebp)
movl -20(%ebp), %edx
leal (%eax,%edi,8), %ecx
movl %ecx, -40(%ebp)
movl %esi, %ecx
.p2align 4,,7
.L5:
movl -32(%ebp), %edi
movl -44(%ebp), %eax
addl %edi, %eax
movl -24(%ebp), %edi
movl %eax, -48(%ebp)
fldl (%edx,%eax,8)
movl -32(%ebp), %eax
movl -40(%ebp), %edx
addl %ecx, %eax
addl $8, %edx
leal (%edi,%eax,8), %eax
.p2align 4,,7
.L6:
fldl (%edx)
fmull (%eax)
decl %ecx
addl %ebx, %eax
addl $8, %edx
testl %ecx, %ecx
faddp %st, %st(1)
jg .L6
.L7:
movl -32(%ebp), %ecx
movl -48(%ebp), %eax
movl -20(%ebp), %edx
incl %ecx
decl %esi
movl %ecx, -32(%ebp)
fstpl (%edx,%eax,8)
testl %esi, %esi
jle .L18
movl -16(%ebp), %ecx
jmp .L5
.L2:
.L1:
addl $36, %esp
popl %ebx
popl %esi
popl %edi
popl %ebp
ret
.L8:
.L18:
movl -36(%ebp), %edx
movl -44(%ebp), %ecx
decl %edx
movl -16(%ebp), %edi
movl %edx, -36(%ebp)
addl %edi, %ecx
movl -36(%ebp), %esi
movl %ecx, -44(%ebp)
testl %esi, %esi
jle .L1
movl %edi, %esi
movl -44(%ebp), %edi
jmp .L4
.size mult_, .-mult_
.local c.2
.comm c.2,8388608,32
.local a.0
.comm a.0,8388608,32
.local b.1
.comm b.1,8388608,32
.section .rodata.str1.1,"aMS",@progbits,1
- - - - - - - - - - - - - - - - - - - - -
The generated assembler for Intel Fortran:
.globl mult_
mult_:
# parameter 1: 28 + %esp
# parameter 2: 32 + %esp
# parameter 3: 36 + %esp
# parameter 4: 40 + %esp
..B2.1: # Preds ..B2.0
pushl %edi #15.17
pushl %esi #15.17
pushl %ebp #15.17
pushl %ebx #15.17
subl $8, %esp #15.17
movl 40(%esp), %eax #1.0
movl (%eax), %ebp #15.17
movl $1, %ebx #18.6
testl %ebp, %ebp #18.6
jle ..B2.9 # Prob 1% #18.6
# LOE ebx ebp
..B2.2: # Preds ..B2.1
movl 28(%esp), %esi #
movl 32(%esp), %edx #
movl 36(%esp), %edi #
lea (%ebp,%ebp), %eax #
addl %eax, %eax #
addl %eax, %eax #
subl %eax, %esi #
movl %esi, (%esp) #
subl %eax, %edx #
subl %eax, %edi #
movl %ebx, %ecx #
imull %eax, %ecx #
addl %edx, %ecx #
movl %ebx, %edx #
imull %eax, %edx #
addl %edi, %edx #
# LOE eax edx ecx ebx ebp
..B2.3: # Preds ..B2.7 ..B2.2
movl (%esp), %esi #19.6
movl %ebx, 4(%esp) #19.6
movl $1, %edi #19.6
lea (%eax,%esi), %esi #19.6
# LOE eax edx ecx ebp esi edi
..B2.4: # Preds ..B2.6 ..B2.3
movsd -8(%ecx,%edi,8), %xmm0 #21.29
movl $1, %ebx #20.6
.align 4,0x90
# LOE eax edx ecx ebx ebp esi edi xmm0
..B2.5: # Preds ..B2.5 ..B2.4
movsd -8(%esi,%ebx,8), %xmm1 #21.22
mulsd %xmm0, %xmm1 #21.28
addsd -8(%edx,%ebx,8), %xmm1 #21.21
movsd %xmm1, -8(%edx,%ebx,8) #21.8
addl $1, %ebx #20.6
cmpl %ebp, %ebx #20.6
jle ..B2.5 # Prob 99% #20.6
# LOE eax edx ecx ebx ebp esi edi xmm0
..B2.6: # Preds ..B2.5
addl %eax, %esi #19.6
addl $1, %edi #19.6
cmpl %ebp, %edi #19.6
jle ..B2.4 # Prob 99% #19.6
# LOE eax edx ecx ebp esi edi
..B2.7: # Preds ..B2.6
movl 4(%esp), %ebx #
addl %eax, %ecx #18.6
addl %eax, %edx #18.6
addl $1, %ebx #18.6
cmpl %ebp, %ebx #18.6
jle ..B2.3 # Prob 99% #18.6
# LOE eax edx ecx ebx ebp
..B2.9: # Preds ..B2.7 ..B2.1
addl $8, %esp #26.6
popl %ebx #26.6
popl %ebp #26.6
popl %esi #26.6
popl %edi #26.6
ret #26.6
- - - - - - - - - - - - - - - - - - - - -
I think gfortran gets its tail stomped by Intel's effort in this comparison.
Side note: I assume you are aware that your code is a brute force
technique for matrix multiplies, and that other algorithms are much more
efficient.
If anyone is interested, I can perform the same experiment with the
Intel and GNU C compilers.
--
Scott Robert Ladd
Coyote Gulch Productions (http://www.coyotegulch.com)
Software Invention for High-Performance Computing
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: (a+b)+c should be replaced by a+(b+c)
2004-03-25 15:59 ` Scott Robert Ladd
@ 2004-03-25 16:18 ` Jakub Jelinek
2004-03-25 16:38 ` Scott Robert Ladd
2004-03-26 2:51 ` Gabriel Paubert
0 siblings, 2 replies; 12+ messages in thread
From: Jakub Jelinek @ 2004-03-25 16:18 UTC (permalink / raw)
To: Scott Robert Ladd; +Cc: gcc mailing list
On Thu, Mar 25, 2004 at 09:21:48AM -0500, Scott Robert Ladd wrote:
> Joost VandeVondele wrote:
> >BTW, timing of the code below on IBM SP4 with xlf90, would be useful to
> >see how gfortran performs.
>
> Being in a benchmarking mood, I took your code and compiled it on a
> 2.8GHz Pentium 4 (Northwood core). The results did not show gfortran in
> a very good light:
>
> - - - - - - - - - - - - - - - - - - - - -
>
> Tycho$ ifort -O3 -tpp7 -xN -ipo -o matmuli matmul.for
> IPO: using IR for /tmp/ifortyRX1Wg.o
> IPO: performing single-file optimizations
> matmul.for(6) : (col. 6) remark: LOOP WAS VECTORIZED.
> matmul.for(7) : (col. 6) remark: LOOP WAS VECTORIZED.
> matmul.for(8) : (col. 6) remark: LOOP WAS VECTORIZED.
> Tycho:$ ./matmuli
> 5.90410300000000 10.2399999999998
> Tycho$ gfortran -o matmulg -O3 -ffast-math -march=pentium4 matmul.for
You forgot -mfpmath=sse. That is only the default for -m64.
Jakub
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: (a+b)+c should be replaced by a+(b+c)
2004-03-25 16:18 ` Jakub Jelinek
@ 2004-03-25 16:38 ` Scott Robert Ladd
2004-03-25 19:47 ` Laurent GUERBY
2004-03-26 2:51 ` Gabriel Paubert
1 sibling, 1 reply; 12+ messages in thread
From: Scott Robert Ladd @ 2004-03-25 16:38 UTC (permalink / raw)
To: Jakub Jelinek; +Cc: gcc mailing list
Jakub Jelinek wrote:
> On Thu, Mar 25, 2004 at 09:21:48AM -0500, Scott Robert Ladd wrote:
>
>>Joost VandeVondele wrote:
>>
>>>BTW, timing of the code below on IBM SP4 with xlf90, would be useful to
>>>see how gfortran performs.
>>
>>Being in a benchmarking mood, I took your code and compiled it on a
>>2.8GHz Pentium 4 (Northwood core). The results did not show gfortran in
>>a very good light:
>>
>>- - - - - - - - - - - - - - - - - - - - -
>>
>>Tycho$ ifort -O3 -tpp7 -xN -ipo -o matmuli matmul.for
>>IPO: using IR for /tmp/ifortyRX1Wg.o
>>IPO: performing single-file optimizations
>>matmul.for(6) : (col. 6) remark: LOOP WAS VECTORIZED.
>>matmul.for(7) : (col. 6) remark: LOOP WAS VECTORIZED.
>>matmul.for(8) : (col. 6) remark: LOOP WAS VECTORIZED.
>>Tycho:$ ./matmuli
>> 5.90410300000000 10.2399999999998
>>Tycho$ gfortran -o matmulg -O3 -ffast-math -march=pentium4 matmul.for
>
>
> You forgot -mfpmath=sse. That is only the default for -m64.
>
> Jakub
>
Good point; I've been doing Opteron work for a week, and was getting
used to not explicitly declaring certain flags.
Also, a minimized browser was playing a &%$!! Flash animation in the
background, so I'll run numbers on a clean machine without the overhead.
And the compiler says:
- - - - - - - - - - - - - - - -
Tycho$ gfortran -o matmulg -O3 -march=pentium4 -ffast-math matmul.for
Tycho$ ./matmulg
64.9091330000000 10.2400000000000
Tycho$ gfortran -o matmulg -O3 -march=pentium4 -ffast-math -mfpmath=sse
matmul.for
Tycho$ ./matmulg
64.6051790000000 10.2399999999998
Tycho$ gfortran -o matmulg -O3 -march=pentium4 -mfpmath=sse matmul.for
Tycho$ ./matmulg
64.7361590000000 10.2399999999998
Tycho$ gfortran -o matmulg -O3 -march=pentium4 matmul.for
Tycho$ ./matmulg
64.7751530000000 10.2400000000000
Tycho$
- - - - - - - - - - - - - - - -
[dry_sarcasm]
Well, we can see the -ffast-math *really* helps in this suituation, huh?
[/dry_sarcasm]
Nor did -mfpmath=sse show much value for this test. In my experience,
-mfpmath=sse often fails to produce faster code (with gfortran or gcc)
What about Intel Fortran with their -mp1 and -mp options?
- - - - - - - - - - - - - - - -
Tycho$ ifort -O3 -tpp7 -xN -ipo -o matmuli matmul.for
Tycho$ ./matmuli
4.85226200000000 10.2399999999998
Tycho$ ifort -O3 -tpp7 -xN -ipo -mp1 -o matmuli matmul.for
Tycho:~/projects/spikes$ ./matmuli
4.90425400000000 10.2399999999998
Tycho$ ifort -O3 -tpp7 -xN -ipo -mp -o matmuli matmul.for
Tycho$ ./matmuli
66.0699560000000 10.2399999999998
- - - - - - - - - - - - - - - -
Forcing Intel to stick with the "rules" does slow its performance.
Certainly some food for thought...
--
Scott Robert Ladd
Coyote Gulch Productions (http://www.coyotegulch.com)
Software Invention for High-Performance Computing
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: (a+b)+c should be replaced by a+(b+c)
2004-03-25 16:38 ` Scott Robert Ladd
@ 2004-03-25 19:47 ` Laurent GUERBY
2004-03-25 20:16 ` Scott Robert Ladd
0 siblings, 1 reply; 12+ messages in thread
From: Laurent GUERBY @ 2004-03-25 19:47 UTC (permalink / raw)
To: Scott Robert Ladd; +Cc: Jakub Jelinek, gcc mailing list
On Thu, 2004-03-25 at 16:06, Scott Robert Ladd wrote:
> [dry_sarcasm]
> Well, we can see the -ffast-math *really* helps in this suituation, huh?
> [/dry_sarcasm]
Could you try -funroll-all-loops and may be a prefetch flag? On short
numerical loops this usually makes lots of difference.
Laurent
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: (a+b)+c should be replaced by a+(b+c)
2004-03-25 19:47 ` Laurent GUERBY
@ 2004-03-25 20:16 ` Scott Robert Ladd
0 siblings, 0 replies; 12+ messages in thread
From: Scott Robert Ladd @ 2004-03-25 20:16 UTC (permalink / raw)
To: Laurent GUERBY; +Cc: Jakub Jelinek, gcc mailing list
Laurent GUERBY wrote:
> Could you try -funroll-all-loops and may be a prefetch flag? On short
> numerical loops this usually makes lots of difference.
gfortran -o matmulg -O3 -ffast-math -march=pentium4 \
-fprefetch-loop-arrays -funroll-all-loops -mfpmath=sse matmul.for
..does not improve matters. Perhaps this is due to the nascent nature of
gfortran?
As it is, I'm going to stop doing on-the-spot benchmarks for this set of
topics, having demonstrated my assertions about performance and
accuracy. At this point, we're using anecdotal evidence to pick compiler
options; I prefer a more scientific approach.
By Monday, my new Acovea runs (on both P4 and Opteron) will have had a
chance to evolve optimal compiler option sets, showing us where the
strengths and weaknesses lie.
Or so I hope!
..Scott
--
Scott Robert Ladd
Coyote Gulch Productions (http://www.coyotegulch.com)
Software Invention for High-Performance Computing
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: (a+b)+c should be replaced by a+(b+c)
2004-03-25 16:18 ` Jakub Jelinek
2004-03-25 16:38 ` Scott Robert Ladd
@ 2004-03-26 2:51 ` Gabriel Paubert
2004-03-26 3:17 ` Jakub Jelinek
1 sibling, 1 reply; 12+ messages in thread
From: Gabriel Paubert @ 2004-03-26 2:51 UTC (permalink / raw)
To: Jakub Jelinek; +Cc: Scott Robert Ladd, gcc mailing list
On Thu, Mar 25, 2004 at 01:19:27PM +0100, Jakub Jelinek wrote:
> On Thu, Mar 25, 2004 at 09:21:48AM -0500, Scott Robert Ladd wrote:
> > Joost VandeVondele wrote:
> > >BTW, timing of the code below on IBM SP4 with xlf90, would be useful to
> > >see how gfortran performs.
> >
> > Being in a benchmarking mood, I took your code and compiled it on a
> > 2.8GHz Pentium 4 (Northwood core). The results did not show gfortran in
> > a very good light:
> >
> > - - - - - - - - - - - - - - - - - - - - -
> >
> > Tycho$ ifort -O3 -tpp7 -xN -ipo -o matmuli matmul.for
> > IPO: using IR for /tmp/ifortyRX1Wg.o
> > IPO: performing single-file optimizations
> > matmul.for(6) : (col. 6) remark: LOOP WAS VECTORIZED.
> > matmul.for(7) : (col. 6) remark: LOOP WAS VECTORIZED.
> > matmul.for(8) : (col. 6) remark: LOOP WAS VECTORIZED.
> > Tycho:$ ./matmuli
> > 5.90410300000000 10.2399999999998
> > Tycho$ gfortran -o matmulg -O3 -ffast-math -march=pentium4 matmul.for
>
> You forgot -mfpmath=sse. That is only the default for -m64.
Isn't it rather -mfpmath=sse2, since he is using doubles?
IIRC, -mfpmath=sse will only use sse instructions for floats, not
for doubles.
Gabriel
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: (a+b)+c should be replaced by a+(b+c)
2004-03-26 2:51 ` Gabriel Paubert
@ 2004-03-26 3:17 ` Jakub Jelinek
0 siblings, 0 replies; 12+ messages in thread
From: Jakub Jelinek @ 2004-03-26 3:17 UTC (permalink / raw)
To: Gabriel Paubert; +Cc: Scott Robert Ladd, gcc mailing list
On Thu, Mar 25, 2004 at 11:51:20PM +0100, Gabriel Paubert wrote:
> > You forgot -mfpmath=sse. That is only the default for -m64.
>
> Isn't it rather -mfpmath=sse2, since he is using doubles?
No, -mfpmath= only takes "i387", "sse", "sse,i387" and "i387,sse"
options. Whether doubles are done using SSE* or i387 insns
depends on -msse2.
Jakub
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2004-03-25 23:25 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-03-25 12:03 (a+b)+c should be replaced by a+(b+c) Joost VandeVondele
2004-03-25 14:45 ` Robert Dewar
2004-03-25 15:07 ` Joost VandeVondele
2004-03-25 15:18 ` Robert Dewar
2004-03-25 15:32 ` Joost VandeVondele
2004-03-25 15:59 ` Scott Robert Ladd
2004-03-25 16:18 ` Jakub Jelinek
2004-03-25 16:38 ` Scott Robert Ladd
2004-03-25 19:47 ` Laurent GUERBY
2004-03-25 20:16 ` Scott Robert Ladd
2004-03-26 2:51 ` Gabriel Paubert
2004-03-26 3:17 ` Jakub Jelinek
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).