public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* (a+b)+c should be replaced by a+(b+c)
@ 2004-03-25 12:03 Joost VandeVondele
  2004-03-25 14:45 ` Robert Dewar
  0 siblings, 1 reply; 12+ messages in thread
From: Joost VandeVondele @ 2004-03-25 12:03 UTC (permalink / raw)
  To: gcc

I think there is an obvious need for doing the optimization
(a+b)+c -> a+(b+c) in e.g. many scientific codes.

consider matrix multiply
do k=1,N
 do j=1,N
  do i=1,N
   c(i,j)=c(i,j)+a(i,k)*b(k,j)
  enddo
 enddo
enddo

good compilers (e.g. xlf90) will (at -O4) do higher order transforms of
the loop to introduce blocking, independent FMAs, ... that makes this
little piece of code about 100 times faster at O4 than O2 (what about
LNO/SSA?). This can only be done if you allow (a+b)+c -> a+(b+c). It is
basically what any optimized blas routine will do. Matrix multiply is a
trivial example, if you want blas performance, call blas. There are many
other kernels like this in e.g. scientific code that are not blas. You
can't expect a scientist to hand unroll and block any kernel to the
appropriate depth for any machine. There need to be a compiler option to
do this. This can only be done if you allow (a+b)+c -> a+(b+c).

Joost

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: (a+b)+c should be replaced by a+(b+c)
  2004-03-25 12:03 (a+b)+c should be replaced by a+(b+c) Joost VandeVondele
@ 2004-03-25 14:45 ` Robert Dewar
  2004-03-25 15:07   ` Joost VandeVondele
  0 siblings, 1 reply; 12+ messages in thread
From: Robert Dewar @ 2004-03-25 14:45 UTC (permalink / raw)
  To: Joost VandeVondele; +Cc: gcc

Joost VandeVondele wrote:

> good compilers (e.g. xlf90) will (at -O4) do higher order transforms of
> the loop to introduce blocking, independent FMAs, ... that makes this
> little piece of code about 100 times faster at O4 than O2 (what about
> LNO/SSA?). This can only be done if you allow (a+b)+c -> a+(b+c). It is
> basically what any optimized blas routine will do. Matrix multiply is a
> trivial example, if you want blas performance, call blas. There are many
> other kernels like this in e.g. scientific code that are not blas. You
> can't expect a scientist to hand unroll and block any kernel to the
> appropriate depth for any machine. There need to be a compiler option to
> do this. This can only be done if you allow (a+b)+c -> a+(b+c).

Can you really deduce this freedom from later versions of the Fortran
standard?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: (a+b)+c should be replaced by a+(b+c)
  2004-03-25 14:45 ` Robert Dewar
@ 2004-03-25 15:07   ` Joost VandeVondele
  2004-03-25 15:18     ` Robert Dewar
  2004-03-25 15:59     ` Scott Robert Ladd
  0 siblings, 2 replies; 12+ messages in thread
From: Joost VandeVondele @ 2004-03-25 15:07 UTC (permalink / raw)
  To: Robert Dewar; +Cc: gcc

On Thu, 25 Mar 2004, Robert Dewar wrote:

> Joost VandeVondele wrote:
>
> > good compilers (e.g. xlf90) will (at -O4) do higher order transforms of
> > the loop to introduce blocking, independent FMAs, ... that makes this
> > little piece of code about 100 times faster at O4 than O2 (what about
..
>
> Can you really deduce this freedom from later versions of the Fortran
> standard?
>
No, I'm only happy there are compilers that make my code 100 times faster
without doing a lot of work myself, keeping my code easy to maintain and
read.

Another example that relies on this kind of optimization that comes to my
mind is OMP/MPI code. There is just a large class of problems for which
this optimization is just what is needed.

BTW, timing of the code below on IBM SP4 with xlf90, would be useful to
see how gfortran performs.

O2:116.76s
O4:2.4s
O5:1.6s

Joost

INTEGER, PARAMETER :: N=1024
REAL*8 :: A(N,N), B(N,N), C(N,N)
REAL*8 :: t1,t2
A=0.1D0
B=0.1D0
C=0.0D0
CALL cpu_time(t1)
CALL mult(A,B,C,N)
CALL cpu_time(t2)
write(6,*) t2-t1,C(1,1)
END

SUBROUTINE mult(A,B,C,N)
REAL*8 :: A(N,N), B(N,N), C(N,N)
INTEGER :: I,J,K,N
DO J=1,N
DO I=1,N
DO K=1,N
  C(I,J)=C(I,J)+A(I,K)*B(K,J)
ENDDO
ENDDO
ENDDO
END

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: (a+b)+c should be replaced by a+(b+c)
  2004-03-25 15:07   ` Joost VandeVondele
@ 2004-03-25 15:18     ` Robert Dewar
  2004-03-25 15:32       ` Joost VandeVondele
  2004-03-25 15:59     ` Scott Robert Ladd
  1 sibling, 1 reply; 12+ messages in thread
From: Robert Dewar @ 2004-03-25 15:18 UTC (permalink / raw)
  To: Joost VandeVondele; +Cc: gcc

Joost VandeVondele wrote:

> No, I'm only happy there are compilers that make my code 100 times faster
> without doing a lot of work myself, keeping my code easy to maintain and
> read.

Well it is fine to have this kind of transformation available as an
option, though in general it is better to rely on BLAS written by
competent numerical programmers, than on transformations of unknown
impact.

> Another example that relies on this kind of optimization that comes to my
> mind is OMP/MPI code. There is just a large class of problems for which
> this optimization is just what is needed.

Please do not call this an optimization, call it a transformation


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: (a+b)+c should be replaced by a+(b+c)
  2004-03-25 15:18     ` Robert Dewar
@ 2004-03-25 15:32       ` Joost VandeVondele
  0 siblings, 0 replies; 12+ messages in thread
From: Joost VandeVondele @ 2004-03-25 15:32 UTC (permalink / raw)
  To: Robert Dewar; +Cc: gcc

>
> > No, I'm only happy there are compilers that make my code 100 times faster
> > without doing a lot of work myself, keeping my code easy to maintain and
> > read.
>
> Well it is fine to have this kind of transformation available as an
> option, though in general it is better to rely on BLAS written by
> competent numerical programmers, than on transformations of unknown
> impact.
>
Obviously, this was an example (and I referred to calling blas
explicitly), to suggest that there exists a wide range of computational
kernels that benefit from (a+b)+c->a+(b+c) being performed by the
compiler. (I realized that there are much better example of this, but
anyway).

I like the name transformation as much as I dislike unsafe-math. FYI the
following warning comes from IBM (at -O3 when optimizing, oops
transforming, expressions in a way that might lead to non-bitwise
identical results), and I think it is not badly worded :

"mytest.f90", 1500-036 (I) The NOSTRICT option (default at OPT(3)) has the
potential to alter the semantics of a program.  Please refer to
documentation on the STRICT/NOSTRICT option for more information.

Joost

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: (a+b)+c should be replaced by a+(b+c)
  2004-03-25 15:07   ` Joost VandeVondele
  2004-03-25 15:18     ` Robert Dewar
@ 2004-03-25 15:59     ` Scott Robert Ladd
  2004-03-25 16:18       ` Jakub Jelinek
  1 sibling, 1 reply; 12+ messages in thread
From: Scott Robert Ladd @ 2004-03-25 15:59 UTC (permalink / raw)
  To: gcc mailing list

Joost VandeVondele wrote:
> BTW, timing of the code below on IBM SP4 with xlf90, would be useful to
> see how gfortran performs.

Being in a benchmarking mood, I took your code and compiled it on a
2.8GHz Pentium 4 (Northwood core). The results did not show gfortran in
a very good light:

- - - - - - - - - - - - - - - - - - - - -

Tycho$ ifort -O3 -tpp7 -xN -ipo -o matmuli matmul.for
IPO: using IR for /tmp/ifortyRX1Wg.o
IPO: performing single-file optimizations
matmul.for(6) : (col. 6) remark: LOOP WAS VECTORIZED.
matmul.for(7) : (col. 6) remark: LOOP WAS VECTORIZED.
matmul.for(8) : (col. 6) remark: LOOP WAS VECTORIZED.
Tycho:$ ./matmuli
    5.90410300000000        10.2399999999998
Tycho$ gfortran -o matmulg -O3 -ffast-math -march=pentium4 matmul.for
Tycho$ ./matmulg     71.4641360000000         10.2400000000000

Tycho$ icc -V
Intel(R) C++ Compiler for 32-bit applications, Version 8.0   Build
20031211Z Package ID: l_cc_p_8.0.055_pe057
Copyright (C) 1985-2003 Intel Corporation.  All rights reserved.

Tycho$ gfortran -v
Reading specs from
/opt/gcc-tree-ssa/lib/gcc/i686-pc-linux-gnu/3.5-tree-ssa/specs
Configured with: ../gcc/configure --prefix=/opt/gcc-tree-ssa
--disable-checking --enable-shared --enable-threads=posix
--enable-__cxa_atexit --enable-languages=c,c++,f95
Thread model: posix
gcc version 3.5-tree-ssa 20040316 (merged 20040307)

- - - - - - - - - - - - - - - - - - - - -

The generated assembler from GCC looks like:

.globl mult_
	.type	mult_, @function
mult_:
	pushl	%ebp
	movl	%esp, %ebp
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	subl	$36, %esp
	movl	20(%ebp), %eax
	movl	16(%ebp), %esi
	movl	8(%ebp), %edi
	movl	12(%ebp), %ecx
	movl	(%eax), %ebx
	movl	%ebx, -16(%ebp)
	xorl	$-1, %ebx
	sall	$3, %ebx
	movl	-16(%ebp), %eax
	movl	%ebx, -28(%ebp)
	addl	%ebx, %esi
	movl	-28(%ebp), %edx
	addl	%ebx, %edi
	addl	%ecx, %edx
	movl	%esi, -20(%ebp)
	movl	%edi, -24(%ebp)
	movl	%edx, -28(%ebp)
	testl	%eax, %eax
	jle	.L1
	movl	-16(%ebp), %edx
	movl	%edx, %ebx
	movl	%edx, -36(%ebp)
	movl	%edx, -44(%ebp)
	movl	%edx, %esi
	sall	$3, %ebx
	movl	%edx, %edi
.L4:
	movl	-28(%ebp), %eax
	movl	$1, -32(%ebp)
	movl	-20(%ebp), %edx
	leal	(%eax,%edi,8), %ecx
	movl	%ecx, -40(%ebp)
	movl	%esi, %ecx
	.p2align 4,,7
.L5:
	movl	-32(%ebp), %edi
	movl	-44(%ebp), %eax
	addl	%edi, %eax
	movl	-24(%ebp), %edi
	movl	%eax, -48(%ebp)
	fldl	(%edx,%eax,8)
	movl	-32(%ebp), %eax
	movl	-40(%ebp), %edx
	addl	%ecx, %eax
	addl	$8, %edx
	leal	(%edi,%eax,8), %eax
	.p2align 4,,7
.L6:
	fldl	(%edx)
	fmull	(%eax)
	decl	%ecx
	addl	%ebx, %eax
	addl	$8, %edx
	testl	%ecx, %ecx
	faddp	%st, %st(1)
	jg	.L6
.L7:
	movl	-32(%ebp), %ecx
	movl	-48(%ebp), %eax
	movl	-20(%ebp), %edx
	incl	%ecx
	decl	%esi
	movl	%ecx, -32(%ebp)
	fstpl	(%edx,%eax,8)
	testl	%esi, %esi
	jle	.L18
	movl	-16(%ebp), %ecx
	jmp	.L5
.L2:
.L1:
	addl	$36, %esp
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
	ret
.L8:
.L18:
	movl	-36(%ebp), %edx
	movl	-44(%ebp), %ecx
	decl	%edx
	movl	-16(%ebp), %edi
	movl	%edx, -36(%ebp)
	addl	%edi, %ecx
	movl	-36(%ebp), %esi
	movl	%ecx, -44(%ebp)
	testl	%esi, %esi
	jle	.L1
	movl	%edi, %esi
	movl	-44(%ebp), %edi
	jmp	.L4
	.size	mult_, .-mult_
	.local	c.2
	.comm	c.2,8388608,32
	.local	a.0
	.comm	a.0,8388608,32
	.local	b.1
	.comm	b.1,8388608,32
	.section	.rodata.str1.1,"aMS",@progbits,1

- - - - - - - - - - - - - - - - - - - - -

The generated assembler for Intel Fortran:

	.globl mult_
mult_:
# parameter 1: 28 + %esp
# parameter 2: 32 + %esp
# parameter 3: 36 + %esp
# parameter 4: 40 + %esp
..B2.1:                         # Preds ..B2.0
         pushl     %edi                                          #15.17
         pushl     %esi                                          #15.17
         pushl     %ebp                                          #15.17
         pushl     %ebx                                          #15.17
         subl      $8, %esp                                      #15.17
         movl      40(%esp), %eax                                #1.0
         movl      (%eax), %ebp                                  #15.17
         movl      $1, %ebx                                      #18.6
         testl     %ebp, %ebp                                    #18.6
         jle       ..B2.9        # Prob 1%                       #18.6
                                 # LOE ebx ebp
..B2.2:                         # Preds ..B2.1
         movl      28(%esp), %esi                                #
         movl      32(%esp), %edx                                #
         movl      36(%esp), %edi                                #
         lea       (%ebp,%ebp), %eax                             #
         addl      %eax, %eax                                    #
         addl      %eax, %eax                                    #
         subl      %eax, %esi                                    #
         movl      %esi, (%esp)                                  #
         subl      %eax, %edx                                    #
         subl      %eax, %edi                                    #
         movl      %ebx, %ecx                                    #
         imull     %eax, %ecx                                    #
         addl      %edx, %ecx                                    #
         movl      %ebx, %edx                                    #
         imull     %eax, %edx                                    #
         addl      %edi, %edx                                    #
                                 # LOE eax edx ecx ebx ebp
..B2.3:                         # Preds ..B2.7 ..B2.2
         movl      (%esp), %esi                                  #19.6
         movl      %ebx, 4(%esp)                                 #19.6
         movl      $1, %edi                                      #19.6
         lea       (%eax,%esi), %esi                             #19.6
                                 # LOE eax edx ecx ebp esi edi
..B2.4:                         # Preds ..B2.6 ..B2.3
         movsd     -8(%ecx,%edi,8), %xmm0                        #21.29
         movl      $1, %ebx                                      #20.6
         .align    4,0x90
                                 # LOE eax edx ecx ebx ebp esi edi xmm0
..B2.5:                         # Preds ..B2.5 ..B2.4
         movsd     -8(%esi,%ebx,8), %xmm1                        #21.22
         mulsd     %xmm0, %xmm1                                  #21.28
         addsd     -8(%edx,%ebx,8), %xmm1                        #21.21
         movsd     %xmm1, -8(%edx,%ebx,8)                        #21.8
         addl      $1, %ebx                                      #20.6
         cmpl      %ebp, %ebx                                    #20.6
         jle       ..B2.5        # Prob 99%                      #20.6
                                 # LOE eax edx ecx ebx ebp esi edi xmm0
..B2.6:                         # Preds ..B2.5
         addl      %eax, %esi                                    #19.6
         addl      $1, %edi                                      #19.6
         cmpl      %ebp, %edi                                    #19.6
         jle       ..B2.4        # Prob 99%                      #19.6
                                 # LOE eax edx ecx ebp esi edi
..B2.7:                         # Preds ..B2.6
         movl      4(%esp), %ebx                                 #
         addl      %eax, %ecx                                    #18.6
         addl      %eax, %edx                                    #18.6
         addl      $1, %ebx                                      #18.6
         cmpl      %ebp, %ebx                                    #18.6
         jle       ..B2.3        # Prob 99%                      #18.6
                                 # LOE eax edx ecx ebx ebp
..B2.9:                         # Preds ..B2.7 ..B2.1
         addl      $8, %esp                                      #26.6
         popl      %ebx                                          #26.6
         popl      %ebp                                          #26.6
         popl      %esi                                          #26.6
         popl      %edi                                          #26.6
         ret                                                     #26.6


- - - - - - - - - - - - - - - - - - - - -

I think gfortran gets its tail stomped by Intel's effort in this comparison.

Side note: I assume you are aware that your code is a brute force
technique for matrix multiplies, and that other algorithms are much more
efficient.

If anyone is interested, I can perform the same experiment with the
Intel and GNU C compilers.

-- 
Scott Robert Ladd
Coyote Gulch Productions (http://www.coyotegulch.com)
Software Invention for High-Performance Computing


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: (a+b)+c should be replaced by a+(b+c)
  2004-03-25 15:59     ` Scott Robert Ladd
@ 2004-03-25 16:18       ` Jakub Jelinek
  2004-03-25 16:38         ` Scott Robert Ladd
  2004-03-26  2:51         ` Gabriel Paubert
  0 siblings, 2 replies; 12+ messages in thread
From: Jakub Jelinek @ 2004-03-25 16:18 UTC (permalink / raw)
  To: Scott Robert Ladd; +Cc: gcc mailing list

On Thu, Mar 25, 2004 at 09:21:48AM -0500, Scott Robert Ladd wrote:
> Joost VandeVondele wrote:
> >BTW, timing of the code below on IBM SP4 with xlf90, would be useful to
> >see how gfortran performs.
> 
> Being in a benchmarking mood, I took your code and compiled it on a
> 2.8GHz Pentium 4 (Northwood core). The results did not show gfortran in
> a very good light:
> 
> - - - - - - - - - - - - - - - - - - - - -
> 
> Tycho$ ifort -O3 -tpp7 -xN -ipo -o matmuli matmul.for
> IPO: using IR for /tmp/ifortyRX1Wg.o
> IPO: performing single-file optimizations
> matmul.for(6) : (col. 6) remark: LOOP WAS VECTORIZED.
> matmul.for(7) : (col. 6) remark: LOOP WAS VECTORIZED.
> matmul.for(8) : (col. 6) remark: LOOP WAS VECTORIZED.
> Tycho:$ ./matmuli
>    5.90410300000000        10.2399999999998
> Tycho$ gfortran -o matmulg -O3 -ffast-math -march=pentium4 matmul.for

You forgot -mfpmath=sse.  That is only the default for -m64.

	Jakub

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: (a+b)+c should be replaced by a+(b+c)
  2004-03-25 16:18       ` Jakub Jelinek
@ 2004-03-25 16:38         ` Scott Robert Ladd
  2004-03-25 19:47           ` Laurent GUERBY
  2004-03-26  2:51         ` Gabriel Paubert
  1 sibling, 1 reply; 12+ messages in thread
From: Scott Robert Ladd @ 2004-03-25 16:38 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc mailing list

Jakub Jelinek wrote:
> On Thu, Mar 25, 2004 at 09:21:48AM -0500, Scott Robert Ladd wrote:
> 
>>Joost VandeVondele wrote:
>>
>>>BTW, timing of the code below on IBM SP4 with xlf90, would be useful to
>>>see how gfortran performs.
>>
>>Being in a benchmarking mood, I took your code and compiled it on a
>>2.8GHz Pentium 4 (Northwood core). The results did not show gfortran in
>>a very good light:
>>
>>- - - - - - - - - - - - - - - - - - - - -
>>
>>Tycho$ ifort -O3 -tpp7 -xN -ipo -o matmuli matmul.for
>>IPO: using IR for /tmp/ifortyRX1Wg.o
>>IPO: performing single-file optimizations
>>matmul.for(6) : (col. 6) remark: LOOP WAS VECTORIZED.
>>matmul.for(7) : (col. 6) remark: LOOP WAS VECTORIZED.
>>matmul.for(8) : (col. 6) remark: LOOP WAS VECTORIZED.
>>Tycho:$ ./matmuli
>>   5.90410300000000        10.2399999999998
>>Tycho$ gfortran -o matmulg -O3 -ffast-math -march=pentium4 matmul.for
> 
> 
> You forgot -mfpmath=sse.  That is only the default for -m64.
> 
> 	Jakub
> 

Good point; I've been doing Opteron work for a week, and was getting 
used to not explicitly declaring certain flags.

Also, a minimized browser was playing a &%$!! Flash animation in the 
background, so I'll run numbers on a clean machine without the overhead.

And the compiler says:

  - - - - - - - - - - - - - - - -
Tycho$ gfortran -o matmulg -O3 -march=pentium4 -ffast-math matmul.for
Tycho$ ./matmulg
     64.9091330000000         10.2400000000000

Tycho$ gfortran -o matmulg -O3 -march=pentium4 -ffast-math -mfpmath=sse 
matmul.for
Tycho$ ./matmulg
     64.6051790000000         10.2399999999998

Tycho$ gfortran -o matmulg -O3 -march=pentium4 -mfpmath=sse matmul.for
Tycho$ ./matmulg
     64.7361590000000         10.2399999999998

Tycho$ gfortran -o matmulg -O3 -march=pentium4 matmul.for
Tycho$ ./matmulg
     64.7751530000000         10.2400000000000
Tycho$

  - - - - - - - - - - - - - - - -

[dry_sarcasm]
Well, we can see the -ffast-math *really* helps in this suituation, huh?
[/dry_sarcasm]


Nor did -mfpmath=sse show much value for this test. In my experience, 
-mfpmath=sse often fails to produce faster code (with gfortran or gcc)

What about Intel Fortran with their -mp1 and -mp options?

  - - - - - - - - - - - - - - - -

Tycho$ ifort -O3 -tpp7 -xN -ipo -o matmuli matmul.for
Tycho$ ./matmuli
    4.85226200000000        10.2399999999998

Tycho$ ifort -O3 -tpp7 -xN -ipo -mp1 -o matmuli matmul.for
Tycho:~/projects/spikes$ ./matmuli
    4.90425400000000        10.2399999999998

Tycho$ ifort -O3 -tpp7 -xN -ipo -mp -o matmuli matmul.for
Tycho$ ./matmuli
    66.0699560000000        10.2399999999998

  - - - - - - - - - - - - - - - -

Forcing Intel to stick with the "rules" does slow its performance. 
Certainly some food for thought...


-- 
Scott Robert Ladd
Coyote Gulch Productions (http://www.coyotegulch.com)
Software Invention for High-Performance Computing

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: (a+b)+c should be replaced by a+(b+c)
  2004-03-25 16:38         ` Scott Robert Ladd
@ 2004-03-25 19:47           ` Laurent GUERBY
  2004-03-25 20:16             ` Scott Robert Ladd
  0 siblings, 1 reply; 12+ messages in thread
From: Laurent GUERBY @ 2004-03-25 19:47 UTC (permalink / raw)
  To: Scott Robert Ladd; +Cc: Jakub Jelinek, gcc mailing list

On Thu, 2004-03-25 at 16:06, Scott Robert Ladd wrote:
> [dry_sarcasm]
> Well, we can see the -ffast-math *really* helps in this suituation, huh?
> [/dry_sarcasm]

Could you try -funroll-all-loops and may be a prefetch flag? On short
numerical loops this usually makes lots of difference.

Laurent

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: (a+b)+c should be replaced by a+(b+c)
  2004-03-25 19:47           ` Laurent GUERBY
@ 2004-03-25 20:16             ` Scott Robert Ladd
  0 siblings, 0 replies; 12+ messages in thread
From: Scott Robert Ladd @ 2004-03-25 20:16 UTC (permalink / raw)
  To: Laurent GUERBY; +Cc: Jakub Jelinek, gcc mailing list

Laurent GUERBY wrote:
> Could you try -funroll-all-loops and may be a prefetch flag? On short
>  numerical loops this usually makes lots of difference.

gfortran -o matmulg -O3 -ffast-math -march=pentium4 \
     -fprefetch-loop-arrays -funroll-all-loops -mfpmath=sse matmul.for

..does not improve matters. Perhaps this is due to the nascent nature of 
gfortran?

As it is, I'm going to stop doing on-the-spot benchmarks for this set of 
topics, having demonstrated my assertions about performance and 
accuracy. At this point, we're using anecdotal evidence to pick compiler 
options; I prefer a more scientific approach.

By Monday, my new Acovea runs (on both P4 and Opteron) will have had a 
chance to evolve optimal compiler option sets, showing us where the 
strengths and weaknesses lie.

Or so I hope!

..Scott

-- 
Scott Robert Ladd
Coyote Gulch Productions (http://www.coyotegulch.com)
Software Invention for High-Performance Computing

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: (a+b)+c should be replaced by a+(b+c)
  2004-03-25 16:18       ` Jakub Jelinek
  2004-03-25 16:38         ` Scott Robert Ladd
@ 2004-03-26  2:51         ` Gabriel Paubert
  2004-03-26  3:17           ` Jakub Jelinek
  1 sibling, 1 reply; 12+ messages in thread
From: Gabriel Paubert @ 2004-03-26  2:51 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Scott Robert Ladd, gcc mailing list

On Thu, Mar 25, 2004 at 01:19:27PM +0100, Jakub Jelinek wrote:
> On Thu, Mar 25, 2004 at 09:21:48AM -0500, Scott Robert Ladd wrote:
> > Joost VandeVondele wrote:
> > >BTW, timing of the code below on IBM SP4 with xlf90, would be useful to
> > >see how gfortran performs.
> > 
> > Being in a benchmarking mood, I took your code and compiled it on a
> > 2.8GHz Pentium 4 (Northwood core). The results did not show gfortran in
> > a very good light:
> > 
> > - - - - - - - - - - - - - - - - - - - - -
> > 
> > Tycho$ ifort -O3 -tpp7 -xN -ipo -o matmuli matmul.for
> > IPO: using IR for /tmp/ifortyRX1Wg.o
> > IPO: performing single-file optimizations
> > matmul.for(6) : (col. 6) remark: LOOP WAS VECTORIZED.
> > matmul.for(7) : (col. 6) remark: LOOP WAS VECTORIZED.
> > matmul.for(8) : (col. 6) remark: LOOP WAS VECTORIZED.
> > Tycho:$ ./matmuli
> >    5.90410300000000        10.2399999999998
> > Tycho$ gfortran -o matmulg -O3 -ffast-math -march=pentium4 matmul.for
> 
> You forgot -mfpmath=sse.  That is only the default for -m64.

Isn't it rather -mfpmath=sse2, since he is using doubles?

IIRC, -mfpmath=sse will only use sse instructions for floats, not
for doubles.

	Gabriel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: (a+b)+c should be replaced by a+(b+c)
  2004-03-26  2:51         ` Gabriel Paubert
@ 2004-03-26  3:17           ` Jakub Jelinek
  0 siblings, 0 replies; 12+ messages in thread
From: Jakub Jelinek @ 2004-03-26  3:17 UTC (permalink / raw)
  To: Gabriel Paubert; +Cc: Scott Robert Ladd, gcc mailing list

On Thu, Mar 25, 2004 at 11:51:20PM +0100, Gabriel Paubert wrote:
> > You forgot -mfpmath=sse.  That is only the default for -m64.
> 
> Isn't it rather -mfpmath=sse2, since he is using doubles?

No, -mfpmath= only takes "i387", "sse", "sse,i387" and "i387,sse"
options.  Whether doubles are done using SSE* or i387 insns
depends on -msse2.

	Jakub

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2004-03-25 23:25 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-03-25 12:03 (a+b)+c should be replaced by a+(b+c) Joost VandeVondele
2004-03-25 14:45 ` Robert Dewar
2004-03-25 15:07   ` Joost VandeVondele
2004-03-25 15:18     ` Robert Dewar
2004-03-25 15:32       ` Joost VandeVondele
2004-03-25 15:59     ` Scott Robert Ladd
2004-03-25 16:18       ` Jakub Jelinek
2004-03-25 16:38         ` Scott Robert Ladd
2004-03-25 19:47           ` Laurent GUERBY
2004-03-25 20:16             ` Scott Robert Ladd
2004-03-26  2:51         ` Gabriel Paubert
2004-03-26  3:17           ` Jakub Jelinek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).