From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-501057-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 35264 invoked by alias); 31 Oct 2015 14:15:54 -0000
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
Received: (qmail 28802 invoked by uid 48); 31 Oct 2015 14:15:49 -0000
From: "dominiq at lps dot ens.fr" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug libfortran/51119] MATMUL slow for large matrices
Date: Sat, 31 Oct 2015 14:15:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: libfortran
X-Bugzilla-Version: unknown
X-Bugzilla-Keywords:
X-Bugzilla-Severity: enhancement
X-Bugzilla-Who: dominiq at lps dot ens.fr
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Resolution:
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: jb at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields:
Message-ID: <bug-51119-4-ybaXb6fdj4@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-51119-4@http.gcc.gnu.org/bugzilla/>
References: <bug-51119-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2015-10/txt/msg02612.txt.bz2

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #12 from Dominique d'Humieres <dominiq at lps dot ens.fr> ---
Some new numbers for a four cores Corei7 2.8Ghz, turboboost 3.8Ghz, 1.6Ghz DDR3
on x86_64-apple-darwin14.5 for the following test

program t2 
implicit none 
REAL time_begin, time_end 
integer, parameter :: n=2000; 
integer(8) :: ts, te, rate8, cmax8
real(8) :: elapsed
REAL(8) :: a(n,n), b(n,n), c(n,n) 
integer, parameter :: m = 100 
integer :: i 
call RANDOM_NUMBER(a) 
call RANDOM_NUMBER(b) 
call cpu_time(time_begin) 
call SYSTEM_CLOCK (ts, rate8, cmax8)
do i = 1,m 
    a(1,1) = a(1,1) + 0.1 
    c = MATMUL(a,b) 
enddo 
call SYSTEM_CLOCK (te, rate8, cmax8)
call cpu_time(time_end) 
elapsed = real(te-ts, kind=8)/real(rate8, kind=8)
PRINT *, 'Time, MATMUL: ',time_end-time_begin, elapsed , 2*m*real(n,
kind=8)**3/(10**9*elapsed)
call cpu_time(time_begin) 
call SYSTEM_CLOCK (ts, rate8, cmax8)
do i = 1,m 
    a(1,1) = a(1,1) + 0.1 
    call dgemm('n','n',n, n, n, dble(1.0), a, n, b, n, dble(0.0), c, n) 
enddo 
call SYSTEM_CLOCK (te, rate8, cmax8)
call cpu_time(time_end) 
elapsed = real(te-ts, kind=8)/real(rate8, kind=8)
PRINT *, 'Time, MATMUL: ',time_end-time_begin, elapsed , 2*m*real(n,
kind=8)**3/(10**9*elapsed)
end program 

borrowed from
http://groups.google.com/group/comp.lang.fortran/browse_thread/thread/1cba8e6ce5080197

[Book15] f90/bug% gfc -Ofast timing/matmul_tst_sys.f90 -framework Accelerate
-fno-frontend-optimize
[Book15] f90/bug% time a.out
 Time, MATMUL:    374.027161       374.02889900000002        4.2777443247774283 
 Time, MATMUL:    172.823853       23.073034000000000        69.345019818373260 
546.427u 0.542s 6:37.24 137.6%  0+0k 1+0io 41pf+0w
[Book15] f90/bug% gfc -Ofast timing/matmul_tst_sys.f90 -framework Accelerate 
[Book15] f90/bug% time a.out
 Time, MATMUL:    391.495880       391.49403500000000        4.0869077353886123 
 Time, MATMUL:    169.313202       22.781099000000001        70.233661685944114 
560.384u 0.544s 6:54.39 135.3%  0+0k 0+0io 0pf+0w
[Book15] f90/bug% gfc -Ofast timing/matmul_tst_sys.f90 -framework Accelerate
-march=native
[Book15] f90/bug% time a.out
 Time, MATMUL:    367.570374       367.56880500000000        4.3529265221514102 
 Time, MATMUL:    170.150818       22.837544000000001        70.060073009602078 
537.306u 0.534s 6:30.53 137.7%  0+0k 0+0io 0pf+0w

where the last column is the speed in Gflops. These numbers show that the
library MATMUL is slightly faster than the inline version unless -march=native
is used (AVX should be twice faster unless limited by the memory bandwidth).

[Book15] f90/bug% gfc -Ofast -fexternal-blas timing/matmul_tst_sys.f90
-framework Accelerate
[Book15] f90/bug% time a.out
 Time, MATMUL:    159.000992       21.450851000000000        74.589115368896088 
 Time, MATMUL:    172.616943       23.029487000000000        69.476145951492541 
331.281u 0.453s 0:44.60 743.7%  0+0k 0+0io 3pf+0w
... repeated several time in order to heat the CPU
[Book15] f90/bug% time a.out
 Time, MATMUL:    179.624268       23.935708999999999        66.845732457726655 
 Time, MATMUL:    178.685364       23.898668000000001        66.949337929628541 
357.978u 0.447s 0:47.95 747.4%  0+0k 0+0io 0pf+0w

Thus the BLAS provided by darwin gets ~67GFlops out of the ~90GFlops peak
(AVX*4cores), while the inlined MATMUL gets ~4GFlops out of ~15Gflops peak (no
AVX, one core and turboboost) with little gain when using AVX (~30GFlops peak).

I suppose most modern OS provide such optimized BLAS and, if not, one can
install libraries such as atlas. So I wonder if it would not be more effective
to be able to configure with something such as --with-blas="magic incantation"
and use -fexternal-blas as the default rather than reinventing the wheel.

More than three years ago Janne Blomqvist (comment 7) wrote
> IIRC I reached about 30-40 % of peak flops which was a bit disappointing.

Would it be possible to have the patch to play with?