From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 35264 invoked by alias); 31 Oct 2015 14:15:54 -0000 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org Received: (qmail 28802 invoked by uid 48); 31 Oct 2015 14:15:49 -0000 From: "dominiq at lps dot ens.fr" To: gcc-bugs@gcc.gnu.org Subject: [Bug libfortran/51119] MATMUL slow for large matrices Date: Sat, 31 Oct 2015 14:15:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: libfortran X-Bugzilla-Version: unknown X-Bugzilla-Keywords: X-Bugzilla-Severity: enhancement X-Bugzilla-Who: dominiq at lps dot ens.fr X-Bugzilla-Status: ASSIGNED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: jb at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2015-10/txt/msg02612.txt.bz2 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #12 from Dominique d'Humieres --- Some new numbers for a four cores Corei7 2.8Ghz, turboboost 3.8Ghz, 1.6Ghz DDR3 on x86_64-apple-darwin14.5 for the following test program t2 implicit none REAL time_begin, time_end integer, parameter :: n=2000; integer(8) :: ts, te, rate8, cmax8 real(8) :: elapsed REAL(8) :: a(n,n), b(n,n), c(n,n) integer, parameter :: m = 100 integer :: i call RANDOM_NUMBER(a) call RANDOM_NUMBER(b) call cpu_time(time_begin) call SYSTEM_CLOCK (ts, rate8, cmax8) do i = 1,m a(1,1) = a(1,1) + 0.1 c = MATMUL(a,b) enddo call SYSTEM_CLOCK (te, rate8, cmax8) call cpu_time(time_end) elapsed = real(te-ts, kind=8)/real(rate8, kind=8) PRINT *, 'Time, MATMUL: ',time_end-time_begin, elapsed , 2*m*real(n, kind=8)**3/(10**9*elapsed) call cpu_time(time_begin) call SYSTEM_CLOCK (ts, rate8, cmax8) do i = 1,m a(1,1) = a(1,1) + 0.1 call dgemm('n','n',n, n, n, dble(1.0), a, n, b, n, dble(0.0), c, n) enddo call SYSTEM_CLOCK (te, rate8, cmax8) call cpu_time(time_end) elapsed = real(te-ts, kind=8)/real(rate8, kind=8) PRINT *, 'Time, MATMUL: ',time_end-time_begin, elapsed , 2*m*real(n, kind=8)**3/(10**9*elapsed) end program borrowed from http://groups.google.com/group/comp.lang.fortran/browse_thread/thread/1cba8e6ce5080197 [Book15] f90/bug% gfc -Ofast timing/matmul_tst_sys.f90 -framework Accelerate -fno-frontend-optimize [Book15] f90/bug% time a.out Time, MATMUL: 374.027161 374.02889900000002 4.2777443247774283 Time, MATMUL: 172.823853 23.073034000000000 69.345019818373260 546.427u 0.542s 6:37.24 137.6% 0+0k 1+0io 41pf+0w [Book15] f90/bug% gfc -Ofast timing/matmul_tst_sys.f90 -framework Accelerate [Book15] f90/bug% time a.out Time, MATMUL: 391.495880 391.49403500000000 4.0869077353886123 Time, MATMUL: 169.313202 22.781099000000001 70.233661685944114 560.384u 0.544s 6:54.39 135.3% 0+0k 0+0io 0pf+0w [Book15] f90/bug% gfc -Ofast timing/matmul_tst_sys.f90 -framework Accelerate -march=native [Book15] f90/bug% time a.out Time, MATMUL: 367.570374 367.56880500000000 4.3529265221514102 Time, MATMUL: 170.150818 22.837544000000001 70.060073009602078 537.306u 0.534s 6:30.53 137.7% 0+0k 0+0io 0pf+0w where the last column is the speed in Gflops. These numbers show that the library MATMUL is slightly faster than the inline version unless -march=native is used (AVX should be twice faster unless limited by the memory bandwidth). [Book15] f90/bug% gfc -Ofast -fexternal-blas timing/matmul_tst_sys.f90 -framework Accelerate [Book15] f90/bug% time a.out Time, MATMUL: 159.000992 21.450851000000000 74.589115368896088 Time, MATMUL: 172.616943 23.029487000000000 69.476145951492541 331.281u 0.453s 0:44.60 743.7% 0+0k 0+0io 3pf+0w ... repeated several time in order to heat the CPU [Book15] f90/bug% time a.out Time, MATMUL: 179.624268 23.935708999999999 66.845732457726655 Time, MATMUL: 178.685364 23.898668000000001 66.949337929628541 357.978u 0.447s 0:47.95 747.4% 0+0k 0+0io 0pf+0w Thus the BLAS provided by darwin gets ~67GFlops out of the ~90GFlops peak (AVX*4cores), while the inlined MATMUL gets ~4GFlops out of ~15Gflops peak (no AVX, one core and turboboost) with little gain when using AVX (~30GFlops peak). I suppose most modern OS provide such optimized BLAS and, if not, one can install libraries such as atlas. So I wonder if it would not be more effective to be able to configure with something such as --with-blas="magic incantation" and use -fexternal-blas as the default rather than reinventing the wheel. More than three years ago Janne Blomqvist (comment 7) wrote > IIRC I reached about 30-40 % of peak flops which was a bit disappointing. Would it be possible to have the patch to play with?