public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug libfortran/51119] New: MATMUL slow for large matrices
@ 2011-11-14  8:16 jb at gcc dot gnu.org
  2011-11-14  8:17 ` [Bug libfortran/51119] " jb at gcc dot gnu.org
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: jb at gcc dot gnu.org @ 2011-11-14  8:16 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

             Bug #: 51119
           Summary: MATMUL slow for large matrices
    Classification: Unclassified
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: libfortran
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: jb@gcc.gnu.org


Compared to ATLAS BLAS on an AMD 10h processor, MATMUL on square matrices with
n > 256 is around a factor of 8 slower. 

While I don't think it's worth spending the time on target-specific parameters
and/or asm-coded inner kernel as high-performance BLAS implementations do, I
suspect that a little effort towards cache blocking could improve things.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug libfortran/51119] MATMUL slow for large matrices
  2011-11-14  8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org
@ 2011-11-14  8:17 ` jb at gcc dot gnu.org
  2011-11-14 13:56 ` burnus at gcc dot gnu.org
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: jb at gcc dot gnu.org @ 2011-11-14  8:17 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

Janne Blomqvist <jb at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |ASSIGNED
   Last reconfirmed|                            |2011-11-14
         AssignedTo|unassigned at gcc dot       |jb at gcc dot gnu.org
                   |gnu.org                     |
     Ever Confirmed|0                           |1

--- Comment #1 from Janne Blomqvist <jb at gcc dot gnu.org> 2011-11-14 06:49:11 UTC ---
Assigning to myself.

I have a cunning plan.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug libfortran/51119] MATMUL slow for large matrices
  2011-11-14  8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org
  2011-11-14  8:17 ` [Bug libfortran/51119] " jb at gcc dot gnu.org
@ 2011-11-14 13:56 ` burnus at gcc dot gnu.org
  2011-11-15 12:35 ` Joost.VandeVondele at mat dot ethz.ch
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: burnus at gcc dot gnu.org @ 2011-11-14 13:56 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

Tobias Burnus <burnus at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |burnus at gcc dot gnu.org

--- Comment #2 from Tobias Burnus <burnus at gcc dot gnu.org> 2011-11-14 13:08:49 UTC ---
(In reply to comment #0)
> Compared to ATLAS BLAS on an AMD 10h processor, MATMUL on square matrices with
> n > 256 is around a factor of 8 slower. 

Side note: You can use -fexternal-blas -fblas-matmul-limit=<...> and link ATLAS
BLAS.

> Assigning to myself.
> I have a cunning plan.

I am looking forward to cunning ideas - at least if they are not too
convoluted, work on all targets and are middle-end friendly.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug libfortran/51119] MATMUL slow for large matrices
  2011-11-14  8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org
  2011-11-14  8:17 ` [Bug libfortran/51119] " jb at gcc dot gnu.org
  2011-11-14 13:56 ` burnus at gcc dot gnu.org
@ 2011-11-15 12:35 ` Joost.VandeVondele at mat dot ethz.ch
  2011-11-15 12:37 ` Joost.VandeVondele at mat dot ethz.ch
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Joost.VandeVondele at mat dot ethz.ch @ 2011-11-15 12:35 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #3 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2011-11-15 12:19:59 UTC ---
(In reply to comment #1)
> I have a cunning plan.

It is doable to come within a factor of 2 of highly efficient implementations
using a cache-oblivious matrix multiply, which is relatively easy to code. I'm
not sure this is worth the effort.

I believe it would be more important to have actually highly efficient
(inlined) implementations for very small matrices. These would outperform
general libraries by a large factor. For CP2K I have written a specialized
small matrix multiply library generator which generates code that outperforms
e.g. MKL by a large factor for small matrices (<<32x32). The generation time
and library size do not make it a general purpose tool. It also contains an
implementation of the recursive multiply of some sort (see
http://cvs.berlios.de/cgi-bin/viewvc.cgi/cp2k/cp2k/tools/build_libsmm/)


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug libfortran/51119] MATMUL slow for large matrices
  2011-11-14  8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2011-11-15 12:35 ` Joost.VandeVondele at mat dot ethz.ch
@ 2011-11-15 12:37 ` Joost.VandeVondele at mat dot ethz.ch
  2011-11-15 16:19 ` jb at gcc dot gnu.org
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Joost.VandeVondele at mat dot ethz.ch @ 2011-11-15 12:37 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #4 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2011-11-15 12:31:10 UTC ---
Created attachment 25826
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25826
comparison in performance for small matrix multiplies (libsmm vs mkl)

added some data showing the speedup of specialized matrix multiply code (small
matrices, known bounds, in cache) against general dgemm (mkl).


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug libfortran/51119] MATMUL slow for large matrices
  2011-11-14  8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2011-11-15 12:37 ` Joost.VandeVondele at mat dot ethz.ch
@ 2011-11-15 16:19 ` jb at gcc dot gnu.org
  2012-06-28 11:58 ` Joost.VandeVondele at mat dot ethz.ch
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: jb at gcc dot gnu.org @ 2011-11-15 16:19 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #5 from Janne Blomqvist <jb at gcc dot gnu.org> 2011-11-15 15:47:54 UTC ---
(In reply to comment #3)
> I believe it would be more important to have actually highly efficient
> (inlined) implementations for very small matrices.

There's already PR 37131 for that.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug libfortran/51119] MATMUL slow for large matrices
  2011-11-14  8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2011-11-15 16:19 ` jb at gcc dot gnu.org
@ 2012-06-28 11:58 ` Joost.VandeVondele at mat dot ethz.ch
  2012-06-28 12:15 ` jb at gcc dot gnu.org
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Joost.VandeVondele at mat dot ethz.ch @ 2012-06-28 11:58 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #6 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2012-06-28 11:58:20 UTC ---
Janne, have you had a chance to look at this ? For larger matrices MATMMUL is
really slow. Anything that includes even the most basic blocking scheme should
be faster. I think this would be a valuable improvement.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug libfortran/51119] MATMUL slow for large matrices
  2011-11-14  8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2012-06-28 11:58 ` Joost.VandeVondele at mat dot ethz.ch
@ 2012-06-28 12:15 ` jb at gcc dot gnu.org
  2012-06-29  7:19 ` Joost.VandeVondele at mat dot ethz.ch
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: jb at gcc dot gnu.org @ 2012-06-28 12:15 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #7 from Janne Blomqvist <jb at gcc dot gnu.org> 2012-06-28 12:15:05 UTC ---
(In reply to comment #6)
> Janne, have you had a chance to look at this ? For larger matrices MATMMUL is
> really slow. Anything that includes even the most basic blocking scheme should
> be faster. I think this would be a valuable improvement.

I implemented a block-panel multiplication algorithm similar to GOTO BLAS and
Eigen, but I got side-tracked by other things and never found the time to fix
the corner-case bugs and tune performance. IIRC I reached about 30-40 % of peak
flops which was a bit disappointing.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug libfortran/51119] MATMUL slow for large matrices
  2011-11-14  8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2012-06-28 12:15 ` jb at gcc dot gnu.org
@ 2012-06-29  7:19 ` Joost.VandeVondele at mat dot ethz.ch
  2012-06-29 10:56 ` steven at gcc dot gnu.org
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Joost.VandeVondele at mat dot ethz.ch @ 2012-06-29  7:19 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |Joost.VandeVondele at mat
                   |                            |dot ethz.ch

--- Comment #8 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2012-06-29 07:19:03 UTC ---
(In reply to comment #7)
> (In reply to comment #6)
> > Janne, have you had a chance to look at this ? For larger matrices MATMMUL is
> > really slow. Anything that includes even the most basic blocking scheme should
> > be faster. I think this would be a valuable improvement.
> 
> I implemented a block-panel multiplication algorithm similar to GOTO BLAS and
> Eigen, but I got side-tracked by other things and never found the time to fix
> the corner-case bugs and tune performance. IIRC I reached about 30-40 % of peak
> flops which was a bit disappointing.

I think 30% of peak is a good improvement over the current version (which
reaches 7% of peak (92% for MKL) for a double precision 8000x8000 matrix
multiplication) on a sandy bridge.

In addition to blocking, is the Fortran runtime being compiled with a set of
compile options that enables vectorization ? In the ideal world, gcc would
recognize the loop pattern in the runtime library code, and do blocking,
vectorization etc. automagically.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug libfortran/51119] MATMUL slow for large matrices
  2011-11-14  8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2012-06-29  7:19 ` Joost.VandeVondele at mat dot ethz.ch
@ 2012-06-29 10:56 ` steven at gcc dot gnu.org
  2013-03-29  8:47 ` Joost.VandeVondele at mat dot ethz.ch
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: steven at gcc dot gnu.org @ 2012-06-29 10:56 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

Steven Bosscher <steven at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |steven at gcc dot gnu.org

--- Comment #9 from Steven Bosscher <steven at gcc dot gnu.org> 2012-06-29 10:55:48 UTC ---
(In reply to comment #7)
> IIRC I reached about 30-40 % of peak
> flops which was a bit disappointing.

This sounds quite impressive to me, actually.

It would be interesting to investigate using the IFUNC mechanism to provide
optimized (e.g. vectorized) versions of some of the library functions.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug libfortran/51119] MATMUL slow for large matrices
  2011-11-14  8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2012-06-29 10:56 ` steven at gcc dot gnu.org
@ 2013-03-29  8:47 ` Joost.VandeVondele at mat dot ethz.ch
  2013-04-01 15:59 ` tkoenig at gcc dot gnu.org
  2015-10-31 14:15 ` dominiq at lps dot ens.fr
  11 siblings, 0 replies; 13+ messages in thread
From: Joost.VandeVondele at mat dot ethz.ch @ 2013-03-29  8:47 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|2011-11-14 00:00:00         |2013-03-29

--- Comment #10 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2013-03-29 08:47:39 UTC ---
What about compiling the fortran runtime library with vectorization, and all
the fancy options that come with graphite (loop-blocking in particular). If
they don't work for a matrix multiplication pattern .... what's their use ?
Further naivety would be to provide an lto'ed runtime, allowing matrix
multiplication to be inlined for known small bounds ... kind of the ultimate
dogfooding ?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug libfortran/51119] MATMUL slow for large matrices
  2011-11-14  8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2013-03-29  8:47 ` Joost.VandeVondele at mat dot ethz.ch
@ 2013-04-01 15:59 ` tkoenig at gcc dot gnu.org
  2015-10-31 14:15 ` dominiq at lps dot ens.fr
  11 siblings, 0 replies; 13+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2013-04-01 15:59 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

Thomas Koenig <tkoenig at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Depends on|                            |37131

--- Comment #11 from Thomas Koenig <tkoenig at gcc dot gnu.org> 2013-04-01 15:58:52 UTC ---
A bit like PR 37131 (but I don't want to lose either audit trail).


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug libfortran/51119] MATMUL slow for large matrices
  2011-11-14  8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2013-04-01 15:59 ` tkoenig at gcc dot gnu.org
@ 2015-10-31 14:15 ` dominiq at lps dot ens.fr
  11 siblings, 0 replies; 13+ messages in thread
From: dominiq at lps dot ens.fr @ 2015-10-31 14:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #12 from Dominique d'Humieres <dominiq at lps dot ens.fr> ---
Some new numbers for a four cores Corei7 2.8Ghz, turboboost 3.8Ghz, 1.6Ghz DDR3
on x86_64-apple-darwin14.5 for the following test

program t2 
implicit none 
REAL time_begin, time_end 
integer, parameter :: n=2000; 
integer(8) :: ts, te, rate8, cmax8
real(8) :: elapsed
REAL(8) :: a(n,n), b(n,n), c(n,n) 
integer, parameter :: m = 100 
integer :: i 
call RANDOM_NUMBER(a) 
call RANDOM_NUMBER(b) 
call cpu_time(time_begin) 
call SYSTEM_CLOCK (ts, rate8, cmax8)
do i = 1,m 
    a(1,1) = a(1,1) + 0.1 
    c = MATMUL(a,b) 
enddo 
call SYSTEM_CLOCK (te, rate8, cmax8)
call cpu_time(time_end) 
elapsed = real(te-ts, kind=8)/real(rate8, kind=8)
PRINT *, 'Time, MATMUL: ',time_end-time_begin, elapsed , 2*m*real(n,
kind=8)**3/(10**9*elapsed)
call cpu_time(time_begin) 
call SYSTEM_CLOCK (ts, rate8, cmax8)
do i = 1,m 
    a(1,1) = a(1,1) + 0.1 
    call dgemm('n','n',n, n, n, dble(1.0), a, n, b, n, dble(0.0), c, n) 
enddo 
call SYSTEM_CLOCK (te, rate8, cmax8)
call cpu_time(time_end) 
elapsed = real(te-ts, kind=8)/real(rate8, kind=8)
PRINT *, 'Time, MATMUL: ',time_end-time_begin, elapsed , 2*m*real(n,
kind=8)**3/(10**9*elapsed)
end program 

borrowed from
http://groups.google.com/group/comp.lang.fortran/browse_thread/thread/1cba8e6ce5080197

[Book15] f90/bug% gfc -Ofast timing/matmul_tst_sys.f90 -framework Accelerate
-fno-frontend-optimize
[Book15] f90/bug% time a.out
 Time, MATMUL:    374.027161       374.02889900000002        4.2777443247774283 
 Time, MATMUL:    172.823853       23.073034000000000        69.345019818373260 
546.427u 0.542s 6:37.24 137.6%  0+0k 1+0io 41pf+0w
[Book15] f90/bug% gfc -Ofast timing/matmul_tst_sys.f90 -framework Accelerate 
[Book15] f90/bug% time a.out
 Time, MATMUL:    391.495880       391.49403500000000        4.0869077353886123 
 Time, MATMUL:    169.313202       22.781099000000001        70.233661685944114 
560.384u 0.544s 6:54.39 135.3%  0+0k 0+0io 0pf+0w
[Book15] f90/bug% gfc -Ofast timing/matmul_tst_sys.f90 -framework Accelerate
-march=native
[Book15] f90/bug% time a.out
 Time, MATMUL:    367.570374       367.56880500000000        4.3529265221514102 
 Time, MATMUL:    170.150818       22.837544000000001        70.060073009602078 
537.306u 0.534s 6:30.53 137.7%  0+0k 0+0io 0pf+0w

where the last column is the speed in Gflops. These numbers show that the
library MATMUL is slightly faster than the inline version unless -march=native
is used (AVX should be twice faster unless limited by the memory bandwidth).

[Book15] f90/bug% gfc -Ofast -fexternal-blas timing/matmul_tst_sys.f90
-framework Accelerate
[Book15] f90/bug% time a.out
 Time, MATMUL:    159.000992       21.450851000000000        74.589115368896088 
 Time, MATMUL:    172.616943       23.029487000000000        69.476145951492541 
331.281u 0.453s 0:44.60 743.7%  0+0k 0+0io 3pf+0w
... repeated several time in order to heat the CPU
[Book15] f90/bug% time a.out
 Time, MATMUL:    179.624268       23.935708999999999        66.845732457726655 
 Time, MATMUL:    178.685364       23.898668000000001        66.949337929628541 
357.978u 0.447s 0:47.95 747.4%  0+0k 0+0io 0pf+0w

Thus the BLAS provided by darwin gets ~67GFlops out of the ~90GFlops peak
(AVX*4cores), while the inlined MATMUL gets ~4GFlops out of ~15Gflops peak (no
AVX, one core and turboboost) with little gain when using AVX (~30GFlops peak).

I suppose most modern OS provide such optimized BLAS and, if not, one can
install libraries such as atlas. So I wonder if it would not be more effective
to be able to configure with something such as --with-blas="magic incantation"
and use -fexternal-blas as the default rather than reinventing the wheel.

More than three years ago Janne Blomqvist (comment 7) wrote
> IIRC I reached about 30-40 % of peak flops which was a bit disappointing.

Would it be possible to have the patch to play with?


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-10-31 14:15 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-14  8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org
2011-11-14  8:17 ` [Bug libfortran/51119] " jb at gcc dot gnu.org
2011-11-14 13:56 ` burnus at gcc dot gnu.org
2011-11-15 12:35 ` Joost.VandeVondele at mat dot ethz.ch
2011-11-15 12:37 ` Joost.VandeVondele at mat dot ethz.ch
2011-11-15 16:19 ` jb at gcc dot gnu.org
2012-06-28 11:58 ` Joost.VandeVondele at mat dot ethz.ch
2012-06-28 12:15 ` jb at gcc dot gnu.org
2012-06-29  7:19 ` Joost.VandeVondele at mat dot ethz.ch
2012-06-29 10:56 ` steven at gcc dot gnu.org
2013-03-29  8:47 ` Joost.VandeVondele at mat dot ethz.ch
2013-04-01 15:59 ` tkoenig at gcc dot gnu.org
2015-10-31 14:15 ` dominiq at lps dot ens.fr

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).