public inbox for gcc-bugs@sourceware.org help / color / mirror / Atom feed
* [Bug libfortran/51119] New: MATMUL slow for large matrices @ 2011-11-14 8:16 jb at gcc dot gnu.org 2011-11-14 8:17 ` [Bug libfortran/51119] " jb at gcc dot gnu.org ` (11 more replies) 0 siblings, 12 replies; 13+ messages in thread From: jb at gcc dot gnu.org @ 2011-11-14 8:16 UTC (permalink / raw) To: gcc-bugs http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 Bug #: 51119 Summary: MATMUL slow for large matrices Classification: Unclassified Product: gcc Version: unknown Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: libfortran AssignedTo: unassigned@gcc.gnu.org ReportedBy: jb@gcc.gnu.org Compared to ATLAS BLAS on an AMD 10h processor, MATMUL on square matrices with n > 256 is around a factor of 8 slower. While I don't think it's worth spending the time on target-specific parameters and/or asm-coded inner kernel as high-performance BLAS implementations do, I suspect that a little effort towards cache blocking could improve things. ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug libfortran/51119] MATMUL slow for large matrices 2011-11-14 8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org @ 2011-11-14 8:17 ` jb at gcc dot gnu.org 2011-11-14 13:56 ` burnus at gcc dot gnu.org ` (10 subsequent siblings) 11 siblings, 0 replies; 13+ messages in thread From: jb at gcc dot gnu.org @ 2011-11-14 8:17 UTC (permalink / raw) To: gcc-bugs http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 Janne Blomqvist <jb at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |ASSIGNED Last reconfirmed| |2011-11-14 AssignedTo|unassigned at gcc dot |jb at gcc dot gnu.org |gnu.org | Ever Confirmed|0 |1 --- Comment #1 from Janne Blomqvist <jb at gcc dot gnu.org> 2011-11-14 06:49:11 UTC --- Assigning to myself. I have a cunning plan. ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug libfortran/51119] MATMUL slow for large matrices 2011-11-14 8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org 2011-11-14 8:17 ` [Bug libfortran/51119] " jb at gcc dot gnu.org @ 2011-11-14 13:56 ` burnus at gcc dot gnu.org 2011-11-15 12:35 ` Joost.VandeVondele at mat dot ethz.ch ` (9 subsequent siblings) 11 siblings, 0 replies; 13+ messages in thread From: burnus at gcc dot gnu.org @ 2011-11-14 13:56 UTC (permalink / raw) To: gcc-bugs http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 Tobias Burnus <burnus at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |burnus at gcc dot gnu.org --- Comment #2 from Tobias Burnus <burnus at gcc dot gnu.org> 2011-11-14 13:08:49 UTC --- (In reply to comment #0) > Compared to ATLAS BLAS on an AMD 10h processor, MATMUL on square matrices with > n > 256 is around a factor of 8 slower. Side note: You can use -fexternal-blas -fblas-matmul-limit=<...> and link ATLAS BLAS. > Assigning to myself. > I have a cunning plan. I am looking forward to cunning ideas - at least if they are not too convoluted, work on all targets and are middle-end friendly. ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug libfortran/51119] MATMUL slow for large matrices 2011-11-14 8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org 2011-11-14 8:17 ` [Bug libfortran/51119] " jb at gcc dot gnu.org 2011-11-14 13:56 ` burnus at gcc dot gnu.org @ 2011-11-15 12:35 ` Joost.VandeVondele at mat dot ethz.ch 2011-11-15 12:37 ` Joost.VandeVondele at mat dot ethz.ch ` (8 subsequent siblings) 11 siblings, 0 replies; 13+ messages in thread From: Joost.VandeVondele at mat dot ethz.ch @ 2011-11-15 12:35 UTC (permalink / raw) To: gcc-bugs http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #3 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2011-11-15 12:19:59 UTC --- (In reply to comment #1) > I have a cunning plan. It is doable to come within a factor of 2 of highly efficient implementations using a cache-oblivious matrix multiply, which is relatively easy to code. I'm not sure this is worth the effort. I believe it would be more important to have actually highly efficient (inlined) implementations for very small matrices. These would outperform general libraries by a large factor. For CP2K I have written a specialized small matrix multiply library generator which generates code that outperforms e.g. MKL by a large factor for small matrices (<<32x32). The generation time and library size do not make it a general purpose tool. It also contains an implementation of the recursive multiply of some sort (see http://cvs.berlios.de/cgi-bin/viewvc.cgi/cp2k/cp2k/tools/build_libsmm/) ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug libfortran/51119] MATMUL slow for large matrices 2011-11-14 8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org ` (2 preceding siblings ...) 2011-11-15 12:35 ` Joost.VandeVondele at mat dot ethz.ch @ 2011-11-15 12:37 ` Joost.VandeVondele at mat dot ethz.ch 2011-11-15 16:19 ` jb at gcc dot gnu.org ` (7 subsequent siblings) 11 siblings, 0 replies; 13+ messages in thread From: Joost.VandeVondele at mat dot ethz.ch @ 2011-11-15 12:37 UTC (permalink / raw) To: gcc-bugs http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #4 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2011-11-15 12:31:10 UTC --- Created attachment 25826 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25826 comparison in performance for small matrix multiplies (libsmm vs mkl) added some data showing the speedup of specialized matrix multiply code (small matrices, known bounds, in cache) against general dgemm (mkl). ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug libfortran/51119] MATMUL slow for large matrices 2011-11-14 8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org ` (3 preceding siblings ...) 2011-11-15 12:37 ` Joost.VandeVondele at mat dot ethz.ch @ 2011-11-15 16:19 ` jb at gcc dot gnu.org 2012-06-28 11:58 ` Joost.VandeVondele at mat dot ethz.ch ` (6 subsequent siblings) 11 siblings, 0 replies; 13+ messages in thread From: jb at gcc dot gnu.org @ 2011-11-15 16:19 UTC (permalink / raw) To: gcc-bugs http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #5 from Janne Blomqvist <jb at gcc dot gnu.org> 2011-11-15 15:47:54 UTC --- (In reply to comment #3) > I believe it would be more important to have actually highly efficient > (inlined) implementations for very small matrices. There's already PR 37131 for that. ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug libfortran/51119] MATMUL slow for large matrices 2011-11-14 8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org ` (4 preceding siblings ...) 2011-11-15 16:19 ` jb at gcc dot gnu.org @ 2012-06-28 11:58 ` Joost.VandeVondele at mat dot ethz.ch 2012-06-28 12:15 ` jb at gcc dot gnu.org ` (5 subsequent siblings) 11 siblings, 0 replies; 13+ messages in thread From: Joost.VandeVondele at mat dot ethz.ch @ 2012-06-28 11:58 UTC (permalink / raw) To: gcc-bugs http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #6 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2012-06-28 11:58:20 UTC --- Janne, have you had a chance to look at this ? For larger matrices MATMMUL is really slow. Anything that includes even the most basic blocking scheme should be faster. I think this would be a valuable improvement. ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug libfortran/51119] MATMUL slow for large matrices 2011-11-14 8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org ` (5 preceding siblings ...) 2012-06-28 11:58 ` Joost.VandeVondele at mat dot ethz.ch @ 2012-06-28 12:15 ` jb at gcc dot gnu.org 2012-06-29 7:19 ` Joost.VandeVondele at mat dot ethz.ch ` (4 subsequent siblings) 11 siblings, 0 replies; 13+ messages in thread From: jb at gcc dot gnu.org @ 2012-06-28 12:15 UTC (permalink / raw) To: gcc-bugs http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #7 from Janne Blomqvist <jb at gcc dot gnu.org> 2012-06-28 12:15:05 UTC --- (In reply to comment #6) > Janne, have you had a chance to look at this ? For larger matrices MATMMUL is > really slow. Anything that includes even the most basic blocking scheme should > be faster. I think this would be a valuable improvement. I implemented a block-panel multiplication algorithm similar to GOTO BLAS and Eigen, but I got side-tracked by other things and never found the time to fix the corner-case bugs and tune performance. IIRC I reached about 30-40 % of peak flops which was a bit disappointing. ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug libfortran/51119] MATMUL slow for large matrices 2011-11-14 8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org ` (6 preceding siblings ...) 2012-06-28 12:15 ` jb at gcc dot gnu.org @ 2012-06-29 7:19 ` Joost.VandeVondele at mat dot ethz.ch 2012-06-29 10:56 ` steven at gcc dot gnu.org ` (3 subsequent siblings) 11 siblings, 0 replies; 13+ messages in thread From: Joost.VandeVondele at mat dot ethz.ch @ 2012-06-29 7:19 UTC (permalink / raw) To: gcc-bugs http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |Joost.VandeVondele at mat | |dot ethz.ch --- Comment #8 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2012-06-29 07:19:03 UTC --- (In reply to comment #7) > (In reply to comment #6) > > Janne, have you had a chance to look at this ? For larger matrices MATMMUL is > > really slow. Anything that includes even the most basic blocking scheme should > > be faster. I think this would be a valuable improvement. > > I implemented a block-panel multiplication algorithm similar to GOTO BLAS and > Eigen, but I got side-tracked by other things and never found the time to fix > the corner-case bugs and tune performance. IIRC I reached about 30-40 % of peak > flops which was a bit disappointing. I think 30% of peak is a good improvement over the current version (which reaches 7% of peak (92% for MKL) for a double precision 8000x8000 matrix multiplication) on a sandy bridge. In addition to blocking, is the Fortran runtime being compiled with a set of compile options that enables vectorization ? In the ideal world, gcc would recognize the loop pattern in the runtime library code, and do blocking, vectorization etc. automagically. ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug libfortran/51119] MATMUL slow for large matrices 2011-11-14 8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org ` (7 preceding siblings ...) 2012-06-29 7:19 ` Joost.VandeVondele at mat dot ethz.ch @ 2012-06-29 10:56 ` steven at gcc dot gnu.org 2013-03-29 8:47 ` Joost.VandeVondele at mat dot ethz.ch ` (2 subsequent siblings) 11 siblings, 0 replies; 13+ messages in thread From: steven at gcc dot gnu.org @ 2012-06-29 10:56 UTC (permalink / raw) To: gcc-bugs http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 Steven Bosscher <steven at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |steven at gcc dot gnu.org --- Comment #9 from Steven Bosscher <steven at gcc dot gnu.org> 2012-06-29 10:55:48 UTC --- (In reply to comment #7) > IIRC I reached about 30-40 % of peak > flops which was a bit disappointing. This sounds quite impressive to me, actually. It would be interesting to investigate using the IFUNC mechanism to provide optimized (e.g. vectorized) versions of some of the library functions. ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug libfortran/51119] MATMUL slow for large matrices 2011-11-14 8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org ` (8 preceding siblings ...) 2012-06-29 10:56 ` steven at gcc dot gnu.org @ 2013-03-29 8:47 ` Joost.VandeVondele at mat dot ethz.ch 2013-04-01 15:59 ` tkoenig at gcc dot gnu.org 2015-10-31 14:15 ` dominiq at lps dot ens.fr 11 siblings, 0 replies; 13+ messages in thread From: Joost.VandeVondele at mat dot ethz.ch @ 2013-03-29 8:47 UTC (permalink / raw) To: gcc-bugs http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> changed: What |Removed |Added ---------------------------------------------------------------------------- Last reconfirmed|2011-11-14 00:00:00 |2013-03-29 --- Comment #10 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2013-03-29 08:47:39 UTC --- What about compiling the fortran runtime library with vectorization, and all the fancy options that come with graphite (loop-blocking in particular). If they don't work for a matrix multiplication pattern .... what's their use ? Further naivety would be to provide an lto'ed runtime, allowing matrix multiplication to be inlined for known small bounds ... kind of the ultimate dogfooding ? ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug libfortran/51119] MATMUL slow for large matrices 2011-11-14 8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org ` (9 preceding siblings ...) 2013-03-29 8:47 ` Joost.VandeVondele at mat dot ethz.ch @ 2013-04-01 15:59 ` tkoenig at gcc dot gnu.org 2015-10-31 14:15 ` dominiq at lps dot ens.fr 11 siblings, 0 replies; 13+ messages in thread From: tkoenig at gcc dot gnu.org @ 2013-04-01 15:59 UTC (permalink / raw) To: gcc-bugs http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 Thomas Koenig <tkoenig at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Depends on| |37131 --- Comment #11 from Thomas Koenig <tkoenig at gcc dot gnu.org> 2013-04-01 15:58:52 UTC --- A bit like PR 37131 (but I don't want to lose either audit trail). ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug libfortran/51119] MATMUL slow for large matrices 2011-11-14 8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org ` (10 preceding siblings ...) 2013-04-01 15:59 ` tkoenig at gcc dot gnu.org @ 2015-10-31 14:15 ` dominiq at lps dot ens.fr 11 siblings, 0 replies; 13+ messages in thread From: dominiq at lps dot ens.fr @ 2015-10-31 14:15 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #12 from Dominique d'Humieres <dominiq at lps dot ens.fr> --- Some new numbers for a four cores Corei7 2.8Ghz, turboboost 3.8Ghz, 1.6Ghz DDR3 on x86_64-apple-darwin14.5 for the following test program t2 implicit none REAL time_begin, time_end integer, parameter :: n=2000; integer(8) :: ts, te, rate8, cmax8 real(8) :: elapsed REAL(8) :: a(n,n), b(n,n), c(n,n) integer, parameter :: m = 100 integer :: i call RANDOM_NUMBER(a) call RANDOM_NUMBER(b) call cpu_time(time_begin) call SYSTEM_CLOCK (ts, rate8, cmax8) do i = 1,m a(1,1) = a(1,1) + 0.1 c = MATMUL(a,b) enddo call SYSTEM_CLOCK (te, rate8, cmax8) call cpu_time(time_end) elapsed = real(te-ts, kind=8)/real(rate8, kind=8) PRINT *, 'Time, MATMUL: ',time_end-time_begin, elapsed , 2*m*real(n, kind=8)**3/(10**9*elapsed) call cpu_time(time_begin) call SYSTEM_CLOCK (ts, rate8, cmax8) do i = 1,m a(1,1) = a(1,1) + 0.1 call dgemm('n','n',n, n, n, dble(1.0), a, n, b, n, dble(0.0), c, n) enddo call SYSTEM_CLOCK (te, rate8, cmax8) call cpu_time(time_end) elapsed = real(te-ts, kind=8)/real(rate8, kind=8) PRINT *, 'Time, MATMUL: ',time_end-time_begin, elapsed , 2*m*real(n, kind=8)**3/(10**9*elapsed) end program borrowed from http://groups.google.com/group/comp.lang.fortran/browse_thread/thread/1cba8e6ce5080197 [Book15] f90/bug% gfc -Ofast timing/matmul_tst_sys.f90 -framework Accelerate -fno-frontend-optimize [Book15] f90/bug% time a.out Time, MATMUL: 374.027161 374.02889900000002 4.2777443247774283 Time, MATMUL: 172.823853 23.073034000000000 69.345019818373260 546.427u 0.542s 6:37.24 137.6% 0+0k 1+0io 41pf+0w [Book15] f90/bug% gfc -Ofast timing/matmul_tst_sys.f90 -framework Accelerate [Book15] f90/bug% time a.out Time, MATMUL: 391.495880 391.49403500000000 4.0869077353886123 Time, MATMUL: 169.313202 22.781099000000001 70.233661685944114 560.384u 0.544s 6:54.39 135.3% 0+0k 0+0io 0pf+0w [Book15] f90/bug% gfc -Ofast timing/matmul_tst_sys.f90 -framework Accelerate -march=native [Book15] f90/bug% time a.out Time, MATMUL: 367.570374 367.56880500000000 4.3529265221514102 Time, MATMUL: 170.150818 22.837544000000001 70.060073009602078 537.306u 0.534s 6:30.53 137.7% 0+0k 0+0io 0pf+0w where the last column is the speed in Gflops. These numbers show that the library MATMUL is slightly faster than the inline version unless -march=native is used (AVX should be twice faster unless limited by the memory bandwidth). [Book15] f90/bug% gfc -Ofast -fexternal-blas timing/matmul_tst_sys.f90 -framework Accelerate [Book15] f90/bug% time a.out Time, MATMUL: 159.000992 21.450851000000000 74.589115368896088 Time, MATMUL: 172.616943 23.029487000000000 69.476145951492541 331.281u 0.453s 0:44.60 743.7% 0+0k 0+0io 3pf+0w ... repeated several time in order to heat the CPU [Book15] f90/bug% time a.out Time, MATMUL: 179.624268 23.935708999999999 66.845732457726655 Time, MATMUL: 178.685364 23.898668000000001 66.949337929628541 357.978u 0.447s 0:47.95 747.4% 0+0k 0+0io 0pf+0w Thus the BLAS provided by darwin gets ~67GFlops out of the ~90GFlops peak (AVX*4cores), while the inlined MATMUL gets ~4GFlops out of ~15Gflops peak (no AVX, one core and turboboost) with little gain when using AVX (~30GFlops peak). I suppose most modern OS provide such optimized BLAS and, if not, one can install libraries such as atlas. So I wonder if it would not be more effective to be able to configure with something such as --with-blas="magic incantation" and use -fexternal-blas as the default rather than reinventing the wheel. More than three years ago Janne Blomqvist (comment 7) wrote > IIRC I reached about 30-40 % of peak flops which was a bit disappointing. Would it be possible to have the patch to play with? ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2015-10-31 14:15 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-11-14 8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org 2011-11-14 8:17 ` [Bug libfortran/51119] " jb at gcc dot gnu.org 2011-11-14 13:56 ` burnus at gcc dot gnu.org 2011-11-15 12:35 ` Joost.VandeVondele at mat dot ethz.ch 2011-11-15 12:37 ` Joost.VandeVondele at mat dot ethz.ch 2011-11-15 16:19 ` jb at gcc dot gnu.org 2012-06-28 11:58 ` Joost.VandeVondele at mat dot ethz.ch 2012-06-28 12:15 ` jb at gcc dot gnu.org 2012-06-29 7:19 ` Joost.VandeVondele at mat dot ethz.ch 2012-06-29 10:56 ` steven at gcc dot gnu.org 2013-03-29 8:47 ` Joost.VandeVondele at mat dot ethz.ch 2013-04-01 15:59 ` tkoenig at gcc dot gnu.org 2015-10-31 14:15 ` dominiq at lps dot ens.fr
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).