public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug libfortran/51119] New: MATMUL slow for large matrices
@ 2011-11-14 8:16 jb at gcc dot gnu.org
2011-11-14 8:17 ` [Bug libfortran/51119] " jb at gcc dot gnu.org
` (11 more replies)
0 siblings, 12 replies; 13+ messages in thread
From: jb at gcc dot gnu.org @ 2011-11-14 8:16 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
Bug #: 51119
Summary: MATMUL slow for large matrices
Classification: Unclassified
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: enhancement
Priority: P3
Component: libfortran
AssignedTo: unassigned@gcc.gnu.org
ReportedBy: jb@gcc.gnu.org
Compared to ATLAS BLAS on an AMD 10h processor, MATMUL on square matrices with
n > 256 is around a factor of 8 slower.
While I don't think it's worth spending the time on target-specific parameters
and/or asm-coded inner kernel as high-performance BLAS implementations do, I
suspect that a little effort towards cache blocking could improve things.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug libfortran/51119] MATMUL slow for large matrices
2011-11-14 8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org
@ 2011-11-14 8:17 ` jb at gcc dot gnu.org
2011-11-14 13:56 ` burnus at gcc dot gnu.org
` (10 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: jb at gcc dot gnu.org @ 2011-11-14 8:17 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
Janne Blomqvist <jb at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |ASSIGNED
Last reconfirmed| |2011-11-14
AssignedTo|unassigned at gcc dot |jb at gcc dot gnu.org
|gnu.org |
Ever Confirmed|0 |1
--- Comment #1 from Janne Blomqvist <jb at gcc dot gnu.org> 2011-11-14 06:49:11 UTC ---
Assigning to myself.
I have a cunning plan.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug libfortran/51119] MATMUL slow for large matrices
2011-11-14 8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org
2011-11-14 8:17 ` [Bug libfortran/51119] " jb at gcc dot gnu.org
@ 2011-11-14 13:56 ` burnus at gcc dot gnu.org
2011-11-15 12:35 ` Joost.VandeVondele at mat dot ethz.ch
` (9 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: burnus at gcc dot gnu.org @ 2011-11-14 13:56 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
Tobias Burnus <burnus at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |burnus at gcc dot gnu.org
--- Comment #2 from Tobias Burnus <burnus at gcc dot gnu.org> 2011-11-14 13:08:49 UTC ---
(In reply to comment #0)
> Compared to ATLAS BLAS on an AMD 10h processor, MATMUL on square matrices with
> n > 256 is around a factor of 8 slower.
Side note: You can use -fexternal-blas -fblas-matmul-limit=<...> and link ATLAS
BLAS.
> Assigning to myself.
> I have a cunning plan.
I am looking forward to cunning ideas - at least if they are not too
convoluted, work on all targets and are middle-end friendly.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug libfortran/51119] MATMUL slow for large matrices
2011-11-14 8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org
2011-11-14 8:17 ` [Bug libfortran/51119] " jb at gcc dot gnu.org
2011-11-14 13:56 ` burnus at gcc dot gnu.org
@ 2011-11-15 12:35 ` Joost.VandeVondele at mat dot ethz.ch
2011-11-15 12:37 ` Joost.VandeVondele at mat dot ethz.ch
` (8 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Joost.VandeVondele at mat dot ethz.ch @ 2011-11-15 12:35 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #3 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2011-11-15 12:19:59 UTC ---
(In reply to comment #1)
> I have a cunning plan.
It is doable to come within a factor of 2 of highly efficient implementations
using a cache-oblivious matrix multiply, which is relatively easy to code. I'm
not sure this is worth the effort.
I believe it would be more important to have actually highly efficient
(inlined) implementations for very small matrices. These would outperform
general libraries by a large factor. For CP2K I have written a specialized
small matrix multiply library generator which generates code that outperforms
e.g. MKL by a large factor for small matrices (<<32x32). The generation time
and library size do not make it a general purpose tool. It also contains an
implementation of the recursive multiply of some sort (see
http://cvs.berlios.de/cgi-bin/viewvc.cgi/cp2k/cp2k/tools/build_libsmm/)
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug libfortran/51119] MATMUL slow for large matrices
2011-11-14 8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org
` (2 preceding siblings ...)
2011-11-15 12:35 ` Joost.VandeVondele at mat dot ethz.ch
@ 2011-11-15 12:37 ` Joost.VandeVondele at mat dot ethz.ch
2011-11-15 16:19 ` jb at gcc dot gnu.org
` (7 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Joost.VandeVondele at mat dot ethz.ch @ 2011-11-15 12:37 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #4 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2011-11-15 12:31:10 UTC ---
Created attachment 25826
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25826
comparison in performance for small matrix multiplies (libsmm vs mkl)
added some data showing the speedup of specialized matrix multiply code (small
matrices, known bounds, in cache) against general dgemm (mkl).
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug libfortran/51119] MATMUL slow for large matrices
2011-11-14 8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org
` (3 preceding siblings ...)
2011-11-15 12:37 ` Joost.VandeVondele at mat dot ethz.ch
@ 2011-11-15 16:19 ` jb at gcc dot gnu.org
2012-06-28 11:58 ` Joost.VandeVondele at mat dot ethz.ch
` (6 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: jb at gcc dot gnu.org @ 2011-11-15 16:19 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #5 from Janne Blomqvist <jb at gcc dot gnu.org> 2011-11-15 15:47:54 UTC ---
(In reply to comment #3)
> I believe it would be more important to have actually highly efficient
> (inlined) implementations for very small matrices.
There's already PR 37131 for that.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug libfortran/51119] MATMUL slow for large matrices
2011-11-14 8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org
` (4 preceding siblings ...)
2011-11-15 16:19 ` jb at gcc dot gnu.org
@ 2012-06-28 11:58 ` Joost.VandeVondele at mat dot ethz.ch
2012-06-28 12:15 ` jb at gcc dot gnu.org
` (5 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Joost.VandeVondele at mat dot ethz.ch @ 2012-06-28 11:58 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #6 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2012-06-28 11:58:20 UTC ---
Janne, have you had a chance to look at this ? For larger matrices MATMMUL is
really slow. Anything that includes even the most basic blocking scheme should
be faster. I think this would be a valuable improvement.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug libfortran/51119] MATMUL slow for large matrices
2011-11-14 8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org
` (5 preceding siblings ...)
2012-06-28 11:58 ` Joost.VandeVondele at mat dot ethz.ch
@ 2012-06-28 12:15 ` jb at gcc dot gnu.org
2012-06-29 7:19 ` Joost.VandeVondele at mat dot ethz.ch
` (4 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: jb at gcc dot gnu.org @ 2012-06-28 12:15 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #7 from Janne Blomqvist <jb at gcc dot gnu.org> 2012-06-28 12:15:05 UTC ---
(In reply to comment #6)
> Janne, have you had a chance to look at this ? For larger matrices MATMMUL is
> really slow. Anything that includes even the most basic blocking scheme should
> be faster. I think this would be a valuable improvement.
I implemented a block-panel multiplication algorithm similar to GOTO BLAS and
Eigen, but I got side-tracked by other things and never found the time to fix
the corner-case bugs and tune performance. IIRC I reached about 30-40 % of peak
flops which was a bit disappointing.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug libfortran/51119] MATMUL slow for large matrices
2011-11-14 8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org
` (6 preceding siblings ...)
2012-06-28 12:15 ` jb at gcc dot gnu.org
@ 2012-06-29 7:19 ` Joost.VandeVondele at mat dot ethz.ch
2012-06-29 10:56 ` steven at gcc dot gnu.org
` (3 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Joost.VandeVondele at mat dot ethz.ch @ 2012-06-29 7:19 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |Joost.VandeVondele at mat
| |dot ethz.ch
--- Comment #8 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2012-06-29 07:19:03 UTC ---
(In reply to comment #7)
> (In reply to comment #6)
> > Janne, have you had a chance to look at this ? For larger matrices MATMMUL is
> > really slow. Anything that includes even the most basic blocking scheme should
> > be faster. I think this would be a valuable improvement.
>
> I implemented a block-panel multiplication algorithm similar to GOTO BLAS and
> Eigen, but I got side-tracked by other things and never found the time to fix
> the corner-case bugs and tune performance. IIRC I reached about 30-40 % of peak
> flops which was a bit disappointing.
I think 30% of peak is a good improvement over the current version (which
reaches 7% of peak (92% for MKL) for a double precision 8000x8000 matrix
multiplication) on a sandy bridge.
In addition to blocking, is the Fortran runtime being compiled with a set of
compile options that enables vectorization ? In the ideal world, gcc would
recognize the loop pattern in the runtime library code, and do blocking,
vectorization etc. automagically.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug libfortran/51119] MATMUL slow for large matrices
2011-11-14 8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org
` (7 preceding siblings ...)
2012-06-29 7:19 ` Joost.VandeVondele at mat dot ethz.ch
@ 2012-06-29 10:56 ` steven at gcc dot gnu.org
2013-03-29 8:47 ` Joost.VandeVondele at mat dot ethz.ch
` (2 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: steven at gcc dot gnu.org @ 2012-06-29 10:56 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
Steven Bosscher <steven at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |steven at gcc dot gnu.org
--- Comment #9 from Steven Bosscher <steven at gcc dot gnu.org> 2012-06-29 10:55:48 UTC ---
(In reply to comment #7)
> IIRC I reached about 30-40 % of peak
> flops which was a bit disappointing.
This sounds quite impressive to me, actually.
It would be interesting to investigate using the IFUNC mechanism to provide
optimized (e.g. vectorized) versions of some of the library functions.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug libfortran/51119] MATMUL slow for large matrices
2011-11-14 8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org
` (8 preceding siblings ...)
2012-06-29 10:56 ` steven at gcc dot gnu.org
@ 2013-03-29 8:47 ` Joost.VandeVondele at mat dot ethz.ch
2013-04-01 15:59 ` tkoenig at gcc dot gnu.org
2015-10-31 14:15 ` dominiq at lps dot ens.fr
11 siblings, 0 replies; 13+ messages in thread
From: Joost.VandeVondele at mat dot ethz.ch @ 2013-03-29 8:47 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed|2011-11-14 00:00:00 |2013-03-29
--- Comment #10 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2013-03-29 08:47:39 UTC ---
What about compiling the fortran runtime library with vectorization, and all
the fancy options that come with graphite (loop-blocking in particular). If
they don't work for a matrix multiplication pattern .... what's their use ?
Further naivety would be to provide an lto'ed runtime, allowing matrix
multiplication to be inlined for known small bounds ... kind of the ultimate
dogfooding ?
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug libfortran/51119] MATMUL slow for large matrices
2011-11-14 8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org
` (9 preceding siblings ...)
2013-03-29 8:47 ` Joost.VandeVondele at mat dot ethz.ch
@ 2013-04-01 15:59 ` tkoenig at gcc dot gnu.org
2015-10-31 14:15 ` dominiq at lps dot ens.fr
11 siblings, 0 replies; 13+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2013-04-01 15:59 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
Thomas Koenig <tkoenig at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Depends on| |37131
--- Comment #11 from Thomas Koenig <tkoenig at gcc dot gnu.org> 2013-04-01 15:58:52 UTC ---
A bit like PR 37131 (but I don't want to lose either audit trail).
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug libfortran/51119] MATMUL slow for large matrices
2011-11-14 8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org
` (10 preceding siblings ...)
2013-04-01 15:59 ` tkoenig at gcc dot gnu.org
@ 2015-10-31 14:15 ` dominiq at lps dot ens.fr
11 siblings, 0 replies; 13+ messages in thread
From: dominiq at lps dot ens.fr @ 2015-10-31 14:15 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #12 from Dominique d'Humieres <dominiq at lps dot ens.fr> ---
Some new numbers for a four cores Corei7 2.8Ghz, turboboost 3.8Ghz, 1.6Ghz DDR3
on x86_64-apple-darwin14.5 for the following test
program t2
implicit none
REAL time_begin, time_end
integer, parameter :: n=2000;
integer(8) :: ts, te, rate8, cmax8
real(8) :: elapsed
REAL(8) :: a(n,n), b(n,n), c(n,n)
integer, parameter :: m = 100
integer :: i
call RANDOM_NUMBER(a)
call RANDOM_NUMBER(b)
call cpu_time(time_begin)
call SYSTEM_CLOCK (ts, rate8, cmax8)
do i = 1,m
a(1,1) = a(1,1) + 0.1
c = MATMUL(a,b)
enddo
call SYSTEM_CLOCK (te, rate8, cmax8)
call cpu_time(time_end)
elapsed = real(te-ts, kind=8)/real(rate8, kind=8)
PRINT *, 'Time, MATMUL: ',time_end-time_begin, elapsed , 2*m*real(n,
kind=8)**3/(10**9*elapsed)
call cpu_time(time_begin)
call SYSTEM_CLOCK (ts, rate8, cmax8)
do i = 1,m
a(1,1) = a(1,1) + 0.1
call dgemm('n','n',n, n, n, dble(1.0), a, n, b, n, dble(0.0), c, n)
enddo
call SYSTEM_CLOCK (te, rate8, cmax8)
call cpu_time(time_end)
elapsed = real(te-ts, kind=8)/real(rate8, kind=8)
PRINT *, 'Time, MATMUL: ',time_end-time_begin, elapsed , 2*m*real(n,
kind=8)**3/(10**9*elapsed)
end program
borrowed from
http://groups.google.com/group/comp.lang.fortran/browse_thread/thread/1cba8e6ce5080197
[Book15] f90/bug% gfc -Ofast timing/matmul_tst_sys.f90 -framework Accelerate
-fno-frontend-optimize
[Book15] f90/bug% time a.out
Time, MATMUL: 374.027161 374.02889900000002 4.2777443247774283
Time, MATMUL: 172.823853 23.073034000000000 69.345019818373260
546.427u 0.542s 6:37.24 137.6% 0+0k 1+0io 41pf+0w
[Book15] f90/bug% gfc -Ofast timing/matmul_tst_sys.f90 -framework Accelerate
[Book15] f90/bug% time a.out
Time, MATMUL: 391.495880 391.49403500000000 4.0869077353886123
Time, MATMUL: 169.313202 22.781099000000001 70.233661685944114
560.384u 0.544s 6:54.39 135.3% 0+0k 0+0io 0pf+0w
[Book15] f90/bug% gfc -Ofast timing/matmul_tst_sys.f90 -framework Accelerate
-march=native
[Book15] f90/bug% time a.out
Time, MATMUL: 367.570374 367.56880500000000 4.3529265221514102
Time, MATMUL: 170.150818 22.837544000000001 70.060073009602078
537.306u 0.534s 6:30.53 137.7% 0+0k 0+0io 0pf+0w
where the last column is the speed in Gflops. These numbers show that the
library MATMUL is slightly faster than the inline version unless -march=native
is used (AVX should be twice faster unless limited by the memory bandwidth).
[Book15] f90/bug% gfc -Ofast -fexternal-blas timing/matmul_tst_sys.f90
-framework Accelerate
[Book15] f90/bug% time a.out
Time, MATMUL: 159.000992 21.450851000000000 74.589115368896088
Time, MATMUL: 172.616943 23.029487000000000 69.476145951492541
331.281u 0.453s 0:44.60 743.7% 0+0k 0+0io 3pf+0w
... repeated several time in order to heat the CPU
[Book15] f90/bug% time a.out
Time, MATMUL: 179.624268 23.935708999999999 66.845732457726655
Time, MATMUL: 178.685364 23.898668000000001 66.949337929628541
357.978u 0.447s 0:47.95 747.4% 0+0k 0+0io 0pf+0w
Thus the BLAS provided by darwin gets ~67GFlops out of the ~90GFlops peak
(AVX*4cores), while the inlined MATMUL gets ~4GFlops out of ~15Gflops peak (no
AVX, one core and turboboost) with little gain when using AVX (~30GFlops peak).
I suppose most modern OS provide such optimized BLAS and, if not, one can
install libraries such as atlas. So I wonder if it would not be more effective
to be able to configure with something such as --with-blas="magic incantation"
and use -fexternal-blas as the default rather than reinventing the wheel.
More than three years ago Janne Blomqvist (comment 7) wrote
> IIRC I reached about 30-40 % of peak flops which was a bit disappointing.
Would it be possible to have the patch to play with?
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2015-10-31 14:15 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-14 8:16 [Bug libfortran/51119] New: MATMUL slow for large matrices jb at gcc dot gnu.org
2011-11-14 8:17 ` [Bug libfortran/51119] " jb at gcc dot gnu.org
2011-11-14 13:56 ` burnus at gcc dot gnu.org
2011-11-15 12:35 ` Joost.VandeVondele at mat dot ethz.ch
2011-11-15 12:37 ` Joost.VandeVondele at mat dot ethz.ch
2011-11-15 16:19 ` jb at gcc dot gnu.org
2012-06-28 11:58 ` Joost.VandeVondele at mat dot ethz.ch
2012-06-28 12:15 ` jb at gcc dot gnu.org
2012-06-29 7:19 ` Joost.VandeVondele at mat dot ethz.ch
2012-06-29 10:56 ` steven at gcc dot gnu.org
2013-03-29 8:47 ` Joost.VandeVondele at mat dot ethz.ch
2013-04-01 15:59 ` tkoenig at gcc dot gnu.org
2015-10-31 14:15 ` dominiq at lps dot ens.fr
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).