* [Bug fortran/37131] inline matmul for small matrix sizes
2008-08-15 19:24 [Bug fortran/37131] New: inline matmul for small matrix sizes tkoenig at gcc dot gnu dot org
@ 2008-08-16 22:56 ` pinskia at gcc dot gnu dot org
2008-08-23 13:20 ` tkoenig at gcc dot gnu dot org
` (9 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2008-08-16 22:56 UTC (permalink / raw)
To: gcc-bugs
------- Comment #3 from pinskia at gcc dot gnu dot org 2008-08-16 22:55 -------
Confirmed.
--
pinskia at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
Severity|normal |enhancement
Status|UNCONFIRMED |NEW
Ever Confirmed|0 |1
Last reconfirmed|0000-00-00 00:00:00 |2008-08-16 22:55:22
date| |
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug fortran/37131] inline matmul for small matrix sizes
2008-08-15 19:24 [Bug fortran/37131] New: inline matmul for small matrix sizes tkoenig at gcc dot gnu dot org
2008-08-16 22:56 ` [Bug fortran/37131] " pinskia at gcc dot gnu dot org
@ 2008-08-23 13:20 ` tkoenig at gcc dot gnu dot org
2008-11-29 16:20 ` burnus at gcc dot gnu dot org
` (8 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: tkoenig at gcc dot gnu dot org @ 2008-08-23 13:20 UTC (permalink / raw)
To: gcc-bugs
------- Comment #4 from tkoenig at gcc dot gnu dot org 2008-08-23 13:18 -------
Created an attachment (id=16134)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=16134&action=view)
test case
Actually, the test cases were a bit unfair, because
the middle-end decided not to calculate the
values of c that were never used.
Attached is a better test case.
Timings on x86_64-unknown-linux-gnu:
matmul = 12.840802 s
subroutine without explicit interface: 0.88805580 s
subroutine with explicit interface: 0.87605572 s
inline with sum 2.0721283 s
While inlining is still much better than matmul, a hand-rolled
3*3 subroutine is much faster overall, which I find a bit surprising.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug fortran/37131] inline matmul for small matrix sizes
2008-08-15 19:24 [Bug fortran/37131] New: inline matmul for small matrix sizes tkoenig at gcc dot gnu dot org
2008-08-16 22:56 ` [Bug fortran/37131] " pinskia at gcc dot gnu dot org
2008-08-23 13:20 ` tkoenig at gcc dot gnu dot org
@ 2008-11-29 16:20 ` burnus at gcc dot gnu dot org
2008-12-04 20:00 ` tkoenig at gcc dot gnu dot org
` (7 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: burnus at gcc dot gnu dot org @ 2008-11-29 16:20 UTC (permalink / raw)
To: gcc-bugs
------- Comment #5 from burnus at gcc dot gnu dot org 2008-11-29 16:18 -------
(In reply to comment #4)
> Timings on x86_64-unknown-linux-gnu:
> matmul = 12.840802 s
> subroutine without explicit interface: 0.88805580 s
> subroutine with explicit interface: 0.87605572 s
> inline with sum 2.0721283 s
With -O2 I get:
matmul = 10.724670 s
subroutine without explicit interface: 7.7324829 s
subroutine with explicit interface: 7.8684921 s
inline with sum 7.7684860 s
Only with I get with -O3 -ffast-math -march=native on AMD64 the following:
matmul = 10.656666 s
subroutine without explicit interface: 0.91205692 s
subroutine with explicit interface: 0.82805157 s
inline with sum 2.4521542 s
For comparison with ifort ("loop was vectorized" in lines 40, 41, 43):
matmul = 2.660166 s
subroutine without explicit interface: 0.0000000E+00 s
subroutine with explicit interface: 0.0000000E+00 s
inline with sum 0.0000000E+00 s
and openf95 -O3:
matmul = 1.26807904 s (-O2: 28.2537651 s)
subroutine without explicit interface: 1.07606697 s (4.07225418)
subroutine with explicit interface: 1.05206609 s (4.08025742)
inline with sum 0.748046875 s (3.7522316)
--
burnus at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |burnus at gcc dot gnu dot
| |org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug fortran/37131] inline matmul for small matrix sizes
2008-08-15 19:24 [Bug fortran/37131] New: inline matmul for small matrix sizes tkoenig at gcc dot gnu dot org
` (2 preceding siblings ...)
2008-11-29 16:20 ` burnus at gcc dot gnu dot org
@ 2008-12-04 20:00 ` tkoenig at gcc dot gnu dot org
2008-12-09 22:13 ` tkoenig at gcc dot gnu dot org
` (6 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: tkoenig at gcc dot gnu dot org @ 2008-12-04 20:00 UTC (permalink / raw)
To: gcc-bugs
------- Comment #6 from tkoenig at gcc dot gnu dot org 2008-12-04 19:58 -------
(In reply to comment #5)
> For comparison with ifort ("loop was vectorized" in lines 40, 41, 43):
> matmul = 2.660166 s
> subroutine without explicit interface: 0.0000000E+00 s
> subroutine with explicit interface: 0.0000000E+00 s
> inline with sum 0.0000000E+00 s
ifort detects that the call to invalidate doesn't actually invalidate
anything and so just removes the whole matmul stuff.
Intelligent, but bad for benchmarks :-)
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug fortran/37131] inline matmul for small matrix sizes
2008-08-15 19:24 [Bug fortran/37131] New: inline matmul for small matrix sizes tkoenig at gcc dot gnu dot org
` (3 preceding siblings ...)
2008-12-04 20:00 ` tkoenig at gcc dot gnu dot org
@ 2008-12-09 22:13 ` tkoenig at gcc dot gnu dot org
2010-05-14 9:15 ` tkoenig at gcc dot gnu dot org
` (5 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: tkoenig at gcc dot gnu dot org @ 2008-12-09 22:13 UTC (permalink / raw)
To: gcc-bugs
------- Comment #7 from tkoenig at gcc dot gnu dot org 2008-12-09 22:12 -------
Created an attachment (id=16866)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=16866&action=view)
better test case
Thou shalt use IMPLICIT none, especially if you think you don't need it...
Here's a better test case, which actually tests the right thing.
Timings with 4.4 on i686-pc-linux-gnu:
matmul = 15.596974 s
subroutine with explicit interface: 3.6842318 s
unrolled subroutine with explicit interface: 3.3522091 s
inline with sum 3.3602085 s
--
tkoenig at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #16134|0 |1
is obsolete| |
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug fortran/37131] inline matmul for small matrix sizes
2008-08-15 19:24 [Bug fortran/37131] New: inline matmul for small matrix sizes tkoenig at gcc dot gnu dot org
` (4 preceding siblings ...)
2008-12-09 22:13 ` tkoenig at gcc dot gnu dot org
@ 2010-05-14 9:15 ` tkoenig at gcc dot gnu dot org
2010-06-04 22:32 ` tkoenig at gcc dot gnu dot org
` (4 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: tkoenig at gcc dot gnu dot org @ 2010-05-14 9:15 UTC (permalink / raw)
To: gcc-bugs
------- Comment #8 from tkoenig at gcc dot gnu dot org 2010-05-14 09:15 -------
New timings, on x86_64-unknown-linux-gnu. I split off the "invalidate"
subroutine to make sure the optimizers don't optimize this out:
ig25@linux-fd1f:/tmp> gfortran -O3 matmul.f90 invalidate.f90
ig25@linux-fd1f:/tmp> time ./a.out
matmul = 11.100311 s
subroutine with explicit interface: 2.0216932 s
unrolled subroutine with explicit interface: 1.9317064 s
inline with sum 1.9087105 s
real 0m16.971s
user 0m16.959s
sys 0m0.005s
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug fortran/37131] inline matmul for small matrix sizes
2008-08-15 19:24 [Bug fortran/37131] New: inline matmul for small matrix sizes tkoenig at gcc dot gnu dot org
` (5 preceding siblings ...)
2010-05-14 9:15 ` tkoenig at gcc dot gnu dot org
@ 2010-06-04 22:32 ` tkoenig at gcc dot gnu dot org
2010-06-05 6:55 ` paul dot richard dot thomas at gmail dot com
` (3 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: tkoenig at gcc dot gnu dot org @ 2010-06-04 22:32 UTC (permalink / raw)
To: gcc-bugs
------- Comment #9 from tkoenig at gcc dot gnu dot org 2010-06-04 22:31 -------
I have thought a little bit about this, and the problem is
a bit daunting ;-) Of course, this is at least partly because
my experience with the scalarizer is close to non-existant, but you
have to learn sometime.
It seems that the functions for scalarizing do not help a lot
here, because (for example) we need three nested loops for implementing
the case where a and b are of rank 2.
The preferred way would therefore be to state the rank 2 * rank 2 problem as
do i=1,m
do j=1,n
c(i,j) = sum(a(i,:) * b(:,j))
end do
end do
with the inner dot product borrowed using the scalarizer (borrowing
from dot_product), and the outer loops using either hand-crafted
TREE code or calling the DO translation.
Comments? Is this reasonable?
--
tkoenig at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |pault at gcc dot gnu dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug fortran/37131] inline matmul for small matrix sizes
2008-08-15 19:24 [Bug fortran/37131] New: inline matmul for small matrix sizes tkoenig at gcc dot gnu dot org
` (6 preceding siblings ...)
2010-06-04 22:32 ` tkoenig at gcc dot gnu dot org
@ 2010-06-05 6:55 ` paul dot richard dot thomas at gmail dot com
2010-06-05 8:49 ` tkoenig at gcc dot gnu dot org
` (2 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: paul dot richard dot thomas at gmail dot com @ 2010-06-05 6:55 UTC (permalink / raw)
To: gcc-bugs
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1406 bytes --]
------- Comment #10 from paul dot richard dot thomas at gmail dot com 2010-06-05 06:55 -------
Subject: Re: inline matmul for small matrix sizes
Dear Thomas,
> The preferred way would therefore be to state the rank 2 * rank 2 problem as
>
> do i=1,m
> do j=1,n
> c(i,j) = sum(a(i,:) * b(:,j))
> end do
> end do
>
> with the inner dot product borrowed using the scalarizer (borrowing
> from dot_product), and the outer loops using either hand-crafted
> TREE code or calling the DO translation.
Yes that is reasonable. Otherwise, you could borrow a little trick
that I used in allocatable components: trans-array.c:6020
gfc_add_expr_to_block (&loopbody, tmp);
/* Build the loop and return. */
gfc_init_loopinfo (&loop);
loop.dimen = 1;
loop.from[0] = gfc_index_zero_node;
loop.loopvar[0] = index;
loop.to[0] = nelems;
gfc_trans_scalarizing_loops (&loop, &loopbody);
gfc_add_block_to_block (&fnblock, &loop.pre);
tmp = gfc_finish_block (&fnblock);
if (null_cond != NULL_TREE)
tmp = build3_v (COND_EXPR, null_cond, tmp,
build_empty_stmt (input_location));
Here tmp in the first line is the expression or finished block within
the loop. Earlier on, you will find an expression involving the
index.
Cheers
Paul
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug fortran/37131] inline matmul for small matrix sizes
2008-08-15 19:24 [Bug fortran/37131] New: inline matmul for small matrix sizes tkoenig at gcc dot gnu dot org
` (7 preceding siblings ...)
2010-06-05 6:55 ` paul dot richard dot thomas at gmail dot com
@ 2010-06-05 8:49 ` tkoenig at gcc dot gnu dot org
2010-06-05 9:31 ` mikael at gcc dot gnu dot org
2010-06-05 18:27 ` tkoenig at netcologne dot de
10 siblings, 0 replies; 12+ messages in thread
From: tkoenig at gcc dot gnu dot org @ 2010-06-05 8:49 UTC (permalink / raw)
To: gcc-bugs
------- Comment #11 from tkoenig at gcc dot gnu dot org 2010-06-05 08:49 -------
Dear Paul,
thanks a lot for your helpful comments.
Just one thing: I currently don't see how to refer to multiple
indices for an array element.
In the code you pointed out, this is done with a single variable,
recursing for multiple dimension (or that is how I read the code,
which might not be correct).
Any more hints?
Thomas
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug fortran/37131] inline matmul for small matrix sizes
2008-08-15 19:24 [Bug fortran/37131] New: inline matmul for small matrix sizes tkoenig at gcc dot gnu dot org
` (8 preceding siblings ...)
2010-06-05 8:49 ` tkoenig at gcc dot gnu dot org
@ 2010-06-05 9:31 ` mikael at gcc dot gnu dot org
2010-06-05 18:27 ` tkoenig at netcologne dot de
10 siblings, 0 replies; 12+ messages in thread
From: mikael at gcc dot gnu dot org @ 2010-06-05 9:31 UTC (permalink / raw)
To: gcc-bugs
------- Comment #12 from mikael at gcc dot gnu dot org 2010-06-05 09:31 -------
(In reply to comment #9)
> I have thought a little bit about this, and the problem is
> a bit daunting ;-) Of course, this is at least partly because
> my experience with the scalarizer is close to non-existant, but you
> have to learn sometime.
>
> It seems that the functions for scalarizing do not help a lot
> here, because (for example) we need three nested loops for implementing
> the case where a and b are of rank 2.
>
> The preferred way would therefore be to state the rank 2 * rank 2 problem as
>
> do i=1,m
> do j=1,n
> c(i,j) = sum(a(i,:) * b(:,j))
> end do
> end do
>
> with the inner dot product borrowed using the scalarizer (borrowing
> from dot_product), and the outer loops using either hand-crafted
> TREE code or calling the DO translation.
>
> Comments? Is this reasonable?
>
The downside is that you can't use directly the matmul result in an expression.
You will need a temporary.
I'm working on nested scalarization loops for the sum intrinsic (pr43829) ;
inlining matmul should be straightforward after that.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug fortran/37131] inline matmul for small matrix sizes
2008-08-15 19:24 [Bug fortran/37131] New: inline matmul for small matrix sizes tkoenig at gcc dot gnu dot org
` (9 preceding siblings ...)
2010-06-05 9:31 ` mikael at gcc dot gnu dot org
@ 2010-06-05 18:27 ` tkoenig at netcologne dot de
10 siblings, 0 replies; 12+ messages in thread
From: tkoenig at netcologne dot de @ 2010-06-05 18:27 UTC (permalink / raw)
To: gcc-bugs
------- Comment #13 from tkoenig at netcologne dot de 2010-06-05 18:27 -------
Subject: Re: inline matmul for small matrix sizes
mikael at gcc dot gnu dot org wrote:
> I'm working on nested scalarization loops for the sum intrinsic
> (pr43829) ;
> inlining matmul should be straightforward after that.
I agree that this is the best approach - teach the scalarizer about
multidimensional arrays.
I'll hold further work on this PR until your work in this is finished.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131
^ permalink raw reply [flat|nested] 12+ messages in thread