[Bug fortran/37131] New: inline matmul for small matrix sizes

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug fortran/37131]  New: inline matmul for small matrix sizes
@ 2008-08-15 19:24 tkoenig at gcc dot gnu dot org
  2008-08-16 22:56 ` [Bug fortran/37131] " pinskia at gcc dot gnu dot org
                   ` (10 more replies)
  0 siblings, 11 replies; 12+ messages in thread
From: tkoenig at gcc dot gnu dot org @ 2008-08-15 19:24 UTC (permalink / raw)
  To: gcc-bugs

Ouch.

This is a factor of 20 for this simple test case on my computer.

$ cat foo.f90
program main
  real, dimension(3,3) :: a,b,c
  call random_number(a)
  call random_number(b)
  do i=1,10**8
    c = matmul(a,b)
    a(1,1) = a(1,1) + b(1,1) - c(1,1)
  end do
  print *,c
end program main

$ gfortran -O3 foo.f90
$ time ./a.out
  0.34224379      0.27477881      0.48155165      0.76788843      0.65491939   
   1.2103429      0.38770726      0.38460296      0.87301219

real    0m20.733s
user    0m19.585s
sys     0m0.000s
$ cat bar.f90
program main
  real, dimension(3,3) :: a,b,c
  call random_number(a)
  call random_number(b)
  do i=1,10**8
    forall (i=1:3)
      forall (j=1:3)
        c(i,j) = sum(a(i,:) * b(:,j))
      end forall
    end forall
    a(1,1) = a(1,1) + b(1,1) - c(1,1)
  end do
  print *,c
end program main

$ gfortran -O3 bar.f90
$ time ./a.out
  0.34224379      0.27477881      0.48155165      0.76788843      0.65491939   
   1.2103429      0.38770726      0.38460296      0.87301219

real    0m1.075s
user    0m1.060s
sys     0m0.000s
$


-- 
           Summary: inline matmul for small matrix sizes
           Product: gcc
           Version: 4.4.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: fortran
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: tkoenig at gcc dot gnu dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug fortran/37131] inline matmul for small matrix sizes
  2008-08-15 19:24 [Bug fortran/37131] New: inline matmul for small matrix sizes tkoenig at gcc dot gnu dot org
@ 2008-08-16 22:56 ` pinskia at gcc dot gnu dot org
  2008-08-23 13:20 ` tkoenig at gcc dot gnu dot org
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2008-08-16 22:56 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #3 from pinskia at gcc dot gnu dot org  2008-08-16 22:55 -------
Confirmed.


-- 

pinskia at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement
             Status|UNCONFIRMED                 |NEW
     Ever Confirmed|0                           |1
   Last reconfirmed|0000-00-00 00:00:00         |2008-08-16 22:55:22
               date|                            |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug fortran/37131] inline matmul for small matrix sizes
  2008-08-15 19:24 [Bug fortran/37131] New: inline matmul for small matrix sizes tkoenig at gcc dot gnu dot org
  2008-08-16 22:56 ` [Bug fortran/37131] " pinskia at gcc dot gnu dot org
@ 2008-08-23 13:20 ` tkoenig at gcc dot gnu dot org
  2008-11-29 16:20 ` burnus at gcc dot gnu dot org
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: tkoenig at gcc dot gnu dot org @ 2008-08-23 13:20 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #4 from tkoenig at gcc dot gnu dot org  2008-08-23 13:18 -------
Created an attachment (id=16134)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=16134&action=view)
test case

Actually, the test cases were a bit unfair, because 
the middle-end decided not to calculate the
values of c that were never used.

Attached is a better test case.

Timings on x86_64-unknown-linux-gnu:

 matmul =    12.840802      s
 subroutine without explicit interface:   0.88805580      s
 subroutine with explicit interface:   0.87605572      s
 inline with sum   2.0721283      s

While inlining is still much better than matmul, a hand-rolled
3*3 subroutine is much faster overall, which I find a bit surprising.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug fortran/37131] inline matmul for small matrix sizes
  2008-08-15 19:24 [Bug fortran/37131] New: inline matmul for small matrix sizes tkoenig at gcc dot gnu dot org
  2008-08-16 22:56 ` [Bug fortran/37131] " pinskia at gcc dot gnu dot org
  2008-08-23 13:20 ` tkoenig at gcc dot gnu dot org
@ 2008-11-29 16:20 ` burnus at gcc dot gnu dot org
  2008-12-04 20:00 ` tkoenig at gcc dot gnu dot org
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: burnus at gcc dot gnu dot org @ 2008-11-29 16:20 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #5 from burnus at gcc dot gnu dot org  2008-11-29 16:18 -------
(In reply to comment #4)
> Timings on x86_64-unknown-linux-gnu:
>  matmul =    12.840802      s
>  subroutine without explicit interface:   0.88805580      s
>  subroutine with explicit interface:   0.87605572      s
>  inline with sum   2.0721283      s

With -O2 I get:
 matmul =    10.724670      s
 subroutine without explicit interface:    7.7324829      s
 subroutine with explicit interface:    7.8684921      s
 inline with sum   7.7684860      s

Only with I get with -O3 -ffast-math -march=native on AMD64 the following:
 matmul =    10.656666      s
 subroutine without explicit interface:   0.91205692      s
 subroutine with explicit interface:   0.82805157      s
 inline with sum   2.4521542      s

For comparison with ifort ("loop was vectorized" in lines 40, 41, 43):
 matmul =    2.660166      s
 subroutine without explicit interface:   0.0000000E+00  s
 subroutine with explicit interface:   0.0000000E+00  s
 inline with sum  0.0000000E+00  s
and openf95 -O3:
 matmul =  1.26807904  s  (-O2: 28.2537651  s)
 subroutine without explicit interface:  1.07606697  s (4.07225418)
 subroutine with explicit interface:  1.05206609  s (4.08025742)
 inline with sum 0.748046875  s (3.7522316)


-- 

burnus at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |burnus at gcc dot gnu dot
                   |                            |org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug fortran/37131] inline matmul for small matrix sizes
  2008-08-15 19:24 [Bug fortran/37131] New: inline matmul for small matrix sizes tkoenig at gcc dot gnu dot org
                   ` (2 preceding siblings ...)
  2008-11-29 16:20 ` burnus at gcc dot gnu dot org
@ 2008-12-04 20:00 ` tkoenig at gcc dot gnu dot org
  2008-12-09 22:13 ` tkoenig at gcc dot gnu dot org
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: tkoenig at gcc dot gnu dot org @ 2008-12-04 20:00 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #6 from tkoenig at gcc dot gnu dot org  2008-12-04 19:58 -------
(In reply to comment #5)

> For comparison with ifort ("loop was vectorized" in lines 40, 41, 43):
>  matmul =    2.660166      s
>  subroutine without explicit interface:   0.0000000E+00  s
>  subroutine with explicit interface:   0.0000000E+00  s
>  inline with sum  0.0000000E+00  s

ifort detects that the call to invalidate doesn't actually invalidate
anything and so just removes the whole matmul stuff.

Intelligent, but bad for benchmarks :-)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug fortran/37131] inline matmul for small matrix sizes
  2008-08-15 19:24 [Bug fortran/37131] New: inline matmul for small matrix sizes tkoenig at gcc dot gnu dot org
                   ` (3 preceding siblings ...)
  2008-12-04 20:00 ` tkoenig at gcc dot gnu dot org
@ 2008-12-09 22:13 ` tkoenig at gcc dot gnu dot org
  2010-05-14  9:15 ` tkoenig at gcc dot gnu dot org
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: tkoenig at gcc dot gnu dot org @ 2008-12-09 22:13 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #7 from tkoenig at gcc dot gnu dot org  2008-12-09 22:12 -------
Created an attachment (id=16866)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=16866&action=view)
better test case

Thou shalt use IMPLICIT none, especially if you think you don't need it...

Here's a better test case, which actually tests the right thing.

Timings with 4.4 on i686-pc-linux-gnu:

 matmul =    15.596974      s
 subroutine with explicit interface:    3.6842318      s
 unrolled subroutine with explicit interface:    3.3522091      s
 inline with sum   3.3602085      s


-- 

tkoenig at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #16134|0                           |1
        is obsolete|                            |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug fortran/37131] inline matmul for small matrix sizes
  2008-08-15 19:24 [Bug fortran/37131] New: inline matmul for small matrix sizes tkoenig at gcc dot gnu dot org
                   ` (4 preceding siblings ...)
  2008-12-09 22:13 ` tkoenig at gcc dot gnu dot org
@ 2010-05-14  9:15 ` tkoenig at gcc dot gnu dot org
  2010-06-04 22:32 ` tkoenig at gcc dot gnu dot org
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: tkoenig at gcc dot gnu dot org @ 2010-05-14  9:15 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #8 from tkoenig at gcc dot gnu dot org  2010-05-14 09:15 -------
New timings, on x86_64-unknown-linux-gnu.  I split off the "invalidate"
subroutine to make sure the optimizers don't optimize this out:

ig25@linux-fd1f:/tmp> gfortran -O3 matmul.f90 invalidate.f90
ig25@linux-fd1f:/tmp> time ./a.out
 matmul =    11.100311      s
 subroutine with explicit interface:    2.0216932      s
 unrolled subroutine with explicit interface:    1.9317064      s
 inline with sum   1.9087105      s

real    0m16.971s
user    0m16.959s
sys     0m0.005s


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug fortran/37131] inline matmul for small matrix sizes
  2008-08-15 19:24 [Bug fortran/37131] New: inline matmul for small matrix sizes tkoenig at gcc dot gnu dot org
                   ` (5 preceding siblings ...)
  2010-05-14  9:15 ` tkoenig at gcc dot gnu dot org
@ 2010-06-04 22:32 ` tkoenig at gcc dot gnu dot org
  2010-06-05  6:55 ` paul dot richard dot thomas at gmail dot com
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: tkoenig at gcc dot gnu dot org @ 2010-06-04 22:32 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #9 from tkoenig at gcc dot gnu dot org  2010-06-04 22:31 -------
I have thought a little bit about this, and the problem is
a bit daunting ;-)  Of course, this is at least partly because
my experience with the scalarizer is close to non-existant, but you
have to learn sometime.

It seems that the functions for scalarizing do not help a lot
here, because (for example) we need three nested loops for implementing
the case where a and b are of rank 2.

The preferred way would therefore be to state the rank 2 * rank 2 problem as

  do i=1,m
     do j=1,n
        c(i,j) = sum(a(i,:) * b(:,j))
     end do
  end do

with the inner dot product borrowed using the scalarizer (borrowing
from dot_product), and the outer loops using either hand-crafted
TREE code or calling the DO translation.

Comments?  Is this reasonable?


-- 

tkoenig at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |pault at gcc dot gnu dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug fortran/37131] inline matmul for small matrix sizes
  2008-08-15 19:24 [Bug fortran/37131] New: inline matmul for small matrix sizes tkoenig at gcc dot gnu dot org
                   ` (6 preceding siblings ...)
  2010-06-04 22:32 ` tkoenig at gcc dot gnu dot org
@ 2010-06-05  6:55 ` paul dot richard dot thomas at gmail dot com
  2010-06-05  8:49 ` tkoenig at gcc dot gnu dot org
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: paul dot richard dot thomas at gmail dot com @ 2010-06-05  6:55 UTC (permalink / raw)
  To: gcc-bugs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1406 bytes --]



------- Comment #10 from paul dot richard dot thomas at gmail dot com  2010-06-05 06:55 -------
Subject: Re:  inline matmul for small matrix sizes

Dear Thomas,


> The preferred way would therefore be to state the rank 2 * rank 2 problem as
>
>  do i=1,m
>     do j=1,n
>        c(i,j) = sum(a(i,:) * b(:,j))
>     end do
>  end do
>
> with the inner dot product borrowed using the scalarizer (borrowing
> from dot_product), and the outer loops using either hand-crafted
> TREE code or calling the DO translation.

Yes that is reasonable.  Otherwise, you could borrow a little trick
that I used in allocatable components: trans-array.c:6020

      gfc_add_expr_to_block (&loopbody, tmp);

      /* Build the loop and return.  */
      gfc_init_loopinfo (&loop);
      loop.dimen = 1;
      loop.from[0] = gfc_index_zero_node;
      loop.loopvar[0] = index;
      loop.to[0] = nelems;
      gfc_trans_scalarizing_loops (&loop, &loopbody);
      gfc_add_block_to_block (&fnblock, &loop.pre);

      tmp = gfc_finish_block (&fnblock);
      if (null_cond != NULL_TREE)
        tmp = build3_v (COND_EXPR, null_cond, tmp,
                        build_empty_stmt (input_location));

Here tmp in the first line is the expression or finished block within
the loop.  Earlier on, you will find an expression involving the
index.

Cheers

Paul


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug fortran/37131] inline matmul for small matrix sizes
  2008-08-15 19:24 [Bug fortran/37131] New: inline matmul for small matrix sizes tkoenig at gcc dot gnu dot org
                   ` (7 preceding siblings ...)
  2010-06-05  6:55 ` paul dot richard dot thomas at gmail dot com
@ 2010-06-05  8:49 ` tkoenig at gcc dot gnu dot org
  2010-06-05  9:31 ` mikael at gcc dot gnu dot org
  2010-06-05 18:27 ` tkoenig at netcologne dot de
  10 siblings, 0 replies; 12+ messages in thread
From: tkoenig at gcc dot gnu dot org @ 2010-06-05  8:49 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #11 from tkoenig at gcc dot gnu dot org  2010-06-05 08:49 -------
Dear Paul,

thanks a lot for your helpful comments.

Just one thing:  I currently don't see how to refer to multiple
indices for an array element.

In the code you pointed out, this is done with a single variable,
recursing for multiple dimension (or that is how I read the code,
which might not be correct).

Any more hints?

Thomas


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug fortran/37131] inline matmul for small matrix sizes
  2008-08-15 19:24 [Bug fortran/37131] New: inline matmul for small matrix sizes tkoenig at gcc dot gnu dot org
                   ` (8 preceding siblings ...)
  2010-06-05  8:49 ` tkoenig at gcc dot gnu dot org
@ 2010-06-05  9:31 ` mikael at gcc dot gnu dot org
  2010-06-05 18:27 ` tkoenig at netcologne dot de
  10 siblings, 0 replies; 12+ messages in thread
From: mikael at gcc dot gnu dot org @ 2010-06-05  9:31 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #12 from mikael at gcc dot gnu dot org  2010-06-05 09:31 -------
(In reply to comment #9)
> I have thought a little bit about this, and the problem is
> a bit daunting ;-)  Of course, this is at least partly because
> my experience with the scalarizer is close to non-existant, but you
> have to learn sometime.
> 
> It seems that the functions for scalarizing do not help a lot
> here, because (for example) we need three nested loops for implementing
> the case where a and b are of rank 2.
> 
> The preferred way would therefore be to state the rank 2 * rank 2 problem as
> 
>   do i=1,m
>      do j=1,n
>         c(i,j) = sum(a(i,:) * b(:,j))
>      end do
>   end do
> 
> with the inner dot product borrowed using the scalarizer (borrowing
> from dot_product), and the outer loops using either hand-crafted
> TREE code or calling the DO translation.
> 
> Comments?  Is this reasonable?
> 

The downside is that you can't use directly the matmul result in an expression. 
You will need a temporary. 

I'm working on nested scalarization loops for the sum intrinsic (pr43829) ;
inlining matmul should be straightforward after that. 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug fortran/37131] inline matmul for small matrix sizes
  2008-08-15 19:24 [Bug fortran/37131] New: inline matmul for small matrix sizes tkoenig at gcc dot gnu dot org
                   ` (9 preceding siblings ...)
  2010-06-05  9:31 ` mikael at gcc dot gnu dot org
@ 2010-06-05 18:27 ` tkoenig at netcologne dot de
  10 siblings, 0 replies; 12+ messages in thread
From: tkoenig at netcologne dot de @ 2010-06-05 18:27 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #13 from tkoenig at netcologne dot de  2010-06-05 18:27 -------
Subject: Re:  inline matmul for small matrix sizes

mikael at gcc dot gnu dot org wrote:

> I'm working on nested scalarization loops for the sum intrinsic
> (pr43829) ;
> inlining matmul should be straightforward after that. 

I agree that this is the best approach - teach the scalarizer about
multidimensional arrays.

I'll hold further work on this PR until your work in this is finished.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2010-06-05 18:27 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-08-15 19:24 [Bug fortran/37131] New: inline matmul for small matrix sizes tkoenig at gcc dot gnu dot org
2008-08-16 22:56 ` [Bug fortran/37131] " pinskia at gcc dot gnu dot org
2008-08-23 13:20 ` tkoenig at gcc dot gnu dot org
2008-11-29 16:20 ` burnus at gcc dot gnu dot org
2008-12-04 20:00 ` tkoenig at gcc dot gnu dot org
2008-12-09 22:13 ` tkoenig at gcc dot gnu dot org
2010-05-14  9:15 ` tkoenig at gcc dot gnu dot org
2010-06-04 22:32 ` tkoenig at gcc dot gnu dot org
2010-06-05  6:55 ` paul dot richard dot thomas at gmail dot com
2010-06-05  8:49 ` tkoenig at gcc dot gnu dot org
2010-06-05  9:31 ` mikael at gcc dot gnu dot org
2010-06-05 18:27 ` tkoenig at netcologne dot de

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).