[Bug fortran/106565] New: Using a transposed matrix in matmul (GCC-10.3.0) is very slow

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug fortran/106565] New: Using a transposed matrix in matmul (GCC-10.3.0) is very slow
@ 2022-08-08 20:04 quanhua.liu at noaa dot gov
  2022-08-09  7:50 ` [Bug fortran/106565] " rguenth at gcc dot gnu.org
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: quanhua.liu at noaa dot gov @ 2022-08-08 20:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565

            Bug ID: 106565
           Summary: Using a transposed matrix in matmul (GCC-10.3.0) is
                    very slow
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: fortran
          Assignee: unassigned at gcc dot gnu.org
          Reporter: quanhua.liu at noaa dot gov
  Target Milestone: ---

gcc version 10.3.0 (GCC)
linux
Using  (2) BB = transpose(B)
           C = matmul(A, BB)
is 5 times faster than
using  (1) C = matmul(A, transpose(B))

ifort 19 doesn't have the problem.


      PROGRAM test_matrixCal
! ------------------------------------------------------
! This code test
!  (1)   C = matmul(A, transpose(B))        
!  against 
!  (2)   BB = transpose(B)
!        C = matmul(A, BB)
!  (2) is 5 times faster than (1)
!   gfortran -O3 test_matrixCal
!   time a.ot
! ------------------------------------------------------
      INTEGER, PARAMETER :: m = 200, n = 300, nn = 150
      REAL :: A(m,n), B(nn,n), C(m,nn), BB(n,nn)
      INTEGER :: i, j, k, L
      A(:,:) = 3.0
      B(:,:) = 1.7

      iterative_loop: DO L = 1, 1000
         A(:,10) = A(:,10) + 0.0001*L
!         C = matmul(A, transpose(B))
         BB = transpose(B)
         C = matmul(A, BB)
      IF(mod(L,50) == 0)   print *,L, C(10,20)
      END DO iterative_loop
      STOP
      END PROGRAM test_matrixCal

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug fortran/106565] Using a transposed matrix in matmul (GCC-10.3.0) is very slow
  2022-08-08 20:04 [Bug fortran/106565] New: Using a transposed matrix in matmul (GCC-10.3.0) is very slow quanhua.liu at noaa dot gov
@ 2022-08-09  7:50 ` rguenth at gcc dot gnu.org
  2022-08-09 14:01 ` quanhua.liu at noaa dot gov
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-08-09  7:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2022-08-09
      Known to fail|                            |12.1.0
             Status|UNCONFIRMED                 |NEW
           Keywords|                            |missed-optimization
     Ever confirmed|0                           |1
            Version|unknown                     |10.3.0

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed also with gfortran 12.  The issue is that with the combined
matmul+transpose we invoke matmul with an array descriptor representing the
transpose operation which results in suboptimal memory access patterns.

Can you check whether ifort does the transpose separately or whether its
matmul library routine simply special-cases the situation?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug fortran/106565] Using a transposed matrix in matmul (GCC-10.3.0) is very slow
  2022-08-08 20:04 [Bug fortran/106565] New: Using a transposed matrix in matmul (GCC-10.3.0) is very slow quanhua.liu at noaa dot gov
  2022-08-09  7:50 ` [Bug fortran/106565] " rguenth at gcc dot gnu.org
@ 2022-08-09 14:01 ` quanhua.liu at noaa dot gov
  2022-08-09 15:07 ` kargl at gcc dot gnu.org
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: quanhua.liu at noaa dot gov @ 2022-08-09 14:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565

--- Comment #2 from Quanhua Liu <quanhua.liu at noaa dot gov> ---
I modified the application code (see below) and use the "method" as a control
variable from command line.
I use the same code for both gfortran 10.3.0 and ifort 19.0.5.281
  gfortran -O3 matrixCal.f90
  time a.out  1
  time a.out  2
  ifort -O3 matrixCal.f90
  time a.out  1
  time a.out  2
where method 1, C = matmul(A, transpose(B) )
             method 2, BB = transpose(B),  C = matmul(A, BB)
  The timing is given in the table below.
As you can see, using gfortran, method '2' is 6 times faster than the method
'1'.
Using ifort, method '2' is very similar to the method '1'. '1' is slightly fast
because '2' may copy B to BB.

Timing
compiler       gfortran                  ifort
method        1         2           1          2
real        6.28     0.79          0.80       0.83

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug fortran/106565] Using a transposed matrix in matmul (GCC-10.3.0) is very slow
  2022-08-08 20:04 [Bug fortran/106565] New: Using a transposed matrix in matmul (GCC-10.3.0) is very slow quanhua.liu at noaa dot gov
  2022-08-09  7:50 ` [Bug fortran/106565] " rguenth at gcc dot gnu.org
  2022-08-09 14:01 ` quanhua.liu at noaa dot gov
@ 2022-08-09 15:07 ` kargl at gcc dot gnu.org
  2022-08-09 15:08 ` kargl at gcc dot gnu.org
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: kargl at gcc dot gnu.org @ 2022-08-09 15:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565

kargl at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |kargl at gcc dot gnu.org

--- Comment #3 from kargl at gcc dot gnu.org ---

>       INTEGER, PARAMETER :: m = 200, n = 300, nn = 150
>       REAL :: A(m,n), B(nn,n), C(m,nn), BB(n,nn)
>       INTEGER :: i, j, k, L


If you are doing a problem of this size or larger, you want to use the
-fexternal-blas option and link in OpenBLAS.

I added timing code and replicated the loop to both in one go.

% gfcx -o z -O3 -march=native a.f90 && ./z
   1.16500998       1615.08594    
   5.32258606       1615.08020    
% gfcx -o z -O3 -march=native a.f90 -fexternal-blas -lopenblas && ./z
   2.44668889       1615.08301    
   1.99379802       1615.08301

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug fortran/106565] Using a transposed matrix in matmul (GCC-10.3.0) is very slow
  2022-08-08 20:04 [Bug fortran/106565] New: Using a transposed matrix in matmul (GCC-10.3.0) is very slow quanhua.liu at noaa dot gov
                   ` (2 preceding siblings ...)
  2022-08-09 15:07 ` kargl at gcc dot gnu.org
@ 2022-08-09 15:08 ` kargl at gcc dot gnu.org
  2022-08-09 17:14 ` quanhua.liu at noaa dot gov
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: kargl at gcc dot gnu.org @ 2022-08-09 15:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565

kargl at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P3                          |P4

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug fortran/106565] Using a transposed matrix in matmul (GCC-10.3.0) is very slow
  2022-08-08 20:04 [Bug fortran/106565] New: Using a transposed matrix in matmul (GCC-10.3.0) is very slow quanhua.liu at noaa dot gov
                   ` (3 preceding siblings ...)
  2022-08-09 15:08 ` kargl at gcc dot gnu.org
@ 2022-08-09 17:14 ` quanhua.liu at noaa dot gov
  2022-08-09 17:17 ` quanhua.liu at noaa dot gov
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: quanhua.liu at noaa dot gov @ 2022-08-09 17:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565

--- Comment #4 from Quanhua Liu <quanhua.liu at noaa dot gov> ---
Using 
gfortran -O3 -fexternal-blas -L/..... -lblas testmatrixCal.f90
time a.out  1
real:  6.14 (s)
time a.out  2
real: 5.41

It is 6 times slower than
  BB = transpose(B)
  C = matmul(A, BB)

ifort doesn't have the problem.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug fortran/106565] Using a transposed matrix in matmul (GCC-10.3.0) is very slow
  2022-08-08 20:04 [Bug fortran/106565] New: Using a transposed matrix in matmul (GCC-10.3.0) is very slow quanhua.liu at noaa dot gov
                   ` (4 preceding siblings ...)
  2022-08-09 17:14 ` quanhua.liu at noaa dot gov
@ 2022-08-09 17:17 ` quanhua.liu at noaa dot gov
  2022-08-09 17:51 ` sgk at troutmask dot apl.washington.edu
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: quanhua.liu at noaa dot gov @ 2022-08-09 17:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565

--- Comment #5 from Quanhua Liu <quanhua.liu at noaa dot gov> ---
Hi Richard,

Using -fexternal-blas for gfortran v10.3.0 is much slower than
the method 2:
   BB = transpose(B)
   C = matmul(A, BB)

How about on your machine?

Thanks,

Quanhua Liu
On 8/9/2022 11:07 AM, kargl at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565
>
> kargl at gcc dot gnu.org changed:
>
>             What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                   CC|                            |kargl at gcc dot gnu.org
>
> --- Comment #3 from kargl at gcc dot gnu.org ---
>
>>        INTEGER, PARAMETER :: m = 200, n = 300, nn = 150
>>        REAL :: A(m,n), B(nn,n), C(m,nn), BB(n,nn)
>>        INTEGER :: i, j, k, L
>
> If you are doing a problem of this size or larger, you want to use the
> -fexternal-blas option and link in OpenBLAS.
>
> I added timing code and replicated the loop to both in one go.
>
> % gfcx -o z -O3 -march=native a.f90 && ./z
>     1.16500998       1615.08594
>     5.32258606       1615.08020
> % gfcx -o z -O3 -march=native a.f90 -fexternal-blas -lopenblas && ./z
>     2.44668889       1615.08301
>     1.99379802       1615.08301
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug fortran/106565] Using a transposed matrix in matmul (GCC-10.3.0) is very slow
  2022-08-08 20:04 [Bug fortran/106565] New: Using a transposed matrix in matmul (GCC-10.3.0) is very slow quanhua.liu at noaa dot gov
                   ` (5 preceding siblings ...)
  2022-08-09 17:17 ` quanhua.liu at noaa dot gov
@ 2022-08-09 17:51 ` sgk at troutmask dot apl.washington.edu
  2022-08-09 17:53 ` sgk at troutmask dot apl.washington.edu
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: sgk at troutmask dot apl.washington.edu @ 2022-08-09 17:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565

--- Comment #6 from Steve Kargl <sgk at troutmask dot apl.washington.edu> ---
On Tue, Aug 09, 2022 at 05:14:16PM +0000, quanhua.liu at noaa dot gov wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565
> 
> --- Comment #4 from Quanhua Liu <quanhua.liu at noaa dot gov> ---
> Using 
> gfortran -O3 -fexternal-blas -L/..... -lblas testmatrixCal.f90

Which BLAS are you using?  If you are using BLAS from
Netlib, then of course you'll likely get poor results
as the Netlib BLAS is not tuned. 

I specifically wrote **** use OpenBLAS ****

OpenBLAS is likely tuned for whatever hardware you have.

% gfcx -o z -O3 -march=native a.f90 -fexternal-blas -lopenblas \
   -fdump-tree-optimized && ./z
   2.44969702       1615.08301    
   2.00995278       1615.08301    

The use of matmal(..., transpose()) is the fastest on a AMD FX(tm)-8350,

% grep gemm z-a.f90.252t.optimized 
  sgemm (&"N"[1]{lb: 1 sz: 1}, &"N"[1]{lb: 1 sz: 1}, &C.4300, &C.4301, &C.4302,
&C.4303, &a, &C.4304, &bb, &C.4305, &C.4306, &c, &C.4307, 1, 1);
  sgemm (&"N"[1]{lb: 1 sz: 1}, &"T"[1]{lb: 1 sz: 1}, &C.4379, &C.4380, &C.4381,
&C.4382, &a, &C.4383, &b, &C.4384, &C.4385, &c, &C.4386, 1, 1);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug fortran/106565] Using a transposed matrix in matmul (GCC-10.3.0) is very slow
  2022-08-08 20:04 [Bug fortran/106565] New: Using a transposed matrix in matmul (GCC-10.3.0) is very slow quanhua.liu at noaa dot gov
                   ` (6 preceding siblings ...)
  2022-08-09 17:51 ` sgk at troutmask dot apl.washington.edu
@ 2022-08-09 17:53 ` sgk at troutmask dot apl.washington.edu
  2022-08-09 17:55 ` sgk at troutmask dot apl.washington.edu
  2022-08-09 18:20 ` quanhua.liu at noaa dot gov
  9 siblings, 0 replies; 11+ messages in thread
From: sgk at troutmask dot apl.washington.edu @ 2022-08-09 17:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565

--- Comment #7 from Steve Kargl <sgk at troutmask dot apl.washington.edu> ---
On Tue, Aug 09, 2022 at 05:17:57PM +0000, quanhua.liu at noaa dot gov wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565
> 
> --- Comment #5 from Quanhua Liu <quanhua.liu at noaa dot gov> ---
> Hi Richard,
> 
> Using -fexternal-blas for gfortran v10.3.0 is much slower than
> the method 2:
>    BB = transpose(B)
>    C = matmul(A, BB)
> 
> How about on your machine?
> 
> >
> > If you are doing a problem of this size or larger, you want to use the
> > -fexternal-blas option and link in OpenBLAS.


I wrote "and link in OpenBLAS".

> > I added timing code and replicated the loop to both in one go.
> >
> > % gfcx -o z -O3 -march=native a.f90 && ./z
> >     1.16500998       1615.08594
> >     5.32258606       1615.08020


> > % gfcx -o z -O3 -march=native a.f90 -fexternal-blas -lopenblas && ./z
> >     2.44668889       1615.08301
> >     1.99379802       1615.08301

Method 1 is faster with OpenBLAS.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug fortran/106565] Using a transposed matrix in matmul (GCC-10.3.0) is very slow
  2022-08-08 20:04 [Bug fortran/106565] New: Using a transposed matrix in matmul (GCC-10.3.0) is very slow quanhua.liu at noaa dot gov
                   ` (7 preceding siblings ...)
  2022-08-09 17:53 ` sgk at troutmask dot apl.washington.edu
@ 2022-08-09 17:55 ` sgk at troutmask dot apl.washington.edu
  2022-08-09 18:20 ` quanhua.liu at noaa dot gov
  9 siblings, 0 replies; 11+ messages in thread
From: sgk at troutmask dot apl.washington.edu @ 2022-08-09 17:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565

--- Comment #8 from Steve Kargl <sgk at troutmask dot apl.washington.edu> ---
On Tue, Aug 09, 2022 at 05:51:51PM +0000, sgk at troutmask dot
apl.washington.edu wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565
> 
> --- Comment #6 from Steve Kargl <sgk at troutmask dot apl.washington.edu> ---
> On Tue, Aug 09, 2022 at 05:14:16PM +0000, quanhua.liu at noaa dot gov wrote:
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565
> > 
> > --- Comment #4 from Quanhua Liu <quanhua.liu at noaa dot gov> ---
> > Using 
> > gfortran -O3 -fexternal-blas -L/..... -lblas testmatrixCal.f90
> 
> Which BLAS are you using?  If you are using BLAS from
> Netlib, then of course you'll likely get poor results
> as the Netlib BLAS is not tuned. 
> 

Even netlib blas is ok.

 gfcx -o z -O3 -march=native a.f90 -fexternal-blas -lblas -fdump-tree-optimized
&& ./z
   1.41149306       1615.08020    
   1.50036991       1615.08020

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug fortran/106565] Using a transposed matrix in matmul (GCC-10.3.0) is very slow
  2022-08-08 20:04 [Bug fortran/106565] New: Using a transposed matrix in matmul (GCC-10.3.0) is very slow quanhua.liu at noaa dot gov
                   ` (8 preceding siblings ...)
  2022-08-09 17:55 ` sgk at troutmask dot apl.washington.edu
@ 2022-08-09 18:20 ` quanhua.liu at noaa dot gov
  9 siblings, 0 replies; 11+ messages in thread
From: quanhua.liu at noaa dot gov @ 2022-08-09 18:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565

--- Comment #9 from Quanhua Liu <quanhua.liu at noaa dot gov> ---
Hi Richard,

It seems that I cannot add comment online to the ticket.
I tried
    gfortran -o z -O3 -march=native test_matrixCal.f90 -fexternal-blas 
-lblas -fdump-tree-optimized
   time a.out 1
   and
    time a.out 2
Both are very slow ( 6s in comparison to previous 0.8 s using method 2).
I don't know which blab on my machine is.

On your machine, can you help to test
   BB = transpose(B)
   C = matmul(A,BB)
  using gfortran -O3 test_matrixCal.f90
  time a.out  2
against test
   C = matmul(A, transpose(B) )
using any option or blas timing?

The timing depends on machine. It would be great helpful if you can 
provide the timing for the two methods from your site

Thank you!

Quanhua Liu
On 8/9/2022 1:53 PM, sgk at troutmask dot apl.washington.edu wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565
>
> --- Comment #7 from Steve Kargl <sgk at troutmask dot apl.washington.edu> ---
> On Tue, Aug 09, 2022 at 05:17:57PM +0000, quanhua.liu at noaa dot gov wrote:
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565
>>
>> --- Comment #5 from Quanhua Liu <quanhua.liu at noaa dot gov> ---
>> Hi Richard,
>>
>> Using -fexternal-blas for gfortran v10.3.0 is much slower than
>> the method 2:
>>     BB = transpose(B)
>>     C = matmul(A, BB)
>>
>> How about on your machine?
>>
>>> If you are doing a problem of this size or larger, you want to use the
>>> -fexternal-blas option and link in OpenBLAS.
>
> I wrote "and link in OpenBLAS".
>
>>> I added timing code and replicated the loop to both in one go.
>>>
>>> % gfcx -o z -O3 -march=native a.f90 && ./z
>>>      1.16500998       1615.08594
>>>      5.32258606       1615.08020
>
>>> % gfcx -o z -O3 -march=native a.f90 -fexternal-blas -lopenblas && ./z
>>>      2.44668889       1615.08301
>>>      1.99379802       1615.08301
> Method 1 is faster with OpenBLAS.
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-08-09 18:20 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-08 20:04 [Bug fortran/106565] New: Using a transposed matrix in matmul (GCC-10.3.0) is very slow quanhua.liu at noaa dot gov
2022-08-09  7:50 ` [Bug fortran/106565] " rguenth at gcc dot gnu.org
2022-08-09 14:01 ` quanhua.liu at noaa dot gov
2022-08-09 15:07 ` kargl at gcc dot gnu.org
2022-08-09 15:08 ` kargl at gcc dot gnu.org
2022-08-09 17:14 ` quanhua.liu at noaa dot gov
2022-08-09 17:17 ` quanhua.liu at noaa dot gov
2022-08-09 17:51 ` sgk at troutmask dot apl.washington.edu
2022-08-09 17:53 ` sgk at troutmask dot apl.washington.edu
2022-08-09 17:55 ` sgk at troutmask dot apl.washington.edu
2022-08-09 18:20 ` quanhua.liu at noaa dot gov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).