* [Bug fortran/106565] Using a transposed matrix in matmul (GCC-10.3.0) is very slow
2022-08-08 20:04 [Bug fortran/106565] New: Using a transposed matrix in matmul (GCC-10.3.0) is very slow quanhua.liu at noaa dot gov
@ 2022-08-09 7:50 ` rguenth at gcc dot gnu.org
2022-08-09 14:01 ` quanhua.liu at noaa dot gov
` (8 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-08-09 7:50 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed| |2022-08-09
Known to fail| |12.1.0
Status|UNCONFIRMED |NEW
Keywords| |missed-optimization
Ever confirmed|0 |1
Version|unknown |10.3.0
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed also with gfortran 12. The issue is that with the combined
matmul+transpose we invoke matmul with an array descriptor representing the
transpose operation which results in suboptimal memory access patterns.
Can you check whether ifort does the transpose separately or whether its
matmul library routine simply special-cases the situation?
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug fortran/106565] Using a transposed matrix in matmul (GCC-10.3.0) is very slow
2022-08-08 20:04 [Bug fortran/106565] New: Using a transposed matrix in matmul (GCC-10.3.0) is very slow quanhua.liu at noaa dot gov
2022-08-09 7:50 ` [Bug fortran/106565] " rguenth at gcc dot gnu.org
@ 2022-08-09 14:01 ` quanhua.liu at noaa dot gov
2022-08-09 15:07 ` kargl at gcc dot gnu.org
` (7 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: quanhua.liu at noaa dot gov @ 2022-08-09 14:01 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565
--- Comment #2 from Quanhua Liu <quanhua.liu at noaa dot gov> ---
I modified the application code (see below) and use the "method" as a control
variable from command line.
I use the same code for both gfortran 10.3.0 and ifort 19.0.5.281
gfortran -O3 matrixCal.f90
time a.out 1
time a.out 2
ifort -O3 matrixCal.f90
time a.out 1
time a.out 2
where method 1, C = matmul(A, transpose(B) )
method 2, BB = transpose(B), C = matmul(A, BB)
The timing is given in the table below.
As you can see, using gfortran, method '2' is 6 times faster than the method
'1'.
Using ifort, method '2' is very similar to the method '1'. '1' is slightly fast
because '2' may copy B to BB.
Timing
compiler gfortran ifort
method 1 2 1 2
real 6.28 0.79 0.80 0.83
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug fortran/106565] Using a transposed matrix in matmul (GCC-10.3.0) is very slow
2022-08-08 20:04 [Bug fortran/106565] New: Using a transposed matrix in matmul (GCC-10.3.0) is very slow quanhua.liu at noaa dot gov
2022-08-09 7:50 ` [Bug fortran/106565] " rguenth at gcc dot gnu.org
2022-08-09 14:01 ` quanhua.liu at noaa dot gov
@ 2022-08-09 15:07 ` kargl at gcc dot gnu.org
2022-08-09 15:08 ` kargl at gcc dot gnu.org
` (6 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: kargl at gcc dot gnu.org @ 2022-08-09 15:07 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565
kargl at gcc dot gnu.org changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |kargl at gcc dot gnu.org
--- Comment #3 from kargl at gcc dot gnu.org ---
> INTEGER, PARAMETER :: m = 200, n = 300, nn = 150
> REAL :: A(m,n), B(nn,n), C(m,nn), BB(n,nn)
> INTEGER :: i, j, k, L
If you are doing a problem of this size or larger, you want to use the
-fexternal-blas option and link in OpenBLAS.
I added timing code and replicated the loop to both in one go.
% gfcx -o z -O3 -march=native a.f90 && ./z
1.16500998 1615.08594
5.32258606 1615.08020
% gfcx -o z -O3 -march=native a.f90 -fexternal-blas -lopenblas && ./z
2.44668889 1615.08301
1.99379802 1615.08301
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug fortran/106565] Using a transposed matrix in matmul (GCC-10.3.0) is very slow
2022-08-08 20:04 [Bug fortran/106565] New: Using a transposed matrix in matmul (GCC-10.3.0) is very slow quanhua.liu at noaa dot gov
` (2 preceding siblings ...)
2022-08-09 15:07 ` kargl at gcc dot gnu.org
@ 2022-08-09 15:08 ` kargl at gcc dot gnu.org
2022-08-09 17:14 ` quanhua.liu at noaa dot gov
` (5 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: kargl at gcc dot gnu.org @ 2022-08-09 15:08 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565
kargl at gcc dot gnu.org changed:
What |Removed |Added
----------------------------------------------------------------------------
Priority|P3 |P4
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug fortran/106565] Using a transposed matrix in matmul (GCC-10.3.0) is very slow
2022-08-08 20:04 [Bug fortran/106565] New: Using a transposed matrix in matmul (GCC-10.3.0) is very slow quanhua.liu at noaa dot gov
` (3 preceding siblings ...)
2022-08-09 15:08 ` kargl at gcc dot gnu.org
@ 2022-08-09 17:14 ` quanhua.liu at noaa dot gov
2022-08-09 17:17 ` quanhua.liu at noaa dot gov
` (4 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: quanhua.liu at noaa dot gov @ 2022-08-09 17:14 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565
--- Comment #4 from Quanhua Liu <quanhua.liu at noaa dot gov> ---
Using
gfortran -O3 -fexternal-blas -L/..... -lblas testmatrixCal.f90
time a.out 1
real: 6.14 (s)
time a.out 2
real: 5.41
It is 6 times slower than
BB = transpose(B)
C = matmul(A, BB)
ifort doesn't have the problem.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug fortran/106565] Using a transposed matrix in matmul (GCC-10.3.0) is very slow
2022-08-08 20:04 [Bug fortran/106565] New: Using a transposed matrix in matmul (GCC-10.3.0) is very slow quanhua.liu at noaa dot gov
` (4 preceding siblings ...)
2022-08-09 17:14 ` quanhua.liu at noaa dot gov
@ 2022-08-09 17:17 ` quanhua.liu at noaa dot gov
2022-08-09 17:51 ` sgk at troutmask dot apl.washington.edu
` (3 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: quanhua.liu at noaa dot gov @ 2022-08-09 17:17 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565
--- Comment #5 from Quanhua Liu <quanhua.liu at noaa dot gov> ---
Hi Richard,
Using -fexternal-blas for gfortran v10.3.0 is much slower than
the method 2:
BB = transpose(B)
C = matmul(A, BB)
How about on your machine?
Thanks,
Quanhua Liu
On 8/9/2022 11:07 AM, kargl at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565
>
> kargl at gcc dot gnu.org changed:
>
> What |Removed |Added
> ----------------------------------------------------------------------------
> CC| |kargl at gcc dot gnu.org
>
> --- Comment #3 from kargl at gcc dot gnu.org ---
>
>> INTEGER, PARAMETER :: m = 200, n = 300, nn = 150
>> REAL :: A(m,n), B(nn,n), C(m,nn), BB(n,nn)
>> INTEGER :: i, j, k, L
>
> If you are doing a problem of this size or larger, you want to use the
> -fexternal-blas option and link in OpenBLAS.
>
> I added timing code and replicated the loop to both in one go.
>
> % gfcx -o z -O3 -march=native a.f90 && ./z
> 1.16500998 1615.08594
> 5.32258606 1615.08020
> % gfcx -o z -O3 -march=native a.f90 -fexternal-blas -lopenblas && ./z
> 2.44668889 1615.08301
> 1.99379802 1615.08301
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug fortran/106565] Using a transposed matrix in matmul (GCC-10.3.0) is very slow
2022-08-08 20:04 [Bug fortran/106565] New: Using a transposed matrix in matmul (GCC-10.3.0) is very slow quanhua.liu at noaa dot gov
` (5 preceding siblings ...)
2022-08-09 17:17 ` quanhua.liu at noaa dot gov
@ 2022-08-09 17:51 ` sgk at troutmask dot apl.washington.edu
2022-08-09 17:53 ` sgk at troutmask dot apl.washington.edu
` (2 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: sgk at troutmask dot apl.washington.edu @ 2022-08-09 17:51 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565
--- Comment #6 from Steve Kargl <sgk at troutmask dot apl.washington.edu> ---
On Tue, Aug 09, 2022 at 05:14:16PM +0000, quanhua.liu at noaa dot gov wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565
>
> --- Comment #4 from Quanhua Liu <quanhua.liu at noaa dot gov> ---
> Using
> gfortran -O3 -fexternal-blas -L/..... -lblas testmatrixCal.f90
Which BLAS are you using? If you are using BLAS from
Netlib, then of course you'll likely get poor results
as the Netlib BLAS is not tuned.
I specifically wrote **** use OpenBLAS ****
OpenBLAS is likely tuned for whatever hardware you have.
% gfcx -o z -O3 -march=native a.f90 -fexternal-blas -lopenblas \
-fdump-tree-optimized && ./z
2.44969702 1615.08301
2.00995278 1615.08301
The use of matmal(..., transpose()) is the fastest on a AMD FX(tm)-8350,
% grep gemm z-a.f90.252t.optimized
sgemm (&"N"[1]{lb: 1 sz: 1}, &"N"[1]{lb: 1 sz: 1}, &C.4300, &C.4301, &C.4302,
&C.4303, &a, &C.4304, &bb, &C.4305, &C.4306, &c, &C.4307, 1, 1);
sgemm (&"N"[1]{lb: 1 sz: 1}, &"T"[1]{lb: 1 sz: 1}, &C.4379, &C.4380, &C.4381,
&C.4382, &a, &C.4383, &b, &C.4384, &C.4385, &c, &C.4386, 1, 1);
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug fortran/106565] Using a transposed matrix in matmul (GCC-10.3.0) is very slow
2022-08-08 20:04 [Bug fortran/106565] New: Using a transposed matrix in matmul (GCC-10.3.0) is very slow quanhua.liu at noaa dot gov
` (6 preceding siblings ...)
2022-08-09 17:51 ` sgk at troutmask dot apl.washington.edu
@ 2022-08-09 17:53 ` sgk at troutmask dot apl.washington.edu
2022-08-09 17:55 ` sgk at troutmask dot apl.washington.edu
2022-08-09 18:20 ` quanhua.liu at noaa dot gov
9 siblings, 0 replies; 11+ messages in thread
From: sgk at troutmask dot apl.washington.edu @ 2022-08-09 17:53 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565
--- Comment #7 from Steve Kargl <sgk at troutmask dot apl.washington.edu> ---
On Tue, Aug 09, 2022 at 05:17:57PM +0000, quanhua.liu at noaa dot gov wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565
>
> --- Comment #5 from Quanhua Liu <quanhua.liu at noaa dot gov> ---
> Hi Richard,
>
> Using -fexternal-blas for gfortran v10.3.0 is much slower than
> the method 2:
> BB = transpose(B)
> C = matmul(A, BB)
>
> How about on your machine?
>
> >
> > If you are doing a problem of this size or larger, you want to use the
> > -fexternal-blas option and link in OpenBLAS.
I wrote "and link in OpenBLAS".
> > I added timing code and replicated the loop to both in one go.
> >
> > % gfcx -o z -O3 -march=native a.f90 && ./z
> > 1.16500998 1615.08594
> > 5.32258606 1615.08020
> > % gfcx -o z -O3 -march=native a.f90 -fexternal-blas -lopenblas && ./z
> > 2.44668889 1615.08301
> > 1.99379802 1615.08301
Method 1 is faster with OpenBLAS.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug fortran/106565] Using a transposed matrix in matmul (GCC-10.3.0) is very slow
2022-08-08 20:04 [Bug fortran/106565] New: Using a transposed matrix in matmul (GCC-10.3.0) is very slow quanhua.liu at noaa dot gov
` (7 preceding siblings ...)
2022-08-09 17:53 ` sgk at troutmask dot apl.washington.edu
@ 2022-08-09 17:55 ` sgk at troutmask dot apl.washington.edu
2022-08-09 18:20 ` quanhua.liu at noaa dot gov
9 siblings, 0 replies; 11+ messages in thread
From: sgk at troutmask dot apl.washington.edu @ 2022-08-09 17:55 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565
--- Comment #8 from Steve Kargl <sgk at troutmask dot apl.washington.edu> ---
On Tue, Aug 09, 2022 at 05:51:51PM +0000, sgk at troutmask dot
apl.washington.edu wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565
>
> --- Comment #6 from Steve Kargl <sgk at troutmask dot apl.washington.edu> ---
> On Tue, Aug 09, 2022 at 05:14:16PM +0000, quanhua.liu at noaa dot gov wrote:
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565
> >
> > --- Comment #4 from Quanhua Liu <quanhua.liu at noaa dot gov> ---
> > Using
> > gfortran -O3 -fexternal-blas -L/..... -lblas testmatrixCal.f90
>
> Which BLAS are you using? If you are using BLAS from
> Netlib, then of course you'll likely get poor results
> as the Netlib BLAS is not tuned.
>
Even netlib blas is ok.
gfcx -o z -O3 -march=native a.f90 -fexternal-blas -lblas -fdump-tree-optimized
&& ./z
1.41149306 1615.08020
1.50036991 1615.08020
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug fortran/106565] Using a transposed matrix in matmul (GCC-10.3.0) is very slow
2022-08-08 20:04 [Bug fortran/106565] New: Using a transposed matrix in matmul (GCC-10.3.0) is very slow quanhua.liu at noaa dot gov
` (8 preceding siblings ...)
2022-08-09 17:55 ` sgk at troutmask dot apl.washington.edu
@ 2022-08-09 18:20 ` quanhua.liu at noaa dot gov
9 siblings, 0 replies; 11+ messages in thread
From: quanhua.liu at noaa dot gov @ 2022-08-09 18:20 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565
--- Comment #9 from Quanhua Liu <quanhua.liu at noaa dot gov> ---
Hi Richard,
It seems that I cannot add comment online to the ticket.
I tried
gfortran -o z -O3 -march=native test_matrixCal.f90 -fexternal-blas
-lblas -fdump-tree-optimized
time a.out 1
and
time a.out 2
Both are very slow ( 6s in comparison to previous 0.8 s using method 2).
I don't know which blab on my machine is.
On your machine, can you help to test
BB = transpose(B)
C = matmul(A,BB)
using gfortran -O3 test_matrixCal.f90
time a.out 2
against test
C = matmul(A, transpose(B) )
using any option or blas timing?
The timing depends on machine. It would be great helpful if you can
provide the timing for the two methods from your site
Thank you!
Quanhua Liu
On 8/9/2022 1:53 PM, sgk at troutmask dot apl.washington.edu wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565
>
> --- Comment #7 from Steve Kargl <sgk at troutmask dot apl.washington.edu> ---
> On Tue, Aug 09, 2022 at 05:17:57PM +0000, quanhua.liu at noaa dot gov wrote:
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106565
>>
>> --- Comment #5 from Quanhua Liu <quanhua.liu at noaa dot gov> ---
>> Hi Richard,
>>
>> Using -fexternal-blas for gfortran v10.3.0 is much slower than
>> the method 2:
>> BB = transpose(B)
>> C = matmul(A, BB)
>>
>> How about on your machine?
>>
>>> If you are doing a problem of this size or larger, you want to use the
>>> -fexternal-blas option and link in OpenBLAS.
>
> I wrote "and link in OpenBLAS".
>
>>> I added timing code and replicated the loop to both in one go.
>>>
>>> % gfcx -o z -O3 -march=native a.f90 && ./z
>>> 1.16500998 1615.08594
>>> 5.32258606 1615.08020
>
>>> % gfcx -o z -O3 -march=native a.f90 -fexternal-blas -lopenblas && ./z
>>> 2.44668889 1615.08301
>>> 1.99379802 1615.08301
> Method 1 is faster with OpenBLAS.
>
^ permalink raw reply [flat|nested] 11+ messages in thread