MATMUL broken with frontend optimization.

public inbox for fortran@gcc.gnu.org
 help / color / mirror / Atom feed

* MATMUL broken with frontend optimization.
@ 2021-03-18  7:48 Steve Kargl
  2021-03-18 12:35 ` Richard Biener
  0 siblings, 1 reply; 14+ messages in thread
From: Steve Kargl @ 2021-03-18  7:48 UTC (permalink / raw)
  To: fortran

It seems that gfortran will inline MATMUL with optimization.
This  produce very poor performance.  In fact, gfortran will
inline MATMUL even if one specifies -fexternal-blas.  This is
very bad.

% cat a.f90
program main

   implicit none

   integer, parameter :: imax = 20000, jmax = 10000
   real, allocatable :: inVect(:), matrix(:,:), outVect(:)
   real :: start, finish

   allocate(invect(imax), matrix(imax,jmax), outvect(jmax))

   call random_number(inVect)
   call random_number(matrix)
        
   call cpu_time(start)
   outVect = matmul(inVect, matrix)
   call cpu_time(finish)

   print '("Time = ",f10.7," seconds. – First Value = ",f10.4)',finish-start,outVect(1)
end program main

% gfcx -o z -O0 a.f90 && ./z
Time =  0.2234111 seconds. – First Value =  4982.6362
% nm z | grep matmul
                 U _gfortran_matmul_r4@@GFORTRAN_8
% gfcx -o z -O1 a.f90 && ./z
Time =  0.3295890 seconds. – First Value =  4971.0962
% nm z | grep matmul
% gfcx -o z -O2 a.f90 && ./z
Time =  0.3299561 seconds. – First Value =  5025.4902
% nm z | grep matmul
% gfcx -o z -O2 -fexternal-blas a.f90 && ./z
Time =  0.3295580 seconds. – First Value =  5022.8291

This last one is definitely broken.  I did not link with
an external BLAS library.  Please fix before 11.1 is 
released.

-- 
Steve

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: MATMUL broken with frontend optimization.
  2021-03-18  7:48 MATMUL broken with frontend optimization Steve Kargl
@ 2021-03-18 12:35 ` Richard Biener
  2021-03-18 14:48   ` Tobias Burnus
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Biener @ 2021-03-18 12:35 UTC (permalink / raw)
  To: Steve Kargl; +Cc: fortran

On Thu, Mar 18, 2021 at 8:49 AM Steve Kargl via Fortran
<fortran@gcc.gnu.org> wrote:
>
> It seems that gfortran will inline MATMUL with optimization.
> This  produce very poor performance.  In fact, gfortran will
> inline MATMUL even if one specifies -fexternal-blas.  This is
> very bad.
>
> % cat a.f90
> program main
>
>    implicit none
>
>    integer, parameter :: imax = 20000, jmax = 10000
>    real, allocatable :: inVect(:), matrix(:,:), outVect(:)
>    real :: start, finish
>
>    allocate(invect(imax), matrix(imax,jmax), outvect(jmax))
>
>    call random_number(inVect)
>    call random_number(matrix)
>
>    call cpu_time(start)
>    outVect = matmul(inVect, matrix)
>    call cpu_time(finish)
>
>    print '("Time = ",f10.7," seconds. – First Value = ",f10.4)',finish-start,outVect(1)
> end program main
>
> % gfcx -o z -O0 a.f90 && ./z
> Time =  0.2234111 seconds. – First Value =  4982.6362
> % nm z | grep matmul
>                  U _gfortran_matmul_r4@@GFORTRAN_8
> % gfcx -o z -O1 a.f90 && ./z
> Time =  0.3295890 seconds. – First Value =  4971.0962
> % nm z | grep matmul
> % gfcx -o z -O2 a.f90 && ./z
> Time =  0.3299561 seconds. – First Value =  5025.4902
> % nm z | grep matmul
> % gfcx -o z -O2 -fexternal-blas a.f90 && ./z
> Time =  0.3295580 seconds. – First Value =  5022.8291
>
> This last one is definitely broken.  I did not link with
> an external BLAS library.  Please fix before 11.1 is
> released.

Since the libgfortran MATMUL should be vectorized
I think it's not reasonable to inline any but _very_ small
MATMUL at optimization levels that do not enable vectorization.

Richard.

>
> --
> Steve

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: MATMUL broken with frontend optimization.
  2021-03-18 12:35 ` Richard Biener
@ 2021-03-18 14:48   ` Tobias Burnus
  2021-03-18 15:05     ` Richard Biener
  0 siblings, 1 reply; 14+ messages in thread
From: Tobias Burnus @ 2021-03-18 14:48 UTC (permalink / raw)
  To: Richard Biener, Steve Kargl; +Cc: fortran, Thomas Koenig

Richard,

On 18.03.21 13:35, Richard Biener via Fortran wrote:
> [...]
> Since the libgfortran MATMUL should be vectorized
> I think it's not reasonable to inline any but _very_ small
> MATMUL at optimization levels that do not enable vectorization.

Besides the obvious if (!flag_external_blas) which should always prevent
inlining (possibly except for tiny N like N=1), your idea is 'if (N
small || flag_tree_loop_vectorize)'?

Or are you thinking of a different or additional flag_... than
flag_tree_loop_vectorize for making this choice?

Tobias

-----------------
Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank Thürauf

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: MATMUL broken with frontend optimization.
  2021-03-18 14:48   ` Tobias Burnus
@ 2021-03-18 15:05     ` Richard Biener
  2021-03-18 16:13       ` Steve Kargl
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Biener @ 2021-03-18 15:05 UTC (permalink / raw)
  To: Tobias Burnus; +Cc: Steve Kargl, fortran, Thomas Koenig

On Thu, Mar 18, 2021 at 3:48 PM Tobias Burnus <tobias@codesourcery.com> wrote:
>
> Richard,
>
> On 18.03.21 13:35, Richard Biener via Fortran wrote:
> > [...]
> > Since the libgfortran MATMUL should be vectorized
> > I think it's not reasonable to inline any but _very_ small
> > MATMUL at optimization levels that do not enable vectorization.
>
> Besides the obvious if (!flag_external_blas) which should always prevent
> inlining (possibly except for tiny N like N=1), your idea is 'if (N
> small || flag_tree_loop_vectorize)'?
>
> Or are you thinking of a different or additional flag_... than
> flag_tree_loop_vectorize for making this choice?

Yes, I was thinking of flag_tree_loop_vectorize.  Of course libgfortran
is far from having micro-optimized matmul for various architectures
but IIRC it uses attribute(target) to provide several overloads.  So
maybe only ever inlining tiny matmul makes sense as well (does the
runtime have specializations for small sizes?)

Richard.

> Tobias
>
> -----------------
> Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank Thürauf

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: MATMUL broken with frontend optimization.
  2021-03-18 15:05     ` Richard Biener
@ 2021-03-18 16:13       ` Steve Kargl
  2021-03-18 18:16         ` Thomas Koenig
  0 siblings, 1 reply; 14+ messages in thread
From: Steve Kargl @ 2021-03-18 16:13 UTC (permalink / raw)
  To: Richard Biener; +Cc: Tobias Burnus, fortran, Thomas Koenig

On Thu, Mar 18, 2021 at 04:05:40PM +0100, Richard Biener wrote:
> On Thu, Mar 18, 2021 at 3:48 PM Tobias Burnus <tobias@codesourcery.com> wrote:
> >
> > Richard,
> >
> > On 18.03.21 13:35, Richard Biener via Fortran wrote:
> > > [...]
> > > Since the libgfortran MATMUL should be vectorized
> > > I think it's not reasonable to inline any but _very_ small
> > > MATMUL at optimization levels that do not enable vectorization.
> >
> > Besides the obvious if (!flag_external_blas) which should always prevent
> > inlining (possibly except for tiny N like N=1), your idea is 'if (N
> > small || flag_tree_loop_vectorize)'?
> >
> > Or are you thinking of a different or additional flag_... than
> > flag_tree_loop_vectorize for making this choice?
> 
> Yes, I was thinking of flag_tree_loop_vectorize.  Of course libgfortran
> is far from having micro-optimized matmul for various architectures
> but IIRC it uses attribute(target) to provide several overloads.  So
> maybe only ever inlining tiny matmul makes sense as well (does the
> runtime have specializations for small sizes?)
> 

With -fexternal-blas, there is a cross-over value of N=30,
which can be changed by -fblas-matmul-limit=N option.

I forgot the important example, but Thomas seems to be aware.

% gfcx -o z -O2 -fno-frontend-optimize -fexternal-blas a.f90 && ./z
/usr/local/bin/ld: /tmp/ccOe3VoD.o: in function `MAIN__':
a.f90:(.text+0x156): undefined reference to `sgemm_'
collect2: error: ld returned 1 exit status

sgemm_ would come from a tuned BLAS library such as OpenBLAS.

I was going to suggest adding a testcase that scans a dump
for sgemm.  It seems matmul_blas_1.f tests the -fexternal-blas
and -fblas-matmul-limit=N options, but it doesn't look for sgemm.  
This, I believe, does the checking

diff --git a/gcc/testsuite/gfortran.dg/matmul_blas_1.f b/gcc/testsuite/gfortran.dg/matmul_blas_1.f
index 6a88981c9d7..52298d09cce 100644
--- a/gcc/testsuite/gfortran.dg/matmul_blas_1.f
+++ b/gcc/testsuite/gfortran.dg/matmul_blas_1.f
@@ -237,4 +237,4 @@ C Test calling of BLAS routines
       if (any (c /= cres)) stop 20
 
       end
-! { dg-final { scan-tree-dump-times "_gfortran_matmul" 0 "optimized" } }
+! { dg-final { scan-tree-dump "sgemm" "optimized" } }

-- 
Steve

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: MATMUL broken with frontend optimization.
  2021-03-18 16:13       ` Steve Kargl
@ 2021-03-18 18:16         ` Thomas Koenig
  2021-03-18 18:24           ` Thomas Koenig
  0 siblings, 1 reply; 14+ messages in thread
From: Thomas Koenig @ 2021-03-18 18:16 UTC (permalink / raw)
  To: Steve Kargl, Richard Biener; +Cc: Tobias Burnus, fortran

OK, so I've had a bit of time to look at the actual test case.  I
missed one very important detail before:  This is a vector-matrix
operation.

For this, we do not have a good library routine (Harald just
removed it because of a bug in buffering), and -fexternal-blas
does not work because we do not handle calls to anything but
*GEMM.

The idea is that, for a vector-matrix-multiplication, the
compiler should have enough information about the information

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: MATMUL broken with frontend optimization.
  2021-03-18 18:16         ` Thomas Koenig
@ 2021-03-18 18:24           ` Thomas Koenig
  2021-03-18 20:22             ` Steve Kargl
  0 siblings, 1 reply; 14+ messages in thread
From: Thomas Koenig @ 2021-03-18 18:24 UTC (permalink / raw)
  To: Steve Kargl, Richard Biener; +Cc: Tobias Burnus, fortran

I didn't finish the previous mail before hitting "send", so here
is the postscript...

> OK, so I've had a bit of time to look at the actual test case.  I
> missed one very important detail before:  This is a vector-matrix
> operation.
> 
> For this, we do not have a good library routine (Harald just
> removed it because of a bug in buffering), and -fexternal-blas
> does not work because we do not handle calls to anything but
> *GEMM.

A vector-matrix multiplicatin would be a call to *GEMV, a worthy
goal, but out of scope so close to a release.

> The idea is that, for a vector-matrix-multiplication, the
> compiler should have enough information about the information
about how to optimize for the relevant architecture, especially
if the user compilers with the right flags.

So, the current idea is that, if we optimize, we can inline.

What would a better heuristic be?

Best regards

	Thomas

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: MATMUL broken with frontend optimization.
  2021-03-18 18:24           ` Thomas Koenig
@ 2021-03-18 20:22             ` Steve Kargl
  2021-03-18 20:55               ` Thomas Koenig
  0 siblings, 1 reply; 14+ messages in thread
From: Steve Kargl @ 2021-03-18 20:22 UTC (permalink / raw)
  To: Thomas Koenig; +Cc: Richard Biener, Tobias Burnus, fortran

On Thu, Mar 18, 2021 at 07:24:21PM +0100, Thomas Koenig wrote:
> I didn't finish the previous mail before hitting "send", so here
> is the postscript...
> 
> > OK, so I've had a bit of time to look at the actual test case.  I
> > missed one very important detail before:  This is a vector-matrix
> > operation.
> > 
> > For this, we do not have a good library routine (Harald just
> > removed it because of a bug in buffering), and -fexternal-blas
> > does not work because we do not handle calls to anything but
> > *GEMM.
> 
> A vector-matrix multiplicatin would be a call to *GEMV, a worthy
> goal, but out of scope so close to a release.

Agreed.

> > The idea is that, for a vector-matrix-multiplication, the
> > compiler should have enough information about the information
> about how to optimize for the relevant architecture, especially
> if the user compilers with the right flags.
> 
> So, the current idea is that, if we optimize, we can inline.
> 
> What would a better heuristic be?
> 

Does _gfortran_matmul_r4 (and friends) work for vector-matrix
products?  I haven't checked.  If so, how about disabling
in-lining MATMUL for 11.1; then, for 11.2, this can be revisited
where a small N can be chosen for in-lining.  With -fexternal-blas
and *gemm, the default cross-over is N = 30.

BTW, I cam across this in StackOverflow.

https://stackoverflow.com/questions/66682180/why-is-matmul-slower-with-gfortran-compiler-optimization-turned-on

-- 
Steve

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: MATMUL broken with frontend optimization.
  2021-03-18 20:22             ` Steve Kargl
@ 2021-03-18 20:55               ` Thomas Koenig
  2021-03-18 22:07                 ` Steve Kargl
  0 siblings, 1 reply; 14+ messages in thread
From: Thomas Koenig @ 2021-03-18 20:55 UTC (permalink / raw)
  To: Steve Kargl; +Cc: Richard Biener, Tobias Burnus, fortran


Am 18.03.21 um 21:22 schrieb Steve Kargl:
> On Thu, Mar 18, 2021 at 07:24:21PM +0100, Thomas Koenig wrote:
>> I didn't finish the previous mail before hitting "send", so here
>> is the postscript...
>>
>>> OK, so I've had a bit of time to look at the actual test case.  I
>>> missed one very important detail before:  This is a vector-matrix
>>> operation.
>>>
>>> For this, we do not have a good library routine (Harald just
>>> removed it because of a bug in buffering), and -fexternal-blas
>>> does not work because we do not handle calls to anything but
>>> *GEMM.
>>
>> A vector-matrix multiplicatin would be a call to *GEMV, a worthy
>> goal, but out of scope so close to a release.
> 
> Agreed.
> 
>>> The idea is that, for a vector-matrix-multiplication, the
>>> compiler should have enough information about the information
>> about how to optimize for the relevant architecture, especially
>> if the user compilers with the right flags.
>>
>> So, the current idea is that, if we optimize, we can inline.
>>
>> What would a better heuristic be?
>>
> 
> Does _gfortran_matmul_r4 (and friends) work for vector-matrix
> products?

Yes.

> I haven't checked.  If so, how about disabling
> in-lining MATMUL for 11.1;

Absolutely not for the general case. This would cause a huge regression
in execution time for 2*2 matrices, and also for small matrix-vector
multiplications.

What we could do is only to enable the inlining for vector*matrix
at -O2 or higher. Again, this will mean a penalty for smaller loops,
but at less than -O2, people probably don't care too much.

If there is agreement on that, I will prepare a patch.

Regards

	Thomas

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: MATMUL broken with frontend optimization.
  2021-03-18 20:55               ` Thomas Koenig
@ 2021-03-18 22:07                 ` Steve Kargl
  2021-03-19  6:19                   ` Thomas Koenig
  0 siblings, 1 reply; 14+ messages in thread
From: Steve Kargl @ 2021-03-18 22:07 UTC (permalink / raw)
  To: Thomas Koenig; +Cc: Richard Biener, Tobias Burnus, fortran

On Thu, Mar 18, 2021 at 09:55:27PM +0100, Thomas Koenig wrote:
> 
> > I haven't checked.  If so, how about disabling
> > in-lining MATMUL for 11.1;
> 
> Absolutely not for the general case. This would cause a huge regression
> in execution time for 2*2 matrices, and also for small matrix-vector
> multiplications.
> 
> What we could do is only to enable the inlining for vector*matrix
> at -O2 or higher. Again, this will mean a penalty for smaller loops,
> but at less than -O2, people probably don't care too much.
> 

On my old core2 cpu, a quick test with N=1000 and NxN matrix
suggest a cross over near N=1000 for REAL(4).  This cpu doesn't
have any AVX* instruction, so YMMV.  Program follows .sig

-- 
Steve

program t

   implicit none

   character(len=10) str
   integer i, j
   integer, parameter :: &
   &  n(10) = [100, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 10000]
   real t0, t1, t3, t4
   real, allocatable :: a(:), b(:,:), c(:)
   !
   ! Loop over n(j) array.  Run each test 5 times and average.
   !
   do j = 1, 10

      allocate(a(n(j)), b(n(j),n(j)), c(n(j)))

      a = 1
      b = 1

      t3 = 0
      do i = 1, 5
         call cpu_time(t0)
         c = matmul(a, b)
         call cpu_time(t1)
         t3 = t3 + (t1 - t0)
         if (c(1) /= n(j)) stop 1
      end do
 
      t4 = 0
      do i = 1, 5
         call cpu_time(t0)
         c = matmul(b, a)
         call cpu_time(t1)
         t4 = t4 + (t1 - t0)
         if (c(1) /= n(j)) stop 2
      end do

      print '(I5,1X,2(F8.4,1X))', n(j), (t3/5) * 1000, (t4/5) * 1000

      deallocate(a, b, c)
   end do

end program t


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: MATMUL broken with frontend optimization.
  2021-03-18 22:07                 ` Steve Kargl
@ 2021-03-19  6:19                   ` Thomas Koenig
  2021-03-19  6:36                     ` Thomas Koenig
  2021-03-19  7:03                     ` Steve Kargl
  0 siblings, 2 replies; 14+ messages in thread
From: Thomas Koenig @ 2021-03-19  6:19 UTC (permalink / raw)
  To: Steve Kargl; +Cc: Richard Biener, Tobias Burnus, fortran


Hi Steve,

> On my old core2 cpu, a quick test with N=1000 and NxN matrix
> suggest a cross over near N=1000 for REAL(4).  This cpu doesn't
> have any AVX* instruction, so YMMV.  Program follows .sig

Looking at your data with AVX (which I think we can mostly count
on now),

- The library is always faster for matmul(vector,matrix) for any n >=100
- For matmul(matrix,vector) there is no appreciable difference

So, putting in the same inline limits for matmul(vector,matrix)
that we have for matmul(matrix,matrix), and leaving
mamul(matrix,vector) alone, seems like a reasonable thing to do.

I'll work on a patch.

Regards

	Thomas

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: MATMUL broken with frontend optimization.
  2021-03-19  6:19                   ` Thomas Koenig
@ 2021-03-19  6:36                     ` Thomas Koenig
  2021-03-19 19:12                       ` Steve Kargl
  2021-03-19  7:03                     ` Steve Kargl
  1 sibling, 1 reply; 14+ messages in thread
From: Thomas Koenig @ 2021-03-19  6:36 UTC (permalink / raw)
  To: Steve Kargl; +Cc: Richard Biener, Tobias Burnus, fortran

[-- Attachment #1: Type: text/plain, Size: 212 bytes --]


Am 19.03.21 um 07:19 schrieb Thomas Koenig:

> I'll work on a patch.

So, here's a concept patch.  It still needs a ChangeLog and testsuite
adjustment, but this is what I would propose to use.

Regards

	Thomas

[-- Attachment #2: p1.diff --]
[-- Type: text/x-patch, Size: 2733 bytes --]

diff --git a/gcc/fortran/frontend-passes.c b/gcc/fortran/frontend-passes.c
index cfc47471cf1..7d3eae67632 100644
--- a/gcc/fortran/frontend-passes.c
+++ b/gcc/fortran/frontend-passes.c
@@ -3307,7 +3307,7 @@ get_operand (gfc_intrinsic_op op, gfc_expr *e1, gfc_expr *e2)
    removed by DCE. Only called for rank-two matrices A and B.  */
 
 static gfc_code *
-inline_limit_check (gfc_expr *a, gfc_expr *b, int limit)
+inline_limit_check (gfc_expr *a, gfc_expr *b, int limit, int rank_a)
 {
   gfc_expr *inline_limit;
   gfc_code *if_1, *if_2, *else_2;
@@ -3315,16 +3315,28 @@ inline_limit_check (gfc_expr *a, gfc_expr *b, int limit)
   gfc_typespec ts;
   gfc_expr *cond;
 
+  gcc_assert (rank_a == 1 || rank_a == 2);
+
   /* Calculation is done in real to avoid integer overflow.  */
 
   inline_limit = gfc_get_constant_expr (BT_REAL, gfc_default_real_kind,
 					&a->where);
   mpfr_set_si (inline_limit->value.real, limit, GFC_RND_MODE);
-  mpfr_pow_ui (inline_limit->value.real, inline_limit->value.real, 3,
+
+  /* Set the limit according to the rank.  */
+  mpfr_pow_ui (inline_limit->value.real, inline_limit->value.real, rank_a + 1,
 	       GFC_RND_MODE);
 
   a1 = get_array_inq_function (GFC_ISYM_SIZE, a, 1);
-  a2 = get_array_inq_function (GFC_ISYM_SIZE, a, 2);
+
+  /* For a_rank = 1, must use one as the size of a along the second
+     dimension as to avoid too much code duplication.  */
+
+  if (rank_a == 2)
+    a2 = get_array_inq_function (GFC_ISYM_SIZE, a, 2);
+  else
+    a2 = gfc_get_int_expr (gfc_index_integer_kind, &a->where, 1);
+
   b2 = get_array_inq_function (GFC_ISYM_SIZE, b, 2);
 
   gfc_clear_ts (&ts);
@@ -4243,11 +4255,13 @@ inline_matmul_assign (gfc_code **c, int *walk_subtrees,
   /* Take care of the inline flag.  If the limit check evaluates to a
      constant, dead code elimination will eliminate the unneeded branch.  */
 
-  if (flag_inline_matmul_limit > 0 && matrix_a->rank == 2
+  if (flag_inline_matmul_limit > 0
+      && (matrix_a->rank == 1 || matrix_a->rank == 2)
       && matrix_b->rank == 2)
     {
       if_limit = inline_limit_check (matrix_a, matrix_b,
-				     flag_inline_matmul_limit);
+				     flag_inline_matmul_limit,
+				     matrix_a->rank);
 
       /* Insert the original statement into the else branch.  */
       if_limit->block->block->next = co;
@@ -4757,7 +4771,7 @@ call_external_blas (gfc_code **c, int *walk_subtrees ATTRIBUTE_UNUSED,
     return 0;
 
   /* Generate the if statement and hang it into the tree.  */
-  if_limit = inline_limit_check (matrix_a, matrix_b, flag_blas_matmul_limit);
+  if_limit = inline_limit_check (matrix_a, matrix_b, flag_blas_matmul_limit, 2);
   co_next = co->next;
   (*current_code) = if_limit;
   co->next = NULL;

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: MATMUL broken with frontend optimization.
  2021-03-19  6:19                   ` Thomas Koenig
  2021-03-19  6:36                     ` Thomas Koenig
@ 2021-03-19  7:03                     ` Steve Kargl
  1 sibling, 0 replies; 14+ messages in thread
From: Steve Kargl @ 2021-03-19  7:03 UTC (permalink / raw)
  To: Thomas Koenig; +Cc: Richard Biener, Tobias Burnus, fortran

On Fri, Mar 19, 2021 at 07:19:16AM +0100, Thomas Koenig wrote:
> 
> Hi Steve,
> 
> > On my old core2 cpu, a quick test with N=1000 and NxN matrix
> > suggest a cross over near N=1000 for REAL(4).  This cpu doesn't
> > have any AVX* instruction, so YMMV.  Program follows .sig
> 
> Looking at your data with AVX (which I think we can mostly count
> on now),
> 
> - The library is always faster for matmul(vector,matrix) for any n >=100
> - For matmul(matrix,vector) there is no appreciable difference
> 
> So, putting in the same inline limits for matmul(vector,matrix)
> that we have for matmul(matrix,matrix), and leaving
> mamul(matrix,vector) alone, seems like a reasonable thing to do.
> 
> I'll work on a patch.
> 

Thanks for working on this in such short notice.  I agree the core2
is old, and gfortran should look toward the future so using the same
inline threshold seems right.

I saw your other email with the concept patch.  I'll look through
it tomorrow when I'm a little more cogent (it's after midnight here).

-- 
Steve

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: MATMUL broken with frontend optimization.
  2021-03-19  6:36                     ` Thomas Koenig
@ 2021-03-19 19:12                       ` Steve Kargl
  0 siblings, 0 replies; 14+ messages in thread
From: Steve Kargl @ 2021-03-19 19:12 UTC (permalink / raw)
  To: Thomas Koenig; +Cc: Richard Biener, Tobias Burnus, fortran

On Fri, Mar 19, 2021 at 07:36:22AM +0100, Thomas Koenig wrote:
> 
> Am 19.03.21 um 07:19 schrieb Thomas Koenig:
> 
> > I'll work on a patch.
> 
> So, here's a concept patch.  It still needs a ChangeLog and testsuite
> adjustment, but this is what I would propose to use.
> 

I see that you've submitted the patch for review.  AFAICT,
it looks correct to me.  Thanks for q quick fix.

-- 
Steve

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-03-19 19:12 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-18  7:48 MATMUL broken with frontend optimization Steve Kargl
2021-03-18 12:35 ` Richard Biener
2021-03-18 14:48   ` Tobias Burnus
2021-03-18 15:05     ` Richard Biener
2021-03-18 16:13       ` Steve Kargl
2021-03-18 18:16         ` Thomas Koenig
2021-03-18 18:24           ` Thomas Koenig
2021-03-18 20:22             ` Steve Kargl
2021-03-18 20:55               ` Thomas Koenig
2021-03-18 22:07                 ` Steve Kargl
2021-03-19  6:19                   ` Thomas Koenig
2021-03-19  6:36                     ` Thomas Koenig
2021-03-19 19:12                       ` Steve Kargl
2021-03-19  7:03                     ` Steve Kargl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).