[patch, fortran] Enable FMA for AVX2 and AVX512F for matmul

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
@ 2017-03-01 21:00 Thomas Koenig
  2017-03-02  3:22 ` Jerry DeLisle
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Thomas Koenig @ 2017-03-01 21:00 UTC (permalink / raw)
  To: fortran, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1347 bytes --]

Hello world,

the attached patch enables FMA for the AVX2 and AVX512F variants of
matmul.  This should bring a very nice speedup (although I have
been unable to run benchmarks due to lack of a suitable machine).

Question: Is this still appropriate for the current state of trunk?
Or rather, OK for when gcc 8 opens (which might still be some time
in the future)?

2017-03-01  Thomas Koenig  <tkoenig@gcc.gnu.org>

         PR fortran/78379
         * m4/matmul.m4: (matmul_'rtype_code`_avx2): Also generate for
         reals.  Add fma to target options.
         (matmul_'rtype_code`_avx512f): Add fma to target options.
         (matmul_'rtype_code`):  Call AVX2 and AVX512F only if
         FMA is available.
         * generated/matmul_c10.c: Regenerated.
         * generated/matmul_c16.c: Regenerated.
         * generated/matmul_c4.c: Regenerated.
         * generated/matmul_c8.c: Regenerated.
         * generated/matmul_i1.c: Regenerated.
         * generated/matmul_i16.c: Regenerated.
         * generated/matmul_i2.c: Regenerated.
         * generated/matmul_i4.c: Regenerated.
         * generated/matmul_i8.c: Regenerated.
         * generated/matmul_r10.c: Regenerated.
         * generated/matmul_r16.c: Regenerated.
         * generated/matmul_r4.c: Regenerated.
         * generated/matmul_r8.c: Regenerated.

Regards

	Thomas

[-- Attachment #2: p1-fma.diff --]
[-- Type: text/x-patch, Size: 2139 bytes --]

Index: m4/matmul.m4
===================================================================
--- m4/matmul.m4	(Revision 245760)
+++ m4/matmul.m4	(Arbeitskopie)
@@ -75,14 +75,6 @@
 	int blas_limit, blas_call gemm);
 export_proto(matmul_'rtype_code`);
 
-'ifelse(rtype_letter,`r',dnl
-`#if defined(HAVE_AVX) && defined(HAVE_AVX2)
-/* REAL types generate identical code for AVX and AVX2.  Only generate
-   an AVX2 function if we are dealing with integer.  */
-#undef HAVE_AVX2
-#endif')
-`
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -101,7 +93,7 @@
 `static void
 'matmul_name` ('rtype` * const restrict retarray, 
 	'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static' include(matmul_internal.m4)dnl
 `#endif /* HAVE_AVX2 */
 
@@ -110,7 +102,7 @@
 `static void
 'matmul_name` ('rtype` * const restrict retarray, 
 	'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
 static' include(matmul_internal.m4)dnl
 `#endif  /* HAVE_AVX512F */
 
@@ -138,7 +130,9 @@
 	{
           /* Run down the available processors in order of preference.  */
 #ifdef HAVE_AVX512F
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+	      && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
 	    {
 	      matmul_p = matmul_'rtype_code`_avx512f;
 	      goto tailcall;
@@ -147,7 +141,8 @@
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_'rtype_code`_avx2;
 	      goto tailcall;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
  2017-03-01 21:00 [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul Thomas Koenig
@ 2017-03-02  3:22 ` Jerry DeLisle
  2017-03-02  6:15   ` Thomas Koenig
  2017-03-02  7:32 ` Janne Blomqvist
  2017-03-02  8:43 ` Jakub Jelinek
  2 siblings, 1 reply; 17+ messages in thread
From: Jerry DeLisle @ 2017-03-02  3:22 UTC (permalink / raw)
  To: fortran; +Cc: GCC Patches

On 03/01/2017 01:00 PM, Thomas Koenig wrote:
> Hello world,
>
> the attached patch enables FMA for the AVX2 and AVX512F variants of
> matmul.  This should bring a very nice speedup (although I have
> been unable to run benchmarks due to lack of a suitable machine).
>
> Question: Is this still appropriate for the current state of trunk?
> Or rather, OK for when gcc 8 opens (which might still be some time
> in the future)?

I think it may be appropriate now because you are making an adjustment to the 
just added new feature.

I would prefer that it was tested on the actual expected platform. Does anyone 
anywhere on this list have access to one of these machines to test?

Jerry


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
  2017-03-02  3:22 ` Jerry DeLisle
@ 2017-03-02  6:15   ` Thomas Koenig
  0 siblings, 0 replies; 17+ messages in thread
From: Thomas Koenig @ 2017-03-02  6:15 UTC (permalink / raw)
  To: Jerry DeLisle, fortran; +Cc: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 305 bytes --]

Hi Jerry,

> I would prefer that it was tested on the actual expected platform. Does
> anyone anywhere on this list have access to one of these machines to test?

If anybody wants to test who does not have --enable-maintainer-mode
activated, here is a patch that works "out of the box".

Regards

	Thomas

[-- Attachment #2: p1-fma-total.diff --]
[-- Type: text/x-patch, Size: 34073 bytes --]

Index: generated/matmul_c10.c
===================================================================
--- generated/matmul_c10.c	(Revision 245760)
+++ generated/matmul_c10.c	(Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_c10 (gfc_array_c10 * const rest
 	int blas_limit, blas_call gemm);
 export_proto(matmul_c10);
 
-
-
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_c10_avx (gfc_array_c10 * const restrict ret
 static void
 matmul_c10_avx2 (gfc_array_c10 * const restrict retarray, 
 	gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static void
 matmul_c10_avx2 (gfc_array_c10 * const restrict retarray, 
 	gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
@@ -1171,7 +1168,7 @@ matmul_c10_avx2 (gfc_array_c10 * const restrict re
 static void
 matmul_c10_avx512f (gfc_array_c10 * const restrict retarray, 
 	gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
 static void
 matmul_c10_avx512f (gfc_array_c10 * const restrict retarray, 
 	gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
@@ -2268,7 +2265,9 @@ void matmul_c10 (gfc_array_c10 * const restrict re
 	{
           /* Run down the available processors in order of preference.  */
 #ifdef HAVE_AVX512F
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+	      && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
 	    {
 	      matmul_p = matmul_c10_avx512f;
 	      goto tailcall;
@@ -2277,7 +2276,8 @@ void matmul_c10 (gfc_array_c10 * const restrict re
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_c10_avx2;
 	      goto tailcall;
Index: generated/matmul_c16.c
===================================================================
--- generated/matmul_c16.c	(Revision 245760)
+++ generated/matmul_c16.c	(Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_c16 (gfc_array_c16 * const rest
 	int blas_limit, blas_call gemm);
 export_proto(matmul_c16);
 
-
-
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_c16_avx (gfc_array_c16 * const restrict ret
 static void
 matmul_c16_avx2 (gfc_array_c16 * const restrict retarray, 
 	gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static void
 matmul_c16_avx2 (gfc_array_c16 * const restrict retarray, 
 	gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
@@ -1171,7 +1168,7 @@ matmul_c16_avx2 (gfc_array_c16 * const restrict re
 static void
 matmul_c16_avx512f (gfc_array_c16 * const restrict retarray, 
 	gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
 static void
 matmul_c16_avx512f (gfc_array_c16 * const restrict retarray, 
 	gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
@@ -2268,7 +2265,9 @@ void matmul_c16 (gfc_array_c16 * const restrict re
 	{
           /* Run down the available processors in order of preference.  */
 #ifdef HAVE_AVX512F
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+	      && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
 	    {
 	      matmul_p = matmul_c16_avx512f;
 	      goto tailcall;
@@ -2277,7 +2276,8 @@ void matmul_c16 (gfc_array_c16 * const restrict re
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_c16_avx2;
 	      goto tailcall;
Index: generated/matmul_c4.c
===================================================================
--- generated/matmul_c4.c	(Revision 245760)
+++ generated/matmul_c4.c	(Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_c4 (gfc_array_c4 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_c4);
 
-
-
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_c4_avx (gfc_array_c4 * const restrict retar
 static void
 matmul_c4_avx2 (gfc_array_c4 * const restrict retarray, 
 	gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static void
 matmul_c4_avx2 (gfc_array_c4 * const restrict retarray, 
 	gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas,
@@ -1171,7 +1168,7 @@ matmul_c4_avx2 (gfc_array_c4 * const restrict reta
 static void
 matmul_c4_avx512f (gfc_array_c4 * const restrict retarray, 
 	gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
 static void
 matmul_c4_avx512f (gfc_array_c4 * const restrict retarray, 
 	gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas,
@@ -2268,7 +2265,9 @@ void matmul_c4 (gfc_array_c4 * const restrict reta
 	{
           /* Run down the available processors in order of preference.  */
 #ifdef HAVE_AVX512F
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+	      && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
 	    {
 	      matmul_p = matmul_c4_avx512f;
 	      goto tailcall;
@@ -2277,7 +2276,8 @@ void matmul_c4 (gfc_array_c4 * const restrict reta
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_c4_avx2;
 	      goto tailcall;
Index: generated/matmul_c8.c
===================================================================
--- generated/matmul_c8.c	(Revision 245760)
+++ generated/matmul_c8.c	(Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_c8 (gfc_array_c8 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_c8);
 
-
-
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_c8_avx (gfc_array_c8 * const restrict retar
 static void
 matmul_c8_avx2 (gfc_array_c8 * const restrict retarray, 
 	gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static void
 matmul_c8_avx2 (gfc_array_c8 * const restrict retarray, 
 	gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas,
@@ -1171,7 +1168,7 @@ matmul_c8_avx2 (gfc_array_c8 * const restrict reta
 static void
 matmul_c8_avx512f (gfc_array_c8 * const restrict retarray, 
 	gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
 static void
 matmul_c8_avx512f (gfc_array_c8 * const restrict retarray, 
 	gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas,
@@ -2268,7 +2265,9 @@ void matmul_c8 (gfc_array_c8 * const restrict reta
 	{
           /* Run down the available processors in order of preference.  */
 #ifdef HAVE_AVX512F
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+	      && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
 	    {
 	      matmul_p = matmul_c8_avx512f;
 	      goto tailcall;
@@ -2277,7 +2276,8 @@ void matmul_c8 (gfc_array_c8 * const restrict reta
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_c8_avx2;
 	      goto tailcall;
Index: generated/matmul_i1.c
===================================================================
--- generated/matmul_i1.c	(Revision 245760)
+++ generated/matmul_i1.c	(Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_i1 (gfc_array_i1 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_i1);
 
-
-
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_i1_avx (gfc_array_i1 * const restrict retar
 static void
 matmul_i1_avx2 (gfc_array_i1 * const restrict retarray, 
 	gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static void
 matmul_i1_avx2 (gfc_array_i1 * const restrict retarray, 
 	gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas,
@@ -1171,7 +1168,7 @@ matmul_i1_avx2 (gfc_array_i1 * const restrict reta
 static void
 matmul_i1_avx512f (gfc_array_i1 * const restrict retarray, 
 	gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
 static void
 matmul_i1_avx512f (gfc_array_i1 * const restrict retarray, 
 	gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas,
@@ -2268,7 +2265,9 @@ void matmul_i1 (gfc_array_i1 * const restrict reta
 	{
           /* Run down the available processors in order of preference.  */
 #ifdef HAVE_AVX512F
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+	      && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
 	    {
 	      matmul_p = matmul_i1_avx512f;
 	      goto tailcall;
@@ -2277,7 +2276,8 @@ void matmul_i1 (gfc_array_i1 * const restrict reta
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_i1_avx2;
 	      goto tailcall;
Index: generated/matmul_i16.c
===================================================================
--- generated/matmul_i16.c	(Revision 245760)
+++ generated/matmul_i16.c	(Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_i16 (gfc_array_i16 * const rest
 	int blas_limit, blas_call gemm);
 export_proto(matmul_i16);
 
-
-
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_i16_avx (gfc_array_i16 * const restrict ret
 static void
 matmul_i16_avx2 (gfc_array_i16 * const restrict retarray, 
 	gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static void
 matmul_i16_avx2 (gfc_array_i16 * const restrict retarray, 
 	gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas,
@@ -1171,7 +1168,7 @@ matmul_i16_avx2 (gfc_array_i16 * const restrict re
 static void
 matmul_i16_avx512f (gfc_array_i16 * const restrict retarray, 
 	gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
 static void
 matmul_i16_avx512f (gfc_array_i16 * const restrict retarray, 
 	gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas,
@@ -2268,7 +2265,9 @@ void matmul_i16 (gfc_array_i16 * const restrict re
 	{
           /* Run down the available processors in order of preference.  */
 #ifdef HAVE_AVX512F
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+	      && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
 	    {
 	      matmul_p = matmul_i16_avx512f;
 	      goto tailcall;
@@ -2277,7 +2276,8 @@ void matmul_i16 (gfc_array_i16 * const restrict re
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_i16_avx2;
 	      goto tailcall;
Index: generated/matmul_i2.c
===================================================================
--- generated/matmul_i2.c	(Revision 245760)
+++ generated/matmul_i2.c	(Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_i2 (gfc_array_i2 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_i2);
 
-
-
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_i2_avx (gfc_array_i2 * const restrict retar
 static void
 matmul_i2_avx2 (gfc_array_i2 * const restrict retarray, 
 	gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static void
 matmul_i2_avx2 (gfc_array_i2 * const restrict retarray, 
 	gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas,
@@ -1171,7 +1168,7 @@ matmul_i2_avx2 (gfc_array_i2 * const restrict reta
 static void
 matmul_i2_avx512f (gfc_array_i2 * const restrict retarray, 
 	gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
 static void
 matmul_i2_avx512f (gfc_array_i2 * const restrict retarray, 
 	gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas,
@@ -2268,7 +2265,9 @@ void matmul_i2 (gfc_array_i2 * const restrict reta
 	{
           /* Run down the available processors in order of preference.  */
 #ifdef HAVE_AVX512F
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+	      && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
 	    {
 	      matmul_p = matmul_i2_avx512f;
 	      goto tailcall;
@@ -2277,7 +2276,8 @@ void matmul_i2 (gfc_array_i2 * const restrict reta
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_i2_avx2;
 	      goto tailcall;
Index: generated/matmul_i4.c
===================================================================
--- generated/matmul_i4.c	(Revision 245760)
+++ generated/matmul_i4.c	(Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_i4 (gfc_array_i4 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_i4);
 
-
-
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_i4_avx (gfc_array_i4 * const restrict retar
 static void
 matmul_i4_avx2 (gfc_array_i4 * const restrict retarray, 
 	gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static void
 matmul_i4_avx2 (gfc_array_i4 * const restrict retarray, 
 	gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas,
@@ -1171,7 +1168,7 @@ matmul_i4_avx2 (gfc_array_i4 * const restrict reta
 static void
 matmul_i4_avx512f (gfc_array_i4 * const restrict retarray, 
 	gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
 static void
 matmul_i4_avx512f (gfc_array_i4 * const restrict retarray, 
 	gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas,
@@ -2268,7 +2265,9 @@ void matmul_i4 (gfc_array_i4 * const restrict reta
 	{
           /* Run down the available processors in order of preference.  */
 #ifdef HAVE_AVX512F
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+	      && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
 	    {
 	      matmul_p = matmul_i4_avx512f;
 	      goto tailcall;
@@ -2277,7 +2276,8 @@ void matmul_i4 (gfc_array_i4 * const restrict reta
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_i4_avx2;
 	      goto tailcall;
Index: generated/matmul_i8.c
===================================================================
--- generated/matmul_i8.c	(Revision 245760)
+++ generated/matmul_i8.c	(Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_i8 (gfc_array_i8 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_i8);
 
-
-
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_i8_avx (gfc_array_i8 * const restrict retar
 static void
 matmul_i8_avx2 (gfc_array_i8 * const restrict retarray, 
 	gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static void
 matmul_i8_avx2 (gfc_array_i8 * const restrict retarray, 
 	gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas,
@@ -1171,7 +1168,7 @@ matmul_i8_avx2 (gfc_array_i8 * const restrict reta
 static void
 matmul_i8_avx512f (gfc_array_i8 * const restrict retarray, 
 	gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
 static void
 matmul_i8_avx512f (gfc_array_i8 * const restrict retarray, 
 	gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas,
@@ -2268,7 +2265,9 @@ void matmul_i8 (gfc_array_i8 * const restrict reta
 	{
           /* Run down the available processors in order of preference.  */
 #ifdef HAVE_AVX512F
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+	      && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
 	    {
 	      matmul_p = matmul_i8_avx512f;
 	      goto tailcall;
@@ -2277,7 +2276,8 @@ void matmul_i8 (gfc_array_i8 * const restrict reta
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_i8_avx2;
 	      goto tailcall;
Index: generated/matmul_r10.c
===================================================================
--- generated/matmul_r10.c	(Revision 245760)
+++ generated/matmul_r10.c	(Arbeitskopie)
@@ -74,13 +74,6 @@ extern void matmul_r10 (gfc_array_r10 * const rest
 	int blas_limit, blas_call gemm);
 export_proto(matmul_r10);
 
-#if defined(HAVE_AVX) && defined(HAVE_AVX2)
-/* REAL types generate identical code for AVX and AVX2.  Only generate
-   an AVX2 function if we are dealing with integer.  */
-#undef HAVE_AVX2
-#endif
-
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -632,7 +625,7 @@ matmul_r10_avx (gfc_array_r10 * const restrict ret
 static void
 matmul_r10_avx2 (gfc_array_r10 * const restrict retarray, 
 	gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static void
 matmul_r10_avx2 (gfc_array_r10 * const restrict retarray, 
 	gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas,
@@ -1175,7 +1168,7 @@ matmul_r10_avx2 (gfc_array_r10 * const restrict re
 static void
 matmul_r10_avx512f (gfc_array_r10 * const restrict retarray, 
 	gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
 static void
 matmul_r10_avx512f (gfc_array_r10 * const restrict retarray, 
 	gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas,
@@ -2272,7 +2265,9 @@ void matmul_r10 (gfc_array_r10 * const restrict re
 	{
           /* Run down the available processors in order of preference.  */
 #ifdef HAVE_AVX512F
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+	      && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
 	    {
 	      matmul_p = matmul_r10_avx512f;
 	      goto tailcall;
@@ -2281,7 +2276,8 @@ void matmul_r10 (gfc_array_r10 * const restrict re
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_r10_avx2;
 	      goto tailcall;
Index: generated/matmul_r16.c
===================================================================
--- generated/matmul_r16.c	(Revision 245760)
+++ generated/matmul_r16.c	(Arbeitskopie)
@@ -74,13 +74,6 @@ extern void matmul_r16 (gfc_array_r16 * const rest
 	int blas_limit, blas_call gemm);
 export_proto(matmul_r16);
 
-#if defined(HAVE_AVX) && defined(HAVE_AVX2)
-/* REAL types generate identical code for AVX and AVX2.  Only generate
-   an AVX2 function if we are dealing with integer.  */
-#undef HAVE_AVX2
-#endif
-
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -632,7 +625,7 @@ matmul_r16_avx (gfc_array_r16 * const restrict ret
 static void
 matmul_r16_avx2 (gfc_array_r16 * const restrict retarray, 
 	gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static void
 matmul_r16_avx2 (gfc_array_r16 * const restrict retarray, 
 	gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas,
@@ -1175,7 +1168,7 @@ matmul_r16_avx2 (gfc_array_r16 * const restrict re
 static void
 matmul_r16_avx512f (gfc_array_r16 * const restrict retarray, 
 	gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
 static void
 matmul_r16_avx512f (gfc_array_r16 * const restrict retarray, 
 	gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas,
@@ -2272,7 +2265,9 @@ void matmul_r16 (gfc_array_r16 * const restrict re
 	{
           /* Run down the available processors in order of preference.  */
 #ifdef HAVE_AVX512F
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+	      && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
 	    {
 	      matmul_p = matmul_r16_avx512f;
 	      goto tailcall;
@@ -2281,7 +2276,8 @@ void matmul_r16 (gfc_array_r16 * const restrict re
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_r16_avx2;
 	      goto tailcall;
Index: generated/matmul_r4.c
===================================================================
--- generated/matmul_r4.c	(Revision 245760)
+++ generated/matmul_r4.c	(Arbeitskopie)
@@ -74,13 +74,6 @@ extern void matmul_r4 (gfc_array_r4 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_r4);
 
-#if defined(HAVE_AVX) && defined(HAVE_AVX2)
-/* REAL types generate identical code for AVX and AVX2.  Only generate
-   an AVX2 function if we are dealing with integer.  */
-#undef HAVE_AVX2
-#endif
-
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -632,7 +625,7 @@ matmul_r4_avx (gfc_array_r4 * const restrict retar
 static void
 matmul_r4_avx2 (gfc_array_r4 * const restrict retarray, 
 	gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static void
 matmul_r4_avx2 (gfc_array_r4 * const restrict retarray, 
 	gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas,
@@ -1175,7 +1168,7 @@ matmul_r4_avx2 (gfc_array_r4 * const restrict reta
 static void
 matmul_r4_avx512f (gfc_array_r4 * const restrict retarray, 
 	gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
 static void
 matmul_r4_avx512f (gfc_array_r4 * const restrict retarray, 
 	gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas,
@@ -2272,7 +2265,9 @@ void matmul_r4 (gfc_array_r4 * const restrict reta
 	{
           /* Run down the available processors in order of preference.  */
 #ifdef HAVE_AVX512F
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+	      && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
 	    {
 	      matmul_p = matmul_r4_avx512f;
 	      goto tailcall;
@@ -2281,7 +2276,8 @@ void matmul_r4 (gfc_array_r4 * const restrict reta
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_r4_avx2;
 	      goto tailcall;
Index: generated/matmul_r8.c
===================================================================
--- generated/matmul_r8.c	(Revision 245760)
+++ generated/matmul_r8.c	(Arbeitskopie)
@@ -74,13 +74,6 @@ extern void matmul_r8 (gfc_array_r8 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_r8);
 
-#if defined(HAVE_AVX) && defined(HAVE_AVX2)
-/* REAL types generate identical code for AVX and AVX2.  Only generate
-   an AVX2 function if we are dealing with integer.  */
-#undef HAVE_AVX2
-#endif
-
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -632,7 +625,7 @@ matmul_r8_avx (gfc_array_r8 * const restrict retar
 static void
 matmul_r8_avx2 (gfc_array_r8 * const restrict retarray, 
 	gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static void
 matmul_r8_avx2 (gfc_array_r8 * const restrict retarray, 
 	gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas,
@@ -1175,7 +1168,7 @@ matmul_r8_avx2 (gfc_array_r8 * const restrict reta
 static void
 matmul_r8_avx512f (gfc_array_r8 * const restrict retarray, 
 	gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
 static void
 matmul_r8_avx512f (gfc_array_r8 * const restrict retarray, 
 	gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas,
@@ -2272,7 +2265,9 @@ void matmul_r8 (gfc_array_r8 * const restrict reta
 	{
           /* Run down the available processors in order of preference.  */
 #ifdef HAVE_AVX512F
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+	      && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
 	    {
 	      matmul_p = matmul_r8_avx512f;
 	      goto tailcall;
@@ -2281,7 +2276,8 @@ void matmul_r8 (gfc_array_r8 * const restrict reta
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_r8_avx2;
 	      goto tailcall;
Index: m4/matmul.m4
===================================================================
--- m4/matmul.m4	(Revision 245760)
+++ m4/matmul.m4	(Arbeitskopie)
@@ -75,14 +75,6 @@ extern void matmul_'rtype_code` ('rtype` * const r
 	int blas_limit, blas_call gemm);
 export_proto(matmul_'rtype_code`);
 
-'ifelse(rtype_letter,`r',dnl
-`#if defined(HAVE_AVX) && defined(HAVE_AVX2)
-/* REAL types generate identical code for AVX and AVX2.  Only generate
-   an AVX2 function if we are dealing with integer.  */
-#undef HAVE_AVX2
-#endif')
-`
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -101,7 +93,7 @@ static' include(matmul_internal.m4)dnl
 `static void
 'matmul_name` ('rtype` * const restrict retarray, 
 	'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static' include(matmul_internal.m4)dnl
 `#endif /* HAVE_AVX2 */
 
@@ -110,7 +102,7 @@ static' include(matmul_internal.m4)dnl
 `static void
 'matmul_name` ('rtype` * const restrict retarray, 
 	'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
 static' include(matmul_internal.m4)dnl
 `#endif  /* HAVE_AVX512F */
 
@@ -138,7 +130,9 @@ void matmul_'rtype_code` ('rtype` * const restrict
 	{
           /* Run down the available processors in order of preference.  */
 #ifdef HAVE_AVX512F
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+	      && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
 	    {
 	      matmul_p = matmul_'rtype_code`_avx512f;
 	      goto tailcall;
@@ -147,7 +141,8 @@ void matmul_'rtype_code` ('rtype` * const restrict
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_'rtype_code`_avx2;
 	      goto tailcall;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
  2017-03-01 21:00 [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul Thomas Koenig
  2017-03-02  3:22 ` Jerry DeLisle
@ 2017-03-02  7:32 ` Janne Blomqvist
  2017-03-02  7:50   ` Thomas Koenig
  2017-03-02  8:43 ` Jakub Jelinek
  2 siblings, 1 reply; 17+ messages in thread
From: Janne Blomqvist @ 2017-03-02  7:32 UTC (permalink / raw)
  To: Thomas Koenig; +Cc: fortran, gcc-patches

On Wed, Mar 1, 2017 at 11:00 PM, Thomas Koenig <tkoenig@netcologne.de> wrote:
> Hello world,
>
> the attached patch enables FMA for the AVX2 and AVX512F variants of
> matmul.  This should bring a very nice speedup (although I have
> been unable to run benchmarks due to lack of a suitable machine).

In lieu of benchmarks, have you looked at the generated asm to verify
that fma is actually used?

> Question: Is this still appropriate for the current state of trunk?

Yes, looks pretty safe.



-- 
Janne Blomqvist

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
  2017-03-02  7:32 ` Janne Blomqvist
@ 2017-03-02  7:50   ` Thomas Koenig
  2017-03-02  8:09     ` Janne Blomqvist
  0 siblings, 1 reply; 17+ messages in thread
From: Thomas Koenig @ 2017-03-02  7:50 UTC (permalink / raw)
  To: Janne Blomqvist; +Cc: fortran, gcc-patches

Am 02.03.2017 um 08:32 schrieb Janne Blomqvist:
> On Wed, Mar 1, 2017 at 11:00 PM, Thomas Koenig <tkoenig@netcologne.de> wrote:
>> Hello world,
>>
>> the attached patch enables FMA for the AVX2 and AVX512F variants of
>> matmul.  This should bring a very nice speedup (although I have
>> been unable to run benchmarks due to lack of a suitable machine).
>
> In lieu of benchmarks, have you looked at the generated asm to verify
> that fma is actually used?

Yes, I did.

Here's something from the new matmul_r8_avx2:

     156c:       c4 62 e5 b8 fd          vfmadd231pd %ymm5,%ymm3,%ymm15
     1571:       c4 c1 79 10 04 06       vmovupd (%r14,%rax,1),%xmm0
     1577:       c4 62 dd b8 db          vfmadd231pd %ymm3,%ymm4,%ymm11
     157c:       c4 c3 7d 18 44 06 10    vinsertf128 
$0x1,0x10(%r14,%rax,1),%ymm0,%ymm0
     1583:       01
     1584:       c4 62 ed b8 ed          vfmadd231pd %ymm5,%ymm2,%ymm13
     1589:       c4 e2 ed b8 fc          vfmadd231pd %ymm4,%ymm2,%ymm7
     158e:       c4 e2 fd a8 ad 30 ff    vfmadd213pd 
-0x800d0(%rbp),%ymm0,%ymm5

... and here from matmul_r8_avx512f:

     1da8:       c4 a1 7b 10 14 d6       vmovsd (%rsi,%r10,8),%xmm2
     1dae:       c4 c2 b1 b9 f0          vfmadd231sd %xmm8,%xmm9,%xmm6
     1db3:       62 62 ed 08 b9 e5       vfmadd231sd %xmm5,%xmm2,%xmm28
     1db9:       62 62 ed 08 b9 ec       vfmadd231sd %xmm4,%xmm2,%xmm29
     1dbf:       62 62 ed 08 b9 f3       vfmadd231sd %xmm3,%xmm2,%xmm30
     1dc5:       c4 e2 91 99 e8          vfmadd132sd %xmm0,%xmm13,%xmm5
     1dca:       c4 e2 99 99 e0          vfmadd132sd %xmm0,%xmm12,%xmm4
     1dcf:       c4 e2 a1 99 d8          vfmadd132sd %xmm0,%xmm11,%xmm3
     1dd4:       c4 c2 a9 99 d1          vfmadd132sd %xmm9,%xmm10,%xmm2
     1dd9:       c4 c2 89 99 c1          vfmadd132sd %xmm9,%xmm14,%xmm0
     1dde:       0f 8e d3 fe ff ff       jle    1cb7 
<matmul_r8_avx512f+0x1cb7>

... so this is looking pretty good.

Regards

	Thomas

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
  2017-03-02  7:50   ` Thomas Koenig
@ 2017-03-02  8:09     ` Janne Blomqvist
  2017-03-02  8:14       ` Richard Biener
  2017-03-02  8:16       ` Jakub Jelinek
  0 siblings, 2 replies; 17+ messages in thread
From: Janne Blomqvist @ 2017-03-02  8:09 UTC (permalink / raw)
  To: Thomas Koenig; +Cc: fortran, gcc-patches

On Thu, Mar 2, 2017 at 9:50 AM, Thomas Koenig <tkoenig@netcologne.de> wrote:
> Am 02.03.2017 um 08:32 schrieb Janne Blomqvist:
>>
>> On Wed, Mar 1, 2017 at 11:00 PM, Thomas Koenig <tkoenig@netcologne.de>
>> wrote:
>>>
>>> Hello world,
>>>
>>> the attached patch enables FMA for the AVX2 and AVX512F variants of
>>> matmul.  This should bring a very nice speedup (although I have
>>> been unable to run benchmarks due to lack of a suitable machine).
>>
>>
>> In lieu of benchmarks, have you looked at the generated asm to verify
>> that fma is actually used?
>
>
> Yes, I did.
>
> Here's something from the new matmul_r8_avx2:
>
>     156c:       c4 62 e5 b8 fd          vfmadd231pd %ymm5,%ymm3,%ymm15
>     1571:       c4 c1 79 10 04 06       vmovupd (%r14,%rax,1),%xmm0
>     1577:       c4 62 dd b8 db          vfmadd231pd %ymm3,%ymm4,%ymm11
>     157c:       c4 c3 7d 18 44 06 10    vinsertf128
> $0x1,0x10(%r14,%rax,1),%ymm0,%ymm0
>     1583:       01
>     1584:       c4 62 ed b8 ed          vfmadd231pd %ymm5,%ymm2,%ymm13
>     1589:       c4 e2 ed b8 fc          vfmadd231pd %ymm4,%ymm2,%ymm7
>     158e:       c4 e2 fd a8 ad 30 ff    vfmadd213pd
> -0x800d0(%rbp),%ymm0,%ymm5

Great, looks good!

> ... and here from matmul_r8_avx512f:
>
>     1da8:       c4 a1 7b 10 14 d6       vmovsd (%rsi,%r10,8),%xmm2
>     1dae:       c4 c2 b1 b9 f0          vfmadd231sd %xmm8,%xmm9,%xmm6
>     1db3:       62 62 ed 08 b9 e5       vfmadd231sd %xmm5,%xmm2,%xmm28
>     1db9:       62 62 ed 08 b9 ec       vfmadd231sd %xmm4,%xmm2,%xmm29
>     1dbf:       62 62 ed 08 b9 f3       vfmadd231sd %xmm3,%xmm2,%xmm30
>     1dc5:       c4 e2 91 99 e8          vfmadd132sd %xmm0,%xmm13,%xmm5
>     1dca:       c4 e2 99 99 e0          vfmadd132sd %xmm0,%xmm12,%xmm4
>     1dcf:       c4 e2 a1 99 d8          vfmadd132sd %xmm0,%xmm11,%xmm3
>     1dd4:       c4 c2 a9 99 d1          vfmadd132sd %xmm9,%xmm10,%xmm2
>     1dd9:       c4 c2 89 99 c1          vfmadd132sd %xmm9,%xmm14,%xmm0
>     1dde:       0f 8e d3 fe ff ff       jle    1cb7
> <matmul_r8_avx512f+0x1cb7>

Good, it's using fma, but why is this using xmm registers? That would
mean it's operating only on 128 bit blocks at a time so no better than
plain AVX. AFAIU avx512 should use zmm registers to operate on 512 bit
chunks.

I guess this is not due to your patch, but some other issue.


-- 
Janne Blomqvist

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
  2017-03-02  8:09     ` Janne Blomqvist
@ 2017-03-02  8:14       ` Richard Biener
  2017-03-02  8:16       ` Jakub Jelinek
  1 sibling, 0 replies; 17+ messages in thread
From: Richard Biener @ 2017-03-02  8:14 UTC (permalink / raw)
  To: Janne Blomqvist; +Cc: Thomas Koenig, fortran, gcc-patches

On Thu, Mar 2, 2017 at 9:09 AM, Janne Blomqvist
<blomqvist.janne@gmail.com> wrote:
> On Thu, Mar 2, 2017 at 9:50 AM, Thomas Koenig <tkoenig@netcologne.de> wrote:
>> Am 02.03.2017 um 08:32 schrieb Janne Blomqvist:
>>>
>>> On Wed, Mar 1, 2017 at 11:00 PM, Thomas Koenig <tkoenig@netcologne.de>
>>> wrote:
>>>>
>>>> Hello world,
>>>>
>>>> the attached patch enables FMA for the AVX2 and AVX512F variants of
>>>> matmul.  This should bring a very nice speedup (although I have
>>>> been unable to run benchmarks due to lack of a suitable machine).
>>>
>>>
>>> In lieu of benchmarks, have you looked at the generated asm to verify
>>> that fma is actually used?
>>
>>
>> Yes, I did.
>>
>> Here's something from the new matmul_r8_avx2:
>>
>>     156c:       c4 62 e5 b8 fd          vfmadd231pd %ymm5,%ymm3,%ymm15
>>     1571:       c4 c1 79 10 04 06       vmovupd (%r14,%rax,1),%xmm0
>>     1577:       c4 62 dd b8 db          vfmadd231pd %ymm3,%ymm4,%ymm11
>>     157c:       c4 c3 7d 18 44 06 10    vinsertf128
>> $0x1,0x10(%r14,%rax,1),%ymm0,%ymm0
>>     1583:       01
>>     1584:       c4 62 ed b8 ed          vfmadd231pd %ymm5,%ymm2,%ymm13
>>     1589:       c4 e2 ed b8 fc          vfmadd231pd %ymm4,%ymm2,%ymm7
>>     158e:       c4 e2 fd a8 ad 30 ff    vfmadd213pd
>> -0x800d0(%rbp),%ymm0,%ymm5
>
> Great, looks good!
>
>> ... and here from matmul_r8_avx512f:
>>
>>     1da8:       c4 a1 7b 10 14 d6       vmovsd (%rsi,%r10,8),%xmm2
>>     1dae:       c4 c2 b1 b9 f0          vfmadd231sd %xmm8,%xmm9,%xmm6
>>     1db3:       62 62 ed 08 b9 e5       vfmadd231sd %xmm5,%xmm2,%xmm28
>>     1db9:       62 62 ed 08 b9 ec       vfmadd231sd %xmm4,%xmm2,%xmm29
>>     1dbf:       62 62 ed 08 b9 f3       vfmadd231sd %xmm3,%xmm2,%xmm30
>>     1dc5:       c4 e2 91 99 e8          vfmadd132sd %xmm0,%xmm13,%xmm5
>>     1dca:       c4 e2 99 99 e0          vfmadd132sd %xmm0,%xmm12,%xmm4
>>     1dcf:       c4 e2 a1 99 d8          vfmadd132sd %xmm0,%xmm11,%xmm3
>>     1dd4:       c4 c2 a9 99 d1          vfmadd132sd %xmm9,%xmm10,%xmm2
>>     1dd9:       c4 c2 89 99 c1          vfmadd132sd %xmm9,%xmm14,%xmm0
>>     1dde:       0f 8e d3 fe ff ff       jle    1cb7
>> <matmul_r8_avx512f+0x1cb7>
>
> Good, it's using fma, but why is this using xmm registers? That would
> mean it's operating only on 128 bit blocks at a time so no better than
> plain AVX. AFAIU avx512 should use zmm registers to operate on 512 bit
> chunks.
>
> I guess this is not due to your patch, but some other issue.

The question is, was it using %zmm before the patch?

>
> --
> Janne Blomqvist

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
  2017-03-02  8:09     ` Janne Blomqvist
  2017-03-02  8:14       ` Richard Biener
@ 2017-03-02  8:16       ` Jakub Jelinek
  1 sibling, 0 replies; 17+ messages in thread
From: Jakub Jelinek @ 2017-03-02  8:16 UTC (permalink / raw)
  To: Janne Blomqvist; +Cc: Thomas Koenig, fortran, gcc-patches

On Thu, Mar 02, 2017 at 10:09:31AM +0200, Janne Blomqvist wrote:
> > Here's something from the new matmul_r8_avx2:
> >
> >     156c:       c4 62 e5 b8 fd          vfmadd231pd %ymm5,%ymm3,%ymm15
> >     1571:       c4 c1 79 10 04 06       vmovupd (%r14,%rax,1),%xmm0
> >     1577:       c4 62 dd b8 db          vfmadd231pd %ymm3,%ymm4,%ymm11
> >     157c:       c4 c3 7d 18 44 06 10    vinsertf128
> > $0x1,0x10(%r14,%rax,1),%ymm0,%ymm0
> >     1583:       01
> >     1584:       c4 62 ed b8 ed          vfmadd231pd %ymm5,%ymm2,%ymm13
> >     1589:       c4 e2 ed b8 fc          vfmadd231pd %ymm4,%ymm2,%ymm7
> >     158e:       c4 e2 fd a8 ad 30 ff    vfmadd213pd
> > -0x800d0(%rbp),%ymm0,%ymm5
> 
> Great, looks good!
> 
> > ... and here from matmul_r8_avx512f:
> >
> >     1da8:       c4 a1 7b 10 14 d6       vmovsd (%rsi,%r10,8),%xmm2
> >     1dae:       c4 c2 b1 b9 f0          vfmadd231sd %xmm8,%xmm9,%xmm6
> >     1db3:       62 62 ed 08 b9 e5       vfmadd231sd %xmm5,%xmm2,%xmm28
> >     1db9:       62 62 ed 08 b9 ec       vfmadd231sd %xmm4,%xmm2,%xmm29
> >     1dbf:       62 62 ed 08 b9 f3       vfmadd231sd %xmm3,%xmm2,%xmm30
> >     1dc5:       c4 e2 91 99 e8          vfmadd132sd %xmm0,%xmm13,%xmm5
> >     1dca:       c4 e2 99 99 e0          vfmadd132sd %xmm0,%xmm12,%xmm4
> >     1dcf:       c4 e2 a1 99 d8          vfmadd132sd %xmm0,%xmm11,%xmm3
> >     1dd4:       c4 c2 a9 99 d1          vfmadd132sd %xmm9,%xmm10,%xmm2
> >     1dd9:       c4 c2 89 99 c1          vfmadd132sd %xmm9,%xmm14,%xmm0
> >     1dde:       0f 8e d3 fe ff ff       jle    1cb7
> > <matmul_r8_avx512f+0x1cb7>
> 
> Good, it's using fma, but why is this using xmm registers? That would
> mean it's operating only on 128 bit blocks at a time so no better than
> plain AVX. AFAIU avx512 should use zmm registers to operate on 512 bit
> chunks.

Well, it uses sd, i.e. the scalar fma, not pd, so those are always xmm regs
and only a single double in them, this must be some scalar epilogue loop or
whatever; but matmul_r8_avx512f also has:
    140c:       62 72 e5 40 98 c1       vfmadd132pd %zmm1,%zmm19,%zmm8
    1412:       62 72 e5 40 98 cd       vfmadd132pd %zmm5,%zmm19,%zmm9
    1418:       62 72 e5 40 98 d1       vfmadd132pd %zmm1,%zmm19,%zmm10
    141e:       62 72 e5 40 98 de       vfmadd132pd %zmm6,%zmm19,%zmm11
    1424:       62 72 e5 40 98 e1       vfmadd132pd %zmm1,%zmm19,%zmm12
    142a:       62 e2 e5 40 98 c6       vfmadd132pd %zmm6,%zmm19,%zmm16
    1430:       62 f2 e5 40 98 c8       vfmadd132pd %zmm0,%zmm19,%zmm1
    1436:       62 f2 e5 40 98 f0       vfmadd132pd %zmm0,%zmm19,%zmm6
    143c:       62 72 e5 40 98 fd       vfmadd132pd %zmm5,%zmm19,%zmm15
    1442:       62 72 e5 40 98 f4       vfmadd132pd %zmm4,%zmm19,%zmm14
    1448:       62 72 e5 40 98 eb       vfmadd132pd %zmm3,%zmm19,%zmm13
    144e:       62 f2 e5 40 98 d0       vfmadd132pd %zmm0,%zmm19,%zmm2
    1454:       62 b2 e5 40 98 ec       vfmadd132pd %zmm20,%zmm19,%zmm5
    145a:       62 b2 e5 40 98 e4       vfmadd132pd %zmm20,%zmm19,%zmm4
    1460:       62 b2 e5 40 98 dc       vfmadd132pd %zmm20,%zmm19,%zmm3
    1466:       62 b2 e5 40 98 c4       vfmadd132pd %zmm20,%zmm19,%zmm0
etc. where 8 doubles in zmm regs are processed together.

	Jakub

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
  2017-03-01 21:00 [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul Thomas Koenig
  2017-03-02  3:22 ` Jerry DeLisle
  2017-03-02  7:32 ` Janne Blomqvist
@ 2017-03-02  8:43 ` Jakub Jelinek
  2017-03-02  9:03   ` Thomas Koenig
  2 siblings, 1 reply; 17+ messages in thread
From: Jakub Jelinek @ 2017-03-02  8:43 UTC (permalink / raw)
  To: Thomas Koenig; +Cc: fortran, gcc-patches

On Wed, Mar 01, 2017 at 10:00:08PM +0100, Thomas Koenig wrote:
> @@ -101,7 +93,7 @@
>  `static void
>  'matmul_name` ('rtype` * const restrict retarray, 
>  	'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
> -	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
> +	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
>  static' include(matmul_internal.m4)dnl
>  `#endif /* HAVE_AVX2 */
>  

I guess the question here is if there are any CPUs that have AVX2 but don't
have FMA3.  If there are none, then this is not controversial, if there are
some, it depends on how widely they are used compared to ones that have both
AVX2 and FMA3.  Going just from our -march= bitsets, it seems if there is
PTA_AVX2, then there is also PTA_FMA: haswell, broadwell, skylake, skylake-avx512, knl,
bdver4, znver1, there are CPUs that have just PTA_AVX and not PTA_AVX2 and
still have PTA_FMA: bdver2, bdver3 (but that is not relevant to this patch).

> @@ -110,7 +102,7 @@
>  `static void
>  'matmul_name` ('rtype` * const restrict retarray, 
>  	'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
> -	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
> +	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
>  static' include(matmul_internal.m4)dnl
>  `#endif  /* HAVE_AVX512F */
>  

I think this change is not needed, because the EVEX encoded
VFMADD???[SP][DS] instructions etc. are in AVX512F ISA, not in FMA3 ISA
(which has just the VEX encoded ones).
Which is why I'm seeing the fmas in my libgfortran even without your patch.
Thus I think you should remove this from your patch.

> @@ -147,7 +141,8 @@
>  #endif  /* HAVE_AVX512F */
>  
>  #ifdef HAVE_AVX2
> -      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
> +      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
> +	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
>  	    {
>  	      matmul_p = matmul_'rtype_code`_avx2;
>  	      goto tailcall;

and this too.

	Jakub

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
  2017-03-02  8:43 ` Jakub Jelinek
@ 2017-03-02  9:03   ` Thomas Koenig
  2017-03-02  9:08     ` Jakub Jelinek
  0 siblings, 1 reply; 17+ messages in thread
From: Thomas Koenig @ 2017-03-02  9:03 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: fortran, gcc-patches

Am 02.03.2017 um 09:43 schrieb Jakub Jelinek:
> On Wed, Mar 01, 2017 at 10:00:08PM +0100, Thomas Koenig wrote:
>> @@ -101,7 +93,7 @@
>>  `static void
>>  'matmul_name` ('rtype` * const restrict retarray,
>>  	'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
>> -	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
>> +	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
>>  static' include(matmul_internal.m4)dnl
>>  `#endif /* HAVE_AVX2 */
>>
>
> I guess the question here is if there are any CPUs that have AVX2 but don't
> have FMA3.  If there are none, then this is not controversial, if there are
> some, it depends on how widely they are used compared to ones that have both
> AVX2 and FMA3.  Going just from our -march= bitsets, it seems if there is
> PTA_AVX2, then there is also PTA_FMA: haswell, broadwell, skylake, skylake-avx512, knl,
> bdver4, znver1, there are CPUs that have just PTA_AVX and not PTA_AVX2 and
> still have PTA_FMA: bdver2, bdver3 (but that is not relevant to this patch).

In a previous incantation of the patch, I saw that the compiler
generated the same floating point code for AVX and AVX2 (which why
there currently is no AVX2 floating point version).  I could also
generate an AVX+FMA version for floating point and an AVX2 version
for integer (if anybody cares about integer matmul).

Or I could just leave it as it is.

>> @@ -110,7 +102,7 @@
>>  `static void
>>  'matmul_name` ('rtype` * const restrict retarray,
>>  	'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
>> -	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
>> +	int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
>>  static' include(matmul_internal.m4)dnl
>>  `#endif  /* HAVE_AVX512F */
>>
>
> I think this change is not needed, because the EVEX encoded
> VFMADD???[SP][DS] instructions etc. are in AVX512F ISA, not in FMA3 ISA
> (which has just the VEX encoded ones).
> Which is why I'm seeing the fmas in my libgfortran even without your patch.
> Thus I think you should remove this from your patch.

OK, I'll remove it.

>
>> @@ -147,7 +141,8 @@
>>  #endif  /* HAVE_AVX512F */
>>
>>  #ifdef HAVE_AVX2
>> -      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
>> +      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
>> +	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
>>  	    {
>>  	      matmul_p = matmul_'rtype_code`_avx2;
>>  	      goto tailcall;
>
> and this too.

Will do.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
  2017-03-02  9:03   ` Thomas Koenig
@ 2017-03-02  9:08     ` Jakub Jelinek
  2017-03-02 10:46       ` Thomas Koenig
  0 siblings, 1 reply; 17+ messages in thread
From: Jakub Jelinek @ 2017-03-02  9:08 UTC (permalink / raw)
  To: Thomas Koenig; +Cc: fortran, gcc-patches

On Thu, Mar 02, 2017 at 10:03:28AM +0100, Thomas Koenig wrote:
> Am 02.03.2017 um 09:43 schrieb Jakub Jelinek:
> > On Wed, Mar 01, 2017 at 10:00:08PM +0100, Thomas Koenig wrote:
> > > @@ -101,7 +93,7 @@
> > >  `static void
> > >  'matmul_name` ('rtype` * const restrict retarray,
> > >  	'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
> > > -	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
> > > +	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
> > >  static' include(matmul_internal.m4)dnl
> > >  `#endif /* HAVE_AVX2 */
> > > 
> > 
> > I guess the question here is if there are any CPUs that have AVX2 but don't
> > have FMA3.  If there are none, then this is not controversial, if there are
> > some, it depends on how widely they are used compared to ones that have both
> > AVX2 and FMA3.  Going just from our -march= bitsets, it seems if there is
> > PTA_AVX2, then there is also PTA_FMA: haswell, broadwell, skylake, skylake-avx512, knl,
> > bdver4, znver1, there are CPUs that have just PTA_AVX and not PTA_AVX2 and
> > still have PTA_FMA: bdver2, bdver3 (but that is not relevant to this patch).
> 
> In a previous incantation of the patch, I saw that the compiler
> generated the same floating point code for AVX and AVX2 (which why
> there currently is no AVX2 floating point version).  I could also
> generate an AVX+FMA version for floating point and an AVX2 version
> for integer (if anybody cares about integer matmul).

I think having another avx,fma version is not worth it, avx+fma is far less
common than avx without fma.

> > > @@ -147,7 +141,8 @@
> > >  #endif  /* HAVE_AVX512F */
> > > 
> > >  #ifdef HAVE_AVX2
> > > -      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
> > > +      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
> > > +	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
> > >  	    {
> > >  	      matmul_p = matmul_'rtype_code`_avx2;
> > >  	      goto tailcall;
> > 
> > and this too.
> 
> Will do.

Note I meant obviously the FEATURE_AVX512F related hunk, not this one,
sorry.

	Jakub

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
  2017-03-02  9:08     ` Jakub Jelinek
@ 2017-03-02 10:46       ` Thomas Koenig
  2017-03-02 10:48         ` Jakub Jelinek
  2017-03-02 11:02         ` Jakub Jelinek
  0 siblings, 2 replies; 17+ messages in thread
From: Thomas Koenig @ 2017-03-02 10:46 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: fortran, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 969 bytes --]

Here's the updated version, which just uses FMA for AVX2.

OK for trunk?

Regards

	Thomas

2017-03-01  Thomas Koenig  <tkoenig@gcc.gnu.org>

         PR fortran/78379
         * m4/matmul.m4: (matmul_'rtype_code`_avx2): Also generate for
         reals.  Add fma to target options.
         (matmul_'rtype_code`):  Call AVX2 only if FMA is available.
         * generated/matmul_c10.c: Regenerated.
         * generated/matmul_c16.c: Regenerated.
         * generated/matmul_c4.c: Regenerated.
         * generated/matmul_c8.c: Regenerated.
         * generated/matmul_i1.c: Regenerated.
         * generated/matmul_i16.c: Regenerated.
         * generated/matmul_i2.c: Regenerated.
         * generated/matmul_i4.c: Regenerated.
         * generated/matmul_i8.c: Regenerated.
         * generated/matmul_r10.c: Regenerated.
         * generated/matmul_r16.c: Regenerated.
         * generated/matmul_r4.c: Regenerated.
         * generated/matmul_r8.c: Regenerated.


[-- Attachment #2: p2-fma.diff --]
[-- Type: text/x-patch, Size: 20305 bytes --]

Index: generated/matmul_c10.c
===================================================================
--- generated/matmul_c10.c	(Revision 245760)
+++ generated/matmul_c10.c	(Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_c10 (gfc_array_c10 * const rest
 	int blas_limit, blas_call gemm);
 export_proto(matmul_c10);
 
-
-
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_c10_avx (gfc_array_c10 * const restrict ret
 static void
 matmul_c10_avx2 (gfc_array_c10 * const restrict retarray, 
 	gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static void
 matmul_c10_avx2 (gfc_array_c10 * const restrict retarray, 
 	gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
@@ -2277,7 +2274,8 @@ void matmul_c10 (gfc_array_c10 * const restrict re
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_c10_avx2;
 	      goto tailcall;
Index: generated/matmul_c16.c
===================================================================
--- generated/matmul_c16.c	(Revision 245760)
+++ generated/matmul_c16.c	(Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_c16 (gfc_array_c16 * const rest
 	int blas_limit, blas_call gemm);
 export_proto(matmul_c16);
 
-
-
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_c16_avx (gfc_array_c16 * const restrict ret
 static void
 matmul_c16_avx2 (gfc_array_c16 * const restrict retarray, 
 	gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static void
 matmul_c16_avx2 (gfc_array_c16 * const restrict retarray, 
 	gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
@@ -2277,7 +2274,8 @@ void matmul_c16 (gfc_array_c16 * const restrict re
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_c16_avx2;
 	      goto tailcall;
Index: generated/matmul_c4.c
===================================================================
--- generated/matmul_c4.c	(Revision 245760)
+++ generated/matmul_c4.c	(Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_c4 (gfc_array_c4 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_c4);
 
-
-
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_c4_avx (gfc_array_c4 * const restrict retar
 static void
 matmul_c4_avx2 (gfc_array_c4 * const restrict retarray, 
 	gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static void
 matmul_c4_avx2 (gfc_array_c4 * const restrict retarray, 
 	gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas,
@@ -2277,7 +2274,8 @@ void matmul_c4 (gfc_array_c4 * const restrict reta
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_c4_avx2;
 	      goto tailcall;
Index: generated/matmul_c8.c
===================================================================
--- generated/matmul_c8.c	(Revision 245760)
+++ generated/matmul_c8.c	(Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_c8 (gfc_array_c8 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_c8);
 
-
-
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_c8_avx (gfc_array_c8 * const restrict retar
 static void
 matmul_c8_avx2 (gfc_array_c8 * const restrict retarray, 
 	gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static void
 matmul_c8_avx2 (gfc_array_c8 * const restrict retarray, 
 	gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas,
@@ -2277,7 +2274,8 @@ void matmul_c8 (gfc_array_c8 * const restrict reta
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_c8_avx2;
 	      goto tailcall;
Index: generated/matmul_i1.c
===================================================================
--- generated/matmul_i1.c	(Revision 245760)
+++ generated/matmul_i1.c	(Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_i1 (gfc_array_i1 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_i1);
 
-
-
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_i1_avx (gfc_array_i1 * const restrict retar
 static void
 matmul_i1_avx2 (gfc_array_i1 * const restrict retarray, 
 	gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static void
 matmul_i1_avx2 (gfc_array_i1 * const restrict retarray, 
 	gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas,
@@ -2277,7 +2274,8 @@ void matmul_i1 (gfc_array_i1 * const restrict reta
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_i1_avx2;
 	      goto tailcall;
Index: generated/matmul_i16.c
===================================================================
--- generated/matmul_i16.c	(Revision 245760)
+++ generated/matmul_i16.c	(Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_i16 (gfc_array_i16 * const rest
 	int blas_limit, blas_call gemm);
 export_proto(matmul_i16);
 
-
-
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_i16_avx (gfc_array_i16 * const restrict ret
 static void
 matmul_i16_avx2 (gfc_array_i16 * const restrict retarray, 
 	gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static void
 matmul_i16_avx2 (gfc_array_i16 * const restrict retarray, 
 	gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas,
@@ -2277,7 +2274,8 @@ void matmul_i16 (gfc_array_i16 * const restrict re
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_i16_avx2;
 	      goto tailcall;
Index: generated/matmul_i2.c
===================================================================
--- generated/matmul_i2.c	(Revision 245760)
+++ generated/matmul_i2.c	(Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_i2 (gfc_array_i2 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_i2);
 
-
-
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_i2_avx (gfc_array_i2 * const restrict retar
 static void
 matmul_i2_avx2 (gfc_array_i2 * const restrict retarray, 
 	gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static void
 matmul_i2_avx2 (gfc_array_i2 * const restrict retarray, 
 	gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas,
@@ -2277,7 +2274,8 @@ void matmul_i2 (gfc_array_i2 * const restrict reta
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_i2_avx2;
 	      goto tailcall;
Index: generated/matmul_i4.c
===================================================================
--- generated/matmul_i4.c	(Revision 245760)
+++ generated/matmul_i4.c	(Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_i4 (gfc_array_i4 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_i4);
 
-
-
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_i4_avx (gfc_array_i4 * const restrict retar
 static void
 matmul_i4_avx2 (gfc_array_i4 * const restrict retarray, 
 	gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static void
 matmul_i4_avx2 (gfc_array_i4 * const restrict retarray, 
 	gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas,
@@ -2277,7 +2274,8 @@ void matmul_i4 (gfc_array_i4 * const restrict reta
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_i4_avx2;
 	      goto tailcall;
Index: generated/matmul_i8.c
===================================================================
--- generated/matmul_i8.c	(Revision 245760)
+++ generated/matmul_i8.c	(Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_i8 (gfc_array_i8 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_i8);
 
-
-
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_i8_avx (gfc_array_i8 * const restrict retar
 static void
 matmul_i8_avx2 (gfc_array_i8 * const restrict retarray, 
 	gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static void
 matmul_i8_avx2 (gfc_array_i8 * const restrict retarray, 
 	gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas,
@@ -2277,7 +2274,8 @@ void matmul_i8 (gfc_array_i8 * const restrict reta
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_i8_avx2;
 	      goto tailcall;
Index: generated/matmul_r10.c
===================================================================
--- generated/matmul_r10.c	(Revision 245760)
+++ generated/matmul_r10.c	(Arbeitskopie)
@@ -74,13 +74,6 @@ extern void matmul_r10 (gfc_array_r10 * const rest
 	int blas_limit, blas_call gemm);
 export_proto(matmul_r10);
 
-#if defined(HAVE_AVX) && defined(HAVE_AVX2)
-/* REAL types generate identical code for AVX and AVX2.  Only generate
-   an AVX2 function if we are dealing with integer.  */
-#undef HAVE_AVX2
-#endif
-
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -632,7 +625,7 @@ matmul_r10_avx (gfc_array_r10 * const restrict ret
 static void
 matmul_r10_avx2 (gfc_array_r10 * const restrict retarray, 
 	gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static void
 matmul_r10_avx2 (gfc_array_r10 * const restrict retarray, 
 	gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas,
@@ -2281,7 +2274,8 @@ void matmul_r10 (gfc_array_r10 * const restrict re
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_r10_avx2;
 	      goto tailcall;
Index: generated/matmul_r16.c
===================================================================
--- generated/matmul_r16.c	(Revision 245760)
+++ generated/matmul_r16.c	(Arbeitskopie)
@@ -74,13 +74,6 @@ extern void matmul_r16 (gfc_array_r16 * const rest
 	int blas_limit, blas_call gemm);
 export_proto(matmul_r16);
 
-#if defined(HAVE_AVX) && defined(HAVE_AVX2)
-/* REAL types generate identical code for AVX and AVX2.  Only generate
-   an AVX2 function if we are dealing with integer.  */
-#undef HAVE_AVX2
-#endif
-
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -632,7 +625,7 @@ matmul_r16_avx (gfc_array_r16 * const restrict ret
 static void
 matmul_r16_avx2 (gfc_array_r16 * const restrict retarray, 
 	gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static void
 matmul_r16_avx2 (gfc_array_r16 * const restrict retarray, 
 	gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas,
@@ -2281,7 +2274,8 @@ void matmul_r16 (gfc_array_r16 * const restrict re
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_r16_avx2;
 	      goto tailcall;
Index: generated/matmul_r4.c
===================================================================
--- generated/matmul_r4.c	(Revision 245760)
+++ generated/matmul_r4.c	(Arbeitskopie)
@@ -74,13 +74,6 @@ extern void matmul_r4 (gfc_array_r4 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_r4);
 
-#if defined(HAVE_AVX) && defined(HAVE_AVX2)
-/* REAL types generate identical code for AVX and AVX2.  Only generate
-   an AVX2 function if we are dealing with integer.  */
-#undef HAVE_AVX2
-#endif
-
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -632,7 +625,7 @@ matmul_r4_avx (gfc_array_r4 * const restrict retar
 static void
 matmul_r4_avx2 (gfc_array_r4 * const restrict retarray, 
 	gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static void
 matmul_r4_avx2 (gfc_array_r4 * const restrict retarray, 
 	gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas,
@@ -2281,7 +2274,8 @@ void matmul_r4 (gfc_array_r4 * const restrict reta
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_r4_avx2;
 	      goto tailcall;
Index: generated/matmul_r8.c
===================================================================
--- generated/matmul_r8.c	(Revision 245760)
+++ generated/matmul_r8.c	(Arbeitskopie)
@@ -74,13 +74,6 @@ extern void matmul_r8 (gfc_array_r8 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_r8);
 
-#if defined(HAVE_AVX) && defined(HAVE_AVX2)
-/* REAL types generate identical code for AVX and AVX2.  Only generate
-   an AVX2 function if we are dealing with integer.  */
-#undef HAVE_AVX2
-#endif
-
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -632,7 +625,7 @@ matmul_r8_avx (gfc_array_r8 * const restrict retar
 static void
 matmul_r8_avx2 (gfc_array_r8 * const restrict retarray, 
 	gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static void
 matmul_r8_avx2 (gfc_array_r8 * const restrict retarray, 
 	gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas,
@@ -2281,7 +2274,8 @@ void matmul_r8 (gfc_array_r8 * const restrict reta
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_r8_avx2;
 	      goto tailcall;
Index: m4/matmul.m4
===================================================================
--- m4/matmul.m4	(Revision 245760)
+++ m4/matmul.m4	(Arbeitskopie)
@@ -75,14 +75,6 @@ extern void matmul_'rtype_code` ('rtype` * const r
 	int blas_limit, blas_call gemm);
 export_proto(matmul_'rtype_code`);
 
-'ifelse(rtype_letter,`r',dnl
-`#if defined(HAVE_AVX) && defined(HAVE_AVX2)
-/* REAL types generate identical code for AVX and AVX2.  Only generate
-   an AVX2 function if we are dealing with integer.  */
-#undef HAVE_AVX2
-#endif')
-`
-
 /* Put exhaustive list of possible architectures here here, ORed together.  */
 
 #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -101,7 +93,7 @@ static' include(matmul_internal.m4)dnl
 `static void
 'matmul_name` ('rtype` * const restrict retarray, 
 	'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
-	int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+	int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
 static' include(matmul_internal.m4)dnl
 `#endif /* HAVE_AVX2 */
 
@@ -147,7 +139,8 @@ void matmul_'rtype_code` ('rtype` * const restrict
 #endif  /* HAVE_AVX512F */
 
 #ifdef HAVE_AVX2
-      	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+      	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
 	      matmul_p = matmul_'rtype_code`_avx2;
 	      goto tailcall;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
  2017-03-02 10:46       ` Thomas Koenig
@ 2017-03-02 10:48         ` Jakub Jelinek
  2017-03-02 11:02         ` Jakub Jelinek
  1 sibling, 0 replies; 17+ messages in thread
From: Jakub Jelinek @ 2017-03-02 10:48 UTC (permalink / raw)
  To: Thomas Koenig; +Cc: fortran, gcc-patches

On Thu, Mar 02, 2017 at 11:45:59AM +0100, Thomas Koenig wrote:
> Here's the updated version, which just uses FMA for AVX2.
> 
> OK for trunk?
> 
> Regards
> 
> 	Thomas
> 
> 2017-03-01  Thomas Koenig  <tkoenig@gcc.gnu.org>
> 
>         PR fortran/78379
>         * m4/matmul.m4: (matmul_'rtype_code`_avx2): Also generate for
>         reals.  Add fma to target options.
>         (matmul_'rtype_code`):  Call AVX2 only if FMA is available.
>         * generated/matmul_c10.c: Regenerated.
>         * generated/matmul_c16.c: Regenerated.
>         * generated/matmul_c4.c: Regenerated.
>         * generated/matmul_c8.c: Regenerated.
>         * generated/matmul_i1.c: Regenerated.
>         * generated/matmul_i16.c: Regenerated.
>         * generated/matmul_i2.c: Regenerated.
>         * generated/matmul_i4.c: Regenerated.
>         * generated/matmul_i8.c: Regenerated.
>         * generated/matmul_r10.c: Regenerated.
>         * generated/matmul_r16.c: Regenerated.
>         * generated/matmul_r4.c: Regenerated.
>         * generated/matmul_r8.c: Regenerated.

Ok, thanks.

	Jakub

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
  2017-03-02 10:46       ` Thomas Koenig
  2017-03-02 10:48         ` Jakub Jelinek
@ 2017-03-02 11:02         ` Jakub Jelinek
  2017-03-02 11:57           ` Thomas Koenig
  1 sibling, 1 reply; 17+ messages in thread
From: Jakub Jelinek @ 2017-03-02 11:02 UTC (permalink / raw)
  To: Thomas Koenig; +Cc: fortran, gcc-patches

On Thu, Mar 02, 2017 at 11:45:59AM +0100, Thomas Koenig wrote:
> Here's the updated version, which just uses FMA for AVX2.
> 
> OK for trunk?
> 
> Regards
> 
> 	Thomas
> 
> 2017-03-01  Thomas Koenig  <tkoenig@gcc.gnu.org>
> 
>         PR fortran/78379
>         * m4/matmul.m4: (matmul_'rtype_code`_avx2): Also generate for
>         reals.  Add fma to target options.
>         (matmul_'rtype_code`):  Call AVX2 only if FMA is available.
>         * generated/matmul_c10.c: Regenerated.
>         * generated/matmul_c16.c: Regenerated.
>         * generated/matmul_c4.c: Regenerated.
>         * generated/matmul_c8.c: Regenerated.
>         * generated/matmul_i1.c: Regenerated.
>         * generated/matmul_i16.c: Regenerated.
>         * generated/matmul_i2.c: Regenerated.
>         * generated/matmul_i4.c: Regenerated.
>         * generated/matmul_i8.c: Regenerated.
>         * generated/matmul_r10.c: Regenerated.
>         * generated/matmul_r16.c: Regenerated.
>         * generated/matmul_r4.c: Regenerated.
>         * generated/matmul_r8.c: Regenerated.

Actually, I see a problem, but not related to this patch.
I bet e.g. tsan would complain heavily on the wrappers, because the code
is racy:
  static void (*matmul_p) ('rtype` * const restrict retarray,
        'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
        int blas_limit, blas_call gemm) = NULL;

  if (matmul_p == NULL)
    {
      matmul_p = matmul_'rtype_code`_vanilla;
      if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
        {
          /* Run down the available processors in order of preference.  */
#ifdef HAVE_AVX512F
          if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
            {
              matmul_p = matmul_'rtype_code`_avx512f;
              goto tailcall;
            }
            
#endif  /* HAVE_AVX512F */
...
    }

tailcall:
   (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);

So, even when assuming all matmul_p = stores are atomic, e.g. if you call
matmul from 2 or more threads about the same time for the first time,
it could be that the first one sets matmul_p to vanilla and then another
thread runs it (uselessly slow), etc.

As you don't care about the if (matmul_p == NULL) part being done in
multiple threads concurrently, I guess you could e.g. do:
  static void (*matmul_p) ('rtype` * const restrict retarray,
        'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
        int blas_limit, blas_call gemm); //  <--- No need for NULL initializer for static var
  void (*matmul_fn) ('rtype` * const restrict retarray,
        'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
        int blas_limit, blas_call gemm);

  matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
  if (matmul_fn == NULL)
    {
      matmul_fn = matmul_'rtype_code`_vanilla;
      if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
        {
          /* Run down the available processors in order of preference.  */
#ifdef HAVE_AVX512F
          if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
            {
              matmul_fn = matmul_'rtype_code`_avx512f;
              goto finish;
            }
            
#endif  /* HAVE_AVX512F */
...
  finish:
      __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
    }
  (*matmul_fn) (retarray, a, b, try_blas, blas_limit, gemm);

(i.e. make sure you read matmul_p in each call exactly once and store at
most once per thread).

	Jakub

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
  2017-03-02 11:02         ` Jakub Jelinek
@ 2017-03-02 11:57           ` Thomas Koenig
  2017-03-02 12:02             ` Jakub Jelinek
  0 siblings, 1 reply; 17+ messages in thread
From: Thomas Koenig @ 2017-03-02 11:57 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: fortran, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1135 bytes --]

Hi Jakub,

> Actually, I see a problem, but not related to this patch.
> I bet e.g. tsan would complain heavily on the wrappers, because the code
> is racy:

Here is a patch implementing your suggestion.  Tested at least so
far that all matmul test cases pass on my machine.

OK for trunk?

Regards

	Thomas

2017-03-02  Thomas Koenig  <tkoenig@gcc.gnu.org>
             Jakub Jelinek  <jakub@redhat.com>

         * m4/matmul.m4 (matmul_'rtype_code`_avx2):  Avoid
         race condition on storing function pointer.
         * generated/matmul_c10.c: Regenerated.
         * generated/matmul_c16.c: Regenerated.
         * generated/matmul_c4.c: Regenerated.
         * generated/matmul_c8.c: Regenerated.
         * generated/matmul_i1.c: Regenerated.
         * generated/matmul_i16.c: Regenerated.
         * generated/matmul_i2.c: Regenerated.
         * generated/matmul_i4.c: Regenerated.
         * generated/matmul_i8.c: Regenerated.
         * generated/matmul_r10.c: Regenerated.
         * generated/matmul_r16.c: Regenerated.
         * generated/matmul_r4.c: Regenerated.
         * generated/matmul_r8.c: Regenerated.


[-- Attachment #2: p1-race.diff --]
[-- Type: text/x-patch, Size: 28758 bytes --]

Index: generated/matmul_c10.c
===================================================================
--- generated/matmul_c10.c	(Revision 245836)
+++ generated/matmul_c10.c	(Arbeitskopie)
@@ -2258,9 +2258,14 @@ void matmul_c10 (gfc_array_c10 * const restrict re
 	gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm) = NULL;
 
+  void (*matmul_fn) (gfc_array_c10 * const restrict retarray, 
+	gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm) = NULL;
+
+  matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
   if (matmul_p == NULL)
     {
-      matmul_p = matmul_c10_vanilla;
+      matmul_fn = matmul_c10_vanilla;
       if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
 	{
           /* Run down the available processors in order of preference.  */
@@ -2267,8 +2272,8 @@ void matmul_c10 (gfc_array_c10 * const restrict re
 #ifdef HAVE_AVX512F
       	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
 	    {
-	      matmul_p = matmul_c10_avx512f;
-	      goto tailcall;
+	      matmul_fn = matmul_c10_avx512f;
+	      goto store;
 	    }
 
 #endif  /* HAVE_AVX512F */
@@ -2277,8 +2282,8 @@ void matmul_c10 (gfc_array_c10 * const restrict re
       	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
 	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
-	      matmul_p = matmul_c10_avx2;
-	      goto tailcall;
+	      matmul_fn = matmul_c10_avx2;
+	      goto store;
 	    }
 
 #endif
@@ -2286,14 +2291,15 @@ void matmul_c10 (gfc_array_c10 * const restrict re
 #ifdef HAVE_AVX
       	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
  	    {
-              matmul_p = matmul_c10_avx;
-	      goto tailcall;
+              matmul_fn = matmul_c10_avx;
+	      goto store;
 	    }
 #endif  /* HAVE_AVX */
         }
+   store:
+      __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
    }
 
-tailcall:
    (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
 }
 
Index: generated/matmul_c16.c
===================================================================
--- generated/matmul_c16.c	(Revision 245836)
+++ generated/matmul_c16.c	(Arbeitskopie)
@@ -2258,9 +2258,14 @@ void matmul_c16 (gfc_array_c16 * const restrict re
 	gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm) = NULL;
 
+  void (*matmul_fn) (gfc_array_c16 * const restrict retarray, 
+	gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm) = NULL;
+
+  matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
   if (matmul_p == NULL)
     {
-      matmul_p = matmul_c16_vanilla;
+      matmul_fn = matmul_c16_vanilla;
       if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
 	{
           /* Run down the available processors in order of preference.  */
@@ -2267,8 +2272,8 @@ void matmul_c16 (gfc_array_c16 * const restrict re
 #ifdef HAVE_AVX512F
       	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
 	    {
-	      matmul_p = matmul_c16_avx512f;
-	      goto tailcall;
+	      matmul_fn = matmul_c16_avx512f;
+	      goto store;
 	    }
 
 #endif  /* HAVE_AVX512F */
@@ -2277,8 +2282,8 @@ void matmul_c16 (gfc_array_c16 * const restrict re
       	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
 	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
-	      matmul_p = matmul_c16_avx2;
-	      goto tailcall;
+	      matmul_fn = matmul_c16_avx2;
+	      goto store;
 	    }
 
 #endif
@@ -2286,14 +2291,15 @@ void matmul_c16 (gfc_array_c16 * const restrict re
 #ifdef HAVE_AVX
       	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
  	    {
-              matmul_p = matmul_c16_avx;
-	      goto tailcall;
+              matmul_fn = matmul_c16_avx;
+	      goto store;
 	    }
 #endif  /* HAVE_AVX */
         }
+   store:
+      __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
    }
 
-tailcall:
    (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
 }
 
Index: generated/matmul_c4.c
===================================================================
--- generated/matmul_c4.c	(Revision 245836)
+++ generated/matmul_c4.c	(Arbeitskopie)
@@ -2258,9 +2258,14 @@ void matmul_c4 (gfc_array_c4 * const restrict reta
 	gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm) = NULL;
 
+  void (*matmul_fn) (gfc_array_c4 * const restrict retarray, 
+	gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm) = NULL;
+
+  matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
   if (matmul_p == NULL)
     {
-      matmul_p = matmul_c4_vanilla;
+      matmul_fn = matmul_c4_vanilla;
       if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
 	{
           /* Run down the available processors in order of preference.  */
@@ -2267,8 +2272,8 @@ void matmul_c4 (gfc_array_c4 * const restrict reta
 #ifdef HAVE_AVX512F
       	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
 	    {
-	      matmul_p = matmul_c4_avx512f;
-	      goto tailcall;
+	      matmul_fn = matmul_c4_avx512f;
+	      goto store;
 	    }
 
 #endif  /* HAVE_AVX512F */
@@ -2277,8 +2282,8 @@ void matmul_c4 (gfc_array_c4 * const restrict reta
       	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
 	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
-	      matmul_p = matmul_c4_avx2;
-	      goto tailcall;
+	      matmul_fn = matmul_c4_avx2;
+	      goto store;
 	    }
 
 #endif
@@ -2286,14 +2291,15 @@ void matmul_c4 (gfc_array_c4 * const restrict reta
 #ifdef HAVE_AVX
       	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
  	    {
-              matmul_p = matmul_c4_avx;
-	      goto tailcall;
+              matmul_fn = matmul_c4_avx;
+	      goto store;
 	    }
 #endif  /* HAVE_AVX */
         }
+   store:
+      __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
    }
 
-tailcall:
    (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
 }
 
Index: generated/matmul_c8.c
===================================================================
--- generated/matmul_c8.c	(Revision 245836)
+++ generated/matmul_c8.c	(Arbeitskopie)
@@ -2258,9 +2258,14 @@ void matmul_c8 (gfc_array_c8 * const restrict reta
 	gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm) = NULL;
 
+  void (*matmul_fn) (gfc_array_c8 * const restrict retarray, 
+	gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm) = NULL;
+
+  matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
   if (matmul_p == NULL)
     {
-      matmul_p = matmul_c8_vanilla;
+      matmul_fn = matmul_c8_vanilla;
       if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
 	{
           /* Run down the available processors in order of preference.  */
@@ -2267,8 +2272,8 @@ void matmul_c8 (gfc_array_c8 * const restrict reta
 #ifdef HAVE_AVX512F
       	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
 	    {
-	      matmul_p = matmul_c8_avx512f;
-	      goto tailcall;
+	      matmul_fn = matmul_c8_avx512f;
+	      goto store;
 	    }
 
 #endif  /* HAVE_AVX512F */
@@ -2277,8 +2282,8 @@ void matmul_c8 (gfc_array_c8 * const restrict reta
       	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
 	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
-	      matmul_p = matmul_c8_avx2;
-	      goto tailcall;
+	      matmul_fn = matmul_c8_avx2;
+	      goto store;
 	    }
 
 #endif
@@ -2286,14 +2291,15 @@ void matmul_c8 (gfc_array_c8 * const restrict reta
 #ifdef HAVE_AVX
       	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
  	    {
-              matmul_p = matmul_c8_avx;
-	      goto tailcall;
+              matmul_fn = matmul_c8_avx;
+	      goto store;
 	    }
 #endif  /* HAVE_AVX */
         }
+   store:
+      __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
    }
 
-tailcall:
    (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
 }
 
Index: generated/matmul_i1.c
===================================================================
--- generated/matmul_i1.c	(Revision 245836)
+++ generated/matmul_i1.c	(Arbeitskopie)
@@ -2258,9 +2258,14 @@ void matmul_i1 (gfc_array_i1 * const restrict reta
 	gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm) = NULL;
 
+  void (*matmul_fn) (gfc_array_i1 * const restrict retarray, 
+	gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm) = NULL;
+
+  matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
   if (matmul_p == NULL)
     {
-      matmul_p = matmul_i1_vanilla;
+      matmul_fn = matmul_i1_vanilla;
       if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
 	{
           /* Run down the available processors in order of preference.  */
@@ -2267,8 +2272,8 @@ void matmul_i1 (gfc_array_i1 * const restrict reta
 #ifdef HAVE_AVX512F
       	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
 	    {
-	      matmul_p = matmul_i1_avx512f;
-	      goto tailcall;
+	      matmul_fn = matmul_i1_avx512f;
+	      goto store;
 	    }
 
 #endif  /* HAVE_AVX512F */
@@ -2277,8 +2282,8 @@ void matmul_i1 (gfc_array_i1 * const restrict reta
       	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
 	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
-	      matmul_p = matmul_i1_avx2;
-	      goto tailcall;
+	      matmul_fn = matmul_i1_avx2;
+	      goto store;
 	    }
 
 #endif
@@ -2286,14 +2291,15 @@ void matmul_i1 (gfc_array_i1 * const restrict reta
 #ifdef HAVE_AVX
       	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
  	    {
-              matmul_p = matmul_i1_avx;
-	      goto tailcall;
+              matmul_fn = matmul_i1_avx;
+	      goto store;
 	    }
 #endif  /* HAVE_AVX */
         }
+   store:
+      __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
    }
 
-tailcall:
    (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
 }
 
Index: generated/matmul_i16.c
===================================================================
--- generated/matmul_i16.c	(Revision 245836)
+++ generated/matmul_i16.c	(Arbeitskopie)
@@ -2258,9 +2258,14 @@ void matmul_i16 (gfc_array_i16 * const restrict re
 	gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm) = NULL;
 
+  void (*matmul_fn) (gfc_array_i16 * const restrict retarray, 
+	gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm) = NULL;
+
+  matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
   if (matmul_p == NULL)
     {
-      matmul_p = matmul_i16_vanilla;
+      matmul_fn = matmul_i16_vanilla;
       if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
 	{
           /* Run down the available processors in order of preference.  */
@@ -2267,8 +2272,8 @@ void matmul_i16 (gfc_array_i16 * const restrict re
 #ifdef HAVE_AVX512F
       	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
 	    {
-	      matmul_p = matmul_i16_avx512f;
-	      goto tailcall;
+	      matmul_fn = matmul_i16_avx512f;
+	      goto store;
 	    }
 
 #endif  /* HAVE_AVX512F */
@@ -2277,8 +2282,8 @@ void matmul_i16 (gfc_array_i16 * const restrict re
       	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
 	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
-	      matmul_p = matmul_i16_avx2;
-	      goto tailcall;
+	      matmul_fn = matmul_i16_avx2;
+	      goto store;
 	    }
 
 #endif
@@ -2286,14 +2291,15 @@ void matmul_i16 (gfc_array_i16 * const restrict re
 #ifdef HAVE_AVX
       	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
  	    {
-              matmul_p = matmul_i16_avx;
-	      goto tailcall;
+              matmul_fn = matmul_i16_avx;
+	      goto store;
 	    }
 #endif  /* HAVE_AVX */
         }
+   store:
+      __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
    }
 
-tailcall:
    (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
 }
 
Index: generated/matmul_i2.c
===================================================================
--- generated/matmul_i2.c	(Revision 245836)
+++ generated/matmul_i2.c	(Arbeitskopie)
@@ -2258,9 +2258,14 @@ void matmul_i2 (gfc_array_i2 * const restrict reta
 	gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm) = NULL;
 
+  void (*matmul_fn) (gfc_array_i2 * const restrict retarray, 
+	gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm) = NULL;
+
+  matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
   if (matmul_p == NULL)
     {
-      matmul_p = matmul_i2_vanilla;
+      matmul_fn = matmul_i2_vanilla;
       if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
 	{
           /* Run down the available processors in order of preference.  */
@@ -2267,8 +2272,8 @@ void matmul_i2 (gfc_array_i2 * const restrict reta
 #ifdef HAVE_AVX512F
       	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
 	    {
-	      matmul_p = matmul_i2_avx512f;
-	      goto tailcall;
+	      matmul_fn = matmul_i2_avx512f;
+	      goto store;
 	    }
 
 #endif  /* HAVE_AVX512F */
@@ -2277,8 +2282,8 @@ void matmul_i2 (gfc_array_i2 * const restrict reta
       	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
 	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
-	      matmul_p = matmul_i2_avx2;
-	      goto tailcall;
+	      matmul_fn = matmul_i2_avx2;
+	      goto store;
 	    }
 
 #endif
@@ -2286,14 +2291,15 @@ void matmul_i2 (gfc_array_i2 * const restrict reta
 #ifdef HAVE_AVX
       	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
  	    {
-              matmul_p = matmul_i2_avx;
-	      goto tailcall;
+              matmul_fn = matmul_i2_avx;
+	      goto store;
 	    }
 #endif  /* HAVE_AVX */
         }
+   store:
+      __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
    }
 
-tailcall:
    (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
 }
 
Index: generated/matmul_i4.c
===================================================================
--- generated/matmul_i4.c	(Revision 245836)
+++ generated/matmul_i4.c	(Arbeitskopie)
@@ -2258,9 +2258,14 @@ void matmul_i4 (gfc_array_i4 * const restrict reta
 	gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm) = NULL;
 
+  void (*matmul_fn) (gfc_array_i4 * const restrict retarray, 
+	gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm) = NULL;
+
+  matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
   if (matmul_p == NULL)
     {
-      matmul_p = matmul_i4_vanilla;
+      matmul_fn = matmul_i4_vanilla;
       if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
 	{
           /* Run down the available processors in order of preference.  */
@@ -2267,8 +2272,8 @@ void matmul_i4 (gfc_array_i4 * const restrict reta
 #ifdef HAVE_AVX512F
       	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
 	    {
-	      matmul_p = matmul_i4_avx512f;
-	      goto tailcall;
+	      matmul_fn = matmul_i4_avx512f;
+	      goto store;
 	    }
 
 #endif  /* HAVE_AVX512F */
@@ -2277,8 +2282,8 @@ void matmul_i4 (gfc_array_i4 * const restrict reta
       	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
 	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
-	      matmul_p = matmul_i4_avx2;
-	      goto tailcall;
+	      matmul_fn = matmul_i4_avx2;
+	      goto store;
 	    }
 
 #endif
@@ -2286,14 +2291,15 @@ void matmul_i4 (gfc_array_i4 * const restrict reta
 #ifdef HAVE_AVX
       	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
  	    {
-              matmul_p = matmul_i4_avx;
-	      goto tailcall;
+              matmul_fn = matmul_i4_avx;
+	      goto store;
 	    }
 #endif  /* HAVE_AVX */
         }
+   store:
+      __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
    }
 
-tailcall:
    (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
 }
 
Index: generated/matmul_i8.c
===================================================================
--- generated/matmul_i8.c	(Revision 245836)
+++ generated/matmul_i8.c	(Arbeitskopie)
@@ -2258,9 +2258,14 @@ void matmul_i8 (gfc_array_i8 * const restrict reta
 	gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm) = NULL;
 
+  void (*matmul_fn) (gfc_array_i8 * const restrict retarray, 
+	gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm) = NULL;
+
+  matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
   if (matmul_p == NULL)
     {
-      matmul_p = matmul_i8_vanilla;
+      matmul_fn = matmul_i8_vanilla;
       if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
 	{
           /* Run down the available processors in order of preference.  */
@@ -2267,8 +2272,8 @@ void matmul_i8 (gfc_array_i8 * const restrict reta
 #ifdef HAVE_AVX512F
       	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
 	    {
-	      matmul_p = matmul_i8_avx512f;
-	      goto tailcall;
+	      matmul_fn = matmul_i8_avx512f;
+	      goto store;
 	    }
 
 #endif  /* HAVE_AVX512F */
@@ -2277,8 +2282,8 @@ void matmul_i8 (gfc_array_i8 * const restrict reta
       	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
 	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
-	      matmul_p = matmul_i8_avx2;
-	      goto tailcall;
+	      matmul_fn = matmul_i8_avx2;
+	      goto store;
 	    }
 
 #endif
@@ -2286,14 +2291,15 @@ void matmul_i8 (gfc_array_i8 * const restrict reta
 #ifdef HAVE_AVX
       	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
  	    {
-              matmul_p = matmul_i8_avx;
-	      goto tailcall;
+              matmul_fn = matmul_i8_avx;
+	      goto store;
 	    }
 #endif  /* HAVE_AVX */
         }
+   store:
+      __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
    }
 
-tailcall:
    (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
 }
 
Index: generated/matmul_r10.c
===================================================================
--- generated/matmul_r10.c	(Revision 245836)
+++ generated/matmul_r10.c	(Arbeitskopie)
@@ -2258,9 +2258,14 @@ void matmul_r10 (gfc_array_r10 * const restrict re
 	gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm) = NULL;
 
+  void (*matmul_fn) (gfc_array_r10 * const restrict retarray, 
+	gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm) = NULL;
+
+  matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
   if (matmul_p == NULL)
     {
-      matmul_p = matmul_r10_vanilla;
+      matmul_fn = matmul_r10_vanilla;
       if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
 	{
           /* Run down the available processors in order of preference.  */
@@ -2267,8 +2272,8 @@ void matmul_r10 (gfc_array_r10 * const restrict re
 #ifdef HAVE_AVX512F
       	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
 	    {
-	      matmul_p = matmul_r10_avx512f;
-	      goto tailcall;
+	      matmul_fn = matmul_r10_avx512f;
+	      goto store;
 	    }
 
 #endif  /* HAVE_AVX512F */
@@ -2277,8 +2282,8 @@ void matmul_r10 (gfc_array_r10 * const restrict re
       	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
 	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
-	      matmul_p = matmul_r10_avx2;
-	      goto tailcall;
+	      matmul_fn = matmul_r10_avx2;
+	      goto store;
 	    }
 
 #endif
@@ -2286,14 +2291,15 @@ void matmul_r10 (gfc_array_r10 * const restrict re
 #ifdef HAVE_AVX
       	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
  	    {
-              matmul_p = matmul_r10_avx;
-	      goto tailcall;
+              matmul_fn = matmul_r10_avx;
+	      goto store;
 	    }
 #endif  /* HAVE_AVX */
         }
+   store:
+      __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
    }
 
-tailcall:
    (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
 }
 
Index: generated/matmul_r16.c
===================================================================
--- generated/matmul_r16.c	(Revision 245836)
+++ generated/matmul_r16.c	(Arbeitskopie)
@@ -2258,9 +2258,14 @@ void matmul_r16 (gfc_array_r16 * const restrict re
 	gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm) = NULL;
 
+  void (*matmul_fn) (gfc_array_r16 * const restrict retarray, 
+	gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm) = NULL;
+
+  matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
   if (matmul_p == NULL)
     {
-      matmul_p = matmul_r16_vanilla;
+      matmul_fn = matmul_r16_vanilla;
       if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
 	{
           /* Run down the available processors in order of preference.  */
@@ -2267,8 +2272,8 @@ void matmul_r16 (gfc_array_r16 * const restrict re
 #ifdef HAVE_AVX512F
       	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
 	    {
-	      matmul_p = matmul_r16_avx512f;
-	      goto tailcall;
+	      matmul_fn = matmul_r16_avx512f;
+	      goto store;
 	    }
 
 #endif  /* HAVE_AVX512F */
@@ -2277,8 +2282,8 @@ void matmul_r16 (gfc_array_r16 * const restrict re
       	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
 	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
-	      matmul_p = matmul_r16_avx2;
-	      goto tailcall;
+	      matmul_fn = matmul_r16_avx2;
+	      goto store;
 	    }
 
 #endif
@@ -2286,14 +2291,15 @@ void matmul_r16 (gfc_array_r16 * const restrict re
 #ifdef HAVE_AVX
       	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
  	    {
-              matmul_p = matmul_r16_avx;
-	      goto tailcall;
+              matmul_fn = matmul_r16_avx;
+	      goto store;
 	    }
 #endif  /* HAVE_AVX */
         }
+   store:
+      __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
    }
 
-tailcall:
    (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
 }
 
Index: generated/matmul_r4.c
===================================================================
--- generated/matmul_r4.c	(Revision 245836)
+++ generated/matmul_r4.c	(Arbeitskopie)
@@ -2258,9 +2258,14 @@ void matmul_r4 (gfc_array_r4 * const restrict reta
 	gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm) = NULL;
 
+  void (*matmul_fn) (gfc_array_r4 * const restrict retarray, 
+	gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm) = NULL;
+
+  matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
   if (matmul_p == NULL)
     {
-      matmul_p = matmul_r4_vanilla;
+      matmul_fn = matmul_r4_vanilla;
       if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
 	{
           /* Run down the available processors in order of preference.  */
@@ -2267,8 +2272,8 @@ void matmul_r4 (gfc_array_r4 * const restrict reta
 #ifdef HAVE_AVX512F
       	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
 	    {
-	      matmul_p = matmul_r4_avx512f;
-	      goto tailcall;
+	      matmul_fn = matmul_r4_avx512f;
+	      goto store;
 	    }
 
 #endif  /* HAVE_AVX512F */
@@ -2277,8 +2282,8 @@ void matmul_r4 (gfc_array_r4 * const restrict reta
       	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
 	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
-	      matmul_p = matmul_r4_avx2;
-	      goto tailcall;
+	      matmul_fn = matmul_r4_avx2;
+	      goto store;
 	    }
 
 #endif
@@ -2286,14 +2291,15 @@ void matmul_r4 (gfc_array_r4 * const restrict reta
 #ifdef HAVE_AVX
       	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
  	    {
-              matmul_p = matmul_r4_avx;
-	      goto tailcall;
+              matmul_fn = matmul_r4_avx;
+	      goto store;
 	    }
 #endif  /* HAVE_AVX */
         }
+   store:
+      __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
    }
 
-tailcall:
    (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
 }
 
Index: generated/matmul_r8.c
===================================================================
--- generated/matmul_r8.c	(Revision 245836)
+++ generated/matmul_r8.c	(Arbeitskopie)
@@ -2258,9 +2258,14 @@ void matmul_r8 (gfc_array_r8 * const restrict reta
 	gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm) = NULL;
 
+  void (*matmul_fn) (gfc_array_r8 * const restrict retarray, 
+	gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm) = NULL;
+
+  matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
   if (matmul_p == NULL)
     {
-      matmul_p = matmul_r8_vanilla;
+      matmul_fn = matmul_r8_vanilla;
       if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
 	{
           /* Run down the available processors in order of preference.  */
@@ -2267,8 +2272,8 @@ void matmul_r8 (gfc_array_r8 * const restrict reta
 #ifdef HAVE_AVX512F
       	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
 	    {
-	      matmul_p = matmul_r8_avx512f;
-	      goto tailcall;
+	      matmul_fn = matmul_r8_avx512f;
+	      goto store;
 	    }
 
 #endif  /* HAVE_AVX512F */
@@ -2277,8 +2282,8 @@ void matmul_r8 (gfc_array_r8 * const restrict reta
       	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
 	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
-	      matmul_p = matmul_r8_avx2;
-	      goto tailcall;
+	      matmul_fn = matmul_r8_avx2;
+	      goto store;
 	    }
 
 #endif
@@ -2286,14 +2291,15 @@ void matmul_r8 (gfc_array_r8 * const restrict reta
 #ifdef HAVE_AVX
       	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
  	    {
-              matmul_p = matmul_r8_avx;
-	      goto tailcall;
+              matmul_fn = matmul_r8_avx;
+	      goto store;
 	    }
 #endif  /* HAVE_AVX */
         }
+   store:
+      __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
    }
 
-tailcall:
    (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
 }
 
Index: m4/matmul.m4
===================================================================
--- m4/matmul.m4	(Revision 245836)
+++ m4/matmul.m4	(Arbeitskopie)
@@ -123,9 +123,14 @@ void matmul_'rtype_code` ('rtype` * const restrict
 	'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm) = NULL;
 
+  void (*matmul_fn) ('rtype` * const restrict retarray, 
+	'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm) = NULL;
+
+  matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
   if (matmul_p == NULL)
     {
-      matmul_p = matmul_'rtype_code`_vanilla;
+      matmul_fn = matmul_'rtype_code`_vanilla;
       if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
 	{
           /* Run down the available processors in order of preference.  */
@@ -132,8 +137,8 @@ void matmul_'rtype_code` ('rtype` * const restrict
 #ifdef HAVE_AVX512F
       	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
 	    {
-	      matmul_p = matmul_'rtype_code`_avx512f;
-	      goto tailcall;
+	      matmul_fn = matmul_'rtype_code`_avx512f;
+	      goto store;
 	    }
 
 #endif  /* HAVE_AVX512F */
@@ -142,8 +147,8 @@ void matmul_'rtype_code` ('rtype` * const restrict
       	  if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
 	     && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
 	    {
-	      matmul_p = matmul_'rtype_code`_avx2;
-	      goto tailcall;
+	      matmul_fn = matmul_'rtype_code`_avx2;
+	      goto store;
 	    }
 
 #endif
@@ -151,14 +156,15 @@ void matmul_'rtype_code` ('rtype` * const restrict
 #ifdef HAVE_AVX
       	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
  	    {
-              matmul_p = matmul_'rtype_code`_avx;
-	      goto tailcall;
+              matmul_fn = matmul_'rtype_code`_avx;
+	      goto store;
 	    }
 #endif  /* HAVE_AVX */
         }
+   store:
+      __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
    }
 
-tailcall:
    (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
 }
 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
  2017-03-02 11:57           ` Thomas Koenig
@ 2017-03-02 12:02             ` Jakub Jelinek
  2017-03-02 13:01               ` Thomas Koenig
  0 siblings, 1 reply; 17+ messages in thread
From: Jakub Jelinek @ 2017-03-02 12:02 UTC (permalink / raw)
  To: Thomas Koenig; +Cc: fortran, gcc-patches

On Thu, Mar 02, 2017 at 12:57:05PM +0100, Thomas Koenig wrote:
> --- m4/matmul.m4	(Revision 245836)
> +++ m4/matmul.m4	(Arbeitskopie)
> @@ -123,9 +123,14 @@ void matmul_'rtype_code` ('rtype` * const restrict
>  	'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
>  	int blas_limit, blas_call gemm) = NULL;

Please drop the " = NULL" here

> +  void (*matmul_fn) ('rtype` * const restrict retarray, 
> +	'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
> +	int blas_limit, blas_call gemm) = NULL;

and here as well.  The first one because static vars are zero initialized
by default, the latter because it makes no sense to initialize it and then
immediately overwrite it in the next stmt.

> +
> +  matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
>    if (matmul_p == NULL)

This needs to test matmul_fn == NULL instead of matmul_p == NULL.

> @@ -151,14 +156,15 @@ void matmul_'rtype_code` ('rtype` * const restrict
>  #ifdef HAVE_AVX
>        	  if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
>   	    {
> -              matmul_p = matmul_'rtype_code`_avx;
> -	      goto tailcall;
> +              matmul_fn = matmul_'rtype_code`_avx;
> +	      goto store;
>  	    }
>  #endif  /* HAVE_AVX */
>          }
> +   store:
> +      __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
>     }
>  
> -tailcall:
>     (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);

And this needs to use *matmul_fn instead of *matmul_p too.
The whole point is that matmul_p is only loaded using __atomic_load_n
and only optionally stored using __atomic_store_n.

Ok with those changes.

	Jakub

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
  2017-03-02 12:02             ` Jakub Jelinek
@ 2017-03-02 13:01               ` Thomas Koenig
  0 siblings, 0 replies; 17+ messages in thread
From: Thomas Koenig @ 2017-03-02 13:01 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: fortran, gcc-patches

Am 02.03.2017 um 13:02 schrieb Jakub Jelinek:
> And this needs to use *matmul_fn instead of *matmul_p too.
> The whole point is that matmul_p is only loaded using __atomic_load_n
> and only optionally stored using __atomic_store_n.
>
> Ok with those changes.

Thanks! Committed as

https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=245839

Regards

	Thomas

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-03-02 13:01 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-01 21:00 [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul Thomas Koenig
2017-03-02  3:22 ` Jerry DeLisle
2017-03-02  6:15   ` Thomas Koenig
2017-03-02  7:32 ` Janne Blomqvist
2017-03-02  7:50   ` Thomas Koenig
2017-03-02  8:09     ` Janne Blomqvist
2017-03-02  8:14       ` Richard Biener
2017-03-02  8:16       ` Jakub Jelinek
2017-03-02  8:43 ` Jakub Jelinek
2017-03-02  9:03   ` Thomas Koenig
2017-03-02  9:08     ` Jakub Jelinek
2017-03-02 10:46       ` Thomas Koenig
2017-03-02 10:48         ` Jakub Jelinek
2017-03-02 11:02         ` Jakub Jelinek
2017-03-02 11:57           ` Thomas Koenig
2017-03-02 12:02             ` Jakub Jelinek
2017-03-02 13:01               ` Thomas Koenig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).