* [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
@ 2017-03-01 21:00 Thomas Koenig
2017-03-02 3:22 ` Jerry DeLisle
` (2 more replies)
0 siblings, 3 replies; 17+ messages in thread
From: Thomas Koenig @ 2017-03-01 21:00 UTC (permalink / raw)
To: fortran, gcc-patches
[-- Attachment #1: Type: text/plain, Size: 1347 bytes --]
Hello world,
the attached patch enables FMA for the AVX2 and AVX512F variants of
matmul. This should bring a very nice speedup (although I have
been unable to run benchmarks due to lack of a suitable machine).
Question: Is this still appropriate for the current state of trunk?
Or rather, OK for when gcc 8 opens (which might still be some time
in the future)?
2017-03-01 Thomas Koenig <tkoenig@gcc.gnu.org>
PR fortran/78379
* m4/matmul.m4: (matmul_'rtype_code`_avx2): Also generate for
reals. Add fma to target options.
(matmul_'rtype_code`_avx512f): Add fma to target options.
(matmul_'rtype_code`): Call AVX2 and AVX512F only if
FMA is available.
* generated/matmul_c10.c: Regenerated.
* generated/matmul_c16.c: Regenerated.
* generated/matmul_c4.c: Regenerated.
* generated/matmul_c8.c: Regenerated.
* generated/matmul_i1.c: Regenerated.
* generated/matmul_i16.c: Regenerated.
* generated/matmul_i2.c: Regenerated.
* generated/matmul_i4.c: Regenerated.
* generated/matmul_i8.c: Regenerated.
* generated/matmul_r10.c: Regenerated.
* generated/matmul_r16.c: Regenerated.
* generated/matmul_r4.c: Regenerated.
* generated/matmul_r8.c: Regenerated.
Regards
Thomas
[-- Attachment #2: p1-fma.diff --]
[-- Type: text/x-patch, Size: 2139 bytes --]
Index: m4/matmul.m4
===================================================================
--- m4/matmul.m4 (Revision 245760)
+++ m4/matmul.m4 (Arbeitskopie)
@@ -75,14 +75,6 @@
int blas_limit, blas_call gemm);
export_proto(matmul_'rtype_code`);
-'ifelse(rtype_letter,`r',dnl
-`#if defined(HAVE_AVX) && defined(HAVE_AVX2)
-/* REAL types generate identical code for AVX and AVX2. Only generate
- an AVX2 function if we are dealing with integer. */
-#undef HAVE_AVX2
-#endif')
-`
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -101,7 +93,7 @@
`static void
'matmul_name` ('rtype` * const restrict retarray,
'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static' include(matmul_internal.m4)dnl
`#endif /* HAVE_AVX2 */
@@ -110,7 +102,7 @@
`static void
'matmul_name` ('rtype` * const restrict retarray,
'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
static' include(matmul_internal.m4)dnl
`#endif /* HAVE_AVX512F */
@@ -138,7 +130,9 @@
{
/* Run down the available processors in order of preference. */
#ifdef HAVE_AVX512F
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
{
matmul_p = matmul_'rtype_code`_avx512f;
goto tailcall;
@@ -147,7 +141,8 @@
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_'rtype_code`_avx2;
goto tailcall;
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
2017-03-01 21:00 [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul Thomas Koenig
@ 2017-03-02 3:22 ` Jerry DeLisle
2017-03-02 6:15 ` Thomas Koenig
2017-03-02 7:32 ` Janne Blomqvist
2017-03-02 8:43 ` Jakub Jelinek
2 siblings, 1 reply; 17+ messages in thread
From: Jerry DeLisle @ 2017-03-02 3:22 UTC (permalink / raw)
To: fortran; +Cc: GCC Patches
On 03/01/2017 01:00 PM, Thomas Koenig wrote:
> Hello world,
>
> the attached patch enables FMA for the AVX2 and AVX512F variants of
> matmul. This should bring a very nice speedup (although I have
> been unable to run benchmarks due to lack of a suitable machine).
>
> Question: Is this still appropriate for the current state of trunk?
> Or rather, OK for when gcc 8 opens (which might still be some time
> in the future)?
I think it may be appropriate now because you are making an adjustment to the
just added new feature.
I would prefer that it was tested on the actual expected platform. Does anyone
anywhere on this list have access to one of these machines to test?
Jerry
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
2017-03-02 3:22 ` Jerry DeLisle
@ 2017-03-02 6:15 ` Thomas Koenig
0 siblings, 0 replies; 17+ messages in thread
From: Thomas Koenig @ 2017-03-02 6:15 UTC (permalink / raw)
To: Jerry DeLisle, fortran; +Cc: GCC Patches
[-- Attachment #1: Type: text/plain, Size: 305 bytes --]
Hi Jerry,
> I would prefer that it was tested on the actual expected platform. Does
> anyone anywhere on this list have access to one of these machines to test?
If anybody wants to test who does not have --enable-maintainer-mode
activated, here is a patch that works "out of the box".
Regards
Thomas
[-- Attachment #2: p1-fma-total.diff --]
[-- Type: text/x-patch, Size: 34073 bytes --]
Index: generated/matmul_c10.c
===================================================================
--- generated/matmul_c10.c (Revision 245760)
+++ generated/matmul_c10.c (Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_c10 (gfc_array_c10 * const rest
int blas_limit, blas_call gemm);
export_proto(matmul_c10);
-
-
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_c10_avx (gfc_array_c10 * const restrict ret
static void
matmul_c10_avx2 (gfc_array_c10 * const restrict retarray,
gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static void
matmul_c10_avx2 (gfc_array_c10 * const restrict retarray,
gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
@@ -1171,7 +1168,7 @@ matmul_c10_avx2 (gfc_array_c10 * const restrict re
static void
matmul_c10_avx512f (gfc_array_c10 * const restrict retarray,
gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
static void
matmul_c10_avx512f (gfc_array_c10 * const restrict retarray,
gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
@@ -2268,7 +2265,9 @@ void matmul_c10 (gfc_array_c10 * const restrict re
{
/* Run down the available processors in order of preference. */
#ifdef HAVE_AVX512F
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
{
matmul_p = matmul_c10_avx512f;
goto tailcall;
@@ -2277,7 +2276,8 @@ void matmul_c10 (gfc_array_c10 * const restrict re
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_c10_avx2;
goto tailcall;
Index: generated/matmul_c16.c
===================================================================
--- generated/matmul_c16.c (Revision 245760)
+++ generated/matmul_c16.c (Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_c16 (gfc_array_c16 * const rest
int blas_limit, blas_call gemm);
export_proto(matmul_c16);
-
-
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_c16_avx (gfc_array_c16 * const restrict ret
static void
matmul_c16_avx2 (gfc_array_c16 * const restrict retarray,
gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static void
matmul_c16_avx2 (gfc_array_c16 * const restrict retarray,
gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
@@ -1171,7 +1168,7 @@ matmul_c16_avx2 (gfc_array_c16 * const restrict re
static void
matmul_c16_avx512f (gfc_array_c16 * const restrict retarray,
gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
static void
matmul_c16_avx512f (gfc_array_c16 * const restrict retarray,
gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
@@ -2268,7 +2265,9 @@ void matmul_c16 (gfc_array_c16 * const restrict re
{
/* Run down the available processors in order of preference. */
#ifdef HAVE_AVX512F
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
{
matmul_p = matmul_c16_avx512f;
goto tailcall;
@@ -2277,7 +2276,8 @@ void matmul_c16 (gfc_array_c16 * const restrict re
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_c16_avx2;
goto tailcall;
Index: generated/matmul_c4.c
===================================================================
--- generated/matmul_c4.c (Revision 245760)
+++ generated/matmul_c4.c (Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_c4 (gfc_array_c4 * const restri
int blas_limit, blas_call gemm);
export_proto(matmul_c4);
-
-
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_c4_avx (gfc_array_c4 * const restrict retar
static void
matmul_c4_avx2 (gfc_array_c4 * const restrict retarray,
gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static void
matmul_c4_avx2 (gfc_array_c4 * const restrict retarray,
gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas,
@@ -1171,7 +1168,7 @@ matmul_c4_avx2 (gfc_array_c4 * const restrict reta
static void
matmul_c4_avx512f (gfc_array_c4 * const restrict retarray,
gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
static void
matmul_c4_avx512f (gfc_array_c4 * const restrict retarray,
gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas,
@@ -2268,7 +2265,9 @@ void matmul_c4 (gfc_array_c4 * const restrict reta
{
/* Run down the available processors in order of preference. */
#ifdef HAVE_AVX512F
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
{
matmul_p = matmul_c4_avx512f;
goto tailcall;
@@ -2277,7 +2276,8 @@ void matmul_c4 (gfc_array_c4 * const restrict reta
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_c4_avx2;
goto tailcall;
Index: generated/matmul_c8.c
===================================================================
--- generated/matmul_c8.c (Revision 245760)
+++ generated/matmul_c8.c (Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_c8 (gfc_array_c8 * const restri
int blas_limit, blas_call gemm);
export_proto(matmul_c8);
-
-
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_c8_avx (gfc_array_c8 * const restrict retar
static void
matmul_c8_avx2 (gfc_array_c8 * const restrict retarray,
gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static void
matmul_c8_avx2 (gfc_array_c8 * const restrict retarray,
gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas,
@@ -1171,7 +1168,7 @@ matmul_c8_avx2 (gfc_array_c8 * const restrict reta
static void
matmul_c8_avx512f (gfc_array_c8 * const restrict retarray,
gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
static void
matmul_c8_avx512f (gfc_array_c8 * const restrict retarray,
gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas,
@@ -2268,7 +2265,9 @@ void matmul_c8 (gfc_array_c8 * const restrict reta
{
/* Run down the available processors in order of preference. */
#ifdef HAVE_AVX512F
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
{
matmul_p = matmul_c8_avx512f;
goto tailcall;
@@ -2277,7 +2276,8 @@ void matmul_c8 (gfc_array_c8 * const restrict reta
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_c8_avx2;
goto tailcall;
Index: generated/matmul_i1.c
===================================================================
--- generated/matmul_i1.c (Revision 245760)
+++ generated/matmul_i1.c (Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_i1 (gfc_array_i1 * const restri
int blas_limit, blas_call gemm);
export_proto(matmul_i1);
-
-
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_i1_avx (gfc_array_i1 * const restrict retar
static void
matmul_i1_avx2 (gfc_array_i1 * const restrict retarray,
gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static void
matmul_i1_avx2 (gfc_array_i1 * const restrict retarray,
gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas,
@@ -1171,7 +1168,7 @@ matmul_i1_avx2 (gfc_array_i1 * const restrict reta
static void
matmul_i1_avx512f (gfc_array_i1 * const restrict retarray,
gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
static void
matmul_i1_avx512f (gfc_array_i1 * const restrict retarray,
gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas,
@@ -2268,7 +2265,9 @@ void matmul_i1 (gfc_array_i1 * const restrict reta
{
/* Run down the available processors in order of preference. */
#ifdef HAVE_AVX512F
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
{
matmul_p = matmul_i1_avx512f;
goto tailcall;
@@ -2277,7 +2276,8 @@ void matmul_i1 (gfc_array_i1 * const restrict reta
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_i1_avx2;
goto tailcall;
Index: generated/matmul_i16.c
===================================================================
--- generated/matmul_i16.c (Revision 245760)
+++ generated/matmul_i16.c (Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_i16 (gfc_array_i16 * const rest
int blas_limit, blas_call gemm);
export_proto(matmul_i16);
-
-
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_i16_avx (gfc_array_i16 * const restrict ret
static void
matmul_i16_avx2 (gfc_array_i16 * const restrict retarray,
gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static void
matmul_i16_avx2 (gfc_array_i16 * const restrict retarray,
gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas,
@@ -1171,7 +1168,7 @@ matmul_i16_avx2 (gfc_array_i16 * const restrict re
static void
matmul_i16_avx512f (gfc_array_i16 * const restrict retarray,
gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
static void
matmul_i16_avx512f (gfc_array_i16 * const restrict retarray,
gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas,
@@ -2268,7 +2265,9 @@ void matmul_i16 (gfc_array_i16 * const restrict re
{
/* Run down the available processors in order of preference. */
#ifdef HAVE_AVX512F
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
{
matmul_p = matmul_i16_avx512f;
goto tailcall;
@@ -2277,7 +2276,8 @@ void matmul_i16 (gfc_array_i16 * const restrict re
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_i16_avx2;
goto tailcall;
Index: generated/matmul_i2.c
===================================================================
--- generated/matmul_i2.c (Revision 245760)
+++ generated/matmul_i2.c (Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_i2 (gfc_array_i2 * const restri
int blas_limit, blas_call gemm);
export_proto(matmul_i2);
-
-
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_i2_avx (gfc_array_i2 * const restrict retar
static void
matmul_i2_avx2 (gfc_array_i2 * const restrict retarray,
gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static void
matmul_i2_avx2 (gfc_array_i2 * const restrict retarray,
gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas,
@@ -1171,7 +1168,7 @@ matmul_i2_avx2 (gfc_array_i2 * const restrict reta
static void
matmul_i2_avx512f (gfc_array_i2 * const restrict retarray,
gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
static void
matmul_i2_avx512f (gfc_array_i2 * const restrict retarray,
gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas,
@@ -2268,7 +2265,9 @@ void matmul_i2 (gfc_array_i2 * const restrict reta
{
/* Run down the available processors in order of preference. */
#ifdef HAVE_AVX512F
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
{
matmul_p = matmul_i2_avx512f;
goto tailcall;
@@ -2277,7 +2276,8 @@ void matmul_i2 (gfc_array_i2 * const restrict reta
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_i2_avx2;
goto tailcall;
Index: generated/matmul_i4.c
===================================================================
--- generated/matmul_i4.c (Revision 245760)
+++ generated/matmul_i4.c (Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_i4 (gfc_array_i4 * const restri
int blas_limit, blas_call gemm);
export_proto(matmul_i4);
-
-
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_i4_avx (gfc_array_i4 * const restrict retar
static void
matmul_i4_avx2 (gfc_array_i4 * const restrict retarray,
gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static void
matmul_i4_avx2 (gfc_array_i4 * const restrict retarray,
gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas,
@@ -1171,7 +1168,7 @@ matmul_i4_avx2 (gfc_array_i4 * const restrict reta
static void
matmul_i4_avx512f (gfc_array_i4 * const restrict retarray,
gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
static void
matmul_i4_avx512f (gfc_array_i4 * const restrict retarray,
gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas,
@@ -2268,7 +2265,9 @@ void matmul_i4 (gfc_array_i4 * const restrict reta
{
/* Run down the available processors in order of preference. */
#ifdef HAVE_AVX512F
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
{
matmul_p = matmul_i4_avx512f;
goto tailcall;
@@ -2277,7 +2276,8 @@ void matmul_i4 (gfc_array_i4 * const restrict reta
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_i4_avx2;
goto tailcall;
Index: generated/matmul_i8.c
===================================================================
--- generated/matmul_i8.c (Revision 245760)
+++ generated/matmul_i8.c (Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_i8 (gfc_array_i8 * const restri
int blas_limit, blas_call gemm);
export_proto(matmul_i8);
-
-
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_i8_avx (gfc_array_i8 * const restrict retar
static void
matmul_i8_avx2 (gfc_array_i8 * const restrict retarray,
gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static void
matmul_i8_avx2 (gfc_array_i8 * const restrict retarray,
gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas,
@@ -1171,7 +1168,7 @@ matmul_i8_avx2 (gfc_array_i8 * const restrict reta
static void
matmul_i8_avx512f (gfc_array_i8 * const restrict retarray,
gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
static void
matmul_i8_avx512f (gfc_array_i8 * const restrict retarray,
gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas,
@@ -2268,7 +2265,9 @@ void matmul_i8 (gfc_array_i8 * const restrict reta
{
/* Run down the available processors in order of preference. */
#ifdef HAVE_AVX512F
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
{
matmul_p = matmul_i8_avx512f;
goto tailcall;
@@ -2277,7 +2276,8 @@ void matmul_i8 (gfc_array_i8 * const restrict reta
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_i8_avx2;
goto tailcall;
Index: generated/matmul_r10.c
===================================================================
--- generated/matmul_r10.c (Revision 245760)
+++ generated/matmul_r10.c (Arbeitskopie)
@@ -74,13 +74,6 @@ extern void matmul_r10 (gfc_array_r10 * const rest
int blas_limit, blas_call gemm);
export_proto(matmul_r10);
-#if defined(HAVE_AVX) && defined(HAVE_AVX2)
-/* REAL types generate identical code for AVX and AVX2. Only generate
- an AVX2 function if we are dealing with integer. */
-#undef HAVE_AVX2
-#endif
-
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -632,7 +625,7 @@ matmul_r10_avx (gfc_array_r10 * const restrict ret
static void
matmul_r10_avx2 (gfc_array_r10 * const restrict retarray,
gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static void
matmul_r10_avx2 (gfc_array_r10 * const restrict retarray,
gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas,
@@ -1175,7 +1168,7 @@ matmul_r10_avx2 (gfc_array_r10 * const restrict re
static void
matmul_r10_avx512f (gfc_array_r10 * const restrict retarray,
gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
static void
matmul_r10_avx512f (gfc_array_r10 * const restrict retarray,
gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas,
@@ -2272,7 +2265,9 @@ void matmul_r10 (gfc_array_r10 * const restrict re
{
/* Run down the available processors in order of preference. */
#ifdef HAVE_AVX512F
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
{
matmul_p = matmul_r10_avx512f;
goto tailcall;
@@ -2281,7 +2276,8 @@ void matmul_r10 (gfc_array_r10 * const restrict re
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_r10_avx2;
goto tailcall;
Index: generated/matmul_r16.c
===================================================================
--- generated/matmul_r16.c (Revision 245760)
+++ generated/matmul_r16.c (Arbeitskopie)
@@ -74,13 +74,6 @@ extern void matmul_r16 (gfc_array_r16 * const rest
int blas_limit, blas_call gemm);
export_proto(matmul_r16);
-#if defined(HAVE_AVX) && defined(HAVE_AVX2)
-/* REAL types generate identical code for AVX and AVX2. Only generate
- an AVX2 function if we are dealing with integer. */
-#undef HAVE_AVX2
-#endif
-
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -632,7 +625,7 @@ matmul_r16_avx (gfc_array_r16 * const restrict ret
static void
matmul_r16_avx2 (gfc_array_r16 * const restrict retarray,
gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static void
matmul_r16_avx2 (gfc_array_r16 * const restrict retarray,
gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas,
@@ -1175,7 +1168,7 @@ matmul_r16_avx2 (gfc_array_r16 * const restrict re
static void
matmul_r16_avx512f (gfc_array_r16 * const restrict retarray,
gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
static void
matmul_r16_avx512f (gfc_array_r16 * const restrict retarray,
gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas,
@@ -2272,7 +2265,9 @@ void matmul_r16 (gfc_array_r16 * const restrict re
{
/* Run down the available processors in order of preference. */
#ifdef HAVE_AVX512F
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
{
matmul_p = matmul_r16_avx512f;
goto tailcall;
@@ -2281,7 +2276,8 @@ void matmul_r16 (gfc_array_r16 * const restrict re
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_r16_avx2;
goto tailcall;
Index: generated/matmul_r4.c
===================================================================
--- generated/matmul_r4.c (Revision 245760)
+++ generated/matmul_r4.c (Arbeitskopie)
@@ -74,13 +74,6 @@ extern void matmul_r4 (gfc_array_r4 * const restri
int blas_limit, blas_call gemm);
export_proto(matmul_r4);
-#if defined(HAVE_AVX) && defined(HAVE_AVX2)
-/* REAL types generate identical code for AVX and AVX2. Only generate
- an AVX2 function if we are dealing with integer. */
-#undef HAVE_AVX2
-#endif
-
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -632,7 +625,7 @@ matmul_r4_avx (gfc_array_r4 * const restrict retar
static void
matmul_r4_avx2 (gfc_array_r4 * const restrict retarray,
gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static void
matmul_r4_avx2 (gfc_array_r4 * const restrict retarray,
gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas,
@@ -1175,7 +1168,7 @@ matmul_r4_avx2 (gfc_array_r4 * const restrict reta
static void
matmul_r4_avx512f (gfc_array_r4 * const restrict retarray,
gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
static void
matmul_r4_avx512f (gfc_array_r4 * const restrict retarray,
gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas,
@@ -2272,7 +2265,9 @@ void matmul_r4 (gfc_array_r4 * const restrict reta
{
/* Run down the available processors in order of preference. */
#ifdef HAVE_AVX512F
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
{
matmul_p = matmul_r4_avx512f;
goto tailcall;
@@ -2281,7 +2276,8 @@ void matmul_r4 (gfc_array_r4 * const restrict reta
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_r4_avx2;
goto tailcall;
Index: generated/matmul_r8.c
===================================================================
--- generated/matmul_r8.c (Revision 245760)
+++ generated/matmul_r8.c (Arbeitskopie)
@@ -74,13 +74,6 @@ extern void matmul_r8 (gfc_array_r8 * const restri
int blas_limit, blas_call gemm);
export_proto(matmul_r8);
-#if defined(HAVE_AVX) && defined(HAVE_AVX2)
-/* REAL types generate identical code for AVX and AVX2. Only generate
- an AVX2 function if we are dealing with integer. */
-#undef HAVE_AVX2
-#endif
-
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -632,7 +625,7 @@ matmul_r8_avx (gfc_array_r8 * const restrict retar
static void
matmul_r8_avx2 (gfc_array_r8 * const restrict retarray,
gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static void
matmul_r8_avx2 (gfc_array_r8 * const restrict retarray,
gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas,
@@ -1175,7 +1168,7 @@ matmul_r8_avx2 (gfc_array_r8 * const restrict reta
static void
matmul_r8_avx512f (gfc_array_r8 * const restrict retarray,
gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
static void
matmul_r8_avx512f (gfc_array_r8 * const restrict retarray,
gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas,
@@ -2272,7 +2265,9 @@ void matmul_r8 (gfc_array_r8 * const restrict reta
{
/* Run down the available processors in order of preference. */
#ifdef HAVE_AVX512F
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
{
matmul_p = matmul_r8_avx512f;
goto tailcall;
@@ -2281,7 +2276,8 @@ void matmul_r8 (gfc_array_r8 * const restrict reta
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_r8_avx2;
goto tailcall;
Index: m4/matmul.m4
===================================================================
--- m4/matmul.m4 (Revision 245760)
+++ m4/matmul.m4 (Arbeitskopie)
@@ -75,14 +75,6 @@ extern void matmul_'rtype_code` ('rtype` * const r
int blas_limit, blas_call gemm);
export_proto(matmul_'rtype_code`);
-'ifelse(rtype_letter,`r',dnl
-`#if defined(HAVE_AVX) && defined(HAVE_AVX2)
-/* REAL types generate identical code for AVX and AVX2. Only generate
- an AVX2 function if we are dealing with integer. */
-#undef HAVE_AVX2
-#endif')
-`
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -101,7 +93,7 @@ static' include(matmul_internal.m4)dnl
`static void
'matmul_name` ('rtype` * const restrict retarray,
'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static' include(matmul_internal.m4)dnl
`#endif /* HAVE_AVX2 */
@@ -110,7 +102,7 @@ static' include(matmul_internal.m4)dnl
`static void
'matmul_name` ('rtype` * const restrict retarray,
'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
static' include(matmul_internal.m4)dnl
`#endif /* HAVE_AVX512F */
@@ -138,7 +130,9 @@ void matmul_'rtype_code` ('rtype` * const restrict
{
/* Run down the available processors in order of preference. */
#ifdef HAVE_AVX512F
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
+
{
matmul_p = matmul_'rtype_code`_avx512f;
goto tailcall;
@@ -147,7 +141,8 @@ void matmul_'rtype_code` ('rtype` * const restrict
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_'rtype_code`_avx2;
goto tailcall;
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
2017-03-01 21:00 [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul Thomas Koenig
2017-03-02 3:22 ` Jerry DeLisle
@ 2017-03-02 7:32 ` Janne Blomqvist
2017-03-02 7:50 ` Thomas Koenig
2017-03-02 8:43 ` Jakub Jelinek
2 siblings, 1 reply; 17+ messages in thread
From: Janne Blomqvist @ 2017-03-02 7:32 UTC (permalink / raw)
To: Thomas Koenig; +Cc: fortran, gcc-patches
On Wed, Mar 1, 2017 at 11:00 PM, Thomas Koenig <tkoenig@netcologne.de> wrote:
> Hello world,
>
> the attached patch enables FMA for the AVX2 and AVX512F variants of
> matmul. This should bring a very nice speedup (although I have
> been unable to run benchmarks due to lack of a suitable machine).
In lieu of benchmarks, have you looked at the generated asm to verify
that fma is actually used?
> Question: Is this still appropriate for the current state of trunk?
Yes, looks pretty safe.
--
Janne Blomqvist
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
2017-03-02 7:32 ` Janne Blomqvist
@ 2017-03-02 7:50 ` Thomas Koenig
2017-03-02 8:09 ` Janne Blomqvist
0 siblings, 1 reply; 17+ messages in thread
From: Thomas Koenig @ 2017-03-02 7:50 UTC (permalink / raw)
To: Janne Blomqvist; +Cc: fortran, gcc-patches
Am 02.03.2017 um 08:32 schrieb Janne Blomqvist:
> On Wed, Mar 1, 2017 at 11:00 PM, Thomas Koenig <tkoenig@netcologne.de> wrote:
>> Hello world,
>>
>> the attached patch enables FMA for the AVX2 and AVX512F variants of
>> matmul. This should bring a very nice speedup (although I have
>> been unable to run benchmarks due to lack of a suitable machine).
>
> In lieu of benchmarks, have you looked at the generated asm to verify
> that fma is actually used?
Yes, I did.
Here's something from the new matmul_r8_avx2:
156c: c4 62 e5 b8 fd vfmadd231pd %ymm5,%ymm3,%ymm15
1571: c4 c1 79 10 04 06 vmovupd (%r14,%rax,1),%xmm0
1577: c4 62 dd b8 db vfmadd231pd %ymm3,%ymm4,%ymm11
157c: c4 c3 7d 18 44 06 10 vinsertf128
$0x1,0x10(%r14,%rax,1),%ymm0,%ymm0
1583: 01
1584: c4 62 ed b8 ed vfmadd231pd %ymm5,%ymm2,%ymm13
1589: c4 e2 ed b8 fc vfmadd231pd %ymm4,%ymm2,%ymm7
158e: c4 e2 fd a8 ad 30 ff vfmadd213pd
-0x800d0(%rbp),%ymm0,%ymm5
... and here from matmul_r8_avx512f:
1da8: c4 a1 7b 10 14 d6 vmovsd (%rsi,%r10,8),%xmm2
1dae: c4 c2 b1 b9 f0 vfmadd231sd %xmm8,%xmm9,%xmm6
1db3: 62 62 ed 08 b9 e5 vfmadd231sd %xmm5,%xmm2,%xmm28
1db9: 62 62 ed 08 b9 ec vfmadd231sd %xmm4,%xmm2,%xmm29
1dbf: 62 62 ed 08 b9 f3 vfmadd231sd %xmm3,%xmm2,%xmm30
1dc5: c4 e2 91 99 e8 vfmadd132sd %xmm0,%xmm13,%xmm5
1dca: c4 e2 99 99 e0 vfmadd132sd %xmm0,%xmm12,%xmm4
1dcf: c4 e2 a1 99 d8 vfmadd132sd %xmm0,%xmm11,%xmm3
1dd4: c4 c2 a9 99 d1 vfmadd132sd %xmm9,%xmm10,%xmm2
1dd9: c4 c2 89 99 c1 vfmadd132sd %xmm9,%xmm14,%xmm0
1dde: 0f 8e d3 fe ff ff jle 1cb7
<matmul_r8_avx512f+0x1cb7>
... so this is looking pretty good.
Regards
Thomas
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
2017-03-02 7:50 ` Thomas Koenig
@ 2017-03-02 8:09 ` Janne Blomqvist
2017-03-02 8:14 ` Richard Biener
2017-03-02 8:16 ` Jakub Jelinek
0 siblings, 2 replies; 17+ messages in thread
From: Janne Blomqvist @ 2017-03-02 8:09 UTC (permalink / raw)
To: Thomas Koenig; +Cc: fortran, gcc-patches
On Thu, Mar 2, 2017 at 9:50 AM, Thomas Koenig <tkoenig@netcologne.de> wrote:
> Am 02.03.2017 um 08:32 schrieb Janne Blomqvist:
>>
>> On Wed, Mar 1, 2017 at 11:00 PM, Thomas Koenig <tkoenig@netcologne.de>
>> wrote:
>>>
>>> Hello world,
>>>
>>> the attached patch enables FMA for the AVX2 and AVX512F variants of
>>> matmul. This should bring a very nice speedup (although I have
>>> been unable to run benchmarks due to lack of a suitable machine).
>>
>>
>> In lieu of benchmarks, have you looked at the generated asm to verify
>> that fma is actually used?
>
>
> Yes, I did.
>
> Here's something from the new matmul_r8_avx2:
>
> 156c: c4 62 e5 b8 fd vfmadd231pd %ymm5,%ymm3,%ymm15
> 1571: c4 c1 79 10 04 06 vmovupd (%r14,%rax,1),%xmm0
> 1577: c4 62 dd b8 db vfmadd231pd %ymm3,%ymm4,%ymm11
> 157c: c4 c3 7d 18 44 06 10 vinsertf128
> $0x1,0x10(%r14,%rax,1),%ymm0,%ymm0
> 1583: 01
> 1584: c4 62 ed b8 ed vfmadd231pd %ymm5,%ymm2,%ymm13
> 1589: c4 e2 ed b8 fc vfmadd231pd %ymm4,%ymm2,%ymm7
> 158e: c4 e2 fd a8 ad 30 ff vfmadd213pd
> -0x800d0(%rbp),%ymm0,%ymm5
Great, looks good!
> ... and here from matmul_r8_avx512f:
>
> 1da8: c4 a1 7b 10 14 d6 vmovsd (%rsi,%r10,8),%xmm2
> 1dae: c4 c2 b1 b9 f0 vfmadd231sd %xmm8,%xmm9,%xmm6
> 1db3: 62 62 ed 08 b9 e5 vfmadd231sd %xmm5,%xmm2,%xmm28
> 1db9: 62 62 ed 08 b9 ec vfmadd231sd %xmm4,%xmm2,%xmm29
> 1dbf: 62 62 ed 08 b9 f3 vfmadd231sd %xmm3,%xmm2,%xmm30
> 1dc5: c4 e2 91 99 e8 vfmadd132sd %xmm0,%xmm13,%xmm5
> 1dca: c4 e2 99 99 e0 vfmadd132sd %xmm0,%xmm12,%xmm4
> 1dcf: c4 e2 a1 99 d8 vfmadd132sd %xmm0,%xmm11,%xmm3
> 1dd4: c4 c2 a9 99 d1 vfmadd132sd %xmm9,%xmm10,%xmm2
> 1dd9: c4 c2 89 99 c1 vfmadd132sd %xmm9,%xmm14,%xmm0
> 1dde: 0f 8e d3 fe ff ff jle 1cb7
> <matmul_r8_avx512f+0x1cb7>
Good, it's using fma, but why is this using xmm registers? That would
mean it's operating only on 128 bit blocks at a time so no better than
plain AVX. AFAIU avx512 should use zmm registers to operate on 512 bit
chunks.
I guess this is not due to your patch, but some other issue.
--
Janne Blomqvist
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
2017-03-02 8:09 ` Janne Blomqvist
@ 2017-03-02 8:14 ` Richard Biener
2017-03-02 8:16 ` Jakub Jelinek
1 sibling, 0 replies; 17+ messages in thread
From: Richard Biener @ 2017-03-02 8:14 UTC (permalink / raw)
To: Janne Blomqvist; +Cc: Thomas Koenig, fortran, gcc-patches
On Thu, Mar 2, 2017 at 9:09 AM, Janne Blomqvist
<blomqvist.janne@gmail.com> wrote:
> On Thu, Mar 2, 2017 at 9:50 AM, Thomas Koenig <tkoenig@netcologne.de> wrote:
>> Am 02.03.2017 um 08:32 schrieb Janne Blomqvist:
>>>
>>> On Wed, Mar 1, 2017 at 11:00 PM, Thomas Koenig <tkoenig@netcologne.de>
>>> wrote:
>>>>
>>>> Hello world,
>>>>
>>>> the attached patch enables FMA for the AVX2 and AVX512F variants of
>>>> matmul. This should bring a very nice speedup (although I have
>>>> been unable to run benchmarks due to lack of a suitable machine).
>>>
>>>
>>> In lieu of benchmarks, have you looked at the generated asm to verify
>>> that fma is actually used?
>>
>>
>> Yes, I did.
>>
>> Here's something from the new matmul_r8_avx2:
>>
>> 156c: c4 62 e5 b8 fd vfmadd231pd %ymm5,%ymm3,%ymm15
>> 1571: c4 c1 79 10 04 06 vmovupd (%r14,%rax,1),%xmm0
>> 1577: c4 62 dd b8 db vfmadd231pd %ymm3,%ymm4,%ymm11
>> 157c: c4 c3 7d 18 44 06 10 vinsertf128
>> $0x1,0x10(%r14,%rax,1),%ymm0,%ymm0
>> 1583: 01
>> 1584: c4 62 ed b8 ed vfmadd231pd %ymm5,%ymm2,%ymm13
>> 1589: c4 e2 ed b8 fc vfmadd231pd %ymm4,%ymm2,%ymm7
>> 158e: c4 e2 fd a8 ad 30 ff vfmadd213pd
>> -0x800d0(%rbp),%ymm0,%ymm5
>
> Great, looks good!
>
>> ... and here from matmul_r8_avx512f:
>>
>> 1da8: c4 a1 7b 10 14 d6 vmovsd (%rsi,%r10,8),%xmm2
>> 1dae: c4 c2 b1 b9 f0 vfmadd231sd %xmm8,%xmm9,%xmm6
>> 1db3: 62 62 ed 08 b9 e5 vfmadd231sd %xmm5,%xmm2,%xmm28
>> 1db9: 62 62 ed 08 b9 ec vfmadd231sd %xmm4,%xmm2,%xmm29
>> 1dbf: 62 62 ed 08 b9 f3 vfmadd231sd %xmm3,%xmm2,%xmm30
>> 1dc5: c4 e2 91 99 e8 vfmadd132sd %xmm0,%xmm13,%xmm5
>> 1dca: c4 e2 99 99 e0 vfmadd132sd %xmm0,%xmm12,%xmm4
>> 1dcf: c4 e2 a1 99 d8 vfmadd132sd %xmm0,%xmm11,%xmm3
>> 1dd4: c4 c2 a9 99 d1 vfmadd132sd %xmm9,%xmm10,%xmm2
>> 1dd9: c4 c2 89 99 c1 vfmadd132sd %xmm9,%xmm14,%xmm0
>> 1dde: 0f 8e d3 fe ff ff jle 1cb7
>> <matmul_r8_avx512f+0x1cb7>
>
> Good, it's using fma, but why is this using xmm registers? That would
> mean it's operating only on 128 bit blocks at a time so no better than
> plain AVX. AFAIU avx512 should use zmm registers to operate on 512 bit
> chunks.
>
> I guess this is not due to your patch, but some other issue.
The question is, was it using %zmm before the patch?
>
> --
> Janne Blomqvist
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
2017-03-02 8:09 ` Janne Blomqvist
2017-03-02 8:14 ` Richard Biener
@ 2017-03-02 8:16 ` Jakub Jelinek
1 sibling, 0 replies; 17+ messages in thread
From: Jakub Jelinek @ 2017-03-02 8:16 UTC (permalink / raw)
To: Janne Blomqvist; +Cc: Thomas Koenig, fortran, gcc-patches
On Thu, Mar 02, 2017 at 10:09:31AM +0200, Janne Blomqvist wrote:
> > Here's something from the new matmul_r8_avx2:
> >
> > 156c: c4 62 e5 b8 fd vfmadd231pd %ymm5,%ymm3,%ymm15
> > 1571: c4 c1 79 10 04 06 vmovupd (%r14,%rax,1),%xmm0
> > 1577: c4 62 dd b8 db vfmadd231pd %ymm3,%ymm4,%ymm11
> > 157c: c4 c3 7d 18 44 06 10 vinsertf128
> > $0x1,0x10(%r14,%rax,1),%ymm0,%ymm0
> > 1583: 01
> > 1584: c4 62 ed b8 ed vfmadd231pd %ymm5,%ymm2,%ymm13
> > 1589: c4 e2 ed b8 fc vfmadd231pd %ymm4,%ymm2,%ymm7
> > 158e: c4 e2 fd a8 ad 30 ff vfmadd213pd
> > -0x800d0(%rbp),%ymm0,%ymm5
>
> Great, looks good!
>
> > ... and here from matmul_r8_avx512f:
> >
> > 1da8: c4 a1 7b 10 14 d6 vmovsd (%rsi,%r10,8),%xmm2
> > 1dae: c4 c2 b1 b9 f0 vfmadd231sd %xmm8,%xmm9,%xmm6
> > 1db3: 62 62 ed 08 b9 e5 vfmadd231sd %xmm5,%xmm2,%xmm28
> > 1db9: 62 62 ed 08 b9 ec vfmadd231sd %xmm4,%xmm2,%xmm29
> > 1dbf: 62 62 ed 08 b9 f3 vfmadd231sd %xmm3,%xmm2,%xmm30
> > 1dc5: c4 e2 91 99 e8 vfmadd132sd %xmm0,%xmm13,%xmm5
> > 1dca: c4 e2 99 99 e0 vfmadd132sd %xmm0,%xmm12,%xmm4
> > 1dcf: c4 e2 a1 99 d8 vfmadd132sd %xmm0,%xmm11,%xmm3
> > 1dd4: c4 c2 a9 99 d1 vfmadd132sd %xmm9,%xmm10,%xmm2
> > 1dd9: c4 c2 89 99 c1 vfmadd132sd %xmm9,%xmm14,%xmm0
> > 1dde: 0f 8e d3 fe ff ff jle 1cb7
> > <matmul_r8_avx512f+0x1cb7>
>
> Good, it's using fma, but why is this using xmm registers? That would
> mean it's operating only on 128 bit blocks at a time so no better than
> plain AVX. AFAIU avx512 should use zmm registers to operate on 512 bit
> chunks.
Well, it uses sd, i.e. the scalar fma, not pd, so those are always xmm regs
and only a single double in them, this must be some scalar epilogue loop or
whatever; but matmul_r8_avx512f also has:
140c: 62 72 e5 40 98 c1 vfmadd132pd %zmm1,%zmm19,%zmm8
1412: 62 72 e5 40 98 cd vfmadd132pd %zmm5,%zmm19,%zmm9
1418: 62 72 e5 40 98 d1 vfmadd132pd %zmm1,%zmm19,%zmm10
141e: 62 72 e5 40 98 de vfmadd132pd %zmm6,%zmm19,%zmm11
1424: 62 72 e5 40 98 e1 vfmadd132pd %zmm1,%zmm19,%zmm12
142a: 62 e2 e5 40 98 c6 vfmadd132pd %zmm6,%zmm19,%zmm16
1430: 62 f2 e5 40 98 c8 vfmadd132pd %zmm0,%zmm19,%zmm1
1436: 62 f2 e5 40 98 f0 vfmadd132pd %zmm0,%zmm19,%zmm6
143c: 62 72 e5 40 98 fd vfmadd132pd %zmm5,%zmm19,%zmm15
1442: 62 72 e5 40 98 f4 vfmadd132pd %zmm4,%zmm19,%zmm14
1448: 62 72 e5 40 98 eb vfmadd132pd %zmm3,%zmm19,%zmm13
144e: 62 f2 e5 40 98 d0 vfmadd132pd %zmm0,%zmm19,%zmm2
1454: 62 b2 e5 40 98 ec vfmadd132pd %zmm20,%zmm19,%zmm5
145a: 62 b2 e5 40 98 e4 vfmadd132pd %zmm20,%zmm19,%zmm4
1460: 62 b2 e5 40 98 dc vfmadd132pd %zmm20,%zmm19,%zmm3
1466: 62 b2 e5 40 98 c4 vfmadd132pd %zmm20,%zmm19,%zmm0
etc. where 8 doubles in zmm regs are processed together.
Jakub
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
2017-03-01 21:00 [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul Thomas Koenig
2017-03-02 3:22 ` Jerry DeLisle
2017-03-02 7:32 ` Janne Blomqvist
@ 2017-03-02 8:43 ` Jakub Jelinek
2017-03-02 9:03 ` Thomas Koenig
2 siblings, 1 reply; 17+ messages in thread
From: Jakub Jelinek @ 2017-03-02 8:43 UTC (permalink / raw)
To: Thomas Koenig; +Cc: fortran, gcc-patches
On Wed, Mar 01, 2017 at 10:00:08PM +0100, Thomas Koenig wrote:
> @@ -101,7 +93,7 @@
> `static void
> 'matmul_name` ('rtype` * const restrict retarray,
> 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
> - int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
> + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
> static' include(matmul_internal.m4)dnl
> `#endif /* HAVE_AVX2 */
>
I guess the question here is if there are any CPUs that have AVX2 but don't
have FMA3. If there are none, then this is not controversial, if there are
some, it depends on how widely they are used compared to ones that have both
AVX2 and FMA3. Going just from our -march= bitsets, it seems if there is
PTA_AVX2, then there is also PTA_FMA: haswell, broadwell, skylake, skylake-avx512, knl,
bdver4, znver1, there are CPUs that have just PTA_AVX and not PTA_AVX2 and
still have PTA_FMA: bdver2, bdver3 (but that is not relevant to this patch).
> @@ -110,7 +102,7 @@
> `static void
> 'matmul_name` ('rtype` * const restrict retarray,
> 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
> - int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
> + int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
> static' include(matmul_internal.m4)dnl
> `#endif /* HAVE_AVX512F */
>
I think this change is not needed, because the EVEX encoded
VFMADD???[SP][DS] instructions etc. are in AVX512F ISA, not in FMA3 ISA
(which has just the VEX encoded ones).
Which is why I'm seeing the fmas in my libgfortran even without your patch.
Thus I think you should remove this from your patch.
> @@ -147,7 +141,8 @@
> #endif /* HAVE_AVX512F */
>
> #ifdef HAVE_AVX2
> - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
> + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
> + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
> {
> matmul_p = matmul_'rtype_code`_avx2;
> goto tailcall;
and this too.
Jakub
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
2017-03-02 8:43 ` Jakub Jelinek
@ 2017-03-02 9:03 ` Thomas Koenig
2017-03-02 9:08 ` Jakub Jelinek
0 siblings, 1 reply; 17+ messages in thread
From: Thomas Koenig @ 2017-03-02 9:03 UTC (permalink / raw)
To: Jakub Jelinek; +Cc: fortran, gcc-patches
Am 02.03.2017 um 09:43 schrieb Jakub Jelinek:
> On Wed, Mar 01, 2017 at 10:00:08PM +0100, Thomas Koenig wrote:
>> @@ -101,7 +93,7 @@
>> `static void
>> 'matmul_name` ('rtype` * const restrict retarray,
>> 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
>> - int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
>> + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
>> static' include(matmul_internal.m4)dnl
>> `#endif /* HAVE_AVX2 */
>>
>
> I guess the question here is if there are any CPUs that have AVX2 but don't
> have FMA3. If there are none, then this is not controversial, if there are
> some, it depends on how widely they are used compared to ones that have both
> AVX2 and FMA3. Going just from our -march= bitsets, it seems if there is
> PTA_AVX2, then there is also PTA_FMA: haswell, broadwell, skylake, skylake-avx512, knl,
> bdver4, znver1, there are CPUs that have just PTA_AVX and not PTA_AVX2 and
> still have PTA_FMA: bdver2, bdver3 (but that is not relevant to this patch).
In a previous incantation of the patch, I saw that the compiler
generated the same floating point code for AVX and AVX2 (which why
there currently is no AVX2 floating point version). I could also
generate an AVX+FMA version for floating point and an AVX2 version
for integer (if anybody cares about integer matmul).
Or I could just leave it as it is.
>> @@ -110,7 +102,7 @@
>> `static void
>> 'matmul_name` ('rtype` * const restrict retarray,
>> 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
>> - int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
>> + int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma")));
>> static' include(matmul_internal.m4)dnl
>> `#endif /* HAVE_AVX512F */
>>
>
> I think this change is not needed, because the EVEX encoded
> VFMADD???[SP][DS] instructions etc. are in AVX512F ISA, not in FMA3 ISA
> (which has just the VEX encoded ones).
> Which is why I'm seeing the fmas in my libgfortran even without your patch.
> Thus I think you should remove this from your patch.
OK, I'll remove it.
>
>> @@ -147,7 +141,8 @@
>> #endif /* HAVE_AVX512F */
>>
>> #ifdef HAVE_AVX2
>> - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
>> + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
>> + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
>> {
>> matmul_p = matmul_'rtype_code`_avx2;
>> goto tailcall;
>
> and this too.
Will do.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
2017-03-02 9:03 ` Thomas Koenig
@ 2017-03-02 9:08 ` Jakub Jelinek
2017-03-02 10:46 ` Thomas Koenig
0 siblings, 1 reply; 17+ messages in thread
From: Jakub Jelinek @ 2017-03-02 9:08 UTC (permalink / raw)
To: Thomas Koenig; +Cc: fortran, gcc-patches
On Thu, Mar 02, 2017 at 10:03:28AM +0100, Thomas Koenig wrote:
> Am 02.03.2017 um 09:43 schrieb Jakub Jelinek:
> > On Wed, Mar 01, 2017 at 10:00:08PM +0100, Thomas Koenig wrote:
> > > @@ -101,7 +93,7 @@
> > > `static void
> > > 'matmul_name` ('rtype` * const restrict retarray,
> > > 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
> > > - int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
> > > + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
> > > static' include(matmul_internal.m4)dnl
> > > `#endif /* HAVE_AVX2 */
> > >
> >
> > I guess the question here is if there are any CPUs that have AVX2 but don't
> > have FMA3. If there are none, then this is not controversial, if there are
> > some, it depends on how widely they are used compared to ones that have both
> > AVX2 and FMA3. Going just from our -march= bitsets, it seems if there is
> > PTA_AVX2, then there is also PTA_FMA: haswell, broadwell, skylake, skylake-avx512, knl,
> > bdver4, znver1, there are CPUs that have just PTA_AVX and not PTA_AVX2 and
> > still have PTA_FMA: bdver2, bdver3 (but that is not relevant to this patch).
>
> In a previous incantation of the patch, I saw that the compiler
> generated the same floating point code for AVX and AVX2 (which why
> there currently is no AVX2 floating point version). I could also
> generate an AVX+FMA version for floating point and an AVX2 version
> for integer (if anybody cares about integer matmul).
I think having another avx,fma version is not worth it, avx+fma is far less
common than avx without fma.
> > > @@ -147,7 +141,8 @@
> > > #endif /* HAVE_AVX512F */
> > >
> > > #ifdef HAVE_AVX2
> > > - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
> > > + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
> > > + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
> > > {
> > > matmul_p = matmul_'rtype_code`_avx2;
> > > goto tailcall;
> >
> > and this too.
>
> Will do.
Note I meant obviously the FEATURE_AVX512F related hunk, not this one,
sorry.
Jakub
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
2017-03-02 9:08 ` Jakub Jelinek
@ 2017-03-02 10:46 ` Thomas Koenig
2017-03-02 10:48 ` Jakub Jelinek
2017-03-02 11:02 ` Jakub Jelinek
0 siblings, 2 replies; 17+ messages in thread
From: Thomas Koenig @ 2017-03-02 10:46 UTC (permalink / raw)
To: Jakub Jelinek; +Cc: fortran, gcc-patches
[-- Attachment #1: Type: text/plain, Size: 969 bytes --]
Here's the updated version, which just uses FMA for AVX2.
OK for trunk?
Regards
Thomas
2017-03-01 Thomas Koenig <tkoenig@gcc.gnu.org>
PR fortran/78379
* m4/matmul.m4: (matmul_'rtype_code`_avx2): Also generate for
reals. Add fma to target options.
(matmul_'rtype_code`): Call AVX2 only if FMA is available.
* generated/matmul_c10.c: Regenerated.
* generated/matmul_c16.c: Regenerated.
* generated/matmul_c4.c: Regenerated.
* generated/matmul_c8.c: Regenerated.
* generated/matmul_i1.c: Regenerated.
* generated/matmul_i16.c: Regenerated.
* generated/matmul_i2.c: Regenerated.
* generated/matmul_i4.c: Regenerated.
* generated/matmul_i8.c: Regenerated.
* generated/matmul_r10.c: Regenerated.
* generated/matmul_r16.c: Regenerated.
* generated/matmul_r4.c: Regenerated.
* generated/matmul_r8.c: Regenerated.
[-- Attachment #2: p2-fma.diff --]
[-- Type: text/x-patch, Size: 20305 bytes --]
Index: generated/matmul_c10.c
===================================================================
--- generated/matmul_c10.c (Revision 245760)
+++ generated/matmul_c10.c (Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_c10 (gfc_array_c10 * const rest
int blas_limit, blas_call gemm);
export_proto(matmul_c10);
-
-
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_c10_avx (gfc_array_c10 * const restrict ret
static void
matmul_c10_avx2 (gfc_array_c10 * const restrict retarray,
gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static void
matmul_c10_avx2 (gfc_array_c10 * const restrict retarray,
gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
@@ -2277,7 +2274,8 @@ void matmul_c10 (gfc_array_c10 * const restrict re
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_c10_avx2;
goto tailcall;
Index: generated/matmul_c16.c
===================================================================
--- generated/matmul_c16.c (Revision 245760)
+++ generated/matmul_c16.c (Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_c16 (gfc_array_c16 * const rest
int blas_limit, blas_call gemm);
export_proto(matmul_c16);
-
-
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_c16_avx (gfc_array_c16 * const restrict ret
static void
matmul_c16_avx2 (gfc_array_c16 * const restrict retarray,
gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static void
matmul_c16_avx2 (gfc_array_c16 * const restrict retarray,
gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
@@ -2277,7 +2274,8 @@ void matmul_c16 (gfc_array_c16 * const restrict re
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_c16_avx2;
goto tailcall;
Index: generated/matmul_c4.c
===================================================================
--- generated/matmul_c4.c (Revision 245760)
+++ generated/matmul_c4.c (Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_c4 (gfc_array_c4 * const restri
int blas_limit, blas_call gemm);
export_proto(matmul_c4);
-
-
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_c4_avx (gfc_array_c4 * const restrict retar
static void
matmul_c4_avx2 (gfc_array_c4 * const restrict retarray,
gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static void
matmul_c4_avx2 (gfc_array_c4 * const restrict retarray,
gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas,
@@ -2277,7 +2274,8 @@ void matmul_c4 (gfc_array_c4 * const restrict reta
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_c4_avx2;
goto tailcall;
Index: generated/matmul_c8.c
===================================================================
--- generated/matmul_c8.c (Revision 245760)
+++ generated/matmul_c8.c (Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_c8 (gfc_array_c8 * const restri
int blas_limit, blas_call gemm);
export_proto(matmul_c8);
-
-
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_c8_avx (gfc_array_c8 * const restrict retar
static void
matmul_c8_avx2 (gfc_array_c8 * const restrict retarray,
gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static void
matmul_c8_avx2 (gfc_array_c8 * const restrict retarray,
gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas,
@@ -2277,7 +2274,8 @@ void matmul_c8 (gfc_array_c8 * const restrict reta
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_c8_avx2;
goto tailcall;
Index: generated/matmul_i1.c
===================================================================
--- generated/matmul_i1.c (Revision 245760)
+++ generated/matmul_i1.c (Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_i1 (gfc_array_i1 * const restri
int blas_limit, blas_call gemm);
export_proto(matmul_i1);
-
-
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_i1_avx (gfc_array_i1 * const restrict retar
static void
matmul_i1_avx2 (gfc_array_i1 * const restrict retarray,
gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static void
matmul_i1_avx2 (gfc_array_i1 * const restrict retarray,
gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas,
@@ -2277,7 +2274,8 @@ void matmul_i1 (gfc_array_i1 * const restrict reta
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_i1_avx2;
goto tailcall;
Index: generated/matmul_i16.c
===================================================================
--- generated/matmul_i16.c (Revision 245760)
+++ generated/matmul_i16.c (Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_i16 (gfc_array_i16 * const rest
int blas_limit, blas_call gemm);
export_proto(matmul_i16);
-
-
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_i16_avx (gfc_array_i16 * const restrict ret
static void
matmul_i16_avx2 (gfc_array_i16 * const restrict retarray,
gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static void
matmul_i16_avx2 (gfc_array_i16 * const restrict retarray,
gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas,
@@ -2277,7 +2274,8 @@ void matmul_i16 (gfc_array_i16 * const restrict re
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_i16_avx2;
goto tailcall;
Index: generated/matmul_i2.c
===================================================================
--- generated/matmul_i2.c (Revision 245760)
+++ generated/matmul_i2.c (Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_i2 (gfc_array_i2 * const restri
int blas_limit, blas_call gemm);
export_proto(matmul_i2);
-
-
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_i2_avx (gfc_array_i2 * const restrict retar
static void
matmul_i2_avx2 (gfc_array_i2 * const restrict retarray,
gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static void
matmul_i2_avx2 (gfc_array_i2 * const restrict retarray,
gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas,
@@ -2277,7 +2274,8 @@ void matmul_i2 (gfc_array_i2 * const restrict reta
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_i2_avx2;
goto tailcall;
Index: generated/matmul_i4.c
===================================================================
--- generated/matmul_i4.c (Revision 245760)
+++ generated/matmul_i4.c (Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_i4 (gfc_array_i4 * const restri
int blas_limit, blas_call gemm);
export_proto(matmul_i4);
-
-
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_i4_avx (gfc_array_i4 * const restrict retar
static void
matmul_i4_avx2 (gfc_array_i4 * const restrict retarray,
gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static void
matmul_i4_avx2 (gfc_array_i4 * const restrict retarray,
gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas,
@@ -2277,7 +2274,8 @@ void matmul_i4 (gfc_array_i4 * const restrict reta
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_i4_avx2;
goto tailcall;
Index: generated/matmul_i8.c
===================================================================
--- generated/matmul_i8.c (Revision 245760)
+++ generated/matmul_i8.c (Arbeitskopie)
@@ -74,9 +74,6 @@ extern void matmul_i8 (gfc_array_i8 * const restri
int blas_limit, blas_call gemm);
export_proto(matmul_i8);
-
-
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -628,7 +625,7 @@ matmul_i8_avx (gfc_array_i8 * const restrict retar
static void
matmul_i8_avx2 (gfc_array_i8 * const restrict retarray,
gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static void
matmul_i8_avx2 (gfc_array_i8 * const restrict retarray,
gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas,
@@ -2277,7 +2274,8 @@ void matmul_i8 (gfc_array_i8 * const restrict reta
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_i8_avx2;
goto tailcall;
Index: generated/matmul_r10.c
===================================================================
--- generated/matmul_r10.c (Revision 245760)
+++ generated/matmul_r10.c (Arbeitskopie)
@@ -74,13 +74,6 @@ extern void matmul_r10 (gfc_array_r10 * const rest
int blas_limit, blas_call gemm);
export_proto(matmul_r10);
-#if defined(HAVE_AVX) && defined(HAVE_AVX2)
-/* REAL types generate identical code for AVX and AVX2. Only generate
- an AVX2 function if we are dealing with integer. */
-#undef HAVE_AVX2
-#endif
-
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -632,7 +625,7 @@ matmul_r10_avx (gfc_array_r10 * const restrict ret
static void
matmul_r10_avx2 (gfc_array_r10 * const restrict retarray,
gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static void
matmul_r10_avx2 (gfc_array_r10 * const restrict retarray,
gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas,
@@ -2281,7 +2274,8 @@ void matmul_r10 (gfc_array_r10 * const restrict re
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_r10_avx2;
goto tailcall;
Index: generated/matmul_r16.c
===================================================================
--- generated/matmul_r16.c (Revision 245760)
+++ generated/matmul_r16.c (Arbeitskopie)
@@ -74,13 +74,6 @@ extern void matmul_r16 (gfc_array_r16 * const rest
int blas_limit, blas_call gemm);
export_proto(matmul_r16);
-#if defined(HAVE_AVX) && defined(HAVE_AVX2)
-/* REAL types generate identical code for AVX and AVX2. Only generate
- an AVX2 function if we are dealing with integer. */
-#undef HAVE_AVX2
-#endif
-
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -632,7 +625,7 @@ matmul_r16_avx (gfc_array_r16 * const restrict ret
static void
matmul_r16_avx2 (gfc_array_r16 * const restrict retarray,
gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static void
matmul_r16_avx2 (gfc_array_r16 * const restrict retarray,
gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas,
@@ -2281,7 +2274,8 @@ void matmul_r16 (gfc_array_r16 * const restrict re
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_r16_avx2;
goto tailcall;
Index: generated/matmul_r4.c
===================================================================
--- generated/matmul_r4.c (Revision 245760)
+++ generated/matmul_r4.c (Arbeitskopie)
@@ -74,13 +74,6 @@ extern void matmul_r4 (gfc_array_r4 * const restri
int blas_limit, blas_call gemm);
export_proto(matmul_r4);
-#if defined(HAVE_AVX) && defined(HAVE_AVX2)
-/* REAL types generate identical code for AVX and AVX2. Only generate
- an AVX2 function if we are dealing with integer. */
-#undef HAVE_AVX2
-#endif
-
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -632,7 +625,7 @@ matmul_r4_avx (gfc_array_r4 * const restrict retar
static void
matmul_r4_avx2 (gfc_array_r4 * const restrict retarray,
gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static void
matmul_r4_avx2 (gfc_array_r4 * const restrict retarray,
gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas,
@@ -2281,7 +2274,8 @@ void matmul_r4 (gfc_array_r4 * const restrict reta
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_r4_avx2;
goto tailcall;
Index: generated/matmul_r8.c
===================================================================
--- generated/matmul_r8.c (Revision 245760)
+++ generated/matmul_r8.c (Arbeitskopie)
@@ -74,13 +74,6 @@ extern void matmul_r8 (gfc_array_r8 * const restri
int blas_limit, blas_call gemm);
export_proto(matmul_r8);
-#if defined(HAVE_AVX) && defined(HAVE_AVX2)
-/* REAL types generate identical code for AVX and AVX2. Only generate
- an AVX2 function if we are dealing with integer. */
-#undef HAVE_AVX2
-#endif
-
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -632,7 +625,7 @@ matmul_r8_avx (gfc_array_r8 * const restrict retar
static void
matmul_r8_avx2 (gfc_array_r8 * const restrict retarray,
gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static void
matmul_r8_avx2 (gfc_array_r8 * const restrict retarray,
gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas,
@@ -2281,7 +2274,8 @@ void matmul_r8 (gfc_array_r8 * const restrict reta
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_r8_avx2;
goto tailcall;
Index: m4/matmul.m4
===================================================================
--- m4/matmul.m4 (Revision 245760)
+++ m4/matmul.m4 (Arbeitskopie)
@@ -75,14 +75,6 @@ extern void matmul_'rtype_code` ('rtype` * const r
int blas_limit, blas_call gemm);
export_proto(matmul_'rtype_code`);
-'ifelse(rtype_letter,`r',dnl
-`#if defined(HAVE_AVX) && defined(HAVE_AVX2)
-/* REAL types generate identical code for AVX and AVX2. Only generate
- an AVX2 function if we are dealing with integer. */
-#undef HAVE_AVX2
-#endif')
-`
-
/* Put exhaustive list of possible architectures here here, ORed together. */
#if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
@@ -101,7 +93,7 @@ static' include(matmul_internal.m4)dnl
`static void
'matmul_name` ('rtype` * const restrict retarray,
'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
- int blas_limit, blas_call gemm) __attribute__((__target__("avx2")));
+ int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma")));
static' include(matmul_internal.m4)dnl
`#endif /* HAVE_AVX2 */
@@ -147,7 +139,8 @@ void matmul_'rtype_code` ('rtype` * const restrict
#endif /* HAVE_AVX512F */
#ifdef HAVE_AVX2
- if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
+ && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
matmul_p = matmul_'rtype_code`_avx2;
goto tailcall;
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
2017-03-02 10:46 ` Thomas Koenig
@ 2017-03-02 10:48 ` Jakub Jelinek
2017-03-02 11:02 ` Jakub Jelinek
1 sibling, 0 replies; 17+ messages in thread
From: Jakub Jelinek @ 2017-03-02 10:48 UTC (permalink / raw)
To: Thomas Koenig; +Cc: fortran, gcc-patches
On Thu, Mar 02, 2017 at 11:45:59AM +0100, Thomas Koenig wrote:
> Here's the updated version, which just uses FMA for AVX2.
>
> OK for trunk?
>
> Regards
>
> Thomas
>
> 2017-03-01 Thomas Koenig <tkoenig@gcc.gnu.org>
>
> PR fortran/78379
> * m4/matmul.m4: (matmul_'rtype_code`_avx2): Also generate for
> reals. Add fma to target options.
> (matmul_'rtype_code`): Call AVX2 only if FMA is available.
> * generated/matmul_c10.c: Regenerated.
> * generated/matmul_c16.c: Regenerated.
> * generated/matmul_c4.c: Regenerated.
> * generated/matmul_c8.c: Regenerated.
> * generated/matmul_i1.c: Regenerated.
> * generated/matmul_i16.c: Regenerated.
> * generated/matmul_i2.c: Regenerated.
> * generated/matmul_i4.c: Regenerated.
> * generated/matmul_i8.c: Regenerated.
> * generated/matmul_r10.c: Regenerated.
> * generated/matmul_r16.c: Regenerated.
> * generated/matmul_r4.c: Regenerated.
> * generated/matmul_r8.c: Regenerated.
Ok, thanks.
Jakub
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
2017-03-02 10:46 ` Thomas Koenig
2017-03-02 10:48 ` Jakub Jelinek
@ 2017-03-02 11:02 ` Jakub Jelinek
2017-03-02 11:57 ` Thomas Koenig
1 sibling, 1 reply; 17+ messages in thread
From: Jakub Jelinek @ 2017-03-02 11:02 UTC (permalink / raw)
To: Thomas Koenig; +Cc: fortran, gcc-patches
On Thu, Mar 02, 2017 at 11:45:59AM +0100, Thomas Koenig wrote:
> Here's the updated version, which just uses FMA for AVX2.
>
> OK for trunk?
>
> Regards
>
> Thomas
>
> 2017-03-01 Thomas Koenig <tkoenig@gcc.gnu.org>
>
> PR fortran/78379
> * m4/matmul.m4: (matmul_'rtype_code`_avx2): Also generate for
> reals. Add fma to target options.
> (matmul_'rtype_code`): Call AVX2 only if FMA is available.
> * generated/matmul_c10.c: Regenerated.
> * generated/matmul_c16.c: Regenerated.
> * generated/matmul_c4.c: Regenerated.
> * generated/matmul_c8.c: Regenerated.
> * generated/matmul_i1.c: Regenerated.
> * generated/matmul_i16.c: Regenerated.
> * generated/matmul_i2.c: Regenerated.
> * generated/matmul_i4.c: Regenerated.
> * generated/matmul_i8.c: Regenerated.
> * generated/matmul_r10.c: Regenerated.
> * generated/matmul_r16.c: Regenerated.
> * generated/matmul_r4.c: Regenerated.
> * generated/matmul_r8.c: Regenerated.
Actually, I see a problem, but not related to this patch.
I bet e.g. tsan would complain heavily on the wrappers, because the code
is racy:
static void (*matmul_p) ('rtype` * const restrict retarray,
'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
int blas_limit, blas_call gemm) = NULL;
if (matmul_p == NULL)
{
matmul_p = matmul_'rtype_code`_vanilla;
if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
{
/* Run down the available processors in order of preference. */
#ifdef HAVE_AVX512F
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
{
matmul_p = matmul_'rtype_code`_avx512f;
goto tailcall;
}
#endif /* HAVE_AVX512F */
...
}
tailcall:
(*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
So, even when assuming all matmul_p = stores are atomic, e.g. if you call
matmul from 2 or more threads about the same time for the first time,
it could be that the first one sets matmul_p to vanilla and then another
thread runs it (uselessly slow), etc.
As you don't care about the if (matmul_p == NULL) part being done in
multiple threads concurrently, I guess you could e.g. do:
static void (*matmul_p) ('rtype` * const restrict retarray,
'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
int blas_limit, blas_call gemm); // <--- No need for NULL initializer for static var
void (*matmul_fn) ('rtype` * const restrict retarray,
'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
int blas_limit, blas_call gemm);
matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
if (matmul_fn == NULL)
{
matmul_fn = matmul_'rtype_code`_vanilla;
if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
{
/* Run down the available processors in order of preference. */
#ifdef HAVE_AVX512F
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
{
matmul_fn = matmul_'rtype_code`_avx512f;
goto finish;
}
#endif /* HAVE_AVX512F */
...
finish:
__atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
}
(*matmul_fn) (retarray, a, b, try_blas, blas_limit, gemm);
(i.e. make sure you read matmul_p in each call exactly once and store at
most once per thread).
Jakub
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
2017-03-02 11:02 ` Jakub Jelinek
@ 2017-03-02 11:57 ` Thomas Koenig
2017-03-02 12:02 ` Jakub Jelinek
0 siblings, 1 reply; 17+ messages in thread
From: Thomas Koenig @ 2017-03-02 11:57 UTC (permalink / raw)
To: Jakub Jelinek; +Cc: fortran, gcc-patches
[-- Attachment #1: Type: text/plain, Size: 1135 bytes --]
Hi Jakub,
> Actually, I see a problem, but not related to this patch.
> I bet e.g. tsan would complain heavily on the wrappers, because the code
> is racy:
Here is a patch implementing your suggestion. Tested at least so
far that all matmul test cases pass on my machine.
OK for trunk?
Regards
Thomas
2017-03-02 Thomas Koenig <tkoenig@gcc.gnu.org>
Jakub Jelinek <jakub@redhat.com>
* m4/matmul.m4 (matmul_'rtype_code`_avx2): Avoid
race condition on storing function pointer.
* generated/matmul_c10.c: Regenerated.
* generated/matmul_c16.c: Regenerated.
* generated/matmul_c4.c: Regenerated.
* generated/matmul_c8.c: Regenerated.
* generated/matmul_i1.c: Regenerated.
* generated/matmul_i16.c: Regenerated.
* generated/matmul_i2.c: Regenerated.
* generated/matmul_i4.c: Regenerated.
* generated/matmul_i8.c: Regenerated.
* generated/matmul_r10.c: Regenerated.
* generated/matmul_r16.c: Regenerated.
* generated/matmul_r4.c: Regenerated.
* generated/matmul_r8.c: Regenerated.
[-- Attachment #2: p1-race.diff --]
[-- Type: text/x-patch, Size: 28758 bytes --]
Index: generated/matmul_c10.c
===================================================================
--- generated/matmul_c10.c (Revision 245836)
+++ generated/matmul_c10.c (Arbeitskopie)
@@ -2258,9 +2258,14 @@ void matmul_c10 (gfc_array_c10 * const restrict re
gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
int blas_limit, blas_call gemm) = NULL;
+ void (*matmul_fn) (gfc_array_c10 * const restrict retarray,
+ gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
+ int blas_limit, blas_call gemm) = NULL;
+
+ matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
if (matmul_p == NULL)
{
- matmul_p = matmul_c10_vanilla;
+ matmul_fn = matmul_c10_vanilla;
if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
{
/* Run down the available processors in order of preference. */
@@ -2267,8 +2272,8 @@ void matmul_c10 (gfc_array_c10 * const restrict re
#ifdef HAVE_AVX512F
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
{
- matmul_p = matmul_c10_avx512f;
- goto tailcall;
+ matmul_fn = matmul_c10_avx512f;
+ goto store;
}
#endif /* HAVE_AVX512F */
@@ -2277,8 +2282,8 @@ void matmul_c10 (gfc_array_c10 * const restrict re
if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
&& (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
- matmul_p = matmul_c10_avx2;
- goto tailcall;
+ matmul_fn = matmul_c10_avx2;
+ goto store;
}
#endif
@@ -2286,14 +2291,15 @@ void matmul_c10 (gfc_array_c10 * const restrict re
#ifdef HAVE_AVX
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
{
- matmul_p = matmul_c10_avx;
- goto tailcall;
+ matmul_fn = matmul_c10_avx;
+ goto store;
}
#endif /* HAVE_AVX */
}
+ store:
+ __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
}
-tailcall:
(*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
}
Index: generated/matmul_c16.c
===================================================================
--- generated/matmul_c16.c (Revision 245836)
+++ generated/matmul_c16.c (Arbeitskopie)
@@ -2258,9 +2258,14 @@ void matmul_c16 (gfc_array_c16 * const restrict re
gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
int blas_limit, blas_call gemm) = NULL;
+ void (*matmul_fn) (gfc_array_c16 * const restrict retarray,
+ gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
+ int blas_limit, blas_call gemm) = NULL;
+
+ matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
if (matmul_p == NULL)
{
- matmul_p = matmul_c16_vanilla;
+ matmul_fn = matmul_c16_vanilla;
if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
{
/* Run down the available processors in order of preference. */
@@ -2267,8 +2272,8 @@ void matmul_c16 (gfc_array_c16 * const restrict re
#ifdef HAVE_AVX512F
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
{
- matmul_p = matmul_c16_avx512f;
- goto tailcall;
+ matmul_fn = matmul_c16_avx512f;
+ goto store;
}
#endif /* HAVE_AVX512F */
@@ -2277,8 +2282,8 @@ void matmul_c16 (gfc_array_c16 * const restrict re
if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
&& (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
- matmul_p = matmul_c16_avx2;
- goto tailcall;
+ matmul_fn = matmul_c16_avx2;
+ goto store;
}
#endif
@@ -2286,14 +2291,15 @@ void matmul_c16 (gfc_array_c16 * const restrict re
#ifdef HAVE_AVX
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
{
- matmul_p = matmul_c16_avx;
- goto tailcall;
+ matmul_fn = matmul_c16_avx;
+ goto store;
}
#endif /* HAVE_AVX */
}
+ store:
+ __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
}
-tailcall:
(*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
}
Index: generated/matmul_c4.c
===================================================================
--- generated/matmul_c4.c (Revision 245836)
+++ generated/matmul_c4.c (Arbeitskopie)
@@ -2258,9 +2258,14 @@ void matmul_c4 (gfc_array_c4 * const restrict reta
gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas,
int blas_limit, blas_call gemm) = NULL;
+ void (*matmul_fn) (gfc_array_c4 * const restrict retarray,
+ gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas,
+ int blas_limit, blas_call gemm) = NULL;
+
+ matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
if (matmul_p == NULL)
{
- matmul_p = matmul_c4_vanilla;
+ matmul_fn = matmul_c4_vanilla;
if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
{
/* Run down the available processors in order of preference. */
@@ -2267,8 +2272,8 @@ void matmul_c4 (gfc_array_c4 * const restrict reta
#ifdef HAVE_AVX512F
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
{
- matmul_p = matmul_c4_avx512f;
- goto tailcall;
+ matmul_fn = matmul_c4_avx512f;
+ goto store;
}
#endif /* HAVE_AVX512F */
@@ -2277,8 +2282,8 @@ void matmul_c4 (gfc_array_c4 * const restrict reta
if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
&& (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
- matmul_p = matmul_c4_avx2;
- goto tailcall;
+ matmul_fn = matmul_c4_avx2;
+ goto store;
}
#endif
@@ -2286,14 +2291,15 @@ void matmul_c4 (gfc_array_c4 * const restrict reta
#ifdef HAVE_AVX
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
{
- matmul_p = matmul_c4_avx;
- goto tailcall;
+ matmul_fn = matmul_c4_avx;
+ goto store;
}
#endif /* HAVE_AVX */
}
+ store:
+ __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
}
-tailcall:
(*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
}
Index: generated/matmul_c8.c
===================================================================
--- generated/matmul_c8.c (Revision 245836)
+++ generated/matmul_c8.c (Arbeitskopie)
@@ -2258,9 +2258,14 @@ void matmul_c8 (gfc_array_c8 * const restrict reta
gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas,
int blas_limit, blas_call gemm) = NULL;
+ void (*matmul_fn) (gfc_array_c8 * const restrict retarray,
+ gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas,
+ int blas_limit, blas_call gemm) = NULL;
+
+ matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
if (matmul_p == NULL)
{
- matmul_p = matmul_c8_vanilla;
+ matmul_fn = matmul_c8_vanilla;
if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
{
/* Run down the available processors in order of preference. */
@@ -2267,8 +2272,8 @@ void matmul_c8 (gfc_array_c8 * const restrict reta
#ifdef HAVE_AVX512F
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
{
- matmul_p = matmul_c8_avx512f;
- goto tailcall;
+ matmul_fn = matmul_c8_avx512f;
+ goto store;
}
#endif /* HAVE_AVX512F */
@@ -2277,8 +2282,8 @@ void matmul_c8 (gfc_array_c8 * const restrict reta
if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
&& (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
- matmul_p = matmul_c8_avx2;
- goto tailcall;
+ matmul_fn = matmul_c8_avx2;
+ goto store;
}
#endif
@@ -2286,14 +2291,15 @@ void matmul_c8 (gfc_array_c8 * const restrict reta
#ifdef HAVE_AVX
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
{
- matmul_p = matmul_c8_avx;
- goto tailcall;
+ matmul_fn = matmul_c8_avx;
+ goto store;
}
#endif /* HAVE_AVX */
}
+ store:
+ __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
}
-tailcall:
(*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
}
Index: generated/matmul_i1.c
===================================================================
--- generated/matmul_i1.c (Revision 245836)
+++ generated/matmul_i1.c (Arbeitskopie)
@@ -2258,9 +2258,14 @@ void matmul_i1 (gfc_array_i1 * const restrict reta
gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas,
int blas_limit, blas_call gemm) = NULL;
+ void (*matmul_fn) (gfc_array_i1 * const restrict retarray,
+ gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas,
+ int blas_limit, blas_call gemm) = NULL;
+
+ matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
if (matmul_p == NULL)
{
- matmul_p = matmul_i1_vanilla;
+ matmul_fn = matmul_i1_vanilla;
if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
{
/* Run down the available processors in order of preference. */
@@ -2267,8 +2272,8 @@ void matmul_i1 (gfc_array_i1 * const restrict reta
#ifdef HAVE_AVX512F
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
{
- matmul_p = matmul_i1_avx512f;
- goto tailcall;
+ matmul_fn = matmul_i1_avx512f;
+ goto store;
}
#endif /* HAVE_AVX512F */
@@ -2277,8 +2282,8 @@ void matmul_i1 (gfc_array_i1 * const restrict reta
if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
&& (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
- matmul_p = matmul_i1_avx2;
- goto tailcall;
+ matmul_fn = matmul_i1_avx2;
+ goto store;
}
#endif
@@ -2286,14 +2291,15 @@ void matmul_i1 (gfc_array_i1 * const restrict reta
#ifdef HAVE_AVX
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
{
- matmul_p = matmul_i1_avx;
- goto tailcall;
+ matmul_fn = matmul_i1_avx;
+ goto store;
}
#endif /* HAVE_AVX */
}
+ store:
+ __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
}
-tailcall:
(*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
}
Index: generated/matmul_i16.c
===================================================================
--- generated/matmul_i16.c (Revision 245836)
+++ generated/matmul_i16.c (Arbeitskopie)
@@ -2258,9 +2258,14 @@ void matmul_i16 (gfc_array_i16 * const restrict re
gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas,
int blas_limit, blas_call gemm) = NULL;
+ void (*matmul_fn) (gfc_array_i16 * const restrict retarray,
+ gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas,
+ int blas_limit, blas_call gemm) = NULL;
+
+ matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
if (matmul_p == NULL)
{
- matmul_p = matmul_i16_vanilla;
+ matmul_fn = matmul_i16_vanilla;
if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
{
/* Run down the available processors in order of preference. */
@@ -2267,8 +2272,8 @@ void matmul_i16 (gfc_array_i16 * const restrict re
#ifdef HAVE_AVX512F
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
{
- matmul_p = matmul_i16_avx512f;
- goto tailcall;
+ matmul_fn = matmul_i16_avx512f;
+ goto store;
}
#endif /* HAVE_AVX512F */
@@ -2277,8 +2282,8 @@ void matmul_i16 (gfc_array_i16 * const restrict re
if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
&& (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
- matmul_p = matmul_i16_avx2;
- goto tailcall;
+ matmul_fn = matmul_i16_avx2;
+ goto store;
}
#endif
@@ -2286,14 +2291,15 @@ void matmul_i16 (gfc_array_i16 * const restrict re
#ifdef HAVE_AVX
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
{
- matmul_p = matmul_i16_avx;
- goto tailcall;
+ matmul_fn = matmul_i16_avx;
+ goto store;
}
#endif /* HAVE_AVX */
}
+ store:
+ __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
}
-tailcall:
(*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
}
Index: generated/matmul_i2.c
===================================================================
--- generated/matmul_i2.c (Revision 245836)
+++ generated/matmul_i2.c (Arbeitskopie)
@@ -2258,9 +2258,14 @@ void matmul_i2 (gfc_array_i2 * const restrict reta
gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas,
int blas_limit, blas_call gemm) = NULL;
+ void (*matmul_fn) (gfc_array_i2 * const restrict retarray,
+ gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas,
+ int blas_limit, blas_call gemm) = NULL;
+
+ matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
if (matmul_p == NULL)
{
- matmul_p = matmul_i2_vanilla;
+ matmul_fn = matmul_i2_vanilla;
if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
{
/* Run down the available processors in order of preference. */
@@ -2267,8 +2272,8 @@ void matmul_i2 (gfc_array_i2 * const restrict reta
#ifdef HAVE_AVX512F
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
{
- matmul_p = matmul_i2_avx512f;
- goto tailcall;
+ matmul_fn = matmul_i2_avx512f;
+ goto store;
}
#endif /* HAVE_AVX512F */
@@ -2277,8 +2282,8 @@ void matmul_i2 (gfc_array_i2 * const restrict reta
if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
&& (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
- matmul_p = matmul_i2_avx2;
- goto tailcall;
+ matmul_fn = matmul_i2_avx2;
+ goto store;
}
#endif
@@ -2286,14 +2291,15 @@ void matmul_i2 (gfc_array_i2 * const restrict reta
#ifdef HAVE_AVX
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
{
- matmul_p = matmul_i2_avx;
- goto tailcall;
+ matmul_fn = matmul_i2_avx;
+ goto store;
}
#endif /* HAVE_AVX */
}
+ store:
+ __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
}
-tailcall:
(*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
}
Index: generated/matmul_i4.c
===================================================================
--- generated/matmul_i4.c (Revision 245836)
+++ generated/matmul_i4.c (Arbeitskopie)
@@ -2258,9 +2258,14 @@ void matmul_i4 (gfc_array_i4 * const restrict reta
gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas,
int blas_limit, blas_call gemm) = NULL;
+ void (*matmul_fn) (gfc_array_i4 * const restrict retarray,
+ gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas,
+ int blas_limit, blas_call gemm) = NULL;
+
+ matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
if (matmul_p == NULL)
{
- matmul_p = matmul_i4_vanilla;
+ matmul_fn = matmul_i4_vanilla;
if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
{
/* Run down the available processors in order of preference. */
@@ -2267,8 +2272,8 @@ void matmul_i4 (gfc_array_i4 * const restrict reta
#ifdef HAVE_AVX512F
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
{
- matmul_p = matmul_i4_avx512f;
- goto tailcall;
+ matmul_fn = matmul_i4_avx512f;
+ goto store;
}
#endif /* HAVE_AVX512F */
@@ -2277,8 +2282,8 @@ void matmul_i4 (gfc_array_i4 * const restrict reta
if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
&& (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
- matmul_p = matmul_i4_avx2;
- goto tailcall;
+ matmul_fn = matmul_i4_avx2;
+ goto store;
}
#endif
@@ -2286,14 +2291,15 @@ void matmul_i4 (gfc_array_i4 * const restrict reta
#ifdef HAVE_AVX
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
{
- matmul_p = matmul_i4_avx;
- goto tailcall;
+ matmul_fn = matmul_i4_avx;
+ goto store;
}
#endif /* HAVE_AVX */
}
+ store:
+ __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
}
-tailcall:
(*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
}
Index: generated/matmul_i8.c
===================================================================
--- generated/matmul_i8.c (Revision 245836)
+++ generated/matmul_i8.c (Arbeitskopie)
@@ -2258,9 +2258,14 @@ void matmul_i8 (gfc_array_i8 * const restrict reta
gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas,
int blas_limit, blas_call gemm) = NULL;
+ void (*matmul_fn) (gfc_array_i8 * const restrict retarray,
+ gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas,
+ int blas_limit, blas_call gemm) = NULL;
+
+ matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
if (matmul_p == NULL)
{
- matmul_p = matmul_i8_vanilla;
+ matmul_fn = matmul_i8_vanilla;
if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
{
/* Run down the available processors in order of preference. */
@@ -2267,8 +2272,8 @@ void matmul_i8 (gfc_array_i8 * const restrict reta
#ifdef HAVE_AVX512F
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
{
- matmul_p = matmul_i8_avx512f;
- goto tailcall;
+ matmul_fn = matmul_i8_avx512f;
+ goto store;
}
#endif /* HAVE_AVX512F */
@@ -2277,8 +2282,8 @@ void matmul_i8 (gfc_array_i8 * const restrict reta
if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
&& (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
- matmul_p = matmul_i8_avx2;
- goto tailcall;
+ matmul_fn = matmul_i8_avx2;
+ goto store;
}
#endif
@@ -2286,14 +2291,15 @@ void matmul_i8 (gfc_array_i8 * const restrict reta
#ifdef HAVE_AVX
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
{
- matmul_p = matmul_i8_avx;
- goto tailcall;
+ matmul_fn = matmul_i8_avx;
+ goto store;
}
#endif /* HAVE_AVX */
}
+ store:
+ __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
}
-tailcall:
(*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
}
Index: generated/matmul_r10.c
===================================================================
--- generated/matmul_r10.c (Revision 245836)
+++ generated/matmul_r10.c (Arbeitskopie)
@@ -2258,9 +2258,14 @@ void matmul_r10 (gfc_array_r10 * const restrict re
gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas,
int blas_limit, blas_call gemm) = NULL;
+ void (*matmul_fn) (gfc_array_r10 * const restrict retarray,
+ gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas,
+ int blas_limit, blas_call gemm) = NULL;
+
+ matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
if (matmul_p == NULL)
{
- matmul_p = matmul_r10_vanilla;
+ matmul_fn = matmul_r10_vanilla;
if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
{
/* Run down the available processors in order of preference. */
@@ -2267,8 +2272,8 @@ void matmul_r10 (gfc_array_r10 * const restrict re
#ifdef HAVE_AVX512F
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
{
- matmul_p = matmul_r10_avx512f;
- goto tailcall;
+ matmul_fn = matmul_r10_avx512f;
+ goto store;
}
#endif /* HAVE_AVX512F */
@@ -2277,8 +2282,8 @@ void matmul_r10 (gfc_array_r10 * const restrict re
if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
&& (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
- matmul_p = matmul_r10_avx2;
- goto tailcall;
+ matmul_fn = matmul_r10_avx2;
+ goto store;
}
#endif
@@ -2286,14 +2291,15 @@ void matmul_r10 (gfc_array_r10 * const restrict re
#ifdef HAVE_AVX
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
{
- matmul_p = matmul_r10_avx;
- goto tailcall;
+ matmul_fn = matmul_r10_avx;
+ goto store;
}
#endif /* HAVE_AVX */
}
+ store:
+ __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
}
-tailcall:
(*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
}
Index: generated/matmul_r16.c
===================================================================
--- generated/matmul_r16.c (Revision 245836)
+++ generated/matmul_r16.c (Arbeitskopie)
@@ -2258,9 +2258,14 @@ void matmul_r16 (gfc_array_r16 * const restrict re
gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas,
int blas_limit, blas_call gemm) = NULL;
+ void (*matmul_fn) (gfc_array_r16 * const restrict retarray,
+ gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas,
+ int blas_limit, blas_call gemm) = NULL;
+
+ matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
if (matmul_p == NULL)
{
- matmul_p = matmul_r16_vanilla;
+ matmul_fn = matmul_r16_vanilla;
if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
{
/* Run down the available processors in order of preference. */
@@ -2267,8 +2272,8 @@ void matmul_r16 (gfc_array_r16 * const restrict re
#ifdef HAVE_AVX512F
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
{
- matmul_p = matmul_r16_avx512f;
- goto tailcall;
+ matmul_fn = matmul_r16_avx512f;
+ goto store;
}
#endif /* HAVE_AVX512F */
@@ -2277,8 +2282,8 @@ void matmul_r16 (gfc_array_r16 * const restrict re
if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
&& (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
- matmul_p = matmul_r16_avx2;
- goto tailcall;
+ matmul_fn = matmul_r16_avx2;
+ goto store;
}
#endif
@@ -2286,14 +2291,15 @@ void matmul_r16 (gfc_array_r16 * const restrict re
#ifdef HAVE_AVX
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
{
- matmul_p = matmul_r16_avx;
- goto tailcall;
+ matmul_fn = matmul_r16_avx;
+ goto store;
}
#endif /* HAVE_AVX */
}
+ store:
+ __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
}
-tailcall:
(*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
}
Index: generated/matmul_r4.c
===================================================================
--- generated/matmul_r4.c (Revision 245836)
+++ generated/matmul_r4.c (Arbeitskopie)
@@ -2258,9 +2258,14 @@ void matmul_r4 (gfc_array_r4 * const restrict reta
gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas,
int blas_limit, blas_call gemm) = NULL;
+ void (*matmul_fn) (gfc_array_r4 * const restrict retarray,
+ gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas,
+ int blas_limit, blas_call gemm) = NULL;
+
+ matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
if (matmul_p == NULL)
{
- matmul_p = matmul_r4_vanilla;
+ matmul_fn = matmul_r4_vanilla;
if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
{
/* Run down the available processors in order of preference. */
@@ -2267,8 +2272,8 @@ void matmul_r4 (gfc_array_r4 * const restrict reta
#ifdef HAVE_AVX512F
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
{
- matmul_p = matmul_r4_avx512f;
- goto tailcall;
+ matmul_fn = matmul_r4_avx512f;
+ goto store;
}
#endif /* HAVE_AVX512F */
@@ -2277,8 +2282,8 @@ void matmul_r4 (gfc_array_r4 * const restrict reta
if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
&& (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
- matmul_p = matmul_r4_avx2;
- goto tailcall;
+ matmul_fn = matmul_r4_avx2;
+ goto store;
}
#endif
@@ -2286,14 +2291,15 @@ void matmul_r4 (gfc_array_r4 * const restrict reta
#ifdef HAVE_AVX
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
{
- matmul_p = matmul_r4_avx;
- goto tailcall;
+ matmul_fn = matmul_r4_avx;
+ goto store;
}
#endif /* HAVE_AVX */
}
+ store:
+ __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
}
-tailcall:
(*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
}
Index: generated/matmul_r8.c
===================================================================
--- generated/matmul_r8.c (Revision 245836)
+++ generated/matmul_r8.c (Arbeitskopie)
@@ -2258,9 +2258,14 @@ void matmul_r8 (gfc_array_r8 * const restrict reta
gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas,
int blas_limit, blas_call gemm) = NULL;
+ void (*matmul_fn) (gfc_array_r8 * const restrict retarray,
+ gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas,
+ int blas_limit, blas_call gemm) = NULL;
+
+ matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
if (matmul_p == NULL)
{
- matmul_p = matmul_r8_vanilla;
+ matmul_fn = matmul_r8_vanilla;
if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
{
/* Run down the available processors in order of preference. */
@@ -2267,8 +2272,8 @@ void matmul_r8 (gfc_array_r8 * const restrict reta
#ifdef HAVE_AVX512F
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
{
- matmul_p = matmul_r8_avx512f;
- goto tailcall;
+ matmul_fn = matmul_r8_avx512f;
+ goto store;
}
#endif /* HAVE_AVX512F */
@@ -2277,8 +2282,8 @@ void matmul_r8 (gfc_array_r8 * const restrict reta
if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
&& (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
- matmul_p = matmul_r8_avx2;
- goto tailcall;
+ matmul_fn = matmul_r8_avx2;
+ goto store;
}
#endif
@@ -2286,14 +2291,15 @@ void matmul_r8 (gfc_array_r8 * const restrict reta
#ifdef HAVE_AVX
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
{
- matmul_p = matmul_r8_avx;
- goto tailcall;
+ matmul_fn = matmul_r8_avx;
+ goto store;
}
#endif /* HAVE_AVX */
}
+ store:
+ __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
}
-tailcall:
(*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
}
Index: m4/matmul.m4
===================================================================
--- m4/matmul.m4 (Revision 245836)
+++ m4/matmul.m4 (Arbeitskopie)
@@ -123,9 +123,14 @@ void matmul_'rtype_code` ('rtype` * const restrict
'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
int blas_limit, blas_call gemm) = NULL;
+ void (*matmul_fn) ('rtype` * const restrict retarray,
+ 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
+ int blas_limit, blas_call gemm) = NULL;
+
+ matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
if (matmul_p == NULL)
{
- matmul_p = matmul_'rtype_code`_vanilla;
+ matmul_fn = matmul_'rtype_code`_vanilla;
if (__cpu_model.__cpu_vendor == VENDOR_INTEL)
{
/* Run down the available processors in order of preference. */
@@ -132,8 +137,8 @@ void matmul_'rtype_code` ('rtype` * const restrict
#ifdef HAVE_AVX512F
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F))
{
- matmul_p = matmul_'rtype_code`_avx512f;
- goto tailcall;
+ matmul_fn = matmul_'rtype_code`_avx512f;
+ goto store;
}
#endif /* HAVE_AVX512F */
@@ -142,8 +147,8 @@ void matmul_'rtype_code` ('rtype` * const restrict
if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2))
&& (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA)))
{
- matmul_p = matmul_'rtype_code`_avx2;
- goto tailcall;
+ matmul_fn = matmul_'rtype_code`_avx2;
+ goto store;
}
#endif
@@ -151,14 +156,15 @@ void matmul_'rtype_code` ('rtype` * const restrict
#ifdef HAVE_AVX
if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
{
- matmul_p = matmul_'rtype_code`_avx;
- goto tailcall;
+ matmul_fn = matmul_'rtype_code`_avx;
+ goto store;
}
#endif /* HAVE_AVX */
}
+ store:
+ __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
}
-tailcall:
(*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
}
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
2017-03-02 11:57 ` Thomas Koenig
@ 2017-03-02 12:02 ` Jakub Jelinek
2017-03-02 13:01 ` Thomas Koenig
0 siblings, 1 reply; 17+ messages in thread
From: Jakub Jelinek @ 2017-03-02 12:02 UTC (permalink / raw)
To: Thomas Koenig; +Cc: fortran, gcc-patches
On Thu, Mar 02, 2017 at 12:57:05PM +0100, Thomas Koenig wrote:
> --- m4/matmul.m4 (Revision 245836)
> +++ m4/matmul.m4 (Arbeitskopie)
> @@ -123,9 +123,14 @@ void matmul_'rtype_code` ('rtype` * const restrict
> 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
> int blas_limit, blas_call gemm) = NULL;
Please drop the " = NULL" here
> + void (*matmul_fn) ('rtype` * const restrict retarray,
> + 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
> + int blas_limit, blas_call gemm) = NULL;
and here as well. The first one because static vars are zero initialized
by default, the latter because it makes no sense to initialize it and then
immediately overwrite it in the next stmt.
> +
> + matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED);
> if (matmul_p == NULL)
This needs to test matmul_fn == NULL instead of matmul_p == NULL.
> @@ -151,14 +156,15 @@ void matmul_'rtype_code` ('rtype` * const restrict
> #ifdef HAVE_AVX
> if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX))
> {
> - matmul_p = matmul_'rtype_code`_avx;
> - goto tailcall;
> + matmul_fn = matmul_'rtype_code`_avx;
> + goto store;
> }
> #endif /* HAVE_AVX */
> }
> + store:
> + __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED);
> }
>
> -tailcall:
> (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm);
And this needs to use *matmul_fn instead of *matmul_p too.
The whole point is that matmul_p is only loaded using __atomic_load_n
and only optionally stored using __atomic_store_n.
Ok with those changes.
Jakub
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul
2017-03-02 12:02 ` Jakub Jelinek
@ 2017-03-02 13:01 ` Thomas Koenig
0 siblings, 0 replies; 17+ messages in thread
From: Thomas Koenig @ 2017-03-02 13:01 UTC (permalink / raw)
To: Jakub Jelinek; +Cc: fortran, gcc-patches
Am 02.03.2017 um 13:02 schrieb Jakub Jelinek:
> And this needs to use *matmul_fn instead of *matmul_p too.
> The whole point is that matmul_p is only loaded using __atomic_load_n
> and only optionally stored using __atomic_store_n.
>
> Ok with those changes.
Thanks! Committed as
https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=245839
Regards
Thomas
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2017-03-02 13:01 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-01 21:00 [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul Thomas Koenig
2017-03-02 3:22 ` Jerry DeLisle
2017-03-02 6:15 ` Thomas Koenig
2017-03-02 7:32 ` Janne Blomqvist
2017-03-02 7:50 ` Thomas Koenig
2017-03-02 8:09 ` Janne Blomqvist
2017-03-02 8:14 ` Richard Biener
2017-03-02 8:16 ` Jakub Jelinek
2017-03-02 8:43 ` Jakub Jelinek
2017-03-02 9:03 ` Thomas Koenig
2017-03-02 9:08 ` Jakub Jelinek
2017-03-02 10:46 ` Thomas Koenig
2017-03-02 10:48 ` Jakub Jelinek
2017-03-02 11:02 ` Jakub Jelinek
2017-03-02 11:57 ` Thomas Koenig
2017-03-02 12:02 ` Jakub Jelinek
2017-03-02 13:01 ` Thomas Koenig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).