* [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul @ 2017-03-01 21:00 Thomas Koenig 2017-03-02 3:22 ` Jerry DeLisle ` (2 more replies) 0 siblings, 3 replies; 17+ messages in thread From: Thomas Koenig @ 2017-03-01 21:00 UTC (permalink / raw) To: fortran, gcc-patches [-- Attachment #1: Type: text/plain, Size: 1347 bytes --] Hello world, the attached patch enables FMA for the AVX2 and AVX512F variants of matmul. This should bring a very nice speedup (although I have been unable to run benchmarks due to lack of a suitable machine). Question: Is this still appropriate for the current state of trunk? Or rather, OK for when gcc 8 opens (which might still be some time in the future)? 2017-03-01 Thomas Koenig <tkoenig@gcc.gnu.org> PR fortran/78379 * m4/matmul.m4: (matmul_'rtype_code`_avx2): Also generate for reals. Add fma to target options. (matmul_'rtype_code`_avx512f): Add fma to target options. (matmul_'rtype_code`): Call AVX2 and AVX512F only if FMA is available. * generated/matmul_c10.c: Regenerated. * generated/matmul_c16.c: Regenerated. * generated/matmul_c4.c: Regenerated. * generated/matmul_c8.c: Regenerated. * generated/matmul_i1.c: Regenerated. * generated/matmul_i16.c: Regenerated. * generated/matmul_i2.c: Regenerated. * generated/matmul_i4.c: Regenerated. * generated/matmul_i8.c: Regenerated. * generated/matmul_r10.c: Regenerated. * generated/matmul_r16.c: Regenerated. * generated/matmul_r4.c: Regenerated. * generated/matmul_r8.c: Regenerated. Regards Thomas [-- Attachment #2: p1-fma.diff --] [-- Type: text/x-patch, Size: 2139 bytes --] Index: m4/matmul.m4 =================================================================== --- m4/matmul.m4 (Revision 245760) +++ m4/matmul.m4 (Arbeitskopie) @@ -75,14 +75,6 @@ int blas_limit, blas_call gemm); export_proto(matmul_'rtype_code`); -'ifelse(rtype_letter,`r',dnl -`#if defined(HAVE_AVX) && defined(HAVE_AVX2) -/* REAL types generate identical code for AVX and AVX2. Only generate - an AVX2 function if we are dealing with integer. */ -#undef HAVE_AVX2 -#endif') -` - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -101,7 +93,7 @@ `static void 'matmul_name` ('rtype` * const restrict retarray, 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static' include(matmul_internal.m4)dnl `#endif /* HAVE_AVX2 */ @@ -110,7 +102,7 @@ `static void 'matmul_name` ('rtype` * const restrict retarray, 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx512f"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma"))); static' include(matmul_internal.m4)dnl `#endif /* HAVE_AVX512F */ @@ -138,7 +130,9 @@ { /* Run down the available processors in order of preference. */ #ifdef HAVE_AVX512F - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) + { matmul_p = matmul_'rtype_code`_avx512f; goto tailcall; @@ -147,7 +141,8 @@ #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_'rtype_code`_avx2; goto tailcall; ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul 2017-03-01 21:00 [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul Thomas Koenig @ 2017-03-02 3:22 ` Jerry DeLisle 2017-03-02 6:15 ` Thomas Koenig 2017-03-02 7:32 ` Janne Blomqvist 2017-03-02 8:43 ` Jakub Jelinek 2 siblings, 1 reply; 17+ messages in thread From: Jerry DeLisle @ 2017-03-02 3:22 UTC (permalink / raw) To: fortran; +Cc: GCC Patches On 03/01/2017 01:00 PM, Thomas Koenig wrote: > Hello world, > > the attached patch enables FMA for the AVX2 and AVX512F variants of > matmul. This should bring a very nice speedup (although I have > been unable to run benchmarks due to lack of a suitable machine). > > Question: Is this still appropriate for the current state of trunk? > Or rather, OK for when gcc 8 opens (which might still be some time > in the future)? I think it may be appropriate now because you are making an adjustment to the just added new feature. I would prefer that it was tested on the actual expected platform. Does anyone anywhere on this list have access to one of these machines to test? Jerry ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul 2017-03-02 3:22 ` Jerry DeLisle @ 2017-03-02 6:15 ` Thomas Koenig 0 siblings, 0 replies; 17+ messages in thread From: Thomas Koenig @ 2017-03-02 6:15 UTC (permalink / raw) To: Jerry DeLisle, fortran; +Cc: GCC Patches [-- Attachment #1: Type: text/plain, Size: 305 bytes --] Hi Jerry, > I would prefer that it was tested on the actual expected platform. Does > anyone anywhere on this list have access to one of these machines to test? If anybody wants to test who does not have --enable-maintainer-mode activated, here is a patch that works "out of the box". Regards Thomas [-- Attachment #2: p1-fma-total.diff --] [-- Type: text/x-patch, Size: 34073 bytes --] Index: generated/matmul_c10.c =================================================================== --- generated/matmul_c10.c (Revision 245760) +++ generated/matmul_c10.c (Arbeitskopie) @@ -74,9 +74,6 @@ extern void matmul_c10 (gfc_array_c10 * const rest int blas_limit, blas_call gemm); export_proto(matmul_c10); - - - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -628,7 +625,7 @@ matmul_c10_avx (gfc_array_c10 * const restrict ret static void matmul_c10_avx2 (gfc_array_c10 * const restrict retarray, gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static void matmul_c10_avx2 (gfc_array_c10 * const restrict retarray, gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas, @@ -1171,7 +1168,7 @@ matmul_c10_avx2 (gfc_array_c10 * const restrict re static void matmul_c10_avx512f (gfc_array_c10 * const restrict retarray, gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx512f"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma"))); static void matmul_c10_avx512f (gfc_array_c10 * const restrict retarray, gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas, @@ -2268,7 +2265,9 @@ void matmul_c10 (gfc_array_c10 * const restrict re { /* Run down the available processors in order of preference. */ #ifdef HAVE_AVX512F - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) + { matmul_p = matmul_c10_avx512f; goto tailcall; @@ -2277,7 +2276,8 @@ void matmul_c10 (gfc_array_c10 * const restrict re #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_c10_avx2; goto tailcall; Index: generated/matmul_c16.c =================================================================== --- generated/matmul_c16.c (Revision 245760) +++ generated/matmul_c16.c (Arbeitskopie) @@ -74,9 +74,6 @@ extern void matmul_c16 (gfc_array_c16 * const rest int blas_limit, blas_call gemm); export_proto(matmul_c16); - - - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -628,7 +625,7 @@ matmul_c16_avx (gfc_array_c16 * const restrict ret static void matmul_c16_avx2 (gfc_array_c16 * const restrict retarray, gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static void matmul_c16_avx2 (gfc_array_c16 * const restrict retarray, gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas, @@ -1171,7 +1168,7 @@ matmul_c16_avx2 (gfc_array_c16 * const restrict re static void matmul_c16_avx512f (gfc_array_c16 * const restrict retarray, gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx512f"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma"))); static void matmul_c16_avx512f (gfc_array_c16 * const restrict retarray, gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas, @@ -2268,7 +2265,9 @@ void matmul_c16 (gfc_array_c16 * const restrict re { /* Run down the available processors in order of preference. */ #ifdef HAVE_AVX512F - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) + { matmul_p = matmul_c16_avx512f; goto tailcall; @@ -2277,7 +2276,8 @@ void matmul_c16 (gfc_array_c16 * const restrict re #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_c16_avx2; goto tailcall; Index: generated/matmul_c4.c =================================================================== --- generated/matmul_c4.c (Revision 245760) +++ generated/matmul_c4.c (Arbeitskopie) @@ -74,9 +74,6 @@ extern void matmul_c4 (gfc_array_c4 * const restri int blas_limit, blas_call gemm); export_proto(matmul_c4); - - - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -628,7 +625,7 @@ matmul_c4_avx (gfc_array_c4 * const restrict retar static void matmul_c4_avx2 (gfc_array_c4 * const restrict retarray, gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static void matmul_c4_avx2 (gfc_array_c4 * const restrict retarray, gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas, @@ -1171,7 +1168,7 @@ matmul_c4_avx2 (gfc_array_c4 * const restrict reta static void matmul_c4_avx512f (gfc_array_c4 * const restrict retarray, gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx512f"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma"))); static void matmul_c4_avx512f (gfc_array_c4 * const restrict retarray, gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas, @@ -2268,7 +2265,9 @@ void matmul_c4 (gfc_array_c4 * const restrict reta { /* Run down the available processors in order of preference. */ #ifdef HAVE_AVX512F - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) + { matmul_p = matmul_c4_avx512f; goto tailcall; @@ -2277,7 +2276,8 @@ void matmul_c4 (gfc_array_c4 * const restrict reta #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_c4_avx2; goto tailcall; Index: generated/matmul_c8.c =================================================================== --- generated/matmul_c8.c (Revision 245760) +++ generated/matmul_c8.c (Arbeitskopie) @@ -74,9 +74,6 @@ extern void matmul_c8 (gfc_array_c8 * const restri int blas_limit, blas_call gemm); export_proto(matmul_c8); - - - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -628,7 +625,7 @@ matmul_c8_avx (gfc_array_c8 * const restrict retar static void matmul_c8_avx2 (gfc_array_c8 * const restrict retarray, gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static void matmul_c8_avx2 (gfc_array_c8 * const restrict retarray, gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas, @@ -1171,7 +1168,7 @@ matmul_c8_avx2 (gfc_array_c8 * const restrict reta static void matmul_c8_avx512f (gfc_array_c8 * const restrict retarray, gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx512f"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma"))); static void matmul_c8_avx512f (gfc_array_c8 * const restrict retarray, gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas, @@ -2268,7 +2265,9 @@ void matmul_c8 (gfc_array_c8 * const restrict reta { /* Run down the available processors in order of preference. */ #ifdef HAVE_AVX512F - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) + { matmul_p = matmul_c8_avx512f; goto tailcall; @@ -2277,7 +2276,8 @@ void matmul_c8 (gfc_array_c8 * const restrict reta #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_c8_avx2; goto tailcall; Index: generated/matmul_i1.c =================================================================== --- generated/matmul_i1.c (Revision 245760) +++ generated/matmul_i1.c (Arbeitskopie) @@ -74,9 +74,6 @@ extern void matmul_i1 (gfc_array_i1 * const restri int blas_limit, blas_call gemm); export_proto(matmul_i1); - - - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -628,7 +625,7 @@ matmul_i1_avx (gfc_array_i1 * const restrict retar static void matmul_i1_avx2 (gfc_array_i1 * const restrict retarray, gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static void matmul_i1_avx2 (gfc_array_i1 * const restrict retarray, gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas, @@ -1171,7 +1168,7 @@ matmul_i1_avx2 (gfc_array_i1 * const restrict reta static void matmul_i1_avx512f (gfc_array_i1 * const restrict retarray, gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx512f"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma"))); static void matmul_i1_avx512f (gfc_array_i1 * const restrict retarray, gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas, @@ -2268,7 +2265,9 @@ void matmul_i1 (gfc_array_i1 * const restrict reta { /* Run down the available processors in order of preference. */ #ifdef HAVE_AVX512F - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) + { matmul_p = matmul_i1_avx512f; goto tailcall; @@ -2277,7 +2276,8 @@ void matmul_i1 (gfc_array_i1 * const restrict reta #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_i1_avx2; goto tailcall; Index: generated/matmul_i16.c =================================================================== --- generated/matmul_i16.c (Revision 245760) +++ generated/matmul_i16.c (Arbeitskopie) @@ -74,9 +74,6 @@ extern void matmul_i16 (gfc_array_i16 * const rest int blas_limit, blas_call gemm); export_proto(matmul_i16); - - - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -628,7 +625,7 @@ matmul_i16_avx (gfc_array_i16 * const restrict ret static void matmul_i16_avx2 (gfc_array_i16 * const restrict retarray, gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static void matmul_i16_avx2 (gfc_array_i16 * const restrict retarray, gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas, @@ -1171,7 +1168,7 @@ matmul_i16_avx2 (gfc_array_i16 * const restrict re static void matmul_i16_avx512f (gfc_array_i16 * const restrict retarray, gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx512f"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma"))); static void matmul_i16_avx512f (gfc_array_i16 * const restrict retarray, gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas, @@ -2268,7 +2265,9 @@ void matmul_i16 (gfc_array_i16 * const restrict re { /* Run down the available processors in order of preference. */ #ifdef HAVE_AVX512F - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) + { matmul_p = matmul_i16_avx512f; goto tailcall; @@ -2277,7 +2276,8 @@ void matmul_i16 (gfc_array_i16 * const restrict re #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_i16_avx2; goto tailcall; Index: generated/matmul_i2.c =================================================================== --- generated/matmul_i2.c (Revision 245760) +++ generated/matmul_i2.c (Arbeitskopie) @@ -74,9 +74,6 @@ extern void matmul_i2 (gfc_array_i2 * const restri int blas_limit, blas_call gemm); export_proto(matmul_i2); - - - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -628,7 +625,7 @@ matmul_i2_avx (gfc_array_i2 * const restrict retar static void matmul_i2_avx2 (gfc_array_i2 * const restrict retarray, gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static void matmul_i2_avx2 (gfc_array_i2 * const restrict retarray, gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas, @@ -1171,7 +1168,7 @@ matmul_i2_avx2 (gfc_array_i2 * const restrict reta static void matmul_i2_avx512f (gfc_array_i2 * const restrict retarray, gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx512f"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma"))); static void matmul_i2_avx512f (gfc_array_i2 * const restrict retarray, gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas, @@ -2268,7 +2265,9 @@ void matmul_i2 (gfc_array_i2 * const restrict reta { /* Run down the available processors in order of preference. */ #ifdef HAVE_AVX512F - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) + { matmul_p = matmul_i2_avx512f; goto tailcall; @@ -2277,7 +2276,8 @@ void matmul_i2 (gfc_array_i2 * const restrict reta #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_i2_avx2; goto tailcall; Index: generated/matmul_i4.c =================================================================== --- generated/matmul_i4.c (Revision 245760) +++ generated/matmul_i4.c (Arbeitskopie) @@ -74,9 +74,6 @@ extern void matmul_i4 (gfc_array_i4 * const restri int blas_limit, blas_call gemm); export_proto(matmul_i4); - - - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -628,7 +625,7 @@ matmul_i4_avx (gfc_array_i4 * const restrict retar static void matmul_i4_avx2 (gfc_array_i4 * const restrict retarray, gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static void matmul_i4_avx2 (gfc_array_i4 * const restrict retarray, gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas, @@ -1171,7 +1168,7 @@ matmul_i4_avx2 (gfc_array_i4 * const restrict reta static void matmul_i4_avx512f (gfc_array_i4 * const restrict retarray, gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx512f"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma"))); static void matmul_i4_avx512f (gfc_array_i4 * const restrict retarray, gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas, @@ -2268,7 +2265,9 @@ void matmul_i4 (gfc_array_i4 * const restrict reta { /* Run down the available processors in order of preference. */ #ifdef HAVE_AVX512F - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) + { matmul_p = matmul_i4_avx512f; goto tailcall; @@ -2277,7 +2276,8 @@ void matmul_i4 (gfc_array_i4 * const restrict reta #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_i4_avx2; goto tailcall; Index: generated/matmul_i8.c =================================================================== --- generated/matmul_i8.c (Revision 245760) +++ generated/matmul_i8.c (Arbeitskopie) @@ -74,9 +74,6 @@ extern void matmul_i8 (gfc_array_i8 * const restri int blas_limit, blas_call gemm); export_proto(matmul_i8); - - - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -628,7 +625,7 @@ matmul_i8_avx (gfc_array_i8 * const restrict retar static void matmul_i8_avx2 (gfc_array_i8 * const restrict retarray, gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static void matmul_i8_avx2 (gfc_array_i8 * const restrict retarray, gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas, @@ -1171,7 +1168,7 @@ matmul_i8_avx2 (gfc_array_i8 * const restrict reta static void matmul_i8_avx512f (gfc_array_i8 * const restrict retarray, gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx512f"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma"))); static void matmul_i8_avx512f (gfc_array_i8 * const restrict retarray, gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas, @@ -2268,7 +2265,9 @@ void matmul_i8 (gfc_array_i8 * const restrict reta { /* Run down the available processors in order of preference. */ #ifdef HAVE_AVX512F - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) + { matmul_p = matmul_i8_avx512f; goto tailcall; @@ -2277,7 +2276,8 @@ void matmul_i8 (gfc_array_i8 * const restrict reta #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_i8_avx2; goto tailcall; Index: generated/matmul_r10.c =================================================================== --- generated/matmul_r10.c (Revision 245760) +++ generated/matmul_r10.c (Arbeitskopie) @@ -74,13 +74,6 @@ extern void matmul_r10 (gfc_array_r10 * const rest int blas_limit, blas_call gemm); export_proto(matmul_r10); -#if defined(HAVE_AVX) && defined(HAVE_AVX2) -/* REAL types generate identical code for AVX and AVX2. Only generate - an AVX2 function if we are dealing with integer. */ -#undef HAVE_AVX2 -#endif - - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -632,7 +625,7 @@ matmul_r10_avx (gfc_array_r10 * const restrict ret static void matmul_r10_avx2 (gfc_array_r10 * const restrict retarray, gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static void matmul_r10_avx2 (gfc_array_r10 * const restrict retarray, gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas, @@ -1175,7 +1168,7 @@ matmul_r10_avx2 (gfc_array_r10 * const restrict re static void matmul_r10_avx512f (gfc_array_r10 * const restrict retarray, gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx512f"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma"))); static void matmul_r10_avx512f (gfc_array_r10 * const restrict retarray, gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas, @@ -2272,7 +2265,9 @@ void matmul_r10 (gfc_array_r10 * const restrict re { /* Run down the available processors in order of preference. */ #ifdef HAVE_AVX512F - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) + { matmul_p = matmul_r10_avx512f; goto tailcall; @@ -2281,7 +2276,8 @@ void matmul_r10 (gfc_array_r10 * const restrict re #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_r10_avx2; goto tailcall; Index: generated/matmul_r16.c =================================================================== --- generated/matmul_r16.c (Revision 245760) +++ generated/matmul_r16.c (Arbeitskopie) @@ -74,13 +74,6 @@ extern void matmul_r16 (gfc_array_r16 * const rest int blas_limit, blas_call gemm); export_proto(matmul_r16); -#if defined(HAVE_AVX) && defined(HAVE_AVX2) -/* REAL types generate identical code for AVX and AVX2. Only generate - an AVX2 function if we are dealing with integer. */ -#undef HAVE_AVX2 -#endif - - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -632,7 +625,7 @@ matmul_r16_avx (gfc_array_r16 * const restrict ret static void matmul_r16_avx2 (gfc_array_r16 * const restrict retarray, gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static void matmul_r16_avx2 (gfc_array_r16 * const restrict retarray, gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas, @@ -1175,7 +1168,7 @@ matmul_r16_avx2 (gfc_array_r16 * const restrict re static void matmul_r16_avx512f (gfc_array_r16 * const restrict retarray, gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx512f"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma"))); static void matmul_r16_avx512f (gfc_array_r16 * const restrict retarray, gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas, @@ -2272,7 +2265,9 @@ void matmul_r16 (gfc_array_r16 * const restrict re { /* Run down the available processors in order of preference. */ #ifdef HAVE_AVX512F - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) + { matmul_p = matmul_r16_avx512f; goto tailcall; @@ -2281,7 +2276,8 @@ void matmul_r16 (gfc_array_r16 * const restrict re #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_r16_avx2; goto tailcall; Index: generated/matmul_r4.c =================================================================== --- generated/matmul_r4.c (Revision 245760) +++ generated/matmul_r4.c (Arbeitskopie) @@ -74,13 +74,6 @@ extern void matmul_r4 (gfc_array_r4 * const restri int blas_limit, blas_call gemm); export_proto(matmul_r4); -#if defined(HAVE_AVX) && defined(HAVE_AVX2) -/* REAL types generate identical code for AVX and AVX2. Only generate - an AVX2 function if we are dealing with integer. */ -#undef HAVE_AVX2 -#endif - - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -632,7 +625,7 @@ matmul_r4_avx (gfc_array_r4 * const restrict retar static void matmul_r4_avx2 (gfc_array_r4 * const restrict retarray, gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static void matmul_r4_avx2 (gfc_array_r4 * const restrict retarray, gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas, @@ -1175,7 +1168,7 @@ matmul_r4_avx2 (gfc_array_r4 * const restrict reta static void matmul_r4_avx512f (gfc_array_r4 * const restrict retarray, gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx512f"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma"))); static void matmul_r4_avx512f (gfc_array_r4 * const restrict retarray, gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas, @@ -2272,7 +2265,9 @@ void matmul_r4 (gfc_array_r4 * const restrict reta { /* Run down the available processors in order of preference. */ #ifdef HAVE_AVX512F - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) + { matmul_p = matmul_r4_avx512f; goto tailcall; @@ -2281,7 +2276,8 @@ void matmul_r4 (gfc_array_r4 * const restrict reta #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_r4_avx2; goto tailcall; Index: generated/matmul_r8.c =================================================================== --- generated/matmul_r8.c (Revision 245760) +++ generated/matmul_r8.c (Arbeitskopie) @@ -74,13 +74,6 @@ extern void matmul_r8 (gfc_array_r8 * const restri int blas_limit, blas_call gemm); export_proto(matmul_r8); -#if defined(HAVE_AVX) && defined(HAVE_AVX2) -/* REAL types generate identical code for AVX and AVX2. Only generate - an AVX2 function if we are dealing with integer. */ -#undef HAVE_AVX2 -#endif - - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -632,7 +625,7 @@ matmul_r8_avx (gfc_array_r8 * const restrict retar static void matmul_r8_avx2 (gfc_array_r8 * const restrict retarray, gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static void matmul_r8_avx2 (gfc_array_r8 * const restrict retarray, gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas, @@ -1175,7 +1168,7 @@ matmul_r8_avx2 (gfc_array_r8 * const restrict reta static void matmul_r8_avx512f (gfc_array_r8 * const restrict retarray, gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx512f"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma"))); static void matmul_r8_avx512f (gfc_array_r8 * const restrict retarray, gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas, @@ -2272,7 +2265,9 @@ void matmul_r8 (gfc_array_r8 * const restrict reta { /* Run down the available processors in order of preference. */ #ifdef HAVE_AVX512F - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) + { matmul_p = matmul_r8_avx512f; goto tailcall; @@ -2281,7 +2276,8 @@ void matmul_r8 (gfc_array_r8 * const restrict reta #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_r8_avx2; goto tailcall; Index: m4/matmul.m4 =================================================================== --- m4/matmul.m4 (Revision 245760) +++ m4/matmul.m4 (Arbeitskopie) @@ -75,14 +75,6 @@ extern void matmul_'rtype_code` ('rtype` * const r int blas_limit, blas_call gemm); export_proto(matmul_'rtype_code`); -'ifelse(rtype_letter,`r',dnl -`#if defined(HAVE_AVX) && defined(HAVE_AVX2) -/* REAL types generate identical code for AVX and AVX2. Only generate - an AVX2 function if we are dealing with integer. */ -#undef HAVE_AVX2 -#endif') -` - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -101,7 +93,7 @@ static' include(matmul_internal.m4)dnl `static void 'matmul_name` ('rtype` * const restrict retarray, 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static' include(matmul_internal.m4)dnl `#endif /* HAVE_AVX2 */ @@ -110,7 +102,7 @@ static' include(matmul_internal.m4)dnl `static void 'matmul_name` ('rtype` * const restrict retarray, 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx512f"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma"))); static' include(matmul_internal.m4)dnl `#endif /* HAVE_AVX512F */ @@ -138,7 +130,9 @@ void matmul_'rtype_code` ('rtype` * const restrict { /* Run down the available processors in order of preference. */ #ifdef HAVE_AVX512F - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) + { matmul_p = matmul_'rtype_code`_avx512f; goto tailcall; @@ -147,7 +141,8 @@ void matmul_'rtype_code` ('rtype` * const restrict #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_'rtype_code`_avx2; goto tailcall; ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul 2017-03-01 21:00 [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul Thomas Koenig 2017-03-02 3:22 ` Jerry DeLisle @ 2017-03-02 7:32 ` Janne Blomqvist 2017-03-02 7:50 ` Thomas Koenig 2017-03-02 8:43 ` Jakub Jelinek 2 siblings, 1 reply; 17+ messages in thread From: Janne Blomqvist @ 2017-03-02 7:32 UTC (permalink / raw) To: Thomas Koenig; +Cc: fortran, gcc-patches On Wed, Mar 1, 2017 at 11:00 PM, Thomas Koenig <tkoenig@netcologne.de> wrote: > Hello world, > > the attached patch enables FMA for the AVX2 and AVX512F variants of > matmul. This should bring a very nice speedup (although I have > been unable to run benchmarks due to lack of a suitable machine). In lieu of benchmarks, have you looked at the generated asm to verify that fma is actually used? > Question: Is this still appropriate for the current state of trunk? Yes, looks pretty safe. -- Janne Blomqvist ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul 2017-03-02 7:32 ` Janne Blomqvist @ 2017-03-02 7:50 ` Thomas Koenig 2017-03-02 8:09 ` Janne Blomqvist 0 siblings, 1 reply; 17+ messages in thread From: Thomas Koenig @ 2017-03-02 7:50 UTC (permalink / raw) To: Janne Blomqvist; +Cc: fortran, gcc-patches Am 02.03.2017 um 08:32 schrieb Janne Blomqvist: > On Wed, Mar 1, 2017 at 11:00 PM, Thomas Koenig <tkoenig@netcologne.de> wrote: >> Hello world, >> >> the attached patch enables FMA for the AVX2 and AVX512F variants of >> matmul. This should bring a very nice speedup (although I have >> been unable to run benchmarks due to lack of a suitable machine). > > In lieu of benchmarks, have you looked at the generated asm to verify > that fma is actually used? Yes, I did. Here's something from the new matmul_r8_avx2: 156c: c4 62 e5 b8 fd vfmadd231pd %ymm5,%ymm3,%ymm15 1571: c4 c1 79 10 04 06 vmovupd (%r14,%rax,1),%xmm0 1577: c4 62 dd b8 db vfmadd231pd %ymm3,%ymm4,%ymm11 157c: c4 c3 7d 18 44 06 10 vinsertf128 $0x1,0x10(%r14,%rax,1),%ymm0,%ymm0 1583: 01 1584: c4 62 ed b8 ed vfmadd231pd %ymm5,%ymm2,%ymm13 1589: c4 e2 ed b8 fc vfmadd231pd %ymm4,%ymm2,%ymm7 158e: c4 e2 fd a8 ad 30 ff vfmadd213pd -0x800d0(%rbp),%ymm0,%ymm5 ... and here from matmul_r8_avx512f: 1da8: c4 a1 7b 10 14 d6 vmovsd (%rsi,%r10,8),%xmm2 1dae: c4 c2 b1 b9 f0 vfmadd231sd %xmm8,%xmm9,%xmm6 1db3: 62 62 ed 08 b9 e5 vfmadd231sd %xmm5,%xmm2,%xmm28 1db9: 62 62 ed 08 b9 ec vfmadd231sd %xmm4,%xmm2,%xmm29 1dbf: 62 62 ed 08 b9 f3 vfmadd231sd %xmm3,%xmm2,%xmm30 1dc5: c4 e2 91 99 e8 vfmadd132sd %xmm0,%xmm13,%xmm5 1dca: c4 e2 99 99 e0 vfmadd132sd %xmm0,%xmm12,%xmm4 1dcf: c4 e2 a1 99 d8 vfmadd132sd %xmm0,%xmm11,%xmm3 1dd4: c4 c2 a9 99 d1 vfmadd132sd %xmm9,%xmm10,%xmm2 1dd9: c4 c2 89 99 c1 vfmadd132sd %xmm9,%xmm14,%xmm0 1dde: 0f 8e d3 fe ff ff jle 1cb7 <matmul_r8_avx512f+0x1cb7> ... so this is looking pretty good. Regards Thomas ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul 2017-03-02 7:50 ` Thomas Koenig @ 2017-03-02 8:09 ` Janne Blomqvist 2017-03-02 8:14 ` Richard Biener 2017-03-02 8:16 ` Jakub Jelinek 0 siblings, 2 replies; 17+ messages in thread From: Janne Blomqvist @ 2017-03-02 8:09 UTC (permalink / raw) To: Thomas Koenig; +Cc: fortran, gcc-patches On Thu, Mar 2, 2017 at 9:50 AM, Thomas Koenig <tkoenig@netcologne.de> wrote: > Am 02.03.2017 um 08:32 schrieb Janne Blomqvist: >> >> On Wed, Mar 1, 2017 at 11:00 PM, Thomas Koenig <tkoenig@netcologne.de> >> wrote: >>> >>> Hello world, >>> >>> the attached patch enables FMA for the AVX2 and AVX512F variants of >>> matmul. This should bring a very nice speedup (although I have >>> been unable to run benchmarks due to lack of a suitable machine). >> >> >> In lieu of benchmarks, have you looked at the generated asm to verify >> that fma is actually used? > > > Yes, I did. > > Here's something from the new matmul_r8_avx2: > > 156c: c4 62 e5 b8 fd vfmadd231pd %ymm5,%ymm3,%ymm15 > 1571: c4 c1 79 10 04 06 vmovupd (%r14,%rax,1),%xmm0 > 1577: c4 62 dd b8 db vfmadd231pd %ymm3,%ymm4,%ymm11 > 157c: c4 c3 7d 18 44 06 10 vinsertf128 > $0x1,0x10(%r14,%rax,1),%ymm0,%ymm0 > 1583: 01 > 1584: c4 62 ed b8 ed vfmadd231pd %ymm5,%ymm2,%ymm13 > 1589: c4 e2 ed b8 fc vfmadd231pd %ymm4,%ymm2,%ymm7 > 158e: c4 e2 fd a8 ad 30 ff vfmadd213pd > -0x800d0(%rbp),%ymm0,%ymm5 Great, looks good! > ... and here from matmul_r8_avx512f: > > 1da8: c4 a1 7b 10 14 d6 vmovsd (%rsi,%r10,8),%xmm2 > 1dae: c4 c2 b1 b9 f0 vfmadd231sd %xmm8,%xmm9,%xmm6 > 1db3: 62 62 ed 08 b9 e5 vfmadd231sd %xmm5,%xmm2,%xmm28 > 1db9: 62 62 ed 08 b9 ec vfmadd231sd %xmm4,%xmm2,%xmm29 > 1dbf: 62 62 ed 08 b9 f3 vfmadd231sd %xmm3,%xmm2,%xmm30 > 1dc5: c4 e2 91 99 e8 vfmadd132sd %xmm0,%xmm13,%xmm5 > 1dca: c4 e2 99 99 e0 vfmadd132sd %xmm0,%xmm12,%xmm4 > 1dcf: c4 e2 a1 99 d8 vfmadd132sd %xmm0,%xmm11,%xmm3 > 1dd4: c4 c2 a9 99 d1 vfmadd132sd %xmm9,%xmm10,%xmm2 > 1dd9: c4 c2 89 99 c1 vfmadd132sd %xmm9,%xmm14,%xmm0 > 1dde: 0f 8e d3 fe ff ff jle 1cb7 > <matmul_r8_avx512f+0x1cb7> Good, it's using fma, but why is this using xmm registers? That would mean it's operating only on 128 bit blocks at a time so no better than plain AVX. AFAIU avx512 should use zmm registers to operate on 512 bit chunks. I guess this is not due to your patch, but some other issue. -- Janne Blomqvist ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul 2017-03-02 8:09 ` Janne Blomqvist @ 2017-03-02 8:14 ` Richard Biener 2017-03-02 8:16 ` Jakub Jelinek 1 sibling, 0 replies; 17+ messages in thread From: Richard Biener @ 2017-03-02 8:14 UTC (permalink / raw) To: Janne Blomqvist; +Cc: Thomas Koenig, fortran, gcc-patches On Thu, Mar 2, 2017 at 9:09 AM, Janne Blomqvist <blomqvist.janne@gmail.com> wrote: > On Thu, Mar 2, 2017 at 9:50 AM, Thomas Koenig <tkoenig@netcologne.de> wrote: >> Am 02.03.2017 um 08:32 schrieb Janne Blomqvist: >>> >>> On Wed, Mar 1, 2017 at 11:00 PM, Thomas Koenig <tkoenig@netcologne.de> >>> wrote: >>>> >>>> Hello world, >>>> >>>> the attached patch enables FMA for the AVX2 and AVX512F variants of >>>> matmul. This should bring a very nice speedup (although I have >>>> been unable to run benchmarks due to lack of a suitable machine). >>> >>> >>> In lieu of benchmarks, have you looked at the generated asm to verify >>> that fma is actually used? >> >> >> Yes, I did. >> >> Here's something from the new matmul_r8_avx2: >> >> 156c: c4 62 e5 b8 fd vfmadd231pd %ymm5,%ymm3,%ymm15 >> 1571: c4 c1 79 10 04 06 vmovupd (%r14,%rax,1),%xmm0 >> 1577: c4 62 dd b8 db vfmadd231pd %ymm3,%ymm4,%ymm11 >> 157c: c4 c3 7d 18 44 06 10 vinsertf128 >> $0x1,0x10(%r14,%rax,1),%ymm0,%ymm0 >> 1583: 01 >> 1584: c4 62 ed b8 ed vfmadd231pd %ymm5,%ymm2,%ymm13 >> 1589: c4 e2 ed b8 fc vfmadd231pd %ymm4,%ymm2,%ymm7 >> 158e: c4 e2 fd a8 ad 30 ff vfmadd213pd >> -0x800d0(%rbp),%ymm0,%ymm5 > > Great, looks good! > >> ... and here from matmul_r8_avx512f: >> >> 1da8: c4 a1 7b 10 14 d6 vmovsd (%rsi,%r10,8),%xmm2 >> 1dae: c4 c2 b1 b9 f0 vfmadd231sd %xmm8,%xmm9,%xmm6 >> 1db3: 62 62 ed 08 b9 e5 vfmadd231sd %xmm5,%xmm2,%xmm28 >> 1db9: 62 62 ed 08 b9 ec vfmadd231sd %xmm4,%xmm2,%xmm29 >> 1dbf: 62 62 ed 08 b9 f3 vfmadd231sd %xmm3,%xmm2,%xmm30 >> 1dc5: c4 e2 91 99 e8 vfmadd132sd %xmm0,%xmm13,%xmm5 >> 1dca: c4 e2 99 99 e0 vfmadd132sd %xmm0,%xmm12,%xmm4 >> 1dcf: c4 e2 a1 99 d8 vfmadd132sd %xmm0,%xmm11,%xmm3 >> 1dd4: c4 c2 a9 99 d1 vfmadd132sd %xmm9,%xmm10,%xmm2 >> 1dd9: c4 c2 89 99 c1 vfmadd132sd %xmm9,%xmm14,%xmm0 >> 1dde: 0f 8e d3 fe ff ff jle 1cb7 >> <matmul_r8_avx512f+0x1cb7> > > Good, it's using fma, but why is this using xmm registers? That would > mean it's operating only on 128 bit blocks at a time so no better than > plain AVX. AFAIU avx512 should use zmm registers to operate on 512 bit > chunks. > > I guess this is not due to your patch, but some other issue. The question is, was it using %zmm before the patch? > > -- > Janne Blomqvist ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul 2017-03-02 8:09 ` Janne Blomqvist 2017-03-02 8:14 ` Richard Biener @ 2017-03-02 8:16 ` Jakub Jelinek 1 sibling, 0 replies; 17+ messages in thread From: Jakub Jelinek @ 2017-03-02 8:16 UTC (permalink / raw) To: Janne Blomqvist; +Cc: Thomas Koenig, fortran, gcc-patches On Thu, Mar 02, 2017 at 10:09:31AM +0200, Janne Blomqvist wrote: > > Here's something from the new matmul_r8_avx2: > > > > 156c: c4 62 e5 b8 fd vfmadd231pd %ymm5,%ymm3,%ymm15 > > 1571: c4 c1 79 10 04 06 vmovupd (%r14,%rax,1),%xmm0 > > 1577: c4 62 dd b8 db vfmadd231pd %ymm3,%ymm4,%ymm11 > > 157c: c4 c3 7d 18 44 06 10 vinsertf128 > > $0x1,0x10(%r14,%rax,1),%ymm0,%ymm0 > > 1583: 01 > > 1584: c4 62 ed b8 ed vfmadd231pd %ymm5,%ymm2,%ymm13 > > 1589: c4 e2 ed b8 fc vfmadd231pd %ymm4,%ymm2,%ymm7 > > 158e: c4 e2 fd a8 ad 30 ff vfmadd213pd > > -0x800d0(%rbp),%ymm0,%ymm5 > > Great, looks good! > > > ... and here from matmul_r8_avx512f: > > > > 1da8: c4 a1 7b 10 14 d6 vmovsd (%rsi,%r10,8),%xmm2 > > 1dae: c4 c2 b1 b9 f0 vfmadd231sd %xmm8,%xmm9,%xmm6 > > 1db3: 62 62 ed 08 b9 e5 vfmadd231sd %xmm5,%xmm2,%xmm28 > > 1db9: 62 62 ed 08 b9 ec vfmadd231sd %xmm4,%xmm2,%xmm29 > > 1dbf: 62 62 ed 08 b9 f3 vfmadd231sd %xmm3,%xmm2,%xmm30 > > 1dc5: c4 e2 91 99 e8 vfmadd132sd %xmm0,%xmm13,%xmm5 > > 1dca: c4 e2 99 99 e0 vfmadd132sd %xmm0,%xmm12,%xmm4 > > 1dcf: c4 e2 a1 99 d8 vfmadd132sd %xmm0,%xmm11,%xmm3 > > 1dd4: c4 c2 a9 99 d1 vfmadd132sd %xmm9,%xmm10,%xmm2 > > 1dd9: c4 c2 89 99 c1 vfmadd132sd %xmm9,%xmm14,%xmm0 > > 1dde: 0f 8e d3 fe ff ff jle 1cb7 > > <matmul_r8_avx512f+0x1cb7> > > Good, it's using fma, but why is this using xmm registers? That would > mean it's operating only on 128 bit blocks at a time so no better than > plain AVX. AFAIU avx512 should use zmm registers to operate on 512 bit > chunks. Well, it uses sd, i.e. the scalar fma, not pd, so those are always xmm regs and only a single double in them, this must be some scalar epilogue loop or whatever; but matmul_r8_avx512f also has: 140c: 62 72 e5 40 98 c1 vfmadd132pd %zmm1,%zmm19,%zmm8 1412: 62 72 e5 40 98 cd vfmadd132pd %zmm5,%zmm19,%zmm9 1418: 62 72 e5 40 98 d1 vfmadd132pd %zmm1,%zmm19,%zmm10 141e: 62 72 e5 40 98 de vfmadd132pd %zmm6,%zmm19,%zmm11 1424: 62 72 e5 40 98 e1 vfmadd132pd %zmm1,%zmm19,%zmm12 142a: 62 e2 e5 40 98 c6 vfmadd132pd %zmm6,%zmm19,%zmm16 1430: 62 f2 e5 40 98 c8 vfmadd132pd %zmm0,%zmm19,%zmm1 1436: 62 f2 e5 40 98 f0 vfmadd132pd %zmm0,%zmm19,%zmm6 143c: 62 72 e5 40 98 fd vfmadd132pd %zmm5,%zmm19,%zmm15 1442: 62 72 e5 40 98 f4 vfmadd132pd %zmm4,%zmm19,%zmm14 1448: 62 72 e5 40 98 eb vfmadd132pd %zmm3,%zmm19,%zmm13 144e: 62 f2 e5 40 98 d0 vfmadd132pd %zmm0,%zmm19,%zmm2 1454: 62 b2 e5 40 98 ec vfmadd132pd %zmm20,%zmm19,%zmm5 145a: 62 b2 e5 40 98 e4 vfmadd132pd %zmm20,%zmm19,%zmm4 1460: 62 b2 e5 40 98 dc vfmadd132pd %zmm20,%zmm19,%zmm3 1466: 62 b2 e5 40 98 c4 vfmadd132pd %zmm20,%zmm19,%zmm0 etc. where 8 doubles in zmm regs are processed together. Jakub ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul 2017-03-01 21:00 [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul Thomas Koenig 2017-03-02 3:22 ` Jerry DeLisle 2017-03-02 7:32 ` Janne Blomqvist @ 2017-03-02 8:43 ` Jakub Jelinek 2017-03-02 9:03 ` Thomas Koenig 2 siblings, 1 reply; 17+ messages in thread From: Jakub Jelinek @ 2017-03-02 8:43 UTC (permalink / raw) To: Thomas Koenig; +Cc: fortran, gcc-patches On Wed, Mar 01, 2017 at 10:00:08PM +0100, Thomas Koenig wrote: > @@ -101,7 +93,7 @@ > `static void > 'matmul_name` ('rtype` * const restrict retarray, > 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas, > - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); > + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); > static' include(matmul_internal.m4)dnl > `#endif /* HAVE_AVX2 */ > I guess the question here is if there are any CPUs that have AVX2 but don't have FMA3. If there are none, then this is not controversial, if there are some, it depends on how widely they are used compared to ones that have both AVX2 and FMA3. Going just from our -march= bitsets, it seems if there is PTA_AVX2, then there is also PTA_FMA: haswell, broadwell, skylake, skylake-avx512, knl, bdver4, znver1, there are CPUs that have just PTA_AVX and not PTA_AVX2 and still have PTA_FMA: bdver2, bdver3 (but that is not relevant to this patch). > @@ -110,7 +102,7 @@ > `static void > 'matmul_name` ('rtype` * const restrict retarray, > 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas, > - int blas_limit, blas_call gemm) __attribute__((__target__("avx512f"))); > + int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma"))); > static' include(matmul_internal.m4)dnl > `#endif /* HAVE_AVX512F */ > I think this change is not needed, because the EVEX encoded VFMADD???[SP][DS] instructions etc. are in AVX512F ISA, not in FMA3 ISA (which has just the VEX encoded ones). Which is why I'm seeing the fmas in my libgfortran even without your patch. Thus I think you should remove this from your patch. > @@ -147,7 +141,8 @@ > #endif /* HAVE_AVX512F */ > > #ifdef HAVE_AVX2 > - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) > + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) > + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) > { > matmul_p = matmul_'rtype_code`_avx2; > goto tailcall; and this too. Jakub ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul 2017-03-02 8:43 ` Jakub Jelinek @ 2017-03-02 9:03 ` Thomas Koenig 2017-03-02 9:08 ` Jakub Jelinek 0 siblings, 1 reply; 17+ messages in thread From: Thomas Koenig @ 2017-03-02 9:03 UTC (permalink / raw) To: Jakub Jelinek; +Cc: fortran, gcc-patches Am 02.03.2017 um 09:43 schrieb Jakub Jelinek: > On Wed, Mar 01, 2017 at 10:00:08PM +0100, Thomas Koenig wrote: >> @@ -101,7 +93,7 @@ >> `static void >> 'matmul_name` ('rtype` * const restrict retarray, >> 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas, >> - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); >> + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); >> static' include(matmul_internal.m4)dnl >> `#endif /* HAVE_AVX2 */ >> > > I guess the question here is if there are any CPUs that have AVX2 but don't > have FMA3. If there are none, then this is not controversial, if there are > some, it depends on how widely they are used compared to ones that have both > AVX2 and FMA3. Going just from our -march= bitsets, it seems if there is > PTA_AVX2, then there is also PTA_FMA: haswell, broadwell, skylake, skylake-avx512, knl, > bdver4, znver1, there are CPUs that have just PTA_AVX and not PTA_AVX2 and > still have PTA_FMA: bdver2, bdver3 (but that is not relevant to this patch). In a previous incantation of the patch, I saw that the compiler generated the same floating point code for AVX and AVX2 (which why there currently is no AVX2 floating point version). I could also generate an AVX+FMA version for floating point and an AVX2 version for integer (if anybody cares about integer matmul). Or I could just leave it as it is. >> @@ -110,7 +102,7 @@ >> `static void >> 'matmul_name` ('rtype` * const restrict retarray, >> 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas, >> - int blas_limit, blas_call gemm) __attribute__((__target__("avx512f"))); >> + int blas_limit, blas_call gemm) __attribute__((__target__("avx512f,fma"))); >> static' include(matmul_internal.m4)dnl >> `#endif /* HAVE_AVX512F */ >> > > I think this change is not needed, because the EVEX encoded > VFMADD???[SP][DS] instructions etc. are in AVX512F ISA, not in FMA3 ISA > (which has just the VEX encoded ones). > Which is why I'm seeing the fmas in my libgfortran even without your patch. > Thus I think you should remove this from your patch. OK, I'll remove it. > >> @@ -147,7 +141,8 @@ >> #endif /* HAVE_AVX512F */ >> >> #ifdef HAVE_AVX2 >> - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) >> + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) >> + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) >> { >> matmul_p = matmul_'rtype_code`_avx2; >> goto tailcall; > > and this too. Will do. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul 2017-03-02 9:03 ` Thomas Koenig @ 2017-03-02 9:08 ` Jakub Jelinek 2017-03-02 10:46 ` Thomas Koenig 0 siblings, 1 reply; 17+ messages in thread From: Jakub Jelinek @ 2017-03-02 9:08 UTC (permalink / raw) To: Thomas Koenig; +Cc: fortran, gcc-patches On Thu, Mar 02, 2017 at 10:03:28AM +0100, Thomas Koenig wrote: > Am 02.03.2017 um 09:43 schrieb Jakub Jelinek: > > On Wed, Mar 01, 2017 at 10:00:08PM +0100, Thomas Koenig wrote: > > > @@ -101,7 +93,7 @@ > > > `static void > > > 'matmul_name` ('rtype` * const restrict retarray, > > > 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas, > > > - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); > > > + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); > > > static' include(matmul_internal.m4)dnl > > > `#endif /* HAVE_AVX2 */ > > > > > > > I guess the question here is if there are any CPUs that have AVX2 but don't > > have FMA3. If there are none, then this is not controversial, if there are > > some, it depends on how widely they are used compared to ones that have both > > AVX2 and FMA3. Going just from our -march= bitsets, it seems if there is > > PTA_AVX2, then there is also PTA_FMA: haswell, broadwell, skylake, skylake-avx512, knl, > > bdver4, znver1, there are CPUs that have just PTA_AVX and not PTA_AVX2 and > > still have PTA_FMA: bdver2, bdver3 (but that is not relevant to this patch). > > In a previous incantation of the patch, I saw that the compiler > generated the same floating point code for AVX and AVX2 (which why > there currently is no AVX2 floating point version). I could also > generate an AVX+FMA version for floating point and an AVX2 version > for integer (if anybody cares about integer matmul). I think having another avx,fma version is not worth it, avx+fma is far less common than avx without fma. > > > @@ -147,7 +141,8 @@ > > > #endif /* HAVE_AVX512F */ > > > > > > #ifdef HAVE_AVX2 > > > - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) > > > + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) > > > + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) > > > { > > > matmul_p = matmul_'rtype_code`_avx2; > > > goto tailcall; > > > > and this too. > > Will do. Note I meant obviously the FEATURE_AVX512F related hunk, not this one, sorry. Jakub ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul 2017-03-02 9:08 ` Jakub Jelinek @ 2017-03-02 10:46 ` Thomas Koenig 2017-03-02 10:48 ` Jakub Jelinek 2017-03-02 11:02 ` Jakub Jelinek 0 siblings, 2 replies; 17+ messages in thread From: Thomas Koenig @ 2017-03-02 10:46 UTC (permalink / raw) To: Jakub Jelinek; +Cc: fortran, gcc-patches [-- Attachment #1: Type: text/plain, Size: 969 bytes --] Here's the updated version, which just uses FMA for AVX2. OK for trunk? Regards Thomas 2017-03-01 Thomas Koenig <tkoenig@gcc.gnu.org> PR fortran/78379 * m4/matmul.m4: (matmul_'rtype_code`_avx2): Also generate for reals. Add fma to target options. (matmul_'rtype_code`): Call AVX2 only if FMA is available. * generated/matmul_c10.c: Regenerated. * generated/matmul_c16.c: Regenerated. * generated/matmul_c4.c: Regenerated. * generated/matmul_c8.c: Regenerated. * generated/matmul_i1.c: Regenerated. * generated/matmul_i16.c: Regenerated. * generated/matmul_i2.c: Regenerated. * generated/matmul_i4.c: Regenerated. * generated/matmul_i8.c: Regenerated. * generated/matmul_r10.c: Regenerated. * generated/matmul_r16.c: Regenerated. * generated/matmul_r4.c: Regenerated. * generated/matmul_r8.c: Regenerated. [-- Attachment #2: p2-fma.diff --] [-- Type: text/x-patch, Size: 20305 bytes --] Index: generated/matmul_c10.c =================================================================== --- generated/matmul_c10.c (Revision 245760) +++ generated/matmul_c10.c (Arbeitskopie) @@ -74,9 +74,6 @@ extern void matmul_c10 (gfc_array_c10 * const rest int blas_limit, blas_call gemm); export_proto(matmul_c10); - - - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -628,7 +625,7 @@ matmul_c10_avx (gfc_array_c10 * const restrict ret static void matmul_c10_avx2 (gfc_array_c10 * const restrict retarray, gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static void matmul_c10_avx2 (gfc_array_c10 * const restrict retarray, gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas, @@ -2277,7 +2274,8 @@ void matmul_c10 (gfc_array_c10 * const restrict re #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_c10_avx2; goto tailcall; Index: generated/matmul_c16.c =================================================================== --- generated/matmul_c16.c (Revision 245760) +++ generated/matmul_c16.c (Arbeitskopie) @@ -74,9 +74,6 @@ extern void matmul_c16 (gfc_array_c16 * const rest int blas_limit, blas_call gemm); export_proto(matmul_c16); - - - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -628,7 +625,7 @@ matmul_c16_avx (gfc_array_c16 * const restrict ret static void matmul_c16_avx2 (gfc_array_c16 * const restrict retarray, gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static void matmul_c16_avx2 (gfc_array_c16 * const restrict retarray, gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas, @@ -2277,7 +2274,8 @@ void matmul_c16 (gfc_array_c16 * const restrict re #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_c16_avx2; goto tailcall; Index: generated/matmul_c4.c =================================================================== --- generated/matmul_c4.c (Revision 245760) +++ generated/matmul_c4.c (Arbeitskopie) @@ -74,9 +74,6 @@ extern void matmul_c4 (gfc_array_c4 * const restri int blas_limit, blas_call gemm); export_proto(matmul_c4); - - - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -628,7 +625,7 @@ matmul_c4_avx (gfc_array_c4 * const restrict retar static void matmul_c4_avx2 (gfc_array_c4 * const restrict retarray, gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static void matmul_c4_avx2 (gfc_array_c4 * const restrict retarray, gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas, @@ -2277,7 +2274,8 @@ void matmul_c4 (gfc_array_c4 * const restrict reta #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_c4_avx2; goto tailcall; Index: generated/matmul_c8.c =================================================================== --- generated/matmul_c8.c (Revision 245760) +++ generated/matmul_c8.c (Arbeitskopie) @@ -74,9 +74,6 @@ extern void matmul_c8 (gfc_array_c8 * const restri int blas_limit, blas_call gemm); export_proto(matmul_c8); - - - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -628,7 +625,7 @@ matmul_c8_avx (gfc_array_c8 * const restrict retar static void matmul_c8_avx2 (gfc_array_c8 * const restrict retarray, gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static void matmul_c8_avx2 (gfc_array_c8 * const restrict retarray, gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas, @@ -2277,7 +2274,8 @@ void matmul_c8 (gfc_array_c8 * const restrict reta #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_c8_avx2; goto tailcall; Index: generated/matmul_i1.c =================================================================== --- generated/matmul_i1.c (Revision 245760) +++ generated/matmul_i1.c (Arbeitskopie) @@ -74,9 +74,6 @@ extern void matmul_i1 (gfc_array_i1 * const restri int blas_limit, blas_call gemm); export_proto(matmul_i1); - - - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -628,7 +625,7 @@ matmul_i1_avx (gfc_array_i1 * const restrict retar static void matmul_i1_avx2 (gfc_array_i1 * const restrict retarray, gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static void matmul_i1_avx2 (gfc_array_i1 * const restrict retarray, gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas, @@ -2277,7 +2274,8 @@ void matmul_i1 (gfc_array_i1 * const restrict reta #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_i1_avx2; goto tailcall; Index: generated/matmul_i16.c =================================================================== --- generated/matmul_i16.c (Revision 245760) +++ generated/matmul_i16.c (Arbeitskopie) @@ -74,9 +74,6 @@ extern void matmul_i16 (gfc_array_i16 * const rest int blas_limit, blas_call gemm); export_proto(matmul_i16); - - - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -628,7 +625,7 @@ matmul_i16_avx (gfc_array_i16 * const restrict ret static void matmul_i16_avx2 (gfc_array_i16 * const restrict retarray, gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static void matmul_i16_avx2 (gfc_array_i16 * const restrict retarray, gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas, @@ -2277,7 +2274,8 @@ void matmul_i16 (gfc_array_i16 * const restrict re #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_i16_avx2; goto tailcall; Index: generated/matmul_i2.c =================================================================== --- generated/matmul_i2.c (Revision 245760) +++ generated/matmul_i2.c (Arbeitskopie) @@ -74,9 +74,6 @@ extern void matmul_i2 (gfc_array_i2 * const restri int blas_limit, blas_call gemm); export_proto(matmul_i2); - - - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -628,7 +625,7 @@ matmul_i2_avx (gfc_array_i2 * const restrict retar static void matmul_i2_avx2 (gfc_array_i2 * const restrict retarray, gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static void matmul_i2_avx2 (gfc_array_i2 * const restrict retarray, gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas, @@ -2277,7 +2274,8 @@ void matmul_i2 (gfc_array_i2 * const restrict reta #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_i2_avx2; goto tailcall; Index: generated/matmul_i4.c =================================================================== --- generated/matmul_i4.c (Revision 245760) +++ generated/matmul_i4.c (Arbeitskopie) @@ -74,9 +74,6 @@ extern void matmul_i4 (gfc_array_i4 * const restri int blas_limit, blas_call gemm); export_proto(matmul_i4); - - - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -628,7 +625,7 @@ matmul_i4_avx (gfc_array_i4 * const restrict retar static void matmul_i4_avx2 (gfc_array_i4 * const restrict retarray, gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static void matmul_i4_avx2 (gfc_array_i4 * const restrict retarray, gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas, @@ -2277,7 +2274,8 @@ void matmul_i4 (gfc_array_i4 * const restrict reta #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_i4_avx2; goto tailcall; Index: generated/matmul_i8.c =================================================================== --- generated/matmul_i8.c (Revision 245760) +++ generated/matmul_i8.c (Arbeitskopie) @@ -74,9 +74,6 @@ extern void matmul_i8 (gfc_array_i8 * const restri int blas_limit, blas_call gemm); export_proto(matmul_i8); - - - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -628,7 +625,7 @@ matmul_i8_avx (gfc_array_i8 * const restrict retar static void matmul_i8_avx2 (gfc_array_i8 * const restrict retarray, gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static void matmul_i8_avx2 (gfc_array_i8 * const restrict retarray, gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas, @@ -2277,7 +2274,8 @@ void matmul_i8 (gfc_array_i8 * const restrict reta #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_i8_avx2; goto tailcall; Index: generated/matmul_r10.c =================================================================== --- generated/matmul_r10.c (Revision 245760) +++ generated/matmul_r10.c (Arbeitskopie) @@ -74,13 +74,6 @@ extern void matmul_r10 (gfc_array_r10 * const rest int blas_limit, blas_call gemm); export_proto(matmul_r10); -#if defined(HAVE_AVX) && defined(HAVE_AVX2) -/* REAL types generate identical code for AVX and AVX2. Only generate - an AVX2 function if we are dealing with integer. */ -#undef HAVE_AVX2 -#endif - - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -632,7 +625,7 @@ matmul_r10_avx (gfc_array_r10 * const restrict ret static void matmul_r10_avx2 (gfc_array_r10 * const restrict retarray, gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static void matmul_r10_avx2 (gfc_array_r10 * const restrict retarray, gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas, @@ -2281,7 +2274,8 @@ void matmul_r10 (gfc_array_r10 * const restrict re #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_r10_avx2; goto tailcall; Index: generated/matmul_r16.c =================================================================== --- generated/matmul_r16.c (Revision 245760) +++ generated/matmul_r16.c (Arbeitskopie) @@ -74,13 +74,6 @@ extern void matmul_r16 (gfc_array_r16 * const rest int blas_limit, blas_call gemm); export_proto(matmul_r16); -#if defined(HAVE_AVX) && defined(HAVE_AVX2) -/* REAL types generate identical code for AVX and AVX2. Only generate - an AVX2 function if we are dealing with integer. */ -#undef HAVE_AVX2 -#endif - - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -632,7 +625,7 @@ matmul_r16_avx (gfc_array_r16 * const restrict ret static void matmul_r16_avx2 (gfc_array_r16 * const restrict retarray, gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static void matmul_r16_avx2 (gfc_array_r16 * const restrict retarray, gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas, @@ -2281,7 +2274,8 @@ void matmul_r16 (gfc_array_r16 * const restrict re #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_r16_avx2; goto tailcall; Index: generated/matmul_r4.c =================================================================== --- generated/matmul_r4.c (Revision 245760) +++ generated/matmul_r4.c (Arbeitskopie) @@ -74,13 +74,6 @@ extern void matmul_r4 (gfc_array_r4 * const restri int blas_limit, blas_call gemm); export_proto(matmul_r4); -#if defined(HAVE_AVX) && defined(HAVE_AVX2) -/* REAL types generate identical code for AVX and AVX2. Only generate - an AVX2 function if we are dealing with integer. */ -#undef HAVE_AVX2 -#endif - - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -632,7 +625,7 @@ matmul_r4_avx (gfc_array_r4 * const restrict retar static void matmul_r4_avx2 (gfc_array_r4 * const restrict retarray, gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static void matmul_r4_avx2 (gfc_array_r4 * const restrict retarray, gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas, @@ -2281,7 +2274,8 @@ void matmul_r4 (gfc_array_r4 * const restrict reta #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_r4_avx2; goto tailcall; Index: generated/matmul_r8.c =================================================================== --- generated/matmul_r8.c (Revision 245760) +++ generated/matmul_r8.c (Arbeitskopie) @@ -74,13 +74,6 @@ extern void matmul_r8 (gfc_array_r8 * const restri int blas_limit, blas_call gemm); export_proto(matmul_r8); -#if defined(HAVE_AVX) && defined(HAVE_AVX2) -/* REAL types generate identical code for AVX and AVX2. Only generate - an AVX2 function if we are dealing with integer. */ -#undef HAVE_AVX2 -#endif - - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -632,7 +625,7 @@ matmul_r8_avx (gfc_array_r8 * const restrict retar static void matmul_r8_avx2 (gfc_array_r8 * const restrict retarray, gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static void matmul_r8_avx2 (gfc_array_r8 * const restrict retarray, gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas, @@ -2281,7 +2274,8 @@ void matmul_r8 (gfc_array_r8 * const restrict reta #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_r8_avx2; goto tailcall; Index: m4/matmul.m4 =================================================================== --- m4/matmul.m4 (Revision 245760) +++ m4/matmul.m4 (Arbeitskopie) @@ -75,14 +75,6 @@ extern void matmul_'rtype_code` ('rtype` * const r int blas_limit, blas_call gemm); export_proto(matmul_'rtype_code`); -'ifelse(rtype_letter,`r',dnl -`#if defined(HAVE_AVX) && defined(HAVE_AVX2) -/* REAL types generate identical code for AVX and AVX2. Only generate - an AVX2 function if we are dealing with integer. */ -#undef HAVE_AVX2 -#endif') -` - /* Put exhaustive list of possible architectures here here, ORed together. */ #if defined(HAVE_AVX) || defined(HAVE_AVX2) || defined(HAVE_AVX512F) @@ -101,7 +93,7 @@ static' include(matmul_internal.m4)dnl `static void 'matmul_name` ('rtype` * const restrict retarray, 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas, - int blas_limit, blas_call gemm) __attribute__((__target__("avx2"))); + int blas_limit, blas_call gemm) __attribute__((__target__("avx2,fma"))); static' include(matmul_internal.m4)dnl `#endif /* HAVE_AVX2 */ @@ -147,7 +139,8 @@ void matmul_'rtype_code` ('rtype` * const restrict #endif /* HAVE_AVX512F */ #ifdef HAVE_AVX2 - if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) + && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { matmul_p = matmul_'rtype_code`_avx2; goto tailcall; ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul 2017-03-02 10:46 ` Thomas Koenig @ 2017-03-02 10:48 ` Jakub Jelinek 2017-03-02 11:02 ` Jakub Jelinek 1 sibling, 0 replies; 17+ messages in thread From: Jakub Jelinek @ 2017-03-02 10:48 UTC (permalink / raw) To: Thomas Koenig; +Cc: fortran, gcc-patches On Thu, Mar 02, 2017 at 11:45:59AM +0100, Thomas Koenig wrote: > Here's the updated version, which just uses FMA for AVX2. > > OK for trunk? > > Regards > > Thomas > > 2017-03-01 Thomas Koenig <tkoenig@gcc.gnu.org> > > PR fortran/78379 > * m4/matmul.m4: (matmul_'rtype_code`_avx2): Also generate for > reals. Add fma to target options. > (matmul_'rtype_code`): Call AVX2 only if FMA is available. > * generated/matmul_c10.c: Regenerated. > * generated/matmul_c16.c: Regenerated. > * generated/matmul_c4.c: Regenerated. > * generated/matmul_c8.c: Regenerated. > * generated/matmul_i1.c: Regenerated. > * generated/matmul_i16.c: Regenerated. > * generated/matmul_i2.c: Regenerated. > * generated/matmul_i4.c: Regenerated. > * generated/matmul_i8.c: Regenerated. > * generated/matmul_r10.c: Regenerated. > * generated/matmul_r16.c: Regenerated. > * generated/matmul_r4.c: Regenerated. > * generated/matmul_r8.c: Regenerated. Ok, thanks. Jakub ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul 2017-03-02 10:46 ` Thomas Koenig 2017-03-02 10:48 ` Jakub Jelinek @ 2017-03-02 11:02 ` Jakub Jelinek 2017-03-02 11:57 ` Thomas Koenig 1 sibling, 1 reply; 17+ messages in thread From: Jakub Jelinek @ 2017-03-02 11:02 UTC (permalink / raw) To: Thomas Koenig; +Cc: fortran, gcc-patches On Thu, Mar 02, 2017 at 11:45:59AM +0100, Thomas Koenig wrote: > Here's the updated version, which just uses FMA for AVX2. > > OK for trunk? > > Regards > > Thomas > > 2017-03-01 Thomas Koenig <tkoenig@gcc.gnu.org> > > PR fortran/78379 > * m4/matmul.m4: (matmul_'rtype_code`_avx2): Also generate for > reals. Add fma to target options. > (matmul_'rtype_code`): Call AVX2 only if FMA is available. > * generated/matmul_c10.c: Regenerated. > * generated/matmul_c16.c: Regenerated. > * generated/matmul_c4.c: Regenerated. > * generated/matmul_c8.c: Regenerated. > * generated/matmul_i1.c: Regenerated. > * generated/matmul_i16.c: Regenerated. > * generated/matmul_i2.c: Regenerated. > * generated/matmul_i4.c: Regenerated. > * generated/matmul_i8.c: Regenerated. > * generated/matmul_r10.c: Regenerated. > * generated/matmul_r16.c: Regenerated. > * generated/matmul_r4.c: Regenerated. > * generated/matmul_r8.c: Regenerated. Actually, I see a problem, but not related to this patch. I bet e.g. tsan would complain heavily on the wrappers, because the code is racy: static void (*matmul_p) ('rtype` * const restrict retarray, 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas, int blas_limit, blas_call gemm) = NULL; if (matmul_p == NULL) { matmul_p = matmul_'rtype_code`_vanilla; if (__cpu_model.__cpu_vendor == VENDOR_INTEL) { /* Run down the available processors in order of preference. */ #ifdef HAVE_AVX512F if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) { matmul_p = matmul_'rtype_code`_avx512f; goto tailcall; } #endif /* HAVE_AVX512F */ ... } tailcall: (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm); So, even when assuming all matmul_p = stores are atomic, e.g. if you call matmul from 2 or more threads about the same time for the first time, it could be that the first one sets matmul_p to vanilla and then another thread runs it (uselessly slow), etc. As you don't care about the if (matmul_p == NULL) part being done in multiple threads concurrently, I guess you could e.g. do: static void (*matmul_p) ('rtype` * const restrict retarray, 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas, int blas_limit, blas_call gemm); // <--- No need for NULL initializer for static var void (*matmul_fn) ('rtype` * const restrict retarray, 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas, int blas_limit, blas_call gemm); matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED); if (matmul_fn == NULL) { matmul_fn = matmul_'rtype_code`_vanilla; if (__cpu_model.__cpu_vendor == VENDOR_INTEL) { /* Run down the available processors in order of preference. */ #ifdef HAVE_AVX512F if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) { matmul_fn = matmul_'rtype_code`_avx512f; goto finish; } #endif /* HAVE_AVX512F */ ... finish: __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED); } (*matmul_fn) (retarray, a, b, try_blas, blas_limit, gemm); (i.e. make sure you read matmul_p in each call exactly once and store at most once per thread). Jakub ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul 2017-03-02 11:02 ` Jakub Jelinek @ 2017-03-02 11:57 ` Thomas Koenig 2017-03-02 12:02 ` Jakub Jelinek 0 siblings, 1 reply; 17+ messages in thread From: Thomas Koenig @ 2017-03-02 11:57 UTC (permalink / raw) To: Jakub Jelinek; +Cc: fortran, gcc-patches [-- Attachment #1: Type: text/plain, Size: 1135 bytes --] Hi Jakub, > Actually, I see a problem, but not related to this patch. > I bet e.g. tsan would complain heavily on the wrappers, because the code > is racy: Here is a patch implementing your suggestion. Tested at least so far that all matmul test cases pass on my machine. OK for trunk? Regards Thomas 2017-03-02 Thomas Koenig <tkoenig@gcc.gnu.org> Jakub Jelinek <jakub@redhat.com> * m4/matmul.m4 (matmul_'rtype_code`_avx2): Avoid race condition on storing function pointer. * generated/matmul_c10.c: Regenerated. * generated/matmul_c16.c: Regenerated. * generated/matmul_c4.c: Regenerated. * generated/matmul_c8.c: Regenerated. * generated/matmul_i1.c: Regenerated. * generated/matmul_i16.c: Regenerated. * generated/matmul_i2.c: Regenerated. * generated/matmul_i4.c: Regenerated. * generated/matmul_i8.c: Regenerated. * generated/matmul_r10.c: Regenerated. * generated/matmul_r16.c: Regenerated. * generated/matmul_r4.c: Regenerated. * generated/matmul_r8.c: Regenerated. [-- Attachment #2: p1-race.diff --] [-- Type: text/x-patch, Size: 28758 bytes --] Index: generated/matmul_c10.c =================================================================== --- generated/matmul_c10.c (Revision 245836) +++ generated/matmul_c10.c (Arbeitskopie) @@ -2258,9 +2258,14 @@ void matmul_c10 (gfc_array_c10 * const restrict re gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas, int blas_limit, blas_call gemm) = NULL; + void (*matmul_fn) (gfc_array_c10 * const restrict retarray, + gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas, + int blas_limit, blas_call gemm) = NULL; + + matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED); if (matmul_p == NULL) { - matmul_p = matmul_c10_vanilla; + matmul_fn = matmul_c10_vanilla; if (__cpu_model.__cpu_vendor == VENDOR_INTEL) { /* Run down the available processors in order of preference. */ @@ -2267,8 +2272,8 @@ void matmul_c10 (gfc_array_c10 * const restrict re #ifdef HAVE_AVX512F if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) { - matmul_p = matmul_c10_avx512f; - goto tailcall; + matmul_fn = matmul_c10_avx512f; + goto store; } #endif /* HAVE_AVX512F */ @@ -2277,8 +2282,8 @@ void matmul_c10 (gfc_array_c10 * const restrict re if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { - matmul_p = matmul_c10_avx2; - goto tailcall; + matmul_fn = matmul_c10_avx2; + goto store; } #endif @@ -2286,14 +2291,15 @@ void matmul_c10 (gfc_array_c10 * const restrict re #ifdef HAVE_AVX if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX)) { - matmul_p = matmul_c10_avx; - goto tailcall; + matmul_fn = matmul_c10_avx; + goto store; } #endif /* HAVE_AVX */ } + store: + __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED); } -tailcall: (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm); } Index: generated/matmul_c16.c =================================================================== --- generated/matmul_c16.c (Revision 245836) +++ generated/matmul_c16.c (Arbeitskopie) @@ -2258,9 +2258,14 @@ void matmul_c16 (gfc_array_c16 * const restrict re gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas, int blas_limit, blas_call gemm) = NULL; + void (*matmul_fn) (gfc_array_c16 * const restrict retarray, + gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas, + int blas_limit, blas_call gemm) = NULL; + + matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED); if (matmul_p == NULL) { - matmul_p = matmul_c16_vanilla; + matmul_fn = matmul_c16_vanilla; if (__cpu_model.__cpu_vendor == VENDOR_INTEL) { /* Run down the available processors in order of preference. */ @@ -2267,8 +2272,8 @@ void matmul_c16 (gfc_array_c16 * const restrict re #ifdef HAVE_AVX512F if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) { - matmul_p = matmul_c16_avx512f; - goto tailcall; + matmul_fn = matmul_c16_avx512f; + goto store; } #endif /* HAVE_AVX512F */ @@ -2277,8 +2282,8 @@ void matmul_c16 (gfc_array_c16 * const restrict re if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { - matmul_p = matmul_c16_avx2; - goto tailcall; + matmul_fn = matmul_c16_avx2; + goto store; } #endif @@ -2286,14 +2291,15 @@ void matmul_c16 (gfc_array_c16 * const restrict re #ifdef HAVE_AVX if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX)) { - matmul_p = matmul_c16_avx; - goto tailcall; + matmul_fn = matmul_c16_avx; + goto store; } #endif /* HAVE_AVX */ } + store: + __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED); } -tailcall: (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm); } Index: generated/matmul_c4.c =================================================================== --- generated/matmul_c4.c (Revision 245836) +++ generated/matmul_c4.c (Arbeitskopie) @@ -2258,9 +2258,14 @@ void matmul_c4 (gfc_array_c4 * const restrict reta gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas, int blas_limit, blas_call gemm) = NULL; + void (*matmul_fn) (gfc_array_c4 * const restrict retarray, + gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas, + int blas_limit, blas_call gemm) = NULL; + + matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED); if (matmul_p == NULL) { - matmul_p = matmul_c4_vanilla; + matmul_fn = matmul_c4_vanilla; if (__cpu_model.__cpu_vendor == VENDOR_INTEL) { /* Run down the available processors in order of preference. */ @@ -2267,8 +2272,8 @@ void matmul_c4 (gfc_array_c4 * const restrict reta #ifdef HAVE_AVX512F if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) { - matmul_p = matmul_c4_avx512f; - goto tailcall; + matmul_fn = matmul_c4_avx512f; + goto store; } #endif /* HAVE_AVX512F */ @@ -2277,8 +2282,8 @@ void matmul_c4 (gfc_array_c4 * const restrict reta if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { - matmul_p = matmul_c4_avx2; - goto tailcall; + matmul_fn = matmul_c4_avx2; + goto store; } #endif @@ -2286,14 +2291,15 @@ void matmul_c4 (gfc_array_c4 * const restrict reta #ifdef HAVE_AVX if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX)) { - matmul_p = matmul_c4_avx; - goto tailcall; + matmul_fn = matmul_c4_avx; + goto store; } #endif /* HAVE_AVX */ } + store: + __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED); } -tailcall: (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm); } Index: generated/matmul_c8.c =================================================================== --- generated/matmul_c8.c (Revision 245836) +++ generated/matmul_c8.c (Arbeitskopie) @@ -2258,9 +2258,14 @@ void matmul_c8 (gfc_array_c8 * const restrict reta gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas, int blas_limit, blas_call gemm) = NULL; + void (*matmul_fn) (gfc_array_c8 * const restrict retarray, + gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas, + int blas_limit, blas_call gemm) = NULL; + + matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED); if (matmul_p == NULL) { - matmul_p = matmul_c8_vanilla; + matmul_fn = matmul_c8_vanilla; if (__cpu_model.__cpu_vendor == VENDOR_INTEL) { /* Run down the available processors in order of preference. */ @@ -2267,8 +2272,8 @@ void matmul_c8 (gfc_array_c8 * const restrict reta #ifdef HAVE_AVX512F if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) { - matmul_p = matmul_c8_avx512f; - goto tailcall; + matmul_fn = matmul_c8_avx512f; + goto store; } #endif /* HAVE_AVX512F */ @@ -2277,8 +2282,8 @@ void matmul_c8 (gfc_array_c8 * const restrict reta if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { - matmul_p = matmul_c8_avx2; - goto tailcall; + matmul_fn = matmul_c8_avx2; + goto store; } #endif @@ -2286,14 +2291,15 @@ void matmul_c8 (gfc_array_c8 * const restrict reta #ifdef HAVE_AVX if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX)) { - matmul_p = matmul_c8_avx; - goto tailcall; + matmul_fn = matmul_c8_avx; + goto store; } #endif /* HAVE_AVX */ } + store: + __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED); } -tailcall: (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm); } Index: generated/matmul_i1.c =================================================================== --- generated/matmul_i1.c (Revision 245836) +++ generated/matmul_i1.c (Arbeitskopie) @@ -2258,9 +2258,14 @@ void matmul_i1 (gfc_array_i1 * const restrict reta gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas, int blas_limit, blas_call gemm) = NULL; + void (*matmul_fn) (gfc_array_i1 * const restrict retarray, + gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas, + int blas_limit, blas_call gemm) = NULL; + + matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED); if (matmul_p == NULL) { - matmul_p = matmul_i1_vanilla; + matmul_fn = matmul_i1_vanilla; if (__cpu_model.__cpu_vendor == VENDOR_INTEL) { /* Run down the available processors in order of preference. */ @@ -2267,8 +2272,8 @@ void matmul_i1 (gfc_array_i1 * const restrict reta #ifdef HAVE_AVX512F if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) { - matmul_p = matmul_i1_avx512f; - goto tailcall; + matmul_fn = matmul_i1_avx512f; + goto store; } #endif /* HAVE_AVX512F */ @@ -2277,8 +2282,8 @@ void matmul_i1 (gfc_array_i1 * const restrict reta if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { - matmul_p = matmul_i1_avx2; - goto tailcall; + matmul_fn = matmul_i1_avx2; + goto store; } #endif @@ -2286,14 +2291,15 @@ void matmul_i1 (gfc_array_i1 * const restrict reta #ifdef HAVE_AVX if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX)) { - matmul_p = matmul_i1_avx; - goto tailcall; + matmul_fn = matmul_i1_avx; + goto store; } #endif /* HAVE_AVX */ } + store: + __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED); } -tailcall: (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm); } Index: generated/matmul_i16.c =================================================================== --- generated/matmul_i16.c (Revision 245836) +++ generated/matmul_i16.c (Arbeitskopie) @@ -2258,9 +2258,14 @@ void matmul_i16 (gfc_array_i16 * const restrict re gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas, int blas_limit, blas_call gemm) = NULL; + void (*matmul_fn) (gfc_array_i16 * const restrict retarray, + gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas, + int blas_limit, blas_call gemm) = NULL; + + matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED); if (matmul_p == NULL) { - matmul_p = matmul_i16_vanilla; + matmul_fn = matmul_i16_vanilla; if (__cpu_model.__cpu_vendor == VENDOR_INTEL) { /* Run down the available processors in order of preference. */ @@ -2267,8 +2272,8 @@ void matmul_i16 (gfc_array_i16 * const restrict re #ifdef HAVE_AVX512F if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) { - matmul_p = matmul_i16_avx512f; - goto tailcall; + matmul_fn = matmul_i16_avx512f; + goto store; } #endif /* HAVE_AVX512F */ @@ -2277,8 +2282,8 @@ void matmul_i16 (gfc_array_i16 * const restrict re if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { - matmul_p = matmul_i16_avx2; - goto tailcall; + matmul_fn = matmul_i16_avx2; + goto store; } #endif @@ -2286,14 +2291,15 @@ void matmul_i16 (gfc_array_i16 * const restrict re #ifdef HAVE_AVX if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX)) { - matmul_p = matmul_i16_avx; - goto tailcall; + matmul_fn = matmul_i16_avx; + goto store; } #endif /* HAVE_AVX */ } + store: + __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED); } -tailcall: (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm); } Index: generated/matmul_i2.c =================================================================== --- generated/matmul_i2.c (Revision 245836) +++ generated/matmul_i2.c (Arbeitskopie) @@ -2258,9 +2258,14 @@ void matmul_i2 (gfc_array_i2 * const restrict reta gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas, int blas_limit, blas_call gemm) = NULL; + void (*matmul_fn) (gfc_array_i2 * const restrict retarray, + gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas, + int blas_limit, blas_call gemm) = NULL; + + matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED); if (matmul_p == NULL) { - matmul_p = matmul_i2_vanilla; + matmul_fn = matmul_i2_vanilla; if (__cpu_model.__cpu_vendor == VENDOR_INTEL) { /* Run down the available processors in order of preference. */ @@ -2267,8 +2272,8 @@ void matmul_i2 (gfc_array_i2 * const restrict reta #ifdef HAVE_AVX512F if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) { - matmul_p = matmul_i2_avx512f; - goto tailcall; + matmul_fn = matmul_i2_avx512f; + goto store; } #endif /* HAVE_AVX512F */ @@ -2277,8 +2282,8 @@ void matmul_i2 (gfc_array_i2 * const restrict reta if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { - matmul_p = matmul_i2_avx2; - goto tailcall; + matmul_fn = matmul_i2_avx2; + goto store; } #endif @@ -2286,14 +2291,15 @@ void matmul_i2 (gfc_array_i2 * const restrict reta #ifdef HAVE_AVX if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX)) { - matmul_p = matmul_i2_avx; - goto tailcall; + matmul_fn = matmul_i2_avx; + goto store; } #endif /* HAVE_AVX */ } + store: + __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED); } -tailcall: (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm); } Index: generated/matmul_i4.c =================================================================== --- generated/matmul_i4.c (Revision 245836) +++ generated/matmul_i4.c (Arbeitskopie) @@ -2258,9 +2258,14 @@ void matmul_i4 (gfc_array_i4 * const restrict reta gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas, int blas_limit, blas_call gemm) = NULL; + void (*matmul_fn) (gfc_array_i4 * const restrict retarray, + gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas, + int blas_limit, blas_call gemm) = NULL; + + matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED); if (matmul_p == NULL) { - matmul_p = matmul_i4_vanilla; + matmul_fn = matmul_i4_vanilla; if (__cpu_model.__cpu_vendor == VENDOR_INTEL) { /* Run down the available processors in order of preference. */ @@ -2267,8 +2272,8 @@ void matmul_i4 (gfc_array_i4 * const restrict reta #ifdef HAVE_AVX512F if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) { - matmul_p = matmul_i4_avx512f; - goto tailcall; + matmul_fn = matmul_i4_avx512f; + goto store; } #endif /* HAVE_AVX512F */ @@ -2277,8 +2282,8 @@ void matmul_i4 (gfc_array_i4 * const restrict reta if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { - matmul_p = matmul_i4_avx2; - goto tailcall; + matmul_fn = matmul_i4_avx2; + goto store; } #endif @@ -2286,14 +2291,15 @@ void matmul_i4 (gfc_array_i4 * const restrict reta #ifdef HAVE_AVX if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX)) { - matmul_p = matmul_i4_avx; - goto tailcall; + matmul_fn = matmul_i4_avx; + goto store; } #endif /* HAVE_AVX */ } + store: + __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED); } -tailcall: (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm); } Index: generated/matmul_i8.c =================================================================== --- generated/matmul_i8.c (Revision 245836) +++ generated/matmul_i8.c (Arbeitskopie) @@ -2258,9 +2258,14 @@ void matmul_i8 (gfc_array_i8 * const restrict reta gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas, int blas_limit, blas_call gemm) = NULL; + void (*matmul_fn) (gfc_array_i8 * const restrict retarray, + gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas, + int blas_limit, blas_call gemm) = NULL; + + matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED); if (matmul_p == NULL) { - matmul_p = matmul_i8_vanilla; + matmul_fn = matmul_i8_vanilla; if (__cpu_model.__cpu_vendor == VENDOR_INTEL) { /* Run down the available processors in order of preference. */ @@ -2267,8 +2272,8 @@ void matmul_i8 (gfc_array_i8 * const restrict reta #ifdef HAVE_AVX512F if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) { - matmul_p = matmul_i8_avx512f; - goto tailcall; + matmul_fn = matmul_i8_avx512f; + goto store; } #endif /* HAVE_AVX512F */ @@ -2277,8 +2282,8 @@ void matmul_i8 (gfc_array_i8 * const restrict reta if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { - matmul_p = matmul_i8_avx2; - goto tailcall; + matmul_fn = matmul_i8_avx2; + goto store; } #endif @@ -2286,14 +2291,15 @@ void matmul_i8 (gfc_array_i8 * const restrict reta #ifdef HAVE_AVX if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX)) { - matmul_p = matmul_i8_avx; - goto tailcall; + matmul_fn = matmul_i8_avx; + goto store; } #endif /* HAVE_AVX */ } + store: + __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED); } -tailcall: (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm); } Index: generated/matmul_r10.c =================================================================== --- generated/matmul_r10.c (Revision 245836) +++ generated/matmul_r10.c (Arbeitskopie) @@ -2258,9 +2258,14 @@ void matmul_r10 (gfc_array_r10 * const restrict re gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas, int blas_limit, blas_call gemm) = NULL; + void (*matmul_fn) (gfc_array_r10 * const restrict retarray, + gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas, + int blas_limit, blas_call gemm) = NULL; + + matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED); if (matmul_p == NULL) { - matmul_p = matmul_r10_vanilla; + matmul_fn = matmul_r10_vanilla; if (__cpu_model.__cpu_vendor == VENDOR_INTEL) { /* Run down the available processors in order of preference. */ @@ -2267,8 +2272,8 @@ void matmul_r10 (gfc_array_r10 * const restrict re #ifdef HAVE_AVX512F if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) { - matmul_p = matmul_r10_avx512f; - goto tailcall; + matmul_fn = matmul_r10_avx512f; + goto store; } #endif /* HAVE_AVX512F */ @@ -2277,8 +2282,8 @@ void matmul_r10 (gfc_array_r10 * const restrict re if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { - matmul_p = matmul_r10_avx2; - goto tailcall; + matmul_fn = matmul_r10_avx2; + goto store; } #endif @@ -2286,14 +2291,15 @@ void matmul_r10 (gfc_array_r10 * const restrict re #ifdef HAVE_AVX if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX)) { - matmul_p = matmul_r10_avx; - goto tailcall; + matmul_fn = matmul_r10_avx; + goto store; } #endif /* HAVE_AVX */ } + store: + __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED); } -tailcall: (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm); } Index: generated/matmul_r16.c =================================================================== --- generated/matmul_r16.c (Revision 245836) +++ generated/matmul_r16.c (Arbeitskopie) @@ -2258,9 +2258,14 @@ void matmul_r16 (gfc_array_r16 * const restrict re gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas, int blas_limit, blas_call gemm) = NULL; + void (*matmul_fn) (gfc_array_r16 * const restrict retarray, + gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas, + int blas_limit, blas_call gemm) = NULL; + + matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED); if (matmul_p == NULL) { - matmul_p = matmul_r16_vanilla; + matmul_fn = matmul_r16_vanilla; if (__cpu_model.__cpu_vendor == VENDOR_INTEL) { /* Run down the available processors in order of preference. */ @@ -2267,8 +2272,8 @@ void matmul_r16 (gfc_array_r16 * const restrict re #ifdef HAVE_AVX512F if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) { - matmul_p = matmul_r16_avx512f; - goto tailcall; + matmul_fn = matmul_r16_avx512f; + goto store; } #endif /* HAVE_AVX512F */ @@ -2277,8 +2282,8 @@ void matmul_r16 (gfc_array_r16 * const restrict re if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { - matmul_p = matmul_r16_avx2; - goto tailcall; + matmul_fn = matmul_r16_avx2; + goto store; } #endif @@ -2286,14 +2291,15 @@ void matmul_r16 (gfc_array_r16 * const restrict re #ifdef HAVE_AVX if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX)) { - matmul_p = matmul_r16_avx; - goto tailcall; + matmul_fn = matmul_r16_avx; + goto store; } #endif /* HAVE_AVX */ } + store: + __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED); } -tailcall: (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm); } Index: generated/matmul_r4.c =================================================================== --- generated/matmul_r4.c (Revision 245836) +++ generated/matmul_r4.c (Arbeitskopie) @@ -2258,9 +2258,14 @@ void matmul_r4 (gfc_array_r4 * const restrict reta gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas, int blas_limit, blas_call gemm) = NULL; + void (*matmul_fn) (gfc_array_r4 * const restrict retarray, + gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas, + int blas_limit, blas_call gemm) = NULL; + + matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED); if (matmul_p == NULL) { - matmul_p = matmul_r4_vanilla; + matmul_fn = matmul_r4_vanilla; if (__cpu_model.__cpu_vendor == VENDOR_INTEL) { /* Run down the available processors in order of preference. */ @@ -2267,8 +2272,8 @@ void matmul_r4 (gfc_array_r4 * const restrict reta #ifdef HAVE_AVX512F if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) { - matmul_p = matmul_r4_avx512f; - goto tailcall; + matmul_fn = matmul_r4_avx512f; + goto store; } #endif /* HAVE_AVX512F */ @@ -2277,8 +2282,8 @@ void matmul_r4 (gfc_array_r4 * const restrict reta if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { - matmul_p = matmul_r4_avx2; - goto tailcall; + matmul_fn = matmul_r4_avx2; + goto store; } #endif @@ -2286,14 +2291,15 @@ void matmul_r4 (gfc_array_r4 * const restrict reta #ifdef HAVE_AVX if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX)) { - matmul_p = matmul_r4_avx; - goto tailcall; + matmul_fn = matmul_r4_avx; + goto store; } #endif /* HAVE_AVX */ } + store: + __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED); } -tailcall: (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm); } Index: generated/matmul_r8.c =================================================================== --- generated/matmul_r8.c (Revision 245836) +++ generated/matmul_r8.c (Arbeitskopie) @@ -2258,9 +2258,14 @@ void matmul_r8 (gfc_array_r8 * const restrict reta gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas, int blas_limit, blas_call gemm) = NULL; + void (*matmul_fn) (gfc_array_r8 * const restrict retarray, + gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas, + int blas_limit, blas_call gemm) = NULL; + + matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED); if (matmul_p == NULL) { - matmul_p = matmul_r8_vanilla; + matmul_fn = matmul_r8_vanilla; if (__cpu_model.__cpu_vendor == VENDOR_INTEL) { /* Run down the available processors in order of preference. */ @@ -2267,8 +2272,8 @@ void matmul_r8 (gfc_array_r8 * const restrict reta #ifdef HAVE_AVX512F if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) { - matmul_p = matmul_r8_avx512f; - goto tailcall; + matmul_fn = matmul_r8_avx512f; + goto store; } #endif /* HAVE_AVX512F */ @@ -2277,8 +2282,8 @@ void matmul_r8 (gfc_array_r8 * const restrict reta if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { - matmul_p = matmul_r8_avx2; - goto tailcall; + matmul_fn = matmul_r8_avx2; + goto store; } #endif @@ -2286,14 +2291,15 @@ void matmul_r8 (gfc_array_r8 * const restrict reta #ifdef HAVE_AVX if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX)) { - matmul_p = matmul_r8_avx; - goto tailcall; + matmul_fn = matmul_r8_avx; + goto store; } #endif /* HAVE_AVX */ } + store: + __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED); } -tailcall: (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm); } Index: m4/matmul.m4 =================================================================== --- m4/matmul.m4 (Revision 245836) +++ m4/matmul.m4 (Arbeitskopie) @@ -123,9 +123,14 @@ void matmul_'rtype_code` ('rtype` * const restrict 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas, int blas_limit, blas_call gemm) = NULL; + void (*matmul_fn) ('rtype` * const restrict retarray, + 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas, + int blas_limit, blas_call gemm) = NULL; + + matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED); if (matmul_p == NULL) { - matmul_p = matmul_'rtype_code`_vanilla; + matmul_fn = matmul_'rtype_code`_vanilla; if (__cpu_model.__cpu_vendor == VENDOR_INTEL) { /* Run down the available processors in order of preference. */ @@ -132,8 +137,8 @@ void matmul_'rtype_code` ('rtype` * const restrict #ifdef HAVE_AVX512F if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX512F)) { - matmul_p = matmul_'rtype_code`_avx512f; - goto tailcall; + matmul_fn = matmul_'rtype_code`_avx512f; + goto store; } #endif /* HAVE_AVX512F */ @@ -142,8 +147,8 @@ void matmul_'rtype_code` ('rtype` * const restrict if ((__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX2)) && (__cpu_model.__cpu_features[0] & (1 << FEATURE_FMA))) { - matmul_p = matmul_'rtype_code`_avx2; - goto tailcall; + matmul_fn = matmul_'rtype_code`_avx2; + goto store; } #endif @@ -151,14 +156,15 @@ void matmul_'rtype_code` ('rtype` * const restrict #ifdef HAVE_AVX if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX)) { - matmul_p = matmul_'rtype_code`_avx; - goto tailcall; + matmul_fn = matmul_'rtype_code`_avx; + goto store; } #endif /* HAVE_AVX */ } + store: + __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED); } -tailcall: (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm); } ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul 2017-03-02 11:57 ` Thomas Koenig @ 2017-03-02 12:02 ` Jakub Jelinek 2017-03-02 13:01 ` Thomas Koenig 0 siblings, 1 reply; 17+ messages in thread From: Jakub Jelinek @ 2017-03-02 12:02 UTC (permalink / raw) To: Thomas Koenig; +Cc: fortran, gcc-patches On Thu, Mar 02, 2017 at 12:57:05PM +0100, Thomas Koenig wrote: > --- m4/matmul.m4 (Revision 245836) > +++ m4/matmul.m4 (Arbeitskopie) > @@ -123,9 +123,14 @@ void matmul_'rtype_code` ('rtype` * const restrict > 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas, > int blas_limit, blas_call gemm) = NULL; Please drop the " = NULL" here > + void (*matmul_fn) ('rtype` * const restrict retarray, > + 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas, > + int blas_limit, blas_call gemm) = NULL; and here as well. The first one because static vars are zero initialized by default, the latter because it makes no sense to initialize it and then immediately overwrite it in the next stmt. > + > + matmul_fn = __atomic_load_n (&matmul_p, __ATOMIC_RELAXED); > if (matmul_p == NULL) This needs to test matmul_fn == NULL instead of matmul_p == NULL. > @@ -151,14 +156,15 @@ void matmul_'rtype_code` ('rtype` * const restrict > #ifdef HAVE_AVX > if (__cpu_model.__cpu_features[0] & (1 << FEATURE_AVX)) > { > - matmul_p = matmul_'rtype_code`_avx; > - goto tailcall; > + matmul_fn = matmul_'rtype_code`_avx; > + goto store; > } > #endif /* HAVE_AVX */ > } > + store: > + __atomic_store_n (&matmul_p, matmul_fn, __ATOMIC_RELAXED); > } > > -tailcall: > (*matmul_p) (retarray, a, b, try_blas, blas_limit, gemm); And this needs to use *matmul_fn instead of *matmul_p too. The whole point is that matmul_p is only loaded using __atomic_load_n and only optionally stored using __atomic_store_n. Ok with those changes. Jakub ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul 2017-03-02 12:02 ` Jakub Jelinek @ 2017-03-02 13:01 ` Thomas Koenig 0 siblings, 0 replies; 17+ messages in thread From: Thomas Koenig @ 2017-03-02 13:01 UTC (permalink / raw) To: Jakub Jelinek; +Cc: fortran, gcc-patches Am 02.03.2017 um 13:02 schrieb Jakub Jelinek: > And this needs to use *matmul_fn instead of *matmul_p too. > The whole point is that matmul_p is only loaded using __atomic_load_n > and only optionally stored using __atomic_store_n. > > Ok with those changes. Thanks! Committed as https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=245839 Regards Thomas ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2017-03-02 13:01 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-03-01 21:00 [patch, fortran] Enable FMA for AVX2 and AVX512F for matmul Thomas Koenig 2017-03-02 3:22 ` Jerry DeLisle 2017-03-02 6:15 ` Thomas Koenig 2017-03-02 7:32 ` Janne Blomqvist 2017-03-02 7:50 ` Thomas Koenig 2017-03-02 8:09 ` Janne Blomqvist 2017-03-02 8:14 ` Richard Biener 2017-03-02 8:16 ` Jakub Jelinek 2017-03-02 8:43 ` Jakub Jelinek 2017-03-02 9:03 ` Thomas Koenig 2017-03-02 9:08 ` Jakub Jelinek 2017-03-02 10:46 ` Thomas Koenig 2017-03-02 10:48 ` Jakub Jelinek 2017-03-02 11:02 ` Jakub Jelinek 2017-03-02 11:57 ` Thomas Koenig 2017-03-02 12:02 ` Jakub Jelinek 2017-03-02 13:01 ` Thomas Koenig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).