[PATCH] amdgcn: Enable SIMD vectorization of math functions

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] amdgcn: Enable SIMD vectorization of math functions
@ 2023-02-28 23:01 Kwok Cheung Yeung
  2023-02-28 23:06 ` Andrew Pinski
  2023-03-01 10:01 ` Andrew Stubbs
  0 siblings, 2 replies; 9+ messages in thread
From: Kwok Cheung Yeung @ 2023-02-28 23:01 UTC (permalink / raw)
  To: gcc-patches, ams

[-- Attachment #1: Type: text/plain, Size: 1373 bytes --]

Hello

This patch implements the TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION 
target hook for the AMD GCN architecture, such that when vectorized, 
calls to builtin standard math functions such as asinf, exp, pow etc. 
are converted to calls to the recently added vectorized math functions 
for GCN in Newlib. The -fno-math-errno flag is required in addition to 
the usual vectorization optimization flags for this to occur, and some 
of the math functions (the larger double-precision ones) require a large 
stack size to function properly.

This patch requires the GCN vector math functions in Newlib to function 
- these were included in the recent 4.3.0.20230120 snapshot. As this was 
a minimum requirement starting from the patch 'amdgcn, libgomp: Manually 
allocated stacks', this should not be a problem.

I have added new testcases in the testsuite that compare the output of 
the vectorized math functions against the scalar, passing if they are 
sufficiently close. With the testcase for standalone GCN (without 
libgomp) in gcc.target/gcn/, there is a problem since gcn-run currently 
cannot set the stack size correctly in DejaGnu testing, so I have made 
it a compile test for now - it is still useful to check that calls to 
the correct functions are being made. The runtime correctness is still 
covered by the libgomp test.

Okay for trunk?

Thanks

Kwok

[-- Attachment #2: 0001-amdgcn-Enable-SIMD-vectorization-of-math-functions.patch --]
[-- Type: text/plain, Size: 23876 bytes --]

From 69d13dc898ff7c70e80299a92dc895a89a9e679b Mon Sep 17 00:00:00 2001
From: Kwok Cheung Yeung <kcy@codesourcery.com>
Date: Tue, 28 Feb 2023 14:15:47 +0000
Subject: [PATCH] amdgcn: Enable SIMD vectorization of math functions

Calls to vectorized versions of routines in the math library will now
be inserted when vectorizing code containing supported math functions.

2023-02-28  Kwok Cheung Yeung  <kcy@codesourcery.com>
	    Paul-Antoine Arras  <pa@codesourcery.com>

	gcc/
	* builtins.cc (mathfn_built_in_explicit): New.
	* config/gcn/gcn.cc: Include case-cfn-macros.h.
	(mathfn_built_in_explicit): Add prototype.
	(gcn_vectorize_builtin_vectorized_function): New.
	(gcn_libc_has_function): New.
	(TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION): Define.
	(TARGET_LIBC_HAS_FUNCTION): Define.

	gcc/testsuite/
	* gcc.target/gcn/simd-math-1.c: New testcase.

	libgomp/
	* testsuite/libgomp.c/simd-math-1.c: New testcase.
---
 gcc/builtins.cc                            |   8 +
 gcc/config/gcn/gcn.cc                      | 110 +++++++++++
 gcc/testsuite/gcc.target/gcn/simd-math-1.c | 210 ++++++++++++++++++++
 libgomp/testsuite/libgomp.c/simd-math-1.c  | 217 +++++++++++++++++++++
 4 files changed, 545 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/gcn/simd-math-1.c
 create mode 100644 libgomp/testsuite/libgomp.c/simd-math-1.c

diff --git a/gcc/builtins.cc b/gcc/builtins.cc
index 4d467c8c5c1..305c65c29be 100644
--- a/gcc/builtins.cc
+++ b/gcc/builtins.cc
@@ -2089,6 +2089,14 @@ mathfn_built_in (tree type, combined_fn fn)
   return mathfn_built_in_1 (type, fn, /*implicit=*/ 1);
 }
 
+/* Like mathfn_built_in_1, but always use the explicit array.  */
+
+tree
+mathfn_built_in_explicit (tree type, combined_fn fn)
+{
+  return mathfn_built_in_1 (type, fn, /*implicit=*/ 0);
+}
+
 /* Like mathfn_built_in_1, but take a built_in_function and
    always use the implicit array.  */
 
diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index 23ab01e75d8..d99bb63d4c0 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -53,6 +53,7 @@
 #include "dwarf2.h"
 #include "gimple.h"
 #include "cgraph.h"
+#include "case-cfn-macros.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -5240,6 +5241,110 @@ gcn_simd_clone_usable (struct cgraph_node *ARG_UNUSED (node))
   return 0;
 }
 
+tree mathfn_built_in_explicit (tree, combined_fn);
+
+/* Implement TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION.
+   Return the function declaration of the vectorized version of the builtin
+   in the math library if available.  */
+
+tree
+gcn_vectorize_builtin_vectorized_function (unsigned int fn, tree type_out,
+					   tree type_in)
+{
+  if (TREE_CODE (type_out) != VECTOR_TYPE
+      || TREE_CODE (type_in) != VECTOR_TYPE)
+    return NULL_TREE;
+
+  machine_mode out_mode = TYPE_MODE (TREE_TYPE (type_out));
+  int out_n = TYPE_VECTOR_SUBPARTS (type_out);
+  machine_mode in_mode = TYPE_MODE (TREE_TYPE (type_in));
+  int in_n = TYPE_VECTOR_SUBPARTS (type_in);
+  combined_fn cfn = combined_fn (fn);
+
+  /* Keep this consistent with the list of vectorized math routines.  */
+  int implicit_p;
+  switch (fn)
+    {
+    CASE_CFN_ACOS:
+    CASE_CFN_ACOSH:
+    CASE_CFN_ASIN:
+    CASE_CFN_ASINH:
+    CASE_CFN_ATAN:
+    CASE_CFN_ATAN2:
+    CASE_CFN_ATANH:
+    CASE_CFN_COPYSIGN:
+    CASE_CFN_COS:
+    CASE_CFN_COSH:
+    CASE_CFN_ERF:
+    CASE_CFN_EXP:
+    CASE_CFN_EXP2:
+    CASE_CFN_FINITE:
+    CASE_CFN_FMOD:
+    CASE_CFN_GAMMA:
+    CASE_CFN_HYPOT:
+    CASE_CFN_ISNAN:
+    CASE_CFN_LGAMMA:
+    CASE_CFN_LOG:
+    CASE_CFN_LOG10:
+    CASE_CFN_LOG2:
+    CASE_CFN_POW:
+    CASE_CFN_REMAINDER:
+    CASE_CFN_RINT:
+    CASE_CFN_SIN:
+    CASE_CFN_SINH:
+    CASE_CFN_SQRT:
+    CASE_CFN_TAN:
+    CASE_CFN_TANH:
+    CASE_CFN_TGAMMA:
+      implicit_p = 1;
+      break;
+
+    CASE_CFN_SCALB:
+    CASE_CFN_SIGNIFICAND:
+      implicit_p = 0;
+      break;
+
+    default:
+      return NULL_TREE;
+    }
+
+  tree out_t_node = (out_mode == DFmode) ? double_type_node : float_type_node;
+  tree fndecl = implicit_p ? mathfn_built_in (out_t_node, cfn)
+			   : mathfn_built_in_explicit (out_t_node, cfn);
+
+  const char *bname = IDENTIFIER_POINTER (DECL_NAME (fndecl));
+  char name[20];
+  sprintf (name, out_mode == DFmode ? "v%ddf_%s" : "v%dsf_%s",
+	   out_n, bname + 10);
+
+  unsigned arity = 0;
+  for (tree args = DECL_ARGUMENTS (fndecl); args; args = TREE_CHAIN (args))
+    arity++;
+
+  tree fntype = (arity == 1)
+		? build_function_type_list (type_out, type_in, NULL)
+		: build_function_type_list (type_out, type_in, type_in, NULL);
+
+  /* Build a function declaration for the vectorized function.  */
+  tree new_fndecl = build_decl (BUILTINS_LOCATION,
+				FUNCTION_DECL, get_identifier (name), fntype);
+  TREE_PUBLIC (new_fndecl) = 1;
+  DECL_EXTERNAL (new_fndecl) = 1;
+  DECL_IS_NOVOPS (new_fndecl) = 1;
+  TREE_READONLY (new_fndecl) = 1;
+
+  return new_fndecl;
+}
+
+/* Implement TARGET_LIBC_HAS_FUNCTION.  */
+
+bool
+gcn_libc_has_function (enum function_class fn_class,
+		       tree type)
+{
+  return bsd_libc_has_function (fn_class, type);
+}
+
 /* }}}  */
 /* {{{ md_reorg pass.  */
 
@@ -7324,6 +7429,11 @@ gcn_dwarf_register_span (rtx rtl)
   gcn_simd_clone_compute_vecsize_and_simdlen
 #undef  TARGET_SIMD_CLONE_USABLE
 #define TARGET_SIMD_CLONE_USABLE gcn_simd_clone_usable
+#undef TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
+#define TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION \
+  gcn_vectorize_builtin_vectorized_function
+#undef TARGET_LIBC_HAS_FUNCTION
+#define TARGET_LIBC_HAS_FUNCTION gcn_libc_has_function
 #undef  TARGET_SMALL_REGISTER_CLASSES_FOR_MODE_P
 #define TARGET_SMALL_REGISTER_CLASSES_FOR_MODE_P \
   gcn_small_register_classes_for_mode_p
diff --git a/gcc/testsuite/gcc.target/gcn/simd-math-1.c b/gcc/testsuite/gcc.target/gcn/simd-math-1.c
new file mode 100644
index 00000000000..54e8761f720
--- /dev/null
+++ b/gcc/testsuite/gcc.target/gcn/simd-math-1.c
@@ -0,0 +1,210 @@
+/* Check that the SIMD versions of math routines give the same (or
+   sufficiently close) results as their scalar equivalents, and that the
+   calls to the vectorized math functions are actually emitted.  */
+
+/* Ideally this test should be run, but the math routines require a large
+   stack and gcn-run currently does not respect the stack-size parameter.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fno-math-errno -mstack-size=3000000 -fdump-tree-vect" } */
+
+
+#undef PRINT_RESULT
+#define VERBOSE 0
+#define EARLY_EXIT 1
+
+#include <math.h>
+#include <stdlib.h>
+
+#ifdef PRINT_RESULT
+  #include <stdio.h>
+  #define PRINTF printf
+#else
+  static void null_printf (const char *f, ...) { }
+
+  #define PRINTF null_printf
+#endif
+
+#define N 512
+#define EPSILON_float 1e-5
+#define EPSILON_double 1e-10
+
+static int failed = 0;
+
+int deviation_float (float x, float y)
+{
+  union {
+    float f;
+    unsigned u;
+  } u, v;
+
+  u.f = x;
+  v.f = y;
+
+  unsigned mask = 0x80000000U; 
+  int i;
+
+  for (i = 32; i > 0; i--)
+    if ((u.u ^ v.u) & mask)
+      break;
+    else
+      mask >>= 1;
+
+  return i;
+}
+
+int deviation_double (double x, double y)
+{
+  union {
+    double d;
+    unsigned long long u;
+  } u, v;
+
+  u.d = x;
+  v.d = y;
+
+  unsigned long long mask = 0x8000000000000000ULL;
+  int i;
+
+  for (i = 64; i > 0; i--)
+    if ((u.u ^ v.u) & mask)
+      break;
+    else
+      mask >>= 1;
+
+  return i;
+}
+
+#define TEST_FUN(TFLOAT, LOW, HIGH, FUN) \
+__attribute__((optimize("no-tree-vectorize"))) \
+__attribute__((optimize("no-unsafe-math-optimizations"))) \
+void check_##FUN (TFLOAT res[N], TFLOAT a[N]) \
+{ \
+  int failed = 0; \
+  for (int i = 0; i < N; i++) { \
+    TFLOAT expected = FUN (a[i]); \
+    TFLOAT diff = __builtin_fabs (expected - res[i]); \
+    int deviation = deviation_##TFLOAT (expected, res[i]); \
+    int fail = isnan (res[i]) != isnan (expected) \
+               || isinf (res[i]) != isinf (expected) \
+               || (diff > EPSILON_##TFLOAT && deviation > 10); \
+    failed |= fail; \
+    if (VERBOSE || fail) \
+      PRINTF (#FUN "(%f) = %f, expected = %f, diff = %f, deviation = %d %s\n", \
+              a[i], res[i], expected, diff, deviation, fail ? "(!)" : ""); \
+    if (EARLY_EXIT && fail) \
+      exit (1); \
+  } \
+} \
+void test_##FUN (void) \
+{ \
+  TFLOAT res[N], a[N]; \
+  for (int i = 0; i < N; i++) \
+    a[i] = LOW + ((HIGH - LOW) / N) * i; \
+  for (int i = 0; i < N; i++) \
+    res[i] = FUN (a[i]); \
+  check_##FUN (res, a); \
+}\
+test_##FUN ();
+
+#define TEST_FUN2(TFLOAT, LOW1, HIGH1, LOW2, HIGH2, FUN) \
+__attribute__((optimize("no-tree-vectorize"))) \
+__attribute__((optimize("no-unsafe-math-optimizations"))) \
+void check_##FUN (TFLOAT res[N], TFLOAT a[N], TFLOAT b[N]) \
+{ \
+  int failed = 0; \
+  for (int i = 0; i < N; i++) { \
+    TFLOAT expected = FUN (a[i], b[i]); \
+    TFLOAT diff = __builtin_fabs (expected - res[i]); \
+    int deviation = deviation_##TFLOAT (expected, res[i]); \
+    int fail = isnan (res[i]) != isnan (expected) \
+               || isinf (res[i]) != isinf (expected) \
+               || (diff > EPSILON_##TFLOAT && deviation > 10); \
+    failed |= fail; \
+    if (VERBOSE || fail) \
+      PRINTF (#FUN "(%f,%f) = %f, expected = %f, diff = %f, deviation = %d %s\n", \
+              a[i], b[i], res[i], expected, diff, deviation, fail ? "(!)" : ""); \
+    if (EARLY_EXIT && fail) \
+      exit (1); \
+  } \
+} \
+void test_##FUN (void) \
+{ \
+  TFLOAT res[N], a[N], b[N]; \
+  for (int i = 0; i < N; i++) { \
+    a[i] = LOW1 + ((HIGH1 - LOW1) / N) * i; \
+    b[i] = LOW2 + ((HIGH2 - LOW2) / N) * i; \
+  } \
+  for (int i = 0; i < N; i++) \
+    res[i] = FUN (a[i], b[i]); \
+  check_##FUN (res, a, b); \
+}\
+test_##FUN ();
+
+int main (void)
+{
+  TEST_FUN (float, -1.1, 1.1, acosf); /* { dg-final { scan-tree-dump "v64sf_acosf" "vect" } }*/
+  TEST_FUN (float, -10, 10, acoshf); /* { dg-final { scan-tree-dump "v64sf_acoshf" "vect" } }*/
+  TEST_FUN (float, -1.1, 1.1, asinf); /* { dg-final { scan-tree-dump "v64sf_asinf" "vect" } }*/
+  TEST_FUN (float, -10, 10, asinhf); /* { dg-final { scan-tree-dump "v64sf_asinhf" "vect" } }*/
+  TEST_FUN (float, -1.1, 1.1, atanf); /* { dg-final { scan-tree-dump "v64sf_atanf" "vect" } }*/
+  TEST_FUN2 (float, -2.0, 2.0, 2.0, -2.0, atan2f); /* { dg-final { scan-tree-dump "v64sf_atan2f" "vect" } }*/
+  TEST_FUN (float, -2.0, 2.0, atanhf); /* { dg-final { scan-tree-dump "v64sf_atanhf" "vect" } }*/
+  TEST_FUN2 (float, -10.0, 10.0, 5.0, -15.0, copysignf); /* { dg-final { scan-tree-dump "v64sf_copysignf" "vect" } }*/
+  TEST_FUN (float, -3.14159265359, 3.14159265359, cosf); /* { dg-final { scan-tree-dump "v64sf_cosf" "vect" } }*/
+  TEST_FUN (float, -3.14159265359, 3.14159265359, coshf); /* { dg-final { scan-tree-dump "v64sf_coshf" "vect" } }*/
+  TEST_FUN (float, -10.0, 10.0, erff);  /* { dg-final { scan-tree-dump "v64sf_erff" "vect" } }*/
+  TEST_FUN (float, -10.0, 10.0, expf); /* { dg-final { scan-tree-dump "v64sf_expf" "vect" } }*/
+  TEST_FUN (float, -10.0, 10.0, exp2f); /* { dg-final { scan-tree-dump "v64sf_exp2f" "vect" } }*/
+  TEST_FUN2 (float, -10.0, 10.0, 100.0, -25.0, fmodf); /* { dg-final { scan-tree-dump "v64sf_fmodf" "vect" } }*/
+  TEST_FUN (float, -10.0, 10.0, gammaf); /* { dg-final { scan-tree-dump "v64sf_gammaf" "vect" { xfail *-*-*} } }*/
+  TEST_FUN2 (float, -10.0, 10.0, 15.0, -5.0,hypotf); /* { dg-final { scan-tree-dump "v64sf_hypotf" "vect" } }*/
+  TEST_FUN (float, -10.0, 10.0, lgammaf); /* { dg-final { scan-tree-dump "v64sf_lgammaf" "vect" { xfail *-*-*} } }*/
+  TEST_FUN (float, -1.0, 50.0, logf); /* { dg-final { scan-tree-dump "v64sf_logf" "vect" } }*/
+  TEST_FUN (float, -1.0, 500.0, log10f); /* { dg-final { scan-tree-dump "v64sf_log10f" "vect" } }*/
+  TEST_FUN (float, -1.0, 64.0, log2f); /* { dg-final { scan-tree-dump "v64sf_log2f" "vect" } }*/
+  TEST_FUN2 (float, -100.0, 100.0, 100.0, -100.0, powf); /* { dg-final { scan-tree-dump "v64sf_powf" "vect" } }*/
+  TEST_FUN2 (float, -50.0, 100.0, -2.0, 40.0, remainderf); /* { dg-final { scan-tree-dump "v64sf_remainderf" "vect" } }*/
+  TEST_FUN (float, -50.0, 50.0, rintf);  /* { dg-final { scan-tree-dump "v64sf_rintf" "vect" } }*/
+  TEST_FUN2 (float, -50.0, 50.0, -10.0, 32.0, __builtin_scalbf); /* { dg-final { scan-tree-dump "v64sf_scalbf" "vect" } }*/
+  TEST_FUN (float, -10.0, 10.0, __builtin_significandf); /* { dg-final { scan-tree-dump "v64sf_significandf" "vect" } }*/
+  TEST_FUN (float, -3.14159265359, 3.14159265359, sinf); /* { dg-final { scan-tree-dump "v64sf_sinf" "vect" } }*/
+  TEST_FUN (float, -3.14159265359, 3.14159265359, sinhf); /* { dg-final { scan-tree-dump "v64sf_sinhf" "vect" } }*/
+  TEST_FUN (float, -0.1, 10000.0, sqrtf); /* { dg-final { scan-tree-dump "v64sf_sqrtf" "vect" } }*/
+  TEST_FUN (float, -5.0, 5.0, tanf); /* { dg-final { scan-tree-dump "v64sf_tanf" "vect" } }*/
+  TEST_FUN (float, -3.14159265359, 3.14159265359, tanhf); /* { dg-final { scan-tree-dump "v64sf_tanhf" "vect" } }*/
+  TEST_FUN (float, -10.0, 10.0, tgammaf); /* { dg-final { scan-tree-dump "v64sf_tgammaf" "vect" } }*/
+
+  TEST_FUN (double, -1.1, 1.1, acos); /* { dg-final { scan-tree-dump "v64df_acos" "vect" } }*/
+  TEST_FUN (double, -10, 10, acosh); /* { dg-final { scan-tree-dump "v64df_acosh" "vect" } }*/
+  TEST_FUN (double, -1.1, 1.1, asin); /* { dg-final { scan-tree-dump "v64df_asin" "vect" } }*/
+  TEST_FUN (double, -10, 10, asinh); /* { dg-final { scan-tree-dump "v64df_asinh" "vect" } }*/
+  TEST_FUN (double, -1.1, 1.1, atan); /* { dg-final { scan-tree-dump "v64df_atan" "vect" } }*/
+  TEST_FUN2 (double, -2.0, 2.0, 2.0, -2.0, atan2); /* { dg-final { scan-tree-dump "v64df_atan2" "vect" } }*/
+  TEST_FUN (double, -2.0, 2.0, atanh); /* { dg-final { scan-tree-dump "v64df_atanh" "vect" } }*/
+  TEST_FUN2 (double, -10.0, 10.0, 5.0, -15.0, copysign); /* { dg-final { scan-tree-dump "v64df_copysign" "vect" } }*/
+  TEST_FUN (double, -3.14159265359, 3.14159265359, cos); /* { dg-final { scan-tree-dump "v64df_cos" "vect" } }*/
+  TEST_FUN (double, -3.14159265359, 3.14159265359, cosh); /* { dg-final { scan-tree-dump "v64df_cosh" "vect" } }*/
+  TEST_FUN (double, -10.0, 10.0, erf); /* { dg-final { scan-tree-dump "v64df_erf" "vect" } }*/
+  TEST_FUN (double, -10.0, 10.0, exp); /* { dg-final { scan-tree-dump "v64df_exp" "vect" } }*/
+  TEST_FUN (double, -10.0, 10.0, exp2); /* { dg-final { scan-tree-dump "v64df_exp2" "vect" } }*/
+  TEST_FUN2 (double, -10.0, 10.0, 100.0, -25.0, fmod); /* { dg-final { scan-tree-dump "v64df_fmod" "vect" } }*/
+  TEST_FUN (double, -10.0, 10.0, gamma); /* { dg-final { scan-tree-dump "v64df_gamma" "vect" { xfail *-*-*} } }*/
+  TEST_FUN2 (double, -10.0, 10.0, 15.0, -5.0, hypot); /* { dg-final { scan-tree-dump "v64df_hypot" "vect" } }*/
+  TEST_FUN (double, -10.0, 10.0, lgamma); /* { dg-final { scan-tree-dump "v64df_lgamma" "vect" { xfail *-*-*} } }*/
+  TEST_FUN (double, -1.0, 50.0, log); /* { dg-final { scan-tree-dump "v64df_log" "vect" } }*/
+  TEST_FUN (double, -1.0, 500.0, log10); /* { dg-final { scan-tree-dump "v64df_log10" "vect" } }*/
+  TEST_FUN (double, -1.0, 64.0, log2); /* { dg-final { scan-tree-dump "v64df_log2" "vect" { xfail *-*-*} } }*/
+  TEST_FUN2 (double, -100.0, 100.0, 100.0, -100.0, pow); /* { dg-final { scan-tree-dump "v64df_pow" "vect" } }*/
+  TEST_FUN2 (double, -50.0, 100.0, -2.0, 40.0, remainder); /* { dg-final { scan-tree-dump "v64df_remainder" "vect" } }*/
+  TEST_FUN (double, -50.0, 50.0, rint); /* { dg-final { scan-tree-dump "v64df_rint" "vect" } }*/
+  TEST_FUN2 (double, -50.0, 50.0, -10.0, 32.0, __builtin_scalb); /* { dg-final { scan-tree-dump "v64df_scalb" "vect" } }*/
+  TEST_FUN (double, -10.0, 10.0, __builtin_significand); /* { dg-final { scan-tree-dump "v64df_significand" "vect" } }*/
+  TEST_FUN (double, -3.14159265359, 3.14159265359, sin); /* { dg-final { scan-tree-dump "v64df_sin" "vect" } }*/
+  TEST_FUN (double, -3.14159265359, 3.14159265359, sinh); /* { dg-final { scan-tree-dump "v64df_sinh" "vect" } }*/
+  TEST_FUN (double, -0.1, 10000.0, sqrt); /* { dg-final { scan-tree-dump "v64df_sqrt" "vect" } }*/
+  TEST_FUN (double, -5.0, 5.0, tan); /* { dg-final { scan-tree-dump "v64df_tan" "vect" } }*/
+  TEST_FUN (double, -3.14159265359, 3.14159265359, tanh); /* { dg-final { scan-tree-dump "v64df_tanh" "vect" } }*/
+  TEST_FUN (double, -10.0, 10.0, tgamma); /* { dg-final { scan-tree-dump "v64df_tgamma" "vect" } }*/
+
+  return failed;
+}
diff --git a/libgomp/testsuite/libgomp.c/simd-math-1.c b/libgomp/testsuite/libgomp.c/simd-math-1.c
new file mode 100644
index 00000000000..947bf606e36
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/simd-math-1.c
@@ -0,0 +1,217 @@
+/* Check that the SIMD versions of math routines give the same (or
+   sufficiently close) results as their scalar equivalents.  */
+
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-vectorize -fno-math-errno" } */
+/* { dg-additional-options -foffload-options=amdgcn-amdhsa=-mstack-size=3000000 { target offload_target_amdgcn } } */
+/* { dg-additional-options -foffload-options=-lm } */
+
+#undef PRINT_RESULT
+#define VERBOSE 0
+#define EARLY_EXIT 1
+
+#include <math.h>
+#include <stdlib.h>
+
+#ifdef PRINT_RESULT
+  #include <stdio.h>
+  #define PRINTF printf
+#else
+  static void null_printf (const char *f, ...) { }
+
+  #define PRINTF null_printf
+#endif
+
+#define N 512
+#define EPSILON_float 1e-5
+#define EPSILON_double 1e-10
+
+static int xfail = 0;
+static int failed = 0;
+
+int deviation_float (float x, float y)
+{
+  union {
+    float f;
+    unsigned u;
+  } u, v;
+
+  u.f = x;
+  v.f = y;
+
+  unsigned mask = 0x80000000U;
+  int i;
+
+  for (i = 32; i > 0; i--)
+    if ((u.u ^ v.u) & mask)
+      break;
+    else
+      mask >>= 1;
+
+  return i;
+}
+
+int deviation_double (double x, double y)
+{
+  union {
+    double d;
+    unsigned long long u;
+  } u, v;
+
+  u.d = x;
+  v.d = y;
+
+  unsigned long long mask = 0x8000000000000000ULL;
+  int i;
+
+  for (i = 64; i > 0; i--)
+    if ((u.u ^ v.u) & mask)
+      break;
+    else
+      mask >>= 1;
+
+  return i;
+}
+
+#define TEST_FUN_XFAIL(TFLOAT, LOW, HIGH, FUN) \
+  xfail = 1; \
+  TEST_FUN (TFLOAT, LOW, HIGH, FUN); \
+  xfail = 0;
+
+#define TEST_FUN(TFLOAT, LOW, HIGH, FUN) \
+__attribute__((optimize("no-tree-vectorize"))) \
+__attribute__((optimize("no-unsafe-math-optimizations"))) \
+void check_##FUN (TFLOAT res[N], TFLOAT a[N]) \
+{ \
+  for (int i = 0; i < N; i++) { \
+    TFLOAT expected = FUN (a[i]); \
+    TFLOAT diff = __builtin_fabs (expected - res[i]); \
+    int deviation = deviation_##TFLOAT (expected, res[i]); \
+    int fail = isnan (res[i]) != isnan (expected) \
+	       || isinf (res[i]) != isinf (expected) \
+	       || (diff > EPSILON_##TFLOAT && deviation > 10); \
+    if (VERBOSE || fail) \
+      PRINTF (#FUN "(%f) = %f, expected = %f, diff = %f, deviation = %d %s\n", \
+	      a[i], res[i], expected, diff, deviation, fail ? "(!)" : ""); \
+    failed |= (fail && !xfail); \
+    if (EARLY_EXIT && failed) \
+      exit (1); \
+  } \
+} \
+void test_##FUN (void) \
+{ \
+  TFLOAT res[N], a[N]; \
+  for (int i = 0; i < N; i++) \
+    a[i] = LOW + ((HIGH - LOW) / N) * i; \
+  _Pragma ("omp target parallel for simd map(to:a) map(from:res)") \
+    for (int i = 0; i < N; i++) \
+      res[i] = FUN (a[i]); \
+  check_##FUN (res, a); \
+}\
+test_##FUN ();
+
+#define TEST_FUN2(TFLOAT, LOW1, HIGH1, LOW2, HIGH2, FUN) \
+__attribute__((optimize("no-tree-vectorize"))) \
+__attribute__((optimize("no-unsafe-math-optimizations"))) \
+void check_##FUN (TFLOAT res[N], TFLOAT a[N], TFLOAT b[N]) \
+{ \
+  int failed = 0; \
+  for (int i = 0; i < N; i++) { \
+    TFLOAT expected = FUN (a[i], b[i]); \
+    TFLOAT diff = __builtin_fabs (expected - res[i]); \
+    int deviation = deviation_##TFLOAT (expected, res[i]); \
+    int fail = isnan (res[i]) != isnan (expected) \
+	       || isinf (res[i]) != isinf (expected) \
+	       || (diff > EPSILON_##TFLOAT && deviation > 10); \
+    failed |= fail; \
+    if (VERBOSE || fail) \
+      PRINTF (#FUN "(%f,%f) = %f, expected = %f, diff = %f, deviation = %d %s\n", \
+	      a[i], b[i], res[i], expected, diff, deviation, fail ? "(!)" : ""); \
+    if (EARLY_EXIT && fail) \
+      exit (1); \
+  } \
+} \
+void test_##FUN (void) \
+{ \
+  TFLOAT res[N], a[N], b[N]; \
+  for (int i = 0; i < N; i++) { \
+    a[i] = LOW1 + ((HIGH1 - LOW1) / N) * i; \
+    b[i] = LOW2 + ((HIGH2 - LOW2) / N) * i; \
+  } \
+  _Pragma ("omp target parallel for simd map(to:a) map(from:res)") \
+    for (int i = 0; i < N; i++) \
+      res[i] = FUN (a[i], b[i]); \
+  check_##FUN (res, a, b); \
+}\
+test_##FUN ();
+
+int main (void)
+{
+  TEST_FUN (float, -1.1, 1.1, acosf);
+  TEST_FUN (float, -10, 10, acoshf);
+  TEST_FUN (float, -1.1, 1.1, asinf);
+  TEST_FUN (float, -10, 10, asinhf);
+  TEST_FUN (float, -1.1, 1.1, atanf);
+  TEST_FUN2 (float, -2.0, 2.0, 2.0, -2.0, atan2f);
+  TEST_FUN (float, -2.0, 2.0, atanhf);
+  TEST_FUN2 (float, -10.0, 10.0, 5.0, -15.0, copysignf);
+  TEST_FUN (float, -3.14159265359, 3.14159265359, cosf);
+  TEST_FUN (float, -3.14159265359, 3.14159265359, coshf);
+  TEST_FUN (float, -10.0, 10.0, erff);
+  TEST_FUN (float, -10.0, 10.0, expf);
+  TEST_FUN (float, -10.0, 10.0, exp2f);
+  TEST_FUN2 (float, -10.0, 10.0, 100.0, -25.0, fmodf);
+  TEST_FUN (float, -10.0, 10.0, gammaf);
+  TEST_FUN2 (float, -10.0, 10.0, 15.0, -5.0,hypotf);
+  TEST_FUN (float, -10.0, 10.0, lgammaf);
+  TEST_FUN (float, -1.0, 50.0, logf);
+  TEST_FUN (float, -1.0, 500.0, log10f);
+  TEST_FUN (float, -1.0, 64.0, log2f);
+  TEST_FUN2 (float, -100.0, 100.0, 100.0, -100.0, powf);
+  TEST_FUN2 (float, -50.0, 100.0, -2.0, 40.0, remainderf);
+  TEST_FUN (float, -50.0, 50.0, rintf);
+  TEST_FUN2 (float, -50.0, 50.0, -10.0, 32.0, __builtin_scalbf);
+  TEST_FUN (float, -10.0, 10.0, __builtin_significandf);
+  TEST_FUN (float, -3.14159265359, 3.14159265359, sinf);
+  TEST_FUN (float, -3.14159265359, 3.14159265359, sinhf);
+  TEST_FUN (float, -0.1, 10000.0, sqrtf);
+  TEST_FUN (float, -5.0, 5.0, tanf);
+  TEST_FUN (float, -3.14159265359, 3.14159265359, tanhf);
+  /* Newlib's version of tgammaf is known to have poor accuracy.  */
+  TEST_FUN_XFAIL (float, -10.0, 10.0, tgammaf);
+
+  TEST_FUN (double, -1.1, 1.1, acos);
+  TEST_FUN (double, -10, 10, acosh);
+  TEST_FUN (double, -1.1, 1.1, asin);
+  TEST_FUN (double, -10, 10, asinh);
+  TEST_FUN (double, -1.1, 1.1, atan);
+  TEST_FUN2 (double, -2.0, 2.0, 2.0, -2.0, atan2);
+  TEST_FUN (double, -2.0, 2.0, atanh);
+  TEST_FUN2 (double, -10.0, 10.0, 5.0, -15.0, copysign);
+  TEST_FUN (double, -3.14159265359, 3.14159265359, cos);
+  TEST_FUN (double, -3.14159265359, 3.14159265359, cosh);
+  TEST_FUN (double, -10.0, 10.0, erf);
+  TEST_FUN (double, -10.0, 10.0, exp);
+  TEST_FUN (double, -10.0, 10.0, exp2);
+  TEST_FUN2 (double, -10.0, 10.0, 100.0, -25.0, fmod);
+  TEST_FUN (double, -10.0, 10.0, gamma);
+  TEST_FUN2 (double, -10.0, 10.0, 15.0, -5.0, hypot);
+  TEST_FUN (double, -10.0, 10.0, lgamma);
+  TEST_FUN (double, -1.0, 50.0, log);
+  TEST_FUN (double, -1.0, 500.0, log10);
+  TEST_FUN (double, -1.0, 64.0, log2);
+  TEST_FUN2 (double, -100.0, 100.0, 100.0, -100.0, pow);
+  TEST_FUN2 (double, -50.0, 100.0, -2.0, 40.0, remainder);
+  TEST_FUN (double, -50.0, 50.0, rint);
+  TEST_FUN2 (double, -50.0, 50.0, -10.0, 32.0, __builtin_scalb);
+  TEST_FUN (double, -10.0, 10.0, __builtin_significand);
+  TEST_FUN (double, -3.14159265359, 3.14159265359, sin);
+  TEST_FUN (double, -3.14159265359, 3.14159265359, sinh);
+  TEST_FUN (double, -0.1, 10000.0, sqrt);
+  TEST_FUN (double, -5.0, 5.0, tan);
+  TEST_FUN (double, -3.14159265359, 3.14159265359, tanh);
+  /* Newlib's version of tgamma is known to have poor accuracy.  */
+  TEST_FUN_XFAIL (double, -10.0, 10.0, tgamma);
+
+  return failed;
+}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] amdgcn: Enable SIMD vectorization of math functions
  2023-02-28 23:01 [PATCH] amdgcn: Enable SIMD vectorization of math functions Kwok Cheung Yeung
@ 2023-02-28 23:06 ` Andrew Pinski
  2023-03-01  8:18   ` Richard Biener
  2023-03-01 10:01 ` Andrew Stubbs
  1 sibling, 1 reply; 9+ messages in thread
From: Andrew Pinski @ 2023-02-28 23:06 UTC (permalink / raw)
  To: Kwok Cheung Yeung; +Cc: gcc-patches, ams

On Tue, Feb 28, 2023 at 3:02 PM Kwok Cheung Yeung <kcy@codesourcery.com> wrote:
>
> Hello
>
> This patch implements the TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
> target hook for the AMD GCN architecture, such that when vectorized,
> calls to builtin standard math functions such as asinf, exp, pow etc.
> are converted to calls to the recently added vectorized math functions
> for GCN in Newlib. The -fno-math-errno flag is required in addition to
> the usual vectorization optimization flags for this to occur, and some
> of the math functions (the larger double-precision ones) require a large
> stack size to function properly.
>
> This patch requires the GCN vector math functions in Newlib to function
> - these were included in the recent 4.3.0.20230120 snapshot. As this was
> a minimum requirement starting from the patch 'amdgcn, libgomp: Manually
> allocated stacks', this should not be a problem.
>
> I have added new testcases in the testsuite that compare the output of
> the vectorized math functions against the scalar, passing if they are
> sufficiently close. With the testcase for standalone GCN (without
> libgomp) in gcc.target/gcn/, there is a problem since gcn-run currently
> cannot set the stack size correctly in DejaGnu testing, so I have made
> it a compile test for now - it is still useful to check that calls to
> the correct functions are being made. The runtime correctness is still
> covered by the libgomp test.

I thought we were moving towards using the simd attribute instead and
moving away from these kind of patches.
Though since gcn is a special target that including math.h normally
does not happen for offloading this might be still usefull.


>
> Okay for trunk?

We are in stage 4 of GCC 13 release cycle, I suspect we want to wait
until GCC 13 branches off to apply this.

Thanks,
Andrew Pinski

>
> Thanks
>
> Kwok

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] amdgcn: Enable SIMD vectorization of math functions
  2023-02-28 23:06 ` Andrew Pinski
@ 2023-03-01  8:18   ` Richard Biener
  2023-03-01  8:57     ` Tobias Burnus
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Biener @ 2023-03-01  8:18 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: Kwok Cheung Yeung, gcc-patches, ams

On Wed, Mar 1, 2023 at 12:07 AM Andrew Pinski via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Tue, Feb 28, 2023 at 3:02 PM Kwok Cheung Yeung <kcy@codesourcery.com> wrote:
> >
> > Hello
> >
> > This patch implements the TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
> > target hook for the AMD GCN architecture, such that when vectorized,
> > calls to builtin standard math functions such as asinf, exp, pow etc.
> > are converted to calls to the recently added vectorized math functions
> > for GCN in Newlib. The -fno-math-errno flag is required in addition to
> > the usual vectorization optimization flags for this to occur, and some
> > of the math functions (the larger double-precision ones) require a large
> > stack size to function properly.
> >
> > This patch requires the GCN vector math functions in Newlib to function
> > - these were included in the recent 4.3.0.20230120 snapshot. As this was
> > a minimum requirement starting from the patch 'amdgcn, libgomp: Manually
> > allocated stacks', this should not be a problem.
> >
> > I have added new testcases in the testsuite that compare the output of
> > the vectorized math functions against the scalar, passing if they are
> > sufficiently close. With the testcase for standalone GCN (without
> > libgomp) in gcc.target/gcn/, there is a problem since gcn-run currently
> > cannot set the stack size correctly in DejaGnu testing, so I have made
> > it a compile test for now - it is still useful to check that calls to
> > the correct functions are being made. The runtime correctness is still
> > covered by the libgomp test.
>
> I thought we were moving towards using the simd attribute instead and
> moving away from these kind of patches.
> Though since gcn is a special target that including math.h normally
> does not happen for offloading this might be still usefull.

Yes, this particular target hook is considered legacy.  See how for example
glibc provides a math-vector-fortran.h file announcing them to the fortran
compiler in case you were wondering how to target non-C family frontends.

Richard.

>
> >
> > Okay for trunk?
>
> We are in stage 4 of GCC 13 release cycle, I suspect we want to wait
> until GCC 13 branches off to apply this.
>
> Thanks,
> Andrew Pinski
>
> >
> > Thanks
> >
> > Kwok

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] amdgcn: Enable SIMD vectorization of math functions
  2023-03-01  8:18   ` Richard Biener
@ 2023-03-01  8:57     ` Tobias Burnus
  0 siblings, 0 replies; 9+ messages in thread
From: Tobias Burnus @ 2023-03-01  8:57 UTC (permalink / raw)
  To: Richard Biener, Andrew Pinski; +Cc: Kwok Cheung Yeung, gcc-patches, ams

Hi Richard, hi all,

On 01.03.23 09:18, Richard Biener via Gcc-patches wrote:

>> I thought we were moving towards using the simd attribute instead and
>> moving away from these kind of patches.
>> Though since gcn is a special target that including math.h normally
>> does not happen for offloading this might be still usefull.
> Yes, this particular target hook is considered legacy.  See how for example
> glibc provides a math-vector-fortran.h file announcing them to the fortran
> compiler in case you were wondering how to target non-C family frontends.

[Not having looked in depth at the patch, the Newlib libm patches, and at this suggestion.]

With offloading there is the problem that we only parse the C/C++/Fortran code
only once - targeting the host compiler - and then use the the intermediate
representation both for generating the code for that system (like: x86-64) but also
for the offload region (like: gcn).

At the moment, I fail how to easily handle this. For instance, 'gamma' is not offered
for x86-64 by glibc's libm (as SIMD) but Newlib for gcn has it (according to Kwok's patch).

I could imagine some dance with
  omp declare variant match(device={arch(gcn)})
combined with 'declare simd' for the arch variants to get this working.

Ignoring some potential issues with declare variant and getting the code only active on
one device and not on the host and issues surrounded that, we still need to get this
into the compiler while parsing.

I think this means creating some 'math-extra.h' included in GCC; this then needs to be
included at parse time into C/C++/fortran (at least when compiling with -fopenmp or
-fopenacc for if ENABLE_OFFLOADING) and will provide the 'declare variant' + 'declare simd'.

At least I don't see any other way - and as we have functions in both GLIBC and Newlib,
I don't see how any header file not maintained by + shipping with GCC would work.

Thoughts on this?

Regarding the current SIMD use, see:

For Fortran, GLIBC has:
   /usr/include/finclude/math-vector-fortran.h
   !GCC$ builtin (pow) attributes simd (notinbranch) if('x86_64')
→ https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86/fpu/finclude/math-vector-fortran.h;hb=refs/heads/master

For "math.h" users (C, C++), GLICB has:
   /usr/include/x86_64-linux-gnu/bits/math-vector.h
   #if defined __x86_64__ && defined __FAST_MATH__
...
#  define __DECL_SIMD_x86_64 _Pragma ("omp declare simd notinbranch")
# elif __GNUC_PREREQ (6,0)
/* W/o OpenMP use GCC 6.* __attribute__ ((__simd__)).  */
#  define __DECL_SIMD_x86_64 __attribute__ ((__simd__ ("notinbranch")))
...
#  define __DECL_SIMD_cos __DECL_SIMD_x86_64

→ https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86/fpu/bits/math-vector.h;hb=refs/heads/master

For completeness, there were a while (2-3 years ago) some patches to extend the SIMD support for AMD.

And LLVM, that parses the source files multiple times, not only partially avoids this issue
by including the offload-target's header files - it has some target specific .h of its own
like:
   https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_cuda_math.h
   https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_hip_math.h
which runs directly the vendor math functions.

Tobias

-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] amdgcn: Enable SIMD vectorization of math functions
  2023-02-28 23:01 [PATCH] amdgcn: Enable SIMD vectorization of math functions Kwok Cheung Yeung
  2023-02-28 23:06 ` Andrew Pinski
@ 2023-03-01 10:01 ` Andrew Stubbs
  2023-03-01 10:52   ` Andre Vieira (lists)
  2023-03-02 15:07   ` Kwok Cheung Yeung
  1 sibling, 2 replies; 9+ messages in thread
From: Andrew Stubbs @ 2023-03-01 10:01 UTC (permalink / raw)
  To: Kwok Cheung Yeung, gcc-patches

On 28/02/2023 23:01, Kwok Cheung Yeung wrote:
> Hello
> 
> This patch implements the TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION 
> target hook for the AMD GCN architecture, such that when vectorized, 
> calls to builtin standard math functions such as asinf, exp, pow etc. 
> are converted to calls to the recently added vectorized math functions 
> for GCN in Newlib. The -fno-math-errno flag is required in addition to 
> the usual vectorization optimization flags for this to occur, and some 
> of the math functions (the larger double-precision ones) require a large 
> stack size to function properly.
> 
> This patch requires the GCN vector math functions in Newlib to function 
> - these were included in the recent 4.3.0.20230120 snapshot. As this was 
> a minimum requirement starting from the patch 'amdgcn, libgomp: Manually 
> allocated stacks', this should not be a problem.
> 
> I have added new testcases in the testsuite that compare the output of 
> the vectorized math functions against the scalar, passing if they are 
> sufficiently close. With the testcase for standalone GCN (without 
> libgomp) in gcc.target/gcn/, there is a problem since gcn-run currently 
> cannot set the stack size correctly in DejaGnu testing, so I have made 
> it a compile test for now - it is still useful to check that calls to 
> the correct functions are being made. The runtime correctness is still 
> covered by the libgomp test.
> 
> Okay for trunk?

The main part of the patch is OK, with the small changes below.

Others have pointed out that "omp declare simd" exists, but you and I 
have been all through that verbally, long ago, and as Tobias says the 
offload compiler cannot rely on markup in the host compiler's header 
files to solve this problem.

> @@ -7324,6 +7429,11 @@ gcn_dwarf_register_span (rtx rtl)
>    gcn_simd_clone_compute_vecsize_and_simdlen
>  #undef  TARGET_SIMD_CLONE_USABLE
>  #define TARGET_SIMD_CLONE_USABLE gcn_simd_clone_usable
> +#undef TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
> +#define TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION \
> +  gcn_vectorize_builtin_vectorized_function
> +#undef TARGET_LIBC_HAS_FUNCTION
> +#define TARGET_LIBC_HAS_FUNCTION gcn_libc_has_function
>  #undef  TARGET_SMALL_REGISTER_CLASSES_FOR_MODE_P
>  #define TARGET_SMALL_REGISTER_CLASSES_FOR_MODE_P \
>    gcn_small_register_classes_for_mode_p

Please keep these in alphabetical order.

> +/* Ideally this test should be run, but the math routines require a large
> +   stack and gcn-run currently does not respect the stack-size parameter.  */
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -ftree-vectorize -fno-math-errno -mstack-size=3000000 -fdump-tree-vect" } */

This isn't ideal. The dg-set-target-env-var directive (I think this is 
it?) can set GCN_STACK_SIZE, which gcn-run does honour, but I realise 
that doesn't work with remote test targets (like ours).

I suggest adding an additional test that sets the envvar and #includes 
the code from this one; one test to scan the dumps, one test to run it. 
Like this .... (untested, syntax uncertain).

/* { dg-do run } */
/* { dg-options "-O2 -ftree-vectorize -fno-math-errno" } */
/* { dg-set-target-env-var "GCN_STACK_SIZE" "3000000" } */
#include "simd-math-1.c"

The run test will get skipped in our test environment (and anyone else 
using remote), but the libgomp test should make up for that.

Andrew

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] amdgcn: Enable SIMD vectorization of math functions
  2023-03-01 10:01 ` Andrew Stubbs
@ 2023-03-01 10:52   ` Andre Vieira (lists)
  2023-03-01 12:13     ` Andrew Stubbs
  2023-03-02 15:07   ` Kwok Cheung Yeung
  1 sibling, 1 reply; 9+ messages in thread
From: Andre Vieira (lists) @ 2023-03-01 10:52 UTC (permalink / raw)
  To: Andrew Stubbs, Kwok Cheung Yeung, gcc-patches



On 01/03/2023 10:01, Andrew Stubbs wrote:
 > On 28/02/2023 23:01, Kwok Cheung Yeung wrote:
 >> Hello
 >>
 >> This patch implements the TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
 >> target hook for the AMD GCN architecture, such that when vectorized,
 >> calls to builtin standard math functions such as asinf, exp, pow etc.
 >> are converted to calls to the recently added vectorized math functions
 >> for GCN in Newlib. The -fno-math-errno flag is required in addition to
 >> the usual vectorization optimization flags for this to occur, and some
 >> of the math functions (the larger double-precision ones) require a
 >> large stack size to function properly.
 >>
 >> This patch requires the GCN vector math functions in Newlib to
 >> function - these were included in the recent 4.3.0.20230120 snapshot.
 >> As this was a minimum requirement starting from the patch 'amdgcn,
 >> libgomp: Manually allocated stacks', this should not be a problem.
 >>
 >> I have added new testcases in the testsuite that compare the output of
 >> the vectorized math functions against the scalar, passing if they are
 >> sufficiently close. With the testcase for standalone GCN (without
 >> libgomp) in gcc.target/gcn/, there is a problem since gcn-run
 >> currently cannot set the stack size correctly in DejaGnu testing, so I
 >> have made it a compile test for now - it is still useful to check that
 >> calls to the correct functions are being made. The runtime correctness
 >> is still covered by the libgomp test.
 >>
 >> Okay for trunk?
 >
 > The main part of the patch is OK, with the small changes below.
 >
 > Others have pointed out that "omp declare simd" exists, but you and I
 > have been all through that verbally, long ago, and as Tobias says the
 > offload compiler cannot rely on markup in the host compiler's header
 > files to solve this problem.

For what it's worth, I am currently working on enabling "omp declare 
simd" for SVE and more importantly teaching GCC to use "omp declare 
variant"'s with simd construct's as simdclones during autovect. This 
gives a bit more control on what simdclones you advertise as available. 
I hope to have some RFC's on here soon. I obviously am not familiar with 
your constraints but just wanted to let you know.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] amdgcn: Enable SIMD vectorization of math functions
  2023-03-01 10:52   ` Andre Vieira (lists)
@ 2023-03-01 12:13     ` Andrew Stubbs
  0 siblings, 0 replies; 9+ messages in thread
From: Andrew Stubbs @ 2023-03-01 12:13 UTC (permalink / raw)
  To: Andre Vieira (lists), Kwok Cheung Yeung, gcc-patches

On 01/03/2023 10:52, Andre Vieira (lists) wrote:
> 
> 
> On 01/03/2023 10:01, Andrew Stubbs wrote:
>  > On 28/02/2023 23:01, Kwok Cheung Yeung wrote:
>  >> Hello
>  >>
>  >> This patch implements the TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
>  >> target hook for the AMD GCN architecture, such that when vectorized,
>  >> calls to builtin standard math functions such as asinf, exp, pow etc.
>  >> are converted to calls to the recently added vectorized math functions
>  >> for GCN in Newlib. The -fno-math-errno flag is required in addition to
>  >> the usual vectorization optimization flags for this to occur, and some
>  >> of the math functions (the larger double-precision ones) require a
>  >> large stack size to function properly.
>  >>
>  >> This patch requires the GCN vector math functions in Newlib to
>  >> function - these were included in the recent 4.3.0.20230120 snapshot.
>  >> As this was a minimum requirement starting from the patch 'amdgcn,
>  >> libgomp: Manually allocated stacks', this should not be a problem.
>  >>
>  >> I have added new testcases in the testsuite that compare the output of
>  >> the vectorized math functions against the scalar, passing if they are
>  >> sufficiently close. With the testcase for standalone GCN (without
>  >> libgomp) in gcc.target/gcn/, there is a problem since gcn-run
>  >> currently cannot set the stack size correctly in DejaGnu testing, so I
>  >> have made it a compile test for now - it is still useful to check that
>  >> calls to the correct functions are being made. The runtime correctness
>  >> is still covered by the libgomp test.
>  >>
>  >> Okay for trunk?
>  >
>  > The main part of the patch is OK, with the small changes below.
>  >
>  > Others have pointed out that "omp declare simd" exists, but you and I
>  > have been all through that verbally, long ago, and as Tobias says the
>  > offload compiler cannot rely on markup in the host compiler's header
>  > files to solve this problem.
> 
> For what it's worth, I am currently working on enabling "omp declare 
> simd" for SVE and more importantly teaching GCC to use "omp declare 
> variant"'s with simd construct's as simdclones during autovect. This 
> gives a bit more control on what simdclones you advertise as available. 
> I hope to have some RFC's on here soon. I obviously am not familiar with 
> your constraints but just wanted to let you know.

We can use "omp declare target sim" (or whatever the exact form is) to 
create SIMD clones of user functions, but this doesn't work for libm 
functions (or any library) unless the header file for both host and 
offload device have matching markup. Given that the x86_64 Linux host is 
using Glibc and the offload device compiler is using Newlib this is not 
likely to be the case.

I suppose the variants thing could something about this, but with this 
patch we don't need it to, for now.

Andrew

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] amdgcn: Enable SIMD vectorization of math functions
  2023-03-01 10:01 ` Andrew Stubbs
  2023-03-01 10:52   ` Andre Vieira (lists)
@ 2023-03-02 15:07   ` Kwok Cheung Yeung
  2023-03-02 17:20     ` Andrew Stubbs
  1 sibling, 1 reply; 9+ messages in thread
From: Kwok Cheung Yeung @ 2023-03-02 15:07 UTC (permalink / raw)
  To: Andrew Stubbs, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 3684 bytes --]

Hello

I've made the suggested changes. Should I hold off on committing this 
until GCC 13 has been branched off?

Kwok

On 01/03/2023 10:01 am, Andrew Stubbs wrote:
> On 28/02/2023 23:01, Kwok Cheung Yeung wrote:
>> Hello
>>
>> This patch implements the TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION 
>> target hook for the AMD GCN architecture, such that when vectorized, 
>> calls to builtin standard math functions such as asinf, exp, pow etc. 
>> are converted to calls to the recently added vectorized math functions 
>> for GCN in Newlib. The -fno-math-errno flag is required in addition to 
>> the usual vectorization optimization flags for this to occur, and some 
>> of the math functions (the larger double-precision ones) require a 
>> large stack size to function properly.
>>
>> This patch requires the GCN vector math functions in Newlib to 
>> function - these were included in the recent 4.3.0.20230120 snapshot. 
>> As this was a minimum requirement starting from the patch 'amdgcn, 
>> libgomp: Manually allocated stacks', this should not be a problem.
>>
>> I have added new testcases in the testsuite that compare the output of 
>> the vectorized math functions against the scalar, passing if they are 
>> sufficiently close. With the testcase for standalone GCN (without 
>> libgomp) in gcc.target/gcn/, there is a problem since gcn-run 
>> currently cannot set the stack size correctly in DejaGnu testing, so I 
>> have made it a compile test for now - it is still useful to check that 
>> calls to the correct functions are being made. The runtime correctness 
>> is still covered by the libgomp test.
>>
>> Okay for trunk?
> 
> The main part of the patch is OK, with the small changes below.
> 
> Others have pointed out that "omp declare simd" exists, but you and I 
> have been all through that verbally, long ago, and as Tobias says the 
> offload compiler cannot rely on markup in the host compiler's header 
> files to solve this problem.
> 
>> @@ -7324,6 +7429,11 @@ gcn_dwarf_register_span (rtx rtl)
>>    gcn_simd_clone_compute_vecsize_and_simdlen
>>  #undef  TARGET_SIMD_CLONE_USABLE
>>  #define TARGET_SIMD_CLONE_USABLE gcn_simd_clone_usable
>> +#undef TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
>> +#define TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION \
>> +  gcn_vectorize_builtin_vectorized_function
>> +#undef TARGET_LIBC_HAS_FUNCTION
>> +#define TARGET_LIBC_HAS_FUNCTION gcn_libc_has_function
>>  #undef  TARGET_SMALL_REGISTER_CLASSES_FOR_MODE_P
>>  #define TARGET_SMALL_REGISTER_CLASSES_FOR_MODE_P \
>>    gcn_small_register_classes_for_mode_p
> 
> Please keep these in alphabetical order.
> 
>> +/* Ideally this test should be run, but the math routines require a 
>> large
>> +   stack and gcn-run currently does not respect the stack-size 
>> parameter.  */
>> +/* { dg-do compile } */
>> +/* { dg-options "-O2 -ftree-vectorize -fno-math-errno 
>> -mstack-size=3000000 -fdump-tree-vect" } */
> 
> This isn't ideal. The dg-set-target-env-var directive (I think this is 
> it?) can set GCN_STACK_SIZE, which gcn-run does honour, but I realise 
> that doesn't work with remote test targets (like ours).
> 
> I suggest adding an additional test that sets the envvar and #includes 
> the code from this one; one test to scan the dumps, one test to run it. 
> Like this .... (untested, syntax uncertain).
> 
> /* { dg-do run } */
> /* { dg-options "-O2 -ftree-vectorize -fno-math-errno" } */
> /* { dg-set-target-env-var "GCN_STACK_SIZE" "3000000" } */
> #include "simd-math-1.c"
> 
> The run test will get skipped in our test environment (and anyone else 
> using remote), but the libgomp test should make up for that.
> 
> Andrew

[-- Attachment #2: 0001-amdgcn-Enable-SIMD-vectorization-of-math-functions.patch --]
[-- Type: text/plain, Size: 24659 bytes --]

From 0b43ef3c2d6afd4aecfc03fd1d2df675626e017b Mon Sep 17 00:00:00 2001
From: Kwok Cheung Yeung <kcy@codesourcery.com>
Date: Tue, 28 Feb 2023 14:15:47 +0000
Subject: [PATCH] amdgcn: Enable SIMD vectorization of math functions

Calls to vectorized versions of routines in the math library will now
be inserted when vectorizing code containing supported math functions.

2023-02-28  Kwok Cheung Yeung  <kcy@codesourcery.com>
	    Paul-Antoine Arras  <pa@codesourcery.com>

	gcc/
	* builtins.cc (mathfn_built_in_explicit): New.
	* config/gcn/gcn.cc: Include case-cfn-macros.h.
	(mathfn_built_in_explicit): Add prototype.
	(gcn_vectorize_builtin_vectorized_function): New.
	(gcn_libc_has_function): New.
	(TARGET_LIBC_HAS_FUNCTION): Define.
	(TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION): Define.

	gcc/testsuite/
	* gcc.target/gcn/simd-math-1.c: New testcase.
	* gcc.target/gcn/simd-math-2.c: New testcase.

	libgomp/
	* testsuite/libgomp.c/simd-math-1.c: New testcase.
---
 gcc/builtins.cc                            |   8 +
 gcc/config/gcn/gcn.cc                      | 110 +++++++++++
 gcc/testsuite/gcc.target/gcn/simd-math-1.c | 206 +++++++++++++++++++
 gcc/testsuite/gcc.target/gcn/simd-math-2.c |   8 +
 libgomp/testsuite/libgomp.c/simd-math-1.c  | 217 +++++++++++++++++++++
 5 files changed, 549 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/gcn/simd-math-1.c
 create mode 100644 gcc/testsuite/gcc.target/gcn/simd-math-2.c
 create mode 100644 libgomp/testsuite/libgomp.c/simd-math-1.c

diff --git a/gcc/builtins.cc b/gcc/builtins.cc
index 4d467c8c5c1..305c65c29be 100644
--- a/gcc/builtins.cc
+++ b/gcc/builtins.cc
@@ -2089,6 +2089,14 @@ mathfn_built_in (tree type, combined_fn fn)
   return mathfn_built_in_1 (type, fn, /*implicit=*/ 1);
 }
 
+/* Like mathfn_built_in_1, but always use the explicit array.  */
+
+tree
+mathfn_built_in_explicit (tree type, combined_fn fn)
+{
+  return mathfn_built_in_1 (type, fn, /*implicit=*/ 0);
+}
+
 /* Like mathfn_built_in_1, but take a built_in_function and
    always use the implicit array.  */
 
diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index 23ab01e75d8..6f0a90a4904 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -53,6 +53,7 @@
 #include "dwarf2.h"
 #include "gimple.h"
 #include "cgraph.h"
+#include "case-cfn-macros.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -5240,6 +5241,110 @@ gcn_simd_clone_usable (struct cgraph_node *ARG_UNUSED (node))
   return 0;
 }
 
+tree mathfn_built_in_explicit (tree, combined_fn);
+
+/* Implement TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION.
+   Return the function declaration of the vectorized version of the builtin
+   in the math library if available.  */
+
+tree
+gcn_vectorize_builtin_vectorized_function (unsigned int fn, tree type_out,
+					   tree type_in)
+{
+  if (TREE_CODE (type_out) != VECTOR_TYPE
+      || TREE_CODE (type_in) != VECTOR_TYPE)
+    return NULL_TREE;
+
+  machine_mode out_mode = TYPE_MODE (TREE_TYPE (type_out));
+  int out_n = TYPE_VECTOR_SUBPARTS (type_out);
+  machine_mode in_mode = TYPE_MODE (TREE_TYPE (type_in));
+  int in_n = TYPE_VECTOR_SUBPARTS (type_in);
+  combined_fn cfn = combined_fn (fn);
+
+  /* Keep this consistent with the list of vectorized math routines.  */
+  int implicit_p;
+  switch (fn)
+    {
+    CASE_CFN_ACOS:
+    CASE_CFN_ACOSH:
+    CASE_CFN_ASIN:
+    CASE_CFN_ASINH:
+    CASE_CFN_ATAN:
+    CASE_CFN_ATAN2:
+    CASE_CFN_ATANH:
+    CASE_CFN_COPYSIGN:
+    CASE_CFN_COS:
+    CASE_CFN_COSH:
+    CASE_CFN_ERF:
+    CASE_CFN_EXP:
+    CASE_CFN_EXP2:
+    CASE_CFN_FINITE:
+    CASE_CFN_FMOD:
+    CASE_CFN_GAMMA:
+    CASE_CFN_HYPOT:
+    CASE_CFN_ISNAN:
+    CASE_CFN_LGAMMA:
+    CASE_CFN_LOG:
+    CASE_CFN_LOG10:
+    CASE_CFN_LOG2:
+    CASE_CFN_POW:
+    CASE_CFN_REMAINDER:
+    CASE_CFN_RINT:
+    CASE_CFN_SIN:
+    CASE_CFN_SINH:
+    CASE_CFN_SQRT:
+    CASE_CFN_TAN:
+    CASE_CFN_TANH:
+    CASE_CFN_TGAMMA:
+      implicit_p = 1;
+      break;
+
+    CASE_CFN_SCALB:
+    CASE_CFN_SIGNIFICAND:
+      implicit_p = 0;
+      break;
+
+    default:
+      return NULL_TREE;
+    }
+
+  tree out_t_node = (out_mode == DFmode) ? double_type_node : float_type_node;
+  tree fndecl = implicit_p ? mathfn_built_in (out_t_node, cfn)
+			   : mathfn_built_in_explicit (out_t_node, cfn);
+
+  const char *bname = IDENTIFIER_POINTER (DECL_NAME (fndecl));
+  char name[20];
+  sprintf (name, out_mode == DFmode ? "v%ddf_%s" : "v%dsf_%s",
+	   out_n, bname + 10);
+
+  unsigned arity = 0;
+  for (tree args = DECL_ARGUMENTS (fndecl); args; args = TREE_CHAIN (args))
+    arity++;
+
+  tree fntype = (arity == 1)
+		? build_function_type_list (type_out, type_in, NULL)
+		: build_function_type_list (type_out, type_in, type_in, NULL);
+
+  /* Build a function declaration for the vectorized function.  */
+  tree new_fndecl = build_decl (BUILTINS_LOCATION,
+				FUNCTION_DECL, get_identifier (name), fntype);
+  TREE_PUBLIC (new_fndecl) = 1;
+  DECL_EXTERNAL (new_fndecl) = 1;
+  DECL_IS_NOVOPS (new_fndecl) = 1;
+  TREE_READONLY (new_fndecl) = 1;
+
+  return new_fndecl;
+}
+
+/* Implement TARGET_LIBC_HAS_FUNCTION.  */
+
+bool
+gcn_libc_has_function (enum function_class fn_class,
+		       tree type)
+{
+  return bsd_libc_has_function (fn_class, type);
+}
+
 /* }}}  */
 /* {{{ md_reorg pass.  */
 
@@ -7290,6 +7395,8 @@ gcn_dwarf_register_span (rtx rtl)
   gcn_ira_change_pseudo_allocno_class
 #undef  TARGET_LEGITIMATE_CONSTANT_P
 #define TARGET_LEGITIMATE_CONSTANT_P gcn_legitimate_constant_p
+#undef  TARGET_LIBC_HAS_FUNCTION
+#define TARGET_LIBC_HAS_FUNCTION gcn_libc_has_function
 #undef  TARGET_LRA_P
 #define TARGET_LRA_P hook_bool_void_true
 #undef  TARGET_MACHINE_DEPENDENT_REORG
@@ -7337,6 +7444,9 @@ gcn_dwarf_register_span (rtx rtl)
 #define TARGET_TRULY_NOOP_TRUNCATION gcn_truly_noop_truncation
 #undef  TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST
 #define TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST gcn_vectorization_cost
+#undef  TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
+#define TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION \
+  gcn_vectorize_builtin_vectorized_function
 #undef  TARGET_VECTORIZE_GET_MASK_MODE
 #define TARGET_VECTORIZE_GET_MASK_MODE gcn_vectorize_get_mask_mode
 #undef  TARGET_VECTORIZE_PREFERRED_SIMD_MODE
diff --git a/gcc/testsuite/gcc.target/gcn/simd-math-1.c b/gcc/testsuite/gcc.target/gcn/simd-math-1.c
new file mode 100644
index 00000000000..6868ccb2c54
--- /dev/null
+++ b/gcc/testsuite/gcc.target/gcn/simd-math-1.c
@@ -0,0 +1,206 @@
+/* Check that calls to the vectorized math functions are actually emitted.  */
+
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fno-math-errno -mstack-size=3000000 -fdump-tree-vect" } */
+
+
+#undef PRINT_RESULT
+#define VERBOSE 0
+#define EARLY_EXIT 1
+
+#include <math.h>
+#include <stdlib.h>
+
+#ifdef PRINT_RESULT
+  #include <stdio.h>
+  #define PRINTF printf
+#else
+  static void null_printf (const char *f, ...) { }
+
+  #define PRINTF null_printf
+#endif
+
+#define N 512
+#define EPSILON_float 1e-5
+#define EPSILON_double 1e-10
+
+static int failed = 0;
+
+int deviation_float (float x, float y)
+{
+  union {
+    float f;
+    unsigned u;
+  } u, v;
+
+  u.f = x;
+  v.f = y;
+
+  unsigned mask = 0x80000000U; 
+  int i;
+
+  for (i = 32; i > 0; i--)
+    if ((u.u ^ v.u) & mask)
+      break;
+    else
+      mask >>= 1;
+
+  return i;
+}
+
+int deviation_double (double x, double y)
+{
+  union {
+    double d;
+    unsigned long long u;
+  } u, v;
+
+  u.d = x;
+  v.d = y;
+
+  unsigned long long mask = 0x8000000000000000ULL;
+  int i;
+
+  for (i = 64; i > 0; i--)
+    if ((u.u ^ v.u) & mask)
+      break;
+    else
+      mask >>= 1;
+
+  return i;
+}
+
+#define TEST_FUN(TFLOAT, LOW, HIGH, FUN) \
+__attribute__((optimize("no-tree-vectorize"))) \
+__attribute__((optimize("no-unsafe-math-optimizations"))) \
+void check_##FUN (TFLOAT res[N], TFLOAT a[N]) \
+{ \
+  int failed = 0; \
+  for (int i = 0; i < N; i++) { \
+    TFLOAT expected = FUN (a[i]); \
+    TFLOAT diff = __builtin_fabs (expected - res[i]); \
+    int deviation = deviation_##TFLOAT (expected, res[i]); \
+    int fail = isnan (res[i]) != isnan (expected) \
+               || isinf (res[i]) != isinf (expected) \
+               || (diff > EPSILON_##TFLOAT && deviation > 10); \
+    failed |= fail; \
+    if (VERBOSE || fail) \
+      PRINTF (#FUN "(%f) = %f, expected = %f, diff = %f, deviation = %d %s\n", \
+              a[i], res[i], expected, diff, deviation, fail ? "(!)" : ""); \
+    if (EARLY_EXIT && fail) \
+      exit (1); \
+  } \
+} \
+void test_##FUN (void) \
+{ \
+  TFLOAT res[N], a[N]; \
+  for (int i = 0; i < N; i++) \
+    a[i] = LOW + ((HIGH - LOW) / N) * i; \
+  for (int i = 0; i < N; i++) \
+    res[i] = FUN (a[i]); \
+  check_##FUN (res, a); \
+}\
+test_##FUN ();
+
+#define TEST_FUN2(TFLOAT, LOW1, HIGH1, LOW2, HIGH2, FUN) \
+__attribute__((optimize("no-tree-vectorize"))) \
+__attribute__((optimize("no-unsafe-math-optimizations"))) \
+void check_##FUN (TFLOAT res[N], TFLOAT a[N], TFLOAT b[N]) \
+{ \
+  int failed = 0; \
+  for (int i = 0; i < N; i++) { \
+    TFLOAT expected = FUN (a[i], b[i]); \
+    TFLOAT diff = __builtin_fabs (expected - res[i]); \
+    int deviation = deviation_##TFLOAT (expected, res[i]); \
+    int fail = isnan (res[i]) != isnan (expected) \
+               || isinf (res[i]) != isinf (expected) \
+               || (diff > EPSILON_##TFLOAT && deviation > 10); \
+    failed |= fail; \
+    if (VERBOSE || fail) \
+      PRINTF (#FUN "(%f,%f) = %f, expected = %f, diff = %f, deviation = %d %s\n", \
+              a[i], b[i], res[i], expected, diff, deviation, fail ? "(!)" : ""); \
+    if (EARLY_EXIT && fail) \
+      exit (1); \
+  } \
+} \
+void test_##FUN (void) \
+{ \
+  TFLOAT res[N], a[N], b[N]; \
+  for (int i = 0; i < N; i++) { \
+    a[i] = LOW1 + ((HIGH1 - LOW1) / N) * i; \
+    b[i] = LOW2 + ((HIGH2 - LOW2) / N) * i; \
+  } \
+  for (int i = 0; i < N; i++) \
+    res[i] = FUN (a[i], b[i]); \
+  check_##FUN (res, a, b); \
+}\
+test_##FUN ();
+
+int main (void)
+{
+  TEST_FUN (float, -1.1, 1.1, acosf); /* { dg-final { scan-tree-dump "v64sf_acosf" "vect" } }*/
+  TEST_FUN (float, -10, 10, acoshf); /* { dg-final { scan-tree-dump "v64sf_acoshf" "vect" } }*/
+  TEST_FUN (float, -1.1, 1.1, asinf); /* { dg-final { scan-tree-dump "v64sf_asinf" "vect" } }*/
+  TEST_FUN (float, -10, 10, asinhf); /* { dg-final { scan-tree-dump "v64sf_asinhf" "vect" } }*/
+  TEST_FUN (float, -1.1, 1.1, atanf); /* { dg-final { scan-tree-dump "v64sf_atanf" "vect" } }*/
+  TEST_FUN2 (float, -2.0, 2.0, 2.0, -2.0, atan2f); /* { dg-final { scan-tree-dump "v64sf_atan2f" "vect" } }*/
+  TEST_FUN (float, -2.0, 2.0, atanhf); /* { dg-final { scan-tree-dump "v64sf_atanhf" "vect" } }*/
+  TEST_FUN2 (float, -10.0, 10.0, 5.0, -15.0, copysignf); /* { dg-final { scan-tree-dump "v64sf_copysignf" "vect" } }*/
+  TEST_FUN (float, -3.14159265359, 3.14159265359, cosf); /* { dg-final { scan-tree-dump "v64sf_cosf" "vect" } }*/
+  TEST_FUN (float, -3.14159265359, 3.14159265359, coshf); /* { dg-final { scan-tree-dump "v64sf_coshf" "vect" } }*/
+  TEST_FUN (float, -10.0, 10.0, erff);  /* { dg-final { scan-tree-dump "v64sf_erff" "vect" } }*/
+  TEST_FUN (float, -10.0, 10.0, expf); /* { dg-final { scan-tree-dump "v64sf_expf" "vect" } }*/
+  TEST_FUN (float, -10.0, 10.0, exp2f); /* { dg-final { scan-tree-dump "v64sf_exp2f" "vect" } }*/
+  TEST_FUN2 (float, -10.0, 10.0, 100.0, -25.0, fmodf); /* { dg-final { scan-tree-dump "v64sf_fmodf" "vect" } }*/
+  TEST_FUN (float, -10.0, 10.0, gammaf); /* { dg-final { scan-tree-dump "v64sf_gammaf" "vect" { xfail *-*-*} } }*/
+  TEST_FUN2 (float, -10.0, 10.0, 15.0, -5.0,hypotf); /* { dg-final { scan-tree-dump "v64sf_hypotf" "vect" } }*/
+  TEST_FUN (float, -10.0, 10.0, lgammaf); /* { dg-final { scan-tree-dump "v64sf_lgammaf" "vect" { xfail *-*-*} } }*/
+  TEST_FUN (float, -1.0, 50.0, logf); /* { dg-final { scan-tree-dump "v64sf_logf" "vect" } }*/
+  TEST_FUN (float, -1.0, 500.0, log10f); /* { dg-final { scan-tree-dump "v64sf_log10f" "vect" } }*/
+  TEST_FUN (float, -1.0, 64.0, log2f); /* { dg-final { scan-tree-dump "v64sf_log2f" "vect" } }*/
+  TEST_FUN2 (float, -100.0, 100.0, 100.0, -100.0, powf); /* { dg-final { scan-tree-dump "v64sf_powf" "vect" } }*/
+  TEST_FUN2 (float, -50.0, 100.0, -2.0, 40.0, remainderf); /* { dg-final { scan-tree-dump "v64sf_remainderf" "vect" } }*/
+  TEST_FUN (float, -50.0, 50.0, rintf);  /* { dg-final { scan-tree-dump "v64sf_rintf" "vect" } }*/
+  TEST_FUN2 (float, -50.0, 50.0, -10.0, 32.0, __builtin_scalbf); /* { dg-final { scan-tree-dump "v64sf_scalbf" "vect" } }*/
+  TEST_FUN (float, -10.0, 10.0, __builtin_significandf); /* { dg-final { scan-tree-dump "v64sf_significandf" "vect" } }*/
+  TEST_FUN (float, -3.14159265359, 3.14159265359, sinf); /* { dg-final { scan-tree-dump "v64sf_sinf" "vect" } }*/
+  TEST_FUN (float, -3.14159265359, 3.14159265359, sinhf); /* { dg-final { scan-tree-dump "v64sf_sinhf" "vect" } }*/
+  TEST_FUN (float, -0.1, 10000.0, sqrtf); /* { dg-final { scan-tree-dump "v64sf_sqrtf" "vect" } }*/
+  TEST_FUN (float, -5.0, 5.0, tanf); /* { dg-final { scan-tree-dump "v64sf_tanf" "vect" } }*/
+  TEST_FUN (float, -3.14159265359, 3.14159265359, tanhf); /* { dg-final { scan-tree-dump "v64sf_tanhf" "vect" } }*/
+  TEST_FUN (float, -10.0, 10.0, tgammaf); /* { dg-final { scan-tree-dump "v64sf_tgammaf" "vect" } }*/
+
+  TEST_FUN (double, -1.1, 1.1, acos); /* { dg-final { scan-tree-dump "v64df_acos" "vect" } }*/
+  TEST_FUN (double, -10, 10, acosh); /* { dg-final { scan-tree-dump "v64df_acosh" "vect" } }*/
+  TEST_FUN (double, -1.1, 1.1, asin); /* { dg-final { scan-tree-dump "v64df_asin" "vect" } }*/
+  TEST_FUN (double, -10, 10, asinh); /* { dg-final { scan-tree-dump "v64df_asinh" "vect" } }*/
+  TEST_FUN (double, -1.1, 1.1, atan); /* { dg-final { scan-tree-dump "v64df_atan" "vect" } }*/
+  TEST_FUN2 (double, -2.0, 2.0, 2.0, -2.0, atan2); /* { dg-final { scan-tree-dump "v64df_atan2" "vect" } }*/
+  TEST_FUN (double, -2.0, 2.0, atanh); /* { dg-final { scan-tree-dump "v64df_atanh" "vect" } }*/
+  TEST_FUN2 (double, -10.0, 10.0, 5.0, -15.0, copysign); /* { dg-final { scan-tree-dump "v64df_copysign" "vect" } }*/
+  TEST_FUN (double, -3.14159265359, 3.14159265359, cos); /* { dg-final { scan-tree-dump "v64df_cos" "vect" } }*/
+  TEST_FUN (double, -3.14159265359, 3.14159265359, cosh); /* { dg-final { scan-tree-dump "v64df_cosh" "vect" } }*/
+  TEST_FUN (double, -10.0, 10.0, erf); /* { dg-final { scan-tree-dump "v64df_erf" "vect" } }*/
+  TEST_FUN (double, -10.0, 10.0, exp); /* { dg-final { scan-tree-dump "v64df_exp" "vect" } }*/
+  TEST_FUN (double, -10.0, 10.0, exp2); /* { dg-final { scan-tree-dump "v64df_exp2" "vect" } }*/
+  TEST_FUN2 (double, -10.0, 10.0, 100.0, -25.0, fmod); /* { dg-final { scan-tree-dump "v64df_fmod" "vect" } }*/
+  TEST_FUN (double, -10.0, 10.0, gamma); /* { dg-final { scan-tree-dump "v64df_gamma" "vect" { xfail *-*-*} } }*/
+  TEST_FUN2 (double, -10.0, 10.0, 15.0, -5.0, hypot); /* { dg-final { scan-tree-dump "v64df_hypot" "vect" } }*/
+  TEST_FUN (double, -10.0, 10.0, lgamma); /* { dg-final { scan-tree-dump "v64df_lgamma" "vect" { xfail *-*-*} } }*/
+  TEST_FUN (double, -1.0, 50.0, log); /* { dg-final { scan-tree-dump "v64df_log" "vect" } }*/
+  TEST_FUN (double, -1.0, 500.0, log10); /* { dg-final { scan-tree-dump "v64df_log10" "vect" } }*/
+  TEST_FUN (double, -1.0, 64.0, log2); /* { dg-final { scan-tree-dump "v64df_log2" "vect" { xfail *-*-*} } }*/
+  TEST_FUN2 (double, -100.0, 100.0, 100.0, -100.0, pow); /* { dg-final { scan-tree-dump "v64df_pow" "vect" } }*/
+  TEST_FUN2 (double, -50.0, 100.0, -2.0, 40.0, remainder); /* { dg-final { scan-tree-dump "v64df_remainder" "vect" } }*/
+  TEST_FUN (double, -50.0, 50.0, rint); /* { dg-final { scan-tree-dump "v64df_rint" "vect" } }*/
+  TEST_FUN2 (double, -50.0, 50.0, -10.0, 32.0, __builtin_scalb); /* { dg-final { scan-tree-dump "v64df_scalb" "vect" } }*/
+  TEST_FUN (double, -10.0, 10.0, __builtin_significand); /* { dg-final { scan-tree-dump "v64df_significand" "vect" } }*/
+  TEST_FUN (double, -3.14159265359, 3.14159265359, sin); /* { dg-final { scan-tree-dump "v64df_sin" "vect" } }*/
+  TEST_FUN (double, -3.14159265359, 3.14159265359, sinh); /* { dg-final { scan-tree-dump "v64df_sinh" "vect" } }*/
+  TEST_FUN (double, -0.1, 10000.0, sqrt); /* { dg-final { scan-tree-dump "v64df_sqrt" "vect" } }*/
+  TEST_FUN (double, -5.0, 5.0, tan); /* { dg-final { scan-tree-dump "v64df_tan" "vect" } }*/
+  TEST_FUN (double, -3.14159265359, 3.14159265359, tanh); /* { dg-final { scan-tree-dump "v64df_tanh" "vect" } }*/
+  TEST_FUN (double, -10.0, 10.0, tgamma); /* { dg-final { scan-tree-dump "v64df_tgamma" "vect" } }*/
+
+  return failed;
+}
diff --git a/gcc/testsuite/gcc.target/gcn/simd-math-2.c b/gcc/testsuite/gcc.target/gcn/simd-math-2.c
new file mode 100644
index 00000000000..375a2ad9263
--- /dev/null
+++ b/gcc/testsuite/gcc.target/gcn/simd-math-2.c
@@ -0,0 +1,8 @@
+/* Check that the SIMD versions of math routines give the same (or
+   sufficiently close) results as their scalar equivalents.  */
+
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-vectorize -fno-math-errno" } */
+/* { dg-set-target-env-var "GCN_STACK_SIZE" "3000000" } */
+
+#include "simd-math-1.c"
diff --git a/libgomp/testsuite/libgomp.c/simd-math-1.c b/libgomp/testsuite/libgomp.c/simd-math-1.c
new file mode 100644
index 00000000000..947bf606e36
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/simd-math-1.c
@@ -0,0 +1,217 @@
+/* Check that the SIMD versions of math routines give the same (or
+   sufficiently close) results as their scalar equivalents.  */
+
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-vectorize -fno-math-errno" } */
+/* { dg-additional-options -foffload-options=amdgcn-amdhsa=-mstack-size=3000000 { target offload_target_amdgcn } } */
+/* { dg-additional-options -foffload-options=-lm } */
+
+#undef PRINT_RESULT
+#define VERBOSE 0
+#define EARLY_EXIT 1
+
+#include <math.h>
+#include <stdlib.h>
+
+#ifdef PRINT_RESULT
+  #include <stdio.h>
+  #define PRINTF printf
+#else
+  static void null_printf (const char *f, ...) { }
+
+  #define PRINTF null_printf
+#endif
+
+#define N 512
+#define EPSILON_float 1e-5
+#define EPSILON_double 1e-10
+
+static int xfail = 0;
+static int failed = 0;
+
+int deviation_float (float x, float y)
+{
+  union {
+    float f;
+    unsigned u;
+  } u, v;
+
+  u.f = x;
+  v.f = y;
+
+  unsigned mask = 0x80000000U;
+  int i;
+
+  for (i = 32; i > 0; i--)
+    if ((u.u ^ v.u) & mask)
+      break;
+    else
+      mask >>= 1;
+
+  return i;
+}
+
+int deviation_double (double x, double y)
+{
+  union {
+    double d;
+    unsigned long long u;
+  } u, v;
+
+  u.d = x;
+  v.d = y;
+
+  unsigned long long mask = 0x8000000000000000ULL;
+  int i;
+
+  for (i = 64; i > 0; i--)
+    if ((u.u ^ v.u) & mask)
+      break;
+    else
+      mask >>= 1;
+
+  return i;
+}
+
+#define TEST_FUN_XFAIL(TFLOAT, LOW, HIGH, FUN) \
+  xfail = 1; \
+  TEST_FUN (TFLOAT, LOW, HIGH, FUN); \
+  xfail = 0;
+
+#define TEST_FUN(TFLOAT, LOW, HIGH, FUN) \
+__attribute__((optimize("no-tree-vectorize"))) \
+__attribute__((optimize("no-unsafe-math-optimizations"))) \
+void check_##FUN (TFLOAT res[N], TFLOAT a[N]) \
+{ \
+  for (int i = 0; i < N; i++) { \
+    TFLOAT expected = FUN (a[i]); \
+    TFLOAT diff = __builtin_fabs (expected - res[i]); \
+    int deviation = deviation_##TFLOAT (expected, res[i]); \
+    int fail = isnan (res[i]) != isnan (expected) \
+	       || isinf (res[i]) != isinf (expected) \
+	       || (diff > EPSILON_##TFLOAT && deviation > 10); \
+    if (VERBOSE || fail) \
+      PRINTF (#FUN "(%f) = %f, expected = %f, diff = %f, deviation = %d %s\n", \
+	      a[i], res[i], expected, diff, deviation, fail ? "(!)" : ""); \
+    failed |= (fail && !xfail); \
+    if (EARLY_EXIT && failed) \
+      exit (1); \
+  } \
+} \
+void test_##FUN (void) \
+{ \
+  TFLOAT res[N], a[N]; \
+  for (int i = 0; i < N; i++) \
+    a[i] = LOW + ((HIGH - LOW) / N) * i; \
+  _Pragma ("omp target parallel for simd map(to:a) map(from:res)") \
+    for (int i = 0; i < N; i++) \
+      res[i] = FUN (a[i]); \
+  check_##FUN (res, a); \
+}\
+test_##FUN ();
+
+#define TEST_FUN2(TFLOAT, LOW1, HIGH1, LOW2, HIGH2, FUN) \
+__attribute__((optimize("no-tree-vectorize"))) \
+__attribute__((optimize("no-unsafe-math-optimizations"))) \
+void check_##FUN (TFLOAT res[N], TFLOAT a[N], TFLOAT b[N]) \
+{ \
+  int failed = 0; \
+  for (int i = 0; i < N; i++) { \
+    TFLOAT expected = FUN (a[i], b[i]); \
+    TFLOAT diff = __builtin_fabs (expected - res[i]); \
+    int deviation = deviation_##TFLOAT (expected, res[i]); \
+    int fail = isnan (res[i]) != isnan (expected) \
+	       || isinf (res[i]) != isinf (expected) \
+	       || (diff > EPSILON_##TFLOAT && deviation > 10); \
+    failed |= fail; \
+    if (VERBOSE || fail) \
+      PRINTF (#FUN "(%f,%f) = %f, expected = %f, diff = %f, deviation = %d %s\n", \
+	      a[i], b[i], res[i], expected, diff, deviation, fail ? "(!)" : ""); \
+    if (EARLY_EXIT && fail) \
+      exit (1); \
+  } \
+} \
+void test_##FUN (void) \
+{ \
+  TFLOAT res[N], a[N], b[N]; \
+  for (int i = 0; i < N; i++) { \
+    a[i] = LOW1 + ((HIGH1 - LOW1) / N) * i; \
+    b[i] = LOW2 + ((HIGH2 - LOW2) / N) * i; \
+  } \
+  _Pragma ("omp target parallel for simd map(to:a) map(from:res)") \
+    for (int i = 0; i < N; i++) \
+      res[i] = FUN (a[i], b[i]); \
+  check_##FUN (res, a, b); \
+}\
+test_##FUN ();
+
+int main (void)
+{
+  TEST_FUN (float, -1.1, 1.1, acosf);
+  TEST_FUN (float, -10, 10, acoshf);
+  TEST_FUN (float, -1.1, 1.1, asinf);
+  TEST_FUN (float, -10, 10, asinhf);
+  TEST_FUN (float, -1.1, 1.1, atanf);
+  TEST_FUN2 (float, -2.0, 2.0, 2.0, -2.0, atan2f);
+  TEST_FUN (float, -2.0, 2.0, atanhf);
+  TEST_FUN2 (float, -10.0, 10.0, 5.0, -15.0, copysignf);
+  TEST_FUN (float, -3.14159265359, 3.14159265359, cosf);
+  TEST_FUN (float, -3.14159265359, 3.14159265359, coshf);
+  TEST_FUN (float, -10.0, 10.0, erff);
+  TEST_FUN (float, -10.0, 10.0, expf);
+  TEST_FUN (float, -10.0, 10.0, exp2f);
+  TEST_FUN2 (float, -10.0, 10.0, 100.0, -25.0, fmodf);
+  TEST_FUN (float, -10.0, 10.0, gammaf);
+  TEST_FUN2 (float, -10.0, 10.0, 15.0, -5.0,hypotf);
+  TEST_FUN (float, -10.0, 10.0, lgammaf);
+  TEST_FUN (float, -1.0, 50.0, logf);
+  TEST_FUN (float, -1.0, 500.0, log10f);
+  TEST_FUN (float, -1.0, 64.0, log2f);
+  TEST_FUN2 (float, -100.0, 100.0, 100.0, -100.0, powf);
+  TEST_FUN2 (float, -50.0, 100.0, -2.0, 40.0, remainderf);
+  TEST_FUN (float, -50.0, 50.0, rintf);
+  TEST_FUN2 (float, -50.0, 50.0, -10.0, 32.0, __builtin_scalbf);
+  TEST_FUN (float, -10.0, 10.0, __builtin_significandf);
+  TEST_FUN (float, -3.14159265359, 3.14159265359, sinf);
+  TEST_FUN (float, -3.14159265359, 3.14159265359, sinhf);
+  TEST_FUN (float, -0.1, 10000.0, sqrtf);
+  TEST_FUN (float, -5.0, 5.0, tanf);
+  TEST_FUN (float, -3.14159265359, 3.14159265359, tanhf);
+  /* Newlib's version of tgammaf is known to have poor accuracy.  */
+  TEST_FUN_XFAIL (float, -10.0, 10.0, tgammaf);
+
+  TEST_FUN (double, -1.1, 1.1, acos);
+  TEST_FUN (double, -10, 10, acosh);
+  TEST_FUN (double, -1.1, 1.1, asin);
+  TEST_FUN (double, -10, 10, asinh);
+  TEST_FUN (double, -1.1, 1.1, atan);
+  TEST_FUN2 (double, -2.0, 2.0, 2.0, -2.0, atan2);
+  TEST_FUN (double, -2.0, 2.0, atanh);
+  TEST_FUN2 (double, -10.0, 10.0, 5.0, -15.0, copysign);
+  TEST_FUN (double, -3.14159265359, 3.14159265359, cos);
+  TEST_FUN (double, -3.14159265359, 3.14159265359, cosh);
+  TEST_FUN (double, -10.0, 10.0, erf);
+  TEST_FUN (double, -10.0, 10.0, exp);
+  TEST_FUN (double, -10.0, 10.0, exp2);
+  TEST_FUN2 (double, -10.0, 10.0, 100.0, -25.0, fmod);
+  TEST_FUN (double, -10.0, 10.0, gamma);
+  TEST_FUN2 (double, -10.0, 10.0, 15.0, -5.0, hypot);
+  TEST_FUN (double, -10.0, 10.0, lgamma);
+  TEST_FUN (double, -1.0, 50.0, log);
+  TEST_FUN (double, -1.0, 500.0, log10);
+  TEST_FUN (double, -1.0, 64.0, log2);
+  TEST_FUN2 (double, -100.0, 100.0, 100.0, -100.0, pow);
+  TEST_FUN2 (double, -50.0, 100.0, -2.0, 40.0, remainder);
+  TEST_FUN (double, -50.0, 50.0, rint);
+  TEST_FUN2 (double, -50.0, 50.0, -10.0, 32.0, __builtin_scalb);
+  TEST_FUN (double, -10.0, 10.0, __builtin_significand);
+  TEST_FUN (double, -3.14159265359, 3.14159265359, sin);
+  TEST_FUN (double, -3.14159265359, 3.14159265359, sinh);
+  TEST_FUN (double, -0.1, 10000.0, sqrt);
+  TEST_FUN (double, -5.0, 5.0, tan);
+  TEST_FUN (double, -3.14159265359, 3.14159265359, tanh);
+  /* Newlib's version of tgamma is known to have poor accuracy.  */
+  TEST_FUN_XFAIL (double, -10.0, 10.0, tgamma);
+
+  return failed;
+}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] amdgcn: Enable SIMD vectorization of math functions
  2023-03-02 15:07   ` Kwok Cheung Yeung
@ 2023-03-02 17:20     ` Andrew Stubbs
  0 siblings, 0 replies; 9+ messages in thread
From: Andrew Stubbs @ 2023-03-02 17:20 UTC (permalink / raw)
  To: Kwok Cheung Yeung, gcc-patches

On 02/03/2023 15:07, Kwok Cheung Yeung wrote:
> Hello
> 
> I've made the suggested changes. Should I hold off on committing this 
> until GCC 13 has been branched off?

No need, amdgcn is not a primary target and this stuff won't affect 
anyone else. Please go ahead and commit.

Andrew

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-03-02 17:20 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-28 23:01 [PATCH] amdgcn: Enable SIMD vectorization of math functions Kwok Cheung Yeung
2023-02-28 23:06 ` Andrew Pinski
2023-03-01  8:18   ` Richard Biener
2023-03-01  8:57     ` Tobias Burnus
2023-03-01 10:01 ` Andrew Stubbs
2023-03-01 10:52   ` Andre Vieira (lists)
2023-03-01 12:13     ` Andrew Stubbs
2023-03-02 15:07   ` Kwok Cheung Yeung
2023-03-02 17:20     ` Andrew Stubbs

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).