public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
@ 2015-06-18 11:57 Benedikt Huber
  2015-06-18 12:03 ` [PATCH] 2015-06-15 Benedikt Huber <benedikt.huber@theobroma-systems.com> Benedikt Huber
                   ` (3 more replies)
  0 siblings, 4 replies; 36+ messages in thread
From: Benedikt Huber @ 2015-06-18 11:57 UTC (permalink / raw)
  To: gcc-patches; +Cc: benedikt.huber, philipp.tomsich

arch64 offers the instructions frsqrte and frsqrts, for rsqrt estimation and
a Newton-Raphson step, respectively.
There are ARMv8 implementations where this is faster than using fdiv and rsqrt.
It runs three steps for double and two steps for float to achieve the needed precision.

There is one caveat and open question.
Since -ffast-math enables flush to zero intermediate values between approximation steps
will be flushed to zero if they are denormal.
E.g. This happens in the case of rsqrt (DBL_MAX) and rsqrtf (FLT_MAX).
The test cases pass, but it is unclear to me whether this is expected behavior with -ffast-math.

The patch applies to commit:
svn+ssh://gcc.gnu.org/svn/gcc/trunk@224470

Please consider including this patch.
Thank you and best regards,
Benedikt Huber

Benedikt Huber (1):
  2015-06-15  Benedikt Huber  <benedikt.huber@theobroma-systems.com>

 gcc/ChangeLog                            |   9 +++
 gcc/config/aarch64/aarch64-builtins.c    |  60 ++++++++++++++++
 gcc/config/aarch64/aarch64-protos.h      |   2 +
 gcc/config/aarch64/aarch64-simd.md       |  27 ++++++++
 gcc/config/aarch64/aarch64.c             |  63 +++++++++++++++++
 gcc/config/aarch64/aarch64.md            |   3 +
 gcc/testsuite/gcc.target/aarch64/rsqrt.c | 113 +++++++++++++++++++++++++++++++
 7 files changed, 277 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/rsqrt.c

-- 
1.9.1

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] 2015-06-15  Benedikt Huber  <benedikt.huber@theobroma-systems.com>
  2015-06-18 11:57 [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math Benedikt Huber
@ 2015-06-18 12:03 ` Benedikt Huber
  2015-06-27  8:12   ` Andrew Pinski
  2015-06-18 12:36 ` [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math Kumar, Venkataramanan
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 36+ messages in thread
From: Benedikt Huber @ 2015-06-18 12:03 UTC (permalink / raw)
  To: gcc-patches; +Cc: benedikt.huber, philipp.tomsich

       * config/aarch64/aarch64-builtins.c: Builtins for rsqrt and rsqrtf.
       * config/aarch64/aarch64-protos.h: Declare.
       * config/aarch64/aarch64-simd.md: Matching expressions for frsqrte and frsqrts.
       * config/aarch64/aarch64.c: New functions. Emit rsqrt estimation code in fast math mode.
       * config/aarch64/aarch64.md: Added enum entry.
       * testsuite/gcc.target/aarch64/rsqrt.c: Tests for single and double.
---
 gcc/ChangeLog                            |   9 +++
 gcc/config/aarch64/aarch64-builtins.c    |  60 ++++++++++++++++
 gcc/config/aarch64/aarch64-protos.h      |   2 +
 gcc/config/aarch64/aarch64-simd.md       |  27 ++++++++
 gcc/config/aarch64/aarch64.c             |  63 +++++++++++++++++
 gcc/config/aarch64/aarch64.md            |   3 +
 gcc/testsuite/gcc.target/aarch64/rsqrt.c | 113 +++++++++++++++++++++++++++++++
 7 files changed, 277 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/rsqrt.c

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index c9b156f..690ebba 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,12 @@
+2015-06-15  Benedikt Huber  <benedikt.huber@theobroma-systems.com>
+
+	* config/aarch64/aarch64-builtins.c: Builtins for rsqrt and rsqrtf.
+	* config/aarch64/aarch64-protos.h: Declare.
+	* config/aarch64/aarch64-simd.md: Matching expressions for frsqrte and frsqrts.
+	* config/aarch64/aarch64.c: New functions. Emit rsqrt estimation code in fast math mode.
+	* config/aarch64/aarch64.md: Added enum entry.
+	* testsuite/gcc.target/aarch64/rsqrt.c: Tests for single and double.
+
 2015-06-14  Richard Sandiford  <richard.sandiford@arm.com>
 
 	* rtl.h (classify_insn): Declare.
diff --git a/gcc/config/aarch64/aarch64-builtins.c b/gcc/config/aarch64/aarch64-builtins.c
index f7a39ec..484bb84 100644
--- a/gcc/config/aarch64/aarch64-builtins.c
+++ b/gcc/config/aarch64/aarch64-builtins.c
@@ -342,6 +342,8 @@ enum aarch64_builtins
   AARCH64_BUILTIN_GET_FPSR,
   AARCH64_BUILTIN_SET_FPSR,
 
+  AARCH64_BUILTIN_RSQRT,
+  AARCH64_BUILTIN_RSQRTF,
   AARCH64_SIMD_BUILTIN_BASE,
   AARCH64_SIMD_BUILTIN_LANE_CHECK,
 #include "aarch64-simd-builtins.def"
@@ -831,6 +833,32 @@ aarch64_init_crc32_builtins ()
 }
 
 void
+aarch64_add_builtin_rsqrt (void)
+{
+  tree fndecl = NULL;
+  tree ftype = NULL;
+  ftype = build_function_type_list (double_type_node, double_type_node, NULL_TREE);
+
+  fndecl = add_builtin_function ("__builtin_aarch64_rsqrt",
+                                 ftype,
+                                 AARCH64_BUILTIN_RSQRT,
+                                 BUILT_IN_MD,
+                                 NULL,
+                                 NULL_TREE);
+  aarch64_builtin_decls[AARCH64_BUILTIN_RSQRT] = fndecl;
+
+  tree ftypef = NULL;
+  ftypef = build_function_type_list (float_type_node, float_type_node, NULL_TREE);
+  fndecl = add_builtin_function ("__builtin_aarch64_rsqrtf",
+                                 ftypef,
+                                 AARCH64_BUILTIN_RSQRTF,
+                                 BUILT_IN_MD,
+                                 NULL,
+                                 NULL_TREE);
+  aarch64_builtin_decls[AARCH64_BUILTIN_RSQRTF] = fndecl;
+}
+
+void
 aarch64_init_builtins (void)
 {
   tree ftype_set_fpr
@@ -855,6 +883,7 @@ aarch64_init_builtins (void)
     aarch64_init_simd_builtins ();
   if (TARGET_CRC32)
     aarch64_init_crc32_builtins ();
+  aarch64_add_builtin_rsqrt ();
 }
 
 tree
@@ -1099,6 +1128,23 @@ aarch64_crc32_expand_builtin (int fcode, tree exp, rtx target)
   return target;
 }
 
+static rtx
+aarch64_expand_builtin_rsqrt (int fcode, tree exp, rtx target)
+{
+  rtx pat;
+  tree arg0 = CALL_EXPR_ARG (exp, 0);
+  rtx op0 = expand_normal (arg0);
+
+  enum insn_code c = CODE_FOR_rsqrtdf;
+  if (fcode == AARCH64_BUILTIN_RSQRTF)
+    c = CODE_FOR_rsqrtsf;
+
+  pat = GEN_FCN (c) (target, op0);
+  emit_insn (pat);
+
+  return target;
+}
+
 /* Expand an expression EXP that calls a built-in function,
    with result going to TARGET if that's convenient.  */
 rtx
@@ -1146,6 +1192,11 @@ aarch64_expand_builtin (tree exp,
   else if (fcode >= AARCH64_CRC32_BUILTIN_BASE && fcode <= AARCH64_CRC32_BUILTIN_MAX)
     return aarch64_crc32_expand_builtin (fcode, exp, target);
 
+  if (fcode == AARCH64_BUILTIN_RSQRT ||
+      fcode == AARCH64_BUILTIN_RSQRTF)
+    return aarch64_expand_builtin_rsqrt (fcode, exp, target);
+
+  return NULL_RTX;
   gcc_unreachable ();
 }
 
@@ -1303,6 +1354,15 @@ aarch64_builtin_vectorized_function (tree fndecl, tree type_out, tree type_in)
   return NULL_TREE;
 }
 
+tree
+aarch64_builtin_rsqrt (bool is_float)
+{
+  if (is_float)
+    return aarch64_builtin_decls[AARCH64_BUILTIN_RSQRTF];
+  else
+    return aarch64_builtin_decls[AARCH64_BUILTIN_RSQRT];
+}
+
 #undef VAR1
 #define VAR1(T, N, MAP, A) \
   case AARCH64_SIMD_BUILTIN_##T##_##N##A:
diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 965a11b..4f1c8ce 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -270,6 +270,8 @@ void aarch64_print_operand (FILE *, rtx, char);
 void aarch64_print_operand_address (FILE *, rtx);
 void aarch64_emit_call_insn (rtx);
 
+void aarch64_emit_swrsqrt (rtx, rtx);
+
 /* Initialize builtins for SIMD intrinsics.  */
 void init_aarch64_simd_builtins (void);
 
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index b90f938..266800a 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -353,6 +353,33 @@
   [(set_attr "type" "neon_fp_mul_d_scalar_q")]
 )
 
+(define_insn "*rsqrte_simd"
+  [(set (match_operand:VALLF 0 "register_operand" "=w")
+	(unspec:VALLF [(match_operand:VALLF 1 "register_operand" "w")]
+		     UNSPEC_RSQRTE))]
+  "TARGET_SIMD"
+  "frsqrte\\t%<v>0<Vmtype>, %<v>1<Vmtype>"
+  [(set_attr "type" "neon_fp_rsqrte_<Vetype><q>")])
+
+(define_insn "*rsqrts_simd"
+  [(set (match_operand:VALLF 0 "register_operand" "=w")
+	(unspec:VALLF [(match_operand:VALLF 1 "register_operand" "w")
+               (match_operand:VALLF 2 "register_operand" "w")]
+		     UNSPEC_RSQRTS))]
+  "TARGET_SIMD"
+  "frsqrts\\t%<v>0<Vmtype>, %<v>1<Vmtype>, %<v>2<Vmtype>"
+  [(set_attr "type" "neon_fp_rsqrts_<Vetype><q>")])
+
+(define_expand "rsqrt<mode>"
+  [(set (match_operand:GPF 0 "register_operand" "=w")
+	(unspec:GPF [(match_operand:GPF 1 "register_operand" "w")]
+		     UNSPEC_RSQRT))]
+  "TARGET_FLOAT"
+{
+  aarch64_emit_swrsqrt (operands[0], operands[1]);
+  DONE;
+})
+
 (define_insn "*aarch64_mul3_elt_to_64v2df"
   [(set (match_operand:DF 0 "register_operand" "=w")
      (mult:DF
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index c3c2795..281f0b2 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -6816,6 +6816,66 @@ aarch64_memory_move_cost (machine_mode mode ATTRIBUTE_UNUSED,
   return aarch64_tune_params->memmov_cost;
 }
 
+extern tree aarch64_builtin_rsqrt (bool is_float);
+
+static tree
+aarch64_builtin_reciprocal (unsigned int fn,
+                            bool md_fn ATTRIBUTE_UNUSED,
+                            bool sqrt ATTRIBUTE_UNUSED)
+{
+  if (!fast_math_flags_set_p (&global_options))
+    return NULL_TREE;
+
+  if (fn == BUILT_IN_SQRTF)
+    return aarch64_builtin_rsqrt (true);
+  else if (fn == BUILT_IN_SQRT)
+    return aarch64_builtin_rsqrt (false);
+  else
+    return NULL_TREE;
+}
+
+void
+aarch64_emit_swrsqrt (rtx dst, rtx src)
+{
+  enum machine_mode mode = GET_MODE (src);
+  rtx xsrc = gen_reg_rtx (mode);
+  emit_set_insn (xsrc, src);
+
+  rtx x0 = gen_reg_rtx (mode);
+  emit_insn (gen_rtx_SET (x0,
+                          gen_rtx_UNSPEC (mode, gen_rtvec (1, xsrc),
+                                          UNSPEC_RSQRTE)));
+
+  bool double_mode = (DFmode   == mode ||
+                      V1DFmode == mode ||
+                      V2DFmode == mode ||
+                      V4DFmode == mode ||
+                      V6DFmode == mode ||
+                      V8DFmode == mode);
+
+  int iterations = 2;
+  if (double_mode)
+    iterations = 3;
+
+  for (int i = 0; i < iterations; ++i)
+    {
+      rtx x1 = gen_reg_rtx (mode);
+      rtx x2 = gen_reg_rtx (mode);
+      rtx x3 = gen_reg_rtx (mode);
+      emit_insn (gen_rtx_SET (x2,
+                              gen_rtx_MULT (mode, x0, x0)));
+      emit_insn (gen_rtx_SET (x3,
+                              gen_rtx_UNSPEC (mode, gen_rtvec (2, xsrc, x2),
+                                              UNSPEC_RSQRTS)));
+      emit_insn (gen_rtx_SET (x1,
+                              gen_rtx_MULT (mode, x0, x3)));
+      x0 = x1;
+    }
+
+  emit_move_insn (dst, x0);
+  return;
+}
+
 /* Return the number of instructions that can be issued per cycle.  */
 static int
 aarch64_sched_issue_rate (void)
@@ -11747,6 +11807,9 @@ aarch64_gen_adjusted_ldpstp (rtx *operands, bool load,
 #undef TARGET_USE_BLOCKS_FOR_CONSTANT_P
 #define TARGET_USE_BLOCKS_FOR_CONSTANT_P aarch64_use_blocks_for_constant_p
 
+#undef TARGET_BUILTIN_RECIPROCAL
+#define TARGET_BUILTIN_RECIPROCAL aarch64_builtin_reciprocal
+
 #undef TARGET_VECTOR_MODE_SUPPORTED_P
 #define TARGET_VECTOR_MODE_SUPPORTED_P aarch64_vector_mode_supported_p
 
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 11123d6..7272d05 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -120,6 +120,9 @@
     UNSPEC_VSTRUCTDUMMY
     UNSPEC_SP_SET
     UNSPEC_SP_TEST
+    UNSPEC_RSQRT
+    UNSPEC_RSQRTE
+    UNSPEC_RSQRTS
 ])
 
 (define_c_enum "unspecv" [
diff --git a/gcc/testsuite/gcc.target/aarch64/rsqrt.c b/gcc/testsuite/gcc.target/aarch64/rsqrt.c
new file mode 100644
index 0000000..6607483
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/rsqrt.c
@@ -0,0 +1,113 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -fno-inline --save-temps -ffast-math" } */
+
+#include <math.h>
+#include <stdio.h>
+
+#include <values.h>
+
+#define PI    3.141592653589793
+#define SQRT2 1.4142135623730951
+
+#define PI_4 0.7853981633974483
+#define SQRT1_2 0.7071067811865475
+
+/* 2^25+1, float has 24 significand bits
+ *       according to Single-precision floating-point format.  */
+#define TESTA8_FLT 33554433
+/* 2^54+1, double has 53 significand bits
+ *       according to Double-precision floating-point format.  */
+#define TESTA8_DBL 18014398509481985
+
+#define SD(a, b) t_double ((#a), (a), (b));
+#define SF(a, b) t_float ((#a), (a), (b));
+
+#define EPSILON_double __DBL_EPSILON__
+#define EPSILON_float __FLT_EPSILON__
+#define ABS_double fabs
+#define ABS_float fabsf
+#define SQRT_double sqrt
+#define SQRT_float sqrtf
+
+extern void abort (void);
+
+#define TESTTYPE(TYPE)                                                     \
+TYPE rsqrt_##TYPE (TYPE a)                                                 \
+{                                                                          \
+  return 1.0/SQRT_##TYPE(a);                                               \
+}                                                                          \
+                                                                           \
+int equals_##TYPE (TYPE a, TYPE b)                                         \
+{                                                                          \
+  return (a == b ||                                                        \
+          (isnan (a) && isnan (b)) ||                                      \
+          (ABS_##TYPE (a - b) < EPSILON_##TYPE));                          \
+}                                                                          \
+                                                                           \
+void t_##TYPE (const char *s, TYPE a, TYPE result)                         \
+{                                                                          \
+  TYPE r = rsqrt_##TYPE (a);                                               \
+  if (!equals_##TYPE (r, result))                                          \
+  {                                                                        \
+    abort ();                                                              \
+  }                                                                        \
+}                                                                          \
+
+//  printf ("Problem in %20s: %30.18A should be %30.18A\n", s, r, result); \
+
+TESTTYPE(double)
+TESTTYPE(float)
+
+/* { dg-final { scan-assembler-times "frsqrte\\td\[0-9\]+, d\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "frsqrts\\td\[0-9\]+, d\[0-9\]+, d\[0-9\]+" 3 } } */
+
+/* { dg-final { scan-assembler-times "frsqrte\\ts\[0-9\]+, s\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "frsqrts\\ts\[0-9\]+, s\[0-9\]+, s\[0-9\]+" 2 } } */
+
+int main ()
+{
+  SD(    1.0/256, 0X1.00000000000000P+4  );
+  SD(        1.0, 0X1.00000000000000P+0  );
+  SD(       -1.0,                     NAN);
+  SD(       11.0, 0X1.34BF63D1568260P-2  );
+  SD(        0.0,                INFINITY);
+  SD(   INFINITY, 0X0.00000000000000P+0  );
+  SD(        NAN,                     NAN);
+  SD(       -NAN,                    -NAN);
+  SD(    DBL_MAX, 0X1.00000000000010P-512);
+  SD(    DBL_MIN, 0X1.00000000000000P+511);
+  SD(         PI, 0X1.20DD750429B6D0P-1  );
+  SD(       PI_4, 0X1.20DD750429B6D0P+0  );
+  SD(      SQRT2, 0X1.AE89F995AD3AE0P-1  );
+  SD(    SQRT1_2, 0X1.306FE0A31B7150P+0  );
+  SD(        -PI,                     NAN);
+  SD(     -SQRT2,                     NAN);
+  SD( TESTA8_DBL, 0X1.00000000000000P-27 );
+
+  SF(    1.0/256, 0X1.00000000000000P+4  );
+  SF(        1.0, 0X1.00000000000000P+0  );
+  SF(       -1.0,                     NAN);
+  SF(       11.0, 0X1.34BF6400000000P-2  );
+  SF(        0.0,                INFINITY);
+  SF(   INFINITY, 0X0.00000000000000P+0  );
+  SF(        NAN,                     NAN);
+  SF(       -NAN,                    -NAN);
+  SF(    FLT_MAX, 0X1.00000200000000P-64 );
+  SF(    FLT_MIN, 0X1.00000000000000P+63 );
+  SF(         PI, 0X1.20DD7400000000P-1  );
+  SF(       PI_4, 0X1.20DD7400000000P+0  );
+  SF(      SQRT2, 0X1.AE89FA00000000P-1  );
+  SF(    SQRT1_2, 0X1.306FE000000000P+0  );
+  SF(        -PI,                     NAN);
+  SF(     -SQRT2,                     NAN);
+  SF( TESTA8_FLT, 0X1.6A09E600000000P-13 );
+
+//   With -ffast-math these return positive INF.
+//   SD(       -0.0,               -INFINITY);
+//   SF(       -0.0,               -INFINITY);
+//   The reason here is that -ffast-math flushes to zero.
+//   SD(DBL_MIN/256, 0X1.00000000000000P+515);
+//   SF(FLT_MIN/256, 0X1.00000000000000P+67 );
+
+  return 0;
+}
-- 
1.9.1

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-18 11:57 [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math Benedikt Huber
  2015-06-18 12:03 ` [PATCH] 2015-06-15 Benedikt Huber <benedikt.huber@theobroma-systems.com> Benedikt Huber
@ 2015-06-18 12:36 ` Kumar, Venkataramanan
  2015-06-24 16:49 ` Evandro Menezes
  2015-06-25  7:03 ` pinskia
  3 siblings, 0 replies; 36+ messages in thread
From: Kumar, Venkataramanan @ 2015-06-18 12:36 UTC (permalink / raw)
  To: Benedikt Huber, gcc-patches
  Cc: philipp.tomsich, James Greenhalgh (james.greenhalgh@arm.com)

Hi, 

is there a plan to support  -mrecip=rsqrt for Aarch64?

Regards,
Venkat.

> -----Original Message-----
> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
> owner@gcc.gnu.org] On Behalf Of Benedikt Huber
> Sent: Thursday, June 18, 2015 5:34 PM
> To: gcc-patches@gcc.gnu.org
> Cc: benedikt.huber@theobroma-systems.com;
> philipp.tomsich@theobroma-systems.com
> Subject: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
> estimation in -ffast-math
> 
> arch64 offers the instructions frsqrte and frsqrts, for rsqrt estimation and a
> Newton-Raphson step, respectively.
> There are ARMv8 implementations where this is faster than using fdiv and
> rsqrt.
> It runs three steps for double and two steps for float to achieve the needed
> precision.
> 
> There is one caveat and open question.
> Since -ffast-math enables flush to zero intermediate values between
> approximation steps will be flushed to zero if they are denormal.
> E.g. This happens in the case of rsqrt (DBL_MAX) and rsqrtf (FLT_MAX).
> The test cases pass, but it is unclear to me whether this is expected behavior
> with -ffast-math.
> 
> The patch applies to commit:
> svn+ssh://gcc.gnu.org/svn/gcc/trunk@224470
> 
> Please consider including this patch.
> Thank you and best regards,
> Benedikt Huber
> 
> Benedikt Huber (1):
>   2015-06-15  Benedikt Huber  <benedikt.huber@theobroma-systems.com>
> 
>  gcc/ChangeLog                            |   9 +++
>  gcc/config/aarch64/aarch64-builtins.c    |  60 ++++++++++++++++
>  gcc/config/aarch64/aarch64-protos.h      |   2 +
>  gcc/config/aarch64/aarch64-simd.md       |  27 ++++++++
>  gcc/config/aarch64/aarch64.c             |  63 +++++++++++++++++
>  gcc/config/aarch64/aarch64.md            |   3 +
>  gcc/testsuite/gcc.target/aarch64/rsqrt.c | 113
> +++++++++++++++++++++++++++++++
>  7 files changed, 277 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/rsqrt.c
> 
> --
> 1.9.1

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-18 11:57 [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math Benedikt Huber
  2015-06-18 12:03 ` [PATCH] 2015-06-15 Benedikt Huber <benedikt.huber@theobroma-systems.com> Benedikt Huber
  2015-06-18 12:36 ` [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math Kumar, Venkataramanan
@ 2015-06-24 16:49 ` Evandro Menezes
  2015-06-24 16:55   ` Dr. Philipp Tomsich
  2015-06-25  7:03 ` pinskia
  3 siblings, 1 reply; 36+ messages in thread
From: Evandro Menezes @ 2015-06-24 16:49 UTC (permalink / raw)
  To: 'Benedikt Huber', gcc-patches; +Cc: philipp.tomsich

[-- Attachment #1: Type: text/plain, Size: 2283 bytes --]

Benedikt,

You beat me to it! :-)  Do you have the implementation for dividing using
the Newton series as well?

I'm not sure that the series is always for all data types and on all
processors.  It would be useful to allow each AArch64 processor to enable
this or not depending on the data type.  BTW, do you have some tests showing
the speed up?

Thank you,

-- 
Evandro Menezes                              Austin, TX

> -----Original Message-----
> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-owner@gcc.gnu.org]
On
> Behalf Of Benedikt Huber
> Sent: Thursday, June 18, 2015 7:04
> To: gcc-patches@gcc.gnu.org
> Cc: benedikt.huber@theobroma-systems.com; philipp.tomsich@theobroma-
> systems.com
> Subject: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
> estimation in -ffast-math
> 
> arch64 offers the instructions frsqrte and frsqrts, for rsqrt estimation
and
> a Newton-Raphson step, respectively.
> There are ARMv8 implementations where this is faster than using fdiv and
> rsqrt.
> It runs three steps for double and two steps for float to achieve the
needed
> precision.
> 
> There is one caveat and open question.
> Since -ffast-math enables flush to zero intermediate values between
> approximation steps will be flushed to zero if they are denormal.
> E.g. This happens in the case of rsqrt (DBL_MAX) and rsqrtf (FLT_MAX).
> The test cases pass, but it is unclear to me whether this is expected
> behavior with -ffast-math.
> 
> The patch applies to commit:
> svn+ssh://gcc.gnu.org/svn/gcc/trunk@224470
> 
> Please consider including this patch.
> Thank you and best regards,
> Benedikt Huber
> 
> Benedikt Huber (1):
>   2015-06-15  Benedikt Huber  <benedikt.huber@theobroma-systems.com>
> 
>  gcc/ChangeLog                            |   9 +++
>  gcc/config/aarch64/aarch64-builtins.c    |  60 ++++++++++++++++
>  gcc/config/aarch64/aarch64-protos.h      |   2 +
>  gcc/config/aarch64/aarch64-simd.md       |  27 ++++++++
>  gcc/config/aarch64/aarch64.c             |  63 +++++++++++++++++
>  gcc/config/aarch64/aarch64.md            |   3 +
>  gcc/testsuite/gcc.target/aarch64/rsqrt.c | 113
> +++++++++++++++++++++++++++++++
>  7 files changed, 277 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/rsqrt.c
> 
> --
> 1.9.1

[-- Attachment #2: Type: message/rfc822, Size: 16261 bytes --]

From: "Benedikt Huber" <benedikt.huber@theobroma-systems.com>
To: <gcc-patches@gcc.gnu.org>
Cc: <benedikt.huber@theobroma-systems.com>, <philipp.tomsich@theobroma-systems.com>
Subject: [PATCH] 2015-06-15  Benedikt Huber <benedikt.huber@theobroma-systems.com>
Date: Thu, 18 Jun 2015 07:04:05 -0500
Message-ID: <1434629045-24650-2-git-send-email-benedikt.huber@theobroma-systems.com>

       * config/aarch64/aarch64-builtins.c: Builtins for rsqrt and rsqrtf.
       * config/aarch64/aarch64-protos.h: Declare.
       * config/aarch64/aarch64-simd.md: Matching expressions for frsqrte
and frsqrts.
       * config/aarch64/aarch64.c: New functions. Emit rsqrt estimation code
in fast math mode.
       * config/aarch64/aarch64.md: Added enum entry.
       * testsuite/gcc.target/aarch64/rsqrt.c: Tests for single and double.
---
 gcc/ChangeLog                            |   9 +++
 gcc/config/aarch64/aarch64-builtins.c    |  60 ++++++++++++++++
 gcc/config/aarch64/aarch64-protos.h      |   2 +
 gcc/config/aarch64/aarch64-simd.md       |  27 ++++++++
 gcc/config/aarch64/aarch64.c             |  63 +++++++++++++++++
 gcc/config/aarch64/aarch64.md            |   3 +
 gcc/testsuite/gcc.target/aarch64/rsqrt.c | 113
+++++++++++++++++++++++++++++++
 7 files changed, 277 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/rsqrt.c

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index c9b156f..690ebba 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,12 @@
+2015-06-15  Benedikt Huber  <benedikt.huber@theobroma-systems.com>
+
+	* config/aarch64/aarch64-builtins.c: Builtins for rsqrt and rsqrtf.
+	* config/aarch64/aarch64-protos.h: Declare.
+	* config/aarch64/aarch64-simd.md: Matching expressions for frsqrte
and frsqrts.
+	* config/aarch64/aarch64.c: New functions. Emit rsqrt estimation
code in fast math mode.
+	* config/aarch64/aarch64.md: Added enum entry.
+	* testsuite/gcc.target/aarch64/rsqrt.c: Tests for single and double.
+
 2015-06-14  Richard Sandiford  <richard.sandiford@arm.com>
 
 	* rtl.h (classify_insn): Declare.
diff --git a/gcc/config/aarch64/aarch64-builtins.c
b/gcc/config/aarch64/aarch64-builtins.c
index f7a39ec..484bb84 100644
--- a/gcc/config/aarch64/aarch64-builtins.c
+++ b/gcc/config/aarch64/aarch64-builtins.c
@@ -342,6 +342,8 @@ enum aarch64_builtins
   AARCH64_BUILTIN_GET_FPSR,
   AARCH64_BUILTIN_SET_FPSR,
 
+  AARCH64_BUILTIN_RSQRT,
+  AARCH64_BUILTIN_RSQRTF,
   AARCH64_SIMD_BUILTIN_BASE,
   AARCH64_SIMD_BUILTIN_LANE_CHECK,
 #include "aarch64-simd-builtins.def"
@@ -831,6 +833,32 @@ aarch64_init_crc32_builtins ()
 }
 
 void
+aarch64_add_builtin_rsqrt (void)
+{
+  tree fndecl = NULL;
+  tree ftype = NULL;
+  ftype = build_function_type_list (double_type_node, double_type_node,
NULL_TREE);
+
+  fndecl = add_builtin_function ("__builtin_aarch64_rsqrt",
+                                 ftype,
+                                 AARCH64_BUILTIN_RSQRT,
+                                 BUILT_IN_MD,
+                                 NULL,
+                                 NULL_TREE);
+  aarch64_builtin_decls[AARCH64_BUILTIN_RSQRT] = fndecl;
+
+  tree ftypef = NULL;
+  ftypef = build_function_type_list (float_type_node, float_type_node,
NULL_TREE);
+  fndecl = add_builtin_function ("__builtin_aarch64_rsqrtf",
+                                 ftypef,
+                                 AARCH64_BUILTIN_RSQRTF,
+                                 BUILT_IN_MD,
+                                 NULL,
+                                 NULL_TREE);
+  aarch64_builtin_decls[AARCH64_BUILTIN_RSQRTF] = fndecl;
+}
+
+void
 aarch64_init_builtins (void)
 {
   tree ftype_set_fpr
@@ -855,6 +883,7 @@ aarch64_init_builtins (void)
     aarch64_init_simd_builtins ();
   if (TARGET_CRC32)
     aarch64_init_crc32_builtins ();
+  aarch64_add_builtin_rsqrt ();
 }
 
 tree
@@ -1099,6 +1128,23 @@ aarch64_crc32_expand_builtin (int fcode, tree exp,
rtx target)
   return target;
 }
 
+static rtx
+aarch64_expand_builtin_rsqrt (int fcode, tree exp, rtx target)
+{
+  rtx pat;
+  tree arg0 = CALL_EXPR_ARG (exp, 0);
+  rtx op0 = expand_normal (arg0);
+
+  enum insn_code c = CODE_FOR_rsqrtdf;
+  if (fcode == AARCH64_BUILTIN_RSQRTF)
+    c = CODE_FOR_rsqrtsf;
+
+  pat = GEN_FCN (c) (target, op0);
+  emit_insn (pat);
+
+  return target;
+}
+
 /* Expand an expression EXP that calls a built-in function,
    with result going to TARGET if that's convenient.  */
 rtx
@@ -1146,6 +1192,11 @@ aarch64_expand_builtin (tree exp,
   else if (fcode >= AARCH64_CRC32_BUILTIN_BASE && fcode <=
AARCH64_CRC32_BUILTIN_MAX)
     return aarch64_crc32_expand_builtin (fcode, exp, target);
 
+  if (fcode == AARCH64_BUILTIN_RSQRT ||
+      fcode == AARCH64_BUILTIN_RSQRTF)
+    return aarch64_expand_builtin_rsqrt (fcode, exp, target);
+
+  return NULL_RTX;
   gcc_unreachable ();
 }
 
@@ -1303,6 +1354,15 @@ aarch64_builtin_vectorized_function (tree fndecl,
tree type_out, tree type_in)
   return NULL_TREE;
 }
 
+tree
+aarch64_builtin_rsqrt (bool is_float)
+{
+  if (is_float)
+    return aarch64_builtin_decls[AARCH64_BUILTIN_RSQRTF];
+  else
+    return aarch64_builtin_decls[AARCH64_BUILTIN_RSQRT];
+}
+
 #undef VAR1
 #define VAR1(T, N, MAP, A) \
   case AARCH64_SIMD_BUILTIN_##T##_##N##A:
diff --git a/gcc/config/aarch64/aarch64-protos.h
b/gcc/config/aarch64/aarch64-protos.h
index 965a11b..4f1c8ce 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -270,6 +270,8 @@ void aarch64_print_operand (FILE *, rtx, char);
 void aarch64_print_operand_address (FILE *, rtx);
 void aarch64_emit_call_insn (rtx);
 
+void aarch64_emit_swrsqrt (rtx, rtx);
+
 /* Initialize builtins for SIMD intrinsics.  */
 void init_aarch64_simd_builtins (void);
 
diff --git a/gcc/config/aarch64/aarch64-simd.md
b/gcc/config/aarch64/aarch64-simd.md
index b90f938..266800a 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -353,6 +353,33 @@
   [(set_attr "type" "neon_fp_mul_d_scalar_q")]
 )
 
+(define_insn "*rsqrte_simd"
+  [(set (match_operand:VALLF 0 "register_operand" "=w")
+	(unspec:VALLF [(match_operand:VALLF 1 "register_operand" "w")]
+		     UNSPEC_RSQRTE))]
+  "TARGET_SIMD"
+  "frsqrte\\t%<v>0<Vmtype>, %<v>1<Vmtype>"
+  [(set_attr "type" "neon_fp_rsqrte_<Vetype><q>")])
+
+(define_insn "*rsqrts_simd"
+  [(set (match_operand:VALLF 0 "register_operand" "=w")
+	(unspec:VALLF [(match_operand:VALLF 1 "register_operand" "w")
+               (match_operand:VALLF 2 "register_operand" "w")]
+		     UNSPEC_RSQRTS))]
+  "TARGET_SIMD"
+  "frsqrts\\t%<v>0<Vmtype>, %<v>1<Vmtype>, %<v>2<Vmtype>"
+  [(set_attr "type" "neon_fp_rsqrts_<Vetype><q>")])
+
+(define_expand "rsqrt<mode>"
+  [(set (match_operand:GPF 0 "register_operand" "=w")
+	(unspec:GPF [(match_operand:GPF 1 "register_operand" "w")]
+		     UNSPEC_RSQRT))]
+  "TARGET_FLOAT"
+{
+  aarch64_emit_swrsqrt (operands[0], operands[1]);
+  DONE;
+})
+
 (define_insn "*aarch64_mul3_elt_to_64v2df"
   [(set (match_operand:DF 0 "register_operand" "=w")
      (mult:DF
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index c3c2795..281f0b2 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -6816,6 +6816,66 @@ aarch64_memory_move_cost (machine_mode mode
ATTRIBUTE_UNUSED,
   return aarch64_tune_params->memmov_cost;
 }
 
+extern tree aarch64_builtin_rsqrt (bool is_float);
+
+static tree
+aarch64_builtin_reciprocal (unsigned int fn,
+                            bool md_fn ATTRIBUTE_UNUSED,
+                            bool sqrt ATTRIBUTE_UNUSED)
+{
+  if (!fast_math_flags_set_p (&global_options))
+    return NULL_TREE;
+
+  if (fn == BUILT_IN_SQRTF)
+    return aarch64_builtin_rsqrt (true);
+  else if (fn == BUILT_IN_SQRT)
+    return aarch64_builtin_rsqrt (false);
+  else
+    return NULL_TREE;
+}
+
+void
+aarch64_emit_swrsqrt (rtx dst, rtx src)
+{
+  enum machine_mode mode = GET_MODE (src);
+  rtx xsrc = gen_reg_rtx (mode);
+  emit_set_insn (xsrc, src);
+
+  rtx x0 = gen_reg_rtx (mode);
+  emit_insn (gen_rtx_SET (x0,
+                          gen_rtx_UNSPEC (mode, gen_rtvec (1, xsrc),
+                                          UNSPEC_RSQRTE)));
+
+  bool double_mode = (DFmode   == mode ||
+                      V1DFmode == mode ||
+                      V2DFmode == mode ||
+                      V4DFmode == mode ||
+                      V6DFmode == mode ||
+                      V8DFmode == mode);
+
+  int iterations = 2;
+  if (double_mode)
+    iterations = 3;
+
+  for (int i = 0; i < iterations; ++i)
+    {
+      rtx x1 = gen_reg_rtx (mode);
+      rtx x2 = gen_reg_rtx (mode);
+      rtx x3 = gen_reg_rtx (mode);
+      emit_insn (gen_rtx_SET (x2,
+                              gen_rtx_MULT (mode, x0, x0)));
+      emit_insn (gen_rtx_SET (x3,
+                              gen_rtx_UNSPEC (mode, gen_rtvec (2, xsrc,
x2),
+                                              UNSPEC_RSQRTS)));
+      emit_insn (gen_rtx_SET (x1,
+                              gen_rtx_MULT (mode, x0, x3)));
+      x0 = x1;
+    }
+
+  emit_move_insn (dst, x0);
+  return;
+}
+
 /* Return the number of instructions that can be issued per cycle.  */
 static int
 aarch64_sched_issue_rate (void)
@@ -11747,6 +11807,9 @@ aarch64_gen_adjusted_ldpstp (rtx *operands, bool
load,
 #undef TARGET_USE_BLOCKS_FOR_CONSTANT_P
 #define TARGET_USE_BLOCKS_FOR_CONSTANT_P aarch64_use_blocks_for_constant_p
 
+#undef TARGET_BUILTIN_RECIPROCAL
+#define TARGET_BUILTIN_RECIPROCAL aarch64_builtin_reciprocal
+
 #undef TARGET_VECTOR_MODE_SUPPORTED_P
 #define TARGET_VECTOR_MODE_SUPPORTED_P aarch64_vector_mode_supported_p
 
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 11123d6..7272d05 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -120,6 +120,9 @@
     UNSPEC_VSTRUCTDUMMY
     UNSPEC_SP_SET
     UNSPEC_SP_TEST
+    UNSPEC_RSQRT
+    UNSPEC_RSQRTE
+    UNSPEC_RSQRTS
 ])
 
 (define_c_enum "unspecv" [
diff --git a/gcc/testsuite/gcc.target/aarch64/rsqrt.c
b/gcc/testsuite/gcc.target/aarch64/rsqrt.c
new file mode 100644
index 0000000..6607483
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/rsqrt.c
@@ -0,0 +1,113 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -fno-inline --save-temps -ffast-math" } */
+
+#include <math.h>
+#include <stdio.h>
+
+#include <values.h>
+
+#define PI    3.141592653589793
+#define SQRT2 1.4142135623730951
+
+#define PI_4 0.7853981633974483
+#define SQRT1_2 0.7071067811865475
+
+/* 2^25+1, float has 24 significand bits
+ *       according to Single-precision floating-point format.  */
+#define TESTA8_FLT 33554433
+/* 2^54+1, double has 53 significand bits
+ *       according to Double-precision floating-point format.  */
+#define TESTA8_DBL 18014398509481985
+
+#define SD(a, b) t_double ((#a), (a), (b));
+#define SF(a, b) t_float ((#a), (a), (b));
+
+#define EPSILON_double __DBL_EPSILON__
+#define EPSILON_float __FLT_EPSILON__
+#define ABS_double fabs
+#define ABS_float fabsf
+#define SQRT_double sqrt
+#define SQRT_float sqrtf
+
+extern void abort (void);
+
+#define TESTTYPE(TYPE)
\
+TYPE rsqrt_##TYPE (TYPE a)
\
+{
\
+  return 1.0/SQRT_##TYPE(a);
\
+}
\
+
\
+int equals_##TYPE (TYPE a, TYPE b)
\
+{
\
+  return (a == b ||
\
+          (isnan (a) && isnan (b)) ||
\
+          (ABS_##TYPE (a - b) < EPSILON_##TYPE));
\
+}
\
+
\
+void t_##TYPE (const char *s, TYPE a, TYPE result)
\
+{
\
+  TYPE r = rsqrt_##TYPE (a);
\
+  if (!equals_##TYPE (r, result))
\
+  {
\
+    abort ();
\
+  }
\
+}
\
+
+//  printf ("Problem in %20s: %30.18A should be %30.18A\n", s, r, result);
\
+
+TESTTYPE(double)
+TESTTYPE(float)
+
+/* { dg-final { scan-assembler-times "frsqrte\\td\[0-9\]+, d\[0-9\]+" 1 } }
*/
+/* { dg-final { scan-assembler-times "frsqrts\\td\[0-9\]+, d\[0-9\]+,
d\[0-9\]+" 3 } } */
+
+/* { dg-final { scan-assembler-times "frsqrte\\ts\[0-9\]+, s\[0-9\]+" 1 } }
*/
+/* { dg-final { scan-assembler-times "frsqrts\\ts\[0-9\]+, s\[0-9\]+,
s\[0-9\]+" 2 } } */
+
+int main ()
+{
+  SD(    1.0/256, 0X1.00000000000000P+4  );
+  SD(        1.0, 0X1.00000000000000P+0  );
+  SD(       -1.0,                     NAN);
+  SD(       11.0, 0X1.34BF63D1568260P-2  );
+  SD(        0.0,                INFINITY);
+  SD(   INFINITY, 0X0.00000000000000P+0  );
+  SD(        NAN,                     NAN);
+  SD(       -NAN,                    -NAN);
+  SD(    DBL_MAX, 0X1.00000000000010P-512);
+  SD(    DBL_MIN, 0X1.00000000000000P+511);
+  SD(         PI, 0X1.20DD750429B6D0P-1  );
+  SD(       PI_4, 0X1.20DD750429B6D0P+0  );
+  SD(      SQRT2, 0X1.AE89F995AD3AE0P-1  );
+  SD(    SQRT1_2, 0X1.306FE0A31B7150P+0  );
+  SD(        -PI,                     NAN);
+  SD(     -SQRT2,                     NAN);
+  SD( TESTA8_DBL, 0X1.00000000000000P-27 );
+
+  SF(    1.0/256, 0X1.00000000000000P+4  );
+  SF(        1.0, 0X1.00000000000000P+0  );
+  SF(       -1.0,                     NAN);
+  SF(       11.0, 0X1.34BF6400000000P-2  );
+  SF(        0.0,                INFINITY);
+  SF(   INFINITY, 0X0.00000000000000P+0  );
+  SF(        NAN,                     NAN);
+  SF(       -NAN,                    -NAN);
+  SF(    FLT_MAX, 0X1.00000200000000P-64 );
+  SF(    FLT_MIN, 0X1.00000000000000P+63 );
+  SF(         PI, 0X1.20DD7400000000P-1  );
+  SF(       PI_4, 0X1.20DD7400000000P+0  );
+  SF(      SQRT2, 0X1.AE89FA00000000P-1  );
+  SF(    SQRT1_2, 0X1.306FE000000000P+0  );
+  SF(        -PI,                     NAN);
+  SF(     -SQRT2,                     NAN);
+  SF( TESTA8_FLT, 0X1.6A09E600000000P-13 );
+
+//   With -ffast-math these return positive INF.
+//   SD(       -0.0,               -INFINITY);
+//   SF(       -0.0,               -INFINITY);
+//   The reason here is that -ffast-math flushes to zero.
+//   SD(DBL_MIN/256, 0X1.00000000000000P+515);
+//   SF(FLT_MIN/256, 0X1.00000000000000P+67 );
+
+  return 0;
+}
-- 
1.9.1

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-24 16:49 ` Evandro Menezes
@ 2015-06-24 16:55   ` Dr. Philipp Tomsich
  2015-06-24 17:16     ` Benedikt Huber
  2015-06-25  7:01     ` Kumar, Venkataramanan
  0 siblings, 2 replies; 36+ messages in thread
From: Dr. Philipp Tomsich @ 2015-06-24 16:55 UTC (permalink / raw)
  To: Evandro Menezes; +Cc: Benedikt Huber, gcc-patches

Evandro,

We’ve seen a 28% speed-up on gromacs in SPECfp for the (scalar) reciprocal sqrt.

Also, the “reciprocal divide” patches are floating around in various of our git-tree, but 
aren’t ready for public consumption, yet… I’ll leave Benedikt to comment on potential 
timelines for getting that pushed out.

Best,
Philipp.

> On 24 Jun 2015, at 18:42, Evandro Menezes <e.menezes@samsung.com> wrote:
> 
> Benedikt,
> 
> You beat me to it! :-)  Do you have the implementation for dividing using
> the Newton series as well?
> 
> I'm not sure that the series is always for all data types and on all
> processors.  It would be useful to allow each AArch64 processor to enable
> this or not depending on the data type.  BTW, do you have some tests showing
> the speed up?
> 
> Thank you,
> 
> -- 
> Evandro Menezes                              Austin, TX
> 
>> -----Original Message-----
>> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-owner@gcc.gnu.org]
> On
>> Behalf Of Benedikt Huber
>> Sent: Thursday, June 18, 2015 7:04
>> To: gcc-patches@gcc.gnu.org
>> Cc: benedikt.huber@theobroma-systems.com; philipp.tomsich@theobroma-
>> systems.com
>> Subject: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
>> estimation in -ffast-math
>> 
>> arch64 offers the instructions frsqrte and frsqrts, for rsqrt estimation
> and
>> a Newton-Raphson step, respectively.
>> There are ARMv8 implementations where this is faster than using fdiv and
>> rsqrt.
>> It runs three steps for double and two steps for float to achieve the
> needed
>> precision.
>> 
>> There is one caveat and open question.
>> Since -ffast-math enables flush to zero intermediate values between
>> approximation steps will be flushed to zero if they are denormal.
>> E.g. This happens in the case of rsqrt (DBL_MAX) and rsqrtf (FLT_MAX).
>> The test cases pass, but it is unclear to me whether this is expected
>> behavior with -ffast-math.
>> 
>> The patch applies to commit:
>> svn+ssh://gcc.gnu.org/svn/gcc/trunk@224470
>> 
>> Please consider including this patch.
>> Thank you and best regards,
>> Benedikt Huber
>> 
>> Benedikt Huber (1):
>>  2015-06-15  Benedikt Huber  <benedikt.huber@theobroma-systems.com>
>> 
>> gcc/ChangeLog                            |   9 +++
>> gcc/config/aarch64/aarch64-builtins.c    |  60 ++++++++++++++++
>> gcc/config/aarch64/aarch64-protos.h      |   2 +
>> gcc/config/aarch64/aarch64-simd.md       |  27 ++++++++
>> gcc/config/aarch64/aarch64.c             |  63 +++++++++++++++++
>> gcc/config/aarch64/aarch64.md            |   3 +
>> gcc/testsuite/gcc.target/aarch64/rsqrt.c | 113
>> +++++++++++++++++++++++++++++++
>> 7 files changed, 277 insertions(+)
>> create mode 100644 gcc/testsuite/gcc.target/aarch64/rsqrt.c
>> 
>> --
>> 1.9.1
> <Mail Attachment.eml>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-24 16:55   ` Dr. Philipp Tomsich
@ 2015-06-24 17:16     ` Benedikt Huber
  2015-06-24 18:37       ` Evandro Menezes
  2015-06-25  7:01     ` Kumar, Venkataramanan
  1 sibling, 1 reply; 36+ messages in thread
From: Benedikt Huber @ 2015-06-24 17:16 UTC (permalink / raw)
  To: Dr. Philipp Tomsich; +Cc: Evandro Menezes, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 3392 bytes --]

Evandro,

Yes, we also have the 1/x approximation.
However we do not have the test cases yet, and it also would need some clean up.
I am going to provide a patch for that soon (say next week).
Also, for this optimization we have *not* yet found a benchmark with significant improvements.

Best Regards,
Benedikt


> On 24 Jun 2015, at 18:52, Dr. Philipp Tomsich <philipp.tomsich@theobroma-systems.com> wrote:
> 
> Evandro,
> 
> We’ve seen a 28% speed-up on gromacs in SPECfp for the (scalar) reciprocal sqrt.
> 
> Also, the “reciprocal divide” patches are floating around in various of our git-tree, but
> aren’t ready for public consumption, yet… I’ll leave Benedikt to comment on potential
> timelines for getting that pushed out.
> 
> Best,
> Philipp.
> 
>> On 24 Jun 2015, at 18:42, Evandro Menezes <e.menezes@samsung.com> wrote:
>> 
>> Benedikt,
>> 
>> You beat me to it! :-)  Do you have the implementation for dividing using
>> the Newton series as well?
>> 
>> I'm not sure that the series is always for all data types and on all
>> processors.  It would be useful to allow each AArch64 processor to enable
>> this or not depending on the data type.  BTW, do you have some tests showing
>> the speed up?
>> 
>> Thank you,
>> 
>> --
>> Evandro Menezes                              Austin, TX
>> 
>>> -----Original Message-----
>>> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-owner@gcc.gnu.org]
>> On
>>> Behalf Of Benedikt Huber
>>> Sent: Thursday, June 18, 2015 7:04
>>> To: gcc-patches@gcc.gnu.org
>>> Cc: benedikt.huber@theobroma-systems.com; philipp.tomsich@theobroma-
>>> systems.com
>>> Subject: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
>>> estimation in -ffast-math
>>> 
>>> arch64 offers the instructions frsqrte and frsqrts, for rsqrt estimation
>> and
>>> a Newton-Raphson step, respectively.
>>> There are ARMv8 implementations where this is faster than using fdiv and
>>> rsqrt.
>>> It runs three steps for double and two steps for float to achieve the
>> needed
>>> precision.
>>> 
>>> There is one caveat and open question.
>>> Since -ffast-math enables flush to zero intermediate values between
>>> approximation steps will be flushed to zero if they are denormal.
>>> E.g. This happens in the case of rsqrt (DBL_MAX) and rsqrtf (FLT_MAX).
>>> The test cases pass, but it is unclear to me whether this is expected
>>> behavior with -ffast-math.
>>> 
>>> The patch applies to commit:
>>> svn+ssh://gcc.gnu.org/svn/gcc/trunk@224470
>>> 
>>> Please consider including this patch.
>>> Thank you and best regards,
>>> Benedikt Huber
>>> 
>>> Benedikt Huber (1):
>>> 2015-06-15  Benedikt Huber  <benedikt.huber@theobroma-systems.com>
>>> 
>>> gcc/ChangeLog                            |   9 +++
>>> gcc/config/aarch64/aarch64-builtins.c    |  60 ++++++++++++++++
>>> gcc/config/aarch64/aarch64-protos.h      |   2 +
>>> gcc/config/aarch64/aarch64-simd.md       |  27 ++++++++
>>> gcc/config/aarch64/aarch64.c             |  63 +++++++++++++++++
>>> gcc/config/aarch64/aarch64.md            |   3 +
>>> gcc/testsuite/gcc.target/aarch64/rsqrt.c | 113
>>> +++++++++++++++++++++++++++++++
>>> 7 files changed, 277 insertions(+)
>>> create mode 100644 gcc/testsuite/gcc.target/aarch64/rsqrt.c
>>> 
>>> --
>>> 1.9.1
>> <Mail Attachment.eml>
> 


[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-24 17:16     ` Benedikt Huber
@ 2015-06-24 18:37       ` Evandro Menezes
  2015-06-24 20:11         ` Dr. Philipp Tomsich
  0 siblings, 1 reply; 36+ messages in thread
From: Evandro Menezes @ 2015-06-24 18:37 UTC (permalink / raw)
  To: 'Benedikt Huber', 'Dr. Philipp Tomsich'; +Cc: gcc-patches

Benedikt,

Are you developing the reciprocal approximation just for 1/x proper or for any division, as in x/y = x * 1/y?

Thank you,

-- 
Evandro Menezes                              Austin, TX


> -----Original Message-----
> From: Benedikt Huber [mailto:benedikt.huber@theobroma-systems.com]
> Sent: Wednesday, June 24, 2015 12:11
> To: Dr. Philipp Tomsich
> Cc: Evandro Menezes; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
> estimation in -ffast-math
> 
> Evandro,
> 
> Yes, we also have the 1/x approximation.
> However we do not have the test cases yet, and it also would need some clean
> up.
> I am going to provide a patch for that soon (say next week).
> Also, for this optimization we have *not* yet found a benchmark with
> significant improvements.
> 
> Best Regards,
> Benedikt
> 
> 
> > On 24 Jun 2015, at 18:52, Dr. Philipp Tomsich <philipp.tomsich@theobroma-
> systems.com> wrote:
> >
> > Evandro,
> >
> > We’ve seen a 28% speed-up on gromacs in SPECfp for the (scalar) reciprocal
> sqrt.
> >
> > Also, the “reciprocal divide” patches are floating around in various
> > of our git-tree, but aren’t ready for public consumption, yet… I’ll
> > leave Benedikt to comment on potential timelines for getting that pushed
> out.
> >
> > Best,
> > Philipp.
> >
> >> On 24 Jun 2015, at 18:42, Evandro Menezes <e.menezes@samsung.com> wrote:
> >>
> >> Benedikt,
> >>
> >> You beat me to it! :-)  Do you have the implementation for dividing
> >> using the Newton series as well?
> >>
> >> I'm not sure that the series is always for all data types and on all
> >> processors.  It would be useful to allow each AArch64 processor to
> >> enable this or not depending on the data type.  BTW, do you have some
> >> tests showing the speed up?
> >>
> >> Thank you,
> >>
> >> --
> >> Evandro Menezes                              Austin, TX
> >>
> >>> -----Original Message-----
> >>> From: gcc-patches-owner@gcc.gnu.org
> >>> [mailto:gcc-patches-owner@gcc.gnu.org]
> >> On
> >>> Behalf Of Benedikt Huber
> >>> Sent: Thursday, June 18, 2015 7:04
> >>> To: gcc-patches@gcc.gnu.org
> >>> Cc: benedikt.huber@theobroma-systems.com; philipp.tomsich@theobroma-
> >>> systems.com
> >>> Subject: [PATCH] [aarch64] Implemented reciprocal square root
> >>> (rsqrt) estimation in -ffast-math
> >>>
> >>> arch64 offers the instructions frsqrte and frsqrts, for rsqrt
> >>> estimation
> >> and
> >>> a Newton-Raphson step, respectively.
> >>> There are ARMv8 implementations where this is faster than using fdiv
> >>> and rsqrt.
> >>> It runs three steps for double and two steps for float to achieve
> >>> the
> >> needed
> >>> precision.
> >>>
> >>> There is one caveat and open question.
> >>> Since -ffast-math enables flush to zero intermediate values between
> >>> approximation steps will be flushed to zero if they are denormal.
> >>> E.g. This happens in the case of rsqrt (DBL_MAX) and rsqrtf (FLT_MAX).
> >>> The test cases pass, but it is unclear to me whether this is
> >>> expected behavior with -ffast-math.
> >>>
> >>> The patch applies to commit:
> >>> svn+ssh://gcc.gnu.org/svn/gcc/trunk@224470
> >>>
> >>> Please consider including this patch.
> >>> Thank you and best regards,
> >>> Benedikt Huber
> >>>
> >>> Benedikt Huber (1):
> >>> 2015-06-15  Benedikt Huber  <benedikt.huber@theobroma-systems.com>
> >>>
> >>> gcc/ChangeLog                            |   9 +++
> >>> gcc/config/aarch64/aarch64-builtins.c    |  60 ++++++++++++++++
> >>> gcc/config/aarch64/aarch64-protos.h      |   2 +
> >>> gcc/config/aarch64/aarch64-simd.md       |  27 ++++++++
> >>> gcc/config/aarch64/aarch64.c             |  63 +++++++++++++++++
> >>> gcc/config/aarch64/aarch64.md            |   3 +
> >>> gcc/testsuite/gcc.target/aarch64/rsqrt.c | 113
> >>> +++++++++++++++++++++++++++++++
> >>> 7 files changed, 277 insertions(+)
> >>> create mode 100644 gcc/testsuite/gcc.target/aarch64/rsqrt.c
> >>>
> >>> --
> >>> 1.9.1
> >> <Mail Attachment.eml>
> >


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-24 18:37       ` Evandro Menezes
@ 2015-06-24 20:11         ` Dr. Philipp Tomsich
  2015-06-24 20:54           ` Evandro Menezes
  0 siblings, 1 reply; 36+ messages in thread
From: Dr. Philipp Tomsich @ 2015-06-24 20:11 UTC (permalink / raw)
  To: Evandro Menezes; +Cc: Benedikt Huber, gcc-patches

Evandro,

Shouldn't ‘execute_cse_reciprocals_1’ take care of this, once the reciprocal-division is implemented?
Do you think there’s additional work needed to catch all cases/opportunities?

Best,
Philipp.

> On 24 Jun 2015, at 20:19, Evandro Menezes <e.menezes@samsung.com> wrote:
> 
> Benedikt,
> 
> Are you developing the reciprocal approximation just for 1/x proper or for any division, as in x/y = x * 1/y?
> 
> Thank you,
> 
> -- 
> Evandro Menezes                              Austin, TX
> 
> 
>> -----Original Message-----
>> From: Benedikt Huber [mailto:benedikt.huber@theobroma-systems.com]
>> Sent: Wednesday, June 24, 2015 12:11
>> To: Dr. Philipp Tomsich
>> Cc: Evandro Menezes; gcc-patches@gcc.gnu.org
>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
>> estimation in -ffast-math
>> 
>> Evandro,
>> 
>> Yes, we also have the 1/x approximation.
>> However we do not have the test cases yet, and it also would need some clean
>> up.
>> I am going to provide a patch for that soon (say next week).
>> Also, for this optimization we have *not* yet found a benchmark with
>> significant improvements.
>> 
>> Best Regards,
>> Benedikt
>> 
>> 
>>> On 24 Jun 2015, at 18:52, Dr. Philipp Tomsich <philipp.tomsich@theobroma-
>> systems.com> wrote:
>>> 
>>> Evandro,
>>> 
>>> We’ve seen a 28% speed-up on gromacs in SPECfp for the (scalar) reciprocal
>> sqrt.
>>> 
>>> Also, the “reciprocal divide” patches are floating around in various
>>> of our git-tree, but aren’t ready for public consumption, yet… I’ll
>>> leave Benedikt to comment on potential timelines for getting that pushed
>> out.
>>> 
>>> Best,
>>> Philipp.
>>> 
>>>> On 24 Jun 2015, at 18:42, Evandro Menezes <e.menezes@samsung.com> wrote:
>>>> 
>>>> Benedikt,
>>>> 
>>>> You beat me to it! :-)  Do you have the implementation for dividing
>>>> using the Newton series as well?
>>>> 
>>>> I'm not sure that the series is always for all data types and on all
>>>> processors.  It would be useful to allow each AArch64 processor to
>>>> enable this or not depending on the data type.  BTW, do you have some
>>>> tests showing the speed up?
>>>> 
>>>> Thank you,
>>>> 
>>>> --
>>>> Evandro Menezes                              Austin, TX
>>>> 
>>>>> -----Original Message-----
>>>>> From: gcc-patches-owner@gcc.gnu.org
>>>>> [mailto:gcc-patches-owner@gcc.gnu.org]
>>>> On
>>>>> Behalf Of Benedikt Huber
>>>>> Sent: Thursday, June 18, 2015 7:04
>>>>> To: gcc-patches@gcc.gnu.org
>>>>> Cc: benedikt.huber@theobroma-systems.com; philipp.tomsich@theobroma-
>>>>> systems.com
>>>>> Subject: [PATCH] [aarch64] Implemented reciprocal square root
>>>>> (rsqrt) estimation in -ffast-math
>>>>> 
>>>>> arch64 offers the instructions frsqrte and frsqrts, for rsqrt
>>>>> estimation
>>>> and
>>>>> a Newton-Raphson step, respectively.
>>>>> There are ARMv8 implementations where this is faster than using fdiv
>>>>> and rsqrt.
>>>>> It runs three steps for double and two steps for float to achieve
>>>>> the
>>>> needed
>>>>> precision.
>>>>> 
>>>>> There is one caveat and open question.
>>>>> Since -ffast-math enables flush to zero intermediate values between
>>>>> approximation steps will be flushed to zero if they are denormal.
>>>>> E.g. This happens in the case of rsqrt (DBL_MAX) and rsqrtf (FLT_MAX).
>>>>> The test cases pass, but it is unclear to me whether this is
>>>>> expected behavior with -ffast-math.
>>>>> 
>>>>> The patch applies to commit:
>>>>> svn+ssh://gcc.gnu.org/svn/gcc/trunk@224470
>>>>> 
>>>>> Please consider including this patch.
>>>>> Thank you and best regards,
>>>>> Benedikt Huber
>>>>> 
>>>>> Benedikt Huber (1):
>>>>> 2015-06-15  Benedikt Huber  <benedikt.huber@theobroma-systems.com>
>>>>> 
>>>>> gcc/ChangeLog                            |   9 +++
>>>>> gcc/config/aarch64/aarch64-builtins.c    |  60 ++++++++++++++++
>>>>> gcc/config/aarch64/aarch64-protos.h      |   2 +
>>>>> gcc/config/aarch64/aarch64-simd.md       |  27 ++++++++
>>>>> gcc/config/aarch64/aarch64.c             |  63 +++++++++++++++++
>>>>> gcc/config/aarch64/aarch64.md            |   3 +
>>>>> gcc/testsuite/gcc.target/aarch64/rsqrt.c | 113
>>>>> +++++++++++++++++++++++++++++++
>>>>> 7 files changed, 277 insertions(+)
>>>>> create mode 100644 gcc/testsuite/gcc.target/aarch64/rsqrt.c
>>>>> 
>>>>> --
>>>>> 1.9.1
>>>> <Mail Attachment.eml>
>>> 
> 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-24 20:11         ` Dr. Philipp Tomsich
@ 2015-06-24 20:54           ` Evandro Menezes
  2015-06-25 11:52             ` Benedikt Huber
  0 siblings, 1 reply; 36+ messages in thread
From: Evandro Menezes @ 2015-06-24 20:54 UTC (permalink / raw)
  To: 'Dr. Philipp Tomsich'; +Cc: 'Benedikt Huber', gcc-patches

Philipp,

I think that execute_cse_reciprocals_1() applies only when the denominator is known at compile-time, otherwise the division stays.  It doesn't seem to know whether the target supports the approximate reciprocal or not.

Cheers,

-- 
Evandro Menezes                              Austin, TX


> -----Original Message-----
> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-owner@gcc.gnu.org] On
> Behalf Of Dr. Philipp Tomsich
> Sent: Wednesday, June 24, 2015 15:08
> To: Evandro Menezes
> Cc: Benedikt Huber; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
> estimation in -ffast-math
> 
> Evandro,
> 
> Shouldn't ‘execute_cse_reciprocals_1’ take care of this, once the reciprocal-
> division is implemented?
> Do you think there’s additional work needed to catch all cases/opportunities?
> 
> Best,
> Philipp.
> 
> > On 24 Jun 2015, at 20:19, Evandro Menezes <e.menezes@samsung.com> wrote:
> >
> > Benedikt,
> >
> > Are you developing the reciprocal approximation just for 1/x proper or for
> any division, as in x/y = x * 1/y?
> >
> > Thank you,
> >
> > --
> > Evandro Menezes                              Austin, TX
> >
> >
> >> -----Original Message-----
> >> From: Benedikt Huber [mailto:benedikt.huber@theobroma-systems.com]
> >> Sent: Wednesday, June 24, 2015 12:11
> >> To: Dr. Philipp Tomsich
> >> Cc: Evandro Menezes; gcc-patches@gcc.gnu.org
> >> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> >> (rsqrt) estimation in -ffast-math
> >>
> >> Evandro,
> >>
> >> Yes, we also have the 1/x approximation.
> >> However we do not have the test cases yet, and it also would need
> >> some clean up.
> >> I am going to provide a patch for that soon (say next week).
> >> Also, for this optimization we have *not* yet found a benchmark with
> >> significant improvements.
> >>
> >> Best Regards,
> >> Benedikt
> >>
> >>
> >>> On 24 Jun 2015, at 18:52, Dr. Philipp Tomsich
> >>> <philipp.tomsich@theobroma-
> >> systems.com> wrote:
> >>>
> >>> Evandro,
> >>>
> >>> We’ve seen a 28% speed-up on gromacs in SPECfp for the (scalar)
> >>> reciprocal
> >> sqrt.
> >>>
> >>> Also, the “reciprocal divide” patches are floating around in various
> >>> of our git-tree, but aren’t ready for public consumption, yet… I’ll
> >>> leave Benedikt to comment on potential timelines for getting that
> >>> pushed
> >> out.
> >>>
> >>> Best,
> >>> Philipp.
> >>>
> >>>> On 24 Jun 2015, at 18:42, Evandro Menezes <e.menezes@samsung.com> wrote:
> >>>>
> >>>> Benedikt,
> >>>>
> >>>> You beat me to it! :-)  Do you have the implementation for dividing
> >>>> using the Newton series as well?
> >>>>
> >>>> I'm not sure that the series is always for all data types and on
> >>>> all processors.  It would be useful to allow each AArch64 processor
> >>>> to enable this or not depending on the data type.  BTW, do you have
> >>>> some tests showing the speed up?
> >>>>
> >>>> Thank you,
> >>>>
> >>>> --
> >>>> Evandro Menezes                              Austin, TX
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: gcc-patches-owner@gcc.gnu.org
> >>>>> [mailto:gcc-patches-owner@gcc.gnu.org]
> >>>> On
> >>>>> Behalf Of Benedikt Huber
> >>>>> Sent: Thursday, June 18, 2015 7:04
> >>>>> To: gcc-patches@gcc.gnu.org
> >>>>> Cc: benedikt.huber@theobroma-systems.com;
> >>>>> philipp.tomsich@theobroma- systems.com
> >>>>> Subject: [PATCH] [aarch64] Implemented reciprocal square root
> >>>>> (rsqrt) estimation in -ffast-math
> >>>>>
> >>>>> arch64 offers the instructions frsqrte and frsqrts, for rsqrt
> >>>>> estimation
> >>>> and
> >>>>> a Newton-Raphson step, respectively.
> >>>>> There are ARMv8 implementations where this is faster than using
> >>>>> fdiv and rsqrt.
> >>>>> It runs three steps for double and two steps for float to achieve
> >>>>> the
> >>>> needed
> >>>>> precision.
> >>>>>
> >>>>> There is one caveat and open question.
> >>>>> Since -ffast-math enables flush to zero intermediate values
> >>>>> between approximation steps will be flushed to zero if they are
> denormal.
> >>>>> E.g. This happens in the case of rsqrt (DBL_MAX) and rsqrtf (FLT_MAX).
> >>>>> The test cases pass, but it is unclear to me whether this is
> >>>>> expected behavior with -ffast-math.
> >>>>>
> >>>>> The patch applies to commit:
> >>>>> svn+ssh://gcc.gnu.org/svn/gcc/trunk@224470
> >>>>>
> >>>>> Please consider including this patch.
> >>>>> Thank you and best regards,
> >>>>> Benedikt Huber
> >>>>>
> >>>>> Benedikt Huber (1):
> >>>>> 2015-06-15  Benedikt Huber  <benedikt.huber@theobroma-systems.com>
> >>>>>
> >>>>> gcc/ChangeLog                            |   9 +++
> >>>>> gcc/config/aarch64/aarch64-builtins.c    |  60 ++++++++++++++++
> >>>>> gcc/config/aarch64/aarch64-protos.h      |   2 +
> >>>>> gcc/config/aarch64/aarch64-simd.md       |  27 ++++++++
> >>>>> gcc/config/aarch64/aarch64.c             |  63 +++++++++++++++++
> >>>>> gcc/config/aarch64/aarch64.md            |   3 +
> >>>>> gcc/testsuite/gcc.target/aarch64/rsqrt.c | 113
> >>>>> +++++++++++++++++++++++++++++++
> >>>>> 7 files changed, 277 insertions(+) create mode 100644
> >>>>> gcc/testsuite/gcc.target/aarch64/rsqrt.c
> >>>>>
> >>>>> --
> >>>>> 1.9.1
> >>>> <Mail Attachment.eml>
> >>>
> >
> >

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-24 16:55   ` Dr. Philipp Tomsich
  2015-06-24 17:16     ` Benedikt Huber
@ 2015-06-25  7:01     ` Kumar, Venkataramanan
  1 sibling, 0 replies; 36+ messages in thread
From: Kumar, Venkataramanan @ 2015-06-25  7:01 UTC (permalink / raw)
  To: Dr. Philipp Tomsich, Evandro Menezes; +Cc: Benedikt Huber, gcc-patches

Hi, 

If I understand correct, current implementation replaces 

fdiv 
fsqrt

 by  
 frsqrte
for i=0 to 3
fmul
frsqrts  
fmul

So I think gains depends latency of  frsqrts  insn.

I see patch has patterns for  vector versions of frsqrts, but does not enable them?

Regards,
Venkat.

> -----Original Message-----
> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
> owner@gcc.gnu.org] On Behalf Of Dr. Philipp Tomsich
> Sent: Wednesday, June 24, 2015 10:22 PM
> To: Evandro Menezes
> Cc: Benedikt Huber; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
> estimation in -ffast-math
> 
> Evandro,
> 
> We’ve seen a 28% speed-up on gromacs in SPECfp for the (scalar) reciprocal
> sqrt.
> 
> Also, the “reciprocal divide” patches are floating around in various of our git-
> tree, but aren’t ready for public consumption, yet… I’ll leave Benedikt to
> comment on potential timelines for getting that pushed out.
> 
> Best,
> Philipp.
> 
> > On 24 Jun 2015, at 18:42, Evandro Menezes <e.menezes@samsung.com>
> wrote:
> >
> > Benedikt,
> >
> > You beat me to it! :-)  Do you have the implementation for dividing
> > using the Newton series as well?
> >
> > I'm not sure that the series is always for all data types and on all
> > processors.  It would be useful to allow each AArch64 processor to
> > enable this or not depending on the data type.  BTW, do you have some
> > tests showing the speed up?
> >
> > Thank you,
> >
> > --
> > Evandro Menezes                              Austin, TX
> >
> >> -----Original Message-----
> >> From: gcc-patches-owner@gcc.gnu.org
> >> [mailto:gcc-patches-owner@gcc.gnu.org]
> > On
> >> Behalf Of Benedikt Huber
> >> Sent: Thursday, June 18, 2015 7:04
> >> To: gcc-patches@gcc.gnu.org
> >> Cc: benedikt.huber@theobroma-systems.com;
> philipp.tomsich@theobroma-
> >> systems.com
> >> Subject: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
> >> estimation in -ffast-math
> >>
> >> arch64 offers the instructions frsqrte and frsqrts, for rsqrt
> >> estimation
> > and
> >> a Newton-Raphson step, respectively.
> >> There are ARMv8 implementations where this is faster than using fdiv
> >> and rsqrt.
> >> It runs three steps for double and two steps for float to achieve the
> > needed
> >> precision.
> >>
> >> There is one caveat and open question.
> >> Since -ffast-math enables flush to zero intermediate values between
> >> approximation steps will be flushed to zero if they are denormal.
> >> E.g. This happens in the case of rsqrt (DBL_MAX) and rsqrtf (FLT_MAX).
> >> The test cases pass, but it is unclear to me whether this is expected
> >> behavior with -ffast-math.
> >>
> >> The patch applies to commit:
> >> svn+ssh://gcc.gnu.org/svn/gcc/trunk@224470
> >>
> >> Please consider including this patch.
> >> Thank you and best regards,
> >> Benedikt Huber
> >>
> >> Benedikt Huber (1):
> >>  2015-06-15  Benedikt Huber  <benedikt.huber@theobroma-
> systems.com>
> >>
> >> gcc/ChangeLog                            |   9 +++
> >> gcc/config/aarch64/aarch64-builtins.c    |  60 ++++++++++++++++
> >> gcc/config/aarch64/aarch64-protos.h      |   2 +
> >> gcc/config/aarch64/aarch64-simd.md       |  27 ++++++++
> >> gcc/config/aarch64/aarch64.c             |  63 +++++++++++++++++
> >> gcc/config/aarch64/aarch64.md            |   3 +
> >> gcc/testsuite/gcc.target/aarch64/rsqrt.c | 113
> >> +++++++++++++++++++++++++++++++
> >> 7 files changed, 277 insertions(+)
> >> create mode 100644 gcc/testsuite/gcc.target/aarch64/rsqrt.c
> >>
> >> --
> >> 1.9.1
> > <Mail Attachment.eml>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-18 11:57 [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math Benedikt Huber
                   ` (2 preceding siblings ...)
  2015-06-24 16:49 ` Evandro Menezes
@ 2015-06-25  7:03 ` pinskia
  2015-06-25  9:43   ` Ramana Radhakrishnan
  2015-06-25 11:07   ` Benedikt Huber
  3 siblings, 2 replies; 36+ messages in thread
From: pinskia @ 2015-06-25  7:03 UTC (permalink / raw)
  To: Benedikt Huber; +Cc: gcc-patches, philipp.tomsich





> On Jun 18, 2015, at 5:04 AM, Benedikt Huber <benedikt.huber@theobroma-systems.com> wrote:
> 
> arch64 offers the instructions frsqrte and frsqrts, for rsqrt estimation and
> a Newton-Raphson step, respectively.
> There are ARMv8 implementations where this is faster than using fdiv and rsqrt.
> It runs three steps for double and two steps for float to achieve the needed precision.

This is NOT a win on thunderX at least for single precision because you have to do the divide and sqrt in the same time as it takes 5 multiples (estimate and step are multiplies in the thunderX pipeline).  Doubles is 10 multiplies which is just the same as what the patch does (but it is really slightly less than 10, I rounded up). So in the end this is NOT a win at all for thunderX unless we do one less step for both single and double. 

Thanks,
Andrew


> 
> There is one caveat and open question.
> Since -ffast-math enables flush to zero intermediate values between approximation steps
> will be flushed to zero if they are denormal.
> E.g. This happens in the case of rsqrt (DBL_MAX) and rsqrtf (FLT_MAX).
> The test cases pass, but it is unclear to me whether this is expected behavior with -ffast-math.
> 
> The patch applies to commit:
> svn+ssh://gcc.gnu.org/svn/gcc/trunk@224470
> 
> Please consider including this patch.
> Thank you and best regards,
> Benedikt Huber
> 
> Benedikt Huber (1):
>  2015-06-15  Benedikt Huber  <benedikt.huber@theobroma-systems.com>
> 
> gcc/ChangeLog                            |   9 +++
> gcc/config/aarch64/aarch64-builtins.c    |  60 ++++++++++++++++
> gcc/config/aarch64/aarch64-protos.h      |   2 +
> gcc/config/aarch64/aarch64-simd.md       |  27 ++++++++
> gcc/config/aarch64/aarch64.c             |  63 +++++++++++++++++
> gcc/config/aarch64/aarch64.md            |   3 +
> gcc/testsuite/gcc.target/aarch64/rsqrt.c | 113 +++++++++++++++++++++++++++++++
> 7 files changed, 277 insertions(+)
> create mode 100644 gcc/testsuite/gcc.target/aarch64/rsqrt.c
> 
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-25  7:03 ` pinskia
@ 2015-06-25  9:43   ` Ramana Radhakrishnan
  2015-06-27  2:01     ` Andrew Pinski
  2015-06-25 11:07   ` Benedikt Huber
  1 sibling, 1 reply; 36+ messages in thread
From: Ramana Radhakrishnan @ 2015-06-25  9:43 UTC (permalink / raw)
  To: Benedikt Huber; +Cc: pinskia, gcc-patches, philipp.tomsich

Benedikt,

On 25/06/15 08:01, pinskia@gmail.com wrote:
>
>
>
>
>> On Jun 18, 2015, at 5:04 AM, Benedikt Huber <benedikt.huber@theobroma-systems.com> wrote:
>>
>> arch64 offers the instructions frsqrte and frsqrts, for rsqrt estimation and
>> a Newton-Raphson step, respectively.
>> There are ARMv8 implementations where this is faster than using fdiv and rsqrt.
>> It runs three steps for double and two steps for float to achieve the needed precision.
>
> This is NOT a win on thunderX at least for single precision because you have to do the divide and sqrt in the same time as it takes 5 multiples (estimate and step are multiplies in the thunderX pipeline).  Doubles is 10 multiplies which is just the same as what the patch does (but it is really slightly less than 10, I rounded up). So in the end this is NOT a win at all for thunderX unless we do one less step for both single and double.
>


Have you seen this 
https://gcc.gnu.org/ml/gcc-patches/2015-03/msg00164.html ? Really this 
is something that should be gated by the costs infrastructure .


regards
Ramana





> Thanks,
> Andrew
>
>
>>
>> There is one caveat and open question.
>> Since -ffast-math enables flush to zero intermediate values between approximation steps
>> will be flushed to zero if they are denormal.
>> E.g. This happens in the case of rsqrt (DBL_MAX) and rsqrtf (FLT_MAX).
>> The test cases pass, but it is unclear to me whether this is expected behavior with -ffast-math.
>>
>> The patch applies to commit:
>> svn+ssh://gcc.gnu.org/svn/gcc/trunk@224470
>>
>> Please consider including this patch.
>> Thank you and best regards,
>> Benedikt Huber
>>
>> Benedikt Huber (1):
>>   2015-06-15  Benedikt Huber  <benedikt.huber@theobroma-systems.com>
>>
>> gcc/ChangeLog                            |   9 +++
>> gcc/config/aarch64/aarch64-builtins.c    |  60 ++++++++++++++++
>> gcc/config/aarch64/aarch64-protos.h      |   2 +
>> gcc/config/aarch64/aarch64-simd.md       |  27 ++++++++
>> gcc/config/aarch64/aarch64.c             |  63 +++++++++++++++++
>> gcc/config/aarch64/aarch64.md            |   3 +
>> gcc/testsuite/gcc.target/aarch64/rsqrt.c | 113 +++++++++++++++++++++++++++++++
>> 7 files changed, 277 insertions(+)
>> create mode 100644 gcc/testsuite/gcc.target/aarch64/rsqrt.c
>>
>> --
>> 1.9.1
>>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-25  7:03 ` pinskia
  2015-06-25  9:43   ` Ramana Radhakrishnan
@ 2015-06-25 11:07   ` Benedikt Huber
  2015-06-25 13:27     ` Michael Matz
  2015-06-25 15:43     ` Kumar, Venkataramanan
  1 sibling, 2 replies; 36+ messages in thread
From: Benedikt Huber @ 2015-06-25 11:07 UTC (permalink / raw)
  To: pinskia; +Cc: gcc-patches, philipp.tomsich

[-- Attachment #1: Type: text/plain, Size: 822 bytes --]

Andrew,

> This is NOT a win on thunderX at least for single precision because you have to do the divide and sqrt in the same time as it takes 5 multiples (estimate and step are multiplies in the thunderX pipeline).  Doubles is 10 multiplies which is just the same as what the patch does (but it is really slightly less than 10, I rounded up). So in the end this is NOT a win at all for thunderX unless we do one less step for both single and double.

Yes, the expected benefit from rsqrt estimation is implementation specific. If one has a better initial rsqrte or an application that can trade precision for execution time, we could offer a command line option to do only 2 steps for doulbe and 1 step for float; similar to -mrecip-precision for PowerPC.
What are your thoughts on that?

Best regards,
Benedikt

[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-24 20:54           ` Evandro Menezes
@ 2015-06-25 11:52             ` Benedikt Huber
  0 siblings, 0 replies; 36+ messages in thread
From: Benedikt Huber @ 2015-06-25 11:52 UTC (permalink / raw)
  To: Evandro Menezes; +Cc: Dr. Philipp Tomsich, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 6243 bytes --]

I did not look at execute_cse_reciprocals_1(), yet.

However, with the recip-patch applied:

double recipd (double a, double b)
{
  return a/b;
}

translates to

recipd:
  frecpe  d2, d1
  frecps  d3, d2, d1
  fmul  d2, d2, d3
  frecps  d3, d2, d1
  fmul  d2, d2, d3
  frecps  d1, d2, d1
  fmul  d2, d2, d1
  fmul  d0, d2, d0
  ret

float recipf (float a, float b)
{
  return a/b;
}

translates to

recipf:
  frecpe  s2, s1
  frecps  s3, s2, s1
  fmul  s2, s2, s3
  frecps  s1, s2, s1
  fmul  s2, s2, s1
  fmul  s0, s2, s0
  ret

So it seems, that it works also for a generic division.

Best regards,
Benedikt

> On 24 Jun 2015, at 22:39, Evandro Menezes <e.menezes@samsung.com> wrote:
> 
> Philipp,
> 
> I think that execute_cse_reciprocals_1() applies only when the denominator is known at compile-time, otherwise the division stays.  It doesn't seem to know whether the target supports the approximate reciprocal or not.
> 
> Cheers,
> 
> --
> Evandro Menezes                              Austin, TX
> 
> 
>> -----Original Message-----
>> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-owner@gcc.gnu.org] On
>> Behalf Of Dr. Philipp Tomsich
>> Sent: Wednesday, June 24, 2015 15:08
>> To: Evandro Menezes
>> Cc: Benedikt Huber; gcc-patches@gcc.gnu.org
>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
>> estimation in -ffast-math
>> 
>> Evandro,
>> 
>> Shouldn't ‘execute_cse_reciprocals_1’ take care of this, once the reciprocal-
>> division is implemented?
>> Do you think there’s additional work needed to catch all cases/opportunities?
>> 
>> Best,
>> Philipp.
>> 
>>> On 24 Jun 2015, at 20:19, Evandro Menezes <e.menezes@samsung.com> wrote:
>>> 
>>> Benedikt,
>>> 
>>> Are you developing the reciprocal approximation just for 1/x proper or for
>> any division, as in x/y = x * 1/y?
>>> 
>>> Thank you,
>>> 
>>> --
>>> Evandro Menezes                              Austin, TX
>>> 
>>> 
>>>> -----Original Message-----
>>>> From: Benedikt Huber [mailto:benedikt.huber@theobroma-systems.com]
>>>> Sent: Wednesday, June 24, 2015 12:11
>>>> To: Dr. Philipp Tomsich
>>>> Cc: Evandro Menezes; gcc-patches@gcc.gnu.org
>>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
>>>> (rsqrt) estimation in -ffast-math
>>>> 
>>>> Evandro,
>>>> 
>>>> Yes, we also have the 1/x approximation.
>>>> However we do not have the test cases yet, and it also would need
>>>> some clean up.
>>>> I am going to provide a patch for that soon (say next week).
>>>> Also, for this optimization we have *not* yet found a benchmark with
>>>> significant improvements.
>>>> 
>>>> Best Regards,
>>>> Benedikt
>>>> 
>>>> 
>>>>> On 24 Jun 2015, at 18:52, Dr. Philipp Tomsich
>>>>> <philipp.tomsich@theobroma-
>>>> systems.com> wrote:
>>>>> 
>>>>> Evandro,
>>>>> 
>>>>> We’ve seen a 28% speed-up on gromacs in SPECfp for the (scalar)
>>>>> reciprocal
>>>> sqrt.
>>>>> 
>>>>> Also, the “reciprocal divide” patches are floating around in various
>>>>> of our git-tree, but aren’t ready for public consumption, yet… I’ll
>>>>> leave Benedikt to comment on potential timelines for getting that
>>>>> pushed
>>>> out.
>>>>> 
>>>>> Best,
>>>>> Philipp.
>>>>> 
>>>>>> On 24 Jun 2015, at 18:42, Evandro Menezes <e.menezes@samsung.com> wrote:
>>>>>> 
>>>>>> Benedikt,
>>>>>> 
>>>>>> You beat me to it! :-)  Do you have the implementation for dividing
>>>>>> using the Newton series as well?
>>>>>> 
>>>>>> I'm not sure that the series is always for all data types and on
>>>>>> all processors.  It would be useful to allow each AArch64 processor
>>>>>> to enable this or not depending on the data type.  BTW, do you have
>>>>>> some tests showing the speed up?
>>>>>> 
>>>>>> Thank you,
>>>>>> 
>>>>>> --
>>>>>> Evandro Menezes                              Austin, TX
>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: gcc-patches-owner@gcc.gnu.org
>>>>>>> [mailto:gcc-patches-owner@gcc.gnu.org]
>>>>>> On
>>>>>>> Behalf Of Benedikt Huber
>>>>>>> Sent: Thursday, June 18, 2015 7:04
>>>>>>> To: gcc-patches@gcc.gnu.org
>>>>>>> Cc: benedikt.huber@theobroma-systems.com;
>>>>>>> philipp.tomsich@theobroma- systems.com
>>>>>>> Subject: [PATCH] [aarch64] Implemented reciprocal square root
>>>>>>> (rsqrt) estimation in -ffast-math
>>>>>>> 
>>>>>>> arch64 offers the instructions frsqrte and frsqrts, for rsqrt
>>>>>>> estimation
>>>>>> and
>>>>>>> a Newton-Raphson step, respectively.
>>>>>>> There are ARMv8 implementations where this is faster than using
>>>>>>> fdiv and rsqrt.
>>>>>>> It runs three steps for double and two steps for float to achieve
>>>>>>> the
>>>>>> needed
>>>>>>> precision.
>>>>>>> 
>>>>>>> There is one caveat and open question.
>>>>>>> Since -ffast-math enables flush to zero intermediate values
>>>>>>> between approximation steps will be flushed to zero if they are
>> denormal.
>>>>>>> E.g. This happens in the case of rsqrt (DBL_MAX) and rsqrtf (FLT_MAX).
>>>>>>> The test cases pass, but it is unclear to me whether this is
>>>>>>> expected behavior with -ffast-math.
>>>>>>> 
>>>>>>> The patch applies to commit:
>>>>>>> svn+ssh://gcc.gnu.org/svn/gcc/trunk@224470
>>>>>>> 
>>>>>>> Please consider including this patch.
>>>>>>> Thank you and best regards,
>>>>>>> Benedikt Huber
>>>>>>> 
>>>>>>> Benedikt Huber (1):
>>>>>>> 2015-06-15  Benedikt Huber  <benedikt.huber@theobroma-systems.com>
>>>>>>> 
>>>>>>> gcc/ChangeLog                            |   9 +++
>>>>>>> gcc/config/aarch64/aarch64-builtins.c    |  60 ++++++++++++++++
>>>>>>> gcc/config/aarch64/aarch64-protos.h      |   2 +
>>>>>>> gcc/config/aarch64/aarch64-simd.md       |  27 ++++++++
>>>>>>> gcc/config/aarch64/aarch64.c             |  63 +++++++++++++++++
>>>>>>> gcc/config/aarch64/aarch64.md            |   3 +
>>>>>>> gcc/testsuite/gcc.target/aarch64/rsqrt.c | 113
>>>>>>> +++++++++++++++++++++++++++++++
>>>>>>> 7 files changed, 277 insertions(+) create mode 100644
>>>>>>> gcc/testsuite/gcc.target/aarch64/rsqrt.c
>>>>>>> 
>>>>>>> --
>>>>>>> 1.9.1
>>>>>> <Mail Attachment.eml>
>>>>> 
>>> 
>>> 
> 


[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-25 11:07   ` Benedikt Huber
@ 2015-06-25 13:27     ` Michael Matz
  2015-06-25 15:43     ` Kumar, Venkataramanan
  1 sibling, 0 replies; 36+ messages in thread
From: Michael Matz @ 2015-06-25 13:27 UTC (permalink / raw)
  To: Benedikt Huber; +Cc: pinskia, gcc-patches, philipp.tomsich

Hi,

On Thu, 25 Jun 2015, Benedikt Huber wrote:

> > This is NOT a win on thunderX at least for single precision because 
> > you have to do the divide and sqrt in the same time as it takes 5 
> > multiples (estimate and step are multiplies in the thunderX pipeline).  
> > Doubles is 10 multiplies which is just the same as what the patch does 
> > (but it is really slightly less than 10, I rounded up). So in the end 
> > this is NOT a win at all for thunderX unless we do one less step for 
> > both single and double.
> 
> Yes, the expected benefit from rsqrt estimation is implementation 
> specific. If one has a better initial rsqrte or an application that can 
> trade precision for execution time, we could offer a command line option 
> to do only 2 steps for doulbe and 1 step for float; similar to 
> -mrecip-precision for PowerPC. What are your thoughts on that?

On x86-64, under -ffast-math we only do one NR step.  Generally the 
rule-of-thumb take on fast-math is, that common benchmarks should still 
validate with that option in effect.

(And yes, I also never found a speedup for approximated reciprocals so 
that benchmarks would still generally validate, you always had to do two 
NR steps, and then it became as slow as a general divide).  See also 
http://gcc.gnu.org/ml/gcc-patches/2009-11/msg00099.html and the followup 
thread.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-25 11:07   ` Benedikt Huber
  2015-06-25 13:27     ` Michael Matz
@ 2015-06-25 15:43     ` Kumar, Venkataramanan
  2015-06-25 15:52       ` Dr. Philipp Tomsich
  1 sibling, 1 reply; 36+ messages in thread
From: Kumar, Venkataramanan @ 2015-06-25 15:43 UTC (permalink / raw)
  To: Benedikt Huber, pinskia; +Cc: gcc-patches, philipp.tomsich

Changing to  "1 step for float" and "2 steps for double" gives better gains now for gromacs on cortex-a57.

Regards,
Venkat.
> -----Original Message-----
> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
> owner@gcc.gnu.org] On Behalf Of Benedikt Huber
> Sent: Thursday, June 25, 2015 4:09 PM
> To: pinskia@gmail.com
> Cc: gcc-patches@gcc.gnu.org; philipp.tomsich@theobroma-systems.com
> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
> estimation in -ffast-math
> 
> Andrew,
> 
> > This is NOT a win on thunderX at least for single precision because you have
> to do the divide and sqrt in the same time as it takes 5 multiples (estimate
> and step are multiplies in the thunderX pipeline).  Doubles is 10 multiplies
> which is just the same as what the patch does (but it is really slightly less than
> 10, I rounded up). So in the end this is NOT a win at all for thunderX unless
> we do one less step for both single and double.
> 
> Yes, the expected benefit from rsqrt estimation is implementation specific. If
> one has a better initial rsqrte or an application that can trade precision for
> execution time, we could offer a command line option to do only 2 steps for
> doulbe and 1 step for float; similar to -mrecip-precision for PowerPC.
> What are your thoughts on that?
> 
> Best regards,
> Benedikt

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-25 15:43     ` Kumar, Venkataramanan
@ 2015-06-25 15:52       ` Dr. Philipp Tomsich
  2015-06-25 16:47         ` Kumar, Venkataramanan
  0 siblings, 1 reply; 36+ messages in thread
From: Dr. Philipp Tomsich @ 2015-06-25 15:52 UTC (permalink / raw)
  To: Kumar, Venkataramanan; +Cc: Benedikt Huber, pinskia, gcc-patches

Kumar,

what is the relative gain that you see on Cortex-A57?

Thanks,
Philipp.

> On 25 Jun 2015, at 17:35, Kumar, Venkataramanan <Venkataramanan.Kumar@amd.com> wrote:
> 
> Changing to  "1 step for float" and "2 steps for double" gives better gains now for gromacs on cortex-a57.
> 
> Regards,
> Venkat.
>> -----Original Message-----
>> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
>> owner@gcc.gnu.org] On Behalf Of Benedikt Huber
>> Sent: Thursday, June 25, 2015 4:09 PM
>> To: pinskia@gmail.com
>> Cc: gcc-patches@gcc.gnu.org; philipp.tomsich@theobroma-systems.com
>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
>> estimation in -ffast-math
>> 
>> Andrew,
>> 
>>> This is NOT a win on thunderX at least for single precision because you have
>> to do the divide and sqrt in the same time as it takes 5 multiples (estimate
>> and step are multiplies in the thunderX pipeline).  Doubles is 10 multiplies
>> which is just the same as what the patch does (but it is really slightly less than
>> 10, I rounded up). So in the end this is NOT a win at all for thunderX unless
>> we do one less step for both single and double.
>> 
>> Yes, the expected benefit from rsqrt estimation is implementation specific. If
>> one has a better initial rsqrte or an application that can trade precision for
>> execution time, we could offer a command line option to do only 2 steps for
>> doulbe and 1 step for float; similar to -mrecip-precision for PowerPC.
>> What are your thoughts on that?
>> 
>> Best regards,
>> Benedikt

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-25 15:52       ` Dr. Philipp Tomsich
@ 2015-06-25 16:47         ` Kumar, Venkataramanan
  2015-06-28 15:13           ` pinskia
  0 siblings, 1 reply; 36+ messages in thread
From: Kumar, Venkataramanan @ 2015-06-25 16:47 UTC (permalink / raw)
  To: Dr. Philipp Tomsich; +Cc: Benedikt Huber, pinskia, gcc-patches

I got around ~12% gain with -Ofast -mcpu=cortex-a57.

Regards,
Venkat.

> -----Original Message-----
> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-	
> owner@gcc.gnu.org] On Behalf Of Dr. Philipp Tomsich
> Sent: Thursday, June 25, 2015 9:13 PM
> To: Kumar, Venkataramanan
> Cc: Benedikt Huber; pinskia@gmail.com; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
> estimation in -ffast-math
> 
> Kumar,
> 
> what is the relative gain that you see on Cortex-A57?
> 
> Thanks,
> Philipp.
> 
> > On 25 Jun 2015, at 17:35, Kumar, Venkataramanan
> <Venkataramanan.Kumar@amd.com> wrote:
> >
> > Changing to  "1 step for float" and "2 steps for double" gives better gains
> now for gromacs on cortex-a57.
> >
> > Regards,
> > Venkat.
> >> -----Original Message-----
> >> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
> >> owner@gcc.gnu.org] On Behalf Of Benedikt Huber
> >> Sent: Thursday, June 25, 2015 4:09 PM
> >> To: pinskia@gmail.com
> >> Cc: gcc-patches@gcc.gnu.org; philipp.tomsich@theobroma-systems.com
> >> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> >> (rsqrt) estimation in -ffast-math
> >>
> >> Andrew,
> >>
> >>> This is NOT a win on thunderX at least for single precision because
> >>> you have
> >> to do the divide and sqrt in the same time as it takes 5 multiples
> >> (estimate and step are multiplies in the thunderX pipeline).  Doubles
> >> is 10 multiplies which is just the same as what the patch does (but
> >> it is really slightly less than 10, I rounded up). So in the end this
> >> is NOT a win at all for thunderX unless we do one less step for both single
> and double.
> >>
> >> Yes, the expected benefit from rsqrt estimation is implementation
> >> specific. If one has a better initial rsqrte or an application that
> >> can trade precision for execution time, we could offer a command line
> >> option to do only 2 steps for doulbe and 1 step for float; similar to -
> mrecip-precision for PowerPC.
> >> What are your thoughts on that?
> >>
> >> Best regards,
> >> Benedikt

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-25  9:43   ` Ramana Radhakrishnan
@ 2015-06-27  2:01     ` Andrew Pinski
  0 siblings, 0 replies; 36+ messages in thread
From: Andrew Pinski @ 2015-06-27  2:01 UTC (permalink / raw)
  To: Ramana Radhakrishnan; +Cc: Benedikt Huber, gcc-patches, philipp.tomsich

On Thu, Jun 25, 2015 at 1:24 AM, Ramana Radhakrishnan
<ramana.radhakrishnan@foss.arm.com> wrote:
> Benedikt,
>
> On 25/06/15 08:01, pinskia@gmail.com wrote:
>>
>>
>>
>>
>>
>>> On Jun 18, 2015, at 5:04 AM, Benedikt Huber
>>> <benedikt.huber@theobroma-systems.com> wrote:
>>>
>>> arch64 offers the instructions frsqrte and frsqrts, for rsqrt estimation
>>> and
>>> a Newton-Raphson step, respectively.
>>> There are ARMv8 implementations where this is faster than using fdiv and
>>> rsqrt.
>>> It runs three steps for double and two steps for float to achieve the
>>> needed precision.
>>
>>
>> This is NOT a win on thunderX at least for single precision because you
>> have to do the divide and sqrt in the same time as it takes 5 multiples
>> (estimate and step are multiplies in the thunderX pipeline).  Doubles is 10
>> multiplies which is just the same as what the patch does (but it is really
>> slightly less than 10, I rounded up). So in the end this is NOT a win at all
>> for thunderX unless we do one less step for both single and double.
>>
>
>
> Have you seen this https://gcc.gnu.org/ml/gcc-patches/2015-03/msg00164.html
> ? Really this is something that should be gated by the costs infrastructure


Yes I saw that in fact I did not look into the latencies of our core
until this patch came out.  But yes this should be gated by a cost
infrastructure and most likely not as part of the -mcpu=generic cost
(well the rsqrt if we change it to 1 iterations and 2 iterations).

Thanks,
Andrew

> .
>
>
> regards
> Ramana
>
>
>
>
>
>
>> Thanks,
>> Andrew
>>
>>
>>>
>>> There is one caveat and open question.
>>> Since -ffast-math enables flush to zero intermediate values between
>>> approximation steps
>>> will be flushed to zero if they are denormal.
>>> E.g. This happens in the case of rsqrt (DBL_MAX) and rsqrtf (FLT_MAX).
>>> The test cases pass, but it is unclear to me whether this is expected
>>> behavior with -ffast-math.
>>>
>>> The patch applies to commit:
>>> svn+ssh://gcc.gnu.org/svn/gcc/trunk@224470
>>>
>>> Please consider including this patch.
>>> Thank you and best regards,
>>> Benedikt Huber
>>>
>>> Benedikt Huber (1):
>>>   2015-06-15  Benedikt Huber  <benedikt.huber@theobroma-systems.com>
>>>
>>> gcc/ChangeLog                            |   9 +++
>>> gcc/config/aarch64/aarch64-builtins.c    |  60 ++++++++++++++++
>>> gcc/config/aarch64/aarch64-protos.h      |   2 +
>>> gcc/config/aarch64/aarch64-simd.md       |  27 ++++++++
>>> gcc/config/aarch64/aarch64.c             |  63 +++++++++++++++++
>>> gcc/config/aarch64/aarch64.md            |   3 +
>>> gcc/testsuite/gcc.target/aarch64/rsqrt.c | 113
>>> +++++++++++++++++++++++++++++++
>>> 7 files changed, 277 insertions(+)
>>> create mode 100644 gcc/testsuite/gcc.target/aarch64/rsqrt.c
>>>
>>> --
>>> 1.9.1
>>>
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] 2015-06-15 Benedikt Huber <benedikt.huber@theobroma-systems.com>
  2015-06-18 12:03 ` [PATCH] 2015-06-15 Benedikt Huber <benedikt.huber@theobroma-systems.com> Benedikt Huber
@ 2015-06-27  8:12   ` Andrew Pinski
  0 siblings, 0 replies; 36+ messages in thread
From: Andrew Pinski @ 2015-06-27  8:12 UTC (permalink / raw)
  To: Benedikt Huber; +Cc: GCC Patches, Philipp Tomsich

On Thu, Jun 18, 2015 at 5:04 AM, Benedikt Huber
<benedikt.huber@theobroma-systems.com> wrote:
A few comments about the patch itself besides the cost model issue and
iteration issue mentioned in the other email thread.

>        * config/aarch64/aarch64-builtins.c: Builtins for rsqrt and rsqrtf.
>        * config/aarch64/aarch64-protos.h: Declare.
>        * config/aarch64/aarch64-simd.md: Matching expressions for frsqrte and frsqrts.
>        * config/aarch64/aarch64.c: New functions. Emit rsqrt estimation code in fast math mode.
>        * config/aarch64/aarch64.md: Added enum entry.
>        * testsuite/gcc.target/aarch64/rsqrt.c: Tests for single and double.
> ---
>  gcc/ChangeLog                            |   9 +++
>  gcc/config/aarch64/aarch64-builtins.c    |  60 ++++++++++++++++
>  gcc/config/aarch64/aarch64-protos.h      |   2 +
>  gcc/config/aarch64/aarch64-simd.md       |  27 ++++++++
>  gcc/config/aarch64/aarch64.c             |  63 +++++++++++++++++
>  gcc/config/aarch64/aarch64.md            |   3 +
>  gcc/testsuite/gcc.target/aarch64/rsqrt.c | 113 +++++++++++++++++++++++++++++++
>  7 files changed, 277 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/rsqrt.c
>
> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
> index c9b156f..690ebba 100644
> --- a/gcc/ChangeLog
> +++ b/gcc/ChangeLog
> @@ -1,3 +1,12 @@
> +2015-06-15  Benedikt Huber  <benedikt.huber@theobroma-systems.com>
> +
> +       * config/aarch64/aarch64-builtins.c: Builtins for rsqrt and rsqrtf.
> +       * config/aarch64/aarch64-protos.h: Declare.
> +       * config/aarch64/aarch64-simd.md: Matching expressions for frsqrte and frsqrts.
> +       * config/aarch64/aarch64.c: New functions. Emit rsqrt estimation code in fast math mode.
> +       * config/aarch64/aarch64.md: Added enum entry.
> +       * testsuite/gcc.target/aarch64/rsqrt.c: Tests for single and double.
> +
>  2015-06-14  Richard Sandiford  <richard.sandiford@arm.com>
>
>         * rtl.h (classify_insn): Declare.
> diff --git a/gcc/config/aarch64/aarch64-builtins.c b/gcc/config/aarch64/aarch64-builtins.c
> index f7a39ec..484bb84 100644
> --- a/gcc/config/aarch64/aarch64-builtins.c
> +++ b/gcc/config/aarch64/aarch64-builtins.c
> @@ -342,6 +342,8 @@ enum aarch64_builtins
>    AARCH64_BUILTIN_GET_FPSR,
>    AARCH64_BUILTIN_SET_FPSR,
>
> +  AARCH64_BUILTIN_RSQRT,
> +  AARCH64_BUILTIN_RSQRTF,
>    AARCH64_SIMD_BUILTIN_BASE,
>    AARCH64_SIMD_BUILTIN_LANE_CHECK,
>  #include "aarch64-simd-builtins.def"
> @@ -831,6 +833,32 @@ aarch64_init_crc32_builtins ()
>  }
>
>  void
> +aarch64_add_builtin_rsqrt (void)
> +{
> +  tree fndecl = NULL;
> +  tree ftype = NULL;
> +  ftype = build_function_type_list (double_type_node, double_type_node, NULL_TREE);
> +
> +  fndecl = add_builtin_function ("__builtin_aarch64_rsqrt",
> +                                 ftype,
> +                                 AARCH64_BUILTIN_RSQRT,
> +                                 BUILT_IN_MD,
> +                                 NULL,
> +                                 NULL_TREE);
> +  aarch64_builtin_decls[AARCH64_BUILTIN_RSQRT] = fndecl;
> +
> +  tree ftypef = NULL;
> +  ftypef = build_function_type_list (float_type_node, float_type_node, NULL_TREE);
> +  fndecl = add_builtin_function ("__builtin_aarch64_rsqrtf",
> +                                 ftypef,
> +                                 AARCH64_BUILTIN_RSQRTF,
> +                                 BUILT_IN_MD,
> +                                 NULL,
> +                                 NULL_TREE);
> +  aarch64_builtin_decls[AARCH64_BUILTIN_RSQRTF] = fndecl;
> +}
> +
> +void
>  aarch64_init_builtins (void)
>  {
>    tree ftype_set_fpr
> @@ -855,6 +883,7 @@ aarch64_init_builtins (void)
>      aarch64_init_simd_builtins ();
>    if (TARGET_CRC32)
>      aarch64_init_crc32_builtins ();
> +  aarch64_add_builtin_rsqrt ();
>  }
>
>  tree
> @@ -1099,6 +1128,23 @@ aarch64_crc32_expand_builtin (int fcode, tree exp, rtx target)
>    return target;
>  }
>
> +static rtx
> +aarch64_expand_builtin_rsqrt (int fcode, tree exp, rtx target)
> +{
> +  rtx pat;
> +  tree arg0 = CALL_EXPR_ARG (exp, 0);
> +  rtx op0 = expand_normal (arg0);
> +
> +  enum insn_code c = CODE_FOR_rsqrtdf;
> +  if (fcode == AARCH64_BUILTIN_RSQRTF)
> +    c = CODE_FOR_rsqrtsf;
> +
> +  pat = GEN_FCN (c) (target, op0);
> +  emit_insn (pat);
> +
> +  return target;
> +}
> +
>  /* Expand an expression EXP that calls a built-in function,
>     with result going to TARGET if that's convenient.  */
>  rtx
> @@ -1146,6 +1192,11 @@ aarch64_expand_builtin (tree exp,
>    else if (fcode >= AARCH64_CRC32_BUILTIN_BASE && fcode <= AARCH64_CRC32_BUILTIN_MAX)
>      return aarch64_crc32_expand_builtin (fcode, exp, target);
>
> +  if (fcode == AARCH64_BUILTIN_RSQRT ||
> +      fcode == AARCH64_BUILTIN_RSQRTF)
> +    return aarch64_expand_builtin_rsqrt (fcode, exp, target);
> +
> +  return NULL_RTX;
>    gcc_unreachable ();
>  }
>
> @@ -1303,6 +1354,15 @@ aarch64_builtin_vectorized_function (tree fndecl, tree type_out, tree type_in)
>    return NULL_TREE;
>  }
>
> +tree
> +aarch64_builtin_rsqrt (bool is_float)
> +{
> +  if (is_float)
> +    return aarch64_builtin_decls[AARCH64_BUILTIN_RSQRTF];
> +  else
> +    return aarch64_builtin_decls[AARCH64_BUILTIN_RSQRT];
> +}
> +
>  #undef VAR1
>  #define VAR1(T, N, MAP, A) \
>    case AARCH64_SIMD_BUILTIN_##T##_##N##A:
> diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
> index 965a11b..4f1c8ce 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -270,6 +270,8 @@ void aarch64_print_operand (FILE *, rtx, char);
>  void aarch64_print_operand_address (FILE *, rtx);
>  void aarch64_emit_call_insn (rtx);
>
> +void aarch64_emit_swrsqrt (rtx, rtx);
> +
>  /* Initialize builtins for SIMD intrinsics.  */
>  void init_aarch64_simd_builtins (void);
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
> index b90f938..266800a 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -353,6 +353,33 @@
>    [(set_attr "type" "neon_fp_mul_d_scalar_q")]
>  )
>
> +(define_insn "*rsqrte_simd"
> +  [(set (match_operand:VALLF 0 "register_operand" "=w")
> +       (unspec:VALLF [(match_operand:VALLF 1 "register_operand" "w")]
> +                    UNSPEC_RSQRTE))]
> +  "TARGET_SIMD"
> +  "frsqrte\\t%<v>0<Vmtype>, %<v>1<Vmtype>"
> +  [(set_attr "type" "neon_fp_rsqrte_<Vetype><q>")])

I think it might be useful to have the mode on the pattern name.

> +
> +(define_insn "*rsqrts_simd"

Likewise.

> +  [(set (match_operand:VALLF 0 "register_operand" "=w")
> +       (unspec:VALLF [(match_operand:VALLF 1 "register_operand" "w")
> +               (match_operand:VALLF 2 "register_operand" "w")]
> +                    UNSPEC_RSQRTS))]
> +  "TARGET_SIMD"
> +  "frsqrts\\t%<v>0<Vmtype>, %<v>1<Vmtype>, %<v>2<Vmtype>"
> +  [(set_attr "type" "neon_fp_rsqrts_<Vetype><q>")])
> +
> +(define_expand "rsqrt<mode>"
> +  [(set (match_operand:GPF 0 "register_operand" "=w")
> +       (unspec:GPF [(match_operand:GPF 1 "register_operand" "w")]
> +                    UNSPEC_RSQRT))]
> +  "TARGET_FLOAT"
> +{
> +  aarch64_emit_swrsqrt (operands[0], operands[1]);
> +  DONE;
> +})
> +
>  (define_insn "*aarch64_mul3_elt_to_64v2df"
>    [(set (match_operand:DF 0 "register_operand" "=w")
>       (mult:DF
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index c3c2795..281f0b2 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -6816,6 +6816,66 @@ aarch64_memory_move_cost (machine_mode mode ATTRIBUTE_UNUSED,
>    return aarch64_tune_params->memmov_cost;
>  }
>
> +extern tree aarch64_builtin_rsqrt (bool is_float);
> +
> +static tree
> +aarch64_builtin_reciprocal (unsigned int fn,
> +                            bool md_fn ATTRIBUTE_UNUSED,
> +                            bool sqrt ATTRIBUTE_UNUSED)

You don't need ATTRIBUTE_UNUSED any more if you are not using the
variable at all.  Just remove the variable name, this is C++ code
after all :).

> +{
> +  if (!fast_math_flags_set_p (&global_options))
> +    return NULL_TREE;

Is this correct check here?  I suspect you want
!(flag_finite_math_only && !flag_trapping_math &&
flag_unsafe_math_optimizations) instead so someone can do -ffast-math
-fno-finite-math-only.


is there a reason why you don't check md_fn here at all because if
md_fn is true, and you get the same value for fn as
BUILT_IN_SQRTF/BUILT_IN_SQRT, we would get incorrect code.
Also I think you should check md_fn and support the vectorized functions here.


> +
> +  if (fn == BUILT_IN_SQRTF)
> +    return aarch64_builtin_rsqrt (true);
> +  else if (fn == BUILT_IN_SQRT)
> +    return aarch64_builtin_rsqrt (false);
> +  else
> +    return NULL_TREE;
> +}
> +
> +void
> +aarch64_emit_swrsqrt (rtx dst, rtx src)
> +{
> +  enum machine_mode mode = GET_MODE (src);
> +  rtx xsrc = gen_reg_rtx (mode);
> +  emit_set_insn (xsrc, src);

Why do you use emit_set_insn here instead of emit_move_insn?

> +
> +  rtx x0 = gen_reg_rtx (mode);
> +  emit_insn (gen_rtx_SET (x0,
> +                          gen_rtx_UNSPEC (mode, gen_rtvec (1, xsrc),
> +                                          UNSPEC_RSQRTE)));

And not emit_set_insn here?

> +
> +  bool double_mode = (DFmode   == mode ||
> +                      V1DFmode == mode ||
> +                      V2DFmode == mode ||
> +                      V4DFmode == mode ||
> +                      V6DFmode == mode ||
> +                      V8DFmode == mode);

Why not use:
mode == DFmode || (VECTOR_MODE_P (mode) && GET_MODE_INNER (mode) == DFmode);
At least it becomes more obvious what is done here.
Or just:
mode == DFmode || mode == V2DFmode || mode == V4DFmode

Also it might make sense to have an assert on the modes we accept at
the beginning of the function.

> +
> +  int iterations = 2;
> +  if (double_mode)
> +    iterations = 3;
> +
> +  for (int i = 0; i < iterations; ++i)
> +    {
> +      rtx x1 = gen_reg_rtx (mode);
> +      rtx x2 = gen_reg_rtx (mode);
> +      rtx x3 = gen_reg_rtx (mode);
> +      emit_insn (gen_rtx_SET (x2,
> +                              gen_rtx_MULT (mode, x0, x0)));
> +      emit_insn (gen_rtx_SET (x3,
> +                              gen_rtx_UNSPEC (mode, gen_rtvec (2, xsrc, x2),
> +                                              UNSPEC_RSQRTS)));
> +      emit_insn (gen_rtx_SET (x1,
> +                              gen_rtx_MULT (mode, x0, x3)));

More of emit_set_insn usage.

> +      x0 = x1;
> +    }
> +
> +  emit_move_insn (dst, x0);
> +  return;
> +}
> +
>  /* Return the number of instructions that can be issued per cycle.  */
>  static int
>  aarch64_sched_issue_rate (void)
> @@ -11747,6 +11807,9 @@ aarch64_gen_adjusted_ldpstp (rtx *operands, bool load,
>  #undef TARGET_USE_BLOCKS_FOR_CONSTANT_P
>  #define TARGET_USE_BLOCKS_FOR_CONSTANT_P aarch64_use_blocks_for_constant_p
>
> +#undef TARGET_BUILTIN_RECIPROCAL
> +#define TARGET_BUILTIN_RECIPROCAL aarch64_builtin_reciprocal
> +
>  #undef TARGET_VECTOR_MODE_SUPPORTED_P
>  #define TARGET_VECTOR_MODE_SUPPORTED_P aarch64_vector_mode_supported_p
>
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 11123d6..7272d05 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -120,6 +120,9 @@
>      UNSPEC_VSTRUCTDUMMY
>      UNSPEC_SP_SET
>      UNSPEC_SP_TEST
> +    UNSPEC_RSQRT
> +    UNSPEC_RSQRTE
> +    UNSPEC_RSQRTS
>  ])
>
>  (define_c_enum "unspecv" [
> diff --git a/gcc/testsuite/gcc.target/aarch64/rsqrt.c b/gcc/testsuite/gcc.target/aarch64/rsqrt.c
> new file mode 100644
> index 0000000..6607483
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/rsqrt.c
> @@ -0,0 +1,113 @@
> +/* { dg-do run } */
> +/* { dg-options "-O3 -fno-inline --save-temps -ffast-math" } */

Maybe instead of -fno-inline, you can add some volatile to some
variables.  We really should have separate tests for the assembly scan
vs the runtime tests too.

> +
> +#include <math.h>
> +#include <stdio.h>
> +
> +#include <values.h>

Is values.h a standard header file.  I suspect you should be using
float.h instead which is the standard header file.
> +
> +#define PI    3.141592653589793
> +#define SQRT2 1.4142135623730951
> +
> +#define PI_4 0.7853981633974483
> +#define SQRT1_2 0.7071067811865475
> +
> +/* 2^25+1, float has 24 significand bits
> + *       according to Single-precision floating-point format.  */
> +#define TESTA8_FLT 33554433
> +/* 2^54+1, double has 53 significand bits
> + *       according to Double-precision floating-point format.  */
> +#define TESTA8_DBL 18014398509481985
> +
> +#define SD(a, b) t_double ((#a), (a), (b));
> +#define SF(a, b) t_float ((#a), (a), (b));
> +
> +#define EPSILON_double __DBL_EPSILON__
> +#define EPSILON_float __FLT_EPSILON__
> +#define ABS_double fabs
> +#define ABS_float fabsf
> +#define SQRT_double sqrt
> +#define SQRT_float sqrtf

I suspect you should have those using the builtin versions of the
above functions instead of the named ones.

Thanks,
Andrew Pinski

> +
> +extern void abort (void);
> +
> +#define TESTTYPE(TYPE)                                                     \
> +TYPE rsqrt_##TYPE (TYPE a)                                                 \
> +{                                                                          \
> +  return 1.0/SQRT_##TYPE(a);                                               \
> +}                                                                          \
> +                                                                           \
> +int equals_##TYPE (TYPE a, TYPE b)                                         \
> +{                                                                          \
> +  return (a == b ||                                                        \
> +          (isnan (a) && isnan (b)) ||                                      \
> +          (ABS_##TYPE (a - b) < EPSILON_##TYPE));                          \
> +}                                                                          \
> +                                                                           \
> +void t_##TYPE (const char *s, TYPE a, TYPE result)                         \
> +{                                                                          \
> +  TYPE r = rsqrt_##TYPE (a);                                               \
> +  if (!equals_##TYPE (r, result))                                          \
> +  {                                                                        \
> +    abort ();                                                              \
> +  }                                                                        \
> +}                                                                          \
> +
> +//  printf ("Problem in %20s: %30.18A should be %30.18A\n", s, r, result); \
> +
> +TESTTYPE(double)
> +TESTTYPE(float)
> +
> +/* { dg-final { scan-assembler-times "frsqrte\\td\[0-9\]+, d\[0-9\]+" 1 } } */
> +/* { dg-final { scan-assembler-times "frsqrts\\td\[0-9\]+, d\[0-9\]+, d\[0-9\]+" 3 } } */
> +
> +/* { dg-final { scan-assembler-times "frsqrte\\ts\[0-9\]+, s\[0-9\]+" 1 } } */
> +/* { dg-final { scan-assembler-times "frsqrts\\ts\[0-9\]+, s\[0-9\]+, s\[0-9\]+" 2 } } */
> +
> +int main ()
> +{
> +  SD(    1.0/256, 0X1.00000000000000P+4  );
> +  SD(        1.0, 0X1.00000000000000P+0  );
> +  SD(       -1.0,                     NAN);
> +  SD(       11.0, 0X1.34BF63D1568260P-2  );
> +  SD(        0.0,                INFINITY);
> +  SD(   INFINITY, 0X0.00000000000000P+0  );
> +  SD(        NAN,                     NAN);
> +  SD(       -NAN,                    -NAN);
> +  SD(    DBL_MAX, 0X1.00000000000010P-512);
> +  SD(    DBL_MIN, 0X1.00000000000000P+511);
> +  SD(         PI, 0X1.20DD750429B6D0P-1  );
> +  SD(       PI_4, 0X1.20DD750429B6D0P+0  );
> +  SD(      SQRT2, 0X1.AE89F995AD3AE0P-1  );
> +  SD(    SQRT1_2, 0X1.306FE0A31B7150P+0  );
> +  SD(        -PI,                     NAN);
> +  SD(     -SQRT2,                     NAN);
> +  SD( TESTA8_DBL, 0X1.00000000000000P-27 );
> +
> +  SF(    1.0/256, 0X1.00000000000000P+4  );
> +  SF(        1.0, 0X1.00000000000000P+0  );
> +  SF(       -1.0,                     NAN);
> +  SF(       11.0, 0X1.34BF6400000000P-2  );
> +  SF(        0.0,                INFINITY);
> +  SF(   INFINITY, 0X0.00000000000000P+0  );
> +  SF(        NAN,                     NAN);
> +  SF(       -NAN,                    -NAN);
> +  SF(    FLT_MAX, 0X1.00000200000000P-64 );
> +  SF(    FLT_MIN, 0X1.00000000000000P+63 );
> +  SF(         PI, 0X1.20DD7400000000P-1  );
> +  SF(       PI_4, 0X1.20DD7400000000P+0  );
> +  SF(      SQRT2, 0X1.AE89FA00000000P-1  );
> +  SF(    SQRT1_2, 0X1.306FE000000000P+0  );
> +  SF(        -PI,                     NAN);
> +  SF(     -SQRT2,                     NAN);
> +  SF( TESTA8_FLT, 0X1.6A09E600000000P-13 );
> +
> +//   With -ffast-math these return positive INF.
> +//   SD(       -0.0,               -INFINITY);
> +//   SF(       -0.0,               -INFINITY);
> +//   The reason here is that -ffast-math flushes to zero.
> +//   SD(DBL_MIN/256, 0X1.00000000000000P+515);
> +//   SF(FLT_MIN/256, 0X1.00000000000000P+67 );
> +
> +  return 0;
> +}
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-25 16:47         ` Kumar, Venkataramanan
@ 2015-06-28 15:13           ` pinskia
  2015-06-29  8:30             ` Kumar, Venkataramanan
  0 siblings, 1 reply; 36+ messages in thread
From: pinskia @ 2015-06-28 15:13 UTC (permalink / raw)
  To: Kumar, Venkataramanan; +Cc: Dr. Philipp Tomsich, Benedikt Huber, gcc-patches





> On Jun 25, 2015, at 9:44 AM, Kumar, Venkataramanan <Venkataramanan.Kumar@amd.com> wrote:
> 
> I got around ~12% gain with -Ofast -mcpu=cortex-a57.

I get around 11/12% on thunderX with the patch and the decreasing the iterations change (1/2) compared to without the patch. 

Thanks,
Andrew


> 
> Regards,
> Venkat.
> 
>> -----Original Message-----
>> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-    
>> owner@gcc.gnu.org] On Behalf Of Dr. Philipp Tomsich
>> Sent: Thursday, June 25, 2015 9:13 PM
>> To: Kumar, Venkataramanan
>> Cc: Benedikt Huber; pinskia@gmail.com; gcc-patches@gcc.gnu.org
>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
>> estimation in -ffast-math
>> 
>> Kumar,
>> 
>> what is the relative gain that you see on Cortex-A57?
>> 
>> Thanks,
>> Philipp.
>> 
>>>> On 25 Jun 2015, at 17:35, Kumar, Venkataramanan
>>> <Venkataramanan.Kumar@amd.com> wrote:
>>> 
>>> Changing to  "1 step for float" and "2 steps for double" gives better gains
>> now for gromacs on cortex-a57.
>>> 
>>> Regards,
>>> Venkat.
>>>> -----Original Message-----
>>>> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
>>>> owner@gcc.gnu.org] On Behalf Of Benedikt Huber
>>>> Sent: Thursday, June 25, 2015 4:09 PM
>>>> To: pinskia@gmail.com
>>>> Cc: gcc-patches@gcc.gnu.org; philipp.tomsich@theobroma-systems.com
>>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
>>>> (rsqrt) estimation in -ffast-math
>>>> 
>>>> Andrew,
>>>> 
>>>>> This is NOT a win on thunderX at least for single precision because
>>>>> you have
>>>> to do the divide and sqrt in the same time as it takes 5 multiples
>>>> (estimate and step are multiplies in the thunderX pipeline).  Doubles
>>>> is 10 multiplies which is just the same as what the patch does (but
>>>> it is really slightly less than 10, I rounded up). So in the end this
>>>> is NOT a win at all for thunderX unless we do one less step for both single
>> and double.
>>>> 
>>>> Yes, the expected benefit from rsqrt estimation is implementation
>>>> specific. If one has a better initial rsqrte or an application that
>>>> can trade precision for execution time, we could offer a command line
>>>> option to do only 2 steps for doulbe and 1 step for float; similar to -
>> mrecip-precision for PowerPC.
>>>> What are your thoughts on that?
>>>> 
>>>> Best regards,
>>>> Benedikt
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-28 15:13           ` pinskia
@ 2015-06-29  8:30             ` Kumar, Venkataramanan
  2015-06-29  9:07               ` Dr. Philipp Tomsich
                                 ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Kumar, Venkataramanan @ 2015-06-29  8:30 UTC (permalink / raw)
  To: pinskia; +Cc: Dr. Philipp Tomsich, Benedikt Huber, gcc-patches


Hmm,  Reducing the iterations to "1 step for float" and "2 steps for double"

 I got VE (miscompares) on following benchmarks
416.gamess 
453.povray         
454.calculix   
459.GemsFDTD  

Benedikt , I have ICE for 444.namd with your patch,  not sure if something wrong in my local tree.  

Regards,
Venkat.

> -----Original Message-----
> From: pinskia@gmail.com [mailto:pinskia@gmail.com]
> Sent: Sunday, June 28, 2015 8:35 PM
> To: Kumar, Venkataramanan
> Cc: Dr. Philipp Tomsich; Benedikt Huber; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
> estimation in -ffast-math
> 
> 
> 
> 
> 
> > On Jun 25, 2015, at 9:44 AM, Kumar, Venkataramanan
> <Venkataramanan.Kumar@amd.com> wrote:
> >
> > I got around ~12% gain with -Ofast -mcpu=cortex-a57.
> 
> I get around 11/12% on thunderX with the patch and the decreasing the
> iterations change (1/2) compared to without the patch.
> 
> Thanks,
> Andrew
> 
> 
> >
> > Regards,
> > Venkat.
> >
> >> -----Original Message-----
> >> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
> >> owner@gcc.gnu.org] On Behalf Of Dr. Philipp Tomsich
> >> Sent: Thursday, June 25, 2015 9:13 PM
> >> To: Kumar, Venkataramanan
> >> Cc: Benedikt Huber; pinskia@gmail.com; gcc-patches@gcc.gnu.org
> >> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> >> (rsqrt) estimation in -ffast-math
> >>
> >> Kumar,
> >>
> >> what is the relative gain that you see on Cortex-A57?
> >>
> >> Thanks,
> >> Philipp.
> >>
> >>>> On 25 Jun 2015, at 17:35, Kumar, Venkataramanan
> >>> <Venkataramanan.Kumar@amd.com> wrote:
> >>>
> >>> Changing to  "1 step for float" and "2 steps for double" gives
> >>> better gains
> >> now for gromacs on cortex-a57.
> >>>
> >>> Regards,
> >>> Venkat.
> >>>> -----Original Message-----
> >>>> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
> >>>> owner@gcc.gnu.org] On Behalf Of Benedikt Huber
> >>>> Sent: Thursday, June 25, 2015 4:09 PM
> >>>> To: pinskia@gmail.com
> >>>> Cc: gcc-patches@gcc.gnu.org; philipp.tomsich@theobroma-
> systems.com
> >>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> >>>> (rsqrt) estimation in -ffast-math
> >>>>
> >>>> Andrew,
> >>>>
> >>>>> This is NOT a win on thunderX at least for single precision
> >>>>> because you have
> >>>> to do the divide and sqrt in the same time as it takes 5 multiples
> >>>> (estimate and step are multiplies in the thunderX pipeline).
> >>>> Doubles is 10 multiplies which is just the same as what the patch
> >>>> does (but it is really slightly less than 10, I rounded up). So in
> >>>> the end this is NOT a win at all for thunderX unless we do one less
> >>>> step for both single
> >> and double.
> >>>>
> >>>> Yes, the expected benefit from rsqrt estimation is implementation
> >>>> specific. If one has a better initial rsqrte or an application that
> >>>> can trade precision for execution time, we could offer a command
> >>>> line option to do only 2 steps for doulbe and 1 step for float;
> >>>> similar to -
> >> mrecip-precision for PowerPC.
> >>>> What are your thoughts on that?
> >>>>
> >>>> Best regards,
> >>>> Benedikt
> >

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-29  8:30             ` Kumar, Venkataramanan
@ 2015-06-29  9:07               ` Dr. Philipp Tomsich
  2015-06-29  9:22                 ` Kumar, Venkataramanan
  2015-07-14 22:20                 ` Evandro Menezes
  2015-06-29 14:20               ` Benedikt Huber
  2015-06-29 17:35               ` Benedikt Huber
  2 siblings, 2 replies; 36+ messages in thread
From: Dr. Philipp Tomsich @ 2015-06-29  9:07 UTC (permalink / raw)
  To: Kumar, Venkataramanan; +Cc: pinskia, Benedikt Huber, gcc-patches

Kumar,

This does not come unexpected, as the initial estimation and each iteration will add an architecturally-defined number of bits of precision (ARMv8 guarantuees only a minimum number of bits provided per operation… the exact number is specific to each micro-arch, though).
Depending on your architecture and on the required number of precise bits by any given benchmark, one may see miscompares.

Do you know the exact number of bits that the initial estimate and the subsequent refinement steps add for your micro-arch?

Thanks,
Philipp.

> On 29 Jun 2015, at 10:17, Kumar, Venkataramanan <Venkataramanan.Kumar@amd.com> wrote:
> 
> 
> Hmm,  Reducing the iterations to "1 step for float" and "2 steps for double"
> 
> I got VE (miscompares) on following benchmarks
> 416.gamess 
> 453.povray         
> 454.calculix   
> 459.GemsFDTD  
> 
> Benedikt , I have ICE for 444.namd with your patch,  not sure if something wrong in my local tree.  
> 
> Regards,
> Venkat.
> 
>> -----Original Message-----
>> From: pinskia@gmail.com [mailto:pinskia@gmail.com]
>> Sent: Sunday, June 28, 2015 8:35 PM
>> To: Kumar, Venkataramanan
>> Cc: Dr. Philipp Tomsich; Benedikt Huber; gcc-patches@gcc.gnu.org
>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
>> estimation in -ffast-math
>> 
>> 
>> 
>> 
>> 
>>> On Jun 25, 2015, at 9:44 AM, Kumar, Venkataramanan
>> <Venkataramanan.Kumar@amd.com> wrote:
>>> 
>>> I got around ~12% gain with -Ofast -mcpu=cortex-a57.
>> 
>> I get around 11/12% on thunderX with the patch and the decreasing the
>> iterations change (1/2) compared to without the patch.
>> 
>> Thanks,
>> Andrew
>> 
>> 
>>> 
>>> Regards,
>>> Venkat.
>>> 
>>>> -----Original Message-----
>>>> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
>>>> owner@gcc.gnu.org] On Behalf Of Dr. Philipp Tomsich
>>>> Sent: Thursday, June 25, 2015 9:13 PM
>>>> To: Kumar, Venkataramanan
>>>> Cc: Benedikt Huber; pinskia@gmail.com; gcc-patches@gcc.gnu.org
>>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
>>>> (rsqrt) estimation in -ffast-math
>>>> 
>>>> Kumar,
>>>> 
>>>> what is the relative gain that you see on Cortex-A57?
>>>> 
>>>> Thanks,
>>>> Philipp.
>>>> 
>>>>>> On 25 Jun 2015, at 17:35, Kumar, Venkataramanan
>>>>> <Venkataramanan.Kumar@amd.com> wrote:
>>>>> 
>>>>> Changing to  "1 step for float" and "2 steps for double" gives
>>>>> better gains
>>>> now for gromacs on cortex-a57.
>>>>> 
>>>>> Regards,
>>>>> Venkat.
>>>>>> -----Original Message-----
>>>>>> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
>>>>>> owner@gcc.gnu.org] On Behalf Of Benedikt Huber
>>>>>> Sent: Thursday, June 25, 2015 4:09 PM
>>>>>> To: pinskia@gmail.com
>>>>>> Cc: gcc-patches@gcc.gnu.org; philipp.tomsich@theobroma-
>> systems.com
>>>>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
>>>>>> (rsqrt) estimation in -ffast-math
>>>>>> 
>>>>>> Andrew,
>>>>>> 
>>>>>>> This is NOT a win on thunderX at least for single precision
>>>>>>> because you have
>>>>>> to do the divide and sqrt in the same time as it takes 5 multiples
>>>>>> (estimate and step are multiplies in the thunderX pipeline).
>>>>>> Doubles is 10 multiplies which is just the same as what the patch
>>>>>> does (but it is really slightly less than 10, I rounded up). So in
>>>>>> the end this is NOT a win at all for thunderX unless we do one less
>>>>>> step for both single
>>>> and double.
>>>>>> 
>>>>>> Yes, the expected benefit from rsqrt estimation is implementation
>>>>>> specific. If one has a better initial rsqrte or an application that
>>>>>> can trade precision for execution time, we could offer a command
>>>>>> line option to do only 2 steps for doulbe and 1 step for float;
>>>>>> similar to -
>>>> mrecip-precision for PowerPC.
>>>>>> What are your thoughts on that?
>>>>>> 
>>>>>> Best regards,
>>>>>> Benedikt
>>> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-29  9:07               ` Dr. Philipp Tomsich
@ 2015-06-29  9:22                 ` Kumar, Venkataramanan
  2015-06-29 11:44                   ` James Greenhalgh
  2015-07-14 22:20                 ` Evandro Menezes
  1 sibling, 1 reply; 36+ messages in thread
From: Kumar, Venkataramanan @ 2015-06-29  9:22 UTC (permalink / raw)
  To: Dr. Philipp Tomsich, pinskia,
	James Greenhalgh (james.greenhalgh@arm.com)
  Cc: Benedikt Huber, gcc-patches, Marcus Shawcroft (marcus.shawcroft@arm.com)


> -----Original Message-----
> From: Dr. Philipp Tomsich [mailto:philipp.tomsich@theobroma-systems.com]
> Sent: Monday, June 29, 2015 2:17 PM
> To: Kumar, Venkataramanan
> Cc: pinskia@gmail.com; Benedikt Huber; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
> estimation in -ffast-math
> 
> Kumar,
> 
> This does not come unexpected, as the initial estimation and each iteration
> will add an architecturally-defined number of bits of precision (ARMv8
> guarantuees only a minimum number of bits provided per operation… the
> exact number is specific to each micro-arch, though).
> Depending on your architecture and on the required number of precise bits
> by any given benchmark, one may see miscompares.

True.  
 

> 
> Do you know the exact number of bits that the initial estimate and the
> subsequent refinement steps add for your micro-arch?

I am not sure on this. I  need to check for cortex-a57 case.  

But best thing for "cortex-a57" case is not to use this optimization by default. 
If we get an -mrecip-sqrt command line , then we can add it for "gromacs" kind application to get gains.

Any thoughts on this?
    
Regards,
Venkat.
 

> 
> Thanks,
> Philipp.
> 
> > On 29 Jun 2015, at 10:17, Kumar, Venkataramanan
> <Venkataramanan.Kumar@amd.com> wrote:
> >
> >
> > Hmm,  Reducing the iterations to "1 step for float" and "2 steps for double"
> >
> > I got VE (miscompares) on following benchmarks 416.gamess
> > 453.povray
> > 454.calculix
> > 459.GemsFDTD
> >
> > Benedikt , I have ICE for 444.namd with your patch,  not sure if something
> wrong in my local tree.
> >
> > Regards,
> > Venkat.
> >
> >> -----Original Message-----
> >> From: pinskia@gmail.com [mailto:pinskia@gmail.com]
> >> Sent: Sunday, June 28, 2015 8:35 PM
> >> To: Kumar, Venkataramanan
> >> Cc: Dr. Philipp Tomsich; Benedikt Huber; gcc-patches@gcc.gnu.org
> >> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> >> (rsqrt) estimation in -ffast-math
> >>
> >>
> >>
> >>
> >>
> >>> On Jun 25, 2015, at 9:44 AM, Kumar, Venkataramanan
> >> <Venkataramanan.Kumar@amd.com> wrote:
> >>>
> >>> I got around ~12% gain with -Ofast -mcpu=cortex-a57.
> >>
> >> I get around 11/12% on thunderX with the patch and the decreasing the
> >> iterations change (1/2) compared to without the patch.
> >>
> >> Thanks,
> >> Andrew
> >>
> >>
> >>>
> >>> Regards,
> >>> Venkat.
> >>>
> >>>> -----Original Message-----
> >>>> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
> >>>> owner@gcc.gnu.org] On Behalf Of Dr. Philipp Tomsich
> >>>> Sent: Thursday, June 25, 2015 9:13 PM
> >>>> To: Kumar, Venkataramanan
> >>>> Cc: Benedikt Huber; pinskia@gmail.com; gcc-patches@gcc.gnu.org
> >>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> >>>> (rsqrt) estimation in -ffast-math
> >>>>
> >>>> Kumar,
> >>>>
> >>>> what is the relative gain that you see on Cortex-A57?
> >>>>
> >>>> Thanks,
> >>>> Philipp.
> >>>>
> >>>>>> On 25 Jun 2015, at 17:35, Kumar, Venkataramanan
> >>>>> <Venkataramanan.Kumar@amd.com> wrote:
> >>>>>
> >>>>> Changing to  "1 step for float" and "2 steps for double" gives
> >>>>> better gains
> >>>> now for gromacs on cortex-a57.
> >>>>>
> >>>>> Regards,
> >>>>> Venkat.
> >>>>>> -----Original Message-----
> >>>>>> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
> >>>>>> owner@gcc.gnu.org] On Behalf Of Benedikt Huber
> >>>>>> Sent: Thursday, June 25, 2015 4:09 PM
> >>>>>> To: pinskia@gmail.com
> >>>>>> Cc: gcc-patches@gcc.gnu.org; philipp.tomsich@theobroma-
> >> systems.com
> >>>>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> >>>>>> (rsqrt) estimation in -ffast-math
> >>>>>>
> >>>>>> Andrew,
> >>>>>>
> >>>>>>> This is NOT a win on thunderX at least for single precision
> >>>>>>> because you have
> >>>>>> to do the divide and sqrt in the same time as it takes 5
> >>>>>> multiples (estimate and step are multiplies in the thunderX pipeline).
> >>>>>> Doubles is 10 multiplies which is just the same as what the patch
> >>>>>> does (but it is really slightly less than 10, I rounded up). So
> >>>>>> in the end this is NOT a win at all for thunderX unless we do one
> >>>>>> less step for both single
> >>>> and double.
> >>>>>>
> >>>>>> Yes, the expected benefit from rsqrt estimation is implementation
> >>>>>> specific. If one has a better initial rsqrte or an application
> >>>>>> that can trade precision for execution time, we could offer a
> >>>>>> command line option to do only 2 steps for doulbe and 1 step for
> >>>>>> float; similar to -
> >>>> mrecip-precision for PowerPC.
> >>>>>> What are your thoughts on that?
> >>>>>>
> >>>>>> Best regards,
> >>>>>> Benedikt
> >>>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-29  9:22                 ` Kumar, Venkataramanan
@ 2015-06-29 11:44                   ` James Greenhalgh
  2015-06-29 11:56                     ` Dr. Philipp Tomsich
  0 siblings, 1 reply; 36+ messages in thread
From: James Greenhalgh @ 2015-06-29 11:44 UTC (permalink / raw)
  To: Kumar, Venkataramanan
  Cc: philipp.tomsich, pinskia, Benedikt Huber, gcc-patches,
	Marcus Shawcroft, Ramana Radhakrishnan, Richard Earnshaw

On Mon, Jun 29, 2015 at 10:18:23AM +0100, Kumar, Venkataramanan wrote:
> 
> > -----Original Message-----
> > From: Dr. Philipp Tomsich [mailto:philipp.tomsich@theobroma-systems.com]
> > Sent: Monday, June 29, 2015 2:17 PM
> > To: Kumar, Venkataramanan
> > Cc: pinskia@gmail.com; Benedikt Huber; gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
> > estimation in -ffast-math
> > 
> > Kumar,
> > 
> > This does not come unexpected, as the initial estimation and each iteration
> > will add an architecturally-defined number of bits of precision (ARMv8
> > guarantuees only a minimum number of bits provided per operation… the
> > exact number is specific to each micro-arch, though).
> > Depending on your architecture and on the required number of precise bits
> > by any given benchmark, one may see miscompares.
> 
> True.  

I would be very uncomfortable with this approach.

From Richard Biener's post in the thread Michael Matz linked earlier
in the thread:

    It would follow existing practice of things we allow in
    -funsafe-math-optimizations.  Existing practice in that we
    want to allow -ffast-math use with common benchmarks we care
    about.

    https://gcc.gnu.org/ml/gcc-patches/2009-11/msg00100.html

With the solution you seem to be converging on (2-steps for some
microarchitectures, 3 for others), a binary generated for one micro-arch
may drop below a minimum guarantee of precision when run on another. This
seems to go against the spirit of the practice above. I would only support
adding this optimization to -Ofast if we could keep to architectural
guarantees of precision in the generated code (i.e. 3-steps everywhere).

I don't object to adding a "-mlow-precision-recip-sqrt" style option,
which would be off by default, would enable the 2-step mode, and would
need to be explicitly enabled (i.e. not implied by -mcpu=foo) but I don't
see what this buys you beyond the Gromacs boost (and even there you would
be creating an Invalid Run as optimization flags must be applied across
all workloads). 

For the 3-step optimization, it is clear to me that for "generic" tuning
we don't want this to be enabled by default experimental results and advice
in this thread argues against it for thunderx and cortex-a57 targets.
However, enabling it based on the CPU tuning selected seems fine to me.

Thanks,
James

> > 
> > Do you know the exact number of bits that the initial estimate and the
> > subsequent refinement steps add for your micro-arch?
> 
> I am not sure on this. I  need to check for cortex-a57 case.  
> 
> But best thing for "cortex-a57" case is not to use this optimization by default. 
> If we get an -mrecip-sqrt command line , then we can add it for "gromacs" kind application to get gains.
> 
> Any thoughts on this?
>     
> Regards,
> Venkat.
>  
> 
> > 
> > Thanks,
> > Philipp.
> > 
> > > On 29 Jun 2015, at 10:17, Kumar, Venkataramanan
> > <Venkataramanan.Kumar@amd.com> wrote:
> > >
> > >
> > > Hmm,  Reducing the iterations to "1 step for float" and "2 steps for double"
> > >
> > > I got VE (miscompares) on following benchmarks 416.gamess
> > > 453.povray
> > > 454.calculix
> > > 459.GemsFDTD
> > >
> > > Benedikt , I have ICE for 444.namd with your patch,  not sure if something
> > wrong in my local tree.
> > >
> > > Regards,
> > > Venkat.
> > >
> > >> -----Original Message-----
> > >> From: pinskia@gmail.com [mailto:pinskia@gmail.com]
> > >> Sent: Sunday, June 28, 2015 8:35 PM
> > >> To: Kumar, Venkataramanan
> > >> Cc: Dr. Philipp Tomsich; Benedikt Huber; gcc-patches@gcc.gnu.org
> > >> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> > >> (rsqrt) estimation in -ffast-math
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>> On Jun 25, 2015, at 9:44 AM, Kumar, Venkataramanan
> > >> <Venkataramanan.Kumar@amd.com> wrote:
> > >>>
> > >>> I got around ~12% gain with -Ofast -mcpu=cortex-a57.
> > >>
> > >> I get around 11/12% on thunderX with the patch and the decreasing the
> > >> iterations change (1/2) compared to without the patch.
> > >>
> > >> Thanks,
> > >> Andrew
> > >>
> > >>
> > >>>
> > >>> Regards,
> > >>> Venkat.
> > >>>
> > >>>> -----Original Message-----
> > >>>> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
> > >>>> owner@gcc.gnu.org] On Behalf Of Dr. Philipp Tomsich
> > >>>> Sent: Thursday, June 25, 2015 9:13 PM
> > >>>> To: Kumar, Venkataramanan
> > >>>> Cc: Benedikt Huber; pinskia@gmail.com; gcc-patches@gcc.gnu.org
> > >>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> > >>>> (rsqrt) estimation in -ffast-math
> > >>>>
> > >>>> Kumar,
> > >>>>
> > >>>> what is the relative gain that you see on Cortex-A57?
> > >>>>
> > >>>> Thanks,
> > >>>> Philipp.
> > >>>>
> > >>>>>> On 25 Jun 2015, at 17:35, Kumar, Venkataramanan
> > >>>>> <Venkataramanan.Kumar@amd.com> wrote:
> > >>>>>
> > >>>>> Changing to  "1 step for float" and "2 steps for double" gives
> > >>>>> better gains
> > >>>> now for gromacs on cortex-a57.
> > >>>>>
> > >>>>> Regards,
> > >>>>> Venkat.
> > >>>>>> -----Original Message-----
> > >>>>>> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
> > >>>>>> owner@gcc.gnu.org] On Behalf Of Benedikt Huber
> > >>>>>> Sent: Thursday, June 25, 2015 4:09 PM
> > >>>>>> To: pinskia@gmail.com
> > >>>>>> Cc: gcc-patches@gcc.gnu.org; philipp.tomsich@theobroma-
> > >> systems.com
> > >>>>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> > >>>>>> (rsqrt) estimation in -ffast-math
> > >>>>>>
> > >>>>>> Andrew,
> > >>>>>>
> > >>>>>>> This is NOT a win on thunderX at least for single precision
> > >>>>>>> because you have
> > >>>>>> to do the divide and sqrt in the same time as it takes 5
> > >>>>>> multiples (estimate and step are multiplies in the thunderX pipeline).
> > >>>>>> Doubles is 10 multiplies which is just the same as what the patch
> > >>>>>> does (but it is really slightly less than 10, I rounded up). So
> > >>>>>> in the end this is NOT a win at all for thunderX unless we do one
> > >>>>>> less step for both single
> > >>>> and double.
> > >>>>>>
> > >>>>>> Yes, the expected benefit from rsqrt estimation is implementation
> > >>>>>> specific. If one has a better initial rsqrte or an application
> > >>>>>> that can trade precision for execution time, we could offer a
> > >>>>>> command line option to do only 2 steps for doulbe and 1 step for
> > >>>>>> float; similar to -
> > >>>> mrecip-precision for PowerPC.
> > >>>>>> What are your thoughts on that?
> > >>>>>>
> > >>>>>> Best regards,
> > >>>>>> Benedikt
> > >>>
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-29 11:44                   ` James Greenhalgh
@ 2015-06-29 11:56                     ` Dr. Philipp Tomsich
  2015-06-29 16:57                       ` pinskia
  2015-07-13 19:09                       ` Evandro Menezes
  0 siblings, 2 replies; 36+ messages in thread
From: Dr. Philipp Tomsich @ 2015-06-29 11:56 UTC (permalink / raw)
  To: James Greenhalgh
  Cc: Kumar, Venkataramanan, pinskia, Benedikt Huber, gcc-patches,
	Marcus Shawcroft, Ramana Radhakrishnan, Richard Earnshaw

James,

On 29 Jun 2015, at 13:36, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> 
> On Mon, Jun 29, 2015 at 10:18:23AM +0100, Kumar, Venkataramanan wrote:
>> 
>>> -----Original Message-----
>>> From: Dr. Philipp Tomsich [mailto:philipp.tomsich@theobroma-systems.com]
>>> Sent: Monday, June 29, 2015 2:17 PM
>>> To: Kumar, Venkataramanan
>>> Cc: pinskia@gmail.com; Benedikt Huber; gcc-patches@gcc.gnu.org
>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
>>> estimation in -ffast-math
>>> 
>>> Kumar,
>>> 
>>> This does not come unexpected, as the initial estimation and each iteration
>>> will add an architecturally-defined number of bits of precision (ARMv8
>>> guarantuees only a minimum number of bits provided per operation… the
>>> exact number is specific to each micro-arch, though).
>>> Depending on your architecture and on the required number of precise bits
>>> by any given benchmark, one may see miscompares.
>> 
>> True.  
> 
> I would be very uncomfortable with this approach.

Same here. The default must be safe. Always.
Unlike other architectures, we don’t have a problem with making the proper
defaults for “safety”, as the ARMv8 ISA guarantees a minimum number of
precise bits per iteration.

> From Richard Biener's post in the thread Michael Matz linked earlier
> in the thread:
> 
>    It would follow existing practice of things we allow in
>    -funsafe-math-optimizations.  Existing practice in that we
>    want to allow -ffast-math use with common benchmarks we care
>    about.
> 
>    https://gcc.gnu.org/ml/gcc-patches/2009-11/msg00100.html
> 
> With the solution you seem to be converging on (2-steps for some
> microarchitectures, 3 for others), a binary generated for one micro-arch
> may drop below a minimum guarantee of precision when run on another. This
> seems to go against the spirit of the practice above. I would only support
> adding this optimization to -Ofast if we could keep to architectural
> guarantees of precision in the generated code (i.e. 3-steps everywhere).
> 
> I don't object to adding a "-mlow-precision-recip-sqrt" style option,
> which would be off by default, would enable the 2-step mode, and would
> need to be explicitly enabled (i.e. not implied by -mcpu=foo) but I don't
> see what this buys you beyond the Gromacs boost (and even there you would
> be creating an Invalid Run as optimization flags must be applied across
> all workloads). 

Any flag that reduces precision (and thus breaks IEEE floating-point semantics)
needs to be gated with an “unsafe” flag (i.e. one that is never on by default).
As a consequence, the “peak”-tuning for SPEC will turn this on… but barely 
anyone else would.

> For the 3-step optimization, it is clear to me that for "generic" tuning
> we don't want this to be enabled by default experimental results and advice
> in this thread argues against it for thunderx and cortex-a57 targets.
> However, enabling it based on the CPU tuning selected seems fine to me.

I do not agree on this one, as I would like to see the safe form (i.e. 3 and 5
iterations respectively) to become the default. Most “server-type” chips
should not see a performance regression, while it will be easier to optimise for
this in hardware than for a (potentially microcoded) sqrt-instruction (and 
subsequent, dependent divide).

I have not heard anyone claim a performance regression (either on thunderx
or on cortex-a57), but merely heard a “no speed-up”.

So I am strongly in favor of defaulting to the ‘safe’ number of iterations, even
when compiling for a generic target.

Best,
Philipp.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-29  8:30             ` Kumar, Venkataramanan
  2015-06-29  9:07               ` Dr. Philipp Tomsich
@ 2015-06-29 14:20               ` Benedikt Huber
  2015-06-29 17:35               ` Benedikt Huber
  2 siblings, 0 replies; 36+ messages in thread
From: Benedikt Huber @ 2015-06-29 14:20 UTC (permalink / raw)
  To: Kumar, Venkataramanan; +Cc: pinskia, Dr. Philipp Tomsich, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 513 bytes --]

> Hmm,  Reducing the iterations to "1 step for float" and "2 steps for double"
> 
> I got VE (miscompares) on following benchmarks
> 416.gamess
> 453.povray
> 454.calculix
> 459.GemsFDTD
> 
> Benedikt , I have ICE for 444.namd with your patch,  not sure if something wrong in my local tree.

I could not reproduce the ICE for namd.
Could you provide the spec config entry or the command line options, please.
Which ICE did you get? Please provide the output.

Thank you and best regards,
Benedikt

[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-29 11:56                     ` Dr. Philipp Tomsich
@ 2015-06-29 16:57                       ` pinskia
  2015-06-29 19:07                         ` Kumar, Venkataramanan
  2015-07-13 19:09                       ` Evandro Menezes
  1 sibling, 1 reply; 36+ messages in thread
From: pinskia @ 2015-06-29 16:57 UTC (permalink / raw)
  To: Dr. Philipp Tomsich
  Cc: James Greenhalgh, Kumar, Venkataramanan, Benedikt Huber,
	gcc-patches, Marcus Shawcroft, Ramana Radhakrishnan,
	Richard Earnshaw





> On Jun 29, 2015, at 4:44 AM, Dr. Philipp Tomsich <philipp.tomsich@theobroma-systems.com> wrote:
> 
> James,
> 
>> On 29 Jun 2015, at 13:36, James Greenhalgh <james.greenhalgh@arm.com> wrote:
>> 
>>> On Mon, Jun 29, 2015 at 10:18:23AM +0100, Kumar, Venkataramanan wrote:
>>> 
>>>> -----Original Message-----
>>>> From: Dr. Philipp Tomsich [mailto:philipp.tomsich@theobroma-systems.com]
>>>> Sent: Monday, June 29, 2015 2:17 PM
>>>> To: Kumar, Venkataramanan
>>>> Cc: pinskia@gmail.com; Benedikt Huber; gcc-patches@gcc.gnu.org
>>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
>>>> estimation in -ffast-math
>>>> 
>>>> Kumar,
>>>> 
>>>> This does not come unexpected, as the initial estimation and each iteration
>>>> will add an architecturally-defined number of bits of precision (ARMv8
>>>> guarantuees only a minimum number of bits provided per operation… the
>>>> exact number is specific to each micro-arch, though).
>>>> Depending on your architecture and on the required number of precise bits
>>>> by any given benchmark, one may see miscompares.
>>> 
>>> True.  
>> 
>> I would be very uncomfortable with this approach.
> 
> Same here. The default must be safe. Always.
> Unlike other architectures, we don’t have a problem with making the proper
> defaults for “safety”, as the ARMv8 ISA guarantees a minimum number of
> precise bits per iteration.
> 
>> From Richard Biener's post in the thread Michael Matz linked earlier
>> in the thread:
>> 
>>   It would follow existing practice of things we allow in
>>   -funsafe-math-optimizations.  Existing practice in that we
>>   want to allow -ffast-math use with common benchmarks we care
>>   about.
>> 
>>   https://gcc.gnu.org/ml/gcc-patches/2009-11/msg00100.html
>> 
>> With the solution you seem to be converging on (2-steps for some
>> microarchitectures, 3 for others), a binary generated for one micro-arch
>> may drop below a minimum guarantee of precision when run on another. This
>> seems to go against the spirit of the practice above. I would only support
>> adding this optimization to -Ofast if we could keep to architectural
>> guarantees of precision in the generated code (i.e. 3-steps everywhere).
>> 
>> I don't object to adding a "-mlow-precision-recip-sqrt" style option,
>> which would be off by default, would enable the 2-step mode, and would
>> need to be explicitly enabled (i.e. not implied by -mcpu=foo) but I don't
>> see what this buys you beyond the Gromacs boost (and even there you would
>> be creating an Invalid Run as optimization flags must be applied across
>> all workloads).
> 
> Any flag that reduces precision (and thus breaks IEEE floating-point semantics)
> needs to be gated with an “unsafe” flag (i.e. one that is never on by default).
> As a consequence, the “peak”-tuning for SPEC will turn this on… but barely 
> anyone else would.
> 
>> For the 3-step optimization, it is clear to me that for "generic" tuning
>> we don't want this to be enabled by default experimental results and advice
>> in this thread argues against it for thunderx and cortex-a57 targets.
>> However, enabling it based on the CPU tuning selected seems fine to me.
> 
> I do not agree on this one, as I would like to see the safe form (i.e. 3 and 5
> iterations respectively) to become the default. Most “server-type” chips
> should not see a performance regression, while it will be easier to optimise for
> this in hardware than for a (potentially microcoded) sqrt-instruction (and 
> subsequent, dependent divide).
> 
> I have not heard anyone claim a performance regression (either on thunderx
> or on cortex-a57), but merely heard a “no speed-up”.

Actually it does regress performance on thunderX, I just assumed that when I said not going to be a win it was taken as a slow down. It regress gromacs by more than 10% on thunderX but I can't remember how much as i had someone else run it. The latency difference is also over 40%; for example single precision: 29 cycles with div (12) sqrt(17) directly vs 42 cycles with the rsqrte and 2 iterations of 2mul/rsqrts (double is 53 vs 60). That is huge difference right there.  ThunderX has a fast div and a fast sqrt for 32bit and a reasonable one for double.   So again this is not just not a win but rather a regression for thunderX. I suspect cortex-a57 is also true. 

Thanks,
Andrew

> 
> So I am strongly in favor of defaulting to the ‘safe’ number of iterations, even
> when compiling for a generic target.
> 
> Best,
> Philipp.
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-29  8:30             ` Kumar, Venkataramanan
  2015-06-29  9:07               ` Dr. Philipp Tomsich
  2015-06-29 14:20               ` Benedikt Huber
@ 2015-06-29 17:35               ` Benedikt Huber
  2015-06-29 17:44                 ` Kumar, Venkataramanan
  2 siblings, 1 reply; 36+ messages in thread
From: Benedikt Huber @ 2015-06-29 17:35 UTC (permalink / raw)
  To: Kumar, Venkataramanan; +Cc: pinskia, Dr. Philipp Tomsich, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 270 bytes --]

> 
> Benedikt , I have ICE for 444.namd with your patch,  not sure if something wrong in my local tree.

Venkat, now I could reproduce it.
Strangely it does not happen with -flto.
I will try to find out the reason for that.

Thank you for the catch,
Benedikt


[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-29 17:35               ` Benedikt Huber
@ 2015-06-29 17:44                 ` Kumar, Venkataramanan
  0 siblings, 0 replies; 36+ messages in thread
From: Kumar, Venkataramanan @ 2015-06-29 17:44 UTC (permalink / raw)
  To: Benedikt Huber; +Cc: pinskia, Dr. Philipp Tomsich, gcc-patches

Hi,

> -----Original Message-----
> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
> owner@gcc.gnu.org] On Behalf Of Benedikt Huber
> Sent: Monday, June 29, 2015 11:04 PM
> To: Kumar, Venkataramanan
> Cc: pinskia@gmail.com; Dr. Philipp Tomsich; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
> estimation in -ffast-math
> 
> >
> > Benedikt , I have ICE for 444.namd with your patch,  not sure if something
> wrong in my local tree.
> 
> Venkat, now I could reproduce it.
> Strangely it does not happen with -flto.
> I will try to find out the reason for that.
> 
Sure,  I kind of remember that RTX for target in builtin rsqrt was 0.  But not sure if we need to create reg rtx in such cases.
 

> Thank you for the catch,
> Benedikt

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-29 16:57                       ` pinskia
@ 2015-06-29 19:07                         ` Kumar, Venkataramanan
  2015-07-14 22:26                           ` Evandro Menezes
  0 siblings, 1 reply; 36+ messages in thread
From: Kumar, Venkataramanan @ 2015-06-29 19:07 UTC (permalink / raw)
  To: pinskia, Dr. Philipp Tomsich
  Cc: James Greenhalgh, Benedikt Huber, gcc-patches, Marcus Shawcroft,
	Ramana Radhakrishnan, Richard Earnshaw

Hi,

> -----Original Message-----
> From: pinskia@gmail.com [mailto:pinskia@gmail.com]
> Sent: Monday, June 29, 2015 10:23 PM
> To: Dr. Philipp Tomsich
> Cc: James Greenhalgh; Kumar, Venkataramanan; Benedikt Huber; gcc-
> patches@gcc.gnu.org; Marcus Shawcroft; Ramana Radhakrishnan; Richard
> Earnshaw
> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
> estimation in -ffast-math
> 
> 
> 
> 
> 
> > On Jun 29, 2015, at 4:44 AM, Dr. Philipp Tomsich
> <philipp.tomsich@theobroma-systems.com> wrote:
> >
> > James,
> >
> >> On 29 Jun 2015, at 13:36, James Greenhalgh
> <james.greenhalgh@arm.com> wrote:
> >>
> >>> On Mon, Jun 29, 2015 at 10:18:23AM +0100, Kumar, Venkataramanan
> wrote:
> >>>
> >>>> -----Original Message-----
> >>>> From: Dr. Philipp Tomsich
> >>>> [mailto:philipp.tomsich@theobroma-systems.com]
> >>>> Sent: Monday, June 29, 2015 2:17 PM
> >>>> To: Kumar, Venkataramanan
> >>>> Cc: pinskia@gmail.com; Benedikt Huber; gcc-patches@gcc.gnu.org
> >>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> >>>> (rsqrt) estimation in -ffast-math
> >>>>
> >>>> Kumar,
> >>>>
> >>>> This does not come unexpected, as the initial estimation and each
> >>>> iteration will add an architecturally-defined number of bits of
> >>>> precision (ARMv8 guarantuees only a minimum number of bits
> provided
> >>>> per operation… the exact number is specific to each micro-arch,
> though).
> >>>> Depending on your architecture and on the required number of
> >>>> precise bits by any given benchmark, one may see miscompares.
> >>>
> >>> True.
> >>
> >> I would be very uncomfortable with this approach.
> >
> > Same here. The default must be safe. Always.
> > Unlike other architectures, we don’t have a problem with making the
> > proper defaults for “safety”, as the ARMv8 ISA guarantees a minimum
> > number of precise bits per iteration.
> >
> >> From Richard Biener's post in the thread Michael Matz linked earlier
> >> in the thread:
> >>
> >>   It would follow existing practice of things we allow in
> >>   -funsafe-math-optimizations.  Existing practice in that we
> >>   want to allow -ffast-math use with common benchmarks we care
> >>   about.
> >>
> >>   https://gcc.gnu.org/ml/gcc-patches/2009-11/msg00100.html
> >>
> >> With the solution you seem to be converging on (2-steps for some
> >> microarchitectures, 3 for others), a binary generated for one
> >> micro-arch may drop below a minimum guarantee of precision when run
> >> on another. This seems to go against the spirit of the practice
> >> above. I would only support adding this optimization to -Ofast if we
> >> could keep to architectural guarantees of precision in the generated code
> (i.e. 3-steps everywhere).
> >>
> >> I don't object to adding a "-mlow-precision-recip-sqrt" style option,
> >> which would be off by default, would enable the 2-step mode, and
> >> would need to be explicitly enabled (i.e. not implied by -mcpu=foo)
> >> but I don't see what this buys you beyond the Gromacs boost (and even
> >> there you would be creating an Invalid Run as optimization flags must
> >> be applied across all workloads).
> >
> > Any flag that reduces precision (and thus breaks IEEE floating-point
> > semantics) needs to be gated with an “unsafe” flag (i.e. one that is never
> on by default).
> > As a consequence, the “peak”-tuning for SPEC will turn this on… but
> > barely anyone else would.
> >
> >> For the 3-step optimization, it is clear to me that for "generic"
> >> tuning we don't want this to be enabled by default experimental
> >> results and advice in this thread argues against it for thunderx and cortex-
> a57 targets.
> >> However, enabling it based on the CPU tuning selected seems fine to me.
> >
> > I do not agree on this one, as I would like to see the safe form (i.e.
> > 3 and 5 iterations respectively) to become the default. Most
> > “server-type” chips should not see a performance regression, while it
> > will be easier to optimise for this in hardware than for a
> > (potentially microcoded) sqrt-instruction (and subsequent, dependent
> divide).
> >
> > I have not heard anyone claim a performance regression (either on
> > thunderx or on cortex-a57), but merely heard a “no speed-up”.
> 
> Actually it does regress performance on thunderX, I just assumed that when
> I said not going to be a win it was taken as a slow down. It regress gromacs by
> more than 10% on thunderX but I can't remember how much as i had
> someone else run it. The latency difference is also over 40%; for example
> single precision: 29 cycles with div (12) sqrt(17) directly vs 42 cycles with the
> rsqrte and 2 iterations of 2mul/rsqrts (double is 53 vs 60). That is huge
> difference right there.  ThunderX has a fast div and a fast sqrt for 32bit and a
> reasonable one for double.   So again this is not just not a win but rather a
> regression for thunderX. I suspect cortex-a57 is also true.
> 
> Thanks,
> Andrew
> 

Yes theoretically  should be  true for cortex-57 case as well.   But  I believe hardware pipelining with instruction scheduling in compiler helps a little for gromacs case  ~3% to 4% with the original patch.

I have not tested other FP benchmarks.   As James said a flag -mlow-precision-recip-sqrt if allowed can be used as a peak flag. 

> >
> > So I am strongly in favor of defaulting to the ‘safe’ number of
> > iterations, even when compiling for a generic target.
> >
> > Best,
> > Philipp.
> >

Regards,
Venkat.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-29 11:56                     ` Dr. Philipp Tomsich
  2015-06-29 16:57                       ` pinskia
@ 2015-07-13 19:09                       ` Evandro Menezes
  1 sibling, 0 replies; 36+ messages in thread
From: Evandro Menezes @ 2015-07-13 19:09 UTC (permalink / raw)
  To: 'Dr. Philipp Tomsich', 'James Greenhalgh'
  Cc: 'Kumar, Venkataramanan',
	pinskia, 'Benedikt Huber',
	gcc-patches, 'Marcus Shawcroft',
	'Ramana Radhakrishnan', 'Richard Earnshaw'

FWIW, I was curious about the precision of the results using such instructions for the standard sqrt{,f} functions.  This is not a wide sample, but it does point to a floor of series iterations to 3 for DP and 2 for SP:

x               sqrt(x)         1 Step (ulps)           2 Steps (ulps)          3 Steps (ulps)
2.2251e-308     1.4917e-154     1.4917e-154 (999)       1.4917e-154 (999)       1.4917e-154 (000)
1.6022e-19      4.0027e-10      4.0027e-10 (999)        4.0027e-10 (999)        4.0027e-10 (000)
1.0000e+00      1.0000e+00      1.0000e+00 (001)        1.0000e+00 (001)        1.0000e+00 (001)
1.0000e+00      1.0000e+00      9.9999e-01 (999)        1.0000e+00 (999)        1.0000e+00 (000)
1.0000e+00      1.0000e+00      9.9999e-01 (999)        1.0000e+00 (999)        1.0000e+00 (000)
2.0000e+00      1.4142e+00      1.4142e+00 (999)        1.4142e+00 (999)        1.4142e+00 (000)
2.2500e+00      1.5000e+00      1.5000e+00 (999)        1.5000e+00 (999)        1.5000e+00 (000)
2.5600e+00      1.6000e+00      1.6000e+00 (000)        1.6000e+00 (000)        1.6000e+00 (000)
3.1416e+00      1.7725e+00      1.7725e+00 (999)        1.7725e+00 (999)        1.7725e+00 (000)
6.0221e+23      7.7602e+11      7.7602e+11 (999)        7.7602e+11 (999)        7.7602e+11 (000)
1.7977e+308     1.3408e+154     1.3408e+154 (000)       1.3408e+154 (000)       1.3408e+154 (000)

x               sqrtf(x)        1 Step (ulps)           2 Steps (ulps)          3 Steps (ulps)
1.1755e-38      1.0842e-19      1.0842e-19 (096)        1.0842e-19 (000)        1.0842e-19 (000)
1.6022e-19      4.0027e-10      4.0027e-10 (008)        4.0027e-10 (000)        4.0027e-10 (000)
1.0000e+00      1.0000e+00      1.0000e+00 (001)        1.0000e+00 (001)        1.0000e+00 (001)
1.0000e+00      1.0000e+00      9.9999e-01 (096)        1.0000e+00 (000)        1.0000e+00 (000)
1.0000e+00      1.0000e+00      9.9999e-01 (094)        1.0000e+00 (001)        1.0000e+00 (000)
2.0000e+00      1.4142e+00      1.4142e+00 (146)        1.4142e+00 (001)        1.4142e+00 (000)
2.2500e+00      1.5000e+00      1.5000e+00 (018)        1.5000e+00 (000)        1.5000e+00 (001)
2.5600e+00      1.6000e+00      1.6000e+00 (001)        1.6000e+00 (001)        1.6000e+00 (001)
3.1416e+00      1.7725e+00      1.7725e+00 (006)        1.7725e+00 (001)        1.7725e+00 (001)
6.0221e+23      7.7602e+11      7.7602e+11 (069)        7.7602e+11 (001)        7.7602e+11 (000)
3.4028e+38      1.8447e+19      1.8447e+19 (000)        1.8447e+19 (000)        1.8447e+19 (000)

The error in ULPs saturates at 999 above.

The result of having to use so many iterations to achieve accuracy would defeat using the Newton series, as it would likely be slower than the FSQRT instruction.

Unlike in x86, I have the impression that the initial estimate in AArch64 is meant to be used in applications that do not require precision, like graphics, etc.  Then, a single series iteration for SP would perhaps be good enough.

-- 
Evandro Menezes                              Austin, TX


> -----Original Message-----
> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-owner@gcc.gnu.org] On
> Behalf Of Dr. Philipp Tomsich
> Sent: Monday, June 29, 2015 6:45
> To: James Greenhalgh
> Cc: Kumar, Venkataramanan; pinskia@gmail.com; Benedikt Huber; gcc-
> patches@gcc.gnu.org; Marcus Shawcroft; Ramana Radhakrishnan; Richard Earnshaw
> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
> estimation in -ffast-math
> 
> James,
> 
> On 29 Jun 2015, at 13:36, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> >
> > On Mon, Jun 29, 2015 at 10:18:23AM +0100, Kumar, Venkataramanan wrote:
> >>
> >>> -----Original Message-----
> >>> From: Dr. Philipp Tomsich
> >>> [mailto:philipp.tomsich@theobroma-systems.com]
> >>> Sent: Monday, June 29, 2015 2:17 PM
> >>> To: Kumar, Venkataramanan
> >>> Cc: pinskia@gmail.com; Benedikt Huber; gcc-patches@gcc.gnu.org
> >>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> >>> (rsqrt) estimation in -ffast-math
> >>>
> >>> Kumar,
> >>>
> >>> This does not come unexpected, as the initial estimation and each
> >>> iteration will add an architecturally-defined number of bits of
> >>> precision (ARMv8 guarantuees only a minimum number of bits provided
> >>> per operation… the exact number is specific to each micro-arch, though).
> >>> Depending on your architecture and on the required number of precise
> >>> bits by any given benchmark, one may see miscompares.
> >>
> >> True.
> >
> > I would be very uncomfortable with this approach.
> 
> Same here. The default must be safe. Always.
> Unlike other architectures, we don’t have a problem with making the proper
> defaults for “safety”, as the ARMv8 ISA guarantees a minimum number of
> precise bits per iteration.
> 
> > From Richard Biener's post in the thread Michael Matz linked earlier
> > in the thread:
> >
> >    It would follow existing practice of things we allow in
> >    -funsafe-math-optimizations.  Existing practice in that we
> >    want to allow -ffast-math use with common benchmarks we care
> >    about.
> >
> >    https://gcc.gnu.org/ml/gcc-patches/2009-11/msg00100.html
> >
> > With the solution you seem to be converging on (2-steps for some
> > microarchitectures, 3 for others), a binary generated for one
> > micro-arch may drop below a minimum guarantee of precision when run on
> > another. This seems to go against the spirit of the practice above. I
> > would only support adding this optimization to -Ofast if we could keep
> > to architectural guarantees of precision in the generated code (i.e. 3-
> steps everywhere).
> >
> > I don't object to adding a "-mlow-precision-recip-sqrt" style option,
> > which would be off by default, would enable the 2-step mode, and would
> > need to be explicitly enabled (i.e. not implied by -mcpu=foo) but I
> > don't see what this buys you beyond the Gromacs boost (and even there
> > you would be creating an Invalid Run as optimization flags must be
> > applied across all workloads).
> 
> Any flag that reduces precision (and thus breaks IEEE floating-point
> semantics) needs to be gated with an “unsafe” flag (i.e. one that is never on
> by default).
> As a consequence, the “peak”-tuning for SPEC will turn this on… but barely
> anyone else would.
> 
> > For the 3-step optimization, it is clear to me that for "generic"
> > tuning we don't want this to be enabled by default experimental
> > results and advice in this thread argues against it for thunderx and
> cortex-a57 targets.
> > However, enabling it based on the CPU tuning selected seems fine to me.
> 
> I do not agree on this one, as I would like to see the safe form (i.e. 3 and
> 5 iterations respectively) to become the default. Most “server-type” chips
> should not see a performance regression, while it will be easier to optimise
> for this in hardware than for a (potentially microcoded) sqrt-instruction
> (and subsequent, dependent divide).
> 
> I have not heard anyone claim a performance regression (either on thunderx or
> on cortex-a57), but merely heard a “no speed-up”.
> 
> So I am strongly in favor of defaulting to the ‘safe’ number of iterations,
> even when compiling for a generic target.
> 
> Best,
> Philipp.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-29  9:07               ` Dr. Philipp Tomsich
  2015-06-29  9:22                 ` Kumar, Venkataramanan
@ 2015-07-14 22:20                 ` Evandro Menezes
  1 sibling, 0 replies; 36+ messages in thread
From: Evandro Menezes @ 2015-07-14 22:20 UTC (permalink / raw)
  To: 'Dr. Philipp Tomsich', 'Kumar, Venkataramanan'
  Cc: pinskia, 'Benedikt Huber', gcc-patches

For both FRECPE and FRSQRTE the ARMv8 ISA guide states in their pseudo-code that:

"Result is double-precision and a multiple of 1/256 in the range 1 to 511/256."

This suggests that the estimate is merely 8 bits long.

IIRC, x86 returns 12 bits for its equivalent insns, requiring then a single series iteration for both SP and DP to achieve a precise enough result.

-- 
Evandro Menezes                              Austin, TX


> -----Original Message-----
> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-owner@gcc.gnu.org] On
> Behalf Of Dr. Philipp Tomsich
> Sent: Monday, June 29, 2015 3:47
> To: Kumar, Venkataramanan
> Cc: pinskia@gmail.com; Benedikt Huber; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
> estimation in -ffast-math
> 
> Kumar,
> 
> This does not come unexpected, as the initial estimation and each iteration
> will add an architecturally-defined number of bits of precision (ARMv8
> guarantuees only a minimum number of bits provided per operation… the exact
> number is specific to each micro-arch, though).
> Depending on your architecture and on the required number of precise bits by
> any given benchmark, one may see miscompares.
> 
> Do you know the exact number of bits that the initial estimate and the
> subsequent refinement steps add for your micro-arch?
> 
> Thanks,
> Philipp.
> 
> > On 29 Jun 2015, at 10:17, Kumar, Venkataramanan
> <Venkataramanan.Kumar@amd.com> wrote:
> >
> >
> > Hmm,  Reducing the iterations to "1 step for float" and "2 steps for
> double"
> >
> > I got VE (miscompares) on following benchmarks 416.gamess
> > 453.povray
> > 454.calculix
> > 459.GemsFDTD
> >
> > Benedikt , I have ICE for 444.namd with your patch,  not sure if something
> wrong in my local tree.
> >
> > Regards,
> > Venkat.
> >
> >> -----Original Message-----
> >> From: pinskia@gmail.com [mailto:pinskia@gmail.com]
> >> Sent: Sunday, June 28, 2015 8:35 PM
> >> To: Kumar, Venkataramanan
> >> Cc: Dr. Philipp Tomsich; Benedikt Huber; gcc-patches@gcc.gnu.org
> >> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> >> (rsqrt) estimation in -ffast-math
> >>
> >>
> >>
> >>
> >>
> >>> On Jun 25, 2015, at 9:44 AM, Kumar, Venkataramanan
> >> <Venkataramanan.Kumar@amd.com> wrote:
> >>>
> >>> I got around ~12% gain with -Ofast -mcpu=cortex-a57.
> >>
> >> I get around 11/12% on thunderX with the patch and the decreasing the
> >> iterations change (1/2) compared to without the patch.
> >>
> >> Thanks,
> >> Andrew
> >>
> >>
> >>>
> >>> Regards,
> >>> Venkat.
> >>>
> >>>> -----Original Message-----
> >>>> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
> >>>> owner@gcc.gnu.org] On Behalf Of Dr. Philipp Tomsich
> >>>> Sent: Thursday, June 25, 2015 9:13 PM
> >>>> To: Kumar, Venkataramanan
> >>>> Cc: Benedikt Huber; pinskia@gmail.com; gcc-patches@gcc.gnu.org
> >>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> >>>> (rsqrt) estimation in -ffast-math
> >>>>
> >>>> Kumar,
> >>>>
> >>>> what is the relative gain that you see on Cortex-A57?
> >>>>
> >>>> Thanks,
> >>>> Philipp.
> >>>>
> >>>>>> On 25 Jun 2015, at 17:35, Kumar, Venkataramanan
> >>>>> <Venkataramanan.Kumar@amd.com> wrote:
> >>>>>
> >>>>> Changing to  "1 step for float" and "2 steps for double" gives
> >>>>> better gains
> >>>> now for gromacs on cortex-a57.
> >>>>>
> >>>>> Regards,
> >>>>> Venkat.
> >>>>>> -----Original Message-----
> >>>>>> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
> >>>>>> owner@gcc.gnu.org] On Behalf Of Benedikt Huber
> >>>>>> Sent: Thursday, June 25, 2015 4:09 PM
> >>>>>> To: pinskia@gmail.com
> >>>>>> Cc: gcc-patches@gcc.gnu.org; philipp.tomsich@theobroma-
> >> systems.com
> >>>>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> >>>>>> (rsqrt) estimation in -ffast-math
> >>>>>>
> >>>>>> Andrew,
> >>>>>>
> >>>>>>> This is NOT a win on thunderX at least for single precision
> >>>>>>> because you have
> >>>>>> to do the divide and sqrt in the same time as it takes 5
> >>>>>> multiples (estimate and step are multiplies in the thunderX pipeline).
> >>>>>> Doubles is 10 multiplies which is just the same as what the patch
> >>>>>> does (but it is really slightly less than 10, I rounded up). So
> >>>>>> in the end this is NOT a win at all for thunderX unless we do one
> >>>>>> less step for both single
> >>>> and double.
> >>>>>>
> >>>>>> Yes, the expected benefit from rsqrt estimation is implementation
> >>>>>> specific. If one has a better initial rsqrte or an application
> >>>>>> that can trade precision for execution time, we could offer a
> >>>>>> command line option to do only 2 steps for doulbe and 1 step for
> >>>>>> float; similar to -
> >>>> mrecip-precision for PowerPC.
> >>>>>> What are your thoughts on that?
> >>>>>>
> >>>>>> Best regards,
> >>>>>> Benedikt
> >>>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-06-29 19:07                         ` Kumar, Venkataramanan
@ 2015-07-14 22:26                           ` Evandro Menezes
  2015-07-20  9:46                             ` Kumar, Venkataramanan
  0 siblings, 1 reply; 36+ messages in thread
From: Evandro Menezes @ 2015-07-14 22:26 UTC (permalink / raw)
  To: 'Kumar, Venkataramanan', pinskia, 'Dr. Philipp Tomsich'
  Cc: 'James Greenhalgh', 'Benedikt Huber',
	gcc-patches, 'Marcus Shawcroft',
	'Ramana Radhakrishnan', 'Richard Earnshaw'

I ran a simple test on A57 rev. 0, looping a million times around sqrt{,f} and the respective series iterations with the values in the sequence 1..1000000 and got these results:

sqrt(x):        36593844/s      1/sqrt(x):      18283875/s
3 Steps:        47922557/s      3 Steps:        49005194/s

sqrtf(x):       143988480/s     1/sqrtf(x):     69516857/s
2 Steps:        78740157/s      2 Steps:        80385852/s

I'm a bit surprised that the 3-iteration series for DP is faster than sqrt(), but not that it's much faster for the reciprocal of sqrt().  As for SP, the 2-iteration series is faster only for the reciprocal for sqrtf().

There might still be some leg for this patch in real-world cases which I'd like to investigate.

-- 
Evandro Menezes                              Austin, TX


> -----Original Message-----
> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-owner@gcc.gnu.org] On
> Behalf Of Kumar, Venkataramanan
> Sent: Monday, June 29, 2015 13:50
> To: pinskia@gmail.com; Dr. Philipp Tomsich
> Cc: James Greenhalgh; Benedikt Huber; gcc-patches@gcc.gnu.org; Marcus
> Shawcroft; Ramana Radhakrishnan; Richard Earnshaw
> Subject: RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
> estimation in -ffast-math
> 
> Hi,
> 
> > -----Original Message-----
> > From: pinskia@gmail.com [mailto:pinskia@gmail.com]
> > Sent: Monday, June 29, 2015 10:23 PM
> > To: Dr. Philipp Tomsich
> > Cc: James Greenhalgh; Kumar, Venkataramanan; Benedikt Huber; gcc-
> > patches@gcc.gnu.org; Marcus Shawcroft; Ramana Radhakrishnan; Richard
> > Earnshaw
> > Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> > (rsqrt) estimation in -ffast-math
> >
> >
> >
> >
> >
> > > On Jun 29, 2015, at 4:44 AM, Dr. Philipp Tomsich
> > <philipp.tomsich@theobroma-systems.com> wrote:
> > >
> > > James,
> > >
> > >> On 29 Jun 2015, at 13:36, James Greenhalgh
> > <james.greenhalgh@arm.com> wrote:
> > >>
> > >>> On Mon, Jun 29, 2015 at 10:18:23AM +0100, Kumar, Venkataramanan
> > wrote:
> > >>>
> > >>>> -----Original Message-----
> > >>>> From: Dr. Philipp Tomsich
> > >>>> [mailto:philipp.tomsich@theobroma-systems.com]
> > >>>> Sent: Monday, June 29, 2015 2:17 PM
> > >>>> To: Kumar, Venkataramanan
> > >>>> Cc: pinskia@gmail.com; Benedikt Huber; gcc-patches@gcc.gnu.org
> > >>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> > >>>> (rsqrt) estimation in -ffast-math
> > >>>>
> > >>>> Kumar,
> > >>>>
> > >>>> This does not come unexpected, as the initial estimation and each
> > >>>> iteration will add an architecturally-defined number of bits of
> > >>>> precision (ARMv8 guarantuees only a minimum number of bits
> > provided
> > >>>> per operation… the exact number is specific to each micro-arch,
> > though).
> > >>>> Depending on your architecture and on the required number of
> > >>>> precise bits by any given benchmark, one may see miscompares.
> > >>>
> > >>> True.
> > >>
> > >> I would be very uncomfortable with this approach.
> > >
> > > Same here. The default must be safe. Always.
> > > Unlike other architectures, we don’t have a problem with making the
> > > proper defaults for “safety”, as the ARMv8 ISA guarantees a minimum
> > > number of precise bits per iteration.
> > >
> > >> From Richard Biener's post in the thread Michael Matz linked
> > >> earlier in the thread:
> > >>
> > >>   It would follow existing practice of things we allow in
> > >>   -funsafe-math-optimizations.  Existing practice in that we
> > >>   want to allow -ffast-math use with common benchmarks we care
> > >>   about.
> > >>
> > >>   https://gcc.gnu.org/ml/gcc-patches/2009-11/msg00100.html
> > >>
> > >> With the solution you seem to be converging on (2-steps for some
> > >> microarchitectures, 3 for others), a binary generated for one
> > >> micro-arch may drop below a minimum guarantee of precision when run
> > >> on another. This seems to go against the spirit of the practice
> > >> above. I would only support adding this optimization to -Ofast if
> > >> we could keep to architectural guarantees of precision in the
> > >> generated code
> > (i.e. 3-steps everywhere).
> > >>
> > >> I don't object to adding a "-mlow-precision-recip-sqrt" style
> > >> option, which would be off by default, would enable the 2-step
> > >> mode, and would need to be explicitly enabled (i.e. not implied by
> > >> -mcpu=foo) but I don't see what this buys you beyond the Gromacs
> > >> boost (and even there you would be creating an Invalid Run as
> > >> optimization flags must be applied across all workloads).
> > >
> > > Any flag that reduces precision (and thus breaks IEEE floating-point
> > > semantics) needs to be gated with an “unsafe” flag (i.e. one that is
> > > never
> > on by default).
> > > As a consequence, the “peak”-tuning for SPEC will turn this on… but
> > > barely anyone else would.
> > >
> > >> For the 3-step optimization, it is clear to me that for "generic"
> > >> tuning we don't want this to be enabled by default experimental
> > >> results and advice in this thread argues against it for thunderx
> > >> and cortex-
> > a57 targets.
> > >> However, enabling it based on the CPU tuning selected seems fine to me.
> > >
> > > I do not agree on this one, as I would like to see the safe form (i.e.
> > > 3 and 5 iterations respectively) to become the default. Most
> > > “server-type” chips should not see a performance regression, while
> > > it will be easier to optimise for this in hardware than for a
> > > (potentially microcoded) sqrt-instruction (and subsequent, dependent
> > divide).
> > >
> > > I have not heard anyone claim a performance regression (either on
> > > thunderx or on cortex-a57), but merely heard a “no speed-up”.
> >
> > Actually it does regress performance on thunderX, I just assumed that
> > when I said not going to be a win it was taken as a slow down. It
> > regress gromacs by more than 10% on thunderX but I can't remember how
> > much as i had someone else run it. The latency difference is also over
> > 40%; for example single precision: 29 cycles with div (12) sqrt(17)
> > directly vs 42 cycles with the rsqrte and 2 iterations of 2mul/rsqrts
> > (double is 53 vs 60). That is huge difference right there.  ThunderX has a
> fast div and a fast sqrt for 32bit and a
> > reasonable one for double.   So again this is not just not a win but rather
> a
> > regression for thunderX. I suspect cortex-a57 is also true.
> >
> > Thanks,
> > Andrew
> >
> 
> Yes theoretically  should be  true for cortex-57 case as well.   But  I
> believe hardware pipelining with instruction scheduling in compiler helps a
> little for gromacs case  ~3% to 4% with the original patch.
> 
> I have not tested other FP benchmarks.   As James said a flag -mlow-
> precision-recip-sqrt if allowed can be used as a peak flag.
> 
> > >
> > > So I am strongly in favor of defaulting to the ‘safe’ number of
> > > iterations, even when compiling for a generic target.
> > >
> > > Best,
> > > Philipp.
> > >
> 
> Regards,
> Venkat.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-07-14 22:26                           ` Evandro Menezes
@ 2015-07-20  9:46                             ` Kumar, Venkataramanan
  2015-07-20 15:58                               ` Evandro Menezes
  0 siblings, 1 reply; 36+ messages in thread
From: Kumar, Venkataramanan @ 2015-07-20  9:46 UTC (permalink / raw)
  To: Evandro Menezes, pinskia, 'Dr. Philipp Tomsich'
  Cc: 'James Greenhalgh', 'Benedikt Huber',
	gcc-patches, 'Marcus Shawcroft',
	'Ramana Radhakrishnan', 'Richard Earnshaw'

Hi, 

I missed your email and noticed it this week.

What does column 2  tests?  Are you trying to implement square roots  using reciprocal estimate and step? 

But reciprocal square root  using reciprocal estimate and (2 for fp 3 for dp) step seems  to be better that using fdiv and fsqrt in your case.   

Regards,
Venkat.

> -----Original Message-----
> From: Evandro Menezes [mailto:e.menezes@samsung.com]
> Sent: Wednesday, July 15, 2015 3:45 AM
> To: Kumar, Venkataramanan; pinskia@gmail.com; 'Dr. Philipp Tomsich'
> Cc: 'James Greenhalgh'; 'Benedikt Huber'; gcc-patches@gcc.gnu.org; 'Marcus
> Shawcroft'; 'Ramana Radhakrishnan'; 'Richard Earnshaw'
> Subject: RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
> estimation in -ffast-math
> 
> I ran a simple test on A57 rev. 0, looping a million times around sqrt{,f} and
> the respective series iterations with the values in the sequence 1..1000000
> and got these results:
> 
> sqrt(x):        36593844/s      1/sqrt(x):      18283875/s
> 3 Steps:        47922557/s      3 Steps:        49005194/s
> 
> sqrtf(x):       143988480/s     1/sqrtf(x):     69516857/s
> 2 Steps:        78740157/s      2 Steps:        80385852/s
> 
> I'm a bit surprised that the 3-iteration series for DP is faster than sqrt(), but
> not that it's much faster for the reciprocal of sqrt().  As for SP, the 2-iteration
> series is faster only for the reciprocal for sqrtf().
> 
> There might still be some leg for this patch in real-world cases which I'd like to
> investigate.
> 
> --
> Evandro Menezes                              Austin, TX
> 
> 
> > -----Original Message-----
> > From: gcc-patches-owner@gcc.gnu.org
> > [mailto:gcc-patches-owner@gcc.gnu.org] On Behalf Of Kumar,
> > Venkataramanan
> > Sent: Monday, June 29, 2015 13:50
> > To: pinskia@gmail.com; Dr. Philipp Tomsich
> > Cc: James Greenhalgh; Benedikt Huber; gcc-patches@gcc.gnu.org; Marcus
> > Shawcroft; Ramana Radhakrishnan; Richard Earnshaw
> > Subject: RE: [PATCH] [aarch64] Implemented reciprocal square root
> > (rsqrt) estimation in -ffast-math
> >
> > Hi,
> >
> > > -----Original Message-----
> > > From: pinskia@gmail.com [mailto:pinskia@gmail.com]
> > > Sent: Monday, June 29, 2015 10:23 PM
> > > To: Dr. Philipp Tomsich
> > > Cc: James Greenhalgh; Kumar, Venkataramanan; Benedikt Huber; gcc-
> > > patches@gcc.gnu.org; Marcus Shawcroft; Ramana Radhakrishnan;
> Richard
> > > Earnshaw
> > > Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> > > (rsqrt) estimation in -ffast-math
> > >
> > >
> > >
> > >
> > >
> > > > On Jun 29, 2015, at 4:44 AM, Dr. Philipp Tomsich
> > > <philipp.tomsich@theobroma-systems.com> wrote:
> > > >
> > > > James,
> > > >
> > > >> On 29 Jun 2015, at 13:36, James Greenhalgh
> > > <james.greenhalgh@arm.com> wrote:
> > > >>
> > > >>> On Mon, Jun 29, 2015 at 10:18:23AM +0100, Kumar, Venkataramanan
> > > wrote:
> > > >>>
> > > >>>> -----Original Message-----
> > > >>>> From: Dr. Philipp Tomsich
> > > >>>> [mailto:philipp.tomsich@theobroma-systems.com]
> > > >>>> Sent: Monday, June 29, 2015 2:17 PM
> > > >>>> To: Kumar, Venkataramanan
> > > >>>> Cc: pinskia@gmail.com; Benedikt Huber; gcc-patches@gcc.gnu.org
> > > >>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square
> > > >>>> root
> > > >>>> (rsqrt) estimation in -ffast-math
> > > >>>>
> > > >>>> Kumar,
> > > >>>>
> > > >>>> This does not come unexpected, as the initial estimation and
> > > >>>> each iteration will add an architecturally-defined number of
> > > >>>> bits of precision (ARMv8 guarantuees only a minimum number of
> > > >>>> bits
> > > provided
> > > >>>> per operation… the exact number is specific to each micro-arch,
> > > though).
> > > >>>> Depending on your architecture and on the required number of
> > > >>>> precise bits by any given benchmark, one may see miscompares.
> > > >>>
> > > >>> True.
> > > >>
> > > >> I would be very uncomfortable with this approach.
> > > >
> > > > Same here. The default must be safe. Always.
> > > > Unlike other architectures, we don’t have a problem with making
> > > > the proper defaults for “safety”, as the ARMv8 ISA guarantees a
> > > > minimum number of precise bits per iteration.
> > > >
> > > >> From Richard Biener's post in the thread Michael Matz linked
> > > >> earlier in the thread:
> > > >>
> > > >>   It would follow existing practice of things we allow in
> > > >>   -funsafe-math-optimizations.  Existing practice in that we
> > > >>   want to allow -ffast-math use with common benchmarks we care
> > > >>   about.
> > > >>
> > > >>   https://gcc.gnu.org/ml/gcc-patches/2009-11/msg00100.html
> > > >>
> > > >> With the solution you seem to be converging on (2-steps for some
> > > >> microarchitectures, 3 for others), a binary generated for one
> > > >> micro-arch may drop below a minimum guarantee of precision when
> > > >> run on another. This seems to go against the spirit of the
> > > >> practice above. I would only support adding this optimization to
> > > >> -Ofast if we could keep to architectural guarantees of precision
> > > >> in the generated code
> > > (i.e. 3-steps everywhere).
> > > >>
> > > >> I don't object to adding a "-mlow-precision-recip-sqrt" style
> > > >> option, which would be off by default, would enable the 2-step
> > > >> mode, and would need to be explicitly enabled (i.e. not implied
> > > >> by
> > > >> -mcpu=foo) but I don't see what this buys you beyond the Gromacs
> > > >> boost (and even there you would be creating an Invalid Run as
> > > >> optimization flags must be applied across all workloads).
> > > >
> > > > Any flag that reduces precision (and thus breaks IEEE
> > > > floating-point
> > > > semantics) needs to be gated with an “unsafe” flag (i.e. one that
> > > > is never
> > > on by default).
> > > > As a consequence, the “peak”-tuning for SPEC will turn this on…
> > > > but barely anyone else would.
> > > >
> > > >> For the 3-step optimization, it is clear to me that for "generic"
> > > >> tuning we don't want this to be enabled by default experimental
> > > >> results and advice in this thread argues against it for thunderx
> > > >> and cortex-
> > > a57 targets.
> > > >> However, enabling it based on the CPU tuning selected seems fine to
> me.
> > > >
> > > > I do not agree on this one, as I would like to see the safe form (i.e.
> > > > 3 and 5 iterations respectively) to become the default. Most
> > > > “server-type” chips should not see a performance regression, while
> > > > it will be easier to optimise for this in hardware than for a
> > > > (potentially microcoded) sqrt-instruction (and subsequent,
> > > > dependent
> > > divide).
> > > >
> > > > I have not heard anyone claim a performance regression (either on
> > > > thunderx or on cortex-a57), but merely heard a “no speed-up”.
> > >
> > > Actually it does regress performance on thunderX, I just assumed
> > > that when I said not going to be a win it was taken as a slow down.
> > > It regress gromacs by more than 10% on thunderX but I can't remember
> > > how much as i had someone else run it. The latency difference is
> > > also over 40%; for example single precision: 29 cycles with div (12)
> > > sqrt(17) directly vs 42 cycles with the rsqrte and 2 iterations of
> > > 2mul/rsqrts (double is 53 vs 60). That is huge difference right
> > > there.  ThunderX has a
> > fast div and a fast sqrt for 32bit and a
> > > reasonable one for double.   So again this is not just not a win but rather
> > a
> > > regression for thunderX. I suspect cortex-a57 is also true.
> > >
> > > Thanks,
> > > Andrew
> > >
> >
> > Yes theoretically  should be  true for cortex-57 case as well.   But  I
> > believe hardware pipelining with instruction scheduling in compiler
> > helps a little for gromacs case  ~3% to 4% with the original patch.
> >
> > I have not tested other FP benchmarks.   As James said a flag -mlow-
> > precision-recip-sqrt if allowed can be used as a peak flag.
> >
> > > >
> > > > So I am strongly in favor of defaulting to the ‘safe’ number of
> > > > iterations, even when compiling for a generic target.
> > > >
> > > > Best,
> > > > Philipp.
> > > >
> >
> > Regards,
> > Venkat.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math
  2015-07-20  9:46                             ` Kumar, Venkataramanan
@ 2015-07-20 15:58                               ` Evandro Menezes
  0 siblings, 0 replies; 36+ messages in thread
From: Evandro Menezes @ 2015-07-20 15:58 UTC (permalink / raw)
  To: 'Kumar, Venkataramanan', pinskia, 'Dr. Philipp Tomsich'
  Cc: 'James Greenhalgh', 'Benedikt Huber',
	gcc-patches, 'Marcus Shawcroft',
	'Ramana Radhakrishnan', 'Richard Earnshaw'

Hi, Venkat.

Since x^1/2 = x * x^-1/2, the Newton series can also be used for the regular square root with an extra multiplication, as it is done in x86.  That's what I was trying to estimate below.

Cheers,

-- 
Evandro Menezes                              Austin, TX

> -----Original Message-----
> From: Kumar, Venkataramanan [mailto:Venkataramanan.Kumar@amd.com]
> Sent: Monday, July 20, 2015 2:53
> To: Evandro Menezes; pinskia@gmail.com; 'Dr. Philipp Tomsich'
> Cc: 'James Greenhalgh'; 'Benedikt Huber'; gcc-patches@gcc.gnu.org; 'Marcus
> Shawcroft'; 'Ramana Radhakrishnan'; 'Richard Earnshaw'
> Subject: RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
> estimation in -ffast-math
> 
> Hi,
> 
> I missed your email and noticed it this week.
> 
> What does column 2  tests?  Are you trying to implement square roots  using
> reciprocal estimate and step?
> 
> But reciprocal square root  using reciprocal estimate and (2 for fp 3 for dp)
> step seems  to be better that using fdiv and fsqrt in your case.
> 
> Regards,
> Venkat.
> 
> > -----Original Message-----
> > From: Evandro Menezes [mailto:e.menezes@samsung.com]
> > Sent: Wednesday, July 15, 2015 3:45 AM
> > To: Kumar, Venkataramanan; pinskia@gmail.com; 'Dr. Philipp Tomsich'
> > Cc: 'James Greenhalgh'; 'Benedikt Huber'; gcc-patches@gcc.gnu.org;
> > 'Marcus Shawcroft'; 'Ramana Radhakrishnan'; 'Richard Earnshaw'
> > Subject: RE: [PATCH] [aarch64] Implemented reciprocal square root
> > (rsqrt) estimation in -ffast-math
> >
> > I ran a simple test on A57 rev. 0, looping a million times around
> > sqrt{,f} and the respective series iterations with the values in the
> > sequence 1..1000000 and got these results:
> >
> > sqrt(x):        36593844/s      1/sqrt(x):      18283875/s
> > 3 Steps:        47922557/s      3 Steps:        49005194/s
> >
> > sqrtf(x):       143988480/s     1/sqrtf(x):     69516857/s
> > 2 Steps:        78740157/s      2 Steps:        80385852/s
> >
> > I'm a bit surprised that the 3-iteration series for DP is faster than
> > sqrt(), but not that it's much faster for the reciprocal of sqrt().
> > As for SP, the 2-iteration series is faster only for the reciprocal for
> sqrtf().
> >
> > There might still be some leg for this patch in real-world cases which
> > I'd like to investigate.
> >
> > --
> > Evandro Menezes                              Austin, TX
> >
> >
> > > -----Original Message-----
> > > From: gcc-patches-owner@gcc.gnu.org
> > > [mailto:gcc-patches-owner@gcc.gnu.org] On Behalf Of Kumar,
> > > Venkataramanan
> > > Sent: Monday, June 29, 2015 13:50
> > > To: pinskia@gmail.com; Dr. Philipp Tomsich
> > > Cc: James Greenhalgh; Benedikt Huber; gcc-patches@gcc.gnu.org;
> > > Marcus Shawcroft; Ramana Radhakrishnan; Richard Earnshaw
> > > Subject: RE: [PATCH] [aarch64] Implemented reciprocal square root
> > > (rsqrt) estimation in -ffast-math
> > >
> > > Hi,
> > >
> > > > -----Original Message-----
> > > > From: pinskia@gmail.com [mailto:pinskia@gmail.com]
> > > > Sent: Monday, June 29, 2015 10:23 PM
> > > > To: Dr. Philipp Tomsich
> > > > Cc: James Greenhalgh; Kumar, Venkataramanan; Benedikt Huber; gcc-
> > > > patches@gcc.gnu.org; Marcus Shawcroft; Ramana Radhakrishnan;
> > Richard
> > > > Earnshaw
> > > > Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> > > > (rsqrt) estimation in -ffast-math
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > > On Jun 29, 2015, at 4:44 AM, Dr. Philipp Tomsich
> > > > <philipp.tomsich@theobroma-systems.com> wrote:
> > > > >
> > > > > James,
> > > > >
> > > > >> On 29 Jun 2015, at 13:36, James Greenhalgh
> > > > <james.greenhalgh@arm.com> wrote:
> > > > >>
> > > > >>> On Mon, Jun 29, 2015 at 10:18:23AM +0100, Kumar,
> > > > >>> Venkataramanan
> > > > wrote:
> > > > >>>
> > > > >>>> -----Original Message-----
> > > > >>>> From: Dr. Philipp Tomsich
> > > > >>>> [mailto:philipp.tomsich@theobroma-systems.com]
> > > > >>>> Sent: Monday, June 29, 2015 2:17 PM
> > > > >>>> To: Kumar, Venkataramanan
> > > > >>>> Cc: pinskia@gmail.com; Benedikt Huber;
> > > > >>>> gcc-patches@gcc.gnu.org
> > > > >>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square
> > > > >>>> root
> > > > >>>> (rsqrt) estimation in -ffast-math
> > > > >>>>
> > > > >>>> Kumar,
> > > > >>>>
> > > > >>>> This does not come unexpected, as the initial estimation and
> > > > >>>> each iteration will add an architecturally-defined number of
> > > > >>>> bits of precision (ARMv8 guarantuees only a minimum number of
> > > > >>>> bits
> > > > provided
> > > > >>>> per operation… the exact number is specific to each
> > > > >>>> micro-arch,
> > > > though).
> > > > >>>> Depending on your architecture and on the required number of
> > > > >>>> precise bits by any given benchmark, one may see miscompares.
> > > > >>>
> > > > >>> True.
> > > > >>
> > > > >> I would be very uncomfortable with this approach.
> > > > >
> > > > > Same here. The default must be safe. Always.
> > > > > Unlike other architectures, we don’t have a problem with making
> > > > > the proper defaults for “safety”, as the ARMv8 ISA guarantees a
> > > > > minimum number of precise bits per iteration.
> > > > >
> > > > >> From Richard Biener's post in the thread Michael Matz linked
> > > > >> earlier in the thread:
> > > > >>
> > > > >>   It would follow existing practice of things we allow in
> > > > >>   -funsafe-math-optimizations.  Existing practice in that we
> > > > >>   want to allow -ffast-math use with common benchmarks we care
> > > > >>   about.
> > > > >>
> > > > >>   https://gcc.gnu.org/ml/gcc-patches/2009-11/msg00100.html
> > > > >>
> > > > >> With the solution you seem to be converging on (2-steps for
> > > > >> some microarchitectures, 3 for others), a binary generated for
> > > > >> one micro-arch may drop below a minimum guarantee of precision
> > > > >> when run on another. This seems to go against the spirit of the
> > > > >> practice above. I would only support adding this optimization
> > > > >> to -Ofast if we could keep to architectural guarantees of
> > > > >> precision in the generated code
> > > > (i.e. 3-steps everywhere).
> > > > >>
> > > > >> I don't object to adding a "-mlow-precision-recip-sqrt" style
> > > > >> option, which would be off by default, would enable the 2-step
> > > > >> mode, and would need to be explicitly enabled (i.e. not implied
> > > > >> by
> > > > >> -mcpu=foo) but I don't see what this buys you beyond the
> > > > >> Gromacs boost (and even there you would be creating an Invalid
> > > > >> Run as optimization flags must be applied across all workloads).
> > > > >
> > > > > Any flag that reduces precision (and thus breaks IEEE
> > > > > floating-point
> > > > > semantics) needs to be gated with an “unsafe” flag (i.e. one
> > > > > that is never
> > > > on by default).
> > > > > As a consequence, the “peak”-tuning for SPEC will turn this on…
> > > > > but barely anyone else would.
> > > > >
> > > > >> For the 3-step optimization, it is clear to me that for "generic"
> > > > >> tuning we don't want this to be enabled by default experimental
> > > > >> results and advice in this thread argues against it for
> > > > >> thunderx and cortex-
> > > > a57 targets.
> > > > >> However, enabling it based on the CPU tuning selected seems
> > > > >> fine to
> > me.
> > > > >
> > > > > I do not agree on this one, as I would like to see the safe form
> (i.e.
> > > > > 3 and 5 iterations respectively) to become the default. Most
> > > > > “server-type” chips should not see a performance regression,
> > > > > while it will be easier to optimise for this in hardware than
> > > > > for a (potentially microcoded) sqrt-instruction (and subsequent,
> > > > > dependent
> > > > divide).
> > > > >
> > > > > I have not heard anyone claim a performance regression (either
> > > > > on thunderx or on cortex-a57), but merely heard a “no speed-up”.
> > > >
> > > > Actually it does regress performance on thunderX, I just assumed
> > > > that when I said not going to be a win it was taken as a slow down.
> > > > It regress gromacs by more than 10% on thunderX but I can't
> > > > remember how much as i had someone else run it. The latency
> > > > difference is also over 40%; for example single precision: 29
> > > > cycles with div (12)
> > > > sqrt(17) directly vs 42 cycles with the rsqrte and 2 iterations of
> > > > 2mul/rsqrts (double is 53 vs 60). That is huge difference right
> > > > there.  ThunderX has a
> > > fast div and a fast sqrt for 32bit and a
> > > > reasonable one for double.   So again this is not just not a win but
> rather
> > > a
> > > > regression for thunderX. I suspect cortex-a57 is also true.
> > > >
> > > > Thanks,
> > > > Andrew
> > > >
> > >
> > > Yes theoretically  should be  true for cortex-57 case as well.   But  I
> > > believe hardware pipelining with instruction scheduling in compiler
> > > helps a little for gromacs case  ~3% to 4% with the original patch.
> > >
> > > I have not tested other FP benchmarks.   As James said a flag -mlow-
> > > precision-recip-sqrt if allowed can be used as a peak flag.
> > >
> > > > >
> > > > > So I am strongly in favor of defaulting to the ‘safe’ number of
> > > > > iterations, even when compiling for a generic target.
> > > > >
> > > > > Best,
> > > > > Philipp.
> > > > >
> > >
> > > Regards,
> > > Venkat.


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2015-07-20 15:26 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-18 11:57 [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math Benedikt Huber
2015-06-18 12:03 ` [PATCH] 2015-06-15 Benedikt Huber <benedikt.huber@theobroma-systems.com> Benedikt Huber
2015-06-27  8:12   ` Andrew Pinski
2015-06-18 12:36 ` [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math Kumar, Venkataramanan
2015-06-24 16:49 ` Evandro Menezes
2015-06-24 16:55   ` Dr. Philipp Tomsich
2015-06-24 17:16     ` Benedikt Huber
2015-06-24 18:37       ` Evandro Menezes
2015-06-24 20:11         ` Dr. Philipp Tomsich
2015-06-24 20:54           ` Evandro Menezes
2015-06-25 11:52             ` Benedikt Huber
2015-06-25  7:01     ` Kumar, Venkataramanan
2015-06-25  7:03 ` pinskia
2015-06-25  9:43   ` Ramana Radhakrishnan
2015-06-27  2:01     ` Andrew Pinski
2015-06-25 11:07   ` Benedikt Huber
2015-06-25 13:27     ` Michael Matz
2015-06-25 15:43     ` Kumar, Venkataramanan
2015-06-25 15:52       ` Dr. Philipp Tomsich
2015-06-25 16:47         ` Kumar, Venkataramanan
2015-06-28 15:13           ` pinskia
2015-06-29  8:30             ` Kumar, Venkataramanan
2015-06-29  9:07               ` Dr. Philipp Tomsich
2015-06-29  9:22                 ` Kumar, Venkataramanan
2015-06-29 11:44                   ` James Greenhalgh
2015-06-29 11:56                     ` Dr. Philipp Tomsich
2015-06-29 16:57                       ` pinskia
2015-06-29 19:07                         ` Kumar, Venkataramanan
2015-07-14 22:26                           ` Evandro Menezes
2015-07-20  9:46                             ` Kumar, Venkataramanan
2015-07-20 15:58                               ` Evandro Menezes
2015-07-13 19:09                       ` Evandro Menezes
2015-07-14 22:20                 ` Evandro Menezes
2015-06-29 14:20               ` Benedikt Huber
2015-06-29 17:35               ` Benedikt Huber
2015-06-29 17:44                 ` Kumar, Venkataramanan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).