Emit square root using the Newton series

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* Emit square root using the Newton series
@ 2016-03-17 22:50 Evandro Menezes
  2016-03-24 20:30 ` [AArch64] " Evandro Menezes
  0 siblings, 1 reply; 38+ messages in thread
From: Evandro Menezes @ 2016-03-17 22:50 UTC (permalink / raw)
  To: GCC Patches
  Cc: James Greenhalgh, Wilco Dijkstra, Andrew Pinski, philipp.tomsich,
	Benedikt Huber

[-- Attachment #1: Type: text/plain, Size: 1123 bytes --]

    2016-03-16  Evandro Menezes <e.menezes@samsung.com>
                 Wilco Dijkstra  <wilco.dijkstra@arm.com>

    gcc/
         * config/aarch64/aarch64-tuning-flags.def
         (AARCH64_EXTRA_TUNE_APPROX_SQRT_{SF,DF}): New tuning macros.
         * config/aarch64/aarch64-protos.h
         (aarch64_emit_approx_rsqrt): Replace with
    "aarch64_emit_approx_sqrt".
         (AARCH64_EXTRA_TUNE_APPROX_SQRT): New macro.
         * config/aarch64/aarch64.c
         (exynosm1_tunings): Use the new macro.
         (aarch64_emit_approx_sqrt): Define new function.
         * config/aarch64/aarch64.md
         (rsqrt<mode>2): Use new function instead.
         (sqrt<mode>2): New expansion and insn definitions.
         * config/aarch64/aarch64-simd.md: Likewise.
         * config/aarch64/aarch64.opt
         (mlow-precision-recip-sqrt): Expand option description.
         * doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.


This patch refactors the function to emit the reciprocal square root 
approximation to also emit the square root approximation.

Feedback is welcome.

Thank you,

-- 
Evandro Menezes


[-- Attachment #2: 0001-Emit-square-root-using-the-Newton-series.patch --]
[-- Type: text/x-patch, Size: 11309 bytes --]

From 8d00622b90fa414df605011446ac058efe867cf6 Mon Sep 17 00:00:00 2001
From: Evandro Menezes <e.menezes@samsung.com>
Date: Thu, 17 Mar 2016 17:39:55 -0500
Subject: [PATCH] Emit square root using the Newton series

2016-03-17  Evandro Menezes  <e.menezes@samsung.com>
            Wilco Dijkstra  <wilco.dijkstra@arm.com>

gcc/
	* config/aarch64/aarch64-tuning-flags.def
	(AARCH64_EXTRA_TUNE_APPROX_SQRT_{SF,DF}): New tuning macros.
	* config/aarch64/aarch64-protos.h
	(aarch64_emit_approx_rsqrt): Replace with "aarch64_emit_approx_sqrt".
	(AARCH64_EXTRA_TUNE_APPROX_SQRT): New macro.
	* config/aarch64/aarch64.c
	(exynosm1_tunings): Use the new macro.
	(aarch64_emit_approx_sqrt): Define new function.
	* config/aarch64/aarch64.md
	(rsqrt<mode>2): Use new function instead.
	(sqrt<mode>2): New expansion and insn definitions.
	* config/aarch64/aarch64-simd.md: Likewise.
	* config/aarch64/aarch64.opt
	(mlow-precision-recip-sqrt): Expand option description.
	* doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.
---
 gcc/config/aarch64/aarch64-protos.h         |  5 +-
 gcc/config/aarch64/aarch64-simd.md          | 27 +++++++-
 gcc/config/aarch64/aarch64-tuning-flags.def |  3 +-
 gcc/config/aarch64/aarch64.c                | 97 +++++++++++++++++++++++------
 gcc/config/aarch64/aarch64.md               | 25 +++++++-
 gcc/config/aarch64/aarch64.opt              |  4 +-
 gcc/doc/invoke.texi                         |  9 +--
 7 files changed, 139 insertions(+), 31 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index dced209..3f3ae1c 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -263,6 +263,9 @@ enum aarch64_extra_tuning_flags
 };
 #undef AARCH64_EXTRA_TUNING_OPTION
 
+#define AARCH64_EXTRA_TUNE_APPROX_SQRT \
+  (AARCH64_EXTRA_TUNE_APPROX_SQRT_DF | AARCH64_EXTRA_TUNE_APPROX_SQRT_SF)
+
 extern struct tune_params aarch64_tune_params;
 
 HOST_WIDE_INT aarch64_initial_elimination_offset (unsigned, unsigned);
@@ -361,7 +364,7 @@ void aarch64_register_pragmas (void);
 void aarch64_relayout_simd_types (void);
 void aarch64_reset_previous_fndecl (void);
 void aarch64_save_restore_target_globals (tree);
-void aarch64_emit_approx_rsqrt (rtx, rtx);
+void aarch64_emit_approx_sqrt (rtx, rtx, bool);
 
 /* Initialize builtins for SIMD intrinsics.  */
 void init_aarch64_simd_builtins (void);
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index bd73bce..31191bb 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -405,7 +405,7 @@
 		     UNSPEC_RSQRT))]
   "TARGET_SIMD"
 {
-  aarch64_emit_approx_rsqrt (operands[0], operands[1]);
+  aarch64_emit_approx_sqrt (operands[0], operands[1], true);
   DONE;
 })
 
@@ -4307,7 +4307,30 @@
 
 ;; sqrt
 
-(define_insn "sqrt<mode>2"
+(define_expand "sqrt<mode>2"
+  [(set (match_operand:VDQF 0 "register_operand")
+	(sqrt:VDQF (match_operand:VDQF 1 "register_operand")))]
+  "TARGET_SIMD"
+{
+  machine_mode mode = GET_MODE_INNER (GET_MODE (operands[1]));
+
+  if (flag_finite_math_only
+      && !flag_trapping_math
+      && flag_unsafe_math_optimizations
+      && !optimize_function_for_size_p (cfun)
+      && ((mode == SFmode
+           && (aarch64_tune_params.extra_tuning_flags
+               & AARCH64_EXTRA_TUNE_APPROX_SQRT_SF))
+          || (mode == DFmode
+              && (aarch64_tune_params.extra_tuning_flags
+                  & AARCH64_EXTRA_TUNE_APPROX_SQRT_DF))))
+    {
+      aarch64_emit_approx_sqrt (operands[0], operands[1], false);
+      DONE;
+    }
+})
+
+(define_insn "*sqrt<mode>2"
   [(set (match_operand:VDQF 0 "register_operand" "=w")
         (sqrt:VDQF (match_operand:VDQF 1 "register_operand" "w")))]
   "TARGET_SIMD"
diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
index 7e45a0c..725a79c 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -30,4 +30,5 @@
 
 AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
 AARCH64_EXTRA_TUNING_OPTION ("approx_rsqrt", APPROX_RSQRT)
-
+AARCH64_EXTRA_TUNING_OPTION ("approx_sqrt", APPROX_SQRT_DF)
+AARCH64_EXTRA_TUNING_OPTION ("approx_sqrtf", APPROX_SQRT_SF)
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index ed0daa5..04f5633 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -38,6 +38,7 @@
 #include "recog.h"
 #include "diagnostic.h"
 #include "insn-attr.h"
+#include "insn-flags.h"
 #include "alias.h"
 #include "fold-const.h"
 #include "stor-layout.h"
@@ -7498,46 +7499,102 @@ get_rsqrts_type (machine_mode mode)
   }
 }
 
-/* Emit instruction sequence to compute the reciprocal square root using the
-   Newton-Raphson series.  Iterate over the series twice for SF
-   and thrice for DF.  */
+/* Emit instruction sequence to compute either the approximate square root
+   or its approximate reciprocal.  */
 
 void
-aarch64_emit_approx_rsqrt (rtx dst, rtx src)
+aarch64_emit_approx_sqrt (rtx dst, rtx src, bool recp)
 {
   machine_mode mode = GET_MODE (src);
-  gcc_assert (
-    mode == SFmode || mode == V2SFmode || mode == V4SFmode
-	|| mode == DFmode || mode == V2DFmode);
+  machine_mode mmsk;
+
+  gcc_assert (GET_MODE_INNER (mode) == SFmode
+              || GET_MODE_INNER (mode) == DFmode);
 
   rtx xsrc = gen_reg_rtx (mode);
   emit_move_insn (xsrc, src);
-  rtx x0 = gen_reg_rtx (mode);
 
-  emit_insn ((*get_rsqrte_type (mode)) (x0, xsrc));
+  rtx xcc, xne, xmsk;
+  bool scalar = !VECTOR_MODE_P (mode);
+  if (!recp)
+    {
+      if (scalar)
+	{
+	  /* Compare argument with 0.0 and set the CC.  */
+	  xcc = aarch64_gen_compare_reg (NE, xsrc, CONST0_RTX (mode));
+	  xne = gen_rtx_NE (VOIDmode, xcc, const0_rtx);
+	}
+      else
+	{
+	  /* Compare the argument with 0.0 and create a vector mask.  */
+	  mmsk = mode_for_vector (int_mode_for_mode (GET_MODE_INNER (mode)),
+				  GET_MODE_NUNITS (mode));
+	  xmsk = gen_reg_rtx (mmsk);
+	  switch (mode)
+	  {
+	    case V2SFmode:
+	      emit_insn (gen_aarch64_cmeqv2sf (xmsk, xsrc, CONST0_RTX (mode)));
+	      break;
 
-  bool double_mode = (mode == DFmode || mode == V2DFmode);
+	    case V4SFmode:
+	      emit_insn (gen_aarch64_cmeqv4sf (xmsk, xsrc, CONST0_RTX (mode)));
+	      break;
 
-  int iterations = double_mode ? 3 : 2;
+	    case V2DFmode:
+	      emit_insn (gen_aarch64_cmeqv2df (xmsk, xsrc, CONST0_RTX (mode)));
+	      break;
 
-  /* Optionally iterate over the series one less time than otherwise.  */
+	    default:
+	      gcc_unreachable ();
+	  }
+	}
+    }
+
+  /* Estimate the approximate reciprocal square root.  */
+  rtx xdst = gen_reg_rtx (mode);
+  emit_insn ((*get_rsqrte_type (mode)) (xdst, xsrc));
+
+  /* Iterate over the series twice for SF and thrice for DF.  */
+  int iterations = (GET_MODE_INNER (mode) == DFmode) ? 3 : 2;
+
+  /* Optionally iterate over the series once less for faster performance
+     while sacrificing the accuracy.  */
   if (flag_mrecip_low_precision_sqrt)
     iterations--;
 
-  for (int i = 0; i < iterations; ++i)
+  /* Iterate over the series.  */
+  while (iterations--)
     {
-      rtx x1 = gen_reg_rtx (mode);
       rtx x2 = gen_reg_rtx (mode);
-      rtx x3 = gen_reg_rtx (mode);
-      emit_set_insn (x2, gen_rtx_MULT (mode, x0, x0));
+      emit_set_insn (x2, gen_rtx_MULT (mode, xdst, xdst));
+
+      rtx x1 = gen_reg_rtx (mode);
+      emit_insn ((*get_rsqrts_type (mode)) (x1, xsrc, x2));
 
-      emit_insn ((*get_rsqrts_type (mode)) (x3, xsrc, x2));
+      emit_set_insn (xdst, gen_rtx_MULT (mode, x1, xdst));
+    }
+
+  if (!recp)
+    {
+      /* Qualify the final estimate for the approximate reciprocal square root
+	 when the argument is 0.0.  */
+      if (scalar)
+	/* Conditionally set the final estimate to 0.0.  */
+	emit_set_insn (xdst, gen_rtx_IF_THEN_ELSE (mode, xne, xdst, xsrc));
+      else
+	{
+	  /* Mask off any final vector element estimate to 0.0.  */
+	  rtx xtmp = gen_reg_rtx (mmsk);
+	  emit_set_insn (xtmp, gen_rtx_AND (mmsk, gen_rtx_NOT (mmsk, xmsk),
+					    gen_rtx_SUBREG (mmsk, xdst, 0)));
+	  emit_move_insn (xdst, gen_rtx_SUBREG (mode, xtmp, 0));
+	}
 
-      emit_set_insn (x1, gen_rtx_MULT (mode, x0, x3));
-      x0 = x1;
+      /* Calculate the approximate square root.  */
+      emit_set_insn (xdst, gen_rtx_MULT (mode, xsrc, xdst));
     }
 
-  emit_move_insn (dst, x0);
+  emit_move_insn (dst, xdst);
 }
 
 /* Return the number of instructions that can be issued per cycle.  */
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 68676c9..71725e7 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -4665,7 +4665,30 @@
   [(set_attr "type" "ffarith<s>")]
 )
 
-(define_insn "sqrt<mode>2"
+(define_expand "sqrt<mode>2"
+  [(set (match_operand:GPF 0 "register_operand")
+        (sqrt:GPF (match_operand:GPF 1 "register_operand")))]
+  "TARGET_SIMD"
+{
+  machine_mode mode = GET_MODE_INNER (GET_MODE (operands[1]));
+
+  if (flag_finite_math_only
+      && !flag_trapping_math
+      && flag_unsafe_math_optimizations
+      && !optimize_function_for_size_p (cfun)
+      && ((mode == SFmode
+           && (aarch64_tune_params.extra_tuning_flags
+               & AARCH64_EXTRA_TUNE_APPROX_SQRT_SF))
+          || (mode == DFmode
+              && (aarch64_tune_params.extra_tuning_flags
+                  & AARCH64_EXTRA_TUNE_APPROX_SQRT_DF))))
+    {
+      aarch64_emit_approx_sqrt (operands[0], operands[1], false);
+      DONE;
+    }
+})
+
+(define_insn "*sqrt<mode>2"
   [(set (match_operand:GPF 0 "register_operand" "=w")
         (sqrt:GPF (match_operand:GPF 1 "register_operand" "w")))]
   "TARGET_FLOAT"
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index c637ff4..c5e7fc9 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -151,5 +151,5 @@ PC relative literal loads.
 
 mlow-precision-recip-sqrt
 Common Var(flag_mrecip_low_precision_sqrt) Optimization
-When calculating the reciprocal square root approximation,
-uses one less step than otherwise, thus reducing latency and precision.
+When calculating the approximate square root or its approximate reciprocal,
+use one less step than otherwise, thus reducing latency and precision.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 99ac11b..d48c29b 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -12903,10 +12903,11 @@ corresponding flag to the linker.
 @item -mno-low-precision-recip-sqrt
 @opindex -mlow-precision-recip-sqrt
 @opindex -mno-low-precision-recip-sqrt
-When calculating the reciprocal square root approximation,
-uses one less step than otherwise, thus reducing latency and precision.
-This is only relevant if @option{-ffast-math} enables the reciprocal square root
-approximation, which in turn depends on the target processor.
+When calculating the approximate square root or its approximate reciprocal,
+use one less step than otherwise, thus reducing latency and precision.
+This is only relevant if @option{-ffast-math} enables
+the approximate square root or its approximate reciprocal,
+which in turn depends on the target processor.
 
 @item -march=@var{name}
 @opindex march
-- 
1.9.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2016-03-17 22:50 Emit square root using the Newton series Evandro Menezes
@ 2016-03-24 20:30 ` Evandro Menezes
  2016-04-01 22:45   ` Evandro Menezes
  0 siblings, 1 reply; 38+ messages in thread
From: Evandro Menezes @ 2016-03-24 20:30 UTC (permalink / raw)
  To: GCC Patches
  Cc: James Greenhalgh, Wilco Dijkstra, Andrew Pinski, philipp.tomsich,
	Benedikt Huber

[-- Attachment #1: Type: text/plain, Size: 1414 bytes --]

On 03/17/16 17:46, Evandro Menezes wrote:
> This patch refactors the function to emit the reciprocal square root 
> approximation to also emit the square root approximation.

    2016-03-23  Evandro Menezes <e.menezes@samsung.com>
                 Wilco Dijkstra  <wilco.dijkstra@arm.com>

    gcc/
             * config/aarch64/aarch64-tuning-flags.def
             (AARCH64_EXTRA_TUNE_APPROX_SQRT_{SF,DF}): New tuning macros.
             * config/aarch64/aarch64-protos.h
             (aarch64_emit_approx_rsqrt): Replace with
    "aarch64_emit_approx_sqrt".
             (AARCH64_EXTRA_TUNE_APPROX_SQRT): New macro.
             * config/aarch64/aarch64.c
             (exynosm1_tunings): Use the new macro.
             (aarch64_emit_approx_sqrt): Define new function.
             (aarch64_override_options_after_change_1): Handle new option.
             * config/aarch64/aarch64.md
             (rsqrt<mode>2): Use new function instead.
             (sqrt<mode>2): New expansion and insn definitions.
             * config/aarch64/aarch64-simd.md: Likewise.
             * config/aarch64/aarch64.opt
             (mlow-precision-sqrt): Add new option description.
             * doc/invoke.texi (mlow-precision-sqrt): Likewise.

This version of the patch cleans up the changes to the MD files and 
fixes some bugs introduced in it since the first proposal.

Thanks for your feedback,

-- 
Evandro Menezes


[-- Attachment #2: 0001-AArch64-Emit-square-root-using-the-Newton-series.patch --]
[-- Type: text/x-patch, Size: 12645 bytes --]

From 712e330bf651393bb788e85ebe7b3d9a37f54ae7 Mon Sep 17 00:00:00 2001
From: Evandro Menezes <e.menezes@samsung.com>
Date: Thu, 17 Mar 2016 17:39:55 -0500
Subject: [PATCH] [AArch64] Emit square root using the Newton series

2016-03-23  Evandro Menezes  <e.menezes@samsung.com>
            Wilco Dijkstra  <wilco.dijkstra@arm.com>

gcc/
	* config/aarch64/aarch64-tuning-flags.def
	(AARCH64_EXTRA_TUNE_APPROX_SQRT_{SF,DF}): New tuning macros.
	* config/aarch64/aarch64-protos.h
	(aarch64_emit_approx_rsqrt): Replace with "aarch64_emit_approx_sqrt".
	(AARCH64_EXTRA_TUNE_APPROX_SQRT): New macro.
	* config/aarch64/aarch64.c
	(exynosm1_tunings): Use the new macro.
	(aarch64_emit_approx_sqrt): Define new function.
	(aarch64_override_options_after_change_1): Handle new option.
	* config/aarch64/aarch64.md
	(rsqrt<mode>2): Use new function instead.
	(sqrt<mode>2): New expansion and insn definitions.
	* config/aarch64/aarch64-simd.md: Likewise.
	* config/aarch64/aarch64.opt
	(mlow-precision-sqrt): Add new option description.
	* doc/invoke.texi (mlow-precision-sqrt): Likewise.
---
 gcc/config/aarch64/aarch64-protos.h         |   5 +-
 gcc/config/aarch64/aarch64-simd.md          |  13 ++-
 gcc/config/aarch64/aarch64-tuning-flags.def |   3 +-
 gcc/config/aarch64/aarch64.c                | 129 ++++++++++++++++++++++------
 gcc/config/aarch64/aarch64.md               |  11 ++-
 gcc/config/aarch64/aarch64.opt              |   9 +-
 gcc/doc/invoke.texi                         |  10 +++
 7 files changed, 147 insertions(+), 33 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index dced209..24c2125 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -263,6 +263,9 @@ enum aarch64_extra_tuning_flags
 };
 #undef AARCH64_EXTRA_TUNING_OPTION
 
+#define AARCH64_EXTRA_TUNE_APPROX_SQRT \
+  (AARCH64_EXTRA_TUNE_APPROX_SQRT_DF | AARCH64_EXTRA_TUNE_APPROX_SQRT_SF)
+
 extern struct tune_params aarch64_tune_params;
 
 HOST_WIDE_INT aarch64_initial_elimination_offset (unsigned, unsigned);
@@ -361,7 +364,7 @@ void aarch64_register_pragmas (void);
 void aarch64_relayout_simd_types (void);
 void aarch64_reset_previous_fndecl (void);
 void aarch64_save_restore_target_globals (tree);
-void aarch64_emit_approx_rsqrt (rtx, rtx);
+bool aarch64_emit_approx_sqrt (rtx, rtx, bool);
 
 /* Initialize builtins for SIMD intrinsics.  */
 void init_aarch64_simd_builtins (void);
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index bd73bce..47ccb18 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -405,7 +405,7 @@
 		     UNSPEC_RSQRT))]
   "TARGET_SIMD"
 {
-  aarch64_emit_approx_rsqrt (operands[0], operands[1]);
+  aarch64_emit_approx_sqrt (operands[0], operands[1], true);
   DONE;
 })
 
@@ -4307,7 +4307,16 @@
 
 ;; sqrt
 
-(define_insn "sqrt<mode>2"
+(define_expand "sqrt<mode>2"
+  [(set (match_operand:VDQF 0 "register_operand")
+	(sqrt:VDQF (match_operand:VDQF 1 "register_operand")))]
+  "TARGET_SIMD"
+{
+  if (aarch64_emit_approx_sqrt (operands[0], operands[1], false))
+    DONE;
+})
+
+(define_insn "*sqrt<mode>2"
   [(set (match_operand:VDQF 0 "register_operand" "=w")
         (sqrt:VDQF (match_operand:VDQF 1 "register_operand" "w")))]
   "TARGET_SIMD"
diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
index 7e45a0c..725a79c 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -30,4 +30,5 @@
 
 AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
 AARCH64_EXTRA_TUNING_OPTION ("approx_rsqrt", APPROX_RSQRT)
-
+AARCH64_EXTRA_TUNING_OPTION ("approx_sqrt", APPROX_SQRT_DF)
+AARCH64_EXTRA_TUNING_OPTION ("approx_sqrtf", APPROX_SQRT_SF)
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index ed0daa5..b155b74 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -38,6 +38,7 @@
 #include "recog.h"
 #include "diagnostic.h"
 #include "insn-attr.h"
+#include "insn-flags.h"
 #include "alias.h"
 #include "fold-const.h"
 #include "stor-layout.h"
@@ -538,7 +539,8 @@ static const struct tune_params exynosm1_tunings =
   48,	/* max_case_values.  */
   64,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_APPROX_RSQRT) /* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_APPROX_RSQRT
+   | AARCH64_EXTRA_TUNE_APPROX_SQRT) /* tune_flags.  */
 };
 
 static const struct tune_params thunderx_tunings =
@@ -7498,46 +7500,115 @@ get_rsqrts_type (machine_mode mode)
   }
 }
 
-/* Emit instruction sequence to compute the reciprocal square root using the
-   Newton-Raphson series.  Iterate over the series twice for SF
-   and thrice for DF.  */
+/* Emit instruction sequence to compute either the approximate square root
+   or its approximate reciprocal.  */
 
-void
-aarch64_emit_approx_rsqrt (rtx dst, rtx src)
+bool
+aarch64_emit_approx_sqrt (rtx dst, rtx src, bool recp)
 {
-  machine_mode mode = GET_MODE (src);
-  gcc_assert (
-    mode == SFmode || mode == V2SFmode || mode == V4SFmode
-	|| mode == DFmode || mode == V2DFmode);
+  machine_mode mode = GET_MODE (dst);
+  machine_mode mmsk = mode_for_vector (int_mode_for_mode (GET_MODE_INNER (mode)),
+				       GET_MODE_NUNITS (mode));
+  bool scalar = !VECTOR_MODE_P (mode);
+
+  if (!flag_finite_math_only
+      || flag_trapping_math
+      || !flag_unsafe_math_optimizations
+      || optimize_function_for_size_p (cfun)
+      || !((recp && (flag_mrecip_low_precision_sqrt
+		     || (aarch64_tune_params.extra_tuning_flags
+			 & AARCH64_EXTRA_TUNE_APPROX_RSQRT)))
+	   || (!recp && (flag_mlow_precision_sqrt
+			 || (GET_MODE_INNER (mode) == SFmode
+			     && (aarch64_tune_params.extra_tuning_flags
+				 & AARCH64_EXTRA_TUNE_APPROX_SQRT_SF))
+			 || (GET_MODE_INNER (mode) == DFmode
+			     && (aarch64_tune_params.extra_tuning_flags
+				 & AARCH64_EXTRA_TUNE_APPROX_SQRT_DF))))))
+    return false;
 
-  rtx xsrc = gen_reg_rtx (mode);
-  emit_move_insn (xsrc, src);
-  rtx x0 = gen_reg_rtx (mode);
+  rtx xne, xmsk;
+  if (!recp)
+    {
+      /* When calculating the approximate square root...  */
+      if (scalar)
+	{
+	  /* Compare argument with 0.0 and set the CC.  */
+	  rtx xcc = aarch64_gen_compare_reg (NE, src, CONST0_RTX (mode));
+	  xne = gen_rtx_NE (VOIDmode, xcc, const0_rtx);
+	}
+      else
+	{
+	  /* Compare the argument with 0.0 and set a vector mask.  */
+	  xmsk = gen_reg_rtx (mmsk);
+	  switch (mode)
+	  {
+	    case V2SFmode:
+	      emit_insn (gen_aarch64_cmeqv2sf (xmsk, src, CONST0_RTX (mode)));
+	      break;
+
+	    case V4SFmode:
+	      emit_insn (gen_aarch64_cmeqv4sf (xmsk, src, CONST0_RTX (mode)));
+	      break;
+
+	    case V2DFmode:
+	      emit_insn (gen_aarch64_cmeqv2df (xmsk, src, CONST0_RTX (mode)));
+	      break;
 
-  emit_insn ((*get_rsqrte_type (mode)) (x0, xsrc));
+	    default:
+	      gcc_unreachable ();
+	  }
+	}
+    }
 
-  bool double_mode = (mode == DFmode || mode == V2DFmode);
+  /* Estimate the approximate reciprocal square root.  */
+  rtx xdst = gen_reg_rtx (mode);
+  emit_insn ((*get_rsqrte_type (mode)) (xdst, src));
 
-  int iterations = double_mode ? 3 : 2;
+  /* Iterate over the series twice for SF and thrice for DF.  */
+  int iterations = (GET_MODE_INNER (mode) == DFmode) ? 3 : 2;
 
-  /* Optionally iterate over the series one less time than otherwise.  */
-  if (flag_mrecip_low_precision_sqrt)
+  /* Optionally iterate over the series once less for faster performance
+     while sacrificing the accuracy.  */
+  if ((recp && flag_mrecip_low_precision_sqrt)
+      || (!recp && flag_mlow_precision_sqrt))
     iterations--;
 
-  for (int i = 0; i < iterations; ++i)
+  /* Iterate over the series to calculate the approximate reciprocal square root.  */
+  while (iterations--)
     {
-      rtx x1 = gen_reg_rtx (mode);
       rtx x2 = gen_reg_rtx (mode);
-      rtx x3 = gen_reg_rtx (mode);
-      emit_set_insn (x2, gen_rtx_MULT (mode, x0, x0));
+      emit_set_insn (x2, gen_rtx_MULT (mode, xdst, xdst));
 
-      emit_insn ((*get_rsqrts_type (mode)) (x3, xsrc, x2));
+      rtx x1 = gen_reg_rtx (mode);
+      emit_insn ((*get_rsqrts_type (mode)) (x1, src, x2));
 
-      emit_set_insn (x1, gen_rtx_MULT (mode, x0, x3));
-      x0 = x1;
+      emit_set_insn (xdst, gen_rtx_MULT (mode, xdst, x1));
     }
 
-  emit_move_insn (dst, x0);
+  if (!recp)
+    {
+      /* Qualify the approximate reciprocal square root when the argument is 0.0.  */
+      if (scalar)
+	/* Conditionally set the result to 0.0.  */
+	emit_set_insn (xdst, gen_rtx_IF_THEN_ELSE (mode, xne, xdst, src));
+      else
+	{
+	  /* Mask off any resulting vector element to 0.0.  */
+	  rtx xtmp = gen_reg_rtx (mmsk);
+	  emit_set_insn (xtmp, gen_rtx_AND (mmsk, gen_rtx_NOT (mmsk, xmsk),
+					    gen_rtx_SUBREG (mmsk, xdst, 0)));
+	  emit_move_insn (xdst, gen_rtx_SUBREG (mode, xtmp, 0));
+	}
+
+      /* Calculate the approximate square root.  */
+      emit_set_insn (dst, gen_rtx_MULT (mode, xdst, src));
+    }
+  else
+    /* Return the calculated approximate reciprocal square root.  */
+    emit_move_insn (dst, xdst);
+
+  return true;
 }
 
 /* Return the number of instructions that can be issued per cycle.  */
@@ -8144,6 +8215,12 @@ aarch64_override_options_after_change_1 (struct gcc_options *opts)
       && (aarch64_cmodel == AARCH64_CMODEL_TINY
 	  || aarch64_cmodel == AARCH64_CMODEL_TINY_PIC))
     aarch64_nopcrelative_literal_loads = false;
+
+  /* When enabling the lower precision Newton series for the square root, also
+     enable it for the reciprocal square root, since the later is an
+     intermediary step for the latter.  */
+  if (flag_mlow_precision_sqrt)
+    flag_mrecip_low_precision_sqrt = true;
 }
 
 /* 'Unpack' up the internal tuning structs and update the options
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 68676c9..43fa318 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -4665,7 +4665,16 @@
   [(set_attr "type" "ffarith<s>")]
 )
 
-(define_insn "sqrt<mode>2"
+(define_expand "sqrt<mode>2"
+  [(set (match_operand:GPF 0 "register_operand")
+        (sqrt:GPF (match_operand:GPF 1 "register_operand")))]
+  "TARGET_SIMD"
+{
+  if (aarch64_emit_approx_sqrt (operands[0], operands[1], false))
+    DONE;
+})
+
+(define_insn "*sqrt<mode>2"
   [(set (match_operand:GPF 0 "register_operand" "=w")
         (sqrt:GPF (match_operand:GPF 1 "register_operand" "w")))]
   "TARGET_FLOAT"
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index c637ff4..ffd5540 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -151,5 +151,10 @@ PC relative literal loads.
 
 mlow-precision-recip-sqrt
 Common Var(flag_mrecip_low_precision_sqrt) Optimization
-When calculating the reciprocal square root approximation,
-uses one less step than otherwise, thus reducing latency and precision.
+When calculating the approximate reciprocal square root,
+use one less step than otherwise, thus reducing latency and precision.
+
+mlow-precision-sqrt
+Common Var(flag_mlow_precision_sqrt) Optimization
+When calculating the approximate square root,
+use one less step than otherwise, thus reducing latency and precision.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 99ac11b..433a9f2 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -573,6 +573,7 @@ Objective-C and Objective-C++ Dialects}.
 -mfix-cortex-a53-835769  -mno-fix-cortex-a53-835769 @gol
 -mfix-cortex-a53-843419  -mno-fix-cortex-a53-843419 @gol
 -mlow-precision-recip-sqrt -mno-low-precision-recip-sqrt@gol
+-mlow-precision-sqrt -mno-low-precision-sqrt@gol
 -march=@var{name}  -mcpu=@var{name}  -mtune=@var{name}}
 
 @emph{Adapteva Epiphany Options}
@@ -12908,6 +12909,15 @@ uses one less step than otherwise, thus reducing latency and precision.
 This is only relevant if @option{-ffast-math} enables the reciprocal square root
 approximation, which in turn depends on the target processor.
 
+@item -mlow-precision-sqrt
+@item -mno-low-precision-sqrt
+@opindex -mlow-precision-sqrt
+@opindex -mno-low-precision-sqrt
+When calculating the square root approximation,
+uses one less step than otherwise, thus reducing latency and precision.
+This is only relevant if @option{-ffast-math} enables the square root
+approximation, which in turn depends on the target processor.
+
 @item -march=@var{name}
 @opindex march
 Specify the name of the target architecture and, optionally, one or
-- 
1.9.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2016-03-24 20:30 ` [AArch64] " Evandro Menezes
@ 2016-04-01 22:45   ` Evandro Menezes
  2016-04-04 16:32     ` Evandro Menezes
  0 siblings, 1 reply; 38+ messages in thread
From: Evandro Menezes @ 2016-04-01 22:45 UTC (permalink / raw)
  To: GCC Patches
  Cc: James Greenhalgh, Wilco Dijkstra, Andrew Pinski, philipp.tomsich,
	Benedikt Huber

[-- Attachment #1: Type: text/plain, Size: 2459 bytes --]

On 03/24/16 14:11, Evandro Menezes wrote:
> On 03/17/16 17:46, Evandro Menezes wrote:
>> This patch refactors the function to emit the reciprocal square root 
>> approximation to also emit the square root approximation.
> This version of the patch cleans up the changes to the MD files and 
> fixes some bugs introduced in it since the first proposal.

         [AArch64] Emit square root using the Newton series

         2016-03-30  Evandro Menezes  <e.menezes@samsung.com>
                     Wilco Dijkstra  <wilco.dijkstra@arm.com>

         gcc/
             * config/aarch64/aarch64-protos.h
             (aarch64_emit_approx_rsqrt): Replace with new function
             "aarch64_emit_approx_sqrt".
             (AARCH64_APPROX_MODE): New macro.
    (AARCH64_APPROX_{NONE,SP,DP,DFORM,QFORM,SCALAR,VECTOR,ALL}: Likewise.
             (tune_params): New member "approx_sqrt_modes".
             * config/aarch64/aarch64.c
             (generic_tunings): New member "approx_rsqrt_modes".
             (cortexa35_tunings): Likewise.
             (cortexa53_tunings): Likewise.
             (cortexa57_tunings): Likewise.
             (cortexa72_tunings): Likewise.
             (exynosm1_tunings): Likewise.
             (thunderx_tunings): Likewise.
             (xgene1_tunings): Likewise.
             (aarch64_emit_approx_rsqrt): Replace with new function
             "aarch64_emit_approx_sqrt".
             (aarch64_override_options_after_change_1): Handle new option.
             * config/aarch64/aarch64-simd.md
             (rsqrt<mode>2): Use new function instead.
             (sqrt<mode>2): New expansion and insn definitions.
             * config/aarch64/aarch64.md: Likewise.
             * config/aarch64/aarch64.opt
             (mlow-precision-sqrt): Add new option description.
             * doc/invoke.texi (mlow-precision-sqrt): Likewise.

This version of the patch uses the finer grained selection for the 
approximate sqrt() by the target firstly proposed at 
https://gcc.gnu.org/ml/gcc-patches/2016-04/msg00089.html

Additionally, I changed the handling of the special case when the 
argument is 0.0 for scalars to be the same as for vectors.  The reason 
is that by relying on the CC, a scarce resource, it hindered 
parallelism.  By using up an additional register to hold the mask also 
for scalars, the code is more... scalable.

Hopefully this patch gets close to what all have in mind.

Thank you,

-- 
Evandro Menezes


[-- Attachment #2: 0001-AArch64-Emit-square-root-using-the-Newton-series.patch --]
[-- Type: text/x-patch, Size: 16180 bytes --]

From 6a508df89b9dde5506ec7c2fc40013850b1cd07c Mon Sep 17 00:00:00 2001
From: Evandro Menezes <e.menezes@samsung.com>
Date: Thu, 17 Mar 2016 17:39:55 -0500
Subject: [PATCH] [AArch64] Emit square root using the Newton series

2016-03-30  Evandro Menezes  <e.menezes@samsung.com>
            Wilco Dijkstra  <wilco.dijkstra@arm.com>

gcc/
	* config/aarch64/aarch64-protos.h
	(aarch64_emit_approx_rsqrt): Replace with new function
	"aarch64_emit_approx_sqrt".
	(AARCH64_APPROX_MODE): New macro.
	(AARCH64_APPROX_{NONE,SP,DP,DFORM,QFORM,SCALAR,VECTOR,ALL}: Likewise.
	(tune_params): New member "approx_sqrt_modes".
	* config/aarch64/aarch64.c
	(generic_tunings): New member "approx_rsqrt_modes".
	(cortexa35_tunings): Likewise.
	(cortexa53_tunings): Likewise.
	(cortexa57_tunings): Likewise.
	(cortexa72_tunings): Likewise.
	(exynosm1_tunings): Likewise.
	(thunderx_tunings): Likewise.
	(xgene1_tunings): Likewise.
	(aarch64_emit_approx_rsqrt): Replace with new function
	"aarch64_emit_approx_sqrt".
	(aarch64_override_options_after_change_1): Handle new option.
	* config/aarch64/aarch64-simd.md
	(rsqrt<mode>2): Use new function instead.
	(sqrt<mode>2): New expansion and insn definitions.
	* config/aarch64/aarch64.md: Likewise.
	* config/aarch64/aarch64.opt
	(mlow-precision-sqrt): Add new option description.
	* doc/invoke.texi (mlow-precision-sqrt): Likewise.
---
 gcc/config/aarch64/aarch64-protos.h |  28 ++++++++-
 gcc/config/aarch64/aarch64-simd.md  |  13 +++-
 gcc/config/aarch64/aarch64.c        | 114 +++++++++++++++++++++++++-----------
 gcc/config/aarch64/aarch64.md       |  11 +++-
 gcc/config/aarch64/aarch64.opt      |   9 ++-
 gcc/config/aarch64/predicates.md    |   2 +-
 gcc/doc/invoke.texi                 |  10 ++++
 7 files changed, 146 insertions(+), 41 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index dced209..055ba7a 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -178,6 +178,31 @@ struct cpu_branch_cost
   const int unpredictable;  /* Unpredictable branch or optimizing for speed.  */
 };
 
+/* Control approximate alternatives to certain FP operators.  */
+#define AARCH64_APPROX_MODE(MODE) \
+  ((MIN_MODE_FLOAT <= (MODE) && (MODE) <= MAX_MODE_FLOAT) \
+   ? (1 << ((MODE) - MIN_MODE_FLOAT)) \
+   : (MIN_MODE_VECTOR_FLOAT <= (MODE) && (MODE) <= MAX_MODE_VECTOR_FLOAT) \
+     ? (1 << ((MODE) - MIN_MODE_VECTOR_FLOAT + MAX_MODE_FLOAT + 1)) \
+     : (0))
+#define AARCH64_APPROX_NONE (0)
+#define AARCH64_APPROX_SP (AARCH64_APPROX_MODE (SFmode) \
+			   | AARCH64_APPROX_MODE (V2SFmode) \
+			   | AARCH64_APPROX_MODE (V4SFmode))
+#define AARCH64_APPROX_DP (AARCH64_APPROX_MODE (DFmode) \
+			   | AARCH64_APPROX_MODE (V2DFmode))
+#define AARCH64_APPROX_DFORM (AARCH64_APPROX_MODE (SFmode) \
+			      | AARCH64_APPROX_MODE (DFmode) \
+			      | AARCH64_APPROX_MODE (V2SFmode))
+#define AARCH64_APPROX_QFORM (AARCH64_APPROX_MODE (V4SFmode) \
+			      | AARCH64_APPROX_MODE (V2DFmode))
+#define AARCH64_APPROX_SCALAR (AARCH64_APPROX_MODE (SFmode) \
+			       | AARCH64_APPROX_MODE (DFmode))
+#define AARCH64_APPROX_VECTOR (AARCH64_APPROX_MODE (V2SFmode) \
+                               | AARCH64_APPROX_MODE (V4SFmode) \
+                               | AARCH64_APPROX_MODE (V2DFmode))
+#define AARCH64_APPROX_ALL (-1)
+
 struct tune_params
 {
   const struct cpu_cost_table *insn_extra_cost;
@@ -218,6 +243,7 @@ struct tune_params
   } autoprefetcher_model;
 
   unsigned int extra_tuning_flags;
+  unsigned int approx_sqrt_modes;
 };
 
 #define AARCH64_FUSION_PAIR(x, name) \
@@ -361,7 +387,7 @@ void aarch64_register_pragmas (void);
 void aarch64_relayout_simd_types (void);
 void aarch64_reset_previous_fndecl (void);
 void aarch64_save_restore_target_globals (tree);
-void aarch64_emit_approx_rsqrt (rtx, rtx);
+bool aarch64_emit_approx_sqrt (rtx, rtx, bool);
 
 /* Initialize builtins for SIMD intrinsics.  */
 void init_aarch64_simd_builtins (void);
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index bd73bce..47ccb18 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -405,7 +405,7 @@
 		     UNSPEC_RSQRT))]
   "TARGET_SIMD"
 {
-  aarch64_emit_approx_rsqrt (operands[0], operands[1]);
+  aarch64_emit_approx_sqrt (operands[0], operands[1], true);
   DONE;
 })
 
@@ -4307,7 +4307,16 @@
 
 ;; sqrt
 
-(define_insn "sqrt<mode>2"
+(define_expand "sqrt<mode>2"
+  [(set (match_operand:VDQF 0 "register_operand")
+	(sqrt:VDQF (match_operand:VDQF 1 "register_operand")))]
+  "TARGET_SIMD"
+{
+  if (aarch64_emit_approx_sqrt (operands[0], operands[1], false))
+    DONE;
+})
+
+(define_insn "*sqrt<mode>2"
   [(set (match_operand:VDQF 0 "register_operand" "=w")
         (sqrt:VDQF (match_operand:VDQF 1 "register_operand" "w")))]
   "TARGET_SIMD"
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index ed0daa5..308e70a 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -38,6 +38,7 @@
 #include "recog.h"
 #include "diagnostic.h"
 #include "insn-attr.h"
+#include "insn-flags.h"
 #include "alias.h"
 #include "fold-const.h"
 #include "stor-layout.h"
@@ -414,7 +415,8 @@ static const struct tune_params generic_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_sqrt_modes.  */
 };
 
 static const struct tune_params cortexa35_tunings =
@@ -439,7 +441,8 @@ static const struct tune_params cortexa35_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_sqrt_modes.  */
 };
 
 static const struct tune_params cortexa53_tunings =
@@ -464,7 +467,8 @@ static const struct tune_params cortexa53_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_sqrt_modes.  */
 };
 
 static const struct tune_params cortexa57_tunings =
@@ -489,7 +493,8 @@ static const struct tune_params cortexa57_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_sqrt_modes.  */
 };
 
 static const struct tune_params cortexa72_tunings =
@@ -514,7 +519,8 @@ static const struct tune_params cortexa72_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_sqrt_modes.  */
 };
 
 static const struct tune_params exynosm1_tunings =
@@ -538,7 +544,8 @@ static const struct tune_params exynosm1_tunings =
   48,	/* max_case_values.  */
   64,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_APPROX_RSQRT) /* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_APPROX_RSQRT), /* tune_flags.  */
+  (AARCH64_APPROX_ALL) /* approx_sqrt_modes.  */
 };
 
 static const struct tune_params thunderx_tunings =
@@ -562,7 +569,8 @@ static const struct tune_params thunderx_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_sqrt_modes.  */
 };
 
 static const struct tune_params xgene1_tunings =
@@ -586,7 +594,8 @@ static const struct tune_params xgene1_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_APPROX_RSQRT)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_APPROX_RSQRT),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_sqrt_modes.  */
 };
 
 /* Support for fine-grained override of the tuning structures.  */
@@ -7498,46 +7507,77 @@ get_rsqrts_type (machine_mode mode)
   }
 }
 
-/* Emit instruction sequence to compute the reciprocal square root using the
-   Newton-Raphson series.  Iterate over the series twice for SF
-   and thrice for DF.  */
+/* Emit instruction sequence to compute either the approximate square root
+   or its approximate reciprocal.  */
 
-void
-aarch64_emit_approx_rsqrt (rtx dst, rtx src)
+bool
+aarch64_emit_approx_sqrt (rtx dst, rtx src, bool recp)
 {
-  machine_mode mode = GET_MODE (src);
-  gcc_assert (
-    mode == SFmode || mode == V2SFmode || mode == V4SFmode
-	|| mode == DFmode || mode == V2DFmode);
-
-  rtx xsrc = gen_reg_rtx (mode);
-  emit_move_insn (xsrc, src);
-  rtx x0 = gen_reg_rtx (mode);
+  machine_mode mode = GET_MODE (dst);
+  machine_mode mmsk = mode_for_vector (int_mode_for_mode (GET_MODE_INNER (mode)),
+				       GET_MODE_NUNITS (mode));
+
+  if (!flag_finite_math_only
+      || flag_trapping_math
+      || !flag_unsafe_math_optimizations
+      || optimize_function_for_size_p (cfun)
+      || !((recp && (flag_mrecip_low_precision_sqrt
+		     || (aarch64_tune_params.extra_tuning_flags
+			 & AARCH64_EXTRA_TUNE_APPROX_RSQRT)))
+	   || (!recp && (flag_mlow_precision_sqrt
+			 || (aarch64_tune_params.approx_sqrt_modes
+			     & AARCH64_APPROX_MODE (mode))))))
+    return false;
 
-  emit_insn ((*get_rsqrte_type (mode)) (x0, xsrc));
+  rtx xmsk = gen_reg_rtx (mmsk);
+  if (!recp)
+    /* When calculating the approximate square root, compare the argument with
+       0.0 and create a mask.  */
+    emit_insn (gen_rtx_SET (xmsk, gen_rtx_NEG (mmsk, gen_rtx_EQ (mmsk, src,
+							  CONST0_RTX (mode)))));
 
-  bool double_mode = (mode == DFmode || mode == V2DFmode);
+  /* Estimate the approximate reciprocal square root.  */
+  rtx xdst = gen_reg_rtx (mode);
+  emit_insn ((*get_rsqrte_type (mode)) (xdst, src));
 
-  int iterations = double_mode ? 3 : 2;
+  /* Iterate over the series twice for SF and thrice for DF.  */
+  int iterations = (GET_MODE_INNER (mode) == DFmode) ? 3 : 2;
 
-  /* Optionally iterate over the series one less time than otherwise.  */
-  if (flag_mrecip_low_precision_sqrt)
+  /* Optionally iterate over the series once less for faster performance
+     while sacrificing the accuracy.  */
+  if ((recp && flag_mrecip_low_precision_sqrt)
+      || (!recp && flag_mlow_precision_sqrt))
     iterations--;
 
-  for (int i = 0; i < iterations; ++i)
+  /* Iterate over the series to calculate the approximate reciprocal square root.  */
+  while (iterations--)
     {
-      rtx x1 = gen_reg_rtx (mode);
       rtx x2 = gen_reg_rtx (mode);
-      rtx x3 = gen_reg_rtx (mode);
-      emit_set_insn (x2, gen_rtx_MULT (mode, x0, x0));
+      emit_set_insn (x2, gen_rtx_MULT (mode, xdst, xdst));
+
+      rtx x1 = gen_reg_rtx (mode);
+      emit_insn ((*get_rsqrts_type (mode)) (x1, src, x2));
+
+      emit_set_insn (xdst, gen_rtx_MULT (mode, xdst, x1));
+    }
 
-      emit_insn ((*get_rsqrts_type (mode)) (x3, xsrc, x2));
+  if (!recp)
+    {
+      /* Qualify the approximate reciprocal square root when the argument is
+	 0.0 by squashing the intermediary result to 0.0.  */
+      rtx xtmp = gen_reg_rtx (mmsk);
+      emit_set_insn (xtmp, gen_rtx_AND (mmsk, gen_rtx_NOT (mmsk, xmsk),
+					      gen_rtx_SUBREG (mmsk, xdst, 0)));
+      emit_move_insn (xdst, gen_rtx_SUBREG (mode, xtmp, 0));
 
-      emit_set_insn (x1, gen_rtx_MULT (mode, x0, x3));
-      x0 = x1;
+      /* Calculate the approximate square root.  */
+      emit_set_insn (dst, gen_rtx_MULT (mode, xdst, src));
     }
+  else
+    /* Return the approximate reciprocal square root.  */
+    emit_move_insn (dst, xdst);
 
-  emit_move_insn (dst, x0);
+  return true;
 }
 
 /* Return the number of instructions that can be issued per cycle.  */
@@ -8144,6 +8184,12 @@ aarch64_override_options_after_change_1 (struct gcc_options *opts)
       && (aarch64_cmodel == AARCH64_CMODEL_TINY
 	  || aarch64_cmodel == AARCH64_CMODEL_TINY_PIC))
     aarch64_nopcrelative_literal_loads = false;
+
+  /* When enabling the lower precision Newton series for the square root, also
+     enable it for the reciprocal square root, since the later is an
+     intermediary step for the latter.  */
+  if (flag_mlow_precision_sqrt)
+    flag_mrecip_low_precision_sqrt = true;
 }
 
 /* 'Unpack' up the internal tuning structs and update the options
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 68676c9..43fa318 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -4665,7 +4665,16 @@
   [(set_attr "type" "ffarith<s>")]
 )
 
-(define_insn "sqrt<mode>2"
+(define_expand "sqrt<mode>2"
+  [(set (match_operand:GPF 0 "register_operand")
+        (sqrt:GPF (match_operand:GPF 1 "register_operand")))]
+  "TARGET_SIMD"
+{
+  if (aarch64_emit_approx_sqrt (operands[0], operands[1], false))
+    DONE;
+})
+
+(define_insn "*sqrt<mode>2"
   [(set (match_operand:GPF 0 "register_operand" "=w")
         (sqrt:GPF (match_operand:GPF 1 "register_operand" "w")))]
   "TARGET_FLOAT"
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index c637ff4..ffd5540 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -151,5 +151,10 @@ PC relative literal loads.
 
 mlow-precision-recip-sqrt
 Common Var(flag_mrecip_low_precision_sqrt) Optimization
-When calculating the reciprocal square root approximation,
-uses one less step than otherwise, thus reducing latency and precision.
+When calculating the approximate reciprocal square root,
+use one less step than otherwise, thus reducing latency and precision.
+
+mlow-precision-sqrt
+Common Var(flag_mlow_precision_sqrt) Optimization
+When calculating the approximate square root,
+use one less step than otherwise, thus reducing latency and precision.
diff --git a/gcc/config/aarch64/predicates.md b/gcc/config/aarch64/predicates.md
index 1186827..8f2726d 100644
--- a/gcc/config/aarch64/predicates.md
+++ b/gcc/config/aarch64/predicates.md
@@ -302,7 +302,7 @@
 })
 
 (define_predicate "aarch64_simd_reg_or_zero"
-  (and (match_code "reg,subreg,const_int,const_vector")
+  (and (match_code "reg,subreg,const_int,const_double,const_vector")
        (ior (match_operand 0 "register_operand")
            (ior (match_test "op == const0_rtx")
                 (match_test "aarch64_simd_imm_zero_p (op, mode)")))))
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 99ac11b..433a9f2 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -573,6 +573,7 @@ Objective-C and Objective-C++ Dialects}.
 -mfix-cortex-a53-835769  -mno-fix-cortex-a53-835769 @gol
 -mfix-cortex-a53-843419  -mno-fix-cortex-a53-843419 @gol
 -mlow-precision-recip-sqrt -mno-low-precision-recip-sqrt@gol
+-mlow-precision-sqrt -mno-low-precision-sqrt@gol
 -march=@var{name}  -mcpu=@var{name}  -mtune=@var{name}}
 
 @emph{Adapteva Epiphany Options}
@@ -12908,6 +12909,15 @@ uses one less step than otherwise, thus reducing latency and precision.
 This is only relevant if @option{-ffast-math} enables the reciprocal square root
 approximation, which in turn depends on the target processor.
 
+@item -mlow-precision-sqrt
+@item -mno-low-precision-sqrt
+@opindex -mlow-precision-sqrt
+@opindex -mno-low-precision-sqrt
+When calculating the square root approximation,
+uses one less step than otherwise, thus reducing latency and precision.
+This is only relevant if @option{-ffast-math} enables the square root
+approximation, which in turn depends on the target processor.
+
 @item -march=@var{name}
 @opindex march
 Specify the name of the target architecture and, optionally, one or
-- 
1.9.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2016-04-01 22:45   ` Evandro Menezes
@ 2016-04-04 16:32     ` Evandro Menezes
       [not found]       ` <DB3PR08MB008902F0F0AFA3B1F1C91511839E0@DB3PR08MB0089.eurprd08.prod.outlook.com>
  0 siblings, 1 reply; 38+ messages in thread
From: Evandro Menezes @ 2016-04-04 16:32 UTC (permalink / raw)
  To: GCC Patches
  Cc: James Greenhalgh, Wilco Dijkstra, Andrew Pinski, philipp.tomsich,
	Benedikt Huber

[-- Attachment #1: Type: text/plain, Size: 2671 bytes --]

On 04/01/16 17:45, Evandro Menezes wrote:
> On 03/24/16 14:11, Evandro Menezes wrote:
>> On 03/17/16 17:46, Evandro Menezes wrote:
>>> This patch refactors the function to emit the reciprocal square root 
>>> approximation to also emit the square root approximation.
>> This version of the patch cleans up the changes to the MD files and 
>> fixes some bugs introduced in it since the first proposal.
>
> This version of the patch uses the finer grained selection for the 
> approximate sqrt() by the target firstly proposed at 
> https://gcc.gnu.org/ml/gcc-patches/2016-04/msg00089.html
>
> Additionally, I changed the handling of the special case when the 
> argument is 0.0 for scalars to be the same as for vectors.  The reason 
> is that by relying on the CC, a scarce resource, it hindered 
> parallelism.  By using up an additional register to hold the mask also 
> for scalars, the code is more... scalable.
>
> Hopefully this patch gets close to what all have in mind.
>

         [AArch64] Emit square root using the Newton series

         2016-04-04  Evandro Menezes  <e.menezes@samsung.com>
                     Wilco Dijkstra  <wilco.dijkstra@arm.com>

         gcc/
             * config/aarch64/aarch64-protos.h
             (aarch64_emit_approx_rsqrt): Replace with new function
             "aarch64_emit_approx_sqrt".
             (AARCH64_APPROX_MODE): New macro.
    (AARCH64_APPROX_{NONE,SP,DP,DFORM,QFORM,SCALAR,VECTOR,ALL}: Likewise.
             (tune_params): New member "approx_sqrt_modes".
             * config/aarch64/aarch64.c
             (generic_tunings): New member "approx_rsqrt_modes".
             (cortexa35_tunings): Likewise.
             (cortexa53_tunings): Likewise.
             (cortexa57_tunings): Likewise.
             (cortexa72_tunings): Likewise.
             (exynosm1_tunings): Likewise.
             (thunderx_tunings): Likewise.
             (xgene1_tunings): Likewise.
             (aarch64_emit_approx_rsqrt): Replace with new function
             "aarch64_emit_approx_sqrt".
             (aarch64_override_options_after_change_1): Handle new option.
             * config/aarch64/aarch64-simd.md
             (rsqrt<mode>2): Use new function instead.
             (sqrt<mode>2): New expansion and insn definitions.
             * config/aarch64/aarch64.md: Likewise.
             * config/aarch64/aarch64.opt
             (mlow-precision-sqrt): Add new option description.
             * doc/invoke.texi (mlow-precision-sqrt): Likewise.


This version of the patch refactors the algorithm to shorten the 
dependency chain at the last iteration of the series.

Thank you for your feedback.

-- 
Evandro Menezes


[-- Attachment #2: 0001-AArch64-Emit-square-root-using-the-Newton-series.patch --]
[-- Type: text/x-patch, Size: 15614 bytes --]

From 8e463d55c89233a623aad2412fb3055021fdd066 Mon Sep 17 00:00:00 2001
From: Evandro Menezes <e.menezes@samsung.com>
Date: Mon, 4 Apr 2016 11:23:29 -0500
Subject: [PATCH] [AArch64] Emit square root using the Newton series

2016-04-04  Evandro Menezes  <e.menezes@samsung.com>
            Wilco Dijkstra  <wilco.dijkstra@arm.com>

gcc/
	* config/aarch64/aarch64-protos.h
	(aarch64_emit_approx_rsqrt): Replace with new function
	"aarch64_emit_approx_sqrt".
	(AARCH64_APPROX_MODE): New macro.
	(AARCH64_APPROX_{NONE,SP,DP,DFORM,QFORM,SCALAR,VECTOR,ALL}: Likewise.
	(tune_params): New member "approx_sqrt_modes".
	* config/aarch64/aarch64.c
	(generic_tunings): New member "approx_rsqrt_modes".
	(cortexa35_tunings): Likewise.
	(cortexa53_tunings): Likewise.
	(cortexa57_tunings): Likewise.
	(cortexa72_tunings): Likewise.
	(exynosm1_tunings): Likewise.
	(thunderx_tunings): Likewise.
	(xgene1_tunings): Likewise.
	(aarch64_emit_approx_rsqrt): Replace with new function
	"aarch64_emit_approx_sqrt".
	(aarch64_override_options_after_change_1): Handle new option.
	* config/aarch64/aarch64-simd.md
	(rsqrt<mode>2): Use new function instead.
	(sqrt<mode>2): New expansion and insn definitions.
	* config/aarch64/aarch64.md: Likewise.
	* config/aarch64/aarch64.opt
	(mlow-precision-sqrt): Add new option description.
	* doc/invoke.texi (mlow-precision-sqrt): Likewise.
---
 gcc/config/aarch64/aarch64-protos.h |  29 ++++++++-
 gcc/config/aarch64/aarch64-simd.md  |  13 +++-
 gcc/config/aarch64/aarch64.c        | 115 +++++++++++++++++++++++++-----------
 gcc/config/aarch64/aarch64.md       |  11 +++-
 gcc/config/aarch64/aarch64.opt      |   9 ++-
 gcc/doc/invoke.texi                 |  10 ++++
 6 files changed, 147 insertions(+), 40 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 58c9d0d..365572d 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -178,6 +178,32 @@ struct cpu_branch_cost
   const int unpredictable;  /* Unpredictable branch or optimizing for speed.  */
 };
 
+/* Control approximate alternatives to certain FP operators.  */
+#define AARCH64_APPROX_MODE(MODE) \
+  ((MIN_MODE_FLOAT <= (MODE) && (MODE) <= MAX_MODE_FLOAT) \
+   ? (1 << ((MODE) - MIN_MODE_FLOAT)) \
+   : (MIN_MODE_VECTOR_FLOAT <= (MODE) && (MODE) <= MAX_MODE_VECTOR_FLOAT) \
+     ? (1 << ((MODE) - MIN_MODE_VECTOR_FLOAT \
+	      + MAX_MODE_FLOAT - MIN_MODE_FLOAT + 1)) \
+     : (0))
+#define AARCH64_APPROX_NONE (0)
+#define AARCH64_APPROX_SP (AARCH64_APPROX_MODE (SFmode) \
+			   | AARCH64_APPROX_MODE (V2SFmode) \
+			   | AARCH64_APPROX_MODE (V4SFmode))
+#define AARCH64_APPROX_DP (AARCH64_APPROX_MODE (DFmode) \
+			   | AARCH64_APPROX_MODE (V2DFmode))
+#define AARCH64_APPROX_DFORM (AARCH64_APPROX_MODE (SFmode) \
+			      | AARCH64_APPROX_MODE (DFmode) \
+			      | AARCH64_APPROX_MODE (V2SFmode))
+#define AARCH64_APPROX_QFORM (AARCH64_APPROX_MODE (V4SFmode) \
+			      | AARCH64_APPROX_MODE (V2DFmode))
+#define AARCH64_APPROX_SCALAR (AARCH64_APPROX_MODE (SFmode) \
+			       | AARCH64_APPROX_MODE (DFmode))
+#define AARCH64_APPROX_VECTOR (AARCH64_APPROX_MODE (V2SFmode) \
+                               | AARCH64_APPROX_MODE (V4SFmode) \
+                               | AARCH64_APPROX_MODE (V2DFmode))
+#define AARCH64_APPROX_ALL (-1)
+
 struct tune_params
 {
   const struct cpu_cost_table *insn_extra_cost;
@@ -218,6 +244,7 @@ struct tune_params
   } autoprefetcher_model;
 
   unsigned int extra_tuning_flags;
+  unsigned int approx_sqrt_modes;
 };
 
 #define AARCH64_FUSION_PAIR(x, name) \
@@ -361,7 +388,7 @@ void aarch64_register_pragmas (void);
 void aarch64_relayout_simd_types (void);
 void aarch64_reset_previous_fndecl (void);
 void aarch64_save_restore_target_globals (tree);
-void aarch64_emit_approx_rsqrt (rtx, rtx);
+bool aarch64_emit_approx_sqrt (rtx, rtx, bool);
 
 /* Initialize builtins for SIMD intrinsics.  */
 void init_aarch64_simd_builtins (void);
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index bd73bce..47ccb18 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -405,7 +405,7 @@
 		     UNSPEC_RSQRT))]
   "TARGET_SIMD"
 {
-  aarch64_emit_approx_rsqrt (operands[0], operands[1]);
+  aarch64_emit_approx_sqrt (operands[0], operands[1], true);
   DONE;
 })
 
@@ -4307,7 +4307,16 @@
 
 ;; sqrt
 
-(define_insn "sqrt<mode>2"
+(define_expand "sqrt<mode>2"
+  [(set (match_operand:VDQF 0 "register_operand")
+	(sqrt:VDQF (match_operand:VDQF 1 "register_operand")))]
+  "TARGET_SIMD"
+{
+  if (aarch64_emit_approx_sqrt (operands[0], operands[1], false))
+    DONE;
+})
+
+(define_insn "*sqrt<mode>2"
   [(set (match_operand:VDQF 0 "register_operand" "=w")
         (sqrt:VDQF (match_operand:VDQF 1 "register_operand" "w")))]
   "TARGET_SIMD"
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index b7086dd..9bc9aa4 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -38,6 +38,7 @@
 #include "recog.h"
 #include "diagnostic.h"
 #include "insn-attr.h"
+#include "insn-flags.h"
 #include "alias.h"
 #include "fold-const.h"
 #include "stor-layout.h"
@@ -414,7 +415,8 @@ static const struct tune_params generic_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_sqrt_modes.  */
 };
 
 static const struct tune_params cortexa35_tunings =
@@ -439,7 +441,8 @@ static const struct tune_params cortexa35_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_sqrt_modes.  */
 };
 
 static const struct tune_params cortexa53_tunings =
@@ -464,7 +467,8 @@ static const struct tune_params cortexa53_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_sqrt_modes.  */
 };
 
 static const struct tune_params cortexa57_tunings =
@@ -489,7 +493,8 @@ static const struct tune_params cortexa57_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_sqrt_modes.  */
 };
 
 static const struct tune_params cortexa72_tunings =
@@ -514,7 +519,8 @@ static const struct tune_params cortexa72_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_sqrt_modes.  */
 };
 
 static const struct tune_params exynosm1_tunings =
@@ -538,7 +544,8 @@ static const struct tune_params exynosm1_tunings =
   48,	/* max_case_values.  */
   64,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_APPROX_RSQRT) /* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_APPROX_RSQRT), /* tune_flags.  */
+  (AARCH64_APPROX_ALL) /* approx_sqrt_modes.  */
 };
 
 static const struct tune_params thunderx_tunings =
@@ -562,7 +569,8 @@ static const struct tune_params thunderx_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_sqrt_modes.  */
 };
 
 static const struct tune_params xgene1_tunings =
@@ -586,7 +594,8 @@ static const struct tune_params xgene1_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_APPROX_RSQRT)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_APPROX_RSQRT),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_sqrt_modes.  */
 };
 
 /* Support for fine-grained override of the tuning structures.  */
@@ -7510,46 +7519,78 @@ get_rsqrts_type (machine_mode mode)
   }
 }
 
-/* Emit instruction sequence to compute the reciprocal square root using the
-   Newton-Raphson series.  Iterate over the series twice for SF
-   and thrice for DF.  */
+/* Emit instruction sequence to compute either the approximate square root
+   or its approximate reciprocal.  */
 
-void
-aarch64_emit_approx_rsqrt (rtx dst, rtx src)
+bool
+aarch64_emit_approx_sqrt (rtx dst, rtx src, bool recp)
 {
-  machine_mode mode = GET_MODE (src);
-  gcc_assert (
-    mode == SFmode || mode == V2SFmode || mode == V4SFmode
-	|| mode == DFmode || mode == V2DFmode);
-
-  rtx xsrc = gen_reg_rtx (mode);
-  emit_move_insn (xsrc, src);
-  rtx x0 = gen_reg_rtx (mode);
+  machine_mode mode = GET_MODE (dst);
+  machine_mode mmsk = mode_for_vector (int_mode_for_mode (GET_MODE_INNER (mode)),
+				       GET_MODE_NUNITS (mode));
+
+  if (!flag_finite_math_only
+      || flag_trapping_math
+      || !flag_unsafe_math_optimizations
+      || optimize_function_for_size_p (cfun)
+      || !((recp && (flag_mrecip_low_precision_sqrt
+		     || (aarch64_tune_params.extra_tuning_flags
+			 & AARCH64_EXTRA_TUNE_APPROX_RSQRT)))
+	   || (!recp && (flag_mlow_precision_sqrt
+			 || (aarch64_tune_params.approx_sqrt_modes
+			     & AARCH64_APPROX_MODE (mode))))))
+    return false;
 
-  emit_insn ((*get_rsqrte_type (mode)) (x0, xsrc));
+  rtx xmsk = gen_reg_rtx (mmsk);
+  if (!recp)
+    /* When calculating the approximate square root, compare the argument with
+       0.0 and create a mask.  */
+    emit_insn (gen_rtx_SET (xmsk, gen_rtx_NEG (mmsk, gen_rtx_EQ (mmsk, src,
+							  CONST0_RTX (mode)))));
 
-  bool double_mode = (mode == DFmode || mode == V2DFmode);
+  /* Estimate the approximate reciprocal square root.  */
+  rtx xdst = gen_reg_rtx (mode);
+  emit_insn ((*get_rsqrte_type (mode)) (xdst, src));
 
-  int iterations = double_mode ? 3 : 2;
+  /* Iterate over the series twice for SF and thrice for DF.  */
+  int iterations = (GET_MODE_INNER (mode) == DFmode) ? 3 : 2;
 
-  /* Optionally iterate over the series one less time than otherwise.  */
-  if (flag_mrecip_low_precision_sqrt)
+  /* Optionally iterate over the series once less for faster performance
+     while sacrificing the accuracy.  */
+  if ((recp && flag_mrecip_low_precision_sqrt)
+      || (!recp && flag_mlow_precision_sqrt))
     iterations--;
 
-  for (int i = 0; i < iterations; ++i)
+  /* Iterate over the series to calculate the approximate reciprocal square root.  */
+  rtx x1 = gen_reg_rtx (mode);
+  while (iterations--)
     {
-      rtx x1 = gen_reg_rtx (mode);
       rtx x2 = gen_reg_rtx (mode);
-      rtx x3 = gen_reg_rtx (mode);
-      emit_set_insn (x2, gen_rtx_MULT (mode, x0, x0));
+      emit_set_insn (x2, gen_rtx_MULT (mode, xdst, xdst));
+
+      emit_insn ((*get_rsqrts_type (mode)) (x1, src, x2));
 
-      emit_insn ((*get_rsqrts_type (mode)) (x3, xsrc, x2));
+      if (iterations > 0)
+	emit_set_insn (xdst, gen_rtx_MULT (mode, xdst, x1));
+    }
+
+  if (!recp)
+    {
+      /* Qualify the approximate reciprocal square root when the argument is
+	 0.0 by squashing the intermediary result to 0.0.  */
+      rtx xtmp = gen_reg_rtx (mmsk);
+      emit_set_insn (xtmp, gen_rtx_AND (mmsk, gen_rtx_NOT (mmsk, xmsk),
+					      gen_rtx_SUBREG (mmsk, xdst, 0)));
+      emit_move_insn (xdst, gen_rtx_SUBREG (mode, xtmp, 0));
 
-      emit_set_insn (x1, gen_rtx_MULT (mode, x0, x3));
-      x0 = x1;
+      /* Calculate the approximate square root.  */
+      emit_set_insn (xdst, gen_rtx_MULT (mode, xdst, src));
     }
 
-  emit_move_insn (dst, x0);
+  /* Return the approximation.  */
+  emit_set_insn (dst, gen_rtx_MULT (mode, xdst, x1));
+
+  return true;
 }
 
 /* Return the number of instructions that can be issued per cycle.  */
@@ -8156,6 +8197,12 @@ aarch64_override_options_after_change_1 (struct gcc_options *opts)
       && (aarch64_cmodel == AARCH64_CMODEL_TINY
 	  || aarch64_cmodel == AARCH64_CMODEL_TINY_PIC))
     aarch64_nopcrelative_literal_loads = false;
+
+  /* When enabling the lower precision Newton series for the square root, also
+     enable it for the reciprocal square root, since the later is an
+     intermediary step for the latter.  */
+  if (flag_mlow_precision_sqrt)
+    flag_mrecip_low_precision_sqrt = true;
 }
 
 /* 'Unpack' up the internal tuning structs and update the options
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 68676c9..43fa318 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -4665,7 +4665,16 @@
   [(set_attr "type" "ffarith<s>")]
 )
 
-(define_insn "sqrt<mode>2"
+(define_expand "sqrt<mode>2"
+  [(set (match_operand:GPF 0 "register_operand")
+        (sqrt:GPF (match_operand:GPF 1 "register_operand")))]
+  "TARGET_SIMD"
+{
+  if (aarch64_emit_approx_sqrt (operands[0], operands[1], false))
+    DONE;
+})
+
+(define_insn "*sqrt<mode>2"
   [(set (match_operand:GPF 0 "register_operand" "=w")
         (sqrt:GPF (match_operand:GPF 1 "register_operand" "w")))]
   "TARGET_FLOAT"
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index c637ff4..ffd5540 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -151,5 +151,10 @@ PC relative literal loads.
 
 mlow-precision-recip-sqrt
 Common Var(flag_mrecip_low_precision_sqrt) Optimization
-When calculating the reciprocal square root approximation,
-uses one less step than otherwise, thus reducing latency and precision.
+When calculating the approximate reciprocal square root,
+use one less step than otherwise, thus reducing latency and precision.
+
+mlow-precision-sqrt
+Common Var(flag_mlow_precision_sqrt) Optimization
+When calculating the approximate square root,
+use one less step than otherwise, thus reducing latency and precision.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index e9763d4..cebb8cf 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -573,6 +573,7 @@ Objective-C and Objective-C++ Dialects}.
 -mfix-cortex-a53-835769  -mno-fix-cortex-a53-835769 @gol
 -mfix-cortex-a53-843419  -mno-fix-cortex-a53-843419 @gol
 -mlow-precision-recip-sqrt -mno-low-precision-recip-sqrt@gol
+-mlow-precision-sqrt -mno-low-precision-sqrt@gol
 -march=@var{name}  -mcpu=@var{name}  -mtune=@var{name}}
 
 @emph{Adapteva Epiphany Options}
@@ -12921,6 +12922,15 @@ uses one less step than otherwise, thus reducing latency and precision.
 This is only relevant if @option{-ffast-math} enables the reciprocal square root
 approximation, which in turn depends on the target processor.
 
+@item -mlow-precision-sqrt
+@item -mno-low-precision-sqrt
+@opindex -mlow-precision-sqrt
+@opindex -mno-low-precision-sqrt
+When calculating the square root approximation,
+uses one less step than otherwise, thus reducing latency and precision.
+This is only relevant if @option{-ffast-math} enables the square root
+approximation, which in turn depends on the target processor.
+
 @item -march=@var{name}
 @opindex march
 Specify the name of the target architecture and, optionally, one or
-- 
1.9.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
       [not found]       ` <DB3PR08MB008902F0F0AFA3B1F1C91511839E0@DB3PR08MB0089.eurprd08.prod.outlook.com>
@ 2016-04-05 22:30         ` Evandro Menezes
  2016-04-12 18:15           ` Evandro Menezes
  0 siblings, 1 reply; 38+ messages in thread
From: Evandro Menezes @ 2016-04-05 22:30 UTC (permalink / raw)
  To: Wilco Dijkstra, GCC Patches
  Cc: James Greenhalgh, Andrew Pinski, philipp.tomsich, Benedikt Huber

[-- Attachment #1: Type: text/plain, Size: 546 bytes --]

On 04/05/16 13:37, Wilco Dijkstra wrote:
> I can't get any of these to work... Not only do I get a large number of collisions and duplicated
> code between these patches, when I try to resolve them, all I get is crashes whenever I try
> to use sqrt (even rsqrt stopped working). Do you have a patchset that applies cleanly so I can
> try all approximation routines?

Hi, Wilco.

The original patches should be independent of each other, so indeed they 
duplicate code.

This patch suite should be suitable for testing.

HTH

-- 
Evandro Menezes


[-- Attachment #2: 0003-Emit-division-using-the-Newton-series.patch --]
[-- Type: text/x-patch, Size: 11600 bytes --]

From cbc2b62f7df5c3e2fef2a24157b1bdd1a6de191b Mon Sep 17 00:00:00 2001
From: Evandro Menezes <e.menezes@samsung.com>
Date: Mon, 4 Apr 2016 14:02:24 -0500
Subject: [PATCH 3/3] Emit division using the Newton series

2016-04-04  Evandro Menezes  <e.menezes@samsung.com>
            Wilco Dijkstra <Wilco.Dijkstra@arm.com>

gcc/
	* config/aarch64/aarch64-tuning-flags.def
	* config/aarch64/aarch64-protos.h
	(tune_params): Add new member "approx_div_modes".
	(aarch64_emit_approx_div): Declare new function.
	* config/aarch64/aarch64.c
	(generic_tunings): New member "approx_div_modes".
	(cortexa35_tunings): Likewise.
	(cortexa53_tunings): Likewise.
	(cortexa57_tunings): Likewise.
	(cortexa72_tunings): Likewise.
	(exynosm1_tunings): Likewise.
	(thunderx_tunings): Likewise.
	(xgene1_tunings): Likewise.
	(aarch64_emit_approx_div): Define new function.
	* config/aarch64/aarch64.md ("div<mode>3"): New expansion.
	* config/aarch64/aarch64-simd.md ("div<mode>3"): Likewise.
	* config/aarch64/aarch64.opt (-mlow-precision-div): Add new option.
	* doc/invoke.texi (-mlow-precision-div): Describe new option.
---
 gcc/config/aarch64/aarch64-protos.h |  2 +
 gcc/config/aarch64/aarch64-simd.md  | 14 +++++-
 gcc/config/aarch64/aarch64.c        | 85 +++++++++++++++++++++++++++++++++++++
 gcc/config/aarch64/aarch64.md       | 19 +++++++--
 gcc/config/aarch64/aarch64.opt      |  5 +++
 gcc/doc/invoke.texi                 | 10 +++++
 6 files changed, 130 insertions(+), 5 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 85ad796..649faf7 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -244,6 +244,7 @@ struct tune_params
   } autoprefetcher_model;
 
   unsigned int extra_tuning_flags;
+  unsigned int approx_div_modes;
   unsigned int approx_sqrt_modes;
   unsigned int approx_rsqrt_modes;
 };
@@ -390,6 +391,7 @@ void aarch64_relayout_simd_types (void);
 void aarch64_reset_previous_fndecl (void);
 void aarch64_save_restore_target_globals (tree);
 bool aarch64_emit_approx_sqrt (rtx, rtx, bool);
+bool aarch64_emit_approx_div (rtx, rtx, rtx);
 
 /* Initialize builtins for SIMD intrinsics.  */
 void init_aarch64_simd_builtins (void);
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 47ccb18..7e99e16 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1509,7 +1509,19 @@
   [(set_attr "type" "neon_fp_mul_<Vetype><q>")]
 )
 
-(define_insn "div<mode>3"
+(define_expand "div<mode>3"
+ [(set (match_operand:VDQF 0 "register_operand")
+       (div:VDQF (match_operand:VDQF 1 "general_operand")
+		 (match_operand:VDQF 2 "register_operand")))]
+ "TARGET_SIMD"
+{
+  if (aarch64_emit_approx_div (operands[0], operands[1], operands[2]))
+    DONE;
+
+  operands[1] = force_reg (<MODE>mode, operands[1]);
+})
+
+(define_insn "*div<mode>3"
  [(set (match_operand:VDQF 0 "register_operand" "=w")
        (div:VDQF (match_operand:VDQF 1 "register_operand" "w")
 		 (match_operand:VDQF 2 "register_operand" "w")))]
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 4af2175..74310e8 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -417,6 +417,7 @@ static const struct tune_params generic_tunings =
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE),	/* approx_div_modes.  */
   (AARCH64_APPROX_NONE),	/* approx_sqrt_modes.  */
   (AARCH64_APPROX_NONE)	/* approx_rsqrt_modes.  */
 };
@@ -444,6 +445,7 @@ static const struct tune_params cortexa35_tunings =
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE),	/* approx_div_modes.  */
   (AARCH64_APPROX_NONE),	/* approx_sqrt_modes.  */
   (AARCH64_APPROX_NONE)	/* approx_rsqrt_modes.  */
 };
@@ -471,6 +473,7 @@ static const struct tune_params cortexa53_tunings =
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE),	/* approx_div_modes.  */
   (AARCH64_APPROX_NONE),	/* approx_sqrt_modes.  */
   (AARCH64_APPROX_NONE)	/* approx_rsqrt_modes.  */
 };
@@ -498,6 +501,7 @@ static const struct tune_params cortexa57_tunings =
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE),	/* approx_div_modes.  */
   (AARCH64_APPROX_NONE),	/* approx_sqrt_modes.  */
   (AARCH64_APPROX_NONE)	/* approx_rsqrt_modes.  */
 };
@@ -525,6 +529,7 @@ static const struct tune_params cortexa72_tunings =
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE),	/* approx_div_modes.  */
   (AARCH64_APPROX_NONE),	/* approx_sqrt_modes.  */
   (AARCH64_APPROX_NONE)	/* approx_rsqrt_modes.  */
 };
@@ -551,6 +556,7 @@ static const struct tune_params exynosm1_tunings =
   64,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
+  (AARCH64_APPROX_NONE), /* approx_div_modes.  */
   (AARCH64_APPROX_ALL), /* approx_sqrt_modes.  */
   (AARCH64_APPROX_ALL) /* approx_rsqrt_modes.  */
 };
@@ -577,6 +583,7 @@ static const struct tune_params thunderx_tunings =
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE),	/* approx_div_modes.  */
   (AARCH64_APPROX_NONE),	/* approx_sqrt_modes.  */
   (AARCH64_APPROX_NONE)	/* approx_rsqrt_modes.  */
 };
@@ -603,6 +610,7 @@ static const struct tune_params xgene1_tunings =
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE),	/* approx_div_modes.  */
   (AARCH64_APPROX_NONE),	/* approx_sqrt_modes.  */
   (AARCH64_APPROX_ALL)	/* approx_rsqrt_modes.  */
 };
@@ -7604,6 +7612,83 @@ aarch64_emit_approx_sqrt (rtx dst, rtx src, bool recp)
   return true;
 }
 
+/* Emit the instruction sequence to compute the approximation for a division.  */
+
+bool
+aarch64_emit_approx_div (rtx quo, rtx num, rtx div)
+{
+  machine_mode mode = GET_MODE (quo);
+
+  if (!flag_finite_math_only
+      || flag_trapping_math
+      || !flag_unsafe_math_optimizations
+      || optimize_function_for_size_p (cfun)
+      || !(flag_mlow_precision_div
+	   || (aarch64_tune_params.approx_div_modes & AARCH64_APPROX_MODE (mode))))
+    return false;
+
+  /* Estimate the approximate reciprocal.  */
+  rtx xrcp = gen_reg_rtx (mode);
+  switch (mode)
+    {
+      case SFmode:
+	emit_insn (gen_aarch64_frecpesf (xrcp, div)); break;
+      case V2SFmode:
+	emit_insn (gen_aarch64_frecpev2sf (xrcp, div)); break;
+      case V4SFmode:
+	emit_insn (gen_aarch64_frecpev4sf (xrcp, div)); break;
+      case DFmode:
+	emit_insn (gen_aarch64_frecpedf (xrcp, div)); break;
+      case V2DFmode:
+	emit_insn (gen_aarch64_frecpev2df (xrcp, div)); break;
+      default:
+	gcc_unreachable ();
+    }
+
+  /* Iterate over the series twice for SF and thrice for DF.  */
+  int iterations = (GET_MODE_INNER (mode) == DFmode) ? 3 : 2;
+
+  /* Optionally iterate over the series once less for faster performance,
+     while sacrificing the accuracy.  */
+  if (flag_mlow_precision_div)
+    iterations--;
+
+  /* Iterate over the series to calculate the approximate reciprocal.  */
+  rtx xtmp = gen_reg_rtx (mode);
+  while (iterations--)
+    {
+      switch (mode)
+        {
+	  case SFmode:
+	    emit_insn (gen_aarch64_frecpssf (xtmp, xrcp, div)); break;
+	  case V2SFmode:
+	    emit_insn (gen_aarch64_frecpsv2sf (xtmp, xrcp, div)); break;
+	  case V4SFmode:
+	    emit_insn (gen_aarch64_frecpsv4sf (xtmp, xrcp, div)); break;
+	  case DFmode:
+	    emit_insn (gen_aarch64_frecpsdf (xtmp, xrcp, div)); break;
+	  case V2DFmode:
+	    emit_insn (gen_aarch64_frecpsv2df (xtmp, xrcp, div)); break;
+	  default:
+	    gcc_unreachable ();
+        }
+
+      if (iterations > 0)
+	emit_set_insn (xrcp, gen_rtx_MULT (mode, xrcp, xtmp));
+    }
+
+  if (num != CONST1_RTX (mode))
+    {
+      /* Calculate the approximate division.  */
+      rtx xnum = force_reg (mode, num);
+      emit_set_insn (xrcp, gen_rtx_MULT (mode, xrcp, xnum));
+    }
+
+  /* Return the approximation.  */
+  emit_set_insn (quo, gen_rtx_MULT (mode, xrcp, xtmp));
+  return true;
+}
+
 /* Return the number of instructions that can be issued per cycle.  */
 static int
 aarch64_sched_issue_rate (void)
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 43fa318..b42ce1a 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -4647,11 +4647,22 @@
   [(set_attr "type" "fmul<s>")]
 )
 
-(define_insn "div<mode>3"
+(define_expand "div<mode>3"
+ [(set (match_operand:GPF 0 "register_operand")
+       (div:GPF (match_operand:GPF 1 "general_operand")
+		(match_operand:GPF 2 "register_operand")))]
+ "TARGET_SIMD"
+{
+  if (aarch64_emit_approx_div (operands[0], operands[1], operands[2]))
+    DONE;
+
+  operands[1] = force_reg (<MODE>mode, operands[1]);
+})
+
+(define_insn "*div<mode>3"
   [(set (match_operand:GPF 0 "register_operand" "=w")
-        (div:GPF
-         (match_operand:GPF 1 "register_operand" "w")
-         (match_operand:GPF 2 "register_operand" "w")))]
+        (div:GPF (match_operand:GPF 1 "register_operand" "w")
+	         (match_operand:GPF 2 "register_operand" "w")))]
   "TARGET_FLOAT"
   "fdiv\\t%<s>0, %<s>1, %<s>2"
   [(set_attr "type" "fdiv<s>")]
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index ffd5540..760bd50 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -158,3 +158,8 @@ mlow-precision-sqrt
 Common Var(flag_mlow_precision_sqrt) Optimization
 When calculating the approximate square root,
 use one less step than otherwise, thus reducing latency and precision.
+
+mlow-precision-div
+Common Var(flag_mlow_precision_div) Optimization
+When calculating the approximate division,
+use one less step than otherwise, thus reducing latency and precision.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 01c3e87..8d33997 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -574,6 +574,7 @@ Objective-C and Objective-C++ Dialects}.
 -mfix-cortex-a53-843419  -mno-fix-cortex-a53-843419 @gol
 -mlow-precision-recip-sqrt -mno-low-precision-recip-sqrt@gol
 -mlow-precision-sqrt -mno-low-precision-sqrt@gol
+-mlow-precision-div -mno-low-precision-div @gol
 -march=@var{name}  -mcpu=@var{name}  -mtune=@var{name}}
 
 @emph{Adapteva Epiphany Options}
@@ -12931,6 +12932,15 @@ uses one less step than otherwise, thus reducing latency and precision.
 This is only relevant if @option{-ffast-math} enables the square root
 approximation.
 
+@item -mlow-precision-div
+@item -mno-low-precision-div
+@opindex -mlow-precision-div
+@opindex -mno-low-precision-div
+When calculating the division approximation,
+uses one less step than otherwise, thus reducing latency and precision.
+This is only relevant if @option{-ffast-math} enables the division
+approximation.
+
 @item -march=@var{name}
 @opindex march
 Specify the name of the target architecture and, optionally, one or
-- 
1.9.1


[-- Attachment #3: 0002-AArch64-Emit-square-root-using-the-Newton-series.patch --]
[-- Type: text/x-patch, Size: 13251 bytes --]

From ea7079be1850290146096e2b69c537875713ef62 Mon Sep 17 00:00:00 2001
From: Evandro Menezes <e.menezes@samsung.com>
Date: Mon, 4 Apr 2016 11:23:29 -0500
Subject: [PATCH 2/3] [AArch64] Emit square root using the Newton series

2016-04-04  Evandro Menezes  <e.menezes@samsung.com>
            Wilco Dijkstra  <wilco.dijkstra@arm.com>

gcc/
	* config/aarch64/aarch64-protos.h
	(aarch64_emit_approx_rsqrt): Replace with new function
	"aarch64_emit_approx_sqrt".
	(tune_params): New member "approx_sqrt_modes".
	* config/aarch64/aarch64.c
	(generic_tunings): New member "approx_rsqrt_modes".
	(cortexa35_tunings): Likewise.
	(cortexa53_tunings): Likewise.
	(cortexa57_tunings): Likewise.
	(cortexa72_tunings): Likewise.
	(exynosm1_tunings): Likewise.
	(thunderx_tunings): Likewise.
	(xgene1_tunings): Likewise.
	(aarch64_emit_approx_rsqrt): Replace with new function
	"aarch64_emit_approx_sqrt".
	(aarch64_override_options_after_change_1): Handle new option.
	* config/aarch64/aarch64-simd.md
	(rsqrt<mode>2): Use new function instead.
	(sqrt<mode>2): New expansion and insn definitions.
	* config/aarch64/aarch64.md: Likewise.
	* config/aarch64/aarch64.opt
	(mlow-precision-sqrt): Add new option description.
	* doc/invoke.texi (mlow-precision-sqrt): Likewise.
---
 gcc/config/aarch64/aarch64-protos.h |  3 +-
 gcc/config/aarch64/aarch64-simd.md  | 13 ++++-
 gcc/config/aarch64/aarch64.c        | 99 +++++++++++++++++++++++++++----------
 gcc/config/aarch64/aarch64.md       | 11 ++++-
 gcc/config/aarch64/aarch64.opt      |  9 +++-
 gcc/doc/invoke.texi                 | 10 ++++
 6 files changed, 113 insertions(+), 32 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index fe1746b..85ad796 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -244,6 +244,7 @@ struct tune_params
   } autoprefetcher_model;
 
   unsigned int extra_tuning_flags;
+  unsigned int approx_sqrt_modes;
   unsigned int approx_rsqrt_modes;
 };
 
@@ -388,7 +389,7 @@ void aarch64_register_pragmas (void);
 void aarch64_relayout_simd_types (void);
 void aarch64_reset_previous_fndecl (void);
 void aarch64_save_restore_target_globals (tree);
-void aarch64_emit_approx_rsqrt (rtx, rtx);
+bool aarch64_emit_approx_sqrt (rtx, rtx, bool);
 
 /* Initialize builtins for SIMD intrinsics.  */
 void init_aarch64_simd_builtins (void);
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index bd73bce..47ccb18 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -405,7 +405,7 @@
 		     UNSPEC_RSQRT))]
   "TARGET_SIMD"
 {
-  aarch64_emit_approx_rsqrt (operands[0], operands[1]);
+  aarch64_emit_approx_sqrt (operands[0], operands[1], true);
   DONE;
 })
 
@@ -4307,7 +4307,16 @@
 
 ;; sqrt
 
-(define_insn "sqrt<mode>2"
+(define_expand "sqrt<mode>2"
+  [(set (match_operand:VDQF 0 "register_operand")
+	(sqrt:VDQF (match_operand:VDQF 1 "register_operand")))]
+  "TARGET_SIMD"
+{
+  if (aarch64_emit_approx_sqrt (operands[0], operands[1], false))
+    DONE;
+})
+
+(define_insn "*sqrt<mode>2"
   [(set (match_operand:VDQF 0 "register_operand" "=w")
         (sqrt:VDQF (match_operand:VDQF 1 "register_operand" "w")))]
   "TARGET_SIMD"
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index b0ee11e..4af2175 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -38,6 +38,7 @@
 #include "recog.h"
 #include "diagnostic.h"
 #include "insn-attr.h"
+#include "insn-flags.h"
 #include "insn-modes.h"
 #include "alias.h"
 #include "fold-const.h"
@@ -416,6 +417,7 @@ static const struct tune_params generic_tunings =
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE),	/* approx_sqrt_modes.  */
   (AARCH64_APPROX_NONE)	/* approx_rsqrt_modes.  */
 };
 
@@ -442,6 +444,7 @@ static const struct tune_params cortexa35_tunings =
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE),	/* approx_sqrt_modes.  */
   (AARCH64_APPROX_NONE)	/* approx_rsqrt_modes.  */
 };
 
@@ -468,6 +471,7 @@ static const struct tune_params cortexa53_tunings =
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE),	/* approx_sqrt_modes.  */
   (AARCH64_APPROX_NONE)	/* approx_rsqrt_modes.  */
 };
 
@@ -494,6 +498,7 @@ static const struct tune_params cortexa57_tunings =
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE),	/* approx_sqrt_modes.  */
   (AARCH64_APPROX_NONE)	/* approx_rsqrt_modes.  */
 };
 
@@ -520,6 +525,7 @@ static const struct tune_params cortexa72_tunings =
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE),	/* approx_sqrt_modes.  */
   (AARCH64_APPROX_NONE)	/* approx_rsqrt_modes.  */
 };
 
@@ -545,6 +551,7 @@ static const struct tune_params exynosm1_tunings =
   64,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
+  (AARCH64_APPROX_ALL), /* approx_sqrt_modes.  */
   (AARCH64_APPROX_ALL) /* approx_rsqrt_modes.  */
 };
 
@@ -570,6 +577,7 @@ static const struct tune_params thunderx_tunings =
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE),	/* approx_sqrt_modes.  */
   (AARCH64_APPROX_NONE)	/* approx_rsqrt_modes.  */
 };
 
@@ -595,6 +603,7 @@ static const struct tune_params xgene1_tunings =
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE),	/* approx_sqrt_modes.  */
   (AARCH64_APPROX_ALL)	/* approx_rsqrt_modes.  */
 };
 
@@ -7521,46 +7530,78 @@ get_rsqrts_type (machine_mode mode)
   }
 }
 
-/* Emit instruction sequence to compute the reciprocal square root using the
-   Newton-Raphson series.  Iterate over the series twice for SF
-   and thrice for DF.  */
+/* Emit instruction sequence to compute either the approximate square root
+   or its approximate reciprocal.  */
 
-void
-aarch64_emit_approx_rsqrt (rtx dst, rtx src)
+bool
+aarch64_emit_approx_sqrt (rtx dst, rtx src, bool recp)
 {
-  machine_mode mode = GET_MODE (src);
-  gcc_assert (
-    mode == SFmode || mode == V2SFmode || mode == V4SFmode
-	|| mode == DFmode || mode == V2DFmode);
-
-  rtx xsrc = gen_reg_rtx (mode);
-  emit_move_insn (xsrc, src);
-  rtx x0 = gen_reg_rtx (mode);
+  machine_mode mode = GET_MODE (dst);
+  machine_mode mmsk = mode_for_vector (int_mode_for_mode (GET_MODE_INNER (mode)),
+				       GET_MODE_NUNITS (mode));
+
+  if (!flag_finite_math_only
+      || flag_trapping_math
+      || !flag_unsafe_math_optimizations
+      || optimize_function_for_size_p (cfun)
+      || !((recp && (flag_mrecip_low_precision_sqrt
+		     || (aarch64_tune_params.approx_rsqrt_modes
+			 & AARCH64_APPROX_MODE (mode))))
+	   || (!recp && (flag_mlow_precision_sqrt
+			 || (aarch64_tune_params.approx_sqrt_modes
+			     & AARCH64_APPROX_MODE (mode))))))
+    return false;
 
-  emit_insn ((*get_rsqrte_type (mode)) (x0, xsrc));
+  rtx xmsk = gen_reg_rtx (mmsk);
+  if (!recp)
+    /* When calculating the approximate square root, compare the argument with
+       0.0 and create a mask.  */
+    emit_insn (gen_rtx_SET (xmsk, gen_rtx_NEG (mmsk, gen_rtx_EQ (mmsk, src,
+							  CONST0_RTX (mode)))));
 
-  bool double_mode = (mode == DFmode || mode == V2DFmode);
+  /* Estimate the approximate reciprocal square root.  */
+  rtx xdst = gen_reg_rtx (mode);
+  emit_insn ((*get_rsqrte_type (mode)) (xdst, src));
 
-  int iterations = double_mode ? 3 : 2;
+  /* Iterate over the series twice for SF and thrice for DF.  */
+  int iterations = (GET_MODE_INNER (mode) == DFmode) ? 3 : 2;
 
-  /* Optionally iterate over the series one less time than otherwise.  */
-  if (flag_mrecip_low_precision_sqrt)
+  /* Optionally iterate over the series once less for faster performance
+     while sacrificing the accuracy.  */
+  if ((recp && flag_mrecip_low_precision_sqrt)
+      || (!recp && flag_mlow_precision_sqrt))
     iterations--;
 
-  for (int i = 0; i < iterations; ++i)
+  /* Iterate over the series to calculate the approximate reciprocal square root.  */
+  rtx x1 = gen_reg_rtx (mode);
+  while (iterations--)
     {
-      rtx x1 = gen_reg_rtx (mode);
       rtx x2 = gen_reg_rtx (mode);
-      rtx x3 = gen_reg_rtx (mode);
-      emit_set_insn (x2, gen_rtx_MULT (mode, x0, x0));
+      emit_set_insn (x2, gen_rtx_MULT (mode, xdst, xdst));
+
+      emit_insn ((*get_rsqrts_type (mode)) (x1, src, x2));
 
-      emit_insn ((*get_rsqrts_type (mode)) (x3, xsrc, x2));
+      if (iterations > 0)
+	emit_set_insn (xdst, gen_rtx_MULT (mode, xdst, x1));
+    }
+
+  if (!recp)
+    {
+      /* Qualify the approximate reciprocal square root when the argument is
+	 0.0 by squashing the intermediary result to 0.0.  */
+      rtx xtmp = gen_reg_rtx (mmsk);
+      emit_set_insn (xtmp, gen_rtx_AND (mmsk, gen_rtx_NOT (mmsk, xmsk),
+					      gen_rtx_SUBREG (mmsk, xdst, 0)));
+      emit_move_insn (xdst, gen_rtx_SUBREG (mode, xtmp, 0));
 
-      emit_set_insn (x1, gen_rtx_MULT (mode, x0, x3));
-      x0 = x1;
+      /* Calculate the approximate square root.  */
+      emit_set_insn (xdst, gen_rtx_MULT (mode, xdst, src));
     }
 
-  emit_move_insn (dst, x0);
+  /* Return the approximation.  */
+  emit_set_insn (dst, gen_rtx_MULT (mode, xdst, x1));
+
+  return true;
 }
 
 /* Return the number of instructions that can be issued per cycle.  */
@@ -8167,6 +8208,12 @@ aarch64_override_options_after_change_1 (struct gcc_options *opts)
       && (aarch64_cmodel == AARCH64_CMODEL_TINY
 	  || aarch64_cmodel == AARCH64_CMODEL_TINY_PIC))
     aarch64_nopcrelative_literal_loads = false;
+
+  /* When enabling the lower precision Newton series for the square root, also
+     enable it for the reciprocal square root, since the later is an
+     intermediary step for the latter.  */
+  if (flag_mlow_precision_sqrt)
+    flag_mrecip_low_precision_sqrt = true;
 }
 
 /* 'Unpack' up the internal tuning structs and update the options
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 68676c9..43fa318 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -4665,7 +4665,16 @@
   [(set_attr "type" "ffarith<s>")]
 )
 
-(define_insn "sqrt<mode>2"
+(define_expand "sqrt<mode>2"
+  [(set (match_operand:GPF 0 "register_operand")
+        (sqrt:GPF (match_operand:GPF 1 "register_operand")))]
+  "TARGET_SIMD"
+{
+  if (aarch64_emit_approx_sqrt (operands[0], operands[1], false))
+    DONE;
+})
+
+(define_insn "*sqrt<mode>2"
   [(set (match_operand:GPF 0 "register_operand" "=w")
         (sqrt:GPF (match_operand:GPF 1 "register_operand" "w")))]
   "TARGET_FLOAT"
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index c637ff4..ffd5540 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -151,5 +151,10 @@ PC relative literal loads.
 
 mlow-precision-recip-sqrt
 Common Var(flag_mrecip_low_precision_sqrt) Optimization
-When calculating the reciprocal square root approximation,
-uses one less step than otherwise, thus reducing latency and precision.
+When calculating the approximate reciprocal square root,
+use one less step than otherwise, thus reducing latency and precision.
+
+mlow-precision-sqrt
+Common Var(flag_mlow_precision_sqrt) Optimization
+When calculating the approximate square root,
+use one less step than otherwise, thus reducing latency and precision.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 488c52c..01c3e87 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -573,6 +573,7 @@ Objective-C and Objective-C++ Dialects}.
 -mfix-cortex-a53-835769  -mno-fix-cortex-a53-835769 @gol
 -mfix-cortex-a53-843419  -mno-fix-cortex-a53-843419 @gol
 -mlow-precision-recip-sqrt -mno-low-precision-recip-sqrt@gol
+-mlow-precision-sqrt -mno-low-precision-sqrt@gol
 -march=@var{name}  -mcpu=@var{name}  -mtune=@var{name}}
 
 @emph{Adapteva Epiphany Options}
@@ -12921,6 +12922,15 @@ uses one less step than otherwise, thus reducing latency and precision.
 This is only relevant if @option{-ffast-math} enables the reciprocal square root
 approximation.
 
+@item -mlow-precision-sqrt
+@item -mno-low-precision-sqrt
+@opindex -mlow-precision-sqrt
+@opindex -mno-low-precision-sqrt
+When calculating the square root approximation,
+uses one less step than otherwise, thus reducing latency and precision.
+This is only relevant if @option{-ffast-math} enables the square root
+approximation.
+
 @item -march=@var{name}
 @opindex march
 Specify the name of the target architecture and, optionally, one or
-- 
1.9.1


[-- Attachment #4: 0001-AArch64-Add-more-choices-for-the-reciprocal-square-r.patch --]
[-- Type: text/x-patch, Size: 9267 bytes --]

From 428d21df1ae04ad263ddb9b0493cc40a3e566e04 Mon Sep 17 00:00:00 2001
From: Evandro Menezes <e.menezes@samsung.com>
Date: Thu, 3 Mar 2016 18:13:46 -0600
Subject: [PATCH 1/3] [AArch64] Add more choices for the reciprocal square root
 approximation

Allow a target to prefer such operation depending on the operation mode.

2016-03-03  Evandro Menezes  <e.menezes@samsung.com>

gcc/
	* config/aarch64/aarch64-protos.h
	(AARCH64_APPROX_MODE): New macro.
	(AARCH64_APPROX_{NONE,SP,DP,DFORM,QFORM,SCALAR,VECTOR,ALL}): Likewise.
	(tune_params): New member "approx_rsqrt_modes".
	* config/aarch64/aarch64-tuning-flags.def
	(AARCH64_EXTRA_TUNE_APPROX_RSQRT): Remove macro.
	* config/aarch64/aarch64.c
	(generic_tunings): New member "approx_rsqrt_modes".
	(cortexa35_tunings): Likewise.
	(cortexa53_tunings): Likewise.
	(cortexa57_tunings): Likewise.
	(cortexa72_tunings): Likewise.
	(exynosm1_tunings): Likewise.
	(thunderx_tunings): Likewise.
	(xgene1_tunings): Likewise.
	(use_rsqrt_p): New argument for the mode and use new member from
	"tune_params".
	(aarch64_builtin_reciprocal): Devise mode from builtin.
	(aarch64_optab_supported_p): New argument for the mode.
	* doc/invoke.texi (-mlow-precision-recip-sqrt): Reword description.
---
 gcc/config/aarch64/aarch64-protos.h         | 27 ++++++++++++++++++++
 gcc/config/aarch64/aarch64-tuning-flags.def |  2 --
 gcc/config/aarch64/aarch64.c                | 39 ++++++++++++++++++-----------
 gcc/doc/invoke.texi                         |  2 +-
 4 files changed, 53 insertions(+), 17 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 58c9d0d..fe1746b 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -178,6 +178,32 @@ struct cpu_branch_cost
   const int unpredictable;  /* Unpredictable branch or optimizing for speed.  */
 };
 
+/* Control approximate alternatives to certain FP operators.  */
+#define AARCH64_APPROX_MODE(MODE) \
+  ((MIN_MODE_FLOAT <= (MODE) && (MODE) <= MAX_MODE_FLOAT) \
+   ? (1 << ((MODE) - MIN_MODE_FLOAT)) \
+   : (MIN_MODE_VECTOR_FLOAT <= (MODE) && (MODE) <= MAX_MODE_VECTOR_FLOAT) \
+     ? (1 << ((MODE) - MIN_MODE_VECTOR_FLOAT \
+	      + MAX_MODE_FLOAT - MIN_MODE_FLOAT + 1)) \
+     : (0))
+#define AARCH64_APPROX_NONE (0)
+#define AARCH64_APPROX_SP (AARCH64_APPROX_MODE (SFmode) \
+			   | AARCH64_APPROX_MODE (V2SFmode) \
+			   | AARCH64_APPROX_MODE (V4SFmode))
+#define AARCH64_APPROX_DP (AARCH64_APPROX_MODE (DFmode) \
+			   | AARCH64_APPROX_MODE (V2DFmode))
+#define AARCH64_APPROX_DFORM (AARCH64_APPROX_MODE (SFmode) \
+			      | AARCH64_APPROX_MODE (DFmode) \
+			      | AARCH64_APPROX_MODE (V2SFmode))
+#define AARCH64_APPROX_QFORM (AARCH64_APPROX_MODE (V4SFmode) \
+			      | AARCH64_APPROX_MODE (V2DFmode))
+#define AARCH64_APPROX_SCALAR (AARCH64_APPROX_MODE (SFmode) \
+			       | AARCH64_APPROX_MODE (DFmode))
+#define AARCH64_APPROX_VECTOR (AARCH64_APPROX_MODE (V2SFmode) \
+			       | AARCH64_APPROX_MODE (V4SFmode) \
+			       | AARCH64_APPROX_MODE (V2DFmode))
+#define AARCH64_APPROX_ALL (-1)
+
 struct tune_params
 {
   const struct cpu_cost_table *insn_extra_cost;
@@ -218,6 +244,7 @@ struct tune_params
   } autoprefetcher_model;
 
   unsigned int extra_tuning_flags;
+  unsigned int approx_rsqrt_modes;
 };
 
 #define AARCH64_FUSION_PAIR(x, name) \
diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
index 7e45a0c..048c2a3 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -29,5 +29,3 @@
      AARCH64_TUNE_ to give an enum name. */
 
 AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
-AARCH64_EXTRA_TUNING_OPTION ("approx_rsqrt", APPROX_RSQRT)
-
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index b7086dd..b0ee11e 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -38,6 +38,7 @@
 #include "recog.h"
 #include "diagnostic.h"
 #include "insn-attr.h"
+#include "insn-modes.h"
 #include "alias.h"
 #include "fold-const.h"
 #include "stor-layout.h"
@@ -414,7 +415,8 @@ static const struct tune_params generic_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_rsqrt_modes.  */
 };
 
 static const struct tune_params cortexa35_tunings =
@@ -439,7 +441,8 @@ static const struct tune_params cortexa35_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_rsqrt_modes.  */
 };
 
 static const struct tune_params cortexa53_tunings =
@@ -464,7 +467,8 @@ static const struct tune_params cortexa53_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_rsqrt_modes.  */
 };
 
 static const struct tune_params cortexa57_tunings =
@@ -489,7 +493,8 @@ static const struct tune_params cortexa57_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_rsqrt_modes.  */
 };
 
 static const struct tune_params cortexa72_tunings =
@@ -514,7 +519,8 @@ static const struct tune_params cortexa72_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_rsqrt_modes.  */
 };
 
 static const struct tune_params exynosm1_tunings =
@@ -538,7 +544,8 @@ static const struct tune_params exynosm1_tunings =
   48,	/* max_case_values.  */
   64,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_APPROX_RSQRT) /* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
+  (AARCH64_APPROX_ALL) /* approx_rsqrt_modes.  */
 };
 
 static const struct tune_params thunderx_tunings =
@@ -562,7 +569,8 @@ static const struct tune_params thunderx_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_rsqrt_modes.  */
 };
 
 static const struct tune_params xgene1_tunings =
@@ -586,7 +594,8 @@ static const struct tune_params xgene1_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_APPROX_RSQRT)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_ALL)	/* approx_rsqrt_modes.  */
 };
 
 /* Support for fine-grained override of the tuning structures.  */
@@ -7452,12 +7461,12 @@ aarch64_memory_move_cost (machine_mode mode ATTRIBUTE_UNUSED,
    to optimize 1.0/sqrt.  */
 
 static bool
-use_rsqrt_p (void)
+use_rsqrt_p (machine_mode mode)
 {
   return (!flag_trapping_math
 	  && flag_unsafe_math_optimizations
-	  && ((aarch64_tune_params.extra_tuning_flags
-	       & AARCH64_EXTRA_TUNE_APPROX_RSQRT)
+	  && ((aarch64_tune_params.approx_rsqrt_modes
+	       & AARCH64_APPROX_MODE (mode))
 	      || flag_mrecip_low_precision_sqrt));
 }
 
@@ -7467,7 +7476,9 @@ use_rsqrt_p (void)
 static tree
 aarch64_builtin_reciprocal (tree fndecl)
 {
-  if (!use_rsqrt_p ())
+  machine_mode mode = TYPE_MODE (TREE_TYPE (fndecl));
+
+  if (!use_rsqrt_p (mode))
     return NULL_TREE;
   return aarch64_builtin_rsqrt (DECL_FUNCTION_CODE (fndecl));
 }
@@ -13964,13 +13975,13 @@ aarch64_promoted_type (const_tree t)
 /* Implement the TARGET_OPTAB_SUPPORTED_P hook.  */
 
 static bool
-aarch64_optab_supported_p (int op, machine_mode, machine_mode,
+aarch64_optab_supported_p (int op, machine_mode mode1, machine_mode,
 			   optimization_type opt_type)
 {
   switch (op)
     {
     case rsqrt_optab:
-      return opt_type == OPTIMIZE_FOR_SPEED && use_rsqrt_p ();
+      return opt_type == OPTIMIZE_FOR_SPEED && use_rsqrt_p (mode1);
 
     default:
       return true;
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index e9763d4..488c52c 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -12919,7 +12919,7 @@ corresponding flag to the linker.
 When calculating the reciprocal square root approximation,
 uses one less step than otherwise, thus reducing latency and precision.
 This is only relevant if @option{-ffast-math} enables the reciprocal square root
-approximation, which in turn depends on the target processor.
+approximation.
 
 @item -march=@var{name}
 @opindex march
-- 
1.9.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2016-04-05 22:30         ` Evandro Menezes
@ 2016-04-12 18:15           ` Evandro Menezes
  2016-04-21 18:44             ` Evandro Menezes
  2016-04-27 14:24             ` James Greenhalgh
  0 siblings, 2 replies; 38+ messages in thread
From: Evandro Menezes @ 2016-04-12 18:15 UTC (permalink / raw)
  To: Wilco Dijkstra, GCC Patches
  Cc: James Greenhalgh, Andrew Pinski, philipp.tomsich, Benedikt Huber

On 04/05/16 17:30, Evandro Menezes wrote:
> On 04/05/16 13:37, Wilco Dijkstra wrote:
>> I can't get any of these to work... Not only do I get a large number 
>> of collisions and duplicated
>> code between these patches, when I try to resolve them, all I get is 
>> crashes whenever I try
>> to use sqrt (even rsqrt stopped working). Do you have a patchset that 
>> applies cleanly so I can
>> try all approximation routines?
>
> The original patches should be independent of each other, so indeed 
> they duplicate code.
>
> This patch suite should be suitable for testing.

Ping^1

-- 
Evandro Menezes

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [AArch64] Emit square root using the Newton series
  2016-04-12 18:15           ` Evandro Menezes
@ 2016-04-21 18:44             ` Evandro Menezes
  2016-04-27 14:24             ` James Greenhalgh
  1 sibling, 0 replies; 38+ messages in thread
From: Evandro Menezes @ 2016-04-21 18:44 UTC (permalink / raw)
  To: 'Wilco Dijkstra', 'GCC Patches'
  Cc: 'James Greenhalgh', 'Andrew Pinski',
	philipp.tomsich, 'Benedikt Huber'

> On 04/05/16 17:30, Evandro Menezes wrote:
> > On 04/05/16 13:37, Wilco Dijkstra wrote:
> >> I can't get any of these to work... Not only do I get a large number
> >> of collisions and duplicated code between these patches, when I try
> >> to resolve them, all I get is crashes whenever I try to use sqrt
> >> (even rsqrt stopped working). Do you have a patchset that applies
> >> cleanly so I can try all approximation routines?
> >
> > The original patches should be independent of each other, so indeed
> > they duplicate code.
> >
> > This patch suite should be suitable for testing.
> 
> Ping^1

Ping^2
 
--
Evandro Menezes


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2016-04-12 18:15           ` Evandro Menezes
  2016-04-21 18:44             ` Evandro Menezes
@ 2016-04-27 14:24             ` James Greenhalgh
  2016-04-27 15:45               ` Evandro Menezes
  1 sibling, 1 reply; 38+ messages in thread
From: James Greenhalgh @ 2016-04-27 14:24 UTC (permalink / raw)
  To: Evandro Menezes
  Cc: Wilco Dijkstra, GCC Patches, Andrew Pinski, philipp.tomsich,
	Benedikt Huber, nd

On Tue, Apr 12, 2016 at 01:14:51PM -0500, Evandro Menezes wrote:
> On 04/05/16 17:30, Evandro Menezes wrote:
> >On 04/05/16 13:37, Wilco Dijkstra wrote:
> >>I can't get any of these to work... Not only do I get a large
> >>number of collisions and duplicated
> >>code between these patches, when I try to resolve them, all I
> >>get is crashes whenever I try
> >>to use sqrt (even rsqrt stopped working). Do you have a patchset
> >>that applies cleanly so I can
> >>try all approximation routines?
> >
> >The original patches should be independent of each other, so
> >indeed they duplicate code.
> >
> >This patch suite should be suitable for testing.

Take look at other patch sets posted to this list for examples of how
to make review easier.

Please send a series of emails tagged:

[Patch 0/3 AArch64] Add infrastructure for more approximate FP operations
[PATCH 1/3 AArch64] Add more choices for the reciprocal square root approximation
[PATCH 2/3 AArch64] Emit square root using the Newton series
[PATCH 3/3 AArch64] Emit division using the Newton series

One patch per email, with the dependencies explicit like this, is
infinitely easier to follow than the current structure of your patch set.

I'm not trying to be pedantic for the sake of it, I'm genuinely unsure where
the latest patch versions currently are and how I should apply them to a
clean tree for review.

Thanks,
James

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2016-04-27 14:24             ` James Greenhalgh
@ 2016-04-27 15:45               ` Evandro Menezes
  0 siblings, 0 replies; 38+ messages in thread
From: Evandro Menezes @ 2016-04-27 15:45 UTC (permalink / raw)
  To: James Greenhalgh
  Cc: Wilco Dijkstra, GCC Patches, Andrew Pinski, philipp.tomsich,
	Benedikt Huber, nd

On 04/27/16 09:23, James Greenhalgh wrote:
> On Tue, Apr 12, 2016 at 01:14:51PM -0500, Evandro Menezes wrote:
>> On 04/05/16 17:30, Evandro Menezes wrote:
>>> On 04/05/16 13:37, Wilco Dijkstra wrote:
>>>> I can't get any of these to work... Not only do I get a large
>>>> number of collisions and duplicated
>>>> code between these patches, when I try to resolve them, all I
>>>> get is crashes whenever I try
>>>> to use sqrt (even rsqrt stopped working). Do you have a patchset
>>>> that applies cleanly so I can
>>>> try all approximation routines?
>>> The original patches should be independent of each other, so
>>> indeed they duplicate code.
>>>
>>> This patch suite should be suitable for testing.
> Take look at other patch sets posted to this list for examples of how
> to make review easier.
>
> Please send a series of emails tagged:
>
> [Patch 0/3 AArch64] Add infrastructure for more approximate FP operations
> [PATCH 1/3 AArch64] Add more choices for the reciprocal square root approximation
> [PATCH 2/3 AArch64] Emit square root using the Newton series
> [PATCH 3/3 AArch64] Emit division using the Newton series
>
> One patch per email, with the dependencies explicit like this, is
> infinitely easier to follow than the current structure of your patch set.
>
> I'm not trying to be pedantic for the sake of it, I'm genuinely unsure where
> the latest patch versions currently are and how I should apply them to a
> clean tree for review.

I can certainly create such a series, but the patch above should be 
suitable for testing.

Thank you,

-- 
Evandro Menezes

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2016-03-17 14:55         ` James Greenhalgh
@ 2016-03-17 16:25           ` Evandro Menezes
  0 siblings, 0 replies; 38+ messages in thread
From: Evandro Menezes @ 2016-03-17 16:25 UTC (permalink / raw)
  To: James Greenhalgh
  Cc: GCC Patches, Marcus Shawcroft, Andrew Pinski, Benedikt Huber,
	philipp.tomsich, Kyrill Tkachov

On 03/17/16 09:55, James Greenhalgh wrote:
> On Wed, Mar 16, 2016 at 02:45:37PM -0500, Evandro Menezes wrote:
>> On 03/08/16 16:08, Evandro Menezes wrote:
>>> On 02/16/16 14:56, Evandro Menezes wrote:
>>>> On 12/08/15 15:35, Evandro Menezes wrote:
>>>>> Emit square root using the Newton series
>>>>>
>>>>>    2015-12-03  Evandro Menezes  <e.menezes@samsung.com>
>>>>>
>>>>>    gcc/
>>>>>             * config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
>>>>>    Declare new
>>>>>             function.
>>>>>             * config/aarch64/aarch64-simd.md (sqrt<mode>2): New
>>>>>    expansion and
>>>>>             insn definitions.
>>>>>             * config/aarch64/aarch64-tuning-flags.def
>>>>>             (AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
>>>>>             * config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
>>>>>    new function.
>>>>>             * config/aarch64/aarch64.md (sqrt<mode>2): New expansion
>>>>>    and insn
>>>>>             definitions.
>>>>>             * config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
>>>>>    Expand option
>>>>>             description.
>>>>>             * doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.
>>>>>
>>>>> This patch extends the patch that added support for
>>>>> implementing x^-1/2 using the Newton series by adding support
>>>>> for x^1/2 as well.
>>>>>
>>>>> Is it OK at this point of stage 3?
>>>>>
>>>>> Thank you,
>>>>>
>>>> James,
>>>>
>>>> As I was saying, this patch results in some validation errors in
>>>> CPU2000 benchmarks using DF.  Although proving the algorithm to
>>>> be pretty solid with a vast set of random values, I'm confused
>>>> why some benchmarks fail to validate with this implementation of
>>>> the Newton series for square root too, when they pass with the
>>>> Newton series for reciprocal square root.
>>>>
>>>> Since I had no problems with the same algorithm on x86-64, I
>>>> wonder if the initial estimate on AArch64, which offers just 8
>>>> bits, whereas x86-64 offers 11 bits, has to do with it.  Then
>>>> again, the algorithm iterated 1 less time on x86-64 than on
>>>> AArch64.
>>>>
>>>> Since it seems that the initial estimate is sufficient for
>>>> CPU2000 to validate when using SF, I'm leaning towards
>>>> restricting the Newton series for square root only for SF.
>>>>
>>>> Your thoughts on the matter are appreciated,
>>>         Add choices for the reciprocal square root approximation
>>>
>>>         Allow a target to prefer such operation depending on the FP
>>>    precision.
>>>
>>>         gcc/
>>>             * config/aarch64/aarch64-protos.h
>>>             (AARCH64_EXTRA_TUNE_APPROX_RSQRT): New macro.
>>>             * config/aarch64/aarch64-tuning-flags.def
>>>             (AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF): New mask.
>>>             (AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF): Likewise.
>>>             * config/aarch64/aarch64.c
>>>             (use_rsqrt_p): New argument for the mode.
>>>             (aarch64_builtin_reciprocal): Devise mode from builtin.
>>>             (aarch64_optab_supported_p): New argument for the mode.
>>>
>>>
>>> Now that the patch is attached, feedback is appreciated.
>> Ping.
> Hi Evandro,
>
> I thought this was on hold while you looked in to the underlying issue for
> the failures in the other thread? With that said, I'm struggling to keep
> up with where we are on this, so maybe it is time for a clean break - a new
> thread for patch set v2, proposed as an explicit patch series (just to keep
> the dependencies clear to me).
>
> I'm not convinced of the value of this split, nor why we would stop here
> if it was useful (vector modes vs. scalar modes would also seem an
> important distinction).
>
> If you no longer need the workaround this enables then I'm not sure I see a
> good reason for it to go in, maybe I'm missing a target for which this
> would be important?

Hi, James.

OK, I'll start a thread over.

Thank you,

-- 
Evandro Menezes

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2016-03-16 19:45       ` Evandro Menezes
@ 2016-03-17 14:55         ` James Greenhalgh
  2016-03-17 16:25           ` Evandro Menezes
  0 siblings, 1 reply; 38+ messages in thread
From: James Greenhalgh @ 2016-03-17 14:55 UTC (permalink / raw)
  To: Evandro Menezes
  Cc: GCC Patches, Marcus Shawcroft, Andrew Pinski, Benedikt Huber,
	philipp.tomsich, Kyrill Tkachov

On Wed, Mar 16, 2016 at 02:45:37PM -0500, Evandro Menezes wrote:
> On 03/08/16 16:08, Evandro Menezes wrote:
> >On 02/16/16 14:56, Evandro Menezes wrote:
> >>On 12/08/15 15:35, Evandro Menezes wrote:
> >>>Emit square root using the Newton series
> >>>
> >>>   2015-12-03  Evandro Menezes  <e.menezes@samsung.com>
> >>>
> >>>   gcc/
> >>>            * config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
> >>>   Declare new
> >>>            function.
> >>>            * config/aarch64/aarch64-simd.md (sqrt<mode>2): New
> >>>   expansion and
> >>>            insn definitions.
> >>>            * config/aarch64/aarch64-tuning-flags.def
> >>>            (AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
> >>>            * config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
> >>>   new function.
> >>>            * config/aarch64/aarch64.md (sqrt<mode>2): New expansion
> >>>   and insn
> >>>            definitions.
> >>>            * config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
> >>>   Expand option
> >>>            description.
> >>>            * doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.
> >>>
> >>>This patch extends the patch that added support for
> >>>implementing x^-1/2 using the Newton series by adding support
> >>>for x^1/2 as well.
> >>>
> >>>Is it OK at this point of stage 3?
> >>>
> >>>Thank you,
> >>>
> >>
> >>James,
> >>
> >>As I was saying, this patch results in some validation errors in
> >>CPU2000 benchmarks using DF.  Although proving the algorithm to
> >>be pretty solid with a vast set of random values, I'm confused
> >>why some benchmarks fail to validate with this implementation of
> >>the Newton series for square root too, when they pass with the
> >>Newton series for reciprocal square root.
> >>
> >>Since I had no problems with the same algorithm on x86-64, I
> >>wonder if the initial estimate on AArch64, which offers just 8
> >>bits, whereas x86-64 offers 11 bits, has to do with it.  Then
> >>again, the algorithm iterated 1 less time on x86-64 than on
> >>AArch64.
> >>
> >>Since it seems that the initial estimate is sufficient for
> >>CPU2000 to validate when using SF, I'm leaning towards
> >>restricting the Newton series for square root only for SF.
> >>
> >>Your thoughts on the matter are appreciated,
> >
> >        Add choices for the reciprocal square root approximation
> >
> >        Allow a target to prefer such operation depending on the FP
> >   precision.
> >
> >        gcc/
> >            * config/aarch64/aarch64-protos.h
> >            (AARCH64_EXTRA_TUNE_APPROX_RSQRT): New macro.
> >            * config/aarch64/aarch64-tuning-flags.def
> >            (AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF): New mask.
> >            (AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF): Likewise.
> >            * config/aarch64/aarch64.c
> >            (use_rsqrt_p): New argument for the mode.
> >            (aarch64_builtin_reciprocal): Devise mode from builtin.
> >            (aarch64_optab_supported_p): New argument for the mode.
> >
> >
> >Now that the patch is attached, feedback is appreciated.
> 
> Ping.

Hi Evandro,

I thought this was on hold while you looked in to the underlying issue for
the failures in the other thread? With that said, I'm struggling to keep
up with where we are on this, so maybe it is time for a clean break - a new
thread for patch set v2, proposed as an explicit patch series (just to keep
the dependencies clear to me).

I'm not convinced of the value of this split, nor why we would stop here
if it was useful (vector modes vs. scalar modes would also seem an
important distinction).

If you no longer need the workaround this enables then I'm not sure I see a
good reason for it to go in, maybe I'm missing a target for which this
would be important?

Thanks,
James

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2016-03-11  1:06           ` Wilco Dijkstra
  2016-03-14 16:39             ` Evandro Menezes
@ 2016-03-16 21:44             ` Evandro Menezes
  1 sibling, 0 replies; 38+ messages in thread
From: Evandro Menezes @ 2016-03-16 21:44 UTC (permalink / raw)
  To: Wilco Dijkstra
  Cc: gcc-patches, nd, Andrew Pinski, Benedikt Huber, philipp.tomsich,
	James Greenhalgh

[-- Attachment #1: Type: text/plain, Size: 2121 bytes --]

On 03/10/16 19:06, Wilco Dijkstra wrote:
> Evandro Menezes <e.menezes@samsung.com> wrote:
>> That's what I had in mind too, but around the approximation for x^-1/2
>> and using masks for vector cases thusly:
>>
>>         fcmne   v3.4s, v0.4s, #0.0
>>          frsqrte v1.4s, v0.4s
>>          fmul    v2.4s, v1.4s, v1.4s
>>          frsqrts v2.4s, v0.4s, v2.4s
>>          fmul    v1.4s, v1.4s, v2.4s
>>          fmul    v2.4s, v1.4s, v1.4s
>>          frsqrts v2.4s, v0.4s, v2.4s
>>          fmul    v1.4s, v1.4s, v2.4s
>>         and     v1.4s, v3.4s
>>          fmul    v0.4s, v1.4s, v0.4s
> That's possible but the overall latency is higher - according to exynos-1.md the
> above takes 44 cycles while my version would be 37.

         Emit square root using the Newton series

         2016-03-16  Evandro Menezes  <e.menezes@samsung.com>
                     Wilco Dijkstra  <wilco.dijkstra@arm.com>

         gcc/
             * config/aarch64/aarch64-tuning-flags.def
             (AARCH64_EXTRA_TUNE_APPROX_SQRT_{SF,DF}): New tuning macros.
             * config/aarch64/aarch64-protos.h
             (aarch64_emit_approx_rsqrt): Replace with
    "aarch64_emit_approx_sqrt".
             (AARCH64_EXTRA_TUNE_APPROX_SQRT): New macro.
             * config/aarch64/aarch64.c
             (exynosm1_tunings): Use the new macro.
             (aarch64_emit_approx_sqrt): Define new function.
             * config/aarch64/aarch64.md
             (rsqrt<mode>2): Use new function instead.
             (sqrt<mode>2): New expansion and insn definitions.
             * config/aarch64/aarch64-simd.md: Likewise.
             * config/aarch64/aarch64.opt
             (mlow-precision-recip-sqrt): Expand option description.
             * doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.


This patch merges the function that emits the approximate reciprocal 
square root and the approximate square root, qualifying the latter for 
when the input argument is zero.

It depends on the patch at 
https://gcc.gnu.org/ml/gcc-patches/2016-03/msg00534.html

I appreciate your feedback.

Thank you,

-- 
Evandro Menezes


[-- Attachment #2: 0001-Emit-square-root-using-the-Newton-series.patch --]
[-- Type: text/x-patch, Size: 11382 bytes --]

From a69a80da4c3feab691d3c1df28906ef195e5631d Mon Sep 17 00:00:00 2001
From: Evandro Menezes <e.menezes@samsung.com>
Date: Wed, 16 Mar 2016 15:21:00 -0500
Subject: [PATCH] Emit square root using the Newton series

2016-03-16  Evandro Menezes  <e.menezes@samsung.com>
            Wilco Dijkstra  <wilco.dijkstra@arm.com>

gcc/
	* config/aarch64/aarch64-tuning-flags.def
	(AARCH64_EXTRA_TUNE_APPROX_SQRT_{SF,DF}): New tuning macros.
	* config/aarch64/aarch64-protos.h
	(aarch64_emit_approx_rsqrt): Replace with "aarch64_emit_approx_sqrt".
	(AARCH64_EXTRA_TUNE_APPROX_SQRT): New macro.
	* config/aarch64/aarch64.c
	(exynosm1_tunings): Use the new macro.
	(aarch64_emit_approx_sqrt): Define new function.
	* config/aarch64/aarch64.md
	(rsqrt<mode>2): Use new function instead.
	(sqrt<mode>2): New expansion and insn definitions.
	* config/aarch64/aarch64-simd.md: Likewise.
	* config/aarch64/aarch64.opt
	(mlow-precision-recip-sqrt): Expand option description.
	* doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.
---
 gcc/config/aarch64/aarch64-protos.h         |  4 +-
 gcc/config/aarch64/aarch64-simd.md          | 27 +++++++-
 gcc/config/aarch64/aarch64-tuning-flags.def |  3 +-
 gcc/config/aarch64/aarch64.c                | 97 +++++++++++++++++++++++------
 gcc/config/aarch64/aarch64.md               | 25 +++++++-
 gcc/config/aarch64/aarch64.opt              |  4 +-
 gcc/doc/invoke.texi                         |  9 +--
 7 files changed, 138 insertions(+), 31 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 58e5d73..c9a5192 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -265,6 +265,8 @@ enum aarch64_extra_tuning_flags
 
 #define AARCH64_EXTRA_TUNE_APPROX_RSQRT \
   (AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF | AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF)
+#define AARCH64_EXTRA_TUNE_APPROX_SQRT \
+  (AARCH64_EXTRA_TUNE_APPROX_SQRT_DF | AARCH64_EXTRA_TUNE_APPROX_SQRT_SF)
 
 extern struct tune_params aarch64_tune_params;
 
@@ -364,7 +366,7 @@ void aarch64_register_pragmas (void);
 void aarch64_relayout_simd_types (void);
 void aarch64_reset_previous_fndecl (void);
 void aarch64_save_restore_target_globals (tree);
-void aarch64_emit_approx_rsqrt (rtx, rtx);
+void aarch64_emit_approx_sqrt (rtx, rtx, bool);
 
 /* Initialize builtins for SIMD intrinsics.  */
 void init_aarch64_simd_builtins (void);
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index bd73bce..31191bb 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -405,7 +405,7 @@
 		     UNSPEC_RSQRT))]
   "TARGET_SIMD"
 {
-  aarch64_emit_approx_rsqrt (operands[0], operands[1]);
+  aarch64_emit_approx_sqrt (operands[0], operands[1], true);
   DONE;
 })
 
@@ -4307,7 +4307,30 @@
 
 ;; sqrt
 
-(define_insn "sqrt<mode>2"
+(define_expand "sqrt<mode>2"
+  [(set (match_operand:VDQF 0 "register_operand")
+	(sqrt:VDQF (match_operand:VDQF 1 "register_operand")))]
+  "TARGET_SIMD"
+{
+  machine_mode mode = GET_MODE_INNER (GET_MODE (operands[1]));
+
+  if (flag_finite_math_only
+      && !flag_trapping_math
+      && flag_unsafe_math_optimizations
+      && !optimize_function_for_size_p (cfun)
+      && ((mode == SFmode
+           && (aarch64_tune_params.extra_tuning_flags
+               & AARCH64_EXTRA_TUNE_APPROX_SQRT_SF))
+          || (mode == DFmode
+              && (aarch64_tune_params.extra_tuning_flags
+                  & AARCH64_EXTRA_TUNE_APPROX_SQRT_DF))))
+    {
+      aarch64_emit_approx_sqrt (operands[0], operands[1], false);
+      DONE;
+    }
+})
+
+(define_insn "*sqrt<mode>2"
   [(set (match_operand:VDQF 0 "register_operand" "=w")
         (sqrt:VDQF (match_operand:VDQF 1 "register_operand" "w")))]
   "TARGET_SIMD"
diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
index 57d9588..b4421b1 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -31,4 +31,5 @@
 AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
 AARCH64_EXTRA_TUNING_OPTION ("approx_rsqrt", APPROX_RSQRT_DF)
 AARCH64_EXTRA_TUNING_OPTION ("approx_rsqrtf", APPROX_RSQRT_SF)
-
+AARCH64_EXTRA_TUNING_OPTION ("approx_sqrt", APPROX_SQRT_DF)
+AARCH64_EXTRA_TUNING_OPTION ("approx_sqrtf", APPROX_SQRT_SF)
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 531a7e7..207c61a 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -38,6 +38,7 @@
 #include "recog.h"
 #include "diagnostic.h"
 #include "insn-attr.h"
+#include "insn-flags.h"
 #include "alias.h"
 #include "fold-const.h"
 #include "stor-layout.h"
@@ -7529,46 +7530,102 @@ get_rsqrts_type (machine_mode mode)
   }
 }
 
-/* Emit instruction sequence to compute the reciprocal square root using the
-   Newton-Raphson series.  Iterate over the series twice for SF
-   and thrice for DF.  */
+/* Emit instruction sequence to compute either the approximate square root
+   or its approximate reciprocal.  */
 
 void
-aarch64_emit_approx_rsqrt (rtx dst, rtx src)
+aarch64_emit_approx_sqrt (rtx dst, rtx src, bool recp)
 {
   machine_mode mode = GET_MODE (src);
-  gcc_assert (
-    mode == SFmode || mode == V2SFmode || mode == V4SFmode
-	|| mode == DFmode || mode == V2DFmode);
+  machine_mode mmsk;
+
+  gcc_assert (GET_MODE_INNER (mode) == SFmode
+              || GET_MODE_INNER (mode) == DFmode);
 
   rtx xsrc = gen_reg_rtx (mode);
   emit_move_insn (xsrc, src);
-  rtx x0 = gen_reg_rtx (mode);
 
-  emit_insn ((*get_rsqrte_type (mode)) (x0, xsrc));
+  rtx xcc, xne, xmsk;
+  bool scalar = !VECTOR_MODE_P (mode);
+  if (!recp)
+    {
+      if (scalar)
+	{
+	  /* Compare argument with 0.0 and set the CC.  */
+	  xcc = aarch64_gen_compare_reg (NE, xsrc, CONST0_RTX (mode));
+	  xne = gen_rtx_NE (VOIDmode, xcc, const0_rtx);
+	}
+      else
+	{
+	  /* Compare the argument with 0.0 and create a vector mask.  */
+	  mmsk = mode_for_vector (int_mode_for_mode (GET_MODE_INNER (mode)),
+				  GET_MODE_NUNITS (mode));
+	  xmsk = gen_reg_rtx (mmsk);
+	  switch (mode)
+	  {
+	    case V2SFmode:
+	      emit_insn (gen_aarch64_cmeqv2sf (xmsk, xsrc, CONST0_RTX (mode)));
+	      break;
 
-  bool double_mode = (mode == DFmode || mode == V2DFmode);
+	    case V4SFmode:
+	      emit_insn (gen_aarch64_cmeqv4sf (xmsk, xsrc, CONST0_RTX (mode)));
+	      break;
 
-  int iterations = double_mode ? 3 : 2;
+	    case V2DFmode:
+	      emit_insn (gen_aarch64_cmeqv2df (xmsk, xsrc, CONST0_RTX (mode)));
+	      break;
 
-  /* Optionally iterate over the series one less time than otherwise.  */
+	    default:
+	      gcc_unreachable ();
+	  }
+	}
+    }
+
+  /* Estimate the approximate reciprocal square root.  */
+  rtx xdst = gen_reg_rtx (mode);
+  emit_insn ((*get_rsqrte_type (mode)) (xdst, xsrc));
+
+  /* Iterate over the series twice for SF and thrice for DF.  */
+  int iterations = (GET_MODE_INNER (mode) == DFmode) ? 3 : 2;
+
+  /* Optionally iterate over the series once less for faster performance
+     while sacrificing the accuracy.  */
   if (flag_mrecip_low_precision_sqrt)
     iterations--;
 
-  for (int i = 0; i < iterations; ++i)
+  /* Iterate over the series.  */
+  while (iterations--)
     {
-      rtx x1 = gen_reg_rtx (mode);
       rtx x2 = gen_reg_rtx (mode);
-      rtx x3 = gen_reg_rtx (mode);
-      emit_set_insn (x2, gen_rtx_MULT (mode, x0, x0));
+      emit_set_insn (x2, gen_rtx_MULT (mode, xdst, xdst));
+
+      rtx x1 = gen_reg_rtx (mode);
+      emit_insn ((*get_rsqrts_type (mode)) (x1, xsrc, x2));
 
-      emit_insn ((*get_rsqrts_type (mode)) (x3, xsrc, x2));
+      emit_set_insn (xdst, gen_rtx_MULT (mode, x1, xdst));
+    }
+
+  if (!recp)
+    {
+      /* Qualify the final estimate for the approximate reciprocal square root
+	 when the argument is 0.0.  */
+      if (scalar)
+	/* Conditionally set the final estimate to 0.0.  */
+	emit_set_insn (xdst, gen_rtx_IF_THEN_ELSE (mode, xne, xdst, xsrc));
+      else
+	{
+	  /* Mask off any final vector element estimate to 0.0.  */
+	  rtx xtmp = gen_reg_rtx (mmsk);
+	  emit_set_insn (xtmp, gen_rtx_AND (mmsk, gen_rtx_NOT (mmsk, xmsk),
+					    gen_rtx_SUBREG (mmsk, xdst, 0)));
+	  emit_move_insn (xdst, gen_rtx_SUBREG (mode, xtmp, 0));
+	}
 
-      emit_set_insn (x1, gen_rtx_MULT (mode, x0, x3));
-      x0 = x1;
+      /* Calculate the approximate square root.  */
+      emit_set_insn (xdst, gen_rtx_MULT (mode, xsrc, xdst));
     }
 
-  emit_move_insn (dst, x0);
+  emit_move_insn (dst, xdst);
 }
 
 /* Return the number of instructions that can be issued per cycle.  */
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 68676c9..71725e7 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -4665,7 +4665,30 @@
   [(set_attr "type" "ffarith<s>")]
 )
 
-(define_insn "sqrt<mode>2"
+(define_expand "sqrt<mode>2"
+  [(set (match_operand:GPF 0 "register_operand")
+        (sqrt:GPF (match_operand:GPF 1 "register_operand")))]
+  "TARGET_SIMD"
+{
+  machine_mode mode = GET_MODE_INNER (GET_MODE (operands[1]));
+
+  if (flag_finite_math_only
+      && !flag_trapping_math
+      && flag_unsafe_math_optimizations
+      && !optimize_function_for_size_p (cfun)
+      && ((mode == SFmode
+           && (aarch64_tune_params.extra_tuning_flags
+               & AARCH64_EXTRA_TUNE_APPROX_SQRT_SF))
+          || (mode == DFmode
+              && (aarch64_tune_params.extra_tuning_flags
+                  & AARCH64_EXTRA_TUNE_APPROX_SQRT_DF))))
+    {
+      aarch64_emit_approx_sqrt (operands[0], operands[1], false);
+      DONE;
+    }
+})
+
+(define_insn "*sqrt<mode>2"
   [(set (match_operand:GPF 0 "register_operand" "=w")
         (sqrt:GPF (match_operand:GPF 1 "register_operand" "w")))]
   "TARGET_FLOAT"
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index c637ff4..c5e7fc9 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -151,5 +151,5 @@ PC relative literal loads.
 
 mlow-precision-recip-sqrt
 Common Var(flag_mrecip_low_precision_sqrt) Optimization
-When calculating the reciprocal square root approximation,
-uses one less step than otherwise, thus reducing latency and precision.
+When calculating the approximate square root or its approximate reciprocal,
+use one less step than otherwise, thus reducing latency and precision.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 99ac11b..d48c29b 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -12903,10 +12903,11 @@ corresponding flag to the linker.
 @item -mno-low-precision-recip-sqrt
 @opindex -mlow-precision-recip-sqrt
 @opindex -mno-low-precision-recip-sqrt
-When calculating the reciprocal square root approximation,
-uses one less step than otherwise, thus reducing latency and precision.
-This is only relevant if @option{-ffast-math} enables the reciprocal square root
-approximation, which in turn depends on the target processor.
+When calculating the approximate square root or its approximate reciprocal,
+use one less step than otherwise, thus reducing latency and precision.
+This is only relevant if @option{-ffast-math} enables
+the approximate square root or its approximate reciprocal,
+which in turn depends on the target processor.
 
 @item -march=@var{name}
 @opindex march
-- 
1.9.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2016-03-08 22:08     ` Evandro Menezes
  2016-03-08 22:18       ` Evandro Menezes
@ 2016-03-16 19:45       ` Evandro Menezes
  2016-03-17 14:55         ` James Greenhalgh
  1 sibling, 1 reply; 38+ messages in thread
From: Evandro Menezes @ 2016-03-16 19:45 UTC (permalink / raw)
  To: GCC Patches, Marcus Shawcroft, James Greenhalgh, Andrew Pinski,
	Benedikt Huber, philipp.tomsich, Kyrill Tkachov

On 03/08/16 16:08, Evandro Menezes wrote:
> On 02/16/16 14:56, Evandro Menezes wrote:
>> On 12/08/15 15:35, Evandro Menezes wrote:
>>> Emit square root using the Newton series
>>>
>>>    2015-12-03  Evandro Menezes  <e.menezes@samsung.com>
>>>
>>>    gcc/
>>>             * config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
>>>    Declare new
>>>             function.
>>>             * config/aarch64/aarch64-simd.md (sqrt<mode>2): New
>>>    expansion and
>>>             insn definitions.
>>>             * config/aarch64/aarch64-tuning-flags.def
>>>             (AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
>>>             * config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
>>>    new function.
>>>             * config/aarch64/aarch64.md (sqrt<mode>2): New expansion
>>>    and insn
>>>             definitions.
>>>             * config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
>>>    Expand option
>>>             description.
>>>             * doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.
>>>
>>> This patch extends the patch that added support for implementing 
>>> x^-1/2 using the Newton series by adding support for x^1/2 as well.
>>>
>>> Is it OK at this point of stage 3?
>>>
>>> Thank you,
>>>
>>
>> James,
>>
>> As I was saying, this patch results in some validation errors in 
>> CPU2000 benchmarks using DF.  Although proving the algorithm to be 
>> pretty solid with a vast set of random values, I'm confused why some 
>> benchmarks fail to validate with this implementation of the Newton 
>> series for square root too, when they pass with the Newton series for 
>> reciprocal square root.
>>
>> Since I had no problems with the same algorithm on x86-64, I wonder 
>> if the initial estimate on AArch64, which offers just 8 bits, whereas 
>> x86-64 offers 11 bits, has to do with it.  Then again, the algorithm 
>> iterated 1 less time on x86-64 than on AArch64.
>>
>> Since it seems that the initial estimate is sufficient for CPU2000 to 
>> validate when using SF, I'm leaning towards restricting the Newton 
>> series for square root only for SF.
>>
>> Your thoughts on the matter are appreciated,
>
>         Add choices for the reciprocal square root approximation
>
>         Allow a target to prefer such operation depending on the FP
>    precision.
>
>         gcc/
>             * config/aarch64/aarch64-protos.h
>             (AARCH64_EXTRA_TUNE_APPROX_RSQRT): New macro.
>             * config/aarch64/aarch64-tuning-flags.def
>             (AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF): New mask.
>             (AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF): Likewise.
>             * config/aarch64/aarch64.c
>             (use_rsqrt_p): New argument for the mode.
>             (aarch64_builtin_reciprocal): Devise mode from builtin.
>             (aarch64_optab_supported_p): New argument for the mode.
>
>
> Now that the patch is attached, feedback is appreciated.

Ping.

-- 
Evandro Menezes

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2016-03-14 16:39             ` Evandro Menezes
@ 2016-03-14 19:13               ` Wilco Dijkstra
  0 siblings, 0 replies; 38+ messages in thread
From: Wilco Dijkstra @ 2016-03-14 19:13 UTC (permalink / raw)
  To: Evandro Menezes; +Cc: gcc-patches, nd

Evandro Menezes <e.menezes@samsung.com> wrote:
>
> I got the scalar version going, but I'm stuck with the vector version.
> As you can see above, I need to use the complement of the mask produced
> by FCMEQ to squelch the offending vector element. However, the way in
> which FCMEQ is defined in GCC, it produces an integer vector and the
> SIMD AND only takes integer vectors.  I'm stuck at how to pass an FP
> vector to AND and then its integer vector back to an FP insn.

You can use gen_rtx_SUBREG(mcmp, xsqrt, 0) to change the mode to an 
integer vector on the AND instruction and back to mode for the destination.

Wilco

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2016-03-11  1:06           ` Wilco Dijkstra
@ 2016-03-14 16:39             ` Evandro Menezes
  2016-03-14 19:13               ` Wilco Dijkstra
  2016-03-16 21:44             ` Evandro Menezes
  1 sibling, 1 reply; 38+ messages in thread
From: Evandro Menezes @ 2016-03-14 16:39 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: gcc-patches, nd

On 03/10/16 19:06, Wilco Dijkstra wrote:
> Evandro Menezes <e.menezes@samsung.com> wrote:
>> That's what I had in mind too, but around the approximation for x^-1/2
>> and using masks for vector cases thusly:
>>
>>         fcmne   v3.4s, v0.4s, #0.0
>>          frsqrte v1.4s, v0.4s
>>          fmul    v2.4s, v1.4s, v1.4s
>>          frsqrts v2.4s, v0.4s, v2.4s
>>          fmul    v1.4s, v1.4s, v2.4s
>>          fmul    v2.4s, v1.4s, v1.4s
>>          frsqrts v2.4s, v0.4s, v2.4s
>>          fmul    v1.4s, v1.4s, v2.4s
>>         and     v1.4s, v3.4s
>>          fmul    v0.4s, v1.4s, v0.4s
> That's possible but the overall latency is higher - according to exynos-1.md the
> above takes 44 cycles while my version would be 37.

I'm currently working to get this prototyped without modifying the 
reciprocal square root.  Once I'm done, I'll merge both functions 
together to generate better code.

I got the scalar version going, but I'm stuck with the vector version.  
As you can see above, I need to use the complement of the mask produced 
by FCMEQ to squelch the offending vector element. However, the way in 
which FCMEQ is defined in GCC, it produces an integer vector and the 
SIMD AND only takes integer vectors.  I'm stuck at how to pass an FP 
vector to AND and then its integer vector back to an FP insn.

Here's how the function stands at the moment:

    void
    aarch64_emit_approx_sqrt (rtx dst, rtx src)
    {
       machine_mode mode = GET_MODE (src);
       gcc_assert (GET_MODE_INNER (mode) == SFmode
                   || GET_MODE_INNER (mode) == DFmode);

       bool scalar = !VECTOR_MODE_P (mode);
       bool narrow = (mode == V2SFmode);

       rtx xsrc = gen_reg_rtx (mode);
       emit_move_insn (xsrc, src);

       rtx xcc, xne, xmsk;
       if (scalar)
         {
           /* fcmp */
           xcc = aarch64_gen_compare_reg (NE, xsrc, CONST0_RTX (mode));
           xne = gen_rtx_NE (VOIDmode, xcc, const0_rtx);
         }
       else
         {
           machine_mode mcmp = mode_for_vector (int_mode_for_mode
    (GET_MODE_INNER (mode)), GET_MODE_NUNITS (mode));
           /* fcmne */
           xmsk = gen_reg_rtx (mode);
           /* Just V4SF for now */
           emit_insn (gen_aarch64_cmeqv4sf (xmsk, xsrc, CONST0_RTX (mode)));
           /* TODO: must use the complement of the this result.  */
         }

       /* Calculate the approximate reciprocal square root.  */
       rtx xrsqrt = gen_reg_rtx (mode);
       aarch64_emit_approx_rsqrt (xrsqrt, xsrc);

       /* Calculate the approximate square root.  */
       rtx xsqrt = gen_reg_rtx (mode);
       emit_set_insn (xsqrt, gen_rtx_MULT (mode, xrsqrt, xsrc));

       /* Qualify the result for when the input is zero.  */
       rtx xdst = gen_reg_rtx (mode);
       if (scalar)
         /* fcsel */
         emit_set_insn (xdst, gen_rtx_IF_THEN_ELSE (mode, xne, xsqrt,
    xsrc));
       else
         /* and */
         emit_set_insn (xdst, gen_rtx_AND (mode, xsqrt, xmsk));

       emit_move_insn (dst, xdst);
    }

Any help is welcome.

Thank you,

-- 
Evandro Menezes

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2016-03-10 22:15         ` Evandro Menezes
@ 2016-03-11  1:06           ` Wilco Dijkstra
  2016-03-14 16:39             ` Evandro Menezes
  2016-03-16 21:44             ` Evandro Menezes
  0 siblings, 2 replies; 38+ messages in thread
From: Wilco Dijkstra @ 2016-03-11  1:06 UTC (permalink / raw)
  To: Evandro Menezes; +Cc: gcc-patches, nd

Evandro Menezes <e.menezes@samsung.com> wrote:
>
> That's what I had in mind too, but around the approximation for x^-1/2
> and using masks for vector cases thusly:
>
>        fcmne   v3.4s, v0.4s, #0.0
>         frsqrte v1.4s, v0.4s
>         fmul    v2.4s, v1.4s, v1.4s
>         frsqrts v2.4s, v0.4s, v2.4s
>         fmul    v1.4s, v1.4s, v2.4s
>         fmul    v2.4s, v1.4s, v1.4s
>         frsqrts v2.4s, v0.4s, v2.4s
>         fmul    v1.4s, v1.4s, v2.4s
>        and     v1.4s, v3.4s
>         fmul    v0.4s, v1.4s, v0.4s

That's possible but the overall latency is higher - according to exynos-1.md the
above takes 44 cycles while my version would be 37.

Wilco

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2016-03-10 19:10       ` Wilco Dijkstra
@ 2016-03-10 22:15         ` Evandro Menezes
  2016-03-11  1:06           ` Wilco Dijkstra
  0 siblings, 1 reply; 38+ messages in thread
From: Evandro Menezes @ 2016-03-10 22:15 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: gcc-patches, nd

On 03/10/16 13:10, Wilco Dijkstra wrote:
>      frsqrte  s1, s0
>      fmul     s2, s1, s1
>      frsqrts  s2, s0, s2
>      fcmp     s0, 0.0
>      fmul     s1, s1, s2
>      fmul     s2, s1, s1
>      fmul     s1, s0, s1
>      frsqrts  s2, s0, s2
>      fcsel    s1, s0, s1, eq
>      fmul     s0, s1, s2

That's what I had in mind too, but around the approximation for x^-1/2 
and using masks for vector cases thusly:

	fcmne	v3.4s, v0.4s, #0.0
         frsqrte v1.4s, v0.4s
         fmul    v2.4s, v1.4s, v1.4s
         frsqrts v2.4s, v0.4s, v2.4s
         fmul    v1.4s, v1.4s, v2.4s
         fmul    v2.4s, v1.4s, v1.4s
         frsqrts v2.4s, v0.4s, v2.4s
         fmul    v1.4s, v1.4s, v2.4s
	and	v1.4s, v3.4s
         fmul    v0.4s, v1.4s, v0.4s


Thanks,

-- 
Evandro Menezes

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2016-03-10 16:58     ` Evandro Menezes
@ 2016-03-10 19:10       ` Wilco Dijkstra
  2016-03-10 22:15         ` Evandro Menezes
  0 siblings, 1 reply; 38+ messages in thread
From: Wilco Dijkstra @ 2016-03-10 19:10 UTC (permalink / raw)
  To: Evandro Menezes; +Cc: gcc-patches, nd


On 03/10/16 10:52, Wilco Dijkstra wrote:
> Hi Evandro,
>
>> I have however encountered precision issues with DF, namely some benchmarks in the SPECfp CPU2000 suite would fail to validate.
> Accuracy is not an issue, the computation is extremely accurate. The issue is that your patch doesn't support sqrt(0.0) - it returns NaN rather than zero, and that causes the miscompares you're seeing. So support for the zero case should be added.
>
> This would be a better expansion, supporting zero, and with lower latency than the current sequence:

Now I think of it, frsqrts returns 1.5 for the zero case, so we only need to fix up the estimated
sqrt value before the final multiply. Since a FCSEL/VAND can be hidden completely behind the
latency of frsqrts, both scalar and vector case could do this:

    frsqrte  s1, s0
    fmul     s2, s1, s1
    frsqrts  s2, s0, s2
    fcmp     s0, 0.0
    fmul     s1, s1, s2
    fmul     s2, s1, s1
    fmul     s1, s0, s1
    frsqrts  s2, s0, s2
    fcsel    s1, s0, s1, eq
    fmul     s0, s1, s2

Wilco



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2016-03-10 16:52   ` Wilco Dijkstra
@ 2016-03-10 16:58     ` Evandro Menezes
  2016-03-10 19:10       ` Wilco Dijkstra
  0 siblings, 1 reply; 38+ messages in thread
From: Evandro Menezes @ 2016-03-10 16:58 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: gcc-patches, nd

On 03/10/16 10:52, Wilco Dijkstra wrote:
> Hi Evandro,
>
>> I have however encountered precision issues with DF, namely some benchmarks in the SPECfp CPU2000 suite would fail to validate.
> Accuracy is not an issue, the computation is extremely accurate. The issue is that your patch doesn't support sqrt(0.0) - it returns NaN rather than zero, and that causes the miscompares you're seeing. So support for the zero case should be added.
>
> This would be a better expansion, supporting zero, and with lower latency than the current sequence:
>
>      fcmp    s0, 0.0
>      beq      zero
>      frsqrte    s1, s0
>      fmul    s2, s1, s1
>      frsqrts    s2, s0, s2
>      fmul    s1, s1, s2
>      fmul    s2, s1, s1
>      fmul   s1, s0, s1
>      frsqrts    s2, s0, s2
>      fmul    s0, s1, s2
> zero:
>
> For the vector variant you can't avoid the extra latency of an AND, but it should not be slower than it is today.

Thanks for the pointer, Wilco.  Will work it in the patch.

-- 
Evandro Menezes

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
       [not found] ` <011d01d17a26$31b3ade0$951b09a0$@samsung.com>
@ 2016-03-10 16:52   ` Wilco Dijkstra
  2016-03-10 16:58     ` Evandro Menezes
  0 siblings, 1 reply; 38+ messages in thread
From: Wilco Dijkstra @ 2016-03-10 16:52 UTC (permalink / raw)
  To: Evandro Menezes; +Cc: gcc-patches, nd

Hi Evandro,

> I have however encountered precision issues with DF, namely some benchmarks in the SPECfp CPU2000 suite would fail to validate. 

Accuracy is not an issue, the computation is extremely accurate. The issue is that your patch doesn't support sqrt(0.0) - it returns NaN rather than zero, and that causes the miscompares you're seeing. So support for the zero case should be added.

This would be a better expansion, supporting zero, and with lower latency than the current sequence:

    fcmp    s0, 0.0
    beq      zero
    frsqrte    s1, s0
    fmul    s2, s1, s1
    frsqrts    s2, s0, s2
    fmul    s1, s1, s2
    fmul    s2, s1, s1
    fmul   s1, s0, s1
    frsqrts    s2, s0, s2
    fmul    s0, s1, s2
zero:

For the vector variant you can't avoid the extra latency of an AND, but it should not be slower than it is today.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2016-03-08 22:18       ` Evandro Menezes
@ 2016-03-08 22:20         ` Evandro Menezes
  0 siblings, 0 replies; 38+ messages in thread
From: Evandro Menezes @ 2016-03-08 22:20 UTC (permalink / raw)
  To: GCC Patches, Marcus Shawcroft, James Greenhalgh, Andrew Pinski,
	Benedikt Huber, philipp.tomsich, Kyrill Tkachov

[-- Attachment #1: Type: text/plain, Size: 4377 bytes --]

On 03/08/16 16:08, Evandro Menezes wrote:
> On 02/16/16 14:56, Evandro Menezes wrote:
>> On 12/08/15 15:35, Evandro Menezes wrote:
>>> Emit square root using the Newton series
>>>
>>>    2015-12-03  Evandro Menezes  <e.menezes@samsung.com>
>>>
>>>    gcc/
>>>             * config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
>>>    Declare new
>>>             function.
>>>             * config/aarch64/aarch64-simd.md (sqrt<mode>2): New
>>>    expansion and
>>>             insn definitions.
>>>             * config/aarch64/aarch64-tuning-flags.def
>>>             (AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
>>>             * config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
>>>    new function.
>>>             * config/aarch64/aarch64.md (sqrt<mode>2): New expansion
>>>    and insn
>>>             definitions.
>>>             * config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
>>>    Expand option
>>>             description.
>>>             * doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.
>>>
>>> This patch extends the patch that added support for implementing 
>>> x^-1/2 using the Newton series by adding support for x^1/2 as well.
>>>
>>> Is it OK at this point of stage 3?
>>>
>>> Thank you,
>>>
>>
>> James,
>>
>> As I was saying, this patch results in some validation errors in 
>> CPU2000 benchmarks using DF.  Although proving the algorithm to be 
>> pretty solid with a vast set of random values, I'm confused why some 
>> benchmarks fail to validate with this implementation of the Newton 
>> series for square root too, when they pass with the Newton series for 
>> reciprocal square root.
>>
>> Since I had no problems with the same algorithm on x86-64, I wonder 
>> if the initial estimate on AArch64, which offers just 8 bits, whereas 
>> x86-64 offers 11 bits, has to do with it.  Then again, the algorithm 
>> iterated 1 less time on x86-64 than on AArch64.
>>
>> Since it seems that the initial estimate is sufficient for CPU2000 to 
>> validate when using SF, I'm leaning towards restricting the Newton 
>> series for square root only for SF.
>>
>> Your thoughts on the matter are appreciated,
>
>         Add choices for the reciprocal square root approximation
>
>         Allow a target to prefer such operation depending on the FP
>    precision.
>
>         gcc/
>             * config/aarch64/aarch64-protos.h
>             (AARCH64_EXTRA_TUNE_APPROX_RSQRT): New macro.
>             * config/aarch64/aarch64-tuning-flags.def
>             (AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF): New mask.
>             (AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF): Likewise.
>             * config/aarch64/aarch64.c
>             (use_rsqrt_p): New argument for the mode.
>             (aarch64_builtin_reciprocal): Devise mode from builtin.
>             (aarch64_optab_supported_p): New argument for the mode.

         Emit square root using the Newton series

         gcc/
             * config/aarch64/aarch64-tuning-flags.def
             (AARCH64_EXTRA_TUNE_APPROX_SQRT_{DF,SF}): New tuning macros.
             * config/aarch64/aarch64-protos.h
             (aarch64_emit_approx_sqrt): Declare new function.
             * config/aarch64/aarch64.c
             (aarch64_emit_approx_sqrt): Define new function.
             * config/aarch64/aarch64.md
             (sqrt*2): New expansion and insn definitions.
             * config/aarch64/aarch64-simd.md (sqrt*2): Likewise.
             * config/aarch64/aarch64.opt
             (mlow-precision-recip-sqrt): Expand option description.
             * doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.


This patch, which depends on 
https://gcc.gnu.org/ml/gcc-patches/2016-03/msg00534.html, leverages the 
reciprocal square root approximation to emit a faster square root 
approximation.

I have however encountered precision issues with DF, namely some 
benchmarks in the SPECfp CPU2000 suite would fail to validate. Perhaps 
the initial estimate, with just 8 bits, is not good enough for the 
series to converge given the workloads of such benchmarks; perhaps 
denormals, known to occur in some of these benchmarks, result in 
errors.  This was the motivation to split the tuning flags between one 
specific for DF and the other, for SF in the previous related patch.

Again, now with the patch attached, your feedback is appreciated.

Thank you,

-- 
Evandro Menezes


[-- Attachment #2: 0001-Emit-square-root-using-the-Newton-series.patch --]
[-- Type: text/x-patch, Size: 8885 bytes --]

From 4f61f722f744339650a48aa034906dd685110ae2 Mon Sep 17 00:00:00 2001
From: Evandro Menezes <e.menezes@samsung.com>
Date: Tue, 8 Mar 2016 15:06:03 -0600
Subject: [PATCH] Emit square root using the Newton series

gcc/
	* config/aarch64/aarch64-tuning-flags.def
	(AARCH64_EXTRA_TUNE_APPROX_SQRT_{DF,SF}): New tuning macros.
	* config/aarch64/aarch64-protos.h
	(aarch64_emit_approx_sqrt): Declare new function.
	* config/aarch64/aarch64.c
	(aarch64_emit_approx_sqrt): Define new function.
	* config/aarch64/aarch64.md
	(sqrt*2): New expansion and insn definitions.
	* config/aarch64/aarch64-simd.md (sqrt*2): Likewise.
	* config/aarch64/aarch64.opt
	(mlow-precision-recip-sqrt): Expand option description.
	* doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.
---
 gcc/config/aarch64/aarch64-protos.h         |  3 +++
 gcc/config/aarch64/aarch64-simd.md          | 25 ++++++++++++++++++++-
 gcc/config/aarch64/aarch64-tuning-flags.def |  3 ++-
 gcc/config/aarch64/aarch64.c                | 35 ++++++++++++++++++++++++-----
 gcc/config/aarch64/aarch64.md               | 25 ++++++++++++++++++++-
 gcc/config/aarch64/aarch64.opt              |  4 ++--
 gcc/doc/invoke.texi                         |  9 ++++----
 7 files changed, 89 insertions(+), 15 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index ee3505c..3f7e76b 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -265,6 +265,8 @@ enum aarch64_extra_tuning_flags
 
 #define AARCH64_EXTRA_TUNE_APPROX_RSQRT \
   (AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF | AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF)
+#define AARCH64_EXTRA_TUNE_APPROX_SQRT \
+  (AARCH64_EXTRA_TUNE_APPROX_SQRT_DF | AARCH64_EXTRA_TUNE_APPROX_SQRT_SF)
 
 extern struct tune_params aarch64_tune_params;
 
@@ -364,6 +366,7 @@ void aarch64_register_pragmas (void);
 void aarch64_relayout_simd_types (void);
 void aarch64_reset_previous_fndecl (void);
 void aarch64_emit_approx_rsqrt (rtx, rtx);
+void aarch64_emit_approx_sqrt (rtx, rtx);
 
 /* Initialize builtins for SIMD intrinsics.  */
 void init_aarch64_simd_builtins (void);
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index bd73bce..afeca5a 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4307,7 +4307,30 @@
 
 ;; sqrt
 
-(define_insn "sqrt<mode>2"
+(define_expand "sqrt<mode>2"
+  [(set (match_operand:VDQF 0 "register_operand")
+	(sqrt:VDQF (match_operand:VDQF 1 "register_operand")))]
+  "TARGET_SIMD"
+{
+  machine_mode mode = GET_MODE_INNER (GET_MODE (operands[1]));
+
+  if (flag_finite_math_only
+      && !flag_trapping_math
+      && flag_unsafe_math_optimizations
+      && !optimize_function_for_size_p (cfun)
+      && ((mode == SFmode
+           && (aarch64_tune_params.extra_tuning_flags
+               & AARCH64_EXTRA_TUNE_APPROX_SQRT_SF))
+          || (mode == DFmode
+              && (aarch64_tune_params.extra_tuning_flags
+                  & AARCH64_EXTRA_TUNE_APPROX_SQRT_DF))))
+    {
+      aarch64_emit_approx_sqrt (operands[0], operands[1]);
+      DONE;
+    }
+})
+
+(define_insn "*sqrt<mode>2"
   [(set (match_operand:VDQF 0 "register_operand" "=w")
         (sqrt:VDQF (match_operand:VDQF 1 "register_operand" "w")))]
   "TARGET_SIMD"
diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
index 57d9588..b4421b1 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -31,4 +31,5 @@
 AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
 AARCH64_EXTRA_TUNING_OPTION ("approx_rsqrt", APPROX_RSQRT_DF)
 AARCH64_EXTRA_TUNING_OPTION ("approx_rsqrtf", APPROX_RSQRT_SF)
-
+AARCH64_EXTRA_TUNING_OPTION ("approx_sqrt", APPROX_SQRT_DF)
+AARCH64_EXTRA_TUNING_OPTION ("approx_sqrtf", APPROX_SQRT_SF)
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 39a1a47..5e5dc5f 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -538,7 +538,8 @@ static const struct tune_params exynosm1_tunings =
   48,	/* max_case_values.  */
   64,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF, /* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_APPROX_RSQRT) /* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_APPROX_SQRT_SF
+   | AARCH64_EXTRA_TUNE_APPROX_RSQRT) /* tune_flags.  */
 };
 
 static const struct tune_params thunderx_tunings =
@@ -7537,9 +7538,8 @@ void
 aarch64_emit_approx_rsqrt (rtx dst, rtx src)
 {
   machine_mode mode = GET_MODE (src);
-  gcc_assert (
-    mode == SFmode || mode == V2SFmode || mode == V4SFmode
-	|| mode == DFmode || mode == V2DFmode);
+  gcc_assert (GET_MODE_INNER (mode) == SFmode
+              || GET_MODE_INNER (mode) == DFmode);
 
   rtx xsrc = gen_reg_rtx (mode);
   emit_move_insn (xsrc, src);
@@ -7547,8 +7547,7 @@ aarch64_emit_approx_rsqrt (rtx dst, rtx src)
 
   emit_insn ((*get_rsqrte_type (mode)) (x0, xsrc));
 
-  bool double_mode = (mode == DFmode || mode == V2DFmode);
-
+  bool double_mode = (GET_MODE_INNER (mode) == DFmode);
   int iterations = double_mode ? 3 : 2;
 
   /* Optionally iterate over the series one less time than otherwise.  */
@@ -7571,6 +7570,30 @@ aarch64_emit_approx_rsqrt (rtx dst, rtx src)
   emit_move_insn (dst, x0);
 }
 
+/* Emit instruction sequence to compute the approximate square root.  */
+
+void
+aarch64_emit_approx_sqrt (rtx dst, rtx src)
+{
+  machine_mode mode = GET_MODE (src);
+  gcc_assert (GET_MODE_INNER (mode) == SFmode
+              || GET_MODE_INNER (mode) == DFmode);
+
+  rtx xsrc = gen_reg_rtx (mode);
+  emit_move_insn (xsrc, src);
+
+  /* Calculate the approximate square root by multiplying the approximate
+     reciprocal square root...  */
+  rtx xrsqrt = gen_reg_rtx (mode);
+  aarch64_emit_approx_rsqrt (xrsqrt, xsrc);
+
+  /* ... by the original value.  */
+  rtx xsqrt = gen_reg_rtx (mode);
+  emit_set_insn (xsqrt, gen_rtx_MULT (mode, xrsqrt, xsrc));
+
+  emit_move_insn (dst, xsqrt);
+}
+
 /* Return the number of instructions that can be issued per cycle.  */
 static int
 aarch64_sched_issue_rate (void)
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 68676c9..bd9947a 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -4665,7 +4665,30 @@
   [(set_attr "type" "ffarith<s>")]
 )
 
-(define_insn "sqrt<mode>2"
+(define_expand "sqrt<mode>2"
+  [(set (match_operand:GPF 0 "register_operand")
+        (sqrt:GPF (match_operand:GPF 1 "register_operand")))]
+  "TARGET_SIMD"
+{
+  machine_mode mode = GET_MODE_INNER (GET_MODE (operands[1]));
+
+  if (flag_finite_math_only
+      && !flag_trapping_math
+      && flag_unsafe_math_optimizations
+      && !optimize_function_for_size_p (cfun)
+      && ((mode == SFmode
+           && (aarch64_tune_params.extra_tuning_flags
+               & AARCH64_EXTRA_TUNE_APPROX_SQRT_SF))
+          || (mode == DFmode
+              && (aarch64_tune_params.extra_tuning_flags
+                  & AARCH64_EXTRA_TUNE_APPROX_SQRT_DF))))
+    {
+      aarch64_emit_approx_sqrt (operands[0], operands[1]);
+      DONE;
+    }
+})
+
+(define_insn "*sqrt<mode>2"
   [(set (match_operand:GPF 0 "register_operand" "=w")
         (sqrt:GPF (match_operand:GPF 1 "register_operand" "w")))]
   "TARGET_FLOAT"
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index 49ef0c6..8bb12d6 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -151,5 +151,5 @@ PC relative literal loads.
 
 mlow-precision-recip-sqrt
 Common Var(flag_mrecip_low_precision_sqrt) Optimization
-When calculating the reciprocal square root approximation,
-uses one less step than otherwise, thus reducing latency and precision.
+When calculating the approximate square root or its approximate reciprocal,
+use one less step than otherwise, thus reducing latency and precision.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 62c70d5..24ad1f3 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -12887,10 +12887,11 @@ corresponding flag to the linker.
 @item -mno-low-precision-recip-sqrt
 @opindex -mlow-precision-recip-sqrt
 @opindex -mno-low-precision-recip-sqrt
-When calculating the reciprocal square root approximation,
-uses one less step than otherwise, thus reducing latency and precision.
-This is only relevant if @option{-ffast-math} enables the reciprocal square root
-approximation, which in turn depends on the target processor.
+When calculating the approximate square root or its approximate reciprocal,
+use one less step than otherwise, thus reducing latency and precision.
+This is only relevant if @option{-ffast-math} enables
+the approximate square root or its approximate reciprocal,
+which in turn depends on the target processor.
 
 @item -march=@var{name}
 @opindex march
-- 
2.6.3


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2016-03-08 22:08     ` Evandro Menezes
@ 2016-03-08 22:18       ` Evandro Menezes
  2016-03-08 22:20         ` Evandro Menezes
  2016-03-16 19:45       ` Evandro Menezes
  1 sibling, 1 reply; 38+ messages in thread
From: Evandro Menezes @ 2016-03-08 22:18 UTC (permalink / raw)
  To: GCC Patches, Marcus Shawcroft, James Greenhalgh, Andrew Pinski,
	Benedikt Huber, philipp.tomsich, Kyrill Tkachov

On 03/08/16 16:08, Evandro Menezes wrote:
> On 02/16/16 14:56, Evandro Menezes wrote:
>> On 12/08/15 15:35, Evandro Menezes wrote:
>>> Emit square root using the Newton series
>>>
>>>    2015-12-03  Evandro Menezes  <e.menezes@samsung.com>
>>>
>>>    gcc/
>>>             * config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
>>>    Declare new
>>>             function.
>>>             * config/aarch64/aarch64-simd.md (sqrt<mode>2): New
>>>    expansion and
>>>             insn definitions.
>>>             * config/aarch64/aarch64-tuning-flags.def
>>>             (AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
>>>             * config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
>>>    new function.
>>>             * config/aarch64/aarch64.md (sqrt<mode>2): New expansion
>>>    and insn
>>>             definitions.
>>>             * config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
>>>    Expand option
>>>             description.
>>>             * doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.
>>>
>>> This patch extends the patch that added support for implementing 
>>> x^-1/2 using the Newton series by adding support for x^1/2 as well.
>>>
>>> Is it OK at this point of stage 3?
>>>
>>> Thank you,
>>>
>>
>> James,
>>
>> As I was saying, this patch results in some validation errors in 
>> CPU2000 benchmarks using DF.  Although proving the algorithm to be 
>> pretty solid with a vast set of random values, I'm confused why some 
>> benchmarks fail to validate with this implementation of the Newton 
>> series for square root too, when they pass with the Newton series for 
>> reciprocal square root.
>>
>> Since I had no problems with the same algorithm on x86-64, I wonder 
>> if the initial estimate on AArch64, which offers just 8 bits, whereas 
>> x86-64 offers 11 bits, has to do with it.  Then again, the algorithm 
>> iterated 1 less time on x86-64 than on AArch64.
>>
>> Since it seems that the initial estimate is sufficient for CPU2000 to 
>> validate when using SF, I'm leaning towards restricting the Newton 
>> series for square root only for SF.
>>
>> Your thoughts on the matter are appreciated,
>
>         Add choices for the reciprocal square root approximation
>
>         Allow a target to prefer such operation depending on the FP
>    precision.
>
>         gcc/
>             * config/aarch64/aarch64-protos.h
>             (AARCH64_EXTRA_TUNE_APPROX_RSQRT): New macro.
>             * config/aarch64/aarch64-tuning-flags.def
>             (AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF): New mask.
>             (AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF): Likewise.
>             * config/aarch64/aarch64.c
>             (use_rsqrt_p): New argument for the mode.
>             (aarch64_builtin_reciprocal): Devise mode from builtin.
>             (aarch64_optab_supported_p): New argument for the mode.

         Emit square root using the Newton series

         gcc/
             * config/aarch64/aarch64-tuning-flags.def
             (AARCH64_EXTRA_TUNE_APPROX_SQRT_{DF,SF}): New tuning macros.
             * config/aarch64/aarch64-protos.h
             (aarch64_emit_approx_sqrt): Declare new function.
             * config/aarch64/aarch64.c
             (aarch64_emit_approx_sqrt): Define new function.
             * config/aarch64/aarch64.md
             (sqrt*2): New expansion and insn definitions.
             * config/aarch64/aarch64-simd.md (sqrt*2): Likewise.
             * config/aarch64/aarch64.opt
             (mlow-precision-recip-sqrt): Expand option description.
             * doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.


This patch, which depends on 
https://gcc.gnu.org/ml/gcc-patches/2016-03/msg00534.html, leverages the 
reciprocal square root approximation to emit a faster square root 
approximation.

I have however encountered precision issues with DF, namely some 
benchmarks in the SPECfp CPU2000 suite would fail to validate. Perhaps 
the initial estimate, with just 8 bits, is not good enough for the 
series to converge given the workloads of such benchmarks; perhaps 
denormals, known to occur in some of these benchmarks, result in 
errors.  This was the motivation to split the tuning flags between one 
specific for DF and the other, for SF in the previous related patch.

Again, your feedback is appreciated.

Thank you,

-- 
Evandro Menezes

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2016-03-04  0:22   ` Evandro Menezes
@ 2016-03-08 22:08     ` Evandro Menezes
  2016-03-08 22:18       ` Evandro Menezes
  2016-03-16 19:45       ` Evandro Menezes
  0 siblings, 2 replies; 38+ messages in thread
From: Evandro Menezes @ 2016-03-08 22:08 UTC (permalink / raw)
  To: GCC Patches, Marcus Shawcroft, James Greenhalgh, Andrew Pinski,
	Benedikt Huber, philipp.tomsich, Kyrill Tkachov

[-- Attachment #1: Type: text/plain, Size: 2836 bytes --]

On 02/16/16 14:56, Evandro Menezes wrote:
> On 12/08/15 15:35, Evandro Menezes wrote:
>> Emit square root using the Newton series
>>
>>    2015-12-03  Evandro Menezes  <e.menezes@samsung.com>
>>
>>    gcc/
>>             * config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
>>    Declare new
>>             function.
>>             * config/aarch64/aarch64-simd.md (sqrt<mode>2): New
>>    expansion and
>>             insn definitions.
>>             * config/aarch64/aarch64-tuning-flags.def
>>             (AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
>>             * config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
>>    new function.
>>             * config/aarch64/aarch64.md (sqrt<mode>2): New expansion
>>    and insn
>>             definitions.
>>             * config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
>>    Expand option
>>             description.
>>             * doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.
>>
>> This patch extends the patch that added support for implementing 
>> x^-1/2 using the Newton series by adding support for x^1/2 as well.
>>
>> Is it OK at this point of stage 3?
>>
>> Thank you,
>>
>
> James,
>
> As I was saying, this patch results in some validation errors in 
> CPU2000 benchmarks using DF.  Although proving the algorithm to be 
> pretty solid with a vast set of random values, I'm confused why some 
> benchmarks fail to validate with this implementation of the Newton 
> series for square root too, when they pass with the Newton series for 
> reciprocal square root.
>
> Since I had no problems with the same algorithm on x86-64, I wonder if 
> the initial estimate on AArch64, which offers just 8 bits, whereas 
> x86-64 offers 11 bits, has to do with it.  Then again, the algorithm 
> iterated 1 less time on x86-64 than on AArch64.
>
> Since it seems that the initial estimate is sufficient for CPU2000 to 
> validate when using SF, I'm leaning towards restricting the Newton 
> series for square root only for SF.
>
> Your thoughts on the matter are appreciated,

         Add choices for the reciprocal square root approximation

         Allow a target to prefer such operation depending on the FP
    precision.

         gcc/
             * config/aarch64/aarch64-protos.h
             (AARCH64_EXTRA_TUNE_APPROX_RSQRT): New macro.
             * config/aarch64/aarch64-tuning-flags.def
             (AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF): New mask.
             (AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF): Likewise.
             * config/aarch64/aarch64.c
             (use_rsqrt_p): New argument for the mode.
             (aarch64_builtin_reciprocal): Devise mode from builtin.
             (aarch64_optab_supported_p): New argument for the mode.


Now that the patch is attached, feedback is appreciated.

Thank you,


-- 
Evandro Menezes


[-- Attachment #2: 0001-Add-choices-for-the-reciprocal-square-root-approxima.patch --]
[-- Type: text/x-patch, Size: 3848 bytes --]

From 0bb413550e854c81cc5ab180a3afdd43cd4faf0b Mon Sep 17 00:00:00 2001
From: Evandro Menezes <e.menezes@samsung.com>
Date: Thu, 3 Mar 2016 18:13:46 -0600
Subject: [PATCH] Add choices for the reciprocal square root approximation

Allow a target to prefer such operation depending on the FP precision.

gcc/
	* config/aarch64/aarch64-protos.h
	(AARCH64_EXTRA_TUNE_APPROX_RSQRT): New macro.
	* config/aarch64/aarch64-tuning-flags.def
	(AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF): New mask.
	(AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF): Likewise.
	* config/aarch64/aarch64.c
	(use_rsqrt_p): New argument for the mode.
	(aarch64_builtin_reciprocal): Devise mode from builtin.
	(aarch64_optab_supported_p): New argument for the mode.
---
 gcc/config/aarch64/aarch64-protos.h         |  3 +++
 gcc/config/aarch64/aarch64-tuning-flags.def |  3 ++-
 gcc/config/aarch64/aarch64.c                | 23 +++++++++++++++--------
 3 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index acf2062..ee3505c 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -263,6 +263,9 @@ enum aarch64_extra_tuning_flags
 };
 #undef AARCH64_EXTRA_TUNING_OPTION
 
+#define AARCH64_EXTRA_TUNE_APPROX_RSQRT \
+  (AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF | AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF)
+
 extern struct tune_params aarch64_tune_params;
 
 HOST_WIDE_INT aarch64_initial_elimination_offset (unsigned, unsigned);
diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
index 7e45a0c..57d9588 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -29,5 +29,6 @@
      AARCH64_TUNE_ to give an enum name. */
 
 AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
-AARCH64_EXTRA_TUNING_OPTION ("approx_rsqrt", APPROX_RSQRT)
+AARCH64_EXTRA_TUNING_OPTION ("approx_rsqrt", APPROX_RSQRT_DF)
+AARCH64_EXTRA_TUNING_OPTION ("approx_rsqrtf", APPROX_RSQRT_SF)
 
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 801f95a..39a1a47 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -7464,12 +7464,16 @@ aarch64_memory_move_cost (machine_mode mode ATTRIBUTE_UNUSED,
    to optimize 1.0/sqrt.  */
 
 static bool
-use_rsqrt_p (void)
+use_rsqrt_p (machine_mode mode)
 {
   return (!flag_trapping_math
 	  && flag_unsafe_math_optimizations
-	  && ((aarch64_tune_params.extra_tuning_flags
-	       & AARCH64_EXTRA_TUNE_APPROX_RSQRT)
+	  && ((GET_MODE_INNER (mode) == SFmode
+	       && (aarch64_tune_params.extra_tuning_flags
+		   & AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF))
+	      || (GET_MODE_INNER (mode) == DFmode
+		  && (aarch64_tune_params.extra_tuning_flags
+		      & AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF))
 	      || flag_mrecip_low_precision_sqrt));
 }
 
@@ -7479,9 +7483,12 @@ use_rsqrt_p (void)
 static tree
 aarch64_builtin_reciprocal (tree fndecl)
 {
-  if (!use_rsqrt_p ())
-    return NULL_TREE;
-  return aarch64_builtin_rsqrt (DECL_FUNCTION_CODE (fndecl));
+  machine_mode mode = TYPE_MODE (TREE_TYPE (fndecl));
+
+  if (use_rsqrt_p (mode))
+    return aarch64_builtin_rsqrt (DECL_FUNCTION_CODE (fndecl));
+
+  return NULL_TREE;
 }
 
 typedef rtx (*rsqrte_type) (rtx, rtx);
@@ -13960,13 +13967,13 @@ aarch64_promoted_type (const_tree t)
 /* Implement the TARGET_OPTAB_SUPPORTED_P hook.  */
 
 static bool
-aarch64_optab_supported_p (int op, machine_mode, machine_mode,
+aarch64_optab_supported_p (int op, machine_mode mode1, machine_mode,
 			   optimization_type opt_type)
 {
   switch (op)
     {
     case rsqrt_optab:
-      return opt_type == OPTIMIZE_FOR_SPEED && use_rsqrt_p ();
+      return opt_type == OPTIMIZE_FOR_SPEED && use_rsqrt_p (mode1);
 
     default:
       return true;
-- 
2.6.3


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2016-02-16 20:56 ` Evandro Menezes
@ 2016-03-04  0:22   ` Evandro Menezes
  2016-03-08 22:08     ` Evandro Menezes
  0 siblings, 1 reply; 38+ messages in thread
From: Evandro Menezes @ 2016-03-04  0:22 UTC (permalink / raw)
  To: GCC Patches, Marcus Shawcroft, James Greenhalgh, Andrew Pinski,
	Benedikt Huber, philipp.tomsich

On 02/16/16 14:56, Evandro Menezes wrote:
> On 12/08/15 15:35, Evandro Menezes wrote:
>> Emit square root using the Newton series
>>
>>    2015-12-03  Evandro Menezes  <e.menezes@samsung.com>
>>
>>    gcc/
>>             * config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
>>    Declare new
>>             function.
>>             * config/aarch64/aarch64-simd.md (sqrt<mode>2): New
>>    expansion and
>>             insn definitions.
>>             * config/aarch64/aarch64-tuning-flags.def
>>             (AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
>>             * config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
>>    new function.
>>             * config/aarch64/aarch64.md (sqrt<mode>2): New expansion
>>    and insn
>>             definitions.
>>             * config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
>>    Expand option
>>             description.
>>             * doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.
>>
>> This patch extends the patch that added support for implementing 
>> x^-1/2 using the Newton series by adding support for x^1/2 as well.
>>
>> Is it OK at this point of stage 3?
>>
>> Thank you,
>>
>
> James,
>
> As I was saying, this patch results in some validation errors in 
> CPU2000 benchmarks using DF.  Although proving the algorithm to be 
> pretty solid with a vast set of random values, I'm confused why some 
> benchmarks fail to validate with this implementation of the Newton 
> series for square root too, when they pass with the Newton series for 
> reciprocal square root.
>
> Since I had no problems with the same algorithm on x86-64, I wonder if 
> the initial estimate on AArch64, which offers just 8 bits, whereas 
> x86-64 offers 11 bits, has to do with it.  Then again, the algorithm 
> iterated 1 less time on x86-64 than on AArch64.
>
> Since it seems that the initial estimate is sufficient for CPU2000 to 
> validate when using SF, I'm leaning towards restricting the Newton 
> series for square root only for SF.
>
> Your thoughts on the matter are appreciated,

         Add choices for the reciprocal square root approximation

         Allow a target to prefer such operation depending on the FP
    precision.

         gcc/
             * config/aarch64/aarch64-protos.h
             (AARCH64_EXTRA_TUNE_APPROX_RSQRT): New macro.
             * config/aarch64/aarch64-tuning-flags.def
             (AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF): New mask.
             (AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF): Likewise.
             * config/aarch64/aarch64.c
             (use_rsqrt_p): New argument for the mode.
             (aarch64_builtin_reciprocal): Devise mode from builtin.
             (aarch64_optab_supported_p): New argument for the mode.


Feedback appreciated.

Thank you,

-- 
Evandro Menezes

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2016-02-26 23:42                 ` Evandro Menezes
@ 2016-02-26 23:46                   ` Evandro Menezes
  0 siblings, 0 replies; 38+ messages in thread
From: Evandro Menezes @ 2016-02-26 23:46 UTC (permalink / raw)
  To: James Greenhalgh
  Cc: Kyrill Tkachov, GCC Patches, Marcus Shawcroft, Andrew Pinski,
	Benedikt Huber, philipp.tomsich

On 02/26/16 17:42, Evandro Menezes wrote:
> On 02/26/16 08:59, James Greenhalgh wrote:
>> On Mon, Feb 22, 2016 at 06:50:44PM -0600, Evandro Menezes wrote:
>>> In preparation for the patch adding the Newton series also for
>>> square root, I'd like to propose this patch changing the name of the
>>> existing tuning flag for the reciprocal square root.
>> This is fine, other names like sw_rsqrt, expand_rsqrt, nr_rsqrt would 
>> also
>> be OK. Pick your favourite!
>>
>> One comment on the replacement invoke.texi text below, otherwise this is
>> OK to apply now.
>>
>>> diff --git a/gcc/config/aarch64/aarch64.opt 
>>> b/gcc/config/aarch64/aarch64.opt
>>> index 5cbd4cd..155d2bd 100644
>>> --- a/gcc/config/aarch64/aarch64.opt
>>> +++ b/gcc/config/aarch64/aarch64.opt
>>> @@ -151,5 +151,5 @@ PC relative literal loads.
>>>     mlow-precision-recip-sqrt
>>>   Common Var(flag_mrecip_low_precision_sqrt) Optimization
>>> -When calculating a sqrt approximation, run fewer steps.
>>> +Calculate the reciprocal square-root approximation in fewer steps.
>>>   This reduces precision, but can result in faster computation.
>>> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
>>> index 490df93..eeff24d 100644
>>> --- a/gcc/doc/invoke.texi
>>> +++ b/gcc/doc/invoke.texi
>>> @@ -12879,12 +12879,10 @@ corresponding flag to the linker.
>>>   @item -mno-low-precision-recip-sqrt
>>>   @opindex -mlow-precision-recip-sqrt
>>>   @opindex -mno-low-precision-recip-sqrt
>>> -The square root estimate uses two steps instead of three for 
>>> double-precision,
>>> -and one step instead of two for single-precision.
>>> -Thus reducing latency and precision.
>>> -This is only relevant if @option{-ffast-math} activates
>>> -reciprocal square root estimate instructions.
>>> -Which in turn depends on the target processor.
>>> +The reciprocal square root approximation uses one step less than 
>>> otherwise,
>>> +thus reducing latency and precision.
>> When calculating the reciprocal square root approximation, use one less
>> step than otherwise, thus reducing latency and precision.
>>
>
> Checked in as r233772.

But not without some log hiccups, sorry...

-- 
Evandro Menezes

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2016-02-26 15:00               ` James Greenhalgh
@ 2016-02-26 23:42                 ` Evandro Menezes
  2016-02-26 23:46                   ` Evandro Menezes
  0 siblings, 1 reply; 38+ messages in thread
From: Evandro Menezes @ 2016-02-26 23:42 UTC (permalink / raw)
  To: James Greenhalgh
  Cc: Kyrill Tkachov, GCC Patches, Marcus Shawcroft, Andrew Pinski,
	Benedikt Huber, philipp.tomsich

On 02/26/16 08:59, James Greenhalgh wrote:
> On Mon, Feb 22, 2016 at 06:50:44PM -0600, Evandro Menezes wrote:
>> In preparation for the patch adding the Newton series also for
>> square root, I'd like to propose this patch changing the name of the
>> existing tuning flag for the reciprocal square root.
> This is fine, other names like sw_rsqrt, expand_rsqrt, nr_rsqrt would also
> be OK. Pick your favourite!
>
> One comment on the replacement invoke.texi text below, otherwise this is
> OK to apply now.
>
>> diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
>> index 5cbd4cd..155d2bd 100644
>> --- a/gcc/config/aarch64/aarch64.opt
>> +++ b/gcc/config/aarch64/aarch64.opt
>> @@ -151,5 +151,5 @@ PC relative literal loads.
>>   
>>   mlow-precision-recip-sqrt
>>   Common Var(flag_mrecip_low_precision_sqrt) Optimization
>> -When calculating a sqrt approximation, run fewer steps.
>> +Calculate the reciprocal square-root approximation in fewer steps.
>>   This reduces precision, but can result in faster computation.
>> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
>> index 490df93..eeff24d 100644
>> --- a/gcc/doc/invoke.texi
>> +++ b/gcc/doc/invoke.texi
>> @@ -12879,12 +12879,10 @@ corresponding flag to the linker.
>>   @item -mno-low-precision-recip-sqrt
>>   @opindex -mlow-precision-recip-sqrt
>>   @opindex -mno-low-precision-recip-sqrt
>> -The square root estimate uses two steps instead of three for double-precision,
>> -and one step instead of two for single-precision.
>> -Thus reducing latency and precision.
>> -This is only relevant if @option{-ffast-math} activates
>> -reciprocal square root estimate instructions.
>> -Which in turn depends on the target processor.
>> +The reciprocal square root approximation uses one step less than otherwise,
>> +thus reducing latency and precision.
> When calculating the reciprocal square root approximation, use one less
> step than otherwise, thus reducing latency and precision.
>

Checked in as r233772.

Thank you,

-- 
Evandro Menezes

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2016-02-23  0:50             ` Evandro Menezes
@ 2016-02-26 15:00               ` James Greenhalgh
  2016-02-26 23:42                 ` Evandro Menezes
  0 siblings, 1 reply; 38+ messages in thread
From: James Greenhalgh @ 2016-02-26 15:00 UTC (permalink / raw)
  To: Evandro Menezes
  Cc: Kyrill Tkachov, GCC Patches, Marcus Shawcroft, Andrew Pinski,
	Benedikt Huber, philipp.tomsich

On Mon, Feb 22, 2016 at 06:50:44PM -0600, Evandro Menezes wrote:
> In preparation for the patch adding the Newton series also for
> square root, I'd like to propose this patch changing the name of the
> existing tuning flag for the reciprocal square root.

This is fine, other names like sw_rsqrt, expand_rsqrt, nr_rsqrt would also
be OK. Pick your favourite!

One comment on the replacement invoke.texi text below, otherwise this is
OK to apply now.

> diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
> index 5cbd4cd..155d2bd 100644
> --- a/gcc/config/aarch64/aarch64.opt
> +++ b/gcc/config/aarch64/aarch64.opt
> @@ -151,5 +151,5 @@ PC relative literal loads.
>  
>  mlow-precision-recip-sqrt
>  Common Var(flag_mrecip_low_precision_sqrt) Optimization
> -When calculating a sqrt approximation, run fewer steps.
> +Calculate the reciprocal square-root approximation in fewer steps.
>  This reduces precision, but can result in faster computation.
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 490df93..eeff24d 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -12879,12 +12879,10 @@ corresponding flag to the linker.
>  @item -mno-low-precision-recip-sqrt
>  @opindex -mlow-precision-recip-sqrt
>  @opindex -mno-low-precision-recip-sqrt
> -The square root estimate uses two steps instead of three for double-precision,
> -and one step instead of two for single-precision.
> -Thus reducing latency and precision.
> -This is only relevant if @option{-ffast-math} activates
> -reciprocal square root estimate instructions.
> -Which in turn depends on the target processor.
> +The reciprocal square root approximation uses one step less than otherwise,
> +thus reducing latency and precision.

When calculating the reciprocal square root approximation, use one less
step than otherwise, thus reducing latency and precision.

Thanks,
James

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2015-12-10 10:30           ` Kyrill Tkachov
@ 2016-02-23  0:50             ` Evandro Menezes
  2016-02-26 15:00               ` James Greenhalgh
  0 siblings, 1 reply; 38+ messages in thread
From: Evandro Menezes @ 2016-02-23  0:50 UTC (permalink / raw)
  To: Kyrill Tkachov, GCC Patches, Marcus Shawcroft, James Greenhalgh,
	Andrew Pinski, Benedikt Huber, philipp.tomsich

[-- Attachment #1: Type: text/plain, Size: 5199 bytes --]

On 12/10/15 04:30, Kyrill Tkachov wrote:
>
> On 09/12/15 18:50, Evandro Menezes wrote:
>> On 12/09/2015 11:16 AM, Kyrill Tkachov wrote:
>>>
>>> On 09/12/15 17:02, Kyrill Tkachov wrote:
>>>>
>>>> On 09/12/15 16:59, Evandro Menezes wrote:
>>>>> On 12/09/2015 10:52 AM, Kyrill Tkachov wrote:
>>>>>> Hi Evandro,
>>>>>>
>>>>>> On 08/12/15 21:35, Evandro Menezes wrote:
>>>>>>> Emit square root using the Newton series
>>>>>>>
>>>>>>>    2015-12-03  Evandro Menezes <e.menezes@samsung.com>
>>>>>>>
>>>>>>>    gcc/
>>>>>>>             * config/aarch64/aarch64-protos.h 
>>>>>>> (aarch64_emit_swsqrt):
>>>>>>>    Declare new
>>>>>>>             function.
>>>>>>>             * config/aarch64/aarch64-simd.md (sqrt<mode>2): New
>>>>>>>    expansion and
>>>>>>>             insn definitions.
>>>>>>>             * config/aarch64/aarch64-tuning-flags.def
>>>>>>>             (AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
>>>>>>>             * config/aarch64/aarch64.c (aarch64_emit_swsqrt): 
>>>>>>> Define
>>>>>>>    new function.
>>>>>>>             * config/aarch64/aarch64.md (sqrt<mode>2): New 
>>>>>>> expansion
>>>>>>>    and insn
>>>>>>>             definitions.
>>>>>>>             * config/aarch64/aarch64.opt 
>>>>>>> (mlow-precision-recip-sqrt):
>>>>>>>    Expand option
>>>>>>>             description.
>>>>>>>             * doc/invoke.texi (mlow-precision-recip-sqrt): 
>>>>>>> Likewise.
>>>>>>>
>>>>>>> This patch extends the patch that added support for implementing 
>>>>>>> x^-1/2 using the Newton series by adding support for x^1/2 as well.
>>>>>>>
>>>>>>> Is it OK at this point of stage 3?
>>>>>>>
>>>>>>
>>>>>> A comment on the patch itself from me...
>>>>>>
>>>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
>>>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>>> index 6f7dbce..11c6c9a 100644
>>>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>>> @@ -30,4 +30,4 @@
>>>>>>
>>>>>>  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
>>>>>>  AARCH64_EXTRA_TUNING_OPTION ("recip_sqrt", RECIP_SQRT)
>>>>>> -
>>>>>> +AARCH64_EXTRA_TUNING_OPTION ("fast_sqrt", FAST_SQRT)
>>>>>>
>>>>>> That seems like a misleading name to me.
>>>>>> If we're doing this, that means that the sqrt instruction is not 
>>>>>> faster
>>>>>> than doing the inverse sqrt estimation followed by a multiply.
>>>>>> I think a name like "synth_sqrt" or "estimate_sqrt" or something 
>>>>>> along those lines
>>>>>> is more appropriate.
>>>>>
>>>>> Unfortunately, this is the case on Exynos M1: the series is faster 
>>>>> than the instruction. :-(  So, other targets when this is also 
>>>>> true, using the "fast_sqrt" option might make sense.
>>>>>
>>>>
>>>> Sure, but the way your patch is written, we will emit the series 
>>>> when "fast_sqrt" is set, rather
>>>> than the other way around, unless I'm misreading the logic in:
>>>>
>>>
>>> Sorry, what I meant to say is it would be clearer, IMO, to describe 
>>> the compiler action that is being taken
>>> (e.g. the rename_fma_regs tuning flag), in this case it's estimating 
>>> sqrt using a series.
>>>
>>> Kyrill
>>>
>>>> diff --git a/gcc/config/aarch64/aarch64-simd.md 
>>>> b/gcc/config/aarch64/aarch64-simd.md
>>>> index 030a101..f6d2da4 100644
>>>> --- a/gcc/config/aarch64/aarch64-simd.md
>>>> +++ b/gcc/config/aarch64/aarch64-simd.md
>>>> @@ -4280,7 +4280,23 @@
>>>>
>>>>  ;; sqrt
>>>>
>>>> -(define_insn "sqrt<mode>2"
>>>> +(define_expand "sqrt<mode>2"
>>>> +  [(set (match_operand:VDQF 0 "register_operand")
>>>> +    (sqrt:VDQF (match_operand:VDQF 1 "register_operand")))]
>>>> +  "TARGET_SIMD"
>>>> +{
>>>> +  if ((AARCH64_EXTRA_TUNE_FAST_SQRT & 
>>>> aarch64_tune_params.extra_tuning_flags)
>>>> +      && !optimize_function_for_size_p (cfun)
>>>> +      && flag_finite_math_only
>>>> +      && !flag_trapping_math
>>>> +      && flag_unsafe_math_optimizations)
>>>> +    {
>>>> +      aarch64_emit_swsqrt (operands[0], operands[1]);
>>>> +      DONE;
>>>> +    }
>>>> +})
>>>>
>>
>> Kyrill,
>>
>> How about "approx_sqrt" for, you guessed it, approximate square 
>> root?  The same adjective would perhaps describe "recip_sqrt" better 
>> too.
>>
>
> Sounds good to me.
> Sorry for the bikeshedding.

         Rename the reciprocal square root tuning flag

         Rename the tuning option to enable the Newton series for the
    reciprocal square
         root to reflect its approximative characteristic.

         2016-02-22  Evandro Menezes  <e.menezes@samsung.com>

         gcc/
             * config/aarch64/aarch64-tuning-flags.def: Rename tuning
    flag to
             AARCH64_EXTRA_TUNE_APPROX_RSQRT.
             * config/aarch64/aarch64.c (xgene1_tunings): Use new name.
             (use_rsqrt_p): Likewise.
             * config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
    Reword the
             text explaining this option.
             * doc/invoke.texi (-mlow-precision-recip-sqrt): Likewise.


In preparation for the patch adding the Newton series also for square 
root, I'd like to propose this patch changing the name of the existing 
tuning flag for the reciprocal square root.

Thank you,

-- 
Evandro Menezes


[-- Attachment #2: 0001-Rename-the-reciprocal-square-root-tuning-flag.patch --]
[-- Type: text/x-patch, Size: 3806 bytes --]

From 7043444f83c12de0ab50627a8b386e3070050591 Mon Sep 17 00:00:00 2001
From: Evandro Menezes <e.menezes@samsung.com>
Date: Mon, 22 Feb 2016 17:49:09 -0600
Subject: [PATCH] Rename the reciprocal square root tuning flag

Rename the tuning option to enable the Newton series for the reciprocal square
root to reflect its approximative characteristic.

2016-02-22  Evandro Menezes  <e.menezes@samsung.com>

gcc/
	* config/aarch64/aarch64-tuning-flags.def: Rename tuning flag to
	AARCH64_EXTRA_TUNE_APPROX_RSQRT.
	* config/aarch64/aarch64.c (xgene1_tunings): Use new name.
	(use_rsqrt_p): Likewise.
	* config/aarch64/aarch64.opt (mlow-precision-recip-sqrt): Reword the
	text explaining this option.
	* doc/invoke.texi (-mlow-precision-recip-sqrt): Likewise.
---
 gcc/config/aarch64/aarch64-tuning-flags.def |  2 +-
 gcc/config/aarch64/aarch64.c                |  4 ++--
 gcc/config/aarch64/aarch64.opt              |  2 +-
 gcc/doc/invoke.texi                         | 10 ++++------
 4 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
index 8036cfe..7e45a0c 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -29,5 +29,5 @@
      AARCH64_TUNE_ to give an enum name. */
 
 AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
-AARCH64_EXTRA_TUNING_OPTION ("recip_sqrt", RECIP_SQRT)
+AARCH64_EXTRA_TUNING_OPTION ("approx_rsqrt", APPROX_RSQRT)
 
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 923a4b3..ebf47da 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -586,7 +586,7 @@ static const struct tune_params xgene1_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_RECIP_SQRT)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_APPROX_RSQRT)	/* tune_flags.  */
 };
 
 /* Support for fine-grained override of the tuning structures.  */
@@ -7469,7 +7469,7 @@ use_rsqrt_p (void)
   return (!flag_trapping_math
 	  && flag_unsafe_math_optimizations
 	  && ((aarch64_tune_params.extra_tuning_flags
-	       & AARCH64_EXTRA_TUNE_RECIP_SQRT)
+	       & AARCH64_EXTRA_TUNE_APPROX_RSQRT)
 	      || flag_mrecip_low_precision_sqrt));
 }
 
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index 5cbd4cd..155d2bd 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -151,5 +151,5 @@ PC relative literal loads.
 
 mlow-precision-recip-sqrt
 Common Var(flag_mrecip_low_precision_sqrt) Optimization
-When calculating a sqrt approximation, run fewer steps.
+Calculate the reciprocal square-root approximation in fewer steps.
 This reduces precision, but can result in faster computation.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 490df93..eeff24d 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -12879,12 +12879,10 @@ corresponding flag to the linker.
 @item -mno-low-precision-recip-sqrt
 @opindex -mlow-precision-recip-sqrt
 @opindex -mno-low-precision-recip-sqrt
-The square root estimate uses two steps instead of three for double-precision,
-and one step instead of two for single-precision.
-Thus reducing latency and precision.
-This is only relevant if @option{-ffast-math} activates
-reciprocal square root estimate instructions.
-Which in turn depends on the target processor.
+The reciprocal square root approximation uses one step less than otherwise,
+thus reducing latency and precision.
+This is only relevant if @option{-ffast-math} enables the reciprocal square root
+approximation, which in turn depends on the target processor.
 
 @item -march=@var{name}
 @opindex march
-- 
2.6.3


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2015-12-08 21:35 Evandro Menezes
  2015-12-09 14:05 ` Marcus Shawcroft
  2015-12-09 16:52 ` Kyrill Tkachov
@ 2016-02-16 20:56 ` Evandro Menezes
  2016-03-04  0:22   ` Evandro Menezes
  2 siblings, 1 reply; 38+ messages in thread
From: Evandro Menezes @ 2016-02-16 20:56 UTC (permalink / raw)
  To: GCC Patches, Marcus Shawcroft, James Greenhalgh, Andrew Pinski,
	Benedikt Huber, philipp.tomsich

On 12/08/15 15:35, Evandro Menezes wrote:
> Emit square root using the Newton series
>
>    2015-12-03  Evandro Menezes  <e.menezes@samsung.com>
>
>    gcc/
>             * config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
>    Declare new
>             function.
>             * config/aarch64/aarch64-simd.md (sqrt<mode>2): New
>    expansion and
>             insn definitions.
>             * config/aarch64/aarch64-tuning-flags.def
>             (AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
>             * config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
>    new function.
>             * config/aarch64/aarch64.md (sqrt<mode>2): New expansion
>    and insn
>             definitions.
>             * config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
>    Expand option
>             description.
>             * doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.
>
> This patch extends the patch that added support for implementing 
> x^-1/2 using the Newton series by adding support for x^1/2 as well.
>
> Is it OK at this point of stage 3?
>
> Thank you,
>

James,

As I was saying, this patch results in some validation errors in CPU2000 
benchmarks using DF.  Although proving the algorithm to be pretty solid 
with a vast set of random values, I'm confused why some benchmarks fail 
to validate with this implementation of the Newton series for square 
root too, when they pass with the Newton series for reciprocal square root.

Since I had no problems with the same algorithm on x86-64, I wonder if 
the initial estimate on AArch64, which offers just 8 bits, whereas 
x86-64 offers 11 bits, has to do with it.  Then again, the algorithm 
iterated 1 less time on x86-64 than on AArch64.

Since it seems that the initial estimate is sufficient for CPU2000 to 
validate when using SF, I'm leaning towards restricting the Newton 
series for square root only for SF.

Your thoughts on the matter are appreciated,

-- 
Evandro Menezes

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2015-12-09 18:50         ` Evandro Menezes
@ 2015-12-10 10:30           ` Kyrill Tkachov
  2016-02-23  0:50             ` Evandro Menezes
  0 siblings, 1 reply; 38+ messages in thread
From: Kyrill Tkachov @ 2015-12-10 10:30 UTC (permalink / raw)
  To: Evandro Menezes, GCC Patches, Marcus Shawcroft, James Greenhalgh,
	Andrew Pinski, Benedikt Huber, philipp.tomsich


On 09/12/15 18:50, Evandro Menezes wrote:
> On 12/09/2015 11:16 AM, Kyrill Tkachov wrote:
>>
>> On 09/12/15 17:02, Kyrill Tkachov wrote:
>>>
>>> On 09/12/15 16:59, Evandro Menezes wrote:
>>>> On 12/09/2015 10:52 AM, Kyrill Tkachov wrote:
>>>>> Hi Evandro,
>>>>>
>>>>> On 08/12/15 21:35, Evandro Menezes wrote:
>>>>>> Emit square root using the Newton series
>>>>>>
>>>>>>    2015-12-03  Evandro Menezes <e.menezes@samsung.com>
>>>>>>
>>>>>>    gcc/
>>>>>>             * config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
>>>>>>    Declare new
>>>>>>             function.
>>>>>>             * config/aarch64/aarch64-simd.md (sqrt<mode>2): New
>>>>>>    expansion and
>>>>>>             insn definitions.
>>>>>>             * config/aarch64/aarch64-tuning-flags.def
>>>>>>             (AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
>>>>>>             * config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
>>>>>>    new function.
>>>>>>             * config/aarch64/aarch64.md (sqrt<mode>2): New expansion
>>>>>>    and insn
>>>>>>             definitions.
>>>>>>             * config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
>>>>>>    Expand option
>>>>>>             description.
>>>>>>             * doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.
>>>>>>
>>>>>> This patch extends the patch that added support for implementing x^-1/2 using the Newton series by adding support for x^1/2 as well.
>>>>>>
>>>>>> Is it OK at this point of stage 3?
>>>>>>
>>>>>
>>>>> A comment on the patch itself from me...
>>>>>
>>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>> index 6f7dbce..11c6c9a 100644
>>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>> @@ -30,4 +30,4 @@
>>>>>
>>>>>  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
>>>>>  AARCH64_EXTRA_TUNING_OPTION ("recip_sqrt", RECIP_SQRT)
>>>>> -
>>>>> +AARCH64_EXTRA_TUNING_OPTION ("fast_sqrt", FAST_SQRT)
>>>>>
>>>>> That seems like a misleading name to me.
>>>>> If we're doing this, that means that the sqrt instruction is not faster
>>>>> than doing the inverse sqrt estimation followed by a multiply.
>>>>> I think a name like "synth_sqrt" or "estimate_sqrt" or something along those lines
>>>>> is more appropriate.
>>>>
>>>> Unfortunately, this is the case on Exynos M1: the series is faster than the instruction. :-(  So, other targets when this is also true, using the "fast_sqrt" option might make sense.
>>>>
>>>
>>> Sure, but the way your patch is written, we will emit the series when "fast_sqrt" is set, rather
>>> than the other way around, unless I'm misreading the logic in:
>>>
>>
>> Sorry, what I meant to say is it would be clearer, IMO, to describe the compiler action that is being taken
>> (e.g. the rename_fma_regs tuning flag), in this case it's estimating sqrt using a series.
>>
>> Kyrill
>>
>>> diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
>>> index 030a101..f6d2da4 100644
>>> --- a/gcc/config/aarch64/aarch64-simd.md
>>> +++ b/gcc/config/aarch64/aarch64-simd.md
>>> @@ -4280,7 +4280,23 @@
>>>
>>>  ;; sqrt
>>>
>>> -(define_insn "sqrt<mode>2"
>>> +(define_expand "sqrt<mode>2"
>>> +  [(set (match_operand:VDQF 0 "register_operand")
>>> +    (sqrt:VDQF (match_operand:VDQF 1 "register_operand")))]
>>> +  "TARGET_SIMD"
>>> +{
>>> +  if ((AARCH64_EXTRA_TUNE_FAST_SQRT & aarch64_tune_params.extra_tuning_flags)
>>> +      && !optimize_function_for_size_p (cfun)
>>> +      && flag_finite_math_only
>>> +      && !flag_trapping_math
>>> +      && flag_unsafe_math_optimizations)
>>> +    {
>>> +      aarch64_emit_swsqrt (operands[0], operands[1]);
>>> +      DONE;
>>> +    }
>>> +})
>>>
>
> Kyrill,
>
> How about "approx_sqrt" for, you guessed it, approximate square root?  The same adjective would perhaps describe "recip_sqrt" better too.
>

Sounds good to me.
Sorry for the bikeshedding.

Kyrill

> Thanks,
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2015-12-09 17:16       ` Kyrill Tkachov
@ 2015-12-09 18:50         ` Evandro Menezes
  2015-12-10 10:30           ` Kyrill Tkachov
  0 siblings, 1 reply; 38+ messages in thread
From: Evandro Menezes @ 2015-12-09 18:50 UTC (permalink / raw)
  To: Kyrill Tkachov, GCC Patches, Marcus Shawcroft, James Greenhalgh,
	Andrew Pinski, Benedikt Huber, philipp.tomsich

On 12/09/2015 11:16 AM, Kyrill Tkachov wrote:
>
> On 09/12/15 17:02, Kyrill Tkachov wrote:
>>
>> On 09/12/15 16:59, Evandro Menezes wrote:
>>> On 12/09/2015 10:52 AM, Kyrill Tkachov wrote:
>>>> Hi Evandro,
>>>>
>>>> On 08/12/15 21:35, Evandro Menezes wrote:
>>>>> Emit square root using the Newton series
>>>>>
>>>>>    2015-12-03  Evandro Menezes <e.menezes@samsung.com>
>>>>>
>>>>>    gcc/
>>>>>             * config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
>>>>>    Declare new
>>>>>             function.
>>>>>             * config/aarch64/aarch64-simd.md (sqrt<mode>2): New
>>>>>    expansion and
>>>>>             insn definitions.
>>>>>             * config/aarch64/aarch64-tuning-flags.def
>>>>>             (AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
>>>>>             * config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
>>>>>    new function.
>>>>>             * config/aarch64/aarch64.md (sqrt<mode>2): New expansion
>>>>>    and insn
>>>>>             definitions.
>>>>>             * config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
>>>>>    Expand option
>>>>>             description.
>>>>>             * doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.
>>>>>
>>>>> This patch extends the patch that added support for implementing 
>>>>> x^-1/2 using the Newton series by adding support for x^1/2 as well.
>>>>>
>>>>> Is it OK at this point of stage 3?
>>>>>
>>>>
>>>> A comment on the patch itself from me...
>>>>
>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>> index 6f7dbce..11c6c9a 100644
>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>> @@ -30,4 +30,4 @@
>>>>
>>>>  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
>>>>  AARCH64_EXTRA_TUNING_OPTION ("recip_sqrt", RECIP_SQRT)
>>>> -
>>>> +AARCH64_EXTRA_TUNING_OPTION ("fast_sqrt", FAST_SQRT)
>>>>
>>>> That seems like a misleading name to me.
>>>> If we're doing this, that means that the sqrt instruction is not 
>>>> faster
>>>> than doing the inverse sqrt estimation followed by a multiply.
>>>> I think a name like "synth_sqrt" or "estimate_sqrt" or something 
>>>> along those lines
>>>> is more appropriate.
>>>
>>> Unfortunately, this is the case on Exynos M1: the series is faster 
>>> than the instruction. :-(  So, other targets when this is also true, 
>>> using the "fast_sqrt" option might make sense.
>>>
>>
>> Sure, but the way your patch is written, we will emit the series when 
>> "fast_sqrt" is set, rather
>> than the other way around, unless I'm misreading the logic in:
>>
>
> Sorry, what I meant to say is it would be clearer, IMO, to describe 
> the compiler action that is being taken
> (e.g. the rename_fma_regs tuning flag), in this case it's estimating 
> sqrt using a series.
>
> Kyrill
>
>> diff --git a/gcc/config/aarch64/aarch64-simd.md 
>> b/gcc/config/aarch64/aarch64-simd.md
>> index 030a101..f6d2da4 100644
>> --- a/gcc/config/aarch64/aarch64-simd.md
>> +++ b/gcc/config/aarch64/aarch64-simd.md
>> @@ -4280,7 +4280,23 @@
>>
>>  ;; sqrt
>>
>> -(define_insn "sqrt<mode>2"
>> +(define_expand "sqrt<mode>2"
>> +  [(set (match_operand:VDQF 0 "register_operand")
>> +    (sqrt:VDQF (match_operand:VDQF 1 "register_operand")))]
>> +  "TARGET_SIMD"
>> +{
>> +  if ((AARCH64_EXTRA_TUNE_FAST_SQRT & 
>> aarch64_tune_params.extra_tuning_flags)
>> +      && !optimize_function_for_size_p (cfun)
>> +      && flag_finite_math_only
>> +      && !flag_trapping_math
>> +      && flag_unsafe_math_optimizations)
>> +    {
>> +      aarch64_emit_swsqrt (operands[0], operands[1]);
>> +      DONE;
>> +    }
>> +})
>>

Kyrill,

How about "approx_sqrt" for, you guessed it, approximate square root?  
The same adjective would perhaps describe "recip_sqrt" better too.

Thanks,

-- 
Evandro Menezes

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2015-12-09 17:03     ` Kyrill Tkachov
@ 2015-12-09 17:16       ` Kyrill Tkachov
  2015-12-09 18:50         ` Evandro Menezes
  0 siblings, 1 reply; 38+ messages in thread
From: Kyrill Tkachov @ 2015-12-09 17:16 UTC (permalink / raw)
  To: Evandro Menezes, GCC Patches, Marcus Shawcroft, James Greenhalgh,
	Andrew Pinski, Benedikt Huber, philipp.tomsich


On 09/12/15 17:02, Kyrill Tkachov wrote:
>
> On 09/12/15 16:59, Evandro Menezes wrote:
>> On 12/09/2015 10:52 AM, Kyrill Tkachov wrote:
>>> Hi Evandro,
>>>
>>> On 08/12/15 21:35, Evandro Menezes wrote:
>>>> Emit square root using the Newton series
>>>>
>>>>    2015-12-03  Evandro Menezes <e.menezes@samsung.com>
>>>>
>>>>    gcc/
>>>>             * config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
>>>>    Declare new
>>>>             function.
>>>>             * config/aarch64/aarch64-simd.md (sqrt<mode>2): New
>>>>    expansion and
>>>>             insn definitions.
>>>>             * config/aarch64/aarch64-tuning-flags.def
>>>>             (AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
>>>>             * config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
>>>>    new function.
>>>>             * config/aarch64/aarch64.md (sqrt<mode>2): New expansion
>>>>    and insn
>>>>             definitions.
>>>>             * config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
>>>>    Expand option
>>>>             description.
>>>>             * doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.
>>>>
>>>> This patch extends the patch that added support for implementing x^-1/2 using the Newton series by adding support for x^1/2 as well.
>>>>
>>>> Is it OK at this point of stage 3?
>>>>
>>>
>>> A comment on the patch itself from me...
>>>
>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
>>> index 6f7dbce..11c6c9a 100644
>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
>>> @@ -30,4 +30,4 @@
>>>
>>>  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
>>>  AARCH64_EXTRA_TUNING_OPTION ("recip_sqrt", RECIP_SQRT)
>>> -
>>> +AARCH64_EXTRA_TUNING_OPTION ("fast_sqrt", FAST_SQRT)
>>>
>>> That seems like a misleading name to me.
>>> If we're doing this, that means that the sqrt instruction is not faster
>>> than doing the inverse sqrt estimation followed by a multiply.
>>> I think a name like "synth_sqrt" or "estimate_sqrt" or something along those lines
>>> is more appropriate.
>>
>> Unfortunately, this is the case on Exynos M1: the series is faster than the instruction. :-(  So, other targets when this is also true, using the "fast_sqrt" option might make sense.
>>
>
> Sure, but the way your patch is written, we will emit the series when "fast_sqrt" is set, rather
> than the other way around, unless I'm misreading the logic in:
>

Sorry, what I meant to say is it would be clearer, IMO, to describe the compiler action that is being taken
(e.g. the rename_fma_regs tuning flag), in this case it's estimating sqrt using a series.

Kyrill

> diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
> index 030a101..f6d2da4 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -4280,7 +4280,23 @@
>
>  ;; sqrt
>
> -(define_insn "sqrt<mode>2"
> +(define_expand "sqrt<mode>2"
> +  [(set (match_operand:VDQF 0 "register_operand")
> +    (sqrt:VDQF (match_operand:VDQF 1 "register_operand")))]
> +  "TARGET_SIMD"
> +{
> +  if ((AARCH64_EXTRA_TUNE_FAST_SQRT & aarch64_tune_params.extra_tuning_flags)
> +      && !optimize_function_for_size_p (cfun)
> +      && flag_finite_math_only
> +      && !flag_trapping_math
> +      && flag_unsafe_math_optimizations)
> +    {
> +      aarch64_emit_swsqrt (operands[0], operands[1]);
> +      DONE;
> +    }
> +})
>
>
> Thanks,
> Kyrill
>
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2015-12-09 16:59   ` Evandro Menezes
@ 2015-12-09 17:03     ` Kyrill Tkachov
  2015-12-09 17:16       ` Kyrill Tkachov
  0 siblings, 1 reply; 38+ messages in thread
From: Kyrill Tkachov @ 2015-12-09 17:03 UTC (permalink / raw)
  To: Evandro Menezes, GCC Patches, Marcus Shawcroft, James Greenhalgh,
	Andrew Pinski, Benedikt Huber, philipp.tomsich


On 09/12/15 16:59, Evandro Menezes wrote:
> On 12/09/2015 10:52 AM, Kyrill Tkachov wrote:
>> Hi Evandro,
>>
>> On 08/12/15 21:35, Evandro Menezes wrote:
>>> Emit square root using the Newton series
>>>
>>>    2015-12-03  Evandro Menezes  <e.menezes@samsung.com>
>>>
>>>    gcc/
>>>             * config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
>>>    Declare new
>>>             function.
>>>             * config/aarch64/aarch64-simd.md (sqrt<mode>2): New
>>>    expansion and
>>>             insn definitions.
>>>             * config/aarch64/aarch64-tuning-flags.def
>>>             (AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
>>>             * config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
>>>    new function.
>>>             * config/aarch64/aarch64.md (sqrt<mode>2): New expansion
>>>    and insn
>>>             definitions.
>>>             * config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
>>>    Expand option
>>>             description.
>>>             * doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.
>>>
>>> This patch extends the patch that added support for implementing x^-1/2 using the Newton series by adding support for x^1/2 as well.
>>>
>>> Is it OK at this point of stage 3?
>>>
>>
>> A comment on the patch itself from me...
>>
>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
>> index 6f7dbce..11c6c9a 100644
>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
>> @@ -30,4 +30,4 @@
>>
>>  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
>>  AARCH64_EXTRA_TUNING_OPTION ("recip_sqrt", RECIP_SQRT)
>> -
>> +AARCH64_EXTRA_TUNING_OPTION ("fast_sqrt", FAST_SQRT)
>>
>> That seems like a misleading name to me.
>> If we're doing this, that means that the sqrt instruction is not faster
>> than doing the inverse sqrt estimation followed by a multiply.
>> I think a name like "synth_sqrt" or "estimate_sqrt" or something along those lines
>> is more appropriate.
>
> Unfortunately, this is the case on Exynos M1: the series is faster than the instruction. :-(  So, other targets when this is also true, using the "fast_sqrt" option might make sense.
>

Sure, but the way your patch is written, we will emit the series when "fast_sqrt" is set, rather
than the other way around, unless I'm misreading the logic in:

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 030a101..f6d2da4 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4280,7 +4280,23 @@
  
  ;; sqrt
  
-(define_insn "sqrt<mode>2"
+(define_expand "sqrt<mode>2"
+  [(set (match_operand:VDQF 0 "register_operand")
+	(sqrt:VDQF (match_operand:VDQF 1 "register_operand")))]
+  "TARGET_SIMD"
+{
+  if ((AARCH64_EXTRA_TUNE_FAST_SQRT & aarch64_tune_params.extra_tuning_flags)
+      && !optimize_function_for_size_p (cfun)
+      && flag_finite_math_only
+      && !flag_trapping_math
+      && flag_unsafe_math_optimizations)
+    {
+      aarch64_emit_swsqrt (operands[0], operands[1]);
+      DONE;
+    }
+})


Thanks,
Kyrill


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2015-12-09 16:52 ` Kyrill Tkachov
@ 2015-12-09 16:59   ` Evandro Menezes
  2015-12-09 17:03     ` Kyrill Tkachov
  0 siblings, 1 reply; 38+ messages in thread
From: Evandro Menezes @ 2015-12-09 16:59 UTC (permalink / raw)
  To: Kyrill Tkachov, GCC Patches, Marcus Shawcroft, James Greenhalgh,
	Andrew Pinski, Benedikt Huber, philipp.tomsich

On 12/09/2015 10:52 AM, Kyrill Tkachov wrote:
> Hi Evandro,
>
> On 08/12/15 21:35, Evandro Menezes wrote:
>> Emit square root using the Newton series
>>
>>    2015-12-03  Evandro Menezes  <e.menezes@samsung.com>
>>
>>    gcc/
>>             * config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
>>    Declare new
>>             function.
>>             * config/aarch64/aarch64-simd.md (sqrt<mode>2): New
>>    expansion and
>>             insn definitions.
>>             * config/aarch64/aarch64-tuning-flags.def
>>             (AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
>>             * config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
>>    new function.
>>             * config/aarch64/aarch64.md (sqrt<mode>2): New expansion
>>    and insn
>>             definitions.
>>             * config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
>>    Expand option
>>             description.
>>             * doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.
>>
>> This patch extends the patch that added support for implementing 
>> x^-1/2 using the Newton series by adding support for x^1/2 as well.
>>
>> Is it OK at this point of stage 3?
>>
>
> A comment on the patch itself from me...
>
> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
> b/gcc/config/aarch64/aarch64-tuning-flags.def
> index 6f7dbce..11c6c9a 100644
> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> @@ -30,4 +30,4 @@
>
>  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
>  AARCH64_EXTRA_TUNING_OPTION ("recip_sqrt", RECIP_SQRT)
> -
> +AARCH64_EXTRA_TUNING_OPTION ("fast_sqrt", FAST_SQRT)
>
> That seems like a misleading name to me.
> If we're doing this, that means that the sqrt instruction is not faster
> than doing the inverse sqrt estimation followed by a multiply.
> I think a name like "synth_sqrt" or "estimate_sqrt" or something along 
> those lines
> is more appropriate.

Unfortunately, this is the case on Exynos M1: the series is faster than 
the instruction. :-(  So, other targets when this is also true, using 
the "fast_sqrt" option might make sense.

Thank you,

-- 
Evandro Menezes

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2015-12-08 21:35 Evandro Menezes
  2015-12-09 14:05 ` Marcus Shawcroft
@ 2015-12-09 16:52 ` Kyrill Tkachov
  2015-12-09 16:59   ` Evandro Menezes
  2016-02-16 20:56 ` Evandro Menezes
  2 siblings, 1 reply; 38+ messages in thread
From: Kyrill Tkachov @ 2015-12-09 16:52 UTC (permalink / raw)
  To: Evandro Menezes, GCC Patches, Marcus Shawcroft, James Greenhalgh,
	Andrew Pinski, Benedikt Huber, philipp.tomsich

Hi Evandro,

On 08/12/15 21:35, Evandro Menezes wrote:
> Emit square root using the Newton series
>
>    2015-12-03  Evandro Menezes  <e.menezes@samsung.com>
>
>    gcc/
>             * config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
>    Declare new
>             function.
>             * config/aarch64/aarch64-simd.md (sqrt<mode>2): New
>    expansion and
>             insn definitions.
>             * config/aarch64/aarch64-tuning-flags.def
>             (AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
>             * config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
>    new function.
>             * config/aarch64/aarch64.md (sqrt<mode>2): New expansion
>    and insn
>             definitions.
>             * config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
>    Expand option
>             description.
>             * doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.
>
> This patch extends the patch that added support for implementing x^-1/2 using the Newton series by adding support for x^1/2 as well.
>
> Is it OK at this point of stage 3?
>

A comment on the patch itself from me...

diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
index 6f7dbce..11c6c9a 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -30,4 +30,4 @@
  
  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
  AARCH64_EXTRA_TUNING_OPTION ("recip_sqrt", RECIP_SQRT)
-
+AARCH64_EXTRA_TUNING_OPTION ("fast_sqrt", FAST_SQRT)

That seems like a misleading name to me.
If we're doing this, that means that the sqrt instruction is not faster
than doing the inverse sqrt estimation followed by a multiply.
I think a name like "synth_sqrt" or "estimate_sqrt" or something along those lines
is more appropriate.

Thanks,
Kyrill

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [AArch64] Emit square root using the Newton series
  2015-12-09 14:05 ` Marcus Shawcroft
@ 2015-12-09 16:31   ` Evandro Menezes
  0 siblings, 0 replies; 38+ messages in thread
From: Evandro Menezes @ 2015-12-09 16:31 UTC (permalink / raw)
  To: 'Marcus Shawcroft'
  Cc: 'GCC Patches', 'Marcus Shawcroft',
	'James Greenhalgh', 'Andrew Pinski',
	'Benedikt Huber',
	philipp.tomsich

Hi, Marcus.

I've run Geekbench, SPEC CPU2000 and synthetic benchmarks.

I can share these results iterating an array with values between 1 and 1000000 and taking their square root:

Million Operations/s			Juno	
			A53 @850MHz	A57 @1100MHz
X^½	DP	Canon	 31 	 37 
		Newton	 13 	 39 
		%Δ	-57%	6%
	SP	Canon	 48 	 144 
		Newton	 18 	 62 
		%Δ	-63%	-57%
X^-½	DP	Canon	 17 	 16 
		Newton	 14 	 42 
		%Δ	-17%	155%
	SP	Canon	 28 	 70 
		Newton	 20 	 62 
		%Δ	-30%	-11%

As you can see, it's a mixed result for A57 and a definite regression for A53.  In mnost benchmarks overall, this is not a good optimization for A57.  That's why I left it as a target-specific tuning.

Thank you,

-- 
Evandro Menezes                              Austin, TX


> -----Original Message-----
> From: Marcus Shawcroft [mailto:marcus.shawcroft@gmail.com]
> Sent: Wednesday, December 09, 2015 8:06
> To: Evandro Menezes
> Cc: GCC Patches; Marcus Shawcroft; James Greenhalgh; Andrew Pinski; Benedikt
> Huber; philipp.tomsich@theobroma-systems.com
> Subject: Re: [AArch64] Emit square root using the Newton series
> 
> On 8 December 2015 at 21:35, Evandro Menezes <e.menezes@samsung.com> wrote:
> >    Emit square root using the Newton series
> >
> >    2015-12-03  Evandro Menezes  <e.menezes@samsung.com>
> >
> >    gcc/
> >             * config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
> >    Declare new
> >             function.
> >             * config/aarch64/aarch64-simd.md (sqrt<mode>2): New
> >    expansion and
> >             insn definitions.
> >             * config/aarch64/aarch64-tuning-flags.def
> >             (AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
> >             * config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
> >    new function.
> >             * config/aarch64/aarch64.md (sqrt<mode>2): New expansion
> >    and insn
> >             definitions.
> >             * config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
> >    Expand option
> >             description.
> >             * doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.
> >
> > This patch extends the patch that added support for implementing
> > x^-1/2 using the Newton series by adding support for x^1/2 as well.
> 
> Hi Evandro, What benchmarking have you done on this patch?
> /M

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [AArch64] Emit square root using the Newton series
  2015-12-08 21:35 Evandro Menezes
@ 2015-12-09 14:05 ` Marcus Shawcroft
  2015-12-09 16:31   ` Evandro Menezes
  2015-12-09 16:52 ` Kyrill Tkachov
  2016-02-16 20:56 ` Evandro Menezes
  2 siblings, 1 reply; 38+ messages in thread
From: Marcus Shawcroft @ 2015-12-09 14:05 UTC (permalink / raw)
  To: Evandro Menezes
  Cc: GCC Patches, Marcus Shawcroft, James Greenhalgh, Andrew Pinski,
	Benedikt Huber, philipp.tomsich

On 8 December 2015 at 21:35, Evandro Menezes <e.menezes@samsung.com> wrote:
>    Emit square root using the Newton series
>
>    2015-12-03  Evandro Menezes  <e.menezes@samsung.com>
>
>    gcc/
>             * config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
>    Declare new
>             function.
>             * config/aarch64/aarch64-simd.md (sqrt<mode>2): New
>    expansion and
>             insn definitions.
>             * config/aarch64/aarch64-tuning-flags.def
>             (AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
>             * config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
>    new function.
>             * config/aarch64/aarch64.md (sqrt<mode>2): New expansion
>    and insn
>             definitions.
>             * config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
>    Expand option
>             description.
>             * doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.
>
> This patch extends the patch that added support for implementing x^-1/2
> using the Newton series by adding support for x^1/2 as well.

Hi Evandro, What benchmarking have you done on this patch?
/M

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [AArch64] Emit square root using the Newton series
@ 2015-12-08 21:35 Evandro Menezes
  2015-12-09 14:05 ` Marcus Shawcroft
                   ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Evandro Menezes @ 2015-12-08 21:35 UTC (permalink / raw)
  To: GCC Patches, Marcus Shawcroft, James Greenhalgh, Andrew Pinski,
	Benedikt Huber, philipp.tomsich

[-- Attachment #1: Type: text/plain, Size: 1041 bytes --]

    Emit square root using the Newton series

    2015-12-03  Evandro Menezes  <e.menezes@samsung.com>

    gcc/
             * config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
    Declare new
             function.
             * config/aarch64/aarch64-simd.md (sqrt<mode>2): New
    expansion and
             insn definitions.
             * config/aarch64/aarch64-tuning-flags.def
             (AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
             * config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
    new function.
             * config/aarch64/aarch64.md (sqrt<mode>2): New expansion
    and insn
             definitions.
             * config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
    Expand option
             description.
             * doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.

This patch extends the patch that added support for implementing x^-1/2 
using the Newton series by adding support for x^1/2 as well.

Is it OK at this point of stage 3?

Thank you,

-- 
Evandro Menezes


[-- Attachment #2: 0001-Emit-square-root-using-the-Newton-series.patch --]
[-- Type: text/x-patch, Size: 7350 bytes --]

From f173dace7b4137f8868a1a6ef9cdbbeefa92ffde Mon Sep 17 00:00:00 2001
From: Evandro Menezes <e.menezes@samsung.com>
Date: Thu, 3 Dec 2015 15:25:07 -0600
Subject: [PATCH] Emit square root using the Newton series

2015-12-03  Evandro Menezes  <e.menezes@samsung.com>

gcc/
	* config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt): Declare new
	function.
	* config/aarch64/aarch64-simd.md (sqrt<mode>2): New expansion and
	insn definitions.
	* config/aarch64/aarch64-tuning-flags.def
	(AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
	* config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define new function.
	* config/aarch64/aarch64.md (sqrt<mode>2): New expansion and insn
	definitions.
	* config/aarch64/aarch64.opt (mlow-precision-recip-sqrt): Expand option
	description.
	* doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.
---
 gcc/config/aarch64/aarch64-protos.h         |  1 +
 gcc/config/aarch64/aarch64-simd.md          | 18 +++++++++++++++++-
 gcc/config/aarch64/aarch64-tuning-flags.def |  2 +-
 gcc/config/aarch64/aarch64.c                | 25 +++++++++++++++++++++++--
 gcc/config/aarch64/aarch64.md               | 18 +++++++++++++++++-
 gcc/config/aarch64/aarch64.opt              |  2 +-
 gcc/doc/invoke.texi                         | 13 ++++++-------
 7 files changed, 66 insertions(+), 13 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 1e0fb4e..7fe6074 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -356,6 +356,7 @@ void aarch64_relayout_simd_types (void);
 void aarch64_reset_previous_fndecl (void);
 
 void aarch64_emit_swrsqrt (rtx, rtx);
+void aarch64_emit_swsqrt (rtx, rtx);
 
 /* Initialize builtins for SIMD intrinsics.  */
 void init_aarch64_simd_builtins (void);
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 030a101..f6d2da4 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4280,7 +4280,23 @@
 
 ;; sqrt
 
-(define_insn "sqrt<mode>2"
+(define_expand "sqrt<mode>2"
+  [(set (match_operand:VDQF 0 "register_operand")
+	(sqrt:VDQF (match_operand:VDQF 1 "register_operand")))]
+  "TARGET_SIMD"
+{
+  if ((AARCH64_EXTRA_TUNE_FAST_SQRT & aarch64_tune_params.extra_tuning_flags)
+      && !optimize_function_for_size_p (cfun)
+      && flag_finite_math_only
+      && !flag_trapping_math
+      && flag_unsafe_math_optimizations)
+    {
+      aarch64_emit_swsqrt (operands[0], operands[1]);
+      DONE;
+    }
+})
+
+(define_insn "*sqrt<mode>2"
   [(set (match_operand:VDQF 0 "register_operand" "=w")
         (sqrt:VDQF (match_operand:VDQF 1 "register_operand" "w")))]
   "TARGET_SIMD"
diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
index 6f7dbce..11c6c9a 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -30,4 +30,4 @@
 
 AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
 AARCH64_EXTRA_TUNING_OPTION ("recip_sqrt", RECIP_SQRT)
-
+AARCH64_EXTRA_TUNING_OPTION ("fast_sqrt", FAST_SQRT)
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index ae4cfb3..3b58c35 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -533,8 +533,9 @@ static const struct tune_params exynosm1_tunings =
   2,	/* min_div_recip_mul_df.  */
   48,	/* max_case_values.  */
   64,	/* cache_line_size.  */
-  tune_params::AUTOPREFETCHER_OFF, /* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE) /* tune_flags.  */
+  tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
+  (AARCH64_EXTRA_TUNE_RECIP_SQRT
+   | AARCH64_EXTRA_TUNE_FAST_SQRT)	/* tune_flags.  */
 };
 
 static const struct tune_params thunderx_tunings =
@@ -7515,6 +7516,26 @@ aarch64_emit_swrsqrt (rtx dst, rtx src)
   emit_move_insn (dst, x0);
 }
 
+/* Emit instruction sequence to compute the approximate square root.  */
+
+void
+aarch64_emit_swsqrt (rtx dst, rtx src)
+{
+  machine_mode mode = GET_MODE (src);
+  gcc_assert (mode == SFmode || mode == V2SFmode || mode == V4SFmode
+	      || mode == DFmode || mode == V2DFmode);
+
+  rtx xsrc = gen_reg_rtx (mode);
+  emit_move_insn (xsrc, src);
+
+  rtx xdst = gen_reg_rtx (mode);
+
+  /* Calculate the approximate square root by multiplying the original operand
+     by its approximate reciprocal square root.  */
+  aarch64_emit_swrsqrt (xdst, xsrc);
+  emit_set_insn (dst, gen_rtx_MULT (mode, xdst, src));
+}
+
 /* Return the number of instructions that can be issued per cycle.  */
 static int
 aarch64_sched_issue_rate (void)
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index d9fe1ae..d5930b9 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -4534,7 +4534,23 @@
   [(set_attr "type" "ffarith<s>")]
 )
 
-(define_insn "sqrt<mode>2"
+(define_expand "sqrt<mode>2"
+  [(set (match_operand:GPF 0 "register_operand")
+	(sqrt:GPF (match_operand:GPF 1 "register_operand")))]
+  "TARGET_SIMD"
+{
+  if ((AARCH64_EXTRA_TUNE_FAST_SQRT & aarch64_tune_params.extra_tuning_flags)
+      && !optimize_function_for_size_p (cfun)
+      && flag_finite_math_only
+      && !flag_trapping_math
+      && flag_unsafe_math_optimizations)
+    {
+      aarch64_emit_swsqrt (operands[0], operands[1]);
+      DONE;
+    }
+})
+
+(define_insn "*sqrt<mode>2"
   [(set (match_operand:GPF 0 "register_operand" "=w")
         (sqrt:GPF (match_operand:GPF 1 "register_operand" "w")))]
   "TARGET_FLOAT"
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index a0fbfd42..d02c5e8 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -151,5 +151,5 @@ PC relative literal loads.
 
 mlow-precision-recip-sqrt
 Common Var(flag_mrecip_low_precision_sqrt) Optimization
-When calculating a sqrt approximation, run fewer steps.
+Calculate the square-root or its reciprocal approximation in fewer steps.
 This reduces precision, but can result in faster computation.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 5ab565c..f4a47a6 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -6141,7 +6141,7 @@ is usable even in freestanding environments.
 @opindex fsanitize-coverage=trace-pc
 Enable coverage-guided fuzzing code instrumentation.
 Inserts call to __sanitizer_cov_trace_pc into every basic block.
-
+-
 @item -fcheck-pointer-bounds
 @opindex fcheck-pointer-bounds
 @opindex fno-check-pointer-bounds
@@ -12561,12 +12561,11 @@ corresponding flag to the linker.
 @item -mno-low-precision-recip-sqrt
 @opindex -mlow-precision-recip-sqrt
 @opindex -mno-low-precision-recip-sqrt
-The square root estimate uses two steps instead of three for double-precision,
-and one step instead of two for single-precision.
-Thus reducing latency and precision.
-This is only relevant if @option{-ffast-math} activates
-reciprocal square root estimate instructions.
-Which in turn depends on the target processor.
+The square root and its reciprocal approximation use one step less than
+otherwise, thus reducing latency and precision.
+This is only relevant if @option{-ffast-math} enables
+the square root or its reciprocal approximation,
+which in turn depends on the target processor.
 
 @item -march=@var{name}
 @opindex march
-- 
1.9.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2016-04-27 15:45 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-17 22:50 Emit square root using the Newton series Evandro Menezes
2016-03-24 20:30 ` [AArch64] " Evandro Menezes
2016-04-01 22:45   ` Evandro Menezes
2016-04-04 16:32     ` Evandro Menezes
     [not found]       ` <DB3PR08MB008902F0F0AFA3B1F1C91511839E0@DB3PR08MB0089.eurprd08.prod.outlook.com>
2016-04-05 22:30         ` Evandro Menezes
2016-04-12 18:15           ` Evandro Menezes
2016-04-21 18:44             ` Evandro Menezes
2016-04-27 14:24             ` James Greenhalgh
2016-04-27 15:45               ` Evandro Menezes
     [not found] <AM3PR08MB00886499882773F3C8B9F71D83B30@AM3PR08MB0088.eurprd08.prod.outlook.com>
     [not found] ` <011d01d17a26$31b3ade0$951b09a0$@samsung.com>
2016-03-10 16:52   ` Wilco Dijkstra
2016-03-10 16:58     ` Evandro Menezes
2016-03-10 19:10       ` Wilco Dijkstra
2016-03-10 22:15         ` Evandro Menezes
2016-03-11  1:06           ` Wilco Dijkstra
2016-03-14 16:39             ` Evandro Menezes
2016-03-14 19:13               ` Wilco Dijkstra
2016-03-16 21:44             ` Evandro Menezes
  -- strict thread matches above, loose matches on Subject: below --
2015-12-08 21:35 Evandro Menezes
2015-12-09 14:05 ` Marcus Shawcroft
2015-12-09 16:31   ` Evandro Menezes
2015-12-09 16:52 ` Kyrill Tkachov
2015-12-09 16:59   ` Evandro Menezes
2015-12-09 17:03     ` Kyrill Tkachov
2015-12-09 17:16       ` Kyrill Tkachov
2015-12-09 18:50         ` Evandro Menezes
2015-12-10 10:30           ` Kyrill Tkachov
2016-02-23  0:50             ` Evandro Menezes
2016-02-26 15:00               ` James Greenhalgh
2016-02-26 23:42                 ` Evandro Menezes
2016-02-26 23:46                   ` Evandro Menezes
2016-02-16 20:56 ` Evandro Menezes
2016-03-04  0:22   ` Evandro Menezes
2016-03-08 22:08     ` Evandro Menezes
2016-03-08 22:18       ` Evandro Menezes
2016-03-08 22:20         ` Evandro Menezes
2016-03-16 19:45       ` Evandro Menezes
2016-03-17 14:55         ` James Greenhalgh
2016-03-17 16:25           ` Evandro Menezes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).