public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [AArch64] Emit division using the Newton series
@ 2016-03-17 21:14 Evandro Menezes
  2016-03-23 16:24 ` Evandro Menezes
  2016-03-23 16:25 ` Evandro Menezes
  0 siblings, 2 replies; 16+ messages in thread
From: Evandro Menezes @ 2016-03-17 21:14 UTC (permalink / raw)
  To: GCC Patches; +Cc: James Greenhalgh, Wilco Dijkstra, Andrew Pinski

[-- Attachment #1: Type: text/plain, Size: 899 bytes --]

         Emit division using the Newton series

         2016-03-17  Evandro Menezes  <e.menezes@samsung.com>

         gcc/
             * config/aarch64/aarch64-tuning-flags.def
             (AARCH64_EXTRA_TUNE_APPROX_DIV_{SF,DF}: New tuning macros.
             * config/aarch64/aarch64-protos.h
             (AARCH64_EXTRA_TUNE_APPROX_DIV): New macro.
             (aarch64_emit_approx_div): Declare new function.
             * config/aarch64/aarch64.c
             (aarch64_emit_approx_div): Define new function.
             * config/aarch64/aarch64.md ("div<mode>3"): New expansion.
             * config/aarch64/aarch64-simd.md ("div<mode>3"): Likewise.

This patch implements FP division by an approximation using the Newton 
series.

With this patch, DF division is sped up by over 100% and SF division, 
zilch, both on A57 and on M1.

Feedback welcome.

Thank you,

-- 
Evandro Menezes


[-- Attachment #2: 0001-Emit-division-using-the-Newton-series.patch --]
[-- Type: text/x-patch, Size: 7998 bytes --]

From 750bd4f64cea8787eb077b7537cc7d8dceafac57 Mon Sep 17 00:00:00 2001
From: Evandro Menezes <e.menezes@samsung.com>
Date: Thu, 17 Mar 2016 14:44:55 -0500
Subject: [PATCH] Emit division using the Newton series

2016-03-17  Evandro Menezes  <e.menezes@samsung.com>

gcc/
	* config/aarch64/aarch64-tuning-flags.def
	(AARCH64_EXTRA_TUNE_APPROX_DIV_{SF,DF}: New tuning macros.
	* config/aarch64/aarch64-protos.h
	(AARCH64_EXTRA_TUNE_APPROX_DIV): New macro.
	(aarch64_emit_approx_div): Declare new function.
	* config/aarch64/aarch64.c
	(aarch64_emit_approx_div): Define new function.
	* config/aarch64/aarch64.md ("div<mode>3"): New expansion.
	* config/aarch64/aarch64-simd.md ("div<mode>3"): Likewise.
---
 gcc/config/aarch64/aarch64-protos.h         |  4 ++
 gcc/config/aarch64/aarch64-simd.md          | 26 ++++++++++-
 gcc/config/aarch64/aarch64-tuning-flags.def |  3 +-
 gcc/config/aarch64/aarch64.c                | 67 ++++++++++++++++++++++++++++-
 gcc/config/aarch64/aarch64.md               | 31 +++++++++++--
 5 files changed, 124 insertions(+), 7 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index dced209..847a282 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -263,6 +263,9 @@ enum aarch64_extra_tuning_flags
 };
 #undef AARCH64_EXTRA_TUNING_OPTION
 
+#define AARCH64_EXTRA_TUNE_APPROX_DIV \
+        (AARCH64_EXTRA_TUNE_APPROX_DIV_DF | AARCH64_EXTRA_TUNE_APPROX_DIV_SF)
+
 extern struct tune_params aarch64_tune_params;
 
 HOST_WIDE_INT aarch64_initial_elimination_offset (unsigned, unsigned);
@@ -362,6 +365,7 @@ void aarch64_relayout_simd_types (void);
 void aarch64_reset_previous_fndecl (void);
 void aarch64_save_restore_target_globals (tree);
 void aarch64_emit_approx_rsqrt (rtx, rtx);
+void aarch64_emit_approx_div (rtx, rtx, rtx);
 
 /* Initialize builtins for SIMD intrinsics.  */
 void init_aarch64_simd_builtins (void);
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index bd73bce..f1e53be 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1509,7 +1509,31 @@
   [(set_attr "type" "neon_fp_mul_<Vetype><q>")]
 )
 
-(define_insn "div<mode>3"
+(define_expand "div<mode>3"
+ [(set (match_operand:VDQF 0 "register_operand" "=w")
+       (div:VDQF (match_operand:VDQF 1 "register_operand" "w")
+		 (match_operand:VDQF 2 "register_operand" "w")))]
+ "TARGET_SIMD"
+{
+  machine_mode mode = GET_MODE_INNER (GET_MODE (operands[1]));
+
+  if (flag_finite_math_only
+      && !flag_trapping_math
+      && flag_unsafe_math_optimizations
+      && !optimize_function_for_size_p (cfun)
+      && ((mode == SFmode
+           && (aarch64_tune_params.extra_tuning_flags
+               & AARCH64_EXTRA_TUNE_APPROX_DIV_SF))
+          || (mode == DFmode
+              && (aarch64_tune_params.extra_tuning_flags
+                  & AARCH64_EXTRA_TUNE_APPROX_DIV_DF))))
+    {
+      aarch64_emit_approx_div (operands[0], operands[1], operands[2]);
+      DONE;
+    }
+})
+
+(define_insn "*div<mode>3"
  [(set (match_operand:VDQF 0 "register_operand" "=w")
        (div:VDQF (match_operand:VDQF 1 "register_operand" "w")
 		 (match_operand:VDQF 2 "register_operand" "w")))]
diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
index 7e45a0c..ececdc1 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -30,4 +30,5 @@
 
 AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
 AARCH64_EXTRA_TUNING_OPTION ("approx_rsqrt", APPROX_RSQRT)
-
+AARCH64_EXTRA_TUNING_OPTION ("approx_div", APPROX_DIV_DF)
+AARCH64_EXTRA_TUNING_OPTION ("approx_divf", APPROX_DIV_SF)
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 12e498d..97af0c0 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -538,7 +538,8 @@ static const struct tune_params exynosm1_tunings =
   48,	/* max_case_values.  */
   64,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF, /* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_APPROX_RSQRT) /* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_APPROX_DIV
+   | AARCH64_EXTRA_TUNE_APPROX_RSQRT) /* tune_flags.  */
 };
 
 static const struct tune_params thunderx_tunings =
@@ -7540,6 +7541,70 @@ aarch64_emit_approx_rsqrt (rtx dst, rtx src)
   emit_move_insn (dst, x0);
 }
 
+/* Emit the instruction sequence to compute the approximation for FP division.  */
+
+void
+aarch64_emit_approx_div (rtx quo, rtx num, rtx div)
+{
+  machine_mode mode = GET_MODE (quo);
+  gcc_assert (GET_MODE_INNER (mode) == SFmode
+              || GET_MODE_INNER (mode) == DFmode);
+
+  rtx xnum = gen_reg_rtx (mode);
+  emit_move_insn (xnum, num);
+
+  rtx xdiv = gen_reg_rtx (mode);
+  emit_move_insn (xdiv, div);
+
+  /* Estimate the approximate reciprocal.  */
+  rtx xrcp = gen_reg_rtx (mode);
+  switch (mode)
+    {
+    case SFmode:
+      emit_insn (gen_aarch64_frecpesf (xrcp, xdiv)); break;
+    case V2SFmode:
+      emit_insn (gen_aarch64_frecpev2sf (xrcp, xdiv)); break;
+    case V4SFmode:
+      emit_insn (gen_aarch64_frecpev4sf (xrcp, xdiv)); break;
+    case DFmode:
+      emit_insn (gen_aarch64_frecpedf (xrcp, xdiv)); break;
+    case V2DFmode:
+      emit_insn (gen_aarch64_frecpev2df (xrcp, xdiv)); break;
+    default:
+      gcc_unreachable ();
+    }
+
+  /* Iterate over the series twice for SF and thrice for DF.  */
+  int iterations = (GET_MODE_INNER (mode) == DFmode) ? 3 : 2;
+  while (iterations--)
+    {
+      rtx xtmp = gen_reg_rtx (mode);
+
+      switch (mode)
+        {
+	    case SFmode:
+	      emit_insn (gen_aarch64_frecpssf (xtmp, xrcp, xdiv)); break;
+	    case V2SFmode:
+	      emit_insn (gen_aarch64_frecpsv2sf (xtmp, xrcp, xdiv)); break;
+	    case V4SFmode:
+	      emit_insn (gen_aarch64_frecpsv4sf (xtmp, xrcp, xdiv)); break;
+	    case DFmode:
+	      emit_insn (gen_aarch64_frecpsdf (xtmp, xrcp, xdiv)); break;
+	    case V2DFmode:
+	      emit_insn (gen_aarch64_frecpsv2df (xtmp, xrcp, xdiv)); break;
+        default:
+          gcc_unreachable ();
+        }
+
+      emit_set_insn (xrcp, gen_rtx_MULT (mode, xtmp, xrcp));
+    }
+
+  rtx xquo = gen_reg_rtx (mode);
+  emit_set_insn (xquo, gen_rtx_MULT (mode, xnum, xrcp));
+
+  emit_move_insn (quo, xquo);
+}
+
 /* Return the number of instructions that can be issued per cycle.  */
 static int
 aarch64_sched_issue_rate (void)
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 68676c9..b5d61db 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -4647,11 +4647,34 @@
   [(set_attr "type" "fmul<s>")]
 )
 
-(define_insn "div<mode>3"
+(define_expand "div<mode>3"
+ [(set (match_operand:GPF 0 "register_operand" "=w")
+       (div:GPF (match_operand:GPF 1 "register_operand" "w")
+		(match_operand:GPF 2 "register_operand" "w")))]
+ "TARGET_SIMD"
+{
+  machine_mode mode = GET_MODE_INNER (GET_MODE (operands[1]));
+
+  if (flag_finite_math_only
+      && !flag_trapping_math
+      && flag_unsafe_math_optimizations
+      && !optimize_function_for_size_p (cfun)
+      && ((mode == SFmode
+           && (aarch64_tune_params.extra_tuning_flags
+               & AARCH64_EXTRA_TUNE_APPROX_DIV_SF))
+          || (mode == DFmode
+              && (aarch64_tune_params.extra_tuning_flags
+                  & AARCH64_EXTRA_TUNE_APPROX_DIV_DF))))
+    {
+      aarch64_emit_approx_div (operands[0], operands[1], operands[2]);
+      DONE;
+    }
+})
+
+(define_insn "*div<mode>3"
   [(set (match_operand:GPF 0 "register_operand" "=w")
-        (div:GPF
-         (match_operand:GPF 1 "register_operand" "w")
-         (match_operand:GPF 2 "register_operand" "w")))]
+        (div:GPF (match_operand:GPF 1 "register_operand" "w")
+	         (match_operand:GPF 2 "register_operand" "w")))]
   "TARGET_FLOAT"
   "fdiv\\t%<s>0, %<s>1, %<s>2"
   [(set_attr "type" "fdiv<s>")]
-- 
1.9.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [AArch64] Emit division using the Newton series
  2016-03-17 21:14 [AArch64] Emit division using the Newton series Evandro Menezes
@ 2016-03-23 16:24 ` Evandro Menezes
  2016-03-23 16:25 ` Evandro Menezes
  1 sibling, 0 replies; 16+ messages in thread
From: Evandro Menezes @ 2016-03-23 16:24 UTC (permalink / raw)
  To: GCC Patches; +Cc: James Greenhalgh, Wilco Dijkstra, Andrew Pinski

On 03/17/16 15:09, Evandro Menezes wrote:
> This patch implements FP division by an approximation using the Newton 
> series.
>
> With this patch, DF division is sped up by over 100% and SF division, 
> zilch, both on A57 and on M1.

         gcc/
             * config/aarch64/aarch64-tuning-flags.def
             (AARCH64_EXTRA_TUNE_APPROX_DIV_{SF,DF}: New tuning macros.
             * config/aarch64/aarch64-protos.h
             (AARCH64_EXTRA_TUNE_APPROX_DIV): New macro.
             (aarch64_emit_approx_div): Declare new function.
             * config/aarch64/aarch64.c
             (aarch64_emit_approx_div): Define new function.
             * config/aarch64/aarch64.md ("div<mode>3"): New expansion.
             * config/aarch64/aarch64-simd.md ("div<mode>3"): Likewise.


This version of the patch cleans up the changes to the MD files and 
optimizes the division when the numerator is 1.0.

Again, I look forward to your feedback.

Thank you,

-- 
Evandro Menezes

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [AArch64] Emit division using the Newton series
  2016-03-17 21:14 [AArch64] Emit division using the Newton series Evandro Menezes
  2016-03-23 16:24 ` Evandro Menezes
@ 2016-03-23 16:25 ` Evandro Menezes
  2016-03-31 22:23   ` Evandro Menezes
  1 sibling, 1 reply; 16+ messages in thread
From: Evandro Menezes @ 2016-03-23 16:25 UTC (permalink / raw)
  To: GCC Patches; +Cc: James Greenhalgh, Wilco Dijkstra, Andrew Pinski

[-- Attachment #1: Type: text/plain, Size: 980 bytes --]

On 03/17/16 15:09, Evandro Menezes wrote:
> This patch implements FP division by an approximation using the Newton
> series.
>
> With this patch, DF division is sped up by over 100% and SF division,
> zilch, both on A57 and on M1.

         gcc/
             * config/aarch64/aarch64-tuning-flags.def
             (AARCH64_EXTRA_TUNE_APPROX_DIV_{SF,DF}: New tuning macros.
             * config/aarch64/aarch64-protos.h
             (AARCH64_EXTRA_TUNE_APPROX_DIV): New macro.
             (aarch64_emit_approx_div): Declare new function.
             * config/aarch64/aarch64.c
             (aarch64_emit_approx_div): Define new function.
             * config/aarch64/aarch64.md ("div<mode>3"): New expansion.
             * config/aarch64/aarch64-simd.md ("div<mode>3"): Likewise.


This version of the patch cleans up the changes to the MD files and 
optimizes the division when the numerator is 1.0.

Again, I look forward to your feedback.

Thank you,

-- 
Evandro Menezes


[-- Attachment #2: 0001-AArch64-Emit-division-using-the-Newton-series.patch --]
[-- Type: text/x-patch, Size: 6934 bytes --]

From 5cd2a628086af3656b3242f0c4f41784646f52b1 Mon Sep 17 00:00:00 2001
From: Evandro Menezes <e.menezes@samsung.com>
Date: Thu, 17 Mar 2016 14:44:55 -0500
Subject: [PATCH] [AArch64] Emit division using the Newton series

2016-03-17  Evandro Menezes  <e.menezes@samsung.com>

gcc/
	* config/aarch64/aarch64-tuning-flags.def
	(AARCH64_EXTRA_TUNE_APPROX_DIV_{SF,DF}: New tuning macros.
	* config/aarch64/aarch64-protos.h
	(AARCH64_EXTRA_TUNE_APPROX_DIV): New macro.
	(aarch64_emit_approx_div): Declare new function.
	* config/aarch64/aarch64.c
	(aarch64_emit_approx_div): Define new function.
	* config/aarch64/aarch64.md ("div<mode>3"): New expansion.
	* config/aarch64/aarch64-simd.md ("div<mode>3"): Likewise.
---
 gcc/config/aarch64/aarch64-protos.h         |  4 ++
 gcc/config/aarch64/aarch64-simd.md          | 14 +++++-
 gcc/config/aarch64/aarch64-tuning-flags.def |  3 +-
 gcc/config/aarch64/aarch64.c                | 73 +++++++++++++++++++++++++++++
 gcc/config/aarch64/aarch64.md               | 19 ++++++--
 5 files changed, 107 insertions(+), 6 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index dced209..52c4838 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -263,6 +263,9 @@ enum aarch64_extra_tuning_flags
 };
 #undef AARCH64_EXTRA_TUNING_OPTION
 
+#define AARCH64_EXTRA_TUNE_APPROX_DIV \
+        (AARCH64_EXTRA_TUNE_APPROX_DIV_DF | AARCH64_EXTRA_TUNE_APPROX_DIV_SF)
+
 extern struct tune_params aarch64_tune_params;
 
 HOST_WIDE_INT aarch64_initial_elimination_offset (unsigned, unsigned);
@@ -362,6 +365,7 @@ void aarch64_relayout_simd_types (void);
 void aarch64_reset_previous_fndecl (void);
 void aarch64_save_restore_target_globals (tree);
 void aarch64_emit_approx_rsqrt (rtx, rtx);
+bool aarch64_emit_approx_div (rtx, rtx, rtx);
 
 /* Initialize builtins for SIMD intrinsics.  */
 void init_aarch64_simd_builtins (void);
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index bd73bce..99be92e 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1509,7 +1509,19 @@
   [(set_attr "type" "neon_fp_mul_<Vetype><q>")]
 )
 
-(define_insn "div<mode>3"
+(define_expand "div<mode>3"
+ [(set (match_operand:VDQF 0 "register_operand")
+       (div:VDQF (match_operand:VDQF 1 "general_operand")
+		 (match_operand:VDQF 2 "register_operand")))]
+ "TARGET_SIMD"
+{
+  if (aarch64_emit_approx_div (operands[0], operands[1], operands[2]))
+    DONE;
+
+  operands[1] = force_reg (<MODE>mode, operands[1]);
+})
+
+(define_insn "*div<mode>3"
  [(set (match_operand:VDQF 0 "register_operand" "=w")
        (div:VDQF (match_operand:VDQF 1 "register_operand" "w")
 		 (match_operand:VDQF 2 "register_operand" "w")))]
diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
index 7e45a0c..ececdc1 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -30,4 +30,5 @@
 
 AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
 AARCH64_EXTRA_TUNING_OPTION ("approx_rsqrt", APPROX_RSQRT)
-
+AARCH64_EXTRA_TUNING_OPTION ("approx_div", APPROX_DIV_DF)
+AARCH64_EXTRA_TUNING_OPTION ("approx_divf", APPROX_DIV_SF)
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 12e498d..2c878ce 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -7540,6 +7540,79 @@ aarch64_emit_approx_rsqrt (rtx dst, rtx src)
   emit_move_insn (dst, x0);
 }
 
+/* Emit the instruction sequence to compute the approximation for a reciprocal.  */
+
+bool
+aarch64_emit_approx_div (rtx quo, rtx num, rtx div)
+{
+  machine_mode mode = GET_MODE (quo);
+
+  if (!flag_finite_math_only
+      || flag_trapping_math
+      || !flag_unsafe_math_optimizations
+      || optimize_function_for_size_p (cfun)
+      || ((GET_MODE_INNER (mode) != SFmode
+           || !(aarch64_tune_params.extra_tuning_flags
+                & AARCH64_EXTRA_TUNE_APPROX_DIV_SF))
+          && (GET_MODE_INNER (mode) != DFmode
+              || !(aarch64_tune_params.extra_tuning_flags
+                   & AARCH64_EXTRA_TUNE_APPROX_DIV_DF))))
+    return false;
+
+  /* Estimate the approximate reciprocal.  */
+  rtx xrcp = gen_reg_rtx (mode);
+  switch (mode)
+    {
+      case SFmode:
+	emit_insn (gen_aarch64_frecpesf (xrcp, div)); break;
+      case V2SFmode:
+	emit_insn (gen_aarch64_frecpev2sf (xrcp, div)); break;
+      case V4SFmode:
+	emit_insn (gen_aarch64_frecpev4sf (xrcp, div)); break;
+      case DFmode:
+	emit_insn (gen_aarch64_frecpedf (xrcp, div)); break;
+      case V2DFmode:
+	emit_insn (gen_aarch64_frecpev2df (xrcp, div)); break;
+      default:
+	gcc_unreachable ();
+    }
+
+  /* Iterate over the series twice for SF and thrice for DF.  */
+  int iterations = (GET_MODE_INNER (mode) == DFmode) ? 3 : 2;
+
+  rtx xtmp = gen_reg_rtx (mode);
+  while (iterations--)
+    {
+      switch (mode)
+        {
+	  case SFmode:
+	    emit_insn (gen_aarch64_frecpssf (xtmp, xrcp, div)); break;
+	  case V2SFmode:
+	    emit_insn (gen_aarch64_frecpsv2sf (xtmp, xrcp, div)); break;
+	  case V4SFmode:
+	    emit_insn (gen_aarch64_frecpsv4sf (xtmp, xrcp, div)); break;
+	  case DFmode:
+	    emit_insn (gen_aarch64_frecpsdf (xtmp, xrcp, div)); break;
+	  case V2DFmode:
+	    emit_insn (gen_aarch64_frecpsv2df (xtmp, xrcp, div)); break;
+	  default:
+	    gcc_unreachable ();
+        }
+
+      emit_set_insn (xrcp, gen_rtx_MULT (mode, xrcp, xtmp));
+    }
+
+  if (num != CONST1_RTX (mode))
+    {
+      rtx xnum = force_reg (mode, num);
+      emit_set_insn (quo, gen_rtx_MULT (mode, xnum, xrcp));
+    }
+  else
+    emit_move_insn (quo, xrcp);
+
+  return true;
+}
+
 /* Return the number of instructions that can be issued per cycle.  */
 static int
 aarch64_sched_issue_rate (void)
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 68676c9..985915e 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -4647,11 +4647,22 @@
   [(set_attr "type" "fmul<s>")]
 )
 
-(define_insn "div<mode>3"
+(define_expand "div<mode>3"
+ [(set (match_operand:GPF 0 "register_operand")
+       (div:GPF (match_operand:GPF 1 "general_operand")
+		(match_operand:GPF 2 "register_operand")))]
+ "TARGET_SIMD"
+{
+  if (aarch64_emit_approx_div (operands[0], operands[1], operands[2]))
+    DONE;
+
+  operands[1] = force_reg (<MODE>mode, operands[1]);
+})
+
+(define_insn "*div<mode>3"
   [(set (match_operand:GPF 0 "register_operand" "=w")
-        (div:GPF
-         (match_operand:GPF 1 "register_operand" "w")
-         (match_operand:GPF 2 "register_operand" "w")))]
+        (div:GPF (match_operand:GPF 1 "register_operand" "w")
+	         (match_operand:GPF 2 "register_operand" "w")))]
   "TARGET_FLOAT"
   "fdiv\\t%<s>0, %<s>1, %<s>2"
   [(set_attr "type" "fdiv<s>")]
-- 
1.9.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [AArch64] Emit division using the Newton series
  2016-03-23 16:25 ` Evandro Menezes
@ 2016-03-31 22:23   ` Evandro Menezes
  2016-04-01 13:58     ` Wilco Dijkstra
  0 siblings, 1 reply; 16+ messages in thread
From: Evandro Menezes @ 2016-03-31 22:23 UTC (permalink / raw)
  To: GCC Patches; +Cc: James Greenhalgh, Wilco Dijkstra, Andrew Pinski

On 03/23/16 11:24, Evandro Menezes wrote:
> On 03/17/16 15:09, Evandro Menezes wrote:
>> This patch implements FP division by an approximation using the Newton
>> series.
>>
>> With this patch, DF division is sped up by over 100% and SF division,
>> zilch, both on A57 and on M1.
>
>         gcc/
>             * config/aarch64/aarch64-tuning-flags.def
>             (AARCH64_EXTRA_TUNE_APPROX_DIV_{SF,DF}: New tuning macros.
>             * config/aarch64/aarch64-protos.h
>             (AARCH64_EXTRA_TUNE_APPROX_DIV): New macro.
>             (aarch64_emit_approx_div): Declare new function.
>             * config/aarch64/aarch64.c
>             (aarch64_emit_approx_div): Define new function.
>             * config/aarch64/aarch64.md ("div<mode>3"): New expansion.
>             * config/aarch64/aarch64-simd.md ("div<mode>3"): Likewise.
>
>
> This version of the patch cleans up the changes to the MD files and 
> optimizes the division when the numerator is 1.0.

Ping^1

-- 
Evandro Menezes

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [AArch64] Emit division using the Newton series
  2016-03-31 22:23   ` Evandro Menezes
@ 2016-04-01 13:58     ` Wilco Dijkstra
  2016-04-01 19:47       ` Evandro Menezes
  0 siblings, 1 reply; 16+ messages in thread
From: Wilco Dijkstra @ 2016-04-01 13:58 UTC (permalink / raw)
  To: Evandro Menezes, GCC Patches; +Cc: James Greenhalgh, Andrew Pinski, nd

Evandro Menezes wrote:
On 03/23/16 11:24, Evandro Menezes wrote:
> On 03/17/16 15:09, Evandro Menezes wrote:
>> This patch implements FP division by an approximation using the Newton
>> series.
>>
>> With this patch, DF division is sped up by over 100% and SF division,
>> zilch, both on A57 and on M1.

Mentioning throughput is not useful given that the vectorized single precision
case will give most of the speedup in actual code.

>         gcc/
>             * config/aarch64/aarch64-tuning-flags.def
>             (AARCH64_EXTRA_TUNE_APPROX_DIV_{SF,DF}: New tuning macros.
>             * config/aarch64/aarch64-protos.h
>             (AARCH64_EXTRA_TUNE_APPROX_DIV): New macro.
>             (aarch64_emit_approx_div): Declare new function.
>             * config/aarch64/aarch64.c
>             (aarch64_emit_approx_div): Define new function.
>             * config/aarch64/aarch64.md ("div<mode>3"): New expansion.
>             * config/aarch64/aarch64-simd.md ("div<mode>3"): Likewise.
>
>
> This version of the patch cleans up the changes to the MD files and
> optimizes the division when the numerator is 1.0.

Adding support for plain recip is good. Having the enabling logic no longer in
the md file is an improvement, but I don't believe adding tuning flags for the inner
mode is correct - we need a more generic solution like I mentioned in my other mail.

The division variant should use the same latency reduction trick I mentioned for sqrt.

Wilco

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [AArch64] Emit division using the Newton series
  2016-04-01 13:58     ` Wilco Dijkstra
@ 2016-04-01 19:47       ` Evandro Menezes
  2016-04-01 21:22         ` Wilco Dijkstra
  0 siblings, 1 reply; 16+ messages in thread
From: Evandro Menezes @ 2016-04-01 19:47 UTC (permalink / raw)
  To: Wilco Dijkstra, GCC Patches; +Cc: James Greenhalgh, Andrew Pinski, nd

On 04/01/16 08:58, Wilco Dijkstra wrote:
> Evandro Menezes wrote:
> On 03/23/16 11:24, Evandro Menezes wrote:
>> On 03/17/16 15:09, Evandro Menezes wrote:
>>> This patch implements FP division by an approximation using the Newton
>>> series.
>>>
>>> With this patch, DF division is sped up by over 100% and SF division,
>>> zilch, both on A57 and on M1.
> Mentioning throughput is not useful given that the vectorized single precision
> case will give most of the speedup in actual code.
>
>>          gcc/
>>              * config/aarch64/aarch64-tuning-flags.def
>>              (AARCH64_EXTRA_TUNE_APPROX_DIV_{SF,DF}: New tuning macros.
>>              * config/aarch64/aarch64-protos.h
>>              (AARCH64_EXTRA_TUNE_APPROX_DIV): New macro.
>>              (aarch64_emit_approx_div): Declare new function.
>>              * config/aarch64/aarch64.c
>>              (aarch64_emit_approx_div): Define new function.
>>              * config/aarch64/aarch64.md ("div<mode>3"): New expansion.
>>              * config/aarch64/aarch64-simd.md ("div<mode>3"): Likewise.
>>
>>
>> This version of the patch cleans up the changes to the MD files and
>> optimizes the division when the numerator is 1.0.
> Adding support for plain recip is good. Having the enabling logic no longer in
> the md file is an improvement, but I don't believe adding tuning flags for the inner
> mode is correct - we need a more generic solution like I mentioned in my other mail.
>
> The division variant should use the same latency reduction trick I mentioned for sqrt.

Wilco,

I don't think that it applies here, since it doesn't have to deal with 
special cases.

As for the finer grained flags, I'll wait for the feedback on 
https://gcc.gnu.org/ml/gcc-patches/2016-04/msg00089.html

Thank you,

-- 
Evandro Menezes

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [AArch64] Emit division using the Newton series
  2016-04-01 19:47       ` Evandro Menezes
@ 2016-04-01 21:22         ` Wilco Dijkstra
  2016-04-01 21:56           ` Evandro Menezes
  0 siblings, 1 reply; 16+ messages in thread
From: Wilco Dijkstra @ 2016-04-01 21:22 UTC (permalink / raw)
  To: Evandro Menezes, GCC Patches; +Cc: James Greenhalgh, Andrew Pinski, nd

Evandro Menezes wrote:
> > The division variant should use the same latency reduction trick I mentioned for sqrt.
>
> I don't think that it applies here, since it doesn't have to deal with
> special cases.

No it applies as it's exactly the same calculation: x * rsqrt(y) and x * recip(y). In both
cases you don't need the final result of rsqrt(y) or recip(y), avoiding a multiply. 
Given these sequences are high latency this saving is actually quite important.

Wilco

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [AArch64] Emit division using the Newton series
  2016-04-01 21:22         ` Wilco Dijkstra
@ 2016-04-01 21:56           ` Evandro Menezes
  2016-04-01 22:46             ` Wilco Dijkstra
  0 siblings, 1 reply; 16+ messages in thread
From: Evandro Menezes @ 2016-04-01 21:56 UTC (permalink / raw)
  To: Wilco Dijkstra, GCC Patches; +Cc: James Greenhalgh, Andrew Pinski, nd

On 04/01/16 16:22, Wilco Dijkstra wrote:
> Evandro Menezes wrote:
>>> The division variant should use the same latency reduction trick I mentioned for sqrt.
>> I don't think that it applies here, since it doesn't have to deal with
>> special cases.
> No it applies as it's exactly the same calculation: x * rsqrt(y) and x * recip(y). In both
> cases you don't need the final result of rsqrt(y) or recip(y), avoiding a multiply.
> Given these sequences are high latency this saving is actually quite important.

Wilco,

In the case of sqrt(), the special case when the argument is 0.0 
multiplication is necessary in order to guarantee correctness. Handling 
this special case hurts performance, when your suggestion helps.

However, I don't think that there's the need to handle any special case 
for division.  The only case when the approximation differs from 
division is when the numerator is infinity and the denominator, zero, 
when the approximation returns infinity and the division, NAN.  So I 
don't think that it's a special case that deserves being handled.  IOW, 
the result of the approximate reciprocal is always needed.

Or am I missing something?

Thank you,

-- 
Evandro Menezes

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [AArch64] Emit division using the Newton series
  2016-04-01 21:56           ` Evandro Menezes
@ 2016-04-01 22:46             ` Wilco Dijkstra
  2016-04-01 22:52               ` Evandro Menezes
  0 siblings, 1 reply; 16+ messages in thread
From: Wilco Dijkstra @ 2016-04-01 22:46 UTC (permalink / raw)
  To: Evandro Menezes, GCC Patches; +Cc: James Greenhalgh, Andrew Pinski, nd

Evandro Menezes wrote:

> However, I don't think that there's the need to handle any special case
> for division.  The only case when the approximation differs from
> division is when the numerator is infinity and the denominator, zero,
> when the approximation returns infinity and the division, NAN.  So I
> don't think that it's a special case that deserves being handled.  IOW,
> the result of the approximate reciprocal is always needed.
 
No, the result of the approximate reciprocal is not needed. 

Basically a NR approximation produces a correction factor that is very close
to 1.0, and then multiplies that with the previous estimate to get a more
accurate estimate. The final calculation for x * recip(y) is:

result = (reciprocal_correction * reciprocal_estimate) * x

while what I am suggesting is a trivial reassociation:

result = reciprocal_correction * (reciprocal_estimate * x)

The computation of the final reciprocal_correction is on the critical latency
path, while reciprocal_estimate is computed earlier, so we can compute
(reciprocal_estimate * x) without increasing the overall latency. Ie. we saved
a multiply.

In principle this could be done as a separate optimization pass that tries to 
reassociate to reduce latency. However I'm not too convinced this would be
easy to implement in GCC's scheduler, so it's best to do it explicitly.

Wilco

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [AArch64] Emit division using the Newton series
  2016-04-01 22:46             ` Wilco Dijkstra
@ 2016-04-01 22:52               ` Evandro Menezes
  2016-04-04 19:06                 ` Evandro Menezes
  0 siblings, 1 reply; 16+ messages in thread
From: Evandro Menezes @ 2016-04-01 22:52 UTC (permalink / raw)
  To: Wilco Dijkstra, GCC Patches; +Cc: James Greenhalgh, Andrew Pinski, nd

On 04/01/16 17:45, Wilco Dijkstra wrote:
> Evandro Menezes wrote:
>
>> However, I don't think that there's the need to handle any special case
>> for division.  The only case when the approximation differs from
>> division is when the numerator is infinity and the denominator, zero,
>> when the approximation returns infinity and the division, NAN.  So I
>> don't think that it's a special case that deserves being handled.  IOW,
>> the result of the approximate reciprocal is always needed.
>   
> No, the result of the approximate reciprocal is not needed.
>
> Basically a NR approximation produces a correction factor that is very close
> to 1.0, and then multiplies that with the previous estimate to get a more
> accurate estimate. The final calculation for x * recip(y) is:
>
> result = (reciprocal_correction * reciprocal_estimate) * x
>
> while what I am suggesting is a trivial reassociation:
>
> result = reciprocal_correction * (reciprocal_estimate * x)
>
> The computation of the final reciprocal_correction is on the critical latency
> path, while reciprocal_estimate is computed earlier, so we can compute
> (reciprocal_estimate * x) without increasing the overall latency. Ie. we saved
> a multiply.
>
> In principle this could be done as a separate optimization pass that tries to
> reassociate to reduce latency. However I'm not too convinced this would be
> easy to implement in GCC's scheduler, so it's best to do it explicitly.

I think that I see what you mean.  I'll hack something tomorrow.

Thanks for your patience,

-- 
Evandro Menezes

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [AArch64] Emit division using the Newton series
  2016-04-01 22:52               ` Evandro Menezes
@ 2016-04-04 19:06                 ` Evandro Menezes
  2016-04-12 18:15                   ` Evandro Menezes
  2016-04-27 14:16                   ` James Greenhalgh
  0 siblings, 2 replies; 16+ messages in thread
From: Evandro Menezes @ 2016-04-04 19:06 UTC (permalink / raw)
  To: Wilco Dijkstra, GCC Patches; +Cc: James Greenhalgh, Andrew Pinski, nd

[-- Attachment #1: Type: text/plain, Size: 3090 bytes --]

On 04/01/16 17:52, Evandro Menezes wrote:
> On 04/01/16 17:45, Wilco Dijkstra wrote:
>> Evandro Menezes wrote:
>>
>>> However, I don't think that there's the need to handle any special case
>>> for division.  The only case when the approximation differs from
>>> division is when the numerator is infinity and the denominator, zero,
>>> when the approximation returns infinity and the division, NAN.  So I
>>> don't think that it's a special case that deserves being handled.  IOW,
>>> the result of the approximate reciprocal is always needed.
>>   No, the result of the approximate reciprocal is not needed.
>>
>> Basically a NR approximation produces a correction factor that is 
>> very close
>> to 1.0, and then multiplies that with the previous estimate to get a 
>> more
>> accurate estimate. The final calculation for x * recip(y) is:
>>
>> result = (reciprocal_correction * reciprocal_estimate) * x
>>
>> while what I am suggesting is a trivial reassociation:
>>
>> result = reciprocal_correction * (reciprocal_estimate * x)
>>
>> The computation of the final reciprocal_correction is on the critical 
>> latency
>> path, while reciprocal_estimate is computed earlier, so we can compute
>> (reciprocal_estimate * x) without increasing the overall latency. Ie. 
>> we saved
>> a multiply.
>>
>> In principle this could be done as a separate optimization pass that 
>> tries to
>> reassociate to reduce latency. However I'm not too convinced this 
>> would be
>> easy to implement in GCC's scheduler, so it's best to do it explicitly.
>
> I think that I see what you mean.  I'll hack something tomorrow.

    [AArch64] Emit division using the Newton series

    2016-04-04  Evandro Menezes  <e.menezes@samsung.com>
                 Wilco Dijkstra <Wilco.Dijkstra@arm.com>

    gcc/
             * config/aarch64/aarch64-tuning-flags.def
             * config/aarch64/aarch64-protos.h
             (AARCH64_APPROX_MODE): New macro.
    (AARCH64_EXTRA_TUNE_APPROX_{NONE,SP,DP,DFORM,QFORM,SCALAR,VECTOR,ALL}:
             New tuning macros.
             (tune_params): Add new member "approx_div_modes".
             (aarch64_emit_approx_div): Declare new function.
             * config/aarch64/aarch64.c
             (generic_tunings): New member "approx_div_modes".
             (cortexa35_tunings): Likewise.
             (cortexa53_tunings): Likewise.
             (cortexa57_tunings): Likewise.
             (cortexa72_tunings): Likewise.
             (exynosm1_tunings): Likewise.
             (thunderx_tunings): Likewise.
             (xgene1_tunings): Likewise.
             (aarch64_emit_approx_div): Define new function.
             * config/aarch64/aarch64.md ("div<mode>3"): New expansion.
             * config/aarch64/aarch64-simd.md ("div<mode>3"): Likewise.
             * config/aarch64/aarch64.opt (-mlow-precision-div): Add new
    option.
             * doc/invoke.texi (-mlow-precision-div): Describe new option.


This version of the patch has a shorter dependency chain at the last 
iteration of the series.

Thank you for your feedback,

-- 
Evandro Menezes


[-- Attachment #2: 0001-AArch64-Emit-division-using-the-Newton-series.patch --]
[-- Type: text/x-patch, Size: 13839 bytes --]

From c8d94247e5b3c6120436051c8da11850937b7246 Mon Sep 17 00:00:00 2001
From: Evandro Menezes <e.menezes@samsung.com>
Date: Mon, 4 Apr 2016 14:02:24 -0500
Subject: [PATCH] [AArch64] Emit division using the Newton series

2016-04-04  Evandro Menezes  <e.menezes@samsung.com>
            Wilco Dijkstra <Wilco.Dijkstra@arm.com>

gcc/
	* config/aarch64/aarch64-tuning-flags.def
	* config/aarch64/aarch64-protos.h
	(AARCH64_APPROX_MODE): New macro.
	(AARCH64_EXTRA_TUNE_APPROX_{NONE,SP,DP,DFORM,QFORM,SCALAR,VECTOR,ALL}:
	New tuning macros.
	(tune_params): Add new member "approx_div_modes".
	(aarch64_emit_approx_div): Declare new function.
	* config/aarch64/aarch64.c
	(generic_tunings): New member "approx_div_modes".
	(cortexa35_tunings): Likewise.
	(cortexa53_tunings): Likewise.
	(cortexa57_tunings): Likewise.
	(cortexa72_tunings): Likewise.
	(exynosm1_tunings): Likewise.
	(thunderx_tunings): Likewise.
	(xgene1_tunings): Likewise.
	(aarch64_emit_approx_div): Define new function.
	* config/aarch64/aarch64.md ("div<mode>3"): New expansion.
	* config/aarch64/aarch64-simd.md ("div<mode>3"): Likewise.
	* config/aarch64/aarch64.opt (-mlow-precision-div): Add new option.
	* doc/invoke.texi (-mlow-precision-div): Describe new option.
---
 gcc/config/aarch64/aarch64-protos.h         | 28 +++++++++
 gcc/config/aarch64/aarch64-simd.md          | 14 ++++-
 gcc/config/aarch64/aarch64-tuning-flags.def |  1 -
 gcc/config/aarch64/aarch64.c                | 98 ++++++++++++++++++++++++++---
 gcc/config/aarch64/aarch64.md               | 19 ++++--
 gcc/config/aarch64/aarch64.opt              |  5 ++
 gcc/doc/invoke.texi                         | 10 +++
 7 files changed, 161 insertions(+), 14 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 58c9d0d..25102d5 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -178,6 +178,32 @@ struct cpu_branch_cost
   const int unpredictable;  /* Unpredictable branch or optimizing for speed.  */
 };
 
+/* Control approximate alternatives to certain FP operators.  */
+#define AARCH64_APPROX_MODE(MODE) \
+  ((MIN_MODE_FLOAT <= (MODE) && (MODE) <= MAX_MODE_FLOAT) \
+   ? (1 << ((MODE) - MIN_MODE_FLOAT)) \
+   : (MIN_MODE_VECTOR_FLOAT <= (MODE) && (MODE) <= MAX_MODE_VECTOR_FLOAT) \
+     ? (1 << ((MODE) - MIN_MODE_VECTOR_FLOAT \
+	      + MAX_MODE_FLOAT - MIN_MODE_FLOAT + 1)) \
+     : (0))
+#define AARCH64_APPROX_NONE (0)
+#define AARCH64_APPROX_SP (AARCH64_APPROX_MODE (SFmode) \
+			   | AARCH64_APPROX_MODE (V2SFmode) \
+			   | AARCH64_APPROX_MODE (V4SFmode))
+#define AARCH64_APPROX_DP (AARCH64_APPROX_MODE (DFmode) \
+			   | AARCH64_APPROX_MODE (V2DFmode))
+#define AARCH64_APPROX_DFORM (AARCH64_APPROX_MODE (SFmode) \
+			      | AARCH64_APPROX_MODE (DFmode) \
+			      | AARCH64_APPROX_MODE (V2SFmode))
+#define AARCH64_APPROX_QFORM (AARCH64_APPROX_MODE (V4SFmode) \
+			      | AARCH64_APPROX_MODE (V2DFmode))
+#define AARCH64_APPROX_SCALAR (AARCH64_APPROX_MODE (SFmode) \
+			       | AARCH64_APPROX_MODE (DFmode))
+#define AARCH64_APPROX_VECTOR (AARCH64_APPROX_MODE (V2SFmode) \
+			       | AARCH64_APPROX_MODE (V4SFmode) \
+			       | AARCH64_APPROX_MODE (V2DFmode))
+#define AARCH64_APPROX_ALL (-1)
+
 struct tune_params
 {
   const struct cpu_cost_table *insn_extra_cost;
@@ -218,6 +244,7 @@ struct tune_params
   } autoprefetcher_model;
 
   unsigned int extra_tuning_flags;
+  unsigned int approx_div_modes;
 };
 
 #define AARCH64_FUSION_PAIR(x, name) \
@@ -362,6 +389,7 @@ void aarch64_relayout_simd_types (void);
 void aarch64_reset_previous_fndecl (void);
 void aarch64_save_restore_target_globals (tree);
 void aarch64_emit_approx_rsqrt (rtx, rtx);
+bool aarch64_emit_approx_div (rtx, rtx, rtx);
 
 /* Initialize builtins for SIMD intrinsics.  */
 void init_aarch64_simd_builtins (void);
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index bd73bce..99be92e 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1509,7 +1509,19 @@
   [(set_attr "type" "neon_fp_mul_<Vetype><q>")]
 )
 
-(define_insn "div<mode>3"
+(define_expand "div<mode>3"
+ [(set (match_operand:VDQF 0 "register_operand")
+       (div:VDQF (match_operand:VDQF 1 "general_operand")
+		 (match_operand:VDQF 2 "register_operand")))]
+ "TARGET_SIMD"
+{
+  if (aarch64_emit_approx_div (operands[0], operands[1], operands[2]))
+    DONE;
+
+  operands[1] = force_reg (<MODE>mode, operands[1]);
+})
+
+(define_insn "*div<mode>3"
  [(set (match_operand:VDQF 0 "register_operand" "=w")
        (div:VDQF (match_operand:VDQF 1 "register_operand" "w")
 		 (match_operand:VDQF 2 "register_operand" "w")))]
diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
index 7e45a0c..f25714c 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -30,4 +30,3 @@
 
 AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
 AARCH64_EXTRA_TUNING_OPTION ("approx_rsqrt", APPROX_RSQRT)
-
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index b7086dd..21af809 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -414,7 +414,8 @@ static const struct tune_params generic_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_div_modes.  */
 };
 
 static const struct tune_params cortexa35_tunings =
@@ -439,7 +440,8 @@ static const struct tune_params cortexa35_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_div_modes.  */
 };
 
 static const struct tune_params cortexa53_tunings =
@@ -464,7 +466,8 @@ static const struct tune_params cortexa53_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_div_modes.  */
 };
 
 static const struct tune_params cortexa57_tunings =
@@ -489,7 +492,8 @@ static const struct tune_params cortexa57_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_div_modes.  */
 };
 
 static const struct tune_params cortexa72_tunings =
@@ -514,7 +518,8 @@ static const struct tune_params cortexa72_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_div_modes.  */
 };
 
 static const struct tune_params exynosm1_tunings =
@@ -538,7 +543,8 @@ static const struct tune_params exynosm1_tunings =
   48,	/* max_case_values.  */
   64,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_APPROX_RSQRT) /* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_APPROX_RSQRT), /* tune_flags.  */
+  (AARCH64_APPROX_NONE) /* approx_div_modes.  */
 };
 
 static const struct tune_params thunderx_tunings =
@@ -562,7 +568,8 @@ static const struct tune_params thunderx_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_div_modes.  */
 };
 
 static const struct tune_params xgene1_tunings =
@@ -586,7 +593,8 @@ static const struct tune_params xgene1_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_APPROX_RSQRT)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_APPROX_RSQRT),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_div_modes.  */
 };
 
 /* Support for fine-grained override of the tuning structures.  */
@@ -7552,6 +7560,80 @@ aarch64_emit_approx_rsqrt (rtx dst, rtx src)
   emit_move_insn (dst, x0);
 }
 
+/* Emit the instruction sequence to compute the approximation for a reciprocal.  */
+
+bool
+aarch64_emit_approx_div (rtx quo, rtx num, rtx div)
+{
+  machine_mode mode = GET_MODE (quo);
+
+  if (!flag_finite_math_only
+      || flag_trapping_math
+      || !flag_unsafe_math_optimizations
+      || optimize_function_for_size_p (cfun)
+      || !(flag_mlow_precision_div
+	   || (aarch64_tune_params.approx_div_modes & AARCH64_APPROX_MODE (mode))))
+    return false;
+
+  /* Estimate the approximate reciprocal.  */
+  rtx xrcp = gen_reg_rtx (mode);
+  switch (mode)
+    {
+      case SFmode:
+	emit_insn (gen_aarch64_frecpesf (xrcp, div)); break;
+      case V2SFmode:
+	emit_insn (gen_aarch64_frecpev2sf (xrcp, div)); break;
+      case V4SFmode:
+	emit_insn (gen_aarch64_frecpev4sf (xrcp, div)); break;
+      case DFmode:
+	emit_insn (gen_aarch64_frecpedf (xrcp, div)); break;
+      case V2DFmode:
+	emit_insn (gen_aarch64_frecpev2df (xrcp, div)); break;
+      default:
+	gcc_unreachable ();
+    }
+
+  /* Iterate over the series twice for SF and thrice for DF.  */
+  int iterations = (GET_MODE_INNER (mode) == DFmode) ? 3 : 2;
+
+  /* Optionally iterate over the series once less for faster performance,
+     while sacrificing the accuracy.  */
+  if (flag_mlow_precision_div)
+    iterations--;
+
+  rtx xtmp = gen_reg_rtx (mode);
+  while (iterations--)
+    {
+      switch (mode)
+        {
+	  case SFmode:
+	    emit_insn (gen_aarch64_frecpssf (xtmp, xrcp, div)); break;
+	  case V2SFmode:
+	    emit_insn (gen_aarch64_frecpsv2sf (xtmp, xrcp, div)); break;
+	  case V4SFmode:
+	    emit_insn (gen_aarch64_frecpsv4sf (xtmp, xrcp, div)); break;
+	  case DFmode:
+	    emit_insn (gen_aarch64_frecpsdf (xtmp, xrcp, div)); break;
+	  case V2DFmode:
+	    emit_insn (gen_aarch64_frecpsv2df (xtmp, xrcp, div)); break;
+	  default:
+	    gcc_unreachable ();
+        }
+
+      if (iterations > 0)
+	emit_set_insn (xrcp, gen_rtx_MULT (mode, xrcp, xtmp));
+    }
+
+  if (num != CONST1_RTX (mode))
+    {
+      rtx xnum = force_reg (mode, num);
+      emit_set_insn (xrcp, gen_rtx_MULT (mode, xrcp, xnum));
+    }
+
+  emit_set_insn (quo, gen_rtx_MULT (mode, xrcp, xtmp));
+  return true;
+}
+
 /* Return the number of instructions that can be issued per cycle.  */
 static int
 aarch64_sched_issue_rate (void)
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 68676c9..985915e 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -4647,11 +4647,22 @@
   [(set_attr "type" "fmul<s>")]
 )
 
-(define_insn "div<mode>3"
+(define_expand "div<mode>3"
+ [(set (match_operand:GPF 0 "register_operand")
+       (div:GPF (match_operand:GPF 1 "general_operand")
+		(match_operand:GPF 2 "register_operand")))]
+ "TARGET_SIMD"
+{
+  if (aarch64_emit_approx_div (operands[0], operands[1], operands[2]))
+    DONE;
+
+  operands[1] = force_reg (<MODE>mode, operands[1]);
+})
+
+(define_insn "*div<mode>3"
   [(set (match_operand:GPF 0 "register_operand" "=w")
-        (div:GPF
-         (match_operand:GPF 1 "register_operand" "w")
-         (match_operand:GPF 2 "register_operand" "w")))]
+        (div:GPF (match_operand:GPF 1 "register_operand" "w")
+	         (match_operand:GPF 2 "register_operand" "w")))]
   "TARGET_FLOAT"
   "fdiv\\t%<s>0, %<s>1, %<s>2"
   [(set_attr "type" "fdiv<s>")]
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index c637ff4..672f08c 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -153,3 +153,8 @@ mlow-precision-recip-sqrt
 Common Var(flag_mrecip_low_precision_sqrt) Optimization
 When calculating the reciprocal square root approximation,
 uses one less step than otherwise, thus reducing latency and precision.
+
+mlow-precision-div
+Common Var(flag_mlow_precision_div) Optimization
+When calculating the approximate division,
+use one less step than otherwise, thus reducing latency and precision.
\ No newline at end of file
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index e9763d4..297f9aa 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -572,6 +572,7 @@ Objective-C and Objective-C++ Dialects}.
 -mtls-size=@var{size} @gol
 -mfix-cortex-a53-835769  -mno-fix-cortex-a53-835769 @gol
 -mfix-cortex-a53-843419  -mno-fix-cortex-a53-843419 @gol
+-mlow-precision-div -mno-low-precision-div @gol
 -mlow-precision-recip-sqrt -mno-low-precision-recip-sqrt@gol
 -march=@var{name}  -mcpu=@var{name}  -mtune=@var{name}}
 
@@ -12921,6 +12922,15 @@ uses one less step than otherwise, thus reducing latency and precision.
 This is only relevant if @option{-ffast-math} enables the reciprocal square root
 approximation, which in turn depends on the target processor.
 
+@item -mlow-precision-div
+@item -mno-low-precision-div
+@opindex -mlow-precision-div
+@opindex -mno-low-precision-div
+When calculating the division approximation,
+uses one less step than otherwise, thus reducing latency and precision.
+This is only relevant if @option{-ffast-math} enables the division
+approximation.
+
 @item -march=@var{name}
 @opindex march
 Specify the name of the target architecture and, optionally, one or
-- 
1.9.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [AArch64] Emit division using the Newton series
  2016-04-04 19:06                 ` Evandro Menezes
@ 2016-04-12 18:15                   ` Evandro Menezes
  2016-04-21 18:44                     ` Evandro Menezes
  2016-04-27 14:16                   ` James Greenhalgh
  1 sibling, 1 reply; 16+ messages in thread
From: Evandro Menezes @ 2016-04-12 18:15 UTC (permalink / raw)
  To: Wilco Dijkstra, GCC Patches; +Cc: James Greenhalgh, Andrew Pinski, nd

On 04/04/16 14:06, Evandro Menezes wrote:
> On 04/01/16 17:52, Evandro Menezes wrote:
>> On 04/01/16 17:45, Wilco Dijkstra wrote:
>>> Evandro Menezes wrote:
>>>
>>>> However, I don't think that there's the need to handle any special 
>>>> case
>>>> for division.  The only case when the approximation differs from
>>>> division is when the numerator is infinity and the denominator, zero,
>>>> when the approximation returns infinity and the division, NAN.  So I
>>>> don't think that it's a special case that deserves being handled.  
>>>> IOW,
>>>> the result of the approximate reciprocal is always needed.
>>>   No, the result of the approximate reciprocal is not needed.
>>>
>>> Basically a NR approximation produces a correction factor that is 
>>> very close
>>> to 1.0, and then multiplies that with the previous estimate to get a 
>>> more
>>> accurate estimate. The final calculation for x * recip(y) is:
>>>
>>> result = (reciprocal_correction * reciprocal_estimate) * x
>>>
>>> while what I am suggesting is a trivial reassociation:
>>>
>>> result = reciprocal_correction * (reciprocal_estimate * x)
>>>
>>> The computation of the final reciprocal_correction is on the 
>>> critical latency
>>> path, while reciprocal_estimate is computed earlier, so we can compute
>>> (reciprocal_estimate * x) without increasing the overall latency. 
>>> Ie. we saved
>>> a multiply.
>>>
>>> In principle this could be done as a separate optimization pass that 
>>> tries to
>>> reassociate to reduce latency. However I'm not too convinced this 
>>> would be
>>> easy to implement in GCC's scheduler, so it's best to do it explicitly.
>>
>> I think that I see what you mean.  I'll hack something tomorrow.
>
>    [AArch64] Emit division using the Newton series
>
>    2016-04-04  Evandro Menezes  <e.menezes@samsung.com>
>                 Wilco Dijkstra <Wilco.Dijkstra@arm.com>
>
>    gcc/
>             * config/aarch64/aarch64-tuning-flags.def
>             * config/aarch64/aarch64-protos.h
>             (AARCH64_APPROX_MODE): New macro.
> (AARCH64_EXTRA_TUNE_APPROX_{NONE,SP,DP,DFORM,QFORM,SCALAR,VECTOR,ALL}:
>             New tuning macros.
>             (tune_params): Add new member "approx_div_modes".
>             (aarch64_emit_approx_div): Declare new function.
>             * config/aarch64/aarch64.c
>             (generic_tunings): New member "approx_div_modes".
>             (cortexa35_tunings): Likewise.
>             (cortexa53_tunings): Likewise.
>             (cortexa57_tunings): Likewise.
>             (cortexa72_tunings): Likewise.
>             (exynosm1_tunings): Likewise.
>             (thunderx_tunings): Likewise.
>             (xgene1_tunings): Likewise.
>             (aarch64_emit_approx_div): Define new function.
>             * config/aarch64/aarch64.md ("div<mode>3"): New expansion.
>             * config/aarch64/aarch64-simd.md ("div<mode>3"): Likewise.
>             * config/aarch64/aarch64.opt (-mlow-precision-div): Add new
>    option.
>             * doc/invoke.texi (-mlow-precision-div): Describe new option.
>
>
> This version of the patch has a shorter dependency chain at the last 
> iteration of the series.

Ping^1

-- 
Evandro Menezes

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [AArch64] Emit division using the Newton series
  2016-04-12 18:15                   ` Evandro Menezes
@ 2016-04-21 18:44                     ` Evandro Menezes
  0 siblings, 0 replies; 16+ messages in thread
From: Evandro Menezes @ 2016-04-21 18:44 UTC (permalink / raw)
  To: 'Wilco Dijkstra', 'GCC Patches'
  Cc: 'James Greenhalgh', 'Andrew Pinski', 'nd'

> On 04/04/16 14:06, Evandro Menezes wrote:
> > On 04/01/16 17:52, Evandro Menezes wrote:
> >> On 04/01/16 17:45, Wilco Dijkstra wrote:
> >>> Evandro Menezes wrote:
> >>>
> >>>> However, I don't think that there's the need to handle any special
> >>>> case for division.  The only case when the approximation differs
> >>>> from division is when the numerator is infinity and the
> >>>> denominator, zero, when the approximation returns infinity and the
> >>>> division, NAN.  So I don't think that it's a special case that
> >>>> deserves being handled.
> >>>> IOW,
> >>>> the result of the approximate reciprocal is always needed.
> >>>   No, the result of the approximate reciprocal is not needed.
> >>>
> >>> Basically a NR approximation produces a correction factor that is
> >>> very close to 1.0, and then multiplies that with the previous
> >>> estimate to get a more accurate estimate. The final calculation for
> >>> x * recip(y) is:
> >>>
> >>> result = (reciprocal_correction * reciprocal_estimate) * x
> >>>
> >>> while what I am suggesting is a trivial reassociation:
> >>>
> >>> result = reciprocal_correction * (reciprocal_estimate * x)
> >>>
> >>> The computation of the final reciprocal_correction is on the
> >>> critical latency path, while reciprocal_estimate is computed
> >>> earlier, so we can compute (reciprocal_estimate * x) without
> >>> increasing the overall latency.
> >>> Ie. we saved
> >>> a multiply.
> >>>
> >>> In principle this could be done as a separate optimization pass that
> >>> tries to reassociate to reduce latency. However I'm not too
> >>> convinced this would be easy to implement in GCC's scheduler, so
> >>> it's best to do it explicitly.
> >>
> >> I think that I see what you mean.  I'll hack something tomorrow.
> >
> >    [AArch64] Emit division using the Newton series
> >
> >    2016-04-04  Evandro Menezes  <e.menezes@samsung.com>
> >                 Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> >
> >    gcc/
> >             * config/aarch64/aarch64-tuning-flags.def
> >             * config/aarch64/aarch64-protos.h
> >             (AARCH64_APPROX_MODE): New macro.
> > (AARCH64_EXTRA_TUNE_APPROX_{NONE,SP,DP,DFORM,QFORM,SCALAR,VECTOR,ALL}:
> >             New tuning macros.
> >             (tune_params): Add new member "approx_div_modes".
> >             (aarch64_emit_approx_div): Declare new function.
> >             * config/aarch64/aarch64.c
> >             (generic_tunings): New member "approx_div_modes".
> >             (cortexa35_tunings): Likewise.
> >             (cortexa53_tunings): Likewise.
> >             (cortexa57_tunings): Likewise.
> >             (cortexa72_tunings): Likewise.
> >             (exynosm1_tunings): Likewise.
> >             (thunderx_tunings): Likewise.
> >             (xgene1_tunings): Likewise.
> >             (aarch64_emit_approx_div): Define new function.
> >             * config/aarch64/aarch64.md ("div<mode>3"): New expansion.
> >             * config/aarch64/aarch64-simd.md ("div<mode>3"): Likewise.
> >             * config/aarch64/aarch64.opt (-mlow-precision-div): Add new
> >    option.
> >             * doc/invoke.texi (-mlow-precision-div): Describe new option.
> >
> >
> > This version of the patch has a shorter dependency chain at the last
> > iteration of the series.
> 
> Ping^1

Ping^2

-- 
Evandro Menezes                              Austin, TX

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [AArch64] Emit division using the Newton series
  2016-04-04 19:06                 ` Evandro Menezes
  2016-04-12 18:15                   ` Evandro Menezes
@ 2016-04-27 14:16                   ` James Greenhalgh
  2016-04-27 14:44                     ` Wilco Dijkstra
  2016-04-27 15:43                     ` Evandro Menezes
  1 sibling, 2 replies; 16+ messages in thread
From: James Greenhalgh @ 2016-04-27 14:16 UTC (permalink / raw)
  To: Evandro Menezes; +Cc: Wilco Dijkstra, GCC Patches, Andrew Pinski, nd

> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index b7086dd..21af809 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -414,7 +414,8 @@ static const struct tune_params generic_tunings =
>    0,	/* max_case_values.  */
>    0,	/* cache_line_size.  */
>    tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
> -  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
> +  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
> +  (AARCH64_APPROX_NONE)	/* approx_div_modes.  */
>  };
>  
>  static const struct tune_params cortexa35_tunings =
> @@ -439,7 +440,8 @@ static const struct tune_params cortexa35_tunings =
>    0,	/* max_case_values.  */
>    0,	/* cache_line_size.  */
>    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
> -  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
> +  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
> +  (AARCH64_APPROX_NONE)	/* approx_div_modes.  */
>  };
>  
>  static const struct tune_params cortexa53_tunings =
> @@ -464,7 +466,8 @@ static const struct tune_params cortexa53_tunings =
>    0,	/* max_case_values.  */
>    0,	/* cache_line_size.  */
>    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
> -  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
> +  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
> +  (AARCH64_APPROX_NONE)	/* approx_div_modes.  */
>  };
>  
>  static const struct tune_params cortexa57_tunings =
> @@ -489,7 +492,8 @@ static const struct tune_params cortexa57_tunings =
>    0,	/* max_case_values.  */
>    0,	/* cache_line_size.  */
>    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
> -  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */
> +  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),	/* tune_flags.  */
> +  (AARCH64_APPROX_NONE)	/* approx_div_modes.  */
>  };
>  
>  static const struct tune_params cortexa72_tunings =
> @@ -514,7 +518,8 @@ static const struct tune_params cortexa72_tunings =
>    0,	/* max_case_values.  */
>    0,	/* cache_line_size.  */
>    tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
> -  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
> +  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
> +  (AARCH64_APPROX_NONE)	/* approx_div_modes.  */
>  };
>  
>  static const struct tune_params exynosm1_tunings =
> @@ -538,7 +543,8 @@ static const struct tune_params exynosm1_tunings =
>    48,	/* max_case_values.  */
>    64,	/* cache_line_size.  */
>    tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
> -  (AARCH64_EXTRA_TUNE_APPROX_RSQRT) /* tune_flags.  */
> +  (AARCH64_EXTRA_TUNE_APPROX_RSQRT), /* tune_flags.  */
> +  (AARCH64_APPROX_NONE) /* approx_div_modes.  */
>  };
>  
>  static const struct tune_params thunderx_tunings =
> @@ -562,7 +568,8 @@ static const struct tune_params thunderx_tunings =
>    0,	/* max_case_values.  */
>    0,	/* cache_line_size.  */
>    tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
> -  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
> +  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
> +  (AARCH64_APPROX_NONE)	/* approx_div_modes.  */
>  };
>  
>  static const struct tune_params xgene1_tunings =
> @@ -586,7 +593,8 @@ static const struct tune_params xgene1_tunings =
>    0,	/* max_case_values.  */
>    0,	/* cache_line_size.  */
>    tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
> -  (AARCH64_EXTRA_TUNE_APPROX_RSQRT)	/* tune_flags.  */
> +  (AARCH64_EXTRA_TUNE_APPROX_RSQRT),	/* tune_flags.  */
> +  (AARCH64_APPROX_NONE)	/* approx_div_modes.  */
>  };

So this is off for all cores currently supported by GCC?

I'm not sure I understand why we should take this if it will immediately
be dead code?

Thanks,
James

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [AArch64] Emit division using the Newton series
  2016-04-27 14:16                   ` James Greenhalgh
@ 2016-04-27 14:44                     ` Wilco Dijkstra
  2016-04-27 15:43                     ` Evandro Menezes
  1 sibling, 0 replies; 16+ messages in thread
From: Wilco Dijkstra @ 2016-04-27 14:44 UTC (permalink / raw)
  To: James Greenhalgh, Evandro Menezes; +Cc: GCC Patches, Andrew Pinski, nd

James Greenhalgh wrote:
> So this is off for all cores currently supported by GCC?
> 
> I'm not sure I understand why we should take this if it will immediately
> be dead code?

I presume it was meant to have the vector variants enabled with -mcpu=exynos-m1
as that is where you can get a good gain if you only have a single divide+sqrt unit.
The same applies to the sqrt case too, and I guess -mcpu=xgene-1.

Wilco

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [AArch64] Emit division using the Newton series
  2016-04-27 14:16                   ` James Greenhalgh
  2016-04-27 14:44                     ` Wilco Dijkstra
@ 2016-04-27 15:43                     ` Evandro Menezes
  1 sibling, 0 replies; 16+ messages in thread
From: Evandro Menezes @ 2016-04-27 15:43 UTC (permalink / raw)
  To: James Greenhalgh; +Cc: Wilco Dijkstra, GCC Patches, Andrew Pinski, nd

On 04/27/16 09:15, James Greenhalgh wrote:
> So this is off for all cores currently supported by GCC? I'm not sure 
> I understand why we should take this if it will immediately be dead code? 

Excuse me?  Not only are other target maintainers free to evaluate if 
this code is useful to them, but so are users to use it through the 
command line option -mlow-precision-div.

-- 
Evandro Menezes

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-04-27 15:43 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-17 21:14 [AArch64] Emit division using the Newton series Evandro Menezes
2016-03-23 16:24 ` Evandro Menezes
2016-03-23 16:25 ` Evandro Menezes
2016-03-31 22:23   ` Evandro Menezes
2016-04-01 13:58     ` Wilco Dijkstra
2016-04-01 19:47       ` Evandro Menezes
2016-04-01 21:22         ` Wilco Dijkstra
2016-04-01 21:56           ` Evandro Menezes
2016-04-01 22:46             ` Wilco Dijkstra
2016-04-01 22:52               ` Evandro Menezes
2016-04-04 19:06                 ` Evandro Menezes
2016-04-12 18:15                   ` Evandro Menezes
2016-04-21 18:44                     ` Evandro Menezes
2016-04-27 14:16                   ` James Greenhalgh
2016-04-27 14:44                     ` Wilco Dijkstra
2016-04-27 15:43                     ` Evandro Menezes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).