[ARM] Implement division using vrecpe, vrecps

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [ARM] Implement division using vrecpe, vrecps
@ 2018-10-26  6:36 Prathamesh Kulkarni
  2018-11-02  9:39 ` Prathamesh Kulkarni
  2018-11-05 13:52 ` Ramana Radhakrishnan
  0 siblings, 2 replies; 7+ messages in thread
From: Prathamesh Kulkarni @ 2018-10-26  6:36 UTC (permalink / raw)
  To: gcc Patches, Kyrill Tkachov, Ramana Radhakrishnan

[-- Attachment #1: Type: text/plain, Size: 1031 bytes --]

Hi,
This is a rebased version of patch that adds a pattern to neon.md for
implementing division with multiplication by reciprocal using
vrecpe/vrecps with -funsafe-math-optimizations excluding -Os.
The newly added test-cases are not vectorized on armeb target with
-O2. I posted the analysis for that here:
https://gcc.gnu.org/ml/gcc-patches/2016-05/msg01765.html

Briefly, the difference between little and big-endian vectorizer is in
arm_builtin_support_vector_misalignment() which calls
default_builtin_support_vector_misalignment() for big-endian case, and
that returns false because
movmisalign_optab does not exist for V2SF mode. This isn't observed
with -O3 because loop peeling for alignment gets enabled.

It seems that the test cases in patch appear unsupported on armeb,
after r221677 thus this patch requires no changes to
target-supports.exp to adjust for armeb (unlike last time which
stalled the patch).

Bootstrap+tested on arm-linux-gnueabihf.
Cross-tested on arm*-*-* variants.
OK for trunk ?

Thanks,
Prathamesh

[-- Attachment #2: tcwg-319-3.txt --]
[-- Type: text/plain, Size: 3349 bytes --]

2018-10-26  Prathamesh Kulkarni  <prathamesh.kulkarni@linaro.org>

	* config/arm/neon.md (div<mode>3): New pattern.

testsuite/
	* gcc.target/arm/neon-vect-div-1.c: New test.
	* gcc.target/arm/neon-vect-div-2.c: Likewise.

diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index 5aeee4b08c1..25ed45d381a 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -620,6 +620,38 @@
                     (const_string "neon_mul_<V_elem_ch><q>")))]
 )
 
+/* Perform division using multiply-by-reciprocal.
+   Reciprocal is calculated using Newton-Raphson method.
+   Enabled with -funsafe-math-optimizations -freciprocal-math
+   and disabled for -Os since it increases code size .  */
+
+(define_expand "div<mode>3"
+  [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
+        (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w")
+		  (match_operand:VCVTF 2 "s_register_operand" "w")))]
+  "TARGET_NEON && !optimize_size
+   && flag_unsafe_math_optimizations && flag_reciprocal_math"
+  {
+    rtx rec = gen_reg_rtx (<MODE>mode);
+    rtx vrecps_temp = gen_reg_rtx (<MODE>mode);
+
+    /* Reciprocal estimate.  */
+    emit_insn (gen_neon_vrecpe<mode> (rec, operands[2]));
+
+    /* Perform 2 iterations of newton-raphson method.  */
+    for (int i = 0; i < 2; i++)
+      {
+	emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2]));
+	emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp));
+      }
+
+    /* We now have reciprocal in rec, perform operands[0] = operands[1] * rec.  */
+    emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec));
+    DONE;
+  }
+)
+
+
 (define_insn "mul<mode>3add<mode>_neon"
   [(set (match_operand:VDQW 0 "s_register_operand" "=w")
         (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand" "w")
diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c
new file mode 100644
index 00000000000..50d04b4175b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c
@@ -0,0 +1,16 @@
+/* Test pattern div<mode>3.  */
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_ok } */
+/* { dg-require-effective-target vect_hw_misalign } */
+/* { dg-options "-O2 -ftree-vectorize -funsafe-math-optimizations -fdump-tree-vect-details" } */
+/* { dg-add-options arm_neon } */
+
+void
+foo (int len, float * __restrict p, float *__restrict x)
+{
+  len = len & ~31;
+  for (int i = 0; i < len; i++)
+    p[i] = p[i] / x[i];
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c
new file mode 100644
index 00000000000..606f54b4e0e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c
@@ -0,0 +1,16 @@
+/* Test pattern div<mode>3.  */
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_ok } */
+/* { dg-require-effective-target vect_hw_misalign } */
+/* { dg-options "-O3 -ftree-vectorize -funsafe-math-optimizations -fdump-tree-vect-details -fno-reciprocal-math" } */
+/* { dg-add-options arm_neon } */
+
+void
+foo (int len, float * __restrict p, float *__restrict x)
+{
+  len = len & ~31;
+  for (int i = 0; i < len; i++)
+    p[i] = p[i] / x[i];
+}
+
+/* { dg-final { scan-tree-dump-not "vectorized 1 loops" "vect" } } */

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ARM] Implement division using vrecpe, vrecps
  2018-10-26  6:36 [ARM] Implement division using vrecpe, vrecps Prathamesh Kulkarni
@ 2018-11-02  9:39 ` Prathamesh Kulkarni
  2018-11-05 13:52 ` Ramana Radhakrishnan
  1 sibling, 0 replies; 7+ messages in thread
From: Prathamesh Kulkarni @ 2018-11-02  9:39 UTC (permalink / raw)
  To: gcc Patches, Kyrill Tkachov, Ramana Radhakrishnan

On Fri, 26 Oct 2018 at 10:34, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> Hi,
> This is a rebased version of patch that adds a pattern to neon.md for
> implementing division with multiplication by reciprocal using
> vrecpe/vrecps with -funsafe-math-optimizations excluding -Os.
> The newly added test-cases are not vectorized on armeb target with
> -O2. I posted the analysis for that here:
> https://gcc.gnu.org/ml/gcc-patches/2016-05/msg01765.html
>
> Briefly, the difference between little and big-endian vectorizer is in
> arm_builtin_support_vector_misalignment() which calls
> default_builtin_support_vector_misalignment() for big-endian case, and
> that returns false because
> movmisalign_optab does not exist for V2SF mode. This isn't observed
> with -O3 because loop peeling for alignment gets enabled.
>
> It seems that the test cases in patch appear unsupported on armeb,
> after r221677 thus this patch requires no changes to
> target-supports.exp to adjust for armeb (unlike last time which
> stalled the patch).
>
> Bootstrap+tested on arm-linux-gnueabihf.
> Cross-tested on arm*-*-* variants.
> OK for trunk ?
ping: https://gcc.gnu.org/ml/gcc-patches/2018-10/msg01645.html

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ARM] Implement division using vrecpe, vrecps
  2018-10-26  6:36 [ARM] Implement division using vrecpe, vrecps Prathamesh Kulkarni
  2018-11-02  9:39 ` Prathamesh Kulkarni
@ 2018-11-05 13:52 ` Ramana Radhakrishnan
  2018-11-09  6:44   ` Prathamesh Kulkarni
  1 sibling, 1 reply; 7+ messages in thread
From: Ramana Radhakrishnan @ 2018-11-05 13:52 UTC (permalink / raw)
  To: Prathamesh Kulkarni, gcc Patches, Kyrylo Tkachov; +Cc: nd

On 26/10/2018 06:04, Prathamesh Kulkarni wrote:
> Hi,
> This is a rebased version of patch that adds a pattern to neon.md for
> implementing division with multiplication by reciprocal using
> vrecpe/vrecps with -funsafe-math-optimizations excluding -Os.
> The newly added test-cases are not vectorized on armeb target with
> -O2. I posted the analysis for that here:
> https://gcc.gnu.org/ml/gcc-patches/2016-05/msg01765.html
> 
> Briefly, the difference between little and big-endian vectorizer is in
> arm_builtin_support_vector_misalignment() which calls
> default_builtin_support_vector_misalignment() for big-endian case, and
> that returns false because
> movmisalign_optab does not exist for V2SF mode. This isn't observed
> with -O3 because loop peeling for alignment gets enabled.
> 
> It seems that the test cases in patch appear unsupported on armeb,
> after r221677 thus this patch requires no changes to
> target-supports.exp to adjust for armeb (unlike last time which
> stalled the patch).
> 
> Bootstrap+tested on arm-linux-gnueabihf.
> Cross-tested on arm*-*-* variants.
> OK for trunk ?
> 
> Thanks,
> Prathamesh
> 
> 
> tcwg-319-3.txt
> 
> 2018-10-26  Prathamesh Kulkarni<prathamesh.kulkarni@linaro.org>
> 
> 	* config/arm/neon.md (div<mode>3): New pattern.
> 
> testsuite/
> 	* gcc.target/arm/neon-vect-div-1.c: New test.
> 	* gcc.target/arm/neon-vect-div-2.c: Likewise.
> 
> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
> index 5aeee4b08c1..25ed45d381a 100644
> --- a/gcc/config/arm/neon.md
> +++ b/gcc/config/arm/neon.md
> @@ -620,6 +620,38 @@
>                       (const_string "neon_mul_<V_elem_ch><q>")))]
>   )
>   
> +/* Perform division using multiply-by-reciprocal.
> +   Reciprocal is calculated using Newton-Raphson method.
> +   Enabled with -funsafe-math-optimizations -freciprocal-math
> +   and disabled for -Os since it increases code size .  */ > +
> +(define_expand "div<mode>3"
> +  [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
> +        (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w")
> +		  (match_operand:VCVTF 2 "s_register_operand" "w")))]
> +  "TARGET_NEON && !optimize_size
> +   && flag_unsafe_math_optimizations && flag_reciprocal_math"

I would prefer this to be more granular than 
flag_unsafe_math_optimization && flag_reciprocal_math which really is 
flag_reciprocal_math as it is turned on by default with 
funsafe-math-optimizations.

I think this should really be just flag_reciprocal_math.


Otherwise ok.

regards
Ramana




> +  {
> +    rtx rec = gen_reg_rtx (<MODE>mode);
> +    rtx vrecps_temp = gen_reg_rtx (<MODE>mode);
> +
> +    /* Reciprocal estimate.  */
> +    emit_insn (gen_neon_vrecpe<mode> (rec, operands[2]));
> +
> +    /* Perform 2 iterations of newton-raphson method.  */
> +    for (int i = 0; i < 2; i++)
> +      {
> +	emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2]));
> +	emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp));
> +      }
> +
> +    /* We now have reciprocal in rec, perform operands[0] = operands[1] * rec.  */
> +    emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec));
> +    DONE;
> +  }
> +)
> +
> +
>   (define_insn "mul<mode>3add<mode>_neon"
>     [(set (match_operand:VDQW 0 "s_register_operand" "=w")
>           (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand" "w")
> diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c
> new file mode 100644
> index 00000000000..50d04b4175b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c
> @@ -0,0 +1,16 @@
> +/* Test pattern div<mode>3.  */
> +/* { dg-do compile } */
> +/* { dg-require-effective-target arm_neon_ok } */
> +/* { dg-require-effective-target vect_hw_misalign } */
> +/* { dg-options "-O2 -ftree-vectorize -funsafe-math-optimizations -fdump-tree-vect-details" } */
> +/* { dg-add-options arm_neon } */
> +
> +void
> +foo (int len, float * __restrict p, float *__restrict x)
> +{
> +  len = len & ~31;
> +  for (int i = 0; i < len; i++)
> +    p[i] = p[i] / x[i];
> +}
> +
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c
> new file mode 100644
> index 00000000000..606f54b4e0e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c
> @@ -0,0 +1,16 @@
> +/* Test pattern div<mode>3.  */
> +/* { dg-do compile } */
> +/* { dg-require-effective-target arm_neon_ok } */
> +/* { dg-require-effective-target vect_hw_misalign } */
> +/* { dg-options "-O3 -ftree-vectorize -funsafe-math-optimizations -fdump-tree-vect-details -fno-reciprocal-math" } */
> +/* { dg-add-options arm_neon } */
> +
> +void
> +foo (int len, float * __restrict p, float *__restrict x)
> +{
> +  len = len & ~31;
> +  for (int i = 0; i < len; i++)
> +    p[i] = p[i] / x[i];
> +}
> +
> +/* { dg-final { scan-tree-dump-not "vectorized 1 loops" "vect" } } */
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ARM] Implement division using vrecpe, vrecps
  2018-11-05 13:52 ` Ramana Radhakrishnan
@ 2018-11-09  6:44   ` Prathamesh Kulkarni
  0 siblings, 0 replies; 7+ messages in thread
From: Prathamesh Kulkarni @ 2018-11-09  6:44 UTC (permalink / raw)
  To: Ramana Radhakrishnan; +Cc: gcc Patches, Kyrill Tkachov, nd

On Mon, 5 Nov 2018 at 19:22, Ramana Radhakrishnan
<Ramana.Radhakrishnan@arm.com> wrote:
>
> On 26/10/2018 06:04, Prathamesh Kulkarni wrote:
> > Hi,
> > This is a rebased version of patch that adds a pattern to neon.md for
> > implementing division with multiplication by reciprocal using
> > vrecpe/vrecps with -funsafe-math-optimizations excluding -Os.
> > The newly added test-cases are not vectorized on armeb target with
> > -O2. I posted the analysis for that here:
> > https://gcc.gnu.org/ml/gcc-patches/2016-05/msg01765.html
> >
> > Briefly, the difference between little and big-endian vectorizer is in
> > arm_builtin_support_vector_misalignment() which calls
> > default_builtin_support_vector_misalignment() for big-endian case, and
> > that returns false because
> > movmisalign_optab does not exist for V2SF mode. This isn't observed
> > with -O3 because loop peeling for alignment gets enabled.
> >
> > It seems that the test cases in patch appear unsupported on armeb,
> > after r221677 thus this patch requires no changes to
> > target-supports.exp to adjust for armeb (unlike last time which
> > stalled the patch).
> >
> > Bootstrap+tested on arm-linux-gnueabihf.
> > Cross-tested on arm*-*-* variants.
> > OK for trunk ?
> >
> > Thanks,
> > Prathamesh
> >
> >
> > tcwg-319-3.txt
> >
> > 2018-10-26  Prathamesh Kulkarni<prathamesh.kulkarni@linaro.org>
> >
> >       * config/arm/neon.md (div<mode>3): New pattern.
> >
> > testsuite/
> >       * gcc.target/arm/neon-vect-div-1.c: New test.
> >       * gcc.target/arm/neon-vect-div-2.c: Likewise.
> >
> > diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
> > index 5aeee4b08c1..25ed45d381a 100644
> > --- a/gcc/config/arm/neon.md
> > +++ b/gcc/config/arm/neon.md
> > @@ -620,6 +620,38 @@
> >                       (const_string "neon_mul_<V_elem_ch><q>")))]
> >   )
> >
> > +/* Perform division using multiply-by-reciprocal.
> > +   Reciprocal is calculated using Newton-Raphson method.
> > +   Enabled with -funsafe-math-optimizations -freciprocal-math
> > +   and disabled for -Os since it increases code size .  */ > +
> > +(define_expand "div<mode>3"
> > +  [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
> > +        (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w")
> > +               (match_operand:VCVTF 2 "s_register_operand" "w")))]
> > +  "TARGET_NEON && !optimize_size
> > +   && flag_unsafe_math_optimizations && flag_reciprocal_math"
>
> I would prefer this to be more granular than
> flag_unsafe_math_optimization && flag_reciprocal_math which really is
> flag_reciprocal_math as it is turned on by default with
> funsafe-math-optimizations.
>
> I think this should really be just flag_reciprocal_math.
>
>
> Otherwise ok.
Thanks, committed it in r265948 after removing check for
flag_unsafe_math_optimizations.

Regards,
Prathamesh
>
> regards
> Ramana
>
>
>
>
> > +  {
> > +    rtx rec = gen_reg_rtx (<MODE>mode);
> > +    rtx vrecps_temp = gen_reg_rtx (<MODE>mode);
> > +
> > +    /* Reciprocal estimate.  */
> > +    emit_insn (gen_neon_vrecpe<mode> (rec, operands[2]));
> > +
> > +    /* Perform 2 iterations of newton-raphson method.  */
> > +    for (int i = 0; i < 2; i++)
> > +      {
> > +     emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2]));
> > +     emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp));
> > +      }
> > +
> > +    /* We now have reciprocal in rec, perform operands[0] = operands[1] * rec.  */
> > +    emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec));
> > +    DONE;
> > +  }
> > +)
> > +
> > +
> >   (define_insn "mul<mode>3add<mode>_neon"
> >     [(set (match_operand:VDQW 0 "s_register_operand" "=w")
> >           (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand" "w")
> > diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c
> > new file mode 100644
> > index 00000000000..50d04b4175b
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c
> > @@ -0,0 +1,16 @@
> > +/* Test pattern div<mode>3.  */
> > +/* { dg-do compile } */
> > +/* { dg-require-effective-target arm_neon_ok } */
> > +/* { dg-require-effective-target vect_hw_misalign } */
> > +/* { dg-options "-O2 -ftree-vectorize -funsafe-math-optimizations -fdump-tree-vect-details" } */
> > +/* { dg-add-options arm_neon } */
> > +
> > +void
> > +foo (int len, float * __restrict p, float *__restrict x)
> > +{
> > +  len = len & ~31;
> > +  for (int i = 0; i < len; i++)
> > +    p[i] = p[i] / x[i];
> > +}
> > +
> > +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
> > diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c
> > new file mode 100644
> > index 00000000000..606f54b4e0e
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c
> > @@ -0,0 +1,16 @@
> > +/* Test pattern div<mode>3.  */
> > +/* { dg-do compile } */
> > +/* { dg-require-effective-target arm_neon_ok } */
> > +/* { dg-require-effective-target vect_hw_misalign } */
> > +/* { dg-options "-O3 -ftree-vectorize -funsafe-math-optimizations -fdump-tree-vect-details -fno-reciprocal-math" } */
> > +/* { dg-add-options arm_neon } */
> > +
> > +void
> > +foo (int len, float * __restrict p, float *__restrict x)
> > +{
> > +  len = len & ~31;
> > +  for (int i = 0; i < len; i++)
> > +    p[i] = p[i] / x[i];
> > +}
> > +
> > +/* { dg-final { scan-tree-dump-not "vectorized 1 loops" "vect" } } */
> >
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ARM] Implement division using vrecpe, vrecps
  2018-11-05  4:55 ` Prathamesh Kulkarni
@ 2018-11-05 13:35   ` Wilco Dijkstra
  0 siblings, 0 replies; 7+ messages in thread
From: Wilco Dijkstra @ 2018-11-05 13:35 UTC (permalink / raw)
  To: Prathamesh Kulkarni; +Cc: nd, GCC Patches, Kyrylo Tkachov, Ramana Radhakrishnan

Hi Prathamesh,

Prathamesh Kulkarni wrote:
> Thanks for the suggestions. The last time I benchmarked the patch
> (around Jan 2016)
> I got following results with the patch for SPEC2006:
>
> a15: +0.64% overall, 481.wrf: +6.46%
> a53: +0.21% overall, 416.gamess: -1.39%, 481.wrf: +6.76%
> a57: +0.35% overall, 481.wrf: +3.84%
> (https://gcc.gnu.org/ml/gcc-patches/2016-01/msg01209.html)
>
> Do these numbers look acceptable ?
> I am benchmarking the patch on ToT, and will report if there are any
> performance improvements found with the patch.

Yes those results are quite good - in fact they seemed too good to be true at first.
However looking at arm/neon.md there isn't a division pattern. So I think it's worth
mentioning in the description that your patch actually adds vectorization of
division. Disassembling the AArch64 wrf binary shows several hundred vector
division instructions - so the speedup makes sense now since many more loops
are being vectorized.

It's a shame this pattern wasn't added many years ago... It's a good idea to add a
vectorized (r)sqrt too as this will improve wrf even further.

Wilco

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ARM] Implement division using vrecpe, vrecps
  2018-11-02 13:38 Wilco Dijkstra
@ 2018-11-05  4:55 ` Prathamesh Kulkarni
  2018-11-05 13:35   ` Wilco Dijkstra
  0 siblings, 1 reply; 7+ messages in thread
From: Prathamesh Kulkarni @ 2018-11-05  4:55 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: nd, gcc Patches, Kyrill Tkachov, Ramana Radhakrishnan

On Fri, 2 Nov 2018 at 19:08, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>
> Prathamesh Kulkarni wrote:
>
> > This is a rebased version of patch that adds a pattern to neon.md for
> > implementing division with multiplication by reciprocal using
> > vrecpe/vrecps with -funsafe-math-optimizations excluding -Os.
> > The newly added test-cases are not vectorized on armeb target with
> > -O2. I posted the analysis for that here:
> > https://gcc.gnu.org/ml/gcc-patches/2016-05/msg01765.html
>
> I don't think doing this unconditionally for any CPU is a good idea. On AArch64
> we don't enable this for any core since it's not really faster (newer CPUs have
> significantly improved division and the reciprocal instructions reduce throughput
> of other FMAs). On wrf doing reciprocal square root is far better than reciprocal
> division, but it's only faster on some specific CPUs, so it's not enabled by default.
Hi Wilco,
Thanks for the suggestions. The last time I benchmarked the patch
(around Jan 2016)
I got following results with the patch for SPEC2006:

a15: +0.64% overall, 481.wrf: +6.46%
a53: +0.21% overall, 416.gamess: -1.39%, 481.wrf: +6.76%
a57: +0.35% overall, 481.wrf: +3.84%
(https://gcc.gnu.org/ml/gcc-patches/2016-01/msg01209.html)

Do these numbers look acceptable ?
I am benchmarking the patch on ToT, and will report if there are any
performance improvements found with the patch.

Thanks,
Prathamesh
>
> Wilco

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ARM] Implement division using vrecpe, vrecps
@ 2018-11-02 13:38 Wilco Dijkstra
  2018-11-05  4:55 ` Prathamesh Kulkarni
  0 siblings, 1 reply; 7+ messages in thread
From: Wilco Dijkstra @ 2018-11-02 13:38 UTC (permalink / raw)
  To: prathamesh.kulkarni; +Cc: nd, GCC Patches, Kyrylo Tkachov, Ramana Radhakrishnan

Prathamesh Kulkarni wrote:

> This is a rebased version of patch that adds a pattern to neon.md for
> implementing division with multiplication by reciprocal using
> vrecpe/vrecps with -funsafe-math-optimizations excluding -Os.
> The newly added test-cases are not vectorized on armeb target with
> -O2. I posted the analysis for that here:
> https://gcc.gnu.org/ml/gcc-patches/2016-05/msg01765.html

I don't think doing this unconditionally for any CPU is a good idea. On AArch64
we don't enable this for any core since it's not really faster (newer CPUs have
significantly improved division and the reciprocal instructions reduce throughput
of other FMAs). On wrf doing reciprocal square root is far better than reciprocal
division, but it's only faster on some specific CPUs, so it's not enabled by default.

Wilco

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-11-09  6:44 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-26  6:36 [ARM] Implement division using vrecpe, vrecps Prathamesh Kulkarni
2018-11-02  9:39 ` Prathamesh Kulkarni
2018-11-05 13:52 ` Ramana Radhakrishnan
2018-11-09  6:44   ` Prathamesh Kulkarni
2018-11-02 13:38 Wilco Dijkstra
2018-11-05  4:55 ` Prathamesh Kulkarni
2018-11-05 13:35   ` Wilco Dijkstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).