Re: [PATCH] aarch64: Improve SVE sin polynomial

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* Re: [PATCH] aarch64: Improve SVE sin polynomial
@ 2023-08-03 14:21 Wilco Dijkstra
  2023-08-03 15:22 ` Szabolcs Nagy
  0 siblings, 1 reply; 7+ messages in thread
From: Wilco Dijkstra @ 2023-08-03 14:21 UTC (permalink / raw)
  To: paul zimmermann; +Cc: 'GNU C Library', Joe Ramsay, Szabolcs Nagy

Hi Paul,

> you can still improve the polynomial with the following one generated by
> Sollya (https://www.sollya.org/) with relative error < 2^-52.454 on
> [-pi/2, pi/2], instead of 2^-51.765 for the one you propose:

So the goal is to reduce the ULP error. Due to the way floating point works,
errors are at their worst when the result is just above a power of 2. For the
wide sin(x) polynomial it happens when sin(x) is close to but larger than 0.5.
The poly is extremely accurate for small x. It's only problematic when x is large
and we cross an exponent boundary.

Szabolcs can explain it better but what we ended up doing is to scale the
sollya function so that errors at the end of the range are and reduced more.

Note also this is not the final polynomial. The ULP is a bit high at 3.3. I had a
more accurate one close to 2ULP but it could give results >1.0. So we need to
teach sollya to estimate the end from below, and not allow it to overshoot.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] aarch64: Improve SVE sin polynomial
  2023-08-03 14:21 [PATCH] aarch64: Improve SVE sin polynomial Wilco Dijkstra
@ 2023-08-03 15:22 ` Szabolcs Nagy
       [not found]   ` <p9u0msz79sga.fsf@coriandre.loria.fr>
  0 siblings, 1 reply; 7+ messages in thread
From: Szabolcs Nagy @ 2023-08-03 15:22 UTC (permalink / raw)
  To: Wilco Dijkstra, paul zimmermann; +Cc: 'GNU C Library', Joe Ramsay

The 08/03/2023 15:21, Wilco Dijkstra wrote:
> Hi Paul,
> 
> > you can still improve the polynomial with the following one generated by
> > Sollya (https://www.sollya.org/) with relative error < 2^-52.454 on
> > [-pi/2, pi/2], instead of 2^-51.765 for the one you propose:
> 
> So the goal is to reduce the ULP error. Due to the way floating point works,
> errors are at their worst when the result is just above a power of 2. For the
> wide sin(x) polynomial it happens when sin(x) is close to but larger than 0.5.
> The poly is extremely accurate for small x. It's only problematic when x is large
> and we cross an exponent boundary.
> 
> Szabolcs can explain it better but what we ended up doing is to scale the
> sollya function so that errors at the end of the range are and reduced more.

i'd say we want to minimize the maximum of

 (poly(x) - sin(x))/ulp(sin(x))

instead of

 (poly(x) - sin(x))/sin(x)

since ulp(x) is flat between 0.5 and 1 this can make a big
difference when designing a sin poly for -pi/2,pi/2.

e.g. you want to look at the abs error on [pi/6,pi/2], not
relative error when comparing the polynomials.

this is if you only consider approximation errors, but
there is an independent effect of large rounding errors
where operations in x + c*x*x*x cross powof2 boundaries
which is also near x = pi/2.

so for weighting func i used a polynomial that roughly
approximates ulp(sin(x)) but forces smaller approx error
where i expect big rounding error (this part was a bit
of black magic but i think the approx error part is
sound and covers most of the quality gain).

> 
> Note also this is not the final polynomial. The ULP is a bit high at 3.3. I had a
> more accurate one close to 2ULP but it could give results >1.0. So we need to
> teach sollya to estimate the end from below, and not allow it to overshoot.

we originally planned to use a poly with one more term,
but that made errors go in the wrong direction at x=pi/2,
and it could return 1.0 + 0x1p-52 which is not good for
sin(x). i could not fix this with weighting, but reducing
the degree the error goes into the other direction so
even though the error is bigger then it's probably better
in practice.

hope this helps.

^ permalink raw reply	[flat|nested] 7+ messages in thread

[parent not found: <p9u0msz79sga.fsf@coriandre.loria.fr>]

* Re: [PATCH] aarch64: Improve SVE sin polynomial
       [not found]   ` <p9u0msz79sga.fsf@coriandre.loria.fr>
@ 2023-08-04  8:35     ` Szabolcs Nagy
  2023-08-04 14:51     ` Vincent Lefevre
  1 sibling, 0 replies; 7+ messages in thread
From: Szabolcs Nagy @ 2023-08-04  8:35 UTC (permalink / raw)
  To: Paul Zimmermann; +Cc: Wilco.Dijkstra, libc-alpha, Joe.Ramsay

The 08/04/2023 09:28, Paul Zimmermann wrote:
>        Hi Szabolcs,
> 
> > i'd say we want to minimize the maximum of
> > 
> >  (poly(x) - sin(x))/ulp(sin(x))
> > 
> > instead of
> > 
> >  (poly(x) - sin(x))/sin(x)
> 
> interesting question! I have asked the Sollya developers if this is possible.
> 
> Anyway, unless I'm wrong I find the largest ulp error with the polynomial
> from the patch is about 1.72 at x=0.252552902718105, and for the polynomial
> I obtained with Sollya it is about 1.43 at the same value of x.
> 
> See attached graph (blue = polynomial from the patch, red = polynomial
> I obtained with Sollya), which for 0 <= x <= pi/2 prints the absolute
> value of the ulp error as you define above.

neither the red nor the blue line has flat maximums, so they are not
optimal for |poly(x)-sin(x)|/ulp(sin(x)). but that only considers
approx error, not rounding.

as you can see the blue error goes down near pi/2, while it's high
around small values like 0.25. this is deliberate as at small values
there is tiny rounding error, while in [1,pi/2] there is cancellation
with huge errors (the order of x + c*x*x*x is smaller than x so the
terms are big an cancel each other after relatively big rounding
error).

so the actual worst case error after rounding is much bigger than
what sollya reports (approx error) and it is in [1,pi/2]. e.g.

_ZGVnN2v_sin(0x1.7a83373b951abp+0) got 0x1.fdd2e6d528af4p-1 want 0x1.fdd2e6d528af7p-1 - 0.234485 ulp

error: -2.7655 ulp

which shows that there is a point where rounding error takes over
and becomes dominant over approx error. the weighting function can
take this into account by roughly following ulp(sin(x)) for a while
and then in [1,pi/2] making it smaller (this is the weighting i
used, but it was put together manually, not computed precisely:
ideally we would compute rounding_error_bound(x) function based
on the poly eval and incorporate that into the weighting. for most
functions i don't expect this to help, but it matters for sin(x).)

i would expect simply minimizing the approx error (using the
ulp(sin(x)) weighting) would be better than your polynomial
(sin(x) weighting) after rounding but using our polynomial
(manually adjusted weighting) is even better (i verified it
experimentally that measured error near the approx error peaks
are roughly the same so the error is evenly spread out, but
i'm sure slightly better polynomial is possible to find with
a bit more research into the optimal design).

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] aarch64: Improve SVE sin polynomial
       [not found]   ` <p9u0msz79sga.fsf@coriandre.loria.fr>
  2023-08-04  8:35     ` Szabolcs Nagy
@ 2023-08-04 14:51     ` Vincent Lefevre
  1 sibling, 0 replies; 7+ messages in thread
From: Vincent Lefevre @ 2023-08-04 14:51 UTC (permalink / raw)
  To: Paul Zimmermann; +Cc: Szabolcs Nagy, Wilco.Dijkstra, libc-alpha, Joe.Ramsay

On 2023-08-04 09:28:53 +0200, Paul Zimmermann via Libc-alpha wrote:
>        Hi Szabolcs,
> 
> > i'd say we want to minimize the maximum of
> > 
> >  (poly(x) - sin(x))/ulp(sin(x))
> > 
> > instead of
> > 
> >  (poly(x) - sin(x))/sin(x)
> 
> interesting question! I have asked the Sollya developers if this is
> possible.

I would say that a large change of the weight function (e.g.
a discontinuity point) is likely to generate an extrema for
the error function, which could affect the quality of the
minimax approximation. That's from the theory. In practice,
it is still possible that this works by chance, but I suppose
that this depends on the considered function and domain.

Concerning Sollya, doesn't it uses the derivative of the
error function (making the direct use of ulp impossible)?

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] aarch64: Improve SVE sin polynomial
@ 2023-08-03 11:54 Joe Ramsay
  2023-08-03 12:20 ` Paul Zimmermann
  2023-10-03 10:48 ` Szabolcs Nagy
  0 siblings, 2 replies; 7+ messages in thread
From: Joe Ramsay @ 2023-08-03 11:54 UTC (permalink / raw)
  To: libc-alpha; +Cc: Joe Ramsay

The polynomial used in the AdvSIMD variant was found to be more
efficient than FTRIG intrinsics, with acceptable loss of precision.
---
Thanks,
Joe
 sysdeps/aarch64/fpu/sin_sve.c | 99 ++++++++++++++++++-----------------
 1 file changed, 50 insertions(+), 49 deletions(-)

diff --git a/sysdeps/aarch64/fpu/sin_sve.c b/sysdeps/aarch64/fpu/sin_sve.c
index c3f450d0ea..4d93e761af 100644
--- a/sysdeps/aarch64/fpu/sin_sve.c
+++ b/sysdeps/aarch64/fpu/sin_sve.c
@@ -21,20 +21,23 @@
 
 static const struct data
 {
-  double inv_pi, half_pi, inv_pi_over_2, pi_over_2_1, pi_over_2_2, pi_over_2_3,
-      shift;
+  double inv_pi, pi_1, pi_2, pi_3, shift, range_val;
+  double poly[7];
 } data = {
-  /* Polynomial coefficients are hard-wired in the FTMAD instruction.  */
+  /* Worst-case error is 2.8+0.5 ulp in [-pi/2, pi/2].  */
+  .poly = { -0x1.555555555547bp-3, 0x1.1111111108a4dp-7, -0x1.a01a019936f27p-13,
+            0x1.71de37a97d93ep-19, -0x1.ae633919987c6p-26,
+            0x1.60e277ae07cecp-33, -0x1.9e9540300a1p-41, },
+
   .inv_pi = 0x1.45f306dc9c883p-2,
-  .half_pi = 0x1.921fb54442d18p+0,
-  .inv_pi_over_2 = 0x1.45f306dc9c882p-1,
-  .pi_over_2_1 = 0x1.921fb50000000p+0,
-  .pi_over_2_2 = 0x1.110b460000000p-26,
-  .pi_over_2_3 = 0x1.1a62633145c07p-54,
-  .shift = 0x1.8p52
+  .pi_1 = 0x1.921fb54442d18p+1,
+  .pi_2 = 0x1.1a62633145c06p-53,
+  .pi_3 = 0x1.c1cd129024e09p-106,
+  .shift = 0x1.8p52,
+  .range_val = 0x1p23,
 };
 
-#define RangeVal 0x4160000000000000 /* asuint64 (0x1p23).  */
+#define C(i) sv_f64 (d->poly[i])
 
 static svfloat64_t NOINLINE
 special_case (svfloat64_t x, svfloat64_t y, svbool_t cmp)
@@ -42,55 +45,53 @@ special_case (svfloat64_t x, svfloat64_t y, svbool_t cmp)
   return sv_call_f64 (sin, x, y, cmp);
 }
 
-/* A fast SVE implementation of sin based on trigonometric
-   instructions (FTMAD, FTSSEL, FTSMUL).
-   Maximum observed error in 2.52 ULP:
-   SV_NAME_D1 (sin)(0x1.2d2b00df69661p+19) got 0x1.10ace8f3e786bp-40
-					  want 0x1.10ace8f3e7868p-40.  */
+/* A fast SVE implementation of sin.
+   Maximum observed error in 3.22 ULP:
+   _ZGVsMxv_sin (0x1.d70eef40f39b1p+12) got -0x1.ffe9537d5dbb7p-3
+				       want -0x1.ffe9537d5dbb4p-3.  */
 svfloat64_t SV_NAME_D1 (sin) (svfloat64_t x, const svbool_t pg)
 {
   const struct data *d = ptr_barrier (&data);
 
-  svfloat64_t r = svabs_f64_x (pg, x);
-  svuint64_t sign
-      = sveor_u64_x (pg, svreinterpret_u64_f64 (x), svreinterpret_u64_f64 (r));
-  svbool_t cmp = svcmpge_n_u64 (pg, svreinterpret_u64_f64 (r), RangeVal);
-
-  /* Load first two pio2-related constants to one vector.  */
-  svfloat64_t invpio2_and_pio2_1
-      = svld1rq_f64 (svptrue_b64 (), &d->inv_pi_over_2);
-
-  /* n = rint(|x|/(pi/2)).  */
-  svfloat64_t q = svmla_lane_f64 (sv_f64 (d->shift), r, invpio2_and_pio2_1, 0);
-  svfloat64_t n = svsub_n_f64_x (pg, q, d->shift);
+  /* Load some values in quad-word chunks to minimise memory access.  */
+  const svbool_t ptrue = svptrue_b64 ();
+  svfloat64_t shift = sv_f64 (d->shift);
+  svfloat64_t inv_pi_and_pi1 = svld1rq (ptrue, &d->inv_pi);
+  svfloat64_t pi2_and_pi3 = svld1rq (ptrue, &d->pi_2);
+
+  /* n = rint(|x|/pi).  */
+  svfloat64_t n = svmla_lane (shift, x, inv_pi_and_pi1, 0);
+  svuint64_t odd = svlsl_x (pg, svreinterpret_u64 (n), 63);
+  odd = sveor_m (
+      svcmpeq (pg, svreinterpret_u64 (x), svreinterpret_u64 (sv_f64 (-0.0))),
+      odd, 0x8000000000000000);
+  n = svsub_x (pg, n, shift);
+
+  /* r = |x| - n*(pi/2)  (range reduction into -pi/2 .. pi/2).  */
+  svfloat64_t r = x;
+  r = svmls_lane (r, n, inv_pi_and_pi1, 1);
+  r = svmls_lane (r, n, pi2_and_pi3, 0);
+  r = svmls_lane (r, n, pi2_and_pi3, 1);
 
-  /* r = |x| - n*(pi/2)  (range reduction into -pi/4 .. pi/4).  */
-  r = svmls_lane_f64 (r, n, invpio2_and_pio2_1, 1);
-  r = svmls_n_f64_x (pg, r, n, d->pi_over_2_2);
-  r = svmls_n_f64_x (pg, r, n, d->pi_over_2_3);
+  /* sin(r) poly approx.  */
+  svfloat64_t r2 = svmul_x (pg, r, r);
+  svfloat64_t r3 = svmul_x (pg, r2, r);
+  svfloat64_t r4 = svmul_x (pg, r2, r2);
 
-  /* Final multiplicative factor: 1.0 or x depending on bit #0 of q.  */
-  svfloat64_t f = svtssel_f64 (r, svreinterpret_u64_f64 (q));
+  svfloat64_t t1 = svmla_x (pg, C (4), C (5), r2);
+  svfloat64_t t2 = svmla_x (pg, C (2), C (3), r2);
+  svfloat64_t t3 = svmla_x (pg, C (0), C (1), r2);
 
-  /* sin(r) poly approx.  */
-  svfloat64_t r2 = svtsmul_f64 (r, svreinterpret_u64_f64 (q));
-  svfloat64_t y = sv_f64 (0.0);
-  y = svtmad_f64 (y, r2, 7);
-  y = svtmad_f64 (y, r2, 6);
-  y = svtmad_f64 (y, r2, 5);
-  y = svtmad_f64 (y, r2, 4);
-  y = svtmad_f64 (y, r2, 3);
-  y = svtmad_f64 (y, r2, 2);
-  y = svtmad_f64 (y, r2, 1);
-  y = svtmad_f64 (y, r2, 0);
-
-  /* Apply factor.  */
-  y = svmul_f64_x (pg, f, y);
+  svfloat64_t y = svmla_x (pg, t1, C (6), r4);
+  y = svmla_x (pg, t2, y, r4);
+  y = svmla_x (pg, t3, y, r4);
+  y = svmla_x (pg, r, y, r3);
 
   /* sign = y^sign.  */
-  y = svreinterpret_f64_u64 (
-      sveor_u64_x (pg, svreinterpret_u64_f64 (y), sign));
+  y = svreinterpret_f64 (sveor_z (pg, svreinterpret_u64 (y), odd));
 
+  svbool_t cmp = svacle (pg, x, d->range_val);
+  cmp = svnot_z (pg, cmp);
   if (__glibc_unlikely (svptest_any (pg, cmp)))
     return special_case (x, y, cmp);
   return y;
-- 
2.27.0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] aarch64: Improve SVE sin polynomial
  2023-08-03 11:54 Joe Ramsay
@ 2023-08-03 12:20 ` Paul Zimmermann
  2023-10-03 10:48 ` Szabolcs Nagy
  1 sibling, 0 replies; 7+ messages in thread
From: Paul Zimmermann @ 2023-08-03 12:20 UTC (permalink / raw)
  To: Joe Ramsay; +Cc: libc-alpha, Joe.Ramsay

       Hi Joe,

you can still improve the polynomial with the following one generated by
Sollya (https://www.sollya.org/) with relative error < 2^-52.454 on
[-pi/2, pi/2], instead of 2^-51.765 for the one you propose:

$ cat sin.sollya
// polynomial proposed by Joe Ramsay
p3=-0x1.555555555547bp-3;
p5=0x1.1111111108a4dp-7;
p7=-0x1.a01a019936f27p-13;
p9=0x1.71de37a97d93ep-19;
p11=-0x1.ae633919987c6p-26;
p13=0x1.60e277ae07cecp-33;
p15=-0x1.9e9540300a1p-41;

p=x+p3*x^3+p5*x^5+p7*x^7+p9*x^9+p11*x^11+p13*x^13+p15*x^15;

pretty = proc(u) {
  return ~(floor(u*1000)/1000);
};

d = [0,pi/2];
f = 1;
w = 1/sin(x);
err_p = -log2(dirtyinfnorm(p*w-f, d));
display = hexadecimal;
print (p);
display = decimal;
print (pretty(err_p));

// polynomial generated by Sollya
n = [|1,3,5,7,9,11,13,15|];
p =  remez(f, n, d, w);
pf = fpminimax(sin(x), n, [|1,53...|], d, relative, floating, 0, p);
err_p = -log2(dirtyinfnorm(pf*w-f, d));
display = hexadecimal;
print (pf);
display = decimal;
print (pretty(err_p));

$ sollya sin.sollya
Display mode is hexadecimal numbers.
x + -0x1.555555555547bp-3 * x^0x1.8p1 + 0x1.1111111108a4dp-7 * x^0x1.4p2 + -0x1.a01a019936f27p-13 * x^0x1.cp2 + 0x1.71de37a97d93ep-19 * x^0x1.2p3 + -0x1.ae633919987c6p-26 * x^0x1.6p3 + 0x1.60e277ae07cecp-33 * x^0x1.ap3 + -0x1.9e9540300a1p-41 * x^0x1.ep3
Display mode is decimal numbers.
51.765
Display mode is hexadecimal numbers.
-0x1.9f0ef1c39804ap-41 * x^0x1.ep3 + 0x1.60e6871043ae8p-33 * x^0x1.ap3 + -0x1.ae635418ed56dp-26 * x^0x1.6p3 + 0x1.71de3801036c7p-19 * x^0x1.2p3 + -0x1.a01a019a50542p-13 * x^0x1.cp2 + 0x1.111111110a30bp-7 * x^0x1.4p2 + -0x1.55555555554a2p-3 * x^0x1.8p1 + x
Display mode is decimal numbers.
52.454

Paul

> CC: Joe Ramsay <Joe.Ramsay@arm.com>
> Date: Thu, 3 Aug 2023 12:54:53 +0100
> NoDisclaimer: true
> From: Joe Ramsay via Libc-alpha <libc-alpha@sourceware.org>
> 
> The polynomial used in the AdvSIMD variant was found to be more
> efficient than FTRIG intrinsics, with acceptable loss of precision.
> ---
> Thanks,
> Joe
>  sysdeps/aarch64/fpu/sin_sve.c | 99 ++++++++++++++++++-----------------
>  1 file changed, 50 insertions(+), 49 deletions(-)
> 
> diff --git a/sysdeps/aarch64/fpu/sin_sve.c b/sysdeps/aarch64/fpu/sin_sve.c
> index c3f450d0ea..4d93e761af 100644
> --- a/sysdeps/aarch64/fpu/sin_sve.c
> +++ b/sysdeps/aarch64/fpu/sin_sve.c
> @@ -21,20 +21,23 @@
>  
>  static const struct data
>  {
> -  double inv_pi, half_pi, inv_pi_over_2, pi_over_2_1, pi_over_2_2, pi_over_2_3,
> -      shift;
> +  double inv_pi, pi_1, pi_2, pi_3, shift, range_val;
> +  double poly[7];
>  } data = {
> -  /* Polynomial coefficients are hard-wired in the FTMAD instruction.  */
> +  /* Worst-case error is 2.8+0.5 ulp in [-pi/2, pi/2].  */
> +  .poly = { -0x1.555555555547bp-3, 0x1.1111111108a4dp-7, -0x1.a01a019936f27p-13,
> +            0x1.71de37a97d93ep-19, -0x1.ae633919987c6p-26,
> +            0x1.60e277ae07cecp-33, -0x1.9e9540300a1p-41, },
> +
>    .inv_pi = 0x1.45f306dc9c883p-2,
> -  .half_pi = 0x1.921fb54442d18p+0,
> -  .inv_pi_over_2 = 0x1.45f306dc9c882p-1,
> -  .pi_over_2_1 = 0x1.921fb50000000p+0,
> -  .pi_over_2_2 = 0x1.110b460000000p-26,
> -  .pi_over_2_3 = 0x1.1a62633145c07p-54,
> -  .shift = 0x1.8p52
> +  .pi_1 = 0x1.921fb54442d18p+1,
> +  .pi_2 = 0x1.1a62633145c06p-53,
> +  .pi_3 = 0x1.c1cd129024e09p-106,
> +  .shift = 0x1.8p52,
> +  .range_val = 0x1p23,
>  };
>  
> -#define RangeVal 0x4160000000000000 /* asuint64 (0x1p23).  */
> +#define C(i) sv_f64 (d->poly[i])
>  
>  static svfloat64_t NOINLINE
>  special_case (svfloat64_t x, svfloat64_t y, svbool_t cmp)
> @@ -42,55 +45,53 @@ special_case (svfloat64_t x, svfloat64_t y, svbool_t cmp)
>    return sv_call_f64 (sin, x, y, cmp);
>  }
>  
> -/* A fast SVE implementation of sin based on trigonometric
> -   instructions (FTMAD, FTSSEL, FTSMUL).
> -   Maximum observed error in 2.52 ULP:
> -   SV_NAME_D1 (sin)(0x1.2d2b00df69661p+19) got 0x1.10ace8f3e786bp-40
> -					  want 0x1.10ace8f3e7868p-40.  */
> +/* A fast SVE implementation of sin.
> +   Maximum observed error in 3.22 ULP:
> +   _ZGVsMxv_sin (0x1.d70eef40f39b1p+12) got -0x1.ffe9537d5dbb7p-3
> +				       want -0x1.ffe9537d5dbb4p-3.  */
>  svfloat64_t SV_NAME_D1 (sin) (svfloat64_t x, const svbool_t pg)
>  {
>    const struct data *d = ptr_barrier (&data);
>  
> -  svfloat64_t r = svabs_f64_x (pg, x);
> -  svuint64_t sign
> -      = sveor_u64_x (pg, svreinterpret_u64_f64 (x), svreinterpret_u64_f64 (r));
> -  svbool_t cmp = svcmpge_n_u64 (pg, svreinterpret_u64_f64 (r), RangeVal);
> -
> -  /* Load first two pio2-related constants to one vector.  */
> -  svfloat64_t invpio2_and_pio2_1
> -      = svld1rq_f64 (svptrue_b64 (), &d->inv_pi_over_2);
> -
> -  /* n = rint(|x|/(pi/2)).  */
> -  svfloat64_t q = svmla_lane_f64 (sv_f64 (d->shift), r, invpio2_and_pio2_1, 0);
> -  svfloat64_t n = svsub_n_f64_x (pg, q, d->shift);
> +  /* Load some values in quad-word chunks to minimise memory access.  */
> +  const svbool_t ptrue = svptrue_b64 ();
> +  svfloat64_t shift = sv_f64 (d->shift);
> +  svfloat64_t inv_pi_and_pi1 = svld1rq (ptrue, &d->inv_pi);
> +  svfloat64_t pi2_and_pi3 = svld1rq (ptrue, &d->pi_2);
> +
> +  /* n = rint(|x|/pi).  */
> +  svfloat64_t n = svmla_lane (shift, x, inv_pi_and_pi1, 0);
> +  svuint64_t odd = svlsl_x (pg, svreinterpret_u64 (n), 63);
> +  odd = sveor_m (
> +      svcmpeq (pg, svreinterpret_u64 (x), svreinterpret_u64 (sv_f64 (-0.0))),
> +      odd, 0x8000000000000000);
> +  n = svsub_x (pg, n, shift);
> +
> +  /* r = |x| - n*(pi/2)  (range reduction into -pi/2 .. pi/2).  */
> +  svfloat64_t r = x;
> +  r = svmls_lane (r, n, inv_pi_and_pi1, 1);
> +  r = svmls_lane (r, n, pi2_and_pi3, 0);
> +  r = svmls_lane (r, n, pi2_and_pi3, 1);
>  
> -  /* r = |x| - n*(pi/2)  (range reduction into -pi/4 .. pi/4).  */
> -  r = svmls_lane_f64 (r, n, invpio2_and_pio2_1, 1);
> -  r = svmls_n_f64_x (pg, r, n, d->pi_over_2_2);
> -  r = svmls_n_f64_x (pg, r, n, d->pi_over_2_3);
> +  /* sin(r) poly approx.  */
> +  svfloat64_t r2 = svmul_x (pg, r, r);
> +  svfloat64_t r3 = svmul_x (pg, r2, r);
> +  svfloat64_t r4 = svmul_x (pg, r2, r2);
>  
> -  /* Final multiplicative factor: 1.0 or x depending on bit #0 of q.  */
> -  svfloat64_t f = svtssel_f64 (r, svreinterpret_u64_f64 (q));
> +  svfloat64_t t1 = svmla_x (pg, C (4), C (5), r2);
> +  svfloat64_t t2 = svmla_x (pg, C (2), C (3), r2);
> +  svfloat64_t t3 = svmla_x (pg, C (0), C (1), r2);
>  
> -  /* sin(r) poly approx.  */
> -  svfloat64_t r2 = svtsmul_f64 (r, svreinterpret_u64_f64 (q));
> -  svfloat64_t y = sv_f64 (0.0);
> -  y = svtmad_f64 (y, r2, 7);
> -  y = svtmad_f64 (y, r2, 6);
> -  y = svtmad_f64 (y, r2, 5);
> -  y = svtmad_f64 (y, r2, 4);
> -  y = svtmad_f64 (y, r2, 3);
> -  y = svtmad_f64 (y, r2, 2);
> -  y = svtmad_f64 (y, r2, 1);
> -  y = svtmad_f64 (y, r2, 0);
> -
> -  /* Apply factor.  */
> -  y = svmul_f64_x (pg, f, y);
> +  svfloat64_t y = svmla_x (pg, t1, C (6), r4);
> +  y = svmla_x (pg, t2, y, r4);
> +  y = svmla_x (pg, t3, y, r4);
> +  y = svmla_x (pg, r, y, r3);
>  
>    /* sign = y^sign.  */
> -  y = svreinterpret_f64_u64 (
> -      sveor_u64_x (pg, svreinterpret_u64_f64 (y), sign));
> +  y = svreinterpret_f64 (sveor_z (pg, svreinterpret_u64 (y), odd));
>  
> +  svbool_t cmp = svacle (pg, x, d->range_val);
> +  cmp = svnot_z (pg, cmp);
>    if (__glibc_unlikely (svptest_any (pg, cmp)))
>      return special_case (x, y, cmp);
>    return y;
> -- 
> 2.27.0

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] aarch64: Improve SVE sin polynomial
  2023-08-03 11:54 Joe Ramsay
  2023-08-03 12:20 ` Paul Zimmermann
@ 2023-10-03 10:48 ` Szabolcs Nagy
  1 sibling, 0 replies; 7+ messages in thread
From: Szabolcs Nagy @ 2023-10-03 10:48 UTC (permalink / raw)
  To: Joe Ramsay, libc-alpha

The 08/03/2023 12:54, Joe Ramsay via Libc-alpha wrote:
> The polynomial used in the AdvSIMD variant was found to be more
> efficient than FTRIG intrinsics, with acceptable loss of precision.

i think we want this change, and if somebody finds a
better polynomial we can update both the sve and advsimd
implementations.

> -  /* Polynomial coefficients are hard-wired in the FTMAD instruction.  */
> +  /* Worst-case error is 2.8+0.5 ulp in [-pi/2, pi/2].  */

known worst-case is 2.87 ulp now

> +  svuint64_t odd = svlsl_x (pg, svreinterpret_u64 (n), 63);
> +  odd = sveor_m (
> +      svcmpeq (pg, svreinterpret_u64 (x), svreinterpret_u64 (sv_f64 (-0.0))),

we don't need to special case -0.0 any more.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-10-03 10:48 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-03 14:21 [PATCH] aarch64: Improve SVE sin polynomial Wilco Dijkstra
2023-08-03 15:22 ` Szabolcs Nagy
     [not found]   ` <p9u0msz79sga.fsf@coriandre.loria.fr>
2023-08-04  8:35     ` Szabolcs Nagy
2023-08-04 14:51     ` Vincent Lefevre
  -- strict thread matches above, loose matches on Subject: below --
2023-08-03 11:54 Joe Ramsay
2023-08-03 12:20 ` Paul Zimmermann
2023-10-03 10:48 ` Szabolcs Nagy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).