* Re: [PATCH] aarch64: Improve SVE sin polynomial
@ 2023-08-03 14:21 Wilco Dijkstra
2023-08-03 15:22 ` Szabolcs Nagy
0 siblings, 1 reply; 7+ messages in thread
From: Wilco Dijkstra @ 2023-08-03 14:21 UTC (permalink / raw)
To: paul zimmermann; +Cc: 'GNU C Library', Joe Ramsay, Szabolcs Nagy
Hi Paul,
> you can still improve the polynomial with the following one generated by
> Sollya (https://www.sollya.org/) with relative error < 2^-52.454 on
> [-pi/2, pi/2], instead of 2^-51.765 for the one you propose:
So the goal is to reduce the ULP error. Due to the way floating point works,
errors are at their worst when the result is just above a power of 2. For the
wide sin(x) polynomial it happens when sin(x) is close to but larger than 0.5.
The poly is extremely accurate for small x. It's only problematic when x is large
and we cross an exponent boundary.
Szabolcs can explain it better but what we ended up doing is to scale the
sollya function so that errors at the end of the range are and reduced more.
Note also this is not the final polynomial. The ULP is a bit high at 3.3. I had a
more accurate one close to 2ULP but it could give results >1.0. So we need to
teach sollya to estimate the end from below, and not allow it to overshoot.
Cheers,
Wilco
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] aarch64: Improve SVE sin polynomial 2023-08-03 14:21 [PATCH] aarch64: Improve SVE sin polynomial Wilco Dijkstra @ 2023-08-03 15:22 ` Szabolcs Nagy [not found] ` <p9u0msz79sga.fsf@coriandre.loria.fr> 0 siblings, 1 reply; 7+ messages in thread From: Szabolcs Nagy @ 2023-08-03 15:22 UTC (permalink / raw) To: Wilco Dijkstra, paul zimmermann; +Cc: 'GNU C Library', Joe Ramsay The 08/03/2023 15:21, Wilco Dijkstra wrote: > Hi Paul, > > > you can still improve the polynomial with the following one generated by > > Sollya (https://www.sollya.org/) with relative error < 2^-52.454 on > > [-pi/2, pi/2], instead of 2^-51.765 for the one you propose: > > So the goal is to reduce the ULP error. Due to the way floating point works, > errors are at their worst when the result is just above a power of 2. For the > wide sin(x) polynomial it happens when sin(x) is close to but larger than 0.5. > The poly is extremely accurate for small x. It's only problematic when x is large > and we cross an exponent boundary. > > Szabolcs can explain it better but what we ended up doing is to scale the > sollya function so that errors at the end of the range are and reduced more. i'd say we want to minimize the maximum of (poly(x) - sin(x))/ulp(sin(x)) instead of (poly(x) - sin(x))/sin(x) since ulp(x) is flat between 0.5 and 1 this can make a big difference when designing a sin poly for -pi/2,pi/2. e.g. you want to look at the abs error on [pi/6,pi/2], not relative error when comparing the polynomials. this is if you only consider approximation errors, but there is an independent effect of large rounding errors where operations in x + c*x*x*x cross powof2 boundaries which is also near x = pi/2. so for weighting func i used a polynomial that roughly approximates ulp(sin(x)) but forces smaller approx error where i expect big rounding error (this part was a bit of black magic but i think the approx error part is sound and covers most of the quality gain). > > Note also this is not the final polynomial. The ULP is a bit high at 3.3. I had a > more accurate one close to 2ULP but it could give results >1.0. So we need to > teach sollya to estimate the end from below, and not allow it to overshoot. we originally planned to use a poly with one more term, but that made errors go in the wrong direction at x=pi/2, and it could return 1.0 + 0x1p-52 which is not good for sin(x). i could not fix this with weighting, but reducing the degree the error goes into the other direction so even though the error is bigger then it's probably better in practice. hope this helps. ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <p9u0msz79sga.fsf@coriandre.loria.fr>]
* Re: [PATCH] aarch64: Improve SVE sin polynomial [not found] ` <p9u0msz79sga.fsf@coriandre.loria.fr> @ 2023-08-04 8:35 ` Szabolcs Nagy 2023-08-04 14:51 ` Vincent Lefevre 1 sibling, 0 replies; 7+ messages in thread From: Szabolcs Nagy @ 2023-08-04 8:35 UTC (permalink / raw) To: Paul Zimmermann; +Cc: Wilco.Dijkstra, libc-alpha, Joe.Ramsay The 08/04/2023 09:28, Paul Zimmermann wrote: > Hi Szabolcs, > > > i'd say we want to minimize the maximum of > > > > (poly(x) - sin(x))/ulp(sin(x)) > > > > instead of > > > > (poly(x) - sin(x))/sin(x) > > interesting question! I have asked the Sollya developers if this is possible. > > Anyway, unless I'm wrong I find the largest ulp error with the polynomial > from the patch is about 1.72 at x=0.252552902718105, and for the polynomial > I obtained with Sollya it is about 1.43 at the same value of x. > > See attached graph (blue = polynomial from the patch, red = polynomial > I obtained with Sollya), which for 0 <= x <= pi/2 prints the absolute > value of the ulp error as you define above. neither the red nor the blue line has flat maximums, so they are not optimal for |poly(x)-sin(x)|/ulp(sin(x)). but that only considers approx error, not rounding. as you can see the blue error goes down near pi/2, while it's high around small values like 0.25. this is deliberate as at small values there is tiny rounding error, while in [1,pi/2] there is cancellation with huge errors (the order of x + c*x*x*x is smaller than x so the terms are big an cancel each other after relatively big rounding error). so the actual worst case error after rounding is much bigger than what sollya reports (approx error) and it is in [1,pi/2]. e.g. _ZGVnN2v_sin(0x1.7a83373b951abp+0) got 0x1.fdd2e6d528af4p-1 want 0x1.fdd2e6d528af7p-1 - 0.234485 ulp error: -2.7655 ulp which shows that there is a point where rounding error takes over and becomes dominant over approx error. the weighting function can take this into account by roughly following ulp(sin(x)) for a while and then in [1,pi/2] making it smaller (this is the weighting i used, but it was put together manually, not computed precisely: ideally we would compute rounding_error_bound(x) function based on the poly eval and incorporate that into the weighting. for most functions i don't expect this to help, but it matters for sin(x).) i would expect simply minimizing the approx error (using the ulp(sin(x)) weighting) would be better than your polynomial (sin(x) weighting) after rounding but using our polynomial (manually adjusted weighting) is even better (i verified it experimentally that measured error near the approx error peaks are roughly the same so the error is evenly spread out, but i'm sure slightly better polynomial is possible to find with a bit more research into the optimal design). ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] aarch64: Improve SVE sin polynomial [not found] ` <p9u0msz79sga.fsf@coriandre.loria.fr> 2023-08-04 8:35 ` Szabolcs Nagy @ 2023-08-04 14:51 ` Vincent Lefevre 1 sibling, 0 replies; 7+ messages in thread From: Vincent Lefevre @ 2023-08-04 14:51 UTC (permalink / raw) To: Paul Zimmermann; +Cc: Szabolcs Nagy, Wilco.Dijkstra, libc-alpha, Joe.Ramsay On 2023-08-04 09:28:53 +0200, Paul Zimmermann via Libc-alpha wrote: > Hi Szabolcs, > > > i'd say we want to minimize the maximum of > > > > (poly(x) - sin(x))/ulp(sin(x)) > > > > instead of > > > > (poly(x) - sin(x))/sin(x) > > interesting question! I have asked the Sollya developers if this is > possible. I would say that a large change of the weight function (e.g. a discontinuity point) is likely to generate an extrema for the error function, which could affect the quality of the minimax approximation. That's from the theory. In practice, it is still possible that this works by chance, but I suppose that this depends on the considered function and domain. Concerning Sollya, doesn't it uses the derivative of the error function (making the direct use of ulp impossible)? -- Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH] aarch64: Improve SVE sin polynomial @ 2023-08-03 11:54 Joe Ramsay 2023-08-03 12:20 ` Paul Zimmermann 2023-10-03 10:48 ` Szabolcs Nagy 0 siblings, 2 replies; 7+ messages in thread From: Joe Ramsay @ 2023-08-03 11:54 UTC (permalink / raw) To: libc-alpha; +Cc: Joe Ramsay The polynomial used in the AdvSIMD variant was found to be more efficient than FTRIG intrinsics, with acceptable loss of precision. --- Thanks, Joe sysdeps/aarch64/fpu/sin_sve.c | 99 ++++++++++++++++++----------------- 1 file changed, 50 insertions(+), 49 deletions(-) diff --git a/sysdeps/aarch64/fpu/sin_sve.c b/sysdeps/aarch64/fpu/sin_sve.c index c3f450d0ea..4d93e761af 100644 --- a/sysdeps/aarch64/fpu/sin_sve.c +++ b/sysdeps/aarch64/fpu/sin_sve.c @@ -21,20 +21,23 @@ static const struct data { - double inv_pi, half_pi, inv_pi_over_2, pi_over_2_1, pi_over_2_2, pi_over_2_3, - shift; + double inv_pi, pi_1, pi_2, pi_3, shift, range_val; + double poly[7]; } data = { - /* Polynomial coefficients are hard-wired in the FTMAD instruction. */ + /* Worst-case error is 2.8+0.5 ulp in [-pi/2, pi/2]. */ + .poly = { -0x1.555555555547bp-3, 0x1.1111111108a4dp-7, -0x1.a01a019936f27p-13, + 0x1.71de37a97d93ep-19, -0x1.ae633919987c6p-26, + 0x1.60e277ae07cecp-33, -0x1.9e9540300a1p-41, }, + .inv_pi = 0x1.45f306dc9c883p-2, - .half_pi = 0x1.921fb54442d18p+0, - .inv_pi_over_2 = 0x1.45f306dc9c882p-1, - .pi_over_2_1 = 0x1.921fb50000000p+0, - .pi_over_2_2 = 0x1.110b460000000p-26, - .pi_over_2_3 = 0x1.1a62633145c07p-54, - .shift = 0x1.8p52 + .pi_1 = 0x1.921fb54442d18p+1, + .pi_2 = 0x1.1a62633145c06p-53, + .pi_3 = 0x1.c1cd129024e09p-106, + .shift = 0x1.8p52, + .range_val = 0x1p23, }; -#define RangeVal 0x4160000000000000 /* asuint64 (0x1p23). */ +#define C(i) sv_f64 (d->poly[i]) static svfloat64_t NOINLINE special_case (svfloat64_t x, svfloat64_t y, svbool_t cmp) @@ -42,55 +45,53 @@ special_case (svfloat64_t x, svfloat64_t y, svbool_t cmp) return sv_call_f64 (sin, x, y, cmp); } -/* A fast SVE implementation of sin based on trigonometric - instructions (FTMAD, FTSSEL, FTSMUL). - Maximum observed error in 2.52 ULP: - SV_NAME_D1 (sin)(0x1.2d2b00df69661p+19) got 0x1.10ace8f3e786bp-40 - want 0x1.10ace8f3e7868p-40. */ +/* A fast SVE implementation of sin. + Maximum observed error in 3.22 ULP: + _ZGVsMxv_sin (0x1.d70eef40f39b1p+12) got -0x1.ffe9537d5dbb7p-3 + want -0x1.ffe9537d5dbb4p-3. */ svfloat64_t SV_NAME_D1 (sin) (svfloat64_t x, const svbool_t pg) { const struct data *d = ptr_barrier (&data); - svfloat64_t r = svabs_f64_x (pg, x); - svuint64_t sign - = sveor_u64_x (pg, svreinterpret_u64_f64 (x), svreinterpret_u64_f64 (r)); - svbool_t cmp = svcmpge_n_u64 (pg, svreinterpret_u64_f64 (r), RangeVal); - - /* Load first two pio2-related constants to one vector. */ - svfloat64_t invpio2_and_pio2_1 - = svld1rq_f64 (svptrue_b64 (), &d->inv_pi_over_2); - - /* n = rint(|x|/(pi/2)). */ - svfloat64_t q = svmla_lane_f64 (sv_f64 (d->shift), r, invpio2_and_pio2_1, 0); - svfloat64_t n = svsub_n_f64_x (pg, q, d->shift); + /* Load some values in quad-word chunks to minimise memory access. */ + const svbool_t ptrue = svptrue_b64 (); + svfloat64_t shift = sv_f64 (d->shift); + svfloat64_t inv_pi_and_pi1 = svld1rq (ptrue, &d->inv_pi); + svfloat64_t pi2_and_pi3 = svld1rq (ptrue, &d->pi_2); + + /* n = rint(|x|/pi). */ + svfloat64_t n = svmla_lane (shift, x, inv_pi_and_pi1, 0); + svuint64_t odd = svlsl_x (pg, svreinterpret_u64 (n), 63); + odd = sveor_m ( + svcmpeq (pg, svreinterpret_u64 (x), svreinterpret_u64 (sv_f64 (-0.0))), + odd, 0x8000000000000000); + n = svsub_x (pg, n, shift); + + /* r = |x| - n*(pi/2) (range reduction into -pi/2 .. pi/2). */ + svfloat64_t r = x; + r = svmls_lane (r, n, inv_pi_and_pi1, 1); + r = svmls_lane (r, n, pi2_and_pi3, 0); + r = svmls_lane (r, n, pi2_and_pi3, 1); - /* r = |x| - n*(pi/2) (range reduction into -pi/4 .. pi/4). */ - r = svmls_lane_f64 (r, n, invpio2_and_pio2_1, 1); - r = svmls_n_f64_x (pg, r, n, d->pi_over_2_2); - r = svmls_n_f64_x (pg, r, n, d->pi_over_2_3); + /* sin(r) poly approx. */ + svfloat64_t r2 = svmul_x (pg, r, r); + svfloat64_t r3 = svmul_x (pg, r2, r); + svfloat64_t r4 = svmul_x (pg, r2, r2); - /* Final multiplicative factor: 1.0 or x depending on bit #0 of q. */ - svfloat64_t f = svtssel_f64 (r, svreinterpret_u64_f64 (q)); + svfloat64_t t1 = svmla_x (pg, C (4), C (5), r2); + svfloat64_t t2 = svmla_x (pg, C (2), C (3), r2); + svfloat64_t t3 = svmla_x (pg, C (0), C (1), r2); - /* sin(r) poly approx. */ - svfloat64_t r2 = svtsmul_f64 (r, svreinterpret_u64_f64 (q)); - svfloat64_t y = sv_f64 (0.0); - y = svtmad_f64 (y, r2, 7); - y = svtmad_f64 (y, r2, 6); - y = svtmad_f64 (y, r2, 5); - y = svtmad_f64 (y, r2, 4); - y = svtmad_f64 (y, r2, 3); - y = svtmad_f64 (y, r2, 2); - y = svtmad_f64 (y, r2, 1); - y = svtmad_f64 (y, r2, 0); - - /* Apply factor. */ - y = svmul_f64_x (pg, f, y); + svfloat64_t y = svmla_x (pg, t1, C (6), r4); + y = svmla_x (pg, t2, y, r4); + y = svmla_x (pg, t3, y, r4); + y = svmla_x (pg, r, y, r3); /* sign = y^sign. */ - y = svreinterpret_f64_u64 ( - sveor_u64_x (pg, svreinterpret_u64_f64 (y), sign)); + y = svreinterpret_f64 (sveor_z (pg, svreinterpret_u64 (y), odd)); + svbool_t cmp = svacle (pg, x, d->range_val); + cmp = svnot_z (pg, cmp); if (__glibc_unlikely (svptest_any (pg, cmp))) return special_case (x, y, cmp); return y; -- 2.27.0 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] aarch64: Improve SVE sin polynomial 2023-08-03 11:54 Joe Ramsay @ 2023-08-03 12:20 ` Paul Zimmermann 2023-10-03 10:48 ` Szabolcs Nagy 1 sibling, 0 replies; 7+ messages in thread From: Paul Zimmermann @ 2023-08-03 12:20 UTC (permalink / raw) To: Joe Ramsay; +Cc: libc-alpha, Joe.Ramsay Hi Joe, you can still improve the polynomial with the following one generated by Sollya (https://www.sollya.org/) with relative error < 2^-52.454 on [-pi/2, pi/2], instead of 2^-51.765 for the one you propose: $ cat sin.sollya // polynomial proposed by Joe Ramsay p3=-0x1.555555555547bp-3; p5=0x1.1111111108a4dp-7; p7=-0x1.a01a019936f27p-13; p9=0x1.71de37a97d93ep-19; p11=-0x1.ae633919987c6p-26; p13=0x1.60e277ae07cecp-33; p15=-0x1.9e9540300a1p-41; p=x+p3*x^3+p5*x^5+p7*x^7+p9*x^9+p11*x^11+p13*x^13+p15*x^15; pretty = proc(u) { return ~(floor(u*1000)/1000); }; d = [0,pi/2]; f = 1; w = 1/sin(x); err_p = -log2(dirtyinfnorm(p*w-f, d)); display = hexadecimal; print (p); display = decimal; print (pretty(err_p)); // polynomial generated by Sollya n = [|1,3,5,7,9,11,13,15|]; p = remez(f, n, d, w); pf = fpminimax(sin(x), n, [|1,53...|], d, relative, floating, 0, p); err_p = -log2(dirtyinfnorm(pf*w-f, d)); display = hexadecimal; print (pf); display = decimal; print (pretty(err_p)); $ sollya sin.sollya Display mode is hexadecimal numbers. x + -0x1.555555555547bp-3 * x^0x1.8p1 + 0x1.1111111108a4dp-7 * x^0x1.4p2 + -0x1.a01a019936f27p-13 * x^0x1.cp2 + 0x1.71de37a97d93ep-19 * x^0x1.2p3 + -0x1.ae633919987c6p-26 * x^0x1.6p3 + 0x1.60e277ae07cecp-33 * x^0x1.ap3 + -0x1.9e9540300a1p-41 * x^0x1.ep3 Display mode is decimal numbers. 51.765 Display mode is hexadecimal numbers. -0x1.9f0ef1c39804ap-41 * x^0x1.ep3 + 0x1.60e6871043ae8p-33 * x^0x1.ap3 + -0x1.ae635418ed56dp-26 * x^0x1.6p3 + 0x1.71de3801036c7p-19 * x^0x1.2p3 + -0x1.a01a019a50542p-13 * x^0x1.cp2 + 0x1.111111110a30bp-7 * x^0x1.4p2 + -0x1.55555555554a2p-3 * x^0x1.8p1 + x Display mode is decimal numbers. 52.454 Paul > CC: Joe Ramsay <Joe.Ramsay@arm.com> > Date: Thu, 3 Aug 2023 12:54:53 +0100 > NoDisclaimer: true > From: Joe Ramsay via Libc-alpha <libc-alpha@sourceware.org> > > The polynomial used in the AdvSIMD variant was found to be more > efficient than FTRIG intrinsics, with acceptable loss of precision. > --- > Thanks, > Joe > sysdeps/aarch64/fpu/sin_sve.c | 99 ++++++++++++++++++----------------- > 1 file changed, 50 insertions(+), 49 deletions(-) > > diff --git a/sysdeps/aarch64/fpu/sin_sve.c b/sysdeps/aarch64/fpu/sin_sve.c > index c3f450d0ea..4d93e761af 100644 > --- a/sysdeps/aarch64/fpu/sin_sve.c > +++ b/sysdeps/aarch64/fpu/sin_sve.c > @@ -21,20 +21,23 @@ > > static const struct data > { > - double inv_pi, half_pi, inv_pi_over_2, pi_over_2_1, pi_over_2_2, pi_over_2_3, > - shift; > + double inv_pi, pi_1, pi_2, pi_3, shift, range_val; > + double poly[7]; > } data = { > - /* Polynomial coefficients are hard-wired in the FTMAD instruction. */ > + /* Worst-case error is 2.8+0.5 ulp in [-pi/2, pi/2]. */ > + .poly = { -0x1.555555555547bp-3, 0x1.1111111108a4dp-7, -0x1.a01a019936f27p-13, > + 0x1.71de37a97d93ep-19, -0x1.ae633919987c6p-26, > + 0x1.60e277ae07cecp-33, -0x1.9e9540300a1p-41, }, > + > .inv_pi = 0x1.45f306dc9c883p-2, > - .half_pi = 0x1.921fb54442d18p+0, > - .inv_pi_over_2 = 0x1.45f306dc9c882p-1, > - .pi_over_2_1 = 0x1.921fb50000000p+0, > - .pi_over_2_2 = 0x1.110b460000000p-26, > - .pi_over_2_3 = 0x1.1a62633145c07p-54, > - .shift = 0x1.8p52 > + .pi_1 = 0x1.921fb54442d18p+1, > + .pi_2 = 0x1.1a62633145c06p-53, > + .pi_3 = 0x1.c1cd129024e09p-106, > + .shift = 0x1.8p52, > + .range_val = 0x1p23, > }; > > -#define RangeVal 0x4160000000000000 /* asuint64 (0x1p23). */ > +#define C(i) sv_f64 (d->poly[i]) > > static svfloat64_t NOINLINE > special_case (svfloat64_t x, svfloat64_t y, svbool_t cmp) > @@ -42,55 +45,53 @@ special_case (svfloat64_t x, svfloat64_t y, svbool_t cmp) > return sv_call_f64 (sin, x, y, cmp); > } > > -/* A fast SVE implementation of sin based on trigonometric > - instructions (FTMAD, FTSSEL, FTSMUL). > - Maximum observed error in 2.52 ULP: > - SV_NAME_D1 (sin)(0x1.2d2b00df69661p+19) got 0x1.10ace8f3e786bp-40 > - want 0x1.10ace8f3e7868p-40. */ > +/* A fast SVE implementation of sin. > + Maximum observed error in 3.22 ULP: > + _ZGVsMxv_sin (0x1.d70eef40f39b1p+12) got -0x1.ffe9537d5dbb7p-3 > + want -0x1.ffe9537d5dbb4p-3. */ > svfloat64_t SV_NAME_D1 (sin) (svfloat64_t x, const svbool_t pg) > { > const struct data *d = ptr_barrier (&data); > > - svfloat64_t r = svabs_f64_x (pg, x); > - svuint64_t sign > - = sveor_u64_x (pg, svreinterpret_u64_f64 (x), svreinterpret_u64_f64 (r)); > - svbool_t cmp = svcmpge_n_u64 (pg, svreinterpret_u64_f64 (r), RangeVal); > - > - /* Load first two pio2-related constants to one vector. */ > - svfloat64_t invpio2_and_pio2_1 > - = svld1rq_f64 (svptrue_b64 (), &d->inv_pi_over_2); > - > - /* n = rint(|x|/(pi/2)). */ > - svfloat64_t q = svmla_lane_f64 (sv_f64 (d->shift), r, invpio2_and_pio2_1, 0); > - svfloat64_t n = svsub_n_f64_x (pg, q, d->shift); > + /* Load some values in quad-word chunks to minimise memory access. */ > + const svbool_t ptrue = svptrue_b64 (); > + svfloat64_t shift = sv_f64 (d->shift); > + svfloat64_t inv_pi_and_pi1 = svld1rq (ptrue, &d->inv_pi); > + svfloat64_t pi2_and_pi3 = svld1rq (ptrue, &d->pi_2); > + > + /* n = rint(|x|/pi). */ > + svfloat64_t n = svmla_lane (shift, x, inv_pi_and_pi1, 0); > + svuint64_t odd = svlsl_x (pg, svreinterpret_u64 (n), 63); > + odd = sveor_m ( > + svcmpeq (pg, svreinterpret_u64 (x), svreinterpret_u64 (sv_f64 (-0.0))), > + odd, 0x8000000000000000); > + n = svsub_x (pg, n, shift); > + > + /* r = |x| - n*(pi/2) (range reduction into -pi/2 .. pi/2). */ > + svfloat64_t r = x; > + r = svmls_lane (r, n, inv_pi_and_pi1, 1); > + r = svmls_lane (r, n, pi2_and_pi3, 0); > + r = svmls_lane (r, n, pi2_and_pi3, 1); > > - /* r = |x| - n*(pi/2) (range reduction into -pi/4 .. pi/4). */ > - r = svmls_lane_f64 (r, n, invpio2_and_pio2_1, 1); > - r = svmls_n_f64_x (pg, r, n, d->pi_over_2_2); > - r = svmls_n_f64_x (pg, r, n, d->pi_over_2_3); > + /* sin(r) poly approx. */ > + svfloat64_t r2 = svmul_x (pg, r, r); > + svfloat64_t r3 = svmul_x (pg, r2, r); > + svfloat64_t r4 = svmul_x (pg, r2, r2); > > - /* Final multiplicative factor: 1.0 or x depending on bit #0 of q. */ > - svfloat64_t f = svtssel_f64 (r, svreinterpret_u64_f64 (q)); > + svfloat64_t t1 = svmla_x (pg, C (4), C (5), r2); > + svfloat64_t t2 = svmla_x (pg, C (2), C (3), r2); > + svfloat64_t t3 = svmla_x (pg, C (0), C (1), r2); > > - /* sin(r) poly approx. */ > - svfloat64_t r2 = svtsmul_f64 (r, svreinterpret_u64_f64 (q)); > - svfloat64_t y = sv_f64 (0.0); > - y = svtmad_f64 (y, r2, 7); > - y = svtmad_f64 (y, r2, 6); > - y = svtmad_f64 (y, r2, 5); > - y = svtmad_f64 (y, r2, 4); > - y = svtmad_f64 (y, r2, 3); > - y = svtmad_f64 (y, r2, 2); > - y = svtmad_f64 (y, r2, 1); > - y = svtmad_f64 (y, r2, 0); > - > - /* Apply factor. */ > - y = svmul_f64_x (pg, f, y); > + svfloat64_t y = svmla_x (pg, t1, C (6), r4); > + y = svmla_x (pg, t2, y, r4); > + y = svmla_x (pg, t3, y, r4); > + y = svmla_x (pg, r, y, r3); > > /* sign = y^sign. */ > - y = svreinterpret_f64_u64 ( > - sveor_u64_x (pg, svreinterpret_u64_f64 (y), sign)); > + y = svreinterpret_f64 (sveor_z (pg, svreinterpret_u64 (y), odd)); > > + svbool_t cmp = svacle (pg, x, d->range_val); > + cmp = svnot_z (pg, cmp); > if (__glibc_unlikely (svptest_any (pg, cmp))) > return special_case (x, y, cmp); > return y; > -- > 2.27.0 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] aarch64: Improve SVE sin polynomial 2023-08-03 11:54 Joe Ramsay 2023-08-03 12:20 ` Paul Zimmermann @ 2023-10-03 10:48 ` Szabolcs Nagy 1 sibling, 0 replies; 7+ messages in thread From: Szabolcs Nagy @ 2023-10-03 10:48 UTC (permalink / raw) To: Joe Ramsay, libc-alpha The 08/03/2023 12:54, Joe Ramsay via Libc-alpha wrote: > The polynomial used in the AdvSIMD variant was found to be more > efficient than FTRIG intrinsics, with acceptable loss of precision. i think we want this change, and if somebody finds a better polynomial we can update both the sve and advsimd implementations. > - /* Polynomial coefficients are hard-wired in the FTMAD instruction. */ > + /* Worst-case error is 2.8+0.5 ulp in [-pi/2, pi/2]. */ known worst-case is 2.87 ulp now > + svuint64_t odd = svlsl_x (pg, svreinterpret_u64 (n), 63); > + odd = sveor_m ( > + svcmpeq (pg, svreinterpret_u64 (x), svreinterpret_u64 (sv_f64 (-0.0))), we don't need to special case -0.0 any more. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2023-10-03 10:48 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-08-03 14:21 [PATCH] aarch64: Improve SVE sin polynomial Wilco Dijkstra 2023-08-03 15:22 ` Szabolcs Nagy [not found] ` <p9u0msz79sga.fsf@coriandre.loria.fr> 2023-08-04 8:35 ` Szabolcs Nagy 2023-08-04 14:51 ` Vincent Lefevre -- strict thread matches above, loose matches on Subject: below -- 2023-08-03 11:54 Joe Ramsay 2023-08-03 12:20 ` Paul Zimmermann 2023-10-03 10:48 ` Szabolcs Nagy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).