* RE: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
@ 2017-06-13 10:57 Wilco Dijkstra
2017-06-13 11:06 ` Siddhesh Poyarekar
0 siblings, 1 reply; 36+ messages in thread
From: Wilco Dijkstra @ 2017-06-13 10:57 UTC (permalink / raw)
To: Ashwin.Sekhar, libc-alpha; +Cc: nd, Siddhesh Poyarekar
Ashwin wrote:
> Please find the microbenchmark code at
> https://github.com/ashwinyes/glibc_microbenchmarks/blob/master/sinf/sinf_benchmark.c
That is fine for benchmarking the individual code paths of a specific implementation.
However a good benchmark would run through a representative subset of calls from
actual code rather than repeating the same input many times. This avoids focusing too
much on special cases that never occur in the real world or failing to take the cost of
branch mispredictions into account due to varying inputs.
Wilco
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-13 10:57 [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf Wilco Dijkstra
@ 2017-06-13 11:06 ` Siddhesh Poyarekar
0 siblings, 0 replies; 36+ messages in thread
From: Siddhesh Poyarekar @ 2017-06-13 11:06 UTC (permalink / raw)
To: Wilco Dijkstra, Ashwin.Sekhar, libc-alpha; +Cc: nd
On Tuesday 13 June 2017 04:27 PM, Wilco Dijkstra wrote:
> That is fine for benchmarking the individual code paths of a specific implementation.
>
> However a good benchmark would run through a representative subset of calls from
> actual code rather than repeating the same input many times. This avoids focusing too
> much on special cases that never occur in the real world or failing to take the cost of
> branch mispredictions into account due to varying inputs.
The problem with libm benchmarks is that real world workloads are not
easy to find. For sincos tonto from SPEC2006 may provide some input,
but otherwise there isn't much to talk about. Other more popular
benchmarks (like UnixBench) are again synthetic.
All of the math microbenchmarks in glibc are currently synthetic and aim
to test all code paths. We had once talked about a system benchmarking
project (in 2014 or 2015 I think) that could come up with more relevant
inputs for benchtests but not much has happened since then.
Siddhesh
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-23 10:49 ` Sekhar, Ashwin
@ 2017-06-23 10:52 ` Szabolcs Nagy
0 siblings, 0 replies; 36+ messages in thread
From: Szabolcs Nagy @ 2017-06-23 10:52 UTC (permalink / raw)
To: Sekhar, Ashwin, libc-alpha; +Cc: nd
On 23/06/17 11:49, Sekhar, Ashwin wrote:
> On Tue, 2017-06-13 at 12:07 +0100, Szabolcs Nagy wrote:
>>
>> - document the worst case ulp error and number of misrounded
>> cases: for single argument scalar functions you can easily test
>> all possible inputs in all rounding modes and that information
>> helps to decide if the algorithm is good enough.
>>
> I have a question on this. In order to calculate the ulp error for all
> possible inputs, I need to compare my implementation against another
> standard implementation.
>
> Is it enough that I compare against the existing sinf implementation in
> glibc or is there any other standard implementation which I can use to
> compare against.
>
no, don't compare it against sinf but against sin.
double precision has enough extra precision to be
a useful oracle for correct results.
(and you may want to compute ulp error in double
precision so you get more precision than just an
integer, but that can be a bit involved because
of special cases around overflow, nan, inf,..)
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-13 11:07 ` Szabolcs Nagy
2017-06-13 11:55 ` Siddhesh Poyarekar
2017-06-13 12:56 ` Sekhar, Ashwin
@ 2017-06-23 10:49 ` Sekhar, Ashwin
2017-06-23 10:52 ` Szabolcs Nagy
2 siblings, 1 reply; 36+ messages in thread
From: Sekhar, Ashwin @ 2017-06-23 10:49 UTC (permalink / raw)
To: libc-alpha, szabolcs.nagy; +Cc: nd
On Tue, 2017-06-13 at 12:07 +0100, Szabolcs Nagy wrote:
>
> - document the worst case ulp error and number of misrounded
> cases: for single argument scalar functions you can easily test
> all possible inputs in all rounding modes and that information
> helps to decide if the algorithm is good enough.
>
I have a question on this. In order to calculate the ulp error for all
possible inputs, I need to compare my implementation against another
standard implementation.
Is it enough that I compare against the existing sinf implementation in
glibc or is there any other standard implementation which I can use to
compare against.
Ashwin
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-23 9:50 ` Sekhar, Ashwin
@ 2017-06-23 10:48 ` Siddhesh Poyarekar
0 siblings, 0 replies; 36+ messages in thread
From: Siddhesh Poyarekar @ 2017-06-23 10:48 UTC (permalink / raw)
To: Sekhar, Ashwin, libc-alpha, szabolcs.nagy; +Cc: nd
On Friday 23 June 2017 03:20 PM, Sekhar, Ashwin wrote:
> Were the weights for different input ranges based on some benchmark?
No, they were not based on any benchmarks, the lower weighted ones were
basically all of the fast return paths. All other branches (i.e. the
different slow paths) are equally weighted.
> And does the chosen values within an input range have some significance
> or were they random values within that different input range ?
Random values, generated using systemtap probes within parts of code IIRC.
> For sinf/cosf, I am planning to use the same weights as sin/cos and
> random values within different input ranges. Thats why the question.
Yeah the guideline should probably be the same, i.e. don't put much
weightage on the faster paths (i.e. the ones that return 0 or NaN or
similar) for the general set of inputs. In addition to that, if the
functions figure in a known benchmark (like SPEC2006) then add those in
a separate ##name section like Wilco did for powf; going forward those
will get higher weight when comparing performance over the general
inputs since they will be more relevant.
Siddhesh
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-23 9:43 ` Siddhesh Poyarekar
@ 2017-06-23 9:50 ` Sekhar, Ashwin
2017-06-23 10:48 ` Siddhesh Poyarekar
0 siblings, 1 reply; 36+ messages in thread
From: Sekhar, Ashwin @ 2017-06-23 9:50 UTC (permalink / raw)
To: libc-alpha, Sekhar, Ashwin, siddhesh, szabolcs.nagy; +Cc: nd
On Fri, 2017-06-23 at 15:13 +0530, Siddhesh Poyarekar wrote:
> On Friday 23 June 2017 02:26 PM, Sekhar, Ashwin wrote:
> >
> > I am going forward with the C implementation of the sinf/cosf.
> > Along
> > with it I am also working on the above changes you suggested for
> > benchtest framework. Also I am trying to add sinf and cosf inputs
> > in
> > the benchtest.
> >
> > I need your inputs on how you arrived at the inputs for the sin and
> > cos
> > benchmarks (benchtests/cos-inputs and benchtests/sin-inputs). The
> > commit message and as well as the mail archive doesn't provide much
> > information on this.
> Those inputs basically just exercise all branches, with the obvious
> fast
> inouts (0, NaN, etc.) getting less weight since we shouldn't really
> care
> about optimizing them.
>
> Siddhesh
Were the weights for different input ranges based on some benchmark?
And does the chosen values within an input range have some significance
or were they random values within that different input range ?
For sinf/cosf, I am planning to use the same weights as sin/cos and
random values within different input ranges. Thats why the question.
Thanks
Ashwin
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-23 8:56 ` Sekhar, Ashwin
@ 2017-06-23 9:43 ` Siddhesh Poyarekar
2017-06-23 9:50 ` Sekhar, Ashwin
0 siblings, 1 reply; 36+ messages in thread
From: Siddhesh Poyarekar @ 2017-06-23 9:43 UTC (permalink / raw)
To: Sekhar, Ashwin, libc-alpha, szabolcs.nagy; +Cc: nd
On Friday 23 June 2017 02:26 PM, Sekhar, Ashwin wrote:
> I am going forward with the C implementation of the sinf/cosf. Along
> with it I am also working on the above changes you suggested for
> benchtest framework. Also I am trying to add sinf and cosf inputs in
> the benchtest.
>
> I need your inputs on how you arrived at the inputs for the sin and cos
> benchmarks (benchtests/cos-inputs and benchtests/sin-inputs). The
> commit message and as well as the mail archive doesn't provide much
> information on this.
Those inputs basically just exercise all branches, with the obvious fast
inouts (0, NaN, etc.) getting less weight since we shouldn't really care
about optimizing them.
Siddhesh
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-13 14:19 ` Siddhesh Poyarekar
2017-06-13 16:53 ` Adhemerval Zanella
@ 2017-06-23 8:56 ` Sekhar, Ashwin
2017-06-23 9:43 ` Siddhesh Poyarekar
1 sibling, 1 reply; 36+ messages in thread
From: Sekhar, Ashwin @ 2017-06-23 8:56 UTC (permalink / raw)
To: libc-alpha, siddhesh, szabolcs.nagy; +Cc: nd
On Tue, 2017-06-13 at 19:49 +0530, Siddhesh Poyarekar wrote:
> Currently the microbenchmark framework tests the same input in a loop
> a
> specific number of times to get a large enough number that a single
> iteration gives a stable mean and then tests inputs in a loop - I
> agree
> that this is cheating a bit since it eliminates cache effects as well
> as
> branches. It will need a pretty straightforward fix to run only once
> for a single input and it will do what you want, i.e. measure the
> effect
> of branches.
>
> Maybe Ashwin could patch the framework as well when he posts his
> patch.
Hi Siddhesh,
I am going forward with the C implementation of the sinf/cosf. Along
with it I am also working on the above changes you suggested for
benchtest framework. Also I am trying to add sinf and cosf inputs in
the benchtest.
I need your inputs on how you arrived at the inputs for the sin and cos
benchmarks (benchtests/cos-inputs and benchtests/sin-inputs). The
commit message and as well as the mail archive doesn't provide much
information on this.
Ashwin
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-19 11:46 ` Wilco Dijkstra
2017-06-19 14:08 ` Ramana Radhakrishnan
@ 2017-06-19 15:26 ` Joseph Myers
1 sibling, 0 replies; 36+ messages in thread
From: Joseph Myers @ 2017-06-19 15:26 UTC (permalink / raw)
To: Wilco Dijkstra
Cc: Sekhar, Ashwin, siddhesh, adhemerval.zanella, libc-alpha, nd,
Szabolcs Nagy
On Mon, 19 Jun 2017, Wilco Dijkstra wrote:
> And yes, your patches have licensing issues - since you have signed your
> copyright away, neither of your patches can ever be used in other
Please see the actual assignment text rather than spreading FUD. There is
no requirement or expectation for code contributed to glibc to be usable
in non-copyleft libraries, and the person or company assigning rights is
in any case granted rights to use the code they contributed under other
terms if they wish to do so. See e.g.
<https://www.fsf.org/bulletin/2014/spring/copyright-assignment-at-the-fsf>:
"Thus, we grant back to contributors a license to use their work as they
see fit. This means they are free to modify, share, and sublicense their
own work under terms of their choice. This enables contributors to
redistribute their work under another free software license. While this
technically also permits distributing their work under a proprietary
license, we hope they won't.".
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-19 11:46 ` Wilco Dijkstra
@ 2017-06-19 14:08 ` Ramana Radhakrishnan
2017-06-19 15:26 ` Joseph Myers
1 sibling, 0 replies; 36+ messages in thread
From: Ramana Radhakrishnan @ 2017-06-19 14:08 UTC (permalink / raw)
To: Wilco Dijkstra
Cc: Sekhar, Ashwin, siddhesh, adhemerval.zanella, libc-alpha, nd,
Szabolcs Nagy
Wilco,
On Mon, Jun 19, 2017 at 12:46 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> Sekhar, Ashwin wrote:
>>
>> In my work, I only used algorithms that are already in libm in other
>> architectures' sinf/cosf implementations. So I guess the issue
>> that Szabolcs raised about math code licensing doesn't really apply to
>> my patch??
>
> Firstly the ARM and AArch64 optimized assembly routines are also in cortex-strings,
> bionic and newlib, pretty much identically. Such patches have to be committed first
> to cortex-strings or newlib to avoid licensing issues.
Please read this thread again especially Joseph's email
https://sourceware.org/ml/libc-alpha/2017-06/msg00546.html. There is
no requirement on contributors to the glibc project to do this.
While our team in ARM has a practice of contributing routines to
cortex-strings and newlib and glibc and keeping these projects in
sync, there is no legal requirement to the glibc project to do so.
Thus on this point (about patches being committed first to
cortex-strings or newlib) there is no licensing issue.
Thanks
Ramana
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-16 14:39 ` Sekhar, Ashwin
@ 2017-06-19 11:46 ` Wilco Dijkstra
2017-06-19 14:08 ` Ramana Radhakrishnan
2017-06-19 15:26 ` Joseph Myers
0 siblings, 2 replies; 36+ messages in thread
From: Wilco Dijkstra @ 2017-06-19 11:46 UTC (permalink / raw)
To: Sekhar, Ashwin, siddhesh, adhemerval.zanella
Cc: libc-alpha, nd, Szabolcs Nagy
Sekhar, Ashwin wrote:
>
> In my work, I only used algorithms that are already in libm in other
> architectures' sinf/cosf implementations. So I guess the issue
> that Szabolcs raised about math code licensing doesn't really apply to
> my patch??
Firstly the ARM and AArch64 optimized assembly routines are also in cortex-strings,
bionic and newlib, pretty much identically. Such patches have to be committed first
to cortex-strings or newlib to avoid licensing issues.
Similarly https://github.com/ARM-software/optimized-routines contains various
math functions which are faster than the ones in GLIBC.
And yes, your patches have licensing issues - since you have signed your copyright
away, neither of your patches can ever be used in other libraries. That makes it less
useful to add them to GLIBC as we'd have to create different versions for the other
libraries.
Wilco
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-13 18:49 ` Adhemerval Zanella
@ 2017-06-16 14:39 ` Sekhar, Ashwin
2017-06-19 11:46 ` Wilco Dijkstra
0 siblings, 1 reply; 36+ messages in thread
From: Sekhar, Ashwin @ 2017-06-16 14:39 UTC (permalink / raw)
To: siddhesh, Sekhar, Ashwin, adhemerval.zanella, Wilco.Dijkstra
Cc: libc-alpha, nd, Szabolcs.Nagy
[-- Attachment #1: Type: text/plain, Size: 1618 bytes --]
On Tue, 2017-06-13 at 15:49 -0300, Adhemerval Zanella wrote:
>
> On 13/06/2017 14:49, Wilco Dijkstra wrote:
> >
> > Adhemerval Zanella wrote:
> >
> > >
> > > I think a good starting point I would be if Ashwin in could
> > > provide us with a C skeleton with same implementation done in
> > > assembly.
> > I don't see the point of asking him to do that. It would be a
> > significant amount of
> > work which would be wasted once Szabolcs posts his implementation.
> It was not a demand, but rather a suggestion if it were the case he
> has a
> starting implementation based on C (which I know might not be the
> case).
After your suggestion, I tried implementing a C skeleton (attached with
this mail) of sinf using the same algorithm. I am able to see same kind
of speedup with the C version.
But I am not sure whether to spend time addressing other technical
comments on the current patch and cleaning this C version for
submission as there is likely another patch coming from Szabolcs that
supposedly can supersede my work.
Ashwin
>
> Also, it would be helpful which kind of algorithm strategies Szabolcs
> is
> planing to do different than Ashwin and current implementation that
> is
> intended to supersede both implementation (like short algorithm
> description
> sent by Ashwin).
I would also like know about the same.
In my work, I only used algorithms that are already in libm in other
architectures' sinf/cosf implementations. So I guess the issue
that Szabolcs raised about math code licensing doesn't really apply to
my patch??
Ashwin
>
> >
> >
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: s_sinf.c --]
[-- Type: text/x-csrc; name="s_sinf.c", Size: 6906 bytes --]
/* Optimized ASIMD version of sinf
Copyright (C) 2017 Free Software Foundation, Inc.
This file is part of the GNU C Library.
The GNU C Library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License as published by the Free Software Foundation; either
version 2.1 of the License, or (at your option) any later version.
The GNU C Library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public
License along with the GNU C Library; if not, see
<http://www.gnu.org/licenses/>. */
#include <errno.h>
#include <math.h>
#include <math_private.h>
/* Short algorithm description:
*
* 1) if |x| == 0: return x.
* 2) if |x| < 2^-27: return x-x*DP_SMALL, raise underflow only when needed.
* 3) if |x| < 2^-5 : return x+x^3*DP_SIN2_0+x^5*DP_SIN2_1.
* 4) if |x| < Pi/4: return x+x^3*(S0+x^2*(S1+x^2*(S2+x^2*(S3+x^2*S4)))).
* 5) if |x| < 9*Pi/4:
* 5.1) Range reduction: k=trunc(|x|/(Pi/4)), j=(k+1)&0x0e, n=k+1,
* t=|x|-j*Pi/4.
* 5.2) Reconstruction:
* s = sign(x) * (-1.0)^((n>>2)&1)
* if(n&2 != 0) {
* using cos(t) polynomial for |t|<Pi/4, result is
* s * (1.0+t^2*(C0+t^2*(C1+t^2*(C2+t^2*(C3+t^2*C4))))).
* } else {
* using sin(t) polynomial for |t|<Pi/4, result is
* s * t * (1.0+t^2*(S0+t^2*(S1+t^2*(S2+t^2*(S3+t^2*S4))))).
* }
* 6) if |x| < 2^23, large args:
* 6.1) Range reduction: k=trunc(|x|/(Pi/4)), j=(k+1)&0xfffffffe, n=k+1,
* t=|x|-j*Pi/4.
* 6.2) Reconstruction same as (5.2).
* 7) if |x| >= 2^23, very large args:
* 7.1) Range reduction: k=trunc(|x|/(Pi/4)), j=(k+1)&0xfffffffe, n=k+1,
* t=|x|-j*Pi/4.
* 7.2) Reconstruction same as (5.2).
* 8) if x is Inf, return x-x, and set errno=EDOM.
* 9) if x is NaN, return x-x.
*
* Special cases:
* sin(+-0) = +-0 not raising inexact/underflow,
* sin(subnormal) raises inexact/underflow,
* sin(min_normalized) raises inexact/underflow,
* sin(normalized) raises inexact,
* sin(Inf) = NaN, raises invalid, sets errno to EDOM,
* sin(NaN) = NaN.
*/
static double invpio4_table[25] = {
0.00000000000000000000000000000000e+00,
1.27323953807353973388671875000000e+00,
6.66162294771233121082332218065858e-09,
4.55202009562027200395410188316081e-18,
5.05365203780056430379174094616698e-26,
3.33657353140390501092355175630980e-34,
2.31774265657771014271759915567725e-43,
1.41079183488085906188916890478151e-51,
1.78201357714620429447917285584916e-59,
6.45440934111020426845713721490652e-68,
2.96289605657163538186678664280422e-77,
2.34290278673081796193027756956770e-85,
6.89165747744598758512920848666612e-94,
2.61827895738527799147711877424548e-102,
5.22516501694879285329083609946870e-111,
2.31723558129677581012126324525784e-119,
5.36762980505877895920512518100769e-128,
2.49914273179058265651805723374739e-135,
3.32028340088181692034775714274009e-144,
1.98261407071607720791943444409774e-152,
1.34423330943468258263688817309715e-162,
2.61360711509865349546753597258351e-169,
2.26640747502086603170125306310953e-178,
2.39096791372273354372455558796085e-186,
2.14336400443697767564443045577643e-194
};
float __sinf (float x)
{
uint64_t ix, n;
double k0, k1, k2, k3, k4, w, r, y, z, t;
GET_FLOAT_WORD(ix, x);
ix &= 0x7fffffff;
if (ix < 0x3d000000)
goto small_args;
if (ix >= 0x3f490fdb)
goto large_args;
/* Here if 2^-5<=|x|<Pi/4 */
/* Sin Polynomial Coefficients */
k0 = -1.66666666666265311791406134034332e-01;
k1 = 8.33333332439092043519845987020744e-03;
k2 = -1.98412633515623692105969699817081e-04;
k3 = 2.75552591873811586009688362475245e-06;
k4 = -2.47545996176987174320511533257699e-08;
y = x * x;
z = y * y;
r = x * (1.0 + y * (k0 + z * (k2 + z * k4)) + z * (k1 + z * k3));
return r;
large_args:
if (ix < 0x40e231d6) {
/* Here if Pi/4<=|x|<9*Pi/4 */
double pio4, invpio4, j;
pio4 = 7.85398163397448278999490867136046e-01;
invpio4 = 1.27323954473516276486577680771006e+00;
t = fabs (x);
n = t * invpio4 + 1.0;
j = n & 0x0e;
t = t - j * pio4;
} else if (ix < 0x4b000000) {
/* Here if 9*Pi/4<=|x|<2^23 */
double pio4hi, pio4lo, invpio4, j;
pio4hi = -7.85398162901401519775390625000000e-01;
pio4lo = -4.96046789840270212596747252887163e-10;
invpio4 = 1.27323954473516276486577680771006e+00;
t = fabs (x);
n = t * invpio4 + 1.0;
j = n & 0xfffffffe;
t = t + j * pio4hi + j * pio4lo;
} else if (ix < 0x7f800000) {
/* Here if 2^23<=|x|<=Inf */
uint64_t bitpos, j;
double pio4, tmp0, tmp1, tmp2, tmp3, tmp4;
pio4 = 7.85398163397448278999490867136046e-01;
t = fabs (x);
bitpos = (ix >> 23) - 0x7f + 59;
j = (bitpos * ((0x100000000 / 28) + 1)) >> 32;
tmp0 = invpio4_table[j - 2] * t;
tmp1 = invpio4_table[j - 1] * t;
tmp2 = invpio4_table[j] * t;
tmp3 = invpio4_table[j + 1] * t;
tmp0 = tmp0 - ((uint64_t)tmp0 & ~0x7);
tmp4 = tmp0 + tmp1;
n = tmp4;
tmp0 = tmp0 - n;
t = tmp0 - (n & 0x1) + tmp1 + tmp2 + tmp3;
if (t > 1.0) {
n = n + 1;
t -= 2.0;
}
n = n + 1;
t = t * pio4;
} else {
/* Here if x is Inf or Nan */
if (ix == 0x7f800000)
__set_errno (EDOM);
return x - x;
}
switch (n & 0x2) {
case 2:
/* Use Cos Polynomial */
k0 = -4.99999999994893751242841517523630e-01;
k1 = 4.16666665534264832326805105822132e-02;
k2 = -1.38888806593809050610177635576292e-03;
k3 = 2.47989607241011055654977823792251e-05;
k4 = -2.71747891329266278019784327732444e-07;
w = 1.0;
break;
default:
/* Use Sin Polynomial */
k0 = -1.66666666666265311791406134034332e-01;
k1 = 8.33333332439092043519845987020744e-03;
k2 = -1.98412633515623692105969699817081e-04;
k3 = 2.75552591873811586009688362475245e-06;
k4 = -2.47545996176987174320511533257699e-08;
w = t;
}
GET_FLOAT_WORD(ix, x);
if (((ix >> 31) ^ (n >> 2)) & 0x1)
w *= -1.0;
y = t * t;
z = y * y;
r = w * (1.0 + y * (k0 + z * (k2 + z * k4)) + z * (k1 + z * k3));
return r;
small_args:
if (ix >= 0x32000000) {
/* Here if 2^-27<=|x|<2^-5 */
k0 = -1.66666666634829235826842364076583e-01;
k1 = 8.33312019844746100505350483445000e-03;
y = x * x;
z = y * y;
r = x + x * y * k0 + x * z * k1;
return r;
} else if (ix != 0) {
/* Here if 0<=|x|<2^-27 */
double small;
small = 8.88178419700125232338905334472656e-16;
r = x - x * small;
return r;
}
/* Here if |x| == 0 */
return x;
}
weak_alias (__sinf, sinf);
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-13 17:49 Wilco Dijkstra
2017-06-13 18:14 ` Siddhesh Poyarekar
@ 2017-06-13 18:49 ` Adhemerval Zanella
2017-06-16 14:39 ` Sekhar, Ashwin
1 sibling, 1 reply; 36+ messages in thread
From: Adhemerval Zanella @ 2017-06-13 18:49 UTC (permalink / raw)
To: Wilco Dijkstra, Ashwin.Sekhar, Siddhesh Poyarekar
Cc: nd, libc-alpha, Szabolcs Nagy
On 13/06/2017 14:49, Wilco Dijkstra wrote:
> Adhemerval Zanella wrote:
>
>> I think a good starting point I would be if Ashwin in could provide us with a C skeleton with same implementation done in assembly.
>
> I don't see the point of asking him to do that. It would be a significant amount of
> work which would be wasted once Szabolcs posts his implementation.
It was not a demand, but rather a suggestion if it were the case he has a
starting implementation based on C (which I know might not be the case).
Also, it would be helpful which kind of algorithm strategies Szabolcs is
planing to do different than Ashwin and current implementation that is
intended to supersede both implementation (like short algorithm description
sent by Ashwin).
>
> What I'd like to ask Ashwin is which benchmarks show a speedup due to
> his patch and whether it is essential they get a speedup this release
> (rather than a larger speedup later). I don't recall seeing sinf/cosf in profiles
> in popular benchmarks, so I don't understand the urgency of doing this now.
>
> What I think would be useful is to start collecting real traces of actual applications
> or large benchmarks like SPEC and creating representative microbenchmarks
> using that data.
>
> Wilco
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-13 17:49 Wilco Dijkstra
@ 2017-06-13 18:14 ` Siddhesh Poyarekar
2017-06-13 18:49 ` Adhemerval Zanella
1 sibling, 0 replies; 36+ messages in thread
From: Siddhesh Poyarekar @ 2017-06-13 18:14 UTC (permalink / raw)
To: Wilco Dijkstra, adhemerval.zanella, Ashwin.Sekhar
Cc: nd, libc-alpha, Szabolcs Nagy
On Tuesday 13 June 2017 11:19 PM, Wilco Dijkstra wrote:
> I don't see the point of asking him to do that. It would be a significant amount of
> work which would be wasted once Szabolcs posts his implementation.
Here's what Szabolcs said about his sinf/cosf work:
>> the plan is the next release cycle (i plan to post powf
>> first, then work on sinf/cosf, possibly sin/cos too, then
>> look at vector versions once the vector abi is in gcc).
which seems to indicate that he has not even started. I guess it is
Ashwin's call now as to whether he finds it worthwhile spending time to
work on it and then have it (probably) superseded later.
> What I'd like to ask Ashwin is which benchmarks show a speedup due to
> his patch and whether it is essential they get a speedup this release
> (rather than a larger speedup later). I don't recall seeing sinf/cosf in profiles
> in popular benchmarks, so I don't understand the urgency of doing this now.
I don't understand the point of the question. Why is an intermediate
speedup undesirable? At the very least it will help Ashwin get his
first patch into glibc, which is a great thing too since that means we
have another contributor in our fold.
> What I think would be useful is to start collecting real traces of actual applications
> or large benchmarks like SPEC and creating representative microbenchmarks
> using that data.
We had a benchmarking BoF back in 2015 and I had made notes then to help
track the long term goals we desire from such an exercise:
https://www.sourceware.org/ml/libc-alpha/2015-08/msg00726.html
Siddhesh
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
@ 2017-06-13 17:49 Wilco Dijkstra
2017-06-13 18:14 ` Siddhesh Poyarekar
2017-06-13 18:49 ` Adhemerval Zanella
0 siblings, 2 replies; 36+ messages in thread
From: Wilco Dijkstra @ 2017-06-13 17:49 UTC (permalink / raw)
To: adhemerval.zanella, Ashwin.Sekhar, Siddhesh Poyarekar
Cc: nd, libc-alpha, Szabolcs Nagy
Adhemerval Zanella wrote:
> I think a good starting point I would be if Ashwin in could provide us with a C skeleton with same implementation done in assembly.
I don't see the point of asking him to do that. It would be a significant amount of
work which would be wasted once Szabolcs posts his implementation.
What I'd like to ask Ashwin is which benchmarks show a speedup due to
his patch and whether it is essential they get a speedup this release
(rather than a larger speedup later). I don't recall seeing sinf/cosf in profiles
in popular benchmarks, so I don't understand the urgency of doing this now.
What I think would be useful is to start collecting real traces of actual applications
or large benchmarks like SPEC and creating representative microbenchmarks
using that data.
Wilco
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-13 16:53 ` Adhemerval Zanella
@ 2017-06-13 17:49 ` Joseph Myers
0 siblings, 0 replies; 36+ messages in thread
From: Joseph Myers @ 2017-06-13 17:49 UTC (permalink / raw)
To: Adhemerval Zanella
Cc: Siddhesh Poyarekar, Szabolcs Nagy, libc-alpha, Ashwin Sekhar T K, nd
On Tue, 13 Jun 2017, Adhemerval Zanella wrote:
> I think a good starting point I would be if Ashwin in could provide us
> with a C skeleton with same implementation done in assembly.
Generic remark for anyone speeding up sinf (architecture-independently):
sinf slowness (in certain cases, anyway) is bug 5997, so if you make it at
least as fast as sin in those cases your patch should include [BZ #5997]
in its ChangeLog entry and that bug should be resolved as FIXED with
milestone set once such a speedup is in.
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-13 14:19 ` Siddhesh Poyarekar
@ 2017-06-13 16:53 ` Adhemerval Zanella
2017-06-13 17:49 ` Joseph Myers
2017-06-23 8:56 ` Sekhar, Ashwin
1 sibling, 1 reply; 36+ messages in thread
From: Adhemerval Zanella @ 2017-06-13 16:53 UTC (permalink / raw)
To: Siddhesh Poyarekar; +Cc: Szabolcs Nagy, libc-alpha, Ashwin Sekhar T K, nd
> On 13 Jun 2017, at 11:19, Siddhesh Poyarekar <siddhesh@gotplt.org> wrote:
>
>> On Tuesday 13 June 2017 06:58 PM, Szabolcs Nagy wrote:
>> i didnt say i rejected his code, but that duplicated
>> effort is not good.
>
> OK, but but wasn't clear from the context of the message. I agree that
> duplication is wasteful, but it's not really that bad if it brings out
> different implementations that can be weighed and improved upon.
>
>> asm is not acceptable even if it's slightly faster.
>> (fix the compiler in that case)
>>
>> asm code maintenance is a huge problem in glibc,
>> in the long term generic code is better in a lot
>> of domains, the sinf/cosf code is such a case,
>> there is no special instruction that helps them
>> that the compiler cannot easily generate.
>
> I'm going to disagree with this even though I agree with the general
> premise that C > assembly for maintenance. If someone comes up with an
> assembly implementation that is significantly (the definition of
> 'significant' may vary from function to function) faster that cannot be
> implemented in the compiler in the current release, it makes sense to
> carry that implementation in glibc until the compiler can catch up
> provided that all other criteria (accuracy, readability, etc.) are met.
>
> Additionally, the source of the implementation is important. Now if
> this patch came from a university student who does not intend to follow
> up and maintain her patch then I would be (slightly, again it depends on
> the magnitude of the improvement) inclined to agree with you since it
> puts the maintenance overhead on us, but in this case the source is
> reliable, so that is an added advantage.
>
>> i didn't say it's a glibc requirement, you have to use
>> common sense here: there are algorithms that are so
>> useful outside of glibc and so generic that it is just
>> unnecessary complication to develop them within glibc
>> (obviously it's not a complication for glibc, but for
>> everybody else, and i cant impose this procedure on
>> others, but i still think this is the better for the
>> larger community).
>
> Yes, I did not disagree on the merit of the requirement, I am arguing
> about our ability to gate that effectively. It might be useful to come
> up with a wiki doc (or enhancing the contribution checklist) to specify
> this.
>
> But then, as a project we are also ideologically bound to LGPL, so again
> I wonder if doing this conflicts with that ideology. I personally am
> more liberal about this, but I don't know if that is the general opinion
> of the community.
>
>> if one tests the same input in a loop that does
>> not measure the effect of branches and thus we end
>> up breaking up the input space into many special
>> ranges, however in practice that's not optimal.
>
> Currently the microbenchmark framework tests the same input in a loop a
> specific number of times to get a large enough number that a single
> iteration gives a stable mean and then tests inputs in a loop - I agree
> that this is cheating a bit since it eliminates cache effects as well as
> branches. It will need a pretty straightforward fix to run only once
> for a single input and it will do what you want, i.e. measure the effect
> of branches.
>
> Maybe Ashwin could patch the framework as well when he posts his patch.
>
> Siddhesh
I think a good starting point I would be if Ashwin in could provide us with a C skeleton with same implementation done in assembly.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-13 15:25 Wilco Dijkstra
@ 2017-06-13 15:44 ` Joseph Myers
0 siblings, 0 replies; 36+ messages in thread
From: Joseph Myers @ 2017-06-13 15:44 UTC (permalink / raw)
To: Wilco Dijkstra; +Cc: libc-alpha, Szabolcs Nagy, nd, Ashwin.Sekhar
On Tue, 13 Jun 2017, Wilco Dijkstra wrote:
> Agreed. I recently committed an AArch64 fix for int<->FP moves
> (https://gcc.gnu.org/ml/gcc-patches/2017-04/msg01282.html), so the
> particular issue Szabolcs mentions is fixed. I've got several other
> improvements inspired by non-optimal sequences in math functions
> in review or development. It's fairly trivial to get GCC generate the
> expected code, so there is no excuse for using assembly code.
And this discussion reminded me to file a bug -
<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81085> - for inefficient
code I'd previously observed on 32-bit x86 when accessing long double
function arguments.
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
@ 2017-06-13 15:25 Wilco Dijkstra
2017-06-13 15:44 ` Joseph Myers
0 siblings, 1 reply; 36+ messages in thread
From: Wilco Dijkstra @ 2017-06-13 15:25 UTC (permalink / raw)
To: Joseph Myers, libc-alpha, Szabolcs Nagy; +Cc: nd, Ashwin.Sekhar
Joseph Myers wrote:
> On Tue, 13 Jun 2017, Szabolcs Nagy wrote:
>
> > the c implementation is generic
> > (sometimes the instruction scheduling is suboptimal and
> > i found that union based bithacks don't always give good
> > code but those are issues we can work on the gcc side)
>
> Indeed, I've told powerpc people trying to add powerpc-specific versions
> of those union-based macros to do the optimization on the compiler side.
> Exactly the same applies to AArch64 - there are lots of copies of
> fdlibm-based code and similar union-based code in different projects,
> making compilers optimize better will help more than just glibc.
Agreed. I recently committed an AArch64 fix for int<->FP moves
(https://gcc.gnu.org/ml/gcc-patches/2017-04/msg01282.html), so the
particular issue Szabolcs mentions is fixed. I've got several other
improvements inspired by non-optimal sequences in math functions
in review or development. It's fairly trivial to get GCC generate the
expected code, so there is no excuse for using assembly code.
Wilco
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-13 14:20 ` Adhemerval Zanella
@ 2017-06-13 14:46 ` Joseph Myers
0 siblings, 0 replies; 36+ messages in thread
From: Joseph Myers @ 2017-06-13 14:46 UTC (permalink / raw)
To: Adhemerval Zanella; +Cc: libc-alpha
On Tue, 13 Jun 2017, Adhemerval Zanella wrote:
> x86_64 does this trick using ifunc (sysdeps/x86_64/fpu/multiarch/e_pow-fma4.c
> for instance).
That's only relevant where you have architecture variants with and without
fma (then you can have a generic source file, built with different options
and using IFUNC). For 32-bit ARM conditional fma availability is relevant
(bug 15503 notes that we ought to use VFMA in the fma / fmaf functions in
glibc, with IFUNCs where necessary), for AArch64 fma is always present.
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-13 13:24 ` Szabolcs Nagy
2017-06-13 14:20 ` Adhemerval Zanella
@ 2017-06-13 14:40 ` Joseph Myers
1 sibling, 0 replies; 36+ messages in thread
From: Joseph Myers @ 2017-06-13 14:40 UTC (permalink / raw)
To: Szabolcs Nagy; +Cc: Sekhar, Ashwin, libc-alpha, nd
On Tue, 13 Jun 2017, Szabolcs Nagy wrote:
> the c implementation is generic
> (sometimes the instruction scheduling is suboptimal and
> i found that union based bithacks don't always give good
> code but those are issues we can work on the gcc side)
Indeed, I've told powerpc people trying to add powerpc-specific versions
of those union-based macros to do the optimization on the compiler side.
Exactly the same applies to AArch64 - there are lots of copies of
fdlibm-based code and similar union-based code in different projects,
making compilers optimize better will help more than just glibc.
> one issue is fma vs non-fma code, i haven't solved that
> yet, but it will probably work either way (since we use
> double prec), if it makes a difference i will add ifdef
> code path for the two cases (might affect the fast arg
> reduction)
We already have __FP_FAST_FMA conditionals in glibc; having more such
conditionals (where relevant) is fine.
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-13 11:55 ` Siddhesh Poyarekar
2017-06-13 13:28 ` Szabolcs Nagy
@ 2017-06-13 14:37 ` Joseph Myers
1 sibling, 0 replies; 36+ messages in thread
From: Joseph Myers @ 2017-06-13 14:37 UTC (permalink / raw)
To: Siddhesh Poyarekar; +Cc: Szabolcs Nagy, libc-alpha, Ashwin Sekhar T K, nd
On Tue, 13 Jun 2017, Siddhesh Poyarekar wrote:
> > - math code should not be fsf assigned lgpl code, but universally
> > available, post it under non-restricted license first, then assign
> > it to fsf so it can be used everywhere without legal issues.
>
> This is not a glibc requirement. I don't know if we can even make that
> a requirement for arm/aarch64 code under the scope of the glibc project
> (i.e., it seems like a technical limitation - how do we reject arm
> patches in libc-alpha and redirect devs to cortex-strings or whatever
> else?), but that is something that Joseph or Carlos may be able to answer.
There should be no such requirement in glibc. Contributors are of course
free to contribute their implementations to other projects if they wish,
and the standard FSF assignments generally allow you to use under other
licenses code you contributed (the particular assignments may vary on
whether prior notice to the FSF is needed before doing so).
I fully expect to continue to license new libm functions I contribute, or
major pieces of new code (new source files) for existing functions, under
the usual glibc LGPLv2.1+ license.
> As I mentioned earlier, realistic workloads are more or less a myth
> currently for math, so unless someone comes up with some, synthetic is
> all you'll get.
The synthetic benchmarks can at least have inputs biased to what we expect
the common case is (for the trig functions I'd expect that to be absolute
values below a few pi, with not so many large inputs).
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-13 13:24 ` Szabolcs Nagy
@ 2017-06-13 14:20 ` Adhemerval Zanella
2017-06-13 14:46 ` Joseph Myers
2017-06-13 14:40 ` Joseph Myers
1 sibling, 1 reply; 36+ messages in thread
From: Adhemerval Zanella @ 2017-06-13 14:20 UTC (permalink / raw)
To: libc-alpha
On 13/06/2017 10:23, Szabolcs Nagy wrote:
> On 13/06/17 13:56, Sekhar, Ashwin wrote:
>>>> SINF
>>>> ---------------------------------------------------------
>>>> Input ThunderX88 ThunderX99 CortexA57
>>>> ---------------------------------------------------------
>>>> 0.0 1.88x 1.18x 1.17x
>>>> 2.0^-28 1.33x 1.12x 1.03x
>>>> 2.0^-6 1.48x 1.28x 1.27x
>>>> 0.6*Pi/4 0.94x 1.14x 1.21x
>>>> 13*Pi/8 1.41x 2.00x 2.16x
>>>> 17*Pi/8 1.45x 1.93x 2.23x
>>> based on these numbers my current c implementation is faster,
>>> but it will take time to polish that for submission.
>>
>> Are these going to be aarch64 specific C implementations or changes in
>> generic code?
>>
>> And Could you please inform when you are going to submit your patches.
>>
>> I also dont agree to having duplicated efforts. But if you dont plan to
>> submit your changes in the near future, I guess I will go ahead
>> addressing the other comments and work on submitting a v2 patch.
>>
>
> the plan is the next release cycle (i plan to post powf
> first, then work on sinf/cosf, possibly sin/cos too, then
> look at vector versions once the vector abi is in gcc).
>
> the c implementation is generic
> (sometimes the instruction scheduling is suboptimal and
> i found that union based bithacks don't always give good
> code but those are issues we can work on the gcc side)
>
> one issue is fma vs non-fma code, i haven't solved that
> yet, but it will probably work either way (since we use
> double prec), if it makes a difference i will add ifdef
> code path for the two cases (might affect the fast arg
> reduction)
x86_64 does this trick using ifunc (sysdeps/x86_64/fpu/multiarch/e_pow-fma4.c
for instance).
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-13 13:28 ` Szabolcs Nagy
2017-06-13 14:15 ` Adhemerval Zanella
@ 2017-06-13 14:19 ` Siddhesh Poyarekar
2017-06-13 16:53 ` Adhemerval Zanella
2017-06-23 8:56 ` Sekhar, Ashwin
1 sibling, 2 replies; 36+ messages in thread
From: Siddhesh Poyarekar @ 2017-06-13 14:19 UTC (permalink / raw)
To: Szabolcs Nagy, libc-alpha, Ashwin Sekhar T K; +Cc: nd
On Tuesday 13 June 2017 06:58 PM, Szabolcs Nagy wrote:
> i didnt say i rejected his code, but that duplicated
> effort is not good.
OK, but but wasn't clear from the context of the message. I agree that
duplication is wasteful, but it's not really that bad if it brings out
different implementations that can be weighed and improved upon.
> asm is not acceptable even if it's slightly faster.
> (fix the compiler in that case)
>
> asm code maintenance is a huge problem in glibc,
> in the long term generic code is better in a lot
> of domains, the sinf/cosf code is such a case,
> there is no special instruction that helps them
> that the compiler cannot easily generate.
I'm going to disagree with this even though I agree with the general
premise that C > assembly for maintenance. If someone comes up with an
assembly implementation that is significantly (the definition of
'significant' may vary from function to function) faster that cannot be
implemented in the compiler in the current release, it makes sense to
carry that implementation in glibc until the compiler can catch up
provided that all other criteria (accuracy, readability, etc.) are met.
Additionally, the source of the implementation is important. Now if
this patch came from a university student who does not intend to follow
up and maintain her patch then I would be (slightly, again it depends on
the magnitude of the improvement) inclined to agree with you since it
puts the maintenance overhead on us, but in this case the source is
reliable, so that is an added advantage.
> i didn't say it's a glibc requirement, you have to use
> common sense here: there are algorithms that are so
> useful outside of glibc and so generic that it is just
> unnecessary complication to develop them within glibc
> (obviously it's not a complication for glibc, but for
> everybody else, and i cant impose this procedure on
> others, but i still think this is the better for the
> larger community).
Yes, I did not disagree on the merit of the requirement, I am arguing
about our ability to gate that effectively. It might be useful to come
up with a wiki doc (or enhancing the contribution checklist) to specify
this.
But then, as a project we are also ideologically bound to LGPL, so again
I wonder if doing this conflicts with that ideology. I personally am
more liberal about this, but I don't know if that is the general opinion
of the community.
> if one tests the same input in a loop that does
> not measure the effect of branches and thus we end
> up breaking up the input space into many special
> ranges, however in practice that's not optimal.
Currently the microbenchmark framework tests the same input in a loop a
specific number of times to get a large enough number that a single
iteration gives a stable mean and then tests inputs in a loop - I agree
that this is cheating a bit since it eliminates cache effects as well as
branches. It will need a pretty straightforward fix to run only once
for a single input and it will do what you want, i.e. measure the effect
of branches.
Maybe Ashwin could patch the framework as well when he posts his patch.
Siddhesh
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-13 13:28 ` Szabolcs Nagy
@ 2017-06-13 14:15 ` Adhemerval Zanella
2017-06-13 14:19 ` Siddhesh Poyarekar
1 sibling, 0 replies; 36+ messages in thread
From: Adhemerval Zanella @ 2017-06-13 14:15 UTC (permalink / raw)
To: libc-alpha
On 13/06/2017 10:28, Szabolcs Nagy wrote:
> On 13/06/17 12:54, Siddhesh Poyarekar wrote:
>> On Tuesday 13 June 2017 04:37 PM, Szabolcs Nagy wrote:
>>> i thought it was a vector version because of ASIMD, but it's
>>> just scalar sinf/cosf.
>>>
>>> there are many issues with this patch, but most importantly it
>>> duplicates work as i also happen to work on single precision
>>> math functions (sorry).
>>
>> I don't know if this reason even makes sense for rejecting a patch -
>> you're basically saying that we should reject code that is already
>> posted because you have been working on something that is going to come
>> out in the future.
>>
>
> i didnt say i rejected his code, but that duplicated
> effort is not good.
>
>> Ashwin has come out with his code first, so please stick to only the
>> technical points for review.
>>
>>> issues:
>>>
>>> - asm code wont be accepted: generic c code can be just as fast.
>>
>> To be specific, ASM code won't be accepted until it is proven to be
>> faster than existing C code.
>>
>
> asm is not acceptable even if it's slightly faster.
> (fix the compiler in that case)
>
> asm code maintenance is a huge problem in glibc,
> in the long term generic code is better in a lot
> of domains, the sinf/cosf code is such a case,
> there is no special instruction that helps them
> that the compiler cannot easily generate.
>
I tend to agree with you and generic code can be useful not only for an
specific CPU. However in this special case I think the coordination must
first came from you, since you are the one that is asking Ashwin to
hold/abandon the patch for a future submission. Maybe sharing your current
work, even if it is still WIP, with him can sped up development and give
hints for future developments.
>>> - ifunc wont be accepted: all instructions are available on all cpus.
>>
>> Agreed.
>>
>>> - math code should not be fsf assigned lgpl code, but universally
>>> available, post it under non-restricted license first, then assign
>>> it to fsf so it can be used everywhere without legal issues.
>>
>> This is not a glibc requirement. I don't know if we can even make that
>> a requirement for arm/aarch64 code under the scope of the glibc project
>> (i.e., it seems like a technical limitation - how do we reject arm
>> patches in libc-alpha and redirect devs to cortex-strings or whatever
>> else?), but that is something that Joseph or Carlos may be able to answer.
>>
>> Perhaps a prominent note in the wiki should be a start.
>>
>
> i didn't say it's a glibc requirement, you have to use
> common sense here: there are algorithms that are so
> useful outside of glibc and so generic that it is just
> unnecessary complication to develop them within glibc
> (obviously it's not a complication for glibc, but for
> everybody else, and i cant impose this procedure on
> others, but i still think this is the better for the
> larger community).
Would be possible to multi-licensing the code under lgpl and a
less restrictive one?
>
>>> - document the worst case ulp error and number of misrounded
>>> cases: for single argument scalar functions you can easily test
>>> all possible inputs in all rounding modes and that information
>>> helps to decide if the algorithm is good enough.
>>
>> Agreed.
>>
>>> - benchmark measurements ideally provide a latency and a
>>> throughput numbers as well for the various ranges or use a
>>> realistic workload, in this case there are many branches
>>> for the various input ranges so it is useful to have a
>>> benchmark that can show the effect of that.
>>
>> As I mentioned earlier, realistic workloads are more or less a myth
>> currently for math, so unless someone comes up with some, synthetic is
>> all you'll get.
>
> if one tests the same input in a loop that does
> not measure the effect of branches and thus we end
> up breaking up the input space into many special
> ranges, however in practice that's not optimal.
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-13 11:55 ` Siddhesh Poyarekar
@ 2017-06-13 13:28 ` Szabolcs Nagy
2017-06-13 14:15 ` Adhemerval Zanella
2017-06-13 14:19 ` Siddhesh Poyarekar
2017-06-13 14:37 ` Joseph Myers
1 sibling, 2 replies; 36+ messages in thread
From: Szabolcs Nagy @ 2017-06-13 13:28 UTC (permalink / raw)
To: Siddhesh Poyarekar, libc-alpha, Ashwin Sekhar T K; +Cc: nd
On 13/06/17 12:54, Siddhesh Poyarekar wrote:
> On Tuesday 13 June 2017 04:37 PM, Szabolcs Nagy wrote:
>> i thought it was a vector version because of ASIMD, but it's
>> just scalar sinf/cosf.
>>
>> there are many issues with this patch, but most importantly it
>> duplicates work as i also happen to work on single precision
>> math functions (sorry).
>
> I don't know if this reason even makes sense for rejecting a patch -
> you're basically saying that we should reject code that is already
> posted because you have been working on something that is going to come
> out in the future.
>
i didnt say i rejected his code, but that duplicated
effort is not good.
> Ashwin has come out with his code first, so please stick to only the
> technical points for review.
>
>> issues:
>>
>> - asm code wont be accepted: generic c code can be just as fast.
>
> To be specific, ASM code won't be accepted until it is proven to be
> faster than existing C code.
>
asm is not acceptable even if it's slightly faster.
(fix the compiler in that case)
asm code maintenance is a huge problem in glibc,
in the long term generic code is better in a lot
of domains, the sinf/cosf code is such a case,
there is no special instruction that helps them
that the compiler cannot easily generate.
>> - ifunc wont be accepted: all instructions are available on all cpus.
>
> Agreed.
>
>> - math code should not be fsf assigned lgpl code, but universally
>> available, post it under non-restricted license first, then assign
>> it to fsf so it can be used everywhere without legal issues.
>
> This is not a glibc requirement. I don't know if we can even make that
> a requirement for arm/aarch64 code under the scope of the glibc project
> (i.e., it seems like a technical limitation - how do we reject arm
> patches in libc-alpha and redirect devs to cortex-strings or whatever
> else?), but that is something that Joseph or Carlos may be able to answer.
>
> Perhaps a prominent note in the wiki should be a start.
>
i didn't say it's a glibc requirement, you have to use
common sense here: there are algorithms that are so
useful outside of glibc and so generic that it is just
unnecessary complication to develop them within glibc
(obviously it's not a complication for glibc, but for
everybody else, and i cant impose this procedure on
others, but i still think this is the better for the
larger community).
>> - document the worst case ulp error and number of misrounded
>> cases: for single argument scalar functions you can easily test
>> all possible inputs in all rounding modes and that information
>> helps to decide if the algorithm is good enough.
>
> Agreed.
>
>> - benchmark measurements ideally provide a latency and a
>> throughput numbers as well for the various ranges or use a
>> realistic workload, in this case there are many branches
>> for the various input ranges so it is useful to have a
>> benchmark that can show the effect of that.
>
> As I mentioned earlier, realistic workloads are more or less a myth
> currently for math, so unless someone comes up with some, synthetic is
> all you'll get.
if one tests the same input in a loop that does
not measure the effect of branches and thus we end
up breaking up the input space into many special
ranges, however in practice that's not optimal.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-13 12:56 ` Sekhar, Ashwin
@ 2017-06-13 13:24 ` Szabolcs Nagy
2017-06-13 14:20 ` Adhemerval Zanella
2017-06-13 14:40 ` Joseph Myers
0 siblings, 2 replies; 36+ messages in thread
From: Szabolcs Nagy @ 2017-06-13 13:24 UTC (permalink / raw)
To: Sekhar, Ashwin, libc-alpha; +Cc: nd
On 13/06/17 13:56, Sekhar, Ashwin wrote:
>>> SINF
>>> ---------------------------------------------------------
>>> Input ThunderX88 ThunderX99 CortexA57
>>> ---------------------------------------------------------
>>> 0.0 1.88x 1.18x 1.17x
>>> 2.0^-28 1.33x 1.12x 1.03x
>>> 2.0^-6 1.48x 1.28x 1.27x
>>> 0.6*Pi/4 0.94x 1.14x 1.21x
>>> 13*Pi/8 1.41x 2.00x 2.16x
>>> 17*Pi/8 1.45x 1.93x 2.23x
>> based on these numbers my current c implementation is faster,
>> but it will take time to polish that for submission.
>
> Are these going to be aarch64 specific C implementations or changes in
> generic code?
>
> And Could you please inform when you are going to submit your patches.
>
> I also dont agree to having duplicated efforts. But if you dont plan to
> submit your changes in the near future, I guess I will go ahead
> addressing the other comments and work on submitting a v2 patch.
>
the plan is the next release cycle (i plan to post powf
first, then work on sinf/cosf, possibly sin/cos too, then
look at vector versions once the vector abi is in gcc).
the c implementation is generic
(sometimes the instruction scheduling is suboptimal and
i found that union based bithacks don't always give good
code but those are issues we can work on the gcc side)
one issue is fma vs non-fma code, i haven't solved that
yet, but it will probably work either way (since we use
double prec), if it makes a difference i will add ifdef
code path for the two cases (might affect the fast arg
reduction)
> Thanks
> Ashwin
>
>>
>>>
>>> 1000*Pi/4 19.68x 37.46x 27.99x
>>> 2.0^51 12.00x 13.58x 13.49x
>> this is a bug in the current generic code that it falls back
>> to slow argument reduction even though single precision arg
>> reduction can be done in a few cycles over the entire range,
>>
>> i think the x86_64 sse code could still be simpler and faster
>> (not that it matters much as these are rare cases).
>>
>>>
>>> Inf 1.04x 1.05x 1.12x
>>> Nan 0.95x 0.87x 0.82x
>>> ---------------------------------------------------------
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-13 11:07 ` Szabolcs Nagy
2017-06-13 11:55 ` Siddhesh Poyarekar
@ 2017-06-13 12:56 ` Sekhar, Ashwin
2017-06-13 13:24 ` Szabolcs Nagy
2017-06-23 10:49 ` Sekhar, Ashwin
2 siblings, 1 reply; 36+ messages in thread
From: Sekhar, Ashwin @ 2017-06-13 12:56 UTC (permalink / raw)
To: libc-alpha, szabolcs.nagy; +Cc: nd
> > SINF
> > ---------------------------------------------------------
> > Input ThunderX88 ThunderX99 CortexA57
> > ---------------------------------------------------------
> > 0.0 1.88x 1.18x 1.17x
> > 2.0^-28 1.33x 1.12x 1.03x
> > 2.0^-6 1.48x 1.28x 1.27x
> > 0.6*Pi/4 0.94x 1.14x 1.21x
> > 13*Pi/8 1.41x 2.00x 2.16x
> > 17*Pi/8 1.45x 1.93x 2.23x
> based on these numbers my current c implementation is faster,
> but it will take time to polish that for submission.
Are these going to be aarch64 specific C implementations or changes in
generic code?
And Could you please inform when you are going to submit your patches.
I also dont agree to having duplicated efforts. But if you dont plan to
submit your changes in the near future, I guess I will go ahead
addressing the other comments and work on submitting a v2 patch.
Thanks
Ashwin
>
> >
> > 1000*Pi/4 19.68x 37.46x 27.99x
> > 2.0^51 12.00x 13.58x 13.49x
> this is a bug in the current generic code that it falls back
> to slow argument reduction even though single precision arg
> reduction can be done in a few cycles over the entire range,
>
> i think the x86_64 sse code could still be simpler and faster
> (not that it matters much as these are rare cases).
>
> >
> > Inf 1.04x 1.05x 1.12x
> > Nan 0.95x 0.87x 0.82x
> > ---------------------------------------------------------
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-13 11:07 ` Szabolcs Nagy
@ 2017-06-13 11:55 ` Siddhesh Poyarekar
2017-06-13 13:28 ` Szabolcs Nagy
2017-06-13 14:37 ` Joseph Myers
2017-06-13 12:56 ` Sekhar, Ashwin
2017-06-23 10:49 ` Sekhar, Ashwin
2 siblings, 2 replies; 36+ messages in thread
From: Siddhesh Poyarekar @ 2017-06-13 11:55 UTC (permalink / raw)
To: Szabolcs Nagy, libc-alpha, Ashwin Sekhar T K; +Cc: nd
On Tuesday 13 June 2017 04:37 PM, Szabolcs Nagy wrote:
> i thought it was a vector version because of ASIMD, but it's
> just scalar sinf/cosf.
>
> there are many issues with this patch, but most importantly it
> duplicates work as i also happen to work on single precision
> math functions (sorry).
I don't know if this reason even makes sense for rejecting a patch -
you're basically saying that we should reject code that is already
posted because you have been working on something that is going to come
out in the future.
Ashwin has come out with his code first, so please stick to only the
technical points for review.
> issues:
>
> - asm code wont be accepted: generic c code can be just as fast.
To be specific, ASM code won't be accepted until it is proven to be
faster than existing C code.
> - ifunc wont be accepted: all instructions are available on all cpus.
Agreed.
> - math code should not be fsf assigned lgpl code, but universally
> available, post it under non-restricted license first, then assign
> it to fsf so it can be used everywhere without legal issues.
This is not a glibc requirement. I don't know if we can even make that
a requirement for arm/aarch64 code under the scope of the glibc project
(i.e., it seems like a technical limitation - how do we reject arm
patches in libc-alpha and redirect devs to cortex-strings or whatever
else?), but that is something that Joseph or Carlos may be able to answer.
Perhaps a prominent note in the wiki should be a start.
> - document the worst case ulp error and number of misrounded
> cases: for single argument scalar functions you can easily test
> all possible inputs in all rounding modes and that information
> helps to decide if the algorithm is good enough.
Agreed.
> - benchmark measurements ideally provide a latency and a
> throughput numbers as well for the various ranges or use a
> realistic workload, in this case there are many branches
> for the various input ranges so it is useful to have a
> benchmark that can show the effect of that.
As I mentioned earlier, realistic workloads are more or less a myth
currently for math, so unless someone comes up with some, synthetic is
all you'll get.
Siddhesh
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-13 7:17 Ashwin Sekhar T K
2017-06-13 8:30 ` Siddhesh Poyarekar
@ 2017-06-13 11:07 ` Szabolcs Nagy
2017-06-13 11:55 ` Siddhesh Poyarekar
` (2 more replies)
1 sibling, 3 replies; 36+ messages in thread
From: Szabolcs Nagy @ 2017-06-13 11:07 UTC (permalink / raw)
To: libc-alpha, Ashwin Sekhar T K; +Cc: nd
On 13/06/17 08:17, Ashwin Sekhar T K wrote:
> This patchset adds the optimized ASIMD version of sinf/cosf
> for Aarch64. The algorithm and code flow is based on the SSE versions
> of the same in sysdeps/x86_64/fpu.
>
> The ASIMD versions are used only if the cpu supports asimd feature.
> It uses ifuncs and HWCAP to identify the ASIMD capability.
>
i thought it was a vector version because of ASIMD, but it's
just scalar sinf/cosf.
there are many issues with this patch, but most importantly it
duplicates work as i also happen to work on single precision
math functions (sorry).
i plan to work on vector math functions and double precision
math functions too, before anybody jumps on that, please
coordinate to avoid wasted effort like this.
issues:
- asm code wont be accepted: generic c code can be just as fast.
- ifunc wont be accepted: all instructions are available on all cpus.
- math code should not be fsf assigned lgpl code, but universally
available, post it under non-restricted license first, then assign
it to fsf so it can be used everywhere without legal issues.
- document the worst case ulp error and number of misrounded
cases: for single argument scalar functions you can easily test
all possible inputs in all rounding modes and that information
helps to decide if the algorithm is good enough.
- benchmark measurements ideally provide a latency and a
throughput numbers as well for the various ranges or use a
realistic workload, in this case there are many branches
for the various input ranges so it is useful to have a
benchmark that can show the effect of that.
> The patchset was tested using "make check" for the math sub-directory.
> The tests were run on linux 4.4.0-45-generic on ThunderX88 platform.
>
> The following are the approximate speedups observed over the
> existing implementation on different Aarch64 platforms for
> different input values.
>
> SINF
> ---------------------------------------------------------
> Input ThunderX88 ThunderX99 CortexA57
> ---------------------------------------------------------
> 0.0 1.88x 1.18x 1.17x
> 2.0^-28 1.33x 1.12x 1.03x
> 2.0^-6 1.48x 1.28x 1.27x
> 0.6*Pi/4 0.94x 1.14x 1.21x
> 13*Pi/8 1.41x 2.00x 2.16x
> 17*Pi/8 1.45x 1.93x 2.23x
based on these numbers my current c implementation is faster,
but it will take time to polish that for submission.
> 1000*Pi/4 19.68x 37.46x 27.99x
> 2.0^51 12.00x 13.58x 13.49x
this is a bug in the current generic code that it falls back
to slow argument reduction even though single precision arg
reduction can be done in a few cycles over the entire range,
i think the x86_64 sse code could still be simpler and faster
(not that it matters much as these are rare cases).
> Inf 1.04x 1.05x 1.12x
> Nan 0.95x 0.87x 0.82x
> ---------------------------------------------------------
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-13 8:39 ` Sekhar, Ashwin
2017-06-13 10:33 ` Sekhar, Ashwin
@ 2017-06-13 10:48 ` Siddhesh Poyarekar
1 sibling, 0 replies; 36+ messages in thread
From: Siddhesh Poyarekar @ 2017-06-13 10:48 UTC (permalink / raw)
To: Sekhar, Ashwin, libc-alpha
On Tuesday 13 June 2017 02:09 PM, Sekhar, Ashwin wrote:
>> 2. Write a microbenchmark test for glibc
> Sure. Will share this via github.
glibc has its own set of microbenchmarks in benchtests/*. You'll have
to create new benchmark input files for sinf and cosf; take a look at
the benchmark inputs for sin and cos for example and review
benchtests/README for instructions. That way, anybody can do a simple
'make bench' to verify your results on their hardware.
Siddhesh
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-13 10:33 ` Sekhar, Ashwin
@ 2017-06-13 10:39 ` Sekhar, Ashwin
0 siblings, 0 replies; 36+ messages in thread
From: Sekhar, Ashwin @ 2017-06-13 10:39 UTC (permalink / raw)
To: libc-alpha, Sekhar, Ashwin, siddhesh
On Tue, 2017-06-13 at 10:33 +0000, Sekhar, Ashwin wrote:
> Hi Siddhesh,
> >
> > >
> > > >
> > > >
> > > >
> > > Thank you for doing this. Please:
> > >
> > > 1. Explain why you chose these input values as optimization
> > > targets
> > > and
> > The algorithm splits the inputs into different intervals and uses
> > different code paths for these different intervals. The input
> > values
> > I
> > chose covers all these code paths.
> >
> > >
> > >
> > > 2. Write a microbenchmark test for glibc
> > Sure. Will share this via github.
> Please find the microbenchmark code at
> https://github.com/ashwinyes/glibc_microbenchmarks/blob/master/sinf/s
> in
> f_benchmark.c
Forgot to mention. The benchmark code uses PMU cycles counter to
measure the number of cycles. Please follow section "38.2.2. High-
resolution cycle counter" in the DPDK guide at http://dpdk.org/doc/guid
es/prog_guide/profile_app.html to enable the access to PMU cycle counter from userspace.
> Thanks
> Ashwin
> >
> >
> > >
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-13 8:39 ` Sekhar, Ashwin
@ 2017-06-13 10:33 ` Sekhar, Ashwin
2017-06-13 10:39 ` Sekhar, Ashwin
2017-06-13 10:48 ` Siddhesh Poyarekar
1 sibling, 1 reply; 36+ messages in thread
From: Sekhar, Ashwin @ 2017-06-13 10:33 UTC (permalink / raw)
To: libc-alpha, Sekhar, Ashwin, siddhesh
Hi Siddhesh,
> > >
> > >
> > Thank you for doing this. Please:
> >
> > 1. Explain why you chose these input values as optimization targets
> > and
> The algorithm splits the inputs into different intervals and uses
> different code paths for these different intervals. The input values
> I
> chose covers all these code paths.
>
> >
> > 2. Write a microbenchmark test for glibc
> Sure. Will share this via github.
Please find the microbenchmark code at
https://github.com/ashwinyes/glibc_microbenchmarks/blob/master/sinf/sin
f_benchmark.c
Thanks
Ashwin
>
> >
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-13 8:30 ` Siddhesh Poyarekar
@ 2017-06-13 8:39 ` Sekhar, Ashwin
2017-06-13 10:33 ` Sekhar, Ashwin
2017-06-13 10:48 ` Siddhesh Poyarekar
0 siblings, 2 replies; 36+ messages in thread
From: Sekhar, Ashwin @ 2017-06-13 8:39 UTC (permalink / raw)
To: libc-alpha, siddhesh
On Tue, 2017-06-13 at 14:00 +0530, Siddhesh Poyarekar wrote:
> On Tuesday 13 June 2017 12:47 PM, Ashwin Sekhar T K wrote:
> >
> > The following are the approximate speedups observed over the
> > existing implementation on different Aarch64 platforms for
> > different input values.
> >
> > SINF
> > ---------------------------------------------------------
> > Input ThunderX88 ThunderX99 CortexA57
> > ---------------------------------------------------------
> > 0.0 1.88x 1.18x 1.17x
> > 2.0^-28 1.33x 1.12x 1.03x
> > 2.0^-6 1.48x 1.28x 1.27x
> > 0.6*Pi/4 0.94x 1.14x 1.21x
> > 13*Pi/8 1.41x 2.00x 2.16x
> > 17*Pi/8 1.45x 1.93x 2.23x
> > 1000*Pi/4 19.68x 37.46x 27.99x
> > 2.0^51 12.00x 13.58x 13.49x
> > Inf 1.04x 1.05x 1.12x
> > Nan 0.95x 0.87x 0.82x
> > ---------------------------------------------------------
> >
> > COSF
> > ---------------------------------------------------------
> > Input ThunderX88 ThunderX99 CortexA57
> > ---------------------------------------------------------
> > 0.0 1.25x 1.14x 1.17x
> > 2.0^-28 1.24x 1.14x 1.13x
> > 2.0^-6 1.38x 1.38x 1.85x
> > 0.6*Pi/4 1.15x 1.38x 1.69x
> > 13*Pi/8 1.65x 1.94x 2.18x
> > 17*Pi/8 1.49x 2.05x 2.09x
> > 1000*Pi/4 18.98x 38.39x 27.52x
> > 2.0^51 11.35x 13.74x 13.47x
> > Inf 0.99x 1.02x 1.16x
> > Nan 0.88x 0.86x 0.87x
> > ---------------------------------------------------------
> Thank you for doing this. Please:
>
> 1. Explain why you chose these input values as optimization targets
> and
The algorithm splits the inputs into different intervals and uses
different code paths for these different intervals. The input values I
chose covers all these code paths.
> 2. Write a microbenchmark test for glibc
Sure. Will share this via github.
Thanks
Ashwin
>
> Thanks,
> Siddhesh
>
> >
> >
> > Ashwin Sekhar T K (2):
> > aarch64: Add optimized ASIMD version of sinf
> > aarch64: Add optimized ASIMD version of cosf
> >
> > sysdeps/aarch64/fpu/multiarch/Makefile | 3 +
> > sysdeps/aarch64/fpu/multiarch/s_cosf-asimd.S | 367
> > +++++++++++++++++++++++++
> > sysdeps/aarch64/fpu/multiarch/s_cosf.c | 31 +++
> > sysdeps/aarch64/fpu/multiarch/s_sinf-asimd.S | 382
> > +++++++++++++++++++++++++++
> > sysdeps/aarch64/fpu/multiarch/s_sinf.c | 31 +++
> > 5 files changed, 814 insertions(+)
> > create mode 100644 sysdeps/aarch64/fpu/multiarch/Makefile
> > create mode 100644 sysdeps/aarch64/fpu/multiarch/s_cosf-asimd.S
> > create mode 100644 sysdeps/aarch64/fpu/multiarch/s_cosf.c
> > create mode 100644 sysdeps/aarch64/fpu/multiarch/s_sinf-asimd.S
> > create mode 100644 sysdeps/aarch64/fpu/multiarch/s_sinf.c
> >
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
2017-06-13 7:17 Ashwin Sekhar T K
@ 2017-06-13 8:30 ` Siddhesh Poyarekar
2017-06-13 8:39 ` Sekhar, Ashwin
2017-06-13 11:07 ` Szabolcs Nagy
1 sibling, 1 reply; 36+ messages in thread
From: Siddhesh Poyarekar @ 2017-06-13 8:30 UTC (permalink / raw)
To: Ashwin Sekhar T K, libc-alpha
On Tuesday 13 June 2017 12:47 PM, Ashwin Sekhar T K wrote:
> The following are the approximate speedups observed over the
> existing implementation on different Aarch64 platforms for
> different input values.
>
> SINF
> ---------------------------------------------------------
> Input ThunderX88 ThunderX99 CortexA57
> ---------------------------------------------------------
> 0.0 1.88x 1.18x 1.17x
> 2.0^-28 1.33x 1.12x 1.03x
> 2.0^-6 1.48x 1.28x 1.27x
> 0.6*Pi/4 0.94x 1.14x 1.21x
> 13*Pi/8 1.41x 2.00x 2.16x
> 17*Pi/8 1.45x 1.93x 2.23x
> 1000*Pi/4 19.68x 37.46x 27.99x
> 2.0^51 12.00x 13.58x 13.49x
> Inf 1.04x 1.05x 1.12x
> Nan 0.95x 0.87x 0.82x
> ---------------------------------------------------------
>
> COSF
> ---------------------------------------------------------
> Input ThunderX88 ThunderX99 CortexA57
> ---------------------------------------------------------
> 0.0 1.25x 1.14x 1.17x
> 2.0^-28 1.24x 1.14x 1.13x
> 2.0^-6 1.38x 1.38x 1.85x
> 0.6*Pi/4 1.15x 1.38x 1.69x
> 13*Pi/8 1.65x 1.94x 2.18x
> 17*Pi/8 1.49x 2.05x 2.09x
> 1000*Pi/4 18.98x 38.39x 27.52x
> 2.0^51 11.35x 13.74x 13.47x
> Inf 0.99x 1.02x 1.16x
> Nan 0.88x 0.86x 0.87x
> ---------------------------------------------------------
Thank you for doing this. Please:
1. Explain why you chose these input values as optimization targets and
2. Write a microbenchmark test for glibc
Thanks,
Siddhesh
>
> Ashwin Sekhar T K (2):
> aarch64: Add optimized ASIMD version of sinf
> aarch64: Add optimized ASIMD version of cosf
>
> sysdeps/aarch64/fpu/multiarch/Makefile | 3 +
> sysdeps/aarch64/fpu/multiarch/s_cosf-asimd.S | 367 +++++++++++++++++++++++++
> sysdeps/aarch64/fpu/multiarch/s_cosf.c | 31 +++
> sysdeps/aarch64/fpu/multiarch/s_sinf-asimd.S | 382 +++++++++++++++++++++++++++
> sysdeps/aarch64/fpu/multiarch/s_sinf.c | 31 +++
> 5 files changed, 814 insertions(+)
> create mode 100644 sysdeps/aarch64/fpu/multiarch/Makefile
> create mode 100644 sysdeps/aarch64/fpu/multiarch/s_cosf-asimd.S
> create mode 100644 sysdeps/aarch64/fpu/multiarch/s_cosf.c
> create mode 100644 sysdeps/aarch64/fpu/multiarch/s_sinf-asimd.S
> create mode 100644 sysdeps/aarch64/fpu/multiarch/s_sinf.c
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf
@ 2017-06-13 7:17 Ashwin Sekhar T K
2017-06-13 8:30 ` Siddhesh Poyarekar
2017-06-13 11:07 ` Szabolcs Nagy
0 siblings, 2 replies; 36+ messages in thread
From: Ashwin Sekhar T K @ 2017-06-13 7:17 UTC (permalink / raw)
To: libc-alpha; +Cc: Ashwin Sekhar T K
This patchset adds the optimized ASIMD version of sinf/cosf
for Aarch64. The algorithm and code flow is based on the SSE versions
of the same in sysdeps/x86_64/fpu.
The ASIMD versions are used only if the cpu supports asimd feature.
It uses ifuncs and HWCAP to identify the ASIMD capability.
The patchset was tested using "make check" for the math sub-directory.
The tests were run on linux 4.4.0-45-generic on ThunderX88 platform.
The following are the approximate speedups observed over the
existing implementation on different Aarch64 platforms for
different input values.
SINF
---------------------------------------------------------
Input ThunderX88 ThunderX99 CortexA57
---------------------------------------------------------
0.0 1.88x 1.18x 1.17x
2.0^-28 1.33x 1.12x 1.03x
2.0^-6 1.48x 1.28x 1.27x
0.6*Pi/4 0.94x 1.14x 1.21x
13*Pi/8 1.41x 2.00x 2.16x
17*Pi/8 1.45x 1.93x 2.23x
1000*Pi/4 19.68x 37.46x 27.99x
2.0^51 12.00x 13.58x 13.49x
Inf 1.04x 1.05x 1.12x
Nan 0.95x 0.87x 0.82x
---------------------------------------------------------
COSF
---------------------------------------------------------
Input ThunderX88 ThunderX99 CortexA57
---------------------------------------------------------
0.0 1.25x 1.14x 1.17x
2.0^-28 1.24x 1.14x 1.13x
2.0^-6 1.38x 1.38x 1.85x
0.6*Pi/4 1.15x 1.38x 1.69x
13*Pi/8 1.65x 1.94x 2.18x
17*Pi/8 1.49x 2.05x 2.09x
1000*Pi/4 18.98x 38.39x 27.52x
2.0^51 11.35x 13.74x 13.47x
Inf 0.99x 1.02x 1.16x
Nan 0.88x 0.86x 0.87x
---------------------------------------------------------
Ashwin Sekhar T K (2):
aarch64: Add optimized ASIMD version of sinf
aarch64: Add optimized ASIMD version of cosf
sysdeps/aarch64/fpu/multiarch/Makefile | 3 +
sysdeps/aarch64/fpu/multiarch/s_cosf-asimd.S | 367 +++++++++++++++++++++++++
sysdeps/aarch64/fpu/multiarch/s_cosf.c | 31 +++
sysdeps/aarch64/fpu/multiarch/s_sinf-asimd.S | 382 +++++++++++++++++++++++++++
sysdeps/aarch64/fpu/multiarch/s_sinf.c | 31 +++
5 files changed, 814 insertions(+)
create mode 100644 sysdeps/aarch64/fpu/multiarch/Makefile
create mode 100644 sysdeps/aarch64/fpu/multiarch/s_cosf-asimd.S
create mode 100644 sysdeps/aarch64/fpu/multiarch/s_cosf.c
create mode 100644 sysdeps/aarch64/fpu/multiarch/s_sinf-asimd.S
create mode 100644 sysdeps/aarch64/fpu/multiarch/s_sinf.c
--
2.12.2
^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2017-06-23 10:52 UTC | newest]
Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-13 10:57 [RFC][PATCH 0/2] aarch64: Add optimized ASIMD versions of sinf/cosf Wilco Dijkstra
2017-06-13 11:06 ` Siddhesh Poyarekar
-- strict thread matches above, loose matches on Subject: below --
2017-06-13 17:49 Wilco Dijkstra
2017-06-13 18:14 ` Siddhesh Poyarekar
2017-06-13 18:49 ` Adhemerval Zanella
2017-06-16 14:39 ` Sekhar, Ashwin
2017-06-19 11:46 ` Wilco Dijkstra
2017-06-19 14:08 ` Ramana Radhakrishnan
2017-06-19 15:26 ` Joseph Myers
2017-06-13 15:25 Wilco Dijkstra
2017-06-13 15:44 ` Joseph Myers
2017-06-13 7:17 Ashwin Sekhar T K
2017-06-13 8:30 ` Siddhesh Poyarekar
2017-06-13 8:39 ` Sekhar, Ashwin
2017-06-13 10:33 ` Sekhar, Ashwin
2017-06-13 10:39 ` Sekhar, Ashwin
2017-06-13 10:48 ` Siddhesh Poyarekar
2017-06-13 11:07 ` Szabolcs Nagy
2017-06-13 11:55 ` Siddhesh Poyarekar
2017-06-13 13:28 ` Szabolcs Nagy
2017-06-13 14:15 ` Adhemerval Zanella
2017-06-13 14:19 ` Siddhesh Poyarekar
2017-06-13 16:53 ` Adhemerval Zanella
2017-06-13 17:49 ` Joseph Myers
2017-06-23 8:56 ` Sekhar, Ashwin
2017-06-23 9:43 ` Siddhesh Poyarekar
2017-06-23 9:50 ` Sekhar, Ashwin
2017-06-23 10:48 ` Siddhesh Poyarekar
2017-06-13 14:37 ` Joseph Myers
2017-06-13 12:56 ` Sekhar, Ashwin
2017-06-13 13:24 ` Szabolcs Nagy
2017-06-13 14:20 ` Adhemerval Zanella
2017-06-13 14:46 ` Joseph Myers
2017-06-13 14:40 ` Joseph Myers
2017-06-23 10:49 ` Sekhar, Ashwin
2017-06-23 10:52 ` Szabolcs Nagy
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).