* nextafter() about an order of magnitude slower than trivial implementation
@ 2021-08-16 16:03 Stefan Kanthak
2021-08-18 12:51 ` Szabolcs Nagy
0 siblings, 1 reply; 10+ messages in thread
From: Stefan Kanthak @ 2021-08-16 16:03 UTC (permalink / raw)
To: libc-help
Hi,
to test the accuracy of (elementary) floating-point functions
against a higher precision reference I use the following code:
double input = 1.0, output = exp(input), reference = M_E;
if (output == reference)
printf("correctly rounded\n");
else if (nextafter(output, reference) == reference)
printf("faithfully rounded\n");
else
printf("...");
Testing and benchmarking an exponential function that consumes
about 10ns/call on an AMD EPYC 7262, I noticed that nextafter()
itself is DEAD SLOW: it consumes about 8.5ns/call!
The following (quick&dirty) implementation consumes 0.7ns/call,
i.e. is about an order of magnitude faster:
--- after.c ---
double nextafter(double from, double to)
{
if (from == to)
return to;
if (to != to)
return to;
if (from != from)
return from;
if (from == 0.0)
return to < 0.0 ? -0x1.0p-1074 : 0x1.0p-1074;
unsigned long long ull = *(unsigned long long *) &from;
if ((from < to) == (from < 0.0))
ull--;
else
ull++;
return 0.0 + *(double *) &ull;
}
--- EOF ---
JFTR: the code generated by GCC for this function is FAR from
optimal; the assembly implementation shown below is more
than 2x faster and consumes <0.3ns/call, i.e. is about
30x faster than glibc's nextafter()!
The data from 'perf stat' show that glibc's nextafter() executes
about 30 instructions more than the above trivial implementation,
and that it is NOT well suited for modern super-scalar processors.
Stefan
PS: test program and logs
[stefan@rome ~]$ cat bench.c
// Copyright (C) 2005-2021, Stefan Kanthak <stefan.kanthak@nexgo.de>
#include <math.h>
#include <stdint.h>
#include <stdio.h>
#include <time.h>
__attribute__((noinline))
double nop(double foo, double bar)
{
return foo + bar;
}
inline static
double lfsr64(void)
{
// 64-bit linear feedback shift register (Galois form) using
// primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"),
// initialised with 2**64 / golden ratio
static uint64_t lfsr = 0x9E3779B97F4A7C15;
const uint64_t sign = (int64_t) lfsr >> 63;
lfsr = (lfsr << 1) ^ (0xAD93D23594C935A9 & sign);
return *(double *) &lfsr;
}
inline static
double random64(void)
{
static uint64_t seed = 0x0123456789ABCDEF;
seed = seed * 6364136223846793005 + 1442695040888963407;
return *(double *) &seed;
}
int main(void)
{
clock_t t0, t1, t2, tt;
uint32_t n;
volatile double result;
t0 = clock();
for (n = 500000000u; n > 0u; n--) {
result = nop(lfsr64(), 0.0);
result = nop(random64(), 1.0 / 0.0);
}
t1 = clock();
for (n = 500000000u; n > 0u; n--) {
result = nextafter(lfsr64(), 0.0);
result = nextafter(random64(), 1.0 / 0.0);
}
t2 = clock();
tt = t2 - t0; t2 -= t1; t1 -= t0; t0 = t2 - t1;
printf("\n"
"nop() %4lu.%06lu 0\n"
"nextafter() %4lu.%06lu %4lu.%06lu\n"
" %4lu.%06lu nano-seconds\n",
t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
}
[stefan@rome ~]$ gcc -O3 -lm bench.c
[stefan@rome ~]$ perf stat ./a.out
nop() 1.480000 0
nextafter() 10.060000 8.580000
11.540000 nano-seconds
Performance counter stats for './a.out':
11,548.78 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
145 page-faults:u # 0.013 K/sec
38,917,213,536 cycles:u # 3.370 GHz (83.33%)
~~~~~~~~~~~~~~
15,647,656,615 stalled-cycles-frontend:u # 40.21% frontend cycles idle (83.33%)
10,746,044,422 stalled-cycles-backend:u # 27.61% backend cycles idle (83.33%)
69,739,403,870 instructions:u # 1.79 insn per cycle
~~~~~~~~~~~~~~ ~~~~
# 0.22 stalled cycles per insn (83.33%)
16,495,748,110 branches:u # 1428.354 M/sec (83.33%)
500,701,246 branch-misses:u # 3.04% of all branches (83.33%)
11.550414029 seconds time elapsed
11.548265000 seconds user
0.000999000 seconds sys
[stefan@rome ~]$ gcc -O3 bench.c after.c
[stefan@rome ~]$ perf stat ./a.out
nop() 1.490000 0
nextafter() 2.210000 0.720000
3.700000 nano-seconds
Performance counter stats for './a.out':
3,702.89 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
122 page-faults:u # 0.033 K/sec
12,407,345,183 cycles:u # 3.351 GHz (83.32%)
~~~~~~~~~~~~~~
135,817 stalled-cycles-frontend:u # 0.00% frontend cycles idle (83.34%)
5,498,895,906 stalled-cycles-backend:u # 44.32% backend cycles idle (83.34%)
38,002,430,460 instructions:u # 3.06 insn per cycle
~~~~~~~~~~~~~~ ~~~~
# 0.14 stalled cycles per insn (83.34%)
7,497,381,393 branches:u # 2024.735 M/sec (83.34%)
497,462 branch-misses:u # 0.01% of all branches (83.32%)
3.703648875 seconds time elapsed
3.703294000 seconds user
0.000000000 seconds sys
[stefan@rome ~]cat after.s
# Copyright © 2004-2021, Stefan Kanthak <stefan.kanthak@nexgo.de>
.arch generic64
.code64
.intel_syntax noprefix
.text
# xmm0 = from
# xmm1 = to
nextafter:
xorpd xmm2, xmm2 # xmm2 = 0.0
ucomisd xmm1, xmm2 # CF = (to < 0.0)
jp .Lto # to = NAN?
sbb rax, rax # rax = (to < 0.0) ? -1 : 0
ucomisd xmm0, xmm1 # CF = (from < to)
jp .Lfrom # from = NAN?
je .Lto # from = to?
.Lnotequal:
sbb rcx, rcx # rcx = (from < to) ? -1 : 0
ucomisd xmm0, xmm2 # CF = (from < 0.0)
jz .Lzero # from = 0.0?
.Lnext:
movq rdx, xmm0 # rdx = from
sbb rax, rax # rax = (from < 0.0) ? -1 : 0
xor rax, rcx # rax = (from < 0.0) = (from < to) ? 0 : -1
or rax, 1 # rax = (from < 0.0) = (from < to) ? 1 : -1
sub rdx, rax
movq xmm0, rdx
addsd xmm0, xmm2
ret
.Lzero:
shl rax, 63 # rax = (to < -0.0) ? 0x8000000000000000 : 0
or rax, 1 # rax = (to < -0.0) ? 0x8000000000000001 : 1
movq xmm0, rax # xmm0 = (to < -0.0) ? -0x1.0p-1074 : 0x1.0p-1074
addsd xmm0, xmm2
ret
.Lto:
movsd xmm0, xmm1 # xmm0 = to
.Lfrom:
ret
.size nextafter, .-nextafter
.type nextafter, @function
.global nextafter
.end
[stefan@rome ~]$ perf stat ./a.out
nop() 1.630000 0
nextafter() 1.910000 0.280000
3.540000 nano-seconds
Performance counter stats for './a.out':
3,547.12 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
123 page-faults:u # 0.035 K/sec
11,949,840,797 cycles:u # 3.369 GHz (83.32%)
~~~~~~~~~~~~~~
129,627 stalled-cycles-frontend:u # 0.00% frontend cycles idle (83.34%)
3,998,960,716 stalled-cycles-backend:u # 33.46% backend cycles idle (83.34%)
37,493,523,285 instructions:u # 3.14 insn per cycle
~~~~~~~~~~~~~~ ~~~~
# 0.11 stalled cycles per insn (83.34%)
7,998,559,192 branches:u # 2254.945 M/sec (83.34%)
497,565 branch-misses:u # 0.01% of all branches (83.32%)
3.547999008 seconds time elapsed
3.546671000 seconds user
0.000999000 seconds sys
[stefan@rome ~]$
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: nextafter() about an order of magnitude slower than trivial implementation
2021-08-16 16:03 nextafter() about an order of magnitude slower than trivial implementation Stefan Kanthak
@ 2021-08-18 12:51 ` Szabolcs Nagy
2021-08-18 17:11 ` Stefan Kanthak
0 siblings, 1 reply; 10+ messages in thread
From: Szabolcs Nagy @ 2021-08-18 12:51 UTC (permalink / raw)
To: Stefan Kanthak; +Cc: libc-help
The 08/16/2021 18:03, Stefan Kanthak wrote:
> Testing and benchmarking an exponential function that consumes
> about 10ns/call on an AMD EPYC 7262, I noticed that nextafter()
> itself is DEAD SLOW: it consumes about 8.5ns/call!
>
> The following (quick&dirty) implementation consumes 0.7ns/call,
> i.e. is about an order of magnitude faster:
correctly measuring latency on a modern x86_64 core:
musl: 3.16 ns
glibc: 5.68 ns
your: 5.72 ns
i.e. your code is slower.
(this is the correctly predicted hot path latency
measuring for(i=0;i<n;i++) r = nextafter(r,0); )
it is more difficult to measure the case when latency
is not the bottleneck, since then it depends on what
code is being executed along nextafter. your code
likely wins in that case as it has less instructions.
what you see in your benchmark is the effect of
branch mispredicts in the nextafter(x, INFINITY)
case: your code avoids branching on the sign of x,
while glibc branches and misses half the time with
random x inputs. (it maybe possible to fix glibc to
avoid that branch and then it would be fast in your
measurement too on a big ooo core)
(your code also does not handle fenv correctly and
has aliasing violations, but these can be fixed)
>
> --- after.c ---
> double nextafter(double from, double to)
> {
> if (from == to)
> return to;
>
> if (to != to)
> return to;
>
> if (from != from)
> return from;
>
> if (from == 0.0)
> return to < 0.0 ? -0x1.0p-1074 : 0x1.0p-1074;
>
> unsigned long long ull = *(unsigned long long *) &from;
>
> if ((from < to) == (from < 0.0))
> ull--;
> else
> ull++;
>
> return 0.0 + *(double *) &ull;
> }
> --- EOF ---
>
> JFTR: the code generated by GCC for this function is FAR from
> optimal; the assembly implementation shown below is more
> than 2x faster and consumes <0.3ns/call, i.e. is about
> 30x faster than glibc's nextafter()!
>
> The data from 'perf stat' show that glibc's nextafter() executes
> about 30 instructions more than the above trivial implementation,
> and that it is NOT well suited for modern super-scalar processors.
>
> Stefan
>
> PS: test program and logs
>
> [stefan@rome ~]$ cat bench.c
> // Copyright (C) 2005-2021, Stefan Kanthak <stefan.kanthak@nexgo.de>
>
> #include <math.h>
> #include <stdint.h>
> #include <stdio.h>
> #include <time.h>
>
> __attribute__((noinline))
> double nop(double foo, double bar)
> {
> return foo + bar;
> }
>
> inline static
> double lfsr64(void)
> {
> // 64-bit linear feedback shift register (Galois form) using
> // primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"),
> // initialised with 2**64 / golden ratio
>
> static uint64_t lfsr = 0x9E3779B97F4A7C15;
> const uint64_t sign = (int64_t) lfsr >> 63;
>
> lfsr = (lfsr << 1) ^ (0xAD93D23594C935A9 & sign);
>
> return *(double *) &lfsr;
> }
>
> inline static
> double random64(void)
> {
> static uint64_t seed = 0x0123456789ABCDEF;
>
> seed = seed * 6364136223846793005 + 1442695040888963407;
>
> return *(double *) &seed;
> }
>
> int main(void)
> {
> clock_t t0, t1, t2, tt;
> uint32_t n;
> volatile double result;
>
> t0 = clock();
>
> for (n = 500000000u; n > 0u; n--) {
> result = nop(lfsr64(), 0.0);
> result = nop(random64(), 1.0 / 0.0);
> }
>
> t1 = clock();
>
> for (n = 500000000u; n > 0u; n--) {
> result = nextafter(lfsr64(), 0.0);
> result = nextafter(random64(), 1.0 / 0.0);
> }
>
> t2 = clock();
> tt = t2 - t0; t2 -= t1; t1 -= t0; t0 = t2 - t1;
>
> printf("\n"
> "nop() %4lu.%06lu 0\n"
> "nextafter() %4lu.%06lu %4lu.%06lu\n"
> " %4lu.%06lu nano-seconds\n",
> t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
> t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
> t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
> tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
> }
> [stefan@rome ~]$ gcc -O3 -lm bench.c
> [stefan@rome ~]$ perf stat ./a.out
>
> nop() 1.480000 0
> nextafter() 10.060000 8.580000
> 11.540000 nano-seconds
>
> Performance counter stats for './a.out':
>
> 11,548.78 msec task-clock:u # 1.000 CPUs utilized
> 0 context-switches:u # 0.000 K/sec
> 0 cpu-migrations:u # 0.000 K/sec
> 145 page-faults:u # 0.013 K/sec
> 38,917,213,536 cycles:u # 3.370 GHz (83.33%)
> ~~~~~~~~~~~~~~
> 15,647,656,615 stalled-cycles-frontend:u # 40.21% frontend cycles idle (83.33%)
> 10,746,044,422 stalled-cycles-backend:u # 27.61% backend cycles idle (83.33%)
> 69,739,403,870 instructions:u # 1.79 insn per cycle
> ~~~~~~~~~~~~~~ ~~~~
> # 0.22 stalled cycles per insn (83.33%)
> 16,495,748,110 branches:u # 1428.354 M/sec (83.33%)
> 500,701,246 branch-misses:u # 3.04% of all branches (83.33%)
>
> 11.550414029 seconds time elapsed
>
> 11.548265000 seconds user
> 0.000999000 seconds sys
>
>
> [stefan@rome ~]$ gcc -O3 bench.c after.c
> [stefan@rome ~]$ perf stat ./a.out
>
> nop() 1.490000 0
> nextafter() 2.210000 0.720000
> 3.700000 nano-seconds
>
> Performance counter stats for './a.out':
>
> 3,702.89 msec task-clock:u # 1.000 CPUs utilized
> 0 context-switches:u # 0.000 K/sec
> 0 cpu-migrations:u # 0.000 K/sec
> 122 page-faults:u # 0.033 K/sec
> 12,407,345,183 cycles:u # 3.351 GHz (83.32%)
> ~~~~~~~~~~~~~~
> 135,817 stalled-cycles-frontend:u # 0.00% frontend cycles idle (83.34%)
> 5,498,895,906 stalled-cycles-backend:u # 44.32% backend cycles idle (83.34%)
> 38,002,430,460 instructions:u # 3.06 insn per cycle
> ~~~~~~~~~~~~~~ ~~~~
> # 0.14 stalled cycles per insn (83.34%)
> 7,497,381,393 branches:u # 2024.735 M/sec (83.34%)
> 497,462 branch-misses:u # 0.01% of all branches (83.32%)
>
> 3.703648875 seconds time elapsed
>
> 3.703294000 seconds user
> 0.000000000 seconds sys
>
>
> [stefan@rome ~]cat after.s
> # Copyright � 2004-2021, Stefan Kanthak <stefan.kanthak@nexgo.de>
>
> .arch generic64
> .code64
> .intel_syntax noprefix
> .text
> # xmm0 = from
> # xmm1 = to
> nextafter:
> xorpd xmm2, xmm2 # xmm2 = 0.0
> ucomisd xmm1, xmm2 # CF = (to < 0.0)
> jp .Lto # to = NAN?
> sbb rax, rax # rax = (to < 0.0) ? -1 : 0
> ucomisd xmm0, xmm1 # CF = (from < to)
> jp .Lfrom # from = NAN?
> je .Lto # from = to?
> .Lnotequal:
> sbb rcx, rcx # rcx = (from < to) ? -1 : 0
> ucomisd xmm0, xmm2 # CF = (from < 0.0)
> jz .Lzero # from = 0.0?
> .Lnext:
> movq rdx, xmm0 # rdx = from
> sbb rax, rax # rax = (from < 0.0) ? -1 : 0
> xor rax, rcx # rax = (from < 0.0) = (from < to) ? 0 : -1
> or rax, 1 # rax = (from < 0.0) = (from < to) ? 1 : -1
> sub rdx, rax
> movq xmm0, rdx
> addsd xmm0, xmm2
> ret
> .Lzero:
> shl rax, 63 # rax = (to < -0.0) ? 0x8000000000000000 : 0
> or rax, 1 # rax = (to < -0.0) ? 0x8000000000000001 : 1
> movq xmm0, rax # xmm0 = (to < -0.0) ? -0x1.0p-1074 : 0x1.0p-1074
> addsd xmm0, xmm2
> ret
> .Lto:
> movsd xmm0, xmm1 # xmm0 = to
> .Lfrom:
> ret
>
> .size nextafter, .-nextafter
> .type nextafter, @function
> .global nextafter
> .end
> [stefan@rome ~]$ perf stat ./a.out
>
> nop() 1.630000 0
> nextafter() 1.910000 0.280000
> 3.540000 nano-seconds
>
> Performance counter stats for './a.out':
>
> 3,547.12 msec task-clock:u # 1.000 CPUs utilized
> 0 context-switches:u # 0.000 K/sec
> 0 cpu-migrations:u # 0.000 K/sec
> 123 page-faults:u # 0.035 K/sec
> 11,949,840,797 cycles:u # 3.369 GHz (83.32%)
> ~~~~~~~~~~~~~~
> 129,627 stalled-cycles-frontend:u # 0.00% frontend cycles idle (83.34%)
> 3,998,960,716 stalled-cycles-backend:u # 33.46% backend cycles idle (83.34%)
> 37,493,523,285 instructions:u # 3.14 insn per cycle
> ~~~~~~~~~~~~~~ ~~~~
> # 0.11 stalled cycles per insn (83.34%)
> 7,998,559,192 branches:u # 2254.945 M/sec (83.34%)
> 497,565 branch-misses:u # 0.01% of all branches (83.32%)
>
> 3.547999008 seconds time elapsed
>
> 3.546671000 seconds user
> 0.000999000 seconds sys
>
>
> [stefan@rome ~]$
>
--
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: nextafter() about an order of magnitude slower than trivial implementation
2021-08-18 12:51 ` Szabolcs Nagy
@ 2021-08-18 17:11 ` Stefan Kanthak
2021-08-19 11:20 ` Adhemerval Zanella
0 siblings, 1 reply; 10+ messages in thread
From: Stefan Kanthak @ 2021-08-18 17:11 UTC (permalink / raw)
To: Szabolcs Nagy; +Cc: libc-help
Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:
> The 08/16/2021 18:03, Stefan Kanthak wrote:
>> Testing and benchmarking an exponential function that consumes
>> about 10ns/call on an AMD EPYC 7262, I noticed that nextafter()
>> itself is DEAD SLOW: it consumes about 8.5ns/call!
>>
>> The following (quick&dirty) implementation consumes 0.7ns/call,
>> i.e. is about an order of magnitude faster:
>
> correctly measuring latency on a modern x86_64 core:
>
> musl: 3.16 ns
> glibc: 5.68 ns
> your: 5.72 ns
My benchmark measures (reciprocal) throughput, NOT latency.
As long as a routine like nextafter() is NOT called back-to-back,
but along other code, latency doesn't really matter.
> i.e. your code is slower.
The overall execution time and the numbers from 'perf stat' show
the contrary: the program executes overall
- 70 billion instructions in 38.9 billion cycles with glibc's
implementation,
- 38 billion instructions in 12.4 billion cycles with my
(quick&dirty) C implementation,
- 37.5 billion instructions in 12 billion cycles with my
assembly implementation,
- 39.8 billion instructions in 22 billion cycles with musl's
original implementation,
- 40.5 billion instructions in 11.7 billion cycles with my
patch applied to musl's implementation.
The OVERALL number of cycles per call are
38.9 (glibc) > 22 (musl) > 12.4 (C) > 12 (Assembly) > 11.7 (musl patched)
Subtract the (constant but unknown, not measured) number of
cycles for the program itself: I estimate the overhead to be
about 4..6 billion cycles.
glibc's implementation consumes about 2.5x the number of cycles
of musl's implementation, and about 5x the number of cycles of
a proper implementation.
> (this is the correctly predicted hot path latency
> measuring for(i=0;i<n;i++) r = nextafter(r,0); )
See above: I'm interested in throughput, NOT latency,
> it is more difficult to measure the case when latency
> is not the bottleneck, since then it depends on what
> code is being executed along nextafter. your code
> likely wins in that case as it has less instructions.
Correct. And that's what counts in practice, as shown from
'perf stat' with the number of consumed cycles!
> what you see in your benchmark is the effect of
> branch mispredicts in the nextafter(x, INFINITY)
> case: your code avoids branching on the sign of x,
> while glibc branches and misses half the time with
> random x inputs. (it maybe possible to fix glibc to
> avoid that branch and then it would be fast in your
> measurement too on a big ooo core)
It's not just mispredicted branches: the program executes
- 16.5 billion branches (0.5 B(!)illion mispredicted) with
glibc's implementation,
- 7.5 billion branches (0.5 M(!)illion mispredicted) with my
C implementation,
- 8 billion branches (0.5 M(!)illion mispredicted) with my
assembly implementation,
- 9.7 billion branches (0.25 B(!)illion mispredicted) with
musl's original implementation,
- 8.5 billion branches (0.5 M(!)illion mispredicted with my
patch applied to musl's implementation.
2 billion branches are executed for the loops and should
be subtracted to get the net numbers per call of nextafter():
14.5 (glibc) > 9.7 (musl) > 8.5 (musl patched) > 8 (assembly) > 7.5 (C)
glibc's implementation executes almost twice the number of
branches of the other implementations.
The mispredicted branches:
0.5 (glibc) > 0.25 (musl) > 0.0001 (the proper implementations)
> (your code also does not handle fenv correctly and
> has aliasing violations, but these can be fixed)
I wrote explicitly QUICK&DIRTY (C) implementation.
The assembly code is correct in both cases.
Stefan
> --- after.c ---
> double nextafter(double from, double to)
> {
> if (from == to)
> return to;
>
> if (to != to)
> return to;
>
> if (from != from)
> return from;
>
> if (from == 0.0)
> return to < 0.0 ? -0x1.0p-1074 : 0x1.0p-1074;
>
> unsigned long long ull = *(unsigned long long *) &from;
>
> if ((from < to) == (from < 0.0))
> ull--;
> else
> ull++;
>
> return 0.0 + *(double *) &ull;
> }
> --- EOF ---
>
> JFTR: the code generated by GCC for this function is FAR from
> optimal; the assembly implementation shown below is more
> than 2x faster and consumes <0.3ns/call, i.e. is about
> 30x faster than glibc's nextafter()!
>
> The data from 'perf stat' show that glibc's nextafter() executes
> about 30 instructions more than the above trivial implementation,
> and that it is NOT well suited for modern super-scalar processors.
>
> Stefan
>
> PS: test program and logs
>
> [stefan@rome ~]$ cat bench.c
> // Copyright (C) 2005-2021, Stefan Kanthak <stefan.kanthak@nexgo.de>
>
> #include <math.h>
> #include <stdint.h>
> #include <stdio.h>
> #include <time.h>
>
> __attribute__((noinline))
> double nop(double foo, double bar)
> {
> return foo + bar;
> }
>
> inline static
> double lfsr64(void)
> {
> // 64-bit linear feedback shift register (Galois form) using
> // primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"),
> // initialised with 2**64 / golden ratio
>
> static uint64_t lfsr = 0x9E3779B97F4A7C15;
> const uint64_t sign = (int64_t) lfsr >> 63;
>
> lfsr = (lfsr << 1) ^ (0xAD93D23594C935A9 & sign);
>
> return *(double *) &lfsr;
> }
>
> inline static
> double random64(void)
> {
> static uint64_t seed = 0x0123456789ABCDEF;
>
> seed = seed * 6364136223846793005 + 1442695040888963407;
>
> return *(double *) &seed;
> }
>
> int main(void)
> {
> clock_t t0, t1, t2, tt;
> uint32_t n;
> volatile double result;
>
> t0 = clock();
>
> for (n = 500000000u; n > 0u; n--) {
> result = nop(lfsr64(), 0.0);
> result = nop(random64(), 1.0 / 0.0);
> }
>
> t1 = clock();
>
> for (n = 500000000u; n > 0u; n--) {
> result = nextafter(lfsr64(), 0.0);
> result = nextafter(random64(), 1.0 / 0.0);
> }
>
> t2 = clock();
> tt = t2 - t0; t2 -= t1; t1 -= t0; t0 = t2 - t1;
>
> printf("\n"
> "nop() %4lu.%06lu 0\n"
> "nextafter() %4lu.%06lu %4lu.%06lu\n"
> " %4lu.%06lu nano-seconds\n",
> t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
> t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
> t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
> tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
> }
> [stefan@rome ~]$ gcc -O3 -lm bench.c
> [stefan@rome ~]$ perf stat ./a.out
>
> nop() 1.480000 0
> nextafter() 10.060000 8.580000
> 11.540000 nano-seconds
>
> Performance counter stats for './a.out':
>
> 11,548.78 msec task-clock:u # 1.000 CPUs utilized
> 0 context-switches:u # 0.000 K/sec
> 0 cpu-migrations:u # 0.000 K/sec
> 145 page-faults:u # 0.013 K/sec
> 38,917,213,536 cycles:u # 3.370 GHz (83.33%)
> ~~~~~~~~~~~~~~
> 15,647,656,615 stalled-cycles-frontend:u # 40.21% frontend cycles idle (83.33%)
> 10,746,044,422 stalled-cycles-backend:u # 27.61% backend cycles idle (83.33%)
> 69,739,403,870 instructions:u # 1.79 insn per cycle
> ~~~~~~~~~~~~~~ ~~~~
> # 0.22 stalled cycles per insn (83.33%)
> 16,495,748,110 branches:u # 1428.354 M/sec (83.33%)
> 500,701,246 branch-misses:u # 3.04% of all branches (83.33%)
>
> 11.550414029 seconds time elapsed
>
> 11.548265000 seconds user
> 0.000999000 seconds sys
>
>
> [stefan@rome ~]$ gcc -O3 bench.c after.c
> [stefan@rome ~]$ perf stat ./a.out
>
> nop() 1.490000 0
> nextafter() 2.210000 0.720000
> 3.700000 nano-seconds
>
> Performance counter stats for './a.out':
>
> 3,702.89 msec task-clock:u # 1.000 CPUs utilized
> 0 context-switches:u # 0.000 K/sec
> 0 cpu-migrations:u # 0.000 K/sec
> 122 page-faults:u # 0.033 K/sec
> 12,407,345,183 cycles:u # 3.351 GHz (83.32%)
> ~~~~~~~~~~~~~~
> 135,817 stalled-cycles-frontend:u # 0.00% frontend cycles idle (83.34%)
> 5,498,895,906 stalled-cycles-backend:u # 44.32% backend cycles idle (83.34%)
> 38,002,430,460 instructions:u # 3.06 insn per cycle
> ~~~~~~~~~~~~~~ ~~~~
> # 0.14 stalled cycles per insn (83.34%)
> 7,497,381,393 branches:u # 2024.735 M/sec (83.34%)
> 497,462 branch-misses:u # 0.01% of all branches (83.32%)
>
> 3.703648875 seconds time elapsed
>
> 3.703294000 seconds user
> 0.000000000 seconds sys
>
>
> [stefan@rome ~]cat after.s
> # Copyright � 2004-2021, Stefan Kanthak <stefan.kanthak@nexgo.de>
>
> .arch generic64
> .code64
> .intel_syntax noprefix
> .text
> # xmm0 = from
> # xmm1 = to
> nextafter:
> xorpd xmm2, xmm2 # xmm2 = 0.0
> ucomisd xmm1, xmm2 # CF = (to < 0.0)
> jp .Lto # to = NAN?
> sbb rax, rax # rax = (to < 0.0) ? -1 : 0
> ucomisd xmm0, xmm1 # CF = (from < to)
> jp .Lfrom # from = NAN?
> je .Lto # from = to?
> .Lnotequal:
> sbb rcx, rcx # rcx = (from < to) ? -1 : 0
> ucomisd xmm0, xmm2 # CF = (from < 0.0)
> jz .Lzero # from = 0.0?
> .Lnext:
> movq rdx, xmm0 # rdx = from
> sbb rax, rax # rax = (from < 0.0) ? -1 : 0
> xor rax, rcx # rax = (from < 0.0) = (from < to) ? 0 : -1
> or rax, 1 # rax = (from < 0.0) = (from < to) ? 1 : -1
> sub rdx, rax
> movq xmm0, rdx
> addsd xmm0, xmm2
> ret
> .Lzero:
> shl rax, 63 # rax = (to < -0.0) ? 0x8000000000000000 : 0
> or rax, 1 # rax = (to < -0.0) ? 0x8000000000000001 : 1
> movq xmm0, rax # xmm0 = (to < -0.0) ? -0x1.0p-1074 : 0x1.0p-1074
> addsd xmm0, xmm2
> ret
> .Lto:
> movsd xmm0, xmm1 # xmm0 = to
> .Lfrom:
> ret
>
> .size nextafter, .-nextafter
> .type nextafter, @function
> .global nextafter
> .end
> [stefan@rome ~]$ perf stat ./a.out
>
> nop() 1.630000 0
> nextafter() 1.910000 0.280000
> 3.540000 nano-seconds
>
> Performance counter stats for './a.out':
>
> 3,547.12 msec task-clock:u # 1.000 CPUs utilized
> 0 context-switches:u # 0.000 K/sec
> 0 cpu-migrations:u # 0.000 K/sec
> 123 page-faults:u # 0.035 K/sec
> 11,949,840,797 cycles:u # 3.369 GHz (83.32%)
> ~~~~~~~~~~~~~~
> 129,627 stalled-cycles-frontend:u # 0.00% frontend cycles idle (83.34%)
> 3,998,960,716 stalled-cycles-backend:u # 33.46% backend cycles idle (83.34%)
> 37,493,523,285 instructions:u # 3.14 insn per cycle
> ~~~~~~~~~~~~~~ ~~~~
> # 0.11 stalled cycles per insn (83.34%)
> 7,998,559,192 branches:u # 2254.945 M/sec (83.34%)
> 497,565 branch-misses:u # 0.01% of all branches (83.32%)
>
> 3.547999008 seconds time elapsed
>
> 3.546671000 seconds user
> 0.000999000 seconds sys
>
>
> [stefan@rome ~]$
>
--
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: nextafter() about an order of magnitude slower than trivial implementation
2021-08-18 17:11 ` Stefan Kanthak
@ 2021-08-19 11:20 ` Adhemerval Zanella
2021-08-19 17:57 ` [PATCH] " Stefan Kanthak
2021-08-20 9:52 ` [PATCH/2nd version] " Stefan Kanthak
0 siblings, 2 replies; 10+ messages in thread
From: Adhemerval Zanella @ 2021-08-19 11:20 UTC (permalink / raw)
To: Stefan Kanthak, Szabolcs Nagy; +Cc: libc-help
On 18/08/2021 14:11, Stefan Kanthak wrote:
> Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:
>
>> The 08/16/2021 18:03, Stefan Kanthak wrote:
>>> Testing and benchmarking an exponential function that consumes
>>> about 10ns/call on an AMD EPYC 7262, I noticed that nextafter()
>>> itself is DEAD SLOW: it consumes about 8.5ns/call!
>>>
>>> The following (quick&dirty) implementation consumes 0.7ns/call,
>>> i.e. is about an order of magnitude faster:
>>
>> correctly measuring latency on a modern x86_64 core:
>>
>> musl: 3.16 ns
>> glibc: 5.68 ns
>> your: 5.72 ns
Thanks for bring this up, if you want to contribute a patch please
follow the Contribution checklist [1]. We recently dropped the
requirement of the FSF contribution, so you can use a SCO-like
license on the patches.
To change the current implementation I suggest you to also provide
a benchmark using glibc benchmark framework. Some maths functions
provide both latency and reciprocal-throughput information, and
with both numbers we can evaluate if the new implementation is
indeed better on different machines.
I would just like to ask to keep the tone respectful and be open
to suggestion and criticize, so you do not repeat the same derail
thread on musl maillist [1]. Szabolcs and Wilco did an excellent
job on some newer math functions implementation, you might read
the old thread to see how was their approach.
[1] https://sourceware.org/glibc/wiki/Contribution%20checklist
[2] https://www.openwall.com/lists/musl/2021/08/15/18
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH] Re: nextafter() about an order of magnitude slower than trivial implementation
2021-08-19 11:20 ` Adhemerval Zanella
@ 2021-08-19 17:57 ` Stefan Kanthak
2021-08-20 9:52 ` [PATCH/2nd version] " Stefan Kanthak
1 sibling, 0 replies; 10+ messages in thread
From: Stefan Kanthak @ 2021-08-19 17:57 UTC (permalink / raw)
To: Szabolcs Nagy, Adhemerval Zanella; +Cc: libc-help, libc-alpha
[-- Attachment #1: Type: text/plain, Size: 3645 bytes --]
Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote:
> On 18/08/2021 14:11, Stefan Kanthak wrote:
>> Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:
>>
>>> The 08/16/2021 18:03, Stefan Kanthak wrote:
>>>> Testing and benchmarking an exponential function that consumes
>>>> about 10ns/call on an AMD EPYC 7262, I noticed that nextafter()
>>>> itself is DEAD SLOW: it consumes about 8.5ns/call!
>>>>
>>>> The following (quick&dirty) implementation consumes 0.7ns/call,
>>>> i.e. is about an order of magnitude faster:
>>>
>>> correctly measuring latency on a modern x86_64 core:
>>>
>>> musl: 3.16 ns
>>> glibc: 5.68 ns
>>> your: 5.72 ns
>
> Thanks for bring this up, if you want to contribute a patch
See attachment.
For completeness sake and to ease an initial review, I post the full
source here too: without ALTERNATE defined, it uses integer arithmetic
to the extent possible, like the original.
Choose whatever you like best or gives better throughput/latency.
Stefan
JFTR: .../sysdeps/ieee754/flt-32/s_nextafterf.c as well as
.../sysdeps/i386/fpu/s_nextafterl.c will need the same changes.
--- .../math/s_nextafter.c ---
/* IEEE functions
* nextafter(x,y)
* return the next machine floating-point number of x in the
* direction toward y.
* Special cases:
*/
/* Ugly hack so that the aliasing works. */
#define __nexttoward __internal___nexttoward
#define nexttoward __internal_nexttoward
#include <errno.h>
#include <math.h>
#include <math-barriers.h>
#include <math_private.h>
#include <float.h>
#include <libm-alias-double.h>
double __nextafter(double x, double y)
{
union {
double d;
int64_t s;
uint64_t u;
} xx = {x}, yy = {y};
uint64_t ax = xx.u + xx.u; /* |x|<<1 */
uint64_t ay = yy.u + yy.u; /* |y|<<1 */
if(ax > 0x7ffull << 53) /* x is nan */
return x;
if(ay > 0x7ffull << 53) /* y is nan */
return y;
if(x == y) return y; /* x=y, return y */
if(ax == 0ull) { /* x == 0 */
xx.u = yy.s < 0 ? 0x8000000000000001ull : 1ull;
xx.d = math_opt_barrier (xx.d);
xx.d += 0.0;
math_force_eval (xx.d); /* raise underflow flag */
return xx.d; /* return +-minsubnormal */
}
#ifdef ALTERNATE
if((x < y) == (x < 0.0))
--xx.u;
else
++xx.u;
#else
if(xx.s < 0) { /* x < 0 */
if(yy.s < 0 && xx.s < yy.s)
++xx.s; /* x < 0, x > y, x += ulp */
else
--xx.s; /* x < 0, x < y, x -= ulp */
} else { /* x > 0 */
if(xx.s > yy.s)
--xx.s; /* x > 0, x > y, x -= ulp */
else
++xx.s; /* x > 0, x < y, x += ulp */
}
#endif
ax = xx.u + xx.u; /* |x|<<1 */
if((ax >= 0x7ffull << 53) /* overflow */
|| (ax < 0x001ull << 53)) { /* underflow */
xx.d = math_opt_barrier (xx.d);
xx.d += 0.0;
math_force_eval (xx.d); /* raise overflow/underflow flag */
__set_errno (ERANGE);
}
return xx.d;
}
libm_alias_double (__nextafter, nextafter)
#ifdef NO_LONG_DOUBLE
strong_alias (__nextafter, __nexttowardl)
weak_alias (__nexttowardl, nexttowardl)
#undef __nexttoward
strong_alias (__nextafter, __nexttoward)
#undef nexttoward
weak_alias (__nextafter, nexttoward)
#endif
--- EOF ---
[-- Attachment #2: glibc.patch --]
[-- Type: application/octet-stream, Size: 3533 bytes --]
--- -/math/s_nextafter.c
+++ +/math/s_nextafter.c
@@ -1,19 +1,3 @@
-/* @(#)s_nextafter.c 5.1 93/09/24 */
-/*
- * ====================================================
- * Copyright (C) 1993 by Sun Microsystems, Inc. All rights reserved.
- *
- * Developed at SunPro, a Sun Microsystems, Inc. business.
- * Permission to use, copy, modify, and distribute this
- * software is freely granted, provided that this notice
- * is preserved.
- * ====================================================
- */
-
-#if defined(LIBM_SCCS) && !defined(lint)
-static char rcsid[] = "$NetBSD: s_nextafter.c,v 1.8 1995/05/10 20:47:58 jtc Exp $";
-#endif
-
/* IEEE functions
* nextafter(x,y)
* return the next machine floating-point number of x in the
@@ -34,55 +18,53 @@
double __nextafter(double x, double y)
{
- int32_t hx,hy,ix,iy;
- uint32_t lx,ly;
+ union {
+ double d;
+ int64_t s;
+ uint64_t u;
+ } xx = {x}, yy = {y};
- EXTRACT_WORDS(hx,lx,x);
- EXTRACT_WORDS(hy,ly,y);
- ix = hx&0x7fffffff; /* |x| */
- iy = hy&0x7fffffff; /* |y| */
+ uint64_t ax = xx.u + xx.u; /* |x|<<1 */
+ uint64_t ay = yy.u + yy.u; /* |y|<<1 */
- if(((ix>=0x7ff00000)&&((ix-0x7ff00000)|lx)!=0) || /* x is nan */
- ((iy>=0x7ff00000)&&((iy-0x7ff00000)|ly)!=0)) /* y is nan */
- return x+y;
- if(x==y) return y; /* x=y, return y */
- if((ix|lx)==0) { /* x == 0 */
- double u;
- INSERT_WORDS(x,hy&0x80000000,1); /* return +-minsubnormal */
- u = math_opt_barrier (x);
- u = u*u;
- math_force_eval (u); /* raise underflow flag */
+ if(ax > 0x7ffull << 53) /* x is nan */
return x;
+ if(ay > 0x7ffull << 53) /* y is nan */
+ return y;
+ if(x == y) return y; /* x=y, return y */
+ if(ax == 0ull) { /* x == 0 */
+ xx.u = yy.s < 0 ? 0x8000000000000001ull : 1ull;
+ xx.d = math_opt_barrier (xx.d);
+ xx.d += 0.0;
+ math_force_eval (xx.d); /* raise underflow flag */
+ return xx.d; /* return +-minsubnormal */
}
- if(hx>=0) { /* x > 0 */
- if(hx>hy||((hx==hy)&&(lx>ly))) { /* x > y, x -= ulp */
- if(lx==0) hx -= 1;
- lx -= 1;
- } else { /* x < y, x += ulp */
- lx += 1;
- if(lx==0) hx += 1;
- }
- } else { /* x < 0 */
- if(hy>=0||hx>hy||((hx==hy)&&(lx>ly))){/* x < y, x -= ulp */
- if(lx==0) hx -= 1;
- lx -= 1;
- } else { /* x > y, x += ulp */
- lx += 1;
- if(lx==0) hx += 1;
- }
+#ifdef ALTERNATE
+ if ((x < y) == (x < 0.0))
+ --xx.u;
+ else
+ ++xx.u;
+#else
+ if(xx.s < 0) { /* x < 0 */
+ if(yy.s < 0 && xx.s < yy.s)
+ ++xx.s; /* x < 0, x > y, x += ulp */
+ else
+ --xx.s; /* x < 0, x < y, x -= ulp */
+ } else { /* x > 0 */
+ if(xx.s > yy.s)
+ --xx.s; /* x > 0, x > y, x -= ulp */
+ else
+ ++xx.s; /* x > 0, x < y, x += ulp */
}
- hy = hx&0x7ff00000;
- if(hy>=0x7ff00000) {
- double u = x+x; /* overflow */
- math_force_eval (u);
+#endif
+ ax = xx.u + xx.u; /* |x|<<1 */
+ if((ax >= 0x7ffull << 53) /* overflow */
+ || (ax < 0x001ull << 53)) { /* underflow */
+ xx.d = math_opt_barrier (xx.d);
+ xx.d += 0.0;
+ math_force_eval (xx.d); /* raise overflow/underflow flag */
__set_errno (ERANGE);
}
- if(hy<0x00100000) {
- double u = x*x; /* underflow */
- math_force_eval (u); /* raise underflow flag */
- __set_errno (ERANGE);
- }
- INSERT_WORDS(x,hx,lx);
return xx.d;
}
libm_alias_double (__nextafter, nextafter)
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH/2nd version] Re: nextafter() about an order of magnitude slower than trivial implementation
2021-08-19 11:20 ` Adhemerval Zanella
2021-08-19 17:57 ` [PATCH] " Stefan Kanthak
@ 2021-08-20 9:52 ` Stefan Kanthak
2021-08-20 16:55 ` Joseph Myers
1 sibling, 1 reply; 10+ messages in thread
From: Stefan Kanthak @ 2021-08-20 9:52 UTC (permalink / raw)
To: Szabolcs Nagy, Adhemerval Zanella; +Cc: libc-help, libc-alpha
[-- Attachment #1: Type: text/plain, Size: 3682 bytes --]
Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote:
> On 18/08/2021 14:11, Stefan Kanthak wrote:
>> Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:
>>
>>> The 08/16/2021 18:03, Stefan Kanthak wrote:
>>>> Testing and benchmarking an exponential function that consumes
>>>> about 10ns/call on an AMD EPYC 7262, I noticed that nextafter()
>>>> itself is DEAD SLOW: it consumes about 8.5ns/call!
>>>>
>>>> The following (quick&dirty) implementation consumes 0.7ns/call,
>>>> i.e. is about an order of magnitude faster:
>>>
>>> correctly measuring latency on a modern x86_64 core:
>>>
>>> musl: 3.16 ns
>>> glibc: 5.68 ns
>>> your: 5.72 ns
>
> Thanks for bring this up, if you want to contribute a patch
See attachment.
For completeness sake and to ease an initial review, I post the full
source here too: without ALTERNATE defined, it uses integer arithmetic
to the extent possible, like the original.
Choose whatever you like best or gives better throughput/latency.
Changes from the 1st version of the patch:
- leftover FP comparision substituted with integer operations;
- increment/decrement logic simplified/unified, avoiding a
conditional branch.
I expect the compiler to detect the common clause
"&& ((xx.s < 0) == (yy.s < 0)))" introduced with these changes.
Stefan
JFTR: .../sysdeps/ieee754/flt-32/s_nextafterf.c as well as
.../sysdeps/i386/fpu/s_nextafterl.c will need the same changes.
--- .../math/s_nextafter.c ---
/* IEEE functions
* nextafter(x,y)
* return the next machine floating-point number of x in the
* direction toward y.
* Special cases:
*/
/* Ugly hack so that the aliasing works. */
#define __nexttoward __internal___nexttoward
#define nexttoward __internal_nexttoward
#include <errno.h>
#include <math.h>
#include <math-barriers.h>
#include <math_private.h>
#include <float.h>
#include <libm-alias-double.h>
double __nextafter(double x, double y)
{
union {
double d;
int64_t s;
uint64_t u;
} xx = {x}, yy = {y};
uint64_t ax = xx.u + xx.u; /* |x|<<1 */
uint64_t ay = yy.u + yy.u; /* |y|<<1 */
if(ax > 0x7ffull << 53) /* x is nan */
return x;
if(ay > 0x7ffull << 53) /* y is nan */
return y;
#ifdef ALTERNATE
if(x == y) return y; /* x=y, return y */
#else
if((ax == ay)
&& ((xx.s < 0) == (yy.s < 0)))
return y;
#endif
if(ax == 0) { /* x == 0 */
xx.u = yy.s < 0 ? 0x8000000000000001ull : 1ull;
xx.d = math_opt_barrier (xx.d);
xx.d += 0.0;
math_force_eval (xx.d); /* raise underflow flag */
return xx.d; /* return +-minsubnormal */
}
#ifdef ALTERNATE
if((x < y) == (x < 0.0))
--xx.u;
else
++xx.u;
#else
if((ax < ay)
&& ((xx.s < 0) == (yy.s < 0)))
++xx.u;
else
--xx.u;
#endif
ax = xx.u + xx.u; /* |x|<<1 */
if((ax >= 0x7ffull << 53) /* overflow */
|| (ax < 0x001ull << 53)) { /* underflow */
xx.d = math_opt_barrier (xx.d);
xx.d += 0.0;
math_force_eval (xx.d); /* raise overflow/underflow flag */
__set_errno (ERANGE);
}
return xx.d;
}
libm_alias_double (__nextafter, nextafter)
#ifdef NO_LONG_DOUBLE
strong_alias (__nextafter, __nexttowardl)
weak_alias (__nexttowardl, nexttowardl)
#undef __nexttoward
strong_alias (__nextafter, __nexttoward)
#undef nexttoward
weak_alias (__nextafter, nexttoward)
#endif
--- EOF ---
[-- Attachment #2: glibc.patch --]
[-- Type: application/octet-stream, Size: 3447 bytes --]
--- -/math/s_nextafter.c
+++ +/math/s_nextafter.c
@@ -1,19 +1,3 @@
-/* @(#)s_nextafter.c 5.1 93/09/24 */
-/*
- * ====================================================
- * Copyright (C) 1993 by Sun Microsystems, Inc. All rights reserved.
- *
- * Developed at SunPro, a Sun Microsystems, Inc. business.
- * Permission to use, copy, modify, and distribute this
- * software is freely granted, provided that this notice
- * is preserved.
- * ====================================================
- */
-
-#if defined(LIBM_SCCS) && !defined(lint)
-static char rcsid[] = "$NetBSD: s_nextafter.c,v 1.8 1995/05/10 20:47:58 jtc Exp $";
-#endif
-
/* IEEE functions
* nextafter(x,y)
* return the next machine floating-point number of x in the
@@ -34,56 +18,55 @@
double __nextafter(double x, double y)
{
- int32_t hx,hy,ix,iy;
- uint32_t lx,ly;
+ union {
+ double d;
+ int64_t s;
+ uint64_t u;
+ } xx = {x}, yy = {y};
- EXTRACT_WORDS(hx,lx,x);
- EXTRACT_WORDS(hy,ly,y);
- ix = hx&0x7fffffff; /* |x| */
- iy = hy&0x7fffffff; /* |y| */
+ uint64_t ax = xx.u + xx.u; /* |x|<<1 */
+ uint64_t ay = yy.u + yy.u; /* |y|<<1 */
- if(((ix>=0x7ff00000)&&((ix-0x7ff00000)|lx)!=0) || /* x is nan */
- ((iy>=0x7ff00000)&&((iy-0x7ff00000)|ly)!=0)) /* y is nan */
- return x+y;
- if(x==y) return y; /* x=y, return y */
- if((ix|lx)==0) { /* x == 0 */
- double u;
- INSERT_WORDS(x,hy&0x80000000,1); /* return +-minsubnormal */
- u = math_opt_barrier (x);
- u = u*u;
- math_force_eval (u); /* raise underflow flag */
+ if(ax > 0x7ffull << 53) /* x is nan */
return x;
+ if(ay > 0x7ffull << 53) /* y is nan */
+ return y;
+#ifdef ALTERNATE
+ if(x == y) return y; /* x=y, return y */
+#else
+ if((ax == ay)
+ && ((xx.s < 0) == (yy.s < 0)))
+ return y;
+#endif
+ if(ax == 0) { /* x == 0 */
+ xx.u = yy.s < 0 ? 0x8000000000000001ull : 1ull;
+ xx.d = math_opt_barrier (xx.d);
+ xx.d += 0.0;
+ math_force_eval (xx.d); /* raise underflow flag */
+ return xx.d; /* return +-minsubnormal */
}
- if(hx>=0) { /* x > 0 */
- if(hx>hy||((hx==hy)&&(lx>ly))) { /* x > y, x -= ulp */
- if(lx==0) hx -= 1;
- lx -= 1;
- } else { /* x < y, x += ulp */
- lx += 1;
- if(lx==0) hx += 1;
- }
- } else { /* x < 0 */
- if(hy>=0||hx>hy||((hx==hy)&&(lx>ly))){/* x < y, x -= ulp */
- if(lx==0) hx -= 1;
- lx -= 1;
- } else { /* x > y, x += ulp */
- lx += 1;
- if(lx==0) hx += 1;
- }
+#ifdef ALTERNATE
+ if((x < y) == (x < 0.0))
+ --xx.u;
+ else
+ ++xx.u;
}
- hy = hx&0x7ff00000;
- if(hy>=0x7ff00000) {
- double u = x+x; /* overflow */
- math_force_eval (u);
- __set_errno (ERANGE);
+#else
+ if((ax < ay)
+ && ((xx.s < 0) == (yy.s < 0)))
+ ++xx.u;
+ else
+ --xx.u;
+#endif
+ ax = xx.u + xx.u; /* |x|<<1 */
+ if((ax >= 0x7ffull << 53) /* overflow */
+ || (ax < 0x001ull << 53)) { /* underflow */
+ xx.d = math_opt_barrier (xx.d);
+ xx.d += 0.0;
+ math_force_eval (xx.d); /* raise overflow/underflow flag */
+ __set_errno (ERANGE);
}
- if(hy<0x00100000) {
- double u = x*x; /* underflow */
- math_force_eval (u); /* raise underflow flag */
- __set_errno (ERANGE);
- }
- INSERT_WORDS(x,hx,lx);
- return x;
+ return xx.d;
}
libm_alias_double (__nextafter, nextafter)
#ifdef NO_LONG_DOUBLE
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH/2nd version] Re: nextafter() about an order of magnitude slower than trivial implementation
2021-08-20 9:52 ` [PATCH/2nd version] " Stefan Kanthak
@ 2021-08-20 16:55 ` Joseph Myers
2021-08-20 20:19 ` Stefan Kanthak
0 siblings, 1 reply; 10+ messages in thread
From: Joseph Myers @ 2021-08-20 16:55 UTC (permalink / raw)
To: Stefan Kanthak; +Cc: Szabolcs Nagy, Adhemerval Zanella, libc-help, libc-alpha
On Fri, 20 Aug 2021, Stefan Kanthak wrote:
> if(ax > 0x7ffull << 53) /* x is nan */
> return x;
> if(ay > 0x7ffull << 53) /* y is nan */
> return y;
How has this been tested? I'd have expected it to fail the nextafter
tests in the glibc testsuite (libm-test-nextafter.inc), because they
verify that sNaN arguments produce a qNaN result with the "invalid"
exception raised, and this looks like it would just return an sNaN
argument unchanged.
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH/2nd version] Re: nextafter() about an order of magnitude slower than trivial implementation
2021-08-20 16:55 ` Joseph Myers
@ 2021-08-20 20:19 ` Stefan Kanthak
2021-08-20 21:03 ` Joseph Myers
2021-08-23 12:50 ` Adhemerval Zanella
0 siblings, 2 replies; 10+ messages in thread
From: Stefan Kanthak @ 2021-08-20 20:19 UTC (permalink / raw)
To: Joseph Myers; +Cc: Szabolcs Nagy, Adhemerval Zanella, libc-help, libc-alpha
"Joseph Myers" <joseph@codesourcery.com> wrote:
> On Fri, 20 Aug 2021, Stefan Kanthak wrote:
>
>> if(ax > 0x7ffull << 53) /* x is nan */
>> return x;
>> if(ay > 0x7ffull << 53) /* y is nan */
>> return y;
>
> How has this been tested? I'd have expected it to fail the nextafter
> tests in the glibc testsuite (libm-test-nextafter.inc), because they
> verify that sNaN arguments produce a qNaN result with the "invalid"
> exception raised, and this looks like it would just return an sNaN
> argument unchanged.
It doesn't look like it would, it REALLY does!
As I explicitly wrote, my changes avoid FP operations if possible.
<https://pubs.opengroup.org/onlinepubs/9699919799/functions/nextafter.html>
| If x or y is NaN, a NaN shall be returned.
I choose to return the argument NaN, what both POSIX and ISO C allow.
If my mind serves me well, one of the former editions even stated
"If an argument is NaN, this argument shall be returned". This may
but be influenced/spoiled by
<https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/nextafter-functions>
| If either x or y is a NAN, then the return value is one of the input NANs.
If that bothers you, change these 4 lines to either
if(ax > 0x7ffull << 53) /* x is nan */
return 0.0 + x;
if(ay > 0x7ffull << 53) /* y is nan */
return 0.0 + y;
(the addition has eventually to be guarded from optimisation) or
if(ax > 0x7ffull << 53 /* x is nan */
|| ay > 0x7ffull << 53) /* y is nan */
return x + y;
whatever you like best.
Stefan
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH/2nd version] Re: nextafter() about an order of magnitude slower than trivial implementation
2021-08-20 20:19 ` Stefan Kanthak
@ 2021-08-20 21:03 ` Joseph Myers
2021-08-23 12:50 ` Adhemerval Zanella
1 sibling, 0 replies; 10+ messages in thread
From: Joseph Myers @ 2021-08-20 21:03 UTC (permalink / raw)
To: Stefan Kanthak; +Cc: Szabolcs Nagy, libc-help, libc-alpha
On Fri, 20 Aug 2021, Stefan Kanthak wrote:
> I choose to return the argument NaN, what both POSIX and ISO C allow.
glibc follows C23 Annex F FE_SNANS_ALWAYS_SIGNAL rules for signaling NaN
arguments.
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH/2nd version] Re: nextafter() about an order of magnitude slower than trivial implementation
2021-08-20 20:19 ` Stefan Kanthak
2021-08-20 21:03 ` Joseph Myers
@ 2021-08-23 12:50 ` Adhemerval Zanella
1 sibling, 0 replies; 10+ messages in thread
From: Adhemerval Zanella @ 2021-08-23 12:50 UTC (permalink / raw)
To: Stefan Kanthak, Joseph Myers; +Cc: Szabolcs Nagy, libc-help, libc-alpha
On 20/08/2021 17:19, Stefan Kanthak wrote:
> "Joseph Myers" <joseph@codesourcery.com> wrote:
>
>> On Fri, 20 Aug 2021, Stefan Kanthak wrote:
>>
>>> if(ax > 0x7ffull << 53) /* x is nan */
>>> return x;
>>> if(ay > 0x7ffull << 53) /* y is nan */
>>> return y;
>>
>> How has this been tested? I'd have expected it to fail the nextafter
>> tests in the glibc testsuite (libm-test-nextafter.inc), because they
>> verify that sNaN arguments produce a qNaN result with the "invalid"
>> exception raised, and this looks like it would just return an sNaN
>> argument unchanged.
>
> It doesn't look like it would, it REALLY does!
> As I explicitly wrote, my changes avoid FP operations if possible.
>
> <https://pubs.opengroup.org/onlinepubs/9699919799/functions/nextafter.html>
>
> | If x or y is NaN, a NaN shall be returned.
>
> I choose to return the argument NaN, what both POSIX and ISO C allow.
> If my mind serves me well, one of the former editions even stated
> "If an argument is NaN, this argument shall be returned". This may
> but be influenced/spoiled by
> <https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/nextafter-functions>
>
> | If either x or y is a NAN, then the return value is one of the input NANs.
>
>
> If that bothers you, change these 4 lines to either
>
> if(ax > 0x7ffull << 53) /* x is nan */
> return 0.0 + x;
> if(ay > 0x7ffull << 53) /* y is nan */
> return 0.0 + y;
>
> (the addition has eventually to be guarded from optimisation) or
>
> if(ax > 0x7ffull << 53 /* x is nan */
> || ay > 0x7ffull << 53) /* y is nan */
> return x + y;
Sorry, but sending half-baked patches where one will need to evaluate
and decide which is the best strategy is not the best way forward to
accept this change.
Please do follow the contributor checklist [1] and the suggestion I
gave you when you asked for input in the libc-help:
* Check for regressions by testing your patch against glibc testsuite
(make check). As Joseph has put, your patch triggers a lot of
regressions:
$ grep ^FAIL math/subdir-tests.sum
FAIL: math/bug-nextafter
FAIL: math/test-double-nextafter
FAIL: math/test-float32x-nextafter
FAIL: math/test-float64-nextafter
FAIL: math/test-misc
* You removed the copyright since it is a new implementation, but
the any code still requires a Copyright (even if is not assigned
to FSF). Check the item 3.4 on the Contributor checklist.
* Also, since is a new implementation follow the usual style and
code guidelines instead of using the current one used.
* It would be good to add a benchmark to evaluate this change if
the motivation for your change is performance. Follow what
other math benchmarks does on benchtests (exp2 or exp10 for
instance).
* Sending a patch with a define switch and stating "Choose whatever
you like best" does not help in anything and put the burden or
evaluating your change to the maintainers.
[1] https://sourceware.org/glibc/wiki/Contribution%20checklist
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2021-08-23 12:50 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-16 16:03 nextafter() about an order of magnitude slower than trivial implementation Stefan Kanthak
2021-08-18 12:51 ` Szabolcs Nagy
2021-08-18 17:11 ` Stefan Kanthak
2021-08-19 11:20 ` Adhemerval Zanella
2021-08-19 17:57 ` [PATCH] " Stefan Kanthak
2021-08-20 9:52 ` [PATCH/2nd version] " Stefan Kanthak
2021-08-20 16:55 ` Joseph Myers
2021-08-20 20:19 ` Stefan Kanthak
2021-08-20 21:03 ` Joseph Myers
2021-08-23 12:50 ` Adhemerval Zanella
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).