* nextafter() about an order of magnitude slower than trivial implementation @ 2021-08-16 16:03 Stefan Kanthak 2021-08-18 12:51 ` Szabolcs Nagy 0 siblings, 1 reply; 10+ messages in thread From: Stefan Kanthak @ 2021-08-16 16:03 UTC (permalink / raw) To: libc-help Hi, to test the accuracy of (elementary) floating-point functions against a higher precision reference I use the following code: double input = 1.0, output = exp(input), reference = M_E; if (output == reference) printf("correctly rounded\n"); else if (nextafter(output, reference) == reference) printf("faithfully rounded\n"); else printf("..."); Testing and benchmarking an exponential function that consumes about 10ns/call on an AMD EPYC 7262, I noticed that nextafter() itself is DEAD SLOW: it consumes about 8.5ns/call! The following (quick&dirty) implementation consumes 0.7ns/call, i.e. is about an order of magnitude faster: --- after.c --- double nextafter(double from, double to) { if (from == to) return to; if (to != to) return to; if (from != from) return from; if (from == 0.0) return to < 0.0 ? -0x1.0p-1074 : 0x1.0p-1074; unsigned long long ull = *(unsigned long long *) &from; if ((from < to) == (from < 0.0)) ull--; else ull++; return 0.0 + *(double *) &ull; } --- EOF --- JFTR: the code generated by GCC for this function is FAR from optimal; the assembly implementation shown below is more than 2x faster and consumes <0.3ns/call, i.e. is about 30x faster than glibc's nextafter()! The data from 'perf stat' show that glibc's nextafter() executes about 30 instructions more than the above trivial implementation, and that it is NOT well suited for modern super-scalar processors. Stefan PS: test program and logs [stefan@rome ~]$ cat bench.c // Copyright (C) 2005-2021, Stefan Kanthak <stefan.kanthak@nexgo.de> #include <math.h> #include <stdint.h> #include <stdio.h> #include <time.h> __attribute__((noinline)) double nop(double foo, double bar) { return foo + bar; } inline static double lfsr64(void) { // 64-bit linear feedback shift register (Galois form) using // primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"), // initialised with 2**64 / golden ratio static uint64_t lfsr = 0x9E3779B97F4A7C15; const uint64_t sign = (int64_t) lfsr >> 63; lfsr = (lfsr << 1) ^ (0xAD93D23594C935A9 & sign); return *(double *) &lfsr; } inline static double random64(void) { static uint64_t seed = 0x0123456789ABCDEF; seed = seed * 6364136223846793005 + 1442695040888963407; return *(double *) &seed; } int main(void) { clock_t t0, t1, t2, tt; uint32_t n; volatile double result; t0 = clock(); for (n = 500000000u; n > 0u; n--) { result = nop(lfsr64(), 0.0); result = nop(random64(), 1.0 / 0.0); } t1 = clock(); for (n = 500000000u; n > 0u; n--) { result = nextafter(lfsr64(), 0.0); result = nextafter(random64(), 1.0 / 0.0); } t2 = clock(); tt = t2 - t0; t2 -= t1; t1 -= t0; t0 = t2 - t1; printf("\n" "nop() %4lu.%06lu 0\n" "nextafter() %4lu.%06lu %4lu.%06lu\n" " %4lu.%06lu nano-seconds\n", t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC, t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC, t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC, tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC); } [stefan@rome ~]$ gcc -O3 -lm bench.c [stefan@rome ~]$ perf stat ./a.out nop() 1.480000 0 nextafter() 10.060000 8.580000 11.540000 nano-seconds Performance counter stats for './a.out': 11,548.78 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 145 page-faults:u # 0.013 K/sec 38,917,213,536 cycles:u # 3.370 GHz (83.33%) ~~~~~~~~~~~~~~ 15,647,656,615 stalled-cycles-frontend:u # 40.21% frontend cycles idle (83.33%) 10,746,044,422 stalled-cycles-backend:u # 27.61% backend cycles idle (83.33%) 69,739,403,870 instructions:u # 1.79 insn per cycle ~~~~~~~~~~~~~~ ~~~~ # 0.22 stalled cycles per insn (83.33%) 16,495,748,110 branches:u # 1428.354 M/sec (83.33%) 500,701,246 branch-misses:u # 3.04% of all branches (83.33%) 11.550414029 seconds time elapsed 11.548265000 seconds user 0.000999000 seconds sys [stefan@rome ~]$ gcc -O3 bench.c after.c [stefan@rome ~]$ perf stat ./a.out nop() 1.490000 0 nextafter() 2.210000 0.720000 3.700000 nano-seconds Performance counter stats for './a.out': 3,702.89 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 122 page-faults:u # 0.033 K/sec 12,407,345,183 cycles:u # 3.351 GHz (83.32%) ~~~~~~~~~~~~~~ 135,817 stalled-cycles-frontend:u # 0.00% frontend cycles idle (83.34%) 5,498,895,906 stalled-cycles-backend:u # 44.32% backend cycles idle (83.34%) 38,002,430,460 instructions:u # 3.06 insn per cycle ~~~~~~~~~~~~~~ ~~~~ # 0.14 stalled cycles per insn (83.34%) 7,497,381,393 branches:u # 2024.735 M/sec (83.34%) 497,462 branch-misses:u # 0.01% of all branches (83.32%) 3.703648875 seconds time elapsed 3.703294000 seconds user 0.000000000 seconds sys [stefan@rome ~]cat after.s # Copyright © 2004-2021, Stefan Kanthak <stefan.kanthak@nexgo.de> .arch generic64 .code64 .intel_syntax noprefix .text # xmm0 = from # xmm1 = to nextafter: xorpd xmm2, xmm2 # xmm2 = 0.0 ucomisd xmm1, xmm2 # CF = (to < 0.0) jp .Lto # to = NAN? sbb rax, rax # rax = (to < 0.0) ? -1 : 0 ucomisd xmm0, xmm1 # CF = (from < to) jp .Lfrom # from = NAN? je .Lto # from = to? .Lnotequal: sbb rcx, rcx # rcx = (from < to) ? -1 : 0 ucomisd xmm0, xmm2 # CF = (from < 0.0) jz .Lzero # from = 0.0? .Lnext: movq rdx, xmm0 # rdx = from sbb rax, rax # rax = (from < 0.0) ? -1 : 0 xor rax, rcx # rax = (from < 0.0) = (from < to) ? 0 : -1 or rax, 1 # rax = (from < 0.0) = (from < to) ? 1 : -1 sub rdx, rax movq xmm0, rdx addsd xmm0, xmm2 ret .Lzero: shl rax, 63 # rax = (to < -0.0) ? 0x8000000000000000 : 0 or rax, 1 # rax = (to < -0.0) ? 0x8000000000000001 : 1 movq xmm0, rax # xmm0 = (to < -0.0) ? -0x1.0p-1074 : 0x1.0p-1074 addsd xmm0, xmm2 ret .Lto: movsd xmm0, xmm1 # xmm0 = to .Lfrom: ret .size nextafter, .-nextafter .type nextafter, @function .global nextafter .end [stefan@rome ~]$ perf stat ./a.out nop() 1.630000 0 nextafter() 1.910000 0.280000 3.540000 nano-seconds Performance counter stats for './a.out': 3,547.12 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 123 page-faults:u # 0.035 K/sec 11,949,840,797 cycles:u # 3.369 GHz (83.32%) ~~~~~~~~~~~~~~ 129,627 stalled-cycles-frontend:u # 0.00% frontend cycles idle (83.34%) 3,998,960,716 stalled-cycles-backend:u # 33.46% backend cycles idle (83.34%) 37,493,523,285 instructions:u # 3.14 insn per cycle ~~~~~~~~~~~~~~ ~~~~ # 0.11 stalled cycles per insn (83.34%) 7,998,559,192 branches:u # 2254.945 M/sec (83.34%) 497,565 branch-misses:u # 0.01% of all branches (83.32%) 3.547999008 seconds time elapsed 3.546671000 seconds user 0.000999000 seconds sys [stefan@rome ~]$ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: nextafter() about an order of magnitude slower than trivial implementation 2021-08-16 16:03 nextafter() about an order of magnitude slower than trivial implementation Stefan Kanthak @ 2021-08-18 12:51 ` Szabolcs Nagy 2021-08-18 17:11 ` Stefan Kanthak 0 siblings, 1 reply; 10+ messages in thread From: Szabolcs Nagy @ 2021-08-18 12:51 UTC (permalink / raw) To: Stefan Kanthak; +Cc: libc-help The 08/16/2021 18:03, Stefan Kanthak wrote: > Testing and benchmarking an exponential function that consumes > about 10ns/call on an AMD EPYC 7262, I noticed that nextafter() > itself is DEAD SLOW: it consumes about 8.5ns/call! > > The following (quick&dirty) implementation consumes 0.7ns/call, > i.e. is about an order of magnitude faster: correctly measuring latency on a modern x86_64 core: musl: 3.16 ns glibc: 5.68 ns your: 5.72 ns i.e. your code is slower. (this is the correctly predicted hot path latency measuring for(i=0;i<n;i++) r = nextafter(r,0); ) it is more difficult to measure the case when latency is not the bottleneck, since then it depends on what code is being executed along nextafter. your code likely wins in that case as it has less instructions. what you see in your benchmark is the effect of branch mispredicts in the nextafter(x, INFINITY) case: your code avoids branching on the sign of x, while glibc branches and misses half the time with random x inputs. (it maybe possible to fix glibc to avoid that branch and then it would be fast in your measurement too on a big ooo core) (your code also does not handle fenv correctly and has aliasing violations, but these can be fixed) > > --- after.c --- > double nextafter(double from, double to) > { > if (from == to) > return to; > > if (to != to) > return to; > > if (from != from) > return from; > > if (from == 0.0) > return to < 0.0 ? -0x1.0p-1074 : 0x1.0p-1074; > > unsigned long long ull = *(unsigned long long *) &from; > > if ((from < to) == (from < 0.0)) > ull--; > else > ull++; > > return 0.0 + *(double *) &ull; > } > --- EOF --- > > JFTR: the code generated by GCC for this function is FAR from > optimal; the assembly implementation shown below is more > than 2x faster and consumes <0.3ns/call, i.e. is about > 30x faster than glibc's nextafter()! > > The data from 'perf stat' show that glibc's nextafter() executes > about 30 instructions more than the above trivial implementation, > and that it is NOT well suited for modern super-scalar processors. > > Stefan > > PS: test program and logs > > [stefan@rome ~]$ cat bench.c > // Copyright (C) 2005-2021, Stefan Kanthak <stefan.kanthak@nexgo.de> > > #include <math.h> > #include <stdint.h> > #include <stdio.h> > #include <time.h> > > __attribute__((noinline)) > double nop(double foo, double bar) > { > return foo + bar; > } > > inline static > double lfsr64(void) > { > // 64-bit linear feedback shift register (Galois form) using > // primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"), > // initialised with 2**64 / golden ratio > > static uint64_t lfsr = 0x9E3779B97F4A7C15; > const uint64_t sign = (int64_t) lfsr >> 63; > > lfsr = (lfsr << 1) ^ (0xAD93D23594C935A9 & sign); > > return *(double *) &lfsr; > } > > inline static > double random64(void) > { > static uint64_t seed = 0x0123456789ABCDEF; > > seed = seed * 6364136223846793005 + 1442695040888963407; > > return *(double *) &seed; > } > > int main(void) > { > clock_t t0, t1, t2, tt; > uint32_t n; > volatile double result; > > t0 = clock(); > > for (n = 500000000u; n > 0u; n--) { > result = nop(lfsr64(), 0.0); > result = nop(random64(), 1.0 / 0.0); > } > > t1 = clock(); > > for (n = 500000000u; n > 0u; n--) { > result = nextafter(lfsr64(), 0.0); > result = nextafter(random64(), 1.0 / 0.0); > } > > t2 = clock(); > tt = t2 - t0; t2 -= t1; t1 -= t0; t0 = t2 - t1; > > printf("\n" > "nop() %4lu.%06lu 0\n" > "nextafter() %4lu.%06lu %4lu.%06lu\n" > " %4lu.%06lu nano-seconds\n", > t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC, > t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC, > t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC, > tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC); > } > [stefan@rome ~]$ gcc -O3 -lm bench.c > [stefan@rome ~]$ perf stat ./a.out > > nop() 1.480000 0 > nextafter() 10.060000 8.580000 > 11.540000 nano-seconds > > Performance counter stats for './a.out': > > 11,548.78 msec task-clock:u # 1.000 CPUs utilized > 0 context-switches:u # 0.000 K/sec > 0 cpu-migrations:u # 0.000 K/sec > 145 page-faults:u # 0.013 K/sec > 38,917,213,536 cycles:u # 3.370 GHz (83.33%) > ~~~~~~~~~~~~~~ > 15,647,656,615 stalled-cycles-frontend:u # 40.21% frontend cycles idle (83.33%) > 10,746,044,422 stalled-cycles-backend:u # 27.61% backend cycles idle (83.33%) > 69,739,403,870 instructions:u # 1.79 insn per cycle > ~~~~~~~~~~~~~~ ~~~~ > # 0.22 stalled cycles per insn (83.33%) > 16,495,748,110 branches:u # 1428.354 M/sec (83.33%) > 500,701,246 branch-misses:u # 3.04% of all branches (83.33%) > > 11.550414029 seconds time elapsed > > 11.548265000 seconds user > 0.000999000 seconds sys > > > [stefan@rome ~]$ gcc -O3 bench.c after.c > [stefan@rome ~]$ perf stat ./a.out > > nop() 1.490000 0 > nextafter() 2.210000 0.720000 > 3.700000 nano-seconds > > Performance counter stats for './a.out': > > 3,702.89 msec task-clock:u # 1.000 CPUs utilized > 0 context-switches:u # 0.000 K/sec > 0 cpu-migrations:u # 0.000 K/sec > 122 page-faults:u # 0.033 K/sec > 12,407,345,183 cycles:u # 3.351 GHz (83.32%) > ~~~~~~~~~~~~~~ > 135,817 stalled-cycles-frontend:u # 0.00% frontend cycles idle (83.34%) > 5,498,895,906 stalled-cycles-backend:u # 44.32% backend cycles idle (83.34%) > 38,002,430,460 instructions:u # 3.06 insn per cycle > ~~~~~~~~~~~~~~ ~~~~ > # 0.14 stalled cycles per insn (83.34%) > 7,497,381,393 branches:u # 2024.735 M/sec (83.34%) > 497,462 branch-misses:u # 0.01% of all branches (83.32%) > > 3.703648875 seconds time elapsed > > 3.703294000 seconds user > 0.000000000 seconds sys > > > [stefan@rome ~]cat after.s > # Copyright � 2004-2021, Stefan Kanthak <stefan.kanthak@nexgo.de> > > .arch generic64 > .code64 > .intel_syntax noprefix > .text > # xmm0 = from > # xmm1 = to > nextafter: > xorpd xmm2, xmm2 # xmm2 = 0.0 > ucomisd xmm1, xmm2 # CF = (to < 0.0) > jp .Lto # to = NAN? > sbb rax, rax # rax = (to < 0.0) ? -1 : 0 > ucomisd xmm0, xmm1 # CF = (from < to) > jp .Lfrom # from = NAN? > je .Lto # from = to? > .Lnotequal: > sbb rcx, rcx # rcx = (from < to) ? -1 : 0 > ucomisd xmm0, xmm2 # CF = (from < 0.0) > jz .Lzero # from = 0.0? > .Lnext: > movq rdx, xmm0 # rdx = from > sbb rax, rax # rax = (from < 0.0) ? -1 : 0 > xor rax, rcx # rax = (from < 0.0) = (from < to) ? 0 : -1 > or rax, 1 # rax = (from < 0.0) = (from < to) ? 1 : -1 > sub rdx, rax > movq xmm0, rdx > addsd xmm0, xmm2 > ret > .Lzero: > shl rax, 63 # rax = (to < -0.0) ? 0x8000000000000000 : 0 > or rax, 1 # rax = (to < -0.0) ? 0x8000000000000001 : 1 > movq xmm0, rax # xmm0 = (to < -0.0) ? -0x1.0p-1074 : 0x1.0p-1074 > addsd xmm0, xmm2 > ret > .Lto: > movsd xmm0, xmm1 # xmm0 = to > .Lfrom: > ret > > .size nextafter, .-nextafter > .type nextafter, @function > .global nextafter > .end > [stefan@rome ~]$ perf stat ./a.out > > nop() 1.630000 0 > nextafter() 1.910000 0.280000 > 3.540000 nano-seconds > > Performance counter stats for './a.out': > > 3,547.12 msec task-clock:u # 1.000 CPUs utilized > 0 context-switches:u # 0.000 K/sec > 0 cpu-migrations:u # 0.000 K/sec > 123 page-faults:u # 0.035 K/sec > 11,949,840,797 cycles:u # 3.369 GHz (83.32%) > ~~~~~~~~~~~~~~ > 129,627 stalled-cycles-frontend:u # 0.00% frontend cycles idle (83.34%) > 3,998,960,716 stalled-cycles-backend:u # 33.46% backend cycles idle (83.34%) > 37,493,523,285 instructions:u # 3.14 insn per cycle > ~~~~~~~~~~~~~~ ~~~~ > # 0.11 stalled cycles per insn (83.34%) > 7,998,559,192 branches:u # 2254.945 M/sec (83.34%) > 497,565 branch-misses:u # 0.01% of all branches (83.32%) > > 3.547999008 seconds time elapsed > > 3.546671000 seconds user > 0.000999000 seconds sys > > > [stefan@rome ~]$ > -- ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: nextafter() about an order of magnitude slower than trivial implementation 2021-08-18 12:51 ` Szabolcs Nagy @ 2021-08-18 17:11 ` Stefan Kanthak 2021-08-19 11:20 ` Adhemerval Zanella 0 siblings, 1 reply; 10+ messages in thread From: Stefan Kanthak @ 2021-08-18 17:11 UTC (permalink / raw) To: Szabolcs Nagy; +Cc: libc-help Szabolcs Nagy <szabolcs.nagy@arm.com> wrote: > The 08/16/2021 18:03, Stefan Kanthak wrote: >> Testing and benchmarking an exponential function that consumes >> about 10ns/call on an AMD EPYC 7262, I noticed that nextafter() >> itself is DEAD SLOW: it consumes about 8.5ns/call! >> >> The following (quick&dirty) implementation consumes 0.7ns/call, >> i.e. is about an order of magnitude faster: > > correctly measuring latency on a modern x86_64 core: > > musl: 3.16 ns > glibc: 5.68 ns > your: 5.72 ns My benchmark measures (reciprocal) throughput, NOT latency. As long as a routine like nextafter() is NOT called back-to-back, but along other code, latency doesn't really matter. > i.e. your code is slower. The overall execution time and the numbers from 'perf stat' show the contrary: the program executes overall - 70 billion instructions in 38.9 billion cycles with glibc's implementation, - 38 billion instructions in 12.4 billion cycles with my (quick&dirty) C implementation, - 37.5 billion instructions in 12 billion cycles with my assembly implementation, - 39.8 billion instructions in 22 billion cycles with musl's original implementation, - 40.5 billion instructions in 11.7 billion cycles with my patch applied to musl's implementation. The OVERALL number of cycles per call are 38.9 (glibc) > 22 (musl) > 12.4 (C) > 12 (Assembly) > 11.7 (musl patched) Subtract the (constant but unknown, not measured) number of cycles for the program itself: I estimate the overhead to be about 4..6 billion cycles. glibc's implementation consumes about 2.5x the number of cycles of musl's implementation, and about 5x the number of cycles of a proper implementation. > (this is the correctly predicted hot path latency > measuring for(i=0;i<n;i++) r = nextafter(r,0); ) See above: I'm interested in throughput, NOT latency, > it is more difficult to measure the case when latency > is not the bottleneck, since then it depends on what > code is being executed along nextafter. your code > likely wins in that case as it has less instructions. Correct. And that's what counts in practice, as shown from 'perf stat' with the number of consumed cycles! > what you see in your benchmark is the effect of > branch mispredicts in the nextafter(x, INFINITY) > case: your code avoids branching on the sign of x, > while glibc branches and misses half the time with > random x inputs. (it maybe possible to fix glibc to > avoid that branch and then it would be fast in your > measurement too on a big ooo core) It's not just mispredicted branches: the program executes - 16.5 billion branches (0.5 B(!)illion mispredicted) with glibc's implementation, - 7.5 billion branches (0.5 M(!)illion mispredicted) with my C implementation, - 8 billion branches (0.5 M(!)illion mispredicted) with my assembly implementation, - 9.7 billion branches (0.25 B(!)illion mispredicted) with musl's original implementation, - 8.5 billion branches (0.5 M(!)illion mispredicted with my patch applied to musl's implementation. 2 billion branches are executed for the loops and should be subtracted to get the net numbers per call of nextafter(): 14.5 (glibc) > 9.7 (musl) > 8.5 (musl patched) > 8 (assembly) > 7.5 (C) glibc's implementation executes almost twice the number of branches of the other implementations. The mispredicted branches: 0.5 (glibc) > 0.25 (musl) > 0.0001 (the proper implementations) > (your code also does not handle fenv correctly and > has aliasing violations, but these can be fixed) I wrote explicitly QUICK&DIRTY (C) implementation. The assembly code is correct in both cases. Stefan > --- after.c --- > double nextafter(double from, double to) > { > if (from == to) > return to; > > if (to != to) > return to; > > if (from != from) > return from; > > if (from == 0.0) > return to < 0.0 ? -0x1.0p-1074 : 0x1.0p-1074; > > unsigned long long ull = *(unsigned long long *) &from; > > if ((from < to) == (from < 0.0)) > ull--; > else > ull++; > > return 0.0 + *(double *) &ull; > } > --- EOF --- > > JFTR: the code generated by GCC for this function is FAR from > optimal; the assembly implementation shown below is more > than 2x faster and consumes <0.3ns/call, i.e. is about > 30x faster than glibc's nextafter()! > > The data from 'perf stat' show that glibc's nextafter() executes > about 30 instructions more than the above trivial implementation, > and that it is NOT well suited for modern super-scalar processors. > > Stefan > > PS: test program and logs > > [stefan@rome ~]$ cat bench.c > // Copyright (C) 2005-2021, Stefan Kanthak <stefan.kanthak@nexgo.de> > > #include <math.h> > #include <stdint.h> > #include <stdio.h> > #include <time.h> > > __attribute__((noinline)) > double nop(double foo, double bar) > { > return foo + bar; > } > > inline static > double lfsr64(void) > { > // 64-bit linear feedback shift register (Galois form) using > // primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"), > // initialised with 2**64 / golden ratio > > static uint64_t lfsr = 0x9E3779B97F4A7C15; > const uint64_t sign = (int64_t) lfsr >> 63; > > lfsr = (lfsr << 1) ^ (0xAD93D23594C935A9 & sign); > > return *(double *) &lfsr; > } > > inline static > double random64(void) > { > static uint64_t seed = 0x0123456789ABCDEF; > > seed = seed * 6364136223846793005 + 1442695040888963407; > > return *(double *) &seed; > } > > int main(void) > { > clock_t t0, t1, t2, tt; > uint32_t n; > volatile double result; > > t0 = clock(); > > for (n = 500000000u; n > 0u; n--) { > result = nop(lfsr64(), 0.0); > result = nop(random64(), 1.0 / 0.0); > } > > t1 = clock(); > > for (n = 500000000u; n > 0u; n--) { > result = nextafter(lfsr64(), 0.0); > result = nextafter(random64(), 1.0 / 0.0); > } > > t2 = clock(); > tt = t2 - t0; t2 -= t1; t1 -= t0; t0 = t2 - t1; > > printf("\n" > "nop() %4lu.%06lu 0\n" > "nextafter() %4lu.%06lu %4lu.%06lu\n" > " %4lu.%06lu nano-seconds\n", > t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC, > t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC, > t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC, > tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC); > } > [stefan@rome ~]$ gcc -O3 -lm bench.c > [stefan@rome ~]$ perf stat ./a.out > > nop() 1.480000 0 > nextafter() 10.060000 8.580000 > 11.540000 nano-seconds > > Performance counter stats for './a.out': > > 11,548.78 msec task-clock:u # 1.000 CPUs utilized > 0 context-switches:u # 0.000 K/sec > 0 cpu-migrations:u # 0.000 K/sec > 145 page-faults:u # 0.013 K/sec > 38,917,213,536 cycles:u # 3.370 GHz (83.33%) > ~~~~~~~~~~~~~~ > 15,647,656,615 stalled-cycles-frontend:u # 40.21% frontend cycles idle (83.33%) > 10,746,044,422 stalled-cycles-backend:u # 27.61% backend cycles idle (83.33%) > 69,739,403,870 instructions:u # 1.79 insn per cycle > ~~~~~~~~~~~~~~ ~~~~ > # 0.22 stalled cycles per insn (83.33%) > 16,495,748,110 branches:u # 1428.354 M/sec (83.33%) > 500,701,246 branch-misses:u # 3.04% of all branches (83.33%) > > 11.550414029 seconds time elapsed > > 11.548265000 seconds user > 0.000999000 seconds sys > > > [stefan@rome ~]$ gcc -O3 bench.c after.c > [stefan@rome ~]$ perf stat ./a.out > > nop() 1.490000 0 > nextafter() 2.210000 0.720000 > 3.700000 nano-seconds > > Performance counter stats for './a.out': > > 3,702.89 msec task-clock:u # 1.000 CPUs utilized > 0 context-switches:u # 0.000 K/sec > 0 cpu-migrations:u # 0.000 K/sec > 122 page-faults:u # 0.033 K/sec > 12,407,345,183 cycles:u # 3.351 GHz (83.32%) > ~~~~~~~~~~~~~~ > 135,817 stalled-cycles-frontend:u # 0.00% frontend cycles idle (83.34%) > 5,498,895,906 stalled-cycles-backend:u # 44.32% backend cycles idle (83.34%) > 38,002,430,460 instructions:u # 3.06 insn per cycle > ~~~~~~~~~~~~~~ ~~~~ > # 0.14 stalled cycles per insn (83.34%) > 7,497,381,393 branches:u # 2024.735 M/sec (83.34%) > 497,462 branch-misses:u # 0.01% of all branches (83.32%) > > 3.703648875 seconds time elapsed > > 3.703294000 seconds user > 0.000000000 seconds sys > > > [stefan@rome ~]cat after.s > # Copyright � 2004-2021, Stefan Kanthak <stefan.kanthak@nexgo.de> > > .arch generic64 > .code64 > .intel_syntax noprefix > .text > # xmm0 = from > # xmm1 = to > nextafter: > xorpd xmm2, xmm2 # xmm2 = 0.0 > ucomisd xmm1, xmm2 # CF = (to < 0.0) > jp .Lto # to = NAN? > sbb rax, rax # rax = (to < 0.0) ? -1 : 0 > ucomisd xmm0, xmm1 # CF = (from < to) > jp .Lfrom # from = NAN? > je .Lto # from = to? > .Lnotequal: > sbb rcx, rcx # rcx = (from < to) ? -1 : 0 > ucomisd xmm0, xmm2 # CF = (from < 0.0) > jz .Lzero # from = 0.0? > .Lnext: > movq rdx, xmm0 # rdx = from > sbb rax, rax # rax = (from < 0.0) ? -1 : 0 > xor rax, rcx # rax = (from < 0.0) = (from < to) ? 0 : -1 > or rax, 1 # rax = (from < 0.0) = (from < to) ? 1 : -1 > sub rdx, rax > movq xmm0, rdx > addsd xmm0, xmm2 > ret > .Lzero: > shl rax, 63 # rax = (to < -0.0) ? 0x8000000000000000 : 0 > or rax, 1 # rax = (to < -0.0) ? 0x8000000000000001 : 1 > movq xmm0, rax # xmm0 = (to < -0.0) ? -0x1.0p-1074 : 0x1.0p-1074 > addsd xmm0, xmm2 > ret > .Lto: > movsd xmm0, xmm1 # xmm0 = to > .Lfrom: > ret > > .size nextafter, .-nextafter > .type nextafter, @function > .global nextafter > .end > [stefan@rome ~]$ perf stat ./a.out > > nop() 1.630000 0 > nextafter() 1.910000 0.280000 > 3.540000 nano-seconds > > Performance counter stats for './a.out': > > 3,547.12 msec task-clock:u # 1.000 CPUs utilized > 0 context-switches:u # 0.000 K/sec > 0 cpu-migrations:u # 0.000 K/sec > 123 page-faults:u # 0.035 K/sec > 11,949,840,797 cycles:u # 3.369 GHz (83.32%) > ~~~~~~~~~~~~~~ > 129,627 stalled-cycles-frontend:u # 0.00% frontend cycles idle (83.34%) > 3,998,960,716 stalled-cycles-backend:u # 33.46% backend cycles idle (83.34%) > 37,493,523,285 instructions:u # 3.14 insn per cycle > ~~~~~~~~~~~~~~ ~~~~ > # 0.11 stalled cycles per insn (83.34%) > 7,998,559,192 branches:u # 2254.945 M/sec (83.34%) > 497,565 branch-misses:u # 0.01% of all branches (83.32%) > > 3.547999008 seconds time elapsed > > 3.546671000 seconds user > 0.000999000 seconds sys > > > [stefan@rome ~]$ > -- ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: nextafter() about an order of magnitude slower than trivial implementation 2021-08-18 17:11 ` Stefan Kanthak @ 2021-08-19 11:20 ` Adhemerval Zanella 2021-08-19 17:57 ` [PATCH] " Stefan Kanthak 2021-08-20 9:52 ` [PATCH/2nd version] " Stefan Kanthak 0 siblings, 2 replies; 10+ messages in thread From: Adhemerval Zanella @ 2021-08-19 11:20 UTC (permalink / raw) To: Stefan Kanthak, Szabolcs Nagy; +Cc: libc-help On 18/08/2021 14:11, Stefan Kanthak wrote: > Szabolcs Nagy <szabolcs.nagy@arm.com> wrote: > >> The 08/16/2021 18:03, Stefan Kanthak wrote: >>> Testing and benchmarking an exponential function that consumes >>> about 10ns/call on an AMD EPYC 7262, I noticed that nextafter() >>> itself is DEAD SLOW: it consumes about 8.5ns/call! >>> >>> The following (quick&dirty) implementation consumes 0.7ns/call, >>> i.e. is about an order of magnitude faster: >> >> correctly measuring latency on a modern x86_64 core: >> >> musl: 3.16 ns >> glibc: 5.68 ns >> your: 5.72 ns Thanks for bring this up, if you want to contribute a patch please follow the Contribution checklist [1]. We recently dropped the requirement of the FSF contribution, so you can use a SCO-like license on the patches. To change the current implementation I suggest you to also provide a benchmark using glibc benchmark framework. Some maths functions provide both latency and reciprocal-throughput information, and with both numbers we can evaluate if the new implementation is indeed better on different machines. I would just like to ask to keep the tone respectful and be open to suggestion and criticize, so you do not repeat the same derail thread on musl maillist [1]. Szabolcs and Wilco did an excellent job on some newer math functions implementation, you might read the old thread to see how was their approach. [1] https://sourceware.org/glibc/wiki/Contribution%20checklist [2] https://www.openwall.com/lists/musl/2021/08/15/18 ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH] Re: nextafter() about an order of magnitude slower than trivial implementation 2021-08-19 11:20 ` Adhemerval Zanella @ 2021-08-19 17:57 ` Stefan Kanthak 2021-08-20 9:52 ` [PATCH/2nd version] " Stefan Kanthak 1 sibling, 0 replies; 10+ messages in thread From: Stefan Kanthak @ 2021-08-19 17:57 UTC (permalink / raw) To: Szabolcs Nagy, Adhemerval Zanella; +Cc: libc-help, libc-alpha [-- Attachment #1: Type: text/plain, Size: 3645 bytes --] Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote: > On 18/08/2021 14:11, Stefan Kanthak wrote: >> Szabolcs Nagy <szabolcs.nagy@arm.com> wrote: >> >>> The 08/16/2021 18:03, Stefan Kanthak wrote: >>>> Testing and benchmarking an exponential function that consumes >>>> about 10ns/call on an AMD EPYC 7262, I noticed that nextafter() >>>> itself is DEAD SLOW: it consumes about 8.5ns/call! >>>> >>>> The following (quick&dirty) implementation consumes 0.7ns/call, >>>> i.e. is about an order of magnitude faster: >>> >>> correctly measuring latency on a modern x86_64 core: >>> >>> musl: 3.16 ns >>> glibc: 5.68 ns >>> your: 5.72 ns > > Thanks for bring this up, if you want to contribute a patch See attachment. For completeness sake and to ease an initial review, I post the full source here too: without ALTERNATE defined, it uses integer arithmetic to the extent possible, like the original. Choose whatever you like best or gives better throughput/latency. Stefan JFTR: .../sysdeps/ieee754/flt-32/s_nextafterf.c as well as .../sysdeps/i386/fpu/s_nextafterl.c will need the same changes. --- .../math/s_nextafter.c --- /* IEEE functions * nextafter(x,y) * return the next machine floating-point number of x in the * direction toward y. * Special cases: */ /* Ugly hack so that the aliasing works. */ #define __nexttoward __internal___nexttoward #define nexttoward __internal_nexttoward #include <errno.h> #include <math.h> #include <math-barriers.h> #include <math_private.h> #include <float.h> #include <libm-alias-double.h> double __nextafter(double x, double y) { union { double d; int64_t s; uint64_t u; } xx = {x}, yy = {y}; uint64_t ax = xx.u + xx.u; /* |x|<<1 */ uint64_t ay = yy.u + yy.u; /* |y|<<1 */ if(ax > 0x7ffull << 53) /* x is nan */ return x; if(ay > 0x7ffull << 53) /* y is nan */ return y; if(x == y) return y; /* x=y, return y */ if(ax == 0ull) { /* x == 0 */ xx.u = yy.s < 0 ? 0x8000000000000001ull : 1ull; xx.d = math_opt_barrier (xx.d); xx.d += 0.0; math_force_eval (xx.d); /* raise underflow flag */ return xx.d; /* return +-minsubnormal */ } #ifdef ALTERNATE if((x < y) == (x < 0.0)) --xx.u; else ++xx.u; #else if(xx.s < 0) { /* x < 0 */ if(yy.s < 0 && xx.s < yy.s) ++xx.s; /* x < 0, x > y, x += ulp */ else --xx.s; /* x < 0, x < y, x -= ulp */ } else { /* x > 0 */ if(xx.s > yy.s) --xx.s; /* x > 0, x > y, x -= ulp */ else ++xx.s; /* x > 0, x < y, x += ulp */ } #endif ax = xx.u + xx.u; /* |x|<<1 */ if((ax >= 0x7ffull << 53) /* overflow */ || (ax < 0x001ull << 53)) { /* underflow */ xx.d = math_opt_barrier (xx.d); xx.d += 0.0; math_force_eval (xx.d); /* raise overflow/underflow flag */ __set_errno (ERANGE); } return xx.d; } libm_alias_double (__nextafter, nextafter) #ifdef NO_LONG_DOUBLE strong_alias (__nextafter, __nexttowardl) weak_alias (__nexttowardl, nexttowardl) #undef __nexttoward strong_alias (__nextafter, __nexttoward) #undef nexttoward weak_alias (__nextafter, nexttoward) #endif --- EOF --- [-- Attachment #2: glibc.patch --] [-- Type: application/octet-stream, Size: 3533 bytes --] --- -/math/s_nextafter.c +++ +/math/s_nextafter.c @@ -1,19 +1,3 @@ -/* @(#)s_nextafter.c 5.1 93/09/24 */ -/* - * ==================================================== - * Copyright (C) 1993 by Sun Microsystems, Inc. All rights reserved. - * - * Developed at SunPro, a Sun Microsystems, Inc. business. - * Permission to use, copy, modify, and distribute this - * software is freely granted, provided that this notice - * is preserved. - * ==================================================== - */ - -#if defined(LIBM_SCCS) && !defined(lint) -static char rcsid[] = "$NetBSD: s_nextafter.c,v 1.8 1995/05/10 20:47:58 jtc Exp $"; -#endif - /* IEEE functions * nextafter(x,y) * return the next machine floating-point number of x in the @@ -34,55 +18,53 @@ double __nextafter(double x, double y) { - int32_t hx,hy,ix,iy; - uint32_t lx,ly; + union { + double d; + int64_t s; + uint64_t u; + } xx = {x}, yy = {y}; - EXTRACT_WORDS(hx,lx,x); - EXTRACT_WORDS(hy,ly,y); - ix = hx&0x7fffffff; /* |x| */ - iy = hy&0x7fffffff; /* |y| */ + uint64_t ax = xx.u + xx.u; /* |x|<<1 */ + uint64_t ay = yy.u + yy.u; /* |y|<<1 */ - if(((ix>=0x7ff00000)&&((ix-0x7ff00000)|lx)!=0) || /* x is nan */ - ((iy>=0x7ff00000)&&((iy-0x7ff00000)|ly)!=0)) /* y is nan */ - return x+y; - if(x==y) return y; /* x=y, return y */ - if((ix|lx)==0) { /* x == 0 */ - double u; - INSERT_WORDS(x,hy&0x80000000,1); /* return +-minsubnormal */ - u = math_opt_barrier (x); - u = u*u; - math_force_eval (u); /* raise underflow flag */ + if(ax > 0x7ffull << 53) /* x is nan */ return x; + if(ay > 0x7ffull << 53) /* y is nan */ + return y; + if(x == y) return y; /* x=y, return y */ + if(ax == 0ull) { /* x == 0 */ + xx.u = yy.s < 0 ? 0x8000000000000001ull : 1ull; + xx.d = math_opt_barrier (xx.d); + xx.d += 0.0; + math_force_eval (xx.d); /* raise underflow flag */ + return xx.d; /* return +-minsubnormal */ } - if(hx>=0) { /* x > 0 */ - if(hx>hy||((hx==hy)&&(lx>ly))) { /* x > y, x -= ulp */ - if(lx==0) hx -= 1; - lx -= 1; - } else { /* x < y, x += ulp */ - lx += 1; - if(lx==0) hx += 1; - } - } else { /* x < 0 */ - if(hy>=0||hx>hy||((hx==hy)&&(lx>ly))){/* x < y, x -= ulp */ - if(lx==0) hx -= 1; - lx -= 1; - } else { /* x > y, x += ulp */ - lx += 1; - if(lx==0) hx += 1; - } +#ifdef ALTERNATE + if ((x < y) == (x < 0.0)) + --xx.u; + else + ++xx.u; +#else + if(xx.s < 0) { /* x < 0 */ + if(yy.s < 0 && xx.s < yy.s) + ++xx.s; /* x < 0, x > y, x += ulp */ + else + --xx.s; /* x < 0, x < y, x -= ulp */ + } else { /* x > 0 */ + if(xx.s > yy.s) + --xx.s; /* x > 0, x > y, x -= ulp */ + else + ++xx.s; /* x > 0, x < y, x += ulp */ } - hy = hx&0x7ff00000; - if(hy>=0x7ff00000) { - double u = x+x; /* overflow */ - math_force_eval (u); +#endif + ax = xx.u + xx.u; /* |x|<<1 */ + if((ax >= 0x7ffull << 53) /* overflow */ + || (ax < 0x001ull << 53)) { /* underflow */ + xx.d = math_opt_barrier (xx.d); + xx.d += 0.0; + math_force_eval (xx.d); /* raise overflow/underflow flag */ __set_errno (ERANGE); } - if(hy<0x00100000) { - double u = x*x; /* underflow */ - math_force_eval (u); /* raise underflow flag */ - __set_errno (ERANGE); - } - INSERT_WORDS(x,hx,lx); return xx.d; } libm_alias_double (__nextafter, nextafter) ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH/2nd version] Re: nextafter() about an order of magnitude slower than trivial implementation 2021-08-19 11:20 ` Adhemerval Zanella 2021-08-19 17:57 ` [PATCH] " Stefan Kanthak @ 2021-08-20 9:52 ` Stefan Kanthak 2021-08-20 16:55 ` Joseph Myers 1 sibling, 1 reply; 10+ messages in thread From: Stefan Kanthak @ 2021-08-20 9:52 UTC (permalink / raw) To: Szabolcs Nagy, Adhemerval Zanella; +Cc: libc-help, libc-alpha [-- Attachment #1: Type: text/plain, Size: 3682 bytes --] Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote: > On 18/08/2021 14:11, Stefan Kanthak wrote: >> Szabolcs Nagy <szabolcs.nagy@arm.com> wrote: >> >>> The 08/16/2021 18:03, Stefan Kanthak wrote: >>>> Testing and benchmarking an exponential function that consumes >>>> about 10ns/call on an AMD EPYC 7262, I noticed that nextafter() >>>> itself is DEAD SLOW: it consumes about 8.5ns/call! >>>> >>>> The following (quick&dirty) implementation consumes 0.7ns/call, >>>> i.e. is about an order of magnitude faster: >>> >>> correctly measuring latency on a modern x86_64 core: >>> >>> musl: 3.16 ns >>> glibc: 5.68 ns >>> your: 5.72 ns > > Thanks for bring this up, if you want to contribute a patch See attachment. For completeness sake and to ease an initial review, I post the full source here too: without ALTERNATE defined, it uses integer arithmetic to the extent possible, like the original. Choose whatever you like best or gives better throughput/latency. Changes from the 1st version of the patch: - leftover FP comparision substituted with integer operations; - increment/decrement logic simplified/unified, avoiding a conditional branch. I expect the compiler to detect the common clause "&& ((xx.s < 0) == (yy.s < 0)))" introduced with these changes. Stefan JFTR: .../sysdeps/ieee754/flt-32/s_nextafterf.c as well as .../sysdeps/i386/fpu/s_nextafterl.c will need the same changes. --- .../math/s_nextafter.c --- /* IEEE functions * nextafter(x,y) * return the next machine floating-point number of x in the * direction toward y. * Special cases: */ /* Ugly hack so that the aliasing works. */ #define __nexttoward __internal___nexttoward #define nexttoward __internal_nexttoward #include <errno.h> #include <math.h> #include <math-barriers.h> #include <math_private.h> #include <float.h> #include <libm-alias-double.h> double __nextafter(double x, double y) { union { double d; int64_t s; uint64_t u; } xx = {x}, yy = {y}; uint64_t ax = xx.u + xx.u; /* |x|<<1 */ uint64_t ay = yy.u + yy.u; /* |y|<<1 */ if(ax > 0x7ffull << 53) /* x is nan */ return x; if(ay > 0x7ffull << 53) /* y is nan */ return y; #ifdef ALTERNATE if(x == y) return y; /* x=y, return y */ #else if((ax == ay) && ((xx.s < 0) == (yy.s < 0))) return y; #endif if(ax == 0) { /* x == 0 */ xx.u = yy.s < 0 ? 0x8000000000000001ull : 1ull; xx.d = math_opt_barrier (xx.d); xx.d += 0.0; math_force_eval (xx.d); /* raise underflow flag */ return xx.d; /* return +-minsubnormal */ } #ifdef ALTERNATE if((x < y) == (x < 0.0)) --xx.u; else ++xx.u; #else if((ax < ay) && ((xx.s < 0) == (yy.s < 0))) ++xx.u; else --xx.u; #endif ax = xx.u + xx.u; /* |x|<<1 */ if((ax >= 0x7ffull << 53) /* overflow */ || (ax < 0x001ull << 53)) { /* underflow */ xx.d = math_opt_barrier (xx.d); xx.d += 0.0; math_force_eval (xx.d); /* raise overflow/underflow flag */ __set_errno (ERANGE); } return xx.d; } libm_alias_double (__nextafter, nextafter) #ifdef NO_LONG_DOUBLE strong_alias (__nextafter, __nexttowardl) weak_alias (__nexttowardl, nexttowardl) #undef __nexttoward strong_alias (__nextafter, __nexttoward) #undef nexttoward weak_alias (__nextafter, nexttoward) #endif --- EOF --- [-- Attachment #2: glibc.patch --] [-- Type: application/octet-stream, Size: 3447 bytes --] --- -/math/s_nextafter.c +++ +/math/s_nextafter.c @@ -1,19 +1,3 @@ -/* @(#)s_nextafter.c 5.1 93/09/24 */ -/* - * ==================================================== - * Copyright (C) 1993 by Sun Microsystems, Inc. All rights reserved. - * - * Developed at SunPro, a Sun Microsystems, Inc. business. - * Permission to use, copy, modify, and distribute this - * software is freely granted, provided that this notice - * is preserved. - * ==================================================== - */ - -#if defined(LIBM_SCCS) && !defined(lint) -static char rcsid[] = "$NetBSD: s_nextafter.c,v 1.8 1995/05/10 20:47:58 jtc Exp $"; -#endif - /* IEEE functions * nextafter(x,y) * return the next machine floating-point number of x in the @@ -34,56 +18,55 @@ double __nextafter(double x, double y) { - int32_t hx,hy,ix,iy; - uint32_t lx,ly; + union { + double d; + int64_t s; + uint64_t u; + } xx = {x}, yy = {y}; - EXTRACT_WORDS(hx,lx,x); - EXTRACT_WORDS(hy,ly,y); - ix = hx&0x7fffffff; /* |x| */ - iy = hy&0x7fffffff; /* |y| */ + uint64_t ax = xx.u + xx.u; /* |x|<<1 */ + uint64_t ay = yy.u + yy.u; /* |y|<<1 */ - if(((ix>=0x7ff00000)&&((ix-0x7ff00000)|lx)!=0) || /* x is nan */ - ((iy>=0x7ff00000)&&((iy-0x7ff00000)|ly)!=0)) /* y is nan */ - return x+y; - if(x==y) return y; /* x=y, return y */ - if((ix|lx)==0) { /* x == 0 */ - double u; - INSERT_WORDS(x,hy&0x80000000,1); /* return +-minsubnormal */ - u = math_opt_barrier (x); - u = u*u; - math_force_eval (u); /* raise underflow flag */ + if(ax > 0x7ffull << 53) /* x is nan */ return x; + if(ay > 0x7ffull << 53) /* y is nan */ + return y; +#ifdef ALTERNATE + if(x == y) return y; /* x=y, return y */ +#else + if((ax == ay) + && ((xx.s < 0) == (yy.s < 0))) + return y; +#endif + if(ax == 0) { /* x == 0 */ + xx.u = yy.s < 0 ? 0x8000000000000001ull : 1ull; + xx.d = math_opt_barrier (xx.d); + xx.d += 0.0; + math_force_eval (xx.d); /* raise underflow flag */ + return xx.d; /* return +-minsubnormal */ } - if(hx>=0) { /* x > 0 */ - if(hx>hy||((hx==hy)&&(lx>ly))) { /* x > y, x -= ulp */ - if(lx==0) hx -= 1; - lx -= 1; - } else { /* x < y, x += ulp */ - lx += 1; - if(lx==0) hx += 1; - } - } else { /* x < 0 */ - if(hy>=0||hx>hy||((hx==hy)&&(lx>ly))){/* x < y, x -= ulp */ - if(lx==0) hx -= 1; - lx -= 1; - } else { /* x > y, x += ulp */ - lx += 1; - if(lx==0) hx += 1; - } +#ifdef ALTERNATE + if((x < y) == (x < 0.0)) + --xx.u; + else + ++xx.u; } - hy = hx&0x7ff00000; - if(hy>=0x7ff00000) { - double u = x+x; /* overflow */ - math_force_eval (u); - __set_errno (ERANGE); +#else + if((ax < ay) + && ((xx.s < 0) == (yy.s < 0))) + ++xx.u; + else + --xx.u; +#endif + ax = xx.u + xx.u; /* |x|<<1 */ + if((ax >= 0x7ffull << 53) /* overflow */ + || (ax < 0x001ull << 53)) { /* underflow */ + xx.d = math_opt_barrier (xx.d); + xx.d += 0.0; + math_force_eval (xx.d); /* raise overflow/underflow flag */ + __set_errno (ERANGE); } - if(hy<0x00100000) { - double u = x*x; /* underflow */ - math_force_eval (u); /* raise underflow flag */ - __set_errno (ERANGE); - } - INSERT_WORDS(x,hx,lx); - return x; + return xx.d; } libm_alias_double (__nextafter, nextafter) #ifdef NO_LONG_DOUBLE ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH/2nd version] Re: nextafter() about an order of magnitude slower than trivial implementation 2021-08-20 9:52 ` [PATCH/2nd version] " Stefan Kanthak @ 2021-08-20 16:55 ` Joseph Myers 2021-08-20 20:19 ` Stefan Kanthak 0 siblings, 1 reply; 10+ messages in thread From: Joseph Myers @ 2021-08-20 16:55 UTC (permalink / raw) To: Stefan Kanthak; +Cc: Szabolcs Nagy, Adhemerval Zanella, libc-help, libc-alpha On Fri, 20 Aug 2021, Stefan Kanthak wrote: > if(ax > 0x7ffull << 53) /* x is nan */ > return x; > if(ay > 0x7ffull << 53) /* y is nan */ > return y; How has this been tested? I'd have expected it to fail the nextafter tests in the glibc testsuite (libm-test-nextafter.inc), because they verify that sNaN arguments produce a qNaN result with the "invalid" exception raised, and this looks like it would just return an sNaN argument unchanged. -- Joseph S. Myers joseph@codesourcery.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH/2nd version] Re: nextafter() about an order of magnitude slower than trivial implementation 2021-08-20 16:55 ` Joseph Myers @ 2021-08-20 20:19 ` Stefan Kanthak 2021-08-20 21:03 ` Joseph Myers 2021-08-23 12:50 ` Adhemerval Zanella 0 siblings, 2 replies; 10+ messages in thread From: Stefan Kanthak @ 2021-08-20 20:19 UTC (permalink / raw) To: Joseph Myers; +Cc: Szabolcs Nagy, Adhemerval Zanella, libc-help, libc-alpha "Joseph Myers" <joseph@codesourcery.com> wrote: > On Fri, 20 Aug 2021, Stefan Kanthak wrote: > >> if(ax > 0x7ffull << 53) /* x is nan */ >> return x; >> if(ay > 0x7ffull << 53) /* y is nan */ >> return y; > > How has this been tested? I'd have expected it to fail the nextafter > tests in the glibc testsuite (libm-test-nextafter.inc), because they > verify that sNaN arguments produce a qNaN result with the "invalid" > exception raised, and this looks like it would just return an sNaN > argument unchanged. It doesn't look like it would, it REALLY does! As I explicitly wrote, my changes avoid FP operations if possible. <https://pubs.opengroup.org/onlinepubs/9699919799/functions/nextafter.html> | If x or y is NaN, a NaN shall be returned. I choose to return the argument NaN, what both POSIX and ISO C allow. If my mind serves me well, one of the former editions even stated "If an argument is NaN, this argument shall be returned". This may but be influenced/spoiled by <https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/nextafter-functions> | If either x or y is a NAN, then the return value is one of the input NANs. If that bothers you, change these 4 lines to either if(ax > 0x7ffull << 53) /* x is nan */ return 0.0 + x; if(ay > 0x7ffull << 53) /* y is nan */ return 0.0 + y; (the addition has eventually to be guarded from optimisation) or if(ax > 0x7ffull << 53 /* x is nan */ || ay > 0x7ffull << 53) /* y is nan */ return x + y; whatever you like best. Stefan ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH/2nd version] Re: nextafter() about an order of magnitude slower than trivial implementation 2021-08-20 20:19 ` Stefan Kanthak @ 2021-08-20 21:03 ` Joseph Myers 2021-08-23 12:50 ` Adhemerval Zanella 1 sibling, 0 replies; 10+ messages in thread From: Joseph Myers @ 2021-08-20 21:03 UTC (permalink / raw) To: Stefan Kanthak; +Cc: Szabolcs Nagy, libc-help, libc-alpha On Fri, 20 Aug 2021, Stefan Kanthak wrote: > I choose to return the argument NaN, what both POSIX and ISO C allow. glibc follows C23 Annex F FE_SNANS_ALWAYS_SIGNAL rules for signaling NaN arguments. -- Joseph S. Myers joseph@codesourcery.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH/2nd version] Re: nextafter() about an order of magnitude slower than trivial implementation 2021-08-20 20:19 ` Stefan Kanthak 2021-08-20 21:03 ` Joseph Myers @ 2021-08-23 12:50 ` Adhemerval Zanella 1 sibling, 0 replies; 10+ messages in thread From: Adhemerval Zanella @ 2021-08-23 12:50 UTC (permalink / raw) To: Stefan Kanthak, Joseph Myers; +Cc: Szabolcs Nagy, libc-help, libc-alpha On 20/08/2021 17:19, Stefan Kanthak wrote: > "Joseph Myers" <joseph@codesourcery.com> wrote: > >> On Fri, 20 Aug 2021, Stefan Kanthak wrote: >> >>> if(ax > 0x7ffull << 53) /* x is nan */ >>> return x; >>> if(ay > 0x7ffull << 53) /* y is nan */ >>> return y; >> >> How has this been tested? I'd have expected it to fail the nextafter >> tests in the glibc testsuite (libm-test-nextafter.inc), because they >> verify that sNaN arguments produce a qNaN result with the "invalid" >> exception raised, and this looks like it would just return an sNaN >> argument unchanged. > > It doesn't look like it would, it REALLY does! > As I explicitly wrote, my changes avoid FP operations if possible. > > <https://pubs.opengroup.org/onlinepubs/9699919799/functions/nextafter.html> > > | If x or y is NaN, a NaN shall be returned. > > I choose to return the argument NaN, what both POSIX and ISO C allow. > If my mind serves me well, one of the former editions even stated > "If an argument is NaN, this argument shall be returned". This may > but be influenced/spoiled by > <https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/nextafter-functions> > > | If either x or y is a NAN, then the return value is one of the input NANs. > > > If that bothers you, change these 4 lines to either > > if(ax > 0x7ffull << 53) /* x is nan */ > return 0.0 + x; > if(ay > 0x7ffull << 53) /* y is nan */ > return 0.0 + y; > > (the addition has eventually to be guarded from optimisation) or > > if(ax > 0x7ffull << 53 /* x is nan */ > || ay > 0x7ffull << 53) /* y is nan */ > return x + y; Sorry, but sending half-baked patches where one will need to evaluate and decide which is the best strategy is not the best way forward to accept this change. Please do follow the contributor checklist [1] and the suggestion I gave you when you asked for input in the libc-help: * Check for regressions by testing your patch against glibc testsuite (make check). As Joseph has put, your patch triggers a lot of regressions: $ grep ^FAIL math/subdir-tests.sum FAIL: math/bug-nextafter FAIL: math/test-double-nextafter FAIL: math/test-float32x-nextafter FAIL: math/test-float64-nextafter FAIL: math/test-misc * You removed the copyright since it is a new implementation, but the any code still requires a Copyright (even if is not assigned to FSF). Check the item 3.4 on the Contributor checklist. * Also, since is a new implementation follow the usual style and code guidelines instead of using the current one used. * It would be good to add a benchmark to evaluate this change if the motivation for your change is performance. Follow what other math benchmarks does on benchtests (exp2 or exp10 for instance). * Sending a patch with a define switch and stating "Choose whatever you like best" does not help in anything and put the burden or evaluating your change to the maintainers. [1] https://sourceware.org/glibc/wiki/Contribution%20checklist ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2021-08-23 12:50 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-08-16 16:03 nextafter() about an order of magnitude slower than trivial implementation Stefan Kanthak 2021-08-18 12:51 ` Szabolcs Nagy 2021-08-18 17:11 ` Stefan Kanthak 2021-08-19 11:20 ` Adhemerval Zanella 2021-08-19 17:57 ` [PATCH] " Stefan Kanthak 2021-08-20 9:52 ` [PATCH/2nd version] " Stefan Kanthak 2021-08-20 16:55 ` Joseph Myers 2021-08-20 20:19 ` Stefan Kanthak 2021-08-20 21:03 ` Joseph Myers 2021-08-23 12:50 ` Adhemerval Zanella
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).