From: Szabolcs Nagy <szabolcs.nagy@arm.com>
To: Stefan Kanthak <stefan.kanthak@nexgo.de>
Cc: libc-help@sourceware.org
Subject: Re: nextafter() about an order of magnitude slower than trivial implementation
Date: Wed, 18 Aug 2021 13:51:19 +0100 [thread overview]
Message-ID: <20210818125119.GH25257@arm.com> (raw)
In-Reply-To: <D7077C5619F74049B8890508155291B2@H270>
The 08/16/2021 18:03, Stefan Kanthak wrote:
> Testing and benchmarking an exponential function that consumes
> about 10ns/call on an AMD EPYC 7262, I noticed that nextafter()
> itself is DEAD SLOW: it consumes about 8.5ns/call!
>
> The following (quick&dirty) implementation consumes 0.7ns/call,
> i.e. is about an order of magnitude faster:
correctly measuring latency on a modern x86_64 core:
musl: 3.16 ns
glibc: 5.68 ns
your: 5.72 ns
i.e. your code is slower.
(this is the correctly predicted hot path latency
measuring for(i=0;i<n;i++) r = nextafter(r,0); )
it is more difficult to measure the case when latency
is not the bottleneck, since then it depends on what
code is being executed along nextafter. your code
likely wins in that case as it has less instructions.
what you see in your benchmark is the effect of
branch mispredicts in the nextafter(x, INFINITY)
case: your code avoids branching on the sign of x,
while glibc branches and misses half the time with
random x inputs. (it maybe possible to fix glibc to
avoid that branch and then it would be fast in your
measurement too on a big ooo core)
(your code also does not handle fenv correctly and
has aliasing violations, but these can be fixed)
>
> --- after.c ---
> double nextafter(double from, double to)
> {
> if (from == to)
> return to;
>
> if (to != to)
> return to;
>
> if (from != from)
> return from;
>
> if (from == 0.0)
> return to < 0.0 ? -0x1.0p-1074 : 0x1.0p-1074;
>
> unsigned long long ull = *(unsigned long long *) &from;
>
> if ((from < to) == (from < 0.0))
> ull--;
> else
> ull++;
>
> return 0.0 + *(double *) &ull;
> }
> --- EOF ---
>
> JFTR: the code generated by GCC for this function is FAR from
> optimal; the assembly implementation shown below is more
> than 2x faster and consumes <0.3ns/call, i.e. is about
> 30x faster than glibc's nextafter()!
>
> The data from 'perf stat' show that glibc's nextafter() executes
> about 30 instructions more than the above trivial implementation,
> and that it is NOT well suited for modern super-scalar processors.
>
> Stefan
>
> PS: test program and logs
>
> [stefan@rome ~]$ cat bench.c
> // Copyright (C) 2005-2021, Stefan Kanthak <stefan.kanthak@nexgo.de>
>
> #include <math.h>
> #include <stdint.h>
> #include <stdio.h>
> #include <time.h>
>
> __attribute__((noinline))
> double nop(double foo, double bar)
> {
> return foo + bar;
> }
>
> inline static
> double lfsr64(void)
> {
> // 64-bit linear feedback shift register (Galois form) using
> // primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"),
> // initialised with 2**64 / golden ratio
>
> static uint64_t lfsr = 0x9E3779B97F4A7C15;
> const uint64_t sign = (int64_t) lfsr >> 63;
>
> lfsr = (lfsr << 1) ^ (0xAD93D23594C935A9 & sign);
>
> return *(double *) &lfsr;
> }
>
> inline static
> double random64(void)
> {
> static uint64_t seed = 0x0123456789ABCDEF;
>
> seed = seed * 6364136223846793005 + 1442695040888963407;
>
> return *(double *) &seed;
> }
>
> int main(void)
> {
> clock_t t0, t1, t2, tt;
> uint32_t n;
> volatile double result;
>
> t0 = clock();
>
> for (n = 500000000u; n > 0u; n--) {
> result = nop(lfsr64(), 0.0);
> result = nop(random64(), 1.0 / 0.0);
> }
>
> t1 = clock();
>
> for (n = 500000000u; n > 0u; n--) {
> result = nextafter(lfsr64(), 0.0);
> result = nextafter(random64(), 1.0 / 0.0);
> }
>
> t2 = clock();
> tt = t2 - t0; t2 -= t1; t1 -= t0; t0 = t2 - t1;
>
> printf("\n"
> "nop() %4lu.%06lu 0\n"
> "nextafter() %4lu.%06lu %4lu.%06lu\n"
> " %4lu.%06lu nano-seconds\n",
> t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
> t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
> t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
> tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
> }
> [stefan@rome ~]$ gcc -O3 -lm bench.c
> [stefan@rome ~]$ perf stat ./a.out
>
> nop() 1.480000 0
> nextafter() 10.060000 8.580000
> 11.540000 nano-seconds
>
> Performance counter stats for './a.out':
>
> 11,548.78 msec task-clock:u # 1.000 CPUs utilized
> 0 context-switches:u # 0.000 K/sec
> 0 cpu-migrations:u # 0.000 K/sec
> 145 page-faults:u # 0.013 K/sec
> 38,917,213,536 cycles:u # 3.370 GHz (83.33%)
> ~~~~~~~~~~~~~~
> 15,647,656,615 stalled-cycles-frontend:u # 40.21% frontend cycles idle (83.33%)
> 10,746,044,422 stalled-cycles-backend:u # 27.61% backend cycles idle (83.33%)
> 69,739,403,870 instructions:u # 1.79 insn per cycle
> ~~~~~~~~~~~~~~ ~~~~
> # 0.22 stalled cycles per insn (83.33%)
> 16,495,748,110 branches:u # 1428.354 M/sec (83.33%)
> 500,701,246 branch-misses:u # 3.04% of all branches (83.33%)
>
> 11.550414029 seconds time elapsed
>
> 11.548265000 seconds user
> 0.000999000 seconds sys
>
>
> [stefan@rome ~]$ gcc -O3 bench.c after.c
> [stefan@rome ~]$ perf stat ./a.out
>
> nop() 1.490000 0
> nextafter() 2.210000 0.720000
> 3.700000 nano-seconds
>
> Performance counter stats for './a.out':
>
> 3,702.89 msec task-clock:u # 1.000 CPUs utilized
> 0 context-switches:u # 0.000 K/sec
> 0 cpu-migrations:u # 0.000 K/sec
> 122 page-faults:u # 0.033 K/sec
> 12,407,345,183 cycles:u # 3.351 GHz (83.32%)
> ~~~~~~~~~~~~~~
> 135,817 stalled-cycles-frontend:u # 0.00% frontend cycles idle (83.34%)
> 5,498,895,906 stalled-cycles-backend:u # 44.32% backend cycles idle (83.34%)
> 38,002,430,460 instructions:u # 3.06 insn per cycle
> ~~~~~~~~~~~~~~ ~~~~
> # 0.14 stalled cycles per insn (83.34%)
> 7,497,381,393 branches:u # 2024.735 M/sec (83.34%)
> 497,462 branch-misses:u # 0.01% of all branches (83.32%)
>
> 3.703648875 seconds time elapsed
>
> 3.703294000 seconds user
> 0.000000000 seconds sys
>
>
> [stefan@rome ~]cat after.s
> # Copyright � 2004-2021, Stefan Kanthak <stefan.kanthak@nexgo.de>
>
> .arch generic64
> .code64
> .intel_syntax noprefix
> .text
> # xmm0 = from
> # xmm1 = to
> nextafter:
> xorpd xmm2, xmm2 # xmm2 = 0.0
> ucomisd xmm1, xmm2 # CF = (to < 0.0)
> jp .Lto # to = NAN?
> sbb rax, rax # rax = (to < 0.0) ? -1 : 0
> ucomisd xmm0, xmm1 # CF = (from < to)
> jp .Lfrom # from = NAN?
> je .Lto # from = to?
> .Lnotequal:
> sbb rcx, rcx # rcx = (from < to) ? -1 : 0
> ucomisd xmm0, xmm2 # CF = (from < 0.0)
> jz .Lzero # from = 0.0?
> .Lnext:
> movq rdx, xmm0 # rdx = from
> sbb rax, rax # rax = (from < 0.0) ? -1 : 0
> xor rax, rcx # rax = (from < 0.0) = (from < to) ? 0 : -1
> or rax, 1 # rax = (from < 0.0) = (from < to) ? 1 : -1
> sub rdx, rax
> movq xmm0, rdx
> addsd xmm0, xmm2
> ret
> .Lzero:
> shl rax, 63 # rax = (to < -0.0) ? 0x8000000000000000 : 0
> or rax, 1 # rax = (to < -0.0) ? 0x8000000000000001 : 1
> movq xmm0, rax # xmm0 = (to < -0.0) ? -0x1.0p-1074 : 0x1.0p-1074
> addsd xmm0, xmm2
> ret
> .Lto:
> movsd xmm0, xmm1 # xmm0 = to
> .Lfrom:
> ret
>
> .size nextafter, .-nextafter
> .type nextafter, @function
> .global nextafter
> .end
> [stefan@rome ~]$ perf stat ./a.out
>
> nop() 1.630000 0
> nextafter() 1.910000 0.280000
> 3.540000 nano-seconds
>
> Performance counter stats for './a.out':
>
> 3,547.12 msec task-clock:u # 1.000 CPUs utilized
> 0 context-switches:u # 0.000 K/sec
> 0 cpu-migrations:u # 0.000 K/sec
> 123 page-faults:u # 0.035 K/sec
> 11,949,840,797 cycles:u # 3.369 GHz (83.32%)
> ~~~~~~~~~~~~~~
> 129,627 stalled-cycles-frontend:u # 0.00% frontend cycles idle (83.34%)
> 3,998,960,716 stalled-cycles-backend:u # 33.46% backend cycles idle (83.34%)
> 37,493,523,285 instructions:u # 3.14 insn per cycle
> ~~~~~~~~~~~~~~ ~~~~
> # 0.11 stalled cycles per insn (83.34%)
> 7,998,559,192 branches:u # 2254.945 M/sec (83.34%)
> 497,565 branch-misses:u # 0.01% of all branches (83.32%)
>
> 3.547999008 seconds time elapsed
>
> 3.546671000 seconds user
> 0.000999000 seconds sys
>
>
> [stefan@rome ~]$
>
--
next prev parent reply other threads:[~2021-08-18 12:51 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-08-16 16:03 Stefan Kanthak
2021-08-18 12:51 ` Szabolcs Nagy [this message]
2021-08-18 17:11 ` Stefan Kanthak
2021-08-19 11:20 ` Adhemerval Zanella
2021-08-19 17:57 ` [PATCH] " Stefan Kanthak
2021-08-20 9:52 ` [PATCH/2nd version] " Stefan Kanthak
2021-08-20 16:55 ` Joseph Myers
2021-08-20 20:19 ` Stefan Kanthak
2021-08-20 21:03 ` Joseph Myers
2021-08-23 12:50 ` Adhemerval Zanella
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210818125119.GH25257@arm.com \
--to=szabolcs.nagy@arm.com \
--cc=libc-help@sourceware.org \
--cc=stefan.kanthak@nexgo.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).