From: "Stefan Kanthak" <stefan.kanthak@nexgo.de>
To: "Szabolcs Nagy" <szabolcs.nagy@arm.com>
Cc: <libc-help@sourceware.org>
Subject: Re: nextafter() about an order of magnitude slower than trivial implementation
Date: Wed, 18 Aug 2021 19:11:02 +0200 [thread overview]
Message-ID: <BEF317B3F18B447F8BAE9B550BE5C237@H270> (raw)
In-Reply-To: <20210818125119.GH25257@arm.com>
Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:
> The 08/16/2021 18:03, Stefan Kanthak wrote:
>> Testing and benchmarking an exponential function that consumes
>> about 10ns/call on an AMD EPYC 7262, I noticed that nextafter()
>> itself is DEAD SLOW: it consumes about 8.5ns/call!
>>
>> The following (quick&dirty) implementation consumes 0.7ns/call,
>> i.e. is about an order of magnitude faster:
>
> correctly measuring latency on a modern x86_64 core:
>
> musl: 3.16 ns
> glibc: 5.68 ns
> your: 5.72 ns
My benchmark measures (reciprocal) throughput, NOT latency.
As long as a routine like nextafter() is NOT called back-to-back,
but along other code, latency doesn't really matter.
> i.e. your code is slower.
The overall execution time and the numbers from 'perf stat' show
the contrary: the program executes overall
- 70 billion instructions in 38.9 billion cycles with glibc's
implementation,
- 38 billion instructions in 12.4 billion cycles with my
(quick&dirty) C implementation,
- 37.5 billion instructions in 12 billion cycles with my
assembly implementation,
- 39.8 billion instructions in 22 billion cycles with musl's
original implementation,
- 40.5 billion instructions in 11.7 billion cycles with my
patch applied to musl's implementation.
The OVERALL number of cycles per call are
38.9 (glibc) > 22 (musl) > 12.4 (C) > 12 (Assembly) > 11.7 (musl patched)
Subtract the (constant but unknown, not measured) number of
cycles for the program itself: I estimate the overhead to be
about 4..6 billion cycles.
glibc's implementation consumes about 2.5x the number of cycles
of musl's implementation, and about 5x the number of cycles of
a proper implementation.
> (this is the correctly predicted hot path latency
> measuring for(i=0;i<n;i++) r = nextafter(r,0); )
See above: I'm interested in throughput, NOT latency,
> it is more difficult to measure the case when latency
> is not the bottleneck, since then it depends on what
> code is being executed along nextafter. your code
> likely wins in that case as it has less instructions.
Correct. And that's what counts in practice, as shown from
'perf stat' with the number of consumed cycles!
> what you see in your benchmark is the effect of
> branch mispredicts in the nextafter(x, INFINITY)
> case: your code avoids branching on the sign of x,
> while glibc branches and misses half the time with
> random x inputs. (it maybe possible to fix glibc to
> avoid that branch and then it would be fast in your
> measurement too on a big ooo core)
It's not just mispredicted branches: the program executes
- 16.5 billion branches (0.5 B(!)illion mispredicted) with
glibc's implementation,
- 7.5 billion branches (0.5 M(!)illion mispredicted) with my
C implementation,
- 8 billion branches (0.5 M(!)illion mispredicted) with my
assembly implementation,
- 9.7 billion branches (0.25 B(!)illion mispredicted) with
musl's original implementation,
- 8.5 billion branches (0.5 M(!)illion mispredicted with my
patch applied to musl's implementation.
2 billion branches are executed for the loops and should
be subtracted to get the net numbers per call of nextafter():
14.5 (glibc) > 9.7 (musl) > 8.5 (musl patched) > 8 (assembly) > 7.5 (C)
glibc's implementation executes almost twice the number of
branches of the other implementations.
The mispredicted branches:
0.5 (glibc) > 0.25 (musl) > 0.0001 (the proper implementations)
> (your code also does not handle fenv correctly and
> has aliasing violations, but these can be fixed)
I wrote explicitly QUICK&DIRTY (C) implementation.
The assembly code is correct in both cases.
Stefan
> --- after.c ---
> double nextafter(double from, double to)
> {
> if (from == to)
> return to;
>
> if (to != to)
> return to;
>
> if (from != from)
> return from;
>
> if (from == 0.0)
> return to < 0.0 ? -0x1.0p-1074 : 0x1.0p-1074;
>
> unsigned long long ull = *(unsigned long long *) &from;
>
> if ((from < to) == (from < 0.0))
> ull--;
> else
> ull++;
>
> return 0.0 + *(double *) &ull;
> }
> --- EOF ---
>
> JFTR: the code generated by GCC for this function is FAR from
> optimal; the assembly implementation shown below is more
> than 2x faster and consumes <0.3ns/call, i.e. is about
> 30x faster than glibc's nextafter()!
>
> The data from 'perf stat' show that glibc's nextafter() executes
> about 30 instructions more than the above trivial implementation,
> and that it is NOT well suited for modern super-scalar processors.
>
> Stefan
>
> PS: test program and logs
>
> [stefan@rome ~]$ cat bench.c
> // Copyright (C) 2005-2021, Stefan Kanthak <stefan.kanthak@nexgo.de>
>
> #include <math.h>
> #include <stdint.h>
> #include <stdio.h>
> #include <time.h>
>
> __attribute__((noinline))
> double nop(double foo, double bar)
> {
> return foo + bar;
> }
>
> inline static
> double lfsr64(void)
> {
> // 64-bit linear feedback shift register (Galois form) using
> // primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"),
> // initialised with 2**64 / golden ratio
>
> static uint64_t lfsr = 0x9E3779B97F4A7C15;
> const uint64_t sign = (int64_t) lfsr >> 63;
>
> lfsr = (lfsr << 1) ^ (0xAD93D23594C935A9 & sign);
>
> return *(double *) &lfsr;
> }
>
> inline static
> double random64(void)
> {
> static uint64_t seed = 0x0123456789ABCDEF;
>
> seed = seed * 6364136223846793005 + 1442695040888963407;
>
> return *(double *) &seed;
> }
>
> int main(void)
> {
> clock_t t0, t1, t2, tt;
> uint32_t n;
> volatile double result;
>
> t0 = clock();
>
> for (n = 500000000u; n > 0u; n--) {
> result = nop(lfsr64(), 0.0);
> result = nop(random64(), 1.0 / 0.0);
> }
>
> t1 = clock();
>
> for (n = 500000000u; n > 0u; n--) {
> result = nextafter(lfsr64(), 0.0);
> result = nextafter(random64(), 1.0 / 0.0);
> }
>
> t2 = clock();
> tt = t2 - t0; t2 -= t1; t1 -= t0; t0 = t2 - t1;
>
> printf("\n"
> "nop() %4lu.%06lu 0\n"
> "nextafter() %4lu.%06lu %4lu.%06lu\n"
> " %4lu.%06lu nano-seconds\n",
> t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
> t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
> t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
> tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
> }
> [stefan@rome ~]$ gcc -O3 -lm bench.c
> [stefan@rome ~]$ perf stat ./a.out
>
> nop() 1.480000 0
> nextafter() 10.060000 8.580000
> 11.540000 nano-seconds
>
> Performance counter stats for './a.out':
>
> 11,548.78 msec task-clock:u # 1.000 CPUs utilized
> 0 context-switches:u # 0.000 K/sec
> 0 cpu-migrations:u # 0.000 K/sec
> 145 page-faults:u # 0.013 K/sec
> 38,917,213,536 cycles:u # 3.370 GHz (83.33%)
> ~~~~~~~~~~~~~~
> 15,647,656,615 stalled-cycles-frontend:u # 40.21% frontend cycles idle (83.33%)
> 10,746,044,422 stalled-cycles-backend:u # 27.61% backend cycles idle (83.33%)
> 69,739,403,870 instructions:u # 1.79 insn per cycle
> ~~~~~~~~~~~~~~ ~~~~
> # 0.22 stalled cycles per insn (83.33%)
> 16,495,748,110 branches:u # 1428.354 M/sec (83.33%)
> 500,701,246 branch-misses:u # 3.04% of all branches (83.33%)
>
> 11.550414029 seconds time elapsed
>
> 11.548265000 seconds user
> 0.000999000 seconds sys
>
>
> [stefan@rome ~]$ gcc -O3 bench.c after.c
> [stefan@rome ~]$ perf stat ./a.out
>
> nop() 1.490000 0
> nextafter() 2.210000 0.720000
> 3.700000 nano-seconds
>
> Performance counter stats for './a.out':
>
> 3,702.89 msec task-clock:u # 1.000 CPUs utilized
> 0 context-switches:u # 0.000 K/sec
> 0 cpu-migrations:u # 0.000 K/sec
> 122 page-faults:u # 0.033 K/sec
> 12,407,345,183 cycles:u # 3.351 GHz (83.32%)
> ~~~~~~~~~~~~~~
> 135,817 stalled-cycles-frontend:u # 0.00% frontend cycles idle (83.34%)
> 5,498,895,906 stalled-cycles-backend:u # 44.32% backend cycles idle (83.34%)
> 38,002,430,460 instructions:u # 3.06 insn per cycle
> ~~~~~~~~~~~~~~ ~~~~
> # 0.14 stalled cycles per insn (83.34%)
> 7,497,381,393 branches:u # 2024.735 M/sec (83.34%)
> 497,462 branch-misses:u # 0.01% of all branches (83.32%)
>
> 3.703648875 seconds time elapsed
>
> 3.703294000 seconds user
> 0.000000000 seconds sys
>
>
> [stefan@rome ~]cat after.s
> # Copyright � 2004-2021, Stefan Kanthak <stefan.kanthak@nexgo.de>
>
> .arch generic64
> .code64
> .intel_syntax noprefix
> .text
> # xmm0 = from
> # xmm1 = to
> nextafter:
> xorpd xmm2, xmm2 # xmm2 = 0.0
> ucomisd xmm1, xmm2 # CF = (to < 0.0)
> jp .Lto # to = NAN?
> sbb rax, rax # rax = (to < 0.0) ? -1 : 0
> ucomisd xmm0, xmm1 # CF = (from < to)
> jp .Lfrom # from = NAN?
> je .Lto # from = to?
> .Lnotequal:
> sbb rcx, rcx # rcx = (from < to) ? -1 : 0
> ucomisd xmm0, xmm2 # CF = (from < 0.0)
> jz .Lzero # from = 0.0?
> .Lnext:
> movq rdx, xmm0 # rdx = from
> sbb rax, rax # rax = (from < 0.0) ? -1 : 0
> xor rax, rcx # rax = (from < 0.0) = (from < to) ? 0 : -1
> or rax, 1 # rax = (from < 0.0) = (from < to) ? 1 : -1
> sub rdx, rax
> movq xmm0, rdx
> addsd xmm0, xmm2
> ret
> .Lzero:
> shl rax, 63 # rax = (to < -0.0) ? 0x8000000000000000 : 0
> or rax, 1 # rax = (to < -0.0) ? 0x8000000000000001 : 1
> movq xmm0, rax # xmm0 = (to < -0.0) ? -0x1.0p-1074 : 0x1.0p-1074
> addsd xmm0, xmm2
> ret
> .Lto:
> movsd xmm0, xmm1 # xmm0 = to
> .Lfrom:
> ret
>
> .size nextafter, .-nextafter
> .type nextafter, @function
> .global nextafter
> .end
> [stefan@rome ~]$ perf stat ./a.out
>
> nop() 1.630000 0
> nextafter() 1.910000 0.280000
> 3.540000 nano-seconds
>
> Performance counter stats for './a.out':
>
> 3,547.12 msec task-clock:u # 1.000 CPUs utilized
> 0 context-switches:u # 0.000 K/sec
> 0 cpu-migrations:u # 0.000 K/sec
> 123 page-faults:u # 0.035 K/sec
> 11,949,840,797 cycles:u # 3.369 GHz (83.32%)
> ~~~~~~~~~~~~~~
> 129,627 stalled-cycles-frontend:u # 0.00% frontend cycles idle (83.34%)
> 3,998,960,716 stalled-cycles-backend:u # 33.46% backend cycles idle (83.34%)
> 37,493,523,285 instructions:u # 3.14 insn per cycle
> ~~~~~~~~~~~~~~ ~~~~
> # 0.11 stalled cycles per insn (83.34%)
> 7,998,559,192 branches:u # 2254.945 M/sec (83.34%)
> 497,565 branch-misses:u # 0.01% of all branches (83.32%)
>
> 3.547999008 seconds time elapsed
>
> 3.546671000 seconds user
> 0.000999000 seconds sys
>
>
> [stefan@rome ~]$
>
--
next prev parent reply other threads:[~2021-08-18 17:15 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-08-16 16:03 Stefan Kanthak
2021-08-18 12:51 ` Szabolcs Nagy
2021-08-18 17:11 ` Stefan Kanthak [this message]
2021-08-19 11:20 ` Adhemerval Zanella
2021-08-19 17:57 ` [PATCH] " Stefan Kanthak
2021-08-20 9:52 ` [PATCH/2nd version] " Stefan Kanthak
2021-08-20 16:55 ` Joseph Myers
2021-08-20 20:19 ` Stefan Kanthak
2021-08-20 21:03 ` Joseph Myers
2021-08-23 12:50 ` Adhemerval Zanella
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=BEF317B3F18B447F8BAE9B550BE5C237@H270 \
--to=stefan.kanthak@nexgo.de \
--cc=libc-help@sourceware.org \
--cc=szabolcs.nagy@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).