public inbox for libc-help@sourceware.org
 help / color / mirror / Atom feed
From: "Stefan Kanthak" <stefan.kanthak@nexgo.de>
To: "Szabolcs Nagy" <szabolcs.nagy@arm.com>
Cc: <libc-help@sourceware.org>
Subject: Re: nextafter() about an order of magnitude slower than trivial implementation
Date: Wed, 18 Aug 2021 19:11:02 +0200	[thread overview]
Message-ID: <BEF317B3F18B447F8BAE9B550BE5C237@H270> (raw)
In-Reply-To: <20210818125119.GH25257@arm.com>

Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:

> The 08/16/2021 18:03, Stefan Kanthak wrote:
>> Testing and benchmarking an exponential function that consumes
>> about 10ns/call on an AMD EPYC 7262, I noticed that nextafter()
>> itself is DEAD SLOW: it consumes about 8.5ns/call!
>>
>> The following (quick&dirty) implementation consumes 0.7ns/call,
>> i.e. is about an order of magnitude faster:
>
> correctly measuring latency on a modern x86_64 core:
>
> musl: 3.16 ns
> glibc: 5.68 ns
> your: 5.72 ns

My benchmark measures (reciprocal) throughput, NOT latency.
As long as a routine like nextafter() is NOT called back-to-back,
but along other code, latency doesn't really matter.

> i.e. your code is slower.

The overall execution time and the numbers from 'perf stat' show
the contrary: the program executes overall
- 70 billion instructions in 38.9 billion cycles with glibc's
  implementation,
- 38 billion instructions in 12.4 billion cycles with my
  (quick&dirty) C implementation,
- 37.5 billion instructions in 12 billion cycles with my
  assembly implementation,
- 39.8 billion instructions in 22 billion cycles with musl's
  original implementation,
- 40.5 billion instructions in 11.7 billion cycles with my
  patch applied to musl's implementation.

The OVERALL number of cycles per call are
38.9 (glibc) > 22 (musl) > 12.4 (C) > 12 (Assembly) > 11.7 (musl patched)

Subtract the (constant but unknown, not measured) number of
cycles for the program itself: I estimate the overhead to be
about 4..6 billion cycles.
glibc's implementation consumes about 2.5x the number of cycles
of musl's implementation, and about 5x the number of cycles of
a proper implementation.

> (this is the correctly predicted hot path latency
> measuring for(i=0;i<n;i++) r = nextafter(r,0); )

See above: I'm interested in throughput, NOT latency,

> it is more difficult to measure the case when latency
> is not the bottleneck, since then it depends on what
> code is being executed along nextafter. your code
> likely wins in that case as it has less instructions.

Correct. And that's what counts in practice, as shown from
'perf stat' with the number of consumed cycles!

> what you see in your benchmark is the effect of
> branch mispredicts in the nextafter(x, INFINITY)
> case: your code avoids branching on the sign of x,
> while glibc branches and misses half the time with
> random x inputs. (it maybe possible to fix glibc to
> avoid that branch and then it would be fast in your
> measurement too on a big ooo core)

It's not just mispredicted branches: the program executes
- 16.5 billion branches (0.5 B(!)illion mispredicted) with
  glibc's implementation,
- 7.5 billion branches (0.5 M(!)illion mispredicted) with my
  C implementation,
- 8 billion branches (0.5 M(!)illion mispredicted) with my
  assembly implementation,
- 9.7 billion branches (0.25 B(!)illion mispredicted) with
  musl's original implementation,
- 8.5 billion branches (0.5 M(!)illion mispredicted with my
  patch applied to musl's implementation.

2 billion branches are executed for the loops and should
be subtracted to get the net numbers per call of nextafter():
14.5 (glibc) > 9.7 (musl) > 8.5 (musl patched) > 8 (assembly) > 7.5 (C)

glibc's implementation executes almost twice the number of
branches of the other implementations.

The mispredicted branches:
0.5 (glibc) > 0.25 (musl) > 0.0001 (the proper implementations)

> (your code also does not handle fenv correctly and
> has aliasing violations, but these can be fixed)

I wrote explicitly QUICK&DIRTY (C) implementation.
The assembly code is correct in both cases.

Stefan

> --- after.c ---
> double nextafter(double from, double to)
> {
>     if (from == to)
>         return to;
>
>     if (to != to)
>         return to;
>
>     if (from != from)
>         return from;
>
>     if (from == 0.0)
>         return to < 0.0 ? -0x1.0p-1074 : 0x1.0p-1074;
>
>     unsigned long long ull = *(unsigned long long *) &from;
>
>     if ((from < to) == (from < 0.0))
>         ull--;
>     else
>         ull++;
>
>     return 0.0 + *(double *) &ull;
> }
> --- EOF ---
>
> JFTR: the code generated by GCC for this function is FAR from
>       optimal; the assembly implementation shown below is more
>       than 2x faster and consumes <0.3ns/call, i.e. is about
>       30x faster than glibc's nextafter()!
>
> The data from 'perf stat' show that glibc's nextafter() executes
> about 30 instructions more than the above trivial implementation,
> and that it is NOT well suited for modern super-scalar processors.
>
> Stefan
>
> PS: test program and logs
>
> [stefan@rome ~]$ cat bench.c
> // Copyright (C) 2005-2021, Stefan Kanthak <stefan.kanthak@nexgo.de>
>
> #include <math.h>
> #include <stdint.h>
> #include <stdio.h>
> #include <time.h>
>
> __attribute__((noinline))
> double nop(double foo, double bar)
> {
>     return foo + bar;
> }
>
> inline static
> double lfsr64(void)
> {
>     // 64-bit linear feedback shift register (Galois form) using
>     //  primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"),
>     //   initialised with 2**64 / golden ratio
>
>     static uint64_t lfsr = 0x9E3779B97F4A7C15;
>     const  uint64_t sign = (int64_t) lfsr >> 63;
>
>     lfsr = (lfsr << 1) ^ (0xAD93D23594C935A9 & sign);
>
>     return *(double *) &lfsr;
> }
>
> inline static
> double random64(void)
> {
>     static uint64_t seed = 0x0123456789ABCDEF;
>
>     seed = seed * 6364136223846793005 + 1442695040888963407;
>
>     return *(double *) &seed;
> }
>
> int main(void)
> {
>     clock_t t0, t1, t2, tt;
>     uint32_t n;
>     volatile double result;
>
>     t0 = clock();
>
>     for (n = 500000000u; n > 0u; n--) {
>         result = nop(lfsr64(), 0.0);
>         result = nop(random64(), 1.0 / 0.0);
>     }
>
>     t1 = clock();
>
>     for (n = 500000000u; n > 0u; n--) {
>         result = nextafter(lfsr64(), 0.0);
>         result = nextafter(random64(), 1.0 / 0.0);
>     }
>
>     t2 = clock();
>     tt = t2 - t0; t2 -= t1; t1 -= t0; t0 = t2 - t1;
>
>     printf("\n"
>            "nop()         %4lu.%06lu       0\n"
>            "nextafter()   %4lu.%06lu    %4lu.%06lu\n"
>            "              %4lu.%06lu nano-seconds\n",
>            t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
>            t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
>            t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
>            tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
> }
> [stefan@rome ~]$ gcc -O3 -lm bench.c
> [stefan@rome ~]$ perf stat ./a.out
>
> nop()            1.480000       0
> nextafter()     10.060000       8.580000
>                 11.540000 nano-seconds
>
>  Performance counter stats for './a.out':
>
>          11,548.78 msec task-clock:u              #    1.000 CPUs utilized
>                  0      context-switches:u        #    0.000 K/sec
>                  0      cpu-migrations:u          #    0.000 K/sec
>                145      page-faults:u             #    0.013 K/sec
>     38,917,213,536      cycles:u                  #    3.370 GHz                      (83.33%)
>     ~~~~~~~~~~~~~~
>     15,647,656,615      stalled-cycles-frontend:u #   40.21% frontend cycles idle     (83.33%)
>     10,746,044,422      stalled-cycles-backend:u  #   27.61% backend cycles idle      (83.33%)
>     69,739,403,870      instructions:u            #    1.79  insn per cycle
>     ~~~~~~~~~~~~~~                                     ~~~~
>                                                   #    0.22  stalled cycles per insn  (83.33%)
>     16,495,748,110      branches:u                # 1428.354 M/sec                    (83.33%)
>        500,701,246      branch-misses:u           #    3.04% of all branches          (83.33%)
>
>       11.550414029 seconds time elapsed
>
>       11.548265000 seconds user
>        0.000999000 seconds sys
>
>
> [stefan@rome ~]$ gcc -O3 bench.c after.c
> [stefan@rome ~]$ perf stat ./a.out
>
> nop()            1.490000       0
> nextafter()      2.210000       0.720000
>                  3.700000 nano-seconds
>
>  Performance counter stats for './a.out':
>
>           3,702.89 msec task-clock:u              #    1.000 CPUs utilized
>                  0      context-switches:u        #    0.000 K/sec
>                  0      cpu-migrations:u          #    0.000 K/sec
>                122      page-faults:u             #    0.033 K/sec
>     12,407,345,183      cycles:u                  #    3.351 GHz                      (83.32%)
>     ~~~~~~~~~~~~~~
>            135,817      stalled-cycles-frontend:u #    0.00% frontend cycles idle     (83.34%)
>      5,498,895,906      stalled-cycles-backend:u  #   44.32% backend cycles idle      (83.34%)
>     38,002,430,460      instructions:u            #    3.06  insn per cycle
>     ~~~~~~~~~~~~~~                                     ~~~~
>                                                   #    0.14  stalled cycles per insn  (83.34%)
>      7,497,381,393      branches:u                # 2024.735 M/sec                    (83.34%)
>            497,462      branch-misses:u           #    0.01% of all branches          (83.32%)
>
>        3.703648875 seconds time elapsed
>
>        3.703294000 seconds user
>        0.000000000 seconds sys
>
>
> [stefan@rome ~]cat after.s
> # Copyright � 2004-2021, Stefan Kanthak <stefan.kanthak@nexgo.de>
>
> .arch   generic64
> .code64
> .intel_syntax noprefix
> .text
>                                 # xmm0 = from
>                                 # xmm1 = to
> nextafter:
>         xorpd   xmm2, xmm2      # xmm2 = 0.0
>         ucomisd xmm1, xmm2      # CF = (to < 0.0)
>         jp      .Lto            # to = NAN?
>         sbb     rax, rax        # rax = (to < 0.0) ? -1 : 0
>         ucomisd xmm0, xmm1      # CF = (from < to)
>         jp      .Lfrom          # from = NAN?
>         je      .Lto            # from = to?
> .Lnotequal:
>         sbb     rcx, rcx        # rcx = (from < to) ? -1 : 0
>         ucomisd xmm0, xmm2      # CF = (from < 0.0)
>         jz      .Lzero          # from = 0.0?
> .Lnext:
>         movq    rdx, xmm0       # rdx = from
>         sbb     rax, rax        # rax = (from < 0.0) ? -1 : 0
>         xor     rax, rcx        # rax = (from < 0.0) = (from < to) ? 0 : -1
>         or      rax, 1          # rax = (from < 0.0) = (from < to) ? 1 : -1
>         sub     rdx, rax
>         movq    xmm0, rdx
>         addsd   xmm0, xmm2
>         ret
> .Lzero:
>         shl     rax, 63         # rax = (to < -0.0) ? 0x8000000000000000 : 0
>         or      rax, 1          # rax = (to < -0.0) ? 0x8000000000000001 : 1
>         movq    xmm0, rax       # xmm0 = (to < -0.0) ? -0x1.0p-1074 : 0x1.0p-1074
>         addsd   xmm0, xmm2
>         ret
> .Lto:
>         movsd   xmm0, xmm1      # xmm0 = to
> .Lfrom:
>         ret
>
> .size   nextafter, .-nextafter
> .type   nextafter, @function
> .global nextafter
> .end
> [stefan@rome ~]$ perf stat ./a.out
>
> nop()            1.630000       0
> nextafter()      1.910000       0.280000
>                  3.540000 nano-seconds
>
>  Performance counter stats for './a.out':
>
>           3,547.12 msec task-clock:u              #    1.000 CPUs utilized
>                  0      context-switches:u        #    0.000 K/sec
>                  0      cpu-migrations:u          #    0.000 K/sec
>                123      page-faults:u             #    0.035 K/sec
>     11,949,840,797      cycles:u                  #    3.369 GHz                      (83.32%)
>     ~~~~~~~~~~~~~~
>            129,627      stalled-cycles-frontend:u #    0.00% frontend cycles idle     (83.34%)
>      3,998,960,716      stalled-cycles-backend:u  #   33.46% backend cycles idle      (83.34%)
>     37,493,523,285      instructions:u            #    3.14  insn per cycle
>     ~~~~~~~~~~~~~~                                     ~~~~
>                                                   #    0.11  stalled cycles per insn  (83.34%)
>      7,998,559,192      branches:u                # 2254.945 M/sec                    (83.34%)
>            497,565      branch-misses:u           #    0.01% of all branches          (83.32%)
>
>        3.547999008 seconds time elapsed
>
>        3.546671000 seconds user
>        0.000999000 seconds sys
>
>
> [stefan@rome ~]$
>

-- 


  reply	other threads:[~2021-08-18 17:15 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-16 16:03 Stefan Kanthak
2021-08-18 12:51 ` Szabolcs Nagy
2021-08-18 17:11   ` Stefan Kanthak [this message]
2021-08-19 11:20     ` Adhemerval Zanella
2021-08-19 17:57       ` [PATCH] " Stefan Kanthak
2021-08-20  9:52       ` [PATCH/2nd version] " Stefan Kanthak
2021-08-20 16:55         ` Joseph Myers
2021-08-20 20:19           ` Stefan Kanthak
2021-08-20 21:03             ` Joseph Myers
2021-08-23 12:50             ` Adhemerval Zanella

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BEF317B3F18B447F8BAE9B550BE5C237@H270 \
    --to=stefan.kanthak@nexgo.de \
    --cc=libc-help@sourceware.org \
    --cc=szabolcs.nagy@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).