From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <stefan.kanthak@nexgo.de>
Received: from smtpout2.vodafonemail.de (smtpout2.vodafonemail.de
 [145.253.239.133])
 by sourceware.org (Postfix) with ESMTPS id 1D6E83858023
 for <libc-help@sourceware.org>; Wed, 18 Aug 2021 17:15:11 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 1D6E83858023
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=nexgo.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=nexgo.de
Received: from smtp.vodafone.de (smtpa08.fra-mediabeam.com [10.2.0.39])
 by smtpout2.vodafonemail.de (Postfix) with ESMTP id C8A41121574;
 Wed, 18 Aug 2021 19:15:09 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nexgo.de;
 s=vfde-smtpout-mb-15sep; t=1629306909;
 bh=DA8llo0fMF3aM51TlogtrL9TMlXJdumh97mv2Tw+eFU=;
 h=From:To:Cc:References:In-Reply-To:Subject:Date;
 b=Kp0+uQfUt6UiBtfcIJaXJsV1F0xh4Km0tyAhrzPbV3b7GXDTUdO6GlfiNzleSCn7+
 gzX5G9uuNF4FvngsMEKDmfQ13t9hys56fhyaIkS8nR4yTjfSFdNx+uDajkiDchV6+s
 B8jec55w/A4B6khY4OKPMrtnmjvyv6lVJPVMyiNo=
Received: from H270 (p5b38f1bc.dip0.t-ipconnect.de [91.56.241.188])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits))
 (No client certificate requested)
 by smtp.vodafone.de (Postfix) with ESMTPSA id F2E2F1401B2;
 Wed, 18 Aug 2021 17:15:08 +0000 (UTC)
Message-ID: <BEF317B3F18B447F8BAE9B550BE5C237@H270>
From: "Stefan Kanthak" <stefan.kanthak@nexgo.de>
To: "Szabolcs Nagy" <szabolcs.nagy@arm.com>
Cc: <libc-help@sourceware.org>
References: <D7077C5619F74049B8890508155291B2@H270>
 <20210818125119.GH25257@arm.com>
In-Reply-To: <20210818125119.GH25257@arm.com>
Subject: Re: nextafter() about an order of magnitude slower than trivial
 implementation
Date: Wed, 18 Aug 2021 19:11:02 +0200
Organization: Me, myself & IT
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 8bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Windows Mail 6.0.6002.18197
X-MimeOLE: Produced By Microsoft MimeOLE V6.1.7601.24158
X-purgate-type: clean
X-purgate-Ad: Categorized by eleven eXpurgate (R) http://www.eleven.de
X-purgate: This mail is considered clean (visit http://www.eleven.de for
 further information)
X-purgate: clean
X-purgate-size: 12695
X-purgate-ID: 155817::1629306909-00007455-F02C21C2/0/0
X-Spam-Status: No, score=-2.7 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libc-help@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-help mailing list <libc-help.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-help>,
 <mailto:libc-help-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-help/>
List-Help: <mailto:libc-help-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-help>,
 <mailto:libc-help-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Wed, 18 Aug 2021 17:15:22 -0000

Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:

> The 08/16/2021 18:03, Stefan Kanthak wrote:
>> Testing and benchmarking an exponential function that consumes
>> about 10ns/call on an AMD EPYC 7262, I noticed that nextafter()
>> itself is DEAD SLOW: it consumes about 8.5ns/call!
>>
>> The following (quick&dirty) implementation consumes 0.7ns/call,
>> i.e. is about an order of magnitude faster:
>
> correctly measuring latency on a modern x86_64 core:
>
> musl: 3.16 ns
> glibc: 5.68 ns
> your: 5.72 ns

My benchmark measures (reciprocal) throughput, NOT latency.
As long as a routine like nextafter() is NOT called back-to-back,
but along other code, latency doesn't really matter.

> i.e. your code is slower.

The overall execution time and the numbers from 'perf stat' show
the contrary: the program executes overall
- 70 billion instructions in 38.9 billion cycles with glibc's
  implementation,
- 38 billion instructions in 12.4 billion cycles with my
  (quick&dirty) C implementation,
- 37.5 billion instructions in 12 billion cycles with my
  assembly implementation,
- 39.8 billion instructions in 22 billion cycles with musl's
  original implementation,
- 40.5 billion instructions in 11.7 billion cycles with my
  patch applied to musl's implementation.

The OVERALL number of cycles per call are
38.9 (glibc) > 22 (musl) > 12.4 (C) > 12 (Assembly) > 11.7 (musl patched)

Subtract the (constant but unknown, not measured) number of
cycles for the program itself: I estimate the overhead to be
about 4..6 billion cycles.
glibc's implementation consumes about 2.5x the number of cycles
of musl's implementation, and about 5x the number of cycles of
a proper implementation.

> (this is the correctly predicted hot path latency
> measuring for(i=0;i<n;i++) r = nextafter(r,0); )

See above: I'm interested in throughput, NOT latency,

> it is more difficult to measure the case when latency
> is not the bottleneck, since then it depends on what
> code is being executed along nextafter. your code
> likely wins in that case as it has less instructions.

Correct. And that's what counts in practice, as shown from
'perf stat' with the number of consumed cycles!

> what you see in your benchmark is the effect of
> branch mispredicts in the nextafter(x, INFINITY)
> case: your code avoids branching on the sign of x,
> while glibc branches and misses half the time with
> random x inputs. (it maybe possible to fix glibc to
> avoid that branch and then it would be fast in your
> measurement too on a big ooo core)

It's not just mispredicted branches: the program executes
- 16.5 billion branches (0.5 B(!)illion mispredicted) with
  glibc's implementation,
- 7.5 billion branches (0.5 M(!)illion mispredicted) with my
  C implementation,
- 8 billion branches (0.5 M(!)illion mispredicted) with my
  assembly implementation,
- 9.7 billion branches (0.25 B(!)illion mispredicted) with
  musl's original implementation,
- 8.5 billion branches (0.5 M(!)illion mispredicted with my
  patch applied to musl's implementation.

2 billion branches are executed for the loops and should
be subtracted to get the net numbers per call of nextafter():
14.5 (glibc) > 9.7 (musl) > 8.5 (musl patched) > 8 (assembly) > 7.5 (C)

glibc's implementation executes almost twice the number of
branches of the other implementations.

The mispredicted branches:
0.5 (glibc) > 0.25 (musl) > 0.0001 (the proper implementations)

> (your code also does not handle fenv correctly and
> has aliasing violations, but these can be fixed)

I wrote explicitly QUICK&DIRTY (C) implementation.
The assembly code is correct in both cases.

Stefan

> --- after.c ---
> double nextafter(double from, double to)
> {
>     if (from == to)
>         return to;
>
>     if (to != to)
>         return to;
>
>     if (from != from)
>         return from;
>
>     if (from == 0.0)
>         return to < 0.0 ? -0x1.0p-1074 : 0x1.0p-1074;
>
>     unsigned long long ull = *(unsigned long long *) &from;
>
>     if ((from < to) == (from < 0.0))
>         ull--;
>     else
>         ull++;
>
>     return 0.0 + *(double *) &ull;
> }
> --- EOF ---
>
> JFTR: the code generated by GCC for this function is FAR from
>       optimal; the assembly implementation shown below is more
>       than 2x faster and consumes <0.3ns/call, i.e. is about
>       30x faster than glibc's nextafter()!
>
> The data from 'perf stat' show that glibc's nextafter() executes
> about 30 instructions more than the above trivial implementation,
> and that it is NOT well suited for modern super-scalar processors.
>
> Stefan
>
> PS: test program and logs
>
> [stefan@rome ~]$ cat bench.c
> // Copyright (C) 2005-2021, Stefan Kanthak <stefan.kanthak@nexgo.de>
>
> #include <math.h>
> #include <stdint.h>
> #include <stdio.h>
> #include <time.h>
>
> __attribute__((noinline))
> double nop(double foo, double bar)
> {
>     return foo + bar;
> }
>
> inline static
> double lfsr64(void)
> {
>     // 64-bit linear feedback shift register (Galois form) using
>     //  primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"),
>     //   initialised with 2**64 / golden ratio
>
>     static uint64_t lfsr = 0x9E3779B97F4A7C15;
>     const  uint64_t sign = (int64_t) lfsr >> 63;
>
>     lfsr = (lfsr << 1) ^ (0xAD93D23594C935A9 & sign);
>
>     return *(double *) &lfsr;
> }
>
> inline static
> double random64(void)
> {
>     static uint64_t seed = 0x0123456789ABCDEF;
>
>     seed = seed * 6364136223846793005 + 1442695040888963407;
>
>     return *(double *) &seed;
> }
>
> int main(void)
> {
>     clock_t t0, t1, t2, tt;
>     uint32_t n;
>     volatile double result;
>
>     t0 = clock();
>
>     for (n = 500000000u; n > 0u; n--) {
>         result = nop(lfsr64(), 0.0);
>         result = nop(random64(), 1.0 / 0.0);
>     }
>
>     t1 = clock();
>
>     for (n = 500000000u; n > 0u; n--) {
>         result = nextafter(lfsr64(), 0.0);
>         result = nextafter(random64(), 1.0 / 0.0);
>     }
>
>     t2 = clock();
>     tt = t2 - t0; t2 -= t1; t1 -= t0; t0 = t2 - t1;
>
>     printf("\n"
>            "nop()         %4lu.%06lu       0\n"
>            "nextafter()   %4lu.%06lu    %4lu.%06lu\n"
>            "              %4lu.%06lu nano-seconds\n",
>            t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
>            t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
>            t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
>            tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
> }
> [stefan@rome ~]$ gcc -O3 -lm bench.c
> [stefan@rome ~]$ perf stat ./a.out
>
> nop()            1.480000       0
> nextafter()     10.060000       8.580000
>                 11.540000 nano-seconds
>
>  Performance counter stats for './a.out':
>
>          11,548.78 msec task-clock:u              #    1.000 CPUs utilized
>                  0      context-switches:u        #    0.000 K/sec
>                  0      cpu-migrations:u          #    0.000 K/sec
>                145      page-faults:u             #    0.013 K/sec
>     38,917,213,536      cycles:u                  #    3.370 GHz                      (83.33%)
>     ~~~~~~~~~~~~~~
>     15,647,656,615      stalled-cycles-frontend:u #   40.21% frontend cycles idle     (83.33%)
>     10,746,044,422      stalled-cycles-backend:u  #   27.61% backend cycles idle      (83.33%)
>     69,739,403,870      instructions:u            #    1.79  insn per cycle
>     ~~~~~~~~~~~~~~                                     ~~~~
>                                                   #    0.22  stalled cycles per insn  (83.33%)
>     16,495,748,110      branches:u                # 1428.354 M/sec                    (83.33%)
>        500,701,246      branch-misses:u           #    3.04% of all branches          (83.33%)
>
>       11.550414029 seconds time elapsed
>
>       11.548265000 seconds user
>        0.000999000 seconds sys
>
>
> [stefan@rome ~]$ gcc -O3 bench.c after.c
> [stefan@rome ~]$ perf stat ./a.out
>
> nop()            1.490000       0
> nextafter()      2.210000       0.720000
>                  3.700000 nano-seconds
>
>  Performance counter stats for './a.out':
>
>           3,702.89 msec task-clock:u              #    1.000 CPUs utilized
>                  0      context-switches:u        #    0.000 K/sec
>                  0      cpu-migrations:u          #    0.000 K/sec
>                122      page-faults:u             #    0.033 K/sec
>     12,407,345,183      cycles:u                  #    3.351 GHz                      (83.32%)
>     ~~~~~~~~~~~~~~
>            135,817      stalled-cycles-frontend:u #    0.00% frontend cycles idle     (83.34%)
>      5,498,895,906      stalled-cycles-backend:u  #   44.32% backend cycles idle      (83.34%)
>     38,002,430,460      instructions:u            #    3.06  insn per cycle
>     ~~~~~~~~~~~~~~                                     ~~~~
>                                                   #    0.14  stalled cycles per insn  (83.34%)
>      7,497,381,393      branches:u                # 2024.735 M/sec                    (83.34%)
>            497,462      branch-misses:u           #    0.01% of all branches          (83.32%)
>
>        3.703648875 seconds time elapsed
>
>        3.703294000 seconds user
>        0.000000000 seconds sys
>
>
> [stefan@rome ~]cat after.s
> # Copyright � 2004-2021, Stefan Kanthak <stefan.kanthak@nexgo.de>
>
> .arch   generic64
> .code64
> .intel_syntax noprefix
> .text
>                                 # xmm0 = from
>                                 # xmm1 = to
> nextafter:
>         xorpd   xmm2, xmm2      # xmm2 = 0.0
>         ucomisd xmm1, xmm2      # CF = (to < 0.0)
>         jp      .Lto            # to = NAN?
>         sbb     rax, rax        # rax = (to < 0.0) ? -1 : 0
>         ucomisd xmm0, xmm1      # CF = (from < to)
>         jp      .Lfrom          # from = NAN?
>         je      .Lto            # from = to?
> .Lnotequal:
>         sbb     rcx, rcx        # rcx = (from < to) ? -1 : 0
>         ucomisd xmm0, xmm2      # CF = (from < 0.0)
>         jz      .Lzero          # from = 0.0?
> .Lnext:
>         movq    rdx, xmm0       # rdx = from
>         sbb     rax, rax        # rax = (from < 0.0) ? -1 : 0
>         xor     rax, rcx        # rax = (from < 0.0) = (from < to) ? 0 : -1
>         or      rax, 1          # rax = (from < 0.0) = (from < to) ? 1 : -1
>         sub     rdx, rax
>         movq    xmm0, rdx
>         addsd   xmm0, xmm2
>         ret
> .Lzero:
>         shl     rax, 63         # rax = (to < -0.0) ? 0x8000000000000000 : 0
>         or      rax, 1          # rax = (to < -0.0) ? 0x8000000000000001 : 1
>         movq    xmm0, rax       # xmm0 = (to < -0.0) ? -0x1.0p-1074 : 0x1.0p-1074
>         addsd   xmm0, xmm2
>         ret
> .Lto:
>         movsd   xmm0, xmm1      # xmm0 = to
> .Lfrom:
>         ret
>
> .size   nextafter, .-nextafter
> .type   nextafter, @function
> .global nextafter
> .end
> [stefan@rome ~]$ perf stat ./a.out
>
> nop()            1.630000       0
> nextafter()      1.910000       0.280000
>                  3.540000 nano-seconds
>
>  Performance counter stats for './a.out':
>
>           3,547.12 msec task-clock:u              #    1.000 CPUs utilized
>                  0      context-switches:u        #    0.000 K/sec
>                  0      cpu-migrations:u          #    0.000 K/sec
>                123      page-faults:u             #    0.035 K/sec
>     11,949,840,797      cycles:u                  #    3.369 GHz                      (83.32%)
>     ~~~~~~~~~~~~~~
>            129,627      stalled-cycles-frontend:u #    0.00% frontend cycles idle     (83.34%)
>      3,998,960,716      stalled-cycles-backend:u  #   33.46% backend cycles idle      (83.34%)
>     37,493,523,285      instructions:u            #    3.14  insn per cycle
>     ~~~~~~~~~~~~~~                                     ~~~~
>                                                   #    0.11  stalled cycles per insn  (83.34%)
>      7,998,559,192      branches:u                # 2254.945 M/sec                    (83.34%)
>            497,565      branch-misses:u           #    0.01% of all branches          (83.32%)
>
>        3.547999008 seconds time elapsed
>
>        3.546671000 seconds user
>        0.000999000 seconds sys
>
>
> [stefan@rome ~]$
>

--