public inbox for libc-help@sourceware.org
 help / color / mirror / Atom feed
* nextafter() about an order of magnitude slower than trivial implementation
@ 2021-08-16 16:03 Stefan Kanthak
  2021-08-18 12:51 ` Szabolcs Nagy
  0 siblings, 1 reply; 10+ messages in thread
From: Stefan Kanthak @ 2021-08-16 16:03 UTC (permalink / raw)
  To: libc-help

Hi,

to test the accuracy of (elementary) floating-point functions
against a higher precision reference I use the following code:

    double input = 1.0, output = exp(input), reference = M_E;

    if (output == reference)
        printf("correctly rounded\n");
    else if (nextafter(output, reference) == reference)
        printf("faithfully rounded\n");
    else
        printf("...");

Testing and benchmarking an exponential function that consumes
about 10ns/call on an AMD EPYC 7262, I noticed that nextafter()
itself is DEAD SLOW: it consumes about 8.5ns/call!

The following (quick&dirty) implementation consumes 0.7ns/call,
i.e. is about an order of magnitude faster:

--- after.c ---
double nextafter(double from, double to)
{
    if (from == to)
        return to;

    if (to != to)
        return to;

    if (from != from)
        return from;

    if (from == 0.0)
        return to < 0.0 ? -0x1.0p-1074 : 0x1.0p-1074;

    unsigned long long ull = *(unsigned long long *) &from;

    if ((from < to) == (from < 0.0))
        ull--;
    else
        ull++;

    return 0.0 + *(double *) &ull;
}
--- EOF ---

JFTR: the code generated by GCC for this function is FAR from
      optimal; the assembly implementation shown below is more
      than 2x faster and consumes <0.3ns/call, i.e. is about
      30x faster than glibc's nextafter()!

The data from 'perf stat' show that glibc's nextafter() executes
about 30 instructions more than the above trivial implementation,
and that it is NOT well suited for modern super-scalar processors.

Stefan

PS: test program and logs

[stefan@rome ~]$ cat bench.c
// Copyright (C) 2005-2021, Stefan Kanthak <stefan.kanthak@nexgo.de>

#include <math.h>
#include <stdint.h>
#include <stdio.h>
#include <time.h>

__attribute__((noinline))
double nop(double foo, double bar)
{
    return foo + bar;
}

inline static
double lfsr64(void)
{
    // 64-bit linear feedback shift register (Galois form) using
    //  primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"),
    //   initialised with 2**64 / golden ratio

    static uint64_t lfsr = 0x9E3779B97F4A7C15;
    const  uint64_t sign = (int64_t) lfsr >> 63;

    lfsr = (lfsr << 1) ^ (0xAD93D23594C935A9 & sign);

    return *(double *) &lfsr;
}

inline static
double random64(void)
{
    static uint64_t seed = 0x0123456789ABCDEF;

    seed = seed * 6364136223846793005 + 1442695040888963407;

    return *(double *) &seed;
}

int main(void)
{
    clock_t t0, t1, t2, tt;
    uint32_t n;
    volatile double result;

    t0 = clock();

    for (n = 500000000u; n > 0u; n--) {
        result = nop(lfsr64(), 0.0);
        result = nop(random64(), 1.0 / 0.0);
    }

    t1 = clock();

    for (n = 500000000u; n > 0u; n--) {
        result = nextafter(lfsr64(), 0.0);
        result = nextafter(random64(), 1.0 / 0.0);
    }

    t2 = clock();
    tt = t2 - t0; t2 -= t1; t1 -= t0; t0 = t2 - t1;

    printf("\n"
           "nop()         %4lu.%06lu       0\n"
           "nextafter()   %4lu.%06lu    %4lu.%06lu\n"
           "              %4lu.%06lu nano-seconds\n",
           t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
}
[stefan@rome ~]$ gcc -O3 -lm bench.c
[stefan@rome ~]$ perf stat ./a.out

nop()            1.480000       0
nextafter()     10.060000       8.580000
                11.540000 nano-seconds

 Performance counter stats for './a.out':

         11,548.78 msec task-clock:u              #    1.000 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               145      page-faults:u             #    0.013 K/sec
    38,917,213,536      cycles:u                  #    3.370 GHz                      (83.33%)
    ~~~~~~~~~~~~~~
    15,647,656,615      stalled-cycles-frontend:u #   40.21% frontend cycles idle     (83.33%)
    10,746,044,422      stalled-cycles-backend:u  #   27.61% backend cycles idle      (83.33%)
    69,739,403,870      instructions:u            #    1.79  insn per cycle
    ~~~~~~~~~~~~~~                                     ~~~~
                                                  #    0.22  stalled cycles per insn  (83.33%)
    16,495,748,110      branches:u                # 1428.354 M/sec                    (83.33%)
       500,701,246      branch-misses:u           #    3.04% of all branches          (83.33%)

      11.550414029 seconds time elapsed

      11.548265000 seconds user
       0.000999000 seconds sys


[stefan@rome ~]$ gcc -O3 bench.c after.c
[stefan@rome ~]$ perf stat ./a.out

nop()            1.490000       0
nextafter()      2.210000       0.720000
                 3.700000 nano-seconds

 Performance counter stats for './a.out':

          3,702.89 msec task-clock:u              #    1.000 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               122      page-faults:u             #    0.033 K/sec
    12,407,345,183      cycles:u                  #    3.351 GHz                      (83.32%)
    ~~~~~~~~~~~~~~
           135,817      stalled-cycles-frontend:u #    0.00% frontend cycles idle     (83.34%)
     5,498,895,906      stalled-cycles-backend:u  #   44.32% backend cycles idle      (83.34%)
    38,002,430,460      instructions:u            #    3.06  insn per cycle
    ~~~~~~~~~~~~~~                                     ~~~~
                                                  #    0.14  stalled cycles per insn  (83.34%)
     7,497,381,393      branches:u                # 2024.735 M/sec                    (83.34%)
           497,462      branch-misses:u           #    0.01% of all branches          (83.32%)

       3.703648875 seconds time elapsed

       3.703294000 seconds user
       0.000000000 seconds sys


[stefan@rome ~]cat after.s
# Copyright © 2004-2021, Stefan Kanthak <stefan.kanthak@nexgo.de>

.arch   generic64
.code64
.intel_syntax noprefix
.text
                                # xmm0 = from
                                # xmm1 = to
nextafter:
        xorpd   xmm2, xmm2      # xmm2 = 0.0
        ucomisd xmm1, xmm2      # CF = (to < 0.0)
        jp      .Lto            # to = NAN?
        sbb     rax, rax        # rax = (to < 0.0) ? -1 : 0
        ucomisd xmm0, xmm1      # CF = (from < to)
        jp      .Lfrom          # from = NAN?
        je      .Lto            # from = to?
.Lnotequal:
        sbb     rcx, rcx        # rcx = (from < to) ? -1 : 0
        ucomisd xmm0, xmm2      # CF = (from < 0.0)
        jz      .Lzero          # from = 0.0?
.Lnext:
        movq    rdx, xmm0       # rdx = from
        sbb     rax, rax        # rax = (from < 0.0) ? -1 : 0
        xor     rax, rcx        # rax = (from < 0.0) = (from < to) ? 0 : -1
        or      rax, 1          # rax = (from < 0.0) = (from < to) ? 1 : -1
        sub     rdx, rax
        movq    xmm0, rdx
        addsd   xmm0, xmm2
        ret
.Lzero:
        shl     rax, 63         # rax = (to < -0.0) ? 0x8000000000000000 : 0
        or      rax, 1          # rax = (to < -0.0) ? 0x8000000000000001 : 1
        movq    xmm0, rax       # xmm0 = (to < -0.0) ? -0x1.0p-1074 : 0x1.0p-1074
        addsd   xmm0, xmm2
        ret
.Lto:
        movsd   xmm0, xmm1      # xmm0 = to
.Lfrom:
        ret

.size   nextafter, .-nextafter
.type   nextafter, @function
.global nextafter
.end
[stefan@rome ~]$ perf stat ./a.out

nop()            1.630000       0
nextafter()      1.910000       0.280000
                 3.540000 nano-seconds

 Performance counter stats for './a.out':

          3,547.12 msec task-clock:u              #    1.000 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               123      page-faults:u             #    0.035 K/sec
    11,949,840,797      cycles:u                  #    3.369 GHz                      (83.32%)
    ~~~~~~~~~~~~~~
           129,627      stalled-cycles-frontend:u #    0.00% frontend cycles idle     (83.34%)
     3,998,960,716      stalled-cycles-backend:u  #   33.46% backend cycles idle      (83.34%)
    37,493,523,285      instructions:u            #    3.14  insn per cycle
    ~~~~~~~~~~~~~~                                     ~~~~
                                                  #    0.11  stalled cycles per insn  (83.34%)
     7,998,559,192      branches:u                # 2254.945 M/sec                    (83.34%)
           497,565      branch-misses:u           #    0.01% of all branches          (83.32%)

       3.547999008 seconds time elapsed

       3.546671000 seconds user
       0.000999000 seconds sys


[stefan@rome ~]$


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: nextafter() about an order of magnitude slower than trivial implementation
  2021-08-16 16:03 nextafter() about an order of magnitude slower than trivial implementation Stefan Kanthak
@ 2021-08-18 12:51 ` Szabolcs Nagy
  2021-08-18 17:11   ` Stefan Kanthak
  0 siblings, 1 reply; 10+ messages in thread
From: Szabolcs Nagy @ 2021-08-18 12:51 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: libc-help

The 08/16/2021 18:03, Stefan Kanthak wrote:
> Testing and benchmarking an exponential function that consumes
> about 10ns/call on an AMD EPYC 7262, I noticed that nextafter()
> itself is DEAD SLOW: it consumes about 8.5ns/call!
> 
> The following (quick&dirty) implementation consumes 0.7ns/call,
> i.e. is about an order of magnitude faster:

correctly measuring latency on a modern x86_64 core:

musl: 3.16 ns
glibc: 5.68 ns
your: 5.72 ns

i.e. your code is slower.
(this is the correctly predicted hot path latency
measuring for(i=0;i<n;i++) r = nextafter(r,0); )

it is more difficult to measure the case when latency
is not the bottleneck, since then it depends on what
code is being executed along nextafter. your code
likely wins in that case as it has less instructions.

what you see in your benchmark is the effect of
branch mispredicts in the nextafter(x, INFINITY)
case: your code avoids branching on the sign of x,
while glibc branches and misses half the time with
random x inputs. (it maybe possible to fix glibc to
avoid that branch and then it would be fast in your
measurement too on a big ooo core)

(your code also does not handle fenv correctly and
has aliasing violations, but these can be fixed)


> 
> --- after.c ---
> double nextafter(double from, double to)
> {
>     if (from == to)
>         return to;
> 
>     if (to != to)
>         return to;
> 
>     if (from != from)
>         return from;
> 
>     if (from == 0.0)
>         return to < 0.0 ? -0x1.0p-1074 : 0x1.0p-1074;
> 
>     unsigned long long ull = *(unsigned long long *) &from;
> 
>     if ((from < to) == (from < 0.0))
>         ull--;
>     else
>         ull++;
> 
>     return 0.0 + *(double *) &ull;
> }
> --- EOF ---
> 
> JFTR: the code generated by GCC for this function is FAR from
>       optimal; the assembly implementation shown below is more
>       than 2x faster and consumes <0.3ns/call, i.e. is about
>       30x faster than glibc's nextafter()!
> 
> The data from 'perf stat' show that glibc's nextafter() executes
> about 30 instructions more than the above trivial implementation,
> and that it is NOT well suited for modern super-scalar processors.
> 
> Stefan
> 
> PS: test program and logs
> 
> [stefan@rome ~]$ cat bench.c
> // Copyright (C) 2005-2021, Stefan Kanthak <stefan.kanthak@nexgo.de>
> 
> #include <math.h>
> #include <stdint.h>
> #include <stdio.h>
> #include <time.h>
> 
> __attribute__((noinline))
> double nop(double foo, double bar)
> {
>     return foo + bar;
> }
> 
> inline static
> double lfsr64(void)
> {
>     // 64-bit linear feedback shift register (Galois form) using
>     //  primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"),
>     //   initialised with 2**64 / golden ratio
> 
>     static uint64_t lfsr = 0x9E3779B97F4A7C15;
>     const  uint64_t sign = (int64_t) lfsr >> 63;
> 
>     lfsr = (lfsr << 1) ^ (0xAD93D23594C935A9 & sign);
> 
>     return *(double *) &lfsr;
> }
> 
> inline static
> double random64(void)
> {
>     static uint64_t seed = 0x0123456789ABCDEF;
> 
>     seed = seed * 6364136223846793005 + 1442695040888963407;
> 
>     return *(double *) &seed;
> }
> 
> int main(void)
> {
>     clock_t t0, t1, t2, tt;
>     uint32_t n;
>     volatile double result;
> 
>     t0 = clock();
> 
>     for (n = 500000000u; n > 0u; n--) {
>         result = nop(lfsr64(), 0.0);
>         result = nop(random64(), 1.0 / 0.0);
>     }
> 
>     t1 = clock();
> 
>     for (n = 500000000u; n > 0u; n--) {
>         result = nextafter(lfsr64(), 0.0);
>         result = nextafter(random64(), 1.0 / 0.0);
>     }
> 
>     t2 = clock();
>     tt = t2 - t0; t2 -= t1; t1 -= t0; t0 = t2 - t1;
> 
>     printf("\n"
>            "nop()         %4lu.%06lu       0\n"
>            "nextafter()   %4lu.%06lu    %4lu.%06lu\n"
>            "              %4lu.%06lu nano-seconds\n",
>            t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
>            t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
>            t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
>            tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
> }
> [stefan@rome ~]$ gcc -O3 -lm bench.c
> [stefan@rome ~]$ perf stat ./a.out
> 
> nop()            1.480000       0
> nextafter()     10.060000       8.580000
>                 11.540000 nano-seconds
> 
>  Performance counter stats for './a.out':
> 
>          11,548.78 msec task-clock:u              #    1.000 CPUs utilized
>                  0      context-switches:u        #    0.000 K/sec
>                  0      cpu-migrations:u          #    0.000 K/sec
>                145      page-faults:u             #    0.013 K/sec
>     38,917,213,536      cycles:u                  #    3.370 GHz                      (83.33%)
>     ~~~~~~~~~~~~~~
>     15,647,656,615      stalled-cycles-frontend:u #   40.21% frontend cycles idle     (83.33%)
>     10,746,044,422      stalled-cycles-backend:u  #   27.61% backend cycles idle      (83.33%)
>     69,739,403,870      instructions:u            #    1.79  insn per cycle
>     ~~~~~~~~~~~~~~                                     ~~~~
>                                                   #    0.22  stalled cycles per insn  (83.33%)
>     16,495,748,110      branches:u                # 1428.354 M/sec                    (83.33%)
>        500,701,246      branch-misses:u           #    3.04% of all branches          (83.33%)
> 
>       11.550414029 seconds time elapsed
> 
>       11.548265000 seconds user
>        0.000999000 seconds sys
> 
> 
> [stefan@rome ~]$ gcc -O3 bench.c after.c
> [stefan@rome ~]$ perf stat ./a.out
> 
> nop()            1.490000       0
> nextafter()      2.210000       0.720000
>                  3.700000 nano-seconds
> 
>  Performance counter stats for './a.out':
> 
>           3,702.89 msec task-clock:u              #    1.000 CPUs utilized
>                  0      context-switches:u        #    0.000 K/sec
>                  0      cpu-migrations:u          #    0.000 K/sec
>                122      page-faults:u             #    0.033 K/sec
>     12,407,345,183      cycles:u                  #    3.351 GHz                      (83.32%)
>     ~~~~~~~~~~~~~~
>            135,817      stalled-cycles-frontend:u #    0.00% frontend cycles idle     (83.34%)
>      5,498,895,906      stalled-cycles-backend:u  #   44.32% backend cycles idle      (83.34%)
>     38,002,430,460      instructions:u            #    3.06  insn per cycle
>     ~~~~~~~~~~~~~~                                     ~~~~
>                                                   #    0.14  stalled cycles per insn  (83.34%)
>      7,497,381,393      branches:u                # 2024.735 M/sec                    (83.34%)
>            497,462      branch-misses:u           #    0.01% of all branches          (83.32%)
> 
>        3.703648875 seconds time elapsed
> 
>        3.703294000 seconds user
>        0.000000000 seconds sys
> 
> 
> [stefan@rome ~]cat after.s
> # Copyright � 2004-2021, Stefan Kanthak <stefan.kanthak@nexgo.de>
> 
> .arch   generic64
> .code64
> .intel_syntax noprefix
> .text
>                                 # xmm0 = from
>                                 # xmm1 = to
> nextafter:
>         xorpd   xmm2, xmm2      # xmm2 = 0.0
>         ucomisd xmm1, xmm2      # CF = (to < 0.0)
>         jp      .Lto            # to = NAN?
>         sbb     rax, rax        # rax = (to < 0.0) ? -1 : 0
>         ucomisd xmm0, xmm1      # CF = (from < to)
>         jp      .Lfrom          # from = NAN?
>         je      .Lto            # from = to?
> .Lnotequal:
>         sbb     rcx, rcx        # rcx = (from < to) ? -1 : 0
>         ucomisd xmm0, xmm2      # CF = (from < 0.0)
>         jz      .Lzero          # from = 0.0?
> .Lnext:
>         movq    rdx, xmm0       # rdx = from
>         sbb     rax, rax        # rax = (from < 0.0) ? -1 : 0
>         xor     rax, rcx        # rax = (from < 0.0) = (from < to) ? 0 : -1
>         or      rax, 1          # rax = (from < 0.0) = (from < to) ? 1 : -1
>         sub     rdx, rax
>         movq    xmm0, rdx
>         addsd   xmm0, xmm2
>         ret
> .Lzero:
>         shl     rax, 63         # rax = (to < -0.0) ? 0x8000000000000000 : 0
>         or      rax, 1          # rax = (to < -0.0) ? 0x8000000000000001 : 1
>         movq    xmm0, rax       # xmm0 = (to < -0.0) ? -0x1.0p-1074 : 0x1.0p-1074
>         addsd   xmm0, xmm2
>         ret
> .Lto:
>         movsd   xmm0, xmm1      # xmm0 = to
> .Lfrom:
>         ret
> 
> .size   nextafter, .-nextafter
> .type   nextafter, @function
> .global nextafter
> .end
> [stefan@rome ~]$ perf stat ./a.out
> 
> nop()            1.630000       0
> nextafter()      1.910000       0.280000
>                  3.540000 nano-seconds
> 
>  Performance counter stats for './a.out':
> 
>           3,547.12 msec task-clock:u              #    1.000 CPUs utilized
>                  0      context-switches:u        #    0.000 K/sec
>                  0      cpu-migrations:u          #    0.000 K/sec
>                123      page-faults:u             #    0.035 K/sec
>     11,949,840,797      cycles:u                  #    3.369 GHz                      (83.32%)
>     ~~~~~~~~~~~~~~
>            129,627      stalled-cycles-frontend:u #    0.00% frontend cycles idle     (83.34%)
>      3,998,960,716      stalled-cycles-backend:u  #   33.46% backend cycles idle      (83.34%)
>     37,493,523,285      instructions:u            #    3.14  insn per cycle
>     ~~~~~~~~~~~~~~                                     ~~~~
>                                                   #    0.11  stalled cycles per insn  (83.34%)
>      7,998,559,192      branches:u                # 2254.945 M/sec                    (83.34%)
>            497,565      branch-misses:u           #    0.01% of all branches          (83.32%)
> 
>        3.547999008 seconds time elapsed
> 
>        3.546671000 seconds user
>        0.000999000 seconds sys
> 
> 
> [stefan@rome ~]$
> 

-- 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: nextafter() about an order of magnitude slower than trivial implementation
  2021-08-18 12:51 ` Szabolcs Nagy
@ 2021-08-18 17:11   ` Stefan Kanthak
  2021-08-19 11:20     ` Adhemerval Zanella
  0 siblings, 1 reply; 10+ messages in thread
From: Stefan Kanthak @ 2021-08-18 17:11 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: libc-help

Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:

> The 08/16/2021 18:03, Stefan Kanthak wrote:
>> Testing and benchmarking an exponential function that consumes
>> about 10ns/call on an AMD EPYC 7262, I noticed that nextafter()
>> itself is DEAD SLOW: it consumes about 8.5ns/call!
>>
>> The following (quick&dirty) implementation consumes 0.7ns/call,
>> i.e. is about an order of magnitude faster:
>
> correctly measuring latency on a modern x86_64 core:
>
> musl: 3.16 ns
> glibc: 5.68 ns
> your: 5.72 ns

My benchmark measures (reciprocal) throughput, NOT latency.
As long as a routine like nextafter() is NOT called back-to-back,
but along other code, latency doesn't really matter.

> i.e. your code is slower.

The overall execution time and the numbers from 'perf stat' show
the contrary: the program executes overall
- 70 billion instructions in 38.9 billion cycles with glibc's
  implementation,
- 38 billion instructions in 12.4 billion cycles with my
  (quick&dirty) C implementation,
- 37.5 billion instructions in 12 billion cycles with my
  assembly implementation,
- 39.8 billion instructions in 22 billion cycles with musl's
  original implementation,
- 40.5 billion instructions in 11.7 billion cycles with my
  patch applied to musl's implementation.

The OVERALL number of cycles per call are
38.9 (glibc) > 22 (musl) > 12.4 (C) > 12 (Assembly) > 11.7 (musl patched)

Subtract the (constant but unknown, not measured) number of
cycles for the program itself: I estimate the overhead to be
about 4..6 billion cycles.
glibc's implementation consumes about 2.5x the number of cycles
of musl's implementation, and about 5x the number of cycles of
a proper implementation.

> (this is the correctly predicted hot path latency
> measuring for(i=0;i<n;i++) r = nextafter(r,0); )

See above: I'm interested in throughput, NOT latency,

> it is more difficult to measure the case when latency
> is not the bottleneck, since then it depends on what
> code is being executed along nextafter. your code
> likely wins in that case as it has less instructions.

Correct. And that's what counts in practice, as shown from
'perf stat' with the number of consumed cycles!

> what you see in your benchmark is the effect of
> branch mispredicts in the nextafter(x, INFINITY)
> case: your code avoids branching on the sign of x,
> while glibc branches and misses half the time with
> random x inputs. (it maybe possible to fix glibc to
> avoid that branch and then it would be fast in your
> measurement too on a big ooo core)

It's not just mispredicted branches: the program executes
- 16.5 billion branches (0.5 B(!)illion mispredicted) with
  glibc's implementation,
- 7.5 billion branches (0.5 M(!)illion mispredicted) with my
  C implementation,
- 8 billion branches (0.5 M(!)illion mispredicted) with my
  assembly implementation,
- 9.7 billion branches (0.25 B(!)illion mispredicted) with
  musl's original implementation,
- 8.5 billion branches (0.5 M(!)illion mispredicted with my
  patch applied to musl's implementation.

2 billion branches are executed for the loops and should
be subtracted to get the net numbers per call of nextafter():
14.5 (glibc) > 9.7 (musl) > 8.5 (musl patched) > 8 (assembly) > 7.5 (C)

glibc's implementation executes almost twice the number of
branches of the other implementations.

The mispredicted branches:
0.5 (glibc) > 0.25 (musl) > 0.0001 (the proper implementations)

> (your code also does not handle fenv correctly and
> has aliasing violations, but these can be fixed)

I wrote explicitly QUICK&DIRTY (C) implementation.
The assembly code is correct in both cases.

Stefan

> --- after.c ---
> double nextafter(double from, double to)
> {
>     if (from == to)
>         return to;
>
>     if (to != to)
>         return to;
>
>     if (from != from)
>         return from;
>
>     if (from == 0.0)
>         return to < 0.0 ? -0x1.0p-1074 : 0x1.0p-1074;
>
>     unsigned long long ull = *(unsigned long long *) &from;
>
>     if ((from < to) == (from < 0.0))
>         ull--;
>     else
>         ull++;
>
>     return 0.0 + *(double *) &ull;
> }
> --- EOF ---
>
> JFTR: the code generated by GCC for this function is FAR from
>       optimal; the assembly implementation shown below is more
>       than 2x faster and consumes <0.3ns/call, i.e. is about
>       30x faster than glibc's nextafter()!
>
> The data from 'perf stat' show that glibc's nextafter() executes
> about 30 instructions more than the above trivial implementation,
> and that it is NOT well suited for modern super-scalar processors.
>
> Stefan
>
> PS: test program and logs
>
> [stefan@rome ~]$ cat bench.c
> // Copyright (C) 2005-2021, Stefan Kanthak <stefan.kanthak@nexgo.de>
>
> #include <math.h>
> #include <stdint.h>
> #include <stdio.h>
> #include <time.h>
>
> __attribute__((noinline))
> double nop(double foo, double bar)
> {
>     return foo + bar;
> }
>
> inline static
> double lfsr64(void)
> {
>     // 64-bit linear feedback shift register (Galois form) using
>     //  primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"),
>     //   initialised with 2**64 / golden ratio
>
>     static uint64_t lfsr = 0x9E3779B97F4A7C15;
>     const  uint64_t sign = (int64_t) lfsr >> 63;
>
>     lfsr = (lfsr << 1) ^ (0xAD93D23594C935A9 & sign);
>
>     return *(double *) &lfsr;
> }
>
> inline static
> double random64(void)
> {
>     static uint64_t seed = 0x0123456789ABCDEF;
>
>     seed = seed * 6364136223846793005 + 1442695040888963407;
>
>     return *(double *) &seed;
> }
>
> int main(void)
> {
>     clock_t t0, t1, t2, tt;
>     uint32_t n;
>     volatile double result;
>
>     t0 = clock();
>
>     for (n = 500000000u; n > 0u; n--) {
>         result = nop(lfsr64(), 0.0);
>         result = nop(random64(), 1.0 / 0.0);
>     }
>
>     t1 = clock();
>
>     for (n = 500000000u; n > 0u; n--) {
>         result = nextafter(lfsr64(), 0.0);
>         result = nextafter(random64(), 1.0 / 0.0);
>     }
>
>     t2 = clock();
>     tt = t2 - t0; t2 -= t1; t1 -= t0; t0 = t2 - t1;
>
>     printf("\n"
>            "nop()         %4lu.%06lu       0\n"
>            "nextafter()   %4lu.%06lu    %4lu.%06lu\n"
>            "              %4lu.%06lu nano-seconds\n",
>            t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
>            t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
>            t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
>            tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
> }
> [stefan@rome ~]$ gcc -O3 -lm bench.c
> [stefan@rome ~]$ perf stat ./a.out
>
> nop()            1.480000       0
> nextafter()     10.060000       8.580000
>                 11.540000 nano-seconds
>
>  Performance counter stats for './a.out':
>
>          11,548.78 msec task-clock:u              #    1.000 CPUs utilized
>                  0      context-switches:u        #    0.000 K/sec
>                  0      cpu-migrations:u          #    0.000 K/sec
>                145      page-faults:u             #    0.013 K/sec
>     38,917,213,536      cycles:u                  #    3.370 GHz                      (83.33%)
>     ~~~~~~~~~~~~~~
>     15,647,656,615      stalled-cycles-frontend:u #   40.21% frontend cycles idle     (83.33%)
>     10,746,044,422      stalled-cycles-backend:u  #   27.61% backend cycles idle      (83.33%)
>     69,739,403,870      instructions:u            #    1.79  insn per cycle
>     ~~~~~~~~~~~~~~                                     ~~~~
>                                                   #    0.22  stalled cycles per insn  (83.33%)
>     16,495,748,110      branches:u                # 1428.354 M/sec                    (83.33%)
>        500,701,246      branch-misses:u           #    3.04% of all branches          (83.33%)
>
>       11.550414029 seconds time elapsed
>
>       11.548265000 seconds user
>        0.000999000 seconds sys
>
>
> [stefan@rome ~]$ gcc -O3 bench.c after.c
> [stefan@rome ~]$ perf stat ./a.out
>
> nop()            1.490000       0
> nextafter()      2.210000       0.720000
>                  3.700000 nano-seconds
>
>  Performance counter stats for './a.out':
>
>           3,702.89 msec task-clock:u              #    1.000 CPUs utilized
>                  0      context-switches:u        #    0.000 K/sec
>                  0      cpu-migrations:u          #    0.000 K/sec
>                122      page-faults:u             #    0.033 K/sec
>     12,407,345,183      cycles:u                  #    3.351 GHz                      (83.32%)
>     ~~~~~~~~~~~~~~
>            135,817      stalled-cycles-frontend:u #    0.00% frontend cycles idle     (83.34%)
>      5,498,895,906      stalled-cycles-backend:u  #   44.32% backend cycles idle      (83.34%)
>     38,002,430,460      instructions:u            #    3.06  insn per cycle
>     ~~~~~~~~~~~~~~                                     ~~~~
>                                                   #    0.14  stalled cycles per insn  (83.34%)
>      7,497,381,393      branches:u                # 2024.735 M/sec                    (83.34%)
>            497,462      branch-misses:u           #    0.01% of all branches          (83.32%)
>
>        3.703648875 seconds time elapsed
>
>        3.703294000 seconds user
>        0.000000000 seconds sys
>
>
> [stefan@rome ~]cat after.s
> # Copyright � 2004-2021, Stefan Kanthak <stefan.kanthak@nexgo.de>
>
> .arch   generic64
> .code64
> .intel_syntax noprefix
> .text
>                                 # xmm0 = from
>                                 # xmm1 = to
> nextafter:
>         xorpd   xmm2, xmm2      # xmm2 = 0.0
>         ucomisd xmm1, xmm2      # CF = (to < 0.0)
>         jp      .Lto            # to = NAN?
>         sbb     rax, rax        # rax = (to < 0.0) ? -1 : 0
>         ucomisd xmm0, xmm1      # CF = (from < to)
>         jp      .Lfrom          # from = NAN?
>         je      .Lto            # from = to?
> .Lnotequal:
>         sbb     rcx, rcx        # rcx = (from < to) ? -1 : 0
>         ucomisd xmm0, xmm2      # CF = (from < 0.0)
>         jz      .Lzero          # from = 0.0?
> .Lnext:
>         movq    rdx, xmm0       # rdx = from
>         sbb     rax, rax        # rax = (from < 0.0) ? -1 : 0
>         xor     rax, rcx        # rax = (from < 0.0) = (from < to) ? 0 : -1
>         or      rax, 1          # rax = (from < 0.0) = (from < to) ? 1 : -1
>         sub     rdx, rax
>         movq    xmm0, rdx
>         addsd   xmm0, xmm2
>         ret
> .Lzero:
>         shl     rax, 63         # rax = (to < -0.0) ? 0x8000000000000000 : 0
>         or      rax, 1          # rax = (to < -0.0) ? 0x8000000000000001 : 1
>         movq    xmm0, rax       # xmm0 = (to < -0.0) ? -0x1.0p-1074 : 0x1.0p-1074
>         addsd   xmm0, xmm2
>         ret
> .Lto:
>         movsd   xmm0, xmm1      # xmm0 = to
> .Lfrom:
>         ret
>
> .size   nextafter, .-nextafter
> .type   nextafter, @function
> .global nextafter
> .end
> [stefan@rome ~]$ perf stat ./a.out
>
> nop()            1.630000       0
> nextafter()      1.910000       0.280000
>                  3.540000 nano-seconds
>
>  Performance counter stats for './a.out':
>
>           3,547.12 msec task-clock:u              #    1.000 CPUs utilized
>                  0      context-switches:u        #    0.000 K/sec
>                  0      cpu-migrations:u          #    0.000 K/sec
>                123      page-faults:u             #    0.035 K/sec
>     11,949,840,797      cycles:u                  #    3.369 GHz                      (83.32%)
>     ~~~~~~~~~~~~~~
>            129,627      stalled-cycles-frontend:u #    0.00% frontend cycles idle     (83.34%)
>      3,998,960,716      stalled-cycles-backend:u  #   33.46% backend cycles idle      (83.34%)
>     37,493,523,285      instructions:u            #    3.14  insn per cycle
>     ~~~~~~~~~~~~~~                                     ~~~~
>                                                   #    0.11  stalled cycles per insn  (83.34%)
>      7,998,559,192      branches:u                # 2254.945 M/sec                    (83.34%)
>            497,565      branch-misses:u           #    0.01% of all branches          (83.32%)
>
>        3.547999008 seconds time elapsed
>
>        3.546671000 seconds user
>        0.000999000 seconds sys
>
>
> [stefan@rome ~]$
>

-- 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: nextafter() about an order of magnitude slower than trivial implementation
  2021-08-18 17:11   ` Stefan Kanthak
@ 2021-08-19 11:20     ` Adhemerval Zanella
  2021-08-19 17:57       ` [PATCH] " Stefan Kanthak
  2021-08-20  9:52       ` [PATCH/2nd version] " Stefan Kanthak
  0 siblings, 2 replies; 10+ messages in thread
From: Adhemerval Zanella @ 2021-08-19 11:20 UTC (permalink / raw)
  To: Stefan Kanthak, Szabolcs Nagy; +Cc: libc-help



On 18/08/2021 14:11, Stefan Kanthak wrote:
> Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:
> 
>> The 08/16/2021 18:03, Stefan Kanthak wrote:
>>> Testing and benchmarking an exponential function that consumes
>>> about 10ns/call on an AMD EPYC 7262, I noticed that nextafter()
>>> itself is DEAD SLOW: it consumes about 8.5ns/call!
>>>
>>> The following (quick&dirty) implementation consumes 0.7ns/call,
>>> i.e. is about an order of magnitude faster:
>>
>> correctly measuring latency on a modern x86_64 core:
>>
>> musl: 3.16 ns
>> glibc: 5.68 ns
>> your: 5.72 ns

Thanks for bring this up, if you want to contribute a patch please
follow the Contribution checklist [1].  We recently dropped the
requirement of the FSF contribution, so you can use a SCO-like 
license on the patches.

To change the current implementation I suggest you to also provide
a benchmark using glibc benchmark framework.  Some maths functions
provide both latency and reciprocal-throughput information, and
with both numbers we can evaluate if the new implementation is
indeed better on different machines.

I would just like to ask to keep the tone respectful and be open
to suggestion and criticize, so you do not repeat the same derail
thread on musl maillist [1]. Szabolcs and Wilco did an excellent 
job on some newer math functions implementation, you might read 
the old thread to see how was their approach.

[1] https://sourceware.org/glibc/wiki/Contribution%20checklist
[2] https://www.openwall.com/lists/musl/2021/08/15/18

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH] Re: nextafter() about an order of magnitude slower than trivial implementation
  2021-08-19 11:20     ` Adhemerval Zanella
@ 2021-08-19 17:57       ` Stefan Kanthak
  2021-08-20  9:52       ` [PATCH/2nd version] " Stefan Kanthak
  1 sibling, 0 replies; 10+ messages in thread
From: Stefan Kanthak @ 2021-08-19 17:57 UTC (permalink / raw)
  To: Szabolcs Nagy, Adhemerval Zanella; +Cc: libc-help, libc-alpha

[-- Attachment #1: Type: text/plain, Size: 3645 bytes --]

Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote:

> On 18/08/2021 14:11, Stefan Kanthak wrote:
>> Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:
>> 
>>> The 08/16/2021 18:03, Stefan Kanthak wrote:
>>>> Testing and benchmarking an exponential function that consumes
>>>> about 10ns/call on an AMD EPYC 7262, I noticed that nextafter()
>>>> itself is DEAD SLOW: it consumes about 8.5ns/call!
>>>>
>>>> The following (quick&dirty) implementation consumes 0.7ns/call,
>>>> i.e. is about an order of magnitude faster:
>>>
>>> correctly measuring latency on a modern x86_64 core:
>>>
>>> musl: 3.16 ns
>>> glibc: 5.68 ns
>>> your: 5.72 ns
> 
> Thanks for bring this up, if you want to contribute a patch

See attachment.
For completeness sake and to ease an initial review, I post the full
source here too: without ALTERNATE defined, it uses integer arithmetic
to the extent possible, like the original.
Choose whatever you like best or gives better throughput/latency.

Stefan

JFTR: .../sysdeps/ieee754/flt-32/s_nextafterf.c as well as
      .../sysdeps/i386/fpu/s_nextafterl.c will need the same changes.

--- .../math/s_nextafter.c ---
/* IEEE functions
 *        nextafter(x,y)
 *        return the next machine floating-point number of x in the
 *        direction toward y.
 *   Special cases:
 */

/* Ugly hack so that the aliasing works.  */
#define __nexttoward __internal___nexttoward
#define nexttoward __internal_nexttoward

#include <errno.h>
#include <math.h>
#include <math-barriers.h>
#include <math_private.h>
#include <float.h>
#include <libm-alias-double.h>

double __nextafter(double x, double y)
{
        union {
            double d;
            int64_t s;
            uint64_t u;
        } xx = {x}, yy = {y};

        uint64_t ax = xx.u + xx.u;      /* |x|<<1 */
        uint64_t ay = yy.u + yy.u;      /* |y|<<1 */

        if(ax > 0x7ffull << 53)         /* x is nan */
            return x;
        if(ay > 0x7ffull << 53)         /* y is nan */
            return y;
        if(x == y) return y;            /* x=y, return y */
        if(ax == 0ull) {                /* x == 0 */
            xx.u = yy.s < 0 ? 0x8000000000000001ull : 1ull;
            xx.d = math_opt_barrier (xx.d);
            xx.d += 0.0;
            math_force_eval (xx.d);     /* raise underflow flag */
            return xx.d;                /* return +-minsubnormal */
        }
#ifdef ALTERNATE
        if((x < y) == (x < 0.0))
            --xx.u;
        else
            ++xx.u;
#else
        if(xx.s < 0) {                  /* x < 0 */
            if(yy.s < 0 && xx.s < yy.s)
                ++xx.s;                 /* x < 0, x > y, x += ulp */
            else
                --xx.s;                 /* x < 0, x < y, x -= ulp */
        } else {                        /* x > 0 */
            if(xx.s > yy.s)
                --xx.s;                 /* x > 0, x > y, x -= ulp */
            else
                ++xx.s;                 /* x > 0, x < y, x += ulp */
        }
#endif
        ax = xx.u + xx.u;               /* |x|<<1 */
        if((ax >= 0x7ffull << 53)       /* overflow  */
        || (ax < 0x001ull << 53)) {     /* underflow */
          xx.d = math_opt_barrier (xx.d);
          xx.d += 0.0;
          math_force_eval (xx.d);       /* raise overflow/underflow flag */
          __set_errno (ERANGE);
        }
        return xx.d;
}
libm_alias_double (__nextafter, nextafter)
#ifdef NO_LONG_DOUBLE
strong_alias (__nextafter, __nexttowardl)
weak_alias (__nexttowardl, nexttowardl)
#undef __nexttoward
strong_alias (__nextafter, __nexttoward)
#undef nexttoward
weak_alias (__nextafter, nexttoward)
#endif
--- EOF ---

[-- Attachment #2: glibc.patch --]
[-- Type: application/octet-stream, Size: 3533 bytes --]

--- -/math/s_nextafter.c
+++ +/math/s_nextafter.c
@@ -1,19 +1,3 @@
-/* @(#)s_nextafter.c 5.1 93/09/24 */
-/*
- * ====================================================
- * Copyright (C) 1993 by Sun Microsystems, Inc. All rights reserved.
- *
- * Developed at SunPro, a Sun Microsystems, Inc. business.
- * Permission to use, copy, modify, and distribute this
- * software is freely granted, provided that this notice
- * is preserved.
- * ====================================================
- */
-
-#if defined(LIBM_SCCS) && !defined(lint)
-static char rcsid[] = "$NetBSD: s_nextafter.c,v 1.8 1995/05/10 20:47:58 jtc Exp $";
-#endif
-
 /* IEEE functions
  *	nextafter(x,y)
  *	return the next machine floating-point number of x in the
@@ -34,55 +18,53 @@
 
 double __nextafter(double x, double y)
 {
-	int32_t hx,hy,ix,iy;
-	uint32_t lx,ly;
+	union {
+	    double d;
+	    int64_t s;
+	    uint64_t u;
+	} xx = {x}, yy = {y};
 
-	EXTRACT_WORDS(hx,lx,x);
-	EXTRACT_WORDS(hy,ly,y);
-	ix = hx&0x7fffffff;		/* |x| */
-	iy = hy&0x7fffffff;		/* |y| */
+	uint64_t ax = xx.u + xx.u;	/* |x|<<1 */
+	uint64_t ay = yy.u + yy.u;	/* |y|<<1 */
 
-	if(((ix>=0x7ff00000)&&((ix-0x7ff00000)|lx)!=0) ||   /* x is nan */
-	   ((iy>=0x7ff00000)&&((iy-0x7ff00000)|ly)!=0))     /* y is nan */
-	   return x+y;
-	if(x==y) return y;		/* x=y, return y */
-	if((ix|lx)==0) {			/* x == 0 */
-	    double u;
-	    INSERT_WORDS(x,hy&0x80000000,1);	/* return +-minsubnormal */
-	    u = math_opt_barrier (x);
-	    u = u*u;
-	    math_force_eval (u);		/* raise underflow flag */
+	if(ax > 0x7ffull << 53)		/* x is nan */
 	    return x;
+	if(ay > 0x7ffull << 53)		/* y is nan */
+	    return y;
+	if(x == y) return y;		/* x=y, return y */
+	if(ax == 0ull) {		/* x == 0 */
+	    xx.u = yy.s < 0 ? 0x8000000000000001ull : 1ull;
+	    xx.d = math_opt_barrier (xx.d);
+	    xx.d += 0.0;
+	    math_force_eval (xx.d);	/* raise underflow flag */
+	    return xx.d;		/* return +-minsubnormal */
 	}
-	if(hx>=0) {				/* x > 0 */
-	    if(hx>hy||((hx==hy)&&(lx>ly))) {	/* x > y, x -= ulp */
-		if(lx==0) hx -= 1;
-		lx -= 1;
-	    } else {				/* x < y, x += ulp */
-		lx += 1;
-		if(lx==0) hx += 1;
-	    }
-	} else {				/* x < 0 */
-	    if(hy>=0||hx>hy||((hx==hy)&&(lx>ly))){/* x < y, x -= ulp */
-		if(lx==0) hx -= 1;
-		lx -= 1;
-	    } else {				/* x > y, x += ulp */
-		lx += 1;
-		if(lx==0) hx += 1;
-	    }
+#ifdef ALTERNATE
+	if ((x < y) == (x < 0.0))
+	    --xx.u;
+	else
+	    ++xx.u;
+#else
+	if(xx.s < 0) {			/* x < 0 */
+	    if(yy.s < 0 && xx.s < yy.s)
+	        ++xx.s;			/* x < 0, x > y, x += ulp */
+	    else
+	        --xx.s;			/* x < 0, x < y, x -= ulp */
+	} else {			/* x > 0 */
+	    if(xx.s > yy.s)
+	        --xx.s;			/* x > 0, x > y, x -= ulp */
+	    else
+	        ++xx.s;			/* x > 0, x < y, x += ulp */
 	}
-	hy = hx&0x7ff00000;
-	if(hy>=0x7ff00000) {
-	  double u = x+x;	/* overflow  */
-	  math_force_eval (u);
+#endif
+	ax = xx.u + xx.u;		/* |x|<<1 */
+	if((ax >= 0x7ffull << 53)	/* overflow  */
+	|| (ax < 0x001ull << 53)) {	/* underflow */
+	  xx.d = math_opt_barrier (xx.d);
+	  xx.d += 0.0;
+	  math_force_eval (xx.d);	/* raise overflow/underflow flag */
 	  __set_errno (ERANGE);
 	}
-	if(hy<0x00100000) {
-	    double u = x*x;			/* underflow */
-	    math_force_eval (u);		/* raise underflow flag */
-	  __set_errno (ERANGE);
-	}
-	INSERT_WORDS(x,hx,lx);
 	return xx.d;
 }
 libm_alias_double (__nextafter, nextafter)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH/2nd version] Re: nextafter() about an order of magnitude slower than trivial implementation
  2021-08-19 11:20     ` Adhemerval Zanella
  2021-08-19 17:57       ` [PATCH] " Stefan Kanthak
@ 2021-08-20  9:52       ` Stefan Kanthak
  2021-08-20 16:55         ` Joseph Myers
  1 sibling, 1 reply; 10+ messages in thread
From: Stefan Kanthak @ 2021-08-20  9:52 UTC (permalink / raw)
  To: Szabolcs Nagy, Adhemerval Zanella; +Cc: libc-help, libc-alpha

[-- Attachment #1: Type: text/plain, Size: 3682 bytes --]

Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote:

> On 18/08/2021 14:11, Stefan Kanthak wrote:
>> Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:
>> 
>>> The 08/16/2021 18:03, Stefan Kanthak wrote:
>>>> Testing and benchmarking an exponential function that consumes
>>>> about 10ns/call on an AMD EPYC 7262, I noticed that nextafter()
>>>> itself is DEAD SLOW: it consumes about 8.5ns/call!
>>>>
>>>> The following (quick&dirty) implementation consumes 0.7ns/call,
>>>> i.e. is about an order of magnitude faster:
>>>
>>> correctly measuring latency on a modern x86_64 core:
>>>
>>> musl: 3.16 ns
>>> glibc: 5.68 ns
>>> your: 5.72 ns
> 
> Thanks for bring this up, if you want to contribute a patch

See attachment.
For completeness sake and to ease an initial review, I post the full
source here too: without ALTERNATE defined, it uses integer arithmetic
to the extent possible, like the original.
Choose whatever you like best or gives better throughput/latency.

Changes from the 1st version of the patch:
- leftover FP comparision substituted with integer operations;
- increment/decrement logic simplified/unified, avoiding a
  conditional branch.
I expect the compiler to detect the common clause
 "&& ((xx.s < 0) == (yy.s < 0)))" introduced with these changes.

Stefan

JFTR: .../sysdeps/ieee754/flt-32/s_nextafterf.c as well as
      .../sysdeps/i386/fpu/s_nextafterl.c will need the same changes.

--- .../math/s_nextafter.c ---
/* IEEE functions
 *      nextafter(x,y)
 *      return the next machine floating-point number of x in the
 *      direction toward y.
 *   Special cases:
 */

/* Ugly hack so that the aliasing works.  */
#define __nexttoward __internal___nexttoward
#define nexttoward __internal_nexttoward

#include <errno.h>
#include <math.h>
#include <math-barriers.h>
#include <math_private.h>
#include <float.h>
#include <libm-alias-double.h>

double __nextafter(double x, double y)
{
        union {
            double d;
            int64_t s;
            uint64_t u;
        } xx = {x}, yy = {y};

        uint64_t ax = xx.u + xx.u;      /* |x|<<1 */
        uint64_t ay = yy.u + yy.u;      /* |y|<<1 */

        if(ax > 0x7ffull << 53)         /* x is nan */
            return x;
        if(ay > 0x7ffull << 53)         /* y is nan */
            return y;
#ifdef ALTERNATE
        if(x == y) return y;            /* x=y, return y */
#else
        if((ax == ay)
        && ((xx.s < 0) == (yy.s < 0)))
            return y;
#endif
        if(ax == 0) {                   /* x == 0 */
            xx.u = yy.s < 0 ? 0x8000000000000001ull : 1ull;
            xx.d = math_opt_barrier (xx.d);
            xx.d += 0.0;
            math_force_eval (xx.d);     /* raise underflow flag */
            return xx.d;                /* return +-minsubnormal */
        }
#ifdef ALTERNATE
        if((x < y) == (x < 0.0))
            --xx.u;
        else
            ++xx.u;
#else
        if((ax < ay)
        && ((xx.s < 0) == (yy.s < 0)))
            ++xx.u;
        else
            --xx.u;
#endif
        ax = xx.u + xx.u;               /* |x|<<1 */
        if((ax >= 0x7ffull << 53)       /* overflow  */
        || (ax < 0x001ull << 53)) {     /* underflow */
            xx.d = math_opt_barrier (xx.d);
            xx.d += 0.0;
            math_force_eval (xx.d);     /* raise overflow/underflow flag */
            __set_errno (ERANGE);
        }
        return xx.d;
}
libm_alias_double (__nextafter, nextafter)
#ifdef NO_LONG_DOUBLE
strong_alias (__nextafter, __nexttowardl)
weak_alias (__nexttowardl, nexttowardl)
#undef __nexttoward
strong_alias (__nextafter, __nexttoward)
#undef nexttoward
weak_alias (__nextafter, nexttoward)
#endif
--- EOF ---

[-- Attachment #2: glibc.patch --]
[-- Type: application/octet-stream, Size: 3447 bytes --]

--- -/math/s_nextafter.c
+++ +/math/s_nextafter.c
@@ -1,19 +1,3 @@
-/* @(#)s_nextafter.c 5.1 93/09/24 */
-/*
- * ====================================================
- * Copyright (C) 1993 by Sun Microsystems, Inc. All rights reserved.
- *
- * Developed at SunPro, a Sun Microsystems, Inc. business.
- * Permission to use, copy, modify, and distribute this
- * software is freely granted, provided that this notice
- * is preserved.
- * ====================================================
- */
-
-#if defined(LIBM_SCCS) && !defined(lint)
-static char rcsid[] = "$NetBSD: s_nextafter.c,v 1.8 1995/05/10 20:47:58 jtc Exp $";
-#endif
-
 /* IEEE functions
  *	nextafter(x,y)
  *	return the next machine floating-point number of x in the
@@ -34,56 +18,55 @@
 
 double __nextafter(double x, double y)
 {
-	int32_t hx,hy,ix,iy;
-	uint32_t lx,ly;
+	union {
+	    double d;
+	    int64_t s;
+	    uint64_t u;
+	} xx = {x}, yy = {y};
 
-	EXTRACT_WORDS(hx,lx,x);
-	EXTRACT_WORDS(hy,ly,y);
-	ix = hx&0x7fffffff;		/* |x| */
-	iy = hy&0x7fffffff;		/* |y| */
+	uint64_t ax = xx.u + xx.u;	/* |x|<<1 */
+	uint64_t ay = yy.u + yy.u;	/* |y|<<1 */
 
-	if(((ix>=0x7ff00000)&&((ix-0x7ff00000)|lx)!=0) ||   /* x is nan */
-	   ((iy>=0x7ff00000)&&((iy-0x7ff00000)|ly)!=0))     /* y is nan */
-	   return x+y;
-	if(x==y) return y;		/* x=y, return y */
-	if((ix|lx)==0) {			/* x == 0 */
-	    double u;
-	    INSERT_WORDS(x,hy&0x80000000,1);	/* return +-minsubnormal */
-	    u = math_opt_barrier (x);
-	    u = u*u;
-	    math_force_eval (u);		/* raise underflow flag */
+	if(ax > 0x7ffull << 53)		/* x is nan */
 	    return x;
+	if(ay > 0x7ffull << 53)		/* y is nan */
+	    return y;
+#ifdef ALTERNATE
+	if(x == y) return y;		/* x=y, return y */
+#else
+	if((ax == ay)
+	&& ((xx.s < 0) == (yy.s < 0)))
+	    return y;
+#endif
+	if(ax == 0) {			/* x == 0 */
+	    xx.u = yy.s < 0 ? 0x8000000000000001ull : 1ull;
+	    xx.d = math_opt_barrier (xx.d);
+	    xx.d += 0.0;
+	    math_force_eval (xx.d);	/* raise underflow flag */
+	    return xx.d;		/* return +-minsubnormal */
 	}
-	if(hx>=0) {				/* x > 0 */
-	    if(hx>hy||((hx==hy)&&(lx>ly))) {	/* x > y, x -= ulp */
-		if(lx==0) hx -= 1;
-		lx -= 1;
-	    } else {				/* x < y, x += ulp */
-		lx += 1;
-		if(lx==0) hx += 1;
-	    }
-	} else {				/* x < 0 */
-	    if(hy>=0||hx>hy||((hx==hy)&&(lx>ly))){/* x < y, x -= ulp */
-		if(lx==0) hx -= 1;
-		lx -= 1;
-	    } else {				/* x > y, x += ulp */
-		lx += 1;
-		if(lx==0) hx += 1;
-	    }
+#ifdef ALTERNATE
+	if((x < y) == (x < 0.0))
+	    --xx.u;
+	else
+	    ++xx.u;
 	}
-	hy = hx&0x7ff00000;
-	if(hy>=0x7ff00000) {
-	  double u = x+x;	/* overflow  */
-	  math_force_eval (u);
-	  __set_errno (ERANGE);
+#else
+	if((ax < ay)
+	&& ((xx.s < 0) == (yy.s < 0)))
+	    ++xx.u;
+	else
+	    --xx.u;
+#endif
+	ax = xx.u + xx.u;		/* |x|<<1 */
+	if((ax >= 0x7ffull << 53)	/* overflow  */
+	|| (ax < 0x001ull << 53)) {	/* underflow */
+	  xx.d = math_opt_barrier (xx.d);
+	  xx.d += 0.0;
+	  math_force_eval (xx.d);	/* raise overflow/underflow flag */
+	  __set_errno (ERANGE);
 	}
-	if(hy<0x00100000) {
-	    double u = x*x;			/* underflow */
-	    math_force_eval (u);		/* raise underflow flag */
-	  __set_errno (ERANGE);
-	}
-	INSERT_WORDS(x,hx,lx);
-	return x;
+ 	return xx.d;
 }
 libm_alias_double (__nextafter, nextafter)
 #ifdef NO_LONG_DOUBLE

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH/2nd version] Re: nextafter() about an order of magnitude slower than trivial implementation
  2021-08-20  9:52       ` [PATCH/2nd version] " Stefan Kanthak
@ 2021-08-20 16:55         ` Joseph Myers
  2021-08-20 20:19           ` Stefan Kanthak
  0 siblings, 1 reply; 10+ messages in thread
From: Joseph Myers @ 2021-08-20 16:55 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: Szabolcs Nagy, Adhemerval Zanella, libc-help, libc-alpha

On Fri, 20 Aug 2021, Stefan Kanthak wrote:

>         if(ax > 0x7ffull << 53)         /* x is nan */
>             return x;
>         if(ay > 0x7ffull << 53)         /* y is nan */
>             return y;

How has this been tested?  I'd have expected it to fail the nextafter 
tests in the glibc testsuite (libm-test-nextafter.inc), because they 
verify that sNaN arguments produce a qNaN result with the "invalid" 
exception raised, and this looks like it would just return an sNaN 
argument unchanged.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH/2nd version] Re: nextafter() about an order of magnitude slower than trivial implementation
  2021-08-20 16:55         ` Joseph Myers
@ 2021-08-20 20:19           ` Stefan Kanthak
  2021-08-20 21:03             ` Joseph Myers
  2021-08-23 12:50             ` Adhemerval Zanella
  0 siblings, 2 replies; 10+ messages in thread
From: Stefan Kanthak @ 2021-08-20 20:19 UTC (permalink / raw)
  To: Joseph Myers; +Cc: Szabolcs Nagy, Adhemerval Zanella, libc-help, libc-alpha

"Joseph Myers" <joseph@codesourcery.com> wrote:

> On Fri, 20 Aug 2021, Stefan Kanthak wrote:
> 
>>         if(ax > 0x7ffull << 53)         /* x is nan */
>>             return x;
>>         if(ay > 0x7ffull << 53)         /* y is nan */
>>             return y;
> 
> How has this been tested?  I'd have expected it to fail the nextafter 
> tests in the glibc testsuite (libm-test-nextafter.inc), because they 
> verify that sNaN arguments produce a qNaN result with the "invalid" 
> exception raised, and this looks like it would just return an sNaN 
> argument unchanged.

It doesn't look like it would, it REALLY does!
As I explicitly wrote, my changes avoid FP operations if possible.

<https://pubs.opengroup.org/onlinepubs/9699919799/functions/nextafter.html>

| If x or y is NaN, a NaN shall be returned.

I choose to return the argument NaN, what both POSIX and ISO C allow.
If my mind serves me well, one of the former editions even stated
"If an argument is NaN, this argument shall be returned". This may
but be influenced/spoiled by
<https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/nextafter-functions>

| If either x or y is a NAN, then the return value is one of the input NANs.


If that bothers you, change these 4 lines to either

         if(ax > 0x7ffull << 53)         /* x is nan */
             return 0.0 + x;
         if(ay > 0x7ffull << 53)         /* y is nan */
             return 0.0 + y;

(the addition has eventually to be guarded from optimisation) or

         if(ax > 0x7ffull << 53          /* x is nan */
         || ay > 0x7ffull << 53)         /* y is nan */
             return x + y;

whatever you like best.

Stefan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH/2nd version] Re: nextafter() about an order of magnitude slower than trivial implementation
  2021-08-20 20:19           ` Stefan Kanthak
@ 2021-08-20 21:03             ` Joseph Myers
  2021-08-23 12:50             ` Adhemerval Zanella
  1 sibling, 0 replies; 10+ messages in thread
From: Joseph Myers @ 2021-08-20 21:03 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: Szabolcs Nagy, libc-help, libc-alpha

On Fri, 20 Aug 2021, Stefan Kanthak wrote:

> I choose to return the argument NaN, what both POSIX and ISO C allow.

glibc follows C23 Annex F FE_SNANS_ALWAYS_SIGNAL rules for signaling NaN 
arguments.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH/2nd version] Re: nextafter() about an order of magnitude slower than trivial implementation
  2021-08-20 20:19           ` Stefan Kanthak
  2021-08-20 21:03             ` Joseph Myers
@ 2021-08-23 12:50             ` Adhemerval Zanella
  1 sibling, 0 replies; 10+ messages in thread
From: Adhemerval Zanella @ 2021-08-23 12:50 UTC (permalink / raw)
  To: Stefan Kanthak, Joseph Myers; +Cc: Szabolcs Nagy, libc-help, libc-alpha



On 20/08/2021 17:19, Stefan Kanthak wrote:
> "Joseph Myers" <joseph@codesourcery.com> wrote:
> 
>> On Fri, 20 Aug 2021, Stefan Kanthak wrote:
>>
>>>         if(ax > 0x7ffull << 53)         /* x is nan */
>>>             return x;
>>>         if(ay > 0x7ffull << 53)         /* y is nan */
>>>             return y;
>>
>> How has this been tested?  I'd have expected it to fail the nextafter 
>> tests in the glibc testsuite (libm-test-nextafter.inc), because they 
>> verify that sNaN arguments produce a qNaN result with the "invalid" 
>> exception raised, and this looks like it would just return an sNaN 
>> argument unchanged.
> 
> It doesn't look like it would, it REALLY does!
> As I explicitly wrote, my changes avoid FP operations if possible.
> 
> <https://pubs.opengroup.org/onlinepubs/9699919799/functions/nextafter.html>
> 
> | If x or y is NaN, a NaN shall be returned.
> 
> I choose to return the argument NaN, what both POSIX and ISO C allow.
> If my mind serves me well, one of the former editions even stated
> "If an argument is NaN, this argument shall be returned". This may
> but be influenced/spoiled by
> <https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/nextafter-functions>
> 
> | If either x or y is a NAN, then the return value is one of the input NANs.
> 
> 
> If that bothers you, change these 4 lines to either
> 
>          if(ax > 0x7ffull << 53)         /* x is nan */
>              return 0.0 + x;
>          if(ay > 0x7ffull << 53)         /* y is nan */
>              return 0.0 + y;
> 
> (the addition has eventually to be guarded from optimisation) or
> 
>          if(ax > 0x7ffull << 53          /* x is nan */
>          || ay > 0x7ffull << 53)         /* y is nan */
>              return x + y;

Sorry, but sending half-baked patches where one will need to evaluate
and decide which is the best strategy is not the best way forward to
accept this change.

Please do follow the contributor checklist [1] and the suggestion I
gave you when you asked for input in the libc-help:

  * Check for regressions by testing your patch against glibc testsuite
    (make check).  As Joseph has put, your patch triggers a lot of
    regressions:

    $ grep ^FAIL math/subdir-tests.sum
    FAIL: math/bug-nextafter
    FAIL: math/test-double-nextafter
    FAIL: math/test-float32x-nextafter
    FAIL: math/test-float64-nextafter
    FAIL: math/test-misc

  * You removed the copyright since it is a new implementation, but
    the any code still requires a Copyright (even if is not assigned
    to FSF).  Check the item 3.4 on the Contributor checklist.

  * Also, since is a new implementation follow the usual style and
    code guidelines instead of using the current one used.

  * It would be good to add a benchmark to evaluate this change if
    the motivation for your change is performance.  Follow what
    other math benchmarks does on benchtests (exp2 or exp10 for
    instance).

  * Sending a patch with a define switch and stating "Choose whatever
    you like best" does not help in anything and put the burden or
    evaluating your change to the maintainers. 
 
[1] https://sourceware.org/glibc/wiki/Contribution%20checklist

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-08-23 12:50 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-16 16:03 nextafter() about an order of magnitude slower than trivial implementation Stefan Kanthak
2021-08-18 12:51 ` Szabolcs Nagy
2021-08-18 17:11   ` Stefan Kanthak
2021-08-19 11:20     ` Adhemerval Zanella
2021-08-19 17:57       ` [PATCH] " Stefan Kanthak
2021-08-20  9:52       ` [PATCH/2nd version] " Stefan Kanthak
2021-08-20 16:55         ` Joseph Myers
2021-08-20 20:19           ` Stefan Kanthak
2021-08-20 21:03             ` Joseph Myers
2021-08-23 12:50             ` Adhemerval Zanella

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).