From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtpout2.vodafonemail.de (smtpout2.vodafonemail.de [145.253.239.133]) by sourceware.org (Postfix) with ESMTPS id 1D6E83858023 for ; Wed, 18 Aug 2021 17:15:11 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 1D6E83858023 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=nexgo.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=nexgo.de Received: from smtp.vodafone.de (smtpa08.fra-mediabeam.com [10.2.0.39]) by smtpout2.vodafonemail.de (Postfix) with ESMTP id C8A41121574; Wed, 18 Aug 2021 19:15:09 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nexgo.de; s=vfde-smtpout-mb-15sep; t=1629306909; bh=DA8llo0fMF3aM51TlogtrL9TMlXJdumh97mv2Tw+eFU=; h=From:To:Cc:References:In-Reply-To:Subject:Date; b=Kp0+uQfUt6UiBtfcIJaXJsV1F0xh4Km0tyAhrzPbV3b7GXDTUdO6GlfiNzleSCn7+ gzX5G9uuNF4FvngsMEKDmfQ13t9hys56fhyaIkS8nR4yTjfSFdNx+uDajkiDchV6+s B8jec55w/A4B6khY4OKPMrtnmjvyv6lVJPVMyiNo= Received: from H270 (p5b38f1bc.dip0.t-ipconnect.de [91.56.241.188]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by smtp.vodafone.de (Postfix) with ESMTPSA id F2E2F1401B2; Wed, 18 Aug 2021 17:15:08 +0000 (UTC) Message-ID: From: "Stefan Kanthak" To: "Szabolcs Nagy" Cc: References: <20210818125119.GH25257@arm.com> In-Reply-To: <20210818125119.GH25257@arm.com> Subject: Re: nextafter() about an order of magnitude slower than trivial implementation Date: Wed, 18 Aug 2021 19:11:02 +0200 Organization: Me, myself & IT MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 8bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Windows Mail 6.0.6002.18197 X-MimeOLE: Produced By Microsoft MimeOLE V6.1.7601.24158 X-purgate-type: clean X-purgate-Ad: Categorized by eleven eXpurgate (R) http://www.eleven.de X-purgate: This mail is considered clean (visit http://www.eleven.de for further information) X-purgate: clean X-purgate-size: 12695 X-purgate-ID: 155817::1629306909-00007455-F02C21C2/0/0 X-Spam-Status: No, score=-2.7 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-help@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-help mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Aug 2021 17:15:22 -0000 Szabolcs Nagy wrote: > The 08/16/2021 18:03, Stefan Kanthak wrote: >> Testing and benchmarking an exponential function that consumes >> about 10ns/call on an AMD EPYC 7262, I noticed that nextafter() >> itself is DEAD SLOW: it consumes about 8.5ns/call! >> >> The following (quick&dirty) implementation consumes 0.7ns/call, >> i.e. is about an order of magnitude faster: > > correctly measuring latency on a modern x86_64 core: > > musl: 3.16 ns > glibc: 5.68 ns > your: 5.72 ns My benchmark measures (reciprocal) throughput, NOT latency. As long as a routine like nextafter() is NOT called back-to-back, but along other code, latency doesn't really matter. > i.e. your code is slower. The overall execution time and the numbers from 'perf stat' show the contrary: the program executes overall - 70 billion instructions in 38.9 billion cycles with glibc's implementation, - 38 billion instructions in 12.4 billion cycles with my (quick&dirty) C implementation, - 37.5 billion instructions in 12 billion cycles with my assembly implementation, - 39.8 billion instructions in 22 billion cycles with musl's original implementation, - 40.5 billion instructions in 11.7 billion cycles with my patch applied to musl's implementation. The OVERALL number of cycles per call are 38.9 (glibc) > 22 (musl) > 12.4 (C) > 12 (Assembly) > 11.7 (musl patched) Subtract the (constant but unknown, not measured) number of cycles for the program itself: I estimate the overhead to be about 4..6 billion cycles. glibc's implementation consumes about 2.5x the number of cycles of musl's implementation, and about 5x the number of cycles of a proper implementation. > (this is the correctly predicted hot path latency > measuring for(i=0;i it is more difficult to measure the case when latency > is not the bottleneck, since then it depends on what > code is being executed along nextafter. your code > likely wins in that case as it has less instructions. Correct. And that's what counts in practice, as shown from 'perf stat' with the number of consumed cycles! > what you see in your benchmark is the effect of > branch mispredicts in the nextafter(x, INFINITY) > case: your code avoids branching on the sign of x, > while glibc branches and misses half the time with > random x inputs. (it maybe possible to fix glibc to > avoid that branch and then it would be fast in your > measurement too on a big ooo core) It's not just mispredicted branches: the program executes - 16.5 billion branches (0.5 B(!)illion mispredicted) with glibc's implementation, - 7.5 billion branches (0.5 M(!)illion mispredicted) with my C implementation, - 8 billion branches (0.5 M(!)illion mispredicted) with my assembly implementation, - 9.7 billion branches (0.25 B(!)illion mispredicted) with musl's original implementation, - 8.5 billion branches (0.5 M(!)illion mispredicted with my patch applied to musl's implementation. 2 billion branches are executed for the loops and should be subtracted to get the net numbers per call of nextafter(): 14.5 (glibc) > 9.7 (musl) > 8.5 (musl patched) > 8 (assembly) > 7.5 (C) glibc's implementation executes almost twice the number of branches of the other implementations. The mispredicted branches: 0.5 (glibc) > 0.25 (musl) > 0.0001 (the proper implementations) > (your code also does not handle fenv correctly and > has aliasing violations, but these can be fixed) I wrote explicitly QUICK&DIRTY (C) implementation. The assembly code is correct in both cases. Stefan > --- after.c --- > double nextafter(double from, double to) > { > if (from == to) > return to; > > if (to != to) > return to; > > if (from != from) > return from; > > if (from == 0.0) > return to < 0.0 ? -0x1.0p-1074 : 0x1.0p-1074; > > unsigned long long ull = *(unsigned long long *) &from; > > if ((from < to) == (from < 0.0)) > ull--; > else > ull++; > > return 0.0 + *(double *) &ull; > } > --- EOF --- > > JFTR: the code generated by GCC for this function is FAR from > optimal; the assembly implementation shown below is more > than 2x faster and consumes <0.3ns/call, i.e. is about > 30x faster than glibc's nextafter()! > > The data from 'perf stat' show that glibc's nextafter() executes > about 30 instructions more than the above trivial implementation, > and that it is NOT well suited for modern super-scalar processors. > > Stefan > > PS: test program and logs > > [stefan@rome ~]$ cat bench.c > // Copyright (C) 2005-2021, Stefan Kanthak > > #include > #include > #include > #include > > __attribute__((noinline)) > double nop(double foo, double bar) > { > return foo + bar; > } > > inline static > double lfsr64(void) > { > // 64-bit linear feedback shift register (Galois form) using > // primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"), > // initialised with 2**64 / golden ratio > > static uint64_t lfsr = 0x9E3779B97F4A7C15; > const uint64_t sign = (int64_t) lfsr >> 63; > > lfsr = (lfsr << 1) ^ (0xAD93D23594C935A9 & sign); > > return *(double *) &lfsr; > } > > inline static > double random64(void) > { > static uint64_t seed = 0x0123456789ABCDEF; > > seed = seed * 6364136223846793005 + 1442695040888963407; > > return *(double *) &seed; > } > > int main(void) > { > clock_t t0, t1, t2, tt; > uint32_t n; > volatile double result; > > t0 = clock(); > > for (n = 500000000u; n > 0u; n--) { > result = nop(lfsr64(), 0.0); > result = nop(random64(), 1.0 / 0.0); > } > > t1 = clock(); > > for (n = 500000000u; n > 0u; n--) { > result = nextafter(lfsr64(), 0.0); > result = nextafter(random64(), 1.0 / 0.0); > } > > t2 = clock(); > tt = t2 - t0; t2 -= t1; t1 -= t0; t0 = t2 - t1; > > printf("\n" > "nop() %4lu.%06lu 0\n" > "nextafter() %4lu.%06lu %4lu.%06lu\n" > " %4lu.%06lu nano-seconds\n", > t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC, > t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC, > t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC, > tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC); > } > [stefan@rome ~]$ gcc -O3 -lm bench.c > [stefan@rome ~]$ perf stat ./a.out > > nop() 1.480000 0 > nextafter() 10.060000 8.580000 > 11.540000 nano-seconds > > Performance counter stats for './a.out': > > 11,548.78 msec task-clock:u # 1.000 CPUs utilized > 0 context-switches:u # 0.000 K/sec > 0 cpu-migrations:u # 0.000 K/sec > 145 page-faults:u # 0.013 K/sec > 38,917,213,536 cycles:u # 3.370 GHz (83.33%) > ~~~~~~~~~~~~~~ > 15,647,656,615 stalled-cycles-frontend:u # 40.21% frontend cycles idle (83.33%) > 10,746,044,422 stalled-cycles-backend:u # 27.61% backend cycles idle (83.33%) > 69,739,403,870 instructions:u # 1.79 insn per cycle > ~~~~~~~~~~~~~~ ~~~~ > # 0.22 stalled cycles per insn (83.33%) > 16,495,748,110 branches:u # 1428.354 M/sec (83.33%) > 500,701,246 branch-misses:u # 3.04% of all branches (83.33%) > > 11.550414029 seconds time elapsed > > 11.548265000 seconds user > 0.000999000 seconds sys > > > [stefan@rome ~]$ gcc -O3 bench.c after.c > [stefan@rome ~]$ perf stat ./a.out > > nop() 1.490000 0 > nextafter() 2.210000 0.720000 > 3.700000 nano-seconds > > Performance counter stats for './a.out': > > 3,702.89 msec task-clock:u # 1.000 CPUs utilized > 0 context-switches:u # 0.000 K/sec > 0 cpu-migrations:u # 0.000 K/sec > 122 page-faults:u # 0.033 K/sec > 12,407,345,183 cycles:u # 3.351 GHz (83.32%) > ~~~~~~~~~~~~~~ > 135,817 stalled-cycles-frontend:u # 0.00% frontend cycles idle (83.34%) > 5,498,895,906 stalled-cycles-backend:u # 44.32% backend cycles idle (83.34%) > 38,002,430,460 instructions:u # 3.06 insn per cycle > ~~~~~~~~~~~~~~ ~~~~ > # 0.14 stalled cycles per insn (83.34%) > 7,497,381,393 branches:u # 2024.735 M/sec (83.34%) > 497,462 branch-misses:u # 0.01% of all branches (83.32%) > > 3.703648875 seconds time elapsed > > 3.703294000 seconds user > 0.000000000 seconds sys > > > [stefan@rome ~]cat after.s > # Copyright � 2004-2021, Stefan Kanthak > > .arch generic64 > .code64 > .intel_syntax noprefix > .text > # xmm0 = from > # xmm1 = to > nextafter: > xorpd xmm2, xmm2 # xmm2 = 0.0 > ucomisd xmm1, xmm2 # CF = (to < 0.0) > jp .Lto # to = NAN? > sbb rax, rax # rax = (to < 0.0) ? -1 : 0 > ucomisd xmm0, xmm1 # CF = (from < to) > jp .Lfrom # from = NAN? > je .Lto # from = to? > .Lnotequal: > sbb rcx, rcx # rcx = (from < to) ? -1 : 0 > ucomisd xmm0, xmm2 # CF = (from < 0.0) > jz .Lzero # from = 0.0? > .Lnext: > movq rdx, xmm0 # rdx = from > sbb rax, rax # rax = (from < 0.0) ? -1 : 0 > xor rax, rcx # rax = (from < 0.0) = (from < to) ? 0 : -1 > or rax, 1 # rax = (from < 0.0) = (from < to) ? 1 : -1 > sub rdx, rax > movq xmm0, rdx > addsd xmm0, xmm2 > ret > .Lzero: > shl rax, 63 # rax = (to < -0.0) ? 0x8000000000000000 : 0 > or rax, 1 # rax = (to < -0.0) ? 0x8000000000000001 : 1 > movq xmm0, rax # xmm0 = (to < -0.0) ? -0x1.0p-1074 : 0x1.0p-1074 > addsd xmm0, xmm2 > ret > .Lto: > movsd xmm0, xmm1 # xmm0 = to > .Lfrom: > ret > > .size nextafter, .-nextafter > .type nextafter, @function > .global nextafter > .end > [stefan@rome ~]$ perf stat ./a.out > > nop() 1.630000 0 > nextafter() 1.910000 0.280000 > 3.540000 nano-seconds > > Performance counter stats for './a.out': > > 3,547.12 msec task-clock:u # 1.000 CPUs utilized > 0 context-switches:u # 0.000 K/sec > 0 cpu-migrations:u # 0.000 K/sec > 123 page-faults:u # 0.035 K/sec > 11,949,840,797 cycles:u # 3.369 GHz (83.32%) > ~~~~~~~~~~~~~~ > 129,627 stalled-cycles-frontend:u # 0.00% frontend cycles idle (83.34%) > 3,998,960,716 stalled-cycles-backend:u # 33.46% backend cycles idle (83.34%) > 37,493,523,285 instructions:u # 3.14 insn per cycle > ~~~~~~~~~~~~~~ ~~~~ > # 0.11 stalled cycles per insn (83.34%) > 7,998,559,192 branches:u # 2254.945 M/sec (83.34%) > 497,565 branch-misses:u # 0.01% of all branches (83.32%) > > 3.547999008 seconds time elapsed > > 3.546671000 seconds user > 0.000999000 seconds sys > > > [stefan@rome ~]$ > --