From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 8CE403858D33; Mon, 4 Mar 2024 17:14:39 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8CE403858D33 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1709572479; bh=llUhg0N71QM1BUXknD5JFtLYDTpz60gLeLhw68ePSFc=; h=From:To:Subject:Date:In-Reply-To:References:From; b=Q3IxTaO2rFqtUrzEB/nF/f8NKK8GxWT4CB38KwLuJYmFK3tmm8ibEBLiORHzZTtzW L9UupyewodY+W9hhlFW36a1ZZkkRULOy4jkKGPzkk57jw5HMBG6i8Zf628aOIl25fn binD4OUVyPq9hHAsappxTxstRCubDEee8EMDIbm4= From: "mkretz at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug libstdc++/77776] C++17 std::hypot implementation is poor Date: Mon, 04 Mar 2024 17:14:38 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: libstdc++ X-Bugzilla-Version: 7.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: mkretz at gcc dot gnu.org X-Bugzilla-Status: ASSIGNED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: emsr at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D77776 --- Comment #17 from Matthias Kretz (Vir) --- hypotf(a, b) is implemented using double precision and hypot(a, b) uses 80-= bit long double on i386 and x86_64 hypot does what you describe, right? std::experimental::simd benchmarks of hypot(a, b), where simd_abi::scalar u= ses the implementation (i.e. glibc): -march=3Dskylake-avx512 -ffast-math -O3 -lmvec: TYPE Latency Speedup Throughput= =20=20=20=20 Speedup [cycles/call] [per value] [cycles/call] = [per value] float, simd_abi::scalar 37.5 1 11.5= =20=20=20=20=20 1 float, 37.6 0.999 10.2= =20=20=20=20=20 1.13 float, simd_abi::__sse 34 4.42 6.46= =20=20=20=20=20 7.15 float, simd_abi::__avx 34.1 8.79 6.56= =20=20=20=20=20 14.1 float, simd_abi::_Avx512<32> 34.3 8.76 6.01= =20=20=20=20=20 15.4 float, simd_abi::_Avx512<64> 44.1 13.6 12= =20=20=20=20=20 15.4 float, [[gnu::vector_size(16)]] 58.3 2.57 47.5= =20=20=20=20=20 0.974 float, [[gnu::vector_size(32)]] 132 2.27 104= =20=20=20=20=20 0.892 float, [[gnu::vector_size(64)]] 240 2.5 222= =20=20=20=20=20 0.832 ---------------------------------------------------------------------------= ----------- TYPE Latency Speedup Throughput= =20=20=20=20 Speedup [cycles/call] [per value] [cycles/call] = [per value] double, simd_abi::scalar 81 1 21.5= =20=20=20=20=20 1 double, 80.1 1.01 21.3= =20=20=20=20=20 1.01 double, simd_abi::__sse 39.9 4.06 6.47= =20=20=20=20=20 6.64 double, simd_abi::__avx 40.2 8.05 12= =20=20=20=20=20 7.14 double, simd_abi::_Avx512<32> 40.3 8.04 12= =20=20=20=20=20 7.14 double, simd_abi::_Avx512<64> 56.2 11.5 24= =20=20=20=20=20 7.14 double, [[gnu::vector_size(16)]] 89.3 1.81 42.5= =20=20=20=20=20 1.01 double, [[gnu::vector_size(32)]] 150 2.16 110= =20=20=20=20=20 0.777 double, [[gnu::vector_size(64)]] 297 2.18 242= =20=20=20=20=20 0.71 ---------------------------------------------------------------------------= ----------- -march=3Dskylake-avx512 -O3 -lmvec: TYPE Latency Speedup Throughput= =20=20=20=20 Speedup [cycles/call] [per value] [cycles/call] = [per value] float, simd_abi::scalar 37.6 1 10.4= =20=20=20=20=20 1 float, 37.7 0.998 10.2= =20=20=20=20=20 1.02=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20 float, simd_abi::__sse 37.6 4 8.83= =20=20=20=20=20 4.71=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20 float, simd_abi::__avx 37.5 8.01 9.42= =20=20=20=20=20 8.82 float, simd_abi::_Avx512<64> 47.8 12.6 12= =20=20=20=20=20 13.8 float, [[gnu::vector_size(16)]] 98.7 1.52 57.2= =20=20=20=20=20 0.727 float, [[gnu::vector_size(32)]] 151 2 114= =20=20=20=20=20 0.728 float, [[gnu::vector_size(64)]] 260 2.31 230= =20=20=20=20=20 0.722 ---------------------------------------------------------------------------= ----------- TYPE Latency Speedup Throughput= =20=20=20=20 Speedup [cycles/call] [per value] [cycles/call] = [per value] double, simd_abi::scalar 79.7 1 21.7= =20=20=20=20=20 1 double, 80.1 0.995 21.6= =20=20=20=20=20 1 double, simd_abi::__sse 44.2 3.6 9.99= =20=20=20=20=20 4.33 double, simd_abi::__avx 43.6 7.32 12= =20=20=20=20=20 7.21 double, simd_abi::_Avx512<64> 59.9 10.6 24= =20=20=20=20=20 7.21 double, [[gnu::vector_size(16)]] 88.3 1.8 44.2= =20=20=20=20=20 0.98 double, [[gnu::vector_size(32)]] 163 1.96 115= =20=20=20=20=20 0.75 double, [[gnu::vector_size(64)]] 302 2.11 233= =20=20=20=20=20 0.742 ---------------------------------------------------------------------------= ----------- I have never ported my SIMD implementation back to scalar and benchmarked it against glibc.=