From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 9CBCB3858006; Wed,  6 Mar 2024 09:43:28 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 9CBCB3858006
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1709718208;
	bh=7gDEBfqahON5y3wW6fAqI4cYJCBBPABW9+zLBpn6zQM=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=g+O8nxky3BPJM8Zufk1lqSL4qmBiBnkUbvrLtaSj7OJJvFHb4E1s9KHQF1WG+30QB
	 +CNl9MhIjQvXiPFAhQsis1TIUSQ6/TgSV3IYdWa2Y5e1atDelp7l4Vfhaa6cREs417
	 hFAMdszCPg89GV4KzeDEJMwSl2ghRaCnef6M7i2s=
From: "mkretz at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug libstdc++/77776] C++17 std::hypot implementation is poor
Date: Wed, 06 Mar 2024 09:43:24 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: libstdc++
X-Bugzilla-Version: 7.0
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: mkretz at gcc dot gnu.org
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: emsr at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-77776-4-bxo8QZO6KT@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-77776-4@http.gcc.gnu.org/bugzilla/>
References: <bug-77776-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D77776
--- Comment #20 from Matthias Kretz (Vir) <mkretz at gcc dot gnu.org> ---
Thanks, I'd be very happy if such a relatively clear implementation could m=
ake
it!

> branchfree code is always better.

Don't say it like that. Smart branching, making use of how static
branch-prediction works, can speed up code significantly. You don't want to
compute everything when 99.9% of the inputs need only a fraction of the wor=
k.

              TYPE                      Latency     Speedup     Throughput=
=20=20=20=20
Speedup
                                  [cycles/call] [per value]  [cycles/call] =
[per
value]
 float, simd_abi::scalar                   48.1           1             17=
=20=20=20=20=20
     1
 float, std::hypot                         43.3        1.11           12.3=
=20=20=20=20=20
  1.39
 float, hypot3_scale                       31.7        1.52           22.3=
=20=20=20=20=20
 0.764
 float, hypot3_exp                         83.9       0.574           84.5=
=20=20=20=20=20
 0.201
---------------------------------------------------------------------------=
-----------
              TYPE                      Latency     Speedup     Throughput=
=20=20=20=20
Speedup
                                  [cycles/call] [per value]  [cycles/call] =
[per
value]
double, simd_abi::scalar                   54.7           1             15=
=20=20=20=20=20
     1
double, std::hypot                         53.8        1.02             19=
=20=20=20=20=20
  0.79
double, hypot3_scale                         44        1.24             24=
=20=20=20=20=20
 0.625
double, hypot3_exp                         91.3       0.599             91=
=20=20=20=20=20
 0.165

and with -ffast-math:

              TYPE                      Latency     Speedup     Throughput=
=20=20=20=20
Speedup
                                  [cycles/call] [per value]  [cycles/call] =
[per
value]
 float, simd_abi::scalar                   48.9           1           9.15=
=20=20=20=20=20
     1
 float, std::hypot                         53.2       0.918           8.31=
=20=20=20=20=20
   1.1
 float, hypot3_scale                       31.3        1.56             14=
=20=20=20=20=20
 0.652
 float, hypot3_exp                         55.9       0.874           21.5=
=20=20=20=20=20
 0.425
---------------------------------------------------------------------------=
-----------
              TYPE                      Latency     Speedup     Throughput=
=20=20=20=20
Speedup
                                  [cycles/call] [per value]  [cycles/call] =
[per
value]
double, simd_abi::scalar                   54.8           1           9.07=
=20=20=20=20=20
     1
double, std::hypot                         61.5       0.891           11.3=
=20=20=20=20=20
 0.805
double, hypot3_scale                       40.8        1.34           12.1=
=20=20=20=20=20
 0.753
double, hypot3_exp                         64.2       0.853           23.3=
=20=20=20=20=20
  0.39


I have not tested correctness or precision yet. Also, the benchmark only us=
es
inputs that do not require anything else than =E2=88=9Ax=C2=B2+y=C2=B2+z=C2=
=B2 (which, I believe,
should be the common input and thus optimized for).=