[Bug libstdc++/77776] C++17 std::hypot implementation is poor

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug libstdc++/77776] C++17 std::hypot implementation is poor
       [not found] <bug-77776-4@http.gcc.gnu.org/bugzilla/>
@ 2021-05-04 12:31 ` rguenth at gcc dot gnu.org
  2024-02-29 18:23 ` g.peterhoff@t-online.de
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-05-04 12:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77776

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug libstdc++/77776] C++17 std::hypot implementation is poor
       [not found] <bug-77776-4@http.gcc.gnu.org/bugzilla/>
  2021-05-04 12:31 ` [Bug libstdc++/77776] C++17 std::hypot implementation is poor rguenth at gcc dot gnu.org
@ 2024-02-29 18:23 ` g.peterhoff@t-online.de
  2024-03-01 12:27 ` redi at gcc dot gnu.org
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: g.peterhoff@t-online.de @ 2024-02-29 18:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77776

g.peterhoff@t-online.de changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |g.peterhoff@t-online.de

--- Comment #11 from g.peterhoff@t-online.de ---
Would this be a good implementation for hypot3 in cmath?

#define GCC_UNLIKELY(x) __builtin_expect(x, 0)
#define GCC_LIKELY(x) __builtin_expect(x, 1)

namespace __detail
{
        template <typename _Tp>
        inline _GLIBCXX_CONSTEXPR typename
enable_if<is_floating_point<_Tp>::value, bool>::type __isinf3(const _Tp __x,
const _Tp __y, const _Tp __z)   noexcept
        {
                return bool(int(std::isinf(__x)) | int(std::isinf(__y)) |
int(std::isinf(__z)));
        }

        template <typename _Tp>
        inline _GLIBCXX_CONSTEXPR typename
enable_if<is_floating_point<_Tp>::value, _Tp>::type  __hypot3(_Tp __x, _Tp __y,
_Tp __z)     noexcept
        {
                __x = std::fabs(__x);
                __y = std::fabs(__y);
                __z = std::fabs(__z);

                const _Tp
                        __max = std::fmax(std::fmax(__x, __y), __z);

                if (GCC_UNLIKELY(__max == _Tp{}))
                {
                        return __max;
                }
                else
                {
                        __x /= __max;
                        __y /= __max;
                        __z /= __max;
                        return std::sqrt(__x*__x + __y*__y + __z*__z) * __max;
                }
        }
}       //      __detail


        template <typename _Tp>
        inline _GLIBCXX_CONSTEXPR typename
enable_if<is_floating_point<_Tp>::value, _Tp>::type  __hypot3(const _Tp __x,
const _Tp __y, const _Tp __z)   noexcept
        {
                return (GCC_UNLIKELY(__detail::__isinf3(__x, __y, __z))) ?
numeric_limits<_Tp>::infinity() : __detail::__hypot3(__x, __y, __z);
        }

#undef GCC_UNLIKELY
#undef GCC_LIKELY

How does it work?
* Basically, I first pull out the special case INFINITY (see
https://en.cppreference.com/w/cpp/numeric/math/hypot).
* As an additional safety measure (to prevent misuse) the functions are defined
by enable_if.

constexpr
* The hypot3 functions can thus be defined as _GLIBCXX_CONSTEXPR.

Questions
* To get a better runtime behavior I define GCC_(UN)LIKELY. Are there already
such macros (which I have overlooked)?
* The functions are noexcept. Does that make sense? If yes: why are the math
functions not noexcept?

thx
Gero

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug libstdc++/77776] C++17 std::hypot implementation is poor
       [not found] <bug-77776-4@http.gcc.gnu.org/bugzilla/>
  2021-05-04 12:31 ` [Bug libstdc++/77776] C++17 std::hypot implementation is poor rguenth at gcc dot gnu.org
  2024-02-29 18:23 ` g.peterhoff@t-online.de
@ 2024-03-01 12:27 ` redi at gcc dot gnu.org
  2024-03-02 15:52 ` g.peterhoff@t-online.de
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: redi at gcc dot gnu.org @ 2024-03-01 12:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77776

--- Comment #12 from Jonathan Wakely <redi at gcc dot gnu.org> ---
(In reply to g.peterhoff from comment #11)
> Would this be a good implementation for hypot3 in cmath?

Thanks for the suggestion!

> #define GCC_UNLIKELY(x) __builtin_expect(x, 0)
> #define GCC_LIKELY(x) __builtin_expect(x, 1)
> 
> namespace __detail
> {
> 	template <typename _Tp>
> 	inline _GLIBCXX_CONSTEXPR typename enable_if<is_floating_point<_Tp>::value,
> bool>::type	__isinf3(const _Tp __x, const _Tp __y, const _Tp __z)	noexcept
> 	{
> 		return bool(int(std::isinf(__x)) | int(std::isinf(__y)) |
> int(std::isinf(__z)));

The casts are redundant and just make it harder to read IMHO:

  return std::isinf(__x) | std::isinf(__y) | std::isinf(__z);



> 	}
> 
> 	template <typename _Tp>
> 	inline _GLIBCXX_CONSTEXPR typename enable_if<is_floating_point<_Tp>::value,
> _Tp>::type	__hypot3(_Tp __x, _Tp __y, _Tp __z)	noexcept
> 	{
> 		__x = std::fabs(__x);
> 		__y = std::fabs(__y);
> 		__z = std::fabs(__z);
> 
> 		const _Tp
> 			__max = std::fmax(std::fmax(__x, __y), __z);
> 
> 		if (GCC_UNLIKELY(__max == _Tp{}))
> 		{
> 			return __max;
> 		}
> 		else
> 		{
> 			__x /= __max;
> 			__y /= __max;
> 			__z /= __max;
> 			return std::sqrt(__x*__x + __y*__y + __z*__z) * __max;
> 		}
> 	}
> }	//	__detail
> 
> 
> 	template <typename _Tp>
> 	inline _GLIBCXX_CONSTEXPR typename enable_if<is_floating_point<_Tp>::value,
> _Tp>::type	__hypot3(const _Tp __x, const _Tp __y, const _Tp __z)	noexcept

This is a C++17 function, you can use enable_if_t<is_floating_point_v<_Tp>>,
but see below.

> 	{
> 		return (GCC_UNLIKELY(__detail::__isinf3(__x, __y, __z))) ?
> numeric_limits<_Tp>::infinity() : __detail::__hypot3(__x, __y, __z);
> 	}
> 
> #undef GCC_UNLIKELY
> #undef GCC_LIKELY
> 
> How does it work?
> * Basically, I first pull out the special case INFINITY (see
> https://en.cppreference.com/w/cpp/numeric/math/hypot).
> * As an additional safety measure (to prevent misuse) the functions are
> defined by enable_if.

I don't think we want to slow down compilation like that. If users decide to
misuse std::__detail::__isinf3 then they get what they deserve.


> constexpr
> * The hypot3 functions can thus be defined as _GLIBCXX_CONSTEXPR.

Just use 'constexpr' because this function isn't compiled as C++98. Then you
don't need the 'inline'. Although the standard doesn't allow std::hypot3 to be
constexpr.

> Questions
> * To get a better runtime behavior I define GCC_(UN)LIKELY. Are there
> already such macros (which I have overlooked)?

No, but you can do:


  if (__isinf3(__x, __y, __x)) [[__unlikely__]]
    ...


> * The functions are noexcept. Does that make sense? If yes: why are the math
> functions not noexcept?

I think it's just because nobody bothered to add them, and I doubt much code
ever needs to check whether they are noexcept. The compiler should already know
the standard libm functions don't throw. For this function (which isn't in libm
and the compiler doesn't know about) it seems worth adding 'noexcept'.


Does splitting it into three functions matter? It seems simpler as a single
function:

  template<typename _Tp>
    constexpr _Tp
    __hypot3(_Tp __x, _Tp __y, _Tp __z) noexcept
    {
      if (std::isinf(__x) | std::isinf(__y) | std::isinf(__z)) [[__unlikely__]]
        return numeric_limits<_Tp>::infinity();
      __x = std::fabs(__x);
      __y = std::fabs(__y);
      __z = std::fabs(__z);
      const _Tp __max = std::fmax(std::fmax(__x, __y), __z);
      if (__max == _Tp{}) [[__unlikely__]]
        return __max;
      else
        {
          __x /= __max;
          __y /= __max;
          __z /= __max;
          return std::sqrt(__x*__x + __y*__y + __z*__z) * __max;
        }
    }

This would add a dependency on <limits> to <cmath>, which isn't currently
there. Maybe we could just return (_Tp)__builtin_huge_vall().

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug libstdc++/77776] C++17 std::hypot implementation is poor
       [not found] <bug-77776-4@http.gcc.gnu.org/bugzilla/>
                   ` (2 preceding siblings ...)
  2024-03-01 12:27 ` redi at gcc dot gnu.org
@ 2024-03-02 15:52 ` g.peterhoff@t-online.de
  2024-03-04  3:49 ` de34 at live dot cn
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: g.peterhoff@t-online.de @ 2024-03-02 15:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77776

--- Comment #13 from g.peterhoff@t-online.de ---
Thanks for the suggestions:
template <typename _Tp>
constexpr _Tp __hypot3(_Tp __x, _Tp __y, _Tp __z)       noexcept
{
        if (std::isinf(__x) | std::isinf(__y) | std::isinf(__z))
[[__unlikely__]]
                return _Tp(INFINITY);
        __x = std::fabs(__x);
        __y = std::fabs(__y);
        __z = std::fabs(__z);
        const _Tp __max = std::fmax(std::fmax(__x, __y), __z);
        if (__max == _Tp{}) [[__unlikely__]]
                return __max;
        __x /= __max;
        __y /= __max;
        __z /= __max;
        return std::sqrt(__x*__x + __y*__y + __z*__z) * __max;
}

The functions are then set to constexpr/noexcept.

regards
Gero

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug libstdc++/77776] C++17 std::hypot implementation is poor
       [not found] <bug-77776-4@http.gcc.gnu.org/bugzilla/>
                   ` (3 preceding siblings ...)
  2024-03-02 15:52 ` g.peterhoff@t-online.de
@ 2024-03-04  3:49 ` de34 at live dot cn
  2024-03-04 10:45 ` mkretz at gcc dot gnu.org
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: de34 at live dot cn @ 2024-03-04  3:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77776

Jiang An <de34 at live dot cn> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |de34 at live dot cn

--- Comment #14 from Jiang An <de34 at live dot cn> ---
(In reply to g.peterhoff from comment #11)

> constexpr
> * The hypot3 functions can thus be defined as _GLIBCXX_CONSTEXPR.

I think it should be marked _GLIBCXX26_CONSTEXPR.
(and we may also need to change bits/c++config)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug libstdc++/77776] C++17 std::hypot implementation is poor
       [not found] <bug-77776-4@http.gcc.gnu.org/bugzilla/>
                   ` (4 preceding siblings ...)
  2024-03-04  3:49 ` de34 at live dot cn
@ 2024-03-04 10:45 ` mkretz at gcc dot gnu.org
  2024-03-04 11:12 ` jakub at gcc dot gnu.org
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: mkretz at gcc dot gnu.org @ 2024-03-04 10:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77776

--- Comment #15 from Matthias Kretz (Vir) <mkretz at gcc dot gnu.org> ---
Your implementation still needs to solve:

1. Loss of precision because of division & subsequent scaling by max. Users
comparing std::hypot(x, y, z) against a simple std::sqrt(x * x + y * y + z * z)
might wonder why they want to use std::hypot if it's less precise.

2. Relatively high cost (in latency and throughput) because of the three
divisions. You could replace it with scale = 1/max; x *= scale; ... But that
can reduce precision even further.

3. Summation of the x, y, and z squares isn't associative if you care about
precision. A high quality implementation needs to add the two lowest values
first.

Here's a precise but inefficient implementation:
(https://compiler-explorer.com/z/ocGPnsYE3)

template <typename T>
[[gnu::optimize("-fno-unsafe-math-optimizations")]]
T
hypot3(T x, T y, T z)
{
  x = std::abs(x);
  y = std::abs(y);
  z = std::abs(z);
  if (std::isinf(x) || std::isinf(y) || std::isinf(z))
    return std::__infinity_v<T>;
  else if (std::isnan(x) || std::isnan(y) || std::isnan(z))
    return std::__quiet_NaN_v<T>;
  else if (x == y && y == z)
    return x * std::sqrt(T(3));
  else if (z == 0 && y == 0)
    return x;
  else if (x == 0 && z == 0)
    return y;
  else if (x == 0 && y == 0)
    return z;
  else
    {
      T hi = std::max(std::max(x, y), z);
      T lo0 = std::min(std::max(x, y), z);
      T lo1 = std::min(x, y);
      int e = 0;
      hi = std::frexp(hi, &e);
      lo0 = std::ldexp(lo0, -e);
      lo1 = std::ldexp(lo1, -e);
      T lo = lo0 * lo0 + lo1 * lo1;
      return std::ldexp(std::sqrt(hi * hi + lo), e);
    }
}

AFAIK
https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/include/experimental/bits/simd_math.h;h=06e7b4496f9917f886f66fbd7629700dd17e55f9;hb=HEAD#l1168
is a precise and efficient implementation. It also avoids division altogether
unless an input is subnormal.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug libstdc++/77776] C++17 std::hypot implementation is poor
       [not found] <bug-77776-4@http.gcc.gnu.org/bugzilla/>
                   ` (5 preceding siblings ...)
  2024-03-04 10:45 ` mkretz at gcc dot gnu.org
@ 2024-03-04 11:12 ` jakub at gcc dot gnu.org
  2024-03-04 17:14 ` mkretz at gcc dot gnu.org
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: jakub at gcc dot gnu.org @ 2024-03-04 11:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77776

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #16 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
(In reply to Matthias Kretz (Vir) from comment #15)
> Your implementation still needs to solve:
> 
> 1. Loss of precision because of division & subsequent scaling by max. Users
> comparing std::hypot(x, y, z) against a simple std::sqrt(x * x + y * y + z *
> z) might wonder why they want to use std::hypot if it's less precise.
> 
> 2. Relatively high cost (in latency and throughput) because of the three
> divisions. You could replace it with scale = 1/max; x *= scale; ... But that
> can reduce precision even further.
> 
> 3. Summation of the x, y, and z squares isn't associative if you care about
> precision. A high quality implementation needs to add the two lowest values
> first.
> 
> Here's a precise but inefficient implementation:
> (https://compiler-explorer.com/z/ocGPnsYE3)
> 
> template <typename T>
> [[gnu::optimize("-fno-unsafe-math-optimizations")]]
> T
> hypot3(T x, T y, T z)
> {
>   x = std::abs(x);
>   y = std::abs(y);
>   z = std::abs(z);
>   if (std::isinf(x) || std::isinf(y) || std::isinf(z))
>     return std::__infinity_v<T>;
>   else if (std::isnan(x) || std::isnan(y) || std::isnan(z))
>     return std::__quiet_NaN_v<T>;
>   else if (x == y && y == z)
>     return x * std::sqrt(T(3));
>   else if (z == 0 && y == 0)
>     return x;
>   else if (x == 0 && z == 0)
>     return y;
>   else if (x == 0 && y == 0)
>     return z;
>   else
>     {
>       T hi = std::max(std::max(x, y), z);
>       T lo0 = std::min(std::max(x, y), z);
>       T lo1 = std::min(x, y);
>       int e = 0;
>       hi = std::frexp(hi, &e);
>       lo0 = std::ldexp(lo0, -e);
>       lo1 = std::ldexp(lo1, -e);
>       T lo = lo0 * lo0 + lo1 * lo1;
>       return std::ldexp(std::sqrt(hi * hi + lo), e);
>     }
> }
> 
> AFAIK
> https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/include/
> experimental/bits/simd_math.h;h=06e7b4496f9917f886f66fbd7629700dd17e55f9;
> hb=HEAD#l1168 is a precise and efficient implementation. It also avoids
> division altogether unless an input is subnormal.

What glibc does there for the 2 argument hypot is after handling the non-finite
cases finds the minimum and maximum and uses just normal multiplication,
addition + sqrt for the common case where maximum isn't too large and minimum
isn't too small.
So, no need to use frexp/ldexp, just comparisons of hi above against sqrt of
(max finite / 3), in that case scale by multiplying all 3 args by some
appropriate scale constant, and similarly otherwise if lo1 is too small by some
large scale.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug libstdc++/77776] C++17 std::hypot implementation is poor
       [not found] <bug-77776-4@http.gcc.gnu.org/bugzilla/>
                   ` (6 preceding siblings ...)
  2024-03-04 11:12 ` jakub at gcc dot gnu.org
@ 2024-03-04 17:14 ` mkretz at gcc dot gnu.org
  2024-03-04 20:21 ` jakub at gcc dot gnu.org
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: mkretz at gcc dot gnu.org @ 2024-03-04 17:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77776

--- Comment #17 from Matthias Kretz (Vir) <mkretz at gcc dot gnu.org> ---
hypotf(a, b) is implemented using double precision and hypot(a, b) uses 80-bit
long double on i386 and x86_64 hypot does what you describe, right?

std::experimental::simd benchmarks of hypot(a, b), where simd_abi::scalar uses
the <cmath> implementation (i.e. glibc):


-march=skylake-avx512 -ffast-math -O3 -lmvec:
              TYPE                      Latency     Speedup     Throughput    
Speedup
                                  [cycles/call] [per value]  [cycles/call] [per
value]
 float, simd_abi::scalar                   37.5           1           11.5     
     1
 float,                                    37.6       0.999           10.2     
  1.13
 float, simd_abi::__sse                      34        4.42           6.46     
  7.15
 float, simd_abi::__avx                    34.1        8.79           6.56     
  14.1
 float, simd_abi::_Avx512<32>              34.3        8.76           6.01     
  15.4
 float, simd_abi::_Avx512<64>              44.1        13.6             12     
  15.4
 float, [[gnu::vector_size(16)]]           58.3        2.57           47.5     
 0.974
 float, [[gnu::vector_size(32)]]            132        2.27            104     
 0.892
 float, [[gnu::vector_size(64)]]            240         2.5            222     
 0.832
--------------------------------------------------------------------------------------
              TYPE                      Latency     Speedup     Throughput    
Speedup
                                  [cycles/call] [per value]  [cycles/call] [per
value]
double, simd_abi::scalar                     81           1           21.5     
     1
double,                                    80.1        1.01           21.3     
  1.01
double, simd_abi::__sse                    39.9        4.06           6.47     
  6.64
double, simd_abi::__avx                    40.2        8.05             12     
  7.14
double, simd_abi::_Avx512<32>              40.3        8.04             12     
  7.14
double, simd_abi::_Avx512<64>              56.2        11.5             24     
  7.14
double, [[gnu::vector_size(16)]]           89.3        1.81           42.5     
  1.01
double, [[gnu::vector_size(32)]]            150        2.16            110     
 0.777
double, [[gnu::vector_size(64)]]            297        2.18            242     
  0.71
--------------------------------------------------------------------------------------

-march=skylake-avx512 -O3 -lmvec:
              TYPE                      Latency     Speedup     Throughput    
Speedup
                                  [cycles/call] [per value]  [cycles/call] [per
value]
 float, simd_abi::scalar                   37.6           1           10.4     
     1
 float,                                    37.7       0.998           10.2     
  1.02                                                                          
 float, simd_abi::__sse                    37.6           4           8.83     
  4.71                                                                          
 float, simd_abi::__avx                    37.5        8.01           9.42     
  8.82
 float, simd_abi::_Avx512<64>              47.8        12.6             12     
  13.8
 float, [[gnu::vector_size(16)]]           98.7        1.52           57.2     
 0.727
 float, [[gnu::vector_size(32)]]            151           2            114     
 0.728
 float, [[gnu::vector_size(64)]]            260        2.31            230     
 0.722
--------------------------------------------------------------------------------------
              TYPE                      Latency     Speedup     Throughput    
Speedup
                                  [cycles/call] [per value]  [cycles/call] [per
value]
double, simd_abi::scalar                   79.7           1           21.7     
     1
double,                                    80.1       0.995           21.6     
     1
double, simd_abi::__sse                    44.2         3.6           9.99     
  4.33
double, simd_abi::__avx                    43.6        7.32             12     
  7.21
double, simd_abi::_Avx512<64>              59.9        10.6             24     
  7.21
double, [[gnu::vector_size(16)]]           88.3         1.8           44.2     
  0.98
double, [[gnu::vector_size(32)]]            163        1.96            115     
  0.75
double, [[gnu::vector_size(64)]]            302        2.11            233     
 0.742
--------------------------------------------------------------------------------------

I have never ported my SIMD implementation back to scalar and benchmarked it
against glibc.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug libstdc++/77776] C++17 std::hypot implementation is poor
       [not found] <bug-77776-4@http.gcc.gnu.org/bugzilla/>
                   ` (7 preceding siblings ...)
  2024-03-04 17:14 ` mkretz at gcc dot gnu.org
@ 2024-03-04 20:21 ` jakub at gcc dot gnu.org
  2024-03-06  2:47 ` g.peterhoff@t-online.de
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: jakub at gcc dot gnu.org @ 2024-03-04 20:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77776

--- Comment #18 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
I was looking at the sysdeps/ieee754/ldbl-128/ version, i.e. what is used for
hypotf128.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug libstdc++/77776] C++17 std::hypot implementation is poor
       [not found] <bug-77776-4@http.gcc.gnu.org/bugzilla/>
                   ` (8 preceding siblings ...)
  2024-03-04 20:21 ` jakub at gcc dot gnu.org
@ 2024-03-06  2:47 ` g.peterhoff@t-online.de
  2024-03-06  9:43 ` mkretz at gcc dot gnu.org
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: g.peterhoff@t-online.de @ 2024-03-06  2:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77776

--- Comment #19 from g.peterhoff@t-online.de ---
> So, no need to use frexp/ldexp, just comparisons of hi above against sqrt of
> (max finite / 3), in that case scale by multiplying all 3 args by some
> appropriate scale constant, and similarly otherwise if lo1 is too small by
> some large scale.

I don't really know. With frexp/ldexp you probably get the highest accuracy
(even if it is probably slower) instead of doing it manually. The problem is to
determine suitable scaling factors and to adjust the (return)values
accordingly. I have implemented both cases.

Error
* In the case (x==y && y==z), x*std::sqrt(T(3)) must not simply be returned, as
this can lead to an overflow (inf).

Generally
* Instead of using fmin/fmax to determine the values hi,lo1,lo0, it is better
to sort x,y,z. This is faster and clearer and no additional variables need to
be introduced.
* It also makes sense to consider the case (x==0 && y==0 && z==0).

Optimizations
* You were probably wondering why I wrote "if (std::isinf(x) | std::isinf(y) |
std::isinf(z))", for example. This is intentional. The problem is that gcc
almost always produces branch code for logical operations, so *a lot* of
conditional jumps. By using arithmetic operations, so instead of || && just |
&, I can get it to generate only actually necessary conditional jumps or
cmoves. branchfree code is always better.


template <typename T>
constexpr T     hypot3_exp(T x, T y, T z) noexcept
{
        using limits = std::numeric_limits<T>;

        constexpr T
                zero = 0;

        x = std::abs(x);
        y = std::abs(y);
        z = std::abs(z);

        if (std::isinf(x) | std::isinf(y) | std::isinf(z))  [[unlikely]]
                return limits::infinity();
        if (std::isnan(x) | std::isnan(y) | std::isnan(z))      [[unlikely]]
                return limits::quiet_NaN();
        if ((x==zero) & (y==zero) & (z==zero))  [[unlikely]]
                return zero;
        if ((y==zero) & (z==zero))      [[unlikely]]
                return x;
        if ((x==zero) & (z==zero))      [[unlikely]]
                return y;
        if ((x==zero) & (y==zero))      [[unlikely]]
                return z;

        auto sort = [](T& a, T& b, T& c)        constexpr noexcept -> void
        {
                if (a > b) std::swap(a, b);
                if (b > c) std::swap(b, c);
                if (a > b) std::swap(a, b);
        };

        sort(x, y, z);  //      x <= y <= z

        int
                exp = 0;

        z = std::frexp(z, &exp);
        y = std::ldexp(y, -exp);
        x = std::ldexp(x, -exp);

        T
                sum = x*x + y*y;

        sum += z*z;
        return std::ldexp(std::sqrt(sum), exp);
}

template <typename T>
constexpr T     hypot3_scale(T x, T y, T z) noexcept
{
        using limits = std::numeric_limits<T>;

        auto prev_power2 = [](const T value)    constexpr noexcept -> T
        {
                return std::exp2(std::floor(std::log2(value)));
        };

        constexpr T
                sqrtmax         = std::sqrt(limits::max()),
                scale_up        = prev_power2(sqrtmax),
                scale_down      = T(1) / scale_up,
                zero            = 0;

        x = std::abs(x);
        y = std::abs(y);
        z = std::abs(z);

        if (std::isinf(x) | std::isinf(y) | std::isinf(z))  [[unlikely]]
                return limits::infinity();
        if (std::isnan(x) | std::isnan(y) | std::isnan(z))      [[unlikely]]
                return limits::quiet_NaN();
        if ((x==zero) & (y==zero) & (z==zero))  [[unlikely]]
                return zero;
        if ((y==zero) & (z==zero))      [[unlikely]]
                return x;
        if ((x==zero) & (z==zero))      [[unlikely]]
                return y;
        if ((x==zero) & (y==zero))      [[unlikely]]
                return z;

        auto sort = [](T& a, T& b, T& c)        constexpr noexcept -> void
        {
                if (a > b) std::swap(a, b);
                if (b > c) std::swap(b, c);
                if (a > b) std::swap(a, b);
        };

        sort(x, y, z);  //      x <= y <= z

        const T
                scale = (z > sqrtmax) ? scale_down : (z < 1) ? scale_up : 1;

        x *= scale;
        y *= scale;
        z *= scale;

        T
                sum = x*x + y*y;

        sum += z*z;
        return std::sqrt(sum) / scale;
}


regards
Gero

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug libstdc++/77776] C++17 std::hypot implementation is poor
       [not found] <bug-77776-4@http.gcc.gnu.org/bugzilla/>
                   ` (9 preceding siblings ...)
  2024-03-06  2:47 ` g.peterhoff@t-online.de
@ 2024-03-06  9:43 ` mkretz at gcc dot gnu.org
  2024-03-06 12:35 ` redi at gcc dot gnu.org
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: mkretz at gcc dot gnu.org @ 2024-03-06  9:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77776

--- Comment #20 from Matthias Kretz (Vir) <mkretz at gcc dot gnu.org> ---
Thanks, I'd be very happy if such a relatively clear implementation could make
it!

> branchfree code is always better.

Don't say it like that. Smart branching, making use of how static
branch-prediction works, can speed up code significantly. You don't want to
compute everything when 99.9% of the inputs need only a fraction of the work.

              TYPE                      Latency     Speedup     Throughput    
Speedup
                                  [cycles/call] [per value]  [cycles/call] [per
value]
 float, simd_abi::scalar                   48.1           1             17     
     1
 float, std::hypot                         43.3        1.11           12.3     
  1.39
 float, hypot3_scale                       31.7        1.52           22.3     
 0.764
 float, hypot3_exp                         83.9       0.574           84.5     
 0.201
--------------------------------------------------------------------------------------
              TYPE                      Latency     Speedup     Throughput    
Speedup
                                  [cycles/call] [per value]  [cycles/call] [per
value]
double, simd_abi::scalar                   54.7           1             15     
     1
double, std::hypot                         53.8        1.02             19     
  0.79
double, hypot3_scale                         44        1.24             24     
 0.625
double, hypot3_exp                         91.3       0.599             91     
 0.165

and with -ffast-math:

              TYPE                      Latency     Speedup     Throughput    
Speedup
                                  [cycles/call] [per value]  [cycles/call] [per
value]
 float, simd_abi::scalar                   48.9           1           9.15     
     1
 float, std::hypot                         53.2       0.918           8.31     
   1.1
 float, hypot3_scale                       31.3        1.56             14     
 0.652
 float, hypot3_exp                         55.9       0.874           21.5     
 0.425
--------------------------------------------------------------------------------------
              TYPE                      Latency     Speedup     Throughput    
Speedup
                                  [cycles/call] [per value]  [cycles/call] [per
value]
double, simd_abi::scalar                   54.8           1           9.07     
     1
double, std::hypot                         61.5       0.891           11.3     
 0.805
double, hypot3_scale                       40.8        1.34           12.1     
 0.753
double, hypot3_exp                         64.2       0.853           23.3     
  0.39


I have not tested correctness or precision yet. Also, the benchmark only uses
inputs that do not require anything else than √x²+y²+z² (which, I believe,
should be the common input and thus optimized for).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug libstdc++/77776] C++17 std::hypot implementation is poor
       [not found] <bug-77776-4@http.gcc.gnu.org/bugzilla/>
                   ` (10 preceding siblings ...)
  2024-03-06  9:43 ` mkretz at gcc dot gnu.org
@ 2024-03-06 12:35 ` redi at gcc dot gnu.org
  2024-03-12 11:31 ` mkretz at gcc dot gnu.org
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: redi at gcc dot gnu.org @ 2024-03-06 12:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77776

--- Comment #21 from Jonathan Wakely <redi at gcc dot gnu.org> ---
(In reply to g.peterhoff from comment #19)
> * You were probably wondering why I wrote "if (std::isinf(x) | std::isinf(y)
> | std::isinf(z))", for example. This is intentional. The problem is that gcc
> almost always produces branch code for logical operations, so *a lot* of
> conditional jumps. By using arithmetic operations, so instead of || && just
> | &, I can get it to generate only actually necessary conditional jumps or
> cmoves. branchfree code is always better.

In general the code is not equivalent if any of the subexpressions has side
effects, such as setting errno, raising an exception for a signalling nan etc.
but I don't think we care about that here.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug libstdc++/77776] C++17 std::hypot implementation is poor
       [not found] <bug-77776-4@http.gcc.gnu.org/bugzilla/>
                   ` (11 preceding siblings ...)
  2024-03-06 12:35 ` redi at gcc dot gnu.org
@ 2024-03-12 11:31 ` mkretz at gcc dot gnu.org
  2024-03-19  0:09 ` g.peterhoff@t-online.de
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: mkretz at gcc dot gnu.org @ 2024-03-12 11:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77776

--- Comment #22 from Matthias Kretz (Vir) <mkretz at gcc dot gnu.org> ---
I took your hypot3_scale and reduced latency and throughput. I don't think the
sqrtmax/sqrtmin limits are correct (sqrtmax² * 3 -> infinity).

              TYPE                      Latency     Speedup     Throughput    
Speedup                                                                         
                                  [cycles/call] [per value]  [cycles/call] [per
value]                                                                          
 float,                                    46.5           1           12.7     
     1                                                                          
 float, hypot3_scale                       35.5        1.31             27     
  0.47                                                                          
 float, hypot3_mkretz                      30.2        1.54             12     
  1.06                                                                          
-------------------------------------------------------------------------------------- 
              TYPE                      Latency     Speedup     Throughput    
Speedup                                                                         
                                  [cycles/call] [per value]  [cycles/call] [per
value]                                                                          
double,                                    59.6           1             60     
     1                                                                          
double, hypot3_scale                       51.2        1.16           48.8     
  1.23                                                                          
double, hypot3_mkretz                      40.1        1.49             35     
  1.71


template <typename T>
  constexpr T 
  hypot(T x, T y, T z)
  { 
    using limits = std::numeric_limits<T>;

    auto prev_power2 = [](const T value) constexpr noexcept -> T 
    { 
      return std::exp2(std::floor(std::log2(value)));
    };

    constexpr T sqrtmax = std::sqrt(limits::max());
    constexpr T sqrtmin = std::sqrt(limits::min());
    constexpr T scale_up = prev_power2(sqrtmax);
    constexpr T scale_down = T(1) / scale_up;
    constexpr T zero = 0;

    if (not (std::isnormal(x) && std::isnormal(y) && std::isnormal(z)))
[[unlikely]]
      { 
        if (std::isinf(x) | std::isinf(y) | std::isinf(z))
          return limits::infinity();
        else if (std::isnan(x) | std::isnan(y) | std::isnan(z))
          return limits::quiet_NaN();
        const bool xz = x == zero;
        const bool yz = y == zero;
        const bool zz = z == zero;
        if (xz)
          { 
            if (yz)
              return zz ? zero : z;
            else if (zz)
              return y;
          } 
        else if (yz && zz)
          return x;
      } 

    x = std::abs(x);
    y = std::abs(y);
    z = std::abs(z);

    T a = std::max(std::max(x, y), z);
    T b = std::min(std::max(x, y), z);
    T c = std::min(x, y);

    if (a >= sqrtmin && a <= sqrtmax) [[likely]]
      return std::sqrt(__builtin_assoc_barrier(c * c + b * b) + a * a);

    const T scale = a >= sqrtmin ? scale_down : scale_up;

    a *= scale;
    b *= scale;
    c *= scale;

    return std::sqrt(__builtin_assoc_barrier(c * c + b * b) + a * a) / scale;
  }

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug libstdc++/77776] C++17 std::hypot implementation is poor
       [not found] <bug-77776-4@http.gcc.gnu.org/bugzilla/>
                   ` (12 preceding siblings ...)
  2024-03-12 11:31 ` mkretz at gcc dot gnu.org
@ 2024-03-19  0:09 ` g.peterhoff@t-online.de
  2024-03-25 10:06 ` mkretz at gcc dot gnu.org
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: g.peterhoff@t-online.de @ 2024-03-19  0:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77776

--- Comment #23 from g.peterhoff@t-online.de ---
Hello Matthias,
you've given me new ideas. I think we agree on implementing hypot3 using a
scaling factor. But the correct value is not yet implemented here either; do
you have a suggestion?
A version here: https://godbolt.org/z/Gd53cG9YG
I've intentionally broken hypot_gp into small pieces so that you can play
around with it. This is of course unnecessary for a final version.

General
* The function must of course work efficiently with all FP types.

Questions
* Sorting: It is theoretically sufficient to sort the values x,y,z only to the
extent that the condition x,y <= z is fulfilled (HYPOT_SORT_FULL).
* Accuracy: This is better with fma (HYPOT_FMA).
* How do you create the benchmarks? I could do this myself without getting on
your nerves.

thx
Gero

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug libstdc++/77776] C++17 std::hypot implementation is poor
       [not found] <bug-77776-4@http.gcc.gnu.org/bugzilla/>
                   ` (13 preceding siblings ...)
  2024-03-19  0:09 ` g.peterhoff@t-online.de
@ 2024-03-25 10:06 ` mkretz at gcc dot gnu.org
  2024-04-05  1:08 ` g.peterhoff@t-online.de
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: mkretz at gcc dot gnu.org @ 2024-03-25 10:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77776

--- Comment #24 from Matthias Kretz (Vir) <mkretz at gcc dot gnu.org> ---
(In reply to g.peterhoff from comment #23)
> * How do you create the benchmarks?

https://github.com/mattkretz/simd-benchmarks
Look at hypot3.cpp :)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug libstdc++/77776] C++17 std::hypot implementation is poor
       [not found] <bug-77776-4@http.gcc.gnu.org/bugzilla/>
                   ` (14 preceding siblings ...)
  2024-03-25 10:06 ` mkretz at gcc dot gnu.org
@ 2024-04-05  1:08 ` g.peterhoff@t-online.de
  2024-04-05  1:23 ` g.peterhoff@t-online.de
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: g.peterhoff@t-online.de @ 2024-04-05  1:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77776

--- Comment #25 from g.peterhoff@t-online.de ---
Hi Matthias,
to get good results on average (all FP-types: (B)FP16..FP128,
scalar/vectorized(SIMD)/parallel/...) this algorithm seems to me (so far) to be
suitable:

template <typename Type>
inline constexpr Type   hypot_gp(Type x, Type y, Type z)        noexcept
{
        using limits = std::numeric_limits<Type>;

        constexpr Type
                sqrt_min        = std::sqrt(limits::min()),
                sqrt_max        = std::sqrt(limits::max()),
                scale_up        = std::exp2( Type(limits::max_exponent/2)),
                scale_down      = std::exp2(-Type(limits::max_exponent/2)),
                zero            = 0;

        x = std::abs(x);
        y = std::abs(y);
        z = std::abs(z);

        if (!(std::isnormal(x) && std::isnormal(y) && std::isnormal(z)))
[[unlikely]]
        {
                if (std::isinf(x) | std::isinf(y) | std::isinf(z))      return
limits::infinity();
                else if (std::isnan(x) | std::isnan(y) | std::isnan(z)) return
limits::quiet_NaN();
                else
                {
                        const bool
                                xz{x == zero},
                                yz{y == zero},
                                zz{z == zero};

                        if (xz)
                        {
                                if (yz)         return zz ? zero : z;
                                else if (zz)    return y;
                        }
                        else if (yz && zz)      return x;
                }
        }

        if (x > z) std::swap(x, z);
        if (y > z) std::swap(y, z);

        if ((z >= sqrt_min) && (z <= sqrt_max)) [[likely]]
        {
                //      no scale
                return std::sqrt(__builtin_assoc_barrier(x*x + y*y) + z*z);
        }
        else
        {
                const Type
                        scale = (z >= sqrt_min) ? scale_down : scale_up;

                x *= scale;
                y *= scale;
                z *= scale;
                return std::sqrt(__builtin_assoc_barrier(x*x + y*y) + z*z);
        }
}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug libstdc++/77776] C++17 std::hypot implementation is poor
       [not found] <bug-77776-4@http.gcc.gnu.org/bugzilla/>
                   ` (15 preceding siblings ...)
  2024-04-05  1:08 ` g.peterhoff@t-online.de
@ 2024-04-05  1:23 ` g.peterhoff@t-online.de
  2024-04-10 16:41 ` g.peterhoff@t-online.de
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: g.peterhoff@t-online.de @ 2024-04-05  1:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77776

--- Comment #26 from g.peterhoff@t-online.de ---
must of course be "... / scale".
How can I still edit posts?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug libstdc++/77776] C++17 std::hypot implementation is poor
       [not found] <bug-77776-4@http.gcc.gnu.org/bugzilla/>
                   ` (16 preceding siblings ...)
  2024-04-05  1:23 ` g.peterhoff@t-online.de
@ 2024-04-10 16:41 ` g.peterhoff@t-online.de
  2024-04-10 16:48 ` jakub at gcc dot gnu.org
  2024-04-10 18:08 ` g.peterhoff@t-online.de
  19 siblings, 0 replies; 20+ messages in thread
From: g.peterhoff@t-online.de @ 2024-04-10 16:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77776

--- Comment #27 from g.peterhoff@t-online.de ---
Hi Matthias,
thanks for your benchmark. I still have 2 questions:

1) Accuracy
The frexp/ldexp variant seems to be the most accurate; is that correct? Then
other constants would have to be used in hypot_gp:
scale_up   = std::exp2(Type(limits::max_exponent-1))
scale_down = std::exp2(Type(limits::min_exponent-1))

2) Speed
Your benchmark outputs several columns (Δ)Latency/(Δ)Throughput/Speedup. What
exactly do the values stand for; what should be optimized for?

thx
Gero

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug libstdc++/77776] C++17 std::hypot implementation is poor
       [not found] <bug-77776-4@http.gcc.gnu.org/bugzilla/>
                   ` (17 preceding siblings ...)
  2024-04-10 16:41 ` g.peterhoff@t-online.de
@ 2024-04-10 16:48 ` jakub at gcc dot gnu.org
  2024-04-10 18:08 ` g.peterhoff@t-online.de
  19 siblings, 0 replies; 20+ messages in thread
From: jakub at gcc dot gnu.org @ 2024-04-10 16:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77776

--- Comment #28 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
As long as the scale is a power of two or 1.0 / power of two, I don't see why
any version wouldn't be inaccurate.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug libstdc++/77776] C++17 std::hypot implementation is poor
       [not found] <bug-77776-4@http.gcc.gnu.org/bugzilla/>
                   ` (18 preceding siblings ...)
  2024-04-10 16:48 ` jakub at gcc dot gnu.org
@ 2024-04-10 18:08 ` g.peterhoff@t-online.de
  19 siblings, 0 replies; 20+ messages in thread
From: g.peterhoff@t-online.de @ 2024-04-10 18:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77776

--- Comment #29 from g.peterhoff@t-online.de ---
(In reply to Jakub Jelinek from comment #28)
> As long as the scale is a power of two or 1.0 / power of two, I don't see
> why any version wouldn't be inaccurate.

Yes, but the constant scale_up is incorrectly selected.
scale_up = std::exp2(Type(limits::max_exponent-1)) --> ok
scale_up = std::exp2(Type(limits::max_exponent/2)) --> error
scale_up = prev_power2(sqrt_max) --> error

scale_down = std::exp2(Type(limits::min_exponent-1))
also seems to me to be more favorable.

PS:
There seems to be a problem with random numbers and std::float16_t, which is
why I use std::uniform_real_distribution<std::float128_t>. I have not yet found
out exactly where the error lies.

thx
Gero


template <std::floating_point Type>
inline constexpr Type   hypot_exp(Type x, Type y, Type z)       noexcept
{
        using limits = std::numeric_limits<Type>;

        constexpr Type
                zero = 0;

        x = std::abs(x);
        y = std::abs(y);
        z = std::abs(z);

        if (!(std::isnormal(x) && std::isnormal(y) && std::isnormal(z)))
[[unlikely]]
        {
                if              (std::isinf(x) | std::isinf(y) | std::isinf(z))
return limits::infinity();
                else if (std::isnan(x) | std::isnan(y) | std::isnan(z)) return
limits::quiet_NaN();
                else
                {
                        const bool
                                xz{x == zero},
                                yz{y == zero},
                                zz{z == zero};

                        if (xz)
                        {
                                if (yz)                 return zz ? zero : z;
                                else if (zz)    return y;
                        }
                        else if (yz && zz)      return x;
                }
        }

        if (x > z) std::swap(x, z);
        if (y > z) std::swap(y, z);

        int
                exp;

        z = std::frexp(z, &exp);
        y = std::ldexp(y, -exp);
        x = std::ldexp(x, -exp);
        return std::ldexp(std::sqrt(__builtin_assoc_barrier(x*x + y*y) + z*z),
exp);
}

template <std::floating_point Type>
inline constexpr Type   hypot_gp(Type x, Type y, Type z)        noexcept
{
        using limits = std::numeric_limits<Type>;

        constexpr Type
                sqrt_min        = std::sqrt(limits::min()),
                sqrt_max        = std::sqrt(limits::max()),
                scale_up        = std::exp2(Type(limits::max_exponent-1)),
                scale_down      = std::exp2(Type(limits::min_exponent-1)),
                zero            = 0;

        x = std::abs(x);
        y = std::abs(y);
        z = std::abs(z);

        if (!(std::isnormal(x) && std::isnormal(y) && std::isnormal(z)))
[[unlikely]]
        {
                if              (std::isinf(x) | std::isinf(y) | std::isinf(z))
return limits::infinity();
                else if (std::isnan(x) | std::isnan(y) | std::isnan(z)) return
limits::quiet_NaN();
                else
                {
                        const bool
                                xz{x == zero},
                                yz{y == zero},
                                zz{z == zero};

                        if (xz)
                        {
                                if (yz)                 return zz ? zero : z;
                                else if (zz)    return y;
                        }
                        else if (yz && zz)      return x;
                }
        }

        if (x > z) std::swap(x, z);
        if (y > z) std::swap(y, z);

        if (const bool b{z>=sqrt_min}; b && z<=sqrt_max) [[likely]]
        {
                //      no scale
                return std::sqrt(__builtin_assoc_barrier(x*x + y*y) + z*z);
        }
        else
        {
                const Type
                        scale = b ? scale_down : scale_up;

                x *= scale;
                y *= scale;
                z *= scale;
                return std::sqrt(__builtin_assoc_barrier(x*x + y*y) + z*z) /
scale;
        }
}

template <std::floating_point Type>
void    test(const size_t count, const Type min, const Type max, const Type
factor)
{
        std::random_device rd{};
        std::mt19937 gen{rd()};
        std::uniform_real_distribution<std::float128_t> dis{min, max};

        auto rnd = [&]() noexcept -> Type { return Type(dis(gen) * factor); };

        for (size_t i=0; i<count; ++i)
        {
                const Type
                        x = rnd(),
                        y = rnd(),
                        z = rnd(),
                        r0= hypot_exp(x, y, z),
                        r1= hypot_gp(x, y, z);

                if (r0 != r1) [[unlikely]]
                {
                        std::cout << "error\n";
                        return;
                }
        }
        std::cout << "ok\n";
}

int main()
{
        using value_type = std::float64_t;
        using limits = std::numeric_limits<value_type>;

        test<value_type>(1024*1024, 0.5, 1, limits::max());
        test<value_type>(1024*1024, 0, 1, limits::min());

        return EXIT_SUCCESS;
}

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2024-04-10 18:08 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-77776-4@http.gcc.gnu.org/bugzilla/>
2021-05-04 12:31 ` [Bug libstdc++/77776] C++17 std::hypot implementation is poor rguenth at gcc dot gnu.org
2024-02-29 18:23 ` g.peterhoff@t-online.de
2024-03-01 12:27 ` redi at gcc dot gnu.org
2024-03-02 15:52 ` g.peterhoff@t-online.de
2024-03-04  3:49 ` de34 at live dot cn
2024-03-04 10:45 ` mkretz at gcc dot gnu.org
2024-03-04 11:12 ` jakub at gcc dot gnu.org
2024-03-04 17:14 ` mkretz at gcc dot gnu.org
2024-03-04 20:21 ` jakub at gcc dot gnu.org
2024-03-06  2:47 ` g.peterhoff@t-online.de
2024-03-06  9:43 ` mkretz at gcc dot gnu.org
2024-03-06 12:35 ` redi at gcc dot gnu.org
2024-03-12 11:31 ` mkretz at gcc dot gnu.org
2024-03-19  0:09 ` g.peterhoff@t-online.de
2024-03-25 10:06 ` mkretz at gcc dot gnu.org
2024-04-05  1:08 ` g.peterhoff@t-online.de
2024-04-05  1:23 ` g.peterhoff@t-online.de
2024-04-10 16:41 ` g.peterhoff@t-online.de
2024-04-10 16:48 ` jakub at gcc dot gnu.org
2024-04-10 18:08 ` g.peterhoff@t-online.de

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).