From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=0sFH=IO=gmail.com=pinskia@sourceware.org>
Received: from mail-pf1-x429.google.com (mail-pf1-x429.google.com [IPv6:2607:f8b0:4864:20::429])
	by sourceware.org (Postfix) with ESMTPS id 0AD773858D33;
	Thu,  4 Jan 2024 09:10:26 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 0AD773858D33
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 0AD773858D33
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::429
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1704359430; cv=none;
	b=cgwQRpvrZl1mw1dlkaJapK2I31D7VjzFRrP3wKS4kC1WnO675eMoO+qWkyzX7TjW2YTotIk0b1ITy4ShN1CgcAnBvQ/nFOUdkP4auW8+Nzu1BWHtvpR9azAFN6gIVeHCe4Y7EKvscbCaqOqKr3zs9hMuvrfvwwLKXFrH6fo1An0=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1704359430; c=relaxed/simple;
	bh=G7D5YWfH1/o772A0HHgk20rJWMGLpl++UdL9zxVxs2s=;
	h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=Cj3RKUoZDzW6PEKXQe34srQzcHoaX99h7ewHJkYEGko6jMqKbN311ALO/py4bcIsDju9ea+WXiHko8hOSXira72aELVuWby42/Ed15hl6Szm1s5ZSlmwStRhJD1pU50NviM1dIAZyBvnr8VuKDor9SbO0+nkjjuGV/ITTve15Do=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-pf1-x429.google.com with SMTP id d2e1a72fcca58-6d9b5c4f332so161922b3a.3;
        Thu, 04 Jan 2024 01:10:25 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1704359425; x=1704964225; darn=gcc.gnu.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=3ZrEGoe5mbAsKcLmxm9Oia4TF21jZw4vBhimCpu3YuE=;
        b=I8UTa7/JHjyFs5FLdeVvASRqvAKcW6xzRGJvoGFwO0VQbHJufG73g1RysnX9ln91XI
         1xGZvUfUejbLkhHE3Xj5pQ1w4ak7H6cyoV+uvFdoRtC7H7ywmShzxpAXMhXWrpeyXUgB
         AY+y7ompojcRG3DAtRn0jV6iVvKZvK7r4uMCeETFcGLR4yDJ5sSYgUPgynpgoJIihSpA
         FD7GXmckWMGqMz2Wt4i42nH9YpwrqNaitXDVzd32T34Z+xblr4iK841E5UoSVgFjYqAP
         pxsiz+qg/RvXQzsBT6N2jUZceIyAcj8HIBtjzk5CQfwu5jsPaij7TMqmAJ/6kyRyPuZa
         OH1Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1704359425; x=1704964225;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=3ZrEGoe5mbAsKcLmxm9Oia4TF21jZw4vBhimCpu3YuE=;
        b=LisWWP60/qeYS9MUb7Y4VulIxrj6ZCHOyChIU73JYkeYs6IUKSBAN7lU2iWK70dHrO
         I+kHebFwy3Qc2exho8h51jU2lsAv6qH1rw0uUAgPkGQG4FeYPbBRno4KCY0BXwVvlrrv
         MJUTTm20OsLj7DCIk/U7Y30uQOPncjveNTJ4/3Yd4LcYIfVpJayYphMnP2OV/Ytat/Fz
         /VfxwnouKLLTDjLDYyq7G2+V9fbeaetC7hK33qvyjfnGk7j9l6fpmLJ4mdGHYy1ghMgX
         EvImTGkTWukxsAyn39mRMUNEbUw3R5LGpYOMyD2SGHxXd51Sf5BXQ/sshXpss+gUq4CP
         lDIw==
X-Gm-Message-State: AOJu0YyTjhGpVKiJ94u/UQqvfxOfw+ZOhSX8fIHsFfsvQUfT+D9MGPPJ
	Z0wQDmyrum+nE397hjybko+fHkZ5OAqDlOaY0Bg=
X-Google-Smtp-Source: AGHT+IED4YaoUdcKwPV0Yv+l5lG2aaPASWTi22plGKWou/D5UNIQjNvKAL3cmW405tsObG9OyoGFcvgswdGbLcPq4NQ=
X-Received: by 2002:a05:6a20:3c8b:b0:198:fd5c:b94b with SMTP id
 b11-20020a056a203c8b00b00198fd5cb94bmr207207pzj.86.1704359424671; Thu, 04 Jan
 2024 01:10:24 -0800 (PST)
MIME-Version: 1.0
References: <CAADZLq108u8Td2Fc==ToN24wKE7+bdACUPfi8cVVVwv=uwYxXQ@mail.gmail.com>
 <mpt8r62kwxi.fsf@arm.com> <CAADZLq3FS3psmGb_zvr7cogEQtQX2LP81dhwxLxNsOvcA+3qZw@mail.gmail.com>
In-Reply-To: <CAADZLq3FS3psmGb_zvr7cogEQtQX2LP81dhwxLxNsOvcA+3qZw@mail.gmail.com>
From: Andrew Pinski <pinskia@gmail.com>
Date: Thu, 4 Jan 2024 01:10:12 -0800
Message-ID: <CA+=Sn1=ATceJbWzQ+ksGMAhh3gz+sNUUugLhhLdg7D1nizCmWg@mail.gmail.com>
Subject: Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd
To: Srinivas Yadav <vasusrinivas.vasu14@gmail.com>
Cc: libstdc++@gcc.gnu.org, gcc-patches@gcc.gnu.org, richard.sandiford@arm.com, 
	m.kretz@gsi.de
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libstdc++.gcc.gnu.org>

On Thu, Jan 4, 2024 at 1:03=E2=80=AFAM Srinivas Yadav
<vasusrinivas.vasu14@gmail.com> wrote:
>
>
> Hi,
>
> Thanks a lot for the review. Sorry for the very late reply.
>
> The following are my comments on the feedback.
>
>>
>> The main thing that worries me is:
>>
>>   #if _GLIBCXX_SIMD_HAVE_SVE
>>   constexpr inline int __sve_vectorized_size_bytes =3D __ARM_FEATURE_SVE=
_BITS / 8;
>>   #else
>>   constexpr inline int __sve_vectorized_size_bytes =3D 0;
>>   #endif
>>
>> Although -msve-vector-bits is currently a per-TU setting, that isn't
>> necessarily going to be the case in future.  Ideally it would be
>> possible to define different implementations of a function with
>> different (fixed) vector lengths within the same TU.  The value at
>> the point that the header file is included is then somewhat arbitrary.
>>
>> So rather than have:
>>
>> >  using __neon128 =3D _Neon<16>;
>> >  using __neon64 =3D _Neon<8>;
>> > +using __sve =3D _Sve<>;
>>
>> would it be possible instead to have:
>>
>>   using __sve128 =3D _Sve<128>;
>>   using __sve256 =3D _Sve<256>;
>>   ...etc...
>>
>> ?  Code specialised for 128-bit vectors could then use __sve128 and
>> code specialised for 256-bit vectors could use __sve256.
>
>
> I think it is very hard to maintain specialization for each vector length=
, as lengths for
> SVE can vary from 128 to 2048 in 128 bit increments. Hence, there would b=
e many
> specializations. Also, we only embed the vector length information in the=
 type specialization.
> Hence, one generic __sve abi type would be sufficient, which holds the ma=
x vector length (in bytes)
> derived from __sve_vectorized_bytes (which is again derived from __ARM_FE=
ATURE_SVE_BITS
> in simd.h.

I really doubt this would work in the end. Because HW which is 128bits
only, can't support -msve-vector-bits=3D2048 . I am thinking
std::experimental::simd is not the right way of supporting this.
Really the route the standard should be heading towards is non
constant at compile time sized vectors instead and then you could use
the constant sized ones to emulate the Variable length ones.
I think we should not depend on __ARM_FEATURE_SVE_BITS being set here
and being meanful in any way.

Thanks,
Andrew Pinski

>
>> Perhaps that's not possible as things stand, but it'd be interesting
>> to know the exact failure mode if so.  Either way, it would be good to
>> remove direct uses of __ARM_FEATURE_SVE_BITS from simd_sve.h if possible=
,
>> and instead rely on information derived from template parameters.
>>
> I am currently directly using __ARM_FEATURE_SVE_BITS in simd_sve.h for de=
fining specific vector
> types of basic arithmetic types as int, float, double etc...
> The __ARM_FEATURE_SVE_BITS macro is used in the context of
> arm_sve_vector_bits(__ARM_FEATURE_SVE_BITS) that requires an integer cons=
tant expression.
> Hence, its not quite usable with information derived from template parame=
ters.
> May I please know what is the exact issue if __ARM_FEATURE_SVE_BITS is di=
rectly used ?
>
> Other possible solution is to define a local macro in simd.h as
> #define _GLIBCXX_ARM_FEATURE_SVE_BITS __ARM_FEATURE_SVE_BITS
> and use _GLIBCXX_ARM_FEATURE_SVE_BITS at all occurrences of __ARM_FEATURE=
_SVE_BITS.
>
>
>> It should be possible to use SVE to optimise some of the __neon*
>> implementations, which has the advantage of working even for VLA SVE.
>> That's obviously a separate patch, though.  Just saying for the record.
>>
> Yes, that's an interesting thought. Noted. Thanks!
>
>
>> On:
>>
>>     inline static __sve_bool_type
>>     __sve_active_mask()
>>     { return svwhilelt_b8(int8_t(0), int8_t(_Np)); };
>>
>> I think it'd be more canonical to use svptrue_b8().  (The inputs to
>> svwhilelt aren't naturally tied to the size of the output elements.
>> This particular call will use WHILELE with 32-bit registers.  So if
>> you do continue to use svwhile*, it'd be good to remove the casts on
>> the inputs.)
>>
>> It'd also be good to use svptrue_b8() for all element sizes where possib=
le,
>> and only use element-specific masks where absolutely necesary.  This hel=
ps
>> to premote reuse of predicates.
>>
>> Or does the API require that every mask value is internally "canonical",
>> i.e.  every element must be 0 or 1, regardless of the element size?
>> If so, that could introduce some inefficiency on SVE, where upper bits
>> are (by design) usually ignored.
>>
> I am not sure if svptrue_b8() would satisfy the requirement here. As curr=
ently,
> we use svwhilet_b8(0, _Np) to make a mask with only [0 to _Np) elements a=
ctive (1)
> and remaining  elements to be inactive (0).
> I would prefer to continue to use svwhile* and remove the casts to the in=
puts.
>
>
>> On:
>>
>>   template <size_t _Np>
>>     struct __sve_vector_type<char, _Np>
>>     : __sve_vector_type<__get_sve_value_type_t<char>, _Np>
>>     {};
>>
>>   template <size_t _Np>
>>     struct __sve_vector_type<char16_t, _Np>
>>     : __sve_vector_type<__get_sve_value_type_t<char16_t>, _Np>
>>     {};
>>
>>   template <size_t _Np>
>>     struct __sve_vector_type<wchar_t, _Np>
>>     : __sve_vector_type<__get_sve_value_type_t<wchar_t>, _Np>
>>     {};
>>
>>   template <size_t _Np>
>>     struct __sve_vector_type<char32_t, _Np>
>>     : __sve_vector_type<__get_sve_value_type_t<char32_t>, _Np>
>>     {};
>>
>>   template <size_t _Np>
>>     struct __sve_vector_type<long long int, _Np>
>>     : __sve_vector_type<__get_sve_value_type_t<long long int>, _Np>
>>     {};
>>
>>   template <size_t _Np>
>>     struct __sve_vector_type<long long unsigned int, _Np>
>>     : __sve_vector_type<__get_sve_value_type_t<long long unsigned int>, =
_Np>
>>     {};
>>
>> Does this ensure complete coverage without duplicate definitions?  It se=
ems
>> to be hard-coding a specific choice of stdint.h type (e.g. by including
>> char but not unsigned char, and long long int but not long int).
>> Unfortunately the stdint.h definitions vary between newlib and glibc,
>> for example.
>>
> I have overloaded all the types that are present in simd testsuite.
> I am sure it covers all the basic types without any duplicate definitions=
 (because same type definition throw's redefinition error).
>
>>     inline static type
>>     __sve_mask_first_true()
>>     { return svptrue_pat_b8(SV_VL1); }
>>
>> SV_VL1 produces the same predicate result for all element widths,
>> so I don't think this needs to be parameterised.
>>
> Got it!
>
>>   template <>
>>     struct __sve_mask_type<8>
>>     {
>>       ...
>>
>>       inline static void
>>       called()
>>       {}
>>
>> What's the called() function for?
>>
> Sorry, I missed this during cleanup. I removed it now.
>
>>
>>   template <typename _To, typename _From>
>>     _GLIBCXX_SIMD_INTRINSIC constexpr auto
>>     __sve_reinterpret_cast(_From __v)
>>     {
>>       if constexpr (std::is_same_v<_To, int32_t>)
>>         return svreinterpret_s32(__v);
>>       else if constexpr (std::is_same_v<_To, int64_t>)
>>         return svreinterpret_s64(__v);
>>       else if constexpr (std::is_same_v<_To, float32_t>)
>>         return svreinterpret_f32(__v);
>>       else if constexpr (std::is_same_v<_To, float64_t>)
>>         return svreinterpret_f64(__v);
>>     }
>>
>> Why does this only need to handle these 4 types?  Think it'd be
>> worth a comment.
>>
> These are only used in simd math functions (simd_math.h). Hence, only con=
versions between int32/64 <-> float32/64 are needed.
>
>>
>>   template <size_t _Bits, size_t _Width>
>>     struct _SveMaskWrapper
>>     {
>>       ...
>>
>>       _GLIBCXX_SIMD_INTRINSIC constexpr value_type
>>       operator[](size_t __i) const
>>       {
>>         return _BuiltinSveMaskType::__sve_mask_active_count(
>>                 _BuiltinSveVectorType::__sve_active_mask(),
>>                 svand_z(_BuiltinSveVectorType::__sve_active_mask(),
>>                         svcmpeq(_BuiltinSveVectorType::__sve_active_mask=
(),
>>                                 _BuiltinSveMaskType::__index0123,
>>                                 typename _BuiltinSveMaskType::__sve_mask=
_uint_type(__i)),
>>                         _M_data));
>>       }
>>
>> A simpler way to do this might be to use svdup_u<width>_z (_M_data, 1, 0=
)
>> and then svorrv on the result.
>>
> Thanks! The following solution you have provided is really and simple eff=
icient.
>  svdup_u<width>_z (_M_data, 1)[__i]
>  I will make the necessary changes.
>
>
>>
>>   struct _CommonImplSve
>>   {
>>     ...
>>
>>     template <typename _Tp, typename _Up, size_t _Np>
>>       _GLIBCXX_SIMD_INTRINSIC static constexpr __sve_vector_type_t<_Tp, =
_Np>
>>       _S_load(const _Up* __p, _SveMaskWrapper<sizeof(_Tp), _Np> __k)
>>       {
>>         using _STp =3D __get_sve_value_type_t<_Tp>;
>>         using _SUp =3D __get_sve_value_type_t<_Up>;
>>         using _V =3D __sve_vector_type_t<_Tp, _Np>;
>>         const _SUp* __up =3D reinterpret_cast<const _SUp*>(__p);
>>
>>         if constexpr (std::is_same_v<_Tp, _Up>)
>>          return _V(svld1(__k._M_data, __up));
>>
>> What about cases where _Tp and _Up differ in signedness?  Is that
>> allowed?  It seems from the code quoted below that it's OK to load
>> int8_ts into a uint16_t vector (for example), even though that could
>> change values.  Are uint16_t->int16_t and int16_t->uint16_t similarly OK=
?
>
> Yes. It is ok to load int8_t into uint16_t vector and similar cases (whic=
h is supported in other simd backends like avx and neon).
> Here, I have just added specializations using intrinsics supported by ACL=
E, and all the cases which are not
> supported are fall backed to scalar loads with casts.
>
>>
>>         if constexpr (std::is_integral_v<_Tp> && std::is_integral_v<_Up>
>>                        && (sizeof(_Tp) > sizeof(_Up)))
>>          {
>>            if constexpr (std::is_same_v<_SUp, int8_t>)
>>              {
>>                if constexpr (std::is_same_v<_STp, int16_t>)
>>                  return _V(svld1sb_s16(__k._M_data, __up));
>>                if constexpr (std::is_same_v<_STp, uint16_t>)
>>                  return _V(svld1sb_u16(__k._M_data, __up));
>>                if constexpr (std::is_same_v<_STp, int32_t>)
>>                  return _V(svld1sb_s32(__k._M_data, __up));
>>                if constexpr (std::is_same_v<_STp, uint32_t>)
>>                  return _V(svld1sb_u32(__k._M_data, __up));
>>                if constexpr (std::is_same_v<_STp, int64_t>)
>>                  return _V(svld1sb_s64(__k._M_data, __up));
>>
>> Will this cope correctly with both long int and long long int
>> (for LP64)?
>>
> Yes, It will work correctly, as both long int and long long int will fall=
 back to s64.
>
>>
>>     template <typename _Tp, typename _Up, size_t _Np>
>>       _GLIBCXX_SIMD_INTRINSIC static constexpr void
>>       _S_store(_Up* __p, _SveSimdWrapper<_Tp, _Np> __x, _SveMaskWrapper<=
sizeof(_Tp), _Np> __k)
>>       {
>>         using _SUp =3D __get_sve_value_type_t<_Up>;
>>         _SUp* __up =3D reinterpret_cast<_SUp*>(__p);
>>
>>         if constexpr (std::is_same_v<_Tp, _Up>)
>>          return svst1(__k._M_data, __up, __x);
>>         __execute_n_times<_Np>([&](auto __i) _GLIBCXX_SIMD_ALWAYS_INLINE=
_LAMBDA {
>>          if (__k[__i])
>>            __p[__i] =3D static_cast<_Up>(__x[__i]);
>>         });
>>       }
>>
>> Would this benefit from using truncating stores?  It seemed odd that
>> there was support for conversions on the load size but not the store sid=
e.
>
> Thats right. I have this implemented already in my local fork, but could =
not add to the patch due to time constraints.
> I have now added the support for truncating stores using intrinsics.
>
>>   template <typename _Abi, typename>
>>     struct _SimdImplSve
>>     {
>>       ...
>>       template <typename _Tp, size_t _Np>
>>         _GLIBCXX_SIMD_INTRINSIC static constexpr _MaskMember<_Tp>
>>         _S_negate(_SveSimdWrapper<_Tp, _Np> __x) noexcept
>>         {
>>          return svcmpeq(__sve_vector_type<_Tp, _Np>::__sve_active_mask()=
, __x._M_data,
>>                         __sve_vector_type<_Tp, _Np>::__sve_broadcast(_Tp=
{}));
>>         }
>>
>> What operation is this performing?  The implementation doesn't seem
>> to match the intuitive meaning of the name.
>
> It performs negation operation by comparing the value of the input vector=
 register (__x) with zero vector register.
> The returned type is a mask holding true values at the positions where ze=
ros are present in input vector(__x),
> and remaining elements with false values.
>
>>
>>       template <typename _Tp, typename _BinaryOperation>
>>         _GLIBCXX_SIMD_INTRINSIC static constexpr _Tp
>>         _S_reduce(simd<_Tp, _Abi> __x, _BinaryOperation&& __binary_op)
>>         {
>>          auto __x_data =3D __x._M_data;
>>          constexpr size_t _Np =3D simd_size_v<_Tp, _Abi>;
>>          using __sve_vec_t =3D __sve_vector_type_t<_Tp, _Np>;
>>          std::size_t __i =3D __x.size();
>>          for (; (__i % 2) !=3D 1; __i /=3D 2)
>>            {
>>              __x_data =3D __binary_op(simd<_Tp, _Abi>(
>>                                       __private_init, _SveSimdWrapper<_T=
p, _Np>(
>>    __sve_vec_t(svuzp1(__x_data, __x_data)))),
>>                                     simd<_Tp, _Abi>(
>>                                       __private_init, _SveSimdWrapper<_T=
p, _Np>(
>>    __sve_vec_t(svuzp2(__x_data, __x_data))))
>>                                    )._M_data;
>>            }
>>          _Tp __res =3D __x_data[0];
>>          for (size_t __ri =3D 1; __ri !=3D __i; __ri++)
>>            __res =3D __binary_op(__x_data[__ri], __res);
>>          return __res;
>>         }
>>
>> Will this make use of the reduction instructions where available, or is
>> it the intention that the compiler will do that?
>>
> This implementation does not use any reduction instruction directly, as i=
nput Binary Operation is generic and what
> type of fundamental arithmetic operation is performed is lost. Hence, I h=
ave used shuffles to perform reduction.
> This implementation reduces in logarithm step till the size of array is a=
n odd number, then
> a scalar reduction is performed for rest of the odd number of elements. T=
his takes log(n) steps when n is a power of 2.
>
>>
>>   template <typename _Abi, typename>
>>     struct _SimdImplSve
>>     {
>>       ...
>>       template <typename _Tp, size_t _Np>
>>         _GLIBCXX_SIMD_INTRINSIC static constexpr _SveSimdWrapper<_Tp, _N=
p>
>>         _S_unary_minus(_SveSimdWrapper<_Tp, _Np> __x) noexcept
>>         {
>>          return svmul_x(__sve_vector_type<_Tp, _Np>::__sve_active_mask()=
, __x._M_data,
>>                         static_cast<_Tp>(-1));
>>         }
>>
>> Why svmul_x by -1 rather than svneg_x.
>
>  I accidentally missed that svneg_x existed in ACLE documentation. Anyway=
s, I have updated it now!  Thanks a lot!
>
>>
>>       template <typename _Tp, size_t _Np>
>>         _GLIBCXX_SIMD_INTRINSIC static constexpr __sve_vector_type_t<_Tp=
, _Np>
>>         _S_multiplies(_SveSimdWrapper<_Tp, _Np> __x, _SveSimdWrapper<_Tp=
, _Np> __y)
>>         {
>>          return svmul_x(__sve_vector_type<_Tp, _Np>::__sve_active_mask()=
, __x._M_data, __y._M_data);
>>         }
>>
>>       template <typename _Tp, size_t _Np>
>>         _GLIBCXX_SIMD_INTRINSIC static constexpr __sve_vector_type_t<_Tp=
, _Np>
>>         _S_divides(_SveSimdWrapper<_Tp, _Np> __x, _SveSimdWrapper<_Tp, _=
Np> __y)
>>         { return __x._M_data / __y._M_data; }
>>
>> If we're using infix operators for division, couldn't we also use them
>> for the other primitive operations?
>>
> Thats right, as it is easy to use infix operators for +, -, * too. I will=
 make the necessary changes.
>
>>
>>       _GLIBCXX_SIMD_MATH_FALLBACK(fmax)
>>       _GLIBCXX_SIMD_MATH_FALLBACK(fmin)
>>
>> Couldn't we use FMAXNM and FMINNM for these?  I think it'd be worth a
>> comment if not.
>>
> Yes, that's totally possible. I had it in my local fork, but could not ad=
d to the patch due to time constraints.
> I have now added to it the patch.
>
>>
>>       template <typename _Tp, size_t _Np, typename _Op>
>>         static constexpr _MaskMember<_Tp>
>>         __fp_cmp(_SveSimdWrapper<_Tp, _Np> __x, _SveSimdWrapper<_Tp, _Np=
> __y, _Op __op)
>>         {
>>          using _Ip =3D __get_sve_value_type_t<__int_for_sizeof_t<_Tp>>;
>>          using _VI =3D __sve_vector_type_t<_Ip, _Np>;
>>          using _WI =3D _SveSimdWrapper<_Ip, _Np>;
>>          const _WI __fmv =3D __sve_vector_type<_Ip, _Np>::__sve_broadcas=
t(__finite_max_v<_Ip>);
>>          const _WI __zerov =3D __sve_vector_type<_Ip, _Np>::__sve_broadc=
ast(0);
>>          const _WI __xn =3D _VI(__sve_reinterpret_cast<_Ip>(__x));
>>          const _WI __yn =3D _VI(__sve_reinterpret_cast<_Ip>(__y));
>>
>>          const _WI __xp
>>            =3D svsel(_S_less(__xn, __zerov), _S_unary_minus(_WI(_S_bit_a=
nd(__xn, __fmv))), __xn);
>>          const _WI __yp
>>            =3D svsel(_S_less(__yn, __zerov), _S_unary_minus(_WI(_S_bit_a=
nd(__yn, __fmv))), __yn);
>>          return svbic_z(__sve_vector_type<_Ip, _Np>::__sve_active_mask()=
, __op(__xp, __yp)._M_data,
>>                         _SuperImpl::_S_isunordered(__x, __y)._M_data);
>>         }
>>
>>       template <typename _Tp, size_t _Np>
>>         static constexpr _MaskMember<_Tp>
>>         _S_isgreater(_SveSimdWrapper<_Tp, _Np> __x, _SveSimdWrapper<_Tp,=
 _Np> __y) noexcept
>>         { return __fp_cmp(__x, __y, [](auto __xp, auto __yp) { return _S=
_less(__yp, __xp); }); }
>>
>>       template <typename _Tp, size_t _Np>
>>         static constexpr _MaskMember<_Tp>
>>         _S_isgreaterequal(_SveSimdWrapper<_Tp, _Np> __x, _SveSimdWrapper=
<_Tp, _Np> __y) noexcept
>>         { return __fp_cmp(__x, __y, [](auto __xp, auto __yp) { return _S=
_less_equal(__yp, __xp); }); }
>>
>>       template <typename _Tp, size_t _Np>
>>         static constexpr _MaskMember<_Tp>
>>         _S_isless(_SveSimdWrapper<_Tp, _Np> __x, _SveSimdWrapper<_Tp, _N=
p> __y) noexcept
>>         { return __fp_cmp(__x, __y, [](auto __xp, auto __yp) { return _S=
_less(__xp, __yp); }); }
>>
>>       template <typename _Tp, size_t _Np>
>>         static constexpr _MaskMember<_Tp>
>>         _S_islessequal(_SveSimdWrapper<_Tp, _Np> __x, _SveSimdWrapper<_T=
p, _Np> __y) noexcept
>>         { return __fp_cmp(__x, __y, [](auto __xp, auto __yp) { return _S=
_less_equal(__xp, __yp); }); }
>>
>>       template <typename _Tp, size_t _Np>
>>         static constexpr _MaskMember<_Tp>
>>         _S_islessgreater(_SveSimdWrapper<_Tp, _Np> __x, _SveSimdWrapper<=
_Tp, _Np> __y) noexcept
>>         {
>>          return svbic_z(__sve_vector_type<_Tp, _Np>::__sve_active_mask()=
,
>>                         _SuperImpl::_S_not_equal_to(__x, __y)._M_data,
>>                         _SuperImpl::_S_isunordered(__x, __y)._M_data);
>>         }
>>
>> Could you go into more detail about the implementation of these?  They
>> seem to generate more code than I'd expect.
>>
> I am not sure if its because of the helper __fp_cmp function I have used.
> However, __fp_cmp implementation is exactly similar to the implementation=
 in simd_builtin.h.
> I will try different combinations and optimize it further.
>
>>       template <typename _Tp, size_t _Np>
>>         _GLIBCXX_SIMD_INTRINSIC static _MaskMember<_Tp>
>>         _S_isnan([[maybe_unused]] _SveSimdWrapper<_Tp, _Np> __x)
>>         {
>>   #if __FINITE_MATH_ONLY__
>>          return {}; // false
>>   #elif !defined __SUPPORT_SNAN__
>>          return svnot_z(__sve_vector_type<_Tp, _Np>::__sve_active_mask()=
,
>>                         _S_equal_to(__x, __x)._M_data);
>>   #elif defined __STDC_IEC_559__
>>          return svcmpuo(__sve_vector_type<_Tp, _Np>::__sve_active_mask()=
, __x._M_data, __x._M_data);
>>   #else
>>   #error "Not implemented: how to support SNaN but non-IEC559 floating-p=
oint?"
>>   #endif
>>         }
>>
>> I think this is just my lack of libstdc++-v3 knowledge, but why can't
>> we use svcmpuo for the !defined __SUPPORT_SNAN__ case?
>>
> Sorry, even I am exactly not sure why. I did the SVE implementation using=
 simd_builtin.h as base implementation.
> I think @m.kretz@gsi.de knows a better reason why.
>
>>
>>       template <typename _Tp>
>>         _GLIBCXX_SIMD_INTRINSIC static constexpr _MaskMember<_Tp>
>>         _S_broadcast(bool __x)
>>         {
>>          constexpr size_t _Np =3D simd_size_v<_Tp, _Abi>;
>>          __sve_bool_type __tr =3D __sve_vector_type<_Tp, _Np>::__sve_act=
ive_mask();
>>          __sve_bool_type __fl =3D svnot_z(__tr, __tr);
>>
>> This can just be svpfalse_b();
>>
> Got it! Thanks!
>
>>
>>   template <typename _Abi, typename>
>>     struct _MaskImplSve
>>     {
>>     ...
>>       template <typename _Tp>
>>         _GLIBCXX_SIMD_INTRINSIC static constexpr _MaskMember<_Tp>
>>         _S_load(const bool* __mem)
>>         {
>>          _SveMaskWrapper<sizeof(_Tp), simd_size_v<_Tp, _Abi>> __r;
>>
>>          __execute_n_times<simd_size_v<_Tp, _Abi>>(
>>            [&](auto __i) _GLIBCXX_SIMD_ALWAYS_INLINE_LAMBDA { __r._M_set=
(__i, __mem[__i]); });
>>
>>          return __r;
>>         }
>>
>>       template <size_t _Bits, size_t _Np>
>>         static inline _SveMaskWrapper<_Bits, _Np>
>>         _S_masked_load(_SveMaskWrapper<_Bits, _Np> __merge, _SveMaskWrap=
per<_Bits, _Np> __mask,
>>                       const bool* __mem) noexcept
>>         {
>>          _SveMaskWrapper<_Bits, _Np> __r;
>>
>>          __execute_n_times<_Np>([&](auto __i) _GLIBCXX_SIMD_ALWAYS_INLIN=
E_LAMBDA {
>>            if (__mask[__i])
>>              __r._M_set(__i, __mem[__i]);
>>            else
>>              __r._M_set(__i, __merge[__i]);
>>          });
>>
>>          return __r;
>>         }
>>
>> If these are loading unpacked booleans, couldn't we just use svld1
>> followed by a comparison?  Similarly the stores could use svdup_u8_z
>> to load a vector of 1s and 0s and then use svst1 to store it.
>
> Do you mean reinterpret-casting the input pointer (bool*) to (uint8*) and=
 perform a comparison ?
>
>>
>>       template <typename _Tp>
>>         _GLIBCXX_SIMD_INTRINSIC static bool
>>         _S_all_of(simd_mask<_Tp, _Abi> __k)
>>         { return _S_popcount(__k) =3D=3D simd_size_v<_Tp, _Abi>; }
>>
>> In principle, this should be better as !svptest_any(..., svnot_z (..., _=
_k)),
>> since we should then be able to use a single flag-setting predicate
>> logic instruction.
>>
>> Incidentally, __k seems like a bit of an AVX-centric name :)
>>
>>       template <typename _Tp>
>>         _GLIBCXX_SIMD_INTRINSIC static bool
>>         _S_any_of(simd_mask<_Tp, _Abi> __k)
>>         { return _S_popcount(__k) > 0; }
>>
>>       template <typename _Tp>
>>         _GLIBCXX_SIMD_INTRINSIC static bool
>>         _S_none_of(simd_mask<_Tp, _Abi> __k)
>>         { return _S_popcount(__k) =3D=3D 0; }
>>
>> These should map directly to svptest_any and !svptest_any respectively.
>>
> Got it! I will update with these changes.
>
>>
>>       template <typename _Tp>
>>         _GLIBCXX_SIMD_INTRINSIC static int
>>         _S_find_first_set(simd_mask<_Tp, _Abi> __k)
>>         {
>>          constexpr size_t _Np =3D simd_size_v<_Tp, _Abi>;
>>
>>          auto __first_index =3D __sve_mask_type<sizeof(_Tp)>::__sve_mask=
_first_true();
>>          for (int __idx =3D 0; __idx < _Np; __idx++)
>>            {
>>              if (__sve_mask_type<sizeof(_Tp)>::__sve_mask_active_count(
>>                    __sve_vector_type<_Tp, _Np>::__sve_active_mask(),
>>                    svand_z(__sve_vector_type<_Tp, _Np>::__sve_active_mas=
k(), __k._M_data,
>>                            __first_index)))
>>                return __idx;
>>              __first_index =3D __sve_mask_type<sizeof(_Tp)>::__sve_mask_=
next_true(
>>                                __sve_vector_type<_Tp, _Np>::__sve_active=
_mask(), __first_index);
>>            }
>>          return -1;
>>         }
>>
>>       template <typename _Tp>
>>         _GLIBCXX_SIMD_INTRINSIC static int
>>         _S_find_last_set(simd_mask<_Tp, _Abi> __k)
>>         {
>>          constexpr size_t _Np =3D simd_size_v<_Tp, _Abi>;
>>
>>          int __ret =3D -1;
>>          auto __first_index =3D __sve_mask_type<sizeof(_Tp)>::__sve_mask=
_first_true();
>>          for (int __idx =3D 0; __idx < _Np; __idx++)
>>            {
>>              if (__sve_mask_type<sizeof(_Tp)>::__sve_mask_active_count(
>>                    __sve_vector_type<_Tp, _Np>::__sve_active_mask(),
>>                    svand_z(__sve_vector_type<_Tp, _Np>::__sve_active_mas=
k(), __k._M_data,
>>                            __first_index)))
>>                __ret =3D __idx;
>>              __first_index =3D __sve_mask_type<sizeof(_Tp)>::__sve_mask_=
next_true(
>>                                __sve_vector_type<_Tp, _Np>::__sve_active=
_mask(), __first_index);
>>            }
>>          return __ret;
>>         }
>>
>> _S_find_last_set should be able to use svclasta and an iota vector.
>> _S_find_first_set could do the same with a leading svpfirst.
>
> Thanks. This solution for find_last_set should significantly improves the=
 performance.
> Can you please elaborate solution for find_first_set ?
> Other efficient solution for find_first_set I have in my mind is to use s=
vrev_b*  and then perform a find_last_set.
>
> Thank you,
> Srinivas Yadav Singanaboina