From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=lcgx=HV=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id 6E01F3858C41;
	Sun, 10 Dec 2023 13:29:47 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 6E01F3858C41
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 6E01F3858C41
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=217.140.110.172
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1702214989; cv=none;
	b=AZaUgpagqUenXrOLNFrf8sBsAD5N8wDrzYss1PoMGiuAY2kR1BMwXmx/K38aY//OhNanUBlHzhC4mEhyuzXP4XhGG7ucxXgrSykiAAhSSjzuG1mQ7+SPab+xHh15lQLHO9eYUx5ILlo1mCIhYWz+AcZpC1ndQX8aEGZu4XQ2KWY=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1702214989; c=relaxed/simple;
	bh=inZvnlcXHEenZ8py4YSMUUDg2YXBk5luPpEuK4Z5O+s=;
	h=From:To:Subject:Date:Message-ID:MIME-Version; b=CnIWkAjDZuORE1GEhhH2yuhro92NOp//9/wokhunBx8EcvMtO5WOoLJWFpA4BFbNiRA8eEgSIG/JOASjtD/lDo8eLNDTkxJHwlgv4L+GL01tDNT7XJRlzkFwXHVnk0ioeEL1wLd01M1ToiO9TbUeWvTcJ9MyFXiK5OnlbZoKCXk=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 230C0FEC;
	Sun, 10 Dec 2023 05:30:33 -0800 (PST)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 671833F6C4;
	Sun, 10 Dec 2023 05:29:46 -0800 (PST)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Srinivas Yadav <vasusrinivas.vasu14@gmail.com>
Mail-Followup-To: Srinivas Yadav <vasusrinivas.vasu14@gmail.com>,libstdc++@gcc.gnu.org,  gcc-patches@gcc.gnu.org, richard.sandiford@arm.com
Cc: libstdc++@gcc.gnu.org,  gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd
References: <CAADZLq108u8Td2Fc==ToN24wKE7+bdACUPfi8cVVVwv=uwYxXQ@mail.gmail.com>
Date: Sun, 10 Dec 2023 13:29:45 +0000
In-Reply-To: <CAADZLq108u8Td2Fc==ToN24wKE7+bdACUPfi8cVVVwv=uwYxXQ@mail.gmail.com>
	(Srinivas Yadav's message of "Fri, 24 Nov 2023 09:59:16 -0600")
Message-ID: <mpt8r62kwxi.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-12.9 required=5.0 tests=BAYES_00,INDUSTRIAL_BODY,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libstdc++.gcc.gnu.org>

Thanks for the patch and sorry for the slow review.

I can only comment on the usage of SVE, rather than on the scaffolding
around it.  Hopefully Jonathan or others can comment on the rest.

The main thing that worries me is:

  #if _GLIBCXX_SIMD_HAVE_SVE
  constexpr inline int __sve_vectorized_size_bytes = __ARM_FEATURE_SVE_BITS / 8;
  #else
  constexpr inline int __sve_vectorized_size_bytes = 0;
  #endif

Although -msve-vector-bits is currently a per-TU setting, that isn't
necessarily going to be the case in future.  Ideally it would be
possible to define different implementations of a function with
different (fixed) vector lengths within the same TU.  The value at
the point that the header file is included is then somewhat arbitrary.

So rather than have:

>  using __neon128 = _Neon<16>;
>  using __neon64 = _Neon<8>;
> +using __sve = _Sve<>;

would it be possible instead to have:

  using __sve128 = _Sve<128>;
  using __sve256 = _Sve<256>;
  ...etc...

?  Code specialised for 128-bit vectors could then use __sve128 and
code specialised for 256-bit vectors could use __sve256.

Perhaps that's not possible as things stand, but it'd be interesting
to know the exact failure mode if so.  Either way, it would be good to
remove direct uses of __ARM_FEATURE_SVE_BITS from simd_sve.h if possible,
and instead rely on information derived from template parameters.

It should be possible to use SVE to optimise some of the __neon*
implementations, which has the advantage of working even for VLA SVE.
That's obviously a separate patch, though.  Just saying for the record.

On:

    inline static __sve_bool_type
    __sve_active_mask()
    { return svwhilelt_b8(int8_t(0), int8_t(_Np)); };

I think it'd be more canonical to use svptrue_b8().  (The inputs to
svwhilelt aren't naturally tied to the size of the output elements.
This particular call will use WHILELE with 32-bit registers.  So if
you do continue to use svwhile*, it'd be good to remove the casts on
the inputs.)

It'd also be good to use svptrue_b8() for all element sizes where possible,
and only use element-specific masks where absolutely necesary.  This helps
to premote reuse of predicates.

Or does the API require that every mask value is internally "canonical",
i.e.  every element must be 0 or 1, regardless of the element size?
If so, that could introduce some inefficiency on SVE, where upper bits
are (by design) usually ignored.

On:

  template <size_t _Np>
    struct __sve_vector_type<char, _Np>
    : __sve_vector_type<__get_sve_value_type_t<char>, _Np>
    {};

  template <size_t _Np>
    struct __sve_vector_type<char16_t, _Np>
    : __sve_vector_type<__get_sve_value_type_t<char16_t>, _Np>
    {};

  template <size_t _Np>
    struct __sve_vector_type<wchar_t, _Np>
    : __sve_vector_type<__get_sve_value_type_t<wchar_t>, _Np>
    {};

  template <size_t _Np>
    struct __sve_vector_type<char32_t, _Np>
    : __sve_vector_type<__get_sve_value_type_t<char32_t>, _Np>
    {};

  template <size_t _Np>
    struct __sve_vector_type<long long int, _Np>
    : __sve_vector_type<__get_sve_value_type_t<long long int>, _Np>
    {};

  template <size_t _Np>
    struct __sve_vector_type<long long unsigned int, _Np>
    : __sve_vector_type<__get_sve_value_type_t<long long unsigned int>, _Np>
    {};

Does this ensure complete coverage without duplicate definitions?  It seems
to be hard-coding a specific choice of stdint.h type (e.g. by including
char but not unsigned char, and long long int but not long int).
Unfortunately the stdint.h definitions vary between newlib and glibc,
for example.

    inline static type
    __sve_mask_first_true()
    { return svptrue_pat_b8(SV_VL1); }

SV_VL1 produces the same predicate result for all element widths,
so I don't think this needs to be parameterised.

  template <>
    struct __sve_mask_type<8>
    {
      ...

      inline static void
      called()
      {}

What's the called() function for?

  template <typename _To, typename _From>
    _GLIBCXX_SIMD_INTRINSIC constexpr auto
    __sve_reinterpret_cast(_From __v)
    {
      if constexpr (std::is_same_v<_To, int32_t>)
        return svreinterpret_s32(__v);
      else if constexpr (std::is_same_v<_To, int64_t>)
        return svreinterpret_s64(__v);
      else if constexpr (std::is_same_v<_To, float32_t>)
        return svreinterpret_f32(__v);
      else if constexpr (std::is_same_v<_To, float64_t>)
        return svreinterpret_f64(__v);
    }

Why does this only need to handle these 4 types?  Think it'd be
worth a comment.

  template <size_t _Bits, size_t _Width>
    struct _SveMaskWrapper
    {
      ...

      _GLIBCXX_SIMD_INTRINSIC constexpr value_type
      operator[](size_t __i) const
      {
        return _BuiltinSveMaskType::__sve_mask_active_count(
                _BuiltinSveVectorType::__sve_active_mask(),
                svand_z(_BuiltinSveVectorType::__sve_active_mask(),
                        svcmpeq(_BuiltinSveVectorType::__sve_active_mask(),
                                _BuiltinSveMaskType::__index0123,
                                typename _BuiltinSveMaskType::__sve_mask_uint_type(__i)),
                        _M_data));
      }

A simpler way to do this might be to use svdup_u<width>_z (_M_data, 1, 0)
and then svorrv on the result.

  struct _CommonImplSve
  {
    ...

    template <typename _Tp, typename _Up, size_t _Np>
      _GLIBCXX_SIMD_INTRINSIC static constexpr __sve_vector_type_t<_Tp, _Np>
      _S_load(const _Up* __p, _SveMaskWrapper<sizeof(_Tp), _Np> __k)
      {
        using _STp = __get_sve_value_type_t<_Tp>;
        using _SUp = __get_sve_value_type_t<_Up>;
        using _V = __sve_vector_type_t<_Tp, _Np>;
        const _SUp* __up = reinterpret_cast<const _SUp*>(__p);

        if constexpr (std::is_same_v<_Tp, _Up>)
         return _V(svld1(__k._M_data, __up));

What about cases where _Tp and _Up differ in signedness?  Is that
allowed?  It seems from the code quoted below that it's OK to load
int8_ts into a uint16_t vector (for example), even though that could
change values.  Are uint16_t->int16_t and int16_t->uint16_t similarly OK?

        if constexpr (std::is_integral_v<_Tp> && std::is_integral_v<_Up>
                       && (sizeof(_Tp) > sizeof(_Up)))
         {
           if constexpr (std::is_same_v<_SUp, int8_t>)
             {
               if constexpr (std::is_same_v<_STp, int16_t>)
                 return _V(svld1sb_s16(__k._M_data, __up));
               if constexpr (std::is_same_v<_STp, uint16_t>)
                 return _V(svld1sb_u16(__k._M_data, __up));
               if constexpr (std::is_same_v<_STp, int32_t>)
                 return _V(svld1sb_s32(__k._M_data, __up));
               if constexpr (std::is_same_v<_STp, uint32_t>)
                 return _V(svld1sb_u32(__k._M_data, __up));
               if constexpr (std::is_same_v<_STp, int64_t>)
                 return _V(svld1sb_s64(__k._M_data, __up));

Will this cope correctly with both long int and long long int
(for LP64)?

    template <typename _Tp, typename _Up, size_t _Np>
      _GLIBCXX_SIMD_INTRINSIC static constexpr void
      _S_store(_Up* __p, _SveSimdWrapper<_Tp, _Np> __x, _SveMaskWrapper<sizeof(_Tp), _Np> __k)
      {
        using _SUp = __get_sve_value_type_t<_Up>;
        _SUp* __up = reinterpret_cast<_SUp*>(__p);

        if constexpr (std::is_same_v<_Tp, _Up>)
         return svst1(__k._M_data, __up, __x);
        __execute_n_times<_Np>([&](auto __i) _GLIBCXX_SIMD_ALWAYS_INLINE_LAMBDA {
         if (__k[__i])
           __p[__i] = static_cast<_Up>(__x[__i]);
        });
      }

Would this benefit from using truncating stores?  It seemed odd that
there was support for conversions on the load size but not the store side.

  template <typename _Abi, typename>
    struct _SimdImplSve
    {
      ...
      template <typename _Tp, size_t _Np>
        _GLIBCXX_SIMD_INTRINSIC static constexpr _MaskMember<_Tp>
        _S_negate(_SveSimdWrapper<_Tp, _Np> __x) noexcept
        {
         return svcmpeq(__sve_vector_type<_Tp, _Np>::__sve_active_mask(), __x._M_data,
                        __sve_vector_type<_Tp, _Np>::__sve_broadcast(_Tp{}));
        }

What operation is this performing?  The implementation doesn't seem
to match the intuitive meaning of the name.

      template <typename _Tp, typename _BinaryOperation>
        _GLIBCXX_SIMD_INTRINSIC static constexpr _Tp
        _S_reduce(simd<_Tp, _Abi> __x, _BinaryOperation&& __binary_op)
        {
         auto __x_data = __x._M_data;
         constexpr size_t _Np = simd_size_v<_Tp, _Abi>;
         using __sve_vec_t = __sve_vector_type_t<_Tp, _Np>;
         std::size_t __i = __x.size();
         for (; (__i % 2) != 1; __i /= 2)
           {
             __x_data = __binary_op(simd<_Tp, _Abi>(
                                      __private_init, _SveSimdWrapper<_Tp, _Np>(
   __sve_vec_t(svuzp1(__x_data, __x_data)))),
                                    simd<_Tp, _Abi>(
                                      __private_init, _SveSimdWrapper<_Tp, _Np>(
   __sve_vec_t(svuzp2(__x_data, __x_data))))
                                   )._M_data;
           }
         _Tp __res = __x_data[0];
         for (size_t __ri = 1; __ri != __i; __ri++)
           __res = __binary_op(__x_data[__ri], __res);
         return __res;
        }

Will this make use of the reduction instructions where available, or is
it the intention that the compiler will do that?

  template <typename _Abi, typename>
    struct _SimdImplSve
    {
      ...
      template <typename _Tp, size_t _Np>
        _GLIBCXX_SIMD_INTRINSIC static constexpr _SveSimdWrapper<_Tp, _Np>
        _S_unary_minus(_SveSimdWrapper<_Tp, _Np> __x) noexcept
        {
         return svmul_x(__sve_vector_type<_Tp, _Np>::__sve_active_mask(), __x._M_data,
                        static_cast<_Tp>(-1));
        }

Why svmul_x by -1 rather than svneg_x?

      template <typename _Tp, size_t _Np>
        _GLIBCXX_SIMD_INTRINSIC static constexpr __sve_vector_type_t<_Tp, _Np>
        _S_multiplies(_SveSimdWrapper<_Tp, _Np> __x, _SveSimdWrapper<_Tp, _Np> __y)
        {
         return svmul_x(__sve_vector_type<_Tp, _Np>::__sve_active_mask(), __x._M_data, __y._M_data);
        }

      template <typename _Tp, size_t _Np>
        _GLIBCXX_SIMD_INTRINSIC static constexpr __sve_vector_type_t<_Tp, _Np>
        _S_divides(_SveSimdWrapper<_Tp, _Np> __x, _SveSimdWrapper<_Tp, _Np> __y)
        { return __x._M_data / __y._M_data; }

If we're using infix operators for division, couldn't we also use them
for the other primitive operations?

      _GLIBCXX_SIMD_MATH_FALLBACK(fmax)
      _GLIBCXX_SIMD_MATH_FALLBACK(fmin)

Couldn't we use FMAXNM and FMINNM for these?  I think it'd be worth a
comment if not.

      template <typename _Tp, size_t _Np, typename _Op>
        static constexpr _MaskMember<_Tp>
        __fp_cmp(_SveSimdWrapper<_Tp, _Np> __x, _SveSimdWrapper<_Tp, _Np> __y, _Op __op)
        {
         using _Ip = __get_sve_value_type_t<__int_for_sizeof_t<_Tp>>;
         using _VI = __sve_vector_type_t<_Ip, _Np>;
         using _WI = _SveSimdWrapper<_Ip, _Np>;
         const _WI __fmv = __sve_vector_type<_Ip, _Np>::__sve_broadcast(__finite_max_v<_Ip>);
         const _WI __zerov = __sve_vector_type<_Ip, _Np>::__sve_broadcast(0);
         const _WI __xn = _VI(__sve_reinterpret_cast<_Ip>(__x));
         const _WI __yn = _VI(__sve_reinterpret_cast<_Ip>(__y));

         const _WI __xp
           = svsel(_S_less(__xn, __zerov), _S_unary_minus(_WI(_S_bit_and(__xn, __fmv))), __xn);
         const _WI __yp
           = svsel(_S_less(__yn, __zerov), _S_unary_minus(_WI(_S_bit_and(__yn, __fmv))), __yn);
         return svbic_z(__sve_vector_type<_Ip, _Np>::__sve_active_mask(), __op(__xp, __yp)._M_data,
                        _SuperImpl::_S_isunordered(__x, __y)._M_data);
        }

      template <typename _Tp, size_t _Np>
        static constexpr _MaskMember<_Tp>
        _S_isgreater(_SveSimdWrapper<_Tp, _Np> __x, _SveSimdWrapper<_Tp, _Np> __y) noexcept
        { return __fp_cmp(__x, __y, [](auto __xp, auto __yp) { return _S_less(__yp, __xp); }); }

      template <typename _Tp, size_t _Np>
        static constexpr _MaskMember<_Tp>
        _S_isgreaterequal(_SveSimdWrapper<_Tp, _Np> __x, _SveSimdWrapper<_Tp, _Np> __y) noexcept
        { return __fp_cmp(__x, __y, [](auto __xp, auto __yp) { return _S_less_equal(__yp, __xp); }); }

      template <typename _Tp, size_t _Np>
        static constexpr _MaskMember<_Tp>
        _S_isless(_SveSimdWrapper<_Tp, _Np> __x, _SveSimdWrapper<_Tp, _Np> __y) noexcept
        { return __fp_cmp(__x, __y, [](auto __xp, auto __yp) { return _S_less(__xp, __yp); }); }

      template <typename _Tp, size_t _Np>
        static constexpr _MaskMember<_Tp>
        _S_islessequal(_SveSimdWrapper<_Tp, _Np> __x, _SveSimdWrapper<_Tp, _Np> __y) noexcept
        { return __fp_cmp(__x, __y, [](auto __xp, auto __yp) { return _S_less_equal(__xp, __yp); }); }

      template <typename _Tp, size_t _Np>
        static constexpr _MaskMember<_Tp>
        _S_islessgreater(_SveSimdWrapper<_Tp, _Np> __x, _SveSimdWrapper<_Tp, _Np> __y) noexcept
        {
         return svbic_z(__sve_vector_type<_Tp, _Np>::__sve_active_mask(),
                        _SuperImpl::_S_not_equal_to(__x, __y)._M_data,
                        _SuperImpl::_S_isunordered(__x, __y)._M_data);
        }

Could you go into more detail about the implementation of these?  They
seem to generate more code than I'd expect.

      template <typename _Tp, size_t _Np>
        _GLIBCXX_SIMD_INTRINSIC static _MaskMember<_Tp>
        _S_isnan([[maybe_unused]] _SveSimdWrapper<_Tp, _Np> __x)
        {
  #if __FINITE_MATH_ONLY__
         return {}; // false
  #elif !defined __SUPPORT_SNAN__
         return svnot_z(__sve_vector_type<_Tp, _Np>::__sve_active_mask(),
                        _S_equal_to(__x, __x)._M_data);
  #elif defined __STDC_IEC_559__
         return svcmpuo(__sve_vector_type<_Tp, _Np>::__sve_active_mask(), __x._M_data, __x._M_data);
  #else
  #error "Not implemented: how to support SNaN but non-IEC559 floating-point?"
  #endif
        }

I think this is just my lack of libstdc++-v3 knowledge, but why can't
we use svcmpuo for the !defined __SUPPORT_SNAN__ case?

      template <typename _Tp>
        _GLIBCXX_SIMD_INTRINSIC static constexpr _MaskMember<_Tp>
        _S_broadcast(bool __x)
        {
         constexpr size_t _Np = simd_size_v<_Tp, _Abi>;
         __sve_bool_type __tr = __sve_vector_type<_Tp, _Np>::__sve_active_mask();
         __sve_bool_type __fl = svnot_z(__tr, __tr);

This can just be svpfalse_b();

  template <typename _Abi, typename>
    struct _MaskImplSve
    {
    ...
      template <typename _Tp>
        _GLIBCXX_SIMD_INTRINSIC static constexpr _MaskMember<_Tp>
        _S_load(const bool* __mem)
        {
         _SveMaskWrapper<sizeof(_Tp), simd_size_v<_Tp, _Abi>> __r;

         __execute_n_times<simd_size_v<_Tp, _Abi>>(
           [&](auto __i) _GLIBCXX_SIMD_ALWAYS_INLINE_LAMBDA { __r._M_set(__i, __mem[__i]); });

         return __r;
        }

      template <size_t _Bits, size_t _Np>
        static inline _SveMaskWrapper<_Bits, _Np>
        _S_masked_load(_SveMaskWrapper<_Bits, _Np> __merge, _SveMaskWrapper<_Bits, _Np> __mask,
                      const bool* __mem) noexcept
        {
         _SveMaskWrapper<_Bits, _Np> __r;

         __execute_n_times<_Np>([&](auto __i) _GLIBCXX_SIMD_ALWAYS_INLINE_LAMBDA {
           if (__mask[__i])
             __r._M_set(__i, __mem[__i]);
           else
             __r._M_set(__i, __merge[__i]);
         });

         return __r;
        }

If these are loading unpacked booleans, couldn't we just use svld1
followed by a comparison?  Similarly the stores could use svdup_u8_z
to load a vector of 1s and 0s and then use svst1 to store it.

      template <typename _Tp>
        _GLIBCXX_SIMD_INTRINSIC static bool
        _S_all_of(simd_mask<_Tp, _Abi> __k)
        { return _S_popcount(__k) == simd_size_v<_Tp, _Abi>; }

In principle, this should be better as !svptest_any(..., svnot_z (..., __k)),
since we should then be able to use a single flag-setting predicate
logic instruction.

Incidentally, __k seems like a bit of an AVX-centric name :)

      template <typename _Tp>
        _GLIBCXX_SIMD_INTRINSIC static bool
        _S_any_of(simd_mask<_Tp, _Abi> __k)
        { return _S_popcount(__k) > 0; }

      template <typename _Tp>
        _GLIBCXX_SIMD_INTRINSIC static bool
        _S_none_of(simd_mask<_Tp, _Abi> __k)
        { return _S_popcount(__k) == 0; }

These should map directly to svptest_any and !svptest_any respectively.

      template <typename _Tp>
        _GLIBCXX_SIMD_INTRINSIC static int
        _S_find_first_set(simd_mask<_Tp, _Abi> __k)
        {
         constexpr size_t _Np = simd_size_v<_Tp, _Abi>;

         auto __first_index = __sve_mask_type<sizeof(_Tp)>::__sve_mask_first_true();
         for (int __idx = 0; __idx < _Np; __idx++)
           {
             if (__sve_mask_type<sizeof(_Tp)>::__sve_mask_active_count(
                   __sve_vector_type<_Tp, _Np>::__sve_active_mask(),
                   svand_z(__sve_vector_type<_Tp, _Np>::__sve_active_mask(), __k._M_data,
                           __first_index)))
               return __idx;
             __first_index = __sve_mask_type<sizeof(_Tp)>::__sve_mask_next_true(
                               __sve_vector_type<_Tp, _Np>::__sve_active_mask(), __first_index);
           }
         return -1;
        }

      template <typename _Tp>
        _GLIBCXX_SIMD_INTRINSIC static int
        _S_find_last_set(simd_mask<_Tp, _Abi> __k)
        {
         constexpr size_t _Np = simd_size_v<_Tp, _Abi>;

         int __ret = -1;
         auto __first_index = __sve_mask_type<sizeof(_Tp)>::__sve_mask_first_true();
         for (int __idx = 0; __idx < _Np; __idx++)
           {
             if (__sve_mask_type<sizeof(_Tp)>::__sve_mask_active_count(
                   __sve_vector_type<_Tp, _Np>::__sve_active_mask(),
                   svand_z(__sve_vector_type<_Tp, _Np>::__sve_active_mask(), __k._M_data,
                           __first_index)))
               __ret = __idx;
             __first_index = __sve_mask_type<sizeof(_Tp)>::__sve_mask_next_true(
                               __sve_vector_type<_Tp, _Np>::__sve_active_mask(), __first_index);
           }
         return __ret;
        }

_S_find_last_set should be able to use svclasta and an iota vector.
_S_find_first_set could do the same with a leading svpfirst.

Thanks,
Richard