[gcc r11-6935] libstdc++: Add std::experimental::simd from the Parallelism TS 2

public inbox for gcc-cvs@sourceware.org
help / color / mirror / Atom feed

From: Jonathan Wakely <redi@gcc.gnu.org>
To: gcc-cvs@gcc.gnu.org, libstdc++-cvs@gcc.gnu.org
Subject: [gcc r11-6935] libstdc++: Add std::experimental::simd from the Parallelism TS 2
Date: Wed, 27 Jan 2021 16:39:18 +0000 (GMT)	[thread overview]
Message-ID: <20210127163918.E607F3846405@sourceware.org> (raw)

https://gcc.gnu.org/g:2bcceb6fc59fcdaf51006d4fcfc71c2d26761396

commit r11-6935-g2bcceb6fc59fcdaf51006d4fcfc71c2d26761396
Author: Matthias Kretz <kretz@kde.org>
Date:   Thu Jan 21 11:45:15 2021 +0000

    libstdc++: Add std::experimental::simd from the Parallelism TS 2
    
    Adds <experimental/simd>.
    
    This implements the simd and simd_mask class templates via
    [[gnu::vector_size(N)]] data members. It implements overloads for all of
    <cmath> for simd. Explicit vectorization of the <cmath> functions is not
    finished.
    
    The majority of functions are marked as [[gnu::always_inline]] to enable
    quasi-ODR-conforming linking of TUs with different -m flags.
    Performance optimization was done for x86_64.  ARM, Aarch64, and POWER
    rely on the compiler to recognize reduction, conversion, and shuffle
    patterns.
    
    Besides verification using many different machine flages, the code was
    also verified with different fast-math flags.
    
    libstdc++-v3/ChangeLog:
    
            * doc/xml/manual/status_cxx2017.xml: Add implementation status
            of the Parallelism TS 2. Document implementation-defined types
            and behavior.
            * include/Makefile.am: Add new headers.
            * include/Makefile.in: Regenerate.
            * include/experimental/simd: New file. New header for
            Parallelism TS 2.
            * include/experimental/bits/numeric_traits.h: New file.
            Implementation of P1841R1 using internal naming. Addition of
            missing IEC559 functionality query.
            * include/experimental/bits/simd.h: New file. Definition of the
            public simd interfaces and general implementation helpers.
            * include/experimental/bits/simd_builtin.h: New file.
            Implementation of the _VecBuiltin simd_abi.
            * include/experimental/bits/simd_converter.h: New file. Generic
            simd conversions.
            * include/experimental/bits/simd_detail.h: New file. Internal
            macros for the simd implementation.
            * include/experimental/bits/simd_fixed_size.h: New file. Simd
            fixed_size ABI specific implementations.
            * include/experimental/bits/simd_math.h: New file. Math
            overloads for simd.
            * include/experimental/bits/simd_neon.h: New file. Simd NEON
            specific implementations.
            * include/experimental/bits/simd_ppc.h: New file. Implement bit
            shifts to avoid invalid results for integral types smaller than
            int.
            * include/experimental/bits/simd_scalar.h: New file. Simd scalar
            ABI specific implementations.
            * include/experimental/bits/simd_x86.h: New file. Simd x86
            specific implementations.
            * include/experimental/bits/simd_x86_conversions.h: New file.
            x86 specific conversion optimizations. The conversion patterns
            work around missing conversion patterns in the compiler and
            should be removed as soon as PR85048 is resolved.
            * testsuite/experimental/simd/standard_abi_usable.cc: New file.
            Test that all (not all fixed_size<N>, though) standard simd and
            simd_mask types are usable.
            * testsuite/experimental/simd/standard_abi_usable_2.cc: New
            file. As above but with -ffast-math.
            * testsuite/libstdc++-dg/conformance.exp: Don't build simd tests
            from the standard test loop. Instead use
            check_vect_support_and_set_flags to build simd tests with the
            relevant machine flags.

Diff:
---
 libstdc++-v3/doc/xml/manual/status_cxx2017.xml     |  216 +
 libstdc++-v3/include/Makefile.am                   |   13 +
 libstdc++-v3/include/Makefile.in                   |   13 +
 .../include/experimental/bits/numeric_traits.h     |  567 +++
 libstdc++-v3/include/experimental/bits/simd.h      | 5051 +++++++++++++++++++
 .../include/experimental/bits/simd_builtin.h       | 2949 +++++++++++
 .../include/experimental/bits/simd_converter.h     |  354 ++
 .../include/experimental/bits/simd_detail.h        |  306 ++
 .../include/experimental/bits/simd_fixed_size.h    | 2066 ++++++++
 libstdc++-v3/include/experimental/bits/simd_math.h | 1500 ++++++
 libstdc++-v3/include/experimental/bits/simd_neon.h |  519 ++
 libstdc++-v3/include/experimental/bits/simd_ppc.h  |  123 +
 .../include/experimental/bits/simd_scalar.h        |  772 +++
 libstdc++-v3/include/experimental/bits/simd_x86.h  | 5169 ++++++++++++++++++++
 .../experimental/bits/simd_x86_conversions.h       | 2029 ++++++++
 libstdc++-v3/include/experimental/simd             |   70 +
 .../experimental/simd/standard_abi_usable.cc       |   64 +
 .../experimental/simd/standard_abi_usable_2.cc     |    4 +
 .../testsuite/libstdc++-dg/conformance.exp         |   18 +-
 19 files changed, 21802 insertions(+), 1 deletion(-)

diff --git a/libstdc++-v3/doc/xml/manual/status_cxx2017.xml b/libstdc++-v3/doc/xml/manual/status_cxx2017.xml
index e6834b3607a..bc740f8e1ba 100644
--- a/libstdc++-v3/doc/xml/manual/status_cxx2017.xml
+++ b/libstdc++-v3/doc/xml/manual/status_cxx2017.xml
@@ -2869,6 +2869,17 @@ since C++14 and the implementation is complete.
       <entry>Library Fundamentals 2 TS</entry>
     </row>
 
+    <row>
+      <entry>
+	<link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0214r9.pdf">
+	  P0214R9
+	</link>
+      </entry>
+      <entry>Data-Parallel Types</entry>
+      <entry>Y</entry>
+      <entry>Parallelism 2 TS</entry>
+    </row>
+
   </tbody>
 </tgroup>
 </table>
@@ -3014,6 +3025,211 @@ since C++14 and the implementation is complete.
       If <code>!is_regular_file(p)</code>, an error is reported.
    </para>
 
+   <section xml:id="iso.2017.par2ts" xreflabel="Implementation Specific Behavior of the Parallelism 2 TS"><info><title>Parallelism 2 TS</title></info>
+
+     <para>
+        <emphasis>9.3 [parallel.simd.abi]</emphasis>
+        <code>max_fixed_size&lt;T&gt;</code> is 32, except when targetting
+        AVX512BW and <code>sizeof(T)</code> is 1.
+     </para>
+
+     <para>
+        When targeting 32-bit x86,
+        <classname>simd_abi::compatible&lt;T&gt;</classname> is an alias for
+        <classname>simd_abi::scalar</classname>.
+        When targeting 64-bit x86 (including x32) or Aarch64,
+        <classname>simd_abi::compatible&lt;T&gt;</classname> is an alias for
+        <classname>simd_abi::_VecBuiltin&lt;16&gt;</classname>,
+        unless <code>T</code> is <code>long double</code>, in which case it is
+        an alias for <classname>simd_abi::scalar</classname>.
+        When targeting ARM (but not Aarch64) with NEON support,
+        <classname>simd_abi::compatible&lt;T&gt;</classname> is an alias for
+        <classname>simd_abi::_VecBuiltin&lt;16&gt;</classname>,
+        unless <code>sizeof(T) &gt; 4</code>, in which case it is
+        an alias for <classname>simd_abi::scalar</classname>. Additionally,
+        <classname>simd_abi::compatible&lt;float&gt;</classname> is an alias for
+        <classname>simd_abi::scalar</classname> unless compiling with
+        -ffast-math.
+     </para>
+
+     <para>
+        When targeting x86 (both 32-bit and 64-bit),
+        <classname>simd_abi::native&lt;T&gt;</classname> is an alias for one of
+        <classname>simd_abi::scalar</classname>,
+        <classname>simd_abi::_VecBuiltin&lt;16&gt;</classname>,
+        <classname>simd_abi::_VecBuiltin&lt;32&gt;</classname>, or
+        <classname>simd_abi::_VecBltnBtmsk&lt;64&gt;</classname>, depending on
+        <code>T</code> and the machine options the compiler was invoked with.
+     </para>
+
+     <para>
+        When targeting ARM/Aarch64 or POWER,
+        <classname>simd_abi::native&lt;T&gt;</classname> is an alias for
+        <classname>simd_abi::scalar</classname> or
+        <classname>simd_abi::_VecBuiltin&lt;16&gt;</classname>, depending on
+        <code>T</code> and the machine options the compiler was invoked with.
+     </para>
+
+     <para>
+        For any other targeted machine
+        <classname>simd_abi::compatible&lt;T&gt;</classname> and
+        <classname>simd_abi::native&lt;T&gt;</classname> are aliases for
+        <classname>simd_abi::scalar</classname>. (subject to change)
+     </para>
+
+     <para>
+        The extended ABI tag types defined in the
+        <code>std::experimental::parallelism_v2::simd_abi</code> namespace are:
+        <classname>simd_abi::_VecBuiltin&lt;Bytes&gt;</classname>, and
+        <classname>simd_abi::_VecBltnBtmsk&lt;Bytes&gt;</classname>.
+     </para>
+
+     <para>
+        <classname>simd_abi::deduce&lt;T, N, Abis...&gt;::type</classname>,
+        with <code>N &gt; 1</code> is an alias for an extended ABI tag, if a
+        supported extended ABI tag exists. Otherwise it is an alias for
+        <classname>simd_abi::fixed_size&lt;N&gt;</classname>. The <classname>
+        simd_abi::_VecBltnBtmsk</classname> ABI tag is preferred over
+        <classname>simd_abi::_VecBuiltin</classname>.
+     </para>
+
+     <para>
+        <emphasis>9.4 [parallel.simd.traits]</emphasis>
+        <classname>memory_alignment&lt;T, U&gt;::value</classname> is
+        <code>sizeof(U) * T::size()</code> rounded up to the next power-of-two
+        value.
+     </para>
+
+     <para>
+        <emphasis>9.6.1 [parallel.simd.overview]</emphasis>
+        On ARM, <classname>simd&lt;T, _VecBuiltin&lt;Bytes&gt;&gt;</classname>
+        is supported if <code>__ARM_NEON</code> is defined and
+        <code>sizeof(T) &lt;= 4</code>. Additionally,
+        <code>sizeof(T) == 8</code> with integral <code>T</code> is supported if
+        <code>__ARM_ARCH &gt;= 8</code>, and <code>double</code> is supported if
+        <code>__aarch64__</code> is defined.
+
+        On POWER, <classname>simd&lt;T, _VecBuiltin&lt;Bytes&gt;&gt;</classname>
+        is supported if <code>__ALTIVEC__</code> is defined and <code>sizeof(T)
+        &lt; 8</code>. Additionally, <code>double</code> is supported if
+        <code>__VSX__</code> is defined, and any <code>T</code> with <code>
+        sizeof(T) &le; 8</code> is supported if <code>__POWER8_VECTOR__</code>
+        is defined.
+
+        On x86, given an extended ABI tag <code>Abi</code>,
+        <classname>simd&lt;T, Abi&gt;</classname> is supported according to the
+        following table:
+        <table frame="all" xml:id="table.par2ts_simd_support">
+          <title>Support for Extended ABI Tags</title>
+
+          <tgroup cols="4" align="left" colsep="0" rowsep="1">
+          <colspec colname="c1"/>
+          <colspec colname="c2"/>
+          <colspec colname="c3"/>
+          <colspec colname="c4"/>
+            <thead>
+              <row>
+                <entry>ABI tag <code>Abi</code></entry>
+                <entry>value type <code>T</code></entry>
+                <entry>values for <code>Bytes</code></entry>
+                <entry>required machine option</entry>
+              </row>
+            </thead>
+
+            <tbody>
+              <row>
+                <entry morerows="5">
+                  <classname>_VecBuiltin&lt;Bytes&gt;</classname>
+                </entry>
+                <entry morerows="1"><code>float</code></entry>
+                <entry>8, 12, 16</entry>
+                <entry>"-msse"</entry>
+              </row>
+
+              <row>
+                <entry>20, 24, 28, 32</entry>
+                <entry>"-mavx"</entry>
+              </row>
+
+              <row>
+                <entry morerows="1"><code>double</code></entry>
+                <entry>16</entry>
+                <entry>"-msse2"</entry>
+              </row>
+
+              <row>
+                <entry>24, 32</entry>
+                <entry>"-mavx"</entry>
+              </row>
+
+              <row>
+                <entry morerows="1">
+                  integral types other than <code>bool</code>
+                </entry>
+                <entry>
+                  <code>Bytes</code> ≤ 16 and <code>Bytes</code> divisible by
+                  <code>sizeof(T)</code>
+                </entry>
+                <entry>"-msse2"</entry>
+              </row>
+
+              <row>
+                <entry>
+                  16 &lt; <code>Bytes</code> ≤ 32 and <code>Bytes</code>
+                  divisible by <code>sizeof(T)</code>
+                </entry>
+                <entry>"-mavx2"</entry>
+              </row>
+
+              <row>
+                <entry morerows="1">
+                  <classname>_VecBuiltin&lt;Bytes&gt;</classname> and
+                  <classname>_VecBltnBtmsk&lt;Bytes&gt;</classname>
+                </entry>
+                <entry>
+                  vectorizable types with <code>sizeof(T)</code> ≥ 4
+                </entry>
+                <entry morerows="1">
+                  32 &lt; <code>Bytes</code> ≤ 64 and <code>Bytes</code>
+                  divisible by <code>sizeof(T)</code>
+                </entry>
+                <entry>"-mavx512f"</entry>
+              </row>
+
+              <row>
+                <entry>
+                  vectorizable types with <code>sizeof(T)</code> &lt; 4
+                </entry>
+                <entry>"-mavx512bw"</entry>
+              </row>
+
+              <row>
+                <entry morerows="1">
+                  <classname>_VecBltnBtmsk&lt;Bytes&gt;</classname>
+                </entry>
+                <entry>
+                  vectorizable types with <code>sizeof(T)</code> ≥ 4
+                </entry>
+                <entry morerows="1">
+                  <code>Bytes</code> ≤ 32 and <code>Bytes</code> divisible by
+                  <code>sizeof(T)</code>
+                </entry>
+                <entry>"-mavx512vl"</entry>
+              </row>
+
+              <row>
+                <entry>
+                  vectorizable types with <code>sizeof(T)</code> &lt; 4
+                </entry>
+                <entry>"-mavx512bw" and "-mavx512vl"</entry>
+              </row>
+
+            </tbody>
+          </tgroup>
+        </table>
+     </para>
+
+   </section>
 
 </section>
 
diff --git a/libstdc++-v3/include/Makefile.am b/libstdc++-v3/include/Makefile.am
index 90508a8fe83..f24a5489e8e 100644
--- a/libstdc++-v3/include/Makefile.am
+++ b/libstdc++-v3/include/Makefile.am
@@ -747,6 +747,7 @@ experimental_headers = \
 	${experimental_srcdir}/ratio \
 	${experimental_srcdir}/regex \
 	${experimental_srcdir}/set \
+	${experimental_srcdir}/simd \
 	${experimental_srcdir}/socket \
 	${experimental_srcdir}/source_location \
 	${experimental_srcdir}/string \
@@ -766,7 +767,19 @@ experimental_bits_builddir = ./experimental/bits
 experimental_bits_headers = \
 	${experimental_bits_srcdir}/lfts_config.h \
 	${experimental_bits_srcdir}/net.h \
+	${experimental_bits_srcdir}/numeric_traits.h \
 	${experimental_bits_srcdir}/shared_ptr.h \
+	${experimental_bits_srcdir}/simd.h \
+	${experimental_bits_srcdir}/simd_builtin.h \
+	${experimental_bits_srcdir}/simd_converter.h \
+	${experimental_bits_srcdir}/simd_detail.h \
+	${experimental_bits_srcdir}/simd_fixed_size.h \
+	${experimental_bits_srcdir}/simd_math.h \
+	${experimental_bits_srcdir}/simd_neon.h \
+	${experimental_bits_srcdir}/simd_ppc.h \
+	${experimental_bits_srcdir}/simd_scalar.h \
+	${experimental_bits_srcdir}/simd_x86.h \
+	${experimental_bits_srcdir}/simd_x86_conversions.h \
 	${experimental_bits_srcdir}/string_view.tcc \
 	${experimental_bits_filesystem_headers}
 
diff --git a/libstdc++-v3/include/Makefile.in b/libstdc++-v3/include/Makefile.in
index 922ba440df0..12c63400706 100644
--- a/libstdc++-v3/include/Makefile.in
+++ b/libstdc++-v3/include/Makefile.in
@@ -1097,6 +1097,7 @@ experimental_headers = \
 	${experimental_srcdir}/ratio \
 	${experimental_srcdir}/regex \
 	${experimental_srcdir}/set \
+	${experimental_srcdir}/simd \
 	${experimental_srcdir}/socket \
 	${experimental_srcdir}/source_location \
 	${experimental_srcdir}/string \
@@ -1116,7 +1117,19 @@ experimental_bits_builddir = ./experimental/bits
 experimental_bits_headers = \
 	${experimental_bits_srcdir}/lfts_config.h \
 	${experimental_bits_srcdir}/net.h \
+	${experimental_bits_srcdir}/numeric_traits.h \
 	${experimental_bits_srcdir}/shared_ptr.h \
+	${experimental_bits_srcdir}/simd.h \
+	${experimental_bits_srcdir}/simd_builtin.h \
+	${experimental_bits_srcdir}/simd_converter.h \
+	${experimental_bits_srcdir}/simd_detail.h \
+	${experimental_bits_srcdir}/simd_fixed_size.h \
+	${experimental_bits_srcdir}/simd_math.h \
+	${experimental_bits_srcdir}/simd_neon.h \
+	${experimental_bits_srcdir}/simd_ppc.h \
+	${experimental_bits_srcdir}/simd_scalar.h \
+	${experimental_bits_srcdir}/simd_x86.h \
+	${experimental_bits_srcdir}/simd_x86_conversions.h \
 	${experimental_bits_srcdir}/string_view.tcc \
 	${experimental_bits_filesystem_headers}
 
diff --git a/libstdc++-v3/include/experimental/bits/numeric_traits.h b/libstdc++-v3/include/experimental/bits/numeric_traits.h
new file mode 100644
index 00000000000..1b60874b788
--- /dev/null
+++ b/libstdc++-v3/include/experimental/bits/numeric_traits.h
@@ -0,0 +1,567 @@
+// Definition of numeric_limits replacement traits P1841R1 -*- C++ -*-
+
+// Copyright (C) 2020 Free Software Foundation, Inc.
+//
+// This file is part of the GNU ISO C++ Library.  This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3, or (at your option)
+// any later version.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+
+// Under Section 7 of GPL version 3, you are granted additional
+// permissions described in the GCC Runtime Library Exception, version
+// 3.1, as published by the Free Software Foundation.
+
+// You should have received a copy of the GNU General Public License and
+// a copy of the GCC Runtime Library Exception along with this program;
+// see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+// <http://www.gnu.org/licenses/>.
+
+#include <type_traits>
+
+namespace std {
+
+template <template <typename> class _Trait, typename _Tp, typename = void>
+  struct __value_exists_impl : false_type {};
+
+template <template <typename> class _Trait, typename _Tp>
+  struct __value_exists_impl<_Trait, _Tp, void_t<decltype(_Trait<_Tp>::value)>>
+  : true_type {};
+
+template <typename _Tp, bool = is_arithmetic_v<_Tp>>
+  struct __digits_impl {};
+
+template <typename _Tp>
+  struct __digits_impl<_Tp, true>
+  {
+    static inline constexpr int value
+      = sizeof(_Tp) * __CHAR_BIT__ - is_signed_v<_Tp>;
+  };
+
+template <>
+  struct __digits_impl<float, true>
+  { static inline constexpr int value = __FLT_MANT_DIG__; };
+
+template <>
+  struct __digits_impl<double, true>
+  { static inline constexpr int value = __DBL_MANT_DIG__; };
+
+template <>
+  struct __digits_impl<long double, true>
+  { static inline constexpr int value = __LDBL_MANT_DIG__; };
+
+template <typename _Tp, bool = is_arithmetic_v<_Tp>>
+  struct __digits10_impl {};
+
+template <typename _Tp>
+  struct __digits10_impl<_Tp, true>
+  {
+    // The fraction 643/2136 approximates log10(2) to 7 significant digits.
+    static inline constexpr int value = __digits_impl<_Tp>::value * 643L / 2136;
+  };
+
+template <>
+  struct __digits10_impl<float, true>
+  { static inline constexpr int value = __FLT_DIG__; };
+
+template <>
+  struct __digits10_impl<double, true>
+  { static inline constexpr int value = __DBL_DIG__; };
+
+template <>
+  struct __digits10_impl<long double, true>
+  { static inline constexpr int value = __LDBL_DIG__; };
+
+template <typename _Tp, bool = is_arithmetic_v<_Tp>>
+  struct __max_digits10_impl {};
+
+template <typename _Tp>
+  struct __max_digits10_impl<_Tp, true>
+  {
+    static inline constexpr int value
+      = is_floating_point_v<_Tp> ? 2 + __digits_impl<_Tp>::value * 643L / 2136
+				 : __digits10_impl<_Tp>::value + 1;
+  };
+
+template <typename _Tp>
+  struct __max_exponent_impl {};
+
+template <>
+  struct __max_exponent_impl<float>
+  { static inline constexpr int value = __FLT_MAX_EXP__; };
+
+template <>
+  struct __max_exponent_impl<double>
+  { static inline constexpr int value = __DBL_MAX_EXP__; };
+
+template <>
+  struct __max_exponent_impl<long double>
+  { static inline constexpr int value = __LDBL_MAX_EXP__; };
+
+template <typename _Tp>
+  struct __max_exponent10_impl {};
+
+template <>
+  struct __max_exponent10_impl<float>
+  { static inline constexpr int value = __FLT_MAX_10_EXP__; };
+
+template <>
+  struct __max_exponent10_impl<double>
+  { static inline constexpr int value = __DBL_MAX_10_EXP__; };
+
+template <>
+  struct __max_exponent10_impl<long double>
+  { static inline constexpr int value = __LDBL_MAX_10_EXP__; };
+
+template <typename _Tp>
+  struct __min_exponent_impl {};
+
+template <>
+  struct __min_exponent_impl<float>
+  { static inline constexpr int value = __FLT_MIN_EXP__; };
+
+template <>
+  struct __min_exponent_impl<double>
+  { static inline constexpr int value = __DBL_MIN_EXP__; };
+
+template <>
+  struct __min_exponent_impl<long double>
+  { static inline constexpr int value = __LDBL_MIN_EXP__; };
+
+template <typename _Tp>
+  struct __min_exponent10_impl {};
+
+template <>
+  struct __min_exponent10_impl<float>
+  { static inline constexpr int value = __FLT_MIN_10_EXP__; };
+
+template <>
+  struct __min_exponent10_impl<double>
+  { static inline constexpr int value = __DBL_MIN_10_EXP__; };
+
+template <>
+  struct __min_exponent10_impl<long double>
+  { static inline constexpr int value = __LDBL_MIN_10_EXP__; };
+
+template <typename _Tp, bool = is_arithmetic_v<_Tp>>
+  struct __radix_impl {};
+
+template <typename _Tp>
+  struct __radix_impl<_Tp, true>
+  {
+    static inline constexpr int value
+      = is_floating_point_v<_Tp> ? __FLT_RADIX__ : 2;
+  };
+
+// [num.traits.util], numeric utility traits
+template <template <typename> class _Trait, typename _Tp>
+  struct __value_exists : __value_exists_impl<_Trait, _Tp> {};
+
+template <template <typename> class _Trait, typename _Tp>
+  inline constexpr bool __value_exists_v = __value_exists<_Trait, _Tp>::value;
+
+template <template <typename> class _Trait, typename _Tp, typename _Up = _Tp>
+  inline constexpr _Up
+  __value_or(_Up __def = _Up()) noexcept
+  {
+    if constexpr (__value_exists_v<_Trait, _Tp>)
+      return static_cast<_Up>(_Trait<_Tp>::value);
+    else
+      return __def;
+  }
+
+template <typename _Tp, bool = is_arithmetic_v<_Tp>>
+  struct __norm_min_impl {};
+
+template <typename _Tp>
+  struct __norm_min_impl<_Tp, true>
+  { static inline constexpr _Tp value = 1; };
+
+template <>
+  struct __norm_min_impl<float, true>
+  { static inline constexpr float value = __FLT_MIN__; };
+
+template <>
+  struct __norm_min_impl<double, true>
+  { static inline constexpr double value = __DBL_MIN__; };
+
+template <>
+  struct __norm_min_impl<long double, true>
+  { static inline constexpr long double value = __LDBL_MIN__; };
+
+template <typename _Tp>
+  struct __denorm_min_impl : __norm_min_impl<_Tp> {};
+
+#if __FLT_HAS_DENORM__
+template <>
+  struct __denorm_min_impl<float>
+  { static inline constexpr float value = __FLT_DENORM_MIN__; };
+#endif
+
+#if __DBL_HAS_DENORM__
+template <>
+  struct __denorm_min_impl<double>
+  { static inline constexpr double value = __DBL_DENORM_MIN__; };
+#endif
+
+#if __LDBL_HAS_DENORM__
+template <>
+  struct __denorm_min_impl<long double>
+  { static inline constexpr long double value = __LDBL_DENORM_MIN__; };
+#endif
+
+template <typename _Tp>
+  struct __epsilon_impl {};
+
+template <>
+  struct __epsilon_impl<float>
+  { static inline constexpr float value = __FLT_EPSILON__; };
+
+template <>
+  struct __epsilon_impl<double>
+  { static inline constexpr double value = __DBL_EPSILON__; };
+
+template <>
+  struct __epsilon_impl<long double>
+  { static inline constexpr long double value = __LDBL_EPSILON__; };
+
+template <typename _Tp, bool = is_arithmetic_v<_Tp>>
+  struct __finite_min_impl {};
+
+template <typename _Tp>
+  struct __finite_min_impl<_Tp, true>
+  {
+    static inline constexpr _Tp value
+      = is_unsigned_v<_Tp> ? _Tp()
+			   : -2 * (_Tp(1) << __digits_impl<_Tp>::value - 1);
+  };
+
+template <>
+  struct __finite_min_impl<float, true>
+  { static inline constexpr float value = -__FLT_MAX__; };
+
+template <>
+  struct __finite_min_impl<double, true>
+  { static inline constexpr double value = -__DBL_MAX__; };
+
+template <>
+  struct __finite_min_impl<long double, true>
+  { static inline constexpr long double value = -__LDBL_MAX__; };
+
+template <typename _Tp, bool = is_arithmetic_v<_Tp>>
+  struct __finite_max_impl {};
+
+template <typename _Tp>
+  struct __finite_max_impl<_Tp, true>
+  { static inline constexpr _Tp value = ~__finite_min_impl<_Tp>::value; };
+
+template <>
+  struct __finite_max_impl<float, true>
+  { static inline constexpr float value = __FLT_MAX__; };
+
+template <>
+  struct __finite_max_impl<double, true>
+  { static inline constexpr double value = __DBL_MAX__; };
+
+template <>
+  struct __finite_max_impl<long double, true>
+  { static inline constexpr long double value = __LDBL_MAX__; };
+
+template <typename _Tp>
+  struct __infinity_impl {};
+
+#if __FLT_HAS_INFINITY__
+template <>
+  struct __infinity_impl<float>
+  { static inline constexpr float value = __builtin_inff(); };
+#endif
+
+#if __DBL_HAS_INFINITY__
+template <>
+  struct __infinity_impl<double>
+  { static inline constexpr double value = __builtin_inf(); };
+#endif
+
+#if __LDBL_HAS_INFINITY__
+template <>
+  struct __infinity_impl<long double>
+  { static inline constexpr long double value = __builtin_infl(); };
+#endif
+
+template <typename _Tp>
+  struct __quiet_NaN_impl {};
+
+#if __FLT_HAS_QUIET_NAN__
+template <>
+  struct __quiet_NaN_impl<float>
+  { static inline constexpr float value = __builtin_nanf(""); };
+#endif
+
+#if __DBL_HAS_QUIET_NAN__
+template <>
+  struct __quiet_NaN_impl<double>
+  { static inline constexpr double value = __builtin_nan(""); };
+#endif
+
+#if __LDBL_HAS_QUIET_NAN__
+template <>
+  struct __quiet_NaN_impl<long double>
+  { static inline constexpr long double value = __builtin_nanl(""); };
+#endif
+
+template <typename _Tp, bool = is_floating_point_v<_Tp>>
+  struct __reciprocal_overflow_threshold_impl {};
+
+template <typename _Tp>
+  struct __reciprocal_overflow_threshold_impl<_Tp, true>
+  {
+    // This typically yields a subnormal value. Is this incorrect for
+    // flush-to-zero configurations?
+    static constexpr _Tp _S_search(_Tp __ok, _Tp __overflows)
+    {
+      const _Tp __mid = (__ok + __overflows) / 2;
+      // 1/__mid without -ffast-math is not a constant expression if it
+      // overflows. Therefore divide 1 by the radix before division.
+      // Consequently finite_max (the threshold) must be scaled by the
+      // same value.
+      if (__mid == __ok || __mid == __overflows)
+	return __ok;
+      else if (_Tp(1) / (__radix_impl<_Tp>::value * __mid)
+	       <= __finite_max_impl<_Tp>::value / __radix_impl<_Tp>::value)
+	return _S_search(__mid, __overflows);
+      else
+	return _S_search(__ok, __mid);
+    }
+
+    static inline constexpr _Tp value
+      = _S_search(_Tp(1.01) / __finite_max_impl<_Tp>::value,
+		  _Tp(0.99) / __finite_max_impl<_Tp>::value);
+  };
+
+template <typename _Tp, bool = is_floating_point_v<_Tp>>
+  struct __round_error_impl {};
+
+template <typename _Tp>
+  struct __round_error_impl<_Tp, true>
+  { static inline constexpr _Tp value = 0.5; };
+
+template <typename _Tp>
+  struct __signaling_NaN_impl {};
+
+#if __FLT_HAS_QUIET_NAN__
+template <>
+  struct __signaling_NaN_impl<float>
+  { static inline constexpr float value = __builtin_nansf(""); };
+#endif
+
+#if __DBL_HAS_QUIET_NAN__
+template <>
+  struct __signaling_NaN_impl<double>
+  { static inline constexpr double value = __builtin_nans(""); };
+#endif
+
+#if __LDBL_HAS_QUIET_NAN__
+template <>
+  struct __signaling_NaN_impl<long double>
+  { static inline constexpr long double value = __builtin_nansl(""); };
+#endif
+
+// [num.traits.val], numeric distinguished value traits
+template <typename _Tp>
+  struct __denorm_min : __denorm_min_impl<remove_cv_t<_Tp>> {};
+
+template <typename _Tp>
+  struct __epsilon : __epsilon_impl<remove_cv_t<_Tp>> {};
+
+template <typename _Tp>
+  struct __finite_max : __finite_max_impl<remove_cv_t<_Tp>> {};
+
+template <typename _Tp>
+  struct __finite_min : __finite_min_impl<remove_cv_t<_Tp>> {};
+
+template <typename _Tp>
+  struct __infinity : __infinity_impl<remove_cv_t<_Tp>> {};
+
+template <typename _Tp>
+  struct __norm_min : __norm_min_impl<remove_cv_t<_Tp>> {};
+
+template <typename _Tp>
+  struct __quiet_NaN : __quiet_NaN_impl<remove_cv_t<_Tp>> {};
+
+template <typename _Tp>
+  struct __reciprocal_overflow_threshold
+  : __reciprocal_overflow_threshold_impl<remove_cv_t<_Tp>> {};
+
+template <typename _Tp>
+  struct __round_error : __round_error_impl<remove_cv_t<_Tp>> {};
+
+template <typename _Tp>
+  struct __signaling_NaN : __signaling_NaN_impl<remove_cv_t<_Tp>> {};
+
+template <typename _Tp>
+  inline constexpr auto __denorm_min_v = __denorm_min<_Tp>::value;
+
+template <typename _Tp>
+  inline constexpr auto __epsilon_v = __epsilon<_Tp>::value;
+
+template <typename _Tp>
+  inline constexpr auto __finite_max_v = __finite_max<_Tp>::value;
+
+template <typename _Tp>
+  inline constexpr auto __finite_min_v = __finite_min<_Tp>::value;
+
+template <typename _Tp>
+  inline constexpr auto __infinity_v = __infinity<_Tp>::value;
+
+template <typename _Tp>
+  inline constexpr auto __norm_min_v = __norm_min<_Tp>::value;
+
+template <typename _Tp>
+  inline constexpr auto __quiet_NaN_v = __quiet_NaN<_Tp>::value;
+
+template <typename _Tp>
+  inline constexpr auto __reciprocal_overflow_threshold_v
+    = __reciprocal_overflow_threshold<_Tp>::value;
+
+template <typename _Tp>
+  inline constexpr auto __round_error_v = __round_error<_Tp>::value;
+
+template <typename _Tp>
+  inline constexpr auto __signaling_NaN_v = __signaling_NaN<_Tp>::value;
+
+// [num.traits.char], numeric characteristics traits
+template <typename _Tp>
+  struct __digits : __digits_impl<remove_cv_t<_Tp>> {};
+
+template <typename _Tp>
+  struct __digits10 : __digits10_impl<remove_cv_t<_Tp>> {};
+
+template <typename _Tp>
+  struct __max_digits10 : __max_digits10_impl<remove_cv_t<_Tp>> {};
+
+template <typename _Tp>
+  struct __max_exponent : __max_exponent_impl<remove_cv_t<_Tp>> {};
+
+template <typename _Tp>
+  struct __max_exponent10 : __max_exponent10_impl<remove_cv_t<_Tp>> {};
+
+template <typename _Tp>
+  struct __min_exponent : __min_exponent_impl<remove_cv_t<_Tp>> {};
+
+template <typename _Tp>
+  struct __min_exponent10 : __min_exponent10_impl<remove_cv_t<_Tp>> {};
+
+template <typename _Tp>
+  struct __radix : __radix_impl<remove_cv_t<_Tp>> {};
+
+template <typename _Tp>
+  inline constexpr auto __digits_v = __digits<_Tp>::value;
+
+template <typename _Tp>
+  inline constexpr auto __digits10_v = __digits10<_Tp>::value;
+
+template <typename _Tp>
+  inline constexpr auto __max_digits10_v = __max_digits10<_Tp>::value;
+
+template <typename _Tp>
+  inline constexpr auto __max_exponent_v = __max_exponent<_Tp>::value;
+
+template <typename _Tp>
+  inline constexpr auto __max_exponent10_v = __max_exponent10<_Tp>::value;
+
+template <typename _Tp>
+  inline constexpr auto __min_exponent_v = __min_exponent<_Tp>::value;
+
+template <typename _Tp>
+  inline constexpr auto __min_exponent10_v = __min_exponent10<_Tp>::value;
+
+template <typename _Tp>
+  inline constexpr auto __radix_v = __radix<_Tp>::value;
+
+// mkretz's extensions
+// TODO: does GCC tell me? __GCC_IEC_559 >= 2 is not the right answer
+template <typename _Tp>
+  struct __has_iec559_storage_format : true_type {};
+
+template <typename _Tp>
+  inline constexpr bool __has_iec559_storage_format_v
+    = __has_iec559_storage_format<_Tp>::value;
+
+/* To propose:
+   If __has_iec559_behavior<__quiet_NaN, T> is true the following holds:
+     - nan == nan is false
+     - isnan(nan) is true
+     - isnan(nan + x) is true
+     - isnan(inf/inf) is true
+     - isnan(0/0) is true
+     - isunordered(nan, x) is true
+
+   If __has_iec559_behavior<__infinity, T> is true the following holds (x is
+   neither nan nor inf):
+     - isinf(inf) is true
+     - isinf(inf + x) is true
+     - isinf(1/0) is true
+ */
+template <template <typename> class _Trait, typename _Tp>
+  struct __has_iec559_behavior : false_type {};
+
+template <template <typename> class _Trait, typename _Tp>
+  inline constexpr bool __has_iec559_behavior_v
+    = __has_iec559_behavior<_Trait, _Tp>::value;
+
+#if !__FINITE_MATH_ONLY__
+#if __FLT_HAS_QUIET_NAN__
+template <>
+  struct __has_iec559_behavior<__quiet_NaN, float> : true_type {};
+#endif
+
+#if __DBL_HAS_QUIET_NAN__
+template <>
+  struct __has_iec559_behavior<__quiet_NaN, double> : true_type {};
+#endif
+
+#if __LDBL_HAS_QUIET_NAN__
+template <>
+  struct __has_iec559_behavior<__quiet_NaN, long double> : true_type {};
+#endif
+
+#if __FLT_HAS_INFINITY__
+template <>
+  struct __has_iec559_behavior<__infinity, float> : true_type {};
+#endif
+
+#if __DBL_HAS_INFINITY__
+template <>
+  struct __has_iec559_behavior<__infinity, double> : true_type {};
+#endif
+
+#if __LDBL_HAS_INFINITY__
+template <>
+  struct __has_iec559_behavior<__infinity, long double> : true_type {};
+#endif
+
+#ifdef __SUPPORT_SNAN__
+#if __FLT_HAS_QUIET_NAN__
+template <>
+  struct __has_iec559_behavior<__signaling_NaN, float> : true_type {};
+#endif
+
+#if __DBL_HAS_QUIET_NAN__
+template <>
+  struct __has_iec559_behavior<__signaling_NaN, double> : true_type {};
+#endif
+
+#if __LDBL_HAS_QUIET_NAN__
+template <>
+  struct __has_iec559_behavior<__signaling_NaN, long double> : true_type {};
+#endif
+
+#endif
+#endif // __FINITE_MATH_ONLY__
+
+} // namespace std
diff --git a/libstdc++-v3/include/experimental/bits/simd.h b/libstdc++-v3/include/experimental/bits/simd.h
new file mode 100644
index 00000000000..00eec50d64f
--- /dev/null
+++ b/libstdc++-v3/include/experimental/bits/simd.h
@@ -0,0 +1,5051 @@
+// Definition of the public simd interfaces -*- C++ -*-
+
+// Copyright (C) 2020 Free Software Foundation, Inc.
+//
+// This file is part of the GNU ISO C++ Library.  This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3, or (at your option)
+// any later version.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+
+// Under Section 7 of GPL version 3, you are granted additional
+// permissions described in the GCC Runtime Library Exception, version
+// 3.1, as published by the Free Software Foundation.
+
+// You should have received a copy of the GNU General Public License and
+// a copy of the GCC Runtime Library Exception along with this program;
+// see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+// <http://www.gnu.org/licenses/>.
+
+#ifndef _GLIBCXX_EXPERIMENTAL_SIMD_H
+#define _GLIBCXX_EXPERIMENTAL_SIMD_H
+
+#if __cplusplus >= 201703L
+
+#include "simd_detail.h"
+#include "numeric_traits.h"
+#include <bit>
+#include <bitset>
+#ifdef _GLIBCXX_DEBUG_UB
+#include <cstdio> // for stderr
+#endif
+#include <cstring>
+#include <functional>
+#include <iosfwd>
+#include <utility>
+
+#if _GLIBCXX_SIMD_X86INTRIN
+#include <x86intrin.h>
+#elif _GLIBCXX_SIMD_HAVE_NEON
+#include <arm_neon.h>
+#endif
+
+/* There are several closely related types, with the following naming
+ * convention:
+ * _Tp: vectorizable (arithmetic) type (or any type)
+ * _TV: __vector_type_t<_Tp, _Np>
+ * _TW: _SimdWrapper<_Tp, _Np>
+ * _TI: __intrinsic_type_t<_Tp, _Np>
+ * _TVT: _VectorTraits<_TV> or _VectorTraits<_TW>
+ * If one additional type is needed use _U instead of _T.
+ * Otherwise use _T\d, _TV\d, _TW\d, TI\d, _TVT\d.
+ *
+ * More naming conventions:
+ * _Ap or _Abi: An ABI tag from the simd_abi namespace
+ * _Ip: often used for integer types with sizeof(_Ip) == sizeof(_Tp),
+ *      _IV, _IW as for _TV, _TW
+ * _Np: number of elements (not bytes)
+ * _Bytes: number of bytes
+ *
+ * Variable names:
+ * __k: mask object (vector- or bitmask)
+ */
+_GLIBCXX_SIMD_BEGIN_NAMESPACE
+
+#if !_GLIBCXX_SIMD_X86INTRIN
+using __m128  [[__gnu__::__vector_size__(16)]] = float;
+using __m128d [[__gnu__::__vector_size__(16)]] = double;
+using __m128i [[__gnu__::__vector_size__(16)]] = long long;
+using __m256  [[__gnu__::__vector_size__(32)]] = float;
+using __m256d [[__gnu__::__vector_size__(32)]] = double;
+using __m256i [[__gnu__::__vector_size__(32)]] = long long;
+using __m512  [[__gnu__::__vector_size__(64)]] = float;
+using __m512d [[__gnu__::__vector_size__(64)]] = double;
+using __m512i [[__gnu__::__vector_size__(64)]] = long long;
+#endif
+
+namespace simd_abi {
+// simd_abi forward declarations {{{
+// implementation details:
+struct _Scalar;
+
+template <int _Np>
+  struct _Fixed;
+
+// There are two major ABIs that appear on different architectures.
+// Both have non-boolean values packed into an N Byte register
+// -> #elements = N / sizeof(T)
+// Masks differ:
+// 1. Use value vector registers for masks (all 0 or all 1)
+// 2. Use bitmasks (mask registers) with one bit per value in the corresponding
+//    value vector
+//
+// Both can be partially used, masking off the rest when doing horizontal
+// operations or operations that can trap (e.g. FP_INVALID or integer division
+// by 0). This is encoded as the number of used bytes.
+template <int _UsedBytes>
+  struct _VecBuiltin;
+
+template <int _UsedBytes>
+  struct _VecBltnBtmsk;
+
+template <typename _Tp, int _Np>
+  using _VecN = _VecBuiltin<sizeof(_Tp) * _Np>;
+
+template <int _UsedBytes = 16>
+  using _Sse = _VecBuiltin<_UsedBytes>;
+
+template <int _UsedBytes = 32>
+  using _Avx = _VecBuiltin<_UsedBytes>;
+
+template <int _UsedBytes = 64>
+  using _Avx512 = _VecBltnBtmsk<_UsedBytes>;
+
+template <int _UsedBytes = 16>
+  using _Neon = _VecBuiltin<_UsedBytes>;
+
+// implementation-defined:
+using __sse = _Sse<>;
+using __avx = _Avx<>;
+using __avx512 = _Avx512<>;
+using __neon = _Neon<>;
+using __neon128 = _Neon<16>;
+using __neon64 = _Neon<8>;
+
+// standard:
+template <typename _Tp, size_t _Np, typename...>
+  struct deduce;
+
+template <int _Np>
+  using fixed_size = _Fixed<_Np>;
+
+using scalar = _Scalar;
+
+// }}}
+} // namespace simd_abi
+// forward declarations is_simd(_mask), simd(_mask), simd_size {{{
+template <typename _Tp>
+  struct is_simd;
+
+template <typename _Tp>
+  struct is_simd_mask;
+
+template <typename _Tp, typename _Abi>
+  class simd;
+
+template <typename _Tp, typename _Abi>
+  class simd_mask;
+
+template <typename _Tp, typename _Abi>
+  struct simd_size;
+
+// }}}
+// load/store flags {{{
+struct element_aligned_tag
+{
+  template <typename _Tp, typename _Up = typename _Tp::value_type>
+    static constexpr size_t _S_alignment = alignof(_Up);
+
+  template <typename _Tp, typename _Up>
+    _GLIBCXX_SIMD_INTRINSIC static constexpr _Up*
+    _S_apply(_Up* __ptr)
+    { return __ptr; }
+};
+
+struct vector_aligned_tag
+{
+  template <typename _Tp, typename _Up = typename _Tp::value_type>
+    static constexpr size_t _S_alignment
+      = std::__bit_ceil(sizeof(_Up) * _Tp::size());
+
+  template <typename _Tp, typename _Up>
+    _GLIBCXX_SIMD_INTRINSIC static constexpr _Up*
+    _S_apply(_Up* __ptr)
+    {
+      return static_cast<_Up*>(
+	__builtin_assume_aligned(__ptr, _S_alignment<_Tp, _Up>));
+    }
+};
+
+template <size_t _Np> struct overaligned_tag
+{
+  template <typename _Tp, typename _Up = typename _Tp::value_type>
+    static constexpr size_t _S_alignment = _Np;
+
+  template <typename _Tp, typename _Up>
+    _GLIBCXX_SIMD_INTRINSIC static constexpr _Up*
+    _S_apply(_Up* __ptr)
+    { return static_cast<_Up*>(__builtin_assume_aligned(__ptr, _Np)); }
+};
+
+inline constexpr element_aligned_tag element_aligned = {};
+
+inline constexpr vector_aligned_tag vector_aligned = {};
+
+template <size_t _Np>
+  inline constexpr overaligned_tag<_Np> overaligned = {};
+
+// }}}
+template <size_t _X>
+  using _SizeConstant = integral_constant<size_t, _X>;
+
+// unrolled/pack execution helpers
+// __execute_n_times{{{
+template <typename _Fp, size_t... _I>
+  _GLIBCXX_SIMD_INTRINSIC constexpr void
+  __execute_on_index_sequence(_Fp&& __f, index_sequence<_I...>)
+  { ((void)__f(_SizeConstant<_I>()), ...); }
+
+template <typename _Fp>
+  _GLIBCXX_SIMD_INTRINSIC constexpr void
+  __execute_on_index_sequence(_Fp&&, index_sequence<>)
+  { }
+
+template <size_t _Np, typename _Fp>
+  _GLIBCXX_SIMD_INTRINSIC constexpr void
+  __execute_n_times(_Fp&& __f)
+  {
+    __execute_on_index_sequence(static_cast<_Fp&&>(__f),
+				make_index_sequence<_Np>{});
+  }
+
+// }}}
+// __generate_from_n_evaluations{{{
+template <typename _R, typename _Fp, size_t... _I>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _R
+  __execute_on_index_sequence_with_return(_Fp&& __f, index_sequence<_I...>)
+  { return _R{__f(_SizeConstant<_I>())...}; }
+
+template <size_t _Np, typename _R, typename _Fp>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _R
+  __generate_from_n_evaluations(_Fp&& __f)
+  {
+    return __execute_on_index_sequence_with_return<_R>(
+      static_cast<_Fp&&>(__f), make_index_sequence<_Np>{});
+  }
+
+// }}}
+// __call_with_n_evaluations{{{
+template <size_t... _I, typename _F0, typename _FArgs>
+  _GLIBCXX_SIMD_INTRINSIC constexpr auto
+  __call_with_n_evaluations(index_sequence<_I...>, _F0&& __f0, _FArgs&& __fargs)
+  { return __f0(__fargs(_SizeConstant<_I>())...); }
+
+template <size_t _Np, typename _F0, typename _FArgs>
+  _GLIBCXX_SIMD_INTRINSIC constexpr auto
+  __call_with_n_evaluations(_F0&& __f0, _FArgs&& __fargs)
+  {
+    return __call_with_n_evaluations(make_index_sequence<_Np>{},
+				     static_cast<_F0&&>(__f0),
+				     static_cast<_FArgs&&>(__fargs));
+  }
+
+// }}}
+// __call_with_subscripts{{{
+template <size_t _First = 0, size_t... _It, typename _Tp, typename _Fp>
+  _GLIBCXX_SIMD_INTRINSIC constexpr auto
+  __call_with_subscripts(_Tp&& __x, index_sequence<_It...>, _Fp&& __fun)
+  { return __fun(__x[_First + _It]...); }
+
+template <size_t _Np, size_t _First = 0, typename _Tp, typename _Fp>
+  _GLIBCXX_SIMD_INTRINSIC constexpr auto
+  __call_with_subscripts(_Tp&& __x, _Fp&& __fun)
+  {
+    return __call_with_subscripts<_First>(static_cast<_Tp&&>(__x),
+					  make_index_sequence<_Np>(),
+					  static_cast<_Fp&&>(__fun));
+  }
+
+// }}}
+
+// vvv ---- type traits ---- vvv
+// integer type aliases{{{
+using _UChar = unsigned char;
+using _SChar = signed char;
+using _UShort = unsigned short;
+using _UInt = unsigned int;
+using _ULong = unsigned long;
+using _ULLong = unsigned long long;
+using _LLong = long long;
+
+//}}}
+// __first_of_pack{{{
+template <typename _T0, typename...>
+  struct __first_of_pack
+  { using type = _T0; };
+
+template <typename... _Ts>
+  using __first_of_pack_t = typename __first_of_pack<_Ts...>::type;
+
+//}}}
+// __value_type_or_identity_t {{{
+template <typename _Tp>
+  typename _Tp::value_type
+  __value_type_or_identity_impl(int);
+
+template <typename _Tp>
+  _Tp
+  __value_type_or_identity_impl(float);
+
+template <typename _Tp>
+  using __value_type_or_identity_t
+    = decltype(__value_type_or_identity_impl<_Tp>(int()));
+
+// }}}
+// __is_vectorizable {{{
+template <typename _Tp>
+  struct __is_vectorizable : public is_arithmetic<_Tp> {};
+
+template <>
+  struct __is_vectorizable<bool> : public false_type {};
+
+template <typename _Tp>
+  inline constexpr bool __is_vectorizable_v = __is_vectorizable<_Tp>::value;
+
+// Deduces to a vectorizable type
+template <typename _Tp, typename = enable_if_t<__is_vectorizable_v<_Tp>>>
+  using _Vectorizable = _Tp;
+
+// }}}
+// _LoadStorePtr / __is_possible_loadstore_conversion {{{
+template <typename _Ptr, typename _ValueType>
+  struct __is_possible_loadstore_conversion
+  : conjunction<__is_vectorizable<_Ptr>, __is_vectorizable<_ValueType>> {};
+
+template <>
+  struct __is_possible_loadstore_conversion<bool, bool> : true_type {};
+
+// Deduces to a type allowed for load/store with the given value type.
+template <typename _Ptr, typename _ValueType,
+	  typename = enable_if_t<
+	    __is_possible_loadstore_conversion<_Ptr, _ValueType>::value>>
+  using _LoadStorePtr = _Ptr;
+
+// }}}
+// __is_bitmask{{{
+template <typename _Tp, typename = void_t<>>
+  struct __is_bitmask : false_type {};
+
+template <typename _Tp>
+  inline constexpr bool __is_bitmask_v = __is_bitmask<_Tp>::value;
+
+// the __mmaskXX case:
+template <typename _Tp>
+  struct __is_bitmask<_Tp,
+    void_t<decltype(declval<unsigned&>() = declval<_Tp>() & 1u)>>
+  : true_type {};
+
+// }}}
+// __int_for_sizeof{{{
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wpedantic"
+template <size_t _Bytes>
+  constexpr auto
+  __int_for_sizeof()
+  {
+    if constexpr (_Bytes == sizeof(int))
+      return int();
+  #ifdef __clang__
+    else if constexpr (_Bytes == sizeof(char))
+      return char();
+  #else
+    else if constexpr (_Bytes == sizeof(_SChar))
+      return _SChar();
+  #endif
+    else if constexpr (_Bytes == sizeof(short))
+      return short();
+  #ifndef __clang__
+    else if constexpr (_Bytes == sizeof(long))
+      return long();
+  #endif
+    else if constexpr (_Bytes == sizeof(_LLong))
+      return _LLong();
+  #ifdef __SIZEOF_INT128__
+    else if constexpr (_Bytes == sizeof(__int128))
+      return __int128();
+  #endif // __SIZEOF_INT128__
+    else if constexpr (_Bytes % sizeof(int) == 0)
+      {
+	constexpr size_t _Np = _Bytes / sizeof(int);
+	struct _Ip
+	{
+	  int _M_data[_Np];
+
+	  _GLIBCXX_SIMD_INTRINSIC constexpr _Ip
+	  operator&(_Ip __rhs) const
+	  {
+	    return __generate_from_n_evaluations<_Np, _Ip>(
+	      [&](auto __i) { return __rhs._M_data[__i] & _M_data[__i]; });
+	  }
+
+	  _GLIBCXX_SIMD_INTRINSIC constexpr _Ip
+	  operator|(_Ip __rhs) const
+	  {
+	    return __generate_from_n_evaluations<_Np, _Ip>(
+	      [&](auto __i) { return __rhs._M_data[__i] | _M_data[__i]; });
+	  }
+
+	  _GLIBCXX_SIMD_INTRINSIC constexpr _Ip
+	  operator^(_Ip __rhs) const
+	  {
+	    return __generate_from_n_evaluations<_Np, _Ip>(
+	      [&](auto __i) { return __rhs._M_data[__i] ^ _M_data[__i]; });
+	  }
+
+	  _GLIBCXX_SIMD_INTRINSIC constexpr _Ip
+	  operator~() const
+	  {
+	    return __generate_from_n_evaluations<_Np, _Ip>(
+	      [&](auto __i) { return ~_M_data[__i]; });
+	  }
+	};
+	return _Ip{};
+      }
+    else
+      static_assert(_Bytes != _Bytes, "this should be unreachable");
+  }
+#pragma GCC diagnostic pop
+
+template <typename _Tp>
+  using __int_for_sizeof_t = decltype(__int_for_sizeof<sizeof(_Tp)>());
+
+template <size_t _Np>
+  using __int_with_sizeof_t = decltype(__int_for_sizeof<_Np>());
+
+// }}}
+// __is_fixed_size_abi{{{
+template <typename _Tp>
+  struct __is_fixed_size_abi : false_type {};
+
+template <int _Np>
+  struct __is_fixed_size_abi<simd_abi::fixed_size<_Np>> : true_type {};
+
+template <typename _Tp>
+  inline constexpr bool __is_fixed_size_abi_v = __is_fixed_size_abi<_Tp>::value;
+
+// }}}
+// constexpr feature detection{{{
+constexpr inline bool __have_mmx = _GLIBCXX_SIMD_HAVE_MMX;
+constexpr inline bool __have_sse = _GLIBCXX_SIMD_HAVE_SSE;
+constexpr inline bool __have_sse2 = _GLIBCXX_SIMD_HAVE_SSE2;
+constexpr inline bool __have_sse3 = _GLIBCXX_SIMD_HAVE_SSE3;
+constexpr inline bool __have_ssse3 = _GLIBCXX_SIMD_HAVE_SSSE3;
+constexpr inline bool __have_sse4_1 = _GLIBCXX_SIMD_HAVE_SSE4_1;
+constexpr inline bool __have_sse4_2 = _GLIBCXX_SIMD_HAVE_SSE4_2;
+constexpr inline bool __have_xop = _GLIBCXX_SIMD_HAVE_XOP;
+constexpr inline bool __have_avx = _GLIBCXX_SIMD_HAVE_AVX;
+constexpr inline bool __have_avx2 = _GLIBCXX_SIMD_HAVE_AVX2;
+constexpr inline bool __have_bmi = _GLIBCXX_SIMD_HAVE_BMI1;
+constexpr inline bool __have_bmi2 = _GLIBCXX_SIMD_HAVE_BMI2;
+constexpr inline bool __have_lzcnt = _GLIBCXX_SIMD_HAVE_LZCNT;
+constexpr inline bool __have_sse4a = _GLIBCXX_SIMD_HAVE_SSE4A;
+constexpr inline bool __have_fma = _GLIBCXX_SIMD_HAVE_FMA;
+constexpr inline bool __have_fma4 = _GLIBCXX_SIMD_HAVE_FMA4;
+constexpr inline bool __have_f16c = _GLIBCXX_SIMD_HAVE_F16C;
+constexpr inline bool __have_popcnt = _GLIBCXX_SIMD_HAVE_POPCNT;
+constexpr inline bool __have_avx512f = _GLIBCXX_SIMD_HAVE_AVX512F;
+constexpr inline bool __have_avx512dq = _GLIBCXX_SIMD_HAVE_AVX512DQ;
+constexpr inline bool __have_avx512vl = _GLIBCXX_SIMD_HAVE_AVX512VL;
+constexpr inline bool __have_avx512bw = _GLIBCXX_SIMD_HAVE_AVX512BW;
+constexpr inline bool __have_avx512dq_vl = __have_avx512dq && __have_avx512vl;
+constexpr inline bool __have_avx512bw_vl = __have_avx512bw && __have_avx512vl;
+
+constexpr inline bool __have_neon = _GLIBCXX_SIMD_HAVE_NEON;
+constexpr inline bool __have_neon_a32 = _GLIBCXX_SIMD_HAVE_NEON_A32;
+constexpr inline bool __have_neon_a64 = _GLIBCXX_SIMD_HAVE_NEON_A64;
+constexpr inline bool __support_neon_float =
+#if defined __GCC_IEC_559
+  __GCC_IEC_559 == 0;
+#elif defined __FAST_MATH__
+  true;
+#else
+  false;
+#endif
+
+#ifdef __POWER9_VECTOR__
+constexpr inline bool __have_power9vec = true;
+#else
+constexpr inline bool __have_power9vec = false;
+#endif
+#if defined __POWER8_VECTOR__
+constexpr inline bool __have_power8vec = true;
+#else
+constexpr inline bool __have_power8vec = __have_power9vec;
+#endif
+#if defined __VSX__
+constexpr inline bool __have_power_vsx = true;
+#else
+constexpr inline bool __have_power_vsx = __have_power8vec;
+#endif
+#if defined __ALTIVEC__
+constexpr inline bool __have_power_vmx = true;
+#else
+constexpr inline bool __have_power_vmx = __have_power_vsx;
+#endif
+
+// }}}
+// __is_scalar_abi {{{
+template <typename _Abi>
+  constexpr bool
+  __is_scalar_abi()
+  { return is_same_v<simd_abi::scalar, _Abi>; }
+
+// }}}
+// __abi_bytes_v {{{
+template <template <int> class _Abi, int _Bytes>
+  constexpr int
+  __abi_bytes_impl(_Abi<_Bytes>*)
+  { return _Bytes; }
+
+template <typename _Tp>
+  constexpr int
+  __abi_bytes_impl(_Tp*)
+  { return -1; }
+
+template <typename _Abi>
+  inline constexpr int __abi_bytes_v
+    = __abi_bytes_impl(static_cast<_Abi*>(nullptr));
+
+// }}}
+// __is_builtin_bitmask_abi {{{
+template <typename _Abi>
+  constexpr bool
+  __is_builtin_bitmask_abi()
+  { return is_same_v<simd_abi::_VecBltnBtmsk<__abi_bytes_v<_Abi>>, _Abi>; }
+
+// }}}
+// __is_sse_abi {{{
+template <typename _Abi>
+  constexpr bool
+  __is_sse_abi()
+  {
+    constexpr auto _Bytes = __abi_bytes_v<_Abi>;
+    return _Bytes <= 16 && is_same_v<simd_abi::_VecBuiltin<_Bytes>, _Abi>;
+  }
+
+// }}}
+// __is_avx_abi {{{
+template <typename _Abi>
+  constexpr bool
+  __is_avx_abi()
+  {
+    constexpr auto _Bytes = __abi_bytes_v<_Abi>;
+    return _Bytes > 16 && _Bytes <= 32
+	   && is_same_v<simd_abi::_VecBuiltin<_Bytes>, _Abi>;
+  }
+
+// }}}
+// __is_avx512_abi {{{
+template <typename _Abi>
+  constexpr bool
+  __is_avx512_abi()
+  {
+    constexpr auto _Bytes = __abi_bytes_v<_Abi>;
+    return _Bytes <= 64 && is_same_v<simd_abi::_Avx512<_Bytes>, _Abi>;
+  }
+
+// }}}
+// __is_neon_abi {{{
+template <typename _Abi>
+  constexpr bool
+  __is_neon_abi()
+  {
+    constexpr auto _Bytes = __abi_bytes_v<_Abi>;
+    return _Bytes <= 16 && is_same_v<simd_abi::_VecBuiltin<_Bytes>, _Abi>;
+  }
+
+// }}}
+// __make_dependent_t {{{
+template <typename, typename _Up>
+  struct __make_dependent
+  { using type = _Up; };
+
+template <typename _Tp, typename _Up>
+  using __make_dependent_t = typename __make_dependent<_Tp, _Up>::type;
+
+// }}}
+// ^^^ ---- type traits ---- ^^^
+
+// __invoke_ub{{{
+template <typename... _Args>
+  [[noreturn]] _GLIBCXX_SIMD_ALWAYS_INLINE void
+  __invoke_ub([[maybe_unused]] const char* __msg,
+	      [[maybe_unused]] const _Args&... __args)
+  {
+#ifdef _GLIBCXX_DEBUG_UB
+    __builtin_fprintf(stderr, __msg, __args...);
+    __builtin_trap();
+#else
+    __builtin_unreachable();
+#endif
+  }
+
+// }}}
+// __assert_unreachable{{{
+template <typename _Tp>
+  struct __assert_unreachable
+  { static_assert(!is_same_v<_Tp, _Tp>, "this should be unreachable"); };
+
+// }}}
+// __size_or_zero_v {{{
+template <typename _Tp, typename _Ap, size_t _Np = simd_size<_Tp, _Ap>::value>
+  constexpr size_t
+  __size_or_zero_dispatch(int)
+  { return _Np; }
+
+template <typename _Tp, typename _Ap>
+  constexpr size_t
+  __size_or_zero_dispatch(float)
+  { return 0; }
+
+template <typename _Tp, typename _Ap>
+  inline constexpr size_t __size_or_zero_v
+     = __size_or_zero_dispatch<_Tp, _Ap>(0);
+
+// }}}
+// __div_roundup {{{
+inline constexpr size_t
+__div_roundup(size_t __a, size_t __b)
+{ return (__a + __b - 1) / __b; }
+
+// }}}
+// _ExactBool{{{
+class _ExactBool
+{
+  const bool _M_data;
+
+public:
+  _GLIBCXX_SIMD_INTRINSIC constexpr _ExactBool(bool __b) : _M_data(__b) {}
+
+  _ExactBool(int) = delete;
+
+  _GLIBCXX_SIMD_INTRINSIC constexpr operator bool() const { return _M_data; }
+};
+
+// }}}
+// __may_alias{{{
+/**@internal
+ * Helper __may_alias<_Tp> that turns _Tp into the type to be used for an
+ * aliasing pointer. This adds the __may_alias attribute to _Tp (with compilers
+ * that support it).
+ */
+template <typename _Tp>
+  using __may_alias [[__gnu__::__may_alias__]] = _Tp;
+
+// }}}
+// _UnsupportedBase {{{
+// simd and simd_mask base for unsupported <_Tp, _Abi>
+struct _UnsupportedBase
+{
+  _UnsupportedBase() = delete;
+  _UnsupportedBase(const _UnsupportedBase&) = delete;
+  _UnsupportedBase& operator=(const _UnsupportedBase&) = delete;
+  ~_UnsupportedBase() = delete;
+};
+
+// }}}
+// _InvalidTraits {{{
+/**
+ * @internal
+ * Defines the implementation of __a given <_Tp, _Abi>.
+ *
+ * Implementations must ensure that only valid <_Tp, _Abi> instantiations are
+ * possible. Static assertions in the type definition do not suffice. It is
+ * important that SFINAE works.
+ */
+struct _InvalidTraits
+{
+  using _IsValid = false_type;
+  using _SimdBase = _UnsupportedBase;
+  using _MaskBase = _UnsupportedBase;
+
+  static constexpr size_t _S_full_size = 0;
+  static constexpr bool _S_is_partial = false;
+
+  static constexpr size_t _S_simd_align = 1;
+  struct _SimdImpl;
+  struct _SimdMember {};
+  struct _SimdCastType;
+
+  static constexpr size_t _S_mask_align = 1;
+  struct _MaskImpl;
+  struct _MaskMember {};
+  struct _MaskCastType;
+};
+
+// }}}
+// _SimdTraits {{{
+template <typename _Tp, typename _Abi, typename = void_t<>>
+  struct _SimdTraits : _InvalidTraits {};
+
+// }}}
+// __private_init, __bitset_init{{{
+/**
+ * @internal
+ * Tag used for private init constructor of simd and simd_mask
+ */
+inline constexpr struct _PrivateInit {} __private_init = {};
+
+inline constexpr struct _BitsetInit {} __bitset_init = {};
+
+// }}}
+// __is_narrowing_conversion<_From, _To>{{{
+template <typename _From, typename _To, bool = is_arithmetic_v<_From>,
+	  bool = is_arithmetic_v<_To>>
+  struct __is_narrowing_conversion;
+
+// ignore "signed/unsigned mismatch" in the following trait.
+// The implicit conversions will do the right thing here.
+template <typename _From, typename _To>
+  struct __is_narrowing_conversion<_From, _To, true, true>
+  : public __bool_constant<(
+      __digits_v<_From> > __digits_v<_To>
+      || __finite_max_v<_From> > __finite_max_v<_To>
+      || __finite_min_v<_From> < __finite_min_v<_To>
+      || (is_signed_v<_From> && is_unsigned_v<_To>))> {};
+
+template <typename _Tp>
+  struct __is_narrowing_conversion<_Tp, bool, true, true>
+  : public true_type {};
+
+template <>
+  struct __is_narrowing_conversion<bool, bool, true, true>
+  : public false_type {};
+
+template <typename _Tp>
+  struct __is_narrowing_conversion<_Tp, _Tp, true, true>
+  : public false_type {};
+
+template <typename _From, typename _To>
+  struct __is_narrowing_conversion<_From, _To, false, true>
+  : public negation<is_convertible<_From, _To>> {};
+
+// }}}
+// __converts_to_higher_integer_rank{{{
+template <typename _From, typename _To, bool = (sizeof(_From) < sizeof(_To))>
+  struct __converts_to_higher_integer_rank : public true_type {};
+
+// this may fail for char -> short if sizeof(char) == sizeof(short)
+template <typename _From, typename _To>
+  struct __converts_to_higher_integer_rank<_From, _To, false>
+  : public is_same<decltype(declval<_From>() + declval<_To>()), _To> {};
+
+// }}}
+// __data(simd/simd_mask) {{{
+template <typename _Tp, typename _Ap>
+  _GLIBCXX_SIMD_INTRINSIC constexpr const auto&
+  __data(const simd<_Tp, _Ap>& __x);
+
+template <typename _Tp, typename _Ap>
+  _GLIBCXX_SIMD_INTRINSIC constexpr auto&
+  __data(simd<_Tp, _Ap>& __x);
+
+template <typename _Tp, typename _Ap>
+  _GLIBCXX_SIMD_INTRINSIC constexpr const auto&
+  __data(const simd_mask<_Tp, _Ap>& __x);
+
+template <typename _Tp, typename _Ap>
+  _GLIBCXX_SIMD_INTRINSIC constexpr auto&
+  __data(simd_mask<_Tp, _Ap>& __x);
+
+// }}}
+// _SimdConverter {{{
+template <typename _FromT, typename _FromA, typename _ToT, typename _ToA,
+	  typename = void>
+  struct _SimdConverter;
+
+template <typename _Tp, typename _Ap>
+  struct _SimdConverter<_Tp, _Ap, _Tp, _Ap, void>
+  {
+    template <typename _Up>
+      _GLIBCXX_SIMD_INTRINSIC const _Up&
+      operator()(const _Up& __x)
+      { return __x; }
+  };
+
+// }}}
+// __to_value_type_or_member_type {{{
+template <typename _V>
+  _GLIBCXX_SIMD_INTRINSIC constexpr auto
+  __to_value_type_or_member_type(const _V& __x) -> decltype(__data(__x))
+  { return __data(__x); }
+
+template <typename _V>
+  _GLIBCXX_SIMD_INTRINSIC constexpr const typename _V::value_type&
+  __to_value_type_or_member_type(const typename _V::value_type& __x)
+  { return __x; }
+
+// }}}
+// __bool_storage_member_type{{{
+template <size_t _Size>
+  struct __bool_storage_member_type;
+
+template <size_t _Size>
+  using __bool_storage_member_type_t =
+    typename __bool_storage_member_type<_Size>::type;
+
+// }}}
+// _SimdTuple {{{
+// why not tuple?
+// 1. tuple gives no guarantee about the storage order, but I require
+// storage
+//    equivalent to array<_Tp, _Np>
+// 2. direct access to the element type (first template argument)
+// 3. enforces equal element type, only different _Abi types are allowed
+template <typename _Tp, typename... _Abis>
+  struct _SimdTuple;
+
+//}}}
+// __fixed_size_storage_t {{{
+template <typename _Tp, int _Np>
+  struct __fixed_size_storage;
+
+template <typename _Tp, int _Np>
+  using __fixed_size_storage_t = typename __fixed_size_storage<_Tp, _Np>::type;
+
+// }}}
+// _SimdWrapper fwd decl{{{
+template <typename _Tp, size_t _Size, typename = void_t<>>
+  struct _SimdWrapper;
+
+template <typename _Tp>
+  using _SimdWrapper8 = _SimdWrapper<_Tp, 8 / sizeof(_Tp)>;
+template <typename _Tp>
+  using _SimdWrapper16 = _SimdWrapper<_Tp, 16 / sizeof(_Tp)>;
+template <typename _Tp>
+  using _SimdWrapper32 = _SimdWrapper<_Tp, 32 / sizeof(_Tp)>;
+template <typename _Tp>
+  using _SimdWrapper64 = _SimdWrapper<_Tp, 64 / sizeof(_Tp)>;
+
+// }}}
+// __is_simd_wrapper {{{
+template <typename _Tp>
+  struct __is_simd_wrapper : false_type {};
+
+template <typename _Tp, size_t _Np>
+  struct __is_simd_wrapper<_SimdWrapper<_Tp, _Np>> : true_type {};
+
+template <typename _Tp>
+  inline constexpr bool __is_simd_wrapper_v = __is_simd_wrapper<_Tp>::value;
+
+// }}}
+// _BitOps {{{
+struct _BitOps
+{
+  // _S_bit_iteration {{{
+  template <typename _Tp, typename _Fp>
+    static void
+    _S_bit_iteration(_Tp __mask, _Fp&& __f)
+    {
+      static_assert(sizeof(_ULLong) >= sizeof(_Tp));
+      conditional_t<sizeof(_Tp) <= sizeof(_UInt), _UInt, _ULLong> __k;
+      if constexpr (is_convertible_v<_Tp, decltype(__k)>)
+	__k = __mask;
+      else
+	__k = __mask.to_ullong();
+      while(__k)
+	{
+	  __f(std::__countr_zero(__k));
+	  __k &= (__k - 1);
+	}
+    }
+
+  //}}}
+};
+
+//}}}
+// __increment, __decrement {{{
+template <typename _Tp = void>
+  struct __increment
+  { constexpr _Tp operator()(_Tp __a) const { return ++__a; } };
+
+template <>
+  struct __increment<void>
+  {
+    template <typename _Tp>
+      constexpr _Tp
+      operator()(_Tp __a) const
+      { return ++__a; }
+  };
+
+template <typename _Tp = void>
+  struct __decrement
+  { constexpr _Tp operator()(_Tp __a) const { return --__a; } };
+
+template <>
+  struct __decrement<void>
+  {
+    template <typename _Tp>
+      constexpr _Tp
+      operator()(_Tp __a) const
+      { return --__a; }
+  };
+
+// }}}
+// _ValuePreserving(OrInt) {{{
+template <typename _From, typename _To,
+	  typename = enable_if_t<negation<
+	    __is_narrowing_conversion<__remove_cvref_t<_From>, _To>>::value>>
+  using _ValuePreserving = _From;
+
+template <typename _From, typename _To,
+	  typename _DecayedFrom = __remove_cvref_t<_From>,
+	  typename = enable_if_t<conjunction<
+	    is_convertible<_From, _To>,
+	    disjunction<
+	      is_same<_DecayedFrom, _To>, is_same<_DecayedFrom, int>,
+	      conjunction<is_same<_DecayedFrom, _UInt>, is_unsigned<_To>>,
+	      negation<__is_narrowing_conversion<_DecayedFrom, _To>>>>::value>>
+  using _ValuePreservingOrInt = _From;
+
+// }}}
+// __intrinsic_type {{{
+template <typename _Tp, size_t _Bytes, typename = void_t<>>
+  struct __intrinsic_type;
+
+template <typename _Tp, size_t _Size>
+  using __intrinsic_type_t =
+    typename __intrinsic_type<_Tp, _Size * sizeof(_Tp)>::type;
+
+template <typename _Tp>
+  using __intrinsic_type2_t = typename __intrinsic_type<_Tp, 2>::type;
+template <typename _Tp>
+  using __intrinsic_type4_t = typename __intrinsic_type<_Tp, 4>::type;
+template <typename _Tp>
+  using __intrinsic_type8_t = typename __intrinsic_type<_Tp, 8>::type;
+template <typename _Tp>
+  using __intrinsic_type16_t = typename __intrinsic_type<_Tp, 16>::type;
+template <typename _Tp>
+  using __intrinsic_type32_t = typename __intrinsic_type<_Tp, 32>::type;
+template <typename _Tp>
+  using __intrinsic_type64_t = typename __intrinsic_type<_Tp, 64>::type;
+
+// }}}
+// _BitMask {{{
+template <size_t _Np, bool _Sanitized = false>
+  struct _BitMask;
+
+template <size_t _Np, bool _Sanitized>
+  struct __is_bitmask<_BitMask<_Np, _Sanitized>, void> : true_type {};
+
+template <size_t _Np>
+  using _SanitizedBitMask = _BitMask<_Np, true>;
+
+template <size_t _Np, bool _Sanitized>
+  struct _BitMask
+  {
+    static_assert(_Np > 0);
+
+    static constexpr size_t _NBytes = __div_roundup(_Np, __CHAR_BIT__);
+
+    using _Tp = conditional_t<_Np == 1, bool,
+			      make_unsigned_t<__int_with_sizeof_t<std::min(
+				sizeof(_ULLong), std::__bit_ceil(_NBytes))>>>;
+
+    static constexpr int _S_array_size = __div_roundup(_NBytes, sizeof(_Tp));
+
+    _Tp _M_bits[_S_array_size];
+
+    static constexpr int _S_unused_bits
+      = _Np == 1 ? 0 : _S_array_size * sizeof(_Tp) * __CHAR_BIT__ - _Np;
+
+    static constexpr _Tp _S_bitmask = +_Tp(~_Tp()) >> _S_unused_bits;
+
+    constexpr _BitMask() noexcept = default;
+
+    constexpr _BitMask(unsigned long long __x) noexcept
+      : _M_bits{static_cast<_Tp>(__x)} {}
+
+    _BitMask(bitset<_Np> __x) noexcept : _BitMask(__x.to_ullong()) {}
+
+    constexpr _BitMask(const _BitMask&) noexcept = default;
+
+    template <bool _RhsSanitized, typename = enable_if_t<_RhsSanitized == false
+							 && _Sanitized == true>>
+      constexpr _BitMask(const _BitMask<_Np, _RhsSanitized>& __rhs) noexcept
+	: _BitMask(__rhs._M_sanitized()) {}
+
+    constexpr operator _SimdWrapper<bool, _Np>() const noexcept
+    {
+      static_assert(_S_array_size == 1);
+      return _M_bits[0];
+    }
+
+    // precondition: is sanitized
+    constexpr _Tp
+    _M_to_bits() const noexcept
+    {
+      static_assert(_S_array_size == 1);
+      return _M_bits[0];
+    }
+
+    // precondition: is sanitized
+    constexpr unsigned long long
+    to_ullong() const noexcept
+    {
+      static_assert(_S_array_size == 1);
+      return _M_bits[0];
+    }
+
+    // precondition: is sanitized
+    constexpr unsigned long
+    to_ulong() const noexcept
+    {
+      static_assert(_S_array_size == 1);
+      return _M_bits[0];
+    }
+
+    constexpr bitset<_Np>
+    _M_to_bitset() const noexcept
+    {
+      static_assert(_S_array_size == 1);
+      return _M_bits[0];
+    }
+
+    constexpr decltype(auto)
+    _M_sanitized() const noexcept
+    {
+      if constexpr (_Sanitized)
+	return *this;
+      else if constexpr (_Np == 1)
+	return _SanitizedBitMask<_Np>(_M_bits[0]);
+      else
+	{
+	  _SanitizedBitMask<_Np> __r = {};
+	  for (int __i = 0; __i < _S_array_size; ++__i)
+	    __r._M_bits[__i] = _M_bits[__i];
+	  if constexpr (_S_unused_bits > 0)
+	    __r._M_bits[_S_array_size - 1] &= _S_bitmask;
+	  return __r;
+	}
+    }
+
+    template <size_t _Mp, bool _LSanitized>
+      constexpr _BitMask<_Np + _Mp, _Sanitized>
+      _M_prepend(_BitMask<_Mp, _LSanitized> __lsb) const noexcept
+      {
+	constexpr size_t _RN = _Np + _Mp;
+	using _Rp = _BitMask<_RN, _Sanitized>;
+	if constexpr (_Rp::_S_array_size == 1)
+	  {
+	    _Rp __r{{_M_bits[0]}};
+	    __r._M_bits[0] <<= _Mp;
+	    __r._M_bits[0] |= __lsb._M_sanitized()._M_bits[0];
+	    return __r;
+	  }
+	else
+	  __assert_unreachable<_Rp>();
+      }
+
+    // Return a new _BitMask with size _NewSize while dropping _DropLsb least
+    // significant bits. If the operation implicitly produces a sanitized bitmask,
+    // the result type will have _Sanitized set.
+    template <size_t _DropLsb, size_t _NewSize = _Np - _DropLsb>
+      constexpr auto
+      _M_extract() const noexcept
+      {
+	static_assert(_Np > _DropLsb);
+	static_assert(_DropLsb + _NewSize <= sizeof(_ULLong) * __CHAR_BIT__,
+		      "not implemented for bitmasks larger than one ullong");
+	if constexpr (_NewSize == 1)
+	  // must sanitize because the return _Tp is bool
+	  return _SanitizedBitMask<1>(_M_bits[0] & (_Tp(1) << _DropLsb));
+	else
+	  return _BitMask<_NewSize,
+			  ((_NewSize + _DropLsb == sizeof(_Tp) * __CHAR_BIT__
+			    && _NewSize + _DropLsb <= _Np)
+			   || ((_Sanitized || _Np == sizeof(_Tp) * __CHAR_BIT__)
+			       && _NewSize + _DropLsb >= _Np))>(_M_bits[0]
+								>> _DropLsb);
+      }
+
+    // True if all bits are set. Implicitly sanitizes if _Sanitized == false.
+    constexpr bool
+    all() const noexcept
+    {
+      if constexpr (_Np == 1)
+	return _M_bits[0];
+      else if constexpr (!_Sanitized)
+	return _M_sanitized().all();
+      else
+	{
+	  constexpr _Tp __allbits = ~_Tp();
+	  for (int __i = 0; __i < _S_array_size - 1; ++__i)
+	    if (_M_bits[__i] != __allbits)
+	      return false;
+	  return _M_bits[_S_array_size - 1] == _S_bitmask;
+	}
+    }
+
+    // True if at least one bit is set. Implicitly sanitizes if _Sanitized ==
+    // false.
+    constexpr bool
+    any() const noexcept
+    {
+      if constexpr (_Np == 1)
+	return _M_bits[0];
+      else if constexpr (!_Sanitized)
+	return _M_sanitized().any();
+      else
+	{
+	  for (int __i = 0; __i < _S_array_size - 1; ++__i)
+	    if (_M_bits[__i] != 0)
+	      return true;
+	  return _M_bits[_S_array_size - 1] != 0;
+	}
+    }
+
+    // True if no bit is set. Implicitly sanitizes if _Sanitized == false.
+    constexpr bool
+    none() const noexcept
+    {
+      if constexpr (_Np == 1)
+	return !_M_bits[0];
+      else if constexpr (!_Sanitized)
+	return _M_sanitized().none();
+      else
+	{
+	  for (int __i = 0; __i < _S_array_size - 1; ++__i)
+	    if (_M_bits[__i] != 0)
+	      return false;
+	  return _M_bits[_S_array_size - 1] == 0;
+	}
+    }
+
+    // Returns the number of set bits. Implicitly sanitizes if _Sanitized ==
+    // false.
+    constexpr int
+    count() const noexcept
+    {
+      if constexpr (_Np == 1)
+	return _M_bits[0];
+      else if constexpr (!_Sanitized)
+	return _M_sanitized().none();
+      else
+	{
+	  int __result = __builtin_popcountll(_M_bits[0]);
+	  for (int __i = 1; __i < _S_array_size; ++__i)
+	    __result += __builtin_popcountll(_M_bits[__i]);
+	  return __result;
+	}
+    }
+
+    // Returns the bit at offset __i as bool.
+    constexpr bool
+    operator[](size_t __i) const noexcept
+    {
+      if constexpr (_Np == 1)
+	return _M_bits[0];
+      else if constexpr (_S_array_size == 1)
+	return (_M_bits[0] >> __i) & 1;
+      else
+	{
+	  const size_t __j = __i / (sizeof(_Tp) * __CHAR_BIT__);
+	  const size_t __shift = __i % (sizeof(_Tp) * __CHAR_BIT__);
+	  return (_M_bits[__j] >> __shift) & 1;
+	}
+    }
+
+    template <size_t __i>
+      constexpr bool
+      operator[](_SizeConstant<__i>) const noexcept
+      {
+	static_assert(__i < _Np);
+	constexpr size_t __j = __i / (sizeof(_Tp) * __CHAR_BIT__);
+	constexpr size_t __shift = __i % (sizeof(_Tp) * __CHAR_BIT__);
+	return static_cast<bool>(_M_bits[__j] & (_Tp(1) << __shift));
+      }
+
+    // Set the bit at offset __i to __x.
+    constexpr void
+    set(size_t __i, bool __x) noexcept
+    {
+      if constexpr (_Np == 1)
+	_M_bits[0] = __x;
+      else if constexpr (_S_array_size == 1)
+	{
+	  _M_bits[0] &= ~_Tp(_Tp(1) << __i);
+	  _M_bits[0] |= _Tp(_Tp(__x) << __i);
+	}
+      else
+	{
+	  const size_t __j = __i / (sizeof(_Tp) * __CHAR_BIT__);
+	  const size_t __shift = __i % (sizeof(_Tp) * __CHAR_BIT__);
+	  _M_bits[__j] &= ~_Tp(_Tp(1) << __shift);
+	  _M_bits[__j] |= _Tp(_Tp(__x) << __shift);
+	}
+    }
+
+    template <size_t __i>
+      constexpr void
+      set(_SizeConstant<__i>, bool __x) noexcept
+      {
+	static_assert(__i < _Np);
+	if constexpr (_Np == 1)
+	  _M_bits[0] = __x;
+	else
+	  {
+	    constexpr size_t __j = __i / (sizeof(_Tp) * __CHAR_BIT__);
+	    constexpr size_t __shift = __i % (sizeof(_Tp) * __CHAR_BIT__);
+	    constexpr _Tp __mask = ~_Tp(_Tp(1) << __shift);
+	    _M_bits[__j] &= __mask;
+	    _M_bits[__j] |= _Tp(_Tp(__x) << __shift);
+	  }
+      }
+
+    // Inverts all bits. Sanitized input leads to sanitized output.
+    constexpr _BitMask
+    operator~() const noexcept
+    {
+      if constexpr (_Np == 1)
+	return !_M_bits[0];
+      else
+	{
+	  _BitMask __result{};
+	  for (int __i = 0; __i < _S_array_size - 1; ++__i)
+	    __result._M_bits[__i] = ~_M_bits[__i];
+	  if constexpr (_Sanitized)
+	    __result._M_bits[_S_array_size - 1]
+	      = _M_bits[_S_array_size - 1] ^ _S_bitmask;
+	  else
+	    __result._M_bits[_S_array_size - 1] = ~_M_bits[_S_array_size - 1];
+	  return __result;
+	}
+    }
+
+    constexpr _BitMask&
+    operator^=(const _BitMask& __b) & noexcept
+    {
+      __execute_n_times<_S_array_size>(
+	[&](auto __i) { _M_bits[__i] ^= __b._M_bits[__i]; });
+      return *this;
+    }
+
+    constexpr _BitMask&
+    operator|=(const _BitMask& __b) & noexcept
+    {
+      __execute_n_times<_S_array_size>(
+	[&](auto __i) { _M_bits[__i] |= __b._M_bits[__i]; });
+      return *this;
+    }
+
+    constexpr _BitMask&
+    operator&=(const _BitMask& __b) & noexcept
+    {
+      __execute_n_times<_S_array_size>(
+	[&](auto __i) { _M_bits[__i] &= __b._M_bits[__i]; });
+      return *this;
+    }
+
+    friend constexpr _BitMask
+    operator^(const _BitMask& __a, const _BitMask& __b) noexcept
+    {
+      _BitMask __r = __a;
+      __r ^= __b;
+      return __r;
+    }
+
+    friend constexpr _BitMask
+    operator|(const _BitMask& __a, const _BitMask& __b) noexcept
+    {
+      _BitMask __r = __a;
+      __r |= __b;
+      return __r;
+    }
+
+    friend constexpr _BitMask
+    operator&(const _BitMask& __a, const _BitMask& __b) noexcept
+    {
+      _BitMask __r = __a;
+      __r &= __b;
+      return __r;
+    }
+
+    _GLIBCXX_SIMD_INTRINSIC
+    constexpr bool
+    _M_is_constprop() const
+    {
+      if constexpr (_S_array_size == 0)
+	return __builtin_constant_p(_M_bits[0]);
+      else
+	{
+	  for (int __i = 0; __i < _S_array_size; ++__i)
+	    if (!__builtin_constant_p(_M_bits[__i]))
+	      return false;
+	  return true;
+	}
+    }
+  };
+
+// }}}
+
+// vvv ---- builtin vector types [[gnu::vector_size(N)]] and operations ---- vvv
+// __min_vector_size {{{
+template <typename _Tp = void>
+  static inline constexpr int __min_vector_size = 2 * sizeof(_Tp);
+
+#if _GLIBCXX_SIMD_HAVE_NEON
+template <>
+  inline constexpr int __min_vector_size<void> = 8;
+#else
+template <>
+  inline constexpr int __min_vector_size<void> = 16;
+#endif
+
+// }}}
+// __vector_type {{{
+template <typename _Tp, size_t _Np, typename = void>
+  struct __vector_type_n {};
+
+// substition failure for 0-element case
+template <typename _Tp>
+  struct __vector_type_n<_Tp, 0, void> {};
+
+// special case 1-element to be _Tp itself
+template <typename _Tp>
+  struct __vector_type_n<_Tp, 1, enable_if_t<__is_vectorizable_v<_Tp>>>
+  { using type = _Tp; };
+
+// else, use GNU-style builtin vector types
+template <typename _Tp, size_t _Np>
+  struct __vector_type_n<_Tp, _Np,
+			 enable_if_t<__is_vectorizable_v<_Tp> && _Np >= 2>>
+  {
+    static constexpr size_t _S_Np2 = std::__bit_ceil(_Np * sizeof(_Tp));
+
+    static constexpr size_t _S_Bytes =
+#ifdef __i386__
+      // Using [[gnu::vector_size(8)]] would wreak havoc on the FPU because
+      // those objects are passed via MMX registers and nothing ever calls EMMS.
+      _S_Np2 == 8 ? 16 :
+#endif
+      _S_Np2 < __min_vector_size<_Tp> ? __min_vector_size<_Tp>
+				      : _S_Np2;
+
+    using type [[__gnu__::__vector_size__(_S_Bytes)]] = _Tp;
+  };
+
+template <typename _Tp, size_t _Bytes, size_t = _Bytes % sizeof(_Tp)>
+  struct __vector_type;
+
+template <typename _Tp, size_t _Bytes>
+  struct __vector_type<_Tp, _Bytes, 0>
+  : __vector_type_n<_Tp, _Bytes / sizeof(_Tp)> {};
+
+template <typename _Tp, size_t _Size>
+  using __vector_type_t = typename __vector_type_n<_Tp, _Size>::type;
+
+template <typename _Tp>
+  using __vector_type2_t = typename __vector_type<_Tp, 2>::type;
+template <typename _Tp>
+  using __vector_type4_t = typename __vector_type<_Tp, 4>::type;
+template <typename _Tp>
+  using __vector_type8_t = typename __vector_type<_Tp, 8>::type;
+template <typename _Tp>
+  using __vector_type16_t = typename __vector_type<_Tp, 16>::type;
+template <typename _Tp>
+  using __vector_type32_t = typename __vector_type<_Tp, 32>::type;
+template <typename _Tp>
+  using __vector_type64_t = typename __vector_type<_Tp, 64>::type;
+
+// }}}
+// __is_vector_type {{{
+template <typename _Tp, typename = void_t<>>
+  struct __is_vector_type : false_type {};
+
+template <typename _Tp>
+  struct __is_vector_type<
+    _Tp, void_t<typename __vector_type<
+	   remove_reference_t<decltype(declval<_Tp>()[0])>, sizeof(_Tp)>::type>>
+    : is_same<_Tp, typename __vector_type<
+		     remove_reference_t<decltype(declval<_Tp>()[0])>,
+		     sizeof(_Tp)>::type> {};
+
+template <typename _Tp>
+  inline constexpr bool __is_vector_type_v = __is_vector_type<_Tp>::value;
+
+// }}}
+// _VectorTraits{{{
+template <typename _Tp, typename = void_t<>>
+  struct _VectorTraitsImpl;
+
+template <typename _Tp>
+  struct _VectorTraitsImpl<_Tp, enable_if_t<__is_vector_type_v<_Tp>>>
+  {
+    using type = _Tp;
+    using value_type = remove_reference_t<decltype(declval<_Tp>()[0])>;
+    static constexpr int _S_full_size = sizeof(_Tp) / sizeof(value_type);
+    using _Wrapper = _SimdWrapper<value_type, _S_full_size>;
+    template <typename _Up, int _W = _S_full_size>
+      static constexpr bool _S_is
+	= is_same_v<value_type, _Up> && _W == _S_full_size;
+  };
+
+template <typename _Tp, size_t _Np>
+  struct _VectorTraitsImpl<_SimdWrapper<_Tp, _Np>,
+			   void_t<__vector_type_t<_Tp, _Np>>>
+  {
+    using type = __vector_type_t<_Tp, _Np>;
+    using value_type = _Tp;
+    static constexpr int _S_full_size = sizeof(type) / sizeof(value_type);
+    using _Wrapper = _SimdWrapper<_Tp, _Np>;
+    static constexpr bool _S_is_partial = (_Np == _S_full_size);
+    static constexpr int _S_partial_width = _Np;
+    template <typename _Up, int _W = _S_full_size>
+      static constexpr bool _S_is
+	= is_same_v<value_type, _Up>&& _W == _S_full_size;
+  };
+
+template <typename _Tp, typename = typename _VectorTraitsImpl<_Tp>::type>
+  using _VectorTraits = _VectorTraitsImpl<_Tp>;
+
+// }}}
+// __as_vector{{{
+template <typename _V>
+  _GLIBCXX_SIMD_INTRINSIC constexpr auto
+  __as_vector(_V __x)
+  {
+    if constexpr (__is_vector_type_v<_V>)
+      return __x;
+    else if constexpr (is_simd<_V>::value || is_simd_mask<_V>::value)
+      return __data(__x)._M_data;
+    else if constexpr (__is_vectorizable_v<_V>)
+      return __vector_type_t<_V, 2>{__x};
+    else
+      return __x._M_data;
+  }
+
+// }}}
+// __as_wrapper{{{
+template <size_t _Np = 0, typename _V>
+  _GLIBCXX_SIMD_INTRINSIC constexpr auto
+  __as_wrapper(_V __x)
+  {
+    if constexpr (__is_vector_type_v<_V>)
+      return _SimdWrapper<typename _VectorTraits<_V>::value_type,
+			  (_Np > 0 ? _Np : _VectorTraits<_V>::_S_full_size)>(__x);
+    else if constexpr (is_simd<_V>::value || is_simd_mask<_V>::value)
+      {
+	static_assert(_V::size() == _Np);
+	return __data(__x);
+      }
+    else
+      {
+	static_assert(_V::_S_size == _Np);
+	return __x;
+      }
+  }
+
+// }}}
+// __intrin_bitcast{{{
+template <typename _To, typename _From>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _To
+  __intrin_bitcast(_From __v)
+  {
+    static_assert(__is_vector_type_v<_From> && __is_vector_type_v<_To>);
+    if constexpr (sizeof(_To) == sizeof(_From))
+      return reinterpret_cast<_To>(__v);
+    else if constexpr (sizeof(_From) > sizeof(_To))
+      if constexpr (sizeof(_To) >= 16)
+	return reinterpret_cast<const __may_alias<_To>&>(__v);
+      else
+	{
+	  _To __r;
+	  __builtin_memcpy(&__r, &__v, sizeof(_To));
+	  return __r;
+	}
+#if _GLIBCXX_SIMD_X86INTRIN && !defined __clang__
+    else if constexpr (__have_avx && sizeof(_From) == 16 && sizeof(_To) == 32)
+      return reinterpret_cast<_To>(__builtin_ia32_ps256_ps(
+	reinterpret_cast<__vector_type_t<float, 4>>(__v)));
+    else if constexpr (__have_avx512f && sizeof(_From) == 16
+		       && sizeof(_To) == 64)
+      return reinterpret_cast<_To>(__builtin_ia32_ps512_ps(
+	reinterpret_cast<__vector_type_t<float, 4>>(__v)));
+    else if constexpr (__have_avx512f && sizeof(_From) == 32
+		       && sizeof(_To) == 64)
+      return reinterpret_cast<_To>(__builtin_ia32_ps512_256ps(
+	reinterpret_cast<__vector_type_t<float, 8>>(__v)));
+#endif // _GLIBCXX_SIMD_X86INTRIN
+    else if constexpr (sizeof(__v) <= 8)
+      return reinterpret_cast<_To>(
+	__vector_type_t<__int_for_sizeof_t<_From>, sizeof(_To) / sizeof(_From)>{
+	  reinterpret_cast<__int_for_sizeof_t<_From>>(__v)});
+    else
+      {
+	static_assert(sizeof(_To) > sizeof(_From));
+	_To __r = {};
+	__builtin_memcpy(&__r, &__v, sizeof(_From));
+	return __r;
+      }
+  }
+
+// }}}
+// __vector_bitcast{{{
+template <typename _To, size_t _NN = 0, typename _From,
+	  typename _FromVT = _VectorTraits<_From>,
+	  size_t _Np = _NN == 0 ? sizeof(_From) / sizeof(_To) : _NN>
+  _GLIBCXX_SIMD_INTRINSIC constexpr __vector_type_t<_To, _Np>
+  __vector_bitcast(_From __x)
+  {
+    using _R = __vector_type_t<_To, _Np>;
+    return __intrin_bitcast<_R>(__x);
+  }
+
+template <typename _To, size_t _NN = 0, typename _Tp, size_t _Nx,
+	  size_t _Np
+	  = _NN == 0 ? sizeof(_SimdWrapper<_Tp, _Nx>) / sizeof(_To) : _NN>
+  _GLIBCXX_SIMD_INTRINSIC constexpr __vector_type_t<_To, _Np>
+  __vector_bitcast(const _SimdWrapper<_Tp, _Nx>& __x)
+  {
+    static_assert(_Np > 1);
+    return __intrin_bitcast<__vector_type_t<_To, _Np>>(__x._M_data);
+  }
+
+// }}}
+// __convert_x86 declarations {{{
+#ifdef _GLIBCXX_SIMD_WORKAROUND_PR85048
+template <typename _To, typename _Tp, typename _TVT = _VectorTraits<_Tp>>
+  _To __convert_x86(_Tp);
+
+template <typename _To, typename _Tp, typename _TVT = _VectorTraits<_Tp>>
+  _To __convert_x86(_Tp, _Tp);
+
+template <typename _To, typename _Tp, typename _TVT = _VectorTraits<_Tp>>
+  _To __convert_x86(_Tp, _Tp, _Tp, _Tp);
+
+template <typename _To, typename _Tp, typename _TVT = _VectorTraits<_Tp>>
+  _To __convert_x86(_Tp, _Tp, _Tp, _Tp, _Tp, _Tp, _Tp, _Tp);
+
+template <typename _To, typename _Tp, typename _TVT = _VectorTraits<_Tp>>
+  _To __convert_x86(_Tp, _Tp, _Tp, _Tp, _Tp, _Tp, _Tp, _Tp, _Tp, _Tp, _Tp, _Tp,
+		    _Tp, _Tp, _Tp, _Tp);
+#endif // _GLIBCXX_SIMD_WORKAROUND_PR85048
+
+//}}}
+// __bit_cast {{{
+template <typename _To, typename _From>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _To
+  __bit_cast(const _From __x)
+  {
+    // TODO: implement with / replace by __builtin_bit_cast ASAP
+    static_assert(sizeof(_To) == sizeof(_From));
+    constexpr bool __to_is_vectorizable
+      = is_arithmetic_v<_To> || is_enum_v<_To>;
+    constexpr bool __from_is_vectorizable
+      = is_arithmetic_v<_From> || is_enum_v<_From>;
+    if constexpr (__is_vector_type_v<_To> && __is_vector_type_v<_From>)
+      return reinterpret_cast<_To>(__x);
+    else if constexpr (__is_vector_type_v<_To> && __from_is_vectorizable)
+      {
+	using _FV [[gnu::vector_size(sizeof(_From))]] = _From;
+	return reinterpret_cast<_To>(_FV{__x});
+      }
+    else if constexpr (__to_is_vectorizable && __from_is_vectorizable)
+      {
+	using _TV [[gnu::vector_size(sizeof(_To))]] = _To;
+	using _FV [[gnu::vector_size(sizeof(_From))]] = _From;
+	return reinterpret_cast<_TV>(_FV{__x})[0];
+      }
+    else if constexpr (__to_is_vectorizable && __is_vector_type_v<_From>)
+      {
+	using _TV [[gnu::vector_size(sizeof(_To))]] = _To;
+	return reinterpret_cast<_TV>(__x)[0];
+      }
+    else
+      {
+	_To __r;
+	__builtin_memcpy(reinterpret_cast<char*>(&__r),
+			 reinterpret_cast<const char*>(&__x), sizeof(_To));
+	return __r;
+      }
+  }
+
+// }}}
+// __to_intrin {{{
+template <typename _Tp, typename _TVT = _VectorTraits<_Tp>,
+	  typename _R
+	  = __intrinsic_type_t<typename _TVT::value_type, _TVT::_S_full_size>>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _R
+  __to_intrin(_Tp __x)
+  {
+    static_assert(sizeof(__x) <= sizeof(_R),
+		  "__to_intrin may never drop values off the end");
+    if constexpr (sizeof(__x) == sizeof(_R))
+      return reinterpret_cast<_R>(__as_vector(__x));
+    else
+      {
+	using _Up = __int_for_sizeof_t<_Tp>;
+	return reinterpret_cast<_R>(
+	  __vector_type_t<_Up, sizeof(_R) / sizeof(_Up)>{__bit_cast<_Up>(__x)});
+      }
+  }
+
+// }}}
+// __make_vector{{{
+template <typename _Tp, typename... _Args>
+  _GLIBCXX_SIMD_INTRINSIC constexpr __vector_type_t<_Tp, sizeof...(_Args)>
+  __make_vector(const _Args&... __args)
+  {
+    return __vector_type_t<_Tp, sizeof...(_Args)>{static_cast<_Tp>(__args)...};
+  }
+
+// }}}
+// __vector_broadcast{{{
+template <size_t _Np, typename _Tp>
+  _GLIBCXX_SIMD_INTRINSIC constexpr __vector_type_t<_Tp, _Np>
+  __vector_broadcast(_Tp __x)
+  {
+    return __call_with_n_evaluations<_Np>(
+      [](auto... __xx) { return __vector_type_t<_Tp, _Np>{__xx...}; },
+      [&__x](int) { return __x; });
+  }
+
+// }}}
+// __generate_vector{{{
+  template <typename _Tp, size_t _Np, typename _Gp, size_t... _I>
+  _GLIBCXX_SIMD_INTRINSIC constexpr __vector_type_t<_Tp, _Np>
+  __generate_vector_impl(_Gp&& __gen, index_sequence<_I...>)
+  {
+    return __vector_type_t<_Tp, _Np>{
+      static_cast<_Tp>(__gen(_SizeConstant<_I>()))...};
+  }
+
+template <typename _V, typename _VVT = _VectorTraits<_V>, typename _Gp>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _V
+  __generate_vector(_Gp&& __gen)
+  {
+    if constexpr (__is_vector_type_v<_V>)
+      return __generate_vector_impl<typename _VVT::value_type,
+				    _VVT::_S_full_size>(
+	static_cast<_Gp&&>(__gen), make_index_sequence<_VVT::_S_full_size>());
+    else
+      return __generate_vector_impl<typename _VVT::value_type,
+				    _VVT::_S_partial_width>(
+	static_cast<_Gp&&>(__gen),
+	make_index_sequence<_VVT::_S_partial_width>());
+  }
+
+template <typename _Tp, size_t _Np, typename _Gp>
+  _GLIBCXX_SIMD_INTRINSIC constexpr __vector_type_t<_Tp, _Np>
+  __generate_vector(_Gp&& __gen)
+  {
+    return __generate_vector_impl<_Tp, _Np>(static_cast<_Gp&&>(__gen),
+					    make_index_sequence<_Np>());
+  }
+
+// }}}
+// __xor{{{
+template <typename _TW>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _TW
+  __xor(_TW __a, _TW __b) noexcept
+  {
+    if constexpr (__is_vector_type_v<_TW> || __is_simd_wrapper_v<_TW>)
+      {
+	using _Tp = typename conditional_t<__is_simd_wrapper_v<_TW>, _TW,
+					   _VectorTraitsImpl<_TW>>::value_type;
+	if constexpr (is_floating_point_v<_Tp>)
+	  {
+	    using _Ip = make_unsigned_t<__int_for_sizeof_t<_Tp>>;
+	    return __vector_bitcast<_Tp>(__vector_bitcast<_Ip>(__a)
+					 ^ __vector_bitcast<_Ip>(__b));
+	  }
+	else if constexpr (__is_vector_type_v<_TW>)
+	  return __a ^ __b;
+	else
+	  return __a._M_data ^ __b._M_data;
+      }
+    else
+      return __a ^ __b;
+  }
+
+// }}}
+// __or{{{
+template <typename _TW>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _TW
+  __or(_TW __a, _TW __b) noexcept
+  {
+    if constexpr (__is_vector_type_v<_TW> || __is_simd_wrapper_v<_TW>)
+      {
+	using _Tp = typename conditional_t<__is_simd_wrapper_v<_TW>, _TW,
+					   _VectorTraitsImpl<_TW>>::value_type;
+	if constexpr (is_floating_point_v<_Tp>)
+	  {
+	    using _Ip = make_unsigned_t<__int_for_sizeof_t<_Tp>>;
+	    return __vector_bitcast<_Tp>(__vector_bitcast<_Ip>(__a)
+					 | __vector_bitcast<_Ip>(__b));
+	  }
+	else if constexpr (__is_vector_type_v<_TW>)
+	  return __a | __b;
+	else
+	  return __a._M_data | __b._M_data;
+      }
+    else
+      return __a | __b;
+  }
+
+// }}}
+// __and{{{
+template <typename _TW>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _TW
+  __and(_TW __a, _TW __b) noexcept
+  {
+    if constexpr (__is_vector_type_v<_TW> || __is_simd_wrapper_v<_TW>)
+      {
+	using _Tp = typename conditional_t<__is_simd_wrapper_v<_TW>, _TW,
+					   _VectorTraitsImpl<_TW>>::value_type;
+	if constexpr (is_floating_point_v<_Tp>)
+	  {
+	    using _Ip = make_unsigned_t<__int_for_sizeof_t<_Tp>>;
+	    return __vector_bitcast<_Tp>(__vector_bitcast<_Ip>(__a)
+					 & __vector_bitcast<_Ip>(__b));
+	  }
+	else if constexpr (__is_vector_type_v<_TW>)
+	  return __a & __b;
+	else
+	  return __a._M_data & __b._M_data;
+      }
+    else
+      return __a & __b;
+  }
+
+// }}}
+// __andnot{{{
+#if _GLIBCXX_SIMD_X86INTRIN && !defined __clang__
+static constexpr struct
+{
+  _GLIBCXX_SIMD_INTRINSIC __v4sf
+  operator()(__v4sf __a, __v4sf __b) const noexcept
+  { return __builtin_ia32_andnps(__a, __b); }
+
+  _GLIBCXX_SIMD_INTRINSIC __v2df
+  operator()(__v2df __a, __v2df __b) const noexcept
+  { return __builtin_ia32_andnpd(__a, __b); }
+
+  _GLIBCXX_SIMD_INTRINSIC __v2di
+  operator()(__v2di __a, __v2di __b) const noexcept
+  { return __builtin_ia32_pandn128(__a, __b); }
+
+  _GLIBCXX_SIMD_INTRINSIC __v8sf
+  operator()(__v8sf __a, __v8sf __b) const noexcept
+  { return __builtin_ia32_andnps256(__a, __b); }
+
+  _GLIBCXX_SIMD_INTRINSIC __v4df
+  operator()(__v4df __a, __v4df __b) const noexcept
+  { return __builtin_ia32_andnpd256(__a, __b); }
+
+  _GLIBCXX_SIMD_INTRINSIC __v4di
+  operator()(__v4di __a, __v4di __b) const noexcept
+  {
+    if constexpr (__have_avx2)
+      return __builtin_ia32_andnotsi256(__a, __b);
+    else
+      return reinterpret_cast<__v4di>(
+	__builtin_ia32_andnpd256(reinterpret_cast<__v4df>(__a),
+				 reinterpret_cast<__v4df>(__b)));
+  }
+
+  _GLIBCXX_SIMD_INTRINSIC __v16sf
+  operator()(__v16sf __a, __v16sf __b) const noexcept
+  {
+    if constexpr (__have_avx512dq)
+      return _mm512_andnot_ps(__a, __b);
+    else
+      return reinterpret_cast<__v16sf>(
+	_mm512_andnot_si512(reinterpret_cast<__v8di>(__a),
+			    reinterpret_cast<__v8di>(__b)));
+  }
+
+  _GLIBCXX_SIMD_INTRINSIC __v8df
+  operator()(__v8df __a, __v8df __b) const noexcept
+  {
+    if constexpr (__have_avx512dq)
+      return _mm512_andnot_pd(__a, __b);
+    else
+      return reinterpret_cast<__v8df>(
+	_mm512_andnot_si512(reinterpret_cast<__v8di>(__a),
+			    reinterpret_cast<__v8di>(__b)));
+  }
+
+  _GLIBCXX_SIMD_INTRINSIC __v8di
+  operator()(__v8di __a, __v8di __b) const noexcept
+  { return _mm512_andnot_si512(__a, __b); }
+} _S_x86_andnot;
+#endif // _GLIBCXX_SIMD_X86INTRIN && !__clang__
+
+template <typename _TW>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _TW
+  __andnot(_TW __a, _TW __b) noexcept
+  {
+    if constexpr (__is_vector_type_v<_TW> || __is_simd_wrapper_v<_TW>)
+      {
+	using _TVT = conditional_t<__is_simd_wrapper_v<_TW>, _TW,
+				   _VectorTraitsImpl<_TW>>;
+	using _Tp = typename _TVT::value_type;
+#if _GLIBCXX_SIMD_X86INTRIN && !defined __clang__
+	if constexpr (sizeof(_TW) >= 16)
+	  {
+	    const auto __ai = __to_intrin(__a);
+	    const auto __bi = __to_intrin(__b);
+	    if (!__builtin_is_constant_evaluated()
+		&& !(__builtin_constant_p(__ai) && __builtin_constant_p(__bi)))
+	      {
+		const auto __r = _S_x86_andnot(__ai, __bi);
+		if constexpr (is_convertible_v<decltype(__r), _TW>)
+		  return __r;
+		else
+		  return reinterpret_cast<typename _TVT::type>(__r);
+	      }
+	  }
+#endif // _GLIBCXX_SIMD_X86INTRIN
+	using _Ip = make_unsigned_t<__int_for_sizeof_t<_Tp>>;
+	return __vector_bitcast<_Tp>(~__vector_bitcast<_Ip>(__a)
+				     & __vector_bitcast<_Ip>(__b));
+      }
+    else
+      return ~__a & __b;
+  }
+
+// }}}
+// __not{{{
+template <typename _Tp, typename _TVT = _VectorTraits<_Tp>>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _Tp
+  __not(_Tp __a) noexcept
+  {
+    if constexpr (is_floating_point_v<typename _TVT::value_type>)
+      return reinterpret_cast<typename _TVT::type>(
+	~__vector_bitcast<unsigned>(__a));
+    else
+      return ~__a;
+  }
+
+// }}}
+// __concat{{{
+template <typename _Tp, typename _TVT = _VectorTraits<_Tp>,
+	  typename _R = __vector_type_t<typename _TVT::value_type,
+					_TVT::_S_full_size * 2>>
+  constexpr _R
+  __concat(_Tp a_, _Tp b_)
+  {
+#ifdef _GLIBCXX_SIMD_WORKAROUND_XXX_1
+    using _W
+      = conditional_t<is_floating_point_v<typename _TVT::value_type>, double,
+		      conditional_t<(sizeof(_Tp) >= 2 * sizeof(long long)),
+				    long long, typename _TVT::value_type>>;
+    constexpr int input_width = sizeof(_Tp) / sizeof(_W);
+    const auto __a = __vector_bitcast<_W>(a_);
+    const auto __b = __vector_bitcast<_W>(b_);
+    using _Up = __vector_type_t<_W, sizeof(_R) / sizeof(_W)>;
+#else
+    constexpr int input_width = _TVT::_S_full_size;
+    const _Tp& __a = a_;
+    const _Tp& __b = b_;
+    using _Up = _R;
+#endif
+    if constexpr (input_width == 2)
+      return reinterpret_cast<_R>(_Up{__a[0], __a[1], __b[0], __b[1]});
+    else if constexpr (input_width == 4)
+      return reinterpret_cast<_R>(
+	_Up{__a[0], __a[1], __a[2], __a[3], __b[0], __b[1], __b[2], __b[3]});
+    else if constexpr (input_width == 8)
+      return reinterpret_cast<_R>(
+	_Up{__a[0], __a[1], __a[2], __a[3], __a[4], __a[5], __a[6], __a[7],
+	    __b[0], __b[1], __b[2], __b[3], __b[4], __b[5], __b[6], __b[7]});
+    else if constexpr (input_width == 16)
+      return reinterpret_cast<_R>(
+	_Up{__a[0],  __a[1],  __a[2],  __a[3],  __a[4],  __a[5],  __a[6],
+	    __a[7],  __a[8],  __a[9],  __a[10], __a[11], __a[12], __a[13],
+	    __a[14], __a[15], __b[0],  __b[1],  __b[2],  __b[3],  __b[4],
+	    __b[5],  __b[6],  __b[7],  __b[8],  __b[9],  __b[10], __b[11],
+	    __b[12], __b[13], __b[14], __b[15]});
+    else if constexpr (input_width == 32)
+      return reinterpret_cast<_R>(
+	_Up{__a[0],  __a[1],  __a[2],  __a[3],  __a[4],  __a[5],  __a[6],
+	    __a[7],  __a[8],  __a[9],  __a[10], __a[11], __a[12], __a[13],
+	    __a[14], __a[15], __a[16], __a[17], __a[18], __a[19], __a[20],
+	    __a[21], __a[22], __a[23], __a[24], __a[25], __a[26], __a[27],
+	    __a[28], __a[29], __a[30], __a[31], __b[0],  __b[1],  __b[2],
+	    __b[3],  __b[4],  __b[5],  __b[6],  __b[7],  __b[8],  __b[9],
+	    __b[10], __b[11], __b[12], __b[13], __b[14], __b[15], __b[16],
+	    __b[17], __b[18], __b[19], __b[20], __b[21], __b[22], __b[23],
+	    __b[24], __b[25], __b[26], __b[27], __b[28], __b[29], __b[30],
+	    __b[31]});
+  }
+
+// }}}
+// __zero_extend {{{
+template <typename _Tp, typename _TVT = _VectorTraits<_Tp>>
+  struct _ZeroExtendProxy
+  {
+    using value_type = typename _TVT::value_type;
+    static constexpr size_t _Np = _TVT::_S_full_size;
+    const _Tp __x;
+
+    template <typename _To, typename _ToVT = _VectorTraits<_To>,
+	      typename
+	      = enable_if_t<is_same_v<typename _ToVT::value_type, value_type>>>
+      _GLIBCXX_SIMD_INTRINSIC operator _To() const
+      {
+	constexpr size_t _ToN = _ToVT::_S_full_size;
+	if constexpr (_ToN == _Np)
+	  return __x;
+	else if constexpr (_ToN == 2 * _Np)
+	  {
+#ifdef _GLIBCXX_SIMD_WORKAROUND_XXX_3
+	    if constexpr (__have_avx && _TVT::template _S_is<float, 4>)
+	      return __vector_bitcast<value_type>(
+		_mm256_insertf128_ps(__m256(), __x, 0));
+	    else if constexpr (__have_avx && _TVT::template _S_is<double, 2>)
+	      return __vector_bitcast<value_type>(
+		_mm256_insertf128_pd(__m256d(), __x, 0));
+	    else if constexpr (__have_avx2 && _Np * sizeof(value_type) == 16)
+	      return __vector_bitcast<value_type>(
+		_mm256_insertf128_si256(__m256i(), __to_intrin(__x), 0));
+	    else if constexpr (__have_avx512f && _TVT::template _S_is<float, 8>)
+	      {
+		if constexpr (__have_avx512dq)
+		  return __vector_bitcast<value_type>(
+		    _mm512_insertf32x8(__m512(), __x, 0));
+		else
+		  return reinterpret_cast<__m512>(
+		    _mm512_insertf64x4(__m512d(),
+				       reinterpret_cast<__m256d>(__x), 0));
+	      }
+	    else if constexpr (__have_avx512f
+			       && _TVT::template _S_is<double, 4>)
+	      return __vector_bitcast<value_type>(
+		_mm512_insertf64x4(__m512d(), __x, 0));
+	    else if constexpr (__have_avx512f && _Np * sizeof(value_type) == 32)
+	      return __vector_bitcast<value_type>(
+		_mm512_inserti64x4(__m512i(), __to_intrin(__x), 0));
+#endif
+	    return __concat(__x, _Tp());
+	  }
+	else if constexpr (_ToN == 4 * _Np)
+	  {
+#ifdef _GLIBCXX_SIMD_WORKAROUND_XXX_3
+	    if constexpr (__have_avx512dq && _TVT::template _S_is<double, 2>)
+	      {
+		return __vector_bitcast<value_type>(
+		  _mm512_insertf64x2(__m512d(), __x, 0));
+	      }
+	    else if constexpr (__have_avx512f
+			       && is_floating_point_v<value_type>)
+	      {
+		return __vector_bitcast<value_type>(
+		  _mm512_insertf32x4(__m512(), reinterpret_cast<__m128>(__x),
+				     0));
+	      }
+	    else if constexpr (__have_avx512f && _Np * sizeof(value_type) == 16)
+	      {
+		return __vector_bitcast<value_type>(
+		  _mm512_inserti32x4(__m512i(), __to_intrin(__x), 0));
+	      }
+#endif
+	    return __concat(__concat(__x, _Tp()),
+			    __vector_type_t<value_type, _Np * 2>());
+	  }
+	else if constexpr (_ToN == 8 * _Np)
+	  return __concat(operator __vector_type_t<value_type, _Np * 4>(),
+			  __vector_type_t<value_type, _Np * 4>());
+	else if constexpr (_ToN == 16 * _Np)
+	  return __concat(operator __vector_type_t<value_type, _Np * 8>(),
+			  __vector_type_t<value_type, _Np * 8>());
+	else
+	  __assert_unreachable<_Tp>();
+      }
+  };
+
+template <typename _Tp, typename _TVT = _VectorTraits<_Tp>>
+  _GLIBCXX_SIMD_INTRINSIC _ZeroExtendProxy<_Tp, _TVT>
+  __zero_extend(_Tp __x)
+  { return {__x}; }
+
+// }}}
+// __extract<_Np, By>{{{
+template <int _Offset,
+	  int _SplitBy,
+	  typename _Tp,
+	  typename _TVT = _VectorTraits<_Tp>,
+	  typename _R = __vector_type_t<typename _TVT::value_type,
+			  _TVT::_S_full_size / _SplitBy>>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _R
+  __extract(_Tp __in)
+  {
+    using value_type = typename _TVT::value_type;
+#if _GLIBCXX_SIMD_X86INTRIN // {{{
+    if constexpr (sizeof(_Tp) == 64 && _SplitBy == 4 && _Offset > 0)
+      {
+	if constexpr (__have_avx512dq && is_same_v<double, value_type>)
+	  return _mm512_extractf64x2_pd(__to_intrin(__in), _Offset);
+	else if constexpr (is_floating_point_v<value_type>)
+	  return __vector_bitcast<value_type>(
+	    _mm512_extractf32x4_ps(__intrin_bitcast<__m512>(__in), _Offset));
+	else
+	  return reinterpret_cast<_R>(
+	    _mm512_extracti32x4_epi32(__intrin_bitcast<__m512i>(__in),
+				      _Offset));
+      }
+    else
+#endif // _GLIBCXX_SIMD_X86INTRIN }}}
+      {
+#ifdef _GLIBCXX_SIMD_WORKAROUND_XXX_1
+	using _W = conditional_t<
+	  is_floating_point_v<value_type>, double,
+	  conditional_t<(sizeof(_R) >= 16), long long, value_type>>;
+	static_assert(sizeof(_R) % sizeof(_W) == 0);
+	constexpr int __return_width = sizeof(_R) / sizeof(_W);
+	using _Up = __vector_type_t<_W, __return_width>;
+	const auto __x = __vector_bitcast<_W>(__in);
+#else
+      constexpr int __return_width = _TVT::_S_full_size / _SplitBy;
+      using _Up = _R;
+      const __vector_type_t<value_type, _TVT::_S_full_size>& __x
+	= __in; // only needed for _Tp = _SimdWrapper<value_type, _Np>
+#endif
+	constexpr int _O = _Offset * __return_width;
+	return __call_with_subscripts<__return_width, _O>(
+	  __x, [](auto... __entries) {
+	    return reinterpret_cast<_R>(_Up{__entries...});
+	  });
+      }
+  }
+
+// }}}
+// __lo/__hi64[z]{{{
+template <typename _Tp,
+	  typename _R
+	  = __vector_type8_t<typename _VectorTraits<_Tp>::value_type>>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _R
+  __lo64(_Tp __x)
+  {
+    _R __r{};
+    __builtin_memcpy(&__r, &__x, 8);
+    return __r;
+  }
+
+template <typename _Tp,
+	  typename _R
+	  = __vector_type8_t<typename _VectorTraits<_Tp>::value_type>>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _R
+  __hi64(_Tp __x)
+  {
+    static_assert(sizeof(_Tp) == 16, "use __hi64z if you meant it");
+    _R __r{};
+    __builtin_memcpy(&__r, reinterpret_cast<const char*>(&__x) + 8, 8);
+    return __r;
+  }
+
+template <typename _Tp,
+	  typename _R
+	  = __vector_type8_t<typename _VectorTraits<_Tp>::value_type>>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _R
+  __hi64z([[maybe_unused]] _Tp __x)
+  {
+    _R __r{};
+    if constexpr (sizeof(_Tp) == 16)
+      __builtin_memcpy(&__r, reinterpret_cast<const char*>(&__x) + 8, 8);
+    return __r;
+  }
+
+// }}}
+// __lo/__hi128{{{
+template <typename _Tp>
+  _GLIBCXX_SIMD_INTRINSIC constexpr auto
+  __lo128(_Tp __x)
+  { return __extract<0, sizeof(_Tp) / 16>(__x); }
+
+template <typename _Tp>
+  _GLIBCXX_SIMD_INTRINSIC constexpr auto
+  __hi128(_Tp __x)
+  {
+    static_assert(sizeof(__x) == 32);
+    return __extract<1, 2>(__x);
+  }
+
+// }}}
+// __lo/__hi256{{{
+template <typename _Tp>
+  _GLIBCXX_SIMD_INTRINSIC constexpr auto
+  __lo256(_Tp __x)
+  {
+    static_assert(sizeof(__x) == 64);
+    return __extract<0, 2>(__x);
+  }
+
+template <typename _Tp>
+  _GLIBCXX_SIMD_INTRINSIC constexpr auto
+  __hi256(_Tp __x)
+  {
+    static_assert(sizeof(__x) == 64);
+    return __extract<1, 2>(__x);
+  }
+
+// }}}
+// __auto_bitcast{{{
+template <typename _Tp>
+  struct _AutoCast
+  {
+    static_assert(__is_vector_type_v<_Tp>);
+
+    const _Tp __x;
+
+    template <typename _Up, typename _UVT = _VectorTraits<_Up>>
+      _GLIBCXX_SIMD_INTRINSIC constexpr operator _Up() const
+      { return __intrin_bitcast<typename _UVT::type>(__x); }
+  };
+
+template <typename _Tp>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _AutoCast<_Tp>
+  __auto_bitcast(const _Tp& __x)
+  { return {__x}; }
+
+template <typename _Tp, size_t _Np>
+  _GLIBCXX_SIMD_INTRINSIC constexpr
+  _AutoCast<typename _SimdWrapper<_Tp, _Np>::_BuiltinType>
+  __auto_bitcast(const _SimdWrapper<_Tp, _Np>& __x)
+  { return {__x._M_data}; }
+
+// }}}
+// ^^^ ---- builtin vector types [[gnu::vector_size(N)]] and operations ---- ^^^
+
+#if _GLIBCXX_SIMD_HAVE_SSE_ABI
+// __bool_storage_member_type{{{
+#if _GLIBCXX_SIMD_HAVE_AVX512F && _GLIBCXX_SIMD_X86INTRIN
+template <size_t _Size>
+  struct __bool_storage_member_type
+  {
+    static_assert((_Size & (_Size - 1)) != 0,
+		  "This trait may only be used for non-power-of-2 sizes. "
+		  "Power-of-2 sizes must be specialized.");
+    using type =
+      typename __bool_storage_member_type<std::__bit_ceil(_Size)>::type;
+  };
+
+template <>
+  struct __bool_storage_member_type<1> { using type = bool; };
+
+template <>
+  struct __bool_storage_member_type<2> { using type = __mmask8; };
+
+template <>
+  struct __bool_storage_member_type<4> { using type = __mmask8; };
+
+template <>
+  struct __bool_storage_member_type<8> { using type = __mmask8; };
+
+template <>
+  struct __bool_storage_member_type<16> { using type = __mmask16; };
+
+template <>
+  struct __bool_storage_member_type<32> { using type = __mmask32; };
+
+template <>
+  struct __bool_storage_member_type<64> { using type = __mmask64; };
+#endif // _GLIBCXX_SIMD_HAVE_AVX512F
+
+// }}}
+// __intrinsic_type (x86){{{
+// the following excludes bool via __is_vectorizable
+#if _GLIBCXX_SIMD_HAVE_SSE
+template <typename _Tp, size_t _Bytes>
+  struct __intrinsic_type<_Tp, _Bytes,
+			  enable_if_t<__is_vectorizable_v<_Tp> && _Bytes <= 64>>
+  {
+    static_assert(!is_same_v<_Tp, long double>,
+		  "no __intrinsic_type support for long double on x86");
+
+    static constexpr size_t _S_VBytes = _Bytes <= 16   ? 16
+					: _Bytes <= 32 ? 32
+						       : 64;
+
+    using type [[__gnu__::__vector_size__(_S_VBytes)]]
+    = conditional_t<is_integral_v<_Tp>, long long int, _Tp>;
+  };
+#endif // _GLIBCXX_SIMD_HAVE_SSE
+
+// }}}
+#endif // _GLIBCXX_SIMD_HAVE_SSE_ABI
+// __intrinsic_type (ARM){{{
+#if _GLIBCXX_SIMD_HAVE_NEON
+template <typename _Tp, size_t _Bytes>
+  struct __intrinsic_type<_Tp, _Bytes,
+			  enable_if_t<__is_vectorizable_v<_Tp> && _Bytes <= 16>>
+  {
+    static constexpr int _S_VBytes = _Bytes <= 8 ? 8 : 16;
+    using _Ip = __int_for_sizeof_t<_Tp>;
+    using _Up = conditional_t<
+      is_floating_point_v<_Tp>, _Tp,
+      conditional_t<is_unsigned_v<_Tp>, make_unsigned_t<_Ip>, _Ip>>;
+    using type [[__gnu__::__vector_size__(_S_VBytes)]] = _Up;
+  };
+#endif // _GLIBCXX_SIMD_HAVE_NEON
+
+// }}}
+// __intrinsic_type (PPC){{{
+#ifdef __ALTIVEC__
+template <typename _Tp>
+  struct __intrinsic_type_impl;
+
+#define _GLIBCXX_SIMD_PPC_INTRIN(_Tp)                                          \
+  template <>                                                                  \
+    struct __intrinsic_type_impl<_Tp> { using type = __vector _Tp; }
+_GLIBCXX_SIMD_PPC_INTRIN(float);
+_GLIBCXX_SIMD_PPC_INTRIN(double);
+_GLIBCXX_SIMD_PPC_INTRIN(signed char);
+_GLIBCXX_SIMD_PPC_INTRIN(unsigned char);
+_GLIBCXX_SIMD_PPC_INTRIN(signed short);
+_GLIBCXX_SIMD_PPC_INTRIN(unsigned short);
+_GLIBCXX_SIMD_PPC_INTRIN(signed int);
+_GLIBCXX_SIMD_PPC_INTRIN(unsigned int);
+_GLIBCXX_SIMD_PPC_INTRIN(signed long);
+_GLIBCXX_SIMD_PPC_INTRIN(unsigned long);
+_GLIBCXX_SIMD_PPC_INTRIN(signed long long);
+_GLIBCXX_SIMD_PPC_INTRIN(unsigned long long);
+#undef _GLIBCXX_SIMD_PPC_INTRIN
+
+template <typename _Tp, size_t _Bytes>
+  struct __intrinsic_type<_Tp, _Bytes,
+			  enable_if_t<__is_vectorizable_v<_Tp> && _Bytes <= 16>>
+  {
+    static_assert(!is_same_v<_Tp, long double>,
+		  "no __intrinsic_type support for long double on PPC");
+#ifndef __VSX__
+    static_assert(!is_same_v<_Tp, double>,
+		  "no __intrinsic_type support for double on PPC w/o VSX");
+#endif
+#ifndef __POWER8_VECTOR__
+    static_assert(
+      !(is_integral_v<_Tp> && sizeof(_Tp) > 4),
+      "no __intrinsic_type support for integers larger than 4 Bytes "
+      "on PPC w/o POWER8 vectors");
+#endif
+    using type = typename __intrinsic_type_impl<conditional_t<
+      is_floating_point_v<_Tp>, _Tp, __int_for_sizeof_t<_Tp>>>::type;
+  };
+#endif // __ALTIVEC__
+
+// }}}
+// _SimdWrapper<bool>{{{1
+template <size_t _Width>
+  struct _SimdWrapper<bool, _Width,
+		      void_t<typename __bool_storage_member_type<_Width>::type>>
+  {
+    using _BuiltinType = typename __bool_storage_member_type<_Width>::type;
+    using value_type = bool;
+
+    static constexpr size_t _S_full_size = sizeof(_BuiltinType) * __CHAR_BIT__;
+
+    _GLIBCXX_SIMD_INTRINSIC constexpr _SimdWrapper<bool, _S_full_size>
+    __as_full_vector() const { return _M_data; }
+
+    _GLIBCXX_SIMD_INTRINSIC constexpr _SimdWrapper() = default;
+    _GLIBCXX_SIMD_INTRINSIC constexpr _SimdWrapper(_BuiltinType __k)
+      : _M_data(__k) {};
+
+    _GLIBCXX_SIMD_INTRINSIC operator const _BuiltinType&() const
+    { return _M_data; }
+
+    _GLIBCXX_SIMD_INTRINSIC operator _BuiltinType&()
+    { return _M_data; }
+
+    _GLIBCXX_SIMD_INTRINSIC _BuiltinType __intrin() const
+    { return _M_data; }
+
+    _GLIBCXX_SIMD_INTRINSIC constexpr value_type operator[](size_t __i) const
+    { return _M_data & (_BuiltinType(1) << __i); }
+
+    template <size_t __i>
+      _GLIBCXX_SIMD_INTRINSIC constexpr value_type
+      operator[](_SizeConstant<__i>) const
+      { return _M_data & (_BuiltinType(1) << __i); }
+
+    _GLIBCXX_SIMD_INTRINSIC constexpr void _M_set(size_t __i, value_type __x)
+    {
+      if (__x)
+	_M_data |= (_BuiltinType(1) << __i);
+      else
+	_M_data &= ~(_BuiltinType(1) << __i);
+    }
+
+    _GLIBCXX_SIMD_INTRINSIC
+    constexpr bool _M_is_constprop() const
+    { return __builtin_constant_p(_M_data); }
+
+    _GLIBCXX_SIMD_INTRINSIC constexpr bool _M_is_constprop_none_of() const
+    {
+      if (__builtin_constant_p(_M_data))
+	{
+	  constexpr int __nbits = sizeof(_BuiltinType) * __CHAR_BIT__;
+	  constexpr _BuiltinType __active_mask
+	    = ~_BuiltinType() >> (__nbits - _Width);
+	  return (_M_data & __active_mask) == 0;
+	}
+      return false;
+    }
+
+    _GLIBCXX_SIMD_INTRINSIC constexpr bool _M_is_constprop_all_of() const
+    {
+      if (__builtin_constant_p(_M_data))
+	{
+	  constexpr int __nbits = sizeof(_BuiltinType) * __CHAR_BIT__;
+	  constexpr _BuiltinType __active_mask
+	    = ~_BuiltinType() >> (__nbits - _Width);
+	  return (_M_data & __active_mask) == __active_mask;
+	}
+      return false;
+    }
+
+    _BuiltinType _M_data;
+  };
+
+// _SimdWrapperBase{{{1
+template <bool _MustZeroInitPadding, typename _BuiltinType>
+  struct _SimdWrapperBase;
+
+template <typename _BuiltinType>
+  struct _SimdWrapperBase<false, _BuiltinType> // no padding or no SNaNs
+  {
+    _GLIBCXX_SIMD_INTRINSIC constexpr _SimdWrapperBase() = default;
+    _GLIBCXX_SIMD_INTRINSIC constexpr _SimdWrapperBase(_BuiltinType __init)
+      : _M_data(__init)
+    {}
+
+    _BuiltinType _M_data;
+  };
+
+template <typename _BuiltinType>
+  struct _SimdWrapperBase<true, _BuiltinType> // with padding that needs to
+					      // never become SNaN
+  {
+    _GLIBCXX_SIMD_INTRINSIC constexpr _SimdWrapperBase() : _M_data() {}
+    _GLIBCXX_SIMD_INTRINSIC constexpr _SimdWrapperBase(_BuiltinType __init)
+      : _M_data(__init)
+    {}
+
+    _BuiltinType _M_data;
+  };
+
+// }}}
+// _SimdWrapper{{{
+template <typename _Tp, size_t _Width>
+  struct _SimdWrapper<
+    _Tp, _Width,
+    void_t<__vector_type_t<_Tp, _Width>, __intrinsic_type_t<_Tp, _Width>>>
+    : _SimdWrapperBase<__has_iec559_behavior<__signaling_NaN, _Tp>::value
+			 && sizeof(_Tp) * _Width
+			      == sizeof(__vector_type_t<_Tp, _Width>),
+		       __vector_type_t<_Tp, _Width>>
+  {
+    using _Base
+      = _SimdWrapperBase<__has_iec559_behavior<__signaling_NaN, _Tp>::value
+			   && sizeof(_Tp) * _Width
+				== sizeof(__vector_type_t<_Tp, _Width>),
+			 __vector_type_t<_Tp, _Width>>;
+
+    static_assert(__is_vectorizable_v<_Tp>);
+    static_assert(_Width >= 2); // 1 doesn't make sense, use _Tp directly then
+
+    using _BuiltinType = __vector_type_t<_Tp, _Width>;
+    using value_type = _Tp;
+
+    static inline constexpr size_t _S_full_size
+      = sizeof(_BuiltinType) / sizeof(value_type);
+    static inline constexpr int _S_size = _Width;
+    static inline constexpr bool _S_is_partial = _S_full_size != _S_size;
+
+    using _Base::_M_data;
+
+    _GLIBCXX_SIMD_INTRINSIC constexpr _SimdWrapper<_Tp, _S_full_size>
+    __as_full_vector() const
+    { return _M_data; }
+
+    _GLIBCXX_SIMD_INTRINSIC constexpr _SimdWrapper(initializer_list<_Tp> __init)
+      : _Base(__generate_from_n_evaluations<_Width, _BuiltinType>(
+	[&](auto __i) { return __init.begin()[__i.value]; })) {}
+
+    _GLIBCXX_SIMD_INTRINSIC constexpr _SimdWrapper() = default;
+    _GLIBCXX_SIMD_INTRINSIC constexpr _SimdWrapper(const _SimdWrapper&)
+      = default;
+    _GLIBCXX_SIMD_INTRINSIC constexpr _SimdWrapper(_SimdWrapper&&) = default;
+
+    _GLIBCXX_SIMD_INTRINSIC constexpr _SimdWrapper&
+    operator=(const _SimdWrapper&) = default;
+    _GLIBCXX_SIMD_INTRINSIC constexpr _SimdWrapper&
+    operator=(_SimdWrapper&&) = default;
+
+    template <typename _V, typename = enable_if_t<disjunction_v<
+			     is_same<_V, __vector_type_t<_Tp, _Width>>,
+			     is_same<_V, __intrinsic_type_t<_Tp, _Width>>>>>
+      _GLIBCXX_SIMD_INTRINSIC constexpr _SimdWrapper(_V __x)
+      // __vector_bitcast can convert e.g. __m128 to __vector(2) float
+      : _Base(__vector_bitcast<_Tp, _Width>(__x)) {}
+
+    template <typename... _As,
+	      typename = enable_if_t<((is_same_v<simd_abi::scalar, _As> && ...)
+				      && sizeof...(_As) <= _Width)>>
+      _GLIBCXX_SIMD_INTRINSIC constexpr
+      operator _SimdTuple<_Tp, _As...>() const
+      {
+	const auto& dd = _M_data; // workaround for GCC7 ICE
+	return __generate_from_n_evaluations<sizeof...(_As),
+					     _SimdTuple<_Tp, _As...>>([&](
+	  auto __i) constexpr { return dd[int(__i)]; });
+      }
+
+    _GLIBCXX_SIMD_INTRINSIC constexpr operator const _BuiltinType&() const
+    { return _M_data; }
+
+    _GLIBCXX_SIMD_INTRINSIC constexpr operator _BuiltinType&()
+    { return _M_data; }
+
+    _GLIBCXX_SIMD_INTRINSIC constexpr _Tp operator[](size_t __i) const
+    { return _M_data[__i]; }
+
+    template <size_t __i>
+      _GLIBCXX_SIMD_INTRINSIC constexpr _Tp operator[](_SizeConstant<__i>) const
+      { return _M_data[__i]; }
+
+    _GLIBCXX_SIMD_INTRINSIC constexpr void _M_set(size_t __i, _Tp __x)
+    { _M_data[__i] = __x; }
+
+    _GLIBCXX_SIMD_INTRINSIC
+    constexpr bool _M_is_constprop() const
+    { return __builtin_constant_p(_M_data); }
+
+    _GLIBCXX_SIMD_INTRINSIC constexpr bool _M_is_constprop_none_of() const
+    {
+      if (__builtin_constant_p(_M_data))
+	{
+	  bool __r = true;
+	  if constexpr (is_floating_point_v<_Tp>)
+	    {
+	      using _Ip = __int_for_sizeof_t<_Tp>;
+	      const auto __intdata = __vector_bitcast<_Ip>(_M_data);
+	      __execute_n_times<_Width>(
+		[&](auto __i) { __r &= __intdata[__i.value] == _Ip(); });
+	    }
+	  else
+	    __execute_n_times<_Width>(
+	      [&](auto __i) { __r &= _M_data[__i.value] == _Tp(); });
+	  return __r;
+	}
+      return false;
+    }
+
+    _GLIBCXX_SIMD_INTRINSIC constexpr bool _M_is_constprop_all_of() const
+    {
+      if (__builtin_constant_p(_M_data))
+	{
+	  bool __r = true;
+	  if constexpr (is_floating_point_v<_Tp>)
+	    {
+	      using _Ip = __int_for_sizeof_t<_Tp>;
+	      const auto __intdata = __vector_bitcast<_Ip>(_M_data);
+	      __execute_n_times<_Width>(
+		[&](auto __i) { __r &= __intdata[__i.value] == ~_Ip(); });
+	    }
+	  else
+	    __execute_n_times<_Width>(
+	      [&](auto __i) { __r &= _M_data[__i.value] == ~_Tp(); });
+	  return __r;
+	}
+      return false;
+    }
+  };
+
+// }}}
+
+// __vectorized_sizeof {{{
+template <typename _Tp>
+  constexpr size_t
+  __vectorized_sizeof()
+  {
+    if constexpr (!__is_vectorizable_v<_Tp>)
+      return 0;
+
+    if constexpr (sizeof(_Tp) <= 8)
+      {
+	// X86:
+	if constexpr (__have_avx512bw)
+	  return 64;
+	if constexpr (__have_avx512f && sizeof(_Tp) >= 4)
+	  return 64;
+	if constexpr (__have_avx2)
+	  return 32;
+	if constexpr (__have_avx && is_floating_point_v<_Tp>)
+	  return 32;
+	if constexpr (__have_sse2)
+	  return 16;
+	if constexpr (__have_sse && is_same_v<_Tp, float>)
+	  return 16;
+	/* The following is too much trouble because of mixed MMX and x87 code.
+	 * While nothing here explicitly calls MMX instructions of registers,
+	 * they are still emitted but no EMMS cleanup is done.
+	if constexpr (__have_mmx && sizeof(_Tp) <= 4 && is_integral_v<_Tp>)
+	  return 8;
+	 */
+
+	// PowerPC:
+	if constexpr (__have_power8vec
+		      || (__have_power_vmx && (sizeof(_Tp) < 8))
+		      || (__have_power_vsx && is_floating_point_v<_Tp>) )
+	  return 16;
+
+	// ARM:
+	if constexpr (__have_neon_a64
+		      || (__have_neon_a32 && !is_same_v<_Tp, double>) )
+	  return 16;
+	if constexpr (__have_neon
+		      && sizeof(_Tp) < 8
+		      // Only allow fp if the user allows non-ICE559 fp (e.g.
+		      // via -ffast-math). ARMv7 NEON fp is not conforming to
+		      // IEC559.
+		      && (__support_neon_float || !is_floating_point_v<_Tp>))
+	  return 16;
+      }
+
+    return sizeof(_Tp);
+  }
+
+// }}}
+namespace simd_abi {
+// most of simd_abi is defined in simd_detail.h
+template <typename _Tp>
+  inline constexpr int max_fixed_size
+    = (__have_avx512bw && sizeof(_Tp) == 1) ? 64 : 32;
+
+// compatible {{{
+#if defined __x86_64__ || defined __aarch64__
+template <typename _Tp>
+  using compatible = conditional_t<(sizeof(_Tp) <= 8), _VecBuiltin<16>, scalar>;
+#elif defined __ARM_NEON
+// FIXME: not sure, probably needs to be scalar (or dependent on the hard-float
+// ABI?)
+template <typename _Tp>
+  using compatible
+    = conditional_t<(sizeof(_Tp) < 8
+		     && (__support_neon_float || !is_floating_point_v<_Tp>)),
+		    _VecBuiltin<16>, scalar>;
+#else
+template <typename>
+  using compatible = scalar;
+#endif
+
+// }}}
+// native {{{
+template <typename _Tp>
+  constexpr auto
+  __determine_native_abi()
+  {
+    constexpr size_t __bytes = __vectorized_sizeof<_Tp>();
+    if constexpr (__bytes == sizeof(_Tp))
+      return static_cast<scalar*>(nullptr);
+    else if constexpr (__have_avx512vl || (__have_avx512f && __bytes == 64))
+      return static_cast<_VecBltnBtmsk<__bytes>*>(nullptr);
+    else
+      return static_cast<_VecBuiltin<__bytes>*>(nullptr);
+  }
+
+template <typename _Tp, typename = enable_if_t<__is_vectorizable_v<_Tp>>>
+  using native = remove_pointer_t<decltype(__determine_native_abi<_Tp>())>;
+
+// }}}
+// __default_abi {{{
+#if defined _GLIBCXX_SIMD_DEFAULT_ABI
+template <typename _Tp>
+  using __default_abi = _GLIBCXX_SIMD_DEFAULT_ABI<_Tp>;
+#else
+template <typename _Tp>
+  using __default_abi = compatible<_Tp>;
+#endif
+
+// }}}
+} // namespace simd_abi
+
+// traits {{{1
+// is_abi_tag {{{2
+template <typename _Tp, typename = void_t<>>
+  struct is_abi_tag : false_type {};
+
+template <typename _Tp>
+  struct is_abi_tag<_Tp, void_t<typename _Tp::_IsValidAbiTag>>
+  : public _Tp::_IsValidAbiTag {};
+
+template <typename _Tp>
+  inline constexpr bool is_abi_tag_v = is_abi_tag<_Tp>::value;
+
+// is_simd(_mask) {{{2
+template <typename _Tp>
+  struct is_simd : public false_type {};
+
+template <typename _Tp>
+  inline constexpr bool is_simd_v = is_simd<_Tp>::value;
+
+template <typename _Tp>
+  struct is_simd_mask : public false_type {};
+
+template <typename _Tp>
+inline constexpr bool is_simd_mask_v = is_simd_mask<_Tp>::value;
+
+// simd_size {{{2
+template <typename _Tp, typename _Abi, typename = void>
+  struct __simd_size_impl {};
+
+template <typename _Tp, typename _Abi>
+  struct __simd_size_impl<
+    _Tp, _Abi,
+    enable_if_t<conjunction_v<__is_vectorizable<_Tp>, is_abi_tag<_Abi>>>>
+    : _SizeConstant<_Abi::template _S_size<_Tp>> {};
+
+template <typename _Tp, typename _Abi = simd_abi::__default_abi<_Tp>>
+  struct simd_size : __simd_size_impl<_Tp, _Abi> {};
+
+template <typename _Tp, typename _Abi = simd_abi::__default_abi<_Tp>>
+  inline constexpr size_t simd_size_v = simd_size<_Tp, _Abi>::value;
+
+// simd_abi::deduce {{{2
+template <typename _Tp, size_t _Np, typename = void>
+  struct __deduce_impl;
+
+namespace simd_abi {
+/**
+ * @tparam _Tp   The requested `value_type` for the elements.
+ * @tparam _Np    The requested number of elements.
+ * @tparam _Abis This parameter is ignored, since this implementation cannot
+ * make any use of it. Either __a good native ABI is matched and used as `type`
+ * alias, or the `fixed_size<_Np>` ABI is used, which internally is built from
+ * the best matching native ABIs.
+ */
+template <typename _Tp, size_t _Np, typename...>
+  struct deduce : __deduce_impl<_Tp, _Np> {};
+
+template <typename _Tp, size_t _Np, typename... _Abis>
+  using deduce_t = typename deduce<_Tp, _Np, _Abis...>::type;
+} // namespace simd_abi
+
+// }}}2
+// rebind_simd {{{2
+template <typename _Tp, typename _V, typename = void>
+  struct rebind_simd;
+
+template <typename _Tp, typename _Up, typename _Abi>
+  struct rebind_simd<
+    _Tp, simd<_Up, _Abi>,
+    void_t<simd_abi::deduce_t<_Tp, simd_size_v<_Up, _Abi>, _Abi>>>
+  {
+    using type
+      = simd<_Tp, simd_abi::deduce_t<_Tp, simd_size_v<_Up, _Abi>, _Abi>>;
+  };
+
+template <typename _Tp, typename _Up, typename _Abi>
+  struct rebind_simd<
+    _Tp, simd_mask<_Up, _Abi>,
+    void_t<simd_abi::deduce_t<_Tp, simd_size_v<_Up, _Abi>, _Abi>>>
+  {
+    using type
+      = simd_mask<_Tp, simd_abi::deduce_t<_Tp, simd_size_v<_Up, _Abi>, _Abi>>;
+  };
+
+template <typename _Tp, typename _V>
+  using rebind_simd_t = typename rebind_simd<_Tp, _V>::type;
+
+// resize_simd {{{2
+template <int _Np, typename _V, typename = void>
+  struct resize_simd;
+
+template <int _Np, typename _Tp, typename _Abi>
+  struct resize_simd<_Np, simd<_Tp, _Abi>,
+		     void_t<simd_abi::deduce_t<_Tp, _Np, _Abi>>>
+  { using type = simd<_Tp, simd_abi::deduce_t<_Tp, _Np, _Abi>>; };
+
+template <int _Np, typename _Tp, typename _Abi>
+  struct resize_simd<_Np, simd_mask<_Tp, _Abi>,
+		     void_t<simd_abi::deduce_t<_Tp, _Np, _Abi>>>
+  { using type = simd_mask<_Tp, simd_abi::deduce_t<_Tp, _Np, _Abi>>; };
+
+template <int _Np, typename _V>
+  using resize_simd_t = typename resize_simd<_Np, _V>::type;
+
+// }}}2
+// memory_alignment {{{2
+template <typename _Tp, typename _Up = typename _Tp::value_type>
+  struct memory_alignment
+  : public _SizeConstant<vector_aligned_tag::_S_alignment<_Tp, _Up>> {};
+
+template <typename _Tp, typename _Up = typename _Tp::value_type>
+  inline constexpr size_t memory_alignment_v = memory_alignment<_Tp, _Up>::value;
+
+// class template simd [simd] {{{1
+template <typename _Tp, typename _Abi = simd_abi::__default_abi<_Tp>>
+  class simd;
+
+template <typename _Tp, typename _Abi>
+  struct is_simd<simd<_Tp, _Abi>> : public true_type {};
+
+template <typename _Tp>
+  using native_simd = simd<_Tp, simd_abi::native<_Tp>>;
+
+template <typename _Tp, int _Np>
+  using fixed_size_simd = simd<_Tp, simd_abi::fixed_size<_Np>>;
+
+template <typename _Tp, size_t _Np>
+  using __deduced_simd = simd<_Tp, simd_abi::deduce_t<_Tp, _Np>>;
+
+// class template simd_mask [simd_mask] {{{1
+template <typename _Tp, typename _Abi = simd_abi::__default_abi<_Tp>>
+  class simd_mask;
+
+template <typename _Tp, typename _Abi>
+  struct is_simd_mask<simd_mask<_Tp, _Abi>> : public true_type {};
+
+template <typename _Tp>
+  using native_simd_mask = simd_mask<_Tp, simd_abi::native<_Tp>>;
+
+template <typename _Tp, int _Np>
+  using fixed_size_simd_mask = simd_mask<_Tp, simd_abi::fixed_size<_Np>>;
+
+template <typename _Tp, size_t _Np>
+  using __deduced_simd_mask = simd_mask<_Tp, simd_abi::deduce_t<_Tp, _Np>>;
+
+// casts [simd.casts] {{{1
+// static_simd_cast {{{2
+template <typename _Tp, typename _Up, typename _Ap, bool = is_simd_v<_Tp>,
+	  typename = void>
+  struct __static_simd_cast_return_type;
+
+template <typename _Tp, typename _A0, typename _Up, typename _Ap>
+  struct __static_simd_cast_return_type<simd_mask<_Tp, _A0>, _Up, _Ap, false,
+					void>
+  : __static_simd_cast_return_type<simd<_Tp, _A0>, _Up, _Ap> {};
+
+template <typename _Tp, typename _Up, typename _Ap>
+  struct __static_simd_cast_return_type<
+    _Tp, _Up, _Ap, true, enable_if_t<_Tp::size() == simd_size_v<_Up, _Ap>>>
+  { using type = _Tp; };
+
+template <typename _Tp, typename _Ap>
+  struct __static_simd_cast_return_type<_Tp, _Tp, _Ap, false,
+#ifdef _GLIBCXX_SIMD_FIX_P2TS_ISSUE66
+					enable_if_t<__is_vectorizable_v<_Tp>>
+#else
+					void
+#endif
+					>
+  { using type = simd<_Tp, _Ap>; };
+
+template <typename _Tp, typename = void>
+  struct __safe_make_signed { using type = _Tp;};
+
+template <typename _Tp>
+  struct __safe_make_signed<_Tp, enable_if_t<is_integral_v<_Tp>>>
+  {
+    // the extra make_unsigned_t is because of PR85951
+    using type = make_signed_t<make_unsigned_t<_Tp>>;
+  };
+
+template <typename _Tp>
+  using safe_make_signed_t = typename __safe_make_signed<_Tp>::type;
+
+template <typename _Tp, typename _Up, typename _Ap>
+  struct __static_simd_cast_return_type<_Tp, _Up, _Ap, false,
+#ifdef _GLIBCXX_SIMD_FIX_P2TS_ISSUE66
+					enable_if_t<__is_vectorizable_v<_Tp>>
+#else
+					void
+#endif
+					>
+  {
+    using type = conditional_t<
+      (is_integral_v<_Up> && is_integral_v<_Tp> &&
+#ifndef _GLIBCXX_SIMD_FIX_P2TS_ISSUE65
+       is_signed_v<_Up> != is_signed_v<_Tp> &&
+#endif
+       is_same_v<safe_make_signed_t<_Up>, safe_make_signed_t<_Tp>>),
+      simd<_Tp, _Ap>, fixed_size_simd<_Tp, simd_size_v<_Up, _Ap>>>;
+  };
+
+template <typename _Tp, typename _Up, typename _Ap,
+	  typename _R
+	  = typename __static_simd_cast_return_type<_Tp, _Up, _Ap>::type>
+  _GLIBCXX_SIMD_INTRINSIC _GLIBCXX_SIMD_CONSTEXPR _R
+  static_simd_cast(const simd<_Up, _Ap>& __x)
+  {
+    if constexpr (is_same<_R, simd<_Up, _Ap>>::value)
+      return __x;
+    else
+      {
+	_SimdConverter<_Up, _Ap, typename _R::value_type, typename _R::abi_type>
+	  __c;
+	return _R(__private_init, __c(__data(__x)));
+      }
+  }
+
+namespace __proposed {
+template <typename _Tp, typename _Up, typename _Ap,
+	  typename _R
+	  = typename __static_simd_cast_return_type<_Tp, _Up, _Ap>::type>
+  _GLIBCXX_SIMD_INTRINSIC _GLIBCXX_SIMD_CONSTEXPR typename _R::mask_type
+  static_simd_cast(const simd_mask<_Up, _Ap>& __x)
+  {
+    using _RM = typename _R::mask_type;
+    return {__private_init, _RM::abi_type::_MaskImpl::template _S_convert<
+			      typename _RM::simd_type::value_type>(__x)};
+  }
+} // namespace __proposed
+
+// simd_cast {{{2
+template <typename _Tp, typename _Up, typename _Ap,
+	  typename _To = __value_type_or_identity_t<_Tp>>
+  _GLIBCXX_SIMD_INTRINSIC _GLIBCXX_SIMD_CONSTEXPR auto
+  simd_cast(const simd<_ValuePreserving<_Up, _To>, _Ap>& __x)
+    -> decltype(static_simd_cast<_Tp>(__x))
+  { return static_simd_cast<_Tp>(__x); }
+
+namespace __proposed {
+template <typename _Tp, typename _Up, typename _Ap,
+	  typename _To = __value_type_or_identity_t<_Tp>>
+  _GLIBCXX_SIMD_INTRINSIC _GLIBCXX_SIMD_CONSTEXPR auto
+  simd_cast(const simd_mask<_ValuePreserving<_Up, _To>, _Ap>& __x)
+    -> decltype(static_simd_cast<_Tp>(__x))
+  { return static_simd_cast<_Tp>(__x); }
+} // namespace __proposed
+
+// }}}2
+// resizing_simd_cast {{{
+namespace __proposed {
+/* Proposed spec:
+
+template <class T, class U, class Abi>
+T resizing_simd_cast(const simd<U, Abi>& x)
+
+p1  Constraints:
+    - is_simd_v<T> is true and
+    - T::value_type is the same type as U
+
+p2  Returns:
+    A simd object with the i^th element initialized to x[i] for all i in the
+    range of [0, min(T::size(), simd_size_v<U, Abi>)). If T::size() is larger
+    than simd_size_v<U, Abi>, the remaining elements are value-initialized.
+
+template <class T, class U, class Abi>
+T resizing_simd_cast(const simd_mask<U, Abi>& x)
+
+p1  Constraints: is_simd_mask_v<T> is true
+
+p2  Returns:
+    A simd_mask object with the i^th element initialized to x[i] for all i in
+the range of [0, min(T::size(), simd_size_v<U, Abi>)). If T::size() is larger
+    than simd_size_v<U, Abi>, the remaining elements are initialized to false.
+
+ */
+
+template <typename _Tp, typename _Up, typename _Ap>
+  _GLIBCXX_SIMD_INTRINSIC _GLIBCXX_SIMD_CONSTEXPR enable_if_t<
+  conjunction_v<is_simd<_Tp>, is_same<typename _Tp::value_type, _Up>>, _Tp>
+  resizing_simd_cast(const simd<_Up, _Ap>& __x)
+  {
+    if constexpr (is_same_v<typename _Tp::abi_type, _Ap>)
+      return __x;
+    else if constexpr (simd_size_v<_Up, _Ap> == 1)
+      {
+	_Tp __r{};
+	__r[0] = __x[0];
+	return __r;
+      }
+    else if constexpr (_Tp::size() == 1)
+      return __x[0];
+    else if constexpr (sizeof(_Tp) == sizeof(__x)
+		       && !__is_fixed_size_abi_v<_Ap>)
+      return {__private_init,
+	      __vector_bitcast<typename _Tp::value_type, _Tp::size()>(
+		_Ap::_S_masked(__data(__x))._M_data)};
+    else
+      {
+	_Tp __r{};
+	__builtin_memcpy(&__data(__r), &__data(__x),
+			 sizeof(_Up)
+			   * std::min(_Tp::size(), simd_size_v<_Up, _Ap>));
+	return __r;
+      }
+  }
+
+template <typename _Tp, typename _Up, typename _Ap>
+  _GLIBCXX_SIMD_INTRINSIC _GLIBCXX_SIMD_CONSTEXPR
+  enable_if_t<is_simd_mask_v<_Tp>, _Tp>
+  resizing_simd_cast(const simd_mask<_Up, _Ap>& __x)
+  {
+    return {__private_init, _Tp::abi_type::_MaskImpl::template _S_convert<
+			      typename _Tp::simd_type::value_type>(__x)};
+  }
+} // namespace __proposed
+
+// }}}
+// to_fixed_size {{{2
+template <typename _Tp, int _Np>
+  _GLIBCXX_SIMD_INTRINSIC fixed_size_simd<_Tp, _Np>
+  to_fixed_size(const fixed_size_simd<_Tp, _Np>& __x)
+  { return __x; }
+
+template <typename _Tp, int _Np>
+  _GLIBCXX_SIMD_INTRINSIC fixed_size_simd_mask<_Tp, _Np>
+  to_fixed_size(const fixed_size_simd_mask<_Tp, _Np>& __x)
+  { return __x; }
+
+template <typename _Tp, typename _Ap>
+  _GLIBCXX_SIMD_INTRINSIC auto
+  to_fixed_size(const simd<_Tp, _Ap>& __x)
+  {
+    return simd<_Tp, simd_abi::fixed_size<simd_size_v<_Tp, _Ap>>>([&__x](
+      auto __i) constexpr { return __x[__i]; });
+  }
+
+template <typename _Tp, typename _Ap>
+  _GLIBCXX_SIMD_INTRINSIC auto
+  to_fixed_size(const simd_mask<_Tp, _Ap>& __x)
+  {
+    constexpr int _Np = simd_mask<_Tp, _Ap>::size();
+    fixed_size_simd_mask<_Tp, _Np> __r;
+    __execute_n_times<_Np>([&](auto __i) constexpr { __r[__i] = __x[__i]; });
+    return __r;
+  }
+
+// to_native {{{2
+template <typename _Tp, int _Np>
+  _GLIBCXX_SIMD_INTRINSIC
+  enable_if_t<(_Np == native_simd<_Tp>::size()), native_simd<_Tp>>
+  to_native(const fixed_size_simd<_Tp, _Np>& __x)
+  {
+    alignas(memory_alignment_v<native_simd<_Tp>>) _Tp __mem[_Np];
+    __x.copy_to(__mem, vector_aligned);
+    return {__mem, vector_aligned};
+  }
+
+template <typename _Tp, size_t _Np>
+  _GLIBCXX_SIMD_INTRINSIC
+  enable_if_t<(_Np == native_simd_mask<_Tp>::size()), native_simd_mask<_Tp>>
+  to_native(const fixed_size_simd_mask<_Tp, _Np>& __x)
+  {
+    return native_simd_mask<_Tp>([&](auto __i) constexpr { return __x[__i]; });
+  }
+
+// to_compatible {{{2
+template <typename _Tp, size_t _Np>
+  _GLIBCXX_SIMD_INTRINSIC enable_if_t<(_Np == simd<_Tp>::size()), simd<_Tp>>
+  to_compatible(const simd<_Tp, simd_abi::fixed_size<_Np>>& __x)
+  {
+    alignas(memory_alignment_v<simd<_Tp>>) _Tp __mem[_Np];
+    __x.copy_to(__mem, vector_aligned);
+    return {__mem, vector_aligned};
+  }
+
+template <typename _Tp, size_t _Np>
+  _GLIBCXX_SIMD_INTRINSIC
+  enable_if_t<(_Np == simd_mask<_Tp>::size()), simd_mask<_Tp>>
+  to_compatible(const simd_mask<_Tp, simd_abi::fixed_size<_Np>>& __x)
+  { return simd_mask<_Tp>([&](auto __i) constexpr { return __x[__i]; }); }
+
+// masked assignment [simd_mask.where] {{{1
+
+// where_expression {{{1
+// const_where_expression<M, T> {{{2
+template <typename _M, typename _Tp>
+  class const_where_expression
+  {
+    using _V = _Tp;
+    static_assert(is_same_v<_V, __remove_cvref_t<_Tp>>);
+
+    struct _Wrapper { using value_type = _V; };
+
+  protected:
+    using _Impl = typename _V::_Impl;
+
+    using value_type =
+      typename conditional_t<is_arithmetic_v<_V>, _Wrapper, _V>::value_type;
+
+    _GLIBCXX_SIMD_INTRINSIC friend const _M&
+    __get_mask(const const_where_expression& __x)
+    { return __x._M_k; }
+
+    _GLIBCXX_SIMD_INTRINSIC friend const _Tp&
+    __get_lvalue(const const_where_expression& __x)
+    { return __x._M_value; }
+
+    const _M& _M_k;
+    _Tp& _M_value;
+
+  public:
+    const_where_expression(const const_where_expression&) = delete;
+    const_where_expression& operator=(const const_where_expression&) = delete;
+
+    _GLIBCXX_SIMD_INTRINSIC const_where_expression(const _M& __kk, const _Tp& dd)
+      : _M_k(__kk), _M_value(const_cast<_Tp&>(dd)) {}
+
+    _GLIBCXX_SIMD_INTRINSIC _V
+    operator-() const&&
+    {
+      return {__private_init,
+	      _Impl::template _S_masked_unary<negate>(__data(_M_k),
+						      __data(_M_value))};
+    }
+
+    template <typename _Up, typename _Flags>
+      [[nodiscard]] _GLIBCXX_SIMD_INTRINSIC _V
+      copy_from(const _LoadStorePtr<_Up, value_type>* __mem, _Flags) const&&
+      {
+	return {__private_init,
+		_Impl::_S_masked_load(__data(_M_value), __data(_M_k),
+				      _Flags::template _S_apply<_V>(__mem))};
+      }
+
+    template <typename _Up, typename _Flags>
+      _GLIBCXX_SIMD_INTRINSIC void
+      copy_to(_LoadStorePtr<_Up, value_type>* __mem, _Flags) const&&
+      {
+	_Impl::_S_masked_store(__data(_M_value),
+			       _Flags::template _S_apply<_V>(__mem),
+			       __data(_M_k));
+      }
+  };
+
+// const_where_expression<bool, T> {{{2
+template <typename _Tp>
+  class const_where_expression<bool, _Tp>
+  {
+    using _M = bool;
+    using _V = _Tp;
+
+    static_assert(is_same_v<_V, __remove_cvref_t<_Tp>>);
+
+    struct _Wrapper { using value_type = _V; };
+
+  protected:
+    using value_type =
+      typename conditional_t<is_arithmetic_v<_V>, _Wrapper, _V>::value_type;
+
+    _GLIBCXX_SIMD_INTRINSIC friend const _M&
+    __get_mask(const const_where_expression& __x)
+    { return __x._M_k; }
+
+    _GLIBCXX_SIMD_INTRINSIC friend const _Tp&
+    __get_lvalue(const const_where_expression& __x)
+    { return __x._M_value; }
+
+    const bool _M_k;
+    _Tp& _M_value;
+
+  public:
+    const_where_expression(const const_where_expression&) = delete;
+    const_where_expression& operator=(const const_where_expression&) = delete;
+
+    _GLIBCXX_SIMD_INTRINSIC const_where_expression(const bool __kk, const _Tp& dd)
+      : _M_k(__kk), _M_value(const_cast<_Tp&>(dd)) {}
+
+    _GLIBCXX_SIMD_INTRINSIC _V operator-() const&&
+    { return _M_k ? -_M_value : _M_value; }
+
+    template <typename _Up, typename _Flags>
+      [[nodiscard]] _GLIBCXX_SIMD_INTRINSIC _V
+      copy_from(const _LoadStorePtr<_Up, value_type>* __mem, _Flags) const&&
+      { return _M_k ? static_cast<_V>(__mem[0]) : _M_value; }
+
+    template <typename _Up, typename _Flags>
+      _GLIBCXX_SIMD_INTRINSIC void
+      copy_to(_LoadStorePtr<_Up, value_type>* __mem, _Flags) const&&
+      {
+	if (_M_k)
+	  __mem[0] = _M_value;
+      }
+  };
+
+// where_expression<M, T> {{{2
+template <typename _M, typename _Tp>
+  class where_expression : public const_where_expression<_M, _Tp>
+  {
+    using _Impl = typename const_where_expression<_M, _Tp>::_Impl;
+
+    static_assert(!is_const<_Tp>::value,
+		  "where_expression may only be instantiated with __a non-const "
+		  "_Tp parameter");
+
+    using typename const_where_expression<_M, _Tp>::value_type;
+    using const_where_expression<_M, _Tp>::_M_k;
+    using const_where_expression<_M, _Tp>::_M_value;
+
+    static_assert(
+      is_same<typename _M::abi_type, typename _Tp::abi_type>::value, "");
+    static_assert(_M::size() == _Tp::size(), "");
+
+    _GLIBCXX_SIMD_INTRINSIC friend _Tp& __get_lvalue(where_expression& __x)
+    { return __x._M_value; }
+
+  public:
+    where_expression(const where_expression&) = delete;
+    where_expression& operator=(const where_expression&) = delete;
+
+    _GLIBCXX_SIMD_INTRINSIC where_expression(const _M& __kk, _Tp& dd)
+      : const_where_expression<_M, _Tp>(__kk, dd) {}
+
+    template <typename _Up>
+      _GLIBCXX_SIMD_INTRINSIC void operator=(_Up&& __x) &&
+      {
+	_Impl::_S_masked_assign(__data(_M_k), __data(_M_value),
+				__to_value_type_or_member_type<_Tp>(
+				  static_cast<_Up&&>(__x)));
+      }
+
+#define _GLIBCXX_SIMD_OP_(__op, __name)                                        \
+  template <typename _Up>                                                      \
+    _GLIBCXX_SIMD_INTRINSIC void operator __op##=(_Up&& __x)&&                 \
+    {                                                                          \
+      _Impl::template _S_masked_cassign(                                       \
+	__data(_M_k), __data(_M_value),                                        \
+	__to_value_type_or_member_type<_Tp>(static_cast<_Up&&>(__x)),          \
+	[](auto __impl, auto __lhs, auto __rhs) constexpr {                    \
+	return __impl.__name(__lhs, __rhs);                                    \
+	});                                                                    \
+    }                                                                          \
+  static_assert(true)
+    _GLIBCXX_SIMD_OP_(+, _S_plus);
+    _GLIBCXX_SIMD_OP_(-, _S_minus);
+    _GLIBCXX_SIMD_OP_(*, _S_multiplies);
+    _GLIBCXX_SIMD_OP_(/, _S_divides);
+    _GLIBCXX_SIMD_OP_(%, _S_modulus);
+    _GLIBCXX_SIMD_OP_(&, _S_bit_and);
+    _GLIBCXX_SIMD_OP_(|, _S_bit_or);
+    _GLIBCXX_SIMD_OP_(^, _S_bit_xor);
+    _GLIBCXX_SIMD_OP_(<<, _S_shift_left);
+    _GLIBCXX_SIMD_OP_(>>, _S_shift_right);
+#undef _GLIBCXX_SIMD_OP_
+
+    _GLIBCXX_SIMD_INTRINSIC void operator++() &&
+    {
+      __data(_M_value)
+	= _Impl::template _S_masked_unary<__increment>(__data(_M_k),
+						       __data(_M_value));
+    }
+
+    _GLIBCXX_SIMD_INTRINSIC void operator++(int) &&
+    {
+      __data(_M_value)
+	= _Impl::template _S_masked_unary<__increment>(__data(_M_k),
+						       __data(_M_value));
+    }
+
+    _GLIBCXX_SIMD_INTRINSIC void operator--() &&
+    {
+      __data(_M_value)
+	= _Impl::template _S_masked_unary<__decrement>(__data(_M_k),
+						       __data(_M_value));
+    }
+
+    _GLIBCXX_SIMD_INTRINSIC void operator--(int) &&
+    {
+      __data(_M_value)
+	= _Impl::template _S_masked_unary<__decrement>(__data(_M_k),
+						       __data(_M_value));
+    }
+
+    // intentionally hides const_where_expression::copy_from
+    template <typename _Up, typename _Flags>
+      _GLIBCXX_SIMD_INTRINSIC void
+      copy_from(const _LoadStorePtr<_Up, value_type>* __mem, _Flags) &&
+      {
+	__data(_M_value)
+	  = _Impl::_S_masked_load(__data(_M_value), __data(_M_k),
+				  _Flags::template _S_apply<_Tp>(__mem));
+      }
+  };
+
+// where_expression<bool, T> {{{2
+template <typename _Tp>
+  class where_expression<bool, _Tp> : public const_where_expression<bool, _Tp>
+  {
+    using _M = bool;
+    using typename const_where_expression<_M, _Tp>::value_type;
+    using const_where_expression<_M, _Tp>::_M_k;
+    using const_where_expression<_M, _Tp>::_M_value;
+
+  public:
+    where_expression(const where_expression&) = delete;
+    where_expression& operator=(const where_expression&) = delete;
+
+    _GLIBCXX_SIMD_INTRINSIC where_expression(const _M& __kk, _Tp& dd)
+      : const_where_expression<_M, _Tp>(__kk, dd) {}
+
+#define _GLIBCXX_SIMD_OP_(__op)                                                \
+    template <typename _Up>                                                    \
+      _GLIBCXX_SIMD_INTRINSIC void operator __op(_Up&& __x)&&                  \
+      { if (_M_k) _M_value __op static_cast<_Up&&>(__x); }
+
+    _GLIBCXX_SIMD_OP_(=)
+    _GLIBCXX_SIMD_OP_(+=)
+    _GLIBCXX_SIMD_OP_(-=)
+    _GLIBCXX_SIMD_OP_(*=)
+    _GLIBCXX_SIMD_OP_(/=)
+    _GLIBCXX_SIMD_OP_(%=)
+    _GLIBCXX_SIMD_OP_(&=)
+    _GLIBCXX_SIMD_OP_(|=)
+    _GLIBCXX_SIMD_OP_(^=)
+    _GLIBCXX_SIMD_OP_(<<=)
+    _GLIBCXX_SIMD_OP_(>>=)
+  #undef _GLIBCXX_SIMD_OP_
+
+    _GLIBCXX_SIMD_INTRINSIC void operator++() &&
+    { if (_M_k) ++_M_value; }
+
+    _GLIBCXX_SIMD_INTRINSIC void operator++(int) &&
+    { if (_M_k) ++_M_value; }
+
+    _GLIBCXX_SIMD_INTRINSIC void operator--() &&
+    { if (_M_k) --_M_value; }
+
+    _GLIBCXX_SIMD_INTRINSIC void operator--(int) &&
+    { if (_M_k) --_M_value; }
+
+    // intentionally hides const_where_expression::copy_from
+    template <typename _Up, typename _Flags>
+      _GLIBCXX_SIMD_INTRINSIC void
+      copy_from(const _LoadStorePtr<_Up, value_type>* __mem, _Flags) &&
+      { if (_M_k) _M_value = __mem[0]; }
+  };
+
+// where {{{1
+template <typename _Tp, typename _Ap>
+  _GLIBCXX_SIMD_INTRINSIC where_expression<simd_mask<_Tp, _Ap>, simd<_Tp, _Ap>>
+  where(const typename simd<_Tp, _Ap>::mask_type& __k, simd<_Tp, _Ap>& __value)
+  { return {__k, __value}; }
+
+template <typename _Tp, typename _Ap>
+  _GLIBCXX_SIMD_INTRINSIC
+    const_where_expression<simd_mask<_Tp, _Ap>, simd<_Tp, _Ap>>
+    where(const typename simd<_Tp, _Ap>::mask_type& __k,
+	  const simd<_Tp, _Ap>& __value)
+  { return {__k, __value}; }
+
+template <typename _Tp, typename _Ap>
+  _GLIBCXX_SIMD_INTRINSIC
+    where_expression<simd_mask<_Tp, _Ap>, simd_mask<_Tp, _Ap>>
+    where(const remove_const_t<simd_mask<_Tp, _Ap>>& __k,
+	  simd_mask<_Tp, _Ap>& __value)
+  { return {__k, __value}; }
+
+template <typename _Tp, typename _Ap>
+  _GLIBCXX_SIMD_INTRINSIC
+    const_where_expression<simd_mask<_Tp, _Ap>, simd_mask<_Tp, _Ap>>
+    where(const remove_const_t<simd_mask<_Tp, _Ap>>& __k,
+	  const simd_mask<_Tp, _Ap>& __value)
+  { return {__k, __value}; }
+
+template <typename _Tp>
+  _GLIBCXX_SIMD_INTRINSIC where_expression<bool, _Tp>
+  where(_ExactBool __k, _Tp& __value)
+  { return {__k, __value}; }
+
+template <typename _Tp>
+  _GLIBCXX_SIMD_INTRINSIC const_where_expression<bool, _Tp>
+  where(_ExactBool __k, const _Tp& __value)
+  { return {__k, __value}; }
+
+  template <typename _Tp, typename _Ap>
+    void where(bool __k, simd<_Tp, _Ap>& __value) = delete;
+
+  template <typename _Tp, typename _Ap>
+    void where(bool __k, const simd<_Tp, _Ap>& __value) = delete;
+
+// proposed mask iterations {{{1
+namespace __proposed {
+template <size_t _Np>
+  class where_range
+  {
+    const bitset<_Np> __bits;
+
+  public:
+    where_range(bitset<_Np> __b) : __bits(__b) {}
+
+    class iterator
+    {
+      size_t __mask;
+      size_t __bit;
+
+      _GLIBCXX_SIMD_INTRINSIC void __next_bit()
+      { __bit = __builtin_ctzl(__mask); }
+
+      _GLIBCXX_SIMD_INTRINSIC void __reset_lsb()
+      {
+	// 01100100 - 1 = 01100011
+	__mask &= (__mask - 1);
+	// __asm__("btr %1,%0" : "+r"(__mask) : "r"(__bit));
+      }
+
+    public:
+      iterator(decltype(__mask) __m) : __mask(__m) { __next_bit(); }
+      iterator(const iterator&) = default;
+      iterator(iterator&&) = default;
+
+      _GLIBCXX_SIMD_ALWAYS_INLINE size_t operator->() const
+      { return __bit; }
+
+      _GLIBCXX_SIMD_ALWAYS_INLINE size_t operator*() const
+      { return __bit; }
+
+      _GLIBCXX_SIMD_ALWAYS_INLINE iterator& operator++()
+      {
+	__reset_lsb();
+	__next_bit();
+	return *this;
+      }
+
+      _GLIBCXX_SIMD_ALWAYS_INLINE iterator operator++(int)
+      {
+	iterator __tmp = *this;
+	__reset_lsb();
+	__next_bit();
+	return __tmp;
+      }
+
+      _GLIBCXX_SIMD_ALWAYS_INLINE bool operator==(const iterator& __rhs) const
+      { return __mask == __rhs.__mask; }
+
+      _GLIBCXX_SIMD_ALWAYS_INLINE bool operator!=(const iterator& __rhs) const
+      { return __mask != __rhs.__mask; }
+    };
+
+    iterator begin() const
+    { return __bits.to_ullong(); }
+
+    iterator end() const
+    { return 0; }
+  };
+
+template <typename _Tp, typename _Ap>
+  where_range<simd_size_v<_Tp, _Ap>>
+  where(const simd_mask<_Tp, _Ap>& __k)
+  { return __k.__to_bitset(); }
+
+} // namespace __proposed
+
+// }}}1
+// reductions [simd.reductions] {{{1
+  template <typename _Tp, typename _Abi, typename _BinaryOperation = plus<>>
+  _GLIBCXX_SIMD_INTRINSIC _GLIBCXX_SIMD_CONSTEXPR _Tp
+  reduce(const simd<_Tp, _Abi>& __v,
+	 _BinaryOperation __binary_op = _BinaryOperation())
+  { return _Abi::_SimdImpl::_S_reduce(__v, __binary_op); }
+
+template <typename _M, typename _V, typename _BinaryOperation = plus<>>
+  _GLIBCXX_SIMD_INTRINSIC typename _V::value_type
+  reduce(const const_where_expression<_M, _V>& __x,
+	 typename _V::value_type __identity_element,
+	 _BinaryOperation __binary_op)
+  {
+    if (__builtin_expect(none_of(__get_mask(__x)), false))
+      return __identity_element;
+
+    _V __tmp = __identity_element;
+    _V::_Impl::_S_masked_assign(__data(__get_mask(__x)), __data(__tmp),
+				__data(__get_lvalue(__x)));
+    return reduce(__tmp, __binary_op);
+  }
+
+template <typename _M, typename _V>
+  _GLIBCXX_SIMD_INTRINSIC typename _V::value_type
+  reduce(const const_where_expression<_M, _V>& __x, plus<> __binary_op = {})
+  { return reduce(__x, 0, __binary_op); }
+
+template <typename _M, typename _V>
+  _GLIBCXX_SIMD_INTRINSIC typename _V::value_type
+  reduce(const const_where_expression<_M, _V>& __x, multiplies<> __binary_op)
+  { return reduce(__x, 1, __binary_op); }
+
+template <typename _M, typename _V>
+  _GLIBCXX_SIMD_INTRINSIC typename _V::value_type
+  reduce(const const_where_expression<_M, _V>& __x, bit_and<> __binary_op)
+  { return reduce(__x, ~typename _V::value_type(), __binary_op); }
+
+template <typename _M, typename _V>
+  _GLIBCXX_SIMD_INTRINSIC typename _V::value_type
+  reduce(const const_where_expression<_M, _V>& __x, bit_or<> __binary_op)
+  { return reduce(__x, 0, __binary_op); }
+
+template <typename _M, typename _V>
+  _GLIBCXX_SIMD_INTRINSIC typename _V::value_type
+  reduce(const const_where_expression<_M, _V>& __x, bit_xor<> __binary_op)
+  { return reduce(__x, 0, __binary_op); }
+
+// }}}1
+// algorithms [simd.alg] {{{
+template <typename _Tp, typename _Ap>
+  _GLIBCXX_SIMD_INTRINSIC _GLIBCXX_SIMD_CONSTEXPR simd<_Tp, _Ap>
+  min(const simd<_Tp, _Ap>& __a, const simd<_Tp, _Ap>& __b)
+  { return {__private_init, _Ap::_SimdImpl::_S_min(__data(__a), __data(__b))}; }
+
+template <typename _Tp, typename _Ap>
+  _GLIBCXX_SIMD_INTRINSIC _GLIBCXX_SIMD_CONSTEXPR simd<_Tp, _Ap>
+  max(const simd<_Tp, _Ap>& __a, const simd<_Tp, _Ap>& __b)
+  { return {__private_init, _Ap::_SimdImpl::_S_max(__data(__a), __data(__b))}; }
+
+template <typename _Tp, typename _Ap>
+  _GLIBCXX_SIMD_INTRINSIC _GLIBCXX_SIMD_CONSTEXPR
+  pair<simd<_Tp, _Ap>, simd<_Tp, _Ap>>
+  minmax(const simd<_Tp, _Ap>& __a, const simd<_Tp, _Ap>& __b)
+  {
+    const auto pair_of_members
+      = _Ap::_SimdImpl::_S_minmax(__data(__a), __data(__b));
+    return {simd<_Tp, _Ap>(__private_init, pair_of_members.first),
+	    simd<_Tp, _Ap>(__private_init, pair_of_members.second)};
+  }
+
+template <typename _Tp, typename _Ap>
+  _GLIBCXX_SIMD_INTRINSIC _GLIBCXX_SIMD_CONSTEXPR simd<_Tp, _Ap>
+  clamp(const simd<_Tp, _Ap>& __v, const simd<_Tp, _Ap>& __lo,
+	const simd<_Tp, _Ap>& __hi)
+  {
+    using _Impl = typename _Ap::_SimdImpl;
+    return {__private_init,
+	    _Impl::_S_min(__data(__hi),
+			  _Impl::_S_max(__data(__lo), __data(__v)))};
+  }
+
+// }}}
+
+template <size_t... _Sizes, typename _Tp, typename _Ap,
+	  typename = enable_if_t<((_Sizes + ...) == simd<_Tp, _Ap>::size())>>
+  inline tuple<simd<_Tp, simd_abi::deduce_t<_Tp, _Sizes>>...>
+  split(const simd<_Tp, _Ap>&);
+
+// __extract_part {{{
+template <int _Index, int _Total, int _Combine = 1, typename _Tp, size_t _Np>
+  _GLIBCXX_SIMD_INTRINSIC _GLIBCXX_CONST
+  _SimdWrapper<_Tp, _Np / _Total * _Combine>
+  __extract_part(const _SimdWrapper<_Tp, _Np> __x);
+
+template <int Index, int Parts, int _Combine = 1, typename _Tp, typename _A0,
+	  typename... _As>
+  _GLIBCXX_SIMD_INTRINSIC auto
+  __extract_part(const _SimdTuple<_Tp, _A0, _As...>& __x);
+
+// }}}
+// _SizeList {{{
+template <size_t _V0, size_t... _Values>
+  struct _SizeList
+  {
+    template <size_t _I>
+      static constexpr size_t _S_at(_SizeConstant<_I> = {})
+      {
+	if constexpr (_I == 0)
+	  return _V0;
+	else
+	  return _SizeList<_Values...>::template _S_at<_I - 1>();
+      }
+
+    template <size_t _I>
+      static constexpr auto _S_before(_SizeConstant<_I> = {})
+      {
+	if constexpr (_I == 0)
+	  return _SizeConstant<0>();
+	else
+	  return _SizeConstant<
+	    _V0 + _SizeList<_Values...>::template _S_before<_I - 1>()>();
+      }
+
+    template <size_t _Np>
+      static constexpr auto _S_pop_front(_SizeConstant<_Np> = {})
+      {
+	if constexpr (_Np == 0)
+	  return _SizeList();
+	else
+	  return _SizeList<_Values...>::template _S_pop_front<_Np - 1>();
+      }
+  };
+
+// }}}
+// __extract_center {{{
+template <typename _Tp, size_t _Np>
+  _GLIBCXX_SIMD_INTRINSIC _SimdWrapper<_Tp, _Np / 2>
+  __extract_center(_SimdWrapper<_Tp, _Np> __x)
+  {
+    static_assert(_Np >= 4);
+    static_assert(_Np % 4 == 0); // x0 - x1 - x2 - x3 -> return {x1, x2}
+#if _GLIBCXX_SIMD_X86INTRIN    // {{{
+    if constexpr (__have_avx512f && sizeof(_Tp) * _Np == 64)
+      {
+	const auto __intrin = __to_intrin(__x);
+	if constexpr (is_integral_v<_Tp>)
+	  return __vector_bitcast<_Tp>(_mm512_castsi512_si256(
+	    _mm512_shuffle_i32x4(__intrin, __intrin,
+				 1 + 2 * 0x4 + 2 * 0x10 + 3 * 0x40)));
+	else if constexpr (sizeof(_Tp) == 4)
+	  return __vector_bitcast<_Tp>(_mm512_castps512_ps256(
+	    _mm512_shuffle_f32x4(__intrin, __intrin,
+				 1 + 2 * 0x4 + 2 * 0x10 + 3 * 0x40)));
+	else if constexpr (sizeof(_Tp) == 8)
+	  return __vector_bitcast<_Tp>(_mm512_castpd512_pd256(
+	    _mm512_shuffle_f64x2(__intrin, __intrin,
+				 1 + 2 * 0x4 + 2 * 0x10 + 3 * 0x40)));
+	else
+	  __assert_unreachable<_Tp>();
+      }
+    else if constexpr (sizeof(_Tp) * _Np == 32 && is_floating_point_v<_Tp>)
+      return __vector_bitcast<_Tp>(
+	_mm_shuffle_pd(__lo128(__vector_bitcast<double>(__x)),
+		       __hi128(__vector_bitcast<double>(__x)), 1));
+    else if constexpr (sizeof(__x) == 32 && sizeof(_Tp) * _Np <= 32)
+      return __vector_bitcast<_Tp>(
+	_mm_alignr_epi8(__hi128(__vector_bitcast<_LLong>(__x)),
+			__lo128(__vector_bitcast<_LLong>(__x)),
+			sizeof(_Tp) * _Np / 4));
+    else
+#endif // _GLIBCXX_SIMD_X86INTRIN }}}
+      {
+	__vector_type_t<_Tp, _Np / 2> __r;
+	__builtin_memcpy(&__r,
+			 reinterpret_cast<const char*>(&__x)
+			   + sizeof(_Tp) * _Np / 4,
+			 sizeof(_Tp) * _Np / 2);
+	return __r;
+      }
+  }
+
+template <typename _Tp, typename _A0, typename... _As>
+  _GLIBCXX_SIMD_INTRINSIC
+  _SimdWrapper<_Tp, _SimdTuple<_Tp, _A0, _As...>::_S_size() / 2>
+  __extract_center(const _SimdTuple<_Tp, _A0, _As...>& __x)
+  {
+    if constexpr (sizeof...(_As) == 0)
+      return __extract_center(__x.first);
+    else
+      return __extract_part<1, 4, 2>(__x);
+  }
+
+// }}}
+// __split_wrapper {{{
+template <size_t... _Sizes, typename _Tp, typename... _As>
+  auto
+  __split_wrapper(_SizeList<_Sizes...>, const _SimdTuple<_Tp, _As...>& __x)
+  {
+    return split<_Sizes...>(
+      fixed_size_simd<_Tp, _SimdTuple<_Tp, _As...>::_S_size()>(__private_init,
+							       __x));
+  }
+
+// }}}
+
+// split<simd>(simd) {{{
+template <typename _V, typename _Ap,
+	  size_t Parts = simd_size_v<typename _V::value_type, _Ap> / _V::size()>
+  enable_if_t<simd_size_v<typename _V::value_type, _Ap> == Parts * _V::size()
+	      && is_simd_v<_V>, array<_V, Parts>>
+  split(const simd<typename _V::value_type, _Ap>& __x)
+  {
+    using _Tp = typename _V::value_type;
+    if constexpr (Parts == 1)
+      {
+	return {simd_cast<_V>(__x)};
+      }
+    else if (__x._M_is_constprop())
+      {
+	return __generate_from_n_evaluations<Parts, array<_V, Parts>>([&](
+	  auto __i) constexpr {
+	  return _V([&](auto __j) constexpr {
+	    return __x[__i * _V::size() + __j];
+	  });
+	});
+      }
+    else if constexpr (
+      __is_fixed_size_abi_v<_Ap>
+      && (is_same_v<typename _V::abi_type, simd_abi::scalar>
+	|| (__is_fixed_size_abi_v<typename _V::abi_type>
+	  && sizeof(_V) == sizeof(_Tp) * _V::size() // _V doesn't have padding
+	  )))
+      {
+	// fixed_size -> fixed_size (w/o padding) or scalar
+#ifdef _GLIBCXX_SIMD_USE_ALIASING_LOADS
+      const __may_alias<_Tp>* const __element_ptr
+	= reinterpret_cast<const __may_alias<_Tp>*>(&__data(__x));
+      return __generate_from_n_evaluations<Parts, array<_V, Parts>>([&](
+	auto __i) constexpr {
+	return _V(__element_ptr + __i * _V::size(), vector_aligned);
+      });
+#else
+      const auto& __xx = __data(__x);
+      return __generate_from_n_evaluations<Parts, array<_V, Parts>>([&](
+	auto __i) constexpr {
+	[[maybe_unused]] constexpr size_t __offset
+	  = decltype(__i)::value * _V::size();
+	return _V([&](auto __j) constexpr {
+	  constexpr _SizeConstant<__j + __offset> __k;
+	  return __xx[__k];
+	});
+      });
+#endif
+    }
+  else if constexpr (is_same_v<typename _V::abi_type, simd_abi::scalar>)
+    {
+      // normally memcpy should work here as well
+      return __generate_from_n_evaluations<Parts, array<_V, Parts>>([&](
+	auto __i) constexpr { return __x[__i]; });
+    }
+  else
+    {
+      return __generate_from_n_evaluations<Parts, array<_V, Parts>>([&](
+	auto __i) constexpr {
+	if constexpr (__is_fixed_size_abi_v<typename _V::abi_type>)
+	  return _V([&](auto __j) constexpr {
+	    return __x[__i * _V::size() + __j];
+	  });
+	else
+	  return _V(__private_init,
+		    __extract_part<decltype(__i)::value, Parts>(__data(__x)));
+      });
+    }
+  }
+
+// }}}
+// split<simd_mask>(simd_mask) {{{
+template <typename _V, typename _Ap,
+	  size_t _Parts
+	  = simd_size_v<typename _V::simd_type::value_type, _Ap> / _V::size()>
+  enable_if_t<is_simd_mask_v<_V> && simd_size_v<typename
+    _V::simd_type::value_type, _Ap> == _Parts * _V::size(), array<_V, _Parts>>
+  split(const simd_mask<typename _V::simd_type::value_type, _Ap>& __x)
+  {
+    if constexpr (is_same_v<_Ap, typename _V::abi_type>)
+      return {__x};
+    else if constexpr (_Parts == 1)
+      return {__proposed::static_simd_cast<_V>(__x)};
+    else if constexpr (_Parts == 2 && __is_sse_abi<typename _V::abi_type>()
+		       && __is_avx_abi<_Ap>())
+      return {_V(__private_init, __lo128(__data(__x))),
+	      _V(__private_init, __hi128(__data(__x)))};
+    else if constexpr (_V::size() <= __CHAR_BIT__ * sizeof(_ULLong))
+      {
+	const bitset __bits = __x.__to_bitset();
+	return __generate_from_n_evaluations<_Parts, array<_V, _Parts>>([&](
+	  auto __i) constexpr {
+	  constexpr size_t __offset = __i * _V::size();
+	  return _V(__bitset_init, (__bits >> __offset).to_ullong());
+	});
+      }
+    else
+      {
+	return __generate_from_n_evaluations<_Parts, array<_V, _Parts>>([&](
+	  auto __i) constexpr {
+	  constexpr size_t __offset = __i * _V::size();
+	  return _V(
+	    __private_init, [&](auto __j) constexpr {
+	      return __x[__j + __offset];
+	    });
+	});
+      }
+  }
+
+// }}}
+// split<_Sizes...>(simd) {{{
+template <size_t... _Sizes, typename _Tp, typename _Ap, typename>
+  _GLIBCXX_SIMD_ALWAYS_INLINE
+  tuple<simd<_Tp, simd_abi::deduce_t<_Tp, _Sizes>>...>
+  split(const simd<_Tp, _Ap>& __x)
+  {
+    using _SL = _SizeList<_Sizes...>;
+    using _Tuple = tuple<__deduced_simd<_Tp, _Sizes>...>;
+    constexpr size_t _Np = simd_size_v<_Tp, _Ap>;
+    constexpr size_t _N0 = _SL::template _S_at<0>();
+    using _V = __deduced_simd<_Tp, _N0>;
+
+    if (__x._M_is_constprop())
+      return __generate_from_n_evaluations<sizeof...(_Sizes), _Tuple>([&](
+	auto __i) constexpr {
+	using _Vi = __deduced_simd<_Tp, _SL::_S_at(__i)>;
+	constexpr size_t __offset = _SL::_S_before(__i);
+	return _Vi([&](auto __j) constexpr { return __x[__offset + __j]; });
+      });
+    else if constexpr (_Np == _N0)
+      {
+	static_assert(sizeof...(_Sizes) == 1);
+	return {simd_cast<_V>(__x)};
+      }
+    else if constexpr // split from fixed_size, such that __x::first.size == _N0
+      (__is_fixed_size_abi_v<
+	 _Ap> && __fixed_size_storage_t<_Tp, _Np>::_S_first_size == _N0)
+      {
+	static_assert(
+	  !__is_fixed_size_abi_v<typename _V::abi_type>,
+	  "How can <_Tp, _Np> be __a single _SimdTuple entry but __a "
+	  "fixed_size_simd "
+	  "when deduced?");
+	// extract first and recurse (__split_wrapper is needed to deduce a new
+	// _Sizes pack)
+	return tuple_cat(make_tuple(_V(__private_init, __data(__x).first)),
+			 __split_wrapper(_SL::template _S_pop_front<1>(),
+					 __data(__x).second));
+      }
+    else if constexpr ((!is_same_v<simd_abi::scalar,
+				   simd_abi::deduce_t<_Tp, _Sizes>> && ...)
+		       && (!__is_fixed_size_abi_v<
+			     simd_abi::deduce_t<_Tp, _Sizes>> && ...))
+      {
+	if constexpr (((_Sizes * 2 == _Np) && ...))
+	  return {{__private_init, __extract_part<0, 2>(__data(__x))},
+		  {__private_init, __extract_part<1, 2>(__data(__x))}};
+	else if constexpr (is_same_v<_SizeList<_Sizes...>,
+				     _SizeList<_Np / 3, _Np / 3, _Np / 3>>)
+	  return {{__private_init, __extract_part<0, 3>(__data(__x))},
+		  {__private_init, __extract_part<1, 3>(__data(__x))},
+		  {__private_init, __extract_part<2, 3>(__data(__x))}};
+	else if constexpr (is_same_v<_SizeList<_Sizes...>,
+				     _SizeList<2 * _Np / 3, _Np / 3>>)
+	  return {{__private_init, __extract_part<0, 3, 2>(__data(__x))},
+		  {__private_init, __extract_part<2, 3>(__data(__x))}};
+	else if constexpr (is_same_v<_SizeList<_Sizes...>,
+				     _SizeList<_Np / 3, 2 * _Np / 3>>)
+	  return {{__private_init, __extract_part<0, 3>(__data(__x))},
+		  {__private_init, __extract_part<1, 3, 2>(__data(__x))}};
+	else if constexpr (is_same_v<_SizeList<_Sizes...>,
+				     _SizeList<_Np / 2, _Np / 4, _Np / 4>>)
+	  return {{__private_init, __extract_part<0, 2>(__data(__x))},
+		  {__private_init, __extract_part<2, 4>(__data(__x))},
+		  {__private_init, __extract_part<3, 4>(__data(__x))}};
+	else if constexpr (is_same_v<_SizeList<_Sizes...>,
+				     _SizeList<_Np / 4, _Np / 4, _Np / 2>>)
+	  return {{__private_init, __extract_part<0, 4>(__data(__x))},
+		  {__private_init, __extract_part<1, 4>(__data(__x))},
+		  {__private_init, __extract_part<1, 2>(__data(__x))}};
+	else if constexpr (is_same_v<_SizeList<_Sizes...>,
+				     _SizeList<_Np / 4, _Np / 2, _Np / 4>>)
+	  return {{__private_init, __extract_part<0, 4>(__data(__x))},
+		  {__private_init, __extract_center(__data(__x))},
+		  {__private_init, __extract_part<3, 4>(__data(__x))}};
+	else if constexpr (((_Sizes * 4 == _Np) && ...))
+	  return {{__private_init, __extract_part<0, 4>(__data(__x))},
+		  {__private_init, __extract_part<1, 4>(__data(__x))},
+		  {__private_init, __extract_part<2, 4>(__data(__x))},
+		  {__private_init, __extract_part<3, 4>(__data(__x))}};
+	// else fall through
+      }
+#ifdef _GLIBCXX_SIMD_USE_ALIASING_LOADS
+    const __may_alias<_Tp>* const __element_ptr
+      = reinterpret_cast<const __may_alias<_Tp>*>(&__x);
+    return __generate_from_n_evaluations<sizeof...(_Sizes), _Tuple>([&](
+      auto __i) constexpr {
+      using _Vi = __deduced_simd<_Tp, _SL::_S_at(__i)>;
+      constexpr size_t __offset = _SL::_S_before(__i);
+      constexpr size_t __base_align = alignof(simd<_Tp, _Ap>);
+      constexpr size_t __a
+	= __base_align - ((__offset * sizeof(_Tp)) % __base_align);
+      constexpr size_t __b = ((__a - 1) & __a) ^ __a;
+      constexpr size_t __alignment = __b == 0 ? __a : __b;
+      return _Vi(__element_ptr + __offset, overaligned<__alignment>);
+    });
+#else
+    return __generate_from_n_evaluations<sizeof...(_Sizes), _Tuple>([&](
+      auto __i) constexpr {
+      using _Vi = __deduced_simd<_Tp, _SL::_S_at(__i)>;
+      const auto& __xx = __data(__x);
+      using _Offset = decltype(_SL::_S_before(__i));
+      return _Vi([&](auto __j) constexpr {
+	constexpr _SizeConstant<_Offset::value + __j> __k;
+	return __xx[__k];
+      });
+    });
+#endif
+  }
+
+// }}}
+
+// __subscript_in_pack {{{
+template <size_t _I, typename _Tp, typename _Ap, typename... _As>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _Tp
+  __subscript_in_pack(const simd<_Tp, _Ap>& __x, const simd<_Tp, _As>&... __xs)
+  {
+    if constexpr (_I < simd_size_v<_Tp, _Ap>)
+      return __x[_I];
+    else
+      return __subscript_in_pack<_I - simd_size_v<_Tp, _Ap>>(__xs...);
+  }
+
+// }}}
+// __store_pack_of_simd {{{
+template <typename _Tp, typename _A0, typename... _As>
+  _GLIBCXX_SIMD_INTRINSIC void
+  __store_pack_of_simd(char* __mem, const simd<_Tp, _A0>& __x0,
+		       const simd<_Tp, _As>&... __xs)
+  {
+    constexpr size_t __n_bytes = sizeof(_Tp) * simd_size_v<_Tp, _A0>;
+    __builtin_memcpy(__mem, &__data(__x0), __n_bytes);
+    if constexpr (sizeof...(__xs) > 0)
+      __store_pack_of_simd(__mem + __n_bytes, __xs...);
+  }
+
+// }}}
+// concat(simd...) {{{
+template <typename _Tp, typename... _As>
+  inline _GLIBCXX_SIMD_CONSTEXPR
+  simd<_Tp, simd_abi::deduce_t<_Tp, (simd_size_v<_Tp, _As> + ...)>>
+  concat(const simd<_Tp, _As>&... __xs)
+  {
+    using _Rp = __deduced_simd<_Tp, (simd_size_v<_Tp, _As> + ...)>;
+    if constexpr (sizeof...(__xs) == 1)
+      return simd_cast<_Rp>(__xs...);
+    else if ((... && __xs._M_is_constprop()))
+      return simd<_Tp,
+		  simd_abi::deduce_t<_Tp, (simd_size_v<_Tp, _As> + ...)>>([&](
+	auto __i) constexpr { return __subscript_in_pack<__i>(__xs...); });
+    else
+      {
+	_Rp __r{};
+	__store_pack_of_simd(reinterpret_cast<char*>(&__data(__r)), __xs...);
+	return __r;
+      }
+  }
+
+// }}}
+// concat(array<simd>) {{{
+template <typename _Tp, typename _Abi, size_t _Np>
+  _GLIBCXX_SIMD_ALWAYS_INLINE
+  _GLIBCXX_SIMD_CONSTEXPR __deduced_simd<_Tp, simd_size_v<_Tp, _Abi> * _Np>
+  concat(const array<simd<_Tp, _Abi>, _Np>& __x)
+  {
+    return __call_with_subscripts<_Np>(__x, [](const auto&... __xs) {
+      return concat(__xs...);
+    });
+  }
+
+// }}}
+
+// _SmartReference {{{
+template <typename _Up, typename _Accessor = _Up,
+	  typename _ValueType = typename _Up::value_type>
+  class _SmartReference
+  {
+    friend _Accessor;
+    int _M_index;
+    _Up& _M_obj;
+
+    _GLIBCXX_SIMD_INTRINSIC constexpr _ValueType _M_read() const noexcept
+    {
+      if constexpr (is_arithmetic_v<_Up>)
+	return _M_obj;
+      else
+	return _M_obj[_M_index];
+    }
+
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC constexpr void _M_write(_Tp&& __x) const
+      { _Accessor::_S_set(_M_obj, _M_index, static_cast<_Tp&&>(__x)); }
+
+  public:
+    _GLIBCXX_SIMD_INTRINSIC constexpr
+    _SmartReference(_Up& __o, int __i) noexcept
+    : _M_index(__i), _M_obj(__o) {}
+
+    using value_type = _ValueType;
+
+    _GLIBCXX_SIMD_INTRINSIC _SmartReference(const _SmartReference&) = delete;
+
+    _GLIBCXX_SIMD_INTRINSIC constexpr operator value_type() const noexcept
+    { return _M_read(); }
+
+    template <typename _Tp,
+	      typename
+	      = _ValuePreservingOrInt<__remove_cvref_t<_Tp>, value_type>>
+      _GLIBCXX_SIMD_INTRINSIC constexpr _SmartReference operator=(_Tp&& __x) &&
+      {
+	_M_write(static_cast<_Tp&&>(__x));
+	return {_M_obj, _M_index};
+      }
+
+#define _GLIBCXX_SIMD_OP_(__op)                                                \
+    template <typename _Tp,                                                    \
+	      typename _TT                                                     \
+	      = decltype(declval<value_type>() __op declval<_Tp>()),           \
+	      typename = _ValuePreservingOrInt<__remove_cvref_t<_Tp>, _TT>,    \
+	      typename = _ValuePreservingOrInt<_TT, value_type>>               \
+      _GLIBCXX_SIMD_INTRINSIC constexpr _SmartReference                        \
+      operator __op##=(_Tp&& __x) &&                                           \
+      {                                                                        \
+	const value_type& __lhs = _M_read();                                   \
+	_M_write(__lhs __op __x);                                              \
+	return {_M_obj, _M_index};                                             \
+      }
+    _GLIBCXX_SIMD_ALL_ARITHMETICS(_GLIBCXX_SIMD_OP_);
+    _GLIBCXX_SIMD_ALL_SHIFTS(_GLIBCXX_SIMD_OP_);
+    _GLIBCXX_SIMD_ALL_BINARY(_GLIBCXX_SIMD_OP_);
+#undef _GLIBCXX_SIMD_OP_
+
+    template <typename _Tp = void,
+	      typename
+	      = decltype(++declval<conditional_t<true, value_type, _Tp>&>())>
+      _GLIBCXX_SIMD_INTRINSIC constexpr _SmartReference operator++() &&
+      {
+	value_type __x = _M_read();
+	_M_write(++__x);
+	return {_M_obj, _M_index};
+      }
+
+    template <typename _Tp = void,
+	      typename
+	      = decltype(declval<conditional_t<true, value_type, _Tp>&>()++)>
+      _GLIBCXX_SIMD_INTRINSIC constexpr value_type operator++(int) &&
+      {
+	const value_type __r = _M_read();
+	value_type __x = __r;
+	_M_write(++__x);
+	return __r;
+      }
+
+    template <typename _Tp = void,
+	      typename
+	      = decltype(--declval<conditional_t<true, value_type, _Tp>&>())>
+      _GLIBCXX_SIMD_INTRINSIC constexpr _SmartReference operator--() &&
+      {
+	value_type __x = _M_read();
+	_M_write(--__x);
+	return {_M_obj, _M_index};
+      }
+
+    template <typename _Tp = void,
+	      typename
+	      = decltype(declval<conditional_t<true, value_type, _Tp>&>()--)>
+      _GLIBCXX_SIMD_INTRINSIC constexpr value_type operator--(int) &&
+      {
+	const value_type __r = _M_read();
+	value_type __x = __r;
+	_M_write(--__x);
+	return __r;
+      }
+
+    _GLIBCXX_SIMD_INTRINSIC friend void
+    swap(_SmartReference&& __a, _SmartReference&& __b) noexcept(
+      conjunction<
+	is_nothrow_constructible<value_type, _SmartReference&&>,
+	is_nothrow_assignable<_SmartReference&&, value_type&&>>::value)
+    {
+      value_type __tmp = static_cast<_SmartReference&&>(__a);
+      static_cast<_SmartReference&&>(__a) = static_cast<value_type>(__b);
+      static_cast<_SmartReference&&>(__b) = std::move(__tmp);
+    }
+
+    _GLIBCXX_SIMD_INTRINSIC friend void
+    swap(value_type& __a, _SmartReference&& __b) noexcept(
+      conjunction<
+	is_nothrow_constructible<value_type, value_type&&>,
+	is_nothrow_assignable<value_type&, value_type&&>,
+	is_nothrow_assignable<_SmartReference&&, value_type&&>>::value)
+    {
+      value_type __tmp(std::move(__a));
+      __a = static_cast<value_type>(__b);
+      static_cast<_SmartReference&&>(__b) = std::move(__tmp);
+    }
+
+    _GLIBCXX_SIMD_INTRINSIC friend void
+    swap(_SmartReference&& __a, value_type& __b) noexcept(
+      conjunction<
+	is_nothrow_constructible<value_type, _SmartReference&&>,
+	is_nothrow_assignable<value_type&, value_type&&>,
+	is_nothrow_assignable<_SmartReference&&, value_type&&>>::value)
+    {
+      value_type __tmp(__a);
+      static_cast<_SmartReference&&>(__a) = std::move(__b);
+      __b = std::move(__tmp);
+    }
+  };
+
+// }}}
+// __scalar_abi_wrapper {{{
+template <int _Bytes>
+  struct __scalar_abi_wrapper
+  {
+    template <typename _Tp> static constexpr size_t _S_full_size = 1;
+    template <typename _Tp> static constexpr size_t _S_size = 1;
+    template <typename _Tp> static constexpr size_t _S_is_partial = false;
+
+    template <typename _Tp, typename _Abi = simd_abi::scalar>
+      static constexpr bool _S_is_valid_v
+	= _Abi::template _IsValid<_Tp>::value && sizeof(_Tp) == _Bytes;
+  };
+
+// }}}
+// __decay_abi metafunction {{{
+template <typename _Tp>
+  struct __decay_abi { using type = _Tp; };
+
+template <int _Bytes>
+  struct __decay_abi<__scalar_abi_wrapper<_Bytes>>
+  { using type = simd_abi::scalar; };
+
+// }}}
+// __find_next_valid_abi metafunction {{{1
+// Given an ABI tag A<N>, find an N2 < N such that A<N2>::_S_is_valid_v<_Tp> ==
+// true, N2 is a power-of-2, and A<N2>::_S_is_partial<_Tp> is false. Break
+// recursion at 2 elements in the resulting ABI tag. In this case
+// type::_S_is_valid_v<_Tp> may be false.
+template <template <int> class _Abi, int _Bytes, typename _Tp>
+  struct __find_next_valid_abi
+  {
+    static constexpr auto _S_choose()
+    {
+      constexpr int _NextBytes = std::__bit_ceil(_Bytes) / 2;
+      using _NextAbi = _Abi<_NextBytes>;
+      if constexpr (_NextBytes < sizeof(_Tp) * 2) // break recursion
+	return _Abi<_Bytes>();
+      else if constexpr (_NextAbi::template _S_is_partial<_Tp> == false
+			 && _NextAbi::template _S_is_valid_v<_Tp>)
+	return _NextAbi();
+      else
+	return __find_next_valid_abi<_Abi, _NextBytes, _Tp>::_S_choose();
+    }
+
+    using type = decltype(_S_choose());
+  };
+
+template <int _Bytes, typename _Tp>
+  struct __find_next_valid_abi<__scalar_abi_wrapper, _Bytes, _Tp>
+  { using type = simd_abi::scalar; };
+
+// _AbiList {{{1
+template <template <int> class...>
+  struct _AbiList
+  {
+    template <typename, int> static constexpr bool _S_has_valid_abi = false;
+    template <typename, int> using _FirstValidAbi = void;
+    template <typename, int> using _BestAbi = void;
+  };
+
+template <template <int> class _A0, template <int> class... _Rest>
+  struct _AbiList<_A0, _Rest...>
+  {
+    template <typename _Tp, int _Np>
+      static constexpr bool _S_has_valid_abi
+	= _A0<sizeof(_Tp) * _Np>::template _S_is_valid_v<
+	    _Tp> || _AbiList<_Rest...>::template _S_has_valid_abi<_Tp, _Np>;
+
+    template <typename _Tp, int _Np>
+      using _FirstValidAbi = conditional_t<
+	_A0<sizeof(_Tp) * _Np>::template _S_is_valid_v<_Tp>,
+	typename __decay_abi<_A0<sizeof(_Tp) * _Np>>::type,
+	typename _AbiList<_Rest...>::template _FirstValidAbi<_Tp, _Np>>;
+
+    template <typename _Tp, int _Np>
+      static constexpr auto _S_determine_best_abi()
+      {
+	static_assert(_Np >= 1);
+	constexpr int _Bytes = sizeof(_Tp) * _Np;
+	if constexpr (_Np == 1)
+	  return __make_dependent_t<_Tp, simd_abi::scalar>{};
+	else
+	  {
+	    constexpr int __fullsize = _A0<_Bytes>::template _S_full_size<_Tp>;
+	    // _A0<_Bytes> is good if:
+	    // 1. The ABI tag is valid for _Tp
+	    // 2. The storage overhead is no more than padding to fill the next
+	    //    power-of-2 number of bytes
+	    if constexpr (_A0<_Bytes>::template _S_is_valid_v<
+			    _Tp> && __fullsize / 2 < _Np)
+	      return typename __decay_abi<_A0<_Bytes>>::type{};
+	    else
+	      {
+		using _B =
+		  typename __find_next_valid_abi<_A0, _Bytes, _Tp>::type;
+		if constexpr (_B::template _S_is_valid_v<
+				_Tp> && _B::template _S_size<_Tp> <= _Np)
+		  return _B{};
+		else
+		  return
+		    typename _AbiList<_Rest...>::template _BestAbi<_Tp, _Np>{};
+	      }
+	  }
+      }
+
+    template <typename _Tp, int _Np>
+      using _BestAbi = decltype(_S_determine_best_abi<_Tp, _Np>());
+  };
+
+// }}}1
+
+// the following lists all native ABIs, which makes them accessible to
+// simd_abi::deduce and select_best_vector_type_t (for fixed_size). Order
+// matters: Whatever comes first has higher priority.
+using _AllNativeAbis = _AbiList<simd_abi::_VecBltnBtmsk, simd_abi::_VecBuiltin,
+				__scalar_abi_wrapper>;
+
+// valid _SimdTraits specialization {{{1
+template <typename _Tp, typename _Abi>
+  struct _SimdTraits<_Tp, _Abi, void_t<typename _Abi::template _IsValid<_Tp>>>
+  : _Abi::template __traits<_Tp> {};
+
+// __deduce_impl specializations {{{1
+// try all native ABIs (including scalar) first
+template <typename _Tp, size_t _Np>
+  struct __deduce_impl<
+    _Tp, _Np, enable_if_t<_AllNativeAbis::template _S_has_valid_abi<_Tp, _Np>>>
+  { using type = _AllNativeAbis::_FirstValidAbi<_Tp, _Np>; };
+
+// fall back to fixed_size only if scalar and native ABIs don't match
+template <typename _Tp, size_t _Np, typename = void>
+  struct __deduce_fixed_size_fallback {};
+
+template <typename _Tp, size_t _Np>
+  struct __deduce_fixed_size_fallback<_Tp, _Np,
+    enable_if_t<simd_abi::fixed_size<_Np>::template _S_is_valid_v<_Tp>>>
+  { using type = simd_abi::fixed_size<_Np>; };
+
+template <typename _Tp, size_t _Np, typename>
+  struct __deduce_impl : public __deduce_fixed_size_fallback<_Tp, _Np> {};
+
+//}}}1
+
+// simd_mask {{{
+template <typename _Tp, typename _Abi>
+  class simd_mask : public _SimdTraits<_Tp, _Abi>::_MaskBase
+  {
+    // types, tags, and friends {{{
+    using _Traits = _SimdTraits<_Tp, _Abi>;
+    using _MemberType = typename _Traits::_MaskMember;
+
+    // We map all masks with equal element sizeof to a single integer type, the
+    // one given by __int_for_sizeof_t<_Tp>. This is the approach
+    // [[gnu::vector_size(N)]] types take as well and it reduces the number of
+    // template specializations in the implementation classes.
+    using _Ip = __int_for_sizeof_t<_Tp>;
+    static constexpr _Ip* _S_type_tag = nullptr;
+
+    friend typename _Traits::_MaskBase;
+    friend class simd<_Tp, _Abi>;       // to construct masks on return
+    friend typename _Traits::_SimdImpl; // to construct masks on return and
+					// inspect data on masked operations
+  public:
+    using _Impl = typename _Traits::_MaskImpl;
+    friend _Impl;
+
+    // }}}
+    // member types {{{
+    using value_type = bool;
+    using reference = _SmartReference<_MemberType, _Impl, value_type>;
+    using simd_type = simd<_Tp, _Abi>;
+    using abi_type = _Abi;
+
+    // }}}
+    static constexpr size_t size() // {{{
+    { return __size_or_zero_v<_Tp, _Abi>; }
+
+    // }}}
+    // constructors & assignment {{{
+    simd_mask() = default;
+    simd_mask(const simd_mask&) = default;
+    simd_mask(simd_mask&&) = default;
+    simd_mask& operator=(const simd_mask&) = default;
+    simd_mask& operator=(simd_mask&&) = default;
+
+    // }}}
+    // access to internal representation (optional feature) {{{
+    _GLIBCXX_SIMD_ALWAYS_INLINE explicit
+    simd_mask(typename _Traits::_MaskCastType __init)
+    : _M_data{__init} {}
+    // conversions to internal type is done in _MaskBase
+
+    // }}}
+    // bitset interface (extension to be proposed) {{{
+    // TS_FEEDBACK:
+    // Conversion of simd_mask to and from bitset makes it much easier to
+    // interface with other facilities. I suggest adding `static
+    // simd_mask::from_bitset` and `simd_mask::to_bitset`.
+    _GLIBCXX_SIMD_ALWAYS_INLINE static simd_mask
+    __from_bitset(bitset<size()> bs)
+    { return {__bitset_init, bs}; }
+
+    _GLIBCXX_SIMD_ALWAYS_INLINE bitset<size()>
+    __to_bitset() const
+    { return _Impl::_S_to_bits(_M_data)._M_to_bitset(); }
+
+    // }}}
+    // explicit broadcast constructor {{{
+    _GLIBCXX_SIMD_ALWAYS_INLINE explicit _GLIBCXX_SIMD_CONSTEXPR
+    simd_mask(value_type __x)
+    : _M_data(_Impl::template _S_broadcast<_Ip>(__x)) {}
+
+    // }}}
+    // implicit type conversion constructor {{{
+  #ifdef _GLIBCXX_SIMD_ENABLE_IMPLICIT_MASK_CAST
+    // proposed improvement
+    template <typename _Up, typename _A2,
+	      typename = enable_if_t<simd_size_v<_Up, _A2> == size()>>
+      _GLIBCXX_SIMD_ALWAYS_INLINE explicit(sizeof(_MemberType)
+	  != sizeof(typename _SimdTraits<_Up, _A2>::_MaskMember))
+      simd_mask(const simd_mask<_Up, _A2>& __x)
+      : simd_mask(__proposed::static_simd_cast<simd_mask>(__x)) {}
+  #else
+    // conforming to ISO/IEC 19570:2018
+    template <typename _Up, typename = enable_if_t<conjunction<
+			      is_same<abi_type, simd_abi::fixed_size<size()>>,
+			      is_same<_Up, _Up>>::value>>
+      _GLIBCXX_SIMD_ALWAYS_INLINE
+      simd_mask(const simd_mask<_Up, simd_abi::fixed_size<size()>>& __x)
+      : _M_data(_Impl::_S_from_bitmask(__data(__x), _S_type_tag)) {}
+  #endif
+
+    // }}}
+    // load constructor {{{
+    template <typename _Flags>
+      _GLIBCXX_SIMD_ALWAYS_INLINE
+      simd_mask(const value_type* __mem, _Flags)
+      : _M_data(_Impl::template _S_load<_Ip>(
+	_Flags::template _S_apply<simd_mask>(__mem))) {}
+
+    template <typename _Flags>
+      _GLIBCXX_SIMD_ALWAYS_INLINE
+      simd_mask(const value_type* __mem, simd_mask __k, _Flags)
+      : _M_data{}
+      {
+	_M_data
+	  = _Impl::_S_masked_load(_M_data, __k._M_data,
+				  _Flags::template _S_apply<simd_mask>(__mem));
+      }
+
+    // }}}
+    // loads [simd_mask.load] {{{
+    template <typename _Flags>
+      _GLIBCXX_SIMD_ALWAYS_INLINE void
+      copy_from(const value_type* __mem, _Flags)
+      {
+	_M_data = _Impl::template _S_load<_Ip>(
+	  _Flags::template _S_apply<simd_mask>(__mem));
+      }
+
+    // }}}
+    // stores [simd_mask.store] {{{
+    template <typename _Flags>
+      _GLIBCXX_SIMD_ALWAYS_INLINE void
+      copy_to(value_type* __mem, _Flags) const
+      { _Impl::_S_store(_M_data, _Flags::template _S_apply<simd_mask>(__mem)); }
+
+    // }}}
+    // scalar access {{{
+    _GLIBCXX_SIMD_ALWAYS_INLINE reference
+    operator[](size_t __i)
+    {
+      if (__i >= size())
+	__invoke_ub("Subscript %d is out of range [0, %d]", __i, size() - 1);
+      return {_M_data, int(__i)};
+    }
+
+    _GLIBCXX_SIMD_ALWAYS_INLINE value_type
+    operator[](size_t __i) const
+    {
+      if (__i >= size())
+	__invoke_ub("Subscript %d is out of range [0, %d]", __i, size() - 1);
+      if constexpr (__is_scalar_abi<_Abi>())
+	return _M_data;
+      else
+	return static_cast<bool>(_M_data[__i]);
+    }
+
+    // }}}
+    // negation {{{
+    _GLIBCXX_SIMD_ALWAYS_INLINE simd_mask
+    operator!() const
+    { return {__private_init, _Impl::_S_bit_not(_M_data)}; }
+
+    // }}}
+    // simd_mask binary operators [simd_mask.binary] {{{
+  #ifdef _GLIBCXX_SIMD_ENABLE_IMPLICIT_MASK_CAST
+    // simd_mask<int> && simd_mask<uint> needs disambiguation
+    template <typename _Up, typename _A2,
+	      typename
+	      = enable_if_t<is_convertible_v<simd_mask<_Up, _A2>, simd_mask>>>
+      _GLIBCXX_SIMD_ALWAYS_INLINE friend simd_mask
+      operator&&(const simd_mask& __x, const simd_mask<_Up, _A2>& __y)
+      {
+	return {__private_init,
+		_Impl::_S_logical_and(__x._M_data, simd_mask(__y)._M_data)};
+      }
+
+    template <typename _Up, typename _A2,
+	      typename
+	      = enable_if_t<is_convertible_v<simd_mask<_Up, _A2>, simd_mask>>>
+      _GLIBCXX_SIMD_ALWAYS_INLINE friend simd_mask
+      operator||(const simd_mask& __x, const simd_mask<_Up, _A2>& __y)
+      {
+	return {__private_init,
+		_Impl::_S_logical_or(__x._M_data, simd_mask(__y)._M_data)};
+      }
+  #endif // _GLIBCXX_SIMD_ENABLE_IMPLICIT_MASK_CAST
+
+    _GLIBCXX_SIMD_ALWAYS_INLINE friend simd_mask
+    operator&&(const simd_mask& __x, const simd_mask& __y)
+    {
+      return {__private_init, _Impl::_S_logical_and(__x._M_data, __y._M_data)};
+    }
+
+    _GLIBCXX_SIMD_ALWAYS_INLINE friend simd_mask
+    operator||(const simd_mask& __x, const simd_mask& __y)
+    {
+      return {__private_init, _Impl::_S_logical_or(__x._M_data, __y._M_data)};
+    }
+
+    _GLIBCXX_SIMD_ALWAYS_INLINE friend simd_mask
+    operator&(const simd_mask& __x, const simd_mask& __y)
+    { return {__private_init, _Impl::_S_bit_and(__x._M_data, __y._M_data)}; }
+
+    _GLIBCXX_SIMD_ALWAYS_INLINE friend simd_mask
+    operator|(const simd_mask& __x, const simd_mask& __y)
+    { return {__private_init, _Impl::_S_bit_or(__x._M_data, __y._M_data)}; }
+
+    _GLIBCXX_SIMD_ALWAYS_INLINE friend simd_mask
+    operator^(const simd_mask& __x, const simd_mask& __y)
+    { return {__private_init, _Impl::_S_bit_xor(__x._M_data, __y._M_data)}; }
+
+    _GLIBCXX_SIMD_ALWAYS_INLINE friend simd_mask&
+    operator&=(simd_mask& __x, const simd_mask& __y)
+    {
+      __x._M_data = _Impl::_S_bit_and(__x._M_data, __y._M_data);
+      return __x;
+    }
+
+    _GLIBCXX_SIMD_ALWAYS_INLINE friend simd_mask&
+    operator|=(simd_mask& __x, const simd_mask& __y)
+    {
+      __x._M_data = _Impl::_S_bit_or(__x._M_data, __y._M_data);
+      return __x;
+    }
+
+    _GLIBCXX_SIMD_ALWAYS_INLINE friend simd_mask&
+    operator^=(simd_mask& __x, const simd_mask& __y)
+    {
+      __x._M_data = _Impl::_S_bit_xor(__x._M_data, __y._M_data);
+      return __x;
+    }
+
+    // }}}
+    // simd_mask compares [simd_mask.comparison] {{{
+    _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR friend simd_mask
+    operator==(const simd_mask& __x, const simd_mask& __y)
+    { return !operator!=(__x, __y); }
+
+    _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR friend simd_mask
+    operator!=(const simd_mask& __x, const simd_mask& __y)
+    { return {__private_init, _Impl::_S_bit_xor(__x._M_data, __y._M_data)}; }
+
+    // }}}
+    // private_init ctor {{{
+    _GLIBCXX_SIMD_INTRINSIC _GLIBCXX_SIMD_CONSTEXPR
+    simd_mask(_PrivateInit, typename _Traits::_MaskMember __init)
+    : _M_data(__init) {}
+
+    // }}}
+    // private_init generator ctor {{{
+    template <typename _Fp, typename = decltype(bool(declval<_Fp>()(size_t())))>
+      _GLIBCXX_SIMD_INTRINSIC constexpr
+      simd_mask(_PrivateInit, _Fp&& __gen)
+      : _M_data()
+      {
+	__execute_n_times<size()>([&](auto __i) constexpr {
+	  _Impl::_S_set(_M_data, __i, __gen(__i));
+	});
+      }
+
+    // }}}
+    // bitset_init ctor {{{
+    _GLIBCXX_SIMD_INTRINSIC simd_mask(_BitsetInit, bitset<size()> __init)
+    : _M_data(
+	_Impl::_S_from_bitmask(_SanitizedBitMask<size()>(__init), _S_type_tag))
+    {}
+
+    // }}}
+    // __cvt {{{
+    // TS_FEEDBACK:
+    // The conversion operator this implements should be a ctor on simd_mask.
+    // Once you call .__cvt() on a simd_mask it converts conveniently.
+    // A useful variation: add `explicit(sizeof(_Tp) != sizeof(_Up))`
+    struct _CvtProxy
+    {
+      template <typename _Up, typename _A2,
+		typename
+		= enable_if_t<simd_size_v<_Up, _A2> == simd_size_v<_Tp, _Abi>>>
+	operator simd_mask<_Up, _A2>() &&
+	{
+	  using namespace std::experimental::__proposed;
+	  return static_simd_cast<simd_mask<_Up, _A2>>(_M_data);
+	}
+
+      const simd_mask<_Tp, _Abi>& _M_data;
+    };
+
+    _GLIBCXX_SIMD_INTRINSIC _CvtProxy
+    __cvt() const
+    { return {*this}; }
+
+    // }}}
+    // operator?: overloads (suggested extension) {{{
+  #ifdef __GXX_CONDITIONAL_IS_OVERLOADABLE__
+    _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR friend simd_mask
+    operator?:(const simd_mask& __k, const simd_mask& __where_true,
+	       const simd_mask& __where_false)
+    {
+      auto __ret = __where_false;
+      _Impl::_S_masked_assign(__k._M_data, __ret._M_data, __where_true._M_data);
+      return __ret;
+    }
+
+    template <typename _U1, typename _U2,
+	      typename _Rp = simd<common_type_t<_U1, _U2>, _Abi>,
+	      typename = enable_if_t<conjunction_v<
+		is_convertible<_U1, _Rp>, is_convertible<_U2, _Rp>,
+		is_convertible<simd_mask, typename _Rp::mask_type>>>>
+      _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR friend _Rp
+      operator?:(const simd_mask& __k, const _U1& __where_true,
+		 const _U2& __where_false)
+      {
+	_Rp __ret = __where_false;
+	_Rp::_Impl::_S_masked_assign(
+	  __data(static_cast<typename _Rp::mask_type>(__k)), __data(__ret),
+	  __data(static_cast<_Rp>(__where_true)));
+	return __ret;
+      }
+
+  #ifdef _GLIBCXX_SIMD_ENABLE_IMPLICIT_MASK_CAST
+    template <typename _Kp, typename _Ak, typename _Up, typename _Au,
+	      typename = enable_if_t<
+		conjunction_v<is_convertible<simd_mask<_Kp, _Ak>, simd_mask>,
+			      is_convertible<simd_mask<_Up, _Au>, simd_mask>>>>
+      _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR friend simd_mask
+      operator?:(const simd_mask<_Kp, _Ak>& __k, const simd_mask& __where_true,
+		 const simd_mask<_Up, _Au>& __where_false)
+      {
+	simd_mask __ret = __where_false;
+	_Impl::_S_masked_assign(simd_mask(__k)._M_data, __ret._M_data,
+				__where_true._M_data);
+	return __ret;
+      }
+  #endif // _GLIBCXX_SIMD_ENABLE_IMPLICIT_MASK_CAST
+  #endif // __GXX_CONDITIONAL_IS_OVERLOADABLE__
+
+    // }}}
+    // _M_is_constprop {{{
+    _GLIBCXX_SIMD_INTRINSIC constexpr bool
+    _M_is_constprop() const
+    {
+      if constexpr (__is_scalar_abi<_Abi>())
+	return __builtin_constant_p(_M_data);
+      else
+	return _M_data._M_is_constprop();
+    }
+
+    // }}}
+
+  private:
+    friend const auto& __data<_Tp, abi_type>(const simd_mask&);
+    friend auto& __data<_Tp, abi_type>(simd_mask&);
+    alignas(_Traits::_S_mask_align) _MemberType _M_data;
+  };
+
+// }}}
+
+// __data(simd_mask) {{{
+template <typename _Tp, typename _Ap>
+  _GLIBCXX_SIMD_INTRINSIC constexpr const auto&
+  __data(const simd_mask<_Tp, _Ap>& __x)
+  { return __x._M_data; }
+
+template <typename _Tp, typename _Ap>
+  _GLIBCXX_SIMD_INTRINSIC constexpr auto&
+  __data(simd_mask<_Tp, _Ap>& __x)
+  { return __x._M_data; }
+
+// }}}
+
+// simd_mask reductions [simd_mask.reductions] {{{
+template <typename _Tp, typename _Abi>
+  _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR bool
+  all_of(const simd_mask<_Tp, _Abi>& __k) noexcept
+  {
+    if (__builtin_is_constant_evaluated() || __k._M_is_constprop())
+      {
+	for (size_t __i = 0; __i < simd_size_v<_Tp, _Abi>; ++__i)
+	  if (!__k[__i])
+	    return false;
+	return true;
+      }
+    else
+      return _Abi::_MaskImpl::_S_all_of(__k);
+  }
+
+template <typename _Tp, typename _Abi>
+  _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR bool
+  any_of(const simd_mask<_Tp, _Abi>& __k) noexcept
+  {
+    if (__builtin_is_constant_evaluated() || __k._M_is_constprop())
+      {
+	for (size_t __i = 0; __i < simd_size_v<_Tp, _Abi>; ++__i)
+	  if (__k[__i])
+	    return true;
+	return false;
+      }
+    else
+      return _Abi::_MaskImpl::_S_any_of(__k);
+  }
+
+template <typename _Tp, typename _Abi>
+  _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR bool
+  none_of(const simd_mask<_Tp, _Abi>& __k) noexcept
+  {
+    if (__builtin_is_constant_evaluated() || __k._M_is_constprop())
+      {
+	for (size_t __i = 0; __i < simd_size_v<_Tp, _Abi>; ++__i)
+	  if (__k[__i])
+	    return false;
+	return true;
+      }
+    else
+      return _Abi::_MaskImpl::_S_none_of(__k);
+  }
+
+template <typename _Tp, typename _Abi>
+  _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR bool
+  some_of(const simd_mask<_Tp, _Abi>& __k) noexcept
+  {
+    if (__builtin_is_constant_evaluated() || __k._M_is_constprop())
+      {
+	for (size_t __i = 1; __i < simd_size_v<_Tp, _Abi>; ++__i)
+	  if (__k[__i] != __k[__i - 1])
+	    return true;
+	return false;
+      }
+    else
+      return _Abi::_MaskImpl::_S_some_of(__k);
+  }
+
+template <typename _Tp, typename _Abi>
+  _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR int
+  popcount(const simd_mask<_Tp, _Abi>& __k) noexcept
+  {
+    if (__builtin_is_constant_evaluated() || __k._M_is_constprop())
+      {
+	const int __r = __call_with_subscripts<simd_size_v<_Tp, _Abi>>(
+	  __k, [](auto... __elements) { return ((__elements != 0) + ...); });
+	if (__builtin_is_constant_evaluated() || __builtin_constant_p(__r))
+	  return __r;
+      }
+    return _Abi::_MaskImpl::_S_popcount(__k);
+  }
+
+template <typename _Tp, typename _Abi>
+  _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR int
+  find_first_set(const simd_mask<_Tp, _Abi>& __k)
+  {
+    if (__builtin_is_constant_evaluated() || __k._M_is_constprop())
+      {
+	constexpr size_t _Np = simd_size_v<_Tp, _Abi>;
+	const size_t _Idx = __call_with_n_evaluations<_Np>(
+	  [](auto... __indexes) { return std::min({__indexes...}); },
+	  [&](auto __i) { return __k[__i] ? +__i : _Np; });
+	if (_Idx >= _Np)
+	  __invoke_ub("find_first_set(empty mask) is UB");
+	if (__builtin_constant_p(_Idx))
+	  return _Idx;
+      }
+    return _Abi::_MaskImpl::_S_find_first_set(__k);
+  }
+
+template <typename _Tp, typename _Abi>
+  _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR int
+  find_last_set(const simd_mask<_Tp, _Abi>& __k)
+  {
+    if (__builtin_is_constant_evaluated() || __k._M_is_constprop())
+      {
+	constexpr size_t _Np = simd_size_v<_Tp, _Abi>;
+	const int _Idx = __call_with_n_evaluations<_Np>(
+	  [](auto... __indexes) { return std::max({__indexes...}); },
+	  [&](auto __i) { return __k[__i] ? int(__i) : -1; });
+	if (_Idx < 0)
+	  __invoke_ub("find_first_set(empty mask) is UB");
+	if (__builtin_constant_p(_Idx))
+	  return _Idx;
+      }
+    return _Abi::_MaskImpl::_S_find_last_set(__k);
+  }
+
+_GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR bool
+all_of(_ExactBool __x) noexcept
+{ return __x; }
+
+_GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR bool
+any_of(_ExactBool __x) noexcept
+{ return __x; }
+
+_GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR bool
+none_of(_ExactBool __x) noexcept
+{ return !__x; }
+
+_GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR bool
+some_of(_ExactBool) noexcept
+{ return false; }
+
+_GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR int
+popcount(_ExactBool __x) noexcept
+{ return __x; }
+
+_GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR int
+find_first_set(_ExactBool)
+{ return 0; }
+
+_GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR int
+find_last_set(_ExactBool)
+{ return 0; }
+
+// }}}
+
+// _SimdIntOperators{{{1
+template <typename _V, typename _Impl, bool>
+  class _SimdIntOperators {};
+
+template <typename _V, typename _Impl>
+  class _SimdIntOperators<_V, _Impl, true>
+  {
+    _GLIBCXX_SIMD_INTRINSIC const _V& __derived() const
+    { return *static_cast<const _V*>(this); }
+
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static _GLIBCXX_SIMD_CONSTEXPR _V
+      _S_make_derived(_Tp&& __d)
+      { return {__private_init, static_cast<_Tp&&>(__d)}; }
+
+  public:
+    _GLIBCXX_SIMD_CONSTEXPR friend _V& operator%=(_V& __lhs, const _V& __x)
+    { return __lhs = __lhs % __x; }
+
+    _GLIBCXX_SIMD_CONSTEXPR friend _V& operator&=(_V& __lhs, const _V& __x)
+    { return __lhs = __lhs & __x; }
+
+    _GLIBCXX_SIMD_CONSTEXPR friend _V& operator|=(_V& __lhs, const _V& __x)
+    { return __lhs = __lhs | __x; }
+
+    _GLIBCXX_SIMD_CONSTEXPR friend _V& operator^=(_V& __lhs, const _V& __x)
+    { return __lhs = __lhs ^ __x; }
+
+    _GLIBCXX_SIMD_CONSTEXPR friend _V& operator<<=(_V& __lhs, const _V& __x)
+    { return __lhs = __lhs << __x; }
+
+    _GLIBCXX_SIMD_CONSTEXPR friend _V& operator>>=(_V& __lhs, const _V& __x)
+    { return __lhs = __lhs >> __x; }
+
+    _GLIBCXX_SIMD_CONSTEXPR friend _V& operator<<=(_V& __lhs, int __x)
+    { return __lhs = __lhs << __x; }
+
+    _GLIBCXX_SIMD_CONSTEXPR friend _V& operator>>=(_V& __lhs, int __x)
+    { return __lhs = __lhs >> __x; }
+
+    _GLIBCXX_SIMD_CONSTEXPR friend _V operator%(const _V& __x, const _V& __y)
+    {
+      return _SimdIntOperators::_S_make_derived(
+	_Impl::_S_modulus(__data(__x), __data(__y)));
+    }
+
+    _GLIBCXX_SIMD_CONSTEXPR friend _V operator&(const _V& __x, const _V& __y)
+    {
+      return _SimdIntOperators::_S_make_derived(
+	_Impl::_S_bit_and(__data(__x), __data(__y)));
+    }
+
+    _GLIBCXX_SIMD_CONSTEXPR friend _V operator|(const _V& __x, const _V& __y)
+    {
+      return _SimdIntOperators::_S_make_derived(
+	_Impl::_S_bit_or(__data(__x), __data(__y)));
+    }
+
+    _GLIBCXX_SIMD_CONSTEXPR friend _V operator^(const _V& __x, const _V& __y)
+    {
+      return _SimdIntOperators::_S_make_derived(
+	_Impl::_S_bit_xor(__data(__x), __data(__y)));
+    }
+
+    _GLIBCXX_SIMD_CONSTEXPR friend _V operator<<(const _V& __x, const _V& __y)
+    {
+      return _SimdIntOperators::_S_make_derived(
+	_Impl::_S_bit_shift_left(__data(__x), __data(__y)));
+    }
+
+    _GLIBCXX_SIMD_CONSTEXPR friend _V operator>>(const _V& __x, const _V& __y)
+    {
+      return _SimdIntOperators::_S_make_derived(
+	_Impl::_S_bit_shift_right(__data(__x), __data(__y)));
+    }
+
+    template <typename _VV = _V>
+      _GLIBCXX_SIMD_CONSTEXPR friend _V operator<<(const _V& __x, int __y)
+      {
+	using _Tp = typename _VV::value_type;
+	if (__y < 0)
+	  __invoke_ub("The behavior is undefined if the right operand of a "
+		      "shift operation is negative. [expr.shift]\nA shift by "
+		      "%d was requested",
+		      __y);
+	if (size_t(__y) >= sizeof(declval<_Tp>() << __y) * __CHAR_BIT__)
+	  __invoke_ub(
+	    "The behavior is undefined if the right operand of a "
+	    "shift operation is greater than or equal to the width of the "
+	    "promoted left operand. [expr.shift]\nA shift by %d was requested",
+	    __y);
+	return _SimdIntOperators::_S_make_derived(
+	  _Impl::_S_bit_shift_left(__data(__x), __y));
+      }
+
+    template <typename _VV = _V>
+      _GLIBCXX_SIMD_CONSTEXPR friend _V operator>>(const _V& __x, int __y)
+      {
+	using _Tp = typename _VV::value_type;
+	if (__y < 0)
+	  __invoke_ub(
+	    "The behavior is undefined if the right operand of a shift "
+	    "operation is negative. [expr.shift]\nA shift by %d was requested",
+	    __y);
+	if (size_t(__y) >= sizeof(declval<_Tp>() << __y) * __CHAR_BIT__)
+	  __invoke_ub(
+	    "The behavior is undefined if the right operand of a shift "
+	    "operation is greater than or equal to the width of the promoted "
+	    "left operand. [expr.shift]\nA shift by %d was requested",
+	    __y);
+	return _SimdIntOperators::_S_make_derived(
+	  _Impl::_S_bit_shift_right(__data(__x), __y));
+      }
+
+    // unary operators (for integral _Tp)
+    _GLIBCXX_SIMD_CONSTEXPR _V operator~() const
+    { return {__private_init, _Impl::_S_complement(__derived()._M_data)}; }
+  };
+
+//}}}1
+
+// simd {{{
+template <typename _Tp, typename _Abi>
+  class simd : public _SimdIntOperators<
+		 simd<_Tp, _Abi>, typename _SimdTraits<_Tp, _Abi>::_SimdImpl,
+		 conjunction<is_integral<_Tp>,
+			     typename _SimdTraits<_Tp, _Abi>::_IsValid>::value>,
+	       public _SimdTraits<_Tp, _Abi>::_SimdBase
+  {
+    using _Traits = _SimdTraits<_Tp, _Abi>;
+    using _MemberType = typename _Traits::_SimdMember;
+    using _CastType = typename _Traits::_SimdCastType;
+    static constexpr _Tp* _S_type_tag = nullptr;
+    friend typename _Traits::_SimdBase;
+
+  public:
+    using _Impl = typename _Traits::_SimdImpl;
+    friend _Impl;
+    friend _SimdIntOperators<simd, _Impl, true>;
+
+    using value_type = _Tp;
+    using reference = _SmartReference<_MemberType, _Impl, value_type>;
+    using mask_type = simd_mask<_Tp, _Abi>;
+    using abi_type = _Abi;
+
+    static constexpr size_t size()
+    { return __size_or_zero_v<_Tp, _Abi>; }
+
+    _GLIBCXX_SIMD_CONSTEXPR simd() = default;
+    _GLIBCXX_SIMD_CONSTEXPR simd(const simd&) = default;
+    _GLIBCXX_SIMD_CONSTEXPR simd(simd&&) noexcept = default;
+    _GLIBCXX_SIMD_CONSTEXPR simd& operator=(const simd&) = default;
+    _GLIBCXX_SIMD_CONSTEXPR simd& operator=(simd&&) noexcept = default;
+
+    // implicit broadcast constructor
+    template <typename _Up,
+	      typename = enable_if_t<!is_same_v<__remove_cvref_t<_Up>, bool>>>
+      _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR
+      simd(_ValuePreservingOrInt<_Up, value_type>&& __x)
+      : _M_data(
+	_Impl::_S_broadcast(static_cast<value_type>(static_cast<_Up&&>(__x))))
+      {}
+
+    // implicit type conversion constructor (convert from fixed_size to
+    // fixed_size)
+    template <typename _Up>
+      _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR
+      simd(const simd<_Up, simd_abi::fixed_size<size()>>& __x,
+	   enable_if_t<
+	     conjunction<
+	       is_same<simd_abi::fixed_size<size()>, abi_type>,
+	       negation<__is_narrowing_conversion<_Up, value_type>>,
+	       __converts_to_higher_integer_rank<_Up, value_type>>::value,
+	     void*> = nullptr)
+      : simd{static_cast<array<_Up, size()>>(__x).data(), vector_aligned} {}
+
+      // explicit type conversion constructor
+#ifdef _GLIBCXX_SIMD_ENABLE_STATIC_CAST
+    template <typename _Up, typename _A2,
+	      typename = decltype(static_simd_cast<simd>(
+		declval<const simd<_Up, _A2>&>()))>
+      _GLIBCXX_SIMD_ALWAYS_INLINE explicit _GLIBCXX_SIMD_CONSTEXPR
+      simd(const simd<_Up, _A2>& __x)
+      : simd(static_simd_cast<simd>(__x)) {}
+#endif // _GLIBCXX_SIMD_ENABLE_STATIC_CAST
+
+    // generator constructor
+    template <typename _Fp>
+      _GLIBCXX_SIMD_ALWAYS_INLINE explicit _GLIBCXX_SIMD_CONSTEXPR
+      simd(_Fp&& __gen, _ValuePreservingOrInt<decltype(declval<_Fp>()(
+						declval<_SizeConstant<0>&>())),
+					      value_type>* = nullptr)
+      : _M_data(_Impl::_S_generator(static_cast<_Fp&&>(__gen), _S_type_tag)) {}
+
+    // load constructor
+    template <typename _Up, typename _Flags>
+      _GLIBCXX_SIMD_ALWAYS_INLINE
+      simd(const _Up* __mem, _Flags)
+      : _M_data(
+	  _Impl::_S_load(_Flags::template _S_apply<simd>(__mem), _S_type_tag))
+      {}
+
+    // loads [simd.load]
+    template <typename _Up, typename _Flags>
+      _GLIBCXX_SIMD_ALWAYS_INLINE void
+      copy_from(const _Vectorizable<_Up>* __mem, _Flags)
+      {
+	_M_data = static_cast<decltype(_M_data)>(
+	  _Impl::_S_load(_Flags::template _S_apply<simd>(__mem), _S_type_tag));
+      }
+
+    // stores [simd.store]
+    template <typename _Up, typename _Flags>
+      _GLIBCXX_SIMD_ALWAYS_INLINE void
+      copy_to(_Vectorizable<_Up>* __mem, _Flags) const
+      {
+	_Impl::_S_store(_M_data, _Flags::template _S_apply<simd>(__mem),
+			_S_type_tag);
+      }
+
+    // scalar access
+    _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR reference
+    operator[](size_t __i)
+    { return {_M_data, int(__i)}; }
+
+    _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR value_type
+    operator[]([[maybe_unused]] size_t __i) const
+    {
+      if constexpr (__is_scalar_abi<_Abi>())
+	{
+	  _GLIBCXX_DEBUG_ASSERT(__i == 0);
+	  return _M_data;
+	}
+      else
+	return _M_data[__i];
+    }
+
+    // increment and decrement:
+    _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR simd&
+    operator++()
+    {
+      _Impl::_S_increment(_M_data);
+      return *this;
+    }
+
+    _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR simd
+    operator++(int)
+    {
+      simd __r = *this;
+      _Impl::_S_increment(_M_data);
+      return __r;
+    }
+
+    _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR simd&
+    operator--()
+    {
+      _Impl::_S_decrement(_M_data);
+      return *this;
+    }
+
+    _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR simd
+    operator--(int)
+    {
+      simd __r = *this;
+      _Impl::_S_decrement(_M_data);
+      return __r;
+    }
+
+    // unary operators (for any _Tp)
+    _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR mask_type
+    operator!() const
+    { return {__private_init, _Impl::_S_negate(_M_data)}; }
+
+    _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR simd
+    operator+() const
+    { return *this; }
+
+    _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR simd
+    operator-() const
+    { return {__private_init, _Impl::_S_unary_minus(_M_data)}; }
+
+    // access to internal representation (suggested extension)
+    _GLIBCXX_SIMD_ALWAYS_INLINE explicit _GLIBCXX_SIMD_CONSTEXPR
+    simd(_CastType __init) : _M_data(__init) {}
+
+    // compound assignment [simd.cassign]
+    _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR friend simd&
+    operator+=(simd& __lhs, const simd& __x)
+    { return __lhs = __lhs + __x; }
+
+    _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR friend simd&
+    operator-=(simd& __lhs, const simd& __x)
+    { return __lhs = __lhs - __x; }
+
+    _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR friend simd&
+    operator*=(simd& __lhs, const simd& __x)
+    { return __lhs = __lhs * __x; }
+
+    _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR friend simd&
+    operator/=(simd& __lhs, const simd& __x)
+    { return __lhs = __lhs / __x; }
+
+    // binary operators [simd.binary]
+    _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR friend simd
+    operator+(const simd& __x, const simd& __y)
+    { return {__private_init, _Impl::_S_plus(__x._M_data, __y._M_data)}; }
+
+    _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR friend simd
+    operator-(const simd& __x, const simd& __y)
+    { return {__private_init, _Impl::_S_minus(__x._M_data, __y._M_data)}; }
+
+    _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR friend simd
+    operator*(const simd& __x, const simd& __y)
+    { return {__private_init, _Impl::_S_multiplies(__x._M_data, __y._M_data)}; }
+
+    _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR friend simd
+    operator/(const simd& __x, const simd& __y)
+    { return {__private_init, _Impl::_S_divides(__x._M_data, __y._M_data)}; }
+
+    // compares [simd.comparison]
+    _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR friend mask_type
+    operator==(const simd& __x, const simd& __y)
+    { return simd::_S_make_mask(_Impl::_S_equal_to(__x._M_data, __y._M_data)); }
+
+    _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR friend mask_type
+    operator!=(const simd& __x, const simd& __y)
+    {
+      return simd::_S_make_mask(
+	_Impl::_S_not_equal_to(__x._M_data, __y._M_data));
+    }
+
+    _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR friend mask_type
+    operator<(const simd& __x, const simd& __y)
+    { return simd::_S_make_mask(_Impl::_S_less(__x._M_data, __y._M_data)); }
+
+    _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR friend mask_type
+    operator<=(const simd& __x, const simd& __y)
+    {
+      return simd::_S_make_mask(_Impl::_S_less_equal(__x._M_data, __y._M_data));
+    }
+
+    _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR friend mask_type
+    operator>(const simd& __x, const simd& __y)
+    { return simd::_S_make_mask(_Impl::_S_less(__y._M_data, __x._M_data)); }
+
+    _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR friend mask_type
+    operator>=(const simd& __x, const simd& __y)
+    {
+      return simd::_S_make_mask(_Impl::_S_less_equal(__y._M_data, __x._M_data));
+    }
+
+    // operator?: overloads (suggested extension) {{{
+#ifdef __GXX_CONDITIONAL_IS_OVERLOADABLE__
+    _GLIBCXX_SIMD_ALWAYS_INLINE _GLIBCXX_SIMD_CONSTEXPR friend simd
+    operator?:(const mask_type& __k, const simd& __where_true,
+	const simd& __where_false)
+    {
+      auto __ret = __where_false;
+      _Impl::_S_masked_assign(__data(__k), __data(__ret), __data(__where_true));
+      return __ret;
+    }
+
+#endif // __GXX_CONDITIONAL_IS_OVERLOADABLE__
+    // }}}
+
+    // "private" because of the first arguments's namespace
+    _GLIBCXX_SIMD_INTRINSIC _GLIBCXX_SIMD_CONSTEXPR
+    simd(_PrivateInit, const _MemberType& __init)
+    : _M_data(__init) {}
+
+    // "private" because of the first arguments's namespace
+    _GLIBCXX_SIMD_INTRINSIC
+    simd(_BitsetInit, bitset<size()> __init) : _M_data()
+    { where(mask_type(__bitset_init, __init), *this) = ~*this; }
+
+    _GLIBCXX_SIMD_INTRINSIC constexpr bool
+    _M_is_constprop() const
+    {
+      if constexpr (__is_scalar_abi<_Abi>())
+	return __builtin_constant_p(_M_data);
+      else
+	return _M_data._M_is_constprop();
+    }
+
+  private:
+    _GLIBCXX_SIMD_INTRINSIC _GLIBCXX_SIMD_CONSTEXPR static mask_type
+    _S_make_mask(typename mask_type::_MemberType __k)
+    { return {__private_init, __k}; }
+
+    friend const auto& __data<value_type, abi_type>(const simd&);
+    friend auto& __data<value_type, abi_type>(simd&);
+    alignas(_Traits::_S_simd_align) _MemberType _M_data;
+  };
+
+// }}}
+// __data {{{
+template <typename _Tp, typename _Ap>
+  _GLIBCXX_SIMD_INTRINSIC constexpr const auto&
+  __data(const simd<_Tp, _Ap>& __x)
+  { return __x._M_data; }
+
+template <typename _Tp, typename _Ap>
+  _GLIBCXX_SIMD_INTRINSIC constexpr auto&
+  __data(simd<_Tp, _Ap>& __x)
+  { return __x._M_data; }
+
+// }}}
+namespace __float_bitwise_operators { //{{{
+template <typename _Tp, typename _Ap>
+  _GLIBCXX_SIMD_INTRINSIC _GLIBCXX_SIMD_CONSTEXPR simd<_Tp, _Ap>
+  operator^(const simd<_Tp, _Ap>& __a, const simd<_Tp, _Ap>& __b)
+  {
+    return {__private_init,
+	    _Ap::_SimdImpl::_S_bit_xor(__data(__a), __data(__b))};
+  }
+
+template <typename _Tp, typename _Ap>
+  _GLIBCXX_SIMD_INTRINSIC _GLIBCXX_SIMD_CONSTEXPR simd<_Tp, _Ap>
+  operator|(const simd<_Tp, _Ap>& __a, const simd<_Tp, _Ap>& __b)
+  {
+    return {__private_init,
+	    _Ap::_SimdImpl::_S_bit_or(__data(__a), __data(__b))};
+  }
+
+template <typename _Tp, typename _Ap>
+  _GLIBCXX_SIMD_INTRINSIC _GLIBCXX_SIMD_CONSTEXPR simd<_Tp, _Ap>
+  operator&(const simd<_Tp, _Ap>& __a, const simd<_Tp, _Ap>& __b)
+  {
+    return {__private_init,
+	    _Ap::_SimdImpl::_S_bit_and(__data(__a), __data(__b))};
+  }
+} // namespace __float_bitwise_operators }}}
+
+_GLIBCXX_SIMD_END_NAMESPACE
+
+#endif // __cplusplus >= 201703L
+#endif // _GLIBCXX_EXPERIMENTAL_SIMD_H
+
+// vim: foldmethod=marker foldmarker={{{,}}}
diff --git a/libstdc++-v3/include/experimental/bits/simd_builtin.h b/libstdc++-v3/include/experimental/bits/simd_builtin.h
new file mode 100644
index 00000000000..f2c99faa4ee
--- /dev/null
+++ b/libstdc++-v3/include/experimental/bits/simd_builtin.h
@@ -0,0 +1,2949 @@
+// Simd Abi specific implementations -*- C++ -*-
+
+// Copyright (C) 2020 Free Software Foundation, Inc.
+//
+// This file is part of the GNU ISO C++ Library.  This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3, or (at your option)
+// any later version.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+
+// Under Section 7 of GPL version 3, you are granted additional
+// permissions described in the GCC Runtime Library Exception, version
+// 3.1, as published by the Free Software Foundation.
+
+// You should have received a copy of the GNU General Public License and
+// a copy of the GCC Runtime Library Exception along with this program;
+// see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+// <http://www.gnu.org/licenses/>.
+
+#ifndef _GLIBCXX_EXPERIMENTAL_SIMD_ABIS_H_
+#define _GLIBCXX_EXPERIMENTAL_SIMD_ABIS_H_
+
+#if __cplusplus >= 201703L
+
+#include <array>
+#include <cmath>
+#include <cstdlib>
+
+_GLIBCXX_SIMD_BEGIN_NAMESPACE
+// _S_allbits{{{
+template <typename _V>
+  static inline _GLIBCXX_SIMD_USE_CONSTEXPR _V _S_allbits
+    = reinterpret_cast<_V>(~__vector_type_t<char, sizeof(_V) / sizeof(char)>());
+
+// }}}
+// _S_signmask, _S_absmask{{{
+template <typename _V, typename = _VectorTraits<_V>>
+  static inline _GLIBCXX_SIMD_USE_CONSTEXPR _V _S_signmask
+    = __xor(_V() + 1, _V() - 1);
+
+template <typename _V, typename = _VectorTraits<_V>>
+  static inline _GLIBCXX_SIMD_USE_CONSTEXPR _V _S_absmask
+    = __andnot(_S_signmask<_V>, _S_allbits<_V>);
+
+//}}}
+// __vector_permute<Indices...>{{{
+// Index == -1 requests zeroing of the output element
+template <int... _Indices, typename _Tp, typename _TVT = _VectorTraits<_Tp>>
+  _Tp
+  __vector_permute(_Tp __x)
+  {
+    static_assert(sizeof...(_Indices) == _TVT::_S_full_size);
+    return __make_vector<typename _TVT::value_type>(
+      (_Indices == -1 ? 0 : __x[_Indices == -1 ? 0 : _Indices])...);
+  }
+
+// }}}
+// __vector_shuffle<Indices...>{{{
+// Index == -1 requests zeroing of the output element
+template <int... _Indices, typename _Tp, typename _TVT = _VectorTraits<_Tp>>
+  _Tp
+  __vector_shuffle(_Tp __x, _Tp __y)
+  {
+    return _Tp{(_Indices == -1 ? 0
+		: _Indices < _TVT::_S_full_size
+		  ? __x[_Indices]
+		  : __y[_Indices - _TVT::_S_full_size])...};
+  }
+
+// }}}
+// __make_wrapper{{{
+template <typename _Tp, typename... _Args>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _SimdWrapper<_Tp, sizeof...(_Args)>
+  __make_wrapper(const _Args&... __args)
+  { return __make_vector<_Tp>(__args...); }
+
+// }}}
+// __wrapper_bitcast{{{
+template <typename _Tp, size_t _ToN = 0, typename _Up, size_t _M,
+	  size_t _Np = _ToN != 0 ? _ToN : sizeof(_Up) * _M / sizeof(_Tp)>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _SimdWrapper<_Tp, _Np>
+  __wrapper_bitcast(_SimdWrapper<_Up, _M> __x)
+  {
+    static_assert(_Np > 1);
+    return __intrin_bitcast<__vector_type_t<_Tp, _Np>>(__x._M_data);
+  }
+
+// }}}
+// __shift_elements_right{{{
+// if (__shift % 2ⁿ == 0) => the low n Bytes are correct
+template <unsigned __shift, typename _Tp, typename _TVT = _VectorTraits<_Tp>>
+  _GLIBCXX_SIMD_INTRINSIC _Tp
+  __shift_elements_right(_Tp __v)
+  {
+    [[maybe_unused]] const auto __iv = __to_intrin(__v);
+    static_assert(__shift <= sizeof(_Tp));
+    if constexpr (__shift == 0)
+      return __v;
+    else if constexpr (__shift == sizeof(_Tp))
+      return _Tp();
+#if _GLIBCXX_SIMD_X86INTRIN // {{{
+    else if constexpr (__have_sse && __shift == 8
+		       && _TVT::template _S_is<float, 4>)
+      return _mm_movehl_ps(__iv, __iv);
+    else if constexpr (__have_sse2 && __shift == 8
+		       && _TVT::template _S_is<double, 2>)
+      return _mm_unpackhi_pd(__iv, __iv);
+    else if constexpr (__have_sse2 && sizeof(_Tp) == 16)
+      return reinterpret_cast<typename _TVT::type>(
+	_mm_srli_si128(reinterpret_cast<__m128i>(__iv), __shift));
+    else if constexpr (__shift == 16 && sizeof(_Tp) == 32)
+      {
+	/*if constexpr (__have_avx && _TVT::template _S_is<double, 4>)
+	  return _mm256_permute2f128_pd(__iv, __iv, 0x81);
+	else if constexpr (__have_avx && _TVT::template _S_is<float, 8>)
+	  return _mm256_permute2f128_ps(__iv, __iv, 0x81);
+	else if constexpr (__have_avx)
+	  return reinterpret_cast<typename _TVT::type>(
+	    _mm256_permute2f128_si256(__iv, __iv, 0x81));
+	else*/
+	return __zero_extend(__hi128(__v));
+      }
+    else if constexpr (__have_avx2 && sizeof(_Tp) == 32 && __shift < 16)
+      {
+	const auto __vll = __vector_bitcast<_LLong>(__v);
+	return reinterpret_cast<typename _TVT::type>(
+	  _mm256_alignr_epi8(_mm256_permute2x128_si256(__vll, __vll, 0x81),
+			     __vll, __shift));
+      }
+    else if constexpr (__have_avx && sizeof(_Tp) == 32 && __shift < 16)
+      {
+	const auto __vll = __vector_bitcast<_LLong>(__v);
+	return reinterpret_cast<typename _TVT::type>(
+	  __concat(_mm_alignr_epi8(__hi128(__vll), __lo128(__vll), __shift),
+		   _mm_srli_si128(__hi128(__vll), __shift)));
+      }
+    else if constexpr (sizeof(_Tp) == 32 && __shift > 16)
+      return __zero_extend(__shift_elements_right<__shift - 16>(__hi128(__v)));
+    else if constexpr (sizeof(_Tp) == 64 && __shift == 32)
+      return __zero_extend(__hi256(__v));
+    else if constexpr (__have_avx512f && sizeof(_Tp) == 64)
+      {
+	if constexpr (__shift >= 48)
+	  return __zero_extend(
+	    __shift_elements_right<__shift - 48>(__extract<3, 4>(__v)));
+	else if constexpr (__shift >= 32)
+	  return __zero_extend(
+	    __shift_elements_right<__shift - 32>(__hi256(__v)));
+	else if constexpr (__shift % 8 == 0)
+	  return reinterpret_cast<typename _TVT::type>(
+	    _mm512_alignr_epi64(__m512i(), __intrin_bitcast<__m512i>(__v),
+				__shift / 8));
+	else if constexpr (__shift % 4 == 0)
+	  return reinterpret_cast<typename _TVT::type>(
+	    _mm512_alignr_epi32(__m512i(), __intrin_bitcast<__m512i>(__v),
+				__shift / 4));
+	else if constexpr (__have_avx512bw && __shift < 16)
+	  {
+	    const auto __vll = __vector_bitcast<_LLong>(__v);
+	    return reinterpret_cast<typename _TVT::type>(
+	      _mm512_alignr_epi8(_mm512_shuffle_i32x4(__vll, __vll, 0xf9),
+				 __vll, __shift));
+	  }
+	else if constexpr (__have_avx512bw && __shift < 32)
+	  {
+	    const auto __vll = __vector_bitcast<_LLong>(__v);
+	    return reinterpret_cast<typename _TVT::type>(
+	      _mm512_alignr_epi8(_mm512_shuffle_i32x4(__vll, __m512i(), 0xee),
+				 _mm512_shuffle_i32x4(__vll, __vll, 0xf9),
+				 __shift - 16));
+	  }
+	else
+	  __assert_unreachable<_Tp>();
+      }
+  /*
+      } else if constexpr (__shift % 16 == 0 && sizeof(_Tp) == 64)
+	  return __auto_bitcast(__extract<__shift / 16, 4>(__v));
+  */
+#endif // _GLIBCXX_SIMD_X86INTRIN }}}
+    else
+      {
+	constexpr int __chunksize = __shift % 8 == 0   ? 8
+				    : __shift % 4 == 0 ? 4
+				    : __shift % 2 == 0 ? 2
+						       : 1;
+	auto __w = __vector_bitcast<__int_with_sizeof_t<__chunksize>>(__v);
+	using _Up = decltype(__w);
+	return __intrin_bitcast<_Tp>(
+	  __call_with_n_evaluations<(sizeof(_Tp) - __shift) / __chunksize>(
+	    [](auto... __chunks) { return _Up{__chunks...}; },
+	    [&](auto __i) { return __w[__shift / __chunksize + __i]; }));
+      }
+  }
+
+// }}}
+// __extract_part(_SimdWrapper<_Tp, _Np>) {{{
+template <int _Index, int _Total, int _Combine, typename _Tp, size_t _Np>
+  _GLIBCXX_SIMD_INTRINSIC _GLIBCXX_CONST
+  _SimdWrapper<_Tp, _Np / _Total * _Combine>
+  __extract_part(const _SimdWrapper<_Tp, _Np> __x)
+  {
+    if constexpr (_Index % 2 == 0 && _Total % 2 == 0 && _Combine % 2 == 0)
+      return __extract_part<_Index / 2, _Total / 2, _Combine / 2>(__x);
+    else
+      {
+	constexpr size_t __values_per_part = _Np / _Total;
+	constexpr size_t __values_to_skip = _Index * __values_per_part;
+	constexpr size_t __return_size = __values_per_part * _Combine;
+	using _R = __vector_type_t<_Tp, __return_size>;
+	static_assert((_Index + _Combine) * __values_per_part * sizeof(_Tp)
+			<= sizeof(__x),
+		      "out of bounds __extract_part");
+	// the following assertion would ensure no "padding" to be read
+	// static_assert(_Total >= _Index + _Combine, "_Total must be greater
+	// than _Index");
+
+	// static_assert(__return_size * _Total == _Np, "_Np must be divisible
+	// by _Total");
+	if (__x._M_is_constprop())
+	  return __generate_from_n_evaluations<__return_size, _R>(
+	    [&](auto __i) { return __x[__values_to_skip + __i]; });
+	if constexpr (_Index == 0 && _Total == 1)
+	  return __x;
+	else if constexpr (_Index == 0)
+	  return __intrin_bitcast<_R>(__as_vector(__x));
+#if _GLIBCXX_SIMD_X86INTRIN // {{{
+	else if constexpr (sizeof(__x) == 32
+			   && __return_size * sizeof(_Tp) <= 16)
+	  {
+	    constexpr size_t __bytes_to_skip = __values_to_skip * sizeof(_Tp);
+	    if constexpr (__bytes_to_skip == 16)
+	      return __vector_bitcast<_Tp, __return_size>(
+		__hi128(__as_vector(__x)));
+	    else
+	      return __vector_bitcast<_Tp, __return_size>(
+		_mm_alignr_epi8(__hi128(__vector_bitcast<_LLong>(__x)),
+				__lo128(__vector_bitcast<_LLong>(__x)),
+				__bytes_to_skip));
+	  }
+#endif // _GLIBCXX_SIMD_X86INTRIN }}}
+	else if constexpr (_Index > 0
+			   && (__values_to_skip % __return_size != 0
+			       || sizeof(_R) >= 8)
+			   && (__values_to_skip + __return_size) * sizeof(_Tp)
+				<= 64
+			   && sizeof(__x) >= 16)
+	  return __intrin_bitcast<_R>(
+	    __shift_elements_right<__values_to_skip * sizeof(_Tp)>(
+	      __as_vector(__x)));
+	else
+	  {
+	    _R __r = {};
+	    __builtin_memcpy(&__r,
+			     reinterpret_cast<const char*>(&__x)
+			       + sizeof(_Tp) * __values_to_skip,
+			     __return_size * sizeof(_Tp));
+	    return __r;
+	  }
+      }
+  }
+
+// }}}
+// __extract_part(_SimdWrapper<bool, _Np>) {{{
+template <int _Index, int _Total, int _Combine = 1, size_t _Np>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _SimdWrapper<bool, _Np / _Total * _Combine>
+  __extract_part(const _SimdWrapper<bool, _Np> __x)
+  {
+    static_assert(_Combine == 1, "_Combine != 1 not implemented");
+    static_assert(__have_avx512f && _Np == _Np);
+    static_assert(_Total >= 2 && _Index + _Combine <= _Total && _Index >= 0);
+    return __x._M_data >> (_Index * _Np / _Total);
+  }
+
+// }}}
+
+// __vector_convert {{{
+// implementation requires an index sequence
+template <typename _To, typename _From, size_t... _I>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _To
+  __vector_convert(_From __a, index_sequence<_I...>)
+  {
+    using _Tp = typename _VectorTraits<_To>::value_type;
+    return _To{static_cast<_Tp>(__a[_I])...};
+  }
+
+template <typename _To, typename _From, size_t... _I>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _To
+  __vector_convert(_From __a, _From __b, index_sequence<_I...>)
+  {
+    using _Tp = typename _VectorTraits<_To>::value_type;
+    return _To{static_cast<_Tp>(__a[_I])..., static_cast<_Tp>(__b[_I])...};
+  }
+
+template <typename _To, typename _From, size_t... _I>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _To
+  __vector_convert(_From __a, _From __b, _From __c, index_sequence<_I...>)
+  {
+    using _Tp = typename _VectorTraits<_To>::value_type;
+    return _To{static_cast<_Tp>(__a[_I])..., static_cast<_Tp>(__b[_I])...,
+	       static_cast<_Tp>(__c[_I])...};
+  }
+
+template <typename _To, typename _From, size_t... _I>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _To
+  __vector_convert(_From __a, _From __b, _From __c, _From __d,
+		   index_sequence<_I...>)
+  {
+    using _Tp = typename _VectorTraits<_To>::value_type;
+    return _To{static_cast<_Tp>(__a[_I])..., static_cast<_Tp>(__b[_I])...,
+	       static_cast<_Tp>(__c[_I])..., static_cast<_Tp>(__d[_I])...};
+  }
+
+template <typename _To, typename _From, size_t... _I>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _To
+  __vector_convert(_From __a, _From __b, _From __c, _From __d, _From __e,
+		   index_sequence<_I...>)
+  {
+    using _Tp = typename _VectorTraits<_To>::value_type;
+    return _To{static_cast<_Tp>(__a[_I])..., static_cast<_Tp>(__b[_I])...,
+	       static_cast<_Tp>(__c[_I])..., static_cast<_Tp>(__d[_I])...,
+	       static_cast<_Tp>(__e[_I])...};
+  }
+
+template <typename _To, typename _From, size_t... _I>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _To
+  __vector_convert(_From __a, _From __b, _From __c, _From __d, _From __e,
+		   _From __f, index_sequence<_I...>)
+  {
+    using _Tp = typename _VectorTraits<_To>::value_type;
+    return _To{static_cast<_Tp>(__a[_I])..., static_cast<_Tp>(__b[_I])...,
+	       static_cast<_Tp>(__c[_I])..., static_cast<_Tp>(__d[_I])...,
+	       static_cast<_Tp>(__e[_I])..., static_cast<_Tp>(__f[_I])...};
+  }
+
+template <typename _To, typename _From, size_t... _I>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _To
+  __vector_convert(_From __a, _From __b, _From __c, _From __d, _From __e,
+		   _From __f, _From __g, index_sequence<_I...>)
+  {
+    using _Tp = typename _VectorTraits<_To>::value_type;
+    return _To{static_cast<_Tp>(__a[_I])..., static_cast<_Tp>(__b[_I])...,
+	       static_cast<_Tp>(__c[_I])..., static_cast<_Tp>(__d[_I])...,
+	       static_cast<_Tp>(__e[_I])..., static_cast<_Tp>(__f[_I])...,
+	       static_cast<_Tp>(__g[_I])...};
+  }
+
+template <typename _To, typename _From, size_t... _I>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _To
+  __vector_convert(_From __a, _From __b, _From __c, _From __d, _From __e,
+		   _From __f, _From __g, _From __h, index_sequence<_I...>)
+  {
+    using _Tp = typename _VectorTraits<_To>::value_type;
+    return _To{static_cast<_Tp>(__a[_I])..., static_cast<_Tp>(__b[_I])...,
+	       static_cast<_Tp>(__c[_I])..., static_cast<_Tp>(__d[_I])...,
+	       static_cast<_Tp>(__e[_I])..., static_cast<_Tp>(__f[_I])...,
+	       static_cast<_Tp>(__g[_I])..., static_cast<_Tp>(__h[_I])...};
+  }
+
+template <typename _To, typename _From, size_t... _I>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _To
+  __vector_convert(_From __a, _From __b, _From __c, _From __d, _From __e,
+		   _From __f, _From __g, _From __h, _From __i,
+		   index_sequence<_I...>)
+  {
+    using _Tp = typename _VectorTraits<_To>::value_type;
+    return _To{static_cast<_Tp>(__a[_I])..., static_cast<_Tp>(__b[_I])...,
+	       static_cast<_Tp>(__c[_I])..., static_cast<_Tp>(__d[_I])...,
+	       static_cast<_Tp>(__e[_I])..., static_cast<_Tp>(__f[_I])...,
+	       static_cast<_Tp>(__g[_I])..., static_cast<_Tp>(__h[_I])...,
+	       static_cast<_Tp>(__i[_I])...};
+  }
+
+template <typename _To, typename _From, size_t... _I>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _To
+  __vector_convert(_From __a, _From __b, _From __c, _From __d, _From __e,
+		   _From __f, _From __g, _From __h, _From __i, _From __j,
+		   index_sequence<_I...>)
+  {
+    using _Tp = typename _VectorTraits<_To>::value_type;
+    return _To{static_cast<_Tp>(__a[_I])..., static_cast<_Tp>(__b[_I])...,
+	       static_cast<_Tp>(__c[_I])..., static_cast<_Tp>(__d[_I])...,
+	       static_cast<_Tp>(__e[_I])..., static_cast<_Tp>(__f[_I])...,
+	       static_cast<_Tp>(__g[_I])..., static_cast<_Tp>(__h[_I])...,
+	       static_cast<_Tp>(__i[_I])..., static_cast<_Tp>(__j[_I])...};
+  }
+
+template <typename _To, typename _From, size_t... _I>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _To
+  __vector_convert(_From __a, _From __b, _From __c, _From __d, _From __e,
+		   _From __f, _From __g, _From __h, _From __i, _From __j,
+		   _From __k, index_sequence<_I...>)
+  {
+    using _Tp = typename _VectorTraits<_To>::value_type;
+    return _To{static_cast<_Tp>(__a[_I])..., static_cast<_Tp>(__b[_I])...,
+	       static_cast<_Tp>(__c[_I])..., static_cast<_Tp>(__d[_I])...,
+	       static_cast<_Tp>(__e[_I])..., static_cast<_Tp>(__f[_I])...,
+	       static_cast<_Tp>(__g[_I])..., static_cast<_Tp>(__h[_I])...,
+	       static_cast<_Tp>(__i[_I])..., static_cast<_Tp>(__j[_I])...,
+	       static_cast<_Tp>(__k[_I])...};
+  }
+
+template <typename _To, typename _From, size_t... _I>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _To
+  __vector_convert(_From __a, _From __b, _From __c, _From __d, _From __e,
+		   _From __f, _From __g, _From __h, _From __i, _From __j,
+		   _From __k, _From __l, index_sequence<_I...>)
+  {
+    using _Tp = typename _VectorTraits<_To>::value_type;
+    return _To{static_cast<_Tp>(__a[_I])..., static_cast<_Tp>(__b[_I])...,
+	       static_cast<_Tp>(__c[_I])..., static_cast<_Tp>(__d[_I])...,
+	       static_cast<_Tp>(__e[_I])..., static_cast<_Tp>(__f[_I])...,
+	       static_cast<_Tp>(__g[_I])..., static_cast<_Tp>(__h[_I])...,
+	       static_cast<_Tp>(__i[_I])..., static_cast<_Tp>(__j[_I])...,
+	       static_cast<_Tp>(__k[_I])..., static_cast<_Tp>(__l[_I])...};
+  }
+
+template <typename _To, typename _From, size_t... _I>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _To
+  __vector_convert(_From __a, _From __b, _From __c, _From __d, _From __e,
+		   _From __f, _From __g, _From __h, _From __i, _From __j,
+		   _From __k, _From __l, _From __m, index_sequence<_I...>)
+  {
+    using _Tp = typename _VectorTraits<_To>::value_type;
+    return _To{static_cast<_Tp>(__a[_I])..., static_cast<_Tp>(__b[_I])...,
+	       static_cast<_Tp>(__c[_I])..., static_cast<_Tp>(__d[_I])...,
+	       static_cast<_Tp>(__e[_I])..., static_cast<_Tp>(__f[_I])...,
+	       static_cast<_Tp>(__g[_I])..., static_cast<_Tp>(__h[_I])...,
+	       static_cast<_Tp>(__i[_I])..., static_cast<_Tp>(__j[_I])...,
+	       static_cast<_Tp>(__k[_I])..., static_cast<_Tp>(__l[_I])...,
+	       static_cast<_Tp>(__m[_I])...};
+  }
+
+template <typename _To, typename _From, size_t... _I>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _To
+  __vector_convert(_From __a, _From __b, _From __c, _From __d, _From __e,
+		   _From __f, _From __g, _From __h, _From __i, _From __j,
+		   _From __k, _From __l, _From __m, _From __n,
+		   index_sequence<_I...>)
+  {
+    using _Tp = typename _VectorTraits<_To>::value_type;
+    return _To{static_cast<_Tp>(__a[_I])..., static_cast<_Tp>(__b[_I])...,
+	       static_cast<_Tp>(__c[_I])..., static_cast<_Tp>(__d[_I])...,
+	       static_cast<_Tp>(__e[_I])..., static_cast<_Tp>(__f[_I])...,
+	       static_cast<_Tp>(__g[_I])..., static_cast<_Tp>(__h[_I])...,
+	       static_cast<_Tp>(__i[_I])..., static_cast<_Tp>(__j[_I])...,
+	       static_cast<_Tp>(__k[_I])..., static_cast<_Tp>(__l[_I])...,
+	       static_cast<_Tp>(__m[_I])..., static_cast<_Tp>(__n[_I])...};
+  }
+
+template <typename _To, typename _From, size_t... _I>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _To
+  __vector_convert(_From __a, _From __b, _From __c, _From __d, _From __e,
+		   _From __f, _From __g, _From __h, _From __i, _From __j,
+		   _From __k, _From __l, _From __m, _From __n, _From __o,
+		   index_sequence<_I...>)
+  {
+    using _Tp = typename _VectorTraits<_To>::value_type;
+    return _To{static_cast<_Tp>(__a[_I])..., static_cast<_Tp>(__b[_I])...,
+	       static_cast<_Tp>(__c[_I])..., static_cast<_Tp>(__d[_I])...,
+	       static_cast<_Tp>(__e[_I])..., static_cast<_Tp>(__f[_I])...,
+	       static_cast<_Tp>(__g[_I])..., static_cast<_Tp>(__h[_I])...,
+	       static_cast<_Tp>(__i[_I])..., static_cast<_Tp>(__j[_I])...,
+	       static_cast<_Tp>(__k[_I])..., static_cast<_Tp>(__l[_I])...,
+	       static_cast<_Tp>(__m[_I])..., static_cast<_Tp>(__n[_I])...,
+	       static_cast<_Tp>(__o[_I])...};
+  }
+
+template <typename _To, typename _From, size_t... _I>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _To
+  __vector_convert(_From __a, _From __b, _From __c, _From __d, _From __e,
+		   _From __f, _From __g, _From __h, _From __i, _From __j,
+		   _From __k, _From __l, _From __m, _From __n, _From __o,
+		   _From __p, index_sequence<_I...>)
+  {
+    using _Tp = typename _VectorTraits<_To>::value_type;
+    return _To{static_cast<_Tp>(__a[_I])..., static_cast<_Tp>(__b[_I])...,
+	       static_cast<_Tp>(__c[_I])..., static_cast<_Tp>(__d[_I])...,
+	       static_cast<_Tp>(__e[_I])..., static_cast<_Tp>(__f[_I])...,
+	       static_cast<_Tp>(__g[_I])..., static_cast<_Tp>(__h[_I])...,
+	       static_cast<_Tp>(__i[_I])..., static_cast<_Tp>(__j[_I])...,
+	       static_cast<_Tp>(__k[_I])..., static_cast<_Tp>(__l[_I])...,
+	       static_cast<_Tp>(__m[_I])..., static_cast<_Tp>(__n[_I])...,
+	       static_cast<_Tp>(__o[_I])..., static_cast<_Tp>(__p[_I])...};
+  }
+
+// Defer actual conversion to the overload that takes an index sequence. Note
+// that this function adds zeros or drops values off the end if you don't ensure
+// matching width.
+template <typename _To, typename... _From, size_t _FromSize>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _To
+  __vector_convert(_SimdWrapper<_From, _FromSize>... __xs)
+  {
+#ifdef _GLIBCXX_SIMD_WORKAROUND_PR85048
+    using _From0 = __first_of_pack_t<_From...>;
+    using _FW = _SimdWrapper<_From0, _FromSize>;
+    if (!_FW::_S_is_partial && !(... && __xs._M_is_constprop()))
+      {
+	if constexpr ((sizeof...(_From) & (sizeof...(_From) - 1))
+		      == 0) // power-of-two number of arguments
+	  return __convert_x86<_To>(__as_vector(__xs)...);
+	else // append zeros and recurse until the above branch is taken
+	  return __vector_convert<_To>(__xs..., _FW{});
+      }
+    else
+#endif
+      return __vector_convert<_To>(
+	__as_vector(__xs)...,
+	make_index_sequence<(sizeof...(__xs) == 1 ? std::min(
+			       _VectorTraits<_To>::_S_full_size, int(_FromSize))
+						  : _FromSize)>());
+  }
+
+// }}}
+// __convert function{{{
+template <typename _To, typename _From, typename... _More>
+  _GLIBCXX_SIMD_INTRINSIC constexpr auto
+  __convert(_From __v0, _More... __vs)
+  {
+    static_assert((true && ... && is_same_v<_From, _More>) );
+    if constexpr (__is_vectorizable_v<_From>)
+      {
+	using _V = typename _VectorTraits<_To>::type;
+	using _Tp = typename _VectorTraits<_To>::value_type;
+	return _V{static_cast<_Tp>(__v0), static_cast<_Tp>(__vs)...};
+      }
+    else if constexpr (__is_vector_type_v<_From>)
+      return __convert<_To>(__as_wrapper(__v0), __as_wrapper(__vs)...);
+    else // _SimdWrapper arguments
+      {
+	constexpr size_t __input_size = _From::_S_size * (1 + sizeof...(_More));
+	if constexpr (__is_vectorizable_v<_To>)
+	  return __convert<__vector_type_t<_To, __input_size>>(__v0, __vs...);
+	else if constexpr (!__is_vector_type_v<_To>)
+	  return _To(__convert<typename _To::_BuiltinType>(__v0, __vs...));
+	else
+	  {
+	    static_assert(
+	      sizeof...(_More) == 0
+		|| _VectorTraits<_To>::_S_full_size >= __input_size,
+	      "__convert(...) requires the input to fit into the output");
+	    return __vector_convert<_To>(__v0, __vs...);
+	  }
+      }
+  }
+
+// }}}
+// __convert_all{{{
+// Converts __v into array<_To, N>, where N is _NParts if non-zero or
+// otherwise deduced from _To such that N * #elements(_To) <= #elements(__v).
+// Note: this function may return less than all converted elements
+template <typename _To,
+	  size_t _NParts = 0, // allows to convert fewer or more (only last
+			      // _To, to be partially filled) than all
+	  size_t _Offset = 0, // where to start, # of elements (not Bytes or
+			      // Parts)
+	  typename _From, typename _FromVT = _VectorTraits<_From>>
+  _GLIBCXX_SIMD_INTRINSIC auto
+  __convert_all(_From __v)
+  {
+    if constexpr (is_arithmetic_v<_To> && _NParts != 1)
+      {
+	static_assert(_Offset < _FromVT::_S_full_size);
+	constexpr auto _Np
+	  = _NParts == 0 ? _FromVT::_S_partial_width - _Offset : _NParts;
+	return __generate_from_n_evaluations<_Np, array<_To, _Np>>(
+	  [&](auto __i) { return static_cast<_To>(__v[__i + _Offset]); });
+      }
+    else
+      {
+	static_assert(__is_vector_type_v<_To>);
+	using _ToVT = _VectorTraits<_To>;
+	if constexpr (__is_vector_type_v<_From>)
+	  return __convert_all<_To, _NParts>(__as_wrapper(__v));
+	else if constexpr (_NParts == 1)
+	  {
+	    static_assert(_Offset % _ToVT::_S_full_size == 0);
+	    return array<_To, 1>{__vector_convert<_To>(
+	      __extract_part<_Offset / _ToVT::_S_full_size,
+			     __div_roundup(_FromVT::_S_partial_width,
+					   _ToVT::_S_full_size)>(__v))};
+	  }
+#if _GLIBCXX_SIMD_X86INTRIN // {{{
+	else if constexpr (!__have_sse4_1 && _Offset == 0
+	  && is_integral_v<typename _FromVT::value_type>
+	  && sizeof(typename _FromVT::value_type)
+	      < sizeof(typename _ToVT::value_type)
+	  && !(sizeof(typename _FromVT::value_type) == 4
+	      && is_same_v<typename _ToVT::value_type, double>))
+	  {
+	    using _ToT = typename _ToVT::value_type;
+	    using _FromT = typename _FromVT::value_type;
+	    constexpr size_t _Np
+	      = _NParts != 0
+		  ? _NParts
+		  : (_FromVT::_S_partial_width / _ToVT::_S_full_size);
+	    using _R = array<_To, _Np>;
+	    // __adjust modifies its input to have _Np (use _SizeConstant)
+	    // entries so that no unnecessary intermediate conversions are
+	    // requested and, more importantly, no intermediate conversions are
+	    // missing
+	    [[maybe_unused]] auto __adjust
+	      = [](auto __n,
+		   auto __vv) -> _SimdWrapper<_FromT, decltype(__n)::value> {
+	      return __vector_bitcast<_FromT, decltype(__n)::value>(__vv);
+	    };
+	    [[maybe_unused]] const auto __vi = __to_intrin(__v);
+	    auto&& __make_array = [](auto __x0, [[maybe_unused]] auto __x1) {
+	      if constexpr (_Np == 1)
+		return _R{__intrin_bitcast<_To>(__x0)};
+	      else
+		return _R{__intrin_bitcast<_To>(__x0),
+			  __intrin_bitcast<_To>(__x1)};
+	    };
+
+	    if constexpr (_Np == 0)
+	      return _R{};
+	    else if constexpr (sizeof(_FromT) == 1 && sizeof(_ToT) == 2)
+	      {
+		static_assert(is_integral_v<_FromT>);
+		static_assert(is_integral_v<_ToT>);
+		if constexpr (is_unsigned_v<_FromT>)
+		  return __make_array(_mm_unpacklo_epi8(__vi, __m128i()),
+				      _mm_unpackhi_epi8(__vi, __m128i()));
+		else
+		  return __make_array(
+		    _mm_srai_epi16(_mm_unpacklo_epi8(__vi, __vi), 8),
+		    _mm_srai_epi16(_mm_unpackhi_epi8(__vi, __vi), 8));
+	      }
+	    else if constexpr (sizeof(_FromT) == 2 && sizeof(_ToT) == 4)
+	      {
+		static_assert(is_integral_v<_FromT>);
+		if constexpr (is_floating_point_v<_ToT>)
+		  {
+		    const auto __ints
+		      = __convert_all<__vector_type16_t<int>, _Np>(
+			__adjust(_SizeConstant<_Np * 4>(), __v));
+		    return __generate_from_n_evaluations<_Np, _R>(
+		      [&](auto __i) {
+			return __vector_convert<_To>(__as_wrapper(__ints[__i]));
+		      });
+		  }
+		else if constexpr (is_unsigned_v<_FromT>)
+		  return __make_array(_mm_unpacklo_epi16(__vi, __m128i()),
+				      _mm_unpackhi_epi16(__vi, __m128i()));
+		else
+		  return __make_array(
+		    _mm_srai_epi32(_mm_unpacklo_epi16(__vi, __vi), 16),
+		    _mm_srai_epi32(_mm_unpackhi_epi16(__vi, __vi), 16));
+	      }
+	    else if constexpr (sizeof(_FromT) == 4 && sizeof(_ToT) == 8
+			       && is_integral_v<_FromT> && is_integral_v<_ToT>)
+	      {
+		if constexpr (is_unsigned_v<_FromT>)
+		  return __make_array(_mm_unpacklo_epi32(__vi, __m128i()),
+				      _mm_unpackhi_epi32(__vi, __m128i()));
+		else
+		  return __make_array(
+		    _mm_unpacklo_epi32(__vi, _mm_srai_epi32(__vi, 31)),
+		    _mm_unpackhi_epi32(__vi, _mm_srai_epi32(__vi, 31)));
+	      }
+	    else if constexpr (sizeof(_FromT) == 4 && sizeof(_ToT) == 8
+			       && is_integral_v<_FromT> && is_integral_v<_ToT>)
+	      {
+		if constexpr (is_unsigned_v<_FromT>)
+		  return __make_array(_mm_unpacklo_epi32(__vi, __m128i()),
+				      _mm_unpackhi_epi32(__vi, __m128i()));
+		else
+		  return __make_array(
+		    _mm_unpacklo_epi32(__vi, _mm_srai_epi32(__vi, 31)),
+		    _mm_unpackhi_epi32(__vi, _mm_srai_epi32(__vi, 31)));
+	      }
+	    else if constexpr (sizeof(_FromT) == 1 && sizeof(_ToT) >= 4
+			       && is_signed_v<_FromT>)
+	      {
+		const __m128i __vv[2] = {_mm_unpacklo_epi8(__vi, __vi),
+					 _mm_unpackhi_epi8(__vi, __vi)};
+		const __vector_type_t<int, 4> __vvvv[4] = {
+		  __vector_bitcast<int>(_mm_unpacklo_epi16(__vv[0], __vv[0])),
+		  __vector_bitcast<int>(_mm_unpackhi_epi16(__vv[0], __vv[0])),
+		  __vector_bitcast<int>(_mm_unpacklo_epi16(__vv[1], __vv[1])),
+		  __vector_bitcast<int>(_mm_unpackhi_epi16(__vv[1], __vv[1]))};
+		if constexpr (sizeof(_ToT) == 4)
+		  return __generate_from_n_evaluations<_Np, _R>([&](auto __i) {
+		    return __vector_convert<_To>(
+		      _SimdWrapper<int, 4>(__vvvv[__i] >> 24));
+		  });
+		else if constexpr (is_integral_v<_ToT>)
+		  return __generate_from_n_evaluations<_Np, _R>([&](auto __i) {
+		    const auto __signbits = __to_intrin(__vvvv[__i / 2] >> 31);
+		    const auto __sx32 = __to_intrin(__vvvv[__i / 2] >> 24);
+		    return __vector_bitcast<_ToT>(
+		      __i % 2 == 0 ? _mm_unpacklo_epi32(__sx32, __signbits)
+				   : _mm_unpackhi_epi32(__sx32, __signbits));
+		  });
+		else
+		  return __generate_from_n_evaluations<_Np, _R>([&](auto __i) {
+		    const _SimdWrapper<int, 4> __int4 = __vvvv[__i / 2] >> 24;
+		    return __vector_convert<_To>(
+		      __i % 2 == 0 ? __int4
+				   : _SimdWrapper<int, 4>(
+				     _mm_unpackhi_epi64(__to_intrin(__int4),
+							__to_intrin(__int4))));
+		  });
+	      }
+	    else if constexpr (sizeof(_FromT) == 1 && sizeof(_ToT) == 4)
+	      {
+		const auto __shorts = __convert_all<__vector_type16_t<
+		  conditional_t<is_signed_v<_FromT>, short, unsigned short>>>(
+		  __adjust(_SizeConstant<(_Np + 1) / 2 * 8>(), __v));
+		return __generate_from_n_evaluations<_Np, _R>([&](auto __i) {
+		  return __convert_all<_To>(__shorts[__i / 2])[__i % 2];
+		});
+	      }
+	    else if constexpr (sizeof(_FromT) == 2 && sizeof(_ToT) == 8
+			       && is_signed_v<_FromT> && is_integral_v<_ToT>)
+	      {
+		const __m128i __vv[2] = {_mm_unpacklo_epi16(__vi, __vi),
+					 _mm_unpackhi_epi16(__vi, __vi)};
+		const __vector_type16_t<int> __vvvv[4]
+		  = {__vector_bitcast<int>(
+		       _mm_unpacklo_epi32(_mm_srai_epi32(__vv[0], 16),
+					  _mm_srai_epi32(__vv[0], 31))),
+		     __vector_bitcast<int>(
+		       _mm_unpackhi_epi32(_mm_srai_epi32(__vv[0], 16),
+					  _mm_srai_epi32(__vv[0], 31))),
+		     __vector_bitcast<int>(
+		       _mm_unpacklo_epi32(_mm_srai_epi32(__vv[1], 16),
+					  _mm_srai_epi32(__vv[1], 31))),
+		     __vector_bitcast<int>(
+		       _mm_unpackhi_epi32(_mm_srai_epi32(__vv[1], 16),
+					  _mm_srai_epi32(__vv[1], 31)))};
+		return __generate_from_n_evaluations<_Np, _R>([&](auto __i) {
+		  return __vector_bitcast<_ToT>(__vvvv[__i]);
+		});
+	      }
+	    else if constexpr (sizeof(_FromT) <= 2 && sizeof(_ToT) == 8)
+	      {
+		const auto __ints
+		  = __convert_all<__vector_type16_t<conditional_t<
+		    is_signed_v<_FromT> || is_floating_point_v<_ToT>, int,
+		    unsigned int>>>(
+		    __adjust(_SizeConstant<(_Np + 1) / 2 * 4>(), __v));
+		return __generate_from_n_evaluations<_Np, _R>([&](auto __i) {
+		  return __convert_all<_To>(__ints[__i / 2])[__i % 2];
+		});
+	      }
+	    else
+	      __assert_unreachable<_To>();
+	  }
+#endif // _GLIBCXX_SIMD_X86INTRIN }}}
+	else if constexpr ((_FromVT::_S_partial_width - _Offset)
+			   > _ToVT::_S_full_size)
+	  {
+	    /*
+	    static_assert(
+	      (_FromVT::_S_partial_width & (_FromVT::_S_partial_width - 1)) ==
+	    0,
+	      "__convert_all only supports power-of-2 number of elements.
+	    Otherwise " "the return type cannot be array<_To, N>.");
+	      */
+	    constexpr size_t _NTotal
+	      = (_FromVT::_S_partial_width - _Offset) / _ToVT::_S_full_size;
+	    constexpr size_t _Np = _NParts == 0 ? _NTotal : _NParts;
+	    static_assert(
+	      _Np <= _NTotal
+	      || (_Np == _NTotal + 1
+		  && (_FromVT::_S_partial_width - _Offset) % _ToVT::_S_full_size
+		       > 0));
+	    using _R = array<_To, _Np>;
+	    if constexpr (_Np == 1)
+	      return _R{__vector_convert<_To>(
+		__extract_part<_Offset, _FromVT::_S_partial_width,
+			       _ToVT::_S_full_size>(__v))};
+	    else
+	      return __generate_from_n_evaluations<_Np, _R>([&](
+		auto __i) constexpr {
+		auto __part
+		  = __extract_part<__i * _ToVT::_S_full_size + _Offset,
+				   _FromVT::_S_partial_width,
+				   _ToVT::_S_full_size>(__v);
+		return __vector_convert<_To>(__part);
+	      });
+	  }
+	else if constexpr (_Offset == 0)
+	  return array<_To, 1>{__vector_convert<_To>(__v)};
+	else
+	  return array<_To, 1>{__vector_convert<_To>(
+	    __extract_part<_Offset, _FromVT::_S_partial_width,
+			   _FromVT::_S_partial_width - _Offset>(__v))};
+      }
+  }
+
+// }}}
+
+// _GnuTraits {{{
+template <typename _Tp, typename _Mp, typename _Abi, size_t _Np>
+  struct _GnuTraits
+  {
+    using _IsValid = true_type;
+    using _SimdImpl = typename _Abi::_SimdImpl;
+    using _MaskImpl = typename _Abi::_MaskImpl;
+
+    // simd and simd_mask member types {{{
+    using _SimdMember = _SimdWrapper<_Tp, _Np>;
+    using _MaskMember = _SimdWrapper<_Mp, _Np>;
+    static constexpr size_t _S_simd_align = alignof(_SimdMember);
+    static constexpr size_t _S_mask_align = alignof(_MaskMember);
+
+    // }}}
+    // size metadata {{{
+    static constexpr size_t _S_full_size = _SimdMember::_S_full_size;
+    static constexpr bool _S_is_partial = _SimdMember::_S_is_partial;
+
+    // }}}
+    // _SimdBase / base class for simd, providing extra conversions {{{
+    struct _SimdBase2
+    {
+      explicit operator __intrinsic_type_t<_Tp, _Np>() const
+      {
+	return __to_intrin(static_cast<const simd<_Tp, _Abi>*>(this)->_M_data);
+      }
+      explicit operator __vector_type_t<_Tp, _Np>() const
+      {
+	return static_cast<const simd<_Tp, _Abi>*>(this)->_M_data.__builtin();
+      }
+    };
+
+    struct _SimdBase1
+    {
+      explicit operator __intrinsic_type_t<_Tp, _Np>() const
+      { return __data(*static_cast<const simd<_Tp, _Abi>*>(this)); }
+    };
+
+    using _SimdBase = conditional_t<
+      is_same<__intrinsic_type_t<_Tp, _Np>, __vector_type_t<_Tp, _Np>>::value,
+      _SimdBase1, _SimdBase2>;
+
+    // }}}
+    // _MaskBase {{{
+    struct _MaskBase2
+    {
+      explicit operator __intrinsic_type_t<_Tp, _Np>() const
+      {
+	return static_cast<const simd_mask<_Tp, _Abi>*>(this)
+	  ->_M_data.__intrin();
+      }
+      explicit operator __vector_type_t<_Tp, _Np>() const
+      {
+	return static_cast<const simd_mask<_Tp, _Abi>*>(this)->_M_data._M_data;
+      }
+    };
+
+    struct _MaskBase1
+    {
+      explicit operator __intrinsic_type_t<_Tp, _Np>() const
+      { return __data(*static_cast<const simd_mask<_Tp, _Abi>*>(this)); }
+    };
+
+    using _MaskBase = conditional_t<
+      is_same<__intrinsic_type_t<_Tp, _Np>, __vector_type_t<_Tp, _Np>>::value,
+      _MaskBase1, _MaskBase2>;
+
+    // }}}
+    // _MaskCastType {{{
+    // parameter type of one explicit simd_mask constructor
+    class _MaskCastType
+    {
+      using _Up = __intrinsic_type_t<_Tp, _Np>;
+      _Up _M_data;
+
+    public:
+      _MaskCastType(_Up __x) : _M_data(__x) {}
+      operator _MaskMember() const { return _M_data; }
+    };
+
+    // }}}
+    // _SimdCastType {{{
+    // parameter type of one explicit simd constructor
+    class _SimdCastType1
+    {
+      using _Ap = __intrinsic_type_t<_Tp, _Np>;
+      _SimdMember _M_data;
+
+    public:
+      _SimdCastType1(_Ap __a) : _M_data(__vector_bitcast<_Tp>(__a)) {}
+      operator _SimdMember() const { return _M_data; }
+    };
+
+    class _SimdCastType2
+    {
+      using _Ap = __intrinsic_type_t<_Tp, _Np>;
+      using _B = __vector_type_t<_Tp, _Np>;
+      _SimdMember _M_data;
+
+    public:
+      _SimdCastType2(_Ap __a) : _M_data(__vector_bitcast<_Tp>(__a)) {}
+      _SimdCastType2(_B __b) : _M_data(__b) {}
+      operator _SimdMember() const { return _M_data; }
+    };
+
+    using _SimdCastType = conditional_t<
+      is_same<__intrinsic_type_t<_Tp, _Np>, __vector_type_t<_Tp, _Np>>::value,
+      _SimdCastType1, _SimdCastType2>;
+    //}}}
+  };
+
+// }}}
+struct _CommonImplX86;
+struct _CommonImplNeon;
+struct _CommonImplBuiltin;
+template <typename _Abi> struct _SimdImplBuiltin;
+template <typename _Abi> struct _MaskImplBuiltin;
+template <typename _Abi> struct _SimdImplX86;
+template <typename _Abi> struct _MaskImplX86;
+template <typename _Abi> struct _SimdImplNeon;
+template <typename _Abi> struct _MaskImplNeon;
+template <typename _Abi> struct _SimdImplPpc;
+
+// simd_abi::_VecBuiltin {{{
+template <int _UsedBytes>
+  struct simd_abi::_VecBuiltin
+  {
+    template <typename _Tp>
+      static constexpr size_t _S_size = _UsedBytes / sizeof(_Tp);
+
+    // validity traits {{{
+    struct _IsValidAbiTag : __bool_constant<(_UsedBytes > 1)> {};
+
+    template <typename _Tp>
+      struct _IsValidSizeFor
+	: __bool_constant<(_UsedBytes / sizeof(_Tp) > 1
+			   && _UsedBytes % sizeof(_Tp) == 0
+			   && _UsedBytes <= __vectorized_sizeof<_Tp>()
+			   && (!__have_avx512f || _UsedBytes <= 32))> {};
+
+    template <typename _Tp>
+      struct _IsValid : conjunction<_IsValidAbiTag, __is_vectorizable<_Tp>,
+				    _IsValidSizeFor<_Tp>> {};
+
+    template <typename _Tp>
+      static constexpr bool _S_is_valid_v = _IsValid<_Tp>::value;
+
+    // }}}
+    // _SimdImpl/_MaskImpl {{{
+#if _GLIBCXX_SIMD_X86INTRIN
+    using _CommonImpl = _CommonImplX86;
+    using _SimdImpl = _SimdImplX86<_VecBuiltin<_UsedBytes>>;
+    using _MaskImpl = _MaskImplX86<_VecBuiltin<_UsedBytes>>;
+#elif _GLIBCXX_SIMD_HAVE_NEON
+    using _CommonImpl = _CommonImplNeon;
+    using _SimdImpl = _SimdImplNeon<_VecBuiltin<_UsedBytes>>;
+    using _MaskImpl = _MaskImplNeon<_VecBuiltin<_UsedBytes>>;
+#else
+    using _CommonImpl = _CommonImplBuiltin;
+#ifdef __ALTIVEC__
+    using _SimdImpl = _SimdImplPpc<_VecBuiltin<_UsedBytes>>;
+#else
+    using _SimdImpl = _SimdImplBuiltin<_VecBuiltin<_UsedBytes>>;
+#endif
+    using _MaskImpl = _MaskImplBuiltin<_VecBuiltin<_UsedBytes>>;
+#endif
+
+    // }}}
+    // __traits {{{
+    template <typename _Tp>
+      using _MaskValueType = __int_for_sizeof_t<_Tp>;
+
+    template <typename _Tp>
+      using __traits
+	= conditional_t<_S_is_valid_v<_Tp>,
+			_GnuTraits<_Tp, _MaskValueType<_Tp>,
+				   _VecBuiltin<_UsedBytes>, _S_size<_Tp>>,
+			_InvalidTraits>;
+
+    //}}}
+    // size metadata {{{
+    template <typename _Tp>
+      static constexpr size_t _S_full_size = __traits<_Tp>::_S_full_size;
+
+    template <typename _Tp>
+      static constexpr bool _S_is_partial = __traits<_Tp>::_S_is_partial;
+
+    // }}}
+    // implicit masks {{{
+    template <typename _Tp>
+      using _MaskMember = _SimdWrapper<_MaskValueType<_Tp>, _S_size<_Tp>>;
+
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _MaskMember<_Tp>
+      _S_implicit_mask()
+      {
+	using _UV = typename _MaskMember<_Tp>::_BuiltinType;
+	if constexpr (!_MaskMember<_Tp>::_S_is_partial)
+	  return ~_UV();
+	else
+	  {
+	    constexpr auto __size = _S_size<_Tp>;
+	    _GLIBCXX_SIMD_USE_CONSTEXPR auto __r = __generate_vector<_UV>(
+	      [](auto __i) constexpr { return __i < __size ? -1 : 0; });
+	    return __r;
+	  }
+      }
+
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr __intrinsic_type_t<_Tp,
+								  _S_size<_Tp>>
+      _S_implicit_mask_intrin()
+      {
+	return __to_intrin(
+	  __vector_bitcast<_Tp>(_S_implicit_mask<_Tp>()._M_data));
+      }
+
+    template <typename _TW, typename _TVT = _VectorTraits<_TW>>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _TW _S_masked(_TW __x)
+      {
+	using _Tp = typename _TVT::value_type;
+	if constexpr (!_MaskMember<_Tp>::_S_is_partial)
+	  return __x;
+	else
+	  return __and(__as_vector(__x),
+		       __vector_bitcast<_Tp>(_S_implicit_mask<_Tp>()));
+      }
+
+    template <typename _TW, typename _TVT = _VectorTraits<_TW>>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr auto
+      __make_padding_nonzero(_TW __x)
+      {
+	using _Tp = typename _TVT::value_type;
+	if constexpr (!_S_is_partial<_Tp>)
+	  return __x;
+	else
+	  {
+	    _GLIBCXX_SIMD_USE_CONSTEXPR auto __implicit_mask
+	      = __vector_bitcast<_Tp>(_S_implicit_mask<_Tp>());
+	    if constexpr (is_integral_v<_Tp>)
+	      return __or(__x, ~__implicit_mask);
+	    else
+	      {
+		_GLIBCXX_SIMD_USE_CONSTEXPR auto __one
+		  = __andnot(__implicit_mask,
+			     __vector_broadcast<_S_full_size<_Tp>>(_Tp(1)));
+		// it's not enough to return `x | 1_in_padding` because the
+		// padding in x might be inf or nan (independent of
+		// __FINITE_MATH_ONLY__, because it's about padding bits)
+		return __or(__and(__x, __implicit_mask), __one);
+	      }
+	  }
+      }
+    // }}}
+  };
+
+// }}}
+// simd_abi::_VecBltnBtmsk {{{
+template <int _UsedBytes>
+  struct simd_abi::_VecBltnBtmsk
+  {
+    template <typename _Tp>
+      static constexpr size_t _S_size = _UsedBytes / sizeof(_Tp);
+
+    // validity traits {{{
+    struct _IsValidAbiTag : __bool_constant<(_UsedBytes > 1)> {};
+
+    template <typename _Tp>
+      struct _IsValidSizeFor
+	: __bool_constant<(_UsedBytes / sizeof(_Tp) > 1
+			   && _UsedBytes % sizeof(_Tp) == 0 && _UsedBytes <= 64
+			   && (_UsedBytes > 32 || __have_avx512vl))> {};
+
+    // Bitmasks require at least AVX512F. If sizeof(_Tp) < 4 the AVX512BW is also
+    // required.
+    template <typename _Tp>
+      struct _IsValid
+	: conjunction<
+	    _IsValidAbiTag, __bool_constant<__have_avx512f>,
+	    __bool_constant<__have_avx512bw || (sizeof(_Tp) >= 4)>,
+	    __bool_constant<(__vectorized_sizeof<_Tp>() > sizeof(_Tp))>,
+	    _IsValidSizeFor<_Tp>> {};
+
+    template <typename _Tp>
+      static constexpr bool _S_is_valid_v = _IsValid<_Tp>::value;
+
+    // }}}
+    // simd/_MaskImpl {{{
+  #if _GLIBCXX_SIMD_X86INTRIN
+    using _CommonImpl = _CommonImplX86;
+    using _SimdImpl = _SimdImplX86<_VecBltnBtmsk<_UsedBytes>>;
+    using _MaskImpl = _MaskImplX86<_VecBltnBtmsk<_UsedBytes>>;
+  #else
+    template <int>
+      struct _MissingImpl;
+
+    using _CommonImpl = _MissingImpl<_UsedBytes>;
+    using _SimdImpl = _MissingImpl<_UsedBytes>;
+    using _MaskImpl = _MissingImpl<_UsedBytes>;
+  #endif
+
+    // }}}
+    // __traits {{{
+    template <typename _Tp>
+      using _MaskMember = _SimdWrapper<bool, _S_size<_Tp>>;
+
+    template <typename _Tp>
+      using __traits = conditional_t<
+	_S_is_valid_v<_Tp>,
+	_GnuTraits<_Tp, bool, _VecBltnBtmsk<_UsedBytes>, _S_size<_Tp>>,
+	_InvalidTraits>;
+
+    //}}}
+    // size metadata {{{
+    template <typename _Tp>
+      static constexpr size_t _S_full_size = __traits<_Tp>::_S_full_size;
+    template <typename _Tp>
+      static constexpr bool _S_is_partial = __traits<_Tp>::_S_is_partial;
+
+    // }}}
+    // implicit mask {{{
+  private:
+    template <typename _Tp>
+      using _ImplicitMask = _SimdWrapper<bool, _S_size<_Tp>>;
+
+  public:
+    template <size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr __bool_storage_member_type_t<_Np>
+      __implicit_mask_n()
+      {
+	using _Tp = __bool_storage_member_type_t<_Np>;
+	return _Np < sizeof(_Tp) * __CHAR_BIT__ ? _Tp((1ULL << _Np) - 1) : ~_Tp();
+      }
+
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _ImplicitMask<_Tp>
+      _S_implicit_mask()
+      { return __implicit_mask_n<_S_size<_Tp>>(); }
+
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr __bool_storage_member_type_t<
+	_S_size<_Tp>>
+      _S_implicit_mask_intrin()
+      { return __implicit_mask_n<_S_size<_Tp>>(); }
+
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _SimdWrapper<_Tp, _Np>
+      _S_masked(_SimdWrapper<_Tp, _Np> __x)
+      {
+	if constexpr (is_same_v<_Tp, bool>)
+	  if constexpr (_Np < 8 || (_Np & (_Np - 1)) != 0)
+	    return _MaskImpl::_S_bit_and(
+	      __x, _SimdWrapper<_Tp, _Np>(
+		     __bool_storage_member_type_t<_Np>((1ULL << _Np) - 1)));
+	  else
+	    return __x;
+	else
+	  return _S_masked(__x._M_data);
+      }
+
+    template <typename _TV>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _TV
+      _S_masked(_TV __x)
+      {
+	using _Tp = typename _VectorTraits<_TV>::value_type;
+	static_assert(
+	  !__is_bitmask_v<_TV>,
+	  "_VecBltnBtmsk::_S_masked cannot work on bitmasks, since it doesn't "
+	  "know the number of elements. Use _SimdWrapper<bool, N> instead.");
+	if constexpr (_S_is_partial<_Tp>)
+	  {
+	    constexpr size_t _Np = _S_size<_Tp>;
+	    return __make_dependent_t<_TV, _CommonImpl>::_S_blend(
+	      _S_implicit_mask<_Tp>(), _SimdWrapper<_Tp, _Np>(),
+	      _SimdWrapper<_Tp, _Np>(__x));
+	  }
+	else
+	  return __x;
+      }
+
+    template <typename _TV, typename _TVT = _VectorTraits<_TV>>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr auto
+      __make_padding_nonzero(_TV __x)
+      {
+	using _Tp = typename _TVT::value_type;
+	if constexpr (!_S_is_partial<_Tp>)
+	  return __x;
+	else
+	  {
+	    constexpr size_t _Np = _S_size<_Tp>;
+	    if constexpr (is_integral_v<typename _TVT::value_type>)
+	      return __x
+		     | __generate_vector<_Tp, _S_full_size<_Tp>>(
+		       [](auto __i) -> _Tp {
+			 if (__i < _Np)
+			   return 0;
+			 else
+			   return 1;
+		       });
+	    else
+	      return __make_dependent_t<_TV, _CommonImpl>::_S_blend(
+		       _S_implicit_mask<_Tp>(),
+		       _SimdWrapper<_Tp, _Np>(
+			 __vector_broadcast<_S_full_size<_Tp>>(_Tp(1))),
+		       _SimdWrapper<_Tp, _Np>(__x))
+		._M_data;
+	  }
+      }
+
+    // }}}
+  };
+
+//}}}
+// _CommonImplBuiltin {{{
+struct _CommonImplBuiltin
+{
+  // _S_converts_via_decomposition{{{
+  // This lists all cases where a __vector_convert needs to fall back to
+  // conversion of individual scalars (i.e. decompose the input vector into
+  // scalars, convert, compose output vector). In those cases, _S_masked_load &
+  // _S_masked_store prefer to use the _S_bit_iteration implementation.
+  template <typename _From, typename _To, size_t _ToSize>
+    static inline constexpr bool __converts_via_decomposition_v
+      = sizeof(_From) != sizeof(_To);
+
+  // }}}
+  // _S_load{{{
+  template <typename _Tp, size_t _Np, size_t _Bytes = _Np * sizeof(_Tp)>
+    _GLIBCXX_SIMD_INTRINSIC static __vector_type_t<_Tp, _Np>
+    _S_load(const void* __p)
+    {
+      static_assert(_Np > 1);
+      static_assert(_Bytes % sizeof(_Tp) == 0);
+      using _Rp = __vector_type_t<_Tp, _Np>;
+      if constexpr (sizeof(_Rp) == _Bytes)
+	{
+	  _Rp __r;
+	  __builtin_memcpy(&__r, __p, _Bytes);
+	  return __r;
+	}
+      else
+	{
+#ifdef _GLIBCXX_SIMD_WORKAROUND_PR90424
+	  using _Up = conditional_t<
+	    is_integral_v<_Tp>,
+	    conditional_t<_Bytes % 4 == 0,
+			  conditional_t<_Bytes % 8 == 0, long long, int>,
+			  conditional_t<_Bytes % 2 == 0, short, signed char>>,
+	    conditional_t<(_Bytes < 8 || _Np % 2 == 1 || _Np == 2), _Tp,
+			  double>>;
+	  using _V = __vector_type_t<_Up, _Np * sizeof(_Tp) / sizeof(_Up)>;
+	  if constexpr (sizeof(_V) != sizeof(_Rp))
+	    { // on i386 with 4 < _Bytes <= 8
+	      _Rp __r{};
+	      __builtin_memcpy(&__r, __p, _Bytes);
+	      return __r;
+	    }
+	  else
+#else // _GLIBCXX_SIMD_WORKAROUND_PR90424
+	  using _V = _Rp;
+#endif // _GLIBCXX_SIMD_WORKAROUND_PR90424
+	    {
+	      _V __r{};
+	      static_assert(_Bytes <= sizeof(_V));
+	      __builtin_memcpy(&__r, __p, _Bytes);
+	      return reinterpret_cast<_Rp>(__r);
+	    }
+	}
+    }
+
+  // }}}
+  // _S_store {{{
+  template <size_t _ReqBytes = 0, typename _TV>
+    _GLIBCXX_SIMD_INTRINSIC static void _S_store(_TV __x, void* __addr)
+    {
+      constexpr size_t _Bytes = _ReqBytes == 0 ? sizeof(__x) : _ReqBytes;
+      static_assert(sizeof(__x) >= _Bytes);
+
+      if constexpr (__is_vector_type_v<_TV>)
+	{
+	  using _Tp = typename _VectorTraits<_TV>::value_type;
+	  constexpr size_t _Np = _Bytes / sizeof(_Tp);
+	  static_assert(_Np * sizeof(_Tp) == _Bytes);
+
+#ifdef _GLIBCXX_SIMD_WORKAROUND_PR90424
+	  using _Up = conditional_t<
+	    (is_integral_v<_Tp> || _Bytes < 4),
+	    conditional_t<(sizeof(__x) > sizeof(long long)), long long, _Tp>,
+	    float>;
+	  const auto __v = __vector_bitcast<_Up>(__x);
+#else // _GLIBCXX_SIMD_WORKAROUND_PR90424
+	  const __vector_type_t<_Tp, _Np> __v = __x;
+#endif // _GLIBCXX_SIMD_WORKAROUND_PR90424
+
+	  if constexpr ((_Bytes & (_Bytes - 1)) != 0)
+	    {
+	      constexpr size_t _MoreBytes = std::__bit_ceil(_Bytes);
+	      alignas(decltype(__v)) char __tmp[_MoreBytes];
+	      __builtin_memcpy(__tmp, &__v, _MoreBytes);
+	      __builtin_memcpy(__addr, __tmp, _Bytes);
+	    }
+	  else
+	    __builtin_memcpy(__addr, &__v, _Bytes);
+	}
+      else
+	__builtin_memcpy(__addr, &__x, _Bytes);
+    }
+
+  template <typename _Tp, size_t _Np>
+    _GLIBCXX_SIMD_INTRINSIC static void _S_store(_SimdWrapper<_Tp, _Np> __x,
+						 void* __addr)
+    { _S_store<_Np * sizeof(_Tp)>(__x._M_data, __addr); }
+
+  // }}}
+  // _S_store_bool_array(_BitMask) {{{
+  template <size_t _Np, bool _Sanitized>
+    _GLIBCXX_SIMD_INTRINSIC static constexpr void
+    _S_store_bool_array(_BitMask<_Np, _Sanitized> __x, bool* __mem)
+    {
+      if constexpr (_Np == 1)
+	__mem[0] = __x[0];
+      else if constexpr (_Np == 2)
+	{
+	  short __bool2 = (__x._M_to_bits() * 0x81) & 0x0101;
+	  _S_store<_Np>(__bool2, __mem);
+	}
+      else if constexpr (_Np == 3)
+	{
+	  int __bool3 = (__x._M_to_bits() * 0x4081) & 0x010101;
+	  _S_store<_Np>(__bool3, __mem);
+	}
+      else
+	{
+	  __execute_n_times<__div_roundup(_Np, 4)>([&](auto __i) {
+	    constexpr int __offset = __i * 4;
+	    constexpr int __remaining = _Np - __offset;
+	    if constexpr (__remaining > 4 && __remaining <= 7)
+	      {
+		const _ULLong __bool7
+		  = (__x.template _M_extract<__offset>()._M_to_bits()
+		     * 0x40810204081ULL)
+		    & 0x0101010101010101ULL;
+		_S_store<__remaining>(__bool7, __mem + __offset);
+	      }
+	    else if constexpr (__remaining >= 4)
+	      {
+		int __bits = __x.template _M_extract<__offset>()._M_to_bits();
+		if constexpr (__remaining > 7)
+		  __bits &= 0xf;
+		const int __bool4 = (__bits * 0x204081) & 0x01010101;
+		_S_store<4>(__bool4, __mem + __offset);
+	      }
+	  });
+	}
+    }
+
+  // }}}
+  // _S_blend{{{
+  template <typename _Tp, size_t _Np>
+    _GLIBCXX_SIMD_INTRINSIC static constexpr auto
+    _S_blend(_SimdWrapper<__int_for_sizeof_t<_Tp>, _Np> __k,
+	     _SimdWrapper<_Tp, _Np> __at0, _SimdWrapper<_Tp, _Np> __at1)
+    { return __k._M_data ? __at1._M_data : __at0._M_data; }
+
+  // }}}
+};
+
+// }}}
+// _SimdImplBuiltin {{{1
+template <typename _Abi>
+  struct _SimdImplBuiltin
+  {
+    // member types {{{2
+    template <typename _Tp>
+      static constexpr size_t _S_max_store_size = 16;
+
+    using abi_type = _Abi;
+
+    template <typename _Tp>
+      using _TypeTag = _Tp*;
+
+    template <typename _Tp>
+      using _SimdMember = typename _Abi::template __traits<_Tp>::_SimdMember;
+
+    template <typename _Tp>
+      using _MaskMember = typename _Abi::template _MaskMember<_Tp>;
+
+    template <typename _Tp>
+      static constexpr size_t _S_size = _Abi::template _S_size<_Tp>;
+
+    template <typename _Tp>
+      static constexpr size_t _S_full_size = _Abi::template _S_full_size<_Tp>;
+
+    using _CommonImpl = typename _Abi::_CommonImpl;
+    using _SuperImpl = typename _Abi::_SimdImpl;
+    using _MaskImpl = typename _Abi::_MaskImpl;
+
+    // _M_make_simd(_SimdWrapper/__intrinsic_type_t) {{{2
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static simd<_Tp, _Abi>
+      _M_make_simd(_SimdWrapper<_Tp, _Np> __x)
+      { return {__private_init, __x}; }
+
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static simd<_Tp, _Abi>
+      _M_make_simd(__intrinsic_type_t<_Tp, _Np> __x)
+      { return {__private_init, __vector_bitcast<_Tp>(__x)}; }
+
+    // _S_broadcast {{{2
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _SimdMember<_Tp>
+      _S_broadcast(_Tp __x) noexcept
+      { return __vector_broadcast<_S_full_size<_Tp>>(__x); }
+
+    // _S_generator {{{2
+    template <typename _Fp, typename _Tp>
+      inline static constexpr _SimdMember<_Tp> _S_generator(_Fp&& __gen,
+							    _TypeTag<_Tp>)
+      {
+	return __generate_vector<_Tp, _S_full_size<_Tp>>([&](
+	  auto __i) constexpr {
+	  if constexpr (__i < _S_size<_Tp>)
+	    return __gen(__i);
+	  else
+	    return 0;
+	});
+      }
+
+    // _S_load {{{2
+    template <typename _Tp, typename _Up>
+      _GLIBCXX_SIMD_INTRINSIC static _SimdMember<_Tp>
+      _S_load(const _Up* __mem, _TypeTag<_Tp>) noexcept
+      {
+	constexpr size_t _Np = _S_size<_Tp>;
+	constexpr size_t __max_load_size
+	  = (sizeof(_Up) >= 4 && __have_avx512f) || __have_avx512bw   ? 64
+	    : (is_floating_point_v<_Up> && __have_avx) || __have_avx2 ? 32
+								      : 16;
+	constexpr size_t __bytes_to_load = sizeof(_Up) * _Np;
+	if constexpr (sizeof(_Up) > 8)
+	  return __generate_vector<_Tp, _SimdMember<_Tp>::_S_full_size>([&](
+	    auto __i) constexpr {
+	    return static_cast<_Tp>(__i < _Np ? __mem[__i] : 0);
+	  });
+	else if constexpr (is_same_v<_Up, _Tp>)
+	  return _CommonImpl::template _S_load<_Tp, _S_full_size<_Tp>,
+					       _Np * sizeof(_Tp)>(__mem);
+	else if constexpr (__bytes_to_load <= __max_load_size)
+	  return __convert<_SimdMember<_Tp>>(
+	    _CommonImpl::template _S_load<_Up, _Np>(__mem));
+	else if constexpr (__bytes_to_load % __max_load_size == 0)
+	  {
+	    constexpr size_t __n_loads = __bytes_to_load / __max_load_size;
+	    constexpr size_t __elements_per_load = _Np / __n_loads;
+	    return __call_with_n_evaluations<__n_loads>(
+	      [](auto... __uncvted) {
+		return __convert<_SimdMember<_Tp>>(__uncvted...);
+	      },
+	      [&](auto __i) {
+		return _CommonImpl::template _S_load<_Up, __elements_per_load>(
+		  __mem + __i * __elements_per_load);
+	      });
+	  }
+	else if constexpr (__bytes_to_load % (__max_load_size / 2) == 0
+			   && __max_load_size > 16)
+	  { // e.g. int[] -> <char, 12> with AVX2
+	    constexpr size_t __n_loads
+	      = __bytes_to_load / (__max_load_size / 2);
+	    constexpr size_t __elements_per_load = _Np / __n_loads;
+	    return __call_with_n_evaluations<__n_loads>(
+	      [](auto... __uncvted) {
+		return __convert<_SimdMember<_Tp>>(__uncvted...);
+	      },
+	      [&](auto __i) {
+		return _CommonImpl::template _S_load<_Up, __elements_per_load>(
+		  __mem + __i * __elements_per_load);
+	      });
+	  }
+	else // e.g. int[] -> <char, 9>
+	  return __call_with_subscripts(
+	    __mem, make_index_sequence<_Np>(), [](auto... __args) {
+	      return __vector_type_t<_Tp, _S_full_size<_Tp>>{
+		static_cast<_Tp>(__args)...};
+	    });
+      }
+
+    // _S_masked_load {{{2
+    template <typename _Tp, size_t _Np, typename _Up>
+      static inline _SimdWrapper<_Tp, _Np>
+      _S_masked_load(_SimdWrapper<_Tp, _Np> __merge, _MaskMember<_Tp> __k,
+		     const _Up* __mem) noexcept
+      {
+	_BitOps::_S_bit_iteration(_MaskImpl::_S_to_bits(__k), [&](auto __i) {
+	  __merge._M_set(__i, static_cast<_Tp>(__mem[__i]));
+	});
+	return __merge;
+      }
+
+    // _S_store {{{2
+    template <typename _Tp, typename _Up>
+      _GLIBCXX_SIMD_INTRINSIC static void
+      _S_store(_SimdMember<_Tp> __v, _Up* __mem, _TypeTag<_Tp>) noexcept
+      {
+	// TODO: converting int -> "smaller int" can be optimized with AVX512
+	constexpr size_t _Np = _S_size<_Tp>;
+	constexpr size_t __max_store_size
+	  = _SuperImpl::template _S_max_store_size<_Up>;
+	if constexpr (sizeof(_Up) > 8)
+	  __execute_n_times<_Np>([&](auto __i) constexpr {
+	    __mem[__i] = __v[__i];
+	  });
+	else if constexpr (is_same_v<_Up, _Tp>)
+	  _CommonImpl::_S_store(__v, __mem);
+	else if constexpr (sizeof(_Up) * _Np <= __max_store_size)
+	  _CommonImpl::_S_store(_SimdWrapper<_Up, _Np>(__convert<_Up>(__v)),
+				__mem);
+	else
+	  {
+	    constexpr size_t __vsize = __max_store_size / sizeof(_Up);
+	    // round up to convert the last partial vector as well:
+	    constexpr size_t __stores = __div_roundup(_Np, __vsize);
+	    constexpr size_t __full_stores = _Np / __vsize;
+	    using _V = __vector_type_t<_Up, __vsize>;
+	    const array<_V, __stores> __converted
+	      = __convert_all<_V, __stores>(__v);
+	    __execute_n_times<__full_stores>([&](auto __i) constexpr {
+	      _CommonImpl::_S_store(__converted[__i], __mem + __i * __vsize);
+	    });
+	    if constexpr (__full_stores < __stores)
+	      _CommonImpl::template _S_store<(_Np - __full_stores * __vsize)
+					     * sizeof(_Up)>(
+		__converted[__full_stores], __mem + __full_stores * __vsize);
+	  }
+      }
+
+    // _S_masked_store_nocvt {{{2
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static void
+      _S_masked_store_nocvt(_SimdWrapper<_Tp, _Np> __v, _Tp* __mem,
+			    _MaskMember<_Tp> __k)
+      {
+	_BitOps::_S_bit_iteration(
+	  _MaskImpl::_S_to_bits(__k), [&](auto __i) constexpr {
+	    __mem[__i] = __v[__i];
+	  });
+      }
+
+    // _S_masked_store {{{2
+    template <typename _TW, typename _TVT = _VectorTraits<_TW>,
+	      typename _Tp = typename _TVT::value_type, typename _Up>
+      static inline void
+      _S_masked_store(const _TW __v, _Up* __mem, const _MaskMember<_Tp> __k)
+	noexcept
+      {
+	constexpr size_t _TV_size = _S_size<_Tp>;
+	[[maybe_unused]] const auto __vi = __to_intrin(__v);
+	constexpr size_t __max_store_size
+	  = _SuperImpl::template _S_max_store_size<_Up>;
+	if constexpr (
+	  is_same_v<
+	    _Tp,
+	    _Up> || (is_integral_v<_Tp> && is_integral_v<_Up> && sizeof(_Tp) == sizeof(_Up)))
+	  {
+	    // bitwise or no conversion, reinterpret:
+	    const _MaskMember<_Up> __kk = [&]() {
+	      if constexpr (__is_bitmask_v<decltype(__k)>)
+		return _MaskMember<_Up>(__k._M_data);
+	      else
+		return __wrapper_bitcast<__int_for_sizeof_t<_Up>>(__k);
+	    }();
+	    _SuperImpl::_S_masked_store_nocvt(__wrapper_bitcast<_Up>(__v),
+					      __mem, __kk);
+	  }
+	else if constexpr (__vectorized_sizeof<_Up>() > sizeof(_Up)
+			   && !_CommonImpl::
+				template __converts_via_decomposition_v<
+				  _Tp, _Up, __max_store_size>)
+	  { // conversion via decomposition is better handled via the
+	    // bit_iteration
+	    // fallback below
+	    constexpr size_t _UW_size
+	      = std::min(_TV_size, __max_store_size / sizeof(_Up));
+	    static_assert(_UW_size <= _TV_size);
+	    using _UW = _SimdWrapper<_Up, _UW_size>;
+	    using _UV = __vector_type_t<_Up, _UW_size>;
+	    using _UAbi = simd_abi::deduce_t<_Up, _UW_size>;
+	    if constexpr (_UW_size == _TV_size) // one convert+store
+	      {
+		const _UW __converted = __convert<_UW>(__v);
+		_SuperImpl::_S_masked_store_nocvt(
+		  __converted, __mem,
+		  _UAbi::_MaskImpl::template _S_convert<
+		    __int_for_sizeof_t<_Up>>(__k));
+	      }
+	    else
+	      {
+		static_assert(_UW_size * sizeof(_Up) == __max_store_size);
+		constexpr size_t _NFullStores = _TV_size / _UW_size;
+		constexpr size_t _NAllStores
+		  = __div_roundup(_TV_size, _UW_size);
+		constexpr size_t _NParts = _S_full_size<_Tp> / _UW_size;
+		const array<_UV, _NAllStores> __converted
+		  = __convert_all<_UV, _NAllStores>(__v);
+		__execute_n_times<_NFullStores>([&](auto __i) {
+		  _SuperImpl::_S_masked_store_nocvt(
+		    _UW(__converted[__i]), __mem + __i * _UW_size,
+		    _UAbi::_MaskImpl::template _S_convert<
+		      __int_for_sizeof_t<_Up>>(
+		      __extract_part<__i, _NParts>(__k.__as_full_vector())));
+		});
+		if constexpr (_NAllStores
+			      > _NFullStores) // one partial at the end
+		  _SuperImpl::_S_masked_store_nocvt(
+		    _UW(__converted[_NFullStores]),
+		    __mem + _NFullStores * _UW_size,
+		    _UAbi::_MaskImpl::template _S_convert<
+		      __int_for_sizeof_t<_Up>>(
+		      __extract_part<_NFullStores, _NParts>(
+			__k.__as_full_vector())));
+	      }
+	  }
+	else
+	  _BitOps::_S_bit_iteration(
+	    _MaskImpl::_S_to_bits(__k), [&](auto __i) constexpr {
+	      __mem[__i] = static_cast<_Up>(__v[__i]);
+	    });
+      }
+
+    // _S_complement {{{2
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _SimdWrapper<_Tp, _Np>
+      _S_complement(_SimdWrapper<_Tp, _Np> __x) noexcept
+      { return ~__x._M_data; }
+
+    // _S_unary_minus {{{2
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _SimdWrapper<_Tp, _Np>
+      _S_unary_minus(_SimdWrapper<_Tp, _Np> __x) noexcept
+      {
+	// GCC doesn't use the psign instructions, but pxor & psub seem to be
+	// just as good a choice as pcmpeqd & psign. So meh.
+	return -__x._M_data;
+      }
+
+    // arithmetic operators {{{2
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _SimdWrapper<_Tp, _Np>
+      _S_plus(_SimdWrapper<_Tp, _Np> __x, _SimdWrapper<_Tp, _Np> __y)
+      { return __x._M_data + __y._M_data; }
+
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _SimdWrapper<_Tp, _Np>
+      _S_minus(_SimdWrapper<_Tp, _Np> __x, _SimdWrapper<_Tp, _Np> __y)
+      { return __x._M_data - __y._M_data; }
+
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _SimdWrapper<_Tp, _Np>
+      _S_multiplies(_SimdWrapper<_Tp, _Np> __x, _SimdWrapper<_Tp, _Np> __y)
+      { return __x._M_data * __y._M_data; }
+
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _SimdWrapper<_Tp, _Np>
+      _S_divides(_SimdWrapper<_Tp, _Np> __x, _SimdWrapper<_Tp, _Np> __y)
+      {
+	// Note that division by 0 is always UB, so we must ensure we avoid the
+	// case for partial registers
+	if constexpr (!_Abi::template _S_is_partial<_Tp>)
+	  return __x._M_data / __y._M_data;
+	else
+	  return __x._M_data / _Abi::__make_padding_nonzero(__y._M_data);
+      }
+
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _SimdWrapper<_Tp, _Np>
+      _S_modulus(_SimdWrapper<_Tp, _Np> __x, _SimdWrapper<_Tp, _Np> __y)
+      {
+	if constexpr (!_Abi::template _S_is_partial<_Tp>)
+	  return __x._M_data % __y._M_data;
+	else
+	  return __as_vector(__x)
+		 % _Abi::__make_padding_nonzero(__as_vector(__y));
+      }
+
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _SimdWrapper<_Tp, _Np>
+      _S_bit_and(_SimdWrapper<_Tp, _Np> __x, _SimdWrapper<_Tp, _Np> __y)
+      { return __and(__x, __y); }
+
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _SimdWrapper<_Tp, _Np>
+      _S_bit_or(_SimdWrapper<_Tp, _Np> __x, _SimdWrapper<_Tp, _Np> __y)
+      { return __or(__x, __y); }
+
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _SimdWrapper<_Tp, _Np>
+      _S_bit_xor(_SimdWrapper<_Tp, _Np> __x, _SimdWrapper<_Tp, _Np> __y)
+      { return __xor(__x, __y); }
+
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static _SimdWrapper<_Tp, _Np>
+      _S_bit_shift_left(_SimdWrapper<_Tp, _Np> __x, _SimdWrapper<_Tp, _Np> __y)
+      { return __x._M_data << __y._M_data; }
+
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static _SimdWrapper<_Tp, _Np>
+      _S_bit_shift_right(_SimdWrapper<_Tp, _Np> __x, _SimdWrapper<_Tp, _Np> __y)
+      { return __x._M_data >> __y._M_data; }
+
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _SimdWrapper<_Tp, _Np>
+      _S_bit_shift_left(_SimdWrapper<_Tp, _Np> __x, int __y)
+      { return __x._M_data << __y; }
+
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _SimdWrapper<_Tp, _Np>
+      _S_bit_shift_right(_SimdWrapper<_Tp, _Np> __x, int __y)
+      { return __x._M_data >> __y; }
+
+    // compares {{{2
+    // _S_equal_to {{{3
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _MaskMember<_Tp>
+      _S_equal_to(_SimdWrapper<_Tp, _Np> __x, _SimdWrapper<_Tp, _Np> __y)
+      { return __x._M_data == __y._M_data; }
+
+    // _S_not_equal_to {{{3
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _MaskMember<_Tp>
+      _S_not_equal_to(_SimdWrapper<_Tp, _Np> __x, _SimdWrapper<_Tp, _Np> __y)
+      { return __x._M_data != __y._M_data; }
+
+    // _S_less {{{3
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _MaskMember<_Tp>
+      _S_less(_SimdWrapper<_Tp, _Np> __x, _SimdWrapper<_Tp, _Np> __y)
+      { return __x._M_data < __y._M_data; }
+
+    // _S_less_equal {{{3
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _MaskMember<_Tp>
+      _S_less_equal(_SimdWrapper<_Tp, _Np> __x, _SimdWrapper<_Tp, _Np> __y)
+      { return __x._M_data <= __y._M_data; }
+
+    // _S_negate {{{2
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _MaskMember<_Tp>
+      _S_negate(_SimdWrapper<_Tp, _Np> __x) noexcept
+      { return !__x._M_data; }
+
+    // _S_min, _S_max, _S_minmax {{{2
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_NORMAL_MATH _GLIBCXX_SIMD_INTRINSIC static constexpr
+      _SimdWrapper<_Tp, _Np>
+      _S_min(_SimdWrapper<_Tp, _Np> __a, _SimdWrapper<_Tp, _Np> __b)
+      { return __a._M_data < __b._M_data ? __a._M_data : __b._M_data; }
+
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_NORMAL_MATH _GLIBCXX_SIMD_INTRINSIC static constexpr
+      _SimdWrapper<_Tp, _Np>
+      _S_max(_SimdWrapper<_Tp, _Np> __a, _SimdWrapper<_Tp, _Np> __b)
+      { return __a._M_data > __b._M_data ? __a._M_data : __b._M_data; }
+
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_NORMAL_MATH _GLIBCXX_SIMD_INTRINSIC static constexpr
+      pair<_SimdWrapper<_Tp, _Np>, _SimdWrapper<_Tp, _Np>>
+      _S_minmax(_SimdWrapper<_Tp, _Np> __a, _SimdWrapper<_Tp, _Np> __b)
+      {
+	return {__a._M_data < __b._M_data ? __a._M_data : __b._M_data,
+		__a._M_data < __b._M_data ? __b._M_data : __a._M_data};
+      }
+
+    // reductions {{{2
+    template <size_t _Np, size_t... _Is, size_t... _Zeros, typename _Tp,
+	      typename _BinaryOperation>
+      _GLIBCXX_SIMD_INTRINSIC static _Tp
+      _S_reduce_partial(index_sequence<_Is...>, index_sequence<_Zeros...>,
+			simd<_Tp, _Abi> __x, _BinaryOperation&& __binary_op)
+      {
+	using _V = __vector_type_t<_Tp, _Np / 2>;
+	static_assert(sizeof(_V) <= sizeof(__x));
+	// _S_full_size is the size of the smallest native SIMD register that
+	// can store _Np/2 elements:
+	using _FullSimd = __deduced_simd<_Tp, _VectorTraits<_V>::_S_full_size>;
+	using _HalfSimd = __deduced_simd<_Tp, _Np / 2>;
+	const auto __xx = __as_vector(__x);
+	return _HalfSimd::abi_type::_SimdImpl::_S_reduce(
+	  static_cast<_HalfSimd>(__as_vector(__binary_op(
+	    static_cast<_FullSimd>(__intrin_bitcast<_V>(__xx)),
+	    static_cast<_FullSimd>(__intrin_bitcast<_V>(
+	      __vector_permute<(_Np / 2 + _Is)..., (int(_Zeros * 0) - 1)...>(
+		__xx)))))),
+	  __binary_op);
+      }
+
+    template <typename _Tp, typename _BinaryOperation>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _Tp
+      _S_reduce(simd<_Tp, _Abi> __x, _BinaryOperation&& __binary_op)
+      {
+	constexpr size_t _Np = simd_size_v<_Tp, _Abi>;
+	if constexpr (_Np == 1)
+	  return __x[0];
+	else if constexpr (_Np == 2)
+	  return __binary_op(simd<_Tp, simd_abi::scalar>(__x[0]),
+			     simd<_Tp, simd_abi::scalar>(__x[1]))[0];
+	else if constexpr (_Abi::template _S_is_partial<_Tp>) //{{{
+	  {
+	    [[maybe_unused]] constexpr auto __full_size
+	      = _Abi::template _S_full_size<_Tp>;
+	    if constexpr (_Np == 3)
+	      return __binary_op(
+		__binary_op(simd<_Tp, simd_abi::scalar>(__x[0]),
+			    simd<_Tp, simd_abi::scalar>(__x[1])),
+		simd<_Tp, simd_abi::scalar>(__x[2]))[0];
+	    else if constexpr (is_same_v<__remove_cvref_t<_BinaryOperation>,
+					 plus<>>)
+	      {
+		using _Ap = simd_abi::deduce_t<_Tp, __full_size>;
+		return _Ap::_SimdImpl::_S_reduce(
+		  simd<_Tp, _Ap>(__private_init,
+				 _Abi::_S_masked(__as_vector(__x))),
+		  __binary_op);
+	      }
+	    else if constexpr (is_same_v<__remove_cvref_t<_BinaryOperation>,
+					 multiplies<>>)
+	      {
+		using _Ap = simd_abi::deduce_t<_Tp, __full_size>;
+		using _TW = _SimdWrapper<_Tp, __full_size>;
+		_GLIBCXX_SIMD_USE_CONSTEXPR auto __implicit_mask_full
+		  = _Abi::template _S_implicit_mask<_Tp>().__as_full_vector();
+		_GLIBCXX_SIMD_USE_CONSTEXPR _TW __one
+		  = __vector_broadcast<__full_size>(_Tp(1));
+		const _TW __x_full = __data(__x).__as_full_vector();
+		const _TW __x_padded_with_ones
+		  = _Ap::_CommonImpl::_S_blend(__implicit_mask_full, __one,
+					       __x_full);
+		return _Ap::_SimdImpl::_S_reduce(
+		  simd<_Tp, _Ap>(__private_init, __x_padded_with_ones),
+		  __binary_op);
+	      }
+	    else if constexpr (_Np & 1)
+	      {
+		using _Ap = simd_abi::deduce_t<_Tp, _Np - 1>;
+		return __binary_op(
+		  simd<_Tp, simd_abi::scalar>(_Ap::_SimdImpl::_S_reduce(
+		    simd<_Tp, _Ap>(
+		      __intrin_bitcast<__vector_type_t<_Tp, _Np - 1>>(
+			__as_vector(__x))),
+		    __binary_op)),
+		  simd<_Tp, simd_abi::scalar>(__x[_Np - 1]))[0];
+	      }
+	    else
+	      return _S_reduce_partial<_Np>(
+		make_index_sequence<_Np / 2>(),
+		make_index_sequence<__full_size - _Np / 2>(), __x, __binary_op);
+	  }                                   //}}}
+	else if constexpr (sizeof(__x) == 16) //{{{
+	  {
+	    if constexpr (_Np == 16)
+	      {
+		const auto __y = __data(__x);
+		__x = __binary_op(
+		  _M_make_simd<_Tp, _Np>(
+		    __vector_permute<0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6,
+				     7, 7>(__y)),
+		  _M_make_simd<_Tp, _Np>(
+		    __vector_permute<8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13,
+				     14, 14, 15, 15>(__y)));
+	      }
+	    if constexpr (_Np >= 8)
+	      {
+		const auto __y = __vector_bitcast<short>(__data(__x));
+		__x = __binary_op(
+		  _M_make_simd<_Tp, _Np>(__vector_bitcast<_Tp>(
+		    __vector_permute<0, 0, 1, 1, 2, 2, 3, 3>(__y))),
+		  _M_make_simd<_Tp, _Np>(__vector_bitcast<_Tp>(
+		    __vector_permute<4, 4, 5, 5, 6, 6, 7, 7>(__y))));
+	      }
+	    if constexpr (_Np >= 4)
+	      {
+		using _Up = conditional_t<is_floating_point_v<_Tp>, float, int>;
+		const auto __y = __vector_bitcast<_Up>(__data(__x));
+		__x = __binary_op(__x,
+				  _M_make_simd<_Tp, _Np>(__vector_bitcast<_Tp>(
+				    __vector_permute<3, 2, 1, 0>(__y))));
+	      }
+	    using _Up = conditional_t<is_floating_point_v<_Tp>, double, _LLong>;
+	    const auto __y = __vector_bitcast<_Up>(__data(__x));
+	    __x = __binary_op(__x, _M_make_simd<_Tp, _Np>(__vector_bitcast<_Tp>(
+				     __vector_permute<1, 1>(__y))));
+	    return __x[0];
+	  } //}}}
+	else
+	  {
+	    static_assert(sizeof(__x) > __min_vector_size<_Tp>);
+	    static_assert((_Np & (_Np - 1)) == 0); // _Np must be a power of 2
+	    using _Ap = simd_abi::deduce_t<_Tp, _Np / 2>;
+	    using _V = simd<_Tp, _Ap>;
+	    return _Ap::_SimdImpl::_S_reduce(
+	      __binary_op(_V(__private_init, __extract<0, 2>(__as_vector(__x))),
+			  _V(__private_init,
+			     __extract<1, 2>(__as_vector(__x)))),
+	      static_cast<_BinaryOperation&&>(__binary_op));
+	  }
+      }
+
+    // math {{{2
+    // frexp, modf and copysign implemented in simd_math.h
+#define _GLIBCXX_SIMD_MATH_FALLBACK(__name)                                    \
+    template <typename _Tp, typename... _More>                                 \
+      static _Tp _S_##__name(const _Tp& __x, const _More&... __more)           \
+      {                                                                        \
+	return __generate_vector<_Tp>(                                         \
+	  [&](auto __i) { return __name(__x[__i], __more[__i]...); });         \
+      }
+
+#define _GLIBCXX_SIMD_MATH_FALLBACK_MASKRET(__name)                            \
+    template <typename _Tp, typename... _More>                                 \
+      static typename _Tp::mask_type _S_##__name(const _Tp& __x,               \
+						 const _More&... __more)       \
+      {                                                                        \
+	return __generate_vector<_Tp>(                                         \
+	  [&](auto __i) { return __name(__x[__i], __more[__i]...); });         \
+      }
+
+#define _GLIBCXX_SIMD_MATH_FALLBACK_FIXEDRET(_RetTp, __name)                   \
+    template <typename _Tp, typename... _More>                                 \
+      static auto _S_##__name(const _Tp& __x, const _More&... __more)          \
+      {                                                                        \
+	return __fixed_size_storage_t<_RetTp,                                  \
+				      _VectorTraits<_Tp>::_S_partial_width>::  \
+	  _S_generate([&](auto __meta) constexpr {                             \
+	    return __meta._S_generator(                                        \
+	      [&](auto __i) {                                                  \
+		return __name(__x[__meta._S_offset + __i],                     \
+			      __more[__meta._S_offset + __i]...);              \
+	      },                                                               \
+	      static_cast<_RetTp*>(nullptr));                                  \
+	  });                                                                  \
+      }
+
+    _GLIBCXX_SIMD_MATH_FALLBACK(acos)
+    _GLIBCXX_SIMD_MATH_FALLBACK(asin)
+    _GLIBCXX_SIMD_MATH_FALLBACK(atan)
+    _GLIBCXX_SIMD_MATH_FALLBACK(atan2)
+    _GLIBCXX_SIMD_MATH_FALLBACK(cos)
+    _GLIBCXX_SIMD_MATH_FALLBACK(sin)
+    _GLIBCXX_SIMD_MATH_FALLBACK(tan)
+    _GLIBCXX_SIMD_MATH_FALLBACK(acosh)
+    _GLIBCXX_SIMD_MATH_FALLBACK(asinh)
+    _GLIBCXX_SIMD_MATH_FALLBACK(atanh)
+    _GLIBCXX_SIMD_MATH_FALLBACK(cosh)
+    _GLIBCXX_SIMD_MATH_FALLBACK(sinh)
+    _GLIBCXX_SIMD_MATH_FALLBACK(tanh)
+    _GLIBCXX_SIMD_MATH_FALLBACK(exp)
+    _GLIBCXX_SIMD_MATH_FALLBACK(exp2)
+    _GLIBCXX_SIMD_MATH_FALLBACK(expm1)
+    _GLIBCXX_SIMD_MATH_FALLBACK(ldexp)
+    _GLIBCXX_SIMD_MATH_FALLBACK_FIXEDRET(int, ilogb)
+    _GLIBCXX_SIMD_MATH_FALLBACK(log)
+    _GLIBCXX_SIMD_MATH_FALLBACK(log10)
+    _GLIBCXX_SIMD_MATH_FALLBACK(log1p)
+    _GLIBCXX_SIMD_MATH_FALLBACK(log2)
+    _GLIBCXX_SIMD_MATH_FALLBACK(logb)
+
+    // modf implemented in simd_math.h
+    _GLIBCXX_SIMD_MATH_FALLBACK(scalbn)
+    _GLIBCXX_SIMD_MATH_FALLBACK(scalbln)
+    _GLIBCXX_SIMD_MATH_FALLBACK(cbrt)
+    _GLIBCXX_SIMD_MATH_FALLBACK(fabs)
+    _GLIBCXX_SIMD_MATH_FALLBACK(pow)
+    _GLIBCXX_SIMD_MATH_FALLBACK(sqrt)
+    _GLIBCXX_SIMD_MATH_FALLBACK(erf)
+    _GLIBCXX_SIMD_MATH_FALLBACK(erfc)
+    _GLIBCXX_SIMD_MATH_FALLBACK(lgamma)
+    _GLIBCXX_SIMD_MATH_FALLBACK(tgamma)
+
+    _GLIBCXX_SIMD_MATH_FALLBACK_FIXEDRET(long, lrint)
+    _GLIBCXX_SIMD_MATH_FALLBACK_FIXEDRET(long long, llrint)
+
+    _GLIBCXX_SIMD_MATH_FALLBACK_FIXEDRET(long, lround)
+    _GLIBCXX_SIMD_MATH_FALLBACK_FIXEDRET(long long, llround)
+
+    _GLIBCXX_SIMD_MATH_FALLBACK(fmod)
+    _GLIBCXX_SIMD_MATH_FALLBACK(remainder)
+
+    template <typename _Tp, typename _TVT = _VectorTraits<_Tp>>
+      static _Tp
+      _S_remquo(const _Tp __x, const _Tp __y,
+		__fixed_size_storage_t<int, _TVT::_S_partial_width>* __z)
+      {
+	return __generate_vector<_Tp>([&](auto __i) {
+	  int __tmp;
+	  auto __r = remquo(__x[__i], __y[__i], &__tmp);
+	  __z->_M_set(__i, __tmp);
+	  return __r;
+	});
+      }
+
+    // copysign in simd_math.h
+    _GLIBCXX_SIMD_MATH_FALLBACK(nextafter)
+    _GLIBCXX_SIMD_MATH_FALLBACK(fdim)
+    _GLIBCXX_SIMD_MATH_FALLBACK(fmax)
+    _GLIBCXX_SIMD_MATH_FALLBACK(fmin)
+    _GLIBCXX_SIMD_MATH_FALLBACK(fma)
+
+    template <typename _Tp, size_t _Np>
+      static constexpr _MaskMember<_Tp>
+      _S_isgreater(_SimdWrapper<_Tp, _Np> __x,
+		   _SimdWrapper<_Tp, _Np> __y) noexcept
+      {
+	using _Ip = __int_for_sizeof_t<_Tp>;
+	const auto __xn = __vector_bitcast<_Ip>(__x);
+	const auto __yn = __vector_bitcast<_Ip>(__y);
+	const auto __xp = __xn < 0 ? -(__xn & __finite_max_v<_Ip>) : __xn;
+	const auto __yp = __yn < 0 ? -(__yn & __finite_max_v<_Ip>) : __yn;
+	return __andnot(_SuperImpl::_S_isunordered(__x, __y)._M_data,
+			__xp > __yp);
+      }
+
+    template <typename _Tp, size_t _Np>
+      static constexpr _MaskMember<_Tp>
+      _S_isgreaterequal(_SimdWrapper<_Tp, _Np> __x,
+			_SimdWrapper<_Tp, _Np> __y) noexcept
+      {
+	using _Ip = __int_for_sizeof_t<_Tp>;
+	const auto __xn = __vector_bitcast<_Ip>(__x);
+	const auto __yn = __vector_bitcast<_Ip>(__y);
+	const auto __xp = __xn < 0 ? -(__xn & __finite_max_v<_Ip>) : __xn;
+	const auto __yp = __yn < 0 ? -(__yn & __finite_max_v<_Ip>) : __yn;
+	return __andnot(_SuperImpl::_S_isunordered(__x, __y)._M_data,
+			__xp >= __yp);
+      }
+
+    template <typename _Tp, size_t _Np>
+      static constexpr _MaskMember<_Tp>
+      _S_isless(_SimdWrapper<_Tp, _Np> __x, _SimdWrapper<_Tp, _Np> __y) noexcept
+      {
+	using _Ip = __int_for_sizeof_t<_Tp>;
+	const auto __xn = __vector_bitcast<_Ip>(__x);
+	const auto __yn = __vector_bitcast<_Ip>(__y);
+	const auto __xp = __xn < 0 ? -(__xn & __finite_max_v<_Ip>) : __xn;
+	const auto __yp = __yn < 0 ? -(__yn & __finite_max_v<_Ip>) : __yn;
+	return __andnot(_SuperImpl::_S_isunordered(__x, __y)._M_data,
+			__xp < __yp);
+      }
+
+    template <typename _Tp, size_t _Np>
+      static constexpr _MaskMember<_Tp>
+      _S_islessequal(_SimdWrapper<_Tp, _Np> __x,
+		     _SimdWrapper<_Tp, _Np> __y) noexcept
+      {
+	using _Ip = __int_for_sizeof_t<_Tp>;
+	const auto __xn = __vector_bitcast<_Ip>(__x);
+	const auto __yn = __vector_bitcast<_Ip>(__y);
+	const auto __xp = __xn < 0 ? -(__xn & __finite_max_v<_Ip>) : __xn;
+	const auto __yp = __yn < 0 ? -(__yn & __finite_max_v<_Ip>) : __yn;
+	return __andnot(_SuperImpl::_S_isunordered(__x, __y)._M_data,
+			__xp <= __yp);
+      }
+
+    template <typename _Tp, size_t _Np>
+      static constexpr _MaskMember<_Tp>
+      _S_islessgreater(_SimdWrapper<_Tp, _Np> __x,
+		       _SimdWrapper<_Tp, _Np> __y) noexcept
+      {
+	return __andnot(_SuperImpl::_S_isunordered(__x, __y),
+			_SuperImpl::_S_not_equal_to(__x, __y));
+      }
+
+#undef _GLIBCXX_SIMD_MATH_FALLBACK
+#undef _GLIBCXX_SIMD_MATH_FALLBACK_MASKRET
+#undef _GLIBCXX_SIMD_MATH_FALLBACK_FIXEDRET
+    // _S_abs {{{3
+    template <typename _Tp, size_t _Np>
+    _GLIBCXX_SIMD_INTRINSIC static _SimdWrapper<_Tp, _Np>
+    _S_abs(_SimdWrapper<_Tp, _Np> __x) noexcept
+    {
+      // if (__builtin_is_constant_evaluated())
+      //  {
+      //    return __x._M_data < 0 ? -__x._M_data : __x._M_data;
+      //  }
+      if constexpr (is_floating_point_v<_Tp>)
+	// `v < 0 ? -v : v` cannot compile to the efficient implementation of
+	// masking the signbit off because it must consider v == -0
+
+	// ~(-0.) & v would be easy, but breaks with fno-signed-zeros
+	return __and(_S_absmask<__vector_type_t<_Tp, _Np>>, __x._M_data);
+      else
+	return __x._M_data < 0 ? -__x._M_data : __x._M_data;
+    }
+
+    // }}}3
+    // _S_plus_minus {{{
+    // Returns __x + __y - __y without -fassociative-math optimizing to __x.
+    // - _TV must be __vector_type_t<floating-point type, N>.
+    // - _UV must be _TV or floating-point type.
+    template <typename _TV, typename _UV>
+    _GLIBCXX_SIMD_INTRINSIC static constexpr _TV _S_plus_minus(_TV __x,
+							       _UV __y) noexcept
+    {
+  #if defined __i386__ && !defined __SSE_MATH__
+      if constexpr (sizeof(__x) == 8)
+	{ // operations on __x would use the FPU
+	  static_assert(is_same_v<_TV, __vector_type_t<float, 2>>);
+	  const auto __x4 = __vector_bitcast<float, 4>(__x);
+	  if constexpr (is_same_v<_TV, _UV>)
+	    return __vector_bitcast<float, 2>(
+	      _S_plus_minus(__x4, __vector_bitcast<float, 4>(__y)));
+	  else
+	    return __vector_bitcast<float, 2>(_S_plus_minus(__x4, __y));
+	}
+  #endif
+  #if !defined __clang__ && __GCC_IEC_559 == 0
+      if (__builtin_is_constant_evaluated()
+	  || (__builtin_constant_p(__x) && __builtin_constant_p(__y)))
+	return (__x + __y) - __y;
+      else
+	return [&] {
+	  __x += __y;
+	  if constexpr(__have_sse)
+	    {
+	      if constexpr (sizeof(__x) >= 16)
+		asm("" : "+x"(__x));
+	      else if constexpr (is_same_v<__vector_type_t<float, 2>, _TV>)
+		asm("" : "+x"(__x[0]), "+x"(__x[1]));
+	      else
+		__assert_unreachable<_TV>();
+	    }
+	  else if constexpr(__have_neon)
+	    asm("" : "+w"(__x));
+	  else if constexpr (__have_power_vmx)
+	    {
+	      if constexpr (is_same_v<__vector_type_t<float, 2>, _TV>)
+		asm("" : "+fgr"(__x[0]), "+fgr"(__x[1]));
+	      else
+		asm("" : "+v"(__x));
+	    }
+	  else
+	    asm("" : "+g"(__x));
+	  return __x - __y;
+	}();
+  #else
+      return (__x + __y) - __y;
+  #endif
+    }
+
+    // }}}
+    // _S_nearbyint {{{3
+    template <typename _Tp, typename _TVT = _VectorTraits<_Tp>>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_nearbyint(_Tp __x_) noexcept
+    {
+      using value_type = typename _TVT::value_type;
+      using _V = typename _TVT::type;
+      const _V __x = __x_;
+      const _V __absx = __and(__x, _S_absmask<_V>);
+      static_assert(__CHAR_BIT__ * sizeof(1ull) >= __digits_v<value_type>);
+      _GLIBCXX_SIMD_USE_CONSTEXPR _V __shifter_abs
+	= _V() + (1ull << (__digits_v<value_type> - 1));
+      const _V __shifter = __or(__and(_S_signmask<_V>, __x), __shifter_abs);
+      const _V __shifted = _S_plus_minus(__x, __shifter);
+      return __absx < __shifter_abs ? __shifted : __x;
+    }
+
+    // _S_rint {{{3
+    template <typename _Tp, typename _TVT = _VectorTraits<_Tp>>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_rint(_Tp __x) noexcept
+    {
+      return _SuperImpl::_S_nearbyint(__x);
+    }
+
+    // _S_trunc {{{3
+    template <typename _Tp, size_t _Np>
+    _GLIBCXX_SIMD_INTRINSIC static _SimdWrapper<_Tp, _Np>
+    _S_trunc(_SimdWrapper<_Tp, _Np> __x)
+    {
+      using _V = __vector_type_t<_Tp, _Np>;
+      const _V __absx = __and(__x._M_data, _S_absmask<_V>);
+      static_assert(__CHAR_BIT__ * sizeof(1ull) >= __digits_v<_Tp>);
+      constexpr _Tp __shifter = 1ull << (__digits_v<_Tp> - 1);
+      _V __truncated = _S_plus_minus(__absx, __shifter);
+      __truncated -= __truncated > __absx ? _V() + 1 : _V();
+      return __absx < __shifter ? __or(__xor(__absx, __x._M_data), __truncated)
+				: __x._M_data;
+    }
+
+    // _S_round {{{3
+    template <typename _Tp, size_t _Np>
+    _GLIBCXX_SIMD_INTRINSIC static _SimdWrapper<_Tp, _Np>
+    _S_round(_SimdWrapper<_Tp, _Np> __x)
+    {
+      const auto __abs_x = _SuperImpl::_S_abs(__x);
+      const auto __t_abs = _SuperImpl::_S_trunc(__abs_x)._M_data;
+      const auto __r_abs // round(abs(x)) =
+	= __t_abs + (__abs_x._M_data - __t_abs >= _Tp(.5) ? _Tp(1) : 0);
+      return __or(__xor(__abs_x._M_data, __x._M_data), __r_abs);
+    }
+
+    // _S_floor {{{3
+    template <typename _Tp, size_t _Np>
+    _GLIBCXX_SIMD_INTRINSIC static _SimdWrapper<_Tp, _Np>
+    _S_floor(_SimdWrapper<_Tp, _Np> __x)
+    {
+      const auto __y = _SuperImpl::_S_trunc(__x)._M_data;
+      const auto __negative_input
+	= __vector_bitcast<_Tp>(__x._M_data < __vector_broadcast<_Np, _Tp>(0));
+      const auto __mask
+	= __andnot(__vector_bitcast<_Tp>(__y == __x._M_data), __negative_input);
+      return __or(__andnot(__mask, __y),
+		  __and(__mask, __y - __vector_broadcast<_Np, _Tp>(1)));
+    }
+
+    // _S_ceil {{{3
+    template <typename _Tp, size_t _Np>
+    _GLIBCXX_SIMD_INTRINSIC static _SimdWrapper<_Tp, _Np>
+    _S_ceil(_SimdWrapper<_Tp, _Np> __x)
+    {
+      const auto __y = _SuperImpl::_S_trunc(__x)._M_data;
+      const auto __negative_input
+	= __vector_bitcast<_Tp>(__x._M_data < __vector_broadcast<_Np, _Tp>(0));
+      const auto __inv_mask
+	= __or(__vector_bitcast<_Tp>(__y == __x._M_data), __negative_input);
+      return __or(__and(__inv_mask, __y),
+		  __andnot(__inv_mask, __y + __vector_broadcast<_Np, _Tp>(1)));
+    }
+
+    // _S_isnan {{{3
+    template <typename _Tp, size_t _Np>
+    _GLIBCXX_SIMD_INTRINSIC static _MaskMember<_Tp>
+    _S_isnan([[maybe_unused]] _SimdWrapper<_Tp, _Np> __x)
+    {
+  #if __FINITE_MATH_ONLY__
+      return {}; // false
+  #elif !defined __SUPPORT_SNAN__
+      return ~(__x._M_data == __x._M_data);
+  #elif defined __STDC_IEC_559__
+      using _Ip = __int_for_sizeof_t<_Tp>;
+      const auto __absn = __vector_bitcast<_Ip>(_SuperImpl::_S_abs(__x));
+      const auto __infn
+	= __vector_bitcast<_Ip>(__vector_broadcast<_Np>(__infinity_v<_Tp>));
+      return __infn < __absn;
+  #else
+  #error "Not implemented: how to support SNaN but non-IEC559 floating-point?"
+  #endif
+    }
+
+    // _S_isfinite {{{3
+    template <typename _Tp, size_t _Np>
+    _GLIBCXX_SIMD_INTRINSIC static _MaskMember<_Tp>
+    _S_isfinite([[maybe_unused]] _SimdWrapper<_Tp, _Np> __x)
+    {
+  #if __FINITE_MATH_ONLY__
+      using _UV = typename _MaskMember<_Tp>::_BuiltinType;
+      _GLIBCXX_SIMD_USE_CONSTEXPR _UV __alltrue = ~_UV();
+      return __alltrue;
+  #else
+      // if all exponent bits are set, __x is either inf or NaN
+      using _Ip = __int_for_sizeof_t<_Tp>;
+      const auto __absn = __vector_bitcast<_Ip>(_SuperImpl::_S_abs(__x));
+      const auto __maxn
+	= __vector_bitcast<_Ip>(__vector_broadcast<_Np>(__finite_max_v<_Tp>));
+      return __absn <= __maxn;
+  #endif
+    }
+
+    // _S_isunordered {{{3
+    template <typename _Tp, size_t _Np>
+    _GLIBCXX_SIMD_INTRINSIC static _MaskMember<_Tp>
+    _S_isunordered(_SimdWrapper<_Tp, _Np> __x, _SimdWrapper<_Tp, _Np> __y)
+    {
+      return __or(_S_isnan(__x), _S_isnan(__y));
+    }
+
+    // _S_signbit {{{3
+    template <typename _Tp, size_t _Np>
+    _GLIBCXX_SIMD_INTRINSIC static _MaskMember<_Tp>
+    _S_signbit(_SimdWrapper<_Tp, _Np> __x)
+    {
+      using _Ip = __int_for_sizeof_t<_Tp>;
+      return __vector_bitcast<_Ip>(__x) < 0;
+      // Arithmetic right shift (SRA) would also work (instead of compare), but
+      // 64-bit SRA isn't available on x86 before AVX512. And in general,
+      // compares are more likely to be efficient than SRA.
+    }
+
+    // _S_isinf {{{3
+    template <typename _Tp, size_t _Np>
+    _GLIBCXX_SIMD_INTRINSIC static _MaskMember<_Tp>
+    _S_isinf([[maybe_unused]] _SimdWrapper<_Tp, _Np> __x)
+    {
+  #if __FINITE_MATH_ONLY__
+      return {}; // false
+  #else
+      return _SuperImpl::template _S_equal_to<_Tp, _Np>(_SuperImpl::_S_abs(__x),
+							__vector_broadcast<_Np>(
+							  __infinity_v<_Tp>));
+      // alternative:
+      // compare to inf using the corresponding integer type
+      /*
+	 return
+	 __vector_bitcast<_Tp>(__vector_bitcast<__int_for_sizeof_t<_Tp>>(
+			       _S_abs(__x)._M_data)
+	 ==
+	 __vector_bitcast<__int_for_sizeof_t<_Tp>>(__vector_broadcast<_Np>(
+	 __infinity_v<_Tp>)));
+	 */
+  #endif
+    }
+
+    // _S_isnormal {{{3
+    template <typename _Tp, size_t _Np>
+    _GLIBCXX_SIMD_INTRINSIC static _MaskMember<_Tp>
+    _S_isnormal(_SimdWrapper<_Tp, _Np> __x)
+    {
+      using _Ip = __int_for_sizeof_t<_Tp>;
+      const auto __absn = __vector_bitcast<_Ip>(_SuperImpl::_S_abs(__x));
+      const auto __minn
+	= __vector_bitcast<_Ip>(__vector_broadcast<_Np>(__norm_min_v<_Tp>));
+  #if __FINITE_MATH_ONLY__
+      return __absn >= __minn;
+  #else
+      const auto __maxn
+	= __vector_bitcast<_Ip>(__vector_broadcast<_Np>(__finite_max_v<_Tp>));
+      return __minn <= __absn && __absn <= __maxn;
+  #endif
+    }
+
+    // _S_fpclassify {{{3
+    template <typename _Tp, size_t _Np>
+    _GLIBCXX_SIMD_INTRINSIC static __fixed_size_storage_t<int, _Np>
+    _S_fpclassify(_SimdWrapper<_Tp, _Np> __x)
+    {
+      using _I = __int_for_sizeof_t<_Tp>;
+      const auto __xn
+	= __vector_bitcast<_I>(__to_intrin(_SuperImpl::_S_abs(__x)));
+      constexpr size_t _NI = sizeof(__xn) / sizeof(_I);
+      _GLIBCXX_SIMD_USE_CONSTEXPR auto __minn
+	= __vector_bitcast<_I>(__vector_broadcast<_NI>(__norm_min_v<_Tp>));
+      _GLIBCXX_SIMD_USE_CONSTEXPR auto __infn
+	= __vector_bitcast<_I>(__vector_broadcast<_NI>(__infinity_v<_Tp>));
+
+      _GLIBCXX_SIMD_USE_CONSTEXPR auto __fp_normal
+	= __vector_broadcast<_NI, _I>(FP_NORMAL);
+  #if !__FINITE_MATH_ONLY__
+      _GLIBCXX_SIMD_USE_CONSTEXPR auto __fp_nan
+	= __vector_broadcast<_NI, _I>(FP_NAN);
+      _GLIBCXX_SIMD_USE_CONSTEXPR auto __fp_infinite
+	= __vector_broadcast<_NI, _I>(FP_INFINITE);
+  #endif
+  #ifndef __FAST_MATH__
+      _GLIBCXX_SIMD_USE_CONSTEXPR auto __fp_subnormal
+	= __vector_broadcast<_NI, _I>(FP_SUBNORMAL);
+  #endif
+      _GLIBCXX_SIMD_USE_CONSTEXPR auto __fp_zero
+	= __vector_broadcast<_NI, _I>(FP_ZERO);
+
+      __vector_type_t<_I, _NI>
+	__tmp = __xn < __minn
+  #ifdef __FAST_MATH__
+		  ? __fp_zero
+  #else
+		  ? (__xn == 0 ? __fp_zero : __fp_subnormal)
+  #endif
+  #if __FINITE_MATH_ONLY__
+		  : __fp_normal;
+  #else
+		  : (__xn < __infn ? __fp_normal
+				   : (__xn == __infn ? __fp_infinite : __fp_nan));
+  #endif
+
+      if constexpr (sizeof(_I) == sizeof(int))
+	{
+	  using _FixedInt = __fixed_size_storage_t<int, _Np>;
+	  const auto __as_int = __vector_bitcast<int, _Np>(__tmp);
+	  if constexpr (_FixedInt::_S_tuple_size == 1)
+	    return {__as_int};
+	  else if constexpr (_FixedInt::_S_tuple_size == 2
+			     && is_same_v<
+			       typename _FixedInt::_SecondType::_FirstAbi,
+			       simd_abi::scalar>)
+	    return {__extract<0, 2>(__as_int), __as_int[_Np - 1]};
+	  else if constexpr (_FixedInt::_S_tuple_size == 2)
+	    return {__extract<0, 2>(__as_int),
+		    __auto_bitcast(__extract<1, 2>(__as_int))};
+	  else
+	    __assert_unreachable<_Tp>();
+	}
+      else if constexpr (_Np == 2 && sizeof(_I) == 8
+			 && __fixed_size_storage_t<int, _Np>::_S_tuple_size == 2)
+	{
+	  const auto __aslong = __vector_bitcast<_LLong>(__tmp);
+	  return {int(__aslong[0]), {int(__aslong[1])}};
+	}
+  #if _GLIBCXX_SIMD_X86INTRIN
+      else if constexpr (sizeof(_Tp) == 8 && sizeof(__tmp) == 32
+			 && __fixed_size_storage_t<int, _Np>::_S_tuple_size == 1)
+	return {_mm_packs_epi32(__to_intrin(__lo128(__tmp)),
+				__to_intrin(__hi128(__tmp)))};
+      else if constexpr (sizeof(_Tp) == 8 && sizeof(__tmp) == 64
+			 && __fixed_size_storage_t<int, _Np>::_S_tuple_size == 1)
+	return {_mm512_cvtepi64_epi32(__to_intrin(__tmp))};
+  #endif // _GLIBCXX_SIMD_X86INTRIN
+      else if constexpr (__fixed_size_storage_t<int, _Np>::_S_tuple_size == 1)
+	return {__call_with_subscripts<_Np>(__vector_bitcast<_LLong>(__tmp),
+					    [](auto... __l) {
+					      return __make_wrapper<int>(__l...);
+					    })};
+      else
+	__assert_unreachable<_Tp>();
+    }
+
+    // _S_increment & _S_decrement{{{2
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static void
+      _S_increment(_SimdWrapper<_Tp, _Np>& __x)
+      { __x = __x._M_data + 1; }
+
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static void
+      _S_decrement(_SimdWrapper<_Tp, _Np>& __x)
+      { __x = __x._M_data - 1; }
+
+    // smart_reference access {{{2
+    template <typename _Tp, size_t _Np, typename _Up>
+      _GLIBCXX_SIMD_INTRINSIC constexpr static void
+      _S_set(_SimdWrapper<_Tp, _Np>& __v, int __i, _Up&& __x) noexcept
+      { __v._M_set(__i, static_cast<_Up&&>(__x)); }
+
+    // _S_masked_assign{{{2
+    template <typename _Tp, typename _K, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static void
+      _S_masked_assign(_SimdWrapper<_K, _Np> __k, _SimdWrapper<_Tp, _Np>& __lhs,
+		       __type_identity_t<_SimdWrapper<_Tp, _Np>> __rhs)
+      {
+	if (__k._M_is_constprop_none_of())
+	  return;
+	else if (__k._M_is_constprop_all_of())
+	  __lhs = __rhs;
+	else
+	  __lhs = _CommonImpl::_S_blend(__k, __lhs, __rhs);
+      }
+
+    template <typename _Tp, typename _K, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static void
+      _S_masked_assign(_SimdWrapper<_K, _Np> __k, _SimdWrapper<_Tp, _Np>& __lhs,
+		       __type_identity_t<_Tp> __rhs)
+      {
+	if (__k._M_is_constprop_none_of())
+	  return;
+	else if (__k._M_is_constprop_all_of())
+	  __lhs = __vector_broadcast<_Np>(__rhs);
+	else if (__builtin_constant_p(__rhs) && __rhs == 0)
+	  {
+	    if constexpr (!is_same_v<bool, _K>)
+	      // the __andnot optimization only makes sense if __k._M_data is a
+	      // vector register
+	      __lhs._M_data
+		= __andnot(__vector_bitcast<_Tp>(__k), __lhs._M_data);
+	    else
+	      // for AVX512/__mmask, a _mm512_maskz_mov is best
+	      __lhs
+		= _CommonImpl::_S_blend(__k, __lhs, _SimdWrapper<_Tp, _Np>());
+	  }
+	else
+	  __lhs = _CommonImpl::_S_blend(__k, __lhs,
+					_SimdWrapper<_Tp, _Np>(
+					  __vector_broadcast<_Np>(__rhs)));
+      }
+
+    // _S_masked_cassign {{{2
+    template <typename _Op, typename _Tp, typename _K, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static void
+      _S_masked_cassign(const _SimdWrapper<_K, _Np> __k,
+			_SimdWrapper<_Tp, _Np>& __lhs,
+			const __type_identity_t<_SimdWrapper<_Tp, _Np>> __rhs,
+			_Op __op)
+      {
+	if (__k._M_is_constprop_none_of())
+	  return;
+	else if (__k._M_is_constprop_all_of())
+	  __lhs = __op(_SuperImpl{}, __lhs, __rhs);
+	else
+	  __lhs = _CommonImpl::_S_blend(__k, __lhs,
+					__op(_SuperImpl{}, __lhs, __rhs));
+      }
+
+    template <typename _Op, typename _Tp, typename _K, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static void
+      _S_masked_cassign(const _SimdWrapper<_K, _Np> __k,
+			_SimdWrapper<_Tp, _Np>& __lhs,
+			const __type_identity_t<_Tp> __rhs, _Op __op)
+      { _S_masked_cassign(__k, __lhs, __vector_broadcast<_Np>(__rhs), __op); }
+
+    // _S_masked_unary {{{2
+    template <template <typename> class _Op, typename _Tp, typename _K,
+	      size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static _SimdWrapper<_Tp, _Np>
+      _S_masked_unary(const _SimdWrapper<_K, _Np> __k,
+		      const _SimdWrapper<_Tp, _Np> __v)
+      {
+	if (__k._M_is_constprop_none_of())
+	  return __v;
+	auto __vv = _M_make_simd(__v);
+	_Op<decltype(__vv)> __op;
+	if (__k._M_is_constprop_all_of())
+	  return __data(__op(__vv));
+	else
+	  return _CommonImpl::_S_blend(__k, __v, __data(__op(__vv)));
+      }
+
+    //}}}2
+  };
+
+// _MaskImplBuiltinMixin {{{1
+struct _MaskImplBuiltinMixin
+{
+  template <typename _Tp>
+    using _TypeTag = _Tp*;
+
+  // _S_to_maskvector {{{
+  template <typename _Up, size_t _ToN = 1>
+    _GLIBCXX_SIMD_INTRINSIC static constexpr _SimdWrapper<_Up, _ToN>
+    _S_to_maskvector(bool __x)
+    {
+      static_assert(is_same_v<_Up, __int_for_sizeof_t<_Up>>);
+      return __x ? __vector_type_t<_Up, _ToN>{~_Up()}
+		 : __vector_type_t<_Up, _ToN>{};
+    }
+
+  template <typename _Up, size_t _UpN = 0, size_t _Np, bool _Sanitized,
+	    size_t _ToN = _UpN == 0 ? _Np : _UpN>
+    _GLIBCXX_SIMD_INTRINSIC static constexpr _SimdWrapper<_Up, _ToN>
+    _S_to_maskvector(_BitMask<_Np, _Sanitized> __x)
+    {
+      static_assert(is_same_v<_Up, __int_for_sizeof_t<_Up>>);
+      return __generate_vector<__vector_type_t<_Up, _ToN>>([&](
+	auto __i) constexpr {
+	if constexpr (__i < _Np)
+	  return __x[__i] ? ~_Up() : _Up();
+	else
+	  return _Up();
+      });
+    }
+
+  template <typename _Up, size_t _UpN = 0, typename _Tp, size_t _Np,
+	    size_t _ToN = _UpN == 0 ? _Np : _UpN>
+    _GLIBCXX_SIMD_INTRINSIC static constexpr _SimdWrapper<_Up, _ToN>
+    _S_to_maskvector(_SimdWrapper<_Tp, _Np> __x)
+    {
+      static_assert(is_same_v<_Up, __int_for_sizeof_t<_Up>>);
+      using _TW = _SimdWrapper<_Tp, _Np>;
+      using _UW = _SimdWrapper<_Up, _ToN>;
+      if constexpr (sizeof(_Up) == sizeof(_Tp) && sizeof(_TW) == sizeof(_UW))
+	return __wrapper_bitcast<_Up, _ToN>(__x);
+      else if constexpr (is_same_v<_Tp, bool>) // bits -> vector
+	return _S_to_maskvector<_Up, _ToN>(_BitMask<_Np>(__x._M_data));
+      else
+	{ // vector -> vector
+	  /*
+	  [[maybe_unused]] const auto __y = __vector_bitcast<_Up>(__x._M_data);
+	  if constexpr (sizeof(_Tp) == 8 && sizeof(_Up) == 4 && sizeof(__y) ==
+	  16) return __vector_permute<1, 3, -1, -1>(__y); else if constexpr
+	  (sizeof(_Tp) == 4 && sizeof(_Up) == 2
+			     && sizeof(__y) == 16)
+	    return __vector_permute<1, 3, 5, 7, -1, -1, -1, -1>(__y);
+	  else if constexpr (sizeof(_Tp) == 8 && sizeof(_Up) == 2
+			     && sizeof(__y) == 16)
+	    return __vector_permute<3, 7, -1, -1, -1, -1, -1, -1>(__y);
+	  else if constexpr (sizeof(_Tp) == 2 && sizeof(_Up) == 1
+			     && sizeof(__y) == 16)
+	    return __vector_permute<1, 3, 5, 7, 9, 11, 13, 15, -1, -1, -1, -1,
+	  -1, -1, -1, -1>(__y); else if constexpr (sizeof(_Tp) == 4 &&
+	  sizeof(_Up) == 1
+			     && sizeof(__y) == 16)
+	    return __vector_permute<3, 7, 11, 15, -1, -1, -1, -1, -1, -1, -1,
+	  -1, -1, -1, -1, -1>(__y); else if constexpr (sizeof(_Tp) == 8 &&
+	  sizeof(_Up) == 1
+			     && sizeof(__y) == 16)
+	    return __vector_permute<7, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+	  -1, -1, -1, -1, -1>(__y); else
+	  */
+	  {
+	    return __generate_vector<__vector_type_t<_Up, _ToN>>([&](
+	      auto __i) constexpr {
+	      if constexpr (__i < _Np)
+		return _Up(__x[__i.value]);
+	      else
+		return _Up();
+	    });
+	  }
+	}
+    }
+
+  // }}}
+  // _S_to_bits {{{
+  template <typename _Tp, size_t _Np>
+    _GLIBCXX_SIMD_INTRINSIC static constexpr _SanitizedBitMask<_Np>
+    _S_to_bits(_SimdWrapper<_Tp, _Np> __x)
+    {
+      static_assert(!is_same_v<_Tp, bool>);
+      static_assert(_Np <= __CHAR_BIT__ * sizeof(_ULLong));
+      using _Up = make_unsigned_t<__int_for_sizeof_t<_Tp>>;
+      const auto __bools
+	= __vector_bitcast<_Up>(__x) >> (sizeof(_Up) * __CHAR_BIT__ - 1);
+      _ULLong __r = 0;
+      __execute_n_times<_Np>(
+	[&](auto __i) { __r |= _ULLong(__bools[__i.value]) << __i; });
+      return __r;
+    }
+
+  // }}}
+};
+
+// _MaskImplBuiltin {{{1
+template <typename _Abi>
+  struct _MaskImplBuiltin : _MaskImplBuiltinMixin
+  {
+    using _MaskImplBuiltinMixin::_S_to_bits;
+    using _MaskImplBuiltinMixin::_S_to_maskvector;
+
+    // member types {{{
+    template <typename _Tp>
+      using _SimdMember = typename _Abi::template __traits<_Tp>::_SimdMember;
+
+    template <typename _Tp>
+      using _MaskMember = typename _Abi::template _MaskMember<_Tp>;
+
+    using _SuperImpl = typename _Abi::_MaskImpl;
+    using _CommonImpl = typename _Abi::_CommonImpl;
+
+    template <typename _Tp>
+      static constexpr size_t _S_size = simd_size_v<_Tp, _Abi>;
+
+    // }}}
+    // _S_broadcast {{{
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _MaskMember<_Tp>
+      _S_broadcast(bool __x)
+      {
+	return __x ? _Abi::template _S_implicit_mask<_Tp>()
+		   : _MaskMember<_Tp>();
+      }
+
+    // }}}
+    // _S_load {{{
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _MaskMember<_Tp>
+      _S_load(const bool* __mem)
+      {
+	using _I = __int_for_sizeof_t<_Tp>;
+	if constexpr (sizeof(_Tp) == sizeof(bool))
+	  {
+	    const auto __bools
+	      = _CommonImpl::template _S_load<_I, _S_size<_Tp>>(__mem);
+	    // bool is {0, 1}, everything else is UB
+	    return __bools > 0;
+	  }
+	else
+	  return __generate_vector<_I, _S_size<_Tp>>([&](auto __i) constexpr {
+	    return __mem[__i] ? ~_I() : _I();
+	  });
+      }
+
+    // }}}
+    // _S_convert {{{
+    template <typename _Tp, size_t _Np, bool _Sanitized>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr auto
+      _S_convert(_BitMask<_Np, _Sanitized> __x)
+      {
+	if constexpr (__is_builtin_bitmask_abi<_Abi>())
+	  return _SimdWrapper<bool, simd_size_v<_Tp, _Abi>>(__x._M_to_bits());
+	else
+	  return _SuperImpl::template _S_to_maskvector<__int_for_sizeof_t<_Tp>,
+						       _S_size<_Tp>>(
+	    __x._M_sanitized());
+      }
+
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr auto
+      _S_convert(_SimdWrapper<bool, _Np> __x)
+      {
+	if constexpr (__is_builtin_bitmask_abi<_Abi>())
+	  return _SimdWrapper<bool, simd_size_v<_Tp, _Abi>>(__x._M_data);
+	else
+	  return _SuperImpl::template _S_to_maskvector<__int_for_sizeof_t<_Tp>,
+						       _S_size<_Tp>>(
+	    _BitMask<_Np>(__x._M_data)._M_sanitized());
+      }
+
+    template <typename _Tp, typename _Up, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr auto
+      _S_convert(_SimdWrapper<_Up, _Np> __x)
+      {
+	if constexpr (__is_builtin_bitmask_abi<_Abi>())
+	  return _SimdWrapper<bool, simd_size_v<_Tp, _Abi>>(
+	    _SuperImpl::_S_to_bits(__x));
+	else
+	  return _SuperImpl::template _S_to_maskvector<__int_for_sizeof_t<_Tp>,
+						       _S_size<_Tp>>(__x);
+      }
+
+    template <typename _Tp, typename _Up, typename _UAbi>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr auto
+      _S_convert(simd_mask<_Up, _UAbi> __x)
+      {
+	if constexpr (__is_builtin_bitmask_abi<_Abi>())
+	  {
+	    using _R = _SimdWrapper<bool, simd_size_v<_Tp, _Abi>>;
+	    if constexpr (__is_builtin_bitmask_abi<_UAbi>()) // bits -> bits
+	      return _R(__data(__x));
+	    else if constexpr (__is_scalar_abi<_UAbi>()) // bool -> bits
+	      return _R(__data(__x));
+	    else if constexpr (__is_fixed_size_abi_v<_UAbi>) // bitset -> bits
+	      return _R(__data(__x)._M_to_bits());
+	    else // vector -> bits
+	      return _R(_UAbi::_MaskImpl::_S_to_bits(__data(__x))._M_to_bits());
+	  }
+	else
+	  return _SuperImpl::template _S_to_maskvector<__int_for_sizeof_t<_Tp>,
+						       _S_size<_Tp>>(
+	    __data(__x));
+      }
+
+    // }}}
+    // _S_masked_load {{{2
+    template <typename _Tp, size_t _Np>
+      static inline _SimdWrapper<_Tp, _Np>
+      _S_masked_load(_SimdWrapper<_Tp, _Np> __merge,
+		     _SimdWrapper<_Tp, _Np> __mask, const bool* __mem) noexcept
+      {
+	// AVX(2) has 32/64 bit maskload, but nothing at 8 bit granularity
+	auto __tmp = __wrapper_bitcast<__int_for_sizeof_t<_Tp>>(__merge);
+	_BitOps::_S_bit_iteration(_SuperImpl::_S_to_bits(__mask),
+				  [&](auto __i) {
+				    __tmp._M_set(__i, -__mem[__i]);
+				  });
+	__merge = __wrapper_bitcast<_Tp>(__tmp);
+	return __merge;
+      }
+
+    // _S_store {{{2
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static void _S_store(_SimdWrapper<_Tp, _Np> __v,
+						   bool* __mem) noexcept
+      {
+	__execute_n_times<_Np>([&](auto __i) constexpr {
+	  __mem[__i] = __v[__i];
+	});
+      }
+
+    // _S_masked_store {{{2
+    template <typename _Tp, size_t _Np>
+      static inline void
+      _S_masked_store(const _SimdWrapper<_Tp, _Np> __v, bool* __mem,
+		      const _SimdWrapper<_Tp, _Np> __k) noexcept
+      {
+	_BitOps::_S_bit_iteration(
+	  _SuperImpl::_S_to_bits(__k), [&](auto __i) constexpr {
+	    __mem[__i] = __v[__i];
+	  });
+      }
+
+    // _S_from_bitmask{{{2
+    template <size_t _Np, typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static _MaskMember<_Tp>
+      _S_from_bitmask(_SanitizedBitMask<_Np> __bits, _TypeTag<_Tp>)
+      {
+	return _SuperImpl::template _S_to_maskvector<_Tp, _S_size<_Tp>>(__bits);
+      }
+
+    // logical and bitwise operators {{{2
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _SimdWrapper<_Tp, _Np>
+      _S_logical_and(const _SimdWrapper<_Tp, _Np>& __x,
+		     const _SimdWrapper<_Tp, _Np>& __y)
+      { return __and(__x._M_data, __y._M_data); }
+
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _SimdWrapper<_Tp, _Np>
+      _S_logical_or(const _SimdWrapper<_Tp, _Np>& __x,
+		    const _SimdWrapper<_Tp, _Np>& __y)
+      { return __or(__x._M_data, __y._M_data); }
+
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _SimdWrapper<_Tp, _Np>
+      _S_bit_not(const _SimdWrapper<_Tp, _Np>& __x)
+      {
+	if constexpr (_Abi::template _S_is_partial<_Tp>)
+	  return __andnot(__x, __wrapper_bitcast<_Tp>(
+				 _Abi::template _S_implicit_mask<_Tp>()));
+	else
+	  return __not(__x._M_data);
+      }
+
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _SimdWrapper<_Tp, _Np>
+      _S_bit_and(const _SimdWrapper<_Tp, _Np>& __x,
+		 const _SimdWrapper<_Tp, _Np>& __y)
+      { return __and(__x._M_data, __y._M_data); }
+
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _SimdWrapper<_Tp, _Np>
+      _S_bit_or(const _SimdWrapper<_Tp, _Np>& __x,
+		const _SimdWrapper<_Tp, _Np>& __y)
+      { return __or(__x._M_data, __y._M_data); }
+
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _SimdWrapper<_Tp, _Np>
+      _S_bit_xor(const _SimdWrapper<_Tp, _Np>& __x,
+		 const _SimdWrapper<_Tp, _Np>& __y)
+      { return __xor(__x._M_data, __y._M_data); }
+
+    // smart_reference access {{{2
+    template <typename _Tp, size_t _Np>
+      static constexpr void _S_set(_SimdWrapper<_Tp, _Np>& __k, int __i,
+				   bool __x) noexcept
+      {
+	if constexpr (is_same_v<_Tp, bool>)
+	  __k._M_set(__i, __x);
+	else
+	  {
+	    static_assert(is_same_v<_Tp, __int_for_sizeof_t<_Tp>>);
+	    if (__builtin_is_constant_evaluated())
+	      {
+		__k = __generate_from_n_evaluations<_Np,
+						    __vector_type_t<_Tp, _Np>>(
+		  [&](auto __j) {
+		    if (__i == __j)
+		      return _Tp(-__x);
+		    else
+		      return __k[+__j];
+		  });
+	      }
+	    else
+	      __k._M_data[__i] = -__x;
+	  }
+      }
+
+    // _S_masked_assign{{{2
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static void
+      _S_masked_assign(_SimdWrapper<_Tp, _Np> __k,
+		       _SimdWrapper<_Tp, _Np>& __lhs,
+		       __type_identity_t<_SimdWrapper<_Tp, _Np>> __rhs)
+      { __lhs = _CommonImpl::_S_blend(__k, __lhs, __rhs); }
+
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static void
+      _S_masked_assign(_SimdWrapper<_Tp, _Np> __k,
+		       _SimdWrapper<_Tp, _Np>& __lhs, bool __rhs)
+      {
+	if (__builtin_constant_p(__rhs))
+	  {
+	    if (__rhs == false)
+	      __lhs = __andnot(__k, __lhs);
+	    else
+	      __lhs = __or(__k, __lhs);
+	    return;
+	  }
+	__lhs = _CommonImpl::_S_blend(__k, __lhs,
+				      __data(simd_mask<_Tp, _Abi>(__rhs)));
+      }
+
+    //}}}2
+    // _S_all_of {{{
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static bool
+      _S_all_of(simd_mask<_Tp, _Abi> __k)
+      {
+	return __call_with_subscripts(
+	  __data(__k), make_index_sequence<_S_size<_Tp>>(),
+	  [](const auto... __ent) constexpr { return (... && !(__ent == 0)); });
+      }
+
+    // }}}
+    // _S_any_of {{{
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static bool
+      _S_any_of(simd_mask<_Tp, _Abi> __k)
+      {
+	return __call_with_subscripts(
+	  __data(__k), make_index_sequence<_S_size<_Tp>>(),
+	  [](const auto... __ent) constexpr { return (... || !(__ent == 0)); });
+      }
+
+    // }}}
+    // _S_none_of {{{
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static bool
+      _S_none_of(simd_mask<_Tp, _Abi> __k)
+      {
+	return __call_with_subscripts(
+	  __data(__k), make_index_sequence<_S_size<_Tp>>(),
+	  [](const auto... __ent) constexpr { return (... && (__ent == 0)); });
+      }
+
+    // }}}
+    // _S_some_of {{{
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static bool
+      _S_some_of(simd_mask<_Tp, _Abi> __k)
+      {
+	const int __n_true = _S_popcount(__k);
+	return __n_true > 0 && __n_true < int(_S_size<_Tp>);
+      }
+
+    // }}}
+    // _S_popcount {{{
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static int
+      _S_popcount(simd_mask<_Tp, _Abi> __k)
+      {
+	using _I = __int_for_sizeof_t<_Tp>;
+	if constexpr (is_default_constructible_v<simd<_I, _Abi>>)
+	  return -reduce(
+	    simd<_I, _Abi>(__private_init, __wrapper_bitcast<_I>(__data(__k))));
+	else
+	  return -reduce(__bit_cast<rebind_simd_t<_I, simd<_Tp, _Abi>>>(
+	    simd<_Tp, _Abi>(__private_init, __data(__k))));
+      }
+
+    // }}}
+    // _S_find_first_set {{{
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static int
+      _S_find_first_set(simd_mask<_Tp, _Abi> __k)
+      {
+	return std::__countr_zero(
+	  _SuperImpl::_S_to_bits(__data(__k))._M_to_bits());
+      }
+
+    // }}}
+    // _S_find_last_set {{{
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static int
+      _S_find_last_set(simd_mask<_Tp, _Abi> __k)
+      {
+	return std::__bit_width(
+	  _SuperImpl::_S_to_bits(__data(__k))._M_to_bits()) - 1;
+      }
+
+    // }}}
+  };
+
+//}}}1
+_GLIBCXX_SIMD_END_NAMESPACE
+#endif // __cplusplus >= 201703L
+#endif // _GLIBCXX_EXPERIMENTAL_SIMD_ABIS_H_
+
+// vim: foldmethod=marker foldmarker={{{,}}} sw=2 noet ts=8 sts=2 tw=80
diff --git a/libstdc++-v3/include/experimental/bits/simd_converter.h b/libstdc++-v3/include/experimental/bits/simd_converter.h
new file mode 100644
index 00000000000..dc4598743f9
--- /dev/null
+++ b/libstdc++-v3/include/experimental/bits/simd_converter.h
@@ -0,0 +1,354 @@
+// Generic simd conversions -*- C++ -*-
+
+// Copyright (C) 2020 Free Software Foundation, Inc.
+//
+// This file is part of the GNU ISO C++ Library.  This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3, or (at your option)
+// any later version.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+
+// Under Section 7 of GPL version 3, you are granted additional
+// permissions described in the GCC Runtime Library Exception, version
+// 3.1, as published by the Free Software Foundation.
+
+// You should have received a copy of the GNU General Public License and
+// a copy of the GCC Runtime Library Exception along with this program;
+// see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+// <http://www.gnu.org/licenses/>.
+
+#ifndef _GLIBCXX_EXPERIMENTAL_SIMD_CONVERTER_H_
+#define _GLIBCXX_EXPERIMENTAL_SIMD_CONVERTER_H_
+
+#if __cplusplus >= 201703L
+
+_GLIBCXX_SIMD_BEGIN_NAMESPACE
+// _SimdConverter scalar -> scalar {{{
+template <typename _From, typename _To>
+  struct _SimdConverter<_From, simd_abi::scalar, _To, simd_abi::scalar,
+			enable_if_t<!is_same_v<_From, _To>>>
+  {
+    _GLIBCXX_SIMD_INTRINSIC constexpr _To operator()(_From __a) const noexcept
+    { return static_cast<_To>(__a); }
+  };
+
+// }}}
+// _SimdConverter scalar -> "native" {{{
+template <typename _From, typename _To, typename _Abi>
+  struct _SimdConverter<_From, simd_abi::scalar, _To, _Abi,
+			enable_if_t<!is_same_v<_Abi, simd_abi::scalar>>>
+  {
+    using _Ret = typename _Abi::template __traits<_To>::_SimdMember;
+
+    template <typename... _More>
+      _GLIBCXX_SIMD_INTRINSIC constexpr _Ret
+      operator()(_From __a, _More... __more) const noexcept
+      {
+	static_assert(sizeof...(_More) + 1 == _Abi::template _S_size<_To>);
+	static_assert(conjunction_v<is_same<_From, _More>...>);
+	return __make_vector<_To>(__a, __more...);
+      }
+  };
+
+// }}}
+// _SimdConverter "native 1" -> "native 2" {{{
+template <typename _From, typename _To, typename _AFrom, typename _ATo>
+  struct _SimdConverter<
+    _From, _AFrom, _To, _ATo,
+    enable_if_t<!disjunction_v<
+      __is_fixed_size_abi<_AFrom>, __is_fixed_size_abi<_ATo>,
+      is_same<_AFrom, simd_abi::scalar>, is_same<_ATo, simd_abi::scalar>,
+      conjunction<is_same<_From, _To>, is_same<_AFrom, _ATo>>>>>
+  {
+    using _Arg = typename _AFrom::template __traits<_From>::_SimdMember;
+    using _Ret = typename _ATo::template __traits<_To>::_SimdMember;
+    using _V = __vector_type_t<_To, simd_size_v<_To, _ATo>>;
+
+    template <typename... _More>
+      _GLIBCXX_SIMD_INTRINSIC constexpr _Ret
+      operator()(_Arg __a, _More... __more) const noexcept
+      { return __vector_convert<_V>(__a, __more...); }
+  };
+
+// }}}
+// _SimdConverter scalar -> fixed_size<1> {{{1
+template <typename _From, typename _To>
+  struct _SimdConverter<_From, simd_abi::scalar, _To, simd_abi::fixed_size<1>,
+			void>
+  {
+    _GLIBCXX_SIMD_INTRINSIC constexpr _SimdTuple<_To, simd_abi::scalar>
+    operator()(_From __x) const noexcept
+    { return {static_cast<_To>(__x)}; }
+  };
+
+// _SimdConverter fixed_size<1> -> scalar {{{1
+template <typename _From, typename _To>
+  struct _SimdConverter<_From, simd_abi::fixed_size<1>, _To, simd_abi::scalar,
+			void>
+  {
+    _GLIBCXX_SIMD_INTRINSIC constexpr _To
+    operator()(_SimdTuple<_From, simd_abi::scalar> __x) const noexcept
+    { return {static_cast<_To>(__x.first)}; }
+  };
+
+// _SimdConverter fixed_size<_Np> -> fixed_size<_Np> {{{1
+template <typename _From, typename _To, int _Np>
+  struct _SimdConverter<_From, simd_abi::fixed_size<_Np>, _To,
+			simd_abi::fixed_size<_Np>,
+			enable_if_t<!is_same_v<_From, _To>>>
+  {
+    using _Ret = __fixed_size_storage_t<_To, _Np>;
+    using _Arg = __fixed_size_storage_t<_From, _Np>;
+
+    _GLIBCXX_SIMD_INTRINSIC constexpr _Ret
+    operator()(const _Arg& __x) const noexcept
+    {
+      if constexpr (is_same_v<_From, _To>)
+	return __x;
+
+      // special case (optimize) int signedness casts
+      else if constexpr (sizeof(_From) == sizeof(_To)
+			 && is_integral_v<_From> && is_integral_v<_To>)
+	return __bit_cast<_Ret>(__x);
+
+      // special case if all ABI tags in _Ret are scalar
+      else if constexpr (__is_scalar_abi<typename _Ret::_FirstAbi>())
+	{
+	  return __call_with_subscripts(
+	    __x, make_index_sequence<_Np>(),
+	    [](auto... __values) constexpr->_Ret {
+	      return __make_simd_tuple<_To, decltype((void) __values,
+						     simd_abi::scalar())...>(
+		static_cast<_To>(__values)...);
+	    });
+	}
+
+      // from one vector to one vector
+      else if constexpr (_Arg::_S_first_size == _Ret::_S_first_size)
+	{
+	  _SimdConverter<_From, typename _Arg::_FirstAbi, _To,
+			 typename _Ret::_FirstAbi>
+	    __native_cvt;
+	  if constexpr (_Arg::_S_tuple_size == 1)
+	    return {__native_cvt(__x.first)};
+	  else
+	    {
+	      constexpr size_t _NRemain = _Np - _Arg::_S_first_size;
+	      _SimdConverter<_From, simd_abi::fixed_size<_NRemain>, _To,
+			     simd_abi::fixed_size<_NRemain>>
+		__remainder_cvt;
+	      return {__native_cvt(__x.first), __remainder_cvt(__x.second)};
+	    }
+	}
+
+      // from one vector to multiple vectors
+      else if constexpr (_Arg::_S_first_size > _Ret::_S_first_size)
+	{
+	  const auto __multiple_return_chunks
+	    = __convert_all<__vector_type_t<_To, _Ret::_S_first_size>>(
+	      __x.first);
+	  constexpr auto __converted = __multiple_return_chunks.size()
+				       * _Ret::_FirstAbi::template _S_size<_To>;
+	  constexpr auto __remaining = _Np - __converted;
+	  if constexpr (_Arg::_S_tuple_size == 1 && __remaining == 0)
+	    return __to_simd_tuple<_To, _Np>(__multiple_return_chunks);
+	  else if constexpr (_Arg::_S_tuple_size == 1)
+	    { // e.g. <int, 3> -> <double, 2, 1> or <short, 7> -> <double, 4, 2,
+	      // 1>
+	      using _RetRem
+		= __remove_cvref_t<decltype(__simd_tuple_pop_front<__converted>(
+		  _Ret()))>;
+	      const auto __return_chunks2
+		= __convert_all<__vector_type_t<_To, _RetRem::_S_first_size>, 0,
+				__converted>(__x.first);
+	      constexpr auto __converted2
+		= __converted
+		  + __return_chunks2.size() * _RetRem::_S_first_size;
+	      if constexpr (__converted2 == _Np)
+		return __to_simd_tuple<_To, _Np>(__multiple_return_chunks,
+						 __return_chunks2);
+	      else
+		{
+		  using _RetRem2 = __remove_cvref_t<
+		    decltype(__simd_tuple_pop_front<__return_chunks2.size()
+						    * _RetRem::_S_first_size>(
+		      _RetRem()))>;
+		  const auto __return_chunks3 = __convert_all<
+		    __vector_type_t<_To, _RetRem2::_S_first_size>, 0,
+		    __converted2>(__x.first);
+		  constexpr auto __converted3
+		    = __converted2
+		      + __return_chunks3.size() * _RetRem2::_S_first_size;
+		  if constexpr (__converted3 == _Np)
+		    return __to_simd_tuple<_To, _Np>(__multiple_return_chunks,
+						     __return_chunks2,
+						     __return_chunks3);
+		  else
+		    {
+		      using _RetRem3
+			= __remove_cvref_t<decltype(__simd_tuple_pop_front<
+						    __return_chunks3.size()
+						    * _RetRem2::_S_first_size>(
+			  _RetRem2()))>;
+		      const auto __return_chunks4 = __convert_all<
+			__vector_type_t<_To, _RetRem3::_S_first_size>, 0,
+			__converted3>(__x.first);
+		      constexpr auto __converted4
+			= __converted3
+			  + __return_chunks4.size() * _RetRem3::_S_first_size;
+		      if constexpr (__converted4 == _Np)
+			return __to_simd_tuple<_To, _Np>(
+			  __multiple_return_chunks, __return_chunks2,
+			  __return_chunks3, __return_chunks4);
+		      else
+			__assert_unreachable<_To>();
+		    }
+		}
+	    }
+	  else
+	    {
+	      constexpr size_t _NRemain = _Np - _Arg::_S_first_size;
+	      _SimdConverter<_From, simd_abi::fixed_size<_NRemain>, _To,
+			     simd_abi::fixed_size<_NRemain>>
+		__remainder_cvt;
+	      return __simd_tuple_concat(
+		__to_simd_tuple<_To, _Arg::_S_first_size>(
+		  __multiple_return_chunks),
+		__remainder_cvt(__x.second));
+	    }
+	}
+
+      // from multiple vectors to one vector
+      // _Arg::_S_first_size < _Ret::_S_first_size
+      // a) heterogeneous input at the end of the tuple (possible with partial
+      //    native registers in _Ret)
+      else if constexpr (_Ret::_S_tuple_size == 1
+			 && _Np % _Arg::_S_first_size != 0)
+	{
+	  static_assert(_Ret::_FirstAbi::template _S_is_partial<_To>);
+	  return _Ret{__generate_from_n_evaluations<
+	    _Np, typename _VectorTraits<typename _Ret::_FirstType>::type>(
+	    [&](auto __i) { return static_cast<_To>(__x[__i]); })};
+	}
+      else
+	{
+	  static_assert(_Arg::_S_tuple_size > 1);
+	  constexpr auto __n
+	    = __div_roundup(_Ret::_S_first_size, _Arg::_S_first_size);
+	  return __call_with_n_evaluations<__n>(
+	    [&__x](auto... __uncvted) {
+	      // assuming _Arg Abi tags for all __i are _Arg::_FirstAbi
+	      _SimdConverter<_From, typename _Arg::_FirstAbi, _To,
+			     typename _Ret::_FirstAbi>
+		__native_cvt;
+	      if constexpr (_Ret::_S_tuple_size == 1)
+		return _Ret{__native_cvt(__uncvted...)};
+	      else
+		return _Ret{
+		  __native_cvt(__uncvted...),
+		  _SimdConverter<
+		    _From, simd_abi::fixed_size<_Np - _Ret::_S_first_size>, _To,
+		    simd_abi::fixed_size<_Np - _Ret::_S_first_size>>()(
+		    __simd_tuple_pop_front<_Ret::_S_first_size>(__x))};
+	    },
+	    [&__x](auto __i) { return __get_tuple_at<__i>(__x); });
+	}
+    }
+  };
+
+// _SimdConverter "native" -> fixed_size<_Np> {{{1
+// i.e. 1 register to ? registers
+template <typename _From, typename _Ap, typename _To, int _Np>
+  struct _SimdConverter<_From, _Ap, _To, simd_abi::fixed_size<_Np>,
+			enable_if_t<!__is_fixed_size_abi_v<_Ap>>>
+  {
+    static_assert(
+      _Np == simd_size_v<_From, _Ap>,
+      "_SimdConverter to fixed_size only works for equal element counts");
+
+    using _Ret = __fixed_size_storage_t<_To, _Np>;
+
+    _GLIBCXX_SIMD_INTRINSIC constexpr _Ret
+    operator()(typename _SimdTraits<_From, _Ap>::_SimdMember __x) const noexcept
+    {
+      if constexpr (_Ret::_S_tuple_size == 1)
+	return {__vector_convert<typename _Ret::_FirstType::_BuiltinType>(__x)};
+      else
+	{
+	  using _FixedNp = simd_abi::fixed_size<_Np>;
+	  _SimdConverter<_From, _FixedNp, _To, _FixedNp> __fixed_cvt;
+	  using _FromFixedStorage = __fixed_size_storage_t<_From, _Np>;
+	  if constexpr (_FromFixedStorage::_S_tuple_size == 1)
+	    return __fixed_cvt(_FromFixedStorage{__x});
+	  else if constexpr (_FromFixedStorage::_S_tuple_size == 2)
+	    {
+	      _FromFixedStorage __tmp;
+	      static_assert(sizeof(__tmp) <= sizeof(__x));
+	      __builtin_memcpy(&__tmp.first, &__x, sizeof(__tmp.first));
+	      __builtin_memcpy(&__tmp.second.first,
+			       reinterpret_cast<const char*>(&__x)
+				 + sizeof(__tmp.first),
+			       sizeof(__tmp.second.first));
+	      return __fixed_cvt(__tmp);
+	    }
+	  else
+	    __assert_unreachable<_From>();
+	}
+    }
+  };
+
+// _SimdConverter fixed_size<_Np> -> "native" {{{1
+// i.e. ? register to 1 registers
+template <typename _From, int _Np, typename _To, typename _Ap>
+  struct _SimdConverter<_From, simd_abi::fixed_size<_Np>, _To, _Ap,
+			enable_if_t<!__is_fixed_size_abi_v<_Ap>>>
+  {
+    static_assert(
+      _Np == simd_size_v<_To, _Ap>,
+      "_SimdConverter to fixed_size only works for equal element counts");
+
+    using _Arg = __fixed_size_storage_t<_From, _Np>;
+
+    _GLIBCXX_SIMD_INTRINSIC constexpr
+      typename _SimdTraits<_To, _Ap>::_SimdMember
+      operator()(_Arg __x) const noexcept
+    {
+      if constexpr (_Arg::_S_tuple_size == 1)
+	return __vector_convert<__vector_type_t<_To, _Np>>(__x.first);
+      else if constexpr (_Arg::_S_is_homogeneous)
+	return __call_with_n_evaluations<_Arg::_S_tuple_size>(
+	  [](auto... __members) {
+	    if constexpr ((is_convertible_v<decltype(__members), _To> && ...))
+	      return __vector_type_t<_To, _Np>{static_cast<_To>(__members)...};
+	    else
+	      return __vector_convert<__vector_type_t<_To, _Np>>(__members...);
+	  },
+	  [&](auto __i) { return __get_tuple_at<__i>(__x); });
+      else if constexpr (__fixed_size_storage_t<_To, _Np>::_S_tuple_size == 1)
+	{
+	  _SimdConverter<_From, simd_abi::fixed_size<_Np>, _To,
+			 simd_abi::fixed_size<_Np>>
+	    __fixed_cvt;
+	  return __fixed_cvt(__x).first;
+	}
+      else
+	{
+	  const _SimdWrapper<_From, _Np> __xv
+	    = __generate_from_n_evaluations<_Np, __vector_type_t<_From, _Np>>(
+	      [&](auto __i) { return __x[__i]; });
+	  return __vector_convert<__vector_type_t<_To, _Np>>(__xv);
+	}
+    }
+  };
+
+// }}}1
+_GLIBCXX_SIMD_END_NAMESPACE
+#endif // __cplusplus >= 201703L
+#endif // _GLIBCXX_EXPERIMENTAL_SIMD_CONVERTER_H_
+
+// vim: foldmethod=marker sw=2 noet ts=8 sts=2 tw=80
diff --git a/libstdc++-v3/include/experimental/bits/simd_detail.h b/libstdc++-v3/include/experimental/bits/simd_detail.h
new file mode 100644
index 00000000000..a49a9d88b7f
--- /dev/null
+++ b/libstdc++-v3/include/experimental/bits/simd_detail.h
@@ -0,0 +1,306 @@
+// Internal macros for the simd implementation -*- C++ -*-
+
+// Copyright (C) 2020 Free Software Foundation, Inc.
+//
+// This file is part of the GNU ISO C++ Library.  This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3, or (at your option)
+// any later version.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+
+// Under Section 7 of GPL version 3, you are granted additional
+// permissions described in the GCC Runtime Library Exception, version
+// 3.1, as published by the Free Software Foundation.
+
+// You should have received a copy of the GNU General Public License and
+// a copy of the GCC Runtime Library Exception along with this program;
+// see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+// <http://www.gnu.org/licenses/>.
+
+#ifndef _GLIBCXX_EXPERIMENTAL_SIMD_DETAIL_H_
+#define _GLIBCXX_EXPERIMENTAL_SIMD_DETAIL_H_
+
+#if __cplusplus >= 201703L
+
+#include <cstddef>
+#include <cstdint>
+
+
+#define _GLIBCXX_SIMD_BEGIN_NAMESPACE                                          \
+  namespace std _GLIBCXX_VISIBILITY(default)                                   \
+  {                                                                            \
+    _GLIBCXX_BEGIN_NAMESPACE_VERSION                                           \
+      namespace experimental {                                                 \
+      inline namespace parallelism_v2 {
+#define _GLIBCXX_SIMD_END_NAMESPACE                                            \
+  }                                                                            \
+  }                                                                            \
+  _GLIBCXX_END_NAMESPACE_VERSION                                               \
+  }
+
+// ISA extension detection. The following defines all the _GLIBCXX_SIMD_HAVE_XXX
+// macros ARM{{{
+#if defined __ARM_NEON
+#define _GLIBCXX_SIMD_HAVE_NEON 1
+#else
+#define _GLIBCXX_SIMD_HAVE_NEON 0
+#endif
+#if defined __ARM_NEON && (__ARM_ARCH >= 8 || defined __aarch64__)
+#define _GLIBCXX_SIMD_HAVE_NEON_A32 1
+#else
+#define _GLIBCXX_SIMD_HAVE_NEON_A32 0
+#endif
+#if defined __ARM_NEON && defined __aarch64__
+#define _GLIBCXX_SIMD_HAVE_NEON_A64 1
+#else
+#define _GLIBCXX_SIMD_HAVE_NEON_A64 0
+#endif
+//}}}
+// x86{{{
+#ifdef __MMX__
+#define _GLIBCXX_SIMD_HAVE_MMX 1
+#else
+#define _GLIBCXX_SIMD_HAVE_MMX 0
+#endif
+#if defined __SSE__ || defined __x86_64__
+#define _GLIBCXX_SIMD_HAVE_SSE 1
+#else
+#define _GLIBCXX_SIMD_HAVE_SSE 0
+#endif
+#if defined __SSE2__ || defined __x86_64__
+#define _GLIBCXX_SIMD_HAVE_SSE2 1
+#else
+#define _GLIBCXX_SIMD_HAVE_SSE2 0
+#endif
+#ifdef __SSE3__
+#define _GLIBCXX_SIMD_HAVE_SSE3 1
+#else
+#define _GLIBCXX_SIMD_HAVE_SSE3 0
+#endif
+#ifdef __SSSE3__
+#define _GLIBCXX_SIMD_HAVE_SSSE3 1
+#else
+#define _GLIBCXX_SIMD_HAVE_SSSE3 0
+#endif
+#ifdef __SSE4_1__
+#define _GLIBCXX_SIMD_HAVE_SSE4_1 1
+#else
+#define _GLIBCXX_SIMD_HAVE_SSE4_1 0
+#endif
+#ifdef __SSE4_2__
+#define _GLIBCXX_SIMD_HAVE_SSE4_2 1
+#else
+#define _GLIBCXX_SIMD_HAVE_SSE4_2 0
+#endif
+#ifdef __XOP__
+#define _GLIBCXX_SIMD_HAVE_XOP 1
+#else
+#define _GLIBCXX_SIMD_HAVE_XOP 0
+#endif
+#ifdef __AVX__
+#define _GLIBCXX_SIMD_HAVE_AVX 1
+#else
+#define _GLIBCXX_SIMD_HAVE_AVX 0
+#endif
+#ifdef __AVX2__
+#define _GLIBCXX_SIMD_HAVE_AVX2 1
+#else
+#define _GLIBCXX_SIMD_HAVE_AVX2 0
+#endif
+#ifdef __BMI__
+#define _GLIBCXX_SIMD_HAVE_BMI1 1
+#else
+#define _GLIBCXX_SIMD_HAVE_BMI1 0
+#endif
+#ifdef __BMI2__
+#define _GLIBCXX_SIMD_HAVE_BMI2 1
+#else
+#define _GLIBCXX_SIMD_HAVE_BMI2 0
+#endif
+#ifdef __LZCNT__
+#define _GLIBCXX_SIMD_HAVE_LZCNT 1
+#else
+#define _GLIBCXX_SIMD_HAVE_LZCNT 0
+#endif
+#ifdef __SSE4A__
+#define _GLIBCXX_SIMD_HAVE_SSE4A 1
+#else
+#define _GLIBCXX_SIMD_HAVE_SSE4A 0
+#endif
+#ifdef __FMA__
+#define _GLIBCXX_SIMD_HAVE_FMA 1
+#else
+#define _GLIBCXX_SIMD_HAVE_FMA 0
+#endif
+#ifdef __FMA4__
+#define _GLIBCXX_SIMD_HAVE_FMA4 1
+#else
+#define _GLIBCXX_SIMD_HAVE_FMA4 0
+#endif
+#ifdef __F16C__
+#define _GLIBCXX_SIMD_HAVE_F16C 1
+#else
+#define _GLIBCXX_SIMD_HAVE_F16C 0
+#endif
+#ifdef __POPCNT__
+#define _GLIBCXX_SIMD_HAVE_POPCNT 1
+#else
+#define _GLIBCXX_SIMD_HAVE_POPCNT 0
+#endif
+#ifdef __AVX512F__
+#define _GLIBCXX_SIMD_HAVE_AVX512F 1
+#else
+#define _GLIBCXX_SIMD_HAVE_AVX512F 0
+#endif
+#ifdef __AVX512DQ__
+#define _GLIBCXX_SIMD_HAVE_AVX512DQ 1
+#else
+#define _GLIBCXX_SIMD_HAVE_AVX512DQ 0
+#endif
+#ifdef __AVX512VL__
+#define _GLIBCXX_SIMD_HAVE_AVX512VL 1
+#else
+#define _GLIBCXX_SIMD_HAVE_AVX512VL 0
+#endif
+#ifdef __AVX512BW__
+#define _GLIBCXX_SIMD_HAVE_AVX512BW 1
+#else
+#define _GLIBCXX_SIMD_HAVE_AVX512BW 0
+#endif
+
+#if _GLIBCXX_SIMD_HAVE_SSE
+#define _GLIBCXX_SIMD_HAVE_SSE_ABI 1
+#else
+#define _GLIBCXX_SIMD_HAVE_SSE_ABI 0
+#endif
+#if _GLIBCXX_SIMD_HAVE_SSE2
+#define _GLIBCXX_SIMD_HAVE_FULL_SSE_ABI 1
+#else
+#define _GLIBCXX_SIMD_HAVE_FULL_SSE_ABI 0
+#endif
+
+#if _GLIBCXX_SIMD_HAVE_AVX
+#define _GLIBCXX_SIMD_HAVE_AVX_ABI 1
+#else
+#define _GLIBCXX_SIMD_HAVE_AVX_ABI 0
+#endif
+#if _GLIBCXX_SIMD_HAVE_AVX2
+#define _GLIBCXX_SIMD_HAVE_FULL_AVX_ABI 1
+#else
+#define _GLIBCXX_SIMD_HAVE_FULL_AVX_ABI 0
+#endif
+
+#if _GLIBCXX_SIMD_HAVE_AVX512F
+#define _GLIBCXX_SIMD_HAVE_AVX512_ABI 1
+#else
+#define _GLIBCXX_SIMD_HAVE_AVX512_ABI 0
+#endif
+#if _GLIBCXX_SIMD_HAVE_AVX512BW
+#define _GLIBCXX_SIMD_HAVE_FULL_AVX512_ABI 1
+#else
+#define _GLIBCXX_SIMD_HAVE_FULL_AVX512_ABI 0
+#endif
+
+#if defined __x86_64__ && !_GLIBCXX_SIMD_HAVE_SSE2
+#error "Use of SSE2 is required on AMD64"
+#endif
+//}}}
+
+#ifdef __clang__
+#define _GLIBCXX_SIMD_NORMAL_MATH
+#else
+#define _GLIBCXX_SIMD_NORMAL_MATH                                              \
+  [[__gnu__::__optimize__("finite-math-only,no-signed-zeros")]]
+#endif
+#define _GLIBCXX_SIMD_NEVER_INLINE [[__gnu__::__noinline__]]
+#define _GLIBCXX_SIMD_INTRINSIC                                                \
+  [[__gnu__::__always_inline__, __gnu__::__artificial__]] inline
+#define _GLIBCXX_SIMD_ALWAYS_INLINE [[__gnu__::__always_inline__]] inline
+#define _GLIBCXX_SIMD_IS_UNLIKELY(__x) __builtin_expect(__x, 0)
+#define _GLIBCXX_SIMD_IS_LIKELY(__x) __builtin_expect(__x, 1)
+
+#if defined __STRICT_ANSI__ && __STRICT_ANSI__
+#define _GLIBCXX_SIMD_CONSTEXPR
+#define _GLIBCXX_SIMD_USE_CONSTEXPR_API const
+#else
+#define _GLIBCXX_SIMD_CONSTEXPR constexpr
+#define _GLIBCXX_SIMD_USE_CONSTEXPR_API constexpr
+#endif
+
+#if defined __clang__
+#define _GLIBCXX_SIMD_USE_CONSTEXPR const
+#else
+#define _GLIBCXX_SIMD_USE_CONSTEXPR constexpr
+#endif
+
+#define _GLIBCXX_SIMD_LIST_BINARY(__macro) __macro(|) __macro(&) __macro(^)
+#define _GLIBCXX_SIMD_LIST_SHIFTS(__macro) __macro(<<) __macro(>>)
+#define _GLIBCXX_SIMD_LIST_ARITHMETICS(__macro)                                \
+  __macro(+) __macro(-) __macro(*) __macro(/) __macro(%)
+
+#define _GLIBCXX_SIMD_ALL_BINARY(__macro)                                      \
+  _GLIBCXX_SIMD_LIST_BINARY(__macro) static_assert(true)
+#define _GLIBCXX_SIMD_ALL_SHIFTS(__macro)                                      \
+  _GLIBCXX_SIMD_LIST_SHIFTS(__macro) static_assert(true)
+#define _GLIBCXX_SIMD_ALL_ARITHMETICS(__macro)                                 \
+  _GLIBCXX_SIMD_LIST_ARITHMETICS(__macro) static_assert(true)
+
+#ifdef _GLIBCXX_SIMD_NO_ALWAYS_INLINE
+#undef _GLIBCXX_SIMD_ALWAYS_INLINE
+#define _GLIBCXX_SIMD_ALWAYS_INLINE inline
+#undef _GLIBCXX_SIMD_INTRINSIC
+#define _GLIBCXX_SIMD_INTRINSIC inline
+#endif
+
+#if _GLIBCXX_SIMD_HAVE_SSE || _GLIBCXX_SIMD_HAVE_MMX
+#define _GLIBCXX_SIMD_X86INTRIN 1
+#else
+#define _GLIBCXX_SIMD_X86INTRIN 0
+#endif
+
+// workaround macros {{{
+// use aliasing loads to help GCC understand the data accesses better
+// This also seems to hide a miscompilation on swap(x[i], x[i + 1]) with
+// fixed_size_simd<float, 16> x.
+#define _GLIBCXX_SIMD_USE_ALIASING_LOADS 1
+
+// vector conversions on x86 not optimized:
+#if _GLIBCXX_SIMD_X86INTRIN
+#define _GLIBCXX_SIMD_WORKAROUND_PR85048 1
+#endif
+
+// integer division not optimized
+#define _GLIBCXX_SIMD_WORKAROUND_PR90993 1
+
+// very bad codegen for extraction and concatenation of 128/256 "subregisters"
+// with sizeof(element type) < 8: https://godbolt.org/g/mqUsgM
+#if _GLIBCXX_SIMD_X86INTRIN
+#define _GLIBCXX_SIMD_WORKAROUND_XXX_1 1
+#endif
+
+// bad codegen for 8 Byte memcpy to __vector_type_t<char, 16>
+#define _GLIBCXX_SIMD_WORKAROUND_PR90424 1
+
+// bad codegen for zero-extend using simple concat(__x, 0)
+#if _GLIBCXX_SIMD_X86INTRIN
+#define _GLIBCXX_SIMD_WORKAROUND_XXX_3 1
+#endif
+
+// https://github.com/cplusplus/parallelism-ts/issues/65 (incorrect return type
+// of static_simd_cast)
+#define _GLIBCXX_SIMD_FIX_P2TS_ISSUE65 1
+
+// https://github.com/cplusplus/parallelism-ts/issues/66 (incorrect SFINAE
+// constraint on (static)_simd_cast)
+#define _GLIBCXX_SIMD_FIX_P2TS_ISSUE66 1
+// }}}
+
+#endif // __cplusplus >= 201703L
+#endif // _GLIBCXX_EXPERIMENTAL_SIMD_DETAIL_H_
+
+// vim: foldmethod=marker
diff --git a/libstdc++-v3/include/experimental/bits/simd_fixed_size.h b/libstdc++-v3/include/experimental/bits/simd_fixed_size.h
new file mode 100644
index 00000000000..fba8c7e466e
--- /dev/null
+++ b/libstdc++-v3/include/experimental/bits/simd_fixed_size.h
@@ -0,0 +1,2066 @@
+// Simd fixed_size ABI specific implementations -*- C++ -*-
+
+// Copyright (C) 2020 Free Software Foundation, Inc.
+//
+// This file is part of the GNU ISO C++ Library.  This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3, or (at your option)
+// any later version.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+
+// Under Section 7 of GPL version 3, you are granted additional
+// permissions described in the GCC Runtime Library Exception, version
+// 3.1, as published by the Free Software Foundation.
+
+// You should have received a copy of the GNU General Public License and
+// a copy of the GCC Runtime Library Exception along with this program;
+// see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+// <http://www.gnu.org/licenses/>.
+
+/*
+ * The fixed_size ABI gives the following guarantees:
+ *  - simd objects are passed via the stack
+ *  - memory layout of `simd<_Tp, _Np>` is equivalent to `array<_Tp, _Np>`
+ *  - alignment of `simd<_Tp, _Np>` is `_Np * sizeof(_Tp)` if _Np is __a
+ *    power-of-2 value, otherwise `std::__bit_ceil(_Np * sizeof(_Tp))` (Note:
+ *    if the alignment were to exceed the system/compiler maximum, it is bounded
+ *    to that maximum)
+ *  - simd_mask objects are passed like bitset<_Np>
+ *  - memory layout of `simd_mask<_Tp, _Np>` is equivalent to `bitset<_Np>`
+ *  - alignment of `simd_mask<_Tp, _Np>` is equal to the alignment of
+ *    `bitset<_Np>`
+ */
+
+#ifndef _GLIBCXX_EXPERIMENTAL_SIMD_FIXED_SIZE_H_
+#define _GLIBCXX_EXPERIMENTAL_SIMD_FIXED_SIZE_H_
+
+#if __cplusplus >= 201703L
+
+#include <array>
+
+_GLIBCXX_SIMD_BEGIN_NAMESPACE
+
+// __simd_tuple_element {{{
+template <size_t _I, typename _Tp>
+  struct __simd_tuple_element;
+
+template <typename _Tp, typename _A0, typename... _As>
+  struct __simd_tuple_element<0, _SimdTuple<_Tp, _A0, _As...>>
+  { using type = simd<_Tp, _A0>; };
+
+template <size_t _I, typename _Tp, typename _A0, typename... _As>
+  struct __simd_tuple_element<_I, _SimdTuple<_Tp, _A0, _As...>>
+  {
+    using type =
+      typename __simd_tuple_element<_I - 1, _SimdTuple<_Tp, _As...>>::type;
+  };
+
+template <size_t _I, typename _Tp>
+  using __simd_tuple_element_t = typename __simd_tuple_element<_I, _Tp>::type;
+
+// }}}
+// __simd_tuple_concat {{{
+
+template <typename _Tp, typename... _A0s, typename... _A1s>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _SimdTuple<_Tp, _A0s..., _A1s...>
+  __simd_tuple_concat(const _SimdTuple<_Tp, _A0s...>& __left,
+		      const _SimdTuple<_Tp, _A1s...>& __right)
+  {
+    if constexpr (sizeof...(_A0s) == 0)
+      return __right;
+    else if constexpr (sizeof...(_A1s) == 0)
+      return __left;
+    else
+      return {__left.first, __simd_tuple_concat(__left.second, __right)};
+  }
+
+template <typename _Tp, typename _A10, typename... _A1s>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _SimdTuple<_Tp, simd_abi::scalar, _A10,
+					       _A1s...>
+  __simd_tuple_concat(const _Tp& __left,
+		      const _SimdTuple<_Tp, _A10, _A1s...>& __right)
+  { return {__left, __right}; }
+
+// }}}
+// __simd_tuple_pop_front {{{
+// Returns the next _SimdTuple in __x that has _Np elements less.
+// Precondition: _Np must match the number of elements in __first (recursively)
+template <size_t _Np, typename _Tp>
+  _GLIBCXX_SIMD_INTRINSIC constexpr decltype(auto)
+  __simd_tuple_pop_front(_Tp&& __x)
+  {
+    if constexpr (_Np == 0)
+      return static_cast<_Tp&&>(__x);
+    else
+      {
+	using _Up = __remove_cvref_t<_Tp>;
+	static_assert(_Np >= _Up::_S_first_size);
+	return __simd_tuple_pop_front<_Np - _Up::_S_first_size>(__x.second);
+      }
+  }
+
+// }}}
+// __get_simd_at<_Np> {{{1
+struct __as_simd {};
+
+struct __as_simd_tuple {};
+
+template <typename _Tp, typename _A0, typename... _Abis>
+  _GLIBCXX_SIMD_INTRINSIC constexpr simd<_Tp, _A0>
+  __simd_tuple_get_impl(__as_simd, const _SimdTuple<_Tp, _A0, _Abis...>& __t,
+			_SizeConstant<0>)
+  { return {__private_init, __t.first}; }
+
+template <typename _Tp, typename _A0, typename... _Abis>
+  _GLIBCXX_SIMD_INTRINSIC constexpr const auto&
+  __simd_tuple_get_impl(__as_simd_tuple,
+			const _SimdTuple<_Tp, _A0, _Abis...>& __t,
+			_SizeConstant<0>)
+  { return __t.first; }
+
+template <typename _Tp, typename _A0, typename... _Abis>
+  _GLIBCXX_SIMD_INTRINSIC constexpr auto&
+  __simd_tuple_get_impl(__as_simd_tuple, _SimdTuple<_Tp, _A0, _Abis...>& __t,
+			_SizeConstant<0>)
+  { return __t.first; }
+
+template <typename _R, size_t _Np, typename _Tp, typename... _Abis>
+  _GLIBCXX_SIMD_INTRINSIC constexpr auto
+  __simd_tuple_get_impl(_R, const _SimdTuple<_Tp, _Abis...>& __t,
+			_SizeConstant<_Np>)
+  { return __simd_tuple_get_impl(_R(), __t.second, _SizeConstant<_Np - 1>()); }
+
+template <size_t _Np, typename _Tp, typename... _Abis>
+  _GLIBCXX_SIMD_INTRINSIC constexpr auto&
+  __simd_tuple_get_impl(__as_simd_tuple, _SimdTuple<_Tp, _Abis...>& __t,
+			_SizeConstant<_Np>)
+  {
+    return __simd_tuple_get_impl(__as_simd_tuple(), __t.second,
+				 _SizeConstant<_Np - 1>());
+  }
+
+template <size_t _Np, typename _Tp, typename... _Abis>
+  _GLIBCXX_SIMD_INTRINSIC constexpr auto
+  __get_simd_at(const _SimdTuple<_Tp, _Abis...>& __t)
+  { return __simd_tuple_get_impl(__as_simd(), __t, _SizeConstant<_Np>()); }
+
+// }}}
+// __get_tuple_at<_Np> {{{
+template <size_t _Np, typename _Tp, typename... _Abis>
+  _GLIBCXX_SIMD_INTRINSIC constexpr auto
+  __get_tuple_at(const _SimdTuple<_Tp, _Abis...>& __t)
+  {
+    return __simd_tuple_get_impl(__as_simd_tuple(), __t, _SizeConstant<_Np>());
+  }
+
+template <size_t _Np, typename _Tp, typename... _Abis>
+  _GLIBCXX_SIMD_INTRINSIC constexpr auto&
+  __get_tuple_at(_SimdTuple<_Tp, _Abis...>& __t)
+  {
+    return __simd_tuple_get_impl(__as_simd_tuple(), __t, _SizeConstant<_Np>());
+  }
+
+// __tuple_element_meta {{{1
+template <typename _Tp, typename _Abi, size_t _Offset>
+  struct __tuple_element_meta : public _Abi::_SimdImpl
+  {
+    static_assert(is_same_v<typename _Abi::_SimdImpl::abi_type,
+			    _Abi>); // this fails e.g. when _SimdImpl is an
+				    // alias for _SimdImplBuiltin<_DifferentAbi>
+    using value_type = _Tp;
+    using abi_type = _Abi;
+    using _Traits = _SimdTraits<_Tp, _Abi>;
+    using _MaskImpl = typename _Abi::_MaskImpl;
+    using _MaskMember = typename _Traits::_MaskMember;
+    using simd_type = simd<_Tp, _Abi>;
+    static constexpr size_t _S_offset = _Offset;
+    static constexpr size_t _S_size() { return simd_size<_Tp, _Abi>::value; }
+    static constexpr _MaskImpl _S_mask_impl = {};
+
+    template <size_t _Np, bool _Sanitized>
+      _GLIBCXX_SIMD_INTRINSIC static auto
+      _S_submask(_BitMask<_Np, _Sanitized> __bits)
+      { return __bits.template _M_extract<_Offset, _S_size()>(); }
+
+    template <size_t _Np, bool _Sanitized>
+      _GLIBCXX_SIMD_INTRINSIC static _MaskMember
+      _S_make_mask(_BitMask<_Np, _Sanitized> __bits)
+      {
+	return _MaskImpl::template _S_convert<_Tp>(
+	  __bits.template _M_extract<_Offset, _S_size()>()._M_sanitized());
+      }
+
+    _GLIBCXX_SIMD_INTRINSIC static _ULLong
+    _S_mask_to_shifted_ullong(_MaskMember __k)
+    { return _MaskImpl::_S_to_bits(__k).to_ullong() << _Offset; }
+  };
+
+template <size_t _Offset, typename _Tp, typename _Abi, typename... _As>
+  __tuple_element_meta<_Tp, _Abi, _Offset>
+  __make_meta(const _SimdTuple<_Tp, _Abi, _As...>&)
+  { return {}; }
+
+// }}}1
+// _WithOffset wrapper class {{{
+template <size_t _Offset, typename _Base>
+  struct _WithOffset : public _Base
+  {
+    static inline constexpr size_t _S_offset = _Offset;
+
+    _GLIBCXX_SIMD_INTRINSIC char* _M_as_charptr()
+    {
+      return reinterpret_cast<char*>(this)
+	     + _S_offset * sizeof(typename _Base::value_type);
+    }
+
+    _GLIBCXX_SIMD_INTRINSIC const char* _M_as_charptr() const
+    {
+      return reinterpret_cast<const char*>(this)
+	     + _S_offset * sizeof(typename _Base::value_type);
+    }
+  };
+
+// make _WithOffset<_WithOffset> ill-formed to use:
+template <size_t _O0, size_t _O1, typename _Base>
+  struct _WithOffset<_O0, _WithOffset<_O1, _Base>> {};
+
+template <size_t _Offset, typename _Tp>
+  decltype(auto)
+  __add_offset(_Tp& __base)
+  { return static_cast<_WithOffset<_Offset, __remove_cvref_t<_Tp>>&>(__base); }
+
+template <size_t _Offset, typename _Tp>
+  decltype(auto)
+  __add_offset(const _Tp& __base)
+  {
+    return static_cast<const _WithOffset<_Offset, __remove_cvref_t<_Tp>>&>(
+      __base);
+  }
+
+template <size_t _Offset, size_t _ExistingOffset, typename _Tp>
+  decltype(auto)
+  __add_offset(_WithOffset<_ExistingOffset, _Tp>& __base)
+  {
+    return static_cast<_WithOffset<_Offset + _ExistingOffset, _Tp>&>(
+      static_cast<_Tp&>(__base));
+  }
+
+template <size_t _Offset, size_t _ExistingOffset, typename _Tp>
+  decltype(auto)
+  __add_offset(const _WithOffset<_ExistingOffset, _Tp>& __base)
+  {
+    return static_cast<const _WithOffset<_Offset + _ExistingOffset, _Tp>&>(
+      static_cast<const _Tp&>(__base));
+  }
+
+template <typename _Tp>
+  constexpr inline size_t __offset = 0;
+
+template <size_t _Offset, typename _Tp>
+  constexpr inline size_t __offset<_WithOffset<_Offset, _Tp>>
+    = _WithOffset<_Offset, _Tp>::_S_offset;
+
+template <typename _Tp>
+  constexpr inline size_t __offset<const _Tp> = __offset<_Tp>;
+
+template <typename _Tp>
+  constexpr inline size_t __offset<_Tp&> = __offset<_Tp>;
+
+template <typename _Tp>
+  constexpr inline size_t __offset<_Tp&&> = __offset<_Tp>;
+
+// }}}
+// _SimdTuple specializations {{{1
+// empty {{{2
+template <typename _Tp>
+  struct _SimdTuple<_Tp>
+  {
+    using value_type = _Tp;
+    static constexpr size_t _S_tuple_size = 0;
+    static constexpr size_t _S_size() { return 0; }
+  };
+
+// _SimdTupleData {{{2
+template <typename _FirstType, typename _SecondType>
+  struct _SimdTupleData
+  {
+    _FirstType first;
+    _SecondType second;
+
+    _GLIBCXX_SIMD_INTRINSIC
+    constexpr bool _M_is_constprop() const
+    {
+      if constexpr (is_class_v<_FirstType>)
+	return first._M_is_constprop() && second._M_is_constprop();
+      else
+	return __builtin_constant_p(first) && second._M_is_constprop();
+    }
+  };
+
+template <typename _FirstType, typename _Tp>
+  struct _SimdTupleData<_FirstType, _SimdTuple<_Tp>>
+  {
+    _FirstType first;
+    static constexpr _SimdTuple<_Tp> second = {};
+
+    _GLIBCXX_SIMD_INTRINSIC
+    constexpr bool _M_is_constprop() const
+    {
+      if constexpr (is_class_v<_FirstType>)
+	return first._M_is_constprop();
+      else
+	return __builtin_constant_p(first);
+    }
+  };
+
+// 1 or more {{{2
+template <typename _Tp, typename _Abi0, typename... _Abis>
+  struct _SimdTuple<_Tp, _Abi0, _Abis...>
+    : _SimdTupleData<typename _SimdTraits<_Tp, _Abi0>::_SimdMember,
+		     _SimdTuple<_Tp, _Abis...>>
+  {
+    static_assert(!__is_fixed_size_abi_v<_Abi0>);
+    using value_type = _Tp;
+    using _FirstType = typename _SimdTraits<_Tp, _Abi0>::_SimdMember;
+    using _FirstAbi = _Abi0;
+    using _SecondType = _SimdTuple<_Tp, _Abis...>;
+    static constexpr size_t _S_tuple_size = sizeof...(_Abis) + 1;
+
+    static constexpr size_t _S_size()
+    { return simd_size_v<_Tp, _Abi0> + _SecondType::_S_size(); }
+
+    static constexpr size_t _S_first_size = simd_size_v<_Tp, _Abi0>;
+    static constexpr bool _S_is_homogeneous = (is_same_v<_Abi0, _Abis> && ...);
+
+    using _Base = _SimdTupleData<typename _SimdTraits<_Tp, _Abi0>::_SimdMember,
+				 _SimdTuple<_Tp, _Abis...>>;
+    using _Base::first;
+    using _Base::second;
+
+    _GLIBCXX_SIMD_INTRINSIC constexpr _SimdTuple() = default;
+    _GLIBCXX_SIMD_INTRINSIC constexpr _SimdTuple(const _SimdTuple&) = default;
+    _GLIBCXX_SIMD_INTRINSIC constexpr _SimdTuple& operator=(const _SimdTuple&)
+      = default;
+
+    template <typename _Up>
+      _GLIBCXX_SIMD_INTRINSIC constexpr _SimdTuple(_Up&& __x)
+      : _Base{static_cast<_Up&&>(__x)} {}
+
+    template <typename _Up, typename _Up2>
+      _GLIBCXX_SIMD_INTRINSIC constexpr _SimdTuple(_Up&& __x, _Up2&& __y)
+      : _Base{static_cast<_Up&&>(__x), static_cast<_Up2&&>(__y)} {}
+
+    template <typename _Up>
+      _GLIBCXX_SIMD_INTRINSIC constexpr _SimdTuple(_Up&& __x, _SimdTuple<_Tp>)
+      : _Base{static_cast<_Up&&>(__x)} {}
+
+    _GLIBCXX_SIMD_INTRINSIC char* _M_as_charptr()
+    { return reinterpret_cast<char*>(this); }
+
+    _GLIBCXX_SIMD_INTRINSIC const char* _M_as_charptr() const
+    { return reinterpret_cast<const char*>(this); }
+
+    template <size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC constexpr auto& _M_at()
+      {
+	if constexpr (_Np == 0)
+	  return first;
+	else
+	  return second.template _M_at<_Np - 1>();
+      }
+
+    template <size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC constexpr const auto& _M_at() const
+      {
+	if constexpr (_Np == 0)
+	  return first;
+	else
+	  return second.template _M_at<_Np - 1>();
+      }
+
+    template <size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC constexpr auto _M_simd_at() const
+      {
+	if constexpr (_Np == 0)
+	  return simd<_Tp, _Abi0>(__private_init, first);
+	else
+	  return second.template _M_simd_at<_Np - 1>();
+      }
+
+    template <size_t _Offset = 0, typename _Fp>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _SimdTuple
+      _S_generate(_Fp&& __gen, _SizeConstant<_Offset> = {})
+      {
+	auto&& __first = __gen(__tuple_element_meta<_Tp, _Abi0, _Offset>());
+	if constexpr (_S_tuple_size == 1)
+	  return {__first};
+	else
+	  return {__first,
+		  _SecondType::_S_generate(
+		    static_cast<_Fp&&>(__gen),
+		    _SizeConstant<_Offset + simd_size_v<_Tp, _Abi0>>())};
+      }
+
+    template <size_t _Offset = 0, typename _Fp, typename... _More>
+      _GLIBCXX_SIMD_INTRINSIC _SimdTuple
+      _M_apply_wrapped(_Fp&& __fun, const _More&... __more) const
+      {
+	auto&& __first
+	  = __fun(__make_meta<_Offset>(*this), first, __more.first...);
+	if constexpr (_S_tuple_size == 1)
+	  return {__first};
+	else
+	  return {
+	    __first,
+	    second.template _M_apply_wrapped<_Offset + simd_size_v<_Tp, _Abi0>>(
+	      static_cast<_Fp&&>(__fun), __more.second...)};
+      }
+
+    template <typename _Tup>
+      _GLIBCXX_SIMD_INTRINSIC constexpr decltype(auto)
+      _M_extract_argument(_Tup&& __tup) const
+      {
+	using _TupT = typename __remove_cvref_t<_Tup>::value_type;
+	if constexpr (is_same_v<_SimdTuple, __remove_cvref_t<_Tup>>)
+	  return __tup.first;
+	else if (__builtin_is_constant_evaluated())
+	  return __fixed_size_storage_t<_TupT, _S_first_size>::_S_generate([&](
+	    auto __meta) constexpr {
+	    return __meta._S_generator(
+	      [&](auto __i) constexpr { return __tup[__i]; },
+	      static_cast<_TupT*>(nullptr));
+	  });
+	else
+	  return [&]() {
+	    __fixed_size_storage_t<_TupT, _S_first_size> __r;
+	    __builtin_memcpy(__r._M_as_charptr(), __tup._M_as_charptr(),
+			     sizeof(__r));
+	    return __r;
+	  }();
+      }
+
+    template <typename _Tup>
+      _GLIBCXX_SIMD_INTRINSIC constexpr auto&
+      _M_skip_argument(_Tup&& __tup) const
+      {
+	static_assert(_S_tuple_size > 1);
+	using _Up = __remove_cvref_t<_Tup>;
+	constexpr size_t __off = __offset<_Up>;
+	if constexpr (_S_first_size == _Up::_S_first_size && __off == 0)
+	  return __tup.second;
+	else if constexpr (_S_first_size > _Up::_S_first_size
+			   && _S_first_size % _Up::_S_first_size == 0
+			   && __off == 0)
+	  return __simd_tuple_pop_front<_S_first_size>(__tup);
+	else if constexpr (_S_first_size + __off < _Up::_S_first_size)
+	  return __add_offset<_S_first_size>(__tup);
+	else if constexpr (_S_first_size + __off == _Up::_S_first_size)
+	  return __tup.second;
+	else
+	  __assert_unreachable<_Tup>();
+      }
+
+    template <size_t _Offset, typename... _More>
+      _GLIBCXX_SIMD_INTRINSIC constexpr void
+      _M_assign_front(const _SimdTuple<_Tp, _Abi0, _More...>& __x) &
+      {
+	static_assert(_Offset == 0);
+	first = __x.first;
+	if constexpr (sizeof...(_More) > 0)
+	  {
+	    static_assert(sizeof...(_Abis) >= sizeof...(_More));
+	    second.template _M_assign_front<0>(__x.second);
+	  }
+      }
+
+    template <size_t _Offset>
+      _GLIBCXX_SIMD_INTRINSIC constexpr void
+      _M_assign_front(const _FirstType& __x) &
+      {
+	static_assert(_Offset == 0);
+	first = __x;
+      }
+
+    template <size_t _Offset, typename... _As>
+      _GLIBCXX_SIMD_INTRINSIC constexpr void
+      _M_assign_front(const _SimdTuple<_Tp, _As...>& __x) &
+      {
+	__builtin_memcpy(_M_as_charptr() + _Offset * sizeof(value_type),
+			 __x._M_as_charptr(),
+			 sizeof(_Tp) * _SimdTuple<_Tp, _As...>::_S_size());
+      }
+
+    /*
+     * Iterate over the first objects in this _SimdTuple and call __fun for each
+     * of them. If additional arguments are passed via __more, chunk them into
+     * _SimdTuple or __vector_type_t objects of the same number of values.
+     */
+    template <typename _Fp, typename... _More>
+      _GLIBCXX_SIMD_INTRINSIC constexpr _SimdTuple
+      _M_apply_per_chunk(_Fp&& __fun, _More&&... __more) const
+      {
+	if constexpr ((...
+		       || conjunction_v<
+			 is_lvalue_reference<_More>,
+			 negation<is_const<remove_reference_t<_More>>>>) )
+	  {
+	    // need to write back at least one of __more after calling __fun
+	    auto&& __first = [&](auto... __args) constexpr
+	    {
+	      auto __r = __fun(__tuple_element_meta<_Tp, _Abi0, 0>(), first,
+			       __args...);
+	      [[maybe_unused]] auto&& __ignore_me = {(
+		[](auto&& __dst, const auto& __src) {
+		  if constexpr (is_assignable_v<decltype(__dst),
+						decltype(__dst)>)
+		    {
+		      __dst.template _M_assign_front<__offset<decltype(__dst)>>(
+			__src);
+		    }
+		}(static_cast<_More&&>(__more), __args),
+		0)...};
+	      return __r;
+	    }
+	    (_M_extract_argument(__more)...);
+	    if constexpr (_S_tuple_size == 1)
+	      return {__first};
+	    else
+	      return {__first,
+		      second._M_apply_per_chunk(static_cast<_Fp&&>(__fun),
+						_M_skip_argument(__more)...)};
+	  }
+	else
+	  {
+	    auto&& __first = __fun(__tuple_element_meta<_Tp, _Abi0, 0>(), first,
+				   _M_extract_argument(__more)...);
+	    if constexpr (_S_tuple_size == 1)
+	      return {__first};
+	    else
+	      return {__first,
+		      second._M_apply_per_chunk(static_cast<_Fp&&>(__fun),
+						_M_skip_argument(__more)...)};
+	  }
+      }
+
+    template <typename _R = _Tp, typename _Fp, typename... _More>
+      _GLIBCXX_SIMD_INTRINSIC auto _M_apply_r(_Fp&& __fun,
+					      const _More&... __more) const
+      {
+	auto&& __first = __fun(__tuple_element_meta<_Tp, _Abi0, 0>(), first,
+			       __more.first...);
+	if constexpr (_S_tuple_size == 1)
+	  return __first;
+	else
+	  return __simd_tuple_concat<_R>(
+	    __first, second.template _M_apply_r<_R>(static_cast<_Fp&&>(__fun),
+						    __more.second...));
+      }
+
+    template <typename _Fp, typename... _More>
+      _GLIBCXX_SIMD_INTRINSIC constexpr friend _SanitizedBitMask<_S_size()>
+      _M_test(const _Fp& __fun, const _SimdTuple& __x, const _More&... __more)
+      {
+	const _SanitizedBitMask<_S_first_size> __first
+	  = _Abi0::_MaskImpl::_S_to_bits(
+	    __fun(__tuple_element_meta<_Tp, _Abi0, 0>(), __x.first,
+		  __more.first...));
+	if constexpr (_S_tuple_size == 1)
+	  return __first;
+	else
+	  return _M_test(__fun, __x.second, __more.second...)
+	    ._M_prepend(__first);
+      }
+
+    template <typename _Up, _Up _I>
+      _GLIBCXX_SIMD_INTRINSIC constexpr _Tp
+      operator[](integral_constant<_Up, _I>) const noexcept
+      {
+	if constexpr (_I < simd_size_v<_Tp, _Abi0>)
+	  return _M_subscript_read(_I);
+	else
+	  return second[integral_constant<_Up, _I - simd_size_v<_Tp, _Abi0>>()];
+      }
+
+    _Tp operator[](size_t __i) const noexcept
+    {
+      if constexpr (_S_tuple_size == 1)
+	return _M_subscript_read(__i);
+      else
+	{
+#ifdef _GLIBCXX_SIMD_USE_ALIASING_LOADS
+	  return reinterpret_cast<const __may_alias<_Tp>*>(this)[__i];
+#else
+	  if constexpr (__is_scalar_abi<_Abi0>())
+	    {
+	      const _Tp* ptr = &first;
+	      return ptr[__i];
+	    }
+	  else
+	    return __i < simd_size_v<_Tp, _Abi0>
+		     ? _M_subscript_read(__i)
+		     : second[__i - simd_size_v<_Tp, _Abi0>];
+#endif
+	}
+    }
+
+    void _M_set(size_t __i, _Tp __val) noexcept
+    {
+      if constexpr (_S_tuple_size == 1)
+	return _M_subscript_write(__i, __val);
+      else
+	{
+#ifdef _GLIBCXX_SIMD_USE_ALIASING_LOADS
+	  reinterpret_cast<__may_alias<_Tp>*>(this)[__i] = __val;
+#else
+	  if (__i < simd_size_v<_Tp, _Abi0>)
+	    _M_subscript_write(__i, __val);
+	  else
+	    second._M_set(__i - simd_size_v<_Tp, _Abi0>, __val);
+#endif
+	}
+    }
+
+  private:
+    // _M_subscript_read/_write {{{
+    _Tp _M_subscript_read([[maybe_unused]] size_t __i) const noexcept
+    {
+      if constexpr (__is_vectorizable_v<_FirstType>)
+	return first;
+      else
+	return first[__i];
+    }
+
+    void _M_subscript_write([[maybe_unused]] size_t __i, _Tp __y) noexcept
+    {
+      if constexpr (__is_vectorizable_v<_FirstType>)
+	first = __y;
+      else
+	first._M_set(__i, __y);
+    }
+
+    // }}}
+  };
+
+// __make_simd_tuple {{{1
+template <typename _Tp, typename _A0>
+  _GLIBCXX_SIMD_INTRINSIC _SimdTuple<_Tp, _A0>
+  __make_simd_tuple(simd<_Tp, _A0> __x0)
+  { return {__data(__x0)}; }
+
+template <typename _Tp, typename _A0, typename... _As>
+  _GLIBCXX_SIMD_INTRINSIC _SimdTuple<_Tp, _A0, _As...>
+  __make_simd_tuple(const simd<_Tp, _A0>& __x0, const simd<_Tp, _As>&... __xs)
+  { return {__data(__x0), __make_simd_tuple(__xs...)}; }
+
+template <typename _Tp, typename _A0>
+  _GLIBCXX_SIMD_INTRINSIC _SimdTuple<_Tp, _A0>
+  __make_simd_tuple(const typename _SimdTraits<_Tp, _A0>::_SimdMember& __arg0)
+  { return {__arg0}; }
+
+template <typename _Tp, typename _A0, typename _A1, typename... _Abis>
+  _GLIBCXX_SIMD_INTRINSIC _SimdTuple<_Tp, _A0, _A1, _Abis...>
+  __make_simd_tuple(
+    const typename _SimdTraits<_Tp, _A0>::_SimdMember& __arg0,
+    const typename _SimdTraits<_Tp, _A1>::_SimdMember& __arg1,
+    const typename _SimdTraits<_Tp, _Abis>::_SimdMember&... __args)
+  { return {__arg0, __make_simd_tuple<_Tp, _A1, _Abis...>(__arg1, __args...)}; }
+
+// __to_simd_tuple {{{1
+template <typename _Tp, size_t _Np, typename _V, size_t _NV, typename... _VX>
+  _GLIBCXX_SIMD_INTRINSIC constexpr __fixed_size_storage_t<_Tp, _Np>
+  __to_simd_tuple(const array<_V, _NV>& __from, const _VX... __fromX);
+
+template <typename _Tp, size_t _Np,
+	  size_t _Offset = 0, // skip this many elements in __from0
+	  typename _R = __fixed_size_storage_t<_Tp, _Np>, typename _V0,
+	  typename _V0VT = _VectorTraits<_V0>, typename... _VX>
+  _GLIBCXX_SIMD_INTRINSIC _R constexpr __to_simd_tuple(const _V0 __from0,
+						       const _VX... __fromX)
+  {
+    static_assert(is_same_v<typename _V0VT::value_type, _Tp>);
+    static_assert(_Offset < _V0VT::_S_full_size);
+    using _R0 = __vector_type_t<_Tp, _R::_S_first_size>;
+    if constexpr (_R::_S_tuple_size == 1)
+      {
+	if constexpr (_Np == 1)
+	  return _R{__from0[_Offset]};
+	else if constexpr (_Offset == 0 && _V0VT::_S_full_size >= _Np)
+	  return _R{__intrin_bitcast<_R0>(__from0)};
+	else if constexpr (_Offset * 2 == _V0VT::_S_full_size
+			   && _V0VT::_S_full_size / 2 >= _Np)
+	  return _R{__intrin_bitcast<_R0>(__extract_part<1, 2>(__from0))};
+	else if constexpr (_Offset * 4 == _V0VT::_S_full_size
+			   && _V0VT::_S_full_size / 4 >= _Np)
+	  return _R{__intrin_bitcast<_R0>(__extract_part<1, 4>(__from0))};
+	else
+	  __assert_unreachable<_Tp>();
+      }
+    else
+      {
+	if constexpr (1 == _R::_S_first_size)
+	  { // extract one scalar and recurse
+	    if constexpr (_Offset + 1 < _V0VT::_S_full_size)
+	      return _R{__from0[_Offset],
+			__to_simd_tuple<_Tp, _Np - 1, _Offset + 1>(__from0,
+								   __fromX...)};
+	    else
+	      return _R{__from0[_Offset],
+			__to_simd_tuple<_Tp, _Np - 1, 0>(__fromX...)};
+	  }
+
+	// place __from0 into _R::first and recurse for __fromX -> _R::second
+	else if constexpr (_V0VT::_S_full_size == _R::_S_first_size
+			   && _Offset == 0)
+	  return _R{__from0,
+		    __to_simd_tuple<_Tp, _Np - _R::_S_first_size>(__fromX...)};
+
+	// place lower part of __from0 into _R::first and recurse with _Offset
+	else if constexpr (_V0VT::_S_full_size > _R::_S_first_size
+			   && _Offset == 0)
+	  return _R{__intrin_bitcast<_R0>(__from0),
+		    __to_simd_tuple<_Tp, _Np - _R::_S_first_size,
+				    _R::_S_first_size>(__from0, __fromX...)};
+
+	// place lower part of second quarter of __from0 into _R::first and
+	// recurse with _Offset
+	else if constexpr (_Offset * 4 == _V0VT::_S_full_size
+			   && _V0VT::_S_full_size >= 4 * _R::_S_first_size)
+	  return _R{__intrin_bitcast<_R0>(__extract_part<2, 4>(__from0)),
+		    __to_simd_tuple<_Tp, _Np - _R::_S_first_size,
+				    _Offset + _R::_S_first_size>(__from0,
+								 __fromX...)};
+
+	// place lower half of high half of __from0 into _R::first and recurse
+	// with _Offset
+	else if constexpr (_Offset * 2 == _V0VT::_S_full_size
+			   && _V0VT::_S_full_size >= 4 * _R::_S_first_size)
+	  return _R{__intrin_bitcast<_R0>(__extract_part<2, 4>(__from0)),
+		    __to_simd_tuple<_Tp, _Np - _R::_S_first_size,
+				    _Offset + _R::_S_first_size>(__from0,
+								 __fromX...)};
+
+	// place high half of __from0 into _R::first and recurse with __fromX
+	else if constexpr (_Offset * 2 == _V0VT::_S_full_size
+			   && _V0VT::_S_full_size / 2 >= _R::_S_first_size)
+	  return _R{__intrin_bitcast<_R0>(__extract_part<1, 2>(__from0)),
+		    __to_simd_tuple<_Tp, _Np - _R::_S_first_size, 0>(
+		      __fromX...)};
+
+	// ill-formed if some unforseen pattern is needed
+	else
+	  __assert_unreachable<_Tp>();
+      }
+  }
+
+template <typename _Tp, size_t _Np, typename _V, size_t _NV, typename... _VX>
+  _GLIBCXX_SIMD_INTRINSIC constexpr __fixed_size_storage_t<_Tp, _Np>
+  __to_simd_tuple(const array<_V, _NV>& __from, const _VX... __fromX)
+  {
+    if constexpr (is_same_v<_Tp, _V>)
+      {
+	static_assert(
+	  sizeof...(_VX) == 0,
+	  "An array of scalars must be the last argument to __to_simd_tuple");
+	return __call_with_subscripts(
+	  __from,
+	  make_index_sequence<_NV>(), [&](const auto... __args) constexpr {
+	    return __simd_tuple_concat(
+	      _SimdTuple<_Tp, simd_abi::scalar>{__args}..., _SimdTuple<_Tp>());
+	  });
+      }
+    else
+      return __call_with_subscripts(
+	__from,
+	make_index_sequence<_NV>(), [&](const auto... __args) constexpr {
+	  return __to_simd_tuple<_Tp, _Np>(__args..., __fromX...);
+	});
+  }
+
+template <size_t, typename _Tp>
+  using __to_tuple_helper = _Tp;
+
+template <typename _Tp, typename _A0, size_t _NOut, size_t _Np,
+	  size_t... _Indexes>
+  _GLIBCXX_SIMD_INTRINSIC __fixed_size_storage_t<_Tp, _NOut>
+  __to_simd_tuple_impl(index_sequence<_Indexes...>,
+      const array<__vector_type_t<_Tp, simd_size_v<_Tp, _A0>>, _Np>& __args)
+  {
+    return __make_simd_tuple<_Tp, __to_tuple_helper<_Indexes, _A0>...>(
+      __args[_Indexes]...);
+  }
+
+template <typename _Tp, typename _A0, size_t _NOut, size_t _Np,
+	  typename _R = __fixed_size_storage_t<_Tp, _NOut>>
+  _GLIBCXX_SIMD_INTRINSIC _R
+  __to_simd_tuple_sized(
+    const array<__vector_type_t<_Tp, simd_size_v<_Tp, _A0>>, _Np>& __args)
+  {
+    static_assert(_Np * simd_size_v<_Tp, _A0> >= _NOut);
+    return __to_simd_tuple_impl<_Tp, _A0, _NOut>(
+      make_index_sequence<_R::_S_tuple_size>(), __args);
+  }
+
+// __optimize_simd_tuple {{{1
+template <typename _Tp>
+  _GLIBCXX_SIMD_INTRINSIC _SimdTuple<_Tp>
+  __optimize_simd_tuple(const _SimdTuple<_Tp>)
+  { return {}; }
+
+template <typename _Tp, typename _Ap>
+  _GLIBCXX_SIMD_INTRINSIC const _SimdTuple<_Tp, _Ap>&
+  __optimize_simd_tuple(const _SimdTuple<_Tp, _Ap>& __x)
+  { return __x; }
+
+template <typename _Tp, typename _A0, typename _A1, typename... _Abis,
+	  typename _R = __fixed_size_storage_t<
+	    _Tp, _SimdTuple<_Tp, _A0, _A1, _Abis...>::_S_size()>>
+  _GLIBCXX_SIMD_INTRINSIC _R
+  __optimize_simd_tuple(const _SimdTuple<_Tp, _A0, _A1, _Abis...>& __x)
+  {
+    using _Tup = _SimdTuple<_Tp, _A0, _A1, _Abis...>;
+    if constexpr (is_same_v<_R, _Tup>)
+      return __x;
+    else if constexpr (is_same_v<typename _R::_FirstType,
+				 typename _Tup::_FirstType>)
+      return {__x.first, __optimize_simd_tuple(__x.second)};
+    else if constexpr (__is_scalar_abi<_A0>()
+		       || _A0::template _S_is_partial<_Tp>)
+      return {__generate_from_n_evaluations<_R::_S_first_size,
+					    typename _R::_FirstType>(
+		[&](auto __i) { return __x[__i]; }),
+	      __optimize_simd_tuple(
+		__simd_tuple_pop_front<_R::_S_first_size>(__x))};
+    else if constexpr (is_same_v<_A0, _A1>
+	&& _R::_S_first_size == simd_size_v<_Tp, _A0> + simd_size_v<_Tp, _A1>)
+      return {__concat(__x.template _M_at<0>(), __x.template _M_at<1>()),
+	      __optimize_simd_tuple(__x.second.second)};
+    else if constexpr (sizeof...(_Abis) >= 2
+	&& _R::_S_first_size == (4 * simd_size_v<_Tp, _A0>)
+	&& simd_size_v<_Tp, _A0> == __simd_tuple_element_t<
+	    (sizeof...(_Abis) >= 2 ? 3 : 0), _Tup>::size())
+      return {
+	__concat(__concat(__x.template _M_at<0>(), __x.template _M_at<1>()),
+		 __concat(__x.template _M_at<2>(), __x.template _M_at<3>())),
+	__optimize_simd_tuple(__x.second.second.second.second)};
+    else
+      {
+	static_assert(sizeof(_R) == sizeof(__x));
+	_R __r;
+	__builtin_memcpy(__r._M_as_charptr(), __x._M_as_charptr(),
+			 sizeof(_Tp) * _R::_S_size());
+	return __r;
+      }
+  }
+
+// __for_each(const _SimdTuple &, Fun) {{{1
+template <size_t _Offset = 0, typename _Tp, typename _A0, typename _Fp>
+  _GLIBCXX_SIMD_INTRINSIC constexpr void
+  __for_each(const _SimdTuple<_Tp, _A0>& __t, _Fp&& __fun)
+  { static_cast<_Fp&&>(__fun)(__make_meta<_Offset>(__t), __t.first); }
+
+template <size_t _Offset = 0, typename _Tp, typename _A0, typename _A1,
+	  typename... _As, typename _Fp>
+  _GLIBCXX_SIMD_INTRINSIC constexpr void
+  __for_each(const _SimdTuple<_Tp, _A0, _A1, _As...>& __t, _Fp&& __fun)
+  {
+    __fun(__make_meta<_Offset>(__t), __t.first);
+    __for_each<_Offset + simd_size<_Tp, _A0>::value>(__t.second,
+						     static_cast<_Fp&&>(__fun));
+  }
+
+// __for_each(_SimdTuple &, Fun) {{{1
+template <size_t _Offset = 0, typename _Tp, typename _A0, typename _Fp>
+  _GLIBCXX_SIMD_INTRINSIC constexpr void
+  __for_each(_SimdTuple<_Tp, _A0>& __t, _Fp&& __fun)
+  { static_cast<_Fp&&>(__fun)(__make_meta<_Offset>(__t), __t.first); }
+
+template <size_t _Offset = 0, typename _Tp, typename _A0, typename _A1,
+	  typename... _As, typename _Fp>
+  _GLIBCXX_SIMD_INTRINSIC constexpr void
+  __for_each(_SimdTuple<_Tp, _A0, _A1, _As...>& __t, _Fp&& __fun)
+  {
+    __fun(__make_meta<_Offset>(__t), __t.first);
+    __for_each<_Offset + simd_size<_Tp, _A0>::value>(__t.second,
+						     static_cast<_Fp&&>(__fun));
+  }
+
+// __for_each(_SimdTuple &, const _SimdTuple &, Fun) {{{1
+template <size_t _Offset = 0, typename _Tp, typename _A0, typename _Fp>
+  _GLIBCXX_SIMD_INTRINSIC constexpr void
+  __for_each(_SimdTuple<_Tp, _A0>& __a, const _SimdTuple<_Tp, _A0>& __b,
+	     _Fp&& __fun)
+  {
+    static_cast<_Fp&&>(__fun)(__make_meta<_Offset>(__a), __a.first, __b.first);
+  }
+
+template <size_t _Offset = 0, typename _Tp, typename _A0, typename _A1,
+	  typename... _As, typename _Fp>
+  _GLIBCXX_SIMD_INTRINSIC constexpr void
+  __for_each(_SimdTuple<_Tp, _A0, _A1, _As...>& __a,
+	     const _SimdTuple<_Tp, _A0, _A1, _As...>& __b, _Fp&& __fun)
+  {
+    __fun(__make_meta<_Offset>(__a), __a.first, __b.first);
+    __for_each<_Offset + simd_size<_Tp, _A0>::value>(__a.second, __b.second,
+						     static_cast<_Fp&&>(__fun));
+  }
+
+// __for_each(const _SimdTuple &, const _SimdTuple &, Fun) {{{1
+template <size_t _Offset = 0, typename _Tp, typename _A0, typename _Fp>
+  _GLIBCXX_SIMD_INTRINSIC constexpr void
+  __for_each(const _SimdTuple<_Tp, _A0>& __a, const _SimdTuple<_Tp, _A0>& __b,
+	     _Fp&& __fun)
+  {
+    static_cast<_Fp&&>(__fun)(__make_meta<_Offset>(__a), __a.first, __b.first);
+  }
+
+template <size_t _Offset = 0, typename _Tp, typename _A0, typename _A1,
+	  typename... _As, typename _Fp>
+  _GLIBCXX_SIMD_INTRINSIC constexpr void
+  __for_each(const _SimdTuple<_Tp, _A0, _A1, _As...>& __a,
+	     const _SimdTuple<_Tp, _A0, _A1, _As...>& __b, _Fp&& __fun)
+  {
+    __fun(__make_meta<_Offset>(__a), __a.first, __b.first);
+    __for_each<_Offset + simd_size<_Tp, _A0>::value>(__a.second, __b.second,
+						     static_cast<_Fp&&>(__fun));
+  }
+
+// }}}1
+// __extract_part(_SimdTuple) {{{
+template <int _Index, int _Total, int _Combine, typename _Tp, typename _A0,
+	  typename... _As>
+  _GLIBCXX_SIMD_INTRINSIC auto // __vector_type_t or _SimdTuple
+  __extract_part(const _SimdTuple<_Tp, _A0, _As...>& __x)
+  {
+    // worst cases:
+    // (a) 4, 4, 4 => 3, 3, 3, 3 (_Total = 4)
+    // (b) 2, 2, 2 => 3, 3       (_Total = 2)
+    // (c) 4, 2 => 2, 2, 2       (_Total = 3)
+    using _Tuple = _SimdTuple<_Tp, _A0, _As...>;
+    static_assert(_Index + _Combine <= _Total && _Index >= 0 && _Total >= 1);
+    constexpr size_t _Np = _Tuple::_S_size();
+    static_assert(_Np >= _Total && _Np % _Total == 0);
+    constexpr size_t __values_per_part = _Np / _Total;
+    [[maybe_unused]] constexpr size_t __values_to_skip
+      = _Index * __values_per_part;
+    constexpr size_t __return_size = __values_per_part * _Combine;
+    using _RetAbi = simd_abi::deduce_t<_Tp, __return_size>;
+
+    // handle (optimize) the simple cases
+    if constexpr (_Index == 0 && _Tuple::_S_first_size == __return_size)
+      return __x.first._M_data;
+    else if constexpr (_Index == 0 && _Total == _Combine)
+      return __x;
+    else if constexpr (_Index == 0 && _Tuple::_S_first_size >= __return_size)
+      return __intrin_bitcast<__vector_type_t<_Tp, __return_size>>(
+	__as_vector(__x.first));
+
+    // recurse to skip unused data members at the beginning of _SimdTuple
+    else if constexpr (__values_to_skip >= _Tuple::_S_first_size)
+      { // recurse
+	if constexpr (_Tuple::_S_first_size % __values_per_part == 0)
+	  {
+	    constexpr int __parts_in_first
+	      = _Tuple::_S_first_size / __values_per_part;
+	    return __extract_part<_Index - __parts_in_first,
+				  _Total - __parts_in_first, _Combine>(
+	      __x.second);
+	  }
+	else
+	  return __extract_part<__values_to_skip - _Tuple::_S_first_size,
+				_Np - _Tuple::_S_first_size, __return_size>(
+	    __x.second);
+      }
+
+    // extract from multiple _SimdTuple data members
+    else if constexpr (__return_size > _Tuple::_S_first_size - __values_to_skip)
+      {
+#ifdef _GLIBCXX_SIMD_USE_ALIASING_LOADS
+	const __may_alias<_Tp>* const element_ptr
+	  = reinterpret_cast<const __may_alias<_Tp>*>(&__x) + __values_to_skip;
+	return __as_vector(simd<_Tp, _RetAbi>(element_ptr, element_aligned));
+#else
+	[[maybe_unused]] constexpr size_t __offset = __values_to_skip;
+	return __as_vector(simd<_Tp, _RetAbi>([&](auto __i) constexpr {
+	  constexpr _SizeConstant<__i + __offset> __k;
+	  return __x[__k];
+	}));
+#endif
+      }
+
+    // all of the return values are in __x.first
+    else if constexpr (_Tuple::_S_first_size % __values_per_part == 0)
+      return __extract_part<_Index, _Tuple::_S_first_size / __values_per_part,
+			    _Combine>(__x.first);
+    else
+      return __extract_part<__values_to_skip, _Tuple::_S_first_size,
+			    _Combine * __values_per_part>(__x.first);
+  }
+
+// }}}
+// __fixed_size_storage_t<_Tp, _Np>{{{
+template <typename _Tp, int _Np, typename _Tuple,
+	  typename _Next = simd<_Tp, _AllNativeAbis::_BestAbi<_Tp, _Np>>,
+	  int _Remain = _Np - int(_Next::size())>
+  struct __fixed_size_storage_builder;
+
+template <typename _Tp, int _Np>
+  struct __fixed_size_storage
+  : public __fixed_size_storage_builder<_Tp, _Np, _SimdTuple<_Tp>> {};
+
+template <typename _Tp, int _Np, typename... _As, typename _Next>
+  struct __fixed_size_storage_builder<_Tp, _Np, _SimdTuple<_Tp, _As...>, _Next,
+				      0>
+  { using type = _SimdTuple<_Tp, _As..., typename _Next::abi_type>; };
+
+template <typename _Tp, int _Np, typename... _As, typename _Next, int _Remain>
+  struct __fixed_size_storage_builder<_Tp, _Np, _SimdTuple<_Tp, _As...>, _Next,
+				      _Remain>
+  {
+    using type = typename __fixed_size_storage_builder<
+      _Tp, _Remain, _SimdTuple<_Tp, _As..., typename _Next::abi_type>>::type;
+  };
+
+// }}}
+// _AbisInSimdTuple {{{
+template <typename _Tp>
+  struct _SeqOp;
+
+template <size_t _I0, size_t... _Is>
+  struct _SeqOp<index_sequence<_I0, _Is...>>
+  {
+    using _FirstPlusOne = index_sequence<_I0 + 1, _Is...>;
+    using _NotFirstPlusOne = index_sequence<_I0, (_Is + 1)...>;
+    template <size_t _First, size_t _Add>
+    using _Prepend = index_sequence<_First, _I0 + _Add, (_Is + _Add)...>;
+  };
+
+template <typename _Tp>
+  struct _AbisInSimdTuple;
+
+template <typename _Tp>
+  struct _AbisInSimdTuple<_SimdTuple<_Tp>>
+  {
+    using _Counts = index_sequence<0>;
+    using _Begins = index_sequence<0>;
+  };
+
+template <typename _Tp, typename _Ap>
+  struct _AbisInSimdTuple<_SimdTuple<_Tp, _Ap>>
+  {
+    using _Counts = index_sequence<1>;
+    using _Begins = index_sequence<0>;
+  };
+
+template <typename _Tp, typename _A0, typename... _As>
+  struct _AbisInSimdTuple<_SimdTuple<_Tp, _A0, _A0, _As...>>
+  {
+    using _Counts = typename _SeqOp<typename _AbisInSimdTuple<
+      _SimdTuple<_Tp, _A0, _As...>>::_Counts>::_FirstPlusOne;
+    using _Begins = typename _SeqOp<typename _AbisInSimdTuple<
+      _SimdTuple<_Tp, _A0, _As...>>::_Begins>::_NotFirstPlusOne;
+  };
+
+template <typename _Tp, typename _A0, typename _A1, typename... _As>
+  struct _AbisInSimdTuple<_SimdTuple<_Tp, _A0, _A1, _As...>>
+  {
+    using _Counts = typename _SeqOp<typename _AbisInSimdTuple<
+      _SimdTuple<_Tp, _A1, _As...>>::_Counts>::template _Prepend<1, 0>;
+    using _Begins = typename _SeqOp<typename _AbisInSimdTuple<
+      _SimdTuple<_Tp, _A1, _As...>>::_Begins>::template _Prepend<0, 1>;
+  };
+
+// }}}
+// __autocvt_to_simd {{{
+template <typename _Tp, bool = is_arithmetic_v<__remove_cvref_t<_Tp>>>
+  struct __autocvt_to_simd
+  {
+    _Tp _M_data;
+    using _TT = __remove_cvref_t<_Tp>;
+
+    operator _TT()
+    { return _M_data; }
+
+    operator _TT&()
+    {
+      static_assert(is_lvalue_reference<_Tp>::value, "");
+      static_assert(!is_const<_Tp>::value, "");
+      return _M_data;
+    }
+
+    operator _TT*()
+    {
+      static_assert(is_lvalue_reference<_Tp>::value, "");
+      static_assert(!is_const<_Tp>::value, "");
+      return &_M_data;
+    }
+
+    constexpr inline __autocvt_to_simd(_Tp dd) : _M_data(dd) {}
+
+    template <typename _Abi>
+      operator simd<typename _TT::value_type, _Abi>()
+      { return {__private_init, _M_data}; }
+
+    template <typename _Abi>
+      operator simd<typename _TT::value_type, _Abi>&()
+      {
+	return *reinterpret_cast<simd<typename _TT::value_type, _Abi>*>(
+	  &_M_data);
+      }
+
+    template <typename _Abi>
+      operator simd<typename _TT::value_type, _Abi>*()
+      {
+	return reinterpret_cast<simd<typename _TT::value_type, _Abi>*>(
+	  &_M_data);
+      }
+  };
+
+template <typename _Tp>
+  __autocvt_to_simd(_Tp &&) -> __autocvt_to_simd<_Tp>;
+
+template <typename _Tp>
+  struct __autocvt_to_simd<_Tp, true>
+  {
+    using _TT = __remove_cvref_t<_Tp>;
+    _Tp _M_data;
+    fixed_size_simd<_TT, 1> _M_fd;
+
+    constexpr inline __autocvt_to_simd(_Tp dd) : _M_data(dd), _M_fd(_M_data) {}
+
+    ~__autocvt_to_simd()
+    { _M_data = __data(_M_fd).first; }
+
+    operator fixed_size_simd<_TT, 1>()
+    { return _M_fd; }
+
+    operator fixed_size_simd<_TT, 1> &()
+    {
+      static_assert(is_lvalue_reference<_Tp>::value, "");
+      static_assert(!is_const<_Tp>::value, "");
+      return _M_fd;
+    }
+
+    operator fixed_size_simd<_TT, 1> *()
+    {
+      static_assert(is_lvalue_reference<_Tp>::value, "");
+      static_assert(!is_const<_Tp>::value, "");
+      return &_M_fd;
+    }
+  };
+
+// }}}
+
+struct _CommonImplFixedSize;
+template <int _Np> struct _SimdImplFixedSize;
+template <int _Np> struct _MaskImplFixedSize;
+// simd_abi::_Fixed {{{
+template <int _Np>
+  struct simd_abi::_Fixed
+  {
+    template <typename _Tp> static constexpr size_t _S_size = _Np;
+    template <typename _Tp> static constexpr size_t _S_full_size = _Np;
+    // validity traits {{{
+    struct _IsValidAbiTag : public __bool_constant<(_Np > 0)> {};
+
+    template <typename _Tp>
+      struct _IsValidSizeFor
+      : __bool_constant<(_Np <= simd_abi::max_fixed_size<_Tp>)> {};
+
+    template <typename _Tp>
+      struct _IsValid : conjunction<_IsValidAbiTag, __is_vectorizable<_Tp>,
+				    _IsValidSizeFor<_Tp>> {};
+
+    template <typename _Tp>
+      static constexpr bool _S_is_valid_v = _IsValid<_Tp>::value;
+
+    // }}}
+    // _S_masked {{{
+    _GLIBCXX_SIMD_INTRINSIC static constexpr _SanitizedBitMask<_Np>
+    _S_masked(_BitMask<_Np> __x)
+    { return __x._M_sanitized(); }
+
+    _GLIBCXX_SIMD_INTRINSIC static constexpr _SanitizedBitMask<_Np>
+    _S_masked(_SanitizedBitMask<_Np> __x)
+    { return __x; }
+
+    // }}}
+    // _*Impl {{{
+    using _CommonImpl = _CommonImplFixedSize;
+    using _SimdImpl = _SimdImplFixedSize<_Np>;
+    using _MaskImpl = _MaskImplFixedSize<_Np>;
+
+    // }}}
+    // __traits {{{
+    template <typename _Tp, bool = _S_is_valid_v<_Tp>>
+      struct __traits : _InvalidTraits {};
+
+    template <typename _Tp>
+      struct __traits<_Tp, true>
+      {
+	using _IsValid = true_type;
+	using _SimdImpl = _SimdImplFixedSize<_Np>;
+	using _MaskImpl = _MaskImplFixedSize<_Np>;
+
+	// simd and simd_mask member types {{{
+	using _SimdMember = __fixed_size_storage_t<_Tp, _Np>;
+	using _MaskMember = _SanitizedBitMask<_Np>;
+
+	static constexpr size_t _S_simd_align
+	  = std::__bit_ceil(_Np * sizeof(_Tp));
+
+	static constexpr size_t _S_mask_align = alignof(_MaskMember);
+
+	// }}}
+	// _SimdBase / base class for simd, providing extra conversions {{{
+	struct _SimdBase
+	{
+	  // The following ensures, function arguments are passed via the stack.
+	  // This is important for ABI compatibility across TU boundaries
+	  _SimdBase(const _SimdBase&) {}
+	  _SimdBase() = default;
+
+	  explicit operator const _SimdMember &() const
+	  { return static_cast<const simd<_Tp, _Fixed>*>(this)->_M_data; }
+
+	  explicit operator array<_Tp, _Np>() const
+	  {
+	    array<_Tp, _Np> __r;
+	    // _SimdMember can be larger because of higher alignment
+	    static_assert(sizeof(__r) <= sizeof(_SimdMember), "");
+	    __builtin_memcpy(__r.data(), &static_cast<const _SimdMember&>(*this),
+			     sizeof(__r));
+	    return __r;
+	  }
+	};
+
+	// }}}
+	// _MaskBase {{{
+	// empty. The bitset interface suffices
+	struct _MaskBase {};
+
+	// }}}
+	// _SimdCastType {{{
+	struct _SimdCastType
+	{
+	  _SimdCastType(const array<_Tp, _Np>&);
+	  _SimdCastType(const _SimdMember& dd) : _M_data(dd) {}
+	  explicit operator const _SimdMember &() const { return _M_data; }
+
+	private:
+	  const _SimdMember& _M_data;
+	};
+
+	// }}}
+	// _MaskCastType {{{
+	class _MaskCastType
+	{
+	  _MaskCastType() = delete;
+	};
+	// }}}
+      };
+    // }}}
+  };
+
+// }}}
+// _CommonImplFixedSize {{{
+struct _CommonImplFixedSize
+{
+  // _S_store {{{
+  template <typename _Tp, typename... _As>
+    _GLIBCXX_SIMD_INTRINSIC static void
+    _S_store(const _SimdTuple<_Tp, _As...>& __x, void* __addr)
+    {
+      constexpr size_t _Np = _SimdTuple<_Tp, _As...>::_S_size();
+      __builtin_memcpy(__addr, &__x, _Np * sizeof(_Tp));
+    }
+
+  // }}}
+};
+
+// }}}
+// _SimdImplFixedSize {{{1
+// fixed_size should not inherit from _SimdMathFallback in order for
+// specializations in the used _SimdTuple Abis to get used
+template <int _Np>
+  struct _SimdImplFixedSize
+  {
+    // member types {{{2
+    using _MaskMember = _SanitizedBitMask<_Np>;
+
+    template <typename _Tp>
+      using _SimdMember = __fixed_size_storage_t<_Tp, _Np>;
+
+    template <typename _Tp>
+      static constexpr size_t _S_tuple_size = _SimdMember<_Tp>::_S_tuple_size;
+
+    template <typename _Tp>
+      using _Simd = simd<_Tp, simd_abi::fixed_size<_Np>>;
+
+    template <typename _Tp>
+      using _TypeTag = _Tp*;
+
+    // broadcast {{{2
+    template <typename _Tp>
+      static constexpr inline _SimdMember<_Tp> _S_broadcast(_Tp __x) noexcept
+      {
+	return _SimdMember<_Tp>::_S_generate([&](auto __meta) constexpr {
+	  return __meta._S_broadcast(__x);
+	});
+      }
+
+    // _S_generator {{{2
+    template <typename _Fp, typename _Tp>
+      static constexpr inline _SimdMember<_Tp> _S_generator(_Fp&& __gen,
+							    _TypeTag<_Tp>)
+      {
+	return _SimdMember<_Tp>::_S_generate([&__gen](auto __meta) constexpr {
+	  return __meta._S_generator(
+	    [&](auto __i) constexpr {
+	      return __i < _Np ? __gen(_SizeConstant<__meta._S_offset + __i>())
+			       : 0;
+	    },
+	    _TypeTag<_Tp>());
+	});
+      }
+
+    // _S_load {{{2
+    template <typename _Tp, typename _Up>
+      static inline _SimdMember<_Tp> _S_load(const _Up* __mem,
+					     _TypeTag<_Tp>) noexcept
+      {
+	return _SimdMember<_Tp>::_S_generate([&](auto __meta) {
+	  return __meta._S_load(&__mem[__meta._S_offset], _TypeTag<_Tp>());
+	});
+      }
+
+    // _S_masked_load {{{2
+    template <typename _Tp, typename... _As, typename _Up>
+      static inline _SimdTuple<_Tp, _As...>
+      _S_masked_load(const _SimdTuple<_Tp, _As...>& __old,
+		     const _MaskMember __bits, const _Up* __mem) noexcept
+      {
+	auto __merge = __old;
+	__for_each(__merge, [&](auto __meta, auto& __native) {
+	  if (__meta._S_submask(__bits).any())
+#pragma GCC diagnostic push
+	  // __mem + __mem._S_offset could be UB ([expr.add]/4.3, but it punts
+	  // the responsibility for avoiding UB to the caller of the masked load
+	  // via the mask. Consequently, the compiler may assume this branch is
+	  // unreachable, if the pointer arithmetic is UB.
+#pragma GCC diagnostic ignored "-Warray-bounds"
+	    __native
+	      = __meta._S_masked_load(__native, __meta._S_make_mask(__bits),
+				      __mem + __meta._S_offset);
+#pragma GCC diagnostic pop
+	});
+	return __merge;
+      }
+
+    // _S_store {{{2
+    template <typename _Tp, typename _Up>
+      static inline void _S_store(const _SimdMember<_Tp>& __v, _Up* __mem,
+				  _TypeTag<_Tp>) noexcept
+      {
+	__for_each(__v, [&](auto __meta, auto __native) {
+	  __meta._S_store(__native, &__mem[__meta._S_offset], _TypeTag<_Tp>());
+	});
+      }
+
+    // _S_masked_store {{{2
+    template <typename _Tp, typename... _As, typename _Up>
+      static inline void _S_masked_store(const _SimdTuple<_Tp, _As...>& __v,
+					 _Up* __mem,
+					 const _MaskMember __bits) noexcept
+      {
+	__for_each(__v, [&](auto __meta, auto __native) {
+	  if (__meta._S_submask(__bits).any())
+#pragma GCC diagnostic push
+	  // __mem + __mem._S_offset could be UB ([expr.add]/4.3, but it punts
+	  // the responsibility for avoiding UB to the caller of the masked
+	  // store via the mask. Consequently, the compiler may assume this
+	  // branch is unreachable, if the pointer arithmetic is UB.
+#pragma GCC diagnostic ignored "-Warray-bounds"
+	    __meta._S_masked_store(__native, __mem + __meta._S_offset,
+				   __meta._S_make_mask(__bits));
+#pragma GCC diagnostic pop
+	});
+      }
+
+    // negation {{{2
+    template <typename _Tp, typename... _As>
+      static inline _MaskMember
+      _S_negate(const _SimdTuple<_Tp, _As...>& __x) noexcept
+      {
+	_MaskMember __bits = 0;
+	__for_each(
+	  __x, [&__bits](auto __meta, auto __native) constexpr {
+	    __bits
+	      |= __meta._S_mask_to_shifted_ullong(__meta._S_negate(__native));
+	  });
+	return __bits;
+      }
+
+    // reductions {{{2
+    template <typename _Tp, typename _BinaryOperation>
+      static constexpr inline _Tp _S_reduce(const _Simd<_Tp>& __x,
+					    const _BinaryOperation& __binary_op)
+      {
+	using _Tup = _SimdMember<_Tp>;
+	const _Tup& __tup = __data(__x);
+	if constexpr (_Tup::_S_tuple_size == 1)
+	  return _Tup::_FirstAbi::_SimdImpl::_S_reduce(
+	    __tup.template _M_simd_at<0>(), __binary_op);
+	else if constexpr (_Tup::_S_tuple_size == 2 && _Tup::_S_size() > 2
+			   && _Tup::_SecondType::_S_size() == 1)
+	  {
+	    return __binary_op(simd<_Tp, simd_abi::scalar>(
+				 reduce(__tup.template _M_simd_at<0>(),
+					__binary_op)),
+			       __tup.template _M_simd_at<1>())[0];
+	  }
+	else if constexpr (_Tup::_S_tuple_size == 2 && _Tup::_S_size() > 4
+			   && _Tup::_SecondType::_S_size() == 2)
+	  {
+	    return __binary_op(
+	      simd<_Tp, simd_abi::scalar>(
+		reduce(__tup.template _M_simd_at<0>(), __binary_op)),
+	      simd<_Tp, simd_abi::scalar>(
+		reduce(__tup.template _M_simd_at<1>(), __binary_op)))[0];
+	  }
+	else
+	  {
+	    const auto& __x2 = __call_with_n_evaluations<
+	      __div_roundup(_Tup::_S_tuple_size, 2)>(
+	      [](auto __first_simd, auto... __remaining) {
+		if constexpr (sizeof...(__remaining) == 0)
+		  return __first_simd;
+		else
+		  {
+		    using _Tup2
+		      = _SimdTuple<_Tp,
+				   typename decltype(__first_simd)::abi_type,
+				   typename decltype(__remaining)::abi_type...>;
+		    return fixed_size_simd<_Tp, _Tup2::_S_size()>(
+		      __private_init,
+		      __make_simd_tuple(__first_simd, __remaining...));
+		  }
+	      },
+	      [&](auto __i) {
+		auto __left = __tup.template _M_simd_at<2 * __i>();
+		if constexpr (2 * __i + 1 == _Tup::_S_tuple_size)
+		  return __left;
+		else
+		  {
+		    auto __right = __tup.template _M_simd_at<2 * __i + 1>();
+		    using _LT = decltype(__left);
+		    using _RT = decltype(__right);
+		    if constexpr (_LT::size() == _RT::size())
+		      return __binary_op(__left, __right);
+		    else
+		      {
+			_GLIBCXX_SIMD_USE_CONSTEXPR_API
+			typename _LT::mask_type __k(
+			  __private_init,
+			  [](auto __j) constexpr { return __j < _RT::size(); });
+			_LT __ext_right = __left;
+			where(__k, __ext_right)
+			  = __proposed::resizing_simd_cast<_LT>(__right);
+			where(__k, __left) = __binary_op(__left, __ext_right);
+			return __left;
+		      }
+		  }
+	      });
+	    return reduce(__x2, __binary_op);
+	  }
+      }
+
+    // _S_min, _S_max {{{2
+    template <typename _Tp, typename... _As>
+      static inline constexpr _SimdTuple<_Tp, _As...>
+      _S_min(const _SimdTuple<_Tp, _As...>& __a,
+	     const _SimdTuple<_Tp, _As...>& __b)
+      {
+	return __a._M_apply_per_chunk(
+	  [](auto __impl, auto __aa, auto __bb) constexpr {
+	    return __impl._S_min(__aa, __bb);
+	  },
+	  __b);
+      }
+
+    template <typename _Tp, typename... _As>
+      static inline constexpr _SimdTuple<_Tp, _As...>
+      _S_max(const _SimdTuple<_Tp, _As...>& __a,
+	     const _SimdTuple<_Tp, _As...>& __b)
+      {
+	return __a._M_apply_per_chunk(
+	  [](auto __impl, auto __aa, auto __bb) constexpr {
+	    return __impl._S_max(__aa, __bb);
+	  },
+	  __b);
+      }
+
+    // _S_complement {{{2
+    template <typename _Tp, typename... _As>
+      static inline constexpr _SimdTuple<_Tp, _As...>
+      _S_complement(const _SimdTuple<_Tp, _As...>& __x) noexcept
+      {
+	return __x._M_apply_per_chunk([](auto __impl, auto __xx) constexpr {
+	  return __impl._S_complement(__xx);
+	});
+      }
+
+    // _S_unary_minus {{{2
+    template <typename _Tp, typename... _As>
+      static inline constexpr _SimdTuple<_Tp, _As...>
+      _S_unary_minus(const _SimdTuple<_Tp, _As...>& __x) noexcept
+      {
+	return __x._M_apply_per_chunk([](auto __impl, auto __xx) constexpr {
+	  return __impl._S_unary_minus(__xx);
+	});
+      }
+
+    // arithmetic operators {{{2
+
+#define _GLIBCXX_SIMD_FIXED_OP(name_, op_)                                     \
+    template <typename _Tp, typename... _As>                                   \
+      static inline constexpr _SimdTuple<_Tp, _As...> name_(                   \
+	const _SimdTuple<_Tp, _As...> __x, const _SimdTuple<_Tp, _As...> __y)  \
+      {                                                                        \
+	return __x._M_apply_per_chunk(                                         \
+	  [](auto __impl, auto __xx, auto __yy) constexpr {                    \
+	    return __impl.name_(__xx, __yy);                                   \
+	  },                                                                   \
+	  __y);                                                                \
+      }
+
+    _GLIBCXX_SIMD_FIXED_OP(_S_plus, +)
+    _GLIBCXX_SIMD_FIXED_OP(_S_minus, -)
+    _GLIBCXX_SIMD_FIXED_OP(_S_multiplies, *)
+    _GLIBCXX_SIMD_FIXED_OP(_S_divides, /)
+    _GLIBCXX_SIMD_FIXED_OP(_S_modulus, %)
+    _GLIBCXX_SIMD_FIXED_OP(_S_bit_and, &)
+    _GLIBCXX_SIMD_FIXED_OP(_S_bit_or, |)
+    _GLIBCXX_SIMD_FIXED_OP(_S_bit_xor, ^)
+    _GLIBCXX_SIMD_FIXED_OP(_S_bit_shift_left, <<)
+    _GLIBCXX_SIMD_FIXED_OP(_S_bit_shift_right, >>)
+#undef _GLIBCXX_SIMD_FIXED_OP
+
+    template <typename _Tp, typename... _As>
+      static inline constexpr _SimdTuple<_Tp, _As...>
+      _S_bit_shift_left(const _SimdTuple<_Tp, _As...>& __x, int __y)
+      {
+	return __x._M_apply_per_chunk([__y](auto __impl, auto __xx) constexpr {
+	  return __impl._S_bit_shift_left(__xx, __y);
+	});
+      }
+
+    template <typename _Tp, typename... _As>
+      static inline constexpr _SimdTuple<_Tp, _As...>
+      _S_bit_shift_right(const _SimdTuple<_Tp, _As...>& __x, int __y)
+      {
+	return __x._M_apply_per_chunk([__y](auto __impl, auto __xx) constexpr {
+	  return __impl._S_bit_shift_right(__xx, __y);
+	});
+      }
+
+  // math {{{2
+#define _GLIBCXX_SIMD_APPLY_ON_TUPLE(_RetTp, __name)                           \
+    template <typename _Tp, typename... _As, typename... _More>                \
+      static inline __fixed_size_storage_t<_RetTp, _Np>                        \
+	_S_##__name(const _SimdTuple<_Tp, _As...>& __x,                        \
+		    const _More&... __more)                                    \
+      {                                                                        \
+	if constexpr (sizeof...(_More) == 0)                                   \
+	  {                                                                    \
+	    if constexpr (is_same_v<_Tp, _RetTp>)                              \
+	      return __x._M_apply_per_chunk(                                   \
+		[](auto __impl, auto __xx) constexpr {                         \
+		  using _V = typename decltype(__impl)::simd_type;             \
+		  return __data(__name(_V(__private_init, __xx)));             \
+		});                                                            \
+	    else                                                               \
+	      return __optimize_simd_tuple(                                    \
+		__x.template _M_apply_r<_RetTp>([](auto __impl, auto __xx) {   \
+		  return __impl._S_##__name(__xx);                             \
+		}));                                                           \
+	  }                                                                    \
+	else if constexpr (                                                    \
+	  is_same_v<                                                           \
+	    _Tp,                                                               \
+	    _RetTp> && (... && is_same_v<_SimdTuple<_Tp, _As...>, _More>) )    \
+	  return __x._M_apply_per_chunk(                                       \
+	    [](auto __impl, auto __xx, auto... __pack) constexpr {             \
+	      using _V = typename decltype(__impl)::simd_type;                 \
+	      return __data(__name(_V(__private_init, __xx),                   \
+				   _V(__private_init, __pack)...));            \
+	    },                                                                 \
+	    __more...);                                                        \
+	else if constexpr (is_same_v<_Tp, _RetTp>)                             \
+	  return __x._M_apply_per_chunk(                                       \
+	    [](auto __impl, auto __xx, auto... __pack) constexpr {             \
+	      using _V = typename decltype(__impl)::simd_type;                 \
+	      return __data(__name(_V(__private_init, __xx),                   \
+				   __autocvt_to_simd(__pack)...));             \
+	    },                                                                 \
+	    __more...);                                                        \
+	else                                                                   \
+	  __assert_unreachable<_Tp>();                                         \
+      }
+
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, acos)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, asin)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, atan)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, atan2)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, cos)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, sin)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, tan)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, acosh)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, asinh)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, atanh)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, cosh)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, sinh)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, tanh)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, exp)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, exp2)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, expm1)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(int, ilogb)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, log)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, log10)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, log1p)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, log2)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, logb)
+    // modf implemented in simd_math.h
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp,
+				 scalbn) // double scalbn(double x, int exp);
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, scalbln)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, cbrt)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, abs)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, fabs)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, pow)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, sqrt)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, erf)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, erfc)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, lgamma)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, tgamma)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, trunc)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, ceil)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, floor)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, nearbyint)
+
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, rint)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(long, lrint)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(long long, llrint)
+
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, round)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(long, lround)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(long long, llround)
+
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, ldexp)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, fmod)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, remainder)
+    // copysign in simd_math.h
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, nextafter)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, fdim)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, fmax)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, fmin)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, fma)
+    _GLIBCXX_SIMD_APPLY_ON_TUPLE(int, fpclassify)
+#undef _GLIBCXX_SIMD_APPLY_ON_TUPLE
+
+    template <typename _Tp, typename... _Abis>
+      static _SimdTuple<_Tp, _Abis...> _S_remquo(
+	const _SimdTuple<_Tp, _Abis...>& __x,
+	const _SimdTuple<_Tp, _Abis...>& __y,
+	__fixed_size_storage_t<int, _SimdTuple<_Tp, _Abis...>::_S_size()>* __z)
+      {
+	return __x._M_apply_per_chunk(
+	  [](auto __impl, const auto __xx, const auto __yy, auto& __zz) {
+	    return __impl._S_remquo(__xx, __yy, &__zz);
+	  },
+	  __y, *__z);
+      }
+
+    template <typename _Tp, typename... _As>
+      static inline _SimdTuple<_Tp, _As...>
+      _S_frexp(const _SimdTuple<_Tp, _As...>& __x,
+	       __fixed_size_storage_t<int, _Np>& __exp) noexcept
+      {
+	return __x._M_apply_per_chunk(
+	  [](auto __impl, const auto& __a, auto& __b) {
+	    return __data(
+	      frexp(typename decltype(__impl)::simd_type(__private_init, __a),
+		    __autocvt_to_simd(__b)));
+	  },
+	  __exp);
+      }
+
+#define _GLIBCXX_SIMD_TEST_ON_TUPLE_(name_)                                    \
+    template <typename _Tp, typename... _As>                                   \
+      static inline _MaskMember                                                \
+	_S_##name_(const _SimdTuple<_Tp, _As...>& __x) noexcept                \
+      {                                                                        \
+	return _M_test([](auto __impl,                                         \
+			  auto __xx) { return __impl._S_##name_(__xx); },      \
+		       __x);                                                   \
+      }
+
+    _GLIBCXX_SIMD_TEST_ON_TUPLE_(isinf)
+    _GLIBCXX_SIMD_TEST_ON_TUPLE_(isfinite)
+    _GLIBCXX_SIMD_TEST_ON_TUPLE_(isnan)
+    _GLIBCXX_SIMD_TEST_ON_TUPLE_(isnormal)
+    _GLIBCXX_SIMD_TEST_ON_TUPLE_(signbit)
+#undef _GLIBCXX_SIMD_TEST_ON_TUPLE_
+
+    // _S_increment & _S_decrement{{{2
+    template <typename... _Ts>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr void
+      _S_increment(_SimdTuple<_Ts...>& __x)
+      {
+	__for_each(
+	  __x, [](auto __meta, auto& native) constexpr {
+	    __meta._S_increment(native);
+	  });
+      }
+
+    template <typename... _Ts>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr void
+      _S_decrement(_SimdTuple<_Ts...>& __x)
+      {
+	__for_each(
+	  __x, [](auto __meta, auto& native) constexpr {
+	    __meta._S_decrement(native);
+	  });
+      }
+
+    // compares {{{2
+#define _GLIBCXX_SIMD_CMP_OPERATIONS(__cmp)                                    \
+    template <typename _Tp, typename... _As>                                   \
+      _GLIBCXX_SIMD_INTRINSIC constexpr static _MaskMember                     \
+      __cmp(const _SimdTuple<_Tp, _As...>& __x,                                \
+	    const _SimdTuple<_Tp, _As...>& __y)                                \
+      {                                                                        \
+	return _M_test(                                                        \
+	  [](auto __impl, auto __xx, auto __yy) constexpr {                    \
+	    return __impl.__cmp(__xx, __yy);                                   \
+	  },                                                                   \
+	  __x, __y);                                                           \
+      }
+
+    _GLIBCXX_SIMD_CMP_OPERATIONS(_S_equal_to)
+    _GLIBCXX_SIMD_CMP_OPERATIONS(_S_not_equal_to)
+    _GLIBCXX_SIMD_CMP_OPERATIONS(_S_less)
+    _GLIBCXX_SIMD_CMP_OPERATIONS(_S_less_equal)
+    _GLIBCXX_SIMD_CMP_OPERATIONS(_S_isless)
+    _GLIBCXX_SIMD_CMP_OPERATIONS(_S_islessequal)
+    _GLIBCXX_SIMD_CMP_OPERATIONS(_S_isgreater)
+    _GLIBCXX_SIMD_CMP_OPERATIONS(_S_isgreaterequal)
+    _GLIBCXX_SIMD_CMP_OPERATIONS(_S_islessgreater)
+    _GLIBCXX_SIMD_CMP_OPERATIONS(_S_isunordered)
+#undef _GLIBCXX_SIMD_CMP_OPERATIONS
+
+    // smart_reference access {{{2
+    template <typename _Tp, typename... _As, typename _Up>
+      _GLIBCXX_SIMD_INTRINSIC static void _S_set(_SimdTuple<_Tp, _As...>& __v,
+						 int __i, _Up&& __x) noexcept
+      { __v._M_set(__i, static_cast<_Up&&>(__x)); }
+
+    // _S_masked_assign {{{2
+    template <typename _Tp, typename... _As>
+      _GLIBCXX_SIMD_INTRINSIC static void
+      _S_masked_assign(const _MaskMember __bits, _SimdTuple<_Tp, _As...>& __lhs,
+		       const __type_identity_t<_SimdTuple<_Tp, _As...>>& __rhs)
+      {
+	__for_each(
+	  __lhs, __rhs,
+	  [&](auto __meta, auto& __native_lhs, auto __native_rhs) constexpr {
+	    __meta._S_masked_assign(__meta._S_make_mask(__bits), __native_lhs,
+				    __native_rhs);
+	  });
+      }
+
+    // Optimization for the case where the RHS is a scalar. No need to broadcast
+    // the scalar to a simd first.
+    template <typename _Tp, typename... _As>
+      _GLIBCXX_SIMD_INTRINSIC static void
+      _S_masked_assign(const _MaskMember __bits, _SimdTuple<_Tp, _As...>& __lhs,
+		       const __type_identity_t<_Tp> __rhs)
+      {
+	__for_each(
+	  __lhs, [&](auto __meta, auto& __native_lhs) constexpr {
+	    __meta._S_masked_assign(__meta._S_make_mask(__bits), __native_lhs,
+				    __rhs);
+	  });
+      }
+
+    // _S_masked_cassign {{{2
+    template <typename _Op, typename _Tp, typename... _As>
+      static inline void _S_masked_cassign(const _MaskMember __bits,
+					   _SimdTuple<_Tp, _As...>& __lhs,
+					   const _SimdTuple<_Tp, _As...>& __rhs,
+					   _Op __op)
+      {
+	__for_each(
+	  __lhs, __rhs,
+	  [&](auto __meta, auto& __native_lhs, auto __native_rhs) constexpr {
+	    __meta.template _S_masked_cassign(__meta._S_make_mask(__bits),
+					      __native_lhs, __native_rhs, __op);
+	  });
+      }
+
+    // Optimization for the case where the RHS is a scalar. No need to broadcast
+    // the scalar to a simd first.
+    template <typename _Op, typename _Tp, typename... _As>
+      static inline void _S_masked_cassign(const _MaskMember __bits,
+					   _SimdTuple<_Tp, _As...>& __lhs,
+					   const _Tp& __rhs, _Op __op)
+      {
+	__for_each(
+	  __lhs, [&](auto __meta, auto& __native_lhs) constexpr {
+	    __meta.template _S_masked_cassign(__meta._S_make_mask(__bits),
+					      __native_lhs, __rhs, __op);
+	  });
+      }
+
+    // _S_masked_unary {{{2
+    template <template <typename> class _Op, typename _Tp, typename... _As>
+      static inline _SimdTuple<_Tp, _As...>
+      _S_masked_unary(const _MaskMember __bits,
+		      const _SimdTuple<_Tp, _As...> __v) // TODO: const-ref __v?
+      {
+	return __v._M_apply_wrapped([&__bits](auto __meta,
+					      auto __native) constexpr {
+	  return __meta.template _S_masked_unary<_Op>(__meta._S_make_mask(
+							__bits),
+						      __native);
+	});
+      }
+
+    // }}}2
+  };
+
+// _MaskImplFixedSize {{{1
+template <int _Np>
+  struct _MaskImplFixedSize
+  {
+    static_assert(
+      sizeof(_ULLong) * __CHAR_BIT__ >= _Np,
+      "The fixed_size implementation relies on one _ULLong being able to store "
+      "all boolean elements."); // required in load & store
+
+    // member types {{{
+    using _Abi = simd_abi::fixed_size<_Np>;
+
+    using _MaskMember = _SanitizedBitMask<_Np>;
+
+    template <typename _Tp>
+      using _FirstAbi = typename __fixed_size_storage_t<_Tp, _Np>::_FirstAbi;
+
+    template <typename _Tp>
+      using _TypeTag = _Tp*;
+
+    // }}}
+    // _S_broadcast {{{
+    template <typename>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _MaskMember
+      _S_broadcast(bool __x)
+      { return __x ? ~_MaskMember() : _MaskMember(); }
+
+    // }}}
+    // _S_load {{{
+    template <typename>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _MaskMember
+      _S_load(const bool* __mem)
+      {
+	using _Ip = __int_for_sizeof_t<bool>;
+	// the following load uses element_aligned and relies on __mem already
+	// carrying alignment information from when this load function was
+	// called.
+	const simd<_Ip, _Abi> __bools(reinterpret_cast<const __may_alias<_Ip>*>(
+					__mem),
+				      element_aligned);
+	return __data(__bools != 0);
+      }
+
+    // }}}
+    // _S_to_bits {{{
+    template <bool _Sanitized>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _SanitizedBitMask<_Np>
+      _S_to_bits(_BitMask<_Np, _Sanitized> __x)
+      {
+	if constexpr (_Sanitized)
+	  return __x;
+	else
+	  return __x._M_sanitized();
+      }
+
+    // }}}
+    // _S_convert {{{
+    template <typename _Tp, typename _Up, typename _UAbi>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _MaskMember
+      _S_convert(simd_mask<_Up, _UAbi> __x)
+      {
+	return _UAbi::_MaskImpl::_S_to_bits(__data(__x))
+	  .template _M_extract<0, _Np>();
+      }
+
+    // }}}
+    // _S_from_bitmask {{{2
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static _MaskMember
+      _S_from_bitmask(_MaskMember __bits, _TypeTag<_Tp>) noexcept
+      { return __bits; }
+
+    // _S_load {{{2
+    static inline _MaskMember _S_load(const bool* __mem) noexcept
+    {
+      // TODO: _UChar is not necessarily the best type to use here. For smaller
+      // _Np _UShort, _UInt, _ULLong, float, and double can be more efficient.
+      _ULLong __r = 0;
+      using _Vs = __fixed_size_storage_t<_UChar, _Np>;
+      __for_each(_Vs{}, [&](auto __meta, auto) {
+	__r |= __meta._S_mask_to_shifted_ullong(
+	  __meta._S_mask_impl._S_load(&__mem[__meta._S_offset],
+				      _SizeConstant<__meta._S_size()>()));
+      });
+      return __r;
+    }
+
+    // _S_masked_load {{{2
+    static inline _MaskMember _S_masked_load(_MaskMember __merge,
+					     _MaskMember __mask,
+					     const bool* __mem) noexcept
+    {
+      _BitOps::_S_bit_iteration(__mask.to_ullong(), [&](auto __i) {
+	__merge.set(__i, __mem[__i]);
+      });
+      return __merge;
+    }
+
+    // _S_store {{{2
+    static inline void _S_store(const _MaskMember __bitmask,
+				bool* __mem) noexcept
+    {
+      if constexpr (_Np == 1)
+	__mem[0] = __bitmask[0];
+      else
+	_FirstAbi<_UChar>::_CommonImpl::_S_store_bool_array(__bitmask, __mem);
+    }
+
+    // _S_masked_store {{{2
+    static inline void _S_masked_store(const _MaskMember __v, bool* __mem,
+				       const _MaskMember __k) noexcept
+    {
+      _BitOps::_S_bit_iteration(__k, [&](auto __i) { __mem[__i] = __v[__i]; });
+    }
+
+    // logical and bitwise operators {{{2
+    _GLIBCXX_SIMD_INTRINSIC static _MaskMember
+    _S_logical_and(const _MaskMember& __x, const _MaskMember& __y) noexcept
+    { return __x & __y; }
+
+    _GLIBCXX_SIMD_INTRINSIC static _MaskMember
+    _S_logical_or(const _MaskMember& __x, const _MaskMember& __y) noexcept
+    { return __x | __y; }
+
+    _GLIBCXX_SIMD_INTRINSIC static constexpr _MaskMember
+    _S_bit_not(const _MaskMember& __x) noexcept
+    { return ~__x; }
+
+    _GLIBCXX_SIMD_INTRINSIC static _MaskMember
+    _S_bit_and(const _MaskMember& __x, const _MaskMember& __y) noexcept
+    { return __x & __y; }
+
+    _GLIBCXX_SIMD_INTRINSIC static _MaskMember
+    _S_bit_or(const _MaskMember& __x, const _MaskMember& __y) noexcept
+    { return __x | __y; }
+
+    _GLIBCXX_SIMD_INTRINSIC static _MaskMember
+    _S_bit_xor(const _MaskMember& __x, const _MaskMember& __y) noexcept
+    { return __x ^ __y; }
+
+    // smart_reference access {{{2
+    _GLIBCXX_SIMD_INTRINSIC static void _S_set(_MaskMember& __k, int __i,
+					       bool __x) noexcept
+    { __k.set(__i, __x); }
+
+    // _S_masked_assign {{{2
+    _GLIBCXX_SIMD_INTRINSIC static void
+    _S_masked_assign(const _MaskMember __k, _MaskMember& __lhs,
+		     const _MaskMember __rhs)
+    { __lhs = (__lhs & ~__k) | (__rhs & __k); }
+
+    // Optimization for the case where the RHS is a scalar.
+    _GLIBCXX_SIMD_INTRINSIC static void _S_masked_assign(const _MaskMember __k,
+							 _MaskMember& __lhs,
+							 const bool __rhs)
+    {
+      if (__rhs)
+	__lhs |= __k;
+      else
+	__lhs &= ~__k;
+    }
+
+    // }}}2
+    // _S_all_of {{{
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static bool _S_all_of(simd_mask<_Tp, _Abi> __k)
+      { return __data(__k).all(); }
+
+    // }}}
+    // _S_any_of {{{
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static bool _S_any_of(simd_mask<_Tp, _Abi> __k)
+      { return __data(__k).any(); }
+
+    // }}}
+    // _S_none_of {{{
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static bool _S_none_of(simd_mask<_Tp, _Abi> __k)
+      { return __data(__k).none(); }
+
+    // }}}
+    // _S_some_of {{{
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static bool
+      _S_some_of([[maybe_unused]] simd_mask<_Tp, _Abi> __k)
+      {
+	if constexpr (_Np == 1)
+	  return false;
+	else
+	  return __data(__k).any() && !__data(__k).all();
+      }
+
+    // }}}
+    // _S_popcount {{{
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static int _S_popcount(simd_mask<_Tp, _Abi> __k)
+      { return __data(__k).count(); }
+
+    // }}}
+    // _S_find_first_set {{{
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static int
+      _S_find_first_set(simd_mask<_Tp, _Abi> __k)
+      { return std::__countr_zero(__data(__k).to_ullong()); }
+
+    // }}}
+    // _S_find_last_set {{{
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static int
+      _S_find_last_set(simd_mask<_Tp, _Abi> __k)
+      { return std::__bit_width(__data(__k).to_ullong()) - 1; }
+
+    // }}}
+  };
+// }}}1
+
+_GLIBCXX_SIMD_END_NAMESPACE
+#endif // __cplusplus >= 201703L
+#endif // _GLIBCXX_EXPERIMENTAL_SIMD_FIXED_SIZE_H_
+
+// vim: foldmethod=marker sw=2 noet ts=8 sts=2 tw=80
diff --git a/libstdc++-v3/include/experimental/bits/simd_math.h b/libstdc++-v3/include/experimental/bits/simd_math.h
new file mode 100644
index 00000000000..bbaa899faa2
--- /dev/null
+++ b/libstdc++-v3/include/experimental/bits/simd_math.h
@@ -0,0 +1,1500 @@
+// Math overloads for simd -*- C++ -*-
+
+// Copyright (C) 2020 Free Software Foundation, Inc.
+//
+// This file is part of the GNU ISO C++ Library.  This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3, or (at your option)
+// any later version.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+
+// Under Section 7 of GPL version 3, you are granted additional
+// permissions described in the GCC Runtime Library Exception, version
+// 3.1, as published by the Free Software Foundation.
+
+// You should have received a copy of the GNU General Public License and
+// a copy of the GCC Runtime Library Exception along with this program;
+// see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+// <http://www.gnu.org/licenses/>.
+
+#ifndef _GLIBCXX_EXPERIMENTAL_SIMD_MATH_H_
+#define _GLIBCXX_EXPERIMENTAL_SIMD_MATH_H_
+
+#if __cplusplus >= 201703L
+
+#include <utility>
+#include <iomanip>
+
+_GLIBCXX_SIMD_BEGIN_NAMESPACE
+template <typename _Tp, typename _V>
+  using _Samesize = fixed_size_simd<_Tp, _V::size()>;
+
+// _Math_return_type {{{
+template <typename _DoubleR, typename _Tp, typename _Abi>
+  struct _Math_return_type;
+
+template <typename _DoubleR, typename _Tp, typename _Abi>
+  using _Math_return_type_t =
+    typename _Math_return_type<_DoubleR, _Tp, _Abi>::type;
+
+template <typename _Tp, typename _Abi>
+  struct _Math_return_type<double, _Tp, _Abi>
+  { using type = simd<_Tp, _Abi>; };
+
+template <typename _Tp, typename _Abi>
+  struct _Math_return_type<bool, _Tp, _Abi>
+  { using type = simd_mask<_Tp, _Abi>; };
+
+template <typename _DoubleR, typename _Tp, typename _Abi>
+  struct _Math_return_type
+  { using type = fixed_size_simd<_DoubleR, simd_size_v<_Tp, _Abi>>; };
+
+//}}}
+// _GLIBCXX_SIMD_MATH_CALL_ {{{
+#define _GLIBCXX_SIMD_MATH_CALL_(__name)                                       \
+template <typename _Tp, typename _Abi, typename...,                            \
+	  typename _R = _Math_return_type_t<                                   \
+	    decltype(std::__name(declval<double>())), _Tp, _Abi>>              \
+  enable_if_t<is_floating_point_v<_Tp>, _R>                                    \
+  __name(simd<_Tp, _Abi> __x)                                                  \
+  { return {__private_init, _Abi::_SimdImpl::_S_##__name(__data(__x))}; }
+
+// }}}
+//_Extra_argument_type{{{
+template <typename _Up, typename _Tp, typename _Abi>
+  struct _Extra_argument_type;
+
+template <typename _Tp, typename _Abi>
+  struct _Extra_argument_type<_Tp*, _Tp, _Abi>
+  {
+    using type = simd<_Tp, _Abi>*;
+    static constexpr double* declval();
+    static constexpr bool __needs_temporary_scalar = true;
+
+    _GLIBCXX_SIMD_INTRINSIC static constexpr auto _S_data(type __x)
+    { return &__data(*__x); }
+  };
+
+template <typename _Up, typename _Tp, typename _Abi>
+  struct _Extra_argument_type<_Up*, _Tp, _Abi>
+  {
+    static_assert(is_integral_v<_Up>);
+    using type = fixed_size_simd<_Up, simd_size_v<_Tp, _Abi>>*;
+    static constexpr _Up* declval();
+    static constexpr bool __needs_temporary_scalar = true;
+
+    _GLIBCXX_SIMD_INTRINSIC static constexpr auto _S_data(type __x)
+    { return &__data(*__x); }
+  };
+
+template <typename _Tp, typename _Abi>
+  struct _Extra_argument_type<_Tp, _Tp, _Abi>
+  {
+    using type = simd<_Tp, _Abi>;
+    static constexpr double declval();
+    static constexpr bool __needs_temporary_scalar = false;
+
+    _GLIBCXX_SIMD_INTRINSIC static constexpr decltype(auto)
+    _S_data(const type& __x)
+    { return __data(__x); }
+  };
+
+template <typename _Up, typename _Tp, typename _Abi>
+  struct _Extra_argument_type
+  {
+    static_assert(is_integral_v<_Up>);
+    using type = fixed_size_simd<_Up, simd_size_v<_Tp, _Abi>>;
+    static constexpr _Up declval();
+    static constexpr bool __needs_temporary_scalar = false;
+
+    _GLIBCXX_SIMD_INTRINSIC static constexpr decltype(auto)
+    _S_data(const type& __x)
+    { return __data(__x); }
+  };
+
+//}}}
+// _GLIBCXX_SIMD_MATH_CALL2_ {{{
+#define _GLIBCXX_SIMD_MATH_CALL2_(__name, arg2_)                               \
+template <                                                                     \
+  typename _Tp, typename _Abi, typename...,                                    \
+  typename _Arg2 = _Extra_argument_type<arg2_, _Tp, _Abi>,                     \
+  typename _R = _Math_return_type_t<                                           \
+    decltype(std::__name(declval<double>(), _Arg2::declval())), _Tp, _Abi>>    \
+  enable_if_t<is_floating_point_v<_Tp>, _R>                                    \
+  __name(const simd<_Tp, _Abi>& __x, const typename _Arg2::type& __y)          \
+  {                                                                            \
+    return {__private_init,                                                    \
+	    _Abi::_SimdImpl::_S_##__name(__data(__x), _Arg2::_S_data(__y))};   \
+  }                                                                            \
+template <typename _Up, typename _Tp, typename _Abi>                           \
+  _GLIBCXX_SIMD_INTRINSIC _Math_return_type_t<                                 \
+    decltype(std::__name(                                                      \
+      declval<double>(),                                                       \
+      declval<enable_if_t<                                                     \
+	conjunction_v<                                                         \
+	  is_same<arg2_, _Tp>,                                                 \
+	  negation<is_same<__remove_cvref_t<_Up>, simd<_Tp, _Abi>>>,           \
+	  is_convertible<_Up, simd<_Tp, _Abi>>, is_floating_point<_Tp>>,       \
+	double>>())),                                                          \
+    _Tp, _Abi>                                                                 \
+  __name(_Up&& __xx, const simd<_Tp, _Abi>& __yy)                              \
+  { return __name(simd<_Tp, _Abi>(static_cast<_Up&&>(__xx)), __yy); }
+
+// }}}
+// _GLIBCXX_SIMD_MATH_CALL3_ {{{
+#define _GLIBCXX_SIMD_MATH_CALL3_(__name, arg2_, arg3_)                        \
+template <typename _Tp, typename _Abi, typename...,                            \
+	  typename _Arg2 = _Extra_argument_type<arg2_, _Tp, _Abi>,             \
+	  typename _Arg3 = _Extra_argument_type<arg3_, _Tp, _Abi>,             \
+	  typename _R = _Math_return_type_t<                                   \
+	    decltype(std::__name(declval<double>(), _Arg2::declval(),          \
+				 _Arg3::declval())),                           \
+	    _Tp, _Abi>>                                                        \
+  enable_if_t<is_floating_point_v<_Tp>, _R>                                    \
+  __name(const simd<_Tp, _Abi>& __x, const typename _Arg2::type& __y,          \
+	 const typename _Arg3::type& __z)                                      \
+  {                                                                            \
+    return {__private_init,                                                    \
+	    _Abi::_SimdImpl::_S_##__name(__data(__x), _Arg2::_S_data(__y),     \
+					 _Arg3::_S_data(__z))};                \
+  }                                                                            \
+template <                                                                     \
+  typename _T0, typename _T1, typename _T2, typename...,                       \
+  typename _U0 = __remove_cvref_t<_T0>,                                        \
+  typename _U1 = __remove_cvref_t<_T1>,                                        \
+  typename _U2 = __remove_cvref_t<_T2>,                                        \
+  typename _Simd = conditional_t<is_simd_v<_U1>, _U1, _U2>,                    \
+  typename = enable_if_t<conjunction_v<                                        \
+    is_simd<_Simd>, is_convertible<_T0&&, _Simd>,                              \
+    is_convertible<_T1&&, _Simd>, is_convertible<_T2&&, _Simd>,                \
+    negation<conjunction<                                                      \
+      is_simd<_U0>, is_floating_point<__value_type_or_identity_t<_U0>>>>>>>    \
+  _GLIBCXX_SIMD_INTRINSIC decltype(__name(declval<const _Simd&>(),             \
+					  declval<const _Simd&>(),             \
+					  declval<const _Simd&>()))            \
+  __name(_T0&& __xx, _T1&& __yy, _T2&& __zz)                                   \
+  {                                                                            \
+    return __name(_Simd(static_cast<_T0&&>(__xx)),                             \
+		  _Simd(static_cast<_T1&&>(__yy)),                             \
+		  _Simd(static_cast<_T2&&>(__zz)));                            \
+  }
+
+// }}}
+// __cosSeries {{{
+template <typename _Abi>
+  _GLIBCXX_SIMD_ALWAYS_INLINE static simd<float, _Abi>
+  __cosSeries(const simd<float, _Abi>& __x)
+  {
+    const simd<float, _Abi> __x2 = __x * __x;
+    simd<float, _Abi> __y;
+    __y = 0x1.ap-16f;                  //  1/8!
+    __y = __y * __x2 - 0x1.6c1p-10f;   // -1/6!
+    __y = __y * __x2 + 0x1.555556p-5f; //  1/4!
+    return __y * (__x2 * __x2) - .5f * __x2 + 1.f;
+  }
+
+template <typename _Abi>
+  _GLIBCXX_SIMD_ALWAYS_INLINE static simd<double, _Abi>
+  __cosSeries(const simd<double, _Abi>& __x)
+  {
+    const simd<double, _Abi> __x2 = __x * __x;
+    simd<double, _Abi> __y;
+    __y = 0x1.AC00000000000p-45;              //  1/16!
+    __y = __y * __x2 - 0x1.9394000000000p-37; // -1/14!
+    __y = __y * __x2 + 0x1.1EED8C0000000p-29; //  1/12!
+    __y = __y * __x2 - 0x1.27E4FB7400000p-22; // -1/10!
+    __y = __y * __x2 + 0x1.A01A01A018000p-16; //  1/8!
+    __y = __y * __x2 - 0x1.6C16C16C16C00p-10; // -1/6!
+    __y = __y * __x2 + 0x1.5555555555554p-5;  //  1/4!
+    return (__y * __x2 - .5f) * __x2 + 1.f;
+  }
+
+// }}}
+// __sinSeries {{{
+template <typename _Abi>
+  _GLIBCXX_SIMD_ALWAYS_INLINE static simd<float, _Abi>
+  __sinSeries(const simd<float, _Abi>& __x)
+  {
+    const simd<float, _Abi> __x2 = __x * __x;
+    simd<float, _Abi> __y;
+    __y = -0x1.9CC000p-13f;            // -1/7!
+    __y = __y * __x2 + 0x1.111100p-7f; //  1/5!
+    __y = __y * __x2 - 0x1.555556p-3f; // -1/3!
+    return __y * (__x2 * __x) + __x;
+  }
+
+template <typename _Abi>
+  _GLIBCXX_SIMD_ALWAYS_INLINE static simd<double, _Abi>
+  __sinSeries(const simd<double, _Abi>& __x)
+  {
+    // __x  = [0, 0.7854 = pi/4]
+    // __x² = [0, 0.6169 = pi²/8]
+    const simd<double, _Abi> __x2 = __x * __x;
+    simd<double, _Abi> __y;
+    __y = -0x1.ACF0000000000p-41;             // -1/15!
+    __y = __y * __x2 + 0x1.6124400000000p-33; //  1/13!
+    __y = __y * __x2 - 0x1.AE64567000000p-26; // -1/11!
+    __y = __y * __x2 + 0x1.71DE3A5540000p-19; //  1/9!
+    __y = __y * __x2 - 0x1.A01A01A01A000p-13; // -1/7!
+    __y = __y * __x2 + 0x1.1111111111110p-7;  //  1/5!
+    __y = __y * __x2 - 0x1.5555555555555p-3;  // -1/3!
+    return __y * (__x2 * __x) + __x;
+  }
+
+// }}}
+// __zero_low_bits {{{
+template <int _Bits, typename _Tp, typename _Abi>
+  _GLIBCXX_SIMD_INTRINSIC simd<_Tp, _Abi>
+  __zero_low_bits(simd<_Tp, _Abi> __x)
+  {
+    const simd<_Tp, _Abi> __bitmask
+      = __bit_cast<_Tp>(~make_unsigned_t<__int_for_sizeof_t<_Tp>>() << _Bits);
+    return {__private_init,
+	    _Abi::_SimdImpl::_S_bit_and(__data(__x), __data(__bitmask))};
+  }
+
+// }}}
+// __fold_input {{{
+
+/**@internal
+ * Fold @p x into [-¼π, ¼π] and remember the quadrant it came from:
+ * quadrant 0: [-¼π,  ¼π]
+ * quadrant 1: [ ¼π,  ¾π]
+ * quadrant 2: [ ¾π, 1¼π]
+ * quadrant 3: [1¼π, 1¾π]
+ *
+ * The algorithm determines `y` as the multiple `x - y * ¼π = [-¼π, ¼π]`. Using
+ * a bitmask, `y` is reduced to `quadrant`. `y` can be calculated as
+ * ```
+ * y = trunc(x / ¼π);
+ * y += fmod(y, 2);
+ * ```
+ * This can be simplified by moving the (implicit) division by 2 into the
+ * truncation expression. The `+= fmod` effect can the be achieved by using
+ * rounding instead of truncation: `y = round(x / ½π) * 2`. If precision allows,
+ * `2/π * x` is better (faster).
+ */
+template <typename _Tp, typename _Abi>
+  struct _Folded
+  {
+    simd<_Tp, _Abi> _M_x;
+    rebind_simd_t<int, simd<_Tp, _Abi>> _M_quadrant;
+  };
+
+namespace __math_float {
+inline constexpr float __pi_over_4 = 0x1.921FB6p-1f; // π/4
+inline constexpr float __2_over_pi = 0x1.45F306p-1f; // 2/π
+inline constexpr float __pi_2_5bits0
+  = 0x1.921fc0p0f; // π/2, 5 0-bits (least significant)
+inline constexpr float __pi_2_5bits0_rem
+  = -0x1.5777a6p-21f; // π/2 - __pi_2_5bits0
+} // namespace __math_float
+namespace __math_double {
+inline constexpr double __pi_over_4 = 0x1.921fb54442d18p-1; // π/4
+inline constexpr double __2_over_pi = 0x1.45F306DC9C883p-1; // 2/π
+inline constexpr double __pi_2 = 0x1.921fb54442d18p0;       // π/2
+} // namespace __math_double
+
+template <typename _Abi>
+  _GLIBCXX_SIMD_ALWAYS_INLINE _Folded<float, _Abi>
+  __fold_input(const simd<float, _Abi>& __x)
+  {
+    using _V = simd<float, _Abi>;
+    using _IV = rebind_simd_t<int, _V>;
+    using namespace __math_float;
+    _Folded<float, _Abi> __r;
+    __r._M_x = abs(__x);
+#if 0
+    // zero most mantissa bits:
+    constexpr float __1_over_pi = 0x1.45F306p-2f; // 1/π
+    const auto __y = (__r._M_x * __1_over_pi + 0x1.8p23f) - 0x1.8p23f;
+    // split π into 4 parts, the first three with 13 trailing zeros (to make the
+    // following multiplications precise):
+    constexpr float __pi0 = 0x1.920000p1f;
+    constexpr float __pi1 = 0x1.fb4000p-11f;
+    constexpr float __pi2 = 0x1.444000p-23f;
+    constexpr float __pi3 = 0x1.68c234p-38f;
+    __r._M_x - __y*__pi0 - __y*__pi1 - __y*__pi2 - __y*__pi3
+#else
+    if (_GLIBCXX_SIMD_IS_UNLIKELY(all_of(__r._M_x < __pi_over_4)))
+      __r._M_quadrant = 0;
+    else if (_GLIBCXX_SIMD_IS_LIKELY(all_of(__r._M_x < 6 * __pi_over_4)))
+      {
+	const _V __y = nearbyint(__r._M_x * __2_over_pi);
+	__r._M_quadrant = static_simd_cast<_IV>(__y) & 3; // __y mod 4
+	__r._M_x -= __y * __pi_2_5bits0;
+	__r._M_x -= __y * __pi_2_5bits0_rem;
+      }
+    else
+      {
+	using __math_double::__2_over_pi;
+	using __math_double::__pi_2;
+	using _VD = rebind_simd_t<double, _V>;
+	_VD __xd = static_simd_cast<_VD>(__r._M_x);
+	_VD __y = nearbyint(__xd * __2_over_pi);
+	__r._M_quadrant = static_simd_cast<_IV>(__y) & 3; // = __y mod 4
+	__r._M_x = static_simd_cast<_V>(__xd - __y * __pi_2);
+      }
+#endif
+    return __r;
+  }
+
+template <typename _Abi>
+  _GLIBCXX_SIMD_ALWAYS_INLINE _Folded<double, _Abi>
+  __fold_input(const simd<double, _Abi>& __x)
+  {
+    using _V = simd<double, _Abi>;
+    using _IV = rebind_simd_t<int, _V>;
+    using namespace __math_double;
+
+    _Folded<double, _Abi> __r;
+    __r._M_x = abs(__x);
+    if (_GLIBCXX_SIMD_IS_UNLIKELY(all_of(__r._M_x < __pi_over_4)))
+      {
+	__r._M_quadrant = 0;
+	return __r;
+      }
+    const _V __y = nearbyint(__r._M_x / (2 * __pi_over_4));
+    __r._M_quadrant = static_simd_cast<_IV>(__y) & 3;
+
+    if (_GLIBCXX_SIMD_IS_LIKELY(all_of(__r._M_x < 1025 * __pi_over_4)))
+      {
+	// x - y * pi/2, y uses no more than 11 mantissa bits
+	__r._M_x -= __y * 0x1.921FB54443000p0;
+	__r._M_x -= __y * -0x1.73DCB3B39A000p-43;
+	__r._M_x -= __y * 0x1.45C06E0E68948p-86;
+      }
+    else if (_GLIBCXX_SIMD_IS_LIKELY(all_of(__y <= 0x1.0p30)))
+      {
+	// x - y * pi/2, y uses no more than 29 mantissa bits
+	__r._M_x -= __y * 0x1.921FB40000000p0;
+	__r._M_x -= __y * 0x1.4442D00000000p-24;
+	__r._M_x -= __y * 0x1.8469898CC5170p-48;
+      }
+    else
+      {
+	// x - y * pi/2, y may require all mantissa bits
+	const _V __y_hi = __zero_low_bits<26>(__y);
+	const _V __y_lo = __y - __y_hi;
+	const auto __pi_2_1 = 0x1.921FB50000000p0;
+	const auto __pi_2_2 = 0x1.110B460000000p-26;
+	const auto __pi_2_3 = 0x1.1A62630000000p-54;
+	const auto __pi_2_4 = 0x1.8A2E03707344Ap-81;
+	__r._M_x = __r._M_x - __y_hi * __pi_2_1
+		   - max(__y_hi * __pi_2_2, __y_lo * __pi_2_1)
+		   - min(__y_hi * __pi_2_2, __y_lo * __pi_2_1)
+		   - max(__y_hi * __pi_2_3, __y_lo * __pi_2_2)
+		   - min(__y_hi * __pi_2_3, __y_lo * __pi_2_2)
+		   - max(__y * __pi_2_4, __y_lo * __pi_2_3)
+		   - min(__y * __pi_2_4, __y_lo * __pi_2_3);
+      }
+    return __r;
+  }
+
+// }}}
+// __extract_exponent_as_int {{{
+template <typename _Tp, typename _Abi>
+  rebind_simd_t<int, simd<_Tp, _Abi>>
+  __extract_exponent_as_int(const simd<_Tp, _Abi>& __v)
+  {
+    using _Vp = simd<_Tp, _Abi>;
+    using _Up = make_unsigned_t<__int_for_sizeof_t<_Tp>>;
+    using namespace std::experimental::__float_bitwise_operators;
+    const _Vp __exponent_mask
+      = __infinity_v<_Tp>; // 0x7f800000 or 0x7ff0000000000000
+    return static_simd_cast<rebind_simd_t<int, _Vp>>(
+      __bit_cast<rebind_simd_t<_Up, _Vp>>(__v & __exponent_mask)
+      >> (__digits_v<_Tp> - 1));
+  }
+
+// }}}
+// __impl_or_fallback {{{
+template <typename ImplFun, typename FallbackFun, typename... _Args>
+  _GLIBCXX_SIMD_INTRINSIC auto
+  __impl_or_fallback_dispatch(int, ImplFun&& __impl_fun, FallbackFun&&,
+			      _Args&&... __args)
+    -> decltype(__impl_fun(static_cast<_Args&&>(__args)...))
+  { return __impl_fun(static_cast<_Args&&>(__args)...); }
+
+template <typename ImplFun, typename FallbackFun, typename... _Args>
+  inline auto
+  __impl_or_fallback_dispatch(float, ImplFun&&, FallbackFun&& __fallback_fun,
+			      _Args&&... __args)
+    -> decltype(__fallback_fun(static_cast<_Args&&>(__args)...))
+  { return __fallback_fun(static_cast<_Args&&>(__args)...); }
+
+template <typename... _Args>
+  _GLIBCXX_SIMD_INTRINSIC auto
+  __impl_or_fallback(_Args&&... __args)
+  {
+    return __impl_or_fallback_dispatch(int(), static_cast<_Args&&>(__args)...);
+  }
+//}}}
+
+// trigonometric functions {{{
+_GLIBCXX_SIMD_MATH_CALL_(acos)
+_GLIBCXX_SIMD_MATH_CALL_(asin)
+_GLIBCXX_SIMD_MATH_CALL_(atan)
+_GLIBCXX_SIMD_MATH_CALL2_(atan2, _Tp)
+
+/*
+ * algorithm for sine and cosine:
+ *
+ * The result can be calculated with sine or cosine depending on the π/4 section
+ * the input is in. sine   ≈ __x + __x³ cosine ≈ 1 - __x²
+ *
+ * sine:
+ * Map -__x to __x and invert the output
+ * Extend precision of __x - n * π/4 by calculating
+ * ((__x - n * p1) - n * p2) - n * p3 (p1 + p2 + p3 = π/4)
+ *
+ * Calculate Taylor series with tuned coefficients.
+ * Fix sign.
+ */
+// cos{{{
+template <typename _Tp, typename _Abi>
+  enable_if_t<is_floating_point_v<_Tp>, simd<_Tp, _Abi>>
+  cos(const simd<_Tp, _Abi>& __x)
+  {
+    using _V = simd<_Tp, _Abi>;
+    if constexpr (__is_scalar_abi<_Abi>() || __is_fixed_size_abi_v<_Abi>)
+      return {__private_init, _Abi::_SimdImpl::_S_cos(__data(__x))};
+    else
+      {
+	if constexpr (is_same_v<_Tp, float>)
+	  if (_GLIBCXX_SIMD_IS_UNLIKELY(any_of(abs(__x) >= 393382)))
+	    return static_simd_cast<_V>(
+	      cos(static_simd_cast<rebind_simd_t<double, _V>>(__x)));
+
+	const auto __f = __fold_input(__x);
+	// quadrant | effect
+	//        0 | cosSeries, +
+	//        1 | sinSeries, -
+	//        2 | cosSeries, -
+	//        3 | sinSeries, +
+	using namespace std::experimental::__float_bitwise_operators;
+	const _V __sign_flip
+	  = _V(-0.f) & static_simd_cast<_V>((1 + __f._M_quadrant) << 30);
+
+	const auto __need_cos = (__f._M_quadrant & 1) == 0;
+	if (_GLIBCXX_SIMD_IS_UNLIKELY(all_of(__need_cos)))
+	  return __sign_flip ^ __cosSeries(__f._M_x);
+	else if (_GLIBCXX_SIMD_IS_UNLIKELY(none_of(__need_cos)))
+	  return __sign_flip ^ __sinSeries(__f._M_x);
+	else // some_of(__need_cos)
+	  {
+	    _V __r = __sinSeries(__f._M_x);
+	    where(__need_cos.__cvt(), __r) = __cosSeries(__f._M_x);
+	    return __r ^ __sign_flip;
+	  }
+      }
+  }
+
+template <typename _Tp>
+  _GLIBCXX_SIMD_ALWAYS_INLINE
+  enable_if_t<is_floating_point<_Tp>::value, simd<_Tp, simd_abi::scalar>>
+  cos(simd<_Tp, simd_abi::scalar> __x)
+  { return std::cos(__data(__x)); }
+
+//}}}
+// sin{{{
+template <typename _Tp, typename _Abi>
+  enable_if_t<is_floating_point_v<_Tp>, simd<_Tp, _Abi>>
+  sin(const simd<_Tp, _Abi>& __x)
+  {
+    using _V = simd<_Tp, _Abi>;
+    if constexpr (__is_scalar_abi<_Abi>() || __is_fixed_size_abi_v<_Abi>)
+      return {__private_init, _Abi::_SimdImpl::_S_sin(__data(__x))};
+    else
+      {
+	if constexpr (is_same_v<_Tp, float>)
+	  if (_GLIBCXX_SIMD_IS_UNLIKELY(any_of(abs(__x) >= 527449)))
+	    return static_simd_cast<_V>(
+	      sin(static_simd_cast<rebind_simd_t<double, _V>>(__x)));
+
+	const auto __f = __fold_input(__x);
+	// quadrant | effect
+	//        0 | sinSeries
+	//        1 | cosSeries
+	//        2 | sinSeries, sign flip
+	//        3 | cosSeries, sign flip
+	using namespace std::experimental::__float_bitwise_operators;
+	const auto __sign_flip
+	  = (__x ^ static_simd_cast<_V>(1 - __f._M_quadrant)) & _V(_Tp(-0.));
+
+	const auto __need_sin = (__f._M_quadrant & 1) == 0;
+	if (_GLIBCXX_SIMD_IS_UNLIKELY(all_of(__need_sin)))
+	  return __sign_flip ^ __sinSeries(__f._M_x);
+	else if (_GLIBCXX_SIMD_IS_UNLIKELY(none_of(__need_sin)))
+	  return __sign_flip ^ __cosSeries(__f._M_x);
+	else // some_of(__need_sin)
+	  {
+	    _V __r = __cosSeries(__f._M_x);
+	    where(__need_sin.__cvt(), __r) = __sinSeries(__f._M_x);
+	    return __sign_flip ^ __r;
+	  }
+      }
+  }
+
+template <typename _Tp>
+  _GLIBCXX_SIMD_ALWAYS_INLINE
+  enable_if_t<is_floating_point<_Tp>::value, simd<_Tp, simd_abi::scalar>>
+  sin(simd<_Tp, simd_abi::scalar> __x)
+  { return std::sin(__data(__x)); }
+
+//}}}
+_GLIBCXX_SIMD_MATH_CALL_(tan)
+_GLIBCXX_SIMD_MATH_CALL_(acosh)
+_GLIBCXX_SIMD_MATH_CALL_(asinh)
+_GLIBCXX_SIMD_MATH_CALL_(atanh)
+_GLIBCXX_SIMD_MATH_CALL_(cosh)
+_GLIBCXX_SIMD_MATH_CALL_(sinh)
+_GLIBCXX_SIMD_MATH_CALL_(tanh)
+// }}}
+// exponential functions {{{
+_GLIBCXX_SIMD_MATH_CALL_(exp)
+_GLIBCXX_SIMD_MATH_CALL_(exp2)
+_GLIBCXX_SIMD_MATH_CALL_(expm1)
+
+// }}}
+// frexp {{{
+#if _GLIBCXX_SIMD_X86INTRIN
+template <typename _Tp, size_t _Np>
+  _SimdWrapper<_Tp, _Np>
+  __getexp(_SimdWrapper<_Tp, _Np> __x)
+  {
+    if constexpr (__have_avx512vl && __is_sse_ps<_Tp, _Np>())
+      return __auto_bitcast(_mm_getexp_ps(__to_intrin(__x)));
+    else if constexpr (__have_avx512f && __is_sse_ps<_Tp, _Np>())
+      return __auto_bitcast(_mm512_getexp_ps(__auto_bitcast(__to_intrin(__x))));
+    else if constexpr (__have_avx512vl && __is_sse_pd<_Tp, _Np>())
+      return _mm_getexp_pd(__x);
+    else if constexpr (__have_avx512f && __is_sse_pd<_Tp, _Np>())
+      return __lo128(_mm512_getexp_pd(__auto_bitcast(__x)));
+    else if constexpr (__have_avx512vl && __is_avx_ps<_Tp, _Np>())
+      return _mm256_getexp_ps(__x);
+    else if constexpr (__have_avx512f && __is_avx_ps<_Tp, _Np>())
+      return __lo256(_mm512_getexp_ps(__auto_bitcast(__x)));
+    else if constexpr (__have_avx512vl && __is_avx_pd<_Tp, _Np>())
+      return _mm256_getexp_pd(__x);
+    else if constexpr (__have_avx512f && __is_avx_pd<_Tp, _Np>())
+      return __lo256(_mm512_getexp_pd(__auto_bitcast(__x)));
+    else if constexpr (__is_avx512_ps<_Tp, _Np>())
+      return _mm512_getexp_ps(__x);
+    else if constexpr (__is_avx512_pd<_Tp, _Np>())
+      return _mm512_getexp_pd(__x);
+    else
+      __assert_unreachable<_Tp>();
+  }
+
+template <typename _Tp, size_t _Np>
+  _SimdWrapper<_Tp, _Np>
+  __getmant_avx512(_SimdWrapper<_Tp, _Np> __x)
+  {
+    if constexpr (__have_avx512vl && __is_sse_ps<_Tp, _Np>())
+      return __auto_bitcast(_mm_getmant_ps(__to_intrin(__x), _MM_MANT_NORM_p5_1,
+					   _MM_MANT_SIGN_src));
+    else if constexpr (__have_avx512f && __is_sse_ps<_Tp, _Np>())
+      return __auto_bitcast(_mm512_getmant_ps(__auto_bitcast(__to_intrin(__x)),
+					      _MM_MANT_NORM_p5_1,
+					      _MM_MANT_SIGN_src));
+    else if constexpr (__have_avx512vl && __is_sse_pd<_Tp, _Np>())
+      return _mm_getmant_pd(__x, _MM_MANT_NORM_p5_1, _MM_MANT_SIGN_src);
+    else if constexpr (__have_avx512f && __is_sse_pd<_Tp, _Np>())
+      return __lo128(_mm512_getmant_pd(__auto_bitcast(__x), _MM_MANT_NORM_p5_1,
+				       _MM_MANT_SIGN_src));
+    else if constexpr (__have_avx512vl && __is_avx_ps<_Tp, _Np>())
+      return _mm256_getmant_ps(__x, _MM_MANT_NORM_p5_1, _MM_MANT_SIGN_src);
+    else if constexpr (__have_avx512f && __is_avx_ps<_Tp, _Np>())
+      return __lo256(_mm512_getmant_ps(__auto_bitcast(__x), _MM_MANT_NORM_p5_1,
+				       _MM_MANT_SIGN_src));
+    else if constexpr (__have_avx512vl && __is_avx_pd<_Tp, _Np>())
+      return _mm256_getmant_pd(__x, _MM_MANT_NORM_p5_1, _MM_MANT_SIGN_src);
+    else if constexpr (__have_avx512f && __is_avx_pd<_Tp, _Np>())
+      return __lo256(_mm512_getmant_pd(__auto_bitcast(__x), _MM_MANT_NORM_p5_1,
+				       _MM_MANT_SIGN_src));
+    else if constexpr (__is_avx512_ps<_Tp, _Np>())
+      return _mm512_getmant_ps(__x, _MM_MANT_NORM_p5_1, _MM_MANT_SIGN_src);
+    else if constexpr (__is_avx512_pd<_Tp, _Np>())
+      return _mm512_getmant_pd(__x, _MM_MANT_NORM_p5_1, _MM_MANT_SIGN_src);
+    else
+      __assert_unreachable<_Tp>();
+  }
+#endif // _GLIBCXX_SIMD_X86INTRIN
+
+/**
+ * splits @p __v into exponent and mantissa, the sign is kept with the mantissa
+ *
+ * The return value will be in the range [0.5, 1.0[
+ * The @p __e value will be an integer defining the power-of-two exponent
+ */
+template <typename _Tp, typename _Abi>
+  enable_if_t<is_floating_point_v<_Tp>, simd<_Tp, _Abi>>
+  frexp(const simd<_Tp, _Abi>& __x, _Samesize<int, simd<_Tp, _Abi>>* __exp)
+  {
+    if constexpr (simd_size_v<_Tp, _Abi> == 1)
+      {
+	int __tmp;
+	const auto __r = std::frexp(__x[0], &__tmp);
+	(*__exp)[0] = __tmp;
+	return __r;
+      }
+    else if constexpr (__is_fixed_size_abi_v<_Abi>)
+      {
+	return {__private_init,
+		_Abi::_SimdImpl::_S_frexp(__data(__x), __data(*__exp))};
+#if _GLIBCXX_SIMD_X86INTRIN
+      }
+    else if constexpr (__have_avx512f)
+      {
+	constexpr size_t _Np = simd_size_v<_Tp, _Abi>;
+	constexpr size_t _NI = _Np < 4 ? 4 : _Np;
+	const auto __v = __data(__x);
+	const auto __isnonzero
+	  = _Abi::_SimdImpl::_S_isnonzerovalue_mask(__v._M_data);
+	const _SimdWrapper<int, _NI> __exp_plus1
+	  = 1 + __convert<_SimdWrapper<int, _NI>>(__getexp(__v))._M_data;
+	const _SimdWrapper<int, _Np> __e = __wrapper_bitcast<int, _Np>(
+	  _Abi::_CommonImpl::_S_blend(_SimdWrapper<bool, _NI>(__isnonzero),
+				      _SimdWrapper<int, _NI>(), __exp_plus1));
+	simd_abi::deduce_t<int, _Np>::_CommonImpl::_S_store(__e, __exp);
+	return {__private_init,
+		_Abi::_CommonImpl::_S_blend(_SimdWrapper<bool, _Np>(
+					      __isnonzero),
+					    __v, __getmant_avx512(__v))};
+#endif // _GLIBCXX_SIMD_X86INTRIN
+      }
+    else
+      {
+	// fallback implementation
+	static_assert(sizeof(_Tp) == 4 || sizeof(_Tp) == 8);
+	using _V = simd<_Tp, _Abi>;
+	using _IV = rebind_simd_t<int, _V>;
+	using namespace std::experimental::__proposed;
+	using namespace std::experimental::__float_bitwise_operators;
+
+	constexpr int __exp_adjust = sizeof(_Tp) == 4 ? 0x7e : 0x3fe;
+	constexpr int __exp_offset = sizeof(_Tp) == 4 ? 0x70 : 0x200;
+	constexpr _Tp __subnorm_scale = sizeof(_Tp) == 4 ? 0x1p112 : 0x1p512;
+	_GLIBCXX_SIMD_USE_CONSTEXPR_API _V __exponent_mask
+	  = __infinity_v<_Tp>; // 0x7f800000 or 0x7ff0000000000000
+	_GLIBCXX_SIMD_USE_CONSTEXPR_API _V __p5_1_exponent
+	  = -(2 - __epsilon_v<_Tp>) / 2; // 0xbf7fffff or 0xbfefffffffffffff
+
+	_V __mant = __p5_1_exponent & (__exponent_mask | __x); // +/-[.5, 1)
+	const _IV __exponent_bits = __extract_exponent_as_int(__x);
+	if (_GLIBCXX_SIMD_IS_LIKELY(all_of(isnormal(__x))))
+	  {
+	    *__exp
+	      = simd_cast<_Samesize<int, _V>>(__exponent_bits - __exp_adjust);
+	    return __mant;
+	  }
+
+#if __FINITE_MATH_ONLY__
+	// at least one element of __x is 0 or subnormal, the rest is normal
+	// (inf and NaN are excluded by -ffinite-math-only)
+	const auto __iszero_inf_nan = __x == 0;
+#else
+	const auto __as_int
+	  = __bit_cast<rebind_simd_t<__int_for_sizeof_t<_Tp>, _V>>(abs(__x));
+	const auto __inf
+	  = __bit_cast<rebind_simd_t<__int_for_sizeof_t<_Tp>, _V>>(
+	    _V(__infinity_v<_Tp>));
+	const auto __iszero_inf_nan = static_simd_cast<typename _V::mask_type>(
+	  __as_int == 0 || __as_int >= __inf);
+#endif
+
+	const _V __scaled_subnormal = __x * __subnorm_scale;
+	const _V __mant_subnormal
+	  = __p5_1_exponent & (__exponent_mask | __scaled_subnormal);
+	where(!isnormal(__x), __mant) = __mant_subnormal;
+	where(__iszero_inf_nan, __mant) = __x;
+	_IV __e = __extract_exponent_as_int(__scaled_subnormal);
+	using _MaskType =
+	  typename conditional_t<sizeof(typename _V::value_type) == sizeof(int),
+				 _V, _IV>::mask_type;
+	const _MaskType __value_isnormal = isnormal(__x).__cvt();
+	where(__value_isnormal.__cvt(), __e) = __exponent_bits;
+	static_assert(sizeof(_IV) == sizeof(__value_isnormal));
+	const _IV __offset
+	  = (__bit_cast<_IV>(__value_isnormal) & _IV(__exp_adjust))
+	    | (__bit_cast<_IV>(static_simd_cast<_MaskType>(__exponent_bits == 0)
+			       & static_simd_cast<_MaskType>(__x != 0))
+	       & _IV(__exp_adjust + __exp_offset));
+	*__exp = simd_cast<_Samesize<int, _V>>(__e - __offset);
+	return __mant;
+      }
+  }
+
+// }}}
+_GLIBCXX_SIMD_MATH_CALL2_(ldexp, int)
+_GLIBCXX_SIMD_MATH_CALL_(ilogb)
+
+// logarithms {{{
+_GLIBCXX_SIMD_MATH_CALL_(log)
+_GLIBCXX_SIMD_MATH_CALL_(log10)
+_GLIBCXX_SIMD_MATH_CALL_(log1p)
+_GLIBCXX_SIMD_MATH_CALL_(log2)
+
+//}}}
+// logb{{{
+template <typename _Tp, typename _Abi>
+  enable_if_t<is_floating_point<_Tp>::value, simd<_Tp, _Abi>>
+  logb(const simd<_Tp, _Abi>& __x)
+  {
+    constexpr size_t _Np = simd_size_v<_Tp, _Abi>;
+    if constexpr (_Np == 1)
+      return std::logb(__x[0]);
+    else if constexpr (__is_fixed_size_abi_v<_Abi>)
+      {
+	return {__private_init,
+		__data(__x)._M_apply_per_chunk([](auto __impl, auto __xx) {
+		  using _V = typename decltype(__impl)::simd_type;
+		  return __data(
+		    std::experimental::logb(_V(__private_init, __xx)));
+		})};
+      }
+#if _GLIBCXX_SIMD_X86INTRIN // {{{
+    else if constexpr (__have_avx512vl && __is_sse_ps<_Tp, _Np>())
+      return {__private_init,
+	      __auto_bitcast(_mm_getexp_ps(__to_intrin(__as_vector(__x))))};
+    else if constexpr (__have_avx512vl && __is_sse_pd<_Tp, _Np>())
+      return {__private_init, _mm_getexp_pd(__data(__x))};
+    else if constexpr (__have_avx512vl && __is_avx_ps<_Tp, _Np>())
+      return {__private_init, _mm256_getexp_ps(__data(__x))};
+    else if constexpr (__have_avx512vl && __is_avx_pd<_Tp, _Np>())
+      return {__private_init, _mm256_getexp_pd(__data(__x))};
+    else if constexpr (__have_avx512f && __is_avx_ps<_Tp, _Np>())
+      return {__private_init,
+	      __lo256(_mm512_getexp_ps(__auto_bitcast(__data(__x))))};
+    else if constexpr (__have_avx512f && __is_avx_pd<_Tp, _Np>())
+      return {__private_init,
+	      __lo256(_mm512_getexp_pd(__auto_bitcast(__data(__x))))};
+    else if constexpr (__is_avx512_ps<_Tp, _Np>())
+      return {__private_init, _mm512_getexp_ps(__data(__x))};
+    else if constexpr (__is_avx512_pd<_Tp, _Np>())
+      return {__private_init, _mm512_getexp_pd(__data(__x))};
+#endif // _GLIBCXX_SIMD_X86INTRIN }}}
+    else
+      {
+	using _V = simd<_Tp, _Abi>;
+	using namespace std::experimental::__proposed;
+	auto __is_normal = isnormal(__x);
+
+	// work on abs(__x) to reflect the return value on Linux for negative
+	// inputs (domain-error => implementation-defined value is returned)
+	const _V abs_x = abs(__x);
+
+	// __exponent(__x) returns the exponent value (bias removed) as
+	// simd<_Up> with integral _Up
+	auto&& __exponent = [](const _V& __v) {
+	  using namespace std::experimental::__proposed;
+	  using _IV = rebind_simd_t<
+	    conditional_t<sizeof(_Tp) == sizeof(_LLong), _LLong, int>, _V>;
+	  return (__bit_cast<_IV>(__v) >> (__digits_v<_Tp> - 1))
+		 - (__max_exponent_v<_Tp> - 1);
+	};
+	_V __r = static_simd_cast<_V>(__exponent(abs_x));
+	if (_GLIBCXX_SIMD_IS_LIKELY(all_of(__is_normal)))
+	  // without corner cases (nan, inf, subnormal, zero) we have our
+	  // answer:
+	  return __r;
+	const auto __is_zero = __x == 0;
+	const auto __is_nan = isnan(__x);
+	const auto __is_inf = isinf(__x);
+	where(__is_zero, __r) = -__infinity_v<_Tp>;
+	where(__is_nan, __r) = __x;
+	where(__is_inf, __r) = __infinity_v<_Tp>;
+	__is_normal |= __is_zero || __is_nan || __is_inf;
+	if (all_of(__is_normal))
+	  // at this point everything but subnormals is handled
+	  return __r;
+	// subnormals repeat the exponent extraction after multiplication of the
+	// input with __a floating point value that has 112 (0x70) in its exponent
+	// (not too big for sp and large enough for dp)
+	const _V __scaled = abs_x * _Tp(0x1p112);
+	_V __scaled_exp = static_simd_cast<_V>(__exponent(__scaled) - 112);
+	where(__is_normal, __scaled_exp) = __r;
+	return __scaled_exp;
+      }
+  }
+
+//}}}
+template <typename _Tp, typename _Abi>
+  enable_if_t<is_floating_point_v<_Tp>, simd<_Tp, _Abi>>
+  modf(const simd<_Tp, _Abi>& __x, simd<_Tp, _Abi>* __iptr)
+  {
+    if constexpr (__is_scalar_abi<_Abi>()
+		  || (__is_fixed_size_abi_v<
+			_Abi> && simd_size_v<_Tp, _Abi> == 1))
+      {
+	_Tp __tmp;
+	_Tp __r = std::modf(__x[0], &__tmp);
+	__iptr[0] = __tmp;
+	return __r;
+      }
+    else
+      {
+	const auto __integral = trunc(__x);
+	*__iptr = __integral;
+	auto __r = __x - __integral;
+#if !__FINITE_MATH_ONLY__
+	where(isinf(__x), __r) = _Tp();
+#endif
+	return copysign(__r, __x);
+      }
+  }
+
+_GLIBCXX_SIMD_MATH_CALL2_(scalbn, int)
+_GLIBCXX_SIMD_MATH_CALL2_(scalbln, long)
+
+_GLIBCXX_SIMD_MATH_CALL_(cbrt)
+
+_GLIBCXX_SIMD_MATH_CALL_(abs)
+_GLIBCXX_SIMD_MATH_CALL_(fabs)
+
+// [parallel.simd.math] only asks for is_floating_point_v<_Tp> and forgot to
+// allow signed integral _Tp
+template <typename _Tp, typename _Abi>
+  enable_if_t<!is_floating_point_v<_Tp> && is_signed_v<_Tp>, simd<_Tp, _Abi>>
+  abs(const simd<_Tp, _Abi>& __x)
+  { return {__private_init, _Abi::_SimdImpl::_S_abs(__data(__x))}; }
+
+template <typename _Tp, typename _Abi>
+  enable_if_t<!is_floating_point_v<_Tp> && is_signed_v<_Tp>, simd<_Tp, _Abi>>
+  fabs(const simd<_Tp, _Abi>& __x)
+  { return {__private_init, _Abi::_SimdImpl::_S_abs(__data(__x))}; }
+
+// the following are overloads for functions in <cstdlib> and not covered by
+// [parallel.simd.math]. I don't see much value in making them work, though
+/*
+template <typename _Abi> simd<long, _Abi> labs(const simd<long, _Abi> &__x)
+{ return {__private_init, _Abi::_SimdImpl::abs(__data(__x))}; }
+
+template <typename _Abi> simd<long long, _Abi> llabs(const simd<long long, _Abi>
+&__x)
+{ return {__private_init, _Abi::_SimdImpl::abs(__data(__x))}; }
+*/
+
+#define _GLIBCXX_SIMD_CVTING2(_NAME)                                           \
+template <typename _Tp, typename _Abi>                                         \
+  _GLIBCXX_SIMD_INTRINSIC simd<_Tp, _Abi> _NAME(                               \
+    const simd<_Tp, _Abi>& __x, const __type_identity_t<simd<_Tp, _Abi>>& __y) \
+  {                                                                            \
+    return _NAME(__x, __y);                                                    \
+  }                                                                            \
+                                                                               \
+template <typename _Tp, typename _Abi>                                         \
+  _GLIBCXX_SIMD_INTRINSIC simd<_Tp, _Abi> _NAME(                               \
+    const __type_identity_t<simd<_Tp, _Abi>>& __x, const simd<_Tp, _Abi>& __y) \
+  {                                                                            \
+    return _NAME(__x, __y);                                                    \
+  }
+
+#define _GLIBCXX_SIMD_CVTING3(_NAME)                                           \
+template <typename _Tp, typename _Abi>                                         \
+  _GLIBCXX_SIMD_INTRINSIC simd<_Tp, _Abi> _NAME(                               \
+    const __type_identity_t<simd<_Tp, _Abi>>& __x, const simd<_Tp, _Abi>& __y, \
+    const simd<_Tp, _Abi>& __z)                                                \
+  {                                                                            \
+    return _NAME(__x, __y, __z);                                               \
+  }                                                                            \
+                                                                               \
+template <typename _Tp, typename _Abi>                                         \
+  _GLIBCXX_SIMD_INTRINSIC simd<_Tp, _Abi> _NAME(                               \
+    const simd<_Tp, _Abi>& __x, const __type_identity_t<simd<_Tp, _Abi>>& __y, \
+    const simd<_Tp, _Abi>& __z)                                                \
+  {                                                                            \
+    return _NAME(__x, __y, __z);                                               \
+  }                                                                            \
+                                                                               \
+template <typename _Tp, typename _Abi>                                         \
+  _GLIBCXX_SIMD_INTRINSIC simd<_Tp, _Abi> _NAME(                               \
+    const simd<_Tp, _Abi>& __x, const simd<_Tp, _Abi>& __y,                    \
+    const __type_identity_t<simd<_Tp, _Abi>>& __z)                             \
+  {                                                                            \
+    return _NAME(__x, __y, __z);                                               \
+  }                                                                            \
+                                                                               \
+template <typename _Tp, typename _Abi>                                         \
+  _GLIBCXX_SIMD_INTRINSIC simd<_Tp, _Abi> _NAME(                               \
+    const simd<_Tp, _Abi>& __x, const __type_identity_t<simd<_Tp, _Abi>>& __y, \
+    const __type_identity_t<simd<_Tp, _Abi>>& __z)                             \
+  {                                                                            \
+    return _NAME(__x, __y, __z);                                               \
+  }                                                                            \
+                                                                               \
+template <typename _Tp, typename _Abi>                                         \
+  _GLIBCXX_SIMD_INTRINSIC simd<_Tp, _Abi> _NAME(                               \
+    const __type_identity_t<simd<_Tp, _Abi>>& __x, const simd<_Tp, _Abi>& __y, \
+    const __type_identity_t<simd<_Tp, _Abi>>& __z)                             \
+  {                                                                            \
+    return _NAME(__x, __y, __z);                                               \
+  }                                                                            \
+                                                                               \
+template <typename _Tp, typename _Abi>                                         \
+  _GLIBCXX_SIMD_INTRINSIC simd<_Tp, _Abi> _NAME(                               \
+    const __type_identity_t<simd<_Tp, _Abi>>& __x,                             \
+    const __type_identity_t<simd<_Tp, _Abi>>& __y, const simd<_Tp, _Abi>& __z) \
+  {                                                                            \
+    return _NAME(__x, __y, __z);                                               \
+  }
+
+template <typename _R, typename _ToApply, typename _Tp, typename... _Tps>
+  _GLIBCXX_SIMD_INTRINSIC _R
+  __fixed_size_apply(_ToApply&& __apply, const _Tp& __arg0,
+		     const _Tps&... __args)
+  {
+    return {__private_init,
+	    __data(__arg0)._M_apply_per_chunk(
+	      [&](auto __impl, const auto&... __inner) {
+		using _V = typename decltype(__impl)::simd_type;
+		return __data(__apply(_V(__private_init, __inner)...));
+	      },
+	      __data(__args)...)};
+  }
+
+template <typename _VV>
+  __remove_cvref_t<_VV>
+  __hypot(_VV __x, _VV __y)
+  {
+    using _V = __remove_cvref_t<_VV>;
+    using _Tp = typename _V::value_type;
+    if constexpr (_V::size() == 1)
+      return std::hypot(_Tp(__x[0]), _Tp(__y[0]));
+    else if constexpr (__is_fixed_size_abi_v<typename _V::abi_type>)
+      {
+	return __fixed_size_apply<_V>([](auto __a,
+					 auto __b) { return hypot(__a, __b); },
+				      __x, __y);
+      }
+    else
+      {
+	// A simple solution for _Tp == float would be to cast to double and
+	// simply calculate sqrt(x²+y²) as it can't over-/underflow anymore with
+	// dp. It still needs the Annex F fixups though and isn't faster on
+	// Skylake-AVX512 (not even for SSE and AVX vectors, and really bad for
+	// AVX-512).
+	using namespace __float_bitwise_operators;
+	_V __absx = abs(__x);          // no error
+	_V __absy = abs(__y);          // no error
+	_V __hi = max(__absx, __absy); // no error
+	_V __lo = min(__absy, __absx); // no error
+
+	// round __hi down to the next power-of-2:
+	_GLIBCXX_SIMD_USE_CONSTEXPR_API _V __inf(__infinity_v<_Tp>);
+
+#ifndef __FAST_MATH__
+	if constexpr (__have_neon && !__have_neon_a32)
+	  { // With ARMv7 NEON, we have no subnormals and must use slightly
+	    // different strategy
+	    const _V __hi_exp = __hi & __inf;
+	    _V __scale_back = __hi_exp;
+	    // For large exponents (max & max/2) the inversion comes too close
+	    // to subnormals. Subtract 3 from the exponent:
+	    where(__hi_exp > 1, __scale_back) = __hi_exp * _Tp(0.125);
+	    // Invert and adjust for the off-by-one error of inversion via xor:
+	    const _V __scale = (__scale_back ^ __inf) * _Tp(.5);
+	    const _V __h1 = __hi * __scale;
+	    const _V __l1 = __lo * __scale;
+	    _V __r = __scale_back * sqrt(__h1 * __h1 + __l1 * __l1);
+	    // Fix up hypot(0, 0) to not be NaN:
+	    where(__hi == 0, __r) = 0;
+	    return __r;
+	  }
+#endif
+
+#ifdef __FAST_MATH__
+	// With fast-math, ignore precision of subnormals and inputs from
+	// __finite_max_v/2 to __finite_max_v. This removes all
+	// branching/masking.
+	if constexpr (true)
+#else
+	if (_GLIBCXX_SIMD_IS_LIKELY(all_of(isnormal(__x))
+				    && all_of(isnormal(__y))))
+#endif
+	  {
+	    const _V __hi_exp = __hi & __inf;
+	    //((__hi + __hi) & __inf) ^ __inf almost works for computing
+	    //__scale,
+	    // except when (__hi + __hi) & __inf == __inf, in which case __scale
+	    // becomes 0 (should be min/2 instead) and thus loses the
+	    // information from __lo.
+#ifdef __FAST_MATH__
+	    using _Ip = __int_for_sizeof_t<_Tp>;
+	    using _IV = rebind_simd_t<_Ip, _V>;
+	    const auto __as_int = __bit_cast<_IV>(__hi_exp);
+	    const _V __scale
+	      = __bit_cast<_V>(2 * __bit_cast<_Ip>(_Tp(1)) - __as_int);
+#else
+	    const _V __scale = (__hi_exp ^ __inf) * _Tp(.5);
+#endif
+	    _GLIBCXX_SIMD_USE_CONSTEXPR_API _V __mant_mask
+	      = __norm_min_v<_Tp> - __denorm_min_v<_Tp>;
+	    const _V __h1 = (__hi & __mant_mask) | _V(1);
+	    const _V __l1 = __lo * __scale;
+	    return __hi_exp * sqrt(__h1 * __h1 + __l1 * __l1);
+	  }
+	else
+	  {
+	    // slower path to support subnormals
+	    // if __hi is subnormal, avoid scaling by inf & final mul by 0
+	    // (which yields NaN) by using min()
+	    _V __scale = _V(1 / __norm_min_v<_Tp>);
+	    // invert exponent w/o error and w/o using the slow divider unit:
+	    // xor inverts the exponent but off by 1. Multiplication with .5
+	    // adjusts for the discrepancy.
+	    where(__hi >= __norm_min_v<_Tp>, __scale)
+	      = ((__hi & __inf) ^ __inf) * _Tp(.5);
+	    // adjust final exponent for subnormal inputs
+	    _V __hi_exp = __norm_min_v<_Tp>;
+	    where(__hi >= __norm_min_v<_Tp>, __hi_exp)
+	      = __hi & __inf;         // no error
+	    _V __h1 = __hi * __scale; // no error
+	    _V __l1 = __lo * __scale; // no error
+
+	    // sqrt(x²+y²) = e*sqrt((x/e)²+(y/e)²):
+	    // this ensures no overflow in the argument to sqrt
+	    _V __r = __hi_exp * sqrt(__h1 * __h1 + __l1 * __l1);
+#ifdef __STDC_IEC_559__
+	    // fixup for Annex F requirements
+	    // the naive fixup goes like this:
+	    //
+	    // where(__l1 == 0, __r)                      = __hi;
+	    // where(isunordered(__x, __y), __r)          = __quiet_NaN_v<_Tp>;
+	    // where(isinf(__absx) || isinf(__absy), __r) = __inf;
+	    //
+	    // The fixup can be prepared in parallel with the sqrt, requiring a
+	    // single blend step after hi_exp * sqrt, reducing latency and
+	    // throughput:
+	    _V __fixup = __hi; // __lo == 0
+	    where(isunordered(__x, __y), __fixup) = __quiet_NaN_v<_Tp>;
+	    where(isinf(__absx) || isinf(__absy), __fixup) = __inf;
+	    where(!(__lo == 0 || isunordered(__x, __y)
+		    || (isinf(__absx) || isinf(__absy))),
+		  __fixup)
+	      = __r;
+	    __r = __fixup;
+#endif
+	    return __r;
+	  }
+      }
+  }
+
+template <typename _Tp, typename _Abi>
+  _GLIBCXX_SIMD_INTRINSIC simd<_Tp, _Abi>
+  hypot(const simd<_Tp, _Abi>& __x, const simd<_Tp, _Abi>& __y)
+  {
+    return __hypot<conditional_t<__is_fixed_size_abi_v<_Abi>,
+				 const simd<_Tp, _Abi>&, simd<_Tp, _Abi>>>(__x,
+									   __y);
+  }
+
+_GLIBCXX_SIMD_CVTING2(hypot)
+
+  template <typename _VV>
+  __remove_cvref_t<_VV>
+  __hypot(_VV __x, _VV __y, _VV __z)
+  {
+    using _V = __remove_cvref_t<_VV>;
+    using _Abi = typename _V::abi_type;
+    using _Tp = typename _V::value_type;
+    /* FIXME: enable after PR77776 is resolved
+    if constexpr (_V::size() == 1)
+      return std::hypot(_Tp(__x[0]), _Tp(__y[0]), _Tp(__z[0]));
+    else
+    */
+    if constexpr (__is_fixed_size_abi_v<_Abi> && _V::size() > 1)
+      {
+	return __fixed_size_apply<simd<_Tp, _Abi>>(
+	  [](auto __a, auto __b, auto __c) { return hypot(__a, __b, __c); },
+	  __x, __y, __z);
+      }
+    else
+      {
+	using namespace __float_bitwise_operators;
+	const _V __absx = abs(__x);                 // no error
+	const _V __absy = abs(__y);                 // no error
+	const _V __absz = abs(__z);                 // no error
+	_V __hi = max(max(__absx, __absy), __absz); // no error
+	_V __l0 = min(__absz, max(__absx, __absy)); // no error
+	_V __l1 = min(__absy, __absx);              // no error
+	if constexpr (__digits_v<_Tp> == 64 && __max_exponent_v<_Tp> == 0x4000
+		      && __min_exponent_v<_Tp> == -0x3FFD && _V::size() == 1)
+	  { // Seems like x87 fp80, where bit 63 is always 1 unless subnormal or
+	    // NaN. In this case the bit-tricks don't work, they require IEC559
+	    // binary32 or binary64 format.
+#ifdef __STDC_IEC_559__
+	    // fixup for Annex F requirements
+	    if (isinf(__absx[0]) || isinf(__absy[0]) || isinf(__absz[0]))
+	      return __infinity_v<_Tp>;
+	    else if (isunordered(__absx[0], __absy[0] + __absz[0]))
+	      return __quiet_NaN_v<_Tp>;
+	    else if (__l0[0] == 0 && __l1[0] == 0)
+	      return __hi;
+#endif
+	    _V __hi_exp = __hi;
+	    const _ULLong __tmp = 0x8000'0000'0000'0000ull;
+	    __builtin_memcpy(&__data(__hi_exp), &__tmp, 8);
+	    const _V __scale = 1 / __hi_exp;
+	    __hi *= __scale;
+	    __l0 *= __scale;
+	    __l1 *= __scale;
+	    return __hi_exp * sqrt((__l0 * __l0 + __l1 * __l1) + __hi * __hi);
+	  }
+	else
+	  {
+	    // round __hi down to the next power-of-2:
+	    _GLIBCXX_SIMD_USE_CONSTEXPR_API _V __inf(__infinity_v<_Tp>);
+
+#ifndef __FAST_MATH__
+	    if constexpr (_V::size() > 1 && __have_neon && !__have_neon_a32)
+	      { // With ARMv7 NEON, we have no subnormals and must use slightly
+		// different strategy
+		const _V __hi_exp = __hi & __inf;
+		_V __scale_back = __hi_exp;
+		// For large exponents (max & max/2) the inversion comes too
+		// close to subnormals. Subtract 3 from the exponent:
+		where(__hi_exp > 1, __scale_back) = __hi_exp * _Tp(0.125);
+		// Invert and adjust for the off-by-one error of inversion via
+		// xor:
+		const _V __scale = (__scale_back ^ __inf) * _Tp(.5);
+		const _V __h1 = __hi * __scale;
+		__l0 *= __scale;
+		__l1 *= __scale;
+		_V __lo = __l0 * __l0
+			  + __l1 * __l1; // add the two smaller values first
+		asm("" : "+m"(__lo));
+		_V __r = __scale_back * sqrt(__h1 * __h1 + __lo);
+		// Fix up hypot(0, 0, 0) to not be NaN:
+		where(__hi == 0, __r) = 0;
+		return __r;
+	      }
+#endif
+
+#ifdef __FAST_MATH__
+	    // With fast-math, ignore precision of subnormals and inputs from
+	    // __finite_max_v/2 to __finite_max_v. This removes all
+	    // branching/masking.
+	    if constexpr (true)
+#else
+	    if (_GLIBCXX_SIMD_IS_LIKELY(all_of(isnormal(__x))
+					&& all_of(isnormal(__y))
+					&& all_of(isnormal(__z))))
+#endif
+	      {
+		const _V __hi_exp = __hi & __inf;
+		//((__hi + __hi) & __inf) ^ __inf almost works for computing
+		//__scale, except when (__hi + __hi) & __inf == __inf, in which
+		// case __scale
+		// becomes 0 (should be min/2 instead) and thus loses the
+		// information from __lo.
+#ifdef __FAST_MATH__
+		using _Ip = __int_for_sizeof_t<_Tp>;
+		using _IV = rebind_simd_t<_Ip, _V>;
+		const auto __as_int = __bit_cast<_IV>(__hi_exp);
+		const _V __scale
+		  = __bit_cast<_V>(2 * __bit_cast<_Ip>(_Tp(1)) - __as_int);
+#else
+		const _V __scale = (__hi_exp ^ __inf) * _Tp(.5);
+#endif
+		constexpr _Tp __mant_mask
+		  = __norm_min_v<_Tp> - __denorm_min_v<_Tp>;
+		const _V __h1 = (__hi & _V(__mant_mask)) | _V(1);
+		__l0 *= __scale;
+		__l1 *= __scale;
+		const _V __lo
+		  = __l0 * __l0
+		    + __l1 * __l1; // add the two smaller values first
+		return __hi_exp * sqrt(__lo + __h1 * __h1);
+	      }
+	    else
+	      {
+		// slower path to support subnormals
+		// if __hi is subnormal, avoid scaling by inf & final mul by 0
+		// (which yields NaN) by using min()
+		_V __scale = _V(1 / __norm_min_v<_Tp>);
+		// invert exponent w/o error and w/o using the slow divider
+		// unit: xor inverts the exponent but off by 1. Multiplication
+		// with .5 adjusts for the discrepancy.
+		where(__hi >= __norm_min_v<_Tp>, __scale)
+		  = ((__hi & __inf) ^ __inf) * _Tp(.5);
+		// adjust final exponent for subnormal inputs
+		_V __hi_exp = __norm_min_v<_Tp>;
+		where(__hi >= __norm_min_v<_Tp>, __hi_exp)
+		  = __hi & __inf;         // no error
+		_V __h1 = __hi * __scale; // no error
+		__l0 *= __scale;          // no error
+		__l1 *= __scale;          // no error
+		_V __lo = __l0 * __l0
+			  + __l1 * __l1; // add the two smaller values first
+		_V __r = __hi_exp * sqrt(__lo + __h1 * __h1);
+#ifdef __STDC_IEC_559__
+		// fixup for Annex F requirements
+		_V __fixup = __hi; // __lo == 0
+		// where(__lo == 0, __fixup)                   = __hi;
+		where(isunordered(__x, __y + __z), __fixup)
+		  = __quiet_NaN_v<_Tp>;
+		where(isinf(__absx) || isinf(__absy) || isinf(__absz), __fixup)
+		  = __inf;
+		// Instead of __lo == 0, the following could depend on __h1² ==
+		// __h1² + __lo (i.e. __hi is so much larger than the other two
+		// inputs that the result is exactly __hi). While this may
+		// improve precision, it is likely to reduce efficiency if the
+		// ISA has FMAs (because __h1² + __lo is an FMA, but the
+		// intermediate
+		// __h1² must be kept)
+		where(!(__lo == 0 || isunordered(__x, __y + __z)
+			|| isinf(__absx) || isinf(__absy) || isinf(__absz)),
+		      __fixup)
+		  = __r;
+		__r = __fixup;
+#endif
+		return __r;
+	      }
+	  }
+      }
+  }
+
+  template <typename _Tp, typename _Abi>
+  _GLIBCXX_SIMD_INTRINSIC simd<_Tp, _Abi>
+  hypot(const simd<_Tp, _Abi>& __x, const simd<_Tp, _Abi>& __y,
+	const simd<_Tp, _Abi>& __z)
+  {
+    return __hypot<conditional_t<__is_fixed_size_abi_v<_Abi>,
+				 const simd<_Tp, _Abi>&, simd<_Tp, _Abi>>>(__x,
+									   __y,
+									   __z);
+  }
+
+_GLIBCXX_SIMD_CVTING3(hypot)
+
+_GLIBCXX_SIMD_MATH_CALL2_(pow, _Tp)
+
+_GLIBCXX_SIMD_MATH_CALL_(sqrt)
+_GLIBCXX_SIMD_MATH_CALL_(erf)
+_GLIBCXX_SIMD_MATH_CALL_(erfc)
+_GLIBCXX_SIMD_MATH_CALL_(lgamma)
+_GLIBCXX_SIMD_MATH_CALL_(tgamma)
+_GLIBCXX_SIMD_MATH_CALL_(ceil)
+_GLIBCXX_SIMD_MATH_CALL_(floor)
+_GLIBCXX_SIMD_MATH_CALL_(nearbyint)
+_GLIBCXX_SIMD_MATH_CALL_(rint)
+_GLIBCXX_SIMD_MATH_CALL_(lrint)
+_GLIBCXX_SIMD_MATH_CALL_(llrint)
+
+_GLIBCXX_SIMD_MATH_CALL_(round)
+_GLIBCXX_SIMD_MATH_CALL_(lround)
+_GLIBCXX_SIMD_MATH_CALL_(llround)
+
+_GLIBCXX_SIMD_MATH_CALL_(trunc)
+
+_GLIBCXX_SIMD_MATH_CALL2_(fmod, _Tp)
+_GLIBCXX_SIMD_MATH_CALL2_(remainder, _Tp)
+_GLIBCXX_SIMD_MATH_CALL3_(remquo, _Tp, int*)
+
+template <typename _Tp, typename _Abi>
+  enable_if_t<is_floating_point_v<_Tp>, simd<_Tp, _Abi>>
+  copysign(const simd<_Tp, _Abi>& __x, const simd<_Tp, _Abi>& __y)
+  {
+    if constexpr (simd_size_v<_Tp, _Abi> == 1)
+      return std::copysign(__x[0], __y[0]);
+    else if constexpr (is_same_v<_Tp, long double> && sizeof(_Tp) == 12)
+      // Remove this case once __bit_cast is implemented via __builtin_bit_cast.
+      // It is necessary, because __signmask below cannot be computed at compile
+      // time.
+      return simd<_Tp, _Abi>(
+	[&](auto __i) { return std::copysign(__x[__i], __y[__i]); });
+    else
+      {
+	using _V = simd<_Tp, _Abi>;
+	using namespace std::experimental::__float_bitwise_operators;
+	_GLIBCXX_SIMD_USE_CONSTEXPR_API auto __signmask = _V(1) ^ _V(-1);
+	return (__x & (__x ^ __signmask)) | (__y & __signmask);
+      }
+  }
+
+_GLIBCXX_SIMD_MATH_CALL2_(nextafter, _Tp)
+// not covered in [parallel.simd.math]:
+// _GLIBCXX_SIMD_MATH_CALL2_(nexttoward, long double)
+_GLIBCXX_SIMD_MATH_CALL2_(fdim, _Tp)
+_GLIBCXX_SIMD_MATH_CALL2_(fmax, _Tp)
+_GLIBCXX_SIMD_MATH_CALL2_(fmin, _Tp)
+
+_GLIBCXX_SIMD_MATH_CALL3_(fma, _Tp, _Tp)
+_GLIBCXX_SIMD_MATH_CALL_(fpclassify)
+_GLIBCXX_SIMD_MATH_CALL_(isfinite)
+
+// isnan and isinf require special treatment because old glibc may declare
+// `int isinf(double)`.
+template <typename _Tp, typename _Abi, typename...,
+	  typename _R = _Math_return_type_t<bool, _Tp, _Abi>>
+  enable_if_t<is_floating_point_v<_Tp>, _R>
+  isinf(simd<_Tp, _Abi> __x)
+  { return {__private_init, _Abi::_SimdImpl::_S_isinf(__data(__x))}; }
+
+template <typename _Tp, typename _Abi, typename...,
+	  typename _R = _Math_return_type_t<bool, _Tp, _Abi>>
+  enable_if_t<is_floating_point_v<_Tp>, _R>
+  isnan(simd<_Tp, _Abi> __x)
+  { return {__private_init, _Abi::_SimdImpl::_S_isnan(__data(__x))}; }
+
+_GLIBCXX_SIMD_MATH_CALL_(isnormal)
+
+template <typename..., typename _Tp, typename _Abi>
+  simd_mask<_Tp, _Abi>
+  signbit(simd<_Tp, _Abi> __x)
+  {
+    if constexpr (is_integral_v<_Tp>)
+      {
+	if constexpr (is_unsigned_v<_Tp>)
+	  return simd_mask<_Tp, _Abi>{}; // false
+	else
+	  return __x < 0;
+      }
+    else
+      return {__private_init, _Abi::_SimdImpl::_S_signbit(__data(__x))};
+  }
+
+_GLIBCXX_SIMD_MATH_CALL2_(isgreater, _Tp)
+_GLIBCXX_SIMD_MATH_CALL2_(isgreaterequal, _Tp)
+_GLIBCXX_SIMD_MATH_CALL2_(isless, _Tp)
+_GLIBCXX_SIMD_MATH_CALL2_(islessequal, _Tp)
+_GLIBCXX_SIMD_MATH_CALL2_(islessgreater, _Tp)
+_GLIBCXX_SIMD_MATH_CALL2_(isunordered, _Tp)
+
+/* not covered in [parallel.simd.math]
+template <typename _Abi> __doublev<_Abi> nan(const char* tagp);
+template <typename _Abi> __floatv<_Abi> nanf(const char* tagp);
+template <typename _Abi> __ldoublev<_Abi> nanl(const char* tagp);
+
+template <typename _V> struct simd_div_t {
+    _V quot, rem;
+};
+
+template <typename _Abi>
+simd_div_t<_SCharv<_Abi>> div(_SCharv<_Abi> numer,
+					 _SCharv<_Abi> denom);
+template <typename _Abi>
+simd_div_t<__shortv<_Abi>> div(__shortv<_Abi> numer,
+					 __shortv<_Abi> denom);
+template <typename _Abi>
+simd_div_t<__intv<_Abi>> div(__intv<_Abi> numer, __intv<_Abi> denom);
+template <typename _Abi>
+simd_div_t<__longv<_Abi>> div(__longv<_Abi> numer,
+					__longv<_Abi> denom);
+template <typename _Abi>
+simd_div_t<__llongv<_Abi>> div(__llongv<_Abi> numer,
+					 __llongv<_Abi> denom);
+*/
+
+// special math {{{
+template <typename _Tp, typename _Abi>
+  enable_if_t<is_floating_point_v<_Tp>, simd<_Tp, _Abi>>
+  assoc_laguerre(const fixed_size_simd<unsigned, simd_size_v<_Tp, _Abi>>& __n,
+		 const fixed_size_simd<unsigned, simd_size_v<_Tp, _Abi>>& __m,
+		 const simd<_Tp, _Abi>& __x)
+  {
+    return simd<_Tp, _Abi>([&](auto __i) {
+      return std::assoc_laguerre(__n[__i], __m[__i], __x[__i]);
+    });
+  }
+
+template <typename _Tp, typename _Abi>
+  enable_if_t<is_floating_point_v<_Tp>, simd<_Tp, _Abi>>
+  assoc_legendre(const fixed_size_simd<unsigned, simd_size_v<_Tp, _Abi>>& __n,
+		 const fixed_size_simd<unsigned, simd_size_v<_Tp, _Abi>>& __m,
+		 const simd<_Tp, _Abi>& __x)
+  {
+    return simd<_Tp, _Abi>([&](auto __i) {
+      return std::assoc_legendre(__n[__i], __m[__i], __x[__i]);
+    });
+  }
+
+_GLIBCXX_SIMD_MATH_CALL2_(beta, _Tp)
+_GLIBCXX_SIMD_MATH_CALL_(comp_ellint_1)
+_GLIBCXX_SIMD_MATH_CALL_(comp_ellint_2)
+_GLIBCXX_SIMD_MATH_CALL2_(comp_ellint_3, _Tp)
+_GLIBCXX_SIMD_MATH_CALL2_(cyl_bessel_i, _Tp)
+_GLIBCXX_SIMD_MATH_CALL2_(cyl_bessel_j, _Tp)
+_GLIBCXX_SIMD_MATH_CALL2_(cyl_bessel_k, _Tp)
+_GLIBCXX_SIMD_MATH_CALL2_(cyl_neumann, _Tp)
+_GLIBCXX_SIMD_MATH_CALL2_(ellint_1, _Tp)
+_GLIBCXX_SIMD_MATH_CALL2_(ellint_2, _Tp)
+_GLIBCXX_SIMD_MATH_CALL3_(ellint_3, _Tp, _Tp)
+_GLIBCXX_SIMD_MATH_CALL_(expint)
+
+template <typename _Tp, typename _Abi>
+  enable_if_t<is_floating_point_v<_Tp>, simd<_Tp, _Abi>>
+  hermite(const fixed_size_simd<unsigned, simd_size_v<_Tp, _Abi>>& __n,
+	  const simd<_Tp, _Abi>& __x)
+  {
+    return simd<_Tp, _Abi>(
+      [&](auto __i) { return std::hermite(__n[__i], __x[__i]); });
+  }
+
+template <typename _Tp, typename _Abi>
+  enable_if_t<is_floating_point_v<_Tp>, simd<_Tp, _Abi>>
+  laguerre(const fixed_size_simd<unsigned, simd_size_v<_Tp, _Abi>>& __n,
+	   const simd<_Tp, _Abi>& __x)
+  {
+    return simd<_Tp, _Abi>(
+      [&](auto __i) { return std::laguerre(__n[__i], __x[__i]); });
+  }
+
+template <typename _Tp, typename _Abi>
+  enable_if_t<is_floating_point_v<_Tp>, simd<_Tp, _Abi>>
+  legendre(const fixed_size_simd<unsigned, simd_size_v<_Tp, _Abi>>& __n,
+	   const simd<_Tp, _Abi>& __x)
+  {
+    return simd<_Tp, _Abi>(
+      [&](auto __i) { return std::legendre(__n[__i], __x[__i]); });
+  }
+
+_GLIBCXX_SIMD_MATH_CALL_(riemann_zeta)
+
+template <typename _Tp, typename _Abi>
+  enable_if_t<is_floating_point_v<_Tp>, simd<_Tp, _Abi>>
+  sph_bessel(const fixed_size_simd<unsigned, simd_size_v<_Tp, _Abi>>& __n,
+	     const simd<_Tp, _Abi>& __x)
+  {
+    return simd<_Tp, _Abi>(
+      [&](auto __i) { return std::sph_bessel(__n[__i], __x[__i]); });
+  }
+
+template <typename _Tp, typename _Abi>
+  enable_if_t<is_floating_point_v<_Tp>, simd<_Tp, _Abi>>
+  sph_legendre(const fixed_size_simd<unsigned, simd_size_v<_Tp, _Abi>>& __l,
+	       const fixed_size_simd<unsigned, simd_size_v<_Tp, _Abi>>& __m,
+	       const simd<_Tp, _Abi>& theta)
+  {
+    return simd<_Tp, _Abi>([&](auto __i) {
+      return std::assoc_legendre(__l[__i], __m[__i], theta[__i]);
+    });
+  }
+
+template <typename _Tp, typename _Abi>
+  enable_if_t<is_floating_point_v<_Tp>, simd<_Tp, _Abi>>
+  sph_neumann(const fixed_size_simd<unsigned, simd_size_v<_Tp, _Abi>>& __n,
+	      const simd<_Tp, _Abi>& __x)
+  {
+    return simd<_Tp, _Abi>(
+      [&](auto __i) { return std::sph_neumann(__n[__i], __x[__i]); });
+  }
+// }}}
+
+#undef _GLIBCXX_SIMD_MATH_CALL_
+#undef _GLIBCXX_SIMD_MATH_CALL2_
+#undef _GLIBCXX_SIMD_MATH_CALL3_
+
+_GLIBCXX_SIMD_END_NAMESPACE
+
+#endif // __cplusplus >= 201703L
+#endif // _GLIBCXX_EXPERIMENTAL_SIMD_MATH_H_
+
+// vim: foldmethod=marker sw=2 ts=8 noet sts=2
diff --git a/libstdc++-v3/include/experimental/bits/simd_neon.h b/libstdc++-v3/include/experimental/bits/simd_neon.h
new file mode 100644
index 00000000000..a3a8ffe165f
--- /dev/null
+++ b/libstdc++-v3/include/experimental/bits/simd_neon.h
@@ -0,0 +1,519 @@
+// Simd NEON specific implementations -*- C++ -*-
+
+// Copyright (C) 2020 Free Software Foundation, Inc.
+//
+// This file is part of the GNU ISO C++ Library.  This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3, or (at your option)
+// any later version.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+
+// Under Section 7 of GPL version 3, you are granted additional
+// permissions described in the GCC Runtime Library Exception, version
+// 3.1, as published by the Free Software Foundation.
+
+// You should have received a copy of the GNU General Public License and
+// a copy of the GCC Runtime Library Exception along with this program;
+// see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+// <http://www.gnu.org/licenses/>.
+
+#ifndef _GLIBCXX_EXPERIMENTAL_SIMD_NEON_H_
+#define _GLIBCXX_EXPERIMENTAL_SIMD_NEON_H_
+
+#if __cplusplus >= 201703L
+
+#if !_GLIBCXX_SIMD_HAVE_NEON
+#error "simd_neon.h may only be included when NEON on ARM is available"
+#endif
+
+_GLIBCXX_SIMD_BEGIN_NAMESPACE
+
+// _CommonImplNeon {{{
+struct _CommonImplNeon : _CommonImplBuiltin
+{
+  // _S_store {{{
+  using _CommonImplBuiltin::_S_store;
+
+  // }}}
+};
+
+// }}}
+// _SimdImplNeon {{{
+template <typename _Abi>
+  struct _SimdImplNeon : _SimdImplBuiltin<_Abi>
+  {
+    using _Base = _SimdImplBuiltin<_Abi>;
+
+    template <typename _Tp>
+      using _MaskMember = typename _Base::template _MaskMember<_Tp>;
+
+    template <typename _Tp>
+      static constexpr size_t _S_max_store_size = 16;
+
+    // _S_masked_load {{{
+    template <typename _Tp, size_t _Np, typename _Up>
+      static inline _SimdWrapper<_Tp, _Np>
+      _S_masked_load(_SimdWrapper<_Tp, _Np> __merge, _MaskMember<_Tp> __k,
+		     const _Up* __mem) noexcept
+      {
+	__execute_n_times<_Np>([&](auto __i) {
+	  if (__k[__i] != 0)
+	    __merge._M_set(__i, static_cast<_Tp>(__mem[__i]));
+	});
+	return __merge;
+      }
+
+    // }}}
+    // _S_masked_store_nocvt {{{
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static void
+      _S_masked_store_nocvt(_SimdWrapper<_Tp, _Np> __v, _Tp* __mem,
+			    _MaskMember<_Tp> __k)
+      {
+	__execute_n_times<_Np>([&](auto __i) {
+	  if (__k[__i] != 0)
+	    __mem[__i] = __v[__i];
+	});
+      }
+
+    // }}}
+    // _S_reduce {{{
+    template <typename _Tp, typename _BinaryOperation>
+      _GLIBCXX_SIMD_INTRINSIC static _Tp
+      _S_reduce(simd<_Tp, _Abi> __x, _BinaryOperation&& __binary_op)
+      {
+	constexpr size_t _Np = __x.size();
+	if constexpr (sizeof(__x) == 16 && _Np >= 4
+		      && !_Abi::template _S_is_partial<_Tp>)
+	  {
+	    const auto __halves = split<simd<_Tp, simd_abi::_Neon<8>>>(__x);
+	    const auto __y = __binary_op(__halves[0], __halves[1]);
+	    return _SimdImplNeon<simd_abi::_Neon<8>>::_S_reduce(
+	      __y, static_cast<_BinaryOperation&&>(__binary_op));
+	  }
+	else if constexpr (_Np == 8)
+	  {
+	    __x = __binary_op(__x, _Base::template _M_make_simd<_Tp, _Np>(
+				     __vector_permute<1, 0, 3, 2, 5, 4, 7, 6>(
+				       __x._M_data)));
+	    __x = __binary_op(__x, _Base::template _M_make_simd<_Tp, _Np>(
+				     __vector_permute<3, 2, 1, 0, 7, 6, 5, 4>(
+				       __x._M_data)));
+	    __x = __binary_op(__x, _Base::template _M_make_simd<_Tp, _Np>(
+				     __vector_permute<7, 6, 5, 4, 3, 2, 1, 0>(
+				       __x._M_data)));
+	    return __x[0];
+	  }
+	else if constexpr (_Np == 4)
+	  {
+	    __x
+	      = __binary_op(__x, _Base::template _M_make_simd<_Tp, _Np>(
+				   __vector_permute<1, 0, 3, 2>(__x._M_data)));
+	    __x
+	      = __binary_op(__x, _Base::template _M_make_simd<_Tp, _Np>(
+				   __vector_permute<3, 2, 1, 0>(__x._M_data)));
+	    return __x[0];
+	  }
+	else if constexpr (_Np == 2)
+	  {
+	    __x = __binary_op(__x, _Base::template _M_make_simd<_Tp, _Np>(
+				     __vector_permute<1, 0>(__x._M_data)));
+	    return __x[0];
+	  }
+	else
+	  return _Base::_S_reduce(__x,
+				  static_cast<_BinaryOperation&&>(__binary_op));
+      }
+
+    // }}}
+    // math {{{
+    // _S_sqrt {{{
+    template <typename _Tp, typename _TVT = _VectorTraits<_Tp>>
+      _GLIBCXX_SIMD_INTRINSIC static _Tp _S_sqrt(_Tp __x)
+      {
+	if constexpr (__have_neon_a64)
+	  {
+	    const auto __intrin = __to_intrin(__x);
+	    if constexpr (_TVT::template _S_is<float, 2>)
+	      return vsqrt_f32(__intrin);
+	    else if constexpr (_TVT::template _S_is<float, 4>)
+	      return vsqrtq_f32(__intrin);
+	    else if constexpr (_TVT::template _S_is<double, 1>)
+	      return vsqrt_f64(__intrin);
+	    else if constexpr (_TVT::template _S_is<double, 2>)
+	      return vsqrtq_f64(__intrin);
+	    else
+	      __assert_unreachable<_Tp>();
+	  }
+	else
+	  return _Base::_S_sqrt(__x);
+      }
+
+    // }}}
+    // _S_trunc {{{
+    template <typename _TW, typename _TVT = _VectorTraits<_TW>>
+      _GLIBCXX_SIMD_INTRINSIC static _TW _S_trunc(_TW __x)
+      {
+	using _Tp = typename _TVT::value_type;
+	if constexpr (__have_neon_a32)
+	  {
+	    const auto __intrin = __to_intrin(__x);
+	    if constexpr (_TVT::template _S_is<float, 2>)
+	      return vrnd_f32(__intrin);
+	    else if constexpr (_TVT::template _S_is<float, 4>)
+	      return vrndq_f32(__intrin);
+	    else if constexpr (_TVT::template _S_is<double, 1>)
+	      return vrnd_f64(__intrin);
+	    else if constexpr (_TVT::template _S_is<double, 2>)
+	      return vrndq_f64(__intrin);
+	    else
+	      __assert_unreachable<_Tp>();
+	  }
+	else if constexpr (is_same_v<_Tp, float>)
+	  {
+	    auto __intrin = __to_intrin(__x);
+	    if constexpr (sizeof(__x) == 16)
+	      __intrin = vcvtq_f32_s32(vcvtq_s32_f32(__intrin));
+	    else
+	      __intrin = vcvt_f32_s32(vcvt_s32_f32(__intrin));
+	    return _Base::_S_abs(__x)._M_data < 0x1p23f
+		     ? __vector_bitcast<float>(__intrin)
+		     : __x._M_data;
+	  }
+	else
+	  return _Base::_S_trunc(__x);
+      }
+
+    // }}}
+    // _S_round {{{
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static _SimdWrapper<_Tp, _Np>
+      _S_round(_SimdWrapper<_Tp, _Np> __x)
+      {
+	if constexpr (__have_neon_a32)
+	  {
+	    const auto __intrin = __to_intrin(__x);
+	    if constexpr (sizeof(_Tp) == 4 && sizeof(__x) == 8)
+	      return vrnda_f32(__intrin);
+	    else if constexpr (sizeof(_Tp) == 4 && sizeof(__x) == 16)
+	      return vrndaq_f32(__intrin);
+	    else if constexpr (sizeof(_Tp) == 8 && sizeof(__x) == 8)
+	      return vrnda_f64(__intrin);
+	    else if constexpr (sizeof(_Tp) == 8 && sizeof(__x) == 16)
+	      return vrndaq_f64(__intrin);
+	    else
+	      __assert_unreachable<_Tp>();
+	  }
+	else
+	  return _Base::_S_round(__x);
+      }
+
+    // }}}
+    // _S_floor {{{
+    template <typename _Tp, typename _TVT = _VectorTraits<_Tp>>
+      _GLIBCXX_SIMD_INTRINSIC static _Tp _S_floor(_Tp __x)
+      {
+	if constexpr (__have_neon_a32)
+	  {
+	    const auto __intrin = __to_intrin(__x);
+	    if constexpr (_TVT::template _S_is<float, 2>)
+	      return vrndm_f32(__intrin);
+	    else if constexpr (_TVT::template _S_is<float, 4>)
+	      return vrndmq_f32(__intrin);
+	    else if constexpr (_TVT::template _S_is<double, 1>)
+	      return vrndm_f64(__intrin);
+	    else if constexpr (_TVT::template _S_is<double, 2>)
+	      return vrndmq_f64(__intrin);
+	    else
+	      __assert_unreachable<_Tp>();
+	  }
+	else
+	  return _Base::_S_floor(__x);
+      }
+
+    // }}}
+    // _S_ceil {{{
+    template <typename _Tp, typename _TVT = _VectorTraits<_Tp>>
+      _GLIBCXX_SIMD_INTRINSIC static _Tp _S_ceil(_Tp __x)
+      {
+	if constexpr (__have_neon_a32)
+	  {
+	    const auto __intrin = __to_intrin(__x);
+	    if constexpr (_TVT::template _S_is<float, 2>)
+	      return vrndp_f32(__intrin);
+	    else if constexpr (_TVT::template _S_is<float, 4>)
+	      return vrndpq_f32(__intrin);
+	    else if constexpr (_TVT::template _S_is<double, 1>)
+	      return vrndp_f64(__intrin);
+	    else if constexpr (_TVT::template _S_is<double, 2>)
+	      return vrndpq_f64(__intrin);
+	    else
+	      __assert_unreachable<_Tp>();
+	  }
+	else
+	  return _Base::_S_ceil(__x);
+      }
+
+    //}}} }}}
+  }; // }}}
+// _MaskImplNeonMixin {{{
+struct _MaskImplNeonMixin
+{
+  using _Base = _MaskImplBuiltinMixin;
+
+  template <typename _Tp, size_t _Np>
+    _GLIBCXX_SIMD_INTRINSIC static constexpr _SanitizedBitMask<_Np>
+    _S_to_bits(_SimdWrapper<_Tp, _Np> __x)
+    {
+      if (__builtin_is_constant_evaluated())
+	return _Base::_S_to_bits(__x);
+
+      using _I = __int_for_sizeof_t<_Tp>;
+      if constexpr (sizeof(__x) == 16)
+	{
+	  auto __asint = __vector_bitcast<_I>(__x);
+#ifdef __aarch64__
+	  [[maybe_unused]] constexpr auto __zero = decltype(__asint)();
+#else
+	  [[maybe_unused]] constexpr auto __zero = decltype(__lo64(__asint))();
+#endif
+	  if constexpr (sizeof(_Tp) == 1)
+	    {
+	      constexpr auto __bitsel
+		= __generate_from_n_evaluations<16, __vector_type_t<_I, 16>>(
+		  [&](auto __i) {
+		    return static_cast<_I>(
+		      __i < _Np ? (__i < 8 ? 1 << __i : 1 << (__i - 8)) : 0);
+		  });
+	      __asint &= __bitsel;
+#ifdef __aarch64__
+	      return __vector_bitcast<_UShort>(
+		vpaddq_s8(vpaddq_s8(vpaddq_s8(__asint, __zero), __zero),
+			  __zero))[0];
+#else
+	      return __vector_bitcast<_UShort>(
+		vpadd_s8(vpadd_s8(vpadd_s8(__lo64(__asint), __hi64(__asint)),
+				  __zero),
+			 __zero))[0];
+#endif
+	    }
+	  else if constexpr (sizeof(_Tp) == 2)
+	    {
+	      constexpr auto __bitsel
+		= __generate_from_n_evaluations<8, __vector_type_t<_I, 8>>(
+		  [&](auto __i) {
+		    return static_cast<_I>(__i < _Np ? 1 << __i : 0);
+		  });
+	      __asint &= __bitsel;
+#ifdef __aarch64__
+	      return vpaddq_s16(vpaddq_s16(vpaddq_s16(__asint, __zero), __zero),
+				__zero)[0];
+#else
+	      return vpadd_s16(
+		vpadd_s16(vpadd_s16(__lo64(__asint), __hi64(__asint)), __zero),
+		__zero)[0];
+#endif
+	    }
+	  else if constexpr (sizeof(_Tp) == 4)
+	    {
+	      constexpr auto __bitsel
+		= __generate_from_n_evaluations<4, __vector_type_t<_I, 4>>(
+		  [&](auto __i) {
+		    return static_cast<_I>(__i < _Np ? 1 << __i : 0);
+		  });
+	      __asint &= __bitsel;
+#ifdef __aarch64__
+	      return vpaddq_s32(vpaddq_s32(__asint, __zero), __zero)[0];
+#else
+	      return vpadd_s32(vpadd_s32(__lo64(__asint), __hi64(__asint)),
+			       __zero)[0];
+#endif
+	    }
+	  else if constexpr (sizeof(_Tp) == 8)
+	    return (__asint[0] & 1) | (__asint[1] & 2);
+	  else
+	    __assert_unreachable<_Tp>();
+	}
+      else if constexpr (sizeof(__x) == 8)
+	{
+	  auto __asint = __vector_bitcast<_I>(__x);
+	  [[maybe_unused]] constexpr auto __zero = decltype(__asint)();
+	  if constexpr (sizeof(_Tp) == 1)
+	    {
+	      constexpr auto __bitsel
+		= __generate_from_n_evaluations<8, __vector_type_t<_I, 8>>(
+		  [&](auto __i) {
+		    return static_cast<_I>(__i < _Np ? 1 << __i : 0);
+		  });
+	      __asint &= __bitsel;
+	      return vpadd_s8(vpadd_s8(vpadd_s8(__asint, __zero), __zero),
+			      __zero)[0];
+	    }
+	  else if constexpr (sizeof(_Tp) == 2)
+	    {
+	      constexpr auto __bitsel
+		= __generate_from_n_evaluations<4, __vector_type_t<_I, 4>>(
+		  [&](auto __i) {
+		    return static_cast<_I>(__i < _Np ? 1 << __i : 0);
+		  });
+	      __asint &= __bitsel;
+	      return vpadd_s16(vpadd_s16(__asint, __zero), __zero)[0];
+	    }
+	  else if constexpr (sizeof(_Tp) == 4)
+	    {
+	      __asint &= __make_vector<_I>(0x1, 0x2);
+	      return vpadd_s32(__asint, __zero)[0];
+	    }
+	  else
+	    __assert_unreachable<_Tp>();
+	}
+      else
+	return _Base::_S_to_bits(__x);
+    }
+};
+
+// }}}
+// _MaskImplNeon {{{
+template <typename _Abi>
+  struct _MaskImplNeon : _MaskImplNeonMixin, _MaskImplBuiltin<_Abi>
+  {
+    using _MaskImplBuiltinMixin::_S_to_maskvector;
+    using _MaskImplNeonMixin::_S_to_bits;
+    using _Base = _MaskImplBuiltin<_Abi>;
+    using _Base::_S_convert;
+
+    // _S_all_of {{{
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static bool _S_all_of(simd_mask<_Tp, _Abi> __k)
+      {
+	const auto __kk
+	  = __vector_bitcast<char>(__k._M_data)
+	    | ~__vector_bitcast<char>(_Abi::template _S_implicit_mask<_Tp>());
+	if constexpr (sizeof(__k) == 16)
+	  {
+	    const auto __x = __vector_bitcast<long long>(__kk);
+	    return __x[0] + __x[1] == -2;
+	  }
+	else if constexpr (sizeof(__k) <= 8)
+	  return __bit_cast<__int_for_sizeof_t<decltype(__kk)>>(__kk) == -1;
+	else
+	  __assert_unreachable<_Tp>();
+      }
+
+    // }}}
+    // _S_any_of {{{
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static bool _S_any_of(simd_mask<_Tp, _Abi> __k)
+      {
+	const auto __kk
+	  = __vector_bitcast<char>(__k._M_data)
+	    | ~__vector_bitcast<char>(_Abi::template _S_implicit_mask<_Tp>());
+	if constexpr (sizeof(__k) == 16)
+	  {
+	    const auto __x = __vector_bitcast<long long>(__kk);
+	    return (__x[0] | __x[1]) != 0;
+	  }
+	else if constexpr (sizeof(__k) <= 8)
+	  return __bit_cast<__int_for_sizeof_t<decltype(__kk)>>(__kk) != 0;
+	else
+	  __assert_unreachable<_Tp>();
+      }
+
+    // }}}
+    // _S_none_of {{{
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static bool _S_none_of(simd_mask<_Tp, _Abi> __k)
+      {
+	const auto __kk = _Abi::_S_masked(__k._M_data);
+	if constexpr (sizeof(__k) == 16)
+	  {
+	    const auto __x = __vector_bitcast<long long>(__kk);
+	    return (__x[0] | __x[1]) == 0;
+	  }
+	else if constexpr (sizeof(__k) <= 8)
+	  return __bit_cast<__int_for_sizeof_t<decltype(__kk)>>(__kk) == 0;
+	else
+	  __assert_unreachable<_Tp>();
+      }
+
+    // }}}
+    // _S_some_of {{{
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static bool _S_some_of(simd_mask<_Tp, _Abi> __k)
+      {
+	if constexpr (sizeof(__k) <= 8)
+	  {
+	    const auto __kk = __vector_bitcast<char>(__k._M_data)
+			      | ~__vector_bitcast<char>(
+				_Abi::template _S_implicit_mask<_Tp>());
+	    using _Up = make_unsigned_t<__int_for_sizeof_t<decltype(__kk)>>;
+	    return __bit_cast<_Up>(__kk) + 1 > 1;
+	  }
+	else
+	  return _Base::_S_some_of(__k);
+      }
+
+    // }}}
+    // _S_popcount {{{
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static int _S_popcount(simd_mask<_Tp, _Abi> __k)
+      {
+	if constexpr (sizeof(_Tp) == 1)
+	  {
+	    const auto __s8 = __vector_bitcast<_SChar>(__k._M_data);
+	    int8x8_t __tmp = __lo64(__s8) + __hi64z(__s8);
+	    return -vpadd_s8(vpadd_s8(vpadd_s8(__tmp, int8x8_t()), int8x8_t()),
+			     int8x8_t())[0];
+	  }
+	else if constexpr (sizeof(_Tp) == 2)
+	  {
+	    const auto __s16 = __vector_bitcast<short>(__k._M_data);
+	    int16x4_t __tmp = __lo64(__s16) + __hi64z(__s16);
+	    return -vpadd_s16(vpadd_s16(__tmp, int16x4_t()), int16x4_t())[0];
+	  }
+	else if constexpr (sizeof(_Tp) == 4)
+	  {
+	    const auto __s32 = __vector_bitcast<int>(__k._M_data);
+	    int32x2_t __tmp = __lo64(__s32) + __hi64z(__s32);
+	    return -vpadd_s32(__tmp, int32x2_t())[0];
+	  }
+	else if constexpr (sizeof(_Tp) == 8)
+	  {
+	    static_assert(sizeof(__k) == 16);
+	    const auto __s64 = __vector_bitcast<long>(__k._M_data);
+	    return -(__s64[0] + __s64[1]);
+	  }
+      }
+
+    // }}}
+    // _S_find_first_set {{{
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static int
+      _S_find_first_set(simd_mask<_Tp, _Abi> __k)
+      {
+	// TODO: the _Base implementation is not optimal for NEON
+	return _Base::_S_find_first_set(__k);
+      }
+
+    // }}}
+    // _S_find_last_set {{{
+    template <typename _Tp>
+      _GLIBCXX_SIMD_INTRINSIC static int
+      _S_find_last_set(simd_mask<_Tp, _Abi> __k)
+      {
+	// TODO: the _Base implementation is not optimal for NEON
+	return _Base::_S_find_last_set(__k);
+      }
+
+    // }}}
+  }; // }}}
+
+_GLIBCXX_SIMD_END_NAMESPACE
+#endif // __cplusplus >= 201703L
+#endif // _GLIBCXX_EXPERIMENTAL_SIMD_NEON_H_
+// vim: foldmethod=marker sw=2 noet ts=8 sts=2 tw=80
diff --git a/libstdc++-v3/include/experimental/bits/simd_ppc.h b/libstdc++-v3/include/experimental/bits/simd_ppc.h
new file mode 100644
index 00000000000..c00d2323ac6
--- /dev/null
+++ b/libstdc++-v3/include/experimental/bits/simd_ppc.h
@@ -0,0 +1,123 @@
+// Simd PowerPC specific implementations -*- C++ -*-
+
+// Copyright (C) 2020 Free Software Foundation, Inc.
+//
+// This file is part of the GNU ISO C++ Library.  This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3, or (at your option)
+// any later version.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+
+// Under Section 7 of GPL version 3, you are granted additional
+// permissions described in the GCC Runtime Library Exception, version
+// 3.1, as published by the Free Software Foundation.
+
+// You should have received a copy of the GNU General Public License and
+// a copy of the GCC Runtime Library Exception along with this program;
+// see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+// <http://www.gnu.org/licenses/>.
+
+#ifndef _GLIBCXX_EXPERIMENTAL_SIMD_PPC_H_
+#define _GLIBCXX_EXPERIMENTAL_SIMD_PPC_H_
+
+#if __cplusplus >= 201703L
+
+#ifndef __ALTIVEC__
+#error "simd_ppc.h may only be included when AltiVec/VMX is available"
+#endif
+
+_GLIBCXX_SIMD_BEGIN_NAMESPACE
+
+// _SimdImplPpc {{{
+template <typename _Abi>
+  struct _SimdImplPpc : _SimdImplBuiltin<_Abi>
+  {
+    using _Base = _SimdImplBuiltin<_Abi>;
+
+    // Byte and halfword shift instructions on PPC only consider the low 3 or 4
+    // bits of the RHS. Consequently, shifting by sizeof(_Tp)*CHAR_BIT (or more)
+    // is UB without extra measures. To match scalar behavior, byte and halfword
+    // shifts need an extra fixup step.
+
+    // _S_bit_shift_left {{{
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _SimdWrapper<_Tp, _Np>
+      _S_bit_shift_left(_SimdWrapper<_Tp, _Np> __x, _SimdWrapper<_Tp, _Np> __y)
+      {
+	__x = _Base::_S_bit_shift_left(__x, __y);
+	if constexpr (sizeof(_Tp) < sizeof(int))
+	  __x._M_data
+	    = (__y._M_data < sizeof(_Tp) * __CHAR_BIT__) & __x._M_data;
+	return __x;
+      }
+
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _SimdWrapper<_Tp, _Np>
+      _S_bit_shift_left(_SimdWrapper<_Tp, _Np> __x, int __y)
+      {
+	__x = _Base::_S_bit_shift_left(__x, __y);
+	if constexpr (sizeof(_Tp) < sizeof(int))
+	  {
+	    if (__y >= sizeof(_Tp) * __CHAR_BIT__)
+	      return {};
+	  }
+	return __x;
+      }
+
+    // }}}
+    // _S_bit_shift_right {{{
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _SimdWrapper<_Tp, _Np>
+      _S_bit_shift_right(_SimdWrapper<_Tp, _Np> __x, _SimdWrapper<_Tp, _Np> __y)
+      {
+	if constexpr (sizeof(_Tp) < sizeof(int))
+	  {
+	    constexpr int __nbits = sizeof(_Tp) * __CHAR_BIT__;
+	    if constexpr (is_unsigned_v<_Tp>)
+	      return (__y._M_data < __nbits)
+		     & _Base::_S_bit_shift_right(__x, __y)._M_data;
+	    else
+	      {
+		_Base::_S_masked_assign(_SimdWrapper<_Tp, _Np>(__y._M_data
+							       >= __nbits),
+					__y, __nbits - 1);
+		return _Base::_S_bit_shift_right(__x, __y);
+	      }
+	  }
+	else
+	  return _Base::_S_bit_shift_right(__x, __y);
+      }
+
+    template <typename _Tp, size_t _Np>
+      _GLIBCXX_SIMD_INTRINSIC static constexpr _SimdWrapper<_Tp, _Np>
+      _S_bit_shift_right(_SimdWrapper<_Tp, _Np> __x, int __y)
+      {
+	if constexpr (sizeof(_Tp) < sizeof(int))
+	  {
+	    constexpr int __nbits = sizeof(_Tp) * __CHAR_BIT__;
+	    if (__y >= __nbits)
+	      {
+		if constexpr (is_unsigned_v<_Tp>)
+		  return {};
+		else
+		  return _Base::_S_bit_shift_right(__x, __nbits - 1);
+	      }
+	  }
+	return _Base::_S_bit_shift_right(__x, __y);
+      }
+
+    // }}}
+  };
+
+// }}}
+
+_GLIBCXX_SIMD_END_NAMESPACE
+#endif // __cplusplus >= 201703L
+#endif // _GLIBCXX_EXPERIMENTAL_SIMD_PPC_H_
+
+// vim: foldmethod=marker sw=2 noet ts=8 sts=2 tw=80
diff --git a/libstdc++-v3/include/experimental/bits/simd_scalar.h b/libstdc++-v3/include/experimental/bits/simd_scalar.h
new file mode 100644
index 00000000000..7680bc39c30
--- /dev/null
+++ b/libstdc++-v3/include/experimental/bits/simd_scalar.h
@@ -0,0 +1,772 @@
+// Simd scalar ABI specific implementations -*- C++ -*-
+
+// Copyright (C) 2020 Free Software Foundation, Inc.
+//
+// This file is part of the GNU ISO C++ Library.  This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3, or (at your option)
+// any later version.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+
+// Under Section 7 of GPL version 3, you are granted additional
+// permissions described in the GCC Runtime Library Exception, version
+// 3.1, as published by the Free Software Foundation.
+
+// You should have received a copy of the GNU General Public License and
+// a copy of the GCC Runtime Library Exception along with this program;
+// see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+// <http://www.gnu.org/licenses/>.
+
+#ifndef _GLIBCXX_EXPERIMENTAL_SIMD_SCALAR_H_
+#define _GLIBCXX_EXPERIMENTAL_SIMD_SCALAR_H_
+#if __cplusplus >= 201703L
+
+#include <cmath>
+
+_GLIBCXX_SIMD_BEGIN_NAMESPACE
+
+// __promote_preserving_unsigned{{{
+// work around crazy semantics of unsigned integers of lower rank than int:
+// Before applying an operator the operands are promoted to int. In which case
+// over- or underflow is UB, even though the operand types were unsigned.
+template <typename _Tp>
+  _GLIBCXX_SIMD_INTRINSIC constexpr decltype(auto)
+  __promote_preserving_unsigned(const _Tp& __x)
+  {
+    if constexpr (is_signed_v<decltype(+__x)> && is_unsigned_v<_Tp>)
+      return static_cast<unsigned int>(__x);
+    else
+      return __x;
+  }
+
+// }}}
+
+struct _CommonImplScalar;
+struct _CommonImplBuiltin;
+struct _SimdImplScalar;
+struct _MaskImplScalar;
+
+// simd_abi::_Scalar {{{
+struct simd_abi::_Scalar
+{
+  template <typename _Tp>
+    static constexpr size_t _S_size = 1;
+
+  template <typename _Tp>
+    static constexpr size_t _S_full_size = 1;
+
+  template <typename _Tp>
+    static constexpr bool _S_is_partial = false;
+
+  struct _IsValidAbiTag : true_type {};
+
+  template <typename _Tp>
+    struct _IsValidSizeFor : true_type {};
+
+  template <typename _Tp>
+    struct _IsValid : __is_vectorizable<_Tp> {};
+
+  template <typename _Tp>
+    static constexpr bool _S_is_valid_v = _IsValid<_Tp>::value;
+
+  _GLIBCXX_SIMD_INTRINSIC static constexpr bool _S_masked(bool __x)
+  { return __x; }
+
+  using _CommonImpl = _CommonImplScalar;
+  using _SimdImpl = _SimdImplScalar;
+  using _MaskImpl = _MaskImplScalar;
+
+  template <typename _Tp, bool = _S_is_valid_v<_Tp>>
+    struct __traits : _InvalidTraits {};
+
+  template <typename _Tp>
+    struct __traits<_Tp, true>
+    {
+      using _IsValid = true_type;
+      using _SimdImpl = _SimdImplScalar;
+      using _MaskImpl = _MaskImplScalar;
+      using _SimdMember = _Tp;
+      using _MaskMember = bool;
+
+      static constexpr size_t _S_simd_align = alignof(_SimdMember);
+      static constexpr size_t _S_mask_align = alignof(_MaskMember);
+
+      // nothing the user can spell converts to/from simd/simd_mask
+      struct _SimdCastType { _SimdCastType() = delete; };
+      struct _MaskCastType { _MaskCastType() = delete; };
+      struct _SimdBase {};
+      struct _MaskBase {};
+    };
+};
+
+// }}}
+// _CommonImplScalar {{{
+struct _CommonImplScalar
+{
+  // _S_store {{{
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static void _S_store(_Tp __x, void* __addr)
+    { __builtin_memcpy(__addr, &__x, sizeof(_Tp)); }
+
+  // }}}
+  // _S_store_bool_array(_BitMask) {{{
+  template <size_t _Np, bool _Sanitized>
+    _GLIBCXX_SIMD_INTRINSIC static constexpr void
+    _S_store_bool_array(_BitMask<_Np, _Sanitized> __x, bool* __mem)
+    {
+      __make_dependent_t<decltype(__x), _CommonImplBuiltin>::_S_store_bool_array(
+	__x, __mem);
+    }
+
+  // }}}
+};
+
+// }}}
+// _SimdImplScalar {{{
+struct _SimdImplScalar
+{
+  // member types {{{2
+  using abi_type = simd_abi::scalar;
+
+  template <typename _Tp>
+    using _TypeTag = _Tp*;
+
+  // _S_broadcast {{{2
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static constexpr _Tp _S_broadcast(_Tp __x) noexcept
+    { return __x; }
+
+  // _S_generator {{{2
+  template <typename _Fp, typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static constexpr _Tp _S_generator(_Fp&& __gen,
+							      _TypeTag<_Tp>)
+    { return __gen(_SizeConstant<0>()); }
+
+  // _S_load {{{2
+  template <typename _Tp, typename _Up>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_load(const _Up* __mem,
+					       _TypeTag<_Tp>) noexcept
+    { return static_cast<_Tp>(__mem[0]); }
+
+  // _S_masked_load {{{2
+  template <typename _Tp, typename _Up>
+    static inline _Tp _S_masked_load(_Tp __merge, bool __k,
+				     const _Up* __mem) noexcept
+    {
+      if (__k)
+	__merge = static_cast<_Tp>(__mem[0]);
+      return __merge;
+    }
+
+  // _S_store {{{2
+  template <typename _Tp, typename _Up>
+    static inline void _S_store(_Tp __v, _Up* __mem, _TypeTag<_Tp>) noexcept
+    { __mem[0] = static_cast<_Up>(__v); }
+
+  // _S_masked_store {{{2
+  template <typename _Tp, typename _Up>
+    static inline void _S_masked_store(const _Tp __v, _Up* __mem,
+				       const bool __k) noexcept
+    { if (__k) __mem[0] = __v; }
+
+  // _S_negate {{{2
+  template <typename _Tp>
+    static constexpr inline bool _S_negate(_Tp __x) noexcept
+    { return !__x; }
+
+  // _S_reduce {{{2
+  template <typename _Tp, typename _BinaryOperation>
+    static constexpr inline _Tp
+    _S_reduce(const simd<_Tp, simd_abi::scalar>& __x, _BinaryOperation&)
+    { return __x._M_data; }
+
+  // _S_min, _S_max {{{2
+  template <typename _Tp>
+    static constexpr inline _Tp _S_min(const _Tp __a, const _Tp __b)
+    { return std::min(__a, __b); }
+
+  template <typename _Tp>
+    static constexpr inline _Tp _S_max(const _Tp __a, const _Tp __b)
+    { return std::max(__a, __b); }
+
+  // _S_complement {{{2
+  template <typename _Tp>
+    static constexpr inline _Tp _S_complement(_Tp __x) noexcept
+    { return static_cast<_Tp>(~__x); }
+
+  // _S_unary_minus {{{2
+  template <typename _Tp>
+    static constexpr inline _Tp _S_unary_minus(_Tp __x) noexcept
+    { return static_cast<_Tp>(-__x); }
+
+  // arithmetic operators {{{2
+  template <typename _Tp>
+    static constexpr inline _Tp _S_plus(_Tp __x, _Tp __y)
+    {
+      return static_cast<_Tp>(__promote_preserving_unsigned(__x)
+			      + __promote_preserving_unsigned(__y));
+    }
+
+  template <typename _Tp>
+    static constexpr inline _Tp _S_minus(_Tp __x, _Tp __y)
+    {
+      return static_cast<_Tp>(__promote_preserving_unsigned(__x)
+			      - __promote_preserving_unsigned(__y));
+    }
+
+  template <typename _Tp>
+    static constexpr inline _Tp _S_multiplies(_Tp __x, _Tp __y)
+    {
+      return static_cast<_Tp>(__promote_preserving_unsigned(__x)
+			      * __promote_preserving_unsigned(__y));
+    }
+
+  template <typename _Tp>
+    static constexpr inline _Tp _S_divides(_Tp __x, _Tp __y)
+    {
+      return static_cast<_Tp>(__promote_preserving_unsigned(__x)
+			      / __promote_preserving_unsigned(__y));
+    }
+
+  template <typename _Tp>
+    static constexpr inline _Tp _S_modulus(_Tp __x, _Tp __y)
+    {
+      return static_cast<_Tp>(__promote_preserving_unsigned(__x)
+			      % __promote_preserving_unsigned(__y));
+    }
+
+  template <typename _Tp>
+    static constexpr inline _Tp _S_bit_and(_Tp __x, _Tp __y)
+    {
+      if constexpr (is_floating_point_v<_Tp>)
+	{
+	  using _Ip = __int_for_sizeof_t<_Tp>;
+	  return __bit_cast<_Tp>(__bit_cast<_Ip>(__x) & __bit_cast<_Ip>(__y));
+	}
+      else
+	return static_cast<_Tp>(__promote_preserving_unsigned(__x)
+				& __promote_preserving_unsigned(__y));
+    }
+
+  template <typename _Tp>
+    static constexpr inline _Tp _S_bit_or(_Tp __x, _Tp __y)
+    {
+      if constexpr (is_floating_point_v<_Tp>)
+	{
+	  using _Ip = __int_for_sizeof_t<_Tp>;
+	  return __bit_cast<_Tp>(__bit_cast<_Ip>(__x) | __bit_cast<_Ip>(__y));
+	}
+      else
+	return static_cast<_Tp>(__promote_preserving_unsigned(__x)
+				| __promote_preserving_unsigned(__y));
+    }
+
+  template <typename _Tp>
+    static constexpr inline _Tp _S_bit_xor(_Tp __x, _Tp __y)
+    {
+      if constexpr (is_floating_point_v<_Tp>)
+	{
+	  using _Ip = __int_for_sizeof_t<_Tp>;
+	  return __bit_cast<_Tp>(__bit_cast<_Ip>(__x) ^ __bit_cast<_Ip>(__y));
+	}
+      else
+	return static_cast<_Tp>(__promote_preserving_unsigned(__x)
+				^ __promote_preserving_unsigned(__y));
+    }
+
+  template <typename _Tp>
+    static constexpr inline _Tp _S_bit_shift_left(_Tp __x, int __y)
+    { return static_cast<_Tp>(__promote_preserving_unsigned(__x) << __y); }
+
+  template <typename _Tp>
+    static constexpr inline _Tp _S_bit_shift_right(_Tp __x, int __y)
+    { return static_cast<_Tp>(__promote_preserving_unsigned(__x) >> __y); }
+
+  // math {{{2
+  // frexp, modf and copysign implemented in simd_math.h
+  template <typename _Tp>
+    using _ST = _SimdTuple<_Tp, simd_abi::scalar>;
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_acos(_Tp __x)
+    { return std::acos(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_asin(_Tp __x)
+    { return std::asin(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_atan(_Tp __x)
+    { return std::atan(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_cos(_Tp __x)
+    { return std::cos(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_sin(_Tp __x)
+    { return std::sin(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_tan(_Tp __x)
+    { return std::tan(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_acosh(_Tp __x)
+    { return std::acosh(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_asinh(_Tp __x)
+    { return std::asinh(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_atanh(_Tp __x)
+    { return std::atanh(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_cosh(_Tp __x)
+    { return std::cosh(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_sinh(_Tp __x)
+    { return std::sinh(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_tanh(_Tp __x)
+    { return std::tanh(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_atan2(_Tp __x, _Tp __y)
+    { return std::atan2(__x, __y); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_exp(_Tp __x)
+    { return std::exp(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_exp2(_Tp __x)
+    { return std::exp2(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_expm1(_Tp __x)
+    { return std::expm1(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_log(_Tp __x)
+    { return std::log(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_log10(_Tp __x)
+    { return std::log10(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_log1p(_Tp __x)
+    { return std::log1p(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_log2(_Tp __x)
+    { return std::log2(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_logb(_Tp __x)
+    { return std::logb(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _ST<int> _S_ilogb(_Tp __x)
+    { return {std::ilogb(__x)}; }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_pow(_Tp __x, _Tp __y)
+    { return std::pow(__x, __y); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_abs(_Tp __x)
+    { return std::abs(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_fabs(_Tp __x)
+    { return std::fabs(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_sqrt(_Tp __x)
+    { return std::sqrt(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_cbrt(_Tp __x)
+    { return std::cbrt(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_erf(_Tp __x)
+    { return std::erf(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_erfc(_Tp __x)
+    { return std::erfc(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_lgamma(_Tp __x)
+    { return std::lgamma(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_tgamma(_Tp __x)
+    { return std::tgamma(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_trunc(_Tp __x)
+    { return std::trunc(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_floor(_Tp __x)
+    { return std::floor(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_ceil(_Tp __x)
+    { return std::ceil(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_nearbyint(_Tp __x)
+    { return std::nearbyint(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_rint(_Tp __x)
+    { return std::rint(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _ST<long> _S_lrint(_Tp __x)
+    { return {std::lrint(__x)}; }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _ST<long long> _S_llrint(_Tp __x)
+    { return {std::llrint(__x)}; }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_round(_Tp __x)
+    { return std::round(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _ST<long> _S_lround(_Tp __x)
+    { return {std::lround(__x)}; }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _ST<long long> _S_llround(_Tp __x)
+    { return {std::llround(__x)}; }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_ldexp(_Tp __x, _ST<int> __y)
+    { return std::ldexp(__x, __y.first); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_scalbn(_Tp __x, _ST<int> __y)
+    { return std::scalbn(__x, __y.first); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_scalbln(_Tp __x, _ST<long> __y)
+    { return std::scalbln(__x, __y.first); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_fmod(_Tp __x, _Tp __y)
+    { return std::fmod(__x, __y); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_remainder(_Tp __x, _Tp __y)
+    { return std::remainder(__x, __y); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_nextafter(_Tp __x, _Tp __y)
+    { return std::nextafter(__x, __y); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_fdim(_Tp __x, _Tp __y)
+    { return std::fdim(__x, __y); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_fmax(_Tp __x, _Tp __y)
+    { return std::fmax(__x, __y); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_fmin(_Tp __x, _Tp __y)
+    { return std::fmin(__x, __y); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_fma(_Tp __x, _Tp __y, _Tp __z)
+    { return std::fma(__x, __y, __z); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC static _Tp _S_remquo(_Tp __x, _Tp __y, _ST<int>* __z)
+    { return std::remquo(__x, __y, &__z->first); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC constexpr static _ST<int> _S_fpclassify(_Tp __x)
+    { return {std::fpclassify(__x)}; }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC constexpr static bool _S_isfinite(_Tp __x)
+    { return std::isfinite(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC constexpr static bool _S_isinf(_Tp __x)
+    { return std::isinf(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC constexpr static bool _S_isnan(_Tp __x)
+    { return std::isnan(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC constexpr static bool _S_isnormal(_Tp __x)
+    { return std::isnormal(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC constexpr static bool _S_signbit(_Tp __x)
+    { return std::signbit(__x); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC constexpr static bool _S_isgreater(_Tp __x, _Tp __y)
+    { return std::isgreater(__x, __y); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC constexpr static bool _S_isgreaterequal(_Tp __x,
+								    _Tp __y)
+    { return std::isgreaterequal(__x, __y); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC constexpr static bool _S_isless(_Tp __x, _Tp __y)
+    { return std::isless(__x, __y); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC constexpr static bool _S_islessequal(_Tp __x, _Tp __y)
+    { return std::islessequal(__x, __y); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC constexpr static bool _S_islessgreater(_Tp __x,
+								   _Tp __y)
+    { return std::islessgreater(__x, __y); }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC constexpr static bool _S_isunordered(_Tp __x,
+								 _Tp __y)
+    { return std::isunordered(__x, __y); }
+
+  // _S_increment & _S_decrement{{{2
+  template <typename _Tp>
+    constexpr static inline void _S_increment(_Tp& __x)
+    { ++__x; }
+
+  template <typename _Tp>
+    constexpr static inline void _S_decrement(_Tp& __x)
+    { --__x; }
+
+
+  // compares {{{2
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC constexpr static bool _S_equal_to(_Tp __x, _Tp __y)
+    { return __x == __y; }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC constexpr static bool _S_not_equal_to(_Tp __x,
+								  _Tp __y)
+    { return __x != __y; }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC constexpr static bool _S_less(_Tp __x, _Tp __y)
+    { return __x < __y; }
+
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC constexpr static bool _S_less_equal(_Tp __x,
+								_Tp __y)
+    { return __x <= __y; }
+
+  // smart_reference access {{{2
+  template <typename _Tp, typename _Up>
+    constexpr static void _S_set(_Tp& __v, [[maybe_unused]] int __i,
+				 _Up&& __x) noexcept
+    {
+      _GLIBCXX_DEBUG_ASSERT(__i == 0);
+      __v = static_cast<_Up&&>(__x);
+    }
+
+  // _S_masked_assign {{{2
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC constexpr static void
+    _S_masked_assign(bool __k, _Tp& __lhs, _Tp __rhs)
+    { if (__k) __lhs = __rhs; }
+
+  // _S_masked_cassign {{{2
+  template <typename _Op, typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC constexpr static void
+    _S_masked_cassign(const bool __k, _Tp& __lhs, const _Tp __rhs, _Op __op)
+    { if (__k) __lhs = __op(_SimdImplScalar{}, __lhs, __rhs); }
+
+  // _S_masked_unary {{{2
+  template <template <typename> class _Op, typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC constexpr static _Tp _S_masked_unary(const bool __k,
+								 const _Tp __v)
+    { return static_cast<_Tp>(__k ? _Op<_Tp>{}(__v) : __v); }
+
+  // }}}2
+};
+
+// }}}
+// _MaskImplScalar {{{
+struct _MaskImplScalar
+{
+  // member types {{{
+  template <typename _Tp>
+    using _TypeTag = _Tp*;
+
+  // }}}
+  // _S_broadcast {{{
+  template <typename>
+    _GLIBCXX_SIMD_INTRINSIC static constexpr bool _S_broadcast(bool __x)
+    { return __x; }
+
+  // }}}
+  // _S_load {{{
+  template <typename>
+    _GLIBCXX_SIMD_INTRINSIC static constexpr bool _S_load(const bool* __mem)
+    { return __mem[0]; }
+
+  // }}}
+  // _S_to_bits {{{
+  _GLIBCXX_SIMD_INTRINSIC static constexpr _SanitizedBitMask<1>
+  _S_to_bits(bool __x)
+  { return __x; }
+
+  // }}}
+  // _S_convert {{{
+  template <typename, bool _Sanitized>
+    _GLIBCXX_SIMD_INTRINSIC static constexpr bool
+    _S_convert(_BitMask<1, _Sanitized> __x)
+    { return __x[0]; }
+
+  template <typename, typename _Up, typename _UAbi>
+    _GLIBCXX_SIMD_INTRINSIC static constexpr bool
+    _S_convert(simd_mask<_Up, _UAbi> __x)
+    { return __x[0]; }
+
+  // }}}
+  // _S_from_bitmask {{{2
+  template <typename _Tp>
+    _GLIBCXX_SIMD_INTRINSIC constexpr static bool
+    _S_from_bitmask(_SanitizedBitMask<1> __bits, _TypeTag<_Tp>) noexcept
+    { return __bits[0]; }
+
+  // _S_masked_load {{{2
+  _GLIBCXX_SIMD_INTRINSIC constexpr static bool
+  _S_masked_load(bool __merge, bool __mask, const bool* __mem) noexcept
+  {
+    if (__mask)
+      __merge = __mem[0];
+    return __merge;
+  }
+
+  // _S_store {{{2
+  _GLIBCXX_SIMD_INTRINSIC static void _S_store(bool __v, bool* __mem) noexcept
+  { __mem[0] = __v; }
+
+  // _S_masked_store {{{2
+  _GLIBCXX_SIMD_INTRINSIC static void
+  _S_masked_store(const bool __v, bool* __mem, const bool __k) noexcept
+  {
+    if (__k)
+      __mem[0] = __v;
+  }
+
+  // logical and bitwise operators {{{2
+  static constexpr bool _S_logical_and(bool __x, bool __y)
+  { return __x && __y; }
+
+  static constexpr bool _S_logical_or(bool __x, bool __y)
+  { return __x || __y; }
+
+  static constexpr bool _S_bit_not(bool __x)
+  { return !__x; }
+
+  static constexpr bool _S_bit_and(bool __x, bool __y)
+  { return __x && __y; }
+
+  static constexpr bool _S_bit_or(bool __x, bool __y)
+  { return __x || __y; }
+
+  static constexpr bool _S_bit_xor(bool __x, bool __y)
+  { return __x != __y; }
+
+  // smart_reference access {{{2
+  constexpr static void _S_set(bool& __k, [[maybe_unused]] int __i,
+			       bool __x) noexcept
+  {
+    _GLIBCXX_DEBUG_ASSERT(__i == 0);
+    __k = __x;
+  }
+
+  // _S_masked_assign {{{2
+  _GLIBCXX_SIMD_INTRINSIC static void _S_masked_assign(bool __k, bool& __lhs,
+						       bool __rhs)
+  {
+    if (__k)
+      __lhs = __rhs;
+  }
+
+  // }}}2
+  // _S_all_of {{{
+  template <typename _Tp, typename _Abi>
+    _GLIBCXX_SIMD_INTRINSIC constexpr static bool
+    _S_all_of(simd_mask<_Tp, _Abi> __k)
+    { return __k._M_data; }
+
+  // }}}
+  // _S_any_of {{{
+  template <typename _Tp, typename _Abi>
+    _GLIBCXX_SIMD_INTRINSIC constexpr static bool
+    _S_any_of(simd_mask<_Tp, _Abi> __k)
+    { return __k._M_data; }
+
+  // }}}
+  // _S_none_of {{{
+  template <typename _Tp, typename _Abi>
+    _GLIBCXX_SIMD_INTRINSIC constexpr static bool
+    _S_none_of(simd_mask<_Tp, _Abi> __k)
+    { return !__k._M_data; }
+
+  // }}}
+  // _S_some_of {{{
+  template <typename _Tp, typename _Abi>
+    _GLIBCXX_SIMD_INTRINSIC constexpr static bool
+    _S_some_of(simd_mask<_Tp, _Abi>)
+    { return false; }
+
+  // }}}
+  // _S_popcount {{{
+  template <typename _Tp, typename _Abi>
+    _GLIBCXX_SIMD_INTRINSIC constexpr static int
+    _S_popcount(simd_mask<_Tp, _Abi> __k)
+    { return __k._M_data; }
+
+  // }}}
+  // _S_find_first_set {{{
+  template <typename _Tp, typename _Abi>
+    _GLIBCXX_SIMD_INTRINSIC constexpr static int
+    _S_find_first_set(simd_mask<_Tp, _Abi>)
+    { return 0; }
+
+  // }}}
+  // _S_find_last_set {{{
+  template <typename _Tp, typename _Abi>
+    _GLIBCXX_SIMD_INTRINSIC constexpr static int
+    _S_find_last_set(simd_mask<_Tp, _Abi>)
+    { return 0; }
+
+  // }}}
+};
+
+// }}}
+
+_GLIBCXX_SIMD_END_NAMESPACE
+#endif // __cplusplus >= 201703L
+#endif // _GLIBCXX_EXPERIMENTAL_SIMD_SCALAR_H_
+
+// vim: foldmethod=marker sw=2 noet ts=8 sts=2 tw=80
diff --git a/libstdc++-v3/include/experimental/bits/simd_x86.h b/libstdc++-v3/include/experimental/bits/simd_x86.h
new file mode 100644
index 00000000000..d1d7b9d4bf3
--- /dev/null
+++ b/libstdc++-v3/include/experimental/bits/simd_x86.h
@@ -0,0 +1,5169 @@
+// Simd x86 specific implementations -*- C++ -*-
+
+// Copyright (C) 2020 Free Software Foundation, Inc.
+//
+// This file is part of the GNU ISO C++ Library.  This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3, or (at your option)
+// any later version.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+
+// Under Section 7 of GPL version 3, you are granted additional
+// permissions described in the GCC Runtime Library Exception, version
+// 3.1, as published by the Free Software Foundation.
+
+// You should have received a copy of the GNU General Public License and
+// a copy of the GCC Runtime Library Exception along with this program;
+// see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+// <http://www.gnu.org/licenses/>.
+
+#ifndef _GLIBCXX_EXPERIMENTAL_SIMD_X86_H_
+#define _GLIBCXX_EXPERIMENTAL_SIMD_X86_H_
+
+#if __cplusplus >= 201703L
+
+#if !_GLIBCXX_SIMD_X86INTRIN
+#error                                                                         \
+  "simd_x86.h may only be included when MMX or SSE on x86(_64) are available"
+#endif
+
+_GLIBCXX_SIMD_BEGIN_NAMESPACE
+
+// __to_masktype {{{
+// Given <T, N> return <__int_for_sizeof_t<T>, N>. For _SimdWrapper and
+// __vector_type_t.
+template <typename _Tp, size_t _Np>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _SimdWrapper<__int_for_sizeof_t<_Tp>, _Np>
+  __to_masktype(_SimdWrapper<_Tp, _Np> __x)
+  {
+    return reinterpret_cast<__vector_type_t<__int_for_sizeof_t<_Tp>, _Np>>(
+      __x._M_data);
+  }
+
+template <typename _TV,
+	  typename _TVT
+	  = enable_if_t<__is_vector_type_v<_TV>, _VectorTraits<_TV>>,
+	  typename _Up = __int_for_sizeof_t<typename _TVT::value_type>>
+  _GLIBCXX_SIMD_INTRINSIC constexpr __vector_type_t<_Up, _TVT::_S_full_size>
+  __to_masktype(_TV __x)
+  { return reinterpret_cast<__vector_type_t<_Up, _TVT::_S_full_size>>(__x); }
+
+// }}}
+// __interleave128_lo {{{
+template <typename _Ap, typename _B, typename _Tp = common_type_t<_Ap, _B>,
+	  typename _Trait = _VectorTraits<_Tp>>
+  _GLIBCXX_SIMD_INTRINSIC constexpr _Tp
+  __interleave128_lo(const _Ap& __av, const _B& __bv)
+  {
+    const _Tp __a(__av);
+    const _Tp __b(__bv);
+    if constexpr (sizeof(_Tp) == 16 && _Trait::_S_full_size == 2)
+      return _Tp{__a[0], __b[0]};
+    else if constexpr (sizeof(_Tp) == 16 && _Trait::_S_full_size == 4)
+      return _Tp{__a[0], __b[0], __a[1], __b[1]};
+    else if constexpr (sizeof(_Tp) == 16 && _Trait::_S_full_size == 8)
+      return _Tp{__a[0], __b[0], __a[1], __b[1],
+		 __a[2], __b[2], __a[3], __b[3]};
+    else if constexpr (sizeof(_Tp) == 16 && _Trait::_S_full_size == 16)
+      return _Tp{__a[0], __b[0], __a[1], __b[1], __a[2], __b[2],
+		 __a[3], __b[3], __a[4], __b[4], __a[5], __b[5],
+		 __a[6], __b[6], __a[7], __b[7]};
+    else if constexpr (sizeof(_Tp) == 32 && _Trait::_S_full_size == 4)
+      return _Tp{__a[0], __b[0], __a[2], __b[2]};
+    else if constexpr (sizeof(_Tp) == 32 && _Trait::_S_full_size == 8)
+      return _Tp{__a[0], __b[0], __a[1], __b[1],
+		 __a[4], __b[4], __a[5], __b[5]};
+    else if constexpr (sizeof(_Tp) == 32 && _Trait::_S_full_size == 16)
+      return _Tp{__a[0],  __b[0],  __a[1],  __b[1], __a[2], __b[2],
+		 __a[3],  __b[3],  __a[8],  __b[8], __a[9], __b[9],
+		 __a[10], __b[10], __a[11], __b[11]};
+    else if constexpr (sizeof(_Tp) == 32 && _Trait::_S_full_size == 32)
+      return _Tp{__a[0],  __b[0],  __a[1],  __b[1],  __a[2],  __b[2],  __a[3],
+		 __b[3],  __a[4],  __b[4],  __a[5],  __b[5],  __a[6],  __b[6],
+		 __a[7],  __b[7],  __a[16], __b[16], __a[17], __b[17], __a[18],
+		 __b[18], __a[19], __b[19], __a[20], __b[20], __a[21], __b[21],
+		 __a[22], __b[22], __a[23], __b[23]};
+    else if constexpr (sizeof(_Tp) == 64 && _Trait::_S_full_size == 8)
+      return _Tp{__a[0], __b[0], __a[2], __b[2],
+		 __a[4], __b[4], __a[6], __b[6]};
+    else if constexpr (sizeof(_Tp) == 64 && _Trait::_S_full_size == 16)
+      return _Tp{__a[0],  __b[0],  __a[1],  __b[1], __a[4], __b[4],
+		 __a[5],  __b[5],  __a[8],  __b[8], __a[9], __b[9],
+		 __a[12], __b[12], __a[13], __b[13]};
+    else if constexpr (sizeof(_Tp) == 64 && _Trait::_S_full_size == 32)
+      return _Tp{__a[0],  __b[0],  __a[1],  __b[1],  __a[2],  __b[2],  __a[3],
+		 __b[3],  __a[8],  __b[8],  __a[9],  __b[9],  __a[10], __b[10],
+		 __a[11], __b[11], __a[16], __b[16], __a[17], __b[17], __a[18],
+		 __b[18], __a[19], __b[19], __a[24], __b[24], __a[25], __b[25],
+		 __a[26], __b[26], __a[27], __b[27]};
+    else if constexpr (sizeof(_Tp) == 64 && _Trait::_S_full_size == 64)
+      return _Tp{__a[0],  __b[0],  __a[1],  __b[1],  __a[2],  __b[2],  __a[3],
+		 __b[3],  __a[4],  __b[4],  __a[5],  __b[5],  __a[6],  __b[6],
+		 __a[7],  __b[7],  __a[16], __b[16], __a[17], __b[17], __a[18],
+		 __b[18], __a[19], __b[19], __a[20], __b[20], __a[21], __b[21],
+		 __a[22], __b[22], __a[23], __b[23], __a[32], __b[32], __a[33],
+		 __b[33], __a[34], __b[34], __a[35], __b[35], __a[36], __b[36],
+		 __a[37], __b[37], __a[38], __b[38], __a[39], __b[39], __a[48],
+		 __b[48], __a[49], __b[49], __a[50], __b[50], __a[51], __b[51],
+		 __a[52], __b[52], __a[53], __b[53], __a[54], __b[54], __a[55],
+		 __b[55]};
+    else
+      __assert_unreachable<_Tp>();
+  }
+
+// }}}
+// __is_zero{{{
+template <typename _Tp, typename _TVT = _VectorTraits<_Tp>>
+  _GLIBCXX_SIMD_INTRINSIC constexpr bool
+  __is_zero(_Tp __a)
+  {
+    if (!__builtin_is_constant_evaluated())
+      {
+	if constexpr (__have_avx)
+	  {
+	    if constexpr (_TVT::template _S_is<float, 8>)
+	      return _mm256_testz_ps(__a, __a);
+	    else if constexpr (_TVT::template _S_is<double, 4>)
+	      return _mm256_testz_pd(__a, __a);
+	    else if constexpr (sizeof(_Tp) == 32)
+	      return _mm256_testz_si256(__to_intrin(__a), __to_intrin(__a));
+	    else if constexpr (_TVT::template _S_is<float>)
+	      return _mm_testz_ps(__to_intrin(__a), __to_intrin(__a));
+	    else if constexpr (_TVT::template _S_is<double, 2>)
+	      return _mm_testz_pd(__a, __a);
+	    else
+	      return _mm_testz_si128(__to_intrin(__a), __to_intrin(__a));
+	  }
+	else if constexpr (__have_sse4_1)
+	  return _mm_testz_si128(__intrin_bitcast<__m128i>(__a),
+				 __intrin_bitcast<__m128i>(__a));
+      }
+    else if constexpr (sizeof(_Tp) <= 8)
+      return reinterpret_cast<__int_for_sizeof_t<_Tp>>(__a) == 0;
+    else
+      {
+	const auto __b = __vector_bitcast<_LLong>(__a);
+	if constexpr (sizeof(__b) == 16)
+	  return (__b[0] | __b[1]) == 0;
+	else if constexpr (sizeof(__b) == 32)
+	  return __is_zero(__lo128(__b) | __hi128(__b));
+	else if constexpr (sizeof(__b) == 64)
+	  return __is_zero(__lo256(__b) | __hi256(__b));
+	else
+	  __assert_unreachable<_Tp>();
+      }
+  }
+
+// }}}
+// __movemask{{{
+template <typename _Tp, typename _TVT = _VectorTraits<_Tp>>
+  _GLIBCXX_SIMD_INTRINSIC _GLIBCXX_CONST int
+  __movemask(_Tp __a)
+  {
+    if constexpr (sizeof(_Tp) == 32)
+      {
+	if constexpr (_TVT::template _S_is<float>)
+	  return _mm256_movemask_ps(__to_intrin(__a));
+	else if constexpr (_TVT::template _S_is<double>)
+	  return _mm256_movemask_pd(__to_intrin(__a));
+	else
+	  return _mm256_movemask_epi8(__to_intrin(__a));
+      }
+    else if constexpr (_TVT::template _S_is<float>)
+      return _mm_movemask_ps(__to_intrin(__a));
+    else if constexpr (_TVT::template _S_is<double>)
+      return _mm_movemask_pd(__to_intrin(__a));
+    else
+      return _mm_movemask_epi8(__to_intrin(__a));
+  }
+
+// }}}
+// __testz{{{
+template <typename _TI, typename _TVT = _VectorTraits<_TI>>
+  _GLIBCXX_SIMD_INTRINSIC _GLIBCXX_CONST constexpr int
+  __testz(_TI __a, _TI __b)
+  {
+    static_assert(is_same_v<_TI, __intrinsic_type_t<typename _TVT::value_type,
+						    _TVT::_S_full_size>>);
+    if (!__builtin_is_constant_evaluated())
+      {
+	if constexpr (sizeof(_TI) == 32)
+	  {
+	    if constexpr (_TVT::template _S_is<float>)
+	      return _mm256_testz_ps(__to_intrin(__a), __to_intrin(__b));
+	    else if constexpr (_TVT::template _S_is<double>)
+	      return _mm256_testz_pd(__to_intrin(__a), __to_intrin(__b));
+	    else
+	      return _mm256_testz_si256(__to_intrin(__a), __to_intrin(__b));
+	  }
+	else if constexpr (_TVT::template _S_is<float> && __have_avx)
+	  return _mm_testz_ps(__to_intrin(__a), __to_intrin(__b));
+	else if constexpr (_TVT::template _S_is<double> && __have_avx)
+	  return _mm_testz_pd(__to_intrin(__a), __to_intrin(__b));
+	else if constexpr (__have_sse4_1)
+	  return _mm_testz_si128(__intrin_bitcast<__m128i>(__to_intrin(__a)),
+				 __intrin_bitcast<__m128i>(__to_intrin(__b)));
+	else
+	  return __movemask(0 == __and(__a, __b)) != 0;
+      }
+    else
+      return __is_zero(__and(__a, __b));
+  }
+
+// }}}
+// __testc{{{
+// requires SSE4.1 or above
+template <typename _TI, typename _TVT = _VectorTraits<_TI>>
+  _GLIBCXX_SIMD_INTRINSIC _GLIBCXX_CONST constexpr int
+  __testc(_TI __a, _TI __b)
+  {
+    static_assert(is_same_v<_TI, __intrinsic_type_t<typename _TVT::value_type,
+						    _TVT::_S_full_size>>);
+    if (__builtin_is_constant_evaluated())
+      return __is_zero(__andnot(__a, __b));
+
+    if constexpr (sizeof(_TI) == 32)
+      {
+	if constexpr (_TVT::template _S_is<float>)
+	  return _mm256_testc_ps(__a, __b);
+	else if constexpr (_TVT::template _S_is<double>)
+	  return _mm256_testc_pd(__a, __b);
+	else
+	  return _mm256_testc_si256(__to_intrin(__a), __to_intrin(__b));
+      }
+    else if constexpr (_TVT::template _S_is<float> && __have_avx)
+      return _mm_testc_ps(__to_intrin(__a), __to_intrin(__b));
+    else if constexpr (_TVT::template _S_is<double> && __have_avx)
+      return _mm_testc_pd(__to_intrin(__a), __to_intrin(__b));
+    else
+      {
+	static_assert(is_same_v<_TI, _TI> && __have_sse4_1);
+	return _mm_testc_si128(__intrin_bitcast<__m128i>(__to_intrin(__a)),
+			       __intrin_bitcast<__m128i>(__to_intrin(__b)));
+      }
+  }
+
+// }}}
+// __testnzc{{{
+template <typename _TI, typename _TVT = _VectorTraits<_TI>>
+  _GLIBCXX_SIMD_INTRINSIC _GLIBCXX_CONST constexpr int
+  __testnzc(_TI __a, _TI __b)
+  {
+    static_assert(is_same_v<_TI, __intrinsic_type_t<typename _TVT::value_type,
+						    _TVT::_S_full_size>>);
+    if (!__builtin_is_constant_evaluated())
+      {
+	if constexpr (sizeof(_TI) == 32)
+	  {
+	    if constexpr (_TVT::template _S_is<float>)
+	      return _mm256_testnzc_ps(__a, __b);
+	    else if constexpr (_TVT::template _S_is<double>)
+	      return _mm256_testnzc_pd(__a, __b);
+	    else
+	      return _mm256_testnzc_si256(__to_intrin(__a), __to_intrin(__b));
+	  }
+	else if constexpr (_TVT::template _S_is<float> && __have_avx)
+	  return _mm_testnzc_ps(__to_intrin(__a), __to_intrin(__b));
+	else if constexpr (_TVT::template _S_is<double> && __have_avx)
+	  return _mm_testnzc_pd(__to_intrin(__a), __to_intrin(__b));
+	else if constexpr (__have_sse4_1)
+	  return _mm_testnzc_si128(__intrin_bitcast<__m128i>(__to_intrin(__a)),
+				   __intrin_bitcast<__m128i>(__to_intrin(__b)));
+	else
+	  return __movemask(0 == __and(__a, __b)) == 0
+		 && __movemask(0 == __andnot(__a, __b)) == 0;
+      }
+    else
+      return !(__is_zero(__and(__a, __b)) || __is_zero(__andnot(__a, __b)));
+  }
+
+// }}}
+// __xzyw{{{
+// shuffles the complete vector, swapping the inner two quarters. Often useful
+// for AVX for fixing up a shuffle result.
+template <typename _Tp, typename _TVT = _VectorTraits<_Tp>>
+  _GLIBCXX_SIMD_INTRINSIC _Tp
+  __xzyw(_Tp __a)
+  {
+    if constexpr (sizeof(_Tp) == 16)
+      {
+	const auto __x = __vector_bitcast<conditional_t<
+	  is_floating_point_v<typename _TVT::value_type>, float, int>>(__a);
+	return reinterpret_cast<_Tp>(
+	  decltype(__x){__x[0], __x[2], __x[1], __x[3]});
+      }
+    else if constexpr (sizeof(_Tp) == 32)
+      {
+	const auto __x = __vector_bitcast<conditional_t<
+	  is_floating_point_v<typename _TVT::value_type>, double, _LLong>>(__a);
+	return reinterpret_cast<_Tp>(
+	  decltype(__x){__x[0], __x[2], __x[1], __x[3]});
+      }
+    else if constexpr (sizeof(_Tp) == 64)
+      {
+	const auto __x = __vector_bitcast<conditional_t<
+	  is_floating_point_v<typename _TVT::value_type>, double, _LLong>>(__a);
+	return reinterpret_cast<_Tp>(decltype(__x){__x[0], __x[1], __x[4],
+						   __x[5], __x[2], __x[3],
+						   __x[6], __x[7]});
+      }
+    else
+      __assert_unreachable<_Tp>();
+  }
+
+// }}}
+// __maskload_epi32{{{
+template <typename _Tp>
+  _GLIBCXX_SIMD_INTRINSIC auto
+  __maskload_epi32(const int* __ptr, _Tp __k)
+  {
+    if constexpr (sizeof(__k) == 16)
+      return _mm_maskload_epi32(__ptr, __k);
+    else
+      return _mm256_maskload_epi32(__ptr, __k);
+  }
+
+// }}}
+// __maskload_epi64{{{
+template <typename _Tp>
+  _GLIBCXX_SIMD_INTRINSIC auto
+  __maskload_epi64(const _LLong* __ptr, _Tp __k)
+  {
+    if constexpr (sizeof(__k) == 16)
+      return _mm_maskload_epi64(__ptr, __k);
+    else
+      return _mm256_maskload_epi64(__ptr, __k);
+  }
+
+// }}}
+// __maskload_ps{{{
+template <typename _Tp>
+  _GLIBCXX_SIMD_INTRINSIC auto
+  __maskload_ps(const float* __ptr, _Tp __k)
+  {
+    if constexpr (sizeof(__k) == 16)
+      return _mm_maskload_ps(__ptr, __k);
+    else
+      return _mm256_maskload_ps(__ptr, __k);
+  }
+
+// }}}
+// __maskload_pd{{{
+template <typename _Tp>
+  _GLIBCXX_SIMD_INTRINSIC auto
+  __maskload_pd(const double* __ptr, _Tp __k)
+  {
+    if constexpr (sizeof(__k) == 16)
+      return _mm_maskload_pd(__ptr, __k);
+    else
+      return _mm256_maskload_pd(__ptr, __k);
+  }
+
+// }}}
+
+#ifdef _GLIBCXX_SIMD_WORKAROUND_PR85048
+#include "simd_x86_conversions.h"
+#endif
+
+// ISA & type detection {{{
+template <typename _Tp, size_t _Np>
+  constexpr bool
+  __is_sse_ps()
+  {
+    return __have_sse
+	   && is_same_v<_Tp,
+			float> && sizeof(__intrinsic_type_t<_Tp, _Np>) == 16;
+  }
+
+template <typename _Tp, size_t _Np>
+  constexpr bool
+  __is_sse_pd()
+  {
+    return __have_sse2
+	   && is_same_v<_Tp,
+			double> && sizeof(__intrinsic_type_t<_Tp, _Np>) == 16;
+  }
+
+template <typename _Tp, size_t _Np>
+  constexpr bool
+  __is_avx_ps()
+  {
+    return __have_avx
+	   && is_same_v<_Tp,
+			float> && sizeof(__intrinsic_type_t<_Tp, _Np>) == 32;
+  }
+
+template <typename _Tp, size_t _Np>
+  constexpr bool
+  __is_avx_pd()
+  {
+    return __have_avx
+	   && is_same_v<_Tp,
+			double> && sizeof(__intrinsic_type_t<_Tp, _Np>) == 32;
+  }
+
+template <typename _Tp, size_t _Np>
+  constexpr bool
+  __is_avx512_ps()
+  {
+    return __have_avx512f
+	   && is_same_v<_Tp,
+			float> && sizeof(__intrinsic_type_t<_Tp, _Np>) == 64;
+  }
+
+template <typename _Tp, size_t _Np>
+  constexpr bool
+  __is_avx512_pd()
+  {
+    return __have_avx512f
+	   && is_same_v<_Tp,
+			double> && sizeof(__intrinsic_type_t<_Tp, _Np>) == 64;
+  }
+
+// }}}
+struct _MaskImplX86Mixin;
+
+// _CommonImplX86 {{{
+struct _CommonImplX86 : _CommonImplBuiltin
+{
+#ifdef _GLIBCXX_SIMD_WORKAROUND_PR85048
+  // _S_converts_via_decomposition {{{
+  template <typename _From, typename _To, size_t _ToSize>
+    static constexpr bool _S_converts_via_decomposition()
+    {
+      if constexpr (is_integral_v<
+		      _From> && is_integral_v<_To> && sizeof(_From) == 8
+		    && _ToSize == 16)
+	return (sizeof(_To) == 2 && !__have_ssse3)
+	       || (sizeof(_To) == 1 && !__have_avx512f);
+      else if constexpr (is_floating_point_v<_From> && is_integral_v<_To>)
+	return ((sizeof(_From) == 4 || sizeof(_From) == 8) && sizeof(_To) == 8
+		&& !__have_avx512dq)
+	       || (sizeof(_From) == 8 && sizeof(_To) == 4 && !__have_sse4_1
+		   && _ToSize == 16);
+      else if constexpr (
+	is_integral_v<_From> && is_floating_point_v<_To> && sizeof(_From) == 8
+	&& !__have_avx512dq)
+	return (sizeof(_To) == 4 && _ToSize == 16)
+	       || (sizeof(_To) == 8 && _ToSize < 64);
+      else
+	return false;
+    }
+
+  template <typename _From, typename _To, size_t _ToSize>
+    static inline constexpr bool __converts_via_decomposition_v
+      = _S_converts_via_decomposition<_From, _To, _ToSize>();
+
+  // }}}
+#endif
+  // _S_store {{{
+  using _CommonImplBuiltin::_S_store;
+
+  template <typename _Tp, size_t _Np>
+    _GLIBCXX_SIMD_INTRINSIC static void _S_store(_SimdWrapper<_Tp, _Np> __x,
+						 void* __addr)
+    {
+      constexpr size_t _Bytes = _Np * sizeof(_Tp);
+
+      if constexpr ((_Bytes & (_Bytes - 1)) != 0 && __have_avx512bw_vl)
+	{
+	  const auto __v = __to_intrin(__x);
+
+	  if constexpr (_Bytes & 1)
+	    {
+	      if constexpr (_Bytes < 16)
+		_mm_mask_storeu_epi8(__addr, 0xffffu >> (16 - _Bytes),
+				     __intrin_bitcast<__m128i>(__v));
+	      else if constexpr (_Bytes < 32)
+		_mm256_mask_storeu_epi8(__addr, 0xffffffffu >> (32 - _Bytes),
+					__intrin_bitcast<__m256i>(__v));
+	      else
+		_mm512_mask_storeu_epi8(__addr,
+					0xffffffffffffffffull >> (64 - _Bytes),
+					__intrin_bitcast<__m512i>(__v));
+	    }
+	  else if constexpr (_Bytes & 2)
+	    {
+	      if constexpr (_Bytes < 16)
+		_mm_mask_storeu_epi16(__addr, 0xffu >> (8 - _Bytes / 2),
+				      __intrin_bitcast<__m128i>(__v));
+	      else if constexpr (_Bytes < 32)
+		_mm256_mask_storeu_epi16(__addr, 0xffffu >> (16 - _Bytes / 2),
+					 __intrin_bitcast<__m256i>(__v));
+	      else
+		_mm512_mask_storeu_epi16(__addr,
+					 0xffffffffull >> (32 - _Bytes / 2),
+					 __intrin_bitcast<__m512i>(__v));
+	    }
+	  else if constexpr (_Bytes & 4)
+	    {
+	      if constexpr (_Bytes < 16)
+		_mm_mask_storeu_epi32(__addr, 0xfu >> (4 - _Bytes / 4),
+				      __intrin_bitcast<__m128i>(__v));
+	      else if constexpr (_Bytes < 32)
+		_mm256_mask_storeu_epi32(__addr, 0xffu >> (8 - _Bytes / 4),
+					 __intrin_bitcast<__m256i>(__v));
+	      else
+		_mm512_mask_storeu_epi32(__addr, 0xffffull >> (16 - _Bytes / 4),
+					 __intrin_bitcast<__m512i>(__v));
[...]

[diff truncated at 524288 bytes]

                 reply	other threads:[~2021-01-27 16:39 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210127163918.E607F3846405@sourceware.org \
    --to=redi@gcc.gnu.org \
    --cc=gcc-cvs@gcc.gnu.org \
    --cc=libstdc++-cvs@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).