public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] RISC-V: Add popcount fallback expander.
@ 2023-10-18  9:20 Robin Dapp
  2023-10-18  9:28 ` juzhe.zhong
       [not found] ` <202310181728104086621@rivai.ai>
  0 siblings, 2 replies; 10+ messages in thread
From: Robin Dapp @ 2023-10-18  9:20 UTC (permalink / raw)
  To: gcc-patches, palmer, Kito Cheng, jeffreyalaw, juzhe.zhong; +Cc: rdapp.gcc

Hi,

as I didn't manage to get back to the generic vectorizer fallback for
popcount in time (still the generic costing problem) I figured I'd
rather implement the popcount fallback in the riscv backend.
It uses the WWG algorithm from libgcc.

rvv.exp is unchanged, vect and dg.exp testsuites are currently running.

Regards
 Robin

gcc/ChangeLog:

	* config/riscv/autovec.md (popcount<mode>2): New expander.
	* config/riscv/riscv-protos.h (expand_popcount): Define.
	* config/riscv/riscv-v.cc (expand_popcount): Vectorize popcount
	with the WWG algorithm.

gcc/testsuite/ChangeLog:

	* gcc.target/riscv/rvv/autovec/unop/popcount-1.c: New test.
	* gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c: New test.
	* gcc.target/riscv/rvv/autovec/unop/popcount.c: New test.
---
 gcc/config/riscv/autovec.md                   |   14 +
 gcc/config/riscv/riscv-protos.h               |    1 +
 gcc/config/riscv/riscv-v.cc                   |   71 +
 .../riscv/rvv/autovec/unop/popcount-1.c       |   20 +
 .../riscv/rvv/autovec/unop/popcount-run-1.c   |   49 +
 .../riscv/rvv/autovec/unop/popcount.c         | 1464 +++++++++++++++++
 6 files changed, 1619 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount.c

diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md
index c5b1e52cbf9..dfe836f705d 100644
--- a/gcc/config/riscv/autovec.md
+++ b/gcc/config/riscv/autovec.md
@@ -1484,6 +1484,20 @@ (define_expand "xorsign<mode>3"
   DONE;
 })
 
+;; -------------------------------------------------------------------------------
+;; - [INT] POPCOUNT.
+;; -------------------------------------------------------------------------------
+
+(define_expand "popcount<mode>2"
+  [(match_operand:VI 0 "register_operand")
+   (match_operand:VI 1 "register_operand")]
+  "TARGET_VECTOR"
+{
+  riscv_vector::expand_popcount (operands);
+  DONE;
+})
+
+
 ;; -------------------------------------------------------------------------
 ;; ---- [INT] Highpart multiplication
 ;; -------------------------------------------------------------------------
diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index 49bdcdf2f93..4aeccdd961b 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -515,6 +515,7 @@ void expand_fold_extract_last (rtx *);
 void expand_cond_unop (unsigned, rtx *);
 void expand_cond_binop (unsigned, rtx *);
 void expand_cond_ternop (unsigned, rtx *);
+void expand_popcount (rtx *);
 
 /* Rounding mode bitfield for fixed point VXRM.  */
 enum fixed_point_rounding_mode
diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index 21d86c3f917..8b594b7127e 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -4152,4 +4152,75 @@ expand_vec_lfloor (rtx op_0, rtx op_1, machine_mode vec_fp_mode,
   emit_vec_cvt_x_f (op_0, op_1, UNARY_OP_FRM_RDN, vec_fp_mode);
 }
 
+/* Vectorize popcount by the Wilkes-Wheeler-Gill algorithm that libgcc uses as
+   well.  */
+void
+expand_popcount (rtx *ops)
+{
+  rtx dst = ops[0];
+  rtx src = ops[1];
+  machine_mode mode = GET_MODE (dst);
+  scalar_mode imode = GET_MODE_INNER (mode);
+  static const uint64_t m5 = 0x5555555555555555ULL;
+  static const uint64_t m3 = 0x3333333333333333ULL;
+  static const uint64_t mf = 0x0F0F0F0F0F0F0F0FULL;
+  static const uint64_t m1 = 0x0101010101010101ULL;
+
+  rtx x1 = gen_reg_rtx (mode);
+  rtx x2 = gen_reg_rtx (mode);
+  rtx x3 = gen_reg_rtx (mode);
+  rtx x4 = gen_reg_rtx (mode);
+
+  /* x1 = src - (src >> 1) & 0x555...);  */
+  rtx shift1 = expand_binop (mode, lshr_optab, src, GEN_INT (1), NULL, true,
+			     OPTAB_DIRECT);
+
+  rtx and1 = gen_reg_rtx (mode);
+  rtx ops1[] = {and1, shift1, gen_int_mode (m5, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (AND, mode), riscv_vector::BINARY_OP,
+		   ops1);
+
+  x1 = expand_binop (mode, sub_optab, src, and1, NULL, true, OPTAB_DIRECT);
+
+  /* x2 = (x1 & 0x3333333333333333ULL) + ((x1 >> 2) & 0x3333333333333333ULL);
+   */
+  rtx and2 = gen_reg_rtx (mode);
+  rtx ops2[] = {and2, x1, gen_int_mode (m3, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (AND, mode), riscv_vector::BINARY_OP,
+		   ops2);
+
+  rtx shift2 = expand_binop (mode, lshr_optab, x1, GEN_INT (2), NULL, true,
+			     OPTAB_DIRECT);
+
+  rtx and22 = gen_reg_rtx (mode);
+  rtx ops22[] = {and22, shift2, gen_int_mode (m3, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (AND, mode), riscv_vector::BINARY_OP,
+		   ops22);
+
+  x2 = expand_binop (mode, add_optab, and2, and22, NULL, true, OPTAB_DIRECT);
+
+  /* x3 = (x2 + (x2 >> 4)) & 0x0f0f0f0f0f0f0f0fULL;  */
+  rtx shift3 = expand_binop (mode, lshr_optab, x2, GEN_INT (4), NULL, true,
+			     OPTAB_DIRECT);
+
+  rtx plus3
+    = expand_binop (mode, add_optab, x2, shift3, NULL, true, OPTAB_DIRECT);
+
+  rtx ops3[] = {x3, plus3, gen_int_mode (mf, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (AND, mode), riscv_vector::BINARY_OP,
+		   ops3);
+
+  /* dest = (x3 * 0x0101010101010101ULL) >> 56;  */
+  rtx mul4 = gen_reg_rtx (mode);
+  rtx ops4[] = {mul4, x3, gen_int_mode (m1, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (MULT, mode), riscv_vector::BINARY_OP,
+		   ops4);
+
+  x4 = expand_binop (mode, lshr_optab, mul4,
+		     GEN_INT (GET_MODE_BITSIZE (imode) - 8), NULL, true,
+		     OPTAB_DIRECT);
+
+  emit_move_insn (dst, x4);
+}
+
 } // namespace riscv_vector
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-1.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-1.c
new file mode 100644
index 00000000000..3169ebbff71
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-1.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-march=rv64gcv_zvfh -mabi=lp64d --param=riscv-autovec-preference=scalable -fno-vect-cost-model -fdump-tree-vect-details" } */
+
+#include <stdint-gcc.h>
+
+void __attribute__ ((noipa))
+popcount_32 (uint32_t *restrict dst, uint32_t *restrict src, int size)
+{
+  for (int i = 0; i < size; ++i)
+    dst[i] = __builtin_popcount (src[i]);
+}
+
+void __attribute__ ((noipa))
+popcount_64 (uint64_t *restrict dst, uint64_t *restrict src, int size)
+{
+  for (int i = 0; i < size; ++i)
+    dst[i] = __builtin_popcountll (src[i]);
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops in function" 2 "vect" } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c
new file mode 100644
index 00000000000..38f1633da99
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c
@@ -0,0 +1,49 @@
+/* { dg-do run { target { riscv_v } } } */
+
+#include "popcount-1.c"
+
+extern void abort (void) __attribute__ ((noreturn));
+
+unsigned int data[] = {
+  0x11111100, 6,
+  0xe0e0f0f0, 14,
+  0x9900aab3, 13,
+  0x00040003, 3,
+  0x000e000c, 5,
+  0x22227777, 16,
+  0x12341234, 10,
+  0x0, 0
+};
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  unsigned int count = sizeof (data) / sizeof (data[0]) / 2;
+
+  uint32_t in32[count];
+  uint32_t out32[count];
+  for (unsigned int i = 0; i < count; ++i)
+    {
+      in32[i] = data[i * 2];
+      asm volatile ("" ::: "memory");
+    }
+  popcount_32 (out32, in32, count);
+  for (unsigned int i = 0; i < count; ++i)
+    if (out32[i] != data[i * 2 + 1])
+      abort ();
+
+  count /= 2;
+  uint64_t in64[count];
+  uint64_t out64[count];
+  for (unsigned int i = 0; i < count; ++i)
+    {
+      in64[i] = ((uint64_t) data[i * 4] << 32) | data[i * 4 + 2];
+      asm volatile ("" ::: "memory");
+    }
+  popcount_64 (out64, in64, count);
+  for (unsigned int i = 0; i < count; ++i)
+    if (out64[i] != data[i * 4 + 1] + data[i * 4 + 3])
+      abort ();
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount.c
new file mode 100644
index 00000000000..585a522aa81
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount.c
@@ -0,0 +1,1464 @@
+/* { dg-do run { target { riscv_v } } } */
+/* { dg-additional-options { -O2 -fdump-tree-vect-details -fno-vect-cost-model } }  */
+
+#include "stdint-gcc.h"
+#include <assert.h>
+
+#define DEF64(TYPEDST, TYPESRC)                                                \
+  void __attribute__ ((noipa))                                                 \
+  popcount64_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src, \
+				 int size)                                     \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_popcountll (src[i]);                                  \
+  }
+
+#define DEF32(TYPEDST, TYPESRC)                                                \
+  void __attribute__ ((noipa))                                                 \
+  popcount32_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src, \
+				 int size)                                     \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_popcount (src[i]);                                    \
+  }
+
+#define DEFCTZ64(TYPEDST, TYPESRC)                                             \
+  void __attribute__ ((noipa))                                                 \
+  ctz64_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src,      \
+			    int size)                                          \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_ctzll (src[i]);                                       \
+  }
+
+#define DEFCTZ32(TYPEDST, TYPESRC)                                             \
+  void __attribute__ ((noipa))                                                 \
+  ctz32_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src,      \
+			    int size)                                          \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_ctz (src[i]);                                         \
+  }
+
+#define DEFFFS64(TYPEDST, TYPESRC)                                             \
+  void __attribute__ ((noipa))                                                 \
+  ffs64_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src,      \
+			    int size)                                          \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_ffsll (src[i]);                                       \
+  }
+
+#define DEFFFS32(TYPEDST, TYPESRC)                                             \
+  void __attribute__ ((noipa))                                                 \
+  ffs32_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src,      \
+			    int size)                                          \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_ffs (src[i]);                                         \
+  }
+
+#define DEF_ALL()                                                              \
+  DEF64 (uint64_t, uint64_t)                                                   \
+  DEF64 (uint64_t, uint32_t)                                                   \
+  DEF64 (uint64_t, uint16_t)                                                   \
+  DEF64 (uint64_t, uint8_t)                                                    \
+  DEF64 (uint64_t, int64_t)                                                    \
+  DEF64 (uint64_t, int32_t)                                                    \
+  DEF64 (uint64_t, int16_t)                                                    \
+  DEF64 (uint64_t, int8_t)                                                     \
+  DEF64 (int64_t, uint64_t)                                                    \
+  DEF64 (int64_t, uint32_t)                                                    \
+  DEF64 (int64_t, uint16_t)                                                    \
+  DEF64 (int64_t, uint8_t)                                                     \
+  DEF64 (int64_t, int64_t)                                                     \
+  DEF64 (int64_t, int32_t)                                                     \
+  DEF64 (int64_t, int16_t)                                                     \
+  DEF64 (int64_t, int8_t)                                                      \
+  DEF64 (uint32_t, uint64_t)                                                   \
+  DEF64 (uint32_t, uint32_t)                                                   \
+  DEF64 (uint32_t, uint16_t)                                                   \
+  DEF64 (uint32_t, uint8_t)                                                    \
+  DEF64 (uint32_t, int64_t)                                                    \
+  DEF64 (uint32_t, int32_t)                                                    \
+  DEF64 (uint32_t, int16_t)                                                    \
+  DEF64 (uint32_t, int8_t)                                                     \
+  DEF64 (int32_t, uint64_t)                                                    \
+  DEF64 (int32_t, uint32_t)                                                    \
+  DEF64 (int32_t, uint16_t)                                                    \
+  DEF64 (int32_t, uint8_t)                                                     \
+  DEF64 (int32_t, int64_t)                                                     \
+  DEF64 (int32_t, int32_t)                                                     \
+  DEF64 (int32_t, int16_t)                                                     \
+  DEF64 (int32_t, int8_t)                                                      \
+  DEF64 (uint16_t, uint64_t)                                                   \
+  DEF64 (uint16_t, uint32_t)                                                   \
+  DEF64 (uint16_t, uint16_t)                                                   \
+  DEF64 (uint16_t, uint8_t)                                                    \
+  DEF64 (uint16_t, int64_t)                                                    \
+  DEF64 (uint16_t, int32_t)                                                    \
+  DEF64 (uint16_t, int16_t)                                                    \
+  DEF64 (uint16_t, int8_t)                                                     \
+  DEF64 (int16_t, uint64_t)                                                    \
+  DEF64 (int16_t, uint32_t)                                                    \
+  DEF64 (int16_t, uint16_t)                                                    \
+  DEF64 (int16_t, uint8_t)                                                     \
+  DEF64 (int16_t, int64_t)                                                     \
+  DEF64 (int16_t, int32_t)                                                     \
+  DEF64 (int16_t, int16_t)                                                     \
+  DEF64 (int16_t, int8_t)                                                      \
+  DEF64 (uint8_t, uint64_t)                                                    \
+  DEF64 (uint8_t, uint32_t)                                                    \
+  DEF64 (uint8_t, uint16_t)                                                    \
+  DEF64 (uint8_t, uint8_t)                                                     \
+  DEF64 (uint8_t, int64_t)                                                     \
+  DEF64 (uint8_t, int32_t)                                                     \
+  DEF64 (uint8_t, int16_t)                                                     \
+  DEF64 (uint8_t, int8_t)                                                      \
+  DEF64 (int8_t, uint64_t)                                                     \
+  DEF64 (int8_t, uint32_t)                                                     \
+  DEF64 (int8_t, uint16_t)                                                     \
+  DEF64 (int8_t, uint8_t)                                                      \
+  DEF64 (int8_t, int64_t)                                                      \
+  DEF64 (int8_t, int32_t)                                                      \
+  DEF64 (int8_t, int16_t)                                                      \
+  DEF64 (int8_t, int8_t)                                                       \
+  DEF32 (uint64_t, uint64_t)                                                   \
+  DEF32 (uint64_t, uint32_t)                                                   \
+  DEF32 (uint64_t, uint16_t)                                                   \
+  DEF32 (uint64_t, uint8_t)                                                    \
+  DEF32 (uint64_t, int64_t)                                                    \
+  DEF32 (uint64_t, int32_t)                                                    \
+  DEF32 (uint64_t, int16_t)                                                    \
+  DEF32 (uint64_t, int8_t)                                                     \
+  DEF32 (int64_t, uint64_t)                                                    \
+  DEF32 (int64_t, uint32_t)                                                    \
+  DEF32 (int64_t, uint16_t)                                                    \
+  DEF32 (int64_t, uint8_t)                                                     \
+  DEF32 (int64_t, int64_t)                                                     \
+  DEF32 (int64_t, int32_t)                                                     \
+  DEF32 (int64_t, int16_t)                                                     \
+  DEF32 (int64_t, int8_t)                                                      \
+  DEF32 (uint32_t, uint64_t)                                                   \
+  DEF32 (uint32_t, uint32_t)                                                   \
+  DEF32 (uint32_t, uint16_t)                                                   \
+  DEF32 (uint32_t, uint8_t)                                                    \
+  DEF32 (uint32_t, int64_t)                                                    \
+  DEF32 (uint32_t, int32_t)                                                    \
+  DEF32 (uint32_t, int16_t)                                                    \
+  DEF32 (uint32_t, int8_t)                                                     \
+  DEF32 (int32_t, uint64_t)                                                    \
+  DEF32 (int32_t, uint32_t)                                                    \
+  DEF32 (int32_t, uint16_t)                                                    \
+  DEF32 (int32_t, uint8_t)                                                     \
+  DEF32 (int32_t, int64_t)                                                     \
+  DEF32 (int32_t, int32_t)                                                     \
+  DEF32 (int32_t, int16_t)                                                     \
+  DEF32 (int32_t, int8_t)                                                      \
+  DEF32 (uint16_t, uint64_t)                                                   \
+  DEF32 (uint16_t, uint32_t)                                                   \
+  DEF32 (uint16_t, uint16_t)                                                   \
+  DEF32 (uint16_t, uint8_t)                                                    \
+  DEF32 (uint16_t, int64_t)                                                    \
+  DEF32 (uint16_t, int32_t)                                                    \
+  DEF32 (uint16_t, int16_t)                                                    \
+  DEF32 (uint16_t, int8_t)                                                     \
+  DEF32 (int16_t, uint64_t)                                                    \
+  DEF32 (int16_t, uint32_t)                                                    \
+  DEF32 (int16_t, uint16_t)                                                    \
+  DEF32 (int16_t, uint8_t)                                                     \
+  DEF32 (int16_t, int64_t)                                                     \
+  DEF32 (int16_t, int32_t)                                                     \
+  DEF32 (int16_t, int16_t)                                                     \
+  DEF32 (int16_t, int8_t)                                                      \
+  DEF32 (uint8_t, uint64_t)                                                    \
+  DEF32 (uint8_t, uint32_t)                                                    \
+  DEF32 (uint8_t, uint16_t)                                                    \
+  DEF32 (uint8_t, uint8_t)                                                     \
+  DEF32 (uint8_t, int64_t)                                                     \
+  DEF32 (uint8_t, int32_t)                                                     \
+  DEF32 (uint8_t, int16_t)                                                     \
+  DEF32 (uint8_t, int8_t)                                                      \
+  DEF32 (int8_t, uint64_t)                                                     \
+  DEF32 (int8_t, uint32_t)                                                     \
+  DEF32 (int8_t, uint16_t)                                                     \
+  DEF32 (int8_t, uint8_t)                                                      \
+  DEF32 (int8_t, int64_t)                                                      \
+  DEF32 (int8_t, int32_t)                                                      \
+  DEF32 (int8_t, int16_t)                                                      \
+  DEF32 (int8_t, int8_t)                                                       \
+  DEFCTZ64 (uint64_t, uint64_t)                                                \
+  DEFCTZ64 (uint64_t, uint32_t)                                                \
+  DEFCTZ64 (uint64_t, uint16_t)                                                \
+  DEFCTZ64 (uint64_t, uint8_t)                                                 \
+  DEFCTZ64 (uint64_t, int64_t)                                                 \
+  DEFCTZ64 (uint64_t, int32_t)                                                 \
+  DEFCTZ64 (uint64_t, int16_t)                                                 \
+  DEFCTZ64 (uint64_t, int8_t)                                                  \
+  DEFCTZ64 (int64_t, uint64_t)                                                 \
+  DEFCTZ64 (int64_t, uint32_t)                                                 \
+  DEFCTZ64 (int64_t, uint16_t)                                                 \
+  DEFCTZ64 (int64_t, uint8_t)                                                  \
+  DEFCTZ64 (int64_t, int64_t)                                                  \
+  DEFCTZ64 (int64_t, int32_t)                                                  \
+  DEFCTZ64 (int64_t, int16_t)                                                  \
+  DEFCTZ64 (int64_t, int8_t)                                                   \
+  DEFCTZ64 (uint32_t, uint64_t)                                                \
+  DEFCTZ64 (uint32_t, uint32_t)                                                \
+  DEFCTZ64 (uint32_t, uint16_t)                                                \
+  DEFCTZ64 (uint32_t, uint8_t)                                                 \
+  DEFCTZ64 (uint32_t, int64_t)                                                 \
+  DEFCTZ64 (uint32_t, int32_t)                                                 \
+  DEFCTZ64 (uint32_t, int16_t)                                                 \
+  DEFCTZ64 (uint32_t, int8_t)                                                  \
+  DEFCTZ64 (int32_t, uint64_t)                                                 \
+  DEFCTZ64 (int32_t, uint32_t)                                                 \
+  DEFCTZ64 (int32_t, uint16_t)                                                 \
+  DEFCTZ64 (int32_t, uint8_t)                                                  \
+  DEFCTZ64 (int32_t, int64_t)                                                  \
+  DEFCTZ64 (int32_t, int32_t)                                                  \
+  DEFCTZ64 (int32_t, int16_t)                                                  \
+  DEFCTZ64 (int32_t, int8_t)                                                   \
+  DEFCTZ64 (uint16_t, uint64_t)                                                \
+  DEFCTZ64 (uint16_t, uint32_t)                                                \
+  DEFCTZ64 (uint16_t, uint16_t)                                                \
+  DEFCTZ64 (uint16_t, uint8_t)                                                 \
+  DEFCTZ64 (uint16_t, int64_t)                                                 \
+  DEFCTZ64 (uint16_t, int32_t)                                                 \
+  DEFCTZ64 (uint16_t, int16_t)                                                 \
+  DEFCTZ64 (uint16_t, int8_t)                                                  \
+  DEFCTZ64 (int16_t, uint64_t)                                                 \
+  DEFCTZ64 (int16_t, uint32_t)                                                 \
+  DEFCTZ64 (int16_t, uint16_t)                                                 \
+  DEFCTZ64 (int16_t, uint8_t)                                                  \
+  DEFCTZ64 (int16_t, int64_t)                                                  \
+  DEFCTZ64 (int16_t, int32_t)                                                  \
+  DEFCTZ64 (int16_t, int16_t)                                                  \
+  DEFCTZ64 (int16_t, int8_t)                                                   \
+  DEFCTZ64 (uint8_t, uint64_t)                                                 \
+  DEFCTZ64 (uint8_t, uint32_t)                                                 \
+  DEFCTZ64 (uint8_t, uint16_t)                                                 \
+  DEFCTZ64 (uint8_t, uint8_t)                                                  \
+  DEFCTZ64 (uint8_t, int64_t)                                                  \
+  DEFCTZ64 (uint8_t, int32_t)                                                  \
+  DEFCTZ64 (uint8_t, int16_t)                                                  \
+  DEFCTZ64 (uint8_t, int8_t)                                                   \
+  DEFCTZ64 (int8_t, uint64_t)                                                  \
+  DEFCTZ64 (int8_t, uint32_t)                                                  \
+  DEFCTZ64 (int8_t, uint16_t)                                                  \
+  DEFCTZ64 (int8_t, uint8_t)                                                   \
+  DEFCTZ64 (int8_t, int64_t)                                                   \
+  DEFCTZ64 (int8_t, int32_t)                                                   \
+  DEFCTZ64 (int8_t, int16_t)                                                   \
+  DEFCTZ64 (int8_t, int8_t)                                                    \
+  DEFCTZ32 (uint64_t, uint64_t)                                                \
+  DEFCTZ32 (uint64_t, uint32_t)                                                \
+  DEFCTZ32 (uint64_t, uint16_t)                                                \
+  DEFCTZ32 (uint64_t, uint8_t)                                                 \
+  DEFCTZ32 (uint64_t, int64_t)                                                 \
+  DEFCTZ32 (uint64_t, int32_t)                                                 \
+  DEFCTZ32 (uint64_t, int16_t)                                                 \
+  DEFCTZ32 (uint64_t, int8_t)                                                  \
+  DEFCTZ32 (int64_t, uint64_t)                                                 \
+  DEFCTZ32 (int64_t, uint32_t)                                                 \
+  DEFCTZ32 (int64_t, uint16_t)                                                 \
+  DEFCTZ32 (int64_t, uint8_t)                                                  \
+  DEFCTZ32 (int64_t, int64_t)                                                  \
+  DEFCTZ32 (int64_t, int32_t)                                                  \
+  DEFCTZ32 (int64_t, int16_t)                                                  \
+  DEFCTZ32 (int64_t, int8_t)                                                   \
+  DEFCTZ32 (uint32_t, uint64_t)                                                \
+  DEFCTZ32 (uint32_t, uint32_t)                                                \
+  DEFCTZ32 (uint32_t, uint16_t)                                                \
+  DEFCTZ32 (uint32_t, uint8_t)                                                 \
+  DEFCTZ32 (uint32_t, int64_t)                                                 \
+  DEFCTZ32 (uint32_t, int32_t)                                                 \
+  DEFCTZ32 (uint32_t, int16_t)                                                 \
+  DEFCTZ32 (uint32_t, int8_t)                                                  \
+  DEFCTZ32 (int32_t, uint64_t)                                                 \
+  DEFCTZ32 (int32_t, uint32_t)                                                 \
+  DEFCTZ32 (int32_t, uint16_t)                                                 \
+  DEFCTZ32 (int32_t, uint8_t)                                                  \
+  DEFCTZ32 (int32_t, int64_t)                                                  \
+  DEFCTZ32 (int32_t, int32_t)                                                  \
+  DEFCTZ32 (int32_t, int16_t)                                                  \
+  DEFCTZ32 (int32_t, int8_t)                                                   \
+  DEFCTZ32 (uint16_t, uint64_t)                                                \
+  DEFCTZ32 (uint16_t, uint32_t)                                                \
+  DEFCTZ32 (uint16_t, uint16_t)                                                \
+  DEFCTZ32 (uint16_t, uint8_t)                                                 \
+  DEFCTZ32 (uint16_t, int64_t)                                                 \
+  DEFCTZ32 (uint16_t, int32_t)                                                 \
+  DEFCTZ32 (uint16_t, int16_t)                                                 \
+  DEFCTZ32 (uint16_t, int8_t)                                                  \
+  DEFCTZ32 (int16_t, uint64_t)                                                 \
+  DEFCTZ32 (int16_t, uint32_t)                                                 \
+  DEFCTZ32 (int16_t, uint16_t)                                                 \
+  DEFCTZ32 (int16_t, uint8_t)                                                  \
+  DEFCTZ32 (int16_t, int64_t)                                                  \
+  DEFCTZ32 (int16_t, int32_t)                                                  \
+  DEFCTZ32 (int16_t, int16_t)                                                  \
+  DEFCTZ32 (int16_t, int8_t)                                                   \
+  DEFCTZ32 (uint8_t, uint64_t)                                                 \
+  DEFCTZ32 (uint8_t, uint32_t)                                                 \
+  DEFCTZ32 (uint8_t, uint16_t)                                                 \
+  DEFCTZ32 (uint8_t, uint8_t)                                                  \
+  DEFCTZ32 (uint8_t, int64_t)                                                  \
+  DEFCTZ32 (uint8_t, int32_t)                                                  \
+  DEFCTZ32 (uint8_t, int16_t)                                                  \
+  DEFCTZ32 (uint8_t, int8_t)                                                   \
+  DEFCTZ32 (int8_t, uint64_t)                                                  \
+  DEFCTZ32 (int8_t, uint32_t)                                                  \
+  DEFCTZ32 (int8_t, uint16_t)                                                  \
+  DEFCTZ32 (int8_t, uint8_t)                                                   \
+  DEFCTZ32 (int8_t, int64_t)                                                   \
+  DEFCTZ32 (int8_t, int32_t)                                                   \
+  DEFCTZ32 (int8_t, int16_t)                                                   \
+  DEFCTZ32 (int8_t, int8_t)                                                    \
+  DEFFFS64 (uint64_t, uint64_t)                                                \
+  DEFFFS64 (uint64_t, uint32_t)                                                \
+  DEFFFS64 (uint64_t, uint16_t)                                                \
+  DEFFFS64 (uint64_t, uint8_t)                                                 \
+  DEFFFS64 (uint64_t, int64_t)                                                 \
+  DEFFFS64 (uint64_t, int32_t)                                                 \
+  DEFFFS64 (uint64_t, int16_t)                                                 \
+  DEFFFS64 (uint64_t, int8_t)                                                  \
+  DEFFFS64 (int64_t, uint64_t)                                                 \
+  DEFFFS64 (int64_t, uint32_t)                                                 \
+  DEFFFS64 (int64_t, uint16_t)                                                 \
+  DEFFFS64 (int64_t, uint8_t)                                                  \
+  DEFFFS64 (int64_t, int64_t)                                                  \
+  DEFFFS64 (int64_t, int32_t)                                                  \
+  DEFFFS64 (int64_t, int16_t)                                                  \
+  DEFFFS64 (int64_t, int8_t)                                                   \
+  DEFFFS64 (uint32_t, uint64_t)                                                \
+  DEFFFS64 (uint32_t, uint32_t)                                                \
+  DEFFFS64 (uint32_t, uint16_t)                                                \
+  DEFFFS64 (uint32_t, uint8_t)                                                 \
+  DEFFFS64 (uint32_t, int64_t)                                                 \
+  DEFFFS64 (uint32_t, int32_t)                                                 \
+  DEFFFS64 (uint32_t, int16_t)                                                 \
+  DEFFFS64 (uint32_t, int8_t)                                                  \
+  DEFFFS64 (int32_t, uint64_t)                                                 \
+  DEFFFS64 (int32_t, uint32_t)                                                 \
+  DEFFFS64 (int32_t, uint16_t)                                                 \
+  DEFFFS64 (int32_t, uint8_t)                                                  \
+  DEFFFS64 (int32_t, int64_t)                                                  \
+  DEFFFS64 (int32_t, int32_t)                                                  \
+  DEFFFS64 (int32_t, int16_t)                                                  \
+  DEFFFS64 (int32_t, int8_t)                                                   \
+  DEFFFS64 (uint16_t, uint64_t)                                                \
+  DEFFFS64 (uint16_t, uint32_t)                                                \
+  DEFFFS64 (uint16_t, uint16_t)                                                \
+  DEFFFS64 (uint16_t, uint8_t)                                                 \
+  DEFFFS64 (uint16_t, int64_t)                                                 \
+  DEFFFS64 (uint16_t, int32_t)                                                 \
+  DEFFFS64 (uint16_t, int16_t)                                                 \
+  DEFFFS64 (uint16_t, int8_t)                                                  \
+  DEFFFS64 (int16_t, uint64_t)                                                 \
+  DEFFFS64 (int16_t, uint32_t)                                                 \
+  DEFFFS64 (int16_t, uint16_t)                                                 \
+  DEFFFS64 (int16_t, uint8_t)                                                  \
+  DEFFFS64 (int16_t, int64_t)                                                  \
+  DEFFFS64 (int16_t, int32_t)                                                  \
+  DEFFFS64 (int16_t, int16_t)                                                  \
+  DEFFFS64 (int16_t, int8_t)                                                   \
+  DEFFFS64 (uint8_t, uint64_t)                                                 \
+  DEFFFS64 (uint8_t, uint32_t)                                                 \
+  DEFFFS64 (uint8_t, uint16_t)                                                 \
+  DEFFFS64 (uint8_t, uint8_t)                                                  \
+  DEFFFS64 (uint8_t, int64_t)                                                  \
+  DEFFFS64 (uint8_t, int32_t)                                                  \
+  DEFFFS64 (uint8_t, int16_t)                                                  \
+  DEFFFS64 (uint8_t, int8_t)                                                   \
+  DEFFFS64 (int8_t, uint64_t)                                                  \
+  DEFFFS64 (int8_t, uint32_t)                                                  \
+  DEFFFS64 (int8_t, uint16_t)                                                  \
+  DEFFFS64 (int8_t, uint8_t)                                                   \
+  DEFFFS64 (int8_t, int64_t)                                                   \
+  DEFFFS64 (int8_t, int32_t)                                                   \
+  DEFFFS64 (int8_t, int16_t)                                                   \
+  DEFFFS64 (int8_t, int8_t)                                                    \
+  DEFFFS32 (uint64_t, uint64_t)                                                \
+  DEFFFS32 (uint64_t, uint32_t)                                                \
+  DEFFFS32 (uint64_t, uint16_t)                                                \
+  DEFFFS32 (uint64_t, uint8_t)                                                 \
+  DEFFFS32 (uint64_t, int64_t)                                                 \
+  DEFFFS32 (uint64_t, int32_t)                                                 \
+  DEFFFS32 (uint64_t, int16_t)                                                 \
+  DEFFFS32 (uint64_t, int8_t)                                                  \
+  DEFFFS32 (int64_t, uint64_t)                                                 \
+  DEFFFS32 (int64_t, uint32_t)                                                 \
+  DEFFFS32 (int64_t, uint16_t)                                                 \
+  DEFFFS32 (int64_t, uint8_t)                                                  \
+  DEFFFS32 (int64_t, int64_t)                                                  \
+  DEFFFS32 (int64_t, int32_t)                                                  \
+  DEFFFS32 (int64_t, int16_t)                                                  \
+  DEFFFS32 (int64_t, int8_t)                                                   \
+  DEFFFS32 (uint32_t, uint64_t)                                                \
+  DEFFFS32 (uint32_t, uint32_t)                                                \
+  DEFFFS32 (uint32_t, uint16_t)                                                \
+  DEFFFS32 (uint32_t, uint8_t)                                                 \
+  DEFFFS32 (uint32_t, int64_t)                                                 \
+  DEFFFS32 (uint32_t, int32_t)                                                 \
+  DEFFFS32 (uint32_t, int16_t)                                                 \
+  DEFFFS32 (uint32_t, int8_t)                                                  \
+  DEFFFS32 (int32_t, uint64_t)                                                 \
+  DEFFFS32 (int32_t, uint32_t)                                                 \
+  DEFFFS32 (int32_t, uint16_t)                                                 \
+  DEFFFS32 (int32_t, uint8_t)                                                  \
+  DEFFFS32 (int32_t, int64_t)                                                  \
+  DEFFFS32 (int32_t, int32_t)                                                  \
+  DEFFFS32 (int32_t, int16_t)                                                  \
+  DEFFFS32 (int32_t, int8_t)                                                   \
+  DEFFFS32 (uint16_t, uint64_t)                                                \
+  DEFFFS32 (uint16_t, uint32_t)                                                \
+  DEFFFS32 (uint16_t, uint16_t)                                                \
+  DEFFFS32 (uint16_t, uint8_t)                                                 \
+  DEFFFS32 (uint16_t, int64_t)                                                 \
+  DEFFFS32 (uint16_t, int32_t)                                                 \
+  DEFFFS32 (uint16_t, int16_t)                                                 \
+  DEFFFS32 (uint16_t, int8_t)                                                  \
+  DEFFFS32 (int16_t, uint64_t)                                                 \
+  DEFFFS32 (int16_t, uint32_t)                                                 \
+  DEFFFS32 (int16_t, uint16_t)                                                 \
+  DEFFFS32 (int16_t, uint8_t)                                                  \
+  DEFFFS32 (int16_t, int64_t)                                                  \
+  DEFFFS32 (int16_t, int32_t)                                                  \
+  DEFFFS32 (int16_t, int16_t)                                                  \
+  DEFFFS32 (int16_t, int8_t)                                                   \
+  DEFFFS32 (uint8_t, uint64_t)                                                 \
+  DEFFFS32 (uint8_t, uint32_t)                                                 \
+  DEFFFS32 (uint8_t, uint16_t)                                                 \
+  DEFFFS32 (uint8_t, uint8_t)                                                  \
+  DEFFFS32 (uint8_t, int64_t)                                                  \
+  DEFFFS32 (uint8_t, int32_t)                                                  \
+  DEFFFS32 (uint8_t, int16_t)                                                  \
+  DEFFFS32 (uint8_t, int8_t)                                                   \
+  DEFFFS32 (int8_t, uint64_t)                                                  \
+  DEFFFS32 (int8_t, uint32_t)                                                  \
+  DEFFFS32 (int8_t, uint16_t)                                                  \
+  DEFFFS32 (int8_t, uint8_t)                                                   \
+  DEFFFS32 (int8_t, int64_t)                                                   \
+  DEFFFS32 (int8_t, int32_t)                                                   \
+  DEFFFS32 (int8_t, int16_t)                                                   \
+  DEFFFS32 (int8_t, int8_t)
+
+DEF_ALL ()
+
+#define SZ 512
+
+#define TEST64(TYPEDST, TYPESRC)                                               \
+  void __attribute__ ((optimize ("0"))) test64_##TYPEDST##TYPESRC ()           \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * 1234567890;                                              \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    popcount64_##TYPEDST##TYPESRC (dst, src, SZ);                              \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	assert (dst[i] == __builtin_popcountll (src[i]));                      \
+      }                                                                        \
+  }
+
+#define TEST64N(TYPEDST, TYPESRC)                                              \
+  void __attribute__ ((optimize ("0"))) test64n_##TYPEDST##TYPESRC ()          \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * -1234567890;                                             \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    popcount64_##TYPEDST##TYPESRC (dst, src, SZ);                              \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	assert (dst[i] == __builtin_popcountll (src[i]));                      \
+      }                                                                        \
+  }
+
+#define TEST32(TYPEDST, TYPESRC)                                               \
+  void __attribute__ ((optimize ("0"))) test32_##TYPEDST##TYPESRC ()           \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * 1234567;                                                 \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    popcount32_##TYPEDST##TYPESRC (dst, src, SZ);                              \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	assert (dst[i] == __builtin_popcount (src[i]));                        \
+      }                                                                        \
+  }
+
+#define TEST32N(TYPEDST, TYPESRC)                                              \
+  void __attribute__ ((optimize ("0"))) test32n_##TYPEDST##TYPESRC ()          \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * -1234567;                                                \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    popcount32_##TYPEDST##TYPESRC (dst, src, SZ);                              \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	assert (dst[i] == __builtin_popcount (src[i]));                        \
+      }                                                                        \
+  }
+
+#define TESTCTZ64(TYPEDST, TYPESRC)                                            \
+  void __attribute__ ((optimize ("0"))) testctz64_##TYPEDST##TYPESRC ()        \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * 1234567890;                                              \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    ctz64_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	if (src[i] != 0)                                                       \
+	  assert (dst[i] == __builtin_ctzll (src[i]));                         \
+      }                                                                        \
+  }
+
+#define TESTCTZ64N(TYPEDST, TYPESRC)                                           \
+  void __attribute__ ((optimize ("0"))) testctz64n_##TYPEDST##TYPESRC ()       \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * -1234567890;                                             \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    ctz64_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	if (src[i] != 0)                                                       \
+	  assert (dst[i] == __builtin_ctzll (src[i]));                         \
+      }                                                                        \
+  }
+
+#define TESTCTZ32(TYPEDST, TYPESRC)                                            \
+  void __attribute__ ((optimize ("0"))) testctz32_##TYPEDST##TYPESRC ()        \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * 1234567;                                                 \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    ctz32_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	if (src[i] != 0)                                                       \
+	  assert (dst[i] == __builtin_ctz (src[i]));                           \
+      }                                                                        \
+  }
+
+#define TESTCTZ32N(TYPEDST, TYPESRC)                                           \
+  void __attribute__ ((optimize ("0"))) testctz32n_##TYPEDST##TYPESRC ()       \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * -1234567;                                                \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    ctz32_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	if (src[i] != 0)                                                       \
+	  assert (dst[i] == __builtin_ctz (src[i]));                           \
+      }                                                                        \
+  }
+
+#define TESTFFS64(TYPEDST, TYPESRC)                                            \
+  void __attribute__ ((optimize ("0"))) testffs64_##TYPEDST##TYPESRC ()        \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * 1234567890;                                              \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    ffs64_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	assert (dst[i] == __builtin_ffsll (src[i]));                           \
+      }                                                                        \
+  }
+
+#define TESTFFS64N(TYPEDST, TYPESRC)                                           \
+  void __attribute__ ((optimize ("0"))) testffs64n_##TYPEDST##TYPESRC ()       \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * -1234567890;                                             \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    ffs64_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	assert (dst[i] == __builtin_ffsll (src[i]));                           \
+      }                                                                        \
+  }
+
+#define TESTFFS32(TYPEDST, TYPESRC)                                            \
+  void __attribute__ ((optimize ("0"))) testffs32_##TYPEDST##TYPESRC ()        \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * 1234567;                                                 \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    ffs32_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	assert (dst[i] == __builtin_ffs (src[i]));                             \
+      }                                                                        \
+  }
+
+#define TESTFFS32N(TYPEDST, TYPESRC)                                           \
+  void __attribute__ ((optimize ("0"))) testffs32n_##TYPEDST##TYPESRC ()       \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * -1234567;                                                \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    ffs32_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	assert (dst[i] == __builtin_ffs (src[i]));                             \
+      }                                                                        \
+  }
+
+#define TEST_ALL()                                                             \
+  TEST64 (uint64_t, uint64_t)                                                  \
+  TEST64 (uint64_t, uint32_t)                                                  \
+  TEST64 (uint64_t, uint16_t)                                                  \
+  TEST64 (uint64_t, uint8_t)                                                   \
+  TEST64 (uint64_t, int64_t)                                                   \
+  TEST64 (uint64_t, int32_t)                                                   \
+  TEST64 (uint64_t, int16_t)                                                   \
+  TEST64 (uint64_t, int8_t)                                                    \
+  TEST64N (int64_t, uint64_t)                                                  \
+  TEST64N (int64_t, uint32_t)                                                  \
+  TEST64N (int64_t, uint16_t)                                                  \
+  TEST64N (int64_t, uint8_t)                                                   \
+  TEST64N (int64_t, int64_t)                                                   \
+  TEST64N (int64_t, int32_t)                                                   \
+  TEST64N (int64_t, int16_t)                                                   \
+  TEST64N (int64_t, int8_t)                                                    \
+  TEST64 (uint32_t, uint64_t)                                                  \
+  TEST64 (uint32_t, uint32_t)                                                  \
+  TEST64 (uint32_t, uint16_t)                                                  \
+  TEST64 (uint32_t, uint8_t)                                                   \
+  TEST64 (uint32_t, int64_t)                                                   \
+  TEST64 (uint32_t, int32_t)                                                   \
+  TEST64 (uint32_t, int16_t)                                                   \
+  TEST64 (uint32_t, int8_t)                                                    \
+  TEST64N (int32_t, uint64_t)                                                  \
+  TEST64N (int32_t, uint32_t)                                                  \
+  TEST64N (int32_t, uint16_t)                                                  \
+  TEST64N (int32_t, uint8_t)                                                   \
+  TEST64N (int32_t, int64_t)                                                   \
+  TEST64N (int32_t, int32_t)                                                   \
+  TEST64N (int32_t, int16_t)                                                   \
+  TEST64N (int32_t, int8_t)                                                    \
+  TEST64 (uint16_t, uint64_t)                                                  \
+  TEST64 (uint16_t, uint32_t)                                                  \
+  TEST64 (uint16_t, uint16_t)                                                  \
+  TEST64 (uint16_t, uint8_t)                                                   \
+  TEST64 (uint16_t, int64_t)                                                   \
+  TEST64 (uint16_t, int32_t)                                                   \
+  TEST64 (uint16_t, int16_t)                                                   \
+  TEST64 (uint16_t, int8_t)                                                    \
+  TEST64N (int16_t, uint64_t)                                                   \
+  TEST64N (int16_t, uint32_t)                                                   \
+  TEST64N (int16_t, uint16_t)                                                   \
+  TEST64N (int16_t, uint8_t)                                                    \
+  TEST64N (int16_t, int64_t)                                                    \
+  TEST64N (int16_t, int32_t)                                                    \
+  TEST64N (int16_t, int16_t)                                                    \
+  TEST64N (int16_t, int8_t)                                                     \
+  TEST64 (uint8_t, uint64_t)                                                   \
+  TEST64 (uint8_t, uint32_t)                                                   \
+  TEST64 (uint8_t, uint16_t)                                                   \
+  TEST64 (uint8_t, uint8_t)                                                    \
+  TEST64 (uint8_t, int64_t)                                                    \
+  TEST64 (uint8_t, int32_t)                                                    \
+  TEST64 (uint8_t, int16_t)                                                    \
+  TEST64 (uint8_t, int8_t)                                                     \
+  TEST64N (int8_t, uint64_t)                                                    \
+  TEST64N (int8_t, uint32_t)                                                    \
+  TEST64N (int8_t, uint16_t)                                                    \
+  TEST64N (int8_t, uint8_t)                                                     \
+  TEST64N (int8_t, int64_t)                                                     \
+  TEST64N (int8_t, int32_t)                                                     \
+  TEST64N (int8_t, int16_t)                                                     \
+  TEST64N (int8_t, int8_t)                                                      \
+  TEST32 (uint64_t, uint64_t)                                                  \
+  TEST32 (uint64_t, uint32_t)                                                  \
+  TEST32 (uint64_t, uint16_t)                                                  \
+  TEST32 (uint64_t, uint8_t)                                                   \
+  TEST32 (uint64_t, int64_t)                                                   \
+  TEST32 (uint64_t, int32_t)                                                   \
+  TEST32 (uint64_t, int16_t)                                                   \
+  TEST32 (uint64_t, int8_t)                                                    \
+  TEST32N (int64_t, uint64_t)                                                  \
+  TEST32N (int64_t, uint32_t)                                                  \
+  TEST32N (int64_t, uint16_t)                                                  \
+  TEST32N (int64_t, uint8_t)                                                   \
+  TEST32N (int64_t, int64_t)                                                   \
+  TEST32N (int64_t, int32_t)                                                   \
+  TEST32N (int64_t, int16_t)                                                   \
+  TEST32N (int64_t, int8_t)                                                    \
+  TEST32 (uint32_t, uint64_t)                                                  \
+  TEST32 (uint32_t, uint32_t)                                                  \
+  TEST32 (uint32_t, uint16_t)                                                  \
+  TEST32 (uint32_t, uint8_t)                                                   \
+  TEST32 (uint32_t, int64_t)                                                   \
+  TEST32 (uint32_t, int32_t)                                                   \
+  TEST32 (uint32_t, int16_t)                                                   \
+  TEST32 (uint32_t, int8_t)                                                    \
+  TEST32N (int32_t, uint64_t)                                                  \
+  TEST32N (int32_t, uint32_t)                                                  \
+  TEST32N (int32_t, uint16_t)                                                  \
+  TEST32N (int32_t, uint8_t)                                                   \
+  TEST32N (int32_t, int64_t)                                                   \
+  TEST32N (int32_t, int32_t)                                                   \
+  TEST32N (int32_t, int16_t)                                                   \
+  TEST32N (int32_t, int8_t)                                                    \
+  TEST32 (uint16_t, uint64_t)                                                  \
+  TEST32 (uint16_t, uint32_t)                                                  \
+  TEST32 (uint16_t, uint16_t)                                                  \
+  TEST32 (uint16_t, uint8_t)                                                   \
+  TEST32 (uint16_t, int64_t)                                                   \
+  TEST32 (uint16_t, int32_t)                                                   \
+  TEST32 (uint16_t, int16_t)                                                   \
+  TEST32 (uint16_t, int8_t)                                                    \
+  TEST32N (int16_t, uint64_t)                                                  \
+  TEST32N (int16_t, uint32_t)                                                  \
+  TEST32N (int16_t, uint16_t)                                                  \
+  TEST32N (int16_t, uint8_t)                                                   \
+  TEST32N (int16_t, int64_t)                                                   \
+  TEST32N (int16_t, int32_t)                                                   \
+  TEST32N (int16_t, int16_t)                                                   \
+  TEST32N (int16_t, int8_t)                                                    \
+  TEST32 (uint8_t, uint64_t)                                                   \
+  TEST32 (uint8_t, uint32_t)                                                   \
+  TEST32 (uint8_t, uint16_t)                                                   \
+  TEST32 (uint8_t, uint8_t)                                                    \
+  TEST32 (uint8_t, int64_t)                                                    \
+  TEST32 (uint8_t, int32_t)                                                    \
+  TEST32 (uint8_t, int16_t)                                                    \
+  TEST32 (uint8_t, int8_t)                                                     \
+  TEST32N (int8_t, uint64_t)                                                   \
+  TEST32N (int8_t, uint32_t)                                                   \
+  TEST32N (int8_t, uint16_t)                                                   \
+  TEST32N (int8_t, uint8_t)                                                    \
+  TEST32N (int8_t, int64_t)                                                    \
+  TEST32N (int8_t, int32_t)                                                    \
+  TEST32N (int8_t, int16_t)                                                    \
+  TEST32N (int8_t, int8_t)                                                     \
+  TESTCTZ64 (uint64_t, uint64_t)                                               \
+  TESTCTZ64 (uint64_t, uint32_t)                                               \
+  TESTCTZ64 (uint64_t, uint16_t)                                               \
+  TESTCTZ64 (uint64_t, uint8_t)                                                \
+  TESTCTZ64 (uint64_t, int64_t)                                                \
+  TESTCTZ64 (uint64_t, int32_t)                                                \
+  TESTCTZ64 (uint64_t, int16_t)                                                \
+  TESTCTZ64 (uint64_t, int8_t)                                                 \
+  TESTCTZ64N (int64_t, uint64_t)                                               \
+  TESTCTZ64N (int64_t, uint32_t)                                               \
+  TESTCTZ64N (int64_t, uint16_t)                                               \
+  TESTCTZ64N (int64_t, uint8_t)                                                \
+  TESTCTZ64N (int64_t, int64_t)                                                \
+  TESTCTZ64N (int64_t, int32_t)                                                \
+  TESTCTZ64N (int64_t, int16_t)                                                \
+  TESTCTZ64N (int64_t, int8_t)                                                 \
+  TESTCTZ64 (uint32_t, uint64_t)                                               \
+  TESTCTZ64 (uint32_t, uint32_t)                                               \
+  TESTCTZ64 (uint32_t, uint16_t)                                               \
+  TESTCTZ64 (uint32_t, uint8_t)                                                \
+  TESTCTZ64 (uint32_t, int64_t)                                                \
+  TESTCTZ64 (uint32_t, int32_t)                                                \
+  TESTCTZ64 (uint32_t, int16_t)                                                \
+  TESTCTZ64 (uint32_t, int8_t)                                                 \
+  TESTCTZ64N (int32_t, uint64_t)                                               \
+  TESTCTZ64N (int32_t, uint32_t)                                               \
+  TESTCTZ64N (int32_t, uint16_t)                                               \
+  TESTCTZ64N (int32_t, uint8_t)                                                \
+  TESTCTZ64N (int32_t, int64_t)                                                \
+  TESTCTZ64N (int32_t, int32_t)                                                \
+  TESTCTZ64N (int32_t, int16_t)                                                \
+  TESTCTZ64N (int32_t, int8_t)                                                 \
+  TESTCTZ64 (uint16_t, uint64_t)                                               \
+  TESTCTZ64 (uint16_t, uint32_t)                                               \
+  TESTCTZ64 (uint16_t, uint16_t)                                               \
+  TESTCTZ64 (uint16_t, uint8_t)                                                \
+  TESTCTZ64 (uint16_t, int64_t)                                                \
+  TESTCTZ64 (uint16_t, int32_t)                                                \
+  TESTCTZ64 (uint16_t, int16_t)                                                \
+  TESTCTZ64 (uint16_t, int8_t)                                                 \
+  TESTCTZ64N (int16_t, uint64_t)                                               \
+  TESTCTZ64N (int16_t, uint32_t)                                               \
+  TESTCTZ64N (int16_t, uint16_t)                                               \
+  TESTCTZ64N (int16_t, uint8_t)                                                \
+  TESTCTZ64N (int16_t, int64_t)                                                \
+  TESTCTZ64N (int16_t, int32_t)                                                \
+  TESTCTZ64N (int16_t, int16_t)                                                \
+  TESTCTZ64N (int16_t, int8_t)                                                 \
+  TESTCTZ64 (uint8_t, uint64_t)                                                \
+  TESTCTZ64 (uint8_t, uint32_t)                                                \
+  TESTCTZ64 (uint8_t, uint16_t)                                                \
+  TESTCTZ64 (uint8_t, uint8_t)                                                 \
+  TESTCTZ64 (uint8_t, int64_t)                                                 \
+  TESTCTZ64 (uint8_t, int32_t)                                                 \
+  TESTCTZ64 (uint8_t, int16_t)                                                 \
+  TESTCTZ64 (uint8_t, int8_t)                                                  \
+  TESTCTZ64N (int8_t, uint64_t)                                                \
+  TESTCTZ64N (int8_t, uint32_t)                                                \
+  TESTCTZ64N (int8_t, uint16_t)                                                \
+  TESTCTZ64N (int8_t, uint8_t)                                                 \
+  TESTCTZ64N (int8_t, int64_t)                                                 \
+  TESTCTZ64N (int8_t, int32_t)                                                 \
+  TESTCTZ64N (int8_t, int16_t)                                                 \
+  TESTCTZ64N (int8_t, int8_t)                                                  \
+  TESTCTZ32 (uint64_t, uint64_t)                                               \
+  TESTCTZ32 (uint64_t, uint32_t)                                               \
+  TESTCTZ32 (uint64_t, uint16_t)                                               \
+  TESTCTZ32 (uint64_t, uint8_t)                                                \
+  TESTCTZ32 (uint64_t, int64_t)                                                \
+  TESTCTZ32 (uint64_t, int32_t)                                                \
+  TESTCTZ32 (uint64_t, int16_t)                                                \
+  TESTCTZ32 (uint64_t, int8_t)                                                 \
+  TESTCTZ32N (int64_t, uint64_t)                                               \
+  TESTCTZ32N (int64_t, uint32_t)                                               \
+  TESTCTZ32N (int64_t, uint16_t)                                               \
+  TESTCTZ32N (int64_t, uint8_t)                                                \
+  TESTCTZ32N (int64_t, int64_t)                                                \
+  TESTCTZ32N (int64_t, int32_t)                                                \
+  TESTCTZ32N (int64_t, int16_t)                                                \
+  TESTCTZ32N (int64_t, int8_t)                                                 \
+  TESTCTZ32 (uint32_t, uint64_t)                                               \
+  TESTCTZ32 (uint32_t, uint32_t)                                               \
+  TESTCTZ32 (uint32_t, uint16_t)                                               \
+  TESTCTZ32 (uint32_t, uint8_t)                                                \
+  TESTCTZ32 (uint32_t, int64_t)                                                \
+  TESTCTZ32 (uint32_t, int32_t)                                                \
+  TESTCTZ32 (uint32_t, int16_t)                                                \
+  TESTCTZ32 (uint32_t, int8_t)                                                 \
+  TESTCTZ32N (int32_t, uint64_t)                                               \
+  TESTCTZ32N (int32_t, uint32_t)                                               \
+  TESTCTZ32N (int32_t, uint16_t)                                               \
+  TESTCTZ32N (int32_t, uint8_t)                                                \
+  TESTCTZ32N (int32_t, int64_t)                                                \
+  TESTCTZ32N (int32_t, int32_t)                                                \
+  TESTCTZ32N (int32_t, int16_t)                                                \
+  TESTCTZ32N (int32_t, int8_t)                                                 \
+  TESTCTZ32 (uint16_t, uint64_t)                                               \
+  TESTCTZ32 (uint16_t, uint32_t)                                               \
+  TESTCTZ32 (uint16_t, uint16_t)                                               \
+  TESTCTZ32 (uint16_t, uint8_t)                                                \
+  TESTCTZ32 (uint16_t, int64_t)                                                \
+  TESTCTZ32 (uint16_t, int32_t)                                                \
+  TESTCTZ32 (uint16_t, int16_t)                                                \
+  TESTCTZ32 (uint16_t, int8_t)                                                 \
+  TESTCTZ32N (int16_t, uint64_t)                                               \
+  TESTCTZ32N (int16_t, uint32_t)                                               \
+  TESTCTZ32N (int16_t, uint16_t)                                               \
+  TESTCTZ32N (int16_t, uint8_t)                                                \
+  TESTCTZ32N (int16_t, int64_t)                                                \
+  TESTCTZ32N (int16_t, int32_t)                                                \
+  TESTCTZ32N (int16_t, int16_t)                                                \
+  TESTCTZ32N (int16_t, int8_t)                                                 \
+  TESTCTZ32 (uint8_t, uint64_t)                                                \
+  TESTCTZ32 (uint8_t, uint32_t)                                                \
+  TESTCTZ32 (uint8_t, uint16_t)                                                \
+  TESTCTZ32 (uint8_t, uint8_t)                                                 \
+  TESTCTZ32 (uint8_t, int64_t)                                                 \
+  TESTCTZ32 (uint8_t, int32_t)                                                 \
+  TESTCTZ32 (uint8_t, int16_t)                                                 \
+  TESTCTZ32 (uint8_t, int8_t)                                                  \
+  TESTCTZ32N (int8_t, uint64_t)                                                \
+  TESTCTZ32N (int8_t, uint32_t)                                                \
+  TESTCTZ32N (int8_t, uint16_t)                                                \
+  TESTCTZ32N (int8_t, uint8_t)                                                 \
+  TESTCTZ32N (int8_t, int64_t)                                                 \
+  TESTCTZ32N (int8_t, int32_t)                                                 \
+  TESTCTZ32N (int8_t, int16_t)                                                 \
+  TESTCTZ32N (int8_t, int8_t)                                                  \
+  TESTFFS64 (uint64_t, uint64_t)                                               \
+  TESTFFS64 (uint64_t, uint32_t)                                               \
+  TESTFFS64 (uint64_t, uint16_t)                                               \
+  TESTFFS64 (uint64_t, uint8_t)                                                \
+  TESTFFS64 (uint64_t, int64_t)                                                \
+  TESTFFS64 (uint64_t, int32_t)                                                \
+  TESTFFS64 (uint64_t, int16_t)                                                \
+  TESTFFS64 (uint64_t, int8_t)                                                 \
+  TESTFFS64N (int64_t, uint64_t)                                               \
+  TESTFFS64N (int64_t, uint32_t)                                               \
+  TESTFFS64N (int64_t, uint16_t)                                               \
+  TESTFFS64N (int64_t, uint8_t)                                                \
+  TESTFFS64N (int64_t, int64_t)                                                \
+  TESTFFS64N (int64_t, int32_t)                                                \
+  TESTFFS64N (int64_t, int16_t)                                                \
+  TESTFFS64N (int64_t, int8_t)                                                 \
+  TESTFFS64 (uint32_t, uint64_t)                                               \
+  TESTFFS64 (uint32_t, uint32_t)                                               \
+  TESTFFS64 (uint32_t, uint16_t)                                               \
+  TESTFFS64 (uint32_t, uint8_t)                                                \
+  TESTFFS64 (uint32_t, int64_t)                                                \
+  TESTFFS64 (uint32_t, int32_t)                                                \
+  TESTFFS64 (uint32_t, int16_t)                                                \
+  TESTFFS64 (uint32_t, int8_t)                                                 \
+  TESTFFS64N (int32_t, uint64_t)                                               \
+  TESTFFS64N (int32_t, uint32_t)                                               \
+  TESTFFS64N (int32_t, uint16_t)                                               \
+  TESTFFS64N (int32_t, uint8_t)                                                \
+  TESTFFS64N (int32_t, int64_t)                                                \
+  TESTFFS64N (int32_t, int32_t)                                                \
+  TESTFFS64N (int32_t, int16_t)                                                \
+  TESTFFS64N (int32_t, int8_t)                                                 \
+  TESTFFS64 (uint16_t, uint64_t)                                               \
+  TESTFFS64 (uint16_t, uint32_t)                                               \
+  TESTFFS64 (uint16_t, uint16_t)                                               \
+  TESTFFS64 (uint16_t, uint8_t)                                                \
+  TESTFFS64 (uint16_t, int64_t)                                                \
+  TESTFFS64 (uint16_t, int32_t)                                                \
+  TESTFFS64 (uint16_t, int16_t)                                                \
+  TESTFFS64 (uint16_t, int8_t)                                                 \
+  TESTFFS64N (int16_t, uint64_t)                                               \
+  TESTFFS64N (int16_t, uint32_t)                                               \
+  TESTFFS64N (int16_t, uint16_t)                                               \
+  TESTFFS64N (int16_t, uint8_t)                                                \
+  TESTFFS64N (int16_t, int64_t)                                                \
+  TESTFFS64N (int16_t, int32_t)                                                \
+  TESTFFS64N (int16_t, int16_t)                                                \
+  TESTFFS64N (int16_t, int8_t)                                                 \
+  TESTFFS64 (uint8_t, uint64_t)                                                \
+  TESTFFS64 (uint8_t, uint32_t)                                                \
+  TESTFFS64 (uint8_t, uint16_t)                                                \
+  TESTFFS64 (uint8_t, uint8_t)                                                 \
+  TESTFFS64 (uint8_t, int64_t)                                                 \
+  TESTFFS64 (uint8_t, int32_t)                                                 \
+  TESTFFS64 (uint8_t, int16_t)                                                 \
+  TESTFFS64 (uint8_t, int8_t)                                                  \
+  TESTFFS64N (int8_t, uint64_t)                                                \
+  TESTFFS64N (int8_t, uint32_t)                                                \
+  TESTFFS64N (int8_t, uint16_t)                                                \
+  TESTFFS64N (int8_t, uint8_t)                                                 \
+  TESTFFS64N (int8_t, int64_t)                                                 \
+  TESTFFS64N (int8_t, int32_t)                                                 \
+  TESTFFS64N (int8_t, int16_t)                                                 \
+  TESTFFS64N (int8_t, int8_t)                                                  \
+  TESTFFS32 (uint64_t, uint64_t)                                               \
+  TESTFFS32 (uint64_t, uint32_t)                                               \
+  TESTFFS32 (uint64_t, uint16_t)                                               \
+  TESTFFS32 (uint64_t, uint8_t)                                                \
+  TESTFFS32 (uint64_t, int64_t)                                                \
+  TESTFFS32 (uint64_t, int32_t)                                                \
+  TESTFFS32 (uint64_t, int16_t)                                                \
+  TESTFFS32 (uint64_t, int8_t)                                                 \
+  TESTFFS32N (int64_t, uint64_t)                                               \
+  TESTFFS32N (int64_t, uint32_t)                                               \
+  TESTFFS32N (int64_t, uint16_t)                                               \
+  TESTFFS32N (int64_t, uint8_t)                                                \
+  TESTFFS32N (int64_t, int64_t)                                                \
+  TESTFFS32N (int64_t, int32_t)                                                \
+  TESTFFS32N (int64_t, int16_t)                                                \
+  TESTFFS32N (int64_t, int8_t)                                                 \
+  TESTFFS32 (uint32_t, uint64_t)                                               \
+  TESTFFS32 (uint32_t, uint32_t)                                               \
+  TESTFFS32 (uint32_t, uint16_t)                                               \
+  TESTFFS32 (uint32_t, uint8_t)                                                \
+  TESTFFS32 (uint32_t, int64_t)                                                \
+  TESTFFS32 (uint32_t, int32_t)                                                \
+  TESTFFS32 (uint32_t, int16_t)                                                \
+  TESTFFS32 (uint32_t, int8_t)                                                 \
+  TESTFFS32N (int32_t, uint64_t)                                               \
+  TESTFFS32N (int32_t, uint32_t)                                               \
+  TESTFFS32N (int32_t, uint16_t)                                               \
+  TESTFFS32N (int32_t, uint8_t)                                                \
+  TESTFFS32N (int32_t, int64_t)                                                \
+  TESTFFS32N (int32_t, int32_t)                                                \
+  TESTFFS32N (int32_t, int16_t)                                                \
+  TESTFFS32N (int32_t, int8_t)                                                 \
+  TESTFFS32 (uint16_t, uint64_t)                                               \
+  TESTFFS32 (uint16_t, uint32_t)                                               \
+  TESTFFS32 (uint16_t, uint16_t)                                               \
+  TESTFFS32 (uint16_t, uint8_t)                                                \
+  TESTFFS32 (uint16_t, int64_t)                                                \
+  TESTFFS32 (uint16_t, int32_t)                                                \
+  TESTFFS32 (uint16_t, int16_t)                                                \
+  TESTFFS32 (uint16_t, int8_t)                                                 \
+  TESTFFS32N (int16_t, uint64_t)                                               \
+  TESTFFS32N (int16_t, uint32_t)                                               \
+  TESTFFS32N (int16_t, uint16_t)                                               \
+  TESTFFS32N (int16_t, uint8_t)                                                \
+  TESTFFS32N (int16_t, int64_t)                                                \
+  TESTFFS32N (int16_t, int32_t)                                                \
+  TESTFFS32N (int16_t, int16_t)                                                \
+  TESTFFS32N (int16_t, int8_t)                                                 \
+  TESTFFS32 (uint8_t, uint64_t)                                                \
+  TESTFFS32 (uint8_t, uint32_t)                                                \
+  TESTFFS32 (uint8_t, uint16_t)                                                \
+  TESTFFS32 (uint8_t, uint8_t)                                                 \
+  TESTFFS32 (uint8_t, int64_t)                                                 \
+  TESTFFS32 (uint8_t, int32_t)                                                 \
+  TESTFFS32 (uint8_t, int16_t)                                                 \
+  TESTFFS32 (uint8_t, int8_t)                                                  \
+  TESTFFS32N (int8_t, uint64_t)                                                \
+  TESTFFS32N (int8_t, uint32_t)                                                \
+  TESTFFS32N (int8_t, uint16_t)                                                \
+  TESTFFS32N (int8_t, uint8_t)                                                 \
+  TESTFFS32N (int8_t, int64_t)                                                 \
+  TESTFFS32N (int8_t, int32_t)                                                 \
+  TESTFFS32N (int8_t, int16_t)                                                 \
+  TESTFFS32N (int8_t, int8_t)
+
+TEST_ALL ()
+
+#define RUN64(TYPEDST, TYPESRC) test64_##TYPEDST##TYPESRC ();
+#define RUN64N(TYPEDST, TYPESRC) test64n_##TYPEDST##TYPESRC ();
+#define RUN32(TYPEDST, TYPESRC) test32_##TYPEDST##TYPESRC ();
+#define RUN32N(TYPEDST, TYPESRC) test32n_##TYPEDST##TYPESRC ();
+#define RUNCTZ64(TYPEDST, TYPESRC) testctz64_##TYPEDST##TYPESRC ();
+#define RUNCTZ64N(TYPEDST, TYPESRC) testctz64n_##TYPEDST##TYPESRC ();
+#define RUNCTZ32(TYPEDST, TYPESRC) testctz32_##TYPEDST##TYPESRC ();
+#define RUNCTZ32N(TYPEDST, TYPESRC) testctz32n_##TYPEDST##TYPESRC ();
+#define RUNFFS64(TYPEDST, TYPESRC) testffs64_##TYPEDST##TYPESRC ();
+#define RUNFFS64N(TYPEDST, TYPESRC) testffs64n_##TYPEDST##TYPESRC ();
+#define RUNFFS32(TYPEDST, TYPESRC) testffs32_##TYPEDST##TYPESRC ();
+#define RUNFFS32N(TYPEDST, TYPESRC) testffs32n_##TYPEDST##TYPESRC ();
+
+#define RUN_ALL()                                                              \
+  RUN64 (uint64_t, uint64_t)                                                   \
+  RUN64 (uint64_t, uint32_t)                                                   \
+  RUN64 (uint64_t, uint16_t)                                                   \
+  RUN64 (uint64_t, uint8_t)                                                    \
+  RUN64 (uint64_t, int64_t)                                                    \
+  RUN64 (uint64_t, int32_t)                                                    \
+  RUN64 (uint64_t, int16_t)                                                    \
+  RUN64 (uint64_t, int8_t)                                                     \
+  RUN64N (int64_t, uint64_t)                                                    \
+  RUN64N (int64_t, uint32_t)                                                    \
+  RUN64N (int64_t, uint16_t)                                                    \
+  RUN64N (int64_t, uint8_t)                                                     \
+  RUN64N (int64_t, int64_t)                                                     \
+  RUN64N (int64_t, int32_t)                                                     \
+  RUN64N (int64_t, int16_t)                                                     \
+  RUN64N (int64_t, int8_t)                                                      \
+  RUN64 (uint32_t, uint64_t)                                                   \
+  RUN64 (uint32_t, uint32_t)                                                   \
+  RUN64 (uint32_t, uint16_t)                                                   \
+  RUN64 (uint32_t, uint8_t)                                                    \
+  RUN64 (uint32_t, int64_t)                                                    \
+  RUN64 (uint32_t, int32_t)                                                    \
+  RUN64 (uint32_t, int16_t)                                                    \
+  RUN64 (uint32_t, int8_t)                                                     \
+  RUN64N (int32_t, uint64_t)                                                    \
+  RUN64N (int32_t, uint32_t)                                                    \
+  RUN64N (int32_t, uint16_t)                                                    \
+  RUN64N (int32_t, uint8_t)                                                     \
+  RUN64N (int32_t, int64_t)                                                     \
+  RUN64N (int32_t, int32_t)                                                     \
+  RUN64N (int32_t, int16_t)                                                     \
+  RUN64N (int32_t, int8_t)                                                      \
+  RUN64 (uint16_t, uint64_t)                                                   \
+  RUN64 (uint16_t, uint32_t)                                                   \
+  RUN64 (uint16_t, uint16_t)                                                   \
+  RUN64 (uint16_t, uint8_t)                                                    \
+  RUN64 (uint16_t, int64_t)                                                    \
+  RUN64 (uint16_t, int32_t)                                                    \
+  RUN64 (uint16_t, int16_t)                                                    \
+  RUN64 (uint16_t, int8_t)                                                     \
+  RUN64N (int16_t, uint64_t)                                                    \
+  RUN64N (int16_t, uint32_t)                                                    \
+  RUN64N (int16_t, uint16_t)                                                    \
+  RUN64N (int16_t, uint8_t)                                                     \
+  RUN64N (int16_t, int64_t)                                                     \
+  RUN64N (int16_t, int32_t)                                                     \
+  RUN64N (int16_t, int16_t)                                                     \
+  RUN64N (int16_t, int8_t)                                                      \
+  RUN64 (uint8_t, uint64_t)                                                    \
+  RUN64 (uint8_t, uint32_t)                                                    \
+  RUN64 (uint8_t, uint16_t)                                                    \
+  RUN64 (uint8_t, uint8_t)                                                     \
+  RUN64 (uint8_t, int64_t)                                                     \
+  RUN64 (uint8_t, int32_t)                                                     \
+  RUN64 (uint8_t, int16_t)                                                     \
+  RUN64 (uint8_t, int8_t)                                                      \
+  RUN64N (int8_t, uint64_t)                                                     \
+  RUN64N (int8_t, uint32_t)                                                     \
+  RUN64N (int8_t, uint16_t)                                                     \
+  RUN64N (int8_t, uint8_t)                                                      \
+  RUN64N (int8_t, int64_t)                                                      \
+  RUN64N (int8_t, int32_t)                                                      \
+  RUN64N (int8_t, int16_t)                                                      \
+  RUN64N (int8_t, int8_t)                                                       \
+  RUN32 (uint64_t, uint64_t)                                                   \
+  RUN32 (uint64_t, uint32_t)                                                   \
+  RUN32 (uint64_t, uint16_t)                                                   \
+  RUN32 (uint64_t, uint8_t)                                                    \
+  RUN32 (uint64_t, int64_t)                                                    \
+  RUN32 (uint64_t, int32_t)                                                    \
+  RUN32 (uint64_t, int16_t)                                                    \
+  RUN32 (uint64_t, int8_t)                                                     \
+  RUN32N (int64_t, uint64_t)                                                    \
+  RUN32N (int64_t, uint32_t)                                                    \
+  RUN32N (int64_t, uint16_t)                                                    \
+  RUN32N (int64_t, uint8_t)                                                     \
+  RUN32N (int64_t, int64_t)                                                     \
+  RUN32N (int64_t, int32_t)                                                     \
+  RUN32N (int64_t, int16_t)                                                     \
+  RUN32N (int64_t, int8_t)                                                      \
+  RUN32 (uint32_t, uint64_t)                                                   \
+  RUN32 (uint32_t, uint32_t)                                                   \
+  RUN32 (uint32_t, uint16_t)                                                   \
+  RUN32 (uint32_t, uint8_t)                                                    \
+  RUN32 (uint32_t, int64_t)                                                    \
+  RUN32 (uint32_t, int32_t)                                                    \
+  RUN32 (uint32_t, int16_t)                                                    \
+  RUN32 (uint32_t, int8_t)                                                     \
+  RUN32N (int32_t, uint64_t)                                                    \
+  RUN32N (int32_t, uint32_t)                                                    \
+  RUN32N (int32_t, uint16_t)                                                    \
+  RUN32N (int32_t, uint8_t)                                                     \
+  RUN32N (int32_t, int64_t)                                                     \
+  RUN32N (int32_t, int32_t)                                                     \
+  RUN32N (int32_t, int16_t)                                                     \
+  RUN32N (int32_t, int8_t)                                                      \
+  RUN32 (uint16_t, uint64_t)                                                   \
+  RUN32 (uint16_t, uint32_t)                                                   \
+  RUN32 (uint16_t, uint16_t)                                                   \
+  RUN32 (uint16_t, uint8_t)                                                    \
+  RUN32 (uint16_t, int64_t)                                                    \
+  RUN32 (uint16_t, int32_t)                                                    \
+  RUN32 (uint16_t, int16_t)                                                    \
+  RUN32 (uint16_t, int8_t)                                                     \
+  RUN32N (int16_t, uint64_t)                                                    \
+  RUN32N (int16_t, uint32_t)                                                    \
+  RUN32N (int16_t, uint16_t)                                                    \
+  RUN32N (int16_t, uint8_t)                                                     \
+  RUN32N (int16_t, int64_t)                                                     \
+  RUN32N (int16_t, int32_t)                                                     \
+  RUN32N (int16_t, int16_t)                                                     \
+  RUN32N (int16_t, int8_t)                                                      \
+  RUN32 (uint8_t, uint64_t)                                                    \
+  RUN32 (uint8_t, uint32_t)                                                    \
+  RUN32 (uint8_t, uint16_t)                                                    \
+  RUN32 (uint8_t, uint8_t)                                                     \
+  RUN32 (uint8_t, int64_t)                                                     \
+  RUN32 (uint8_t, int32_t)                                                     \
+  RUN32 (uint8_t, int16_t)                                                     \
+  RUN32 (uint8_t, int8_t)                                                      \
+  RUN32N (int8_t, uint64_t)                                                     \
+  RUN32N (int8_t, uint32_t)                                                     \
+  RUN32N (int8_t, uint16_t)                                                     \
+  RUN32N (int8_t, uint8_t)                                                      \
+  RUN32N (int8_t, int64_t)                                                      \
+  RUN32N (int8_t, int32_t)                                                      \
+  RUN32N (int8_t, int16_t)                                                      \
+  RUN32N (int8_t, int8_t)                                                       \
+  RUNCTZ64 (uint64_t, uint64_t)                                                \
+  RUNCTZ64 (uint64_t, uint32_t)                                                \
+  RUNCTZ64 (uint64_t, uint16_t)                                                \
+  RUNCTZ64 (uint64_t, uint8_t)                                                 \
+  RUNCTZ64 (uint64_t, int64_t)                                                 \
+  RUNCTZ64 (uint64_t, int32_t)                                                 \
+  RUNCTZ64 (uint64_t, int16_t)                                                 \
+  RUNCTZ64 (uint64_t, int8_t)                                                  \
+  RUNCTZ64N (int64_t, uint64_t)                                                 \
+  RUNCTZ64N (int64_t, uint32_t)                                                 \
+  RUNCTZ64N (int64_t, uint16_t)                                                 \
+  RUNCTZ64N (int64_t, uint8_t)                                                  \
+  RUNCTZ64N (int64_t, int64_t)                                                  \
+  RUNCTZ64N (int64_t, int32_t)                                                  \
+  RUNCTZ64N (int64_t, int16_t)                                                  \
+  RUNCTZ64N (int64_t, int8_t)                                                   \
+  RUNCTZ64 (uint32_t, uint64_t)                                                \
+  RUNCTZ64 (uint32_t, uint32_t)                                                \
+  RUNCTZ64 (uint32_t, uint16_t)                                                \
+  RUNCTZ64 (uint32_t, uint8_t)                                                 \
+  RUNCTZ64 (uint32_t, int64_t)                                                 \
+  RUNCTZ64 (uint32_t, int32_t)                                                 \
+  RUNCTZ64 (uint32_t, int16_t)                                                 \
+  RUNCTZ64 (uint32_t, int8_t)                                                  \
+  RUNCTZ64N (int32_t, uint64_t)                                                 \
+  RUNCTZ64N (int32_t, uint32_t)                                                 \
+  RUNCTZ64N (int32_t, uint16_t)                                                 \
+  RUNCTZ64N (int32_t, uint8_t)                                                  \
+  RUNCTZ64N (int32_t, int64_t)                                                  \
+  RUNCTZ64N (int32_t, int32_t)                                                  \
+  RUNCTZ64N (int32_t, int16_t)                                                  \
+  RUNCTZ64N (int32_t, int8_t)                                                   \
+  RUNCTZ64 (uint16_t, uint64_t)                                                \
+  RUNCTZ64 (uint16_t, uint32_t)                                                \
+  RUNCTZ64 (uint16_t, uint16_t)                                                \
+  RUNCTZ64 (uint16_t, uint8_t)                                                 \
+  RUNCTZ64 (uint16_t, int64_t)                                                 \
+  RUNCTZ64 (uint16_t, int32_t)                                                 \
+  RUNCTZ64 (uint16_t, int16_t)                                                 \
+  RUNCTZ64 (uint16_t, int8_t)                                                  \
+  RUNCTZ64N (int16_t, uint64_t)                                                \
+  RUNCTZ64N (int16_t, uint32_t)                                                \
+  RUNCTZ64N (int16_t, uint16_t)                                                \
+  RUNCTZ64N (int16_t, uint8_t)                                                 \
+  RUNCTZ64N (int16_t, int64_t)                                                 \
+  RUNCTZ64N (int16_t, int32_t)                                                 \
+  RUNCTZ64N (int16_t, int16_t)                                                 \
+  RUNCTZ64N (int16_t, int8_t)                                                  \
+  RUNCTZ64 (uint8_t, uint64_t)                                                 \
+  RUNCTZ64 (uint8_t, uint32_t)                                                 \
+  RUNCTZ64 (uint8_t, uint16_t)                                                 \
+  RUNCTZ64 (uint8_t, uint8_t)                                                  \
+  RUNCTZ64 (uint8_t, int64_t)                                                  \
+  RUNCTZ64 (uint8_t, int32_t)                                                  \
+  RUNCTZ64 (uint8_t, int16_t)                                                  \
+  RUNCTZ64 (uint8_t, int8_t)                                                   \
+  RUNCTZ64N (int8_t, uint64_t)                                                 \
+  RUNCTZ64N (int8_t, uint32_t)                                                 \
+  RUNCTZ64N (int8_t, uint16_t)                                                 \
+  RUNCTZ64N (int8_t, uint8_t)                                                  \
+  RUNCTZ64N (int8_t, int64_t)                                                  \
+  RUNCTZ64N (int8_t, int32_t)                                                  \
+  RUNCTZ64N (int8_t, int16_t)                                                  \
+  RUNCTZ64N (int8_t, int8_t)                                                   \
+  RUNCTZ32 (uint64_t, uint64_t)                                                \
+  RUNCTZ32 (uint64_t, uint32_t)                                                \
+  RUNCTZ32 (uint64_t, uint16_t)                                                \
+  RUNCTZ32 (uint64_t, uint8_t)                                                 \
+  RUNCTZ32 (uint64_t, int64_t)                                                 \
+  RUNCTZ32 (uint64_t, int32_t)                                                 \
+  RUNCTZ32 (uint64_t, int16_t)                                                 \
+  RUNCTZ32 (uint64_t, int8_t)                                                  \
+  RUNCTZ32N (int64_t, uint64_t)                                                \
+  RUNCTZ32N (int64_t, uint32_t)                                                \
+  RUNCTZ32N (int64_t, uint16_t)                                                \
+  RUNCTZ32N (int64_t, uint8_t)                                                 \
+  RUNCTZ32N (int64_t, int64_t)                                                 \
+  RUNCTZ32N (int64_t, int32_t)                                                 \
+  RUNCTZ32N (int64_t, int16_t)                                                 \
+  RUNCTZ32N (int64_t, int8_t)                                                  \
+  RUNCTZ32 (uint32_t, uint64_t)                                                \
+  RUNCTZ32 (uint32_t, uint32_t)                                                \
+  RUNCTZ32 (uint32_t, uint16_t)                                                \
+  RUNCTZ32 (uint32_t, uint8_t)                                                 \
+  RUNCTZ32 (uint32_t, int64_t)                                                 \
+  RUNCTZ32 (uint32_t, int32_t)                                                 \
+  RUNCTZ32 (uint32_t, int16_t)                                                 \
+  RUNCTZ32 (uint32_t, int8_t)                                                  \
+  RUNCTZ32N (int32_t, uint64_t)                                                \
+  RUNCTZ32N (int32_t, uint32_t)                                                \
+  RUNCTZ32N (int32_t, uint16_t)                                                \
+  RUNCTZ32N (int32_t, uint8_t)                                                 \
+  RUNCTZ32N (int32_t, int64_t)                                                 \
+  RUNCTZ32N (int32_t, int32_t)                                                 \
+  RUNCTZ32N (int32_t, int16_t)                                                 \
+  RUNCTZ32N (int32_t, int8_t)                                                  \
+  RUNCTZ32 (uint16_t, uint64_t)                                                \
+  RUNCTZ32 (uint16_t, uint32_t)                                                \
+  RUNCTZ32 (uint16_t, uint16_t)                                                \
+  RUNCTZ32 (uint16_t, uint8_t)                                                 \
+  RUNCTZ32 (uint16_t, int64_t)                                                 \
+  RUNCTZ32 (uint16_t, int32_t)                                                 \
+  RUNCTZ32 (uint16_t, int16_t)                                                 \
+  RUNCTZ32 (uint16_t, int8_t)                                                  \
+  RUNCTZ32N (int16_t, uint64_t)                                                \
+  RUNCTZ32N (int16_t, uint32_t)                                                \
+  RUNCTZ32N (int16_t, uint16_t)                                                \
+  RUNCTZ32N (int16_t, uint8_t)                                                 \
+  RUNCTZ32N (int16_t, int64_t)                                                 \
+  RUNCTZ32N (int16_t, int32_t)                                                 \
+  RUNCTZ32N (int16_t, int16_t)                                                 \
+  RUNCTZ32N (int16_t, int8_t)                                                  \
+  RUNCTZ32 (uint8_t, uint64_t)                                                 \
+  RUNCTZ32 (uint8_t, uint32_t)                                                 \
+  RUNCTZ32 (uint8_t, uint16_t)                                                 \
+  RUNCTZ32 (uint8_t, uint8_t)                                                  \
+  RUNCTZ32 (uint8_t, int64_t)                                                  \
+  RUNCTZ32 (uint8_t, int32_t)                                                  \
+  RUNCTZ32 (uint8_t, int16_t)                                                  \
+  RUNCTZ32 (uint8_t, int8_t)                                                   \
+  RUNCTZ32N (int8_t, uint64_t)                                                 \
+  RUNCTZ32N (int8_t, uint32_t)                                                 \
+  RUNCTZ32N (int8_t, uint16_t)                                                 \
+  RUNCTZ32N (int8_t, uint8_t)                                                  \
+  RUNCTZ32N (int8_t, int64_t)                                                  \
+  RUNCTZ32N (int8_t, int32_t)                                                  \
+  RUNCTZ32N (int8_t, int16_t)                                                  \
+  RUNCTZ32N (int8_t, int8_t)                                                   \
+  RUNFFS64 (uint64_t, uint64_t)                                                \
+  RUNFFS64 (uint64_t, uint32_t)                                                \
+  RUNFFS64 (uint64_t, uint16_t)                                                \
+  RUNFFS64 (uint64_t, uint8_t)                                                 \
+  RUNFFS64 (uint64_t, int64_t)                                                 \
+  RUNFFS64 (uint64_t, int32_t)                                                 \
+  RUNFFS64 (uint64_t, int16_t)                                                 \
+  RUNFFS64 (uint64_t, int8_t)                                                  \
+  RUNFFS64N (int64_t, uint64_t)                                                \
+  RUNFFS64N (int64_t, uint32_t)                                                \
+  RUNFFS64N (int64_t, uint16_t)                                                \
+  RUNFFS64N (int64_t, uint8_t)                                                 \
+  RUNFFS64N (int64_t, int64_t)                                                 \
+  RUNFFS64N (int64_t, int32_t)                                                 \
+  RUNFFS64N (int64_t, int16_t)                                                 \
+  RUNFFS64N (int64_t, int8_t)                                                  \
+  RUNFFS64 (uint32_t, uint64_t)                                                \
+  RUNFFS64 (uint32_t, uint32_t)                                                \
+  RUNFFS64 (uint32_t, uint16_t)                                                \
+  RUNFFS64 (uint32_t, uint8_t)                                                 \
+  RUNFFS64 (uint32_t, int64_t)                                                 \
+  RUNFFS64 (uint32_t, int32_t)                                                 \
+  RUNFFS64 (uint32_t, int16_t)                                                 \
+  RUNFFS64 (uint32_t, int8_t)                                                  \
+  RUNFFS64N (int32_t, uint64_t)                                                \
+  RUNFFS64N (int32_t, uint32_t)                                                \
+  RUNFFS64N (int32_t, uint16_t)                                                \
+  RUNFFS64N (int32_t, uint8_t)                                                 \
+  RUNFFS64N (int32_t, int64_t)                                                 \
+  RUNFFS64N (int32_t, int32_t)                                                 \
+  RUNFFS64N (int32_t, int16_t)                                                 \
+  RUNFFS64N (int32_t, int8_t)                                                  \
+  RUNFFS64 (uint16_t, uint64_t)                                                \
+  RUNFFS64 (uint16_t, uint32_t)                                                \
+  RUNFFS64 (uint16_t, uint16_t)                                                \
+  RUNFFS64 (uint16_t, uint8_t)                                                 \
+  RUNFFS64 (uint16_t, int64_t)                                                 \
+  RUNFFS64 (uint16_t, int32_t)                                                 \
+  RUNFFS64 (uint16_t, int16_t)                                                 \
+  RUNFFS64 (uint16_t, int8_t)                                                  \
+  RUNFFS64N (int16_t, uint64_t)                                                \
+  RUNFFS64N (int16_t, uint32_t)                                                \
+  RUNFFS64N (int16_t, uint16_t)                                                \
+  RUNFFS64N (int16_t, uint8_t)                                                 \
+  RUNFFS64N (int16_t, int64_t)                                                 \
+  RUNFFS64N (int16_t, int32_t)                                                 \
+  RUNFFS64N (int16_t, int16_t)                                                 \
+  RUNFFS64N (int16_t, int8_t)                                                  \
+  RUNFFS64 (uint8_t, uint64_t)                                                 \
+  RUNFFS64 (uint8_t, uint32_t)                                                 \
+  RUNFFS64 (uint8_t, uint16_t)                                                 \
+  RUNFFS64 (uint8_t, uint8_t)                                                  \
+  RUNFFS64 (uint8_t, int64_t)                                                  \
+  RUNFFS64 (uint8_t, int32_t)                                                  \
+  RUNFFS64 (uint8_t, int16_t)                                                  \
+  RUNFFS64 (uint8_t, int8_t)                                                   \
+  RUNFFS64N (int8_t, uint64_t)                                                 \
+  RUNFFS64N (int8_t, uint32_t)                                                 \
+  RUNFFS64N (int8_t, uint16_t)                                                 \
+  RUNFFS64N (int8_t, uint8_t)                                                  \
+  RUNFFS64N (int8_t, int64_t)                                                  \
+  RUNFFS64N (int8_t, int32_t)                                                  \
+  RUNFFS64N (int8_t, int16_t)                                                  \
+  RUNFFS64N (int8_t, int8_t)                                                   \
+  RUNFFS32 (uint64_t, uint64_t)                                                \
+  RUNFFS32 (uint64_t, uint32_t)                                                \
+  RUNFFS32 (uint64_t, uint16_t)                                                \
+  RUNFFS32 (uint64_t, uint8_t)                                                 \
+  RUNFFS32 (uint64_t, int64_t)                                                 \
+  RUNFFS32 (uint64_t, int32_t)                                                 \
+  RUNFFS32 (uint64_t, int16_t)                                                 \
+  RUNFFS32 (uint64_t, int8_t)                                                  \
+  RUNFFS32N (int64_t, uint64_t)                                                \
+  RUNFFS32N (int64_t, uint32_t)                                                \
+  RUNFFS32N (int64_t, uint16_t)                                                \
+  RUNFFS32N (int64_t, uint8_t)                                                 \
+  RUNFFS32N (int64_t, int64_t)                                                 \
+  RUNFFS32N (int64_t, int32_t)                                                 \
+  RUNFFS32N (int64_t, int16_t)                                                 \
+  RUNFFS32N (int64_t, int8_t)                                                  \
+  RUNFFS32 (uint32_t, uint64_t)                                                \
+  RUNFFS32 (uint32_t, uint32_t)                                                \
+  RUNFFS32 (uint32_t, uint16_t)                                                \
+  RUNFFS32 (uint32_t, uint8_t)                                                 \
+  RUNFFS32 (uint32_t, int64_t)                                                 \
+  RUNFFS32 (uint32_t, int32_t)                                                 \
+  RUNFFS32 (uint32_t, int16_t)                                                 \
+  RUNFFS32 (uint32_t, int8_t)                                                  \
+  RUNFFS32N (int32_t, uint64_t)                                                \
+  RUNFFS32N (int32_t, uint32_t)                                                \
+  RUNFFS32N (int32_t, uint16_t)                                                \
+  RUNFFS32N (int32_t, uint8_t)                                                 \
+  RUNFFS32N (int32_t, int64_t)                                                 \
+  RUNFFS32N (int32_t, int32_t)                                                 \
+  RUNFFS32N (int32_t, int16_t)                                                 \
+  RUNFFS32N (int32_t, int8_t)                                                  \
+  RUNFFS32 (uint16_t, uint64_t)                                                \
+  RUNFFS32 (uint16_t, uint32_t)                                                \
+  RUNFFS32 (uint16_t, uint16_t)                                                \
+  RUNFFS32 (uint16_t, uint8_t)                                                 \
+  RUNFFS32 (uint16_t, int64_t)                                                 \
+  RUNFFS32 (uint16_t, int32_t)                                                 \
+  RUNFFS32 (uint16_t, int16_t)                                                 \
+  RUNFFS32 (uint16_t, int8_t)                                                  \
+  RUNFFS32N (int16_t, uint64_t)                                                \
+  RUNFFS32N (int16_t, uint32_t)                                                \
+  RUNFFS32N (int16_t, uint16_t)                                                \
+  RUNFFS32N (int16_t, uint8_t)                                                 \
+  RUNFFS32N (int16_t, int64_t)                                                 \
+  RUNFFS32N (int16_t, int32_t)                                                 \
+  RUNFFS32N (int16_t, int16_t)                                                 \
+  RUNFFS32N (int16_t, int8_t)                                                  \
+  RUNFFS32 (uint8_t, uint64_t)                                                 \
+  RUNFFS32 (uint8_t, uint32_t)                                                 \
+  RUNFFS32 (uint8_t, uint16_t)                                                 \
+  RUNFFS32 (uint8_t, uint8_t)                                                  \
+  RUNFFS32 (uint8_t, int64_t)                                                  \
+  RUNFFS32 (uint8_t, int32_t)                                                  \
+  RUNFFS32 (uint8_t, int16_t)                                                  \
+  RUNFFS32 (uint8_t, int8_t)                                                   \
+  RUNFFS32N (int8_t, uint64_t)                                                 \
+  RUNFFS32N (int8_t, uint32_t)                                                 \
+  RUNFFS32N (int8_t, uint16_t)                                                 \
+  RUNFFS32N (int8_t, uint8_t)                                                  \
+  RUNFFS32N (int8_t, int64_t)                                                  \
+  RUNFFS32N (int8_t, int32_t)                                                  \
+  RUNFFS32N (int8_t, int16_t)                                                  \
+  RUNFFS32N (int8_t, int8_t)
+
+int
+main ()
+{
+  RUN_ALL ()
+}
+
+/* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 229 "vect" } } */
-- 
2.41.0

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] RISC-V: Add popcount fallback expander.
  2023-10-18  9:20 [PATCH] RISC-V: Add popcount fallback expander Robin Dapp
@ 2023-10-18  9:28 ` juzhe.zhong
  2023-10-18  9:32   ` Robin Dapp
  2023-10-18 11:43   ` Robin Dapp
       [not found] ` <202310181728104086621@rivai.ai>
  1 sibling, 2 replies; 10+ messages in thread
From: juzhe.zhong @ 2023-10-18  9:28 UTC (permalink / raw)
  To: Robin Dapp, gcc-patches, palmer, kito.cheng, jeffreyalaw; +Cc: Robin Dapp

[-- Attachment #1: Type: text/plain, Size: 125530 bytes --]

Could you try this following code :

int x[8];
int y[8];

void foo ()
{
  x[0] = __builtin_popcount (y[0]);
  x[1] = __builtin_popcount (y[1]);
  x[2] = __builtin_popcount (y[2]);
  x[3] = __builtin_popcount (y[3]);
  x[4] = __builtin_popcount (y[4]);
  x[5] = __builtin_popcount (y[5]);
  x[6] = __builtin_popcount (y[6]);
  x[7] = __builtin_popcount (y[7]);
}

I saw you didn't extend VI -> V_VLSI. I guess will failed SLP on popcount.


juzhe.zhong@rivai.ai
 
From: Robin Dapp
Date: 2023-10-18 17:20
To: gcc-patches; palmer; Kito Cheng; jeffreyalaw; juzhe.zhong@rivai.ai
CC: rdapp.gcc
Subject: [PATCH] RISC-V: Add popcount fallback expander.
Hi,
 
as I didn't manage to get back to the generic vectorizer fallback for
popcount in time (still the generic costing problem) I figured I'd
rather implement the popcount fallback in the riscv backend.
It uses the WWG algorithm from libgcc.
 
rvv.exp is unchanged, vect and dg.exp testsuites are currently running.
 
Regards
Robin
 
gcc/ChangeLog:
 
* config/riscv/autovec.md (popcount<mode>2): New expander.
* config/riscv/riscv-protos.h (expand_popcount): Define.
* config/riscv/riscv-v.cc (expand_popcount): Vectorize popcount
with the WWG algorithm.
 
gcc/testsuite/ChangeLog:
 
* gcc.target/riscv/rvv/autovec/unop/popcount-1.c: New test.
* gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c: New test.
* gcc.target/riscv/rvv/autovec/unop/popcount.c: New test.
---
gcc/config/riscv/autovec.md                   |   14 +
gcc/config/riscv/riscv-protos.h               |    1 +
gcc/config/riscv/riscv-v.cc                   |   71 +
.../riscv/rvv/autovec/unop/popcount-1.c       |   20 +
.../riscv/rvv/autovec/unop/popcount-run-1.c   |   49 +
.../riscv/rvv/autovec/unop/popcount.c         | 1464 +++++++++++++++++
6 files changed, 1619 insertions(+)
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-1.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount.c
 
diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md
index c5b1e52cbf9..dfe836f705d 100644
--- a/gcc/config/riscv/autovec.md
+++ b/gcc/config/riscv/autovec.md
@@ -1484,6 +1484,20 @@ (define_expand "xorsign<mode>3"
   DONE;
})
+;; -------------------------------------------------------------------------------
+;; - [INT] POPCOUNT.
+;; -------------------------------------------------------------------------------
+
+(define_expand "popcount<mode>2"
+  [(match_operand:VI 0 "register_operand")
+   (match_operand:VI 1 "register_operand")]
+  "TARGET_VECTOR"
+{
+  riscv_vector::expand_popcount (operands);
+  DONE;
+})
+
+
;; -------------------------------------------------------------------------
;; ---- [INT] Highpart multiplication
;; -------------------------------------------------------------------------
diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index 49bdcdf2f93..4aeccdd961b 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -515,6 +515,7 @@ void expand_fold_extract_last (rtx *);
void expand_cond_unop (unsigned, rtx *);
void expand_cond_binop (unsigned, rtx *);
void expand_cond_ternop (unsigned, rtx *);
+void expand_popcount (rtx *);
/* Rounding mode bitfield for fixed point VXRM.  */
enum fixed_point_rounding_mode
diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index 21d86c3f917..8b594b7127e 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -4152,4 +4152,75 @@ expand_vec_lfloor (rtx op_0, rtx op_1, machine_mode vec_fp_mode,
   emit_vec_cvt_x_f (op_0, op_1, UNARY_OP_FRM_RDN, vec_fp_mode);
}
+/* Vectorize popcount by the Wilkes-Wheeler-Gill algorithm that libgcc uses as
+   well.  */
+void
+expand_popcount (rtx *ops)
+{
+  rtx dst = ops[0];
+  rtx src = ops[1];
+  machine_mode mode = GET_MODE (dst);
+  scalar_mode imode = GET_MODE_INNER (mode);
+  static const uint64_t m5 = 0x5555555555555555ULL;
+  static const uint64_t m3 = 0x3333333333333333ULL;
+  static const uint64_t mf = 0x0F0F0F0F0F0F0F0FULL;
+  static const uint64_t m1 = 0x0101010101010101ULL;
+
+  rtx x1 = gen_reg_rtx (mode);
+  rtx x2 = gen_reg_rtx (mode);
+  rtx x3 = gen_reg_rtx (mode);
+  rtx x4 = gen_reg_rtx (mode);
+
+  /* x1 = src - (src >> 1) & 0x555...);  */
+  rtx shift1 = expand_binop (mode, lshr_optab, src, GEN_INT (1), NULL, true,
+      OPTAB_DIRECT);
+
+  rtx and1 = gen_reg_rtx (mode);
+  rtx ops1[] = {and1, shift1, gen_int_mode (m5, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (AND, mode), riscv_vector::BINARY_OP,
+    ops1);
+
+  x1 = expand_binop (mode, sub_optab, src, and1, NULL, true, OPTAB_DIRECT);
+
+  /* x2 = (x1 & 0x3333333333333333ULL) + ((x1 >> 2) & 0x3333333333333333ULL);
+   */
+  rtx and2 = gen_reg_rtx (mode);
+  rtx ops2[] = {and2, x1, gen_int_mode (m3, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (AND, mode), riscv_vector::BINARY_OP,
+    ops2);
+
+  rtx shift2 = expand_binop (mode, lshr_optab, x1, GEN_INT (2), NULL, true,
+      OPTAB_DIRECT);
+
+  rtx and22 = gen_reg_rtx (mode);
+  rtx ops22[] = {and22, shift2, gen_int_mode (m3, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (AND, mode), riscv_vector::BINARY_OP,
+    ops22);
+
+  x2 = expand_binop (mode, add_optab, and2, and22, NULL, true, OPTAB_DIRECT);
+
+  /* x3 = (x2 + (x2 >> 4)) & 0x0f0f0f0f0f0f0f0fULL;  */
+  rtx shift3 = expand_binop (mode, lshr_optab, x2, GEN_INT (4), NULL, true,
+      OPTAB_DIRECT);
+
+  rtx plus3
+    = expand_binop (mode, add_optab, x2, shift3, NULL, true, OPTAB_DIRECT);
+
+  rtx ops3[] = {x3, plus3, gen_int_mode (mf, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (AND, mode), riscv_vector::BINARY_OP,
+    ops3);
+
+  /* dest = (x3 * 0x0101010101010101ULL) >> 56;  */
+  rtx mul4 = gen_reg_rtx (mode);
+  rtx ops4[] = {mul4, x3, gen_int_mode (m1, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (MULT, mode), riscv_vector::BINARY_OP,
+    ops4);
+
+  x4 = expand_binop (mode, lshr_optab, mul4,
+      GEN_INT (GET_MODE_BITSIZE (imode) - 8), NULL, true,
+      OPTAB_DIRECT);
+
+  emit_move_insn (dst, x4);
+}
+
} // namespace riscv_vector
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-1.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-1.c
new file mode 100644
index 00000000000..3169ebbff71
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-1.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-march=rv64gcv_zvfh -mabi=lp64d --param=riscv-autovec-preference=scalable -fno-vect-cost-model -fdump-tree-vect-details" } */
+
+#include <stdint-gcc.h>
+
+void __attribute__ ((noipa))
+popcount_32 (uint32_t *restrict dst, uint32_t *restrict src, int size)
+{
+  for (int i = 0; i < size; ++i)
+    dst[i] = __builtin_popcount (src[i]);
+}
+
+void __attribute__ ((noipa))
+popcount_64 (uint64_t *restrict dst, uint64_t *restrict src, int size)
+{
+  for (int i = 0; i < size; ++i)
+    dst[i] = __builtin_popcountll (src[i]);
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops in function" 2 "vect" } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c
new file mode 100644
index 00000000000..38f1633da99
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c
@@ -0,0 +1,49 @@
+/* { dg-do run { target { riscv_v } } } */
+
+#include "popcount-1.c"
+
+extern void abort (void) __attribute__ ((noreturn));
+
+unsigned int data[] = {
+  0x11111100, 6,
+  0xe0e0f0f0, 14,
+  0x9900aab3, 13,
+  0x00040003, 3,
+  0x000e000c, 5,
+  0x22227777, 16,
+  0x12341234, 10,
+  0x0, 0
+};
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  unsigned int count = sizeof (data) / sizeof (data[0]) / 2;
+
+  uint32_t in32[count];
+  uint32_t out32[count];
+  for (unsigned int i = 0; i < count; ++i)
+    {
+      in32[i] = data[i * 2];
+      asm volatile ("" ::: "memory");
+    }
+  popcount_32 (out32, in32, count);
+  for (unsigned int i = 0; i < count; ++i)
+    if (out32[i] != data[i * 2 + 1])
+      abort ();
+
+  count /= 2;
+  uint64_t in64[count];
+  uint64_t out64[count];
+  for (unsigned int i = 0; i < count; ++i)
+    {
+      in64[i] = ((uint64_t) data[i * 4] << 32) | data[i * 4 + 2];
+      asm volatile ("" ::: "memory");
+    }
+  popcount_64 (out64, in64, count);
+  for (unsigned int i = 0; i < count; ++i)
+    if (out64[i] != data[i * 4 + 1] + data[i * 4 + 3])
+      abort ();
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount.c
new file mode 100644
index 00000000000..585a522aa81
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount.c
@@ -0,0 +1,1464 @@
+/* { dg-do run { target { riscv_v } } } */
+/* { dg-additional-options { -O2 -fdump-tree-vect-details -fno-vect-cost-model } }  */
+
+#include "stdint-gcc.h"
+#include <assert.h>
+
+#define DEF64(TYPEDST, TYPESRC)                                                \
+  void __attribute__ ((noipa))                                                 \
+  popcount64_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src, \
+ int size)                                     \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_popcountll (src[i]);                                  \
+  }
+
+#define DEF32(TYPEDST, TYPESRC)                                                \
+  void __attribute__ ((noipa))                                                 \
+  popcount32_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src, \
+ int size)                                     \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_popcount (src[i]);                                    \
+  }
+
+#define DEFCTZ64(TYPEDST, TYPESRC)                                             \
+  void __attribute__ ((noipa))                                                 \
+  ctz64_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src,      \
+     int size)                                          \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_ctzll (src[i]);                                       \
+  }
+
+#define DEFCTZ32(TYPEDST, TYPESRC)                                             \
+  void __attribute__ ((noipa))                                                 \
+  ctz32_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src,      \
+     int size)                                          \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_ctz (src[i]);                                         \
+  }
+
+#define DEFFFS64(TYPEDST, TYPESRC)                                             \
+  void __attribute__ ((noipa))                                                 \
+  ffs64_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src,      \
+     int size)                                          \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_ffsll (src[i]);                                       \
+  }
+
+#define DEFFFS32(TYPEDST, TYPESRC)                                             \
+  void __attribute__ ((noipa))                                                 \
+  ffs32_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src,      \
+     int size)                                          \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_ffs (src[i]);                                         \
+  }
+
+#define DEF_ALL()                                                              \
+  DEF64 (uint64_t, uint64_t)                                                   \
+  DEF64 (uint64_t, uint32_t)                                                   \
+  DEF64 (uint64_t, uint16_t)                                                   \
+  DEF64 (uint64_t, uint8_t)                                                    \
+  DEF64 (uint64_t, int64_t)                                                    \
+  DEF64 (uint64_t, int32_t)                                                    \
+  DEF64 (uint64_t, int16_t)                                                    \
+  DEF64 (uint64_t, int8_t)                                                     \
+  DEF64 (int64_t, uint64_t)                                                    \
+  DEF64 (int64_t, uint32_t)                                                    \
+  DEF64 (int64_t, uint16_t)                                                    \
+  DEF64 (int64_t, uint8_t)                                                     \
+  DEF64 (int64_t, int64_t)                                                     \
+  DEF64 (int64_t, int32_t)                                                     \
+  DEF64 (int64_t, int16_t)                                                     \
+  DEF64 (int64_t, int8_t)                                                      \
+  DEF64 (uint32_t, uint64_t)                                                   \
+  DEF64 (uint32_t, uint32_t)                                                   \
+  DEF64 (uint32_t, uint16_t)                                                   \
+  DEF64 (uint32_t, uint8_t)                                                    \
+  DEF64 (uint32_t, int64_t)                                                    \
+  DEF64 (uint32_t, int32_t)                                                    \
+  DEF64 (uint32_t, int16_t)                                                    \
+  DEF64 (uint32_t, int8_t)                                                     \
+  DEF64 (int32_t, uint64_t)                                                    \
+  DEF64 (int32_t, uint32_t)                                                    \
+  DEF64 (int32_t, uint16_t)                                                    \
+  DEF64 (int32_t, uint8_t)                                                     \
+  DEF64 (int32_t, int64_t)                                                     \
+  DEF64 (int32_t, int32_t)                                                     \
+  DEF64 (int32_t, int16_t)                                                     \
+  DEF64 (int32_t, int8_t)                                                      \
+  DEF64 (uint16_t, uint64_t)                                                   \
+  DEF64 (uint16_t, uint32_t)                                                   \
+  DEF64 (uint16_t, uint16_t)                                                   \
+  DEF64 (uint16_t, uint8_t)                                                    \
+  DEF64 (uint16_t, int64_t)                                                    \
+  DEF64 (uint16_t, int32_t)                                                    \
+  DEF64 (uint16_t, int16_t)                                                    \
+  DEF64 (uint16_t, int8_t)                                                     \
+  DEF64 (int16_t, uint64_t)                                                    \
+  DEF64 (int16_t, uint32_t)                                                    \
+  DEF64 (int16_t, uint16_t)                                                    \
+  DEF64 (int16_t, uint8_t)                                                     \
+  DEF64 (int16_t, int64_t)                                                     \
+  DEF64 (int16_t, int32_t)                                                     \
+  DEF64 (int16_t, int16_t)                                                     \
+  DEF64 (int16_t, int8_t)                                                      \
+  DEF64 (uint8_t, uint64_t)                                                    \
+  DEF64 (uint8_t, uint32_t)                                                    \
+  DEF64 (uint8_t, uint16_t)                                                    \
+  DEF64 (uint8_t, uint8_t)                                                     \
+  DEF64 (uint8_t, int64_t)                                                     \
+  DEF64 (uint8_t, int32_t)                                                     \
+  DEF64 (uint8_t, int16_t)                                                     \
+  DEF64 (uint8_t, int8_t)                                                      \
+  DEF64 (int8_t, uint64_t)                                                     \
+  DEF64 (int8_t, uint32_t)                                                     \
+  DEF64 (int8_t, uint16_t)                                                     \
+  DEF64 (int8_t, uint8_t)                                                      \
+  DEF64 (int8_t, int64_t)                                                      \
+  DEF64 (int8_t, int32_t)                                                      \
+  DEF64 (int8_t, int16_t)                                                      \
+  DEF64 (int8_t, int8_t)                                                       \
+  DEF32 (uint64_t, uint64_t)                                                   \
+  DEF32 (uint64_t, uint32_t)                                                   \
+  DEF32 (uint64_t, uint16_t)                                                   \
+  DEF32 (uint64_t, uint8_t)                                                    \
+  DEF32 (uint64_t, int64_t)                                                    \
+  DEF32 (uint64_t, int32_t)                                                    \
+  DEF32 (uint64_t, int16_t)                                                    \
+  DEF32 (uint64_t, int8_t)                                                     \
+  DEF32 (int64_t, uint64_t)                                                    \
+  DEF32 (int64_t, uint32_t)                                                    \
+  DEF32 (int64_t, uint16_t)                                                    \
+  DEF32 (int64_t, uint8_t)                                                     \
+  DEF32 (int64_t, int64_t)                                                     \
+  DEF32 (int64_t, int32_t)                                                     \
+  DEF32 (int64_t, int16_t)                                                     \
+  DEF32 (int64_t, int8_t)                                                      \
+  DEF32 (uint32_t, uint64_t)                                                   \
+  DEF32 (uint32_t, uint32_t)                                                   \
+  DEF32 (uint32_t, uint16_t)                                                   \
+  DEF32 (uint32_t, uint8_t)                                                    \
+  DEF32 (uint32_t, int64_t)                                                    \
+  DEF32 (uint32_t, int32_t)                                                    \
+  DEF32 (uint32_t, int16_t)                                                    \
+  DEF32 (uint32_t, int8_t)                                                     \
+  DEF32 (int32_t, uint64_t)                                                    \
+  DEF32 (int32_t, uint32_t)                                                    \
+  DEF32 (int32_t, uint16_t)                                                    \
+  DEF32 (int32_t, uint8_t)                                                     \
+  DEF32 (int32_t, int64_t)                                                     \
+  DEF32 (int32_t, int32_t)                                                     \
+  DEF32 (int32_t, int16_t)                                                     \
+  DEF32 (int32_t, int8_t)                                                      \
+  DEF32 (uint16_t, uint64_t)                                                   \
+  DEF32 (uint16_t, uint32_t)                                                   \
+  DEF32 (uint16_t, uint16_t)                                                   \
+  DEF32 (uint16_t, uint8_t)                                                    \
+  DEF32 (uint16_t, int64_t)                                                    \
+  DEF32 (uint16_t, int32_t)                                                    \
+  DEF32 (uint16_t, int16_t)                                                    \
+  DEF32 (uint16_t, int8_t)                                                     \
+  DEF32 (int16_t, uint64_t)                                                    \
+  DEF32 (int16_t, uint32_t)                                                    \
+  DEF32 (int16_t, uint16_t)                                                    \
+  DEF32 (int16_t, uint8_t)                                                     \
+  DEF32 (int16_t, int64_t)                                                     \
+  DEF32 (int16_t, int32_t)                                                     \
+  DEF32 (int16_t, int16_t)                                                     \
+  DEF32 (int16_t, int8_t)                                                      \
+  DEF32 (uint8_t, uint64_t)                                                    \
+  DEF32 (uint8_t, uint32_t)                                                    \
+  DEF32 (uint8_t, uint16_t)                                                    \
+  DEF32 (uint8_t, uint8_t)                                                     \
+  DEF32 (uint8_t, int64_t)                                                     \
+  DEF32 (uint8_t, int32_t)                                                     \
+  DEF32 (uint8_t, int16_t)                                                     \
+  DEF32 (uint8_t, int8_t)                                                      \
+  DEF32 (int8_t, uint64_t)                                                     \
+  DEF32 (int8_t, uint32_t)                                                     \
+  DEF32 (int8_t, uint16_t)                                                     \
+  DEF32 (int8_t, uint8_t)                                                      \
+  DEF32 (int8_t, int64_t)                                                      \
+  DEF32 (int8_t, int32_t)                                                      \
+  DEF32 (int8_t, int16_t)                                                      \
+  DEF32 (int8_t, int8_t)                                                       \
+  DEFCTZ64 (uint64_t, uint64_t)                                                \
+  DEFCTZ64 (uint64_t, uint32_t)                                                \
+  DEFCTZ64 (uint64_t, uint16_t)                                                \
+  DEFCTZ64 (uint64_t, uint8_t)                                                 \
+  DEFCTZ64 (uint64_t, int64_t)                                                 \
+  DEFCTZ64 (uint64_t, int32_t)                                                 \
+  DEFCTZ64 (uint64_t, int16_t)                                                 \
+  DEFCTZ64 (uint64_t, int8_t)                                                  \
+  DEFCTZ64 (int64_t, uint64_t)                                                 \
+  DEFCTZ64 (int64_t, uint32_t)                                                 \
+  DEFCTZ64 (int64_t, uint16_t)                                                 \
+  DEFCTZ64 (int64_t, uint8_t)                                                  \
+  DEFCTZ64 (int64_t, int64_t)                                                  \
+  DEFCTZ64 (int64_t, int32_t)                                                  \
+  DEFCTZ64 (int64_t, int16_t)                                                  \
+  DEFCTZ64 (int64_t, int8_t)                                                   \
+  DEFCTZ64 (uint32_t, uint64_t)                                                \
+  DEFCTZ64 (uint32_t, uint32_t)                                                \
+  DEFCTZ64 (uint32_t, uint16_t)                                                \
+  DEFCTZ64 (uint32_t, uint8_t)                                                 \
+  DEFCTZ64 (uint32_t, int64_t)                                                 \
+  DEFCTZ64 (uint32_t, int32_t)                                                 \
+  DEFCTZ64 (uint32_t, int16_t)                                                 \
+  DEFCTZ64 (uint32_t, int8_t)                                                  \
+  DEFCTZ64 (int32_t, uint64_t)                                                 \
+  DEFCTZ64 (int32_t, uint32_t)                                                 \
+  DEFCTZ64 (int32_t, uint16_t)                                                 \
+  DEFCTZ64 (int32_t, uint8_t)                                                  \
+  DEFCTZ64 (int32_t, int64_t)                                                  \
+  DEFCTZ64 (int32_t, int32_t)                                                  \
+  DEFCTZ64 (int32_t, int16_t)                                                  \
+  DEFCTZ64 (int32_t, int8_t)                                                   \
+  DEFCTZ64 (uint16_t, uint64_t)                                                \
+  DEFCTZ64 (uint16_t, uint32_t)                                                \
+  DEFCTZ64 (uint16_t, uint16_t)                                                \
+  DEFCTZ64 (uint16_t, uint8_t)                                                 \
+  DEFCTZ64 (uint16_t, int64_t)                                                 \
+  DEFCTZ64 (uint16_t, int32_t)                                                 \
+  DEFCTZ64 (uint16_t, int16_t)                                                 \
+  DEFCTZ64 (uint16_t, int8_t)                                                  \
+  DEFCTZ64 (int16_t, uint64_t)                                                 \
+  DEFCTZ64 (int16_t, uint32_t)                                                 \
+  DEFCTZ64 (int16_t, uint16_t)                                                 \
+  DEFCTZ64 (int16_t, uint8_t)                                                  \
+  DEFCTZ64 (int16_t, int64_t)                                                  \
+  DEFCTZ64 (int16_t, int32_t)                                                  \
+  DEFCTZ64 (int16_t, int16_t)                                                  \
+  DEFCTZ64 (int16_t, int8_t)                                                   \
+  DEFCTZ64 (uint8_t, uint64_t)                                                 \
+  DEFCTZ64 (uint8_t, uint32_t)                                                 \
+  DEFCTZ64 (uint8_t, uint16_t)                                                 \
+  DEFCTZ64 (uint8_t, uint8_t)                                                  \
+  DEFCTZ64 (uint8_t, int64_t)                                                  \
+  DEFCTZ64 (uint8_t, int32_t)                                                  \
+  DEFCTZ64 (uint8_t, int16_t)                                                  \
+  DEFCTZ64 (uint8_t, int8_t)                                                   \
+  DEFCTZ64 (int8_t, uint64_t)                                                  \
+  DEFCTZ64 (int8_t, uint32_t)                                                  \
+  DEFCTZ64 (int8_t, uint16_t)                                                  \
+  DEFCTZ64 (int8_t, uint8_t)                                                   \
+  DEFCTZ64 (int8_t, int64_t)                                                   \
+  DEFCTZ64 (int8_t, int32_t)                                                   \
+  DEFCTZ64 (int8_t, int16_t)                                                   \
+  DEFCTZ64 (int8_t, int8_t)                                                    \
+  DEFCTZ32 (uint64_t, uint64_t)                                                \
+  DEFCTZ32 (uint64_t, uint32_t)                                                \
+  DEFCTZ32 (uint64_t, uint16_t)                                                \
+  DEFCTZ32 (uint64_t, uint8_t)                                                 \
+  DEFCTZ32 (uint64_t, int64_t)                                                 \
+  DEFCTZ32 (uint64_t, int32_t)                                                 \
+  DEFCTZ32 (uint64_t, int16_t)                                                 \
+  DEFCTZ32 (uint64_t, int8_t)                                                  \
+  DEFCTZ32 (int64_t, uint64_t)                                                 \
+  DEFCTZ32 (int64_t, uint32_t)                                                 \
+  DEFCTZ32 (int64_t, uint16_t)                                                 \
+  DEFCTZ32 (int64_t, uint8_t)                                                  \
+  DEFCTZ32 (int64_t, int64_t)                                                  \
+  DEFCTZ32 (int64_t, int32_t)                                                  \
+  DEFCTZ32 (int64_t, int16_t)                                                  \
+  DEFCTZ32 (int64_t, int8_t)                                                   \
+  DEFCTZ32 (uint32_t, uint64_t)                                                \
+  DEFCTZ32 (uint32_t, uint32_t)                                                \
+  DEFCTZ32 (uint32_t, uint16_t)                                                \
+  DEFCTZ32 (uint32_t, uint8_t)                                                 \
+  DEFCTZ32 (uint32_t, int64_t)                                                 \
+  DEFCTZ32 (uint32_t, int32_t)                                                 \
+  DEFCTZ32 (uint32_t, int16_t)                                                 \
+  DEFCTZ32 (uint32_t, int8_t)                                                  \
+  DEFCTZ32 (int32_t, uint64_t)                                                 \
+  DEFCTZ32 (int32_t, uint32_t)                                                 \
+  DEFCTZ32 (int32_t, uint16_t)                                                 \
+  DEFCTZ32 (int32_t, uint8_t)                                                  \
+  DEFCTZ32 (int32_t, int64_t)                                                  \
+  DEFCTZ32 (int32_t, int32_t)                                                  \
+  DEFCTZ32 (int32_t, int16_t)                                                  \
+  DEFCTZ32 (int32_t, int8_t)                                                   \
+  DEFCTZ32 (uint16_t, uint64_t)                                                \
+  DEFCTZ32 (uint16_t, uint32_t)                                                \
+  DEFCTZ32 (uint16_t, uint16_t)                                                \
+  DEFCTZ32 (uint16_t, uint8_t)                                                 \
+  DEFCTZ32 (uint16_t, int64_t)                                                 \
+  DEFCTZ32 (uint16_t, int32_t)                                                 \
+  DEFCTZ32 (uint16_t, int16_t)                                                 \
+  DEFCTZ32 (uint16_t, int8_t)                                                  \
+  DEFCTZ32 (int16_t, uint64_t)                                                 \
+  DEFCTZ32 (int16_t, uint32_t)                                                 \
+  DEFCTZ32 (int16_t, uint16_t)                                                 \
+  DEFCTZ32 (int16_t, uint8_t)                                                  \
+  DEFCTZ32 (int16_t, int64_t)                                                  \
+  DEFCTZ32 (int16_t, int32_t)                                                  \
+  DEFCTZ32 (int16_t, int16_t)                                                  \
+  DEFCTZ32 (int16_t, int8_t)                                                   \
+  DEFCTZ32 (uint8_t, uint64_t)                                                 \
+  DEFCTZ32 (uint8_t, uint32_t)                                                 \
+  DEFCTZ32 (uint8_t, uint16_t)                                                 \
+  DEFCTZ32 (uint8_t, uint8_t)                                                  \
+  DEFCTZ32 (uint8_t, int64_t)                                                  \
+  DEFCTZ32 (uint8_t, int32_t)                                                  \
+  DEFCTZ32 (uint8_t, int16_t)                                                  \
+  DEFCTZ32 (uint8_t, int8_t)                                                   \
+  DEFCTZ32 (int8_t, uint64_t)                                                  \
+  DEFCTZ32 (int8_t, uint32_t)                                                  \
+  DEFCTZ32 (int8_t, uint16_t)                                                  \
+  DEFCTZ32 (int8_t, uint8_t)                                                   \
+  DEFCTZ32 (int8_t, int64_t)                                                   \
+  DEFCTZ32 (int8_t, int32_t)                                                   \
+  DEFCTZ32 (int8_t, int16_t)                                                   \
+  DEFCTZ32 (int8_t, int8_t)                                                    \
+  DEFFFS64 (uint64_t, uint64_t)                                                \
+  DEFFFS64 (uint64_t, uint32_t)                                                \
+  DEFFFS64 (uint64_t, uint16_t)                                                \
+  DEFFFS64 (uint64_t, uint8_t)                                                 \
+  DEFFFS64 (uint64_t, int64_t)                                                 \
+  DEFFFS64 (uint64_t, int32_t)                                                 \
+  DEFFFS64 (uint64_t, int16_t)                                                 \
+  DEFFFS64 (uint64_t, int8_t)                                                  \
+  DEFFFS64 (int64_t, uint64_t)                                                 \
+  DEFFFS64 (int64_t, uint32_t)                                                 \
+  DEFFFS64 (int64_t, uint16_t)                                                 \
+  DEFFFS64 (int64_t, uint8_t)                                                  \
+  DEFFFS64 (int64_t, int64_t)                                                  \
+  DEFFFS64 (int64_t, int32_t)                                                  \
+  DEFFFS64 (int64_t, int16_t)                                                  \
+  DEFFFS64 (int64_t, int8_t)                                                   \
+  DEFFFS64 (uint32_t, uint64_t)                                                \
+  DEFFFS64 (uint32_t, uint32_t)                                                \
+  DEFFFS64 (uint32_t, uint16_t)                                                \
+  DEFFFS64 (uint32_t, uint8_t)                                                 \
+  DEFFFS64 (uint32_t, int64_t)                                                 \
+  DEFFFS64 (uint32_t, int32_t)                                                 \
+  DEFFFS64 (uint32_t, int16_t)                                                 \
+  DEFFFS64 (uint32_t, int8_t)                                                  \
+  DEFFFS64 (int32_t, uint64_t)                                                 \
+  DEFFFS64 (int32_t, uint32_t)                                                 \
+  DEFFFS64 (int32_t, uint16_t)                                                 \
+  DEFFFS64 (int32_t, uint8_t)                                                  \
+  DEFFFS64 (int32_t, int64_t)                                                  \
+  DEFFFS64 (int32_t, int32_t)                                                  \
+  DEFFFS64 (int32_t, int16_t)                                                  \
+  DEFFFS64 (int32_t, int8_t)                                                   \
+  DEFFFS64 (uint16_t, uint64_t)                                                \
+  DEFFFS64 (uint16_t, uint32_t)                                                \
+  DEFFFS64 (uint16_t, uint16_t)                                                \
+  DEFFFS64 (uint16_t, uint8_t)                                                 \
+  DEFFFS64 (uint16_t, int64_t)                                                 \
+  DEFFFS64 (uint16_t, int32_t)                                                 \
+  DEFFFS64 (uint16_t, int16_t)                                                 \
+  DEFFFS64 (uint16_t, int8_t)                                                  \
+  DEFFFS64 (int16_t, uint64_t)                                                 \
+  DEFFFS64 (int16_t, uint32_t)                                                 \
+  DEFFFS64 (int16_t, uint16_t)                                                 \
+  DEFFFS64 (int16_t, uint8_t)                                                  \
+  DEFFFS64 (int16_t, int64_t)                                                  \
+  DEFFFS64 (int16_t, int32_t)                                                  \
+  DEFFFS64 (int16_t, int16_t)                                                  \
+  DEFFFS64 (int16_t, int8_t)                                                   \
+  DEFFFS64 (uint8_t, uint64_t)                                                 \
+  DEFFFS64 (uint8_t, uint32_t)                                                 \
+  DEFFFS64 (uint8_t, uint16_t)                                                 \
+  DEFFFS64 (uint8_t, uint8_t)                                                  \
+  DEFFFS64 (uint8_t, int64_t)                                                  \
+  DEFFFS64 (uint8_t, int32_t)                                                  \
+  DEFFFS64 (uint8_t, int16_t)                                                  \
+  DEFFFS64 (uint8_t, int8_t)                                                   \
+  DEFFFS64 (int8_t, uint64_t)                                                  \
+  DEFFFS64 (int8_t, uint32_t)                                                  \
+  DEFFFS64 (int8_t, uint16_t)                                                  \
+  DEFFFS64 (int8_t, uint8_t)                                                   \
+  DEFFFS64 (int8_t, int64_t)                                                   \
+  DEFFFS64 (int8_t, int32_t)                                                   \
+  DEFFFS64 (int8_t, int16_t)                                                   \
+  DEFFFS64 (int8_t, int8_t)                                                    \
+  DEFFFS32 (uint64_t, uint64_t)                                                \
+  DEFFFS32 (uint64_t, uint32_t)                                                \
+  DEFFFS32 (uint64_t, uint16_t)                                                \
+  DEFFFS32 (uint64_t, uint8_t)                                                 \
+  DEFFFS32 (uint64_t, int64_t)                                                 \
+  DEFFFS32 (uint64_t, int32_t)                                                 \
+  DEFFFS32 (uint64_t, int16_t)                                                 \
+  DEFFFS32 (uint64_t, int8_t)                                                  \
+  DEFFFS32 (int64_t, uint64_t)                                                 \
+  DEFFFS32 (int64_t, uint32_t)                                                 \
+  DEFFFS32 (int64_t, uint16_t)                                                 \
+  DEFFFS32 (int64_t, uint8_t)                                                  \
+  DEFFFS32 (int64_t, int64_t)                                                  \
+  DEFFFS32 (int64_t, int32_t)                                                  \
+  DEFFFS32 (int64_t, int16_t)                                                  \
+  DEFFFS32 (int64_t, int8_t)                                                   \
+  DEFFFS32 (uint32_t, uint64_t)                                                \
+  DEFFFS32 (uint32_t, uint32_t)                                                \
+  DEFFFS32 (uint32_t, uint16_t)                                                \
+  DEFFFS32 (uint32_t, uint8_t)                                                 \
+  DEFFFS32 (uint32_t, int64_t)                                                 \
+  DEFFFS32 (uint32_t, int32_t)                                                 \
+  DEFFFS32 (uint32_t, int16_t)                                                 \
+  DEFFFS32 (uint32_t, int8_t)                                                  \
+  DEFFFS32 (int32_t, uint64_t)                                                 \
+  DEFFFS32 (int32_t, uint32_t)                                                 \
+  DEFFFS32 (int32_t, uint16_t)                                                 \
+  DEFFFS32 (int32_t, uint8_t)                                                  \
+  DEFFFS32 (int32_t, int64_t)                                                  \
+  DEFFFS32 (int32_t, int32_t)                                                  \
+  DEFFFS32 (int32_t, int16_t)                                                  \
+  DEFFFS32 (int32_t, int8_t)                                                   \
+  DEFFFS32 (uint16_t, uint64_t)                                                \
+  DEFFFS32 (uint16_t, uint32_t)                                                \
+  DEFFFS32 (uint16_t, uint16_t)                                                \
+  DEFFFS32 (uint16_t, uint8_t)                                                 \
+  DEFFFS32 (uint16_t, int64_t)                                                 \
+  DEFFFS32 (uint16_t, int32_t)                                                 \
+  DEFFFS32 (uint16_t, int16_t)                                                 \
+  DEFFFS32 (uint16_t, int8_t)                                                  \
+  DEFFFS32 (int16_t, uint64_t)                                                 \
+  DEFFFS32 (int16_t, uint32_t)                                                 \
+  DEFFFS32 (int16_t, uint16_t)                                                 \
+  DEFFFS32 (int16_t, uint8_t)                                                  \
+  DEFFFS32 (int16_t, int64_t)                                                  \
+  DEFFFS32 (int16_t, int32_t)                                                  \
+  DEFFFS32 (int16_t, int16_t)                                                  \
+  DEFFFS32 (int16_t, int8_t)                                                   \
+  DEFFFS32 (uint8_t, uint64_t)                                                 \
+  DEFFFS32 (uint8_t, uint32_t)                                                 \
+  DEFFFS32 (uint8_t, uint16_t)                                                 \
+  DEFFFS32 (uint8_t, uint8_t)                                                  \
+  DEFFFS32 (uint8_t, int64_t)                                                  \
+  DEFFFS32 (uint8_t, int32_t)                                                  \
+  DEFFFS32 (uint8_t, int16_t)                                                  \
+  DEFFFS32 (uint8_t, int8_t)                                                   \
+  DEFFFS32 (int8_t, uint64_t)                                                  \
+  DEFFFS32 (int8_t, uint32_t)                                                  \
+  DEFFFS32 (int8_t, uint16_t)                                                  \
+  DEFFFS32 (int8_t, uint8_t)                                                   \
+  DEFFFS32 (int8_t, int64_t)                                                   \
+  DEFFFS32 (int8_t, int32_t)                                                   \
+  DEFFFS32 (int8_t, int16_t)                                                   \
+  DEFFFS32 (int8_t, int8_t)
+
+DEF_ALL ()
+
+#define SZ 512
+
+#define TEST64(TYPEDST, TYPESRC)                                               \
+  void __attribute__ ((optimize ("0"))) test64_##TYPEDST##TYPESRC ()           \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * 1234567890;                                              \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    popcount64_##TYPEDST##TYPESRC (dst, src, SZ);                              \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ assert (dst[i] == __builtin_popcountll (src[i]));                      \
+      }                                                                        \
+  }
+
+#define TEST64N(TYPEDST, TYPESRC)                                              \
+  void __attribute__ ((optimize ("0"))) test64n_##TYPEDST##TYPESRC ()          \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * -1234567890;                                             \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    popcount64_##TYPEDST##TYPESRC (dst, src, SZ);                              \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ assert (dst[i] == __builtin_popcountll (src[i]));                      \
+      }                                                                        \
+  }
+
+#define TEST32(TYPEDST, TYPESRC)                                               \
+  void __attribute__ ((optimize ("0"))) test32_##TYPEDST##TYPESRC ()           \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * 1234567;                                                 \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    popcount32_##TYPEDST##TYPESRC (dst, src, SZ);                              \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ assert (dst[i] == __builtin_popcount (src[i]));                        \
+      }                                                                        \
+  }
+
+#define TEST32N(TYPEDST, TYPESRC)                                              \
+  void __attribute__ ((optimize ("0"))) test32n_##TYPEDST##TYPESRC ()          \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * -1234567;                                                \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    popcount32_##TYPEDST##TYPESRC (dst, src, SZ);                              \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ assert (dst[i] == __builtin_popcount (src[i]));                        \
+      }                                                                        \
+  }
+
+#define TESTCTZ64(TYPEDST, TYPESRC)                                            \
+  void __attribute__ ((optimize ("0"))) testctz64_##TYPEDST##TYPESRC ()        \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * 1234567890;                                              \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    ctz64_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ if (src[i] != 0)                                                       \
+   assert (dst[i] == __builtin_ctzll (src[i]));                         \
+      }                                                                        \
+  }
+
+#define TESTCTZ64N(TYPEDST, TYPESRC)                                           \
+  void __attribute__ ((optimize ("0"))) testctz64n_##TYPEDST##TYPESRC ()       \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * -1234567890;                                             \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    ctz64_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ if (src[i] != 0)                                                       \
+   assert (dst[i] == __builtin_ctzll (src[i]));                         \
+      }                                                                        \
+  }
+
+#define TESTCTZ32(TYPEDST, TYPESRC)                                            \
+  void __attribute__ ((optimize ("0"))) testctz32_##TYPEDST##TYPESRC ()        \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * 1234567;                                                 \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    ctz32_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ if (src[i] != 0)                                                       \
+   assert (dst[i] == __builtin_ctz (src[i]));                           \
+      }                                                                        \
+  }
+
+#define TESTCTZ32N(TYPEDST, TYPESRC)                                           \
+  void __attribute__ ((optimize ("0"))) testctz32n_##TYPEDST##TYPESRC ()       \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * -1234567;                                                \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    ctz32_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ if (src[i] != 0)                                                       \
+   assert (dst[i] == __builtin_ctz (src[i]));                           \
+      }                                                                        \
+  }
+
+#define TESTFFS64(TYPEDST, TYPESRC)                                            \
+  void __attribute__ ((optimize ("0"))) testffs64_##TYPEDST##TYPESRC ()        \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * 1234567890;                                              \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    ffs64_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ assert (dst[i] == __builtin_ffsll (src[i]));                           \
+      }                                                                        \
+  }
+
+#define TESTFFS64N(TYPEDST, TYPESRC)                                           \
+  void __attribute__ ((optimize ("0"))) testffs64n_##TYPEDST##TYPESRC ()       \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * -1234567890;                                             \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    ffs64_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ assert (dst[i] == __builtin_ffsll (src[i]));                           \
+      }                                                                        \
+  }
+
+#define TESTFFS32(TYPEDST, TYPESRC)                                            \
+  void __attribute__ ((optimize ("0"))) testffs32_##TYPEDST##TYPESRC ()        \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * 1234567;                                                 \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    ffs32_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ assert (dst[i] == __builtin_ffs (src[i]));                             \
+      }                                                                        \
+  }
+
+#define TESTFFS32N(TYPEDST, TYPESRC)                                           \
+  void __attribute__ ((optimize ("0"))) testffs32n_##TYPEDST##TYPESRC ()       \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * -1234567;                                                \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    ffs32_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ assert (dst[i] == __builtin_ffs (src[i]));                             \
+      }                                                                        \
+  }
+
+#define TEST_ALL()                                                             \
+  TEST64 (uint64_t, uint64_t)                                                  \
+  TEST64 (uint64_t, uint32_t)                                                  \
+  TEST64 (uint64_t, uint16_t)                                                  \
+  TEST64 (uint64_t, uint8_t)                                                   \
+  TEST64 (uint64_t, int64_t)                                                   \
+  TEST64 (uint64_t, int32_t)                                                   \
+  TEST64 (uint64_t, int16_t)                                                   \
+  TEST64 (uint64_t, int8_t)                                                    \
+  TEST64N (int64_t, uint64_t)                                                  \
+  TEST64N (int64_t, uint32_t)                                                  \
+  TEST64N (int64_t, uint16_t)                                                  \
+  TEST64N (int64_t, uint8_t)                                                   \
+  TEST64N (int64_t, int64_t)                                                   \
+  TEST64N (int64_t, int32_t)                                                   \
+  TEST64N (int64_t, int16_t)                                                   \
+  TEST64N (int64_t, int8_t)                                                    \
+  TEST64 (uint32_t, uint64_t)                                                  \
+  TEST64 (uint32_t, uint32_t)                                                  \
+  TEST64 (uint32_t, uint16_t)                                                  \
+  TEST64 (uint32_t, uint8_t)                                                   \
+  TEST64 (uint32_t, int64_t)                                                   \
+  TEST64 (uint32_t, int32_t)                                                   \
+  TEST64 (uint32_t, int16_t)                                                   \
+  TEST64 (uint32_t, int8_t)                                                    \
+  TEST64N (int32_t, uint64_t)                                                  \
+  TEST64N (int32_t, uint32_t)                                                  \
+  TEST64N (int32_t, uint16_t)                                                  \
+  TEST64N (int32_t, uint8_t)                                                   \
+  TEST64N (int32_t, int64_t)                                                   \
+  TEST64N (int32_t, int32_t)                                                   \
+  TEST64N (int32_t, int16_t)                                                   \
+  TEST64N (int32_t, int8_t)                                                    \
+  TEST64 (uint16_t, uint64_t)                                                  \
+  TEST64 (uint16_t, uint32_t)                                                  \
+  TEST64 (uint16_t, uint16_t)                                                  \
+  TEST64 (uint16_t, uint8_t)                                                   \
+  TEST64 (uint16_t, int64_t)                                                   \
+  TEST64 (uint16_t, int32_t)                                                   \
+  TEST64 (uint16_t, int16_t)                                                   \
+  TEST64 (uint16_t, int8_t)                                                    \
+  TEST64N (int16_t, uint64_t)                                                   \
+  TEST64N (int16_t, uint32_t)                                                   \
+  TEST64N (int16_t, uint16_t)                                                   \
+  TEST64N (int16_t, uint8_t)                                                    \
+  TEST64N (int16_t, int64_t)                                                    \
+  TEST64N (int16_t, int32_t)                                                    \
+  TEST64N (int16_t, int16_t)                                                    \
+  TEST64N (int16_t, int8_t)                                                     \
+  TEST64 (uint8_t, uint64_t)                                                   \
+  TEST64 (uint8_t, uint32_t)                                                   \
+  TEST64 (uint8_t, uint16_t)                                                   \
+  TEST64 (uint8_t, uint8_t)                                                    \
+  TEST64 (uint8_t, int64_t)                                                    \
+  TEST64 (uint8_t, int32_t)                                                    \
+  TEST64 (uint8_t, int16_t)                                                    \
+  TEST64 (uint8_t, int8_t)                                                     \
+  TEST64N (int8_t, uint64_t)                                                    \
+  TEST64N (int8_t, uint32_t)                                                    \
+  TEST64N (int8_t, uint16_t)                                                    \
+  TEST64N (int8_t, uint8_t)                                                     \
+  TEST64N (int8_t, int64_t)                                                     \
+  TEST64N (int8_t, int32_t)                                                     \
+  TEST64N (int8_t, int16_t)                                                     \
+  TEST64N (int8_t, int8_t)                                                      \
+  TEST32 (uint64_t, uint64_t)                                                  \
+  TEST32 (uint64_t, uint32_t)                                                  \
+  TEST32 (uint64_t, uint16_t)                                                  \
+  TEST32 (uint64_t, uint8_t)                                                   \
+  TEST32 (uint64_t, int64_t)                                                   \
+  TEST32 (uint64_t, int32_t)                                                   \
+  TEST32 (uint64_t, int16_t)                                                   \
+  TEST32 (uint64_t, int8_t)                                                    \
+  TEST32N (int64_t, uint64_t)                                                  \
+  TEST32N (int64_t, uint32_t)                                                  \
+  TEST32N (int64_t, uint16_t)                                                  \
+  TEST32N (int64_t, uint8_t)                                                   \
+  TEST32N (int64_t, int64_t)                                                   \
+  TEST32N (int64_t, int32_t)                                                   \
+  TEST32N (int64_t, int16_t)                                                   \
+  TEST32N (int64_t, int8_t)                                                    \
+  TEST32 (uint32_t, uint64_t)                                                  \
+  TEST32 (uint32_t, uint32_t)                                                  \
+  TEST32 (uint32_t, uint16_t)                                                  \
+  TEST32 (uint32_t, uint8_t)                                                   \
+  TEST32 (uint32_t, int64_t)                                                   \
+  TEST32 (uint32_t, int32_t)                                                   \
+  TEST32 (uint32_t, int16_t)                                                   \
+  TEST32 (uint32_t, int8_t)                                                    \
+  TEST32N (int32_t, uint64_t)                                                  \
+  TEST32N (int32_t, uint32_t)                                                  \
+  TEST32N (int32_t, uint16_t)                                                  \
+  TEST32N (int32_t, uint8_t)                                                   \
+  TEST32N (int32_t, int64_t)                                                   \
+  TEST32N (int32_t, int32_t)                                                   \
+  TEST32N (int32_t, int16_t)                                                   \
+  TEST32N (int32_t, int8_t)                                                    \
+  TEST32 (uint16_t, uint64_t)                                                  \
+  TEST32 (uint16_t, uint32_t)                                                  \
+  TEST32 (uint16_t, uint16_t)                                                  \
+  TEST32 (uint16_t, uint8_t)                                                   \
+  TEST32 (uint16_t, int64_t)                                                   \
+  TEST32 (uint16_t, int32_t)                                                   \
+  TEST32 (uint16_t, int16_t)                                                   \
+  TEST32 (uint16_t, int8_t)                                                    \
+  TEST32N (int16_t, uint64_t)                                                  \
+  TEST32N (int16_t, uint32_t)                                                  \
+  TEST32N (int16_t, uint16_t)                                                  \
+  TEST32N (int16_t, uint8_t)                                                   \
+  TEST32N (int16_t, int64_t)                                                   \
+  TEST32N (int16_t, int32_t)                                                   \
+  TEST32N (int16_t, int16_t)                                                   \
+  TEST32N (int16_t, int8_t)                                                    \
+  TEST32 (uint8_t, uint64_t)                                                   \
+  TEST32 (uint8_t, uint32_t)                                                   \
+  TEST32 (uint8_t, uint16_t)                                                   \
+  TEST32 (uint8_t, uint8_t)                                                    \
+  TEST32 (uint8_t, int64_t)                                                    \
+  TEST32 (uint8_t, int32_t)                                                    \
+  TEST32 (uint8_t, int16_t)                                                    \
+  TEST32 (uint8_t, int8_t)                                                     \
+  TEST32N (int8_t, uint64_t)                                                   \
+  TEST32N (int8_t, uint32_t)                                                   \
+  TEST32N (int8_t, uint16_t)                                                   \
+  TEST32N (int8_t, uint8_t)                                                    \
+  TEST32N (int8_t, int64_t)                                                    \
+  TEST32N (int8_t, int32_t)                                                    \
+  TEST32N (int8_t, int16_t)                                                    \
+  TEST32N (int8_t, int8_t)                                                     \
+  TESTCTZ64 (uint64_t, uint64_t)                                               \
+  TESTCTZ64 (uint64_t, uint32_t)                                               \
+  TESTCTZ64 (uint64_t, uint16_t)                                               \
+  TESTCTZ64 (uint64_t, uint8_t)                                                \
+  TESTCTZ64 (uint64_t, int64_t)                                                \
+  TESTCTZ64 (uint64_t, int32_t)                                                \
+  TESTCTZ64 (uint64_t, int16_t)                                                \
+  TESTCTZ64 (uint64_t, int8_t)                                                 \
+  TESTCTZ64N (int64_t, uint64_t)                                               \
+  TESTCTZ64N (int64_t, uint32_t)                                               \
+  TESTCTZ64N (int64_t, uint16_t)                                               \
+  TESTCTZ64N (int64_t, uint8_t)                                                \
+  TESTCTZ64N (int64_t, int64_t)                                                \
+  TESTCTZ64N (int64_t, int32_t)                                                \
+  TESTCTZ64N (int64_t, int16_t)                                                \
+  TESTCTZ64N (int64_t, int8_t)                                                 \
+  TESTCTZ64 (uint32_t, uint64_t)                                               \
+  TESTCTZ64 (uint32_t, uint32_t)                                               \
+  TESTCTZ64 (uint32_t, uint16_t)                                               \
+  TESTCTZ64 (uint32_t, uint8_t)                                                \
+  TESTCTZ64 (uint32_t, int64_t)                                                \
+  TESTCTZ64 (uint32_t, int32_t)                                                \
+  TESTCTZ64 (uint32_t, int16_t)                                                \
+  TESTCTZ64 (uint32_t, int8_t)                                                 \
+  TESTCTZ64N (int32_t, uint64_t)                                               \
+  TESTCTZ64N (int32_t, uint32_t)                                               \
+  TESTCTZ64N (int32_t, uint16_t)                                               \
+  TESTCTZ64N (int32_t, uint8_t)                                                \
+  TESTCTZ64N (int32_t, int64_t)                                                \
+  TESTCTZ64N (int32_t, int32_t)                                                \
+  TESTCTZ64N (int32_t, int16_t)                                                \
+  TESTCTZ64N (int32_t, int8_t)                                                 \
+  TESTCTZ64 (uint16_t, uint64_t)                                               \
+  TESTCTZ64 (uint16_t, uint32_t)                                               \
+  TESTCTZ64 (uint16_t, uint16_t)                                               \
+  TESTCTZ64 (uint16_t, uint8_t)                                                \
+  TESTCTZ64 (uint16_t, int64_t)                                                \
+  TESTCTZ64 (uint16_t, int32_t)                                                \
+  TESTCTZ64 (uint16_t, int16_t)                                                \
+  TESTCTZ64 (uint16_t, int8_t)                                                 \
+  TESTCTZ64N (int16_t, uint64_t)                                               \
+  TESTCTZ64N (int16_t, uint32_t)                                               \
+  TESTCTZ64N (int16_t, uint16_t)                                               \
+  TESTCTZ64N (int16_t, uint8_t)                                                \
+  TESTCTZ64N (int16_t, int64_t)                                                \
+  TESTCTZ64N (int16_t, int32_t)                                                \
+  TESTCTZ64N (int16_t, int16_t)                                                \
+  TESTCTZ64N (int16_t, int8_t)                                                 \
+  TESTCTZ64 (uint8_t, uint64_t)                                                \
+  TESTCTZ64 (uint8_t, uint32_t)                                                \
+  TESTCTZ64 (uint8_t, uint16_t)                                                \
+  TESTCTZ64 (uint8_t, uint8_t)                                                 \
+  TESTCTZ64 (uint8_t, int64_t)                                                 \
+  TESTCTZ64 (uint8_t, int32_t)                                                 \
+  TESTCTZ64 (uint8_t, int16_t)                                                 \
+  TESTCTZ64 (uint8_t, int8_t)                                                  \
+  TESTCTZ64N (int8_t, uint64_t)                                                \
+  TESTCTZ64N (int8_t, uint32_t)                                                \
+  TESTCTZ64N (int8_t, uint16_t)                                                \
+  TESTCTZ64N (int8_t, uint8_t)                                                 \
+  TESTCTZ64N (int8_t, int64_t)                                                 \
+  TESTCTZ64N (int8_t, int32_t)                                                 \
+  TESTCTZ64N (int8_t, int16_t)                                                 \
+  TESTCTZ64N (int8_t, int8_t)                                                  \
+  TESTCTZ32 (uint64_t, uint64_t)                                               \
+  TESTCTZ32 (uint64_t, uint32_t)                                               \
+  TESTCTZ32 (uint64_t, uint16_t)                                               \
+  TESTCTZ32 (uint64_t, uint8_t)                                                \
+  TESTCTZ32 (uint64_t, int64_t)                                                \
+  TESTCTZ32 (uint64_t, int32_t)                                                \
+  TESTCTZ32 (uint64_t, int16_t)                                                \
+  TESTCTZ32 (uint64_t, int8_t)                                                 \
+  TESTCTZ32N (int64_t, uint64_t)                                               \
+  TESTCTZ32N (int64_t, uint32_t)                                               \
+  TESTCTZ32N (int64_t, uint16_t)                                               \
+  TESTCTZ32N (int64_t, uint8_t)                                                \
+  TESTCTZ32N (int64_t, int64_t)                                                \
+  TESTCTZ32N (int64_t, int32_t)                                                \
+  TESTCTZ32N (int64_t, int16_t)                                                \
+  TESTCTZ32N (int64_t, int8_t)                                                 \
+  TESTCTZ32 (uint32_t, uint64_t)                                               \
+  TESTCTZ32 (uint32_t, uint32_t)                                               \
+  TESTCTZ32 (uint32_t, uint16_t)                                               \
+  TESTCTZ32 (uint32_t, uint8_t)                                                \
+  TESTCTZ32 (uint32_t, int64_t)                                                \
+  TESTCTZ32 (uint32_t, int32_t)                                                \
+  TESTCTZ32 (uint32_t, int16_t)                                                \
+  TESTCTZ32 (uint32_t, int8_t)                                                 \
+  TESTCTZ32N (int32_t, uint64_t)                                               \
+  TESTCTZ32N (int32_t, uint32_t)                                               \
+  TESTCTZ32N (int32_t, uint16_t)                                               \
+  TESTCTZ32N (int32_t, uint8_t)                                                \
+  TESTCTZ32N (int32_t, int64_t)                                                \
+  TESTCTZ32N (int32_t, int32_t)                                                \
+  TESTCTZ32N (int32_t, int16_t)                                                \
+  TESTCTZ32N (int32_t, int8_t)                                                 \
+  TESTCTZ32 (uint16_t, uint64_t)                                               \
+  TESTCTZ32 (uint16_t, uint32_t)                                               \
+  TESTCTZ32 (uint16_t, uint16_t)                                               \
+  TESTCTZ32 (uint16_t, uint8_t)                                                \
+  TESTCTZ32 (uint16_t, int64_t)                                                \
+  TESTCTZ32 (uint16_t, int32_t)                                                \
+  TESTCTZ32 (uint16_t, int16_t)                                                \
+  TESTCTZ32 (uint16_t, int8_t)                                                 \
+  TESTCTZ32N (int16_t, uint64_t)                                               \
+  TESTCTZ32N (int16_t, uint32_t)                                               \
+  TESTCTZ32N (int16_t, uint16_t)                                               \
+  TESTCTZ32N (int16_t, uint8_t)                                                \
+  TESTCTZ32N (int16_t, int64_t)                                                \
+  TESTCTZ32N (int16_t, int32_t)                                                \
+  TESTCTZ32N (int16_t, int16_t)                                                \
+  TESTCTZ32N (int16_t, int8_t)                                                 \
+  TESTCTZ32 (uint8_t, uint64_t)                                                \
+  TESTCTZ32 (uint8_t, uint32_t)                                                \
+  TESTCTZ32 (uint8_t, uint16_t)                                                \
+  TESTCTZ32 (uint8_t, uint8_t)                                                 \
+  TESTCTZ32 (uint8_t, int64_t)                                                 \
+  TESTCTZ32 (uint8_t, int32_t)                                                 \
+  TESTCTZ32 (uint8_t, int16_t)                                                 \
+  TESTCTZ32 (uint8_t, int8_t)                                                  \
+  TESTCTZ32N (int8_t, uint64_t)                                                \
+  TESTCTZ32N (int8_t, uint32_t)                                                \
+  TESTCTZ32N (int8_t, uint16_t)                                                \
+  TESTCTZ32N (int8_t, uint8_t)                                                 \
+  TESTCTZ32N (int8_t, int64_t)                                                 \
+  TESTCTZ32N (int8_t, int32_t)                                                 \
+  TESTCTZ32N (int8_t, int16_t)                                                 \
+  TESTCTZ32N (int8_t, int8_t)                                                  \
+  TESTFFS64 (uint64_t, uint64_t)                                               \
+  TESTFFS64 (uint64_t, uint32_t)                                               \
+  TESTFFS64 (uint64_t, uint16_t)                                               \
+  TESTFFS64 (uint64_t, uint8_t)                                                \
+  TESTFFS64 (uint64_t, int64_t)                                                \
+  TESTFFS64 (uint64_t, int32_t)                                                \
+  TESTFFS64 (uint64_t, int16_t)                                                \
+  TESTFFS64 (uint64_t, int8_t)                                                 \
+  TESTFFS64N (int64_t, uint64_t)                                               \
+  TESTFFS64N (int64_t, uint32_t)                                               \
+  TESTFFS64N (int64_t, uint16_t)                                               \
+  TESTFFS64N (int64_t, uint8_t)                                                \
+  TESTFFS64N (int64_t, int64_t)                                                \
+  TESTFFS64N (int64_t, int32_t)                                                \
+  TESTFFS64N (int64_t, int16_t)                                                \
+  TESTFFS64N (int64_t, int8_t)                                                 \
+  TESTFFS64 (uint32_t, uint64_t)                                               \
+  TESTFFS64 (uint32_t, uint32_t)                                               \
+  TESTFFS64 (uint32_t, uint16_t)                                               \
+  TESTFFS64 (uint32_t, uint8_t)                                                \
+  TESTFFS64 (uint32_t, int64_t)                                                \
+  TESTFFS64 (uint32_t, int32_t)                                                \
+  TESTFFS64 (uint32_t, int16_t)                                                \
+  TESTFFS64 (uint32_t, int8_t)                                                 \
+  TESTFFS64N (int32_t, uint64_t)                                               \
+  TESTFFS64N (int32_t, uint32_t)                                               \
+  TESTFFS64N (int32_t, uint16_t)                                               \
+  TESTFFS64N (int32_t, uint8_t)                                                \
+  TESTFFS64N (int32_t, int64_t)                                                \
+  TESTFFS64N (int32_t, int32_t)                                                \
+  TESTFFS64N (int32_t, int16_t)                                                \
+  TESTFFS64N (int32_t, int8_t)                                                 \
+  TESTFFS64 (uint16_t, uint64_t)                                               \
+  TESTFFS64 (uint16_t, uint32_t)                                               \
+  TESTFFS64 (uint16_t, uint16_t)                                               \
+  TESTFFS64 (uint16_t, uint8_t)                                                \
+  TESTFFS64 (uint16_t, int64_t)                                                \
+  TESTFFS64 (uint16_t, int32_t)                                                \
+  TESTFFS64 (uint16_t, int16_t)                                                \
+  TESTFFS64 (uint16_t, int8_t)                                                 \
+  TESTFFS64N (int16_t, uint64_t)                                               \
+  TESTFFS64N (int16_t, uint32_t)                                               \
+  TESTFFS64N (int16_t, uint16_t)                                               \
+  TESTFFS64N (int16_t, uint8_t)                                                \
+  TESTFFS64N (int16_t, int64_t)                                                \
+  TESTFFS64N (int16_t, int32_t)                                                \
+  TESTFFS64N (int16_t, int16_t)                                                \
+  TESTFFS64N (int16_t, int8_t)                                                 \
+  TESTFFS64 (uint8_t, uint64_t)                                                \
+  TESTFFS64 (uint8_t, uint32_t)                                                \
+  TESTFFS64 (uint8_t, uint16_t)                                                \
+  TESTFFS64 (uint8_t, uint8_t)                                                 \
+  TESTFFS64 (uint8_t, int64_t)                                                 \
+  TESTFFS64 (uint8_t, int32_t)                                                 \
+  TESTFFS64 (uint8_t, int16_t)                                                 \
+  TESTFFS64 (uint8_t, int8_t)                                                  \
+  TESTFFS64N (int8_t, uint64_t)                                                \
+  TESTFFS64N (int8_t, uint32_t)                                                \
+  TESTFFS64N (int8_t, uint16_t)                                                \
+  TESTFFS64N (int8_t, uint8_t)                                                 \
+  TESTFFS64N (int8_t, int64_t)                                                 \
+  TESTFFS64N (int8_t, int32_t)                                                 \
+  TESTFFS64N (int8_t, int16_t)                                                 \
+  TESTFFS64N (int8_t, int8_t)                                                  \
+  TESTFFS32 (uint64_t, uint64_t)                                               \
+  TESTFFS32 (uint64_t, uint32_t)                                               \
+  TESTFFS32 (uint64_t, uint16_t)                                               \
+  TESTFFS32 (uint64_t, uint8_t)                                                \
+  TESTFFS32 (uint64_t, int64_t)                                                \
+  TESTFFS32 (uint64_t, int32_t)                                                \
+  TESTFFS32 (uint64_t, int16_t)                                                \
+  TESTFFS32 (uint64_t, int8_t)                                                 \
+  TESTFFS32N (int64_t, uint64_t)                                               \
+  TESTFFS32N (int64_t, uint32_t)                                               \
+  TESTFFS32N (int64_t, uint16_t)                                               \
+  TESTFFS32N (int64_t, uint8_t)                                                \
+  TESTFFS32N (int64_t, int64_t)                                                \
+  TESTFFS32N (int64_t, int32_t)                                                \
+  TESTFFS32N (int64_t, int16_t)                                                \
+  TESTFFS32N (int64_t, int8_t)                                                 \
+  TESTFFS32 (uint32_t, uint64_t)                                               \
+  TESTFFS32 (uint32_t, uint32_t)                                               \
+  TESTFFS32 (uint32_t, uint16_t)                                               \
+  TESTFFS32 (uint32_t, uint8_t)                                                \
+  TESTFFS32 (uint32_t, int64_t)                                                \
+  TESTFFS32 (uint32_t, int32_t)                                                \
+  TESTFFS32 (uint32_t, int16_t)                                                \
+  TESTFFS32 (uint32_t, int8_t)                                                 \
+  TESTFFS32N (int32_t, uint64_t)                                               \
+  TESTFFS32N (int32_t, uint32_t)                                               \
+  TESTFFS32N (int32_t, uint16_t)                                               \
+  TESTFFS32N (int32_t, uint8_t)                                                \
+  TESTFFS32N (int32_t, int64_t)                                                \
+  TESTFFS32N (int32_t, int32_t)                                                \
+  TESTFFS32N (int32_t, int16_t)                                                \
+  TESTFFS32N (int32_t, int8_t)                                                 \
+  TESTFFS32 (uint16_t, uint64_t)                                               \
+  TESTFFS32 (uint16_t, uint32_t)                                               \
+  TESTFFS32 (uint16_t, uint16_t)                                               \
+  TESTFFS32 (uint16_t, uint8_t)                                                \
+  TESTFFS32 (uint16_t, int64_t)                                                \
+  TESTFFS32 (uint16_t, int32_t)                                                \
+  TESTFFS32 (uint16_t, int16_t)                                                \
+  TESTFFS32 (uint16_t, int8_t)                                                 \
+  TESTFFS32N (int16_t, uint64_t)                                               \
+  TESTFFS32N (int16_t, uint32_t)                                               \
+  TESTFFS32N (int16_t, uint16_t)                                               \
+  TESTFFS32N (int16_t, uint8_t)                                                \
+  TESTFFS32N (int16_t, int64_t)                                                \
+  TESTFFS32N (int16_t, int32_t)                                                \
+  TESTFFS32N (int16_t, int16_t)                                                \
+  TESTFFS32N (int16_t, int8_t)                                                 \
+  TESTFFS32 (uint8_t, uint64_t)                                                \
+  TESTFFS32 (uint8_t, uint32_t)                                                \
+  TESTFFS32 (uint8_t, uint16_t)                                                \
+  TESTFFS32 (uint8_t, uint8_t)                                                 \
+  TESTFFS32 (uint8_t, int64_t)                                                 \
+  TESTFFS32 (uint8_t, int32_t)                                                 \
+  TESTFFS32 (uint8_t, int16_t)                                                 \
+  TESTFFS32 (uint8_t, int8_t)                                                  \
+  TESTFFS32N (int8_t, uint64_t)                                                \
+  TESTFFS32N (int8_t, uint32_t)                                                \
+  TESTFFS32N (int8_t, uint16_t)                                                \
+  TESTFFS32N (int8_t, uint8_t)                                                 \
+  TESTFFS32N (int8_t, int64_t)                                                 \
+  TESTFFS32N (int8_t, int32_t)                                                 \
+  TESTFFS32N (int8_t, int16_t)                                                 \
+  TESTFFS32N (int8_t, int8_t)
+
+TEST_ALL ()
+
+#define RUN64(TYPEDST, TYPESRC) test64_##TYPEDST##TYPESRC ();
+#define RUN64N(TYPEDST, TYPESRC) test64n_##TYPEDST##TYPESRC ();
+#define RUN32(TYPEDST, TYPESRC) test32_##TYPEDST##TYPESRC ();
+#define RUN32N(TYPEDST, TYPESRC) test32n_##TYPEDST##TYPESRC ();
+#define RUNCTZ64(TYPEDST, TYPESRC) testctz64_##TYPEDST##TYPESRC ();
+#define RUNCTZ64N(TYPEDST, TYPESRC) testctz64n_##TYPEDST##TYPESRC ();
+#define RUNCTZ32(TYPEDST, TYPESRC) testctz32_##TYPEDST##TYPESRC ();
+#define RUNCTZ32N(TYPEDST, TYPESRC) testctz32n_##TYPEDST##TYPESRC ();
+#define RUNFFS64(TYPEDST, TYPESRC) testffs64_##TYPEDST##TYPESRC ();
+#define RUNFFS64N(TYPEDST, TYPESRC) testffs64n_##TYPEDST##TYPESRC ();
+#define RUNFFS32(TYPEDST, TYPESRC) testffs32_##TYPEDST##TYPESRC ();
+#define RUNFFS32N(TYPEDST, TYPESRC) testffs32n_##TYPEDST##TYPESRC ();
+
+#define RUN_ALL()                                                              \
+  RUN64 (uint64_t, uint64_t)                                                   \
+  RUN64 (uint64_t, uint32_t)                                                   \
+  RUN64 (uint64_t, uint16_t)                                                   \
+  RUN64 (uint64_t, uint8_t)                                                    \
+  RUN64 (uint64_t, int64_t)                                                    \
+  RUN64 (uint64_t, int32_t)                                                    \
+  RUN64 (uint64_t, int16_t)                                                    \
+  RUN64 (uint64_t, int8_t)                                                     \
+  RUN64N (int64_t, uint64_t)                                                    \
+  RUN64N (int64_t, uint32_t)                                                    \
+  RUN64N (int64_t, uint16_t)                                                    \
+  RUN64N (int64_t, uint8_t)                                                     \
+  RUN64N (int64_t, int64_t)                                                     \
+  RUN64N (int64_t, int32_t)                                                     \
+  RUN64N (int64_t, int16_t)                                                     \
+  RUN64N (int64_t, int8_t)                                                      \
+  RUN64 (uint32_t, uint64_t)                                                   \
+  RUN64 (uint32_t, uint32_t)                                                   \
+  RUN64 (uint32_t, uint16_t)                                                   \
+  RUN64 (uint32_t, uint8_t)                                                    \
+  RUN64 (uint32_t, int64_t)                                                    \
+  RUN64 (uint32_t, int32_t)                                                    \
+  RUN64 (uint32_t, int16_t)                                                    \
+  RUN64 (uint32_t, int8_t)                                                     \
+  RUN64N (int32_t, uint64_t)                                                    \
+  RUN64N (int32_t, uint32_t)                                                    \
+  RUN64N (int32_t, uint16_t)                                                    \
+  RUN64N (int32_t, uint8_t)                                                     \
+  RUN64N (int32_t, int64_t)                                                     \
+  RUN64N (int32_t, int32_t)                                                     \
+  RUN64N (int32_t, int16_t)                                                     \
+  RUN64N (int32_t, int8_t)                                                      \
+  RUN64 (uint16_t, uint64_t)                                                   \
+  RUN64 (uint16_t, uint32_t)                                                   \
+  RUN64 (uint16_t, uint16_t)                                                   \
+  RUN64 (uint16_t, uint8_t)                                                    \
+  RUN64 (uint16_t, int64_t)                                                    \
+  RUN64 (uint16_t, int32_t)                                                    \
+  RUN64 (uint16_t, int16_t)                                                    \
+  RUN64 (uint16_t, int8_t)                                                     \
+  RUN64N (int16_t, uint64_t)                                                    \
+  RUN64N (int16_t, uint32_t)                                                    \
+  RUN64N (int16_t, uint16_t)                                                    \
+  RUN64N (int16_t, uint8_t)                                                     \
+  RUN64N (int16_t, int64_t)                                                     \
+  RUN64N (int16_t, int32_t)                                                     \
+  RUN64N (int16_t, int16_t)                                                     \
+  RUN64N (int16_t, int8_t)                                                      \
+  RUN64 (uint8_t, uint64_t)                                                    \
+  RUN64 (uint8_t, uint32_t)                                                    \
+  RUN64 (uint8_t, uint16_t)                                                    \
+  RUN64 (uint8_t, uint8_t)                                                     \
+  RUN64 (uint8_t, int64_t)                                                     \
+  RUN64 (uint8_t, int32_t)                                                     \
+  RUN64 (uint8_t, int16_t)                                                     \
+  RUN64 (uint8_t, int8_t)                                                      \
+  RUN64N (int8_t, uint64_t)                                                     \
+  RUN64N (int8_t, uint32_t)                                                     \
+  RUN64N (int8_t, uint16_t)                                                     \
+  RUN64N (int8_t, uint8_t)                                                      \
+  RUN64N (int8_t, int64_t)                                                      \
+  RUN64N (int8_t, int32_t)                                                      \
+  RUN64N (int8_t, int16_t)                                                      \
+  RUN64N (int8_t, int8_t)                                                       \
+  RUN32 (uint64_t, uint64_t)                                                   \
+  RUN32 (uint64_t, uint32_t)                                                   \
+  RUN32 (uint64_t, uint16_t)                                                   \
+  RUN32 (uint64_t, uint8_t)                                                    \
+  RUN32 (uint64_t, int64_t)                                                    \
+  RUN32 (uint64_t, int32_t)                                                    \
+  RUN32 (uint64_t, int16_t)                                                    \
+  RUN32 (uint64_t, int8_t)                                                     \
+  RUN32N (int64_t, uint64_t)                                                    \
+  RUN32N (int64_t, uint32_t)                                                    \
+  RUN32N (int64_t, uint16_t)                                                    \
+  RUN32N (int64_t, uint8_t)                                                     \
+  RUN32N (int64_t, int64_t)                                                     \
+  RUN32N (int64_t, int32_t)                                                     \
+  RUN32N (int64_t, int16_t)                                                     \
+  RUN32N (int64_t, int8_t)                                                      \
+  RUN32 (uint32_t, uint64_t)                                                   \
+  RUN32 (uint32_t, uint32_t)                                                   \
+  RUN32 (uint32_t, uint16_t)                                                   \
+  RUN32 (uint32_t, uint8_t)                                                    \
+  RUN32 (uint32_t, int64_t)                                                    \
+  RUN32 (uint32_t, int32_t)                                                    \
+  RUN32 (uint32_t, int16_t)                                                    \
+  RUN32 (uint32_t, int8_t)                                                     \
+  RUN32N (int32_t, uint64_t)                                                    \
+  RUN32N (int32_t, uint32_t)                                                    \
+  RUN32N (int32_t, uint16_t)                                                    \
+  RUN32N (int32_t, uint8_t)                                                     \
+  RUN32N (int32_t, int64_t)                                                     \
+  RUN32N (int32_t, int32_t)                                                     \
+  RUN32N (int32_t, int16_t)                                                     \
+  RUN32N (int32_t, int8_t)                                                      \
+  RUN32 (uint16_t, uint64_t)                                                   \
+  RUN32 (uint16_t, uint32_t)                                                   \
+  RUN32 (uint16_t, uint16_t)                                                   \
+  RUN32 (uint16_t, uint8_t)                                                    \
+  RUN32 (uint16_t, int64_t)                                                    \
+  RUN32 (uint16_t, int32_t)                                                    \
+  RUN32 (uint16_t, int16_t)                                                    \
+  RUN32 (uint16_t, int8_t)                                                     \
+  RUN32N (int16_t, uint64_t)                                                    \
+  RUN32N (int16_t, uint32_t)                                                    \
+  RUN32N (int16_t, uint16_t)                                                    \
+  RUN32N (int16_t, uint8_t)                                                     \
+  RUN32N (int16_t, int64_t)                                                     \
+  RUN32N (int16_t, int32_t)                                                     \
+  RUN32N (int16_t, int16_t)                                                     \
+  RUN32N (int16_t, int8_t)                                                      \
+  RUN32 (uint8_t, uint64_t)                                                    \
+  RUN32 (uint8_t, uint32_t)                                                    \
+  RUN32 (uint8_t, uint16_t)                                                    \
+  RUN32 (uint8_t, uint8_t)                                                     \
+  RUN32 (uint8_t, int64_t)                                                     \
+  RUN32 (uint8_t, int32_t)                                                     \
+  RUN32 (uint8_t, int16_t)                                                     \
+  RUN32 (uint8_t, int8_t)                                                      \
+  RUN32N (int8_t, uint64_t)                                                     \
+  RUN32N (int8_t, uint32_t)                                                     \
+  RUN32N (int8_t, uint16_t)                                                     \
+  RUN32N (int8_t, uint8_t)                                                      \
+  RUN32N (int8_t, int64_t)                                                      \
+  RUN32N (int8_t, int32_t)                                                      \
+  RUN32N (int8_t, int16_t)                                                      \
+  RUN32N (int8_t, int8_t)                                                       \
+  RUNCTZ64 (uint64_t, uint64_t)                                                \
+  RUNCTZ64 (uint64_t, uint32_t)                                                \
+  RUNCTZ64 (uint64_t, uint16_t)                                                \
+  RUNCTZ64 (uint64_t, uint8_t)                                                 \
+  RUNCTZ64 (uint64_t, int64_t)                                                 \
+  RUNCTZ64 (uint64_t, int32_t)                                                 \
+  RUNCTZ64 (uint64_t, int16_t)                                                 \
+  RUNCTZ64 (uint64_t, int8_t)                                                  \
+  RUNCTZ64N (int64_t, uint64_t)                                                 \
+  RUNCTZ64N (int64_t, uint32_t)                                                 \
+  RUNCTZ64N (int64_t, uint16_t)                                                 \
+  RUNCTZ64N (int64_t, uint8_t)                                                  \
+  RUNCTZ64N (int64_t, int64_t)                                                  \
+  RUNCTZ64N (int64_t, int32_t)                                                  \
+  RUNCTZ64N (int64_t, int16_t)                                                  \
+  RUNCTZ64N (int64_t, int8_t)                                                   \
+  RUNCTZ64 (uint32_t, uint64_t)                                                \
+  RUNCTZ64 (uint32_t, uint32_t)                                                \
+  RUNCTZ64 (uint32_t, uint16_t)                                                \
+  RUNCTZ64 (uint32_t, uint8_t)                                                 \
+  RUNCTZ64 (uint32_t, int64_t)                                                 \
+  RUNCTZ64 (uint32_t, int32_t)                                                 \
+  RUNCTZ64 (uint32_t, int16_t)                                                 \
+  RUNCTZ64 (uint32_t, int8_t)                                                  \
+  RUNCTZ64N (int32_t, uint64_t)                                                 \
+  RUNCTZ64N (int32_t, uint32_t)                                                 \
+  RUNCTZ64N (int32_t, uint16_t)                                                 \
+  RUNCTZ64N (int32_t, uint8_t)                                                  \
+  RUNCTZ64N (int32_t, int64_t)                                                  \
+  RUNCTZ64N (int32_t, int32_t)                                                  \
+  RUNCTZ64N (int32_t, int16_t)                                                  \
+  RUNCTZ64N (int32_t, int8_t)                                                   \
+  RUNCTZ64 (uint16_t, uint64_t)                                                \
+  RUNCTZ64 (uint16_t, uint32_t)                                                \
+  RUNCTZ64 (uint16_t, uint16_t)                                                \
+  RUNCTZ64 (uint16_t, uint8_t)                                                 \
+  RUNCTZ64 (uint16_t, int64_t)                                                 \
+  RUNCTZ64 (uint16_t, int32_t)                                                 \
+  RUNCTZ64 (uint16_t, int16_t)                                                 \
+  RUNCTZ64 (uint16_t, int8_t)                                                  \
+  RUNCTZ64N (int16_t, uint64_t)                                                \
+  RUNCTZ64N (int16_t, uint32_t)                                                \
+  RUNCTZ64N (int16_t, uint16_t)                                                \
+  RUNCTZ64N (int16_t, uint8_t)                                                 \
+  RUNCTZ64N (int16_t, int64_t)                                                 \
+  RUNCTZ64N (int16_t, int32_t)                                                 \
+  RUNCTZ64N (int16_t, int16_t)                                                 \
+  RUNCTZ64N (int16_t, int8_t)                                                  \
+  RUNCTZ64 (uint8_t, uint64_t)                                                 \
+  RUNCTZ64 (uint8_t, uint32_t)                                                 \
+  RUNCTZ64 (uint8_t, uint16_t)                                                 \
+  RUNCTZ64 (uint8_t, uint8_t)                                                  \
+  RUNCTZ64 (uint8_t, int64_t)                                                  \
+  RUNCTZ64 (uint8_t, int32_t)                                                  \
+  RUNCTZ64 (uint8_t, int16_t)                                                  \
+  RUNCTZ64 (uint8_t, int8_t)                                                   \
+  RUNCTZ64N (int8_t, uint64_t)                                                 \
+  RUNCTZ64N (int8_t, uint32_t)                                                 \
+  RUNCTZ64N (int8_t, uint16_t)                                                 \
+  RUNCTZ64N (int8_t, uint8_t)                                                  \
+  RUNCTZ64N (int8_t, int64_t)                                                  \
+  RUNCTZ64N (int8_t, int32_t)                                                  \
+  RUNCTZ64N (int8_t, int16_t)                                                  \
+  RUNCTZ64N (int8_t, int8_t)                                                   \
+  RUNCTZ32 (uint64_t, uint64_t)                                                \
+  RUNCTZ32 (uint64_t, uint32_t)                                                \
+  RUNCTZ32 (uint64_t, uint16_t)                                                \
+  RUNCTZ32 (uint64_t, uint8_t)                                                 \
+  RUNCTZ32 (uint64_t, int64_t)                                                 \
+  RUNCTZ32 (uint64_t, int32_t)                                                 \
+  RUNCTZ32 (uint64_t, int16_t)                                                 \
+  RUNCTZ32 (uint64_t, int8_t)                                                  \
+  RUNCTZ32N (int64_t, uint64_t)                                                \
+  RUNCTZ32N (int64_t, uint32_t)                                                \
+  RUNCTZ32N (int64_t, uint16_t)                                                \
+  RUNCTZ32N (int64_t, uint8_t)                                                 \
+  RUNCTZ32N (int64_t, int64_t)                                                 \
+  RUNCTZ32N (int64_t, int32_t)                                                 \
+  RUNCTZ32N (int64_t, int16_t)                                                 \
+  RUNCTZ32N (int64_t, int8_t)                                                  \
+  RUNCTZ32 (uint32_t, uint64_t)                                                \
+  RUNCTZ32 (uint32_t, uint32_t)                                                \
+  RUNCTZ32 (uint32_t, uint16_t)                                                \
+  RUNCTZ32 (uint32_t, uint8_t)                                                 \
+  RUNCTZ32 (uint32_t, int64_t)                                                 \
+  RUNCTZ32 (uint32_t, int32_t)                                                 \
+  RUNCTZ32 (uint32_t, int16_t)                                                 \
+  RUNCTZ32 (uint32_t, int8_t)                                                  \
+  RUNCTZ32N (int32_t, uint64_t)                                                \
+  RUNCTZ32N (int32_t, uint32_t)                                                \
+  RUNCTZ32N (int32_t, uint16_t)                                                \
+  RUNCTZ32N (int32_t, uint8_t)                                                 \
+  RUNCTZ32N (int32_t, int64_t)                                                 \
+  RUNCTZ32N (int32_t, int32_t)                                                 \
+  RUNCTZ32N (int32_t, int16_t)                                                 \
+  RUNCTZ32N (int32_t, int8_t)                                                  \
+  RUNCTZ32 (uint16_t, uint64_t)                                                \
+  RUNCTZ32 (uint16_t, uint32_t)                                                \
+  RUNCTZ32 (uint16_t, uint16_t)                                                \
+  RUNCTZ32 (uint16_t, uint8_t)                                                 \
+  RUNCTZ32 (uint16_t, int64_t)                                                 \
+  RUNCTZ32 (uint16_t, int32_t)                                                 \
+  RUNCTZ32 (uint16_t, int16_t)                                                 \
+  RUNCTZ32 (uint16_t, int8_t)                                                  \
+  RUNCTZ32N (int16_t, uint64_t)                                                \
+  RUNCTZ32N (int16_t, uint32_t)                                                \
+  RUNCTZ32N (int16_t, uint16_t)                                                \
+  RUNCTZ32N (int16_t, uint8_t)                                                 \
+  RUNCTZ32N (int16_t, int64_t)                                                 \
+  RUNCTZ32N (int16_t, int32_t)                                                 \
+  RUNCTZ32N (int16_t, int16_t)                                                 \
+  RUNCTZ32N (int16_t, int8_t)                                                  \
+  RUNCTZ32 (uint8_t, uint64_t)                                                 \
+  RUNCTZ32 (uint8_t, uint32_t)                                                 \
+  RUNCTZ32 (uint8_t, uint16_t)                                                 \
+  RUNCTZ32 (uint8_t, uint8_t)                                                  \
+  RUNCTZ32 (uint8_t, int64_t)                                                  \
+  RUNCTZ32 (uint8_t, int32_t)                                                  \
+  RUNCTZ32 (uint8_t, int16_t)                                                  \
+  RUNCTZ32 (uint8_t, int8_t)                                                   \
+  RUNCTZ32N (int8_t, uint64_t)                                                 \
+  RUNCTZ32N (int8_t, uint32_t)                                                 \
+  RUNCTZ32N (int8_t, uint16_t)                                                 \
+  RUNCTZ32N (int8_t, uint8_t)                                                  \
+  RUNCTZ32N (int8_t, int64_t)                                                  \
+  RUNCTZ32N (int8_t, int32_t)                                                  \
+  RUNCTZ32N (int8_t, int16_t)                                                  \
+  RUNCTZ32N (int8_t, int8_t)                                                   \
+  RUNFFS64 (uint64_t, uint64_t)                                                \
+  RUNFFS64 (uint64_t, uint32_t)                                                \
+  RUNFFS64 (uint64_t, uint16_t)                                                \
+  RUNFFS64 (uint64_t, uint8_t)                                                 \
+  RUNFFS64 (uint64_t, int64_t)                                                 \
+  RUNFFS64 (uint64_t, int32_t)                                                 \
+  RUNFFS64 (uint64_t, int16_t)                                                 \
+  RUNFFS64 (uint64_t, int8_t)                                                  \
+  RUNFFS64N (int64_t, uint64_t)                                                \
+  RUNFFS64N (int64_t, uint32_t)                                                \
+  RUNFFS64N (int64_t, uint16_t)                                                \
+  RUNFFS64N (int64_t, uint8_t)                                                 \
+  RUNFFS64N (int64_t, int64_t)                                                 \
+  RUNFFS64N (int64_t, int32_t)                                                 \
+  RUNFFS64N (int64_t, int16_t)                                                 \
+  RUNFFS64N (int64_t, int8_t)                                                  \
+  RUNFFS64 (uint32_t, uint64_t)                                                \
+  RUNFFS64 (uint32_t, uint32_t)                                                \
+  RUNFFS64 (uint32_t, uint16_t)                                                \
+  RUNFFS64 (uint32_t, uint8_t)                                                 \
+  RUNFFS64 (uint32_t, int64_t)                                                 \
+  RUNFFS64 (uint32_t, int32_t)                                                 \
+  RUNFFS64 (uint32_t, int16_t)                                                 \
+  RUNFFS64 (uint32_t, int8_t)                                                  \
+  RUNFFS64N (int32_t, uint64_t)                                                \
+  RUNFFS64N (int32_t, uint32_t)                                                \
+  RUNFFS64N (int32_t, uint16_t)                                                \
+  RUNFFS64N (int32_t, uint8_t)                                                 \
+  RUNFFS64N (int32_t, int64_t)                                                 \
+  RUNFFS64N (int32_t, int32_t)                                                 \
+  RUNFFS64N (int32_t, int16_t)                                                 \
+  RUNFFS64N (int32_t, int8_t)                                                  \
+  RUNFFS64 (uint16_t, uint64_t)                                                \
+  RUNFFS64 (uint16_t, uint32_t)                                                \
+  RUNFFS64 (uint16_t, uint16_t)                                                \
+  RUNFFS64 (uint16_t, uint8_t)                                                 \
+  RUNFFS64 (uint16_t, int64_t)                                                 \
+  RUNFFS64 (uint16_t, int32_t)                                                 \
+  RUNFFS64 (uint16_t, int16_t)                                                 \
+  RUNFFS64 (uint16_t, int8_t)                                                  \
+  RUNFFS64N (int16_t, uint64_t)                                                \
+  RUNFFS64N (int16_t, uint32_t)                                                \
+  RUNFFS64N (int16_t, uint16_t)                                                \
+  RUNFFS64N (int16_t, uint8_t)                                                 \
+  RUNFFS64N (int16_t, int64_t)                                                 \
+  RUNFFS64N (int16_t, int32_t)                                                 \
+  RUNFFS64N (int16_t, int16_t)                                                 \
+  RUNFFS64N (int16_t, int8_t)                                                  \
+  RUNFFS64 (uint8_t, uint64_t)                                                 \
+  RUNFFS64 (uint8_t, uint32_t)                                                 \
+  RUNFFS64 (uint8_t, uint16_t)                                                 \
+  RUNFFS64 (uint8_t, uint8_t)                                                  \
+  RUNFFS64 (uint8_t, int64_t)                                                  \
+  RUNFFS64 (uint8_t, int32_t)                                                  \
+  RUNFFS64 (uint8_t, int16_t)                                                  \
+  RUNFFS64 (uint8_t, int8_t)                                                   \
+  RUNFFS64N (int8_t, uint64_t)                                                 \
+  RUNFFS64N (int8_t, uint32_t)                                                 \
+  RUNFFS64N (int8_t, uint16_t)                                                 \
+  RUNFFS64N (int8_t, uint8_t)                                                  \
+  RUNFFS64N (int8_t, int64_t)                                                  \
+  RUNFFS64N (int8_t, int32_t)                                                  \
+  RUNFFS64N (int8_t, int16_t)                                                  \
+  RUNFFS64N (int8_t, int8_t)                                                   \
+  RUNFFS32 (uint64_t, uint64_t)                                                \
+  RUNFFS32 (uint64_t, uint32_t)                                                \
+  RUNFFS32 (uint64_t, uint16_t)                                                \
+  RUNFFS32 (uint64_t, uint8_t)                                                 \
+  RUNFFS32 (uint64_t, int64_t)                                                 \
+  RUNFFS32 (uint64_t, int32_t)                                                 \
+  RUNFFS32 (uint64_t, int16_t)                                                 \
+  RUNFFS32 (uint64_t, int8_t)                                                  \
+  RUNFFS32N (int64_t, uint64_t)                                                \
+  RUNFFS32N (int64_t, uint32_t)                                                \
+  RUNFFS32N (int64_t, uint16_t)                                                \
+  RUNFFS32N (int64_t, uint8_t)                                                 \
+  RUNFFS32N (int64_t, int64_t)                                                 \
+  RUNFFS32N (int64_t, int32_t)                                                 \
+  RUNFFS32N (int64_t, int16_t)                                                 \
+  RUNFFS32N (int64_t, int8_t)                                                  \
+  RUNFFS32 (uint32_t, uint64_t)                                                \
+  RUNFFS32 (uint32_t, uint32_t)                                                \
+  RUNFFS32 (uint32_t, uint16_t)                                                \
+  RUNFFS32 (uint32_t, uint8_t)                                                 \
+  RUNFFS32 (uint32_t, int64_t)                                                 \
+  RUNFFS32 (uint32_t, int32_t)                                                 \
+  RUNFFS32 (uint32_t, int16_t)                                                 \
+  RUNFFS32 (uint32_t, int8_t)                                                  \
+  RUNFFS32N (int32_t, uint64_t)                                                \
+  RUNFFS32N (int32_t, uint32_t)                                                \
+  RUNFFS32N (int32_t, uint16_t)                                                \
+  RUNFFS32N (int32_t, uint8_t)                                                 \
+  RUNFFS32N (int32_t, int64_t)                                                 \
+  RUNFFS32N (int32_t, int32_t)                                                 \
+  RUNFFS32N (int32_t, int16_t)                                                 \
+  RUNFFS32N (int32_t, int8_t)                                                  \
+  RUNFFS32 (uint16_t, uint64_t)                                                \
+  RUNFFS32 (uint16_t, uint32_t)                                                \
+  RUNFFS32 (uint16_t, uint16_t)                                                \
+  RUNFFS32 (uint16_t, uint8_t)                                                 \
+  RUNFFS32 (uint16_t, int64_t)                                                 \
+  RUNFFS32 (uint16_t, int32_t)                                                 \
+  RUNFFS32 (uint16_t, int16_t)                                                 \
+  RUNFFS32 (uint16_t, int8_t)                                                  \
+  RUNFFS32N (int16_t, uint64_t)                                                \
+  RUNFFS32N (int16_t, uint32_t)                                                \
+  RUNFFS32N (int16_t, uint16_t)                                                \
+  RUNFFS32N (int16_t, uint8_t)                                                 \
+  RUNFFS32N (int16_t, int64_t)                                                 \
+  RUNFFS32N (int16_t, int32_t)                                                 \
+  RUNFFS32N (int16_t, int16_t)                                                 \
+  RUNFFS32N (int16_t, int8_t)                                                  \
+  RUNFFS32 (uint8_t, uint64_t)                                                 \
+  RUNFFS32 (uint8_t, uint32_t)                                                 \
+  RUNFFS32 (uint8_t, uint16_t)                                                 \
+  RUNFFS32 (uint8_t, uint8_t)                                                  \
+  RUNFFS32 (uint8_t, int64_t)                                                  \
+  RUNFFS32 (uint8_t, int32_t)                                                  \
+  RUNFFS32 (uint8_t, int16_t)                                                  \
+  RUNFFS32 (uint8_t, int8_t)                                                   \
+  RUNFFS32N (int8_t, uint64_t)                                                 \
+  RUNFFS32N (int8_t, uint32_t)                                                 \
+  RUNFFS32N (int8_t, uint16_t)                                                 \
+  RUNFFS32N (int8_t, uint8_t)                                                  \
+  RUNFFS32N (int8_t, int64_t)                                                  \
+  RUNFFS32N (int8_t, int32_t)                                                  \
+  RUNFFS32N (int8_t, int16_t)                                                  \
+  RUNFFS32N (int8_t, int8_t)
+
+int
+main ()
+{
+  RUN_ALL ()
+}
+
+/* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 229 "vect" } } */
-- 
2.41.0
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Re: [PATCH] RISC-V: Add popcount fallback expander.
       [not found] ` <202310181728104086621@rivai.ai>
@ 2023-10-18  9:30   ` juzhe.zhong
  0 siblings, 0 replies; 10+ messages in thread
From: juzhe.zhong @ 2023-10-18  9:30 UTC (permalink / raw)
  To: Robin Dapp, gcc-patches; +Cc: Robin Dapp

[-- Attachment #1: Type: text/plain, Size: 126209 bytes --]

Could you try this following code :

int x[8];
int y[8];

void foo ()
{
  x[0] = __builtin_popcount (y[0]);
  x[1] = __builtin_popcount (y[1]);
  x[2] = __builtin_popcount (y[2]);
  x[3] = __builtin_popcount (y[3]);
  x[4] = __builtin_popcount (y[4]);
  x[5] = __builtin_popcount (y[5]);
  x[6] = __builtin_popcount (y[6]);
  x[7] = __builtin_popcount (y[7]);
}

I saw you didn't extend VI -> V_VLSI. I guess will failed SLP on popcount.



juzhe.zhong@rivai.ai
 
From: juzhe.zhong@rivai.ai
Date: 2023-10-18 17:28
To: Robin Dapp; gcc-patches; palmer; kito.cheng; jeffreyalaw
CC: Robin Dapp
Subject: Re: [PATCH] RISC-V: Add popcount fallback expander.
Could you try this following code :

int x[8];
int y[8];

void foo ()
{
  x[0] = __builtin_popcount (y[0]);
  x[1] = __builtin_popcount (y[1]);
  x[2] = __builtin_popcount (y[2]);
  x[3] = __builtin_popcount (y[3]);
  x[4] = __builtin_popcount (y[4]);
  x[5] = __builtin_popcount (y[5]);
  x[6] = __builtin_popcount (y[6]);
  x[7] = __builtin_popcount (y[7]);
}

I saw you didn't extend VI -> V_VLSI. I guess will failed SLP on popcount.


juzhe.zhong@rivai.ai
 
From: Robin Dapp
Date: 2023-10-18 17:20
To: gcc-patches; palmer; Kito Cheng; jeffreyalaw; juzhe.zhong@rivai.ai
CC: rdapp.gcc
Subject: [PATCH] RISC-V: Add popcount fallback expander.
Hi,
 
as I didn't manage to get back to the generic vectorizer fallback for
popcount in time (still the generic costing problem) I figured I'd
rather implement the popcount fallback in the riscv backend.
It uses the WWG algorithm from libgcc.
 
rvv.exp is unchanged, vect and dg.exp testsuites are currently running.
 
Regards
Robin
 
gcc/ChangeLog:
 
* config/riscv/autovec.md (popcount<mode>2): New expander.
* config/riscv/riscv-protos.h (expand_popcount): Define.
* config/riscv/riscv-v.cc (expand_popcount): Vectorize popcount
with the WWG algorithm.
 
gcc/testsuite/ChangeLog:
 
* gcc.target/riscv/rvv/autovec/unop/popcount-1.c: New test.
* gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c: New test.
* gcc.target/riscv/rvv/autovec/unop/popcount.c: New test.
---
gcc/config/riscv/autovec.md                   |   14 +
gcc/config/riscv/riscv-protos.h               |    1 +
gcc/config/riscv/riscv-v.cc                   |   71 +
.../riscv/rvv/autovec/unop/popcount-1.c       |   20 +
.../riscv/rvv/autovec/unop/popcount-run-1.c   |   49 +
.../riscv/rvv/autovec/unop/popcount.c         | 1464 +++++++++++++++++
6 files changed, 1619 insertions(+)
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-1.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount.c
 
diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md
index c5b1e52cbf9..dfe836f705d 100644
--- a/gcc/config/riscv/autovec.md
+++ b/gcc/config/riscv/autovec.md
@@ -1484,6 +1484,20 @@ (define_expand "xorsign<mode>3"
   DONE;
})
+;; -------------------------------------------------------------------------------
+;; - [INT] POPCOUNT.
+;; -------------------------------------------------------------------------------
+
+(define_expand "popcount<mode>2"
+  [(match_operand:VI 0 "register_operand")
+   (match_operand:VI 1 "register_operand")]
+  "TARGET_VECTOR"
+{
+  riscv_vector::expand_popcount (operands);
+  DONE;
+})
+
+
;; -------------------------------------------------------------------------
;; ---- [INT] Highpart multiplication
;; -------------------------------------------------------------------------
diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index 49bdcdf2f93..4aeccdd961b 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -515,6 +515,7 @@ void expand_fold_extract_last (rtx *);
void expand_cond_unop (unsigned, rtx *);
void expand_cond_binop (unsigned, rtx *);
void expand_cond_ternop (unsigned, rtx *);
+void expand_popcount (rtx *);
/* Rounding mode bitfield for fixed point VXRM.  */
enum fixed_point_rounding_mode
diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index 21d86c3f917..8b594b7127e 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -4152,4 +4152,75 @@ expand_vec_lfloor (rtx op_0, rtx op_1, machine_mode vec_fp_mode,
   emit_vec_cvt_x_f (op_0, op_1, UNARY_OP_FRM_RDN, vec_fp_mode);
}
+/* Vectorize popcount by the Wilkes-Wheeler-Gill algorithm that libgcc uses as
+   well.  */
+void
+expand_popcount (rtx *ops)
+{
+  rtx dst = ops[0];
+  rtx src = ops[1];
+  machine_mode mode = GET_MODE (dst);
+  scalar_mode imode = GET_MODE_INNER (mode);
+  static const uint64_t m5 = 0x5555555555555555ULL;
+  static const uint64_t m3 = 0x3333333333333333ULL;
+  static const uint64_t mf = 0x0F0F0F0F0F0F0F0FULL;
+  static const uint64_t m1 = 0x0101010101010101ULL;
+
+  rtx x1 = gen_reg_rtx (mode);
+  rtx x2 = gen_reg_rtx (mode);
+  rtx x3 = gen_reg_rtx (mode);
+  rtx x4 = gen_reg_rtx (mode);
+
+  /* x1 = src - (src >> 1) & 0x555...);  */
+  rtx shift1 = expand_binop (mode, lshr_optab, src, GEN_INT (1), NULL, true,
+      OPTAB_DIRECT);
+
+  rtx and1 = gen_reg_rtx (mode);
+  rtx ops1[] = {and1, shift1, gen_int_mode (m5, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (AND, mode), riscv_vector::BINARY_OP,
+    ops1);
+
+  x1 = expand_binop (mode, sub_optab, src, and1, NULL, true, OPTAB_DIRECT);
+
+  /* x2 = (x1 & 0x3333333333333333ULL) + ((x1 >> 2) & 0x3333333333333333ULL);
+   */
+  rtx and2 = gen_reg_rtx (mode);
+  rtx ops2[] = {and2, x1, gen_int_mode (m3, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (AND, mode), riscv_vector::BINARY_OP,
+    ops2);
+
+  rtx shift2 = expand_binop (mode, lshr_optab, x1, GEN_INT (2), NULL, true,
+      OPTAB_DIRECT);
+
+  rtx and22 = gen_reg_rtx (mode);
+  rtx ops22[] = {and22, shift2, gen_int_mode (m3, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (AND, mode), riscv_vector::BINARY_OP,
+    ops22);
+
+  x2 = expand_binop (mode, add_optab, and2, and22, NULL, true, OPTAB_DIRECT);
+
+  /* x3 = (x2 + (x2 >> 4)) & 0x0f0f0f0f0f0f0f0fULL;  */
+  rtx shift3 = expand_binop (mode, lshr_optab, x2, GEN_INT (4), NULL, true,
+      OPTAB_DIRECT);
+
+  rtx plus3
+    = expand_binop (mode, add_optab, x2, shift3, NULL, true, OPTAB_DIRECT);
+
+  rtx ops3[] = {x3, plus3, gen_int_mode (mf, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (AND, mode), riscv_vector::BINARY_OP,
+    ops3);
+
+  /* dest = (x3 * 0x0101010101010101ULL) >> 56;  */
+  rtx mul4 = gen_reg_rtx (mode);
+  rtx ops4[] = {mul4, x3, gen_int_mode (m1, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (MULT, mode), riscv_vector::BINARY_OP,
+    ops4);
+
+  x4 = expand_binop (mode, lshr_optab, mul4,
+      GEN_INT (GET_MODE_BITSIZE (imode) - 8), NULL, true,
+      OPTAB_DIRECT);
+
+  emit_move_insn (dst, x4);
+}
+
} // namespace riscv_vector
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-1.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-1.c
new file mode 100644
index 00000000000..3169ebbff71
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-1.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-march=rv64gcv_zvfh -mabi=lp64d --param=riscv-autovec-preference=scalable -fno-vect-cost-model -fdump-tree-vect-details" } */
+
+#include <stdint-gcc.h>
+
+void __attribute__ ((noipa))
+popcount_32 (uint32_t *restrict dst, uint32_t *restrict src, int size)
+{
+  for (int i = 0; i < size; ++i)
+    dst[i] = __builtin_popcount (src[i]);
+}
+
+void __attribute__ ((noipa))
+popcount_64 (uint64_t *restrict dst, uint64_t *restrict src, int size)
+{
+  for (int i = 0; i < size; ++i)
+    dst[i] = __builtin_popcountll (src[i]);
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops in function" 2 "vect" } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c
new file mode 100644
index 00000000000..38f1633da99
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c
@@ -0,0 +1,49 @@
+/* { dg-do run { target { riscv_v } } } */
+
+#include "popcount-1.c"
+
+extern void abort (void) __attribute__ ((noreturn));
+
+unsigned int data[] = {
+  0x11111100, 6,
+  0xe0e0f0f0, 14,
+  0x9900aab3, 13,
+  0x00040003, 3,
+  0x000e000c, 5,
+  0x22227777, 16,
+  0x12341234, 10,
+  0x0, 0
+};
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  unsigned int count = sizeof (data) / sizeof (data[0]) / 2;
+
+  uint32_t in32[count];
+  uint32_t out32[count];
+  for (unsigned int i = 0; i < count; ++i)
+    {
+      in32[i] = data[i * 2];
+      asm volatile ("" ::: "memory");
+    }
+  popcount_32 (out32, in32, count);
+  for (unsigned int i = 0; i < count; ++i)
+    if (out32[i] != data[i * 2 + 1])
+      abort ();
+
+  count /= 2;
+  uint64_t in64[count];
+  uint64_t out64[count];
+  for (unsigned int i = 0; i < count; ++i)
+    {
+      in64[i] = ((uint64_t) data[i * 4] << 32) | data[i * 4 + 2];
+      asm volatile ("" ::: "memory");
+    }
+  popcount_64 (out64, in64, count);
+  for (unsigned int i = 0; i < count; ++i)
+    if (out64[i] != data[i * 4 + 1] + data[i * 4 + 3])
+      abort ();
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount.c
new file mode 100644
index 00000000000..585a522aa81
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount.c
@@ -0,0 +1,1464 @@
+/* { dg-do run { target { riscv_v } } } */
+/* { dg-additional-options { -O2 -fdump-tree-vect-details -fno-vect-cost-model } }  */
+
+#include "stdint-gcc.h"
+#include <assert.h>
+
+#define DEF64(TYPEDST, TYPESRC)                                                \
+  void __attribute__ ((noipa))                                                 \
+  popcount64_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src, \
+ int size)                                     \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_popcountll (src[i]);                                  \
+  }
+
+#define DEF32(TYPEDST, TYPESRC)                                                \
+  void __attribute__ ((noipa))                                                 \
+  popcount32_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src, \
+ int size)                                     \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_popcount (src[i]);                                    \
+  }
+
+#define DEFCTZ64(TYPEDST, TYPESRC)                                             \
+  void __attribute__ ((noipa))                                                 \
+  ctz64_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src,      \
+     int size)                                          \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_ctzll (src[i]);                                       \
+  }
+
+#define DEFCTZ32(TYPEDST, TYPESRC)                                             \
+  void __attribute__ ((noipa))                                                 \
+  ctz32_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src,      \
+     int size)                                          \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_ctz (src[i]);                                         \
+  }
+
+#define DEFFFS64(TYPEDST, TYPESRC)                                             \
+  void __attribute__ ((noipa))                                                 \
+  ffs64_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src,      \
+     int size)                                          \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_ffsll (src[i]);                                       \
+  }
+
+#define DEFFFS32(TYPEDST, TYPESRC)                                             \
+  void __attribute__ ((noipa))                                                 \
+  ffs32_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src,      \
+     int size)                                          \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_ffs (src[i]);                                         \
+  }
+
+#define DEF_ALL()                                                              \
+  DEF64 (uint64_t, uint64_t)                                                   \
+  DEF64 (uint64_t, uint32_t)                                                   \
+  DEF64 (uint64_t, uint16_t)                                                   \
+  DEF64 (uint64_t, uint8_t)                                                    \
+  DEF64 (uint64_t, int64_t)                                                    \
+  DEF64 (uint64_t, int32_t)                                                    \
+  DEF64 (uint64_t, int16_t)                                                    \
+  DEF64 (uint64_t, int8_t)                                                     \
+  DEF64 (int64_t, uint64_t)                                                    \
+  DEF64 (int64_t, uint32_t)                                                    \
+  DEF64 (int64_t, uint16_t)                                                    \
+  DEF64 (int64_t, uint8_t)                                                     \
+  DEF64 (int64_t, int64_t)                                                     \
+  DEF64 (int64_t, int32_t)                                                     \
+  DEF64 (int64_t, int16_t)                                                     \
+  DEF64 (int64_t, int8_t)                                                      \
+  DEF64 (uint32_t, uint64_t)                                                   \
+  DEF64 (uint32_t, uint32_t)                                                   \
+  DEF64 (uint32_t, uint16_t)                                                   \
+  DEF64 (uint32_t, uint8_t)                                                    \
+  DEF64 (uint32_t, int64_t)                                                    \
+  DEF64 (uint32_t, int32_t)                                                    \
+  DEF64 (uint32_t, int16_t)                                                    \
+  DEF64 (uint32_t, int8_t)                                                     \
+  DEF64 (int32_t, uint64_t)                                                    \
+  DEF64 (int32_t, uint32_t)                                                    \
+  DEF64 (int32_t, uint16_t)                                                    \
+  DEF64 (int32_t, uint8_t)                                                     \
+  DEF64 (int32_t, int64_t)                                                     \
+  DEF64 (int32_t, int32_t)                                                     \
+  DEF64 (int32_t, int16_t)                                                     \
+  DEF64 (int32_t, int8_t)                                                      \
+  DEF64 (uint16_t, uint64_t)                                                   \
+  DEF64 (uint16_t, uint32_t)                                                   \
+  DEF64 (uint16_t, uint16_t)                                                   \
+  DEF64 (uint16_t, uint8_t)                                                    \
+  DEF64 (uint16_t, int64_t)                                                    \
+  DEF64 (uint16_t, int32_t)                                                    \
+  DEF64 (uint16_t, int16_t)                                                    \
+  DEF64 (uint16_t, int8_t)                                                     \
+  DEF64 (int16_t, uint64_t)                                                    \
+  DEF64 (int16_t, uint32_t)                                                    \
+  DEF64 (int16_t, uint16_t)                                                    \
+  DEF64 (int16_t, uint8_t)                                                     \
+  DEF64 (int16_t, int64_t)                                                     \
+  DEF64 (int16_t, int32_t)                                                     \
+  DEF64 (int16_t, int16_t)                                                     \
+  DEF64 (int16_t, int8_t)                                                      \
+  DEF64 (uint8_t, uint64_t)                                                    \
+  DEF64 (uint8_t, uint32_t)                                                    \
+  DEF64 (uint8_t, uint16_t)                                                    \
+  DEF64 (uint8_t, uint8_t)                                                     \
+  DEF64 (uint8_t, int64_t)                                                     \
+  DEF64 (uint8_t, int32_t)                                                     \
+  DEF64 (uint8_t, int16_t)                                                     \
+  DEF64 (uint8_t, int8_t)                                                      \
+  DEF64 (int8_t, uint64_t)                                                     \
+  DEF64 (int8_t, uint32_t)                                                     \
+  DEF64 (int8_t, uint16_t)                                                     \
+  DEF64 (int8_t, uint8_t)                                                      \
+  DEF64 (int8_t, int64_t)                                                      \
+  DEF64 (int8_t, int32_t)                                                      \
+  DEF64 (int8_t, int16_t)                                                      \
+  DEF64 (int8_t, int8_t)                                                       \
+  DEF32 (uint64_t, uint64_t)                                                   \
+  DEF32 (uint64_t, uint32_t)                                                   \
+  DEF32 (uint64_t, uint16_t)                                                   \
+  DEF32 (uint64_t, uint8_t)                                                    \
+  DEF32 (uint64_t, int64_t)                                                    \
+  DEF32 (uint64_t, int32_t)                                                    \
+  DEF32 (uint64_t, int16_t)                                                    \
+  DEF32 (uint64_t, int8_t)                                                     \
+  DEF32 (int64_t, uint64_t)                                                    \
+  DEF32 (int64_t, uint32_t)                                                    \
+  DEF32 (int64_t, uint16_t)                                                    \
+  DEF32 (int64_t, uint8_t)                                                     \
+  DEF32 (int64_t, int64_t)                                                     \
+  DEF32 (int64_t, int32_t)                                                     \
+  DEF32 (int64_t, int16_t)                                                     \
+  DEF32 (int64_t, int8_t)                                                      \
+  DEF32 (uint32_t, uint64_t)                                                   \
+  DEF32 (uint32_t, uint32_t)                                                   \
+  DEF32 (uint32_t, uint16_t)                                                   \
+  DEF32 (uint32_t, uint8_t)                                                    \
+  DEF32 (uint32_t, int64_t)                                                    \
+  DEF32 (uint32_t, int32_t)                                                    \
+  DEF32 (uint32_t, int16_t)                                                    \
+  DEF32 (uint32_t, int8_t)                                                     \
+  DEF32 (int32_t, uint64_t)                                                    \
+  DEF32 (int32_t, uint32_t)                                                    \
+  DEF32 (int32_t, uint16_t)                                                    \
+  DEF32 (int32_t, uint8_t)                                                     \
+  DEF32 (int32_t, int64_t)                                                     \
+  DEF32 (int32_t, int32_t)                                                     \
+  DEF32 (int32_t, int16_t)                                                     \
+  DEF32 (int32_t, int8_t)                                                      \
+  DEF32 (uint16_t, uint64_t)                                                   \
+  DEF32 (uint16_t, uint32_t)                                                   \
+  DEF32 (uint16_t, uint16_t)                                                   \
+  DEF32 (uint16_t, uint8_t)                                                    \
+  DEF32 (uint16_t, int64_t)                                                    \
+  DEF32 (uint16_t, int32_t)                                                    \
+  DEF32 (uint16_t, int16_t)                                                    \
+  DEF32 (uint16_t, int8_t)                                                     \
+  DEF32 (int16_t, uint64_t)                                                    \
+  DEF32 (int16_t, uint32_t)                                                    \
+  DEF32 (int16_t, uint16_t)                                                    \
+  DEF32 (int16_t, uint8_t)                                                     \
+  DEF32 (int16_t, int64_t)                                                     \
+  DEF32 (int16_t, int32_t)                                                     \
+  DEF32 (int16_t, int16_t)                                                     \
+  DEF32 (int16_t, int8_t)                                                      \
+  DEF32 (uint8_t, uint64_t)                                                    \
+  DEF32 (uint8_t, uint32_t)                                                    \
+  DEF32 (uint8_t, uint16_t)                                                    \
+  DEF32 (uint8_t, uint8_t)                                                     \
+  DEF32 (uint8_t, int64_t)                                                     \
+  DEF32 (uint8_t, int32_t)                                                     \
+  DEF32 (uint8_t, int16_t)                                                     \
+  DEF32 (uint8_t, int8_t)                                                      \
+  DEF32 (int8_t, uint64_t)                                                     \
+  DEF32 (int8_t, uint32_t)                                                     \
+  DEF32 (int8_t, uint16_t)                                                     \
+  DEF32 (int8_t, uint8_t)                                                      \
+  DEF32 (int8_t, int64_t)                                                      \
+  DEF32 (int8_t, int32_t)                                                      \
+  DEF32 (int8_t, int16_t)                                                      \
+  DEF32 (int8_t, int8_t)                                                       \
+  DEFCTZ64 (uint64_t, uint64_t)                                                \
+  DEFCTZ64 (uint64_t, uint32_t)                                                \
+  DEFCTZ64 (uint64_t, uint16_t)                                                \
+  DEFCTZ64 (uint64_t, uint8_t)                                                 \
+  DEFCTZ64 (uint64_t, int64_t)                                                 \
+  DEFCTZ64 (uint64_t, int32_t)                                                 \
+  DEFCTZ64 (uint64_t, int16_t)                                                 \
+  DEFCTZ64 (uint64_t, int8_t)                                                  \
+  DEFCTZ64 (int64_t, uint64_t)                                                 \
+  DEFCTZ64 (int64_t, uint32_t)                                                 \
+  DEFCTZ64 (int64_t, uint16_t)                                                 \
+  DEFCTZ64 (int64_t, uint8_t)                                                  \
+  DEFCTZ64 (int64_t, int64_t)                                                  \
+  DEFCTZ64 (int64_t, int32_t)                                                  \
+  DEFCTZ64 (int64_t, int16_t)                                                  \
+  DEFCTZ64 (int64_t, int8_t)                                                   \
+  DEFCTZ64 (uint32_t, uint64_t)                                                \
+  DEFCTZ64 (uint32_t, uint32_t)                                                \
+  DEFCTZ64 (uint32_t, uint16_t)                                                \
+  DEFCTZ64 (uint32_t, uint8_t)                                                 \
+  DEFCTZ64 (uint32_t, int64_t)                                                 \
+  DEFCTZ64 (uint32_t, int32_t)                                                 \
+  DEFCTZ64 (uint32_t, int16_t)                                                 \
+  DEFCTZ64 (uint32_t, int8_t)                                                  \
+  DEFCTZ64 (int32_t, uint64_t)                                                 \
+  DEFCTZ64 (int32_t, uint32_t)                                                 \
+  DEFCTZ64 (int32_t, uint16_t)                                                 \
+  DEFCTZ64 (int32_t, uint8_t)                                                  \
+  DEFCTZ64 (int32_t, int64_t)                                                  \
+  DEFCTZ64 (int32_t, int32_t)                                                  \
+  DEFCTZ64 (int32_t, int16_t)                                                  \
+  DEFCTZ64 (int32_t, int8_t)                                                   \
+  DEFCTZ64 (uint16_t, uint64_t)                                                \
+  DEFCTZ64 (uint16_t, uint32_t)                                                \
+  DEFCTZ64 (uint16_t, uint16_t)                                                \
+  DEFCTZ64 (uint16_t, uint8_t)                                                 \
+  DEFCTZ64 (uint16_t, int64_t)                                                 \
+  DEFCTZ64 (uint16_t, int32_t)                                                 \
+  DEFCTZ64 (uint16_t, int16_t)                                                 \
+  DEFCTZ64 (uint16_t, int8_t)                                                  \
+  DEFCTZ64 (int16_t, uint64_t)                                                 \
+  DEFCTZ64 (int16_t, uint32_t)                                                 \
+  DEFCTZ64 (int16_t, uint16_t)                                                 \
+  DEFCTZ64 (int16_t, uint8_t)                                                  \
+  DEFCTZ64 (int16_t, int64_t)                                                  \
+  DEFCTZ64 (int16_t, int32_t)                                                  \
+  DEFCTZ64 (int16_t, int16_t)                                                  \
+  DEFCTZ64 (int16_t, int8_t)                                                   \
+  DEFCTZ64 (uint8_t, uint64_t)                                                 \
+  DEFCTZ64 (uint8_t, uint32_t)                                                 \
+  DEFCTZ64 (uint8_t, uint16_t)                                                 \
+  DEFCTZ64 (uint8_t, uint8_t)                                                  \
+  DEFCTZ64 (uint8_t, int64_t)                                                  \
+  DEFCTZ64 (uint8_t, int32_t)                                                  \
+  DEFCTZ64 (uint8_t, int16_t)                                                  \
+  DEFCTZ64 (uint8_t, int8_t)                                                   \
+  DEFCTZ64 (int8_t, uint64_t)                                                  \
+  DEFCTZ64 (int8_t, uint32_t)                                                  \
+  DEFCTZ64 (int8_t, uint16_t)                                                  \
+  DEFCTZ64 (int8_t, uint8_t)                                                   \
+  DEFCTZ64 (int8_t, int64_t)                                                   \
+  DEFCTZ64 (int8_t, int32_t)                                                   \
+  DEFCTZ64 (int8_t, int16_t)                                                   \
+  DEFCTZ64 (int8_t, int8_t)                                                    \
+  DEFCTZ32 (uint64_t, uint64_t)                                                \
+  DEFCTZ32 (uint64_t, uint32_t)                                                \
+  DEFCTZ32 (uint64_t, uint16_t)                                                \
+  DEFCTZ32 (uint64_t, uint8_t)                                                 \
+  DEFCTZ32 (uint64_t, int64_t)                                                 \
+  DEFCTZ32 (uint64_t, int32_t)                                                 \
+  DEFCTZ32 (uint64_t, int16_t)                                                 \
+  DEFCTZ32 (uint64_t, int8_t)                                                  \
+  DEFCTZ32 (int64_t, uint64_t)                                                 \
+  DEFCTZ32 (int64_t, uint32_t)                                                 \
+  DEFCTZ32 (int64_t, uint16_t)                                                 \
+  DEFCTZ32 (int64_t, uint8_t)                                                  \
+  DEFCTZ32 (int64_t, int64_t)                                                  \
+  DEFCTZ32 (int64_t, int32_t)                                                  \
+  DEFCTZ32 (int64_t, int16_t)                                                  \
+  DEFCTZ32 (int64_t, int8_t)                                                   \
+  DEFCTZ32 (uint32_t, uint64_t)                                                \
+  DEFCTZ32 (uint32_t, uint32_t)                                                \
+  DEFCTZ32 (uint32_t, uint16_t)                                                \
+  DEFCTZ32 (uint32_t, uint8_t)                                                 \
+  DEFCTZ32 (uint32_t, int64_t)                                                 \
+  DEFCTZ32 (uint32_t, int32_t)                                                 \
+  DEFCTZ32 (uint32_t, int16_t)                                                 \
+  DEFCTZ32 (uint32_t, int8_t)                                                  \
+  DEFCTZ32 (int32_t, uint64_t)                                                 \
+  DEFCTZ32 (int32_t, uint32_t)                                                 \
+  DEFCTZ32 (int32_t, uint16_t)                                                 \
+  DEFCTZ32 (int32_t, uint8_t)                                                  \
+  DEFCTZ32 (int32_t, int64_t)                                                  \
+  DEFCTZ32 (int32_t, int32_t)                                                  \
+  DEFCTZ32 (int32_t, int16_t)                                                  \
+  DEFCTZ32 (int32_t, int8_t)                                                   \
+  DEFCTZ32 (uint16_t, uint64_t)                                                \
+  DEFCTZ32 (uint16_t, uint32_t)                                                \
+  DEFCTZ32 (uint16_t, uint16_t)                                                \
+  DEFCTZ32 (uint16_t, uint8_t)                                                 \
+  DEFCTZ32 (uint16_t, int64_t)                                                 \
+  DEFCTZ32 (uint16_t, int32_t)                                                 \
+  DEFCTZ32 (uint16_t, int16_t)                                                 \
+  DEFCTZ32 (uint16_t, int8_t)                                                  \
+  DEFCTZ32 (int16_t, uint64_t)                                                 \
+  DEFCTZ32 (int16_t, uint32_t)                                                 \
+  DEFCTZ32 (int16_t, uint16_t)                                                 \
+  DEFCTZ32 (int16_t, uint8_t)                                                  \
+  DEFCTZ32 (int16_t, int64_t)                                                  \
+  DEFCTZ32 (int16_t, int32_t)                                                  \
+  DEFCTZ32 (int16_t, int16_t)                                                  \
+  DEFCTZ32 (int16_t, int8_t)                                                   \
+  DEFCTZ32 (uint8_t, uint64_t)                                                 \
+  DEFCTZ32 (uint8_t, uint32_t)                                                 \
+  DEFCTZ32 (uint8_t, uint16_t)                                                 \
+  DEFCTZ32 (uint8_t, uint8_t)                                                  \
+  DEFCTZ32 (uint8_t, int64_t)                                                  \
+  DEFCTZ32 (uint8_t, int32_t)                                                  \
+  DEFCTZ32 (uint8_t, int16_t)                                                  \
+  DEFCTZ32 (uint8_t, int8_t)                                                   \
+  DEFCTZ32 (int8_t, uint64_t)                                                  \
+  DEFCTZ32 (int8_t, uint32_t)                                                  \
+  DEFCTZ32 (int8_t, uint16_t)                                                  \
+  DEFCTZ32 (int8_t, uint8_t)                                                   \
+  DEFCTZ32 (int8_t, int64_t)                                                   \
+  DEFCTZ32 (int8_t, int32_t)                                                   \
+  DEFCTZ32 (int8_t, int16_t)                                                   \
+  DEFCTZ32 (int8_t, int8_t)                                                    \
+  DEFFFS64 (uint64_t, uint64_t)                                                \
+  DEFFFS64 (uint64_t, uint32_t)                                                \
+  DEFFFS64 (uint64_t, uint16_t)                                                \
+  DEFFFS64 (uint64_t, uint8_t)                                                 \
+  DEFFFS64 (uint64_t, int64_t)                                                 \
+  DEFFFS64 (uint64_t, int32_t)                                                 \
+  DEFFFS64 (uint64_t, int16_t)                                                 \
+  DEFFFS64 (uint64_t, int8_t)                                                  \
+  DEFFFS64 (int64_t, uint64_t)                                                 \
+  DEFFFS64 (int64_t, uint32_t)                                                 \
+  DEFFFS64 (int64_t, uint16_t)                                                 \
+  DEFFFS64 (int64_t, uint8_t)                                                  \
+  DEFFFS64 (int64_t, int64_t)                                                  \
+  DEFFFS64 (int64_t, int32_t)                                                  \
+  DEFFFS64 (int64_t, int16_t)                                                  \
+  DEFFFS64 (int64_t, int8_t)                                                   \
+  DEFFFS64 (uint32_t, uint64_t)                                                \
+  DEFFFS64 (uint32_t, uint32_t)                                                \
+  DEFFFS64 (uint32_t, uint16_t)                                                \
+  DEFFFS64 (uint32_t, uint8_t)                                                 \
+  DEFFFS64 (uint32_t, int64_t)                                                 \
+  DEFFFS64 (uint32_t, int32_t)                                                 \
+  DEFFFS64 (uint32_t, int16_t)                                                 \
+  DEFFFS64 (uint32_t, int8_t)                                                  \
+  DEFFFS64 (int32_t, uint64_t)                                                 \
+  DEFFFS64 (int32_t, uint32_t)                                                 \
+  DEFFFS64 (int32_t, uint16_t)                                                 \
+  DEFFFS64 (int32_t, uint8_t)                                                  \
+  DEFFFS64 (int32_t, int64_t)                                                  \
+  DEFFFS64 (int32_t, int32_t)                                                  \
+  DEFFFS64 (int32_t, int16_t)                                                  \
+  DEFFFS64 (int32_t, int8_t)                                                   \
+  DEFFFS64 (uint16_t, uint64_t)                                                \
+  DEFFFS64 (uint16_t, uint32_t)                                                \
+  DEFFFS64 (uint16_t, uint16_t)                                                \
+  DEFFFS64 (uint16_t, uint8_t)                                                 \
+  DEFFFS64 (uint16_t, int64_t)                                                 \
+  DEFFFS64 (uint16_t, int32_t)                                                 \
+  DEFFFS64 (uint16_t, int16_t)                                                 \
+  DEFFFS64 (uint16_t, int8_t)                                                  \
+  DEFFFS64 (int16_t, uint64_t)                                                 \
+  DEFFFS64 (int16_t, uint32_t)                                                 \
+  DEFFFS64 (int16_t, uint16_t)                                                 \
+  DEFFFS64 (int16_t, uint8_t)                                                  \
+  DEFFFS64 (int16_t, int64_t)                                                  \
+  DEFFFS64 (int16_t, int32_t)                                                  \
+  DEFFFS64 (int16_t, int16_t)                                                  \
+  DEFFFS64 (int16_t, int8_t)                                                   \
+  DEFFFS64 (uint8_t, uint64_t)                                                 \
+  DEFFFS64 (uint8_t, uint32_t)                                                 \
+  DEFFFS64 (uint8_t, uint16_t)                                                 \
+  DEFFFS64 (uint8_t, uint8_t)                                                  \
+  DEFFFS64 (uint8_t, int64_t)                                                  \
+  DEFFFS64 (uint8_t, int32_t)                                                  \
+  DEFFFS64 (uint8_t, int16_t)                                                  \
+  DEFFFS64 (uint8_t, int8_t)                                                   \
+  DEFFFS64 (int8_t, uint64_t)                                                  \
+  DEFFFS64 (int8_t, uint32_t)                                                  \
+  DEFFFS64 (int8_t, uint16_t)                                                  \
+  DEFFFS64 (int8_t, uint8_t)                                                   \
+  DEFFFS64 (int8_t, int64_t)                                                   \
+  DEFFFS64 (int8_t, int32_t)                                                   \
+  DEFFFS64 (int8_t, int16_t)                                                   \
+  DEFFFS64 (int8_t, int8_t)                                                    \
+  DEFFFS32 (uint64_t, uint64_t)                                                \
+  DEFFFS32 (uint64_t, uint32_t)                                                \
+  DEFFFS32 (uint64_t, uint16_t)                                                \
+  DEFFFS32 (uint64_t, uint8_t)                                                 \
+  DEFFFS32 (uint64_t, int64_t)                                                 \
+  DEFFFS32 (uint64_t, int32_t)                                                 \
+  DEFFFS32 (uint64_t, int16_t)                                                 \
+  DEFFFS32 (uint64_t, int8_t)                                                  \
+  DEFFFS32 (int64_t, uint64_t)                                                 \
+  DEFFFS32 (int64_t, uint32_t)                                                 \
+  DEFFFS32 (int64_t, uint16_t)                                                 \
+  DEFFFS32 (int64_t, uint8_t)                                                  \
+  DEFFFS32 (int64_t, int64_t)                                                  \
+  DEFFFS32 (int64_t, int32_t)                                                  \
+  DEFFFS32 (int64_t, int16_t)                                                  \
+  DEFFFS32 (int64_t, int8_t)                                                   \
+  DEFFFS32 (uint32_t, uint64_t)                                                \
+  DEFFFS32 (uint32_t, uint32_t)                                                \
+  DEFFFS32 (uint32_t, uint16_t)                                                \
+  DEFFFS32 (uint32_t, uint8_t)                                                 \
+  DEFFFS32 (uint32_t, int64_t)                                                 \
+  DEFFFS32 (uint32_t, int32_t)                                                 \
+  DEFFFS32 (uint32_t, int16_t)                                                 \
+  DEFFFS32 (uint32_t, int8_t)                                                  \
+  DEFFFS32 (int32_t, uint64_t)                                                 \
+  DEFFFS32 (int32_t, uint32_t)                                                 \
+  DEFFFS32 (int32_t, uint16_t)                                                 \
+  DEFFFS32 (int32_t, uint8_t)                                                  \
+  DEFFFS32 (int32_t, int64_t)                                                  \
+  DEFFFS32 (int32_t, int32_t)                                                  \
+  DEFFFS32 (int32_t, int16_t)                                                  \
+  DEFFFS32 (int32_t, int8_t)                                                   \
+  DEFFFS32 (uint16_t, uint64_t)                                                \
+  DEFFFS32 (uint16_t, uint32_t)                                                \
+  DEFFFS32 (uint16_t, uint16_t)                                                \
+  DEFFFS32 (uint16_t, uint8_t)                                                 \
+  DEFFFS32 (uint16_t, int64_t)                                                 \
+  DEFFFS32 (uint16_t, int32_t)                                                 \
+  DEFFFS32 (uint16_t, int16_t)                                                 \
+  DEFFFS32 (uint16_t, int8_t)                                                  \
+  DEFFFS32 (int16_t, uint64_t)                                                 \
+  DEFFFS32 (int16_t, uint32_t)                                                 \
+  DEFFFS32 (int16_t, uint16_t)                                                 \
+  DEFFFS32 (int16_t, uint8_t)                                                  \
+  DEFFFS32 (int16_t, int64_t)                                                  \
+  DEFFFS32 (int16_t, int32_t)                                                  \
+  DEFFFS32 (int16_t, int16_t)                                                  \
+  DEFFFS32 (int16_t, int8_t)                                                   \
+  DEFFFS32 (uint8_t, uint64_t)                                                 \
+  DEFFFS32 (uint8_t, uint32_t)                                                 \
+  DEFFFS32 (uint8_t, uint16_t)                                                 \
+  DEFFFS32 (uint8_t, uint8_t)                                                  \
+  DEFFFS32 (uint8_t, int64_t)                                                  \
+  DEFFFS32 (uint8_t, int32_t)                                                  \
+  DEFFFS32 (uint8_t, int16_t)                                                  \
+  DEFFFS32 (uint8_t, int8_t)                                                   \
+  DEFFFS32 (int8_t, uint64_t)                                                  \
+  DEFFFS32 (int8_t, uint32_t)                                                  \
+  DEFFFS32 (int8_t, uint16_t)                                                  \
+  DEFFFS32 (int8_t, uint8_t)                                                   \
+  DEFFFS32 (int8_t, int64_t)                                                   \
+  DEFFFS32 (int8_t, int32_t)                                                   \
+  DEFFFS32 (int8_t, int16_t)                                                   \
+  DEFFFS32 (int8_t, int8_t)
+
+DEF_ALL ()
+
+#define SZ 512
+
+#define TEST64(TYPEDST, TYPESRC)                                               \
+  void __attribute__ ((optimize ("0"))) test64_##TYPEDST##TYPESRC ()           \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * 1234567890;                                              \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    popcount64_##TYPEDST##TYPESRC (dst, src, SZ);                              \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ assert (dst[i] == __builtin_popcountll (src[i]));                      \
+      }                                                                        \
+  }
+
+#define TEST64N(TYPEDST, TYPESRC)                                              \
+  void __attribute__ ((optimize ("0"))) test64n_##TYPEDST##TYPESRC ()          \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * -1234567890;                                             \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    popcount64_##TYPEDST##TYPESRC (dst, src, SZ);                              \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ assert (dst[i] == __builtin_popcountll (src[i]));                      \
+      }                                                                        \
+  }
+
+#define TEST32(TYPEDST, TYPESRC)                                               \
+  void __attribute__ ((optimize ("0"))) test32_##TYPEDST##TYPESRC ()           \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * 1234567;                                                 \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    popcount32_##TYPEDST##TYPESRC (dst, src, SZ);                              \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ assert (dst[i] == __builtin_popcount (src[i]));                        \
+      }                                                                        \
+  }
+
+#define TEST32N(TYPEDST, TYPESRC)                                              \
+  void __attribute__ ((optimize ("0"))) test32n_##TYPEDST##TYPESRC ()          \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * -1234567;                                                \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    popcount32_##TYPEDST##TYPESRC (dst, src, SZ);                              \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ assert (dst[i] == __builtin_popcount (src[i]));                        \
+      }                                                                        \
+  }
+
+#define TESTCTZ64(TYPEDST, TYPESRC)                                            \
+  void __attribute__ ((optimize ("0"))) testctz64_##TYPEDST##TYPESRC ()        \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * 1234567890;                                              \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    ctz64_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ if (src[i] != 0)                                                       \
+   assert (dst[i] == __builtin_ctzll (src[i]));                         \
+      }                                                                        \
+  }
+
+#define TESTCTZ64N(TYPEDST, TYPESRC)                                           \
+  void __attribute__ ((optimize ("0"))) testctz64n_##TYPEDST##TYPESRC ()       \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * -1234567890;                                             \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    ctz64_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ if (src[i] != 0)                                                       \
+   assert (dst[i] == __builtin_ctzll (src[i]));                         \
+      }                                                                        \
+  }
+
+#define TESTCTZ32(TYPEDST, TYPESRC)                                            \
+  void __attribute__ ((optimize ("0"))) testctz32_##TYPEDST##TYPESRC ()        \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * 1234567;                                                 \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    ctz32_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ if (src[i] != 0)                                                       \
+   assert (dst[i] == __builtin_ctz (src[i]));                           \
+      }                                                                        \
+  }
+
+#define TESTCTZ32N(TYPEDST, TYPESRC)                                           \
+  void __attribute__ ((optimize ("0"))) testctz32n_##TYPEDST##TYPESRC ()       \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * -1234567;                                                \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    ctz32_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ if (src[i] != 0)                                                       \
+   assert (dst[i] == __builtin_ctz (src[i]));                           \
+      }                                                                        \
+  }
+
+#define TESTFFS64(TYPEDST, TYPESRC)                                            \
+  void __attribute__ ((optimize ("0"))) testffs64_##TYPEDST##TYPESRC ()        \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * 1234567890;                                              \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    ffs64_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ assert (dst[i] == __builtin_ffsll (src[i]));                           \
+      }                                                                        \
+  }
+
+#define TESTFFS64N(TYPEDST, TYPESRC)                                           \
+  void __attribute__ ((optimize ("0"))) testffs64n_##TYPEDST##TYPESRC ()       \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * -1234567890;                                             \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    ffs64_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ assert (dst[i] == __builtin_ffsll (src[i]));                           \
+      }                                                                        \
+  }
+
+#define TESTFFS32(TYPEDST, TYPESRC)                                            \
+  void __attribute__ ((optimize ("0"))) testffs32_##TYPEDST##TYPESRC ()        \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * 1234567;                                                 \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    ffs32_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ assert (dst[i] == __builtin_ffs (src[i]));                             \
+      }                                                                        \
+  }
+
+#define TESTFFS32N(TYPEDST, TYPESRC)                                           \
+  void __attribute__ ((optimize ("0"))) testffs32n_##TYPEDST##TYPESRC ()       \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * -1234567;                                                \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    ffs32_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ assert (dst[i] == __builtin_ffs (src[i]));                             \
+      }                                                                        \
+  }
+
+#define TEST_ALL()                                                             \
+  TEST64 (uint64_t, uint64_t)                                                  \
+  TEST64 (uint64_t, uint32_t)                                                  \
+  TEST64 (uint64_t, uint16_t)                                                  \
+  TEST64 (uint64_t, uint8_t)                                                   \
+  TEST64 (uint64_t, int64_t)                                                   \
+  TEST64 (uint64_t, int32_t)                                                   \
+  TEST64 (uint64_t, int16_t)                                                   \
+  TEST64 (uint64_t, int8_t)                                                    \
+  TEST64N (int64_t, uint64_t)                                                  \
+  TEST64N (int64_t, uint32_t)                                                  \
+  TEST64N (int64_t, uint16_t)                                                  \
+  TEST64N (int64_t, uint8_t)                                                   \
+  TEST64N (int64_t, int64_t)                                                   \
+  TEST64N (int64_t, int32_t)                                                   \
+  TEST64N (int64_t, int16_t)                                                   \
+  TEST64N (int64_t, int8_t)                                                    \
+  TEST64 (uint32_t, uint64_t)                                                  \
+  TEST64 (uint32_t, uint32_t)                                                  \
+  TEST64 (uint32_t, uint16_t)                                                  \
+  TEST64 (uint32_t, uint8_t)                                                   \
+  TEST64 (uint32_t, int64_t)                                                   \
+  TEST64 (uint32_t, int32_t)                                                   \
+  TEST64 (uint32_t, int16_t)                                                   \
+  TEST64 (uint32_t, int8_t)                                                    \
+  TEST64N (int32_t, uint64_t)                                                  \
+  TEST64N (int32_t, uint32_t)                                                  \
+  TEST64N (int32_t, uint16_t)                                                  \
+  TEST64N (int32_t, uint8_t)                                                   \
+  TEST64N (int32_t, int64_t)                                                   \
+  TEST64N (int32_t, int32_t)                                                   \
+  TEST64N (int32_t, int16_t)                                                   \
+  TEST64N (int32_t, int8_t)                                                    \
+  TEST64 (uint16_t, uint64_t)                                                  \
+  TEST64 (uint16_t, uint32_t)                                                  \
+  TEST64 (uint16_t, uint16_t)                                                  \
+  TEST64 (uint16_t, uint8_t)                                                   \
+  TEST64 (uint16_t, int64_t)                                                   \
+  TEST64 (uint16_t, int32_t)                                                   \
+  TEST64 (uint16_t, int16_t)                                                   \
+  TEST64 (uint16_t, int8_t)                                                    \
+  TEST64N (int16_t, uint64_t)                                                   \
+  TEST64N (int16_t, uint32_t)                                                   \
+  TEST64N (int16_t, uint16_t)                                                   \
+  TEST64N (int16_t, uint8_t)                                                    \
+  TEST64N (int16_t, int64_t)                                                    \
+  TEST64N (int16_t, int32_t)                                                    \
+  TEST64N (int16_t, int16_t)                                                    \
+  TEST64N (int16_t, int8_t)                                                     \
+  TEST64 (uint8_t, uint64_t)                                                   \
+  TEST64 (uint8_t, uint32_t)                                                   \
+  TEST64 (uint8_t, uint16_t)                                                   \
+  TEST64 (uint8_t, uint8_t)                                                    \
+  TEST64 (uint8_t, int64_t)                                                    \
+  TEST64 (uint8_t, int32_t)                                                    \
+  TEST64 (uint8_t, int16_t)                                                    \
+  TEST64 (uint8_t, int8_t)                                                     \
+  TEST64N (int8_t, uint64_t)                                                    \
+  TEST64N (int8_t, uint32_t)                                                    \
+  TEST64N (int8_t, uint16_t)                                                    \
+  TEST64N (int8_t, uint8_t)                                                     \
+  TEST64N (int8_t, int64_t)                                                     \
+  TEST64N (int8_t, int32_t)                                                     \
+  TEST64N (int8_t, int16_t)                                                     \
+  TEST64N (int8_t, int8_t)                                                      \
+  TEST32 (uint64_t, uint64_t)                                                  \
+  TEST32 (uint64_t, uint32_t)                                                  \
+  TEST32 (uint64_t, uint16_t)                                                  \
+  TEST32 (uint64_t, uint8_t)                                                   \
+  TEST32 (uint64_t, int64_t)                                                   \
+  TEST32 (uint64_t, int32_t)                                                   \
+  TEST32 (uint64_t, int16_t)                                                   \
+  TEST32 (uint64_t, int8_t)                                                    \
+  TEST32N (int64_t, uint64_t)                                                  \
+  TEST32N (int64_t, uint32_t)                                                  \
+  TEST32N (int64_t, uint16_t)                                                  \
+  TEST32N (int64_t, uint8_t)                                                   \
+  TEST32N (int64_t, int64_t)                                                   \
+  TEST32N (int64_t, int32_t)                                                   \
+  TEST32N (int64_t, int16_t)                                                   \
+  TEST32N (int64_t, int8_t)                                                    \
+  TEST32 (uint32_t, uint64_t)                                                  \
+  TEST32 (uint32_t, uint32_t)                                                  \
+  TEST32 (uint32_t, uint16_t)                                                  \
+  TEST32 (uint32_t, uint8_t)                                                   \
+  TEST32 (uint32_t, int64_t)                                                   \
+  TEST32 (uint32_t, int32_t)                                                   \
+  TEST32 (uint32_t, int16_t)                                                   \
+  TEST32 (uint32_t, int8_t)                                                    \
+  TEST32N (int32_t, uint64_t)                                                  \
+  TEST32N (int32_t, uint32_t)                                                  \
+  TEST32N (int32_t, uint16_t)                                                  \
+  TEST32N (int32_t, uint8_t)                                                   \
+  TEST32N (int32_t, int64_t)                                                   \
+  TEST32N (int32_t, int32_t)                                                   \
+  TEST32N (int32_t, int16_t)                                                   \
+  TEST32N (int32_t, int8_t)                                                    \
+  TEST32 (uint16_t, uint64_t)                                                  \
+  TEST32 (uint16_t, uint32_t)                                                  \
+  TEST32 (uint16_t, uint16_t)                                                  \
+  TEST32 (uint16_t, uint8_t)                                                   \
+  TEST32 (uint16_t, int64_t)                                                   \
+  TEST32 (uint16_t, int32_t)                                                   \
+  TEST32 (uint16_t, int16_t)                                                   \
+  TEST32 (uint16_t, int8_t)                                                    \
+  TEST32N (int16_t, uint64_t)                                                  \
+  TEST32N (int16_t, uint32_t)                                                  \
+  TEST32N (int16_t, uint16_t)                                                  \
+  TEST32N (int16_t, uint8_t)                                                   \
+  TEST32N (int16_t, int64_t)                                                   \
+  TEST32N (int16_t, int32_t)                                                   \
+  TEST32N (int16_t, int16_t)                                                   \
+  TEST32N (int16_t, int8_t)                                                    \
+  TEST32 (uint8_t, uint64_t)                                                   \
+  TEST32 (uint8_t, uint32_t)                                                   \
+  TEST32 (uint8_t, uint16_t)                                                   \
+  TEST32 (uint8_t, uint8_t)                                                    \
+  TEST32 (uint8_t, int64_t)                                                    \
+  TEST32 (uint8_t, int32_t)                                                    \
+  TEST32 (uint8_t, int16_t)                                                    \
+  TEST32 (uint8_t, int8_t)                                                     \
+  TEST32N (int8_t, uint64_t)                                                   \
+  TEST32N (int8_t, uint32_t)                                                   \
+  TEST32N (int8_t, uint16_t)                                                   \
+  TEST32N (int8_t, uint8_t)                                                    \
+  TEST32N (int8_t, int64_t)                                                    \
+  TEST32N (int8_t, int32_t)                                                    \
+  TEST32N (int8_t, int16_t)                                                    \
+  TEST32N (int8_t, int8_t)                                                     \
+  TESTCTZ64 (uint64_t, uint64_t)                                               \
+  TESTCTZ64 (uint64_t, uint32_t)                                               \
+  TESTCTZ64 (uint64_t, uint16_t)                                               \
+  TESTCTZ64 (uint64_t, uint8_t)                                                \
+  TESTCTZ64 (uint64_t, int64_t)                                                \
+  TESTCTZ64 (uint64_t, int32_t)                                                \
+  TESTCTZ64 (uint64_t, int16_t)                                                \
+  TESTCTZ64 (uint64_t, int8_t)                                                 \
+  TESTCTZ64N (int64_t, uint64_t)                                               \
+  TESTCTZ64N (int64_t, uint32_t)                                               \
+  TESTCTZ64N (int64_t, uint16_t)                                               \
+  TESTCTZ64N (int64_t, uint8_t)                                                \
+  TESTCTZ64N (int64_t, int64_t)                                                \
+  TESTCTZ64N (int64_t, int32_t)                                                \
+  TESTCTZ64N (int64_t, int16_t)                                                \
+  TESTCTZ64N (int64_t, int8_t)                                                 \
+  TESTCTZ64 (uint32_t, uint64_t)                                               \
+  TESTCTZ64 (uint32_t, uint32_t)                                               \
+  TESTCTZ64 (uint32_t, uint16_t)                                               \
+  TESTCTZ64 (uint32_t, uint8_t)                                                \
+  TESTCTZ64 (uint32_t, int64_t)                                                \
+  TESTCTZ64 (uint32_t, int32_t)                                                \
+  TESTCTZ64 (uint32_t, int16_t)                                                \
+  TESTCTZ64 (uint32_t, int8_t)                                                 \
+  TESTCTZ64N (int32_t, uint64_t)                                               \
+  TESTCTZ64N (int32_t, uint32_t)                                               \
+  TESTCTZ64N (int32_t, uint16_t)                                               \
+  TESTCTZ64N (int32_t, uint8_t)                                                \
+  TESTCTZ64N (int32_t, int64_t)                                                \
+  TESTCTZ64N (int32_t, int32_t)                                                \
+  TESTCTZ64N (int32_t, int16_t)                                                \
+  TESTCTZ64N (int32_t, int8_t)                                                 \
+  TESTCTZ64 (uint16_t, uint64_t)                                               \
+  TESTCTZ64 (uint16_t, uint32_t)                                               \
+  TESTCTZ64 (uint16_t, uint16_t)                                               \
+  TESTCTZ64 (uint16_t, uint8_t)                                                \
+  TESTCTZ64 (uint16_t, int64_t)                                                \
+  TESTCTZ64 (uint16_t, int32_t)                                                \
+  TESTCTZ64 (uint16_t, int16_t)                                                \
+  TESTCTZ64 (uint16_t, int8_t)                                                 \
+  TESTCTZ64N (int16_t, uint64_t)                                               \
+  TESTCTZ64N (int16_t, uint32_t)                                               \
+  TESTCTZ64N (int16_t, uint16_t)                                               \
+  TESTCTZ64N (int16_t, uint8_t)                                                \
+  TESTCTZ64N (int16_t, int64_t)                                                \
+  TESTCTZ64N (int16_t, int32_t)                                                \
+  TESTCTZ64N (int16_t, int16_t)                                                \
+  TESTCTZ64N (int16_t, int8_t)                                                 \
+  TESTCTZ64 (uint8_t, uint64_t)                                                \
+  TESTCTZ64 (uint8_t, uint32_t)                                                \
+  TESTCTZ64 (uint8_t, uint16_t)                                                \
+  TESTCTZ64 (uint8_t, uint8_t)                                                 \
+  TESTCTZ64 (uint8_t, int64_t)                                                 \
+  TESTCTZ64 (uint8_t, int32_t)                                                 \
+  TESTCTZ64 (uint8_t, int16_t)                                                 \
+  TESTCTZ64 (uint8_t, int8_t)                                                  \
+  TESTCTZ64N (int8_t, uint64_t)                                                \
+  TESTCTZ64N (int8_t, uint32_t)                                                \
+  TESTCTZ64N (int8_t, uint16_t)                                                \
+  TESTCTZ64N (int8_t, uint8_t)                                                 \
+  TESTCTZ64N (int8_t, int64_t)                                                 \
+  TESTCTZ64N (int8_t, int32_t)                                                 \
+  TESTCTZ64N (int8_t, int16_t)                                                 \
+  TESTCTZ64N (int8_t, int8_t)                                                  \
+  TESTCTZ32 (uint64_t, uint64_t)                                               \
+  TESTCTZ32 (uint64_t, uint32_t)                                               \
+  TESTCTZ32 (uint64_t, uint16_t)                                               \
+  TESTCTZ32 (uint64_t, uint8_t)                                                \
+  TESTCTZ32 (uint64_t, int64_t)                                                \
+  TESTCTZ32 (uint64_t, int32_t)                                                \
+  TESTCTZ32 (uint64_t, int16_t)                                                \
+  TESTCTZ32 (uint64_t, int8_t)                                                 \
+  TESTCTZ32N (int64_t, uint64_t)                                               \
+  TESTCTZ32N (int64_t, uint32_t)                                               \
+  TESTCTZ32N (int64_t, uint16_t)                                               \
+  TESTCTZ32N (int64_t, uint8_t)                                                \
+  TESTCTZ32N (int64_t, int64_t)                                                \
+  TESTCTZ32N (int64_t, int32_t)                                                \
+  TESTCTZ32N (int64_t, int16_t)                                                \
+  TESTCTZ32N (int64_t, int8_t)                                                 \
+  TESTCTZ32 (uint32_t, uint64_t)                                               \
+  TESTCTZ32 (uint32_t, uint32_t)                                               \
+  TESTCTZ32 (uint32_t, uint16_t)                                               \
+  TESTCTZ32 (uint32_t, uint8_t)                                                \
+  TESTCTZ32 (uint32_t, int64_t)                                                \
+  TESTCTZ32 (uint32_t, int32_t)                                                \
+  TESTCTZ32 (uint32_t, int16_t)                                                \
+  TESTCTZ32 (uint32_t, int8_t)                                                 \
+  TESTCTZ32N (int32_t, uint64_t)                                               \
+  TESTCTZ32N (int32_t, uint32_t)                                               \
+  TESTCTZ32N (int32_t, uint16_t)                                               \
+  TESTCTZ32N (int32_t, uint8_t)                                                \
+  TESTCTZ32N (int32_t, int64_t)                                                \
+  TESTCTZ32N (int32_t, int32_t)                                                \
+  TESTCTZ32N (int32_t, int16_t)                                                \
+  TESTCTZ32N (int32_t, int8_t)                                                 \
+  TESTCTZ32 (uint16_t, uint64_t)                                               \
+  TESTCTZ32 (uint16_t, uint32_t)                                               \
+  TESTCTZ32 (uint16_t, uint16_t)                                               \
+  TESTCTZ32 (uint16_t, uint8_t)                                                \
+  TESTCTZ32 (uint16_t, int64_t)                                                \
+  TESTCTZ32 (uint16_t, int32_t)                                                \
+  TESTCTZ32 (uint16_t, int16_t)                                                \
+  TESTCTZ32 (uint16_t, int8_t)                                                 \
+  TESTCTZ32N (int16_t, uint64_t)                                               \
+  TESTCTZ32N (int16_t, uint32_t)                                               \
+  TESTCTZ32N (int16_t, uint16_t)                                               \
+  TESTCTZ32N (int16_t, uint8_t)                                                \
+  TESTCTZ32N (int16_t, int64_t)                                                \
+  TESTCTZ32N (int16_t, int32_t)                                                \
+  TESTCTZ32N (int16_t, int16_t)                                                \
+  TESTCTZ32N (int16_t, int8_t)                                                 \
+  TESTCTZ32 (uint8_t, uint64_t)                                                \
+  TESTCTZ32 (uint8_t, uint32_t)                                                \
+  TESTCTZ32 (uint8_t, uint16_t)                                                \
+  TESTCTZ32 (uint8_t, uint8_t)                                                 \
+  TESTCTZ32 (uint8_t, int64_t)                                                 \
+  TESTCTZ32 (uint8_t, int32_t)                                                 \
+  TESTCTZ32 (uint8_t, int16_t)                                                 \
+  TESTCTZ32 (uint8_t, int8_t)                                                  \
+  TESTCTZ32N (int8_t, uint64_t)                                                \
+  TESTCTZ32N (int8_t, uint32_t)                                                \
+  TESTCTZ32N (int8_t, uint16_t)                                                \
+  TESTCTZ32N (int8_t, uint8_t)                                                 \
+  TESTCTZ32N (int8_t, int64_t)                                                 \
+  TESTCTZ32N (int8_t, int32_t)                                                 \
+  TESTCTZ32N (int8_t, int16_t)                                                 \
+  TESTCTZ32N (int8_t, int8_t)                                                  \
+  TESTFFS64 (uint64_t, uint64_t)                                               \
+  TESTFFS64 (uint64_t, uint32_t)                                               \
+  TESTFFS64 (uint64_t, uint16_t)                                               \
+  TESTFFS64 (uint64_t, uint8_t)                                                \
+  TESTFFS64 (uint64_t, int64_t)                                                \
+  TESTFFS64 (uint64_t, int32_t)                                                \
+  TESTFFS64 (uint64_t, int16_t)                                                \
+  TESTFFS64 (uint64_t, int8_t)                                                 \
+  TESTFFS64N (int64_t, uint64_t)                                               \
+  TESTFFS64N (int64_t, uint32_t)                                               \
+  TESTFFS64N (int64_t, uint16_t)                                               \
+  TESTFFS64N (int64_t, uint8_t)                                                \
+  TESTFFS64N (int64_t, int64_t)                                                \
+  TESTFFS64N (int64_t, int32_t)                                                \
+  TESTFFS64N (int64_t, int16_t)                                                \
+  TESTFFS64N (int64_t, int8_t)                                                 \
+  TESTFFS64 (uint32_t, uint64_t)                                               \
+  TESTFFS64 (uint32_t, uint32_t)                                               \
+  TESTFFS64 (uint32_t, uint16_t)                                               \
+  TESTFFS64 (uint32_t, uint8_t)                                                \
+  TESTFFS64 (uint32_t, int64_t)                                                \
+  TESTFFS64 (uint32_t, int32_t)                                                \
+  TESTFFS64 (uint32_t, int16_t)                                                \
+  TESTFFS64 (uint32_t, int8_t)                                                 \
+  TESTFFS64N (int32_t, uint64_t)                                               \
+  TESTFFS64N (int32_t, uint32_t)                                               \
+  TESTFFS64N (int32_t, uint16_t)                                               \
+  TESTFFS64N (int32_t, uint8_t)                                                \
+  TESTFFS64N (int32_t, int64_t)                                                \
+  TESTFFS64N (int32_t, int32_t)                                                \
+  TESTFFS64N (int32_t, int16_t)                                                \
+  TESTFFS64N (int32_t, int8_t)                                                 \
+  TESTFFS64 (uint16_t, uint64_t)                                               \
+  TESTFFS64 (uint16_t, uint32_t)                                               \
+  TESTFFS64 (uint16_t, uint16_t)                                               \
+  TESTFFS64 (uint16_t, uint8_t)                                                \
+  TESTFFS64 (uint16_t, int64_t)                                                \
+  TESTFFS64 (uint16_t, int32_t)                                                \
+  TESTFFS64 (uint16_t, int16_t)                                                \
+  TESTFFS64 (uint16_t, int8_t)                                                 \
+  TESTFFS64N (int16_t, uint64_t)                                               \
+  TESTFFS64N (int16_t, uint32_t)                                               \
+  TESTFFS64N (int16_t, uint16_t)                                               \
+  TESTFFS64N (int16_t, uint8_t)                                                \
+  TESTFFS64N (int16_t, int64_t)                                                \
+  TESTFFS64N (int16_t, int32_t)                                                \
+  TESTFFS64N (int16_t, int16_t)                                                \
+  TESTFFS64N (int16_t, int8_t)                                                 \
+  TESTFFS64 (uint8_t, uint64_t)                                                \
+  TESTFFS64 (uint8_t, uint32_t)                                                \
+  TESTFFS64 (uint8_t, uint16_t)                                                \
+  TESTFFS64 (uint8_t, uint8_t)                                                 \
+  TESTFFS64 (uint8_t, int64_t)                                                 \
+  TESTFFS64 (uint8_t, int32_t)                                                 \
+  TESTFFS64 (uint8_t, int16_t)                                                 \
+  TESTFFS64 (uint8_t, int8_t)                                                  \
+  TESTFFS64N (int8_t, uint64_t)                                                \
+  TESTFFS64N (int8_t, uint32_t)                                                \
+  TESTFFS64N (int8_t, uint16_t)                                                \
+  TESTFFS64N (int8_t, uint8_t)                                                 \
+  TESTFFS64N (int8_t, int64_t)                                                 \
+  TESTFFS64N (int8_t, int32_t)                                                 \
+  TESTFFS64N (int8_t, int16_t)                                                 \
+  TESTFFS64N (int8_t, int8_t)                                                  \
+  TESTFFS32 (uint64_t, uint64_t)                                               \
+  TESTFFS32 (uint64_t, uint32_t)                                               \
+  TESTFFS32 (uint64_t, uint16_t)                                               \
+  TESTFFS32 (uint64_t, uint8_t)                                                \
+  TESTFFS32 (uint64_t, int64_t)                                                \
+  TESTFFS32 (uint64_t, int32_t)                                                \
+  TESTFFS32 (uint64_t, int16_t)                                                \
+  TESTFFS32 (uint64_t, int8_t)                                                 \
+  TESTFFS32N (int64_t, uint64_t)                                               \
+  TESTFFS32N (int64_t, uint32_t)                                               \
+  TESTFFS32N (int64_t, uint16_t)                                               \
+  TESTFFS32N (int64_t, uint8_t)                                                \
+  TESTFFS32N (int64_t, int64_t)                                                \
+  TESTFFS32N (int64_t, int32_t)                                                \
+  TESTFFS32N (int64_t, int16_t)                                                \
+  TESTFFS32N (int64_t, int8_t)                                                 \
+  TESTFFS32 (uint32_t, uint64_t)                                               \
+  TESTFFS32 (uint32_t, uint32_t)                                               \
+  TESTFFS32 (uint32_t, uint16_t)                                               \
+  TESTFFS32 (uint32_t, uint8_t)                                                \
+  TESTFFS32 (uint32_t, int64_t)                                                \
+  TESTFFS32 (uint32_t, int32_t)                                                \
+  TESTFFS32 (uint32_t, int16_t)                                                \
+  TESTFFS32 (uint32_t, int8_t)                                                 \
+  TESTFFS32N (int32_t, uint64_t)                                               \
+  TESTFFS32N (int32_t, uint32_t)                                               \
+  TESTFFS32N (int32_t, uint16_t)                                               \
+  TESTFFS32N (int32_t, uint8_t)                                                \
+  TESTFFS32N (int32_t, int64_t)                                                \
+  TESTFFS32N (int32_t, int32_t)                                                \
+  TESTFFS32N (int32_t, int16_t)                                                \
+  TESTFFS32N (int32_t, int8_t)                                                 \
+  TESTFFS32 (uint16_t, uint64_t)                                               \
+  TESTFFS32 (uint16_t, uint32_t)                                               \
+  TESTFFS32 (uint16_t, uint16_t)                                               \
+  TESTFFS32 (uint16_t, uint8_t)                                                \
+  TESTFFS32 (uint16_t, int64_t)                                                \
+  TESTFFS32 (uint16_t, int32_t)                                                \
+  TESTFFS32 (uint16_t, int16_t)                                                \
+  TESTFFS32 (uint16_t, int8_t)                                                 \
+  TESTFFS32N (int16_t, uint64_t)                                               \
+  TESTFFS32N (int16_t, uint32_t)                                               \
+  TESTFFS32N (int16_t, uint16_t)                                               \
+  TESTFFS32N (int16_t, uint8_t)                                                \
+  TESTFFS32N (int16_t, int64_t)                                                \
+  TESTFFS32N (int16_t, int32_t)                                                \
+  TESTFFS32N (int16_t, int16_t)                                                \
+  TESTFFS32N (int16_t, int8_t)                                                 \
+  TESTFFS32 (uint8_t, uint64_t)                                                \
+  TESTFFS32 (uint8_t, uint32_t)                                                \
+  TESTFFS32 (uint8_t, uint16_t)                                                \
+  TESTFFS32 (uint8_t, uint8_t)                                                 \
+  TESTFFS32 (uint8_t, int64_t)                                                 \
+  TESTFFS32 (uint8_t, int32_t)                                                 \
+  TESTFFS32 (uint8_t, int16_t)                                                 \
+  TESTFFS32 (uint8_t, int8_t)                                                  \
+  TESTFFS32N (int8_t, uint64_t)                                                \
+  TESTFFS32N (int8_t, uint32_t)                                                \
+  TESTFFS32N (int8_t, uint16_t)                                                \
+  TESTFFS32N (int8_t, uint8_t)                                                 \
+  TESTFFS32N (int8_t, int64_t)                                                 \
+  TESTFFS32N (int8_t, int32_t)                                                 \
+  TESTFFS32N (int8_t, int16_t)                                                 \
+  TESTFFS32N (int8_t, int8_t)
+
+TEST_ALL ()
+
+#define RUN64(TYPEDST, TYPESRC) test64_##TYPEDST##TYPESRC ();
+#define RUN64N(TYPEDST, TYPESRC) test64n_##TYPEDST##TYPESRC ();
+#define RUN32(TYPEDST, TYPESRC) test32_##TYPEDST##TYPESRC ();
+#define RUN32N(TYPEDST, TYPESRC) test32n_##TYPEDST##TYPESRC ();
+#define RUNCTZ64(TYPEDST, TYPESRC) testctz64_##TYPEDST##TYPESRC ();
+#define RUNCTZ64N(TYPEDST, TYPESRC) testctz64n_##TYPEDST##TYPESRC ();
+#define RUNCTZ32(TYPEDST, TYPESRC) testctz32_##TYPEDST##TYPESRC ();
+#define RUNCTZ32N(TYPEDST, TYPESRC) testctz32n_##TYPEDST##TYPESRC ();
+#define RUNFFS64(TYPEDST, TYPESRC) testffs64_##TYPEDST##TYPESRC ();
+#define RUNFFS64N(TYPEDST, TYPESRC) testffs64n_##TYPEDST##TYPESRC ();
+#define RUNFFS32(TYPEDST, TYPESRC) testffs32_##TYPEDST##TYPESRC ();
+#define RUNFFS32N(TYPEDST, TYPESRC) testffs32n_##TYPEDST##TYPESRC ();
+
+#define RUN_ALL()                                                              \
+  RUN64 (uint64_t, uint64_t)                                                   \
+  RUN64 (uint64_t, uint32_t)                                                   \
+  RUN64 (uint64_t, uint16_t)                                                   \
+  RUN64 (uint64_t, uint8_t)                                                    \
+  RUN64 (uint64_t, int64_t)                                                    \
+  RUN64 (uint64_t, int32_t)                                                    \
+  RUN64 (uint64_t, int16_t)                                                    \
+  RUN64 (uint64_t, int8_t)                                                     \
+  RUN64N (int64_t, uint64_t)                                                    \
+  RUN64N (int64_t, uint32_t)                                                    \
+  RUN64N (int64_t, uint16_t)                                                    \
+  RUN64N (int64_t, uint8_t)                                                     \
+  RUN64N (int64_t, int64_t)                                                     \
+  RUN64N (int64_t, int32_t)                                                     \
+  RUN64N (int64_t, int16_t)                                                     \
+  RUN64N (int64_t, int8_t)                                                      \
+  RUN64 (uint32_t, uint64_t)                                                   \
+  RUN64 (uint32_t, uint32_t)                                                   \
+  RUN64 (uint32_t, uint16_t)                                                   \
+  RUN64 (uint32_t, uint8_t)                                                    \
+  RUN64 (uint32_t, int64_t)                                                    \
+  RUN64 (uint32_t, int32_t)                                                    \
+  RUN64 (uint32_t, int16_t)                                                    \
+  RUN64 (uint32_t, int8_t)                                                     \
+  RUN64N (int32_t, uint64_t)                                                    \
+  RUN64N (int32_t, uint32_t)                                                    \
+  RUN64N (int32_t, uint16_t)                                                    \
+  RUN64N (int32_t, uint8_t)                                                     \
+  RUN64N (int32_t, int64_t)                                                     \
+  RUN64N (int32_t, int32_t)                                                     \
+  RUN64N (int32_t, int16_t)                                                     \
+  RUN64N (int32_t, int8_t)                                                      \
+  RUN64 (uint16_t, uint64_t)                                                   \
+  RUN64 (uint16_t, uint32_t)                                                   \
+  RUN64 (uint16_t, uint16_t)                                                   \
+  RUN64 (uint16_t, uint8_t)                                                    \
+  RUN64 (uint16_t, int64_t)                                                    \
+  RUN64 (uint16_t, int32_t)                                                    \
+  RUN64 (uint16_t, int16_t)                                                    \
+  RUN64 (uint16_t, int8_t)                                                     \
+  RUN64N (int16_t, uint64_t)                                                    \
+  RUN64N (int16_t, uint32_t)                                                    \
+  RUN64N (int16_t, uint16_t)                                                    \
+  RUN64N (int16_t, uint8_t)                                                     \
+  RUN64N (int16_t, int64_t)                                                     \
+  RUN64N (int16_t, int32_t)                                                     \
+  RUN64N (int16_t, int16_t)                                                     \
+  RUN64N (int16_t, int8_t)                                                      \
+  RUN64 (uint8_t, uint64_t)                                                    \
+  RUN64 (uint8_t, uint32_t)                                                    \
+  RUN64 (uint8_t, uint16_t)                                                    \
+  RUN64 (uint8_t, uint8_t)                                                     \
+  RUN64 (uint8_t, int64_t)                                                     \
+  RUN64 (uint8_t, int32_t)                                                     \
+  RUN64 (uint8_t, int16_t)                                                     \
+  RUN64 (uint8_t, int8_t)                                                      \
+  RUN64N (int8_t, uint64_t)                                                     \
+  RUN64N (int8_t, uint32_t)                                                     \
+  RUN64N (int8_t, uint16_t)                                                     \
+  RUN64N (int8_t, uint8_t)                                                      \
+  RUN64N (int8_t, int64_t)                                                      \
+  RUN64N (int8_t, int32_t)                                                      \
+  RUN64N (int8_t, int16_t)                                                      \
+  RUN64N (int8_t, int8_t)                                                       \
+  RUN32 (uint64_t, uint64_t)                                                   \
+  RUN32 (uint64_t, uint32_t)                                                   \
+  RUN32 (uint64_t, uint16_t)                                                   \
+  RUN32 (uint64_t, uint8_t)                                                    \
+  RUN32 (uint64_t, int64_t)                                                    \
+  RUN32 (uint64_t, int32_t)                                                    \
+  RUN32 (uint64_t, int16_t)                                                    \
+  RUN32 (uint64_t, int8_t)                                                     \
+  RUN32N (int64_t, uint64_t)                                                    \
+  RUN32N (int64_t, uint32_t)                                                    \
+  RUN32N (int64_t, uint16_t)                                                    \
+  RUN32N (int64_t, uint8_t)                                                     \
+  RUN32N (int64_t, int64_t)                                                     \
+  RUN32N (int64_t, int32_t)                                                     \
+  RUN32N (int64_t, int16_t)                                                     \
+  RUN32N (int64_t, int8_t)                                                      \
+  RUN32 (uint32_t, uint64_t)                                                   \
+  RUN32 (uint32_t, uint32_t)                                                   \
+  RUN32 (uint32_t, uint16_t)                                                   \
+  RUN32 (uint32_t, uint8_t)                                                    \
+  RUN32 (uint32_t, int64_t)                                                    \
+  RUN32 (uint32_t, int32_t)                                                    \
+  RUN32 (uint32_t, int16_t)                                                    \
+  RUN32 (uint32_t, int8_t)                                                     \
+  RUN32N (int32_t, uint64_t)                                                    \
+  RUN32N (int32_t, uint32_t)                                                    \
+  RUN32N (int32_t, uint16_t)                                                    \
+  RUN32N (int32_t, uint8_t)                                                     \
+  RUN32N (int32_t, int64_t)                                                     \
+  RUN32N (int32_t, int32_t)                                                     \
+  RUN32N (int32_t, int16_t)                                                     \
+  RUN32N (int32_t, int8_t)                                                      \
+  RUN32 (uint16_t, uint64_t)                                                   \
+  RUN32 (uint16_t, uint32_t)                                                   \
+  RUN32 (uint16_t, uint16_t)                                                   \
+  RUN32 (uint16_t, uint8_t)                                                    \
+  RUN32 (uint16_t, int64_t)                                                    \
+  RUN32 (uint16_t, int32_t)                                                    \
+  RUN32 (uint16_t, int16_t)                                                    \
+  RUN32 (uint16_t, int8_t)                                                     \
+  RUN32N (int16_t, uint64_t)                                                    \
+  RUN32N (int16_t, uint32_t)                                                    \
+  RUN32N (int16_t, uint16_t)                                                    \
+  RUN32N (int16_t, uint8_t)                                                     \
+  RUN32N (int16_t, int64_t)                                                     \
+  RUN32N (int16_t, int32_t)                                                     \
+  RUN32N (int16_t, int16_t)                                                     \
+  RUN32N (int16_t, int8_t)                                                      \
+  RUN32 (uint8_t, uint64_t)                                                    \
+  RUN32 (uint8_t, uint32_t)                                                    \
+  RUN32 (uint8_t, uint16_t)                                                    \
+  RUN32 (uint8_t, uint8_t)                                                     \
+  RUN32 (uint8_t, int64_t)                                                     \
+  RUN32 (uint8_t, int32_t)                                                     \
+  RUN32 (uint8_t, int16_t)                                                     \
+  RUN32 (uint8_t, int8_t)                                                      \
+  RUN32N (int8_t, uint64_t)                                                     \
+  RUN32N (int8_t, uint32_t)                                                     \
+  RUN32N (int8_t, uint16_t)                                                     \
+  RUN32N (int8_t, uint8_t)                                                      \
+  RUN32N (int8_t, int64_t)                                                      \
+  RUN32N (int8_t, int32_t)                                                      \
+  RUN32N (int8_t, int16_t)                                                      \
+  RUN32N (int8_t, int8_t)                                                       \
+  RUNCTZ64 (uint64_t, uint64_t)                                                \
+  RUNCTZ64 (uint64_t, uint32_t)                                                \
+  RUNCTZ64 (uint64_t, uint16_t)                                                \
+  RUNCTZ64 (uint64_t, uint8_t)                                                 \
+  RUNCTZ64 (uint64_t, int64_t)                                                 \
+  RUNCTZ64 (uint64_t, int32_t)                                                 \
+  RUNCTZ64 (uint64_t, int16_t)                                                 \
+  RUNCTZ64 (uint64_t, int8_t)                                                  \
+  RUNCTZ64N (int64_t, uint64_t)                                                 \
+  RUNCTZ64N (int64_t, uint32_t)                                                 \
+  RUNCTZ64N (int64_t, uint16_t)                                                 \
+  RUNCTZ64N (int64_t, uint8_t)                                                  \
+  RUNCTZ64N (int64_t, int64_t)                                                  \
+  RUNCTZ64N (int64_t, int32_t)                                                  \
+  RUNCTZ64N (int64_t, int16_t)                                                  \
+  RUNCTZ64N (int64_t, int8_t)                                                   \
+  RUNCTZ64 (uint32_t, uint64_t)                                                \
+  RUNCTZ64 (uint32_t, uint32_t)                                                \
+  RUNCTZ64 (uint32_t, uint16_t)                                                \
+  RUNCTZ64 (uint32_t, uint8_t)                                                 \
+  RUNCTZ64 (uint32_t, int64_t)                                                 \
+  RUNCTZ64 (uint32_t, int32_t)                                                 \
+  RUNCTZ64 (uint32_t, int16_t)                                                 \
+  RUNCTZ64 (uint32_t, int8_t)                                                  \
+  RUNCTZ64N (int32_t, uint64_t)                                                 \
+  RUNCTZ64N (int32_t, uint32_t)                                                 \
+  RUNCTZ64N (int32_t, uint16_t)                                                 \
+  RUNCTZ64N (int32_t, uint8_t)                                                  \
+  RUNCTZ64N (int32_t, int64_t)                                                  \
+  RUNCTZ64N (int32_t, int32_t)                                                  \
+  RUNCTZ64N (int32_t, int16_t)                                                  \
+  RUNCTZ64N (int32_t, int8_t)                                                   \
+  RUNCTZ64 (uint16_t, uint64_t)                                                \
+  RUNCTZ64 (uint16_t, uint32_t)                                                \
+  RUNCTZ64 (uint16_t, uint16_t)                                                \
+  RUNCTZ64 (uint16_t, uint8_t)                                                 \
+  RUNCTZ64 (uint16_t, int64_t)                                                 \
+  RUNCTZ64 (uint16_t, int32_t)                                                 \
+  RUNCTZ64 (uint16_t, int16_t)                                                 \
+  RUNCTZ64 (uint16_t, int8_t)                                                  \
+  RUNCTZ64N (int16_t, uint64_t)                                                \
+  RUNCTZ64N (int16_t, uint32_t)                                                \
+  RUNCTZ64N (int16_t, uint16_t)                                                \
+  RUNCTZ64N (int16_t, uint8_t)                                                 \
+  RUNCTZ64N (int16_t, int64_t)                                                 \
+  RUNCTZ64N (int16_t, int32_t)                                                 \
+  RUNCTZ64N (int16_t, int16_t)                                                 \
+  RUNCTZ64N (int16_t, int8_t)                                                  \
+  RUNCTZ64 (uint8_t, uint64_t)                                                 \
+  RUNCTZ64 (uint8_t, uint32_t)                                                 \
+  RUNCTZ64 (uint8_t, uint16_t)                                                 \
+  RUNCTZ64 (uint8_t, uint8_t)                                                  \
+  RUNCTZ64 (uint8_t, int64_t)                                                  \
+  RUNCTZ64 (uint8_t, int32_t)                                                  \
+  RUNCTZ64 (uint8_t, int16_t)                                                  \
+  RUNCTZ64 (uint8_t, int8_t)                                                   \
+  RUNCTZ64N (int8_t, uint64_t)                                                 \
+  RUNCTZ64N (int8_t, uint32_t)                                                 \
+  RUNCTZ64N (int8_t, uint16_t)                                                 \
+  RUNCTZ64N (int8_t, uint8_t)                                                  \
+  RUNCTZ64N (int8_t, int64_t)                                                  \
+  RUNCTZ64N (int8_t, int32_t)                                                  \
+  RUNCTZ64N (int8_t, int16_t)                                                  \
+  RUNCTZ64N (int8_t, int8_t)                                                   \
+  RUNCTZ32 (uint64_t, uint64_t)                                                \
+  RUNCTZ32 (uint64_t, uint32_t)                                                \
+  RUNCTZ32 (uint64_t, uint16_t)                                                \
+  RUNCTZ32 (uint64_t, uint8_t)                                                 \
+  RUNCTZ32 (uint64_t, int64_t)                                                 \
+  RUNCTZ32 (uint64_t, int32_t)                                                 \
+  RUNCTZ32 (uint64_t, int16_t)                                                 \
+  RUNCTZ32 (uint64_t, int8_t)                                                  \
+  RUNCTZ32N (int64_t, uint64_t)                                                \
+  RUNCTZ32N (int64_t, uint32_t)                                                \
+  RUNCTZ32N (int64_t, uint16_t)                                                \
+  RUNCTZ32N (int64_t, uint8_t)                                                 \
+  RUNCTZ32N (int64_t, int64_t)                                                 \
+  RUNCTZ32N (int64_t, int32_t)                                                 \
+  RUNCTZ32N (int64_t, int16_t)                                                 \
+  RUNCTZ32N (int64_t, int8_t)                                                  \
+  RUNCTZ32 (uint32_t, uint64_t)                                                \
+  RUNCTZ32 (uint32_t, uint32_t)                                                \
+  RUNCTZ32 (uint32_t, uint16_t)                                                \
+  RUNCTZ32 (uint32_t, uint8_t)                                                 \
+  RUNCTZ32 (uint32_t, int64_t)                                                 \
+  RUNCTZ32 (uint32_t, int32_t)                                                 \
+  RUNCTZ32 (uint32_t, int16_t)                                                 \
+  RUNCTZ32 (uint32_t, int8_t)                                                  \
+  RUNCTZ32N (int32_t, uint64_t)                                                \
+  RUNCTZ32N (int32_t, uint32_t)                                                \
+  RUNCTZ32N (int32_t, uint16_t)                                                \
+  RUNCTZ32N (int32_t, uint8_t)                                                 \
+  RUNCTZ32N (int32_t, int64_t)                                                 \
+  RUNCTZ32N (int32_t, int32_t)                                                 \
+  RUNCTZ32N (int32_t, int16_t)                                                 \
+  RUNCTZ32N (int32_t, int8_t)                                                  \
+  RUNCTZ32 (uint16_t, uint64_t)                                                \
+  RUNCTZ32 (uint16_t, uint32_t)                                                \
+  RUNCTZ32 (uint16_t, uint16_t)                                                \
+  RUNCTZ32 (uint16_t, uint8_t)                                                 \
+  RUNCTZ32 (uint16_t, int64_t)                                                 \
+  RUNCTZ32 (uint16_t, int32_t)                                                 \
+  RUNCTZ32 (uint16_t, int16_t)                                                 \
+  RUNCTZ32 (uint16_t, int8_t)                                                  \
+  RUNCTZ32N (int16_t, uint64_t)                                                \
+  RUNCTZ32N (int16_t, uint32_t)                                                \
+  RUNCTZ32N (int16_t, uint16_t)                                                \
+  RUNCTZ32N (int16_t, uint8_t)                                                 \
+  RUNCTZ32N (int16_t, int64_t)                                                 \
+  RUNCTZ32N (int16_t, int32_t)                                                 \
+  RUNCTZ32N (int16_t, int16_t)                                                 \
+  RUNCTZ32N (int16_t, int8_t)                                                  \
+  RUNCTZ32 (uint8_t, uint64_t)                                                 \
+  RUNCTZ32 (uint8_t, uint32_t)                                                 \
+  RUNCTZ32 (uint8_t, uint16_t)                                                 \
+  RUNCTZ32 (uint8_t, uint8_t)                                                  \
+  RUNCTZ32 (uint8_t, int64_t)                                                  \
+  RUNCTZ32 (uint8_t, int32_t)                                                  \
+  RUNCTZ32 (uint8_t, int16_t)                                                  \
+  RUNCTZ32 (uint8_t, int8_t)                                                   \
+  RUNCTZ32N (int8_t, uint64_t)                                                 \
+  RUNCTZ32N (int8_t, uint32_t)                                                 \
+  RUNCTZ32N (int8_t, uint16_t)                                                 \
+  RUNCTZ32N (int8_t, uint8_t)                                                  \
+  RUNCTZ32N (int8_t, int64_t)                                                  \
+  RUNCTZ32N (int8_t, int32_t)                                                  \
+  RUNCTZ32N (int8_t, int16_t)                                                  \
+  RUNCTZ32N (int8_t, int8_t)                                                   \
+  RUNFFS64 (uint64_t, uint64_t)                                                \
+  RUNFFS64 (uint64_t, uint32_t)                                                \
+  RUNFFS64 (uint64_t, uint16_t)                                                \
+  RUNFFS64 (uint64_t, uint8_t)                                                 \
+  RUNFFS64 (uint64_t, int64_t)                                                 \
+  RUNFFS64 (uint64_t, int32_t)                                                 \
+  RUNFFS64 (uint64_t, int16_t)                                                 \
+  RUNFFS64 (uint64_t, int8_t)                                                  \
+  RUNFFS64N (int64_t, uint64_t)                                                \
+  RUNFFS64N (int64_t, uint32_t)                                                \
+  RUNFFS64N (int64_t, uint16_t)                                                \
+  RUNFFS64N (int64_t, uint8_t)                                                 \
+  RUNFFS64N (int64_t, int64_t)                                                 \
+  RUNFFS64N (int64_t, int32_t)                                                 \
+  RUNFFS64N (int64_t, int16_t)                                                 \
+  RUNFFS64N (int64_t, int8_t)                                                  \
+  RUNFFS64 (uint32_t, uint64_t)                                                \
+  RUNFFS64 (uint32_t, uint32_t)                                                \
+  RUNFFS64 (uint32_t, uint16_t)                                                \
+  RUNFFS64 (uint32_t, uint8_t)                                                 \
+  RUNFFS64 (uint32_t, int64_t)                                                 \
+  RUNFFS64 (uint32_t, int32_t)                                                 \
+  RUNFFS64 (uint32_t, int16_t)                                                 \
+  RUNFFS64 (uint32_t, int8_t)                                                  \
+  RUNFFS64N (int32_t, uint64_t)                                                \
+  RUNFFS64N (int32_t, uint32_t)                                                \
+  RUNFFS64N (int32_t, uint16_t)                                                \
+  RUNFFS64N (int32_t, uint8_t)                                                 \
+  RUNFFS64N (int32_t, int64_t)                                                 \
+  RUNFFS64N (int32_t, int32_t)                                                 \
+  RUNFFS64N (int32_t, int16_t)                                                 \
+  RUNFFS64N (int32_t, int8_t)                                                  \
+  RUNFFS64 (uint16_t, uint64_t)                                                \
+  RUNFFS64 (uint16_t, uint32_t)                                                \
+  RUNFFS64 (uint16_t, uint16_t)                                                \
+  RUNFFS64 (uint16_t, uint8_t)                                                 \
+  RUNFFS64 (uint16_t, int64_t)                                                 \
+  RUNFFS64 (uint16_t, int32_t)                                                 \
+  RUNFFS64 (uint16_t, int16_t)                                                 \
+  RUNFFS64 (uint16_t, int8_t)                                                  \
+  RUNFFS64N (int16_t, uint64_t)                                                \
+  RUNFFS64N (int16_t, uint32_t)                                                \
+  RUNFFS64N (int16_t, uint16_t)                                                \
+  RUNFFS64N (int16_t, uint8_t)                                                 \
+  RUNFFS64N (int16_t, int64_t)                                                 \
+  RUNFFS64N (int16_t, int32_t)                                                 \
+  RUNFFS64N (int16_t, int16_t)                                                 \
+  RUNFFS64N (int16_t, int8_t)                                                  \
+  RUNFFS64 (uint8_t, uint64_t)                                                 \
+  RUNFFS64 (uint8_t, uint32_t)                                                 \
+  RUNFFS64 (uint8_t, uint16_t)                                                 \
+  RUNFFS64 (uint8_t, uint8_t)                                                  \
+  RUNFFS64 (uint8_t, int64_t)                                                  \
+  RUNFFS64 (uint8_t, int32_t)                                                  \
+  RUNFFS64 (uint8_t, int16_t)                                                  \
+  RUNFFS64 (uint8_t, int8_t)                                                   \
+  RUNFFS64N (int8_t, uint64_t)                                                 \
+  RUNFFS64N (int8_t, uint32_t)                                                 \
+  RUNFFS64N (int8_t, uint16_t)                                                 \
+  RUNFFS64N (int8_t, uint8_t)                                                  \
+  RUNFFS64N (int8_t, int64_t)                                                  \
+  RUNFFS64N (int8_t, int32_t)                                                  \
+  RUNFFS64N (int8_t, int16_t)                                                  \
+  RUNFFS64N (int8_t, int8_t)                                                   \
+  RUNFFS32 (uint64_t, uint64_t)                                                \
+  RUNFFS32 (uint64_t, uint32_t)                                                \
+  RUNFFS32 (uint64_t, uint16_t)                                                \
+  RUNFFS32 (uint64_t, uint8_t)                                                 \
+  RUNFFS32 (uint64_t, int64_t)                                                 \
+  RUNFFS32 (uint64_t, int32_t)                                                 \
+  RUNFFS32 (uint64_t, int16_t)                                                 \
+  RUNFFS32 (uint64_t, int8_t)                                                  \
+  RUNFFS32N (int64_t, uint64_t)                                                \
+  RUNFFS32N (int64_t, uint32_t)                                                \
+  RUNFFS32N (int64_t, uint16_t)                                                \
+  RUNFFS32N (int64_t, uint8_t)                                                 \
+  RUNFFS32N (int64_t, int64_t)                                                 \
+  RUNFFS32N (int64_t, int32_t)                                                 \
+  RUNFFS32N (int64_t, int16_t)                                                 \
+  RUNFFS32N (int64_t, int8_t)                                                  \
+  RUNFFS32 (uint32_t, uint64_t)                                                \
+  RUNFFS32 (uint32_t, uint32_t)                                                \
+  RUNFFS32 (uint32_t, uint16_t)                                                \
+  RUNFFS32 (uint32_t, uint8_t)                                                 \
+  RUNFFS32 (uint32_t, int64_t)                                                 \
+  RUNFFS32 (uint32_t, int32_t)                                                 \
+  RUNFFS32 (uint32_t, int16_t)                                                 \
+  RUNFFS32 (uint32_t, int8_t)                                                  \
+  RUNFFS32N (int32_t, uint64_t)                                                \
+  RUNFFS32N (int32_t, uint32_t)                                                \
+  RUNFFS32N (int32_t, uint16_t)                                                \
+  RUNFFS32N (int32_t, uint8_t)                                                 \
+  RUNFFS32N (int32_t, int64_t)                                                 \
+  RUNFFS32N (int32_t, int32_t)                                                 \
+  RUNFFS32N (int32_t, int16_t)                                                 \
+  RUNFFS32N (int32_t, int8_t)                                                  \
+  RUNFFS32 (uint16_t, uint64_t)                                                \
+  RUNFFS32 (uint16_t, uint32_t)                                                \
+  RUNFFS32 (uint16_t, uint16_t)                                                \
+  RUNFFS32 (uint16_t, uint8_t)                                                 \
+  RUNFFS32 (uint16_t, int64_t)                                                 \
+  RUNFFS32 (uint16_t, int32_t)                                                 \
+  RUNFFS32 (uint16_t, int16_t)                                                 \
+  RUNFFS32 (uint16_t, int8_t)                                                  \
+  RUNFFS32N (int16_t, uint64_t)                                                \
+  RUNFFS32N (int16_t, uint32_t)                                                \
+  RUNFFS32N (int16_t, uint16_t)                                                \
+  RUNFFS32N (int16_t, uint8_t)                                                 \
+  RUNFFS32N (int16_t, int64_t)                                                 \
+  RUNFFS32N (int16_t, int32_t)                                                 \
+  RUNFFS32N (int16_t, int16_t)                                                 \
+  RUNFFS32N (int16_t, int8_t)                                                  \
+  RUNFFS32 (uint8_t, uint64_t)                                                 \
+  RUNFFS32 (uint8_t, uint32_t)                                                 \
+  RUNFFS32 (uint8_t, uint16_t)                                                 \
+  RUNFFS32 (uint8_t, uint8_t)                                                  \
+  RUNFFS32 (uint8_t, int64_t)                                                  \
+  RUNFFS32 (uint8_t, int32_t)                                                  \
+  RUNFFS32 (uint8_t, int16_t)                                                  \
+  RUNFFS32 (uint8_t, int8_t)                                                   \
+  RUNFFS32N (int8_t, uint64_t)                                                 \
+  RUNFFS32N (int8_t, uint32_t)                                                 \
+  RUNFFS32N (int8_t, uint16_t)                                                 \
+  RUNFFS32N (int8_t, uint8_t)                                                  \
+  RUNFFS32N (int8_t, int64_t)                                                  \
+  RUNFFS32N (int8_t, int32_t)                                                  \
+  RUNFFS32N (int8_t, int16_t)                                                  \
+  RUNFFS32N (int8_t, int8_t)
+
+int
+main ()
+{
+  RUN_ALL ()
+}
+
+/* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 229 "vect" } } */
-- 
2.41.0
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] RISC-V: Add popcount fallback expander.
  2023-10-18  9:28 ` juzhe.zhong
@ 2023-10-18  9:32   ` Robin Dapp
  2023-10-18 11:43   ` Robin Dapp
  1 sibling, 0 replies; 10+ messages in thread
From: Robin Dapp @ 2023-10-18  9:32 UTC (permalink / raw)
  To: juzhe.zhong, gcc-patches, palmer, kito.cheng, jeffreyalaw; +Cc: rdapp.gcc

> I saw you didn't extend VI -> V_VLSI. I guess will failed SLP on
> popcount.
Hehe, right, I just copied and pasted the expander from my old
patch.  Will adjust it and add the test.

Regards
 Robin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] RISC-V: Add popcount fallback expander.
  2023-10-18  9:28 ` juzhe.zhong
  2023-10-18  9:32   ` Robin Dapp
@ 2023-10-18 11:43   ` Robin Dapp
  2023-10-18 11:48     ` juzhe.zhong
                       ` (2 more replies)
  1 sibling, 3 replies; 10+ messages in thread
From: Robin Dapp @ 2023-10-18 11:43 UTC (permalink / raw)
  To: juzhe.zhong, gcc-patches, palmer, kito.cheng, jeffreyalaw; +Cc: rdapp.gcc

> I saw you didn't extend VI -> V_VLSI. I guess will failed SLP on popcount.

Added VLS modes and your test in v2.

Testsuite looks unchanged on my side (vect, dg, rvv).

Regards
 Robin

Subject: [PATCH v2] RISC-V: Add popcount fallback expander.

I didn't manage to get back to the generic vectorizer fallback for
popcount so I figured I'd rather create a popcount fallback in the
riscv backend.  It uses the WWG algorithm from libgcc.

gcc/ChangeLog:

	* config/riscv/autovec.md (popcount<mode>2): New expander.
	* config/riscv/riscv-protos.h (expand_popcount): Define.
	* config/riscv/riscv-v.cc (expand_popcount): Vectorize popcount
	with the WWG algorithm.

gcc/testsuite/ChangeLog:

	* gcc.target/riscv/rvv/autovec/unop/popcount-1.c: New test.
	* gcc.target/riscv/rvv/autovec/unop/popcount-2.c: New test.
	* gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c: New test.
	* gcc.target/riscv/rvv/autovec/unop/popcount.c: New test.
---
 gcc/config/riscv/autovec.md                   |   14 +
 gcc/config/riscv/riscv-protos.h               |    1 +
 gcc/config/riscv/riscv-v.cc                   |   71 +
 .../riscv/rvv/autovec/unop/popcount-1.c       |   20 +
 .../riscv/rvv/autovec/unop/popcount-2.c       |   19 +
 .../riscv/rvv/autovec/unop/popcount-run-1.c   |   49 +
 .../riscv/rvv/autovec/unop/popcount.c         | 1464 +++++++++++++++++
 7 files changed, 1638 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-2.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount.c

diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md
index c5b1e52cbf9..80910ba3cc2 100644
--- a/gcc/config/riscv/autovec.md
+++ b/gcc/config/riscv/autovec.md
@@ -1484,6 +1484,20 @@ (define_expand "xorsign<mode>3"
   DONE;
 })
 
+;; -------------------------------------------------------------------------------
+;; - [INT] POPCOUNT.
+;; -------------------------------------------------------------------------------
+
+(define_expand "popcount<mode>2"
+  [(match_operand:V_VLSI 0 "register_operand")
+   (match_operand:V_VLSI 1 "register_operand")]
+  "TARGET_VECTOR"
+{
+  riscv_vector::expand_popcount (operands);
+  DONE;
+})
+
+
 ;; -------------------------------------------------------------------------
 ;; ---- [INT] Highpart multiplication
 ;; -------------------------------------------------------------------------
diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index 49bdcdf2f93..4aeccdd961b 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -515,6 +515,7 @@ void expand_fold_extract_last (rtx *);
 void expand_cond_unop (unsigned, rtx *);
 void expand_cond_binop (unsigned, rtx *);
 void expand_cond_ternop (unsigned, rtx *);
+void expand_popcount (rtx *);
 
 /* Rounding mode bitfield for fixed point VXRM.  */
 enum fixed_point_rounding_mode
diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index 21d86c3f917..8b594b7127e 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -4152,4 +4152,75 @@ expand_vec_lfloor (rtx op_0, rtx op_1, machine_mode vec_fp_mode,
   emit_vec_cvt_x_f (op_0, op_1, UNARY_OP_FRM_RDN, vec_fp_mode);
 }
 
+/* Vectorize popcount by the Wilkes-Wheeler-Gill algorithm that libgcc uses as
+   well.  */
+void
+expand_popcount (rtx *ops)
+{
+  rtx dst = ops[0];
+  rtx src = ops[1];
+  machine_mode mode = GET_MODE (dst);
+  scalar_mode imode = GET_MODE_INNER (mode);
+  static const uint64_t m5 = 0x5555555555555555ULL;
+  static const uint64_t m3 = 0x3333333333333333ULL;
+  static const uint64_t mf = 0x0F0F0F0F0F0F0F0FULL;
+  static const uint64_t m1 = 0x0101010101010101ULL;
+
+  rtx x1 = gen_reg_rtx (mode);
+  rtx x2 = gen_reg_rtx (mode);
+  rtx x3 = gen_reg_rtx (mode);
+  rtx x4 = gen_reg_rtx (mode);
+
+  /* x1 = src - (src >> 1) & 0x555...);  */
+  rtx shift1 = expand_binop (mode, lshr_optab, src, GEN_INT (1), NULL, true,
+			     OPTAB_DIRECT);
+
+  rtx and1 = gen_reg_rtx (mode);
+  rtx ops1[] = {and1, shift1, gen_int_mode (m5, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (AND, mode), riscv_vector::BINARY_OP,
+		   ops1);
+
+  x1 = expand_binop (mode, sub_optab, src, and1, NULL, true, OPTAB_DIRECT);
+
+  /* x2 = (x1 & 0x3333333333333333ULL) + ((x1 >> 2) & 0x3333333333333333ULL);
+   */
+  rtx and2 = gen_reg_rtx (mode);
+  rtx ops2[] = {and2, x1, gen_int_mode (m3, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (AND, mode), riscv_vector::BINARY_OP,
+		   ops2);
+
+  rtx shift2 = expand_binop (mode, lshr_optab, x1, GEN_INT (2), NULL, true,
+			     OPTAB_DIRECT);
+
+  rtx and22 = gen_reg_rtx (mode);
+  rtx ops22[] = {and22, shift2, gen_int_mode (m3, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (AND, mode), riscv_vector::BINARY_OP,
+		   ops22);
+
+  x2 = expand_binop (mode, add_optab, and2, and22, NULL, true, OPTAB_DIRECT);
+
+  /* x3 = (x2 + (x2 >> 4)) & 0x0f0f0f0f0f0f0f0fULL;  */
+  rtx shift3 = expand_binop (mode, lshr_optab, x2, GEN_INT (4), NULL, true,
+			     OPTAB_DIRECT);
+
+  rtx plus3
+    = expand_binop (mode, add_optab, x2, shift3, NULL, true, OPTAB_DIRECT);
+
+  rtx ops3[] = {x3, plus3, gen_int_mode (mf, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (AND, mode), riscv_vector::BINARY_OP,
+		   ops3);
+
+  /* dest = (x3 * 0x0101010101010101ULL) >> 56;  */
+  rtx mul4 = gen_reg_rtx (mode);
+  rtx ops4[] = {mul4, x3, gen_int_mode (m1, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (MULT, mode), riscv_vector::BINARY_OP,
+		   ops4);
+
+  x4 = expand_binop (mode, lshr_optab, mul4,
+		     GEN_INT (GET_MODE_BITSIZE (imode) - 8), NULL, true,
+		     OPTAB_DIRECT);
+
+  emit_move_insn (dst, x4);
+}
+
 } // namespace riscv_vector
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-1.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-1.c
new file mode 100644
index 00000000000..3169ebbff71
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-1.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-march=rv64gcv_zvfh -mabi=lp64d --param=riscv-autovec-preference=scalable -fno-vect-cost-model -fdump-tree-vect-details" } */
+
+#include <stdint-gcc.h>
+
+void __attribute__ ((noipa))
+popcount_32 (uint32_t *restrict dst, uint32_t *restrict src, int size)
+{
+  for (int i = 0; i < size; ++i)
+    dst[i] = __builtin_popcount (src[i]);
+}
+
+void __attribute__ ((noipa))
+popcount_64 (uint64_t *restrict dst, uint64_t *restrict src, int size)
+{
+  for (int i = 0; i < size; ++i)
+    dst[i] = __builtin_popcountll (src[i]);
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops in function" 2 "vect" } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-2.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-2.c
new file mode 100644
index 00000000000..9c0970afdfd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-2.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-march=rv64gcv -mabi=lp64d --param=riscv-autovec-preference=scalable -fno-vect-cost-model -fdump-tree-slp-details" } */
+
+int x[8];
+int y[8];
+
+void foo ()
+{
+  x[0] = __builtin_popcount (y[0]);
+  x[1] = __builtin_popcount (y[1]);
+  x[2] = __builtin_popcount (y[2]);
+  x[3] = __builtin_popcount (y[3]);
+  x[4] = __builtin_popcount (y[4]);
+  x[5] = __builtin_popcount (y[5]);
+  x[6] = __builtin_popcount (y[6]);
+  x[7] = __builtin_popcount (y[7]);
+}
+
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "slp" } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c
new file mode 100644
index 00000000000..38f1633da99
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c
@@ -0,0 +1,49 @@
+/* { dg-do run { target { riscv_v } } } */
+
+#include "popcount-1.c"
+
+extern void abort (void) __attribute__ ((noreturn));
+
+unsigned int data[] = {
+  0x11111100, 6,
+  0xe0e0f0f0, 14,
+  0x9900aab3, 13,
+  0x00040003, 3,
+  0x000e000c, 5,
+  0x22227777, 16,
+  0x12341234, 10,
+  0x0, 0
+};
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  unsigned int count = sizeof (data) / sizeof (data[0]) / 2;
+
+  uint32_t in32[count];
+  uint32_t out32[count];
+  for (unsigned int i = 0; i < count; ++i)
+    {
+      in32[i] = data[i * 2];
+      asm volatile ("" ::: "memory");
+    }
+  popcount_32 (out32, in32, count);
+  for (unsigned int i = 0; i < count; ++i)
+    if (out32[i] != data[i * 2 + 1])
+      abort ();
+
+  count /= 2;
+  uint64_t in64[count];
+  uint64_t out64[count];
+  for (unsigned int i = 0; i < count; ++i)
+    {
+      in64[i] = ((uint64_t) data[i * 4] << 32) | data[i * 4 + 2];
+      asm volatile ("" ::: "memory");
+    }
+  popcount_64 (out64, in64, count);
+  for (unsigned int i = 0; i < count; ++i)
+    if (out64[i] != data[i * 4 + 1] + data[i * 4 + 3])
+      abort ();
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount.c
new file mode 100644
index 00000000000..585a522aa81
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount.c
@@ -0,0 +1,1464 @@
+/* { dg-do run { target { riscv_v } } } */
+/* { dg-additional-options { -O2 -fdump-tree-vect-details -fno-vect-cost-model } }  */
+
+#include "stdint-gcc.h"
+#include <assert.h>
+
+#define DEF64(TYPEDST, TYPESRC)                                                \
+  void __attribute__ ((noipa))                                                 \
+  popcount64_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src, \
+				 int size)                                     \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_popcountll (src[i]);                                  \
+  }
+
+#define DEF32(TYPEDST, TYPESRC)                                                \
+  void __attribute__ ((noipa))                                                 \
+  popcount32_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src, \
+				 int size)                                     \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_popcount (src[i]);                                    \
+  }
+
+#define DEFCTZ64(TYPEDST, TYPESRC)                                             \
+  void __attribute__ ((noipa))                                                 \
+  ctz64_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src,      \
+			    int size)                                          \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_ctzll (src[i]);                                       \
+  }
+
+#define DEFCTZ32(TYPEDST, TYPESRC)                                             \
+  void __attribute__ ((noipa))                                                 \
+  ctz32_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src,      \
+			    int size)                                          \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_ctz (src[i]);                                         \
+  }
+
+#define DEFFFS64(TYPEDST, TYPESRC)                                             \
+  void __attribute__ ((noipa))                                                 \
+  ffs64_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src,      \
+			    int size)                                          \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_ffsll (src[i]);                                       \
+  }
+
+#define DEFFFS32(TYPEDST, TYPESRC)                                             \
+  void __attribute__ ((noipa))                                                 \
+  ffs32_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src,      \
+			    int size)                                          \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_ffs (src[i]);                                         \
+  }
+
+#define DEF_ALL()                                                              \
+  DEF64 (uint64_t, uint64_t)                                                   \
+  DEF64 (uint64_t, uint32_t)                                                   \
+  DEF64 (uint64_t, uint16_t)                                                   \
+  DEF64 (uint64_t, uint8_t)                                                    \
+  DEF64 (uint64_t, int64_t)                                                    \
+  DEF64 (uint64_t, int32_t)                                                    \
+  DEF64 (uint64_t, int16_t)                                                    \
+  DEF64 (uint64_t, int8_t)                                                     \
+  DEF64 (int64_t, uint64_t)                                                    \
+  DEF64 (int64_t, uint32_t)                                                    \
+  DEF64 (int64_t, uint16_t)                                                    \
+  DEF64 (int64_t, uint8_t)                                                     \
+  DEF64 (int64_t, int64_t)                                                     \
+  DEF64 (int64_t, int32_t)                                                     \
+  DEF64 (int64_t, int16_t)                                                     \
+  DEF64 (int64_t, int8_t)                                                      \
+  DEF64 (uint32_t, uint64_t)                                                   \
+  DEF64 (uint32_t, uint32_t)                                                   \
+  DEF64 (uint32_t, uint16_t)                                                   \
+  DEF64 (uint32_t, uint8_t)                                                    \
+  DEF64 (uint32_t, int64_t)                                                    \
+  DEF64 (uint32_t, int32_t)                                                    \
+  DEF64 (uint32_t, int16_t)                                                    \
+  DEF64 (uint32_t, int8_t)                                                     \
+  DEF64 (int32_t, uint64_t)                                                    \
+  DEF64 (int32_t, uint32_t)                                                    \
+  DEF64 (int32_t, uint16_t)                                                    \
+  DEF64 (int32_t, uint8_t)                                                     \
+  DEF64 (int32_t, int64_t)                                                     \
+  DEF64 (int32_t, int32_t)                                                     \
+  DEF64 (int32_t, int16_t)                                                     \
+  DEF64 (int32_t, int8_t)                                                      \
+  DEF64 (uint16_t, uint64_t)                                                   \
+  DEF64 (uint16_t, uint32_t)                                                   \
+  DEF64 (uint16_t, uint16_t)                                                   \
+  DEF64 (uint16_t, uint8_t)                                                    \
+  DEF64 (uint16_t, int64_t)                                                    \
+  DEF64 (uint16_t, int32_t)                                                    \
+  DEF64 (uint16_t, int16_t)                                                    \
+  DEF64 (uint16_t, int8_t)                                                     \
+  DEF64 (int16_t, uint64_t)                                                    \
+  DEF64 (int16_t, uint32_t)                                                    \
+  DEF64 (int16_t, uint16_t)                                                    \
+  DEF64 (int16_t, uint8_t)                                                     \
+  DEF64 (int16_t, int64_t)                                                     \
+  DEF64 (int16_t, int32_t)                                                     \
+  DEF64 (int16_t, int16_t)                                                     \
+  DEF64 (int16_t, int8_t)                                                      \
+  DEF64 (uint8_t, uint64_t)                                                    \
+  DEF64 (uint8_t, uint32_t)                                                    \
+  DEF64 (uint8_t, uint16_t)                                                    \
+  DEF64 (uint8_t, uint8_t)                                                     \
+  DEF64 (uint8_t, int64_t)                                                     \
+  DEF64 (uint8_t, int32_t)                                                     \
+  DEF64 (uint8_t, int16_t)                                                     \
+  DEF64 (uint8_t, int8_t)                                                      \
+  DEF64 (int8_t, uint64_t)                                                     \
+  DEF64 (int8_t, uint32_t)                                                     \
+  DEF64 (int8_t, uint16_t)                                                     \
+  DEF64 (int8_t, uint8_t)                                                      \
+  DEF64 (int8_t, int64_t)                                                      \
+  DEF64 (int8_t, int32_t)                                                      \
+  DEF64 (int8_t, int16_t)                                                      \
+  DEF64 (int8_t, int8_t)                                                       \
+  DEF32 (uint64_t, uint64_t)                                                   \
+  DEF32 (uint64_t, uint32_t)                                                   \
+  DEF32 (uint64_t, uint16_t)                                                   \
+  DEF32 (uint64_t, uint8_t)                                                    \
+  DEF32 (uint64_t, int64_t)                                                    \
+  DEF32 (uint64_t, int32_t)                                                    \
+  DEF32 (uint64_t, int16_t)                                                    \
+  DEF32 (uint64_t, int8_t)                                                     \
+  DEF32 (int64_t, uint64_t)                                                    \
+  DEF32 (int64_t, uint32_t)                                                    \
+  DEF32 (int64_t, uint16_t)                                                    \
+  DEF32 (int64_t, uint8_t)                                                     \
+  DEF32 (int64_t, int64_t)                                                     \
+  DEF32 (int64_t, int32_t)                                                     \
+  DEF32 (int64_t, int16_t)                                                     \
+  DEF32 (int64_t, int8_t)                                                      \
+  DEF32 (uint32_t, uint64_t)                                                   \
+  DEF32 (uint32_t, uint32_t)                                                   \
+  DEF32 (uint32_t, uint16_t)                                                   \
+  DEF32 (uint32_t, uint8_t)                                                    \
+  DEF32 (uint32_t, int64_t)                                                    \
+  DEF32 (uint32_t, int32_t)                                                    \
+  DEF32 (uint32_t, int16_t)                                                    \
+  DEF32 (uint32_t, int8_t)                                                     \
+  DEF32 (int32_t, uint64_t)                                                    \
+  DEF32 (int32_t, uint32_t)                                                    \
+  DEF32 (int32_t, uint16_t)                                                    \
+  DEF32 (int32_t, uint8_t)                                                     \
+  DEF32 (int32_t, int64_t)                                                     \
+  DEF32 (int32_t, int32_t)                                                     \
+  DEF32 (int32_t, int16_t)                                                     \
+  DEF32 (int32_t, int8_t)                                                      \
+  DEF32 (uint16_t, uint64_t)                                                   \
+  DEF32 (uint16_t, uint32_t)                                                   \
+  DEF32 (uint16_t, uint16_t)                                                   \
+  DEF32 (uint16_t, uint8_t)                                                    \
+  DEF32 (uint16_t, int64_t)                                                    \
+  DEF32 (uint16_t, int32_t)                                                    \
+  DEF32 (uint16_t, int16_t)                                                    \
+  DEF32 (uint16_t, int8_t)                                                     \
+  DEF32 (int16_t, uint64_t)                                                    \
+  DEF32 (int16_t, uint32_t)                                                    \
+  DEF32 (int16_t, uint16_t)                                                    \
+  DEF32 (int16_t, uint8_t)                                                     \
+  DEF32 (int16_t, int64_t)                                                     \
+  DEF32 (int16_t, int32_t)                                                     \
+  DEF32 (int16_t, int16_t)                                                     \
+  DEF32 (int16_t, int8_t)                                                      \
+  DEF32 (uint8_t, uint64_t)                                                    \
+  DEF32 (uint8_t, uint32_t)                                                    \
+  DEF32 (uint8_t, uint16_t)                                                    \
+  DEF32 (uint8_t, uint8_t)                                                     \
+  DEF32 (uint8_t, int64_t)                                                     \
+  DEF32 (uint8_t, int32_t)                                                     \
+  DEF32 (uint8_t, int16_t)                                                     \
+  DEF32 (uint8_t, int8_t)                                                      \
+  DEF32 (int8_t, uint64_t)                                                     \
+  DEF32 (int8_t, uint32_t)                                                     \
+  DEF32 (int8_t, uint16_t)                                                     \
+  DEF32 (int8_t, uint8_t)                                                      \
+  DEF32 (int8_t, int64_t)                                                      \
+  DEF32 (int8_t, int32_t)                                                      \
+  DEF32 (int8_t, int16_t)                                                      \
+  DEF32 (int8_t, int8_t)                                                       \
+  DEFCTZ64 (uint64_t, uint64_t)                                                \
+  DEFCTZ64 (uint64_t, uint32_t)                                                \
+  DEFCTZ64 (uint64_t, uint16_t)                                                \
+  DEFCTZ64 (uint64_t, uint8_t)                                                 \
+  DEFCTZ64 (uint64_t, int64_t)                                                 \
+  DEFCTZ64 (uint64_t, int32_t)                                                 \
+  DEFCTZ64 (uint64_t, int16_t)                                                 \
+  DEFCTZ64 (uint64_t, int8_t)                                                  \
+  DEFCTZ64 (int64_t, uint64_t)                                                 \
+  DEFCTZ64 (int64_t, uint32_t)                                                 \
+  DEFCTZ64 (int64_t, uint16_t)                                                 \
+  DEFCTZ64 (int64_t, uint8_t)                                                  \
+  DEFCTZ64 (int64_t, int64_t)                                                  \
+  DEFCTZ64 (int64_t, int32_t)                                                  \
+  DEFCTZ64 (int64_t, int16_t)                                                  \
+  DEFCTZ64 (int64_t, int8_t)                                                   \
+  DEFCTZ64 (uint32_t, uint64_t)                                                \
+  DEFCTZ64 (uint32_t, uint32_t)                                                \
+  DEFCTZ64 (uint32_t, uint16_t)                                                \
+  DEFCTZ64 (uint32_t, uint8_t)                                                 \
+  DEFCTZ64 (uint32_t, int64_t)                                                 \
+  DEFCTZ64 (uint32_t, int32_t)                                                 \
+  DEFCTZ64 (uint32_t, int16_t)                                                 \
+  DEFCTZ64 (uint32_t, int8_t)                                                  \
+  DEFCTZ64 (int32_t, uint64_t)                                                 \
+  DEFCTZ64 (int32_t, uint32_t)                                                 \
+  DEFCTZ64 (int32_t, uint16_t)                                                 \
+  DEFCTZ64 (int32_t, uint8_t)                                                  \
+  DEFCTZ64 (int32_t, int64_t)                                                  \
+  DEFCTZ64 (int32_t, int32_t)                                                  \
+  DEFCTZ64 (int32_t, int16_t)                                                  \
+  DEFCTZ64 (int32_t, int8_t)                                                   \
+  DEFCTZ64 (uint16_t, uint64_t)                                                \
+  DEFCTZ64 (uint16_t, uint32_t)                                                \
+  DEFCTZ64 (uint16_t, uint16_t)                                                \
+  DEFCTZ64 (uint16_t, uint8_t)                                                 \
+  DEFCTZ64 (uint16_t, int64_t)                                                 \
+  DEFCTZ64 (uint16_t, int32_t)                                                 \
+  DEFCTZ64 (uint16_t, int16_t)                                                 \
+  DEFCTZ64 (uint16_t, int8_t)                                                  \
+  DEFCTZ64 (int16_t, uint64_t)                                                 \
+  DEFCTZ64 (int16_t, uint32_t)                                                 \
+  DEFCTZ64 (int16_t, uint16_t)                                                 \
+  DEFCTZ64 (int16_t, uint8_t)                                                  \
+  DEFCTZ64 (int16_t, int64_t)                                                  \
+  DEFCTZ64 (int16_t, int32_t)                                                  \
+  DEFCTZ64 (int16_t, int16_t)                                                  \
+  DEFCTZ64 (int16_t, int8_t)                                                   \
+  DEFCTZ64 (uint8_t, uint64_t)                                                 \
+  DEFCTZ64 (uint8_t, uint32_t)                                                 \
+  DEFCTZ64 (uint8_t, uint16_t)                                                 \
+  DEFCTZ64 (uint8_t, uint8_t)                                                  \
+  DEFCTZ64 (uint8_t, int64_t)                                                  \
+  DEFCTZ64 (uint8_t, int32_t)                                                  \
+  DEFCTZ64 (uint8_t, int16_t)                                                  \
+  DEFCTZ64 (uint8_t, int8_t)                                                   \
+  DEFCTZ64 (int8_t, uint64_t)                                                  \
+  DEFCTZ64 (int8_t, uint32_t)                                                  \
+  DEFCTZ64 (int8_t, uint16_t)                                                  \
+  DEFCTZ64 (int8_t, uint8_t)                                                   \
+  DEFCTZ64 (int8_t, int64_t)                                                   \
+  DEFCTZ64 (int8_t, int32_t)                                                   \
+  DEFCTZ64 (int8_t, int16_t)                                                   \
+  DEFCTZ64 (int8_t, int8_t)                                                    \
+  DEFCTZ32 (uint64_t, uint64_t)                                                \
+  DEFCTZ32 (uint64_t, uint32_t)                                                \
+  DEFCTZ32 (uint64_t, uint16_t)                                                \
+  DEFCTZ32 (uint64_t, uint8_t)                                                 \
+  DEFCTZ32 (uint64_t, int64_t)                                                 \
+  DEFCTZ32 (uint64_t, int32_t)                                                 \
+  DEFCTZ32 (uint64_t, int16_t)                                                 \
+  DEFCTZ32 (uint64_t, int8_t)                                                  \
+  DEFCTZ32 (int64_t, uint64_t)                                                 \
+  DEFCTZ32 (int64_t, uint32_t)                                                 \
+  DEFCTZ32 (int64_t, uint16_t)                                                 \
+  DEFCTZ32 (int64_t, uint8_t)                                                  \
+  DEFCTZ32 (int64_t, int64_t)                                                  \
+  DEFCTZ32 (int64_t, int32_t)                                                  \
+  DEFCTZ32 (int64_t, int16_t)                                                  \
+  DEFCTZ32 (int64_t, int8_t)                                                   \
+  DEFCTZ32 (uint32_t, uint64_t)                                                \
+  DEFCTZ32 (uint32_t, uint32_t)                                                \
+  DEFCTZ32 (uint32_t, uint16_t)                                                \
+  DEFCTZ32 (uint32_t, uint8_t)                                                 \
+  DEFCTZ32 (uint32_t, int64_t)                                                 \
+  DEFCTZ32 (uint32_t, int32_t)                                                 \
+  DEFCTZ32 (uint32_t, int16_t)                                                 \
+  DEFCTZ32 (uint32_t, int8_t)                                                  \
+  DEFCTZ32 (int32_t, uint64_t)                                                 \
+  DEFCTZ32 (int32_t, uint32_t)                                                 \
+  DEFCTZ32 (int32_t, uint16_t)                                                 \
+  DEFCTZ32 (int32_t, uint8_t)                                                  \
+  DEFCTZ32 (int32_t, int64_t)                                                  \
+  DEFCTZ32 (int32_t, int32_t)                                                  \
+  DEFCTZ32 (int32_t, int16_t)                                                  \
+  DEFCTZ32 (int32_t, int8_t)                                                   \
+  DEFCTZ32 (uint16_t, uint64_t)                                                \
+  DEFCTZ32 (uint16_t, uint32_t)                                                \
+  DEFCTZ32 (uint16_t, uint16_t)                                                \
+  DEFCTZ32 (uint16_t, uint8_t)                                                 \
+  DEFCTZ32 (uint16_t, int64_t)                                                 \
+  DEFCTZ32 (uint16_t, int32_t)                                                 \
+  DEFCTZ32 (uint16_t, int16_t)                                                 \
+  DEFCTZ32 (uint16_t, int8_t)                                                  \
+  DEFCTZ32 (int16_t, uint64_t)                                                 \
+  DEFCTZ32 (int16_t, uint32_t)                                                 \
+  DEFCTZ32 (int16_t, uint16_t)                                                 \
+  DEFCTZ32 (int16_t, uint8_t)                                                  \
+  DEFCTZ32 (int16_t, int64_t)                                                  \
+  DEFCTZ32 (int16_t, int32_t)                                                  \
+  DEFCTZ32 (int16_t, int16_t)                                                  \
+  DEFCTZ32 (int16_t, int8_t)                                                   \
+  DEFCTZ32 (uint8_t, uint64_t)                                                 \
+  DEFCTZ32 (uint8_t, uint32_t)                                                 \
+  DEFCTZ32 (uint8_t, uint16_t)                                                 \
+  DEFCTZ32 (uint8_t, uint8_t)                                                  \
+  DEFCTZ32 (uint8_t, int64_t)                                                  \
+  DEFCTZ32 (uint8_t, int32_t)                                                  \
+  DEFCTZ32 (uint8_t, int16_t)                                                  \
+  DEFCTZ32 (uint8_t, int8_t)                                                   \
+  DEFCTZ32 (int8_t, uint64_t)                                                  \
+  DEFCTZ32 (int8_t, uint32_t)                                                  \
+  DEFCTZ32 (int8_t, uint16_t)                                                  \
+  DEFCTZ32 (int8_t, uint8_t)                                                   \
+  DEFCTZ32 (int8_t, int64_t)                                                   \
+  DEFCTZ32 (int8_t, int32_t)                                                   \
+  DEFCTZ32 (int8_t, int16_t)                                                   \
+  DEFCTZ32 (int8_t, int8_t)                                                    \
+  DEFFFS64 (uint64_t, uint64_t)                                                \
+  DEFFFS64 (uint64_t, uint32_t)                                                \
+  DEFFFS64 (uint64_t, uint16_t)                                                \
+  DEFFFS64 (uint64_t, uint8_t)                                                 \
+  DEFFFS64 (uint64_t, int64_t)                                                 \
+  DEFFFS64 (uint64_t, int32_t)                                                 \
+  DEFFFS64 (uint64_t, int16_t)                                                 \
+  DEFFFS64 (uint64_t, int8_t)                                                  \
+  DEFFFS64 (int64_t, uint64_t)                                                 \
+  DEFFFS64 (int64_t, uint32_t)                                                 \
+  DEFFFS64 (int64_t, uint16_t)                                                 \
+  DEFFFS64 (int64_t, uint8_t)                                                  \
+  DEFFFS64 (int64_t, int64_t)                                                  \
+  DEFFFS64 (int64_t, int32_t)                                                  \
+  DEFFFS64 (int64_t, int16_t)                                                  \
+  DEFFFS64 (int64_t, int8_t)                                                   \
+  DEFFFS64 (uint32_t, uint64_t)                                                \
+  DEFFFS64 (uint32_t, uint32_t)                                                \
+  DEFFFS64 (uint32_t, uint16_t)                                                \
+  DEFFFS64 (uint32_t, uint8_t)                                                 \
+  DEFFFS64 (uint32_t, int64_t)                                                 \
+  DEFFFS64 (uint32_t, int32_t)                                                 \
+  DEFFFS64 (uint32_t, int16_t)                                                 \
+  DEFFFS64 (uint32_t, int8_t)                                                  \
+  DEFFFS64 (int32_t, uint64_t)                                                 \
+  DEFFFS64 (int32_t, uint32_t)                                                 \
+  DEFFFS64 (int32_t, uint16_t)                                                 \
+  DEFFFS64 (int32_t, uint8_t)                                                  \
+  DEFFFS64 (int32_t, int64_t)                                                  \
+  DEFFFS64 (int32_t, int32_t)                                                  \
+  DEFFFS64 (int32_t, int16_t)                                                  \
+  DEFFFS64 (int32_t, int8_t)                                                   \
+  DEFFFS64 (uint16_t, uint64_t)                                                \
+  DEFFFS64 (uint16_t, uint32_t)                                                \
+  DEFFFS64 (uint16_t, uint16_t)                                                \
+  DEFFFS64 (uint16_t, uint8_t)                                                 \
+  DEFFFS64 (uint16_t, int64_t)                                                 \
+  DEFFFS64 (uint16_t, int32_t)                                                 \
+  DEFFFS64 (uint16_t, int16_t)                                                 \
+  DEFFFS64 (uint16_t, int8_t)                                                  \
+  DEFFFS64 (int16_t, uint64_t)                                                 \
+  DEFFFS64 (int16_t, uint32_t)                                                 \
+  DEFFFS64 (int16_t, uint16_t)                                                 \
+  DEFFFS64 (int16_t, uint8_t)                                                  \
+  DEFFFS64 (int16_t, int64_t)                                                  \
+  DEFFFS64 (int16_t, int32_t)                                                  \
+  DEFFFS64 (int16_t, int16_t)                                                  \
+  DEFFFS64 (int16_t, int8_t)                                                   \
+  DEFFFS64 (uint8_t, uint64_t)                                                 \
+  DEFFFS64 (uint8_t, uint32_t)                                                 \
+  DEFFFS64 (uint8_t, uint16_t)                                                 \
+  DEFFFS64 (uint8_t, uint8_t)                                                  \
+  DEFFFS64 (uint8_t, int64_t)                                                  \
+  DEFFFS64 (uint8_t, int32_t)                                                  \
+  DEFFFS64 (uint8_t, int16_t)                                                  \
+  DEFFFS64 (uint8_t, int8_t)                                                   \
+  DEFFFS64 (int8_t, uint64_t)                                                  \
+  DEFFFS64 (int8_t, uint32_t)                                                  \
+  DEFFFS64 (int8_t, uint16_t)                                                  \
+  DEFFFS64 (int8_t, uint8_t)                                                   \
+  DEFFFS64 (int8_t, int64_t)                                                   \
+  DEFFFS64 (int8_t, int32_t)                                                   \
+  DEFFFS64 (int8_t, int16_t)                                                   \
+  DEFFFS64 (int8_t, int8_t)                                                    \
+  DEFFFS32 (uint64_t, uint64_t)                                                \
+  DEFFFS32 (uint64_t, uint32_t)                                                \
+  DEFFFS32 (uint64_t, uint16_t)                                                \
+  DEFFFS32 (uint64_t, uint8_t)                                                 \
+  DEFFFS32 (uint64_t, int64_t)                                                 \
+  DEFFFS32 (uint64_t, int32_t)                                                 \
+  DEFFFS32 (uint64_t, int16_t)                                                 \
+  DEFFFS32 (uint64_t, int8_t)                                                  \
+  DEFFFS32 (int64_t, uint64_t)                                                 \
+  DEFFFS32 (int64_t, uint32_t)                                                 \
+  DEFFFS32 (int64_t, uint16_t)                                                 \
+  DEFFFS32 (int64_t, uint8_t)                                                  \
+  DEFFFS32 (int64_t, int64_t)                                                  \
+  DEFFFS32 (int64_t, int32_t)                                                  \
+  DEFFFS32 (int64_t, int16_t)                                                  \
+  DEFFFS32 (int64_t, int8_t)                                                   \
+  DEFFFS32 (uint32_t, uint64_t)                                                \
+  DEFFFS32 (uint32_t, uint32_t)                                                \
+  DEFFFS32 (uint32_t, uint16_t)                                                \
+  DEFFFS32 (uint32_t, uint8_t)                                                 \
+  DEFFFS32 (uint32_t, int64_t)                                                 \
+  DEFFFS32 (uint32_t, int32_t)                                                 \
+  DEFFFS32 (uint32_t, int16_t)                                                 \
+  DEFFFS32 (uint32_t, int8_t)                                                  \
+  DEFFFS32 (int32_t, uint64_t)                                                 \
+  DEFFFS32 (int32_t, uint32_t)                                                 \
+  DEFFFS32 (int32_t, uint16_t)                                                 \
+  DEFFFS32 (int32_t, uint8_t)                                                  \
+  DEFFFS32 (int32_t, int64_t)                                                  \
+  DEFFFS32 (int32_t, int32_t)                                                  \
+  DEFFFS32 (int32_t, int16_t)                                                  \
+  DEFFFS32 (int32_t, int8_t)                                                   \
+  DEFFFS32 (uint16_t, uint64_t)                                                \
+  DEFFFS32 (uint16_t, uint32_t)                                                \
+  DEFFFS32 (uint16_t, uint16_t)                                                \
+  DEFFFS32 (uint16_t, uint8_t)                                                 \
+  DEFFFS32 (uint16_t, int64_t)                                                 \
+  DEFFFS32 (uint16_t, int32_t)                                                 \
+  DEFFFS32 (uint16_t, int16_t)                                                 \
+  DEFFFS32 (uint16_t, int8_t)                                                  \
+  DEFFFS32 (int16_t, uint64_t)                                                 \
+  DEFFFS32 (int16_t, uint32_t)                                                 \
+  DEFFFS32 (int16_t, uint16_t)                                                 \
+  DEFFFS32 (int16_t, uint8_t)                                                  \
+  DEFFFS32 (int16_t, int64_t)                                                  \
+  DEFFFS32 (int16_t, int32_t)                                                  \
+  DEFFFS32 (int16_t, int16_t)                                                  \
+  DEFFFS32 (int16_t, int8_t)                                                   \
+  DEFFFS32 (uint8_t, uint64_t)                                                 \
+  DEFFFS32 (uint8_t, uint32_t)                                                 \
+  DEFFFS32 (uint8_t, uint16_t)                                                 \
+  DEFFFS32 (uint8_t, uint8_t)                                                  \
+  DEFFFS32 (uint8_t, int64_t)                                                  \
+  DEFFFS32 (uint8_t, int32_t)                                                  \
+  DEFFFS32 (uint8_t, int16_t)                                                  \
+  DEFFFS32 (uint8_t, int8_t)                                                   \
+  DEFFFS32 (int8_t, uint64_t)                                                  \
+  DEFFFS32 (int8_t, uint32_t)                                                  \
+  DEFFFS32 (int8_t, uint16_t)                                                  \
+  DEFFFS32 (int8_t, uint8_t)                                                   \
+  DEFFFS32 (int8_t, int64_t)                                                   \
+  DEFFFS32 (int8_t, int32_t)                                                   \
+  DEFFFS32 (int8_t, int16_t)                                                   \
+  DEFFFS32 (int8_t, int8_t)
+
+DEF_ALL ()
+
+#define SZ 512
+
+#define TEST64(TYPEDST, TYPESRC)                                               \
+  void __attribute__ ((optimize ("0"))) test64_##TYPEDST##TYPESRC ()           \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * 1234567890;                                              \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    popcount64_##TYPEDST##TYPESRC (dst, src, SZ);                              \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	assert (dst[i] == __builtin_popcountll (src[i]));                      \
+      }                                                                        \
+  }
+
+#define TEST64N(TYPEDST, TYPESRC)                                              \
+  void __attribute__ ((optimize ("0"))) test64n_##TYPEDST##TYPESRC ()          \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * -1234567890;                                             \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    popcount64_##TYPEDST##TYPESRC (dst, src, SZ);                              \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	assert (dst[i] == __builtin_popcountll (src[i]));                      \
+      }                                                                        \
+  }
+
+#define TEST32(TYPEDST, TYPESRC)                                               \
+  void __attribute__ ((optimize ("0"))) test32_##TYPEDST##TYPESRC ()           \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * 1234567;                                                 \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    popcount32_##TYPEDST##TYPESRC (dst, src, SZ);                              \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	assert (dst[i] == __builtin_popcount (src[i]));                        \
+      }                                                                        \
+  }
+
+#define TEST32N(TYPEDST, TYPESRC)                                              \
+  void __attribute__ ((optimize ("0"))) test32n_##TYPEDST##TYPESRC ()          \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * -1234567;                                                \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    popcount32_##TYPEDST##TYPESRC (dst, src, SZ);                              \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	assert (dst[i] == __builtin_popcount (src[i]));                        \
+      }                                                                        \
+  }
+
+#define TESTCTZ64(TYPEDST, TYPESRC)                                            \
+  void __attribute__ ((optimize ("0"))) testctz64_##TYPEDST##TYPESRC ()        \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * 1234567890;                                              \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    ctz64_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	if (src[i] != 0)                                                       \
+	  assert (dst[i] == __builtin_ctzll (src[i]));                         \
+      }                                                                        \
+  }
+
+#define TESTCTZ64N(TYPEDST, TYPESRC)                                           \
+  void __attribute__ ((optimize ("0"))) testctz64n_##TYPEDST##TYPESRC ()       \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * -1234567890;                                             \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    ctz64_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	if (src[i] != 0)                                                       \
+	  assert (dst[i] == __builtin_ctzll (src[i]));                         \
+      }                                                                        \
+  }
+
+#define TESTCTZ32(TYPEDST, TYPESRC)                                            \
+  void __attribute__ ((optimize ("0"))) testctz32_##TYPEDST##TYPESRC ()        \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * 1234567;                                                 \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    ctz32_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	if (src[i] != 0)                                                       \
+	  assert (dst[i] == __builtin_ctz (src[i]));                           \
+      }                                                                        \
+  }
+
+#define TESTCTZ32N(TYPEDST, TYPESRC)                                           \
+  void __attribute__ ((optimize ("0"))) testctz32n_##TYPEDST##TYPESRC ()       \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * -1234567;                                                \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    ctz32_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	if (src[i] != 0)                                                       \
+	  assert (dst[i] == __builtin_ctz (src[i]));                           \
+      }                                                                        \
+  }
+
+#define TESTFFS64(TYPEDST, TYPESRC)                                            \
+  void __attribute__ ((optimize ("0"))) testffs64_##TYPEDST##TYPESRC ()        \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * 1234567890;                                              \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    ffs64_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	assert (dst[i] == __builtin_ffsll (src[i]));                           \
+      }                                                                        \
+  }
+
+#define TESTFFS64N(TYPEDST, TYPESRC)                                           \
+  void __attribute__ ((optimize ("0"))) testffs64n_##TYPEDST##TYPESRC ()       \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * -1234567890;                                             \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    ffs64_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	assert (dst[i] == __builtin_ffsll (src[i]));                           \
+      }                                                                        \
+  }
+
+#define TESTFFS32(TYPEDST, TYPESRC)                                            \
+  void __attribute__ ((optimize ("0"))) testffs32_##TYPEDST##TYPESRC ()        \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * 1234567;                                                 \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    ffs32_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	assert (dst[i] == __builtin_ffs (src[i]));                             \
+      }                                                                        \
+  }
+
+#define TESTFFS32N(TYPEDST, TYPESRC)                                           \
+  void __attribute__ ((optimize ("0"))) testffs32n_##TYPEDST##TYPESRC ()       \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * -1234567;                                                \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    ffs32_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	assert (dst[i] == __builtin_ffs (src[i]));                             \
+      }                                                                        \
+  }
+
+#define TEST_ALL()                                                             \
+  TEST64 (uint64_t, uint64_t)                                                  \
+  TEST64 (uint64_t, uint32_t)                                                  \
+  TEST64 (uint64_t, uint16_t)                                                  \
+  TEST64 (uint64_t, uint8_t)                                                   \
+  TEST64 (uint64_t, int64_t)                                                   \
+  TEST64 (uint64_t, int32_t)                                                   \
+  TEST64 (uint64_t, int16_t)                                                   \
+  TEST64 (uint64_t, int8_t)                                                    \
+  TEST64N (int64_t, uint64_t)                                                  \
+  TEST64N (int64_t, uint32_t)                                                  \
+  TEST64N (int64_t, uint16_t)                                                  \
+  TEST64N (int64_t, uint8_t)                                                   \
+  TEST64N (int64_t, int64_t)                                                   \
+  TEST64N (int64_t, int32_t)                                                   \
+  TEST64N (int64_t, int16_t)                                                   \
+  TEST64N (int64_t, int8_t)                                                    \
+  TEST64 (uint32_t, uint64_t)                                                  \
+  TEST64 (uint32_t, uint32_t)                                                  \
+  TEST64 (uint32_t, uint16_t)                                                  \
+  TEST64 (uint32_t, uint8_t)                                                   \
+  TEST64 (uint32_t, int64_t)                                                   \
+  TEST64 (uint32_t, int32_t)                                                   \
+  TEST64 (uint32_t, int16_t)                                                   \
+  TEST64 (uint32_t, int8_t)                                                    \
+  TEST64N (int32_t, uint64_t)                                                  \
+  TEST64N (int32_t, uint32_t)                                                  \
+  TEST64N (int32_t, uint16_t)                                                  \
+  TEST64N (int32_t, uint8_t)                                                   \
+  TEST64N (int32_t, int64_t)                                                   \
+  TEST64N (int32_t, int32_t)                                                   \
+  TEST64N (int32_t, int16_t)                                                   \
+  TEST64N (int32_t, int8_t)                                                    \
+  TEST64 (uint16_t, uint64_t)                                                  \
+  TEST64 (uint16_t, uint32_t)                                                  \
+  TEST64 (uint16_t, uint16_t)                                                  \
+  TEST64 (uint16_t, uint8_t)                                                   \
+  TEST64 (uint16_t, int64_t)                                                   \
+  TEST64 (uint16_t, int32_t)                                                   \
+  TEST64 (uint16_t, int16_t)                                                   \
+  TEST64 (uint16_t, int8_t)                                                    \
+  TEST64N (int16_t, uint64_t)                                                   \
+  TEST64N (int16_t, uint32_t)                                                   \
+  TEST64N (int16_t, uint16_t)                                                   \
+  TEST64N (int16_t, uint8_t)                                                    \
+  TEST64N (int16_t, int64_t)                                                    \
+  TEST64N (int16_t, int32_t)                                                    \
+  TEST64N (int16_t, int16_t)                                                    \
+  TEST64N (int16_t, int8_t)                                                     \
+  TEST64 (uint8_t, uint64_t)                                                   \
+  TEST64 (uint8_t, uint32_t)                                                   \
+  TEST64 (uint8_t, uint16_t)                                                   \
+  TEST64 (uint8_t, uint8_t)                                                    \
+  TEST64 (uint8_t, int64_t)                                                    \
+  TEST64 (uint8_t, int32_t)                                                    \
+  TEST64 (uint8_t, int16_t)                                                    \
+  TEST64 (uint8_t, int8_t)                                                     \
+  TEST64N (int8_t, uint64_t)                                                    \
+  TEST64N (int8_t, uint32_t)                                                    \
+  TEST64N (int8_t, uint16_t)                                                    \
+  TEST64N (int8_t, uint8_t)                                                     \
+  TEST64N (int8_t, int64_t)                                                     \
+  TEST64N (int8_t, int32_t)                                                     \
+  TEST64N (int8_t, int16_t)                                                     \
+  TEST64N (int8_t, int8_t)                                                      \
+  TEST32 (uint64_t, uint64_t)                                                  \
+  TEST32 (uint64_t, uint32_t)                                                  \
+  TEST32 (uint64_t, uint16_t)                                                  \
+  TEST32 (uint64_t, uint8_t)                                                   \
+  TEST32 (uint64_t, int64_t)                                                   \
+  TEST32 (uint64_t, int32_t)                                                   \
+  TEST32 (uint64_t, int16_t)                                                   \
+  TEST32 (uint64_t, int8_t)                                                    \
+  TEST32N (int64_t, uint64_t)                                                  \
+  TEST32N (int64_t, uint32_t)                                                  \
+  TEST32N (int64_t, uint16_t)                                                  \
+  TEST32N (int64_t, uint8_t)                                                   \
+  TEST32N (int64_t, int64_t)                                                   \
+  TEST32N (int64_t, int32_t)                                                   \
+  TEST32N (int64_t, int16_t)                                                   \
+  TEST32N (int64_t, int8_t)                                                    \
+  TEST32 (uint32_t, uint64_t)                                                  \
+  TEST32 (uint32_t, uint32_t)                                                  \
+  TEST32 (uint32_t, uint16_t)                                                  \
+  TEST32 (uint32_t, uint8_t)                                                   \
+  TEST32 (uint32_t, int64_t)                                                   \
+  TEST32 (uint32_t, int32_t)                                                   \
+  TEST32 (uint32_t, int16_t)                                                   \
+  TEST32 (uint32_t, int8_t)                                                    \
+  TEST32N (int32_t, uint64_t)                                                  \
+  TEST32N (int32_t, uint32_t)                                                  \
+  TEST32N (int32_t, uint16_t)                                                  \
+  TEST32N (int32_t, uint8_t)                                                   \
+  TEST32N (int32_t, int64_t)                                                   \
+  TEST32N (int32_t, int32_t)                                                   \
+  TEST32N (int32_t, int16_t)                                                   \
+  TEST32N (int32_t, int8_t)                                                    \
+  TEST32 (uint16_t, uint64_t)                                                  \
+  TEST32 (uint16_t, uint32_t)                                                  \
+  TEST32 (uint16_t, uint16_t)                                                  \
+  TEST32 (uint16_t, uint8_t)                                                   \
+  TEST32 (uint16_t, int64_t)                                                   \
+  TEST32 (uint16_t, int32_t)                                                   \
+  TEST32 (uint16_t, int16_t)                                                   \
+  TEST32 (uint16_t, int8_t)                                                    \
+  TEST32N (int16_t, uint64_t)                                                  \
+  TEST32N (int16_t, uint32_t)                                                  \
+  TEST32N (int16_t, uint16_t)                                                  \
+  TEST32N (int16_t, uint8_t)                                                   \
+  TEST32N (int16_t, int64_t)                                                   \
+  TEST32N (int16_t, int32_t)                                                   \
+  TEST32N (int16_t, int16_t)                                                   \
+  TEST32N (int16_t, int8_t)                                                    \
+  TEST32 (uint8_t, uint64_t)                                                   \
+  TEST32 (uint8_t, uint32_t)                                                   \
+  TEST32 (uint8_t, uint16_t)                                                   \
+  TEST32 (uint8_t, uint8_t)                                                    \
+  TEST32 (uint8_t, int64_t)                                                    \
+  TEST32 (uint8_t, int32_t)                                                    \
+  TEST32 (uint8_t, int16_t)                                                    \
+  TEST32 (uint8_t, int8_t)                                                     \
+  TEST32N (int8_t, uint64_t)                                                   \
+  TEST32N (int8_t, uint32_t)                                                   \
+  TEST32N (int8_t, uint16_t)                                                   \
+  TEST32N (int8_t, uint8_t)                                                    \
+  TEST32N (int8_t, int64_t)                                                    \
+  TEST32N (int8_t, int32_t)                                                    \
+  TEST32N (int8_t, int16_t)                                                    \
+  TEST32N (int8_t, int8_t)                                                     \
+  TESTCTZ64 (uint64_t, uint64_t)                                               \
+  TESTCTZ64 (uint64_t, uint32_t)                                               \
+  TESTCTZ64 (uint64_t, uint16_t)                                               \
+  TESTCTZ64 (uint64_t, uint8_t)                                                \
+  TESTCTZ64 (uint64_t, int64_t)                                                \
+  TESTCTZ64 (uint64_t, int32_t)                                                \
+  TESTCTZ64 (uint64_t, int16_t)                                                \
+  TESTCTZ64 (uint64_t, int8_t)                                                 \
+  TESTCTZ64N (int64_t, uint64_t)                                               \
+  TESTCTZ64N (int64_t, uint32_t)                                               \
+  TESTCTZ64N (int64_t, uint16_t)                                               \
+  TESTCTZ64N (int64_t, uint8_t)                                                \
+  TESTCTZ64N (int64_t, int64_t)                                                \
+  TESTCTZ64N (int64_t, int32_t)                                                \
+  TESTCTZ64N (int64_t, int16_t)                                                \
+  TESTCTZ64N (int64_t, int8_t)                                                 \
+  TESTCTZ64 (uint32_t, uint64_t)                                               \
+  TESTCTZ64 (uint32_t, uint32_t)                                               \
+  TESTCTZ64 (uint32_t, uint16_t)                                               \
+  TESTCTZ64 (uint32_t, uint8_t)                                                \
+  TESTCTZ64 (uint32_t, int64_t)                                                \
+  TESTCTZ64 (uint32_t, int32_t)                                                \
+  TESTCTZ64 (uint32_t, int16_t)                                                \
+  TESTCTZ64 (uint32_t, int8_t)                                                 \
+  TESTCTZ64N (int32_t, uint64_t)                                               \
+  TESTCTZ64N (int32_t, uint32_t)                                               \
+  TESTCTZ64N (int32_t, uint16_t)                                               \
+  TESTCTZ64N (int32_t, uint8_t)                                                \
+  TESTCTZ64N (int32_t, int64_t)                                                \
+  TESTCTZ64N (int32_t, int32_t)                                                \
+  TESTCTZ64N (int32_t, int16_t)                                                \
+  TESTCTZ64N (int32_t, int8_t)                                                 \
+  TESTCTZ64 (uint16_t, uint64_t)                                               \
+  TESTCTZ64 (uint16_t, uint32_t)                                               \
+  TESTCTZ64 (uint16_t, uint16_t)                                               \
+  TESTCTZ64 (uint16_t, uint8_t)                                                \
+  TESTCTZ64 (uint16_t, int64_t)                                                \
+  TESTCTZ64 (uint16_t, int32_t)                                                \
+  TESTCTZ64 (uint16_t, int16_t)                                                \
+  TESTCTZ64 (uint16_t, int8_t)                                                 \
+  TESTCTZ64N (int16_t, uint64_t)                                               \
+  TESTCTZ64N (int16_t, uint32_t)                                               \
+  TESTCTZ64N (int16_t, uint16_t)                                               \
+  TESTCTZ64N (int16_t, uint8_t)                                                \
+  TESTCTZ64N (int16_t, int64_t)                                                \
+  TESTCTZ64N (int16_t, int32_t)                                                \
+  TESTCTZ64N (int16_t, int16_t)                                                \
+  TESTCTZ64N (int16_t, int8_t)                                                 \
+  TESTCTZ64 (uint8_t, uint64_t)                                                \
+  TESTCTZ64 (uint8_t, uint32_t)                                                \
+  TESTCTZ64 (uint8_t, uint16_t)                                                \
+  TESTCTZ64 (uint8_t, uint8_t)                                                 \
+  TESTCTZ64 (uint8_t, int64_t)                                                 \
+  TESTCTZ64 (uint8_t, int32_t)                                                 \
+  TESTCTZ64 (uint8_t, int16_t)                                                 \
+  TESTCTZ64 (uint8_t, int8_t)                                                  \
+  TESTCTZ64N (int8_t, uint64_t)                                                \
+  TESTCTZ64N (int8_t, uint32_t)                                                \
+  TESTCTZ64N (int8_t, uint16_t)                                                \
+  TESTCTZ64N (int8_t, uint8_t)                                                 \
+  TESTCTZ64N (int8_t, int64_t)                                                 \
+  TESTCTZ64N (int8_t, int32_t)                                                 \
+  TESTCTZ64N (int8_t, int16_t)                                                 \
+  TESTCTZ64N (int8_t, int8_t)                                                  \
+  TESTCTZ32 (uint64_t, uint64_t)                                               \
+  TESTCTZ32 (uint64_t, uint32_t)                                               \
+  TESTCTZ32 (uint64_t, uint16_t)                                               \
+  TESTCTZ32 (uint64_t, uint8_t)                                                \
+  TESTCTZ32 (uint64_t, int64_t)                                                \
+  TESTCTZ32 (uint64_t, int32_t)                                                \
+  TESTCTZ32 (uint64_t, int16_t)                                                \
+  TESTCTZ32 (uint64_t, int8_t)                                                 \
+  TESTCTZ32N (int64_t, uint64_t)                                               \
+  TESTCTZ32N (int64_t, uint32_t)                                               \
+  TESTCTZ32N (int64_t, uint16_t)                                               \
+  TESTCTZ32N (int64_t, uint8_t)                                                \
+  TESTCTZ32N (int64_t, int64_t)                                                \
+  TESTCTZ32N (int64_t, int32_t)                                                \
+  TESTCTZ32N (int64_t, int16_t)                                                \
+  TESTCTZ32N (int64_t, int8_t)                                                 \
+  TESTCTZ32 (uint32_t, uint64_t)                                               \
+  TESTCTZ32 (uint32_t, uint32_t)                                               \
+  TESTCTZ32 (uint32_t, uint16_t)                                               \
+  TESTCTZ32 (uint32_t, uint8_t)                                                \
+  TESTCTZ32 (uint32_t, int64_t)                                                \
+  TESTCTZ32 (uint32_t, int32_t)                                                \
+  TESTCTZ32 (uint32_t, int16_t)                                                \
+  TESTCTZ32 (uint32_t, int8_t)                                                 \
+  TESTCTZ32N (int32_t, uint64_t)                                               \
+  TESTCTZ32N (int32_t, uint32_t)                                               \
+  TESTCTZ32N (int32_t, uint16_t)                                               \
+  TESTCTZ32N (int32_t, uint8_t)                                                \
+  TESTCTZ32N (int32_t, int64_t)                                                \
+  TESTCTZ32N (int32_t, int32_t)                                                \
+  TESTCTZ32N (int32_t, int16_t)                                                \
+  TESTCTZ32N (int32_t, int8_t)                                                 \
+  TESTCTZ32 (uint16_t, uint64_t)                                               \
+  TESTCTZ32 (uint16_t, uint32_t)                                               \
+  TESTCTZ32 (uint16_t, uint16_t)                                               \
+  TESTCTZ32 (uint16_t, uint8_t)                                                \
+  TESTCTZ32 (uint16_t, int64_t)                                                \
+  TESTCTZ32 (uint16_t, int32_t)                                                \
+  TESTCTZ32 (uint16_t, int16_t)                                                \
+  TESTCTZ32 (uint16_t, int8_t)                                                 \
+  TESTCTZ32N (int16_t, uint64_t)                                               \
+  TESTCTZ32N (int16_t, uint32_t)                                               \
+  TESTCTZ32N (int16_t, uint16_t)                                               \
+  TESTCTZ32N (int16_t, uint8_t)                                                \
+  TESTCTZ32N (int16_t, int64_t)                                                \
+  TESTCTZ32N (int16_t, int32_t)                                                \
+  TESTCTZ32N (int16_t, int16_t)                                                \
+  TESTCTZ32N (int16_t, int8_t)                                                 \
+  TESTCTZ32 (uint8_t, uint64_t)                                                \
+  TESTCTZ32 (uint8_t, uint32_t)                                                \
+  TESTCTZ32 (uint8_t, uint16_t)                                                \
+  TESTCTZ32 (uint8_t, uint8_t)                                                 \
+  TESTCTZ32 (uint8_t, int64_t)                                                 \
+  TESTCTZ32 (uint8_t, int32_t)                                                 \
+  TESTCTZ32 (uint8_t, int16_t)                                                 \
+  TESTCTZ32 (uint8_t, int8_t)                                                  \
+  TESTCTZ32N (int8_t, uint64_t)                                                \
+  TESTCTZ32N (int8_t, uint32_t)                                                \
+  TESTCTZ32N (int8_t, uint16_t)                                                \
+  TESTCTZ32N (int8_t, uint8_t)                                                 \
+  TESTCTZ32N (int8_t, int64_t)                                                 \
+  TESTCTZ32N (int8_t, int32_t)                                                 \
+  TESTCTZ32N (int8_t, int16_t)                                                 \
+  TESTCTZ32N (int8_t, int8_t)                                                  \
+  TESTFFS64 (uint64_t, uint64_t)                                               \
+  TESTFFS64 (uint64_t, uint32_t)                                               \
+  TESTFFS64 (uint64_t, uint16_t)                                               \
+  TESTFFS64 (uint64_t, uint8_t)                                                \
+  TESTFFS64 (uint64_t, int64_t)                                                \
+  TESTFFS64 (uint64_t, int32_t)                                                \
+  TESTFFS64 (uint64_t, int16_t)                                                \
+  TESTFFS64 (uint64_t, int8_t)                                                 \
+  TESTFFS64N (int64_t, uint64_t)                                               \
+  TESTFFS64N (int64_t, uint32_t)                                               \
+  TESTFFS64N (int64_t, uint16_t)                                               \
+  TESTFFS64N (int64_t, uint8_t)                                                \
+  TESTFFS64N (int64_t, int64_t)                                                \
+  TESTFFS64N (int64_t, int32_t)                                                \
+  TESTFFS64N (int64_t, int16_t)                                                \
+  TESTFFS64N (int64_t, int8_t)                                                 \
+  TESTFFS64 (uint32_t, uint64_t)                                               \
+  TESTFFS64 (uint32_t, uint32_t)                                               \
+  TESTFFS64 (uint32_t, uint16_t)                                               \
+  TESTFFS64 (uint32_t, uint8_t)                                                \
+  TESTFFS64 (uint32_t, int64_t)                                                \
+  TESTFFS64 (uint32_t, int32_t)                                                \
+  TESTFFS64 (uint32_t, int16_t)                                                \
+  TESTFFS64 (uint32_t, int8_t)                                                 \
+  TESTFFS64N (int32_t, uint64_t)                                               \
+  TESTFFS64N (int32_t, uint32_t)                                               \
+  TESTFFS64N (int32_t, uint16_t)                                               \
+  TESTFFS64N (int32_t, uint8_t)                                                \
+  TESTFFS64N (int32_t, int64_t)                                                \
+  TESTFFS64N (int32_t, int32_t)                                                \
+  TESTFFS64N (int32_t, int16_t)                                                \
+  TESTFFS64N (int32_t, int8_t)                                                 \
+  TESTFFS64 (uint16_t, uint64_t)                                               \
+  TESTFFS64 (uint16_t, uint32_t)                                               \
+  TESTFFS64 (uint16_t, uint16_t)                                               \
+  TESTFFS64 (uint16_t, uint8_t)                                                \
+  TESTFFS64 (uint16_t, int64_t)                                                \
+  TESTFFS64 (uint16_t, int32_t)                                                \
+  TESTFFS64 (uint16_t, int16_t)                                                \
+  TESTFFS64 (uint16_t, int8_t)                                                 \
+  TESTFFS64N (int16_t, uint64_t)                                               \
+  TESTFFS64N (int16_t, uint32_t)                                               \
+  TESTFFS64N (int16_t, uint16_t)                                               \
+  TESTFFS64N (int16_t, uint8_t)                                                \
+  TESTFFS64N (int16_t, int64_t)                                                \
+  TESTFFS64N (int16_t, int32_t)                                                \
+  TESTFFS64N (int16_t, int16_t)                                                \
+  TESTFFS64N (int16_t, int8_t)                                                 \
+  TESTFFS64 (uint8_t, uint64_t)                                                \
+  TESTFFS64 (uint8_t, uint32_t)                                                \
+  TESTFFS64 (uint8_t, uint16_t)                                                \
+  TESTFFS64 (uint8_t, uint8_t)                                                 \
+  TESTFFS64 (uint8_t, int64_t)                                                 \
+  TESTFFS64 (uint8_t, int32_t)                                                 \
+  TESTFFS64 (uint8_t, int16_t)                                                 \
+  TESTFFS64 (uint8_t, int8_t)                                                  \
+  TESTFFS64N (int8_t, uint64_t)                                                \
+  TESTFFS64N (int8_t, uint32_t)                                                \
+  TESTFFS64N (int8_t, uint16_t)                                                \
+  TESTFFS64N (int8_t, uint8_t)                                                 \
+  TESTFFS64N (int8_t, int64_t)                                                 \
+  TESTFFS64N (int8_t, int32_t)                                                 \
+  TESTFFS64N (int8_t, int16_t)                                                 \
+  TESTFFS64N (int8_t, int8_t)                                                  \
+  TESTFFS32 (uint64_t, uint64_t)                                               \
+  TESTFFS32 (uint64_t, uint32_t)                                               \
+  TESTFFS32 (uint64_t, uint16_t)                                               \
+  TESTFFS32 (uint64_t, uint8_t)                                                \
+  TESTFFS32 (uint64_t, int64_t)                                                \
+  TESTFFS32 (uint64_t, int32_t)                                                \
+  TESTFFS32 (uint64_t, int16_t)                                                \
+  TESTFFS32 (uint64_t, int8_t)                                                 \
+  TESTFFS32N (int64_t, uint64_t)                                               \
+  TESTFFS32N (int64_t, uint32_t)                                               \
+  TESTFFS32N (int64_t, uint16_t)                                               \
+  TESTFFS32N (int64_t, uint8_t)                                                \
+  TESTFFS32N (int64_t, int64_t)                                                \
+  TESTFFS32N (int64_t, int32_t)                                                \
+  TESTFFS32N (int64_t, int16_t)                                                \
+  TESTFFS32N (int64_t, int8_t)                                                 \
+  TESTFFS32 (uint32_t, uint64_t)                                               \
+  TESTFFS32 (uint32_t, uint32_t)                                               \
+  TESTFFS32 (uint32_t, uint16_t)                                               \
+  TESTFFS32 (uint32_t, uint8_t)                                                \
+  TESTFFS32 (uint32_t, int64_t)                                                \
+  TESTFFS32 (uint32_t, int32_t)                                                \
+  TESTFFS32 (uint32_t, int16_t)                                                \
+  TESTFFS32 (uint32_t, int8_t)                                                 \
+  TESTFFS32N (int32_t, uint64_t)                                               \
+  TESTFFS32N (int32_t, uint32_t)                                               \
+  TESTFFS32N (int32_t, uint16_t)                                               \
+  TESTFFS32N (int32_t, uint8_t)                                                \
+  TESTFFS32N (int32_t, int64_t)                                                \
+  TESTFFS32N (int32_t, int32_t)                                                \
+  TESTFFS32N (int32_t, int16_t)                                                \
+  TESTFFS32N (int32_t, int8_t)                                                 \
+  TESTFFS32 (uint16_t, uint64_t)                                               \
+  TESTFFS32 (uint16_t, uint32_t)                                               \
+  TESTFFS32 (uint16_t, uint16_t)                                               \
+  TESTFFS32 (uint16_t, uint8_t)                                                \
+  TESTFFS32 (uint16_t, int64_t)                                                \
+  TESTFFS32 (uint16_t, int32_t)                                                \
+  TESTFFS32 (uint16_t, int16_t)                                                \
+  TESTFFS32 (uint16_t, int8_t)                                                 \
+  TESTFFS32N (int16_t, uint64_t)                                               \
+  TESTFFS32N (int16_t, uint32_t)                                               \
+  TESTFFS32N (int16_t, uint16_t)                                               \
+  TESTFFS32N (int16_t, uint8_t)                                                \
+  TESTFFS32N (int16_t, int64_t)                                                \
+  TESTFFS32N (int16_t, int32_t)                                                \
+  TESTFFS32N (int16_t, int16_t)                                                \
+  TESTFFS32N (int16_t, int8_t)                                                 \
+  TESTFFS32 (uint8_t, uint64_t)                                                \
+  TESTFFS32 (uint8_t, uint32_t)                                                \
+  TESTFFS32 (uint8_t, uint16_t)                                                \
+  TESTFFS32 (uint8_t, uint8_t)                                                 \
+  TESTFFS32 (uint8_t, int64_t)                                                 \
+  TESTFFS32 (uint8_t, int32_t)                                                 \
+  TESTFFS32 (uint8_t, int16_t)                                                 \
+  TESTFFS32 (uint8_t, int8_t)                                                  \
+  TESTFFS32N (int8_t, uint64_t)                                                \
+  TESTFFS32N (int8_t, uint32_t)                                                \
+  TESTFFS32N (int8_t, uint16_t)                                                \
+  TESTFFS32N (int8_t, uint8_t)                                                 \
+  TESTFFS32N (int8_t, int64_t)                                                 \
+  TESTFFS32N (int8_t, int32_t)                                                 \
+  TESTFFS32N (int8_t, int16_t)                                                 \
+  TESTFFS32N (int8_t, int8_t)
+
+TEST_ALL ()
+
+#define RUN64(TYPEDST, TYPESRC) test64_##TYPEDST##TYPESRC ();
+#define RUN64N(TYPEDST, TYPESRC) test64n_##TYPEDST##TYPESRC ();
+#define RUN32(TYPEDST, TYPESRC) test32_##TYPEDST##TYPESRC ();
+#define RUN32N(TYPEDST, TYPESRC) test32n_##TYPEDST##TYPESRC ();
+#define RUNCTZ64(TYPEDST, TYPESRC) testctz64_##TYPEDST##TYPESRC ();
+#define RUNCTZ64N(TYPEDST, TYPESRC) testctz64n_##TYPEDST##TYPESRC ();
+#define RUNCTZ32(TYPEDST, TYPESRC) testctz32_##TYPEDST##TYPESRC ();
+#define RUNCTZ32N(TYPEDST, TYPESRC) testctz32n_##TYPEDST##TYPESRC ();
+#define RUNFFS64(TYPEDST, TYPESRC) testffs64_##TYPEDST##TYPESRC ();
+#define RUNFFS64N(TYPEDST, TYPESRC) testffs64n_##TYPEDST##TYPESRC ();
+#define RUNFFS32(TYPEDST, TYPESRC) testffs32_##TYPEDST##TYPESRC ();
+#define RUNFFS32N(TYPEDST, TYPESRC) testffs32n_##TYPEDST##TYPESRC ();
+
+#define RUN_ALL()                                                              \
+  RUN64 (uint64_t, uint64_t)                                                   \
+  RUN64 (uint64_t, uint32_t)                                                   \
+  RUN64 (uint64_t, uint16_t)                                                   \
+  RUN64 (uint64_t, uint8_t)                                                    \
+  RUN64 (uint64_t, int64_t)                                                    \
+  RUN64 (uint64_t, int32_t)                                                    \
+  RUN64 (uint64_t, int16_t)                                                    \
+  RUN64 (uint64_t, int8_t)                                                     \
+  RUN64N (int64_t, uint64_t)                                                    \
+  RUN64N (int64_t, uint32_t)                                                    \
+  RUN64N (int64_t, uint16_t)                                                    \
+  RUN64N (int64_t, uint8_t)                                                     \
+  RUN64N (int64_t, int64_t)                                                     \
+  RUN64N (int64_t, int32_t)                                                     \
+  RUN64N (int64_t, int16_t)                                                     \
+  RUN64N (int64_t, int8_t)                                                      \
+  RUN64 (uint32_t, uint64_t)                                                   \
+  RUN64 (uint32_t, uint32_t)                                                   \
+  RUN64 (uint32_t, uint16_t)                                                   \
+  RUN64 (uint32_t, uint8_t)                                                    \
+  RUN64 (uint32_t, int64_t)                                                    \
+  RUN64 (uint32_t, int32_t)                                                    \
+  RUN64 (uint32_t, int16_t)                                                    \
+  RUN64 (uint32_t, int8_t)                                                     \
+  RUN64N (int32_t, uint64_t)                                                    \
+  RUN64N (int32_t, uint32_t)                                                    \
+  RUN64N (int32_t, uint16_t)                                                    \
+  RUN64N (int32_t, uint8_t)                                                     \
+  RUN64N (int32_t, int64_t)                                                     \
+  RUN64N (int32_t, int32_t)                                                     \
+  RUN64N (int32_t, int16_t)                                                     \
+  RUN64N (int32_t, int8_t)                                                      \
+  RUN64 (uint16_t, uint64_t)                                                   \
+  RUN64 (uint16_t, uint32_t)                                                   \
+  RUN64 (uint16_t, uint16_t)                                                   \
+  RUN64 (uint16_t, uint8_t)                                                    \
+  RUN64 (uint16_t, int64_t)                                                    \
+  RUN64 (uint16_t, int32_t)                                                    \
+  RUN64 (uint16_t, int16_t)                                                    \
+  RUN64 (uint16_t, int8_t)                                                     \
+  RUN64N (int16_t, uint64_t)                                                    \
+  RUN64N (int16_t, uint32_t)                                                    \
+  RUN64N (int16_t, uint16_t)                                                    \
+  RUN64N (int16_t, uint8_t)                                                     \
+  RUN64N (int16_t, int64_t)                                                     \
+  RUN64N (int16_t, int32_t)                                                     \
+  RUN64N (int16_t, int16_t)                                                     \
+  RUN64N (int16_t, int8_t)                                                      \
+  RUN64 (uint8_t, uint64_t)                                                    \
+  RUN64 (uint8_t, uint32_t)                                                    \
+  RUN64 (uint8_t, uint16_t)                                                    \
+  RUN64 (uint8_t, uint8_t)                                                     \
+  RUN64 (uint8_t, int64_t)                                                     \
+  RUN64 (uint8_t, int32_t)                                                     \
+  RUN64 (uint8_t, int16_t)                                                     \
+  RUN64 (uint8_t, int8_t)                                                      \
+  RUN64N (int8_t, uint64_t)                                                     \
+  RUN64N (int8_t, uint32_t)                                                     \
+  RUN64N (int8_t, uint16_t)                                                     \
+  RUN64N (int8_t, uint8_t)                                                      \
+  RUN64N (int8_t, int64_t)                                                      \
+  RUN64N (int8_t, int32_t)                                                      \
+  RUN64N (int8_t, int16_t)                                                      \
+  RUN64N (int8_t, int8_t)                                                       \
+  RUN32 (uint64_t, uint64_t)                                                   \
+  RUN32 (uint64_t, uint32_t)                                                   \
+  RUN32 (uint64_t, uint16_t)                                                   \
+  RUN32 (uint64_t, uint8_t)                                                    \
+  RUN32 (uint64_t, int64_t)                                                    \
+  RUN32 (uint64_t, int32_t)                                                    \
+  RUN32 (uint64_t, int16_t)                                                    \
+  RUN32 (uint64_t, int8_t)                                                     \
+  RUN32N (int64_t, uint64_t)                                                    \
+  RUN32N (int64_t, uint32_t)                                                    \
+  RUN32N (int64_t, uint16_t)                                                    \
+  RUN32N (int64_t, uint8_t)                                                     \
+  RUN32N (int64_t, int64_t)                                                     \
+  RUN32N (int64_t, int32_t)                                                     \
+  RUN32N (int64_t, int16_t)                                                     \
+  RUN32N (int64_t, int8_t)                                                      \
+  RUN32 (uint32_t, uint64_t)                                                   \
+  RUN32 (uint32_t, uint32_t)                                                   \
+  RUN32 (uint32_t, uint16_t)                                                   \
+  RUN32 (uint32_t, uint8_t)                                                    \
+  RUN32 (uint32_t, int64_t)                                                    \
+  RUN32 (uint32_t, int32_t)                                                    \
+  RUN32 (uint32_t, int16_t)                                                    \
+  RUN32 (uint32_t, int8_t)                                                     \
+  RUN32N (int32_t, uint64_t)                                                    \
+  RUN32N (int32_t, uint32_t)                                                    \
+  RUN32N (int32_t, uint16_t)                                                    \
+  RUN32N (int32_t, uint8_t)                                                     \
+  RUN32N (int32_t, int64_t)                                                     \
+  RUN32N (int32_t, int32_t)                                                     \
+  RUN32N (int32_t, int16_t)                                                     \
+  RUN32N (int32_t, int8_t)                                                      \
+  RUN32 (uint16_t, uint64_t)                                                   \
+  RUN32 (uint16_t, uint32_t)                                                   \
+  RUN32 (uint16_t, uint16_t)                                                   \
+  RUN32 (uint16_t, uint8_t)                                                    \
+  RUN32 (uint16_t, int64_t)                                                    \
+  RUN32 (uint16_t, int32_t)                                                    \
+  RUN32 (uint16_t, int16_t)                                                    \
+  RUN32 (uint16_t, int8_t)                                                     \
+  RUN32N (int16_t, uint64_t)                                                    \
+  RUN32N (int16_t, uint32_t)                                                    \
+  RUN32N (int16_t, uint16_t)                                                    \
+  RUN32N (int16_t, uint8_t)                                                     \
+  RUN32N (int16_t, int64_t)                                                     \
+  RUN32N (int16_t, int32_t)                                                     \
+  RUN32N (int16_t, int16_t)                                                     \
+  RUN32N (int16_t, int8_t)                                                      \
+  RUN32 (uint8_t, uint64_t)                                                    \
+  RUN32 (uint8_t, uint32_t)                                                    \
+  RUN32 (uint8_t, uint16_t)                                                    \
+  RUN32 (uint8_t, uint8_t)                                                     \
+  RUN32 (uint8_t, int64_t)                                                     \
+  RUN32 (uint8_t, int32_t)                                                     \
+  RUN32 (uint8_t, int16_t)                                                     \
+  RUN32 (uint8_t, int8_t)                                                      \
+  RUN32N (int8_t, uint64_t)                                                     \
+  RUN32N (int8_t, uint32_t)                                                     \
+  RUN32N (int8_t, uint16_t)                                                     \
+  RUN32N (int8_t, uint8_t)                                                      \
+  RUN32N (int8_t, int64_t)                                                      \
+  RUN32N (int8_t, int32_t)                                                      \
+  RUN32N (int8_t, int16_t)                                                      \
+  RUN32N (int8_t, int8_t)                                                       \
+  RUNCTZ64 (uint64_t, uint64_t)                                                \
+  RUNCTZ64 (uint64_t, uint32_t)                                                \
+  RUNCTZ64 (uint64_t, uint16_t)                                                \
+  RUNCTZ64 (uint64_t, uint8_t)                                                 \
+  RUNCTZ64 (uint64_t, int64_t)                                                 \
+  RUNCTZ64 (uint64_t, int32_t)                                                 \
+  RUNCTZ64 (uint64_t, int16_t)                                                 \
+  RUNCTZ64 (uint64_t, int8_t)                                                  \
+  RUNCTZ64N (int64_t, uint64_t)                                                 \
+  RUNCTZ64N (int64_t, uint32_t)                                                 \
+  RUNCTZ64N (int64_t, uint16_t)                                                 \
+  RUNCTZ64N (int64_t, uint8_t)                                                  \
+  RUNCTZ64N (int64_t, int64_t)                                                  \
+  RUNCTZ64N (int64_t, int32_t)                                                  \
+  RUNCTZ64N (int64_t, int16_t)                                                  \
+  RUNCTZ64N (int64_t, int8_t)                                                   \
+  RUNCTZ64 (uint32_t, uint64_t)                                                \
+  RUNCTZ64 (uint32_t, uint32_t)                                                \
+  RUNCTZ64 (uint32_t, uint16_t)                                                \
+  RUNCTZ64 (uint32_t, uint8_t)                                                 \
+  RUNCTZ64 (uint32_t, int64_t)                                                 \
+  RUNCTZ64 (uint32_t, int32_t)                                                 \
+  RUNCTZ64 (uint32_t, int16_t)                                                 \
+  RUNCTZ64 (uint32_t, int8_t)                                                  \
+  RUNCTZ64N (int32_t, uint64_t)                                                 \
+  RUNCTZ64N (int32_t, uint32_t)                                                 \
+  RUNCTZ64N (int32_t, uint16_t)                                                 \
+  RUNCTZ64N (int32_t, uint8_t)                                                  \
+  RUNCTZ64N (int32_t, int64_t)                                                  \
+  RUNCTZ64N (int32_t, int32_t)                                                  \
+  RUNCTZ64N (int32_t, int16_t)                                                  \
+  RUNCTZ64N (int32_t, int8_t)                                                   \
+  RUNCTZ64 (uint16_t, uint64_t)                                                \
+  RUNCTZ64 (uint16_t, uint32_t)                                                \
+  RUNCTZ64 (uint16_t, uint16_t)                                                \
+  RUNCTZ64 (uint16_t, uint8_t)                                                 \
+  RUNCTZ64 (uint16_t, int64_t)                                                 \
+  RUNCTZ64 (uint16_t, int32_t)                                                 \
+  RUNCTZ64 (uint16_t, int16_t)                                                 \
+  RUNCTZ64 (uint16_t, int8_t)                                                  \
+  RUNCTZ64N (int16_t, uint64_t)                                                \
+  RUNCTZ64N (int16_t, uint32_t)                                                \
+  RUNCTZ64N (int16_t, uint16_t)                                                \
+  RUNCTZ64N (int16_t, uint8_t)                                                 \
+  RUNCTZ64N (int16_t, int64_t)                                                 \
+  RUNCTZ64N (int16_t, int32_t)                                                 \
+  RUNCTZ64N (int16_t, int16_t)                                                 \
+  RUNCTZ64N (int16_t, int8_t)                                                  \
+  RUNCTZ64 (uint8_t, uint64_t)                                                 \
+  RUNCTZ64 (uint8_t, uint32_t)                                                 \
+  RUNCTZ64 (uint8_t, uint16_t)                                                 \
+  RUNCTZ64 (uint8_t, uint8_t)                                                  \
+  RUNCTZ64 (uint8_t, int64_t)                                                  \
+  RUNCTZ64 (uint8_t, int32_t)                                                  \
+  RUNCTZ64 (uint8_t, int16_t)                                                  \
+  RUNCTZ64 (uint8_t, int8_t)                                                   \
+  RUNCTZ64N (int8_t, uint64_t)                                                 \
+  RUNCTZ64N (int8_t, uint32_t)                                                 \
+  RUNCTZ64N (int8_t, uint16_t)                                                 \
+  RUNCTZ64N (int8_t, uint8_t)                                                  \
+  RUNCTZ64N (int8_t, int64_t)                                                  \
+  RUNCTZ64N (int8_t, int32_t)                                                  \
+  RUNCTZ64N (int8_t, int16_t)                                                  \
+  RUNCTZ64N (int8_t, int8_t)                                                   \
+  RUNCTZ32 (uint64_t, uint64_t)                                                \
+  RUNCTZ32 (uint64_t, uint32_t)                                                \
+  RUNCTZ32 (uint64_t, uint16_t)                                                \
+  RUNCTZ32 (uint64_t, uint8_t)                                                 \
+  RUNCTZ32 (uint64_t, int64_t)                                                 \
+  RUNCTZ32 (uint64_t, int32_t)                                                 \
+  RUNCTZ32 (uint64_t, int16_t)                                                 \
+  RUNCTZ32 (uint64_t, int8_t)                                                  \
+  RUNCTZ32N (int64_t, uint64_t)                                                \
+  RUNCTZ32N (int64_t, uint32_t)                                                \
+  RUNCTZ32N (int64_t, uint16_t)                                                \
+  RUNCTZ32N (int64_t, uint8_t)                                                 \
+  RUNCTZ32N (int64_t, int64_t)                                                 \
+  RUNCTZ32N (int64_t, int32_t)                                                 \
+  RUNCTZ32N (int64_t, int16_t)                                                 \
+  RUNCTZ32N (int64_t, int8_t)                                                  \
+  RUNCTZ32 (uint32_t, uint64_t)                                                \
+  RUNCTZ32 (uint32_t, uint32_t)                                                \
+  RUNCTZ32 (uint32_t, uint16_t)                                                \
+  RUNCTZ32 (uint32_t, uint8_t)                                                 \
+  RUNCTZ32 (uint32_t, int64_t)                                                 \
+  RUNCTZ32 (uint32_t, int32_t)                                                 \
+  RUNCTZ32 (uint32_t, int16_t)                                                 \
+  RUNCTZ32 (uint32_t, int8_t)                                                  \
+  RUNCTZ32N (int32_t, uint64_t)                                                \
+  RUNCTZ32N (int32_t, uint32_t)                                                \
+  RUNCTZ32N (int32_t, uint16_t)                                                \
+  RUNCTZ32N (int32_t, uint8_t)                                                 \
+  RUNCTZ32N (int32_t, int64_t)                                                 \
+  RUNCTZ32N (int32_t, int32_t)                                                 \
+  RUNCTZ32N (int32_t, int16_t)                                                 \
+  RUNCTZ32N (int32_t, int8_t)                                                  \
+  RUNCTZ32 (uint16_t, uint64_t)                                                \
+  RUNCTZ32 (uint16_t, uint32_t)                                                \
+  RUNCTZ32 (uint16_t, uint16_t)                                                \
+  RUNCTZ32 (uint16_t, uint8_t)                                                 \
+  RUNCTZ32 (uint16_t, int64_t)                                                 \
+  RUNCTZ32 (uint16_t, int32_t)                                                 \
+  RUNCTZ32 (uint16_t, int16_t)                                                 \
+  RUNCTZ32 (uint16_t, int8_t)                                                  \
+  RUNCTZ32N (int16_t, uint64_t)                                                \
+  RUNCTZ32N (int16_t, uint32_t)                                                \
+  RUNCTZ32N (int16_t, uint16_t)                                                \
+  RUNCTZ32N (int16_t, uint8_t)                                                 \
+  RUNCTZ32N (int16_t, int64_t)                                                 \
+  RUNCTZ32N (int16_t, int32_t)                                                 \
+  RUNCTZ32N (int16_t, int16_t)                                                 \
+  RUNCTZ32N (int16_t, int8_t)                                                  \
+  RUNCTZ32 (uint8_t, uint64_t)                                                 \
+  RUNCTZ32 (uint8_t, uint32_t)                                                 \
+  RUNCTZ32 (uint8_t, uint16_t)                                                 \
+  RUNCTZ32 (uint8_t, uint8_t)                                                  \
+  RUNCTZ32 (uint8_t, int64_t)                                                  \
+  RUNCTZ32 (uint8_t, int32_t)                                                  \
+  RUNCTZ32 (uint8_t, int16_t)                                                  \
+  RUNCTZ32 (uint8_t, int8_t)                                                   \
+  RUNCTZ32N (int8_t, uint64_t)                                                 \
+  RUNCTZ32N (int8_t, uint32_t)                                                 \
+  RUNCTZ32N (int8_t, uint16_t)                                                 \
+  RUNCTZ32N (int8_t, uint8_t)                                                  \
+  RUNCTZ32N (int8_t, int64_t)                                                  \
+  RUNCTZ32N (int8_t, int32_t)                                                  \
+  RUNCTZ32N (int8_t, int16_t)                                                  \
+  RUNCTZ32N (int8_t, int8_t)                                                   \
+  RUNFFS64 (uint64_t, uint64_t)                                                \
+  RUNFFS64 (uint64_t, uint32_t)                                                \
+  RUNFFS64 (uint64_t, uint16_t)                                                \
+  RUNFFS64 (uint64_t, uint8_t)                                                 \
+  RUNFFS64 (uint64_t, int64_t)                                                 \
+  RUNFFS64 (uint64_t, int32_t)                                                 \
+  RUNFFS64 (uint64_t, int16_t)                                                 \
+  RUNFFS64 (uint64_t, int8_t)                                                  \
+  RUNFFS64N (int64_t, uint64_t)                                                \
+  RUNFFS64N (int64_t, uint32_t)                                                \
+  RUNFFS64N (int64_t, uint16_t)                                                \
+  RUNFFS64N (int64_t, uint8_t)                                                 \
+  RUNFFS64N (int64_t, int64_t)                                                 \
+  RUNFFS64N (int64_t, int32_t)                                                 \
+  RUNFFS64N (int64_t, int16_t)                                                 \
+  RUNFFS64N (int64_t, int8_t)                                                  \
+  RUNFFS64 (uint32_t, uint64_t)                                                \
+  RUNFFS64 (uint32_t, uint32_t)                                                \
+  RUNFFS64 (uint32_t, uint16_t)                                                \
+  RUNFFS64 (uint32_t, uint8_t)                                                 \
+  RUNFFS64 (uint32_t, int64_t)                                                 \
+  RUNFFS64 (uint32_t, int32_t)                                                 \
+  RUNFFS64 (uint32_t, int16_t)                                                 \
+  RUNFFS64 (uint32_t, int8_t)                                                  \
+  RUNFFS64N (int32_t, uint64_t)                                                \
+  RUNFFS64N (int32_t, uint32_t)                                                \
+  RUNFFS64N (int32_t, uint16_t)                                                \
+  RUNFFS64N (int32_t, uint8_t)                                                 \
+  RUNFFS64N (int32_t, int64_t)                                                 \
+  RUNFFS64N (int32_t, int32_t)                                                 \
+  RUNFFS64N (int32_t, int16_t)                                                 \
+  RUNFFS64N (int32_t, int8_t)                                                  \
+  RUNFFS64 (uint16_t, uint64_t)                                                \
+  RUNFFS64 (uint16_t, uint32_t)                                                \
+  RUNFFS64 (uint16_t, uint16_t)                                                \
+  RUNFFS64 (uint16_t, uint8_t)                                                 \
+  RUNFFS64 (uint16_t, int64_t)                                                 \
+  RUNFFS64 (uint16_t, int32_t)                                                 \
+  RUNFFS64 (uint16_t, int16_t)                                                 \
+  RUNFFS64 (uint16_t, int8_t)                                                  \
+  RUNFFS64N (int16_t, uint64_t)                                                \
+  RUNFFS64N (int16_t, uint32_t)                                                \
+  RUNFFS64N (int16_t, uint16_t)                                                \
+  RUNFFS64N (int16_t, uint8_t)                                                 \
+  RUNFFS64N (int16_t, int64_t)                                                 \
+  RUNFFS64N (int16_t, int32_t)                                                 \
+  RUNFFS64N (int16_t, int16_t)                                                 \
+  RUNFFS64N (int16_t, int8_t)                                                  \
+  RUNFFS64 (uint8_t, uint64_t)                                                 \
+  RUNFFS64 (uint8_t, uint32_t)                                                 \
+  RUNFFS64 (uint8_t, uint16_t)                                                 \
+  RUNFFS64 (uint8_t, uint8_t)                                                  \
+  RUNFFS64 (uint8_t, int64_t)                                                  \
+  RUNFFS64 (uint8_t, int32_t)                                                  \
+  RUNFFS64 (uint8_t, int16_t)                                                  \
+  RUNFFS64 (uint8_t, int8_t)                                                   \
+  RUNFFS64N (int8_t, uint64_t)                                                 \
+  RUNFFS64N (int8_t, uint32_t)                                                 \
+  RUNFFS64N (int8_t, uint16_t)                                                 \
+  RUNFFS64N (int8_t, uint8_t)                                                  \
+  RUNFFS64N (int8_t, int64_t)                                                  \
+  RUNFFS64N (int8_t, int32_t)                                                  \
+  RUNFFS64N (int8_t, int16_t)                                                  \
+  RUNFFS64N (int8_t, int8_t)                                                   \
+  RUNFFS32 (uint64_t, uint64_t)                                                \
+  RUNFFS32 (uint64_t, uint32_t)                                                \
+  RUNFFS32 (uint64_t, uint16_t)                                                \
+  RUNFFS32 (uint64_t, uint8_t)                                                 \
+  RUNFFS32 (uint64_t, int64_t)                                                 \
+  RUNFFS32 (uint64_t, int32_t)                                                 \
+  RUNFFS32 (uint64_t, int16_t)                                                 \
+  RUNFFS32 (uint64_t, int8_t)                                                  \
+  RUNFFS32N (int64_t, uint64_t)                                                \
+  RUNFFS32N (int64_t, uint32_t)                                                \
+  RUNFFS32N (int64_t, uint16_t)                                                \
+  RUNFFS32N (int64_t, uint8_t)                                                 \
+  RUNFFS32N (int64_t, int64_t)                                                 \
+  RUNFFS32N (int64_t, int32_t)                                                 \
+  RUNFFS32N (int64_t, int16_t)                                                 \
+  RUNFFS32N (int64_t, int8_t)                                                  \
+  RUNFFS32 (uint32_t, uint64_t)                                                \
+  RUNFFS32 (uint32_t, uint32_t)                                                \
+  RUNFFS32 (uint32_t, uint16_t)                                                \
+  RUNFFS32 (uint32_t, uint8_t)                                                 \
+  RUNFFS32 (uint32_t, int64_t)                                                 \
+  RUNFFS32 (uint32_t, int32_t)                                                 \
+  RUNFFS32 (uint32_t, int16_t)                                                 \
+  RUNFFS32 (uint32_t, int8_t)                                                  \
+  RUNFFS32N (int32_t, uint64_t)                                                \
+  RUNFFS32N (int32_t, uint32_t)                                                \
+  RUNFFS32N (int32_t, uint16_t)                                                \
+  RUNFFS32N (int32_t, uint8_t)                                                 \
+  RUNFFS32N (int32_t, int64_t)                                                 \
+  RUNFFS32N (int32_t, int32_t)                                                 \
+  RUNFFS32N (int32_t, int16_t)                                                 \
+  RUNFFS32N (int32_t, int8_t)                                                  \
+  RUNFFS32 (uint16_t, uint64_t)                                                \
+  RUNFFS32 (uint16_t, uint32_t)                                                \
+  RUNFFS32 (uint16_t, uint16_t)                                                \
+  RUNFFS32 (uint16_t, uint8_t)                                                 \
+  RUNFFS32 (uint16_t, int64_t)                                                 \
+  RUNFFS32 (uint16_t, int32_t)                                                 \
+  RUNFFS32 (uint16_t, int16_t)                                                 \
+  RUNFFS32 (uint16_t, int8_t)                                                  \
+  RUNFFS32N (int16_t, uint64_t)                                                \
+  RUNFFS32N (int16_t, uint32_t)                                                \
+  RUNFFS32N (int16_t, uint16_t)                                                \
+  RUNFFS32N (int16_t, uint8_t)                                                 \
+  RUNFFS32N (int16_t, int64_t)                                                 \
+  RUNFFS32N (int16_t, int32_t)                                                 \
+  RUNFFS32N (int16_t, int16_t)                                                 \
+  RUNFFS32N (int16_t, int8_t)                                                  \
+  RUNFFS32 (uint8_t, uint64_t)                                                 \
+  RUNFFS32 (uint8_t, uint32_t)                                                 \
+  RUNFFS32 (uint8_t, uint16_t)                                                 \
+  RUNFFS32 (uint8_t, uint8_t)                                                  \
+  RUNFFS32 (uint8_t, int64_t)                                                  \
+  RUNFFS32 (uint8_t, int32_t)                                                  \
+  RUNFFS32 (uint8_t, int16_t)                                                  \
+  RUNFFS32 (uint8_t, int8_t)                                                   \
+  RUNFFS32N (int8_t, uint64_t)                                                 \
+  RUNFFS32N (int8_t, uint32_t)                                                 \
+  RUNFFS32N (int8_t, uint16_t)                                                 \
+  RUNFFS32N (int8_t, uint8_t)                                                  \
+  RUNFFS32N (int8_t, int64_t)                                                  \
+  RUNFFS32N (int8_t, int32_t)                                                  \
+  RUNFFS32N (int8_t, int16_t)                                                  \
+  RUNFFS32N (int8_t, int8_t)
+
+int
+main ()
+{
+  RUN_ALL ()
+}
+
+/* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 229 "vect" } } */
-- 
2.41.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Re: [PATCH] RISC-V: Add popcount fallback expander.
  2023-10-18 11:43   ` Robin Dapp
@ 2023-10-18 11:48     ` juzhe.zhong
  2023-10-18 12:22     ` juzhe.zhong
  2023-10-18 13:51     ` Robin Dapp
  2 siblings, 0 replies; 10+ messages in thread
From: juzhe.zhong @ 2023-10-18 11:48 UTC (permalink / raw)
  To: Robin Dapp, gcc-patches, palmer, kito.cheng, jeffreyalaw; +Cc: Robin Dapp

[-- Attachment #1: Type: text/plain, Size: 126318 bytes --]

LGTM



juzhe.zhong@rivai.ai
 
From: Robin Dapp
Date: 2023-10-18 19:43
To: juzhe.zhong@rivai.ai; gcc-patches; palmer; kito.cheng; jeffreyalaw
CC: rdapp.gcc
Subject: Re: [PATCH] RISC-V: Add popcount fallback expander.
> I saw you didn't extend VI -> V_VLSI. I guess will failed SLP on popcount.
 
Added VLS modes and your test in v2.
 
Testsuite looks unchanged on my side (vect, dg, rvv).
 
Regards
Robin
 
Subject: [PATCH v2] RISC-V: Add popcount fallback expander.
 
I didn't manage to get back to the generic vectorizer fallback for
popcount so I figured I'd rather create a popcount fallback in the
riscv backend.  It uses the WWG algorithm from libgcc.
 
gcc/ChangeLog:
 
* config/riscv/autovec.md (popcount<mode>2): New expander.
* config/riscv/riscv-protos.h (expand_popcount): Define.
* config/riscv/riscv-v.cc (expand_popcount): Vectorize popcount
with the WWG algorithm.
 
gcc/testsuite/ChangeLog:
 
* gcc.target/riscv/rvv/autovec/unop/popcount-1.c: New test.
* gcc.target/riscv/rvv/autovec/unop/popcount-2.c: New test.
* gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c: New test.
* gcc.target/riscv/rvv/autovec/unop/popcount.c: New test.
---
gcc/config/riscv/autovec.md                   |   14 +
gcc/config/riscv/riscv-protos.h               |    1 +
gcc/config/riscv/riscv-v.cc                   |   71 +
.../riscv/rvv/autovec/unop/popcount-1.c       |   20 +
.../riscv/rvv/autovec/unop/popcount-2.c       |   19 +
.../riscv/rvv/autovec/unop/popcount-run-1.c   |   49 +
.../riscv/rvv/autovec/unop/popcount.c         | 1464 +++++++++++++++++
7 files changed, 1638 insertions(+)
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-1.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-2.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount.c
 
diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md
index c5b1e52cbf9..80910ba3cc2 100644
--- a/gcc/config/riscv/autovec.md
+++ b/gcc/config/riscv/autovec.md
@@ -1484,6 +1484,20 @@ (define_expand "xorsign<mode>3"
   DONE;
})
+;; -------------------------------------------------------------------------------
+;; - [INT] POPCOUNT.
+;; -------------------------------------------------------------------------------
+
+(define_expand "popcount<mode>2"
+  [(match_operand:V_VLSI 0 "register_operand")
+   (match_operand:V_VLSI 1 "register_operand")]
+  "TARGET_VECTOR"
+{
+  riscv_vector::expand_popcount (operands);
+  DONE;
+})
+
+
;; -------------------------------------------------------------------------
;; ---- [INT] Highpart multiplication
;; -------------------------------------------------------------------------
diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index 49bdcdf2f93..4aeccdd961b 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -515,6 +515,7 @@ void expand_fold_extract_last (rtx *);
void expand_cond_unop (unsigned, rtx *);
void expand_cond_binop (unsigned, rtx *);
void expand_cond_ternop (unsigned, rtx *);
+void expand_popcount (rtx *);
/* Rounding mode bitfield for fixed point VXRM.  */
enum fixed_point_rounding_mode
diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index 21d86c3f917..8b594b7127e 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -4152,4 +4152,75 @@ expand_vec_lfloor (rtx op_0, rtx op_1, machine_mode vec_fp_mode,
   emit_vec_cvt_x_f (op_0, op_1, UNARY_OP_FRM_RDN, vec_fp_mode);
}
+/* Vectorize popcount by the Wilkes-Wheeler-Gill algorithm that libgcc uses as
+   well.  */
+void
+expand_popcount (rtx *ops)
+{
+  rtx dst = ops[0];
+  rtx src = ops[1];
+  machine_mode mode = GET_MODE (dst);
+  scalar_mode imode = GET_MODE_INNER (mode);
+  static const uint64_t m5 = 0x5555555555555555ULL;
+  static const uint64_t m3 = 0x3333333333333333ULL;
+  static const uint64_t mf = 0x0F0F0F0F0F0F0F0FULL;
+  static const uint64_t m1 = 0x0101010101010101ULL;
+
+  rtx x1 = gen_reg_rtx (mode);
+  rtx x2 = gen_reg_rtx (mode);
+  rtx x3 = gen_reg_rtx (mode);
+  rtx x4 = gen_reg_rtx (mode);
+
+  /* x1 = src - (src >> 1) & 0x555...);  */
+  rtx shift1 = expand_binop (mode, lshr_optab, src, GEN_INT (1), NULL, true,
+      OPTAB_DIRECT);
+
+  rtx and1 = gen_reg_rtx (mode);
+  rtx ops1[] = {and1, shift1, gen_int_mode (m5, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (AND, mode), riscv_vector::BINARY_OP,
+    ops1);
+
+  x1 = expand_binop (mode, sub_optab, src, and1, NULL, true, OPTAB_DIRECT);
+
+  /* x2 = (x1 & 0x3333333333333333ULL) + ((x1 >> 2) & 0x3333333333333333ULL);
+   */
+  rtx and2 = gen_reg_rtx (mode);
+  rtx ops2[] = {and2, x1, gen_int_mode (m3, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (AND, mode), riscv_vector::BINARY_OP,
+    ops2);
+
+  rtx shift2 = expand_binop (mode, lshr_optab, x1, GEN_INT (2), NULL, true,
+      OPTAB_DIRECT);
+
+  rtx and22 = gen_reg_rtx (mode);
+  rtx ops22[] = {and22, shift2, gen_int_mode (m3, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (AND, mode), riscv_vector::BINARY_OP,
+    ops22);
+
+  x2 = expand_binop (mode, add_optab, and2, and22, NULL, true, OPTAB_DIRECT);
+
+  /* x3 = (x2 + (x2 >> 4)) & 0x0f0f0f0f0f0f0f0fULL;  */
+  rtx shift3 = expand_binop (mode, lshr_optab, x2, GEN_INT (4), NULL, true,
+      OPTAB_DIRECT);
+
+  rtx plus3
+    = expand_binop (mode, add_optab, x2, shift3, NULL, true, OPTAB_DIRECT);
+
+  rtx ops3[] = {x3, plus3, gen_int_mode (mf, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (AND, mode), riscv_vector::BINARY_OP,
+    ops3);
+
+  /* dest = (x3 * 0x0101010101010101ULL) >> 56;  */
+  rtx mul4 = gen_reg_rtx (mode);
+  rtx ops4[] = {mul4, x3, gen_int_mode (m1, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (MULT, mode), riscv_vector::BINARY_OP,
+    ops4);
+
+  x4 = expand_binop (mode, lshr_optab, mul4,
+      GEN_INT (GET_MODE_BITSIZE (imode) - 8), NULL, true,
+      OPTAB_DIRECT);
+
+  emit_move_insn (dst, x4);
+}
+
} // namespace riscv_vector
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-1.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-1.c
new file mode 100644
index 00000000000..3169ebbff71
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-1.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-march=rv64gcv_zvfh -mabi=lp64d --param=riscv-autovec-preference=scalable -fno-vect-cost-model -fdump-tree-vect-details" } */
+
+#include <stdint-gcc.h>
+
+void __attribute__ ((noipa))
+popcount_32 (uint32_t *restrict dst, uint32_t *restrict src, int size)
+{
+  for (int i = 0; i < size; ++i)
+    dst[i] = __builtin_popcount (src[i]);
+}
+
+void __attribute__ ((noipa))
+popcount_64 (uint64_t *restrict dst, uint64_t *restrict src, int size)
+{
+  for (int i = 0; i < size; ++i)
+    dst[i] = __builtin_popcountll (src[i]);
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops in function" 2 "vect" } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-2.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-2.c
new file mode 100644
index 00000000000..9c0970afdfd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-2.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-march=rv64gcv -mabi=lp64d --param=riscv-autovec-preference=scalable -fno-vect-cost-model -fdump-tree-slp-details" } */
+
+int x[8];
+int y[8];
+
+void foo ()
+{
+  x[0] = __builtin_popcount (y[0]);
+  x[1] = __builtin_popcount (y[1]);
+  x[2] = __builtin_popcount (y[2]);
+  x[3] = __builtin_popcount (y[3]);
+  x[4] = __builtin_popcount (y[4]);
+  x[5] = __builtin_popcount (y[5]);
+  x[6] = __builtin_popcount (y[6]);
+  x[7] = __builtin_popcount (y[7]);
+}
+
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "slp" } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c
new file mode 100644
index 00000000000..38f1633da99
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c
@@ -0,0 +1,49 @@
+/* { dg-do run { target { riscv_v } } } */
+
+#include "popcount-1.c"
+
+extern void abort (void) __attribute__ ((noreturn));
+
+unsigned int data[] = {
+  0x11111100, 6,
+  0xe0e0f0f0, 14,
+  0x9900aab3, 13,
+  0x00040003, 3,
+  0x000e000c, 5,
+  0x22227777, 16,
+  0x12341234, 10,
+  0x0, 0
+};
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  unsigned int count = sizeof (data) / sizeof (data[0]) / 2;
+
+  uint32_t in32[count];
+  uint32_t out32[count];
+  for (unsigned int i = 0; i < count; ++i)
+    {
+      in32[i] = data[i * 2];
+      asm volatile ("" ::: "memory");
+    }
+  popcount_32 (out32, in32, count);
+  for (unsigned int i = 0; i < count; ++i)
+    if (out32[i] != data[i * 2 + 1])
+      abort ();
+
+  count /= 2;
+  uint64_t in64[count];
+  uint64_t out64[count];
+  for (unsigned int i = 0; i < count; ++i)
+    {
+      in64[i] = ((uint64_t) data[i * 4] << 32) | data[i * 4 + 2];
+      asm volatile ("" ::: "memory");
+    }
+  popcount_64 (out64, in64, count);
+  for (unsigned int i = 0; i < count; ++i)
+    if (out64[i] != data[i * 4 + 1] + data[i * 4 + 3])
+      abort ();
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount.c
new file mode 100644
index 00000000000..585a522aa81
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount.c
@@ -0,0 +1,1464 @@
+/* { dg-do run { target { riscv_v } } } */
+/* { dg-additional-options { -O2 -fdump-tree-vect-details -fno-vect-cost-model } }  */
+
+#include "stdint-gcc.h"
+#include <assert.h>
+
+#define DEF64(TYPEDST, TYPESRC)                                                \
+  void __attribute__ ((noipa))                                                 \
+  popcount64_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src, \
+ int size)                                     \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_popcountll (src[i]);                                  \
+  }
+
+#define DEF32(TYPEDST, TYPESRC)                                                \
+  void __attribute__ ((noipa))                                                 \
+  popcount32_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src, \
+ int size)                                     \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_popcount (src[i]);                                    \
+  }
+
+#define DEFCTZ64(TYPEDST, TYPESRC)                                             \
+  void __attribute__ ((noipa))                                                 \
+  ctz64_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src,      \
+     int size)                                          \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_ctzll (src[i]);                                       \
+  }
+
+#define DEFCTZ32(TYPEDST, TYPESRC)                                             \
+  void __attribute__ ((noipa))                                                 \
+  ctz32_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src,      \
+     int size)                                          \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_ctz (src[i]);                                         \
+  }
+
+#define DEFFFS64(TYPEDST, TYPESRC)                                             \
+  void __attribute__ ((noipa))                                                 \
+  ffs64_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src,      \
+     int size)                                          \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_ffsll (src[i]);                                       \
+  }
+
+#define DEFFFS32(TYPEDST, TYPESRC)                                             \
+  void __attribute__ ((noipa))                                                 \
+  ffs32_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src,      \
+     int size)                                          \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_ffs (src[i]);                                         \
+  }
+
+#define DEF_ALL()                                                              \
+  DEF64 (uint64_t, uint64_t)                                                   \
+  DEF64 (uint64_t, uint32_t)                                                   \
+  DEF64 (uint64_t, uint16_t)                                                   \
+  DEF64 (uint64_t, uint8_t)                                                    \
+  DEF64 (uint64_t, int64_t)                                                    \
+  DEF64 (uint64_t, int32_t)                                                    \
+  DEF64 (uint64_t, int16_t)                                                    \
+  DEF64 (uint64_t, int8_t)                                                     \
+  DEF64 (int64_t, uint64_t)                                                    \
+  DEF64 (int64_t, uint32_t)                                                    \
+  DEF64 (int64_t, uint16_t)                                                    \
+  DEF64 (int64_t, uint8_t)                                                     \
+  DEF64 (int64_t, int64_t)                                                     \
+  DEF64 (int64_t, int32_t)                                                     \
+  DEF64 (int64_t, int16_t)                                                     \
+  DEF64 (int64_t, int8_t)                                                      \
+  DEF64 (uint32_t, uint64_t)                                                   \
+  DEF64 (uint32_t, uint32_t)                                                   \
+  DEF64 (uint32_t, uint16_t)                                                   \
+  DEF64 (uint32_t, uint8_t)                                                    \
+  DEF64 (uint32_t, int64_t)                                                    \
+  DEF64 (uint32_t, int32_t)                                                    \
+  DEF64 (uint32_t, int16_t)                                                    \
+  DEF64 (uint32_t, int8_t)                                                     \
+  DEF64 (int32_t, uint64_t)                                                    \
+  DEF64 (int32_t, uint32_t)                                                    \
+  DEF64 (int32_t, uint16_t)                                                    \
+  DEF64 (int32_t, uint8_t)                                                     \
+  DEF64 (int32_t, int64_t)                                                     \
+  DEF64 (int32_t, int32_t)                                                     \
+  DEF64 (int32_t, int16_t)                                                     \
+  DEF64 (int32_t, int8_t)                                                      \
+  DEF64 (uint16_t, uint64_t)                                                   \
+  DEF64 (uint16_t, uint32_t)                                                   \
+  DEF64 (uint16_t, uint16_t)                                                   \
+  DEF64 (uint16_t, uint8_t)                                                    \
+  DEF64 (uint16_t, int64_t)                                                    \
+  DEF64 (uint16_t, int32_t)                                                    \
+  DEF64 (uint16_t, int16_t)                                                    \
+  DEF64 (uint16_t, int8_t)                                                     \
+  DEF64 (int16_t, uint64_t)                                                    \
+  DEF64 (int16_t, uint32_t)                                                    \
+  DEF64 (int16_t, uint16_t)                                                    \
+  DEF64 (int16_t, uint8_t)                                                     \
+  DEF64 (int16_t, int64_t)                                                     \
+  DEF64 (int16_t, int32_t)                                                     \
+  DEF64 (int16_t, int16_t)                                                     \
+  DEF64 (int16_t, int8_t)                                                      \
+  DEF64 (uint8_t, uint64_t)                                                    \
+  DEF64 (uint8_t, uint32_t)                                                    \
+  DEF64 (uint8_t, uint16_t)                                                    \
+  DEF64 (uint8_t, uint8_t)                                                     \
+  DEF64 (uint8_t, int64_t)                                                     \
+  DEF64 (uint8_t, int32_t)                                                     \
+  DEF64 (uint8_t, int16_t)                                                     \
+  DEF64 (uint8_t, int8_t)                                                      \
+  DEF64 (int8_t, uint64_t)                                                     \
+  DEF64 (int8_t, uint32_t)                                                     \
+  DEF64 (int8_t, uint16_t)                                                     \
+  DEF64 (int8_t, uint8_t)                                                      \
+  DEF64 (int8_t, int64_t)                                                      \
+  DEF64 (int8_t, int32_t)                                                      \
+  DEF64 (int8_t, int16_t)                                                      \
+  DEF64 (int8_t, int8_t)                                                       \
+  DEF32 (uint64_t, uint64_t)                                                   \
+  DEF32 (uint64_t, uint32_t)                                                   \
+  DEF32 (uint64_t, uint16_t)                                                   \
+  DEF32 (uint64_t, uint8_t)                                                    \
+  DEF32 (uint64_t, int64_t)                                                    \
+  DEF32 (uint64_t, int32_t)                                                    \
+  DEF32 (uint64_t, int16_t)                                                    \
+  DEF32 (uint64_t, int8_t)                                                     \
+  DEF32 (int64_t, uint64_t)                                                    \
+  DEF32 (int64_t, uint32_t)                                                    \
+  DEF32 (int64_t, uint16_t)                                                    \
+  DEF32 (int64_t, uint8_t)                                                     \
+  DEF32 (int64_t, int64_t)                                                     \
+  DEF32 (int64_t, int32_t)                                                     \
+  DEF32 (int64_t, int16_t)                                                     \
+  DEF32 (int64_t, int8_t)                                                      \
+  DEF32 (uint32_t, uint64_t)                                                   \
+  DEF32 (uint32_t, uint32_t)                                                   \
+  DEF32 (uint32_t, uint16_t)                                                   \
+  DEF32 (uint32_t, uint8_t)                                                    \
+  DEF32 (uint32_t, int64_t)                                                    \
+  DEF32 (uint32_t, int32_t)                                                    \
+  DEF32 (uint32_t, int16_t)                                                    \
+  DEF32 (uint32_t, int8_t)                                                     \
+  DEF32 (int32_t, uint64_t)                                                    \
+  DEF32 (int32_t, uint32_t)                                                    \
+  DEF32 (int32_t, uint16_t)                                                    \
+  DEF32 (int32_t, uint8_t)                                                     \
+  DEF32 (int32_t, int64_t)                                                     \
+  DEF32 (int32_t, int32_t)                                                     \
+  DEF32 (int32_t, int16_t)                                                     \
+  DEF32 (int32_t, int8_t)                                                      \
+  DEF32 (uint16_t, uint64_t)                                                   \
+  DEF32 (uint16_t, uint32_t)                                                   \
+  DEF32 (uint16_t, uint16_t)                                                   \
+  DEF32 (uint16_t, uint8_t)                                                    \
+  DEF32 (uint16_t, int64_t)                                                    \
+  DEF32 (uint16_t, int32_t)                                                    \
+  DEF32 (uint16_t, int16_t)                                                    \
+  DEF32 (uint16_t, int8_t)                                                     \
+  DEF32 (int16_t, uint64_t)                                                    \
+  DEF32 (int16_t, uint32_t)                                                    \
+  DEF32 (int16_t, uint16_t)                                                    \
+  DEF32 (int16_t, uint8_t)                                                     \
+  DEF32 (int16_t, int64_t)                                                     \
+  DEF32 (int16_t, int32_t)                                                     \
+  DEF32 (int16_t, int16_t)                                                     \
+  DEF32 (int16_t, int8_t)                                                      \
+  DEF32 (uint8_t, uint64_t)                                                    \
+  DEF32 (uint8_t, uint32_t)                                                    \
+  DEF32 (uint8_t, uint16_t)                                                    \
+  DEF32 (uint8_t, uint8_t)                                                     \
+  DEF32 (uint8_t, int64_t)                                                     \
+  DEF32 (uint8_t, int32_t)                                                     \
+  DEF32 (uint8_t, int16_t)                                                     \
+  DEF32 (uint8_t, int8_t)                                                      \
+  DEF32 (int8_t, uint64_t)                                                     \
+  DEF32 (int8_t, uint32_t)                                                     \
+  DEF32 (int8_t, uint16_t)                                                     \
+  DEF32 (int8_t, uint8_t)                                                      \
+  DEF32 (int8_t, int64_t)                                                      \
+  DEF32 (int8_t, int32_t)                                                      \
+  DEF32 (int8_t, int16_t)                                                      \
+  DEF32 (int8_t, int8_t)                                                       \
+  DEFCTZ64 (uint64_t, uint64_t)                                                \
+  DEFCTZ64 (uint64_t, uint32_t)                                                \
+  DEFCTZ64 (uint64_t, uint16_t)                                                \
+  DEFCTZ64 (uint64_t, uint8_t)                                                 \
+  DEFCTZ64 (uint64_t, int64_t)                                                 \
+  DEFCTZ64 (uint64_t, int32_t)                                                 \
+  DEFCTZ64 (uint64_t, int16_t)                                                 \
+  DEFCTZ64 (uint64_t, int8_t)                                                  \
+  DEFCTZ64 (int64_t, uint64_t)                                                 \
+  DEFCTZ64 (int64_t, uint32_t)                                                 \
+  DEFCTZ64 (int64_t, uint16_t)                                                 \
+  DEFCTZ64 (int64_t, uint8_t)                                                  \
+  DEFCTZ64 (int64_t, int64_t)                                                  \
+  DEFCTZ64 (int64_t, int32_t)                                                  \
+  DEFCTZ64 (int64_t, int16_t)                                                  \
+  DEFCTZ64 (int64_t, int8_t)                                                   \
+  DEFCTZ64 (uint32_t, uint64_t)                                                \
+  DEFCTZ64 (uint32_t, uint32_t)                                                \
+  DEFCTZ64 (uint32_t, uint16_t)                                                \
+  DEFCTZ64 (uint32_t, uint8_t)                                                 \
+  DEFCTZ64 (uint32_t, int64_t)                                                 \
+  DEFCTZ64 (uint32_t, int32_t)                                                 \
+  DEFCTZ64 (uint32_t, int16_t)                                                 \
+  DEFCTZ64 (uint32_t, int8_t)                                                  \
+  DEFCTZ64 (int32_t, uint64_t)                                                 \
+  DEFCTZ64 (int32_t, uint32_t)                                                 \
+  DEFCTZ64 (int32_t, uint16_t)                                                 \
+  DEFCTZ64 (int32_t, uint8_t)                                                  \
+  DEFCTZ64 (int32_t, int64_t)                                                  \
+  DEFCTZ64 (int32_t, int32_t)                                                  \
+  DEFCTZ64 (int32_t, int16_t)                                                  \
+  DEFCTZ64 (int32_t, int8_t)                                                   \
+  DEFCTZ64 (uint16_t, uint64_t)                                                \
+  DEFCTZ64 (uint16_t, uint32_t)                                                \
+  DEFCTZ64 (uint16_t, uint16_t)                                                \
+  DEFCTZ64 (uint16_t, uint8_t)                                                 \
+  DEFCTZ64 (uint16_t, int64_t)                                                 \
+  DEFCTZ64 (uint16_t, int32_t)                                                 \
+  DEFCTZ64 (uint16_t, int16_t)                                                 \
+  DEFCTZ64 (uint16_t, int8_t)                                                  \
+  DEFCTZ64 (int16_t, uint64_t)                                                 \
+  DEFCTZ64 (int16_t, uint32_t)                                                 \
+  DEFCTZ64 (int16_t, uint16_t)                                                 \
+  DEFCTZ64 (int16_t, uint8_t)                                                  \
+  DEFCTZ64 (int16_t, int64_t)                                                  \
+  DEFCTZ64 (int16_t, int32_t)                                                  \
+  DEFCTZ64 (int16_t, int16_t)                                                  \
+  DEFCTZ64 (int16_t, int8_t)                                                   \
+  DEFCTZ64 (uint8_t, uint64_t)                                                 \
+  DEFCTZ64 (uint8_t, uint32_t)                                                 \
+  DEFCTZ64 (uint8_t, uint16_t)                                                 \
+  DEFCTZ64 (uint8_t, uint8_t)                                                  \
+  DEFCTZ64 (uint8_t, int64_t)                                                  \
+  DEFCTZ64 (uint8_t, int32_t)                                                  \
+  DEFCTZ64 (uint8_t, int16_t)                                                  \
+  DEFCTZ64 (uint8_t, int8_t)                                                   \
+  DEFCTZ64 (int8_t, uint64_t)                                                  \
+  DEFCTZ64 (int8_t, uint32_t)                                                  \
+  DEFCTZ64 (int8_t, uint16_t)                                                  \
+  DEFCTZ64 (int8_t, uint8_t)                                                   \
+  DEFCTZ64 (int8_t, int64_t)                                                   \
+  DEFCTZ64 (int8_t, int32_t)                                                   \
+  DEFCTZ64 (int8_t, int16_t)                                                   \
+  DEFCTZ64 (int8_t, int8_t)                                                    \
+  DEFCTZ32 (uint64_t, uint64_t)                                                \
+  DEFCTZ32 (uint64_t, uint32_t)                                                \
+  DEFCTZ32 (uint64_t, uint16_t)                                                \
+  DEFCTZ32 (uint64_t, uint8_t)                                                 \
+  DEFCTZ32 (uint64_t, int64_t)                                                 \
+  DEFCTZ32 (uint64_t, int32_t)                                                 \
+  DEFCTZ32 (uint64_t, int16_t)                                                 \
+  DEFCTZ32 (uint64_t, int8_t)                                                  \
+  DEFCTZ32 (int64_t, uint64_t)                                                 \
+  DEFCTZ32 (int64_t, uint32_t)                                                 \
+  DEFCTZ32 (int64_t, uint16_t)                                                 \
+  DEFCTZ32 (int64_t, uint8_t)                                                  \
+  DEFCTZ32 (int64_t, int64_t)                                                  \
+  DEFCTZ32 (int64_t, int32_t)                                                  \
+  DEFCTZ32 (int64_t, int16_t)                                                  \
+  DEFCTZ32 (int64_t, int8_t)                                                   \
+  DEFCTZ32 (uint32_t, uint64_t)                                                \
+  DEFCTZ32 (uint32_t, uint32_t)                                                \
+  DEFCTZ32 (uint32_t, uint16_t)                                                \
+  DEFCTZ32 (uint32_t, uint8_t)                                                 \
+  DEFCTZ32 (uint32_t, int64_t)                                                 \
+  DEFCTZ32 (uint32_t, int32_t)                                                 \
+  DEFCTZ32 (uint32_t, int16_t)                                                 \
+  DEFCTZ32 (uint32_t, int8_t)                                                  \
+  DEFCTZ32 (int32_t, uint64_t)                                                 \
+  DEFCTZ32 (int32_t, uint32_t)                                                 \
+  DEFCTZ32 (int32_t, uint16_t)                                                 \
+  DEFCTZ32 (int32_t, uint8_t)                                                  \
+  DEFCTZ32 (int32_t, int64_t)                                                  \
+  DEFCTZ32 (int32_t, int32_t)                                                  \
+  DEFCTZ32 (int32_t, int16_t)                                                  \
+  DEFCTZ32 (int32_t, int8_t)                                                   \
+  DEFCTZ32 (uint16_t, uint64_t)                                                \
+  DEFCTZ32 (uint16_t, uint32_t)                                                \
+  DEFCTZ32 (uint16_t, uint16_t)                                                \
+  DEFCTZ32 (uint16_t, uint8_t)                                                 \
+  DEFCTZ32 (uint16_t, int64_t)                                                 \
+  DEFCTZ32 (uint16_t, int32_t)                                                 \
+  DEFCTZ32 (uint16_t, int16_t)                                                 \
+  DEFCTZ32 (uint16_t, int8_t)                                                  \
+  DEFCTZ32 (int16_t, uint64_t)                                                 \
+  DEFCTZ32 (int16_t, uint32_t)                                                 \
+  DEFCTZ32 (int16_t, uint16_t)                                                 \
+  DEFCTZ32 (int16_t, uint8_t)                                                  \
+  DEFCTZ32 (int16_t, int64_t)                                                  \
+  DEFCTZ32 (int16_t, int32_t)                                                  \
+  DEFCTZ32 (int16_t, int16_t)                                                  \
+  DEFCTZ32 (int16_t, int8_t)                                                   \
+  DEFCTZ32 (uint8_t, uint64_t)                                                 \
+  DEFCTZ32 (uint8_t, uint32_t)                                                 \
+  DEFCTZ32 (uint8_t, uint16_t)                                                 \
+  DEFCTZ32 (uint8_t, uint8_t)                                                  \
+  DEFCTZ32 (uint8_t, int64_t)                                                  \
+  DEFCTZ32 (uint8_t, int32_t)                                                  \
+  DEFCTZ32 (uint8_t, int16_t)                                                  \
+  DEFCTZ32 (uint8_t, int8_t)                                                   \
+  DEFCTZ32 (int8_t, uint64_t)                                                  \
+  DEFCTZ32 (int8_t, uint32_t)                                                  \
+  DEFCTZ32 (int8_t, uint16_t)                                                  \
+  DEFCTZ32 (int8_t, uint8_t)                                                   \
+  DEFCTZ32 (int8_t, int64_t)                                                   \
+  DEFCTZ32 (int8_t, int32_t)                                                   \
+  DEFCTZ32 (int8_t, int16_t)                                                   \
+  DEFCTZ32 (int8_t, int8_t)                                                    \
+  DEFFFS64 (uint64_t, uint64_t)                                                \
+  DEFFFS64 (uint64_t, uint32_t)                                                \
+  DEFFFS64 (uint64_t, uint16_t)                                                \
+  DEFFFS64 (uint64_t, uint8_t)                                                 \
+  DEFFFS64 (uint64_t, int64_t)                                                 \
+  DEFFFS64 (uint64_t, int32_t)                                                 \
+  DEFFFS64 (uint64_t, int16_t)                                                 \
+  DEFFFS64 (uint64_t, int8_t)                                                  \
+  DEFFFS64 (int64_t, uint64_t)                                                 \
+  DEFFFS64 (int64_t, uint32_t)                                                 \
+  DEFFFS64 (int64_t, uint16_t)                                                 \
+  DEFFFS64 (int64_t, uint8_t)                                                  \
+  DEFFFS64 (int64_t, int64_t)                                                  \
+  DEFFFS64 (int64_t, int32_t)                                                  \
+  DEFFFS64 (int64_t, int16_t)                                                  \
+  DEFFFS64 (int64_t, int8_t)                                                   \
+  DEFFFS64 (uint32_t, uint64_t)                                                \
+  DEFFFS64 (uint32_t, uint32_t)                                                \
+  DEFFFS64 (uint32_t, uint16_t)                                                \
+  DEFFFS64 (uint32_t, uint8_t)                                                 \
+  DEFFFS64 (uint32_t, int64_t)                                                 \
+  DEFFFS64 (uint32_t, int32_t)                                                 \
+  DEFFFS64 (uint32_t, int16_t)                                                 \
+  DEFFFS64 (uint32_t, int8_t)                                                  \
+  DEFFFS64 (int32_t, uint64_t)                                                 \
+  DEFFFS64 (int32_t, uint32_t)                                                 \
+  DEFFFS64 (int32_t, uint16_t)                                                 \
+  DEFFFS64 (int32_t, uint8_t)                                                  \
+  DEFFFS64 (int32_t, int64_t)                                                  \
+  DEFFFS64 (int32_t, int32_t)                                                  \
+  DEFFFS64 (int32_t, int16_t)                                                  \
+  DEFFFS64 (int32_t, int8_t)                                                   \
+  DEFFFS64 (uint16_t, uint64_t)                                                \
+  DEFFFS64 (uint16_t, uint32_t)                                                \
+  DEFFFS64 (uint16_t, uint16_t)                                                \
+  DEFFFS64 (uint16_t, uint8_t)                                                 \
+  DEFFFS64 (uint16_t, int64_t)                                                 \
+  DEFFFS64 (uint16_t, int32_t)                                                 \
+  DEFFFS64 (uint16_t, int16_t)                                                 \
+  DEFFFS64 (uint16_t, int8_t)                                                  \
+  DEFFFS64 (int16_t, uint64_t)                                                 \
+  DEFFFS64 (int16_t, uint32_t)                                                 \
+  DEFFFS64 (int16_t, uint16_t)                                                 \
+  DEFFFS64 (int16_t, uint8_t)                                                  \
+  DEFFFS64 (int16_t, int64_t)                                                  \
+  DEFFFS64 (int16_t, int32_t)                                                  \
+  DEFFFS64 (int16_t, int16_t)                                                  \
+  DEFFFS64 (int16_t, int8_t)                                                   \
+  DEFFFS64 (uint8_t, uint64_t)                                                 \
+  DEFFFS64 (uint8_t, uint32_t)                                                 \
+  DEFFFS64 (uint8_t, uint16_t)                                                 \
+  DEFFFS64 (uint8_t, uint8_t)                                                  \
+  DEFFFS64 (uint8_t, int64_t)                                                  \
+  DEFFFS64 (uint8_t, int32_t)                                                  \
+  DEFFFS64 (uint8_t, int16_t)                                                  \
+  DEFFFS64 (uint8_t, int8_t)                                                   \
+  DEFFFS64 (int8_t, uint64_t)                                                  \
+  DEFFFS64 (int8_t, uint32_t)                                                  \
+  DEFFFS64 (int8_t, uint16_t)                                                  \
+  DEFFFS64 (int8_t, uint8_t)                                                   \
+  DEFFFS64 (int8_t, int64_t)                                                   \
+  DEFFFS64 (int8_t, int32_t)                                                   \
+  DEFFFS64 (int8_t, int16_t)                                                   \
+  DEFFFS64 (int8_t, int8_t)                                                    \
+  DEFFFS32 (uint64_t, uint64_t)                                                \
+  DEFFFS32 (uint64_t, uint32_t)                                                \
+  DEFFFS32 (uint64_t, uint16_t)                                                \
+  DEFFFS32 (uint64_t, uint8_t)                                                 \
+  DEFFFS32 (uint64_t, int64_t)                                                 \
+  DEFFFS32 (uint64_t, int32_t)                                                 \
+  DEFFFS32 (uint64_t, int16_t)                                                 \
+  DEFFFS32 (uint64_t, int8_t)                                                  \
+  DEFFFS32 (int64_t, uint64_t)                                                 \
+  DEFFFS32 (int64_t, uint32_t)                                                 \
+  DEFFFS32 (int64_t, uint16_t)                                                 \
+  DEFFFS32 (int64_t, uint8_t)                                                  \
+  DEFFFS32 (int64_t, int64_t)                                                  \
+  DEFFFS32 (int64_t, int32_t)                                                  \
+  DEFFFS32 (int64_t, int16_t)                                                  \
+  DEFFFS32 (int64_t, int8_t)                                                   \
+  DEFFFS32 (uint32_t, uint64_t)                                                \
+  DEFFFS32 (uint32_t, uint32_t)                                                \
+  DEFFFS32 (uint32_t, uint16_t)                                                \
+  DEFFFS32 (uint32_t, uint8_t)                                                 \
+  DEFFFS32 (uint32_t, int64_t)                                                 \
+  DEFFFS32 (uint32_t, int32_t)                                                 \
+  DEFFFS32 (uint32_t, int16_t)                                                 \
+  DEFFFS32 (uint32_t, int8_t)                                                  \
+  DEFFFS32 (int32_t, uint64_t)                                                 \
+  DEFFFS32 (int32_t, uint32_t)                                                 \
+  DEFFFS32 (int32_t, uint16_t)                                                 \
+  DEFFFS32 (int32_t, uint8_t)                                                  \
+  DEFFFS32 (int32_t, int64_t)                                                  \
+  DEFFFS32 (int32_t, int32_t)                                                  \
+  DEFFFS32 (int32_t, int16_t)                                                  \
+  DEFFFS32 (int32_t, int8_t)                                                   \
+  DEFFFS32 (uint16_t, uint64_t)                                                \
+  DEFFFS32 (uint16_t, uint32_t)                                                \
+  DEFFFS32 (uint16_t, uint16_t)                                                \
+  DEFFFS32 (uint16_t, uint8_t)                                                 \
+  DEFFFS32 (uint16_t, int64_t)                                                 \
+  DEFFFS32 (uint16_t, int32_t)                                                 \
+  DEFFFS32 (uint16_t, int16_t)                                                 \
+  DEFFFS32 (uint16_t, int8_t)                                                  \
+  DEFFFS32 (int16_t, uint64_t)                                                 \
+  DEFFFS32 (int16_t, uint32_t)                                                 \
+  DEFFFS32 (int16_t, uint16_t)                                                 \
+  DEFFFS32 (int16_t, uint8_t)                                                  \
+  DEFFFS32 (int16_t, int64_t)                                                  \
+  DEFFFS32 (int16_t, int32_t)                                                  \
+  DEFFFS32 (int16_t, int16_t)                                                  \
+  DEFFFS32 (int16_t, int8_t)                                                   \
+  DEFFFS32 (uint8_t, uint64_t)                                                 \
+  DEFFFS32 (uint8_t, uint32_t)                                                 \
+  DEFFFS32 (uint8_t, uint16_t)                                                 \
+  DEFFFS32 (uint8_t, uint8_t)                                                  \
+  DEFFFS32 (uint8_t, int64_t)                                                  \
+  DEFFFS32 (uint8_t, int32_t)                                                  \
+  DEFFFS32 (uint8_t, int16_t)                                                  \
+  DEFFFS32 (uint8_t, int8_t)                                                   \
+  DEFFFS32 (int8_t, uint64_t)                                                  \
+  DEFFFS32 (int8_t, uint32_t)                                                  \
+  DEFFFS32 (int8_t, uint16_t)                                                  \
+  DEFFFS32 (int8_t, uint8_t)                                                   \
+  DEFFFS32 (int8_t, int64_t)                                                   \
+  DEFFFS32 (int8_t, int32_t)                                                   \
+  DEFFFS32 (int8_t, int16_t)                                                   \
+  DEFFFS32 (int8_t, int8_t)
+
+DEF_ALL ()
+
+#define SZ 512
+
+#define TEST64(TYPEDST, TYPESRC)                                               \
+  void __attribute__ ((optimize ("0"))) test64_##TYPEDST##TYPESRC ()           \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * 1234567890;                                              \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    popcount64_##TYPEDST##TYPESRC (dst, src, SZ);                              \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ assert (dst[i] == __builtin_popcountll (src[i]));                      \
+      }                                                                        \
+  }
+
+#define TEST64N(TYPEDST, TYPESRC)                                              \
+  void __attribute__ ((optimize ("0"))) test64n_##TYPEDST##TYPESRC ()          \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * -1234567890;                                             \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    popcount64_##TYPEDST##TYPESRC (dst, src, SZ);                              \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ assert (dst[i] == __builtin_popcountll (src[i]));                      \
+      }                                                                        \
+  }
+
+#define TEST32(TYPEDST, TYPESRC)                                               \
+  void __attribute__ ((optimize ("0"))) test32_##TYPEDST##TYPESRC ()           \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * 1234567;                                                 \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    popcount32_##TYPEDST##TYPESRC (dst, src, SZ);                              \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ assert (dst[i] == __builtin_popcount (src[i]));                        \
+      }                                                                        \
+  }
+
+#define TEST32N(TYPEDST, TYPESRC)                                              \
+  void __attribute__ ((optimize ("0"))) test32n_##TYPEDST##TYPESRC ()          \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * -1234567;                                                \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    popcount32_##TYPEDST##TYPESRC (dst, src, SZ);                              \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ assert (dst[i] == __builtin_popcount (src[i]));                        \
+      }                                                                        \
+  }
+
+#define TESTCTZ64(TYPEDST, TYPESRC)                                            \
+  void __attribute__ ((optimize ("0"))) testctz64_##TYPEDST##TYPESRC ()        \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * 1234567890;                                              \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    ctz64_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ if (src[i] != 0)                                                       \
+   assert (dst[i] == __builtin_ctzll (src[i]));                         \
+      }                                                                        \
+  }
+
+#define TESTCTZ64N(TYPEDST, TYPESRC)                                           \
+  void __attribute__ ((optimize ("0"))) testctz64n_##TYPEDST##TYPESRC ()       \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * -1234567890;                                             \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    ctz64_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ if (src[i] != 0)                                                       \
+   assert (dst[i] == __builtin_ctzll (src[i]));                         \
+      }                                                                        \
+  }
+
+#define TESTCTZ32(TYPEDST, TYPESRC)                                            \
+  void __attribute__ ((optimize ("0"))) testctz32_##TYPEDST##TYPESRC ()        \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * 1234567;                                                 \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    ctz32_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ if (src[i] != 0)                                                       \
+   assert (dst[i] == __builtin_ctz (src[i]));                           \
+      }                                                                        \
+  }
+
+#define TESTCTZ32N(TYPEDST, TYPESRC)                                           \
+  void __attribute__ ((optimize ("0"))) testctz32n_##TYPEDST##TYPESRC ()       \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * -1234567;                                                \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    ctz32_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ if (src[i] != 0)                                                       \
+   assert (dst[i] == __builtin_ctz (src[i]));                           \
+      }                                                                        \
+  }
+
+#define TESTFFS64(TYPEDST, TYPESRC)                                            \
+  void __attribute__ ((optimize ("0"))) testffs64_##TYPEDST##TYPESRC ()        \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * 1234567890;                                              \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    ffs64_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ assert (dst[i] == __builtin_ffsll (src[i]));                           \
+      }                                                                        \
+  }
+
+#define TESTFFS64N(TYPEDST, TYPESRC)                                           \
+  void __attribute__ ((optimize ("0"))) testffs64n_##TYPEDST##TYPESRC ()       \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * -1234567890;                                             \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    ffs64_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ assert (dst[i] == __builtin_ffsll (src[i]));                           \
+      }                                                                        \
+  }
+
+#define TESTFFS32(TYPEDST, TYPESRC)                                            \
+  void __attribute__ ((optimize ("0"))) testffs32_##TYPEDST##TYPESRC ()        \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * 1234567;                                                 \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    ffs32_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ assert (dst[i] == __builtin_ffs (src[i]));                             \
+      }                                                                        \
+  }
+
+#define TESTFFS32N(TYPEDST, TYPESRC)                                           \
+  void __attribute__ ((optimize ("0"))) testffs32n_##TYPEDST##TYPESRC ()       \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ int ia = i + 1;                                                        \
+ src[i] = ia * -1234567;                                                \
+ dst[i] = 0;                                                            \
+      }                                                                        \
+    ffs32_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+ assert (dst[i] == __builtin_ffs (src[i]));                             \
+      }                                                                        \
+  }
+
+#define TEST_ALL()                                                             \
+  TEST64 (uint64_t, uint64_t)                                                  \
+  TEST64 (uint64_t, uint32_t)                                                  \
+  TEST64 (uint64_t, uint16_t)                                                  \
+  TEST64 (uint64_t, uint8_t)                                                   \
+  TEST64 (uint64_t, int64_t)                                                   \
+  TEST64 (uint64_t, int32_t)                                                   \
+  TEST64 (uint64_t, int16_t)                                                   \
+  TEST64 (uint64_t, int8_t)                                                    \
+  TEST64N (int64_t, uint64_t)                                                  \
+  TEST64N (int64_t, uint32_t)                                                  \
+  TEST64N (int64_t, uint16_t)                                                  \
+  TEST64N (int64_t, uint8_t)                                                   \
+  TEST64N (int64_t, int64_t)                                                   \
+  TEST64N (int64_t, int32_t)                                                   \
+  TEST64N (int64_t, int16_t)                                                   \
+  TEST64N (int64_t, int8_t)                                                    \
+  TEST64 (uint32_t, uint64_t)                                                  \
+  TEST64 (uint32_t, uint32_t)                                                  \
+  TEST64 (uint32_t, uint16_t)                                                  \
+  TEST64 (uint32_t, uint8_t)                                                   \
+  TEST64 (uint32_t, int64_t)                                                   \
+  TEST64 (uint32_t, int32_t)                                                   \
+  TEST64 (uint32_t, int16_t)                                                   \
+  TEST64 (uint32_t, int8_t)                                                    \
+  TEST64N (int32_t, uint64_t)                                                  \
+  TEST64N (int32_t, uint32_t)                                                  \
+  TEST64N (int32_t, uint16_t)                                                  \
+  TEST64N (int32_t, uint8_t)                                                   \
+  TEST64N (int32_t, int64_t)                                                   \
+  TEST64N (int32_t, int32_t)                                                   \
+  TEST64N (int32_t, int16_t)                                                   \
+  TEST64N (int32_t, int8_t)                                                    \
+  TEST64 (uint16_t, uint64_t)                                                  \
+  TEST64 (uint16_t, uint32_t)                                                  \
+  TEST64 (uint16_t, uint16_t)                                                  \
+  TEST64 (uint16_t, uint8_t)                                                   \
+  TEST64 (uint16_t, int64_t)                                                   \
+  TEST64 (uint16_t, int32_t)                                                   \
+  TEST64 (uint16_t, int16_t)                                                   \
+  TEST64 (uint16_t, int8_t)                                                    \
+  TEST64N (int16_t, uint64_t)                                                   \
+  TEST64N (int16_t, uint32_t)                                                   \
+  TEST64N (int16_t, uint16_t)                                                   \
+  TEST64N (int16_t, uint8_t)                                                    \
+  TEST64N (int16_t, int64_t)                                                    \
+  TEST64N (int16_t, int32_t)                                                    \
+  TEST64N (int16_t, int16_t)                                                    \
+  TEST64N (int16_t, int8_t)                                                     \
+  TEST64 (uint8_t, uint64_t)                                                   \
+  TEST64 (uint8_t, uint32_t)                                                   \
+  TEST64 (uint8_t, uint16_t)                                                   \
+  TEST64 (uint8_t, uint8_t)                                                    \
+  TEST64 (uint8_t, int64_t)                                                    \
+  TEST64 (uint8_t, int32_t)                                                    \
+  TEST64 (uint8_t, int16_t)                                                    \
+  TEST64 (uint8_t, int8_t)                                                     \
+  TEST64N (int8_t, uint64_t)                                                    \
+  TEST64N (int8_t, uint32_t)                                                    \
+  TEST64N (int8_t, uint16_t)                                                    \
+  TEST64N (int8_t, uint8_t)                                                     \
+  TEST64N (int8_t, int64_t)                                                     \
+  TEST64N (int8_t, int32_t)                                                     \
+  TEST64N (int8_t, int16_t)                                                     \
+  TEST64N (int8_t, int8_t)                                                      \
+  TEST32 (uint64_t, uint64_t)                                                  \
+  TEST32 (uint64_t, uint32_t)                                                  \
+  TEST32 (uint64_t, uint16_t)                                                  \
+  TEST32 (uint64_t, uint8_t)                                                   \
+  TEST32 (uint64_t, int64_t)                                                   \
+  TEST32 (uint64_t, int32_t)                                                   \
+  TEST32 (uint64_t, int16_t)                                                   \
+  TEST32 (uint64_t, int8_t)                                                    \
+  TEST32N (int64_t, uint64_t)                                                  \
+  TEST32N (int64_t, uint32_t)                                                  \
+  TEST32N (int64_t, uint16_t)                                                  \
+  TEST32N (int64_t, uint8_t)                                                   \
+  TEST32N (int64_t, int64_t)                                                   \
+  TEST32N (int64_t, int32_t)                                                   \
+  TEST32N (int64_t, int16_t)                                                   \
+  TEST32N (int64_t, int8_t)                                                    \
+  TEST32 (uint32_t, uint64_t)                                                  \
+  TEST32 (uint32_t, uint32_t)                                                  \
+  TEST32 (uint32_t, uint16_t)                                                  \
+  TEST32 (uint32_t, uint8_t)                                                   \
+  TEST32 (uint32_t, int64_t)                                                   \
+  TEST32 (uint32_t, int32_t)                                                   \
+  TEST32 (uint32_t, int16_t)                                                   \
+  TEST32 (uint32_t, int8_t)                                                    \
+  TEST32N (int32_t, uint64_t)                                                  \
+  TEST32N (int32_t, uint32_t)                                                  \
+  TEST32N (int32_t, uint16_t)                                                  \
+  TEST32N (int32_t, uint8_t)                                                   \
+  TEST32N (int32_t, int64_t)                                                   \
+  TEST32N (int32_t, int32_t)                                                   \
+  TEST32N (int32_t, int16_t)                                                   \
+  TEST32N (int32_t, int8_t)                                                    \
+  TEST32 (uint16_t, uint64_t)                                                  \
+  TEST32 (uint16_t, uint32_t)                                                  \
+  TEST32 (uint16_t, uint16_t)                                                  \
+  TEST32 (uint16_t, uint8_t)                                                   \
+  TEST32 (uint16_t, int64_t)                                                   \
+  TEST32 (uint16_t, int32_t)                                                   \
+  TEST32 (uint16_t, int16_t)                                                   \
+  TEST32 (uint16_t, int8_t)                                                    \
+  TEST32N (int16_t, uint64_t)                                                  \
+  TEST32N (int16_t, uint32_t)                                                  \
+  TEST32N (int16_t, uint16_t)                                                  \
+  TEST32N (int16_t, uint8_t)                                                   \
+  TEST32N (int16_t, int64_t)                                                   \
+  TEST32N (int16_t, int32_t)                                                   \
+  TEST32N (int16_t, int16_t)                                                   \
+  TEST32N (int16_t, int8_t)                                                    \
+  TEST32 (uint8_t, uint64_t)                                                   \
+  TEST32 (uint8_t, uint32_t)                                                   \
+  TEST32 (uint8_t, uint16_t)                                                   \
+  TEST32 (uint8_t, uint8_t)                                                    \
+  TEST32 (uint8_t, int64_t)                                                    \
+  TEST32 (uint8_t, int32_t)                                                    \
+  TEST32 (uint8_t, int16_t)                                                    \
+  TEST32 (uint8_t, int8_t)                                                     \
+  TEST32N (int8_t, uint64_t)                                                   \
+  TEST32N (int8_t, uint32_t)                                                   \
+  TEST32N (int8_t, uint16_t)                                                   \
+  TEST32N (int8_t, uint8_t)                                                    \
+  TEST32N (int8_t, int64_t)                                                    \
+  TEST32N (int8_t, int32_t)                                                    \
+  TEST32N (int8_t, int16_t)                                                    \
+  TEST32N (int8_t, int8_t)                                                     \
+  TESTCTZ64 (uint64_t, uint64_t)                                               \
+  TESTCTZ64 (uint64_t, uint32_t)                                               \
+  TESTCTZ64 (uint64_t, uint16_t)                                               \
+  TESTCTZ64 (uint64_t, uint8_t)                                                \
+  TESTCTZ64 (uint64_t, int64_t)                                                \
+  TESTCTZ64 (uint64_t, int32_t)                                                \
+  TESTCTZ64 (uint64_t, int16_t)                                                \
+  TESTCTZ64 (uint64_t, int8_t)                                                 \
+  TESTCTZ64N (int64_t, uint64_t)                                               \
+  TESTCTZ64N (int64_t, uint32_t)                                               \
+  TESTCTZ64N (int64_t, uint16_t)                                               \
+  TESTCTZ64N (int64_t, uint8_t)                                                \
+  TESTCTZ64N (int64_t, int64_t)                                                \
+  TESTCTZ64N (int64_t, int32_t)                                                \
+  TESTCTZ64N (int64_t, int16_t)                                                \
+  TESTCTZ64N (int64_t, int8_t)                                                 \
+  TESTCTZ64 (uint32_t, uint64_t)                                               \
+  TESTCTZ64 (uint32_t, uint32_t)                                               \
+  TESTCTZ64 (uint32_t, uint16_t)                                               \
+  TESTCTZ64 (uint32_t, uint8_t)                                                \
+  TESTCTZ64 (uint32_t, int64_t)                                                \
+  TESTCTZ64 (uint32_t, int32_t)                                                \
+  TESTCTZ64 (uint32_t, int16_t)                                                \
+  TESTCTZ64 (uint32_t, int8_t)                                                 \
+  TESTCTZ64N (int32_t, uint64_t)                                               \
+  TESTCTZ64N (int32_t, uint32_t)                                               \
+  TESTCTZ64N (int32_t, uint16_t)                                               \
+  TESTCTZ64N (int32_t, uint8_t)                                                \
+  TESTCTZ64N (int32_t, int64_t)                                                \
+  TESTCTZ64N (int32_t, int32_t)                                                \
+  TESTCTZ64N (int32_t, int16_t)                                                \
+  TESTCTZ64N (int32_t, int8_t)                                                 \
+  TESTCTZ64 (uint16_t, uint64_t)                                               \
+  TESTCTZ64 (uint16_t, uint32_t)                                               \
+  TESTCTZ64 (uint16_t, uint16_t)                                               \
+  TESTCTZ64 (uint16_t, uint8_t)                                                \
+  TESTCTZ64 (uint16_t, int64_t)                                                \
+  TESTCTZ64 (uint16_t, int32_t)                                                \
+  TESTCTZ64 (uint16_t, int16_t)                                                \
+  TESTCTZ64 (uint16_t, int8_t)                                                 \
+  TESTCTZ64N (int16_t, uint64_t)                                               \
+  TESTCTZ64N (int16_t, uint32_t)                                               \
+  TESTCTZ64N (int16_t, uint16_t)                                               \
+  TESTCTZ64N (int16_t, uint8_t)                                                \
+  TESTCTZ64N (int16_t, int64_t)                                                \
+  TESTCTZ64N (int16_t, int32_t)                                                \
+  TESTCTZ64N (int16_t, int16_t)                                                \
+  TESTCTZ64N (int16_t, int8_t)                                                 \
+  TESTCTZ64 (uint8_t, uint64_t)                                                \
+  TESTCTZ64 (uint8_t, uint32_t)                                                \
+  TESTCTZ64 (uint8_t, uint16_t)                                                \
+  TESTCTZ64 (uint8_t, uint8_t)                                                 \
+  TESTCTZ64 (uint8_t, int64_t)                                                 \
+  TESTCTZ64 (uint8_t, int32_t)                                                 \
+  TESTCTZ64 (uint8_t, int16_t)                                                 \
+  TESTCTZ64 (uint8_t, int8_t)                                                  \
+  TESTCTZ64N (int8_t, uint64_t)                                                \
+  TESTCTZ64N (int8_t, uint32_t)                                                \
+  TESTCTZ64N (int8_t, uint16_t)                                                \
+  TESTCTZ64N (int8_t, uint8_t)                                                 \
+  TESTCTZ64N (int8_t, int64_t)                                                 \
+  TESTCTZ64N (int8_t, int32_t)                                                 \
+  TESTCTZ64N (int8_t, int16_t)                                                 \
+  TESTCTZ64N (int8_t, int8_t)                                                  \
+  TESTCTZ32 (uint64_t, uint64_t)                                               \
+  TESTCTZ32 (uint64_t, uint32_t)                                               \
+  TESTCTZ32 (uint64_t, uint16_t)                                               \
+  TESTCTZ32 (uint64_t, uint8_t)                                                \
+  TESTCTZ32 (uint64_t, int64_t)                                                \
+  TESTCTZ32 (uint64_t, int32_t)                                                \
+  TESTCTZ32 (uint64_t, int16_t)                                                \
+  TESTCTZ32 (uint64_t, int8_t)                                                 \
+  TESTCTZ32N (int64_t, uint64_t)                                               \
+  TESTCTZ32N (int64_t, uint32_t)                                               \
+  TESTCTZ32N (int64_t, uint16_t)                                               \
+  TESTCTZ32N (int64_t, uint8_t)                                                \
+  TESTCTZ32N (int64_t, int64_t)                                                \
+  TESTCTZ32N (int64_t, int32_t)                                                \
+  TESTCTZ32N (int64_t, int16_t)                                                \
+  TESTCTZ32N (int64_t, int8_t)                                                 \
+  TESTCTZ32 (uint32_t, uint64_t)                                               \
+  TESTCTZ32 (uint32_t, uint32_t)                                               \
+  TESTCTZ32 (uint32_t, uint16_t)                                               \
+  TESTCTZ32 (uint32_t, uint8_t)                                                \
+  TESTCTZ32 (uint32_t, int64_t)                                                \
+  TESTCTZ32 (uint32_t, int32_t)                                                \
+  TESTCTZ32 (uint32_t, int16_t)                                                \
+  TESTCTZ32 (uint32_t, int8_t)                                                 \
+  TESTCTZ32N (int32_t, uint64_t)                                               \
+  TESTCTZ32N (int32_t, uint32_t)                                               \
+  TESTCTZ32N (int32_t, uint16_t)                                               \
+  TESTCTZ32N (int32_t, uint8_t)                                                \
+  TESTCTZ32N (int32_t, int64_t)                                                \
+  TESTCTZ32N (int32_t, int32_t)                                                \
+  TESTCTZ32N (int32_t, int16_t)                                                \
+  TESTCTZ32N (int32_t, int8_t)                                                 \
+  TESTCTZ32 (uint16_t, uint64_t)                                               \
+  TESTCTZ32 (uint16_t, uint32_t)                                               \
+  TESTCTZ32 (uint16_t, uint16_t)                                               \
+  TESTCTZ32 (uint16_t, uint8_t)                                                \
+  TESTCTZ32 (uint16_t, int64_t)                                                \
+  TESTCTZ32 (uint16_t, int32_t)                                                \
+  TESTCTZ32 (uint16_t, int16_t)                                                \
+  TESTCTZ32 (uint16_t, int8_t)                                                 \
+  TESTCTZ32N (int16_t, uint64_t)                                               \
+  TESTCTZ32N (int16_t, uint32_t)                                               \
+  TESTCTZ32N (int16_t, uint16_t)                                               \
+  TESTCTZ32N (int16_t, uint8_t)                                                \
+  TESTCTZ32N (int16_t, int64_t)                                                \
+  TESTCTZ32N (int16_t, int32_t)                                                \
+  TESTCTZ32N (int16_t, int16_t)                                                \
+  TESTCTZ32N (int16_t, int8_t)                                                 \
+  TESTCTZ32 (uint8_t, uint64_t)                                                \
+  TESTCTZ32 (uint8_t, uint32_t)                                                \
+  TESTCTZ32 (uint8_t, uint16_t)                                                \
+  TESTCTZ32 (uint8_t, uint8_t)                                                 \
+  TESTCTZ32 (uint8_t, int64_t)                                                 \
+  TESTCTZ32 (uint8_t, int32_t)                                                 \
+  TESTCTZ32 (uint8_t, int16_t)                                                 \
+  TESTCTZ32 (uint8_t, int8_t)                                                  \
+  TESTCTZ32N (int8_t, uint64_t)                                                \
+  TESTCTZ32N (int8_t, uint32_t)                                                \
+  TESTCTZ32N (int8_t, uint16_t)                                                \
+  TESTCTZ32N (int8_t, uint8_t)                                                 \
+  TESTCTZ32N (int8_t, int64_t)                                                 \
+  TESTCTZ32N (int8_t, int32_t)                                                 \
+  TESTCTZ32N (int8_t, int16_t)                                                 \
+  TESTCTZ32N (int8_t, int8_t)                                                  \
+  TESTFFS64 (uint64_t, uint64_t)                                               \
+  TESTFFS64 (uint64_t, uint32_t)                                               \
+  TESTFFS64 (uint64_t, uint16_t)                                               \
+  TESTFFS64 (uint64_t, uint8_t)                                                \
+  TESTFFS64 (uint64_t, int64_t)                                                \
+  TESTFFS64 (uint64_t, int32_t)                                                \
+  TESTFFS64 (uint64_t, int16_t)                                                \
+  TESTFFS64 (uint64_t, int8_t)                                                 \
+  TESTFFS64N (int64_t, uint64_t)                                               \
+  TESTFFS64N (int64_t, uint32_t)                                               \
+  TESTFFS64N (int64_t, uint16_t)                                               \
+  TESTFFS64N (int64_t, uint8_t)                                                \
+  TESTFFS64N (int64_t, int64_t)                                                \
+  TESTFFS64N (int64_t, int32_t)                                                \
+  TESTFFS64N (int64_t, int16_t)                                                \
+  TESTFFS64N (int64_t, int8_t)                                                 \
+  TESTFFS64 (uint32_t, uint64_t)                                               \
+  TESTFFS64 (uint32_t, uint32_t)                                               \
+  TESTFFS64 (uint32_t, uint16_t)                                               \
+  TESTFFS64 (uint32_t, uint8_t)                                                \
+  TESTFFS64 (uint32_t, int64_t)                                                \
+  TESTFFS64 (uint32_t, int32_t)                                                \
+  TESTFFS64 (uint32_t, int16_t)                                                \
+  TESTFFS64 (uint32_t, int8_t)                                                 \
+  TESTFFS64N (int32_t, uint64_t)                                               \
+  TESTFFS64N (int32_t, uint32_t)                                               \
+  TESTFFS64N (int32_t, uint16_t)                                               \
+  TESTFFS64N (int32_t, uint8_t)                                                \
+  TESTFFS64N (int32_t, int64_t)                                                \
+  TESTFFS64N (int32_t, int32_t)                                                \
+  TESTFFS64N (int32_t, int16_t)                                                \
+  TESTFFS64N (int32_t, int8_t)                                                 \
+  TESTFFS64 (uint16_t, uint64_t)                                               \
+  TESTFFS64 (uint16_t, uint32_t)                                               \
+  TESTFFS64 (uint16_t, uint16_t)                                               \
+  TESTFFS64 (uint16_t, uint8_t)                                                \
+  TESTFFS64 (uint16_t, int64_t)                                                \
+  TESTFFS64 (uint16_t, int32_t)                                                \
+  TESTFFS64 (uint16_t, int16_t)                                                \
+  TESTFFS64 (uint16_t, int8_t)                                                 \
+  TESTFFS64N (int16_t, uint64_t)                                               \
+  TESTFFS64N (int16_t, uint32_t)                                               \
+  TESTFFS64N (int16_t, uint16_t)                                               \
+  TESTFFS64N (int16_t, uint8_t)                                                \
+  TESTFFS64N (int16_t, int64_t)                                                \
+  TESTFFS64N (int16_t, int32_t)                                                \
+  TESTFFS64N (int16_t, int16_t)                                                \
+  TESTFFS64N (int16_t, int8_t)                                                 \
+  TESTFFS64 (uint8_t, uint64_t)                                                \
+  TESTFFS64 (uint8_t, uint32_t)                                                \
+  TESTFFS64 (uint8_t, uint16_t)                                                \
+  TESTFFS64 (uint8_t, uint8_t)                                                 \
+  TESTFFS64 (uint8_t, int64_t)                                                 \
+  TESTFFS64 (uint8_t, int32_t)                                                 \
+  TESTFFS64 (uint8_t, int16_t)                                                 \
+  TESTFFS64 (uint8_t, int8_t)                                                  \
+  TESTFFS64N (int8_t, uint64_t)                                                \
+  TESTFFS64N (int8_t, uint32_t)                                                \
+  TESTFFS64N (int8_t, uint16_t)                                                \
+  TESTFFS64N (int8_t, uint8_t)                                                 \
+  TESTFFS64N (int8_t, int64_t)                                                 \
+  TESTFFS64N (int8_t, int32_t)                                                 \
+  TESTFFS64N (int8_t, int16_t)                                                 \
+  TESTFFS64N (int8_t, int8_t)                                                  \
+  TESTFFS32 (uint64_t, uint64_t)                                               \
+  TESTFFS32 (uint64_t, uint32_t)                                               \
+  TESTFFS32 (uint64_t, uint16_t)                                               \
+  TESTFFS32 (uint64_t, uint8_t)                                                \
+  TESTFFS32 (uint64_t, int64_t)                                                \
+  TESTFFS32 (uint64_t, int32_t)                                                \
+  TESTFFS32 (uint64_t, int16_t)                                                \
+  TESTFFS32 (uint64_t, int8_t)                                                 \
+  TESTFFS32N (int64_t, uint64_t)                                               \
+  TESTFFS32N (int64_t, uint32_t)                                               \
+  TESTFFS32N (int64_t, uint16_t)                                               \
+  TESTFFS32N (int64_t, uint8_t)                                                \
+  TESTFFS32N (int64_t, int64_t)                                                \
+  TESTFFS32N (int64_t, int32_t)                                                \
+  TESTFFS32N (int64_t, int16_t)                                                \
+  TESTFFS32N (int64_t, int8_t)                                                 \
+  TESTFFS32 (uint32_t, uint64_t)                                               \
+  TESTFFS32 (uint32_t, uint32_t)                                               \
+  TESTFFS32 (uint32_t, uint16_t)                                               \
+  TESTFFS32 (uint32_t, uint8_t)                                                \
+  TESTFFS32 (uint32_t, int64_t)                                                \
+  TESTFFS32 (uint32_t, int32_t)                                                \
+  TESTFFS32 (uint32_t, int16_t)                                                \
+  TESTFFS32 (uint32_t, int8_t)                                                 \
+  TESTFFS32N (int32_t, uint64_t)                                               \
+  TESTFFS32N (int32_t, uint32_t)                                               \
+  TESTFFS32N (int32_t, uint16_t)                                               \
+  TESTFFS32N (int32_t, uint8_t)                                                \
+  TESTFFS32N (int32_t, int64_t)                                                \
+  TESTFFS32N (int32_t, int32_t)                                                \
+  TESTFFS32N (int32_t, int16_t)                                                \
+  TESTFFS32N (int32_t, int8_t)                                                 \
+  TESTFFS32 (uint16_t, uint64_t)                                               \
+  TESTFFS32 (uint16_t, uint32_t)                                               \
+  TESTFFS32 (uint16_t, uint16_t)                                               \
+  TESTFFS32 (uint16_t, uint8_t)                                                \
+  TESTFFS32 (uint16_t, int64_t)                                                \
+  TESTFFS32 (uint16_t, int32_t)                                                \
+  TESTFFS32 (uint16_t, int16_t)                                                \
+  TESTFFS32 (uint16_t, int8_t)                                                 \
+  TESTFFS32N (int16_t, uint64_t)                                               \
+  TESTFFS32N (int16_t, uint32_t)                                               \
+  TESTFFS32N (int16_t, uint16_t)                                               \
+  TESTFFS32N (int16_t, uint8_t)                                                \
+  TESTFFS32N (int16_t, int64_t)                                                \
+  TESTFFS32N (int16_t, int32_t)                                                \
+  TESTFFS32N (int16_t, int16_t)                                                \
+  TESTFFS32N (int16_t, int8_t)                                                 \
+  TESTFFS32 (uint8_t, uint64_t)                                                \
+  TESTFFS32 (uint8_t, uint32_t)                                                \
+  TESTFFS32 (uint8_t, uint16_t)                                                \
+  TESTFFS32 (uint8_t, uint8_t)                                                 \
+  TESTFFS32 (uint8_t, int64_t)                                                 \
+  TESTFFS32 (uint8_t, int32_t)                                                 \
+  TESTFFS32 (uint8_t, int16_t)                                                 \
+  TESTFFS32 (uint8_t, int8_t)                                                  \
+  TESTFFS32N (int8_t, uint64_t)                                                \
+  TESTFFS32N (int8_t, uint32_t)                                                \
+  TESTFFS32N (int8_t, uint16_t)                                                \
+  TESTFFS32N (int8_t, uint8_t)                                                 \
+  TESTFFS32N (int8_t, int64_t)                                                 \
+  TESTFFS32N (int8_t, int32_t)                                                 \
+  TESTFFS32N (int8_t, int16_t)                                                 \
+  TESTFFS32N (int8_t, int8_t)
+
+TEST_ALL ()
+
+#define RUN64(TYPEDST, TYPESRC) test64_##TYPEDST##TYPESRC ();
+#define RUN64N(TYPEDST, TYPESRC) test64n_##TYPEDST##TYPESRC ();
+#define RUN32(TYPEDST, TYPESRC) test32_##TYPEDST##TYPESRC ();
+#define RUN32N(TYPEDST, TYPESRC) test32n_##TYPEDST##TYPESRC ();
+#define RUNCTZ64(TYPEDST, TYPESRC) testctz64_##TYPEDST##TYPESRC ();
+#define RUNCTZ64N(TYPEDST, TYPESRC) testctz64n_##TYPEDST##TYPESRC ();
+#define RUNCTZ32(TYPEDST, TYPESRC) testctz32_##TYPEDST##TYPESRC ();
+#define RUNCTZ32N(TYPEDST, TYPESRC) testctz32n_##TYPEDST##TYPESRC ();
+#define RUNFFS64(TYPEDST, TYPESRC) testffs64_##TYPEDST##TYPESRC ();
+#define RUNFFS64N(TYPEDST, TYPESRC) testffs64n_##TYPEDST##TYPESRC ();
+#define RUNFFS32(TYPEDST, TYPESRC) testffs32_##TYPEDST##TYPESRC ();
+#define RUNFFS32N(TYPEDST, TYPESRC) testffs32n_##TYPEDST##TYPESRC ();
+
+#define RUN_ALL()                                                              \
+  RUN64 (uint64_t, uint64_t)                                                   \
+  RUN64 (uint64_t, uint32_t)                                                   \
+  RUN64 (uint64_t, uint16_t)                                                   \
+  RUN64 (uint64_t, uint8_t)                                                    \
+  RUN64 (uint64_t, int64_t)                                                    \
+  RUN64 (uint64_t, int32_t)                                                    \
+  RUN64 (uint64_t, int16_t)                                                    \
+  RUN64 (uint64_t, int8_t)                                                     \
+  RUN64N (int64_t, uint64_t)                                                    \
+  RUN64N (int64_t, uint32_t)                                                    \
+  RUN64N (int64_t, uint16_t)                                                    \
+  RUN64N (int64_t, uint8_t)                                                     \
+  RUN64N (int64_t, int64_t)                                                     \
+  RUN64N (int64_t, int32_t)                                                     \
+  RUN64N (int64_t, int16_t)                                                     \
+  RUN64N (int64_t, int8_t)                                                      \
+  RUN64 (uint32_t, uint64_t)                                                   \
+  RUN64 (uint32_t, uint32_t)                                                   \
+  RUN64 (uint32_t, uint16_t)                                                   \
+  RUN64 (uint32_t, uint8_t)                                                    \
+  RUN64 (uint32_t, int64_t)                                                    \
+  RUN64 (uint32_t, int32_t)                                                    \
+  RUN64 (uint32_t, int16_t)                                                    \
+  RUN64 (uint32_t, int8_t)                                                     \
+  RUN64N (int32_t, uint64_t)                                                    \
+  RUN64N (int32_t, uint32_t)                                                    \
+  RUN64N (int32_t, uint16_t)                                                    \
+  RUN64N (int32_t, uint8_t)                                                     \
+  RUN64N (int32_t, int64_t)                                                     \
+  RUN64N (int32_t, int32_t)                                                     \
+  RUN64N (int32_t, int16_t)                                                     \
+  RUN64N (int32_t, int8_t)                                                      \
+  RUN64 (uint16_t, uint64_t)                                                   \
+  RUN64 (uint16_t, uint32_t)                                                   \
+  RUN64 (uint16_t, uint16_t)                                                   \
+  RUN64 (uint16_t, uint8_t)                                                    \
+  RUN64 (uint16_t, int64_t)                                                    \
+  RUN64 (uint16_t, int32_t)                                                    \
+  RUN64 (uint16_t, int16_t)                                                    \
+  RUN64 (uint16_t, int8_t)                                                     \
+  RUN64N (int16_t, uint64_t)                                                    \
+  RUN64N (int16_t, uint32_t)                                                    \
+  RUN64N (int16_t, uint16_t)                                                    \
+  RUN64N (int16_t, uint8_t)                                                     \
+  RUN64N (int16_t, int64_t)                                                     \
+  RUN64N (int16_t, int32_t)                                                     \
+  RUN64N (int16_t, int16_t)                                                     \
+  RUN64N (int16_t, int8_t)                                                      \
+  RUN64 (uint8_t, uint64_t)                                                    \
+  RUN64 (uint8_t, uint32_t)                                                    \
+  RUN64 (uint8_t, uint16_t)                                                    \
+  RUN64 (uint8_t, uint8_t)                                                     \
+  RUN64 (uint8_t, int64_t)                                                     \
+  RUN64 (uint8_t, int32_t)                                                     \
+  RUN64 (uint8_t, int16_t)                                                     \
+  RUN64 (uint8_t, int8_t)                                                      \
+  RUN64N (int8_t, uint64_t)                                                     \
+  RUN64N (int8_t, uint32_t)                                                     \
+  RUN64N (int8_t, uint16_t)                                                     \
+  RUN64N (int8_t, uint8_t)                                                      \
+  RUN64N (int8_t, int64_t)                                                      \
+  RUN64N (int8_t, int32_t)                                                      \
+  RUN64N (int8_t, int16_t)                                                      \
+  RUN64N (int8_t, int8_t)                                                       \
+  RUN32 (uint64_t, uint64_t)                                                   \
+  RUN32 (uint64_t, uint32_t)                                                   \
+  RUN32 (uint64_t, uint16_t)                                                   \
+  RUN32 (uint64_t, uint8_t)                                                    \
+  RUN32 (uint64_t, int64_t)                                                    \
+  RUN32 (uint64_t, int32_t)                                                    \
+  RUN32 (uint64_t, int16_t)                                                    \
+  RUN32 (uint64_t, int8_t)                                                     \
+  RUN32N (int64_t, uint64_t)                                                    \
+  RUN32N (int64_t, uint32_t)                                                    \
+  RUN32N (int64_t, uint16_t)                                                    \
+  RUN32N (int64_t, uint8_t)                                                     \
+  RUN32N (int64_t, int64_t)                                                     \
+  RUN32N (int64_t, int32_t)                                                     \
+  RUN32N (int64_t, int16_t)                                                     \
+  RUN32N (int64_t, int8_t)                                                      \
+  RUN32 (uint32_t, uint64_t)                                                   \
+  RUN32 (uint32_t, uint32_t)                                                   \
+  RUN32 (uint32_t, uint16_t)                                                   \
+  RUN32 (uint32_t, uint8_t)                                                    \
+  RUN32 (uint32_t, int64_t)                                                    \
+  RUN32 (uint32_t, int32_t)                                                    \
+  RUN32 (uint32_t, int16_t)                                                    \
+  RUN32 (uint32_t, int8_t)                                                     \
+  RUN32N (int32_t, uint64_t)                                                    \
+  RUN32N (int32_t, uint32_t)                                                    \
+  RUN32N (int32_t, uint16_t)                                                    \
+  RUN32N (int32_t, uint8_t)                                                     \
+  RUN32N (int32_t, int64_t)                                                     \
+  RUN32N (int32_t, int32_t)                                                     \
+  RUN32N (int32_t, int16_t)                                                     \
+  RUN32N (int32_t, int8_t)                                                      \
+  RUN32 (uint16_t, uint64_t)                                                   \
+  RUN32 (uint16_t, uint32_t)                                                   \
+  RUN32 (uint16_t, uint16_t)                                                   \
+  RUN32 (uint16_t, uint8_t)                                                    \
+  RUN32 (uint16_t, int64_t)                                                    \
+  RUN32 (uint16_t, int32_t)                                                    \
+  RUN32 (uint16_t, int16_t)                                                    \
+  RUN32 (uint16_t, int8_t)                                                     \
+  RUN32N (int16_t, uint64_t)                                                    \
+  RUN32N (int16_t, uint32_t)                                                    \
+  RUN32N (int16_t, uint16_t)                                                    \
+  RUN32N (int16_t, uint8_t)                                                     \
+  RUN32N (int16_t, int64_t)                                                     \
+  RUN32N (int16_t, int32_t)                                                     \
+  RUN32N (int16_t, int16_t)                                                     \
+  RUN32N (int16_t, int8_t)                                                      \
+  RUN32 (uint8_t, uint64_t)                                                    \
+  RUN32 (uint8_t, uint32_t)                                                    \
+  RUN32 (uint8_t, uint16_t)                                                    \
+  RUN32 (uint8_t, uint8_t)                                                     \
+  RUN32 (uint8_t, int64_t)                                                     \
+  RUN32 (uint8_t, int32_t)                                                     \
+  RUN32 (uint8_t, int16_t)                                                     \
+  RUN32 (uint8_t, int8_t)                                                      \
+  RUN32N (int8_t, uint64_t)                                                     \
+  RUN32N (int8_t, uint32_t)                                                     \
+  RUN32N (int8_t, uint16_t)                                                     \
+  RUN32N (int8_t, uint8_t)                                                      \
+  RUN32N (int8_t, int64_t)                                                      \
+  RUN32N (int8_t, int32_t)                                                      \
+  RUN32N (int8_t, int16_t)                                                      \
+  RUN32N (int8_t, int8_t)                                                       \
+  RUNCTZ64 (uint64_t, uint64_t)                                                \
+  RUNCTZ64 (uint64_t, uint32_t)                                                \
+  RUNCTZ64 (uint64_t, uint16_t)                                                \
+  RUNCTZ64 (uint64_t, uint8_t)                                                 \
+  RUNCTZ64 (uint64_t, int64_t)                                                 \
+  RUNCTZ64 (uint64_t, int32_t)                                                 \
+  RUNCTZ64 (uint64_t, int16_t)                                                 \
+  RUNCTZ64 (uint64_t, int8_t)                                                  \
+  RUNCTZ64N (int64_t, uint64_t)                                                 \
+  RUNCTZ64N (int64_t, uint32_t)                                                 \
+  RUNCTZ64N (int64_t, uint16_t)                                                 \
+  RUNCTZ64N (int64_t, uint8_t)                                                  \
+  RUNCTZ64N (int64_t, int64_t)                                                  \
+  RUNCTZ64N (int64_t, int32_t)                                                  \
+  RUNCTZ64N (int64_t, int16_t)                                                  \
+  RUNCTZ64N (int64_t, int8_t)                                                   \
+  RUNCTZ64 (uint32_t, uint64_t)                                                \
+  RUNCTZ64 (uint32_t, uint32_t)                                                \
+  RUNCTZ64 (uint32_t, uint16_t)                                                \
+  RUNCTZ64 (uint32_t, uint8_t)                                                 \
+  RUNCTZ64 (uint32_t, int64_t)                                                 \
+  RUNCTZ64 (uint32_t, int32_t)                                                 \
+  RUNCTZ64 (uint32_t, int16_t)                                                 \
+  RUNCTZ64 (uint32_t, int8_t)                                                  \
+  RUNCTZ64N (int32_t, uint64_t)                                                 \
+  RUNCTZ64N (int32_t, uint32_t)                                                 \
+  RUNCTZ64N (int32_t, uint16_t)                                                 \
+  RUNCTZ64N (int32_t, uint8_t)                                                  \
+  RUNCTZ64N (int32_t, int64_t)                                                  \
+  RUNCTZ64N (int32_t, int32_t)                                                  \
+  RUNCTZ64N (int32_t, int16_t)                                                  \
+  RUNCTZ64N (int32_t, int8_t)                                                   \
+  RUNCTZ64 (uint16_t, uint64_t)                                                \
+  RUNCTZ64 (uint16_t, uint32_t)                                                \
+  RUNCTZ64 (uint16_t, uint16_t)                                                \
+  RUNCTZ64 (uint16_t, uint8_t)                                                 \
+  RUNCTZ64 (uint16_t, int64_t)                                                 \
+  RUNCTZ64 (uint16_t, int32_t)                                                 \
+  RUNCTZ64 (uint16_t, int16_t)                                                 \
+  RUNCTZ64 (uint16_t, int8_t)                                                  \
+  RUNCTZ64N (int16_t, uint64_t)                                                \
+  RUNCTZ64N (int16_t, uint32_t)                                                \
+  RUNCTZ64N (int16_t, uint16_t)                                                \
+  RUNCTZ64N (int16_t, uint8_t)                                                 \
+  RUNCTZ64N (int16_t, int64_t)                                                 \
+  RUNCTZ64N (int16_t, int32_t)                                                 \
+  RUNCTZ64N (int16_t, int16_t)                                                 \
+  RUNCTZ64N (int16_t, int8_t)                                                  \
+  RUNCTZ64 (uint8_t, uint64_t)                                                 \
+  RUNCTZ64 (uint8_t, uint32_t)                                                 \
+  RUNCTZ64 (uint8_t, uint16_t)                                                 \
+  RUNCTZ64 (uint8_t, uint8_t)                                                  \
+  RUNCTZ64 (uint8_t, int64_t)                                                  \
+  RUNCTZ64 (uint8_t, int32_t)                                                  \
+  RUNCTZ64 (uint8_t, int16_t)                                                  \
+  RUNCTZ64 (uint8_t, int8_t)                                                   \
+  RUNCTZ64N (int8_t, uint64_t)                                                 \
+  RUNCTZ64N (int8_t, uint32_t)                                                 \
+  RUNCTZ64N (int8_t, uint16_t)                                                 \
+  RUNCTZ64N (int8_t, uint8_t)                                                  \
+  RUNCTZ64N (int8_t, int64_t)                                                  \
+  RUNCTZ64N (int8_t, int32_t)                                                  \
+  RUNCTZ64N (int8_t, int16_t)                                                  \
+  RUNCTZ64N (int8_t, int8_t)                                                   \
+  RUNCTZ32 (uint64_t, uint64_t)                                                \
+  RUNCTZ32 (uint64_t, uint32_t)                                                \
+  RUNCTZ32 (uint64_t, uint16_t)                                                \
+  RUNCTZ32 (uint64_t, uint8_t)                                                 \
+  RUNCTZ32 (uint64_t, int64_t)                                                 \
+  RUNCTZ32 (uint64_t, int32_t)                                                 \
+  RUNCTZ32 (uint64_t, int16_t)                                                 \
+  RUNCTZ32 (uint64_t, int8_t)                                                  \
+  RUNCTZ32N (int64_t, uint64_t)                                                \
+  RUNCTZ32N (int64_t, uint32_t)                                                \
+  RUNCTZ32N (int64_t, uint16_t)                                                \
+  RUNCTZ32N (int64_t, uint8_t)                                                 \
+  RUNCTZ32N (int64_t, int64_t)                                                 \
+  RUNCTZ32N (int64_t, int32_t)                                                 \
+  RUNCTZ32N (int64_t, int16_t)                                                 \
+  RUNCTZ32N (int64_t, int8_t)                                                  \
+  RUNCTZ32 (uint32_t, uint64_t)                                                \
+  RUNCTZ32 (uint32_t, uint32_t)                                                \
+  RUNCTZ32 (uint32_t, uint16_t)                                                \
+  RUNCTZ32 (uint32_t, uint8_t)                                                 \
+  RUNCTZ32 (uint32_t, int64_t)                                                 \
+  RUNCTZ32 (uint32_t, int32_t)                                                 \
+  RUNCTZ32 (uint32_t, int16_t)                                                 \
+  RUNCTZ32 (uint32_t, int8_t)                                                  \
+  RUNCTZ32N (int32_t, uint64_t)                                                \
+  RUNCTZ32N (int32_t, uint32_t)                                                \
+  RUNCTZ32N (int32_t, uint16_t)                                                \
+  RUNCTZ32N (int32_t, uint8_t)                                                 \
+  RUNCTZ32N (int32_t, int64_t)                                                 \
+  RUNCTZ32N (int32_t, int32_t)                                                 \
+  RUNCTZ32N (int32_t, int16_t)                                                 \
+  RUNCTZ32N (int32_t, int8_t)                                                  \
+  RUNCTZ32 (uint16_t, uint64_t)                                                \
+  RUNCTZ32 (uint16_t, uint32_t)                                                \
+  RUNCTZ32 (uint16_t, uint16_t)                                                \
+  RUNCTZ32 (uint16_t, uint8_t)                                                 \
+  RUNCTZ32 (uint16_t, int64_t)                                                 \
+  RUNCTZ32 (uint16_t, int32_t)                                                 \
+  RUNCTZ32 (uint16_t, int16_t)                                                 \
+  RUNCTZ32 (uint16_t, int8_t)                                                  \
+  RUNCTZ32N (int16_t, uint64_t)                                                \
+  RUNCTZ32N (int16_t, uint32_t)                                                \
+  RUNCTZ32N (int16_t, uint16_t)                                                \
+  RUNCTZ32N (int16_t, uint8_t)                                                 \
+  RUNCTZ32N (int16_t, int64_t)                                                 \
+  RUNCTZ32N (int16_t, int32_t)                                                 \
+  RUNCTZ32N (int16_t, int16_t)                                                 \
+  RUNCTZ32N (int16_t, int8_t)                                                  \
+  RUNCTZ32 (uint8_t, uint64_t)                                                 \
+  RUNCTZ32 (uint8_t, uint32_t)                                                 \
+  RUNCTZ32 (uint8_t, uint16_t)                                                 \
+  RUNCTZ32 (uint8_t, uint8_t)                                                  \
+  RUNCTZ32 (uint8_t, int64_t)                                                  \
+  RUNCTZ32 (uint8_t, int32_t)                                                  \
+  RUNCTZ32 (uint8_t, int16_t)                                                  \
+  RUNCTZ32 (uint8_t, int8_t)                                                   \
+  RUNCTZ32N (int8_t, uint64_t)                                                 \
+  RUNCTZ32N (int8_t, uint32_t)                                                 \
+  RUNCTZ32N (int8_t, uint16_t)                                                 \
+  RUNCTZ32N (int8_t, uint8_t)                                                  \
+  RUNCTZ32N (int8_t, int64_t)                                                  \
+  RUNCTZ32N (int8_t, int32_t)                                                  \
+  RUNCTZ32N (int8_t, int16_t)                                                  \
+  RUNCTZ32N (int8_t, int8_t)                                                   \
+  RUNFFS64 (uint64_t, uint64_t)                                                \
+  RUNFFS64 (uint64_t, uint32_t)                                                \
+  RUNFFS64 (uint64_t, uint16_t)                                                \
+  RUNFFS64 (uint64_t, uint8_t)                                                 \
+  RUNFFS64 (uint64_t, int64_t)                                                 \
+  RUNFFS64 (uint64_t, int32_t)                                                 \
+  RUNFFS64 (uint64_t, int16_t)                                                 \
+  RUNFFS64 (uint64_t, int8_t)                                                  \
+  RUNFFS64N (int64_t, uint64_t)                                                \
+  RUNFFS64N (int64_t, uint32_t)                                                \
+  RUNFFS64N (int64_t, uint16_t)                                                \
+  RUNFFS64N (int64_t, uint8_t)                                                 \
+  RUNFFS64N (int64_t, int64_t)                                                 \
+  RUNFFS64N (int64_t, int32_t)                                                 \
+  RUNFFS64N (int64_t, int16_t)                                                 \
+  RUNFFS64N (int64_t, int8_t)                                                  \
+  RUNFFS64 (uint32_t, uint64_t)                                                \
+  RUNFFS64 (uint32_t, uint32_t)                                                \
+  RUNFFS64 (uint32_t, uint16_t)                                                \
+  RUNFFS64 (uint32_t, uint8_t)                                                 \
+  RUNFFS64 (uint32_t, int64_t)                                                 \
+  RUNFFS64 (uint32_t, int32_t)                                                 \
+  RUNFFS64 (uint32_t, int16_t)                                                 \
+  RUNFFS64 (uint32_t, int8_t)                                                  \
+  RUNFFS64N (int32_t, uint64_t)                                                \
+  RUNFFS64N (int32_t, uint32_t)                                                \
+  RUNFFS64N (int32_t, uint16_t)                                                \
+  RUNFFS64N (int32_t, uint8_t)                                                 \
+  RUNFFS64N (int32_t, int64_t)                                                 \
+  RUNFFS64N (int32_t, int32_t)                                                 \
+  RUNFFS64N (int32_t, int16_t)                                                 \
+  RUNFFS64N (int32_t, int8_t)                                                  \
+  RUNFFS64 (uint16_t, uint64_t)                                                \
+  RUNFFS64 (uint16_t, uint32_t)                                                \
+  RUNFFS64 (uint16_t, uint16_t)                                                \
+  RUNFFS64 (uint16_t, uint8_t)                                                 \
+  RUNFFS64 (uint16_t, int64_t)                                                 \
+  RUNFFS64 (uint16_t, int32_t)                                                 \
+  RUNFFS64 (uint16_t, int16_t)                                                 \
+  RUNFFS64 (uint16_t, int8_t)                                                  \
+  RUNFFS64N (int16_t, uint64_t)                                                \
+  RUNFFS64N (int16_t, uint32_t)                                                \
+  RUNFFS64N (int16_t, uint16_t)                                                \
+  RUNFFS64N (int16_t, uint8_t)                                                 \
+  RUNFFS64N (int16_t, int64_t)                                                 \
+  RUNFFS64N (int16_t, int32_t)                                                 \
+  RUNFFS64N (int16_t, int16_t)                                                 \
+  RUNFFS64N (int16_t, int8_t)                                                  \
+  RUNFFS64 (uint8_t, uint64_t)                                                 \
+  RUNFFS64 (uint8_t, uint32_t)                                                 \
+  RUNFFS64 (uint8_t, uint16_t)                                                 \
+  RUNFFS64 (uint8_t, uint8_t)                                                  \
+  RUNFFS64 (uint8_t, int64_t)                                                  \
+  RUNFFS64 (uint8_t, int32_t)                                                  \
+  RUNFFS64 (uint8_t, int16_t)                                                  \
+  RUNFFS64 (uint8_t, int8_t)                                                   \
+  RUNFFS64N (int8_t, uint64_t)                                                 \
+  RUNFFS64N (int8_t, uint32_t)                                                 \
+  RUNFFS64N (int8_t, uint16_t)                                                 \
+  RUNFFS64N (int8_t, uint8_t)                                                  \
+  RUNFFS64N (int8_t, int64_t)                                                  \
+  RUNFFS64N (int8_t, int32_t)                                                  \
+  RUNFFS64N (int8_t, int16_t)                                                  \
+  RUNFFS64N (int8_t, int8_t)                                                   \
+  RUNFFS32 (uint64_t, uint64_t)                                                \
+  RUNFFS32 (uint64_t, uint32_t)                                                \
+  RUNFFS32 (uint64_t, uint16_t)                                                \
+  RUNFFS32 (uint64_t, uint8_t)                                                 \
+  RUNFFS32 (uint64_t, int64_t)                                                 \
+  RUNFFS32 (uint64_t, int32_t)                                                 \
+  RUNFFS32 (uint64_t, int16_t)                                                 \
+  RUNFFS32 (uint64_t, int8_t)                                                  \
+  RUNFFS32N (int64_t, uint64_t)                                                \
+  RUNFFS32N (int64_t, uint32_t)                                                \
+  RUNFFS32N (int64_t, uint16_t)                                                \
+  RUNFFS32N (int64_t, uint8_t)                                                 \
+  RUNFFS32N (int64_t, int64_t)                                                 \
+  RUNFFS32N (int64_t, int32_t)                                                 \
+  RUNFFS32N (int64_t, int16_t)                                                 \
+  RUNFFS32N (int64_t, int8_t)                                                  \
+  RUNFFS32 (uint32_t, uint64_t)                                                \
+  RUNFFS32 (uint32_t, uint32_t)                                                \
+  RUNFFS32 (uint32_t, uint16_t)                                                \
+  RUNFFS32 (uint32_t, uint8_t)                                                 \
+  RUNFFS32 (uint32_t, int64_t)                                                 \
+  RUNFFS32 (uint32_t, int32_t)                                                 \
+  RUNFFS32 (uint32_t, int16_t)                                                 \
+  RUNFFS32 (uint32_t, int8_t)                                                  \
+  RUNFFS32N (int32_t, uint64_t)                                                \
+  RUNFFS32N (int32_t, uint32_t)                                                \
+  RUNFFS32N (int32_t, uint16_t)                                                \
+  RUNFFS32N (int32_t, uint8_t)                                                 \
+  RUNFFS32N (int32_t, int64_t)                                                 \
+  RUNFFS32N (int32_t, int32_t)                                                 \
+  RUNFFS32N (int32_t, int16_t)                                                 \
+  RUNFFS32N (int32_t, int8_t)                                                  \
+  RUNFFS32 (uint16_t, uint64_t)                                                \
+  RUNFFS32 (uint16_t, uint32_t)                                                \
+  RUNFFS32 (uint16_t, uint16_t)                                                \
+  RUNFFS32 (uint16_t, uint8_t)                                                 \
+  RUNFFS32 (uint16_t, int64_t)                                                 \
+  RUNFFS32 (uint16_t, int32_t)                                                 \
+  RUNFFS32 (uint16_t, int16_t)                                                 \
+  RUNFFS32 (uint16_t, int8_t)                                                  \
+  RUNFFS32N (int16_t, uint64_t)                                                \
+  RUNFFS32N (int16_t, uint32_t)                                                \
+  RUNFFS32N (int16_t, uint16_t)                                                \
+  RUNFFS32N (int16_t, uint8_t)                                                 \
+  RUNFFS32N (int16_t, int64_t)                                                 \
+  RUNFFS32N (int16_t, int32_t)                                                 \
+  RUNFFS32N (int16_t, int16_t)                                                 \
+  RUNFFS32N (int16_t, int8_t)                                                  \
+  RUNFFS32 (uint8_t, uint64_t)                                                 \
+  RUNFFS32 (uint8_t, uint32_t)                                                 \
+  RUNFFS32 (uint8_t, uint16_t)                                                 \
+  RUNFFS32 (uint8_t, uint8_t)                                                  \
+  RUNFFS32 (uint8_t, int64_t)                                                  \
+  RUNFFS32 (uint8_t, int32_t)                                                  \
+  RUNFFS32 (uint8_t, int16_t)                                                  \
+  RUNFFS32 (uint8_t, int8_t)                                                   \
+  RUNFFS32N (int8_t, uint64_t)                                                 \
+  RUNFFS32N (int8_t, uint32_t)                                                 \
+  RUNFFS32N (int8_t, uint16_t)                                                 \
+  RUNFFS32N (int8_t, uint8_t)                                                  \
+  RUNFFS32N (int8_t, int64_t)                                                  \
+  RUNFFS32N (int8_t, int32_t)                                                  \
+  RUNFFS32N (int8_t, int16_t)                                                  \
+  RUNFFS32N (int8_t, int8_t)
+
+int
+main ()
+{
+  RUN_ALL ()
+}
+
+/* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 229 "vect" } } */
-- 
2.41.0
 
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Re: [PATCH] RISC-V: Add popcount fallback expander.
  2023-10-18 11:43   ` Robin Dapp
  2023-10-18 11:48     ` juzhe.zhong
@ 2023-10-18 12:22     ` juzhe.zhong
  2023-10-18 13:51     ` Robin Dapp
  2 siblings, 0 replies; 10+ messages in thread
From: juzhe.zhong @ 2023-10-18 12:22 UTC (permalink / raw)
  To: Robin Dapp, gcc-patches

LGTM

juzhe.zhong@rivai.ai
 
From: Robin Dapp
Date: 2023-10-18 19:43
To: juzhe.zhong@rivai.ai; gcc-patches; palmer; kito.cheng; jeffreyalaw
CC: rdapp.gcc
Subject: Re: [PATCH] RISC-V: Add popcount fallback expander.
> I saw you didn't extend VI -> V_VLSI. I guess will failed SLP on popcount.
 
Added VLS modes and your test in v2.
 
Testsuite looks unchanged on my side (vect, dg, rvv).
 
Regards
Robin
 
Subject: [PATCH v2] RISC-V: Add popcount fallback expander.
 
I didn't manage to get back to the generic vectorizer fallback for
popcount so I figured I'd rather create a popcount fallback in the
riscv backend.  It uses the WWG algorithm from libgcc.
 
gcc/ChangeLog:
 
* config/riscv/autovec.md (popcount<mode>2): New expander.
* config/riscv/riscv-protos.h (expand_popcount): Define.
* config/riscv/riscv-v.cc (expand_popcount): Vectorize popcount
with the WWG algorithm.
 
gcc/testsuite/ChangeLog:
 
* gcc.target/riscv/rvv/autovec/unop/popcount-1.c: New test.
* gcc.target/riscv/rvv/autovec/unop/popcount-2.c: New test.
* gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c: New test.
* gcc.target/riscv/rvv/autovec/unop/popcount.c: New test.
---
gcc/config/riscv/autovec.md                   |   14 +
gcc/config/riscv/riscv-protos.h               |    1 +
gcc/config/riscv/riscv-v.cc                   |   71 +
.../riscv/rvv/autovec/unop/popcount-1.c       |   20 +
.../riscv/rvv/autovec/unop/popcount-2.c       |   19 +
.../riscv/rvv/autovec/unop/popcount-run-1.c   |   49 +
.../riscv/rvv/autovec/unop/popcount.c         | 1464 +++++++++++++++++
7 files changed, 1638 insertions(+)
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-1.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-2.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount.c
 
diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md
index c5b1e52cbf9..80910ba3cc2 100644
--- a/gcc/config/riscv/autovec.md
+++ b/gcc/config/riscv/autovec.md
@@ -1484,6 +1484,20 @@ (define_expand "xorsign<mode>3"
   DONE;
})
+;; -------------------------------------------------------------------------------
+;; - [INT] POPCOUNT.
+;; -------------------------------------------------------------------------------
+
+(define_expand "popcount<mode>2"
+  [(match_operand:V_VLSI 0 "register_operand")
+   (match_operand:V_VLSI 1 "register_operand")]
+  "TARGET_VECTOR"
+{
+  riscv_vector::expand_popcount (operands);
+  DONE;
+})
+
+
;; -------------------------------------------------------------------------
;; ---- [INT] Highpart multiplication
;; -------------------------------------------------------------------------
diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index 49bdcdf2f93..4aeccdd961b 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -515,6 +515,7 @@ void expand_fold_extract_last (rtx *);
void expand_cond_unop (unsigned, rtx *);
void expand_cond_binop (unsigned, rtx *);
void expand_cond_ternop (unsigned, rtx *);
+void expand_popcount (rtx *);
/* Rounding mode bitfield for fixed point VXRM.  */
enum fixed_point_rounding_mode
diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index 21d86c3f917..8b594b7127e 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -4152,4 +4152,75 @@ expand_vec_lfloor (rtx op_0, rtx op_1, machine_mode vec_fp_mode,
   emit_vec_cvt_x_f (op_0, op_1, UNARY_OP_FRM_RDN, vec_fp_mode);
}
+/* Vectorize popcount by the Wilkes-Wheeler-Gill algorithm that libgcc uses as
+   well.  */
+void
+expand_popcount (rtx *ops)
+{
+  rtx dst = ops[0];
+  rtx src = ops[1];
+  machine_mode mode = GET_MODE (dst);
+  scalar_mode imode = GET_MODE_INNER (mode);
+  static const uint64_t m5 = 0x5555555555555555ULL;
+  static const uint64_t m3 = 0x3333333333333333ULL;
+  static const uint64_t mf = 0x0F0F0F0F0F0F0F0FULL;
+  static const uint64_t m1 = 0x0101010101010101ULL;
+
+  rtx x1 = gen_reg_rtx (mode);
+  rtx x2 = gen_reg_rtx (mode);
+  rtx x3 = gen_reg_rtx (mode);
+  rtx x4 = gen_reg_rtx (mode);
+
+  /* x1 = src - (src >> 1) & 0x555...);  */
+  rtx shift1 = expand_binop (mode, lshr_optab, src, GEN_INT (1), NULL, true,
+	     OPTAB_DIRECT);
+
+  rtx and1 = gen_reg_rtx (mode);
+  rtx ops1[] = {and1, shift1, gen_int_mode (m5, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (AND, mode), riscv_vector::BINARY_OP,
+	   ops1);
+
+  x1 = expand_binop (mode, sub_optab, src, and1, NULL, true, OPTAB_DIRECT);
+
+  /* x2 = (x1 & 0x3333333333333333ULL) + ((x1 >> 2) & 0x3333333333333333ULL);
+   */
+  rtx and2 = gen_reg_rtx (mode);
+  rtx ops2[] = {and2, x1, gen_int_mode (m3, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (AND, mode), riscv_vector::BINARY_OP,
+	   ops2);
+
+  rtx shift2 = expand_binop (mode, lshr_optab, x1, GEN_INT (2), NULL, true,
+	     OPTAB_DIRECT);
+
+  rtx and22 = gen_reg_rtx (mode);
+  rtx ops22[] = {and22, shift2, gen_int_mode (m3, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (AND, mode), riscv_vector::BINARY_OP,
+	   ops22);
+
+  x2 = expand_binop (mode, add_optab, and2, and22, NULL, true, OPTAB_DIRECT);
+
+  /* x3 = (x2 + (x2 >> 4)) & 0x0f0f0f0f0f0f0f0fULL;  */
+  rtx shift3 = expand_binop (mode, lshr_optab, x2, GEN_INT (4), NULL, true,
+	     OPTAB_DIRECT);
+
+  rtx plus3
+    = expand_binop (mode, add_optab, x2, shift3, NULL, true, OPTAB_DIRECT);
+
+  rtx ops3[] = {x3, plus3, gen_int_mode (mf, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (AND, mode), riscv_vector::BINARY_OP,
+	   ops3);
+
+  /* dest = (x3 * 0x0101010101010101ULL) >> 56;  */
+  rtx mul4 = gen_reg_rtx (mode);
+  rtx ops4[] = {mul4, x3, gen_int_mode (m1, imode)};
+  emit_vlmax_insn (code_for_pred_scalar (MULT, mode), riscv_vector::BINARY_OP,
+	   ops4);
+
+  x4 = expand_binop (mode, lshr_optab, mul4,
+	     GEN_INT (GET_MODE_BITSIZE (imode) - 8), NULL, true,
+	     OPTAB_DIRECT);
+
+  emit_move_insn (dst, x4);
+}
+
} // namespace riscv_vector
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-1.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-1.c
new file mode 100644
index 00000000000..3169ebbff71
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-1.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-march=rv64gcv_zvfh -mabi=lp64d --param=riscv-autovec-preference=scalable -fno-vect-cost-model -fdump-tree-vect-details" } */
+
+#include <stdint-gcc.h>
+
+void __attribute__ ((noipa))
+popcount_32 (uint32_t *restrict dst, uint32_t *restrict src, int size)
+{
+  for (int i = 0; i < size; ++i)
+    dst[i] = __builtin_popcount (src[i]);
+}
+
+void __attribute__ ((noipa))
+popcount_64 (uint64_t *restrict dst, uint64_t *restrict src, int size)
+{
+  for (int i = 0; i < size; ++i)
+    dst[i] = __builtin_popcountll (src[i]);
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops in function" 2 "vect" } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-2.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-2.c
new file mode 100644
index 00000000000..9c0970afdfd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-2.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-march=rv64gcv -mabi=lp64d --param=riscv-autovec-preference=scalable -fno-vect-cost-model -fdump-tree-slp-details" } */
+
+int x[8];
+int y[8];
+
+void foo ()
+{
+  x[0] = __builtin_popcount (y[0]);
+  x[1] = __builtin_popcount (y[1]);
+  x[2] = __builtin_popcount (y[2]);
+  x[3] = __builtin_popcount (y[3]);
+  x[4] = __builtin_popcount (y[4]);
+  x[5] = __builtin_popcount (y[5]);
+  x[6] = __builtin_popcount (y[6]);
+  x[7] = __builtin_popcount (y[7]);
+}
+
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "slp" } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c
new file mode 100644
index 00000000000..38f1633da99
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c
@@ -0,0 +1,49 @@
+/* { dg-do run { target { riscv_v } } } */
+
+#include "popcount-1.c"
+
+extern void abort (void) __attribute__ ((noreturn));
+
+unsigned int data[] = {
+  0x11111100, 6,
+  0xe0e0f0f0, 14,
+  0x9900aab3, 13,
+  0x00040003, 3,
+  0x000e000c, 5,
+  0x22227777, 16,
+  0x12341234, 10,
+  0x0, 0
+};
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  unsigned int count = sizeof (data) / sizeof (data[0]) / 2;
+
+  uint32_t in32[count];
+  uint32_t out32[count];
+  for (unsigned int i = 0; i < count; ++i)
+    {
+      in32[i] = data[i * 2];
+      asm volatile ("" ::: "memory");
+    }
+  popcount_32 (out32, in32, count);
+  for (unsigned int i = 0; i < count; ++i)
+    if (out32[i] != data[i * 2 + 1])
+      abort ();
+
+  count /= 2;
+  uint64_t in64[count];
+  uint64_t out64[count];
+  for (unsigned int i = 0; i < count; ++i)
+    {
+      in64[i] = ((uint64_t) data[i * 4] << 32) | data[i * 4 + 2];
+      asm volatile ("" ::: "memory");
+    }
+  popcount_64 (out64, in64, count);
+  for (unsigned int i = 0; i < count; ++i)
+    if (out64[i] != data[i * 4 + 1] + data[i * 4 + 3])
+      abort ();
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount.c
new file mode 100644
index 00000000000..585a522aa81
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount.c
@@ -0,0 +1,1464 @@
+/* { dg-do run { target { riscv_v } } } */
+/* { dg-additional-options { -O2 -fdump-tree-vect-details -fno-vect-cost-model } }  */
+
+#include "stdint-gcc.h"
+#include <assert.h>
+
+#define DEF64(TYPEDST, TYPESRC)                                                \
+  void __attribute__ ((noipa))                                                 \
+  popcount64_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src, \
+	int size)                                     \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_popcountll (src[i]);                                  \
+  }
+
+#define DEF32(TYPEDST, TYPESRC)                                                \
+  void __attribute__ ((noipa))                                                 \
+  popcount32_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src, \
+	int size)                                     \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_popcount (src[i]);                                    \
+  }
+
+#define DEFCTZ64(TYPEDST, TYPESRC)                                             \
+  void __attribute__ ((noipa))                                                 \
+  ctz64_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src,      \
+	    int size)                                          \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_ctzll (src[i]);                                       \
+  }
+
+#define DEFCTZ32(TYPEDST, TYPESRC)                                             \
+  void __attribute__ ((noipa))                                                 \
+  ctz32_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src,      \
+	    int size)                                          \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_ctz (src[i]);                                         \
+  }
+
+#define DEFFFS64(TYPEDST, TYPESRC)                                             \
+  void __attribute__ ((noipa))                                                 \
+  ffs64_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src,      \
+	    int size)                                          \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_ffsll (src[i]);                                       \
+  }
+
+#define DEFFFS32(TYPEDST, TYPESRC)                                             \
+  void __attribute__ ((noipa))                                                 \
+  ffs32_##TYPEDST##TYPESRC (TYPEDST *restrict dst, TYPESRC *restrict src,      \
+	    int size)                                          \
+  {                                                                            \
+    for (int i = 0; i < size; ++i)                                             \
+      dst[i] = __builtin_ffs (src[i]);                                         \
+  }
+
+#define DEF_ALL()                                                              \
+  DEF64 (uint64_t, uint64_t)                                                   \
+  DEF64 (uint64_t, uint32_t)                                                   \
+  DEF64 (uint64_t, uint16_t)                                                   \
+  DEF64 (uint64_t, uint8_t)                                                    \
+  DEF64 (uint64_t, int64_t)                                                    \
+  DEF64 (uint64_t, int32_t)                                                    \
+  DEF64 (uint64_t, int16_t)                                                    \
+  DEF64 (uint64_t, int8_t)                                                     \
+  DEF64 (int64_t, uint64_t)                                                    \
+  DEF64 (int64_t, uint32_t)                                                    \
+  DEF64 (int64_t, uint16_t)                                                    \
+  DEF64 (int64_t, uint8_t)                                                     \
+  DEF64 (int64_t, int64_t)                                                     \
+  DEF64 (int64_t, int32_t)                                                     \
+  DEF64 (int64_t, int16_t)                                                     \
+  DEF64 (int64_t, int8_t)                                                      \
+  DEF64 (uint32_t, uint64_t)                                                   \
+  DEF64 (uint32_t, uint32_t)                                                   \
+  DEF64 (uint32_t, uint16_t)                                                   \
+  DEF64 (uint32_t, uint8_t)                                                    \
+  DEF64 (uint32_t, int64_t)                                                    \
+  DEF64 (uint32_t, int32_t)                                                    \
+  DEF64 (uint32_t, int16_t)                                                    \
+  DEF64 (uint32_t, int8_t)                                                     \
+  DEF64 (int32_t, uint64_t)                                                    \
+  DEF64 (int32_t, uint32_t)                                                    \
+  DEF64 (int32_t, uint16_t)                                                    \
+  DEF64 (int32_t, uint8_t)                                                     \
+  DEF64 (int32_t, int64_t)                                                     \
+  DEF64 (int32_t, int32_t)                                                     \
+  DEF64 (int32_t, int16_t)                                                     \
+  DEF64 (int32_t, int8_t)                                                      \
+  DEF64 (uint16_t, uint64_t)                                                   \
+  DEF64 (uint16_t, uint32_t)                                                   \
+  DEF64 (uint16_t, uint16_t)                                                   \
+  DEF64 (uint16_t, uint8_t)                                                    \
+  DEF64 (uint16_t, int64_t)                                                    \
+  DEF64 (uint16_t, int32_t)                                                    \
+  DEF64 (uint16_t, int16_t)                                                    \
+  DEF64 (uint16_t, int8_t)                                                     \
+  DEF64 (int16_t, uint64_t)                                                    \
+  DEF64 (int16_t, uint32_t)                                                    \
+  DEF64 (int16_t, uint16_t)                                                    \
+  DEF64 (int16_t, uint8_t)                                                     \
+  DEF64 (int16_t, int64_t)                                                     \
+  DEF64 (int16_t, int32_t)                                                     \
+  DEF64 (int16_t, int16_t)                                                     \
+  DEF64 (int16_t, int8_t)                                                      \
+  DEF64 (uint8_t, uint64_t)                                                    \
+  DEF64 (uint8_t, uint32_t)                                                    \
+  DEF64 (uint8_t, uint16_t)                                                    \
+  DEF64 (uint8_t, uint8_t)                                                     \
+  DEF64 (uint8_t, int64_t)                                                     \
+  DEF64 (uint8_t, int32_t)                                                     \
+  DEF64 (uint8_t, int16_t)                                                     \
+  DEF64 (uint8_t, int8_t)                                                      \
+  DEF64 (int8_t, uint64_t)                                                     \
+  DEF64 (int8_t, uint32_t)                                                     \
+  DEF64 (int8_t, uint16_t)                                                     \
+  DEF64 (int8_t, uint8_t)                                                      \
+  DEF64 (int8_t, int64_t)                                                      \
+  DEF64 (int8_t, int32_t)                                                      \
+  DEF64 (int8_t, int16_t)                                                      \
+  DEF64 (int8_t, int8_t)                                                       \
+  DEF32 (uint64_t, uint64_t)                                                   \
+  DEF32 (uint64_t, uint32_t)                                                   \
+  DEF32 (uint64_t, uint16_t)                                                   \
+  DEF32 (uint64_t, uint8_t)                                                    \
+  DEF32 (uint64_t, int64_t)                                                    \
+  DEF32 (uint64_t, int32_t)                                                    \
+  DEF32 (uint64_t, int16_t)                                                    \
+  DEF32 (uint64_t, int8_t)                                                     \
+  DEF32 (int64_t, uint64_t)                                                    \
+  DEF32 (int64_t, uint32_t)                                                    \
+  DEF32 (int64_t, uint16_t)                                                    \
+  DEF32 (int64_t, uint8_t)                                                     \
+  DEF32 (int64_t, int64_t)                                                     \
+  DEF32 (int64_t, int32_t)                                                     \
+  DEF32 (int64_t, int16_t)                                                     \
+  DEF32 (int64_t, int8_t)                                                      \
+  DEF32 (uint32_t, uint64_t)                                                   \
+  DEF32 (uint32_t, uint32_t)                                                   \
+  DEF32 (uint32_t, uint16_t)                                                   \
+  DEF32 (uint32_t, uint8_t)                                                    \
+  DEF32 (uint32_t, int64_t)                                                    \
+  DEF32 (uint32_t, int32_t)                                                    \
+  DEF32 (uint32_t, int16_t)                                                    \
+  DEF32 (uint32_t, int8_t)                                                     \
+  DEF32 (int32_t, uint64_t)                                                    \
+  DEF32 (int32_t, uint32_t)                                                    \
+  DEF32 (int32_t, uint16_t)                                                    \
+  DEF32 (int32_t, uint8_t)                                                     \
+  DEF32 (int32_t, int64_t)                                                     \
+  DEF32 (int32_t, int32_t)                                                     \
+  DEF32 (int32_t, int16_t)                                                     \
+  DEF32 (int32_t, int8_t)                                                      \
+  DEF32 (uint16_t, uint64_t)                                                   \
+  DEF32 (uint16_t, uint32_t)                                                   \
+  DEF32 (uint16_t, uint16_t)                                                   \
+  DEF32 (uint16_t, uint8_t)                                                    \
+  DEF32 (uint16_t, int64_t)                                                    \
+  DEF32 (uint16_t, int32_t)                                                    \
+  DEF32 (uint16_t, int16_t)                                                    \
+  DEF32 (uint16_t, int8_t)                                                     \
+  DEF32 (int16_t, uint64_t)                                                    \
+  DEF32 (int16_t, uint32_t)                                                    \
+  DEF32 (int16_t, uint16_t)                                                    \
+  DEF32 (int16_t, uint8_t)                                                     \
+  DEF32 (int16_t, int64_t)                                                     \
+  DEF32 (int16_t, int32_t)                                                     \
+  DEF32 (int16_t, int16_t)                                                     \
+  DEF32 (int16_t, int8_t)                                                      \
+  DEF32 (uint8_t, uint64_t)                                                    \
+  DEF32 (uint8_t, uint32_t)                                                    \
+  DEF32 (uint8_t, uint16_t)                                                    \
+  DEF32 (uint8_t, uint8_t)                                                     \
+  DEF32 (uint8_t, int64_t)                                                     \
+  DEF32 (uint8_t, int32_t)                                                     \
+  DEF32 (uint8_t, int16_t)                                                     \
+  DEF32 (uint8_t, int8_t)                                                      \
+  DEF32 (int8_t, uint64_t)                                                     \
+  DEF32 (int8_t, uint32_t)                                                     \
+  DEF32 (int8_t, uint16_t)                                                     \
+  DEF32 (int8_t, uint8_t)                                                      \
+  DEF32 (int8_t, int64_t)                                                      \
+  DEF32 (int8_t, int32_t)                                                      \
+  DEF32 (int8_t, int16_t)                                                      \
+  DEF32 (int8_t, int8_t)                                                       \
+  DEFCTZ64 (uint64_t, uint64_t)                                                \
+  DEFCTZ64 (uint64_t, uint32_t)                                                \
+  DEFCTZ64 (uint64_t, uint16_t)                                                \
+  DEFCTZ64 (uint64_t, uint8_t)                                                 \
+  DEFCTZ64 (uint64_t, int64_t)                                                 \
+  DEFCTZ64 (uint64_t, int32_t)                                                 \
+  DEFCTZ64 (uint64_t, int16_t)                                                 \
+  DEFCTZ64 (uint64_t, int8_t)                                                  \
+  DEFCTZ64 (int64_t, uint64_t)                                                 \
+  DEFCTZ64 (int64_t, uint32_t)                                                 \
+  DEFCTZ64 (int64_t, uint16_t)                                                 \
+  DEFCTZ64 (int64_t, uint8_t)                                                  \
+  DEFCTZ64 (int64_t, int64_t)                                                  \
+  DEFCTZ64 (int64_t, int32_t)                                                  \
+  DEFCTZ64 (int64_t, int16_t)                                                  \
+  DEFCTZ64 (int64_t, int8_t)                                                   \
+  DEFCTZ64 (uint32_t, uint64_t)                                                \
+  DEFCTZ64 (uint32_t, uint32_t)                                                \
+  DEFCTZ64 (uint32_t, uint16_t)                                                \
+  DEFCTZ64 (uint32_t, uint8_t)                                                 \
+  DEFCTZ64 (uint32_t, int64_t)                                                 \
+  DEFCTZ64 (uint32_t, int32_t)                                                 \
+  DEFCTZ64 (uint32_t, int16_t)                                                 \
+  DEFCTZ64 (uint32_t, int8_t)                                                  \
+  DEFCTZ64 (int32_t, uint64_t)                                                 \
+  DEFCTZ64 (int32_t, uint32_t)                                                 \
+  DEFCTZ64 (int32_t, uint16_t)                                                 \
+  DEFCTZ64 (int32_t, uint8_t)                                                  \
+  DEFCTZ64 (int32_t, int64_t)                                                  \
+  DEFCTZ64 (int32_t, int32_t)                                                  \
+  DEFCTZ64 (int32_t, int16_t)                                                  \
+  DEFCTZ64 (int32_t, int8_t)                                                   \
+  DEFCTZ64 (uint16_t, uint64_t)                                                \
+  DEFCTZ64 (uint16_t, uint32_t)                                                \
+  DEFCTZ64 (uint16_t, uint16_t)                                                \
+  DEFCTZ64 (uint16_t, uint8_t)                                                 \
+  DEFCTZ64 (uint16_t, int64_t)                                                 \
+  DEFCTZ64 (uint16_t, int32_t)                                                 \
+  DEFCTZ64 (uint16_t, int16_t)                                                 \
+  DEFCTZ64 (uint16_t, int8_t)                                                  \
+  DEFCTZ64 (int16_t, uint64_t)                                                 \
+  DEFCTZ64 (int16_t, uint32_t)                                                 \
+  DEFCTZ64 (int16_t, uint16_t)                                                 \
+  DEFCTZ64 (int16_t, uint8_t)                                                  \
+  DEFCTZ64 (int16_t, int64_t)                                                  \
+  DEFCTZ64 (int16_t, int32_t)                                                  \
+  DEFCTZ64 (int16_t, int16_t)                                                  \
+  DEFCTZ64 (int16_t, int8_t)                                                   \
+  DEFCTZ64 (uint8_t, uint64_t)                                                 \
+  DEFCTZ64 (uint8_t, uint32_t)                                                 \
+  DEFCTZ64 (uint8_t, uint16_t)                                                 \
+  DEFCTZ64 (uint8_t, uint8_t)                                                  \
+  DEFCTZ64 (uint8_t, int64_t)                                                  \
+  DEFCTZ64 (uint8_t, int32_t)                                                  \
+  DEFCTZ64 (uint8_t, int16_t)                                                  \
+  DEFCTZ64 (uint8_t, int8_t)                                                   \
+  DEFCTZ64 (int8_t, uint64_t)                                                  \
+  DEFCTZ64 (int8_t, uint32_t)                                                  \
+  DEFCTZ64 (int8_t, uint16_t)                                                  \
+  DEFCTZ64 (int8_t, uint8_t)                                                   \
+  DEFCTZ64 (int8_t, int64_t)                                                   \
+  DEFCTZ64 (int8_t, int32_t)                                                   \
+  DEFCTZ64 (int8_t, int16_t)                                                   \
+  DEFCTZ64 (int8_t, int8_t)                                                    \
+  DEFCTZ32 (uint64_t, uint64_t)                                                \
+  DEFCTZ32 (uint64_t, uint32_t)                                                \
+  DEFCTZ32 (uint64_t, uint16_t)                                                \
+  DEFCTZ32 (uint64_t, uint8_t)                                                 \
+  DEFCTZ32 (uint64_t, int64_t)                                                 \
+  DEFCTZ32 (uint64_t, int32_t)                                                 \
+  DEFCTZ32 (uint64_t, int16_t)                                                 \
+  DEFCTZ32 (uint64_t, int8_t)                                                  \
+  DEFCTZ32 (int64_t, uint64_t)                                                 \
+  DEFCTZ32 (int64_t, uint32_t)                                                 \
+  DEFCTZ32 (int64_t, uint16_t)                                                 \
+  DEFCTZ32 (int64_t, uint8_t)                                                  \
+  DEFCTZ32 (int64_t, int64_t)                                                  \
+  DEFCTZ32 (int64_t, int32_t)                                                  \
+  DEFCTZ32 (int64_t, int16_t)                                                  \
+  DEFCTZ32 (int64_t, int8_t)                                                   \
+  DEFCTZ32 (uint32_t, uint64_t)                                                \
+  DEFCTZ32 (uint32_t, uint32_t)                                                \
+  DEFCTZ32 (uint32_t, uint16_t)                                                \
+  DEFCTZ32 (uint32_t, uint8_t)                                                 \
+  DEFCTZ32 (uint32_t, int64_t)                                                 \
+  DEFCTZ32 (uint32_t, int32_t)                                                 \
+  DEFCTZ32 (uint32_t, int16_t)                                                 \
+  DEFCTZ32 (uint32_t, int8_t)                                                  \
+  DEFCTZ32 (int32_t, uint64_t)                                                 \
+  DEFCTZ32 (int32_t, uint32_t)                                                 \
+  DEFCTZ32 (int32_t, uint16_t)                                                 \
+  DEFCTZ32 (int32_t, uint8_t)                                                  \
+  DEFCTZ32 (int32_t, int64_t)                                                  \
+  DEFCTZ32 (int32_t, int32_t)                                                  \
+  DEFCTZ32 (int32_t, int16_t)                                                  \
+  DEFCTZ32 (int32_t, int8_t)                                                   \
+  DEFCTZ32 (uint16_t, uint64_t)                                                \
+  DEFCTZ32 (uint16_t, uint32_t)                                                \
+  DEFCTZ32 (uint16_t, uint16_t)                                                \
+  DEFCTZ32 (uint16_t, uint8_t)                                                 \
+  DEFCTZ32 (uint16_t, int64_t)                                                 \
+  DEFCTZ32 (uint16_t, int32_t)                                                 \
+  DEFCTZ32 (uint16_t, int16_t)                                                 \
+  DEFCTZ32 (uint16_t, int8_t)                                                  \
+  DEFCTZ32 (int16_t, uint64_t)                                                 \
+  DEFCTZ32 (int16_t, uint32_t)                                                 \
+  DEFCTZ32 (int16_t, uint16_t)                                                 \
+  DEFCTZ32 (int16_t, uint8_t)                                                  \
+  DEFCTZ32 (int16_t, int64_t)                                                  \
+  DEFCTZ32 (int16_t, int32_t)                                                  \
+  DEFCTZ32 (int16_t, int16_t)                                                  \
+  DEFCTZ32 (int16_t, int8_t)                                                   \
+  DEFCTZ32 (uint8_t, uint64_t)                                                 \
+  DEFCTZ32 (uint8_t, uint32_t)                                                 \
+  DEFCTZ32 (uint8_t, uint16_t)                                                 \
+  DEFCTZ32 (uint8_t, uint8_t)                                                  \
+  DEFCTZ32 (uint8_t, int64_t)                                                  \
+  DEFCTZ32 (uint8_t, int32_t)                                                  \
+  DEFCTZ32 (uint8_t, int16_t)                                                  \
+  DEFCTZ32 (uint8_t, int8_t)                                                   \
+  DEFCTZ32 (int8_t, uint64_t)                                                  \
+  DEFCTZ32 (int8_t, uint32_t)                                                  \
+  DEFCTZ32 (int8_t, uint16_t)                                                  \
+  DEFCTZ32 (int8_t, uint8_t)                                                   \
+  DEFCTZ32 (int8_t, int64_t)                                                   \
+  DEFCTZ32 (int8_t, int32_t)                                                   \
+  DEFCTZ32 (int8_t, int16_t)                                                   \
+  DEFCTZ32 (int8_t, int8_t)                                                    \
+  DEFFFS64 (uint64_t, uint64_t)                                                \
+  DEFFFS64 (uint64_t, uint32_t)                                                \
+  DEFFFS64 (uint64_t, uint16_t)                                                \
+  DEFFFS64 (uint64_t, uint8_t)                                                 \
+  DEFFFS64 (uint64_t, int64_t)                                                 \
+  DEFFFS64 (uint64_t, int32_t)                                                 \
+  DEFFFS64 (uint64_t, int16_t)                                                 \
+  DEFFFS64 (uint64_t, int8_t)                                                  \
+  DEFFFS64 (int64_t, uint64_t)                                                 \
+  DEFFFS64 (int64_t, uint32_t)                                                 \
+  DEFFFS64 (int64_t, uint16_t)                                                 \
+  DEFFFS64 (int64_t, uint8_t)                                                  \
+  DEFFFS64 (int64_t, int64_t)                                                  \
+  DEFFFS64 (int64_t, int32_t)                                                  \
+  DEFFFS64 (int64_t, int16_t)                                                  \
+  DEFFFS64 (int64_t, int8_t)                                                   \
+  DEFFFS64 (uint32_t, uint64_t)                                                \
+  DEFFFS64 (uint32_t, uint32_t)                                                \
+  DEFFFS64 (uint32_t, uint16_t)                                                \
+  DEFFFS64 (uint32_t, uint8_t)                                                 \
+  DEFFFS64 (uint32_t, int64_t)                                                 \
+  DEFFFS64 (uint32_t, int32_t)                                                 \
+  DEFFFS64 (uint32_t, int16_t)                                                 \
+  DEFFFS64 (uint32_t, int8_t)                                                  \
+  DEFFFS64 (int32_t, uint64_t)                                                 \
+  DEFFFS64 (int32_t, uint32_t)                                                 \
+  DEFFFS64 (int32_t, uint16_t)                                                 \
+  DEFFFS64 (int32_t, uint8_t)                                                  \
+  DEFFFS64 (int32_t, int64_t)                                                  \
+  DEFFFS64 (int32_t, int32_t)                                                  \
+  DEFFFS64 (int32_t, int16_t)                                                  \
+  DEFFFS64 (int32_t, int8_t)                                                   \
+  DEFFFS64 (uint16_t, uint64_t)                                                \
+  DEFFFS64 (uint16_t, uint32_t)                                                \
+  DEFFFS64 (uint16_t, uint16_t)                                                \
+  DEFFFS64 (uint16_t, uint8_t)                                                 \
+  DEFFFS64 (uint16_t, int64_t)                                                 \
+  DEFFFS64 (uint16_t, int32_t)                                                 \
+  DEFFFS64 (uint16_t, int16_t)                                                 \
+  DEFFFS64 (uint16_t, int8_t)                                                  \
+  DEFFFS64 (int16_t, uint64_t)                                                 \
+  DEFFFS64 (int16_t, uint32_t)                                                 \
+  DEFFFS64 (int16_t, uint16_t)                                                 \
+  DEFFFS64 (int16_t, uint8_t)                                                  \
+  DEFFFS64 (int16_t, int64_t)                                                  \
+  DEFFFS64 (int16_t, int32_t)                                                  \
+  DEFFFS64 (int16_t, int16_t)                                                  \
+  DEFFFS64 (int16_t, int8_t)                                                   \
+  DEFFFS64 (uint8_t, uint64_t)                                                 \
+  DEFFFS64 (uint8_t, uint32_t)                                                 \
+  DEFFFS64 (uint8_t, uint16_t)                                                 \
+  DEFFFS64 (uint8_t, uint8_t)                                                  \
+  DEFFFS64 (uint8_t, int64_t)                                                  \
+  DEFFFS64 (uint8_t, int32_t)                                                  \
+  DEFFFS64 (uint8_t, int16_t)                                                  \
+  DEFFFS64 (uint8_t, int8_t)                                                   \
+  DEFFFS64 (int8_t, uint64_t)                                                  \
+  DEFFFS64 (int8_t, uint32_t)                                                  \
+  DEFFFS64 (int8_t, uint16_t)                                                  \
+  DEFFFS64 (int8_t, uint8_t)                                                   \
+  DEFFFS64 (int8_t, int64_t)                                                   \
+  DEFFFS64 (int8_t, int32_t)                                                   \
+  DEFFFS64 (int8_t, int16_t)                                                   \
+  DEFFFS64 (int8_t, int8_t)                                                    \
+  DEFFFS32 (uint64_t, uint64_t)                                                \
+  DEFFFS32 (uint64_t, uint32_t)                                                \
+  DEFFFS32 (uint64_t, uint16_t)                                                \
+  DEFFFS32 (uint64_t, uint8_t)                                                 \
+  DEFFFS32 (uint64_t, int64_t)                                                 \
+  DEFFFS32 (uint64_t, int32_t)                                                 \
+  DEFFFS32 (uint64_t, int16_t)                                                 \
+  DEFFFS32 (uint64_t, int8_t)                                                  \
+  DEFFFS32 (int64_t, uint64_t)                                                 \
+  DEFFFS32 (int64_t, uint32_t)                                                 \
+  DEFFFS32 (int64_t, uint16_t)                                                 \
+  DEFFFS32 (int64_t, uint8_t)                                                  \
+  DEFFFS32 (int64_t, int64_t)                                                  \
+  DEFFFS32 (int64_t, int32_t)                                                  \
+  DEFFFS32 (int64_t, int16_t)                                                  \
+  DEFFFS32 (int64_t, int8_t)                                                   \
+  DEFFFS32 (uint32_t, uint64_t)                                                \
+  DEFFFS32 (uint32_t, uint32_t)                                                \
+  DEFFFS32 (uint32_t, uint16_t)                                                \
+  DEFFFS32 (uint32_t, uint8_t)                                                 \
+  DEFFFS32 (uint32_t, int64_t)                                                 \
+  DEFFFS32 (uint32_t, int32_t)                                                 \
+  DEFFFS32 (uint32_t, int16_t)                                                 \
+  DEFFFS32 (uint32_t, int8_t)                                                  \
+  DEFFFS32 (int32_t, uint64_t)                                                 \
+  DEFFFS32 (int32_t, uint32_t)                                                 \
+  DEFFFS32 (int32_t, uint16_t)                                                 \
+  DEFFFS32 (int32_t, uint8_t)                                                  \
+  DEFFFS32 (int32_t, int64_t)                                                  \
+  DEFFFS32 (int32_t, int32_t)                                                  \
+  DEFFFS32 (int32_t, int16_t)                                                  \
+  DEFFFS32 (int32_t, int8_t)                                                   \
+  DEFFFS32 (uint16_t, uint64_t)                                                \
+  DEFFFS32 (uint16_t, uint32_t)                                                \
+  DEFFFS32 (uint16_t, uint16_t)                                                \
+  DEFFFS32 (uint16_t, uint8_t)                                                 \
+  DEFFFS32 (uint16_t, int64_t)                                                 \
+  DEFFFS32 (uint16_t, int32_t)                                                 \
+  DEFFFS32 (uint16_t, int16_t)                                                 \
+  DEFFFS32 (uint16_t, int8_t)                                                  \
+  DEFFFS32 (int16_t, uint64_t)                                                 \
+  DEFFFS32 (int16_t, uint32_t)                                                 \
+  DEFFFS32 (int16_t, uint16_t)                                                 \
+  DEFFFS32 (int16_t, uint8_t)                                                  \
+  DEFFFS32 (int16_t, int64_t)                                                  \
+  DEFFFS32 (int16_t, int32_t)                                                  \
+  DEFFFS32 (int16_t, int16_t)                                                  \
+  DEFFFS32 (int16_t, int8_t)                                                   \
+  DEFFFS32 (uint8_t, uint64_t)                                                 \
+  DEFFFS32 (uint8_t, uint32_t)                                                 \
+  DEFFFS32 (uint8_t, uint16_t)                                                 \
+  DEFFFS32 (uint8_t, uint8_t)                                                  \
+  DEFFFS32 (uint8_t, int64_t)                                                  \
+  DEFFFS32 (uint8_t, int32_t)                                                  \
+  DEFFFS32 (uint8_t, int16_t)                                                  \
+  DEFFFS32 (uint8_t, int8_t)                                                   \
+  DEFFFS32 (int8_t, uint64_t)                                                  \
+  DEFFFS32 (int8_t, uint32_t)                                                  \
+  DEFFFS32 (int8_t, uint16_t)                                                  \
+  DEFFFS32 (int8_t, uint8_t)                                                   \
+  DEFFFS32 (int8_t, int64_t)                                                   \
+  DEFFFS32 (int8_t, int32_t)                                                   \
+  DEFFFS32 (int8_t, int16_t)                                                   \
+  DEFFFS32 (int8_t, int8_t)
+
+DEF_ALL ()
+
+#define SZ 512
+
+#define TEST64(TYPEDST, TYPESRC)                                               \
+  void __attribute__ ((optimize ("0"))) test64_##TYPEDST##TYPESRC ()           \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * 1234567890;                                              \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    popcount64_##TYPEDST##TYPESRC (dst, src, SZ);                              \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	assert (dst[i] == __builtin_popcountll (src[i]));                      \
+      }                                                                        \
+  }
+
+#define TEST64N(TYPEDST, TYPESRC)                                              \
+  void __attribute__ ((optimize ("0"))) test64n_##TYPEDST##TYPESRC ()          \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * -1234567890;                                             \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    popcount64_##TYPEDST##TYPESRC (dst, src, SZ);                              \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	assert (dst[i] == __builtin_popcountll (src[i]));                      \
+      }                                                                        \
+  }
+
+#define TEST32(TYPEDST, TYPESRC)                                               \
+  void __attribute__ ((optimize ("0"))) test32_##TYPEDST##TYPESRC ()           \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * 1234567;                                                 \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    popcount32_##TYPEDST##TYPESRC (dst, src, SZ);                              \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	assert (dst[i] == __builtin_popcount (src[i]));                        \
+      }                                                                        \
+  }
+
+#define TEST32N(TYPEDST, TYPESRC)                                              \
+  void __attribute__ ((optimize ("0"))) test32n_##TYPEDST##TYPESRC ()          \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * -1234567;                                                \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    popcount32_##TYPEDST##TYPESRC (dst, src, SZ);                              \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	assert (dst[i] == __builtin_popcount (src[i]));                        \
+      }                                                                        \
+  }
+
+#define TESTCTZ64(TYPEDST, TYPESRC)                                            \
+  void __attribute__ ((optimize ("0"))) testctz64_##TYPEDST##TYPESRC ()        \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * 1234567890;                                              \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    ctz64_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	if (src[i] != 0)                                                       \
+	  assert (dst[i] == __builtin_ctzll (src[i]));                         \
+      }                                                                        \
+  }
+
+#define TESTCTZ64N(TYPEDST, TYPESRC)                                           \
+  void __attribute__ ((optimize ("0"))) testctz64n_##TYPEDST##TYPESRC ()       \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * -1234567890;                                             \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    ctz64_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	if (src[i] != 0)                                                       \
+	  assert (dst[i] == __builtin_ctzll (src[i]));                         \
+      }                                                                        \
+  }
+
+#define TESTCTZ32(TYPEDST, TYPESRC)                                            \
+  void __attribute__ ((optimize ("0"))) testctz32_##TYPEDST##TYPESRC ()        \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * 1234567;                                                 \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    ctz32_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	if (src[i] != 0)                                                       \
+	  assert (dst[i] == __builtin_ctz (src[i]));                           \
+      }                                                                        \
+  }
+
+#define TESTCTZ32N(TYPEDST, TYPESRC)                                           \
+  void __attribute__ ((optimize ("0"))) testctz32n_##TYPEDST##TYPESRC ()       \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * -1234567;                                                \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    ctz32_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	if (src[i] != 0)                                                       \
+	  assert (dst[i] == __builtin_ctz (src[i]));                           \
+      }                                                                        \
+  }
+
+#define TESTFFS64(TYPEDST, TYPESRC)                                            \
+  void __attribute__ ((optimize ("0"))) testffs64_##TYPEDST##TYPESRC ()        \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * 1234567890;                                              \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    ffs64_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	assert (dst[i] == __builtin_ffsll (src[i]));                           \
+      }                                                                        \
+  }
+
+#define TESTFFS64N(TYPEDST, TYPESRC)                                           \
+  void __attribute__ ((optimize ("0"))) testffs64n_##TYPEDST##TYPESRC ()       \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * -1234567890;                                             \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    ffs64_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	assert (dst[i] == __builtin_ffsll (src[i]));                           \
+      }                                                                        \
+  }
+
+#define TESTFFS32(TYPEDST, TYPESRC)                                            \
+  void __attribute__ ((optimize ("0"))) testffs32_##TYPEDST##TYPESRC ()        \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * 1234567;                                                 \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    ffs32_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	assert (dst[i] == __builtin_ffs (src[i]));                             \
+      }                                                                        \
+  }
+
+#define TESTFFS32N(TYPEDST, TYPESRC)                                           \
+  void __attribute__ ((optimize ("0"))) testffs32n_##TYPEDST##TYPESRC ()       \
+  {                                                                            \
+    TYPESRC src[SZ];                                                           \
+    TYPEDST dst[SZ];                                                           \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	int ia = i + 1;                                                        \
+	src[i] = ia * -1234567;                                                \
+	dst[i] = 0;                                                            \
+      }                                                                        \
+    ffs32_##TYPEDST##TYPESRC (dst, src, SZ);                                   \
+    for (int i = 0; i < SZ; i++)                                               \
+      {                                                                        \
+	assert (dst[i] == __builtin_ffs (src[i]));                             \
+      }                                                                        \
+  }
+
+#define TEST_ALL()                                                             \
+  TEST64 (uint64_t, uint64_t)                                                  \
+  TEST64 (uint64_t, uint32_t)                                                  \
+  TEST64 (uint64_t, uint16_t)                                                  \
+  TEST64 (uint64_t, uint8_t)                                                   \
+  TEST64 (uint64_t, int64_t)                                                   \
+  TEST64 (uint64_t, int32_t)                                                   \
+  TEST64 (uint64_t, int16_t)                                                   \
+  TEST64 (uint64_t, int8_t)                                                    \
+  TEST64N (int64_t, uint64_t)                                                  \
+  TEST64N (int64_t, uint32_t)                                                  \
+  TEST64N (int64_t, uint16_t)                                                  \
+  TEST64N (int64_t, uint8_t)                                                   \
+  TEST64N (int64_t, int64_t)                                                   \
+  TEST64N (int64_t, int32_t)                                                   \
+  TEST64N (int64_t, int16_t)                                                   \
+  TEST64N (int64_t, int8_t)                                                    \
+  TEST64 (uint32_t, uint64_t)                                                  \
+  TEST64 (uint32_t, uint32_t)                                                  \
+  TEST64 (uint32_t, uint16_t)                                                  \
+  TEST64 (uint32_t, uint8_t)                                                   \
+  TEST64 (uint32_t, int64_t)                                                   \
+  TEST64 (uint32_t, int32_t)                                                   \
+  TEST64 (uint32_t, int16_t)                                                   \
+  TEST64 (uint32_t, int8_t)                                                    \
+  TEST64N (int32_t, uint64_t)                                                  \
+  TEST64N (int32_t, uint32_t)                                                  \
+  TEST64N (int32_t, uint16_t)                                                  \
+  TEST64N (int32_t, uint8_t)                                                   \
+  TEST64N (int32_t, int64_t)                                                   \
+  TEST64N (int32_t, int32_t)                                                   \
+  TEST64N (int32_t, int16_t)                                                   \
+  TEST64N (int32_t, int8_t)                                                    \
+  TEST64 (uint16_t, uint64_t)                                                  \
+  TEST64 (uint16_t, uint32_t)                                                  \
+  TEST64 (uint16_t, uint16_t)                                                  \
+  TEST64 (uint16_t, uint8_t)                                                   \
+  TEST64 (uint16_t, int64_t)                                                   \
+  TEST64 (uint16_t, int32_t)                                                   \
+  TEST64 (uint16_t, int16_t)                                                   \
+  TEST64 (uint16_t, int8_t)                                                    \
+  TEST64N (int16_t, uint64_t)                                                   \
+  TEST64N (int16_t, uint32_t)                                                   \
+  TEST64N (int16_t, uint16_t)                                                   \
+  TEST64N (int16_t, uint8_t)                                                    \
+  TEST64N (int16_t, int64_t)                                                    \
+  TEST64N (int16_t, int32_t)                                                    \
+  TEST64N (int16_t, int16_t)                                                    \
+  TEST64N (int16_t, int8_t)                                                     \
+  TEST64 (uint8_t, uint64_t)                                                   \
+  TEST64 (uint8_t, uint32_t)                                                   \
+  TEST64 (uint8_t, uint16_t)                                                   \
+  TEST64 (uint8_t, uint8_t)                                                    \
+  TEST64 (uint8_t, int64_t)                                                    \
+  TEST64 (uint8_t, int32_t)                                                    \
+  TEST64 (uint8_t, int16_t)                                                    \
+  TEST64 (uint8_t, int8_t)                                                     \
+  TEST64N (int8_t, uint64_t)                                                    \
+  TEST64N (int8_t, uint32_t)                                                    \
+  TEST64N (int8_t, uint16_t)                                                    \
+  TEST64N (int8_t, uint8_t)                                                     \
+  TEST64N (int8_t, int64_t)                                                     \
+  TEST64N (int8_t, int32_t)                                                     \
+  TEST64N (int8_t, int16_t)                                                     \
+  TEST64N (int8_t, int8_t)                                                      \
+  TEST32 (uint64_t, uint64_t)                                                  \
+  TEST32 (uint64_t, uint32_t)                                                  \
+  TEST32 (uint64_t, uint16_t)                                                  \
+  TEST32 (uint64_t, uint8_t)                                                   \
+  TEST32 (uint64_t, int64_t)                                                   \
+  TEST32 (uint64_t, int32_t)                                                   \
+  TEST32 (uint64_t, int16_t)                                                   \
+  TEST32 (uint64_t, int8_t)                                                    \
+  TEST32N (int64_t, uint64_t)                                                  \
+  TEST32N (int64_t, uint32_t)                                                  \
+  TEST32N (int64_t, uint16_t)                                                  \
+  TEST32N (int64_t, uint8_t)                                                   \
+  TEST32N (int64_t, int64_t)                                                   \
+  TEST32N (int64_t, int32_t)                                                   \
+  TEST32N (int64_t, int16_t)                                                   \
+  TEST32N (int64_t, int8_t)                                                    \
+  TEST32 (uint32_t, uint64_t)                                                  \
+  TEST32 (uint32_t, uint32_t)                                                  \
+  TEST32 (uint32_t, uint16_t)                                                  \
+  TEST32 (uint32_t, uint8_t)                                                   \
+  TEST32 (uint32_t, int64_t)                                                   \
+  TEST32 (uint32_t, int32_t)                                                   \
+  TEST32 (uint32_t, int16_t)                                                   \
+  TEST32 (uint32_t, int8_t)                                                    \
+  TEST32N (int32_t, uint64_t)                                                  \
+  TEST32N (int32_t, uint32_t)                                                  \
+  TEST32N (int32_t, uint16_t)                                                  \
+  TEST32N (int32_t, uint8_t)                                                   \
+  TEST32N (int32_t, int64_t)                                                   \
+  TEST32N (int32_t, int32_t)                                                   \
+  TEST32N (int32_t, int16_t)                                                   \
+  TEST32N (int32_t, int8_t)                                                    \
+  TEST32 (uint16_t, uint64_t)                                                  \
+  TEST32 (uint16_t, uint32_t)                                                  \
+  TEST32 (uint16_t, uint16_t)                                                  \
+  TEST32 (uint16_t, uint8_t)                                                   \
+  TEST32 (uint16_t, int64_t)                                                   \
+  TEST32 (uint16_t, int32_t)                                                   \
+  TEST32 (uint16_t, int16_t)                                                   \
+  TEST32 (uint16_t, int8_t)                                                    \
+  TEST32N (int16_t, uint64_t)                                                  \
+  TEST32N (int16_t, uint32_t)                                                  \
+  TEST32N (int16_t, uint16_t)                                                  \
+  TEST32N (int16_t, uint8_t)                                                   \
+  TEST32N (int16_t, int64_t)                                                   \
+  TEST32N (int16_t, int32_t)                                                   \
+  TEST32N (int16_t, int16_t)                                                   \
+  TEST32N (int16_t, int8_t)                                                    \
+  TEST32 (uint8_t, uint64_t)                                                   \
+  TEST32 (uint8_t, uint32_t)                                                   \
+  TEST32 (uint8_t, uint16_t)                                                   \
+  TEST32 (uint8_t, uint8_t)                                                    \
+  TEST32 (uint8_t, int64_t)                                                    \
+  TEST32 (uint8_t, int32_t)                                                    \
+  TEST32 (uint8_t, int16_t)                                                    \
+  TEST32 (uint8_t, int8_t)                                                     \
+  TEST32N (int8_t, uint64_t)                                                   \
+  TEST32N (int8_t, uint32_t)                                                   \
+  TEST32N (int8_t, uint16_t)                                                   \
+  TEST32N (int8_t, uint8_t)                                                    \
+  TEST32N (int8_t, int64_t)                                                    \
+  TEST32N (int8_t, int32_t)                                                    \
+  TEST32N (int8_t, int16_t)                                                    \
+  TEST32N (int8_t, int8_t)                                                     \
+  TESTCTZ64 (uint64_t, uint64_t)                                               \
+  TESTCTZ64 (uint64_t, uint32_t)                                               \
+  TESTCTZ64 (uint64_t, uint16_t)                                               \
+  TESTCTZ64 (uint64_t, uint8_t)                                                \
+  TESTCTZ64 (uint64_t, int64_t)                                                \
+  TESTCTZ64 (uint64_t, int32_t)                                                \
+  TESTCTZ64 (uint64_t, int16_t)                                                \
+  TESTCTZ64 (uint64_t, int8_t)                                                 \
+  TESTCTZ64N (int64_t, uint64_t)                                               \
+  TESTCTZ64N (int64_t, uint32_t)                                               \
+  TESTCTZ64N (int64_t, uint16_t)                                               \
+  TESTCTZ64N (int64_t, uint8_t)                                                \
+  TESTCTZ64N (int64_t, int64_t)                                                \
+  TESTCTZ64N (int64_t, int32_t)                                                \
+  TESTCTZ64N (int64_t, int16_t)                                                \
+  TESTCTZ64N (int64_t, int8_t)                                                 \
+  TESTCTZ64 (uint32_t, uint64_t)                                               \
+  TESTCTZ64 (uint32_t, uint32_t)                                               \
+  TESTCTZ64 (uint32_t, uint16_t)                                               \
+  TESTCTZ64 (uint32_t, uint8_t)                                                \
+  TESTCTZ64 (uint32_t, int64_t)                                                \
+  TESTCTZ64 (uint32_t, int32_t)                                                \
+  TESTCTZ64 (uint32_t, int16_t)                                                \
+  TESTCTZ64 (uint32_t, int8_t)                                                 \
+  TESTCTZ64N (int32_t, uint64_t)                                               \
+  TESTCTZ64N (int32_t, uint32_t)                                               \
+  TESTCTZ64N (int32_t, uint16_t)                                               \
+  TESTCTZ64N (int32_t, uint8_t)                                                \
+  TESTCTZ64N (int32_t, int64_t)                                                \
+  TESTCTZ64N (int32_t, int32_t)                                                \
+  TESTCTZ64N (int32_t, int16_t)                                                \
+  TESTCTZ64N (int32_t, int8_t)                                                 \
+  TESTCTZ64 (uint16_t, uint64_t)                                               \
+  TESTCTZ64 (uint16_t, uint32_t)                                               \
+  TESTCTZ64 (uint16_t, uint16_t)                                               \
+  TESTCTZ64 (uint16_t, uint8_t)                                                \
+  TESTCTZ64 (uint16_t, int64_t)                                                \
+  TESTCTZ64 (uint16_t, int32_t)                                                \
+  TESTCTZ64 (uint16_t, int16_t)                                                \
+  TESTCTZ64 (uint16_t, int8_t)                                                 \
+  TESTCTZ64N (int16_t, uint64_t)                                               \
+  TESTCTZ64N (int16_t, uint32_t)                                               \
+  TESTCTZ64N (int16_t, uint16_t)                                               \
+  TESTCTZ64N (int16_t, uint8_t)                                                \
+  TESTCTZ64N (int16_t, int64_t)                                                \
+  TESTCTZ64N (int16_t, int32_t)                                                \
+  TESTCTZ64N (int16_t, int16_t)                                                \
+  TESTCTZ64N (int16_t, int8_t)                                                 \
+  TESTCTZ64 (uint8_t, uint64_t)                                                \
+  TESTCTZ64 (uint8_t, uint32_t)                                                \
+  TESTCTZ64 (uint8_t, uint16_t)                                                \
+  TESTCTZ64 (uint8_t, uint8_t)                                                 \
+  TESTCTZ64 (uint8_t, int64_t)                                                 \
+  TESTCTZ64 (uint8_t, int32_t)                                                 \
+  TESTCTZ64 (uint8_t, int16_t)                                                 \
+  TESTCTZ64 (uint8_t, int8_t)                                                  \
+  TESTCTZ64N (int8_t, uint64_t)                                                \
+  TESTCTZ64N (int8_t, uint32_t)                                                \
+  TESTCTZ64N (int8_t, uint16_t)                                                \
+  TESTCTZ64N (int8_t, uint8_t)                                                 \
+  TESTCTZ64N (int8_t, int64_t)                                                 \
+  TESTCTZ64N (int8_t, int32_t)                                                 \
+  TESTCTZ64N (int8_t, int16_t)                                                 \
+  TESTCTZ64N (int8_t, int8_t)                                                  \
+  TESTCTZ32 (uint64_t, uint64_t)                                               \
+  TESTCTZ32 (uint64_t, uint32_t)                                               \
+  TESTCTZ32 (uint64_t, uint16_t)                                               \
+  TESTCTZ32 (uint64_t, uint8_t)                                                \
+  TESTCTZ32 (uint64_t, int64_t)                                                \
+  TESTCTZ32 (uint64_t, int32_t)                                                \
+  TESTCTZ32 (uint64_t, int16_t)                                                \
+  TESTCTZ32 (uint64_t, int8_t)                                                 \
+  TESTCTZ32N (int64_t, uint64_t)                                               \
+  TESTCTZ32N (int64_t, uint32_t)                                               \
+  TESTCTZ32N (int64_t, uint16_t)                                               \
+  TESTCTZ32N (int64_t, uint8_t)                                                \
+  TESTCTZ32N (int64_t, int64_t)                                                \
+  TESTCTZ32N (int64_t, int32_t)                                                \
+  TESTCTZ32N (int64_t, int16_t)                                                \
+  TESTCTZ32N (int64_t, int8_t)                                                 \
+  TESTCTZ32 (uint32_t, uint64_t)                                               \
+  TESTCTZ32 (uint32_t, uint32_t)                                               \
+  TESTCTZ32 (uint32_t, uint16_t)                                               \
+  TESTCTZ32 (uint32_t, uint8_t)                                                \
+  TESTCTZ32 (uint32_t, int64_t)                                                \
+  TESTCTZ32 (uint32_t, int32_t)                                                \
+  TESTCTZ32 (uint32_t, int16_t)                                                \
+  TESTCTZ32 (uint32_t, int8_t)                                                 \
+  TESTCTZ32N (int32_t, uint64_t)                                               \
+  TESTCTZ32N (int32_t, uint32_t)                                               \
+  TESTCTZ32N (int32_t, uint16_t)                                               \
+  TESTCTZ32N (int32_t, uint8_t)                                                \
+  TESTCTZ32N (int32_t, int64_t)                                                \
+  TESTCTZ32N (int32_t, int32_t)                                                \
+  TESTCTZ32N (int32_t, int16_t)                                                \
+  TESTCTZ32N (int32_t, int8_t)                                                 \
+  TESTCTZ32 (uint16_t, uint64_t)                                               \
+  TESTCTZ32 (uint16_t, uint32_t)                                               \
+  TESTCTZ32 (uint16_t, uint16_t)                                               \
+  TESTCTZ32 (uint16_t, uint8_t)                                                \
+  TESTCTZ32 (uint16_t, int64_t)                                                \
+  TESTCTZ32 (uint16_t, int32_t)                                                \
+  TESTCTZ32 (uint16_t, int16_t)                                                \
+  TESTCTZ32 (uint16_t, int8_t)                                                 \
+  TESTCTZ32N (int16_t, uint64_t)                                               \
+  TESTCTZ32N (int16_t, uint32_t)                                               \
+  TESTCTZ32N (int16_t, uint16_t)                                               \
+  TESTCTZ32N (int16_t, uint8_t)                                                \
+  TESTCTZ32N (int16_t, int64_t)                                                \
+  TESTCTZ32N (int16_t, int32_t)                                                \
+  TESTCTZ32N (int16_t, int16_t)                                                \
+  TESTCTZ32N (int16_t, int8_t)                                                 \
+  TESTCTZ32 (uint8_t, uint64_t)                                                \
+  TESTCTZ32 (uint8_t, uint32_t)                                                \
+  TESTCTZ32 (uint8_t, uint16_t)                                                \
+  TESTCTZ32 (uint8_t, uint8_t)                                                 \
+  TESTCTZ32 (uint8_t, int64_t)                                                 \
+  TESTCTZ32 (uint8_t, int32_t)                                                 \
+  TESTCTZ32 (uint8_t, int16_t)                                                 \
+  TESTCTZ32 (uint8_t, int8_t)                                                  \
+  TESTCTZ32N (int8_t, uint64_t)                                                \
+  TESTCTZ32N (int8_t, uint32_t)                                                \
+  TESTCTZ32N (int8_t, uint16_t)                                                \
+  TESTCTZ32N (int8_t, uint8_t)                                                 \
+  TESTCTZ32N (int8_t, int64_t)                                                 \
+  TESTCTZ32N (int8_t, int32_t)                                                 \
+  TESTCTZ32N (int8_t, int16_t)                                                 \
+  TESTCTZ32N (int8_t, int8_t)                                                  \
+  TESTFFS64 (uint64_t, uint64_t)                                               \
+  TESTFFS64 (uint64_t, uint32_t)                                               \
+  TESTFFS64 (uint64_t, uint16_t)                                               \
+  TESTFFS64 (uint64_t, uint8_t)                                                \
+  TESTFFS64 (uint64_t, int64_t)                                                \
+  TESTFFS64 (uint64_t, int32_t)                                                \
+  TESTFFS64 (uint64_t, int16_t)                                                \
+  TESTFFS64 (uint64_t, int8_t)                                                 \
+  TESTFFS64N (int64_t, uint64_t)                                               \
+  TESTFFS64N (int64_t, uint32_t)                                               \
+  TESTFFS64N (int64_t, uint16_t)                                               \
+  TESTFFS64N (int64_t, uint8_t)                                                \
+  TESTFFS64N (int64_t, int64_t)                                                \
+  TESTFFS64N (int64_t, int32_t)                                                \
+  TESTFFS64N (int64_t, int16_t)                                                \
+  TESTFFS64N (int64_t, int8_t)                                                 \
+  TESTFFS64 (uint32_t, uint64_t)                                               \
+  TESTFFS64 (uint32_t, uint32_t)                                               \
+  TESTFFS64 (uint32_t, uint16_t)                                               \
+  TESTFFS64 (uint32_t, uint8_t)                                                \
+  TESTFFS64 (uint32_t, int64_t)                                                \
+  TESTFFS64 (uint32_t, int32_t)                                                \
+  TESTFFS64 (uint32_t, int16_t)                                                \
+  TESTFFS64 (uint32_t, int8_t)                                                 \
+  TESTFFS64N (int32_t, uint64_t)                                               \
+  TESTFFS64N (int32_t, uint32_t)                                               \
+  TESTFFS64N (int32_t, uint16_t)                                               \
+  TESTFFS64N (int32_t, uint8_t)                                                \
+  TESTFFS64N (int32_t, int64_t)                                                \
+  TESTFFS64N (int32_t, int32_t)                                                \
+  TESTFFS64N (int32_t, int16_t)                                                \
+  TESTFFS64N (int32_t, int8_t)                                                 \
+  TESTFFS64 (uint16_t, uint64_t)                                               \
+  TESTFFS64 (uint16_t, uint32_t)                                               \
+  TESTFFS64 (uint16_t, uint16_t)                                               \
+  TESTFFS64 (uint16_t, uint8_t)                                                \
+  TESTFFS64 (uint16_t, int64_t)                                                \
+  TESTFFS64 (uint16_t, int32_t)                                                \
+  TESTFFS64 (uint16_t, int16_t)                                                \
+  TESTFFS64 (uint16_t, int8_t)                                                 \
+  TESTFFS64N (int16_t, uint64_t)                                               \
+  TESTFFS64N (int16_t, uint32_t)                                               \
+  TESTFFS64N (int16_t, uint16_t)                                               \
+  TESTFFS64N (int16_t, uint8_t)                                                \
+  TESTFFS64N (int16_t, int64_t)                                                \
+  TESTFFS64N (int16_t, int32_t)                                                \
+  TESTFFS64N (int16_t, int16_t)                                                \
+  TESTFFS64N (int16_t, int8_t)                                                 \
+  TESTFFS64 (uint8_t, uint64_t)                                                \
+  TESTFFS64 (uint8_t, uint32_t)                                                \
+  TESTFFS64 (uint8_t, uint16_t)                                                \
+  TESTFFS64 (uint8_t, uint8_t)                                                 \
+  TESTFFS64 (uint8_t, int64_t)                                                 \
+  TESTFFS64 (uint8_t, int32_t)                                                 \
+  TESTFFS64 (uint8_t, int16_t)                                                 \
+  TESTFFS64 (uint8_t, int8_t)                                                  \
+  TESTFFS64N (int8_t, uint64_t)                                                \
+  TESTFFS64N (int8_t, uint32_t)                                                \
+  TESTFFS64N (int8_t, uint16_t)                                                \
+  TESTFFS64N (int8_t, uint8_t)                                                 \
+  TESTFFS64N (int8_t, int64_t)                                                 \
+  TESTFFS64N (int8_t, int32_t)                                                 \
+  TESTFFS64N (int8_t, int16_t)                                                 \
+  TESTFFS64N (int8_t, int8_t)                                                  \
+  TESTFFS32 (uint64_t, uint64_t)                                               \
+  TESTFFS32 (uint64_t, uint32_t)                                               \
+  TESTFFS32 (uint64_t, uint16_t)                                               \
+  TESTFFS32 (uint64_t, uint8_t)                                                \
+  TESTFFS32 (uint64_t, int64_t)                                                \
+  TESTFFS32 (uint64_t, int32_t)                                                \
+  TESTFFS32 (uint64_t, int16_t)                                                \
+  TESTFFS32 (uint64_t, int8_t)                                                 \
+  TESTFFS32N (int64_t, uint64_t)                                               \
+  TESTFFS32N (int64_t, uint32_t)                                               \
+  TESTFFS32N (int64_t, uint16_t)                                               \
+  TESTFFS32N (int64_t, uint8_t)                                                \
+  TESTFFS32N (int64_t, int64_t)                                                \
+  TESTFFS32N (int64_t, int32_t)                                                \
+  TESTFFS32N (int64_t, int16_t)                                                \
+  TESTFFS32N (int64_t, int8_t)                                                 \
+  TESTFFS32 (uint32_t, uint64_t)                                               \
+  TESTFFS32 (uint32_t, uint32_t)                                               \
+  TESTFFS32 (uint32_t, uint16_t)                                               \
+  TESTFFS32 (uint32_t, uint8_t)                                                \
+  TESTFFS32 (uint32_t, int64_t)                                                \
+  TESTFFS32 (uint32_t, int32_t)                                                \
+  TESTFFS32 (uint32_t, int16_t)                                                \
+  TESTFFS32 (uint32_t, int8_t)                                                 \
+  TESTFFS32N (int32_t, uint64_t)                                               \
+  TESTFFS32N (int32_t, uint32_t)                                               \
+  TESTFFS32N (int32_t, uint16_t)                                               \
+  TESTFFS32N (int32_t, uint8_t)                                                \
+  TESTFFS32N (int32_t, int64_t)                                                \
+  TESTFFS32N (int32_t, int32_t)                                                \
+  TESTFFS32N (int32_t, int16_t)                                                \
+  TESTFFS32N (int32_t, int8_t)                                                 \
+  TESTFFS32 (uint16_t, uint64_t)                                               \
+  TESTFFS32 (uint16_t, uint32_t)                                               \
+  TESTFFS32 (uint16_t, uint16_t)                                               \
+  TESTFFS32 (uint16_t, uint8_t)                                                \
+  TESTFFS32 (uint16_t, int64_t)                                                \
+  TESTFFS32 (uint16_t, int32_t)                                                \
+  TESTFFS32 (uint16_t, int16_t)                                                \
+  TESTFFS32 (uint16_t, int8_t)                                                 \
+  TESTFFS32N (int16_t, uint64_t)                                               \
+  TESTFFS32N (int16_t, uint32_t)                                               \
+  TESTFFS32N (int16_t, uint16_t)                                               \
+  TESTFFS32N (int16_t, uint8_t)                                                \
+  TESTFFS32N (int16_t, int64_t)                                                \
+  TESTFFS32N (int16_t, int32_t)                                                \
+  TESTFFS32N (int16_t, int16_t)                                                \
+  TESTFFS32N (int16_t, int8_t)                                                 \
+  TESTFFS32 (uint8_t, uint64_t)                                                \
+  TESTFFS32 (uint8_t, uint32_t)                                                \
+  TESTFFS32 (uint8_t, uint16_t)                                                \
+  TESTFFS32 (uint8_t, uint8_t)                                                 \
+  TESTFFS32 (uint8_t, int64_t)                                                 \
+  TESTFFS32 (uint8_t, int32_t)                                                 \
+  TESTFFS32 (uint8_t, int16_t)                                                 \
+  TESTFFS32 (uint8_t, int8_t)                                                  \
+  TESTFFS32N (int8_t, uint64_t)                                                \
+  TESTFFS32N (int8_t, uint32_t)                                                \
+  TESTFFS32N (int8_t, uint16_t)                                                \
+  TESTFFS32N (int8_t, uint8_t)                                                 \
+  TESTFFS32N (int8_t, int64_t)                                                 \
+  TESTFFS32N (int8_t, int32_t)                                                 \
+  TESTFFS32N (int8_t, int16_t)                                                 \
+  TESTFFS32N (int8_t, int8_t)
+
+TEST_ALL ()
+
+#define RUN64(TYPEDST, TYPESRC) test64_##TYPEDST##TYPESRC ();
+#define RUN64N(TYPEDST, TYPESRC) test64n_##TYPEDST##TYPESRC ();
+#define RUN32(TYPEDST, TYPESRC) test32_##TYPEDST##TYPESRC ();
+#define RUN32N(TYPEDST, TYPESRC) test32n_##TYPEDST##TYPESRC ();
+#define RUNCTZ64(TYPEDST, TYPESRC) testctz64_##TYPEDST##TYPESRC ();
+#define RUNCTZ64N(TYPEDST, TYPESRC) testctz64n_##TYPEDST##TYPESRC ();
+#define RUNCTZ32(TYPEDST, TYPESRC) testctz32_##TYPEDST##TYPESRC ();
+#define RUNCTZ32N(TYPEDST, TYPESRC) testctz32n_##TYPEDST##TYPESRC ();
+#define RUNFFS64(TYPEDST, TYPESRC) testffs64_##TYPEDST##TYPESRC ();
+#define RUNFFS64N(TYPEDST, TYPESRC) testffs64n_##TYPEDST##TYPESRC ();
+#define RUNFFS32(TYPEDST, TYPESRC) testffs32_##TYPEDST##TYPESRC ();
+#define RUNFFS32N(TYPEDST, TYPESRC) testffs32n_##TYPEDST##TYPESRC ();
+
+#define RUN_ALL()                                                              \
+  RUN64 (uint64_t, uint64_t)                                                   \
+  RUN64 (uint64_t, uint32_t)                                                   \
+  RUN64 (uint64_t, uint16_t)                                                   \
+  RUN64 (uint64_t, uint8_t)                                                    \
+  RUN64 (uint64_t, int64_t)                                                    \
+  RUN64 (uint64_t, int32_t)                                                    \
+  RUN64 (uint64_t, int16_t)                                                    \
+  RUN64 (uint64_t, int8_t)                                                     \
+  RUN64N (int64_t, uint64_t)                                                    \
+  RUN64N (int64_t, uint32_t)                                                    \
+  RUN64N (int64_t, uint16_t)                                                    \
+  RUN64N (int64_t, uint8_t)                                                     \
+  RUN64N (int64_t, int64_t)                                                     \
+  RUN64N (int64_t, int32_t)                                                     \
+  RUN64N (int64_t, int16_t)                                                     \
+  RUN64N (int64_t, int8_t)                                                      \
+  RUN64 (uint32_t, uint64_t)                                                   \
+  RUN64 (uint32_t, uint32_t)                                                   \
+  RUN64 (uint32_t, uint16_t)                                                   \
+  RUN64 (uint32_t, uint8_t)                                                    \
+  RUN64 (uint32_t, int64_t)                                                    \
+  RUN64 (uint32_t, int32_t)                                                    \
+  RUN64 (uint32_t, int16_t)                                                    \
+  RUN64 (uint32_t, int8_t)                                                     \
+  RUN64N (int32_t, uint64_t)                                                    \
+  RUN64N (int32_t, uint32_t)                                                    \
+  RUN64N (int32_t, uint16_t)                                                    \
+  RUN64N (int32_t, uint8_t)                                                     \
+  RUN64N (int32_t, int64_t)                                                     \
+  RUN64N (int32_t, int32_t)                                                     \
+  RUN64N (int32_t, int16_t)                                                     \
+  RUN64N (int32_t, int8_t)                                                      \
+  RUN64 (uint16_t, uint64_t)                                                   \
+  RUN64 (uint16_t, uint32_t)                                                   \
+  RUN64 (uint16_t, uint16_t)                                                   \
+  RUN64 (uint16_t, uint8_t)                                                    \
+  RUN64 (uint16_t, int64_t)                                                    \
+  RUN64 (uint16_t, int32_t)                                                    \
+  RUN64 (uint16_t, int16_t)                                                    \
+  RUN64 (uint16_t, int8_t)                                                     \
+  RUN64N (int16_t, uint64_t)                                                    \
+  RUN64N (int16_t, uint32_t)                                                    \
+  RUN64N (int16_t, uint16_t)                                                    \
+  RUN64N (int16_t, uint8_t)                                                     \
+  RUN64N (int16_t, int64_t)                                                     \
+  RUN64N (int16_t, int32_t)                                                     \
+  RUN64N (int16_t, int16_t)                                                     \
+  RUN64N (int16_t, int8_t)                                                      \
+  RUN64 (uint8_t, uint64_t)                                                    \
+  RUN64 (uint8_t, uint32_t)                                                    \
+  RUN64 (uint8_t, uint16_t)                                                    \
+  RUN64 (uint8_t, uint8_t)                                                     \
+  RUN64 (uint8_t, int64_t)                                                     \
+  RUN64 (uint8_t, int32_t)                                                     \
+  RUN64 (uint8_t, int16_t)                                                     \
+  RUN64 (uint8_t, int8_t)                                                      \
+  RUN64N (int8_t, uint64_t)                                                     \
+  RUN64N (int8_t, uint32_t)                                                     \
+  RUN64N (int8_t, uint16_t)                                                     \
+  RUN64N (int8_t, uint8_t)                                                      \
+  RUN64N (int8_t, int64_t)                                                      \
+  RUN64N (int8_t, int32_t)                                                      \
+  RUN64N (int8_t, int16_t)                                                      \
+  RUN64N (int8_t, int8_t)                                                       \
+  RUN32 (uint64_t, uint64_t)                                                   \
+  RUN32 (uint64_t, uint32_t)                                                   \
+  RUN32 (uint64_t, uint16_t)                                                   \
+  RUN32 (uint64_t, uint8_t)                                                    \
+  RUN32 (uint64_t, int64_t)                                                    \
+  RUN32 (uint64_t, int32_t)                                                    \
+  RUN32 (uint64_t, int16_t)                                                    \
+  RUN32 (uint64_t, int8_t)                                                     \
+  RUN32N (int64_t, uint64_t)                                                    \
+  RUN32N (int64_t, uint32_t)                                                    \
+  RUN32N (int64_t, uint16_t)                                                    \
+  RUN32N (int64_t, uint8_t)                                                     \
+  RUN32N (int64_t, int64_t)                                                     \
+  RUN32N (int64_t, int32_t)                                                     \
+  RUN32N (int64_t, int16_t)                                                     \
+  RUN32N (int64_t, int8_t)                                                      \
+  RUN32 (uint32_t, uint64_t)                                                   \
+  RUN32 (uint32_t, uint32_t)                                                   \
+  RUN32 (uint32_t, uint16_t)                                                   \
+  RUN32 (uint32_t, uint8_t)                                                    \
+  RUN32 (uint32_t, int64_t)                                                    \
+  RUN32 (uint32_t, int32_t)                                                    \
+  RUN32 (uint32_t, int16_t)                                                    \
+  RUN32 (uint32_t, int8_t)                                                     \
+  RUN32N (int32_t, uint64_t)                                                    \
+  RUN32N (int32_t, uint32_t)                                                    \
+  RUN32N (int32_t, uint16_t)                                                    \
+  RUN32N (int32_t, uint8_t)                                                     \
+  RUN32N (int32_t, int64_t)                                                     \
+  RUN32N (int32_t, int32_t)                                                     \
+  RUN32N (int32_t, int16_t)                                                     \
+  RUN32N (int32_t, int8_t)                                                      \
+  RUN32 (uint16_t, uint64_t)                                                   \
+  RUN32 (uint16_t, uint32_t)                                                   \
+  RUN32 (uint16_t, uint16_t)                                                   \
+  RUN32 (uint16_t, uint8_t)                                                    \
+  RUN32 (uint16_t, int64_t)                                                    \
+  RUN32 (uint16_t, int32_t)                                                    \
+  RUN32 (uint16_t, int16_t)                                                    \
+  RUN32 (uint16_t, int8_t)                                                     \
+  RUN32N (int16_t, uint64_t)                                                    \
+  RUN32N (int16_t, uint32_t)                                                    \
+  RUN32N (int16_t, uint16_t)                                                    \
+  RUN32N (int16_t, uint8_t)                                                     \
+  RUN32N (int16_t, int64_t)                                                     \
+  RUN32N (int16_t, int32_t)                                                     \
+  RUN32N (int16_t, int16_t)                                                     \
+  RUN32N (int16_t, int8_t)                                                      \
+  RUN32 (uint8_t, uint64_t)                                                    \
+  RUN32 (uint8_t, uint32_t)                                                    \
+  RUN32 (uint8_t, uint16_t)                                                    \
+  RUN32 (uint8_t, uint8_t)                                                     \
+  RUN32 (uint8_t, int64_t)                                                     \
+  RUN32 (uint8_t, int32_t)                                                     \
+  RUN32 (uint8_t, int16_t)                                                     \
+  RUN32 (uint8_t, int8_t)                                                      \
+  RUN32N (int8_t, uint64_t)                                                     \
+  RUN32N (int8_t, uint32_t)                                                     \
+  RUN32N (int8_t, uint16_t)                                                     \
+  RUN32N (int8_t, uint8_t)                                                      \
+  RUN32N (int8_t, int64_t)                                                      \
+  RUN32N (int8_t, int32_t)                                                      \
+  RUN32N (int8_t, int16_t)                                                      \
+  RUN32N (int8_t, int8_t)                                                       \
+  RUNCTZ64 (uint64_t, uint64_t)                                                \
+  RUNCTZ64 (uint64_t, uint32_t)                                                \
+  RUNCTZ64 (uint64_t, uint16_t)                                                \
+  RUNCTZ64 (uint64_t, uint8_t)                                                 \
+  RUNCTZ64 (uint64_t, int64_t)                                                 \
+  RUNCTZ64 (uint64_t, int32_t)                                                 \
+  RUNCTZ64 (uint64_t, int16_t)                                                 \
+  RUNCTZ64 (uint64_t, int8_t)                                                  \
+  RUNCTZ64N (int64_t, uint64_t)                                                 \
+  RUNCTZ64N (int64_t, uint32_t)                                                 \
+  RUNCTZ64N (int64_t, uint16_t)                                                 \
+  RUNCTZ64N (int64_t, uint8_t)                                                  \
+  RUNCTZ64N (int64_t, int64_t)                                                  \
+  RUNCTZ64N (int64_t, int32_t)                                                  \
+  RUNCTZ64N (int64_t, int16_t)                                                  \
+  RUNCTZ64N (int64_t, int8_t)                                                   \
+  RUNCTZ64 (uint32_t, uint64_t)                                                \
+  RUNCTZ64 (uint32_t, uint32_t)                                                \
+  RUNCTZ64 (uint32_t, uint16_t)                                                \
+  RUNCTZ64 (uint32_t, uint8_t)                                                 \
+  RUNCTZ64 (uint32_t, int64_t)                                                 \
+  RUNCTZ64 (uint32_t, int32_t)                                                 \
+  RUNCTZ64 (uint32_t, int16_t)                                                 \
+  RUNCTZ64 (uint32_t, int8_t)                                                  \
+  RUNCTZ64N (int32_t, uint64_t)                                                 \
+  RUNCTZ64N (int32_t, uint32_t)                                                 \
+  RUNCTZ64N (int32_t, uint16_t)                                                 \
+  RUNCTZ64N (int32_t, uint8_t)                                                  \
+  RUNCTZ64N (int32_t, int64_t)                                                  \
+  RUNCTZ64N (int32_t, int32_t)                                                  \
+  RUNCTZ64N (int32_t, int16_t)                                                  \
+  RUNCTZ64N (int32_t, int8_t)                                                   \
+  RUNCTZ64 (uint16_t, uint64_t)                                                \
+  RUNCTZ64 (uint16_t, uint32_t)                                                \
+  RUNCTZ64 (uint16_t, uint16_t)                                                \
+  RUNCTZ64 (uint16_t, uint8_t)                                                 \
+  RUNCTZ64 (uint16_t, int64_t)                                                 \
+  RUNCTZ64 (uint16_t, int32_t)                                                 \
+  RUNCTZ64 (uint16_t, int16_t)                                                 \
+  RUNCTZ64 (uint16_t, int8_t)                                                  \
+  RUNCTZ64N (int16_t, uint64_t)                                                \
+  RUNCTZ64N (int16_t, uint32_t)                                                \
+  RUNCTZ64N (int16_t, uint16_t)                                                \
+  RUNCTZ64N (int16_t, uint8_t)                                                 \
+  RUNCTZ64N (int16_t, int64_t)                                                 \
+  RUNCTZ64N (int16_t, int32_t)                                                 \
+  RUNCTZ64N (int16_t, int16_t)                                                 \
+  RUNCTZ64N (int16_t, int8_t)                                                  \
+  RUNCTZ64 (uint8_t, uint64_t)                                                 \
+  RUNCTZ64 (uint8_t, uint32_t)                                                 \
+  RUNCTZ64 (uint8_t, uint16_t)                                                 \
+  RUNCTZ64 (uint8_t, uint8_t)                                                  \
+  RUNCTZ64 (uint8_t, int64_t)                                                  \
+  RUNCTZ64 (uint8_t, int32_t)                                                  \
+  RUNCTZ64 (uint8_t, int16_t)                                                  \
+  RUNCTZ64 (uint8_t, int8_t)                                                   \
+  RUNCTZ64N (int8_t, uint64_t)                                                 \
+  RUNCTZ64N (int8_t, uint32_t)                                                 \
+  RUNCTZ64N (int8_t, uint16_t)                                                 \
+  RUNCTZ64N (int8_t, uint8_t)                                                  \
+  RUNCTZ64N (int8_t, int64_t)                                                  \
+  RUNCTZ64N (int8_t, int32_t)                                                  \
+  RUNCTZ64N (int8_t, int16_t)                                                  \
+  RUNCTZ64N (int8_t, int8_t)                                                   \
+  RUNCTZ32 (uint64_t, uint64_t)                                                \
+  RUNCTZ32 (uint64_t, uint32_t)                                                \
+  RUNCTZ32 (uint64_t, uint16_t)                                                \
+  RUNCTZ32 (uint64_t, uint8_t)                                                 \
+  RUNCTZ32 (uint64_t, int64_t)                                                 \
+  RUNCTZ32 (uint64_t, int32_t)                                                 \
+  RUNCTZ32 (uint64_t, int16_t)                                                 \
+  RUNCTZ32 (uint64_t, int8_t)                                                  \
+  RUNCTZ32N (int64_t, uint64_t)                                                \
+  RUNCTZ32N (int64_t, uint32_t)                                                \
+  RUNCTZ32N (int64_t, uint16_t)                                                \
+  RUNCTZ32N (int64_t, uint8_t)                                                 \
+  RUNCTZ32N (int64_t, int64_t)                                                 \
+  RUNCTZ32N (int64_t, int32_t)                                                 \
+  RUNCTZ32N (int64_t, int16_t)                                                 \
+  RUNCTZ32N (int64_t, int8_t)                                                  \
+  RUNCTZ32 (uint32_t, uint64_t)                                                \
+  RUNCTZ32 (uint32_t, uint32_t)                                                \
+  RUNCTZ32 (uint32_t, uint16_t)                                                \
+  RUNCTZ32 (uint32_t, uint8_t)                                                 \
+  RUNCTZ32 (uint32_t, int64_t)                                                 \
+  RUNCTZ32 (uint32_t, int32_t)                                                 \
+  RUNCTZ32 (uint32_t, int16_t)                                                 \
+  RUNCTZ32 (uint32_t, int8_t)                                                  \
+  RUNCTZ32N (int32_t, uint64_t)                                                \
+  RUNCTZ32N (int32_t, uint32_t)                                                \
+  RUNCTZ32N (int32_t, uint16_t)                                                \
+  RUNCTZ32N (int32_t, uint8_t)                                                 \
+  RUNCTZ32N (int32_t, int64_t)                                                 \
+  RUNCTZ32N (int32_t, int32_t)                                                 \
+  RUNCTZ32N (int32_t, int16_t)                                                 \
+  RUNCTZ32N (int32_t, int8_t)                                                  \
+  RUNCTZ32 (uint16_t, uint64_t)                                                \
+  RUNCTZ32 (uint16_t, uint32_t)                                                \
+  RUNCTZ32 (uint16_t, uint16_t)                                                \
+  RUNCTZ32 (uint16_t, uint8_t)                                                 \
+  RUNCTZ32 (uint16_t, int64_t)                                                 \
+  RUNCTZ32 (uint16_t, int32_t)                                                 \
+  RUNCTZ32 (uint16_t, int16_t)                                                 \
+  RUNCTZ32 (uint16_t, int8_t)                                                  \
+  RUNCTZ32N (int16_t, uint64_t)                                                \
+  RUNCTZ32N (int16_t, uint32_t)                                                \
+  RUNCTZ32N (int16_t, uint16_t)                                                \
+  RUNCTZ32N (int16_t, uint8_t)                                                 \
+  RUNCTZ32N (int16_t, int64_t)                                                 \
+  RUNCTZ32N (int16_t, int32_t)                                                 \
+  RUNCTZ32N (int16_t, int16_t)                                                 \
+  RUNCTZ32N (int16_t, int8_t)                                                  \
+  RUNCTZ32 (uint8_t, uint64_t)                                                 \
+  RUNCTZ32 (uint8_t, uint32_t)                                                 \
+  RUNCTZ32 (uint8_t, uint16_t)                                                 \
+  RUNCTZ32 (uint8_t, uint8_t)                                                  \
+  RUNCTZ32 (uint8_t, int64_t)                                                  \
+  RUNCTZ32 (uint8_t, int32_t)                                                  \
+  RUNCTZ32 (uint8_t, int16_t)                                                  \
+  RUNCTZ32 (uint8_t, int8_t)                                                   \
+  RUNCTZ32N (int8_t, uint64_t)                                                 \
+  RUNCTZ32N (int8_t, uint32_t)                                                 \
+  RUNCTZ32N (int8_t, uint16_t)                                                 \
+  RUNCTZ32N (int8_t, uint8_t)                                                  \
+  RUNCTZ32N (int8_t, int64_t)                                                  \
+  RUNCTZ32N (int8_t, int32_t)                                                  \
+  RUNCTZ32N (int8_t, int16_t)                                                  \
+  RUNCTZ32N (int8_t, int8_t)                                                   \
+  RUNFFS64 (uint64_t, uint64_t)                                                \
+  RUNFFS64 (uint64_t, uint32_t)                                                \
+  RUNFFS64 (uint64_t, uint16_t)                                                \
+  RUNFFS64 (uint64_t, uint8_t)                                                 \
+  RUNFFS64 (uint64_t, int64_t)                                                 \
+  RUNFFS64 (uint64_t, int32_t)                                                 \
+  RUNFFS64 (uint64_t, int16_t)                                                 \
+  RUNFFS64 (uint64_t, int8_t)                                                  \
+  RUNFFS64N (int64_t, uint64_t)                                                \
+  RUNFFS64N (int64_t, uint32_t)                                                \
+  RUNFFS64N (int64_t, uint16_t)                                                \
+  RUNFFS64N (int64_t, uint8_t)                                                 \
+  RUNFFS64N (int64_t, int64_t)                                                 \
+  RUNFFS64N (int64_t, int32_t)                                                 \
+  RUNFFS64N (int64_t, int16_t)                                                 \
+  RUNFFS64N (int64_t, int8_t)                                                  \
+  RUNFFS64 (uint32_t, uint64_t)                                                \
+  RUNFFS64 (uint32_t, uint32_t)                                                \
+  RUNFFS64 (uint32_t, uint16_t)                                                \
+  RUNFFS64 (uint32_t, uint8_t)                                                 \
+  RUNFFS64 (uint32_t, int64_t)                                                 \
+  RUNFFS64 (uint32_t, int32_t)                                                 \
+  RUNFFS64 (uint32_t, int16_t)                                                 \
+  RUNFFS64 (uint32_t, int8_t)                                                  \
+  RUNFFS64N (int32_t, uint64_t)                                                \
+  RUNFFS64N (int32_t, uint32_t)                                                \
+  RUNFFS64N (int32_t, uint16_t)                                                \
+  RUNFFS64N (int32_t, uint8_t)                                                 \
+  RUNFFS64N (int32_t, int64_t)                                                 \
+  RUNFFS64N (int32_t, int32_t)                                                 \
+  RUNFFS64N (int32_t, int16_t)                                                 \
+  RUNFFS64N (int32_t, int8_t)                                                  \
+  RUNFFS64 (uint16_t, uint64_t)                                                \
+  RUNFFS64 (uint16_t, uint32_t)                                                \
+  RUNFFS64 (uint16_t, uint16_t)                                                \
+  RUNFFS64 (uint16_t, uint8_t)                                                 \
+  RUNFFS64 (uint16_t, int64_t)                                                 \
+  RUNFFS64 (uint16_t, int32_t)                                                 \
+  RUNFFS64 (uint16_t, int16_t)                                                 \
+  RUNFFS64 (uint16_t, int8_t)                                                  \
+  RUNFFS64N (int16_t, uint64_t)                                                \
+  RUNFFS64N (int16_t, uint32_t)                                                \
+  RUNFFS64N (int16_t, uint16_t)                                                \
+  RUNFFS64N (int16_t, uint8_t)                                                 \
+  RUNFFS64N (int16_t, int64_t)                                                 \
+  RUNFFS64N (int16_t, int32_t)                                                 \
+  RUNFFS64N (int16_t, int16_t)                                                 \
+  RUNFFS64N (int16_t, int8_t)                                                  \
+  RUNFFS64 (uint8_t, uint64_t)                                                 \
+  RUNFFS64 (uint8_t, uint32_t)                                                 \
+  RUNFFS64 (uint8_t, uint16_t)                                                 \
+  RUNFFS64 (uint8_t, uint8_t)                                                  \
+  RUNFFS64 (uint8_t, int64_t)                                                  \
+  RUNFFS64 (uint8_t, int32_t)                                                  \
+  RUNFFS64 (uint8_t, int16_t)                                                  \
+  RUNFFS64 (uint8_t, int8_t)                                                   \
+  RUNFFS64N (int8_t, uint64_t)                                                 \
+  RUNFFS64N (int8_t, uint32_t)                                                 \
+  RUNFFS64N (int8_t, uint16_t)                                                 \
+  RUNFFS64N (int8_t, uint8_t)                                                  \
+  RUNFFS64N (int8_t, int64_t)                                                  \
+  RUNFFS64N (int8_t, int32_t)                                                  \
+  RUNFFS64N (int8_t, int16_t)                                                  \
+  RUNFFS64N (int8_t, int8_t)                                                   \
+  RUNFFS32 (uint64_t, uint64_t)                                                \
+  RUNFFS32 (uint64_t, uint32_t)                                                \
+  RUNFFS32 (uint64_t, uint16_t)                                                \
+  RUNFFS32 (uint64_t, uint8_t)                                                 \
+  RUNFFS32 (uint64_t, int64_t)                                                 \
+  RUNFFS32 (uint64_t, int32_t)                                                 \
+  RUNFFS32 (uint64_t, int16_t)                                                 \
+  RUNFFS32 (uint64_t, int8_t)                                                  \
+  RUNFFS32N (int64_t, uint64_t)                                                \
+  RUNFFS32N (int64_t, uint32_t)                                                \
+  RUNFFS32N (int64_t, uint16_t)                                                \
+  RUNFFS32N (int64_t, uint8_t)                                                 \
+  RUNFFS32N (int64_t, int64_t)                                                 \
+  RUNFFS32N (int64_t, int32_t)                                                 \
+  RUNFFS32N (int64_t, int16_t)                                                 \
+  RUNFFS32N (int64_t, int8_t)                                                  \
+  RUNFFS32 (uint32_t, uint64_t)                                                \
+  RUNFFS32 (uint32_t, uint32_t)                                                \
+  RUNFFS32 (uint32_t, uint16_t)                                                \
+  RUNFFS32 (uint32_t, uint8_t)                                                 \
+  RUNFFS32 (uint32_t, int64_t)                                                 \
+  RUNFFS32 (uint32_t, int32_t)                                                 \
+  RUNFFS32 (uint32_t, int16_t)                                                 \
+  RUNFFS32 (uint32_t, int8_t)                                                  \
+  RUNFFS32N (int32_t, uint64_t)                                                \
+  RUNFFS32N (int32_t, uint32_t)                                                \
+  RUNFFS32N (int32_t, uint16_t)                                                \
+  RUNFFS32N (int32_t, uint8_t)                                                 \
+  RUNFFS32N (int32_t, int64_t)                                                 \
+  RUNFFS32N (int32_t, int32_t)                                                 \
+  RUNFFS32N (int32_t, int16_t)                                                 \
+  RUNFFS32N (int32_t, int8_t)                                                  \
+  RUNFFS32 (uint16_t, uint64_t)                                                \
+  RUNFFS32 (uint16_t, uint32_t)                                                \
+  RUNFFS32 (uint16_t, uint16_t)                                                \
+  RUNFFS32 (uint16_t, uint8_t)                                                 \
+  RUNFFS32 (uint16_t, int64_t)                                                 \
+  RUNFFS32 (uint16_t, int32_t)                                                 \
+  RUNFFS32 (uint16_t, int16_t)                                                 \
+  RUNFFS32 (uint16_t, int8_t)                                                  \
+  RUNFFS32N (int16_t, uint64_t)                                                \
+  RUNFFS32N (int16_t, uint32_t)                                                \
+  RUNFFS32N (int16_t, uint16_t)                                                \
+  RUNFFS32N (int16_t, uint8_t)                                                 \
+  RUNFFS32N (int16_t, int64_t)                                                 \
+  RUNFFS32N (int16_t, int32_t)                                                 \
+  RUNFFS32N (int16_t, int16_t)                                                 \
+  RUNFFS32N (int16_t, int8_t)                                                  \
+  RUNFFS32 (uint8_t, uint64_t)                                                 \
+  RUNFFS32 (uint8_t, uint32_t)                                                 \
+  RUNFFS32 (uint8_t, uint16_t)                                                 \
+  RUNFFS32 (uint8_t, uint8_t)                                                  \
+  RUNFFS32 (uint8_t, int64_t)                                                  \
+  RUNFFS32 (uint8_t, int32_t)                                                  \
+  RUNFFS32 (uint8_t, int16_t)                                                  \
+  RUNFFS32 (uint8_t, int8_t)                                                   \
+  RUNFFS32N (int8_t, uint64_t)                                                 \
+  RUNFFS32N (int8_t, uint32_t)                                                 \
+  RUNFFS32N (int8_t, uint16_t)                                                 \
+  RUNFFS32N (int8_t, uint8_t)                                                  \
+  RUNFFS32N (int8_t, int64_t)                                                  \
+  RUNFFS32N (int8_t, int32_t)                                                  \
+  RUNFFS32N (int8_t, int16_t)                                                  \
+  RUNFFS32N (int8_t, int8_t)
+
+int
+main ()
+{
+  RUN_ALL ()
+}
+
+/* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 229 "vect" } } */
--
2.41.0
 
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] RISC-V: Add popcount fallback expander.
  2023-10-18 11:43   ` Robin Dapp
  2023-10-18 11:48     ` juzhe.zhong
  2023-10-18 12:22     ` juzhe.zhong
@ 2023-10-18 13:51     ` Robin Dapp
  2023-10-18 14:08       ` 钟居哲
  2 siblings, 1 reply; 10+ messages in thread
From: Robin Dapp @ 2023-10-18 13:51 UTC (permalink / raw)
  To: juzhe.zhong, gcc-patches, palmer, kito.cheng, jeffreyalaw; +Cc: rdapp.gcc

I didn't push this yet because it would have introduced an UNRESOLVED that
my summary script didn't catch.  Normally I go with just contrib/test_summary
but that only filters out FAIL and XPASS.  I should really be using
compare_testsuite_log.py from riscv-gnu-toolchain/scripts.

It was caused by a typo:

-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "slp" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "slp2" } } */

Regards
 Robin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Re: [PATCH] RISC-V: Add popcount fallback expander.
  2023-10-18 13:51     ` Robin Dapp
@ 2023-10-18 14:08       ` 钟居哲
  2023-10-18 20:25         ` Robin Dapp
  0 siblings, 1 reply; 10+ messages in thread
From: 钟居哲 @ 2023-10-18 14:08 UTC (permalink / raw)
  To: rdapp.gcc, gcc-patches, palmer, kito.cheng, Jeff Law; +Cc: rdapp.gcc


Could you by the way add this mention this PR: 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111791
Add the test of this PR ?









juzhe.zhong@rivai.ai



 



From: Robin Dapp



Date: 2023-10-18 21:51



To: juzhe.zhong@rivai.ai; gcc-patches; palmer; kito.cheng; jeffreyalaw



CC: rdapp.gcc



Subject: Re: [PATCH] RISC-V: Add popcount fallback expander.



I didn't push this yet because it would have introduced an UNRESOLVED that



my summary script didn't catch.  Normally I go with just contrib/test_summary



but that only filters out FAIL and XPASS.  I should really be using



compare_testsuite_log.py from riscv-gnu-toolchain/scripts.



 



It was caused by a typo:



 



-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "slp" } } */



+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "slp2" } } */



 



Regards



Robin



 



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] RISC-V: Add popcount fallback expander.
  2023-10-18 14:08       ` 钟居哲
@ 2023-10-18 20:25         ` Robin Dapp
  0 siblings, 0 replies; 10+ messages in thread
From: Robin Dapp @ 2023-10-18 20:25 UTC (permalink / raw)
  To: 钟居哲, gcc-patches, palmer, kito.cheng, Jeff Law
  Cc: rdapp.gcc

> Could you by the way add this mention this PR: 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111791
> Add the test of this PR ?

Commented in that PR.  This patch does not help there.

Regards
 Robin

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-10-18 20:25 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-18  9:20 [PATCH] RISC-V: Add popcount fallback expander Robin Dapp
2023-10-18  9:28 ` juzhe.zhong
2023-10-18  9:32   ` Robin Dapp
2023-10-18 11:43   ` Robin Dapp
2023-10-18 11:48     ` juzhe.zhong
2023-10-18 12:22     ` juzhe.zhong
2023-10-18 13:51     ` Robin Dapp
2023-10-18 14:08       ` 钟居哲
2023-10-18 20:25         ` Robin Dapp
     [not found] ` <202310181728104086621@rivai.ai>
2023-10-18  9:30   ` juzhe.zhong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).