[PATCH] [PR96339] AArch64: Optimise svlast[ab]

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] [PR96339] AArch64: Optimise svlast[ab]
@ 2023-03-16 11:39 Tejas Belagod
  2023-05-04  5:43 ` Tejas Belagod
  2023-05-11 19:32 ` Richard Sandiford
  0 siblings, 2 replies; 10+ messages in thread
From: Tejas Belagod @ 2023-03-16 11:39 UTC (permalink / raw)
  To: gcc-patches; +Cc: Tejas Belagod, richard.sandiford

From: Tejas Belagod <tbelagod@arm.com>

  This PR optimizes an SVE intrinsics sequence where
    svlasta (svptrue_pat_b8 (SV_VL1), x)
  a scalar is selected based on a constant predicate and a variable vector.
  This sequence is optimized to return the correspoding element of a NEON
  vector. For eg.
    svlasta (svptrue_pat_b8 (SV_VL1), x)
  returns
    umov    w0, v0.b[1]
  Likewise,
    svlastb (svptrue_pat_b8 (SV_VL1), x)
  returns
     umov    w0, v0.b[0]
  This optimization only works provided the constant predicate maps to a range
  that is within the bounds of a 128-bit NEON register.

gcc/ChangeLog:

	PR target/96339
	* config/aarch64/aarch64-sve-builtins-base.cc (svlast_impl::fold): Fold sve
	calls that have a constant input predicate vector.
	(svlast_impl::is_lasta): Query to check if intrinsic is svlasta.
	(svlast_impl::is_lastb): Query to check if intrinsic is svlastb.
	(svlast_impl::vect_all_same): Check if all vector elements are equal.

gcc/testsuite/ChangeLog:

	PR target/96339
	* gcc.target/aarch64/sve/acle/general-c/svlast.c: New.
	* gcc.target/aarch64/sve/acle/general-c/svlast128_run.c: New.
	* gcc.target/aarch64/sve/acle/general-c/svlast256_run.c: New.
	* gcc.target/aarch64/sve/pcs/return_4.c (caller_bf16): Fix asm
	to expect optimized code for function body.
	* gcc.target/aarch64/sve/pcs/return_4_128.c (caller_bf16): Likewise.
	* gcc.target/aarch64/sve/pcs/return_4_256.c (caller_bf16): Likewise.
	* gcc.target/aarch64/sve/pcs/return_4_512.c (caller_bf16): Likewise.
	* gcc.target/aarch64/sve/pcs/return_4_1024.c (caller_bf16): Likewise.
	* gcc.target/aarch64/sve/pcs/return_4_2048.c (caller_bf16): Likewise.
	* gcc.target/aarch64/sve/pcs/return_5.c (caller_bf16): Likewise.
	* gcc.target/aarch64/sve/pcs/return_5_128.c (caller_bf16): Likewise.
	* gcc.target/aarch64/sve/pcs/return_5_256.c (caller_bf16): Likewise.
	* gcc.target/aarch64/sve/pcs/return_5_512.c (caller_bf16): Likewise.
	* gcc.target/aarch64/sve/pcs/return_5_1024.c (caller_bf16): Likewise.
	* gcc.target/aarch64/sve/pcs/return_5_2048.c (caller_bf16): Likewise.
---
 .../aarch64/aarch64-sve-builtins-base.cc      | 124 +++++++
 .../aarch64/sve/acle/general-c/svlast.c       |  63 ++++
 .../sve/acle/general-c/svlast128_run.c        | 313 +++++++++++++++++
 .../sve/acle/general-c/svlast256_run.c        | 314 ++++++++++++++++++
 .../gcc.target/aarch64/sve/pcs/return_4.c     |   2 -
 .../aarch64/sve/pcs/return_4_1024.c           |   2 -
 .../gcc.target/aarch64/sve/pcs/return_4_128.c |   2 -
 .../aarch64/sve/pcs/return_4_2048.c           |   2 -
 .../gcc.target/aarch64/sve/pcs/return_4_256.c |   2 -
 .../gcc.target/aarch64/sve/pcs/return_4_512.c |   2 -
 .../gcc.target/aarch64/sve/pcs/return_5.c     |   2 -
 .../aarch64/sve/pcs/return_5_1024.c           |   2 -
 .../gcc.target/aarch64/sve/pcs/return_5_128.c |   2 -
 .../aarch64/sve/pcs/return_5_2048.c           |   2 -
 .../gcc.target/aarch64/sve/pcs/return_5_256.c |   2 -
 .../gcc.target/aarch64/sve/pcs/return_5_512.c |   2 -
 16 files changed, 814 insertions(+), 24 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast128_run.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast256_run.c

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index cd9cace3c9b..db2b4dcaac9 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -1056,6 +1056,130 @@ class svlast_impl : public quiet<function_base>
 public:
   CONSTEXPR svlast_impl (int unspec) : m_unspec (unspec) {}
 
+  bool is_lasta () const { return m_unspec == UNSPEC_LASTA; }
+  bool is_lastb () const { return m_unspec == UNSPEC_LASTB; }
+
+  bool vect_all_same (tree v , int step) const
+  {
+    int i;
+    int nelts = vector_cst_encoded_nelts (v);
+    int first_el = 0;
+
+    for (i = first_el; i < nelts; i += step)
+      if (VECTOR_CST_ENCODED_ELT (v, i) != VECTOR_CST_ENCODED_ELT (v, first_el))
+	return false;
+
+    return true;
+  }
+
+  /* Fold a svlast{a/b} call with constant predicate to a BIT_FIELD_REF.
+     BIT_FIELD_REF lowers to a NEON element extract, so we have to make sure
+     the index of the element being accessed is in the range of a NEON vector
+     width.  */
+  gimple *fold (gimple_folder & f) const override
+  {
+    tree pred = gimple_call_arg (f.call, 0);
+    tree val = gimple_call_arg (f.call, 1);
+
+    if (TREE_CODE (pred) == VECTOR_CST)
+      {
+	HOST_WIDE_INT pos;
+	unsigned int const_vg;
+	int i = 0;
+	int step = f.type_suffix (0).element_bytes;
+	int step_1 = gcd (step, VECTOR_CST_NPATTERNS (pred));
+	int npats = VECTOR_CST_NPATTERNS (pred);
+	unsigned HOST_WIDE_INT nelts = vector_cst_encoded_nelts (pred);
+	tree b = NULL_TREE;
+	bool const_vl = aarch64_sve_vg.is_constant (&const_vg);
+
+	/* We can optimize 2 cases common to variable and fixed-length cases
+	   without a linear search of the predicate vector:
+	   1.  LASTA if predicate is all true, return element 0.
+	   2.  LASTA if predicate all false, return element 0.  */
+	if (is_lasta () && vect_all_same (pred, step_1))
+	  {
+	    b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val,
+			bitsize_int (step * BITS_PER_UNIT), bitsize_int (0));
+	    return gimple_build_assign (f.lhs, b);
+	  }
+
+	/* Handle the all-false case for LASTB where SVE VL == 128b -
+	   return the highest numbered element.  */
+	if (is_lastb () && known_eq (BYTES_PER_SVE_VECTOR, 16)
+	    && vect_all_same (pred, step_1)
+	    && integer_zerop (VECTOR_CST_ENCODED_ELT (pred, 0)))
+	  {
+	    b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val,
+			bitsize_int (step * BITS_PER_UNIT),
+			bitsize_int ((16 - step) * BITS_PER_UNIT));
+
+	    return gimple_build_assign (f.lhs, b);
+	  }
+
+	/* If VECTOR_CST_NELTS_PER_PATTERN (pred) == 2 and every multiple of
+	   'step_1' in
+	   [VECTOR_CST_NPATTERNS .. VECTOR_CST_ENCODED_NELTS - 1]
+	   is zero, then we can treat the vector as VECTOR_CST_NPATTERNS
+	   elements followed by all inactive elements.  */
+	if (!const_vl && VECTOR_CST_NELTS_PER_PATTERN (pred) == 2)
+	  for (i = npats; i < nelts; i += step_1)
+	    {
+	      /* If there are active elements in the repeated pattern of
+		 a variable-length vector, then return NULL as there is no way
+		 to be sure statically if this falls within the NEON range.  */
+	      if (!integer_zerop (VECTOR_CST_ENCODED_ELT (pred, i)))
+		return NULL;
+	    }
+
+	/* If we're here, it means either:
+	   1. The vector is variable-length and there's no active element in the
+	   repeated part of the pattern, or
+	   2. The vector is fixed-length.
+	   Fall-through to a linear search.  */
+
+	/* Restrict the scope of search to NPATS if vector is
+	   variable-length.  */
+	if (!VECTOR_CST_NELTS (pred).is_constant (&nelts))
+	  nelts = npats;
+
+	/* Fall through to finding the last active element linearly for
+	   for all cases where the last active element is known to be
+	   within a statically-determinable range.  */
+	i = MAX ((int)nelts - step, 0);
+	for (; i >= 0; i -= step)
+	  if (!integer_zerop (VECTOR_CST_ELT (pred, i)))
+	    break;
+
+	if (is_lastb ())
+	  {
+	    /* For LASTB, the element is the last active element.  */
+	    pos = i;
+	  }
+	else
+	  {
+	    /* For LASTA, the element is one after last active element.  */
+	    pos = i + step;
+
+	    /* If last active element is
+	       last element, wrap-around and return first NEON element.  */
+	    if (known_ge (pos, BYTES_PER_SVE_VECTOR))
+	      pos = 0;
+	  }
+
+	/* Out of NEON range.  */
+	if (pos < 0 || pos > 15)
+	  return NULL;
+
+	b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val,
+		    bitsize_int (step * BITS_PER_UNIT),
+		    bitsize_int (pos * BITS_PER_UNIT));
+
+	return gimple_build_assign (f.lhs, b);
+      }
+    return NULL;
+  }
+
   rtx
   expand (function_expander &e) const override
   {
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast.c
new file mode 100644
index 00000000000..fdbe5e309af
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast.c
@@ -0,0 +1,63 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -msve-vector-bits=256" } */
+
+#include <stdint.h>
+#include "arm_sve.h"
+
+#define NAME(name, size, pat, sign, ab) \
+  name ## _ ## size ## _ ## pat ## _ ## sign ## _ ## ab
+
+#define NAMEF(name, size, pat, sign, ab) \
+  name ## _ ## size ## _ ## pat ## _ ## sign ## _ ## ab ## _false
+
+#define SVTYPE(size, sign) \
+  sv ## sign ## int ## size ## _t
+
+#define STYPE(size, sign) sign ## int ## size ##_t
+
+#define SVELAST_DEF(size, pat, sign, ab, su) \
+  STYPE (size, sign) __attribute__((noinline)) \
+  NAME (foo, size, pat, sign, ab) (SVTYPE (size, sign) x) \
+  { \
+    return svlast ## ab (svptrue_pat_b ## size (pat), x); \
+  } \
+  STYPE (size, sign) __attribute__((noinline)) \
+  NAMEF (foo, size, pat, sign, ab) (SVTYPE (size, sign) x) \
+  { \
+    return svlast ## ab (svpfalse (), x); \
+  }
+
+#define ALL_PATS(SIZE, SIGN, AB, SU) \
+  SVELAST_DEF (SIZE, SV_VL1, SIGN, AB, SU) \
+  SVELAST_DEF (SIZE, SV_VL2, SIGN, AB, SU) \
+  SVELAST_DEF (SIZE, SV_VL3, SIGN, AB, SU) \
+  SVELAST_DEF (SIZE, SV_VL4, SIGN, AB, SU) \
+  SVELAST_DEF (SIZE, SV_VL5, SIGN, AB, SU) \
+  SVELAST_DEF (SIZE, SV_VL6, SIGN, AB, SU) \
+  SVELAST_DEF (SIZE, SV_VL7, SIGN, AB, SU) \
+  SVELAST_DEF (SIZE, SV_VL8, SIGN, AB, SU) \
+  SVELAST_DEF (SIZE, SV_VL16, SIGN, AB, SU)
+
+#define ALL_SIGN(SIZE, AB) \
+  ALL_PATS (SIZE, , AB, s) \
+  ALL_PATS (SIZE, u, AB, u)
+
+#define ALL_SIZE(AB) \
+  ALL_SIGN (8, AB) \
+  ALL_SIGN (16, AB) \
+  ALL_SIGN (32, AB) \
+  ALL_SIGN (64, AB)
+
+#define ALL_POS() \
+  ALL_SIZE (a) \
+  ALL_SIZE (b)
+
+
+ALL_POS()
+
+/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b} 52 } } */
+/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h} 50 } } */
+/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s} 12 } } */
+/* { dg-final { scan-assembler-times {\tfmov\tw0, s0} 24 } } */
+/* { dg-final { scan-assembler-times {\tumov\tx[0-9]+, v[0-9]+\.d} 4 } } */
+/* { dg-final { scan-assembler-times {\tfmov\tx0, d0} 32 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast128_run.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast128_run.c
new file mode 100644
index 00000000000..5e1e9303d7b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast128_run.c
@@ -0,0 +1,313 @@
+/* { dg-do run { target aarch64_sve128_hw } } */
+/* { dg-options "-O3 -msve-vector-bits=128 -std=gnu99" } */
+
+#include "svlast.c"
+
+int
+main (void)
+{
+  int8_t res_8_SV_VL1__a = 1;
+  int8_t res_8_SV_VL2__a = 2;
+  int8_t res_8_SV_VL3__a = 3;
+  int8_t res_8_SV_VL4__a = 4;
+  int8_t res_8_SV_VL5__a = 5;
+  int8_t res_8_SV_VL6__a = 6;
+  int8_t res_8_SV_VL7__a = 7;
+  int8_t res_8_SV_VL8__a = 8;
+  int8_t res_8_SV_VL16__a = 0;
+  uint8_t res_8_SV_VL1_u_a = 1;
+  uint8_t res_8_SV_VL2_u_a = 2;
+  uint8_t res_8_SV_VL3_u_a = 3;
+  uint8_t res_8_SV_VL4_u_a = 4;
+  uint8_t res_8_SV_VL5_u_a = 5;
+  uint8_t res_8_SV_VL6_u_a = 6;
+  uint8_t res_8_SV_VL7_u_a = 7;
+  uint8_t res_8_SV_VL8_u_a = 8;
+  uint8_t res_8_SV_VL16_u_a = 0;
+  int16_t res_16_SV_VL1__a = 1;
+  int16_t res_16_SV_VL2__a = 2;
+  int16_t res_16_SV_VL3__a = 3;
+  int16_t res_16_SV_VL4__a = 4;
+  int16_t res_16_SV_VL5__a = 5;
+  int16_t res_16_SV_VL6__a = 6;
+  int16_t res_16_SV_VL7__a = 7;
+  int16_t res_16_SV_VL8__a = 0;
+  int16_t res_16_SV_VL16__a = 0;
+  uint16_t res_16_SV_VL1_u_a = 1;
+  uint16_t res_16_SV_VL2_u_a = 2;
+  uint16_t res_16_SV_VL3_u_a = 3;
+  uint16_t res_16_SV_VL4_u_a = 4;
+  uint16_t res_16_SV_VL5_u_a = 5;
+  uint16_t res_16_SV_VL6_u_a = 6;
+  uint16_t res_16_SV_VL7_u_a = 7;
+  uint16_t res_16_SV_VL8_u_a = 0;
+  uint16_t res_16_SV_VL16_u_a = 0;
+  int32_t res_32_SV_VL1__a = 1;
+  int32_t res_32_SV_VL2__a = 2;
+  int32_t res_32_SV_VL3__a = 3;
+  int32_t res_32_SV_VL4__a = 0;
+  int32_t res_32_SV_VL5__a = 0;
+  int32_t res_32_SV_VL6__a = 0;
+  int32_t res_32_SV_VL7__a = 0;
+  int32_t res_32_SV_VL8__a = 0;
+  int32_t res_32_SV_VL16__a = 0;
+  uint32_t res_32_SV_VL1_u_a = 1;
+  uint32_t res_32_SV_VL2_u_a = 2;
+  uint32_t res_32_SV_VL3_u_a = 3;
+  uint32_t res_32_SV_VL4_u_a = 0;
+  uint32_t res_32_SV_VL5_u_a = 0;
+  uint32_t res_32_SV_VL6_u_a = 0;
+  uint32_t res_32_SV_VL7_u_a = 0;
+  uint32_t res_32_SV_VL8_u_a = 0;
+  uint32_t res_32_SV_VL16_u_a = 0;
+  int64_t res_64_SV_VL1__a = 1;
+  int64_t res_64_SV_VL2__a = 0;
+  int64_t res_64_SV_VL3__a = 0;
+  int64_t res_64_SV_VL4__a = 0;
+  int64_t res_64_SV_VL5__a = 0;
+  int64_t res_64_SV_VL6__a = 0;
+  int64_t res_64_SV_VL7__a = 0;
+  int64_t res_64_SV_VL8__a = 0;
+  int64_t res_64_SV_VL16__a = 0;
+  uint64_t res_64_SV_VL1_u_a = 1;
+  uint64_t res_64_SV_VL2_u_a = 0;
+  uint64_t res_64_SV_VL3_u_a = 0;
+  uint64_t res_64_SV_VL4_u_a = 0;
+  uint64_t res_64_SV_VL5_u_a = 0;
+  uint64_t res_64_SV_VL6_u_a = 0;
+  uint64_t res_64_SV_VL7_u_a = 0;
+  uint64_t res_64_SV_VL8_u_a = 0;
+  uint64_t res_64_SV_VL16_u_a = 0;
+  int8_t res_8_SV_VL1__b = 0;
+  int8_t res_8_SV_VL2__b = 1;
+  int8_t res_8_SV_VL3__b = 2;
+  int8_t res_8_SV_VL4__b = 3;
+  int8_t res_8_SV_VL5__b = 4;
+  int8_t res_8_SV_VL6__b = 5;
+  int8_t res_8_SV_VL7__b = 6;
+  int8_t res_8_SV_VL8__b = 7;
+  int8_t res_8_SV_VL16__b = 15;
+  uint8_t res_8_SV_VL1_u_b = 0;
+  uint8_t res_8_SV_VL2_u_b = 1;
+  uint8_t res_8_SV_VL3_u_b = 2;
+  uint8_t res_8_SV_VL4_u_b = 3;
+  uint8_t res_8_SV_VL5_u_b = 4;
+  uint8_t res_8_SV_VL6_u_b = 5;
+  uint8_t res_8_SV_VL7_u_b = 6;
+  uint8_t res_8_SV_VL8_u_b = 7;
+  uint8_t res_8_SV_VL16_u_b = 15;
+  int16_t res_16_SV_VL1__b = 0;
+  int16_t res_16_SV_VL2__b = 1;
+  int16_t res_16_SV_VL3__b = 2;
+  int16_t res_16_SV_VL4__b = 3;
+  int16_t res_16_SV_VL5__b = 4;
+  int16_t res_16_SV_VL6__b = 5;
+  int16_t res_16_SV_VL7__b = 6;
+  int16_t res_16_SV_VL8__b = 7;
+  int16_t res_16_SV_VL16__b = 7;
+  uint16_t res_16_SV_VL1_u_b = 0;
+  uint16_t res_16_SV_VL2_u_b = 1;
+  uint16_t res_16_SV_VL3_u_b = 2;
+  uint16_t res_16_SV_VL4_u_b = 3;
+  uint16_t res_16_SV_VL5_u_b = 4;
+  uint16_t res_16_SV_VL6_u_b = 5;
+  uint16_t res_16_SV_VL7_u_b = 6;
+  uint16_t res_16_SV_VL8_u_b = 7;
+  uint16_t res_16_SV_VL16_u_b = 7;
+  int32_t res_32_SV_VL1__b = 0;
+  int32_t res_32_SV_VL2__b = 1;
+  int32_t res_32_SV_VL3__b = 2;
+  int32_t res_32_SV_VL4__b = 3;
+  int32_t res_32_SV_VL5__b = 3;
+  int32_t res_32_SV_VL6__b = 3;
+  int32_t res_32_SV_VL7__b = 3;
+  int32_t res_32_SV_VL8__b = 3;
+  int32_t res_32_SV_VL16__b = 3;
+  uint32_t res_32_SV_VL1_u_b = 0;
+  uint32_t res_32_SV_VL2_u_b = 1;
+  uint32_t res_32_SV_VL3_u_b = 2;
+  uint32_t res_32_SV_VL4_u_b = 3;
+  uint32_t res_32_SV_VL5_u_b = 3;
+  uint32_t res_32_SV_VL6_u_b = 3;
+  uint32_t res_32_SV_VL7_u_b = 3;
+  uint32_t res_32_SV_VL8_u_b = 3;
+  uint32_t res_32_SV_VL16_u_b = 3;
+  int64_t res_64_SV_VL1__b = 0;
+  int64_t res_64_SV_VL2__b = 1;
+  int64_t res_64_SV_VL3__b = 1;
+  int64_t res_64_SV_VL4__b = 1;
+  int64_t res_64_SV_VL5__b = 1;
+  int64_t res_64_SV_VL6__b = 1;
+  int64_t res_64_SV_VL7__b = 1;
+  int64_t res_64_SV_VL8__b = 1;
+  int64_t res_64_SV_VL16__b = 1;
+  uint64_t res_64_SV_VL1_u_b = 0;
+  uint64_t res_64_SV_VL2_u_b = 1;
+  uint64_t res_64_SV_VL3_u_b = 1;
+  uint64_t res_64_SV_VL4_u_b = 1;
+  uint64_t res_64_SV_VL5_u_b = 1;
+  uint64_t res_64_SV_VL6_u_b = 1;
+  uint64_t res_64_SV_VL7_u_b = 1;
+  uint64_t res_64_SV_VL8_u_b = 1;
+  uint64_t res_64_SV_VL16_u_b = 1;
+
+  int8_t res_8_SV_VL1__a_false = 0;
+  int8_t res_8_SV_VL2__a_false = 0;
+  int8_t res_8_SV_VL3__a_false = 0;
+  int8_t res_8_SV_VL4__a_false = 0;
+  int8_t res_8_SV_VL5__a_false = 0;
+  int8_t res_8_SV_VL6__a_false = 0;
+  int8_t res_8_SV_VL7__a_false = 0;
+  int8_t res_8_SV_VL8__a_false = 0;
+  int8_t res_8_SV_VL16__a_false = 0;
+  uint8_t res_8_SV_VL1_u_a_false = 0;
+  uint8_t res_8_SV_VL2_u_a_false = 0;
+  uint8_t res_8_SV_VL3_u_a_false = 0;
+  uint8_t res_8_SV_VL4_u_a_false = 0;
+  uint8_t res_8_SV_VL5_u_a_false = 0;
+  uint8_t res_8_SV_VL6_u_a_false = 0;
+  uint8_t res_8_SV_VL7_u_a_false = 0;
+  uint8_t res_8_SV_VL8_u_a_false = 0;
+  uint8_t res_8_SV_VL16_u_a_false = 0;
+  int16_t res_16_SV_VL1__a_false = 0;
+  int16_t res_16_SV_VL2__a_false = 0;
+  int16_t res_16_SV_VL3__a_false = 0;
+  int16_t res_16_SV_VL4__a_false = 0;
+  int16_t res_16_SV_VL5__a_false = 0;
+  int16_t res_16_SV_VL6__a_false = 0;
+  int16_t res_16_SV_VL7__a_false = 0;
+  int16_t res_16_SV_VL8__a_false = 0;
+  int16_t res_16_SV_VL16__a_false = 0;
+  uint16_t res_16_SV_VL1_u_a_false = 0;
+  uint16_t res_16_SV_VL2_u_a_false = 0;
+  uint16_t res_16_SV_VL3_u_a_false = 0;
+  uint16_t res_16_SV_VL4_u_a_false = 0;
+  uint16_t res_16_SV_VL5_u_a_false = 0;
+  uint16_t res_16_SV_VL6_u_a_false = 0;
+  uint16_t res_16_SV_VL7_u_a_false = 0;
+  uint16_t res_16_SV_VL8_u_a_false = 0;
+  uint16_t res_16_SV_VL16_u_a_false = 0;
+  int32_t res_32_SV_VL1__a_false = 0;
+  int32_t res_32_SV_VL2__a_false = 0;
+  int32_t res_32_SV_VL3__a_false = 0;
+  int32_t res_32_SV_VL4__a_false = 0;
+  int32_t res_32_SV_VL5__a_false = 0;
+  int32_t res_32_SV_VL6__a_false = 0;
+  int32_t res_32_SV_VL7__a_false = 0;
+  int32_t res_32_SV_VL8__a_false = 0;
+  int32_t res_32_SV_VL16__a_false = 0;
+  uint32_t res_32_SV_VL1_u_a_false = 0;
+  uint32_t res_32_SV_VL2_u_a_false = 0;
+  uint32_t res_32_SV_VL3_u_a_false = 0;
+  uint32_t res_32_SV_VL4_u_a_false = 0;
+  uint32_t res_32_SV_VL5_u_a_false = 0;
+  uint32_t res_32_SV_VL6_u_a_false = 0;
+  uint32_t res_32_SV_VL7_u_a_false = 0;
+  uint32_t res_32_SV_VL8_u_a_false = 0;
+  uint32_t res_32_SV_VL16_u_a_false = 0;
+  int64_t res_64_SV_VL1__a_false = 0;
+  int64_t res_64_SV_VL2__a_false = 0;
+  int64_t res_64_SV_VL3__a_false = 0;
+  int64_t res_64_SV_VL4__a_false = 0;
+  int64_t res_64_SV_VL5__a_false = 0;
+  int64_t res_64_SV_VL6__a_false = 0;
+  int64_t res_64_SV_VL7__a_false = 0;
+  int64_t res_64_SV_VL8__a_false = 0;
+  int64_t res_64_SV_VL16__a_false = 0;
+  uint64_t res_64_SV_VL1_u_a_false = 0;
+  uint64_t res_64_SV_VL2_u_a_false = 0;
+  uint64_t res_64_SV_VL3_u_a_false = 0;
+  uint64_t res_64_SV_VL4_u_a_false = 0;
+  uint64_t res_64_SV_VL5_u_a_false = 0;
+  uint64_t res_64_SV_VL6_u_a_false = 0;
+  uint64_t res_64_SV_VL7_u_a_false = 0;
+  uint64_t res_64_SV_VL8_u_a_false = 0;
+  uint64_t res_64_SV_VL16_u_a_false = 0;
+  int8_t res_8_SV_VL1__b_false = 15;
+  int8_t res_8_SV_VL2__b_false = 15;
+  int8_t res_8_SV_VL3__b_false = 15;
+  int8_t res_8_SV_VL4__b_false = 15;
+  int8_t res_8_SV_VL5__b_false = 15;
+  int8_t res_8_SV_VL6__b_false = 15;
+  int8_t res_8_SV_VL7__b_false = 15;
+  int8_t res_8_SV_VL8__b_false = 15;
+  int8_t res_8_SV_VL16__b_false = 15;
+  uint8_t res_8_SV_VL1_u_b_false = 15;
+  uint8_t res_8_SV_VL2_u_b_false = 15;
+  uint8_t res_8_SV_VL3_u_b_false = 15;
+  uint8_t res_8_SV_VL4_u_b_false = 15;
+  uint8_t res_8_SV_VL5_u_b_false = 15;
+  uint8_t res_8_SV_VL6_u_b_false = 15;
+  uint8_t res_8_SV_VL7_u_b_false = 15;
+  uint8_t res_8_SV_VL8_u_b_false = 15;
+  uint8_t res_8_SV_VL16_u_b_false = 15;
+  int16_t res_16_SV_VL1__b_false = 7;
+  int16_t res_16_SV_VL2__b_false = 7;
+  int16_t res_16_SV_VL3__b_false = 7;
+  int16_t res_16_SV_VL4__b_false = 7;
+  int16_t res_16_SV_VL5__b_false = 7;
+  int16_t res_16_SV_VL6__b_false = 7;
+  int16_t res_16_SV_VL7__b_false = 7;
+  int16_t res_16_SV_VL8__b_false = 7;
+  int16_t res_16_SV_VL16__b_false = 7;
+  uint16_t res_16_SV_VL1_u_b_false = 7;
+  uint16_t res_16_SV_VL2_u_b_false = 7;
+  uint16_t res_16_SV_VL3_u_b_false = 7;
+  uint16_t res_16_SV_VL4_u_b_false = 7;
+  uint16_t res_16_SV_VL5_u_b_false = 7;
+  uint16_t res_16_SV_VL6_u_b_false = 7;
+  uint16_t res_16_SV_VL7_u_b_false = 7;
+  uint16_t res_16_SV_VL8_u_b_false = 7;
+  uint16_t res_16_SV_VL16_u_b_false = 7;
+  int32_t res_32_SV_VL1__b_false = 3;
+  int32_t res_32_SV_VL2__b_false = 3;
+  int32_t res_32_SV_VL3__b_false = 3;
+  int32_t res_32_SV_VL4__b_false = 3;
+  int32_t res_32_SV_VL5__b_false = 3;
+  int32_t res_32_SV_VL6__b_false = 3;
+  int32_t res_32_SV_VL7__b_false = 3;
+  int32_t res_32_SV_VL8__b_false = 3;
+  int32_t res_32_SV_VL16__b_false = 3;
+  uint32_t res_32_SV_VL1_u_b_false = 3;
+  uint32_t res_32_SV_VL2_u_b_false = 3;
+  uint32_t res_32_SV_VL3_u_b_false = 3;
+  uint32_t res_32_SV_VL4_u_b_false = 3;
+  uint32_t res_32_SV_VL5_u_b_false = 3;
+  uint32_t res_32_SV_VL6_u_b_false = 3;
+  uint32_t res_32_SV_VL7_u_b_false = 3;
+  uint32_t res_32_SV_VL8_u_b_false = 3;
+  uint32_t res_32_SV_VL16_u_b_false = 3;
+  int64_t res_64_SV_VL1__b_false = 1;
+  int64_t res_64_SV_VL2__b_false = 1;
+  int64_t res_64_SV_VL3__b_false = 1;
+  int64_t res_64_SV_VL4__b_false = 1;
+  int64_t res_64_SV_VL5__b_false = 1;
+  int64_t res_64_SV_VL6__b_false = 1;
+  int64_t res_64_SV_VL7__b_false = 1;
+  int64_t res_64_SV_VL8__b_false = 1;
+  int64_t res_64_SV_VL16__b_false = 1;
+  uint64_t res_64_SV_VL1_u_b_false = 1;
+  uint64_t res_64_SV_VL2_u_b_false = 1;
+  uint64_t res_64_SV_VL3_u_b_false = 1;
+  uint64_t res_64_SV_VL4_u_b_false = 1;
+  uint64_t res_64_SV_VL5_u_b_false = 1;
+  uint64_t res_64_SV_VL6_u_b_false = 1;
+  uint64_t res_64_SV_VL7_u_b_false = 1;
+  uint64_t res_64_SV_VL8_u_b_false = 1;
+  uint64_t res_64_SV_VL16_u_b_false = 1;
+
+#undef SVELAST_DEF
+#define SVELAST_DEF(size, pat, sign, ab, su) \
+	if (NAME (foo, size, pat, sign, ab) \
+	     (svindex_ ## su ## size (0, 1)) != \
+		NAME (res, size, pat, sign, ab)) \
+	  __builtin_abort (); \
+	if (NAMEF (foo, size, pat, sign, ab) \
+	     (svindex_ ## su ## size (0, 1)) != \
+		NAMEF (res, size, pat, sign, ab)) \
+	  __builtin_abort ();
+
+  ALL_POS ()
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast256_run.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast256_run.c
new file mode 100644
index 00000000000..f6ba7ea7d89
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast256_run.c
@@ -0,0 +1,314 @@
+/* { dg-do run { target aarch64_sve256_hw } } */
+/* { dg-options "-O3 -msve-vector-bits=256 -std=gnu99" } */
+
+#include "svlast.c"
+
+int
+main (void)
+{
+  int8_t res_8_SV_VL1__a = 1;
+  int8_t res_8_SV_VL2__a = 2;
+  int8_t res_8_SV_VL3__a = 3;
+  int8_t res_8_SV_VL4__a = 4;
+  int8_t res_8_SV_VL5__a = 5;
+  int8_t res_8_SV_VL6__a = 6;
+  int8_t res_8_SV_VL7__a = 7;
+  int8_t res_8_SV_VL8__a = 8;
+  int8_t res_8_SV_VL16__a = 16;
+  uint8_t res_8_SV_VL1_u_a = 1;
+  uint8_t res_8_SV_VL2_u_a = 2;
+  uint8_t res_8_SV_VL3_u_a = 3;
+  uint8_t res_8_SV_VL4_u_a = 4;
+  uint8_t res_8_SV_VL5_u_a = 5;
+  uint8_t res_8_SV_VL6_u_a = 6;
+  uint8_t res_8_SV_VL7_u_a = 7;
+  uint8_t res_8_SV_VL8_u_a = 8;
+  uint8_t res_8_SV_VL16_u_a = 16;
+  int16_t res_16_SV_VL1__a = 1;
+  int16_t res_16_SV_VL2__a = 2;
+  int16_t res_16_SV_VL3__a = 3;
+  int16_t res_16_SV_VL4__a = 4;
+  int16_t res_16_SV_VL5__a = 5;
+  int16_t res_16_SV_VL6__a = 6;
+  int16_t res_16_SV_VL7__a = 7;
+  int16_t res_16_SV_VL8__a = 8;
+  int16_t res_16_SV_VL16__a = 0;
+  uint16_t res_16_SV_VL1_u_a = 1;
+  uint16_t res_16_SV_VL2_u_a = 2;
+  uint16_t res_16_SV_VL3_u_a = 3;
+  uint16_t res_16_SV_VL4_u_a = 4;
+  uint16_t res_16_SV_VL5_u_a = 5;
+  uint16_t res_16_SV_VL6_u_a = 6;
+  uint16_t res_16_SV_VL7_u_a = 7;
+  uint16_t res_16_SV_VL8_u_a = 8;
+  uint16_t res_16_SV_VL16_u_a = 0;
+  int32_t res_32_SV_VL1__a = 1;
+  int32_t res_32_SV_VL2__a = 2;
+  int32_t res_32_SV_VL3__a = 3;
+  int32_t res_32_SV_VL4__a = 4;
+  int32_t res_32_SV_VL5__a = 5;
+  int32_t res_32_SV_VL6__a = 6;
+  int32_t res_32_SV_VL7__a = 7;
+  int32_t res_32_SV_VL8__a = 0;
+  int32_t res_32_SV_VL16__a = 0;
+  uint32_t res_32_SV_VL1_u_a = 1;
+  uint32_t res_32_SV_VL2_u_a = 2;
+  uint32_t res_32_SV_VL3_u_a = 3;
+  uint32_t res_32_SV_VL4_u_a = 4;
+  uint32_t res_32_SV_VL5_u_a = 5;
+  uint32_t res_32_SV_VL6_u_a = 6;
+  uint32_t res_32_SV_VL7_u_a = 7;
+  uint32_t res_32_SV_VL8_u_a = 0;
+  uint32_t res_32_SV_VL16_u_a = 0;
+  int64_t res_64_SV_VL1__a = 1;
+  int64_t res_64_SV_VL2__a = 2;
+  int64_t res_64_SV_VL3__a = 3;
+  int64_t res_64_SV_VL4__a = 0;
+  int64_t res_64_SV_VL5__a = 0;
+  int64_t res_64_SV_VL6__a = 0;
+  int64_t res_64_SV_VL7__a = 0;
+  int64_t res_64_SV_VL8__a = 0;
+  int64_t res_64_SV_VL16__a = 0;
+  uint64_t res_64_SV_VL1_u_a = 1;
+  uint64_t res_64_SV_VL2_u_a = 2;
+  uint64_t res_64_SV_VL3_u_a = 3;
+  uint64_t res_64_SV_VL4_u_a = 0;
+  uint64_t res_64_SV_VL5_u_a = 0;
+  uint64_t res_64_SV_VL6_u_a = 0;
+  uint64_t res_64_SV_VL7_u_a = 0;
+  uint64_t res_64_SV_VL8_u_a = 0;
+  uint64_t res_64_SV_VL16_u_a = 0;
+  int8_t res_8_SV_VL1__b = 0;
+  int8_t res_8_SV_VL2__b = 1;
+  int8_t res_8_SV_VL3__b = 2;
+  int8_t res_8_SV_VL4__b = 3;
+  int8_t res_8_SV_VL5__b = 4;
+  int8_t res_8_SV_VL6__b = 5;
+  int8_t res_8_SV_VL7__b = 6;
+  int8_t res_8_SV_VL8__b = 7;
+  int8_t res_8_SV_VL16__b = 15;
+  uint8_t res_8_SV_VL1_u_b = 0;
+  uint8_t res_8_SV_VL2_u_b = 1;
+  uint8_t res_8_SV_VL3_u_b = 2;
+  uint8_t res_8_SV_VL4_u_b = 3;
+  uint8_t res_8_SV_VL5_u_b = 4;
+  uint8_t res_8_SV_VL6_u_b = 5;
+  uint8_t res_8_SV_VL7_u_b = 6;
+  uint8_t res_8_SV_VL8_u_b = 7;
+  uint8_t res_8_SV_VL16_u_b = 15;
+  int16_t res_16_SV_VL1__b = 0;
+  int16_t res_16_SV_VL2__b = 1;
+  int16_t res_16_SV_VL3__b = 2;
+  int16_t res_16_SV_VL4__b = 3;
+  int16_t res_16_SV_VL5__b = 4;
+  int16_t res_16_SV_VL6__b = 5;
+  int16_t res_16_SV_VL7__b = 6;
+  int16_t res_16_SV_VL8__b = 7;
+  int16_t res_16_SV_VL16__b = 15;
+  uint16_t res_16_SV_VL1_u_b = 0;
+  uint16_t res_16_SV_VL2_u_b = 1;
+  uint16_t res_16_SV_VL3_u_b = 2;
+  uint16_t res_16_SV_VL4_u_b = 3;
+  uint16_t res_16_SV_VL5_u_b = 4;
+  uint16_t res_16_SV_VL6_u_b = 5;
+  uint16_t res_16_SV_VL7_u_b = 6;
+  uint16_t res_16_SV_VL8_u_b = 7;
+  uint16_t res_16_SV_VL16_u_b = 15;
+  int32_t res_32_SV_VL1__b = 0;
+  int32_t res_32_SV_VL2__b = 1;
+  int32_t res_32_SV_VL3__b = 2;
+  int32_t res_32_SV_VL4__b = 3;
+  int32_t res_32_SV_VL5__b = 4;
+  int32_t res_32_SV_VL6__b = 5;
+  int32_t res_32_SV_VL7__b = 6;
+  int32_t res_32_SV_VL8__b = 7;
+  int32_t res_32_SV_VL16__b = 7;
+  uint32_t res_32_SV_VL1_u_b = 0;
+  uint32_t res_32_SV_VL2_u_b = 1;
+  uint32_t res_32_SV_VL3_u_b = 2;
+  uint32_t res_32_SV_VL4_u_b = 3;
+  uint32_t res_32_SV_VL5_u_b = 4;
+  uint32_t res_32_SV_VL6_u_b = 5;
+  uint32_t res_32_SV_VL7_u_b = 6;
+  uint32_t res_32_SV_VL8_u_b = 7;
+  uint32_t res_32_SV_VL16_u_b = 7;
+  int64_t res_64_SV_VL1__b = 0;
+  int64_t res_64_SV_VL2__b = 1;
+  int64_t res_64_SV_VL3__b = 2;
+  int64_t res_64_SV_VL4__b = 3;
+  int64_t res_64_SV_VL5__b = 3;
+  int64_t res_64_SV_VL6__b = 3;
+  int64_t res_64_SV_VL7__b = 3;
+  int64_t res_64_SV_VL8__b = 3;
+  int64_t res_64_SV_VL16__b = 3;
+  uint64_t res_64_SV_VL1_u_b = 0;
+  uint64_t res_64_SV_VL2_u_b = 1;
+  uint64_t res_64_SV_VL3_u_b = 2;
+  uint64_t res_64_SV_VL4_u_b = 3;
+  uint64_t res_64_SV_VL5_u_b = 3;
+  uint64_t res_64_SV_VL6_u_b = 3;
+  uint64_t res_64_SV_VL7_u_b = 3;
+  uint64_t res_64_SV_VL8_u_b = 3;
+  uint64_t res_64_SV_VL16_u_b = 3;
+
+  int8_t res_8_SV_VL1__a_false = 0;
+  int8_t res_8_SV_VL2__a_false = 0;
+  int8_t res_8_SV_VL3__a_false = 0;
+  int8_t res_8_SV_VL4__a_false = 0;
+  int8_t res_8_SV_VL5__a_false = 0;
+  int8_t res_8_SV_VL6__a_false = 0;
+  int8_t res_8_SV_VL7__a_false = 0;
+  int8_t res_8_SV_VL8__a_false = 0;
+  int8_t res_8_SV_VL16__a_false = 0;
+  uint8_t res_8_SV_VL1_u_a_false = 0;
+  uint8_t res_8_SV_VL2_u_a_false = 0;
+  uint8_t res_8_SV_VL3_u_a_false = 0;
+  uint8_t res_8_SV_VL4_u_a_false = 0;
+  uint8_t res_8_SV_VL5_u_a_false = 0;
+  uint8_t res_8_SV_VL6_u_a_false = 0;
+  uint8_t res_8_SV_VL7_u_a_false = 0;
+  uint8_t res_8_SV_VL8_u_a_false = 0;
+  uint8_t res_8_SV_VL16_u_a_false = 0;
+  int16_t res_16_SV_VL1__a_false = 0;
+  int16_t res_16_SV_VL2__a_false = 0;
+  int16_t res_16_SV_VL3__a_false = 0;
+  int16_t res_16_SV_VL4__a_false = 0;
+  int16_t res_16_SV_VL5__a_false = 0;
+  int16_t res_16_SV_VL6__a_false = 0;
+  int16_t res_16_SV_VL7__a_false = 0;
+  int16_t res_16_SV_VL8__a_false = 0;
+  int16_t res_16_SV_VL16__a_false = 0;
+  uint16_t res_16_SV_VL1_u_a_false = 0;
+  uint16_t res_16_SV_VL2_u_a_false = 0;
+  uint16_t res_16_SV_VL3_u_a_false = 0;
+  uint16_t res_16_SV_VL4_u_a_false = 0;
+  uint16_t res_16_SV_VL5_u_a_false = 0;
+  uint16_t res_16_SV_VL6_u_a_false = 0;
+  uint16_t res_16_SV_VL7_u_a_false = 0;
+  uint16_t res_16_SV_VL8_u_a_false = 0;
+  uint16_t res_16_SV_VL16_u_a_false = 0;
+  int32_t res_32_SV_VL1__a_false = 0;
+  int32_t res_32_SV_VL2__a_false = 0;
+  int32_t res_32_SV_VL3__a_false = 0;
+  int32_t res_32_SV_VL4__a_false = 0;
+  int32_t res_32_SV_VL5__a_false = 0;
+  int32_t res_32_SV_VL6__a_false = 0;
+  int32_t res_32_SV_VL7__a_false = 0;
+  int32_t res_32_SV_VL8__a_false = 0;
+  int32_t res_32_SV_VL16__a_false = 0;
+  uint32_t res_32_SV_VL1_u_a_false = 0;
+  uint32_t res_32_SV_VL2_u_a_false = 0;
+  uint32_t res_32_SV_VL3_u_a_false = 0;
+  uint32_t res_32_SV_VL4_u_a_false = 0;
+  uint32_t res_32_SV_VL5_u_a_false = 0;
+  uint32_t res_32_SV_VL6_u_a_false = 0;
+  uint32_t res_32_SV_VL7_u_a_false = 0;
+  uint32_t res_32_SV_VL8_u_a_false = 0;
+  uint32_t res_32_SV_VL16_u_a_false = 0;
+  int64_t res_64_SV_VL1__a_false = 0;
+  int64_t res_64_SV_VL2__a_false = 0;
+  int64_t res_64_SV_VL3__a_false = 0;
+  int64_t res_64_SV_VL4__a_false = 0;
+  int64_t res_64_SV_VL5__a_false = 0;
+  int64_t res_64_SV_VL6__a_false = 0;
+  int64_t res_64_SV_VL7__a_false = 0;
+  int64_t res_64_SV_VL8__a_false = 0;
+  int64_t res_64_SV_VL16__a_false = 0;
+  uint64_t res_64_SV_VL1_u_a_false = 0;
+  uint64_t res_64_SV_VL2_u_a_false = 0;
+  uint64_t res_64_SV_VL3_u_a_false = 0;
+  uint64_t res_64_SV_VL4_u_a_false = 0;
+  uint64_t res_64_SV_VL5_u_a_false = 0;
+  uint64_t res_64_SV_VL6_u_a_false = 0;
+  uint64_t res_64_SV_VL7_u_a_false = 0;
+  uint64_t res_64_SV_VL8_u_a_false = 0;
+  uint64_t res_64_SV_VL16_u_a_false = 0;
+  int8_t res_8_SV_VL1__b_false = 31;
+  int8_t res_8_SV_VL2__b_false = 31;
+  int8_t res_8_SV_VL3__b_false = 31;
+  int8_t res_8_SV_VL4__b_false = 31;
+  int8_t res_8_SV_VL5__b_false = 31;
+  int8_t res_8_SV_VL6__b_false = 31;
+  int8_t res_8_SV_VL7__b_false = 31;
+  int8_t res_8_SV_VL8__b_false = 31;
+  int8_t res_8_SV_VL16__b_false = 31;
+  uint8_t res_8_SV_VL1_u_b_false = 31;
+  uint8_t res_8_SV_VL2_u_b_false = 31;
+  uint8_t res_8_SV_VL3_u_b_false = 31;
+  uint8_t res_8_SV_VL4_u_b_false = 31;
+  uint8_t res_8_SV_VL5_u_b_false = 31;
+  uint8_t res_8_SV_VL6_u_b_false = 31;
+  uint8_t res_8_SV_VL7_u_b_false = 31;
+  uint8_t res_8_SV_VL8_u_b_false = 31;
+  uint8_t res_8_SV_VL16_u_b_false = 31;
+  int16_t res_16_SV_VL1__b_false = 15;
+  int16_t res_16_SV_VL2__b_false = 15;
+  int16_t res_16_SV_VL3__b_false = 15;
+  int16_t res_16_SV_VL4__b_false = 15;
+  int16_t res_16_SV_VL5__b_false = 15;
+  int16_t res_16_SV_VL6__b_false = 15;
+  int16_t res_16_SV_VL7__b_false = 15;
+  int16_t res_16_SV_VL8__b_false = 15;
+  int16_t res_16_SV_VL16__b_false = 15;
+  uint16_t res_16_SV_VL1_u_b_false = 15;
+  uint16_t res_16_SV_VL2_u_b_false = 15;
+  uint16_t res_16_SV_VL3_u_b_false = 15;
+  uint16_t res_16_SV_VL4_u_b_false = 15;
+  uint16_t res_16_SV_VL5_u_b_false = 15;
+  uint16_t res_16_SV_VL6_u_b_false = 15;
+  uint16_t res_16_SV_VL7_u_b_false = 15;
+  uint16_t res_16_SV_VL8_u_b_false = 15;
+  uint16_t res_16_SV_VL16_u_b_false = 15;
+  int32_t res_32_SV_VL1__b_false = 7;
+  int32_t res_32_SV_VL2__b_false = 7;
+  int32_t res_32_SV_VL3__b_false = 7;
+  int32_t res_32_SV_VL4__b_false = 7;
+  int32_t res_32_SV_VL5__b_false = 7;
+  int32_t res_32_SV_VL6__b_false = 7;
+  int32_t res_32_SV_VL7__b_false = 7;
+  int32_t res_32_SV_VL8__b_false = 7;
+  int32_t res_32_SV_VL16__b_false = 7;
+  uint32_t res_32_SV_VL1_u_b_false = 7;
+  uint32_t res_32_SV_VL2_u_b_false = 7;
+  uint32_t res_32_SV_VL3_u_b_false = 7;
+  uint32_t res_32_SV_VL4_u_b_false = 7;
+  uint32_t res_32_SV_VL5_u_b_false = 7;
+  uint32_t res_32_SV_VL6_u_b_false = 7;
+  uint32_t res_32_SV_VL7_u_b_false = 7;
+  uint32_t res_32_SV_VL8_u_b_false = 7;
+  uint32_t res_32_SV_VL16_u_b_false = 7;
+  int64_t res_64_SV_VL1__b_false = 3;
+  int64_t res_64_SV_VL2__b_false = 3;
+  int64_t res_64_SV_VL3__b_false = 3;
+  int64_t res_64_SV_VL4__b_false = 3;
+  int64_t res_64_SV_VL5__b_false = 3;
+  int64_t res_64_SV_VL6__b_false = 3;
+  int64_t res_64_SV_VL7__b_false = 3;
+  int64_t res_64_SV_VL8__b_false = 3;
+  int64_t res_64_SV_VL16__b_false = 3;
+  uint64_t res_64_SV_VL1_u_b_false = 3;
+  uint64_t res_64_SV_VL2_u_b_false = 3;
+  uint64_t res_64_SV_VL3_u_b_false = 3;
+  uint64_t res_64_SV_VL4_u_b_false = 3;
+  uint64_t res_64_SV_VL5_u_b_false = 3;
+  uint64_t res_64_SV_VL6_u_b_false = 3;
+  uint64_t res_64_SV_VL7_u_b_false = 3;
+  uint64_t res_64_SV_VL8_u_b_false = 3;
+  uint64_t res_64_SV_VL16_u_b_false = 3;
+
+
+#undef SVELAST_DEF
+#define SVELAST_DEF(size, pat, sign, ab, su) \
+	if (NAME (foo, size, pat, sign, ab) \
+	     (svindex_ ## su ## size (0 ,1)) != \
+		NAME (res, size, pat, sign, ab)) \
+	  __builtin_abort (); \
+	if (NAMEF (foo, size, pat, sign, ab) \
+	     (svindex_ ## su ## size (0 ,1)) != \
+		NAMEF (res, size, pat, sign, ab)) \
+	  __builtin_abort ();
+
+  ALL_POS ()
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4.c
index 1e38371842f..91fdd3c202e 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4.c
@@ -186,8 +186,6 @@ CALLER (f16, __SVFloat16_t)
 ** caller_bf16:
 **	...
 **	bl	callee_bf16
-**	ptrue	(p[0-7])\.b, all
-**	lasta	h0, \1, z0\.h
 **	ldp	x29, x30, \[sp\], 16
 **	ret
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_1024.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_1024.c
index 491c35af221..7d824caae1b 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_1024.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_1024.c
@@ -186,8 +186,6 @@ CALLER (f16, __SVFloat16_t)
 ** caller_bf16:
 **	...
 **	bl	callee_bf16
-**	ptrue	(p[0-7])\.b, vl128
-**	lasta	h0, \1, z0\.h
 **	ldp	x29, x30, \[sp\], 16
 **	ret
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_128.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_128.c
index eebb913273a..e0aa3a5fa68 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_128.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_128.c
@@ -186,8 +186,6 @@ CALLER (f16, __SVFloat16_t)
 ** caller_bf16:
 **	...
 **	bl	callee_bf16
-**	ptrue	(p[0-7])\.b, vl16
-**	lasta	h0, \1, z0\.h
 **	ldp	x29, x30, \[sp\], 16
 **	ret
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_2048.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_2048.c
index 73c3b2ec045..3238015d9eb 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_2048.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_2048.c
@@ -186,8 +186,6 @@ CALLER (f16, __SVFloat16_t)
 ** caller_bf16:
 **	...
 **	bl	callee_bf16
-**	ptrue	(p[0-7])\.b, vl256
-**	lasta	h0, \1, z0\.h
 **	ldp	x29, x30, \[sp\], 16
 **	ret
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_256.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_256.c
index 29744c81402..50861098934 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_256.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_256.c
@@ -186,8 +186,6 @@ CALLER (f16, __SVFloat16_t)
 ** caller_bf16:
 **	...
 **	bl	callee_bf16
-**	ptrue	(p[0-7])\.b, vl32
-**	lasta	h0, \1, z0\.h
 **	ldp	x29, x30, \[sp\], 16
 **	ret
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_512.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_512.c
index cf25c31bcbf..300dacce955 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_512.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_512.c
@@ -186,8 +186,6 @@ CALLER (f16, __SVFloat16_t)
 ** caller_bf16:
 **	...
 **	bl	callee_bf16
-**	ptrue	(p[0-7])\.b, vl64
-**	lasta	h0, \1, z0\.h
 **	ldp	x29, x30, \[sp\], 16
 **	ret
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5.c
index 9ad3e227654..0a840a38384 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5.c
@@ -186,8 +186,6 @@ CALLER (f16, svfloat16_t)
 ** caller_bf16:
 **	...
 **	bl	callee_bf16
-**	ptrue	(p[0-7])\.b, all
-**	lasta	h0, \1, z0\.h
 **	ldp	x29, x30, \[sp\], 16
 **	ret
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_1024.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_1024.c
index d573e5fc69c..18cefbff1e6 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_1024.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_1024.c
@@ -186,8 +186,6 @@ CALLER (f16, svfloat16_t)
 ** caller_bf16:
 **	...
 **	bl	callee_bf16
-**	ptrue	(p[0-7])\.b, vl128
-**	lasta	h0, \1, z0\.h
 **	ldp	x29, x30, \[sp\], 16
 **	ret
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_128.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_128.c
index 200b0eb8242..c622ed55674 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_128.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_128.c
@@ -186,8 +186,6 @@ CALLER (f16, svfloat16_t)
 ** caller_bf16:
 **	...
 **	bl	callee_bf16
-**	ptrue	(p[0-7])\.b, vl16
-**	lasta	h0, \1, z0\.h
 **	ldp	x29, x30, \[sp\], 16
 **	ret
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_2048.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_2048.c
index f6f8858fd47..3286280687d 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_2048.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_2048.c
@@ -186,8 +186,6 @@ CALLER (f16, svfloat16_t)
 ** caller_bf16:
 **	...
 **	bl	callee_bf16
-**	ptrue	(p[0-7])\.b, vl256
-**	lasta	h0, \1, z0\.h
 **	ldp	x29, x30, \[sp\], 16
 **	ret
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_256.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_256.c
index e62f59cc885..3c6afa2fdf1 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_256.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_256.c
@@ -186,8 +186,6 @@ CALLER (f16, svfloat16_t)
 ** caller_bf16:
 **	...
 **	bl	callee_bf16
-**	ptrue	(p[0-7])\.b, vl32
-**	lasta	h0, \1, z0\.h
 **	ldp	x29, x30, \[sp\], 16
 **	ret
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_512.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_512.c
index 483558cb576..bb7d3ebf9d4 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_512.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_512.c
@@ -186,8 +186,6 @@ CALLER (f16, svfloat16_t)
 ** caller_bf16:
 **	...
 **	bl	callee_bf16
-**	ptrue	(p[0-7])\.b, vl64
-**	lasta	h0, \1, z0\.h
 **	ldp	x29, x30, \[sp\], 16
 **	ret
 */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]
  2023-03-16 11:39 [PATCH] [PR96339] AArch64: Optimise svlast[ab] Tejas Belagod
@ 2023-05-04  5:43 ` Tejas Belagod
  2023-05-11 19:32 ` Richard Sandiford
  1 sibling, 0 replies; 10+ messages in thread
From: Tejas Belagod @ 2023-05-04  5:43 UTC (permalink / raw)
  To: gcc-patches; +Cc: Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 41590 bytes --]

[Ping]

From: Tejas Belagod <tejas.belagod@arm.com>
Date: Thursday, March 16, 2023 at 5:09 PM
To: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>
Cc: Tejas Belagod <Tejas.Belagod@arm.com>, Richard Sandiford <Richard.Sandiford@arm.com>
Subject: [PATCH] [PR96339] AArch64: Optimise svlast[ab]
From: Tejas Belagod <tbelagod@arm.com>

  This PR optimizes an SVE intrinsics sequence where
    svlasta (svptrue_pat_b8 (SV_VL1), x)
  a scalar is selected based on a constant predicate and a variable vector.
  This sequence is optimized to return the correspoding element of a NEON
  vector. For eg.
    svlasta (svptrue_pat_b8 (SV_VL1), x)
  returns
    umov    w0, v0.b[1]
  Likewise,
    svlastb (svptrue_pat_b8 (SV_VL1), x)
  returns
     umov    w0, v0.b[0]
  This optimization only works provided the constant predicate maps to a range
  that is within the bounds of a 128-bit NEON register.

gcc/ChangeLog:

        PR target/96339
        * config/aarch64/aarch64-sve-builtins-base.cc (svlast_impl::fold): Fold sve
        calls that have a constant input predicate vector.
        (svlast_impl::is_lasta): Query to check if intrinsic is svlasta.
        (svlast_impl::is_lastb): Query to check if intrinsic is svlastb.
        (svlast_impl::vect_all_same): Check if all vector elements are equal.

gcc/testsuite/ChangeLog:

        PR target/96339
        * gcc.target/aarch64/sve/acle/general-c/svlast.c: New.
        * gcc.target/aarch64/sve/acle/general-c/svlast128_run.c: New.
        * gcc.target/aarch64/sve/acle/general-c/svlast256_run.c: New.
        * gcc.target/aarch64/sve/pcs/return_4.c (caller_bf16): Fix asm
        to expect optimized code for function body.
        * gcc.target/aarch64/sve/pcs/return_4_128.c (caller_bf16): Likewise.
        * gcc.target/aarch64/sve/pcs/return_4_256.c (caller_bf16): Likewise.
        * gcc.target/aarch64/sve/pcs/return_4_512.c (caller_bf16): Likewise.
        * gcc.target/aarch64/sve/pcs/return_4_1024.c (caller_bf16): Likewise.
        * gcc.target/aarch64/sve/pcs/return_4_2048.c (caller_bf16): Likewise.
        * gcc.target/aarch64/sve/pcs/return_5.c (caller_bf16): Likewise.
        * gcc.target/aarch64/sve/pcs/return_5_128.c (caller_bf16): Likewise.
        * gcc.target/aarch64/sve/pcs/return_5_256.c (caller_bf16): Likewise.
        * gcc.target/aarch64/sve/pcs/return_5_512.c (caller_bf16): Likewise.
        * gcc.target/aarch64/sve/pcs/return_5_1024.c (caller_bf16): Likewise.
        * gcc.target/aarch64/sve/pcs/return_5_2048.c (caller_bf16): Likewise.
---
 .../aarch64/aarch64-sve-builtins-base.cc      | 124 +++++++
 .../aarch64/sve/acle/general-c/svlast.c       |  63 ++++
 .../sve/acle/general-c/svlast128_run.c        | 313 +++++++++++++++++
 .../sve/acle/general-c/svlast256_run.c        | 314 ++++++++++++++++++
 .../gcc.target/aarch64/sve/pcs/return_4.c     |   2 -
 .../aarch64/sve/pcs/return_4_1024.c           |   2 -
 .../gcc.target/aarch64/sve/pcs/return_4_128.c |   2 -
 .../aarch64/sve/pcs/return_4_2048.c           |   2 -
 .../gcc.target/aarch64/sve/pcs/return_4_256.c |   2 -
 .../gcc.target/aarch64/sve/pcs/return_4_512.c |   2 -
 .../gcc.target/aarch64/sve/pcs/return_5.c     |   2 -
 .../aarch64/sve/pcs/return_5_1024.c           |   2 -
 .../gcc.target/aarch64/sve/pcs/return_5_128.c |   2 -
 .../aarch64/sve/pcs/return_5_2048.c           |   2 -
 .../gcc.target/aarch64/sve/pcs/return_5_256.c |   2 -
 .../gcc.target/aarch64/sve/pcs/return_5_512.c |   2 -
 16 files changed, 814 insertions(+), 24 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast128_run.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast256_run.c

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index cd9cace3c9b..db2b4dcaac9 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -1056,6 +1056,130 @@ class svlast_impl : public quiet<function_base>
 public:
   CONSTEXPR svlast_impl (int unspec) : m_unspec (unspec) {}

+  bool is_lasta () const { return m_unspec == UNSPEC_LASTA; }
+  bool is_lastb () const { return m_unspec == UNSPEC_LASTB; }
+
+  bool vect_all_same (tree v , int step) const
+  {
+    int i;
+    int nelts = vector_cst_encoded_nelts (v);
+    int first_el = 0;
+
+    for (i = first_el; i < nelts; i += step)
+      if (VECTOR_CST_ENCODED_ELT (v, i) != VECTOR_CST_ENCODED_ELT (v, first_el))
+       return false;
+
+    return true;
+  }
+
+  /* Fold a svlast{a/b} call with constant predicate to a BIT_FIELD_REF.
+     BIT_FIELD_REF lowers to a NEON element extract, so we have to make sure
+     the index of the element being accessed is in the range of a NEON vector
+     width.  */
+  gimple *fold (gimple_folder & f) const override
+  {
+    tree pred = gimple_call_arg (f.call, 0);
+    tree val = gimple_call_arg (f.call, 1);
+
+    if (TREE_CODE (pred) == VECTOR_CST)
+      {
+       HOST_WIDE_INT pos;
+       unsigned int const_vg;
+       int i = 0;
+       int step = f.type_suffix (0).element_bytes;
+       int step_1 = gcd (step, VECTOR_CST_NPATTERNS (pred));
+       int npats = VECTOR_CST_NPATTERNS (pred);
+       unsigned HOST_WIDE_INT nelts = vector_cst_encoded_nelts (pred);
+       tree b = NULL_TREE;
+       bool const_vl = aarch64_sve_vg.is_constant (&const_vg);
+
+       /* We can optimize 2 cases common to variable and fixed-length cases
+          without a linear search of the predicate vector:
+          1.  LASTA if predicate is all true, return element 0.
+          2.  LASTA if predicate all false, return element 0.  */
+       if (is_lasta () && vect_all_same (pred, step_1))
+         {
+           b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val,
+                       bitsize_int (step * BITS_PER_UNIT), bitsize_int (0));
+           return gimple_build_assign (f.lhs, b);
+         }
+
+       /* Handle the all-false case for LASTB where SVE VL == 128b -
+          return the highest numbered element.  */
+       if (is_lastb () && known_eq (BYTES_PER_SVE_VECTOR, 16)
+           && vect_all_same (pred, step_1)
+           && integer_zerop (VECTOR_CST_ENCODED_ELT (pred, 0)))
+         {
+           b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val,
+                       bitsize_int (step * BITS_PER_UNIT),
+                       bitsize_int ((16 - step) * BITS_PER_UNIT));
+
+           return gimple_build_assign (f.lhs, b);
+         }
+
+       /* If VECTOR_CST_NELTS_PER_PATTERN (pred) == 2 and every multiple of
+          'step_1' in
+          [VECTOR_CST_NPATTERNS .. VECTOR_CST_ENCODED_NELTS - 1]
+          is zero, then we can treat the vector as VECTOR_CST_NPATTERNS
+          elements followed by all inactive elements.  */
+       if (!const_vl && VECTOR_CST_NELTS_PER_PATTERN (pred) == 2)
+         for (i = npats; i < nelts; i += step_1)
+           {
+             /* If there are active elements in the repeated pattern of
+                a variable-length vector, then return NULL as there is no way
+                to be sure statically if this falls within the NEON range.  */
+             if (!integer_zerop (VECTOR_CST_ENCODED_ELT (pred, i)))
+               return NULL;
+           }
+
+       /* If we're here, it means either:
+          1. The vector is variable-length and there's no active element in the
+          repeated part of the pattern, or
+          2. The vector is fixed-length.
+          Fall-through to a linear search.  */
+
+       /* Restrict the scope of search to NPATS if vector is
+          variable-length.  */
+       if (!VECTOR_CST_NELTS (pred).is_constant (&nelts))
+         nelts = npats;
+
+       /* Fall through to finding the last active element linearly for
+          for all cases where the last active element is known to be
+          within a statically-determinable range.  */
+       i = MAX ((int)nelts - step, 0);
+       for (; i >= 0; i -= step)
+         if (!integer_zerop (VECTOR_CST_ELT (pred, i)))
+           break;
+
+       if (is_lastb ())
+         {
+           /* For LASTB, the element is the last active element.  */
+           pos = i;
+         }
+       else
+         {
+           /* For LASTA, the element is one after last active element.  */
+           pos = i + step;
+
+           /* If last active element is
+              last element, wrap-around and return first NEON element.  */
+           if (known_ge (pos, BYTES_PER_SVE_VECTOR))
+             pos = 0;
+         }
+
+       /* Out of NEON range.  */
+       if (pos < 0 || pos > 15)
+         return NULL;
+
+       b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val,
+                   bitsize_int (step * BITS_PER_UNIT),
+                   bitsize_int (pos * BITS_PER_UNIT));
+
+       return gimple_build_assign (f.lhs, b);
+      }
+    return NULL;
+  }
+
   rtx
   expand (function_expander &e) const override
   {
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast.c
new file mode 100644
index 00000000000..fdbe5e309af
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast.c
@@ -0,0 +1,63 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -msve-vector-bits=256" } */
+
+#include <stdint.h>
+#include "arm_sve.h"
+
+#define NAME(name, size, pat, sign, ab) \
+  name ## _ ## size ## _ ## pat ## _ ## sign ## _ ## ab
+
+#define NAMEF(name, size, pat, sign, ab) \
+  name ## _ ## size ## _ ## pat ## _ ## sign ## _ ## ab ## _false
+
+#define SVTYPE(size, sign) \
+  sv ## sign ## int ## size ## _t
+
+#define STYPE(size, sign) sign ## int ## size ##_t
+
+#define SVELAST_DEF(size, pat, sign, ab, su) \
+  STYPE (size, sign) __attribute__((noinline)) \
+  NAME (foo, size, pat, sign, ab) (SVTYPE (size, sign) x) \
+  { \
+    return svlast ## ab (svptrue_pat_b ## size (pat), x); \
+  } \
+  STYPE (size, sign) __attribute__((noinline)) \
+  NAMEF (foo, size, pat, sign, ab) (SVTYPE (size, sign) x) \
+  { \
+    return svlast ## ab (svpfalse (), x); \
+  }
+
+#define ALL_PATS(SIZE, SIGN, AB, SU) \
+  SVELAST_DEF (SIZE, SV_VL1, SIGN, AB, SU) \
+  SVELAST_DEF (SIZE, SV_VL2, SIGN, AB, SU) \
+  SVELAST_DEF (SIZE, SV_VL3, SIGN, AB, SU) \
+  SVELAST_DEF (SIZE, SV_VL4, SIGN, AB, SU) \
+  SVELAST_DEF (SIZE, SV_VL5, SIGN, AB, SU) \
+  SVELAST_DEF (SIZE, SV_VL6, SIGN, AB, SU) \
+  SVELAST_DEF (SIZE, SV_VL7, SIGN, AB, SU) \
+  SVELAST_DEF (SIZE, SV_VL8, SIGN, AB, SU) \
+  SVELAST_DEF (SIZE, SV_VL16, SIGN, AB, SU)
+
+#define ALL_SIGN(SIZE, AB) \
+  ALL_PATS (SIZE, , AB, s) \
+  ALL_PATS (SIZE, u, AB, u)
+
+#define ALL_SIZE(AB) \
+  ALL_SIGN (8, AB) \
+  ALL_SIGN (16, AB) \
+  ALL_SIGN (32, AB) \
+  ALL_SIGN (64, AB)
+
+#define ALL_POS() \
+  ALL_SIZE (a) \
+  ALL_SIZE (b)
+
+
+ALL_POS()
+
+/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b} 52 } } */
+/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h} 50 } } */
+/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s} 12 } } */
+/* { dg-final { scan-assembler-times {\tfmov\tw0, s0} 24 } } */
+/* { dg-final { scan-assembler-times {\tumov\tx[0-9]+, v[0-9]+\.d} 4 } } */
+/* { dg-final { scan-assembler-times {\tfmov\tx0, d0} 32 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast128_run.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast128_run.c
new file mode 100644
index 00000000000..5e1e9303d7b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast128_run.c
@@ -0,0 +1,313 @@
+/* { dg-do run { target aarch64_sve128_hw } } */
+/* { dg-options "-O3 -msve-vector-bits=128 -std=gnu99" } */
+
+#include "svlast.c"
+
+int
+main (void)
+{
+  int8_t res_8_SV_VL1__a = 1;
+  int8_t res_8_SV_VL2__a = 2;
+  int8_t res_8_SV_VL3__a = 3;
+  int8_t res_8_SV_VL4__a = 4;
+  int8_t res_8_SV_VL5__a = 5;
+  int8_t res_8_SV_VL6__a = 6;
+  int8_t res_8_SV_VL7__a = 7;
+  int8_t res_8_SV_VL8__a = 8;
+  int8_t res_8_SV_VL16__a = 0;
+  uint8_t res_8_SV_VL1_u_a = 1;
+  uint8_t res_8_SV_VL2_u_a = 2;
+  uint8_t res_8_SV_VL3_u_a = 3;
+  uint8_t res_8_SV_VL4_u_a = 4;
+  uint8_t res_8_SV_VL5_u_a = 5;
+  uint8_t res_8_SV_VL6_u_a = 6;
+  uint8_t res_8_SV_VL7_u_a = 7;
+  uint8_t res_8_SV_VL8_u_a = 8;
+  uint8_t res_8_SV_VL16_u_a = 0;
+  int16_t res_16_SV_VL1__a = 1;
+  int16_t res_16_SV_VL2__a = 2;
+  int16_t res_16_SV_VL3__a = 3;
+  int16_t res_16_SV_VL4__a = 4;
+  int16_t res_16_SV_VL5__a = 5;
+  int16_t res_16_SV_VL6__a = 6;
+  int16_t res_16_SV_VL7__a = 7;
+  int16_t res_16_SV_VL8__a = 0;
+  int16_t res_16_SV_VL16__a = 0;
+  uint16_t res_16_SV_VL1_u_a = 1;
+  uint16_t res_16_SV_VL2_u_a = 2;
+  uint16_t res_16_SV_VL3_u_a = 3;
+  uint16_t res_16_SV_VL4_u_a = 4;
+  uint16_t res_16_SV_VL5_u_a = 5;
+  uint16_t res_16_SV_VL6_u_a = 6;
+  uint16_t res_16_SV_VL7_u_a = 7;
+  uint16_t res_16_SV_VL8_u_a = 0;
+  uint16_t res_16_SV_VL16_u_a = 0;
+  int32_t res_32_SV_VL1__a = 1;
+  int32_t res_32_SV_VL2__a = 2;
+  int32_t res_32_SV_VL3__a = 3;
+  int32_t res_32_SV_VL4__a = 0;
+  int32_t res_32_SV_VL5__a = 0;
+  int32_t res_32_SV_VL6__a = 0;
+  int32_t res_32_SV_VL7__a = 0;
+  int32_t res_32_SV_VL8__a = 0;
+  int32_t res_32_SV_VL16__a = 0;
+  uint32_t res_32_SV_VL1_u_a = 1;
+  uint32_t res_32_SV_VL2_u_a = 2;
+  uint32_t res_32_SV_VL3_u_a = 3;
+  uint32_t res_32_SV_VL4_u_a = 0;
+  uint32_t res_32_SV_VL5_u_a = 0;
+  uint32_t res_32_SV_VL6_u_a = 0;
+  uint32_t res_32_SV_VL7_u_a = 0;
+  uint32_t res_32_SV_VL8_u_a = 0;
+  uint32_t res_32_SV_VL16_u_a = 0;
+  int64_t res_64_SV_VL1__a = 1;
+  int64_t res_64_SV_VL2__a = 0;
+  int64_t res_64_SV_VL3__a = 0;
+  int64_t res_64_SV_VL4__a = 0;
+  int64_t res_64_SV_VL5__a = 0;
+  int64_t res_64_SV_VL6__a = 0;
+  int64_t res_64_SV_VL7__a = 0;
+  int64_t res_64_SV_VL8__a = 0;
+  int64_t res_64_SV_VL16__a = 0;
+  uint64_t res_64_SV_VL1_u_a = 1;
+  uint64_t res_64_SV_VL2_u_a = 0;
+  uint64_t res_64_SV_VL3_u_a = 0;
+  uint64_t res_64_SV_VL4_u_a = 0;
+  uint64_t res_64_SV_VL5_u_a = 0;
+  uint64_t res_64_SV_VL6_u_a = 0;
+  uint64_t res_64_SV_VL7_u_a = 0;
+  uint64_t res_64_SV_VL8_u_a = 0;
+  uint64_t res_64_SV_VL16_u_a = 0;
+  int8_t res_8_SV_VL1__b = 0;
+  int8_t res_8_SV_VL2__b = 1;
+  int8_t res_8_SV_VL3__b = 2;
+  int8_t res_8_SV_VL4__b = 3;
+  int8_t res_8_SV_VL5__b = 4;
+  int8_t res_8_SV_VL6__b = 5;
+  int8_t res_8_SV_VL7__b = 6;
+  int8_t res_8_SV_VL8__b = 7;
+  int8_t res_8_SV_VL16__b = 15;
+  uint8_t res_8_SV_VL1_u_b = 0;
+  uint8_t res_8_SV_VL2_u_b = 1;
+  uint8_t res_8_SV_VL3_u_b = 2;
+  uint8_t res_8_SV_VL4_u_b = 3;
+  uint8_t res_8_SV_VL5_u_b = 4;
+  uint8_t res_8_SV_VL6_u_b = 5;
+  uint8_t res_8_SV_VL7_u_b = 6;
+  uint8_t res_8_SV_VL8_u_b = 7;
+  uint8_t res_8_SV_VL16_u_b = 15;
+  int16_t res_16_SV_VL1__b = 0;
+  int16_t res_16_SV_VL2__b = 1;
+  int16_t res_16_SV_VL3__b = 2;
+  int16_t res_16_SV_VL4__b = 3;
+  int16_t res_16_SV_VL5__b = 4;
+  int16_t res_16_SV_VL6__b = 5;
+  int16_t res_16_SV_VL7__b = 6;
+  int16_t res_16_SV_VL8__b = 7;
+  int16_t res_16_SV_VL16__b = 7;
+  uint16_t res_16_SV_VL1_u_b = 0;
+  uint16_t res_16_SV_VL2_u_b = 1;
+  uint16_t res_16_SV_VL3_u_b = 2;
+  uint16_t res_16_SV_VL4_u_b = 3;
+  uint16_t res_16_SV_VL5_u_b = 4;
+  uint16_t res_16_SV_VL6_u_b = 5;
+  uint16_t res_16_SV_VL7_u_b = 6;
+  uint16_t res_16_SV_VL8_u_b = 7;
+  uint16_t res_16_SV_VL16_u_b = 7;
+  int32_t res_32_SV_VL1__b = 0;
+  int32_t res_32_SV_VL2__b = 1;
+  int32_t res_32_SV_VL3__b = 2;
+  int32_t res_32_SV_VL4__b = 3;
+  int32_t res_32_SV_VL5__b = 3;
+  int32_t res_32_SV_VL6__b = 3;
+  int32_t res_32_SV_VL7__b = 3;
+  int32_t res_32_SV_VL8__b = 3;
+  int32_t res_32_SV_VL16__b = 3;
+  uint32_t res_32_SV_VL1_u_b = 0;
+  uint32_t res_32_SV_VL2_u_b = 1;
+  uint32_t res_32_SV_VL3_u_b = 2;
+  uint32_t res_32_SV_VL4_u_b = 3;
+  uint32_t res_32_SV_VL5_u_b = 3;
+  uint32_t res_32_SV_VL6_u_b = 3;
+  uint32_t res_32_SV_VL7_u_b = 3;
+  uint32_t res_32_SV_VL8_u_b = 3;
+  uint32_t res_32_SV_VL16_u_b = 3;
+  int64_t res_64_SV_VL1__b = 0;
+  int64_t res_64_SV_VL2__b = 1;
+  int64_t res_64_SV_VL3__b = 1;
+  int64_t res_64_SV_VL4__b = 1;
+  int64_t res_64_SV_VL5__b = 1;
+  int64_t res_64_SV_VL6__b = 1;
+  int64_t res_64_SV_VL7__b = 1;
+  int64_t res_64_SV_VL8__b = 1;
+  int64_t res_64_SV_VL16__b = 1;
+  uint64_t res_64_SV_VL1_u_b = 0;
+  uint64_t res_64_SV_VL2_u_b = 1;
+  uint64_t res_64_SV_VL3_u_b = 1;
+  uint64_t res_64_SV_VL4_u_b = 1;
+  uint64_t res_64_SV_VL5_u_b = 1;
+  uint64_t res_64_SV_VL6_u_b = 1;
+  uint64_t res_64_SV_VL7_u_b = 1;
+  uint64_t res_64_SV_VL8_u_b = 1;
+  uint64_t res_64_SV_VL16_u_b = 1;
+
+  int8_t res_8_SV_VL1__a_false = 0;
+  int8_t res_8_SV_VL2__a_false = 0;
+  int8_t res_8_SV_VL3__a_false = 0;
+  int8_t res_8_SV_VL4__a_false = 0;
+  int8_t res_8_SV_VL5__a_false = 0;
+  int8_t res_8_SV_VL6__a_false = 0;
+  int8_t res_8_SV_VL7__a_false = 0;
+  int8_t res_8_SV_VL8__a_false = 0;
+  int8_t res_8_SV_VL16__a_false = 0;
+  uint8_t res_8_SV_VL1_u_a_false = 0;
+  uint8_t res_8_SV_VL2_u_a_false = 0;
+  uint8_t res_8_SV_VL3_u_a_false = 0;
+  uint8_t res_8_SV_VL4_u_a_false = 0;
+  uint8_t res_8_SV_VL5_u_a_false = 0;
+  uint8_t res_8_SV_VL6_u_a_false = 0;
+  uint8_t res_8_SV_VL7_u_a_false = 0;
+  uint8_t res_8_SV_VL8_u_a_false = 0;
+  uint8_t res_8_SV_VL16_u_a_false = 0;
+  int16_t res_16_SV_VL1__a_false = 0;
+  int16_t res_16_SV_VL2__a_false = 0;
+  int16_t res_16_SV_VL3__a_false = 0;
+  int16_t res_16_SV_VL4__a_false = 0;
+  int16_t res_16_SV_VL5__a_false = 0;
+  int16_t res_16_SV_VL6__a_false = 0;
+  int16_t res_16_SV_VL7__a_false = 0;
+  int16_t res_16_SV_VL8__a_false = 0;
+  int16_t res_16_SV_VL16__a_false = 0;
+  uint16_t res_16_SV_VL1_u_a_false = 0;
+  uint16_t res_16_SV_VL2_u_a_false = 0;
+  uint16_t res_16_SV_VL3_u_a_false = 0;
+  uint16_t res_16_SV_VL4_u_a_false = 0;
+  uint16_t res_16_SV_VL5_u_a_false = 0;
+  uint16_t res_16_SV_VL6_u_a_false = 0;
+  uint16_t res_16_SV_VL7_u_a_false = 0;
+  uint16_t res_16_SV_VL8_u_a_false = 0;
+  uint16_t res_16_SV_VL16_u_a_false = 0;
+  int32_t res_32_SV_VL1__a_false = 0;
+  int32_t res_32_SV_VL2__a_false = 0;
+  int32_t res_32_SV_VL3__a_false = 0;
+  int32_t res_32_SV_VL4__a_false = 0;
+  int32_t res_32_SV_VL5__a_false = 0;
+  int32_t res_32_SV_VL6__a_false = 0;
+  int32_t res_32_SV_VL7__a_false = 0;
+  int32_t res_32_SV_VL8__a_false = 0;
+  int32_t res_32_SV_VL16__a_false = 0;
+  uint32_t res_32_SV_VL1_u_a_false = 0;
+  uint32_t res_32_SV_VL2_u_a_false = 0;
+  uint32_t res_32_SV_VL3_u_a_false = 0;
+  uint32_t res_32_SV_VL4_u_a_false = 0;
+  uint32_t res_32_SV_VL5_u_a_false = 0;
+  uint32_t res_32_SV_VL6_u_a_false = 0;
+  uint32_t res_32_SV_VL7_u_a_false = 0;
+  uint32_t res_32_SV_VL8_u_a_false = 0;
+  uint32_t res_32_SV_VL16_u_a_false = 0;
+  int64_t res_64_SV_VL1__a_false = 0;
+  int64_t res_64_SV_VL2__a_false = 0;
+  int64_t res_64_SV_VL3__a_false = 0;
+  int64_t res_64_SV_VL4__a_false = 0;
+  int64_t res_64_SV_VL5__a_false = 0;
+  int64_t res_64_SV_VL6__a_false = 0;
+  int64_t res_64_SV_VL7__a_false = 0;
+  int64_t res_64_SV_VL8__a_false = 0;
+  int64_t res_64_SV_VL16__a_false = 0;
+  uint64_t res_64_SV_VL1_u_a_false = 0;
+  uint64_t res_64_SV_VL2_u_a_false = 0;
+  uint64_t res_64_SV_VL3_u_a_false = 0;
+  uint64_t res_64_SV_VL4_u_a_false = 0;
+  uint64_t res_64_SV_VL5_u_a_false = 0;
+  uint64_t res_64_SV_VL6_u_a_false = 0;
+  uint64_t res_64_SV_VL7_u_a_false = 0;
+  uint64_t res_64_SV_VL8_u_a_false = 0;
+  uint64_t res_64_SV_VL16_u_a_false = 0;
+  int8_t res_8_SV_VL1__b_false = 15;
+  int8_t res_8_SV_VL2__b_false = 15;
+  int8_t res_8_SV_VL3__b_false = 15;
+  int8_t res_8_SV_VL4__b_false = 15;
+  int8_t res_8_SV_VL5__b_false = 15;
+  int8_t res_8_SV_VL6__b_false = 15;
+  int8_t res_8_SV_VL7__b_false = 15;
+  int8_t res_8_SV_VL8__b_false = 15;
+  int8_t res_8_SV_VL16__b_false = 15;
+  uint8_t res_8_SV_VL1_u_b_false = 15;
+  uint8_t res_8_SV_VL2_u_b_false = 15;
+  uint8_t res_8_SV_VL3_u_b_false = 15;
+  uint8_t res_8_SV_VL4_u_b_false = 15;
+  uint8_t res_8_SV_VL5_u_b_false = 15;
+  uint8_t res_8_SV_VL6_u_b_false = 15;
+  uint8_t res_8_SV_VL7_u_b_false = 15;
+  uint8_t res_8_SV_VL8_u_b_false = 15;
+  uint8_t res_8_SV_VL16_u_b_false = 15;
+  int16_t res_16_SV_VL1__b_false = 7;
+  int16_t res_16_SV_VL2__b_false = 7;
+  int16_t res_16_SV_VL3__b_false = 7;
+  int16_t res_16_SV_VL4__b_false = 7;
+  int16_t res_16_SV_VL5__b_false = 7;
+  int16_t res_16_SV_VL6__b_false = 7;
+  int16_t res_16_SV_VL7__b_false = 7;
+  int16_t res_16_SV_VL8__b_false = 7;
+  int16_t res_16_SV_VL16__b_false = 7;
+  uint16_t res_16_SV_VL1_u_b_false = 7;
+  uint16_t res_16_SV_VL2_u_b_false = 7;
+  uint16_t res_16_SV_VL3_u_b_false = 7;
+  uint16_t res_16_SV_VL4_u_b_false = 7;
+  uint16_t res_16_SV_VL5_u_b_false = 7;
+  uint16_t res_16_SV_VL6_u_b_false = 7;
+  uint16_t res_16_SV_VL7_u_b_false = 7;
+  uint16_t res_16_SV_VL8_u_b_false = 7;
+  uint16_t res_16_SV_VL16_u_b_false = 7;
+  int32_t res_32_SV_VL1__b_false = 3;
+  int32_t res_32_SV_VL2__b_false = 3;
+  int32_t res_32_SV_VL3__b_false = 3;
+  int32_t res_32_SV_VL4__b_false = 3;
+  int32_t res_32_SV_VL5__b_false = 3;
+  int32_t res_32_SV_VL6__b_false = 3;
+  int32_t res_32_SV_VL7__b_false = 3;
+  int32_t res_32_SV_VL8__b_false = 3;
+  int32_t res_32_SV_VL16__b_false = 3;
+  uint32_t res_32_SV_VL1_u_b_false = 3;
+  uint32_t res_32_SV_VL2_u_b_false = 3;
+  uint32_t res_32_SV_VL3_u_b_false = 3;
+  uint32_t res_32_SV_VL4_u_b_false = 3;
+  uint32_t res_32_SV_VL5_u_b_false = 3;
+  uint32_t res_32_SV_VL6_u_b_false = 3;
+  uint32_t res_32_SV_VL7_u_b_false = 3;
+  uint32_t res_32_SV_VL8_u_b_false = 3;
+  uint32_t res_32_SV_VL16_u_b_false = 3;
+  int64_t res_64_SV_VL1__b_false = 1;
+  int64_t res_64_SV_VL2__b_false = 1;
+  int64_t res_64_SV_VL3__b_false = 1;
+  int64_t res_64_SV_VL4__b_false = 1;
+  int64_t res_64_SV_VL5__b_false = 1;
+  int64_t res_64_SV_VL6__b_false = 1;
+  int64_t res_64_SV_VL7__b_false = 1;
+  int64_t res_64_SV_VL8__b_false = 1;
+  int64_t res_64_SV_VL16__b_false = 1;
+  uint64_t res_64_SV_VL1_u_b_false = 1;
+  uint64_t res_64_SV_VL2_u_b_false = 1;
+  uint64_t res_64_SV_VL3_u_b_false = 1;
+  uint64_t res_64_SV_VL4_u_b_false = 1;
+  uint64_t res_64_SV_VL5_u_b_false = 1;
+  uint64_t res_64_SV_VL6_u_b_false = 1;
+  uint64_t res_64_SV_VL7_u_b_false = 1;
+  uint64_t res_64_SV_VL8_u_b_false = 1;
+  uint64_t res_64_SV_VL16_u_b_false = 1;
+
+#undef SVELAST_DEF
+#define SVELAST_DEF(size, pat, sign, ab, su) \
+       if (NAME (foo, size, pat, sign, ab) \
+            (svindex_ ## su ## size (0, 1)) != \
+               NAME (res, size, pat, sign, ab)) \
+         __builtin_abort (); \
+       if (NAMEF (foo, size, pat, sign, ab) \
+            (svindex_ ## su ## size (0, 1)) != \
+               NAMEF (res, size, pat, sign, ab)) \
+         __builtin_abort ();
+
+  ALL_POS ()
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast256_run.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast256_run.c
new file mode 100644
index 00000000000..f6ba7ea7d89
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast256_run.c
@@ -0,0 +1,314 @@
+/* { dg-do run { target aarch64_sve256_hw } } */
+/* { dg-options "-O3 -msve-vector-bits=256 -std=gnu99" } */
+
+#include "svlast.c"
+
+int
+main (void)
+{
+  int8_t res_8_SV_VL1__a = 1;
+  int8_t res_8_SV_VL2__a = 2;
+  int8_t res_8_SV_VL3__a = 3;
+  int8_t res_8_SV_VL4__a = 4;
+  int8_t res_8_SV_VL5__a = 5;
+  int8_t res_8_SV_VL6__a = 6;
+  int8_t res_8_SV_VL7__a = 7;
+  int8_t res_8_SV_VL8__a = 8;
+  int8_t res_8_SV_VL16__a = 16;
+  uint8_t res_8_SV_VL1_u_a = 1;
+  uint8_t res_8_SV_VL2_u_a = 2;
+  uint8_t res_8_SV_VL3_u_a = 3;
+  uint8_t res_8_SV_VL4_u_a = 4;
+  uint8_t res_8_SV_VL5_u_a = 5;
+  uint8_t res_8_SV_VL6_u_a = 6;
+  uint8_t res_8_SV_VL7_u_a = 7;
+  uint8_t res_8_SV_VL8_u_a = 8;
+  uint8_t res_8_SV_VL16_u_a = 16;
+  int16_t res_16_SV_VL1__a = 1;
+  int16_t res_16_SV_VL2__a = 2;
+  int16_t res_16_SV_VL3__a = 3;
+  int16_t res_16_SV_VL4__a = 4;
+  int16_t res_16_SV_VL5__a = 5;
+  int16_t res_16_SV_VL6__a = 6;
+  int16_t res_16_SV_VL7__a = 7;
+  int16_t res_16_SV_VL8__a = 8;
+  int16_t res_16_SV_VL16__a = 0;
+  uint16_t res_16_SV_VL1_u_a = 1;
+  uint16_t res_16_SV_VL2_u_a = 2;
+  uint16_t res_16_SV_VL3_u_a = 3;
+  uint16_t res_16_SV_VL4_u_a = 4;
+  uint16_t res_16_SV_VL5_u_a = 5;
+  uint16_t res_16_SV_VL6_u_a = 6;
+  uint16_t res_16_SV_VL7_u_a = 7;
+  uint16_t res_16_SV_VL8_u_a = 8;
+  uint16_t res_16_SV_VL16_u_a = 0;
+  int32_t res_32_SV_VL1__a = 1;
+  int32_t res_32_SV_VL2__a = 2;
+  int32_t res_32_SV_VL3__a = 3;
+  int32_t res_32_SV_VL4__a = 4;
+  int32_t res_32_SV_VL5__a = 5;
+  int32_t res_32_SV_VL6__a = 6;
+  int32_t res_32_SV_VL7__a = 7;
+  int32_t res_32_SV_VL8__a = 0;
+  int32_t res_32_SV_VL16__a = 0;
+  uint32_t res_32_SV_VL1_u_a = 1;
+  uint32_t res_32_SV_VL2_u_a = 2;
+  uint32_t res_32_SV_VL3_u_a = 3;
+  uint32_t res_32_SV_VL4_u_a = 4;
+  uint32_t res_32_SV_VL5_u_a = 5;
+  uint32_t res_32_SV_VL6_u_a = 6;
+  uint32_t res_32_SV_VL7_u_a = 7;
+  uint32_t res_32_SV_VL8_u_a = 0;
+  uint32_t res_32_SV_VL16_u_a = 0;
+  int64_t res_64_SV_VL1__a = 1;
+  int64_t res_64_SV_VL2__a = 2;
+  int64_t res_64_SV_VL3__a = 3;
+  int64_t res_64_SV_VL4__a = 0;
+  int64_t res_64_SV_VL5__a = 0;
+  int64_t res_64_SV_VL6__a = 0;
+  int64_t res_64_SV_VL7__a = 0;
+  int64_t res_64_SV_VL8__a = 0;
+  int64_t res_64_SV_VL16__a = 0;
+  uint64_t res_64_SV_VL1_u_a = 1;
+  uint64_t res_64_SV_VL2_u_a = 2;
+  uint64_t res_64_SV_VL3_u_a = 3;
+  uint64_t res_64_SV_VL4_u_a = 0;
+  uint64_t res_64_SV_VL5_u_a = 0;
+  uint64_t res_64_SV_VL6_u_a = 0;
+  uint64_t res_64_SV_VL7_u_a = 0;
+  uint64_t res_64_SV_VL8_u_a = 0;
+  uint64_t res_64_SV_VL16_u_a = 0;
+  int8_t res_8_SV_VL1__b = 0;
+  int8_t res_8_SV_VL2__b = 1;
+  int8_t res_8_SV_VL3__b = 2;
+  int8_t res_8_SV_VL4__b = 3;
+  int8_t res_8_SV_VL5__b = 4;
+  int8_t res_8_SV_VL6__b = 5;
+  int8_t res_8_SV_VL7__b = 6;
+  int8_t res_8_SV_VL8__b = 7;
+  int8_t res_8_SV_VL16__b = 15;
+  uint8_t res_8_SV_VL1_u_b = 0;
+  uint8_t res_8_SV_VL2_u_b = 1;
+  uint8_t res_8_SV_VL3_u_b = 2;
+  uint8_t res_8_SV_VL4_u_b = 3;
+  uint8_t res_8_SV_VL5_u_b = 4;
+  uint8_t res_8_SV_VL6_u_b = 5;
+  uint8_t res_8_SV_VL7_u_b = 6;
+  uint8_t res_8_SV_VL8_u_b = 7;
+  uint8_t res_8_SV_VL16_u_b = 15;
+  int16_t res_16_SV_VL1__b = 0;
+  int16_t res_16_SV_VL2__b = 1;
+  int16_t res_16_SV_VL3__b = 2;
+  int16_t res_16_SV_VL4__b = 3;
+  int16_t res_16_SV_VL5__b = 4;
+  int16_t res_16_SV_VL6__b = 5;
+  int16_t res_16_SV_VL7__b = 6;
+  int16_t res_16_SV_VL8__b = 7;
+  int16_t res_16_SV_VL16__b = 15;
+  uint16_t res_16_SV_VL1_u_b = 0;
+  uint16_t res_16_SV_VL2_u_b = 1;
+  uint16_t res_16_SV_VL3_u_b = 2;
+  uint16_t res_16_SV_VL4_u_b = 3;
+  uint16_t res_16_SV_VL5_u_b = 4;
+  uint16_t res_16_SV_VL6_u_b = 5;
+  uint16_t res_16_SV_VL7_u_b = 6;
+  uint16_t res_16_SV_VL8_u_b = 7;
+  uint16_t res_16_SV_VL16_u_b = 15;
+  int32_t res_32_SV_VL1__b = 0;
+  int32_t res_32_SV_VL2__b = 1;
+  int32_t res_32_SV_VL3__b = 2;
+  int32_t res_32_SV_VL4__b = 3;
+  int32_t res_32_SV_VL5__b = 4;
+  int32_t res_32_SV_VL6__b = 5;
+  int32_t res_32_SV_VL7__b = 6;
+  int32_t res_32_SV_VL8__b = 7;
+  int32_t res_32_SV_VL16__b = 7;
+  uint32_t res_32_SV_VL1_u_b = 0;
+  uint32_t res_32_SV_VL2_u_b = 1;
+  uint32_t res_32_SV_VL3_u_b = 2;
+  uint32_t res_32_SV_VL4_u_b = 3;
+  uint32_t res_32_SV_VL5_u_b = 4;
+  uint32_t res_32_SV_VL6_u_b = 5;
+  uint32_t res_32_SV_VL7_u_b = 6;
+  uint32_t res_32_SV_VL8_u_b = 7;
+  uint32_t res_32_SV_VL16_u_b = 7;
+  int64_t res_64_SV_VL1__b = 0;
+  int64_t res_64_SV_VL2__b = 1;
+  int64_t res_64_SV_VL3__b = 2;
+  int64_t res_64_SV_VL4__b = 3;
+  int64_t res_64_SV_VL5__b = 3;
+  int64_t res_64_SV_VL6__b = 3;
+  int64_t res_64_SV_VL7__b = 3;
+  int64_t res_64_SV_VL8__b = 3;
+  int64_t res_64_SV_VL16__b = 3;
+  uint64_t res_64_SV_VL1_u_b = 0;
+  uint64_t res_64_SV_VL2_u_b = 1;
+  uint64_t res_64_SV_VL3_u_b = 2;
+  uint64_t res_64_SV_VL4_u_b = 3;
+  uint64_t res_64_SV_VL5_u_b = 3;
+  uint64_t res_64_SV_VL6_u_b = 3;
+  uint64_t res_64_SV_VL7_u_b = 3;
+  uint64_t res_64_SV_VL8_u_b = 3;
+  uint64_t res_64_SV_VL16_u_b = 3;
+
+  int8_t res_8_SV_VL1__a_false = 0;
+  int8_t res_8_SV_VL2__a_false = 0;
+  int8_t res_8_SV_VL3__a_false = 0;
+  int8_t res_8_SV_VL4__a_false = 0;
+  int8_t res_8_SV_VL5__a_false = 0;
+  int8_t res_8_SV_VL6__a_false = 0;
+  int8_t res_8_SV_VL7__a_false = 0;
+  int8_t res_8_SV_VL8__a_false = 0;
+  int8_t res_8_SV_VL16__a_false = 0;
+  uint8_t res_8_SV_VL1_u_a_false = 0;
+  uint8_t res_8_SV_VL2_u_a_false = 0;
+  uint8_t res_8_SV_VL3_u_a_false = 0;
+  uint8_t res_8_SV_VL4_u_a_false = 0;
+  uint8_t res_8_SV_VL5_u_a_false = 0;
+  uint8_t res_8_SV_VL6_u_a_false = 0;
+  uint8_t res_8_SV_VL7_u_a_false = 0;
+  uint8_t res_8_SV_VL8_u_a_false = 0;
+  uint8_t res_8_SV_VL16_u_a_false = 0;
+  int16_t res_16_SV_VL1__a_false = 0;
+  int16_t res_16_SV_VL2__a_false = 0;
+  int16_t res_16_SV_VL3__a_false = 0;
+  int16_t res_16_SV_VL4__a_false = 0;
+  int16_t res_16_SV_VL5__a_false = 0;
+  int16_t res_16_SV_VL6__a_false = 0;
+  int16_t res_16_SV_VL7__a_false = 0;
+  int16_t res_16_SV_VL8__a_false = 0;
+  int16_t res_16_SV_VL16__a_false = 0;
+  uint16_t res_16_SV_VL1_u_a_false = 0;
+  uint16_t res_16_SV_VL2_u_a_false = 0;
+  uint16_t res_16_SV_VL3_u_a_false = 0;
+  uint16_t res_16_SV_VL4_u_a_false = 0;
+  uint16_t res_16_SV_VL5_u_a_false = 0;
+  uint16_t res_16_SV_VL6_u_a_false = 0;
+  uint16_t res_16_SV_VL7_u_a_false = 0;
+  uint16_t res_16_SV_VL8_u_a_false = 0;
+  uint16_t res_16_SV_VL16_u_a_false = 0;
+  int32_t res_32_SV_VL1__a_false = 0;
+  int32_t res_32_SV_VL2__a_false = 0;
+  int32_t res_32_SV_VL3__a_false = 0;
+  int32_t res_32_SV_VL4__a_false = 0;
+  int32_t res_32_SV_VL5__a_false = 0;
+  int32_t res_32_SV_VL6__a_false = 0;
+  int32_t res_32_SV_VL7__a_false = 0;
+  int32_t res_32_SV_VL8__a_false = 0;
+  int32_t res_32_SV_VL16__a_false = 0;
+  uint32_t res_32_SV_VL1_u_a_false = 0;
+  uint32_t res_32_SV_VL2_u_a_false = 0;
+  uint32_t res_32_SV_VL3_u_a_false = 0;
+  uint32_t res_32_SV_VL4_u_a_false = 0;
+  uint32_t res_32_SV_VL5_u_a_false = 0;
+  uint32_t res_32_SV_VL6_u_a_false = 0;
+  uint32_t res_32_SV_VL7_u_a_false = 0;
+  uint32_t res_32_SV_VL8_u_a_false = 0;
+  uint32_t res_32_SV_VL16_u_a_false = 0;
+  int64_t res_64_SV_VL1__a_false = 0;
+  int64_t res_64_SV_VL2__a_false = 0;
+  int64_t res_64_SV_VL3__a_false = 0;
+  int64_t res_64_SV_VL4__a_false = 0;
+  int64_t res_64_SV_VL5__a_false = 0;
+  int64_t res_64_SV_VL6__a_false = 0;
+  int64_t res_64_SV_VL7__a_false = 0;
+  int64_t res_64_SV_VL8__a_false = 0;
+  int64_t res_64_SV_VL16__a_false = 0;
+  uint64_t res_64_SV_VL1_u_a_false = 0;
+  uint64_t res_64_SV_VL2_u_a_false = 0;
+  uint64_t res_64_SV_VL3_u_a_false = 0;
+  uint64_t res_64_SV_VL4_u_a_false = 0;
+  uint64_t res_64_SV_VL5_u_a_false = 0;
+  uint64_t res_64_SV_VL6_u_a_false = 0;
+  uint64_t res_64_SV_VL7_u_a_false = 0;
+  uint64_t res_64_SV_VL8_u_a_false = 0;
+  uint64_t res_64_SV_VL16_u_a_false = 0;
+  int8_t res_8_SV_VL1__b_false = 31;
+  int8_t res_8_SV_VL2__b_false = 31;
+  int8_t res_8_SV_VL3__b_false = 31;
+  int8_t res_8_SV_VL4__b_false = 31;
+  int8_t res_8_SV_VL5__b_false = 31;
+  int8_t res_8_SV_VL6__b_false = 31;
+  int8_t res_8_SV_VL7__b_false = 31;
+  int8_t res_8_SV_VL8__b_false = 31;
+  int8_t res_8_SV_VL16__b_false = 31;
+  uint8_t res_8_SV_VL1_u_b_false = 31;
+  uint8_t res_8_SV_VL2_u_b_false = 31;
+  uint8_t res_8_SV_VL3_u_b_false = 31;
+  uint8_t res_8_SV_VL4_u_b_false = 31;
+  uint8_t res_8_SV_VL5_u_b_false = 31;
+  uint8_t res_8_SV_VL6_u_b_false = 31;
+  uint8_t res_8_SV_VL7_u_b_false = 31;
+  uint8_t res_8_SV_VL8_u_b_false = 31;
+  uint8_t res_8_SV_VL16_u_b_false = 31;
+  int16_t res_16_SV_VL1__b_false = 15;
+  int16_t res_16_SV_VL2__b_false = 15;
+  int16_t res_16_SV_VL3__b_false = 15;
+  int16_t res_16_SV_VL4__b_false = 15;
+  int16_t res_16_SV_VL5__b_false = 15;
+  int16_t res_16_SV_VL6__b_false = 15;
+  int16_t res_16_SV_VL7__b_false = 15;
+  int16_t res_16_SV_VL8__b_false = 15;
+  int16_t res_16_SV_VL16__b_false = 15;
+  uint16_t res_16_SV_VL1_u_b_false = 15;
+  uint16_t res_16_SV_VL2_u_b_false = 15;
+  uint16_t res_16_SV_VL3_u_b_false = 15;
+  uint16_t res_16_SV_VL4_u_b_false = 15;
+  uint16_t res_16_SV_VL5_u_b_false = 15;
+  uint16_t res_16_SV_VL6_u_b_false = 15;
+  uint16_t res_16_SV_VL7_u_b_false = 15;
+  uint16_t res_16_SV_VL8_u_b_false = 15;
+  uint16_t res_16_SV_VL16_u_b_false = 15;
+  int32_t res_32_SV_VL1__b_false = 7;
+  int32_t res_32_SV_VL2__b_false = 7;
+  int32_t res_32_SV_VL3__b_false = 7;
+  int32_t res_32_SV_VL4__b_false = 7;
+  int32_t res_32_SV_VL5__b_false = 7;
+  int32_t res_32_SV_VL6__b_false = 7;
+  int32_t res_32_SV_VL7__b_false = 7;
+  int32_t res_32_SV_VL8__b_false = 7;
+  int32_t res_32_SV_VL16__b_false = 7;
+  uint32_t res_32_SV_VL1_u_b_false = 7;
+  uint32_t res_32_SV_VL2_u_b_false = 7;
+  uint32_t res_32_SV_VL3_u_b_false = 7;
+  uint32_t res_32_SV_VL4_u_b_false = 7;
+  uint32_t res_32_SV_VL5_u_b_false = 7;
+  uint32_t res_32_SV_VL6_u_b_false = 7;
+  uint32_t res_32_SV_VL7_u_b_false = 7;
+  uint32_t res_32_SV_VL8_u_b_false = 7;
+  uint32_t res_32_SV_VL16_u_b_false = 7;
+  int64_t res_64_SV_VL1__b_false = 3;
+  int64_t res_64_SV_VL2__b_false = 3;
+  int64_t res_64_SV_VL3__b_false = 3;
+  int64_t res_64_SV_VL4__b_false = 3;
+  int64_t res_64_SV_VL5__b_false = 3;
+  int64_t res_64_SV_VL6__b_false = 3;
+  int64_t res_64_SV_VL7__b_false = 3;
+  int64_t res_64_SV_VL8__b_false = 3;
+  int64_t res_64_SV_VL16__b_false = 3;
+  uint64_t res_64_SV_VL1_u_b_false = 3;
+  uint64_t res_64_SV_VL2_u_b_false = 3;
+  uint64_t res_64_SV_VL3_u_b_false = 3;
+  uint64_t res_64_SV_VL4_u_b_false = 3;
+  uint64_t res_64_SV_VL5_u_b_false = 3;
+  uint64_t res_64_SV_VL6_u_b_false = 3;
+  uint64_t res_64_SV_VL7_u_b_false = 3;
+  uint64_t res_64_SV_VL8_u_b_false = 3;
+  uint64_t res_64_SV_VL16_u_b_false = 3;
+
+
+#undef SVELAST_DEF
+#define SVELAST_DEF(size, pat, sign, ab, su) \
+       if (NAME (foo, size, pat, sign, ab) \
+            (svindex_ ## su ## size (0 ,1)) != \
+               NAME (res, size, pat, sign, ab)) \
+         __builtin_abort (); \
+       if (NAMEF (foo, size, pat, sign, ab) \
+            (svindex_ ## su ## size (0 ,1)) != \
+               NAMEF (res, size, pat, sign, ab)) \
+         __builtin_abort ();
+
+  ALL_POS ()
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4.c
index 1e38371842f..91fdd3c202e 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4.c
@@ -186,8 +186,6 @@ CALLER (f16, __SVFloat16_t)
 ** caller_bf16:
 **      ...
 **      bl      callee_bf16
-**     ptrue   (p[0-7])\.b, all
-**     lasta   h0, \1, z0\.h
 **      ldp     x29, x30, \[sp\], 16
 **      ret
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_1024.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_1024.c
index 491c35af221..7d824caae1b 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_1024.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_1024.c
@@ -186,8 +186,6 @@ CALLER (f16, __SVFloat16_t)
 ** caller_bf16:
 **      ...
 **      bl      callee_bf16
-**     ptrue   (p[0-7])\.b, vl128
-**     lasta   h0, \1, z0\.h
 **      ldp     x29, x30, \[sp\], 16
 **      ret
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_128.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_128.c
index eebb913273a..e0aa3a5fa68 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_128.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_128.c
@@ -186,8 +186,6 @@ CALLER (f16, __SVFloat16_t)
 ** caller_bf16:
 **      ...
 **      bl      callee_bf16
-**     ptrue   (p[0-7])\.b, vl16
-**     lasta   h0, \1, z0\.h
 **      ldp     x29, x30, \[sp\], 16
 **      ret
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_2048.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_2048.c
index 73c3b2ec045..3238015d9eb 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_2048.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_2048.c
@@ -186,8 +186,6 @@ CALLER (f16, __SVFloat16_t)
 ** caller_bf16:
 **      ...
 **      bl      callee_bf16
-**     ptrue   (p[0-7])\.b, vl256
-**     lasta   h0, \1, z0\.h
 **      ldp     x29, x30, \[sp\], 16
 **      ret
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_256.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_256.c
index 29744c81402..50861098934 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_256.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_256.c
@@ -186,8 +186,6 @@ CALLER (f16, __SVFloat16_t)
 ** caller_bf16:
 **      ...
 **      bl      callee_bf16
-**     ptrue   (p[0-7])\.b, vl32
-**     lasta   h0, \1, z0\.h
 **      ldp     x29, x30, \[sp\], 16
 **      ret
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_512.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_512.c
index cf25c31bcbf..300dacce955 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_512.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_512.c
@@ -186,8 +186,6 @@ CALLER (f16, __SVFloat16_t)
 ** caller_bf16:
 **      ...
 **      bl      callee_bf16
-**     ptrue   (p[0-7])\.b, vl64
-**     lasta   h0, \1, z0\.h
 **      ldp     x29, x30, \[sp\], 16
 **      ret
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5.c
index 9ad3e227654..0a840a38384 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5.c
@@ -186,8 +186,6 @@ CALLER (f16, svfloat16_t)
 ** caller_bf16:
 **      ...
 **      bl      callee_bf16
-**     ptrue   (p[0-7])\.b, all
-**     lasta   h0, \1, z0\.h
 **      ldp     x29, x30, \[sp\], 16
 **      ret
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_1024.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_1024.c
index d573e5fc69c..18cefbff1e6 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_1024.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_1024.c
@@ -186,8 +186,6 @@ CALLER (f16, svfloat16_t)
 ** caller_bf16:
 **      ...
 **      bl      callee_bf16
-**     ptrue   (p[0-7])\.b, vl128
-**     lasta   h0, \1, z0\.h
 **      ldp     x29, x30, \[sp\], 16
 **      ret
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_128.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_128.c
index 200b0eb8242..c622ed55674 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_128.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_128.c
@@ -186,8 +186,6 @@ CALLER (f16, svfloat16_t)
 ** caller_bf16:
 **      ...
 **      bl      callee_bf16
-**     ptrue   (p[0-7])\.b, vl16
-**     lasta   h0, \1, z0\.h
 **      ldp     x29, x30, \[sp\], 16
 **      ret
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_2048.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_2048.c
index f6f8858fd47..3286280687d 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_2048.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_2048.c
@@ -186,8 +186,6 @@ CALLER (f16, svfloat16_t)
 ** caller_bf16:
 **      ...
 **      bl      callee_bf16
-**     ptrue   (p[0-7])\.b, vl256
-**     lasta   h0, \1, z0\.h
 **      ldp     x29, x30, \[sp\], 16
 **      ret
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_256.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_256.c
index e62f59cc885..3c6afa2fdf1 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_256.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_256.c
@@ -186,8 +186,6 @@ CALLER (f16, svfloat16_t)
 ** caller_bf16:
 **      ...
 **      bl      callee_bf16
-**     ptrue   (p[0-7])\.b, vl32
-**     lasta   h0, \1, z0\.h
 **      ldp     x29, x30, \[sp\], 16
 **      ret
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_512.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_512.c
index 483558cb576..bb7d3ebf9d4 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_512.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_512.c
@@ -186,8 +186,6 @@ CALLER (f16, svfloat16_t)
 ** caller_bf16:
 **      ...
 **      bl      callee_bf16
-**     ptrue   (p[0-7])\.b, vl64
-**     lasta   h0, \1, z0\.h
 **      ldp     x29, x30, \[sp\], 16
 **      ret
 */
--
2.17.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]
  2023-03-16 11:39 [PATCH] [PR96339] AArch64: Optimise svlast[ab] Tejas Belagod
  2023-05-04  5:43 ` Tejas Belagod
@ 2023-05-11 19:32 ` Richard Sandiford
  2023-05-16  8:03   ` Tejas Belagod
  1 sibling, 1 reply; 10+ messages in thread
From: Richard Sandiford @ 2023-05-11 19:32 UTC (permalink / raw)
  To: Tejas Belagod; +Cc: gcc-patches, Tejas Belagod

Tejas Belagod <tejas.belagod@arm.com> writes:
> From: Tejas Belagod <tbelagod@arm.com>
>
>   This PR optimizes an SVE intrinsics sequence where
>     svlasta (svptrue_pat_b8 (SV_VL1), x)
>   a scalar is selected based on a constant predicate and a variable vector.
>   This sequence is optimized to return the correspoding element of a NEON
>   vector. For eg.
>     svlasta (svptrue_pat_b8 (SV_VL1), x)
>   returns
>     umov    w0, v0.b[1]
>   Likewise,
>     svlastb (svptrue_pat_b8 (SV_VL1), x)
>   returns
>      umov    w0, v0.b[0]
>   This optimization only works provided the constant predicate maps to a range
>   that is within the bounds of a 128-bit NEON register.
>
> gcc/ChangeLog:
>
> 	PR target/96339
> 	* config/aarch64/aarch64-sve-builtins-base.cc (svlast_impl::fold): Fold sve
> 	calls that have a constant input predicate vector.
> 	(svlast_impl::is_lasta): Query to check if intrinsic is svlasta.
> 	(svlast_impl::is_lastb): Query to check if intrinsic is svlastb.
> 	(svlast_impl::vect_all_same): Check if all vector elements are equal.
>
> gcc/testsuite/ChangeLog:
>
> 	PR target/96339
> 	* gcc.target/aarch64/sve/acle/general-c/svlast.c: New.
> 	* gcc.target/aarch64/sve/acle/general-c/svlast128_run.c: New.
> 	* gcc.target/aarch64/sve/acle/general-c/svlast256_run.c: New.
> 	* gcc.target/aarch64/sve/pcs/return_4.c (caller_bf16): Fix asm
> 	to expect optimized code for function body.
> 	* gcc.target/aarch64/sve/pcs/return_4_128.c (caller_bf16): Likewise.
> 	* gcc.target/aarch64/sve/pcs/return_4_256.c (caller_bf16): Likewise.
> 	* gcc.target/aarch64/sve/pcs/return_4_512.c (caller_bf16): Likewise.
> 	* gcc.target/aarch64/sve/pcs/return_4_1024.c (caller_bf16): Likewise.
> 	* gcc.target/aarch64/sve/pcs/return_4_2048.c (caller_bf16): Likewise.
> 	* gcc.target/aarch64/sve/pcs/return_5.c (caller_bf16): Likewise.
> 	* gcc.target/aarch64/sve/pcs/return_5_128.c (caller_bf16): Likewise.
> 	* gcc.target/aarch64/sve/pcs/return_5_256.c (caller_bf16): Likewise.
> 	* gcc.target/aarch64/sve/pcs/return_5_512.c (caller_bf16): Likewise.
> 	* gcc.target/aarch64/sve/pcs/return_5_1024.c (caller_bf16): Likewise.
> 	* gcc.target/aarch64/sve/pcs/return_5_2048.c (caller_bf16): Likewise.
> ---
>  .../aarch64/aarch64-sve-builtins-base.cc      | 124 +++++++
>  .../aarch64/sve/acle/general-c/svlast.c       |  63 ++++
>  .../sve/acle/general-c/svlast128_run.c        | 313 +++++++++++++++++
>  .../sve/acle/general-c/svlast256_run.c        | 314 ++++++++++++++++++
>  .../gcc.target/aarch64/sve/pcs/return_4.c     |   2 -
>  .../aarch64/sve/pcs/return_4_1024.c           |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_4_128.c |   2 -
>  .../aarch64/sve/pcs/return_4_2048.c           |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_4_256.c |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_4_512.c |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_5.c     |   2 -
>  .../aarch64/sve/pcs/return_5_1024.c           |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_5_128.c |   2 -
>  .../aarch64/sve/pcs/return_5_2048.c           |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_5_256.c |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_5_512.c |   2 -
>  16 files changed, 814 insertions(+), 24 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast128_run.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast256_run.c
>
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> index cd9cace3c9b..db2b4dcaac9 100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> @@ -1056,6 +1056,130 @@ class svlast_impl : public quiet<function_base>
>  public:
>    CONSTEXPR svlast_impl (int unspec) : m_unspec (unspec) {}
>  
> +  bool is_lasta () const { return m_unspec == UNSPEC_LASTA; }
> +  bool is_lastb () const { return m_unspec == UNSPEC_LASTB; }
> +
> +  bool vect_all_same (tree v , int step) const

Nit: stray space after "v".

> +  {
> +    int i;
> +    int nelts = vector_cst_encoded_nelts (v);
> +    int first_el = 0;
> +
> +    for (i = first_el; i < nelts; i += step)
> +      if (VECTOR_CST_ENCODED_ELT (v, i) != VECTOR_CST_ENCODED_ELT (v, first_el))

I think this should use !operand_equal_p (..., ..., 0).

> +	return false;
> +
> +    return true;
> +  }
> +
> +  /* Fold a svlast{a/b} call with constant predicate to a BIT_FIELD_REF.
> +     BIT_FIELD_REF lowers to a NEON element extract, so we have to make sure
> +     the index of the element being accessed is in the range of a NEON vector
> +     width.  */

s/NEON/Advanced SIMD/.  Same in later comments

> +  gimple *fold (gimple_folder & f) const override
> +  {
> +    tree pred = gimple_call_arg (f.call, 0);
> +    tree val = gimple_call_arg (f.call, 1);
> +
> +    if (TREE_CODE (pred) == VECTOR_CST)
> +      {
> +	HOST_WIDE_INT pos;
> +	unsigned int const_vg;
> +	int i = 0;
> +	int step = f.type_suffix (0).element_bytes;
> +	int step_1 = gcd (step, VECTOR_CST_NPATTERNS (pred));
> +	int npats = VECTOR_CST_NPATTERNS (pred);
> +	unsigned HOST_WIDE_INT nelts = vector_cst_encoded_nelts (pred);
> +	tree b = NULL_TREE;
> +	bool const_vl = aarch64_sve_vg.is_constant (&const_vg);

I think this might be left over from previous versions, but:
const_vg isn't used and const_vl is only used once, so I think it
would be better to remove them.

> +
> +	/* We can optimize 2 cases common to variable and fixed-length cases
> +	   without a linear search of the predicate vector:
> +	   1.  LASTA if predicate is all true, return element 0.
> +	   2.  LASTA if predicate all false, return element 0.  */
> +	if (is_lasta () && vect_all_same (pred, step_1))
> +	  {
> +	    b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val,
> +			bitsize_int (step * BITS_PER_UNIT), bitsize_int (0));
> +	    return gimple_build_assign (f.lhs, b);
> +	  }
> +
> +	/* Handle the all-false case for LASTB where SVE VL == 128b -
> +	   return the highest numbered element.  */
> +	if (is_lastb () && known_eq (BYTES_PER_SVE_VECTOR, 16)
> +	    && vect_all_same (pred, step_1)
> +	    && integer_zerop (VECTOR_CST_ENCODED_ELT (pred, 0)))

Formatting nit: one condition per line once one line isn't enough.

> +	  {
> +	    b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val,
> +			bitsize_int (step * BITS_PER_UNIT),
> +			bitsize_int ((16 - step) * BITS_PER_UNIT));
> +
> +	    return gimple_build_assign (f.lhs, b);
> +	  }
> +
> +	/* If VECTOR_CST_NELTS_PER_PATTERN (pred) == 2 and every multiple of
> +	   'step_1' in
> +	   [VECTOR_CST_NPATTERNS .. VECTOR_CST_ENCODED_NELTS - 1]
> +	   is zero, then we can treat the vector as VECTOR_CST_NPATTERNS
> +	   elements followed by all inactive elements.  */
> +	if (!const_vl && VECTOR_CST_NELTS_PER_PATTERN (pred) == 2)

Following on from the above, maybe use:

  !VECTOR_CST_NELTS (pred).is_constant ()

instead of !const_vl here.

I have a horrible suspicion that I'm contradicting our earlier discussion
here, sorry, but: I think we have to return null if NELTS_PER_PATTERN != 2.

OK with those changes, thanks.  I'm going to have to take your word
for the tests being right :)

Richard

> +	  for (i = npats; i < nelts; i += step_1)
> +	    {
> +	      /* If there are active elements in the repeated pattern of
> +		 a variable-length vector, then return NULL as there is no way
> +		 to be sure statically if this falls within the NEON range.  */
> +	      if (!integer_zerop (VECTOR_CST_ENCODED_ELT (pred, i)))
> +		return NULL;
> +	    }
> +
> +	/* If we're here, it means either:
> +	   1. The vector is variable-length and there's no active element in the
> +	   repeated part of the pattern, or
> +	   2. The vector is fixed-length.
> +	   Fall-through to a linear search.  */
> +
> +	/* Restrict the scope of search to NPATS if vector is
> +	   variable-length.  */
> +	if (!VECTOR_CST_NELTS (pred).is_constant (&nelts))
> +	  nelts = npats;
> +
> +	/* Fall through to finding the last active element linearly for
> +	   for all cases where the last active element is known to be
> +	   within a statically-determinable range.  */
> +	i = MAX ((int)nelts - step, 0);
> +	for (; i >= 0; i -= step)
> +	  if (!integer_zerop (VECTOR_CST_ELT (pred, i)))
> +	    break;
> +
> +	if (is_lastb ())
> +	  {
> +	    /* For LASTB, the element is the last active element.  */
> +	    pos = i;
> +	  }
> +	else
> +	  {
> +	    /* For LASTA, the element is one after last active element.  */
> +	    pos = i + step;
> +
> +	    /* If last active element is
> +	       last element, wrap-around and return first NEON element.  */
> +	    if (known_ge (pos, BYTES_PER_SVE_VECTOR))
> +	      pos = 0;
> +	  }
> +
> +	/* Out of NEON range.  */
> +	if (pos < 0 || pos > 15)
> +	  return NULL;
> +
> +	b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val,
> +		    bitsize_int (step * BITS_PER_UNIT),
> +		    bitsize_int (pos * BITS_PER_UNIT));
> +
> +	return gimple_build_assign (f.lhs, b);
> +      }
> +    return NULL;
> +  }
> +
>    rtx
>    expand (function_expander &e) const override
>    {
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast.c
> new file mode 100644
> index 00000000000..fdbe5e309af
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast.c
> @@ -0,0 +1,63 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -msve-vector-bits=256" } */
> +
> +#include <stdint.h>
> +#include "arm_sve.h"
> +
> +#define NAME(name, size, pat, sign, ab) \
> +  name ## _ ## size ## _ ## pat ## _ ## sign ## _ ## ab
> +
> +#define NAMEF(name, size, pat, sign, ab) \
> +  name ## _ ## size ## _ ## pat ## _ ## sign ## _ ## ab ## _false
> +
> +#define SVTYPE(size, sign) \
> +  sv ## sign ## int ## size ## _t
> +
> +#define STYPE(size, sign) sign ## int ## size ##_t
> +
> +#define SVELAST_DEF(size, pat, sign, ab, su) \
> +  STYPE (size, sign) __attribute__((noinline)) \
> +  NAME (foo, size, pat, sign, ab) (SVTYPE (size, sign) x) \
> +  { \
> +    return svlast ## ab (svptrue_pat_b ## size (pat), x); \
> +  } \
> +  STYPE (size, sign) __attribute__((noinline)) \
> +  NAMEF (foo, size, pat, sign, ab) (SVTYPE (size, sign) x) \
> +  { \
> +    return svlast ## ab (svpfalse (), x); \
> +  }
> +
> +#define ALL_PATS(SIZE, SIGN, AB, SU) \
> +  SVELAST_DEF (SIZE, SV_VL1, SIGN, AB, SU) \
> +  SVELAST_DEF (SIZE, SV_VL2, SIGN, AB, SU) \
> +  SVELAST_DEF (SIZE, SV_VL3, SIGN, AB, SU) \
> +  SVELAST_DEF (SIZE, SV_VL4, SIGN, AB, SU) \
> +  SVELAST_DEF (SIZE, SV_VL5, SIGN, AB, SU) \
> +  SVELAST_DEF (SIZE, SV_VL6, SIGN, AB, SU) \
> +  SVELAST_DEF (SIZE, SV_VL7, SIGN, AB, SU) \
> +  SVELAST_DEF (SIZE, SV_VL8, SIGN, AB, SU) \
> +  SVELAST_DEF (SIZE, SV_VL16, SIGN, AB, SU)
> +
> +#define ALL_SIGN(SIZE, AB) \
> +  ALL_PATS (SIZE, , AB, s) \
> +  ALL_PATS (SIZE, u, AB, u)
> +
> +#define ALL_SIZE(AB) \
> +  ALL_SIGN (8, AB) \
> +  ALL_SIGN (16, AB) \
> +  ALL_SIGN (32, AB) \
> +  ALL_SIGN (64, AB)
> +
> +#define ALL_POS() \
> +  ALL_SIZE (a) \
> +  ALL_SIZE (b)
> +
> +
> +ALL_POS()
> +
> +/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b} 52 } } */
> +/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h} 50 } } */
> +/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s} 12 } } */
> +/* { dg-final { scan-assembler-times {\tfmov\tw0, s0} 24 } } */
> +/* { dg-final { scan-assembler-times {\tumov\tx[0-9]+, v[0-9]+\.d} 4 } } */
> +/* { dg-final { scan-assembler-times {\tfmov\tx0, d0} 32 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast128_run.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast128_run.c
> new file mode 100644
> index 00000000000..5e1e9303d7b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast128_run.c
> @@ -0,0 +1,313 @@
> +/* { dg-do run { target aarch64_sve128_hw } } */
> +/* { dg-options "-O3 -msve-vector-bits=128 -std=gnu99" } */
> +
> +#include "svlast.c"
> +
> +int
> +main (void)
> +{
> +  int8_t res_8_SV_VL1__a = 1;
> +  int8_t res_8_SV_VL2__a = 2;
> +  int8_t res_8_SV_VL3__a = 3;
> +  int8_t res_8_SV_VL4__a = 4;
> +  int8_t res_8_SV_VL5__a = 5;
> +  int8_t res_8_SV_VL6__a = 6;
> +  int8_t res_8_SV_VL7__a = 7;
> +  int8_t res_8_SV_VL8__a = 8;
> +  int8_t res_8_SV_VL16__a = 0;
> +  uint8_t res_8_SV_VL1_u_a = 1;
> +  uint8_t res_8_SV_VL2_u_a = 2;
> +  uint8_t res_8_SV_VL3_u_a = 3;
> +  uint8_t res_8_SV_VL4_u_a = 4;
> +  uint8_t res_8_SV_VL5_u_a = 5;
> +  uint8_t res_8_SV_VL6_u_a = 6;
> +  uint8_t res_8_SV_VL7_u_a = 7;
> +  uint8_t res_8_SV_VL8_u_a = 8;
> +  uint8_t res_8_SV_VL16_u_a = 0;
> +  int16_t res_16_SV_VL1__a = 1;
> +  int16_t res_16_SV_VL2__a = 2;
> +  int16_t res_16_SV_VL3__a = 3;
> +  int16_t res_16_SV_VL4__a = 4;
> +  int16_t res_16_SV_VL5__a = 5;
> +  int16_t res_16_SV_VL6__a = 6;
> +  int16_t res_16_SV_VL7__a = 7;
> +  int16_t res_16_SV_VL8__a = 0;
> +  int16_t res_16_SV_VL16__a = 0;
> +  uint16_t res_16_SV_VL1_u_a = 1;
> +  uint16_t res_16_SV_VL2_u_a = 2;
> +  uint16_t res_16_SV_VL3_u_a = 3;
> +  uint16_t res_16_SV_VL4_u_a = 4;
> +  uint16_t res_16_SV_VL5_u_a = 5;
> +  uint16_t res_16_SV_VL6_u_a = 6;
> +  uint16_t res_16_SV_VL7_u_a = 7;
> +  uint16_t res_16_SV_VL8_u_a = 0;
> +  uint16_t res_16_SV_VL16_u_a = 0;
> +  int32_t res_32_SV_VL1__a = 1;
> +  int32_t res_32_SV_VL2__a = 2;
> +  int32_t res_32_SV_VL3__a = 3;
> +  int32_t res_32_SV_VL4__a = 0;
> +  int32_t res_32_SV_VL5__a = 0;
> +  int32_t res_32_SV_VL6__a = 0;
> +  int32_t res_32_SV_VL7__a = 0;
> +  int32_t res_32_SV_VL8__a = 0;
> +  int32_t res_32_SV_VL16__a = 0;
> +  uint32_t res_32_SV_VL1_u_a = 1;
> +  uint32_t res_32_SV_VL2_u_a = 2;
> +  uint32_t res_32_SV_VL3_u_a = 3;
> +  uint32_t res_32_SV_VL4_u_a = 0;
> +  uint32_t res_32_SV_VL5_u_a = 0;
> +  uint32_t res_32_SV_VL6_u_a = 0;
> +  uint32_t res_32_SV_VL7_u_a = 0;
> +  uint32_t res_32_SV_VL8_u_a = 0;
> +  uint32_t res_32_SV_VL16_u_a = 0;
> +  int64_t res_64_SV_VL1__a = 1;
> +  int64_t res_64_SV_VL2__a = 0;
> +  int64_t res_64_SV_VL3__a = 0;
> +  int64_t res_64_SV_VL4__a = 0;
> +  int64_t res_64_SV_VL5__a = 0;
> +  int64_t res_64_SV_VL6__a = 0;
> +  int64_t res_64_SV_VL7__a = 0;
> +  int64_t res_64_SV_VL8__a = 0;
> +  int64_t res_64_SV_VL16__a = 0;
> +  uint64_t res_64_SV_VL1_u_a = 1;
> +  uint64_t res_64_SV_VL2_u_a = 0;
> +  uint64_t res_64_SV_VL3_u_a = 0;
> +  uint64_t res_64_SV_VL4_u_a = 0;
> +  uint64_t res_64_SV_VL5_u_a = 0;
> +  uint64_t res_64_SV_VL6_u_a = 0;
> +  uint64_t res_64_SV_VL7_u_a = 0;
> +  uint64_t res_64_SV_VL8_u_a = 0;
> +  uint64_t res_64_SV_VL16_u_a = 0;
> +  int8_t res_8_SV_VL1__b = 0;
> +  int8_t res_8_SV_VL2__b = 1;
> +  int8_t res_8_SV_VL3__b = 2;
> +  int8_t res_8_SV_VL4__b = 3;
> +  int8_t res_8_SV_VL5__b = 4;
> +  int8_t res_8_SV_VL6__b = 5;
> +  int8_t res_8_SV_VL7__b = 6;
> +  int8_t res_8_SV_VL8__b = 7;
> +  int8_t res_8_SV_VL16__b = 15;
> +  uint8_t res_8_SV_VL1_u_b = 0;
> +  uint8_t res_8_SV_VL2_u_b = 1;
> +  uint8_t res_8_SV_VL3_u_b = 2;
> +  uint8_t res_8_SV_VL4_u_b = 3;
> +  uint8_t res_8_SV_VL5_u_b = 4;
> +  uint8_t res_8_SV_VL6_u_b = 5;
> +  uint8_t res_8_SV_VL7_u_b = 6;
> +  uint8_t res_8_SV_VL8_u_b = 7;
> +  uint8_t res_8_SV_VL16_u_b = 15;
> +  int16_t res_16_SV_VL1__b = 0;
> +  int16_t res_16_SV_VL2__b = 1;
> +  int16_t res_16_SV_VL3__b = 2;
> +  int16_t res_16_SV_VL4__b = 3;
> +  int16_t res_16_SV_VL5__b = 4;
> +  int16_t res_16_SV_VL6__b = 5;
> +  int16_t res_16_SV_VL7__b = 6;
> +  int16_t res_16_SV_VL8__b = 7;
> +  int16_t res_16_SV_VL16__b = 7;
> +  uint16_t res_16_SV_VL1_u_b = 0;
> +  uint16_t res_16_SV_VL2_u_b = 1;
> +  uint16_t res_16_SV_VL3_u_b = 2;
> +  uint16_t res_16_SV_VL4_u_b = 3;
> +  uint16_t res_16_SV_VL5_u_b = 4;
> +  uint16_t res_16_SV_VL6_u_b = 5;
> +  uint16_t res_16_SV_VL7_u_b = 6;
> +  uint16_t res_16_SV_VL8_u_b = 7;
> +  uint16_t res_16_SV_VL16_u_b = 7;
> +  int32_t res_32_SV_VL1__b = 0;
> +  int32_t res_32_SV_VL2__b = 1;
> +  int32_t res_32_SV_VL3__b = 2;
> +  int32_t res_32_SV_VL4__b = 3;
> +  int32_t res_32_SV_VL5__b = 3;
> +  int32_t res_32_SV_VL6__b = 3;
> +  int32_t res_32_SV_VL7__b = 3;
> +  int32_t res_32_SV_VL8__b = 3;
> +  int32_t res_32_SV_VL16__b = 3;
> +  uint32_t res_32_SV_VL1_u_b = 0;
> +  uint32_t res_32_SV_VL2_u_b = 1;
> +  uint32_t res_32_SV_VL3_u_b = 2;
> +  uint32_t res_32_SV_VL4_u_b = 3;
> +  uint32_t res_32_SV_VL5_u_b = 3;
> +  uint32_t res_32_SV_VL6_u_b = 3;
> +  uint32_t res_32_SV_VL7_u_b = 3;
> +  uint32_t res_32_SV_VL8_u_b = 3;
> +  uint32_t res_32_SV_VL16_u_b = 3;
> +  int64_t res_64_SV_VL1__b = 0;
> +  int64_t res_64_SV_VL2__b = 1;
> +  int64_t res_64_SV_VL3__b = 1;
> +  int64_t res_64_SV_VL4__b = 1;
> +  int64_t res_64_SV_VL5__b = 1;
> +  int64_t res_64_SV_VL6__b = 1;
> +  int64_t res_64_SV_VL7__b = 1;
> +  int64_t res_64_SV_VL8__b = 1;
> +  int64_t res_64_SV_VL16__b = 1;
> +  uint64_t res_64_SV_VL1_u_b = 0;
> +  uint64_t res_64_SV_VL2_u_b = 1;
> +  uint64_t res_64_SV_VL3_u_b = 1;
> +  uint64_t res_64_SV_VL4_u_b = 1;
> +  uint64_t res_64_SV_VL5_u_b = 1;
> +  uint64_t res_64_SV_VL6_u_b = 1;
> +  uint64_t res_64_SV_VL7_u_b = 1;
> +  uint64_t res_64_SV_VL8_u_b = 1;
> +  uint64_t res_64_SV_VL16_u_b = 1;
> +
> +  int8_t res_8_SV_VL1__a_false = 0;
> +  int8_t res_8_SV_VL2__a_false = 0;
> +  int8_t res_8_SV_VL3__a_false = 0;
> +  int8_t res_8_SV_VL4__a_false = 0;
> +  int8_t res_8_SV_VL5__a_false = 0;
> +  int8_t res_8_SV_VL6__a_false = 0;
> +  int8_t res_8_SV_VL7__a_false = 0;
> +  int8_t res_8_SV_VL8__a_false = 0;
> +  int8_t res_8_SV_VL16__a_false = 0;
> +  uint8_t res_8_SV_VL1_u_a_false = 0;
> +  uint8_t res_8_SV_VL2_u_a_false = 0;
> +  uint8_t res_8_SV_VL3_u_a_false = 0;
> +  uint8_t res_8_SV_VL4_u_a_false = 0;
> +  uint8_t res_8_SV_VL5_u_a_false = 0;
> +  uint8_t res_8_SV_VL6_u_a_false = 0;
> +  uint8_t res_8_SV_VL7_u_a_false = 0;
> +  uint8_t res_8_SV_VL8_u_a_false = 0;
> +  uint8_t res_8_SV_VL16_u_a_false = 0;
> +  int16_t res_16_SV_VL1__a_false = 0;
> +  int16_t res_16_SV_VL2__a_false = 0;
> +  int16_t res_16_SV_VL3__a_false = 0;
> +  int16_t res_16_SV_VL4__a_false = 0;
> +  int16_t res_16_SV_VL5__a_false = 0;
> +  int16_t res_16_SV_VL6__a_false = 0;
> +  int16_t res_16_SV_VL7__a_false = 0;
> +  int16_t res_16_SV_VL8__a_false = 0;
> +  int16_t res_16_SV_VL16__a_false = 0;
> +  uint16_t res_16_SV_VL1_u_a_false = 0;
> +  uint16_t res_16_SV_VL2_u_a_false = 0;
> +  uint16_t res_16_SV_VL3_u_a_false = 0;
> +  uint16_t res_16_SV_VL4_u_a_false = 0;
> +  uint16_t res_16_SV_VL5_u_a_false = 0;
> +  uint16_t res_16_SV_VL6_u_a_false = 0;
> +  uint16_t res_16_SV_VL7_u_a_false = 0;
> +  uint16_t res_16_SV_VL8_u_a_false = 0;
> +  uint16_t res_16_SV_VL16_u_a_false = 0;
> +  int32_t res_32_SV_VL1__a_false = 0;
> +  int32_t res_32_SV_VL2__a_false = 0;
> +  int32_t res_32_SV_VL3__a_false = 0;
> +  int32_t res_32_SV_VL4__a_false = 0;
> +  int32_t res_32_SV_VL5__a_false = 0;
> +  int32_t res_32_SV_VL6__a_false = 0;
> +  int32_t res_32_SV_VL7__a_false = 0;
> +  int32_t res_32_SV_VL8__a_false = 0;
> +  int32_t res_32_SV_VL16__a_false = 0;
> +  uint32_t res_32_SV_VL1_u_a_false = 0;
> +  uint32_t res_32_SV_VL2_u_a_false = 0;
> +  uint32_t res_32_SV_VL3_u_a_false = 0;
> +  uint32_t res_32_SV_VL4_u_a_false = 0;
> +  uint32_t res_32_SV_VL5_u_a_false = 0;
> +  uint32_t res_32_SV_VL6_u_a_false = 0;
> +  uint32_t res_32_SV_VL7_u_a_false = 0;
> +  uint32_t res_32_SV_VL8_u_a_false = 0;
> +  uint32_t res_32_SV_VL16_u_a_false = 0;
> +  int64_t res_64_SV_VL1__a_false = 0;
> +  int64_t res_64_SV_VL2__a_false = 0;
> +  int64_t res_64_SV_VL3__a_false = 0;
> +  int64_t res_64_SV_VL4__a_false = 0;
> +  int64_t res_64_SV_VL5__a_false = 0;
> +  int64_t res_64_SV_VL6__a_false = 0;
> +  int64_t res_64_SV_VL7__a_false = 0;
> +  int64_t res_64_SV_VL8__a_false = 0;
> +  int64_t res_64_SV_VL16__a_false = 0;
> +  uint64_t res_64_SV_VL1_u_a_false = 0;
> +  uint64_t res_64_SV_VL2_u_a_false = 0;
> +  uint64_t res_64_SV_VL3_u_a_false = 0;
> +  uint64_t res_64_SV_VL4_u_a_false = 0;
> +  uint64_t res_64_SV_VL5_u_a_false = 0;
> +  uint64_t res_64_SV_VL6_u_a_false = 0;
> +  uint64_t res_64_SV_VL7_u_a_false = 0;
> +  uint64_t res_64_SV_VL8_u_a_false = 0;
> +  uint64_t res_64_SV_VL16_u_a_false = 0;
> +  int8_t res_8_SV_VL1__b_false = 15;
> +  int8_t res_8_SV_VL2__b_false = 15;
> +  int8_t res_8_SV_VL3__b_false = 15;
> +  int8_t res_8_SV_VL4__b_false = 15;
> +  int8_t res_8_SV_VL5__b_false = 15;
> +  int8_t res_8_SV_VL6__b_false = 15;
> +  int8_t res_8_SV_VL7__b_false = 15;
> +  int8_t res_8_SV_VL8__b_false = 15;
> +  int8_t res_8_SV_VL16__b_false = 15;
> +  uint8_t res_8_SV_VL1_u_b_false = 15;
> +  uint8_t res_8_SV_VL2_u_b_false = 15;
> +  uint8_t res_8_SV_VL3_u_b_false = 15;
> +  uint8_t res_8_SV_VL4_u_b_false = 15;
> +  uint8_t res_8_SV_VL5_u_b_false = 15;
> +  uint8_t res_8_SV_VL6_u_b_false = 15;
> +  uint8_t res_8_SV_VL7_u_b_false = 15;
> +  uint8_t res_8_SV_VL8_u_b_false = 15;
> +  uint8_t res_8_SV_VL16_u_b_false = 15;
> +  int16_t res_16_SV_VL1__b_false = 7;
> +  int16_t res_16_SV_VL2__b_false = 7;
> +  int16_t res_16_SV_VL3__b_false = 7;
> +  int16_t res_16_SV_VL4__b_false = 7;
> +  int16_t res_16_SV_VL5__b_false = 7;
> +  int16_t res_16_SV_VL6__b_false = 7;
> +  int16_t res_16_SV_VL7__b_false = 7;
> +  int16_t res_16_SV_VL8__b_false = 7;
> +  int16_t res_16_SV_VL16__b_false = 7;
> +  uint16_t res_16_SV_VL1_u_b_false = 7;
> +  uint16_t res_16_SV_VL2_u_b_false = 7;
> +  uint16_t res_16_SV_VL3_u_b_false = 7;
> +  uint16_t res_16_SV_VL4_u_b_false = 7;
> +  uint16_t res_16_SV_VL5_u_b_false = 7;
> +  uint16_t res_16_SV_VL6_u_b_false = 7;
> +  uint16_t res_16_SV_VL7_u_b_false = 7;
> +  uint16_t res_16_SV_VL8_u_b_false = 7;
> +  uint16_t res_16_SV_VL16_u_b_false = 7;
> +  int32_t res_32_SV_VL1__b_false = 3;
> +  int32_t res_32_SV_VL2__b_false = 3;
> +  int32_t res_32_SV_VL3__b_false = 3;
> +  int32_t res_32_SV_VL4__b_false = 3;
> +  int32_t res_32_SV_VL5__b_false = 3;
> +  int32_t res_32_SV_VL6__b_false = 3;
> +  int32_t res_32_SV_VL7__b_false = 3;
> +  int32_t res_32_SV_VL8__b_false = 3;
> +  int32_t res_32_SV_VL16__b_false = 3;
> +  uint32_t res_32_SV_VL1_u_b_false = 3;
> +  uint32_t res_32_SV_VL2_u_b_false = 3;
> +  uint32_t res_32_SV_VL3_u_b_false = 3;
> +  uint32_t res_32_SV_VL4_u_b_false = 3;
> +  uint32_t res_32_SV_VL5_u_b_false = 3;
> +  uint32_t res_32_SV_VL6_u_b_false = 3;
> +  uint32_t res_32_SV_VL7_u_b_false = 3;
> +  uint32_t res_32_SV_VL8_u_b_false = 3;
> +  uint32_t res_32_SV_VL16_u_b_false = 3;
> +  int64_t res_64_SV_VL1__b_false = 1;
> +  int64_t res_64_SV_VL2__b_false = 1;
> +  int64_t res_64_SV_VL3__b_false = 1;
> +  int64_t res_64_SV_VL4__b_false = 1;
> +  int64_t res_64_SV_VL5__b_false = 1;
> +  int64_t res_64_SV_VL6__b_false = 1;
> +  int64_t res_64_SV_VL7__b_false = 1;
> +  int64_t res_64_SV_VL8__b_false = 1;
> +  int64_t res_64_SV_VL16__b_false = 1;
> +  uint64_t res_64_SV_VL1_u_b_false = 1;
> +  uint64_t res_64_SV_VL2_u_b_false = 1;
> +  uint64_t res_64_SV_VL3_u_b_false = 1;
> +  uint64_t res_64_SV_VL4_u_b_false = 1;
> +  uint64_t res_64_SV_VL5_u_b_false = 1;
> +  uint64_t res_64_SV_VL6_u_b_false = 1;
> +  uint64_t res_64_SV_VL7_u_b_false = 1;
> +  uint64_t res_64_SV_VL8_u_b_false = 1;
> +  uint64_t res_64_SV_VL16_u_b_false = 1;
> +
> +#undef SVELAST_DEF
> +#define SVELAST_DEF(size, pat, sign, ab, su) \
> +	if (NAME (foo, size, pat, sign, ab) \
> +	     (svindex_ ## su ## size (0, 1)) != \
> +		NAME (res, size, pat, sign, ab)) \
> +	  __builtin_abort (); \
> +	if (NAMEF (foo, size, pat, sign, ab) \
> +	     (svindex_ ## su ## size (0, 1)) != \
> +		NAMEF (res, size, pat, sign, ab)) \
> +	  __builtin_abort ();
> +
> +  ALL_POS ()
> +
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast256_run.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast256_run.c
> new file mode 100644
> index 00000000000..f6ba7ea7d89
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast256_run.c
> @@ -0,0 +1,314 @@
> +/* { dg-do run { target aarch64_sve256_hw } } */
> +/* { dg-options "-O3 -msve-vector-bits=256 -std=gnu99" } */
> +
> +#include "svlast.c"
> +
> +int
> +main (void)
> +{
> +  int8_t res_8_SV_VL1__a = 1;
> +  int8_t res_8_SV_VL2__a = 2;
> +  int8_t res_8_SV_VL3__a = 3;
> +  int8_t res_8_SV_VL4__a = 4;
> +  int8_t res_8_SV_VL5__a = 5;
> +  int8_t res_8_SV_VL6__a = 6;
> +  int8_t res_8_SV_VL7__a = 7;
> +  int8_t res_8_SV_VL8__a = 8;
> +  int8_t res_8_SV_VL16__a = 16;
> +  uint8_t res_8_SV_VL1_u_a = 1;
> +  uint8_t res_8_SV_VL2_u_a = 2;
> +  uint8_t res_8_SV_VL3_u_a = 3;
> +  uint8_t res_8_SV_VL4_u_a = 4;
> +  uint8_t res_8_SV_VL5_u_a = 5;
> +  uint8_t res_8_SV_VL6_u_a = 6;
> +  uint8_t res_8_SV_VL7_u_a = 7;
> +  uint8_t res_8_SV_VL8_u_a = 8;
> +  uint8_t res_8_SV_VL16_u_a = 16;
> +  int16_t res_16_SV_VL1__a = 1;
> +  int16_t res_16_SV_VL2__a = 2;
> +  int16_t res_16_SV_VL3__a = 3;
> +  int16_t res_16_SV_VL4__a = 4;
> +  int16_t res_16_SV_VL5__a = 5;
> +  int16_t res_16_SV_VL6__a = 6;
> +  int16_t res_16_SV_VL7__a = 7;
> +  int16_t res_16_SV_VL8__a = 8;
> +  int16_t res_16_SV_VL16__a = 0;
> +  uint16_t res_16_SV_VL1_u_a = 1;
> +  uint16_t res_16_SV_VL2_u_a = 2;
> +  uint16_t res_16_SV_VL3_u_a = 3;
> +  uint16_t res_16_SV_VL4_u_a = 4;
> +  uint16_t res_16_SV_VL5_u_a = 5;
> +  uint16_t res_16_SV_VL6_u_a = 6;
> +  uint16_t res_16_SV_VL7_u_a = 7;
> +  uint16_t res_16_SV_VL8_u_a = 8;
> +  uint16_t res_16_SV_VL16_u_a = 0;
> +  int32_t res_32_SV_VL1__a = 1;
> +  int32_t res_32_SV_VL2__a = 2;
> +  int32_t res_32_SV_VL3__a = 3;
> +  int32_t res_32_SV_VL4__a = 4;
> +  int32_t res_32_SV_VL5__a = 5;
> +  int32_t res_32_SV_VL6__a = 6;
> +  int32_t res_32_SV_VL7__a = 7;
> +  int32_t res_32_SV_VL8__a = 0;
> +  int32_t res_32_SV_VL16__a = 0;
> +  uint32_t res_32_SV_VL1_u_a = 1;
> +  uint32_t res_32_SV_VL2_u_a = 2;
> +  uint32_t res_32_SV_VL3_u_a = 3;
> +  uint32_t res_32_SV_VL4_u_a = 4;
> +  uint32_t res_32_SV_VL5_u_a = 5;
> +  uint32_t res_32_SV_VL6_u_a = 6;
> +  uint32_t res_32_SV_VL7_u_a = 7;
> +  uint32_t res_32_SV_VL8_u_a = 0;
> +  uint32_t res_32_SV_VL16_u_a = 0;
> +  int64_t res_64_SV_VL1__a = 1;
> +  int64_t res_64_SV_VL2__a = 2;
> +  int64_t res_64_SV_VL3__a = 3;
> +  int64_t res_64_SV_VL4__a = 0;
> +  int64_t res_64_SV_VL5__a = 0;
> +  int64_t res_64_SV_VL6__a = 0;
> +  int64_t res_64_SV_VL7__a = 0;
> +  int64_t res_64_SV_VL8__a = 0;
> +  int64_t res_64_SV_VL16__a = 0;
> +  uint64_t res_64_SV_VL1_u_a = 1;
> +  uint64_t res_64_SV_VL2_u_a = 2;
> +  uint64_t res_64_SV_VL3_u_a = 3;
> +  uint64_t res_64_SV_VL4_u_a = 0;
> +  uint64_t res_64_SV_VL5_u_a = 0;
> +  uint64_t res_64_SV_VL6_u_a = 0;
> +  uint64_t res_64_SV_VL7_u_a = 0;
> +  uint64_t res_64_SV_VL8_u_a = 0;
> +  uint64_t res_64_SV_VL16_u_a = 0;
> +  int8_t res_8_SV_VL1__b = 0;
> +  int8_t res_8_SV_VL2__b = 1;
> +  int8_t res_8_SV_VL3__b = 2;
> +  int8_t res_8_SV_VL4__b = 3;
> +  int8_t res_8_SV_VL5__b = 4;
> +  int8_t res_8_SV_VL6__b = 5;
> +  int8_t res_8_SV_VL7__b = 6;
> +  int8_t res_8_SV_VL8__b = 7;
> +  int8_t res_8_SV_VL16__b = 15;
> +  uint8_t res_8_SV_VL1_u_b = 0;
> +  uint8_t res_8_SV_VL2_u_b = 1;
> +  uint8_t res_8_SV_VL3_u_b = 2;
> +  uint8_t res_8_SV_VL4_u_b = 3;
> +  uint8_t res_8_SV_VL5_u_b = 4;
> +  uint8_t res_8_SV_VL6_u_b = 5;
> +  uint8_t res_8_SV_VL7_u_b = 6;
> +  uint8_t res_8_SV_VL8_u_b = 7;
> +  uint8_t res_8_SV_VL16_u_b = 15;
> +  int16_t res_16_SV_VL1__b = 0;
> +  int16_t res_16_SV_VL2__b = 1;
> +  int16_t res_16_SV_VL3__b = 2;
> +  int16_t res_16_SV_VL4__b = 3;
> +  int16_t res_16_SV_VL5__b = 4;
> +  int16_t res_16_SV_VL6__b = 5;
> +  int16_t res_16_SV_VL7__b = 6;
> +  int16_t res_16_SV_VL8__b = 7;
> +  int16_t res_16_SV_VL16__b = 15;
> +  uint16_t res_16_SV_VL1_u_b = 0;
> +  uint16_t res_16_SV_VL2_u_b = 1;
> +  uint16_t res_16_SV_VL3_u_b = 2;
> +  uint16_t res_16_SV_VL4_u_b = 3;
> +  uint16_t res_16_SV_VL5_u_b = 4;
> +  uint16_t res_16_SV_VL6_u_b = 5;
> +  uint16_t res_16_SV_VL7_u_b = 6;
> +  uint16_t res_16_SV_VL8_u_b = 7;
> +  uint16_t res_16_SV_VL16_u_b = 15;
> +  int32_t res_32_SV_VL1__b = 0;
> +  int32_t res_32_SV_VL2__b = 1;
> +  int32_t res_32_SV_VL3__b = 2;
> +  int32_t res_32_SV_VL4__b = 3;
> +  int32_t res_32_SV_VL5__b = 4;
> +  int32_t res_32_SV_VL6__b = 5;
> +  int32_t res_32_SV_VL7__b = 6;
> +  int32_t res_32_SV_VL8__b = 7;
> +  int32_t res_32_SV_VL16__b = 7;
> +  uint32_t res_32_SV_VL1_u_b = 0;
> +  uint32_t res_32_SV_VL2_u_b = 1;
> +  uint32_t res_32_SV_VL3_u_b = 2;
> +  uint32_t res_32_SV_VL4_u_b = 3;
> +  uint32_t res_32_SV_VL5_u_b = 4;
> +  uint32_t res_32_SV_VL6_u_b = 5;
> +  uint32_t res_32_SV_VL7_u_b = 6;
> +  uint32_t res_32_SV_VL8_u_b = 7;
> +  uint32_t res_32_SV_VL16_u_b = 7;
> +  int64_t res_64_SV_VL1__b = 0;
> +  int64_t res_64_SV_VL2__b = 1;
> +  int64_t res_64_SV_VL3__b = 2;
> +  int64_t res_64_SV_VL4__b = 3;
> +  int64_t res_64_SV_VL5__b = 3;
> +  int64_t res_64_SV_VL6__b = 3;
> +  int64_t res_64_SV_VL7__b = 3;
> +  int64_t res_64_SV_VL8__b = 3;
> +  int64_t res_64_SV_VL16__b = 3;
> +  uint64_t res_64_SV_VL1_u_b = 0;
> +  uint64_t res_64_SV_VL2_u_b = 1;
> +  uint64_t res_64_SV_VL3_u_b = 2;
> +  uint64_t res_64_SV_VL4_u_b = 3;
> +  uint64_t res_64_SV_VL5_u_b = 3;
> +  uint64_t res_64_SV_VL6_u_b = 3;
> +  uint64_t res_64_SV_VL7_u_b = 3;
> +  uint64_t res_64_SV_VL8_u_b = 3;
> +  uint64_t res_64_SV_VL16_u_b = 3;
> +
> +  int8_t res_8_SV_VL1__a_false = 0;
> +  int8_t res_8_SV_VL2__a_false = 0;
> +  int8_t res_8_SV_VL3__a_false = 0;
> +  int8_t res_8_SV_VL4__a_false = 0;
> +  int8_t res_8_SV_VL5__a_false = 0;
> +  int8_t res_8_SV_VL6__a_false = 0;
> +  int8_t res_8_SV_VL7__a_false = 0;
> +  int8_t res_8_SV_VL8__a_false = 0;
> +  int8_t res_8_SV_VL16__a_false = 0;
> +  uint8_t res_8_SV_VL1_u_a_false = 0;
> +  uint8_t res_8_SV_VL2_u_a_false = 0;
> +  uint8_t res_8_SV_VL3_u_a_false = 0;
> +  uint8_t res_8_SV_VL4_u_a_false = 0;
> +  uint8_t res_8_SV_VL5_u_a_false = 0;
> +  uint8_t res_8_SV_VL6_u_a_false = 0;
> +  uint8_t res_8_SV_VL7_u_a_false = 0;
> +  uint8_t res_8_SV_VL8_u_a_false = 0;
> +  uint8_t res_8_SV_VL16_u_a_false = 0;
> +  int16_t res_16_SV_VL1__a_false = 0;
> +  int16_t res_16_SV_VL2__a_false = 0;
> +  int16_t res_16_SV_VL3__a_false = 0;
> +  int16_t res_16_SV_VL4__a_false = 0;
> +  int16_t res_16_SV_VL5__a_false = 0;
> +  int16_t res_16_SV_VL6__a_false = 0;
> +  int16_t res_16_SV_VL7__a_false = 0;
> +  int16_t res_16_SV_VL8__a_false = 0;
> +  int16_t res_16_SV_VL16__a_false = 0;
> +  uint16_t res_16_SV_VL1_u_a_false = 0;
> +  uint16_t res_16_SV_VL2_u_a_false = 0;
> +  uint16_t res_16_SV_VL3_u_a_false = 0;
> +  uint16_t res_16_SV_VL4_u_a_false = 0;
> +  uint16_t res_16_SV_VL5_u_a_false = 0;
> +  uint16_t res_16_SV_VL6_u_a_false = 0;
> +  uint16_t res_16_SV_VL7_u_a_false = 0;
> +  uint16_t res_16_SV_VL8_u_a_false = 0;
> +  uint16_t res_16_SV_VL16_u_a_false = 0;
> +  int32_t res_32_SV_VL1__a_false = 0;
> +  int32_t res_32_SV_VL2__a_false = 0;
> +  int32_t res_32_SV_VL3__a_false = 0;
> +  int32_t res_32_SV_VL4__a_false = 0;
> +  int32_t res_32_SV_VL5__a_false = 0;
> +  int32_t res_32_SV_VL6__a_false = 0;
> +  int32_t res_32_SV_VL7__a_false = 0;
> +  int32_t res_32_SV_VL8__a_false = 0;
> +  int32_t res_32_SV_VL16__a_false = 0;
> +  uint32_t res_32_SV_VL1_u_a_false = 0;
> +  uint32_t res_32_SV_VL2_u_a_false = 0;
> +  uint32_t res_32_SV_VL3_u_a_false = 0;
> +  uint32_t res_32_SV_VL4_u_a_false = 0;
> +  uint32_t res_32_SV_VL5_u_a_false = 0;
> +  uint32_t res_32_SV_VL6_u_a_false = 0;
> +  uint32_t res_32_SV_VL7_u_a_false = 0;
> +  uint32_t res_32_SV_VL8_u_a_false = 0;
> +  uint32_t res_32_SV_VL16_u_a_false = 0;
> +  int64_t res_64_SV_VL1__a_false = 0;
> +  int64_t res_64_SV_VL2__a_false = 0;
> +  int64_t res_64_SV_VL3__a_false = 0;
> +  int64_t res_64_SV_VL4__a_false = 0;
> +  int64_t res_64_SV_VL5__a_false = 0;
> +  int64_t res_64_SV_VL6__a_false = 0;
> +  int64_t res_64_SV_VL7__a_false = 0;
> +  int64_t res_64_SV_VL8__a_false = 0;
> +  int64_t res_64_SV_VL16__a_false = 0;
> +  uint64_t res_64_SV_VL1_u_a_false = 0;
> +  uint64_t res_64_SV_VL2_u_a_false = 0;
> +  uint64_t res_64_SV_VL3_u_a_false = 0;
> +  uint64_t res_64_SV_VL4_u_a_false = 0;
> +  uint64_t res_64_SV_VL5_u_a_false = 0;
> +  uint64_t res_64_SV_VL6_u_a_false = 0;
> +  uint64_t res_64_SV_VL7_u_a_false = 0;
> +  uint64_t res_64_SV_VL8_u_a_false = 0;
> +  uint64_t res_64_SV_VL16_u_a_false = 0;
> +  int8_t res_8_SV_VL1__b_false = 31;
> +  int8_t res_8_SV_VL2__b_false = 31;
> +  int8_t res_8_SV_VL3__b_false = 31;
> +  int8_t res_8_SV_VL4__b_false = 31;
> +  int8_t res_8_SV_VL5__b_false = 31;
> +  int8_t res_8_SV_VL6__b_false = 31;
> +  int8_t res_8_SV_VL7__b_false = 31;
> +  int8_t res_8_SV_VL8__b_false = 31;
> +  int8_t res_8_SV_VL16__b_false = 31;
> +  uint8_t res_8_SV_VL1_u_b_false = 31;
> +  uint8_t res_8_SV_VL2_u_b_false = 31;
> +  uint8_t res_8_SV_VL3_u_b_false = 31;
> +  uint8_t res_8_SV_VL4_u_b_false = 31;
> +  uint8_t res_8_SV_VL5_u_b_false = 31;
> +  uint8_t res_8_SV_VL6_u_b_false = 31;
> +  uint8_t res_8_SV_VL7_u_b_false = 31;
> +  uint8_t res_8_SV_VL8_u_b_false = 31;
> +  uint8_t res_8_SV_VL16_u_b_false = 31;
> +  int16_t res_16_SV_VL1__b_false = 15;
> +  int16_t res_16_SV_VL2__b_false = 15;
> +  int16_t res_16_SV_VL3__b_false = 15;
> +  int16_t res_16_SV_VL4__b_false = 15;
> +  int16_t res_16_SV_VL5__b_false = 15;
> +  int16_t res_16_SV_VL6__b_false = 15;
> +  int16_t res_16_SV_VL7__b_false = 15;
> +  int16_t res_16_SV_VL8__b_false = 15;
> +  int16_t res_16_SV_VL16__b_false = 15;
> +  uint16_t res_16_SV_VL1_u_b_false = 15;
> +  uint16_t res_16_SV_VL2_u_b_false = 15;
> +  uint16_t res_16_SV_VL3_u_b_false = 15;
> +  uint16_t res_16_SV_VL4_u_b_false = 15;
> +  uint16_t res_16_SV_VL5_u_b_false = 15;
> +  uint16_t res_16_SV_VL6_u_b_false = 15;
> +  uint16_t res_16_SV_VL7_u_b_false = 15;
> +  uint16_t res_16_SV_VL8_u_b_false = 15;
> +  uint16_t res_16_SV_VL16_u_b_false = 15;
> +  int32_t res_32_SV_VL1__b_false = 7;
> +  int32_t res_32_SV_VL2__b_false = 7;
> +  int32_t res_32_SV_VL3__b_false = 7;
> +  int32_t res_32_SV_VL4__b_false = 7;
> +  int32_t res_32_SV_VL5__b_false = 7;
> +  int32_t res_32_SV_VL6__b_false = 7;
> +  int32_t res_32_SV_VL7__b_false = 7;
> +  int32_t res_32_SV_VL8__b_false = 7;
> +  int32_t res_32_SV_VL16__b_false = 7;
> +  uint32_t res_32_SV_VL1_u_b_false = 7;
> +  uint32_t res_32_SV_VL2_u_b_false = 7;
> +  uint32_t res_32_SV_VL3_u_b_false = 7;
> +  uint32_t res_32_SV_VL4_u_b_false = 7;
> +  uint32_t res_32_SV_VL5_u_b_false = 7;
> +  uint32_t res_32_SV_VL6_u_b_false = 7;
> +  uint32_t res_32_SV_VL7_u_b_false = 7;
> +  uint32_t res_32_SV_VL8_u_b_false = 7;
> +  uint32_t res_32_SV_VL16_u_b_false = 7;
> +  int64_t res_64_SV_VL1__b_false = 3;
> +  int64_t res_64_SV_VL2__b_false = 3;
> +  int64_t res_64_SV_VL3__b_false = 3;
> +  int64_t res_64_SV_VL4__b_false = 3;
> +  int64_t res_64_SV_VL5__b_false = 3;
> +  int64_t res_64_SV_VL6__b_false = 3;
> +  int64_t res_64_SV_VL7__b_false = 3;
> +  int64_t res_64_SV_VL8__b_false = 3;
> +  int64_t res_64_SV_VL16__b_false = 3;
> +  uint64_t res_64_SV_VL1_u_b_false = 3;
> +  uint64_t res_64_SV_VL2_u_b_false = 3;
> +  uint64_t res_64_SV_VL3_u_b_false = 3;
> +  uint64_t res_64_SV_VL4_u_b_false = 3;
> +  uint64_t res_64_SV_VL5_u_b_false = 3;
> +  uint64_t res_64_SV_VL6_u_b_false = 3;
> +  uint64_t res_64_SV_VL7_u_b_false = 3;
> +  uint64_t res_64_SV_VL8_u_b_false = 3;
> +  uint64_t res_64_SV_VL16_u_b_false = 3;
> +
> +
> +#undef SVELAST_DEF
> +#define SVELAST_DEF(size, pat, sign, ab, su) \
> +	if (NAME (foo, size, pat, sign, ab) \
> +	     (svindex_ ## su ## size (0 ,1)) != \
> +		NAME (res, size, pat, sign, ab)) \
> +	  __builtin_abort (); \
> +	if (NAMEF (foo, size, pat, sign, ab) \
> +	     (svindex_ ## su ## size (0 ,1)) != \
> +		NAMEF (res, size, pat, sign, ab)) \
> +	  __builtin_abort ();
> +
> +  ALL_POS ()
> +
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4.c
> index 1e38371842f..91fdd3c202e 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4.c
> @@ -186,8 +186,6 @@ CALLER (f16, __SVFloat16_t)
>  ** caller_bf16:
>  **	...
>  **	bl	callee_bf16
> -**	ptrue	(p[0-7])\.b, all
> -**	lasta	h0, \1, z0\.h
>  **	ldp	x29, x30, \[sp\], 16
>  **	ret
>  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_1024.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_1024.c
> index 491c35af221..7d824caae1b 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_1024.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_1024.c
> @@ -186,8 +186,6 @@ CALLER (f16, __SVFloat16_t)
>  ** caller_bf16:
>  **	...
>  **	bl	callee_bf16
> -**	ptrue	(p[0-7])\.b, vl128
> -**	lasta	h0, \1, z0\.h
>  **	ldp	x29, x30, \[sp\], 16
>  **	ret
>  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_128.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_128.c
> index eebb913273a..e0aa3a5fa68 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_128.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_128.c
> @@ -186,8 +186,6 @@ CALLER (f16, __SVFloat16_t)
>  ** caller_bf16:
>  **	...
>  **	bl	callee_bf16
> -**	ptrue	(p[0-7])\.b, vl16
> -**	lasta	h0, \1, z0\.h
>  **	ldp	x29, x30, \[sp\], 16
>  **	ret
>  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_2048.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_2048.c
> index 73c3b2ec045..3238015d9eb 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_2048.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_2048.c
> @@ -186,8 +186,6 @@ CALLER (f16, __SVFloat16_t)
>  ** caller_bf16:
>  **	...
>  **	bl	callee_bf16
> -**	ptrue	(p[0-7])\.b, vl256
> -**	lasta	h0, \1, z0\.h
>  **	ldp	x29, x30, \[sp\], 16
>  **	ret
>  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_256.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_256.c
> index 29744c81402..50861098934 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_256.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_256.c
> @@ -186,8 +186,6 @@ CALLER (f16, __SVFloat16_t)
>  ** caller_bf16:
>  **	...
>  **	bl	callee_bf16
> -**	ptrue	(p[0-7])\.b, vl32
> -**	lasta	h0, \1, z0\.h
>  **	ldp	x29, x30, \[sp\], 16
>  **	ret
>  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_512.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_512.c
> index cf25c31bcbf..300dacce955 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_512.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_512.c
> @@ -186,8 +186,6 @@ CALLER (f16, __SVFloat16_t)
>  ** caller_bf16:
>  **	...
>  **	bl	callee_bf16
> -**	ptrue	(p[0-7])\.b, vl64
> -**	lasta	h0, \1, z0\.h
>  **	ldp	x29, x30, \[sp\], 16
>  **	ret
>  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5.c
> index 9ad3e227654..0a840a38384 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5.c
> @@ -186,8 +186,6 @@ CALLER (f16, svfloat16_t)
>  ** caller_bf16:
>  **	...
>  **	bl	callee_bf16
> -**	ptrue	(p[0-7])\.b, all
> -**	lasta	h0, \1, z0\.h
>  **	ldp	x29, x30, \[sp\], 16
>  **	ret
>  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_1024.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_1024.c
> index d573e5fc69c..18cefbff1e6 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_1024.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_1024.c
> @@ -186,8 +186,6 @@ CALLER (f16, svfloat16_t)
>  ** caller_bf16:
>  **	...
>  **	bl	callee_bf16
> -**	ptrue	(p[0-7])\.b, vl128
> -**	lasta	h0, \1, z0\.h
>  **	ldp	x29, x30, \[sp\], 16
>  **	ret
>  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_128.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_128.c
> index 200b0eb8242..c622ed55674 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_128.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_128.c
> @@ -186,8 +186,6 @@ CALLER (f16, svfloat16_t)
>  ** caller_bf16:
>  **	...
>  **	bl	callee_bf16
> -**	ptrue	(p[0-7])\.b, vl16
> -**	lasta	h0, \1, z0\.h
>  **	ldp	x29, x30, \[sp\], 16
>  **	ret
>  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_2048.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_2048.c
> index f6f8858fd47..3286280687d 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_2048.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_2048.c
> @@ -186,8 +186,6 @@ CALLER (f16, svfloat16_t)
>  ** caller_bf16:
>  **	...
>  **	bl	callee_bf16
> -**	ptrue	(p[0-7])\.b, vl256
> -**	lasta	h0, \1, z0\.h
>  **	ldp	x29, x30, \[sp\], 16
>  **	ret
>  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_256.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_256.c
> index e62f59cc885..3c6afa2fdf1 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_256.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_256.c
> @@ -186,8 +186,6 @@ CALLER (f16, svfloat16_t)
>  ** caller_bf16:
>  **	...
>  **	bl	callee_bf16
> -**	ptrue	(p[0-7])\.b, vl32
> -**	lasta	h0, \1, z0\.h
>  **	ldp	x29, x30, \[sp\], 16
>  **	ret
>  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_512.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_512.c
> index 483558cb576..bb7d3ebf9d4 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_512.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_512.c
> @@ -186,8 +186,6 @@ CALLER (f16, svfloat16_t)
>  ** caller_bf16:
>  **	...
>  **	bl	callee_bf16
> -**	ptrue	(p[0-7])\.b, vl64
> -**	lasta	h0, \1, z0\.h
>  **	ldp	x29, x30, \[sp\], 16
>  **	ret
>  */

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]
  2023-05-11 19:32 ` Richard Sandiford
@ 2023-05-16  8:03   ` Tejas Belagod
  2023-05-16  8:45     ` Richard Sandiford
  0 siblings, 1 reply; 10+ messages in thread
From: Tejas Belagod @ 2023-05-16  8:03 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 44799 bytes --]

Thanks for your comments, Richard.

From: Richard Sandiford <richard.sandiford@arm.com>
Date: Friday, May 12, 2023 at 1:02 AM
To: Tejas Belagod <Tejas.Belagod@arm.com>
Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>, Tejas Belagod <Tejas.Belagod@arm.com>
Subject: Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]
Tejas Belagod <tejas.belagod@arm.com> writes:
> From: Tejas Belagod <tbelagod@arm.com>
>
>   This PR optimizes an SVE intrinsics sequence where
>     svlasta (svptrue_pat_b8 (SV_VL1), x)
>   a scalar is selected based on a constant predicate and a variable vector.
>   This sequence is optimized to return the correspoding element of a NEON
>   vector. For eg.
>     svlasta (svptrue_pat_b8 (SV_VL1), x)
>   returns
>     umov    w0, v0.b[1]
>   Likewise,
>     svlastb (svptrue_pat_b8 (SV_VL1), x)
>   returns
>      umov    w0, v0.b[0]
>   This optimization only works provided the constant predicate maps to a range
>   that is within the bounds of a 128-bit NEON register.
>
> gcc/ChangeLog:
>
>        PR target/96339
>        * config/aarch64/aarch64-sve-builtins-base.cc (svlast_impl::fold): Fold sve
>        calls that have a constant input predicate vector.
>        (svlast_impl::is_lasta): Query to check if intrinsic is svlasta.
>        (svlast_impl::is_lastb): Query to check if intrinsic is svlastb.
>        (svlast_impl::vect_all_same): Check if all vector elements are equal.
>
> gcc/testsuite/ChangeLog:
>
>        PR target/96339
>        * gcc.target/aarch64/sve/acle/general-c/svlast.c: New.
>        * gcc.target/aarch64/sve/acle/general-c/svlast128_run.c: New.
>        * gcc.target/aarch64/sve/acle/general-c/svlast256_run.c: New.
>        * gcc.target/aarch64/sve/pcs/return_4.c (caller_bf16): Fix asm
>        to expect optimized code for function body.
>        * gcc.target/aarch64/sve/pcs/return_4_128.c (caller_bf16): Likewise.
>        * gcc.target/aarch64/sve/pcs/return_4_256.c (caller_bf16): Likewise.
>        * gcc.target/aarch64/sve/pcs/return_4_512.c (caller_bf16): Likewise.
>        * gcc.target/aarch64/sve/pcs/return_4_1024.c (caller_bf16): Likewise.
>        * gcc.target/aarch64/sve/pcs/return_4_2048.c (caller_bf16): Likewise.
>        * gcc.target/aarch64/sve/pcs/return_5.c (caller_bf16): Likewise.
>        * gcc.target/aarch64/sve/pcs/return_5_128.c (caller_bf16): Likewise.
>        * gcc.target/aarch64/sve/pcs/return_5_256.c (caller_bf16): Likewise.
>        * gcc.target/aarch64/sve/pcs/return_5_512.c (caller_bf16): Likewise.
>        * gcc.target/aarch64/sve/pcs/return_5_1024.c (caller_bf16): Likewise.
>        * gcc.target/aarch64/sve/pcs/return_5_2048.c (caller_bf16): Likewise.
> ---
>  .../aarch64/aarch64-sve-builtins-base.cc      | 124 +++++++
>  .../aarch64/sve/acle/general-c/svlast.c       |  63 ++++
>  .../sve/acle/general-c/svlast128_run.c        | 313 +++++++++++++++++
>  .../sve/acle/general-c/svlast256_run.c        | 314 ++++++++++++++++++
>  .../gcc.target/aarch64/sve/pcs/return_4.c     |   2 -
>  .../aarch64/sve/pcs/return_4_1024.c           |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_4_128.c |   2 -
>  .../aarch64/sve/pcs/return_4_2048.c           |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_4_256.c |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_4_512.c |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_5.c     |   2 -
>  .../aarch64/sve/pcs/return_5_1024.c           |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_5_128.c |   2 -
>  .../aarch64/sve/pcs/return_5_2048.c           |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_5_256.c |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_5_512.c |   2 -
>  16 files changed, 814 insertions(+), 24 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast128_run.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast256_run.c
>
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> index cd9cace3c9b..db2b4dcaac9 100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> @@ -1056,6 +1056,130 @@ class svlast_impl : public quiet<function_base>
>  public:
>    CONSTEXPR svlast_impl (int unspec) : m_unspec (unspec) {}
>
> +  bool is_lasta () const { return m_unspec == UNSPEC_LASTA; }
> +  bool is_lastb () const { return m_unspec == UNSPEC_LASTB; }
> +
> +  bool vect_all_same (tree v , int step) const

Nit: stray space after "v".

> +  {
> +    int i;
> +    int nelts = vector_cst_encoded_nelts (v);
> +    int first_el = 0;
> +
> +    for (i = first_el; i < nelts; i += step)
> +      if (VECTOR_CST_ENCODED_ELT (v, i) != VECTOR_CST_ENCODED_ELT (v, first_el))

I think this should use !operand_equal_p (..., ..., 0).

Oops! I wonder why I thought VECTOR_CST_ENCODED_ELT returned a constant! Thanks for spotting that. Also, should the flags here be OEP_ONLY_CONST ?


> +     return false;
> +
> +    return true;
> +  }
> +
> +  /* Fold a svlast{a/b} call with constant predicate to a BIT_FIELD_REF.
> +     BIT_FIELD_REF lowers to a NEON element extract, so we have to make sure
> +     the index of the element being accessed is in the range of a NEON vector
> +     width.  */

s/NEON/Advanced SIMD/.  Same in later comments

> +  gimple *fold (gimple_folder & f) const override
> +  {
> +    tree pred = gimple_call_arg (f.call, 0);
> +    tree val = gimple_call_arg (f.call, 1);
> +
> +    if (TREE_CODE (pred) == VECTOR_CST)
> +      {
> +     HOST_WIDE_INT pos;
> +     unsigned int const_vg;
> +     int i = 0;
> +     int step = f.type_suffix (0).element_bytes;
> +     int step_1 = gcd (step, VECTOR_CST_NPATTERNS (pred));
> +     int npats = VECTOR_CST_NPATTERNS (pred);
> +     unsigned HOST_WIDE_INT nelts = vector_cst_encoded_nelts (pred);
> +     tree b = NULL_TREE;
> +     bool const_vl = aarch64_sve_vg.is_constant (&const_vg);

I think this might be left over from previous versions, but:
const_vg isn't used and const_vl is only used once, so I think it
would be better to remove them.

> +
> +     /* We can optimize 2 cases common to variable and fixed-length cases
> +        without a linear search of the predicate vector:
> +        1.  LASTA if predicate is all true, return element 0.
> +        2.  LASTA if predicate all false, return element 0.  */
> +     if (is_lasta () && vect_all_same (pred, step_1))
> +       {
> +         b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val,
> +                     bitsize_int (step * BITS_PER_UNIT), bitsize_int (0));
> +         return gimple_build_assign (f.lhs, b);
> +       }
> +
> +     /* Handle the all-false case for LASTB where SVE VL == 128b -
> +        return the highest numbered element.  */
> +     if (is_lastb () && known_eq (BYTES_PER_SVE_VECTOR, 16)
> +         && vect_all_same (pred, step_1)
> +         && integer_zerop (VECTOR_CST_ENCODED_ELT (pred, 0)))

Formatting nit: one condition per line once one line isn't enough.

> +       {
> +         b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val,
> +                     bitsize_int (step * BITS_PER_UNIT),
> +                     bitsize_int ((16 - step) * BITS_PER_UNIT));
> +
> +         return gimple_build_assign (f.lhs, b);
> +       }
> +
> +     /* If VECTOR_CST_NELTS_PER_PATTERN (pred) == 2 and every multiple of
> +        'step_1' in
> +        [VECTOR_CST_NPATTERNS .. VECTOR_CST_ENCODED_NELTS - 1]
> +        is zero, then we can treat the vector as VECTOR_CST_NPATTERNS
> +        elements followed by all inactive elements.  */
> +     if (!const_vl && VECTOR_CST_NELTS_PER_PATTERN (pred) == 2)

Following on from the above, maybe use:

  !VECTOR_CST_NELTS (pred).is_constant ()

instead of !const_vl here.

I have a horrible suspicion that I'm contradicting our earlier discussion
here, sorry, but: I think we have to return null if NELTS_PER_PATTERN != 2.

IIUC, the NPATTERNS .. ENCODED_ELTS represent the repeated part of the encoded constant. This means the repetition occurs if NELTS_PER_PATTERN == 2, IOW the base1 repeats in the encoding. This loop is checking this condition and looks for a 1 in the repeated part of the NELTS_PER_PATTERN == 2 in a VL vector. Please correct me if I’m misunderstanding here.



OK with those changes, thanks.  I'm going to have to take your word
for the tests being right :)

I’ve manually inspected the assembler tests and compared runtime tests with the optimization switched off which looked Ok.

Thanks,
Tejas.

Richard

> +       for (i = npats; i < nelts; i += step_1)
> +         {
> +           /* If there are active elements in the repeated pattern of
> +              a variable-length vector, then return NULL as there is no way
> +              to be sure statically if this falls within the NEON range.  */
> +           if (!integer_zerop (VECTOR_CST_ENCODED_ELT (pred, i)))
> +             return NULL;
> +         }
> +
> +     /* If we're here, it means either:
> +        1. The vector is variable-length and there's no active element in the
> +        repeated part of the pattern, or
> +        2. The vector is fixed-length.
> +        Fall-through to a linear search.  */
> +
> +     /* Restrict the scope of search to NPATS if vector is
> +        variable-length.  */
> +     if (!VECTOR_CST_NELTS (pred).is_constant (&nelts))
> +       nelts = npats;
> +
> +     /* Fall through to finding the last active element linearly for
> +        for all cases where the last active element is known to be
> +        within a statically-determinable range.  */
> +     i = MAX ((int)nelts - step, 0);
> +     for (; i >= 0; i -= step)
> +       if (!integer_zerop (VECTOR_CST_ELT (pred, i)))
> +         break;
> +
> +     if (is_lastb ())
> +       {
> +         /* For LASTB, the element is the last active element.  */
> +         pos = i;
> +       }
> +     else
> +       {
> +         /* For LASTA, the element is one after last active element.  */
> +         pos = i + step;
> +
> +         /* If last active element is
> +            last element, wrap-around and return first NEON element.  */
> +         if (known_ge (pos, BYTES_PER_SVE_VECTOR))
> +           pos = 0;
> +       }
> +
> +     /* Out of NEON range.  */
> +     if (pos < 0 || pos > 15)
> +       return NULL;
> +
> +     b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val,
> +                 bitsize_int (step * BITS_PER_UNIT),
> +                 bitsize_int (pos * BITS_PER_UNIT));
> +
> +     return gimple_build_assign (f.lhs, b);
> +      }
> +    return NULL;
> +  }
> +
>    rtx
>    expand (function_expander &e) const override
>    {
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast.c
> new file mode 100644
> index 00000000000..fdbe5e309af
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast.c
> @@ -0,0 +1,63 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -msve-vector-bits=256" } */
> +
> +#include <stdint.h>
> +#include "arm_sve.h"
> +
> +#define NAME(name, size, pat, sign, ab) \
> +  name ## _ ## size ## _ ## pat ## _ ## sign ## _ ## ab
> +
> +#define NAMEF(name, size, pat, sign, ab) \
> +  name ## _ ## size ## _ ## pat ## _ ## sign ## _ ## ab ## _false
> +
> +#define SVTYPE(size, sign) \
> +  sv ## sign ## int ## size ## _t
> +
> +#define STYPE(size, sign) sign ## int ## size ##_t
> +
> +#define SVELAST_DEF(size, pat, sign, ab, su) \
> +  STYPE (size, sign) __attribute__((noinline)) \
> +  NAME (foo, size, pat, sign, ab) (SVTYPE (size, sign) x) \
> +  { \
> +    return svlast ## ab (svptrue_pat_b ## size (pat), x); \
> +  } \
> +  STYPE (size, sign) __attribute__((noinline)) \
> +  NAMEF (foo, size, pat, sign, ab) (SVTYPE (size, sign) x) \
> +  { \
> +    return svlast ## ab (svpfalse (), x); \
> +  }
> +
> +#define ALL_PATS(SIZE, SIGN, AB, SU) \
> +  SVELAST_DEF (SIZE, SV_VL1, SIGN, AB, SU) \
> +  SVELAST_DEF (SIZE, SV_VL2, SIGN, AB, SU) \
> +  SVELAST_DEF (SIZE, SV_VL3, SIGN, AB, SU) \
> +  SVELAST_DEF (SIZE, SV_VL4, SIGN, AB, SU) \
> +  SVELAST_DEF (SIZE, SV_VL5, SIGN, AB, SU) \
> +  SVELAST_DEF (SIZE, SV_VL6, SIGN, AB, SU) \
> +  SVELAST_DEF (SIZE, SV_VL7, SIGN, AB, SU) \
> +  SVELAST_DEF (SIZE, SV_VL8, SIGN, AB, SU) \
> +  SVELAST_DEF (SIZE, SV_VL16, SIGN, AB, SU)
> +
> +#define ALL_SIGN(SIZE, AB) \
> +  ALL_PATS (SIZE, , AB, s) \
> +  ALL_PATS (SIZE, u, AB, u)
> +
> +#define ALL_SIZE(AB) \
> +  ALL_SIGN (8, AB) \
> +  ALL_SIGN (16, AB) \
> +  ALL_SIGN (32, AB) \
> +  ALL_SIGN (64, AB)
> +
> +#define ALL_POS() \
> +  ALL_SIZE (a) \
> +  ALL_SIZE (b)
> +
> +
> +ALL_POS()
> +
> +/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b} 52 } } */
> +/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h} 50 } } */
> +/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s} 12 } } */
> +/* { dg-final { scan-assembler-times {\tfmov\tw0, s0} 24 } } */
> +/* { dg-final { scan-assembler-times {\tumov\tx[0-9]+, v[0-9]+\.d} 4 } } */
> +/* { dg-final { scan-assembler-times {\tfmov\tx0, d0} 32 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast128_run.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast128_run.c
> new file mode 100644
> index 00000000000..5e1e9303d7b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast128_run.c
> @@ -0,0 +1,313 @@
> +/* { dg-do run { target aarch64_sve128_hw } } */
> +/* { dg-options "-O3 -msve-vector-bits=128 -std=gnu99" } */
> +
> +#include "svlast.c"
> +
> +int
> +main (void)
> +{
> +  int8_t res_8_SV_VL1__a = 1;
> +  int8_t res_8_SV_VL2__a = 2;
> +  int8_t res_8_SV_VL3__a = 3;
> +  int8_t res_8_SV_VL4__a = 4;
> +  int8_t res_8_SV_VL5__a = 5;
> +  int8_t res_8_SV_VL6__a = 6;
> +  int8_t res_8_SV_VL7__a = 7;
> +  int8_t res_8_SV_VL8__a = 8;
> +  int8_t res_8_SV_VL16__a = 0;
> +  uint8_t res_8_SV_VL1_u_a = 1;
> +  uint8_t res_8_SV_VL2_u_a = 2;
> +  uint8_t res_8_SV_VL3_u_a = 3;
> +  uint8_t res_8_SV_VL4_u_a = 4;
> +  uint8_t res_8_SV_VL5_u_a = 5;
> +  uint8_t res_8_SV_VL6_u_a = 6;
> +  uint8_t res_8_SV_VL7_u_a = 7;
> +  uint8_t res_8_SV_VL8_u_a = 8;
> +  uint8_t res_8_SV_VL16_u_a = 0;
> +  int16_t res_16_SV_VL1__a = 1;
> +  int16_t res_16_SV_VL2__a = 2;
> +  int16_t res_16_SV_VL3__a = 3;
> +  int16_t res_16_SV_VL4__a = 4;
> +  int16_t res_16_SV_VL5__a = 5;
> +  int16_t res_16_SV_VL6__a = 6;
> +  int16_t res_16_SV_VL7__a = 7;
> +  int16_t res_16_SV_VL8__a = 0;
> +  int16_t res_16_SV_VL16__a = 0;
> +  uint16_t res_16_SV_VL1_u_a = 1;
> +  uint16_t res_16_SV_VL2_u_a = 2;
> +  uint16_t res_16_SV_VL3_u_a = 3;
> +  uint16_t res_16_SV_VL4_u_a = 4;
> +  uint16_t res_16_SV_VL5_u_a = 5;
> +  uint16_t res_16_SV_VL6_u_a = 6;
> +  uint16_t res_16_SV_VL7_u_a = 7;
> +  uint16_t res_16_SV_VL8_u_a = 0;
> +  uint16_t res_16_SV_VL16_u_a = 0;
> +  int32_t res_32_SV_VL1__a = 1;
> +  int32_t res_32_SV_VL2__a = 2;
> +  int32_t res_32_SV_VL3__a = 3;
> +  int32_t res_32_SV_VL4__a = 0;
> +  int32_t res_32_SV_VL5__a = 0;
> +  int32_t res_32_SV_VL6__a = 0;
> +  int32_t res_32_SV_VL7__a = 0;
> +  int32_t res_32_SV_VL8__a = 0;
> +  int32_t res_32_SV_VL16__a = 0;
> +  uint32_t res_32_SV_VL1_u_a = 1;
> +  uint32_t res_32_SV_VL2_u_a = 2;
> +  uint32_t res_32_SV_VL3_u_a = 3;
> +  uint32_t res_32_SV_VL4_u_a = 0;
> +  uint32_t res_32_SV_VL5_u_a = 0;
> +  uint32_t res_32_SV_VL6_u_a = 0;
> +  uint32_t res_32_SV_VL7_u_a = 0;
> +  uint32_t res_32_SV_VL8_u_a = 0;
> +  uint32_t res_32_SV_VL16_u_a = 0;
> +  int64_t res_64_SV_VL1__a = 1;
> +  int64_t res_64_SV_VL2__a = 0;
> +  int64_t res_64_SV_VL3__a = 0;
> +  int64_t res_64_SV_VL4__a = 0;
> +  int64_t res_64_SV_VL5__a = 0;
> +  int64_t res_64_SV_VL6__a = 0;
> +  int64_t res_64_SV_VL7__a = 0;
> +  int64_t res_64_SV_VL8__a = 0;
> +  int64_t res_64_SV_VL16__a = 0;
> +  uint64_t res_64_SV_VL1_u_a = 1;
> +  uint64_t res_64_SV_VL2_u_a = 0;
> +  uint64_t res_64_SV_VL3_u_a = 0;
> +  uint64_t res_64_SV_VL4_u_a = 0;
> +  uint64_t res_64_SV_VL5_u_a = 0;
> +  uint64_t res_64_SV_VL6_u_a = 0;
> +  uint64_t res_64_SV_VL7_u_a = 0;
> +  uint64_t res_64_SV_VL8_u_a = 0;
> +  uint64_t res_64_SV_VL16_u_a = 0;
> +  int8_t res_8_SV_VL1__b = 0;
> +  int8_t res_8_SV_VL2__b = 1;
> +  int8_t res_8_SV_VL3__b = 2;
> +  int8_t res_8_SV_VL4__b = 3;
> +  int8_t res_8_SV_VL5__b = 4;
> +  int8_t res_8_SV_VL6__b = 5;
> +  int8_t res_8_SV_VL7__b = 6;
> +  int8_t res_8_SV_VL8__b = 7;
> +  int8_t res_8_SV_VL16__b = 15;
> +  uint8_t res_8_SV_VL1_u_b = 0;
> +  uint8_t res_8_SV_VL2_u_b = 1;
> +  uint8_t res_8_SV_VL3_u_b = 2;
> +  uint8_t res_8_SV_VL4_u_b = 3;
> +  uint8_t res_8_SV_VL5_u_b = 4;
> +  uint8_t res_8_SV_VL6_u_b = 5;
> +  uint8_t res_8_SV_VL7_u_b = 6;
> +  uint8_t res_8_SV_VL8_u_b = 7;
> +  uint8_t res_8_SV_VL16_u_b = 15;
> +  int16_t res_16_SV_VL1__b = 0;
> +  int16_t res_16_SV_VL2__b = 1;
> +  int16_t res_16_SV_VL3__b = 2;
> +  int16_t res_16_SV_VL4__b = 3;
> +  int16_t res_16_SV_VL5__b = 4;
> +  int16_t res_16_SV_VL6__b = 5;
> +  int16_t res_16_SV_VL7__b = 6;
> +  int16_t res_16_SV_VL8__b = 7;
> +  int16_t res_16_SV_VL16__b = 7;
> +  uint16_t res_16_SV_VL1_u_b = 0;
> +  uint16_t res_16_SV_VL2_u_b = 1;
> +  uint16_t res_16_SV_VL3_u_b = 2;
> +  uint16_t res_16_SV_VL4_u_b = 3;
> +  uint16_t res_16_SV_VL5_u_b = 4;
> +  uint16_t res_16_SV_VL6_u_b = 5;
> +  uint16_t res_16_SV_VL7_u_b = 6;
> +  uint16_t res_16_SV_VL8_u_b = 7;
> +  uint16_t res_16_SV_VL16_u_b = 7;
> +  int32_t res_32_SV_VL1__b = 0;
> +  int32_t res_32_SV_VL2__b = 1;
> +  int32_t res_32_SV_VL3__b = 2;
> +  int32_t res_32_SV_VL4__b = 3;
> +  int32_t res_32_SV_VL5__b = 3;
> +  int32_t res_32_SV_VL6__b = 3;
> +  int32_t res_32_SV_VL7__b = 3;
> +  int32_t res_32_SV_VL8__b = 3;
> +  int32_t res_32_SV_VL16__b = 3;
> +  uint32_t res_32_SV_VL1_u_b = 0;
> +  uint32_t res_32_SV_VL2_u_b = 1;
> +  uint32_t res_32_SV_VL3_u_b = 2;
> +  uint32_t res_32_SV_VL4_u_b = 3;
> +  uint32_t res_32_SV_VL5_u_b = 3;
> +  uint32_t res_32_SV_VL6_u_b = 3;
> +  uint32_t res_32_SV_VL7_u_b = 3;
> +  uint32_t res_32_SV_VL8_u_b = 3;
> +  uint32_t res_32_SV_VL16_u_b = 3;
> +  int64_t res_64_SV_VL1__b = 0;
> +  int64_t res_64_SV_VL2__b = 1;
> +  int64_t res_64_SV_VL3__b = 1;
> +  int64_t res_64_SV_VL4__b = 1;
> +  int64_t res_64_SV_VL5__b = 1;
> +  int64_t res_64_SV_VL6__b = 1;
> +  int64_t res_64_SV_VL7__b = 1;
> +  int64_t res_64_SV_VL8__b = 1;
> +  int64_t res_64_SV_VL16__b = 1;
> +  uint64_t res_64_SV_VL1_u_b = 0;
> +  uint64_t res_64_SV_VL2_u_b = 1;
> +  uint64_t res_64_SV_VL3_u_b = 1;
> +  uint64_t res_64_SV_VL4_u_b = 1;
> +  uint64_t res_64_SV_VL5_u_b = 1;
> +  uint64_t res_64_SV_VL6_u_b = 1;
> +  uint64_t res_64_SV_VL7_u_b = 1;
> +  uint64_t res_64_SV_VL8_u_b = 1;
> +  uint64_t res_64_SV_VL16_u_b = 1;
> +
> +  int8_t res_8_SV_VL1__a_false = 0;
> +  int8_t res_8_SV_VL2__a_false = 0;
> +  int8_t res_8_SV_VL3__a_false = 0;
> +  int8_t res_8_SV_VL4__a_false = 0;
> +  int8_t res_8_SV_VL5__a_false = 0;
> +  int8_t res_8_SV_VL6__a_false = 0;
> +  int8_t res_8_SV_VL7__a_false = 0;
> +  int8_t res_8_SV_VL8__a_false = 0;
> +  int8_t res_8_SV_VL16__a_false = 0;
> +  uint8_t res_8_SV_VL1_u_a_false = 0;
> +  uint8_t res_8_SV_VL2_u_a_false = 0;
> +  uint8_t res_8_SV_VL3_u_a_false = 0;
> +  uint8_t res_8_SV_VL4_u_a_false = 0;
> +  uint8_t res_8_SV_VL5_u_a_false = 0;
> +  uint8_t res_8_SV_VL6_u_a_false = 0;
> +  uint8_t res_8_SV_VL7_u_a_false = 0;
> +  uint8_t res_8_SV_VL8_u_a_false = 0;
> +  uint8_t res_8_SV_VL16_u_a_false = 0;
> +  int16_t res_16_SV_VL1__a_false = 0;
> +  int16_t res_16_SV_VL2__a_false = 0;
> +  int16_t res_16_SV_VL3__a_false = 0;
> +  int16_t res_16_SV_VL4__a_false = 0;
> +  int16_t res_16_SV_VL5__a_false = 0;
> +  int16_t res_16_SV_VL6__a_false = 0;
> +  int16_t res_16_SV_VL7__a_false = 0;
> +  int16_t res_16_SV_VL8__a_false = 0;
> +  int16_t res_16_SV_VL16__a_false = 0;
> +  uint16_t res_16_SV_VL1_u_a_false = 0;
> +  uint16_t res_16_SV_VL2_u_a_false = 0;
> +  uint16_t res_16_SV_VL3_u_a_false = 0;
> +  uint16_t res_16_SV_VL4_u_a_false = 0;
> +  uint16_t res_16_SV_VL5_u_a_false = 0;
> +  uint16_t res_16_SV_VL6_u_a_false = 0;
> +  uint16_t res_16_SV_VL7_u_a_false = 0;
> +  uint16_t res_16_SV_VL8_u_a_false = 0;
> +  uint16_t res_16_SV_VL16_u_a_false = 0;
> +  int32_t res_32_SV_VL1__a_false = 0;
> +  int32_t res_32_SV_VL2__a_false = 0;
> +  int32_t res_32_SV_VL3__a_false = 0;
> +  int32_t res_32_SV_VL4__a_false = 0;
> +  int32_t res_32_SV_VL5__a_false = 0;
> +  int32_t res_32_SV_VL6__a_false = 0;
> +  int32_t res_32_SV_VL7__a_false = 0;
> +  int32_t res_32_SV_VL8__a_false = 0;
> +  int32_t res_32_SV_VL16__a_false = 0;
> +  uint32_t res_32_SV_VL1_u_a_false = 0;
> +  uint32_t res_32_SV_VL2_u_a_false = 0;
> +  uint32_t res_32_SV_VL3_u_a_false = 0;
> +  uint32_t res_32_SV_VL4_u_a_false = 0;
> +  uint32_t res_32_SV_VL5_u_a_false = 0;
> +  uint32_t res_32_SV_VL6_u_a_false = 0;
> +  uint32_t res_32_SV_VL7_u_a_false = 0;
> +  uint32_t res_32_SV_VL8_u_a_false = 0;
> +  uint32_t res_32_SV_VL16_u_a_false = 0;
> +  int64_t res_64_SV_VL1__a_false = 0;
> +  int64_t res_64_SV_VL2__a_false = 0;
> +  int64_t res_64_SV_VL3__a_false = 0;
> +  int64_t res_64_SV_VL4__a_false = 0;
> +  int64_t res_64_SV_VL5__a_false = 0;
> +  int64_t res_64_SV_VL6__a_false = 0;
> +  int64_t res_64_SV_VL7__a_false = 0;
> +  int64_t res_64_SV_VL8__a_false = 0;
> +  int64_t res_64_SV_VL16__a_false = 0;
> +  uint64_t res_64_SV_VL1_u_a_false = 0;
> +  uint64_t res_64_SV_VL2_u_a_false = 0;
> +  uint64_t res_64_SV_VL3_u_a_false = 0;
> +  uint64_t res_64_SV_VL4_u_a_false = 0;
> +  uint64_t res_64_SV_VL5_u_a_false = 0;
> +  uint64_t res_64_SV_VL6_u_a_false = 0;
> +  uint64_t res_64_SV_VL7_u_a_false = 0;
> +  uint64_t res_64_SV_VL8_u_a_false = 0;
> +  uint64_t res_64_SV_VL16_u_a_false = 0;
> +  int8_t res_8_SV_VL1__b_false = 15;
> +  int8_t res_8_SV_VL2__b_false = 15;
> +  int8_t res_8_SV_VL3__b_false = 15;
> +  int8_t res_8_SV_VL4__b_false = 15;
> +  int8_t res_8_SV_VL5__b_false = 15;
> +  int8_t res_8_SV_VL6__b_false = 15;
> +  int8_t res_8_SV_VL7__b_false = 15;
> +  int8_t res_8_SV_VL8__b_false = 15;
> +  int8_t res_8_SV_VL16__b_false = 15;
> +  uint8_t res_8_SV_VL1_u_b_false = 15;
> +  uint8_t res_8_SV_VL2_u_b_false = 15;
> +  uint8_t res_8_SV_VL3_u_b_false = 15;
> +  uint8_t res_8_SV_VL4_u_b_false = 15;
> +  uint8_t res_8_SV_VL5_u_b_false = 15;
> +  uint8_t res_8_SV_VL6_u_b_false = 15;
> +  uint8_t res_8_SV_VL7_u_b_false = 15;
> +  uint8_t res_8_SV_VL8_u_b_false = 15;
> +  uint8_t res_8_SV_VL16_u_b_false = 15;
> +  int16_t res_16_SV_VL1__b_false = 7;
> +  int16_t res_16_SV_VL2__b_false = 7;
> +  int16_t res_16_SV_VL3__b_false = 7;
> +  int16_t res_16_SV_VL4__b_false = 7;
> +  int16_t res_16_SV_VL5__b_false = 7;
> +  int16_t res_16_SV_VL6__b_false = 7;
> +  int16_t res_16_SV_VL7__b_false = 7;
> +  int16_t res_16_SV_VL8__b_false = 7;
> +  int16_t res_16_SV_VL16__b_false = 7;
> +  uint16_t res_16_SV_VL1_u_b_false = 7;
> +  uint16_t res_16_SV_VL2_u_b_false = 7;
> +  uint16_t res_16_SV_VL3_u_b_false = 7;
> +  uint16_t res_16_SV_VL4_u_b_false = 7;
> +  uint16_t res_16_SV_VL5_u_b_false = 7;
> +  uint16_t res_16_SV_VL6_u_b_false = 7;
> +  uint16_t res_16_SV_VL7_u_b_false = 7;
> +  uint16_t res_16_SV_VL8_u_b_false = 7;
> +  uint16_t res_16_SV_VL16_u_b_false = 7;
> +  int32_t res_32_SV_VL1__b_false = 3;
> +  int32_t res_32_SV_VL2__b_false = 3;
> +  int32_t res_32_SV_VL3__b_false = 3;
> +  int32_t res_32_SV_VL4__b_false = 3;
> +  int32_t res_32_SV_VL5__b_false = 3;
> +  int32_t res_32_SV_VL6__b_false = 3;
> +  int32_t res_32_SV_VL7__b_false = 3;
> +  int32_t res_32_SV_VL8__b_false = 3;
> +  int32_t res_32_SV_VL16__b_false = 3;
> +  uint32_t res_32_SV_VL1_u_b_false = 3;
> +  uint32_t res_32_SV_VL2_u_b_false = 3;
> +  uint32_t res_32_SV_VL3_u_b_false = 3;
> +  uint32_t res_32_SV_VL4_u_b_false = 3;
> +  uint32_t res_32_SV_VL5_u_b_false = 3;
> +  uint32_t res_32_SV_VL6_u_b_false = 3;
> +  uint32_t res_32_SV_VL7_u_b_false = 3;
> +  uint32_t res_32_SV_VL8_u_b_false = 3;
> +  uint32_t res_32_SV_VL16_u_b_false = 3;
> +  int64_t res_64_SV_VL1__b_false = 1;
> +  int64_t res_64_SV_VL2__b_false = 1;
> +  int64_t res_64_SV_VL3__b_false = 1;
> +  int64_t res_64_SV_VL4__b_false = 1;
> +  int64_t res_64_SV_VL5__b_false = 1;
> +  int64_t res_64_SV_VL6__b_false = 1;
> +  int64_t res_64_SV_VL7__b_false = 1;
> +  int64_t res_64_SV_VL8__b_false = 1;
> +  int64_t res_64_SV_VL16__b_false = 1;
> +  uint64_t res_64_SV_VL1_u_b_false = 1;
> +  uint64_t res_64_SV_VL2_u_b_false = 1;
> +  uint64_t res_64_SV_VL3_u_b_false = 1;
> +  uint64_t res_64_SV_VL4_u_b_false = 1;
> +  uint64_t res_64_SV_VL5_u_b_false = 1;
> +  uint64_t res_64_SV_VL6_u_b_false = 1;
> +  uint64_t res_64_SV_VL7_u_b_false = 1;
> +  uint64_t res_64_SV_VL8_u_b_false = 1;
> +  uint64_t res_64_SV_VL16_u_b_false = 1;
> +
> +#undef SVELAST_DEF
> +#define SVELAST_DEF(size, pat, sign, ab, su) \
> +     if (NAME (foo, size, pat, sign, ab) \
> +          (svindex_ ## su ## size (0, 1)) != \
> +             NAME (res, size, pat, sign, ab)) \
> +       __builtin_abort (); \
> +     if (NAMEF (foo, size, pat, sign, ab) \
> +          (svindex_ ## su ## size (0, 1)) != \
> +             NAMEF (res, size, pat, sign, ab)) \
> +       __builtin_abort ();
> +
> +  ALL_POS ()
> +
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast256_run.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast256_run.c
> new file mode 100644
> index 00000000000..f6ba7ea7d89
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast256_run.c
> @@ -0,0 +1,314 @@
> +/* { dg-do run { target aarch64_sve256_hw } } */
> +/* { dg-options "-O3 -msve-vector-bits=256 -std=gnu99" } */
> +
> +#include "svlast.c"
> +
> +int
> +main (void)
> +{
> +  int8_t res_8_SV_VL1__a = 1;
> +  int8_t res_8_SV_VL2__a = 2;
> +  int8_t res_8_SV_VL3__a = 3;
> +  int8_t res_8_SV_VL4__a = 4;
> +  int8_t res_8_SV_VL5__a = 5;
> +  int8_t res_8_SV_VL6__a = 6;
> +  int8_t res_8_SV_VL7__a = 7;
> +  int8_t res_8_SV_VL8__a = 8;
> +  int8_t res_8_SV_VL16__a = 16;
> +  uint8_t res_8_SV_VL1_u_a = 1;
> +  uint8_t res_8_SV_VL2_u_a = 2;
> +  uint8_t res_8_SV_VL3_u_a = 3;
> +  uint8_t res_8_SV_VL4_u_a = 4;
> +  uint8_t res_8_SV_VL5_u_a = 5;
> +  uint8_t res_8_SV_VL6_u_a = 6;
> +  uint8_t res_8_SV_VL7_u_a = 7;
> +  uint8_t res_8_SV_VL8_u_a = 8;
> +  uint8_t res_8_SV_VL16_u_a = 16;
> +  int16_t res_16_SV_VL1__a = 1;
> +  int16_t res_16_SV_VL2__a = 2;
> +  int16_t res_16_SV_VL3__a = 3;
> +  int16_t res_16_SV_VL4__a = 4;
> +  int16_t res_16_SV_VL5__a = 5;
> +  int16_t res_16_SV_VL6__a = 6;
> +  int16_t res_16_SV_VL7__a = 7;
> +  int16_t res_16_SV_VL8__a = 8;
> +  int16_t res_16_SV_VL16__a = 0;
> +  uint16_t res_16_SV_VL1_u_a = 1;
> +  uint16_t res_16_SV_VL2_u_a = 2;
> +  uint16_t res_16_SV_VL3_u_a = 3;
> +  uint16_t res_16_SV_VL4_u_a = 4;
> +  uint16_t res_16_SV_VL5_u_a = 5;
> +  uint16_t res_16_SV_VL6_u_a = 6;
> +  uint16_t res_16_SV_VL7_u_a = 7;
> +  uint16_t res_16_SV_VL8_u_a = 8;
> +  uint16_t res_16_SV_VL16_u_a = 0;
> +  int32_t res_32_SV_VL1__a = 1;
> +  int32_t res_32_SV_VL2__a = 2;
> +  int32_t res_32_SV_VL3__a = 3;
> +  int32_t res_32_SV_VL4__a = 4;
> +  int32_t res_32_SV_VL5__a = 5;
> +  int32_t res_32_SV_VL6__a = 6;
> +  int32_t res_32_SV_VL7__a = 7;
> +  int32_t res_32_SV_VL8__a = 0;
> +  int32_t res_32_SV_VL16__a = 0;
> +  uint32_t res_32_SV_VL1_u_a = 1;
> +  uint32_t res_32_SV_VL2_u_a = 2;
> +  uint32_t res_32_SV_VL3_u_a = 3;
> +  uint32_t res_32_SV_VL4_u_a = 4;
> +  uint32_t res_32_SV_VL5_u_a = 5;
> +  uint32_t res_32_SV_VL6_u_a = 6;
> +  uint32_t res_32_SV_VL7_u_a = 7;
> +  uint32_t res_32_SV_VL8_u_a = 0;
> +  uint32_t res_32_SV_VL16_u_a = 0;
> +  int64_t res_64_SV_VL1__a = 1;
> +  int64_t res_64_SV_VL2__a = 2;
> +  int64_t res_64_SV_VL3__a = 3;
> +  int64_t res_64_SV_VL4__a = 0;
> +  int64_t res_64_SV_VL5__a = 0;
> +  int64_t res_64_SV_VL6__a = 0;
> +  int64_t res_64_SV_VL7__a = 0;
> +  int64_t res_64_SV_VL8__a = 0;
> +  int64_t res_64_SV_VL16__a = 0;
> +  uint64_t res_64_SV_VL1_u_a = 1;
> +  uint64_t res_64_SV_VL2_u_a = 2;
> +  uint64_t res_64_SV_VL3_u_a = 3;
> +  uint64_t res_64_SV_VL4_u_a = 0;
> +  uint64_t res_64_SV_VL5_u_a = 0;
> +  uint64_t res_64_SV_VL6_u_a = 0;
> +  uint64_t res_64_SV_VL7_u_a = 0;
> +  uint64_t res_64_SV_VL8_u_a = 0;
> +  uint64_t res_64_SV_VL16_u_a = 0;
> +  int8_t res_8_SV_VL1__b = 0;
> +  int8_t res_8_SV_VL2__b = 1;
> +  int8_t res_8_SV_VL3__b = 2;
> +  int8_t res_8_SV_VL4__b = 3;
> +  int8_t res_8_SV_VL5__b = 4;
> +  int8_t res_8_SV_VL6__b = 5;
> +  int8_t res_8_SV_VL7__b = 6;
> +  int8_t res_8_SV_VL8__b = 7;
> +  int8_t res_8_SV_VL16__b = 15;
> +  uint8_t res_8_SV_VL1_u_b = 0;
> +  uint8_t res_8_SV_VL2_u_b = 1;
> +  uint8_t res_8_SV_VL3_u_b = 2;
> +  uint8_t res_8_SV_VL4_u_b = 3;
> +  uint8_t res_8_SV_VL5_u_b = 4;
> +  uint8_t res_8_SV_VL6_u_b = 5;
> +  uint8_t res_8_SV_VL7_u_b = 6;
> +  uint8_t res_8_SV_VL8_u_b = 7;
> +  uint8_t res_8_SV_VL16_u_b = 15;
> +  int16_t res_16_SV_VL1__b = 0;
> +  int16_t res_16_SV_VL2__b = 1;
> +  int16_t res_16_SV_VL3__b = 2;
> +  int16_t res_16_SV_VL4__b = 3;
> +  int16_t res_16_SV_VL5__b = 4;
> +  int16_t res_16_SV_VL6__b = 5;
> +  int16_t res_16_SV_VL7__b = 6;
> +  int16_t res_16_SV_VL8__b = 7;
> +  int16_t res_16_SV_VL16__b = 15;
> +  uint16_t res_16_SV_VL1_u_b = 0;
> +  uint16_t res_16_SV_VL2_u_b = 1;
> +  uint16_t res_16_SV_VL3_u_b = 2;
> +  uint16_t res_16_SV_VL4_u_b = 3;
> +  uint16_t res_16_SV_VL5_u_b = 4;
> +  uint16_t res_16_SV_VL6_u_b = 5;
> +  uint16_t res_16_SV_VL7_u_b = 6;
> +  uint16_t res_16_SV_VL8_u_b = 7;
> +  uint16_t res_16_SV_VL16_u_b = 15;
> +  int32_t res_32_SV_VL1__b = 0;
> +  int32_t res_32_SV_VL2__b = 1;
> +  int32_t res_32_SV_VL3__b = 2;
> +  int32_t res_32_SV_VL4__b = 3;
> +  int32_t res_32_SV_VL5__b = 4;
> +  int32_t res_32_SV_VL6__b = 5;
> +  int32_t res_32_SV_VL7__b = 6;
> +  int32_t res_32_SV_VL8__b = 7;
> +  int32_t res_32_SV_VL16__b = 7;
> +  uint32_t res_32_SV_VL1_u_b = 0;
> +  uint32_t res_32_SV_VL2_u_b = 1;
> +  uint32_t res_32_SV_VL3_u_b = 2;
> +  uint32_t res_32_SV_VL4_u_b = 3;
> +  uint32_t res_32_SV_VL5_u_b = 4;
> +  uint32_t res_32_SV_VL6_u_b = 5;
> +  uint32_t res_32_SV_VL7_u_b = 6;
> +  uint32_t res_32_SV_VL8_u_b = 7;
> +  uint32_t res_32_SV_VL16_u_b = 7;
> +  int64_t res_64_SV_VL1__b = 0;
> +  int64_t res_64_SV_VL2__b = 1;
> +  int64_t res_64_SV_VL3__b = 2;
> +  int64_t res_64_SV_VL4__b = 3;
> +  int64_t res_64_SV_VL5__b = 3;
> +  int64_t res_64_SV_VL6__b = 3;
> +  int64_t res_64_SV_VL7__b = 3;
> +  int64_t res_64_SV_VL8__b = 3;
> +  int64_t res_64_SV_VL16__b = 3;
> +  uint64_t res_64_SV_VL1_u_b = 0;
> +  uint64_t res_64_SV_VL2_u_b = 1;
> +  uint64_t res_64_SV_VL3_u_b = 2;
> +  uint64_t res_64_SV_VL4_u_b = 3;
> +  uint64_t res_64_SV_VL5_u_b = 3;
> +  uint64_t res_64_SV_VL6_u_b = 3;
> +  uint64_t res_64_SV_VL7_u_b = 3;
> +  uint64_t res_64_SV_VL8_u_b = 3;
> +  uint64_t res_64_SV_VL16_u_b = 3;
> +
> +  int8_t res_8_SV_VL1__a_false = 0;
> +  int8_t res_8_SV_VL2__a_false = 0;
> +  int8_t res_8_SV_VL3__a_false = 0;
> +  int8_t res_8_SV_VL4__a_false = 0;
> +  int8_t res_8_SV_VL5__a_false = 0;
> +  int8_t res_8_SV_VL6__a_false = 0;
> +  int8_t res_8_SV_VL7__a_false = 0;
> +  int8_t res_8_SV_VL8__a_false = 0;
> +  int8_t res_8_SV_VL16__a_false = 0;
> +  uint8_t res_8_SV_VL1_u_a_false = 0;
> +  uint8_t res_8_SV_VL2_u_a_false = 0;
> +  uint8_t res_8_SV_VL3_u_a_false = 0;
> +  uint8_t res_8_SV_VL4_u_a_false = 0;
> +  uint8_t res_8_SV_VL5_u_a_false = 0;
> +  uint8_t res_8_SV_VL6_u_a_false = 0;
> +  uint8_t res_8_SV_VL7_u_a_false = 0;
> +  uint8_t res_8_SV_VL8_u_a_false = 0;
> +  uint8_t res_8_SV_VL16_u_a_false = 0;
> +  int16_t res_16_SV_VL1__a_false = 0;
> +  int16_t res_16_SV_VL2__a_false = 0;
> +  int16_t res_16_SV_VL3__a_false = 0;
> +  int16_t res_16_SV_VL4__a_false = 0;
> +  int16_t res_16_SV_VL5__a_false = 0;
> +  int16_t res_16_SV_VL6__a_false = 0;
> +  int16_t res_16_SV_VL7__a_false = 0;
> +  int16_t res_16_SV_VL8__a_false = 0;
> +  int16_t res_16_SV_VL16__a_false = 0;
> +  uint16_t res_16_SV_VL1_u_a_false = 0;
> +  uint16_t res_16_SV_VL2_u_a_false = 0;
> +  uint16_t res_16_SV_VL3_u_a_false = 0;
> +  uint16_t res_16_SV_VL4_u_a_false = 0;
> +  uint16_t res_16_SV_VL5_u_a_false = 0;
> +  uint16_t res_16_SV_VL6_u_a_false = 0;
> +  uint16_t res_16_SV_VL7_u_a_false = 0;
> +  uint16_t res_16_SV_VL8_u_a_false = 0;
> +  uint16_t res_16_SV_VL16_u_a_false = 0;
> +  int32_t res_32_SV_VL1__a_false = 0;
> +  int32_t res_32_SV_VL2__a_false = 0;
> +  int32_t res_32_SV_VL3__a_false = 0;
> +  int32_t res_32_SV_VL4__a_false = 0;
> +  int32_t res_32_SV_VL5__a_false = 0;
> +  int32_t res_32_SV_VL6__a_false = 0;
> +  int32_t res_32_SV_VL7__a_false = 0;
> +  int32_t res_32_SV_VL8__a_false = 0;
> +  int32_t res_32_SV_VL16__a_false = 0;
> +  uint32_t res_32_SV_VL1_u_a_false = 0;
> +  uint32_t res_32_SV_VL2_u_a_false = 0;
> +  uint32_t res_32_SV_VL3_u_a_false = 0;
> +  uint32_t res_32_SV_VL4_u_a_false = 0;
> +  uint32_t res_32_SV_VL5_u_a_false = 0;
> +  uint32_t res_32_SV_VL6_u_a_false = 0;
> +  uint32_t res_32_SV_VL7_u_a_false = 0;
> +  uint32_t res_32_SV_VL8_u_a_false = 0;
> +  uint32_t res_32_SV_VL16_u_a_false = 0;
> +  int64_t res_64_SV_VL1__a_false = 0;
> +  int64_t res_64_SV_VL2__a_false = 0;
> +  int64_t res_64_SV_VL3__a_false = 0;
> +  int64_t res_64_SV_VL4__a_false = 0;
> +  int64_t res_64_SV_VL5__a_false = 0;
> +  int64_t res_64_SV_VL6__a_false = 0;
> +  int64_t res_64_SV_VL7__a_false = 0;
> +  int64_t res_64_SV_VL8__a_false = 0;
> +  int64_t res_64_SV_VL16__a_false = 0;
> +  uint64_t res_64_SV_VL1_u_a_false = 0;
> +  uint64_t res_64_SV_VL2_u_a_false = 0;
> +  uint64_t res_64_SV_VL3_u_a_false = 0;
> +  uint64_t res_64_SV_VL4_u_a_false = 0;
> +  uint64_t res_64_SV_VL5_u_a_false = 0;
> +  uint64_t res_64_SV_VL6_u_a_false = 0;
> +  uint64_t res_64_SV_VL7_u_a_false = 0;
> +  uint64_t res_64_SV_VL8_u_a_false = 0;
> +  uint64_t res_64_SV_VL16_u_a_false = 0;
> +  int8_t res_8_SV_VL1__b_false = 31;
> +  int8_t res_8_SV_VL2__b_false = 31;
> +  int8_t res_8_SV_VL3__b_false = 31;
> +  int8_t res_8_SV_VL4__b_false = 31;
> +  int8_t res_8_SV_VL5__b_false = 31;
> +  int8_t res_8_SV_VL6__b_false = 31;
> +  int8_t res_8_SV_VL7__b_false = 31;
> +  int8_t res_8_SV_VL8__b_false = 31;
> +  int8_t res_8_SV_VL16__b_false = 31;
> +  uint8_t res_8_SV_VL1_u_b_false = 31;
> +  uint8_t res_8_SV_VL2_u_b_false = 31;
> +  uint8_t res_8_SV_VL3_u_b_false = 31;
> +  uint8_t res_8_SV_VL4_u_b_false = 31;
> +  uint8_t res_8_SV_VL5_u_b_false = 31;
> +  uint8_t res_8_SV_VL6_u_b_false = 31;
> +  uint8_t res_8_SV_VL7_u_b_false = 31;
> +  uint8_t res_8_SV_VL8_u_b_false = 31;
> +  uint8_t res_8_SV_VL16_u_b_false = 31;
> +  int16_t res_16_SV_VL1__b_false = 15;
> +  int16_t res_16_SV_VL2__b_false = 15;
> +  int16_t res_16_SV_VL3__b_false = 15;
> +  int16_t res_16_SV_VL4__b_false = 15;
> +  int16_t res_16_SV_VL5__b_false = 15;
> +  int16_t res_16_SV_VL6__b_false = 15;
> +  int16_t res_16_SV_VL7__b_false = 15;
> +  int16_t res_16_SV_VL8__b_false = 15;
> +  int16_t res_16_SV_VL16__b_false = 15;
> +  uint16_t res_16_SV_VL1_u_b_false = 15;
> +  uint16_t res_16_SV_VL2_u_b_false = 15;
> +  uint16_t res_16_SV_VL3_u_b_false = 15;
> +  uint16_t res_16_SV_VL4_u_b_false = 15;
> +  uint16_t res_16_SV_VL5_u_b_false = 15;
> +  uint16_t res_16_SV_VL6_u_b_false = 15;
> +  uint16_t res_16_SV_VL7_u_b_false = 15;
> +  uint16_t res_16_SV_VL8_u_b_false = 15;
> +  uint16_t res_16_SV_VL16_u_b_false = 15;
> +  int32_t res_32_SV_VL1__b_false = 7;
> +  int32_t res_32_SV_VL2__b_false = 7;
> +  int32_t res_32_SV_VL3__b_false = 7;
> +  int32_t res_32_SV_VL4__b_false = 7;
> +  int32_t res_32_SV_VL5__b_false = 7;
> +  int32_t res_32_SV_VL6__b_false = 7;
> +  int32_t res_32_SV_VL7__b_false = 7;
> +  int32_t res_32_SV_VL8__b_false = 7;
> +  int32_t res_32_SV_VL16__b_false = 7;
> +  uint32_t res_32_SV_VL1_u_b_false = 7;
> +  uint32_t res_32_SV_VL2_u_b_false = 7;
> +  uint32_t res_32_SV_VL3_u_b_false = 7;
> +  uint32_t res_32_SV_VL4_u_b_false = 7;
> +  uint32_t res_32_SV_VL5_u_b_false = 7;
> +  uint32_t res_32_SV_VL6_u_b_false = 7;
> +  uint32_t res_32_SV_VL7_u_b_false = 7;
> +  uint32_t res_32_SV_VL8_u_b_false = 7;
> +  uint32_t res_32_SV_VL16_u_b_false = 7;
> +  int64_t res_64_SV_VL1__b_false = 3;
> +  int64_t res_64_SV_VL2__b_false = 3;
> +  int64_t res_64_SV_VL3__b_false = 3;
> +  int64_t res_64_SV_VL4__b_false = 3;
> +  int64_t res_64_SV_VL5__b_false = 3;
> +  int64_t res_64_SV_VL6__b_false = 3;
> +  int64_t res_64_SV_VL7__b_false = 3;
> +  int64_t res_64_SV_VL8__b_false = 3;
> +  int64_t res_64_SV_VL16__b_false = 3;
> +  uint64_t res_64_SV_VL1_u_b_false = 3;
> +  uint64_t res_64_SV_VL2_u_b_false = 3;
> +  uint64_t res_64_SV_VL3_u_b_false = 3;
> +  uint64_t res_64_SV_VL4_u_b_false = 3;
> +  uint64_t res_64_SV_VL5_u_b_false = 3;
> +  uint64_t res_64_SV_VL6_u_b_false = 3;
> +  uint64_t res_64_SV_VL7_u_b_false = 3;
> +  uint64_t res_64_SV_VL8_u_b_false = 3;
> +  uint64_t res_64_SV_VL16_u_b_false = 3;
> +
> +
> +#undef SVELAST_DEF
> +#define SVELAST_DEF(size, pat, sign, ab, su) \
> +     if (NAME (foo, size, pat, sign, ab) \
> +          (svindex_ ## su ## size (0 ,1)) != \
> +             NAME (res, size, pat, sign, ab)) \
> +       __builtin_abort (); \
> +     if (NAMEF (foo, size, pat, sign, ab) \
> +          (svindex_ ## su ## size (0 ,1)) != \
> +             NAMEF (res, size, pat, sign, ab)) \
> +       __builtin_abort ();
> +
> +  ALL_POS ()
> +
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4.c
> index 1e38371842f..91fdd3c202e 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4.c
> @@ -186,8 +186,6 @@ CALLER (f16, __SVFloat16_t)
>  ** caller_bf16:
>  **   ...
>  **   bl      callee_bf16
> -**   ptrue   (p[0-7])\.b, all
> -**   lasta   h0, \1, z0\.h
>  **   ldp     x29, x30, \[sp\], 16
>  **   ret
>  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_1024.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_1024.c
> index 491c35af221..7d824caae1b 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_1024.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_1024.c
> @@ -186,8 +186,6 @@ CALLER (f16, __SVFloat16_t)
>  ** caller_bf16:
>  **   ...
>  **   bl      callee_bf16
> -**   ptrue   (p[0-7])\.b, vl128
> -**   lasta   h0, \1, z0\.h
>  **   ldp     x29, x30, \[sp\], 16
>  **   ret
>  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_128.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_128.c
> index eebb913273a..e0aa3a5fa68 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_128.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_128.c
> @@ -186,8 +186,6 @@ CALLER (f16, __SVFloat16_t)
>  ** caller_bf16:
>  **   ...
>  **   bl      callee_bf16
> -**   ptrue   (p[0-7])\.b, vl16
> -**   lasta   h0, \1, z0\.h
>  **   ldp     x29, x30, \[sp\], 16
>  **   ret
>  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_2048.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_2048.c
> index 73c3b2ec045..3238015d9eb 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_2048.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_2048.c
> @@ -186,8 +186,6 @@ CALLER (f16, __SVFloat16_t)
>  ** caller_bf16:
>  **   ...
>  **   bl      callee_bf16
> -**   ptrue   (p[0-7])\.b, vl256
> -**   lasta   h0, \1, z0\.h
>  **   ldp     x29, x30, \[sp\], 16
>  **   ret
>  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_256.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_256.c
> index 29744c81402..50861098934 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_256.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_256.c
> @@ -186,8 +186,6 @@ CALLER (f16, __SVFloat16_t)
>  ** caller_bf16:
>  **   ...
>  **   bl      callee_bf16
> -**   ptrue   (p[0-7])\.b, vl32
> -**   lasta   h0, \1, z0\.h
>  **   ldp     x29, x30, \[sp\], 16
>  **   ret
>  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_512.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_512.c
> index cf25c31bcbf..300dacce955 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_512.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_4_512.c
> @@ -186,8 +186,6 @@ CALLER (f16, __SVFloat16_t)
>  ** caller_bf16:
>  **   ...
>  **   bl      callee_bf16
> -**   ptrue   (p[0-7])\.b, vl64
> -**   lasta   h0, \1, z0\.h
>  **   ldp     x29, x30, \[sp\], 16
>  **   ret
>  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5.c
> index 9ad3e227654..0a840a38384 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5.c
> @@ -186,8 +186,6 @@ CALLER (f16, svfloat16_t)
>  ** caller_bf16:
>  **   ...
>  **   bl      callee_bf16
> -**   ptrue   (p[0-7])\.b, all
> -**   lasta   h0, \1, z0\.h
>  **   ldp     x29, x30, \[sp\], 16
>  **   ret
>  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_1024.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_1024.c
> index d573e5fc69c..18cefbff1e6 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_1024.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_1024.c
> @@ -186,8 +186,6 @@ CALLER (f16, svfloat16_t)
>  ** caller_bf16:
>  **   ...
>  **   bl      callee_bf16
> -**   ptrue   (p[0-7])\.b, vl128
> -**   lasta   h0, \1, z0\.h
>  **   ldp     x29, x30, \[sp\], 16
>  **   ret
>  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_128.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_128.c
> index 200b0eb8242..c622ed55674 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_128.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_128.c
> @@ -186,8 +186,6 @@ CALLER (f16, svfloat16_t)
>  ** caller_bf16:
>  **   ...
>  **   bl      callee_bf16
> -**   ptrue   (p[0-7])\.b, vl16
> -**   lasta   h0, \1, z0\.h
>  **   ldp     x29, x30, \[sp\], 16
>  **   ret
>  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_2048.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_2048.c
> index f6f8858fd47..3286280687d 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_2048.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_2048.c
> @@ -186,8 +186,6 @@ CALLER (f16, svfloat16_t)
>  ** caller_bf16:
>  **   ...
>  **   bl      callee_bf16
> -**   ptrue   (p[0-7])\.b, vl256
> -**   lasta   h0, \1, z0\.h
>  **   ldp     x29, x30, \[sp\], 16
>  **   ret
>  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_256.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_256.c
> index e62f59cc885..3c6afa2fdf1 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_256.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_256.c
> @@ -186,8 +186,6 @@ CALLER (f16, svfloat16_t)
>  ** caller_bf16:
>  **   ...
>  **   bl      callee_bf16
> -**   ptrue   (p[0-7])\.b, vl32
> -**   lasta   h0, \1, z0\.h
>  **   ldp     x29, x30, \[sp\], 16
>  **   ret
>  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_512.c b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_512.c
> index 483558cb576..bb7d3ebf9d4 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_512.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/return_5_512.c
> @@ -186,8 +186,6 @@ CALLER (f16, svfloat16_t)
>  ** caller_bf16:
>  **   ...
>  **   bl      callee_bf16
> -**   ptrue   (p[0-7])\.b, vl64
> -**   lasta   h0, \1, z0\.h
>  **   ldp     x29, x30, \[sp\], 16
>  **   ret
>  */

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]
  2023-05-16  8:03   ` Tejas Belagod
@ 2023-05-16  8:45     ` Richard Sandiford
  2023-05-16 11:28       ` Tejas Belagod
  0 siblings, 1 reply; 10+ messages in thread
From: Richard Sandiford @ 2023-05-16  8:45 UTC (permalink / raw)
  To: Tejas Belagod; +Cc: gcc-patches

Tejas Belagod <Tejas.Belagod@arm.com> writes:
>> +  {
>> +    int i;
>> +    int nelts = vector_cst_encoded_nelts (v);
>> +    int first_el = 0;
>> +
>> +    for (i = first_el; i < nelts; i += step)
>> +      if (VECTOR_CST_ENCODED_ELT (v, i) != VECTOR_CST_ENCODED_ELT (v,
> first_el))
>
> I think this should use !operand_equal_p (..., ..., 0).
>
>
> Oops! I wonder why I thought VECTOR_CST_ENCODED_ELT returned a constant! Thanks
> for spotting that.

It does only return a constant.  But there can be multiple trees with
the same constant value, through things like TREE_OVERFLOW (not sure
where things stand on expunging that from gimple) and the fact that
gimple does not maintain a distinction between different types that
have the same mode and signedness.  (E.g. on ILP32 hosts, gimple does
not maintain a distinction between int and long, even though int 0 and
long 0 are different trees.)

> Also, should the flags here be OEP_ONLY_CONST ?

Nah, just 0 should be fine.

>> +     return false;
>> +
>> +    return true;
>> +  }
>> +
>> +  /* Fold a svlast{a/b} call with constant predicate to a BIT_FIELD_REF.
>> +     BIT_FIELD_REF lowers to a NEON element extract, so we have to make sure
>> +     the index of the element being accessed is in the range of a NEON
> vector
>> +     width.  */
>
> s/NEON/Advanced SIMD/.  Same in later comments
>
>> +  gimple *fold (gimple_folder & f) const override
>> +  {
>> +    tree pred = gimple_call_arg (f.call, 0);
>> +    tree val = gimple_call_arg (f.call, 1);
>> +
>> +    if (TREE_CODE (pred) == VECTOR_CST)
>> +      {
>> +     HOST_WIDE_INT pos;
>> +     unsigned int const_vg;
>> +     int i = 0;
>> +     int step = f.type_suffix (0).element_bytes;
>> +     int step_1 = gcd (step, VECTOR_CST_NPATTERNS (pred));
>> +     int npats = VECTOR_CST_NPATTERNS (pred);
>> +     unsigned HOST_WIDE_INT nelts = vector_cst_encoded_nelts (pred);
>> +     tree b = NULL_TREE;
>> +     bool const_vl = aarch64_sve_vg.is_constant (&const_vg);
>
> I think this might be left over from previous versions, but:
> const_vg isn't used and const_vl is only used once, so I think it
> would be better to remove them.
>
>> +
>> +     /* We can optimize 2 cases common to variable and fixed-length cases
>> +        without a linear search of the predicate vector:
>> +        1.  LASTA if predicate is all true, return element 0.
>> +        2.  LASTA if predicate all false, return element 0.  */
>> +     if (is_lasta () && vect_all_same (pred, step_1))
>> +       {
>> +         b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val,
>> +                     bitsize_int (step * BITS_PER_UNIT), bitsize_int (0));
>> +         return gimple_build_assign (f.lhs, b);
>> +       }
>> +
>> +     /* Handle the all-false case for LASTB where SVE VL == 128b -
>> +        return the highest numbered element.  */
>> +     if (is_lastb () && known_eq (BYTES_PER_SVE_VECTOR, 16)
>> +         && vect_all_same (pred, step_1)
>> +         && integer_zerop (VECTOR_CST_ENCODED_ELT (pred, 0)))
>
> Formatting nit: one condition per line once one line isn't enough.
>
>> +       {
>> +         b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val,
>> +                     bitsize_int (step * BITS_PER_UNIT),
>> +                     bitsize_int ((16 - step) * BITS_PER_UNIT));
>> +
>> +         return gimple_build_assign (f.lhs, b);
>> +       }
>> +
>> +     /* If VECTOR_CST_NELTS_PER_PATTERN (pred) == 2 and every multiple of
>> +        'step_1' in
>> +        [VECTOR_CST_NPATTERNS .. VECTOR_CST_ENCODED_NELTS - 1]
>> +        is zero, then we can treat the vector as VECTOR_CST_NPATTERNS
>> +        elements followed by all inactive elements.  */
>> +     if (!const_vl && VECTOR_CST_NELTS_PER_PATTERN (pred) == 2)
>
> Following on from the above, maybe use:
>
>   !VECTOR_CST_NELTS (pred).is_constant ()
>
> instead of !const_vl here.
>
> I have a horrible suspicion that I'm contradicting our earlier discussion
> here, sorry, but: I think we have to return null if NELTS_PER_PATTERN != 2.
>
>  
>
> IIUC, the NPATTERNS .. ENCODED_ELTS represent the repeated part of the encoded
> constant. This means the repetition occurs if NELTS_PER_PATTERN == 2, IOW the
> base1 repeats in the encoding. This loop is checking this condition and looks
> for a 1 in the repeated part of the NELTS_PER_PATTERN == 2 in a VL vector.
> Please correct me if I’m misunderstanding here.

NELTS_PER_PATTERN == 1 is also a repeating pattern: it means that the
entire sequence is repeated to fill a vector.  So if an NELTS_PER_PATTERN
== 1 constant has elements {0, 1, 0, 0}, the vector is:

   {0, 1, 0, 0, 0, 1, 0, 0, ...}

and the optimisation can't handle that.  NELTS_PER_PATTERN == 3 isn't
likely to occur for predicates, but in principle it has the same problem.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]
  2023-05-16  8:45     ` Richard Sandiford
@ 2023-05-16 11:28       ` Tejas Belagod
  2023-05-16 12:06         ` Richard Sandiford
  0 siblings, 1 reply; 10+ messages in thread
From: Tejas Belagod @ 2023-05-16 11:28 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 5656 bytes --]



From: Richard Sandiford <richard.sandiford@arm.com>
Date: Tuesday, May 16, 2023 at 2:15 PM
To: Tejas Belagod <Tejas.Belagod@arm.com>
Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]
Tejas Belagod <Tejas.Belagod@arm.com> writes:
>> +  {
>> +    int i;
>> +    int nelts = vector_cst_encoded_nelts (v);
>> +    int first_el = 0;
>> +
>> +    for (i = first_el; i < nelts; i += step)
>> +      if (VECTOR_CST_ENCODED_ELT (v, i) != VECTOR_CST_ENCODED_ELT (v,
> first_el))
>
> I think this should use !operand_equal_p (..., ..., 0).
>
>
> Oops! I wonder why I thought VECTOR_CST_ENCODED_ELT returned a constant! Thanks
> for spotting that.

It does only return a constant.  But there can be multiple trees with
the same constant value, through things like TREE_OVERFLOW (not sure
where things stand on expunging that from gimple) and the fact that
gimple does not maintain a distinction between different types that
have the same mode and signedness.  (E.g. on ILP32 hosts, gimple does
not maintain a distinction between int and long, even though int 0 and
long 0 are different trees.)

> Also, should the flags here be OEP_ONLY_CONST ?

Nah, just 0 should be fine.

>> +     return false;
>> +
>> +    return true;
>> +  }
>> +
>> +  /* Fold a svlast{a/b} call with constant predicate to a BIT_FIELD_REF.
>> +     BIT_FIELD_REF lowers to a NEON element extract, so we have to make sure
>> +     the index of the element being accessed is in the range of a NEON
> vector
>> +     width.  */
>
> s/NEON/Advanced SIMD/.  Same in later comments
>
>> +  gimple *fold (gimple_folder & f) const override
>> +  {
>> +    tree pred = gimple_call_arg (f.call, 0);
>> +    tree val = gimple_call_arg (f.call, 1);
>> +
>> +    if (TREE_CODE (pred) == VECTOR_CST)
>> +      {
>> +     HOST_WIDE_INT pos;
>> +     unsigned int const_vg;
>> +     int i = 0;
>> +     int step = f.type_suffix (0).element_bytes;
>> +     int step_1 = gcd (step, VECTOR_CST_NPATTERNS (pred));
>> +     int npats = VECTOR_CST_NPATTERNS (pred);
>> +     unsigned HOST_WIDE_INT nelts = vector_cst_encoded_nelts (pred);
>> +     tree b = NULL_TREE;
>> +     bool const_vl = aarch64_sve_vg.is_constant (&const_vg);
>
> I think this might be left over from previous versions, but:
> const_vg isn't used and const_vl is only used once, so I think it
> would be better to remove them.
>
>> +
>> +     /* We can optimize 2 cases common to variable and fixed-length cases
>> +        without a linear search of the predicate vector:
>> +        1.  LASTA if predicate is all true, return element 0.
>> +        2.  LASTA if predicate all false, return element 0.  */
>> +     if (is_lasta () && vect_all_same (pred, step_1))
>> +       {
>> +         b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val,
>> +                     bitsize_int (step * BITS_PER_UNIT), bitsize_int (0));
>> +         return gimple_build_assign (f.lhs, b);
>> +       }
>> +
>> +     /* Handle the all-false case for LASTB where SVE VL == 128b -
>> +        return the highest numbered element.  */
>> +     if (is_lastb () && known_eq (BYTES_PER_SVE_VECTOR, 16)
>> +         && vect_all_same (pred, step_1)
>> +         && integer_zerop (VECTOR_CST_ENCODED_ELT (pred, 0)))
>
> Formatting nit: one condition per line once one line isn't enough.
>
>> +       {
>> +         b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val,
>> +                     bitsize_int (step * BITS_PER_UNIT),
>> +                     bitsize_int ((16 - step) * BITS_PER_UNIT));
>> +
>> +         return gimple_build_assign (f.lhs, b);
>> +       }
>> +
>> +     /* If VECTOR_CST_NELTS_PER_PATTERN (pred) == 2 and every multiple of
>> +        'step_1' in
>> +        [VECTOR_CST_NPATTERNS .. VECTOR_CST_ENCODED_NELTS - 1]
>> +        is zero, then we can treat the vector as VECTOR_CST_NPATTERNS
>> +        elements followed by all inactive elements.  */
>> +     if (!const_vl && VECTOR_CST_NELTS_PER_PATTERN (pred) == 2)
>
> Following on from the above, maybe use:
>
>   !VECTOR_CST_NELTS (pred).is_constant ()
>
> instead of !const_vl here.
>
> I have a horrible suspicion that I'm contradicting our earlier discussion
> here, sorry, but: I think we have to return null if NELTS_PER_PATTERN != 2.
>
>
>
> IIUC, the NPATTERNS .. ENCODED_ELTS represent the repeated part of the encoded
> constant. This means the repetition occurs if NELTS_PER_PATTERN == 2, IOW the
> base1 repeats in the encoding. This loop is checking this condition and looks
> for a 1 in the repeated part of the NELTS_PER_PATTERN == 2 in a VL vector.
> Please correct me if I’m misunderstanding here.

NELTS_PER_PATTERN == 1 is also a repeating pattern: it means that the
entire sequence is repeated to fill a vector.  So if an NELTS_PER_PATTERN
== 1 constant has elements {0, 1, 0, 0}, the vector is:

   {0, 1, 0, 0, 0, 1, 0, 0, ...}

Wouldn’t the vect_all_same(pred, step) cover this case for a given value of step?

and the optimisation can't handle that.  NELTS_PER_PATTERN == 3 isn't
likely to occur for predicates, but in principle it has the same problem.

OK, I had misunderstood the encoding to always make base1 the repeating value by adjusting the NPATTERNS accordingly – I didn’t know you could also have the base2 value and beyond encoding the repeat value. In this case could I just remove NELTS_PER_PATTERN == 2 condition and the enclosed loop would check for a repeating ‘1’ in the repeated part of the encoded pattern?

Thanks,
Tejas.


Thanks,
Richard

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]
  2023-05-16 11:28       ` Tejas Belagod
@ 2023-05-16 12:06         ` Richard Sandiford
  2023-05-19  9:08           ` Tejas Belagod
  0 siblings, 1 reply; 10+ messages in thread
From: Richard Sandiford @ 2023-05-16 12:06 UTC (permalink / raw)
  To: Tejas Belagod; +Cc: gcc-patches

Tejas Belagod <Tejas.Belagod@arm.com> writes:
>>> +       {
>>> +         b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val,
>>> +                     bitsize_int (step * BITS_PER_UNIT),
>>> +                     bitsize_int ((16 - step) * BITS_PER_UNIT));
>>> +
>>> +         return gimple_build_assign (f.lhs, b);
>>> +       }
>>> +
>>> +     /* If VECTOR_CST_NELTS_PER_PATTERN (pred) == 2 and every multiple of
>>> +        'step_1' in
>>> +        [VECTOR_CST_NPATTERNS .. VECTOR_CST_ENCODED_NELTS - 1]
>>> +        is zero, then we can treat the vector as VECTOR_CST_NPATTERNS
>>> +        elements followed by all inactive elements.  */
>>> +     if (!const_vl && VECTOR_CST_NELTS_PER_PATTERN (pred) == 2)
>>
>> Following on from the above, maybe use:
>>
>>   !VECTOR_CST_NELTS (pred).is_constant ()
>>
>> instead of !const_vl here.
>>
>> I have a horrible suspicion that I'm contradicting our earlier discussion
>> here, sorry, but: I think we have to return null if NELTS_PER_PATTERN != 2.
>>
>> 
>>
>> IIUC, the NPATTERNS .. ENCODED_ELTS represent the repeated part of the
> encoded
>> constant. This means the repetition occurs if NELTS_PER_PATTERN == 2, IOW the
>> base1 repeats in the encoding. This loop is checking this condition and looks
>> for a 1 in the repeated part of the NELTS_PER_PATTERN == 2 in a VL vector.
>> Please correct me if I’m misunderstanding here.
>
> NELTS_PER_PATTERN == 1 is also a repeating pattern: it means that the
> entire sequence is repeated to fill a vector.  So if an NELTS_PER_PATTERN
> == 1 constant has elements {0, 1, 0, 0}, the vector is:
>
>    {0, 1, 0, 0, 0, 1, 0, 0, ...}
>
>
> Wouldn’t the vect_all_same(pred, step) cover this case for a given value of
> step?
>
>
> and the optimisation can't handle that.  NELTS_PER_PATTERN == 3 isn't
> likely to occur for predicates, but in principle it has the same problem.
>
>  
>
> OK, I had misunderstood the encoding to always make base1 the repeating value
> by adjusting the NPATTERNS accordingly – I didn’t know you could also have the
> base2 value and beyond encoding the repeat value. In this case could I just
> remove NELTS_PER_PATTERN == 2 condition and the enclosed loop would check for a
> repeating ‘1’ in the repeated part of the encoded pattern?

But for NELTS_PER_PATTERN==1, the whole encoded sequence repeats.
So you would have to start the check at element 0 rather than
NPATTERNS.  And then (for NELTS_PER_PATTERN==1) the loop would reject
any constant that has a nonzero element.  But all valid zero-vector
cases have been handled by this point, so the effect wouldn't be useful.

It should never be the case that all elements from NPATTERNS
onwards are zero for NELTS_PER_PATTERN==3; that case should be
canonicalised to NELTS_PER_PATTERN==2 instead.

So in practice it's simpler and more obviously correct to punt
when NELTS_PER_PATTERN != 2.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]
  2023-05-16 12:06         ` Richard Sandiford
@ 2023-05-19  9:08           ` Tejas Belagod
  2023-05-19  9:50             ` Richard Sandiford
  0 siblings, 1 reply; 10+ messages in thread
From: Tejas Belagod @ 2023-05-19  9:08 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 3580 bytes --]



From: Richard Sandiford <richard.sandiford@arm.com>
Date: Tuesday, May 16, 2023 at 5:36 PM
To: Tejas Belagod <Tejas.Belagod@arm.com>
Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]
Tejas Belagod <Tejas.Belagod@arm.com> writes:
>>> +       {
>>> +         b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val,
>>> +                     bitsize_int (step * BITS_PER_UNIT),
>>> +                     bitsize_int ((16 - step) * BITS_PER_UNIT));
>>> +
>>> +         return gimple_build_assign (f.lhs, b);
>>> +       }
>>> +
>>> +     /* If VECTOR_CST_NELTS_PER_PATTERN (pred) == 2 and every multiple of
>>> +        'step_1' in
>>> +        [VECTOR_CST_NPATTERNS .. VECTOR_CST_ENCODED_NELTS - 1]
>>> +        is zero, then we can treat the vector as VECTOR_CST_NPATTERNS
>>> +        elements followed by all inactive elements.  */
>>> +     if (!const_vl && VECTOR_CST_NELTS_PER_PATTERN (pred) == 2)
>>
>> Following on from the above, maybe use:
>>
>>   !VECTOR_CST_NELTS (pred).is_constant ()
>>
>> instead of !const_vl here.
>>
>> I have a horrible suspicion that I'm contradicting our earlier discussion
>> here, sorry, but: I think we have to return null if NELTS_PER_PATTERN != 2.
>>
>>
>>
>> IIUC, the NPATTERNS .. ENCODED_ELTS represent the repeated part of the
> encoded
>> constant. This means the repetition occurs if NELTS_PER_PATTERN == 2, IOW the
>> base1 repeats in the encoding. This loop is checking this condition and looks
>> for a 1 in the repeated part of the NELTS_PER_PATTERN == 2 in a VL vector.
>> Please correct me if I’m misunderstanding here.
>
> NELTS_PER_PATTERN == 1 is also a repeating pattern: it means that the
> entire sequence is repeated to fill a vector.  So if an NELTS_PER_PATTERN
> == 1 constant has elements {0, 1, 0, 0}, the vector is:
>
>    {0, 1, 0, 0, 0, 1, 0, 0, ...}
>
>
> Wouldn’t the vect_all_same(pred, step) cover this case for a given value of
> step?
>
>
> and the optimisation can't handle that.  NELTS_PER_PATTERN == 3 isn't
> likely to occur for predicates, but in principle it has the same problem.
>
>
>
> OK, I had misunderstood the encoding to always make base1 the repeating value
> by adjusting the NPATTERNS accordingly – I didn’t know you could also have the
> base2 value and beyond encoding the repeat value. In this case could I just
> remove NELTS_PER_PATTERN == 2 condition and the enclosed loop would check for a
> repeating ‘1’ in the repeated part of the encoded pattern?

But for NELTS_PER_PATTERN==1, the whole encoded sequence repeats.
So you would have to start the check at element 0 rather than
NPATTERNS.  And then (for NELTS_PER_PATTERN==1) the loop would reject
any constant that has a nonzero element.  But all valid zero-vector
cases have been handled by this point, so the effect wouldn't be useful.

It should never be the case that all elements from NPATTERNS
onwards are zero for NELTS_PER_PATTERN==3; that case should be
canonicalised to NELTS_PER_PATTERN==2 instead.

So in practice it's simpler and more obviously correct to punt
when NELTS_PER_PATTERN != 2.

Thanks for the clarification.
I understand all points about punting when NELTS_PER_PATTERN !=2, but one.

Am I correct to understand that we still need to check for the case when there's a repeating non-zero elements in the case of NELTS_PER_PATTERN == 2? eg. { 0, 0, 1, 1, 1, 1,....} which should be encoded as {0, 0, 1, 1} with NPATTERNS = 2 ?

Thanks,
Tejas.


Thanks,
Richard

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]
  2023-05-19  9:08           ` Tejas Belagod
@ 2023-05-19  9:50             ` Richard Sandiford
  2023-06-12  8:31               ` Tejas Belagod
  0 siblings, 1 reply; 10+ messages in thread
From: Richard Sandiford @ 2023-05-19  9:50 UTC (permalink / raw)
  To: Tejas Belagod; +Cc: gcc-patches

Tejas Belagod <Tejas.Belagod@arm.com> writes:
> Am I correct to understand that we still need to check for the case when
> there's a repeating non-zero elements in the case of NELTS_PER_PATTERN == 2?
> eg. { 0, 0, 1, 1, 1, 1,....} which should be encoded as {0, 0, 1, 1} with
> NPATTERNS = 2 ?

Yeah, that's right.  The current handling for NPATTERNS==2 looked
good to me.  It was the other two cases that I was worried about.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]
  2023-05-19  9:50             ` Richard Sandiford
@ 2023-06-12  8:31               ` Tejas Belagod
  0 siblings, 0 replies; 10+ messages in thread
From: Tejas Belagod @ 2023-06-12  8:31 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 870 bytes --]


From: Richard Sandiford <richard.sandiford@arm.com>
Date: Friday, May 19, 2023 at 3:20 PM
To: Tejas Belagod <Tejas.Belagod@arm.com>
Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]
Tejas Belagod <Tejas.Belagod@arm.com> writes:
> Am I correct to understand that we still need to check for the case when
> there's a repeating non-zero elements in the case of NELTS_PER_PATTERN == 2?
> eg. { 0, 0, 1, 1, 1, 1,....} which should be encoded as {0, 0, 1, 1} with
> NPATTERNS = 2 ?

Yeah, that's right.  The current handling for NPATTERNS==2 looked
good to me.  It was the other two cases that I was worried about.

Thanks,
Richard

Thanks for all the reviews. I’ve posted a new version of the patch here - https://gcc.gnu.org/pipermail/gcc-patches/2023-June/621310.html

Thanks,
Tejas.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-06-12  8:31 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-16 11:39 [PATCH] [PR96339] AArch64: Optimise svlast[ab] Tejas Belagod
2023-05-04  5:43 ` Tejas Belagod
2023-05-11 19:32 ` Richard Sandiford
2023-05-16  8:03   ` Tejas Belagod
2023-05-16  8:45     ` Richard Sandiford
2023-05-16 11:28       ` Tejas Belagod
2023-05-16 12:06         ` Richard Sandiford
2023-05-19  9:08           ` Tejas Belagod
2023-05-19  9:50             ` Richard Sandiford
2023-06-12  8:31               ` Tejas Belagod

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).