public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Stamatis Markianos-Wright <stam.markianos-wright@arm.com>
To: gcc-patches@gcc.gnu.org
Cc: Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>,
	Richard Earnshaw <Richard.Earnshaw@arm.com>
Subject: [PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
Date: Thu, 17 Aug 2023 11:31:26 +0100	[thread overview]
Message-ID: <949f5dd0-cdf0-715a-f04c-3de80c9b974f@arm.com> (raw)

[-- Attachment #1: Type: text/plain, Size: 5489 bytes --]

Hi all,


This is the 2/2 patch that contains the functional changes needed
for MVE Tail Predicated Low Overhead Loops.  See my previous email
for a general introduction of MVE LOLs.

This support is added through the already existing loop-doloop
mechanisms that are used for non-MVE dls/le looping.

Mid-end changes are:

1) Relax the loop-doloop mechanism in the mid-end to allow for
    decrement numbers other that -1 and for `count` to be an
    rtx containing a simple REG (which in this case will contain
    the number of elements to be processed), rather
    than an expression for calculating the number of iterations.
2) Added a new df utility function: `df_bb_regno_only_def_find` that
    will return the DEF of a REG if it is DEF-ed only once within the
    basic block.

And many things in the backend to implement the above optimisation:

3)  Implement the `arm_predict_doloop_p` target hook to instruct the
     mid-end about Low Overhead Loops (MVE or not), as well as
     `arm_loop_unroll_adjust` which will prevent unrolling of any loops
     that are valid for becoming MVE Tail_Predicated Low Overhead Loops
     (unrolling can transform a loop in ways that invalidate the dlstp/
     letp tranformation logic and the benefit of the dlstp/letp loop
     would be considerably higher than that of unrolling)
4)  Appropriate changes to the define_expand of doloop_end, new
     patterns for dlstp and letp, new iterators,  unspecs, etc.
5) `arm_mve_loop_valid_for_dlstp` and a number of checking functions:
    * `arm_mve_dlstp_check_dec_counter`
    * `arm_mve_dlstp_check_inc_counter`
    * `arm_mve_check_reg_origin_is_num_elems`
    * `arm_mve_check_df_chain_back_for_implic_predic`
    * `arm_mve_check_df_chain_fwd_for_implic_predic_impact`
    This all, in smoe way or another, are running checks on the loop
    structure in order to determine if the loop is valid for dlstp/letp
    transformation.
6) `arm_attempt_dlstp_transform`: (called from the define_expand of
     doloop_end) this function re-checks for the loop's suitability for
     dlstp/letp transformation and then implements it, if possible.
7) Various utility functions:
    *`arm_mve_get_vctp_lanes` to map
    from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg`
    to check an insn to see if it requires the VPR or not.
    * `arm_mve_get_loop_vctp`
    * `arm_mve_get_vctp_lanes`
    * `arm_emit_mve_unpredicated_insn_to_seq`
    * `arm_get_required_vpr_reg`
    * `arm_get_required_vpr_reg_param`
    * `arm_get_required_vpr_reg_ret_val`
    * `arm_mve_is_across_vector_insn`
    * `arm_is_mve_load_store_insn`
    * `arm_mve_vec_insn_is_predicated_with_this_predicate`
    * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`

No regressions on arm-none-eabi with various targets and on
aarch64-none-elf. Thoughts on getting this into trunk?

Thank you,
Stam Markianos-Wright

gcc/ChangeLog:

     * config/arm/arm-protos.h (arm_target_insn_ok_for_lob): Rename to...
     (arm_target_bb_ok_for_lob): ...this
     (arm_attempt_dlstp_transform): New.
     * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.
     (TARGET_PREDICT_DOLOOP_P): New.
     (arm_block_set_vect):
     (arm_target_insn_ok_for_lob): Rename from arm_target_insn_ok_for_lob.
     (arm_target_bb_ok_for_lob): New.
     (arm_mve_get_vctp_lanes): New.
     (arm_get_required_vpr_reg): New.
     (arm_get_required_vpr_reg_param): New.
     (arm_get_required_vpr_reg_ret_val): New.
     (arm_mve_get_loop_vctp): New.
     (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.
     (arm_mve_vec_insn_is_predicated_with_this_predicate): New.
     (arm_mve_check_df_chain_back_for_implic_predic): New.
     (arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.
     (arm_mve_check_reg_origin_is_num_elems): New.
     (arm_mve_dlstp_check_inc_counter): New.
     (arm_mve_dlstp_check_dec_counter): New.
     (arm_mve_loop_valid_for_dlstp): New.
     (arm_mve_is_across_vector_insn): New.
     (arm_is_mve_load_store_insn): New.
     (arm_predict_doloop_p): New.
     (arm_loop_unroll_adjust): New.
     (arm_emit_mve_unpredicated_insn_to_seq): New.
     (arm_attempt_dlstp_transform): New.
         * config/arm/iterators.md (DLSTP): New.
         (mode1): Add DLSTP mappings.
         * config/arm/mve.md (*predicated_doloop_end_internal): New.
         (dlstp<mode1>_insn): New.
         * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
         * config/arm/unspecs.md: New unspecs.
     * df-core.cc (df_bb_regno_only_def_find): New.
     * df.h (df_bb_regno_only_def_find): New.
         * loop-doloop.cc (doloop_condition_get): Relax conditions.
         (doloop_optimize): Add support for elementwise LoLs.

gcc/testsuite/ChangeLog:

         * gcc.target/arm/lob.h: Update framework.
         * gcc.target/arm/lob1.c: Likewise.
         * gcc.target/arm/lob6.c: Likewise.
     * gcc.target/arm/mve/dlstp-compile-asm.c: New test.
     * gcc.target/arm/mve/dlstp-int16x8.c: New test.
     * gcc.target/arm/mve/dlstp-int32x4.c: New test.
     * gcc.target/arm/mve/dlstp-int64x2.c: New test.
     * gcc.target/arm/mve/dlstp-int8x16.c: New test.
     * gcc.target/arm/mve/dlstp-invalid-asm.c: New test.

[-- Attachment #2: 2.patch --]
[-- Type: text/x-patch, Size: 105189 bytes --]

commit 8564dee09c1258c388094abd614f311e60723368
Author: Stam Markianos-Wright <stam.markianos-wright@arm.com>
Date:   Tue Oct 18 17:42:56 2022 +0100

    arm: Add support for MVE Tail-Predicated Low Overhead Loops
    
    This is the 2/2 patch that contains the functional changes needed
    for MVE Tail Predicated Low Overhead Loops.  See my previous email
    for a general introduction of MVE LOLs.
    
    This support is added through the already existing loop-doloop
    mechanisms that are used for non-MVE dls/le looping.
    
    Mid-end changes are:
    
    1) Relax the loop-doloop mechanism in the mid-end to allow for
       decrement numbers other that -1 and for `count` to be an
       rtx containing a simple REG (which in this case will contain
       the number of elements to be processed), rather
       than an expression for calculating the number of iterations.
    2) Added a new df utility function: `df_bb_regno_only_def_find` that
       will return the DEF of a REG if it is DEF-ed only once within the
       basic block.
    
    And many things in the backend to implement the above optimisation:
    
    3)  Implement the `arm_predict_doloop_p` target hook to instruct the
        mid-end about Low Overhead Loops (MVE or not), as well as
        `arm_loop_unroll_adjust` which will prevent unrolling of any loops
        that are valid for becoming MVE Tail_Predicated Low Overhead Loops
        (unrolling can transform a loop in ways that invalidate the dlstp/
        letp tranformation logic and the benefit of the dlstp/letp loop
        would be considerably higher than that of unrolling)
    4)  Appropriate changes to the define_expand of doloop_end, new
        patterns for dlstp and letp, new iterators,  unspecs, etc.
    5) `arm_mve_loop_valid_for_dlstp` and a number of checking functions:
       * `arm_mve_dlstp_check_dec_counter`
       * `arm_mve_dlstp_check_inc_counter`
       * `arm_mve_check_reg_origin_is_num_elems`
       * `arm_mve_check_df_chain_back_for_implic_predic`
       * `arm_mve_check_df_chain_fwd_for_implic_predic_impact`
       This all, in smoe way or another, are running checks on the loop
       structure in order to determine if the loop is valid for dlstp/letp
       transformation.
    6) `arm_attempt_dlstp_transform`: (called from the define_expand of
        doloop_end) this function re-checks for the loop's suitability for
        dlstp/letp transformation and then implements it, if possible.
    7) Various utility functions:
       *`arm_mve_get_vctp_lanes` to map
       from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg`
       to check an insn to see if it requires the VPR or not.
       * `arm_mve_get_loop_vctp`
       * `arm_mve_get_vctp_lanes`
       * `arm_emit_mve_unpredicated_insn_to_seq`
       * `arm_get_required_vpr_reg`
       * `arm_get_required_vpr_reg_param`
       * `arm_get_required_vpr_reg_ret_val`
       * `arm_mve_is_across_vector_insn`
       * `arm_is_mve_load_store_insn`
       * `arm_mve_vec_insn_is_predicated_with_this_predicate`
       * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`
    
    No regressions on arm-none-eabi with various targets and on
    aarch64-none-elf. Thoughts on getting this into trunk?
    
    Thank you,
    Stam Markianos-Wright
    
    gcc/ChangeLog:
    
            * config/arm/arm-protos.h (arm_target_insn_ok_for_lob): Rename to...
            (arm_target_bb_ok_for_lob): ...this
            (arm_attempt_dlstp_transform): New.
            * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.
            (TARGET_PREDICT_DOLOOP_P): New.
            (arm_block_set_vect):
            (arm_target_insn_ok_for_lob): Rename from arm_target_insn_ok_for_lob.
            (arm_target_bb_ok_for_lob): New.
            (arm_mve_get_vctp_lanes): New.
            (arm_get_required_vpr_reg): New.
            (arm_get_required_vpr_reg_param): New.
            (arm_get_required_vpr_reg_ret_val): New.
            (arm_mve_get_loop_vctp): New.
            (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.
            (arm_mve_vec_insn_is_predicated_with_this_predicate): New.
            (arm_mve_check_df_chain_back_for_implic_predic): New.
            (arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.
            (arm_mve_check_reg_origin_is_num_elems): New.
            (arm_mve_dlstp_check_inc_counter): New.
            (arm_mve_dlstp_check_dec_counter): New.
            (arm_mve_loop_valid_for_dlstp): New.
            (arm_mve_is_across_vector_insn): New.
            (arm_is_mve_load_store_insn): New.
            (arm_predict_doloop_p): New.
            (arm_loop_unroll_adjust): New.
            (arm_emit_mve_unpredicated_insn_to_seq): New.
            (arm_attempt_dlstp_transform): New.
            * config/arm/iterators.md (DLSTP): New.
            (mode1): Add DLSTP mappings.
            * config/arm/mve.md (*predicated_doloop_end_internal): New.
            (dlstp<mode1>_insn): New.
            * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
            * config/arm/unspecs.md: New unspecs.
            * df-core.cc (df_bb_regno_only_def_find): New.
            * df.h (df_bb_regno_only_def_find): New.
            * loop-doloop.cc (doloop_condition_get): Relax conditions.
            (doloop_optimize): Add support for elementwise LoLs.
    
    gcc/testsuite/ChangeLog:
    
            * gcc.target/arm/lob.h: Update framework.
            * gcc.target/arm/lob1.c: Likewise.
            * gcc.target/arm/lob6.c: Likewise.
            * gcc.target/arm/mve/dlstp-compile-asm.c: New test.
            * gcc.target/arm/mve/dlstp-int16x8.c: New test.
            * gcc.target/arm/mve/dlstp-int32x4.c: New test.
            * gcc.target/arm/mve/dlstp-int64x2.c: New test.
            * gcc.target/arm/mve/dlstp-int8x16.c: New test.
            * gcc.target/arm/mve/dlstp-invalid-asm.c: New test.

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 77e76336e94..74186930f0b 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -65,8 +65,8 @@ extern void arm_emit_speculation_barrier_function (void);
 extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *);
 extern bool arm_q_bit_access (void);
 extern bool arm_ge_bits_access (void);
-extern bool arm_target_insn_ok_for_lob (rtx);
-
+extern bool arm_target_bb_ok_for_lob (basic_block);
+extern rtx arm_attempt_dlstp_transform (rtx);
 #ifdef RTX_CODE
 enum reg_class
 arm_mode_base_reg_class (machine_mode);
diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index 6e933c80183..39d97ba5e4d 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -659,6 +659,12 @@ static const struct attribute_spec arm_attribute_table[] =
 #undef TARGET_HAVE_CONDITIONAL_EXECUTION
 #define TARGET_HAVE_CONDITIONAL_EXECUTION arm_have_conditional_execution
 
+#undef TARGET_LOOP_UNROLL_ADJUST
+#define TARGET_LOOP_UNROLL_ADJUST arm_loop_unroll_adjust
+
+#undef TARGET_PREDICT_DOLOOP_P
+#define TARGET_PREDICT_DOLOOP_P arm_predict_doloop_p
+
 #undef TARGET_LEGITIMATE_CONSTANT_P
 #define TARGET_LEGITIMATE_CONSTANT_P arm_legitimate_constant_p
 
@@ -34416,19 +34422,1096 @@ arm_invalid_within_doloop (const rtx_insn *insn)
 }
 
 bool
-arm_target_insn_ok_for_lob (rtx insn)
+arm_target_bb_ok_for_lob (basic_block bb)
 {
-  basic_block bb = BLOCK_FOR_INSN (insn);
   /* Make sure the basic block of the target insn is a simple latch
      having as single predecessor and successor the body of the loop
      itself.  Only simple loops with a single basic block as body are
      supported for 'low over head loop' making sure that LE target is
      above LE itself in the generated code.  */
-
   return single_succ_p (bb)
-    && single_pred_p (bb)
-    && single_succ_edge (bb)->dest == single_pred_edge (bb)->src
-    && contains_no_active_insn_p (bb);
+	 && single_pred_p (bb)
+	 && single_succ_edge (bb)->dest == single_pred_edge (bb)->src;
+}
+
+/* Utility fuction: Given a VCTP or a VCTP_M insn, return the number of MVE
+   lanes based on the machine mode being used.  */
+
+static int
+arm_mve_get_vctp_lanes (rtx x)
+{
+  if (GET_CODE (x) == SET && GET_CODE (XEXP (x, 1)) == UNSPEC
+      && (XINT (XEXP (x, 1), 1) == VCTP || XINT (XEXP (x, 1), 1) == VCTP_M))
+    {
+      machine_mode mode = GET_MODE (XEXP (x, 1));
+      return (VECTOR_MODE_P (mode) && VALID_MVE_PRED_MODE (mode))
+	     ? GET_MODE_NUNITS (mode) : 0;
+    }
+  return 0;
+}
+
+/* Check if INSN requires the use of the VPR reg, if it does, return the
+   sub-rtx of the VPR reg.  The TYPE argument controls whether
+   this function should:
+   * For TYPE == 0, check all operands, including the OUT operands,
+     and return the first occurrence of the VPR reg.
+   * For TYPE == 1, only check the input operands.
+   * For TYPE == 2, only check the output operands.
+   (INOUT operands are considered both as input and output operands)
+*/
+static rtx
+arm_get_required_vpr_reg (rtx_insn *insn, unsigned int type = 0)
+{
+  gcc_assert (type < 3);
+  if (!NONJUMP_INSN_P (insn))
+    return NULL_RTX;
+
+  bool requires_vpr;
+  extract_constrain_insn (insn);
+  int n_operands = recog_data.n_operands;
+  if (recog_data.n_alternatives == 0)
+    return NULL_RTX;
+
+  /* Fill in recog_op_alt with information about the constraints of
+     this insn.  */
+  preprocess_constraints (insn);
+
+  for (int op = 0; op < n_operands; op++)
+    {
+      requires_vpr = true;
+      if (type == 1 && recog_data.operand_type[op] == OP_OUT)
+	continue;
+      else if (type == 2 && recog_data.operand_type[op] == OP_IN)
+	continue;
+
+      /* Iterate through alternatives of operand "op" in recog_op_alt and
+	 identify if the operand is required to be the VPR.  */
+      for (int alt = 0; alt < recog_data.n_alternatives; alt++)
+	{
+	  const operand_alternative *op_alt
+	      = &recog_op_alt[alt * n_operands];
+	  /* Fetch the reg_class for each entry and check it against the
+	     VPR_REG reg_class.  */
+	  if (alternative_class (op_alt, op) != VPR_REG)
+	    requires_vpr = false;
+	}
+      /* If all alternatives of the insn require the VPR reg for this operand,
+	 it means that either this is VPR-generating instruction, like a vctp,
+	 vcmp, etc., or it is a VPT-predicated insruction.  Return the subrtx
+	 of the VPR reg operand.  */
+      if (requires_vpr)
+	return recog_data.operand[op];
+    }
+  return NULL_RTX;
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with TYPE == 1, so return
+   something only if the VPR reg is an input operand to the insn.  */
+
+static rtx
+ALWAYS_INLINE
+arm_get_required_vpr_reg_param (rtx_insn *insn)
+{
+  return arm_get_required_vpr_reg (insn, 1);
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with TYPE == 2, so return
+   something only if the VPR reg is the return value, an output of, or is
+   clobbered by the insn.  */
+
+static rtx
+ALWAYS_INLINE
+arm_get_required_vpr_reg_ret_val (rtx_insn *insn)
+{
+  return arm_get_required_vpr_reg (insn, 2);
+}
+
+/* Scan the basic block of a loop body for a vctp instruction.  If there is
+   at least vctp instruction, return the first rtx_insn *.  */
+
+static rtx_insn *
+arm_mve_get_loop_vctp (basic_block bb)
+{
+  rtx_insn *insn = BB_HEAD (bb);
+
+  /* Now scan through all the instruction patterns and pick out the VCTP
+     instruction.  We require arm_get_required_vpr_reg_param to be false
+     to make sure we pick up a VCTP, rather than a VCTP_M.  */
+  FOR_BB_INSNS (bb, insn)
+    if (NONDEBUG_INSN_P (insn))
+      if (arm_get_required_vpr_reg_ret_val (insn)
+	  && (arm_mve_get_vctp_lanes (PATTERN (insn)) != 0)
+	  && !arm_get_required_vpr_reg_param (insn))
+	return insn;
+  return NULL;
+}
+
+/* Return true if INSN is a MVE instruction that is VPT-predicable, but in
+   its unpredicated form, or if it is predicated, but on a predicate other
+   than VPR_REG.  */
+
+static bool
+arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate (rtx_insn *insn,
+							  rtx vpr_reg)
+{
+  rtx insn_vpr_reg_operand;
+  if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+      || (MVE_VPT_PREDICATED_INSN_P (insn)
+	  && (insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn))
+	  && !rtx_equal_p (vpr_reg, insn_vpr_reg_operand)))
+    return true;
+  else
+    return false;
+}
+
+/* Return true if INSN is a MVE instruction that is VPT-predicable and is
+   predicated on VPR_REG.  */
+
+static bool
+arm_mve_vec_insn_is_predicated_with_this_predicate (rtx_insn *insn,
+						    rtx vpr_reg)
+{
+  rtx insn_vpr_reg_operand;
+  if (MVE_VPT_PREDICATED_INSN_P (insn)
+      && (insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn))
+      && rtx_equal_p (vpr_reg, insn_vpr_reg_operand))
+    return true;
+  else
+    return false;
+}
+
+/* Utility function to identify if INSN is an MVE instruction that performs
+   some across-vector operation (and as a result does not align with normal
+   lane predication rules).  All such instructions give one only scalar
+   output, except for vshlcq which gives a PARALLEL of a vector and a scalar
+   (one vector result and one carry output).  */
+
+static bool
+arm_is_mve_across_vector_insn (rtx_insn* insn)
+{
+  df_ref insn_defs = NULL;
+  if (!MVE_VPT_PREDICABLE_INSN_P (insn))
+    return false;
+
+  bool is_across_vector = false;
+  FOR_EACH_INSN_DEF (insn_defs, insn)
+    if (!VALID_MVE_MODE (GET_MODE (DF_REF_REG (insn_defs)))
+	&& !arm_get_required_vpr_reg_ret_val (insn))
+      is_across_vector = true;
+
+  return is_across_vector;
+}
+
+/* Utility function to identify if INSN is an MVE load or store instruction.
+   * For TYPE == 0, check all operands.  If the function returns true,
+     INSN is a load or a store insn.
+   * For TYPE == 1, only check the input operands.  If the function returns
+     true, INSN is a load insn.
+   * For TYPE == 2, only check the output operands.  If the function returns
+     true, INSN is a store insn.  */
+
+static bool
+arm_is_mve_load_store_insn (rtx_insn* insn, int type = 0)
+{
+  int n_operands = recog_data.n_operands;
+  extract_insn (insn);
+
+  for (int op = 0; op < n_operands; op++)
+    {
+      if (type == 1 && recog_data.operand_type[op] == OP_OUT)
+	continue;
+      else if (type == 2 && recog_data.operand_type[op] == OP_IN)
+	continue;
+      if (mve_memory_operand (recog_data.operand[op],
+			      GET_MODE (recog_data.operand[op])))
+      return true;
+    }
+  return false;
+}
+
+/* When transforming an MVE intrinsic loop into an MVE Tail Predicated Low
+   Overhead Loop, there are a number of instructions that, if in their
+   unpredicated form, act across vector lanes, but are still safe to include
+   within the loop, despite the implicit predication added to the vector lanes.
+   This list has been compiled by carefully analyzing the instruction
+   pseudocode in the Arm-ARM.
+   All other across-vector instructions aren't allowed, because the addition
+   of implicit predication could influnce the result of the operation.
+   Any new across-vector instructions to the MVE ISA will have to assessed for
+   inclusion to this list.  */
+
+static bool
+arm_mve_is_allowed_unpredic_across_vector_insn (rtx_insn* insn)
+{
+  gcc_assert (MVE_VPT_UNPREDICATED_INSN_P (insn)
+	      && arm_is_mve_across_vector_insn (insn));
+  rtx insn_pattern = PATTERN (insn);
+  if (GET_CODE (insn_pattern) == SET
+      && GET_CODE (XEXP (insn_pattern, 1)) == UNSPEC
+      && (XINT (XEXP (insn_pattern, 1), 1) == VADDVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VABAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VABAVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMAXVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMAXAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVXQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAXQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHAXQ_S))
+    return true;
+  return false;
+}
+
+
+/* Recursively scan through the DF chain backwards within the basic block and
+   determine if any of the USEs of the original insn (or the USEs of the insns
+   where thy were DEF-ed, etc., recursively) were affected by implicit VPT
+   predication of an MVE_VPT_UNPREDICATED_INSN_P in a dlstp/letp loop.
+   This function returns true if the insn is affected implicit predication
+   and false otherwise.
+   Having such implicit predication on an unpredicated insn wouldn't in itself
+   block tail predication, because the output of that insn might then be used
+   in a correctly predicated store insn, where the disabled lanes will be
+   ignored.  To verify this we later call:
+   `arm_mve_check_df_chain_fwd_for_implic_predic_impact`, which will check the
+   DF chains forward to see if any implicitly-predicated operand gets used in
+   an improper way.  */
+
+static bool
+arm_mve_check_df_chain_back_for_implic_predic
+  (hash_map<int_hash<int, -1, -2>, bool>* safe_insn_map, rtx_insn *insn,
+   rtx vctp_vpr_generated)
+{
+  bool* temp = NULL;
+  if ((temp = safe_insn_map->get (INSN_UID (insn))))
+    return *temp;
+
+  basic_block body = BLOCK_FOR_INSN (insn);
+  /* The circumstances under which an instruction is affected by "implicit
+     predication" are as follows:
+      * It is an UNPREDICATED_INSN_P:
+	* That loads/stores from/to memory.
+	* Where any one of its operands is an MVE vector from outside the
+	  loop body bb.
+     Or:
+      * Any of it's operands, recursively backwards, are affected.  */
+  if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+      && (arm_is_mve_load_store_insn (insn)
+	  || (arm_is_mve_across_vector_insn (insn)
+	      && !arm_mve_is_allowed_unpredic_across_vector_insn (insn))))
+    {
+      safe_insn_map->put (INSN_UID (insn), true);
+      return true;
+    }
+
+  df_ref insn_uses = NULL;
+  FOR_EACH_INSN_USE (insn_uses, insn)
+  {
+    /* If the operand is in the input reg set to the the basic block,
+       (i.e. it has come from outside the loop!), consider it unsafe if:
+	 * It's being used in an unpredicated insn.
+	 * It is a predicable MVE vector.  */
+    if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+	&& VALID_MVE_MODE (GET_MODE (DF_REF_REG (insn_uses)))
+	&& REGNO_REG_SET_P (DF_LR_IN (body), DF_REF_REGNO (insn_uses)))
+      {
+	safe_insn_map->put (INSN_UID (insn), true);
+	return true;
+      }
+    /* Scan backwards from the current INSN through the instruction chain
+       until the start of the basic block.  */
+    for (rtx_insn *prev_insn = PREV_INSN (insn);
+	 prev_insn && prev_insn != PREV_INSN (BB_HEAD (body));
+	 prev_insn = PREV_INSN (prev_insn))
+      {
+	/* If a previous insn defines a register that INSN uses, then recurse
+	   in order to check that insn's USEs.
+	   If any of these insns return true as MVE_VPT_UNPREDICATED_INSN_Ps,
+	   then the whole chain is affected by the change in behaviour from
+	   being placed in dlstp/letp loop.  */
+	df_ref prev_insn_defs = NULL;
+	FOR_EACH_INSN_DEF (prev_insn_defs, prev_insn)
+	{
+	  if (DF_REF_REGNO (insn_uses) == DF_REF_REGNO (prev_insn_defs)
+	      && !arm_mve_vec_insn_is_predicated_with_this_predicate
+		   (insn, vctp_vpr_generated)
+	      && arm_mve_check_df_chain_back_for_implic_predic
+		  (safe_insn_map, prev_insn, vctp_vpr_generated))
+	    {
+	      safe_insn_map->put (INSN_UID (insn), true);
+	      return true;
+	    }
+	}
+      }
+  }
+  safe_insn_map->put (INSN_UID (insn), false);
+  return false;
+}
+
+/* If we have identified that the current DEF will be modified
+   by such implicit predication, scan through all the
+   insns that USE it and bail out if any one is outside the
+   current basic block (i.e. the reg is live after the loop)
+   or if any are store insns that are unpredicated or using a
+   predicate other than the loop VPR.
+   This function returns true if the insn is not suitable for
+   implicit predication and false otherwise.*/
+
+static bool
+arm_mve_check_df_chain_fwd_for_implic_predic_impact (rtx_insn *insn,
+						     rtx vctp_vpr_generated)
+{
+
+  /* If this insn is indeed an unpredicated store to memory, bail out.  */
+  if (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
+	(insn, vctp_vpr_generated)
+      && (arm_is_mve_load_store_insn (insn, 2)
+	  || arm_is_mve_across_vector_insn (insn)))
+    return true;
+
+  /* Next, scan forward to the various USEs of the DEFs in this insn.  */
+  df_ref insn_def = NULL;
+  FOR_EACH_INSN_DEF (insn_def, insn)
+    {
+      for (df_ref use = DF_REG_USE_CHAIN (DF_REF_REGNO (insn_def)); use;
+	   use = DF_REF_NEXT_REG (use))
+	{
+	  rtx_insn *next_use_insn = DF_REF_INSN (use);
+	  if (next_use_insn != insn
+	      && NONDEBUG_INSN_P (next_use_insn))
+	    {
+	      /* If the USE is outside the loop body bb, or it is inside, but
+		 is an differently-predicated store to memory or it is any
+		 across-vector instruction.  */
+	      if (BLOCK_FOR_INSN (insn) != BLOCK_FOR_INSN (next_use_insn)
+		  || (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
+		       (next_use_insn, vctp_vpr_generated)
+		     && (arm_is_mve_load_store_insn (next_use_insn, 2)
+			 || arm_is_mve_across_vector_insn (next_use_insn))))
+		return true;
+	    }
+	}
+    }
+  return false;
+}
+
+/* Helper function to `arm_mve_dlstp_check_inc_counter` and to
+   `arm_mve_dlstp_check_dec_counter`.  In the situations where the loop counter
+   is incrementing by 1 or decrementing by 1 in each iteration, ensure that the
+   target value or the initialisation value, respectively, was a calculation
+   of the number of iterations of the loop, which is expected to be an ASHIFTRT
+   by VCTP_STEP.  */
+
+static bool
+arm_mve_check_reg_origin_is_num_elems (basic_block body, rtx reg, rtx vctp_step)
+{
+  /* Ok, we now know the loop starts from zero and increments by one.
+     Now just show that the max value of the counter came from an
+     appropriate ASHIFRT expr of the correct amount.  */
+  basic_block pre_loop_bb = body->prev_bb;
+  while (pre_loop_bb && BB_END (pre_loop_bb)
+	 && !df_bb_regno_only_def_find (pre_loop_bb, REGNO (reg)))
+    pre_loop_bb = pre_loop_bb->prev_bb;
+
+  df_ref counter_max_last_def = df_bb_regno_only_def_find (pre_loop_bb, REGNO (reg));
+  rtx counter_max_last_set;
+  if (counter_max_last_def)
+    counter_max_last_set = PATTERN (DF_REF_INSN (counter_max_last_def));
+  else
+    return false;
+
+  /* If we encounter a simple SET from a REG, follow it through.  */
+  if (GET_CODE (counter_max_last_set) == SET
+      && REG_P (XEXP (counter_max_last_set, 1)))
+    return arm_mve_check_reg_origin_is_num_elems
+	     (pre_loop_bb->next_bb, XEXP (counter_max_last_set, 1), vctp_step);
+
+  /* If we encounter a SET from an IF_THEN_ELSE where one of the operands is a
+     constant and the other is a REG, follow through to that REG.  */
+  if (GET_CODE (counter_max_last_set) == SET
+      && GET_CODE (XEXP (counter_max_last_set, 1)) == IF_THEN_ELSE
+      && REG_P (XEXP (XEXP (counter_max_last_set, 1), 1))
+      && CONST_INT_P (XEXP (XEXP (counter_max_last_set, 1), 2)))
+    return arm_mve_check_reg_origin_is_num_elems
+	     (pre_loop_bb->next_bb, XEXP (XEXP (counter_max_last_set, 1), 1), vctp_step);
+
+  if (GET_CODE (XEXP (counter_max_last_set, 1)) == ASHIFTRT
+      && CONST_INT_P (XEXP (XEXP (counter_max_last_set, 1), 1))
+      && ((1 << INTVAL (XEXP (XEXP (counter_max_last_set, 1), 1)))
+	   == abs (INTVAL (vctp_step))))
+    return true;
+
+  return false;
+}
+
+/* If we have identified the loop to have an incrementing counter, we need to
+   make sure that it increments by 1 and that the loop is structured correctly:
+    * The counter starts from 0
+    * The counter terminates at (num_of_elem + num_of_lanes - 1) / num_of_lanes
+    * The vctp insn uses a reg that decrements appropriately in each iteration.
+*/
+
+static rtx_insn*
+arm_mve_dlstp_check_inc_counter (basic_block body, rtx_insn* vctp_insn,
+				 rtx condconst, rtx condcount)
+{
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+  /* The loop latch has to be empty.  When compiling all the known MVE LoLs in
+     user applications, none of those with incrementing counters had any real
+     insns in the loop latch.  As such, this function has only been tested with
+     an empty latch and may misbehave or ICE if we somehow get here with an
+     increment in the latch, so, for correctness, error out early.  */
+  rtx_insn *dec_insn = BB_END (body->loop_father->latch);
+  if (NONDEBUG_INSN_P (dec_insn))
+    return NULL;
+
+  class rtx_iv vctp_reg_iv;
+  /* For loops of type B) the loop counter is independent of the decrement
+     of the reg used in the vctp_insn. So run iv analysis on that reg.  This
+     has to succeed for such loops to be supported.  */
+  if (!iv_analyze (vctp_insn, as_a<scalar_int_mode> (GET_MODE (vctp_reg)),
+      vctp_reg, &vctp_reg_iv))
+    return NULL;
+
+  /* Find where both of those are modified in the loop body bb.  */
+  rtx condcount_reg_set
+	= PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
+				 (body, REGNO (condcount))));
+  rtx vctp_reg_set = PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
+					    (body, REGNO (vctp_reg))));
+  if (!vctp_reg_set || !condcount_reg_set)
+    return NULL;
+
+  if (REG_P (condcount) && REG_P (condconst))
+    {
+      /* First we need to prove that the loop is going 0..condconst with an
+	 inc of 1 in each iteration.  */
+      if (GET_CODE (XEXP (condcount_reg_set, 1)) == PLUS
+	  && CONST_INT_P (XEXP (XEXP (condcount_reg_set, 1), 1))
+	  && INTVAL (XEXP (XEXP (condcount_reg_set, 1), 1)) == 1)
+	{
+	    rtx counter_reg = XEXP (condcount_reg_set, 0);
+	    /* Check that the counter did indeed start from zero.  */
+	    df_ref this_set = DF_REG_DEF_CHAIN (REGNO (counter_reg));
+	    if (!this_set)
+	      return NULL;
+	    df_ref last_set = DF_REF_NEXT_REG (this_set);
+	    if (!last_set)
+	      return NULL;
+	    rtx_insn* last_set_insn = DF_REF_INSN (last_set);
+	    if (!single_set (last_set_insn))
+	      return NULL;
+	    rtx counter_orig_set;
+	    counter_orig_set = XEXP (PATTERN (last_set_insn), 1);
+	    if (!CONST_INT_P (counter_orig_set)
+		|| (INTVAL (counter_orig_set) != 0))
+	      return NULL;
+	    /* And finally check that the target value of the counter,
+	       condconst is of the correct shape.  */
+	    if (!arm_mve_check_reg_origin_is_num_elems (body, condconst,
+							vctp_reg_iv.step))
+	      return NULL;
+	}
+      else
+	return NULL;
+    }
+  else
+    return NULL;
+
+  /* Extract the decrementnum of the vctp reg.  */
+  int decrementnum = abs (INTVAL (vctp_reg_iv.step));
+  /* Ensure it matches the number of lanes of the vctp instruction.  */
+  if (decrementnum != arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+    return NULL;
+
+  /* Everything looks valid.  */
+  return vctp_insn;
+}
+
+/* Helper function to `arm_mve_loop_valid_for_dlstp`.  In the case of a
+   counter that is decrementing, ensure that it is decrementing by the
+   right amount in each iteration and that the target condition is what
+   we expect.  */
+
+static rtx_insn*
+arm_mve_dlstp_check_dec_counter (basic_block body, rtx_insn* vctp_insn,
+				 rtx condconst, rtx condcount)
+{
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+  class rtx_iv vctp_reg_iv;
+  int decrementnum;
+  /* For decrementing loops of type A), the counter is usually present in the
+     loop latch.  Here we simply need to verify that this counter is the same
+     reg that is also used in the vctp_insn and that it is not otherwise
+     modified.  */
+  rtx_insn *dec_insn = BB_END (body->loop_father->latch);
+  /* If not in the loop latch, try to find the decrement in the loop body.  */
+  if (!NONDEBUG_INSN_P (dec_insn))
+  {
+    df_ref temp = df_bb_regno_only_def_find (body, REGNO (condcount));
+    /* If we haven't been able to find the decrement, bail out.  */
+    if (!temp)
+      return NULL;
+    dec_insn = DF_REF_INSN (temp);
+  }
+
+  /* Next, ensure that it is a PLUS of the form:
+     (set (reg a) (plus (reg a) (const_int)))
+     where (reg a) is the same as condcount.  */
+  if (GET_CODE (XEXP (PATTERN (dec_insn), 1)) == PLUS
+      && REGNO (XEXP (PATTERN (dec_insn), 0))
+	  == REGNO (XEXP (XEXP (PATTERN (dec_insn), 1), 0))
+      && REGNO (XEXP (PATTERN (dec_insn), 0)) == REGNO (condcount))
+    decrementnum = abs (INTVAL (XEXP (XEXP (PATTERN (dec_insn), 1), 1)));
+  else
+    return NULL;
+
+  /* Ok, so we now know the loop decrement.  If it is a 1, then we need to
+     look at the loop vctp_reg and verify that it also decrements correctly.
+     Then, we need to establish that the starting value of the loop decrement
+     originates from the starting value of the vctp decrement.  */
+  if (decrementnum == 1)
+    {
+      class rtx_iv vctp_reg_iv;
+      /* The loop counter is found to be independent of the decrement
+	 of the reg used in the vctp_insn, again.  Ensure that IV analysis
+	 succeeds and check the step.  */
+      if (!iv_analyze (vctp_insn, as_a<scalar_int_mode> (GET_MODE (vctp_reg)),
+		       vctp_reg, &vctp_reg_iv))
+	return NULL;
+      /* Ensure it matches the number of lanes of the vctp instruction.  */
+      if (abs (INTVAL (vctp_reg_iv.step))
+	  != arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+	return NULL;
+      if (!arm_mve_check_reg_origin_is_num_elems (body, condcount, vctp_reg_iv.step))
+	return NULL;
+    }
+  /* If the decrements are the same, then the situation is simple: either they
+     are also the same reg, which is safe, or they are different registers, in
+     which case makse sure that there is a only simple SET from one to the
+     other inside the loop.*/
+  else if (decrementnum == arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+    {
+      if (REGNO (condcount) != REGNO (vctp_reg))
+	{
+	  /* It wasn't the same reg, but it could be behild a
+	     (set (vctp_reg) (condcount)), so instead find where
+	     the VCTP insn is DEF'd inside the loop.  */
+	  rtx vctp_reg_set =
+		PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
+					(body, REGNO (vctp_reg))));
+	  /* This must just be a simple SET from the condcount.  */
+	  if (GET_CODE (vctp_reg_set) != SET || !REG_P (XEXP (vctp_reg_set, 1))
+	      || REGNO (XEXP (vctp_reg_set, 1)) != REGNO (condcount))
+	    return NULL;
+	}
+    }
+  else
+    return NULL;
+
+  /* We now only need to find out that the loop terminates with a LE
+     zero condition.  If condconst is a const_int, then this is easy.
+     If its a REG, look at the last condition+jump in a bb before
+     the loop, because that usually will have a branch jumping over
+     the loop body.  */
+  if (CONST_INT_P (condconst)
+      && !(INTVAL (condconst) == 0 && JUMP_P (BB_END (body))
+	   && GET_CODE (XEXP (PATTERN (BB_END (body)), 1)) == IF_THEN_ELSE
+	   && (GET_CODE (XEXP (XEXP (PATTERN (BB_END (body)), 1), 0)) == NE
+	       ||GET_CODE (XEXP (XEXP (PATTERN (BB_END (body)), 1), 0)) == GT)))
+    return NULL;
+  else if (REG_P (condconst))
+    {
+      basic_block pre_loop_bb = body;
+      while (pre_loop_bb->prev_bb && BB_END (pre_loop_bb->prev_bb)
+	     && !JUMP_P (BB_END (pre_loop_bb->prev_bb)))
+	pre_loop_bb = pre_loop_bb->prev_bb;
+      if (pre_loop_bb && BB_END (pre_loop_bb))
+	pre_loop_bb = pre_loop_bb->prev_bb;
+      else
+	return NULL;
+      rtx initial_compare = NULL_RTX;
+      if (!(prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb))
+	    && INSN_P (prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb)))))
+	return NULL;
+      else
+	initial_compare
+	    = PATTERN (prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb)));
+      if (!(initial_compare && GET_CODE (initial_compare) == SET
+	    && cc_register (XEXP (initial_compare, 0), VOIDmode)
+	    && GET_CODE (XEXP (initial_compare, 1)) == COMPARE
+	    && CONST_INT_P (XEXP (XEXP (initial_compare, 1), 1))
+	    && INTVAL (XEXP (XEXP (initial_compare, 1), 1)) == 0))
+	return NULL;
+
+      /* Usually this is a LE condition, but it can also just be a GT or an EQ
+	 condition (if the value is unsigned or the compiler knows its not negative)  */
+      rtx_insn *loop_jumpover = BB_END (pre_loop_bb);
+      if (!(JUMP_P (loop_jumpover)
+	    && GET_CODE (XEXP (PATTERN (loop_jumpover), 1)) == IF_THEN_ELSE
+	    && (GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == LE
+		|| GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == GT
+		|| GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == EQ)))
+	return NULL;
+    }
+
+  /* Everything looks valid.  */
+  return vctp_insn;
+}
+
+/* Function to check a loop's structure to see if it is a valid candidate for
+   an MVE Tail Predicated Low-Overhead Loop.  Returns the loop's VCTP_INSN if
+   it is valid, or NULL if it isn't.  */
+
+static rtx_insn*
+arm_mve_loop_valid_for_dlstp (basic_block body)
+{
+  /* Doloop can only be done "elementwise" with predicated dlstp/letp if it
+     contains a VCTP on the number of elements processed by the loop.
+     Find the VCTP predicate generation inside the loop body BB.  */
+  rtx_insn *vctp_insn = arm_mve_get_loop_vctp (body);
+  if (!vctp_insn)
+    return NULL;
+
+  /* There are only two types of loops that can be turned into dlstp/letp
+     loops:
+      A) Loops of the form:
+	  while (num_of_elem > 0)
+	    {
+	      p = vctp<size> (num_of_elem)
+	      n -= num_of_lanes;
+	    }
+      B) Loops of the form:
+	  int num_of_iters = (num_of_elem + num_of_lanes - 1) / num_of_lanes
+	  for (i = 0; i < num_of_iters; i++)
+	    {
+	      p = vctp<size> (num_of_elem)
+	      n -= num_of_lanes;
+	    }
+
+    Then, depending on the type of loop above we need will need to do
+    different sets of checks.  */
+  iv_analysis_loop_init (body->loop_father);
+
+  /* In order to find out if the loop is of type A or B above look for the
+     loop counter: it will either be incrementing by one per iteration or
+     it will be decrementing by num_of_lanes.  We can find the loop counter
+     in the condition at the end of the loop.  */
+  rtx_insn *loop_cond = prev_nonnote_nondebug_insn_bb (BB_END (body));
+  if (!(cc_register (XEXP (PATTERN (loop_cond), 0), VOIDmode)
+	&& GET_CODE (XEXP (PATTERN (loop_cond), 1)) == COMPARE))
+    return NULL;
+
+  /* The operands in the condition:  Try to identify which one is the
+     constant and which is the counter and run IV analysis on the latter.  */
+  rtx cond_arg_1 = XEXP (XEXP (PATTERN (loop_cond), 1), 0);
+  rtx cond_arg_2 = XEXP (XEXP (PATTERN (loop_cond), 1), 1);
+
+  rtx loop_cond_constant;
+  rtx loop_counter;
+  class rtx_iv cond_counter_iv, cond_temp_iv;
+
+  if (CONST_INT_P (cond_arg_1))
+    {
+      /* cond_arg_1 is the constant and cond_arg_2 is the counter.  */
+      loop_cond_constant = cond_arg_1;
+      loop_counter = cond_arg_2;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_2)),
+		  cond_arg_2, &cond_counter_iv);
+    }
+  else if (CONST_INT_P (cond_arg_2))
+    {
+      /* cond_arg_2 is the constant and cond_arg_1 is the counter.  */
+      loop_cond_constant = cond_arg_2;
+      loop_counter = cond_arg_1;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_1)),
+		  cond_arg_1, &cond_counter_iv);
+    }
+  else if (REG_P (cond_arg_1) && REG_P (cond_arg_2))
+    {
+      /* If both operands to the compare are REGs, we can safely
+	 run IV analysis on both and then determine which is the
+	 constant by looking at the step.
+	 First assume cond_arg_1 is the counter.  */
+      loop_counter = cond_arg_1;
+      loop_cond_constant = cond_arg_2;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_1)),
+		  cond_arg_1, &cond_counter_iv);
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_2)),
+		  cond_arg_2, &cond_temp_iv);
+
+      if (!CONST_INT_P (cond_counter_iv.step) || !CONST_INT_P (cond_temp_iv.step))
+	return NULL;
+      /* Look at the steps and swap around the rtx's if needed.  Error out if
+	 one of them cannot be identified as constant.  */
+      if (INTVAL (cond_counter_iv.step) != 0 && INTVAL (cond_temp_iv.step) != 0)
+	return NULL;
+      if (INTVAL (cond_counter_iv.step) == 0 && INTVAL (cond_temp_iv.step) != 0)
+	{
+	  loop_counter = cond_arg_2;
+	  loop_cond_constant = cond_arg_1;
+	  cond_counter_iv = cond_temp_iv;
+	}
+    }
+  else
+    return NULL;
+
+  if (!REG_P (loop_counter))
+    return NULL;
+  if (!(REG_P (loop_cond_constant) || CONST_INT_P (loop_cond_constant)))
+    return NULL;
+
+  /* Now we have extracted the IV step of the loop counter, call the
+     appropriate checking function.  */
+  if (INTVAL (cond_counter_iv.step) > 0)
+    return arm_mve_dlstp_check_inc_counter (body, vctp_insn,
+					    loop_cond_constant, loop_counter);
+  else if (INTVAL (cond_counter_iv.step) < 0)
+    return arm_mve_dlstp_check_dec_counter (body, vctp_insn,
+					    loop_cond_constant, loop_counter);
+  else
+    return NULL;
+}
+
+/* Predict whether the given loop in gimple will be transformed in the RTL
+   doloop_optimize pass.  */
+
+static bool
+arm_predict_doloop_p (struct loop *loop)
+{
+  gcc_assert (loop);
+  /* On arm, targetm.can_use_doloop_p is actually
+     can_use_doloop_if_innermost.  Ensure the loop is innermost,
+     it is valid and as per arm_target_bb_ok_for_lob and the
+     correct architecture flags are enabled.  */
+  if (!(TARGET_32BIT && TARGET_HAVE_LOB && optimize > 0))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " target architecture or optimisation flags.\n");
+      return false;
+    }
+  else if (loop->inner != NULL)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " loop nesting.\n");
+      return false;
+    }
+  else if (!arm_target_bb_ok_for_lob (loop->header->next_bb))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " loop bb complexity.\n");
+      return false;
+    }
+
+  return true;
+}
+
+/* Implement targetm.loop_unroll_adjust.  Use this to block unrolling of loops
+   that may later be turned into MVE Tail Predicated Low Overhead Loops.  The
+   performance benefit of an MVE LoL is likely to be much higher than that of
+   the unrolling.  */
+
+unsigned
+arm_loop_unroll_adjust (unsigned nunroll, struct loop *loop)
+{
+  if (TARGET_HAVE_MVE
+      && arm_target_bb_ok_for_lob (loop->latch)
+      && arm_mve_loop_valid_for_dlstp (loop->header))
+    return 0;
+  else
+    return nunroll;
+}
+
+/* Function to hadle emitting a VPT-unpredicated version of a VPT-predicated
+   insn to a sequence.  */
+
+static bool
+arm_emit_mve_unpredicated_insn_to_seq (rtx_insn* insn)
+{
+  rtx insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn);
+  int new_icode = get_attr_mve_unpredicated_insn (insn);
+  if (!in_sequence_p ()
+      || !MVE_VPT_PREDICATED_INSN_P (insn)
+      || (!insn_vpr_reg_operand)
+      || (!new_icode))
+    return false;
+
+  extract_insn (insn);
+  rtx arr[8];
+  int j = 0;
+
+  /* When transforming a VPT-predicated instruction
+     into its unpredicated equivalent we need to drop
+     the VPR operand and we may need to also drop a
+     merge "vuninit" input operand, depending on the
+     instruction pattern.  Here ensure that we have at
+     most a two-operand difference between the two
+     instrunctions.  */
+  int n_operands_diff
+      = recog_data.n_operands - insn_data[new_icode].n_operands;
+  if (!(n_operands_diff > 0 && n_operands_diff <= 2))
+    return false;
+
+  /* Then, loop through the operands of the predicated
+     instruction, and retain the ones that map to the
+     unpredicated instruction.  */
+  for (int i = 0; i < recog_data.n_operands; i++)
+    {
+      /* Ignore the VPR and, if needed, the vuninit
+	 operand.  */
+      if (insn_vpr_reg_operand == recog_data.operand[i]
+	  || (n_operands_diff == 2
+	      && !strcmp (recog_data.constraints[i], "0")))
+	continue;
+      else
+	{
+	  arr[j] = recog_data.operand[i];
+	  j++;
+	}
+    }
+
+  /* Finally, emit the upredicated instruction.  */
+  switch (j)
+    {
+      case 1:
+	emit_insn (GEN_FCN (new_icode) (arr[0]));
+	break;
+      case 2:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1]));
+	break;
+      case 3:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2]));
+	break;
+      case 4:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2],
+					arr[3]));
+	break;
+      case 5:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4]));
+	break;
+      case 6:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4], arr[5]));
+	break;
+      case 7:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4], arr[5], arr[6]));
+	break;
+      default:
+	gcc_unreachable ();
+    }
+  return true;
+}
+
+/* When a vctp insn is used, its out is often followed by
+   a zero-extend insn to SImode, which is then SUBREG'd into a
+   vector form of mode VALID_MVE_PRED_MODE: this vector form is
+   what is then used as an input to the instructions within the
+   loop.  Hence, store that vector form of the VPR reg into
+   vctp_vpr_generated, so that we can match it with instructions
+   in the loop to determine if they are predicated on this same
+   VPR.  If there is no zero-extend and subreg or it is otherwise
+   invalid, then return NULL to cancel the dlstp transform.  */
+
+static rtx
+arm_mve_get_vctp_vec_form (rtx_insn *insn)
+{
+  rtx vctp_vpr_generated = NULL_RTX;
+  rtx_insn *next_use1 = NULL;
+  df_ref use;
+  for (use
+	= DF_REG_USE_CHAIN
+	   (DF_REF_REGNO (DF_INSN_INFO_DEFS (DF_INSN_INFO_GET (insn))));
+       use; use = DF_REF_NEXT_REG (use))
+    if (!next_use1 && NONDEBUG_INSN_P (DF_REF_INSN (use)))
+      next_use1 = DF_REF_INSN (use);
+
+  if (single_set (next_use1)
+      && GET_CODE (SET_SRC (single_set (next_use1))) == ZERO_EXTEND)
+    {
+      rtx_insn *next_use2 = NULL;
+      for (use
+	    = DF_REG_USE_CHAIN
+	       (DF_REF_REGNO
+		 (DF_INSN_INFO_DEFS (DF_INSN_INFO_GET (next_use1))));
+	   use; use = DF_REF_NEXT_REG (use))
+	if (!next_use2 && NONDEBUG_INSN_P (DF_REF_INSN (use)))
+	  next_use2 = DF_REF_INSN (use);
+
+      if (single_set (next_use2)
+	  && GET_CODE (SET_SRC (single_set (next_use2))) == SUBREG)
+	vctp_vpr_generated = XEXP (PATTERN (next_use2), 0);
+    }
+
+  if (!vctp_vpr_generated || !REG_P (vctp_vpr_generated)
+      || !VALID_MVE_PRED_MODE (GET_MODE (vctp_vpr_generated)))
+    return NULL_RTX;
+
+  return vctp_vpr_generated;
+}
+
+/* Attempt to transform the loop contents of loop basic block from VPT
+   predicated insns into unpredicated insns for a dlstp/letp loop.  */
+
+rtx
+arm_attempt_dlstp_transform (rtx label)
+{
+  basic_block body = BLOCK_FOR_INSN (label)->prev_bb;
+
+  /* Ensure that the bb is within a loop that has all required metadata.  */
+  if (!body->loop_father || !body->loop_father->header
+      || !body->loop_father->simple_loop_desc)
+    return GEN_INT (1);
+
+  rtx_insn *vctp_insn = arm_mve_loop_valid_for_dlstp (body);
+  if (!vctp_insn)
+    return GEN_INT (1);
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+
+  rtx vctp_vpr_generated = arm_mve_get_vctp_vec_form (vctp_insn);
+  if (!vctp_vpr_generated)
+    return GEN_INT (1);
+
+  /* decrementunum is already known to be valid at this point.  */
+  int decrementnum = arm_mve_get_vctp_lanes (PATTERN (vctp_insn));
+
+  rtx_insn *insn = 0;
+  rtx_insn *cur_insn = 0;
+  rtx_insn *seq;
+  hash_map<int_hash<int, -1, -2>, bool>* safe_insn_map
+					  = new hash_map<int_hash<int, -1, -2>,
+							 bool>;
+
+  /* Scan through the insns in the loop bb and emit the transformed bb
+     insns to a sequence.  */
+  start_sequence ();
+  FOR_BB_INSNS (body, insn)
+    {
+      if (GET_CODE (insn) == CODE_LABEL || NOTE_INSN_BASIC_BLOCK_P (insn))
+	continue;
+      else if (NOTE_P (insn))
+	emit_note ((enum insn_note)NOTE_KIND (insn));
+      else if (DEBUG_INSN_P (insn))
+	emit_debug_insn (PATTERN (insn));
+      else if (!INSN_P (insn))
+	{
+	  end_sequence ();
+	  return GEN_INT (1);
+	}
+      /* When we find the vctp instruction: continue.  */
+      else if (insn == vctp_insn)
+	continue;
+       /* If the insn pattern requires the use of the VPR value from the
+	  vctp as an input parameter for predication.  */
+      else if (arm_mve_vec_insn_is_predicated_with_this_predicate
+		(insn, vctp_vpr_generated))
+	{
+	  bool success = arm_emit_mve_unpredicated_insn_to_seq (insn);
+	  if (!success)
+	    {
+	      end_sequence ();
+	      return GEN_INT (1);
+	    }
+	}
+      /* If the insn isn't VPT predicated on vctp_vpr_generated, we need to
+	 make sure that it is still valid within the dlstp/letp loop.  */
+      else
+	{
+	  /* If this instruction USE-s the vctp_vpr_generated other than for
+	     predication, this blocks the transformation as we are not allowed
+	     to optimise the VPR value away.  */
+	  df_ref insn_uses = NULL;
+	  FOR_EACH_INSN_USE (insn_uses, insn)
+	  {
+	    if (rtx_equal_p (vctp_vpr_generated, DF_REF_REG (insn_uses)))
+	      {
+		end_sequence ();
+		return GEN_INT (1);
+	      }
+	  }
+	  /* If within the loop we have an MVE vector instruction that is
+	     unpredicated, the dlstp/letp looping will add implicit
+	     predication to it.  This will result in a change in behaviour
+	     of the instruction, so we need to find out if any instructions
+	     that feed into the current instruction were implicitly
+	     predicated.  */
+	  if (arm_mve_check_df_chain_back_for_implic_predic
+		(safe_insn_map, insn, vctp_vpr_generated))
+	    {
+	      if (arm_mve_check_df_chain_fwd_for_implic_predic_impact
+		    (insn, vctp_vpr_generated))
+		{
+		  end_sequence ();
+		  return GEN_INT (1);
+		}
+	    }
+	  emit_insn (PATTERN (insn));
+	}
+    }
+  seq = get_insns ();
+  end_sequence ();
+
+  /* Re-write the entire BB contents with the transformed
+     sequence.  */
+  FOR_BB_INSNS_SAFE (body, insn, cur_insn)
+    if (!(GET_CODE (insn) == CODE_LABEL || NOTE_INSN_BASIC_BLOCK_P (insn)))
+      delete_insn (insn);
+  for (insn = seq; NEXT_INSN (insn); insn = NEXT_INSN (insn))
+    if (NOTE_P (insn))
+      emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));
+    else if (DEBUG_INSN_P (insn))
+      emit_debug_insn_after (PATTERN (insn), BB_END (body));
+    else
+      emit_insn_after (PATTERN (insn), BB_END (body));
+
+  emit_jump_insn_after (PATTERN (insn), BB_END (body));
+  /* The transformation has succeeded, so now modify the "count"
+     (a.k.a. niter_expr) for the middle-end.  Also set noloop_assumptions
+     to NULL to stop the middle-end from making assumptions about the
+     number of iterations.  */
+  simple_loop_desc (body->loop_father)->niter_expr = vctp_reg;
+  simple_loop_desc (body->loop_father)->noloop_assumptions = NULL_RTX;
+  return GEN_INT (decrementnum);
 }
 
 #if CHECKING_P
diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index ee931ad6ebd..70fade0d0da 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -124,6 +124,11 @@
 ; and not all ARM insns do.
 (define_attr "predicated" "yes,no" (const_string "no"))
 
+
+; An attribute that encodes the CODE_FOR_<insn> of the MVE VPT unpredicated
+; version of a VPT-predicated instruction.  For unpredicated instructions
+; that are predicable, encode the same pattern's CODE_FOR_<insn> as a way to
+; encode that it is a predicable instruction.
 (define_attr "mve_unpredicated_insn" "" (const_int 0))
 
 ; LENGTH of an instruction (in bytes)
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 71e43539616..1401b59dc0b 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -2660,6 +2660,9 @@
 (define_int_attr mrrc [(VUNSPEC_MRRC "mrrc") (VUNSPEC_MRRC2 "mrrc2")])
 (define_int_attr MRRC [(VUNSPEC_MRRC "MRRC") (VUNSPEC_MRRC2 "MRRC2")])
 
+(define_int_attr mode1 [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 "32")
+			(DLSTP64 "64")])
+
 (define_int_attr opsuffix [(UNSPEC_DOT_S "s8")
 			   (UNSPEC_DOT_U "u8")
 			   (UNSPEC_DOT_US "s8")
@@ -2903,6 +2906,8 @@
 (define_int_iterator VSHLCQ_M [VSHLCQ_M_S VSHLCQ_M_U])
 (define_int_iterator VQSHLUQ_M_N [VQSHLUQ_M_N_S])
 (define_int_iterator VQSHLUQ_N [VQSHLUQ_N_S])
+(define_int_iterator DLSTP [DLSTP8 DLSTP16 DLSTP32
+				   DLSTP64])
 
 ;; Define iterators for VCMLA operations
 (define_int_iterator VCMLA_OP [UNSPEC_VCMLA
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 87cbf6c1726..dc4b6301aaa 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -6997,7 +6997,7 @@
    (set (reg:SI LR_REGNUM)
 	(plus:SI (reg:SI LR_REGNUM) (match_dup 0)))
    (clobber (reg:CC CC_REGNUM))]
-  "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2"
+  "TARGET_HAVE_MVE"
   {
     if (get_attr_length (insn) == 4)
       return "letp\t%|lr, %l1";
@@ -7017,5 +7017,5 @@
 	 (unspec:SI [(match_operand:SI 0 "s_register_operand" "r")]
 	  DLSTP))
   ]
-  "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2"
+  "TARGET_HAVE_MVE"
   "dlstp.<mode1>\t%|lr, %0")
diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
index e1e013befa7..368d5138ca1 100644
--- a/gcc/config/arm/thumb2.md
+++ b/gcc/config/arm/thumb2.md
@@ -1613,7 +1613,7 @@
    (use (match_operand 1 "" ""))]     ; label
   "TARGET_32BIT"
   "
- {
+{
    /* Currently SMS relies on the do-loop pattern to recognize loops
       where (1) the control part consists of all insns defining and/or
       using a certain 'count' register and (2) the loop count can be
@@ -1623,41 +1623,65 @@
 
       Also used to implement the low over head loops feature, which is part of
       the Armv8.1-M Mainline Low Overhead Branch (LOB) extension.  */
-   if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
-   {
-     rtx s0;
-     rtx bcomp;
-     rtx loc_ref;
-     rtx cc_reg;
-     rtx insn;
-     rtx cmp;
-
-     if (GET_MODE (operands[0]) != SImode)
-       FAIL;
-
-     s0 = operands [0];
-
-     /* Low over head loop instructions require the first operand to be LR.  */
-     if (TARGET_HAVE_LOB && arm_target_insn_ok_for_lob (operands [1]))
-       s0 = gen_rtx_REG (SImode, LR_REGNUM);
-
-     if (TARGET_THUMB2)
-       insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-     else
-       insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-
-     cmp = XVECEXP (PATTERN (insn), 0, 0);
-     cc_reg = SET_DEST (cmp);
-     bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
-     loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands [1]);
-     emit_jump_insn (gen_rtx_SET (pc_rtx,
-                                  gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
-                                                        loc_ref, pc_rtx)));
-     DONE;
-   }
- else
-   FAIL;
- }")
+  if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
+    {
+      rtx s0;
+      rtx bcomp;
+      rtx loc_ref;
+      rtx cc_reg;
+      rtx insn;
+      rtx cmp;
+      rtx decrement_num;
+
+      if (GET_MODE (operands[0]) != SImode)
+	FAIL;
+
+      s0 = operands[0];
+
+       if (TARGET_HAVE_LOB && arm_target_bb_ok_for_lob (BLOCK_FOR_INSN (operands[1])))
+	{
+	  s0 = gen_rtx_REG (SImode, LR_REGNUM);
+
+	  /* If we have a compatibe MVE target, try and analyse the loop
+	     contents to determine if we can use predicated dlstp/letp
+	     looping.  */
+	  if (TARGET_HAVE_MVE
+	      && (decrement_num = arm_attempt_dlstp_transform (operands[1]))
+	      && (INTVAL (decrement_num) != 1))
+	    {
+	      insn = emit_insn
+		      (gen_thumb2_addsi3_compare0
+			  (s0, s0, GEN_INT ((-1) * (INTVAL (decrement_num)))));
+	      cmp = XVECEXP (PATTERN (insn), 0, 0);
+	      cc_reg = SET_DEST (cmp);
+	      bcomp = gen_rtx_GE (VOIDmode, cc_reg, const0_rtx);
+	      loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+	      emit_jump_insn (gen_rtx_SET (pc_rtx,
+				       gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+							     loc_ref, pc_rtx)));
+	      DONE;
+	    }
+	}
+
+	/* Otherwise, try standard decrement-by-one dls/le looping.  */
+	if (TARGET_THUMB2)
+	  insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0,
+							GEN_INT (-1)));
+	else
+	  insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
+
+	cmp = XVECEXP (PATTERN (insn), 0, 0);
+	cc_reg = SET_DEST (cmp);
+	bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
+	loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+	emit_jump_insn (gen_rtx_SET (pc_rtx,
+				     gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+							   loc_ref, pc_rtx)));
+	DONE;
+    }
+  else
+    FAIL;
+}")
 
 (define_insn "*clear_apsr"
   [(unspec_volatile:SI [(const_int 0)] VUNSPEC_CLRM_APSR)
@@ -1755,7 +1779,37 @@
   {
     if (REGNO (operands[0]) == LR_REGNUM)
       {
-	emit_insn (gen_dls_insn (operands[0]));
+	/* Pick out the number by which we are decrementing the loop counter
+	   in every iteration.  If it's > 1, then use dlstp.  */
+	int const_int_dec_num
+	     = abs (INTVAL (XEXP (XEXP (XVECEXP (PATTERN (operands[1]), 0, 1),
+				  1),
+			    1)));
+	switch (const_int_dec_num)
+	  {
+	    case 16:
+	      emit_insn (gen_dlstp8_insn (operands[0]));
+	      break;
+
+	    case 8:
+	      emit_insn (gen_dlstp16_insn (operands[0]));
+	      break;
+
+	    case 4:
+	      emit_insn (gen_dlstp32_insn (operands[0]));
+	      break;
+
+	    case 2:
+	      emit_insn (gen_dlstp64_insn (operands[0]));
+	      break;
+
+	    case 1:
+	      emit_insn (gen_dls_insn (operands[0]));
+	      break;
+
+	    default:
+	      gcc_unreachable ();
+	  }
 	DONE;
       }
     else
diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
index 6a5b1f8f623..7921bffc169 100644
--- a/gcc/config/arm/unspecs.md
+++ b/gcc/config/arm/unspecs.md
@@ -581,6 +581,10 @@
   VADDLVQ_U
   VCTP
   VCTP_M
+  DLSTP8
+  DLSTP16
+  DLSTP32
+  DLSTP64
   VPNOT
   VCREATEQ_F
   VCVTQ_N_TO_F_S
diff --git a/gcc/df-core.cc b/gcc/df-core.cc
index d4812b04a7c..4fcc14bf790 100644
--- a/gcc/df-core.cc
+++ b/gcc/df-core.cc
@@ -1964,6 +1964,21 @@ df_bb_regno_last_def_find (basic_block bb, unsigned int regno)
   return NULL;
 }
 
+/* Return the one and only def of REGNO within BB.  If there is no def or
+   there are multiple defs, return NULL.  */
+
+df_ref
+df_bb_regno_only_def_find (basic_block bb, unsigned int regno)
+{
+  df_ref temp = df_bb_regno_first_def_find (bb, regno);
+  if (!temp)
+    return NULL;
+  else if (temp == df_bb_regno_last_def_find (bb, regno))
+    return temp;
+  else
+    return NULL;
+}
+
 /* Finds the reference corresponding to the definition of REG in INSN.
    DF is the dataflow object.  */
 
diff --git a/gcc/df.h b/gcc/df.h
index 402657a7076..98623637f9c 100644
--- a/gcc/df.h
+++ b/gcc/df.h
@@ -987,6 +987,7 @@ extern void df_check_cfg_clean (void);
 #endif
 extern df_ref df_bb_regno_first_def_find (basic_block, unsigned int);
 extern df_ref df_bb_regno_last_def_find (basic_block, unsigned int);
+extern df_ref df_bb_regno_only_def_find (basic_block, unsigned int);
 extern df_ref df_find_def (rtx_insn *, rtx);
 extern bool df_reg_defined (rtx_insn *, rtx);
 extern df_ref df_find_use (rtx_insn *, rtx);
diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
index 4feb0a25ab9..f6dbd0515de 100644
--- a/gcc/loop-doloop.cc
+++ b/gcc/loop-doloop.cc
@@ -85,29 +85,29 @@ doloop_condition_get (rtx_insn *doloop_pat)
      forms:
 
      1)  (parallel [(set (pc) (if_then_else (condition)
-	  			            (label_ref (label))
-				            (pc)))
-	             (set (reg) (plus (reg) (const_int -1)))
-	             (additional clobbers and uses)])
+					    (label_ref (label))
+					    (pc)))
+		     (set (reg) (plus (reg) (const_int -n)))
+		     (additional clobbers and uses)])
 
      The branch must be the first entry of the parallel (also required
      by jump.cc), and the second entry of the parallel must be a set of
      the loop counter register.  Some targets (IA-64) wrap the set of
      the loop counter in an if_then_else too.
 
-     2)  (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-	                         (label_ref (label))
-			         (pc))).  
+     2)  (set (reg) (plus (reg) (const_int -n))
+	 (set (pc) (if_then_else (reg != 0)
+				 (label_ref (label))
+				 (pc))).
 
      Some targets (ARM) do the comparison before the branch, as in the
      following form:
 
-     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) */
+     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -n), 0)))
+		   (set (reg) (plus (reg) (const_int -n)))])
+	(set (pc) (if_then_else (cc == NE)
+				(label_ref (label))
+				(pc))) */
 
   pattern = PATTERN (doloop_pat);
 
@@ -143,7 +143,7 @@ doloop_condition_get (rtx_insn *doloop_pat)
 	      || GET_CODE (cmp_arg1) != PLUS)
 	    return 0;
 	  reg_orig = XEXP (cmp_arg1, 0);
-	  if (XEXP (cmp_arg1, 1) != GEN_INT (-1) 
+	  if (!CONST_INT_P (XEXP (cmp_arg1, 1))
 	      || !REG_P (reg_orig))
 	    return 0;
 	  cc_reg = SET_DEST (cmp_orig);
@@ -156,7 +156,8 @@ doloop_condition_get (rtx_insn *doloop_pat)
 	{
 	  /* We expect the condition to be of the form (reg != 0)  */
 	  cond = XEXP (SET_SRC (cmp), 0);
-	  if (GET_CODE (cond) != NE || XEXP (cond, 1) != const0_rtx)
+	  if ((GET_CODE (cond) != NE && GET_CODE (cond) != GE)
+	      || XEXP (cond, 1) != const0_rtx)
 	    return 0;
 	}
     }
@@ -173,14 +174,14 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
     return 0;
 
-  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
      On IA-64, this decrement is wrapped in an if_then_else.  */
   inc_src = SET_SRC (inc);
   if (GET_CODE (inc_src) == IF_THEN_ELSE)
     inc_src = XEXP (inc_src, 1);
   if (GET_CODE (inc_src) != PLUS
       || XEXP (inc_src, 0) != reg
-      || XEXP (inc_src, 1) != constm1_rtx)
+      || !CONST_INT_P (XEXP (inc_src, 1)))
     return 0;
 
   /* Check for (set (pc) (if_then_else (condition)
@@ -211,42 +212,49 @@ doloop_condition_get (rtx_insn *doloop_pat)
       || (GET_CODE (XEXP (condition, 0)) == PLUS
 	  && XEXP (XEXP (condition, 0), 0) == reg))
    {
-     if (GET_CODE (pattern) != PARALLEL)
      /*  For the second form we expect:
 
-         (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-                                 (label_ref (label))
-                                 (pc))).
+	 (set (reg) (plus (reg) (const_int -n))
+	 (set (pc) (if_then_else (reg != 0)
+				 (label_ref (label))
+				 (pc))).
 
-         is equivalent to the following:
+	 If n == 1, that is equivalent to the following:
 
-         (parallel [(set (pc) (if_then_else (reg != 1)
-                                            (label_ref (label))
-                                            (pc)))
-                     (set (reg) (plus (reg) (const_int -1)))
-                     (additional clobbers and uses)])
+	 (parallel [(set (pc) (if_then_else (reg != 1)
+					    (label_ref (label))
+					    (pc)))
+		     (set (reg) (plus (reg) (const_int -1)))
+		     (additional clobbers and uses)])
 
-        For the third form we expect:
+	For the third form we expect:
 
-        (parallel [(set (cc) (compare ((plus (reg) (const_int -1)), 0))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) 
+	(parallel [(set (cc) (compare ((plus (reg) (const_int -n)), 0))
+		   (set (reg) (plus (reg) (const_int -n)))])
+	(set (pc) (if_then_else (cc == NE)
+				(label_ref (label))
+				(pc)))
 
-        which is equivalent to the following:
+	Which also for n == 1 is equivalent to the following:
 
-        (parallel [(set (cc) (compare (reg,  1))
-                   (set (reg) (plus (reg) (const_int -1)))
-                   (set (pc) (if_then_else (NE == cc)
-                                           (label_ref (label))
-                                           (pc))))])
+	(parallel [(set (cc) (compare (reg,  1))
+		   (set (reg) (plus (reg) (const_int -1)))
+		   (set (pc) (if_then_else (NE == cc)
+					   (label_ref (label))
+					   (pc))))])
 
-        So we return the second form instead for the two cases.
+	So we return the second form instead for the two cases.
 
+	For the "elementwise" form where the decrement number isn't -1,
+	the final value may be exceeded, so use GE instead of NE.
      */
-        condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
+     if (GET_CODE (pattern) != PARALLEL)
+       {
+	if (INTVAL (XEXP (inc_src, 1)) != -1)
+	  condition = gen_rtx_fmt_ee (GE, VOIDmode, inc_src, const0_rtx);
+	else
+	  condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);;
+       }
 
     return condition;
    }
@@ -685,17 +693,6 @@ doloop_optimize (class loop *loop)
       return false;
     }
 
-  max_cost
-    = COSTS_N_INSNS (param_max_iterations_computation_cost);
-  if (set_src_cost (desc->niter_expr, mode, optimize_loop_for_speed_p (loop))
-      > max_cost)
-    {
-      if (dump_file)
-	fprintf (dump_file,
-		 "Doloop: number of iterations too costly to compute.\n");
-      return false;
-    }
-
   if (desc->const_iter)
     iterations = widest_int::from (rtx_mode_t (desc->niter_expr, mode),
 				   UNSIGNED);
@@ -716,11 +713,24 @@ doloop_optimize (class loop *loop)
 
   /* Generate looping insn.  If the pattern FAILs then give up trying
      to modify the loop since there is some aspect the back-end does
-     not like.  */
-  count = copy_rtx (desc->niter_expr);
+     not like.  If this succeeds, there is a chance that the loop
+     desc->niter_expr has been altered by the backend, so only extract
+     that data after the gen_doloop_end.  */
   start_label = block_label (desc->in_edge->dest);
   doloop_reg = gen_reg_rtx (mode);
   rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
+  count = copy_rtx (desc->niter_expr);
+
+  max_cost
+    = COSTS_N_INSNS (param_max_iterations_computation_cost);
+  if (set_src_cost (count, mode, optimize_loop_for_speed_p (loop))
+      > max_cost)
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "Doloop: number of iterations too costly to compute.\n");
+      return false;
+    }
 
   word_mode_size = GET_MODE_PRECISION (word_mode);
   word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) - 1;
diff --git a/gcc/testsuite/gcc.target/arm/lob.h b/gcc/testsuite/gcc.target/arm/lob.h
index feaae7cc899..3941fe7a8b6 100644
--- a/gcc/testsuite/gcc.target/arm/lob.h
+++ b/gcc/testsuite/gcc.target/arm/lob.h
@@ -1,15 +1,131 @@
 #include <string.h>
-
+#include <stdint.h>
 /* Common code for lob tests.  */
 
 #define NO_LOB asm volatile ("@ clobber lr" : : : "lr" )
 
-#define N 10000
+#define N 100
+
+static void
+reset_data (int *a, int *b, int *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data64 (int64_t *a, int64_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+check_plus (int *a, int *b, int *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
 
 static void
-reset_data (int *a, int *b, int *c)
+check_memcpy64 (int64_t *a, int64_t *c, int x)
 {
-  memset (a, -1, N * sizeof (*a));
-  memset (b, -1, N * sizeof (*b));
-  memset (c, -1, N * sizeof (*c));
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != a[i]) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
 }
diff --git a/gcc/testsuite/gcc.target/arm/lob1.c b/gcc/testsuite/gcc.target/arm/lob1.c
index ba5c82cd55c..c8ce653a5c3 100644
--- a/gcc/testsuite/gcc.target/arm/lob1.c
+++ b/gcc/testsuite/gcc.target/arm/lob1.c
@@ -54,29 +54,18 @@ loop3 (int *a, int *b, int *c)
     } while (i < N);
 }
 
-void
-check (int *a, int *b, int *c)
-{
-  for (int i = 0; i < N; i++)
-    {
-      NO_LOB;
-      if (c[i] != a[i] + b[i])
-	abort ();
-    }
-}
-
 int
 main (void)
 {
-  reset_data (a, b, c);
+  reset_data (a, b, c, N);
   loop1 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus (a, b, c, N);
+  reset_data (a, b, c, N);
   loop2 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus (a, b, c, N);
+  reset_data (a, b, c, N);
   loop3 (a, b ,c);
-  check (a, b ,c);
+  check_plus (a, b, c, N);
 
   return 0;
 }
diff --git a/gcc/testsuite/gcc.target/arm/lob6.c b/gcc/testsuite/gcc.target/arm/lob6.c
index 17b6124295e..4fe116e2c2b 100644
--- a/gcc/testsuite/gcc.target/arm/lob6.c
+++ b/gcc/testsuite/gcc.target/arm/lob6.c
@@ -79,14 +79,14 @@ check (void)
 int
 main (void)
 {
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data (a1, b1, c1, N);
+  reset_data (a2, b2, c2, N);
   loop1 (a1, b1, c1);
   ref1 (a2, b2, c2);
   check ();
 
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data (a1, b1, c1, N);
+  reset_data (a2, b2, c2, N);
   loop2 (a1, b1, c1);
   ref2 (a2, b2, c2);
   check ();
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c
new file mode 100644
index 00000000000..5ddd994e53d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c
@@ -0,0 +1,561 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O3 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+
+#define IMM 5
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)				\
+void test_##NAME##PRED##_##SIGN##BITS (TYPE##BITS##_t *a, TYPE##BITS##_t *b,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vb = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (b, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_##SIGN##BITS (va, vb, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      b += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (32, 4, w, NAME, PRED)
+
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vmulq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vsubq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vhaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vorrq, _x)
+
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_M(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)				\
+void test_##NAME##PRED##_##SIGN##BITS (TYPE##BITS##x##LANES##_t __inactive, TYPE##BITS##_t *a, TYPE##BITS##_t *b,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vb = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (b, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_##SIGN##BITS (__inactive, va, vb, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      b += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_M (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_M (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (32, 4, w, NAME, PRED)
+
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vmulq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vsubq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vhaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vorrq, _m)
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_N(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)	\
+void test_##NAME##PRED##_n_##SIGN##BITS (TYPE##BITS##_t *a,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_n_##SIGN##BITS (va, IMM, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_N (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_N (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (32, 4, w, NAME, PRED)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vmulq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vsubq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vhaddq, _x)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vbrsrq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vshlq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vshrq, _x)
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_M_N(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)	\
+void test_##NAME##PRED##_n_##SIGN##BITS (TYPE##BITS##x##LANES##_t __inactive, TYPE##BITS##_t *a,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_n_##SIGN##BITS (__inactive, va, IMM, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_M_N (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_M_N (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (32, 4, w, NAME, PRED)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vmulq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vsubq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vhaddq, _m)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vbrsrq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vshlq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vshrq, _m)
+
+/* Now test some more configurations.  */
+
+/* Using a >=1 condition.  */
+void test1 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n >= 1)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+/* Test a for loop format of decrementing to zero */
+int32_t a[] = {0, 1, 2, 3, 4, 5, 6, 7};
+void test2 (int32_t *b, int num_elems)
+{
+    for (int i = num_elems; i > 0; i-= 4)
+    {
+        mve_pred16_t p = vctp32q (i);
+        int32x4_t va = vldrwq_z_s32 (&(a[i]), p);
+        vstrwq_p_s32 (b + i, va, p);
+    }
+}
+
+/* Iteration counter counting up to num_iter.  */
+void test3 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 0; i < num_iter; i++)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+/* Iteration counter counting down from num_iter.  */
+void test4 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = num_iter; i > 0; i--)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+/* Using an unpredicated arithmetic instruction within the loop.  */
+void test5 (uint8_t *a, uint8_t *b, uint8_t *c,  uint8_t *d, int n)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_u8 (b);
+	/* Is affected by implicit predication, because vb also
+	came from an unpredicated load, but there is no functional
+	problem, because the result is used in a predicated store.  */ 
+        uint8x16_t vc = vaddq_u8 (va, vb);
+        uint8x16_t vd = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        vstrbq_p_u8 (d, vd, p);
+        n-=16;
+    }
+}
+
+/* Using a different VPR value for one instruction in the loop.  */
+void test6 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using another VPR value in the loop, with a vctp.
+   The doloop logic will always try to do the transform on the first
+   vctp it encounters, so this is still expected to work.  */
+void test7 (int32_t *a, int32_t *b, int32_t *c, int n, int g)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q (g);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vctp,
+   but this time the p1 will also change in every loop (still fine)  */
+void test8 (int32_t *a, int32_t *b, int32_t *c, int n, int g)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q (g);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+      g++;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vctp_m
+   that is independent of the loop vctp VPR.  */
+void test9 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p2 = vctp32q_m (n, p1);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop,
+   with a vctp_m that is tied to the base vctp VPR.  This
+   is still fine, because the vctp_m will be transformed
+   into a vctp and be implicitly predicated.  */
+void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q_m (n, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p1);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp.  */
+void test11 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vb);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p1);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp_m.  */
+void test12 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p2 = vcmpeqq_m_s32 (va, vb, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp_m 
+   that is tied to the base vctp VPR (same as above, this will be turned
+   into a vcmp and be implicitly predicated).  */
+void test13 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p2 = vcmpeqq_m_s32 (va, vb, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an unpredicated op with a scalar output, where the result is valid
+   outside the bb.  This is valid, because all the inputs to the unpredicated
+   op are correctly predicated.  */
+uint8_t test14 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       sum += vaddvq_u8 (vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Same as above, but with another scalar op between the unpredicated op and
+   the scalar op outside the loop.  */
+uint8_t test15 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx, int g)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       sum += vaddvq_u8 (vc);
+       sum += g;
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Using an unpredicated vcmp to generate a new predicate value in the
+   loop and then using it in a predicated store insn.  */
+void test16 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_s32 (b);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vc);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using a predicated vcmp to generate a new predicate value in the
+   loop and then using it in a predicated store insn.  */
+void test17 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_s32 (va, vb);
+      mve_pred16_t p1 = vcmpeqq_m_s32 (va, vc, p);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an across-vector unpredicated instruction in a valid way.
+   This tests that "vc" has correctly masked the risky "vb".  */
+uint16_t test18 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_x_u16 (va, vb, p);
+      res = vaddvq_u16 (vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector unpredicated instruction with a scalar from outside the loop.  */
+uint16_t test19 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_x_u16 (va, vb, p);
+      res = vaddvaq_u16 (res, vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector predicated instruction in a valid way.  */
+uint16_t  test20 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res = vaddvaq_p_u16 (res, vb, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector predicated instruction in a valid way.  */
+uint16_t  test21 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res++;
+      res = vaddvaq_p_u16 (res, vb, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+int test22 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vmaxvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test23 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vmaxavq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+/* The final number of DLSTPs currently is calculated by the number of
+  `TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY.*` macros * 6 + 23.  */
+/* { dg-final { scan-assembler-times {\tdlstp} 167 } } */
+/* { dg-final { scan-assembler-times {\tletp} 167 } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
new file mode 100644
index 00000000000..0cdffb312b3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int16_t *a, int16_t *b, int16_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      int16x8_t va = vldrhq_z_s16 (a, p);
+      int16x8_t vb = vldrhq_z_s16 (b, p);
+      int16x8_t vc = vaddq_x_s16 (va, vb, p);
+      vstrhq_p_s16 (c, vc, p);
+      c+=8;
+      a+=8;
+      b+=8;
+      n-=8;
+    }
+}
+
+int main ()
+{
+  int i;
+  int16_t temp1[N];
+  int16_t temp2[N];
+  int16_t temp3[N];
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus16 (temp1, temp2, temp3, 0);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus16 (temp1, temp2, temp3, 1);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 7);
+  check_plus16 (temp1, temp2, temp3, 7);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus16 (temp1, temp2, temp3, 8);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus16 (temp1, temp2, temp3, 9);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus16 (temp1, temp2, temp3, 16);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus16 (temp1, temp2, temp3, 17);
+
+  reset_data16 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.16} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c
new file mode 100644
index 00000000000..7ff789d7650
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+int main ()
+{
+  int i;
+  int32_t temp1[N];
+  int32_t temp2[N];
+  int32_t temp3[N];
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus32 (temp1, temp2, temp3, 0);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus32 (temp1, temp2, temp3, 1);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 3);
+  check_plus32 (temp1, temp2, temp3, 3);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 4);
+  check_plus32 (temp1, temp2, temp3, 4);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 5);
+  check_plus32 (temp1, temp2, temp3, 5);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus32 (temp1, temp2, temp3, 8);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus32 (temp1, temp2, temp3, 9);
+
+  reset_data32 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.32} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c
new file mode 100644
index 00000000000..8065bd02469
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int64_t *a, int64_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp64q (n);
+      int64x2_t va = vldrdq_gather_offset_z_s64 (a, vcreateq_u64 (0, 8), p);
+      vstrdq_scatter_offset_p_s64 (c, vcreateq_u64 (0, 8), va, p);
+      c+=2;
+      a+=2;
+      n-=2;
+    }
+}
+
+int main ()
+{
+  int i;
+  int64_t temp1[N];
+  int64_t temp3[N];
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 0);
+  check_memcpy64 (temp1, temp3, 0);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 1);
+  check_memcpy64 (temp1, temp3, 1);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 2);
+  check_memcpy64 (temp1, temp3, 2);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 3);
+  check_memcpy64 (temp1, temp3, 3);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 4);
+  check_memcpy64 (temp1, temp3, 4);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 5);
+  check_memcpy64 (temp1, temp3, 5);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 6);
+  check_memcpy64 (temp1, temp3, 6);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 7);
+  check_memcpy64 (temp1, temp3, 7);
+
+  reset_data64  (temp1, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.64} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c
new file mode 100644
index 00000000000..552781001e9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp8q (n);
+      int8x16_t va = vldrbq_z_s8 (a, p);
+      int8x16_t vb = vldrbq_z_s8 (b, p);
+      int8x16_t vc = vaddq_x_s8 (va, vb, p);
+      vstrbq_p_s8 (c, vc, p);
+      c+=16;
+      a+=16;
+      b+=16;
+      n-=16;
+    }
+}
+
+int main ()
+{
+  int i;
+  int8_t temp1[N];
+  int8_t temp2[N];
+  int8_t temp3[N];
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus8 (temp1, temp2, temp3, 0);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus8 (temp1, temp2, temp3, 1);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 15);
+  check_plus8 (temp1, temp2, temp3, 15);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus8 (temp1, temp2, temp3, 16);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus8 (temp1, temp2, temp3, 17);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 32);
+  check_plus8 (temp1, temp2, temp3, 32);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 33);
+  check_plus8 (temp1, temp2, temp3, 33);
+
+  reset_data8 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.8} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
new file mode 100644
index 00000000000..c1c40c2fea7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
@@ -0,0 +1,343 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O3 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+
+/* Terminating on a non-zero number of elements.  */
+void test0 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    while (n > 1)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Terminating on n >= 0.  */
+void test1 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    while (n >= 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Similar, terminating on a non-zero number of elements, but in a for loop
+   format.  */
+int32_t a[] = {0, 1, 2, 3, 4, 5, 6, 7};
+void test2 (int32_t *b, int num_elems)
+{
+    for (int i = num_elems; i >= 2; i-= 4)
+    {
+       mve_pred16_t p = vctp32q (i);
+       int32x4_t va = vldrwq_z_s32 (&(a[i]), p);
+       vstrwq_p_s32 (b + i, va, p);
+    }
+}
+
+/* Iteration counter counting up to num_iter, with a non-zero starting num.  */
+void test3 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 1; i < num_iter; i++)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Iteration counter counting up to num_iter, with a larger increment  */
+void test4 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 0; i < num_iter; i+=2)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Using an unpredicated store instruction within the loop.  */
+void test5 (uint8_t *a, uint8_t *b, uint8_t *c,  uint8_t *d, int n)
+{
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_u8 (va, vb);
+       uint8x16_t vd = vaddq_x_u8 (va, vb, p);
+       vstrbq_u8 (d, vd);
+       n -= 16;
+    }
+}
+
+/* Using an unpredicated store outside the loop.  */
+void test6 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       vx = vaddq_u8 (vx, vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    vstrbq_u8 (c, vx);
+}
+
+/* Using a VPR that gets modified within the loop.  */
+void test9 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      p++;
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using a VPR that gets re-generated within the loop.  */
+void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  mve_pred16_t p = vctp32q (n);
+  while (n > 0)
+    {
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      p = vctp32q (n);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using vctp32q_m instead of vctp32q.  */
+void test11 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p0)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q_m (n, p0);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an unpredicated op with a scalar output, where the result is valid
+   outside the bb.  This is invalid, because one of the inputs to the
+   unpredicated op is also unpredicated.  */
+uint8_t test12 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_u8 (b);
+       uint8x16_t vc = vaddq_u8 (va, vb);
+       sum += vaddvq_u8 (vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Using an unpredicated vcmp to generate a new predicate value in the
+   loop and then using that VPR to predicate a store insn.  */
+void test13 (int32_t *a, int32_t *b, int32x4_t vc, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_s32 (a);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_s32 (va, vb);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vc);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an across-vector unpredicated instruction. "vb" is the risk.  */
+uint16_t test14 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      vb = vaddq_u16 (va, vb);
+      res = vaddvq_u16 (vb);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector unpredicated instruction. "vc" is the risk. */
+uint16_t test15 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_u16 (va, vb);
+      res = vaddvaq_u16 (res, vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+uint16_t test16 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res =0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res = vaddvaq_u16 (res, vb);
+      res = vaddvaq_p_u16 (res, va, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+int test17 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vmaxvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+
+
+int test18 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vminvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test19 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vminavq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test20 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vminvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+uint8x16_t test21 (uint8_t *a, uint32_t *b, int n, uint8x16_t res)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vshlcq_u8 (va, b, 1);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int8x16_t test22 (int8_t *a, int32_t *b, int n, int8x16_t res)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vshlcq_s8 (va, b, 1);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+/* { dg-final { scan-assembler-not "\tdlstp" } } */
+/* { dg-final { scan-assembler-not "\tletp" } } */

             reply	other threads:[~2023-08-17 10:32 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-17 10:31 Stamatis Markianos-Wright [this message]
2023-09-06 17:19 ` [PING][PATCH " Stamatis Markianos-Wright
2023-09-14 12:10   ` Kyrylo Tkachov
2023-09-28 12:51     ` Andre Vieira (lists)
2023-10-11 11:34       ` Stamatis Markianos-Wright
2023-10-23 10:16         ` Andre Vieira (lists)
2023-10-24 15:11   ` Richard Sandiford
2023-11-06 11:03     ` Stamatis Markianos-Wright
2023-11-06 11:24       ` Richard Sandiford
2023-11-06 17:29         ` Stamatis Markianos-Wright
2023-11-10 12:41           ` Stamatis Markianos-Wright
2023-11-16 11:36             ` Stamatis Markianos-Wright
2023-11-27 12:47               ` Andre Vieira (lists)
2023-11-30 12:55                 ` Stamatis Markianos-Wright
2023-12-07 18:08                   ` Andre Vieira (lists)
2023-12-09 18:31                   ` Richard Sandiford
2023-12-12 17:56                   ` Richard Earnshaw
  -- strict thread matches above, loose matches on Subject: below --
2023-12-18 11:53 [PATCH 0/2] " Andre Vieira
2023-12-18 11:53 ` [PATCH 2/2] " Andre Vieira
2023-12-20 16:54   ` Andre Vieira (lists)
2023-11-06 11:20 Stamatis Markianos-Wright
2023-06-15 11:47 Stamatis Markianos-Wright
2023-06-22 15:54 ` Andre Vieira (lists)
2023-07-05 16:11   ` Stamatis Markianos-Wright
2023-06-23 10:23 ` Andre Vieira (lists)
2023-07-05 16:13   ` Stamatis Markianos-Wright
2023-06-23 16:25 ` Andre Vieira (lists)
2023-07-05 16:41   ` Stamatis Markianos-Wright
2022-11-11 17:40 Stam Markianos-Wright
2022-11-15 15:51 ` Andre Vieira (lists)
2022-11-28 12:13   ` Stam Markianos-Wright

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=949f5dd0-cdf0-715a-f04c-3de80c9b974f@arm.com \
    --to=stam.markianos-wright@arm.com \
    --cc=Kyrylo.Tkachov@arm.com \
    --cc=Richard.Earnshaw@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).