[PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
@ 2023-08-17 10:31 Stamatis Markianos-Wright
  2023-09-06 17:19 ` [PING][PATCH " Stamatis Markianos-Wright
  0 siblings, 1 reply; 17+ messages in thread
From: Stamatis Markianos-Wright @ 2023-08-17 10:31 UTC (permalink / raw)
  To: gcc-patches; +Cc: Kyrylo Tkachov, Richard Earnshaw

[-- Attachment #1: Type: text/plain, Size: 5489 bytes --]

Hi all,


This is the 2/2 patch that contains the functional changes needed
for MVE Tail Predicated Low Overhead Loops.  See my previous email
for a general introduction of MVE LOLs.

This support is added through the already existing loop-doloop
mechanisms that are used for non-MVE dls/le looping.

Mid-end changes are:

1) Relax the loop-doloop mechanism in the mid-end to allow for
    decrement numbers other that -1 and for `count` to be an
    rtx containing a simple REG (which in this case will contain
    the number of elements to be processed), rather
    than an expression for calculating the number of iterations.
2) Added a new df utility function: `df_bb_regno_only_def_find` that
    will return the DEF of a REG if it is DEF-ed only once within the
    basic block.

And many things in the backend to implement the above optimisation:

3)  Implement the `arm_predict_doloop_p` target hook to instruct the
     mid-end about Low Overhead Loops (MVE or not), as well as
     `arm_loop_unroll_adjust` which will prevent unrolling of any loops
     that are valid for becoming MVE Tail_Predicated Low Overhead Loops
     (unrolling can transform a loop in ways that invalidate the dlstp/
     letp tranformation logic and the benefit of the dlstp/letp loop
     would be considerably higher than that of unrolling)
4)  Appropriate changes to the define_expand of doloop_end, new
     patterns for dlstp and letp, new iterators,  unspecs, etc.
5) `arm_mve_loop_valid_for_dlstp` and a number of checking functions:
    * `arm_mve_dlstp_check_dec_counter`
    * `arm_mve_dlstp_check_inc_counter`
    * `arm_mve_check_reg_origin_is_num_elems`
    * `arm_mve_check_df_chain_back_for_implic_predic`
    * `arm_mve_check_df_chain_fwd_for_implic_predic_impact`
    This all, in smoe way or another, are running checks on the loop
    structure in order to determine if the loop is valid for dlstp/letp
    transformation.
6) `arm_attempt_dlstp_transform`: (called from the define_expand of
     doloop_end) this function re-checks for the loop's suitability for
     dlstp/letp transformation and then implements it, if possible.
7) Various utility functions:
    *`arm_mve_get_vctp_lanes` to map
    from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg`
    to check an insn to see if it requires the VPR or not.
    * `arm_mve_get_loop_vctp`
    * `arm_mve_get_vctp_lanes`
    * `arm_emit_mve_unpredicated_insn_to_seq`
    * `arm_get_required_vpr_reg`
    * `arm_get_required_vpr_reg_param`
    * `arm_get_required_vpr_reg_ret_val`
    * `arm_mve_is_across_vector_insn`
    * `arm_is_mve_load_store_insn`
    * `arm_mve_vec_insn_is_predicated_with_this_predicate`
    * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`

No regressions on arm-none-eabi with various targets and on
aarch64-none-elf. Thoughts on getting this into trunk?

Thank you,
Stam Markianos-Wright

gcc/ChangeLog:

     * config/arm/arm-protos.h (arm_target_insn_ok_for_lob): Rename to...
     (arm_target_bb_ok_for_lob): ...this
     (arm_attempt_dlstp_transform): New.
     * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.
     (TARGET_PREDICT_DOLOOP_P): New.
     (arm_block_set_vect):
     (arm_target_insn_ok_for_lob): Rename from arm_target_insn_ok_for_lob.
     (arm_target_bb_ok_for_lob): New.
     (arm_mve_get_vctp_lanes): New.
     (arm_get_required_vpr_reg): New.
     (arm_get_required_vpr_reg_param): New.
     (arm_get_required_vpr_reg_ret_val): New.
     (arm_mve_get_loop_vctp): New.
     (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.
     (arm_mve_vec_insn_is_predicated_with_this_predicate): New.
     (arm_mve_check_df_chain_back_for_implic_predic): New.
     (arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.
     (arm_mve_check_reg_origin_is_num_elems): New.
     (arm_mve_dlstp_check_inc_counter): New.
     (arm_mve_dlstp_check_dec_counter): New.
     (arm_mve_loop_valid_for_dlstp): New.
     (arm_mve_is_across_vector_insn): New.
     (arm_is_mve_load_store_insn): New.
     (arm_predict_doloop_p): New.
     (arm_loop_unroll_adjust): New.
     (arm_emit_mve_unpredicated_insn_to_seq): New.
     (arm_attempt_dlstp_transform): New.
         * config/arm/iterators.md (DLSTP): New.
         (mode1): Add DLSTP mappings.
         * config/arm/mve.md (*predicated_doloop_end_internal): New.
         (dlstp<mode1>_insn): New.
         * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
         * config/arm/unspecs.md: New unspecs.
     * df-core.cc (df_bb_regno_only_def_find): New.
     * df.h (df_bb_regno_only_def_find): New.
         * loop-doloop.cc (doloop_condition_get): Relax conditions.
         (doloop_optimize): Add support for elementwise LoLs.

gcc/testsuite/ChangeLog:

         * gcc.target/arm/lob.h: Update framework.
         * gcc.target/arm/lob1.c: Likewise.
         * gcc.target/arm/lob6.c: Likewise.
     * gcc.target/arm/mve/dlstp-compile-asm.c: New test.
     * gcc.target/arm/mve/dlstp-int16x8.c: New test.
     * gcc.target/arm/mve/dlstp-int32x4.c: New test.
     * gcc.target/arm/mve/dlstp-int64x2.c: New test.
     * gcc.target/arm/mve/dlstp-int8x16.c: New test.
     * gcc.target/arm/mve/dlstp-invalid-asm.c: New test.

[-- Attachment #2: 2.patch --]
[-- Type: text/x-patch, Size: 105189 bytes --]

commit 8564dee09c1258c388094abd614f311e60723368
Author: Stam Markianos-Wright <stam.markianos-wright@arm.com>
Date:   Tue Oct 18 17:42:56 2022 +0100

    arm: Add support for MVE Tail-Predicated Low Overhead Loops
    
    This is the 2/2 patch that contains the functional changes needed
    for MVE Tail Predicated Low Overhead Loops.  See my previous email
    for a general introduction of MVE LOLs.
    
    This support is added through the already existing loop-doloop
    mechanisms that are used for non-MVE dls/le looping.
    
    Mid-end changes are:
    
    1) Relax the loop-doloop mechanism in the mid-end to allow for
       decrement numbers other that -1 and for `count` to be an
       rtx containing a simple REG (which in this case will contain
       the number of elements to be processed), rather
       than an expression for calculating the number of iterations.
    2) Added a new df utility function: `df_bb_regno_only_def_find` that
       will return the DEF of a REG if it is DEF-ed only once within the
       basic block.
    
    And many things in the backend to implement the above optimisation:
    
    3)  Implement the `arm_predict_doloop_p` target hook to instruct the
        mid-end about Low Overhead Loops (MVE or not), as well as
        `arm_loop_unroll_adjust` which will prevent unrolling of any loops
        that are valid for becoming MVE Tail_Predicated Low Overhead Loops
        (unrolling can transform a loop in ways that invalidate the dlstp/
        letp tranformation logic and the benefit of the dlstp/letp loop
        would be considerably higher than that of unrolling)
    4)  Appropriate changes to the define_expand of doloop_end, new
        patterns for dlstp and letp, new iterators,  unspecs, etc.
    5) `arm_mve_loop_valid_for_dlstp` and a number of checking functions:
       * `arm_mve_dlstp_check_dec_counter`
       * `arm_mve_dlstp_check_inc_counter`
       * `arm_mve_check_reg_origin_is_num_elems`
       * `arm_mve_check_df_chain_back_for_implic_predic`
       * `arm_mve_check_df_chain_fwd_for_implic_predic_impact`
       This all, in smoe way or another, are running checks on the loop
       structure in order to determine if the loop is valid for dlstp/letp
       transformation.
    6) `arm_attempt_dlstp_transform`: (called from the define_expand of
        doloop_end) this function re-checks for the loop's suitability for
        dlstp/letp transformation and then implements it, if possible.
    7) Various utility functions:
       *`arm_mve_get_vctp_lanes` to map
       from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg`
       to check an insn to see if it requires the VPR or not.
       * `arm_mve_get_loop_vctp`
       * `arm_mve_get_vctp_lanes`
       * `arm_emit_mve_unpredicated_insn_to_seq`
       * `arm_get_required_vpr_reg`
       * `arm_get_required_vpr_reg_param`
       * `arm_get_required_vpr_reg_ret_val`
       * `arm_mve_is_across_vector_insn`
       * `arm_is_mve_load_store_insn`
       * `arm_mve_vec_insn_is_predicated_with_this_predicate`
       * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`
    
    No regressions on arm-none-eabi with various targets and on
    aarch64-none-elf. Thoughts on getting this into trunk?
    
    Thank you,
    Stam Markianos-Wright
    
    gcc/ChangeLog:
    
            * config/arm/arm-protos.h (arm_target_insn_ok_for_lob): Rename to...
            (arm_target_bb_ok_for_lob): ...this
            (arm_attempt_dlstp_transform): New.
            * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.
            (TARGET_PREDICT_DOLOOP_P): New.
            (arm_block_set_vect):
            (arm_target_insn_ok_for_lob): Rename from arm_target_insn_ok_for_lob.
            (arm_target_bb_ok_for_lob): New.
            (arm_mve_get_vctp_lanes): New.
            (arm_get_required_vpr_reg): New.
            (arm_get_required_vpr_reg_param): New.
            (arm_get_required_vpr_reg_ret_val): New.
            (arm_mve_get_loop_vctp): New.
            (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.
            (arm_mve_vec_insn_is_predicated_with_this_predicate): New.
            (arm_mve_check_df_chain_back_for_implic_predic): New.
            (arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.
            (arm_mve_check_reg_origin_is_num_elems): New.
            (arm_mve_dlstp_check_inc_counter): New.
            (arm_mve_dlstp_check_dec_counter): New.
            (arm_mve_loop_valid_for_dlstp): New.
            (arm_mve_is_across_vector_insn): New.
            (arm_is_mve_load_store_insn): New.
            (arm_predict_doloop_p): New.
            (arm_loop_unroll_adjust): New.
            (arm_emit_mve_unpredicated_insn_to_seq): New.
            (arm_attempt_dlstp_transform): New.
            * config/arm/iterators.md (DLSTP): New.
            (mode1): Add DLSTP mappings.
            * config/arm/mve.md (*predicated_doloop_end_internal): New.
            (dlstp<mode1>_insn): New.
            * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
            * config/arm/unspecs.md: New unspecs.
            * df-core.cc (df_bb_regno_only_def_find): New.
            * df.h (df_bb_regno_only_def_find): New.
            * loop-doloop.cc (doloop_condition_get): Relax conditions.
            (doloop_optimize): Add support for elementwise LoLs.
    
    gcc/testsuite/ChangeLog:
    
            * gcc.target/arm/lob.h: Update framework.
            * gcc.target/arm/lob1.c: Likewise.
            * gcc.target/arm/lob6.c: Likewise.
            * gcc.target/arm/mve/dlstp-compile-asm.c: New test.
            * gcc.target/arm/mve/dlstp-int16x8.c: New test.
            * gcc.target/arm/mve/dlstp-int32x4.c: New test.
            * gcc.target/arm/mve/dlstp-int64x2.c: New test.
            * gcc.target/arm/mve/dlstp-int8x16.c: New test.
            * gcc.target/arm/mve/dlstp-invalid-asm.c: New test.

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 77e76336e94..74186930f0b 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -65,8 +65,8 @@ extern void arm_emit_speculation_barrier_function (void);
 extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *);
 extern bool arm_q_bit_access (void);
 extern bool arm_ge_bits_access (void);
-extern bool arm_target_insn_ok_for_lob (rtx);
-
+extern bool arm_target_bb_ok_for_lob (basic_block);
+extern rtx arm_attempt_dlstp_transform (rtx);
 #ifdef RTX_CODE
 enum reg_class
 arm_mode_base_reg_class (machine_mode);
diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index 6e933c80183..39d97ba5e4d 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -659,6 +659,12 @@ static const struct attribute_spec arm_attribute_table[] =
 #undef TARGET_HAVE_CONDITIONAL_EXECUTION
 #define TARGET_HAVE_CONDITIONAL_EXECUTION arm_have_conditional_execution
 
+#undef TARGET_LOOP_UNROLL_ADJUST
+#define TARGET_LOOP_UNROLL_ADJUST arm_loop_unroll_adjust
+
+#undef TARGET_PREDICT_DOLOOP_P
+#define TARGET_PREDICT_DOLOOP_P arm_predict_doloop_p
+
 #undef TARGET_LEGITIMATE_CONSTANT_P
 #define TARGET_LEGITIMATE_CONSTANT_P arm_legitimate_constant_p
 
@@ -34416,19 +34422,1096 @@ arm_invalid_within_doloop (const rtx_insn *insn)
 }
 
 bool
-arm_target_insn_ok_for_lob (rtx insn)
+arm_target_bb_ok_for_lob (basic_block bb)
 {
-  basic_block bb = BLOCK_FOR_INSN (insn);
   /* Make sure the basic block of the target insn is a simple latch
      having as single predecessor and successor the body of the loop
      itself.  Only simple loops with a single basic block as body are
      supported for 'low over head loop' making sure that LE target is
      above LE itself in the generated code.  */
-
   return single_succ_p (bb)
-    && single_pred_p (bb)
-    && single_succ_edge (bb)->dest == single_pred_edge (bb)->src
-    && contains_no_active_insn_p (bb);
+	 && single_pred_p (bb)
+	 && single_succ_edge (bb)->dest == single_pred_edge (bb)->src;
+}
+
+/* Utility fuction: Given a VCTP or a VCTP_M insn, return the number of MVE
+   lanes based on the machine mode being used.  */
+
+static int
+arm_mve_get_vctp_lanes (rtx x)
+{
+  if (GET_CODE (x) == SET && GET_CODE (XEXP (x, 1)) == UNSPEC
+      && (XINT (XEXP (x, 1), 1) == VCTP || XINT (XEXP (x, 1), 1) == VCTP_M))
+    {
+      machine_mode mode = GET_MODE (XEXP (x, 1));
+      return (VECTOR_MODE_P (mode) && VALID_MVE_PRED_MODE (mode))
+	     ? GET_MODE_NUNITS (mode) : 0;
+    }
+  return 0;
+}
+
+/* Check if INSN requires the use of the VPR reg, if it does, return the
+   sub-rtx of the VPR reg.  The TYPE argument controls whether
+   this function should:
+   * For TYPE == 0, check all operands, including the OUT operands,
+     and return the first occurrence of the VPR reg.
+   * For TYPE == 1, only check the input operands.
+   * For TYPE == 2, only check the output operands.
+   (INOUT operands are considered both as input and output operands)
+*/
+static rtx
+arm_get_required_vpr_reg (rtx_insn *insn, unsigned int type = 0)
+{
+  gcc_assert (type < 3);
+  if (!NONJUMP_INSN_P (insn))
+    return NULL_RTX;
+
+  bool requires_vpr;
+  extract_constrain_insn (insn);
+  int n_operands = recog_data.n_operands;
+  if (recog_data.n_alternatives == 0)
+    return NULL_RTX;
+
+  /* Fill in recog_op_alt with information about the constraints of
+     this insn.  */
+  preprocess_constraints (insn);
+
+  for (int op = 0; op < n_operands; op++)
+    {
+      requires_vpr = true;
+      if (type == 1 && recog_data.operand_type[op] == OP_OUT)
+	continue;
+      else if (type == 2 && recog_data.operand_type[op] == OP_IN)
+	continue;
+
+      /* Iterate through alternatives of operand "op" in recog_op_alt and
+	 identify if the operand is required to be the VPR.  */
+      for (int alt = 0; alt < recog_data.n_alternatives; alt++)
+	{
+	  const operand_alternative *op_alt
+	      = &recog_op_alt[alt * n_operands];
+	  /* Fetch the reg_class for each entry and check it against the
+	     VPR_REG reg_class.  */
+	  if (alternative_class (op_alt, op) != VPR_REG)
+	    requires_vpr = false;
+	}
+      /* If all alternatives of the insn require the VPR reg for this operand,
+	 it means that either this is VPR-generating instruction, like a vctp,
+	 vcmp, etc., or it is a VPT-predicated insruction.  Return the subrtx
+	 of the VPR reg operand.  */
+      if (requires_vpr)
+	return recog_data.operand[op];
+    }
+  return NULL_RTX;
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with TYPE == 1, so return
+   something only if the VPR reg is an input operand to the insn.  */
+
+static rtx
+ALWAYS_INLINE
+arm_get_required_vpr_reg_param (rtx_insn *insn)
+{
+  return arm_get_required_vpr_reg (insn, 1);
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with TYPE == 2, so return
+   something only if the VPR reg is the return value, an output of, or is
+   clobbered by the insn.  */
+
+static rtx
+ALWAYS_INLINE
+arm_get_required_vpr_reg_ret_val (rtx_insn *insn)
+{
+  return arm_get_required_vpr_reg (insn, 2);
+}
+
+/* Scan the basic block of a loop body for a vctp instruction.  If there is
+   at least vctp instruction, return the first rtx_insn *.  */
+
+static rtx_insn *
+arm_mve_get_loop_vctp (basic_block bb)
+{
+  rtx_insn *insn = BB_HEAD (bb);
+
+  /* Now scan through all the instruction patterns and pick out the VCTP
+     instruction.  We require arm_get_required_vpr_reg_param to be false
+     to make sure we pick up a VCTP, rather than a VCTP_M.  */
+  FOR_BB_INSNS (bb, insn)
+    if (NONDEBUG_INSN_P (insn))
+      if (arm_get_required_vpr_reg_ret_val (insn)
+	  && (arm_mve_get_vctp_lanes (PATTERN (insn)) != 0)
+	  && !arm_get_required_vpr_reg_param (insn))
+	return insn;
+  return NULL;
+}
+
+/* Return true if INSN is a MVE instruction that is VPT-predicable, but in
+   its unpredicated form, or if it is predicated, but on a predicate other
+   than VPR_REG.  */
+
+static bool
+arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate (rtx_insn *insn,
+							  rtx vpr_reg)
+{
+  rtx insn_vpr_reg_operand;
+  if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+      || (MVE_VPT_PREDICATED_INSN_P (insn)
+	  && (insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn))
+	  && !rtx_equal_p (vpr_reg, insn_vpr_reg_operand)))
+    return true;
+  else
+    return false;
+}
+
+/* Return true if INSN is a MVE instruction that is VPT-predicable and is
+   predicated on VPR_REG.  */
+
+static bool
+arm_mve_vec_insn_is_predicated_with_this_predicate (rtx_insn *insn,
+						    rtx vpr_reg)
+{
+  rtx insn_vpr_reg_operand;
+  if (MVE_VPT_PREDICATED_INSN_P (insn)
+      && (insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn))
+      && rtx_equal_p (vpr_reg, insn_vpr_reg_operand))
+    return true;
+  else
+    return false;
+}
+
+/* Utility function to identify if INSN is an MVE instruction that performs
+   some across-vector operation (and as a result does not align with normal
+   lane predication rules).  All such instructions give one only scalar
+   output, except for vshlcq which gives a PARALLEL of a vector and a scalar
+   (one vector result and one carry output).  */
+
+static bool
+arm_is_mve_across_vector_insn (rtx_insn* insn)
+{
+  df_ref insn_defs = NULL;
+  if (!MVE_VPT_PREDICABLE_INSN_P (insn))
+    return false;
+
+  bool is_across_vector = false;
+  FOR_EACH_INSN_DEF (insn_defs, insn)
+    if (!VALID_MVE_MODE (GET_MODE (DF_REF_REG (insn_defs)))
+	&& !arm_get_required_vpr_reg_ret_val (insn))
+      is_across_vector = true;
+
+  return is_across_vector;
+}
+
+/* Utility function to identify if INSN is an MVE load or store instruction.
+   * For TYPE == 0, check all operands.  If the function returns true,
+     INSN is a load or a store insn.
+   * For TYPE == 1, only check the input operands.  If the function returns
+     true, INSN is a load insn.
+   * For TYPE == 2, only check the output operands.  If the function returns
+     true, INSN is a store insn.  */
+
+static bool
+arm_is_mve_load_store_insn (rtx_insn* insn, int type = 0)
+{
+  int n_operands = recog_data.n_operands;
+  extract_insn (insn);
+
+  for (int op = 0; op < n_operands; op++)
+    {
+      if (type == 1 && recog_data.operand_type[op] == OP_OUT)
+	continue;
+      else if (type == 2 && recog_data.operand_type[op] == OP_IN)
+	continue;
+      if (mve_memory_operand (recog_data.operand[op],
+			      GET_MODE (recog_data.operand[op])))
+      return true;
+    }
+  return false;
+}
+
+/* When transforming an MVE intrinsic loop into an MVE Tail Predicated Low
+   Overhead Loop, there are a number of instructions that, if in their
+   unpredicated form, act across vector lanes, but are still safe to include
+   within the loop, despite the implicit predication added to the vector lanes.
+   This list has been compiled by carefully analyzing the instruction
+   pseudocode in the Arm-ARM.
+   All other across-vector instructions aren't allowed, because the addition
+   of implicit predication could influnce the result of the operation.
+   Any new across-vector instructions to the MVE ISA will have to assessed for
+   inclusion to this list.  */
+
+static bool
+arm_mve_is_allowed_unpredic_across_vector_insn (rtx_insn* insn)
+{
+  gcc_assert (MVE_VPT_UNPREDICATED_INSN_P (insn)
+	      && arm_is_mve_across_vector_insn (insn));
+  rtx insn_pattern = PATTERN (insn);
+  if (GET_CODE (insn_pattern) == SET
+      && GET_CODE (XEXP (insn_pattern, 1)) == UNSPEC
+      && (XINT (XEXP (insn_pattern, 1), 1) == VADDVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VABAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VABAVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMAXVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMAXAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVXQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAXQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHAXQ_S))
+    return true;
+  return false;
+}
+
+
+/* Recursively scan through the DF chain backwards within the basic block and
+   determine if any of the USEs of the original insn (or the USEs of the insns
+   where thy were DEF-ed, etc., recursively) were affected by implicit VPT
+   predication of an MVE_VPT_UNPREDICATED_INSN_P in a dlstp/letp loop.
+   This function returns true if the insn is affected implicit predication
+   and false otherwise.
+   Having such implicit predication on an unpredicated insn wouldn't in itself
+   block tail predication, because the output of that insn might then be used
+   in a correctly predicated store insn, where the disabled lanes will be
+   ignored.  To verify this we later call:
+   `arm_mve_check_df_chain_fwd_for_implic_predic_impact`, which will check the
+   DF chains forward to see if any implicitly-predicated operand gets used in
+   an improper way.  */
+
+static bool
+arm_mve_check_df_chain_back_for_implic_predic
+  (hash_map<int_hash<int, -1, -2>, bool>* safe_insn_map, rtx_insn *insn,
+   rtx vctp_vpr_generated)
+{
+  bool* temp = NULL;
+  if ((temp = safe_insn_map->get (INSN_UID (insn))))
+    return *temp;
+
+  basic_block body = BLOCK_FOR_INSN (insn);
+  /* The circumstances under which an instruction is affected by "implicit
+     predication" are as follows:
+      * It is an UNPREDICATED_INSN_P:
+	* That loads/stores from/to memory.
+	* Where any one of its operands is an MVE vector from outside the
+	  loop body bb.
+     Or:
+      * Any of it's operands, recursively backwards, are affected.  */
+  if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+      && (arm_is_mve_load_store_insn (insn)
+	  || (arm_is_mve_across_vector_insn (insn)
+	      && !arm_mve_is_allowed_unpredic_across_vector_insn (insn))))
+    {
+      safe_insn_map->put (INSN_UID (insn), true);
+      return true;
+    }
+
+  df_ref insn_uses = NULL;
+  FOR_EACH_INSN_USE (insn_uses, insn)
+  {
+    /* If the operand is in the input reg set to the the basic block,
+       (i.e. it has come from outside the loop!), consider it unsafe if:
+	 * It's being used in an unpredicated insn.
+	 * It is a predicable MVE vector.  */
+    if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+	&& VALID_MVE_MODE (GET_MODE (DF_REF_REG (insn_uses)))
+	&& REGNO_REG_SET_P (DF_LR_IN (body), DF_REF_REGNO (insn_uses)))
+      {
+	safe_insn_map->put (INSN_UID (insn), true);
+	return true;
+      }
+    /* Scan backwards from the current INSN through the instruction chain
+       until the start of the basic block.  */
+    for (rtx_insn *prev_insn = PREV_INSN (insn);
+	 prev_insn && prev_insn != PREV_INSN (BB_HEAD (body));
+	 prev_insn = PREV_INSN (prev_insn))
+      {
+	/* If a previous insn defines a register that INSN uses, then recurse
+	   in order to check that insn's USEs.
+	   If any of these insns return true as MVE_VPT_UNPREDICATED_INSN_Ps,
+	   then the whole chain is affected by the change in behaviour from
+	   being placed in dlstp/letp loop.  */
+	df_ref prev_insn_defs = NULL;
+	FOR_EACH_INSN_DEF (prev_insn_defs, prev_insn)
+	{
+	  if (DF_REF_REGNO (insn_uses) == DF_REF_REGNO (prev_insn_defs)
+	      && !arm_mve_vec_insn_is_predicated_with_this_predicate
+		   (insn, vctp_vpr_generated)
+	      && arm_mve_check_df_chain_back_for_implic_predic
+		  (safe_insn_map, prev_insn, vctp_vpr_generated))
+	    {
+	      safe_insn_map->put (INSN_UID (insn), true);
+	      return true;
+	    }
+	}
+      }
+  }
+  safe_insn_map->put (INSN_UID (insn), false);
+  return false;
+}
+
+/* If we have identified that the current DEF will be modified
+   by such implicit predication, scan through all the
+   insns that USE it and bail out if any one is outside the
+   current basic block (i.e. the reg is live after the loop)
+   or if any are store insns that are unpredicated or using a
+   predicate other than the loop VPR.
+   This function returns true if the insn is not suitable for
+   implicit predication and false otherwise.*/
+
+static bool
+arm_mve_check_df_chain_fwd_for_implic_predic_impact (rtx_insn *insn,
+						     rtx vctp_vpr_generated)
+{
+
+  /* If this insn is indeed an unpredicated store to memory, bail out.  */
+  if (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
+	(insn, vctp_vpr_generated)
+      && (arm_is_mve_load_store_insn (insn, 2)
+	  || arm_is_mve_across_vector_insn (insn)))
+    return true;
+
+  /* Next, scan forward to the various USEs of the DEFs in this insn.  */
+  df_ref insn_def = NULL;
+  FOR_EACH_INSN_DEF (insn_def, insn)
+    {
+      for (df_ref use = DF_REG_USE_CHAIN (DF_REF_REGNO (insn_def)); use;
+	   use = DF_REF_NEXT_REG (use))
+	{
+	  rtx_insn *next_use_insn = DF_REF_INSN (use);
+	  if (next_use_insn != insn
+	      && NONDEBUG_INSN_P (next_use_insn))
+	    {
+	      /* If the USE is outside the loop body bb, or it is inside, but
+		 is an differently-predicated store to memory or it is any
+		 across-vector instruction.  */
+	      if (BLOCK_FOR_INSN (insn) != BLOCK_FOR_INSN (next_use_insn)
+		  || (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
+		       (next_use_insn, vctp_vpr_generated)
+		     && (arm_is_mve_load_store_insn (next_use_insn, 2)
+			 || arm_is_mve_across_vector_insn (next_use_insn))))
+		return true;
+	    }
+	}
+    }
+  return false;
+}
+
+/* Helper function to `arm_mve_dlstp_check_inc_counter` and to
+   `arm_mve_dlstp_check_dec_counter`.  In the situations where the loop counter
+   is incrementing by 1 or decrementing by 1 in each iteration, ensure that the
+   target value or the initialisation value, respectively, was a calculation
+   of the number of iterations of the loop, which is expected to be an ASHIFTRT
+   by VCTP_STEP.  */
+
+static bool
+arm_mve_check_reg_origin_is_num_elems (basic_block body, rtx reg, rtx vctp_step)
+{
+  /* Ok, we now know the loop starts from zero and increments by one.
+     Now just show that the max value of the counter came from an
+     appropriate ASHIFRT expr of the correct amount.  */
+  basic_block pre_loop_bb = body->prev_bb;
+  while (pre_loop_bb && BB_END (pre_loop_bb)
+	 && !df_bb_regno_only_def_find (pre_loop_bb, REGNO (reg)))
+    pre_loop_bb = pre_loop_bb->prev_bb;
+
+  df_ref counter_max_last_def = df_bb_regno_only_def_find (pre_loop_bb, REGNO (reg));
+  rtx counter_max_last_set;
+  if (counter_max_last_def)
+    counter_max_last_set = PATTERN (DF_REF_INSN (counter_max_last_def));
+  else
+    return false;
+
+  /* If we encounter a simple SET from a REG, follow it through.  */
+  if (GET_CODE (counter_max_last_set) == SET
+      && REG_P (XEXP (counter_max_last_set, 1)))
+    return arm_mve_check_reg_origin_is_num_elems
+	     (pre_loop_bb->next_bb, XEXP (counter_max_last_set, 1), vctp_step);
+
+  /* If we encounter a SET from an IF_THEN_ELSE where one of the operands is a
+     constant and the other is a REG, follow through to that REG.  */
+  if (GET_CODE (counter_max_last_set) == SET
+      && GET_CODE (XEXP (counter_max_last_set, 1)) == IF_THEN_ELSE
+      && REG_P (XEXP (XEXP (counter_max_last_set, 1), 1))
+      && CONST_INT_P (XEXP (XEXP (counter_max_last_set, 1), 2)))
+    return arm_mve_check_reg_origin_is_num_elems
+	     (pre_loop_bb->next_bb, XEXP (XEXP (counter_max_last_set, 1), 1), vctp_step);
+
+  if (GET_CODE (XEXP (counter_max_last_set, 1)) == ASHIFTRT
+      && CONST_INT_P (XEXP (XEXP (counter_max_last_set, 1), 1))
+      && ((1 << INTVAL (XEXP (XEXP (counter_max_last_set, 1), 1)))
+	   == abs (INTVAL (vctp_step))))
+    return true;
+
+  return false;
+}
+
+/* If we have identified the loop to have an incrementing counter, we need to
+   make sure that it increments by 1 and that the loop is structured correctly:
+    * The counter starts from 0
+    * The counter terminates at (num_of_elem + num_of_lanes - 1) / num_of_lanes
+    * The vctp insn uses a reg that decrements appropriately in each iteration.
+*/
+
+static rtx_insn*
+arm_mve_dlstp_check_inc_counter (basic_block body, rtx_insn* vctp_insn,
+				 rtx condconst, rtx condcount)
+{
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+  /* The loop latch has to be empty.  When compiling all the known MVE LoLs in
+     user applications, none of those with incrementing counters had any real
+     insns in the loop latch.  As such, this function has only been tested with
+     an empty latch and may misbehave or ICE if we somehow get here with an
+     increment in the latch, so, for correctness, error out early.  */
+  rtx_insn *dec_insn = BB_END (body->loop_father->latch);
+  if (NONDEBUG_INSN_P (dec_insn))
+    return NULL;
+
+  class rtx_iv vctp_reg_iv;
+  /* For loops of type B) the loop counter is independent of the decrement
+     of the reg used in the vctp_insn. So run iv analysis on that reg.  This
+     has to succeed for such loops to be supported.  */
+  if (!iv_analyze (vctp_insn, as_a<scalar_int_mode> (GET_MODE (vctp_reg)),
+      vctp_reg, &vctp_reg_iv))
+    return NULL;
+
+  /* Find where both of those are modified in the loop body bb.  */
+  rtx condcount_reg_set
+	= PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
+				 (body, REGNO (condcount))));
+  rtx vctp_reg_set = PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
+					    (body, REGNO (vctp_reg))));
+  if (!vctp_reg_set || !condcount_reg_set)
+    return NULL;
+
+  if (REG_P (condcount) && REG_P (condconst))
+    {
+      /* First we need to prove that the loop is going 0..condconst with an
+	 inc of 1 in each iteration.  */
+      if (GET_CODE (XEXP (condcount_reg_set, 1)) == PLUS
+	  && CONST_INT_P (XEXP (XEXP (condcount_reg_set, 1), 1))
+	  && INTVAL (XEXP (XEXP (condcount_reg_set, 1), 1)) == 1)
+	{
+	    rtx counter_reg = XEXP (condcount_reg_set, 0);
+	    /* Check that the counter did indeed start from zero.  */
+	    df_ref this_set = DF_REG_DEF_CHAIN (REGNO (counter_reg));
+	    if (!this_set)
+	      return NULL;
+	    df_ref last_set = DF_REF_NEXT_REG (this_set);
+	    if (!last_set)
+	      return NULL;
+	    rtx_insn* last_set_insn = DF_REF_INSN (last_set);
+	    if (!single_set (last_set_insn))
+	      return NULL;
+	    rtx counter_orig_set;
+	    counter_orig_set = XEXP (PATTERN (last_set_insn), 1);
+	    if (!CONST_INT_P (counter_orig_set)
+		|| (INTVAL (counter_orig_set) != 0))
+	      return NULL;
+	    /* And finally check that the target value of the counter,
+	       condconst is of the correct shape.  */
+	    if (!arm_mve_check_reg_origin_is_num_elems (body, condconst,
+							vctp_reg_iv.step))
+	      return NULL;
+	}
+      else
+	return NULL;
+    }
+  else
+    return NULL;
+
+  /* Extract the decrementnum of the vctp reg.  */
+  int decrementnum = abs (INTVAL (vctp_reg_iv.step));
+  /* Ensure it matches the number of lanes of the vctp instruction.  */
+  if (decrementnum != arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+    return NULL;
+
+  /* Everything looks valid.  */
+  return vctp_insn;
+}
+
+/* Helper function to `arm_mve_loop_valid_for_dlstp`.  In the case of a
+   counter that is decrementing, ensure that it is decrementing by the
+   right amount in each iteration and that the target condition is what
+   we expect.  */
+
+static rtx_insn*
+arm_mve_dlstp_check_dec_counter (basic_block body, rtx_insn* vctp_insn,
+				 rtx condconst, rtx condcount)
+{
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+  class rtx_iv vctp_reg_iv;
+  int decrementnum;
+  /* For decrementing loops of type A), the counter is usually present in the
+     loop latch.  Here we simply need to verify that this counter is the same
+     reg that is also used in the vctp_insn and that it is not otherwise
+     modified.  */
+  rtx_insn *dec_insn = BB_END (body->loop_father->latch);
+  /* If not in the loop latch, try to find the decrement in the loop body.  */
+  if (!NONDEBUG_INSN_P (dec_insn))
+  {
+    df_ref temp = df_bb_regno_only_def_find (body, REGNO (condcount));
+    /* If we haven't been able to find the decrement, bail out.  */
+    if (!temp)
+      return NULL;
+    dec_insn = DF_REF_INSN (temp);
+  }
+
+  /* Next, ensure that it is a PLUS of the form:
+     (set (reg a) (plus (reg a) (const_int)))
+     where (reg a) is the same as condcount.  */
+  if (GET_CODE (XEXP (PATTERN (dec_insn), 1)) == PLUS
+      && REGNO (XEXP (PATTERN (dec_insn), 0))
+	  == REGNO (XEXP (XEXP (PATTERN (dec_insn), 1), 0))
+      && REGNO (XEXP (PATTERN (dec_insn), 0)) == REGNO (condcount))
+    decrementnum = abs (INTVAL (XEXP (XEXP (PATTERN (dec_insn), 1), 1)));
+  else
+    return NULL;
+
+  /* Ok, so we now know the loop decrement.  If it is a 1, then we need to
+     look at the loop vctp_reg and verify that it also decrements correctly.
+     Then, we need to establish that the starting value of the loop decrement
+     originates from the starting value of the vctp decrement.  */
+  if (decrementnum == 1)
+    {
+      class rtx_iv vctp_reg_iv;
+      /* The loop counter is found to be independent of the decrement
+	 of the reg used in the vctp_insn, again.  Ensure that IV analysis
+	 succeeds and check the step.  */
+      if (!iv_analyze (vctp_insn, as_a<scalar_int_mode> (GET_MODE (vctp_reg)),
+		       vctp_reg, &vctp_reg_iv))
+	return NULL;
+      /* Ensure it matches the number of lanes of the vctp instruction.  */
+      if (abs (INTVAL (vctp_reg_iv.step))
+	  != arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+	return NULL;
+      if (!arm_mve_check_reg_origin_is_num_elems (body, condcount, vctp_reg_iv.step))
+	return NULL;
+    }
+  /* If the decrements are the same, then the situation is simple: either they
+     are also the same reg, which is safe, or they are different registers, in
+     which case makse sure that there is a only simple SET from one to the
+     other inside the loop.*/
+  else if (decrementnum == arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+    {
+      if (REGNO (condcount) != REGNO (vctp_reg))
+	{
+	  /* It wasn't the same reg, but it could be behild a
+	     (set (vctp_reg) (condcount)), so instead find where
+	     the VCTP insn is DEF'd inside the loop.  */
+	  rtx vctp_reg_set =
+		PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
+					(body, REGNO (vctp_reg))));
+	  /* This must just be a simple SET from the condcount.  */
+	  if (GET_CODE (vctp_reg_set) != SET || !REG_P (XEXP (vctp_reg_set, 1))
+	      || REGNO (XEXP (vctp_reg_set, 1)) != REGNO (condcount))
+	    return NULL;
+	}
+    }
+  else
+    return NULL;
+
+  /* We now only need to find out that the loop terminates with a LE
+     zero condition.  If condconst is a const_int, then this is easy.
+     If its a REG, look at the last condition+jump in a bb before
+     the loop, because that usually will have a branch jumping over
+     the loop body.  */
+  if (CONST_INT_P (condconst)
+      && !(INTVAL (condconst) == 0 && JUMP_P (BB_END (body))
+	   && GET_CODE (XEXP (PATTERN (BB_END (body)), 1)) == IF_THEN_ELSE
+	   && (GET_CODE (XEXP (XEXP (PATTERN (BB_END (body)), 1), 0)) == NE
+	       ||GET_CODE (XEXP (XEXP (PATTERN (BB_END (body)), 1), 0)) == GT)))
+    return NULL;
+  else if (REG_P (condconst))
+    {
+      basic_block pre_loop_bb = body;
+      while (pre_loop_bb->prev_bb && BB_END (pre_loop_bb->prev_bb)
+	     && !JUMP_P (BB_END (pre_loop_bb->prev_bb)))
+	pre_loop_bb = pre_loop_bb->prev_bb;
+      if (pre_loop_bb && BB_END (pre_loop_bb))
+	pre_loop_bb = pre_loop_bb->prev_bb;
+      else
+	return NULL;
+      rtx initial_compare = NULL_RTX;
+      if (!(prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb))
+	    && INSN_P (prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb)))))
+	return NULL;
+      else
+	initial_compare
+	    = PATTERN (prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb)));
+      if (!(initial_compare && GET_CODE (initial_compare) == SET
+	    && cc_register (XEXP (initial_compare, 0), VOIDmode)
+	    && GET_CODE (XEXP (initial_compare, 1)) == COMPARE
+	    && CONST_INT_P (XEXP (XEXP (initial_compare, 1), 1))
+	    && INTVAL (XEXP (XEXP (initial_compare, 1), 1)) == 0))
+	return NULL;
+
+      /* Usually this is a LE condition, but it can also just be a GT or an EQ
+	 condition (if the value is unsigned or the compiler knows its not negative)  */
+      rtx_insn *loop_jumpover = BB_END (pre_loop_bb);
+      if (!(JUMP_P (loop_jumpover)
+	    && GET_CODE (XEXP (PATTERN (loop_jumpover), 1)) == IF_THEN_ELSE
+	    && (GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == LE
+		|| GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == GT
+		|| GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == EQ)))
+	return NULL;
+    }
+
+  /* Everything looks valid.  */
+  return vctp_insn;
+}
+
+/* Function to check a loop's structure to see if it is a valid candidate for
+   an MVE Tail Predicated Low-Overhead Loop.  Returns the loop's VCTP_INSN if
+   it is valid, or NULL if it isn't.  */
+
+static rtx_insn*
+arm_mve_loop_valid_for_dlstp (basic_block body)
+{
+  /* Doloop can only be done "elementwise" with predicated dlstp/letp if it
+     contains a VCTP on the number of elements processed by the loop.
+     Find the VCTP predicate generation inside the loop body BB.  */
+  rtx_insn *vctp_insn = arm_mve_get_loop_vctp (body);
+  if (!vctp_insn)
+    return NULL;
+
+  /* There are only two types of loops that can be turned into dlstp/letp
+     loops:
+      A) Loops of the form:
+	  while (num_of_elem > 0)
+	    {
+	      p = vctp<size> (num_of_elem)
+	      n -= num_of_lanes;
+	    }
+      B) Loops of the form:
+	  int num_of_iters = (num_of_elem + num_of_lanes - 1) / num_of_lanes
+	  for (i = 0; i < num_of_iters; i++)
+	    {
+	      p = vctp<size> (num_of_elem)
+	      n -= num_of_lanes;
+	    }
+
+    Then, depending on the type of loop above we need will need to do
+    different sets of checks.  */
+  iv_analysis_loop_init (body->loop_father);
+
+  /* In order to find out if the loop is of type A or B above look for the
+     loop counter: it will either be incrementing by one per iteration or
+     it will be decrementing by num_of_lanes.  We can find the loop counter
+     in the condition at the end of the loop.  */
+  rtx_insn *loop_cond = prev_nonnote_nondebug_insn_bb (BB_END (body));
+  if (!(cc_register (XEXP (PATTERN (loop_cond), 0), VOIDmode)
+	&& GET_CODE (XEXP (PATTERN (loop_cond), 1)) == COMPARE))
+    return NULL;
+
+  /* The operands in the condition:  Try to identify which one is the
+     constant and which is the counter and run IV analysis on the latter.  */
+  rtx cond_arg_1 = XEXP (XEXP (PATTERN (loop_cond), 1), 0);
+  rtx cond_arg_2 = XEXP (XEXP (PATTERN (loop_cond), 1), 1);
+
+  rtx loop_cond_constant;
+  rtx loop_counter;
+  class rtx_iv cond_counter_iv, cond_temp_iv;
+
+  if (CONST_INT_P (cond_arg_1))
+    {
+      /* cond_arg_1 is the constant and cond_arg_2 is the counter.  */
+      loop_cond_constant = cond_arg_1;
+      loop_counter = cond_arg_2;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_2)),
+		  cond_arg_2, &cond_counter_iv);
+    }
+  else if (CONST_INT_P (cond_arg_2))
+    {
+      /* cond_arg_2 is the constant and cond_arg_1 is the counter.  */
+      loop_cond_constant = cond_arg_2;
+      loop_counter = cond_arg_1;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_1)),
+		  cond_arg_1, &cond_counter_iv);
+    }
+  else if (REG_P (cond_arg_1) && REG_P (cond_arg_2))
+    {
+      /* If both operands to the compare are REGs, we can safely
+	 run IV analysis on both and then determine which is the
+	 constant by looking at the step.
+	 First assume cond_arg_1 is the counter.  */
+      loop_counter = cond_arg_1;
+      loop_cond_constant = cond_arg_2;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_1)),
+		  cond_arg_1, &cond_counter_iv);
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_2)),
+		  cond_arg_2, &cond_temp_iv);
+
+      if (!CONST_INT_P (cond_counter_iv.step) || !CONST_INT_P (cond_temp_iv.step))
+	return NULL;
+      /* Look at the steps and swap around the rtx's if needed.  Error out if
+	 one of them cannot be identified as constant.  */
+      if (INTVAL (cond_counter_iv.step) != 0 && INTVAL (cond_temp_iv.step) != 0)
+	return NULL;
+      if (INTVAL (cond_counter_iv.step) == 0 && INTVAL (cond_temp_iv.step) != 0)
+	{
+	  loop_counter = cond_arg_2;
+	  loop_cond_constant = cond_arg_1;
+	  cond_counter_iv = cond_temp_iv;
+	}
+    }
+  else
+    return NULL;
+
+  if (!REG_P (loop_counter))
+    return NULL;
+  if (!(REG_P (loop_cond_constant) || CONST_INT_P (loop_cond_constant)))
+    return NULL;
+
+  /* Now we have extracted the IV step of the loop counter, call the
+     appropriate checking function.  */
+  if (INTVAL (cond_counter_iv.step) > 0)
+    return arm_mve_dlstp_check_inc_counter (body, vctp_insn,
+					    loop_cond_constant, loop_counter);
+  else if (INTVAL (cond_counter_iv.step) < 0)
+    return arm_mve_dlstp_check_dec_counter (body, vctp_insn,
+					    loop_cond_constant, loop_counter);
+  else
+    return NULL;
+}
+
+/* Predict whether the given loop in gimple will be transformed in the RTL
+   doloop_optimize pass.  */
+
+static bool
+arm_predict_doloop_p (struct loop *loop)
+{
+  gcc_assert (loop);
+  /* On arm, targetm.can_use_doloop_p is actually
+     can_use_doloop_if_innermost.  Ensure the loop is innermost,
+     it is valid and as per arm_target_bb_ok_for_lob and the
+     correct architecture flags are enabled.  */
+  if (!(TARGET_32BIT && TARGET_HAVE_LOB && optimize > 0))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " target architecture or optimisation flags.\n");
+      return false;
+    }
+  else if (loop->inner != NULL)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " loop nesting.\n");
+      return false;
+    }
+  else if (!arm_target_bb_ok_for_lob (loop->header->next_bb))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " loop bb complexity.\n");
+      return false;
+    }
+
+  return true;
+}
+
+/* Implement targetm.loop_unroll_adjust.  Use this to block unrolling of loops
+   that may later be turned into MVE Tail Predicated Low Overhead Loops.  The
+   performance benefit of an MVE LoL is likely to be much higher than that of
+   the unrolling.  */
+
+unsigned
+arm_loop_unroll_adjust (unsigned nunroll, struct loop *loop)
+{
+  if (TARGET_HAVE_MVE
+      && arm_target_bb_ok_for_lob (loop->latch)
+      && arm_mve_loop_valid_for_dlstp (loop->header))
+    return 0;
+  else
+    return nunroll;
+}
+
+/* Function to hadle emitting a VPT-unpredicated version of a VPT-predicated
+   insn to a sequence.  */
+
+static bool
+arm_emit_mve_unpredicated_insn_to_seq (rtx_insn* insn)
+{
+  rtx insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn);
+  int new_icode = get_attr_mve_unpredicated_insn (insn);
+  if (!in_sequence_p ()
+      || !MVE_VPT_PREDICATED_INSN_P (insn)
+      || (!insn_vpr_reg_operand)
+      || (!new_icode))
+    return false;
+
+  extract_insn (insn);
+  rtx arr[8];
+  int j = 0;
+
+  /* When transforming a VPT-predicated instruction
+     into its unpredicated equivalent we need to drop
+     the VPR operand and we may need to also drop a
+     merge "vuninit" input operand, depending on the
+     instruction pattern.  Here ensure that we have at
+     most a two-operand difference between the two
+     instrunctions.  */
+  int n_operands_diff
+      = recog_data.n_operands - insn_data[new_icode].n_operands;
+  if (!(n_operands_diff > 0 && n_operands_diff <= 2))
+    return false;
+
+  /* Then, loop through the operands of the predicated
+     instruction, and retain the ones that map to the
+     unpredicated instruction.  */
+  for (int i = 0; i < recog_data.n_operands; i++)
+    {
+      /* Ignore the VPR and, if needed, the vuninit
+	 operand.  */
+      if (insn_vpr_reg_operand == recog_data.operand[i]
+	  || (n_operands_diff == 2
+	      && !strcmp (recog_data.constraints[i], "0")))
+	continue;
+      else
+	{
+	  arr[j] = recog_data.operand[i];
+	  j++;
+	}
+    }
+
+  /* Finally, emit the upredicated instruction.  */
+  switch (j)
+    {
+      case 1:
+	emit_insn (GEN_FCN (new_icode) (arr[0]));
+	break;
+      case 2:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1]));
+	break;
+      case 3:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2]));
+	break;
+      case 4:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2],
+					arr[3]));
+	break;
+      case 5:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4]));
+	break;
+      case 6:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4], arr[5]));
+	break;
+      case 7:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4], arr[5], arr[6]));
+	break;
+      default:
+	gcc_unreachable ();
+    }
+  return true;
+}
+
+/* When a vctp insn is used, its out is often followed by
+   a zero-extend insn to SImode, which is then SUBREG'd into a
+   vector form of mode VALID_MVE_PRED_MODE: this vector form is
+   what is then used as an input to the instructions within the
+   loop.  Hence, store that vector form of the VPR reg into
+   vctp_vpr_generated, so that we can match it with instructions
+   in the loop to determine if they are predicated on this same
+   VPR.  If there is no zero-extend and subreg or it is otherwise
+   invalid, then return NULL to cancel the dlstp transform.  */
+
+static rtx
+arm_mve_get_vctp_vec_form (rtx_insn *insn)
+{
+  rtx vctp_vpr_generated = NULL_RTX;
+  rtx_insn *next_use1 = NULL;
+  df_ref use;
+  for (use
+	= DF_REG_USE_CHAIN
+	   (DF_REF_REGNO (DF_INSN_INFO_DEFS (DF_INSN_INFO_GET (insn))));
+       use; use = DF_REF_NEXT_REG (use))
+    if (!next_use1 && NONDEBUG_INSN_P (DF_REF_INSN (use)))
+      next_use1 = DF_REF_INSN (use);
+
+  if (single_set (next_use1)
+      && GET_CODE (SET_SRC (single_set (next_use1))) == ZERO_EXTEND)
+    {
+      rtx_insn *next_use2 = NULL;
+      for (use
+	    = DF_REG_USE_CHAIN
+	       (DF_REF_REGNO
+		 (DF_INSN_INFO_DEFS (DF_INSN_INFO_GET (next_use1))));
+	   use; use = DF_REF_NEXT_REG (use))
+	if (!next_use2 && NONDEBUG_INSN_P (DF_REF_INSN (use)))
+	  next_use2 = DF_REF_INSN (use);
+
+      if (single_set (next_use2)
+	  && GET_CODE (SET_SRC (single_set (next_use2))) == SUBREG)
+	vctp_vpr_generated = XEXP (PATTERN (next_use2), 0);
+    }
+
+  if (!vctp_vpr_generated || !REG_P (vctp_vpr_generated)
+      || !VALID_MVE_PRED_MODE (GET_MODE (vctp_vpr_generated)))
+    return NULL_RTX;
+
+  return vctp_vpr_generated;
+}
+
+/* Attempt to transform the loop contents of loop basic block from VPT
+   predicated insns into unpredicated insns for a dlstp/letp loop.  */
+
+rtx
+arm_attempt_dlstp_transform (rtx label)
+{
+  basic_block body = BLOCK_FOR_INSN (label)->prev_bb;
+
+  /* Ensure that the bb is within a loop that has all required metadata.  */
+  if (!body->loop_father || !body->loop_father->header
+      || !body->loop_father->simple_loop_desc)
+    return GEN_INT (1);
+
+  rtx_insn *vctp_insn = arm_mve_loop_valid_for_dlstp (body);
+  if (!vctp_insn)
+    return GEN_INT (1);
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+
+  rtx vctp_vpr_generated = arm_mve_get_vctp_vec_form (vctp_insn);
+  if (!vctp_vpr_generated)
+    return GEN_INT (1);
+
+  /* decrementunum is already known to be valid at this point.  */
+  int decrementnum = arm_mve_get_vctp_lanes (PATTERN (vctp_insn));
+
+  rtx_insn *insn = 0;
+  rtx_insn *cur_insn = 0;
+  rtx_insn *seq;
+  hash_map<int_hash<int, -1, -2>, bool>* safe_insn_map
+					  = new hash_map<int_hash<int, -1, -2>,
+							 bool>;
+
+  /* Scan through the insns in the loop bb and emit the transformed bb
+     insns to a sequence.  */
+  start_sequence ();
+  FOR_BB_INSNS (body, insn)
+    {
+      if (GET_CODE (insn) == CODE_LABEL || NOTE_INSN_BASIC_BLOCK_P (insn))
+	continue;
+      else if (NOTE_P (insn))
+	emit_note ((enum insn_note)NOTE_KIND (insn));
+      else if (DEBUG_INSN_P (insn))
+	emit_debug_insn (PATTERN (insn));
+      else if (!INSN_P (insn))
+	{
+	  end_sequence ();
+	  return GEN_INT (1);
+	}
+      /* When we find the vctp instruction: continue.  */
+      else if (insn == vctp_insn)
+	continue;
+       /* If the insn pattern requires the use of the VPR value from the
+	  vctp as an input parameter for predication.  */
+      else if (arm_mve_vec_insn_is_predicated_with_this_predicate
+		(insn, vctp_vpr_generated))
+	{
+	  bool success = arm_emit_mve_unpredicated_insn_to_seq (insn);
+	  if (!success)
+	    {
+	      end_sequence ();
+	      return GEN_INT (1);
+	    }
+	}
+      /* If the insn isn't VPT predicated on vctp_vpr_generated, we need to
+	 make sure that it is still valid within the dlstp/letp loop.  */
+      else
+	{
+	  /* If this instruction USE-s the vctp_vpr_generated other than for
+	     predication, this blocks the transformation as we are not allowed
+	     to optimise the VPR value away.  */
+	  df_ref insn_uses = NULL;
+	  FOR_EACH_INSN_USE (insn_uses, insn)
+	  {
+	    if (rtx_equal_p (vctp_vpr_generated, DF_REF_REG (insn_uses)))
+	      {
+		end_sequence ();
+		return GEN_INT (1);
+	      }
+	  }
+	  /* If within the loop we have an MVE vector instruction that is
+	     unpredicated, the dlstp/letp looping will add implicit
+	     predication to it.  This will result in a change in behaviour
+	     of the instruction, so we need to find out if any instructions
+	     that feed into the current instruction were implicitly
+	     predicated.  */
+	  if (arm_mve_check_df_chain_back_for_implic_predic
+		(safe_insn_map, insn, vctp_vpr_generated))
+	    {
+	      if (arm_mve_check_df_chain_fwd_for_implic_predic_impact
+		    (insn, vctp_vpr_generated))
+		{
+		  end_sequence ();
+		  return GEN_INT (1);
+		}
+	    }
+	  emit_insn (PATTERN (insn));
+	}
+    }
+  seq = get_insns ();
+  end_sequence ();
+
+  /* Re-write the entire BB contents with the transformed
+     sequence.  */
+  FOR_BB_INSNS_SAFE (body, insn, cur_insn)
+    if (!(GET_CODE (insn) == CODE_LABEL || NOTE_INSN_BASIC_BLOCK_P (insn)))
+      delete_insn (insn);
+  for (insn = seq; NEXT_INSN (insn); insn = NEXT_INSN (insn))
+    if (NOTE_P (insn))
+      emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));
+    else if (DEBUG_INSN_P (insn))
+      emit_debug_insn_after (PATTERN (insn), BB_END (body));
+    else
+      emit_insn_after (PATTERN (insn), BB_END (body));
+
+  emit_jump_insn_after (PATTERN (insn), BB_END (body));
+  /* The transformation has succeeded, so now modify the "count"
+     (a.k.a. niter_expr) for the middle-end.  Also set noloop_assumptions
+     to NULL to stop the middle-end from making assumptions about the
+     number of iterations.  */
+  simple_loop_desc (body->loop_father)->niter_expr = vctp_reg;
+  simple_loop_desc (body->loop_father)->noloop_assumptions = NULL_RTX;
+  return GEN_INT (decrementnum);
 }
 
 #if CHECKING_P
diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index ee931ad6ebd..70fade0d0da 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -124,6 +124,11 @@
 ; and not all ARM insns do.
 (define_attr "predicated" "yes,no" (const_string "no"))
 
+
+; An attribute that encodes the CODE_FOR_<insn> of the MVE VPT unpredicated
+; version of a VPT-predicated instruction.  For unpredicated instructions
+; that are predicable, encode the same pattern's CODE_FOR_<insn> as a way to
+; encode that it is a predicable instruction.
 (define_attr "mve_unpredicated_insn" "" (const_int 0))
 
 ; LENGTH of an instruction (in bytes)
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 71e43539616..1401b59dc0b 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -2660,6 +2660,9 @@
 (define_int_attr mrrc [(VUNSPEC_MRRC "mrrc") (VUNSPEC_MRRC2 "mrrc2")])
 (define_int_attr MRRC [(VUNSPEC_MRRC "MRRC") (VUNSPEC_MRRC2 "MRRC2")])
 
+(define_int_attr mode1 [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 "32")
+			(DLSTP64 "64")])
+
 (define_int_attr opsuffix [(UNSPEC_DOT_S "s8")
 			   (UNSPEC_DOT_U "u8")
 			   (UNSPEC_DOT_US "s8")
@@ -2903,6 +2906,8 @@
 (define_int_iterator VSHLCQ_M [VSHLCQ_M_S VSHLCQ_M_U])
 (define_int_iterator VQSHLUQ_M_N [VQSHLUQ_M_N_S])
 (define_int_iterator VQSHLUQ_N [VQSHLUQ_N_S])
+(define_int_iterator DLSTP [DLSTP8 DLSTP16 DLSTP32
+				   DLSTP64])
 
 ;; Define iterators for VCMLA operations
 (define_int_iterator VCMLA_OP [UNSPEC_VCMLA
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 87cbf6c1726..dc4b6301aaa 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -6997,7 +6997,7 @@
    (set (reg:SI LR_REGNUM)
 	(plus:SI (reg:SI LR_REGNUM) (match_dup 0)))
    (clobber (reg:CC CC_REGNUM))]
-  "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2"
+  "TARGET_HAVE_MVE"
   {
     if (get_attr_length (insn) == 4)
       return "letp\t%|lr, %l1";
@@ -7017,5 +7017,5 @@
 	 (unspec:SI [(match_operand:SI 0 "s_register_operand" "r")]
 	  DLSTP))
   ]
-  "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2"
+  "TARGET_HAVE_MVE"
   "dlstp.<mode1>\t%|lr, %0")
diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
index e1e013befa7..368d5138ca1 100644
--- a/gcc/config/arm/thumb2.md
+++ b/gcc/config/arm/thumb2.md
@@ -1613,7 +1613,7 @@
    (use (match_operand 1 "" ""))]     ; label
   "TARGET_32BIT"
   "
- {
+{
    /* Currently SMS relies on the do-loop pattern to recognize loops
       where (1) the control part consists of all insns defining and/or
       using a certain 'count' register and (2) the loop count can be
@@ -1623,41 +1623,65 @@
 
       Also used to implement the low over head loops feature, which is part of
       the Armv8.1-M Mainline Low Overhead Branch (LOB) extension.  */
-   if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
-   {
-     rtx s0;
-     rtx bcomp;
-     rtx loc_ref;
-     rtx cc_reg;
-     rtx insn;
-     rtx cmp;
-
-     if (GET_MODE (operands[0]) != SImode)
-       FAIL;
-
-     s0 = operands [0];
-
-     /* Low over head loop instructions require the first operand to be LR.  */
-     if (TARGET_HAVE_LOB && arm_target_insn_ok_for_lob (operands [1]))
-       s0 = gen_rtx_REG (SImode, LR_REGNUM);
-
-     if (TARGET_THUMB2)
-       insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-     else
-       insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-
-     cmp = XVECEXP (PATTERN (insn), 0, 0);
-     cc_reg = SET_DEST (cmp);
-     bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
-     loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands [1]);
-     emit_jump_insn (gen_rtx_SET (pc_rtx,
-                                  gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
-                                                        loc_ref, pc_rtx)));
-     DONE;
-   }
- else
-   FAIL;
- }")
+  if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
+    {
+      rtx s0;
+      rtx bcomp;
+      rtx loc_ref;
+      rtx cc_reg;
+      rtx insn;
+      rtx cmp;
+      rtx decrement_num;
+
+      if (GET_MODE (operands[0]) != SImode)
+	FAIL;
+
+      s0 = operands[0];
+
+       if (TARGET_HAVE_LOB && arm_target_bb_ok_for_lob (BLOCK_FOR_INSN (operands[1])))
+	{
+	  s0 = gen_rtx_REG (SImode, LR_REGNUM);
+
+	  /* If we have a compatibe MVE target, try and analyse the loop
+	     contents to determine if we can use predicated dlstp/letp
+	     looping.  */
+	  if (TARGET_HAVE_MVE
+	      && (decrement_num = arm_attempt_dlstp_transform (operands[1]))
+	      && (INTVAL (decrement_num) != 1))
+	    {
+	      insn = emit_insn
+		      (gen_thumb2_addsi3_compare0
+			  (s0, s0, GEN_INT ((-1) * (INTVAL (decrement_num)))));
+	      cmp = XVECEXP (PATTERN (insn), 0, 0);
+	      cc_reg = SET_DEST (cmp);
+	      bcomp = gen_rtx_GE (VOIDmode, cc_reg, const0_rtx);
+	      loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+	      emit_jump_insn (gen_rtx_SET (pc_rtx,
+				       gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+							     loc_ref, pc_rtx)));
+	      DONE;
+	    }
+	}
+
+	/* Otherwise, try standard decrement-by-one dls/le looping.  */
+	if (TARGET_THUMB2)
+	  insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0,
+							GEN_INT (-1)));
+	else
+	  insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
+
+	cmp = XVECEXP (PATTERN (insn), 0, 0);
+	cc_reg = SET_DEST (cmp);
+	bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
+	loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+	emit_jump_insn (gen_rtx_SET (pc_rtx,
+				     gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+							   loc_ref, pc_rtx)));
+	DONE;
+    }
+  else
+    FAIL;
+}")
 
 (define_insn "*clear_apsr"
   [(unspec_volatile:SI [(const_int 0)] VUNSPEC_CLRM_APSR)
@@ -1755,7 +1779,37 @@
   {
     if (REGNO (operands[0]) == LR_REGNUM)
       {
-	emit_insn (gen_dls_insn (operands[0]));
+	/* Pick out the number by which we are decrementing the loop counter
+	   in every iteration.  If it's > 1, then use dlstp.  */
+	int const_int_dec_num
+	     = abs (INTVAL (XEXP (XEXP (XVECEXP (PATTERN (operands[1]), 0, 1),
+				  1),
+			    1)));
+	switch (const_int_dec_num)
+	  {
+	    case 16:
+	      emit_insn (gen_dlstp8_insn (operands[0]));
+	      break;
+
+	    case 8:
+	      emit_insn (gen_dlstp16_insn (operands[0]));
+	      break;
+
+	    case 4:
+	      emit_insn (gen_dlstp32_insn (operands[0]));
+	      break;
+
+	    case 2:
+	      emit_insn (gen_dlstp64_insn (operands[0]));
+	      break;
+
+	    case 1:
+	      emit_insn (gen_dls_insn (operands[0]));
+	      break;
+
+	    default:
+	      gcc_unreachable ();
+	  }
 	DONE;
       }
     else
diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
index 6a5b1f8f623..7921bffc169 100644
--- a/gcc/config/arm/unspecs.md
+++ b/gcc/config/arm/unspecs.md
@@ -581,6 +581,10 @@
   VADDLVQ_U
   VCTP
   VCTP_M
+  DLSTP8
+  DLSTP16
+  DLSTP32
+  DLSTP64
   VPNOT
   VCREATEQ_F
   VCVTQ_N_TO_F_S
diff --git a/gcc/df-core.cc b/gcc/df-core.cc
index d4812b04a7c..4fcc14bf790 100644
--- a/gcc/df-core.cc
+++ b/gcc/df-core.cc
@@ -1964,6 +1964,21 @@ df_bb_regno_last_def_find (basic_block bb, unsigned int regno)
   return NULL;
 }
 
+/* Return the one and only def of REGNO within BB.  If there is no def or
+   there are multiple defs, return NULL.  */
+
+df_ref
+df_bb_regno_only_def_find (basic_block bb, unsigned int regno)
+{
+  df_ref temp = df_bb_regno_first_def_find (bb, regno);
+  if (!temp)
+    return NULL;
+  else if (temp == df_bb_regno_last_def_find (bb, regno))
+    return temp;
+  else
+    return NULL;
+}
+
 /* Finds the reference corresponding to the definition of REG in INSN.
    DF is the dataflow object.  */
 
diff --git a/gcc/df.h b/gcc/df.h
index 402657a7076..98623637f9c 100644
--- a/gcc/df.h
+++ b/gcc/df.h
@@ -987,6 +987,7 @@ extern void df_check_cfg_clean (void);
 #endif
 extern df_ref df_bb_regno_first_def_find (basic_block, unsigned int);
 extern df_ref df_bb_regno_last_def_find (basic_block, unsigned int);
+extern df_ref df_bb_regno_only_def_find (basic_block, unsigned int);
 extern df_ref df_find_def (rtx_insn *, rtx);
 extern bool df_reg_defined (rtx_insn *, rtx);
 extern df_ref df_find_use (rtx_insn *, rtx);
diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
index 4feb0a25ab9..f6dbd0515de 100644
--- a/gcc/loop-doloop.cc
+++ b/gcc/loop-doloop.cc
@@ -85,29 +85,29 @@ doloop_condition_get (rtx_insn *doloop_pat)
      forms:
 
      1)  (parallel [(set (pc) (if_then_else (condition)
-	  			            (label_ref (label))
-				            (pc)))
-	             (set (reg) (plus (reg) (const_int -1)))
-	             (additional clobbers and uses)])
+					    (label_ref (label))
+					    (pc)))
+		     (set (reg) (plus (reg) (const_int -n)))
+		     (additional clobbers and uses)])
 
      The branch must be the first entry of the parallel (also required
      by jump.cc), and the second entry of the parallel must be a set of
      the loop counter register.  Some targets (IA-64) wrap the set of
      the loop counter in an if_then_else too.
 
-     2)  (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-	                         (label_ref (label))
-			         (pc))).  
+     2)  (set (reg) (plus (reg) (const_int -n))
+	 (set (pc) (if_then_else (reg != 0)
+				 (label_ref (label))
+				 (pc))).
 
      Some targets (ARM) do the comparison before the branch, as in the
      following form:
 
-     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) */
+     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -n), 0)))
+		   (set (reg) (plus (reg) (const_int -n)))])
+	(set (pc) (if_then_else (cc == NE)
+				(label_ref (label))
+				(pc))) */
 
   pattern = PATTERN (doloop_pat);
 
@@ -143,7 +143,7 @@ doloop_condition_get (rtx_insn *doloop_pat)
 	      || GET_CODE (cmp_arg1) != PLUS)
 	    return 0;
 	  reg_orig = XEXP (cmp_arg1, 0);
-	  if (XEXP (cmp_arg1, 1) != GEN_INT (-1) 
+	  if (!CONST_INT_P (XEXP (cmp_arg1, 1))
 	      || !REG_P (reg_orig))
 	    return 0;
 	  cc_reg = SET_DEST (cmp_orig);
@@ -156,7 +156,8 @@ doloop_condition_get (rtx_insn *doloop_pat)
 	{
 	  /* We expect the condition to be of the form (reg != 0)  */
 	  cond = XEXP (SET_SRC (cmp), 0);
-	  if (GET_CODE (cond) != NE || XEXP (cond, 1) != const0_rtx)
+	  if ((GET_CODE (cond) != NE && GET_CODE (cond) != GE)
+	      || XEXP (cond, 1) != const0_rtx)
 	    return 0;
 	}
     }
@@ -173,14 +174,14 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
     return 0;
 
-  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
      On IA-64, this decrement is wrapped in an if_then_else.  */
   inc_src = SET_SRC (inc);
   if (GET_CODE (inc_src) == IF_THEN_ELSE)
     inc_src = XEXP (inc_src, 1);
   if (GET_CODE (inc_src) != PLUS
       || XEXP (inc_src, 0) != reg
-      || XEXP (inc_src, 1) != constm1_rtx)
+      || !CONST_INT_P (XEXP (inc_src, 1)))
     return 0;
 
   /* Check for (set (pc) (if_then_else (condition)
@@ -211,42 +212,49 @@ doloop_condition_get (rtx_insn *doloop_pat)
       || (GET_CODE (XEXP (condition, 0)) == PLUS
 	  && XEXP (XEXP (condition, 0), 0) == reg))
    {
-     if (GET_CODE (pattern) != PARALLEL)
      /*  For the second form we expect:
 
-         (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-                                 (label_ref (label))
-                                 (pc))).
+	 (set (reg) (plus (reg) (const_int -n))
+	 (set (pc) (if_then_else (reg != 0)
+				 (label_ref (label))
+				 (pc))).
 
-         is equivalent to the following:
+	 If n == 1, that is equivalent to the following:
 
-         (parallel [(set (pc) (if_then_else (reg != 1)
-                                            (label_ref (label))
-                                            (pc)))
-                     (set (reg) (plus (reg) (const_int -1)))
-                     (additional clobbers and uses)])
+	 (parallel [(set (pc) (if_then_else (reg != 1)
+					    (label_ref (label))
+					    (pc)))
+		     (set (reg) (plus (reg) (const_int -1)))
+		     (additional clobbers and uses)])
 
-        For the third form we expect:
+	For the third form we expect:
 
-        (parallel [(set (cc) (compare ((plus (reg) (const_int -1)), 0))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) 
+	(parallel [(set (cc) (compare ((plus (reg) (const_int -n)), 0))
+		   (set (reg) (plus (reg) (const_int -n)))])
+	(set (pc) (if_then_else (cc == NE)
+				(label_ref (label))
+				(pc)))
 
-        which is equivalent to the following:
+	Which also for n == 1 is equivalent to the following:
 
-        (parallel [(set (cc) (compare (reg,  1))
-                   (set (reg) (plus (reg) (const_int -1)))
-                   (set (pc) (if_then_else (NE == cc)
-                                           (label_ref (label))
-                                           (pc))))])
+	(parallel [(set (cc) (compare (reg,  1))
+		   (set (reg) (plus (reg) (const_int -1)))
+		   (set (pc) (if_then_else (NE == cc)
+					   (label_ref (label))
+					   (pc))))])
 
-        So we return the second form instead for the two cases.
+	So we return the second form instead for the two cases.
 
+	For the "elementwise" form where the decrement number isn't -1,
+	the final value may be exceeded, so use GE instead of NE.
      */
-        condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
+     if (GET_CODE (pattern) != PARALLEL)
+       {
+	if (INTVAL (XEXP (inc_src, 1)) != -1)
+	  condition = gen_rtx_fmt_ee (GE, VOIDmode, inc_src, const0_rtx);
+	else
+	  condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);;
+       }
 
     return condition;
    }
@@ -685,17 +693,6 @@ doloop_optimize (class loop *loop)
       return false;
     }
 
-  max_cost
-    = COSTS_N_INSNS (param_max_iterations_computation_cost);
-  if (set_src_cost (desc->niter_expr, mode, optimize_loop_for_speed_p (loop))
-      > max_cost)
-    {
-      if (dump_file)
-	fprintf (dump_file,
-		 "Doloop: number of iterations too costly to compute.\n");
-      return false;
-    }
-
   if (desc->const_iter)
     iterations = widest_int::from (rtx_mode_t (desc->niter_expr, mode),
 				   UNSIGNED);
@@ -716,11 +713,24 @@ doloop_optimize (class loop *loop)
 
   /* Generate looping insn.  If the pattern FAILs then give up trying
      to modify the loop since there is some aspect the back-end does
-     not like.  */
-  count = copy_rtx (desc->niter_expr);
+     not like.  If this succeeds, there is a chance that the loop
+     desc->niter_expr has been altered by the backend, so only extract
+     that data after the gen_doloop_end.  */
   start_label = block_label (desc->in_edge->dest);
   doloop_reg = gen_reg_rtx (mode);
   rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
+  count = copy_rtx (desc->niter_expr);
+
+  max_cost
+    = COSTS_N_INSNS (param_max_iterations_computation_cost);
+  if (set_src_cost (count, mode, optimize_loop_for_speed_p (loop))
+      > max_cost)
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "Doloop: number of iterations too costly to compute.\n");
+      return false;
+    }
 
   word_mode_size = GET_MODE_PRECISION (word_mode);
   word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) - 1;
diff --git a/gcc/testsuite/gcc.target/arm/lob.h b/gcc/testsuite/gcc.target/arm/lob.h
index feaae7cc899..3941fe7a8b6 100644
--- a/gcc/testsuite/gcc.target/arm/lob.h
+++ b/gcc/testsuite/gcc.target/arm/lob.h
@@ -1,15 +1,131 @@
 #include <string.h>
-
+#include <stdint.h>
 /* Common code for lob tests.  */
 
 #define NO_LOB asm volatile ("@ clobber lr" : : : "lr" )
 
-#define N 10000
+#define N 100
+
+static void
+reset_data (int *a, int *b, int *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data64 (int64_t *a, int64_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+check_plus (int *a, int *b, int *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
 
 static void
-reset_data (int *a, int *b, int *c)
+check_memcpy64 (int64_t *a, int64_t *c, int x)
 {
-  memset (a, -1, N * sizeof (*a));
-  memset (b, -1, N * sizeof (*b));
-  memset (c, -1, N * sizeof (*c));
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != a[i]) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
 }
diff --git a/gcc/testsuite/gcc.target/arm/lob1.c b/gcc/testsuite/gcc.target/arm/lob1.c
index ba5c82cd55c..c8ce653a5c3 100644
--- a/gcc/testsuite/gcc.target/arm/lob1.c
+++ b/gcc/testsuite/gcc.target/arm/lob1.c
@@ -54,29 +54,18 @@ loop3 (int *a, int *b, int *c)
     } while (i < N);
 }
 
-void
-check (int *a, int *b, int *c)
-{
-  for (int i = 0; i < N; i++)
-    {
-      NO_LOB;
-      if (c[i] != a[i] + b[i])
-	abort ();
-    }
-}
-
 int
 main (void)
 {
-  reset_data (a, b, c);
+  reset_data (a, b, c, N);
   loop1 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus (a, b, c, N);
+  reset_data (a, b, c, N);
   loop2 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus (a, b, c, N);
+  reset_data (a, b, c, N);
   loop3 (a, b ,c);
-  check (a, b ,c);
+  check_plus (a, b, c, N);
 
   return 0;
 }
diff --git a/gcc/testsuite/gcc.target/arm/lob6.c b/gcc/testsuite/gcc.target/arm/lob6.c
index 17b6124295e..4fe116e2c2b 100644
--- a/gcc/testsuite/gcc.target/arm/lob6.c
+++ b/gcc/testsuite/gcc.target/arm/lob6.c
@@ -79,14 +79,14 @@ check (void)
 int
 main (void)
 {
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data (a1, b1, c1, N);
+  reset_data (a2, b2, c2, N);
   loop1 (a1, b1, c1);
   ref1 (a2, b2, c2);
   check ();
 
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data (a1, b1, c1, N);
+  reset_data (a2, b2, c2, N);
   loop2 (a1, b1, c1);
   ref2 (a2, b2, c2);
   check ();
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c
new file mode 100644
index 00000000000..5ddd994e53d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c
@@ -0,0 +1,561 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O3 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+
+#define IMM 5
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)				\
+void test_##NAME##PRED##_##SIGN##BITS (TYPE##BITS##_t *a, TYPE##BITS##_t *b,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vb = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (b, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_##SIGN##BITS (va, vb, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      b += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (32, 4, w, NAME, PRED)
+
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vmulq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vsubq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vhaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vorrq, _x)
+
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_M(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)				\
+void test_##NAME##PRED##_##SIGN##BITS (TYPE##BITS##x##LANES##_t __inactive, TYPE##BITS##_t *a, TYPE##BITS##_t *b,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vb = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (b, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_##SIGN##BITS (__inactive, va, vb, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      b += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_M (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_M (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (32, 4, w, NAME, PRED)
+
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vmulq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vsubq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vhaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vorrq, _m)
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_N(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)	\
+void test_##NAME##PRED##_n_##SIGN##BITS (TYPE##BITS##_t *a,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_n_##SIGN##BITS (va, IMM, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_N (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_N (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (32, 4, w, NAME, PRED)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vmulq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vsubq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vhaddq, _x)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vbrsrq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vshlq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vshrq, _x)
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_M_N(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)	\
+void test_##NAME##PRED##_n_##SIGN##BITS (TYPE##BITS##x##LANES##_t __inactive, TYPE##BITS##_t *a,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_n_##SIGN##BITS (__inactive, va, IMM, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_M_N (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_M_N (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (32, 4, w, NAME, PRED)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vmulq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vsubq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vhaddq, _m)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vbrsrq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vshlq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vshrq, _m)
+
+/* Now test some more configurations.  */
+
+/* Using a >=1 condition.  */
+void test1 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n >= 1)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+/* Test a for loop format of decrementing to zero */
+int32_t a[] = {0, 1, 2, 3, 4, 5, 6, 7};
+void test2 (int32_t *b, int num_elems)
+{
+    for (int i = num_elems; i > 0; i-= 4)
+    {
+        mve_pred16_t p = vctp32q (i);
+        int32x4_t va = vldrwq_z_s32 (&(a[i]), p);
+        vstrwq_p_s32 (b + i, va, p);
+    }
+}
+
+/* Iteration counter counting up to num_iter.  */
+void test3 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 0; i < num_iter; i++)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+/* Iteration counter counting down from num_iter.  */
+void test4 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = num_iter; i > 0; i--)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+/* Using an unpredicated arithmetic instruction within the loop.  */
+void test5 (uint8_t *a, uint8_t *b, uint8_t *c,  uint8_t *d, int n)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_u8 (b);
+	/* Is affected by implicit predication, because vb also
+	came from an unpredicated load, but there is no functional
+	problem, because the result is used in a predicated store.  */ 
+        uint8x16_t vc = vaddq_u8 (va, vb);
+        uint8x16_t vd = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        vstrbq_p_u8 (d, vd, p);
+        n-=16;
+    }
+}
+
+/* Using a different VPR value for one instruction in the loop.  */
+void test6 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using another VPR value in the loop, with a vctp.
+   The doloop logic will always try to do the transform on the first
+   vctp it encounters, so this is still expected to work.  */
+void test7 (int32_t *a, int32_t *b, int32_t *c, int n, int g)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q (g);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vctp,
+   but this time the p1 will also change in every loop (still fine)  */
+void test8 (int32_t *a, int32_t *b, int32_t *c, int n, int g)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q (g);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+      g++;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vctp_m
+   that is independent of the loop vctp VPR.  */
+void test9 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p2 = vctp32q_m (n, p1);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop,
+   with a vctp_m that is tied to the base vctp VPR.  This
+   is still fine, because the vctp_m will be transformed
+   into a vctp and be implicitly predicated.  */
+void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q_m (n, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p1);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp.  */
+void test11 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vb);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p1);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp_m.  */
+void test12 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p2 = vcmpeqq_m_s32 (va, vb, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp_m 
+   that is tied to the base vctp VPR (same as above, this will be turned
+   into a vcmp and be implicitly predicated).  */
+void test13 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p2 = vcmpeqq_m_s32 (va, vb, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an unpredicated op with a scalar output, where the result is valid
+   outside the bb.  This is valid, because all the inputs to the unpredicated
+   op are correctly predicated.  */
+uint8_t test14 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       sum += vaddvq_u8 (vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Same as above, but with another scalar op between the unpredicated op and
+   the scalar op outside the loop.  */
+uint8_t test15 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx, int g)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       sum += vaddvq_u8 (vc);
+       sum += g;
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Using an unpredicated vcmp to generate a new predicate value in the
+   loop and then using it in a predicated store insn.  */
+void test16 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_s32 (b);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vc);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using a predicated vcmp to generate a new predicate value in the
+   loop and then using it in a predicated store insn.  */
+void test17 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_s32 (va, vb);
+      mve_pred16_t p1 = vcmpeqq_m_s32 (va, vc, p);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an across-vector unpredicated instruction in a valid way.
+   This tests that "vc" has correctly masked the risky "vb".  */
+uint16_t test18 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_x_u16 (va, vb, p);
+      res = vaddvq_u16 (vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector unpredicated instruction with a scalar from outside the loop.  */
+uint16_t test19 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_x_u16 (va, vb, p);
+      res = vaddvaq_u16 (res, vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector predicated instruction in a valid way.  */
+uint16_t  test20 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res = vaddvaq_p_u16 (res, vb, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector predicated instruction in a valid way.  */
+uint16_t  test21 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res++;
+      res = vaddvaq_p_u16 (res, vb, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+int test22 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vmaxvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test23 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vmaxavq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+/* The final number of DLSTPs currently is calculated by the number of
+  `TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY.*` macros * 6 + 23.  */
+/* { dg-final { scan-assembler-times {\tdlstp} 167 } } */
+/* { dg-final { scan-assembler-times {\tletp} 167 } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
new file mode 100644
index 00000000000..0cdffb312b3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int16_t *a, int16_t *b, int16_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      int16x8_t va = vldrhq_z_s16 (a, p);
+      int16x8_t vb = vldrhq_z_s16 (b, p);
+      int16x8_t vc = vaddq_x_s16 (va, vb, p);
+      vstrhq_p_s16 (c, vc, p);
+      c+=8;
+      a+=8;
+      b+=8;
+      n-=8;
+    }
+}
+
+int main ()
+{
+  int i;
+  int16_t temp1[N];
+  int16_t temp2[N];
+  int16_t temp3[N];
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus16 (temp1, temp2, temp3, 0);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus16 (temp1, temp2, temp3, 1);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 7);
+  check_plus16 (temp1, temp2, temp3, 7);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus16 (temp1, temp2, temp3, 8);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus16 (temp1, temp2, temp3, 9);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus16 (temp1, temp2, temp3, 16);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus16 (temp1, temp2, temp3, 17);
+
+  reset_data16 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.16} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c
new file mode 100644
index 00000000000..7ff789d7650
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+int main ()
+{
+  int i;
+  int32_t temp1[N];
+  int32_t temp2[N];
+  int32_t temp3[N];
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus32 (temp1, temp2, temp3, 0);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus32 (temp1, temp2, temp3, 1);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 3);
+  check_plus32 (temp1, temp2, temp3, 3);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 4);
+  check_plus32 (temp1, temp2, temp3, 4);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 5);
+  check_plus32 (temp1, temp2, temp3, 5);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus32 (temp1, temp2, temp3, 8);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus32 (temp1, temp2, temp3, 9);
+
+  reset_data32 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.32} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c
new file mode 100644
index 00000000000..8065bd02469
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int64_t *a, int64_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp64q (n);
+      int64x2_t va = vldrdq_gather_offset_z_s64 (a, vcreateq_u64 (0, 8), p);
+      vstrdq_scatter_offset_p_s64 (c, vcreateq_u64 (0, 8), va, p);
+      c+=2;
+      a+=2;
+      n-=2;
+    }
+}
+
+int main ()
+{
+  int i;
+  int64_t temp1[N];
+  int64_t temp3[N];
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 0);
+  check_memcpy64 (temp1, temp3, 0);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 1);
+  check_memcpy64 (temp1, temp3, 1);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 2);
+  check_memcpy64 (temp1, temp3, 2);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 3);
+  check_memcpy64 (temp1, temp3, 3);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 4);
+  check_memcpy64 (temp1, temp3, 4);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 5);
+  check_memcpy64 (temp1, temp3, 5);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 6);
+  check_memcpy64 (temp1, temp3, 6);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 7);
+  check_memcpy64 (temp1, temp3, 7);
+
+  reset_data64  (temp1, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.64} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c
new file mode 100644
index 00000000000..552781001e9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp8q (n);
+      int8x16_t va = vldrbq_z_s8 (a, p);
+      int8x16_t vb = vldrbq_z_s8 (b, p);
+      int8x16_t vc = vaddq_x_s8 (va, vb, p);
+      vstrbq_p_s8 (c, vc, p);
+      c+=16;
+      a+=16;
+      b+=16;
+      n-=16;
+    }
+}
+
+int main ()
+{
+  int i;
+  int8_t temp1[N];
+  int8_t temp2[N];
+  int8_t temp3[N];
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus8 (temp1, temp2, temp3, 0);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus8 (temp1, temp2, temp3, 1);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 15);
+  check_plus8 (temp1, temp2, temp3, 15);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus8 (temp1, temp2, temp3, 16);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus8 (temp1, temp2, temp3, 17);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 32);
+  check_plus8 (temp1, temp2, temp3, 32);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 33);
+  check_plus8 (temp1, temp2, temp3, 33);
+
+  reset_data8 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.8} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
new file mode 100644
index 00000000000..c1c40c2fea7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
@@ -0,0 +1,343 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O3 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+
+/* Terminating on a non-zero number of elements.  */
+void test0 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    while (n > 1)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Terminating on n >= 0.  */
+void test1 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    while (n >= 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Similar, terminating on a non-zero number of elements, but in a for loop
+   format.  */
+int32_t a[] = {0, 1, 2, 3, 4, 5, 6, 7};
+void test2 (int32_t *b, int num_elems)
+{
+    for (int i = num_elems; i >= 2; i-= 4)
+    {
+       mve_pred16_t p = vctp32q (i);
+       int32x4_t va = vldrwq_z_s32 (&(a[i]), p);
+       vstrwq_p_s32 (b + i, va, p);
+    }
+}
+
+/* Iteration counter counting up to num_iter, with a non-zero starting num.  */
+void test3 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 1; i < num_iter; i++)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Iteration counter counting up to num_iter, with a larger increment  */
+void test4 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 0; i < num_iter; i+=2)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Using an unpredicated store instruction within the loop.  */
+void test5 (uint8_t *a, uint8_t *b, uint8_t *c,  uint8_t *d, int n)
+{
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_u8 (va, vb);
+       uint8x16_t vd = vaddq_x_u8 (va, vb, p);
+       vstrbq_u8 (d, vd);
+       n -= 16;
+    }
+}
+
+/* Using an unpredicated store outside the loop.  */
+void test6 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       vx = vaddq_u8 (vx, vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    vstrbq_u8 (c, vx);
+}
+
+/* Using a VPR that gets modified within the loop.  */
+void test9 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      p++;
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using a VPR that gets re-generated within the loop.  */
+void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  mve_pred16_t p = vctp32q (n);
+  while (n > 0)
+    {
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      p = vctp32q (n);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using vctp32q_m instead of vctp32q.  */
+void test11 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p0)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q_m (n, p0);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an unpredicated op with a scalar output, where the result is valid
+   outside the bb.  This is invalid, because one of the inputs to the
+   unpredicated op is also unpredicated.  */
+uint8_t test12 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_u8 (b);
+       uint8x16_t vc = vaddq_u8 (va, vb);
+       sum += vaddvq_u8 (vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Using an unpredicated vcmp to generate a new predicate value in the
+   loop and then using that VPR to predicate a store insn.  */
+void test13 (int32_t *a, int32_t *b, int32x4_t vc, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_s32 (a);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_s32 (va, vb);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vc);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an across-vector unpredicated instruction. "vb" is the risk.  */
+uint16_t test14 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      vb = vaddq_u16 (va, vb);
+      res = vaddvq_u16 (vb);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector unpredicated instruction. "vc" is the risk. */
+uint16_t test15 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_u16 (va, vb);
+      res = vaddvaq_u16 (res, vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+uint16_t test16 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res =0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res = vaddvaq_u16 (res, vb);
+      res = vaddvaq_p_u16 (res, va, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+int test17 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vmaxvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+
+
+int test18 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vminvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test19 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vminavq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test20 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vminvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+uint8x16_t test21 (uint8_t *a, uint32_t *b, int n, uint8x16_t res)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vshlcq_u8 (va, b, 1);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int8x16_t test22 (int8_t *a, int32_t *b, int n, int8x16_t res)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vshlcq_s8 (va, b, 1);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+/* { dg-final { scan-assembler-not "\tdlstp" } } */
+/* { dg-final { scan-assembler-not "\tletp" } } */

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
  2023-08-17 10:31 [PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops Stamatis Markianos-Wright
@ 2023-09-06 17:19 ` Stamatis Markianos-Wright
  2023-09-14 12:10   ` Kyrylo Tkachov
  2023-10-24 15:11   ` Richard Sandiford
  0 siblings, 2 replies; 17+ messages in thread
From: Stamatis Markianos-Wright @ 2023-09-06 17:19 UTC (permalink / raw)
  To: gcc-patches; +Cc: Kyrylo Tkachov, Richard Earnshaw

[-- Attachment #1: Type: text/plain, Size: 5488 bytes --]

Hi all,

This is the 2/2 patch that contains the functional changes needed
for MVE Tail Predicated Low Overhead Loops.  See my previous email
for a general introduction of MVE LOLs.

This support is added through the already existing loop-doloop
mechanisms that are used for non-MVE dls/le looping.

Mid-end changes are:

1) Relax the loop-doloop mechanism in the mid-end to allow for
    decrement numbers other that -1 and for `count` to be an
    rtx containing a simple REG (which in this case will contain
    the number of elements to be processed), rather
    than an expression for calculating the number of iterations.
2) Added a new df utility function: `df_bb_regno_only_def_find` that
    will return the DEF of a REG if it is DEF-ed only once within the
    basic block.

And many things in the backend to implement the above optimisation:

3)  Implement the `arm_predict_doloop_p` target hook to instruct the
     mid-end about Low Overhead Loops (MVE or not), as well as
     `arm_loop_unroll_adjust` which will prevent unrolling of any loops
     that are valid for becoming MVE Tail_Predicated Low Overhead Loops
     (unrolling can transform a loop in ways that invalidate the dlstp/
     letp tranformation logic and the benefit of the dlstp/letp loop
     would be considerably higher than that of unrolling)
4)  Appropriate changes to the define_expand of doloop_end, new
     patterns for dlstp and letp, new iterators,  unspecs, etc.
5) `arm_mve_loop_valid_for_dlstp` and a number of checking functions:
    * `arm_mve_dlstp_check_dec_counter`
    * `arm_mve_dlstp_check_inc_counter`
    * `arm_mve_check_reg_origin_is_num_elems`
    * `arm_mve_check_df_chain_back_for_implic_predic`
    * `arm_mve_check_df_chain_fwd_for_implic_predic_impact`
    This all, in smoe way or another, are running checks on the loop
    structure in order to determine if the loop is valid for dlstp/letp
    transformation.
6) `arm_attempt_dlstp_transform`: (called from the define_expand of
     doloop_end) this function re-checks for the loop's suitability for
     dlstp/letp transformation and then implements it, if possible.
7) Various utility functions:
    *`arm_mve_get_vctp_lanes` to map
    from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg`
    to check an insn to see if it requires the VPR or not.
    * `arm_mve_get_loop_vctp`
    * `arm_mve_get_vctp_lanes`
    * `arm_emit_mve_unpredicated_insn_to_seq`
    * `arm_get_required_vpr_reg`
    * `arm_get_required_vpr_reg_param`
    * `arm_get_required_vpr_reg_ret_val`
    * `arm_mve_is_across_vector_insn`
    * `arm_is_mve_load_store_insn`
    * `arm_mve_vec_insn_is_predicated_with_this_predicate`
    * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`

No regressions on arm-none-eabi with various targets and on
aarch64-none-elf. Thoughts on getting this into trunk?

Thank you,
Stam Markianos-Wright

gcc/ChangeLog:

     * config/arm/arm-protos.h (arm_target_insn_ok_for_lob): Rename to...
     (arm_target_bb_ok_for_lob): ...this
     (arm_attempt_dlstp_transform): New.
     * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.
     (TARGET_PREDICT_DOLOOP_P): New.
     (arm_block_set_vect):
     (arm_target_insn_ok_for_lob): Rename from arm_target_insn_ok_for_lob.
     (arm_target_bb_ok_for_lob): New.
     (arm_mve_get_vctp_lanes): New.
     (arm_get_required_vpr_reg): New.
     (arm_get_required_vpr_reg_param): New.
     (arm_get_required_vpr_reg_ret_val): New.
     (arm_mve_get_loop_vctp): New.
     (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.
     (arm_mve_vec_insn_is_predicated_with_this_predicate): New.
     (arm_mve_check_df_chain_back_for_implic_predic): New.
     (arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.
     (arm_mve_check_reg_origin_is_num_elems): New.
     (arm_mve_dlstp_check_inc_counter): New.
     (arm_mve_dlstp_check_dec_counter): New.
     (arm_mve_loop_valid_for_dlstp): New.
     (arm_mve_is_across_vector_insn): New.
     (arm_is_mve_load_store_insn): New.
     (arm_predict_doloop_p): New.
     (arm_loop_unroll_adjust): New.
     (arm_emit_mve_unpredicated_insn_to_seq): New.
     (arm_attempt_dlstp_transform): New.
         * config/arm/iterators.md (DLSTP): New.
         (mode1): Add DLSTP mappings.
         * config/arm/mve.md (*predicated_doloop_end_internal): New.
         (dlstp<mode1>_insn): New.
         * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
         * config/arm/unspecs.md: New unspecs.
     * df-core.cc (df_bb_regno_only_def_find): New.
     * df.h (df_bb_regno_only_def_find): New.
         * loop-doloop.cc (doloop_condition_get): Relax conditions.
         (doloop_optimize): Add support for elementwise LoLs.

gcc/testsuite/ChangeLog:

         * gcc.target/arm/lob.h: Update framework.
         * gcc.target/arm/lob1.c: Likewise.
         * gcc.target/arm/lob6.c: Likewise.
     * gcc.target/arm/mve/dlstp-compile-asm.c: New test.
     * gcc.target/arm/mve/dlstp-int16x8.c: New test.
     * gcc.target/arm/mve/dlstp-int32x4.c: New test.
     * gcc.target/arm/mve/dlstp-int64x2.c: New test.
     * gcc.target/arm/mve/dlstp-int8x16.c: New test.
     * gcc.target/arm/mve/dlstp-invalid-asm.c: New test.

[-- Attachment #2: 2.patch --]
[-- Type: text/x-patch, Size: 105189 bytes --]

commit 8564dee09c1258c388094abd614f311e60723368
Author: Stam Markianos-Wright <stam.markianos-wright@arm.com>
Date:   Tue Oct 18 17:42:56 2022 +0100

    arm: Add support for MVE Tail-Predicated Low Overhead Loops
    
    This is the 2/2 patch that contains the functional changes needed
    for MVE Tail Predicated Low Overhead Loops.  See my previous email
    for a general introduction of MVE LOLs.
    
    This support is added through the already existing loop-doloop
    mechanisms that are used for non-MVE dls/le looping.
    
    Mid-end changes are:
    
    1) Relax the loop-doloop mechanism in the mid-end to allow for
       decrement numbers other that -1 and for `count` to be an
       rtx containing a simple REG (which in this case will contain
       the number of elements to be processed), rather
       than an expression for calculating the number of iterations.
    2) Added a new df utility function: `df_bb_regno_only_def_find` that
       will return the DEF of a REG if it is DEF-ed only once within the
       basic block.
    
    And many things in the backend to implement the above optimisation:
    
    3)  Implement the `arm_predict_doloop_p` target hook to instruct the
        mid-end about Low Overhead Loops (MVE or not), as well as
        `arm_loop_unroll_adjust` which will prevent unrolling of any loops
        that are valid for becoming MVE Tail_Predicated Low Overhead Loops
        (unrolling can transform a loop in ways that invalidate the dlstp/
        letp tranformation logic and the benefit of the dlstp/letp loop
        would be considerably higher than that of unrolling)
    4)  Appropriate changes to the define_expand of doloop_end, new
        patterns for dlstp and letp, new iterators,  unspecs, etc.
    5) `arm_mve_loop_valid_for_dlstp` and a number of checking functions:
       * `arm_mve_dlstp_check_dec_counter`
       * `arm_mve_dlstp_check_inc_counter`
       * `arm_mve_check_reg_origin_is_num_elems`
       * `arm_mve_check_df_chain_back_for_implic_predic`
       * `arm_mve_check_df_chain_fwd_for_implic_predic_impact`
       This all, in smoe way or another, are running checks on the loop
       structure in order to determine if the loop is valid for dlstp/letp
       transformation.
    6) `arm_attempt_dlstp_transform`: (called from the define_expand of
        doloop_end) this function re-checks for the loop's suitability for
        dlstp/letp transformation and then implements it, if possible.
    7) Various utility functions:
       *`arm_mve_get_vctp_lanes` to map
       from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg`
       to check an insn to see if it requires the VPR or not.
       * `arm_mve_get_loop_vctp`
       * `arm_mve_get_vctp_lanes`
       * `arm_emit_mve_unpredicated_insn_to_seq`
       * `arm_get_required_vpr_reg`
       * `arm_get_required_vpr_reg_param`
       * `arm_get_required_vpr_reg_ret_val`
       * `arm_mve_is_across_vector_insn`
       * `arm_is_mve_load_store_insn`
       * `arm_mve_vec_insn_is_predicated_with_this_predicate`
       * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`
    
    No regressions on arm-none-eabi with various targets and on
    aarch64-none-elf. Thoughts on getting this into trunk?
    
    Thank you,
    Stam Markianos-Wright
    
    gcc/ChangeLog:
    
            * config/arm/arm-protos.h (arm_target_insn_ok_for_lob): Rename to...
            (arm_target_bb_ok_for_lob): ...this
            (arm_attempt_dlstp_transform): New.
            * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.
            (TARGET_PREDICT_DOLOOP_P): New.
            (arm_block_set_vect):
            (arm_target_insn_ok_for_lob): Rename from arm_target_insn_ok_for_lob.
            (arm_target_bb_ok_for_lob): New.
            (arm_mve_get_vctp_lanes): New.
            (arm_get_required_vpr_reg): New.
            (arm_get_required_vpr_reg_param): New.
            (arm_get_required_vpr_reg_ret_val): New.
            (arm_mve_get_loop_vctp): New.
            (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.
            (arm_mve_vec_insn_is_predicated_with_this_predicate): New.
            (arm_mve_check_df_chain_back_for_implic_predic): New.
            (arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.
            (arm_mve_check_reg_origin_is_num_elems): New.
            (arm_mve_dlstp_check_inc_counter): New.
            (arm_mve_dlstp_check_dec_counter): New.
            (arm_mve_loop_valid_for_dlstp): New.
            (arm_mve_is_across_vector_insn): New.
            (arm_is_mve_load_store_insn): New.
            (arm_predict_doloop_p): New.
            (arm_loop_unroll_adjust): New.
            (arm_emit_mve_unpredicated_insn_to_seq): New.
            (arm_attempt_dlstp_transform): New.
            * config/arm/iterators.md (DLSTP): New.
            (mode1): Add DLSTP mappings.
            * config/arm/mve.md (*predicated_doloop_end_internal): New.
            (dlstp<mode1>_insn): New.
            * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
            * config/arm/unspecs.md: New unspecs.
            * df-core.cc (df_bb_regno_only_def_find): New.
            * df.h (df_bb_regno_only_def_find): New.
            * loop-doloop.cc (doloop_condition_get): Relax conditions.
            (doloop_optimize): Add support for elementwise LoLs.
    
    gcc/testsuite/ChangeLog:
    
            * gcc.target/arm/lob.h: Update framework.
            * gcc.target/arm/lob1.c: Likewise.
            * gcc.target/arm/lob6.c: Likewise.
            * gcc.target/arm/mve/dlstp-compile-asm.c: New test.
            * gcc.target/arm/mve/dlstp-int16x8.c: New test.
            * gcc.target/arm/mve/dlstp-int32x4.c: New test.
            * gcc.target/arm/mve/dlstp-int64x2.c: New test.
            * gcc.target/arm/mve/dlstp-int8x16.c: New test.
            * gcc.target/arm/mve/dlstp-invalid-asm.c: New test.

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 77e76336e94..74186930f0b 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -65,8 +65,8 @@ extern void arm_emit_speculation_barrier_function (void);
 extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *);
 extern bool arm_q_bit_access (void);
 extern bool arm_ge_bits_access (void);
-extern bool arm_target_insn_ok_for_lob (rtx);
-
+extern bool arm_target_bb_ok_for_lob (basic_block);
+extern rtx arm_attempt_dlstp_transform (rtx);
 #ifdef RTX_CODE
 enum reg_class
 arm_mode_base_reg_class (machine_mode);
diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index 6e933c80183..39d97ba5e4d 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -659,6 +659,12 @@ static const struct attribute_spec arm_attribute_table[] =
 #undef TARGET_HAVE_CONDITIONAL_EXECUTION
 #define TARGET_HAVE_CONDITIONAL_EXECUTION arm_have_conditional_execution
 
+#undef TARGET_LOOP_UNROLL_ADJUST
+#define TARGET_LOOP_UNROLL_ADJUST arm_loop_unroll_adjust
+
+#undef TARGET_PREDICT_DOLOOP_P
+#define TARGET_PREDICT_DOLOOP_P arm_predict_doloop_p
+
 #undef TARGET_LEGITIMATE_CONSTANT_P
 #define TARGET_LEGITIMATE_CONSTANT_P arm_legitimate_constant_p
 
@@ -34416,19 +34422,1096 @@ arm_invalid_within_doloop (const rtx_insn *insn)
 }
 
 bool
-arm_target_insn_ok_for_lob (rtx insn)
+arm_target_bb_ok_for_lob (basic_block bb)
 {
-  basic_block bb = BLOCK_FOR_INSN (insn);
   /* Make sure the basic block of the target insn is a simple latch
      having as single predecessor and successor the body of the loop
      itself.  Only simple loops with a single basic block as body are
      supported for 'low over head loop' making sure that LE target is
      above LE itself in the generated code.  */
-
   return single_succ_p (bb)
-    && single_pred_p (bb)
-    && single_succ_edge (bb)->dest == single_pred_edge (bb)->src
-    && contains_no_active_insn_p (bb);
+	 && single_pred_p (bb)
+	 && single_succ_edge (bb)->dest == single_pred_edge (bb)->src;
+}
+
+/* Utility fuction: Given a VCTP or a VCTP_M insn, return the number of MVE
+   lanes based on the machine mode being used.  */
+
+static int
+arm_mve_get_vctp_lanes (rtx x)
+{
+  if (GET_CODE (x) == SET && GET_CODE (XEXP (x, 1)) == UNSPEC
+      && (XINT (XEXP (x, 1), 1) == VCTP || XINT (XEXP (x, 1), 1) == VCTP_M))
+    {
+      machine_mode mode = GET_MODE (XEXP (x, 1));
+      return (VECTOR_MODE_P (mode) && VALID_MVE_PRED_MODE (mode))
+	     ? GET_MODE_NUNITS (mode) : 0;
+    }
+  return 0;
+}
+
+/* Check if INSN requires the use of the VPR reg, if it does, return the
+   sub-rtx of the VPR reg.  The TYPE argument controls whether
+   this function should:
+   * For TYPE == 0, check all operands, including the OUT operands,
+     and return the first occurrence of the VPR reg.
+   * For TYPE == 1, only check the input operands.
+   * For TYPE == 2, only check the output operands.
+   (INOUT operands are considered both as input and output operands)
+*/
+static rtx
+arm_get_required_vpr_reg (rtx_insn *insn, unsigned int type = 0)
+{
+  gcc_assert (type < 3);
+  if (!NONJUMP_INSN_P (insn))
+    return NULL_RTX;
+
+  bool requires_vpr;
+  extract_constrain_insn (insn);
+  int n_operands = recog_data.n_operands;
+  if (recog_data.n_alternatives == 0)
+    return NULL_RTX;
+
+  /* Fill in recog_op_alt with information about the constraints of
+     this insn.  */
+  preprocess_constraints (insn);
+
+  for (int op = 0; op < n_operands; op++)
+    {
+      requires_vpr = true;
+      if (type == 1 && recog_data.operand_type[op] == OP_OUT)
+	continue;
+      else if (type == 2 && recog_data.operand_type[op] == OP_IN)
+	continue;
+
+      /* Iterate through alternatives of operand "op" in recog_op_alt and
+	 identify if the operand is required to be the VPR.  */
+      for (int alt = 0; alt < recog_data.n_alternatives; alt++)
+	{
+	  const operand_alternative *op_alt
+	      = &recog_op_alt[alt * n_operands];
+	  /* Fetch the reg_class for each entry and check it against the
+	     VPR_REG reg_class.  */
+	  if (alternative_class (op_alt, op) != VPR_REG)
+	    requires_vpr = false;
+	}
+      /* If all alternatives of the insn require the VPR reg for this operand,
+	 it means that either this is VPR-generating instruction, like a vctp,
+	 vcmp, etc., or it is a VPT-predicated insruction.  Return the subrtx
+	 of the VPR reg operand.  */
+      if (requires_vpr)
+	return recog_data.operand[op];
+    }
+  return NULL_RTX;
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with TYPE == 1, so return
+   something only if the VPR reg is an input operand to the insn.  */
+
+static rtx
+ALWAYS_INLINE
+arm_get_required_vpr_reg_param (rtx_insn *insn)
+{
+  return arm_get_required_vpr_reg (insn, 1);
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with TYPE == 2, so return
+   something only if the VPR reg is the return value, an output of, or is
+   clobbered by the insn.  */
+
+static rtx
+ALWAYS_INLINE
+arm_get_required_vpr_reg_ret_val (rtx_insn *insn)
+{
+  return arm_get_required_vpr_reg (insn, 2);
+}
+
+/* Scan the basic block of a loop body for a vctp instruction.  If there is
+   at least vctp instruction, return the first rtx_insn *.  */
+
+static rtx_insn *
+arm_mve_get_loop_vctp (basic_block bb)
+{
+  rtx_insn *insn = BB_HEAD (bb);
+
+  /* Now scan through all the instruction patterns and pick out the VCTP
+     instruction.  We require arm_get_required_vpr_reg_param to be false
+     to make sure we pick up a VCTP, rather than a VCTP_M.  */
+  FOR_BB_INSNS (bb, insn)
+    if (NONDEBUG_INSN_P (insn))
+      if (arm_get_required_vpr_reg_ret_val (insn)
+	  && (arm_mve_get_vctp_lanes (PATTERN (insn)) != 0)
+	  && !arm_get_required_vpr_reg_param (insn))
+	return insn;
+  return NULL;
+}
+
+/* Return true if INSN is a MVE instruction that is VPT-predicable, but in
+   its unpredicated form, or if it is predicated, but on a predicate other
+   than VPR_REG.  */
+
+static bool
+arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate (rtx_insn *insn,
+							  rtx vpr_reg)
+{
+  rtx insn_vpr_reg_operand;
+  if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+      || (MVE_VPT_PREDICATED_INSN_P (insn)
+	  && (insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn))
+	  && !rtx_equal_p (vpr_reg, insn_vpr_reg_operand)))
+    return true;
+  else
+    return false;
+}
+
+/* Return true if INSN is a MVE instruction that is VPT-predicable and is
+   predicated on VPR_REG.  */
+
+static bool
+arm_mve_vec_insn_is_predicated_with_this_predicate (rtx_insn *insn,
+						    rtx vpr_reg)
+{
+  rtx insn_vpr_reg_operand;
+  if (MVE_VPT_PREDICATED_INSN_P (insn)
+      && (insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn))
+      && rtx_equal_p (vpr_reg, insn_vpr_reg_operand))
+    return true;
+  else
+    return false;
+}
+
+/* Utility function to identify if INSN is an MVE instruction that performs
+   some across-vector operation (and as a result does not align with normal
+   lane predication rules).  All such instructions give one only scalar
+   output, except for vshlcq which gives a PARALLEL of a vector and a scalar
+   (one vector result and one carry output).  */
+
+static bool
+arm_is_mve_across_vector_insn (rtx_insn* insn)
+{
+  df_ref insn_defs = NULL;
+  if (!MVE_VPT_PREDICABLE_INSN_P (insn))
+    return false;
+
+  bool is_across_vector = false;
+  FOR_EACH_INSN_DEF (insn_defs, insn)
+    if (!VALID_MVE_MODE (GET_MODE (DF_REF_REG (insn_defs)))
+	&& !arm_get_required_vpr_reg_ret_val (insn))
+      is_across_vector = true;
+
+  return is_across_vector;
+}
+
+/* Utility function to identify if INSN is an MVE load or store instruction.
+   * For TYPE == 0, check all operands.  If the function returns true,
+     INSN is a load or a store insn.
+   * For TYPE == 1, only check the input operands.  If the function returns
+     true, INSN is a load insn.
+   * For TYPE == 2, only check the output operands.  If the function returns
+     true, INSN is a store insn.  */
+
+static bool
+arm_is_mve_load_store_insn (rtx_insn* insn, int type = 0)
+{
+  int n_operands = recog_data.n_operands;
+  extract_insn (insn);
+
+  for (int op = 0; op < n_operands; op++)
+    {
+      if (type == 1 && recog_data.operand_type[op] == OP_OUT)
+	continue;
+      else if (type == 2 && recog_data.operand_type[op] == OP_IN)
+	continue;
+      if (mve_memory_operand (recog_data.operand[op],
+			      GET_MODE (recog_data.operand[op])))
+      return true;
+    }
+  return false;
+}
+
+/* When transforming an MVE intrinsic loop into an MVE Tail Predicated Low
+   Overhead Loop, there are a number of instructions that, if in their
+   unpredicated form, act across vector lanes, but are still safe to include
+   within the loop, despite the implicit predication added to the vector lanes.
+   This list has been compiled by carefully analyzing the instruction
+   pseudocode in the Arm-ARM.
+   All other across-vector instructions aren't allowed, because the addition
+   of implicit predication could influnce the result of the operation.
+   Any new across-vector instructions to the MVE ISA will have to assessed for
+   inclusion to this list.  */
+
+static bool
+arm_mve_is_allowed_unpredic_across_vector_insn (rtx_insn* insn)
+{
+  gcc_assert (MVE_VPT_UNPREDICATED_INSN_P (insn)
+	      && arm_is_mve_across_vector_insn (insn));
+  rtx insn_pattern = PATTERN (insn);
+  if (GET_CODE (insn_pattern) == SET
+      && GET_CODE (XEXP (insn_pattern, 1)) == UNSPEC
+      && (XINT (XEXP (insn_pattern, 1), 1) == VADDVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VABAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VABAVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMAXVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMAXAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVXQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAXQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHAXQ_S))
+    return true;
+  return false;
+}
+
+
+/* Recursively scan through the DF chain backwards within the basic block and
+   determine if any of the USEs of the original insn (or the USEs of the insns
+   where thy were DEF-ed, etc., recursively) were affected by implicit VPT
+   predication of an MVE_VPT_UNPREDICATED_INSN_P in a dlstp/letp loop.
+   This function returns true if the insn is affected implicit predication
+   and false otherwise.
+   Having such implicit predication on an unpredicated insn wouldn't in itself
+   block tail predication, because the output of that insn might then be used
+   in a correctly predicated store insn, where the disabled lanes will be
+   ignored.  To verify this we later call:
+   `arm_mve_check_df_chain_fwd_for_implic_predic_impact`, which will check the
+   DF chains forward to see if any implicitly-predicated operand gets used in
+   an improper way.  */
+
+static bool
+arm_mve_check_df_chain_back_for_implic_predic
+  (hash_map<int_hash<int, -1, -2>, bool>* safe_insn_map, rtx_insn *insn,
+   rtx vctp_vpr_generated)
+{
+  bool* temp = NULL;
+  if ((temp = safe_insn_map->get (INSN_UID (insn))))
+    return *temp;
+
+  basic_block body = BLOCK_FOR_INSN (insn);
+  /* The circumstances under which an instruction is affected by "implicit
+     predication" are as follows:
+      * It is an UNPREDICATED_INSN_P:
+	* That loads/stores from/to memory.
+	* Where any one of its operands is an MVE vector from outside the
+	  loop body bb.
+     Or:
+      * Any of it's operands, recursively backwards, are affected.  */
+  if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+      && (arm_is_mve_load_store_insn (insn)
+	  || (arm_is_mve_across_vector_insn (insn)
+	      && !arm_mve_is_allowed_unpredic_across_vector_insn (insn))))
+    {
+      safe_insn_map->put (INSN_UID (insn), true);
+      return true;
+    }
+
+  df_ref insn_uses = NULL;
+  FOR_EACH_INSN_USE (insn_uses, insn)
+  {
+    /* If the operand is in the input reg set to the the basic block,
+       (i.e. it has come from outside the loop!), consider it unsafe if:
+	 * It's being used in an unpredicated insn.
+	 * It is a predicable MVE vector.  */
+    if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+	&& VALID_MVE_MODE (GET_MODE (DF_REF_REG (insn_uses)))
+	&& REGNO_REG_SET_P (DF_LR_IN (body), DF_REF_REGNO (insn_uses)))
+      {
+	safe_insn_map->put (INSN_UID (insn), true);
+	return true;
+      }
+    /* Scan backwards from the current INSN through the instruction chain
+       until the start of the basic block.  */
+    for (rtx_insn *prev_insn = PREV_INSN (insn);
+	 prev_insn && prev_insn != PREV_INSN (BB_HEAD (body));
+	 prev_insn = PREV_INSN (prev_insn))
+      {
+	/* If a previous insn defines a register that INSN uses, then recurse
+	   in order to check that insn's USEs.
+	   If any of these insns return true as MVE_VPT_UNPREDICATED_INSN_Ps,
+	   then the whole chain is affected by the change in behaviour from
+	   being placed in dlstp/letp loop.  */
+	df_ref prev_insn_defs = NULL;
+	FOR_EACH_INSN_DEF (prev_insn_defs, prev_insn)
+	{
+	  if (DF_REF_REGNO (insn_uses) == DF_REF_REGNO (prev_insn_defs)
+	      && !arm_mve_vec_insn_is_predicated_with_this_predicate
+		   (insn, vctp_vpr_generated)
+	      && arm_mve_check_df_chain_back_for_implic_predic
+		  (safe_insn_map, prev_insn, vctp_vpr_generated))
+	    {
+	      safe_insn_map->put (INSN_UID (insn), true);
+	      return true;
+	    }
+	}
+      }
+  }
+  safe_insn_map->put (INSN_UID (insn), false);
+  return false;
+}
+
+/* If we have identified that the current DEF will be modified
+   by such implicit predication, scan through all the
+   insns that USE it and bail out if any one is outside the
+   current basic block (i.e. the reg is live after the loop)
+   or if any are store insns that are unpredicated or using a
+   predicate other than the loop VPR.
+   This function returns true if the insn is not suitable for
+   implicit predication and false otherwise.*/
+
+static bool
+arm_mve_check_df_chain_fwd_for_implic_predic_impact (rtx_insn *insn,
+						     rtx vctp_vpr_generated)
+{
+
+  /* If this insn is indeed an unpredicated store to memory, bail out.  */
+  if (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
+	(insn, vctp_vpr_generated)
+      && (arm_is_mve_load_store_insn (insn, 2)
+	  || arm_is_mve_across_vector_insn (insn)))
+    return true;
+
+  /* Next, scan forward to the various USEs of the DEFs in this insn.  */
+  df_ref insn_def = NULL;
+  FOR_EACH_INSN_DEF (insn_def, insn)
+    {
+      for (df_ref use = DF_REG_USE_CHAIN (DF_REF_REGNO (insn_def)); use;
+	   use = DF_REF_NEXT_REG (use))
+	{
+	  rtx_insn *next_use_insn = DF_REF_INSN (use);
+	  if (next_use_insn != insn
+	      && NONDEBUG_INSN_P (next_use_insn))
+	    {
+	      /* If the USE is outside the loop body bb, or it is inside, but
+		 is an differently-predicated store to memory or it is any
+		 across-vector instruction.  */
+	      if (BLOCK_FOR_INSN (insn) != BLOCK_FOR_INSN (next_use_insn)
+		  || (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
+		       (next_use_insn, vctp_vpr_generated)
+		     && (arm_is_mve_load_store_insn (next_use_insn, 2)
+			 || arm_is_mve_across_vector_insn (next_use_insn))))
+		return true;
+	    }
+	}
+    }
+  return false;
+}
+
+/* Helper function to `arm_mve_dlstp_check_inc_counter` and to
+   `arm_mve_dlstp_check_dec_counter`.  In the situations where the loop counter
+   is incrementing by 1 or decrementing by 1 in each iteration, ensure that the
+   target value or the initialisation value, respectively, was a calculation
+   of the number of iterations of the loop, which is expected to be an ASHIFTRT
+   by VCTP_STEP.  */
+
+static bool
+arm_mve_check_reg_origin_is_num_elems (basic_block body, rtx reg, rtx vctp_step)
+{
+  /* Ok, we now know the loop starts from zero and increments by one.
+     Now just show that the max value of the counter came from an
+     appropriate ASHIFRT expr of the correct amount.  */
+  basic_block pre_loop_bb = body->prev_bb;
+  while (pre_loop_bb && BB_END (pre_loop_bb)
+	 && !df_bb_regno_only_def_find (pre_loop_bb, REGNO (reg)))
+    pre_loop_bb = pre_loop_bb->prev_bb;
+
+  df_ref counter_max_last_def = df_bb_regno_only_def_find (pre_loop_bb, REGNO (reg));
+  rtx counter_max_last_set;
+  if (counter_max_last_def)
+    counter_max_last_set = PATTERN (DF_REF_INSN (counter_max_last_def));
+  else
+    return false;
+
+  /* If we encounter a simple SET from a REG, follow it through.  */
+  if (GET_CODE (counter_max_last_set) == SET
+      && REG_P (XEXP (counter_max_last_set, 1)))
+    return arm_mve_check_reg_origin_is_num_elems
+	     (pre_loop_bb->next_bb, XEXP (counter_max_last_set, 1), vctp_step);
+
+  /* If we encounter a SET from an IF_THEN_ELSE where one of the operands is a
+     constant and the other is a REG, follow through to that REG.  */
+  if (GET_CODE (counter_max_last_set) == SET
+      && GET_CODE (XEXP (counter_max_last_set, 1)) == IF_THEN_ELSE
+      && REG_P (XEXP (XEXP (counter_max_last_set, 1), 1))
+      && CONST_INT_P (XEXP (XEXP (counter_max_last_set, 1), 2)))
+    return arm_mve_check_reg_origin_is_num_elems
+	     (pre_loop_bb->next_bb, XEXP (XEXP (counter_max_last_set, 1), 1), vctp_step);
+
+  if (GET_CODE (XEXP (counter_max_last_set, 1)) == ASHIFTRT
+      && CONST_INT_P (XEXP (XEXP (counter_max_last_set, 1), 1))
+      && ((1 << INTVAL (XEXP (XEXP (counter_max_last_set, 1), 1)))
+	   == abs (INTVAL (vctp_step))))
+    return true;
+
+  return false;
+}
+
+/* If we have identified the loop to have an incrementing counter, we need to
+   make sure that it increments by 1 and that the loop is structured correctly:
+    * The counter starts from 0
+    * The counter terminates at (num_of_elem + num_of_lanes - 1) / num_of_lanes
+    * The vctp insn uses a reg that decrements appropriately in each iteration.
+*/
+
+static rtx_insn*
+arm_mve_dlstp_check_inc_counter (basic_block body, rtx_insn* vctp_insn,
+				 rtx condconst, rtx condcount)
+{
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+  /* The loop latch has to be empty.  When compiling all the known MVE LoLs in
+     user applications, none of those with incrementing counters had any real
+     insns in the loop latch.  As such, this function has only been tested with
+     an empty latch and may misbehave or ICE if we somehow get here with an
+     increment in the latch, so, for correctness, error out early.  */
+  rtx_insn *dec_insn = BB_END (body->loop_father->latch);
+  if (NONDEBUG_INSN_P (dec_insn))
+    return NULL;
+
+  class rtx_iv vctp_reg_iv;
+  /* For loops of type B) the loop counter is independent of the decrement
+     of the reg used in the vctp_insn. So run iv analysis on that reg.  This
+     has to succeed for such loops to be supported.  */
+  if (!iv_analyze (vctp_insn, as_a<scalar_int_mode> (GET_MODE (vctp_reg)),
+      vctp_reg, &vctp_reg_iv))
+    return NULL;
+
+  /* Find where both of those are modified in the loop body bb.  */
+  rtx condcount_reg_set
+	= PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
+				 (body, REGNO (condcount))));
+  rtx vctp_reg_set = PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
+					    (body, REGNO (vctp_reg))));
+  if (!vctp_reg_set || !condcount_reg_set)
+    return NULL;
+
+  if (REG_P (condcount) && REG_P (condconst))
+    {
+      /* First we need to prove that the loop is going 0..condconst with an
+	 inc of 1 in each iteration.  */
+      if (GET_CODE (XEXP (condcount_reg_set, 1)) == PLUS
+	  && CONST_INT_P (XEXP (XEXP (condcount_reg_set, 1), 1))
+	  && INTVAL (XEXP (XEXP (condcount_reg_set, 1), 1)) == 1)
+	{
+	    rtx counter_reg = XEXP (condcount_reg_set, 0);
+	    /* Check that the counter did indeed start from zero.  */
+	    df_ref this_set = DF_REG_DEF_CHAIN (REGNO (counter_reg));
+	    if (!this_set)
+	      return NULL;
+	    df_ref last_set = DF_REF_NEXT_REG (this_set);
+	    if (!last_set)
+	      return NULL;
+	    rtx_insn* last_set_insn = DF_REF_INSN (last_set);
+	    if (!single_set (last_set_insn))
+	      return NULL;
+	    rtx counter_orig_set;
+	    counter_orig_set = XEXP (PATTERN (last_set_insn), 1);
+	    if (!CONST_INT_P (counter_orig_set)
+		|| (INTVAL (counter_orig_set) != 0))
+	      return NULL;
+	    /* And finally check that the target value of the counter,
+	       condconst is of the correct shape.  */
+	    if (!arm_mve_check_reg_origin_is_num_elems (body, condconst,
+							vctp_reg_iv.step))
+	      return NULL;
+	}
+      else
+	return NULL;
+    }
+  else
+    return NULL;
+
+  /* Extract the decrementnum of the vctp reg.  */
+  int decrementnum = abs (INTVAL (vctp_reg_iv.step));
+  /* Ensure it matches the number of lanes of the vctp instruction.  */
+  if (decrementnum != arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+    return NULL;
+
+  /* Everything looks valid.  */
+  return vctp_insn;
+}
+
+/* Helper function to `arm_mve_loop_valid_for_dlstp`.  In the case of a
+   counter that is decrementing, ensure that it is decrementing by the
+   right amount in each iteration and that the target condition is what
+   we expect.  */
+
+static rtx_insn*
+arm_mve_dlstp_check_dec_counter (basic_block body, rtx_insn* vctp_insn,
+				 rtx condconst, rtx condcount)
+{
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+  class rtx_iv vctp_reg_iv;
+  int decrementnum;
+  /* For decrementing loops of type A), the counter is usually present in the
+     loop latch.  Here we simply need to verify that this counter is the same
+     reg that is also used in the vctp_insn and that it is not otherwise
+     modified.  */
+  rtx_insn *dec_insn = BB_END (body->loop_father->latch);
+  /* If not in the loop latch, try to find the decrement in the loop body.  */
+  if (!NONDEBUG_INSN_P (dec_insn))
+  {
+    df_ref temp = df_bb_regno_only_def_find (body, REGNO (condcount));
+    /* If we haven't been able to find the decrement, bail out.  */
+    if (!temp)
+      return NULL;
+    dec_insn = DF_REF_INSN (temp);
+  }
+
+  /* Next, ensure that it is a PLUS of the form:
+     (set (reg a) (plus (reg a) (const_int)))
+     where (reg a) is the same as condcount.  */
+  if (GET_CODE (XEXP (PATTERN (dec_insn), 1)) == PLUS
+      && REGNO (XEXP (PATTERN (dec_insn), 0))
+	  == REGNO (XEXP (XEXP (PATTERN (dec_insn), 1), 0))
+      && REGNO (XEXP (PATTERN (dec_insn), 0)) == REGNO (condcount))
+    decrementnum = abs (INTVAL (XEXP (XEXP (PATTERN (dec_insn), 1), 1)));
+  else
+    return NULL;
+
+  /* Ok, so we now know the loop decrement.  If it is a 1, then we need to
+     look at the loop vctp_reg and verify that it also decrements correctly.
+     Then, we need to establish that the starting value of the loop decrement
+     originates from the starting value of the vctp decrement.  */
+  if (decrementnum == 1)
+    {
+      class rtx_iv vctp_reg_iv;
+      /* The loop counter is found to be independent of the decrement
+	 of the reg used in the vctp_insn, again.  Ensure that IV analysis
+	 succeeds and check the step.  */
+      if (!iv_analyze (vctp_insn, as_a<scalar_int_mode> (GET_MODE (vctp_reg)),
+		       vctp_reg, &vctp_reg_iv))
+	return NULL;
+      /* Ensure it matches the number of lanes of the vctp instruction.  */
+      if (abs (INTVAL (vctp_reg_iv.step))
+	  != arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+	return NULL;
+      if (!arm_mve_check_reg_origin_is_num_elems (body, condcount, vctp_reg_iv.step))
+	return NULL;
+    }
+  /* If the decrements are the same, then the situation is simple: either they
+     are also the same reg, which is safe, or they are different registers, in
+     which case makse sure that there is a only simple SET from one to the
+     other inside the loop.*/
+  else if (decrementnum == arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+    {
+      if (REGNO (condcount) != REGNO (vctp_reg))
+	{
+	  /* It wasn't the same reg, but it could be behild a
+	     (set (vctp_reg) (condcount)), so instead find where
+	     the VCTP insn is DEF'd inside the loop.  */
+	  rtx vctp_reg_set =
+		PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
+					(body, REGNO (vctp_reg))));
+	  /* This must just be a simple SET from the condcount.  */
+	  if (GET_CODE (vctp_reg_set) != SET || !REG_P (XEXP (vctp_reg_set, 1))
+	      || REGNO (XEXP (vctp_reg_set, 1)) != REGNO (condcount))
+	    return NULL;
+	}
+    }
+  else
+    return NULL;
+
+  /* We now only need to find out that the loop terminates with a LE
+     zero condition.  If condconst is a const_int, then this is easy.
+     If its a REG, look at the last condition+jump in a bb before
+     the loop, because that usually will have a branch jumping over
+     the loop body.  */
+  if (CONST_INT_P (condconst)
+      && !(INTVAL (condconst) == 0 && JUMP_P (BB_END (body))
+	   && GET_CODE (XEXP (PATTERN (BB_END (body)), 1)) == IF_THEN_ELSE
+	   && (GET_CODE (XEXP (XEXP (PATTERN (BB_END (body)), 1), 0)) == NE
+	       ||GET_CODE (XEXP (XEXP (PATTERN (BB_END (body)), 1), 0)) == GT)))
+    return NULL;
+  else if (REG_P (condconst))
+    {
+      basic_block pre_loop_bb = body;
+      while (pre_loop_bb->prev_bb && BB_END (pre_loop_bb->prev_bb)
+	     && !JUMP_P (BB_END (pre_loop_bb->prev_bb)))
+	pre_loop_bb = pre_loop_bb->prev_bb;
+      if (pre_loop_bb && BB_END (pre_loop_bb))
+	pre_loop_bb = pre_loop_bb->prev_bb;
+      else
+	return NULL;
+      rtx initial_compare = NULL_RTX;
+      if (!(prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb))
+	    && INSN_P (prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb)))))
+	return NULL;
+      else
+	initial_compare
+	    = PATTERN (prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb)));
+      if (!(initial_compare && GET_CODE (initial_compare) == SET
+	    && cc_register (XEXP (initial_compare, 0), VOIDmode)
+	    && GET_CODE (XEXP (initial_compare, 1)) == COMPARE
+	    && CONST_INT_P (XEXP (XEXP (initial_compare, 1), 1))
+	    && INTVAL (XEXP (XEXP (initial_compare, 1), 1)) == 0))
+	return NULL;
+
+      /* Usually this is a LE condition, but it can also just be a GT or an EQ
+	 condition (if the value is unsigned or the compiler knows its not negative)  */
+      rtx_insn *loop_jumpover = BB_END (pre_loop_bb);
+      if (!(JUMP_P (loop_jumpover)
+	    && GET_CODE (XEXP (PATTERN (loop_jumpover), 1)) == IF_THEN_ELSE
+	    && (GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == LE
+		|| GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == GT
+		|| GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == EQ)))
+	return NULL;
+    }
+
+  /* Everything looks valid.  */
+  return vctp_insn;
+}
+
+/* Function to check a loop's structure to see if it is a valid candidate for
+   an MVE Tail Predicated Low-Overhead Loop.  Returns the loop's VCTP_INSN if
+   it is valid, or NULL if it isn't.  */
+
+static rtx_insn*
+arm_mve_loop_valid_for_dlstp (basic_block body)
+{
+  /* Doloop can only be done "elementwise" with predicated dlstp/letp if it
+     contains a VCTP on the number of elements processed by the loop.
+     Find the VCTP predicate generation inside the loop body BB.  */
+  rtx_insn *vctp_insn = arm_mve_get_loop_vctp (body);
+  if (!vctp_insn)
+    return NULL;
+
+  /* There are only two types of loops that can be turned into dlstp/letp
+     loops:
+      A) Loops of the form:
+	  while (num_of_elem > 0)
+	    {
+	      p = vctp<size> (num_of_elem)
+	      n -= num_of_lanes;
+	    }
+      B) Loops of the form:
+	  int num_of_iters = (num_of_elem + num_of_lanes - 1) / num_of_lanes
+	  for (i = 0; i < num_of_iters; i++)
+	    {
+	      p = vctp<size> (num_of_elem)
+	      n -= num_of_lanes;
+	    }
+
+    Then, depending on the type of loop above we need will need to do
+    different sets of checks.  */
+  iv_analysis_loop_init (body->loop_father);
+
+  /* In order to find out if the loop is of type A or B above look for the
+     loop counter: it will either be incrementing by one per iteration or
+     it will be decrementing by num_of_lanes.  We can find the loop counter
+     in the condition at the end of the loop.  */
+  rtx_insn *loop_cond = prev_nonnote_nondebug_insn_bb (BB_END (body));
+  if (!(cc_register (XEXP (PATTERN (loop_cond), 0), VOIDmode)
+	&& GET_CODE (XEXP (PATTERN (loop_cond), 1)) == COMPARE))
+    return NULL;
+
+  /* The operands in the condition:  Try to identify which one is the
+     constant and which is the counter and run IV analysis on the latter.  */
+  rtx cond_arg_1 = XEXP (XEXP (PATTERN (loop_cond), 1), 0);
+  rtx cond_arg_2 = XEXP (XEXP (PATTERN (loop_cond), 1), 1);
+
+  rtx loop_cond_constant;
+  rtx loop_counter;
+  class rtx_iv cond_counter_iv, cond_temp_iv;
+
+  if (CONST_INT_P (cond_arg_1))
+    {
+      /* cond_arg_1 is the constant and cond_arg_2 is the counter.  */
+      loop_cond_constant = cond_arg_1;
+      loop_counter = cond_arg_2;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_2)),
+		  cond_arg_2, &cond_counter_iv);
+    }
+  else if (CONST_INT_P (cond_arg_2))
+    {
+      /* cond_arg_2 is the constant and cond_arg_1 is the counter.  */
+      loop_cond_constant = cond_arg_2;
+      loop_counter = cond_arg_1;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_1)),
+		  cond_arg_1, &cond_counter_iv);
+    }
+  else if (REG_P (cond_arg_1) && REG_P (cond_arg_2))
+    {
+      /* If both operands to the compare are REGs, we can safely
+	 run IV analysis on both and then determine which is the
+	 constant by looking at the step.
+	 First assume cond_arg_1 is the counter.  */
+      loop_counter = cond_arg_1;
+      loop_cond_constant = cond_arg_2;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_1)),
+		  cond_arg_1, &cond_counter_iv);
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_2)),
+		  cond_arg_2, &cond_temp_iv);
+
+      if (!CONST_INT_P (cond_counter_iv.step) || !CONST_INT_P (cond_temp_iv.step))
+	return NULL;
+      /* Look at the steps and swap around the rtx's if needed.  Error out if
+	 one of them cannot be identified as constant.  */
+      if (INTVAL (cond_counter_iv.step) != 0 && INTVAL (cond_temp_iv.step) != 0)
+	return NULL;
+      if (INTVAL (cond_counter_iv.step) == 0 && INTVAL (cond_temp_iv.step) != 0)
+	{
+	  loop_counter = cond_arg_2;
+	  loop_cond_constant = cond_arg_1;
+	  cond_counter_iv = cond_temp_iv;
+	}
+    }
+  else
+    return NULL;
+
+  if (!REG_P (loop_counter))
+    return NULL;
+  if (!(REG_P (loop_cond_constant) || CONST_INT_P (loop_cond_constant)))
+    return NULL;
+
+  /* Now we have extracted the IV step of the loop counter, call the
+     appropriate checking function.  */
+  if (INTVAL (cond_counter_iv.step) > 0)
+    return arm_mve_dlstp_check_inc_counter (body, vctp_insn,
+					    loop_cond_constant, loop_counter);
+  else if (INTVAL (cond_counter_iv.step) < 0)
+    return arm_mve_dlstp_check_dec_counter (body, vctp_insn,
+					    loop_cond_constant, loop_counter);
+  else
+    return NULL;
+}
+
+/* Predict whether the given loop in gimple will be transformed in the RTL
+   doloop_optimize pass.  */
+
+static bool
+arm_predict_doloop_p (struct loop *loop)
+{
+  gcc_assert (loop);
+  /* On arm, targetm.can_use_doloop_p is actually
+     can_use_doloop_if_innermost.  Ensure the loop is innermost,
+     it is valid and as per arm_target_bb_ok_for_lob and the
+     correct architecture flags are enabled.  */
+  if (!(TARGET_32BIT && TARGET_HAVE_LOB && optimize > 0))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " target architecture or optimisation flags.\n");
+      return false;
+    }
+  else if (loop->inner != NULL)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " loop nesting.\n");
+      return false;
+    }
+  else if (!arm_target_bb_ok_for_lob (loop->header->next_bb))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " loop bb complexity.\n");
+      return false;
+    }
+
+  return true;
+}
+
+/* Implement targetm.loop_unroll_adjust.  Use this to block unrolling of loops
+   that may later be turned into MVE Tail Predicated Low Overhead Loops.  The
+   performance benefit of an MVE LoL is likely to be much higher than that of
+   the unrolling.  */
+
+unsigned
+arm_loop_unroll_adjust (unsigned nunroll, struct loop *loop)
+{
+  if (TARGET_HAVE_MVE
+      && arm_target_bb_ok_for_lob (loop->latch)
+      && arm_mve_loop_valid_for_dlstp (loop->header))
+    return 0;
+  else
+    return nunroll;
+}
+
+/* Function to hadle emitting a VPT-unpredicated version of a VPT-predicated
+   insn to a sequence.  */
+
+static bool
+arm_emit_mve_unpredicated_insn_to_seq (rtx_insn* insn)
+{
+  rtx insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn);
+  int new_icode = get_attr_mve_unpredicated_insn (insn);
+  if (!in_sequence_p ()
+      || !MVE_VPT_PREDICATED_INSN_P (insn)
+      || (!insn_vpr_reg_operand)
+      || (!new_icode))
+    return false;
+
+  extract_insn (insn);
+  rtx arr[8];
+  int j = 0;
+
+  /* When transforming a VPT-predicated instruction
+     into its unpredicated equivalent we need to drop
+     the VPR operand and we may need to also drop a
+     merge "vuninit" input operand, depending on the
+     instruction pattern.  Here ensure that we have at
+     most a two-operand difference between the two
+     instrunctions.  */
+  int n_operands_diff
+      = recog_data.n_operands - insn_data[new_icode].n_operands;
+  if (!(n_operands_diff > 0 && n_operands_diff <= 2))
+    return false;
+
+  /* Then, loop through the operands of the predicated
+     instruction, and retain the ones that map to the
+     unpredicated instruction.  */
+  for (int i = 0; i < recog_data.n_operands; i++)
+    {
+      /* Ignore the VPR and, if needed, the vuninit
+	 operand.  */
+      if (insn_vpr_reg_operand == recog_data.operand[i]
+	  || (n_operands_diff == 2
+	      && !strcmp (recog_data.constraints[i], "0")))
+	continue;
+      else
+	{
+	  arr[j] = recog_data.operand[i];
+	  j++;
+	}
+    }
+
+  /* Finally, emit the upredicated instruction.  */
+  switch (j)
+    {
+      case 1:
+	emit_insn (GEN_FCN (new_icode) (arr[0]));
+	break;
+      case 2:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1]));
+	break;
+      case 3:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2]));
+	break;
+      case 4:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2],
+					arr[3]));
+	break;
+      case 5:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4]));
+	break;
+      case 6:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4], arr[5]));
+	break;
+      case 7:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4], arr[5], arr[6]));
+	break;
+      default:
+	gcc_unreachable ();
+    }
+  return true;
+}
+
+/* When a vctp insn is used, its out is often followed by
+   a zero-extend insn to SImode, which is then SUBREG'd into a
+   vector form of mode VALID_MVE_PRED_MODE: this vector form is
+   what is then used as an input to the instructions within the
+   loop.  Hence, store that vector form of the VPR reg into
+   vctp_vpr_generated, so that we can match it with instructions
+   in the loop to determine if they are predicated on this same
+   VPR.  If there is no zero-extend and subreg or it is otherwise
+   invalid, then return NULL to cancel the dlstp transform.  */
+
+static rtx
+arm_mve_get_vctp_vec_form (rtx_insn *insn)
+{
+  rtx vctp_vpr_generated = NULL_RTX;
+  rtx_insn *next_use1 = NULL;
+  df_ref use;
+  for (use
+	= DF_REG_USE_CHAIN
+	   (DF_REF_REGNO (DF_INSN_INFO_DEFS (DF_INSN_INFO_GET (insn))));
+       use; use = DF_REF_NEXT_REG (use))
+    if (!next_use1 && NONDEBUG_INSN_P (DF_REF_INSN (use)))
+      next_use1 = DF_REF_INSN (use);
+
+  if (single_set (next_use1)
+      && GET_CODE (SET_SRC (single_set (next_use1))) == ZERO_EXTEND)
+    {
+      rtx_insn *next_use2 = NULL;
+      for (use
+	    = DF_REG_USE_CHAIN
+	       (DF_REF_REGNO
+		 (DF_INSN_INFO_DEFS (DF_INSN_INFO_GET (next_use1))));
+	   use; use = DF_REF_NEXT_REG (use))
+	if (!next_use2 && NONDEBUG_INSN_P (DF_REF_INSN (use)))
+	  next_use2 = DF_REF_INSN (use);
+
+      if (single_set (next_use2)
+	  && GET_CODE (SET_SRC (single_set (next_use2))) == SUBREG)
+	vctp_vpr_generated = XEXP (PATTERN (next_use2), 0);
+    }
+
+  if (!vctp_vpr_generated || !REG_P (vctp_vpr_generated)
+      || !VALID_MVE_PRED_MODE (GET_MODE (vctp_vpr_generated)))
+    return NULL_RTX;
+
+  return vctp_vpr_generated;
+}
+
+/* Attempt to transform the loop contents of loop basic block from VPT
+   predicated insns into unpredicated insns for a dlstp/letp loop.  */
+
+rtx
+arm_attempt_dlstp_transform (rtx label)
+{
+  basic_block body = BLOCK_FOR_INSN (label)->prev_bb;
+
+  /* Ensure that the bb is within a loop that has all required metadata.  */
+  if (!body->loop_father || !body->loop_father->header
+      || !body->loop_father->simple_loop_desc)
+    return GEN_INT (1);
+
+  rtx_insn *vctp_insn = arm_mve_loop_valid_for_dlstp (body);
+  if (!vctp_insn)
+    return GEN_INT (1);
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+
+  rtx vctp_vpr_generated = arm_mve_get_vctp_vec_form (vctp_insn);
+  if (!vctp_vpr_generated)
+    return GEN_INT (1);
+
+  /* decrementunum is already known to be valid at this point.  */
+  int decrementnum = arm_mve_get_vctp_lanes (PATTERN (vctp_insn));
+
+  rtx_insn *insn = 0;
+  rtx_insn *cur_insn = 0;
+  rtx_insn *seq;
+  hash_map<int_hash<int, -1, -2>, bool>* safe_insn_map
+					  = new hash_map<int_hash<int, -1, -2>,
+							 bool>;
+
+  /* Scan through the insns in the loop bb and emit the transformed bb
+     insns to a sequence.  */
+  start_sequence ();
+  FOR_BB_INSNS (body, insn)
+    {
+      if (GET_CODE (insn) == CODE_LABEL || NOTE_INSN_BASIC_BLOCK_P (insn))
+	continue;
+      else if (NOTE_P (insn))
+	emit_note ((enum insn_note)NOTE_KIND (insn));
+      else if (DEBUG_INSN_P (insn))
+	emit_debug_insn (PATTERN (insn));
+      else if (!INSN_P (insn))
+	{
+	  end_sequence ();
+	  return GEN_INT (1);
+	}
+      /* When we find the vctp instruction: continue.  */
+      else if (insn == vctp_insn)
+	continue;
+       /* If the insn pattern requires the use of the VPR value from the
+	  vctp as an input parameter for predication.  */
+      else if (arm_mve_vec_insn_is_predicated_with_this_predicate
+		(insn, vctp_vpr_generated))
+	{
+	  bool success = arm_emit_mve_unpredicated_insn_to_seq (insn);
+	  if (!success)
+	    {
+	      end_sequence ();
+	      return GEN_INT (1);
+	    }
+	}
+      /* If the insn isn't VPT predicated on vctp_vpr_generated, we need to
+	 make sure that it is still valid within the dlstp/letp loop.  */
+      else
+	{
+	  /* If this instruction USE-s the vctp_vpr_generated other than for
+	     predication, this blocks the transformation as we are not allowed
+	     to optimise the VPR value away.  */
+	  df_ref insn_uses = NULL;
+	  FOR_EACH_INSN_USE (insn_uses, insn)
+	  {
+	    if (rtx_equal_p (vctp_vpr_generated, DF_REF_REG (insn_uses)))
+	      {
+		end_sequence ();
+		return GEN_INT (1);
+	      }
+	  }
+	  /* If within the loop we have an MVE vector instruction that is
+	     unpredicated, the dlstp/letp looping will add implicit
+	     predication to it.  This will result in a change in behaviour
+	     of the instruction, so we need to find out if any instructions
+	     that feed into the current instruction were implicitly
+	     predicated.  */
+	  if (arm_mve_check_df_chain_back_for_implic_predic
+		(safe_insn_map, insn, vctp_vpr_generated))
+	    {
+	      if (arm_mve_check_df_chain_fwd_for_implic_predic_impact
+		    (insn, vctp_vpr_generated))
+		{
+		  end_sequence ();
+		  return GEN_INT (1);
+		}
+	    }
+	  emit_insn (PATTERN (insn));
+	}
+    }
+  seq = get_insns ();
+  end_sequence ();
+
+  /* Re-write the entire BB contents with the transformed
+     sequence.  */
+  FOR_BB_INSNS_SAFE (body, insn, cur_insn)
+    if (!(GET_CODE (insn) == CODE_LABEL || NOTE_INSN_BASIC_BLOCK_P (insn)))
+      delete_insn (insn);
+  for (insn = seq; NEXT_INSN (insn); insn = NEXT_INSN (insn))
+    if (NOTE_P (insn))
+      emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));
+    else if (DEBUG_INSN_P (insn))
+      emit_debug_insn_after (PATTERN (insn), BB_END (body));
+    else
+      emit_insn_after (PATTERN (insn), BB_END (body));
+
+  emit_jump_insn_after (PATTERN (insn), BB_END (body));
+  /* The transformation has succeeded, so now modify the "count"
+     (a.k.a. niter_expr) for the middle-end.  Also set noloop_assumptions
+     to NULL to stop the middle-end from making assumptions about the
+     number of iterations.  */
+  simple_loop_desc (body->loop_father)->niter_expr = vctp_reg;
+  simple_loop_desc (body->loop_father)->noloop_assumptions = NULL_RTX;
+  return GEN_INT (decrementnum);
 }
 
 #if CHECKING_P
diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index ee931ad6ebd..70fade0d0da 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -124,6 +124,11 @@
 ; and not all ARM insns do.
 (define_attr "predicated" "yes,no" (const_string "no"))
 
+
+; An attribute that encodes the CODE_FOR_<insn> of the MVE VPT unpredicated
+; version of a VPT-predicated instruction.  For unpredicated instructions
+; that are predicable, encode the same pattern's CODE_FOR_<insn> as a way to
+; encode that it is a predicable instruction.
 (define_attr "mve_unpredicated_insn" "" (const_int 0))
 
 ; LENGTH of an instruction (in bytes)
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 71e43539616..1401b59dc0b 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -2660,6 +2660,9 @@
 (define_int_attr mrrc [(VUNSPEC_MRRC "mrrc") (VUNSPEC_MRRC2 "mrrc2")])
 (define_int_attr MRRC [(VUNSPEC_MRRC "MRRC") (VUNSPEC_MRRC2 "MRRC2")])
 
+(define_int_attr mode1 [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 "32")
+			(DLSTP64 "64")])
+
 (define_int_attr opsuffix [(UNSPEC_DOT_S "s8")
 			   (UNSPEC_DOT_U "u8")
 			   (UNSPEC_DOT_US "s8")
@@ -2903,6 +2906,8 @@
 (define_int_iterator VSHLCQ_M [VSHLCQ_M_S VSHLCQ_M_U])
 (define_int_iterator VQSHLUQ_M_N [VQSHLUQ_M_N_S])
 (define_int_iterator VQSHLUQ_N [VQSHLUQ_N_S])
+(define_int_iterator DLSTP [DLSTP8 DLSTP16 DLSTP32
+				   DLSTP64])
 
 ;; Define iterators for VCMLA operations
 (define_int_iterator VCMLA_OP [UNSPEC_VCMLA
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 87cbf6c1726..dc4b6301aaa 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -6997,7 +6997,7 @@
    (set (reg:SI LR_REGNUM)
 	(plus:SI (reg:SI LR_REGNUM) (match_dup 0)))
    (clobber (reg:CC CC_REGNUM))]
-  "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2"
+  "TARGET_HAVE_MVE"
   {
     if (get_attr_length (insn) == 4)
       return "letp\t%|lr, %l1";
@@ -7017,5 +7017,5 @@
 	 (unspec:SI [(match_operand:SI 0 "s_register_operand" "r")]
 	  DLSTP))
   ]
-  "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2"
+  "TARGET_HAVE_MVE"
   "dlstp.<mode1>\t%|lr, %0")
diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
index e1e013befa7..368d5138ca1 100644
--- a/gcc/config/arm/thumb2.md
+++ b/gcc/config/arm/thumb2.md
@@ -1613,7 +1613,7 @@
    (use (match_operand 1 "" ""))]     ; label
   "TARGET_32BIT"
   "
- {
+{
    /* Currently SMS relies on the do-loop pattern to recognize loops
       where (1) the control part consists of all insns defining and/or
       using a certain 'count' register and (2) the loop count can be
@@ -1623,41 +1623,65 @@
 
       Also used to implement the low over head loops feature, which is part of
       the Armv8.1-M Mainline Low Overhead Branch (LOB) extension.  */
-   if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
-   {
-     rtx s0;
-     rtx bcomp;
-     rtx loc_ref;
-     rtx cc_reg;
-     rtx insn;
-     rtx cmp;
-
-     if (GET_MODE (operands[0]) != SImode)
-       FAIL;
-
-     s0 = operands [0];
-
-     /* Low over head loop instructions require the first operand to be LR.  */
-     if (TARGET_HAVE_LOB && arm_target_insn_ok_for_lob (operands [1]))
-       s0 = gen_rtx_REG (SImode, LR_REGNUM);
-
-     if (TARGET_THUMB2)
-       insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-     else
-       insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-
-     cmp = XVECEXP (PATTERN (insn), 0, 0);
-     cc_reg = SET_DEST (cmp);
-     bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
-     loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands [1]);
-     emit_jump_insn (gen_rtx_SET (pc_rtx,
-                                  gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
-                                                        loc_ref, pc_rtx)));
-     DONE;
-   }
- else
-   FAIL;
- }")
+  if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
+    {
+      rtx s0;
+      rtx bcomp;
+      rtx loc_ref;
+      rtx cc_reg;
+      rtx insn;
+      rtx cmp;
+      rtx decrement_num;
+
+      if (GET_MODE (operands[0]) != SImode)
+	FAIL;
+
+      s0 = operands[0];
+
+       if (TARGET_HAVE_LOB && arm_target_bb_ok_for_lob (BLOCK_FOR_INSN (operands[1])))
+	{
+	  s0 = gen_rtx_REG (SImode, LR_REGNUM);
+
+	  /* If we have a compatibe MVE target, try and analyse the loop
+	     contents to determine if we can use predicated dlstp/letp
+	     looping.  */
+	  if (TARGET_HAVE_MVE
+	      && (decrement_num = arm_attempt_dlstp_transform (operands[1]))
+	      && (INTVAL (decrement_num) != 1))
+	    {
+	      insn = emit_insn
+		      (gen_thumb2_addsi3_compare0
+			  (s0, s0, GEN_INT ((-1) * (INTVAL (decrement_num)))));
+	      cmp = XVECEXP (PATTERN (insn), 0, 0);
+	      cc_reg = SET_DEST (cmp);
+	      bcomp = gen_rtx_GE (VOIDmode, cc_reg, const0_rtx);
+	      loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+	      emit_jump_insn (gen_rtx_SET (pc_rtx,
+				       gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+							     loc_ref, pc_rtx)));
+	      DONE;
+	    }
+	}
+
+	/* Otherwise, try standard decrement-by-one dls/le looping.  */
+	if (TARGET_THUMB2)
+	  insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0,
+							GEN_INT (-1)));
+	else
+	  insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
+
+	cmp = XVECEXP (PATTERN (insn), 0, 0);
+	cc_reg = SET_DEST (cmp);
+	bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
+	loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+	emit_jump_insn (gen_rtx_SET (pc_rtx,
+				     gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+							   loc_ref, pc_rtx)));
+	DONE;
+    }
+  else
+    FAIL;
+}")
 
 (define_insn "*clear_apsr"
   [(unspec_volatile:SI [(const_int 0)] VUNSPEC_CLRM_APSR)
@@ -1755,7 +1779,37 @@
   {
     if (REGNO (operands[0]) == LR_REGNUM)
       {
-	emit_insn (gen_dls_insn (operands[0]));
+	/* Pick out the number by which we are decrementing the loop counter
+	   in every iteration.  If it's > 1, then use dlstp.  */
+	int const_int_dec_num
+	     = abs (INTVAL (XEXP (XEXP (XVECEXP (PATTERN (operands[1]), 0, 1),
+				  1),
+			    1)));
+	switch (const_int_dec_num)
+	  {
+	    case 16:
+	      emit_insn (gen_dlstp8_insn (operands[0]));
+	      break;
+
+	    case 8:
+	      emit_insn (gen_dlstp16_insn (operands[0]));
+	      break;
+
+	    case 4:
+	      emit_insn (gen_dlstp32_insn (operands[0]));
+	      break;
+
+	    case 2:
+	      emit_insn (gen_dlstp64_insn (operands[0]));
+	      break;
+
+	    case 1:
+	      emit_insn (gen_dls_insn (operands[0]));
+	      break;
+
+	    default:
+	      gcc_unreachable ();
+	  }
 	DONE;
       }
     else
diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
index 6a5b1f8f623..7921bffc169 100644
--- a/gcc/config/arm/unspecs.md
+++ b/gcc/config/arm/unspecs.md
@@ -581,6 +581,10 @@
   VADDLVQ_U
   VCTP
   VCTP_M
+  DLSTP8
+  DLSTP16
+  DLSTP32
+  DLSTP64
   VPNOT
   VCREATEQ_F
   VCVTQ_N_TO_F_S
diff --git a/gcc/df-core.cc b/gcc/df-core.cc
index d4812b04a7c..4fcc14bf790 100644
--- a/gcc/df-core.cc
+++ b/gcc/df-core.cc
@@ -1964,6 +1964,21 @@ df_bb_regno_last_def_find (basic_block bb, unsigned int regno)
   return NULL;
 }
 
+/* Return the one and only def of REGNO within BB.  If there is no def or
+   there are multiple defs, return NULL.  */
+
+df_ref
+df_bb_regno_only_def_find (basic_block bb, unsigned int regno)
+{
+  df_ref temp = df_bb_regno_first_def_find (bb, regno);
+  if (!temp)
+    return NULL;
+  else if (temp == df_bb_regno_last_def_find (bb, regno))
+    return temp;
+  else
+    return NULL;
+}
+
 /* Finds the reference corresponding to the definition of REG in INSN.
    DF is the dataflow object.  */
 
diff --git a/gcc/df.h b/gcc/df.h
index 402657a7076..98623637f9c 100644
--- a/gcc/df.h
+++ b/gcc/df.h
@@ -987,6 +987,7 @@ extern void df_check_cfg_clean (void);
 #endif
 extern df_ref df_bb_regno_first_def_find (basic_block, unsigned int);
 extern df_ref df_bb_regno_last_def_find (basic_block, unsigned int);
+extern df_ref df_bb_regno_only_def_find (basic_block, unsigned int);
 extern df_ref df_find_def (rtx_insn *, rtx);
 extern bool df_reg_defined (rtx_insn *, rtx);
 extern df_ref df_find_use (rtx_insn *, rtx);
diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
index 4feb0a25ab9..f6dbd0515de 100644
--- a/gcc/loop-doloop.cc
+++ b/gcc/loop-doloop.cc
@@ -85,29 +85,29 @@ doloop_condition_get (rtx_insn *doloop_pat)
      forms:
 
      1)  (parallel [(set (pc) (if_then_else (condition)
-	  			            (label_ref (label))
-				            (pc)))
-	             (set (reg) (plus (reg) (const_int -1)))
-	             (additional clobbers and uses)])
+					    (label_ref (label))
+					    (pc)))
+		     (set (reg) (plus (reg) (const_int -n)))
+		     (additional clobbers and uses)])
 
      The branch must be the first entry of the parallel (also required
      by jump.cc), and the second entry of the parallel must be a set of
      the loop counter register.  Some targets (IA-64) wrap the set of
      the loop counter in an if_then_else too.
 
-     2)  (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-	                         (label_ref (label))
-			         (pc))).  
+     2)  (set (reg) (plus (reg) (const_int -n))
+	 (set (pc) (if_then_else (reg != 0)
+				 (label_ref (label))
+				 (pc))).
 
      Some targets (ARM) do the comparison before the branch, as in the
      following form:
 
-     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) */
+     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -n), 0)))
+		   (set (reg) (plus (reg) (const_int -n)))])
+	(set (pc) (if_then_else (cc == NE)
+				(label_ref (label))
+				(pc))) */
 
   pattern = PATTERN (doloop_pat);
 
@@ -143,7 +143,7 @@ doloop_condition_get (rtx_insn *doloop_pat)
 	      || GET_CODE (cmp_arg1) != PLUS)
 	    return 0;
 	  reg_orig = XEXP (cmp_arg1, 0);
-	  if (XEXP (cmp_arg1, 1) != GEN_INT (-1) 
+	  if (!CONST_INT_P (XEXP (cmp_arg1, 1))
 	      || !REG_P (reg_orig))
 	    return 0;
 	  cc_reg = SET_DEST (cmp_orig);
@@ -156,7 +156,8 @@ doloop_condition_get (rtx_insn *doloop_pat)
 	{
 	  /* We expect the condition to be of the form (reg != 0)  */
 	  cond = XEXP (SET_SRC (cmp), 0);
-	  if (GET_CODE (cond) != NE || XEXP (cond, 1) != const0_rtx)
+	  if ((GET_CODE (cond) != NE && GET_CODE (cond) != GE)
+	      || XEXP (cond, 1) != const0_rtx)
 	    return 0;
 	}
     }
@@ -173,14 +174,14 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
     return 0;
 
-  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
      On IA-64, this decrement is wrapped in an if_then_else.  */
   inc_src = SET_SRC (inc);
   if (GET_CODE (inc_src) == IF_THEN_ELSE)
     inc_src = XEXP (inc_src, 1);
   if (GET_CODE (inc_src) != PLUS
       || XEXP (inc_src, 0) != reg
-      || XEXP (inc_src, 1) != constm1_rtx)
+      || !CONST_INT_P (XEXP (inc_src, 1)))
     return 0;
 
   /* Check for (set (pc) (if_then_else (condition)
@@ -211,42 +212,49 @@ doloop_condition_get (rtx_insn *doloop_pat)
       || (GET_CODE (XEXP (condition, 0)) == PLUS
 	  && XEXP (XEXP (condition, 0), 0) == reg))
    {
-     if (GET_CODE (pattern) != PARALLEL)
      /*  For the second form we expect:
 
-         (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-                                 (label_ref (label))
-                                 (pc))).
+	 (set (reg) (plus (reg) (const_int -n))
+	 (set (pc) (if_then_else (reg != 0)
+				 (label_ref (label))
+				 (pc))).
 
-         is equivalent to the following:
+	 If n == 1, that is equivalent to the following:
 
-         (parallel [(set (pc) (if_then_else (reg != 1)
-                                            (label_ref (label))
-                                            (pc)))
-                     (set (reg) (plus (reg) (const_int -1)))
-                     (additional clobbers and uses)])
+	 (parallel [(set (pc) (if_then_else (reg != 1)
+					    (label_ref (label))
+					    (pc)))
+		     (set (reg) (plus (reg) (const_int -1)))
+		     (additional clobbers and uses)])
 
-        For the third form we expect:
+	For the third form we expect:
 
-        (parallel [(set (cc) (compare ((plus (reg) (const_int -1)), 0))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) 
+	(parallel [(set (cc) (compare ((plus (reg) (const_int -n)), 0))
+		   (set (reg) (plus (reg) (const_int -n)))])
+	(set (pc) (if_then_else (cc == NE)
+				(label_ref (label))
+				(pc)))
 
-        which is equivalent to the following:
+	Which also for n == 1 is equivalent to the following:
 
-        (parallel [(set (cc) (compare (reg,  1))
-                   (set (reg) (plus (reg) (const_int -1)))
-                   (set (pc) (if_then_else (NE == cc)
-                                           (label_ref (label))
-                                           (pc))))])
+	(parallel [(set (cc) (compare (reg,  1))
+		   (set (reg) (plus (reg) (const_int -1)))
+		   (set (pc) (if_then_else (NE == cc)
+					   (label_ref (label))
+					   (pc))))])
 
-        So we return the second form instead for the two cases.
+	So we return the second form instead for the two cases.
 
+	For the "elementwise" form where the decrement number isn't -1,
+	the final value may be exceeded, so use GE instead of NE.
      */
-        condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
+     if (GET_CODE (pattern) != PARALLEL)
+       {
+	if (INTVAL (XEXP (inc_src, 1)) != -1)
+	  condition = gen_rtx_fmt_ee (GE, VOIDmode, inc_src, const0_rtx);
+	else
+	  condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);;
+       }
 
     return condition;
    }
@@ -685,17 +693,6 @@ doloop_optimize (class loop *loop)
       return false;
     }
 
-  max_cost
-    = COSTS_N_INSNS (param_max_iterations_computation_cost);
-  if (set_src_cost (desc->niter_expr, mode, optimize_loop_for_speed_p (loop))
-      > max_cost)
-    {
-      if (dump_file)
-	fprintf (dump_file,
-		 "Doloop: number of iterations too costly to compute.\n");
-      return false;
-    }
-
   if (desc->const_iter)
     iterations = widest_int::from (rtx_mode_t (desc->niter_expr, mode),
 				   UNSIGNED);
@@ -716,11 +713,24 @@ doloop_optimize (class loop *loop)
 
   /* Generate looping insn.  If the pattern FAILs then give up trying
      to modify the loop since there is some aspect the back-end does
-     not like.  */
-  count = copy_rtx (desc->niter_expr);
+     not like.  If this succeeds, there is a chance that the loop
+     desc->niter_expr has been altered by the backend, so only extract
+     that data after the gen_doloop_end.  */
   start_label = block_label (desc->in_edge->dest);
   doloop_reg = gen_reg_rtx (mode);
   rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
+  count = copy_rtx (desc->niter_expr);
+
+  max_cost
+    = COSTS_N_INSNS (param_max_iterations_computation_cost);
+  if (set_src_cost (count, mode, optimize_loop_for_speed_p (loop))
+      > max_cost)
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "Doloop: number of iterations too costly to compute.\n");
+      return false;
+    }
 
   word_mode_size = GET_MODE_PRECISION (word_mode);
   word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) - 1;
diff --git a/gcc/testsuite/gcc.target/arm/lob.h b/gcc/testsuite/gcc.target/arm/lob.h
index feaae7cc899..3941fe7a8b6 100644
--- a/gcc/testsuite/gcc.target/arm/lob.h
+++ b/gcc/testsuite/gcc.target/arm/lob.h
@@ -1,15 +1,131 @@
 #include <string.h>
-
+#include <stdint.h>
 /* Common code for lob tests.  */
 
 #define NO_LOB asm volatile ("@ clobber lr" : : : "lr" )
 
-#define N 10000
+#define N 100
+
+static void
+reset_data (int *a, int *b, int *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data64 (int64_t *a, int64_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+check_plus (int *a, int *b, int *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
 
 static void
-reset_data (int *a, int *b, int *c)
+check_memcpy64 (int64_t *a, int64_t *c, int x)
 {
-  memset (a, -1, N * sizeof (*a));
-  memset (b, -1, N * sizeof (*b));
-  memset (c, -1, N * sizeof (*c));
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != a[i]) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
 }
diff --git a/gcc/testsuite/gcc.target/arm/lob1.c b/gcc/testsuite/gcc.target/arm/lob1.c
index ba5c82cd55c..c8ce653a5c3 100644
--- a/gcc/testsuite/gcc.target/arm/lob1.c
+++ b/gcc/testsuite/gcc.target/arm/lob1.c
@@ -54,29 +54,18 @@ loop3 (int *a, int *b, int *c)
     } while (i < N);
 }
 
-void
-check (int *a, int *b, int *c)
-{
-  for (int i = 0; i < N; i++)
-    {
-      NO_LOB;
-      if (c[i] != a[i] + b[i])
-	abort ();
-    }
-}
-
 int
 main (void)
 {
-  reset_data (a, b, c);
+  reset_data (a, b, c, N);
   loop1 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus (a, b, c, N);
+  reset_data (a, b, c, N);
   loop2 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus (a, b, c, N);
+  reset_data (a, b, c, N);
   loop3 (a, b ,c);
-  check (a, b ,c);
+  check_plus (a, b, c, N);
 
   return 0;
 }
diff --git a/gcc/testsuite/gcc.target/arm/lob6.c b/gcc/testsuite/gcc.target/arm/lob6.c
index 17b6124295e..4fe116e2c2b 100644
--- a/gcc/testsuite/gcc.target/arm/lob6.c
+++ b/gcc/testsuite/gcc.target/arm/lob6.c
@@ -79,14 +79,14 @@ check (void)
 int
 main (void)
 {
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data (a1, b1, c1, N);
+  reset_data (a2, b2, c2, N);
   loop1 (a1, b1, c1);
   ref1 (a2, b2, c2);
   check ();
 
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data (a1, b1, c1, N);
+  reset_data (a2, b2, c2, N);
   loop2 (a1, b1, c1);
   ref2 (a2, b2, c2);
   check ();
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c
new file mode 100644
index 00000000000..5ddd994e53d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c
@@ -0,0 +1,561 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O3 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+
+#define IMM 5
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)				\
+void test_##NAME##PRED##_##SIGN##BITS (TYPE##BITS##_t *a, TYPE##BITS##_t *b,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vb = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (b, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_##SIGN##BITS (va, vb, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      b += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (32, 4, w, NAME, PRED)
+
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vmulq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vsubq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vhaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vorrq, _x)
+
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_M(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)				\
+void test_##NAME##PRED##_##SIGN##BITS (TYPE##BITS##x##LANES##_t __inactive, TYPE##BITS##_t *a, TYPE##BITS##_t *b,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vb = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (b, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_##SIGN##BITS (__inactive, va, vb, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      b += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_M (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_M (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (32, 4, w, NAME, PRED)
+
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vmulq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vsubq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vhaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vorrq, _m)
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_N(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)	\
+void test_##NAME##PRED##_n_##SIGN##BITS (TYPE##BITS##_t *a,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_n_##SIGN##BITS (va, IMM, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_N (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_N (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (32, 4, w, NAME, PRED)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vmulq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vsubq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vhaddq, _x)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vbrsrq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vshlq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vshrq, _x)
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_M_N(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)	\
+void test_##NAME##PRED##_n_##SIGN##BITS (TYPE##BITS##x##LANES##_t __inactive, TYPE##BITS##_t *a,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_n_##SIGN##BITS (__inactive, va, IMM, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_M_N (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_M_N (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (32, 4, w, NAME, PRED)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vmulq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vsubq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vhaddq, _m)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vbrsrq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vshlq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vshrq, _m)
+
+/* Now test some more configurations.  */
+
+/* Using a >=1 condition.  */
+void test1 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n >= 1)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+/* Test a for loop format of decrementing to zero */
+int32_t a[] = {0, 1, 2, 3, 4, 5, 6, 7};
+void test2 (int32_t *b, int num_elems)
+{
+    for (int i = num_elems; i > 0; i-= 4)
+    {
+        mve_pred16_t p = vctp32q (i);
+        int32x4_t va = vldrwq_z_s32 (&(a[i]), p);
+        vstrwq_p_s32 (b + i, va, p);
+    }
+}
+
+/* Iteration counter counting up to num_iter.  */
+void test3 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 0; i < num_iter; i++)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+/* Iteration counter counting down from num_iter.  */
+void test4 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = num_iter; i > 0; i--)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+/* Using an unpredicated arithmetic instruction within the loop.  */
+void test5 (uint8_t *a, uint8_t *b, uint8_t *c,  uint8_t *d, int n)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_u8 (b);
+	/* Is affected by implicit predication, because vb also
+	came from an unpredicated load, but there is no functional
+	problem, because the result is used in a predicated store.  */ 
+        uint8x16_t vc = vaddq_u8 (va, vb);
+        uint8x16_t vd = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        vstrbq_p_u8 (d, vd, p);
+        n-=16;
+    }
+}
+
+/* Using a different VPR value for one instruction in the loop.  */
+void test6 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using another VPR value in the loop, with a vctp.
+   The doloop logic will always try to do the transform on the first
+   vctp it encounters, so this is still expected to work.  */
+void test7 (int32_t *a, int32_t *b, int32_t *c, int n, int g)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q (g);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vctp,
+   but this time the p1 will also change in every loop (still fine)  */
+void test8 (int32_t *a, int32_t *b, int32_t *c, int n, int g)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q (g);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+      g++;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vctp_m
+   that is independent of the loop vctp VPR.  */
+void test9 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p2 = vctp32q_m (n, p1);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop,
+   with a vctp_m that is tied to the base vctp VPR.  This
+   is still fine, because the vctp_m will be transformed
+   into a vctp and be implicitly predicated.  */
+void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q_m (n, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p1);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp.  */
+void test11 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vb);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p1);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp_m.  */
+void test12 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p2 = vcmpeqq_m_s32 (va, vb, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp_m 
+   that is tied to the base vctp VPR (same as above, this will be turned
+   into a vcmp and be implicitly predicated).  */
+void test13 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p2 = vcmpeqq_m_s32 (va, vb, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an unpredicated op with a scalar output, where the result is valid
+   outside the bb.  This is valid, because all the inputs to the unpredicated
+   op are correctly predicated.  */
+uint8_t test14 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       sum += vaddvq_u8 (vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Same as above, but with another scalar op between the unpredicated op and
+   the scalar op outside the loop.  */
+uint8_t test15 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx, int g)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       sum += vaddvq_u8 (vc);
+       sum += g;
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Using an unpredicated vcmp to generate a new predicate value in the
+   loop and then using it in a predicated store insn.  */
+void test16 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_s32 (b);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vc);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using a predicated vcmp to generate a new predicate value in the
+   loop and then using it in a predicated store insn.  */
+void test17 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_s32 (va, vb);
+      mve_pred16_t p1 = vcmpeqq_m_s32 (va, vc, p);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an across-vector unpredicated instruction in a valid way.
+   This tests that "vc" has correctly masked the risky "vb".  */
+uint16_t test18 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_x_u16 (va, vb, p);
+      res = vaddvq_u16 (vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector unpredicated instruction with a scalar from outside the loop.  */
+uint16_t test19 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_x_u16 (va, vb, p);
+      res = vaddvaq_u16 (res, vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector predicated instruction in a valid way.  */
+uint16_t  test20 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res = vaddvaq_p_u16 (res, vb, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector predicated instruction in a valid way.  */
+uint16_t  test21 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res++;
+      res = vaddvaq_p_u16 (res, vb, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+int test22 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vmaxvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test23 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vmaxavq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+/* The final number of DLSTPs currently is calculated by the number of
+  `TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY.*` macros * 6 + 23.  */
+/* { dg-final { scan-assembler-times {\tdlstp} 167 } } */
+/* { dg-final { scan-assembler-times {\tletp} 167 } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
new file mode 100644
index 00000000000..0cdffb312b3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int16_t *a, int16_t *b, int16_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      int16x8_t va = vldrhq_z_s16 (a, p);
+      int16x8_t vb = vldrhq_z_s16 (b, p);
+      int16x8_t vc = vaddq_x_s16 (va, vb, p);
+      vstrhq_p_s16 (c, vc, p);
+      c+=8;
+      a+=8;
+      b+=8;
+      n-=8;
+    }
+}
+
+int main ()
+{
+  int i;
+  int16_t temp1[N];
+  int16_t temp2[N];
+  int16_t temp3[N];
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus16 (temp1, temp2, temp3, 0);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus16 (temp1, temp2, temp3, 1);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 7);
+  check_plus16 (temp1, temp2, temp3, 7);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus16 (temp1, temp2, temp3, 8);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus16 (temp1, temp2, temp3, 9);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus16 (temp1, temp2, temp3, 16);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus16 (temp1, temp2, temp3, 17);
+
+  reset_data16 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.16} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c
new file mode 100644
index 00000000000..7ff789d7650
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+int main ()
+{
+  int i;
+  int32_t temp1[N];
+  int32_t temp2[N];
+  int32_t temp3[N];
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus32 (temp1, temp2, temp3, 0);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus32 (temp1, temp2, temp3, 1);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 3);
+  check_plus32 (temp1, temp2, temp3, 3);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 4);
+  check_plus32 (temp1, temp2, temp3, 4);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 5);
+  check_plus32 (temp1, temp2, temp3, 5);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus32 (temp1, temp2, temp3, 8);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus32 (temp1, temp2, temp3, 9);
+
+  reset_data32 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.32} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c
new file mode 100644
index 00000000000..8065bd02469
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int64_t *a, int64_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp64q (n);
+      int64x2_t va = vldrdq_gather_offset_z_s64 (a, vcreateq_u64 (0, 8), p);
+      vstrdq_scatter_offset_p_s64 (c, vcreateq_u64 (0, 8), va, p);
+      c+=2;
+      a+=2;
+      n-=2;
+    }
+}
+
+int main ()
+{
+  int i;
+  int64_t temp1[N];
+  int64_t temp3[N];
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 0);
+  check_memcpy64 (temp1, temp3, 0);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 1);
+  check_memcpy64 (temp1, temp3, 1);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 2);
+  check_memcpy64 (temp1, temp3, 2);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 3);
+  check_memcpy64 (temp1, temp3, 3);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 4);
+  check_memcpy64 (temp1, temp3, 4);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 5);
+  check_memcpy64 (temp1, temp3, 5);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 6);
+  check_memcpy64 (temp1, temp3, 6);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 7);
+  check_memcpy64 (temp1, temp3, 7);
+
+  reset_data64  (temp1, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.64} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c
new file mode 100644
index 00000000000..552781001e9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp8q (n);
+      int8x16_t va = vldrbq_z_s8 (a, p);
+      int8x16_t vb = vldrbq_z_s8 (b, p);
+      int8x16_t vc = vaddq_x_s8 (va, vb, p);
+      vstrbq_p_s8 (c, vc, p);
+      c+=16;
+      a+=16;
+      b+=16;
+      n-=16;
+    }
+}
+
+int main ()
+{
+  int i;
+  int8_t temp1[N];
+  int8_t temp2[N];
+  int8_t temp3[N];
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus8 (temp1, temp2, temp3, 0);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus8 (temp1, temp2, temp3, 1);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 15);
+  check_plus8 (temp1, temp2, temp3, 15);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus8 (temp1, temp2, temp3, 16);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus8 (temp1, temp2, temp3, 17);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 32);
+  check_plus8 (temp1, temp2, temp3, 32);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 33);
+  check_plus8 (temp1, temp2, temp3, 33);
+
+  reset_data8 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.8} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
new file mode 100644
index 00000000000..c1c40c2fea7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
@@ -0,0 +1,343 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O3 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+
+/* Terminating on a non-zero number of elements.  */
+void test0 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    while (n > 1)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Terminating on n >= 0.  */
+void test1 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    while (n >= 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Similar, terminating on a non-zero number of elements, but in a for loop
+   format.  */
+int32_t a[] = {0, 1, 2, 3, 4, 5, 6, 7};
+void test2 (int32_t *b, int num_elems)
+{
+    for (int i = num_elems; i >= 2; i-= 4)
+    {
+       mve_pred16_t p = vctp32q (i);
+       int32x4_t va = vldrwq_z_s32 (&(a[i]), p);
+       vstrwq_p_s32 (b + i, va, p);
+    }
+}
+
+/* Iteration counter counting up to num_iter, with a non-zero starting num.  */
+void test3 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 1; i < num_iter; i++)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Iteration counter counting up to num_iter, with a larger increment  */
+void test4 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 0; i < num_iter; i+=2)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Using an unpredicated store instruction within the loop.  */
+void test5 (uint8_t *a, uint8_t *b, uint8_t *c,  uint8_t *d, int n)
+{
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_u8 (va, vb);
+       uint8x16_t vd = vaddq_x_u8 (va, vb, p);
+       vstrbq_u8 (d, vd);
+       n -= 16;
+    }
+}
+
+/* Using an unpredicated store outside the loop.  */
+void test6 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       vx = vaddq_u8 (vx, vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    vstrbq_u8 (c, vx);
+}
+
+/* Using a VPR that gets modified within the loop.  */
+void test9 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      p++;
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using a VPR that gets re-generated within the loop.  */
+void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  mve_pred16_t p = vctp32q (n);
+  while (n > 0)
+    {
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      p = vctp32q (n);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using vctp32q_m instead of vctp32q.  */
+void test11 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p0)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q_m (n, p0);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an unpredicated op with a scalar output, where the result is valid
+   outside the bb.  This is invalid, because one of the inputs to the
+   unpredicated op is also unpredicated.  */
+uint8_t test12 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_u8 (b);
+       uint8x16_t vc = vaddq_u8 (va, vb);
+       sum += vaddvq_u8 (vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Using an unpredicated vcmp to generate a new predicate value in the
+   loop and then using that VPR to predicate a store insn.  */
+void test13 (int32_t *a, int32_t *b, int32x4_t vc, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_s32 (a);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_s32 (va, vb);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vc);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an across-vector unpredicated instruction. "vb" is the risk.  */
+uint16_t test14 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      vb = vaddq_u16 (va, vb);
+      res = vaddvq_u16 (vb);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector unpredicated instruction. "vc" is the risk. */
+uint16_t test15 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_u16 (va, vb);
+      res = vaddvaq_u16 (res, vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+uint16_t test16 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res =0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res = vaddvaq_u16 (res, vb);
+      res = vaddvaq_p_u16 (res, va, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+int test17 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vmaxvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+
+
+int test18 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vminvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test19 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vminavq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test20 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vminvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+uint8x16_t test21 (uint8_t *a, uint32_t *b, int n, uint8x16_t res)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vshlcq_u8 (va, b, 1);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int8x16_t test22 (int8_t *a, int32_t *b, int n, int8x16_t res)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vshlcq_s8 (va, b, 1);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+/* { dg-final { scan-assembler-not "\tdlstp" } } */
+/* { dg-final { scan-assembler-not "\tletp" } } */

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
  2023-09-06 17:19 ` [PING][PATCH " Stamatis Markianos-Wright
@ 2023-09-14 12:10   ` Kyrylo Tkachov
  2023-09-28 12:51     ` Andre Vieira (lists)
  2023-10-24 15:11   ` Richard Sandiford
  1 sibling, 1 reply; 17+ messages in thread
From: Kyrylo Tkachov @ 2023-09-14 12:10 UTC (permalink / raw)
  To: Stam Markianos-Wright, gcc-patches; +Cc: Richard Earnshaw, jlaw

Hi Stam,

> -----Original Message-----
> From: Stam Markianos-Wright <Stam.Markianos-Wright@arm.com>
> Sent: Wednesday, September 6, 2023 6:19 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>; Richard Earnshaw
> <Richard.Earnshaw@arm.com>
> Subject: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low
> Overhead Loops
> 
> Hi all,
> 
> This is the 2/2 patch that contains the functional changes needed
> for MVE Tail Predicated Low Overhead Loops.  See my previous email
> for a general introduction of MVE LOLs.
> 
> This support is added through the already existing loop-doloop
> mechanisms that are used for non-MVE dls/le looping.
> 
> Mid-end changes are:
> 
> 1) Relax the loop-doloop mechanism in the mid-end to allow for
>     decrement numbers other that -1 and for `count` to be an
>     rtx containing a simple REG (which in this case will contain
>     the number of elements to be processed), rather
>     than an expression for calculating the number of iterations.
> 2) Added a new df utility function: `df_bb_regno_only_def_find` that
>     will return the DEF of a REG if it is DEF-ed only once within the
>     basic block.
> 
> And many things in the backend to implement the above optimisation:
> 
> 3)  Implement the `arm_predict_doloop_p` target hook to instruct the
>      mid-end about Low Overhead Loops (MVE or not), as well as
>      `arm_loop_unroll_adjust` which will prevent unrolling of any loops
>      that are valid for becoming MVE Tail_Predicated Low Overhead Loops
>      (unrolling can transform a loop in ways that invalidate the dlstp/
>      letp tranformation logic and the benefit of the dlstp/letp loop
>      would be considerably higher than that of unrolling)
> 4)  Appropriate changes to the define_expand of doloop_end, new
>      patterns for dlstp and letp, new iterators,  unspecs, etc.
> 5) `arm_mve_loop_valid_for_dlstp` and a number of checking functions:
>     * `arm_mve_dlstp_check_dec_counter`
>     * `arm_mve_dlstp_check_inc_counter`
>     * `arm_mve_check_reg_origin_is_num_elems`
>     * `arm_mve_check_df_chain_back_for_implic_predic`
>     * `arm_mve_check_df_chain_fwd_for_implic_predic_impact`
>     This all, in smoe way or another, are running checks on the loop
>     structure in order to determine if the loop is valid for dlstp/letp
>     transformation.
> 6) `arm_attempt_dlstp_transform`: (called from the define_expand of
>      doloop_end) this function re-checks for the loop's suitability for
>      dlstp/letp transformation and then implements it, if possible.
> 7) Various utility functions:
>     *`arm_mve_get_vctp_lanes` to map
>     from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg`
>     to check an insn to see if it requires the VPR or not.
>     * `arm_mve_get_loop_vctp`
>     * `arm_mve_get_vctp_lanes`
>     * `arm_emit_mve_unpredicated_insn_to_seq`
>     * `arm_get_required_vpr_reg`
>     * `arm_get_required_vpr_reg_param`
>     * `arm_get_required_vpr_reg_ret_val`
>     * `arm_mve_is_across_vector_insn`
>     * `arm_is_mve_load_store_insn`
>     * `arm_mve_vec_insn_is_predicated_with_this_predicate`
>     * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`
> 
> No regressions on arm-none-eabi with various targets and on
> aarch64-none-elf. Thoughts on getting this into trunk?

The arm parts look sensible but we'd need review for the df-core.h and df-core.cc changes.
Maybe Jeff can help or can recommend someone to take a look?
Thanks,
Kyrill

> 
> Thank you,
> Stam Markianos-Wright
> 
> gcc/ChangeLog:
> 
>      * config/arm/arm-protos.h (arm_target_insn_ok_for_lob): Rename to...
>      (arm_target_bb_ok_for_lob): ...this
>      (arm_attempt_dlstp_transform): New.
>      * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.
>      (TARGET_PREDICT_DOLOOP_P): New.
>      (arm_block_set_vect):
>      (arm_target_insn_ok_for_lob): Rename from arm_target_insn_ok_for_lob.
>      (arm_target_bb_ok_for_lob): New.
>      (arm_mve_get_vctp_lanes): New.
>      (arm_get_required_vpr_reg): New.
>      (arm_get_required_vpr_reg_param): New.
>      (arm_get_required_vpr_reg_ret_val): New.
>      (arm_mve_get_loop_vctp): New.
>      (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.
>      (arm_mve_vec_insn_is_predicated_with_this_predicate): New.
>      (arm_mve_check_df_chain_back_for_implic_predic): New.
>      (arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.
>      (arm_mve_check_reg_origin_is_num_elems): New.
>      (arm_mve_dlstp_check_inc_counter): New.
>      (arm_mve_dlstp_check_dec_counter): New.
>      (arm_mve_loop_valid_for_dlstp): New.
>      (arm_mve_is_across_vector_insn): New.
>      (arm_is_mve_load_store_insn): New.
>      (arm_predict_doloop_p): New.
>      (arm_loop_unroll_adjust): New.
>      (arm_emit_mve_unpredicated_insn_to_seq): New.
>      (arm_attempt_dlstp_transform): New.
>          * config/arm/iterators.md (DLSTP): New.
>          (mode1): Add DLSTP mappings.
>          * config/arm/mve.md (*predicated_doloop_end_internal): New.
>          (dlstp<mode1>_insn): New.
>          * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
>          * config/arm/unspecs.md: New unspecs.
>      * df-core.cc (df_bb_regno_only_def_find): New.
>      * df.h (df_bb_regno_only_def_find): New.
>          * loop-doloop.cc (doloop_condition_get): Relax conditions.
>          (doloop_optimize): Add support for elementwise LoLs.
> 
> gcc/testsuite/ChangeLog:
> 
>          * gcc.target/arm/lob.h: Update framework.
>          * gcc.target/arm/lob1.c: Likewise.
>          * gcc.target/arm/lob6.c: Likewise.
>      * gcc.target/arm/mve/dlstp-compile-asm.c: New test.
>      * gcc.target/arm/mve/dlstp-int16x8.c: New test.
>      * gcc.target/arm/mve/dlstp-int32x4.c: New test.
>      * gcc.target/arm/mve/dlstp-int64x2.c: New test.
>      * gcc.target/arm/mve/dlstp-int8x16.c: New test.
>      * gcc.target/arm/mve/dlstp-invalid-asm.c: New test.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
  2023-09-14 12:10   ` Kyrylo Tkachov
@ 2023-09-28 12:51     ` Andre Vieira (lists)
  2023-10-11 11:34       ` Stamatis Markianos-Wright
  0 siblings, 1 reply; 17+ messages in thread
From: Andre Vieira (lists) @ 2023-09-28 12:51 UTC (permalink / raw)
  To: Kyrylo Tkachov, Stam Markianos-Wright, gcc-patches; +Cc: Richard Earnshaw, jlaw

Hi,

On 14/09/2023 13:10, Kyrylo Tkachov via Gcc-patches wrote:
> Hi Stam,
> 

> 
> The arm parts look sensible but we'd need review for the df-core.h and df-core.cc changes.
> Maybe Jeff can help or can recommend someone to take a look?
> Thanks,
> Kyrill
> 

FWIW the changes LGTM, if we don't want these in df-core we can always 
implement the extra utility locally. It's really just a helper function 
to check if df_bb_regno_first_def_find and df_bb_regno_last_def_find 
yield the same result, meaning we only have a single definition.

Kind regards,
Andre

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
  2023-09-28 12:51     ` Andre Vieira (lists)
@ 2023-10-11 11:34       ` Stamatis Markianos-Wright
  2023-10-23 10:16         ` Andre Vieira (lists)
  0 siblings, 1 reply; 17+ messages in thread
From: Stamatis Markianos-Wright @ 2023-10-11 11:34 UTC (permalink / raw)
  To: Andre Vieira (lists), Kyrylo Tkachov, gcc-patches; +Cc: Richard Earnshaw, jlaw

Hi all,

On 28/09/2023 13:51, Andre Vieira (lists) wrote:
> Hi,
>
> On 14/09/2023 13:10, Kyrylo Tkachov via Gcc-patches wrote:
>> Hi Stam,
>>
>
>>
>> The arm parts look sensible but we'd need review for the df-core.h 
>> and df-core.cc changes.
>> Maybe Jeff can help or can recommend someone to take a look?

Just thought I'd do a follow-up "ping" on this :)


>> Thanks,
>> Kyrill
>>
>
> FWIW the changes LGTM, if we don't want these in df-core we can always 
> implement the extra utility locally. It's really just a helper 
> function to check if df_bb_regno_first_def_find and 
> df_bb_regno_last_def_find yield the same result, meaning we only have 
> a single definition.
>
> Kind regards,
> Andre

Thanks,

Stam


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
  2023-10-11 11:34       ` Stamatis Markianos-Wright
@ 2023-10-23 10:16         ` Andre Vieira (lists)
  0 siblings, 0 replies; 17+ messages in thread
From: Andre Vieira (lists) @ 2023-10-23 10:16 UTC (permalink / raw)
  To: Stamatis Markianos-Wright, Kyrylo Tkachov, gcc-patches
  Cc: Richard Earnshaw, jlaw

Ping for Jeff or another global maintainer to review the target agnostic 
bits of this, that's:
loop-doloop.cc
df-core.{c,h}

I do have a nitpick myself that I missed last time around:
  	  /* We expect the condition to be of the form (reg != 0)  */
  	  cond = XEXP (SET_SRC (cmp), 0);
-	  if (GET_CODE (cond) != NE || XEXP (cond, 1) != const0_rtx)
+	  if ((GET_CODE (cond) != NE && GET_CODE (cond) != GE)
+	      || XEXP (cond, 1) != const0_rtx)
  	    return 0;
  	}
Could do with updating the comment to reflect allowing >= now. But happy 
for you to change this once approved by a maintainer.

Kind regards,
Andre

On 11/10/2023 12:34, Stamatis Markianos-Wright wrote:
> Hi all,
> 
> On 28/09/2023 13:51, Andre Vieira (lists) wrote:
>> Hi,
>>
>> On 14/09/2023 13:10, Kyrylo Tkachov via Gcc-patches wrote:
>>> Hi Stam,
>>>
>>
>>>
>>> The arm parts look sensible but we'd need review for the df-core.h 
>>> and df-core.cc changes.
>>> Maybe Jeff can help or can recommend someone to take a look?
> 
> Just thought I'd do a follow-up "ping" on this :)
> 
> 
>>> Thanks,
>>> Kyrill
>>>
>>
>> FWIW the changes LGTM, if we don't want these in df-core we can always 
>> implement the extra utility locally. It's really just a helper 
>> function to check if df_bb_regno_first_def_find and 
>> df_bb_regno_last_def_find yield the same result, meaning we only have 
>> a single definition.
>>
>> Kind regards,
>> Andre
> 
> Thanks,
> 
> Stam
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
  2023-09-06 17:19 ` [PING][PATCH " Stamatis Markianos-Wright
  2023-09-14 12:10   ` Kyrylo Tkachov
@ 2023-10-24 15:11   ` Richard Sandiford
  2023-11-06 11:03     ` Stamatis Markianos-Wright
  1 sibling, 1 reply; 17+ messages in thread
From: Richard Sandiford @ 2023-10-24 15:11 UTC (permalink / raw)
  To: Stamatis Markianos-Wright via Gcc-patches
  Cc: Stamatis Markianos-Wright, Richard Earnshaw

Sorry for the slow review.  I had a look at the arm bits too, to get
some context for the target-independent bits.

Stamatis Markianos-Wright via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> [...]
> diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
> index 77e76336e94..74186930f0b 100644
> --- a/gcc/config/arm/arm-protos.h
> +++ b/gcc/config/arm/arm-protos.h
> @@ -65,8 +65,8 @@ extern void arm_emit_speculation_barrier_function (void);
>  extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *);
>  extern bool arm_q_bit_access (void);
>  extern bool arm_ge_bits_access (void);
> -extern bool arm_target_insn_ok_for_lob (rtx);
> -
> +extern bool arm_target_bb_ok_for_lob (basic_block);
> +extern rtx arm_attempt_dlstp_transform (rtx);
>  #ifdef RTX_CODE
>  enum reg_class
>  arm_mode_base_reg_class (machine_mode);
> diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
> index 6e933c80183..39d97ba5e4d 100644
> --- a/gcc/config/arm/arm.cc
> +++ b/gcc/config/arm/arm.cc
> @@ -659,6 +659,12 @@ static const struct attribute_spec arm_attribute_table[]
> [...]
> +/* Wrapper function of arm_get_required_vpr_reg with TYPE == 1, so return
> +   something only if the VPR reg is an input operand to the insn.  */
> +
> +static rtx
> +ALWAYS_INLINE

Probably best to leave out the ALWAYS_INLINE.  That's generally only
appropriate for things that need to be inlined for correctness.

> +arm_get_required_vpr_reg_param (rtx_insn *insn)
> +{
> +  return arm_get_required_vpr_reg (insn, 1);
> +}
> [...]
> +/* Recursively scan through the DF chain backwards within the basic block and
> +   determine if any of the USEs of the original insn (or the USEs of the insns
> +   where thy were DEF-ed, etc., recursively) were affected by implicit VPT
> +   predication of an MVE_VPT_UNPREDICATED_INSN_P in a dlstp/letp loop.
> +   This function returns true if the insn is affected implicit predication
> +   and false otherwise.
> +   Having such implicit predication on an unpredicated insn wouldn't in itself
> +   block tail predication, because the output of that insn might then be used
> +   in a correctly predicated store insn, where the disabled lanes will be
> +   ignored.  To verify this we later call:
> +   `arm_mve_check_df_chain_fwd_for_implic_predic_impact`, which will check the
> +   DF chains forward to see if any implicitly-predicated operand gets used in
> +   an improper way.  */
> +
> +static bool
> +arm_mve_check_df_chain_back_for_implic_predic
> +  (hash_map<int_hash<int, -1, -2>, bool>* safe_insn_map, rtx_insn *insn,
> +   rtx vctp_vpr_generated)
> +{
> +  bool* temp = NULL;
> +  if ((temp = safe_insn_map->get (INSN_UID (insn))))
> +    return *temp;
> +
> +  basic_block body = BLOCK_FOR_INSN (insn);
> +  /* The circumstances under which an instruction is affected by "implicit
> +     predication" are as follows:
> +      * It is an UNPREDICATED_INSN_P:
> +	* That loads/stores from/to memory.
> +	* Where any one of its operands is an MVE vector from outside the
> +	  loop body bb.
> +     Or:
> +      * Any of it's operands, recursively backwards, are affected.  */
> +  if (MVE_VPT_UNPREDICATED_INSN_P (insn)
> +      && (arm_is_mve_load_store_insn (insn)
> +	  || (arm_is_mve_across_vector_insn (insn)
> +	      && !arm_mve_is_allowed_unpredic_across_vector_insn (insn))))
> +    {
> +      safe_insn_map->put (INSN_UID (insn), true);
> +      return true;
> +    }
> +
> +  df_ref insn_uses = NULL;
> +  FOR_EACH_INSN_USE (insn_uses, insn)
> +  {
> +    /* If the operand is in the input reg set to the the basic block,
> +       (i.e. it has come from outside the loop!), consider it unsafe if:
> +	 * It's being used in an unpredicated insn.
> +	 * It is a predicable MVE vector.  */
> +    if (MVE_VPT_UNPREDICATED_INSN_P (insn)
> +	&& VALID_MVE_MODE (GET_MODE (DF_REF_REG (insn_uses)))
> +	&& REGNO_REG_SET_P (DF_LR_IN (body), DF_REF_REGNO (insn_uses)))
> +      {
> +	safe_insn_map->put (INSN_UID (insn), true);
> +	return true;
> +      }
> +    /* Scan backwards from the current INSN through the instruction chain
> +       until the start of the basic block.  */
> +    for (rtx_insn *prev_insn = PREV_INSN (insn);
> +	 prev_insn && prev_insn != PREV_INSN (BB_HEAD (body));
> +	 prev_insn = PREV_INSN (prev_insn))
> +      {
> +	/* If a previous insn defines a register that INSN uses, then recurse
> +	   in order to check that insn's USEs.
> +	   If any of these insns return true as MVE_VPT_UNPREDICATED_INSN_Ps,
> +	   then the whole chain is affected by the change in behaviour from
> +	   being placed in dlstp/letp loop.  */
> +	df_ref prev_insn_defs = NULL;
> +	FOR_EACH_INSN_DEF (prev_insn_defs, prev_insn)
> +	{
> +	  if (DF_REF_REGNO (insn_uses) == DF_REF_REGNO (prev_insn_defs)
> +	      && !arm_mve_vec_insn_is_predicated_with_this_predicate
> +		   (insn, vctp_vpr_generated)
> +	      && arm_mve_check_df_chain_back_for_implic_predic
> +		  (safe_insn_map, prev_insn, vctp_vpr_generated))
> +	    {
> +	      safe_insn_map->put (INSN_UID (insn), true);
> +	      return true;
> +	    }
> +	}
> +      }
> +  }
> +  safe_insn_map->put (INSN_UID (insn), false);
> +  return false;
> +}

It looks like the recursion depth here is proportional to the length
of the longest insn-to-insn DU chain.  That could be a problem for
pathologically large loops.  Would it be possible to restructure
this to use a worklist instead?

> [...]
> +/* If we have identified the loop to have an incrementing counter, we need to
> +   make sure that it increments by 1 and that the loop is structured correctly:
> +    * The counter starts from 0
> +    * The counter terminates at (num_of_elem + num_of_lanes - 1) / num_of_lanes
> +    * The vctp insn uses a reg that decrements appropriately in each iteration.
> +*/
> +
> +static rtx_insn*
> +arm_mve_dlstp_check_inc_counter (basic_block body, rtx_insn* vctp_insn,
> +				 rtx condconst, rtx condcount)
> +{
> +  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
> +  /* The loop latch has to be empty.  When compiling all the known MVE LoLs in
> +     user applications, none of those with incrementing counters had any real
> +     insns in the loop latch.  As such, this function has only been tested with
> +     an empty latch and may misbehave or ICE if we somehow get here with an
> +     increment in the latch, so, for correctness, error out early.  */
> +  rtx_insn *dec_insn = BB_END (body->loop_father->latch);
> +  if (NONDEBUG_INSN_P (dec_insn))
> +    return NULL;

Could this use empty_block_p instead?  It would avoid hard-coding the
assumption that BB_END is not a debug instruction.

> +
> +  class rtx_iv vctp_reg_iv;
> +  /* For loops of type B) the loop counter is independent of the decrement
> +     of the reg used in the vctp_insn. So run iv analysis on that reg.  This
> +     has to succeed for such loops to be supported.  */
> +  if (!iv_analyze (vctp_insn, as_a<scalar_int_mode> (GET_MODE (vctp_reg)),
> +      vctp_reg, &vctp_reg_iv))
> +    return NULL;
> +
> +  /* Find where both of those are modified in the loop body bb.  */
> +  rtx condcount_reg_set
> +	= PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
> +				 (body, REGNO (condcount))));
> +  rtx vctp_reg_set = PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
> +					    (body, REGNO (vctp_reg))));
> +  if (!vctp_reg_set || !condcount_reg_set)
> +    return NULL;

It looks like these should be testing whether df_bb_regno_only_def_find
return null instead.  If they do, the DF_REF_INSN will segfault.
If they don't, the rest will succeed.

> +
> +  if (REG_P (condcount) && REG_P (condconst))
> +    {
> +      /* First we need to prove that the loop is going 0..condconst with an
> +	 inc of 1 in each iteration.  */
> +      if (GET_CODE (XEXP (condcount_reg_set, 1)) == PLUS
> +	  && CONST_INT_P (XEXP (XEXP (condcount_reg_set, 1), 1))
> +	  && INTVAL (XEXP (XEXP (condcount_reg_set, 1), 1)) == 1)
> +	{
> +	    rtx counter_reg = XEXP (condcount_reg_set, 0);
> +	    /* Check that the counter did indeed start from zero.  */
> +	    df_ref this_set = DF_REG_DEF_CHAIN (REGNO (counter_reg));
> +	    if (!this_set)
> +	      return NULL;
> +	    df_ref last_set = DF_REF_NEXT_REG (this_set);
> +	    if (!last_set)
> +	      return NULL;
> +	    rtx_insn* last_set_insn = DF_REF_INSN (last_set);
> +	    if (!single_set (last_set_insn))
> +	      return NULL;
> +	    rtx counter_orig_set;
> +	    counter_orig_set = XEXP (PATTERN (last_set_insn), 1);
> +	    if (!CONST_INT_P (counter_orig_set)
> +		|| (INTVAL (counter_orig_set) != 0))
> +	      return NULL;
> +	    /* And finally check that the target value of the counter,
> +	       condconst is of the correct shape.  */
> +	    if (!arm_mve_check_reg_origin_is_num_elems (body, condconst,
> +							vctp_reg_iv.step))
> +	      return NULL;
> +	}
> +      else
> +	return NULL;
> +    }
> +  else
> +    return NULL;
> +
> +  /* Extract the decrementnum of the vctp reg.  */
> +  int decrementnum = abs (INTVAL (vctp_reg_iv.step));
> +  /* Ensure it matches the number of lanes of the vctp instruction.  */
> +  if (decrementnum != arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
> +    return NULL;
> +
> +  /* Everything looks valid.  */
> +  return vctp_insn;
> +}

One of the main reasons for reading the arm bits was to try to answer
the question: if we switch to a downcounting loop with a GE condition,
how do we make sure that the start value is not a large unsigned
number that is interpreted as negative by GE?  E.g. if the loop
originally counted up in steps of N and used an LTU condition,
it could stop at a value in the range [INT_MAX + 1, UINT_MAX].
But the loop might never iterate if we start counting down from
most values in that range.

Does the patch handle that?

[I didn't look at the Arm parts much beyond this point]

> [...]
> diff --git a/gcc/df-core.cc b/gcc/df-core.cc
> index d4812b04a7c..4fcc14bf790 100644
> --- a/gcc/df-core.cc
> +++ b/gcc/df-core.cc
> @@ -1964,6 +1964,21 @@ df_bb_regno_last_def_find (basic_block bb, unsigned int regno)
>    return NULL;
>  }
>  
> +/* Return the one and only def of REGNO within BB.  If there is no def or
> +   there are multiple defs, return NULL.  */
> +
> +df_ref
> +df_bb_regno_only_def_find (basic_block bb, unsigned int regno)
> +{
> +  df_ref temp = df_bb_regno_first_def_find (bb, regno);
> +  if (!temp)
> +    return NULL;
> +  else if (temp == df_bb_regno_last_def_find (bb, regno))
> +    return temp;
> +  else
> +    return NULL;
> +}
> +
>  /* Finds the reference corresponding to the definition of REG in INSN.
>     DF is the dataflow object.  */
>  
> diff --git a/gcc/df.h b/gcc/df.h
> index 402657a7076..98623637f9c 100644
> --- a/gcc/df.h
> +++ b/gcc/df.h
> @@ -987,6 +987,7 @@ extern void df_check_cfg_clean (void);
>  #endif
>  extern df_ref df_bb_regno_first_def_find (basic_block, unsigned int);
>  extern df_ref df_bb_regno_last_def_find (basic_block, unsigned int);
> +extern df_ref df_bb_regno_only_def_find (basic_block, unsigned int);
>  extern df_ref df_find_def (rtx_insn *, rtx);
>  extern bool df_reg_defined (rtx_insn *, rtx);
>  extern df_ref df_find_use (rtx_insn *, rtx);
> diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
> index 4feb0a25ab9..f6dbd0515de 100644
> --- a/gcc/loop-doloop.cc
> +++ b/gcc/loop-doloop.cc
> @@ -85,29 +85,29 @@ doloop_condition_get (rtx_insn *doloop_pat)
>       forms:
>  
>       1)  (parallel [(set (pc) (if_then_else (condition)
> -	  			            (label_ref (label))
> -				            (pc)))
> -	             (set (reg) (plus (reg) (const_int -1)))
> -	             (additional clobbers and uses)])
> +					    (label_ref (label))
> +					    (pc)))
> +		     (set (reg) (plus (reg) (const_int -n)))
> +		     (additional clobbers and uses)])
>  
>       The branch must be the first entry of the parallel (also required
>       by jump.cc), and the second entry of the parallel must be a set of
>       the loop counter register.  Some targets (IA-64) wrap the set of
>       the loop counter in an if_then_else too.
>  
> -     2)  (set (reg) (plus (reg) (const_int -1))
> -         (set (pc) (if_then_else (reg != 0)
> -	                         (label_ref (label))
> -			         (pc))).  
> +     2)  (set (reg) (plus (reg) (const_int -n))
> +	 (set (pc) (if_then_else (reg != 0)
> +				 (label_ref (label))
> +				 (pc))).
>  
>       Some targets (ARM) do the comparison before the branch, as in the
>       following form:
>  
> -     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
> -                   (set (reg) (plus (reg) (const_int -1)))])
> -        (set (pc) (if_then_else (cc == NE)
> -                                (label_ref (label))
> -                                (pc))) */
> +     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -n), 0)))

Pre-existing, but I think this should be:

  (set (cc) (compare (plus (reg) (const_int -n)) 0))

Same for the copy further down.

> +		   (set (reg) (plus (reg) (const_int -n)))])
> +	(set (pc) (if_then_else (cc == NE)
> +				(label_ref (label))
> +				(pc))) */
>  
>    pattern = PATTERN (doloop_pat);
>  

I agree with Andre that it would be good to include the GE possibility
in the comments, e.g. ==/>=.

> @@ -143,7 +143,7 @@ doloop_condition_get (rtx_insn *doloop_pat)
>  	      || GET_CODE (cmp_arg1) != PLUS)
>  	    return 0;
>  	  reg_orig = XEXP (cmp_arg1, 0);
> -	  if (XEXP (cmp_arg1, 1) != GEN_INT (-1) 
> +	  if (!CONST_INT_P (XEXP (cmp_arg1, 1))
>  	      || !REG_P (reg_orig))
>  	    return 0;
>  	  cc_reg = SET_DEST (cmp_orig);
> @@ -156,7 +156,8 @@ doloop_condition_get (rtx_insn *doloop_pat)
>  	{
>  	  /* We expect the condition to be of the form (reg != 0)  */
>  	  cond = XEXP (SET_SRC (cmp), 0);
> -	  if (GET_CODE (cond) != NE || XEXP (cond, 1) != const0_rtx)
> +	  if ((GET_CODE (cond) != NE && GET_CODE (cond) != GE)
> +	      || XEXP (cond, 1) != const0_rtx)
>  	    return 0;
>  	}
>      }
> @@ -173,14 +174,14 @@ doloop_condition_get (rtx_insn *doloop_pat)
>    if (! REG_P (reg))
>      return 0;
>  
> -  /* Check if something = (plus (reg) (const_int -1)).
> +  /* Check if something = (plus (reg) (const_int -n)).
>       On IA-64, this decrement is wrapped in an if_then_else.  */
>    inc_src = SET_SRC (inc);
>    if (GET_CODE (inc_src) == IF_THEN_ELSE)
>      inc_src = XEXP (inc_src, 1);
>    if (GET_CODE (inc_src) != PLUS
>        || XEXP (inc_src, 0) != reg
> -      || XEXP (inc_src, 1) != constm1_rtx)
> +      || !CONST_INT_P (XEXP (inc_src, 1)))
>      return 0;
>  
>    /* Check for (set (pc) (if_then_else (condition)
> @@ -211,42 +212,49 @@ doloop_condition_get (rtx_insn *doloop_pat)
>        || (GET_CODE (XEXP (condition, 0)) == PLUS
>  	  && XEXP (XEXP (condition, 0), 0) == reg))
>     {
> -     if (GET_CODE (pattern) != PARALLEL)
>       /*  For the second form we expect:
>  
> -         (set (reg) (plus (reg) (const_int -1))
> -         (set (pc) (if_then_else (reg != 0)
> -                                 (label_ref (label))
> -                                 (pc))).
> +	 (set (reg) (plus (reg) (const_int -n))
> +	 (set (pc) (if_then_else (reg != 0)
> +				 (label_ref (label))
> +				 (pc))).
>  
> -         is equivalent to the following:
> +	 If n == 1, that is equivalent to the following:
>  
> -         (parallel [(set (pc) (if_then_else (reg != 1)
> -                                            (label_ref (label))
> -                                            (pc)))
> -                     (set (reg) (plus (reg) (const_int -1)))
> -                     (additional clobbers and uses)])
> +	 (parallel [(set (pc) (if_then_else (reg != 1)
> +					    (label_ref (label))
> +					    (pc)))
> +		     (set (reg) (plus (reg) (const_int -1)))
> +		     (additional clobbers and uses)])
>  
> -        For the third form we expect:
> +	For the third form we expect:
>  
> -        (parallel [(set (cc) (compare ((plus (reg) (const_int -1)), 0))
> -                   (set (reg) (plus (reg) (const_int -1)))])
> -        (set (pc) (if_then_else (cc == NE)
> -                                (label_ref (label))
> -                                (pc))) 
> +	(parallel [(set (cc) (compare ((plus (reg) (const_int -n)), 0))
> +		   (set (reg) (plus (reg) (const_int -n)))])
> +	(set (pc) (if_then_else (cc == NE)
> +				(label_ref (label))
> +				(pc)))
>  
> -        which is equivalent to the following:
> +	Which also for n == 1 is equivalent to the following:
>  
> -        (parallel [(set (cc) (compare (reg,  1))
> -                   (set (reg) (plus (reg) (const_int -1)))
> -                   (set (pc) (if_then_else (NE == cc)
> -                                           (label_ref (label))
> -                                           (pc))))])
> +	(parallel [(set (cc) (compare (reg,  1))
> +		   (set (reg) (plus (reg) (const_int -1)))
> +		   (set (pc) (if_then_else (NE == cc)
> +					   (label_ref (label))
> +					   (pc))))])
>  
> -        So we return the second form instead for the two cases.
> +	So we return the second form instead for the two cases.
>  
> +	For the "elementwise" form where the decrement number isn't -1,
> +	the final value may be exceeded, so use GE instead of NE.
>       */
> -        condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
> +     if (GET_CODE (pattern) != PARALLEL)
> +       {
> +	if (INTVAL (XEXP (inc_src, 1)) != -1)
> +	  condition = gen_rtx_fmt_ee (GE, VOIDmode, inc_src, const0_rtx);
> +	else
> +	  condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);;
> +       }
>  
>      return condition;
>     }
> @@ -685,17 +693,6 @@ doloop_optimize (class loop *loop)
>        return false;
>      }
>  
> -  max_cost
> -    = COSTS_N_INSNS (param_max_iterations_computation_cost);
> -  if (set_src_cost (desc->niter_expr, mode, optimize_loop_for_speed_p (loop))
> -      > max_cost)
> -    {
> -      if (dump_file)
> -	fprintf (dump_file,
> -		 "Doloop: number of iterations too costly to compute.\n");
> -      return false;
> -    }
> -
>    if (desc->const_iter)
>      iterations = widest_int::from (rtx_mode_t (desc->niter_expr, mode),
>  				   UNSIGNED);
> @@ -716,11 +713,24 @@ doloop_optimize (class loop *loop)
>  
>    /* Generate looping insn.  If the pattern FAILs then give up trying
>       to modify the loop since there is some aspect the back-end does
> -     not like.  */
> -  count = copy_rtx (desc->niter_expr);
> +     not like.  If this succeeds, there is a chance that the loop
> +     desc->niter_expr has been altered by the backend, so only extract
> +     that data after the gen_doloop_end.  */
>    start_label = block_label (desc->in_edge->dest);
>    doloop_reg = gen_reg_rtx (mode);
>    rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
> +  count = copy_rtx (desc->niter_expr);

Very minor, but I think the copy should still happen after the cost check.

OK for the df and doloop parts with those changes.

Thanks,
Richard

> +
> +  max_cost
> +    = COSTS_N_INSNS (param_max_iterations_computation_cost);
> +  if (set_src_cost (count, mode, optimize_loop_for_speed_p (loop))
> +      > max_cost)
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "Doloop: number of iterations too costly to compute.\n");
> +      return false;
> +    }
>  
>    word_mode_size = GET_MODE_PRECISION (word_mode);
>    word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) - 1;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
  2023-10-24 15:11   ` Richard Sandiford
@ 2023-11-06 11:03     ` Stamatis Markianos-Wright
  2023-11-06 11:24       ` Richard Sandiford
  0 siblings, 1 reply; 17+ messages in thread
From: Stamatis Markianos-Wright @ 2023-11-06 11:03 UTC (permalink / raw)
  To: Stamatis Markianos-Wright via Gcc-patches, Richard Earnshaw,
	richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 21449 bytes --]


On 24/10/2023 16:11, Richard Sandiford wrote:
> Sorry for the slow review.  I had a look at the arm bits too, to get
> some context for the target-independent bits.

No worries, and thanks for the review! :D

> Stamatis Markianos-Wright via Gcc-patches<gcc-patches@gcc.gnu.org>  writes:
>> [...]
>> diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
>> index 77e76336e94..74186930f0b 100644
>> --- a/gcc/config/arm/arm-protos.h
>> +++ b/gcc/config/arm/arm-protos.h
>> @@ -65,8 +65,8 @@ extern void arm_emit_speculation_barrier_function (void);
>>   extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *);
>>   extern bool arm_q_bit_access (void);
>>   extern bool arm_ge_bits_access (void);
>> -extern bool arm_target_insn_ok_for_lob (rtx);
>> -
>> +extern bool arm_target_bb_ok_for_lob (basic_block);
>> +extern rtx arm_attempt_dlstp_transform (rtx);
>>   #ifdef RTX_CODE
>>   enum reg_class
>>   arm_mode_base_reg_class (machine_mode);
>> diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
>> index 6e933c80183..39d97ba5e4d 100644
>> --- a/gcc/config/arm/arm.cc
>> +++ b/gcc/config/arm/arm.cc
>> @@ -659,6 +659,12 @@ static const struct attribute_spec arm_attribute_table[]
>> [...]
>> +/* Wrapper function of arm_get_required_vpr_reg with TYPE == 1, so return
>> +   something only if the VPR reg is an input operand to the insn.  */
>> +
>> +static rtx
>> +ALWAYS_INLINE
> Probably best to leave out the ALWAYS_INLINE.  That's generally only
> appropriate for things that need to be inlined for correctness.
Done
>> +arm_get_required_vpr_reg_param (rtx_insn *insn)
>> +{
>> +  return arm_get_required_vpr_reg (insn, 1);
>> +}
>> [...]
>> +/* Recursively scan through the DF chain backwards within the basic block and
>> +   determine if any of the USEs of the original insn (or the USEs of the insns
>> +   where thy were DEF-ed, etc., recursively) were affected by implicit VPT
>> +   predication of an MVE_VPT_UNPREDICATED_INSN_P in a dlstp/letp loop.
>> +   This function returns true if the insn is affected implicit predication
>> +   and false otherwise.
>> +   Having such implicit predication on an unpredicated insn wouldn't in itself
>> +   block tail predication, because the output of that insn might then be used
>> +   in a correctly predicated store insn, where the disabled lanes will be
>> +   ignored.  To verify this we later call:
>> +   `arm_mve_check_df_chain_fwd_for_implic_predic_impact`, which will check the
>> +   DF chains forward to see if any implicitly-predicated operand gets used in
>> +   an improper way.  */
>> +
>> +static bool
>> +arm_mve_check_df_chain_back_for_implic_predic
>> +  (hash_map<int_hash<int, -1, -2>, bool>* safe_insn_map, rtx_insn *insn,
>> +   rtx vctp_vpr_generated)
>> +{
>> +  bool* temp = NULL;
>> +  if ((temp = safe_insn_map->get (INSN_UID (insn))))
>> +    return *temp;
>> +
>> +  basic_block body = BLOCK_FOR_INSN (insn);
>> +  /* The circumstances under which an instruction is affected by "implicit
>> +     predication" are as follows:
>> +      * It is an UNPREDICATED_INSN_P:
>> +	* That loads/stores from/to memory.
>> +	* Where any one of its operands is an MVE vector from outside the
>> +	  loop body bb.
>> +     Or:
>> +      * Any of it's operands, recursively backwards, are affected.  */
>> +  if (MVE_VPT_UNPREDICATED_INSN_P (insn)
>> +      && (arm_is_mve_load_store_insn (insn)
>> +	  || (arm_is_mve_across_vector_insn (insn)
>> +	      && !arm_mve_is_allowed_unpredic_across_vector_insn (insn))))
>> +    {
>> +      safe_insn_map->put (INSN_UID (insn), true);
>> +      return true;
>> +    }
>> +
>> +  df_ref insn_uses = NULL;
>> +  FOR_EACH_INSN_USE (insn_uses, insn)
>> +  {
>> +    /* If the operand is in the input reg set to the the basic block,
>> +       (i.e. it has come from outside the loop!), consider it unsafe if:
>> +	 * It's being used in an unpredicated insn.
>> +	 * It is a predicable MVE vector.  */
>> +    if (MVE_VPT_UNPREDICATED_INSN_P (insn)
>> +	&& VALID_MVE_MODE (GET_MODE (DF_REF_REG (insn_uses)))
>> +	&& REGNO_REG_SET_P (DF_LR_IN (body), DF_REF_REGNO (insn_uses)))
>> +      {
>> +	safe_insn_map->put (INSN_UID (insn), true);
>> +	return true;
>> +      }
>> +    /* Scan backwards from the current INSN through the instruction chain
>> +       until the start of the basic block.  */
>> +    for (rtx_insn *prev_insn = PREV_INSN (insn);
>> +	 prev_insn && prev_insn != PREV_INSN (BB_HEAD (body));
>> +	 prev_insn = PREV_INSN (prev_insn))
>> +      {
>> +	/* If a previous insn defines a register that INSN uses, then recurse
>> +	   in order to check that insn's USEs.
>> +	   If any of these insns return true as MVE_VPT_UNPREDICATED_INSN_Ps,
>> +	   then the whole chain is affected by the change in behaviour from
>> +	   being placed in dlstp/letp loop.  */
>> +	df_ref prev_insn_defs = NULL;
>> +	FOR_EACH_INSN_DEF (prev_insn_defs, prev_insn)
>> +	{
>> +	  if (DF_REF_REGNO (insn_uses) == DF_REF_REGNO (prev_insn_defs)
>> +	      && !arm_mve_vec_insn_is_predicated_with_this_predicate
>> +		   (insn, vctp_vpr_generated)
>> +	      && arm_mve_check_df_chain_back_for_implic_predic
>> +		  (safe_insn_map, prev_insn, vctp_vpr_generated))
>> +	    {
>> +	      safe_insn_map->put (INSN_UID (insn), true);
>> +	      return true;
>> +	    }
>> +	}
>> +      }
>> +  }
>> +  safe_insn_map->put (INSN_UID (insn), false);
>> +  return false;
>> +}
> It looks like the recursion depth here is proportional to the length
> of the longest insn-to-insn DU chain.  That could be a problem for
> pathologically large loops.  Would it be possible to restructure
> this to use a worklist instead?

Done. I also changed the hash_map back to hashing the rtx_insn* pointer, 
so that it's more visually consistent with the worklist that I also did 
on the rtx_insn*s.


>> [...]
>> +/* If we have identified the loop to have an incrementing counter, we need to
>> +   make sure that it increments by 1 and that the loop is structured correctly:
>> +    * The counter starts from 0
>> +    * The counter terminates at (num_of_elem + num_of_lanes - 1) / num_of_lanes
>> +    * The vctp insn uses a reg that decrements appropriately in each iteration.
>> +*/
>> +
>> +static rtx_insn*
>> +arm_mve_dlstp_check_inc_counter (basic_block body, rtx_insn* vctp_insn,
>> +				 rtx condconst, rtx condcount)
>> +{
>> +  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
>> +  /* The loop latch has to be empty.  When compiling all the known MVE LoLs in
>> +     user applications, none of those with incrementing counters had any real
>> +     insns in the loop latch.  As such, this function has only been tested with
>> +     an empty latch and may misbehave or ICE if we somehow get here with an
>> +     increment in the latch, so, for correctness, error out early.  */
>> +  rtx_insn *dec_insn = BB_END (body->loop_father->latch);
>> +  if (NONDEBUG_INSN_P (dec_insn))
>> +    return NULL;
> Could this use empty_block_p instead?  It would avoid hard-coding the
> assumption that BB_END is not a debug instruction.
Done
>> +
>> +  class rtx_iv vctp_reg_iv;
>> +  /* For loops of type B) the loop counter is independent of the decrement
>> +     of the reg used in the vctp_insn. So run iv analysis on that reg.  This
>> +     has to succeed for such loops to be supported.  */
>> +  if (!iv_analyze (vctp_insn, as_a<scalar_int_mode> (GET_MODE (vctp_reg)),
>> +      vctp_reg, &vctp_reg_iv))
>> +    return NULL;
>> +
>> +  /* Find where both of those are modified in the loop body bb.  */
>> +  rtx condcount_reg_set
>> +	= PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
>> +				 (body, REGNO (condcount))));
>> +  rtx vctp_reg_set = PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
>> +					    (body, REGNO (vctp_reg))));
>> +  if (!vctp_reg_set || !condcount_reg_set)
>> +    return NULL;
> It looks like these should be testing whether df_bb_regno_only_def_find
> return null instead.  If they do, the DF_REF_INSN will segfault.
> If they don't, the rest will succeed.
Done by moving the `if` to be before the assignments.
>> +
>> +  if (REG_P (condcount) && REG_P (condconst))
>> +    {
>> +      /* First we need to prove that the loop is going 0..condconst with an
>> +	 inc of 1 in each iteration.  */
>> +      if (GET_CODE (XEXP (condcount_reg_set, 1)) == PLUS
>> +	  && CONST_INT_P (XEXP (XEXP (condcount_reg_set, 1), 1))
>> +	  && INTVAL (XEXP (XEXP (condcount_reg_set, 1), 1)) == 1)
>> +	{
>> +	    rtx counter_reg = XEXP (condcount_reg_set, 0);
>> +	    /* Check that the counter did indeed start from zero.  */
>> +	    df_ref this_set = DF_REG_DEF_CHAIN (REGNO (counter_reg));
>> +	    if (!this_set)
>> +	      return NULL;
>> +	    df_ref last_set = DF_REF_NEXT_REG (this_set);
>> +	    if (!last_set)
>> +	      return NULL;
>> +	    rtx_insn* last_set_insn = DF_REF_INSN (last_set);
>> +	    if (!single_set (last_set_insn))
>> +	      return NULL;
>> +	    rtx counter_orig_set;
>> +	    counter_orig_set = XEXP (PATTERN (last_set_insn), 1);
>> +	    if (!CONST_INT_P (counter_orig_set)
>> +		|| (INTVAL (counter_orig_set) != 0))
>> +	      return NULL;
>> +	    /* And finally check that the target value of the counter,
>> +	       condconst is of the correct shape.  */
>> +	    if (!arm_mve_check_reg_origin_is_num_elems (body, condconst,
>> +							vctp_reg_iv.step))
>> +	      return NULL;
>> +	}
>> +      else
>> +	return NULL;
>> +    }
>> +  else
>> +    return NULL;
>> +
>> +  /* Extract the decrementnum of the vctp reg.  */
>> +  int decrementnum = abs (INTVAL (vctp_reg_iv.step));
>> +  /* Ensure it matches the number of lanes of the vctp instruction.  */
>> +  if (decrementnum != arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
>> +    return NULL;
>> +
>> +  /* Everything looks valid.  */
>> +  return vctp_insn;
>> +}
> One of the main reasons for reading the arm bits was to try to answer
> the question: if we switch to a downcounting loop with a GE condition,
> how do we make sure that the start value is not a large unsigned
> number that is interpreted as negative by GE?  E.g. if the loop
> originally counted up in steps of N and used an LTU condition,
> it could stop at a value in the range [INT_MAX + 1, UINT_MAX].
> But the loop might never iterate if we start counting down from
> most values in that range.
>
> Does the patch handle that?

So AFAICT this is actually handled in the generic code in `doloop_valid_p`:

This kind of loops fail because of they are "desc->infinite", then no 
loop-doloop conversion is attempted at all (even for standard dls/le loops)

Thanks to that check I haven't been able to trigger anything like the 
behaviour you describe, do you think the doloop_valid_p checks are 
robust enough?

This did expose a gap in my testing, though, so I've added a couple more 
test cases to check for this. Thank you!

> [I didn't look at the Arm parts much beyond this point]
>
>> [...]
>> diff --git a/gcc/df-core.cc b/gcc/df-core.cc
>> index d4812b04a7c..4fcc14bf790 100644
>> --- a/gcc/df-core.cc
>> +++ b/gcc/df-core.cc
>> @@ -1964,6 +1964,21 @@ df_bb_regno_last_def_find (basic_block bb, unsigned int regno)
>>     return NULL;
>>   }
>>   
>> +/* Return the one and only def of REGNO within BB.  If there is no def or
>> +   there are multiple defs, return NULL.  */
>> +
>> +df_ref
>> +df_bb_regno_only_def_find (basic_block bb, unsigned int regno)
>> +{
>> +  df_ref temp = df_bb_regno_first_def_find (bb, regno);
>> +  if (!temp)
>> +    return NULL;
>> +  else if (temp == df_bb_regno_last_def_find (bb, regno))
>> +    return temp;
>> +  else
>> +    return NULL;
>> +}
>> +
>>   /* Finds the reference corresponding to the definition of REG in INSN.
>>      DF is the dataflow object.  */
>>   
>> diff --git a/gcc/df.h b/gcc/df.h
>> index 402657a7076..98623637f9c 100644
>> --- a/gcc/df.h
>> +++ b/gcc/df.h
>> @@ -987,6 +987,7 @@ extern void df_check_cfg_clean (void);
>>   #endif
>>   extern df_ref df_bb_regno_first_def_find (basic_block, unsigned int);
>>   extern df_ref df_bb_regno_last_def_find (basic_block, unsigned int);
>> +extern df_ref df_bb_regno_only_def_find (basic_block, unsigned int);
>>   extern df_ref df_find_def (rtx_insn *, rtx);
>>   extern bool df_reg_defined (rtx_insn *, rtx);
>>   extern df_ref df_find_use (rtx_insn *, rtx);
>> diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
>> index 4feb0a25ab9..f6dbd0515de 100644
>> --- a/gcc/loop-doloop.cc
>> +++ b/gcc/loop-doloop.cc
>> @@ -85,29 +85,29 @@ doloop_condition_get (rtx_insn *doloop_pat)
>>        forms:
>>   
>>        1)  (parallel [(set (pc) (if_then_else (condition)
>> -	  			            (label_ref (label))
>> -				            (pc)))
>> -	             (set (reg) (plus (reg) (const_int -1)))
>> -	             (additional clobbers and uses)])
>> +					    (label_ref (label))
>> +					    (pc)))
>> +		     (set (reg) (plus (reg) (const_int -n)))
>> +		     (additional clobbers and uses)])
>>   
>>        The branch must be the first entry of the parallel (also required
>>        by jump.cc), and the second entry of the parallel must be a set of
>>        the loop counter register.  Some targets (IA-64) wrap the set of
>>        the loop counter in an if_then_else too.
>>   
>> -     2)  (set (reg) (plus (reg) (const_int -1))
>> -         (set (pc) (if_then_else (reg != 0)
>> -	                         (label_ref (label))
>> -			         (pc))).
>> +     2)  (set (reg) (plus (reg) (const_int -n))
>> +	 (set (pc) (if_then_else (reg != 0)
>> +				 (label_ref (label))
>> +				 (pc))).
>>   
>>        Some targets (ARM) do the comparison before the branch, as in the
>>        following form:
>>   
>> -     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
>> -                   (set (reg) (plus (reg) (const_int -1)))])
>> -        (set (pc) (if_then_else (cc == NE)
>> -                                (label_ref (label))
>> -                                (pc))) */
>> +     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -n), 0)))
> Pre-existing, but I think this should be:
>
>    (set (cc) (compare (plus (reg) (const_int -n)) 0))
>
> Same for the copy further down.
Done
>> +		   (set (reg) (plus (reg) (const_int -n)))])
>> +	(set (pc) (if_then_else (cc == NE)
>> +				(label_ref (label))
>> +				(pc))) */
>>   
>>     pattern = PATTERN (doloop_pat);
>>   
> I agree with Andre that it would be good to include the GE possibility
> in the comments, e.g. ==/>=.
Done. I also had a bit of a re-think of those comments and changed some 
back to having `-1` on the counter, because they were referring to the 
forms `1)` and `2)`, only (that do not apply for the ARM backend). This 
makes it more explicit that only the ARM backend with form `3)` supports 
`-n` on the counter.
>> @@ -143,7 +143,7 @@ doloop_condition_get (rtx_insn *doloop_pat)
>>   	      || GET_CODE (cmp_arg1) != PLUS)
>>   	    return 0;
>>   	  reg_orig = XEXP (cmp_arg1, 0);
>> -	  if (XEXP (cmp_arg1, 1) != GEN_INT (-1)
>> +	  if (!CONST_INT_P (XEXP (cmp_arg1, 1))
>>   	      || !REG_P (reg_orig))
>>   	    return 0;
>>   	  cc_reg = SET_DEST (cmp_orig);
>> @@ -156,7 +156,8 @@ doloop_condition_get (rtx_insn *doloop_pat)
>>   	{
>>   	  /* We expect the condition to be of the form (reg != 0)  */
>>   	  cond = XEXP (SET_SRC (cmp), 0);
>> -	  if (GET_CODE (cond) != NE || XEXP (cond, 1) != const0_rtx)
>> +	  if ((GET_CODE (cond) != NE && GET_CODE (cond) != GE)
>> +	      || XEXP (cond, 1) != const0_rtx)
>>   	    return 0;
>>   	}
>>       }
>> @@ -173,14 +174,14 @@ doloop_condition_get (rtx_insn *doloop_pat)
>>     if (! REG_P (reg))
>>       return 0;
>>   
>> -  /* Check if something = (plus (reg) (const_int -1)).
>> +  /* Check if something = (plus (reg) (const_int -n)).
>>        On IA-64, this decrement is wrapped in an if_then_else.  */
>>     inc_src = SET_SRC (inc);
>>     if (GET_CODE (inc_src) == IF_THEN_ELSE)
>>       inc_src = XEXP (inc_src, 1);
>>     if (GET_CODE (inc_src) != PLUS
>>         || XEXP (inc_src, 0) != reg
>> -      || XEXP (inc_src, 1) != constm1_rtx)
>> +      || !CONST_INT_P (XEXP (inc_src, 1)))
>>       return 0;
>>   
>>     /* Check for (set (pc) (if_then_else (condition)
>> @@ -211,42 +212,49 @@ doloop_condition_get (rtx_insn *doloop_pat)
>>         || (GET_CODE (XEXP (condition, 0)) == PLUS
>>   	  && XEXP (XEXP (condition, 0), 0) == reg))
>>      {
>> -     if (GET_CODE (pattern) != PARALLEL)
>>        /*  For the second form we expect:
>>   
>> -         (set (reg) (plus (reg) (const_int -1))
>> -         (set (pc) (if_then_else (reg != 0)
>> -                                 (label_ref (label))
>> -                                 (pc))).
>> +	 (set (reg) (plus (reg) (const_int -n))
>> +	 (set (pc) (if_then_else (reg != 0)
>> +				 (label_ref (label))
>> +				 (pc))).
>>   
>> -         is equivalent to the following:
>> +	 If n == 1, that is equivalent to the following:
>>   
>> -         (parallel [(set (pc) (if_then_else (reg != 1)
>> -                                            (label_ref (label))
>> -                                            (pc)))
>> -                     (set (reg) (plus (reg) (const_int -1)))
>> -                     (additional clobbers and uses)])
>> +	 (parallel [(set (pc) (if_then_else (reg != 1)
>> +					    (label_ref (label))
>> +					    (pc)))
>> +		     (set (reg) (plus (reg) (const_int -1)))
>> +		     (additional clobbers and uses)])
>>   
>> -        For the third form we expect:
>> +	For the third form we expect:
>>   
>> -        (parallel [(set (cc) (compare ((plus (reg) (const_int -1)), 0))
>> -                   (set (reg) (plus (reg) (const_int -1)))])
>> -        (set (pc) (if_then_else (cc == NE)
>> -                                (label_ref (label))
>> -                                (pc)))
>> +	(parallel [(set (cc) (compare ((plus (reg) (const_int -n)), 0))
>> +		   (set (reg) (plus (reg) (const_int -n)))])
>> +	(set (pc) (if_then_else (cc == NE)
>> +				(label_ref (label))
>> +				(pc)))
>>   
>> -        which is equivalent to the following:
>> +	Which also for n == 1 is equivalent to the following:
>>   
>> -        (parallel [(set (cc) (compare (reg,  1))
>> -                   (set (reg) (plus (reg) (const_int -1)))
>> -                   (set (pc) (if_then_else (NE == cc)
>> -                                           (label_ref (label))
>> -                                           (pc))))])
>> +	(parallel [(set (cc) (compare (reg,  1))
>> +		   (set (reg) (plus (reg) (const_int -1)))
>> +		   (set (pc) (if_then_else (NE == cc)
>> +					   (label_ref (label))
>> +					   (pc))))])
>>   
>> -        So we return the second form instead for the two cases.
>> +	So we return the second form instead for the two cases.
>>   
>> +	For the "elementwise" form where the decrement number isn't -1,
>> +	the final value may be exceeded, so use GE instead of NE.
>>        */
>> -        condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
>> +     if (GET_CODE (pattern) != PARALLEL)
>> +       {
>> +	if (INTVAL (XEXP (inc_src, 1)) != -1)
>> +	  condition = gen_rtx_fmt_ee (GE, VOIDmode, inc_src, const0_rtx);
>> +	else
>> +	  condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);;
>> +       }
>>   
>>       return condition;
>>      }
>> @@ -685,17 +693,6 @@ doloop_optimize (class loop *loop)
>>         return false;
>>       }
>>   
>> -  max_cost
>> -    = COSTS_N_INSNS (param_max_iterations_computation_cost);
>> -  if (set_src_cost (desc->niter_expr, mode, optimize_loop_for_speed_p (loop))
>> -      > max_cost)
>> -    {
>> -      if (dump_file)
>> -	fprintf (dump_file,
>> -		 "Doloop: number of iterations too costly to compute.\n");
>> -      return false;
>> -    }
>> -
>>     if (desc->const_iter)
>>       iterations = widest_int::from (rtx_mode_t (desc->niter_expr, mode),
>>   				   UNSIGNED);
>> @@ -716,11 +713,24 @@ doloop_optimize (class loop *loop)
>>   
>>     /* Generate looping insn.  If the pattern FAILs then give up trying
>>        to modify the loop since there is some aspect the back-end does
>> -     not like.  */
>> -  count = copy_rtx (desc->niter_expr);
>> +     not like.  If this succeeds, there is a chance that the loop
>> +     desc->niter_expr has been altered by the backend, so only extract
>> +     that data after the gen_doloop_end.  */
>>     start_label = block_label (desc->in_edge->dest);
>>     doloop_reg = gen_reg_rtx (mode);
>>     rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
>> +  count = copy_rtx (desc->niter_expr);
> Very minor, but I think the copy should still happen after the cost check.
Done.
> OK for the df and doloop parts with those changes.

Thanks! I've attached the [2/2] patch containing the changes here, but I 
will also send the entire series again, so that it can go through 
patchwork correctly now (previous times had merge conflicts with trunk, 
so failed to go through the public testing infra)


> Thanks,
> Richard
>
>> +
>> +  max_cost
>> +    = COSTS_N_INSNS (param_max_iterations_computation_cost);
>> +  if (set_src_cost (count, mode, optimize_loop_for_speed_p (loop))
>> +      > max_cost)
>> +    {
>> +      if (dump_file)
>> +	fprintf (dump_file,
>> +		 "Doloop: number of iterations too costly to compute.\n");
>> +      return false;
>> +    }
>>   
>>     word_mode_size = GET_MODE_PRECISION (word_mode);
>>     word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) - 1;

[-- Attachment #2: 2.patch --]
[-- Type: text/x-patch, Size: 107540 bytes --]

commit 1eb7b8c84d8687e493b65406ca7d4fe4cdb08e59
Author: Stam Markianos-Wright <stam.markianos-wright@arm.com>
Date:   Tue Oct 18 17:42:56 2022 +0100

    arm: Add support for MVE Tail-Predicated Low Overhead Loops
    
    This is the 2/2 patch that contains the functional changes needed
    for MVE Tail Predicated Low Overhead Loops.  See my previous email
    for a general introduction of MVE LOLs.
    
    This support is added through the already existing loop-doloop
    mechanisms that are used for non-MVE dls/le looping.
    
    Mid-end changes are:
    
    1) Relax the loop-doloop mechanism in the mid-end to allow for
       decrement numbers other that -1 and for `count` to be an
       rtx containing a simple REG (which in this case will contain
       the number of elements to be processed), rather
       than an expression for calculating the number of iterations.
    2) Added a new df utility function: `df_bb_regno_only_def_find` that
       will return the DEF of a REG if it is DEF-ed only once within the
       basic block.
    
    And many things in the backend to implement the above optimisation:
    
    3)  Implement the `arm_predict_doloop_p` target hook to instruct the
        mid-end about Low Overhead Loops (MVE or not), as well as
        `arm_loop_unroll_adjust` which will prevent unrolling of any loops
        that are valid for becoming MVE Tail_Predicated Low Overhead Loops
        (unrolling can transform a loop in ways that invalidate the dlstp/
        letp tranformation logic and the benefit of the dlstp/letp loop
        would be considerably higher than that of unrolling)
    4)  Appropriate changes to the define_expand of doloop_end, new
        patterns for dlstp and letp, new iterators,  unspecs, etc.
    5) `arm_mve_loop_valid_for_dlstp` and a number of checking functions:
       * `arm_mve_dlstp_check_dec_counter`
       * `arm_mve_dlstp_check_inc_counter`
       * `arm_mve_check_reg_origin_is_num_elems`
       * `arm_mve_check_df_chain_back_for_implic_predic`
       * `arm_mve_check_df_chain_fwd_for_implic_predic_impact`
       This all, in smoe way or another, are running checks on the loop
       structure in order to determine if the loop is valid for dlstp/letp
       transformation.
    6) `arm_attempt_dlstp_transform`: (called from the define_expand of
        doloop_end) this function re-checks for the loop's suitability for
        dlstp/letp transformation and then implements it, if possible.
    7) Various utility functions:
       *`arm_mve_get_vctp_lanes` to map
       from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg`
       to check an insn to see if it requires the VPR or not.
       * `arm_mve_get_loop_vctp`
       * `arm_mve_get_vctp_lanes`
       * `arm_emit_mve_unpredicated_insn_to_seq`
       * `arm_get_required_vpr_reg`
       * `arm_get_required_vpr_reg_param`
       * `arm_get_required_vpr_reg_ret_val`
       * `arm_mve_is_across_vector_insn`
       * `arm_is_mve_load_store_insn`
       * `arm_mve_vec_insn_is_predicated_with_this_predicate`
       * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`
    
    No regressions on arm-none-eabi with various targets and on
    aarch64-none-elf. Thoughts on getting this into trunk?
    
    Thank you,
    Stam Markianos-Wright
    
    gcc/ChangeLog:
    
            * config/arm/arm-protos.h (arm_target_insn_ok_for_lob): Rename to...
            (arm_target_bb_ok_for_lob): ...this
            (arm_attempt_dlstp_transform): New.
            * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.
            (TARGET_PREDICT_DOLOOP_P): New.
            (arm_block_set_vect):
            (arm_target_insn_ok_for_lob): Rename from arm_target_insn_ok_for_lob.
            (arm_target_bb_ok_for_lob): New.
            (arm_mve_get_vctp_lanes): New.
            (arm_get_required_vpr_reg): New.
            (arm_get_required_vpr_reg_param): New.
            (arm_get_required_vpr_reg_ret_val): New.
            (arm_mve_get_loop_vctp): New.
            (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.
            (arm_mve_vec_insn_is_predicated_with_this_predicate): New.
            (arm_mve_check_df_chain_back_for_implic_predic): New.
            (arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.
            (arm_mve_check_reg_origin_is_num_elems): New.
            (arm_mve_dlstp_check_inc_counter): New.
            (arm_mve_dlstp_check_dec_counter): New.
            (arm_mve_loop_valid_for_dlstp): New.
            (arm_mve_is_across_vector_insn): New.
            (arm_is_mve_load_store_insn): New.
            (arm_predict_doloop_p): New.
            (arm_loop_unroll_adjust): New.
            (arm_emit_mve_unpredicated_insn_to_seq): New.
            (arm_attempt_dlstp_transform): New.
            * config/arm/iterators.md (DLSTP): New.
            (mode1): Add DLSTP mappings.
            * config/arm/mve.md (*predicated_doloop_end_internal): New.
            (dlstp<mode1>_insn): New.
            * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
            * config/arm/unspecs.md: New unspecs.
            * df-core.cc (df_bb_regno_only_def_find): New.
            * df.h (df_bb_regno_only_def_find): New.
            * loop-doloop.cc (doloop_condition_get): Relax conditions.
            (doloop_optimize): Add support for elementwise LoLs.
    
    gcc/testsuite/ChangeLog:
    
            * gcc.target/arm/lob.h: Update framework.
            * gcc.target/arm/lob1.c: Likewise.
            * gcc.target/arm/lob6.c: Likewise.
            * gcc.target/arm/mve/dlstp-compile-asm.c: New test.
            * gcc.target/arm/mve/dlstp-int16x8.c: New test.
            * gcc.target/arm/mve/dlstp-int32x4.c: New test.
            * gcc.target/arm/mve/dlstp-int64x2.c: New test.
            * gcc.target/arm/mve/dlstp-int8x16.c: New test.
            * gcc.target/arm/mve/dlstp-invalid-asm.c: New test.

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 2f5ca79ed8d..4f164c54740 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -65,8 +65,8 @@ extern void arm_emit_speculation_barrier_function (void);
 extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *);
 extern bool arm_q_bit_access (void);
 extern bool arm_ge_bits_access (void);
-extern bool arm_target_insn_ok_for_lob (rtx);
-
+extern bool arm_target_bb_ok_for_lob (basic_block);
+extern rtx arm_attempt_dlstp_transform (rtx);
 #ifdef RTX_CODE
 enum reg_class
 arm_mode_base_reg_class (machine_mode);
diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index 620ef7bfb2f..def2f0d6a58 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -659,6 +659,12 @@ static const struct attribute_spec arm_attribute_table[] =
 #undef TARGET_HAVE_CONDITIONAL_EXECUTION
 #define TARGET_HAVE_CONDITIONAL_EXECUTION arm_have_conditional_execution
 
+#undef TARGET_LOOP_UNROLL_ADJUST
+#define TARGET_LOOP_UNROLL_ADJUST arm_loop_unroll_adjust
+
+#undef TARGET_PREDICT_DOLOOP_P
+#define TARGET_PREDICT_DOLOOP_P arm_predict_doloop_p
+
 #undef TARGET_LEGITIMATE_CONSTANT_P
 #define TARGET_LEGITIMATE_CONSTANT_P arm_legitimate_constant_p
 
@@ -34472,19 +34478,1103 @@ arm_invalid_within_doloop (const rtx_insn *insn)
 }
 
 bool
-arm_target_insn_ok_for_lob (rtx insn)
+arm_target_bb_ok_for_lob (basic_block bb)
 {
-  basic_block bb = BLOCK_FOR_INSN (insn);
   /* Make sure the basic block of the target insn is a simple latch
      having as single predecessor and successor the body of the loop
      itself.  Only simple loops with a single basic block as body are
      supported for 'low over head loop' making sure that LE target is
      above LE itself in the generated code.  */
-
   return single_succ_p (bb)
-    && single_pred_p (bb)
-    && single_succ_edge (bb)->dest == single_pred_edge (bb)->src
-    && contains_no_active_insn_p (bb);
+	 && single_pred_p (bb)
+	 && single_succ_edge (bb)->dest == single_pred_edge (bb)->src;
+}
+
+/* Utility fuction: Given a VCTP or a VCTP_M insn, return the number of MVE
+   lanes based on the machine mode being used.  */
+
+static int
+arm_mve_get_vctp_lanes (rtx x)
+{
+  if (GET_CODE (x) == SET && GET_CODE (XEXP (x, 1)) == UNSPEC
+      && (XINT (XEXP (x, 1), 1) == VCTP || XINT (XEXP (x, 1), 1) == VCTP_M))
+    {
+      machine_mode mode = GET_MODE (XEXP (x, 1));
+      return (VECTOR_MODE_P (mode) && VALID_MVE_PRED_MODE (mode))
+	     ? GET_MODE_NUNITS (mode) : 0;
+    }
+  return 0;
+}
+
+/* Check if INSN requires the use of the VPR reg, if it does, return the
+   sub-rtx of the VPR reg.  The TYPE argument controls whether
+   this function should:
+   * For TYPE == 0, check all operands, including the OUT operands,
+     and return the first occurrence of the VPR reg.
+   * For TYPE == 1, only check the input operands.
+   * For TYPE == 2, only check the output operands.
+   (INOUT operands are considered both as input and output operands)
+*/
+static rtx
+arm_get_required_vpr_reg (rtx_insn *insn, unsigned int type = 0)
+{
+  gcc_assert (type < 3);
+  if (!NONJUMP_INSN_P (insn))
+    return NULL_RTX;
+
+  bool requires_vpr;
+  extract_constrain_insn (insn);
+  int n_operands = recog_data.n_operands;
+  if (recog_data.n_alternatives == 0)
+    return NULL_RTX;
+
+  /* Fill in recog_op_alt with information about the constraints of
+     this insn.  */
+  preprocess_constraints (insn);
+
+  for (int op = 0; op < n_operands; op++)
+    {
+      requires_vpr = true;
+      if (type == 1 && recog_data.operand_type[op] == OP_OUT)
+	continue;
+      else if (type == 2 && recog_data.operand_type[op] == OP_IN)
+	continue;
+
+      /* Iterate through alternatives of operand "op" in recog_op_alt and
+	 identify if the operand is required to be the VPR.  */
+      for (int alt = 0; alt < recog_data.n_alternatives; alt++)
+	{
+	  const operand_alternative *op_alt
+	      = &recog_op_alt[alt * n_operands];
+	  /* Fetch the reg_class for each entry and check it against the
+	     VPR_REG reg_class.  */
+	  if (alternative_class (op_alt, op) != VPR_REG)
+	    requires_vpr = false;
+	}
+      /* If all alternatives of the insn require the VPR reg for this operand,
+	 it means that either this is VPR-generating instruction, like a vctp,
+	 vcmp, etc., or it is a VPT-predicated insruction.  Return the subrtx
+	 of the VPR reg operand.  */
+      if (requires_vpr)
+	return recog_data.operand[op];
+    }
+  return NULL_RTX;
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with TYPE == 1, so return
+   something only if the VPR reg is an input operand to the insn.  */
+
+static rtx
+arm_get_required_vpr_reg_param (rtx_insn *insn)
+{
+  return arm_get_required_vpr_reg (insn, 1);
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with TYPE == 2, so return
+   something only if the VPR reg is the return value, an output of, or is
+   clobbered by the insn.  */
+
+static rtx
+arm_get_required_vpr_reg_ret_val (rtx_insn *insn)
+{
+  return arm_get_required_vpr_reg (insn, 2);
+}
+
+/* Scan the basic block of a loop body for a vctp instruction.  If there is
+   at least vctp instruction, return the first rtx_insn *.  */
+
+static rtx_insn *
+arm_mve_get_loop_vctp (basic_block bb)
+{
+  rtx_insn *insn = BB_HEAD (bb);
+
+  /* Now scan through all the instruction patterns and pick out the VCTP
+     instruction.  We require arm_get_required_vpr_reg_param to be false
+     to make sure we pick up a VCTP, rather than a VCTP_M.  */
+  FOR_BB_INSNS (bb, insn)
+    if (NONDEBUG_INSN_P (insn))
+      if (arm_get_required_vpr_reg_ret_val (insn)
+	  && (arm_mve_get_vctp_lanes (PATTERN (insn)) != 0)
+	  && !arm_get_required_vpr_reg_param (insn))
+	return insn;
+  return NULL;
+}
+
+/* Return true if INSN is a MVE instruction that is VPT-predicable, but in
+   its unpredicated form, or if it is predicated, but on a predicate other
+   than VPR_REG.  */
+
+static bool
+arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate (rtx_insn *insn,
+							  rtx vpr_reg)
+{
+  rtx insn_vpr_reg_operand;
+  if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+      || (MVE_VPT_PREDICATED_INSN_P (insn)
+	  && (insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn))
+	  && !rtx_equal_p (vpr_reg, insn_vpr_reg_operand)))
+    return true;
+  else
+    return false;
+}
+
+/* Return true if INSN is a MVE instruction that is VPT-predicable and is
+   predicated on VPR_REG.  */
+
+static bool
+arm_mve_vec_insn_is_predicated_with_this_predicate (rtx_insn *insn,
+						    rtx vpr_reg)
+{
+  rtx insn_vpr_reg_operand;
+  if (MVE_VPT_PREDICATED_INSN_P (insn)
+      && (insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn))
+      && rtx_equal_p (vpr_reg, insn_vpr_reg_operand))
+    return true;
+  else
+    return false;
+}
+
+/* Utility function to identify if INSN is an MVE instruction that performs
+   some across-vector operation (and as a result does not align with normal
+   lane predication rules).  All such instructions give one only scalar
+   output, except for vshlcq which gives a PARALLEL of a vector and a scalar
+   (one vector result and one carry output).  */
+
+static bool
+arm_is_mve_across_vector_insn (rtx_insn* insn)
+{
+  df_ref insn_defs = NULL;
+  if (!MVE_VPT_PREDICABLE_INSN_P (insn))
+    return false;
+
+  bool is_across_vector = false;
+  FOR_EACH_INSN_DEF (insn_defs, insn)
+    if (!VALID_MVE_MODE (GET_MODE (DF_REF_REG (insn_defs)))
+	&& !arm_get_required_vpr_reg_ret_val (insn))
+      is_across_vector = true;
+
+  return is_across_vector;
+}
+
+/* Utility function to identify if INSN is an MVE load or store instruction.
+   * For TYPE == 0, check all operands.  If the function returns true,
+     INSN is a load or a store insn.
+   * For TYPE == 1, only check the input operands.  If the function returns
+     true, INSN is a load insn.
+   * For TYPE == 2, only check the output operands.  If the function returns
+     true, INSN is a store insn.  */
+
+static bool
+arm_is_mve_load_store_insn (rtx_insn* insn, int type = 0)
+{
+  int n_operands = recog_data.n_operands;
+  extract_insn (insn);
+
+  for (int op = 0; op < n_operands; op++)
+    {
+      if (type == 1 && recog_data.operand_type[op] == OP_OUT)
+	continue;
+      else if (type == 2 && recog_data.operand_type[op] == OP_IN)
+	continue;
+      if (mve_memory_operand (recog_data.operand[op],
+			      GET_MODE (recog_data.operand[op])))
+      return true;
+    }
+  return false;
+}
+
+/* When transforming an MVE intrinsic loop into an MVE Tail Predicated Low
+   Overhead Loop, there are a number of instructions that, if in their
+   unpredicated form, act across vector lanes, but are still safe to include
+   within the loop, despite the implicit predication added to the vector lanes.
+   This list has been compiled by carefully analyzing the instruction
+   pseudocode in the Arm-ARM.
+   All other across-vector instructions aren't allowed, because the addition
+   of implicit predication could influnce the result of the operation.
+   Any new across-vector instructions to the MVE ISA will have to assessed for
+   inclusion to this list.  */
+
+static bool
+arm_mve_is_allowed_unpredic_across_vector_insn (rtx_insn* insn)
+{
+  gcc_assert (MVE_VPT_UNPREDICATED_INSN_P (insn)
+	      && arm_is_mve_across_vector_insn (insn));
+  rtx insn_pattern = PATTERN (insn);
+  if (GET_CODE (insn_pattern) == SET
+      && GET_CODE (XEXP (insn_pattern, 1)) == UNSPEC
+      && (XINT (XEXP (insn_pattern, 1), 1) == VADDVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VABAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VABAVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMAXVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMAXAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVXQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAXQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHAXQ_S))
+    return true;
+  return false;
+}
+
+/* Recursively scan through the DF chain backwards within the basic block and
+   determine if any of the USEs of the original insn (or the USEs of the insns
+   where thy were DEF-ed, etc., recursively) were affected by implicit VPT
+   predication of an MVE_VPT_UNPREDICATED_INSN_P in a dlstp/letp loop.
+   This function returns true if the insn is affected implicit predication
+   and false otherwise.
+   Having such implicit predication on an unpredicated insn wouldn't in itself
+   block tail predication, because the output of that insn might then be used
+   in a correctly predicated store insn, where the disabled lanes will be
+   ignored.  To verify this we later call:
+   `arm_mve_check_df_chain_fwd_for_implic_predic_impact`, which will check the
+   DF chains forward to see if any implicitly-predicated operand gets used in
+   an improper way.  */
+
+static bool
+arm_mve_check_df_chain_back_for_implic_predic
+  (hash_map <rtx_insn *, bool> *safe_insn_map, rtx_insn *insn_in,
+   rtx vctp_vpr_generated)
+{
+
+  auto_vec<rtx_insn *> worklist;
+  worklist.safe_push (insn_in);
+
+  bool *temp = NULL;
+
+  while (worklist.length () > 0)
+    {
+      rtx_insn *insn = worklist.pop ();
+
+      if ((temp = safe_insn_map->get (insn)))
+	return *temp;
+
+      basic_block body = BLOCK_FOR_INSN (insn);
+
+      /* The circumstances under which an instruction is affected by "implicit
+	 predication" are as follows:
+	  * It is an UNPREDICATED_INSN_P:
+	    * That loads/stores from/to memory.
+	    * Where any one of its operands is an MVE vector from outside the
+	      loop body bb.
+	 Or:
+	  * Any of it's operands, recursively backwards, are affected.  */
+      if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+	  && (arm_is_mve_load_store_insn (insn)
+	      || (arm_is_mve_across_vector_insn (insn)
+		  && !arm_mve_is_allowed_unpredic_across_vector_insn (insn))))
+	{
+	  safe_insn_map->put (insn, true);
+	  return true;
+	}
+
+      df_ref insn_uses = NULL;
+      FOR_EACH_INSN_USE (insn_uses, insn)
+      {
+	/* If the operand is in the input reg set to the the basic block,
+	   (i.e. it has come from outside the loop!), consider it unsafe if:
+	     * It's being used in an unpredicated insn.
+	     * It is a predicable MVE vector.  */
+	if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+	    && VALID_MVE_MODE (GET_MODE (DF_REF_REG (insn_uses)))
+	    && REGNO_REG_SET_P (DF_LR_IN (body), DF_REF_REGNO (insn_uses)))
+	  {
+	    safe_insn_map->put (insn, true);
+	    return true;
+	  }
+
+	/* Scan backwards from the current INSN through the instruction chain
+	   until the start of the basic block.  */
+	for (rtx_insn *prev_insn = PREV_INSN (insn);
+	     prev_insn && prev_insn != PREV_INSN (BB_HEAD (body));
+	     prev_insn = PREV_INSN (prev_insn))
+	  {
+	    /* If a previous insn defines a register that INSN uses, then
+	       recurse in order to check that insn's USEs. If any of these
+	       insns return true as MVE_VPT_UNPREDICATED_INSN_Ps, then the
+	       whole chain is affected by the change in behaviour from being
+	       placed in dlstp/letp loop.  */
+	    df_ref prev_insn_defs = NULL;
+	    FOR_EACH_INSN_DEF (prev_insn_defs, prev_insn)
+	    {
+	      if (DF_REF_REGNO (insn_uses) == DF_REF_REGNO (prev_insn_defs)
+		  && !arm_mve_vec_insn_is_predicated_with_this_predicate
+		       (insn, vctp_vpr_generated))
+		worklist.safe_push (prev_insn);
+	    }
+	  }
+      }
+    }
+  safe_insn_map->put (insn, false);
+  return false;
+}
+
+/* If we have identified that the current DEF will be modified
+   by such implicit predication, scan through all the
+   insns that USE it and bail out if any one is outside the
+   current basic block (i.e. the reg is live after the loop)
+   or if any are store insns that are unpredicated or using a
+   predicate other than the loop VPR.
+   This function returns true if the insn is not suitable for
+   implicit predication and false otherwise.*/
+
+static bool
+arm_mve_check_df_chain_fwd_for_implic_predic_impact (rtx_insn *insn,
+						     rtx vctp_vpr_generated)
+{
+
+  /* If this insn is indeed an unpredicated store to memory, bail out.  */
+  if (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
+	(insn, vctp_vpr_generated)
+      && (arm_is_mve_load_store_insn (insn, 2)
+	  || arm_is_mve_across_vector_insn (insn)))
+    return true;
+
+  /* Next, scan forward to the various USEs of the DEFs in this insn.  */
+  df_ref insn_def = NULL;
+  FOR_EACH_INSN_DEF (insn_def, insn)
+    {
+      for (df_ref use = DF_REG_USE_CHAIN (DF_REF_REGNO (insn_def)); use;
+	   use = DF_REF_NEXT_REG (use))
+	{
+	  rtx_insn *next_use_insn = DF_REF_INSN (use);
+	  if (next_use_insn != insn
+	      && NONDEBUG_INSN_P (next_use_insn))
+	    {
+	      /* If the USE is outside the loop body bb, or it is inside, but
+		 is an differently-predicated store to memory or it is any
+		 across-vector instruction.  */
+	      if (BLOCK_FOR_INSN (insn) != BLOCK_FOR_INSN (next_use_insn)
+		  || (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
+		       (next_use_insn, vctp_vpr_generated)
+		     && (arm_is_mve_load_store_insn (next_use_insn, 2)
+			 || arm_is_mve_across_vector_insn (next_use_insn))))
+		return true;
+	    }
+	}
+    }
+  return false;
+}
+
+/* Helper function to `arm_mve_dlstp_check_inc_counter` and to
+   `arm_mve_dlstp_check_dec_counter`.  In the situations where the loop counter
+   is incrementing by 1 or decrementing by 1 in each iteration, ensure that the
+   target value or the initialisation value, respectively, was a calculation
+   of the number of iterations of the loop, which is expected to be an ASHIFTRT
+   by VCTP_STEP.  */
+
+static bool
+arm_mve_check_reg_origin_is_num_elems (basic_block body, rtx reg, rtx vctp_step)
+{
+  /* Ok, we now know the loop starts from zero and increments by one.
+     Now just show that the max value of the counter came from an
+     appropriate ASHIFRT expr of the correct amount.  */
+  basic_block pre_loop_bb = body->prev_bb;
+  while (pre_loop_bb && BB_END (pre_loop_bb)
+	 && !df_bb_regno_only_def_find (pre_loop_bb, REGNO (reg)))
+    pre_loop_bb = pre_loop_bb->prev_bb;
+
+  df_ref counter_max_last_def = df_bb_regno_only_def_find (pre_loop_bb, REGNO (reg));
+  rtx counter_max_last_set;
+  if (counter_max_last_def)
+    counter_max_last_set = PATTERN (DF_REF_INSN (counter_max_last_def));
+  else
+    return false;
+
+  /* If we encounter a simple SET from a REG, follow it through.  */
+  if (GET_CODE (counter_max_last_set) == SET
+      && REG_P (XEXP (counter_max_last_set, 1)))
+    return arm_mve_check_reg_origin_is_num_elems
+	     (pre_loop_bb->next_bb, XEXP (counter_max_last_set, 1), vctp_step);
+
+  /* If we encounter a SET from an IF_THEN_ELSE where one of the operands is a
+     constant and the other is a REG, follow through to that REG.  */
+  if (GET_CODE (counter_max_last_set) == SET
+      && GET_CODE (XEXP (counter_max_last_set, 1)) == IF_THEN_ELSE
+      && REG_P (XEXP (XEXP (counter_max_last_set, 1), 1))
+      && CONST_INT_P (XEXP (XEXP (counter_max_last_set, 1), 2)))
+    return arm_mve_check_reg_origin_is_num_elems
+	     (pre_loop_bb->next_bb, XEXP (XEXP (counter_max_last_set, 1), 1), vctp_step);
+
+  if (GET_CODE (XEXP (counter_max_last_set, 1)) == ASHIFTRT
+      && CONST_INT_P (XEXP (XEXP (counter_max_last_set, 1), 1))
+      && ((1 << INTVAL (XEXP (XEXP (counter_max_last_set, 1), 1)))
+	   == abs (INTVAL (vctp_step))))
+    return true;
+
+  return false;
+}
+
+/* If we have identified the loop to have an incrementing counter, we need to
+   make sure that it increments by 1 and that the loop is structured correctly:
+    * The counter starts from 0
+    * The counter terminates at (num_of_elem + num_of_lanes - 1) / num_of_lanes
+    * The vctp insn uses a reg that decrements appropriately in each iteration.
+*/
+
+static rtx_insn*
+arm_mve_dlstp_check_inc_counter (basic_block body, rtx_insn* vctp_insn,
+				 rtx condconst, rtx condcount)
+{
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+  /* The loop latch has to be empty.  When compiling all the known MVE LoLs in
+     user applications, none of those with incrementing counters had any real
+     insns in the loop latch.  As such, this function has only been tested with
+     an empty latch and may misbehave or ICE if we somehow get here with an
+     increment in the latch, so, for correctness, error out early.  */
+  if (!empty_block_p (body->loop_father->latch))
+    return NULL;
+
+  class rtx_iv vctp_reg_iv;
+  /* For loops of type B) the loop counter is independent of the decrement
+     of the reg used in the vctp_insn. So run iv analysis on that reg.  This
+     has to succeed for such loops to be supported.  */
+  if (!iv_analyze (vctp_insn, as_a<scalar_int_mode> (GET_MODE (vctp_reg)),
+      vctp_reg, &vctp_reg_iv))
+    return NULL;
+
+  /* Extract the decrementnum of the vctp reg from the iv.  */
+  int decrementnum = abs (INTVAL (vctp_reg_iv.step));
+
+  /* Find where both of those are modified in the loop body bb.  */
+  df_ref condcount_reg_set_df = df_bb_regno_only_def_find (body, REGNO (condcount));
+  df_ref vctp_reg_set_df = df_bb_regno_only_def_find (body, REGNO (vctp_reg));
+  if (!condcount_reg_set_df || !vctp_reg_set_df)
+    return NULL;
+  rtx condcount_reg_set = PATTERN (DF_REF_INSN (condcount_reg_set_df));
+  rtx_insn* vctp_reg_set = DF_REF_INSN (vctp_reg_set_df);
+  /* Ensure the modification of the vctp reg from df is consistent with
+     the iv and the number of lanes on the vctp insn.  */
+  if (!(GET_CODE (XEXP (PATTERN (vctp_reg_set), 1)) == PLUS
+      && REGNO (XEXP (PATTERN (vctp_reg_set), 0))
+	  == REGNO (XEXP (XEXP (PATTERN (vctp_reg_set), 1), 0))))
+    return NULL;
+  if (decrementnum != abs (INTVAL (XEXP (XEXP (PATTERN (vctp_reg_set), 1), 1)))
+      || decrementnum != arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+    return NULL;
+
+  if (REG_P (condcount) && REG_P (condconst))
+    {
+      /* First we need to prove that the loop is going 0..condconst with an
+	 inc of 1 in each iteration.  */
+      if (GET_CODE (XEXP (condcount_reg_set, 1)) == PLUS
+	  && CONST_INT_P (XEXP (XEXP (condcount_reg_set, 1), 1))
+	  && INTVAL (XEXP (XEXP (condcount_reg_set, 1), 1)) == 1)
+	{
+	    rtx counter_reg = XEXP (condcount_reg_set, 0);
+	    /* Check that the counter did indeed start from zero.  */
+	    df_ref this_set = DF_REG_DEF_CHAIN (REGNO (counter_reg));
+	    if (!this_set)
+	      return NULL;
+	    df_ref last_set = DF_REF_NEXT_REG (this_set);
+	    if (!last_set)
+	      return NULL;
+	    rtx_insn* last_set_insn = DF_REF_INSN (last_set);
+	    if (!single_set (last_set_insn))
+	      return NULL;
+	    rtx counter_orig_set;
+	    counter_orig_set = XEXP (PATTERN (last_set_insn), 1);
+	    if (!CONST_INT_P (counter_orig_set)
+		|| (INTVAL (counter_orig_set) != 0))
+	      return NULL;
+	    /* And finally check that the target value of the counter,
+	       condconst, is of the correct shape.  */
+	    if (!arm_mve_check_reg_origin_is_num_elems (body, condconst,
+							vctp_reg_iv.step))
+	      return NULL;
+	}
+      else
+	return NULL;
+    }
+  else
+    return NULL;
+
+  /* Everything looks valid.  */
+  return vctp_insn;
+}
+
+/* Helper function to `arm_mve_loop_valid_for_dlstp`.  In the case of a
+   counter that is decrementing, ensure that it is decrementing by the
+   right amount in each iteration and that the target condition is what
+   we expect.  */
+
+static rtx_insn*
+arm_mve_dlstp_check_dec_counter (basic_block body, rtx_insn* vctp_insn,
+				 rtx condconst, rtx condcount)
+{
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+  class rtx_iv vctp_reg_iv;
+  int decrementnum;
+  /* For decrementing loops of type A), the counter is usually present in the
+     loop latch.  Here we simply need to verify that this counter is the same
+     reg that is also used in the vctp_insn and that it is not otherwise
+     modified.  */
+  rtx_insn *dec_insn = BB_END (body->loop_father->latch);
+  /* If not in the loop latch, try to find the decrement in the loop body.  */
+  if (!NONDEBUG_INSN_P (dec_insn))
+  {
+    df_ref temp = df_bb_regno_only_def_find (body, REGNO (condcount));
+    /* If we haven't been able to find the decrement, bail out.  */
+    if (!temp)
+      return NULL;
+    dec_insn = DF_REF_INSN (temp);
+  }
+
+  /* Next, ensure that it is a PLUS of the form:
+     (set (reg a) (plus (reg a) (const_int)))
+     where (reg a) is the same as condcount.  */
+  if (GET_CODE (XEXP (PATTERN (dec_insn), 1)) == PLUS
+      && REGNO (XEXP (PATTERN (dec_insn), 0))
+	  == REGNO (XEXP (XEXP (PATTERN (dec_insn), 1), 0))
+      && REGNO (XEXP (PATTERN (dec_insn), 0)) == REGNO (condcount))
+    decrementnum = abs (INTVAL (XEXP (XEXP (PATTERN (dec_insn), 1), 1)));
+  else
+    return NULL;
+
+  /* Ok, so we now know the loop decrement.  If it is a 1, then we need to
+     look at the loop vctp_reg and verify that it also decrements correctly.
+     Then, we need to establish that the starting value of the loop decrement
+     originates from the starting value of the vctp decrement.  */
+  if (decrementnum == 1)
+    {
+      class rtx_iv vctp_reg_iv;
+      /* The loop counter is found to be independent of the decrement
+	 of the reg used in the vctp_insn, again.  Ensure that IV analysis
+	 succeeds and check the step.  */
+      if (!iv_analyze (vctp_insn, as_a<scalar_int_mode> (GET_MODE (vctp_reg)),
+		       vctp_reg, &vctp_reg_iv))
+	return NULL;
+      /* Ensure it matches the number of lanes of the vctp instruction.  */
+      if (abs (INTVAL (vctp_reg_iv.step))
+	  != arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+	return NULL;
+      if (!arm_mve_check_reg_origin_is_num_elems (body, condcount, vctp_reg_iv.step))
+	return NULL;
+    }
+  /* If the decrements are the same, then the situation is simple: either they
+     are also the same reg, which is safe, or they are different registers, in
+     which case makse sure that there is a only simple SET from one to the
+     other inside the loop.*/
+  else if (decrementnum == arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+    {
+      if (REGNO (condcount) != REGNO (vctp_reg))
+	{
+	  /* It wasn't the same reg, but it could be behild a
+	     (set (vctp_reg) (condcount)), so instead find where
+	     the VCTP insn is DEF'd inside the loop.  */
+	  rtx vctp_reg_set =
+		PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
+					(body, REGNO (vctp_reg))));
+	  /* This must just be a simple SET from the condcount.  */
+	  if (GET_CODE (vctp_reg_set) != SET || !REG_P (XEXP (vctp_reg_set, 1))
+	      || REGNO (XEXP (vctp_reg_set, 1)) != REGNO (condcount))
+	    return NULL;
+	}
+    }
+  else
+    return NULL;
+
+  /* We now only need to find out that the loop terminates with a LE
+     zero condition.  If condconst is a const_int, then this is easy.
+     If its a REG, look at the last condition+jump in a bb before
+     the loop, because that usually will have a branch jumping over
+     the loop body.  */
+  if (CONST_INT_P (condconst)
+      && !(INTVAL (condconst) == 0 && JUMP_P (BB_END (body))
+	   && GET_CODE (XEXP (PATTERN (BB_END (body)), 1)) == IF_THEN_ELSE
+	   && (GET_CODE (XEXP (XEXP (PATTERN (BB_END (body)), 1), 0)) == NE
+	       ||GET_CODE (XEXP (XEXP (PATTERN (BB_END (body)), 1), 0)) == GT)))
+    return NULL;
+  else if (REG_P (condconst))
+    {
+      basic_block pre_loop_bb = body;
+      while (pre_loop_bb->prev_bb && BB_END (pre_loop_bb->prev_bb)
+	     && !JUMP_P (BB_END (pre_loop_bb->prev_bb)))
+	pre_loop_bb = pre_loop_bb->prev_bb;
+      if (pre_loop_bb && BB_END (pre_loop_bb))
+	pre_loop_bb = pre_loop_bb->prev_bb;
+      else
+	return NULL;
+      rtx initial_compare = NULL_RTX;
+      if (!(prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb))
+	    && INSN_P (prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb)))))
+	return NULL;
+      else
+	initial_compare
+	    = PATTERN (prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb)));
+      if (!(initial_compare && GET_CODE (initial_compare) == SET
+	    && cc_register (XEXP (initial_compare, 0), VOIDmode)
+	    && GET_CODE (XEXP (initial_compare, 1)) == COMPARE
+	    && CONST_INT_P (XEXP (XEXP (initial_compare, 1), 1))
+	    && INTVAL (XEXP (XEXP (initial_compare, 1), 1)) == 0))
+	return NULL;
+
+      /* Usually this is a LE condition, but it can also just be a GT or an EQ
+	 condition (if the value is unsigned or the compiler knows its not negative)  */
+      rtx_insn *loop_jumpover = BB_END (pre_loop_bb);
+      if (!(JUMP_P (loop_jumpover)
+	    && GET_CODE (XEXP (PATTERN (loop_jumpover), 1)) == IF_THEN_ELSE
+	    && (GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == LE
+		|| GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == GT
+		|| GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == EQ)))
+	return NULL;
+    }
+
+  /* Everything looks valid.  */
+  return vctp_insn;
+}
+
+/* Function to check a loop's structure to see if it is a valid candidate for
+   an MVE Tail Predicated Low-Overhead Loop.  Returns the loop's VCTP_INSN if
+   it is valid, or NULL if it isn't.  */
+
+static rtx_insn*
+arm_mve_loop_valid_for_dlstp (basic_block body)
+{
+  /* Doloop can only be done "elementwise" with predicated dlstp/letp if it
+     contains a VCTP on the number of elements processed by the loop.
+     Find the VCTP predicate generation inside the loop body BB.  */
+  rtx_insn *vctp_insn = arm_mve_get_loop_vctp (body);
+  if (!vctp_insn)
+    return NULL;
+
+  /* There are only two types of loops that can be turned into dlstp/letp
+     loops:
+      A) Loops of the form:
+	  while (num_of_elem > 0)
+	    {
+	      p = vctp<size> (num_of_elem)
+	      n -= num_of_lanes;
+	    }
+      B) Loops of the form:
+	  int num_of_iters = (num_of_elem + num_of_lanes - 1) / num_of_lanes
+	  for (i = 0; i < num_of_iters; i++)
+	    {
+	      p = vctp<size> (num_of_elem)
+	      n -= num_of_lanes;
+	    }
+
+    Then, depending on the type of loop above we need will need to do
+    different sets of checks.  */
+  iv_analysis_loop_init (body->loop_father);
+
+  /* In order to find out if the loop is of type A or B above look for the
+     loop counter: it will either be incrementing by one per iteration or
+     it will be decrementing by num_of_lanes.  We can find the loop counter
+     in the condition at the end of the loop.  */
+  rtx_insn *loop_cond = prev_nonnote_nondebug_insn_bb (BB_END (body));
+  if (!(cc_register (XEXP (PATTERN (loop_cond), 0), VOIDmode)
+	&& GET_CODE (XEXP (PATTERN (loop_cond), 1)) == COMPARE))
+    return NULL;
+
+  /* The operands in the condition:  Try to identify which one is the
+     constant and which is the counter and run IV analysis on the latter.  */
+  rtx cond_arg_1 = XEXP (XEXP (PATTERN (loop_cond), 1), 0);
+  rtx cond_arg_2 = XEXP (XEXP (PATTERN (loop_cond), 1), 1);
+
+  rtx loop_cond_constant;
+  rtx loop_counter;
+  class rtx_iv cond_counter_iv, cond_temp_iv;
+
+  if (CONST_INT_P (cond_arg_1))
+    {
+      /* cond_arg_1 is the constant and cond_arg_2 is the counter.  */
+      loop_cond_constant = cond_arg_1;
+      loop_counter = cond_arg_2;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_2)),
+		  cond_arg_2, &cond_counter_iv);
+    }
+  else if (CONST_INT_P (cond_arg_2))
+    {
+      /* cond_arg_2 is the constant and cond_arg_1 is the counter.  */
+      loop_cond_constant = cond_arg_2;
+      loop_counter = cond_arg_1;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_1)),
+		  cond_arg_1, &cond_counter_iv);
+    }
+  else if (REG_P (cond_arg_1) && REG_P (cond_arg_2))
+    {
+      /* If both operands to the compare are REGs, we can safely
+	 run IV analysis on both and then determine which is the
+	 constant by looking at the step.
+	 First assume cond_arg_1 is the counter.  */
+      loop_counter = cond_arg_1;
+      loop_cond_constant = cond_arg_2;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_1)),
+		  cond_arg_1, &cond_counter_iv);
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_2)),
+		  cond_arg_2, &cond_temp_iv);
+
+      if (!CONST_INT_P (cond_counter_iv.step) || !CONST_INT_P (cond_temp_iv.step))
+	return NULL;
+      /* Look at the steps and swap around the rtx's if needed.  Error out if
+	 one of them cannot be identified as constant.  */
+      if (INTVAL (cond_counter_iv.step) != 0 && INTVAL (cond_temp_iv.step) != 0)
+	return NULL;
+      if (INTVAL (cond_counter_iv.step) == 0 && INTVAL (cond_temp_iv.step) != 0)
+	{
+	  loop_counter = cond_arg_2;
+	  loop_cond_constant = cond_arg_1;
+	  cond_counter_iv = cond_temp_iv;
+	}
+    }
+  else
+    return NULL;
+
+  if (!REG_P (loop_counter))
+    return NULL;
+  if (!(REG_P (loop_cond_constant) || CONST_INT_P (loop_cond_constant)))
+    return NULL;
+
+  /* Now we have extracted the IV step of the loop counter, call the
+     appropriate checking function.  */
+  if (INTVAL (cond_counter_iv.step) > 0)
+    return arm_mve_dlstp_check_inc_counter (body, vctp_insn,
+					    loop_cond_constant, loop_counter);
+  else if (INTVAL (cond_counter_iv.step) < 0)
+    return arm_mve_dlstp_check_dec_counter (body, vctp_insn,
+					    loop_cond_constant, loop_counter);
+  else
+    return NULL;
+}
+
+/* Predict whether the given loop in gimple will be transformed in the RTL
+   doloop_optimize pass.  */
+
+static bool
+arm_predict_doloop_p (struct loop *loop)
+{
+  gcc_assert (loop);
+  /* On arm, targetm.can_use_doloop_p is actually
+     can_use_doloop_if_innermost.  Ensure the loop is innermost,
+     it is valid and as per arm_target_bb_ok_for_lob and the
+     correct architecture flags are enabled.  */
+  if (!(TARGET_32BIT && TARGET_HAVE_LOB && optimize > 0))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " target architecture or optimisation flags.\n");
+      return false;
+    }
+  else if (loop->inner != NULL)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " loop nesting.\n");
+      return false;
+    }
+  else if (!arm_target_bb_ok_for_lob (loop->header->next_bb))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " loop bb complexity.\n");
+      return false;
+    }
+
+  return true;
+}
+
+/* Implement targetm.loop_unroll_adjust.  Use this to block unrolling of loops
+   that may later be turned into MVE Tail Predicated Low Overhead Loops.  The
+   performance benefit of an MVE LoL is likely to be much higher than that of
+   the unrolling.  */
+
+unsigned
+arm_loop_unroll_adjust (unsigned nunroll, struct loop *loop)
+{
+  if (TARGET_HAVE_MVE
+      && arm_target_bb_ok_for_lob (loop->latch)
+      && arm_mve_loop_valid_for_dlstp (loop->header))
+    return 0;
+  else
+    return nunroll;
+}
+
+/* Function to hadle emitting a VPT-unpredicated version of a VPT-predicated
+   insn to a sequence.  */
+
+static bool
+arm_emit_mve_unpredicated_insn_to_seq (rtx_insn* insn)
+{
+  rtx insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn);
+  int new_icode = get_attr_mve_unpredicated_insn (insn);
+  if (!in_sequence_p ()
+      || !MVE_VPT_PREDICATED_INSN_P (insn)
+      || (!insn_vpr_reg_operand)
+      || (!new_icode))
+    return false;
+
+  extract_insn (insn);
+  rtx arr[8];
+  int j = 0;
+
+  /* When transforming a VPT-predicated instruction
+     into its unpredicated equivalent we need to drop
+     the VPR operand and we may need to also drop a
+     merge "vuninit" input operand, depending on the
+     instruction pattern.  Here ensure that we have at
+     most a two-operand difference between the two
+     instrunctions.  */
+  int n_operands_diff
+      = recog_data.n_operands - insn_data[new_icode].n_operands;
+  if (!(n_operands_diff > 0 && n_operands_diff <= 2))
+    return false;
+
+  /* Then, loop through the operands of the predicated
+     instruction, and retain the ones that map to the
+     unpredicated instruction.  */
+  for (int i = 0; i < recog_data.n_operands; i++)
+    {
+      /* Ignore the VPR and, if needed, the vuninit
+	 operand.  */
+      if (insn_vpr_reg_operand == recog_data.operand[i]
+	  || (n_operands_diff == 2
+	      && !strcmp (recog_data.constraints[i], "0")))
+	continue;
+      else
+	{
+	  arr[j] = recog_data.operand[i];
+	  j++;
+	}
+    }
+
+  /* Finally, emit the upredicated instruction.  */
+  switch (j)
+    {
+      case 1:
+	emit_insn (GEN_FCN (new_icode) (arr[0]));
+	break;
+      case 2:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1]));
+	break;
+      case 3:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2]));
+	break;
+      case 4:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2],
+					arr[3]));
+	break;
+      case 5:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4]));
+	break;
+      case 6:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4], arr[5]));
+	break;
+      case 7:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4], arr[5], arr[6]));
+	break;
+      default:
+	gcc_unreachable ();
+    }
+  return true;
+}
+
+/* When a vctp insn is used, its out is often followed by
+   a zero-extend insn to SImode, which is then SUBREG'd into a
+   vector form of mode VALID_MVE_PRED_MODE: this vector form is
+   what is then used as an input to the instructions within the
+   loop.  Hence, store that vector form of the VPR reg into
+   vctp_vpr_generated, so that we can match it with instructions
+   in the loop to determine if they are predicated on this same
+   VPR.  If there is no zero-extend and subreg or it is otherwise
+   invalid, then return NULL to cancel the dlstp transform.  */
+
+static rtx
+arm_mve_get_vctp_vec_form (rtx_insn *insn)
+{
+  rtx vctp_vpr_generated = NULL_RTX;
+  rtx_insn *next_use1 = NULL;
+  df_ref use;
+  for (use
+	= DF_REG_USE_CHAIN
+	   (DF_REF_REGNO (DF_INSN_INFO_DEFS (DF_INSN_INFO_GET (insn))));
+       use; use = DF_REF_NEXT_REG (use))
+    if (!next_use1 && NONDEBUG_INSN_P (DF_REF_INSN (use)))
+      next_use1 = DF_REF_INSN (use);
+
+  if (single_set (next_use1)
+      && GET_CODE (SET_SRC (single_set (next_use1))) == ZERO_EXTEND)
+    {
+      rtx_insn *next_use2 = NULL;
+      for (use
+	    = DF_REG_USE_CHAIN
+	       (DF_REF_REGNO
+		 (DF_INSN_INFO_DEFS (DF_INSN_INFO_GET (next_use1))));
+	   use; use = DF_REF_NEXT_REG (use))
+	if (!next_use2 && NONDEBUG_INSN_P (DF_REF_INSN (use)))
+	  next_use2 = DF_REF_INSN (use);
+
+      if (single_set (next_use2)
+	  && GET_CODE (SET_SRC (single_set (next_use2))) == SUBREG)
+	vctp_vpr_generated = XEXP (PATTERN (next_use2), 0);
+    }
+
+  if (!vctp_vpr_generated || !REG_P (vctp_vpr_generated)
+      || !VALID_MVE_PRED_MODE (GET_MODE (vctp_vpr_generated)))
+    return NULL_RTX;
+
+  return vctp_vpr_generated;
+}
+
+/* Attempt to transform the loop contents of loop basic block from VPT
+   predicated insns into unpredicated insns for a dlstp/letp loop.  */
+
+rtx
+arm_attempt_dlstp_transform (rtx label)
+{
+  basic_block body = BLOCK_FOR_INSN (label)->prev_bb;
+
+  /* Ensure that the bb is within a loop that has all required metadata.  */
+  if (!body->loop_father || !body->loop_father->header
+      || !body->loop_father->simple_loop_desc)
+    return GEN_INT (1);
+
+  rtx_insn *vctp_insn = arm_mve_loop_valid_for_dlstp (body);
+  if (!vctp_insn)
+    return GEN_INT (1);
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+
+  rtx vctp_vpr_generated = arm_mve_get_vctp_vec_form (vctp_insn);
+  if (!vctp_vpr_generated)
+    return GEN_INT (1);
+
+  /* decrementunum is already known to be valid at this point.  */
+  int decrementnum = arm_mve_get_vctp_lanes (PATTERN (vctp_insn));
+
+  rtx_insn *insn = 0;
+  rtx_insn *cur_insn = 0;
+  rtx_insn *seq;
+  hash_map <rtx_insn *, bool> *safe_insn_map
+      = new hash_map <rtx_insn *, bool>;
+
+  /* Scan through the insns in the loop bb and emit the transformed bb
+     insns to a sequence.  */
+  start_sequence ();
+  FOR_BB_INSNS (body, insn)
+    {
+      if (GET_CODE (insn) == CODE_LABEL || NOTE_INSN_BASIC_BLOCK_P (insn))
+	continue;
+      else if (NOTE_P (insn))
+	emit_note ((enum insn_note)NOTE_KIND (insn));
+      else if (DEBUG_INSN_P (insn))
+	emit_debug_insn (PATTERN (insn));
+      else if (!INSN_P (insn))
+	{
+	  end_sequence ();
+	  return GEN_INT (1);
+	}
+      /* When we find the vctp instruction: continue.  */
+      else if (insn == vctp_insn)
+	continue;
+       /* If the insn pattern requires the use of the VPR value from the
+	  vctp as an input parameter for predication.  */
+      else if (arm_mve_vec_insn_is_predicated_with_this_predicate
+		(insn, vctp_vpr_generated))
+	{
+	  bool success = arm_emit_mve_unpredicated_insn_to_seq (insn);
+	  if (!success)
+	    {
+	      end_sequence ();
+	      return GEN_INT (1);
+	    }
+	}
+      /* If the insn isn't VPT predicated on vctp_vpr_generated, we need to
+	 make sure that it is still valid within the dlstp/letp loop.  */
+      else
+	{
+	  /* If this instruction USE-s the vctp_vpr_generated other than for
+	     predication, this blocks the transformation as we are not allowed
+	     to optimise the VPR value away.  */
+	  df_ref insn_uses = NULL;
+	  FOR_EACH_INSN_USE (insn_uses, insn)
+	  {
+	    if (rtx_equal_p (vctp_vpr_generated, DF_REF_REG (insn_uses)))
+	      {
+		end_sequence ();
+		return GEN_INT (1);
+	      }
+	  }
+	  /* If within the loop we have an MVE vector instruction that is
+	     unpredicated, the dlstp/letp looping will add implicit
+	     predication to it.  This will result in a change in behaviour
+	     of the instruction, so we need to find out if any instructions
+	     that feed into the current instruction were implicitly
+	     predicated.  */
+	  if (arm_mve_check_df_chain_back_for_implic_predic
+	       (safe_insn_map, insn, vctp_vpr_generated))
+	    {
+	      if (arm_mve_check_df_chain_fwd_for_implic_predic_impact
+		    (insn, vctp_vpr_generated))
+		{
+		  end_sequence ();
+		  return GEN_INT (1);
+		}
+	    }
+	  emit_insn (PATTERN (insn));
+	}
+    }
+  seq = get_insns ();
+  end_sequence ();
+
+  /* Re-write the entire BB contents with the transformed
+     sequence.  */
+  FOR_BB_INSNS_SAFE (body, insn, cur_insn)
+    if (!(GET_CODE (insn) == CODE_LABEL || NOTE_INSN_BASIC_BLOCK_P (insn)))
+      delete_insn (insn);
+  for (insn = seq; NEXT_INSN (insn); insn = NEXT_INSN (insn))
+    if (NOTE_P (insn))
+      emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));
+    else if (DEBUG_INSN_P (insn))
+      emit_debug_insn_after (PATTERN (insn), BB_END (body));
+    else
+      emit_insn_after (PATTERN (insn), BB_END (body));
+
+  emit_jump_insn_after (PATTERN (insn), BB_END (body));
+  /* The transformation has succeeded, so now modify the "count"
+     (a.k.a. niter_expr) for the middle-end.  Also set noloop_assumptions
+     to NULL to stop the middle-end from making assumptions about the
+     number of iterations.  */
+  simple_loop_desc (body->loop_father)->niter_expr = vctp_reg;
+  simple_loop_desc (body->loop_father)->noloop_assumptions = NULL_RTX;
+  return GEN_INT (decrementnum);
 }
 
 #if CHECKING_P
diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 8efdebecc3c..da745288f26 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -124,6 +124,11 @@
 ; and not all ARM insns do.
 (define_attr "predicated" "yes,no" (const_string "no"))
 
+
+; An attribute that encodes the CODE_FOR_<insn> of the MVE VPT unpredicated
+; version of a VPT-predicated instruction.  For unpredicated instructions
+; that are predicable, encode the same pattern's CODE_FOR_<insn> as a way to
+; encode that it is a predicable instruction.
 (define_attr "mve_unpredicated_insn" "" (const_int 0))
 
 ; LENGTH of an instruction (in bytes)
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 5ea2d9e8668..a6a7ff507a5 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -2673,6 +2673,9 @@
 (define_int_attr mrrc [(VUNSPEC_MRRC "mrrc") (VUNSPEC_MRRC2 "mrrc2")])
 (define_int_attr MRRC [(VUNSPEC_MRRC "MRRC") (VUNSPEC_MRRC2 "MRRC2")])
 
+(define_int_attr mode1 [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 "32")
+			(DLSTP64 "64")])
+
 (define_int_attr opsuffix [(UNSPEC_DOT_S "s8")
 			   (UNSPEC_DOT_U "u8")
 			   (UNSPEC_DOT_US "s8")
@@ -2916,6 +2919,8 @@
 (define_int_iterator VSHLCQ_M [VSHLCQ_M_S VSHLCQ_M_U])
 (define_int_iterator VQSHLUQ_M_N [VQSHLUQ_M_N_S])
 (define_int_iterator VQSHLUQ_N [VQSHLUQ_N_S])
+(define_int_iterator DLSTP [DLSTP8 DLSTP16 DLSTP32
+				   DLSTP64])
 
 ;; Define iterators for VCMLA operations
 (define_int_iterator VCMLA_OP [UNSPEC_VCMLA
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 44a04b86cb5..93905583b18 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -6933,7 +6933,7 @@
    (set (reg:SI LR_REGNUM)
 	(plus:SI (reg:SI LR_REGNUM) (match_dup 0)))
    (clobber (reg:CC CC_REGNUM))]
-  "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2"
+  "TARGET_HAVE_MVE"
   {
     if (get_attr_length (insn) == 4)
       return "letp\t%|lr, %l1";
@@ -6953,5 +6953,5 @@
 	 (unspec:SI [(match_operand:SI 0 "s_register_operand" "r")]
 	  DLSTP))
   ]
-  "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2"
+  "TARGET_HAVE_MVE"
   "dlstp.<mode1>\t%|lr, %0")
diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
index e1e013befa7..368d5138ca1 100644
--- a/gcc/config/arm/thumb2.md
+++ b/gcc/config/arm/thumb2.md
@@ -1613,7 +1613,7 @@
    (use (match_operand 1 "" ""))]     ; label
   "TARGET_32BIT"
   "
- {
+{
    /* Currently SMS relies on the do-loop pattern to recognize loops
       where (1) the control part consists of all insns defining and/or
       using a certain 'count' register and (2) the loop count can be
@@ -1623,41 +1623,65 @@
 
       Also used to implement the low over head loops feature, which is part of
       the Armv8.1-M Mainline Low Overhead Branch (LOB) extension.  */
-   if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
-   {
-     rtx s0;
-     rtx bcomp;
-     rtx loc_ref;
-     rtx cc_reg;
-     rtx insn;
-     rtx cmp;
-
-     if (GET_MODE (operands[0]) != SImode)
-       FAIL;
-
-     s0 = operands [0];
-
-     /* Low over head loop instructions require the first operand to be LR.  */
-     if (TARGET_HAVE_LOB && arm_target_insn_ok_for_lob (operands [1]))
-       s0 = gen_rtx_REG (SImode, LR_REGNUM);
-
-     if (TARGET_THUMB2)
-       insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-     else
-       insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-
-     cmp = XVECEXP (PATTERN (insn), 0, 0);
-     cc_reg = SET_DEST (cmp);
-     bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
-     loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands [1]);
-     emit_jump_insn (gen_rtx_SET (pc_rtx,
-                                  gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
-                                                        loc_ref, pc_rtx)));
-     DONE;
-   }
- else
-   FAIL;
- }")
+  if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
+    {
+      rtx s0;
+      rtx bcomp;
+      rtx loc_ref;
+      rtx cc_reg;
+      rtx insn;
+      rtx cmp;
+      rtx decrement_num;
+
+      if (GET_MODE (operands[0]) != SImode)
+	FAIL;
+
+      s0 = operands[0];
+
+       if (TARGET_HAVE_LOB && arm_target_bb_ok_for_lob (BLOCK_FOR_INSN (operands[1])))
+	{
+	  s0 = gen_rtx_REG (SImode, LR_REGNUM);
+
+	  /* If we have a compatibe MVE target, try and analyse the loop
+	     contents to determine if we can use predicated dlstp/letp
+	     looping.  */
+	  if (TARGET_HAVE_MVE
+	      && (decrement_num = arm_attempt_dlstp_transform (operands[1]))
+	      && (INTVAL (decrement_num) != 1))
+	    {
+	      insn = emit_insn
+		      (gen_thumb2_addsi3_compare0
+			  (s0, s0, GEN_INT ((-1) * (INTVAL (decrement_num)))));
+	      cmp = XVECEXP (PATTERN (insn), 0, 0);
+	      cc_reg = SET_DEST (cmp);
+	      bcomp = gen_rtx_GE (VOIDmode, cc_reg, const0_rtx);
+	      loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+	      emit_jump_insn (gen_rtx_SET (pc_rtx,
+				       gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+							     loc_ref, pc_rtx)));
+	      DONE;
+	    }
+	}
+
+	/* Otherwise, try standard decrement-by-one dls/le looping.  */
+	if (TARGET_THUMB2)
+	  insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0,
+							GEN_INT (-1)));
+	else
+	  insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
+
+	cmp = XVECEXP (PATTERN (insn), 0, 0);
+	cc_reg = SET_DEST (cmp);
+	bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
+	loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+	emit_jump_insn (gen_rtx_SET (pc_rtx,
+				     gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+							   loc_ref, pc_rtx)));
+	DONE;
+    }
+  else
+    FAIL;
+}")
 
 (define_insn "*clear_apsr"
   [(unspec_volatile:SI [(const_int 0)] VUNSPEC_CLRM_APSR)
@@ -1755,7 +1779,37 @@
   {
     if (REGNO (operands[0]) == LR_REGNUM)
       {
-	emit_insn (gen_dls_insn (operands[0]));
+	/* Pick out the number by which we are decrementing the loop counter
+	   in every iteration.  If it's > 1, then use dlstp.  */
+	int const_int_dec_num
+	     = abs (INTVAL (XEXP (XEXP (XVECEXP (PATTERN (operands[1]), 0, 1),
+				  1),
+			    1)));
+	switch (const_int_dec_num)
+	  {
+	    case 16:
+	      emit_insn (gen_dlstp8_insn (operands[0]));
+	      break;
+
+	    case 8:
+	      emit_insn (gen_dlstp16_insn (operands[0]));
+	      break;
+
+	    case 4:
+	      emit_insn (gen_dlstp32_insn (operands[0]));
+	      break;
+
+	    case 2:
+	      emit_insn (gen_dlstp64_insn (operands[0]));
+	      break;
+
+	    case 1:
+	      emit_insn (gen_dls_insn (operands[0]));
+	      break;
+
+	    default:
+	      gcc_unreachable ();
+	  }
 	DONE;
       }
     else
diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
index 4713ec840ab..12ae4c4f820 100644
--- a/gcc/config/arm/unspecs.md
+++ b/gcc/config/arm/unspecs.md
@@ -583,6 +583,10 @@
   VADDLVQ_U
   VCTP
   VCTP_M
+  DLSTP8
+  DLSTP16
+  DLSTP32
+  DLSTP64
   VPNOT
   VCREATEQ_F
   VCVTQ_N_TO_F_S
diff --git a/gcc/df-core.cc b/gcc/df-core.cc
index d4812b04a7c..4fcc14bf790 100644
--- a/gcc/df-core.cc
+++ b/gcc/df-core.cc
@@ -1964,6 +1964,21 @@ df_bb_regno_last_def_find (basic_block bb, unsigned int regno)
   return NULL;
 }
 
+/* Return the one and only def of REGNO within BB.  If there is no def or
+   there are multiple defs, return NULL.  */
+
+df_ref
+df_bb_regno_only_def_find (basic_block bb, unsigned int regno)
+{
+  df_ref temp = df_bb_regno_first_def_find (bb, regno);
+  if (!temp)
+    return NULL;
+  else if (temp == df_bb_regno_last_def_find (bb, regno))
+    return temp;
+  else
+    return NULL;
+}
+
 /* Finds the reference corresponding to the definition of REG in INSN.
    DF is the dataflow object.  */
 
diff --git a/gcc/df.h b/gcc/df.h
index 402657a7076..98623637f9c 100644
--- a/gcc/df.h
+++ b/gcc/df.h
@@ -987,6 +987,7 @@ extern void df_check_cfg_clean (void);
 #endif
 extern df_ref df_bb_regno_first_def_find (basic_block, unsigned int);
 extern df_ref df_bb_regno_last_def_find (basic_block, unsigned int);
+extern df_ref df_bb_regno_only_def_find (basic_block, unsigned int);
 extern df_ref df_find_def (rtx_insn *, rtx);
 extern bool df_reg_defined (rtx_insn *, rtx);
 extern df_ref df_find_use (rtx_insn *, rtx);
diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
index 4feb0a25ab9..6a72700a127 100644
--- a/gcc/loop-doloop.cc
+++ b/gcc/loop-doloop.cc
@@ -85,10 +85,10 @@ doloop_condition_get (rtx_insn *doloop_pat)
      forms:
 
      1)  (parallel [(set (pc) (if_then_else (condition)
-	  			            (label_ref (label))
-				            (pc)))
-	             (set (reg) (plus (reg) (const_int -1)))
-	             (additional clobbers and uses)])
+					    (label_ref (label))
+					    (pc)))
+		     (set (reg) (plus (reg) (const_int -1)))
+		     (additional clobbers and uses)])
 
      The branch must be the first entry of the parallel (also required
      by jump.cc), and the second entry of the parallel must be a set of
@@ -96,18 +96,19 @@ doloop_condition_get (rtx_insn *doloop_pat)
      the loop counter in an if_then_else too.
 
      2)  (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-	                         (label_ref (label))
-			         (pc))).  
+	 (set (pc) (if_then_else (reg != 0)
+				 (label_ref (label))
+				 (pc))).
 
-     Some targets (ARM) do the comparison before the branch, as in the
-     following form:
+     Some targets (ARM) do the comparison before the branch. The ARM target
+     also supports a counter that can decrement by `n`.  As such, the
+     following form is expected:
 
-     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) */
+     3) (parallel [(set (cc) (compare (plus (reg) (const_int -n)) 0))
+		   (set (reg) (plus (reg) (const_int -n)))])
+	(set (pc) (if_then_else (cc == NE/GE)
+				(label_ref (label))
+				(pc))) */
 
   pattern = PATTERN (doloop_pat);
 
@@ -143,7 +144,7 @@ doloop_condition_get (rtx_insn *doloop_pat)
 	      || GET_CODE (cmp_arg1) != PLUS)
 	    return 0;
 	  reg_orig = XEXP (cmp_arg1, 0);
-	  if (XEXP (cmp_arg1, 1) != GEN_INT (-1) 
+	  if (!CONST_INT_P (XEXP (cmp_arg1, 1))
 	      || !REG_P (reg_orig))
 	    return 0;
 	  cc_reg = SET_DEST (cmp_orig);
@@ -154,9 +155,11 @@ doloop_condition_get (rtx_insn *doloop_pat)
         inc = PATTERN (prev_insn);
       if (GET_CODE (cmp) == SET && GET_CODE (SET_SRC (cmp)) == IF_THEN_ELSE)
 	{
-	  /* We expect the condition to be of the form (reg != 0)  */
+	  /* We expect the condition to be of the form (reg != 0)
+	     or (reg >= 0)  */
 	  cond = XEXP (SET_SRC (cmp), 0);
-	  if (GET_CODE (cond) != NE || XEXP (cond, 1) != const0_rtx)
+	  if ((GET_CODE (cond) != NE && GET_CODE (cond) != GE)
+	      || XEXP (cond, 1) != const0_rtx)
 	    return 0;
 	}
     }
@@ -173,14 +176,14 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
     return 0;
 
-  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
      On IA-64, this decrement is wrapped in an if_then_else.  */
   inc_src = SET_SRC (inc);
   if (GET_CODE (inc_src) == IF_THEN_ELSE)
     inc_src = XEXP (inc_src, 1);
   if (GET_CODE (inc_src) != PLUS
       || XEXP (inc_src, 0) != reg
-      || XEXP (inc_src, 1) != constm1_rtx)
+      || !CONST_INT_P (XEXP (inc_src, 1)))
     return 0;
 
   /* Check for (set (pc) (if_then_else (condition)
@@ -211,42 +214,48 @@ doloop_condition_get (rtx_insn *doloop_pat)
       || (GET_CODE (XEXP (condition, 0)) == PLUS
 	  && XEXP (XEXP (condition, 0), 0) == reg))
    {
-     if (GET_CODE (pattern) != PARALLEL)
      /*  For the second form we expect:
 
-         (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-                                 (label_ref (label))
-                                 (pc))).
+	 (set (reg) (plus (reg) (const_int -1))
+	 (set (pc) (if_then_else (reg != 0)
+				 (label_ref (label))
+				 (pc))).
 
-         is equivalent to the following:
+	 That is equivalent to the following:
 
-         (parallel [(set (pc) (if_then_else (reg != 1)
-                                            (label_ref (label))
-                                            (pc)))
-                     (set (reg) (plus (reg) (const_int -1)))
-                     (additional clobbers and uses)])
+	 (parallel [(set (pc) (if_then_else (reg != 1)
+					    (label_ref (label))
+					    (pc)))
+		     (set (reg) (plus (reg) (const_int -1)))
+		     (additional clobbers and uses)])
 
-        For the third form we expect:
+	For the third form we expect:
 
-        (parallel [(set (cc) (compare ((plus (reg) (const_int -1)), 0))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) 
+	(parallel [(set (cc) (compare (plus (reg) (const_int -n)) 0))
+		   (set (reg) (plus (reg) (const_int -n)))])
+	(set (pc) (if_then_else (cc == NE/GE)
+				(label_ref (label))
+				(pc)))
 
-        which is equivalent to the following:
+	Which, for n == 1, is equivalent to the following:
 
-        (parallel [(set (cc) (compare (reg,  1))
-                   (set (reg) (plus (reg) (const_int -1)))
-                   (set (pc) (if_then_else (NE == cc)
-                                           (label_ref (label))
-                                           (pc))))])
+	(parallel [(set (cc) (compare (reg,  1))
+		   (set (reg) (plus (reg) (const_int -1)))
+		   (set (pc) (if_then_else (NE == cc)
+					   (label_ref (label))
+					   (pc))))])
 
-        So we return the second form instead for the two cases.
+	So we return the second form instead for the two cases when n == 1.
 
+	For n > 1, the final value may be exceeded, so use GE instead of NE.
      */
-        condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
+     if (GET_CODE (pattern) != PARALLEL)
+       {
+	if (INTVAL (XEXP (inc_src, 1)) != -1)
+	  condition = gen_rtx_fmt_ee (GE, VOIDmode, inc_src, const0_rtx);
+	else
+	  condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);;
+       }
 
     return condition;
    }
@@ -642,7 +651,7 @@ doloop_optimize (class loop *loop)
 {
   scalar_int_mode mode;
   rtx doloop_reg;
-  rtx count;
+  rtx count = NULL_RTX;
   widest_int iterations, iterations_max;
   rtx_code_label *start_label;
   rtx condition;
@@ -685,17 +694,6 @@ doloop_optimize (class loop *loop)
       return false;
     }
 
-  max_cost
-    = COSTS_N_INSNS (param_max_iterations_computation_cost);
-  if (set_src_cost (desc->niter_expr, mode, optimize_loop_for_speed_p (loop))
-      > max_cost)
-    {
-      if (dump_file)
-	fprintf (dump_file,
-		 "Doloop: number of iterations too costly to compute.\n");
-      return false;
-    }
-
   if (desc->const_iter)
     iterations = widest_int::from (rtx_mode_t (desc->niter_expr, mode),
 				   UNSIGNED);
@@ -716,12 +714,25 @@ doloop_optimize (class loop *loop)
 
   /* Generate looping insn.  If the pattern FAILs then give up trying
      to modify the loop since there is some aspect the back-end does
-     not like.  */
-  count = copy_rtx (desc->niter_expr);
+     not like.  If this succeeds, there is a chance that the loop
+     desc->niter_expr has been altered by the backend, so only extract
+     that data after the gen_doloop_end.  */
   start_label = block_label (desc->in_edge->dest);
   doloop_reg = gen_reg_rtx (mode);
   rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
 
+  max_cost
+    = COSTS_N_INSNS (param_max_iterations_computation_cost);
+  if (set_src_cost (desc->niter_expr, mode, optimize_loop_for_speed_p (loop))
+      > max_cost)
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "Doloop: number of iterations too costly to compute.\n");
+      return false;
+    }
+
+  count = copy_rtx (desc->niter_expr);
   word_mode_size = GET_MODE_PRECISION (word_mode);
   word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) - 1;
   if (! doloop_seq
diff --git a/gcc/testsuite/gcc.target/arm/lob.h b/gcc/testsuite/gcc.target/arm/lob.h
index feaae7cc899..3941fe7a8b6 100644
--- a/gcc/testsuite/gcc.target/arm/lob.h
+++ b/gcc/testsuite/gcc.target/arm/lob.h
@@ -1,15 +1,131 @@
 #include <string.h>
-
+#include <stdint.h>
 /* Common code for lob tests.  */
 
 #define NO_LOB asm volatile ("@ clobber lr" : : : "lr" )
 
-#define N 10000
+#define N 100
+
+static void
+reset_data (int *a, int *b, int *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data64 (int64_t *a, int64_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+check_plus (int *a, int *b, int *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
 
 static void
-reset_data (int *a, int *b, int *c)
+check_memcpy64 (int64_t *a, int64_t *c, int x)
 {
-  memset (a, -1, N * sizeof (*a));
-  memset (b, -1, N * sizeof (*b));
-  memset (c, -1, N * sizeof (*c));
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != a[i]) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
 }
diff --git a/gcc/testsuite/gcc.target/arm/lob1.c b/gcc/testsuite/gcc.target/arm/lob1.c
index ba5c82cd55c..c8ce653a5c3 100644
--- a/gcc/testsuite/gcc.target/arm/lob1.c
+++ b/gcc/testsuite/gcc.target/arm/lob1.c
@@ -54,29 +54,18 @@ loop3 (int *a, int *b, int *c)
     } while (i < N);
 }
 
-void
-check (int *a, int *b, int *c)
-{
-  for (int i = 0; i < N; i++)
-    {
-      NO_LOB;
-      if (c[i] != a[i] + b[i])
-	abort ();
-    }
-}
-
 int
 main (void)
 {
-  reset_data (a, b, c);
+  reset_data (a, b, c, N);
   loop1 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus (a, b, c, N);
+  reset_data (a, b, c, N);
   loop2 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus (a, b, c, N);
+  reset_data (a, b, c, N);
   loop3 (a, b ,c);
-  check (a, b ,c);
+  check_plus (a, b, c, N);
 
   return 0;
 }
diff --git a/gcc/testsuite/gcc.target/arm/lob6.c b/gcc/testsuite/gcc.target/arm/lob6.c
index 17b6124295e..4fe116e2c2b 100644
--- a/gcc/testsuite/gcc.target/arm/lob6.c
+++ b/gcc/testsuite/gcc.target/arm/lob6.c
@@ -79,14 +79,14 @@ check (void)
 int
 main (void)
 {
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data (a1, b1, c1, N);
+  reset_data (a2, b2, c2, N);
   loop1 (a1, b1, c1);
   ref1 (a2, b2, c2);
   check ();
 
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data (a1, b1, c1, N);
+  reset_data (a2, b2, c2, N);
   loop2 (a1, b1, c1);
   ref2 (a2, b2, c2);
   check ();
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c
new file mode 100644
index 00000000000..5ddd994e53d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c
@@ -0,0 +1,561 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O3 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+
+#define IMM 5
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)				\
+void test_##NAME##PRED##_##SIGN##BITS (TYPE##BITS##_t *a, TYPE##BITS##_t *b,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vb = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (b, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_##SIGN##BITS (va, vb, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      b += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (32, 4, w, NAME, PRED)
+
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vmulq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vsubq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vhaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vorrq, _x)
+
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_M(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)				\
+void test_##NAME##PRED##_##SIGN##BITS (TYPE##BITS##x##LANES##_t __inactive, TYPE##BITS##_t *a, TYPE##BITS##_t *b,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vb = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (b, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_##SIGN##BITS (__inactive, va, vb, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      b += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_M (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_M (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (32, 4, w, NAME, PRED)
+
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vmulq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vsubq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vhaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vorrq, _m)
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_N(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)	\
+void test_##NAME##PRED##_n_##SIGN##BITS (TYPE##BITS##_t *a,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_n_##SIGN##BITS (va, IMM, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_N (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_N (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (32, 4, w, NAME, PRED)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vmulq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vsubq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vhaddq, _x)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vbrsrq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vshlq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vshrq, _x)
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_M_N(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)	\
+void test_##NAME##PRED##_n_##SIGN##BITS (TYPE##BITS##x##LANES##_t __inactive, TYPE##BITS##_t *a,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_n_##SIGN##BITS (__inactive, va, IMM, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_M_N (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_M_N (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (32, 4, w, NAME, PRED)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vmulq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vsubq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vhaddq, _m)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vbrsrq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vshlq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vshrq, _m)
+
+/* Now test some more configurations.  */
+
+/* Using a >=1 condition.  */
+void test1 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n >= 1)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+/* Test a for loop format of decrementing to zero */
+int32_t a[] = {0, 1, 2, 3, 4, 5, 6, 7};
+void test2 (int32_t *b, int num_elems)
+{
+    for (int i = num_elems; i > 0; i-= 4)
+    {
+        mve_pred16_t p = vctp32q (i);
+        int32x4_t va = vldrwq_z_s32 (&(a[i]), p);
+        vstrwq_p_s32 (b + i, va, p);
+    }
+}
+
+/* Iteration counter counting up to num_iter.  */
+void test3 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 0; i < num_iter; i++)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+/* Iteration counter counting down from num_iter.  */
+void test4 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = num_iter; i > 0; i--)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+/* Using an unpredicated arithmetic instruction within the loop.  */
+void test5 (uint8_t *a, uint8_t *b, uint8_t *c,  uint8_t *d, int n)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_u8 (b);
+	/* Is affected by implicit predication, because vb also
+	came from an unpredicated load, but there is no functional
+	problem, because the result is used in a predicated store.  */ 
+        uint8x16_t vc = vaddq_u8 (va, vb);
+        uint8x16_t vd = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        vstrbq_p_u8 (d, vd, p);
+        n-=16;
+    }
+}
+
+/* Using a different VPR value for one instruction in the loop.  */
+void test6 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using another VPR value in the loop, with a vctp.
+   The doloop logic will always try to do the transform on the first
+   vctp it encounters, so this is still expected to work.  */
+void test7 (int32_t *a, int32_t *b, int32_t *c, int n, int g)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q (g);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vctp,
+   but this time the p1 will also change in every loop (still fine)  */
+void test8 (int32_t *a, int32_t *b, int32_t *c, int n, int g)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q (g);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+      g++;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vctp_m
+   that is independent of the loop vctp VPR.  */
+void test9 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p2 = vctp32q_m (n, p1);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop,
+   with a vctp_m that is tied to the base vctp VPR.  This
+   is still fine, because the vctp_m will be transformed
+   into a vctp and be implicitly predicated.  */
+void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q_m (n, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p1);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp.  */
+void test11 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vb);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p1);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp_m.  */
+void test12 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p2 = vcmpeqq_m_s32 (va, vb, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp_m 
+   that is tied to the base vctp VPR (same as above, this will be turned
+   into a vcmp and be implicitly predicated).  */
+void test13 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p2 = vcmpeqq_m_s32 (va, vb, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an unpredicated op with a scalar output, where the result is valid
+   outside the bb.  This is valid, because all the inputs to the unpredicated
+   op are correctly predicated.  */
+uint8_t test14 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       sum += vaddvq_u8 (vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Same as above, but with another scalar op between the unpredicated op and
+   the scalar op outside the loop.  */
+uint8_t test15 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx, int g)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       sum += vaddvq_u8 (vc);
+       sum += g;
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Using an unpredicated vcmp to generate a new predicate value in the
+   loop and then using it in a predicated store insn.  */
+void test16 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_s32 (b);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vc);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using a predicated vcmp to generate a new predicate value in the
+   loop and then using it in a predicated store insn.  */
+void test17 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_s32 (va, vb);
+      mve_pred16_t p1 = vcmpeqq_m_s32 (va, vc, p);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an across-vector unpredicated instruction in a valid way.
+   This tests that "vc" has correctly masked the risky "vb".  */
+uint16_t test18 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_x_u16 (va, vb, p);
+      res = vaddvq_u16 (vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector unpredicated instruction with a scalar from outside the loop.  */
+uint16_t test19 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_x_u16 (va, vb, p);
+      res = vaddvaq_u16 (res, vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector predicated instruction in a valid way.  */
+uint16_t  test20 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res = vaddvaq_p_u16 (res, vb, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector predicated instruction in a valid way.  */
+uint16_t  test21 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res++;
+      res = vaddvaq_p_u16 (res, vb, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+int test22 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vmaxvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test23 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vmaxavq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+/* The final number of DLSTPs currently is calculated by the number of
+  `TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY.*` macros * 6 + 23.  */
+/* { dg-final { scan-assembler-times {\tdlstp} 167 } } */
+/* { dg-final { scan-assembler-times {\tletp} 167 } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
new file mode 100644
index 00000000000..0125a2a15fa
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
@@ -0,0 +1,69 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_mve_hw } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int16_t *a, int16_t *b, int16_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      int16x8_t va = vldrhq_z_s16 (a, p);
+      int16x8_t vb = vldrhq_z_s16 (b, p);
+      int16x8_t vc = vaddq_x_s16 (va, vb, p);
+      vstrhq_p_s16 (c, vc, p);
+      c+=8;
+      a+=8;
+      b+=8;
+      n-=8;
+    }
+}
+
+int main ()
+{
+  int i;
+  int16_t temp1[N];
+  int16_t temp2[N];
+  int16_t temp3[N];
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus16 (temp1, temp2, temp3, 0);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus16 (temp1, temp2, temp3, 1);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 7);
+  check_plus16 (temp1, temp2, temp3, 7);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus16 (temp1, temp2, temp3, 8);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus16 (temp1, temp2, temp3, 9);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus16 (temp1, temp2, temp3, 16);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus16 (temp1, temp2, temp3, 17);
+
+  reset_data16 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.16} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c
new file mode 100644
index 00000000000..06b960ad9ca
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c
@@ -0,0 +1,69 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_mve_hw } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+int main ()
+{
+  int i;
+  int32_t temp1[N];
+  int32_t temp2[N];
+  int32_t temp3[N];
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus32 (temp1, temp2, temp3, 0);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus32 (temp1, temp2, temp3, 1);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 3);
+  check_plus32 (temp1, temp2, temp3, 3);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 4);
+  check_plus32 (temp1, temp2, temp3, 4);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 5);
+  check_plus32 (temp1, temp2, temp3, 5);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus32 (temp1, temp2, temp3, 8);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus32 (temp1, temp2, temp3, 9);
+
+  reset_data32 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.32} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c
new file mode 100644
index 00000000000..5a782dd7f74
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c
@@ -0,0 +1,69 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_mve_hw } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int64_t *a, int64_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp64q (n);
+      int64x2_t va = vldrdq_gather_offset_z_s64 (a, vcreateq_u64 (0, 8), p);
+      vstrdq_scatter_offset_p_s64 (c, vcreateq_u64 (0, 8), va, p);
+      c+=2;
+      a+=2;
+      n-=2;
+    }
+}
+
+int main ()
+{
+  int i;
+  int64_t temp1[N];
+  int64_t temp3[N];
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 0);
+  check_memcpy64 (temp1, temp3, 0);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 1);
+  check_memcpy64 (temp1, temp3, 1);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 2);
+  check_memcpy64 (temp1, temp3, 2);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 3);
+  check_memcpy64 (temp1, temp3, 3);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 4);
+  check_memcpy64 (temp1, temp3, 4);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 5);
+  check_memcpy64 (temp1, temp3, 5);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 6);
+  check_memcpy64 (temp1, temp3, 6);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 7);
+  check_memcpy64 (temp1, temp3, 7);
+
+  reset_data64  (temp1, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.64} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c
new file mode 100644
index 00000000000..8ea181c82d4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c
@@ -0,0 +1,69 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_mve_hw } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp8q (n);
+      int8x16_t va = vldrbq_z_s8 (a, p);
+      int8x16_t vb = vldrbq_z_s8 (b, p);
+      int8x16_t vc = vaddq_x_s8 (va, vb, p);
+      vstrbq_p_s8 (c, vc, p);
+      c+=16;
+      a+=16;
+      b+=16;
+      n-=16;
+    }
+}
+
+int main ()
+{
+  int i;
+  int8_t temp1[N];
+  int8_t temp2[N];
+  int8_t temp3[N];
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus8 (temp1, temp2, temp3, 0);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus8 (temp1, temp2, temp3, 1);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 15);
+  check_plus8 (temp1, temp2, temp3, 15);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus8 (temp1, temp2, temp3, 16);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus8 (temp1, temp2, temp3, 17);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 32);
+  check_plus8 (temp1, temp2, temp3, 32);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 33);
+  check_plus8 (temp1, temp2, temp3, 33);
+
+  reset_data8 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.8} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
new file mode 100644
index 00000000000..7c331c7895f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
@@ -0,0 +1,388 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O3 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+
+/* Terminating on a non-zero number of elements.  */
+void test0 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    while (n > 1)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Terminating on n >= 0.  */
+void test1 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    while (n >= 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Similar, terminating on a non-zero number of elements, but in a for loop
+   format.  */
+int32_t a[] = {0, 1, 2, 3, 4, 5, 6, 7};
+void test2 (int32_t *b, int num_elems)
+{
+    for (int i = num_elems; i >= 2; i-= 4)
+    {
+       mve_pred16_t p = vctp32q (i);
+       int32x4_t va = vldrwq_z_s32 (&(a[i]), p);
+       vstrwq_p_s32 (b + i, va, p);
+    }
+}
+
+/* Iteration counter counting up to num_iter, with a non-zero starting num.  */
+void test3 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 1; i < num_iter; i++)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Iteration counter counting up to num_iter, with a larger increment  */
+void test4 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 0; i < num_iter; i+=2)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Using an unpredicated store instruction within the loop.  */
+void test5 (uint8_t *a, uint8_t *b, uint8_t *c,  uint8_t *d, int n)
+{
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_u8 (va, vb);
+       uint8x16_t vd = vaddq_x_u8 (va, vb, p);
+       vstrbq_u8 (d, vd);
+       n -= 16;
+    }
+}
+
+/* Using an unpredicated store outside the loop.  */
+void test6 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       vx = vaddq_u8 (vx, vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    vstrbq_u8 (c, vx);
+}
+
+/* Using a VPR that gets modified within the loop.  */
+void test9 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      p++;
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using a VPR that gets re-generated within the loop.  */
+void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  mve_pred16_t p = vctp32q (n);
+  while (n > 0)
+    {
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      p = vctp32q (n);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using vctp32q_m instead of vctp32q.  */
+void test11 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p0)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q_m (n, p0);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an unpredicated op with a scalar output, where the result is valid
+   outside the bb.  This is invalid, because one of the inputs to the
+   unpredicated op is also unpredicated.  */
+uint8_t test12 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_u8 (b);
+       uint8x16_t vc = vaddq_u8 (va, vb);
+       sum += vaddvq_u8 (vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Using an unpredicated vcmp to generate a new predicate value in the
+   loop and then using that VPR to predicate a store insn.  */
+void test13 (int32_t *a, int32_t *b, int32x4_t vc, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_s32 (a);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_s32 (va, vb);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vc);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an across-vector unpredicated instruction. "vb" is the risk.  */
+uint16_t test14 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      vb = vaddq_u16 (va, vb);
+      res = vaddvq_u16 (vb);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector unpredicated instruction. "vc" is the risk. */
+uint16_t test15 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_u16 (va, vb);
+      res = vaddvaq_u16 (res, vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+uint16_t test16 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res =0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res = vaddvaq_u16 (res, vb);
+      res = vaddvaq_p_u16 (res, va, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+int test17 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vmaxvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+
+
+int test18 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vminvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test19 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vminavq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test20 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vminvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+uint8x16_t test21 (uint8_t *a, uint32_t *b, int n, uint8x16_t res)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vshlcq_u8 (va, b, 1);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int8x16_t test22 (int8_t *a, int32_t *b, int n, int8x16_t res)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vshlcq_s8 (va, b, 1);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+/* Using an unsigned number of elements to count down from, with a >0*/
+void test23 (int32_t *a, int32_t *b, int32_t *c, unsigned int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+/* Using an unsigned number of elements to count up to, with a <n*/
+void test24 (uint8_t *a, uint8_t *b, uint8_t *c, unsigned int n)
+{
+    for (int i = 0; i < n; i+=16)
+    {
+        mve_pred16_t p = vctp8q (n-i);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+    }
+}
+
+
+/* Using an unsigned number of elements to count up to, with a <=n*/
+void test25 (uint8_t *a, uint8_t *b, uint8_t *c, unsigned int n)
+{
+    for (int i = 1; i <= n; i+=16)
+    {
+        mve_pred16_t p = vctp8q (n-i);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+    }
+}
+
+/* { dg-final { scan-assembler-not "\tdlstp" } } */
+/* { dg-final { scan-assembler-not "\tletp" } } */
\ No newline at end of file

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
  2023-11-06 11:03     ` Stamatis Markianos-Wright
@ 2023-11-06 11:24       ` Richard Sandiford
  2023-11-06 17:29         ` Stamatis Markianos-Wright
  0 siblings, 1 reply; 17+ messages in thread
From: Richard Sandiford @ 2023-11-06 11:24 UTC (permalink / raw)
  To: Stamatis Markianos-Wright
  Cc: Stamatis Markianos-Wright via Gcc-patches, Richard Earnshaw

Stamatis Markianos-Wright <stam.markianos-wright@arm.com> writes:
>> One of the main reasons for reading the arm bits was to try to answer
>> the question: if we switch to a downcounting loop with a GE condition,
>> how do we make sure that the start value is not a large unsigned
>> number that is interpreted as negative by GE?  E.g. if the loop
>> originally counted up in steps of N and used an LTU condition,
>> it could stop at a value in the range [INT_MAX + 1, UINT_MAX].
>> But the loop might never iterate if we start counting down from
>> most values in that range.
>>
>> Does the patch handle that?
>
> So AFAICT this is actually handled in the generic code in `doloop_valid_p`:
>
> This kind of loops fail because of they are "desc->infinite", then no 
> loop-doloop conversion is attempted at all (even for standard dls/le loops)
>
> Thanks to that check I haven't been able to trigger anything like the 
> behaviour you describe, do you think the doloop_valid_p checks are 
> robust enough?

The loops I was thinking of are provably not infinite though.  E.g.:

  for (unsigned int i = 0; i < UINT_MAX - 100; ++i)
    ...

is known to terminate.  And doloop conversion is safe with the normal
count-down-by-1 approach, so I don't think current code would need
to reject it.  I.e. a conversion to:

  unsigned int i = UINT_MAX - 101;
  do
    ...
  while (--i != ~0U);

would be safe, but a conversion to:

  int i = UINT_MAX - 101;
  do
    ...
  while ((i -= step, i > 0));

wouldn't, because the loop body would only be executed once.

I'm only going off the name "infinite" though :)  It's possible that
it has more connotations than that.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
  2023-11-06 11:24       ` Richard Sandiford
@ 2023-11-06 17:29         ` Stamatis Markianos-Wright
  2023-11-10 12:41           ` Stamatis Markianos-Wright
  0 siblings, 1 reply; 17+ messages in thread
From: Stamatis Markianos-Wright @ 2023-11-06 17:29 UTC (permalink / raw)
  To: Stamatis Markianos-Wright via Gcc-patches, Richard Earnshaw,
	richard.sandiford


On 06/11/2023 11:24, Richard Sandiford wrote:
> Stamatis Markianos-Wright <stam.markianos-wright@arm.com> writes:
>>> One of the main reasons for reading the arm bits was to try to answer
>>> the question: if we switch to a downcounting loop with a GE condition,
>>> how do we make sure that the start value is not a large unsigned
>>> number that is interpreted as negative by GE?  E.g. if the loop
>>> originally counted up in steps of N and used an LTU condition,
>>> it could stop at a value in the range [INT_MAX + 1, UINT_MAX].
>>> But the loop might never iterate if we start counting down from
>>> most values in that range.
>>>
>>> Does the patch handle that?
>> So AFAICT this is actually handled in the generic code in `doloop_valid_p`:
>>
>> This kind of loops fail because of they are "desc->infinite", then no
>> loop-doloop conversion is attempted at all (even for standard dls/le loops)
>>
>> Thanks to that check I haven't been able to trigger anything like the
>> behaviour you describe, do you think the doloop_valid_p checks are
>> robust enough?
> The loops I was thinking of are provably not infinite though.  E.g.:
>
>    for (unsigned int i = 0; i < UINT_MAX - 100; ++i)
>      ...
>
> is known to terminate.  And doloop conversion is safe with the normal
> count-down-by-1 approach, so I don't think current code would need
> to reject it.  I.e. a conversion to:
>
>    unsigned int i = UINT_MAX - 101;
>    do
>      ...
>    while (--i != ~0U);
>
> would be safe, but a conversion to:
>
>    int i = UINT_MAX - 101;
>    do
>      ...
>    while ((i -= step, i > 0));
>
> wouldn't, because the loop body would only be executed once.
>
> I'm only going off the name "infinite" though :)  It's possible that
> it has more connotations than that.
>
> Thanks,
> Richard

Ack, yep, I see what you mean now, and yep, that kind of loop does 
indeed pass through doloop_valid_p

Interestingly , in the v8-M Arm ARM this is done with:

```

boolean IsLastLowOverheadLoop(INSTR_EXEC_STATE_Type state)
// This does not check whether a loop is currently active.
// If the PE were in a loop, would this be the last one?
return UInt(state.LoopCount) <= (1 << (4 - LTPSIZE));

```

So architecturally the asm we output would be ok (except maybe the 
"branch too far subs;bgt;lctp" fallback at 
`predicated_doloop_end_internal` (maybe that should be `bhi`))... But 
now GE: isn't looking like an accurate representation of this operation 
in the compiler.

I'm wondering if I should try to make `predicated_doloop_end_internal` 
contain a comparison along the lines of:
(gtu: (plus: (LR) (const_int -num_lanes)) (const_int num_lanes_minus_1))

I'll give that a try :)

The only reason I'd chosen to go with GE earlier, tbh, was because of 
the existing handling of GE in loop-doloop.cc

Let me know if any other ideas come to your mind!


Cheers,

Stam



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
  2023-11-06 17:29         ` Stamatis Markianos-Wright
@ 2023-11-10 12:41           ` Stamatis Markianos-Wright
  2023-11-16 11:36             ` Stamatis Markianos-Wright
  0 siblings, 1 reply; 17+ messages in thread
From: Stamatis Markianos-Wright @ 2023-11-10 12:41 UTC (permalink / raw)
  To: Stamatis Markianos-Wright via Gcc-patches, Richard Earnshaw,
	richard.sandiford, Kyrylo Tkachov

[-- Attachment #1: Type: text/plain, Size: 12271 bytes --]


On 06/11/2023 17:29, Stamatis Markianos-Wright wrote:
>
> On 06/11/2023 11:24, Richard Sandiford wrote:
>> Stamatis Markianos-Wright <stam.markianos-wright@arm.com> writes:
>>>> One of the main reasons for reading the arm bits was to try to answer
>>>> the question: if we switch to a downcounting loop with a GE condition,
>>>> how do we make sure that the start value is not a large unsigned
>>>> number that is interpreted as negative by GE?  E.g. if the loop
>>>> originally counted up in steps of N and used an LTU condition,
>>>> it could stop at a value in the range [INT_MAX + 1, UINT_MAX].
>>>> But the loop might never iterate if we start counting down from
>>>> most values in that range.
>>>>
>>>> Does the patch handle that?
>>> So AFAICT this is actually handled in the generic code in 
>>> `doloop_valid_p`:
>>>
>>> This kind of loops fail because of they are "desc->infinite", then no
>>> loop-doloop conversion is attempted at all (even for standard dls/le 
>>> loops)
>>>
>>> Thanks to that check I haven't been able to trigger anything like the
>>> behaviour you describe, do you think the doloop_valid_p checks are
>>> robust enough?
>> The loops I was thinking of are provably not infinite though. E.g.:
>>
>>    for (unsigned int i = 0; i < UINT_MAX - 100; ++i)
>>      ...
>>
>> is known to terminate.  And doloop conversion is safe with the normal
>> count-down-by-1 approach, so I don't think current code would need
>> to reject it.  I.e. a conversion to:
>>
>>    unsigned int i = UINT_MAX - 101;
>>    do
>>      ...
>>    while (--i != ~0U);
>>
>> would be safe, but a conversion to:
>>
>>    int i = UINT_MAX - 101;
>>    do
>>      ...
>>    while ((i -= step, i > 0));
>>
>> wouldn't, because the loop body would only be executed once.
>>
>> I'm only going off the name "infinite" though :)  It's possible that
>> it has more connotations than that.
>>
>> Thanks,
>> Richard
>
> Ack, yep, I see what you mean now, and yep, that kind of loop does 
> indeed pass through doloop_valid_p
>
> Interestingly , in the v8-M Arm ARM this is done with:
>
> ```
>
> boolean IsLastLowOverheadLoop(INSTR_EXEC_STATE_Type state)
> // This does not check whether a loop is currently active.
> // If the PE were in a loop, would this be the last one?
> return UInt(state.LoopCount) <= (1 << (4 - LTPSIZE));
>
> ```
>
> So architecturally the asm we output would be ok (except maybe the 
> "branch too far subs;bgt;lctp" fallback at 
> `predicated_doloop_end_internal` (maybe that should be `bhi`))... But 
> now GE: isn't looking like an accurate representation of this 
> operation in the compiler.
>
> I'm wondering if I should try to make `predicated_doloop_end_internal` 
> contain a comparison along the lines of:
> (gtu: (plus: (LR) (const_int -num_lanes)) (const_int num_lanes_minus_1))
>
> I'll give that a try :)
>
> The only reason I'd chosen to go with GE earlier, tbh, was because of 
> the existing handling of GE in loop-doloop.cc
>
> Let me know if any other ideas come to your mind!
>
>
> Cheers,
>
> Stam


It looks like I've had success with the below (diff to previous patch),
trimmed a bit to only the functionally interesting things::




diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
index 368d5138ca1..54dd4ee564b 100644
--- a/gcc/config/arm/thumb2.md
+++ b/gcc/config/arm/thumb2.md
@@ -1649,16 +1649,28 @@
            && (decrement_num = arm_attempt_dlstp_transform (operands[1]))
            && (INTVAL (decrement_num) != 1))
          {
-          insn = emit_insn
-              (gen_thumb2_addsi3_compare0
-              (s0, s0, GEN_INT ((-1) * (INTVAL (decrement_num)))));
-          cmp = XVECEXP (PATTERN (insn), 0, 0);
-          cc_reg = SET_DEST (cmp);
-          bcomp = gen_rtx_GE (VOIDmode, cc_reg, const0_rtx);
            loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
-          emit_jump_insn (gen_rtx_SET (pc_rtx,
-                       gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
-                                 loc_ref, pc_rtx)));
+          switch (INTVAL (decrement_num))
+        {
+          case 2:
+            insn = emit_jump_insn (gen_predicated_doloop_end_internal2
+                        (s0, loc_ref));
+            break;
+          case 4:
+            insn = emit_jump_insn (gen_predicated_doloop_end_internal4
+                        (s0, loc_ref));
+            break;
+          case 8:
+            insn = emit_jump_insn (gen_predicated_doloop_end_internal8
+                        (s0, loc_ref));
+            break;
+          case 16:
+            insn = emit_jump_insn (gen_predicated_doloop_end_internal16
+                        (s0, loc_ref));
+            break;
+          default:
+            gcc_unreachable ();
+        }
            DONE;
          }
      }

diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 93905583b18..c083f965fa9 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -6922,23 +6922,24 @@
  ;; Originally expanded by 'predicated_doloop_end'.
  ;; In the rare situation where the branch is too far, we do also need to
  ;; revert FPSCR.LTPSIZE back to 0x100 after the last iteration.
-(define_insn "*predicated_doloop_end_internal"
+(define_insn "predicated_doloop_end_internal<letp_num_lanes>"
    [(set (pc)
      (if_then_else
-       (ge (plus:SI (reg:SI LR_REGNUM)
-            (match_operand:SI 0 "const_int_operand" ""))
-        (const_int 0))
-     (label_ref (match_operand 1 "" ""))
+       (gtu (unspec:SI [(plus:SI (match_operand:SI 0 
"s_register_operand" "=r")
+                     (const_int <letp_num_lanes_neg>))]
+        LETP)
+        (const_int <letp_num_lanes_minus_1>))
+     (match_operand 1 "" "")
       (pc)))
-   (set (reg:SI LR_REGNUM)
-    (plus:SI (reg:SI LR_REGNUM) (match_dup 0)))
+   (set (match_dup 0)
+    (plus:SI (match_dup 0) (const_int <letp_num_lanes_neg>)))
     (clobber (reg:CC CC_REGNUM))]
    "TARGET_HAVE_MVE"
    {
      if (get_attr_length (insn) == 4)
        return "letp\t%|lr, %l1";
      else
-      return "subs\t%|lr, #%n0\n\tbgt\t%l1\n\tlctp";
+      return "subs\t%|lr, #<letp_num_lanes>\n\tbhi\t%l1\n\tlctp";
    }
    [(set (attr "length")
      (if_then_else
@@ -6947,11 +6948,11 @@
          (const_int 6)))
     (set_attr "type" "branch")])

-(define_insn "dlstp<mode1>_insn"
+(define_insn "dlstp<dlstp_elemsize>_insn"
    [
      (set (reg:SI LR_REGNUM)
       (unspec:SI [(match_operand:SI 0 "s_register_operand" "r")]
        DLSTP))
    ]
    "TARGET_HAVE_MVE"
-  "dlstp.<mode1>\t%|lr, %0")
+  "dlstp.<dlstp_elemsize>\t%|lr, %0")

diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
index 6a72700a127..47fdef989b4 100644
--- a/gcc/loop-doloop.cc
+++ b/gcc/loop-doloop.cc
@@ -185,6 +185,7 @@ doloop_condition_get (rtx_insn *doloop_pat)
        || XEXP (inc_src, 0) != reg
        || !CONST_INT_P (XEXP (inc_src, 1)))
      return 0;
+  int dec_num = abs (INTVAL (XEXP (inc_src, 1)));

    /* Check for (set (pc) (if_then_else (condition)
                                         (label_ref (label))
@@ -199,21 +200,32 @@ doloop_condition_get (rtx_insn *doloop_pat)
    /* Extract loop termination condition.  */
    condition = XEXP (SET_SRC (cmp), 0);

-  /* We expect a GE or NE comparison with 0 or 1.  */
-  if ((GET_CODE (condition) != GE
-       && GET_CODE (condition) != NE)
-      || (XEXP (condition, 1) != const0_rtx
-          && XEXP (condition, 1) != const1_rtx))
+  /* We expect a GE or NE comparison with 0 or 1, or a GTU comparison with
+     dec_num - 1.  */
+  if (!((GET_CODE (condition) == GE
+     || GET_CODE (condition) == NE)
+    && (XEXP (condition, 1) == const0_rtx
+        || XEXP (condition, 1) == const1_rtx ))
+      &&!(GET_CODE (condition) == GTU
+      && ((INTVAL (XEXP (condition, 1))) == (dec_num - 1))))
      return 0;

-  if ((XEXP (condition, 0) == reg)
+  /* For the ARM special case of having a GTU: re-form the condition 
without
+     the unspec for the benefit of the middle-end.  */
+  if (GET_CODE (condition) == GTU)
+    {
+      condition = gen_rtx_fmt_ee (GTU, VOIDmode, inc_src, GEN_INT 
(dec_num - 1));
+      return condition;
+    }
+  else if ((XEXP (condition, 0) == reg)
        /* For the third case:  */
        || ((cc_reg != NULL_RTX)
        && (XEXP (condition, 0) == cc_reg)
        && (reg_orig == reg))
@@ -245,20 +257,11 @@ doloop_condition_get (rtx_insn *doloop_pat)
                         (label_ref (label))
                         (pc))))])

-    So we return the second form instead for the two cases when n == 1.
-
-    For n > 1, the final value may be exceeded, so use GE instead of NE.
+    So we return the second form instead for the two cases.
       */
-     if (GET_CODE (pattern) != PARALLEL)
-       {
-    if (INTVAL (XEXP (inc_src, 1)) != -1)
-      condition = gen_rtx_fmt_ee (GE, VOIDmode, inc_src, const0_rtx);
-    else
-      condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);;
-       }
-
+    condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
      return condition;
-   }
+    }

    /* ??? If a machine uses a funny comparison, we could return a
       canonicalized form here.  */
@@ -501,7 +504,8 @@ doloop_modify (class loop *loop, class niter_desc *desc,
      case GE:
        /* Currently only GE tests against zero are supported.  */
        gcc_assert (XEXP (condition, 1) == const0_rtx);
-
+      /* FALLTHRU */
+    case GTU:
        noloop = constm1_rtx;
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index a6a7ff507a5..9398702cddd 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -2673,8 +2673,16 @@
  (define_int_attr mrrc [(VUNSPEC_MRRC "mrrc") (VUNSPEC_MRRC2 "mrrc2")])
  (define_int_attr MRRC [(VUNSPEC_MRRC "MRRC") (VUNSPEC_MRRC2 "MRRC2")])

-(define_int_attr mode1 [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 "32")
-            (DLSTP64 "64")])
+(define_int_attr dlstp_elemsize [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 "32")
+                 (DLSTP64 "64")])
+
+(define_int_attr letp_num_lanes [(LETP8 "16") (LETP16 "8") (LETP32 "4")
+                 (LETP64 "2")])
+(define_int_attr letp_num_lanes_neg [(LETP8 "-16") (LETP16 "-8") 
(LETP32 "-4")
+                     (LETP64 "-2")])
+
+(define_int_attr letp_num_lanes_minus_1 [(LETP8 "15") (LETP16 "7") 
(LETP32 "3")
+                     (LETP64 "1")])

  (define_int_attr opsuffix [(UNSPEC_DOT_S "s8")
                 (UNSPEC_DOT_U "u8")
@@ -2921,6 +2929,8 @@
  (define_int_iterator VQSHLUQ_N [VQSHLUQ_N_S])
  (define_int_iterator DLSTP [DLSTP8 DLSTP16 DLSTP32
                     DLSTP64])
+(define_int_iterator LETP [LETP8 LETP16 LETP32
+               LETP64])

  ;; Define iterators for VCMLA operations
  (define_int_iterator VCMLA_OP [UNSPEC_VCMLA
        /* The iteration count does not need incrementing for a GE test.  */
diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
index 12ae4c4f820..2d6f27c14f4 100644
--- a/gcc/config/arm/unspecs.md
+++ b/gcc/config/arm/unspecs.md
@@ -587,6 +587,10 @@
    DLSTP16
    DLSTP32
    DLSTP64
+  LETP8
+  LETP16
+  LETP32
+  LETP64
    VPNOT
    VCREATEQ_F
    VCVTQ_N_TO_F_S


I've attached the whole [2/2] patch diff with this change and
the required comment changes in doloop_condition_get.
WDYT?


Thanks,

Stam


>
>

[-- Attachment #2: 2.patch --]
[-- Type: text/x-patch, Size: 110925 bytes --]

commit 45c87b24abb0eb35cbf2d8184d35207339bd6be6
Author: Stam Markianos-Wright <stam.markianos-wright@arm.com>
Date:   Tue Oct 18 17:42:56 2022 +0100

    arm: Add support for MVE Tail-Predicated Low Overhead Loops
    
    This is the 2/2 patch that contains the functional changes needed
    for MVE Tail Predicated Low Overhead Loops.  See my previous email
    for a general introduction of MVE LOLs.
    
    This support is added through the already existing loop-doloop
    mechanisms that are used for non-MVE dls/le looping.
    
    Mid-end changes are:
    
    1) Relax the loop-doloop mechanism in the mid-end to allow for
       decrement numbers other that -1 and for `count` to be an
       rtx containing a simple REG (which in this case will contain
       the number of elements to be processed), rather
       than an expression for calculating the number of iterations.
    2) Added a new df utility function: `df_bb_regno_only_def_find` that
       will return the DEF of a REG if it is DEF-ed only once within the
       basic block.
    
    And many things in the backend to implement the above optimisation:
    
    3)  Implement the `arm_predict_doloop_p` target hook to instruct the
        mid-end about Low Overhead Loops (MVE or not), as well as
        `arm_loop_unroll_adjust` which will prevent unrolling of any loops
        that are valid for becoming MVE Tail_Predicated Low Overhead Loops
        (unrolling can transform a loop in ways that invalidate the dlstp/
        letp tranformation logic and the benefit of the dlstp/letp loop
        would be considerably higher than that of unrolling)
    4)  Appropriate changes to the define_expand of doloop_end, new
        patterns for dlstp and letp, new iterators,  unspecs, etc.
    5) `arm_mve_loop_valid_for_dlstp` and a number of checking functions:
       * `arm_mve_dlstp_check_dec_counter`
       * `arm_mve_dlstp_check_inc_counter`
       * `arm_mve_check_reg_origin_is_num_elems`
       * `arm_mve_check_df_chain_back_for_implic_predic`
       * `arm_mve_check_df_chain_fwd_for_implic_predic_impact`
       This all, in smoe way or another, are running checks on the loop
       structure in order to determine if the loop is valid for dlstp/letp
       transformation.
    6) `arm_attempt_dlstp_transform`: (called from the define_expand of
        doloop_end) this function re-checks for the loop's suitability for
        dlstp/letp transformation and then implements it, if possible.
    7) Various utility functions:
       *`arm_mve_get_vctp_lanes` to map
       from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg`
       to check an insn to see if it requires the VPR or not.
       * `arm_mve_get_loop_vctp`
       * `arm_mve_get_vctp_lanes`
       * `arm_emit_mve_unpredicated_insn_to_seq`
       * `arm_get_required_vpr_reg`
       * `arm_get_required_vpr_reg_param`
       * `arm_get_required_vpr_reg_ret_val`
       * `arm_mve_is_across_vector_insn`
       * `arm_is_mve_load_store_insn`
       * `arm_mve_vec_insn_is_predicated_with_this_predicate`
       * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`
    
    No regressions on arm-none-eabi with various targets and on
    aarch64-none-elf. Thoughts on getting this into trunk?
    
    Thank you,
    Stam Markianos-Wright
    
    gcc/ChangeLog:
    
            * config/arm/arm-protos.h (arm_target_insn_ok_for_lob): Rename to...
            (arm_target_bb_ok_for_lob): ...this
            (arm_attempt_dlstp_transform): New.
            * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.
            (TARGET_PREDICT_DOLOOP_P): New.
            (arm_block_set_vect):
            (arm_target_insn_ok_for_lob): Rename from arm_target_insn_ok_for_lob.
            (arm_target_bb_ok_for_lob): New.
            (arm_mve_get_vctp_lanes): New.
            (arm_get_required_vpr_reg): New.
            (arm_get_required_vpr_reg_param): New.
            (arm_get_required_vpr_reg_ret_val): New.
            (arm_mve_get_loop_vctp): New.
            (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.
            (arm_mve_vec_insn_is_predicated_with_this_predicate): New.
            (arm_mve_check_df_chain_back_for_implic_predic): New.
            (arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.
            (arm_mve_check_reg_origin_is_num_elems): New.
            (arm_mve_dlstp_check_inc_counter): New.
            (arm_mve_dlstp_check_dec_counter): New.
            (arm_mve_loop_valid_for_dlstp): New.
            (arm_mve_is_across_vector_insn): New.
            (arm_is_mve_load_store_insn): New.
            (arm_predict_doloop_p): New.
            (arm_loop_unroll_adjust): New.
            (arm_emit_mve_unpredicated_insn_to_seq): New.
            (arm_attempt_dlstp_transform): New.
            * config/arm/iterators.md (DLSTP): New.
            (mode1): Add DLSTP mappings.
            * config/arm/mve.md (*predicated_doloop_end_internal): New.
            (dlstp<mode1>_insn): New.
            * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
            * config/arm/unspecs.md: New unspecs.
            * df-core.cc (df_bb_regno_only_def_find): New.
            * df.h (df_bb_regno_only_def_find): New.
            * loop-doloop.cc (doloop_condition_get): Relax conditions.
            (doloop_optimize): Add support for elementwise LoLs.
    
    gcc/testsuite/ChangeLog:
    
            * gcc.target/arm/lob.h: Update framework.
            * gcc.target/arm/lob1.c: Likewise.
            * gcc.target/arm/lob6.c: Likewise.
            * gcc.target/arm/mve/dlstp-compile-asm.c: New test.
            * gcc.target/arm/mve/dlstp-int16x8.c: New test.
            * gcc.target/arm/mve/dlstp-int32x4.c: New test.
            * gcc.target/arm/mve/dlstp-int64x2.c: New test.
            * gcc.target/arm/mve/dlstp-int8x16.c: New test.
            * gcc.target/arm/mve/dlstp-invalid-asm.c: New test.

diff --git a/gcc/.vscode/settings.json b/gcc/.vscode/settings.json
new file mode 100644
index 00000000000..4995c792b8d
--- /dev/null
+++ b/gcc/.vscode/settings.json
@@ -0,0 +1,6 @@
+{
+	"files.associations": {
+		"*.cc": "cpp",
+		"*.def": "cpp"
+	}
+}
\ No newline at end of file
diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 2f5ca79ed8d..4f164c54740 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -65,8 +65,8 @@ extern void arm_emit_speculation_barrier_function (void);
 extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *);
 extern bool arm_q_bit_access (void);
 extern bool arm_ge_bits_access (void);
-extern bool arm_target_insn_ok_for_lob (rtx);
-
+extern bool arm_target_bb_ok_for_lob (basic_block);
+extern rtx arm_attempt_dlstp_transform (rtx);
 #ifdef RTX_CODE
 enum reg_class
 arm_mode_base_reg_class (machine_mode);
diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index 620ef7bfb2f..0e46a2728b6 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -659,6 +659,12 @@ static const struct attribute_spec arm_attribute_table[] =
 #undef TARGET_HAVE_CONDITIONAL_EXECUTION
 #define TARGET_HAVE_CONDITIONAL_EXECUTION arm_have_conditional_execution
 
+#undef TARGET_LOOP_UNROLL_ADJUST
+#define TARGET_LOOP_UNROLL_ADJUST arm_loop_unroll_adjust
+
+#undef TARGET_PREDICT_DOLOOP_P
+#define TARGET_PREDICT_DOLOOP_P arm_predict_doloop_p
+
 #undef TARGET_LEGITIMATE_CONSTANT_P
 #define TARGET_LEGITIMATE_CONSTANT_P arm_legitimate_constant_p
 
@@ -34472,19 +34478,1103 @@ arm_invalid_within_doloop (const rtx_insn *insn)
 }
 
 bool
-arm_target_insn_ok_for_lob (rtx insn)
+arm_target_bb_ok_for_lob (basic_block bb)
 {
-  basic_block bb = BLOCK_FOR_INSN (insn);
   /* Make sure the basic block of the target insn is a simple latch
      having as single predecessor and successor the body of the loop
      itself.  Only simple loops with a single basic block as body are
      supported for 'low over head loop' making sure that LE target is
      above LE itself in the generated code.  */
-
   return single_succ_p (bb)
-    && single_pred_p (bb)
-    && single_succ_edge (bb)->dest == single_pred_edge (bb)->src
-    && contains_no_active_insn_p (bb);
+	 && single_pred_p (bb)
+	 && single_succ_edge (bb)->dest == single_pred_edge (bb)->src;
+}
+
+/* Utility fuction: Given a VCTP or a VCTP_M insn, return the number of MVE
+   lanes based on the machine mode being used.  */
+
+static int
+arm_mve_get_vctp_lanes (rtx x)
+{
+  if (GET_CODE (x) == SET && GET_CODE (XEXP (x, 1)) == UNSPEC
+      && (XINT (XEXP (x, 1), 1) == VCTP || XINT (XEXP (x, 1), 1) == VCTP_M))
+    {
+      machine_mode mode = GET_MODE (XEXP (x, 1));
+      return (VECTOR_MODE_P (mode) && VALID_MVE_PRED_MODE (mode))
+	     ? GET_MODE_NUNITS (mode) : 0;
+    }
+  return 0;
+}
+
+/* Check if INSN requires the use of the VPR reg, if it does, return the
+   sub-rtx of the VPR reg.  The TYPE argument controls whether
+   this function should:
+   * For TYPE == 0, check all operands, including the OUT operands,
+     and return the first occurrence of the VPR reg.
+   * For TYPE == 1, only check the input operands.
+   * For TYPE == 2, only check the output operands.
+   (INOUT operands are considered both as input and output operands)
+*/
+static rtx
+arm_get_required_vpr_reg (rtx_insn *insn, unsigned int type = 0)
+{
+  gcc_assert (type < 3);
+  if (!NONJUMP_INSN_P (insn))
+    return NULL_RTX;
+
+  bool requires_vpr;
+  extract_constrain_insn (insn);
+  int n_operands = recog_data.n_operands;
+  if (recog_data.n_alternatives == 0)
+    return NULL_RTX;
+
+  /* Fill in recog_op_alt with information about the constraints of
+     this insn.  */
+  preprocess_constraints (insn);
+
+  for (int op = 0; op < n_operands; op++)
+    {
+      requires_vpr = true;
+      if (type == 1 && recog_data.operand_type[op] == OP_OUT)
+	continue;
+      else if (type == 2 && recog_data.operand_type[op] == OP_IN)
+	continue;
+
+      /* Iterate through alternatives of operand "op" in recog_op_alt and
+	 identify if the operand is required to be the VPR.  */
+      for (int alt = 0; alt < recog_data.n_alternatives; alt++)
+	{
+	  const operand_alternative *op_alt
+	      = &recog_op_alt[alt * n_operands];
+	  /* Fetch the reg_class for each entry and check it against the
+	     VPR_REG reg_class.  */
+	  if (alternative_class (op_alt, op) != VPR_REG)
+	    requires_vpr = false;
+	}
+      /* If all alternatives of the insn require the VPR reg for this operand,
+	 it means that either this is VPR-generating instruction, like a vctp,
+	 vcmp, etc., or it is a VPT-predicated insruction.  Return the subrtx
+	 of the VPR reg operand.  */
+      if (requires_vpr)
+	return recog_data.operand[op];
+    }
+  return NULL_RTX;
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with TYPE == 1, so return
+   something only if the VPR reg is an input operand to the insn.  */
+
+static rtx
+arm_get_required_vpr_reg_param (rtx_insn *insn)
+{
+  return arm_get_required_vpr_reg (insn, 1);
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with TYPE == 2, so return
+   something only if the VPR reg is the return value, an output of, or is
+   clobbered by the insn.  */
+
+static rtx
+arm_get_required_vpr_reg_ret_val (rtx_insn *insn)
+{
+  return arm_get_required_vpr_reg (insn, 2);
+}
+
+/* Scan the basic block of a loop body for a vctp instruction.  If there is
+   at least vctp instruction, return the first rtx_insn *.  */
+
+static rtx_insn *
+arm_mve_get_loop_vctp (basic_block bb)
+{
+  rtx_insn *insn = BB_HEAD (bb);
+
+  /* Now scan through all the instruction patterns and pick out the VCTP
+     instruction.  We require arm_get_required_vpr_reg_param to be false
+     to make sure we pick up a VCTP, rather than a VCTP_M.  */
+  FOR_BB_INSNS (bb, insn)
+    if (NONDEBUG_INSN_P (insn))
+      if (arm_get_required_vpr_reg_ret_val (insn)
+	  && (arm_mve_get_vctp_lanes (PATTERN (insn)) != 0)
+	  && !arm_get_required_vpr_reg_param (insn))
+	return insn;
+  return NULL;
+}
+
+/* Return true if INSN is a MVE instruction that is VPT-predicable, but in
+   its unpredicated form, or if it is predicated, but on a predicate other
+   than VPR_REG.  */
+
+static bool
+arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate (rtx_insn *insn,
+							  rtx vpr_reg)
+{
+  rtx insn_vpr_reg_operand;
+  if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+      || (MVE_VPT_PREDICATED_INSN_P (insn)
+	  && (insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn))
+	  && !rtx_equal_p (vpr_reg, insn_vpr_reg_operand)))
+    return true;
+  else
+    return false;
+}
+
+/* Return true if INSN is a MVE instruction that is VPT-predicable and is
+   predicated on VPR_REG.  */
+
+static bool
+arm_mve_vec_insn_is_predicated_with_this_predicate (rtx_insn *insn,
+						    rtx vpr_reg)
+{
+  rtx insn_vpr_reg_operand;
+  if (MVE_VPT_PREDICATED_INSN_P (insn)
+      && (insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn))
+      && rtx_equal_p (vpr_reg, insn_vpr_reg_operand))
+    return true;
+  else
+    return false;
+}
+
+/* Utility function to identify if INSN is an MVE instruction that performs
+   some across-vector operation (and as a result does not align with normal
+   lane predication rules).  All such instructions give one only scalar
+   output, except for vshlcq which gives a PARALLEL of a vector and a scalar
+   (one vector result and one carry output).  */
+
+static bool
+arm_is_mve_across_vector_insn (rtx_insn* insn)
+{
+  df_ref insn_defs = NULL;
+  if (!MVE_VPT_PREDICABLE_INSN_P (insn))
+    return false;
+
+  bool is_across_vector = false;
+  FOR_EACH_INSN_DEF (insn_defs, insn)
+    if (!VALID_MVE_MODE (GET_MODE (DF_REF_REG (insn_defs)))
+	&& !arm_get_required_vpr_reg_ret_val (insn))
+      is_across_vector = true;
+
+  return is_across_vector;
+}
+
+/* Utility function to identify if INSN is an MVE load or store instruction.
+   * For TYPE == 0, check all operands.  If the function returns true,
+     INSN is a load or a store insn.
+   * For TYPE == 1, only check the input operands.  If the function returns
+     true, INSN is a load insn.
+   * For TYPE == 2, only check the output operands.  If the function returns
+     true, INSN is a store insn.  */
+
+static bool
+arm_is_mve_load_store_insn (rtx_insn* insn, int type = 0)
+{
+  int n_operands = recog_data.n_operands;
+  extract_insn (insn);
+
+  for (int op = 0; op < n_operands; op++)
+    {
+      if (type == 1 && recog_data.operand_type[op] == OP_OUT)
+	continue;
+      else if (type == 2 && recog_data.operand_type[op] == OP_IN)
+	continue;
+      if (mve_memory_operand (recog_data.operand[op],
+			      GET_MODE (recog_data.operand[op])))
+      return true;
+    }
+  return false;
+}
+
+/* When transforming an MVE intrinsic loop into an MVE Tail Predicated Low
+   Overhead Loop, there are a number of instructions that, if in their
+   unpredicated form, act across vector lanes, but are still safe to include
+   within the loop, despite the implicit predication added to the vector lanes.
+   This list has been compiled by carefully analyzing the instruction
+   pseudocode in the Arm-ARM.
+   All other across-vector instructions aren't allowed, because the addition
+   of implicit predication could influnce the result of the operation.
+   Any new across-vector instructions to the MVE ISA will have to assessed for
+   inclusion to this list.  */
+
+static bool
+arm_mve_is_allowed_unpredic_across_vector_insn (rtx_insn* insn)
+{
+  gcc_assert (MVE_VPT_UNPREDICATED_INSN_P (insn)
+	      && arm_is_mve_across_vector_insn (insn));
+  rtx insn_pattern = PATTERN (insn);
+  if (GET_CODE (insn_pattern) == SET
+      && GET_CODE (XEXP (insn_pattern, 1)) == UNSPEC
+      && (XINT (XEXP (insn_pattern, 1), 1) == VADDVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VABAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VABAVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMAXVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMAXAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVXQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAXQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHAXQ_S))
+    return true;
+  return false;
+}
+
+/* Recursively scan through the DF chain backwards within the basic block and
+   determine if any of the USEs of the original insn (or the USEs of the insns
+   where thy were DEF-ed, etc., recursively) were affected by implicit VPT
+   predication of an MVE_VPT_UNPREDICATED_INSN_P in a dlstp/letp loop.
+   This function returns true if the insn is affected implicit predication
+   and false otherwise.
+   Having such implicit predication on an unpredicated insn wouldn't in itself
+   block tail predication, because the output of that insn might then be used
+   in a correctly predicated store insn, where the disabled lanes will be
+   ignored.  To verify this we later call:
+   `arm_mve_check_df_chain_fwd_for_implic_predic_impact`, which will check the
+   DF chains forward to see if any implicitly-predicated operand gets used in
+   an improper way.  */
+
+static bool
+arm_mve_check_df_chain_back_for_implic_predic
+  (hash_map <rtx_insn *, bool> *safe_insn_map, rtx_insn *insn_in,
+   rtx vctp_vpr_generated)
+{
+
+  auto_vec<rtx_insn *> worklist;
+  worklist.safe_push (insn_in);
+
+  bool *temp = NULL;
+
+  while (worklist.length () > 0)
+    {
+      rtx_insn *insn = worklist.pop ();
+
+      if ((temp = safe_insn_map->get (insn)))
+	return *temp;
+
+      basic_block body = BLOCK_FOR_INSN (insn);
+
+      /* The circumstances under which an instruction is affected by "implicit
+	 predication" are as follows:
+	  * It is an UNPREDICATED_INSN_P:
+	    * That loads/stores from/to memory.
+	    * Where any one of its operands is an MVE vector from outside the
+	      loop body bb.
+	 Or:
+	  * Any of it's operands, recursively backwards, are affected.  */
+      if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+	  && (arm_is_mve_load_store_insn (insn)
+	      || (arm_is_mve_across_vector_insn (insn)
+		  && !arm_mve_is_allowed_unpredic_across_vector_insn (insn))))
+	{
+	  safe_insn_map->put (insn, true);
+	  return true;
+	}
+
+      df_ref insn_uses = NULL;
+      FOR_EACH_INSN_USE (insn_uses, insn)
+      {
+	/* If the operand is in the input reg set to the the basic block,
+	   (i.e. it has come from outside the loop!), consider it unsafe if:
+	     * It's being used in an unpredicated insn.
+	     * It is a predicable MVE vector.  */
+	if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+	    && VALID_MVE_MODE (GET_MODE (DF_REF_REG (insn_uses)))
+	    && REGNO_REG_SET_P (DF_LR_IN (body), DF_REF_REGNO (insn_uses)))
+	  {
+	    safe_insn_map->put (insn, true);
+	    return true;
+	  }
+
+	/* Scan backwards from the current INSN through the instruction chain
+	   until the start of the basic block.  */
+	for (rtx_insn *prev_insn = PREV_INSN (insn);
+	     prev_insn && prev_insn != PREV_INSN (BB_HEAD (body));
+	     prev_insn = PREV_INSN (prev_insn))
+	  {
+	    /* If a previous insn defines a register that INSN uses, then
+	       recurse in order to check that insn's USEs. If any of these
+	       insns return true as MVE_VPT_UNPREDICATED_INSN_Ps, then the
+	       whole chain is affected by the change in behaviour from being
+	       placed in dlstp/letp loop.  */
+	    df_ref prev_insn_defs = NULL;
+	    FOR_EACH_INSN_DEF (prev_insn_defs, prev_insn)
+	    {
+	      if (DF_REF_REGNO (insn_uses) == DF_REF_REGNO (prev_insn_defs)
+		  && !arm_mve_vec_insn_is_predicated_with_this_predicate
+		       (insn, vctp_vpr_generated))
+		worklist.safe_push (prev_insn);
+	    }
+	  }
+      }
+    }
+  safe_insn_map->put (insn_in, false);
+  return false;
+}
+
+/* If we have identified that the current DEF will be modified
+   by such implicit predication, scan through all the
+   insns that USE it and bail out if any one is outside the
+   current basic block (i.e. the reg is live after the loop)
+   or if any are store insns that are unpredicated or using a
+   predicate other than the loop VPR.
+   This function returns true if the insn is not suitable for
+   implicit predication and false otherwise.*/
+
+static bool
+arm_mve_check_df_chain_fwd_for_implic_predic_impact (rtx_insn *insn,
+						     rtx vctp_vpr_generated)
+{
+
+  /* If this insn is indeed an unpredicated store to memory, bail out.  */
+  if (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
+	(insn, vctp_vpr_generated)
+      && (arm_is_mve_load_store_insn (insn, 2)
+	  || arm_is_mve_across_vector_insn (insn)))
+    return true;
+
+  /* Next, scan forward to the various USEs of the DEFs in this insn.  */
+  df_ref insn_def = NULL;
+  FOR_EACH_INSN_DEF (insn_def, insn)
+    {
+      for (df_ref use = DF_REG_USE_CHAIN (DF_REF_REGNO (insn_def)); use;
+	   use = DF_REF_NEXT_REG (use))
+	{
+	  rtx_insn *next_use_insn = DF_REF_INSN (use);
+	  if (next_use_insn != insn
+	      && NONDEBUG_INSN_P (next_use_insn))
+	    {
+	      /* If the USE is outside the loop body bb, or it is inside, but
+		 is an differently-predicated store to memory or it is any
+		 across-vector instruction.  */
+	      if (BLOCK_FOR_INSN (insn) != BLOCK_FOR_INSN (next_use_insn)
+		  || (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
+		       (next_use_insn, vctp_vpr_generated)
+		     && (arm_is_mve_load_store_insn (next_use_insn, 2)
+			 || arm_is_mve_across_vector_insn (next_use_insn))))
+		return true;
+	    }
+	}
+    }
+  return false;
+}
+
+/* Helper function to `arm_mve_dlstp_check_inc_counter` and to
+   `arm_mve_dlstp_check_dec_counter`.  In the situations where the loop counter
+   is incrementing by 1 or decrementing by 1 in each iteration, ensure that the
+   target value or the initialisation value, respectively, was a calculation
+   of the number of iterations of the loop, which is expected to be an ASHIFTRT
+   by VCTP_STEP.  */
+
+static bool
+arm_mve_check_reg_origin_is_num_elems (basic_block body, rtx reg, rtx vctp_step)
+{
+  /* Ok, we now know the loop starts from zero and increments by one.
+     Now just show that the max value of the counter came from an
+     appropriate ASHIFRT expr of the correct amount.  */
+  basic_block pre_loop_bb = body->prev_bb;
+  while (pre_loop_bb && BB_END (pre_loop_bb)
+	 && !df_bb_regno_only_def_find (pre_loop_bb, REGNO (reg)))
+    pre_loop_bb = pre_loop_bb->prev_bb;
+
+  df_ref counter_max_last_def = df_bb_regno_only_def_find (pre_loop_bb, REGNO (reg));
+  rtx counter_max_last_set;
+  if (counter_max_last_def)
+    counter_max_last_set = PATTERN (DF_REF_INSN (counter_max_last_def));
+  else
+    return false;
+
+  /* If we encounter a simple SET from a REG, follow it through.  */
+  if (GET_CODE (counter_max_last_set) == SET
+      && REG_P (XEXP (counter_max_last_set, 1)))
+    return arm_mve_check_reg_origin_is_num_elems
+	     (pre_loop_bb->next_bb, XEXP (counter_max_last_set, 1), vctp_step);
+
+  /* If we encounter a SET from an IF_THEN_ELSE where one of the operands is a
+     constant and the other is a REG, follow through to that REG.  */
+  if (GET_CODE (counter_max_last_set) == SET
+      && GET_CODE (XEXP (counter_max_last_set, 1)) == IF_THEN_ELSE
+      && REG_P (XEXP (XEXP (counter_max_last_set, 1), 1))
+      && CONST_INT_P (XEXP (XEXP (counter_max_last_set, 1), 2)))
+    return arm_mve_check_reg_origin_is_num_elems
+	     (pre_loop_bb->next_bb, XEXP (XEXP (counter_max_last_set, 1), 1), vctp_step);
+
+  if (GET_CODE (XEXP (counter_max_last_set, 1)) == ASHIFTRT
+      && CONST_INT_P (XEXP (XEXP (counter_max_last_set, 1), 1))
+      && ((1 << INTVAL (XEXP (XEXP (counter_max_last_set, 1), 1)))
+	   == abs (INTVAL (vctp_step))))
+    return true;
+
+  return false;
+}
+
+/* If we have identified the loop to have an incrementing counter, we need to
+   make sure that it increments by 1 and that the loop is structured correctly:
+    * The counter starts from 0
+    * The counter terminates at (num_of_elem + num_of_lanes - 1) / num_of_lanes
+    * The vctp insn uses a reg that decrements appropriately in each iteration.
+*/
+
+static rtx_insn*
+arm_mve_dlstp_check_inc_counter (basic_block body, rtx_insn* vctp_insn,
+				 rtx condconst, rtx condcount)
+{
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+  /* The loop latch has to be empty.  When compiling all the known MVE LoLs in
+     user applications, none of those with incrementing counters had any real
+     insns in the loop latch.  As such, this function has only been tested with
+     an empty latch and may misbehave or ICE if we somehow get here with an
+     increment in the latch, so, for correctness, error out early.  */
+  if (!empty_block_p (body->loop_father->latch))
+    return NULL;
+
+  class rtx_iv vctp_reg_iv;
+  /* For loops of type B) the loop counter is independent of the decrement
+     of the reg used in the vctp_insn. So run iv analysis on that reg.  This
+     has to succeed for such loops to be supported.  */
+  if (!iv_analyze (vctp_insn, as_a<scalar_int_mode> (GET_MODE (vctp_reg)),
+      vctp_reg, &vctp_reg_iv))
+    return NULL;
+
+  /* Extract the decrementnum of the vctp reg from the iv.  */
+  int decrementnum = abs (INTVAL (vctp_reg_iv.step));
+
+  /* Find where both of those are modified in the loop body bb.  */
+  df_ref condcount_reg_set_df = df_bb_regno_only_def_find (body, REGNO (condcount));
+  df_ref vctp_reg_set_df = df_bb_regno_only_def_find (body, REGNO (vctp_reg));
+  if (!condcount_reg_set_df || !vctp_reg_set_df)
+    return NULL;
+  rtx condcount_reg_set = PATTERN (DF_REF_INSN (condcount_reg_set_df));
+  rtx_insn* vctp_reg_set = DF_REF_INSN (vctp_reg_set_df);
+  /* Ensure the modification of the vctp reg from df is consistent with
+     the iv and the number of lanes on the vctp insn.  */
+  if (!(GET_CODE (XEXP (PATTERN (vctp_reg_set), 1)) == PLUS
+      && REGNO (XEXP (PATTERN (vctp_reg_set), 0))
+	  == REGNO (XEXP (XEXP (PATTERN (vctp_reg_set), 1), 0))))
+    return NULL;
+  if (decrementnum != abs (INTVAL (XEXP (XEXP (PATTERN (vctp_reg_set), 1), 1)))
+      || decrementnum != arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+    return NULL;
+
+  if (REG_P (condcount) && REG_P (condconst))
+    {
+      /* First we need to prove that the loop is going 0..condconst with an
+	 inc of 1 in each iteration.  */
+      if (GET_CODE (XEXP (condcount_reg_set, 1)) == PLUS
+	  && CONST_INT_P (XEXP (XEXP (condcount_reg_set, 1), 1))
+	  && INTVAL (XEXP (XEXP (condcount_reg_set, 1), 1)) == 1)
+	{
+	    rtx counter_reg = XEXP (condcount_reg_set, 0);
+	    /* Check that the counter did indeed start from zero.  */
+	    df_ref this_set = DF_REG_DEF_CHAIN (REGNO (counter_reg));
+	    if (!this_set)
+	      return NULL;
+	    df_ref last_set = DF_REF_NEXT_REG (this_set);
+	    if (!last_set)
+	      return NULL;
+	    rtx_insn* last_set_insn = DF_REF_INSN (last_set);
+	    if (!single_set (last_set_insn))
+	      return NULL;
+	    rtx counter_orig_set;
+	    counter_orig_set = XEXP (PATTERN (last_set_insn), 1);
+	    if (!CONST_INT_P (counter_orig_set)
+		|| (INTVAL (counter_orig_set) != 0))
+	      return NULL;
+	    /* And finally check that the target value of the counter,
+	       condconst, is of the correct shape.  */
+	    if (!arm_mve_check_reg_origin_is_num_elems (body, condconst,
+							vctp_reg_iv.step))
+	      return NULL;
+	}
+      else
+	return NULL;
+    }
+  else
+    return NULL;
+
+  /* Everything looks valid.  */
+  return vctp_insn;
+}
+
+/* Helper function to `arm_mve_loop_valid_for_dlstp`.  In the case of a
+   counter that is decrementing, ensure that it is decrementing by the
+   right amount in each iteration and that the target condition is what
+   we expect.  */
+
+static rtx_insn*
+arm_mve_dlstp_check_dec_counter (basic_block body, rtx_insn* vctp_insn,
+				 rtx condconst, rtx condcount)
+{
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+  class rtx_iv vctp_reg_iv;
+  int decrementnum;
+  /* For decrementing loops of type A), the counter is usually present in the
+     loop latch.  Here we simply need to verify that this counter is the same
+     reg that is also used in the vctp_insn and that it is not otherwise
+     modified.  */
+  rtx_insn *dec_insn = BB_END (body->loop_father->latch);
+  /* If not in the loop latch, try to find the decrement in the loop body.  */
+  if (!NONDEBUG_INSN_P (dec_insn))
+  {
+    df_ref temp = df_bb_regno_only_def_find (body, REGNO (condcount));
+    /* If we haven't been able to find the decrement, bail out.  */
+    if (!temp)
+      return NULL;
+    dec_insn = DF_REF_INSN (temp);
+  }
+
+  /* Next, ensure that it is a PLUS of the form:
+     (set (reg a) (plus (reg a) (const_int)))
+     where (reg a) is the same as condcount.  */
+  if (GET_CODE (XEXP (PATTERN (dec_insn), 1)) == PLUS
+      && REGNO (XEXP (PATTERN (dec_insn), 0))
+	  == REGNO (XEXP (XEXP (PATTERN (dec_insn), 1), 0))
+      && REGNO (XEXP (PATTERN (dec_insn), 0)) == REGNO (condcount))
+    decrementnum = abs (INTVAL (XEXP (XEXP (PATTERN (dec_insn), 1), 1)));
+  else
+    return NULL;
+
+  /* Ok, so we now know the loop decrement.  If it is a 1, then we need to
+     look at the loop vctp_reg and verify that it also decrements correctly.
+     Then, we need to establish that the starting value of the loop decrement
+     originates from the starting value of the vctp decrement.  */
+  if (decrementnum == 1)
+    {
+      class rtx_iv vctp_reg_iv;
+      /* The loop counter is found to be independent of the decrement
+	 of the reg used in the vctp_insn, again.  Ensure that IV analysis
+	 succeeds and check the step.  */
+      if (!iv_analyze (vctp_insn, as_a<scalar_int_mode> (GET_MODE (vctp_reg)),
+		       vctp_reg, &vctp_reg_iv))
+	return NULL;
+      /* Ensure it matches the number of lanes of the vctp instruction.  */
+      if (abs (INTVAL (vctp_reg_iv.step))
+	  != arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+	return NULL;
+      if (!arm_mve_check_reg_origin_is_num_elems (body, condcount, vctp_reg_iv.step))
+	return NULL;
+    }
+  /* If the decrements are the same, then the situation is simple: either they
+     are also the same reg, which is safe, or they are different registers, in
+     which case makse sure that there is a only simple SET from one to the
+     other inside the loop.*/
+  else if (decrementnum == arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+    {
+      if (REGNO (condcount) != REGNO (vctp_reg))
+	{
+	  /* It wasn't the same reg, but it could be behild a
+	     (set (vctp_reg) (condcount)), so instead find where
+	     the VCTP insn is DEF'd inside the loop.  */
+	  rtx vctp_reg_set =
+		PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
+					(body, REGNO (vctp_reg))));
+	  /* This must just be a simple SET from the condcount.  */
+	  if (GET_CODE (vctp_reg_set) != SET || !REG_P (XEXP (vctp_reg_set, 1))
+	      || REGNO (XEXP (vctp_reg_set, 1)) != REGNO (condcount))
+	    return NULL;
+	}
+    }
+  else
+    return NULL;
+
+  /* We now only need to find out that the loop terminates with a LE
+     zero condition.  If condconst is a const_int, then this is easy.
+     If its a REG, look at the last condition+jump in a bb before
+     the loop, because that usually will have a branch jumping over
+     the loop body.  */
+  if (CONST_INT_P (condconst)
+      && !(INTVAL (condconst) == 0 && JUMP_P (BB_END (body))
+	   && GET_CODE (XEXP (PATTERN (BB_END (body)), 1)) == IF_THEN_ELSE
+	   && (GET_CODE (XEXP (XEXP (PATTERN (BB_END (body)), 1), 0)) == NE
+	       ||GET_CODE (XEXP (XEXP (PATTERN (BB_END (body)), 1), 0)) == GT)))
+    return NULL;
+  else if (REG_P (condconst))
+    {
+      basic_block pre_loop_bb = body;
+      while (pre_loop_bb->prev_bb && BB_END (pre_loop_bb->prev_bb)
+	     && !JUMP_P (BB_END (pre_loop_bb->prev_bb)))
+	pre_loop_bb = pre_loop_bb->prev_bb;
+      if (pre_loop_bb && BB_END (pre_loop_bb))
+	pre_loop_bb = pre_loop_bb->prev_bb;
+      else
+	return NULL;
+      rtx initial_compare = NULL_RTX;
+      if (!(prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb))
+	    && INSN_P (prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb)))))
+	return NULL;
+      else
+	initial_compare
+	    = PATTERN (prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb)));
+      if (!(initial_compare && GET_CODE (initial_compare) == SET
+	    && cc_register (XEXP (initial_compare, 0), VOIDmode)
+	    && GET_CODE (XEXP (initial_compare, 1)) == COMPARE
+	    && CONST_INT_P (XEXP (XEXP (initial_compare, 1), 1))
+	    && INTVAL (XEXP (XEXP (initial_compare, 1), 1)) == 0))
+	return NULL;
+
+      /* Usually this is a LE condition, but it can also just be a GT or an EQ
+	 condition (if the value is unsigned or the compiler knows its not negative)  */
+      rtx_insn *loop_jumpover = BB_END (pre_loop_bb);
+      if (!(JUMP_P (loop_jumpover)
+	    && GET_CODE (XEXP (PATTERN (loop_jumpover), 1)) == IF_THEN_ELSE
+	    && (GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == LE
+		|| GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == GT
+		|| GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == EQ)))
+	return NULL;
+    }
+
+  /* Everything looks valid.  */
+  return vctp_insn;
+}
+
+/* Function to check a loop's structure to see if it is a valid candidate for
+   an MVE Tail Predicated Low-Overhead Loop.  Returns the loop's VCTP_INSN if
+   it is valid, or NULL if it isn't.  */
+
+static rtx_insn*
+arm_mve_loop_valid_for_dlstp (basic_block body)
+{
+  /* Doloop can only be done "elementwise" with predicated dlstp/letp if it
+     contains a VCTP on the number of elements processed by the loop.
+     Find the VCTP predicate generation inside the loop body BB.  */
+  rtx_insn *vctp_insn = arm_mve_get_loop_vctp (body);
+  if (!vctp_insn)
+    return NULL;
+
+  /* There are only two types of loops that can be turned into dlstp/letp
+     loops:
+      A) Loops of the form:
+	  while (num_of_elem > 0)
+	    {
+	      p = vctp<size> (num_of_elem)
+	      n -= num_of_lanes;
+	    }
+      B) Loops of the form:
+	  int num_of_iters = (num_of_elem + num_of_lanes - 1) / num_of_lanes
+	  for (i = 0; i < num_of_iters; i++)
+	    {
+	      p = vctp<size> (num_of_elem)
+	      n -= num_of_lanes;
+	    }
+
+    Then, depending on the type of loop above we need will need to do
+    different sets of checks.  */
+  iv_analysis_loop_init (body->loop_father);
+
+  /* In order to find out if the loop is of type A or B above look for the
+     loop counter: it will either be incrementing by one per iteration or
+     it will be decrementing by num_of_lanes.  We can find the loop counter
+     in the condition at the end of the loop.  */
+  rtx_insn *loop_cond = prev_nonnote_nondebug_insn_bb (BB_END (body));
+  if (!(cc_register (XEXP (PATTERN (loop_cond), 0), VOIDmode)
+	&& GET_CODE (XEXP (PATTERN (loop_cond), 1)) == COMPARE))
+    return NULL;
+
+  /* The operands in the condition:  Try to identify which one is the
+     constant and which is the counter and run IV analysis on the latter.  */
+  rtx cond_arg_1 = XEXP (XEXP (PATTERN (loop_cond), 1), 0);
+  rtx cond_arg_2 = XEXP (XEXP (PATTERN (loop_cond), 1), 1);
+
+  rtx loop_cond_constant;
+  rtx loop_counter;
+  class rtx_iv cond_counter_iv, cond_temp_iv;
+
+  if (CONST_INT_P (cond_arg_1))
+    {
+      /* cond_arg_1 is the constant and cond_arg_2 is the counter.  */
+      loop_cond_constant = cond_arg_1;
+      loop_counter = cond_arg_2;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_2)),
+		  cond_arg_2, &cond_counter_iv);
+    }
+  else if (CONST_INT_P (cond_arg_2))
+    {
+      /* cond_arg_2 is the constant and cond_arg_1 is the counter.  */
+      loop_cond_constant = cond_arg_2;
+      loop_counter = cond_arg_1;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_1)),
+		  cond_arg_1, &cond_counter_iv);
+    }
+  else if (REG_P (cond_arg_1) && REG_P (cond_arg_2))
+    {
+      /* If both operands to the compare are REGs, we can safely
+	 run IV analysis on both and then determine which is the
+	 constant by looking at the step.
+	 First assume cond_arg_1 is the counter.  */
+      loop_counter = cond_arg_1;
+      loop_cond_constant = cond_arg_2;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_1)),
+		  cond_arg_1, &cond_counter_iv);
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_2)),
+		  cond_arg_2, &cond_temp_iv);
+
+      if (!CONST_INT_P (cond_counter_iv.step) || !CONST_INT_P (cond_temp_iv.step))
+	return NULL;
+      /* Look at the steps and swap around the rtx's if needed.  Error out if
+	 one of them cannot be identified as constant.  */
+      if (INTVAL (cond_counter_iv.step) != 0 && INTVAL (cond_temp_iv.step) != 0)
+	return NULL;
+      if (INTVAL (cond_counter_iv.step) == 0 && INTVAL (cond_temp_iv.step) != 0)
+	{
+	  loop_counter = cond_arg_2;
+	  loop_cond_constant = cond_arg_1;
+	  cond_counter_iv = cond_temp_iv;
+	}
+    }
+  else
+    return NULL;
+
+  if (!REG_P (loop_counter))
+    return NULL;
+  if (!(REG_P (loop_cond_constant) || CONST_INT_P (loop_cond_constant)))
+    return NULL;
+
+  /* Now we have extracted the IV step of the loop counter, call the
+     appropriate checking function.  */
+  if (INTVAL (cond_counter_iv.step) > 0)
+    return arm_mve_dlstp_check_inc_counter (body, vctp_insn,
+					    loop_cond_constant, loop_counter);
+  else if (INTVAL (cond_counter_iv.step) < 0)
+    return arm_mve_dlstp_check_dec_counter (body, vctp_insn,
+					    loop_cond_constant, loop_counter);
+  else
+    return NULL;
+}
+
+/* Predict whether the given loop in gimple will be transformed in the RTL
+   doloop_optimize pass.  */
+
+static bool
+arm_predict_doloop_p (struct loop *loop)
+{
+  gcc_assert (loop);
+  /* On arm, targetm.can_use_doloop_p is actually
+     can_use_doloop_if_innermost.  Ensure the loop is innermost,
+     it is valid and as per arm_target_bb_ok_for_lob and the
+     correct architecture flags are enabled.  */
+  if (!(TARGET_32BIT && TARGET_HAVE_LOB && optimize > 0))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " target architecture or optimisation flags.\n");
+      return false;
+    }
+  else if (loop->inner != NULL)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " loop nesting.\n");
+      return false;
+    }
+  else if (!arm_target_bb_ok_for_lob (loop->header->next_bb))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " loop bb complexity.\n");
+      return false;
+    }
+
+  return true;
+}
+
+/* Implement targetm.loop_unroll_adjust.  Use this to block unrolling of loops
+   that may later be turned into MVE Tail Predicated Low Overhead Loops.  The
+   performance benefit of an MVE LoL is likely to be much higher than that of
+   the unrolling.  */
+
+unsigned
+arm_loop_unroll_adjust (unsigned nunroll, struct loop *loop)
+{
+  if (TARGET_HAVE_MVE
+      && arm_target_bb_ok_for_lob (loop->latch)
+      && arm_mve_loop_valid_for_dlstp (loop->header))
+    return 0;
+  else
+    return nunroll;
+}
+
+/* Function to hadle emitting a VPT-unpredicated version of a VPT-predicated
+   insn to a sequence.  */
+
+static bool
+arm_emit_mve_unpredicated_insn_to_seq (rtx_insn* insn)
+{
+  rtx insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn);
+  int new_icode = get_attr_mve_unpredicated_insn (insn);
+  if (!in_sequence_p ()
+      || !MVE_VPT_PREDICATED_INSN_P (insn)
+      || (!insn_vpr_reg_operand)
+      || (!new_icode))
+    return false;
+
+  extract_insn (insn);
+  rtx arr[8];
+  int j = 0;
+
+  /* When transforming a VPT-predicated instruction
+     into its unpredicated equivalent we need to drop
+     the VPR operand and we may need to also drop a
+     merge "vuninit" input operand, depending on the
+     instruction pattern.  Here ensure that we have at
+     most a two-operand difference between the two
+     instrunctions.  */
+  int n_operands_diff
+      = recog_data.n_operands - insn_data[new_icode].n_operands;
+  if (!(n_operands_diff > 0 && n_operands_diff <= 2))
+    return false;
+
+  /* Then, loop through the operands of the predicated
+     instruction, and retain the ones that map to the
+     unpredicated instruction.  */
+  for (int i = 0; i < recog_data.n_operands; i++)
+    {
+      /* Ignore the VPR and, if needed, the vuninit
+	 operand.  */
+      if (insn_vpr_reg_operand == recog_data.operand[i]
+	  || (n_operands_diff == 2
+	      && !strcmp (recog_data.constraints[i], "0")))
+	continue;
+      else
+	{
+	  arr[j] = recog_data.operand[i];
+	  j++;
+	}
+    }
+
+  /* Finally, emit the upredicated instruction.  */
+  switch (j)
+    {
+      case 1:
+	emit_insn (GEN_FCN (new_icode) (arr[0]));
+	break;
+      case 2:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1]));
+	break;
+      case 3:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2]));
+	break;
+      case 4:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2],
+					arr[3]));
+	break;
+      case 5:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4]));
+	break;
+      case 6:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4], arr[5]));
+	break;
+      case 7:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4], arr[5], arr[6]));
+	break;
+      default:
+	gcc_unreachable ();
+    }
+  return true;
+}
+
+/* When a vctp insn is used, its out is often followed by
+   a zero-extend insn to SImode, which is then SUBREG'd into a
+   vector form of mode VALID_MVE_PRED_MODE: this vector form is
+   what is then used as an input to the instructions within the
+   loop.  Hence, store that vector form of the VPR reg into
+   vctp_vpr_generated, so that we can match it with instructions
+   in the loop to determine if they are predicated on this same
+   VPR.  If there is no zero-extend and subreg or it is otherwise
+   invalid, then return NULL to cancel the dlstp transform.  */
+
+static rtx
+arm_mve_get_vctp_vec_form (rtx_insn *insn)
+{
+  rtx vctp_vpr_generated = NULL_RTX;
+  rtx_insn *next_use1 = NULL;
+  df_ref use;
+  for (use
+	= DF_REG_USE_CHAIN
+	   (DF_REF_REGNO (DF_INSN_INFO_DEFS (DF_INSN_INFO_GET (insn))));
+       use; use = DF_REF_NEXT_REG (use))
+    if (!next_use1 && NONDEBUG_INSN_P (DF_REF_INSN (use)))
+      next_use1 = DF_REF_INSN (use);
+
+  if (single_set (next_use1)
+      && GET_CODE (SET_SRC (single_set (next_use1))) == ZERO_EXTEND)
+    {
+      rtx_insn *next_use2 = NULL;
+      for (use
+	    = DF_REG_USE_CHAIN
+	       (DF_REF_REGNO
+		 (DF_INSN_INFO_DEFS (DF_INSN_INFO_GET (next_use1))));
+	   use; use = DF_REF_NEXT_REG (use))
+	if (!next_use2 && NONDEBUG_INSN_P (DF_REF_INSN (use)))
+	  next_use2 = DF_REF_INSN (use);
+
+      if (single_set (next_use2)
+	  && GET_CODE (SET_SRC (single_set (next_use2))) == SUBREG)
+	vctp_vpr_generated = XEXP (PATTERN (next_use2), 0);
+    }
+
+  if (!vctp_vpr_generated || !REG_P (vctp_vpr_generated)
+      || !VALID_MVE_PRED_MODE (GET_MODE (vctp_vpr_generated)))
+    return NULL_RTX;
+
+  return vctp_vpr_generated;
+}
+
+/* Attempt to transform the loop contents of loop basic block from VPT
+   predicated insns into unpredicated insns for a dlstp/letp loop.  */
+
+rtx
+arm_attempt_dlstp_transform (rtx label)
+{
+  basic_block body = BLOCK_FOR_INSN (label)->prev_bb;
+
+  /* Ensure that the bb is within a loop that has all required metadata.  */
+  if (!body->loop_father || !body->loop_father->header
+      || !body->loop_father->simple_loop_desc)
+    return GEN_INT (1);
+
+  rtx_insn *vctp_insn = arm_mve_loop_valid_for_dlstp (body);
+  if (!vctp_insn)
+    return GEN_INT (1);
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+
+  rtx vctp_vpr_generated = arm_mve_get_vctp_vec_form (vctp_insn);
+  if (!vctp_vpr_generated)
+    return GEN_INT (1);
+
+  /* decrementunum is already known to be valid at this point.  */
+  int decrementnum = arm_mve_get_vctp_lanes (PATTERN (vctp_insn));
+
+  rtx_insn *insn = 0;
+  rtx_insn *cur_insn = 0;
+  rtx_insn *seq;
+  hash_map <rtx_insn *, bool> *safe_insn_map
+      = new hash_map <rtx_insn *, bool>;
+
+  /* Scan through the insns in the loop bb and emit the transformed bb
+     insns to a sequence.  */
+  start_sequence ();
+  FOR_BB_INSNS (body, insn)
+    {
+      if (GET_CODE (insn) == CODE_LABEL || NOTE_INSN_BASIC_BLOCK_P (insn))
+	continue;
+      else if (NOTE_P (insn))
+	emit_note ((enum insn_note)NOTE_KIND (insn));
+      else if (DEBUG_INSN_P (insn))
+	emit_debug_insn (PATTERN (insn));
+      else if (!INSN_P (insn))
+	{
+	  end_sequence ();
+	  return GEN_INT (1);
+	}
+      /* When we find the vctp instruction: continue.  */
+      else if (insn == vctp_insn)
+	continue;
+       /* If the insn pattern requires the use of the VPR value from the
+	  vctp as an input parameter for predication.  */
+      else if (arm_mve_vec_insn_is_predicated_with_this_predicate
+		(insn, vctp_vpr_generated))
+	{
+	  bool success = arm_emit_mve_unpredicated_insn_to_seq (insn);
+	  if (!success)
+	    {
+	      end_sequence ();
+	      return GEN_INT (1);
+	    }
+	}
+      /* If the insn isn't VPT predicated on vctp_vpr_generated, we need to
+	 make sure that it is still valid within the dlstp/letp loop.  */
+      else
+	{
+	  /* If this instruction USE-s the vctp_vpr_generated other than for
+	     predication, this blocks the transformation as we are not allowed
+	     to optimise the VPR value away.  */
+	  df_ref insn_uses = NULL;
+	  FOR_EACH_INSN_USE (insn_uses, insn)
+	  {
+	    if (rtx_equal_p (vctp_vpr_generated, DF_REF_REG (insn_uses)))
+	      {
+		end_sequence ();
+		return GEN_INT (1);
+	      }
+	  }
+	  /* If within the loop we have an MVE vector instruction that is
+	     unpredicated, the dlstp/letp looping will add implicit
+	     predication to it.  This will result in a change in behaviour
+	     of the instruction, so we need to find out if any instructions
+	     that feed into the current instruction were implicitly
+	     predicated.  */
+	  if (arm_mve_check_df_chain_back_for_implic_predic
+	       (safe_insn_map, insn, vctp_vpr_generated))
+	    {
+	      if (arm_mve_check_df_chain_fwd_for_implic_predic_impact
+		    (insn, vctp_vpr_generated))
+		{
+		  end_sequence ();
+		  return GEN_INT (1);
+		}
+	    }
+	  emit_insn (PATTERN (insn));
+	}
+    }
+  seq = get_insns ();
+  end_sequence ();
+
+  /* Re-write the entire BB contents with the transformed
+     sequence.  */
+  FOR_BB_INSNS_SAFE (body, insn, cur_insn)
+    if (!(GET_CODE (insn) == CODE_LABEL || NOTE_INSN_BASIC_BLOCK_P (insn)))
+      delete_insn (insn);
+  for (insn = seq; NEXT_INSN (insn); insn = NEXT_INSN (insn))
+    if (NOTE_P (insn))
+      emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));
+    else if (DEBUG_INSN_P (insn))
+      emit_debug_insn_after (PATTERN (insn), BB_END (body));
+    else
+      emit_insn_after (PATTERN (insn), BB_END (body));
+
+  emit_jump_insn_after (PATTERN (insn), BB_END (body));
+  /* The transformation has succeeded, so now modify the "count"
+     (a.k.a. niter_expr) for the middle-end.  Also set noloop_assumptions
+     to NULL to stop the middle-end from making assumptions about the
+     number of iterations.  */
+  simple_loop_desc (body->loop_father)->niter_expr = vctp_reg;
+  simple_loop_desc (body->loop_father)->noloop_assumptions = NULL_RTX;
+  return GEN_INT (decrementnum);
 }
 
 #if CHECKING_P
diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 8efdebecc3c..da745288f26 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -124,6 +124,11 @@
 ; and not all ARM insns do.
 (define_attr "predicated" "yes,no" (const_string "no"))
 
+
+; An attribute that encodes the CODE_FOR_<insn> of the MVE VPT unpredicated
+; version of a VPT-predicated instruction.  For unpredicated instructions
+; that are predicable, encode the same pattern's CODE_FOR_<insn> as a way to
+; encode that it is a predicable instruction.
 (define_attr "mve_unpredicated_insn" "" (const_int 0))
 
 ; LENGTH of an instruction (in bytes)
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 5ea2d9e8668..9398702cddd 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -2673,6 +2673,17 @@
 (define_int_attr mrrc [(VUNSPEC_MRRC "mrrc") (VUNSPEC_MRRC2 "mrrc2")])
 (define_int_attr MRRC [(VUNSPEC_MRRC "MRRC") (VUNSPEC_MRRC2 "MRRC2")])
 
+(define_int_attr dlstp_elemsize [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 "32")
+				 (DLSTP64 "64")])
+
+(define_int_attr letp_num_lanes [(LETP8 "16") (LETP16 "8") (LETP32 "4")
+				 (LETP64 "2")])
+(define_int_attr letp_num_lanes_neg [(LETP8 "-16") (LETP16 "-8") (LETP32 "-4")
+				     (LETP64 "-2")])
+
+(define_int_attr letp_num_lanes_minus_1 [(LETP8 "15") (LETP16 "7") (LETP32 "3")
+					 (LETP64 "1")])
+
 (define_int_attr opsuffix [(UNSPEC_DOT_S "s8")
 			   (UNSPEC_DOT_U "u8")
 			   (UNSPEC_DOT_US "s8")
@@ -2916,6 +2927,10 @@
 (define_int_iterator VSHLCQ_M [VSHLCQ_M_S VSHLCQ_M_U])
 (define_int_iterator VQSHLUQ_M_N [VQSHLUQ_M_N_S])
 (define_int_iterator VQSHLUQ_N [VQSHLUQ_N_S])
+(define_int_iterator DLSTP [DLSTP8 DLSTP16 DLSTP32
+				   DLSTP64])
+(define_int_iterator LETP [LETP8 LETP16 LETP32
+			   LETP64])
 
 ;; Define iterators for VCMLA operations
 (define_int_iterator VCMLA_OP [UNSPEC_VCMLA
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 44a04b86cb5..c083f965fa9 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -6922,23 +6922,24 @@
 ;; Originally expanded by 'predicated_doloop_end'.
 ;; In the rare situation where the branch is too far, we do also need to
 ;; revert FPSCR.LTPSIZE back to 0x100 after the last iteration.
-(define_insn "*predicated_doloop_end_internal"
+(define_insn "predicated_doloop_end_internal<letp_num_lanes>"
   [(set (pc)
 	(if_then_else
-	   (ge (plus:SI (reg:SI LR_REGNUM)
-			(match_operand:SI 0 "const_int_operand" ""))
-		(const_int 0))
-	 (label_ref (match_operand 1 "" ""))
+	   (gtu (unspec:SI [(plus:SI (match_operand:SI 0 "s_register_operand" "=r")
+				     (const_int <letp_num_lanes_neg>))]
+		LETP)
+		(const_int <letp_num_lanes_minus_1>))
+	 (match_operand 1 "" "")
 	 (pc)))
-   (set (reg:SI LR_REGNUM)
-	(plus:SI (reg:SI LR_REGNUM) (match_dup 0)))
+   (set (match_dup 0)
+	(plus:SI (match_dup 0) (const_int <letp_num_lanes_neg>)))
    (clobber (reg:CC CC_REGNUM))]
-  "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2"
+  "TARGET_HAVE_MVE"
   {
     if (get_attr_length (insn) == 4)
       return "letp\t%|lr, %l1";
     else
-      return "subs\t%|lr, #%n0\n\tbgt\t%l1\n\tlctp";
+      return "subs\t%|lr, #<letp_num_lanes>\n\tbhi\t%l1\n\tlctp";
   }
   [(set (attr "length")
 	(if_then_else
@@ -6947,11 +6948,11 @@
 	    (const_int 6)))
    (set_attr "type" "branch")])
 
-(define_insn "dlstp<mode1>_insn"
+(define_insn "dlstp<dlstp_elemsize>_insn"
   [
     (set (reg:SI LR_REGNUM)
 	 (unspec:SI [(match_operand:SI 0 "s_register_operand" "r")]
 	  DLSTP))
   ]
-  "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2"
-  "dlstp.<mode1>\t%|lr, %0")
+  "TARGET_HAVE_MVE"
+  "dlstp.<dlstp_elemsize>\t%|lr, %0")
diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
index e1e013befa7..54dd4ee564b 100644
--- a/gcc/config/arm/thumb2.md
+++ b/gcc/config/arm/thumb2.md
@@ -1613,7 +1613,7 @@
    (use (match_operand 1 "" ""))]     ; label
   "TARGET_32BIT"
   "
- {
+{
    /* Currently SMS relies on the do-loop pattern to recognize loops
       where (1) the control part consists of all insns defining and/or
       using a certain 'count' register and (2) the loop count can be
@@ -1623,41 +1623,77 @@
 
       Also used to implement the low over head loops feature, which is part of
       the Armv8.1-M Mainline Low Overhead Branch (LOB) extension.  */
-   if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
-   {
-     rtx s0;
-     rtx bcomp;
-     rtx loc_ref;
-     rtx cc_reg;
-     rtx insn;
-     rtx cmp;
-
-     if (GET_MODE (operands[0]) != SImode)
-       FAIL;
-
-     s0 = operands [0];
-
-     /* Low over head loop instructions require the first operand to be LR.  */
-     if (TARGET_HAVE_LOB && arm_target_insn_ok_for_lob (operands [1]))
-       s0 = gen_rtx_REG (SImode, LR_REGNUM);
-
-     if (TARGET_THUMB2)
-       insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-     else
-       insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-
-     cmp = XVECEXP (PATTERN (insn), 0, 0);
-     cc_reg = SET_DEST (cmp);
-     bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
-     loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands [1]);
-     emit_jump_insn (gen_rtx_SET (pc_rtx,
-                                  gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
-                                                        loc_ref, pc_rtx)));
-     DONE;
-   }
- else
-   FAIL;
- }")
+  if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
+    {
+      rtx s0;
+      rtx bcomp;
+      rtx loc_ref;
+      rtx cc_reg;
+      rtx insn;
+      rtx cmp;
+      rtx decrement_num;
+
+      if (GET_MODE (operands[0]) != SImode)
+	FAIL;
+
+      s0 = operands[0];
+
+       if (TARGET_HAVE_LOB && arm_target_bb_ok_for_lob (BLOCK_FOR_INSN (operands[1])))
+	{
+	  s0 = gen_rtx_REG (SImode, LR_REGNUM);
+
+	  /* If we have a compatibe MVE target, try and analyse the loop
+	     contents to determine if we can use predicated dlstp/letp
+	     looping.  */
+	  if (TARGET_HAVE_MVE
+	      && (decrement_num = arm_attempt_dlstp_transform (operands[1]))
+	      && (INTVAL (decrement_num) != 1))
+	    {
+	      loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+	      switch (INTVAL (decrement_num))
+		{
+		  case 2:
+		    insn = emit_jump_insn (gen_predicated_doloop_end_internal2
+					    (s0, loc_ref));
+		    break;
+		  case 4:
+		    insn = emit_jump_insn (gen_predicated_doloop_end_internal4
+					    (s0, loc_ref));
+		    break;
+		  case 8:
+		    insn = emit_jump_insn (gen_predicated_doloop_end_internal8
+					    (s0, loc_ref));
+		    break;
+		  case 16:
+		    insn = emit_jump_insn (gen_predicated_doloop_end_internal16
+					    (s0, loc_ref));
+		    break;
+		  default:
+		    gcc_unreachable ();
+		}
+	      DONE;
+	    }
+	}
+
+	/* Otherwise, try standard decrement-by-one dls/le looping.  */
+	if (TARGET_THUMB2)
+	  insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0,
+							GEN_INT (-1)));
+	else
+	  insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
+
+	cmp = XVECEXP (PATTERN (insn), 0, 0);
+	cc_reg = SET_DEST (cmp);
+	bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
+	loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+	emit_jump_insn (gen_rtx_SET (pc_rtx,
+				     gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+							   loc_ref, pc_rtx)));
+	DONE;
+    }
+  else
+    FAIL;
+}")
 
 (define_insn "*clear_apsr"
   [(unspec_volatile:SI [(const_int 0)] VUNSPEC_CLRM_APSR)
@@ -1755,7 +1791,37 @@
   {
     if (REGNO (operands[0]) == LR_REGNUM)
       {
-	emit_insn (gen_dls_insn (operands[0]));
+	/* Pick out the number by which we are decrementing the loop counter
+	   in every iteration.  If it's > 1, then use dlstp.  */
+	int const_int_dec_num
+	     = abs (INTVAL (XEXP (XEXP (XVECEXP (PATTERN (operands[1]), 0, 1),
+				  1),
+			    1)));
+	switch (const_int_dec_num)
+	  {
+	    case 16:
+	      emit_insn (gen_dlstp8_insn (operands[0]));
+	      break;
+
+	    case 8:
+	      emit_insn (gen_dlstp16_insn (operands[0]));
+	      break;
+
+	    case 4:
+	      emit_insn (gen_dlstp32_insn (operands[0]));
+	      break;
+
+	    case 2:
+	      emit_insn (gen_dlstp64_insn (operands[0]));
+	      break;
+
+	    case 1:
+	      emit_insn (gen_dls_insn (operands[0]));
+	      break;
+
+	    default:
+	      gcc_unreachable ();
+	  }
 	DONE;
       }
     else
diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
index 4713ec840ab..2d6f27c14f4 100644
--- a/gcc/config/arm/unspecs.md
+++ b/gcc/config/arm/unspecs.md
@@ -583,6 +583,14 @@
   VADDLVQ_U
   VCTP
   VCTP_M
+  DLSTP8
+  DLSTP16
+  DLSTP32
+  DLSTP64
+  LETP8
+  LETP16
+  LETP32
+  LETP64
   VPNOT
   VCREATEQ_F
   VCVTQ_N_TO_F_S
diff --git a/gcc/df-core.cc b/gcc/df-core.cc
index d4812b04a7c..4fcc14bf790 100644
--- a/gcc/df-core.cc
+++ b/gcc/df-core.cc
@@ -1964,6 +1964,21 @@ df_bb_regno_last_def_find (basic_block bb, unsigned int regno)
   return NULL;
 }
 
+/* Return the one and only def of REGNO within BB.  If there is no def or
+   there are multiple defs, return NULL.  */
+
+df_ref
+df_bb_regno_only_def_find (basic_block bb, unsigned int regno)
+{
+  df_ref temp = df_bb_regno_first_def_find (bb, regno);
+  if (!temp)
+    return NULL;
+  else if (temp == df_bb_regno_last_def_find (bb, regno))
+    return temp;
+  else
+    return NULL;
+}
+
 /* Finds the reference corresponding to the definition of REG in INSN.
    DF is the dataflow object.  */
 
diff --git a/gcc/df.h b/gcc/df.h
index 402657a7076..98623637f9c 100644
--- a/gcc/df.h
+++ b/gcc/df.h
@@ -987,6 +987,7 @@ extern void df_check_cfg_clean (void);
 #endif
 extern df_ref df_bb_regno_first_def_find (basic_block, unsigned int);
 extern df_ref df_bb_regno_last_def_find (basic_block, unsigned int);
+extern df_ref df_bb_regno_only_def_find (basic_block, unsigned int);
 extern df_ref df_find_def (rtx_insn *, rtx);
 extern bool df_reg_defined (rtx_insn *, rtx);
 extern df_ref df_find_use (rtx_insn *, rtx);
diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
index 4feb0a25ab9..d54ed792203 100644
--- a/gcc/loop-doloop.cc
+++ b/gcc/loop-doloop.cc
@@ -85,10 +85,10 @@ doloop_condition_get (rtx_insn *doloop_pat)
      forms:
 
      1)  (parallel [(set (pc) (if_then_else (condition)
-	  			            (label_ref (label))
-				            (pc)))
-	             (set (reg) (plus (reg) (const_int -1)))
-	             (additional clobbers and uses)])
+					    (label_ref (label))
+					    (pc)))
+		     (set (reg) (plus (reg) (const_int -1)))
+		     (additional clobbers and uses)])
 
      The branch must be the first entry of the parallel (also required
      by jump.cc), and the second entry of the parallel must be a set of
@@ -96,19 +96,34 @@ doloop_condition_get (rtx_insn *doloop_pat)
      the loop counter in an if_then_else too.
 
      2)  (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-	                         (label_ref (label))
-			         (pc))).  
+	 (set (pc) (if_then_else (reg != 0)
+				 (label_ref (label))
+				 (pc))).
 
      Some targets (ARM) do the comparison before the branch, as in the
      following form:
 
-     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) */
-
+     3) (parallel [(set (cc) (compare (plus (reg) (const_int -1)) 0))
+		   (set (reg) (plus (reg) (const_int -1)))])
+	(set (pc) (if_then_else (cc == NE)
+				(label_ref (label))
+				(pc)))
+
+      The ARM target also supports a special case of a counter that decrements
+      by `n` and terminating in a GTU condition.  In that case, the compare and
+      branch are all part of one insn, containing an UNSPEC:
+
+      4) (parallel [
+	    (set (pc)
+		(if_then_else (gtu (unspec:SI [(plus:SI (reg:SI 14 lr)
+							(const_int -n))])
+				   (const_int n-1]))
+		    (label_ref)
+		    (pc)))
+	    (set (reg:SI 14 lr)
+		 (plus:SI (reg:SI 14 lr)
+			  (const_int -n)))
+     */
   pattern = PATTERN (doloop_pat);
 
   if (GET_CODE (pattern) != PARALLEL)
@@ -143,7 +158,7 @@ doloop_condition_get (rtx_insn *doloop_pat)
 	      || GET_CODE (cmp_arg1) != PLUS)
 	    return 0;
 	  reg_orig = XEXP (cmp_arg1, 0);
-	  if (XEXP (cmp_arg1, 1) != GEN_INT (-1) 
+	  if (XEXP (cmp_arg1, 1) != GEN_INT (-1)
 	      || !REG_P (reg_orig))
 	    return 0;
 	  cc_reg = SET_DEST (cmp_orig);
@@ -173,15 +188,16 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
     return 0;
 
-  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
      On IA-64, this decrement is wrapped in an if_then_else.  */
   inc_src = SET_SRC (inc);
   if (GET_CODE (inc_src) == IF_THEN_ELSE)
     inc_src = XEXP (inc_src, 1);
   if (GET_CODE (inc_src) != PLUS
       || XEXP (inc_src, 0) != reg
-      || XEXP (inc_src, 1) != constm1_rtx)
+      || !CONST_INT_P (XEXP (inc_src, 1)))
     return 0;
+  int dec_num = abs (INTVAL (XEXP (inc_src, 1)));
 
   /* Check for (set (pc) (if_then_else (condition)
                                        (label_ref (label))
@@ -196,60 +212,71 @@ doloop_condition_get (rtx_insn *doloop_pat)
   /* Extract loop termination condition.  */
   condition = XEXP (SET_SRC (cmp), 0);
 
-  /* We expect a GE or NE comparison with 0 or 1.  */
-  if ((GET_CODE (condition) != GE
-       && GET_CODE (condition) != NE)
-      || (XEXP (condition, 1) != const0_rtx
-          && XEXP (condition, 1) != const1_rtx))
+  /* We expect a GE or NE comparison with 0 or 1, or a GTU comparison with
+     dec_num - 1.  */
+  if (!((GET_CODE (condition) == GE
+	 || GET_CODE (condition) == NE)
+	&& (XEXP (condition, 1) == const0_rtx
+	    || XEXP (condition, 1) == const1_rtx ))
+      &&!(GET_CODE (condition) == GTU
+	  && ((INTVAL (XEXP (condition, 1))) == (dec_num - 1))))
     return 0;
 
-  if ((XEXP (condition, 0) == reg)
+  /* For the ARM special case of having a GTU: re-form the condition without
+     the unspec for the benefit of the middle-end.  */
+  if (GET_CODE (condition) == GTU)
+    {
+      condition = gen_rtx_fmt_ee (GTU, VOIDmode, inc_src,
+				  GEN_INT (dec_num - 1));
+      return condition;
+    }
+  else if ((XEXP (condition, 0) == reg)
       /* For the third case:  */  
       || ((cc_reg != NULL_RTX)
 	  && (XEXP (condition, 0) == cc_reg)
 	  && (reg_orig == reg))
       || (GET_CODE (XEXP (condition, 0)) == PLUS
 	  && XEXP (XEXP (condition, 0), 0) == reg))
-   {
+    {
      if (GET_CODE (pattern) != PARALLEL)
      /*  For the second form we expect:
 
-         (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-                                 (label_ref (label))
-                                 (pc))).
+	 (set (reg) (plus (reg) (const_int -1))
+	 (set (pc) (if_then_else (reg != 0)
+				 (label_ref (label))
+				 (pc))).
 
-         is equivalent to the following:
+	 is equivalent to the following:
 
-         (parallel [(set (pc) (if_then_else (reg != 1)
-                                            (label_ref (label))
-                                            (pc)))
-                     (set (reg) (plus (reg) (const_int -1)))
-                     (additional clobbers and uses)])
+	 (parallel [(set (pc) (if_then_else (reg != 1)
+					    (label_ref (label))
+					    (pc)))
+		     (set (reg) (plus (reg) (const_int -1)))
+		     (additional clobbers and uses)])
 
-        For the third form we expect:
+	For the third form we expect:
 
-        (parallel [(set (cc) (compare ((plus (reg) (const_int -1)), 0))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) 
+	(parallel [(set (cc) (compare ((plus (reg) (const_int -1)), 0))
+		   (set (reg) (plus (reg) (const_int -1)))])
+	(set (pc) (if_then_else (cc == NE)
+				(label_ref (label))
+				(pc))) 
 
-        which is equivalent to the following:
+	which is equivalent to the following:
 
-        (parallel [(set (cc) (compare (reg,  1))
-                   (set (reg) (plus (reg) (const_int -1)))
-                   (set (pc) (if_then_else (NE == cc)
-                                           (label_ref (label))
-                                           (pc))))])
+	(parallel [(set (cc) (compare (reg,  1))
+		   (set (reg) (plus (reg) (const_int -1)))
+		   (set (pc) (if_then_else (NE == cc)
+					   (label_ref (label))
+					   (pc))))])
 
-        So we return the second form instead for the two cases.
+	So we return the second form instead for the two cases.
 
      */
-        condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
+	condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
 
     return condition;
-   }
+    }
 
   /* ??? If a machine uses a funny comparison, we could return a
      canonicalized form here.  */
@@ -492,7 +519,8 @@ doloop_modify (class loop *loop, class niter_desc *desc,
     case GE:
       /* Currently only GE tests against zero are supported.  */
       gcc_assert (XEXP (condition, 1) == const0_rtx);
-
+      /* FALLTHRU */
+    case GTU:
       noloop = constm1_rtx;
 
       /* The iteration count does not need incrementing for a GE test.  */
@@ -642,7 +670,7 @@ doloop_optimize (class loop *loop)
 {
   scalar_int_mode mode;
   rtx doloop_reg;
-  rtx count;
+  rtx count = NULL_RTX;
   widest_int iterations, iterations_max;
   rtx_code_label *start_label;
   rtx condition;
@@ -685,17 +713,6 @@ doloop_optimize (class loop *loop)
       return false;
     }
 
-  max_cost
-    = COSTS_N_INSNS (param_max_iterations_computation_cost);
-  if (set_src_cost (desc->niter_expr, mode, optimize_loop_for_speed_p (loop))
-      > max_cost)
-    {
-      if (dump_file)
-	fprintf (dump_file,
-		 "Doloop: number of iterations too costly to compute.\n");
-      return false;
-    }
-
   if (desc->const_iter)
     iterations = widest_int::from (rtx_mode_t (desc->niter_expr, mode),
 				   UNSIGNED);
@@ -716,12 +733,25 @@ doloop_optimize (class loop *loop)
 
   /* Generate looping insn.  If the pattern FAILs then give up trying
      to modify the loop since there is some aspect the back-end does
-     not like.  */
-  count = copy_rtx (desc->niter_expr);
+     not like.  If this succeeds, there is a chance that the loop
+     desc->niter_expr has been altered by the backend, so only extract
+     that data after the gen_doloop_end.  */
   start_label = block_label (desc->in_edge->dest);
   doloop_reg = gen_reg_rtx (mode);
   rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
 
+  max_cost
+    = COSTS_N_INSNS (param_max_iterations_computation_cost);
+  if (set_src_cost (desc->niter_expr, mode, optimize_loop_for_speed_p (loop))
+      > max_cost)
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "Doloop: number of iterations too costly to compute.\n");
+      return false;
+    }
+
+  count = copy_rtx (desc->niter_expr);
   word_mode_size = GET_MODE_PRECISION (word_mode);
   word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) - 1;
   if (! doloop_seq
diff --git a/gcc/testsuite/gcc.target/arm/lob.h b/gcc/testsuite/gcc.target/arm/lob.h
index feaae7cc899..3941fe7a8b6 100644
--- a/gcc/testsuite/gcc.target/arm/lob.h
+++ b/gcc/testsuite/gcc.target/arm/lob.h
@@ -1,15 +1,131 @@
 #include <string.h>
-
+#include <stdint.h>
 /* Common code for lob tests.  */
 
 #define NO_LOB asm volatile ("@ clobber lr" : : : "lr" )
 
-#define N 10000
+#define N 100
+
+static void
+reset_data (int *a, int *b, int *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data64 (int64_t *a, int64_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+check_plus (int *a, int *b, int *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
 
 static void
-reset_data (int *a, int *b, int *c)
+check_memcpy64 (int64_t *a, int64_t *c, int x)
 {
-  memset (a, -1, N * sizeof (*a));
-  memset (b, -1, N * sizeof (*b));
-  memset (c, -1, N * sizeof (*c));
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != a[i]) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
 }
diff --git a/gcc/testsuite/gcc.target/arm/lob1.c b/gcc/testsuite/gcc.target/arm/lob1.c
index ba5c82cd55c..c8ce653a5c3 100644
--- a/gcc/testsuite/gcc.target/arm/lob1.c
+++ b/gcc/testsuite/gcc.target/arm/lob1.c
@@ -54,29 +54,18 @@ loop3 (int *a, int *b, int *c)
     } while (i < N);
 }
 
-void
-check (int *a, int *b, int *c)
-{
-  for (int i = 0; i < N; i++)
-    {
-      NO_LOB;
-      if (c[i] != a[i] + b[i])
-	abort ();
-    }
-}
-
 int
 main (void)
 {
-  reset_data (a, b, c);
+  reset_data (a, b, c, N);
   loop1 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus (a, b, c, N);
+  reset_data (a, b, c, N);
   loop2 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus (a, b, c, N);
+  reset_data (a, b, c, N);
   loop3 (a, b ,c);
-  check (a, b ,c);
+  check_plus (a, b, c, N);
 
   return 0;
 }
diff --git a/gcc/testsuite/gcc.target/arm/lob6.c b/gcc/testsuite/gcc.target/arm/lob6.c
index 17b6124295e..4fe116e2c2b 100644
--- a/gcc/testsuite/gcc.target/arm/lob6.c
+++ b/gcc/testsuite/gcc.target/arm/lob6.c
@@ -79,14 +79,14 @@ check (void)
 int
 main (void)
 {
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data (a1, b1, c1, N);
+  reset_data (a2, b2, c2, N);
   loop1 (a1, b1, c1);
   ref1 (a2, b2, c2);
   check ();
 
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data (a1, b1, c1, N);
+  reset_data (a2, b2, c2, N);
   loop2 (a1, b1, c1);
   ref2 (a2, b2, c2);
   check ();
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c
new file mode 100644
index 00000000000..5ddd994e53d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c
@@ -0,0 +1,561 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O3 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+
+#define IMM 5
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)				\
+void test_##NAME##PRED##_##SIGN##BITS (TYPE##BITS##_t *a, TYPE##BITS##_t *b,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vb = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (b, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_##SIGN##BITS (va, vb, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      b += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (32, 4, w, NAME, PRED)
+
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vmulq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vsubq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vhaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vorrq, _x)
+
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_M(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)				\
+void test_##NAME##PRED##_##SIGN##BITS (TYPE##BITS##x##LANES##_t __inactive, TYPE##BITS##_t *a, TYPE##BITS##_t *b,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vb = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (b, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_##SIGN##BITS (__inactive, va, vb, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      b += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_M (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_M (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (32, 4, w, NAME, PRED)
+
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vmulq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vsubq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vhaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vorrq, _m)
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_N(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)	\
+void test_##NAME##PRED##_n_##SIGN##BITS (TYPE##BITS##_t *a,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_n_##SIGN##BITS (va, IMM, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_N (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_N (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (32, 4, w, NAME, PRED)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vmulq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vsubq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vhaddq, _x)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vbrsrq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vshlq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vshrq, _x)
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_M_N(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)	\
+void test_##NAME##PRED##_n_##SIGN##BITS (TYPE##BITS##x##LANES##_t __inactive, TYPE##BITS##_t *a,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_n_##SIGN##BITS (__inactive, va, IMM, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_M_N (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_M_N (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (32, 4, w, NAME, PRED)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vmulq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vsubq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vhaddq, _m)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vbrsrq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vshlq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vshrq, _m)
+
+/* Now test some more configurations.  */
+
+/* Using a >=1 condition.  */
+void test1 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n >= 1)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+/* Test a for loop format of decrementing to zero */
+int32_t a[] = {0, 1, 2, 3, 4, 5, 6, 7};
+void test2 (int32_t *b, int num_elems)
+{
+    for (int i = num_elems; i > 0; i-= 4)
+    {
+        mve_pred16_t p = vctp32q (i);
+        int32x4_t va = vldrwq_z_s32 (&(a[i]), p);
+        vstrwq_p_s32 (b + i, va, p);
+    }
+}
+
+/* Iteration counter counting up to num_iter.  */
+void test3 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 0; i < num_iter; i++)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+/* Iteration counter counting down from num_iter.  */
+void test4 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = num_iter; i > 0; i--)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+/* Using an unpredicated arithmetic instruction within the loop.  */
+void test5 (uint8_t *a, uint8_t *b, uint8_t *c,  uint8_t *d, int n)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_u8 (b);
+	/* Is affected by implicit predication, because vb also
+	came from an unpredicated load, but there is no functional
+	problem, because the result is used in a predicated store.  */ 
+        uint8x16_t vc = vaddq_u8 (va, vb);
+        uint8x16_t vd = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        vstrbq_p_u8 (d, vd, p);
+        n-=16;
+    }
+}
+
+/* Using a different VPR value for one instruction in the loop.  */
+void test6 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using another VPR value in the loop, with a vctp.
+   The doloop logic will always try to do the transform on the first
+   vctp it encounters, so this is still expected to work.  */
+void test7 (int32_t *a, int32_t *b, int32_t *c, int n, int g)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q (g);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vctp,
+   but this time the p1 will also change in every loop (still fine)  */
+void test8 (int32_t *a, int32_t *b, int32_t *c, int n, int g)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q (g);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+      g++;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vctp_m
+   that is independent of the loop vctp VPR.  */
+void test9 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p2 = vctp32q_m (n, p1);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop,
+   with a vctp_m that is tied to the base vctp VPR.  This
+   is still fine, because the vctp_m will be transformed
+   into a vctp and be implicitly predicated.  */
+void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q_m (n, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p1);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp.  */
+void test11 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vb);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p1);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp_m.  */
+void test12 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p2 = vcmpeqq_m_s32 (va, vb, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp_m 
+   that is tied to the base vctp VPR (same as above, this will be turned
+   into a vcmp and be implicitly predicated).  */
+void test13 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p2 = vcmpeqq_m_s32 (va, vb, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an unpredicated op with a scalar output, where the result is valid
+   outside the bb.  This is valid, because all the inputs to the unpredicated
+   op are correctly predicated.  */
+uint8_t test14 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       sum += vaddvq_u8 (vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Same as above, but with another scalar op between the unpredicated op and
+   the scalar op outside the loop.  */
+uint8_t test15 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx, int g)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       sum += vaddvq_u8 (vc);
+       sum += g;
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Using an unpredicated vcmp to generate a new predicate value in the
+   loop and then using it in a predicated store insn.  */
+void test16 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_s32 (b);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vc);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using a predicated vcmp to generate a new predicate value in the
+   loop and then using it in a predicated store insn.  */
+void test17 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_s32 (va, vb);
+      mve_pred16_t p1 = vcmpeqq_m_s32 (va, vc, p);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an across-vector unpredicated instruction in a valid way.
+   This tests that "vc" has correctly masked the risky "vb".  */
+uint16_t test18 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_x_u16 (va, vb, p);
+      res = vaddvq_u16 (vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector unpredicated instruction with a scalar from outside the loop.  */
+uint16_t test19 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_x_u16 (va, vb, p);
+      res = vaddvaq_u16 (res, vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector predicated instruction in a valid way.  */
+uint16_t  test20 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res = vaddvaq_p_u16 (res, vb, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector predicated instruction in a valid way.  */
+uint16_t  test21 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res++;
+      res = vaddvaq_p_u16 (res, vb, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+int test22 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vmaxvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test23 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vmaxavq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+/* The final number of DLSTPs currently is calculated by the number of
+  `TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY.*` macros * 6 + 23.  */
+/* { dg-final { scan-assembler-times {\tdlstp} 167 } } */
+/* { dg-final { scan-assembler-times {\tletp} 167 } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
new file mode 100644
index 00000000000..0125a2a15fa
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
@@ -0,0 +1,69 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_mve_hw } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int16_t *a, int16_t *b, int16_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      int16x8_t va = vldrhq_z_s16 (a, p);
+      int16x8_t vb = vldrhq_z_s16 (b, p);
+      int16x8_t vc = vaddq_x_s16 (va, vb, p);
+      vstrhq_p_s16 (c, vc, p);
+      c+=8;
+      a+=8;
+      b+=8;
+      n-=8;
+    }
+}
+
+int main ()
+{
+  int i;
+  int16_t temp1[N];
+  int16_t temp2[N];
+  int16_t temp3[N];
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus16 (temp1, temp2, temp3, 0);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus16 (temp1, temp2, temp3, 1);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 7);
+  check_plus16 (temp1, temp2, temp3, 7);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus16 (temp1, temp2, temp3, 8);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus16 (temp1, temp2, temp3, 9);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus16 (temp1, temp2, temp3, 16);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus16 (temp1, temp2, temp3, 17);
+
+  reset_data16 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.16} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c
new file mode 100644
index 00000000000..06b960ad9ca
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c
@@ -0,0 +1,69 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_mve_hw } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+int main ()
+{
+  int i;
+  int32_t temp1[N];
+  int32_t temp2[N];
+  int32_t temp3[N];
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus32 (temp1, temp2, temp3, 0);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus32 (temp1, temp2, temp3, 1);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 3);
+  check_plus32 (temp1, temp2, temp3, 3);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 4);
+  check_plus32 (temp1, temp2, temp3, 4);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 5);
+  check_plus32 (temp1, temp2, temp3, 5);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus32 (temp1, temp2, temp3, 8);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus32 (temp1, temp2, temp3, 9);
+
+  reset_data32 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.32} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c
new file mode 100644
index 00000000000..5a782dd7f74
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c
@@ -0,0 +1,69 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_mve_hw } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int64_t *a, int64_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp64q (n);
+      int64x2_t va = vldrdq_gather_offset_z_s64 (a, vcreateq_u64 (0, 8), p);
+      vstrdq_scatter_offset_p_s64 (c, vcreateq_u64 (0, 8), va, p);
+      c+=2;
+      a+=2;
+      n-=2;
+    }
+}
+
+int main ()
+{
+  int i;
+  int64_t temp1[N];
+  int64_t temp3[N];
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 0);
+  check_memcpy64 (temp1, temp3, 0);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 1);
+  check_memcpy64 (temp1, temp3, 1);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 2);
+  check_memcpy64 (temp1, temp3, 2);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 3);
+  check_memcpy64 (temp1, temp3, 3);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 4);
+  check_memcpy64 (temp1, temp3, 4);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 5);
+  check_memcpy64 (temp1, temp3, 5);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 6);
+  check_memcpy64 (temp1, temp3, 6);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 7);
+  check_memcpy64 (temp1, temp3, 7);
+
+  reset_data64  (temp1, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.64} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c
new file mode 100644
index 00000000000..8ea181c82d4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c
@@ -0,0 +1,69 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_mve_hw } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp8q (n);
+      int8x16_t va = vldrbq_z_s8 (a, p);
+      int8x16_t vb = vldrbq_z_s8 (b, p);
+      int8x16_t vc = vaddq_x_s8 (va, vb, p);
+      vstrbq_p_s8 (c, vc, p);
+      c+=16;
+      a+=16;
+      b+=16;
+      n-=16;
+    }
+}
+
+int main ()
+{
+  int i;
+  int8_t temp1[N];
+  int8_t temp2[N];
+  int8_t temp3[N];
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus8 (temp1, temp2, temp3, 0);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus8 (temp1, temp2, temp3, 1);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 15);
+  check_plus8 (temp1, temp2, temp3, 15);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus8 (temp1, temp2, temp3, 16);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus8 (temp1, temp2, temp3, 17);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 32);
+  check_plus8 (temp1, temp2, temp3, 32);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 33);
+  check_plus8 (temp1, temp2, temp3, 33);
+
+  reset_data8 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.8} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
new file mode 100644
index 00000000000..f7c3e04f883
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
@@ -0,0 +1,391 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O3 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <limits.h>
+#include <arm_mve.h>
+
+/* Terminating on a non-zero number of elements.  */
+void test0 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    while (n > 1)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Terminating on n >= 0.  */
+void test1 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    while (n >= 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Similar, terminating on a non-zero number of elements, but in a for loop
+   format.  */
+int32_t a[] = {0, 1, 2, 3, 4, 5, 6, 7};
+void test2 (int32_t *b, int num_elems)
+{
+    for (int i = num_elems; i >= 2; i-= 4)
+    {
+       mve_pred16_t p = vctp32q (i);
+       int32x4_t va = vldrwq_z_s32 (&(a[i]), p);
+       vstrwq_p_s32 (b + i, va, p);
+    }
+}
+
+/* Iteration counter counting up to num_iter, with a non-zero starting num.  */
+void test3 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 1; i < num_iter; i++)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Iteration counter counting up to num_iter, with a larger increment  */
+void test4 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 0; i < num_iter; i+=2)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Using an unpredicated store instruction within the loop.  */
+void test5 (uint8_t *a, uint8_t *b, uint8_t *c,  uint8_t *d, int n)
+{
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_u8 (va, vb);
+       uint8x16_t vd = vaddq_x_u8 (va, vb, p);
+       vstrbq_u8 (d, vd);
+       n -= 16;
+    }
+}
+
+/* Using an unpredicated store outside the loop.  */
+void test6 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       vx = vaddq_u8 (vx, vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    vstrbq_u8 (c, vx);
+}
+
+/* Using a VPR that gets modified within the loop.  */
+void test9 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      p++;
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using a VPR that gets re-generated within the loop.  */
+void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  mve_pred16_t p = vctp32q (n);
+  while (n > 0)
+    {
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      p = vctp32q (n);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using vctp32q_m instead of vctp32q.  */
+void test11 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p0)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q_m (n, p0);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an unpredicated op with a scalar output, where the result is valid
+   outside the bb.  This is invalid, because one of the inputs to the
+   unpredicated op is also unpredicated.  */
+uint8_t test12 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_u8 (b);
+       uint8x16_t vc = vaddq_u8 (va, vb);
+       sum += vaddvq_u8 (vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Using an unpredicated vcmp to generate a new predicate value in the
+   loop and then using that VPR to predicate a store insn.  */
+void test13 (int32_t *a, int32_t *b, int32x4_t vc, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_s32 (a);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_s32 (va, vb);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vc);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an across-vector unpredicated instruction. "vb" is the risk.  */
+uint16_t test14 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      vb = vaddq_u16 (va, vb);
+      res = vaddvq_u16 (vb);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector unpredicated instruction. "vc" is the risk. */
+uint16_t test15 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_u16 (va, vb);
+      res = vaddvaq_u16 (res, vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+uint16_t test16 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res =0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res = vaddvaq_u16 (res, vb);
+      res = vaddvaq_p_u16 (res, va, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+int test17 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vmaxvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+
+
+int test18 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vminvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test19 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vminavq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test20 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vminvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+uint8x16_t test21 (uint8_t *a, uint32_t *b, int n, uint8x16_t res)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vshlcq_u8 (va, b, 1);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int8x16_t test22 (int8_t *a, int32_t *b, int n, int8x16_t res)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vshlcq_s8 (va, b, 1);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+/* Using an unsigned number of elements to count down from, with a >0*/
+void test23 (int32_t *a, int32_t *b, int32_t *c, unsigned int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+/* Using an unsigned number of elements to count up to, with a <n*/
+void test24 (uint8_t *a, uint8_t *b, uint8_t *c, unsigned int n)
+{
+    for (int i = 0; i < n; i+=16)
+    {
+        mve_pred16_t p = vctp8q (n-i);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+
+/* Using an unsigned number of elements to count up to, with a <=n*/
+void test25 (uint8_t *a, uint8_t *b, uint8_t *c, unsigned int n)
+{
+    for (int i = 1; i <= n; i+=16)
+    {
+        mve_pred16_t p = vctp8q (n-i+1);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+/* { dg-final { scan-assembler-not "\tdlstp" } } */
+/* { dg-final { scan-assembler-not "\tletp" } } */
\ No newline at end of file

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
  2023-11-10 12:41           ` Stamatis Markianos-Wright
@ 2023-11-16 11:36             ` Stamatis Markianos-Wright
  2023-11-27 12:47               ` Andre Vieira (lists)
  0 siblings, 1 reply; 17+ messages in thread
From: Stamatis Markianos-Wright @ 2023-11-16 11:36 UTC (permalink / raw)
  To: Stamatis Markianos-Wright via Gcc-patches, Richard Earnshaw,
	richard.sandiford, Kyrylo Tkachov

Pinging back to the top of reviewers' inboxes due to worry about Stage 1 
End in a few days :)


See the last email for the latest version of the 2/2 patch. The 1/2 
patch is A-Ok from Kyrill's earlier target-backend review.


On 10/11/2023 12:41, Stamatis Markianos-Wright wrote:
>
> On 06/11/2023 17:29, Stamatis Markianos-Wright wrote:
>>
>> On 06/11/2023 11:24, Richard Sandiford wrote:
>>> Stamatis Markianos-Wright <stam.markianos-wright@arm.com> writes:
>>>>> One of the main reasons for reading the arm bits was to try to answer
>>>>> the question: if we switch to a downcounting loop with a GE 
>>>>> condition,
>>>>> how do we make sure that the start value is not a large unsigned
>>>>> number that is interpreted as negative by GE?  E.g. if the loop
>>>>> originally counted up in steps of N and used an LTU condition,
>>>>> it could stop at a value in the range [INT_MAX + 1, UINT_MAX].
>>>>> But the loop might never iterate if we start counting down from
>>>>> most values in that range.
>>>>>
>>>>> Does the patch handle that?
>>>> So AFAICT this is actually handled in the generic code in 
>>>> `doloop_valid_p`:
>>>>
>>>> This kind of loops fail because of they are "desc->infinite", then no
>>>> loop-doloop conversion is attempted at all (even for standard 
>>>> dls/le loops)
>>>>
>>>> Thanks to that check I haven't been able to trigger anything like the
>>>> behaviour you describe, do you think the doloop_valid_p checks are
>>>> robust enough?
>>> The loops I was thinking of are provably not infinite though. E.g.:
>>>
>>>    for (unsigned int i = 0; i < UINT_MAX - 100; ++i)
>>>      ...
>>>
>>> is known to terminate.  And doloop conversion is safe with the normal
>>> count-down-by-1 approach, so I don't think current code would need
>>> to reject it.  I.e. a conversion to:
>>>
>>>    unsigned int i = UINT_MAX - 101;
>>>    do
>>>      ...
>>>    while (--i != ~0U);
>>>
>>> would be safe, but a conversion to:
>>>
>>>    int i = UINT_MAX - 101;
>>>    do
>>>      ...
>>>    while ((i -= step, i > 0));
>>>
>>> wouldn't, because the loop body would only be executed once.
>>>
>>> I'm only going off the name "infinite" though :)  It's possible that
>>> it has more connotations than that.
>>>
>>> Thanks,
>>> Richard
>>
>> Ack, yep, I see what you mean now, and yep, that kind of loop does 
>> indeed pass through doloop_valid_p
>>
>> Interestingly , in the v8-M Arm ARM this is done with:
>>
>> ```
>>
>> boolean IsLastLowOverheadLoop(INSTR_EXEC_STATE_Type state)
>> // This does not check whether a loop is currently active.
>> // If the PE were in a loop, would this be the last one?
>> return UInt(state.LoopCount) <= (1 << (4 - LTPSIZE));
>>
>> ```
>>
>> So architecturally the asm we output would be ok (except maybe the 
>> "branch too far subs;bgt;lctp" fallback at 
>> `predicated_doloop_end_internal` (maybe that should be `bhi`))... But 
>> now GE: isn't looking like an accurate representation of this 
>> operation in the compiler.
>>
>> I'm wondering if I should try to make 
>> `predicated_doloop_end_internal` contain a comparison along the lines 
>> of:
>> (gtu: (plus: (LR) (const_int -num_lanes)) (const_int num_lanes_minus_1))
>>
>> I'll give that a try :)
>>
>> The only reason I'd chosen to go with GE earlier, tbh, was because of 
>> the existing handling of GE in loop-doloop.cc
>>
>> Let me know if any other ideas come to your mind!
>>
>>
>> Cheers,
>>
>> Stam
>
>
> It looks like I've had success with the below (diff to previous patch),
> trimmed a bit to only the functionally interesting things::
>
>
>
>
> diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
> index 368d5138ca1..54dd4ee564b 100644
> --- a/gcc/config/arm/thumb2.md
> +++ b/gcc/config/arm/thumb2.md
> @@ -1649,16 +1649,28 @@
>            && (decrement_num = arm_attempt_dlstp_transform (operands[1]))
>            && (INTVAL (decrement_num) != 1))
>          {
> -          insn = emit_insn
> -              (gen_thumb2_addsi3_compare0
> -              (s0, s0, GEN_INT ((-1) * (INTVAL (decrement_num)))));
> -          cmp = XVECEXP (PATTERN (insn), 0, 0);
> -          cc_reg = SET_DEST (cmp);
> -          bcomp = gen_rtx_GE (VOIDmode, cc_reg, const0_rtx);
>            loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
> -          emit_jump_insn (gen_rtx_SET (pc_rtx,
> -                       gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
> -                                 loc_ref, pc_rtx)));
> +          switch (INTVAL (decrement_num))
> +        {
> +          case 2:
> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal2
> +                        (s0, loc_ref));
> +            break;
> +          case 4:
> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal4
> +                        (s0, loc_ref));
> +            break;
> +          case 8:
> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal8
> +                        (s0, loc_ref));
> +            break;
> +          case 16:
> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal16
> +                        (s0, loc_ref));
> +            break;
> +          default:
> +            gcc_unreachable ();
> +        }
>            DONE;
>          }
>      }
>
> diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
> index 93905583b18..c083f965fa9 100644
> --- a/gcc/config/arm/mve.md
> +++ b/gcc/config/arm/mve.md
> @@ -6922,23 +6922,24 @@
>  ;; Originally expanded by 'predicated_doloop_end'.
>  ;; In the rare situation where the branch is too far, we do also need to
>  ;; revert FPSCR.LTPSIZE back to 0x100 after the last iteration.
> -(define_insn "*predicated_doloop_end_internal"
> +(define_insn "predicated_doloop_end_internal<letp_num_lanes>"
>    [(set (pc)
>      (if_then_else
> -       (ge (plus:SI (reg:SI LR_REGNUM)
> -            (match_operand:SI 0 "const_int_operand" ""))
> -        (const_int 0))
> -     (label_ref (match_operand 1 "" ""))
> +       (gtu (unspec:SI [(plus:SI (match_operand:SI 0 
> "s_register_operand" "=r")
> +                     (const_int <letp_num_lanes_neg>))]
> +        LETP)
> +        (const_int <letp_num_lanes_minus_1>))
> +     (match_operand 1 "" "")
>       (pc)))
> -   (set (reg:SI LR_REGNUM)
> -    (plus:SI (reg:SI LR_REGNUM) (match_dup 0)))
> +   (set (match_dup 0)
> +    (plus:SI (match_dup 0) (const_int <letp_num_lanes_neg>)))
>     (clobber (reg:CC CC_REGNUM))]
>    "TARGET_HAVE_MVE"
>    {
>      if (get_attr_length (insn) == 4)
>        return "letp\t%|lr, %l1";
>      else
> -      return "subs\t%|lr, #%n0\n\tbgt\t%l1\n\tlctp";
> +      return "subs\t%|lr, #<letp_num_lanes>\n\tbhi\t%l1\n\tlctp";
>    }
>    [(set (attr "length")
>      (if_then_else
> @@ -6947,11 +6948,11 @@
>          (const_int 6)))
>     (set_attr "type" "branch")])
>
> -(define_insn "dlstp<mode1>_insn"
> +(define_insn "dlstp<dlstp_elemsize>_insn"
>    [
>      (set (reg:SI LR_REGNUM)
>       (unspec:SI [(match_operand:SI 0 "s_register_operand" "r")]
>        DLSTP))
>    ]
>    "TARGET_HAVE_MVE"
> -  "dlstp.<mode1>\t%|lr, %0")
> +  "dlstp.<dlstp_elemsize>\t%|lr, %0")
>
> diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
> index 6a72700a127..47fdef989b4 100644
> --- a/gcc/loop-doloop.cc
> +++ b/gcc/loop-doloop.cc
> @@ -185,6 +185,7 @@ doloop_condition_get (rtx_insn *doloop_pat)
>        || XEXP (inc_src, 0) != reg
>        || !CONST_INT_P (XEXP (inc_src, 1)))
>      return 0;
> +  int dec_num = abs (INTVAL (XEXP (inc_src, 1)));
>
>    /* Check for (set (pc) (if_then_else (condition)
>                                         (label_ref (label))
> @@ -199,21 +200,32 @@ doloop_condition_get (rtx_insn *doloop_pat)
>    /* Extract loop termination condition.  */
>    condition = XEXP (SET_SRC (cmp), 0);
>
> -  /* We expect a GE or NE comparison with 0 or 1.  */
> -  if ((GET_CODE (condition) != GE
> -       && GET_CODE (condition) != NE)
> -      || (XEXP (condition, 1) != const0_rtx
> -          && XEXP (condition, 1) != const1_rtx))
> +  /* We expect a GE or NE comparison with 0 or 1, or a GTU comparison 
> with
> +     dec_num - 1.  */
> +  if (!((GET_CODE (condition) == GE
> +     || GET_CODE (condition) == NE)
> +    && (XEXP (condition, 1) == const0_rtx
> +        || XEXP (condition, 1) == const1_rtx ))
> +      &&!(GET_CODE (condition) == GTU
> +      && ((INTVAL (XEXP (condition, 1))) == (dec_num - 1))))
>      return 0;
>
> -  if ((XEXP (condition, 0) == reg)
> +  /* For the ARM special case of having a GTU: re-form the condition 
> without
> +     the unspec for the benefit of the middle-end.  */
> +  if (GET_CODE (condition) == GTU)
> +    {
> +      condition = gen_rtx_fmt_ee (GTU, VOIDmode, inc_src, GEN_INT 
> (dec_num - 1));
> +      return condition;
> +    }
> +  else if ((XEXP (condition, 0) == reg)
>        /* For the third case:  */
>        || ((cc_reg != NULL_RTX)
>        && (XEXP (condition, 0) == cc_reg)
>        && (reg_orig == reg))
> @@ -245,20 +257,11 @@ doloop_condition_get (rtx_insn *doloop_pat)
>                         (label_ref (label))
>                         (pc))))])
>
> -    So we return the second form instead for the two cases when n == 1.
> -
> -    For n > 1, the final value may be exceeded, so use GE instead of NE.
> +    So we return the second form instead for the two cases.
>       */
> -     if (GET_CODE (pattern) != PARALLEL)
> -       {
> -    if (INTVAL (XEXP (inc_src, 1)) != -1)
> -      condition = gen_rtx_fmt_ee (GE, VOIDmode, inc_src, const0_rtx);
> -    else
> -      condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);;
> -       }
> -
> +    condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
>      return condition;
> -   }
> +    }
>
>    /* ??? If a machine uses a funny comparison, we could return a
>       canonicalized form here.  */
> @@ -501,7 +504,8 @@ doloop_modify (class loop *loop, class niter_desc 
> *desc,
>      case GE:
>        /* Currently only GE tests against zero are supported.  */
>        gcc_assert (XEXP (condition, 1) == const0_rtx);
> -
> +      /* FALLTHRU */
> +    case GTU:
>        noloop = constm1_rtx;
> diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
> index a6a7ff507a5..9398702cddd 100644
> --- a/gcc/config/arm/iterators.md
> +++ b/gcc/config/arm/iterators.md
> @@ -2673,8 +2673,16 @@
>  (define_int_attr mrrc [(VUNSPEC_MRRC "mrrc") (VUNSPEC_MRRC2 "mrrc2")])
>  (define_int_attr MRRC [(VUNSPEC_MRRC "MRRC") (VUNSPEC_MRRC2 "MRRC2")])
>
> -(define_int_attr mode1 [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 "32")
> -            (DLSTP64 "64")])
> +(define_int_attr dlstp_elemsize [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 
> "32")
> +                 (DLSTP64 "64")])
> +
> +(define_int_attr letp_num_lanes [(LETP8 "16") (LETP16 "8") (LETP32 "4")
> +                 (LETP64 "2")])
> +(define_int_attr letp_num_lanes_neg [(LETP8 "-16") (LETP16 "-8") 
> (LETP32 "-4")
> +                     (LETP64 "-2")])
> +
> +(define_int_attr letp_num_lanes_minus_1 [(LETP8 "15") (LETP16 "7") 
> (LETP32 "3")
> +                     (LETP64 "1")])
>
>  (define_int_attr opsuffix [(UNSPEC_DOT_S "s8")
>                 (UNSPEC_DOT_U "u8")
> @@ -2921,6 +2929,8 @@
>  (define_int_iterator VQSHLUQ_N [VQSHLUQ_N_S])
>  (define_int_iterator DLSTP [DLSTP8 DLSTP16 DLSTP32
>                     DLSTP64])
> +(define_int_iterator LETP [LETP8 LETP16 LETP32
> +               LETP64])
>
>  ;; Define iterators for VCMLA operations
>  (define_int_iterator VCMLA_OP [UNSPEC_VCMLA
>        /* The iteration count does not need incrementing for a GE 
> test.  */
> diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
> index 12ae4c4f820..2d6f27c14f4 100644
> --- a/gcc/config/arm/unspecs.md
> +++ b/gcc/config/arm/unspecs.md
> @@ -587,6 +587,10 @@
>    DLSTP16
>    DLSTP32
>    DLSTP64
> +  LETP8
> +  LETP16
> +  LETP32
> +  LETP64
>    VPNOT
>    VCREATEQ_F
>    VCVTQ_N_TO_F_S
>
>
> I've attached the whole [2/2] patch diff with this change and
> the required comment changes in doloop_condition_get.
> WDYT?
>
>
> Thanks,
>
> Stam
>
>
>>
>>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
  2023-11-16 11:36             ` Stamatis Markianos-Wright
@ 2023-11-27 12:47               ` Andre Vieira (lists)
  2023-11-30 12:55                 ` Stamatis Markianos-Wright
  0 siblings, 1 reply; 17+ messages in thread
From: Andre Vieira (lists) @ 2023-11-27 12:47 UTC (permalink / raw)
  To: Stam Markianos-Wright, Stamatis Markianos-Wright via Gcc-patches,
	Richard Earnshaw, Richard Sandiford, Kyrylo Tkachov

Hi Stam,

Just some comments.

+/* Recursively scan through the DF chain backwards within the basic 
block and
+   determine if any of the USEs of the original insn (or the USEs of 
the insns
s/Recursively scan/Scan/ as you no longer recurse, thanks for that by 
the way :) +   where thy were DEF-ed, etc., recursively) were affected 
by implicit VPT
remove recursively for the same reasons.

+      if (!CONST_INT_P (cond_counter_iv.step) || !CONST_INT_P 
(cond_temp_iv.step))
+	return NULL;
+      /* Look at the steps and swap around the rtx's if needed.  Error 
out if
+	 one of them cannot be identified as constant.  */
+      if (INTVAL (cond_counter_iv.step) != 0 && INTVAL 
(cond_temp_iv.step) != 0)
+	return NULL;

Move the comment above the if before, as the erroring out it talks about 
is there.

+      emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));
  space after 'insn_note)'

@@ -173,14 +176,14 @@ doloop_condition_get (rtx_insn *doloop_pat)
    if (! REG_P (reg))
      return 0;
  -  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
       On IA-64, this decrement is wrapped in an if_then_else.  */
    inc_src = SET_SRC (inc);
    if (GET_CODE (inc_src) == IF_THEN_ELSE)
      inc_src = XEXP (inc_src, 1);
    if (GET_CODE (inc_src) != PLUS
        || XEXP (inc_src, 0) != reg
-      || XEXP (inc_src, 1) != constm1_rtx)
+      || !CONST_INT_P (XEXP (inc_src, 1)))

Do we ever check that inc_src is negative? We used to check if it was 
-1, now we only check it's a constnat, but not a negative one, so I 
suspect this needs a:
|| INTVAL (XEXP (inc_src, 1)) >= 0

@@ -492,7 +519,8 @@ doloop_modify (class loop *loop, class niter_desc *desc,
      case GE:
        /* Currently only GE tests against zero are supported.  */
        gcc_assert (XEXP (condition, 1) == const0_rtx);
-
+      /* FALLTHRU */
+    case GTU:
        noloop = constm1_rtx;

I spent a very long time staring at this trying to understand why noloop 
= constm1_rtx for GTU, where I thought it should've been (count & 
(n-1)). For the current use of doloop it doesn't matter because ARM is 
the only target using it and you set desc->noloop_assumptions to 
null_rtx in 'arm_attempt_dlstp_transform' so noloop is never used. 
However, if a different target accepts this GTU pattern then this target 
agnostic code will do the wrong thing.  I suggest we either:
  - set noloop to what we think might be the correct value, which if you 
ask me should be 'count & (XEXP (condition, 1))',
  - or add a gcc_assert (GET_CODE (condition) != GTU); under the if 
(desc->noloop_assumption); part and document why.  I have a slight 
preference for the assert given otherwise we are adding code that we 
can't test.

LGTM otherwise (but I don't have the power to approve this ;)).

Kind regards,
Andre
________________________________________
From: Stamatis Markianos-Wright <stam.markianos-wright@arm.com>
Sent: Thursday, November 16, 2023 11:36 AM
To: Stamatis Markianos-Wright via Gcc-patches; Richard Earnshaw; Richard 
Sandiford; Kyrylo Tkachov
Subject: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low 
Overhead Loops

Pinging back to the top of reviewers' inboxes due to worry about Stage 1
End in a few days :)


See the last email for the latest version of the 2/2 patch. The 1/2
patch is A-Ok from Kyrill's earlier target-backend review.


On 10/11/2023 12:41, Stamatis Markianos-Wright wrote:
>
> On 06/11/2023 17:29, Stamatis Markianos-Wright wrote:
>>
>> On 06/11/2023 11:24, Richard Sandiford wrote:
>>> Stamatis Markianos-Wright <stam.markianos-wright@arm.com> writes:
>>>>> One of the main reasons for reading the arm bits was to try to answer
>>>>> the question: if we switch to a downcounting loop with a GE
>>>>> condition,
>>>>> how do we make sure that the start value is not a large unsigned
>>>>> number that is interpreted as negative by GE?  E.g. if the loop
>>>>> originally counted up in steps of N and used an LTU condition,
>>>>> it could stop at a value in the range [INT_MAX + 1, UINT_MAX].
>>>>> But the loop might never iterate if we start counting down from
>>>>> most values in that range.
>>>>>
>>>>> Does the patch handle that?
>>>> So AFAICT this is actually handled in the generic code in
>>>> `doloop_valid_p`:
>>>>
>>>> This kind of loops fail because of they are "desc->infinite", then no
>>>> loop-doloop conversion is attempted at all (even for standard
>>>> dls/le loops)
>>>>
>>>> Thanks to that check I haven't been able to trigger anything like the
>>>> behaviour you describe, do you think the doloop_valid_p checks are
>>>> robust enough?
>>> The loops I was thinking of are provably not infinite though. E.g.:
>>>
>>>    for (unsigned int i = 0; i < UINT_MAX - 100; ++i)
>>>      ...
>>>
>>> is known to terminate.  And doloop conversion is safe with the normal
>>> count-down-by-1 approach, so I don't think current code would need
>>> to reject it.  I.e. a conversion to:
>>>
>>>    unsigned int i = UINT_MAX - 101;
>>>    do
>>>      ...
>>>    while (--i != ~0U);
>>>
>>> would be safe, but a conversion to:
>>>
>>>    int i = UINT_MAX - 101;
>>>    do
>>>      ...
>>>    while ((i -= step, i > 0));
>>>
>>> wouldn't, because the loop body would only be executed once.
>>>
>>> I'm only going off the name "infinite" though :)  It's possible that
>>> it has more connotations than that.
>>>
>>> Thanks,
>>> Richard
>>
>> Ack, yep, I see what you mean now, and yep, that kind of loop does
>> indeed pass through doloop_valid_p
>>
>> Interestingly , in the v8-M Arm ARM this is done with:
>>
>> ```
>>
>> boolean IsLastLowOverheadLoop(INSTR_EXEC_STATE_Type state)
>> // This does not check whether a loop is currently active.
>> // If the PE were in a loop, would this be the last one?
>> return UInt(state.LoopCount) <= (1 << (4 - LTPSIZE));
>>
>> ```
>>
>> So architecturally the asm we output would be ok (except maybe the
>> "branch too far subs;bgt;lctp" fallback at
>> `predicated_doloop_end_internal` (maybe that should be `bhi`))... But
>> now GE: isn't looking like an accurate representation of this
>> operation in the compiler.
>>
>> I'm wondering if I should try to make
>> `predicated_doloop_end_internal` contain a comparison along the lines
>> of:
>> (gtu: (plus: (LR) (const_int -num_lanes)) (const_int num_lanes_minus_1))
>>
>> I'll give that a try :)
>>
>> The only reason I'd chosen to go with GE earlier, tbh, was because of
>> the existing handling of GE in loop-doloop.cc
>>
>> Let me know if any other ideas come to your mind!
>>
>>
>> Cheers,
>>
>> Stam
>
>
> It looks like I've had success with the below (diff to previous patch),
> trimmed a bit to only the functionally interesting things::
>
>
>
>
> diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
> index 368d5138ca1..54dd4ee564b 100644
> --- a/gcc/config/arm/thumb2.md
> +++ b/gcc/config/arm/thumb2.md
> @@ -1649,16 +1649,28 @@
>            && (decrement_num = arm_attempt_dlstp_transform (operands[1]))
>            && (INTVAL (decrement_num) != 1))
>          {
> -          insn = emit_insn
> -              (gen_thumb2_addsi3_compare0
> -              (s0, s0, GEN_INT ((-1) * (INTVAL (decrement_num)))));
> -          cmp = XVECEXP (PATTERN (insn), 0, 0);
> -          cc_reg = SET_DEST (cmp);
> -          bcomp = gen_rtx_GE (VOIDmode, cc_reg, const0_rtx);
>            loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
> -          emit_jump_insn (gen_rtx_SET (pc_rtx,
> -                       gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
> -                                 loc_ref, pc_rtx)));
> +          switch (INTVAL (decrement_num))
> +        {
> +          case 2:
> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal2
> +                        (s0, loc_ref));
> +            break;
> +          case 4:
> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal4
> +                        (s0, loc_ref));
> +            break;
> +          case 8:
> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal8
> +                        (s0, loc_ref));
> +            break;
> +          case 16:
> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal16
> +                        (s0, loc_ref));
> +            break;
> +          default:
> +            gcc_unreachable ();
> +        }
>            DONE;
>          }
>      }
>
> diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
> index 93905583b18..c083f965fa9 100644
> --- a/gcc/config/arm/mve.md
> +++ b/gcc/config/arm/mve.md
> @@ -6922,23 +6922,24 @@
>  ;; Originally expanded by 'predicated_doloop_end'.
>  ;; In the rare situation where the branch is too far, we do also need to
>  ;; revert FPSCR.LTPSIZE back to 0x100 after the last iteration.
> -(define_insn "*predicated_doloop_end_internal"
> +(define_insn "predicated_doloop_end_internal<letp_num_lanes>"
>    [(set (pc)
>      (if_then_else
> -       (ge (plus:SI (reg:SI LR_REGNUM)
> -            (match_operand:SI 0 "const_int_operand" ""))
> -        (const_int 0))
> -     (label_ref (match_operand 1 "" ""))
> +       (gtu (unspec:SI [(plus:SI (match_operand:SI 0
> "s_register_operand" "=r")
> +                     (const_int <letp_num_lanes_neg>))]
> +        LETP)
> +        (const_int <letp_num_lanes_minus_1>))
> +     (match_operand 1 "" "")
>       (pc)))
> -   (set (reg:SI LR_REGNUM)
> -    (plus:SI (reg:SI LR_REGNUM) (match_dup 0)))
> +   (set (match_dup 0)
> +    (plus:SI (match_dup 0) (const_int <letp_num_lanes_neg>)))
>     (clobber (reg:CC CC_REGNUM))]
>    "TARGET_HAVE_MVE"
>    {
>      if (get_attr_length (insn) == 4)
>        return "letp\t%|lr, %l1";
>      else
> -      return "subs\t%|lr, #%n0\n\tbgt\t%l1\n\tlctp";
> +      return "subs\t%|lr, #<letp_num_lanes>\n\tbhi\t%l1\n\tlctp";
>    }
>    [(set (attr "length")
>      (if_then_else
> @@ -6947,11 +6948,11 @@
>          (const_int 6)))
>     (set_attr "type" "branch")])
>
> -(define_insn "dlstp<mode1>_insn"
> +(define_insn "dlstp<dlstp_elemsize>_insn"
>    [
>      (set (reg:SI LR_REGNUM)
>       (unspec:SI [(match_operand:SI 0 "s_register_operand" "r")]
>        DLSTP))
>    ]
>    "TARGET_HAVE_MVE"
> -  "dlstp.<mode1>\t%|lr, %0")
> +  "dlstp.<dlstp_elemsize>\t%|lr, %0")
>
> diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
> index 6a72700a127..47fdef989b4 100644
> --- a/gcc/loop-doloop.cc
> +++ b/gcc/loop-doloop.cc
> @@ -185,6 +185,7 @@ doloop_condition_get (rtx_insn *doloop_pat)
>        || XEXP (inc_src, 0) != reg
>        || !CONST_INT_P (XEXP (inc_src, 1)))
>      return 0;
> +  int dec_num = abs (INTVAL (XEXP (inc_src, 1)));
>
>    /* Check for (set (pc) (if_then_else (condition)
>                                         (label_ref (label))
> @@ -199,21 +200,32 @@ doloop_condition_get (rtx_insn *doloop_pat)
>    /* Extract loop termination condition.  */
>    condition = XEXP (SET_SRC (cmp), 0);
>
> -  /* We expect a GE or NE comparison with 0 or 1.  */
> -  if ((GET_CODE (condition) != GE
> -       && GET_CODE (condition) != NE)
> -      || (XEXP (condition, 1) != const0_rtx
> -          && XEXP (condition, 1) != const1_rtx))
> +  /* We expect a GE or NE comparison with 0 or 1, or a GTU comparison
> with
> +     dec_num - 1.  */
> +  if (!((GET_CODE (condition) == GE
> +     || GET_CODE (condition) == NE)
> +    && (XEXP (condition, 1) == const0_rtx
> +        || XEXP (condition, 1) == const1_rtx ))
> +      &&!(GET_CODE (condition) == GTU
> +      && ((INTVAL (XEXP (condition, 1))) == (dec_num - 1))))
>      return 0;
>
> -  if ((XEXP (condition, 0) == reg)
> +  /* For the ARM special case of having a GTU: re-form the condition
> without
> +     the unspec for the benefit of the middle-end.  */
> +  if (GET_CODE (condition) == GTU)
> +    {
> +      condition = gen_rtx_fmt_ee (GTU, VOIDmode, inc_src, GEN_INT
> (dec_num - 1));
> +      return condition;
> +    }
> +  else if ((XEXP (condition, 0) == reg)
>        /* For the third case:  */
>        || ((cc_reg != NULL_RTX)
>        && (XEXP (condition, 0) == cc_reg)
>        && (reg_orig == reg))
> @@ -245,20 +257,11 @@ doloop_condition_get (rtx_insn *doloop_pat)
>                         (label_ref (label))
>                         (pc))))])
>
> -    So we return the second form instead for the two cases when n == 1.
> -
> -    For n > 1, the final value may be exceeded, so use GE instead of NE.
> +    So we return the second form instead for the two cases.
>       */
> -     if (GET_CODE (pattern) != PARALLEL)
> -       {
> -    if (INTVAL (XEXP (inc_src, 1)) != -1)
> -      condition = gen_rtx_fmt_ee (GE, VOIDmode, inc_src, const0_rtx);
> -    else
> -      condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);;
> -       }
> -
> +    condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
>      return condition;
> -   }
> +    }
>
>    /* ??? If a machine uses a funny comparison, we could return a
>       canonicalized form here.  */
> @@ -501,7 +504,8 @@ doloop_modify (class loop *loop, class niter_desc
> *desc,
>      case GE:
>        /* Currently only GE tests against zero are supported.  */
>        gcc_assert (XEXP (condition, 1) == const0_rtx);
> -
> +      /* FALLTHRU */
> +    case GTU:
>        noloop = constm1_rtx;
> diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
> index a6a7ff507a5..9398702cddd 100644
> --- a/gcc/config/arm/iterators.md
> +++ b/gcc/config/arm/iterators.md
> @@ -2673,8 +2673,16 @@
>  (define_int_attr mrrc [(VUNSPEC_MRRC "mrrc") (VUNSPEC_MRRC2 "mrrc2")])
>  (define_int_attr MRRC [(VUNSPEC_MRRC "MRRC") (VUNSPEC_MRRC2 "MRRC2")])
>
> -(define_int_attr mode1 [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 "32")
> -            (DLSTP64 "64")])
> +(define_int_attr dlstp_elemsize [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32
> "32")
> +                 (DLSTP64 "64")])
> +
> +(define_int_attr letp_num_lanes [(LETP8 "16") (LETP16 "8") (LETP32 "4")
> +                 (LETP64 "2")])
> +(define_int_attr letp_num_lanes_neg [(LETP8 "-16") (LETP16 "-8")
> (LETP32 "-4")
> +                     (LETP64 "-2")])
> +
> +(define_int_attr letp_num_lanes_minus_1 [(LETP8 "15") (LETP16 "7")
> (LETP32 "3")
> +                     (LETP64 "1")])
>
>  (define_int_attr opsuffix [(UNSPEC_DOT_S "s8")
>                 (UNSPEC_DOT_U "u8")
> @@ -2921,6 +2929,8 @@
>  (define_int_iterator VQSHLUQ_N [VQSHLUQ_N_S])
>  (define_int_iterator DLSTP [DLSTP8 DLSTP16 DLSTP32
>                     DLSTP64])
> +(define_int_iterator LETP [LETP8 LETP16 LETP32
> +               LETP64])
>
>  ;; Define iterators for VCMLA operations
>  (define_int_iterator VCMLA_OP [UNSPEC_VCMLA
>        /* The iteration count does not need incrementing for a GE
> test.  */
> diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
> index 12ae4c4f820..2d6f27c14f4 100644
> --- a/gcc/config/arm/unspecs.md
> +++ b/gcc/config/arm/unspecs.md
> @@ -587,6 +587,10 @@
>    DLSTP16
>    DLSTP32
>    DLSTP64
> +  LETP8
> +  LETP16
> +  LETP32
> +  LETP64
>    VPNOT
>    VCREATEQ_F
>    VCVTQ_N_TO_F_S
>
>
> I've attached the whole [2/2] patch diff with this change and
> the required comment changes in doloop_condition_get.
> WDYT?
>
>
> Thanks,
>
> Stam
>
>
>>
>>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
  2023-11-27 12:47               ` Andre Vieira (lists)
@ 2023-11-30 12:55                 ` Stamatis Markianos-Wright
  2023-12-07 18:08                   ` Andre Vieira (lists)
                                     ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Stamatis Markianos-Wright @ 2023-11-30 12:55 UTC (permalink / raw)
  To: Andre Vieira (lists),
	Stamatis Markianos-Wright via Gcc-patches, Richard Earnshaw,
	Richard Sandiford, Kyrylo Tkachov

[-- Attachment #1: Type: text/plain, Size: 17354 bytes --]

Hi Andre,

Thanks for the comments, see latest revision attached.

On 27/11/2023 12:47, Andre Vieira (lists) wrote:
> Hi Stam,
>
> Just some comments.
>
> +/* Recursively scan through the DF chain backwards within the basic 
> block and
> +   determine if any of the USEs of the original insn (or the USEs of 
> the insns
> s/Recursively scan/Scan/ as you no longer recurse, thanks for that by 
> the way :) +   where thy were DEF-ed, etc., recursively) were affected 
> by implicit VPT
> remove recursively for the same reasons.
>
> +      if (!CONST_INT_P (cond_counter_iv.step) || !CONST_INT_P 
> (cond_temp_iv.step))
> +    return NULL;
> +      /* Look at the steps and swap around the rtx's if needed. Error 
> out if
> +     one of them cannot be identified as constant.  */
> +      if (INTVAL (cond_counter_iv.step) != 0 && INTVAL 
> (cond_temp_iv.step) != 0)
> +    return NULL;
>
> Move the comment above the if before, as the erroring out it talks 
> about is there.
Done
>
> +      emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));
>  space after 'insn_note)'
>
> @@ -173,14 +176,14 @@ doloop_condition_get (rtx_insn *doloop_pat)
>    if (! REG_P (reg))
>      return 0;
>  -  /* Check if something = (plus (reg) (const_int -1)).
> +  /* Check if something = (plus (reg) (const_int -n)).
>       On IA-64, this decrement is wrapped in an if_then_else.  */
>    inc_src = SET_SRC (inc);
>    if (GET_CODE (inc_src) == IF_THEN_ELSE)
>      inc_src = XEXP (inc_src, 1);
>    if (GET_CODE (inc_src) != PLUS
>        || XEXP (inc_src, 0) != reg
> -      || XEXP (inc_src, 1) != constm1_rtx)
> +      || !CONST_INT_P (XEXP (inc_src, 1)))
>
> Do we ever check that inc_src is negative? We used to check if it was 
> -1, now we only check it's a constnat, but not a negative one, so I 
> suspect this needs a:
> || INTVAL (XEXP (inc_src, 1)) >= 0
Good point. Done
>
> @@ -492,7 +519,8 @@ doloop_modify (class loop *loop, class niter_desc 
> *desc,
>      case GE:
>        /* Currently only GE tests against zero are supported.  */
>        gcc_assert (XEXP (condition, 1) == const0_rtx);
> -
> +      /* FALLTHRU */
> +    case GTU:
>        noloop = constm1_rtx;
>
> I spent a very long time staring at this trying to understand why 
> noloop = constm1_rtx for GTU, where I thought it should've been (count 
> & (n-1)). For the current use of doloop it doesn't matter because ARM 
> is the only target using it and you set desc->noloop_assumptions to 
> null_rtx in 'arm_attempt_dlstp_transform' so noloop is never used. 
> However, if a different target accepts this GTU pattern then this 
> target agnostic code will do the wrong thing.  I suggest we either:
>  - set noloop to what we think might be the correct value, which if 
> you ask me should be 'count & (XEXP (condition, 1))',
>  - or add a gcc_assert (GET_CODE (condition) != GTU); under the if 
> (desc->noloop_assumption); part and document why.  I have a slight 
> preference for the assert given otherwise we are adding code that we 
> can't test.

Yea, that's true tbh. I've done the latter, but also separated out the 
"case GTU:" and added a comment, so that it's more clear that the noloop 
things aren't used in the only implemented GTU case (Arm)

Thank you :)

>
> LGTM otherwise (but I don't have the power to approve this ;)).
>
> Kind regards,
> Andre
> ________________________________________
> From: Stamatis Markianos-Wright <stam.markianos-wright@arm.com>
> Sent: Thursday, November 16, 2023 11:36 AM
> To: Stamatis Markianos-Wright via Gcc-patches; Richard Earnshaw; 
> Richard Sandiford; Kyrylo Tkachov
> Subject: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated 
> Low Overhead Loops
>
> Pinging back to the top of reviewers' inboxes due to worry about Stage 1
> End in a few days :)
>
>
> See the last email for the latest version of the 2/2 patch. The 1/2
> patch is A-Ok from Kyrill's earlier target-backend review.
>
>
> On 10/11/2023 12:41, Stamatis Markianos-Wright wrote:
>>
>> On 06/11/2023 17:29, Stamatis Markianos-Wright wrote:
>>>
>>> On 06/11/2023 11:24, Richard Sandiford wrote:
>>>> Stamatis Markianos-Wright <stam.markianos-wright@arm.com> writes:
>>>>>> One of the main reasons for reading the arm bits was to try to 
>>>>>> answer
>>>>>> the question: if we switch to a downcounting loop with a GE
>>>>>> condition,
>>>>>> how do we make sure that the start value is not a large unsigned
>>>>>> number that is interpreted as negative by GE?  E.g. if the loop
>>>>>> originally counted up in steps of N and used an LTU condition,
>>>>>> it could stop at a value in the range [INT_MAX + 1, UINT_MAX].
>>>>>> But the loop might never iterate if we start counting down from
>>>>>> most values in that range.
>>>>>>
>>>>>> Does the patch handle that?
>>>>> So AFAICT this is actually handled in the generic code in
>>>>> `doloop_valid_p`:
>>>>>
>>>>> This kind of loops fail because of they are "desc->infinite", then no
>>>>> loop-doloop conversion is attempted at all (even for standard
>>>>> dls/le loops)
>>>>>
>>>>> Thanks to that check I haven't been able to trigger anything like the
>>>>> behaviour you describe, do you think the doloop_valid_p checks are
>>>>> robust enough?
>>>> The loops I was thinking of are provably not infinite though. E.g.:
>>>>
>>>>    for (unsigned int i = 0; i < UINT_MAX - 100; ++i)
>>>>      ...
>>>>
>>>> is known to terminate.  And doloop conversion is safe with the normal
>>>> count-down-by-1 approach, so I don't think current code would need
>>>> to reject it.  I.e. a conversion to:
>>>>
>>>>    unsigned int i = UINT_MAX - 101;
>>>>    do
>>>>      ...
>>>>    while (--i != ~0U);
>>>>
>>>> would be safe, but a conversion to:
>>>>
>>>>    int i = UINT_MAX - 101;
>>>>    do
>>>>      ...
>>>>    while ((i -= step, i > 0));
>>>>
>>>> wouldn't, because the loop body would only be executed once.
>>>>
>>>> I'm only going off the name "infinite" though :)  It's possible that
>>>> it has more connotations than that.
>>>>
>>>> Thanks,
>>>> Richard
>>>
>>> Ack, yep, I see what you mean now, and yep, that kind of loop does
>>> indeed pass through doloop_valid_p
>>>
>>> Interestingly , in the v8-M Arm ARM this is done with:
>>>
>>> ```
>>>
>>> boolean IsLastLowOverheadLoop(INSTR_EXEC_STATE_Type state)
>>> // This does not check whether a loop is currently active.
>>> // If the PE were in a loop, would this be the last one?
>>> return UInt(state.LoopCount) <= (1 << (4 - LTPSIZE));
>>>
>>> ```
>>>
>>> So architecturally the asm we output would be ok (except maybe the
>>> "branch too far subs;bgt;lctp" fallback at
>>> `predicated_doloop_end_internal` (maybe that should be `bhi`))... But
>>> now GE: isn't looking like an accurate representation of this
>>> operation in the compiler.
>>>
>>> I'm wondering if I should try to make
>>> `predicated_doloop_end_internal` contain a comparison along the lines
>>> of:
>>> (gtu: (plus: (LR) (const_int -num_lanes)) (const_int 
>>> num_lanes_minus_1))
>>>
>>> I'll give that a try :)
>>>
>>> The only reason I'd chosen to go with GE earlier, tbh, was because of
>>> the existing handling of GE in loop-doloop.cc
>>>
>>> Let me know if any other ideas come to your mind!
>>>
>>>
>>> Cheers,
>>>
>>> Stam
>>
>>
>> It looks like I've had success with the below (diff to previous patch),
>> trimmed a bit to only the functionally interesting things::
>>
>>
>>
>>
>> diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
>> index 368d5138ca1..54dd4ee564b 100644
>> --- a/gcc/config/arm/thumb2.md
>> +++ b/gcc/config/arm/thumb2.md
>> @@ -1649,16 +1649,28 @@
>>            && (decrement_num = arm_attempt_dlstp_transform 
>> (operands[1]))
>>            && (INTVAL (decrement_num) != 1))
>>          {
>> -          insn = emit_insn
>> -              (gen_thumb2_addsi3_compare0
>> -              (s0, s0, GEN_INT ((-1) * (INTVAL (decrement_num)))));
>> -          cmp = XVECEXP (PATTERN (insn), 0, 0);
>> -          cc_reg = SET_DEST (cmp);
>> -          bcomp = gen_rtx_GE (VOIDmode, cc_reg, const0_rtx);
>>            loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
>> -          emit_jump_insn (gen_rtx_SET (pc_rtx,
>> -                       gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
>> -                                 loc_ref, pc_rtx)));
>> +          switch (INTVAL (decrement_num))
>> +        {
>> +          case 2:
>> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal2
>> +                        (s0, loc_ref));
>> +            break;
>> +          case 4:
>> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal4
>> +                        (s0, loc_ref));
>> +            break;
>> +          case 8:
>> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal8
>> +                        (s0, loc_ref));
>> +            break;
>> +          case 16:
>> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal16
>> +                        (s0, loc_ref));
>> +            break;
>> +          default:
>> +            gcc_unreachable ();
>> +        }
>>            DONE;
>>          }
>>      }
>>
>> diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
>> index 93905583b18..c083f965fa9 100644
>> --- a/gcc/config/arm/mve.md
>> +++ b/gcc/config/arm/mve.md
>> @@ -6922,23 +6922,24 @@
>>  ;; Originally expanded by 'predicated_doloop_end'.
>>  ;; In the rare situation where the branch is too far, we do also 
>> need to
>>  ;; revert FPSCR.LTPSIZE back to 0x100 after the last iteration.
>> -(define_insn "*predicated_doloop_end_internal"
>> +(define_insn "predicated_doloop_end_internal<letp_num_lanes>"
>>    [(set (pc)
>>      (if_then_else
>> -       (ge (plus:SI (reg:SI LR_REGNUM)
>> -            (match_operand:SI 0 "const_int_operand" ""))
>> -        (const_int 0))
>> -     (label_ref (match_operand 1 "" ""))
>> +       (gtu (unspec:SI [(plus:SI (match_operand:SI 0
>> "s_register_operand" "=r")
>> +                     (const_int <letp_num_lanes_neg>))]
>> +        LETP)
>> +        (const_int <letp_num_lanes_minus_1>))
>> +     (match_operand 1 "" "")
>>       (pc)))
>> -   (set (reg:SI LR_REGNUM)
>> -    (plus:SI (reg:SI LR_REGNUM) (match_dup 0)))
>> +   (set (match_dup 0)
>> +    (plus:SI (match_dup 0) (const_int <letp_num_lanes_neg>)))
>>     (clobber (reg:CC CC_REGNUM))]
>>    "TARGET_HAVE_MVE"
>>    {
>>      if (get_attr_length (insn) == 4)
>>        return "letp\t%|lr, %l1";
>>      else
>> -      return "subs\t%|lr, #%n0\n\tbgt\t%l1\n\tlctp";
>> +      return "subs\t%|lr, #<letp_num_lanes>\n\tbhi\t%l1\n\tlctp";
>>    }
>>    [(set (attr "length")
>>      (if_then_else
>> @@ -6947,11 +6948,11 @@
>>          (const_int 6)))
>>     (set_attr "type" "branch")])
>>
>> -(define_insn "dlstp<mode1>_insn"
>> +(define_insn "dlstp<dlstp_elemsize>_insn"
>>    [
>>      (set (reg:SI LR_REGNUM)
>>       (unspec:SI [(match_operand:SI 0 "s_register_operand" "r")]
>>        DLSTP))
>>    ]
>>    "TARGET_HAVE_MVE"
>> -  "dlstp.<mode1>\t%|lr, %0")
>> +  "dlstp.<dlstp_elemsize>\t%|lr, %0")
>>
>> diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
>> index 6a72700a127..47fdef989b4 100644
>> --- a/gcc/loop-doloop.cc
>> +++ b/gcc/loop-doloop.cc
>> @@ -185,6 +185,7 @@ doloop_condition_get (rtx_insn *doloop_pat)
>>        || XEXP (inc_src, 0) != reg
>>        || !CONST_INT_P (XEXP (inc_src, 1)))
>>      return 0;
>> +  int dec_num = abs (INTVAL (XEXP (inc_src, 1)));
>>
>>    /* Check for (set (pc) (if_then_else (condition)
>>                                         (label_ref (label))
>> @@ -199,21 +200,32 @@ doloop_condition_get (rtx_insn *doloop_pat)
>>    /* Extract loop termination condition.  */
>>    condition = XEXP (SET_SRC (cmp), 0);
>>
>> -  /* We expect a GE or NE comparison with 0 or 1.  */
>> -  if ((GET_CODE (condition) != GE
>> -       && GET_CODE (condition) != NE)
>> -      || (XEXP (condition, 1) != const0_rtx
>> -          && XEXP (condition, 1) != const1_rtx))
>> +  /* We expect a GE or NE comparison with 0 or 1, or a GTU comparison
>> with
>> +     dec_num - 1.  */
>> +  if (!((GET_CODE (condition) == GE
>> +     || GET_CODE (condition) == NE)
>> +    && (XEXP (condition, 1) == const0_rtx
>> +        || XEXP (condition, 1) == const1_rtx ))
>> +      &&!(GET_CODE (condition) == GTU
>> +      && ((INTVAL (XEXP (condition, 1))) == (dec_num - 1))))
>>      return 0;
>>
>> -  if ((XEXP (condition, 0) == reg)
>> +  /* For the ARM special case of having a GTU: re-form the condition
>> without
>> +     the unspec for the benefit of the middle-end.  */
>> +  if (GET_CODE (condition) == GTU)
>> +    {
>> +      condition = gen_rtx_fmt_ee (GTU, VOIDmode, inc_src, GEN_INT
>> (dec_num - 1));
>> +      return condition;
>> +    }
>> +  else if ((XEXP (condition, 0) == reg)
>>        /* For the third case:  */
>>        || ((cc_reg != NULL_RTX)
>>        && (XEXP (condition, 0) == cc_reg)
>>        && (reg_orig == reg))
>> @@ -245,20 +257,11 @@ doloop_condition_get (rtx_insn *doloop_pat)
>>                         (label_ref (label))
>>                         (pc))))])
>>
>> -    So we return the second form instead for the two cases when n == 1.
>> -
>> -    For n > 1, the final value may be exceeded, so use GE instead of 
>> NE.
>> +    So we return the second form instead for the two cases.
>>       */
>> -     if (GET_CODE (pattern) != PARALLEL)
>> -       {
>> -    if (INTVAL (XEXP (inc_src, 1)) != -1)
>> -      condition = gen_rtx_fmt_ee (GE, VOIDmode, inc_src, const0_rtx);
>> -    else
>> -      condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);;
>> -       }
>> -
>> +    condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
>>      return condition;
>> -   }
>> +    }
>>
>>    /* ??? If a machine uses a funny comparison, we could return a
>>       canonicalized form here.  */
>> @@ -501,7 +504,8 @@ doloop_modify (class loop *loop, class niter_desc
>> *desc,
>>      case GE:
>>        /* Currently only GE tests against zero are supported. */
>>        gcc_assert (XEXP (condition, 1) == const0_rtx);
>> -
>> +      /* FALLTHRU */
>> +    case GTU:
>>        noloop = constm1_rtx;
>> diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
>> index a6a7ff507a5..9398702cddd 100644
>> --- a/gcc/config/arm/iterators.md
>> +++ b/gcc/config/arm/iterators.md
>> @@ -2673,8 +2673,16 @@
>>  (define_int_attr mrrc [(VUNSPEC_MRRC "mrrc") (VUNSPEC_MRRC2 "mrrc2")])
>>  (define_int_attr MRRC [(VUNSPEC_MRRC "MRRC") (VUNSPEC_MRRC2 "MRRC2")])
>>
>> -(define_int_attr mode1 [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 "32")
>> -            (DLSTP64 "64")])
>> +(define_int_attr dlstp_elemsize [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32
>> "32")
>> +                 (DLSTP64 "64")])
>> +
>> +(define_int_attr letp_num_lanes [(LETP8 "16") (LETP16 "8") (LETP32 "4")
>> +                 (LETP64 "2")])
>> +(define_int_attr letp_num_lanes_neg [(LETP8 "-16") (LETP16 "-8")
>> (LETP32 "-4")
>> +                     (LETP64 "-2")])
>> +
>> +(define_int_attr letp_num_lanes_minus_1 [(LETP8 "15") (LETP16 "7")
>> (LETP32 "3")
>> +                     (LETP64 "1")])
>>
>>  (define_int_attr opsuffix [(UNSPEC_DOT_S "s8")
>>                 (UNSPEC_DOT_U "u8")
>> @@ -2921,6 +2929,8 @@
>>  (define_int_iterator VQSHLUQ_N [VQSHLUQ_N_S])
>>  (define_int_iterator DLSTP [DLSTP8 DLSTP16 DLSTP32
>>                     DLSTP64])
>> +(define_int_iterator LETP [LETP8 LETP16 LETP32
>> +               LETP64])
>>
>>  ;; Define iterators for VCMLA operations
>>  (define_int_iterator VCMLA_OP [UNSPEC_VCMLA
>>        /* The iteration count does not need incrementing for a GE
>> test.  */
>> diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
>> index 12ae4c4f820..2d6f27c14f4 100644
>> --- a/gcc/config/arm/unspecs.md
>> +++ b/gcc/config/arm/unspecs.md
>> @@ -587,6 +587,10 @@
>>    DLSTP16
>>    DLSTP32
>>    DLSTP64
>> +  LETP8
>> +  LETP16
>> +  LETP32
>> +  LETP64
>>    VPNOT
>>    VCREATEQ_F
>>    VCVTQ_N_TO_F_S
>>
>>
>> I've attached the whole [2/2] patch diff with this change and
>> the required comment changes in doloop_condition_get.
>> WDYT?
>>
>>
>> Thanks,
>>
>> Stam
>>
>>
>>>
>>>

[-- Attachment #2: diff.txt --]
[-- Type: text/plain, Size: 106360 bytes --]

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 2f5ca79ed8ddd647b212782a0454ee4fefc07257..4f164c547406c43219900c111401540c7ef9d7d1 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -65,8 +65,8 @@ extern void arm_emit_speculation_barrier_function (void);
 extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *);
 extern bool arm_q_bit_access (void);
 extern bool arm_ge_bits_access (void);
-extern bool arm_target_insn_ok_for_lob (rtx);
-
+extern bool arm_target_bb_ok_for_lob (basic_block);
+extern rtx arm_attempt_dlstp_transform (rtx);
 #ifdef RTX_CODE
 enum reg_class
 arm_mode_base_reg_class (machine_mode);
diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index 620ef7bfb2f3af9b8de576359a6157190c439aad..6056d9cdb7bc839175a95c1421ead94b02a2bc18 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -659,6 +659,12 @@ static const struct attribute_spec arm_attribute_table[] =
 #undef TARGET_HAVE_CONDITIONAL_EXECUTION
 #define TARGET_HAVE_CONDITIONAL_EXECUTION arm_have_conditional_execution
 
+#undef TARGET_LOOP_UNROLL_ADJUST
+#define TARGET_LOOP_UNROLL_ADJUST arm_loop_unroll_adjust
+
+#undef TARGET_PREDICT_DOLOOP_P
+#define TARGET_PREDICT_DOLOOP_P arm_predict_doloop_p
+
 #undef TARGET_LEGITIMATE_CONSTANT_P
 #define TARGET_LEGITIMATE_CONSTANT_P arm_legitimate_constant_p
 
@@ -34472,19 +34478,1103 @@ arm_invalid_within_doloop (const rtx_insn *insn)
 }
 
 bool
-arm_target_insn_ok_for_lob (rtx insn)
+arm_target_bb_ok_for_lob (basic_block bb)
 {
-  basic_block bb = BLOCK_FOR_INSN (insn);
   /* Make sure the basic block of the target insn is a simple latch
      having as single predecessor and successor the body of the loop
      itself.  Only simple loops with a single basic block as body are
      supported for 'low over head loop' making sure that LE target is
      above LE itself in the generated code.  */
-
   return single_succ_p (bb)
-    && single_pred_p (bb)
-    && single_succ_edge (bb)->dest == single_pred_edge (bb)->src
-    && contains_no_active_insn_p (bb);
+	 && single_pred_p (bb)
+	 && single_succ_edge (bb)->dest == single_pred_edge (bb)->src;
+}
+
+/* Utility fuction: Given a VCTP or a VCTP_M insn, return the number of MVE
+   lanes based on the machine mode being used.  */
+
+static int
+arm_mve_get_vctp_lanes (rtx x)
+{
+  if (GET_CODE (x) == SET && GET_CODE (XEXP (x, 1)) == UNSPEC
+      && (XINT (XEXP (x, 1), 1) == VCTP || XINT (XEXP (x, 1), 1) == VCTP_M))
+    {
+      machine_mode mode = GET_MODE (XEXP (x, 1));
+      return (VECTOR_MODE_P (mode) && VALID_MVE_PRED_MODE (mode))
+	     ? GET_MODE_NUNITS (mode) : 0;
+    }
+  return 0;
+}
+
+/* Check if INSN requires the use of the VPR reg, if it does, return the
+   sub-rtx of the VPR reg.  The TYPE argument controls whether
+   this function should:
+   * For TYPE == 0, check all operands, including the OUT operands,
+     and return the first occurrence of the VPR reg.
+   * For TYPE == 1, only check the input operands.
+   * For TYPE == 2, only check the output operands.
+   (INOUT operands are considered both as input and output operands)
+*/
+static rtx
+arm_get_required_vpr_reg (rtx_insn *insn, unsigned int type = 0)
+{
+  gcc_assert (type < 3);
+  if (!NONJUMP_INSN_P (insn))
+    return NULL_RTX;
+
+  bool requires_vpr;
+  extract_constrain_insn (insn);
+  int n_operands = recog_data.n_operands;
+  if (recog_data.n_alternatives == 0)
+    return NULL_RTX;
+
+  /* Fill in recog_op_alt with information about the constraints of
+     this insn.  */
+  preprocess_constraints (insn);
+
+  for (int op = 0; op < n_operands; op++)
+    {
+      requires_vpr = true;
+      if (type == 1 && recog_data.operand_type[op] == OP_OUT)
+	continue;
+      else if (type == 2 && recog_data.operand_type[op] == OP_IN)
+	continue;
+
+      /* Iterate through alternatives of operand "op" in recog_op_alt and
+	 identify if the operand is required to be the VPR.  */
+      for (int alt = 0; alt < recog_data.n_alternatives; alt++)
+	{
+	  const operand_alternative *op_alt
+	      = &recog_op_alt[alt * n_operands];
+	  /* Fetch the reg_class for each entry and check it against the
+	     VPR_REG reg_class.  */
+	  if (alternative_class (op_alt, op) != VPR_REG)
+	    requires_vpr = false;
+	}
+      /* If all alternatives of the insn require the VPR reg for this operand,
+	 it means that either this is VPR-generating instruction, like a vctp,
+	 vcmp, etc., or it is a VPT-predicated insruction.  Return the subrtx
+	 of the VPR reg operand.  */
+      if (requires_vpr)
+	return recog_data.operand[op];
+    }
+  return NULL_RTX;
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with TYPE == 1, so return
+   something only if the VPR reg is an input operand to the insn.  */
+
+static rtx
+arm_get_required_vpr_reg_param (rtx_insn *insn)
+{
+  return arm_get_required_vpr_reg (insn, 1);
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with TYPE == 2, so return
+   something only if the VPR reg is the return value, an output of, or is
+   clobbered by the insn.  */
+
+static rtx
+arm_get_required_vpr_reg_ret_val (rtx_insn *insn)
+{
+  return arm_get_required_vpr_reg (insn, 2);
+}
+
+/* Scan the basic block of a loop body for a vctp instruction.  If there is
+   at least vctp instruction, return the first rtx_insn *.  */
+
+static rtx_insn *
+arm_mve_get_loop_vctp (basic_block bb)
+{
+  rtx_insn *insn = BB_HEAD (bb);
+
+  /* Now scan through all the instruction patterns and pick out the VCTP
+     instruction.  We require arm_get_required_vpr_reg_param to be false
+     to make sure we pick up a VCTP, rather than a VCTP_M.  */
+  FOR_BB_INSNS (bb, insn)
+    if (NONDEBUG_INSN_P (insn))
+      if (arm_get_required_vpr_reg_ret_val (insn)
+	  && (arm_mve_get_vctp_lanes (PATTERN (insn)) != 0)
+	  && !arm_get_required_vpr_reg_param (insn))
+	return insn;
+  return NULL;
+}
+
+/* Return true if INSN is a MVE instruction that is VPT-predicable, but in
+   its unpredicated form, or if it is predicated, but on a predicate other
+   than VPR_REG.  */
+
+static bool
+arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate (rtx_insn *insn,
+							  rtx vpr_reg)
+{
+  rtx insn_vpr_reg_operand;
+  if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+      || (MVE_VPT_PREDICATED_INSN_P (insn)
+	  && (insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn))
+	  && !rtx_equal_p (vpr_reg, insn_vpr_reg_operand)))
+    return true;
+  else
+    return false;
+}
+
+/* Return true if INSN is a MVE instruction that is VPT-predicable and is
+   predicated on VPR_REG.  */
+
+static bool
+arm_mve_vec_insn_is_predicated_with_this_predicate (rtx_insn *insn,
+						    rtx vpr_reg)
+{
+  rtx insn_vpr_reg_operand;
+  if (MVE_VPT_PREDICATED_INSN_P (insn)
+      && (insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn))
+      && rtx_equal_p (vpr_reg, insn_vpr_reg_operand))
+    return true;
+  else
+    return false;
+}
+
+/* Utility function to identify if INSN is an MVE instruction that performs
+   some across-vector operation (and as a result does not align with normal
+   lane predication rules).  All such instructions give one only scalar
+   output, except for vshlcq which gives a PARALLEL of a vector and a scalar
+   (one vector result and one carry output).  */
+
+static bool
+arm_is_mve_across_vector_insn (rtx_insn* insn)
+{
+  df_ref insn_defs = NULL;
+  if (!MVE_VPT_PREDICABLE_INSN_P (insn))
+    return false;
+
+  bool is_across_vector = false;
+  FOR_EACH_INSN_DEF (insn_defs, insn)
+    if (!VALID_MVE_MODE (GET_MODE (DF_REF_REG (insn_defs)))
+	&& !arm_get_required_vpr_reg_ret_val (insn))
+      is_across_vector = true;
+
+  return is_across_vector;
+}
+
+/* Utility function to identify if INSN is an MVE load or store instruction.
+   * For TYPE == 0, check all operands.  If the function returns true,
+     INSN is a load or a store insn.
+   * For TYPE == 1, only check the input operands.  If the function returns
+     true, INSN is a load insn.
+   * For TYPE == 2, only check the output operands.  If the function returns
+     true, INSN is a store insn.  */
+
+static bool
+arm_is_mve_load_store_insn (rtx_insn* insn, int type = 0)
+{
+  int n_operands = recog_data.n_operands;
+  extract_insn (insn);
+
+  for (int op = 0; op < n_operands; op++)
+    {
+      if (type == 1 && recog_data.operand_type[op] == OP_OUT)
+	continue;
+      else if (type == 2 && recog_data.operand_type[op] == OP_IN)
+	continue;
+      if (mve_memory_operand (recog_data.operand[op],
+			      GET_MODE (recog_data.operand[op])))
+      return true;
+    }
+  return false;
+}
+
+/* When transforming an MVE intrinsic loop into an MVE Tail Predicated Low
+   Overhead Loop, there are a number of instructions that, if in their
+   unpredicated form, act across vector lanes, but are still safe to include
+   within the loop, despite the implicit predication added to the vector lanes.
+   This list has been compiled by carefully analyzing the instruction
+   pseudocode in the Arm-ARM.
+   All other across-vector instructions aren't allowed, because the addition
+   of implicit predication could influnce the result of the operation.
+   Any new across-vector instructions to the MVE ISA will have to assessed for
+   inclusion to this list.  */
+
+static bool
+arm_mve_is_allowed_unpredic_across_vector_insn (rtx_insn* insn)
+{
+  gcc_assert (MVE_VPT_UNPREDICATED_INSN_P (insn)
+	      && arm_is_mve_across_vector_insn (insn));
+  rtx insn_pattern = PATTERN (insn);
+  if (GET_CODE (insn_pattern) == SET
+      && GET_CODE (XEXP (insn_pattern, 1)) == UNSPEC
+      && (XINT (XEXP (insn_pattern, 1), 1) == VADDVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VABAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VABAVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMAXVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMAXAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVXQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAXQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHAXQ_S))
+    return true;
+  return false;
+}
+
+/* Scan through the DF chain backwards within the basic block and
+   determine if any of the USEs of the original insn (or the USEs of the insns
+   where thy were DEF-ed, etc.) were affected by implicit VPT
+   predication of an MVE_VPT_UNPREDICATED_INSN_P in a dlstp/letp loop.
+   This function returns true if the insn is affected implicit predication
+   and false otherwise.
+   Having such implicit predication on an unpredicated insn wouldn't in itself
+   block tail predication, because the output of that insn might then be used
+   in a correctly predicated store insn, where the disabled lanes will be
+   ignored.  To verify this we later call:
+   `arm_mve_check_df_chain_fwd_for_implic_predic_impact`, which will check the
+   DF chains forward to see if any implicitly-predicated operand gets used in
+   an improper way.  */
+
+static bool
+arm_mve_check_df_chain_back_for_implic_predic
+  (hash_map <rtx_insn *, bool> *safe_insn_map, rtx_insn *insn_in,
+   rtx vctp_vpr_generated)
+{
+
+  auto_vec<rtx_insn *> worklist;
+  worklist.safe_push (insn_in);
+
+  bool *temp = NULL;
+
+  while (worklist.length () > 0)
+    {
+      rtx_insn *insn = worklist.pop ();
+
+      if ((temp = safe_insn_map->get (insn)))
+	return *temp;
+
+      basic_block body = BLOCK_FOR_INSN (insn);
+
+      /* The circumstances under which an instruction is affected by "implicit
+	 predication" are as follows:
+	  * It is an UNPREDICATED_INSN_P:
+	    * That loads/stores from/to memory.
+	    * Where any one of its operands is an MVE vector from outside the
+	      loop body bb.
+	 Or:
+	  * Any of it's operands were affected earlier in the insn chain.  */
+      if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+	  && (arm_is_mve_load_store_insn (insn)
+	      || (arm_is_mve_across_vector_insn (insn)
+		  && !arm_mve_is_allowed_unpredic_across_vector_insn (insn))))
+	{
+	  safe_insn_map->put (insn, true);
+	  return true;
+	}
+
+      df_ref insn_uses = NULL;
+      FOR_EACH_INSN_USE (insn_uses, insn)
+      {
+	/* If the operand is in the input reg set to the the basic block,
+	   (i.e. it has come from outside the loop!), consider it unsafe if:
+	     * It's being used in an unpredicated insn.
+	     * It is a predicable MVE vector.  */
+	if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+	    && VALID_MVE_MODE (GET_MODE (DF_REF_REG (insn_uses)))
+	    && REGNO_REG_SET_P (DF_LR_IN (body), DF_REF_REGNO (insn_uses)))
+	  {
+	    safe_insn_map->put (insn, true);
+	    return true;
+	  }
+
+	/* Scan backwards from the current INSN through the instruction chain
+	   until the start of the basic block.  */
+	for (rtx_insn *prev_insn = PREV_INSN (insn);
+	     prev_insn && prev_insn != PREV_INSN (BB_HEAD (body));
+	     prev_insn = PREV_INSN (prev_insn))
+	  {
+	    /* If a previous insn defines a register that INSN uses, then
+	       add to the worklist to check that insn's USEs.  If any of these
+	       insns return true as MVE_VPT_UNPREDICATED_INSN_Ps, then the
+	       whole chain is affected by the change in behaviour from being
+	       placed in dlstp/letp loop.  */
+	    df_ref prev_insn_defs = NULL;
+	    FOR_EACH_INSN_DEF (prev_insn_defs, prev_insn)
+	    {
+	      if (DF_REF_REGNO (insn_uses) == DF_REF_REGNO (prev_insn_defs)
+		  && !arm_mve_vec_insn_is_predicated_with_this_predicate
+		       (insn, vctp_vpr_generated))
+		worklist.safe_push (prev_insn);
+	    }
+	  }
+      }
+    }
+  safe_insn_map->put (insn_in, false);
+  return false;
+}
+
+/* If we have identified that the current DEF will be modified
+   by such implicit predication, scan through all the
+   insns that USE it and bail out if any one is outside the
+   current basic block (i.e. the reg is live after the loop)
+   or if any are store insns that are unpredicated or using a
+   predicate other than the loop VPR.
+   This function returns true if the insn is not suitable for
+   implicit predication and false otherwise.*/
+
+static bool
+arm_mve_check_df_chain_fwd_for_implic_predic_impact (rtx_insn *insn,
+						     rtx vctp_vpr_generated)
+{
+
+  /* If this insn is indeed an unpredicated store to memory, bail out.  */
+  if (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
+	(insn, vctp_vpr_generated)
+      && (arm_is_mve_load_store_insn (insn, 2)
+	  || arm_is_mve_across_vector_insn (insn)))
+    return true;
+
+  /* Next, scan forward to the various USEs of the DEFs in this insn.  */
+  df_ref insn_def = NULL;
+  FOR_EACH_INSN_DEF (insn_def, insn)
+    {
+      for (df_ref use = DF_REG_USE_CHAIN (DF_REF_REGNO (insn_def)); use;
+	   use = DF_REF_NEXT_REG (use))
+	{
+	  rtx_insn *next_use_insn = DF_REF_INSN (use);
+	  if (next_use_insn != insn
+	      && NONDEBUG_INSN_P (next_use_insn))
+	    {
+	      /* If the USE is outside the loop body bb, or it is inside, but
+		 is an differently-predicated store to memory or it is any
+		 across-vector instruction.  */
+	      if (BLOCK_FOR_INSN (insn) != BLOCK_FOR_INSN (next_use_insn)
+		  || (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
+		       (next_use_insn, vctp_vpr_generated)
+		     && (arm_is_mve_load_store_insn (next_use_insn, 2)
+			 || arm_is_mve_across_vector_insn (next_use_insn))))
+		return true;
+	    }
+	}
+    }
+  return false;
+}
+
+/* Helper function to `arm_mve_dlstp_check_inc_counter` and to
+   `arm_mve_dlstp_check_dec_counter`.  In the situations where the loop counter
+   is incrementing by 1 or decrementing by 1 in each iteration, ensure that the
+   target value or the initialisation value, respectively, was a calculation
+   of the number of iterations of the loop, which is expected to be an ASHIFTRT
+   by VCTP_STEP.  */
+
+static bool
+arm_mve_check_reg_origin_is_num_elems (basic_block body, rtx reg, rtx vctp_step)
+{
+  /* Ok, we now know the loop starts from zero and increments by one.
+     Now just show that the max value of the counter came from an
+     appropriate ASHIFRT expr of the correct amount.  */
+  basic_block pre_loop_bb = body->prev_bb;
+  while (pre_loop_bb && BB_END (pre_loop_bb)
+	 && !df_bb_regno_only_def_find (pre_loop_bb, REGNO (reg)))
+    pre_loop_bb = pre_loop_bb->prev_bb;
+
+  df_ref counter_max_last_def = df_bb_regno_only_def_find (pre_loop_bb, REGNO (reg));
+  rtx counter_max_last_set;
+  if (counter_max_last_def)
+    counter_max_last_set = PATTERN (DF_REF_INSN (counter_max_last_def));
+  else
+    return false;
+
+  /* If we encounter a simple SET from a REG, follow it through.  */
+  if (GET_CODE (counter_max_last_set) == SET
+      && REG_P (XEXP (counter_max_last_set, 1)))
+    return arm_mve_check_reg_origin_is_num_elems
+	     (pre_loop_bb->next_bb, XEXP (counter_max_last_set, 1), vctp_step);
+
+  /* If we encounter a SET from an IF_THEN_ELSE where one of the operands is a
+     constant and the other is a REG, follow through to that REG.  */
+  if (GET_CODE (counter_max_last_set) == SET
+      && GET_CODE (XEXP (counter_max_last_set, 1)) == IF_THEN_ELSE
+      && REG_P (XEXP (XEXP (counter_max_last_set, 1), 1))
+      && CONST_INT_P (XEXP (XEXP (counter_max_last_set, 1), 2)))
+    return arm_mve_check_reg_origin_is_num_elems
+	     (pre_loop_bb->next_bb, XEXP (XEXP (counter_max_last_set, 1), 1), vctp_step);
+
+  if (GET_CODE (XEXP (counter_max_last_set, 1)) == ASHIFTRT
+      && CONST_INT_P (XEXP (XEXP (counter_max_last_set, 1), 1))
+      && ((1 << INTVAL (XEXP (XEXP (counter_max_last_set, 1), 1)))
+	   == abs (INTVAL (vctp_step))))
+    return true;
+
+  return false;
+}
+
+/* If we have identified the loop to have an incrementing counter, we need to
+   make sure that it increments by 1 and that the loop is structured correctly:
+    * The counter starts from 0
+    * The counter terminates at (num_of_elem + num_of_lanes - 1) / num_of_lanes
+    * The vctp insn uses a reg that decrements appropriately in each iteration.
+*/
+
+static rtx_insn*
+arm_mve_dlstp_check_inc_counter (basic_block body, rtx_insn* vctp_insn,
+				 rtx condconst, rtx condcount)
+{
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+  /* The loop latch has to be empty.  When compiling all the known MVE LoLs in
+     user applications, none of those with incrementing counters had any real
+     insns in the loop latch.  As such, this function has only been tested with
+     an empty latch and may misbehave or ICE if we somehow get here with an
+     increment in the latch, so, for correctness, error out early.  */
+  if (!empty_block_p (body->loop_father->latch))
+    return NULL;
+
+  class rtx_iv vctp_reg_iv;
+  /* For loops of type B) the loop counter is independent of the decrement
+     of the reg used in the vctp_insn. So run iv analysis on that reg.  This
+     has to succeed for such loops to be supported.  */
+  if (!iv_analyze (vctp_insn, as_a<scalar_int_mode> (GET_MODE (vctp_reg)),
+      vctp_reg, &vctp_reg_iv))
+    return NULL;
+
+  /* Extract the decrementnum of the vctp reg from the iv.  */
+  int decrementnum = abs (INTVAL (vctp_reg_iv.step));
+
+  /* Find where both of those are modified in the loop body bb.  */
+  df_ref condcount_reg_set_df = df_bb_regno_only_def_find (body, REGNO (condcount));
+  df_ref vctp_reg_set_df = df_bb_regno_only_def_find (body, REGNO (vctp_reg));
+  if (!condcount_reg_set_df || !vctp_reg_set_df)
+    return NULL;
+  rtx condcount_reg_set = PATTERN (DF_REF_INSN (condcount_reg_set_df));
+  rtx_insn* vctp_reg_set = DF_REF_INSN (vctp_reg_set_df);
+  /* Ensure the modification of the vctp reg from df is consistent with
+     the iv and the number of lanes on the vctp insn.  */
+  if (!(GET_CODE (XEXP (PATTERN (vctp_reg_set), 1)) == PLUS
+      && REGNO (XEXP (PATTERN (vctp_reg_set), 0))
+	  == REGNO (XEXP (XEXP (PATTERN (vctp_reg_set), 1), 0))))
+    return NULL;
+  if (decrementnum != abs (INTVAL (XEXP (XEXP (PATTERN (vctp_reg_set), 1), 1)))
+      || decrementnum != arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+    return NULL;
+
+  if (REG_P (condcount) && REG_P (condconst))
+    {
+      /* First we need to prove that the loop is going 0..condconst with an
+	 inc of 1 in each iteration.  */
+      if (GET_CODE (XEXP (condcount_reg_set, 1)) == PLUS
+	  && CONST_INT_P (XEXP (XEXP (condcount_reg_set, 1), 1))
+	  && INTVAL (XEXP (XEXP (condcount_reg_set, 1), 1)) == 1)
+	{
+	    rtx counter_reg = XEXP (condcount_reg_set, 0);
+	    /* Check that the counter did indeed start from zero.  */
+	    df_ref this_set = DF_REG_DEF_CHAIN (REGNO (counter_reg));
+	    if (!this_set)
+	      return NULL;
+	    df_ref last_set = DF_REF_NEXT_REG (this_set);
+	    if (!last_set)
+	      return NULL;
+	    rtx_insn* last_set_insn = DF_REF_INSN (last_set);
+	    if (!single_set (last_set_insn))
+	      return NULL;
+	    rtx counter_orig_set;
+	    counter_orig_set = XEXP (PATTERN (last_set_insn), 1);
+	    if (!CONST_INT_P (counter_orig_set)
+		|| (INTVAL (counter_orig_set) != 0))
+	      return NULL;
+	    /* And finally check that the target value of the counter,
+	       condconst, is of the correct shape.  */
+	    if (!arm_mve_check_reg_origin_is_num_elems (body, condconst,
+							vctp_reg_iv.step))
+	      return NULL;
+	}
+      else
+	return NULL;
+    }
+  else
+    return NULL;
+
+  /* Everything looks valid.  */
+  return vctp_insn;
+}
+
+/* Helper function to `arm_mve_loop_valid_for_dlstp`.  In the case of a
+   counter that is decrementing, ensure that it is decrementing by the
+   right amount in each iteration and that the target condition is what
+   we expect.  */
+
+static rtx_insn*
+arm_mve_dlstp_check_dec_counter (basic_block body, rtx_insn* vctp_insn,
+				 rtx condconst, rtx condcount)
+{
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+  class rtx_iv vctp_reg_iv;
+  int decrementnum;
+  /* For decrementing loops of type A), the counter is usually present in the
+     loop latch.  Here we simply need to verify that this counter is the same
+     reg that is also used in the vctp_insn and that it is not otherwise
+     modified.  */
+  rtx_insn *dec_insn = BB_END (body->loop_father->latch);
+  /* If not in the loop latch, try to find the decrement in the loop body.  */
+  if (!NONDEBUG_INSN_P (dec_insn))
+  {
+    df_ref temp = df_bb_regno_only_def_find (body, REGNO (condcount));
+    /* If we haven't been able to find the decrement, bail out.  */
+    if (!temp)
+      return NULL;
+    dec_insn = DF_REF_INSN (temp);
+  }
+
+  /* Next, ensure that it is a PLUS of the form:
+     (set (reg a) (plus (reg a) (const_int)))
+     where (reg a) is the same as condcount.  */
+  if (GET_CODE (XEXP (PATTERN (dec_insn), 1)) == PLUS
+      && REGNO (XEXP (PATTERN (dec_insn), 0))
+	  == REGNO (XEXP (XEXP (PATTERN (dec_insn), 1), 0))
+      && REGNO (XEXP (PATTERN (dec_insn), 0)) == REGNO (condcount))
+    decrementnum = abs (INTVAL (XEXP (XEXP (PATTERN (dec_insn), 1), 1)));
+  else
+    return NULL;
+
+  /* Ok, so we now know the loop decrement.  If it is a 1, then we need to
+     look at the loop vctp_reg and verify that it also decrements correctly.
+     Then, we need to establish that the starting value of the loop decrement
+     originates from the starting value of the vctp decrement.  */
+  if (decrementnum == 1)
+    {
+      class rtx_iv vctp_reg_iv;
+      /* The loop counter is found to be independent of the decrement
+	 of the reg used in the vctp_insn, again.  Ensure that IV analysis
+	 succeeds and check the step.  */
+      if (!iv_analyze (vctp_insn, as_a<scalar_int_mode> (GET_MODE (vctp_reg)),
+		       vctp_reg, &vctp_reg_iv))
+	return NULL;
+      /* Ensure it matches the number of lanes of the vctp instruction.  */
+      if (abs (INTVAL (vctp_reg_iv.step))
+	  != arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+	return NULL;
+      if (!arm_mve_check_reg_origin_is_num_elems (body, condcount, vctp_reg_iv.step))
+	return NULL;
+    }
+  /* If the decrements are the same, then the situation is simple: either they
+     are also the same reg, which is safe, or they are different registers, in
+     which case makse sure that there is a only simple SET from one to the
+     other inside the loop.*/
+  else if (decrementnum == arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+    {
+      if (REGNO (condcount) != REGNO (vctp_reg))
+	{
+	  /* It wasn't the same reg, but it could be behild a
+	     (set (vctp_reg) (condcount)), so instead find where
+	     the VCTP insn is DEF'd inside the loop.  */
+	  rtx vctp_reg_set =
+		PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
+					(body, REGNO (vctp_reg))));
+	  /* This must just be a simple SET from the condcount.  */
+	  if (GET_CODE (vctp_reg_set) != SET || !REG_P (XEXP (vctp_reg_set, 1))
+	      || REGNO (XEXP (vctp_reg_set, 1)) != REGNO (condcount))
+	    return NULL;
+	}
+    }
+  else
+    return NULL;
+
+  /* We now only need to find out that the loop terminates with a LE
+     zero condition.  If condconst is a const_int, then this is easy.
+     If its a REG, look at the last condition+jump in a bb before
+     the loop, because that usually will have a branch jumping over
+     the loop body.  */
+  if (CONST_INT_P (condconst)
+      && !(INTVAL (condconst) == 0 && JUMP_P (BB_END (body))
+	   && GET_CODE (XEXP (PATTERN (BB_END (body)), 1)) == IF_THEN_ELSE
+	   && (GET_CODE (XEXP (XEXP (PATTERN (BB_END (body)), 1), 0)) == NE
+	       ||GET_CODE (XEXP (XEXP (PATTERN (BB_END (body)), 1), 0)) == GT)))
+    return NULL;
+  else if (REG_P (condconst))
+    {
+      basic_block pre_loop_bb = body;
+      while (pre_loop_bb->prev_bb && BB_END (pre_loop_bb->prev_bb)
+	     && !JUMP_P (BB_END (pre_loop_bb->prev_bb)))
+	pre_loop_bb = pre_loop_bb->prev_bb;
+      if (pre_loop_bb && BB_END (pre_loop_bb))
+	pre_loop_bb = pre_loop_bb->prev_bb;
+      else
+	return NULL;
+      rtx initial_compare = NULL_RTX;
+      if (!(prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb))
+	    && INSN_P (prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb)))))
+	return NULL;
+      else
+	initial_compare
+	    = PATTERN (prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb)));
+      if (!(initial_compare && GET_CODE (initial_compare) == SET
+	    && cc_register (XEXP (initial_compare, 0), VOIDmode)
+	    && GET_CODE (XEXP (initial_compare, 1)) == COMPARE
+	    && CONST_INT_P (XEXP (XEXP (initial_compare, 1), 1))
+	    && INTVAL (XEXP (XEXP (initial_compare, 1), 1)) == 0))
+	return NULL;
+
+      /* Usually this is a LE condition, but it can also just be a GT or an EQ
+	 condition (if the value is unsigned or the compiler knows its not negative)  */
+      rtx_insn *loop_jumpover = BB_END (pre_loop_bb);
+      if (!(JUMP_P (loop_jumpover)
+	    && GET_CODE (XEXP (PATTERN (loop_jumpover), 1)) == IF_THEN_ELSE
+	    && (GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == LE
+		|| GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == GT
+		|| GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == EQ)))
+	return NULL;
+    }
+
+  /* Everything looks valid.  */
+  return vctp_insn;
+}
+
+/* Function to check a loop's structure to see if it is a valid candidate for
+   an MVE Tail Predicated Low-Overhead Loop.  Returns the loop's VCTP_INSN if
+   it is valid, or NULL if it isn't.  */
+
+static rtx_insn*
+arm_mve_loop_valid_for_dlstp (basic_block body)
+{
+  /* Doloop can only be done "elementwise" with predicated dlstp/letp if it
+     contains a VCTP on the number of elements processed by the loop.
+     Find the VCTP predicate generation inside the loop body BB.  */
+  rtx_insn *vctp_insn = arm_mve_get_loop_vctp (body);
+  if (!vctp_insn)
+    return NULL;
+
+  /* There are only two types of loops that can be turned into dlstp/letp
+     loops:
+      A) Loops of the form:
+	  while (num_of_elem > 0)
+	    {
+	      p = vctp<size> (num_of_elem)
+	      n -= num_of_lanes;
+	    }
+      B) Loops of the form:
+	  int num_of_iters = (num_of_elem + num_of_lanes - 1) / num_of_lanes
+	  for (i = 0; i < num_of_iters; i++)
+	    {
+	      p = vctp<size> (num_of_elem)
+	      n -= num_of_lanes;
+	    }
+
+    Then, depending on the type of loop above we need will need to do
+    different sets of checks.  */
+  iv_analysis_loop_init (body->loop_father);
+
+  /* In order to find out if the loop is of type A or B above look for the
+     loop counter: it will either be incrementing by one per iteration or
+     it will be decrementing by num_of_lanes.  We can find the loop counter
+     in the condition at the end of the loop.  */
+  rtx_insn *loop_cond = prev_nonnote_nondebug_insn_bb (BB_END (body));
+  if (!(cc_register (XEXP (PATTERN (loop_cond), 0), VOIDmode)
+	&& GET_CODE (XEXP (PATTERN (loop_cond), 1)) == COMPARE))
+    return NULL;
+
+  /* The operands in the condition:  Try to identify which one is the
+     constant and which is the counter and run IV analysis on the latter.  */
+  rtx cond_arg_1 = XEXP (XEXP (PATTERN (loop_cond), 1), 0);
+  rtx cond_arg_2 = XEXP (XEXP (PATTERN (loop_cond), 1), 1);
+
+  rtx loop_cond_constant;
+  rtx loop_counter;
+  class rtx_iv cond_counter_iv, cond_temp_iv;
+
+  if (CONST_INT_P (cond_arg_1))
+    {
+      /* cond_arg_1 is the constant and cond_arg_2 is the counter.  */
+      loop_cond_constant = cond_arg_1;
+      loop_counter = cond_arg_2;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_2)),
+		  cond_arg_2, &cond_counter_iv);
+    }
+  else if (CONST_INT_P (cond_arg_2))
+    {
+      /* cond_arg_2 is the constant and cond_arg_1 is the counter.  */
+      loop_cond_constant = cond_arg_2;
+      loop_counter = cond_arg_1;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_1)),
+		  cond_arg_1, &cond_counter_iv);
+    }
+  else if (REG_P (cond_arg_1) && REG_P (cond_arg_2))
+    {
+      /* If both operands to the compare are REGs, we can safely
+	 run IV analysis on both and then determine which is the
+	 constant by looking at the step.
+	 First assume cond_arg_1 is the counter.  */
+      loop_counter = cond_arg_1;
+      loop_cond_constant = cond_arg_2;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_1)),
+		  cond_arg_1, &cond_counter_iv);
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_2)),
+		  cond_arg_2, &cond_temp_iv);
+
+      /* Look at the steps and swap around the rtx's if needed.  Error out if
+	 one of them cannot be identified as constant.  */
+      if (!CONST_INT_P (cond_counter_iv.step) || !CONST_INT_P (cond_temp_iv.step))
+	return NULL;
+      if (INTVAL (cond_counter_iv.step) != 0 && INTVAL (cond_temp_iv.step) != 0)
+	return NULL;
+      if (INTVAL (cond_counter_iv.step) == 0 && INTVAL (cond_temp_iv.step) != 0)
+	{
+	  loop_counter = cond_arg_2;
+	  loop_cond_constant = cond_arg_1;
+	  cond_counter_iv = cond_temp_iv;
+	}
+    }
+  else
+    return NULL;
+
+  if (!REG_P (loop_counter))
+    return NULL;
+  if (!(REG_P (loop_cond_constant) || CONST_INT_P (loop_cond_constant)))
+    return NULL;
+
+  /* Now we have extracted the IV step of the loop counter, call the
+     appropriate checking function.  */
+  if (INTVAL (cond_counter_iv.step) > 0)
+    return arm_mve_dlstp_check_inc_counter (body, vctp_insn,
+					    loop_cond_constant, loop_counter);
+  else if (INTVAL (cond_counter_iv.step) < 0)
+    return arm_mve_dlstp_check_dec_counter (body, vctp_insn,
+					    loop_cond_constant, loop_counter);
+  else
+    return NULL;
+}
+
+/* Predict whether the given loop in gimple will be transformed in the RTL
+   doloop_optimize pass.  */
+
+static bool
+arm_predict_doloop_p (struct loop *loop)
+{
+  gcc_assert (loop);
+  /* On arm, targetm.can_use_doloop_p is actually
+     can_use_doloop_if_innermost.  Ensure the loop is innermost,
+     it is valid and as per arm_target_bb_ok_for_lob and the
+     correct architecture flags are enabled.  */
+  if (!(TARGET_32BIT && TARGET_HAVE_LOB && optimize > 0))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " target architecture or optimisation flags.\n");
+      return false;
+    }
+  else if (loop->inner != NULL)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " loop nesting.\n");
+      return false;
+    }
+  else if (!arm_target_bb_ok_for_lob (loop->header->next_bb))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " loop bb complexity.\n");
+      return false;
+    }
+
+  return true;
+}
+
+/* Implement targetm.loop_unroll_adjust.  Use this to block unrolling of loops
+   that may later be turned into MVE Tail Predicated Low Overhead Loops.  The
+   performance benefit of an MVE LoL is likely to be much higher than that of
+   the unrolling.  */
+
+unsigned
+arm_loop_unroll_adjust (unsigned nunroll, struct loop *loop)
+{
+  if (TARGET_HAVE_MVE
+      && arm_target_bb_ok_for_lob (loop->latch)
+      && arm_mve_loop_valid_for_dlstp (loop->header))
+    return 0;
+  else
+    return nunroll;
+}
+
+/* Function to hadle emitting a VPT-unpredicated version of a VPT-predicated
+   insn to a sequence.  */
+
+static bool
+arm_emit_mve_unpredicated_insn_to_seq (rtx_insn* insn)
+{
+  rtx insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn);
+  int new_icode = get_attr_mve_unpredicated_insn (insn);
+  if (!in_sequence_p ()
+      || !MVE_VPT_PREDICATED_INSN_P (insn)
+      || (!insn_vpr_reg_operand)
+      || (!new_icode))
+    return false;
+
+  extract_insn (insn);
+  rtx arr[8];
+  int j = 0;
+
+  /* When transforming a VPT-predicated instruction
+     into its unpredicated equivalent we need to drop
+     the VPR operand and we may need to also drop a
+     merge "vuninit" input operand, depending on the
+     instruction pattern.  Here ensure that we have at
+     most a two-operand difference between the two
+     instrunctions.  */
+  int n_operands_diff
+      = recog_data.n_operands - insn_data[new_icode].n_operands;
+  if (!(n_operands_diff > 0 && n_operands_diff <= 2))
+    return false;
+
+  /* Then, loop through the operands of the predicated
+     instruction, and retain the ones that map to the
+     unpredicated instruction.  */
+  for (int i = 0; i < recog_data.n_operands; i++)
+    {
+      /* Ignore the VPR and, if needed, the vuninit
+	 operand.  */
+      if (insn_vpr_reg_operand == recog_data.operand[i]
+	  || (n_operands_diff == 2
+	      && !strcmp (recog_data.constraints[i], "0")))
+	continue;
+      else
+	{
+	  arr[j] = recog_data.operand[i];
+	  j++;
+	}
+    }
+
+  /* Finally, emit the upredicated instruction.  */
+  switch (j)
+    {
+      case 1:
+	emit_insn (GEN_FCN (new_icode) (arr[0]));
+	break;
+      case 2:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1]));
+	break;
+      case 3:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2]));
+	break;
+      case 4:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2],
+					arr[3]));
+	break;
+      case 5:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4]));
+	break;
+      case 6:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4], arr[5]));
+	break;
+      case 7:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4], arr[5], arr[6]));
+	break;
+      default:
+	gcc_unreachable ();
+    }
+  return true;
+}
+
+/* When a vctp insn is used, its out is often followed by
+   a zero-extend insn to SImode, which is then SUBREG'd into a
+   vector form of mode VALID_MVE_PRED_MODE: this vector form is
+   what is then used as an input to the instructions within the
+   loop.  Hence, store that vector form of the VPR reg into
+   vctp_vpr_generated, so that we can match it with instructions
+   in the loop to determine if they are predicated on this same
+   VPR.  If there is no zero-extend and subreg or it is otherwise
+   invalid, then return NULL to cancel the dlstp transform.  */
+
+static rtx
+arm_mve_get_vctp_vec_form (rtx_insn *insn)
+{
+  rtx vctp_vpr_generated = NULL_RTX;
+  rtx_insn *next_use1 = NULL;
+  df_ref use;
+  for (use
+	= DF_REG_USE_CHAIN
+	   (DF_REF_REGNO (DF_INSN_INFO_DEFS (DF_INSN_INFO_GET (insn))));
+       use; use = DF_REF_NEXT_REG (use))
+    if (!next_use1 && NONDEBUG_INSN_P (DF_REF_INSN (use)))
+      next_use1 = DF_REF_INSN (use);
+
+  if (single_set (next_use1)
+      && GET_CODE (SET_SRC (single_set (next_use1))) == ZERO_EXTEND)
+    {
+      rtx_insn *next_use2 = NULL;
+      for (use
+	    = DF_REG_USE_CHAIN
+	       (DF_REF_REGNO
+		 (DF_INSN_INFO_DEFS (DF_INSN_INFO_GET (next_use1))));
+	   use; use = DF_REF_NEXT_REG (use))
+	if (!next_use2 && NONDEBUG_INSN_P (DF_REF_INSN (use)))
+	  next_use2 = DF_REF_INSN (use);
+
+      if (single_set (next_use2)
+	  && GET_CODE (SET_SRC (single_set (next_use2))) == SUBREG)
+	vctp_vpr_generated = XEXP (PATTERN (next_use2), 0);
+    }
+
+  if (!vctp_vpr_generated || !REG_P (vctp_vpr_generated)
+      || !VALID_MVE_PRED_MODE (GET_MODE (vctp_vpr_generated)))
+    return NULL_RTX;
+
+  return vctp_vpr_generated;
+}
+
+/* Attempt to transform the loop contents of loop basic block from VPT
+   predicated insns into unpredicated insns for a dlstp/letp loop.  */
+
+rtx
+arm_attempt_dlstp_transform (rtx label)
+{
+  basic_block body = BLOCK_FOR_INSN (label)->prev_bb;
+
+  /* Ensure that the bb is within a loop that has all required metadata.  */
+  if (!body->loop_father || !body->loop_father->header
+      || !body->loop_father->simple_loop_desc)
+    return GEN_INT (1);
+
+  rtx_insn *vctp_insn = arm_mve_loop_valid_for_dlstp (body);
+  if (!vctp_insn)
+    return GEN_INT (1);
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+
+  rtx vctp_vpr_generated = arm_mve_get_vctp_vec_form (vctp_insn);
+  if (!vctp_vpr_generated)
+    return GEN_INT (1);
+
+  /* decrementunum is already known to be valid at this point.  */
+  int decrementnum = arm_mve_get_vctp_lanes (PATTERN (vctp_insn));
+
+  rtx_insn *insn = 0;
+  rtx_insn *cur_insn = 0;
+  rtx_insn *seq;
+  hash_map <rtx_insn *, bool> *safe_insn_map
+      = new hash_map <rtx_insn *, bool>;
+
+  /* Scan through the insns in the loop bb and emit the transformed bb
+     insns to a sequence.  */
+  start_sequence ();
+  FOR_BB_INSNS (body, insn)
+    {
+      if (GET_CODE (insn) == CODE_LABEL || NOTE_INSN_BASIC_BLOCK_P (insn))
+	continue;
+      else if (NOTE_P (insn))
+	emit_note ((enum insn_note)NOTE_KIND (insn));
+      else if (DEBUG_INSN_P (insn))
+	emit_debug_insn (PATTERN (insn));
+      else if (!INSN_P (insn))
+	{
+	  end_sequence ();
+	  return GEN_INT (1);
+	}
+      /* When we find the vctp instruction: continue.  */
+      else if (insn == vctp_insn)
+	continue;
+       /* If the insn pattern requires the use of the VPR value from the
+	  vctp as an input parameter for predication.  */
+      else if (arm_mve_vec_insn_is_predicated_with_this_predicate
+		(insn, vctp_vpr_generated))
+	{
+	  bool success = arm_emit_mve_unpredicated_insn_to_seq (insn);
+	  if (!success)
+	    {
+	      end_sequence ();
+	      return GEN_INT (1);
+	    }
+	}
+      /* If the insn isn't VPT predicated on vctp_vpr_generated, we need to
+	 make sure that it is still valid within the dlstp/letp loop.  */
+      else
+	{
+	  /* If this instruction USE-s the vctp_vpr_generated other than for
+	     predication, this blocks the transformation as we are not allowed
+	     to optimise the VPR value away.  */
+	  df_ref insn_uses = NULL;
+	  FOR_EACH_INSN_USE (insn_uses, insn)
+	  {
+	    if (rtx_equal_p (vctp_vpr_generated, DF_REF_REG (insn_uses)))
+	      {
+		end_sequence ();
+		return GEN_INT (1);
+	      }
+	  }
+	  /* If within the loop we have an MVE vector instruction that is
+	     unpredicated, the dlstp/letp looping will add implicit
+	     predication to it.  This will result in a change in behaviour
+	     of the instruction, so we need to find out if any instructions
+	     that feed into the current instruction were implicitly
+	     predicated.  */
+	  if (arm_mve_check_df_chain_back_for_implic_predic
+	       (safe_insn_map, insn, vctp_vpr_generated))
+	    {
+	      if (arm_mve_check_df_chain_fwd_for_implic_predic_impact
+		    (insn, vctp_vpr_generated))
+		{
+		  end_sequence ();
+		  return GEN_INT (1);
+		}
+	    }
+	  emit_insn (PATTERN (insn));
+	}
+    }
+  seq = get_insns ();
+  end_sequence ();
+
+  /* Re-write the entire BB contents with the transformed
+     sequence.  */
+  FOR_BB_INSNS_SAFE (body, insn, cur_insn)
+    if (!(GET_CODE (insn) == CODE_LABEL || NOTE_INSN_BASIC_BLOCK_P (insn)))
+      delete_insn (insn);
+  for (insn = seq; NEXT_INSN (insn); insn = NEXT_INSN (insn))
+    if (NOTE_P (insn))
+      emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));
+    else if (DEBUG_INSN_P (insn))
+      emit_debug_insn_after (PATTERN (insn), BB_END (body));
+    else
+      emit_insn_after (PATTERN (insn), BB_END (body));
+
+  emit_jump_insn_after (PATTERN (insn), BB_END (body));
+  /* The transformation has succeeded, so now modify the "count"
+     (a.k.a. niter_expr) for the middle-end.  Also set noloop_assumptions
+     to NULL to stop the middle-end from making assumptions about the
+     number of iterations.  */
+  simple_loop_desc (body->loop_father)->niter_expr = vctp_reg;
+  simple_loop_desc (body->loop_father)->noloop_assumptions = NULL_RTX;
+  return GEN_INT (decrementnum);
 }
 
 #if CHECKING_P
diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 8efdebecc3cab53bb99cb2a5000d6d3c6c8e3798..da745288f2669427e48aaa5c116bbc6c08ad2b30 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -124,6 +124,11 @@
 ; and not all ARM insns do.
 (define_attr "predicated" "yes,no" (const_string "no"))
 
+
+; An attribute that encodes the CODE_FOR_<insn> of the MVE VPT unpredicated
+; version of a VPT-predicated instruction.  For unpredicated instructions
+; that are predicable, encode the same pattern's CODE_FOR_<insn> as a way to
+; encode that it is a predicable instruction.
 (define_attr "mve_unpredicated_insn" "" (const_int 0))
 
 ; LENGTH of an instruction (in bytes)
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 5ea2d9e866891bdb3dc73fcf6cbd6cdd2f989951..9398702cddd076a7eacf1ca6eac6c5a1fbd9a3d0 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -2673,6 +2673,17 @@
 (define_int_attr mrrc [(VUNSPEC_MRRC "mrrc") (VUNSPEC_MRRC2 "mrrc2")])
 (define_int_attr MRRC [(VUNSPEC_MRRC "MRRC") (VUNSPEC_MRRC2 "MRRC2")])
 
+(define_int_attr dlstp_elemsize [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 "32")
+				 (DLSTP64 "64")])
+
+(define_int_attr letp_num_lanes [(LETP8 "16") (LETP16 "8") (LETP32 "4")
+				 (LETP64 "2")])
+(define_int_attr letp_num_lanes_neg [(LETP8 "-16") (LETP16 "-8") (LETP32 "-4")
+				     (LETP64 "-2")])
+
+(define_int_attr letp_num_lanes_minus_1 [(LETP8 "15") (LETP16 "7") (LETP32 "3")
+					 (LETP64 "1")])
+
 (define_int_attr opsuffix [(UNSPEC_DOT_S "s8")
 			   (UNSPEC_DOT_U "u8")
 			   (UNSPEC_DOT_US "s8")
@@ -2916,6 +2927,10 @@
 (define_int_iterator VSHLCQ_M [VSHLCQ_M_S VSHLCQ_M_U])
 (define_int_iterator VQSHLUQ_M_N [VQSHLUQ_M_N_S])
 (define_int_iterator VQSHLUQ_N [VQSHLUQ_N_S])
+(define_int_iterator DLSTP [DLSTP8 DLSTP16 DLSTP32
+				   DLSTP64])
+(define_int_iterator LETP [LETP8 LETP16 LETP32
+			   LETP64])
 
 ;; Define iterators for VCMLA operations
 (define_int_iterator VCMLA_OP [UNSPEC_VCMLA
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 44a04b86cb5806fcf50917826512fd203d42106c..c083f965fa9a40781bc86beb6e63654afd14eac4 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -6922,23 +6922,24 @@
 ;; Originally expanded by 'predicated_doloop_end'.
 ;; In the rare situation where the branch is too far, we do also need to
 ;; revert FPSCR.LTPSIZE back to 0x100 after the last iteration.
-(define_insn "*predicated_doloop_end_internal"
+(define_insn "predicated_doloop_end_internal<letp_num_lanes>"
   [(set (pc)
 	(if_then_else
-	   (ge (plus:SI (reg:SI LR_REGNUM)
-			(match_operand:SI 0 "const_int_operand" ""))
-		(const_int 0))
-	 (label_ref (match_operand 1 "" ""))
+	   (gtu (unspec:SI [(plus:SI (match_operand:SI 0 "s_register_operand" "=r")
+				     (const_int <letp_num_lanes_neg>))]
+		LETP)
+		(const_int <letp_num_lanes_minus_1>))
+	 (match_operand 1 "" "")
 	 (pc)))
-   (set (reg:SI LR_REGNUM)
-	(plus:SI (reg:SI LR_REGNUM) (match_dup 0)))
+   (set (match_dup 0)
+	(plus:SI (match_dup 0) (const_int <letp_num_lanes_neg>)))
    (clobber (reg:CC CC_REGNUM))]
-  "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2"
+  "TARGET_HAVE_MVE"
   {
     if (get_attr_length (insn) == 4)
       return "letp\t%|lr, %l1";
     else
-      return "subs\t%|lr, #%n0\n\tbgt\t%l1\n\tlctp";
+      return "subs\t%|lr, #<letp_num_lanes>\n\tbhi\t%l1\n\tlctp";
   }
   [(set (attr "length")
 	(if_then_else
@@ -6947,11 +6948,11 @@
 	    (const_int 6)))
    (set_attr "type" "branch")])
 
-(define_insn "dlstp<mode1>_insn"
+(define_insn "dlstp<dlstp_elemsize>_insn"
   [
     (set (reg:SI LR_REGNUM)
 	 (unspec:SI [(match_operand:SI 0 "s_register_operand" "r")]
 	  DLSTP))
   ]
-  "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2"
-  "dlstp.<mode1>\t%|lr, %0")
+  "TARGET_HAVE_MVE"
+  "dlstp.<dlstp_elemsize>\t%|lr, %0")
diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
index e1e013befa7a67ddbf517bf22797bdaeeb96b94f..54dd4ee564b71d8e0b9b276fca388deb4018ce7d 100644
--- a/gcc/config/arm/thumb2.md
+++ b/gcc/config/arm/thumb2.md
@@ -1613,7 +1613,7 @@
    (use (match_operand 1 "" ""))]     ; label
   "TARGET_32BIT"
   "
- {
+{
    /* Currently SMS relies on the do-loop pattern to recognize loops
       where (1) the control part consists of all insns defining and/or
       using a certain 'count' register and (2) the loop count can be
@@ -1623,41 +1623,77 @@
 
       Also used to implement the low over head loops feature, which is part of
       the Armv8.1-M Mainline Low Overhead Branch (LOB) extension.  */
-   if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
-   {
-     rtx s0;
-     rtx bcomp;
-     rtx loc_ref;
-     rtx cc_reg;
-     rtx insn;
-     rtx cmp;
-
-     if (GET_MODE (operands[0]) != SImode)
-       FAIL;
-
-     s0 = operands [0];
-
-     /* Low over head loop instructions require the first operand to be LR.  */
-     if (TARGET_HAVE_LOB && arm_target_insn_ok_for_lob (operands [1]))
-       s0 = gen_rtx_REG (SImode, LR_REGNUM);
-
-     if (TARGET_THUMB2)
-       insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-     else
-       insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-
-     cmp = XVECEXP (PATTERN (insn), 0, 0);
-     cc_reg = SET_DEST (cmp);
-     bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
-     loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands [1]);
-     emit_jump_insn (gen_rtx_SET (pc_rtx,
-                                  gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
-                                                        loc_ref, pc_rtx)));
-     DONE;
-   }
- else
-   FAIL;
- }")
+  if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
+    {
+      rtx s0;
+      rtx bcomp;
+      rtx loc_ref;
+      rtx cc_reg;
+      rtx insn;
+      rtx cmp;
+      rtx decrement_num;
+
+      if (GET_MODE (operands[0]) != SImode)
+	FAIL;
+
+      s0 = operands[0];
+
+       if (TARGET_HAVE_LOB && arm_target_bb_ok_for_lob (BLOCK_FOR_INSN (operands[1])))
+	{
+	  s0 = gen_rtx_REG (SImode, LR_REGNUM);
+
+	  /* If we have a compatibe MVE target, try and analyse the loop
+	     contents to determine if we can use predicated dlstp/letp
+	     looping.  */
+	  if (TARGET_HAVE_MVE
+	      && (decrement_num = arm_attempt_dlstp_transform (operands[1]))
+	      && (INTVAL (decrement_num) != 1))
+	    {
+	      loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+	      switch (INTVAL (decrement_num))
+		{
+		  case 2:
+		    insn = emit_jump_insn (gen_predicated_doloop_end_internal2
+					    (s0, loc_ref));
+		    break;
+		  case 4:
+		    insn = emit_jump_insn (gen_predicated_doloop_end_internal4
+					    (s0, loc_ref));
+		    break;
+		  case 8:
+		    insn = emit_jump_insn (gen_predicated_doloop_end_internal8
+					    (s0, loc_ref));
+		    break;
+		  case 16:
+		    insn = emit_jump_insn (gen_predicated_doloop_end_internal16
+					    (s0, loc_ref));
+		    break;
+		  default:
+		    gcc_unreachable ();
+		}
+	      DONE;
+	    }
+	}
+
+	/* Otherwise, try standard decrement-by-one dls/le looping.  */
+	if (TARGET_THUMB2)
+	  insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0,
+							GEN_INT (-1)));
+	else
+	  insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
+
+	cmp = XVECEXP (PATTERN (insn), 0, 0);
+	cc_reg = SET_DEST (cmp);
+	bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
+	loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+	emit_jump_insn (gen_rtx_SET (pc_rtx,
+				     gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+							   loc_ref, pc_rtx)));
+	DONE;
+    }
+  else
+    FAIL;
+}")
 
 (define_insn "*clear_apsr"
   [(unspec_volatile:SI [(const_int 0)] VUNSPEC_CLRM_APSR)
@@ -1755,7 +1791,37 @@
   {
     if (REGNO (operands[0]) == LR_REGNUM)
       {
-	emit_insn (gen_dls_insn (operands[0]));
+	/* Pick out the number by which we are decrementing the loop counter
+	   in every iteration.  If it's > 1, then use dlstp.  */
+	int const_int_dec_num
+	     = abs (INTVAL (XEXP (XEXP (XVECEXP (PATTERN (operands[1]), 0, 1),
+				  1),
+			    1)));
+	switch (const_int_dec_num)
+	  {
+	    case 16:
+	      emit_insn (gen_dlstp8_insn (operands[0]));
+	      break;
+
+	    case 8:
+	      emit_insn (gen_dlstp16_insn (operands[0]));
+	      break;
+
+	    case 4:
+	      emit_insn (gen_dlstp32_insn (operands[0]));
+	      break;
+
+	    case 2:
+	      emit_insn (gen_dlstp64_insn (operands[0]));
+	      break;
+
+	    case 1:
+	      emit_insn (gen_dls_insn (operands[0]));
+	      break;
+
+	    default:
+	      gcc_unreachable ();
+	  }
 	DONE;
       }
     else
diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
index 4713ec840abae48ca70f418dbc0d4028ad4ad527..2d6f27c14f4a1e7db05b9684a8958a76a1c79ef2 100644
--- a/gcc/config/arm/unspecs.md
+++ b/gcc/config/arm/unspecs.md
@@ -583,6 +583,14 @@
   VADDLVQ_U
   VCTP
   VCTP_M
+  DLSTP8
+  DLSTP16
+  DLSTP32
+  DLSTP64
+  LETP8
+  LETP16
+  LETP32
+  LETP64
   VPNOT
   VCREATEQ_F
   VCVTQ_N_TO_F_S
diff --git a/gcc/df-core.cc b/gcc/df-core.cc
index d4812b04a7cb97ea1606082e26e910472da5bcc1..4fcc14bf790d43e792b3c926fe1f80073d908c17 100644
--- a/gcc/df-core.cc
+++ b/gcc/df-core.cc
@@ -1964,6 +1964,21 @@ df_bb_regno_last_def_find (basic_block bb, unsigned int regno)
   return NULL;
 }
 
+/* Return the one and only def of REGNO within BB.  If there is no def or
+   there are multiple defs, return NULL.  */
+
+df_ref
+df_bb_regno_only_def_find (basic_block bb, unsigned int regno)
+{
+  df_ref temp = df_bb_regno_first_def_find (bb, regno);
+  if (!temp)
+    return NULL;
+  else if (temp == df_bb_regno_last_def_find (bb, regno))
+    return temp;
+  else
+    return NULL;
+}
+
 /* Finds the reference corresponding to the definition of REG in INSN.
    DF is the dataflow object.  */
 
diff --git a/gcc/df.h b/gcc/df.h
index 402657a7076f1bcad24e9c50682e033e57f432f9..98623637f9c839c799222e99df2a7173a770b2ac 100644
--- a/gcc/df.h
+++ b/gcc/df.h
@@ -987,6 +987,7 @@ extern void df_check_cfg_clean (void);
 #endif
 extern df_ref df_bb_regno_first_def_find (basic_block, unsigned int);
 extern df_ref df_bb_regno_last_def_find (basic_block, unsigned int);
+extern df_ref df_bb_regno_only_def_find (basic_block, unsigned int);
 extern df_ref df_find_def (rtx_insn *, rtx);
 extern bool df_reg_defined (rtx_insn *, rtx);
 extern df_ref df_find_use (rtx_insn *, rtx);
diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
index 4feb0a25ab9331b7124df900f73c9fc6fb3eb10b..d919207505c472c8a54a2c9c982a09061584177b 100644
--- a/gcc/loop-doloop.cc
+++ b/gcc/loop-doloop.cc
@@ -85,10 +85,10 @@ doloop_condition_get (rtx_insn *doloop_pat)
      forms:
 
      1)  (parallel [(set (pc) (if_then_else (condition)
-	  			            (label_ref (label))
-				            (pc)))
-	             (set (reg) (plus (reg) (const_int -1)))
-	             (additional clobbers and uses)])
+					    (label_ref (label))
+					    (pc)))
+		     (set (reg) (plus (reg) (const_int -1)))
+		     (additional clobbers and uses)])
 
      The branch must be the first entry of the parallel (also required
      by jump.cc), and the second entry of the parallel must be a set of
@@ -96,19 +96,34 @@ doloop_condition_get (rtx_insn *doloop_pat)
      the loop counter in an if_then_else too.
 
      2)  (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-	                         (label_ref (label))
-			         (pc))).  
+	 (set (pc) (if_then_else (reg != 0)
+				 (label_ref (label))
+				 (pc))).
 
      Some targets (ARM) do the comparison before the branch, as in the
      following form:
 
-     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) */
-
+     3) (parallel [(set (cc) (compare (plus (reg) (const_int -1)) 0))
+		   (set (reg) (plus (reg) (const_int -1)))])
+	(set (pc) (if_then_else (cc == NE)
+				(label_ref (label))
+				(pc)))
+
+      The ARM target also supports a special case of a counter that decrements
+      by `n` and terminating in a GTU condition.  In that case, the compare and
+      branch are all part of one insn, containing an UNSPEC:
+
+      4) (parallel [
+	    (set (pc)
+		(if_then_else (gtu (unspec:SI [(plus:SI (reg:SI 14 lr)
+							(const_int -n))])
+				   (const_int n-1]))
+		    (label_ref)
+		    (pc)))
+	    (set (reg:SI 14 lr)
+		 (plus:SI (reg:SI 14 lr)
+			  (const_int -n)))
+     */
   pattern = PATTERN (doloop_pat);
 
   if (GET_CODE (pattern) != PARALLEL)
@@ -143,7 +158,7 @@ doloop_condition_get (rtx_insn *doloop_pat)
 	      || GET_CODE (cmp_arg1) != PLUS)
 	    return 0;
 	  reg_orig = XEXP (cmp_arg1, 0);
-	  if (XEXP (cmp_arg1, 1) != GEN_INT (-1) 
+	  if (XEXP (cmp_arg1, 1) != GEN_INT (-1)
 	      || !REG_P (reg_orig))
 	    return 0;
 	  cc_reg = SET_DEST (cmp_orig);
@@ -173,15 +188,17 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
     return 0;
 
-  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
      On IA-64, this decrement is wrapped in an if_then_else.  */
   inc_src = SET_SRC (inc);
   if (GET_CODE (inc_src) == IF_THEN_ELSE)
     inc_src = XEXP (inc_src, 1);
   if (GET_CODE (inc_src) != PLUS
       || XEXP (inc_src, 0) != reg
-      || XEXP (inc_src, 1) != constm1_rtx)
+      || !CONST_INT_P (XEXP (inc_src, 1))
+      || INTVAL (XEXP (inc_src, 1)) >= 0)
     return 0;
+  int dec_num = abs (INTVAL (XEXP (inc_src, 1)));
 
   /* Check for (set (pc) (if_then_else (condition)
                                        (label_ref (label))
@@ -196,60 +213,71 @@ doloop_condition_get (rtx_insn *doloop_pat)
   /* Extract loop termination condition.  */
   condition = XEXP (SET_SRC (cmp), 0);
 
-  /* We expect a GE or NE comparison with 0 or 1.  */
-  if ((GET_CODE (condition) != GE
-       && GET_CODE (condition) != NE)
-      || (XEXP (condition, 1) != const0_rtx
-          && XEXP (condition, 1) != const1_rtx))
+  /* We expect a GE or NE comparison with 0 or 1, or a GTU comparison with
+     dec_num - 1.  */
+  if (!((GET_CODE (condition) == GE
+	 || GET_CODE (condition) == NE)
+	&& (XEXP (condition, 1) == const0_rtx
+	    || XEXP (condition, 1) == const1_rtx ))
+      &&!(GET_CODE (condition) == GTU
+	  && ((INTVAL (XEXP (condition, 1))) == (dec_num - 1))))
     return 0;
 
-  if ((XEXP (condition, 0) == reg)
+  /* For the ARM special case of having a GTU: re-form the condition without
+     the unspec for the benefit of the middle-end.  */
+  if (GET_CODE (condition) == GTU)
+    {
+      condition = gen_rtx_fmt_ee (GTU, VOIDmode, inc_src,
+				  GEN_INT (dec_num - 1));
+      return condition;
+    }
+  else if ((XEXP (condition, 0) == reg)
       /* For the third case:  */  
       || ((cc_reg != NULL_RTX)
 	  && (XEXP (condition, 0) == cc_reg)
 	  && (reg_orig == reg))
       || (GET_CODE (XEXP (condition, 0)) == PLUS
 	  && XEXP (XEXP (condition, 0), 0) == reg))
-   {
+    {
      if (GET_CODE (pattern) != PARALLEL)
      /*  For the second form we expect:
 
-         (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-                                 (label_ref (label))
-                                 (pc))).
+	 (set (reg) (plus (reg) (const_int -1))
+	 (set (pc) (if_then_else (reg != 0)
+				 (label_ref (label))
+				 (pc))).
 
-         is equivalent to the following:
+	 is equivalent to the following:
 
-         (parallel [(set (pc) (if_then_else (reg != 1)
-                                            (label_ref (label))
-                                            (pc)))
-                     (set (reg) (plus (reg) (const_int -1)))
-                     (additional clobbers and uses)])
+	 (parallel [(set (pc) (if_then_else (reg != 1)
+					    (label_ref (label))
+					    (pc)))
+		     (set (reg) (plus (reg) (const_int -1)))
+		     (additional clobbers and uses)])
 
-        For the third form we expect:
+	For the third form we expect:
 
-        (parallel [(set (cc) (compare ((plus (reg) (const_int -1)), 0))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) 
+	(parallel [(set (cc) (compare ((plus (reg) (const_int -1)), 0))
+		   (set (reg) (plus (reg) (const_int -1)))])
+	(set (pc) (if_then_else (cc == NE)
+				(label_ref (label))
+				(pc))) 
 
-        which is equivalent to the following:
+	which is equivalent to the following:
 
-        (parallel [(set (cc) (compare (reg,  1))
-                   (set (reg) (plus (reg) (const_int -1)))
-                   (set (pc) (if_then_else (NE == cc)
-                                           (label_ref (label))
-                                           (pc))))])
+	(parallel [(set (cc) (compare (reg,  1))
+		   (set (reg) (plus (reg) (const_int -1)))
+		   (set (pc) (if_then_else (NE == cc)
+					   (label_ref (label))
+					   (pc))))])
 
-        So we return the second form instead for the two cases.
+	So we return the second form instead for the two cases.
 
      */
-        condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
+	condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
 
     return condition;
-   }
+    }
 
   /* ??? If a machine uses a funny comparison, we could return a
      canonicalized form here.  */
@@ -507,6 +535,11 @@ doloop_modify (class loop *loop, class niter_desc *desc,
 	nonneg = 1;
       break;
 
+    case GTU:
+      /* The iteration count does not need incrementing for a GTU test.  */
+      increment_count = false;
+      break;
+
       /* Abort if an invalid doloop pattern has been generated.  */
     default:
       gcc_unreachable ();
@@ -529,6 +562,10 @@ doloop_modify (class loop *loop, class niter_desc *desc,
 
   if (desc->noloop_assumptions)
     {
+      /* The GTU case has only been implemented for the ARM target, where
+	 noloop_assumptions gets explicitly set to NULL for that case, so
+	 assert here for safety.  */
+      gcc_assert (GET_CODE (condition) != GTU);
       rtx ass = copy_rtx (desc->noloop_assumptions);
       basic_block preheader = loop_preheader_edge (loop)->src;
       basic_block set_zero = split_edge (loop_preheader_edge (loop));
@@ -642,7 +679,7 @@ doloop_optimize (class loop *loop)
 {
   scalar_int_mode mode;
   rtx doloop_reg;
-  rtx count;
+  rtx count = NULL_RTX;
   widest_int iterations, iterations_max;
   rtx_code_label *start_label;
   rtx condition;
@@ -685,17 +722,6 @@ doloop_optimize (class loop *loop)
       return false;
     }
 
-  max_cost
-    = COSTS_N_INSNS (param_max_iterations_computation_cost);
-  if (set_src_cost (desc->niter_expr, mode, optimize_loop_for_speed_p (loop))
-      > max_cost)
-    {
-      if (dump_file)
-	fprintf (dump_file,
-		 "Doloop: number of iterations too costly to compute.\n");
-      return false;
-    }
-
   if (desc->const_iter)
     iterations = widest_int::from (rtx_mode_t (desc->niter_expr, mode),
 				   UNSIGNED);
@@ -716,12 +742,25 @@ doloop_optimize (class loop *loop)
 
   /* Generate looping insn.  If the pattern FAILs then give up trying
      to modify the loop since there is some aspect the back-end does
-     not like.  */
-  count = copy_rtx (desc->niter_expr);
+     not like.  If this succeeds, there is a chance that the loop
+     desc->niter_expr has been altered by the backend, so only extract
+     that data after the gen_doloop_end.  */
   start_label = block_label (desc->in_edge->dest);
   doloop_reg = gen_reg_rtx (mode);
   rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
 
+  max_cost
+    = COSTS_N_INSNS (param_max_iterations_computation_cost);
+  if (set_src_cost (desc->niter_expr, mode, optimize_loop_for_speed_p (loop))
+      > max_cost)
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "Doloop: number of iterations too costly to compute.\n");
+      return false;
+    }
+
+  count = copy_rtx (desc->niter_expr);
   word_mode_size = GET_MODE_PRECISION (word_mode);
   word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) - 1;
   if (! doloop_seq
diff --git a/gcc/testsuite/gcc.target/arm/lob.h b/gcc/testsuite/gcc.target/arm/lob.h
index feaae7cc89959b3147368980120700bbc3e85ecb..3941fe7a8b620e62a5f742722be1ba2d031f5a8d 100644
--- a/gcc/testsuite/gcc.target/arm/lob.h
+++ b/gcc/testsuite/gcc.target/arm/lob.h
@@ -1,15 +1,131 @@
 #include <string.h>
-
+#include <stdint.h>
 /* Common code for lob tests.  */
 
 #define NO_LOB asm volatile ("@ clobber lr" : : : "lr" )
 
-#define N 10000
+#define N 100
+
+static void
+reset_data (int *a, int *b, int *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data64 (int64_t *a, int64_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+check_plus (int *a, int *b, int *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
 
 static void
-reset_data (int *a, int *b, int *c)
+check_memcpy64 (int64_t *a, int64_t *c, int x)
 {
-  memset (a, -1, N * sizeof (*a));
-  memset (b, -1, N * sizeof (*b));
-  memset (c, -1, N * sizeof (*c));
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != a[i]) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
 }
diff --git a/gcc/testsuite/gcc.target/arm/lob1.c b/gcc/testsuite/gcc.target/arm/lob1.c
index ba5c82cd55c582c96a18ad417a3041e43d843613..c8ce653a5c39fb1ffcf82a6e584d9a0467a130c0 100644
--- a/gcc/testsuite/gcc.target/arm/lob1.c
+++ b/gcc/testsuite/gcc.target/arm/lob1.c
@@ -54,29 +54,18 @@ loop3 (int *a, int *b, int *c)
     } while (i < N);
 }
 
-void
-check (int *a, int *b, int *c)
-{
-  for (int i = 0; i < N; i++)
-    {
-      NO_LOB;
-      if (c[i] != a[i] + b[i])
-	abort ();
-    }
-}
-
 int
 main (void)
 {
-  reset_data (a, b, c);
+  reset_data (a, b, c, N);
   loop1 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus (a, b, c, N);
+  reset_data (a, b, c, N);
   loop2 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus (a, b, c, N);
+  reset_data (a, b, c, N);
   loop3 (a, b ,c);
-  check (a, b ,c);
+  check_plus (a, b, c, N);
 
   return 0;
 }
diff --git a/gcc/testsuite/gcc.target/arm/lob6.c b/gcc/testsuite/gcc.target/arm/lob6.c
index 17b6124295e8ae9e1cb57e41fa43a954b3390eec..4fe116e2c2be3748d1bb6da7bb9092db8f962abc 100644
--- a/gcc/testsuite/gcc.target/arm/lob6.c
+++ b/gcc/testsuite/gcc.target/arm/lob6.c
@@ -79,14 +79,14 @@ check (void)
 int
 main (void)
 {
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data (a1, b1, c1, N);
+  reset_data (a2, b2, c2, N);
   loop1 (a1, b1, c1);
   ref1 (a2, b2, c2);
   check ();
 
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data (a1, b1, c1, N);
+  reset_data (a2, b2, c2, N);
   loop2 (a1, b1, c1);
   ref2 (a2, b2, c2);
   check ();
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c
new file mode 100644
index 0000000000000000000000000000000000000000..5ddd994e53d55c7b4d05bfb858e6078ce7da4ce4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c
@@ -0,0 +1,561 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O3 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+
+#define IMM 5
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)				\
+void test_##NAME##PRED##_##SIGN##BITS (TYPE##BITS##_t *a, TYPE##BITS##_t *b,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vb = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (b, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_##SIGN##BITS (va, vb, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      b += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (32, 4, w, NAME, PRED)
+
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vmulq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vsubq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vhaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vorrq, _x)
+
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_M(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)				\
+void test_##NAME##PRED##_##SIGN##BITS (TYPE##BITS##x##LANES##_t __inactive, TYPE##BITS##_t *a, TYPE##BITS##_t *b,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vb = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (b, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_##SIGN##BITS (__inactive, va, vb, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      b += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_M (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_M (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (32, 4, w, NAME, PRED)
+
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vmulq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vsubq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vhaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vorrq, _m)
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_N(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)	\
+void test_##NAME##PRED##_n_##SIGN##BITS (TYPE##BITS##_t *a,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_n_##SIGN##BITS (va, IMM, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_N (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_N (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (32, 4, w, NAME, PRED)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vmulq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vsubq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vhaddq, _x)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vbrsrq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vshlq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vshrq, _x)
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_M_N(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)	\
+void test_##NAME##PRED##_n_##SIGN##BITS (TYPE##BITS##x##LANES##_t __inactive, TYPE##BITS##_t *a,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_n_##SIGN##BITS (__inactive, va, IMM, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_M_N (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_M_N (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (32, 4, w, NAME, PRED)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vmulq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vsubq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vhaddq, _m)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vbrsrq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vshlq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vshrq, _m)
+
+/* Now test some more configurations.  */
+
+/* Using a >=1 condition.  */
+void test1 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n >= 1)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+/* Test a for loop format of decrementing to zero */
+int32_t a[] = {0, 1, 2, 3, 4, 5, 6, 7};
+void test2 (int32_t *b, int num_elems)
+{
+    for (int i = num_elems; i > 0; i-= 4)
+    {
+        mve_pred16_t p = vctp32q (i);
+        int32x4_t va = vldrwq_z_s32 (&(a[i]), p);
+        vstrwq_p_s32 (b + i, va, p);
+    }
+}
+
+/* Iteration counter counting up to num_iter.  */
+void test3 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 0; i < num_iter; i++)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+/* Iteration counter counting down from num_iter.  */
+void test4 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = num_iter; i > 0; i--)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+/* Using an unpredicated arithmetic instruction within the loop.  */
+void test5 (uint8_t *a, uint8_t *b, uint8_t *c,  uint8_t *d, int n)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_u8 (b);
+	/* Is affected by implicit predication, because vb also
+	came from an unpredicated load, but there is no functional
+	problem, because the result is used in a predicated store.  */ 
+        uint8x16_t vc = vaddq_u8 (va, vb);
+        uint8x16_t vd = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        vstrbq_p_u8 (d, vd, p);
+        n-=16;
+    }
+}
+
+/* Using a different VPR value for one instruction in the loop.  */
+void test6 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using another VPR value in the loop, with a vctp.
+   The doloop logic will always try to do the transform on the first
+   vctp it encounters, so this is still expected to work.  */
+void test7 (int32_t *a, int32_t *b, int32_t *c, int n, int g)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q (g);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vctp,
+   but this time the p1 will also change in every loop (still fine)  */
+void test8 (int32_t *a, int32_t *b, int32_t *c, int n, int g)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q (g);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+      g++;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vctp_m
+   that is independent of the loop vctp VPR.  */
+void test9 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p2 = vctp32q_m (n, p1);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop,
+   with a vctp_m that is tied to the base vctp VPR.  This
+   is still fine, because the vctp_m will be transformed
+   into a vctp and be implicitly predicated.  */
+void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q_m (n, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p1);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp.  */
+void test11 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vb);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p1);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp_m.  */
+void test12 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p2 = vcmpeqq_m_s32 (va, vb, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp_m 
+   that is tied to the base vctp VPR (same as above, this will be turned
+   into a vcmp and be implicitly predicated).  */
+void test13 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p2 = vcmpeqq_m_s32 (va, vb, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an unpredicated op with a scalar output, where the result is valid
+   outside the bb.  This is valid, because all the inputs to the unpredicated
+   op are correctly predicated.  */
+uint8_t test14 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       sum += vaddvq_u8 (vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Same as above, but with another scalar op between the unpredicated op and
+   the scalar op outside the loop.  */
+uint8_t test15 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx, int g)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       sum += vaddvq_u8 (vc);
+       sum += g;
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Using an unpredicated vcmp to generate a new predicate value in the
+   loop and then using it in a predicated store insn.  */
+void test16 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_s32 (b);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vc);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using a predicated vcmp to generate a new predicate value in the
+   loop and then using it in a predicated store insn.  */
+void test17 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_s32 (va, vb);
+      mve_pred16_t p1 = vcmpeqq_m_s32 (va, vc, p);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an across-vector unpredicated instruction in a valid way.
+   This tests that "vc" has correctly masked the risky "vb".  */
+uint16_t test18 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_x_u16 (va, vb, p);
+      res = vaddvq_u16 (vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector unpredicated instruction with a scalar from outside the loop.  */
+uint16_t test19 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_x_u16 (va, vb, p);
+      res = vaddvaq_u16 (res, vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector predicated instruction in a valid way.  */
+uint16_t  test20 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res = vaddvaq_p_u16 (res, vb, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector predicated instruction in a valid way.  */
+uint16_t  test21 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res++;
+      res = vaddvaq_p_u16 (res, vb, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+int test22 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vmaxvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test23 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vmaxavq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+/* The final number of DLSTPs currently is calculated by the number of
+  `TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY.*` macros * 6 + 23.  */
+/* { dg-final { scan-assembler-times {\tdlstp} 167 } } */
+/* { dg-final { scan-assembler-times {\tletp} 167 } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
new file mode 100644
index 0000000000000000000000000000000000000000..0125a2a15faa1a7071fc821b6db66d10f1bff6da
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
@@ -0,0 +1,69 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_mve_hw } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int16_t *a, int16_t *b, int16_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      int16x8_t va = vldrhq_z_s16 (a, p);
+      int16x8_t vb = vldrhq_z_s16 (b, p);
+      int16x8_t vc = vaddq_x_s16 (va, vb, p);
+      vstrhq_p_s16 (c, vc, p);
+      c+=8;
+      a+=8;
+      b+=8;
+      n-=8;
+    }
+}
+
+int main ()
+{
+  int i;
+  int16_t temp1[N];
+  int16_t temp2[N];
+  int16_t temp3[N];
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus16 (temp1, temp2, temp3, 0);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus16 (temp1, temp2, temp3, 1);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 7);
+  check_plus16 (temp1, temp2, temp3, 7);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus16 (temp1, temp2, temp3, 8);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus16 (temp1, temp2, temp3, 9);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus16 (temp1, temp2, temp3, 16);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus16 (temp1, temp2, temp3, 17);
+
+  reset_data16 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.16} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c
new file mode 100644
index 0000000000000000000000000000000000000000..06b960ad9caadb422e04007d17b64676f379c776
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c
@@ -0,0 +1,69 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_mve_hw } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+int main ()
+{
+  int i;
+  int32_t temp1[N];
+  int32_t temp2[N];
+  int32_t temp3[N];
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus32 (temp1, temp2, temp3, 0);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus32 (temp1, temp2, temp3, 1);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 3);
+  check_plus32 (temp1, temp2, temp3, 3);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 4);
+  check_plus32 (temp1, temp2, temp3, 4);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 5);
+  check_plus32 (temp1, temp2, temp3, 5);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus32 (temp1, temp2, temp3, 8);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus32 (temp1, temp2, temp3, 9);
+
+  reset_data32 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.32} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c
new file mode 100644
index 0000000000000000000000000000000000000000..5a782dd7f742d6e8d177ec8632b2ae4dcff664f5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c
@@ -0,0 +1,69 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_mve_hw } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int64_t *a, int64_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp64q (n);
+      int64x2_t va = vldrdq_gather_offset_z_s64 (a, vcreateq_u64 (0, 8), p);
+      vstrdq_scatter_offset_p_s64 (c, vcreateq_u64 (0, 8), va, p);
+      c+=2;
+      a+=2;
+      n-=2;
+    }
+}
+
+int main ()
+{
+  int i;
+  int64_t temp1[N];
+  int64_t temp3[N];
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 0);
+  check_memcpy64 (temp1, temp3, 0);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 1);
+  check_memcpy64 (temp1, temp3, 1);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 2);
+  check_memcpy64 (temp1, temp3, 2);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 3);
+  check_memcpy64 (temp1, temp3, 3);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 4);
+  check_memcpy64 (temp1, temp3, 4);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 5);
+  check_memcpy64 (temp1, temp3, 5);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 6);
+  check_memcpy64 (temp1, temp3, 6);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 7);
+  check_memcpy64 (temp1, temp3, 7);
+
+  reset_data64  (temp1, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.64} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c
new file mode 100644
index 0000000000000000000000000000000000000000..8ea181c82d45a008d60a66c1f9e9b289c5f05611
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c
@@ -0,0 +1,69 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_mve_hw } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp8q (n);
+      int8x16_t va = vldrbq_z_s8 (a, p);
+      int8x16_t vb = vldrbq_z_s8 (b, p);
+      int8x16_t vc = vaddq_x_s8 (va, vb, p);
+      vstrbq_p_s8 (c, vc, p);
+      c+=16;
+      a+=16;
+      b+=16;
+      n-=16;
+    }
+}
+
+int main ()
+{
+  int i;
+  int8_t temp1[N];
+  int8_t temp2[N];
+  int8_t temp3[N];
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus8 (temp1, temp2, temp3, 0);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus8 (temp1, temp2, temp3, 1);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 15);
+  check_plus8 (temp1, temp2, temp3, 15);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus8 (temp1, temp2, temp3, 16);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus8 (temp1, temp2, temp3, 17);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 32);
+  check_plus8 (temp1, temp2, temp3, 32);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 33);
+  check_plus8 (temp1, temp2, temp3, 33);
+
+  reset_data8 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.8} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
new file mode 100644
index 0000000000000000000000000000000000000000..f7c3e04f8831e6b6eb709c8f3b0a0a896313ca64
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
@@ -0,0 +1,391 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O3 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <limits.h>
+#include <arm_mve.h>
+
+/* Terminating on a non-zero number of elements.  */
+void test0 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    while (n > 1)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Terminating on n >= 0.  */
+void test1 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    while (n >= 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Similar, terminating on a non-zero number of elements, but in a for loop
+   format.  */
+int32_t a[] = {0, 1, 2, 3, 4, 5, 6, 7};
+void test2 (int32_t *b, int num_elems)
+{
+    for (int i = num_elems; i >= 2; i-= 4)
+    {
+       mve_pred16_t p = vctp32q (i);
+       int32x4_t va = vldrwq_z_s32 (&(a[i]), p);
+       vstrwq_p_s32 (b + i, va, p);
+    }
+}
+
+/* Iteration counter counting up to num_iter, with a non-zero starting num.  */
+void test3 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 1; i < num_iter; i++)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Iteration counter counting up to num_iter, with a larger increment  */
+void test4 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 0; i < num_iter; i+=2)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Using an unpredicated store instruction within the loop.  */
+void test5 (uint8_t *a, uint8_t *b, uint8_t *c,  uint8_t *d, int n)
+{
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_u8 (va, vb);
+       uint8x16_t vd = vaddq_x_u8 (va, vb, p);
+       vstrbq_u8 (d, vd);
+       n -= 16;
+    }
+}
+
+/* Using an unpredicated store outside the loop.  */
+void test6 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       vx = vaddq_u8 (vx, vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    vstrbq_u8 (c, vx);
+}
+
+/* Using a VPR that gets modified within the loop.  */
+void test9 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      p++;
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using a VPR that gets re-generated within the loop.  */
+void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  mve_pred16_t p = vctp32q (n);
+  while (n > 0)
+    {
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      p = vctp32q (n);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using vctp32q_m instead of vctp32q.  */
+void test11 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p0)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q_m (n, p0);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an unpredicated op with a scalar output, where the result is valid
+   outside the bb.  This is invalid, because one of the inputs to the
+   unpredicated op is also unpredicated.  */
+uint8_t test12 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_u8 (b);
+       uint8x16_t vc = vaddq_u8 (va, vb);
+       sum += vaddvq_u8 (vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Using an unpredicated vcmp to generate a new predicate value in the
+   loop and then using that VPR to predicate a store insn.  */
+void test13 (int32_t *a, int32_t *b, int32x4_t vc, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_s32 (a);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_s32 (va, vb);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vc);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an across-vector unpredicated instruction. "vb" is the risk.  */
+uint16_t test14 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      vb = vaddq_u16 (va, vb);
+      res = vaddvq_u16 (vb);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector unpredicated instruction. "vc" is the risk. */
+uint16_t test15 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_u16 (va, vb);
+      res = vaddvaq_u16 (res, vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+uint16_t test16 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res =0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res = vaddvaq_u16 (res, vb);
+      res = vaddvaq_p_u16 (res, va, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+int test17 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vmaxvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+
+
+int test18 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vminvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test19 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vminavq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test20 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vminvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+uint8x16_t test21 (uint8_t *a, uint32_t *b, int n, uint8x16_t res)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vshlcq_u8 (va, b, 1);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int8x16_t test22 (int8_t *a, int32_t *b, int n, int8x16_t res)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vshlcq_s8 (va, b, 1);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+/* Using an unsigned number of elements to count down from, with a >0*/
+void test23 (int32_t *a, int32_t *b, int32_t *c, unsigned int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+/* Using an unsigned number of elements to count up to, with a <n*/
+void test24 (uint8_t *a, uint8_t *b, uint8_t *c, unsigned int n)
+{
+    for (int i = 0; i < n; i+=16)
+    {
+        mve_pred16_t p = vctp8q (n-i);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+
+/* Using an unsigned number of elements to count up to, with a <=n*/
+void test25 (uint8_t *a, uint8_t *b, uint8_t *c, unsigned int n)
+{
+    for (int i = 1; i <= n; i+=16)
+    {
+        mve_pred16_t p = vctp8q (n-i+1);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+/* { dg-final { scan-assembler-not "\tdlstp" } } */
+/* { dg-final { scan-assembler-not "\tletp" } } */
\ No newline at end of file

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
  2023-11-30 12:55                 ` Stamatis Markianos-Wright
@ 2023-12-07 18:08                   ` Andre Vieira (lists)
  2023-12-09 18:31                   ` Richard Sandiford
  2023-12-12 17:56                   ` Richard Earnshaw
  2 siblings, 0 replies; 17+ messages in thread
From: Andre Vieira (lists) @ 2023-12-07 18:08 UTC (permalink / raw)
  To: Stamatis Markianos-Wright,
	Stamatis Markianos-Wright via Gcc-patches, Richard Earnshaw,
	Richard Sandiford, Kyrylo Tkachov

Thanks for addressing my comments. I have reviewed this and the other 
patch before and they LGTM. I however do not have approval rights so you 
will need the OK from a maintainer.

Thanks for doing this :)

Andre

On 30/11/2023 12:55, Stamatis Markianos-Wright wrote:
> Hi Andre,
> 
> Thanks for the comments, see latest revision attached.
> 
> On 27/11/2023 12:47, Andre Vieira (lists) wrote:
>> Hi Stam,
>>
>> Just some comments.
>>
>> +/* Recursively scan through the DF chain backwards within the basic 
>> block and
>> +   determine if any of the USEs of the original insn (or the USEs of 
>> the insns
>> s/Recursively scan/Scan/ as you no longer recurse, thanks for that by 
>> the way :) +   where thy were DEF-ed, etc., recursively) were affected 
>> by implicit VPT
>> remove recursively for the same reasons.
>>
>> +      if (!CONST_INT_P (cond_counter_iv.step) || !CONST_INT_P 
>> (cond_temp_iv.step))
>> +    return NULL;
>> +      /* Look at the steps and swap around the rtx's if needed. Error 
>> out if
>> +     one of them cannot be identified as constant.  */
>> +      if (INTVAL (cond_counter_iv.step) != 0 && INTVAL 
>> (cond_temp_iv.step) != 0)
>> +    return NULL;
>>
>> Move the comment above the if before, as the erroring out it talks 
>> about is there.
> Done
>>
>> +      emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));
>>  space after 'insn_note)'
>>
>> @@ -173,14 +176,14 @@ doloop_condition_get (rtx_insn *doloop_pat)
>>    if (! REG_P (reg))
>>      return 0;
>>  -  /* Check if something = (plus (reg) (const_int -1)).
>> +  /* Check if something = (plus (reg) (const_int -n)).
>>       On IA-64, this decrement is wrapped in an if_then_else.  */
>>    inc_src = SET_SRC (inc);
>>    if (GET_CODE (inc_src) == IF_THEN_ELSE)
>>      inc_src = XEXP (inc_src, 1);
>>    if (GET_CODE (inc_src) != PLUS
>>        || XEXP (inc_src, 0) != reg
>> -      || XEXP (inc_src, 1) != constm1_rtx)
>> +      || !CONST_INT_P (XEXP (inc_src, 1)))
>>
>> Do we ever check that inc_src is negative? We used to check if it was 
>> -1, now we only check it's a constnat, but not a negative one, so I 
>> suspect this needs a:
>> || INTVAL (XEXP (inc_src, 1)) >= 0
> Good point. Done
>>
>> @@ -492,7 +519,8 @@ doloop_modify (class loop *loop, class niter_desc 
>> *desc,
>>      case GE:
>>        /* Currently only GE tests against zero are supported.  */
>>        gcc_assert (XEXP (condition, 1) == const0_rtx);
>> -
>> +      /* FALLTHRU */
>> +    case GTU:
>>        noloop = constm1_rtx;
>>
>> I spent a very long time staring at this trying to understand why 
>> noloop = constm1_rtx for GTU, where I thought it should've been (count 
>> & (n-1)). For the current use of doloop it doesn't matter because ARM 
>> is the only target using it and you set desc->noloop_assumptions to 
>> null_rtx in 'arm_attempt_dlstp_transform' so noloop is never used. 
>> However, if a different target accepts this GTU pattern then this 
>> target agnostic code will do the wrong thing.  I suggest we either:
>>  - set noloop to what we think might be the correct value, which if 
>> you ask me should be 'count & (XEXP (condition, 1))',
>>  - or add a gcc_assert (GET_CODE (condition) != GTU); under the if 
>> (desc->noloop_assumption); part and document why.  I have a slight 
>> preference for the assert given otherwise we are adding code that we 
>> can't test.
> 
> Yea, that's true tbh. I've done the latter, but also separated out the 
> "case GTU:" and added a comment, so that it's more clear that the noloop 
> things aren't used in the only implemented GTU case (Arm)
> 
> Thank you :)
> 
>>
>> LGTM otherwise (but I don't have the power to approve this ;)).
>>
>> Kind regards,
>> Andre
>> ________________________________________
>> From: Stamatis Markianos-Wright <stam.markianos-wright@arm.com>
>> Sent: Thursday, November 16, 2023 11:36 AM
>> To: Stamatis Markianos-Wright via Gcc-patches; Richard Earnshaw; 
>> Richard Sandiford; Kyrylo Tkachov
>> Subject: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated 
>> Low Overhead Loops
>>
>> Pinging back to the top of reviewers' inboxes due to worry about Stage 1
>> End in a few days :)
>>
>>
>> See the last email for the latest version of the 2/2 patch. The 1/2
>> patch is A-Ok from Kyrill's earlier target-backend review.
>>
>>
>> On 10/11/2023 12:41, Stamatis Markianos-Wright wrote:
>>>
>>> On 06/11/2023 17:29, Stamatis Markianos-Wright wrote:
>>>>
>>>> On 06/11/2023 11:24, Richard Sandiford wrote:
>>>>> Stamatis Markianos-Wright <stam.markianos-wright@arm.com> writes:
>>>>>>> One of the main reasons for reading the arm bits was to try to 
>>>>>>> answer
>>>>>>> the question: if we switch to a downcounting loop with a GE
>>>>>>> condition,
>>>>>>> how do we make sure that the start value is not a large unsigned
>>>>>>> number that is interpreted as negative by GE?  E.g. if the loop
>>>>>>> originally counted up in steps of N and used an LTU condition,
>>>>>>> it could stop at a value in the range [INT_MAX + 1, UINT_MAX].
>>>>>>> But the loop might never iterate if we start counting down from
>>>>>>> most values in that range.
>>>>>>>
>>>>>>> Does the patch handle that?
>>>>>> So AFAICT this is actually handled in the generic code in
>>>>>> `doloop_valid_p`:
>>>>>>
>>>>>> This kind of loops fail because of they are "desc->infinite", then no
>>>>>> loop-doloop conversion is attempted at all (even for standard
>>>>>> dls/le loops)
>>>>>>
>>>>>> Thanks to that check I haven't been able to trigger anything like the
>>>>>> behaviour you describe, do you think the doloop_valid_p checks are
>>>>>> robust enough?
>>>>> The loops I was thinking of are provably not infinite though. E.g.:
>>>>>
>>>>>    for (unsigned int i = 0; i < UINT_MAX - 100; ++i)
>>>>>      ...
>>>>>
>>>>> is known to terminate.  And doloop conversion is safe with the normal
>>>>> count-down-by-1 approach, so I don't think current code would need
>>>>> to reject it.  I.e. a conversion to:
>>>>>
>>>>>    unsigned int i = UINT_MAX - 101;
>>>>>    do
>>>>>      ...
>>>>>    while (--i != ~0U);
>>>>>
>>>>> would be safe, but a conversion to:
>>>>>
>>>>>    int i = UINT_MAX - 101;
>>>>>    do
>>>>>      ...
>>>>>    while ((i -= step, i > 0));
>>>>>
>>>>> wouldn't, because the loop body would only be executed once.
>>>>>
>>>>> I'm only going off the name "infinite" though :)  It's possible that
>>>>> it has more connotations than that.
>>>>>
>>>>> Thanks,
>>>>> Richard
>>>>
>>>> Ack, yep, I see what you mean now, and yep, that kind of loop does
>>>> indeed pass through doloop_valid_p
>>>>
>>>> Interestingly , in the v8-M Arm ARM this is done with:
>>>>
>>>> ```
>>>>
>>>> boolean IsLastLowOverheadLoop(INSTR_EXEC_STATE_Type state)
>>>> // This does not check whether a loop is currently active.
>>>> // If the PE were in a loop, would this be the last one?
>>>> return UInt(state.LoopCount) <= (1 << (4 - LTPSIZE));
>>>>
>>>> ```
>>>>
>>>> So architecturally the asm we output would be ok (except maybe the
>>>> "branch too far subs;bgt;lctp" fallback at
>>>> `predicated_doloop_end_internal` (maybe that should be `bhi`))... But
>>>> now GE: isn't looking like an accurate representation of this
>>>> operation in the compiler.
>>>>
>>>> I'm wondering if I should try to make
>>>> `predicated_doloop_end_internal` contain a comparison along the lines
>>>> of:
>>>> (gtu: (plus: (LR) (const_int -num_lanes)) (const_int 
>>>> num_lanes_minus_1))
>>>>
>>>> I'll give that a try :)
>>>>
>>>> The only reason I'd chosen to go with GE earlier, tbh, was because of
>>>> the existing handling of GE in loop-doloop.cc
>>>>
>>>> Let me know if any other ideas come to your mind!
>>>>
>>>>
>>>> Cheers,
>>>>
>>>> Stam
>>>
>>>
>>> It looks like I've had success with the below (diff to previous patch),
>>> trimmed a bit to only the functionally interesting things::
>>>
>>>
>>>
>>>
>>> diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
>>> index 368d5138ca1..54dd4ee564b 100644
>>> --- a/gcc/config/arm/thumb2.md
>>> +++ b/gcc/config/arm/thumb2.md
>>> @@ -1649,16 +1649,28 @@
>>>            && (decrement_num = arm_attempt_dlstp_transform 
>>> (operands[1]))
>>>            && (INTVAL (decrement_num) != 1))
>>>          {
>>> -          insn = emit_insn
>>> -              (gen_thumb2_addsi3_compare0
>>> -              (s0, s0, GEN_INT ((-1) * (INTVAL (decrement_num)))));
>>> -          cmp = XVECEXP (PATTERN (insn), 0, 0);
>>> -          cc_reg = SET_DEST (cmp);
>>> -          bcomp = gen_rtx_GE (VOIDmode, cc_reg, const0_rtx);
>>>            loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
>>> -          emit_jump_insn (gen_rtx_SET (pc_rtx,
>>> -                       gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
>>> -                                 loc_ref, pc_rtx)));
>>> +          switch (INTVAL (decrement_num))
>>> +        {
>>> +          case 2:
>>> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal2
>>> +                        (s0, loc_ref));
>>> +            break;
>>> +          case 4:
>>> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal4
>>> +                        (s0, loc_ref));
>>> +            break;
>>> +          case 8:
>>> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal8
>>> +                        (s0, loc_ref));
>>> +            break;
>>> +          case 16:
>>> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal16
>>> +                        (s0, loc_ref));
>>> +            break;
>>> +          default:
>>> +            gcc_unreachable ();
>>> +        }
>>>            DONE;
>>>          }
>>>      }
>>>
>>> diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
>>> index 93905583b18..c083f965fa9 100644
>>> --- a/gcc/config/arm/mve.md
>>> +++ b/gcc/config/arm/mve.md
>>> @@ -6922,23 +6922,24 @@
>>>  ;; Originally expanded by 'predicated_doloop_end'.
>>>  ;; In the rare situation where the branch is too far, we do also 
>>> need to
>>>  ;; revert FPSCR.LTPSIZE back to 0x100 after the last iteration.
>>> -(define_insn "*predicated_doloop_end_internal"
>>> +(define_insn "predicated_doloop_end_internal<letp_num_lanes>"
>>>    [(set (pc)
>>>      (if_then_else
>>> -       (ge (plus:SI (reg:SI LR_REGNUM)
>>> -            (match_operand:SI 0 "const_int_operand" ""))
>>> -        (const_int 0))
>>> -     (label_ref (match_operand 1 "" ""))
>>> +       (gtu (unspec:SI [(plus:SI (match_operand:SI 0
>>> "s_register_operand" "=r")
>>> +                     (const_int <letp_num_lanes_neg>))]
>>> +        LETP)
>>> +        (const_int <letp_num_lanes_minus_1>))
>>> +     (match_operand 1 "" "")
>>>       (pc)))
>>> -   (set (reg:SI LR_REGNUM)
>>> -    (plus:SI (reg:SI LR_REGNUM) (match_dup 0)))
>>> +   (set (match_dup 0)
>>> +    (plus:SI (match_dup 0) (const_int <letp_num_lanes_neg>)))
>>>     (clobber (reg:CC CC_REGNUM))]
>>>    "TARGET_HAVE_MVE"
>>>    {
>>>      if (get_attr_length (insn) == 4)
>>>        return "letp\t%|lr, %l1";
>>>      else
>>> -      return "subs\t%|lr, #%n0\n\tbgt\t%l1\n\tlctp";
>>> +      return "subs\t%|lr, #<letp_num_lanes>\n\tbhi\t%l1\n\tlctp";
>>>    }
>>>    [(set (attr "length")
>>>      (if_then_else
>>> @@ -6947,11 +6948,11 @@
>>>          (const_int 6)))
>>>     (set_attr "type" "branch")])
>>>
>>> -(define_insn "dlstp<mode1>_insn"
>>> +(define_insn "dlstp<dlstp_elemsize>_insn"
>>>    [
>>>      (set (reg:SI LR_REGNUM)
>>>       (unspec:SI [(match_operand:SI 0 "s_register_operand" "r")]
>>>        DLSTP))
>>>    ]
>>>    "TARGET_HAVE_MVE"
>>> -  "dlstp.<mode1>\t%|lr, %0")
>>> +  "dlstp.<dlstp_elemsize>\t%|lr, %0")
>>>
>>> diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
>>> index 6a72700a127..47fdef989b4 100644
>>> --- a/gcc/loop-doloop.cc
>>> +++ b/gcc/loop-doloop.cc
>>> @@ -185,6 +185,7 @@ doloop_condition_get (rtx_insn *doloop_pat)
>>>        || XEXP (inc_src, 0) != reg
>>>        || !CONST_INT_P (XEXP (inc_src, 1)))
>>>      return 0;
>>> +  int dec_num = abs (INTVAL (XEXP (inc_src, 1)));
>>>
>>>    /* Check for (set (pc) (if_then_else (condition)
>>>                                         (label_ref (label))
>>> @@ -199,21 +200,32 @@ doloop_condition_get (rtx_insn *doloop_pat)
>>>    /* Extract loop termination condition.  */
>>>    condition = XEXP (SET_SRC (cmp), 0);
>>>
>>> -  /* We expect a GE or NE comparison with 0 or 1.  */
>>> -  if ((GET_CODE (condition) != GE
>>> -       && GET_CODE (condition) != NE)
>>> -      || (XEXP (condition, 1) != const0_rtx
>>> -          && XEXP (condition, 1) != const1_rtx))
>>> +  /* We expect a GE or NE comparison with 0 or 1, or a GTU comparison
>>> with
>>> +     dec_num - 1.  */
>>> +  if (!((GET_CODE (condition) == GE
>>> +     || GET_CODE (condition) == NE)
>>> +    && (XEXP (condition, 1) == const0_rtx
>>> +        || XEXP (condition, 1) == const1_rtx ))
>>> +      &&!(GET_CODE (condition) == GTU
>>> +      && ((INTVAL (XEXP (condition, 1))) == (dec_num - 1))))
>>>      return 0;
>>>
>>> -  if ((XEXP (condition, 0) == reg)
>>> +  /* For the ARM special case of having a GTU: re-form the condition
>>> without
>>> +     the unspec for the benefit of the middle-end.  */
>>> +  if (GET_CODE (condition) == GTU)
>>> +    {
>>> +      condition = gen_rtx_fmt_ee (GTU, VOIDmode, inc_src, GEN_INT
>>> (dec_num - 1));
>>> +      return condition;
>>> +    }
>>> +  else if ((XEXP (condition, 0) == reg)
>>>        /* For the third case:  */
>>>        || ((cc_reg != NULL_RTX)
>>>        && (XEXP (condition, 0) == cc_reg)
>>>        && (reg_orig == reg))
>>> @@ -245,20 +257,11 @@ doloop_condition_get (rtx_insn *doloop_pat)
>>>                         (label_ref (label))
>>>                         (pc))))])
>>>
>>> -    So we return the second form instead for the two cases when n == 1.
>>> -
>>> -    For n > 1, the final value may be exceeded, so use GE instead of 
>>> NE.
>>> +    So we return the second form instead for the two cases.
>>>       */
>>> -     if (GET_CODE (pattern) != PARALLEL)
>>> -       {
>>> -    if (INTVAL (XEXP (inc_src, 1)) != -1)
>>> -      condition = gen_rtx_fmt_ee (GE, VOIDmode, inc_src, const0_rtx);
>>> -    else
>>> -      condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);;
>>> -       }
>>> -
>>> +    condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
>>>      return condition;
>>> -   }
>>> +    }
>>>
>>>    /* ??? If a machine uses a funny comparison, we could return a
>>>       canonicalized form here.  */
>>> @@ -501,7 +504,8 @@ doloop_modify (class loop *loop, class niter_desc
>>> *desc,
>>>      case GE:
>>>        /* Currently only GE tests against zero are supported. */
>>>        gcc_assert (XEXP (condition, 1) == const0_rtx);
>>> -
>>> +      /* FALLTHRU */
>>> +    case GTU:
>>>        noloop = constm1_rtx;
>>> diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
>>> index a6a7ff507a5..9398702cddd 100644
>>> --- a/gcc/config/arm/iterators.md
>>> +++ b/gcc/config/arm/iterators.md
>>> @@ -2673,8 +2673,16 @@
>>>  (define_int_attr mrrc [(VUNSPEC_MRRC "mrrc") (VUNSPEC_MRRC2 "mrrc2")])
>>>  (define_int_attr MRRC [(VUNSPEC_MRRC "MRRC") (VUNSPEC_MRRC2 "MRRC2")])
>>>
>>> -(define_int_attr mode1 [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 "32")
>>> -            (DLSTP64 "64")])
>>> +(define_int_attr dlstp_elemsize [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32
>>> "32")
>>> +                 (DLSTP64 "64")])
>>> +
>>> +(define_int_attr letp_num_lanes [(LETP8 "16") (LETP16 "8") (LETP32 "4")
>>> +                 (LETP64 "2")])
>>> +(define_int_attr letp_num_lanes_neg [(LETP8 "-16") (LETP16 "-8")
>>> (LETP32 "-4")
>>> +                     (LETP64 "-2")])
>>> +
>>> +(define_int_attr letp_num_lanes_minus_1 [(LETP8 "15") (LETP16 "7")
>>> (LETP32 "3")
>>> +                     (LETP64 "1")])
>>>
>>>  (define_int_attr opsuffix [(UNSPEC_DOT_S "s8")
>>>                 (UNSPEC_DOT_U "u8")
>>> @@ -2921,6 +2929,8 @@
>>>  (define_int_iterator VQSHLUQ_N [VQSHLUQ_N_S])
>>>  (define_int_iterator DLSTP [DLSTP8 DLSTP16 DLSTP32
>>>                     DLSTP64])
>>> +(define_int_iterator LETP [LETP8 LETP16 LETP32
>>> +               LETP64])
>>>
>>>  ;; Define iterators for VCMLA operations
>>>  (define_int_iterator VCMLA_OP [UNSPEC_VCMLA
>>>        /* The iteration count does not need incrementing for a GE
>>> test.  */
>>> diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
>>> index 12ae4c4f820..2d6f27c14f4 100644
>>> --- a/gcc/config/arm/unspecs.md
>>> +++ b/gcc/config/arm/unspecs.md
>>> @@ -587,6 +587,10 @@
>>>    DLSTP16
>>>    DLSTP32
>>>    DLSTP64
>>> +  LETP8
>>> +  LETP16
>>> +  LETP32
>>> +  LETP64
>>>    VPNOT
>>>    VCREATEQ_F
>>>    VCVTQ_N_TO_F_S
>>>
>>>
>>> I've attached the whole [2/2] patch diff with this change and
>>> the required comment changes in doloop_condition_get.
>>> WDYT?
>>>
>>>
>>> Thanks,
>>>
>>> Stam
>>>
>>>
>>>>
>>>>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
  2023-11-30 12:55                 ` Stamatis Markianos-Wright
  2023-12-07 18:08                   ` Andre Vieira (lists)
@ 2023-12-09 18:31                   ` Richard Sandiford
  2023-12-12 17:56                   ` Richard Earnshaw
  2 siblings, 0 replies; 17+ messages in thread
From: Richard Sandiford @ 2023-12-09 18:31 UTC (permalink / raw)
  To: Stamatis Markianos-Wright
  Cc: Andre Vieira (lists),
	Stamatis Markianos-Wright via Gcc-patches, Richard Earnshaw,
	Kyrylo Tkachov

Sorry for the slow review.

Stamatis Markianos-Wright <stam.markianos-wright@arm.com> writes:
> [...]
> diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
> index 44a04b86cb5806fcf50917826512fd203d42106c..c083f965fa9a40781bc86beb6e63654afd14eac4 100644
> --- a/gcc/config/arm/mve.md
> +++ b/gcc/config/arm/mve.md
> @@ -6922,23 +6922,24 @@
>  ;; Originally expanded by 'predicated_doloop_end'.
>  ;; In the rare situation where the branch is too far, we do also need to
>  ;; revert FPSCR.LTPSIZE back to 0x100 after the last iteration.
> -(define_insn "*predicated_doloop_end_internal"
> +(define_insn "predicated_doloop_end_internal<letp_num_lanes>"
>    [(set (pc)
>  	(if_then_else
> -	   (ge (plus:SI (reg:SI LR_REGNUM)
> -			(match_operand:SI 0 "const_int_operand" ""))
> -		(const_int 0))
> -	 (label_ref (match_operand 1 "" ""))
> +	   (gtu (unspec:SI [(plus:SI (match_operand:SI 0 "s_register_operand" "=r")
> +				     (const_int <letp_num_lanes_neg>))]
> +		LETP)
> +		(const_int <letp_num_lanes_minus_1>))

Is there any need for the unspec?  I couldn't see why this wasn't simply:

  (gtu (match_operand:SI 0 "s_register_operand" "=r")
       (const_int <letp_num_lanes_minus_1>))

But I agree that using gtu rather than ge is nicer if it's what the
instruction does.

> diff --git a/gcc/df-core.cc b/gcc/df-core.cc
> index d4812b04a7cb97ea1606082e26e910472da5bcc1..4fcc14bf790d43e792b3c926fe1f80073d908c17 100644
> --- a/gcc/df-core.cc
> +++ b/gcc/df-core.cc
> @@ -1964,6 +1964,21 @@ df_bb_regno_last_def_find (basic_block bb, unsigned int regno)
>    return NULL;
>  }
>  
> +/* Return the one and only def of REGNO within BB.  If there is no def or
> +   there are multiple defs, return NULL.  */
> +
> +df_ref
> +df_bb_regno_only_def_find (basic_block bb, unsigned int regno)
> +{
> +  df_ref temp = df_bb_regno_first_def_find (bb, regno);
> +  if (!temp)
> +    return NULL;
> +  else if (temp == df_bb_regno_last_def_find (bb, regno))
> +    return temp;
> +  else
> +    return NULL;
> +}
> +
>  /* Finds the reference corresponding to the definition of REG in INSN.
>     DF is the dataflow object.  */
>  
> diff --git a/gcc/df.h b/gcc/df.h
> index 402657a7076f1bcad24e9c50682e033e57f432f9..98623637f9c839c799222e99df2a7173a770b2ac 100644
> --- a/gcc/df.h
> +++ b/gcc/df.h
> @@ -987,6 +987,7 @@ extern void df_check_cfg_clean (void);
>  #endif
>  extern df_ref df_bb_regno_first_def_find (basic_block, unsigned int);
>  extern df_ref df_bb_regno_last_def_find (basic_block, unsigned int);
> +extern df_ref df_bb_regno_only_def_find (basic_block, unsigned int);
>  extern df_ref df_find_def (rtx_insn *, rtx);
>  extern bool df_reg_defined (rtx_insn *, rtx);
>  extern df_ref df_find_use (rtx_insn *, rtx);
> diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
> index 4feb0a25ab9331b7124df900f73c9fc6fb3eb10b..d919207505c472c8a54a2c9c982a09061584177b 100644
> --- a/gcc/loop-doloop.cc
> +++ b/gcc/loop-doloop.cc
> @@ -85,10 +85,10 @@ doloop_condition_get (rtx_insn *doloop_pat)
>       forms:
>  
>       1)  (parallel [(set (pc) (if_then_else (condition)
> -	  			            (label_ref (label))
> -				            (pc)))
> -	             (set (reg) (plus (reg) (const_int -1)))
> -	             (additional clobbers and uses)])
> +					    (label_ref (label))
> +					    (pc)))
> +		     (set (reg) (plus (reg) (const_int -1)))
> +		     (additional clobbers and uses)])
>  
>       The branch must be the first entry of the parallel (also required
>       by jump.cc), and the second entry of the parallel must be a set of
> @@ -96,19 +96,34 @@ doloop_condition_get (rtx_insn *doloop_pat)
>       the loop counter in an if_then_else too.
>  
>       2)  (set (reg) (plus (reg) (const_int -1))
> -         (set (pc) (if_then_else (reg != 0)
> -	                         (label_ref (label))
> -			         (pc))).  
> +	 (set (pc) (if_then_else (reg != 0)
> +				 (label_ref (label))
> +				 (pc))).
>  
>       Some targets (ARM) do the comparison before the branch, as in the
>       following form:
>  
> -     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
> -                   (set (reg) (plus (reg) (const_int -1)))])
> -        (set (pc) (if_then_else (cc == NE)
> -                                (label_ref (label))
> -                                (pc))) */
> -
> +     3) (parallel [(set (cc) (compare (plus (reg) (const_int -1)) 0))
> +		   (set (reg) (plus (reg) (const_int -1)))])
> +	(set (pc) (if_then_else (cc == NE)
> +				(label_ref (label))
> +				(pc)))
> +
> +      The ARM target also supports a special case of a counter that decrements
> +      by `n` and terminating in a GTU condition.  In that case, the compare and
> +      branch are all part of one insn, containing an UNSPEC:
> +
> +      4) (parallel [
> +	    (set (pc)
> +		(if_then_else (gtu (unspec:SI [(plus:SI (reg:SI 14 lr)
> +							(const_int -n))])
> +				   (const_int n-1]))

Similarly here.

> +		    (label_ref)
> +		    (pc)))
> +	    (set (reg:SI 14 lr)
> +		 (plus:SI (reg:SI 14 lr)
> +			  (const_int -n)))
> +     */
>    pattern = PATTERN (doloop_pat);
>  
>    if (GET_CODE (pattern) != PARALLEL)
> @@ -143,7 +158,7 @@ doloop_condition_get (rtx_insn *doloop_pat)
>  	      || GET_CODE (cmp_arg1) != PLUS)
>  	    return 0;
>  	  reg_orig = XEXP (cmp_arg1, 0);
> -	  if (XEXP (cmp_arg1, 1) != GEN_INT (-1) 
> +	  if (XEXP (cmp_arg1, 1) != GEN_INT (-1)
>  	      || !REG_P (reg_orig))
>  	    return 0;
>  	  cc_reg = SET_DEST (cmp_orig);
> @@ -173,15 +188,17 @@ doloop_condition_get (rtx_insn *doloop_pat)
>    if (! REG_P (reg))
>      return 0;
>  
> -  /* Check if something = (plus (reg) (const_int -1)).
> +  /* Check if something = (plus (reg) (const_int -n)).
>       On IA-64, this decrement is wrapped in an if_then_else.  */
>    inc_src = SET_SRC (inc);
>    if (GET_CODE (inc_src) == IF_THEN_ELSE)
>      inc_src = XEXP (inc_src, 1);
>    if (GET_CODE (inc_src) != PLUS
>        || XEXP (inc_src, 0) != reg
> -      || XEXP (inc_src, 1) != constm1_rtx)
> +      || !CONST_INT_P (XEXP (inc_src, 1))
> +      || INTVAL (XEXP (inc_src, 1)) >= 0)
>      return 0;
> +  int dec_num = abs (INTVAL (XEXP (inc_src, 1)));
>  
>    /* Check for (set (pc) (if_then_else (condition)
>                                         (label_ref (label))
> @@ -196,60 +213,71 @@ doloop_condition_get (rtx_insn *doloop_pat)
>    /* Extract loop termination condition.  */
>    condition = XEXP (SET_SRC (cmp), 0);
>  
> -  /* We expect a GE or NE comparison with 0 or 1.  */
> -  if ((GET_CODE (condition) != GE
> -       && GET_CODE (condition) != NE)
> -      || (XEXP (condition, 1) != const0_rtx
> -          && XEXP (condition, 1) != const1_rtx))
> +  /* We expect a GE or NE comparison with 0 or 1, or a GTU comparison with
> +     dec_num - 1.  */
> +  if (!((GET_CODE (condition) == GE
> +	 || GET_CODE (condition) == NE)
> +	&& (XEXP (condition, 1) == const0_rtx
> +	    || XEXP (condition, 1) == const1_rtx ))
> +      &&!(GET_CODE (condition) == GTU
> +	  && ((INTVAL (XEXP (condition, 1))) == (dec_num - 1))))
>      return 0;

Formatting nit: should be:

  /* We expect a GE or NE comparison with 0 or 1, or a GTU comparison with
     dec_num  */
  if (!((GET_CODE (condition) == GE
	 || GET_CODE (condition) == NE)
	&& (XEXP (condition, 1) == const0_rtx
	    || XEXP (condition, 1) == const1_rtx))
      && !(GET_CODE (condition) == GTU
	   && CONST_INT_P (XEXP (condition, 1))
	   && INTVAL (XEXP (condition, 1)) == dec_num - 1))
    return 0;

>  
> -  if ((XEXP (condition, 0) == reg)
> +  /* For the ARM special case of having a GTU: re-form the condition without
> +     the unspec for the benefit of the middle-end.  */
> +  if (GET_CODE (condition) == GTU)
> +    {
> +      condition = gen_rtx_fmt_ee (GTU, VOIDmode, inc_src,
> +				  GEN_INT (dec_num - 1));
> +      return condition;
> +    }

Hopefully the gen_rtx_fmt_ee wouldn't be needed then.  It should just
be enough to return the original condition.

OK for the target-independent parts with those changed if you agree.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
  2023-11-30 12:55                 ` Stamatis Markianos-Wright
  2023-12-07 18:08                   ` Andre Vieira (lists)
  2023-12-09 18:31                   ` Richard Sandiford
@ 2023-12-12 17:56                   ` Richard Earnshaw
  2 siblings, 0 replies; 17+ messages in thread
From: Richard Earnshaw @ 2023-12-12 17:56 UTC (permalink / raw)
  To: Stamatis Markianos-Wright, Andre Vieira (lists),
	Stamatis Markianos-Wright via Gcc-patches, Richard Earnshaw,
	Richard Sandiford, Kyrylo Tkachov



On 30/11/2023 12:55, Stamatis Markianos-Wright wrote:
> Hi Andre,
> 
> Thanks for the comments, see latest revision attached.
> 
> On 27/11/2023 12:47, Andre Vieira (lists) wrote:
>> Hi Stam,
>>
>> Just some comments.
>>
>> +/* Recursively scan through the DF chain backwards within the basic 
>> block and
>> +   determine if any of the USEs of the original insn (or the USEs of 
>> the insns
>> s/Recursively scan/Scan/ as you no longer recurse, thanks for that by 
>> the way :) +   where thy were DEF-ed, etc., recursively) were affected 
>> by implicit VPT
>> remove recursively for the same reasons.
>>
>> +      if (!CONST_INT_P (cond_counter_iv.step) || !CONST_INT_P 
>> (cond_temp_iv.step))
>> +    return NULL;
>> +      /* Look at the steps and swap around the rtx's if needed. Error 
>> out if
>> +     one of them cannot be identified as constant.  */
>> +      if (INTVAL (cond_counter_iv.step) != 0 && INTVAL 
>> (cond_temp_iv.step) != 0)
>> +    return NULL;
>>
>> Move the comment above the if before, as the erroring out it talks 
>> about is there.
> Done
>>
>> +      emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));
>>  space after 'insn_note)'
>>
>> @@ -173,14 +176,14 @@ doloop_condition_get (rtx_insn *doloop_pat)
>>    if (! REG_P (reg))
>>      return 0;
>>  -  /* Check if something = (plus (reg) (const_int -1)).
>> +  /* Check if something = (plus (reg) (const_int -n)).
>>       On IA-64, this decrement is wrapped in an if_then_else.  */
>>    inc_src = SET_SRC (inc);
>>    if (GET_CODE (inc_src) == IF_THEN_ELSE)
>>      inc_src = XEXP (inc_src, 1);
>>    if (GET_CODE (inc_src) != PLUS
>>        || XEXP (inc_src, 0) != reg
>> -      || XEXP (inc_src, 1) != constm1_rtx)
>> +      || !CONST_INT_P (XEXP (inc_src, 1)))
>>
>> Do we ever check that inc_src is negative? We used to check if it was 
>> -1, now we only check it's a constnat, but not a negative one, so I 
>> suspect this needs a:
>> || INTVAL (XEXP (inc_src, 1)) >= 0
> Good point. Done
>>
>> @@ -492,7 +519,8 @@ doloop_modify (class loop *loop, class niter_desc 
>> *desc,
>>      case GE:
>>        /* Currently only GE tests against zero are supported.  */
>>        gcc_assert (XEXP (condition, 1) == const0_rtx);
>> -
>> +      /* FALLTHRU */
>> +    case GTU:
>>        noloop = constm1_rtx;
>>
>> I spent a very long time staring at this trying to understand why 
>> noloop = constm1_rtx for GTU, where I thought it should've been (count 
>> & (n-1)). For the current use of doloop it doesn't matter because ARM 
>> is the only target using it and you set desc->noloop_assumptions to 
>> null_rtx in 'arm_attempt_dlstp_transform' so noloop is never used. 
>> However, if a different target accepts this GTU pattern then this 
>> target agnostic code will do the wrong thing.  I suggest we either:
>>  - set noloop to what we think might be the correct value, which if 
>> you ask me should be 'count & (XEXP (condition, 1))',
>>  - or add a gcc_assert (GET_CODE (condition) != GTU); under the if 
>> (desc->noloop_assumption); part and document why.  I have a slight 
>> preference for the assert given otherwise we are adding code that we 
>> can't test.
> 
> Yea, that's true tbh. I've done the latter, but also separated out the 
> "case GTU:" and added a comment, so that it's more clear that the noloop 
> things aren't used in the only implemented GTU case (Arm)
> 
> Thank you :)
> 
>>
>> LGTM otherwise (but I don't have the power to approve this ;)).
>>
>> Kind regards,
>> Andre
>> ________________________________________
>> From: Stamatis Markianos-Wright <stam.markianos-wright@arm.com>
>> Sent: Thursday, November 16, 2023 11:36 AM
>> To: Stamatis Markianos-Wright via Gcc-patches; Richard Earnshaw; 
>> Richard Sandiford; Kyrylo Tkachov
>> Subject: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated 
>> Low Overhead Loops
>>
>> Pinging back to the top of reviewers' inboxes due to worry about Stage 1
>> End in a few days :)
>>
>>
>> See the last email for the latest version of the 2/2 patch. The 1/2
>> patch is A-Ok from Kyrill's earlier target-backend review.
>>
>>
>> On 10/11/2023 12:41, Stamatis Markianos-Wright wrote:
>>>
>>> On 06/11/2023 17:29, Stamatis Markianos-Wright wrote:
>>>>
>>>> On 06/11/2023 11:24, Richard Sandiford wrote:
>>>>> Stamatis Markianos-Wright <stam.markianos-wright@arm.com> writes:
>>>>>>> One of the main reasons for reading the arm bits was to try to 
>>>>>>> answer
>>>>>>> the question: if we switch to a downcounting loop with a GE
>>>>>>> condition,
>>>>>>> how do we make sure that the start value is not a large unsigned
>>>>>>> number that is interpreted as negative by GE?  E.g. if the loop
>>>>>>> originally counted up in steps of N and used an LTU condition,
>>>>>>> it could stop at a value in the range [INT_MAX + 1, UINT_MAX].
>>>>>>> But the loop might never iterate if we start counting down from
>>>>>>> most values in that range.
>>>>>>>
>>>>>>> Does the patch handle that?
>>>>>> So AFAICT this is actually handled in the generic code in
>>>>>> `doloop_valid_p`:
>>>>>>
>>>>>> This kind of loops fail because of they are "desc->infinite", then no
>>>>>> loop-doloop conversion is attempted at all (even for standard
>>>>>> dls/le loops)
>>>>>>
>>>>>> Thanks to that check I haven't been able to trigger anything like the
>>>>>> behaviour you describe, do you think the doloop_valid_p checks are
>>>>>> robust enough?
>>>>> The loops I was thinking of are provably not infinite though. E.g.:
>>>>>
>>>>>    for (unsigned int i = 0; i < UINT_MAX - 100; ++i)
>>>>>      ...
>>>>>
>>>>> is known to terminate.  And doloop conversion is safe with the normal
>>>>> count-down-by-1 approach, so I don't think current code would need
>>>>> to reject it.  I.e. a conversion to:
>>>>>
>>>>>    unsigned int i = UINT_MAX - 101;
>>>>>    do
>>>>>      ...
>>>>>    while (--i != ~0U);
>>>>>
>>>>> would be safe, but a conversion to:
>>>>>
>>>>>    int i = UINT_MAX - 101;
>>>>>    do
>>>>>      ...
>>>>>    while ((i -= step, i > 0));
>>>>>
>>>>> wouldn't, because the loop body would only be executed once.
>>>>>
>>>>> I'm only going off the name "infinite" though :)  It's possible that
>>>>> it has more connotations than that.
>>>>>
>>>>> Thanks,
>>>>> Richard
>>>>
>>>> Ack, yep, I see what you mean now, and yep, that kind of loop does
>>>> indeed pass through doloop_valid_p
>>>>
>>>> Interestingly , in the v8-M Arm ARM this is done with:
>>>>
>>>> ```
>>>>
>>>> boolean IsLastLowOverheadLoop(INSTR_EXEC_STATE_Type state)
>>>> // This does not check whether a loop is currently active.
>>>> // If the PE were in a loop, would this be the last one?
>>>> return UInt(state.LoopCount) <= (1 << (4 - LTPSIZE));
>>>>
>>>> ```
>>>>
>>>> So architecturally the asm we output would be ok (except maybe the
>>>> "branch too far subs;bgt;lctp" fallback at
>>>> `predicated_doloop_end_internal` (maybe that should be `bhi`))... But
>>>> now GE: isn't looking like an accurate representation of this
>>>> operation in the compiler.
>>>>
>>>> I'm wondering if I should try to make
>>>> `predicated_doloop_end_internal` contain a comparison along the lines
>>>> of:
>>>> (gtu: (plus: (LR) (const_int -num_lanes)) (const_int 
>>>> num_lanes_minus_1))
>>>>
>>>> I'll give that a try :)
>>>>
>>>> The only reason I'd chosen to go with GE earlier, tbh, was because of
>>>> the existing handling of GE in loop-doloop.cc
>>>>
>>>> Let me know if any other ideas come to your mind!
>>>>
>>>>
>>>> Cheers,
>>>>
>>>> Stam
>>>
>>>
>>> It looks like I've had success with the below (diff to previous patch),
>>> trimmed a bit to only the functionally interesting things::
>>>
>>>
>>>
>>>
>>> diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
>>> index 368d5138ca1..54dd4ee564b 100644
>>> --- a/gcc/config/arm/thumb2.md
>>> +++ b/gcc/config/arm/thumb2.md
>>> @@ -1649,16 +1649,28 @@
>>>            && (decrement_num = arm_attempt_dlstp_transform 
>>> (operands[1]))
>>>            && (INTVAL (decrement_num) != 1))
>>>          {
>>> -          insn = emit_insn
>>> -              (gen_thumb2_addsi3_compare0
>>> -              (s0, s0, GEN_INT ((-1) * (INTVAL (decrement_num)))));
>>> -          cmp = XVECEXP (PATTERN (insn), 0, 0);
>>> -          cc_reg = SET_DEST (cmp);
>>> -          bcomp = gen_rtx_GE (VOIDmode, cc_reg, const0_rtx);
>>>            loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
>>> -          emit_jump_insn (gen_rtx_SET (pc_rtx,
>>> -                       gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
>>> -                                 loc_ref, pc_rtx)));
>>> +          switch (INTVAL (decrement_num))
>>> +        {
>>> +          case 2:
>>> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal2
>>> +                        (s0, loc_ref));
>>> +            break;
>>> +          case 4:
>>> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal4
>>> +                        (s0, loc_ref));
>>> +            break;
>>> +          case 8:
>>> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal8
>>> +                        (s0, loc_ref));
>>> +            break;
>>> +          case 16:
>>> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal16
>>> +                        (s0, loc_ref));
>>> +            break;
>>> +          default:
>>> +            gcc_unreachable ();
>>> +        }
>>>            DONE;
>>>          }
>>>      }
>>>
>>> diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
>>> index 93905583b18..c083f965fa9 100644
>>> --- a/gcc/config/arm/mve.md
>>> +++ b/gcc/config/arm/mve.md
>>> @@ -6922,23 +6922,24 @@
>>>  ;; Originally expanded by 'predicated_doloop_end'.
>>>  ;; In the rare situation where the branch is too far, we do also 
>>> need to
>>>  ;; revert FPSCR.LTPSIZE back to 0x100 after the last iteration.
>>> -(define_insn "*predicated_doloop_end_internal"
>>> +(define_insn "predicated_doloop_end_internal<letp_num_lanes>"
>>>    [(set (pc)
>>>      (if_then_else
>>> -       (ge (plus:SI (reg:SI LR_REGNUM)
>>> -            (match_operand:SI 0 "const_int_operand" ""))
>>> -        (const_int 0))
>>> -     (label_ref (match_operand 1 "" ""))
>>> +       (gtu (unspec:SI [(plus:SI (match_operand:SI 0
>>> "s_register_operand" "=r")
>>> +                     (const_int <letp_num_lanes_neg>))]
>>> +        LETP)
>>> +        (const_int <letp_num_lanes_minus_1>))
>>> +     (match_operand 1 "" "")
>>>       (pc)))
>>> -   (set (reg:SI LR_REGNUM)
>>> -    (plus:SI (reg:SI LR_REGNUM) (match_dup 0)))
>>> +   (set (match_dup 0)
>>> +    (plus:SI (match_dup 0) (const_int <letp_num_lanes_neg>)))
>>>     (clobber (reg:CC CC_REGNUM))]
>>>    "TARGET_HAVE_MVE"
>>>    {
>>>      if (get_attr_length (insn) == 4)
>>>        return "letp\t%|lr, %l1";
>>>      else
>>> -      return "subs\t%|lr, #%n0\n\tbgt\t%l1\n\tlctp";
>>> +      return "subs\t%|lr, #<letp_num_lanes>\n\tbhi\t%l1\n\tlctp";
>>>    }
>>>    [(set (attr "length")
>>>      (if_then_else
>>> @@ -6947,11 +6948,11 @@
>>>          (const_int 6)))
>>>     (set_attr "type" "branch")])
>>>
>>> -(define_insn "dlstp<mode1>_insn"
>>> +(define_insn "dlstp<dlstp_elemsize>_insn"
>>>    [
>>>      (set (reg:SI LR_REGNUM)
>>>       (unspec:SI [(match_operand:SI 0 "s_register_operand" "r")]
>>>        DLSTP))
>>>    ]
>>>    "TARGET_HAVE_MVE"
>>> -  "dlstp.<mode1>\t%|lr, %0")
>>> +  "dlstp.<dlstp_elemsize>\t%|lr, %0")
>>>
>>> diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
>>> index 6a72700a127..47fdef989b4 100644
>>> --- a/gcc/loop-doloop.cc
>>> +++ b/gcc/loop-doloop.cc
>>> @@ -185,6 +185,7 @@ doloop_condition_get (rtx_insn *doloop_pat)
>>>        || XEXP (inc_src, 0) != reg
>>>        || !CONST_INT_P (XEXP (inc_src, 1)))
>>>      return 0;
>>> +  int dec_num = abs (INTVAL (XEXP (inc_src, 1)));
>>>
>>>    /* Check for (set (pc) (if_then_else (condition)
>>>                                         (label_ref (label))
>>> @@ -199,21 +200,32 @@ doloop_condition_get (rtx_insn *doloop_pat)
>>>    /* Extract loop termination condition.  */
>>>    condition = XEXP (SET_SRC (cmp), 0);
>>>
>>> -  /* We expect a GE or NE comparison with 0 or 1.  */
>>> -  if ((GET_CODE (condition) != GE
>>> -       && GET_CODE (condition) != NE)
>>> -      || (XEXP (condition, 1) != const0_rtx
>>> -          && XEXP (condition, 1) != const1_rtx))
>>> +  /* We expect a GE or NE comparison with 0 or 1, or a GTU comparison
>>> with
>>> +     dec_num - 1.  */
>>> +  if (!((GET_CODE (condition) == GE
>>> +     || GET_CODE (condition) == NE)
>>> +    && (XEXP (condition, 1) == const0_rtx
>>> +        || XEXP (condition, 1) == const1_rtx ))
>>> +      &&!(GET_CODE (condition) == GTU
>>> +      && ((INTVAL (XEXP (condition, 1))) == (dec_num - 1))))
>>>      return 0;
>>>
>>> -  if ((XEXP (condition, 0) == reg)
>>> +  /* For the ARM special case of having a GTU: re-form the condition
>>> without
>>> +     the unspec for the benefit of the middle-end.  */
>>> +  if (GET_CODE (condition) == GTU)
>>> +    {
>>> +      condition = gen_rtx_fmt_ee (GTU, VOIDmode, inc_src, GEN_INT
>>> (dec_num - 1));
>>> +      return condition;
>>> +    }
>>> +  else if ((XEXP (condition, 0) == reg)
>>>        /* For the third case:  */
>>>        || ((cc_reg != NULL_RTX)
>>>        && (XEXP (condition, 0) == cc_reg)
>>>        && (reg_orig == reg))
>>> @@ -245,20 +257,11 @@ doloop_condition_get (rtx_insn *doloop_pat)
>>>                         (label_ref (label))
>>>                         (pc))))])
>>>
>>> -    So we return the second form instead for the two cases when n == 1.
>>> -
>>> -    For n > 1, the final value may be exceeded, so use GE instead of 
>>> NE.
>>> +    So we return the second form instead for the two cases.
>>>       */
>>> -     if (GET_CODE (pattern) != PARALLEL)
>>> -       {
>>> -    if (INTVAL (XEXP (inc_src, 1)) != -1)
>>> -      condition = gen_rtx_fmt_ee (GE, VOIDmode, inc_src, const0_rtx);
>>> -    else
>>> -      condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);;
>>> -       }
>>> -
>>> +    condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
>>>      return condition;
>>> -   }
>>> +    }
>>>
>>>    /* ??? If a machine uses a funny comparison, we could return a
>>>       canonicalized form here.  */
>>> @@ -501,7 +504,8 @@ doloop_modify (class loop *loop, class niter_desc
>>> *desc,
>>>      case GE:
>>>        /* Currently only GE tests against zero are supported. */
>>>        gcc_assert (XEXP (condition, 1) == const0_rtx);
>>> -
>>> +      /* FALLTHRU */
>>> +    case GTU:
>>>        noloop = constm1_rtx;
>>> diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
>>> index a6a7ff507a5..9398702cddd 100644
>>> --- a/gcc/config/arm/iterators.md
>>> +++ b/gcc/config/arm/iterators.md
>>> @@ -2673,8 +2673,16 @@
>>>  (define_int_attr mrrc [(VUNSPEC_MRRC "mrrc") (VUNSPEC_MRRC2 "mrrc2")])
>>>  (define_int_attr MRRC [(VUNSPEC_MRRC "MRRC") (VUNSPEC_MRRC2 "MRRC2")])
>>>
>>> -(define_int_attr mode1 [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 "32")
>>> -            (DLSTP64 "64")])
>>> +(define_int_attr dlstp_elemsize [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32
>>> "32")
>>> +                 (DLSTP64 "64")])
>>> +
>>> +(define_int_attr letp_num_lanes [(LETP8 "16") (LETP16 "8") (LETP32 "4")
>>> +                 (LETP64 "2")])
>>> +(define_int_attr letp_num_lanes_neg [(LETP8 "-16") (LETP16 "-8")
>>> (LETP32 "-4")
>>> +                     (LETP64 "-2")])
>>> +
>>> +(define_int_attr letp_num_lanes_minus_1 [(LETP8 "15") (LETP16 "7")
>>> (LETP32 "3")
>>> +                     (LETP64 "1")])
>>>
>>>  (define_int_attr opsuffix [(UNSPEC_DOT_S "s8")
>>>                 (UNSPEC_DOT_U "u8")
>>> @@ -2921,6 +2929,8 @@
>>>  (define_int_iterator VQSHLUQ_N [VQSHLUQ_N_S])
>>>  (define_int_iterator DLSTP [DLSTP8 DLSTP16 DLSTP32
>>>                     DLSTP64])
>>> +(define_int_iterator LETP [LETP8 LETP16 LETP32
>>> +               LETP64])
>>>
>>>  ;; Define iterators for VCMLA operations
>>>  (define_int_iterator VCMLA_OP [UNSPEC_VCMLA
>>>        /* The iteration count does not need incrementing for a GE
>>> test.  */
>>> diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
>>> index 12ae4c4f820..2d6f27c14f4 100644
>>> --- a/gcc/config/arm/unspecs.md
>>> +++ b/gcc/config/arm/unspecs.md
>>> @@ -587,6 +587,10 @@
>>>    DLSTP16
>>>    DLSTP32
>>>    DLSTP64
>>> +  LETP8
>>> +  LETP16
>>> +  LETP32
>>> +  LETP64
>>>    VPNOT
>>>    VCREATEQ_F
>>>    VCVTQ_N_TO_F_S
>>>
>>>
>>> I've attached the whole [2/2] patch diff with this change and
>>> the required comment changes in doloop_condition_get.
>>> WDYT?
>>>
>>>
>>> Thanks,
>>>
>>> Stam
>>>
>>>
>>>>
>>>>

[I'm still working through this patch, but there are a number of things 
which clearly need addressing, so I'll stop at this point today.]

+arm_predict_doloop_p (struct loop *loop)
+{

Is it feasible to add something here to check the overall size of the 
loop, so that we don't try to convert loops that are clearly too big?

+static rtx_insn*
+arm_mve_dlstp_check_inc_counter (basic_block body, rtx_insn* vctp_insn,
+				 rtx condconst, rtx condcount)
...
+  rtx condcount_reg_set = PATTERN (DF_REF_INSN (condcount_reg_set_df));
+  rtx_insn* vctp_reg_set = DF_REF_INSN (vctp_reg_set_df);
+  /* Ensure the modification of the vctp reg from df is consistent with
+     the iv and the number of lanes on the vctp insn.  */
+  if (!(GET_CODE (XEXP (PATTERN (vctp_reg_set), 1)) == PLUS
+      && REGNO (XEXP (PATTERN (vctp_reg_set), 0))
+	  == REGNO (XEXP (XEXP (PATTERN (vctp_reg_set), 1), 0))))

You seem to be assuming that the pattern in insn you've found will 
always be of the form (set x y), but that's unsafe.  When scanning RTL 
you must first check that you do have a genuine single set.  The easiest 
way to do that is to call single_set (insn), which returns the SET RTL 
if it is a single set, and NULL otherwise; it is also able to handle 
irrelevant clobbers which might be added for register allocation purposes.
Also, rather than using XEXP, you should use SET_SRC and SET_DEST when 
looking at the arm's of a SET operation, for better clarity.

+	    if (!single_set (last_set_insn))
+	      return NULL;
+	    rtx counter_orig_set;
+	    counter_orig_set = XEXP (PATTERN (last_set_insn), 1);

There's a similar problem here, in that single_set returns a valid value 
for something like
   (parallel [(set a b)
              (clobber scratch_reg)])

so looking directly at the pattern of the insn is wrong.  Instead you 
should use the value returned by single_set for further analysis.

+  /* Next, ensure that it is a PLUS of the form:
+     (set (reg a) (plus (reg a) (const_int)))
+     where (reg a) is the same as condcount.  */
+  if (GET_CODE (XEXP (PATTERN (dec_insn), 1)) == PLUS
+      && REGNO (XEXP (PATTERN (dec_insn), 0))
+	  == REGNO (XEXP (XEXP (PATTERN (dec_insn), 1), 0))
+      && REGNO (XEXP (PATTERN (dec_insn), 0)) == REGNO (condcount))

and again, you need to validate that dec_insn is a set first.  There are 
several other cases where you need to check for a SET as well, but I 
won't mention any more here.

Can this code be run before register allocation? If so, there's a risk 
that we will have different pseudos for the source and dest operands 
here, but expect register allocation to tie them back together; 
something like

t1 = count
loop_head:
   ...
   t2 = t1
   ...
   t1 = t2 + const
   if (t1 < end)
      goto loop_head;

Register allocation would be expected to eliminate the t2 = t1 insn by 
allocating the same physical register here.


+    decrementnum = abs (INTVAL (XEXP (XEXP (PATTERN (dec_insn), 1), 1)));

why the use of abs()?  I don't see where we validate that the direction 
really matches our expectation.

+      if (abs (INTVAL (vctp_reg_iv.step))
+	  != arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))

And again, we only look at the absolute value.  Shouldn't we be 
validating the sign here against the sign we ignored earlier?

+  if (!(TARGET_32BIT && TARGET_HAVE_LOB && optimize > 0))

Again, the test for TARGET_32BIT seems redundant.

+  if (single_set (next_use1)
+      && GET_CODE (SET_SRC (single_set (next_use1))) == ZERO_EXTEND)

Don't call single_set twice, just save the result of the first call and 
use that.  Perhaps
   rtx next_use1_set = single_set (next_use1);
   if (next_use1_set && ...)

+/* Attempt to transform the loop contents of loop basic block from VPT
+   predicated insns into unpredicated insns for a dlstp/letp loop.  */
+
+rtx
+arm_attempt_dlstp_transform (rtx label)

Please describe what is returned in the comment.  I'm guessing it's some 
form of iteration count, but why return 1 on failure?

+    return GEN_INT (1);

It's more efficient to write "return const1_rtx;".  In fact, it looks 
like this function always returns a CONST_INT and is only ever called 
from one place in thumb2.md, where we only ever look at the integer 
value.  So why not make it return a HOST_WIDE_INT in the first place?


+static bool
+arm_emit_mve_unpredicated_insn_to_seq (rtx_insn* insn)
+{
+
I think this function needs to also copy across the INSN_LOCATION 
information from the insn it is rewriting; that way we keep any 
diagnostic information and debug information more accurate.  Something like:

+    }
 >> INSN_LOCATION (new_insn) = INSN_LOCATION (insn);
+  return true;

Will be enough if we never get more than one insn emitted from the 
emit_insn sequence above (emit_insn returns new_insn, which you'll need 
to save).  If you need something more complex, then perhaps you could use
   emit_insn_after_setloc (GEN_FCN..., get_last_insn (),
			  INSN_LOCATION (insn));

+  for (insn = seq; NEXT_INSN (insn); insn = NEXT_INSN (insn))
+    if (NOTE_P (insn))
+      emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));
+    else if (DEBUG_INSN_P (insn))
+      emit_debug_insn_after (PATTERN (insn), BB_END (body));
+    else
+      emit_insn_after (PATTERN (insn), BB_END (body));
+
+

I'm not sure I follow why you can't replace this entire loop with

   emit_insn_after (seq, BB_END (body));

which should do all of the above for you.  But there's another problem here:

+    if (NOTE_P (insn))
+      emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));

Notes have data (see emit_note_copy in emit-rtl.cc).  If you can't use 
that function, you'll need to copy the note data manually in the same 
way as emit_note_copy does (or add emit_note_copy_after() as a new 
function in emit-rtl.cc).

I also note that you're already copying the note in 
arm_attempt_dlstp_transform and that also drops the note's data, but 
that really can use emit_note_copy().

thumb2.md:
+	  /* If we have a compatibe MVE target, try and analyse the loop

Typo: compatible

+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
@@ -0,0 +1,69 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_mve_hw } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+

...

+/* { dg-final { scan-assembler-times {\tdlstp.16} 1 } } */

Please do not mix scan-assembler tests with execution tests.  The latter 
require specific hardware to run, while the former do not, that means 
these tests get run by far fewer configurations of the compiler.




^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-12-12 17:56 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-17 10:31 [PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops Stamatis Markianos-Wright
2023-09-06 17:19 ` [PING][PATCH " Stamatis Markianos-Wright
2023-09-14 12:10   ` Kyrylo Tkachov
2023-09-28 12:51     ` Andre Vieira (lists)
2023-10-11 11:34       ` Stamatis Markianos-Wright
2023-10-23 10:16         ` Andre Vieira (lists)
2023-10-24 15:11   ` Richard Sandiford
2023-11-06 11:03     ` Stamatis Markianos-Wright
2023-11-06 11:24       ` Richard Sandiford
2023-11-06 17:29         ` Stamatis Markianos-Wright
2023-11-10 12:41           ` Stamatis Markianos-Wright
2023-11-16 11:36             ` Stamatis Markianos-Wright
2023-11-27 12:47               ` Andre Vieira (lists)
2023-11-30 12:55                 ` Stamatis Markianos-Wright
2023-12-07 18:08                   ` Andre Vieira (lists)
2023-12-09 18:31                   ` Richard Sandiford
2023-12-12 17:56                   ` Richard Earnshaw

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).