[PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
@ 2022-11-11 17:40 Stam Markianos-Wright
  2022-11-15 15:51 ` Andre Vieira (lists)
  0 siblings, 1 reply; 14+ messages in thread
From: Stam Markianos-Wright @ 2022-11-11 17:40 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 60777 bytes --]

Hi all,

This is the 2/2 patch that contains the functional changes needed
for MVE Tail Predicated Low Overhead Loops.  See my previous email
for a general introduction of MVE LOLs.

This support is added through the already existing loop-doloop
mechanisms that are used for non-MVE dls/le looping.

Changes are:

1) Relax the loop-doloop mechanism in the mid-end to allow for
    decrement numbers other that -1 and for `count` to be an
    rtx containing the number of elements to be processed, rather
    than an expression for calculating the number of iterations.
2) Add a `allow_elementwise_doloop` target hook. This allows the
    target backend to manipulate the iteration count as it needs:
    in our case to change it from a pre-calculation of the number
    of iterations to the number of elements to be processed.
3) The doloop_end target-insn now had an additional parameter:
    the `count` (note: this is before it gets modified to just be
    the number of elements), so that the decrement value is
    extracted from that parameter.

And many things in the backend to implement the above optimisation:

4)  Appropriate changes to the define_expand of doloop_end and new
     patterns for dlstp and letp.
5) `arm_attempt_dlstp_transform`: (called from the define_expand of
     doloop_end) this function checks for the loop's suitability for
     dlstp/letp transformation and then implements it, if possible.
6) `arm_mve_get_loop_unique_vctp`: A function that loops through
     the loop contents and returns the vctp VPR-genereting operation
     within the loop, if it is unique and there is exclusively one
     vctp within the loop.
7) A couple of utility functions: `arm_mve_get_vctp_lanes` to map
    from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg`
    to check an insn to see if it requires the VPR or not.

No regressions on arm-none-eabi with various targets and on
aarch64-none-elf. Thoughts on getting this into trunk?

Thank you,
Stam Markianos-Wright

gcc/ChangeLog:

         * config/aarch64/aarch64.md: Add extra doloop_end arg.
         * config/arm/arm-protos.h (arm_attempt_dlstp_transform): New.
         * config/arm/arm.cc (TARGET_ALLOW_ELEMENTWISE_DOLOOP): New.
         (arm_mve_get_vctp_lanes): New.
         (arm_get_required_vpr_reg): New.
         (arm_mve_get_loop_unique_vctp): New.
         (arm_attempt_dlstp_transform): New.
         (arm_allow_elementwise_doloop): New.
         * config/arm/iterators.md:
         * config/arm/mve.md (*predicated_doloop_end_internal): New.
         (dlstp<mode1>_insn): New.
         * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
         * config/arm/unspecs.md: New unspecs.
         * config/ia64/ia64.md: Add extra doloop_end arg.
         * config/pru/pru.md: Add extra doloop_end arg.
         * config/rs6000/rs6000.md: Add extra doloop_end arg.
         * config/s390/s390.md: Add extra doloop_end arg.
         * config/v850/v850.md: Add extra doloop_end arg.
         * doc/tm.texi: Document new hook.
         * doc/tm.texi.in: Likewise.
         * loop-doloop.cc (doloop_condition_get): Relax conditions.
         (doloop_optimize): Add support for elementwise LoLs.
         * target-insns.def (doloop_end): Add extra arg.
         * target.def (allow_elementwise_doloop): New hook.
         * targhooks.cc (default_allow_elementwise_doloop): New.
         * targhooks.h (default_allow_elementwise_doloop): New.

gcc/testsuite/ChangeLog:

         * gcc.target/arm/lob.h: Update framework.
         * gcc.target/arm/lob1.c: Likewise.
         * gcc.target/arm/lob6.c: Likewise.
         * gcc.target/arm/dlstp-int16x8.c: New test.
         * gcc.target/arm/dlstp-int32x4.c: New test.
         * gcc.target/arm/dlstp-int64x2.c: New test.
         * gcc.target/arm/dlstp-int8x16.c: New test.


### Inline copy of patch ###

diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 
f2e3d905dbbeb2949f2947f5cfd68208c94c9272..7a6d24a80060b4a704a481ccd1a32d96e7b0f369 
100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -7366,7 +7366,8 @@
  ;; knows what to generate.
  (define_expand "doloop_end"
    [(use (match_operand 0 "" ""))      ; loop pseudo
-   (use (match_operand 1 "" ""))]     ; label
+   (use (match_operand 1 "" ""))      ; label
+   (use (match_operand 2 "" ""))]     ; decrement constant
    "optimize > 0 && flag_modulo_sched"
  {
    rtx s0;
diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 
550272facd12e60a49bf8a3b20f811cc13765b3a..7684620f0f4d161dd9e9ad2d70308021ec3d3d34 
100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -63,7 +63,7 @@ extern void arm_decompose_di_binop (rtx, rtx, rtx *, 
rtx *, rtx *, rtx *);
  extern bool arm_q_bit_access (void);
  extern bool arm_ge_bits_access (void);
  extern bool arm_target_insn_ok_for_lob (rtx);
-
+extern rtx arm_attempt_dlstp_transform (rtx, rtx);
  #ifdef RTX_CODE
  enum reg_class
  arm_mode_base_reg_class (machine_mode);
diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index 
ee8f1babf8a1319e77e0db0fa55851c038048804..99e144d52c26597c64b982b3d4ae9a62a114cf18 
100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -470,6 +470,9 @@ static const struct attribute_spec 
arm_attribute_table[] =
  #undef TARGET_SCHED_REORDER
  #define TARGET_SCHED_REORDER arm_sched_reorder

+#undef TARGET_ALLOW_ELEMENTWISE_DOLOOP
+#define TARGET_ALLOW_ELEMENTWISE_DOLOOP arm_allow_elementwise_doloop
+
  #undef TARGET_REGISTER_MOVE_COST
  #define TARGET_REGISTER_MOVE_COST arm_register_move_cost

@@ -34138,8 +34141,339 @@ arm_target_insn_ok_for_lob (rtx insn)

    return single_succ_p (bb)
      && single_pred_p (bb)
-    && single_succ_edge (bb)->dest == single_pred_edge (bb)->src
-    && contains_no_active_insn_p (bb);
+    && single_succ_edge (bb)->dest == single_pred_edge (bb)->src;
+}
+
+static int
+arm_mve_get_vctp_lanes (rtx x)
+{
+  if (GET_CODE (x) == SET && GET_CODE (XEXP (x, 1)) == UNSPEC)
+    {
+      switch (XINT (XEXP (x, 1), 1))
+    {
+      case VCTP8Q:
+        return 16;
+      case VCTP16Q:
+        return 8;
+      case VCTP32Q:
+        return 4;
+      case VCTP64Q:
+        return 2;
+      default:
+        break;
+    }
+    }
+  return 0;
+}
+
+/* Check if an insn requires the use of the VPR_REG, if it does, return the
+   sub-rtx of the matched operand.  If there are more than one operand 
(e.g. an
+   input operand and an output operand) that use VPR_REG, return the first
+   occurance, which is usually the output operand.  */
+
+static rtx
+arm_get_required_vpr_reg (rtx_insn *insn)
+{
+  bool requires_vpr;
+
+  extract_constrain_insn (insn);
+  int n_operands = recog_data.n_operands;
+  if (recog_data.n_alternatives == 0)
+    return NULL_RTX;
+
+  /* Fill in recog_op_alt with information about the constraints of
+     this insn.  */
+  preprocess_constraints (insn);
+
+  for (int use = 0; use < n_operands; use++)
+    {
+      requires_vpr = true;
+      /* Iterate through alternatives of operand "use" in recog_op_alt and
+       * identify if the operand is required to be the VPR.  */
+      for (int alt1 = 0; alt1 < recog_data.n_alternatives; alt1++)
+    {
+      const operand_alternative *op_alt1
+          = &recog_op_alt[alt1 * n_operands];
+      /* Fetch the reg_class for each entry and check it against the
+       * VPR_REG reg_class.  */
+      if (alternative_class (op_alt1, use) != VPR_REG)
+        requires_vpr = false;
+    }
+      /* If all alternatives of the insn require the VPR reg for this 
operand,
+     it means that either this is VPR-generating instruction, like a vctp,
+     vcmp, etc., or it is a VPT-predicated insruction.  Return the subrtx
+     of the VPR reg operand.  */
+      if (requires_vpr)
+    return recog_data.operand[use];
+    }
+  return NULL_RTX;
+}
+
+/* Scan the basic block of a loop body for a vctp instruction. If there is
+   exactly one unique vctp instruction, return its rtx_insn *. */
+
+static rtx_insn *
+arm_mve_get_loop_unique_vctp (basic_block bb)
+{
+  rtx_insn *insn = BB_HEAD (bb);
+  rtx_insn *vctp_op = NULL;
+
+  /* Now scan through all the instruction patterns and
+     pick out any MVE instructions.  */
+  FOR_BB_INSNS (bb, insn)
+    {
+      if (INSN_P (insn))
+    {
+      /* First check if this is a vctp instruction.  There needs to be
+         exactly one vctp instruction within the loop.  */
+      if (arm_mve_get_vctp_lanes (PATTERN (insn)) != 0)
+        {
+          /* If we already found one vctp instruction, then the
+         loop is not consistent internally.  */
+          if (vctp_op)
+        return NULL;
+
+          vctp_op = insn;
+        }
+    }
+    }
+  return vctp_op;
+}
+
+rtx
+arm_attempt_dlstp_transform (rtx label, rtx count)
+{
+  int decrementnum;
+  basic_block body = BLOCK_FOR_INSN (label)->prev_bb;
+  rtx initial_compare;
+  /* Doloop can only be done "elementwise" with predicated dlstp/letp
+     when the iteration counter gets deprecated by the number of MVE
+     lanes.  This can be exxtracted from the `count`, which is the 
expression
+     used to calculate the number of iterations that the loop would execute
+     for a standard dls/le loop.  Since we only support cases where 
this is a
+     power of 2, we can assume that this expression arrives here as:
+       (lshiftrt: (A) (const_int y))
+     Then we can extract the decrementnum from y.  */
+  if (GET_CODE (count) == LSHIFTRT && ARITHMETIC_P (XEXP (count, 0))
+      && (decrementnum = (1 << (INTVAL (XEXP (count, 1)))))
+      /* There is one final condition that needs to be met for the loop 
to be
+     transformable: dlstp/letp will continue looping until there are
+     elements still to process.  This can only work if the looping ends
+     when the element counter reaches zero and not some other value
+     (e.g. n > 0 works, not n > 1), or we can incorrectly end up running
+     one additional iteration.  To by-pass any hoisting that the compiler
+     may have done with the `A` in `count` above, we can instead look up
+     to the bb before the loop preheader: this should end with a cmp+jump
+     pair, where the cmp needs to be with (const_int 0).  */
+      && loop_preheader_edge (body->loop_father)->src->prev_bb
+      && BB_END (loop_preheader_edge (body->loop_father)->src->prev_bb)
+      && PREV_INSN (BB_END (loop_preheader_edge (body->loop_father)
+                ->src->prev_bb))
+      && INSN_P (PREV_INSN (BB_END (loop_preheader_edge (body->loop_father)
+                    ->src->prev_bb)))
+      && (initial_compare
+      = PATTERN (PREV_INSN (BB_END (loop_preheader_edge (body->loop_father)
+                        ->src->prev_bb))))
+      && GET_CODE (initial_compare) == SET
+      && cc_register (XEXP (initial_compare, 0), VOIDmode)
+      && GET_CODE (XEXP (initial_compare, 1)) == COMPARE
+      && CONST_INT_P (XEXP (XEXP (initial_compare, 1), 1))
+      && INTVAL (XEXP (XEXP (initial_compare, 1), 1)) == 0)
+    {
+      /* Find the vctp predicate generation inside the loop body BB.  */
+      rtx_insn *vctp_insn = arm_mve_get_loop_unique_vctp (body);
+
+      /* If we have successfully found one exactly vctp 
predicate-generating
+     instruction within the loop and the number by which we deprecate the
+     loop counter in each iteration matches the number of lanes of the
+     vctp instruction, we can attempt to turn this into a dlstp/letp loop.
+     */
+      if (vctp_insn
+      && decrementnum == arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+    {
+      rtx_insn *insn = 0;
+      rtx_insn *cur_insn = 0;
+      rtx_insn *seq;
+      rtx vctp_vpr_generated = NULL_RTX;
+      rtx insn_vpr_reg_operand = NULL_RTX;
+      bool transform_worked = true;
+      int new_icode;
+
+      /* Scan through the insns in the loop bb and emit the transformed bb
+         insns to a sequence.  */
+      start_sequence ();
+      FOR_BB_INSNS (body, insn)
+        {
+          if (INSN_P (insn))
+        {
+          /* When we find the vctp instruction: This may be followed by
+             a sign-extend insn to SImode.  If it is, then save the
+             sign-extended REG into vctp_vpr_generated.  If there is no
+             sign-extend, then store the raw output of the vctp.
+             For any VPT-predicated instructions we need to ensure that
+             the VPR they use is the same as the one given here and
+             they often consume the output of a subreg of the SImode
+             sign-extended VPR-reg.  As a result, comparing against the
+             output of the sign-extend is more likely to succeed.
+             This code also guarantees to us that the vctp comes before
+             any instructions that use the VPR within the loop, for the
+             dlstp/letp transform to succeed.  */
+          if (insn == vctp_insn)
+            {
+              if (GET_CODE (XEXP (PATTERN (NEXT_INSN (insn)), 1))
+                  == SIGN_EXTEND
+              && GET_CODE (XEXP (
+                 PATTERN (NEXT_INSN (NEXT_INSN (insn))), 1))
+                 == SUBREG)
+            vctp_vpr_generated
+                = XEXP (PATTERN (NEXT_INSN (NEXT_INSN (insn))), 0);
+              else
+            vctp_vpr_generated = XEXP (PATTERN (insn), 0);
+              /* Also emit a USE of the source register of the vctp.
+             This holds the number of elements being processed
+             by the loop.  This later gets stored into `count`.
+             */
+              emit_use (XVECEXP (XEXP (PATTERN (insn), 1), 0, 0));
+              continue;
+            }
+          /* If the insn pattern requires the use of the VPR, then it
+             a VPT-predicated instruction, so it will need to be
+             transformed into the non-predicated version of the
+             instruction.  */
+          else if ((insn_vpr_reg_operand
+                = arm_get_required_vpr_reg (insn))
+               != NULL_RTX)
+            {
+              /* If the VPR value is different to the one generated by
+             the vctp, then fail the conversion.  */
+              if (!rtx_equal_p (vctp_vpr_generated,
+                    insn_vpr_reg_operand))
+            {
+              transform_worked = false;
+              break;
+            }
+              /* Also ensure that it's a valid recog-ed instruction with
+             the mve_unpredicated_insn atrribute.  */
+              else if (recog_memoized (insn) >= 0
+                   && (new_icode
+                   = get_attr_mve_unpredicated_insn (insn)))
+            {
+              extract_insn (insn);
+              rtx arr[8];
+              int j = 0;
+
+              /* When transforming a VPT-predicated instruction
+                 into its unpredicated equivalent we need to drop
+                 the VPR operand and we may need to also drop a
+                 merge "vuninit" input operand, depending on the
+                 instruction pattern.  Here ensure that we have at
+                 most a two-operand difference between the two
+                 instrunctions.  */
+              int n_operands_diff
+                  = recog_data.n_operands
+                - insn_data[new_icode].n_operands;
+              gcc_assert (n_operands_diff > 0
+                      && n_operands_diff <= 2);
+
+              /* Then, loop through the operands of the predicated
+                 instruction, and retain the ones that map to the
+                 unpredicated instruction.  */
+              for (int i = 0; i < recog_data.n_operands; i++)
+                {
+                  /* Ignore the VPR and, if needed, the vuninit
+                 operand.  */
+                  if (insn_vpr_reg_operand == recog_data.operand[i]
+                  || (n_operands_diff == 2
+                      && !strcmp (recog_data.constraints[i],
+                          "0")))
+                continue;
+                  else
+                {
+                  arr[j] = recog_data.operand[i];
+                  j++;
+                }
+                }
+
+              /* Finally, emit the upredicated instruction.  */
+              switch (j)
+                {
+                  case 2:
+                emit_insn (GEN_FCN (new_icode) (arr[0],
+                                arr[1]));
+                break;
+                  case 3:
+                emit_insn (GEN_FCN (new_icode) (arr[0], arr[1],
+                                arr[2]));
+                break;
+                  default:
+                gcc_unreachable ();
+                }
+            }
+              /* If we can't identify the INSN as either being either
+             for deletion or to re-map, then we don't know how to
+             handle it, so fail the whole conversion.  */
+              else
+            {
+              transform_worked = false;
+              break;
+            }
+            }
+          /* Instructions that dont's require the VPR can be carried
+             over as-is.  */
+          else
+            emit_insn (PATTERN (insn));
+        }
+        }
+      seq = get_insns ();
+      end_sequence ();
+
+      if (transform_worked)
+        {
+          /* Re-write the entire BB contents with the transformed
+         sequence.  */
+          FOR_BB_INSNS_SAFE (body, insn, cur_insn)
+        if (INSN_P (insn))
+          delete_insn (insn);
+          for (insn = seq; NEXT_INSN (insn); insn = NEXT_INSN (insn))
+        emit_insn_after (PATTERN (insn), BB_END (body));
+          emit_jump_insn_after (PATTERN (insn), BB_END (body));
+          return GEN_INT (decrementnum);
+        }
+    }
+    }
+  /* Bail out: we can't use dlstp/letp, so return 1 to allow 
loop-doloop to try
+     the standard dls/le pair.  */
+  return GEN_INT (1);
+}
+
+/* Target hook to the number of elements to be processed by a 
dlstp/letp loop
+   into `count` to intialise the counter register.  The number of 
elements was
+   previously extracted from the vctp insn and placed into a USE rtx.
+   We only check that the doloop_end pattern successfully decrements by a
+   number other than -1 for a valid dlstp/letp loop.  No other checking is
+   needed as that was done previously.  */
+
+rtx
+arm_allow_elementwise_doloop (rtx count, rtx label, rtx doloop)
+{
+  if (doloop
+      && INTVAL (XEXP (SET_SRC (XVECEXP (PATTERN (doloop), 0, 1)), 1)) 
!= -1
+      && ARITHMETIC_P (XEXP (count, 0)))
+    {
+      basic_block body = BLOCK_FOR_INSN (label)->prev_bb;
+      rtx_insn* insn;
+      FOR_BB_INSNS (body, insn)
+    {
+      if (INSN_P (insn) && GET_CODE (PATTERN (insn)) == USE)
+        {
+          rtx num_elem_reg = copy_rtx (XEXP (PATTERN (insn), 0));
+          delete_insn (insn);
+          return num_elem_reg;
+        }
+    }
+    }
+  else
+    return count;
  }

  #if CHECKING_P
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 
29062cd6fb3c61c22417ef8fc25abd7819d2c034..aa2fdac22f3c7884c18ebf197b26a2d54da44c93 
100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -1464,7 +1464,9 @@

  (define_int_attr mode1 [(VCTP8Q "8") (VCTP16Q "16") (VCTP32Q "32")
              (VCTP64Q "64") (VCTP8Q_M "8") (VCTP16Q_M "16")
-            (VCTP32Q_M "32") (VCTP64Q_M "64")])
+            (VCTP32Q_M "32") (VCTP64Q_M "64")
+            (DLSTP8 "8") (DLSTP16 "16") (DLSTP32 "32")
+            (DLSTP64 "64")])

  ;; Both kinds of return insn.
  (define_code_iterator RETURNS [return simple_return])
@@ -1773,6 +1775,8 @@
  (define_int_iterator UQRSHLLQ [UQRSHLL_64 UQRSHLL_48])
  (define_int_iterator SQRSHRLQ [SQRSHRL_64 SQRSHRL_48])
  (define_int_iterator VSHLCQ_M [VSHLCQ_M_S VSHLCQ_M_U])
+(define_int_iterator DLSTP [DLSTP8 DLSTP16 DLSTP32
+                   DLSTP64])

  ;; Define iterators for VCMLA operations
  (define_int_iterator VCMLA_OP [UNSPEC_VCMLA
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 
b1c8c1c569f31a6cb1bfdc16394047f02d6cddf4..3baddab5905a8e45bbf55f53e876a671d6d58c5e 
100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -10837,3 +10837,38 @@
      }
    DONE;
  })
+
+;; Originally expanded by 'predicated_doloop_end'.
+(define_insn "*predicated_doloop_end_internal"
+  [(set (pc)
+    (if_then_else
+       (ge (plus:SI (reg:SI LR_REGNUM)
+            (match_operand:SI 0 "const_int_operand" ""))
+        (const_int 0))
+     (label_ref (match_operand 1 "" ""))
+     (pc)))
+   (set (reg:SI LR_REGNUM)
+    (plus:SI (reg:SI LR_REGNUM) (match_dup 0)))
+   (clobber (reg:CC CC_REGNUM))]
+  "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2"
+  {
+    if (get_attr_length (insn) == 4)
+      return "letp\t%|lr, %l1";
+    else
+      return "subs\t%|lr, #%0;bgt\t%l1";
+  }
+  [(set (attr "length")
+    (if_then_else
+       (ltu (minus (pc) (match_dup 1)) (const_int 1024))
+        (const_int 4)
+        (const_int 6)))
+   (set_attr "type" "branch")])
+
+(define_insn "dlstp<mode1>_insn"
+  [
+    (set (reg:SI LR_REGNUM)
+     (unspec:SI [(match_operand:SI 0 "s_register_operand" "r")]
+      DLSTP))
+  ]
+  "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2"
+  "dlstp.<mode1>\t%|lr, %0")
\ No newline at end of file
diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
index 
b2309a5216550104f29a71f33506d421f79f5e2c..a21e2909872a618a8b8b2b69efb9a412fcc96a85 
100644
--- a/gcc/config/arm/thumb2.md
+++ b/gcc/config/arm/thumb2.md
@@ -1610,10 +1610,11 @@
  ;; knows what to generate.
  (define_expand "doloop_end"
    [(use (match_operand 0 "" ""))      ; loop pseudo
-   (use (match_operand 1 "" ""))]     ; label
+   (use (match_operand 1 "" ""))      ; label
+   (use (match_operand 2 "" ""))]     ; decrement constant
    "TARGET_32BIT"
    "
- {
+{
     /* Currently SMS relies on the do-loop pattern to recognize loops
        where (1) the control part consists of all insns defining and/or
        using a certain 'count' register and (2) the loop count can be
@@ -1623,41 +1624,68 @@

        Also used to implement the low over head loops feature, which is 
part of
        the Armv8.1-M Mainline Low Overhead Branch (LOB) extension.  */
-   if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
-   {
-     rtx s0;
-     rtx bcomp;
-     rtx loc_ref;
-     rtx cc_reg;
-     rtx insn;
-     rtx cmp;
-
-     if (GET_MODE (operands[0]) != SImode)
-       FAIL;
-
-     s0 = operands [0];
-
-     /* Low over head loop instructions require the first operand to be 
LR.  */
-     if (TARGET_HAVE_LOB && arm_target_insn_ok_for_lob (operands [1]))
-       s0 = gen_rtx_REG (SImode, LR_REGNUM);
-
-     if (TARGET_THUMB2)
-       insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0, GEN_INT 
(-1)));
-     else
-       insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-
-     cmp = XVECEXP (PATTERN (insn), 0, 0);
-     cc_reg = SET_DEST (cmp);
-     bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
-     loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands [1]);
-     emit_jump_insn (gen_rtx_SET (pc_rtx,
-                                  gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
-                                                        loc_ref, pc_rtx)));
-     DONE;
-   }
- else
-   FAIL;
- }")
+  if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
+    {
+      rtx s0;
+      rtx bcomp;
+      rtx loc_ref;
+      rtx cc_reg;
+      rtx insn;
+      rtx cmp;
+      rtx decrement_num;
+
+      if (GET_MODE (operands[0]) != SImode)
+    FAIL;
+
+      s0 = operands[0];
+
+       if (TARGET_HAVE_LOB && arm_target_insn_ok_for_lob (operands[1]))
+    {
+      s0 = gen_rtx_REG (SImode, LR_REGNUM);
+
+      /* If we have a compatibe MVE target, try and analyse the loop
+         contents to determine if we can use predicated dlstp/letp
+         looping.  */
+      if (TARGET_HAVE_MVE && TARGET_THUMB2
+          && (decrement_num = arm_attempt_dlstp_transform (operands[1],
+                                   operands[2]))
+          && (INTVAL (decrement_num) != 1))
+        {
+          insn = emit_insn
+              (gen_thumb2_addsi3_compare0
+              (s0, s0, GEN_INT ((-1) * (INTVAL (decrement_num)))));
+          cmp = XVECEXP (PATTERN (insn), 0, 0);
+          cc_reg = SET_DEST (cmp);
+          bcomp = gen_rtx_GE (VOIDmode, cc_reg, const0_rtx);
+          loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+          emit_jump_insn (gen_rtx_SET (pc_rtx,
+                       gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+                                 loc_ref, pc_rtx)));
+          DONE;
+        }
+
+      /* Otherwise, try standard decrement-by-one dls/le looping.  */
+      if (TARGET_THUMB2)
+        insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0,
+                              GEN_INT (-1)));
+      else
+        insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
+
+      cmp = XVECEXP (PATTERN (insn), 0, 0);
+      cc_reg = SET_DEST (cmp);
+      bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
+      loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+      emit_jump_insn (gen_rtx_SET (pc_rtx,
+                       gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+                                 loc_ref, pc_rtx)));
+      DONE;
+    }
+      else
+    FAIL;
+    }
+  else
+    FAIL;
+}")

  (define_insn "*clear_apsr"
    [(unspec_volatile:SI [(const_int 0)] VUNSPEC_CLRM_APSR)
@@ -1755,7 +1783,37 @@
    {
      if (REGNO (operands[0]) == LR_REGNUM)
        {
-    emit_insn (gen_dls_insn (operands[0]));
+    /* Pick out the number by which we are decrementing the loop counter
+       in every iteration.  If it's > 1, then use dlstp.  */
+    int const_int_dec_num
+         = abs (INTVAL (XEXP (XEXP (XVECEXP (PATTERN (operands[1]), 0, 1),
+                  1),
+                1)));
+    switch (const_int_dec_num)
+      {
+        case 16:
+          emit_insn (gen_dlstp8_insn (operands[0]));
+          break;
+
+        case 8:
+          emit_insn (gen_dlstp16_insn (operands[0]));
+          break;
+
+        case 4:
+          emit_insn (gen_dlstp32_insn (operands[0]));
+          break;
+
+        case 2:
+          emit_insn (gen_dlstp64_insn (operands[0]));
+          break;
+
+        case 1:
+          emit_insn (gen_dls_insn (operands[0]));
+          break;
+
+        default:
+          gcc_unreachable ();
+      }
      DONE;
        }
      else
diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
index 
7748e78437943ca0cd0d8909330ea8d3b4948ae3..744e7ab5731b5010ce280a046c5a94f6f590c350 
100644
--- a/gcc/config/arm/unspecs.md
+++ b/gcc/config/arm/unspecs.md
@@ -579,6 +579,10 @@
    VCTP16Q
    VCTP32Q
    VCTP64Q
+  DLSTP8
+  DLSTP16
+  DLSTP32
+  DLSTP64
    VPNOT
    VCREATEQ_F
    VCVTQ_N_TO_F_S
diff --git a/gcc/config/ia64/ia64.md b/gcc/config/ia64/ia64.md
index 
5d1d47da55b2f0a2ae001c3277467ae19513fbe1..5b24bd76ace5789ccc7e860c5872413bb661cf88 
100644
--- a/gcc/config/ia64/ia64.md
+++ b/gcc/config/ia64/ia64.md
@@ -3956,7 +3956,8 @@

  (define_expand "doloop_end"
    [(use (match_operand 0 "" ""))    ; loop pseudo
-   (use (match_operand 1 "" ""))]    ; label
+   (use (match_operand 1 "" ""))    ; label
+   (use (match_operand 2 "" ""))]     ; decrement constant
    ""
  {
    if (GET_MODE (operands[0]) != DImode)
diff --git a/gcc/config/pru/pru.md b/gcc/config/pru/pru.md
index 
bdc5ad79ba06008e9fbf2aade395b4195e1b2a76..990cdfe0974f1f61082b1fc6508c405cde32a254 
100644
--- a/gcc/config/pru/pru.md
+++ b/gcc/config/pru/pru.md
@@ -1636,7 +1636,8 @@

  (define_expand "doloop_end"
    [(use (match_operand 0 "nonimmediate_operand"))
-   (use (label_ref (match_operand 1 "")))]
+   (use (label_ref (match_operand 1 "")))
+   (use (match_operand 2 "" ""))]     ; decrement constant
    "TARGET_OPT_LOOP"
  {
    if (GET_CODE (operands[0]) == REG && GET_MODE (operands[0]) == QImode)
diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index 
ad5a4cf2ef83329fb72963c330e6cc42deb13dcc..fdaeb6a02ca06ffcfe87c41852fefa554707e3f0 
100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -13422,7 +13422,8 @@

  (define_expand "doloop_end"
    [(use (match_operand 0))    ; loop pseudo
-   (use (match_operand 1))]    ; label
+   (use (match_operand 1))    ; label
+   (use (match_operand 2 "" ""))]     ; decrement constant
    ""
  {
    if (GET_MODE (operands[0]) != Pmode)
diff --git a/gcc/config/s390/s390.md b/gcc/config/s390/s390.md
index 
962927c31126b3dfade6a9314bb54e24f1df5ead..bad317fdba73c7f8150a4cd779444ced45d6ee5d 
100644
--- a/gcc/config/s390/s390.md
+++ b/gcc/config/s390/s390.md
@@ -9780,7 +9780,8 @@

  (define_expand "doloop_end"
    [(use (match_operand 0 "" ""))        ; loop pseudo
-   (use (match_operand 1 "" ""))]       ; label
+   (use (match_operand 1 "" ""))        ; label
+   (use (match_operand 2 "" ""))]       ; decrement constant
    ""
  {
    if (GET_MODE (operands[0]) == SImode)
diff --git a/gcc/config/v850/v850.md b/gcc/config/v850/v850.md
index 
6ca31e3f43f4e67777e3c1bdb93a474a21e883a7..8d0812abac7e84a29d77ada60e172d89d3bbe387 
100644
--- a/gcc/config/v850/v850.md
+++ b/gcc/config/v850/v850.md
@@ -1434,7 +1434,8 @@

  (define_expand "doloop_end"
   [(use (match_operand 0 "" ""))        ; loop pseudo
-  (use (match_operand 1 "" ""))]       ; label
+  (use (match_operand 1 "" ""))        ; label
+  (use (match_operand 2 "" ""))]       ; decrement constant
    "TARGET_V850E3V5_UP && TARGET_LOOP"
    {
      rtx loop_cnt = operands[0];
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 
110f8dfa0a9fc276e584ce4db2dfe2257409615e..7dde0b0096dfd9e996a394a2000d42e3f637c323 
100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -11774,6 +11774,14 @@ loops, and will help ivopts to make some decisions.
  The default version of this hook returns false.
  @end deftypefn

+@deftypefn {Target Hook} rtx TARGET_ALLOW_ELEMENTWISE_DOLOOP (rtx 
@var{count}, rtx @var{label}, rtx @var{doloop})
+This target hook allows the target to support loop-doloop optimisations
+where the value that gets put into the loop counter register is not a
+pre-calculation of the number of iteration of the loop.  For instance,
+the value used can be the number of elements that the loop will process.
+The default version of this hook returns the same rtx it was given.
+@end deftypefn
+
  @deftypevr {Target Hook} bool TARGET_HAVE_COUNT_REG_DECR_P
  Return true if the target supports hardware count register for decrement
  and branch.
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 
501ddf147e40b68b757269f703f58d1091b9e06f..cc1a81f3957d51caebec98992a4119d86878cdb8 
100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -7730,6 +7730,8 @@ to by @var{ce_info}.

  @hook TARGET_PREDICT_DOLOOP_P

+@hook TARGET_ALLOW_ELEMENTWISE_DOLOOP
+
  @hook TARGET_HAVE_COUNT_REG_DECR_P

  @hook TARGET_DOLOOP_COST_FOR_GENERIC
diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
index 
30b45c8071a709d9f0d03037c3876b532727e6f0..674eb429f1f5e915d67513ad6a4aa170444fc00d 
100644
--- a/gcc/loop-doloop.cc
+++ b/gcc/loop-doloop.cc
@@ -85,29 +85,29 @@ doloop_condition_get (rtx_insn *doloop_pat)
       forms:

       1)  (parallel [(set (pc) (if_then_else (condition)
-                              (label_ref (label))
-                            (pc)))
-                 (set (reg) (plus (reg) (const_int -1)))
-                 (additional clobbers and uses)])
+                        (label_ref (label))
+                        (pc)))
+             (set (reg) (plus (reg) (const_int -n)))
+             (additional clobbers and uses)])

       The branch must be the first entry of the parallel (also required
       by jump.cc), and the second entry of the parallel must be a set of
       the loop counter register.  Some targets (IA-64) wrap the set of
       the loop counter in an if_then_else too.

-     2)  (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-                             (label_ref (label))
-                     (pc))).
+     2)  (set (reg) (plus (reg) (const_int -n))
+     (set (pc) (if_then_else (reg != 0)
+                 (label_ref (label))
+                 (pc))).

       Some targets (ARM) do the comparison before the branch, as in the
       following form:

-     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) */
+     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -n), 0)))
+           (set (reg) (plus (reg) (const_int -n)))])
+    (set (pc) (if_then_else (cc == NE)
+                (label_ref (label))
+                (pc))) */

    pattern = PATTERN (doloop_pat);

@@ -143,7 +143,7 @@ doloop_condition_get (rtx_insn *doloop_pat)
            || GET_CODE (cmp_arg1) != PLUS)
          return 0;
        reg_orig = XEXP (cmp_arg1, 0);
-      if (XEXP (cmp_arg1, 1) != GEN_INT (-1)
+      if (!CONST_INT_P (XEXP (cmp_arg1, 1))
            || !REG_P (reg_orig))
          return 0;
        cc_reg = SET_DEST (cmp_orig);
@@ -156,7 +156,8 @@ doloop_condition_get (rtx_insn *doloop_pat)
      {
        /* We expect the condition to be of the form (reg != 0)  */
        cond = XEXP (SET_SRC (cmp), 0);
-      if (GET_CODE (cond) != NE || XEXP (cond, 1) != const0_rtx)
+      if ((GET_CODE (cond) != NE && GET_CODE (cond) != GE)
+          || XEXP (cond, 1) != const0_rtx)
          return 0;
      }
      }
@@ -173,14 +174,14 @@ doloop_condition_get (rtx_insn *doloop_pat)
    if (! REG_P (reg))
      return 0;

-  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
       On IA-64, this decrement is wrapped in an if_then_else.  */
    inc_src = SET_SRC (inc);
    if (GET_CODE (inc_src) == IF_THEN_ELSE)
      inc_src = XEXP (inc_src, 1);
    if (GET_CODE (inc_src) != PLUS
        || XEXP (inc_src, 0) != reg
-      || XEXP (inc_src, 1) != constm1_rtx)
+      || !CONST_INT_P (XEXP (inc_src, 1)))
      return 0;

    /* Check for (set (pc) (if_then_else (condition)
@@ -211,42 +212,49 @@ doloop_condition_get (rtx_insn *doloop_pat)
        || (GET_CODE (XEXP (condition, 0)) == PLUS
        && XEXP (XEXP (condition, 0), 0) == reg))
     {
-     if (GET_CODE (pattern) != PARALLEL)
       /*  For the second form we expect:

-         (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-                                 (label_ref (label))
-                                 (pc))).
+     (set (reg) (plus (reg) (const_int -n))
+     (set (pc) (if_then_else (reg != 0)
+                 (label_ref (label))
+                 (pc))).

-         is equivalent to the following:
+     If n == 1, that is equivalent to the following:

-         (parallel [(set (pc) (if_then_else (reg != 1)
-                                            (label_ref (label))
-                                            (pc)))
-                     (set (reg) (plus (reg) (const_int -1)))
-                     (additional clobbers and uses)])
+     (parallel [(set (pc) (if_then_else (reg != 1)
+                        (label_ref (label))
+                        (pc)))
+             (set (reg) (plus (reg) (const_int -1)))
+             (additional clobbers and uses)])

-        For the third form we expect:
+    For the third form we expect:

-        (parallel [(set (cc) (compare ((plus (reg) (const_int -1)), 0))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc)))
+    (parallel [(set (cc) (compare ((plus (reg) (const_int -n)), 0))
+           (set (reg) (plus (reg) (const_int -n)))])
+    (set (pc) (if_then_else (cc == NE)
+                (label_ref (label))
+                (pc)))

-        which is equivalent to the following:
+    Which also for n == 1 is equivalent to the following:

-        (parallel [(set (cc) (compare (reg,  1))
-                   (set (reg) (plus (reg) (const_int -1)))
-                   (set (pc) (if_then_else (NE == cc)
-                                           (label_ref (label))
-                                           (pc))))])
+    (parallel [(set (cc) (compare (reg,  1))
+           (set (reg) (plus (reg) (const_int -1)))
+           (set (pc) (if_then_else (NE == cc)
+                       (label_ref (label))
+                       (pc))))])

-        So we return the second form instead for the two cases.
+    So we return the second form instead for the two cases.

+    For the "elementwise" form where the decrement number isn't -1,
+    the final value may be exceeded, so use GE instead of NE.
       */
-        condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
+     if (GET_CODE (pattern) != PARALLEL)
+       {
+    if (INTVAL (XEXP (inc_src, 1)) != -1)
+      condition = gen_rtx_fmt_ee (GE, VOIDmode, inc_src, const0_rtx);
+    else
+      condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);;
+       }

      return condition;
     }
@@ -685,17 +693,6 @@ doloop_optimize (class loop *loop)
        return false;
      }

-  max_cost
-    = COSTS_N_INSNS (param_max_iterations_computation_cost);
-  if (set_src_cost (desc->niter_expr, mode, optimize_loop_for_speed_p 
(loop))
-      > max_cost)
-    {
-      if (dump_file)
-    fprintf (dump_file,
-         "Doloop: number of iterations too costly to compute.\n");
-      return false;
-    }
-
    if (desc->const_iter)
      iterations = widest_int::from (rtx_mode_t (desc->niter_expr, mode),
                     UNSIGNED);
@@ -720,7 +717,25 @@ doloop_optimize (class loop *loop)
    count = copy_rtx (desc->niter_expr);
    start_label = block_label (desc->in_edge->dest);
    doloop_reg = gen_reg_rtx (mode);
-  rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
+  rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label,
+                         count);
+
+  /* Not all targets need to pre-calculate the number of the iterations of
+     the loop, they instead work by storing the number of elements in the
+     counter_reg and decrementing that.  Call the appropriate target 
hook to
+     change the value of count.  */
+  count = targetm.allow_elementwise_doloop (count, start_label, 
doloop_seq);
+
+  max_cost
+    = COSTS_N_INSNS (param_max_iterations_computation_cost);
+  if (set_src_cost (count, mode, optimize_loop_for_speed_p (loop))
+      > max_cost)
+    {
+      if (dump_file)
+    fprintf (dump_file,
+         "Doloop: number of iterations too costly to compute.\n");
+      return false;
+    }

    word_mode_size = GET_MODE_PRECISION (word_mode);
    word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) - 1;
@@ -737,7 +752,7 @@ doloop_optimize (class loop *loop)
        else
      count = lowpart_subreg (word_mode, count, mode);
        PUT_MODE (doloop_reg, word_mode);
-      doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
+      doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label, count);
      }
    if (! doloop_seq)
      {
diff --git a/gcc/target-insns.def b/gcc/target-insns.def
index 
de8c0092f989efa9e7d1f8a2a12fcf4ffc4a6013..b77b79724262304b070c511de301d898bc3ca444 
100644
--- a/gcc/target-insns.def
+++ b/gcc/target-insns.def
@@ -48,7 +48,7 @@ DEF_TARGET_INSN (casesi, (rtx x0, rtx x1, rtx x2, rtx 
x3, rtx x4))
  DEF_TARGET_INSN (check_stack, (rtx x0))
  DEF_TARGET_INSN (clear_cache, (rtx x0, rtx x1))
  DEF_TARGET_INSN (doloop_begin, (rtx x0, rtx x1))
-DEF_TARGET_INSN (doloop_end, (rtx x0, rtx x1))
+DEF_TARGET_INSN (doloop_end, (rtx x0, rtx x1, rtx x2))
  DEF_TARGET_INSN (eh_return, (rtx x0))
  DEF_TARGET_INSN (epilogue, (void))
  DEF_TARGET_INSN (exception_receiver, (void))
diff --git a/gcc/target.def b/gcc/target.def
index 
a3d3b04a1657bfa006b8e9ccdc65ead2bcc7fde7..1f1c07f406fbc75b44f1afe7a1ab313005be82b0 
100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -4392,6 +4392,16 @@ The default version of this hook returns false.",
   bool, (class loop *loop),
   default_predict_doloop_p)

+DEFHOOK
+(allow_elementwise_doloop,
+ "This target hook allows the target to support loop-doloop 
optimisations\n\
+where the value that gets put into the loop counter register is not a\n\
+pre-calculation of the number of iteration of the loop.  For instance,\n\
+the value used can be the number of elements that the loop will process.\n\
+The default version of this hook returns the same rtx it was given.",
+ rtx, (rtx count, rtx label, rtx doloop),
+ default_allow_elementwise_doloop)
+
  DEFHOOKPOD
  (have_count_reg_decr_p,
   "Return true if the target supports hardware count register for 
decrement\n\
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index 
ecce55ebe797cedc940620e8d89816973a045d49..6f1f0747f300b774493316b2aa6a7863a69a085a 
100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -88,6 +88,7 @@ extern bool default_fixed_point_supported_p (void);
  extern bool default_has_ifunc_p (void);

  extern bool default_predict_doloop_p (class loop *);
+extern rtx default_allow_elementwise_doloop (rtx, rtx, rtx);
  extern machine_mode default_preferred_doloop_mode (machine_mode);
  extern const char * default_invalid_within_doloop (const rtx_insn *);

diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
index 
d17d393baedc16574ddecb86e868799a65736660..8a3b9e7571092df3b9acc90d1bc5b508110fe912 
100644
--- a/gcc/targhooks.cc
+++ b/gcc/targhooks.cc
@@ -661,6 +661,12 @@ default_predict_doloop_p (class loop *loop 
ATTRIBUTE_UNUSED)
    return false;
  }

+rtx
+default_allow_elementwise_doloop (rtx count, rtx, rtx)
+{
+  return count;
+}
+
  /* By default, just use the input MODE itself.  */

  machine_mode
diff --git a/gcc/testsuite/gcc.target/arm/dlstp-int16x8.c 
b/gcc/testsuite/gcc.target/arm/dlstp-int16x8.c
new file mode 100644
index 
0000000000000000000000000000000000000000..a61f02ed3a23e077fc6bd354f89832223df4acdc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/dlstp-int16x8.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-skip-if "avoid conflicting multilib options" { *-*-* } { 
"-marm" "-mcpu=*" } } */
+/* { dg-options "-march=armv8.1-m.main+fp.dp+mve.fp -mfloat-abi=hard 
-mfpu=auto -O3 --save-temps" } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "lob.h"
+
+void  __attribute__ ((noinline)) test (int16_t *a, int16_t *b, int16_t 
*c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      int16x8_t va = vldrhq_z_s16 (a, p);
+      int16x8_t vb = vldrhq_z_s16 (b, p);
+      int16x8_t vc = vaddq_x_s16 (va, vb, p);
+      vstrhq_p_s16 (c, vc, p);
+      c+=8;
+      a+=8;
+      b+=8;
+      n-=8;
+    }
+}
+
+int main ()
+{
+  int i;
+  int16_t temp1[N];
+  int16_t temp2[N];
+  int16_t temp3[N];
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus16 (temp1, temp2, temp3, 0);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus16 (temp1, temp2, temp3, 1);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 7);
+  check_plus16 (temp1, temp2, temp3, 7);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus16 (temp1, temp2, temp3, 8);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus16 (temp1, temp2, temp3, 9);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus16 (temp1, temp2, temp3, 16);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus16 (temp1, temp2, temp3, 17);
+
+  reset_data16 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.16\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp\t} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp\t" } } */
+/* { dg-final { scan-assembler-not "\tvpst\t" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/dlstp-int32x4.c 
b/gcc/testsuite/gcc.target/arm/dlstp-int32x4.c
new file mode 100644
index 
0000000000000000000000000000000000000000..31a7264ae265a02d617ff31928cece9e37fbcc3f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/dlstp-int32x4.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-skip-if "avoid conflicting multilib options" { *-*-* } { 
"-marm" "-mcpu=*" } } */
+/* { dg-options "-march=armv8.1-m.main+fp.dp+mve.fp -mfloat-abi=hard 
-mfpu=auto -O3 --save-temps" } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "lob.h"
+
+void  __attribute__ ((noinline)) test (int32_t *a, int32_t *b, int32_t 
*c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+int main ()
+{
+  int i;
+  int32_t temp1[N];
+  int32_t temp2[N];
+  int32_t temp3[N];
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus32 (temp1, temp2, temp3, 0);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus32 (temp1, temp2, temp3, 1);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 3);
+  check_plus32 (temp1, temp2, temp3, 3);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 4);
+  check_plus32 (temp1, temp2, temp3, 4);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 5);
+  check_plus32 (temp1, temp2, temp3, 5);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus32 (temp1, temp2, temp3, 8);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus32 (temp1, temp2, temp3, 9);
+
+  reset_data32 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.32\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp\t} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp\t" } } */
+/* { dg-final { scan-assembler-not "\tvpst\t" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/dlstp-int64x2.c 
b/gcc/testsuite/gcc.target/arm/dlstp-int64x2.c
new file mode 100644
index 
0000000000000000000000000000000000000000..a09c01884da50e68de8ae5bab994e3c7d07963b6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/dlstp-int64x2.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-skip-if "avoid conflicting multilib options" { *-*-* } { 
"-marm" "-mcpu=*" } } */
+/* { dg-options "-march=armv8.1-m.main+fp.dp+mve.fp -mfloat-abi=hard 
-mfpu=auto -O3 --save-temps" } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "lob.h"
+
+void  __attribute__ ((noinline)) test (int64_t *a, int64_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp64q (n);
+      int64x2_t va = vldrdq_gather_offset_z_s64 (a, vcreateq_u64 (8, 
0), p);
+      vstrdq_scatter_offset_p_s64 (c, vcreateq_u64 (8, 0), va, p);
+      c+=2;
+      a+=2;
+      n-=2;
+    }
+}
+
+int main ()
+{
+  int i;
+  int64_t temp1[N];
+  int64_t temp3[N];
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 0);
+  check_memcpy64 (temp1, temp3, 0);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 1);
+  check_memcpy64 (temp1, temp3, 1);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 2);
+  check_memcpy64 (temp1, temp3, 2);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 3);
+  check_memcpy64 (temp1, temp3, 3);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 4);
+  check_memcpy64 (temp1, temp3, 4);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 5);
+  check_memcpy64 (temp1, temp3, 5);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 6);
+  check_memcpy64 (temp1, temp3, 6);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 7);
+  check_memcpy64 (temp1, temp3, 7);
+
+  reset_data64  (temp1, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.64\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp\t} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp\t" } } */
+/* { dg-final { scan-assembler-not "\tvpst\t" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/dlstp-int8x16.c 
b/gcc/testsuite/gcc.target/arm/dlstp-int8x16.c
new file mode 100644
index 
0000000000000000000000000000000000000000..49fbd4c16a20af929e7cccd62f6a73d033afb7f5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/dlstp-int8x16.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-skip-if "avoid conflicting multilib options" { *-*-* } { 
"-marm" "-mcpu=*" } } */
+/* { dg-options "-march=armv8.1-m.main+fp.dp+mve.fp -mfloat-abi=hard 
-mfpu=auto -O3 --save-temps" } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "lob.h"
+
+void  __attribute__ ((noinline)) test (int8_t *a, int8_t *b, int8_t *c, 
int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp8q (n);
+      int8x16_t va = vldrbq_z_s8 (a, p);
+      int8x16_t vb = vldrbq_z_s8 (b, p);
+      int8x16_t vc = vaddq_x_s8 (va, vb, p);
+      vstrbq_p_s8 (c, vc, p);
+      c+=16;
+      a+=16;
+      b+=16;
+      n-=16;
+    }
+}
+
+int main ()
+{
+  int i;
+  int8_t temp1[N];
+  int8_t temp2[N];
+  int8_t temp3[N];
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus8 (temp1, temp2, temp3, 0);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus8 (temp1, temp2, temp3, 1);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 15);
+  check_plus8 (temp1, temp2, temp3, 15);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus8 (temp1, temp2, temp3, 16);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus8 (temp1, temp2, temp3, 17);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 32);
+  check_plus8 (temp1, temp2, temp3, 32);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 33);
+  check_plus8 (temp1, temp2, temp3, 33);
+
+  reset_data8 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.8\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp\t} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp\t" } } */
+/* { dg-final { scan-assembler-not "\tvpst\t" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/lob.h 
b/gcc/testsuite/gcc.target/arm/lob.h
index 
feaae7cc89959b3147368980120700bbc3e85ecb..9a34a682160b9c0b0e9393438e7be336ed7d29c7 
100644
--- a/gcc/testsuite/gcc.target/arm/lob.h
+++ b/gcc/testsuite/gcc.target/arm/lob.h
@@ -1,15 +1,106 @@
  #include <string.h>
-
+#include <stdint.h>
  /* Common code for lob tests.  */

  #define NO_LOB asm volatile ("@ clobber lr" : : : "lr" )

-#define N 10000
+#define N 100
+
+static void
+reset_data8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data64 (int64_t *a, int64_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+check_plus8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+    {
+      if (c[i] != (a[i] + b[i])) abort ();
+    }
+      else
+    {
+      if (c[i] != 0) abort ();
+    }
+    }
+}
+
+static void
+check_plus16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+    {
+      if (c[i] != (a[i] + b[i])) abort ();
+    }
+      else
+    {
+      if (c[i] != 0) abort ();
+    }
+    }
+}
+
+static void
+check_plus32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+    {
+      if (c[i] != (a[i] + b[i])) abort ();
+    }
+      else
+    {
+      if (c[i] != 0) abort ();
+    }
+    }
+}

  static void
-reset_data (int *a, int *b, int *c)
+check_memcpy64 (int64_t *a, int64_t *c, int x)
  {
-  memset (a, -1, N * sizeof (*a));
-  memset (b, -1, N * sizeof (*b));
-  memset (c, -1, N * sizeof (*c));
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+    {
+      if (c[i] != a[i]) abort ();
+    }
+      else
+    {
+      if (c[i] != 0) abort ();
+    }
+    }
  }
diff --git a/gcc/testsuite/gcc.target/arm/lob1.c 
b/gcc/testsuite/gcc.target/arm/lob1.c
index 
ba5c82cd55c582c96a18ad417a3041e43d843613..e7fa3f155f926e71b00c9f294e7878e8f13766ff 
100644
--- a/gcc/testsuite/gcc.target/arm/lob1.c
+++ b/gcc/testsuite/gcc.target/arm/lob1.c
@@ -54,29 +54,18 @@ loop3 (int *a, int *b, int *c)
      } while (i < N);
  }

-void
-check (int *a, int *b, int *c)
-{
-  for (int i = 0; i < N; i++)
-    {
-      NO_LOB;
-      if (c[i] != a[i] + b[i])
-    abort ();
-    }
-}
-
  int
  main (void)
  {
-  reset_data (a, b, c);
+  reset_data32 (a, b, c, N);
    loop1 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus32 (a, b, c, N);
+  reset_data32 (a, b, c, N);
    loop2 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus32 (a, b, c, N);
+  reset_data32 (a, b, c, N);
    loop3 (a, b ,c);
-  check (a, b ,c);
+  check_plus32 (a, b, c, N);

    return 0;
  }
diff --git a/gcc/testsuite/gcc.target/arm/lob6.c 
b/gcc/testsuite/gcc.target/arm/lob6.c
index 
17b6124295e8ae9e1cb57e41fa43a954b3390eec..b7f49cf70c98d78af0d7cc1c291bf83ef699863b 
100644
--- a/gcc/testsuite/gcc.target/arm/lob6.c
+++ b/gcc/testsuite/gcc.target/arm/lob6.c
@@ -79,14 +79,14 @@ check (void)
  int
  main (void)
  {
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data32 (a1, b1, c1, N);
+  reset_data32 (a2, b2, c2, N);
    loop1 (a1, b1, c1);
    ref1 (a2, b2, c2);
    check ();

-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data32 (a1, b1, c1, N);
+  reset_data32 (a2, b2, c2, N);
    loop2 (a1, b1, c1);
    ref2 (a2, b2, c2);
    check ();

[-- Attachment #2: rb16111.patch --]
[-- Type: text/x-patch, Size: 48724 bytes --]

diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index f2e3d905dbbeb2949f2947f5cfd68208c94c9272..7a6d24a80060b4a704a481ccd1a32d96e7b0f369 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -7366,7 +7366,8 @@
 ;; knows what to generate.
 (define_expand "doloop_end"
   [(use (match_operand 0 "" ""))      ; loop pseudo
-   (use (match_operand 1 "" ""))]     ; label
+   (use (match_operand 1 "" ""))      ; label
+   (use (match_operand 2 "" ""))]     ; decrement constant
   "optimize > 0 && flag_modulo_sched"
 {
   rtx s0;
diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 550272facd12e60a49bf8a3b20f811cc13765b3a..7684620f0f4d161dd9e9ad2d70308021ec3d3d34 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -63,7 +63,7 @@ extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *);
 extern bool arm_q_bit_access (void);
 extern bool arm_ge_bits_access (void);
 extern bool arm_target_insn_ok_for_lob (rtx);
-
+extern rtx arm_attempt_dlstp_transform (rtx, rtx);
 #ifdef RTX_CODE
 enum reg_class
 arm_mode_base_reg_class (machine_mode);
diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index ee8f1babf8a1319e77e0db0fa55851c038048804..99e144d52c26597c64b982b3d4ae9a62a114cf18 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -470,6 +470,9 @@ static const struct attribute_spec arm_attribute_table[] =
 #undef TARGET_SCHED_REORDER
 #define TARGET_SCHED_REORDER arm_sched_reorder
 
+#undef TARGET_ALLOW_ELEMENTWISE_DOLOOP
+#define TARGET_ALLOW_ELEMENTWISE_DOLOOP arm_allow_elementwise_doloop
+
 #undef TARGET_REGISTER_MOVE_COST
 #define TARGET_REGISTER_MOVE_COST arm_register_move_cost
 
@@ -34138,8 +34141,339 @@ arm_target_insn_ok_for_lob (rtx insn)
 
   return single_succ_p (bb)
     && single_pred_p (bb)
-    && single_succ_edge (bb)->dest == single_pred_edge (bb)->src
-    && contains_no_active_insn_p (bb);
+    && single_succ_edge (bb)->dest == single_pred_edge (bb)->src;
+}
+
+static int
+arm_mve_get_vctp_lanes (rtx x)
+{
+  if (GET_CODE (x) == SET && GET_CODE (XEXP (x, 1)) == UNSPEC)
+    {
+      switch (XINT (XEXP (x, 1), 1))
+	{
+	  case VCTP8Q:
+	    return 16;
+	  case VCTP16Q:
+	    return 8;
+	  case VCTP32Q:
+	    return 4;
+	  case VCTP64Q:
+	    return 2;
+	  default:
+	    break;
+	}
+    }
+  return 0;
+}
+
+/* Check if an insn requires the use of the VPR_REG, if it does, return the
+   sub-rtx of the matched operand.  If there are more than one operand (e.g. an
+   input operand and an output operand) that use VPR_REG, return the first
+   occurance, which is usually the output operand.  */
+
+static rtx
+arm_get_required_vpr_reg (rtx_insn *insn)
+{
+  bool requires_vpr;
+
+  extract_constrain_insn (insn);
+  int n_operands = recog_data.n_operands;
+  if (recog_data.n_alternatives == 0)
+    return NULL_RTX;
+
+  /* Fill in recog_op_alt with information about the constraints of
+     this insn.  */
+  preprocess_constraints (insn);
+
+  for (int use = 0; use < n_operands; use++)
+    {
+      requires_vpr = true;
+      /* Iterate through alternatives of operand "use" in recog_op_alt and
+       * identify if the operand is required to be the VPR.  */
+      for (int alt1 = 0; alt1 < recog_data.n_alternatives; alt1++)
+	{
+	  const operand_alternative *op_alt1
+	      = &recog_op_alt[alt1 * n_operands];
+	  /* Fetch the reg_class for each entry and check it against the
+	   * VPR_REG reg_class.  */
+	  if (alternative_class (op_alt1, use) != VPR_REG)
+	    requires_vpr = false;
+	}
+      /* If all alternatives of the insn require the VPR reg for this operand,
+	 it means that either this is VPR-generating instruction, like a vctp,
+	 vcmp, etc., or it is a VPT-predicated insruction.  Return the subrtx
+	 of the VPR reg operand.  */
+      if (requires_vpr)
+	return recog_data.operand[use];
+    }
+  return NULL_RTX;
+}
+
+/* Scan the basic block of a loop body for a vctp instruction.  If there is
+   exactly one unique vctp instruction, return its rtx_insn *.  */
+
+static rtx_insn *
+arm_mve_get_loop_unique_vctp (basic_block bb)
+{
+  rtx_insn *insn = BB_HEAD (bb);
+  rtx_insn *vctp_op = NULL;
+
+  /* Now scan through all the instruction patterns and
+     pick out any MVE instructions.  */
+  FOR_BB_INSNS (bb, insn)
+    {
+      if (INSN_P (insn))
+	{
+	  /* First check if this is a vctp instruction.  There needs to be
+	     exactly one vctp instruction within the loop.  */
+	  if (arm_mve_get_vctp_lanes (PATTERN (insn)) != 0)
+	    {
+	      /* If we already found one vctp instruction, then the
+		 loop is not consistent internally.  */
+	      if (vctp_op)
+		return NULL;
+
+	      vctp_op = insn;
+	    }
+	}
+    }
+  return vctp_op;
+}
+
+rtx
+arm_attempt_dlstp_transform (rtx label, rtx count)
+{
+  int decrementnum;
+  basic_block body = BLOCK_FOR_INSN (label)->prev_bb;
+  rtx initial_compare;
+  /* Doloop can only be done "elementwise" with predicated dlstp/letp
+     when the iteration counter gets deprecated by the number of MVE
+     lanes.  This can be exxtracted from the `count`, which is the expression
+     used to calculate the number of iterations that the loop would execute
+     for a standard dls/le loop.  Since we only support cases where this is a
+     power of 2, we can assume that this expression arrives here as:
+       (lshiftrt: (A) (const_int y))
+     Then we can extract the decrementnum from y.  */
+  if (GET_CODE (count) == LSHIFTRT && ARITHMETIC_P (XEXP (count, 0))
+      && (decrementnum = (1 << (INTVAL (XEXP (count, 1)))))
+      /* There is one final condition that needs to be met for the loop to be
+	 transformable: dlstp/letp will continue looping until there are
+	 elements still to process.  This can only work if the looping ends
+	 when the element counter reaches zero and not some other value
+	 (e.g. n > 0 works, not n > 1), or we can incorrectly end up running
+	 one additional iteration.  To by-pass any hoisting that the compiler
+	 may have done with the `A` in `count` above, we can instead look up
+	 to the bb before the loop preheader: this should end with a cmp+jump
+	 pair, where the cmp needs to be with (const_int 0).  */
+      && loop_preheader_edge (body->loop_father)->src->prev_bb
+      && BB_END (loop_preheader_edge (body->loop_father)->src->prev_bb)
+      && PREV_INSN (BB_END (loop_preheader_edge (body->loop_father)
+				->src->prev_bb))
+      && INSN_P (PREV_INSN (BB_END (loop_preheader_edge (body->loop_father)
+					->src->prev_bb)))
+      && (initial_compare
+	  = PATTERN (PREV_INSN (BB_END (loop_preheader_edge (body->loop_father)
+					    ->src->prev_bb))))
+      && GET_CODE (initial_compare) == SET
+      && cc_register (XEXP (initial_compare, 0), VOIDmode)
+      && GET_CODE (XEXP (initial_compare, 1)) == COMPARE
+      && CONST_INT_P (XEXP (XEXP (initial_compare, 1), 1))
+      && INTVAL (XEXP (XEXP (initial_compare, 1), 1)) == 0)
+    {
+      /* Find the vctp predicate generation inside the loop body BB.  */
+      rtx_insn *vctp_insn = arm_mve_get_loop_unique_vctp (body);
+
+      /* If we have successfully found one exactly vctp predicate-generating
+	 instruction within the loop and the number by which we deprecate the
+	 loop counter in each iteration matches the number of lanes of the
+	 vctp instruction, we can attempt to turn this into a dlstp/letp loop.
+	 */
+      if (vctp_insn
+	  && decrementnum == arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+	{
+	  rtx_insn *insn = 0;
+	  rtx_insn *cur_insn = 0;
+	  rtx_insn *seq;
+	  rtx vctp_vpr_generated = NULL_RTX;
+	  rtx insn_vpr_reg_operand = NULL_RTX;
+	  bool transform_worked = true;
+	  int new_icode;
+
+	  /* Scan through the insns in the loop bb and emit the transformed bb
+	     insns to a sequence.  */
+	  start_sequence ();
+	  FOR_BB_INSNS (body, insn)
+	    {
+	      if (INSN_P (insn))
+		{
+		  /* When we find the vctp instruction: This may be followed by
+		     a sign-extend insn to SImode.  If it is, then save the
+		     sign-extended REG into vctp_vpr_generated.  If there is no
+		     sign-extend, then store the raw output of the vctp.
+		     For any VPT-predicated instructions we need to ensure that
+		     the VPR they use is the same as the one given here and
+		     they often consume the output of a subreg of the SImode
+		     sign-extended VPR-reg.  As a result, comparing against the
+		     output of the sign-extend is more likely to succeed.
+		     This code also guarantees to us that the vctp comes before
+		     any instructions that use the VPR within the loop, for the
+		     dlstp/letp transform to succeed.  */
+		  if (insn == vctp_insn)
+		    {
+		      if (GET_CODE (XEXP (PATTERN (NEXT_INSN (insn)), 1))
+			      == SIGN_EXTEND
+			  && GET_CODE (XEXP (
+				 PATTERN (NEXT_INSN (NEXT_INSN (insn))), 1))
+				 == SUBREG)
+			vctp_vpr_generated
+			    = XEXP (PATTERN (NEXT_INSN (NEXT_INSN (insn))), 0);
+		      else
+			vctp_vpr_generated = XEXP (PATTERN (insn), 0);
+		      /* Also emit a USE of the source register of the vctp.
+			 This holds the number of elements being processed
+			 by the loop.  This later gets stored into `count`.
+			 */
+		      emit_use (XVECEXP (XEXP (PATTERN (insn), 1), 0, 0));
+		      continue;
+		    }
+		  /* If the insn pattern requires the use of the VPR, then it
+		     a VPT-predicated instruction, so it will need to be
+		     transformed into the non-predicated version of the
+		     instruction.  */
+		  else if ((insn_vpr_reg_operand
+			    = arm_get_required_vpr_reg (insn))
+			   != NULL_RTX)
+		    {
+		      /* If the VPR value is different to the one generated by
+			 the vctp, then fail the conversion.  */
+		      if (!rtx_equal_p (vctp_vpr_generated,
+					insn_vpr_reg_operand))
+			{
+			  transform_worked = false;
+			  break;
+			}
+		      /* Also ensure that it's a valid recog-ed instruction with
+			 the mve_unpredicated_insn atrribute.  */
+		      else if (recog_memoized (insn) >= 0
+			       && (new_icode
+				   = get_attr_mve_unpredicated_insn (insn)))
+			{
+			  extract_insn (insn);
+			  rtx arr[8];
+			  int j = 0;
+
+			  /* When transforming a VPT-predicated instruction
+			     into its unpredicated equivalent we need to drop
+			     the VPR operand and we may need to also drop a
+			     merge "vuninit" input operand, depending on the
+			     instruction pattern.  Here ensure that we have at
+			     most a two-operand difference between the two
+			     instrunctions.  */
+			  int n_operands_diff
+			      = recog_data.n_operands
+				- insn_data[new_icode].n_operands;
+			  gcc_assert (n_operands_diff > 0
+				      && n_operands_diff <= 2);
+
+			  /* Then, loop through the operands of the predicated
+			     instruction, and retain the ones that map to the
+			     unpredicated instruction.  */
+			  for (int i = 0; i < recog_data.n_operands; i++)
+			    {
+			      /* Ignore the VPR and, if needed, the vuninit
+				 operand.  */
+			      if (insn_vpr_reg_operand == recog_data.operand[i]
+				  || (n_operands_diff == 2
+				      && !strcmp (recog_data.constraints[i],
+						  "0")))
+				continue;
+			      else
+				{
+				  arr[j] = recog_data.operand[i];
+				  j++;
+				}
+			    }
+
+			  /* Finally, emit the upredicated instruction.  */
+			  switch (j)
+			    {
+			      case 2:
+				emit_insn (GEN_FCN (new_icode) (arr[0],
+								arr[1]));
+				break;
+			      case 3:
+				emit_insn (GEN_FCN (new_icode) (arr[0], arr[1],
+								arr[2]));
+				break;
+			      default:
+				gcc_unreachable ();
+			    }
+			}
+		      /* If we can't identify the INSN as either being either
+			 for deletion or to re-map, then we don't know how to
+			 handle it, so fail the whole conversion.  */
+		      else
+			{
+			  transform_worked = false;
+			  break;
+			}
+		    }
+		  /* Instructions that dont's require the VPR can be carried
+		     over as-is.  */
+		  else
+		    emit_insn (PATTERN (insn));
+		}
+	    }
+	  seq = get_insns ();
+	  end_sequence ();
+
+	  if (transform_worked)
+	    {
+	      /* Re-write the entire BB contents with the transformed
+		 sequence.  */
+	      FOR_BB_INSNS_SAFE (body, insn, cur_insn)
+		if (INSN_P (insn))
+		  delete_insn (insn);
+	      for (insn = seq; NEXT_INSN (insn); insn = NEXT_INSN (insn))
+		emit_insn_after (PATTERN (insn), BB_END (body));
+	      emit_jump_insn_after (PATTERN (insn), BB_END (body));
+	      return GEN_INT (decrementnum);
+	    }
+	}
+    }
+  /* Bail out: we can't use dlstp/letp, so return 1 to allow loop-doloop to try
+     the standard dls/le pair.  */
+  return GEN_INT (1);
+}
+
+/* Target hook to the number of elements to be processed by a dlstp/letp loop
+   into `count` to intialise the counter register.  The number of elements was
+   previously extracted from the vctp insn and placed into a USE rtx.
+   We only check that the doloop_end pattern successfully decrements by a
+   number other than -1 for a valid dlstp/letp loop.  No other checking is
+   needed as that was done previously.  */
+
+rtx
+arm_allow_elementwise_doloop (rtx count, rtx label, rtx doloop)
+{
+  if (doloop
+      && INTVAL (XEXP (SET_SRC (XVECEXP (PATTERN (doloop), 0, 1)), 1)) != -1
+      && ARITHMETIC_P (XEXP (count, 0)))
+    {
+      basic_block body = BLOCK_FOR_INSN (label)->prev_bb;
+      rtx_insn* insn;
+      FOR_BB_INSNS (body, insn)
+	{
+	  if (INSN_P (insn) && GET_CODE (PATTERN (insn)) == USE)
+	    {
+	      rtx num_elem_reg = copy_rtx (XEXP (PATTERN (insn), 0));
+	      delete_insn (insn);
+	      return num_elem_reg;
+	    }
+	}
+    }
+  else
+    return count;
 }
 
 #if CHECKING_P
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 29062cd6fb3c61c22417ef8fc25abd7819d2c034..aa2fdac22f3c7884c18ebf197b26a2d54da44c93 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -1464,7 +1464,9 @@
 
 (define_int_attr mode1 [(VCTP8Q "8") (VCTP16Q "16") (VCTP32Q "32")
 			(VCTP64Q "64") (VCTP8Q_M "8") (VCTP16Q_M "16")
-			(VCTP32Q_M "32") (VCTP64Q_M "64")])
+			(VCTP32Q_M "32") (VCTP64Q_M "64")
+			(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 "32")
+			(DLSTP64 "64")])
 
 ;; Both kinds of return insn.
 (define_code_iterator RETURNS [return simple_return])
@@ -1773,6 +1775,8 @@
 (define_int_iterator UQRSHLLQ [UQRSHLL_64 UQRSHLL_48])
 (define_int_iterator SQRSHRLQ [SQRSHRL_64 SQRSHRL_48])
 (define_int_iterator VSHLCQ_M [VSHLCQ_M_S VSHLCQ_M_U])
+(define_int_iterator DLSTP [DLSTP8 DLSTP16 DLSTP32
+				   DLSTP64])
 
 ;; Define iterators for VCMLA operations
 (define_int_iterator VCMLA_OP [UNSPEC_VCMLA
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index b1c8c1c569f31a6cb1bfdc16394047f02d6cddf4..3baddab5905a8e45bbf55f53e876a671d6d58c5e 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -10837,3 +10837,38 @@
     }
   DONE;
 })
+
+;; Originally expanded by 'predicated_doloop_end'.
+(define_insn "*predicated_doloop_end_internal"
+  [(set (pc)
+	(if_then_else
+	   (ge (plus:SI (reg:SI LR_REGNUM)
+			(match_operand:SI 0 "const_int_operand" ""))
+		(const_int 0))
+	 (label_ref (match_operand 1 "" ""))
+	 (pc)))
+   (set (reg:SI LR_REGNUM)
+	(plus:SI (reg:SI LR_REGNUM) (match_dup 0)))
+   (clobber (reg:CC CC_REGNUM))]
+  "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2"
+  {
+    if (get_attr_length (insn) == 4)
+      return "letp\t%|lr, %l1";
+    else
+      return "subs\t%|lr, #%0;bgt\t%l1";
+  }
+  [(set (attr "length")
+	(if_then_else
+	   (ltu (minus (pc) (match_dup 1)) (const_int 1024))
+	    (const_int 4)
+	    (const_int 6)))
+   (set_attr "type" "branch")])
+
+(define_insn "dlstp<mode1>_insn"
+  [
+    (set (reg:SI LR_REGNUM)
+	 (unspec:SI [(match_operand:SI 0 "s_register_operand" "r")]
+	  DLSTP))
+  ]
+  "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2"
+  "dlstp.<mode1>\t%|lr, %0")
\ No newline at end of file
diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
index b2309a5216550104f29a71f33506d421f79f5e2c..a21e2909872a618a8b8b2b69efb9a412fcc96a85 100644
--- a/gcc/config/arm/thumb2.md
+++ b/gcc/config/arm/thumb2.md
@@ -1610,10 +1610,11 @@
 ;; knows what to generate.
 (define_expand "doloop_end"
   [(use (match_operand 0 "" ""))      ; loop pseudo
-   (use (match_operand 1 "" ""))]     ; label
+   (use (match_operand 1 "" ""))      ; label
+   (use (match_operand 2 "" ""))]     ; decrement constant
   "TARGET_32BIT"
   "
- {
+{
    /* Currently SMS relies on the do-loop pattern to recognize loops
       where (1) the control part consists of all insns defining and/or
       using a certain 'count' register and (2) the loop count can be
@@ -1623,41 +1624,68 @@
 
       Also used to implement the low over head loops feature, which is part of
       the Armv8.1-M Mainline Low Overhead Branch (LOB) extension.  */
-   if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
-   {
-     rtx s0;
-     rtx bcomp;
-     rtx loc_ref;
-     rtx cc_reg;
-     rtx insn;
-     rtx cmp;
-
-     if (GET_MODE (operands[0]) != SImode)
-       FAIL;
-
-     s0 = operands [0];
-
-     /* Low over head loop instructions require the first operand to be LR.  */
-     if (TARGET_HAVE_LOB && arm_target_insn_ok_for_lob (operands [1]))
-       s0 = gen_rtx_REG (SImode, LR_REGNUM);
-
-     if (TARGET_THUMB2)
-       insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-     else
-       insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-
-     cmp = XVECEXP (PATTERN (insn), 0, 0);
-     cc_reg = SET_DEST (cmp);
-     bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
-     loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands [1]);
-     emit_jump_insn (gen_rtx_SET (pc_rtx,
-                                  gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
-                                                        loc_ref, pc_rtx)));
-     DONE;
-   }
- else
-   FAIL;
- }")
+  if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
+    {
+      rtx s0;
+      rtx bcomp;
+      rtx loc_ref;
+      rtx cc_reg;
+      rtx insn;
+      rtx cmp;
+      rtx decrement_num;
+
+      if (GET_MODE (operands[0]) != SImode)
+	FAIL;
+
+      s0 = operands[0];
+
+       if (TARGET_HAVE_LOB && arm_target_insn_ok_for_lob (operands[1]))
+	{
+	  s0 = gen_rtx_REG (SImode, LR_REGNUM);
+
+	  /* If we have a compatibe MVE target, try and analyse the loop
+	     contents to determine if we can use predicated dlstp/letp
+	     looping.  */
+	  if (TARGET_HAVE_MVE && TARGET_THUMB2
+	      && (decrement_num = arm_attempt_dlstp_transform (operands[1],
+							       operands[2]))
+	      && (INTVAL (decrement_num) != 1))
+	    {
+	      insn = emit_insn
+		      (gen_thumb2_addsi3_compare0
+			  (s0, s0, GEN_INT ((-1) * (INTVAL (decrement_num)))));
+	      cmp = XVECEXP (PATTERN (insn), 0, 0);
+	      cc_reg = SET_DEST (cmp);
+	      bcomp = gen_rtx_GE (VOIDmode, cc_reg, const0_rtx);
+	      loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+	      emit_jump_insn (gen_rtx_SET (pc_rtx,
+				       gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+							     loc_ref, pc_rtx)));
+	      DONE;
+	    }
+
+	  /* Otherwise, try standard decrement-by-one dls/le looping.  */
+	  if (TARGET_THUMB2)
+	    insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0,
+							  GEN_INT (-1)));
+	  else
+	    insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
+
+	  cmp = XVECEXP (PATTERN (insn), 0, 0);
+	  cc_reg = SET_DEST (cmp);
+	  bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
+	  loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+	  emit_jump_insn (gen_rtx_SET (pc_rtx,
+				       gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+							     loc_ref, pc_rtx)));
+	  DONE;
+	}
+      else
+	FAIL;
+    }
+  else
+    FAIL;
+}")
 
 (define_insn "*clear_apsr"
   [(unspec_volatile:SI [(const_int 0)] VUNSPEC_CLRM_APSR)
@@ -1755,7 +1783,37 @@
   {
     if (REGNO (operands[0]) == LR_REGNUM)
       {
-	emit_insn (gen_dls_insn (operands[0]));
+	/* Pick out the number by which we are decrementing the loop counter
+	   in every iteration.  If it's > 1, then use dlstp.  */
+	int const_int_dec_num
+	     = abs (INTVAL (XEXP (XEXP (XVECEXP (PATTERN (operands[1]), 0, 1),
+				  1),
+			    1)));
+	switch (const_int_dec_num)
+	  {
+	    case 16:
+	      emit_insn (gen_dlstp8_insn (operands[0]));
+	      break;
+
+	    case 8:
+	      emit_insn (gen_dlstp16_insn (operands[0]));
+	      break;
+
+	    case 4:
+	      emit_insn (gen_dlstp32_insn (operands[0]));
+	      break;
+
+	    case 2:
+	      emit_insn (gen_dlstp64_insn (operands[0]));
+	      break;
+
+	    case 1:
+	      emit_insn (gen_dls_insn (operands[0]));
+	      break;
+
+	    default:
+	      gcc_unreachable ();
+	  }
 	DONE;
       }
     else
diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
index 7748e78437943ca0cd0d8909330ea8d3b4948ae3..744e7ab5731b5010ce280a046c5a94f6f590c350 100644
--- a/gcc/config/arm/unspecs.md
+++ b/gcc/config/arm/unspecs.md
@@ -579,6 +579,10 @@
   VCTP16Q
   VCTP32Q
   VCTP64Q
+  DLSTP8
+  DLSTP16
+  DLSTP32
+  DLSTP64
   VPNOT
   VCREATEQ_F
   VCVTQ_N_TO_F_S
diff --git a/gcc/config/ia64/ia64.md b/gcc/config/ia64/ia64.md
index 5d1d47da55b2f0a2ae001c3277467ae19513fbe1..5b24bd76ace5789ccc7e860c5872413bb661cf88 100644
--- a/gcc/config/ia64/ia64.md
+++ b/gcc/config/ia64/ia64.md
@@ -3956,7 +3956,8 @@
 
 (define_expand "doloop_end"
   [(use (match_operand 0 "" ""))	; loop pseudo
-   (use (match_operand 1 "" ""))]	; label
+   (use (match_operand 1 "" ""))	; label
+   (use (match_operand 2 "" ""))]     ; decrement constant
   ""
 {
   if (GET_MODE (operands[0]) != DImode)
diff --git a/gcc/config/pru/pru.md b/gcc/config/pru/pru.md
index bdc5ad79ba06008e9fbf2aade395b4195e1b2a76..990cdfe0974f1f61082b1fc6508c405cde32a254 100644
--- a/gcc/config/pru/pru.md
+++ b/gcc/config/pru/pru.md
@@ -1636,7 +1636,8 @@
 
 (define_expand "doloop_end"
   [(use (match_operand 0 "nonimmediate_operand"))
-   (use (label_ref (match_operand 1 "")))]
+   (use (label_ref (match_operand 1 "")))
+   (use (match_operand 2 "" ""))]     ; decrement constant
   "TARGET_OPT_LOOP"
 {
   if (GET_CODE (operands[0]) == REG && GET_MODE (operands[0]) == QImode)
diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index ad5a4cf2ef83329fb72963c330e6cc42deb13dcc..fdaeb6a02ca06ffcfe87c41852fefa554707e3f0 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -13422,7 +13422,8 @@
 
 (define_expand "doloop_end"
   [(use (match_operand 0))	; loop pseudo
-   (use (match_operand 1))]	; label
+   (use (match_operand 1))	; label
+   (use (match_operand 2 "" ""))]     ; decrement constant
   ""
 {
   if (GET_MODE (operands[0]) != Pmode)
diff --git a/gcc/config/s390/s390.md b/gcc/config/s390/s390.md
index 962927c31126b3dfade6a9314bb54e24f1df5ead..bad317fdba73c7f8150a4cd779444ced45d6ee5d 100644
--- a/gcc/config/s390/s390.md
+++ b/gcc/config/s390/s390.md
@@ -9780,7 +9780,8 @@
 
 (define_expand "doloop_end"
   [(use (match_operand 0 "" ""))        ; loop pseudo
-   (use (match_operand 1 "" ""))]       ; label
+   (use (match_operand 1 "" ""))        ; label
+   (use (match_operand 2 "" ""))]       ; decrement constant
   ""
 {
   if (GET_MODE (operands[0]) == SImode)
diff --git a/gcc/config/v850/v850.md b/gcc/config/v850/v850.md
index 6ca31e3f43f4e67777e3c1bdb93a474a21e883a7..8d0812abac7e84a29d77ada60e172d89d3bbe387 100644
--- a/gcc/config/v850/v850.md
+++ b/gcc/config/v850/v850.md
@@ -1434,7 +1434,8 @@
 
 (define_expand "doloop_end"
  [(use (match_operand 0 "" ""))        ; loop pseudo
-  (use (match_operand 1 "" ""))]       ; label
+  (use (match_operand 1 "" ""))        ; label
+  (use (match_operand 2 "" ""))]       ; decrement constant
   "TARGET_V850E3V5_UP && TARGET_LOOP"
   {
     rtx loop_cnt = operands[0];
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 110f8dfa0a9fc276e584ce4db2dfe2257409615e..7dde0b0096dfd9e996a394a2000d42e3f637c323 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -11774,6 +11774,14 @@ loops, and will help ivopts to make some decisions.
 The default version of this hook returns false.
 @end deftypefn
 
+@deftypefn {Target Hook} rtx TARGET_ALLOW_ELEMENTWISE_DOLOOP (rtx @var{count}, rtx @var{label}, rtx @var{doloop})
+This target hook allows the target to support loop-doloop optimisations
+where the value that gets put into the loop counter register is not a
+pre-calculation of the number of iteration of the loop.  For instance,
+the value used can be the number of elements that the loop will process.
+The default version of this hook returns the same rtx it was given.
+@end deftypefn
+
 @deftypevr {Target Hook} bool TARGET_HAVE_COUNT_REG_DECR_P
 Return true if the target supports hardware count register for decrement
 and branch.
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 501ddf147e40b68b757269f703f58d1091b9e06f..cc1a81f3957d51caebec98992a4119d86878cdb8 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -7730,6 +7730,8 @@ to by @var{ce_info}.
 
 @hook TARGET_PREDICT_DOLOOP_P
 
+@hook TARGET_ALLOW_ELEMENTWISE_DOLOOP
+
 @hook TARGET_HAVE_COUNT_REG_DECR_P
 
 @hook TARGET_DOLOOP_COST_FOR_GENERIC
diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
index 30b45c8071a709d9f0d03037c3876b532727e6f0..674eb429f1f5e915d67513ad6a4aa170444fc00d 100644
--- a/gcc/loop-doloop.cc
+++ b/gcc/loop-doloop.cc
@@ -85,29 +85,29 @@ doloop_condition_get (rtx_insn *doloop_pat)
      forms:
 
      1)  (parallel [(set (pc) (if_then_else (condition)
-	  			            (label_ref (label))
-				            (pc)))
-	             (set (reg) (plus (reg) (const_int -1)))
-	             (additional clobbers and uses)])
+					    (label_ref (label))
+					    (pc)))
+		     (set (reg) (plus (reg) (const_int -n)))
+		     (additional clobbers and uses)])
 
      The branch must be the first entry of the parallel (also required
      by jump.cc), and the second entry of the parallel must be a set of
      the loop counter register.  Some targets (IA-64) wrap the set of
      the loop counter in an if_then_else too.
 
-     2)  (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-	                         (label_ref (label))
-			         (pc))).  
+     2)  (set (reg) (plus (reg) (const_int -n))
+	 (set (pc) (if_then_else (reg != 0)
+				 (label_ref (label))
+				 (pc))).
 
      Some targets (ARM) do the comparison before the branch, as in the
      following form:
 
-     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) */
+     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -n), 0)))
+		   (set (reg) (plus (reg) (const_int -n)))])
+	(set (pc) (if_then_else (cc == NE)
+				(label_ref (label))
+				(pc))) */
 
   pattern = PATTERN (doloop_pat);
 
@@ -143,7 +143,7 @@ doloop_condition_get (rtx_insn *doloop_pat)
 	      || GET_CODE (cmp_arg1) != PLUS)
 	    return 0;
 	  reg_orig = XEXP (cmp_arg1, 0);
-	  if (XEXP (cmp_arg1, 1) != GEN_INT (-1) 
+	  if (!CONST_INT_P (XEXP (cmp_arg1, 1))
 	      || !REG_P (reg_orig))
 	    return 0;
 	  cc_reg = SET_DEST (cmp_orig);
@@ -156,7 +156,8 @@ doloop_condition_get (rtx_insn *doloop_pat)
 	{
 	  /* We expect the condition to be of the form (reg != 0)  */
 	  cond = XEXP (SET_SRC (cmp), 0);
-	  if (GET_CODE (cond) != NE || XEXP (cond, 1) != const0_rtx)
+	  if ((GET_CODE (cond) != NE && GET_CODE (cond) != GE)
+	      || XEXP (cond, 1) != const0_rtx)
 	    return 0;
 	}
     }
@@ -173,14 +174,14 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
     return 0;
 
-  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
      On IA-64, this decrement is wrapped in an if_then_else.  */
   inc_src = SET_SRC (inc);
   if (GET_CODE (inc_src) == IF_THEN_ELSE)
     inc_src = XEXP (inc_src, 1);
   if (GET_CODE (inc_src) != PLUS
       || XEXP (inc_src, 0) != reg
-      || XEXP (inc_src, 1) != constm1_rtx)
+      || !CONST_INT_P (XEXP (inc_src, 1)))
     return 0;
 
   /* Check for (set (pc) (if_then_else (condition)
@@ -211,42 +212,49 @@ doloop_condition_get (rtx_insn *doloop_pat)
       || (GET_CODE (XEXP (condition, 0)) == PLUS
 	  && XEXP (XEXP (condition, 0), 0) == reg))
    {
-     if (GET_CODE (pattern) != PARALLEL)
      /*  For the second form we expect:
 
-         (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-                                 (label_ref (label))
-                                 (pc))).
+	 (set (reg) (plus (reg) (const_int -n))
+	 (set (pc) (if_then_else (reg != 0)
+				 (label_ref (label))
+				 (pc))).
 
-         is equivalent to the following:
+	 If n == 1, that is equivalent to the following:
 
-         (parallel [(set (pc) (if_then_else (reg != 1)
-                                            (label_ref (label))
-                                            (pc)))
-                     (set (reg) (plus (reg) (const_int -1)))
-                     (additional clobbers and uses)])
+	 (parallel [(set (pc) (if_then_else (reg != 1)
+					    (label_ref (label))
+					    (pc)))
+		     (set (reg) (plus (reg) (const_int -1)))
+		     (additional clobbers and uses)])
 
-        For the third form we expect:
+	For the third form we expect:
 
-        (parallel [(set (cc) (compare ((plus (reg) (const_int -1)), 0))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) 
+	(parallel [(set (cc) (compare ((plus (reg) (const_int -n)), 0))
+		   (set (reg) (plus (reg) (const_int -n)))])
+	(set (pc) (if_then_else (cc == NE)
+				(label_ref (label))
+				(pc)))
 
-        which is equivalent to the following:
+	Which also for n == 1 is equivalent to the following:
 
-        (parallel [(set (cc) (compare (reg,  1))
-                   (set (reg) (plus (reg) (const_int -1)))
-                   (set (pc) (if_then_else (NE == cc)
-                                           (label_ref (label))
-                                           (pc))))])
+	(parallel [(set (cc) (compare (reg,  1))
+		   (set (reg) (plus (reg) (const_int -1)))
+		   (set (pc) (if_then_else (NE == cc)
+					   (label_ref (label))
+					   (pc))))])
 
-        So we return the second form instead for the two cases.
+	So we return the second form instead for the two cases.
 
+	For the "elementwise" form where the decrement number isn't -1,
+	the final value may be exceeded, so use GE instead of NE.
      */
-        condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
+     if (GET_CODE (pattern) != PARALLEL)
+       {
+	if (INTVAL (XEXP (inc_src, 1)) != -1)
+	  condition = gen_rtx_fmt_ee (GE, VOIDmode, inc_src, const0_rtx);
+	else
+	  condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);;
+       }
 
     return condition;
    }
@@ -685,17 +693,6 @@ doloop_optimize (class loop *loop)
       return false;
     }
 
-  max_cost
-    = COSTS_N_INSNS (param_max_iterations_computation_cost);
-  if (set_src_cost (desc->niter_expr, mode, optimize_loop_for_speed_p (loop))
-      > max_cost)
-    {
-      if (dump_file)
-	fprintf (dump_file,
-		 "Doloop: number of iterations too costly to compute.\n");
-      return false;
-    }
-
   if (desc->const_iter)
     iterations = widest_int::from (rtx_mode_t (desc->niter_expr, mode),
 				   UNSIGNED);
@@ -720,7 +717,25 @@ doloop_optimize (class loop *loop)
   count = copy_rtx (desc->niter_expr);
   start_label = block_label (desc->in_edge->dest);
   doloop_reg = gen_reg_rtx (mode);
-  rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
+  rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label,
+						 count);
+
+  /* Not all targets need to pre-calculate the number of the iterations of
+     the loop, they instead work by storing the number of elements in the
+     counter_reg and decrementing that.  Call the appropriate target hook to
+     change the value of count.  */
+  count = targetm.allow_elementwise_doloop (count, start_label, doloop_seq);
+
+  max_cost
+    = COSTS_N_INSNS (param_max_iterations_computation_cost);
+  if (set_src_cost (count, mode, optimize_loop_for_speed_p (loop))
+      > max_cost)
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "Doloop: number of iterations too costly to compute.\n");
+      return false;
+    }
 
   word_mode_size = GET_MODE_PRECISION (word_mode);
   word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) - 1;
@@ -737,7 +752,7 @@ doloop_optimize (class loop *loop)
       else
 	count = lowpart_subreg (word_mode, count, mode);
       PUT_MODE (doloop_reg, word_mode);
-      doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
+      doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label, count);
     }
   if (! doloop_seq)
     {
diff --git a/gcc/target-insns.def b/gcc/target-insns.def
index de8c0092f989efa9e7d1f8a2a12fcf4ffc4a6013..b77b79724262304b070c511de301d898bc3ca444 100644
--- a/gcc/target-insns.def
+++ b/gcc/target-insns.def
@@ -48,7 +48,7 @@ DEF_TARGET_INSN (casesi, (rtx x0, rtx x1, rtx x2, rtx x3, rtx x4))
 DEF_TARGET_INSN (check_stack, (rtx x0))
 DEF_TARGET_INSN (clear_cache, (rtx x0, rtx x1))
 DEF_TARGET_INSN (doloop_begin, (rtx x0, rtx x1))
-DEF_TARGET_INSN (doloop_end, (rtx x0, rtx x1))
+DEF_TARGET_INSN (doloop_end, (rtx x0, rtx x1, rtx x2))
 DEF_TARGET_INSN (eh_return, (rtx x0))
 DEF_TARGET_INSN (epilogue, (void))
 DEF_TARGET_INSN (exception_receiver, (void))
diff --git a/gcc/target.def b/gcc/target.def
index a3d3b04a1657bfa006b8e9ccdc65ead2bcc7fde7..1f1c07f406fbc75b44f1afe7a1ab313005be82b0 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -4392,6 +4392,16 @@ The default version of this hook returns false.",
  bool, (class loop *loop),
  default_predict_doloop_p)
 
+DEFHOOK
+(allow_elementwise_doloop,
+ "This target hook allows the target to support loop-doloop optimisations\n\
+where the value that gets put into the loop counter register is not a\n\
+pre-calculation of the number of iteration of the loop.  For instance,\n\
+the value used can be the number of elements that the loop will process.\n\
+The default version of this hook returns the same rtx it was given.",
+ rtx, (rtx count, rtx label, rtx doloop),
+ default_allow_elementwise_doloop)
+
 DEFHOOKPOD
 (have_count_reg_decr_p,
  "Return true if the target supports hardware count register for decrement\n\
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index ecce55ebe797cedc940620e8d89816973a045d49..6f1f0747f300b774493316b2aa6a7863a69a085a 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -88,6 +88,7 @@ extern bool default_fixed_point_supported_p (void);
 extern bool default_has_ifunc_p (void);
 
 extern bool default_predict_doloop_p (class loop *);
+extern rtx default_allow_elementwise_doloop (rtx, rtx, rtx);
 extern machine_mode default_preferred_doloop_mode (machine_mode);
 extern const char * default_invalid_within_doloop (const rtx_insn *);
 
diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
index d17d393baedc16574ddecb86e868799a65736660..8a3b9e7571092df3b9acc90d1bc5b508110fe912 100644
--- a/gcc/targhooks.cc
+++ b/gcc/targhooks.cc
@@ -661,6 +661,12 @@ default_predict_doloop_p (class loop *loop ATTRIBUTE_UNUSED)
   return false;
 }
 
+rtx
+default_allow_elementwise_doloop (rtx count, rtx, rtx)
+{
+  return count;
+}
+
 /* By default, just use the input MODE itself.  */
 
 machine_mode
diff --git a/gcc/testsuite/gcc.target/arm/dlstp-int16x8.c b/gcc/testsuite/gcc.target/arm/dlstp-int16x8.c
new file mode 100644
index 0000000000000000000000000000000000000000..a61f02ed3a23e077fc6bd354f89832223df4acdc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/dlstp-int16x8.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-skip-if "avoid conflicting multilib options" { *-*-* } { "-marm" "-mcpu=*" } } */
+/* { dg-options "-march=armv8.1-m.main+fp.dp+mve.fp -mfloat-abi=hard -mfpu=auto -O3 --save-temps" } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "lob.h"
+
+void  __attribute__ ((noinline)) test (int16_t *a, int16_t *b, int16_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      int16x8_t va = vldrhq_z_s16 (a, p);
+      int16x8_t vb = vldrhq_z_s16 (b, p);
+      int16x8_t vc = vaddq_x_s16 (va, vb, p);
+      vstrhq_p_s16 (c, vc, p);
+      c+=8;
+      a+=8;
+      b+=8;
+      n-=8;
+    }
+}
+
+int main ()
+{
+  int i;
+  int16_t temp1[N];
+  int16_t temp2[N];
+  int16_t temp3[N];
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus16 (temp1, temp2, temp3, 0);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus16 (temp1, temp2, temp3, 1);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 7);
+  check_plus16 (temp1, temp2, temp3, 7);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus16 (temp1, temp2, temp3, 8);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus16 (temp1, temp2, temp3, 9);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus16 (temp1, temp2, temp3, 16);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus16 (temp1, temp2, temp3, 17);
+
+  reset_data16 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.16\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp\t} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp\t" } } */
+/* { dg-final { scan-assembler-not "\tvpst\t" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/dlstp-int32x4.c b/gcc/testsuite/gcc.target/arm/dlstp-int32x4.c
new file mode 100644
index 0000000000000000000000000000000000000000..31a7264ae265a02d617ff31928cece9e37fbcc3f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/dlstp-int32x4.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-skip-if "avoid conflicting multilib options" { *-*-* } { "-marm" "-mcpu=*" } } */
+/* { dg-options "-march=armv8.1-m.main+fp.dp+mve.fp -mfloat-abi=hard -mfpu=auto -O3 --save-temps" } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "lob.h"
+
+void  __attribute__ ((noinline)) test (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+int main ()
+{
+  int i;
+  int32_t temp1[N];
+  int32_t temp2[N];
+  int32_t temp3[N];
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus32 (temp1, temp2, temp3, 0);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus32 (temp1, temp2, temp3, 1);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 3);
+  check_plus32 (temp1, temp2, temp3, 3);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 4);
+  check_plus32 (temp1, temp2, temp3, 4);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 5);
+  check_plus32 (temp1, temp2, temp3, 5);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus32 (temp1, temp2, temp3, 8);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus32 (temp1, temp2, temp3, 9);
+
+  reset_data32 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.32\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp\t} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp\t" } } */
+/* { dg-final { scan-assembler-not "\tvpst\t" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/dlstp-int64x2.c b/gcc/testsuite/gcc.target/arm/dlstp-int64x2.c
new file mode 100644
index 0000000000000000000000000000000000000000..a09c01884da50e68de8ae5bab994e3c7d07963b6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/dlstp-int64x2.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-skip-if "avoid conflicting multilib options" { *-*-* } { "-marm" "-mcpu=*" } } */
+/* { dg-options "-march=armv8.1-m.main+fp.dp+mve.fp -mfloat-abi=hard -mfpu=auto -O3 --save-temps" } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "lob.h"
+
+void  __attribute__ ((noinline)) test (int64_t *a, int64_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp64q (n);
+      int64x2_t va = vldrdq_gather_offset_z_s64 (a, vcreateq_u64 (8, 0), p);
+      vstrdq_scatter_offset_p_s64 (c, vcreateq_u64 (8, 0), va, p);
+      c+=2;
+      a+=2;
+      n-=2;
+    }
+}
+
+int main ()
+{
+  int i;
+  int64_t temp1[N];
+  int64_t temp3[N];
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 0);
+  check_memcpy64 (temp1, temp3, 0);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 1);
+  check_memcpy64 (temp1, temp3, 1);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 2);
+  check_memcpy64 (temp1, temp3, 2);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 3);
+  check_memcpy64 (temp1, temp3, 3);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 4);
+  check_memcpy64 (temp1, temp3, 4);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 5);
+  check_memcpy64 (temp1, temp3, 5);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 6);
+  check_memcpy64 (temp1, temp3, 6);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 7);
+  check_memcpy64 (temp1, temp3, 7);
+
+  reset_data64  (temp1, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.64\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp\t} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp\t" } } */
+/* { dg-final { scan-assembler-not "\tvpst\t" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/dlstp-int8x16.c b/gcc/testsuite/gcc.target/arm/dlstp-int8x16.c
new file mode 100644
index 0000000000000000000000000000000000000000..49fbd4c16a20af929e7cccd62f6a73d033afb7f5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/dlstp-int8x16.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-skip-if "avoid conflicting multilib options" { *-*-* } { "-marm" "-mcpu=*" } } */
+/* { dg-options "-march=armv8.1-m.main+fp.dp+mve.fp -mfloat-abi=hard -mfpu=auto -O3 --save-temps" } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "lob.h"
+
+void  __attribute__ ((noinline)) test (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp8q (n);
+      int8x16_t va = vldrbq_z_s8 (a, p);
+      int8x16_t vb = vldrbq_z_s8 (b, p);
+      int8x16_t vc = vaddq_x_s8 (va, vb, p);
+      vstrbq_p_s8 (c, vc, p);
+      c+=16;
+      a+=16;
+      b+=16;
+      n-=16;
+    }
+}
+
+int main ()
+{
+  int i;
+  int8_t temp1[N];
+  int8_t temp2[N];
+  int8_t temp3[N];
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus8 (temp1, temp2, temp3, 0);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus8 (temp1, temp2, temp3, 1);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 15);
+  check_plus8 (temp1, temp2, temp3, 15);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus8 (temp1, temp2, temp3, 16);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus8 (temp1, temp2, temp3, 17);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 32);
+  check_plus8 (temp1, temp2, temp3, 32);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 33);
+  check_plus8 (temp1, temp2, temp3, 33);
+
+  reset_data8 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.8\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp\t} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp\t" } } */
+/* { dg-final { scan-assembler-not "\tvpst\t" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/lob.h b/gcc/testsuite/gcc.target/arm/lob.h
index feaae7cc89959b3147368980120700bbc3e85ecb..9a34a682160b9c0b0e9393438e7be336ed7d29c7 100644
--- a/gcc/testsuite/gcc.target/arm/lob.h
+++ b/gcc/testsuite/gcc.target/arm/lob.h
@@ -1,15 +1,106 @@
 #include <string.h>
-
+#include <stdint.h>
 /* Common code for lob tests.  */
 
 #define NO_LOB asm volatile ("@ clobber lr" : : : "lr" )
 
-#define N 10000
+#define N 100
+
+static void
+reset_data8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data64 (int64_t *a, int64_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+check_plus8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
 
 static void
-reset_data (int *a, int *b, int *c)
+check_memcpy64 (int64_t *a, int64_t *c, int x)
 {
-  memset (a, -1, N * sizeof (*a));
-  memset (b, -1, N * sizeof (*b));
-  memset (c, -1, N * sizeof (*c));
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != a[i]) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
 }
diff --git a/gcc/testsuite/gcc.target/arm/lob1.c b/gcc/testsuite/gcc.target/arm/lob1.c
index ba5c82cd55c582c96a18ad417a3041e43d843613..e7fa3f155f926e71b00c9f294e7878e8f13766ff 100644
--- a/gcc/testsuite/gcc.target/arm/lob1.c
+++ b/gcc/testsuite/gcc.target/arm/lob1.c
@@ -54,29 +54,18 @@ loop3 (int *a, int *b, int *c)
     } while (i < N);
 }
 
-void
-check (int *a, int *b, int *c)
-{
-  for (int i = 0; i < N; i++)
-    {
-      NO_LOB;
-      if (c[i] != a[i] + b[i])
-	abort ();
-    }
-}
-
 int
 main (void)
 {
-  reset_data (a, b, c);
+  reset_data32 (a, b, c, N);
   loop1 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus32 (a, b, c, N);
+  reset_data32 (a, b, c, N);
   loop2 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus32 (a, b, c, N);
+  reset_data32 (a, b, c, N);
   loop3 (a, b ,c);
-  check (a, b ,c);
+  check_plus32 (a, b, c, N);
 
   return 0;
 }
diff --git a/gcc/testsuite/gcc.target/arm/lob6.c b/gcc/testsuite/gcc.target/arm/lob6.c
index 17b6124295e8ae9e1cb57e41fa43a954b3390eec..b7f49cf70c98d78af0d7cc1c291bf83ef699863b 100644
--- a/gcc/testsuite/gcc.target/arm/lob6.c
+++ b/gcc/testsuite/gcc.target/arm/lob6.c
@@ -79,14 +79,14 @@ check (void)
 int
 main (void)
 {
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data32 (a1, b1, c1, N);
+  reset_data32 (a2, b2, c2, N);
   loop1 (a1, b1, c1);
   ref1 (a2, b2, c2);
   check ();
 
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data32 (a1, b1, c1, N);
+  reset_data32 (a2, b2, c2, N);
   loop2 (a1, b1, c1);
   ref2 (a2, b2, c2);
   check ();

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
  2022-11-11 17:40 [PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops Stam Markianos-Wright
@ 2022-11-15 15:51 ` Andre Vieira (lists)
  2022-11-28 12:13   ` Stam Markianos-Wright
  0 siblings, 1 reply; 14+ messages in thread
From: Andre Vieira (lists) @ 2022-11-15 15:51 UTC (permalink / raw)
  To: Stam Markianos-Wright, gcc-patches


On 11/11/2022 17:40, Stam Markianos-Wright via Gcc-patches wrote:
> Hi all,
>
> This is the 2/2 patch that contains the functional changes needed
> for MVE Tail Predicated Low Overhead Loops.  See my previous email
> for a general introduction of MVE LOLs.
>
> This support is added through the already existing loop-doloop
> mechanisms that are used for non-MVE dls/le looping.
>
> Changes are:
>
> 1) Relax the loop-doloop mechanism in the mid-end to allow for
>    decrement numbers other that -1 and for `count` to be an
>    rtx containing the number of elements to be processed, rather
>    than an expression for calculating the number of iterations.
> 2) Add a `allow_elementwise_doloop` target hook. This allows the
>    target backend to manipulate the iteration count as it needs:
>    in our case to change it from a pre-calculation of the number
>    of iterations to the number of elements to be processed.
> 3) The doloop_end target-insn now had an additional parameter:
>    the `count` (note: this is before it gets modified to just be
>    the number of elements), so that the decrement value is
>    extracted from that parameter.
>
> And many things in the backend to implement the above optimisation:
>
> 4)  Appropriate changes to the define_expand of doloop_end and new
>     patterns for dlstp and letp.
> 5) `arm_attempt_dlstp_transform`: (called from the define_expand of
>     doloop_end) this function checks for the loop's suitability for
>     dlstp/letp transformation and then implements it, if possible.
> 6) `arm_mve_get_loop_unique_vctp`: A function that loops through
>     the loop contents and returns the vctp VPR-genereting operation
>     within the loop, if it is unique and there is exclusively one
>     vctp within the loop.
> 7) A couple of utility functions: `arm_mve_get_vctp_lanes` to map
>    from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg`
>    to check an insn to see if it requires the VPR or not.
>
> No regressions on arm-none-eabi with various targets and on
> aarch64-none-elf. Thoughts on getting this into trunk?
>
> Thank you,
> Stam Markianos-Wright
>
> gcc/ChangeLog:
>
>         * config/aarch64/aarch64.md: Add extra doloop_end arg.
>         * config/arm/arm-protos.h (arm_attempt_dlstp_transform): New.
>         * config/arm/arm.cc (TARGET_ALLOW_ELEMENTWISE_DOLOOP): New.
>         (arm_mve_get_vctp_lanes): New.
>         (arm_get_required_vpr_reg): New.
>         (arm_mve_get_loop_unique_vctp): New.
>         (arm_attempt_dlstp_transform): New.
>         (arm_allow_elementwise_doloop): New.
>         * config/arm/iterators.md:
>         * config/arm/mve.md (*predicated_doloop_end_internal): New.
>         (dlstp<mode1>_insn): New.
>         * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
>         * config/arm/unspecs.md: New unspecs.
>         * config/ia64/ia64.md: Add extra doloop_end arg.
>         * config/pru/pru.md: Add extra doloop_end arg.
>         * config/rs6000/rs6000.md: Add extra doloop_end arg.
>         * config/s390/s390.md: Add extra doloop_end arg.
>         * config/v850/v850.md: Add extra doloop_end arg.
>         * doc/tm.texi: Document new hook.
>         * doc/tm.texi.in: Likewise.
>         * loop-doloop.cc (doloop_condition_get): Relax conditions.
>         (doloop_optimize): Add support for elementwise LoLs.
>         * target-insns.def (doloop_end): Add extra arg.
>         * target.def (allow_elementwise_doloop): New hook.
>         * targhooks.cc (default_allow_elementwise_doloop): New.
>         * targhooks.h (default_allow_elementwise_doloop): New.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/arm/lob.h: Update framework.
>         * gcc.target/arm/lob1.c: Likewise.
>         * gcc.target/arm/lob6.c: Likewise.
>         * gcc.target/arm/dlstp-int16x8.c: New test.
>         * gcc.target/arm/dlstp-int32x4.c: New test.
>         * gcc.target/arm/dlstp-int64x2.c: New test.
>         * gcc.target/arm/dlstp-int8x16.c: New test.
>
>
> ### Inline copy of patch ###
>
> diff --git a/gcc/config/aarch64/aarch64.md 
> b/gcc/config/aarch64/aarch64.md
> index 
> f2e3d905dbbeb2949f2947f5cfd68208c94c9272..7a6d24a80060b4a704a481ccd1a32d96e7b0f369 
> 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -7366,7 +7366,8 @@
>  ;; knows what to generate.
>  (define_expand "doloop_end"
>    [(use (match_operand 0 "" ""))      ; loop pseudo
> -   (use (match_operand 1 "" ""))]     ; label
> +   (use (match_operand 1 "" ""))      ; label
> +   (use (match_operand 2 "" ""))]     ; decrement constant
>    "optimize > 0 && flag_modulo_sched"
>  {
>    rtx s0;
> diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
> index 
> 550272facd12e60a49bf8a3b20f811cc13765b3a..7684620f0f4d161dd9e9ad2d70308021ec3d3d34 
> 100644
> --- a/gcc/config/arm/arm-protos.h
> +++ b/gcc/config/arm/arm-protos.h
> @@ -63,7 +63,7 @@ extern void arm_decompose_di_binop (rtx, rtx, rtx *, 
> rtx *, rtx *, rtx *);
>  extern bool arm_q_bit_access (void);
>  extern bool arm_ge_bits_access (void);
>  extern bool arm_target_insn_ok_for_lob (rtx);
> -
> +extern rtx arm_attempt_dlstp_transform (rtx, rtx);
>  #ifdef RTX_CODE
>  enum reg_class
>  arm_mode_base_reg_class (machine_mode);
> diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
> index 
> ee8f1babf8a1319e77e0db0fa55851c038048804..99e144d52c26597c64b982b3d4ae9a62a114cf18 
> 100644
> --- a/gcc/config/arm/arm.cc
> +++ b/gcc/config/arm/arm.cc
> @@ -470,6 +470,9 @@ static const struct attribute_spec 
> arm_attribute_table[] =
>  #undef TARGET_SCHED_REORDER
>  #define TARGET_SCHED_REORDER arm_sched_reorder
>
> +#undef TARGET_ALLOW_ELEMENTWISE_DOLOOP
> +#define TARGET_ALLOW_ELEMENTWISE_DOLOOP arm_allow_elementwise_doloop
> +
Just a nit but maybe keep the same naming scheme as the existing hook:
TARGET_CAN_USE_ELEMENTWISE_DOLOOP_P ?
> +
> +static rtx
> +arm_get_required_vpr_reg (rtx_insn *insn)
> +{
> +  bool requires_vpr;
> +
> +  extract_constrain_insn (insn);
> +  int n_operands = recog_data.n_operands;
> +  if (recog_data.n_alternatives == 0)
> +    return NULL_RTX;
> +
> +  /* Fill in recog_op_alt with information about the constraints of
> +     this insn.  */
> +  preprocess_constraints (insn);
> +
> +  for (int use = 0; use < n_operands; use++)
> +    {
> +      requires_vpr = true;
> +      /* Iterate through alternatives of operand "use" in 
> recog_op_alt and
> +       * identify if the operand is required to be the VPR.  */
Remove the * at the start of the new line.
> +      for (int alt1 = 0; alt1 < recog_data.n_alternatives; alt1++)
> +    {
> +      const operand_alternative *op_alt1
> +          = &recog_op_alt[alt1 * n_operands];
> +      /* Fetch the reg_class for each entry and check it against the
> +       * VPR_REG reg_class.  */
> +      if (alternative_class (op_alt1, use) != VPR_REG)
> +        requires_vpr = false;
> +    }
> +      /* If all alternatives of the insn require the VPR reg for this 
> operand,
> +     it means that either this is VPR-generating instruction, like a 
> vctp,
> +     vcmp, etc., or it is a VPT-predicated insruction.  Return the 
> subrtx
> +     of the VPR reg operand.  */
> +      if (requires_vpr)
> +    return recog_data.operand[use];
> +    }
> +  return NULL_RTX;
> +}
> +
> +/* Scan the basic block of a loop body for a vctp instruction. If 
> there is
> +   exactly one unique vctp instruction, return its rtx_insn *. */
> +
> +static rtx_insn *
> +arm_mve_get_loop_unique_vctp (basic_block bb)
> +{
> +  rtx_insn *insn = BB_HEAD (bb);
> +  rtx_insn *vctp_op = NULL;
> +
> +  /* Now scan through all the instruction patterns and
> +     pick out any MVE instructions.  */
> +  FOR_BB_INSNS (bb, insn)
> +    {
> +      if (INSN_P (insn))
> +    {
> +      /* First check if this is a vctp instruction.  There needs to be
> +         exactly one vctp instruction within the loop.  */
> +      if (arm_mve_get_vctp_lanes (PATTERN (insn)) != 0)
> +        {
> +          /* If we already found one vctp instruction, then the
> +         loop is not consistent internally.  */
> +          if (vctp_op)
> +        return NULL;
> +
> +          vctp_op = insn;
> +        }
> +    }
> +    }
> +  return vctp_op;
> +}
> +
> +rtx
> +arm_attempt_dlstp_transform (rtx label, rtx count)
> +{
> +  int decrementnum;
> +  basic_block body = BLOCK_FOR_INSN (label)->prev_bb;
> +  rtx initial_compare;
> +  /* Doloop can only be done "elementwise" with predicated dlstp/letp
> +     when the iteration counter gets deprecated by the number of MVE
s/deprecated/decreased/ ? I think
> + lanes.  This can be exxtracted from the `count`, which is the 
> expression
s/exxtracted/extracted/
> +     used to calculate the number of iterations that the loop would 
> execute
> +     for a standard dls/le loop.  Since we only support cases where 
> this is a
> +     power of 2, we can assume that this expression arrives here as:
> +       (lshiftrt: (A) (const_int y))
> +     Then we can extract the decrementnum from y.  */
> +  if (GET_CODE (count) == LSHIFTRT && ARITHMETIC_P (XEXP (count, 0))
> +      && (decrementnum = (1 << (INTVAL (XEXP (count, 1)))))
Why are you calculating decrementnum inside the condition?
> +      /* There is one final condition that needs to be met for the 
> loop to be
> +     transformable: dlstp/letp will continue looping until there are
> +     elements still to process.  This can only work if the looping ends
> +     when the element counter reaches zero and not some other value
> +     (e.g. n > 0 works, not n > 1), or we can incorrectly end up running
> +     one additional iteration.  To by-pass any hoisting that the 
> compiler
> +     may have done with the `A` in `count` above, we can instead look up
> +     to the bb before the loop preheader: this should end with a 
> cmp+jump
> +     pair, where the cmp needs to be with (const_int 0).  */

I'm wondering whether it would be possible to subtract a non-zero const 
from count. But that might be dangerous...

Do you have an example/test case where you saw this happen?

> + && loop_preheader_edge (body->loop_father)->src->prev_bb
> +      && BB_END (loop_preheader_edge (body->loop_father)->src->prev_bb)
> +      && PREV_INSN (BB_END (loop_preheader_edge (body->loop_father)
> +                ->src->prev_bb))
> +      && INSN_P (PREV_INSN (BB_END (loop_preheader_edge 
> (body->loop_father)
> +                    ->src->prev_bb)))
> +      && (initial_compare
> +      = PATTERN (PREV_INSN (BB_END (loop_preheader_edge 
> (body->loop_father)
> +                        ->src->prev_bb))))
> +      && GET_CODE (initial_compare) == SET
> +      && cc_register (XEXP (initial_compare, 0), VOIDmode)
> +      && GET_CODE (XEXP (initial_compare, 1)) == COMPARE
> +      && CONST_INT_P (XEXP (XEXP (initial_compare, 1), 1))
> +      && INTVAL (XEXP (XEXP (initial_compare, 1), 1)) == 0)
> +    {
> +      /* Find the vctp predicate generation inside the loop body BB.  */
> +      rtx_insn *vctp_insn = arm_mve_get_loop_unique_vctp (body);
> +
> +      /* If we have successfully found one exactly vctp 
> predicate-generating
> +     instruction within the loop and the number by which we deprecate 
> the
> +     loop counter in each iteration matches the number of lanes of the
> +     vctp instruction, we can attempt to turn this into a dlstp/letp 
> loop.
> +     */
> +      if (vctp_insn
> +      && decrementnum == arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
> +    {
I would exit early here, so you don't need to indent the rest of the 
code, by that I mean something like:

if (!vectp_insn
      || decrementnum != ...)
   return GEN_INT (1);

.. rest of the code.

> + rtx_insn *insn = 0;
> +      rtx_insn *cur_insn = 0;
> +      rtx_insn *seq;
> +      rtx vctp_vpr_generated = NULL_RTX;
> +      rtx insn_vpr_reg_operand = NULL_RTX;
> +      bool transform_worked = true;
Won't need transform_worked.
> +      int new_icode;
> +
> +      /* Scan through the insns in the loop bb and emit the 
> transformed bb
> +         insns to a sequence.  */
> +      start_sequence ();
> +      FOR_BB_INSNS (body, insn)
> +        {
> +          if (INSN_P (insn))
This also captures DEBUG_INSNs, which means passing -g disables this 
feature.
> +        {
> +          /* When we find the vctp instruction: This may be followed by
> +             a sign-extend insn to SImode.  If it is, then save the
> +             sign-extended REG into vctp_vpr_generated.  If there is no
> +             sign-extend, then store the raw output of the vctp.
> +             For any VPT-predicated instructions we need to ensure that
> +             the VPR they use is the same as the one given here and
> +             they often consume the output of a subreg of the SImode
> +             sign-extended VPR-reg.  As a result, comparing against the
> +             output of the sign-extend is more likely to succeed.
> +             This code also guarantees to us that the vctp comes before
> +             any instructions that use the VPR within the loop, for the
> +             dlstp/letp transform to succeed.  */
> +          if (insn == vctp_insn)
> +            {
> +              if (GET_CODE (XEXP (PATTERN (NEXT_INSN (insn)), 1))
> +                  == SIGN_EXTEND
> +              && GET_CODE (XEXP (
> +                 PATTERN (NEXT_INSN (NEXT_INSN (insn))), 1))
> +                 == SUBREG)
> +            vctp_vpr_generated
> +                = XEXP (PATTERN (NEXT_INSN (NEXT_INSN (insn))), 0);
> +              else
> +            vctp_vpr_generated = XEXP (PATTERN (insn), 0);
> +              /* Also emit a USE of the source register of the vctp.
> +             This holds the number of elements being processed
> +             by the loop.  This later gets stored into `count`.
> +             */
> +              emit_use (XVECEXP (XEXP (PATTERN (insn), 1), 0, 0));
What if we get here but don't end up creating a predicated do-loop? Will 
this use break something?
> +       continue;
> +            }
> +          /* If the insn pattern requires the use of the VPR, then it
Missing an is.
> +      a VPT-predicated instruction, so it will need to be
> +             transformed into the non-predicated version of the
> +             instruction.  */
But this comment seems misplace here.
> + else if ((insn_vpr_reg_operand
> +                = arm_get_required_vpr_reg (insn))
> +               != NULL_RTX)
> +            {
> +              /* If the VPR value is different to the one generated by
> +             the vctp, then fail the conversion.  */
> +              if (!rtx_equal_p (vctp_vpr_generated,
> +                    insn_vpr_reg_operand))
> +            {
> +              transform_worked = false;
> +              break;
return GEN_INT (1);
> +     }
> +              /* Also ensure that it's a valid recog-ed instruction with
> +             the mve_unpredicated_insn atrribute.  */
> +              else if (recog_memoized (insn) >= 0
> +                   && (new_icode
> +                   = get_attr_mve_unpredicated_insn (insn)))
> +            {
> +              extract_insn (insn);
> +              rtx arr[8];
> +              int j = 0;
> +
> +              /* When transforming a VPT-predicated instruction
> +                 into its unpredicated equivalent we need to drop
> +                 the VPR operand and we may need to also drop a
> +                 merge "vuninit" input operand, depending on the
> +                 instruction pattern.  Here ensure that we have at
> +                 most a two-operand difference between the two
> +                 instrunctions.  */
> +              int n_operands_diff
> +                  = recog_data.n_operands
> +                - insn_data[new_icode].n_operands;
> +              gcc_assert (n_operands_diff > 0
> +                      && n_operands_diff <= 2);
> +
> +              /* Then, loop through the operands of the predicated
> +                 instruction, and retain the ones that map to the
> +                 unpredicated instruction.  */
> +              for (int i = 0; i < recog_data.n_operands; i++)
> +                {
> +                  /* Ignore the VPR and, if needed, the vuninit
> +                 operand.  */
> +                  if (insn_vpr_reg_operand == recog_data.operand[i]
> +                  || (n_operands_diff == 2
> +                      && !strcmp (recog_data.constraints[i],
> +                          "0")))
> +                continue;
> +                  else
> +                {
> +                  arr[j] = recog_data.operand[i];
> +                  j++;
> +                }
> +                }
> +
> +              /* Finally, emit the upredicated instruction.  */
> +              switch (j)
> +                {
> +                  case 2:
> +                emit_insn (GEN_FCN (new_icode) (arr[0],
> +                                arr[1]));
> +                break;
> +                  case 3:
> +                emit_insn (GEN_FCN (new_icode) (arr[0], arr[1],
> +                                arr[2]));
> +                break;
> +                  default:
> +                gcc_unreachable ();
> +                }
> +            }
> +              /* If we can't identify the INSN as either being either
> +             for deletion or to re-map, then we don't know how to
> +             handle it, so fail the whole conversion.  */
> +              else
> +            {
> +              transform_worked = false;
> +              break;
use
return GEN_INT (1);
> +     }
> +            }
> +          /* Instructions that dont's require the VPR can be carried
> +             over as-is.  */
> +          else
> +            emit_insn (PATTERN (insn));
> +        }
> +        }
> +      seq = get_insns ();
> +      end_sequence ();
> +
> +      if (transform_worked)
> +        {
no need to check this, you can only get here if it worked.
> + /* Re-write the entire BB contents with the transformed
> +         sequence.  */
> +          FOR_BB_INSNS_SAFE (body, insn, cur_insn)
> +        if (INSN_P (insn))
> +          delete_insn (insn);
This will also delete DEBUG_INSN's! You'd probably want to delete only 
NONDEBUG_INSN_P (insn). I'm not an expert in how DEBUG_INSNs work but I 
suspect their order compared to non-debug insns are likely to be 
important, so really you'd want change how you 'transform' the BB and do 
inline insn replacement.
> + for (insn = seq; NEXT_INSN (insn); insn = NEXT_INSN (insn))
> +        emit_insn_after (PATTERN (insn), BB_END (body));
> +          emit_jump_insn_after (PATTERN (insn), BB_END (body));
> +          return GEN_INT (decrementnum);
> +        }
> +    }
> +    }
> +  /* Bail out: we can't use dlstp/letp, so return 1 to allow 
> loop-doloop to try
> +     the standard dls/le pair.  */
> +  return GEN_INT (1);
> +}
>
Only reviewed until here, will look at the rest later.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
  2022-11-15 15:51 ` Andre Vieira (lists)
@ 2022-11-28 12:13   ` Stam Markianos-Wright
  0 siblings, 0 replies; 14+ messages in thread
From: Stam Markianos-Wright @ 2022-11-28 12:13 UTC (permalink / raw)
  To: Andre Vieira (lists), gcc-patches

[-- Attachment #1: Type: text/plain, Size: 24520 bytes --]


On 11/15/22 15:51, Andre Vieira (lists) wrote:
>
> On 11/11/2022 17:40, Stam Markianos-Wright via Gcc-patches wrote:
>> Hi all,
>>
>> This is the 2/2 patch that contains the functional changes needed
>> for MVE Tail Predicated Low Overhead Loops.  See my previous email
>> for a general introduction of MVE LOLs.
>>
>> This support is added through the already existing loop-doloop
>> mechanisms that are used for non-MVE dls/le looping.
>>
>> Changes are:
>>
>> 1) Relax the loop-doloop mechanism in the mid-end to allow for
>>    decrement numbers other that -1 and for `count` to be an
>>    rtx containing the number of elements to be processed, rather
>>    than an expression for calculating the number of iterations.
>> 2) Add a `allow_elementwise_doloop` target hook. This allows the
>>    target backend to manipulate the iteration count as it needs:
>>    in our case to change it from a pre-calculation of the number
>>    of iterations to the number of elements to be processed.
>> 3) The doloop_end target-insn now had an additional parameter:
>>    the `count` (note: this is before it gets modified to just be
>>    the number of elements), so that the decrement value is
>>    extracted from that parameter.
>>
>> And many things in the backend to implement the above optimisation:
>>
>> 4)  Appropriate changes to the define_expand of doloop_end and new
>>     patterns for dlstp and letp.
>> 5) `arm_attempt_dlstp_transform`: (called from the define_expand of
>>     doloop_end) this function checks for the loop's suitability for
>>     dlstp/letp transformation and then implements it, if possible.
>> 6) `arm_mve_get_loop_unique_vctp`: A function that loops through
>>     the loop contents and returns the vctp VPR-genereting operation
>>     within the loop, if it is unique and there is exclusively one
>>     vctp within the loop.
>> 7) A couple of utility functions: `arm_mve_get_vctp_lanes` to map
>>    from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg`
>>    to check an insn to see if it requires the VPR or not.
>>
>> No regressions on arm-none-eabi with various targets and on
>> aarch64-none-elf. Thoughts on getting this into trunk?
>>
>> Thank you,
>> Stam Markianos-Wright
>>
>> gcc/ChangeLog:
>>
>>         * config/aarch64/aarch64.md: Add extra doloop_end arg.
>>         * config/arm/arm-protos.h (arm_attempt_dlstp_transform): New.
>>         * config/arm/arm.cc (TARGET_ALLOW_ELEMENTWISE_DOLOOP): New.
>>         (arm_mve_get_vctp_lanes): New.
>>         (arm_get_required_vpr_reg): New.
>>         (arm_mve_get_loop_unique_vctp): New.
>>         (arm_attempt_dlstp_transform): New.
>>         (arm_allow_elementwise_doloop): New.
>>         * config/arm/iterators.md:
>>         * config/arm/mve.md (*predicated_doloop_end_internal): New.
>>         (dlstp<mode1>_insn): New.
>>         * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
>>         * config/arm/unspecs.md: New unspecs.
>>         * config/ia64/ia64.md: Add extra doloop_end arg.
>>         * config/pru/pru.md: Add extra doloop_end arg.
>>         * config/rs6000/rs6000.md: Add extra doloop_end arg.
>>         * config/s390/s390.md: Add extra doloop_end arg.
>>         * config/v850/v850.md: Add extra doloop_end arg.
>>         * doc/tm.texi: Document new hook.
>>         * doc/tm.texi.in: Likewise.
>>         * loop-doloop.cc (doloop_condition_get): Relax conditions.
>>         (doloop_optimize): Add support for elementwise LoLs.
>>         * target-insns.def (doloop_end): Add extra arg.
>>         * target.def (allow_elementwise_doloop): New hook.
>>         * targhooks.cc (default_allow_elementwise_doloop): New.
>>         * targhooks.h (default_allow_elementwise_doloop): New.
>>
>> gcc/testsuite/ChangeLog:
>>
>>         * gcc.target/arm/lob.h: Update framework.
>>         * gcc.target/arm/lob1.c: Likewise.
>>         * gcc.target/arm/lob6.c: Likewise.
>>         * gcc.target/arm/dlstp-int16x8.c: New test.
>>         * gcc.target/arm/dlstp-int32x4.c: New test.
>>         * gcc.target/arm/dlstp-int64x2.c: New test.
>>         * gcc.target/arm/dlstp-int8x16.c: New test.
>>
>>
>> ### Inline copy of patch ###
>>
>> diff --git a/gcc/config/aarch64/aarch64.md 
>> b/gcc/config/aarch64/aarch64.md
>> index 
>> f2e3d905dbbeb2949f2947f5cfd68208c94c9272..7a6d24a80060b4a704a481ccd1a32d96e7b0f369 
>> 100644
>> --- a/gcc/config/aarch64/aarch64.md
>> +++ b/gcc/config/aarch64/aarch64.md
>> @@ -7366,7 +7366,8 @@
>>  ;; knows what to generate.
>>  (define_expand "doloop_end"
>>    [(use (match_operand 0 "" ""))      ; loop pseudo
>> -   (use (match_operand 1 "" ""))]     ; label
>> +   (use (match_operand 1 "" ""))      ; label
>> +   (use (match_operand 2 "" ""))]     ; decrement constant
>>    "optimize > 0 && flag_modulo_sched"
>>  {
>>    rtx s0;
>> diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
>> index 
>> 550272facd12e60a49bf8a3b20f811cc13765b3a..7684620f0f4d161dd9e9ad2d70308021ec3d3d34 
>> 100644
>> --- a/gcc/config/arm/arm-protos.h
>> +++ b/gcc/config/arm/arm-protos.h
>> @@ -63,7 +63,7 @@ extern void arm_decompose_di_binop (rtx, rtx, rtx 
>> *, rtx *, rtx *, rtx *);
>>  extern bool arm_q_bit_access (void);
>>  extern bool arm_ge_bits_access (void);
>>  extern bool arm_target_insn_ok_for_lob (rtx);
>> -
>> +extern rtx arm_attempt_dlstp_transform (rtx, rtx);
>>  #ifdef RTX_CODE
>>  enum reg_class
>>  arm_mode_base_reg_class (machine_mode);
>> diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
>> index 
>> ee8f1babf8a1319e77e0db0fa55851c038048804..99e144d52c26597c64b982b3d4ae9a62a114cf18 
>> 100644
>> --- a/gcc/config/arm/arm.cc
>> +++ b/gcc/config/arm/arm.cc
>> @@ -470,6 +470,9 @@ static const struct attribute_spec 
>> arm_attribute_table[] =
>>  #undef TARGET_SCHED_REORDER
>>  #define TARGET_SCHED_REORDER arm_sched_reorder
>>
>> +#undef TARGET_ALLOW_ELEMENTWISE_DOLOOP
>> +#define TARGET_ALLOW_ELEMENTWISE_DOLOOP arm_allow_elementwise_doloop
>> +
> Just a nit but maybe keep the same naming scheme as the existing hook:
> TARGET_CAN_USE_ELEMENTWISE_DOLOOP_P ?

Done,

Also:

Thanks for taking time to review!

I've done a second draft as an attachment to this email. Let me know of any

further when you get to the second half of the patch.

>> +
>> +static rtx
>> +arm_get_required_vpr_reg (rtx_insn *insn)
>> +{
>> +  bool requires_vpr;
>> +
>> +  extract_constrain_insn (insn);
>> +  int n_operands = recog_data.n_operands;
>> +  if (recog_data.n_alternatives == 0)
>> +    return NULL_RTX;
>> +
>> +  /* Fill in recog_op_alt with information about the constraints of
>> +     this insn.  */
>> +  preprocess_constraints (insn);
>> +
>> +  for (int use = 0; use < n_operands; use++)
>> +    {
>> +      requires_vpr = true;
>> +      /* Iterate through alternatives of operand "use" in 
>> recog_op_alt and
>> +       * identify if the operand is required to be the VPR.  */
> Remove the * at the start of the new line.
Done.
>> +      for (int alt1 = 0; alt1 < recog_data.n_alternatives; alt1++)
>> +    {
>> +      const operand_alternative *op_alt1
>> +          = &recog_op_alt[alt1 * n_operands];
>> +      /* Fetch the reg_class for each entry and check it against the
>> +       * VPR_REG reg_class.  */
>> +      if (alternative_class (op_alt1, use) != VPR_REG)
>> +        requires_vpr = false;
>> +    }
>> +      /* If all alternatives of the insn require the VPR reg for 
>> this operand,
>> +     it means that either this is VPR-generating instruction, like a 
>> vctp,
>> +     vcmp, etc., or it is a VPT-predicated insruction.  Return the 
>> subrtx
>> +     of the VPR reg operand.  */
>> +      if (requires_vpr)
>> +    return recog_data.operand[use];
>> +    }
>> +  return NULL_RTX;
>> +}
>> +
>> +/* Scan the basic block of a loop body for a vctp instruction. If 
>> there is
>> +   exactly one unique vctp instruction, return its rtx_insn *. */
>> +
>> +static rtx_insn *
>> +arm_mve_get_loop_unique_vctp (basic_block bb)
>> +{
>> +  rtx_insn *insn = BB_HEAD (bb);
>> +  rtx_insn *vctp_op = NULL;
>> +
>> +  /* Now scan through all the instruction patterns and
>> +     pick out any MVE instructions.  */
>> +  FOR_BB_INSNS (bb, insn)
>> +    {
>> +      if (INSN_P (insn))
>> +    {
>> +      /* First check if this is a vctp instruction.  There needs to be
>> +         exactly one vctp instruction within the loop.  */
>> +      if (arm_mve_get_vctp_lanes (PATTERN (insn)) != 0)
>> +        {
>> +          /* If we already found one vctp instruction, then the
>> +         loop is not consistent internally.  */
>> +          if (vctp_op)
>> +        return NULL;
>> +
>> +          vctp_op = insn;
>> +        }
>> +    }
>> +    }
>> +  return vctp_op;
>> +}
>> +
>> +rtx
>> +arm_attempt_dlstp_transform (rtx label, rtx count)
>> +{
>> +  int decrementnum;
>> +  basic_block body = BLOCK_FOR_INSN (label)->prev_bb;
>> +  rtx initial_compare;
>> +  /* Doloop can only be done "elementwise" with predicated dlstp/letp
>> +     when the iteration counter gets deprecated by the number of MVE
> s/deprecated/decreased/ ? I think

"decremented" was what I was going for, so changed it to that :)


>> + lanes.  This can be exxtracted from the `count`, which is the 
>> expression
> s/exxtracted/extracted/
Done.
>> +     used to calculate the number of iterations that the loop would 
>> execute
>> +     for a standard dls/le loop.  Since we only support cases where 
>> this is a
>> +     power of 2, we can assume that this expression arrives here as:
>> +       (lshiftrt: (A) (const_int y))
>> +     Then we can extract the decrementnum from y.  */
>> +  if (GET_CODE (count) == LSHIFTRT && ARITHMETIC_P (XEXP (count, 0))
>> +      && (decrementnum = (1 << (INTVAL (XEXP (count, 1)))))
> Why are you calculating decrementnum inside the condition?

Ahh ok so this is an artifact from a previous revision, where I also 
included the:

`decrementnum != arm_mve_get_vctp_lanes (PATTERN (vctp_insn))`

within that condition, but since that is now done late, I can move this 
down.

>> +      /* There is one final condition that needs to be met for the 
>> loop to be
>> +     transformable: dlstp/letp will continue looping until there are
>> +     elements still to process.  This can only work if the looping ends
>> +     when the element counter reaches zero and not some other value
>> +     (e.g. n > 0 works, not n > 1), or we can incorrectly end up 
>> running
>> +     one additional iteration.  To by-pass any hoisting that the 
>> compiler
>> +     may have done with the `A` in `count` above, we can instead 
>> look up
>> +     to the bb before the loop preheader: this should end with a 
>> cmp+jump
>> +     pair, where the cmp needs to be with (const_int 0).  */
>
> I'm wondering whether it would be possible to subtract a non-zero 
> const from count. But that might be dangerous...
>
> Do you have an example/test case where you saw this happen?

So this can happen in situations like:

     while (n > 1)
     {
         mve_pred16_t p = vctp8q (n);
         uint8x16_t va = vldrbq_z_u8 (a, p);
         uint8x16_t vb = vldrbq_z_u8 (b, p);
         uint8x16_t vc = vaddq_x_u8 (va, vb, p);
         vstrbq_p_u8 (c, vc, p);
         n-=16;
         a+=16;
         b+=16;
         c+=16;
     }

IIUC you mean that we'd end up with something like:

         sub r3, r3, #1
         dlstp.8 lr, r3

to enable loops that aren't terminating at zero.

I believe the answer is `No`, because even though we would execute the 
correct

number of iterations, we would use an incorrect predicate value:

e.g. if n == r3 == 18:

we'd subtract a const 1: r3 == 17

Iteration 1: use all 16 lanes: Correct.
Iteration 2: use 1 lane: Incorrect -- from the user code it looks like 
we should use 2.


Because the number of iterations and the predicate value here are tied, 
I think we

can only safely transform loops that terminate with a `>0`


On standard dls/le loops we do already subtract the non-zero const.

>
>> + && loop_preheader_edge (body->loop_father)->src->prev_bb
>> +      && BB_END (loop_preheader_edge (body->loop_father)->src->prev_bb)
>> +      && PREV_INSN (BB_END (loop_preheader_edge (body->loop_father)
>> +                ->src->prev_bb))
>> +      && INSN_P (PREV_INSN (BB_END (loop_preheader_edge 
>> (body->loop_father)
>> +                    ->src->prev_bb)))
>> +      && (initial_compare
>> +      = PATTERN (PREV_INSN (BB_END (loop_preheader_edge 
>> (body->loop_father)
>> +                        ->src->prev_bb))))
>> +      && GET_CODE (initial_compare) == SET
>> +      && cc_register (XEXP (initial_compare, 0), VOIDmode)
>> +      && GET_CODE (XEXP (initial_compare, 1)) == COMPARE
>> +      && CONST_INT_P (XEXP (XEXP (initial_compare, 1), 1))
>> +      && INTVAL (XEXP (XEXP (initial_compare, 1), 1)) == 0)
>> +    {
>> +      /* Find the vctp predicate generation inside the loop body 
>> BB.  */
>> +      rtx_insn *vctp_insn = arm_mve_get_loop_unique_vctp (body);
>> +
>> +      /* If we have successfully found one exactly vctp 
>> predicate-generating
>> +     instruction within the loop and the number by which we 
>> deprecate the
>> +     loop counter in each iteration matches the number of lanes of the
>> +     vctp instruction, we can attempt to turn this into a dlstp/letp 
>> loop.
>> +     */
>> +      if (vctp_insn
>> +      && decrementnum == arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
>> +    {
> I would exit early here, so you don't need to indent the rest of the 
> code, by that I mean something like:
>
> if (!vectp_insn
>      || decrementnum != ...)
>   return GEN_INT (1);
>
> .. rest of the code.
Done, thanks!
>
>> + rtx_insn *insn = 0;
>> +      rtx_insn *cur_insn = 0;
>> +      rtx_insn *seq;
>> +      rtx vctp_vpr_generated = NULL_RTX;
>> +      rtx insn_vpr_reg_operand = NULL_RTX;
>> +      bool transform_worked = true;
> Won't need transform_worked.
Done
>> +      int new_icode;
>> +
>> +      /* Scan through the insns in the loop bb and emit the 
>> transformed bb
>> +         insns to a sequence.  */
>> +      start_sequence ();
>> +      FOR_BB_INSNS (body, insn)
>> +        {
>> +          if (INSN_P (insn))
> This also captures DEBUG_INSNs, which means passing -g disables this 
> feature.


Fixed: description of this is later on

>> +        {
>> +          /* When we find the vctp instruction: This may be followed by
>> +             a sign-extend insn to SImode.  If it is, then save the
>> +             sign-extended REG into vctp_vpr_generated.  If there is no
>> +             sign-extend, then store the raw output of the vctp.
>> +             For any VPT-predicated instructions we need to ensure that
>> +             the VPR they use is the same as the one given here and
>> +             they often consume the output of a subreg of the SImode
>> +             sign-extended VPR-reg.  As a result, comparing against the
>> +             output of the sign-extend is more likely to succeed.
>> +             This code also guarantees to us that the vctp comes before
>> +             any instructions that use the VPR within the loop, for the
>> +             dlstp/letp transform to succeed.  */
>> +          if (insn == vctp_insn)
>> +            {
>> +              if (GET_CODE (XEXP (PATTERN (NEXT_INSN (insn)), 1))
>> +                  == SIGN_EXTEND
>> +              && GET_CODE (XEXP (
>> +                 PATTERN (NEXT_INSN (NEXT_INSN (insn))), 1))
>> +                 == SUBREG)
>> +            vctp_vpr_generated
>> +                = XEXP (PATTERN (NEXT_INSN (NEXT_INSN (insn))), 0);
>> +              else
>> +            vctp_vpr_generated = XEXP (PATTERN (insn), 0);
>> +              /* Also emit a USE of the source register of the vctp.
>> +             This holds the number of elements being processed
>> +             by the loop.  This later gets stored into `count`.
>> +             */
>> +              emit_use (XVECEXP (XEXP (PATTERN (insn), 1), 0, 0));
> What if we get here but don't end up creating a predicated do-loop? 
> Will this use break something?

Nope, it won't break anything because of the use of 
start_sequence/end_sequence. The use and all

the transform insns get emitted to the sequence, and then only if we 
have fully succeeded, we

will wipe all the contents of the `bb` and put in the sequence (incl. 
the use).


The `use` gets deleted later in `arm_allow_elementwise_doloop_p` and 
really it's just a dodgy way

for me to preserve the REG used in the vctp, so that later I can make it 
the `count` or the number

of elements to be processed by the loop (the `dlstp lr, REG` register)

>> +       continue;
>> +            }
>> +          /* If the insn pattern requires the use of the VPR, then it
> Missing an is.
Done
>> +      a VPT-predicated instruction, so it will need to be
>> +             transformed into the non-predicated version of the
>> +             instruction.  */
> But this comment seems misplace here.
Adjusted this :)
>> + else if ((insn_vpr_reg_operand
>> +                = arm_get_required_vpr_reg (insn))
>> +               != NULL_RTX)
>> +            {
>> +              /* If the VPR value is different to the one generated by
>> +             the vctp, then fail the conversion.  */
>> +              if (!rtx_equal_p (vctp_vpr_generated,
>> +                    insn_vpr_reg_operand))
>> +            {
>> +              transform_worked = false;
>> +              break;
> return GEN_INT (1);
Done -- but also needed an end_sequece ();
>> +     }
>> +              /* Also ensure that it's a valid recog-ed instruction 
>> with
>> +             the mve_unpredicated_insn atrribute.  */
>> +              else if (recog_memoized (insn) >= 0
>> +                   && (new_icode
>> +                   = get_attr_mve_unpredicated_insn (insn)))
>> +            {
>> +              extract_insn (insn);
>> +              rtx arr[8];
>> +              int j = 0;
>> +
>> +              /* When transforming a VPT-predicated instruction
>> +                 into its unpredicated equivalent we need to drop
>> +                 the VPR operand and we may need to also drop a
>> +                 merge "vuninit" input operand, depending on the
>> +                 instruction pattern.  Here ensure that we have at
>> +                 most a two-operand difference between the two
>> +                 instrunctions.  */
>> +              int n_operands_diff
>> +                  = recog_data.n_operands
>> +                - insn_data[new_icode].n_operands;
>> +              gcc_assert (n_operands_diff > 0
>> +                      && n_operands_diff <= 2);
>> +
>> +              /* Then, loop through the operands of the predicated
>> +                 instruction, and retain the ones that map to the
>> +                 unpredicated instruction.  */
>> +              for (int i = 0; i < recog_data.n_operands; i++)
>> +                {
>> +                  /* Ignore the VPR and, if needed, the vuninit
>> +                 operand.  */
>> +                  if (insn_vpr_reg_operand == recog_data.operand[i]
>> +                  || (n_operands_diff == 2
>> +                      && !strcmp (recog_data.constraints[i],
>> +                          "0")))
>> +                continue;
>> +                  else
>> +                {
>> +                  arr[j] = recog_data.operand[i];
>> +                  j++;
>> +                }
>> +                }
>> +
>> +              /* Finally, emit the upredicated instruction.  */
>> +              switch (j)
>> +                {
>> +                  case 2:
>> +                emit_insn (GEN_FCN (new_icode) (arr[0],
>> +                                arr[1]));
>> +                break;
>> +                  case 3:
>> +                emit_insn (GEN_FCN (new_icode) (arr[0], arr[1],
>> +                                arr[2]));
>> +                break;
>> +                  default:
>> +                gcc_unreachable ();
>> +                }
>> +            }
>> +              /* If we can't identify the INSN as either being either
>> +             for deletion or to re-map, then we don't know how to
>> +             handle it, so fail the whole conversion.  */
>> +              else
>> +            {
>> +              transform_worked = false;
>> +              break;
> use
> return GEN_INT (1);
Done.
>> +     }
>> +            }
>> +          /* Instructions that dont's require the VPR can be carried
>> +             over as-is.  */
>> +          else
>> +            emit_insn (PATTERN (insn));
>> +        }
>> +        }
>> +      seq = get_insns ();
>> +      end_sequence ();
>> +
>> +      if (transform_worked)
>> +        {
> no need to check this, you can only get here if it worked.
Done.
>> + /* Re-write the entire BB contents with the transformed
>> +         sequence.  */
>> +          FOR_BB_INSNS_SAFE (body, insn, cur_insn)
>> +        if (INSN_P (insn))
>> +          delete_insn (insn);
> This will also delete DEBUG_INSN's! You'd probably want to delete only 
> NONDEBUG_INSN_P (insn). I'm not an expert in how DEBUG_INSNs work but 
> I suspect their order compared to non-debug insns are likely to be 
> important, so really you'd want change how you 'transform' the BB and 
> do inline insn replacement.
You are correct, thanks for spotting this!

So I do want to also delete debug_insns, because what I want to do is 
replace all the

bb contents with the previous sequence.

Instead, though I need to make sure that DEBUG_INSNs get put into the 
sequence

correctly (which wasn't working).


I also found a similar situation with NOTEs (which are !INSN_Ps), if 
there were any

in the bb other than the NOTE_INSN_BASIC_BLOCK, they would get filtered 
up to

the start of the bb. This should also be fixed now.

>> + for (insn = seq; NEXT_INSN (insn); insn = NEXT_INSN (insn))
>> +        emit_insn_after (PATTERN (insn), BB_END (body));
>> +          emit_jump_insn_after (PATTERN (insn), BB_END (body));
>> +          return GEN_INT (decrementnum);
>> +        }
>> +    }
>> +    }
>> +  /* Bail out: we can't use dlstp/letp, so return 1 to allow 
>> loop-doloop to try
>> +     the standard dls/le pair.  */
>> +  return GEN_INT (1);
>> +}
>>
> Only reviewed until here, will look at the rest later.

[-- Attachment #2: rb16111.patch --]
[-- Type: text/x-patch, Size: 52330 bytes --]

diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index f2e3d905dbb..7a6d24a8006 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -7366,7 +7366,8 @@
 ;; knows what to generate.
 (define_expand "doloop_end"
   [(use (match_operand 0 "" ""))      ; loop pseudo
-   (use (match_operand 1 "" ""))]     ; label
+   (use (match_operand 1 "" ""))      ; label
+   (use (match_operand 2 "" ""))]     ; decrement constant
   "optimize > 0 && flag_modulo_sched"
 {
   rtx s0;
diff --git a/gcc/config/arc/arc.md b/gcc/config/arc/arc.md
index 458d3edf716..a4b3d1addbf 100644
--- a/gcc/config/arc/arc.md
+++ b/gcc/config/arc/arc.md
@@ -4985,12 +4985,13 @@ archs4x, archs4xd"
 		    (pc)))
 	      (set (match_dup 0) (plus:SI (match_dup 0) (const_int -1)))
 	      (unspec:SI [(const_int 0)] UNSPEC_ARC_LP)
-	      (clobber (match_dup 2))])]
+	      (clobber (match_dup 3))
+	      (match_operand 2 "" "")])]
   ""
 {
  if (GET_MODE (operands[0]) != SImode)
    FAIL;
- operands[2] = gen_rtx_SCRATCH (SImode);
+ operands[3] = gen_rtx_SCRATCH (SImode);
 })
 
 (define_insn "arc_lp"
diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 550272facd1..7684620f0f4 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -63,7 +63,7 @@ extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *);
 extern bool arm_q_bit_access (void);
 extern bool arm_ge_bits_access (void);
 extern bool arm_target_insn_ok_for_lob (rtx);
-
+extern rtx arm_attempt_dlstp_transform (rtx, rtx);
 #ifdef RTX_CODE
 enum reg_class
 arm_mode_base_reg_class (machine_mode);
diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index ee8f1babf8a..72d0187eb00 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -470,6 +470,9 @@ static const struct attribute_spec arm_attribute_table[] =
 #undef TARGET_SCHED_REORDER
 #define TARGET_SCHED_REORDER arm_sched_reorder
 
+#undef TARGET_ALLOW_ELEMENTWISE_DOLOOP_P
+#define TARGET_ALLOW_ELEMENTWISE_DOLOOP_P arm_allow_elementwise_doloop_p
+
 #undef TARGET_REGISTER_MOVE_COST
 #define TARGET_REGISTER_MOVE_COST arm_register_move_cost
 
@@ -34138,8 +34141,370 @@ arm_target_insn_ok_for_lob (rtx insn)
 
   return single_succ_p (bb)
     && single_pred_p (bb)
-    && single_succ_edge (bb)->dest == single_pred_edge (bb)->src
-    && contains_no_active_insn_p (bb);
+    && single_succ_edge (bb)->dest == single_pred_edge (bb)->src;
+}
+
+static int
+arm_mve_get_vctp_lanes (rtx x)
+{
+  if (GET_CODE (x) == SET && GET_CODE (XEXP (x, 1)) == UNSPEC)
+    {
+      switch (XINT (XEXP (x, 1), 1))
+	{
+	  case VCTP8Q:
+	    return 16;
+	  case VCTP16Q:
+	    return 8;
+	  case VCTP32Q:
+	    return 4;
+	  case VCTP64Q:
+	    return 2;
+	  default:
+	    break;
+	}
+    }
+  return 0;
+}
+
+/* Check if an insn requires the use of the VPR_REG, if it does, return the
+   sub-rtx of the matched operand.  If there are more than one operand (e.g. an
+   input operand and an output operand) that use VPR_REG, return the first
+   occurance, which is usually the output operand.  */
+
+static rtx
+arm_get_required_vpr_reg (rtx_insn *insn)
+{
+  if (!NONJUMP_INSN_P (insn))
+    return NULL_RTX;
+
+  bool requires_vpr;
+  extract_constrain_insn (insn);
+  int n_operands = recog_data.n_operands;
+  if (recog_data.n_alternatives == 0)
+    return NULL_RTX;
+
+  /* Fill in recog_op_alt with information about the constraints of
+     this insn.  */
+  preprocess_constraints (insn);
+
+  for (int use = 0; use < n_operands; use++)
+    {
+      requires_vpr = true;
+      /* Iterate through alternatives of operand "use" in recog_op_alt and
+	 identify if the operand is required to be the VPR.  */
+      for (int alt1 = 0; alt1 < recog_data.n_alternatives; alt1++)
+	{
+	  const operand_alternative *op_alt1
+	      = &recog_op_alt[alt1 * n_operands];
+	  /* Fetch the reg_class for each entry and check it against the
+	   * VPR_REG reg_class.  */
+	  if (alternative_class (op_alt1, use) != VPR_REG)
+	    requires_vpr = false;
+	}
+      /* If all alternatives of the insn require the VPR reg for this operand,
+	 it means that either this is VPR-generating instruction, like a vctp,
+	 vcmp, etc., or it is a VPT-predicated insruction.  Return the subrtx
+	 of the VPR reg operand.  */
+      if (requires_vpr)
+	return recog_data.operand[use];
+    }
+  return NULL_RTX;
+}
+
+/* Scan the basic block of a loop body for a vctp instruction.  If there is
+   exactly one unique vctp instruction, return its rtx_insn *.  */
+
+static rtx_insn *
+arm_mve_get_loop_unique_vctp (basic_block bb)
+{
+  rtx_insn *insn = BB_HEAD (bb);
+  rtx_insn *vctp_op = NULL;
+
+  /* Now scan through all the instruction patterns and
+     pick out any MVE instructions.  */
+  FOR_BB_INSNS (bb, insn)
+    {
+      if (INSN_P (insn))
+	{
+	  /* First check if this is a vctp instruction.  There needs to be
+	     exactly one vctp instruction within the loop.  */
+	  if (arm_mve_get_vctp_lanes (PATTERN (insn)) != 0)
+	    {
+	      /* If we already found one vctp instruction, then the
+		 loop is not consistent internally.  */
+	      if (vctp_op)
+		return NULL;
+
+	      vctp_op = insn;
+	    }
+	}
+    }
+  return vctp_op;
+}
+
+rtx
+arm_attempt_dlstp_transform (rtx label, rtx count)
+{
+  int decrementnum;
+  basic_block body = BLOCK_FOR_INSN (label)->prev_bb;
+  rtx initial_compare;
+  /* Doloop can only be done "elementwise" with predicated dlstp/letp
+     when the iteration counter gets decremented by the number of MVE
+     lanes.  This can be extracted from the `count`, which is the expression
+     used to calculate the number of iterations that the loop would execute
+     for a standard dls/le loop.  Since we only support cases where this is a
+     power of 2, we can assume that this expression arrives here as:
+       (lshiftrt: (A) (const_int y))
+     Then we can extract the decrementnum from y.  */
+  if (GET_CODE (count) == LSHIFTRT && ARITHMETIC_P (XEXP (count, 0))
+      /* There is one final condition that needs to be met for the loop to be
+	 transformable: dlstp/letp will continue looping until there are
+	 elements still to process.  This can only work if the looping ends
+	 when the element counter reaches zero and not some other value
+	 (e.g. n > 0 works, not n > 1), or we can incorrectly end up running
+	 one additional iteration.  To by-pass any hoisting that the compiler
+	 may have done with the `A` in `count` above, we can instead look up
+	 to the bb before the loop preheader: this should end with a cmp+jump
+	 pair, where the cmp needs to be with (const_int 0).  */
+      && loop_preheader_edge (body->loop_father)->src->prev_bb
+      && BB_END (loop_preheader_edge (body->loop_father)->src->prev_bb)
+      && PREV_INSN (BB_END (loop_preheader_edge (body->loop_father)
+				->src->prev_bb))
+      && INSN_P (PREV_INSN (BB_END (loop_preheader_edge (body->loop_father)
+					->src->prev_bb)))
+      && (initial_compare
+	  = PATTERN (PREV_INSN (BB_END (loop_preheader_edge (body->loop_father)
+					    ->src->prev_bb))))
+      && GET_CODE (initial_compare) == SET
+      && cc_register (XEXP (initial_compare, 0), VOIDmode)
+      && GET_CODE (XEXP (initial_compare, 1)) == COMPARE
+      && CONST_INT_P (XEXP (XEXP (initial_compare, 1), 1))
+      && INTVAL (XEXP (XEXP (initial_compare, 1), 1)) == 0)
+    {
+      /* Extract the integer decrement from the LSHIFTR condition.  */
+      decrementnum = (1 << (INTVAL (XEXP (count, 1))));
+      /* Find the vctp predicate generation inside the loop body BB.  */
+      rtx_insn *vctp_insn = arm_mve_get_loop_unique_vctp (body);
+      /* If we have successfully found one exactly vctp predicate-generating
+	 instruction within the loop and the number by which we deprecate the
+	 loop counter in each iteration matches the number of lanes of the
+	 vctp instruction, we can attempt to turn this into a dlstp/letp loop.
+	 */
+      if (!vctp_insn
+	  || decrementnum != arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+	return GEN_INT (1);
+
+      rtx_insn *insn = 0;
+      rtx_insn *cur_insn = 0;
+      rtx_insn *seq;
+      rtx vctp_vpr_generated = NULL_RTX;
+      rtx insn_vpr_reg_operand = NULL_RTX;
+      int new_icode;
+
+      /* Scan through the insns in the loop bb and emit the transformed bb
+	 insns to a sequence.  */
+      start_sequence ();
+      FOR_BB_INSNS (body, insn)
+	{
+	  if (GET_CODE (insn) == CODE_LABEL || NOTE_INSN_BASIC_BLOCK_P (insn))
+	    continue;
+	  else if (NOTE_P (insn))
+	    emit_note ((enum insn_note) NOTE_KIND (insn));
+	  else if (!INSN_P (insn))
+	    {
+	      end_sequence ();
+	      return GEN_INT (1);
+	    }
+	     /* When we find the vctp instruction: This may be followed by
+	     a sign-extend insn to SImode.  If it is, then save the
+	     sign-extended REG into vctp_vpr_generated.  If there is no
+	     sign-extend, then store the raw output of the vctp.
+	     For any VPT-predicated instructions we need to ensure that
+	     the VPR they use is the same as the one given here and
+	     they often consume the output of a subreg of the SImode
+	     sign-extended VPR-reg.  As a result, comparing against the
+	     output of the sign-extend is more likely to succeed.
+	     This code also guarantees to us that the vctp comes before
+	     any instructions that use the VPR within the loop, for the
+	     dlstp/letp transform to succeed.  */
+	  else if (insn == vctp_insn)
+	    {
+	      if (GET_CODE (
+		      XEXP (PATTERN (next_nonnote_nondebug_insn_bb (insn)), 1))
+		      == SIGN_EXTEND
+		  && GET_CODE (
+			 XEXP (PATTERN (next_nonnote_nondebug_insn_bb (
+				   next_nonnote_nondebug_insn_bb (insn))),
+			       1))
+			 == SUBREG)
+		vctp_vpr_generated
+		    = XEXP (PATTERN (next_nonnote_nondebug_insn_bb (
+				next_nonnote_nondebug_insn_bb (insn))),
+			    0);
+	      else
+		vctp_vpr_generated = XEXP (PATTERN (insn), 0);
+	      /* Also emit a USE of the source register of the vctp.
+		 This holds the number of elements being processed
+		 by the loop.  This later gets stored into `count`.
+		 */
+	      emit_use (XVECEXP (XEXP (PATTERN (insn), 1), 0, 0));
+	      continue;
+	    }
+	  /* If the insn pattern requires the use of the VPR, then it
+	     is a VPT-predicated instruction.  */
+	  else if ((insn_vpr_reg_operand = arm_get_required_vpr_reg (insn))
+		   != NULL_RTX)
+	    {
+	      /* If the VPR value is different to the one generated by
+		 the vctp, then fail the conversion.  */
+	      if (!rtx_equal_p (vctp_vpr_generated, insn_vpr_reg_operand))
+		{
+		  end_sequence ();
+		  return GEN_INT (1);
+		}
+	      /* If the insn does use the same VPR as the one generated
+		 by the vctp, it will need to be transformed into its
+		 non-predicated version.  Also ensure that it's a valid
+		 recog-ed instruction with the mve_unpredicated_insn
+		 atrribute.  */
+	      else if (recog_memoized (insn) >= 0
+		       && (new_icode = get_attr_mve_unpredicated_insn (insn)))
+		{
+		  extract_insn (insn);
+		  rtx arr[8];
+		  int j = 0;
+
+		  /* When transforming a VPT-predicated instruction
+		     into its unpredicated equivalent we need to drop
+		     the VPR operand and we may need to also drop a
+		     merge "vuninit" input operand, depending on the
+		     instruction pattern.  Here ensure that we have at
+		     most a two-operand difference between the two
+		     instrunctions.  */
+		  int n_operands_diff = recog_data.n_operands
+					- insn_data[new_icode].n_operands;
+		  gcc_assert (n_operands_diff > 0 && n_operands_diff <= 2);
+
+		  /* Then, loop through the operands of the predicated
+		     instruction, and retain the ones that map to the
+		     unpredicated instruction.  */
+		  for (int i = 0; i < recog_data.n_operands; i++)
+		    {
+		      /* Ignore the VPR and, if needed, the vuninit
+			 operand.  */
+		      if (insn_vpr_reg_operand == recog_data.operand[i]
+			  || (n_operands_diff == 2
+			      && !strcmp (recog_data.constraints[i], "0")))
+			continue;
+		      else
+			{
+			  arr[j] = recog_data.operand[i];
+			  j++;
+			}
+		    }
+
+		  /* Finally, emit the upredicated instruction.  */
+		  switch (j)
+		    {
+		      case 1:
+			emit_insn (GEN_FCN (new_icode) (arr[0]));
+			break;
+		      case 2:
+			emit_insn (GEN_FCN (new_icode) (arr[0], arr[1]));
+			break;
+		      case 3:
+			emit_insn (GEN_FCN (new_icode) (arr[0], arr[1],
+							arr[2]));
+			break;
+		      case 4:
+			emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2],
+							arr[3]));
+			break;
+		      case 5:
+			emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2],
+							arr[3], arr[4]));
+			break;
+		      case 6:
+			emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2],
+							arr[3], arr[4],
+							arr[5]));
+			break;
+		      case 7:
+			emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2],
+							arr[3], arr[4], arr[5],
+							arr[6]));
+			break;
+		      default:
+			gcc_unreachable ();
+		    }
+		}
+	      /* If we can't identify the INSN as either being either
+		 for deletion or to re-map, then we don't know how to
+		 handle it, so fail the whole conversion.  */
+	      else
+		{
+		  end_sequence ();
+		  return GEN_INT (1);
+		}
+	    }
+	  /* Instructions that dont's require the VPR can be carried
+	     over as-is.  */
+	  else if (DEBUG_INSN_P (insn))
+	    emit_debug_insn (PATTERN (insn));
+	  else
+	    emit_insn (PATTERN (insn));
+	}
+      seq = get_insns ();
+      end_sequence ();
+
+      /* Re-write the entire BB contents with the transformed
+	 sequence.  */
+      FOR_BB_INSNS_SAFE (body, insn, cur_insn)
+	if (!(GET_CODE (insn) == CODE_LABEL || NOTE_INSN_BASIC_BLOCK_P (insn)))
+	  delete_insn (insn);
+      for (insn = seq; NEXT_INSN (insn); insn = NEXT_INSN (insn))
+	if (NOTE_P (insn))
+	  emit_note_after ((enum insn_note) NOTE_KIND (insn),  BB_END (body));
+	else if (DEBUG_INSN_P (insn))
+	  emit_debug_insn_after (PATTERN (insn), BB_END (body));
+	else
+	  emit_insn_after (PATTERN (insn), BB_END (body));
+
+      emit_jump_insn_after (PATTERN (insn), BB_END (body));
+      return GEN_INT (decrementnum);
+    }
+  /* Bail out: we can't use dlstp/letp, so return 1 to allow loop-doloop to try
+     the standard dls/le pair.  */
+  return GEN_INT (1);
+}
+
+/* Target hook to the number of elements to be processed by a dlstp/letp loop
+   into `count` to intialise the counter register.  The number of elements was
+   previously extracted from the vctp insn and placed into a USE rtx.
+   We only check that the doloop_end pattern successfully decrements by a
+   number other than -1 for a valid dlstp/letp loop.  No other checking is
+   needed as that was done previously.  */
+
+rtx
+arm_allow_elementwise_doloop_p (rtx count, rtx label, rtx doloop)
+{
+  if (doloop
+      && INTVAL (XEXP (SET_SRC (XVECEXP (PATTERN (doloop), 0, 1)), 1)) != -1
+      && ARITHMETIC_P (XEXP (count, 0)))
+    {
+      basic_block body = BLOCK_FOR_INSN (label)->prev_bb;
+      rtx_insn* insn;
+      FOR_BB_INSNS (body, insn)
+	{
+	  if (INSN_P (insn) && GET_CODE (PATTERN (insn)) == USE)
+	    {
+	      rtx num_elem_reg = copy_rtx (XEXP (PATTERN (insn), 0));
+	      delete_insn (insn);
+	      return num_elem_reg;
+	    }
+	}
+    }
+  return count;
 }
 
 #if CHECKING_P
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 29062cd6fb3..aa2fdac22f3 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -1464,7 +1464,9 @@
 
 (define_int_attr mode1 [(VCTP8Q "8") (VCTP16Q "16") (VCTP32Q "32")
 			(VCTP64Q "64") (VCTP8Q_M "8") (VCTP16Q_M "16")
-			(VCTP32Q_M "32") (VCTP64Q_M "64")])
+			(VCTP32Q_M "32") (VCTP64Q_M "64")
+			(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 "32")
+			(DLSTP64 "64")])
 
 ;; Both kinds of return insn.
 (define_code_iterator RETURNS [return simple_return])
@@ -1773,6 +1775,8 @@
 (define_int_iterator UQRSHLLQ [UQRSHLL_64 UQRSHLL_48])
 (define_int_iterator SQRSHRLQ [SQRSHRL_64 SQRSHRL_48])
 (define_int_iterator VSHLCQ_M [VSHLCQ_M_S VSHLCQ_M_U])
+(define_int_iterator DLSTP [DLSTP8 DLSTP16 DLSTP32
+				   DLSTP64])
 
 ;; Define iterators for VCMLA operations
 (define_int_iterator VCMLA_OP [UNSPEC_VCMLA
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index b1c8c1c569f..3baddab5905 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -10837,3 +10837,38 @@
     }
   DONE;
 })
+
+;; Originally expanded by 'predicated_doloop_end'.
+(define_insn "*predicated_doloop_end_internal"
+  [(set (pc)
+	(if_then_else
+	   (ge (plus:SI (reg:SI LR_REGNUM)
+			(match_operand:SI 0 "const_int_operand" ""))
+		(const_int 0))
+	 (label_ref (match_operand 1 "" ""))
+	 (pc)))
+   (set (reg:SI LR_REGNUM)
+	(plus:SI (reg:SI LR_REGNUM) (match_dup 0)))
+   (clobber (reg:CC CC_REGNUM))]
+  "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2"
+  {
+    if (get_attr_length (insn) == 4)
+      return "letp\t%|lr, %l1";
+    else
+      return "subs\t%|lr, #%0;bgt\t%l1";
+  }
+  [(set (attr "length")
+	(if_then_else
+	   (ltu (minus (pc) (match_dup 1)) (const_int 1024))
+	    (const_int 4)
+	    (const_int 6)))
+   (set_attr "type" "branch")])
+
+(define_insn "dlstp<mode1>_insn"
+  [
+    (set (reg:SI LR_REGNUM)
+	 (unspec:SI [(match_operand:SI 0 "s_register_operand" "r")]
+	  DLSTP))
+  ]
+  "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2"
+  "dlstp.<mode1>\t%|lr, %0")
\ No newline at end of file
diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
index b2309a52165..a21e2909872 100644
--- a/gcc/config/arm/thumb2.md
+++ b/gcc/config/arm/thumb2.md
@@ -1610,10 +1610,11 @@
 ;; knows what to generate.
 (define_expand "doloop_end"
   [(use (match_operand 0 "" ""))      ; loop pseudo
-   (use (match_operand 1 "" ""))]     ; label
+   (use (match_operand 1 "" ""))      ; label
+   (use (match_operand 2 "" ""))]     ; decrement constant
   "TARGET_32BIT"
   "
- {
+{
    /* Currently SMS relies on the do-loop pattern to recognize loops
       where (1) the control part consists of all insns defining and/or
       using a certain 'count' register and (2) the loop count can be
@@ -1623,41 +1624,68 @@
 
       Also used to implement the low over head loops feature, which is part of
       the Armv8.1-M Mainline Low Overhead Branch (LOB) extension.  */
-   if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
-   {
-     rtx s0;
-     rtx bcomp;
-     rtx loc_ref;
-     rtx cc_reg;
-     rtx insn;
-     rtx cmp;
-
-     if (GET_MODE (operands[0]) != SImode)
-       FAIL;
-
-     s0 = operands [0];
-
-     /* Low over head loop instructions require the first operand to be LR.  */
-     if (TARGET_HAVE_LOB && arm_target_insn_ok_for_lob (operands [1]))
-       s0 = gen_rtx_REG (SImode, LR_REGNUM);
-
-     if (TARGET_THUMB2)
-       insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-     else
-       insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-
-     cmp = XVECEXP (PATTERN (insn), 0, 0);
-     cc_reg = SET_DEST (cmp);
-     bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
-     loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands [1]);
-     emit_jump_insn (gen_rtx_SET (pc_rtx,
-                                  gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
-                                                        loc_ref, pc_rtx)));
-     DONE;
-   }
- else
-   FAIL;
- }")
+  if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
+    {
+      rtx s0;
+      rtx bcomp;
+      rtx loc_ref;
+      rtx cc_reg;
+      rtx insn;
+      rtx cmp;
+      rtx decrement_num;
+
+      if (GET_MODE (operands[0]) != SImode)
+	FAIL;
+
+      s0 = operands[0];
+
+       if (TARGET_HAVE_LOB && arm_target_insn_ok_for_lob (operands[1]))
+	{
+	  s0 = gen_rtx_REG (SImode, LR_REGNUM);
+
+	  /* If we have a compatibe MVE target, try and analyse the loop
+	     contents to determine if we can use predicated dlstp/letp
+	     looping.  */
+	  if (TARGET_HAVE_MVE && TARGET_THUMB2
+	      && (decrement_num = arm_attempt_dlstp_transform (operands[1],
+							       operands[2]))
+	      && (INTVAL (decrement_num) != 1))
+	    {
+	      insn = emit_insn
+		      (gen_thumb2_addsi3_compare0
+			  (s0, s0, GEN_INT ((-1) * (INTVAL (decrement_num)))));
+	      cmp = XVECEXP (PATTERN (insn), 0, 0);
+	      cc_reg = SET_DEST (cmp);
+	      bcomp = gen_rtx_GE (VOIDmode, cc_reg, const0_rtx);
+	      loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+	      emit_jump_insn (gen_rtx_SET (pc_rtx,
+				       gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+							     loc_ref, pc_rtx)));
+	      DONE;
+	    }
+
+	  /* Otherwise, try standard decrement-by-one dls/le looping.  */
+	  if (TARGET_THUMB2)
+	    insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0,
+							  GEN_INT (-1)));
+	  else
+	    insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
+
+	  cmp = XVECEXP (PATTERN (insn), 0, 0);
+	  cc_reg = SET_DEST (cmp);
+	  bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
+	  loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+	  emit_jump_insn (gen_rtx_SET (pc_rtx,
+				       gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+							     loc_ref, pc_rtx)));
+	  DONE;
+	}
+      else
+	FAIL;
+    }
+  else
+    FAIL;
+}")
 
 (define_insn "*clear_apsr"
   [(unspec_volatile:SI [(const_int 0)] VUNSPEC_CLRM_APSR)
@@ -1755,7 +1783,37 @@
   {
     if (REGNO (operands[0]) == LR_REGNUM)
       {
-	emit_insn (gen_dls_insn (operands[0]));
+	/* Pick out the number by which we are decrementing the loop counter
+	   in every iteration.  If it's > 1, then use dlstp.  */
+	int const_int_dec_num
+	     = abs (INTVAL (XEXP (XEXP (XVECEXP (PATTERN (operands[1]), 0, 1),
+				  1),
+			    1)));
+	switch (const_int_dec_num)
+	  {
+	    case 16:
+	      emit_insn (gen_dlstp8_insn (operands[0]));
+	      break;
+
+	    case 8:
+	      emit_insn (gen_dlstp16_insn (operands[0]));
+	      break;
+
+	    case 4:
+	      emit_insn (gen_dlstp32_insn (operands[0]));
+	      break;
+
+	    case 2:
+	      emit_insn (gen_dlstp64_insn (operands[0]));
+	      break;
+
+	    case 1:
+	      emit_insn (gen_dls_insn (operands[0]));
+	      break;
+
+	    default:
+	      gcc_unreachable ();
+	  }
 	DONE;
       }
     else
diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
index 7748e784379..744e7ab5731 100644
--- a/gcc/config/arm/unspecs.md
+++ b/gcc/config/arm/unspecs.md
@@ -579,6 +579,10 @@
   VCTP16Q
   VCTP32Q
   VCTP64Q
+  DLSTP8
+  DLSTP16
+  DLSTP32
+  DLSTP64
   VPNOT
   VCREATEQ_F
   VCVTQ_N_TO_F_S
diff --git a/gcc/config/bfin/bfin.md b/gcc/config/bfin/bfin.md
index 56b24726bc2..3b31ee3a1ee 100644
--- a/gcc/config/bfin/bfin.md
+++ b/gcc/config/bfin/bfin.md
@@ -1959,7 +1959,8 @@
 		   (plus:SI (match_dup 0)
 			    (const_int -1)))
 	      (unspec [(const_int 0)] UNSPEC_LSETUP_END)
-	      (clobber (match_dup 2))
+	      (clobber (match_dup 3))
+	      (match_operand 2 "" "")
 	      (clobber (reg:BI REG_CC))])] ; match_scratch
   ""
 {
@@ -1967,7 +1968,7 @@
   if (GET_MODE (operands[0]) != SImode)
     FAIL;
   bfin_hardware_loop ();
-  operands[2] = gen_rtx_SCRATCH (SImode);
+  operands[3] = gen_rtx_SCRATCH (SImode);
 })
 
 (define_insn "loop_end"
diff --git a/gcc/config/c6x/c6x.md b/gcc/config/c6x/c6x.md
index 60110410d0b..de4f4a56d99 100644
--- a/gcc/config/c6x/c6x.md
+++ b/gcc/config/c6x/c6x.md
@@ -1429,13 +1429,14 @@
 	      (set (match_dup 0)
 		   (plus:SI (match_dup 0)
 			    (const_int -1)))
-	      (clobber (match_dup 2))])] ; match_scratch
+	      (clobber (match_dup 3)) ; match_scratch
+	      (match_operand 2 "" "")])]
   "TARGET_INSNS_64PLUS && optimize"
 {
   /* The loop optimizer doesn't check the predicates... */
   if (GET_MODE (operands[0]) != SImode)
     FAIL;
-  operands[2] = gen_rtx_SCRATCH (SImode);
+  operands[3] = gen_rtx_SCRATCH (SImode);
 })
 
 (define_insn "mvilc"
diff --git a/gcc/config/ia64/ia64.md b/gcc/config/ia64/ia64.md
index 5d1d47da55b..5b24bd76ace 100644
--- a/gcc/config/ia64/ia64.md
+++ b/gcc/config/ia64/ia64.md
@@ -3956,7 +3956,8 @@
 
 (define_expand "doloop_end"
   [(use (match_operand 0 "" ""))	; loop pseudo
-   (use (match_operand 1 "" ""))]	; label
+   (use (match_operand 1 "" ""))	; label
+   (use (match_operand 2 "" ""))]     ; decrement constant
   ""
 {
   if (GET_MODE (operands[0]) != DImode)
diff --git a/gcc/config/pdp11/pdp11.md b/gcc/config/pdp11/pdp11.md
index a46efc1ef78..aa359955c3c 100644
--- a/gcc/config/pdp11/pdp11.md
+++ b/gcc/config/pdp11/pdp11.md
@@ -332,7 +332,8 @@
 		    (pc)))
 	      (set (match_dup 0)
 		   (plus:HI (match_dup 0)
-			 (const_int -1)))])]
+			 (const_int -1)))
+	      (match_operand 2 "" "")])]
   "TARGET_40_PLUS"
   "{
     if (GET_MODE (operands[0]) != HImode)
diff --git a/gcc/config/pru/pru.md b/gcc/config/pru/pru.md
index bdc5ad79ba0..990cdfe0974 100644
--- a/gcc/config/pru/pru.md
+++ b/gcc/config/pru/pru.md
@@ -1636,7 +1636,8 @@
 
 (define_expand "doloop_end"
   [(use (match_operand 0 "nonimmediate_operand"))
-   (use (label_ref (match_operand 1 "")))]
+   (use (label_ref (match_operand 1 "")))
+   (use (match_operand 2 "" ""))]     ; decrement constant
   "TARGET_OPT_LOOP"
 {
   if (GET_CODE (operands[0]) == REG && GET_MODE (operands[0]) == QImode)
diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index ad5a4cf2ef8..fdaeb6a02ca 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -13422,7 +13422,8 @@
 
 (define_expand "doloop_end"
   [(use (match_operand 0))	; loop pseudo
-   (use (match_operand 1))]	; label
+   (use (match_operand 1))	; label
+   (use (match_operand 2 "" ""))]     ; decrement constant
   ""
 {
   if (GET_MODE (operands[0]) != Pmode)
diff --git a/gcc/config/s390/s390.md b/gcc/config/s390/s390.md
index 962927c3112..bad317fdba7 100644
--- a/gcc/config/s390/s390.md
+++ b/gcc/config/s390/s390.md
@@ -9780,7 +9780,8 @@
 
 (define_expand "doloop_end"
   [(use (match_operand 0 "" ""))        ; loop pseudo
-   (use (match_operand 1 "" ""))]       ; label
+   (use (match_operand 1 "" ""))        ; label
+   (use (match_operand 2 "" ""))]       ; decrement constant
   ""
 {
   if (GET_MODE (operands[0]) == SImode)
diff --git a/gcc/config/sh/sh.md b/gcc/config/sh/sh.md
index 59a7b216433..5b8d74d6029 100644
--- a/gcc/config/sh/sh.md
+++ b/gcc/config/sh/sh.md
@@ -6404,7 +6404,8 @@
 				 (pc)))
 	      (set (match_dup 0)
 		   (plus:SI (match_dup 0) (const_int -1)))
-	      (clobber (reg:SI T_REG))])]
+	      (clobber (reg:SI T_REG))
+	      (match_operand 2 "" "")])]
   "TARGET_SH2"
 {
   if (GET_MODE (operands[0]) != SImode)
diff --git a/gcc/config/v850/v850.md b/gcc/config/v850/v850.md
index 6ca31e3f43f..8d0812abac7 100644
--- a/gcc/config/v850/v850.md
+++ b/gcc/config/v850/v850.md
@@ -1434,7 +1434,8 @@
 
 (define_expand "doloop_end"
  [(use (match_operand 0 "" ""))        ; loop pseudo
-  (use (match_operand 1 "" ""))]       ; label
+  (use (match_operand 1 "" ""))        ; label
+  (use (match_operand 2 "" ""))]       ; decrement constant
   "TARGET_V850E3V5_UP && TARGET_LOOP"
   {
     rtx loop_cnt = operands[0];
diff --git a/gcc/config/xtensa/xtensa.md b/gcc/config/xtensa/xtensa.md
index 608110c20bc..57a4e2ca353 100644
--- a/gcc/config/xtensa/xtensa.md
+++ b/gcc/config/xtensa/xtensa.md
@@ -2016,13 +2016,14 @@
                    (plus:SI (match_dup 0)
                             (const_int -1)))
               (unspec [(const_int 0)] UNSPEC_LSETUP_END)
-              (clobber (match_dup 2))])] ; match_scratch
+              (clobber (match_dup 3)) ; match_scratch
+	      (match_operand 2 "" "")])]
   "TARGET_LOOPS && optimize"
 {
   /* The loop optimizer doesn't check the predicates... */
   if (GET_MODE (operands[0]) != SImode)
     FAIL;
-  operands[2] = gen_rtx_SCRATCH (SImode);
+  operands[3] = gen_rtx_SCRATCH (SImode);
 })
 
 \f
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 110f8dfa0a9..a8cfcd7497d 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -11774,6 +11774,14 @@ loops, and will help ivopts to make some decisions.
 The default version of this hook returns false.
 @end deftypefn
 
+@deftypefn {Target Hook} rtx TARGET_ALLOW_ELEMENTWISE_DOLOOP_P (rtx @var{count}, rtx @var{label}, rtx @var{doloop})
+This target hook allows the target to support loop-doloop optimisations
+where the value that gets put into the loop counter register is not a
+pre-calculation of the number of iteration of the loop.  For instance,
+the value used can be the number of elements that the loop will process.
+The default version of this hook returns the same rtx it was given.
+@end deftypefn
+
 @deftypevr {Target Hook} bool TARGET_HAVE_COUNT_REG_DECR_P
 Return true if the target supports hardware count register for decrement
 and branch.
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 501ddf147e4..024711cfdb9 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -7730,6 +7730,8 @@ to by @var{ce_info}.
 
 @hook TARGET_PREDICT_DOLOOP_P
 
+@hook TARGET_ALLOW_ELEMENTWISE_DOLOOP_P
+
 @hook TARGET_HAVE_COUNT_REG_DECR_P
 
 @hook TARGET_DOLOOP_COST_FOR_GENERIC
diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
index 30b45c8071a..cbbc662f16b 100644
--- a/gcc/loop-doloop.cc
+++ b/gcc/loop-doloop.cc
@@ -85,29 +85,29 @@ doloop_condition_get (rtx_insn *doloop_pat)
      forms:
 
      1)  (parallel [(set (pc) (if_then_else (condition)
-	  			            (label_ref (label))
-				            (pc)))
-	             (set (reg) (plus (reg) (const_int -1)))
-	             (additional clobbers and uses)])
+					    (label_ref (label))
+					    (pc)))
+		     (set (reg) (plus (reg) (const_int -n)))
+		     (additional clobbers and uses)])
 
      The branch must be the first entry of the parallel (also required
      by jump.cc), and the second entry of the parallel must be a set of
      the loop counter register.  Some targets (IA-64) wrap the set of
      the loop counter in an if_then_else too.
 
-     2)  (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-	                         (label_ref (label))
-			         (pc))).  
+     2)  (set (reg) (plus (reg) (const_int -n))
+	 (set (pc) (if_then_else (reg != 0)
+				 (label_ref (label))
+				 (pc))).
 
      Some targets (ARM) do the comparison before the branch, as in the
      following form:
 
-     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) */
+     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -n), 0)))
+		   (set (reg) (plus (reg) (const_int -n)))])
+	(set (pc) (if_then_else (cc == NE)
+				(label_ref (label))
+				(pc))) */
 
   pattern = PATTERN (doloop_pat);
 
@@ -143,7 +143,7 @@ doloop_condition_get (rtx_insn *doloop_pat)
 	      || GET_CODE (cmp_arg1) != PLUS)
 	    return 0;
 	  reg_orig = XEXP (cmp_arg1, 0);
-	  if (XEXP (cmp_arg1, 1) != GEN_INT (-1) 
+	  if (!CONST_INT_P (XEXP (cmp_arg1, 1))
 	      || !REG_P (reg_orig))
 	    return 0;
 	  cc_reg = SET_DEST (cmp_orig);
@@ -156,7 +156,8 @@ doloop_condition_get (rtx_insn *doloop_pat)
 	{
 	  /* We expect the condition to be of the form (reg != 0)  */
 	  cond = XEXP (SET_SRC (cmp), 0);
-	  if (GET_CODE (cond) != NE || XEXP (cond, 1) != const0_rtx)
+	  if ((GET_CODE (cond) != NE && GET_CODE (cond) != GE)
+	      || XEXP (cond, 1) != const0_rtx)
 	    return 0;
 	}
     }
@@ -173,14 +174,14 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
     return 0;
 
-  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
      On IA-64, this decrement is wrapped in an if_then_else.  */
   inc_src = SET_SRC (inc);
   if (GET_CODE (inc_src) == IF_THEN_ELSE)
     inc_src = XEXP (inc_src, 1);
   if (GET_CODE (inc_src) != PLUS
       || XEXP (inc_src, 0) != reg
-      || XEXP (inc_src, 1) != constm1_rtx)
+      || !CONST_INT_P (XEXP (inc_src, 1)))
     return 0;
 
   /* Check for (set (pc) (if_then_else (condition)
@@ -211,42 +212,49 @@ doloop_condition_get (rtx_insn *doloop_pat)
       || (GET_CODE (XEXP (condition, 0)) == PLUS
 	  && XEXP (XEXP (condition, 0), 0) == reg))
    {
-     if (GET_CODE (pattern) != PARALLEL)
      /*  For the second form we expect:
 
-         (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-                                 (label_ref (label))
-                                 (pc))).
+	 (set (reg) (plus (reg) (const_int -n))
+	 (set (pc) (if_then_else (reg != 0)
+				 (label_ref (label))
+				 (pc))).
 
-         is equivalent to the following:
+	 If n == 1, that is equivalent to the following:
 
-         (parallel [(set (pc) (if_then_else (reg != 1)
-                                            (label_ref (label))
-                                            (pc)))
-                     (set (reg) (plus (reg) (const_int -1)))
-                     (additional clobbers and uses)])
+	 (parallel [(set (pc) (if_then_else (reg != 1)
+					    (label_ref (label))
+					    (pc)))
+		     (set (reg) (plus (reg) (const_int -1)))
+		     (additional clobbers and uses)])
 
-        For the third form we expect:
+	For the third form we expect:
 
-        (parallel [(set (cc) (compare ((plus (reg) (const_int -1)), 0))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) 
+	(parallel [(set (cc) (compare ((plus (reg) (const_int -n)), 0))
+		   (set (reg) (plus (reg) (const_int -n)))])
+	(set (pc) (if_then_else (cc == NE)
+				(label_ref (label))
+				(pc)))
 
-        which is equivalent to the following:
+	Which also for n == 1 is equivalent to the following:
 
-        (parallel [(set (cc) (compare (reg,  1))
-                   (set (reg) (plus (reg) (const_int -1)))
-                   (set (pc) (if_then_else (NE == cc)
-                                           (label_ref (label))
-                                           (pc))))])
+	(parallel [(set (cc) (compare (reg,  1))
+		   (set (reg) (plus (reg) (const_int -1)))
+		   (set (pc) (if_then_else (NE == cc)
+					   (label_ref (label))
+					   (pc))))])
 
-        So we return the second form instead for the two cases.
+	So we return the second form instead for the two cases.
 
+	For the "elementwise" form where the decrement number isn't -1,
+	the final value may be exceeded, so use GE instead of NE.
      */
-        condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
+     if (GET_CODE (pattern) != PARALLEL)
+       {
+	if (INTVAL (XEXP (inc_src, 1)) != -1)
+	  condition = gen_rtx_fmt_ee (GE, VOIDmode, inc_src, const0_rtx);
+	else
+	  condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);;
+       }
 
     return condition;
    }
@@ -685,17 +693,6 @@ doloop_optimize (class loop *loop)
       return false;
     }
 
-  max_cost
-    = COSTS_N_INSNS (param_max_iterations_computation_cost);
-  if (set_src_cost (desc->niter_expr, mode, optimize_loop_for_speed_p (loop))
-      > max_cost)
-    {
-      if (dump_file)
-	fprintf (dump_file,
-		 "Doloop: number of iterations too costly to compute.\n");
-      return false;
-    }
-
   if (desc->const_iter)
     iterations = widest_int::from (rtx_mode_t (desc->niter_expr, mode),
 				   UNSIGNED);
@@ -720,7 +717,25 @@ doloop_optimize (class loop *loop)
   count = copy_rtx (desc->niter_expr);
   start_label = block_label (desc->in_edge->dest);
   doloop_reg = gen_reg_rtx (mode);
-  rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
+  rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label,
+						 count);
+
+  /* Not all targets need to pre-calculate the number of the iterations of
+     the loop, they instead work by storing the number of elements in the
+     counter_reg and decrementing that.  Call the appropriate target hook to
+     change the value of count.  */
+  count = targetm.allow_elementwise_doloop_p (count, start_label, doloop_seq);
+
+  max_cost
+    = COSTS_N_INSNS (param_max_iterations_computation_cost);
+  if (set_src_cost (count, mode, optimize_loop_for_speed_p (loop))
+      > max_cost)
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "Doloop: number of iterations too costly to compute.\n");
+      return false;
+    }
 
   word_mode_size = GET_MODE_PRECISION (word_mode);
   word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) - 1;
@@ -737,7 +752,7 @@ doloop_optimize (class loop *loop)
       else
 	count = lowpart_subreg (word_mode, count, mode);
       PUT_MODE (doloop_reg, word_mode);
-      doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
+      doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label, count);
     }
   if (! doloop_seq)
     {
diff --git a/gcc/target-insns.def b/gcc/target-insns.def
index de8c0092f98..b77b7972426 100644
--- a/gcc/target-insns.def
+++ b/gcc/target-insns.def
@@ -48,7 +48,7 @@ DEF_TARGET_INSN (casesi, (rtx x0, rtx x1, rtx x2, rtx x3, rtx x4))
 DEF_TARGET_INSN (check_stack, (rtx x0))
 DEF_TARGET_INSN (clear_cache, (rtx x0, rtx x1))
 DEF_TARGET_INSN (doloop_begin, (rtx x0, rtx x1))
-DEF_TARGET_INSN (doloop_end, (rtx x0, rtx x1))
+DEF_TARGET_INSN (doloop_end, (rtx x0, rtx x1, rtx x2))
 DEF_TARGET_INSN (eh_return, (rtx x0))
 DEF_TARGET_INSN (epilogue, (void))
 DEF_TARGET_INSN (exception_receiver, (void))
diff --git a/gcc/target.def b/gcc/target.def
index a3d3b04a165..e74724c8a13 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -4392,6 +4392,16 @@ The default version of this hook returns false.",
  bool, (class loop *loop),
  default_predict_doloop_p)
 
+DEFHOOK
+(allow_elementwise_doloop_p,
+ "This target hook allows the target to support loop-doloop optimisations\n\
+where the value that gets put into the loop counter register is not a\n\
+pre-calculation of the number of iteration of the loop.  For instance,\n\
+the value used can be the number of elements that the loop will process.\n\
+The default version of this hook returns the same rtx it was given.",
+ rtx, (rtx count, rtx label, rtx doloop),
+ default_allow_elementwise_doloop_p)
+
 DEFHOOKPOD
 (have_count_reg_decr_p,
  "Return true if the target supports hardware count register for decrement\n\
diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
index d17d393baed..04d08056dd7 100644
--- a/gcc/targhooks.cc
+++ b/gcc/targhooks.cc
@@ -661,6 +661,12 @@ default_predict_doloop_p (class loop *loop ATTRIBUTE_UNUSED)
   return false;
 }
 
+rtx
+default_allow_elementwise_doloop_p (rtx count, rtx, rtx)
+{
+  return count;
+}
+
 /* By default, just use the input MODE itself.  */
 
 machine_mode
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index ecce55ebe79..f9aba2e0813 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -88,6 +88,7 @@ extern bool default_fixed_point_supported_p (void);
 extern bool default_has_ifunc_p (void);
 
 extern bool default_predict_doloop_p (class loop *);
+extern rtx default_allow_elementwise_doloop_p (rtx, rtx, rtx);
 extern machine_mode default_preferred_doloop_mode (machine_mode);
 extern const char * default_invalid_within_doloop (const rtx_insn *);
 
diff --git a/gcc/testsuite/gcc.target/arm/dlstp-int16x8.c b/gcc/testsuite/gcc.target/arm/dlstp-int16x8.c
new file mode 100644
index 00000000000..a61f02ed3a2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/dlstp-int16x8.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-skip-if "avoid conflicting multilib options" { *-*-* } { "-marm" "-mcpu=*" } } */
+/* { dg-options "-march=armv8.1-m.main+fp.dp+mve.fp -mfloat-abi=hard -mfpu=auto -O3 --save-temps" } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "lob.h"
+
+void  __attribute__ ((noinline)) test (int16_t *a, int16_t *b, int16_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      int16x8_t va = vldrhq_z_s16 (a, p);
+      int16x8_t vb = vldrhq_z_s16 (b, p);
+      int16x8_t vc = vaddq_x_s16 (va, vb, p);
+      vstrhq_p_s16 (c, vc, p);
+      c+=8;
+      a+=8;
+      b+=8;
+      n-=8;
+    }
+}
+
+int main ()
+{
+  int i;
+  int16_t temp1[N];
+  int16_t temp2[N];
+  int16_t temp3[N];
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus16 (temp1, temp2, temp3, 0);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus16 (temp1, temp2, temp3, 1);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 7);
+  check_plus16 (temp1, temp2, temp3, 7);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus16 (temp1, temp2, temp3, 8);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus16 (temp1, temp2, temp3, 9);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus16 (temp1, temp2, temp3, 16);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus16 (temp1, temp2, temp3, 17);
+
+  reset_data16 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.16\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp\t} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp\t" } } */
+/* { dg-final { scan-assembler-not "\tvpst\t" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/dlstp-int32x4.c b/gcc/testsuite/gcc.target/arm/dlstp-int32x4.c
new file mode 100644
index 00000000000..31a7264ae26
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/dlstp-int32x4.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-skip-if "avoid conflicting multilib options" { *-*-* } { "-marm" "-mcpu=*" } } */
+/* { dg-options "-march=armv8.1-m.main+fp.dp+mve.fp -mfloat-abi=hard -mfpu=auto -O3 --save-temps" } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "lob.h"
+
+void  __attribute__ ((noinline)) test (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+int main ()
+{
+  int i;
+  int32_t temp1[N];
+  int32_t temp2[N];
+  int32_t temp3[N];
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus32 (temp1, temp2, temp3, 0);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus32 (temp1, temp2, temp3, 1);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 3);
+  check_plus32 (temp1, temp2, temp3, 3);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 4);
+  check_plus32 (temp1, temp2, temp3, 4);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 5);
+  check_plus32 (temp1, temp2, temp3, 5);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus32 (temp1, temp2, temp3, 8);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus32 (temp1, temp2, temp3, 9);
+
+  reset_data32 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.32\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp\t} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp\t" } } */
+/* { dg-final { scan-assembler-not "\tvpst\t" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/dlstp-int64x2.c b/gcc/testsuite/gcc.target/arm/dlstp-int64x2.c
new file mode 100644
index 00000000000..a09c01884da
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/dlstp-int64x2.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-skip-if "avoid conflicting multilib options" { *-*-* } { "-marm" "-mcpu=*" } } */
+/* { dg-options "-march=armv8.1-m.main+fp.dp+mve.fp -mfloat-abi=hard -mfpu=auto -O3 --save-temps" } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "lob.h"
+
+void  __attribute__ ((noinline)) test (int64_t *a, int64_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp64q (n);
+      int64x2_t va = vldrdq_gather_offset_z_s64 (a, vcreateq_u64 (8, 0), p);
+      vstrdq_scatter_offset_p_s64 (c, vcreateq_u64 (8, 0), va, p);
+      c+=2;
+      a+=2;
+      n-=2;
+    }
+}
+
+int main ()
+{
+  int i;
+  int64_t temp1[N];
+  int64_t temp3[N];
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 0);
+  check_memcpy64 (temp1, temp3, 0);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 1);
+  check_memcpy64 (temp1, temp3, 1);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 2);
+  check_memcpy64 (temp1, temp3, 2);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 3);
+  check_memcpy64 (temp1, temp3, 3);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 4);
+  check_memcpy64 (temp1, temp3, 4);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 5);
+  check_memcpy64 (temp1, temp3, 5);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 6);
+  check_memcpy64 (temp1, temp3, 6);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 7);
+  check_memcpy64 (temp1, temp3, 7);
+
+  reset_data64  (temp1, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.64\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp\t} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp\t" } } */
+/* { dg-final { scan-assembler-not "\tvpst\t" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/dlstp-int8x16.c b/gcc/testsuite/gcc.target/arm/dlstp-int8x16.c
new file mode 100644
index 00000000000..49fbd4c16a2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/dlstp-int8x16.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-skip-if "avoid conflicting multilib options" { *-*-* } { "-marm" "-mcpu=*" } } */
+/* { dg-options "-march=armv8.1-m.main+fp.dp+mve.fp -mfloat-abi=hard -mfpu=auto -O3 --save-temps" } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "lob.h"
+
+void  __attribute__ ((noinline)) test (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp8q (n);
+      int8x16_t va = vldrbq_z_s8 (a, p);
+      int8x16_t vb = vldrbq_z_s8 (b, p);
+      int8x16_t vc = vaddq_x_s8 (va, vb, p);
+      vstrbq_p_s8 (c, vc, p);
+      c+=16;
+      a+=16;
+      b+=16;
+      n-=16;
+    }
+}
+
+int main ()
+{
+  int i;
+  int8_t temp1[N];
+  int8_t temp2[N];
+  int8_t temp3[N];
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus8 (temp1, temp2, temp3, 0);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus8 (temp1, temp2, temp3, 1);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 15);
+  check_plus8 (temp1, temp2, temp3, 15);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus8 (temp1, temp2, temp3, 16);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus8 (temp1, temp2, temp3, 17);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 32);
+  check_plus8 (temp1, temp2, temp3, 32);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 33);
+  check_plus8 (temp1, temp2, temp3, 33);
+
+  reset_data8 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.8\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp\t} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp\t" } } */
+/* { dg-final { scan-assembler-not "\tvpst\t" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/lob.h b/gcc/testsuite/gcc.target/arm/lob.h
index feaae7cc899..3941fe7a8b6 100644
--- a/gcc/testsuite/gcc.target/arm/lob.h
+++ b/gcc/testsuite/gcc.target/arm/lob.h
@@ -1,15 +1,131 @@
 #include <string.h>
-
+#include <stdint.h>
 /* Common code for lob tests.  */
 
 #define NO_LOB asm volatile ("@ clobber lr" : : : "lr" )
 
-#define N 10000
+#define N 100
+
+static void
+reset_data (int *a, int *b, int *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data64 (int64_t *a, int64_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+check_plus (int *a, int *b, int *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
 
 static void
-reset_data (int *a, int *b, int *c)
+check_memcpy64 (int64_t *a, int64_t *c, int x)
 {
-  memset (a, -1, N * sizeof (*a));
-  memset (b, -1, N * sizeof (*b));
-  memset (c, -1, N * sizeof (*c));
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != a[i]) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
 }
diff --git a/gcc/testsuite/gcc.target/arm/lob1.c b/gcc/testsuite/gcc.target/arm/lob1.c
index ba5c82cd55c..c8ce653a5c3 100644
--- a/gcc/testsuite/gcc.target/arm/lob1.c
+++ b/gcc/testsuite/gcc.target/arm/lob1.c
@@ -54,29 +54,18 @@ loop3 (int *a, int *b, int *c)
     } while (i < N);
 }
 
-void
-check (int *a, int *b, int *c)
-{
-  for (int i = 0; i < N; i++)
-    {
-      NO_LOB;
-      if (c[i] != a[i] + b[i])
-	abort ();
-    }
-}
-
 int
 main (void)
 {
-  reset_data (a, b, c);
+  reset_data (a, b, c, N);
   loop1 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus (a, b, c, N);
+  reset_data (a, b, c, N);
   loop2 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus (a, b, c, N);
+  reset_data (a, b, c, N);
   loop3 (a, b ,c);
-  check (a, b ,c);
+  check_plus (a, b, c, N);
 
   return 0;
 }
diff --git a/gcc/testsuite/gcc.target/arm/lob6.c b/gcc/testsuite/gcc.target/arm/lob6.c
index 17b6124295e..4fe116e2c2b 100644
--- a/gcc/testsuite/gcc.target/arm/lob6.c
+++ b/gcc/testsuite/gcc.target/arm/lob6.c
@@ -79,14 +79,14 @@ check (void)
 int
 main (void)
 {
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data (a1, b1, c1, N);
+  reset_data (a2, b2, c2, N);
   loop1 (a1, b1, c1);
   ref1 (a2, b2, c2);
   check ();
 
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data (a1, b1, c1, N);
+  reset_data (a2, b2, c2, N);
   loop2 (a1, b1, c1);
   ref2 (a2, b2, c2);
   check ();

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
  2023-12-18 11:53 ` [PATCH 2/2] " Andre Vieira
@ 2023-12-20 16:54   ` Andre Vieira (lists)
  0 siblings, 0 replies; 14+ messages in thread
From: Andre Vieira (lists) @ 2023-12-20 16:54 UTC (permalink / raw)
  To: gcc-patches; +Cc: Richard.Earnshaw, Stam Markianos-Wright

[-- Attachment #1: Type: text/plain, Size: 1124 bytes --]

Squashed the definition and changes to predicated_doloop_end_internal 
and dlstp*_insn into this patch to make sure the first patch builds 
independently

On 18/12/2023 11:53, Andre Vieira wrote:
> 
> Reworked Stam's patch after comments in:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640362.html
> 
> The original gcc ChangeLog remains unchanged, but I did split up some tests so
> here is the testsuite ChangeLog.
> 
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.target/arm/lob.h: Update framework.
> 	* gcc.target/arm/lob1.c: Likewise.
> 	* gcc.target/arm/lob6.c: Likewise.
> 	* gcc.target/arm/mve/dlstp-compile-asm.c: New test.
> 	* gcc.target/arm/mve/dlstp-int16x8.c: New test.
> 	* gcc.target/arm/mve/dlstp-int16x8-run.c: New test.
> 	* gcc.target/arm/mve/dlstp-int32x4.c: New test.
> 	* gcc.target/arm/mve/dlstp-int32x4-run.c: New test.
> 	* gcc.target/arm/mve/dlstp-int64x2.c: New test.
> 	* gcc.target/arm/mve/dlstp-int64x2-run.c: New test.
> 	* gcc.target/arm/mve/dlstp-int8x16.c: New test.
> 	* gcc.target/arm/mve/dlstp-int8x16-run.c: New test.
> 	* gcc.target/arm/mve/dlstp-invalid-asm.c: New test.
> 

[-- Attachment #2: 0002-arm-Add-support-for-MVE-Tail-Predicated-Low-Overhead_v2.patch --]
[-- Type: text/plain, Size: 107215 bytes --]

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 2f5ca79ed8ddd647b212782a0454ee4fefc07257..4f164c547406c43219900c111401540c7ef9d7d1 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -65,8 +65,8 @@ extern void arm_emit_speculation_barrier_function (void);
 extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *);
 extern bool arm_q_bit_access (void);
 extern bool arm_ge_bits_access (void);
-extern bool arm_target_insn_ok_for_lob (rtx);
-
+extern bool arm_target_bb_ok_for_lob (basic_block);
+extern rtx arm_attempt_dlstp_transform (rtx);
 #ifdef RTX_CODE
 enum reg_class
 arm_mode_base_reg_class (machine_mode);
diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index 0c0cb14a8a4f043357b8acd7042a9f9386af1eb1..1ee72bcb7ec4bea5feea8453ceef7702b0088a73 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -668,6 +668,12 @@ static const scoped_attribute_specs *const arm_attribute_table[] =
 #undef TARGET_HAVE_CONDITIONAL_EXECUTION
 #define TARGET_HAVE_CONDITIONAL_EXECUTION arm_have_conditional_execution
 
+#undef TARGET_LOOP_UNROLL_ADJUST
+#define TARGET_LOOP_UNROLL_ADJUST arm_loop_unroll_adjust
+
+#undef TARGET_PREDICT_DOLOOP_P
+#define TARGET_PREDICT_DOLOOP_P arm_predict_doloop_p
+
 #undef TARGET_LEGITIMATE_CONSTANT_P
 #define TARGET_LEGITIMATE_CONSTANT_P arm_legitimate_constant_p
 
@@ -34483,19 +34489,1147 @@ arm_invalid_within_doloop (const rtx_insn *insn)
 }
 
 bool
-arm_target_insn_ok_for_lob (rtx insn)
+arm_target_bb_ok_for_lob (basic_block bb)
 {
-  basic_block bb = BLOCK_FOR_INSN (insn);
   /* Make sure the basic block of the target insn is a simple latch
      having as single predecessor and successor the body of the loop
      itself.  Only simple loops with a single basic block as body are
      supported for 'low over head loop' making sure that LE target is
      above LE itself in the generated code.  */
-
   return single_succ_p (bb)
-    && single_pred_p (bb)
-    && single_succ_edge (bb)->dest == single_pred_edge (bb)->src
-    && contains_no_active_insn_p (bb);
+	 && single_pred_p (bb)
+	 && single_succ_edge (bb)->dest == single_pred_edge (bb)->src;
+}
+
+/* Utility fuction: Given a VCTP or a VCTP_M insn, return the number of MVE
+   lanes based on the machine mode being used.  */
+
+static int
+arm_mve_get_vctp_lanes (rtx_insn *insn)
+{
+  rtx insn_set = single_set (insn);
+  if (insn_set
+      && GET_CODE (SET_SRC (insn_set)) == UNSPEC
+      && (XINT (SET_SRC (insn_set), 1) == VCTP
+	  || XINT (SET_SRC (insn_set), 1) == VCTP_M))
+    {
+      machine_mode mode = GET_MODE (SET_SRC (insn_set));
+      return (VECTOR_MODE_P (mode) && VALID_MVE_PRED_MODE (mode))
+	     ? GET_MODE_NUNITS (mode) : 0;
+    }
+  return 0;
+}
+
+/* Check if INSN requires the use of the VPR reg, if it does, return the
+   sub-rtx of the VPR reg.  The TYPE argument controls whether
+   this function should:
+   * For TYPE == 0, check all operands, including the OUT operands,
+     and return the first occurrence of the VPR reg.
+   * For TYPE == 1, only check the input operands.
+   * For TYPE == 2, only check the output operands.
+   (INOUT operands are considered both as input and output operands)
+*/
+static rtx
+arm_get_required_vpr_reg (rtx_insn *insn, unsigned int type = 0)
+{
+  gcc_assert (type < 3);
+  if (!NONJUMP_INSN_P (insn))
+    return NULL_RTX;
+
+  bool requires_vpr;
+  extract_constrain_insn (insn);
+  int n_operands = recog_data.n_operands;
+  if (recog_data.n_alternatives == 0)
+    return NULL_RTX;
+
+  /* Fill in recog_op_alt with information about the constraints of
+     this insn.  */
+  preprocess_constraints (insn);
+
+  for (int op = 0; op < n_operands; op++)
+    {
+      requires_vpr = true;
+      if (type == 1 && recog_data.operand_type[op] == OP_OUT)
+	continue;
+      else if (type == 2 && recog_data.operand_type[op] == OP_IN)
+	continue;
+
+      /* Iterate through alternatives of operand "op" in recog_op_alt and
+	 identify if the operand is required to be the VPR.  */
+      for (int alt = 0; alt < recog_data.n_alternatives; alt++)
+	{
+	  const operand_alternative *op_alt
+	      = &recog_op_alt[alt * n_operands];
+	  /* Fetch the reg_class for each entry and check it against the
+	     VPR_REG reg_class.  */
+	  if (alternative_class (op_alt, op) != VPR_REG)
+	    requires_vpr = false;
+	}
+      /* If all alternatives of the insn require the VPR reg for this operand,
+	 it means that either this is VPR-generating instruction, like a vctp,
+	 vcmp, etc., or it is a VPT-predicated insruction.  Return the subrtx
+	 of the VPR reg operand.  */
+      if (requires_vpr)
+	return recog_data.operand[op];
+    }
+  return NULL_RTX;
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with TYPE == 1, so return
+   something only if the VPR reg is an input operand to the insn.  */
+
+static rtx
+arm_get_required_vpr_reg_param (rtx_insn *insn)
+{
+  return arm_get_required_vpr_reg (insn, 1);
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with TYPE == 2, so return
+   something only if the VPR reg is the return value, an output of, or is
+   clobbered by the insn.  */
+
+static rtx
+arm_get_required_vpr_reg_ret_val (rtx_insn *insn)
+{
+  return arm_get_required_vpr_reg (insn, 2);
+}
+
+/* Scan the basic block of a loop body for a vctp instruction.  If there is
+   at least vctp instruction, return the first rtx_insn *.  */
+
+static rtx_insn *
+arm_mve_get_loop_vctp (basic_block bb)
+{
+  rtx_insn *insn = BB_HEAD (bb);
+
+  /* Now scan through all the instruction patterns and pick out the VCTP
+     instruction.  We require arm_get_required_vpr_reg_param to be false
+     to make sure we pick up a VCTP, rather than a VCTP_M.  */
+  FOR_BB_INSNS (bb, insn)
+    if (NONDEBUG_INSN_P (insn))
+      if (arm_get_required_vpr_reg_ret_val (insn)
+	  && (arm_mve_get_vctp_lanes (insn) != 0)
+	  && !arm_get_required_vpr_reg_param (insn))
+	return insn;
+  return NULL;
+}
+
+/* Return true if INSN is a MVE instruction that is VPT-predicable, but in
+   its unpredicated form, or if it is predicated, but on a predicate other
+   than VPR_REG.  */
+
+static bool
+arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate (rtx_insn *insn,
+							  rtx vpr_reg)
+{
+  rtx insn_vpr_reg_operand;
+  if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+      || (MVE_VPT_PREDICATED_INSN_P (insn)
+	  && (insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn))
+	  && !rtx_equal_p (vpr_reg, insn_vpr_reg_operand)))
+    return true;
+  else
+    return false;
+}
+
+/* Return true if INSN is a MVE instruction that is VPT-predicable and is
+   predicated on VPR_REG.  */
+
+static bool
+arm_mve_vec_insn_is_predicated_with_this_predicate (rtx_insn *insn,
+						    rtx vpr_reg)
+{
+  rtx insn_vpr_reg_operand;
+  if (MVE_VPT_PREDICATED_INSN_P (insn)
+      && (insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn))
+      && rtx_equal_p (vpr_reg, insn_vpr_reg_operand))
+    return true;
+  else
+    return false;
+}
+
+/* Utility function to identify if INSN is an MVE instruction that performs
+   some across-vector operation (and as a result does not align with normal
+   lane predication rules).  All such instructions give one only scalar
+   output, except for vshlcq which gives a PARALLEL of a vector and a scalar
+   (one vector result and one carry output).  */
+
+static bool
+arm_is_mve_across_vector_insn (rtx_insn* insn)
+{
+  df_ref insn_defs = NULL;
+  if (!MVE_VPT_PREDICABLE_INSN_P (insn))
+    return false;
+
+  bool is_across_vector = false;
+  FOR_EACH_INSN_DEF (insn_defs, insn)
+    if (!VALID_MVE_MODE (GET_MODE (DF_REF_REG (insn_defs)))
+	&& !arm_get_required_vpr_reg_ret_val (insn))
+      is_across_vector = true;
+
+  return is_across_vector;
+}
+
+/* Utility function to identify if INSN is an MVE load or store instruction.
+   * For TYPE == 0, check all operands.  If the function returns true,
+     INSN is a load or a store insn.
+   * For TYPE == 1, only check the input operands.  If the function returns
+     true, INSN is a load insn.
+   * For TYPE == 2, only check the output operands.  If the function returns
+     true, INSN is a store insn.  */
+
+static bool
+arm_is_mve_load_store_insn (rtx_insn* insn, int type = 0)
+{
+  int n_operands = recog_data.n_operands;
+  extract_insn (insn);
+
+  for (int op = 0; op < n_operands; op++)
+    {
+      if (type == 1 && recog_data.operand_type[op] == OP_OUT)
+	continue;
+      else if (type == 2 && recog_data.operand_type[op] == OP_IN)
+	continue;
+      if (mve_memory_operand (recog_data.operand[op],
+			      GET_MODE (recog_data.operand[op])))
+      return true;
+    }
+  return false;
+}
+
+/* When transforming an MVE intrinsic loop into an MVE Tail Predicated Low
+   Overhead Loop, there are a number of instructions that, if in their
+   unpredicated form, act across vector lanes, but are still safe to include
+   within the loop, despite the implicit predication added to the vector lanes.
+   This list has been compiled by carefully analyzing the instruction
+   pseudocode in the Arm-ARM.
+   All other across-vector instructions aren't allowed, because the addition
+   of implicit predication could influnce the result of the operation.
+   Any new across-vector instructions to the MVE ISA will have to assessed for
+   inclusion to this list.  */
+
+static bool
+arm_mve_is_allowed_unpredic_across_vector_insn (rtx_insn* insn)
+{
+  gcc_assert (MVE_VPT_UNPREDICATED_INSN_P (insn)
+	      && arm_is_mve_across_vector_insn (insn));
+  rtx insn_set = single_set (insn);
+  if (!insn_set)
+    return false;
+  rtx unspec = SET_SRC (insn_set);
+  if (GET_CODE (unspec) != UNSPEC)
+    return false;
+  switch (XINT (unspec, 1))
+    {
+    case VADDVQ_U:
+    case VADDVQ_S:
+    case VADDVAQ_U:
+    case VADDVAQ_S:
+    case VMLADAVQ_U:
+    case VMLADAVQ_S:
+    case VMLADAVXQ_S:
+    case VMLADAVAQ_U:
+    case VMLADAVAQ_S:
+    case VMLADAVAXQ_S:
+    case VABAVQ_S:
+    case VABAVQ_U:
+    case VADDLVQ_S:
+    case VADDLVQ_U:
+    case VADDLVAQ_S:
+    case VADDLVAQ_U:
+    case VMAXVQ_U:
+    case VMAXAVQ_S:
+    case VMLALDAVQ_U:
+    case VMLALDAVXQ_U:
+    case VMLALDAVXQ_S:
+    case VMLALDAVQ_S:
+    case VMLALDAVAQ_S:
+    case VMLALDAVAQ_U:
+    case VMLALDAVAXQ_S:
+    case VMLALDAVAXQ_U:
+    case VMLSDAVQ_S:
+    case VMLSDAVXQ_S:
+    case VMLSDAVAXQ_S:
+    case VMLSDAVAQ_S:
+    case VMLSLDAVQ_S:
+    case VMLSLDAVXQ_S:
+    case VMLSLDAVAQ_S:
+    case VMLSLDAVAXQ_S:
+    case VRMLALDAVHXQ_S:
+    case VRMLALDAVHQ_U:
+    case VRMLALDAVHQ_S:
+    case VRMLALDAVHAQ_S:
+    case VRMLALDAVHAQ_U:
+    case VRMLALDAVHAXQ_S:
+    case VRMLSLDAVHQ_S:
+    case VRMLSLDAVHXQ_S:
+    case VRMLSLDAVHAQ_S:
+    case VRMLSLDAVHAXQ_S:
+      return true;
+    default:
+      break;
+    }
+  return false;
+}
+
+/* Scan through the DF chain backwards within the basic block and
+   determine if any of the USEs of the original insn (or the USEs of the insns
+   where thy were DEF-ed, etc.) were affected by implicit VPT
+   predication of an MVE_VPT_UNPREDICATED_INSN_P in a dlstp/letp loop.
+   This function returns true if the insn is affected implicit predication
+   and false otherwise.
+   Having such implicit predication on an unpredicated insn wouldn't in itself
+   block tail predication, because the output of that insn might then be used
+   in a correctly predicated store insn, where the disabled lanes will be
+   ignored.  To verify this we later call:
+   `arm_mve_check_df_chain_fwd_for_implic_predic_impact`, which will check the
+   DF chains forward to see if any implicitly-predicated operand gets used in
+   an improper way.  */
+
+static bool
+arm_mve_check_df_chain_back_for_implic_predic
+  (hash_map <rtx_insn *, bool> *safe_insn_map, rtx_insn *insn_in,
+   rtx vctp_vpr_generated)
+{
+
+  auto_vec<rtx_insn *> worklist;
+  worklist.safe_push (insn_in);
+
+  bool *temp = NULL;
+
+  while (worklist.length () > 0)
+    {
+      rtx_insn *insn = worklist.pop ();
+
+      if ((temp = safe_insn_map->get (insn)))
+	return *temp;
+
+      basic_block body = BLOCK_FOR_INSN (insn);
+
+      /* The circumstances under which an instruction is affected by "implicit
+	 predication" are as follows:
+	  * It is an UNPREDICATED_INSN_P:
+	    * That loads/stores from/to memory.
+	    * Where any one of its operands is an MVE vector from outside the
+	      loop body bb.
+	 Or:
+	  * Any of it's operands were affected earlier in the insn chain.  */
+      if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+	  && (arm_is_mve_load_store_insn (insn)
+	      || (arm_is_mve_across_vector_insn (insn)
+		  && !arm_mve_is_allowed_unpredic_across_vector_insn (insn))))
+	{
+	  safe_insn_map->put (insn, true);
+	  return true;
+	}
+
+      df_ref insn_uses = NULL;
+      FOR_EACH_INSN_USE (insn_uses, insn)
+      {
+	/* If the operand is in the input reg set to the the basic block,
+	   (i.e. it has come from outside the loop!), consider it unsafe if:
+	     * It's being used in an unpredicated insn.
+	     * It is a predicable MVE vector.  */
+	if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+	    && VALID_MVE_MODE (GET_MODE (DF_REF_REG (insn_uses)))
+	    && REGNO_REG_SET_P (DF_LR_IN (body), DF_REF_REGNO (insn_uses)))
+	  {
+	    safe_insn_map->put (insn, true);
+	    return true;
+	  }
+
+	/* Scan backwards from the current INSN through the instruction chain
+	   until the start of the basic block.  */
+	for (rtx_insn *prev_insn = PREV_INSN (insn);
+	     prev_insn && prev_insn != PREV_INSN (BB_HEAD (body));
+	     prev_insn = PREV_INSN (prev_insn))
+	  {
+	    /* If a previous insn defines a register that INSN uses, then
+	       add to the worklist to check that insn's USEs.  If any of these
+	       insns return true as MVE_VPT_UNPREDICATED_INSN_Ps, then the
+	       whole chain is affected by the change in behaviour from being
+	       placed in dlstp/letp loop.  */
+	    df_ref prev_insn_defs = NULL;
+	    FOR_EACH_INSN_DEF (prev_insn_defs, prev_insn)
+	    {
+	      if (DF_REF_REGNO (insn_uses) == DF_REF_REGNO (prev_insn_defs)
+		  && !arm_mve_vec_insn_is_predicated_with_this_predicate
+		       (insn, vctp_vpr_generated))
+		worklist.safe_push (prev_insn);
+	    }
+	  }
+      }
+    }
+  safe_insn_map->put (insn_in, false);
+  return false;
+}
+
+/* If we have identified that the current DEF will be modified
+   by such implicit predication, scan through all the
+   insns that USE it and bail out if any one is outside the
+   current basic block (i.e. the reg is live after the loop)
+   or if any are store insns that are unpredicated or using a
+   predicate other than the loop VPR.
+   This function returns true if the insn is not suitable for
+   implicit predication and false otherwise.*/
+
+static bool
+arm_mve_check_df_chain_fwd_for_implic_predic_impact (rtx_insn *insn,
+						     rtx vctp_vpr_generated)
+{
+
+  /* If this insn is indeed an unpredicated store to memory, bail out.  */
+  if (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
+	(insn, vctp_vpr_generated)
+      && (arm_is_mve_load_store_insn (insn, 2)
+	  || arm_is_mve_across_vector_insn (insn)))
+    return true;
+
+  /* Next, scan forward to the various USEs of the DEFs in this insn.  */
+  df_ref insn_def = NULL;
+  FOR_EACH_INSN_DEF (insn_def, insn)
+    {
+      for (df_ref use = DF_REG_USE_CHAIN (DF_REF_REGNO (insn_def)); use;
+	   use = DF_REF_NEXT_REG (use))
+	{
+	  rtx_insn *next_use_insn = DF_REF_INSN (use);
+	  if (next_use_insn != insn
+	      && NONDEBUG_INSN_P (next_use_insn))
+	    {
+	      /* If the USE is outside the loop body bb, or it is inside, but
+		 is an differently-predicated store to memory or it is any
+		 across-vector instruction.  */
+	      if (BLOCK_FOR_INSN (insn) != BLOCK_FOR_INSN (next_use_insn)
+		  || (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
+		       (next_use_insn, vctp_vpr_generated)
+		     && (arm_is_mve_load_store_insn (next_use_insn, 2)
+			 || arm_is_mve_across_vector_insn (next_use_insn))))
+		return true;
+	    }
+	}
+    }
+  return false;
+}
+
+/* Helper function to `arm_mve_dlstp_check_inc_counter` and to
+   `arm_mve_dlstp_check_dec_counter`.  In the situations where the loop counter
+   is incrementing by 1 or decrementing by 1 in each iteration, ensure that the
+   target value or the initialisation value, respectively, was a calculation
+   of the number of iterations of the loop, which is expected to be an ASHIFTRT
+   by VCTP_STEP.  */
+
+static bool
+arm_mve_check_reg_origin_is_num_elems (basic_block body, rtx reg, rtx vctp_step)
+{
+  /* Ok, we now know the loop starts from zero and increments by one.
+     Now just show that the max value of the counter came from an
+     appropriate ASHIFRT expr of the correct amount.  */
+  basic_block pre_loop_bb = body->prev_bb;
+  while (pre_loop_bb && BB_END (pre_loop_bb)
+	 && !df_bb_regno_only_def_find (pre_loop_bb, REGNO (reg)))
+    pre_loop_bb = pre_loop_bb->prev_bb;
+
+  df_ref counter_max_last_def = df_bb_regno_only_def_find (pre_loop_bb, REGNO (reg));
+  if (!counter_max_last_def)
+    return false;
+  rtx counter_max_last_set = single_set (DF_REF_INSN (counter_max_last_def));
+  if (!counter_max_last_set)
+    return false;
+
+  /* If we encounter a simple SET from a REG, follow it through.  */
+  if (REG_P (SET_SRC (counter_max_last_set)))
+    return arm_mve_check_reg_origin_is_num_elems
+	     (pre_loop_bb->next_bb, SET_SRC (counter_max_last_set), vctp_step);
+
+  /* If we encounter a SET from an IF_THEN_ELSE where one of the operands is a
+     constant and the other is a REG, follow through to that REG.  */
+  if (GET_CODE (SET_SRC (counter_max_last_set)) == IF_THEN_ELSE
+      && REG_P (XEXP (SET_SRC (counter_max_last_set), 1))
+      && CONST_INT_P (XEXP (SET_SRC (counter_max_last_set), 2)))
+    return arm_mve_check_reg_origin_is_num_elems
+	     (pre_loop_bb->next_bb, XEXP (SET_SRC (counter_max_last_set), 1), vctp_step);
+
+  if (GET_CODE (SET_SRC (counter_max_last_set)) == ASHIFTRT
+      && CONST_INT_P (XEXP (SET_SRC (counter_max_last_set), 1))
+      && ((1 << INTVAL (XEXP (SET_SRC (counter_max_last_set), 1)))
+	   == abs (INTVAL (vctp_step))))
+    return true;
+
+  return false;
+}
+
+/* If we have identified the loop to have an incrementing counter, we need to
+   make sure that it increments by 1 and that the loop is structured correctly:
+    * The counter starts from 0
+    * The counter terminates at (num_of_elem + num_of_lanes - 1) / num_of_lanes
+    * The vctp insn uses a reg that decrements appropriately in each iteration.
+*/
+
+static rtx_insn*
+arm_mve_dlstp_check_inc_counter (basic_block body, rtx_insn* vctp_insn,
+				 rtx condconst, rtx condcount)
+{
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+  /* The loop latch has to be empty.  When compiling all the known MVE LoLs in
+     user applications, none of those with incrementing counters had any real
+     insns in the loop latch.  As such, this function has only been tested with
+     an empty latch and may misbehave or ICE if we somehow get here with an
+     increment in the latch, so, for correctness, error out early.  */
+  if (!empty_block_p (body->loop_father->latch))
+    return NULL;
+
+  class rtx_iv vctp_reg_iv;
+  /* For loops of type B) the loop counter is independent of the decrement
+     of the reg used in the vctp_insn. So run iv analysis on that reg.  This
+     has to succeed for such loops to be supported.  */
+  if (!iv_analyze (vctp_insn, as_a<scalar_int_mode> (GET_MODE (vctp_reg)),
+      vctp_reg, &vctp_reg_iv))
+    return NULL;
+
+  /* Extract the decrementnum of the vctp reg from the iv.  This decrementnum
+     is the number of lanes/elements it decrements from the remaining number of
+     lanes/elements to process in the loop, for this reason this is always a
+     negative number, but to simplify later checks we use it's absolute value.  */
+  int decrementnum = INTVAL (vctp_reg_iv.step);
+  if (decrementnum >= 0)
+    return NULL;
+  decrementnum = abs (decrementnum);
+
+  /* Find where both of those are modified in the loop body bb.  */
+  df_ref condcount_reg_set_df = df_bb_regno_only_def_find (body, REGNO (condcount));
+  df_ref vctp_reg_set_df = df_bb_regno_only_def_find (body, REGNO (vctp_reg));
+  if (!condcount_reg_set_df || !vctp_reg_set_df)
+    return NULL;
+  rtx condcount_reg_set = single_set (DF_REF_INSN (condcount_reg_set_df));
+  rtx vctp_reg_set = single_set (DF_REF_INSN (vctp_reg_set_df));
+  if (!condcount_reg_set || !vctp_reg_set)
+    return NULL;
+
+  /* Ensure the modification of the vctp reg from df is consistent with
+     the iv and the number of lanes on the vctp insn.  */
+  if (GET_CODE (SET_SRC (vctp_reg_set)) != PLUS
+      || !REG_P (SET_DEST (vctp_reg_set))
+      || !REG_P (XEXP (SET_SRC (vctp_reg_set), 0))
+      || REGNO (SET_DEST (vctp_reg_set))
+	  != REGNO (XEXP (SET_SRC (vctp_reg_set), 0))
+      || !CONST_INT_P (XEXP (SET_SRC (vctp_reg_set), 1))
+      || INTVAL (XEXP (SET_SRC (vctp_reg_set), 1)) >= 0
+      || decrementnum != abs (INTVAL (XEXP (SET_SRC (vctp_reg_set), 1)))
+      || decrementnum != arm_mve_get_vctp_lanes (vctp_insn))
+    return NULL;
+
+  if (REG_P (condcount) && REG_P (condconst))
+    {
+      /* First we need to prove that the loop is going 0..condconst with an
+	 inc of 1 in each iteration.  */
+      if (GET_CODE (SET_SRC (condcount_reg_set)) == PLUS
+	  && CONST_INT_P (XEXP (SET_SRC (condcount_reg_set), 1))
+	  && INTVAL (XEXP (SET_SRC (condcount_reg_set), 1)) == 1)
+	{
+	    rtx counter_reg = SET_DEST (condcount_reg_set);
+	    /* Check that the counter did indeed start from zero.  */
+	    df_ref this_set = DF_REG_DEF_CHAIN (REGNO (counter_reg));
+	    if (!this_set)
+	      return NULL;
+	    df_ref last_set_def = DF_REF_NEXT_REG (this_set);
+	    if (!last_set_def)
+	      return NULL;
+	    rtx_insn* last_set_insn = DF_REF_INSN (last_set_def);
+	    rtx last_set = single_set (last_set_insn);
+	    if (!last_set)
+	      return NULL;
+	    rtx counter_orig_set;
+	    counter_orig_set = SET_SRC (last_set);
+	    if (!CONST_INT_P (counter_orig_set)
+		|| (INTVAL (counter_orig_set) != 0))
+	      return NULL;
+	    /* And finally check that the target value of the counter,
+	       condconst, is of the correct shape.  */
+	    if (!arm_mve_check_reg_origin_is_num_elems (body, condconst,
+							vctp_reg_iv.step))
+	      return NULL;
+	}
+      else
+	return NULL;
+    }
+  else
+    return NULL;
+
+  /* Everything looks valid.  */
+  return vctp_insn;
+}
+
+/* Helper function to `arm_mve_loop_valid_for_dlstp`.  In the case of a
+   counter that is decrementing, ensure that it is decrementing by the
+   right amount in each iteration and that the target condition is what
+   we expect.  */
+
+static rtx_insn*
+arm_mve_dlstp_check_dec_counter (basic_block body, rtx_insn* vctp_insn,
+				 rtx condconst, rtx condcount)
+{
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+  class rtx_iv vctp_reg_iv;
+  int decrementnum;
+  /* For decrementing loops of type A), the counter is usually present in the
+     loop latch.  Here we simply need to verify that this counter is the same
+     reg that is also used in the vctp_insn and that it is not otherwise
+     modified.  */
+  rtx_insn *dec_insn = BB_END (body->loop_father->latch);
+  /* If not in the loop latch, try to find the decrement in the loop body.  */
+  if (!NONDEBUG_INSN_P (dec_insn))
+  {
+    df_ref temp = df_bb_regno_only_def_find (body, REGNO (condcount));
+    /* If we haven't been able to find the decrement, bail out.  */
+    if (!temp)
+      return NULL;
+    dec_insn = DF_REF_INSN (temp);
+  }
+
+  rtx dec_set = single_set (dec_insn);
+
+  /* Next, ensure that it is a PLUS of the form:
+     (set (reg a) (plus (reg a) (const_int)))
+     where (reg a) is the same as condcount.  */
+  if (!dec_set
+      || !REG_P (SET_DEST (dec_set))
+      || !REG_P (XEXP (SET_SRC (dec_set), 0))
+      || !CONST_INT_P (XEXP (SET_SRC (dec_set), 1))
+      || REGNO (SET_DEST (dec_set))
+	  != REGNO (XEXP (SET_SRC (dec_set), 0))
+      || REGNO (SET_DEST (dec_set)) != REGNO (condcount))
+    return NULL;
+
+  decrementnum = INTVAL (XEXP (SET_SRC (dec_set), 1));
+
+  /* This decrementnum is the number of lanes/elements it decrements from the
+     remaining number of lanes/elements to process in the loop, for this reason
+     this is always a negative number, but to simplify later checks we use it's
+     absolute value.  */
+  if (decrementnum >= 0)
+    return NULL;
+  decrementnum = abs (decrementnum);
+
+  /* Ok, so we now know the loop decrement.  If it is a 1, then we need to
+     look at the loop vctp_reg and verify that it also decrements correctly.
+     Then, we need to establish that the starting value of the loop decrement
+     originates from the starting value of the vctp decrement.  */
+  if (decrementnum == 1)
+    {
+      class rtx_iv vctp_reg_iv;
+      /* The loop counter is found to be independent of the decrement
+	 of the reg used in the vctp_insn, again.  Ensure that IV analysis
+	 succeeds and check the step.  */
+      if (!iv_analyze (vctp_insn, as_a<scalar_int_mode> (GET_MODE (vctp_reg)),
+		       vctp_reg, &vctp_reg_iv))
+	return NULL;
+      /* Ensure it matches the number of lanes of the vctp instruction.  */
+      if (abs (INTVAL (vctp_reg_iv.step))
+	  != arm_mve_get_vctp_lanes (vctp_insn))
+	return NULL;
+      if (!arm_mve_check_reg_origin_is_num_elems (body, condcount, vctp_reg_iv.step))
+	return NULL;
+    }
+  /* If the decrements are the same, then the situation is simple: either they
+     are also the same reg, which is safe, or they are different registers, in
+     which case makse sure that there is a only simple SET from one to the
+     other inside the loop.*/
+  else if (decrementnum == arm_mve_get_vctp_lanes (vctp_insn))
+    {
+      if (REGNO (condcount) != REGNO (vctp_reg))
+	{
+	  /* It wasn't the same reg, but it could be behild a
+	     (set (vctp_reg) (condcount)), so instead find where
+	     the VCTP insn is DEF'd inside the loop.  */
+	  rtx_insn *vctp_reg_insn
+	    = DF_REF_INSN (df_bb_regno_only_def_find (body, REGNO (vctp_reg)));
+	  rtx vctp_reg_set = single_set (vctp_reg_insn);
+	  /* This must just be a simple SET from the condcount.  */
+	  if (!vctp_reg_set
+	      || !REG_P (SET_DEST (vctp_reg_set))
+	      || !REG_P (SET_SRC (vctp_reg_set))
+	      || REGNO (SET_SRC (vctp_reg_set)) != REGNO (condcount))
+	    return NULL;
+	}
+    }
+  else
+    return NULL;
+
+  /* We now only need to find out that the loop terminates with a LE
+     zero condition.  If condconst is a const_int, then this is easy.
+     If its a REG, look at the last condition+jump in a bb before
+     the loop, because that usually will have a branch jumping over
+     the loop body.  */
+  if (CONST_INT_P (condconst)
+      && !(INTVAL (condconst) == 0 && JUMP_P (BB_END (body))
+	   && GET_CODE (XEXP (PATTERN (BB_END (body)), 1)) == IF_THEN_ELSE
+	   && (GET_CODE (XEXP (XEXP (PATTERN (BB_END (body)), 1), 0)) == NE
+	       ||GET_CODE (XEXP (XEXP (PATTERN (BB_END (body)), 1), 0)) == GT)))
+    return NULL;
+  else if (REG_P (condconst))
+    {
+      basic_block pre_loop_bb = body;
+      while (pre_loop_bb->prev_bb && BB_END (pre_loop_bb->prev_bb)
+	     && !JUMP_P (BB_END (pre_loop_bb->prev_bb)))
+	pre_loop_bb = pre_loop_bb->prev_bb;
+      if (pre_loop_bb && BB_END (pre_loop_bb))
+	pre_loop_bb = pre_loop_bb->prev_bb;
+      else
+	return NULL;
+      rtx initial_compare = NULL_RTX;
+      if (!(prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb))
+	    && INSN_P (prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb)))))
+	return NULL;
+      else
+	initial_compare
+	    = single_set (prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb)));
+      if (!(initial_compare
+	    && cc_register (SET_DEST (initial_compare), VOIDmode)
+	    && GET_CODE (SET_SRC (initial_compare)) == COMPARE
+	    && CONST_INT_P (XEXP (SET_SRC (initial_compare), 1))
+	    && INTVAL (XEXP (SET_SRC (initial_compare), 1)) == 0))
+	return NULL;
+
+      /* Usually this is a LE condition, but it can also just be a GT or an EQ
+	 condition (if the value is unsigned or the compiler knows its not negative)  */
+      rtx_insn *loop_jumpover = BB_END (pre_loop_bb);
+      if (!(JUMP_P (loop_jumpover)
+	    && GET_CODE (XEXP (PATTERN (loop_jumpover), 1)) == IF_THEN_ELSE
+	    && (GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == LE
+		|| GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == GT
+		|| GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == EQ)))
+	return NULL;
+    }
+
+  /* Everything looks valid.  */
+  return vctp_insn;
+}
+
+/* Function to check a loop's structure to see if it is a valid candidate for
+   an MVE Tail Predicated Low-Overhead Loop.  Returns the loop's VCTP_INSN if
+   it is valid, or NULL if it isn't.  */
+
+static rtx_insn*
+arm_mve_loop_valid_for_dlstp (basic_block body)
+{
+  /* Doloop can only be done "elementwise" with predicated dlstp/letp if it
+     contains a VCTP on the number of elements processed by the loop.
+     Find the VCTP predicate generation inside the loop body BB.  */
+  rtx_insn *vctp_insn = arm_mve_get_loop_vctp (body);
+  if (!vctp_insn)
+    return NULL;
+
+  /* There are only two types of loops that can be turned into dlstp/letp
+     loops:
+      A) Loops of the form:
+	  while (num_of_elem > 0)
+	    {
+	      p = vctp<size> (num_of_elem)
+	      n -= num_of_lanes;
+	    }
+      B) Loops of the form:
+	  int num_of_iters = (num_of_elem + num_of_lanes - 1) / num_of_lanes
+	  for (i = 0; i < num_of_iters; i++)
+	    {
+	      p = vctp<size> (num_of_elem)
+	      n -= num_of_lanes;
+	    }
+
+    Then, depending on the type of loop above we need will need to do
+    different sets of checks.  */
+  iv_analysis_loop_init (body->loop_father);
+
+  /* In order to find out if the loop is of type A or B above look for the
+     loop counter: it will either be incrementing by one per iteration or
+     it will be decrementing by num_of_lanes.  We can find the loop counter
+     in the condition at the end of the loop.  */
+  rtx_insn *loop_cond = prev_nonnote_nondebug_insn_bb (BB_END (body));
+  if (!(cc_register (XEXP (PATTERN (loop_cond), 0), VOIDmode)
+	&& GET_CODE (XEXP (PATTERN (loop_cond), 1)) == COMPARE))
+    return NULL;
+
+  /* The operands in the condition:  Try to identify which one is the
+     constant and which is the counter and run IV analysis on the latter.  */
+  rtx cond_arg_1 = XEXP (XEXP (PATTERN (loop_cond), 1), 0);
+  rtx cond_arg_2 = XEXP (XEXP (PATTERN (loop_cond), 1), 1);
+
+  rtx loop_cond_constant;
+  rtx loop_counter;
+  class rtx_iv cond_counter_iv, cond_temp_iv;
+
+  if (CONST_INT_P (cond_arg_1))
+    {
+      /* cond_arg_1 is the constant and cond_arg_2 is the counter.  */
+      loop_cond_constant = cond_arg_1;
+      loop_counter = cond_arg_2;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_2)),
+		  cond_arg_2, &cond_counter_iv);
+    }
+  else if (CONST_INT_P (cond_arg_2))
+    {
+      /* cond_arg_2 is the constant and cond_arg_1 is the counter.  */
+      loop_cond_constant = cond_arg_2;
+      loop_counter = cond_arg_1;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_1)),
+		  cond_arg_1, &cond_counter_iv);
+    }
+  else if (REG_P (cond_arg_1) && REG_P (cond_arg_2))
+    {
+      /* If both operands to the compare are REGs, we can safely
+	 run IV analysis on both and then determine which is the
+	 constant by looking at the step.
+	 First assume cond_arg_1 is the counter.  */
+      loop_counter = cond_arg_1;
+      loop_cond_constant = cond_arg_2;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_1)),
+		  cond_arg_1, &cond_counter_iv);
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_2)),
+		  cond_arg_2, &cond_temp_iv);
+
+      /* Look at the steps and swap around the rtx's if needed.  Error out if
+	 one of them cannot be identified as constant.  */
+      if (!CONST_INT_P (cond_counter_iv.step) || !CONST_INT_P (cond_temp_iv.step))
+	return NULL;
+      if (INTVAL (cond_counter_iv.step) != 0 && INTVAL (cond_temp_iv.step) != 0)
+	return NULL;
+      if (INTVAL (cond_counter_iv.step) == 0 && INTVAL (cond_temp_iv.step) != 0)
+	{
+	  loop_counter = cond_arg_2;
+	  loop_cond_constant = cond_arg_1;
+	  cond_counter_iv = cond_temp_iv;
+	}
+    }
+  else
+    return NULL;
+
+  if (!REG_P (loop_counter))
+    return NULL;
+  if (!(REG_P (loop_cond_constant) || CONST_INT_P (loop_cond_constant)))
+    return NULL;
+
+  /* Now we have extracted the IV step of the loop counter, call the
+     appropriate checking function.  */
+  if (INTVAL (cond_counter_iv.step) > 0)
+    return arm_mve_dlstp_check_inc_counter (body, vctp_insn,
+					    loop_cond_constant, loop_counter);
+  else if (INTVAL (cond_counter_iv.step) < 0)
+    return arm_mve_dlstp_check_dec_counter (body, vctp_insn,
+					    loop_cond_constant, loop_counter);
+  else
+    return NULL;
+}
+
+/* Predict whether the given loop in gimple will be transformed in the RTL
+   doloop_optimize pass.  It could be argued that turning large enough loops
+   into low-overhead loops would not show a signficant performance boost.
+   Howeer, in the case of tail predication we would still avoid using VPT/VPST
+   instructions inside the loop, and in either case using low-overhead loops
+   would not be detrimental, so we decided to not consider size, avoiding the
+   need of a heuristic to determine what an appropriate size boundary is.  */
+
+static bool
+arm_predict_doloop_p (struct loop *loop)
+{
+  gcc_assert (loop);
+  /* On arm, targetm.can_use_doloop_p is actually
+     can_use_doloop_if_innermost.  Ensure the loop is innermost,
+     it is valid and as per arm_target_bb_ok_for_lob and the
+     correct architecture flags are enabled.  */
+  if (!(TARGET_HAVE_LOB && optimize > 0))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " target architecture or optimisation flags.\n");
+      return false;
+    }
+  else if (loop->inner != NULL)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " loop nesting.\n");
+      return false;
+    }
+  else if (!arm_target_bb_ok_for_lob (loop->header->next_bb))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " loop bb complexity.\n");
+      return false;
+    }
+
+  return true;
+}
+
+/* Implement targetm.loop_unroll_adjust.  Use this to block unrolling of loops
+   that may later be turned into MVE Tail Predicated Low Overhead Loops.  The
+   performance benefit of an MVE LoL is likely to be much higher than that of
+   the unrolling.  */
+
+unsigned
+arm_loop_unroll_adjust (unsigned nunroll, struct loop *loop)
+{
+  if (TARGET_HAVE_MVE
+      && arm_target_bb_ok_for_lob (loop->latch)
+      && arm_mve_loop_valid_for_dlstp (loop->header))
+    return 0;
+  else
+    return nunroll;
+}
+
+/* Function to hadle emitting a VPT-unpredicated version of a VPT-predicated
+   insn to a sequence.  */
+
+static bool
+arm_emit_mve_unpredicated_insn_to_seq (rtx_insn* insn)
+{
+  rtx insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn);
+  int new_icode = get_attr_mve_unpredicated_insn (insn);
+  if (!in_sequence_p ()
+      || !MVE_VPT_PREDICATED_INSN_P (insn)
+      || (!insn_vpr_reg_operand)
+      || (!new_icode))
+    return false;
+
+  extract_insn (insn);
+  rtx arr[8];
+  int j = 0;
+
+  /* When transforming a VPT-predicated instruction
+     into its unpredicated equivalent we need to drop
+     the VPR operand and we may need to also drop a
+     merge "vuninit" input operand, depending on the
+     instruction pattern.  Here ensure that we have at
+     most a two-operand difference between the two
+     instrunctions.  */
+  int n_operands_diff
+      = recog_data.n_operands - insn_data[new_icode].n_operands;
+  if (!(n_operands_diff > 0 && n_operands_diff <= 2))
+    return false;
+
+  /* Then, loop through the operands of the predicated
+     instruction, and retain the ones that map to the
+     unpredicated instruction.  */
+  for (int i = 0; i < recog_data.n_operands; i++)
+    {
+      /* Ignore the VPR and, if needed, the vuninit
+	 operand.  */
+      if (insn_vpr_reg_operand == recog_data.operand[i]
+	  || (n_operands_diff == 2
+	      && !strcmp (recog_data.constraints[i], "0")))
+	continue;
+      else
+	{
+	  arr[j] = recog_data.operand[i];
+	  j++;
+	}
+    }
+
+  /* Finally, emit the upredicated instruction.  */
+  rtx_insn *new_insn;
+  switch (j)
+    {
+      case 1:
+	new_insn = emit_insn (GEN_FCN (new_icode) (arr[0]));
+	break;
+      case 2:
+	new_insn = emit_insn (GEN_FCN (new_icode) (arr[0], arr[1]));
+	break;
+      case 3:
+	new_insn = emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2]));
+	break;
+      case 4:
+	new_insn = emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2],
+						   arr[3]));
+	break;
+      case 5:
+	new_insn = emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2],
+						   arr[3], arr[4]));
+	break;
+      case 6:
+	new_insn = emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2],
+						   arr[3], arr[4], arr[5]));
+	break;
+      case 7:
+	new_insn = emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2],
+						   arr[3], arr[4], arr[5],
+						   arr[6]));
+	break;
+      default:
+	gcc_unreachable ();
+    }
+  INSN_LOCATION (new_insn) = INSN_LOCATION (insn);
+  return true;
+}
+
+/* When a vctp insn is used, its out is often followed by
+   a zero-extend insn to SImode, which is then SUBREG'd into a
+   vector form of mode VALID_MVE_PRED_MODE: this vector form is
+   what is then used as an input to the instructions within the
+   loop.  Hence, store that vector form of the VPR reg into
+   vctp_vpr_generated, so that we can match it with instructions
+   in the loop to determine if they are predicated on this same
+   VPR.  If there is no zero-extend and subreg or it is otherwise
+   invalid, then return NULL to cancel the dlstp transform.  */
+
+static rtx
+arm_mve_get_vctp_vec_form (rtx_insn *insn)
+{
+  rtx vctp_vpr_generated = NULL_RTX;
+  rtx_insn *next_use1 = NULL;
+  df_ref use;
+  for (use
+	= DF_REG_USE_CHAIN
+	   (DF_REF_REGNO (DF_INSN_INFO_DEFS (DF_INSN_INFO_GET (insn))));
+       use; use = DF_REF_NEXT_REG (use))
+    if (!next_use1 && NONDEBUG_INSN_P (DF_REF_INSN (use)))
+      next_use1 = DF_REF_INSN (use);
+
+  rtx next_use1_set = single_set (next_use1);
+  if (next_use1_set
+      && GET_CODE (SET_SRC (next_use1_set)) == ZERO_EXTEND)
+    {
+      rtx_insn *next_use2 = NULL;
+      for (use
+	    = DF_REG_USE_CHAIN
+	       (DF_REF_REGNO
+		 (DF_INSN_INFO_DEFS (DF_INSN_INFO_GET (next_use1))));
+	   use; use = DF_REF_NEXT_REG (use))
+	if (!next_use2 && NONDEBUG_INSN_P (DF_REF_INSN (use)))
+	  next_use2 = DF_REF_INSN (use);
+
+      rtx next_use2_set = single_set (next_use2);
+      if (next_use2_set
+	  && GET_CODE (SET_SRC (next_use2_set)) == SUBREG)
+	vctp_vpr_generated = SET_DEST (next_use2_set);
+    }
+
+  if (!vctp_vpr_generated || !REG_P (vctp_vpr_generated)
+      || !VALID_MVE_PRED_MODE (GET_MODE (vctp_vpr_generated)))
+    return NULL_RTX;
+
+  return vctp_vpr_generated;
+}
+
+/* Attempt to transform the loop contents of loop basic block from VPT
+   predicated insns into unpredicated insns for a dlstp/letp loop.  Returns
+   rtx constant value to decrement from the total number of elements.  Return
+   (const_int 1) if we can't use tail predication and fallback to scalar
+   low-overhead loops.  */
+
+rtx
+arm_attempt_dlstp_transform (rtx label)
+{
+  basic_block body = BLOCK_FOR_INSN (label)->prev_bb;
+
+  /* Ensure that the bb is within a loop that has all required metadata.  */
+  if (!body->loop_father || !body->loop_father->header
+      || !body->loop_father->simple_loop_desc)
+    return const1_rtx;
+
+  rtx_insn *vctp_insn = arm_mve_loop_valid_for_dlstp (body);
+  if (!vctp_insn)
+    return const1_rtx;
+
+  gcc_assert (single_set (vctp_insn));
+
+  rtx vctp_vpr_generated = arm_mve_get_vctp_vec_form (vctp_insn);
+  if (!vctp_vpr_generated)
+    return const1_rtx;
+
+  /* decrementunum is already known to be valid at this point.  */
+  int decrementnum = arm_mve_get_vctp_lanes (vctp_insn);
+
+  rtx_insn *insn = 0;
+  rtx_insn *cur_insn = 0;
+  rtx_insn *seq;
+  hash_map <rtx_insn *, bool> *safe_insn_map
+      = new hash_map <rtx_insn *, bool>;
+
+  /* Scan through the insns in the loop bb and emit the transformed bb
+     insns to a sequence.  */
+  start_sequence ();
+  FOR_BB_INSNS (body, insn)
+    {
+      if (GET_CODE (insn) == CODE_LABEL || NOTE_INSN_BASIC_BLOCK_P (insn))
+	continue;
+      else if (NOTE_P (insn))
+	emit_note ((enum insn_note)NOTE_KIND (insn));
+      else if (DEBUG_INSN_P (insn))
+	emit_debug_insn (PATTERN (insn));
+      else if (!INSN_P (insn))
+	{
+	  end_sequence ();
+	  return const1_rtx;
+	}
+      /* When we find the vctp instruction: continue.  */
+      else if (insn == vctp_insn)
+	continue;
+       /* If the insn pattern requires the use of the VPR value from the
+	  vctp as an input parameter for predication.  */
+      else if (arm_mve_vec_insn_is_predicated_with_this_predicate
+		(insn, vctp_vpr_generated))
+	{
+	  bool success = arm_emit_mve_unpredicated_insn_to_seq (insn);
+	  if (!success)
+	    {
+	      end_sequence ();
+	      return const1_rtx;
+	    }
+	}
+      /* If the insn isn't VPT predicated on vctp_vpr_generated, we need to
+	 make sure that it is still valid within the dlstp/letp loop.  */
+      else
+	{
+	  /* If this instruction USE-s the vctp_vpr_generated other than for
+	     predication, this blocks the transformation as we are not allowed
+	     to optimise the VPR value away.  */
+	  df_ref insn_uses = NULL;
+	  FOR_EACH_INSN_USE (insn_uses, insn)
+	  {
+	    if (rtx_equal_p (vctp_vpr_generated, DF_REF_REG (insn_uses)))
+	      {
+		end_sequence ();
+		return const1_rtx;
+	      }
+	  }
+	  /* If within the loop we have an MVE vector instruction that is
+	     unpredicated, the dlstp/letp looping will add implicit
+	     predication to it.  This will result in a change in behaviour
+	     of the instruction, so we need to find out if any instructions
+	     that feed into the current instruction were implicitly
+	     predicated.  */
+	  if (arm_mve_check_df_chain_back_for_implic_predic
+	       (safe_insn_map, insn, vctp_vpr_generated))
+	    {
+	      if (arm_mve_check_df_chain_fwd_for_implic_predic_impact
+		    (insn, vctp_vpr_generated))
+		{
+		  end_sequence ();
+		  return const1_rtx;
+		}
+	    }
+	  emit_insn (PATTERN (insn));
+	}
+    }
+  seq = get_insns ();
+  end_sequence ();
+
+  /* Re-write the entire BB contents with the transformed
+     sequence.  */
+  FOR_BB_INSNS_SAFE (body, insn, cur_insn)
+    if (!(GET_CODE (insn) == CODE_LABEL || NOTE_INSN_BASIC_BLOCK_P (insn)))
+      delete_insn (insn);
+
+  emit_insn_after (seq, BB_END (body));
+
+  /* The transformation has succeeded, so now modify the "count"
+     (a.k.a. niter_expr) for the middle-end.  Also set noloop_assumptions
+     to NULL to stop the middle-end from making assumptions about the
+     number of iterations.  */
+  simple_loop_desc (body->loop_father)->niter_expr
+    = XVECEXP (SET_SRC (PATTERN (vctp_insn)), 0, 0);
+  simple_loop_desc (body->loop_father)->noloop_assumptions = NULL_RTX;
+  return GEN_INT (decrementnum);
 }
 
 #if CHECKING_P
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 5ea2d9e866891bdb3dc73fcf6cbd6cdd2f989951..9398702cddd076a7eacf1ca6eac6c5a1fbd9a3d0 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -2673,6 +2673,17 @@ (define_int_iterator MRRCI [VUNSPEC_MRRC VUNSPEC_MRRC2])
 (define_int_attr mrrc [(VUNSPEC_MRRC "mrrc") (VUNSPEC_MRRC2 "mrrc2")])
 (define_int_attr MRRC [(VUNSPEC_MRRC "MRRC") (VUNSPEC_MRRC2 "MRRC2")])
 
+(define_int_attr dlstp_elemsize [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 "32")
+				 (DLSTP64 "64")])
+
+(define_int_attr letp_num_lanes [(LETP8 "16") (LETP16 "8") (LETP32 "4")
+				 (LETP64 "2")])
+(define_int_attr letp_num_lanes_neg [(LETP8 "-16") (LETP16 "-8") (LETP32 "-4")
+				     (LETP64 "-2")])
+
+(define_int_attr letp_num_lanes_minus_1 [(LETP8 "15") (LETP16 "7") (LETP32 "3")
+					 (LETP64 "1")])
+
 (define_int_attr opsuffix [(UNSPEC_DOT_S "s8")
 			   (UNSPEC_DOT_U "u8")
 			   (UNSPEC_DOT_US "s8")
@@ -2916,6 +2927,10 @@ (define_int_iterator SQRSHRLQ [SQRSHRL_64 SQRSHRL_48])
 (define_int_iterator VSHLCQ_M [VSHLCQ_M_S VSHLCQ_M_U])
 (define_int_iterator VQSHLUQ_M_N [VQSHLUQ_M_N_S])
 (define_int_iterator VQSHLUQ_N [VQSHLUQ_N_S])
+(define_int_iterator DLSTP [DLSTP8 DLSTP16 DLSTP32
+				   DLSTP64])
+(define_int_iterator LETP [LETP8 LETP16 LETP32
+			   LETP64])
 
 ;; Define iterators for VCMLA operations
 (define_int_iterator VCMLA_OP [UNSPEC_VCMLA
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index b1862d7977e91605cd971e634105bed3fa6e75cb..5748e2333eb3a88d659892f2bcc72849bcf388b5 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -6918,3 +6918,41 @@ (define_expand "@arm_mve_reinterpret<mode>"
       }
   }
 )
+
+;; Originally expanded by 'predicated_doloop_end'.
+;; In the rare situation where the branch is too far, we do also need to
+;; revert FPSCR.LTPSIZE back to 0x100 after the last iteration.
+(define_insn "predicated_doloop_end_internal<letp_num_lanes>"
+  [(set (pc)
+	(if_then_else
+	   (gtu (unspec:SI [(plus:SI (match_operand:SI 0 "s_register_operand" "=r")
+				     (const_int <letp_num_lanes_neg>))]
+		LETP)
+		(const_int <letp_num_lanes_minus_1>))
+	 (match_operand 1 "" "")
+	 (pc)))
+   (set (match_dup 0)
+	(plus:SI (match_dup 0) (const_int <letp_num_lanes_neg>)))
+   (clobber (reg:CC CC_REGNUM))]
+  "TARGET_HAVE_MVE"
+  {
+    if (get_attr_length (insn) == 4)
+      return "letp\t%|lr, %l1";
+    else
+      return "subs\t%|lr, #<letp_num_lanes>\n\tbhi\t%l1\n\tlctp";
+  }
+  [(set (attr "length")
+	(if_then_else
+	   (ltu (minus (pc) (match_dup 1)) (const_int 1024))
+	    (const_int 4)
+	    (const_int 6)))
+   (set_attr "type" "branch")])
+
+(define_insn "dlstp<dlstp_elemsize>_insn"
+  [
+    (set (reg:SI LR_REGNUM)
+	 (unspec:SI [(match_operand:SI 0 "s_register_operand" "r")]
+	  DLSTP))
+  ]
+  "TARGET_HAVE_MVE"
+  "dlstp.<dlstp_elemsize>\t%|lr, %0")
diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
index e1e013befa7a67ddbf517bf22797bdaeeb96b94f..f2801cea36a34d326fd6f3a213e0e149c3e0784f 100644
--- a/gcc/config/arm/thumb2.md
+++ b/gcc/config/arm/thumb2.md
@@ -1613,7 +1613,7 @@ (define_expand "doloop_end"
    (use (match_operand 1 "" ""))]     ; label
   "TARGET_32BIT"
   "
- {
+{
    /* Currently SMS relies on the do-loop pattern to recognize loops
       where (1) the control part consists of all insns defining and/or
       using a certain 'count' register and (2) the loop count can be
@@ -1623,41 +1623,77 @@ (define_expand "doloop_end"
 
       Also used to implement the low over head loops feature, which is part of
       the Armv8.1-M Mainline Low Overhead Branch (LOB) extension.  */
-   if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
-   {
-     rtx s0;
-     rtx bcomp;
-     rtx loc_ref;
-     rtx cc_reg;
-     rtx insn;
-     rtx cmp;
-
-     if (GET_MODE (operands[0]) != SImode)
-       FAIL;
-
-     s0 = operands [0];
-
-     /* Low over head loop instructions require the first operand to be LR.  */
-     if (TARGET_HAVE_LOB && arm_target_insn_ok_for_lob (operands [1]))
-       s0 = gen_rtx_REG (SImode, LR_REGNUM);
-
-     if (TARGET_THUMB2)
-       insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-     else
-       insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-
-     cmp = XVECEXP (PATTERN (insn), 0, 0);
-     cc_reg = SET_DEST (cmp);
-     bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
-     loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands [1]);
-     emit_jump_insn (gen_rtx_SET (pc_rtx,
-                                  gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
-                                                        loc_ref, pc_rtx)));
-     DONE;
-   }
- else
-   FAIL;
- }")
+  if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
+    {
+      rtx s0;
+      rtx bcomp;
+      rtx loc_ref;
+      rtx cc_reg;
+      rtx insn;
+      rtx cmp;
+      rtx decrement_num;
+
+      if (GET_MODE (operands[0]) != SImode)
+	FAIL;
+
+      s0 = operands[0];
+
+       if (TARGET_HAVE_LOB && arm_target_bb_ok_for_lob (BLOCK_FOR_INSN (operands[1])))
+	{
+	  s0 = gen_rtx_REG (SImode, LR_REGNUM);
+
+	  /* If we have a compatible MVE target, try and analyse the loop
+	     contents to determine if we can use predicated dlstp/letp
+	     looping.  */
+	  if (TARGET_HAVE_MVE
+	      && (decrement_num = arm_attempt_dlstp_transform (operands[1]))
+	      && (INTVAL (decrement_num) != 1))
+	    {
+	      loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+	      switch (INTVAL (decrement_num))
+		{
+		  case 2:
+		    insn = emit_jump_insn (gen_predicated_doloop_end_internal2
+					    (s0, loc_ref));
+		    break;
+		  case 4:
+		    insn = emit_jump_insn (gen_predicated_doloop_end_internal4
+					    (s0, loc_ref));
+		    break;
+		  case 8:
+		    insn = emit_jump_insn (gen_predicated_doloop_end_internal8
+					    (s0, loc_ref));
+		    break;
+		  case 16:
+		    insn = emit_jump_insn (gen_predicated_doloop_end_internal16
+					    (s0, loc_ref));
+		    break;
+		  default:
+		    gcc_unreachable ();
+		}
+	      DONE;
+	    }
+	}
+
+	/* Otherwise, try standard decrement-by-one dls/le looping.  */
+	if (TARGET_THUMB2)
+	  insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0,
+							GEN_INT (-1)));
+	else
+	  insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
+
+	cmp = XVECEXP (PATTERN (insn), 0, 0);
+	cc_reg = SET_DEST (cmp);
+	bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
+	loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+	emit_jump_insn (gen_rtx_SET (pc_rtx,
+				     gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+							   loc_ref, pc_rtx)));
+	DONE;
+    }
+  else
+    FAIL;
+}")
 
 (define_insn "*clear_apsr"
   [(unspec_volatile:SI [(const_int 0)] VUNSPEC_CLRM_APSR)
@@ -1755,7 +1791,37 @@ (define_expand "doloop_begin"
   {
     if (REGNO (operands[0]) == LR_REGNUM)
       {
-	emit_insn (gen_dls_insn (operands[0]));
+	/* Pick out the number by which we are decrementing the loop counter
+	   in every iteration.  If it's > 1, then use dlstp.  */
+	int const_int_dec_num
+	     = abs (INTVAL (XEXP (XEXP (XVECEXP (PATTERN (operands[1]), 0, 1),
+				  1),
+			    1)));
+	switch (const_int_dec_num)
+	  {
+	    case 16:
+	      emit_insn (gen_dlstp8_insn (operands[0]));
+	      break;
+
+	    case 8:
+	      emit_insn (gen_dlstp16_insn (operands[0]));
+	      break;
+
+	    case 4:
+	      emit_insn (gen_dlstp32_insn (operands[0]));
+	      break;
+
+	    case 2:
+	      emit_insn (gen_dlstp64_insn (operands[0]));
+	      break;
+
+	    case 1:
+	      emit_insn (gen_dls_insn (operands[0]));
+	      break;
+
+	    default:
+	      gcc_unreachable ();
+	  }
 	DONE;
       }
     else
diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
index 4713ec840abae48ca70f418dbc0d4028ad4ad527..2d6f27c14f4a1e7db05b9684a8958a76a1c79ef2 100644
--- a/gcc/config/arm/unspecs.md
+++ b/gcc/config/arm/unspecs.md
@@ -583,6 +583,14 @@ (define_c_enum "unspec" [
   VADDLVQ_U
   VCTP
   VCTP_M
+  DLSTP8
+  DLSTP16
+  DLSTP32
+  DLSTP64
+  LETP8
+  LETP16
+  LETP32
+  LETP64
   VPNOT
   VCREATEQ_F
   VCVTQ_N_TO_F_S
diff --git a/gcc/df-core.cc b/gcc/df-core.cc
index d4812b04a7cb97ea1606082e26e910472da5bcc1..4fcc14bf790d43e792b3c926fe1f80073d908c17 100644
--- a/gcc/df-core.cc
+++ b/gcc/df-core.cc
@@ -1964,6 +1964,21 @@ df_bb_regno_last_def_find (basic_block bb, unsigned int regno)
   return NULL;
 }
 
+/* Return the one and only def of REGNO within BB.  If there is no def or
+   there are multiple defs, return NULL.  */
+
+df_ref
+df_bb_regno_only_def_find (basic_block bb, unsigned int regno)
+{
+  df_ref temp = df_bb_regno_first_def_find (bb, regno);
+  if (!temp)
+    return NULL;
+  else if (temp == df_bb_regno_last_def_find (bb, regno))
+    return temp;
+  else
+    return NULL;
+}
+
 /* Finds the reference corresponding to the definition of REG in INSN.
    DF is the dataflow object.  */
 
diff --git a/gcc/df.h b/gcc/df.h
index 402657a7076f1bcad24e9c50682e033e57f432f9..98623637f9c839c799222e99df2a7173a770b2ac 100644
--- a/gcc/df.h
+++ b/gcc/df.h
@@ -987,6 +987,7 @@ extern void df_check_cfg_clean (void);
 #endif
 extern df_ref df_bb_regno_first_def_find (basic_block, unsigned int);
 extern df_ref df_bb_regno_last_def_find (basic_block, unsigned int);
+extern df_ref df_bb_regno_only_def_find (basic_block, unsigned int);
 extern df_ref df_find_def (rtx_insn *, rtx);
 extern bool df_reg_defined (rtx_insn *, rtx);
 extern df_ref df_find_use (rtx_insn *, rtx);
diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
index 4feb0a25ab9331b7124df900f73c9fc6fb3eb10b..d919207505c472c8a54a2c9c982a09061584177b 100644
--- a/gcc/loop-doloop.cc
+++ b/gcc/loop-doloop.cc
@@ -85,10 +85,10 @@ doloop_condition_get (rtx_insn *doloop_pat)
      forms:
 
      1)  (parallel [(set (pc) (if_then_else (condition)
-	  			            (label_ref (label))
-				            (pc)))
-	             (set (reg) (plus (reg) (const_int -1)))
-	             (additional clobbers and uses)])
+					    (label_ref (label))
+					    (pc)))
+		     (set (reg) (plus (reg) (const_int -1)))
+		     (additional clobbers and uses)])
 
      The branch must be the first entry of the parallel (also required
      by jump.cc), and the second entry of the parallel must be a set of
@@ -96,19 +96,34 @@ doloop_condition_get (rtx_insn *doloop_pat)
      the loop counter in an if_then_else too.
 
      2)  (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-	                         (label_ref (label))
-			         (pc))).  
+	 (set (pc) (if_then_else (reg != 0)
+				 (label_ref (label))
+				 (pc))).
 
      Some targets (ARM) do the comparison before the branch, as in the
      following form:
 
-     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) */
-
+     3) (parallel [(set (cc) (compare (plus (reg) (const_int -1)) 0))
+		   (set (reg) (plus (reg) (const_int -1)))])
+	(set (pc) (if_then_else (cc == NE)
+				(label_ref (label))
+				(pc)))
+
+      The ARM target also supports a special case of a counter that decrements
+      by `n` and terminating in a GTU condition.  In that case, the compare and
+      branch are all part of one insn, containing an UNSPEC:
+
+      4) (parallel [
+	    (set (pc)
+		(if_then_else (gtu (unspec:SI [(plus:SI (reg:SI 14 lr)
+							(const_int -n))])
+				   (const_int n-1]))
+		    (label_ref)
+		    (pc)))
+	    (set (reg:SI 14 lr)
+		 (plus:SI (reg:SI 14 lr)
+			  (const_int -n)))
+     */
   pattern = PATTERN (doloop_pat);
 
   if (GET_CODE (pattern) != PARALLEL)
@@ -143,7 +158,7 @@ doloop_condition_get (rtx_insn *doloop_pat)
 	      || GET_CODE (cmp_arg1) != PLUS)
 	    return 0;
 	  reg_orig = XEXP (cmp_arg1, 0);
-	  if (XEXP (cmp_arg1, 1) != GEN_INT (-1) 
+	  if (XEXP (cmp_arg1, 1) != GEN_INT (-1)
 	      || !REG_P (reg_orig))
 	    return 0;
 	  cc_reg = SET_DEST (cmp_orig);
@@ -173,15 +188,17 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
     return 0;
 
-  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
      On IA-64, this decrement is wrapped in an if_then_else.  */
   inc_src = SET_SRC (inc);
   if (GET_CODE (inc_src) == IF_THEN_ELSE)
     inc_src = XEXP (inc_src, 1);
   if (GET_CODE (inc_src) != PLUS
       || XEXP (inc_src, 0) != reg
-      || XEXP (inc_src, 1) != constm1_rtx)
+      || !CONST_INT_P (XEXP (inc_src, 1))
+      || INTVAL (XEXP (inc_src, 1)) >= 0)
     return 0;
+  int dec_num = abs (INTVAL (XEXP (inc_src, 1)));
 
   /* Check for (set (pc) (if_then_else (condition)
                                        (label_ref (label))
@@ -196,60 +213,71 @@ doloop_condition_get (rtx_insn *doloop_pat)
   /* Extract loop termination condition.  */
   condition = XEXP (SET_SRC (cmp), 0);
 
-  /* We expect a GE or NE comparison with 0 or 1.  */
-  if ((GET_CODE (condition) != GE
-       && GET_CODE (condition) != NE)
-      || (XEXP (condition, 1) != const0_rtx
-          && XEXP (condition, 1) != const1_rtx))
+  /* We expect a GE or NE comparison with 0 or 1, or a GTU comparison with
+     dec_num - 1.  */
+  if (!((GET_CODE (condition) == GE
+	 || GET_CODE (condition) == NE)
+	&& (XEXP (condition, 1) == const0_rtx
+	    || XEXP (condition, 1) == const1_rtx ))
+      &&!(GET_CODE (condition) == GTU
+	  && ((INTVAL (XEXP (condition, 1))) == (dec_num - 1))))
     return 0;
 
-  if ((XEXP (condition, 0) == reg)
+  /* For the ARM special case of having a GTU: re-form the condition without
+     the unspec for the benefit of the middle-end.  */
+  if (GET_CODE (condition) == GTU)
+    {
+      condition = gen_rtx_fmt_ee (GTU, VOIDmode, inc_src,
+				  GEN_INT (dec_num - 1));
+      return condition;
+    }
+  else if ((XEXP (condition, 0) == reg)
       /* For the third case:  */  
       || ((cc_reg != NULL_RTX)
 	  && (XEXP (condition, 0) == cc_reg)
 	  && (reg_orig == reg))
       || (GET_CODE (XEXP (condition, 0)) == PLUS
 	  && XEXP (XEXP (condition, 0), 0) == reg))
-   {
+    {
      if (GET_CODE (pattern) != PARALLEL)
      /*  For the second form we expect:
 
-         (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-                                 (label_ref (label))
-                                 (pc))).
+	 (set (reg) (plus (reg) (const_int -1))
+	 (set (pc) (if_then_else (reg != 0)
+				 (label_ref (label))
+				 (pc))).
 
-         is equivalent to the following:
+	 is equivalent to the following:
 
-         (parallel [(set (pc) (if_then_else (reg != 1)
-                                            (label_ref (label))
-                                            (pc)))
-                     (set (reg) (plus (reg) (const_int -1)))
-                     (additional clobbers and uses)])
+	 (parallel [(set (pc) (if_then_else (reg != 1)
+					    (label_ref (label))
+					    (pc)))
+		     (set (reg) (plus (reg) (const_int -1)))
+		     (additional clobbers and uses)])
 
-        For the third form we expect:
+	For the third form we expect:
 
-        (parallel [(set (cc) (compare ((plus (reg) (const_int -1)), 0))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) 
+	(parallel [(set (cc) (compare ((plus (reg) (const_int -1)), 0))
+		   (set (reg) (plus (reg) (const_int -1)))])
+	(set (pc) (if_then_else (cc == NE)
+				(label_ref (label))
+				(pc))) 
 
-        which is equivalent to the following:
+	which is equivalent to the following:
 
-        (parallel [(set (cc) (compare (reg,  1))
-                   (set (reg) (plus (reg) (const_int -1)))
-                   (set (pc) (if_then_else (NE == cc)
-                                           (label_ref (label))
-                                           (pc))))])
+	(parallel [(set (cc) (compare (reg,  1))
+		   (set (reg) (plus (reg) (const_int -1)))
+		   (set (pc) (if_then_else (NE == cc)
+					   (label_ref (label))
+					   (pc))))])
 
-        So we return the second form instead for the two cases.
+	So we return the second form instead for the two cases.
 
      */
-        condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
+	condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
 
     return condition;
-   }
+    }
 
   /* ??? If a machine uses a funny comparison, we could return a
      canonicalized form here.  */
@@ -507,6 +535,11 @@ doloop_modify (class loop *loop, class niter_desc *desc,
 	nonneg = 1;
       break;
 
+    case GTU:
+      /* The iteration count does not need incrementing for a GTU test.  */
+      increment_count = false;
+      break;
+
       /* Abort if an invalid doloop pattern has been generated.  */
     default:
       gcc_unreachable ();
@@ -529,6 +562,10 @@ doloop_modify (class loop *loop, class niter_desc *desc,
 
   if (desc->noloop_assumptions)
     {
+      /* The GTU case has only been implemented for the ARM target, where
+	 noloop_assumptions gets explicitly set to NULL for that case, so
+	 assert here for safety.  */
+      gcc_assert (GET_CODE (condition) != GTU);
       rtx ass = copy_rtx (desc->noloop_assumptions);
       basic_block preheader = loop_preheader_edge (loop)->src;
       basic_block set_zero = split_edge (loop_preheader_edge (loop));
@@ -642,7 +679,7 @@ doloop_optimize (class loop *loop)
 {
   scalar_int_mode mode;
   rtx doloop_reg;
-  rtx count;
+  rtx count = NULL_RTX;
   widest_int iterations, iterations_max;
   rtx_code_label *start_label;
   rtx condition;
@@ -685,17 +722,6 @@ doloop_optimize (class loop *loop)
       return false;
     }
 
-  max_cost
-    = COSTS_N_INSNS (param_max_iterations_computation_cost);
-  if (set_src_cost (desc->niter_expr, mode, optimize_loop_for_speed_p (loop))
-      > max_cost)
-    {
-      if (dump_file)
-	fprintf (dump_file,
-		 "Doloop: number of iterations too costly to compute.\n");
-      return false;
-    }
-
   if (desc->const_iter)
     iterations = widest_int::from (rtx_mode_t (desc->niter_expr, mode),
 				   UNSIGNED);
@@ -716,12 +742,25 @@ doloop_optimize (class loop *loop)
 
   /* Generate looping insn.  If the pattern FAILs then give up trying
      to modify the loop since there is some aspect the back-end does
-     not like.  */
-  count = copy_rtx (desc->niter_expr);
+     not like.  If this succeeds, there is a chance that the loop
+     desc->niter_expr has been altered by the backend, so only extract
+     that data after the gen_doloop_end.  */
   start_label = block_label (desc->in_edge->dest);
   doloop_reg = gen_reg_rtx (mode);
   rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
 
+  max_cost
+    = COSTS_N_INSNS (param_max_iterations_computation_cost);
+  if (set_src_cost (desc->niter_expr, mode, optimize_loop_for_speed_p (loop))
+      > max_cost)
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "Doloop: number of iterations too costly to compute.\n");
+      return false;
+    }
+
+  count = copy_rtx (desc->niter_expr);
   word_mode_size = GET_MODE_PRECISION (word_mode);
   word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) - 1;
   if (! doloop_seq
diff --git a/gcc/testsuite/gcc.target/arm/lob.h b/gcc/testsuite/gcc.target/arm/lob.h
index feaae7cc89959b3147368980120700bbc3e85ecb..3941fe7a8b620e62a5f742722be1ba2d031f5a8d 100644
--- a/gcc/testsuite/gcc.target/arm/lob.h
+++ b/gcc/testsuite/gcc.target/arm/lob.h
@@ -1,15 +1,131 @@
 #include <string.h>
-
+#include <stdint.h>
 /* Common code for lob tests.  */
 
 #define NO_LOB asm volatile ("@ clobber lr" : : : "lr" )
 
-#define N 10000
+#define N 100
+
+static void
+reset_data (int *a, int *b, int *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data64 (int64_t *a, int64_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+check_plus (int *a, int *b, int *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
 
 static void
-reset_data (int *a, int *b, int *c)
+check_memcpy64 (int64_t *a, int64_t *c, int x)
 {
-  memset (a, -1, N * sizeof (*a));
-  memset (b, -1, N * sizeof (*b));
-  memset (c, -1, N * sizeof (*c));
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != a[i]) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
 }
diff --git a/gcc/testsuite/gcc.target/arm/lob1.c b/gcc/testsuite/gcc.target/arm/lob1.c
index ba5c82cd55c582c96a18ad417a3041e43d843613..c8ce653a5c39fb1ffcf82a6e584d9a0467a130c0 100644
--- a/gcc/testsuite/gcc.target/arm/lob1.c
+++ b/gcc/testsuite/gcc.target/arm/lob1.c
@@ -54,29 +54,18 @@ loop3 (int *a, int *b, int *c)
     } while (i < N);
 }
 
-void
-check (int *a, int *b, int *c)
-{
-  for (int i = 0; i < N; i++)
-    {
-      NO_LOB;
-      if (c[i] != a[i] + b[i])
-	abort ();
-    }
-}
-
 int
 main (void)
 {
-  reset_data (a, b, c);
+  reset_data (a, b, c, N);
   loop1 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus (a, b, c, N);
+  reset_data (a, b, c, N);
   loop2 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus (a, b, c, N);
+  reset_data (a, b, c, N);
   loop3 (a, b ,c);
-  check (a, b ,c);
+  check_plus (a, b, c, N);
 
   return 0;
 }
diff --git a/gcc/testsuite/gcc.target/arm/lob6.c b/gcc/testsuite/gcc.target/arm/lob6.c
index 17b6124295e8ae9e1cb57e41fa43a954b3390eec..4fe116e2c2be3748d1bb6da7bb9092db8f962abc 100644
--- a/gcc/testsuite/gcc.target/arm/lob6.c
+++ b/gcc/testsuite/gcc.target/arm/lob6.c
@@ -79,14 +79,14 @@ check (void)
 int
 main (void)
 {
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data (a1, b1, c1, N);
+  reset_data (a2, b2, c2, N);
   loop1 (a1, b1, c1);
   ref1 (a2, b2, c2);
   check ();
 
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data (a1, b1, c1, N);
+  reset_data (a2, b2, c2, N);
   loop2 (a1, b1, c1);
   ref2 (a2, b2, c2);
   check ();
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c
new file mode 100644
index 0000000000000000000000000000000000000000..5ddd994e53d55c7b4d05bfb858e6078ce7da4ce4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c
@@ -0,0 +1,561 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O3 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+
+#define IMM 5
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)				\
+void test_##NAME##PRED##_##SIGN##BITS (TYPE##BITS##_t *a, TYPE##BITS##_t *b,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vb = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (b, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_##SIGN##BITS (va, vb, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      b += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (32, 4, w, NAME, PRED)
+
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vmulq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vsubq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vhaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vorrq, _x)
+
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_M(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)				\
+void test_##NAME##PRED##_##SIGN##BITS (TYPE##BITS##x##LANES##_t __inactive, TYPE##BITS##_t *a, TYPE##BITS##_t *b,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vb = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (b, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_##SIGN##BITS (__inactive, va, vb, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      b += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_M (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_M (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (32, 4, w, NAME, PRED)
+
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vmulq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vsubq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vhaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vorrq, _m)
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_N(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)	\
+void test_##NAME##PRED##_n_##SIGN##BITS (TYPE##BITS##_t *a,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_n_##SIGN##BITS (va, IMM, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_N (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_N (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (32, 4, w, NAME, PRED)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vmulq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vsubq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vhaddq, _x)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vbrsrq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vshlq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vshrq, _x)
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_M_N(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)	\
+void test_##NAME##PRED##_n_##SIGN##BITS (TYPE##BITS##x##LANES##_t __inactive, TYPE##BITS##_t *a,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_n_##SIGN##BITS (__inactive, va, IMM, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_M_N (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_M_N (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (32, 4, w, NAME, PRED)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vmulq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vsubq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vhaddq, _m)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vbrsrq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vshlq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vshrq, _m)
+
+/* Now test some more configurations.  */
+
+/* Using a >=1 condition.  */
+void test1 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n >= 1)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+/* Test a for loop format of decrementing to zero */
+int32_t a[] = {0, 1, 2, 3, 4, 5, 6, 7};
+void test2 (int32_t *b, int num_elems)
+{
+    for (int i = num_elems; i > 0; i-= 4)
+    {
+        mve_pred16_t p = vctp32q (i);
+        int32x4_t va = vldrwq_z_s32 (&(a[i]), p);
+        vstrwq_p_s32 (b + i, va, p);
+    }
+}
+
+/* Iteration counter counting up to num_iter.  */
+void test3 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 0; i < num_iter; i++)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+/* Iteration counter counting down from num_iter.  */
+void test4 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = num_iter; i > 0; i--)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+/* Using an unpredicated arithmetic instruction within the loop.  */
+void test5 (uint8_t *a, uint8_t *b, uint8_t *c,  uint8_t *d, int n)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_u8 (b);
+	/* Is affected by implicit predication, because vb also
+	came from an unpredicated load, but there is no functional
+	problem, because the result is used in a predicated store.  */ 
+        uint8x16_t vc = vaddq_u8 (va, vb);
+        uint8x16_t vd = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        vstrbq_p_u8 (d, vd, p);
+        n-=16;
+    }
+}
+
+/* Using a different VPR value for one instruction in the loop.  */
+void test6 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using another VPR value in the loop, with a vctp.
+   The doloop logic will always try to do the transform on the first
+   vctp it encounters, so this is still expected to work.  */
+void test7 (int32_t *a, int32_t *b, int32_t *c, int n, int g)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q (g);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vctp,
+   but this time the p1 will also change in every loop (still fine)  */
+void test8 (int32_t *a, int32_t *b, int32_t *c, int n, int g)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q (g);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+      g++;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vctp_m
+   that is independent of the loop vctp VPR.  */
+void test9 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p2 = vctp32q_m (n, p1);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop,
+   with a vctp_m that is tied to the base vctp VPR.  This
+   is still fine, because the vctp_m will be transformed
+   into a vctp and be implicitly predicated.  */
+void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q_m (n, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p1);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp.  */
+void test11 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vb);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p1);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp_m.  */
+void test12 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p2 = vcmpeqq_m_s32 (va, vb, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp_m 
+   that is tied to the base vctp VPR (same as above, this will be turned
+   into a vcmp and be implicitly predicated).  */
+void test13 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p2 = vcmpeqq_m_s32 (va, vb, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an unpredicated op with a scalar output, where the result is valid
+   outside the bb.  This is valid, because all the inputs to the unpredicated
+   op are correctly predicated.  */
+uint8_t test14 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       sum += vaddvq_u8 (vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Same as above, but with another scalar op between the unpredicated op and
+   the scalar op outside the loop.  */
+uint8_t test15 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx, int g)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       sum += vaddvq_u8 (vc);
+       sum += g;
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Using an unpredicated vcmp to generate a new predicate value in the
+   loop and then using it in a predicated store insn.  */
+void test16 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_s32 (b);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vc);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using a predicated vcmp to generate a new predicate value in the
+   loop and then using it in a predicated store insn.  */
+void test17 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_s32 (va, vb);
+      mve_pred16_t p1 = vcmpeqq_m_s32 (va, vc, p);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an across-vector unpredicated instruction in a valid way.
+   This tests that "vc" has correctly masked the risky "vb".  */
+uint16_t test18 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_x_u16 (va, vb, p);
+      res = vaddvq_u16 (vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector unpredicated instruction with a scalar from outside the loop.  */
+uint16_t test19 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_x_u16 (va, vb, p);
+      res = vaddvaq_u16 (res, vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector predicated instruction in a valid way.  */
+uint16_t  test20 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res = vaddvaq_p_u16 (res, vb, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector predicated instruction in a valid way.  */
+uint16_t  test21 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res++;
+      res = vaddvaq_p_u16 (res, vb, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+int test22 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vmaxvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test23 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vmaxavq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+/* The final number of DLSTPs currently is calculated by the number of
+  `TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY.*` macros * 6 + 23.  */
+/* { dg-final { scan-assembler-times {\tdlstp} 167 } } */
+/* { dg-final { scan-assembler-times {\tletp} 167 } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8-run.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8-run.c
new file mode 100644
index 0000000000000000000000000000000000000000..6966a3966046fce59bdabda639c048ed398cac20
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8-run.c
@@ -0,0 +1,44 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_mve_hw } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+#include "dlstp-int16x8.c"
+
+int main ()
+{
+  int i;
+  int16_t temp1[N];
+  int16_t temp2[N];
+  int16_t temp3[N];
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus16 (temp1, temp2, temp3, 0);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus16 (temp1, temp2, temp3, 1);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 7);
+  check_plus16 (temp1, temp2, temp3, 7);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus16 (temp1, temp2, temp3, 8);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus16 (temp1, temp2, temp3, 9);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus16 (temp1, temp2, temp3, 16);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus16 (temp1, temp2, temp3, 17);
+
+  reset_data16 (temp1, temp2, temp3, N);
+}
+
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
new file mode 100644
index 0000000000000000000000000000000000000000..33632c5f14dc6603d56934dfdd0072a980fbd01e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
@@ -0,0 +1,31 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int16_t *a, int16_t *b, int16_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      int16x8_t va = vldrhq_z_s16 (a, p);
+      int16x8_t vb = vldrhq_z_s16 (b, p);
+      int16x8_t vc = vaddq_x_s16 (va, vb, p);
+      vstrhq_p_s16 (c, vc, p);
+      c+=8;
+      a+=8;
+      b+=8;
+      n-=8;
+    }
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.16} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4-run.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4-run.c
new file mode 100644
index 0000000000000000000000000000000000000000..6833dddde92b7cf16a18d42c003ee5bd2b9da847
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4-run.c
@@ -0,0 +1,45 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_mve_hw } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include "dlstp-int32x4.c"
+
+int main ()
+{
+  int i;
+  int32_t temp1[N];
+  int32_t temp2[N];
+  int32_t temp3[N];
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus32 (temp1, temp2, temp3, 0);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus32 (temp1, temp2, temp3, 1);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 3);
+  check_plus32 (temp1, temp2, temp3, 3);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 4);
+  check_plus32 (temp1, temp2, temp3, 4);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 5);
+  check_plus32 (temp1, temp2, temp3, 5);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus32 (temp1, temp2, temp3, 8);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus32 (temp1, temp2, temp3, 9);
+
+  reset_data32 (temp1, temp2, temp3, N);
+}
+
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c
new file mode 100644
index 0000000000000000000000000000000000000000..5d09f784b7716c14e56086b7e66eb12b31772a45
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c
@@ -0,0 +1,31 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.32} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2-run.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2-run.c
new file mode 100644
index 0000000000000000000000000000000000000000..cc0b9ce7ee9a5a8400b18f539ff96b8e675414cb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2-run.c
@@ -0,0 +1,48 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_mve_hw } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include "dlstp-int64x2.c"
+
+int main ()
+{
+  int i;
+  int64_t temp1[N];
+  int64_t temp3[N];
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 0);
+  check_memcpy64 (temp1, temp3, 0);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 1);
+  check_memcpy64 (temp1, temp3, 1);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 2);
+  check_memcpy64 (temp1, temp3, 2);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 3);
+  check_memcpy64 (temp1, temp3, 3);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 4);
+  check_memcpy64 (temp1, temp3, 4);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 5);
+  check_memcpy64 (temp1, temp3, 5);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 6);
+  check_memcpy64 (temp1, temp3, 6);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 7);
+  check_memcpy64 (temp1, temp3, 7);
+
+  reset_data64  (temp1, temp3, N);
+}
+
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c
new file mode 100644
index 0000000000000000000000000000000000000000..21e882424ec3ba4e7a141eadb0f4e593146e81ad
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c
@@ -0,0 +1,28 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int64_t *a, int64_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp64q (n);
+      int64x2_t va = vldrdq_gather_offset_z_s64 (a, vcreateq_u64 (0, 8), p);
+      vstrdq_scatter_offset_p_s64 (c, vcreateq_u64 (0, 8), va, p);
+      c+=2;
+      a+=2;
+      n-=2;
+    }
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.64} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c
new file mode 100644
index 0000000000000000000000000000000000000000..8ea181c82d45a008d60a66c1f9e9b289c5f05611
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c
@@ -0,0 +1,69 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_mve_hw } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp8q (n);
+      int8x16_t va = vldrbq_z_s8 (a, p);
+      int8x16_t vb = vldrbq_z_s8 (b, p);
+      int8x16_t vc = vaddq_x_s8 (va, vb, p);
+      vstrbq_p_s8 (c, vc, p);
+      c+=16;
+      a+=16;
+      b+=16;
+      n-=16;
+    }
+}
+
+int main ()
+{
+  int i;
+  int8_t temp1[N];
+  int8_t temp2[N];
+  int8_t temp3[N];
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus8 (temp1, temp2, temp3, 0);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus8 (temp1, temp2, temp3, 1);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 15);
+  check_plus8 (temp1, temp2, temp3, 15);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus8 (temp1, temp2, temp3, 16);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus8 (temp1, temp2, temp3, 17);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 32);
+  check_plus8 (temp1, temp2, temp3, 32);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 33);
+  check_plus8 (temp1, temp2, temp3, 33);
+
+  reset_data8 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.8} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
new file mode 100644
index 0000000000000000000000000000000000000000..f7c3e04f8831e6b6eb709c8f3b0a0a896313ca64
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
@@ -0,0 +1,391 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O3 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <limits.h>
+#include <arm_mve.h>
+
+/* Terminating on a non-zero number of elements.  */
+void test0 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    while (n > 1)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Terminating on n >= 0.  */
+void test1 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    while (n >= 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Similar, terminating on a non-zero number of elements, but in a for loop
+   format.  */
+int32_t a[] = {0, 1, 2, 3, 4, 5, 6, 7};
+void test2 (int32_t *b, int num_elems)
+{
+    for (int i = num_elems; i >= 2; i-= 4)
+    {
+       mve_pred16_t p = vctp32q (i);
+       int32x4_t va = vldrwq_z_s32 (&(a[i]), p);
+       vstrwq_p_s32 (b + i, va, p);
+    }
+}
+
+/* Iteration counter counting up to num_iter, with a non-zero starting num.  */
+void test3 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 1; i < num_iter; i++)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Iteration counter counting up to num_iter, with a larger increment  */
+void test4 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 0; i < num_iter; i+=2)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Using an unpredicated store instruction within the loop.  */
+void test5 (uint8_t *a, uint8_t *b, uint8_t *c,  uint8_t *d, int n)
+{
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_u8 (va, vb);
+       uint8x16_t vd = vaddq_x_u8 (va, vb, p);
+       vstrbq_u8 (d, vd);
+       n -= 16;
+    }
+}
+
+/* Using an unpredicated store outside the loop.  */
+void test6 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       vx = vaddq_u8 (vx, vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    vstrbq_u8 (c, vx);
+}
+
+/* Using a VPR that gets modified within the loop.  */
+void test9 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      p++;
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using a VPR that gets re-generated within the loop.  */
+void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  mve_pred16_t p = vctp32q (n);
+  while (n > 0)
+    {
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      p = vctp32q (n);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using vctp32q_m instead of vctp32q.  */
+void test11 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p0)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q_m (n, p0);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an unpredicated op with a scalar output, where the result is valid
+   outside the bb.  This is invalid, because one of the inputs to the
+   unpredicated op is also unpredicated.  */
+uint8_t test12 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_u8 (b);
+       uint8x16_t vc = vaddq_u8 (va, vb);
+       sum += vaddvq_u8 (vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Using an unpredicated vcmp to generate a new predicate value in the
+   loop and then using that VPR to predicate a store insn.  */
+void test13 (int32_t *a, int32_t *b, int32x4_t vc, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_s32 (a);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_s32 (va, vb);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vc);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an across-vector unpredicated instruction. "vb" is the risk.  */
+uint16_t test14 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      vb = vaddq_u16 (va, vb);
+      res = vaddvq_u16 (vb);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector unpredicated instruction. "vc" is the risk. */
+uint16_t test15 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_u16 (va, vb);
+      res = vaddvaq_u16 (res, vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+uint16_t test16 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res =0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res = vaddvaq_u16 (res, vb);
+      res = vaddvaq_p_u16 (res, va, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+int test17 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vmaxvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+
+
+int test18 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vminvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test19 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vminavq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test20 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vminvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+uint8x16_t test21 (uint8_t *a, uint32_t *b, int n, uint8x16_t res)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vshlcq_u8 (va, b, 1);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int8x16_t test22 (int8_t *a, int32_t *b, int n, int8x16_t res)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vshlcq_s8 (va, b, 1);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+/* Using an unsigned number of elements to count down from, with a >0*/
+void test23 (int32_t *a, int32_t *b, int32_t *c, unsigned int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+/* Using an unsigned number of elements to count up to, with a <n*/
+void test24 (uint8_t *a, uint8_t *b, uint8_t *c, unsigned int n)
+{
+    for (int i = 0; i < n; i+=16)
+    {
+        mve_pred16_t p = vctp8q (n-i);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+
+/* Using an unsigned number of elements to count up to, with a <=n*/
+void test25 (uint8_t *a, uint8_t *b, uint8_t *c, unsigned int n)
+{
+    for (int i = 1; i <= n; i+=16)
+    {
+        mve_pred16_t p = vctp8q (n-i+1);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+/* { dg-final { scan-assembler-not "\tdlstp" } } */
+/* { dg-final { scan-assembler-not "\tletp" } } */
\ No newline at end of file

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
  2023-12-18 11:53 [PATCH 0/2] " Andre Vieira
@ 2023-12-18 11:53 ` Andre Vieira
  2023-12-20 16:54   ` Andre Vieira (lists)
  0 siblings, 1 reply; 14+ messages in thread
From: Andre Vieira @ 2023-12-18 11:53 UTC (permalink / raw)
  To: gcc-patches; +Cc: Richard.Earnshaw, Stam Markianos-Wright

[-- Attachment #1: Type: text/plain, Size: 44 bytes --]

This is a multi-part message in MIME format.

[-- Attachment #2: Type: text/plain, Size: 881 bytes --]


Reworked Stam's patch after comments in:
https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640362.html

The original gcc ChangeLog remains unchanged, but I did split up some tests so
here is the testsuite ChangeLog.


gcc/testsuite/ChangeLog:

	* gcc.target/arm/lob.h: Update framework.
	* gcc.target/arm/lob1.c: Likewise.
	* gcc.target/arm/lob6.c: Likewise.
	* gcc.target/arm/mve/dlstp-compile-asm.c: New test.
	* gcc.target/arm/mve/dlstp-int16x8.c: New test.
	* gcc.target/arm/mve/dlstp-int16x8-run.c: New test.
	* gcc.target/arm/mve/dlstp-int32x4.c: New test.
	* gcc.target/arm/mve/dlstp-int32x4-run.c: New test.
	* gcc.target/arm/mve/dlstp-int64x2.c: New test.
	* gcc.target/arm/mve/dlstp-int64x2-run.c: New test.
	* gcc.target/arm/mve/dlstp-int8x16.c: New test.
	* gcc.target/arm/mve/dlstp-int8x16-run.c: New test.
	* gcc.target/arm/mve/dlstp-invalid-asm.c: New test.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0002-arm-Add-support-for-MVE-Tail-Predicated-Low-Overhead.patch --]
[-- Type: text/x-patch; name="0002-arm-Add-support-for-MVE-Tail-Predicated-Low-Overhead.patch", Size: 107172 bytes --]

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 2f5ca79ed8d..4f164c54740 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -65,8 +65,8 @@ extern void arm_emit_speculation_barrier_function (void);
 extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *);
 extern bool arm_q_bit_access (void);
 extern bool arm_ge_bits_access (void);
-extern bool arm_target_insn_ok_for_lob (rtx);
-
+extern bool arm_target_bb_ok_for_lob (basic_block);
+extern rtx arm_attempt_dlstp_transform (rtx);
 #ifdef RTX_CODE
 enum reg_class
 arm_mode_base_reg_class (machine_mode);
diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index 0c0cb14a8a4..1ee72bcb7ec 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -668,6 +668,12 @@ static const scoped_attribute_specs *const arm_attribute_table[] =
 #undef TARGET_HAVE_CONDITIONAL_EXECUTION
 #define TARGET_HAVE_CONDITIONAL_EXECUTION arm_have_conditional_execution
 
+#undef TARGET_LOOP_UNROLL_ADJUST
+#define TARGET_LOOP_UNROLL_ADJUST arm_loop_unroll_adjust
+
+#undef TARGET_PREDICT_DOLOOP_P
+#define TARGET_PREDICT_DOLOOP_P arm_predict_doloop_p
+
 #undef TARGET_LEGITIMATE_CONSTANT_P
 #define TARGET_LEGITIMATE_CONSTANT_P arm_legitimate_constant_p
 
@@ -34483,19 +34489,1147 @@ arm_invalid_within_doloop (const rtx_insn *insn)
 }
 
 bool
-arm_target_insn_ok_for_lob (rtx insn)
+arm_target_bb_ok_for_lob (basic_block bb)
 {
-  basic_block bb = BLOCK_FOR_INSN (insn);
   /* Make sure the basic block of the target insn is a simple latch
      having as single predecessor and successor the body of the loop
      itself.  Only simple loops with a single basic block as body are
      supported for 'low over head loop' making sure that LE target is
      above LE itself in the generated code.  */
-
   return single_succ_p (bb)
-    && single_pred_p (bb)
-    && single_succ_edge (bb)->dest == single_pred_edge (bb)->src
-    && contains_no_active_insn_p (bb);
+	 && single_pred_p (bb)
+	 && single_succ_edge (bb)->dest == single_pred_edge (bb)->src;
+}
+
+/* Utility fuction: Given a VCTP or a VCTP_M insn, return the number of MVE
+   lanes based on the machine mode being used.  */
+
+static int
+arm_mve_get_vctp_lanes (rtx_insn *insn)
+{
+  rtx insn_set = single_set (insn);
+  if (insn_set
+      && GET_CODE (SET_SRC (insn_set)) == UNSPEC
+      && (XINT (SET_SRC (insn_set), 1) == VCTP
+	  || XINT (SET_SRC (insn_set), 1) == VCTP_M))
+    {
+      machine_mode mode = GET_MODE (SET_SRC (insn_set));
+      return (VECTOR_MODE_P (mode) && VALID_MVE_PRED_MODE (mode))
+	     ? GET_MODE_NUNITS (mode) : 0;
+    }
+  return 0;
+}
+
+/* Check if INSN requires the use of the VPR reg, if it does, return the
+   sub-rtx of the VPR reg.  The TYPE argument controls whether
+   this function should:
+   * For TYPE == 0, check all operands, including the OUT operands,
+     and return the first occurrence of the VPR reg.
+   * For TYPE == 1, only check the input operands.
+   * For TYPE == 2, only check the output operands.
+   (INOUT operands are considered both as input and output operands)
+*/
+static rtx
+arm_get_required_vpr_reg (rtx_insn *insn, unsigned int type = 0)
+{
+  gcc_assert (type < 3);
+  if (!NONJUMP_INSN_P (insn))
+    return NULL_RTX;
+
+  bool requires_vpr;
+  extract_constrain_insn (insn);
+  int n_operands = recog_data.n_operands;
+  if (recog_data.n_alternatives == 0)
+    return NULL_RTX;
+
+  /* Fill in recog_op_alt with information about the constraints of
+     this insn.  */
+  preprocess_constraints (insn);
+
+  for (int op = 0; op < n_operands; op++)
+    {
+      requires_vpr = true;
+      if (type == 1 && recog_data.operand_type[op] == OP_OUT)
+	continue;
+      else if (type == 2 && recog_data.operand_type[op] == OP_IN)
+	continue;
+
+      /* Iterate through alternatives of operand "op" in recog_op_alt and
+	 identify if the operand is required to be the VPR.  */
+      for (int alt = 0; alt < recog_data.n_alternatives; alt++)
+	{
+	  const operand_alternative *op_alt
+	      = &recog_op_alt[alt * n_operands];
+	  /* Fetch the reg_class for each entry and check it against the
+	     VPR_REG reg_class.  */
+	  if (alternative_class (op_alt, op) != VPR_REG)
+	    requires_vpr = false;
+	}
+      /* If all alternatives of the insn require the VPR reg for this operand,
+	 it means that either this is VPR-generating instruction, like a vctp,
+	 vcmp, etc., or it is a VPT-predicated insruction.  Return the subrtx
+	 of the VPR reg operand.  */
+      if (requires_vpr)
+	return recog_data.operand[op];
+    }
+  return NULL_RTX;
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with TYPE == 1, so return
+   something only if the VPR reg is an input operand to the insn.  */
+
+static rtx
+arm_get_required_vpr_reg_param (rtx_insn *insn)
+{
+  return arm_get_required_vpr_reg (insn, 1);
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with TYPE == 2, so return
+   something only if the VPR reg is the return value, an output of, or is
+   clobbered by the insn.  */
+
+static rtx
+arm_get_required_vpr_reg_ret_val (rtx_insn *insn)
+{
+  return arm_get_required_vpr_reg (insn, 2);
+}
+
+/* Scan the basic block of a loop body for a vctp instruction.  If there is
+   at least vctp instruction, return the first rtx_insn *.  */
+
+static rtx_insn *
+arm_mve_get_loop_vctp (basic_block bb)
+{
+  rtx_insn *insn = BB_HEAD (bb);
+
+  /* Now scan through all the instruction patterns and pick out the VCTP
+     instruction.  We require arm_get_required_vpr_reg_param to be false
+     to make sure we pick up a VCTP, rather than a VCTP_M.  */
+  FOR_BB_INSNS (bb, insn)
+    if (NONDEBUG_INSN_P (insn))
+      if (arm_get_required_vpr_reg_ret_val (insn)
+	  && (arm_mve_get_vctp_lanes (insn) != 0)
+	  && !arm_get_required_vpr_reg_param (insn))
+	return insn;
+  return NULL;
+}
+
+/* Return true if INSN is a MVE instruction that is VPT-predicable, but in
+   its unpredicated form, or if it is predicated, but on a predicate other
+   than VPR_REG.  */
+
+static bool
+arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate (rtx_insn *insn,
+							  rtx vpr_reg)
+{
+  rtx insn_vpr_reg_operand;
+  if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+      || (MVE_VPT_PREDICATED_INSN_P (insn)
+	  && (insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn))
+	  && !rtx_equal_p (vpr_reg, insn_vpr_reg_operand)))
+    return true;
+  else
+    return false;
+}
+
+/* Return true if INSN is a MVE instruction that is VPT-predicable and is
+   predicated on VPR_REG.  */
+
+static bool
+arm_mve_vec_insn_is_predicated_with_this_predicate (rtx_insn *insn,
+						    rtx vpr_reg)
+{
+  rtx insn_vpr_reg_operand;
+  if (MVE_VPT_PREDICATED_INSN_P (insn)
+      && (insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn))
+      && rtx_equal_p (vpr_reg, insn_vpr_reg_operand))
+    return true;
+  else
+    return false;
+}
+
+/* Utility function to identify if INSN is an MVE instruction that performs
+   some across-vector operation (and as a result does not align with normal
+   lane predication rules).  All such instructions give one only scalar
+   output, except for vshlcq which gives a PARALLEL of a vector and a scalar
+   (one vector result and one carry output).  */
+
+static bool
+arm_is_mve_across_vector_insn (rtx_insn* insn)
+{
+  df_ref insn_defs = NULL;
+  if (!MVE_VPT_PREDICABLE_INSN_P (insn))
+    return false;
+
+  bool is_across_vector = false;
+  FOR_EACH_INSN_DEF (insn_defs, insn)
+    if (!VALID_MVE_MODE (GET_MODE (DF_REF_REG (insn_defs)))
+	&& !arm_get_required_vpr_reg_ret_val (insn))
+      is_across_vector = true;
+
+  return is_across_vector;
+}
+
+/* Utility function to identify if INSN is an MVE load or store instruction.
+   * For TYPE == 0, check all operands.  If the function returns true,
+     INSN is a load or a store insn.
+   * For TYPE == 1, only check the input operands.  If the function returns
+     true, INSN is a load insn.
+   * For TYPE == 2, only check the output operands.  If the function returns
+     true, INSN is a store insn.  */
+
+static bool
+arm_is_mve_load_store_insn (rtx_insn* insn, int type = 0)
+{
+  int n_operands = recog_data.n_operands;
+  extract_insn (insn);
+
+  for (int op = 0; op < n_operands; op++)
+    {
+      if (type == 1 && recog_data.operand_type[op] == OP_OUT)
+	continue;
+      else if (type == 2 && recog_data.operand_type[op] == OP_IN)
+	continue;
+      if (mve_memory_operand (recog_data.operand[op],
+			      GET_MODE (recog_data.operand[op])))
+      return true;
+    }
+  return false;
+}
+
+/* When transforming an MVE intrinsic loop into an MVE Tail Predicated Low
+   Overhead Loop, there are a number of instructions that, if in their
+   unpredicated form, act across vector lanes, but are still safe to include
+   within the loop, despite the implicit predication added to the vector lanes.
+   This list has been compiled by carefully analyzing the instruction
+   pseudocode in the Arm-ARM.
+   All other across-vector instructions aren't allowed, because the addition
+   of implicit predication could influnce the result of the operation.
+   Any new across-vector instructions to the MVE ISA will have to assessed for
+   inclusion to this list.  */
+
+static bool
+arm_mve_is_allowed_unpredic_across_vector_insn (rtx_insn* insn)
+{
+  gcc_assert (MVE_VPT_UNPREDICATED_INSN_P (insn)
+	      && arm_is_mve_across_vector_insn (insn));
+  rtx insn_set = single_set (insn);
+  if (!insn_set)
+    return false;
+  rtx unspec = SET_SRC (insn_set);
+  if (GET_CODE (unspec) != UNSPEC)
+    return false;
+  switch (XINT (unspec, 1))
+    {
+    case VADDVQ_U:
+    case VADDVQ_S:
+    case VADDVAQ_U:
+    case VADDVAQ_S:
+    case VMLADAVQ_U:
+    case VMLADAVQ_S:
+    case VMLADAVXQ_S:
+    case VMLADAVAQ_U:
+    case VMLADAVAQ_S:
+    case VMLADAVAXQ_S:
+    case VABAVQ_S:
+    case VABAVQ_U:
+    case VADDLVQ_S:
+    case VADDLVQ_U:
+    case VADDLVAQ_S:
+    case VADDLVAQ_U:
+    case VMAXVQ_U:
+    case VMAXAVQ_S:
+    case VMLALDAVQ_U:
+    case VMLALDAVXQ_U:
+    case VMLALDAVXQ_S:
+    case VMLALDAVQ_S:
+    case VMLALDAVAQ_S:
+    case VMLALDAVAQ_U:
+    case VMLALDAVAXQ_S:
+    case VMLALDAVAXQ_U:
+    case VMLSDAVQ_S:
+    case VMLSDAVXQ_S:
+    case VMLSDAVAXQ_S:
+    case VMLSDAVAQ_S:
+    case VMLSLDAVQ_S:
+    case VMLSLDAVXQ_S:
+    case VMLSLDAVAQ_S:
+    case VMLSLDAVAXQ_S:
+    case VRMLALDAVHXQ_S:
+    case VRMLALDAVHQ_U:
+    case VRMLALDAVHQ_S:
+    case VRMLALDAVHAQ_S:
+    case VRMLALDAVHAQ_U:
+    case VRMLALDAVHAXQ_S:
+    case VRMLSLDAVHQ_S:
+    case VRMLSLDAVHXQ_S:
+    case VRMLSLDAVHAQ_S:
+    case VRMLSLDAVHAXQ_S:
+      return true;
+    default:
+      break;
+    }
+  return false;
+}
+
+/* Scan through the DF chain backwards within the basic block and
+   determine if any of the USEs of the original insn (or the USEs of the insns
+   where thy were DEF-ed, etc.) were affected by implicit VPT
+   predication of an MVE_VPT_UNPREDICATED_INSN_P in a dlstp/letp loop.
+   This function returns true if the insn is affected implicit predication
+   and false otherwise.
+   Having such implicit predication on an unpredicated insn wouldn't in itself
+   block tail predication, because the output of that insn might then be used
+   in a correctly predicated store insn, where the disabled lanes will be
+   ignored.  To verify this we later call:
+   `arm_mve_check_df_chain_fwd_for_implic_predic_impact`, which will check the
+   DF chains forward to see if any implicitly-predicated operand gets used in
+   an improper way.  */
+
+static bool
+arm_mve_check_df_chain_back_for_implic_predic
+  (hash_map <rtx_insn *, bool> *safe_insn_map, rtx_insn *insn_in,
+   rtx vctp_vpr_generated)
+{
+
+  auto_vec<rtx_insn *> worklist;
+  worklist.safe_push (insn_in);
+
+  bool *temp = NULL;
+
+  while (worklist.length () > 0)
+    {
+      rtx_insn *insn = worklist.pop ();
+
+      if ((temp = safe_insn_map->get (insn)))
+	return *temp;
+
+      basic_block body = BLOCK_FOR_INSN (insn);
+
+      /* The circumstances under which an instruction is affected by "implicit
+	 predication" are as follows:
+	  * It is an UNPREDICATED_INSN_P:
+	    * That loads/stores from/to memory.
+	    * Where any one of its operands is an MVE vector from outside the
+	      loop body bb.
+	 Or:
+	  * Any of it's operands were affected earlier in the insn chain.  */
+      if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+	  && (arm_is_mve_load_store_insn (insn)
+	      || (arm_is_mve_across_vector_insn (insn)
+		  && !arm_mve_is_allowed_unpredic_across_vector_insn (insn))))
+	{
+	  safe_insn_map->put (insn, true);
+	  return true;
+	}
+
+      df_ref insn_uses = NULL;
+      FOR_EACH_INSN_USE (insn_uses, insn)
+      {
+	/* If the operand is in the input reg set to the the basic block,
+	   (i.e. it has come from outside the loop!), consider it unsafe if:
+	     * It's being used in an unpredicated insn.
+	     * It is a predicable MVE vector.  */
+	if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+	    && VALID_MVE_MODE (GET_MODE (DF_REF_REG (insn_uses)))
+	    && REGNO_REG_SET_P (DF_LR_IN (body), DF_REF_REGNO (insn_uses)))
+	  {
+	    safe_insn_map->put (insn, true);
+	    return true;
+	  }
+
+	/* Scan backwards from the current INSN through the instruction chain
+	   until the start of the basic block.  */
+	for (rtx_insn *prev_insn = PREV_INSN (insn);
+	     prev_insn && prev_insn != PREV_INSN (BB_HEAD (body));
+	     prev_insn = PREV_INSN (prev_insn))
+	  {
+	    /* If a previous insn defines a register that INSN uses, then
+	       add to the worklist to check that insn's USEs.  If any of these
+	       insns return true as MVE_VPT_UNPREDICATED_INSN_Ps, then the
+	       whole chain is affected by the change in behaviour from being
+	       placed in dlstp/letp loop.  */
+	    df_ref prev_insn_defs = NULL;
+	    FOR_EACH_INSN_DEF (prev_insn_defs, prev_insn)
+	    {
+	      if (DF_REF_REGNO (insn_uses) == DF_REF_REGNO (prev_insn_defs)
+		  && !arm_mve_vec_insn_is_predicated_with_this_predicate
+		       (insn, vctp_vpr_generated))
+		worklist.safe_push (prev_insn);
+	    }
+	  }
+      }
+    }
+  safe_insn_map->put (insn_in, false);
+  return false;
+}
+
+/* If we have identified that the current DEF will be modified
+   by such implicit predication, scan through all the
+   insns that USE it and bail out if any one is outside the
+   current basic block (i.e. the reg is live after the loop)
+   or if any are store insns that are unpredicated or using a
+   predicate other than the loop VPR.
+   This function returns true if the insn is not suitable for
+   implicit predication and false otherwise.*/
+
+static bool
+arm_mve_check_df_chain_fwd_for_implic_predic_impact (rtx_insn *insn,
+						     rtx vctp_vpr_generated)
+{
+
+  /* If this insn is indeed an unpredicated store to memory, bail out.  */
+  if (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
+	(insn, vctp_vpr_generated)
+      && (arm_is_mve_load_store_insn (insn, 2)
+	  || arm_is_mve_across_vector_insn (insn)))
+    return true;
+
+  /* Next, scan forward to the various USEs of the DEFs in this insn.  */
+  df_ref insn_def = NULL;
+  FOR_EACH_INSN_DEF (insn_def, insn)
+    {
+      for (df_ref use = DF_REG_USE_CHAIN (DF_REF_REGNO (insn_def)); use;
+	   use = DF_REF_NEXT_REG (use))
+	{
+	  rtx_insn *next_use_insn = DF_REF_INSN (use);
+	  if (next_use_insn != insn
+	      && NONDEBUG_INSN_P (next_use_insn))
+	    {
+	      /* If the USE is outside the loop body bb, or it is inside, but
+		 is an differently-predicated store to memory or it is any
+		 across-vector instruction.  */
+	      if (BLOCK_FOR_INSN (insn) != BLOCK_FOR_INSN (next_use_insn)
+		  || (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
+		       (next_use_insn, vctp_vpr_generated)
+		     && (arm_is_mve_load_store_insn (next_use_insn, 2)
+			 || arm_is_mve_across_vector_insn (next_use_insn))))
+		return true;
+	    }
+	}
+    }
+  return false;
+}
+
+/* Helper function to `arm_mve_dlstp_check_inc_counter` and to
+   `arm_mve_dlstp_check_dec_counter`.  In the situations where the loop counter
+   is incrementing by 1 or decrementing by 1 in each iteration, ensure that the
+   target value or the initialisation value, respectively, was a calculation
+   of the number of iterations of the loop, which is expected to be an ASHIFTRT
+   by VCTP_STEP.  */
+
+static bool
+arm_mve_check_reg_origin_is_num_elems (basic_block body, rtx reg, rtx vctp_step)
+{
+  /* Ok, we now know the loop starts from zero and increments by one.
+     Now just show that the max value of the counter came from an
+     appropriate ASHIFRT expr of the correct amount.  */
+  basic_block pre_loop_bb = body->prev_bb;
+  while (pre_loop_bb && BB_END (pre_loop_bb)
+	 && !df_bb_regno_only_def_find (pre_loop_bb, REGNO (reg)))
+    pre_loop_bb = pre_loop_bb->prev_bb;
+
+  df_ref counter_max_last_def = df_bb_regno_only_def_find (pre_loop_bb, REGNO (reg));
+  if (!counter_max_last_def)
+    return false;
+  rtx counter_max_last_set = single_set (DF_REF_INSN (counter_max_last_def));
+  if (!counter_max_last_set)
+    return false;
+
+  /* If we encounter a simple SET from a REG, follow it through.  */
+  if (REG_P (SET_SRC (counter_max_last_set)))
+    return arm_mve_check_reg_origin_is_num_elems
+	     (pre_loop_bb->next_bb, SET_SRC (counter_max_last_set), vctp_step);
+
+  /* If we encounter a SET from an IF_THEN_ELSE where one of the operands is a
+     constant and the other is a REG, follow through to that REG.  */
+  if (GET_CODE (SET_SRC (counter_max_last_set)) == IF_THEN_ELSE
+      && REG_P (XEXP (SET_SRC (counter_max_last_set), 1))
+      && CONST_INT_P (XEXP (SET_SRC (counter_max_last_set), 2)))
+    return arm_mve_check_reg_origin_is_num_elems
+	     (pre_loop_bb->next_bb, XEXP (SET_SRC (counter_max_last_set), 1), vctp_step);
+
+  if (GET_CODE (SET_SRC (counter_max_last_set)) == ASHIFTRT
+      && CONST_INT_P (XEXP (SET_SRC (counter_max_last_set), 1))
+      && ((1 << INTVAL (XEXP (SET_SRC (counter_max_last_set), 1)))
+	   == abs (INTVAL (vctp_step))))
+    return true;
+
+  return false;
+}
+
+/* If we have identified the loop to have an incrementing counter, we need to
+   make sure that it increments by 1 and that the loop is structured correctly:
+    * The counter starts from 0
+    * The counter terminates at (num_of_elem + num_of_lanes - 1) / num_of_lanes
+    * The vctp insn uses a reg that decrements appropriately in each iteration.
+*/
+
+static rtx_insn*
+arm_mve_dlstp_check_inc_counter (basic_block body, rtx_insn* vctp_insn,
+				 rtx condconst, rtx condcount)
+{
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+  /* The loop latch has to be empty.  When compiling all the known MVE LoLs in
+     user applications, none of those with incrementing counters had any real
+     insns in the loop latch.  As such, this function has only been tested with
+     an empty latch and may misbehave or ICE if we somehow get here with an
+     increment in the latch, so, for correctness, error out early.  */
+  if (!empty_block_p (body->loop_father->latch))
+    return NULL;
+
+  class rtx_iv vctp_reg_iv;
+  /* For loops of type B) the loop counter is independent of the decrement
+     of the reg used in the vctp_insn. So run iv analysis on that reg.  This
+     has to succeed for such loops to be supported.  */
+  if (!iv_analyze (vctp_insn, as_a<scalar_int_mode> (GET_MODE (vctp_reg)),
+      vctp_reg, &vctp_reg_iv))
+    return NULL;
+
+  /* Extract the decrementnum of the vctp reg from the iv.  This decrementnum
+     is the number of lanes/elements it decrements from the remaining number of
+     lanes/elements to process in the loop, for this reason this is always a
+     negative number, but to simplify later checks we use it's absolute value.  */
+  int decrementnum = INTVAL (vctp_reg_iv.step);
+  if (decrementnum >= 0)
+    return NULL;
+  decrementnum = abs (decrementnum);
+
+  /* Find where both of those are modified in the loop body bb.  */
+  df_ref condcount_reg_set_df = df_bb_regno_only_def_find (body, REGNO (condcount));
+  df_ref vctp_reg_set_df = df_bb_regno_only_def_find (body, REGNO (vctp_reg));
+  if (!condcount_reg_set_df || !vctp_reg_set_df)
+    return NULL;
+  rtx condcount_reg_set = single_set (DF_REF_INSN (condcount_reg_set_df));
+  rtx vctp_reg_set = single_set (DF_REF_INSN (vctp_reg_set_df));
+  if (!condcount_reg_set || !vctp_reg_set)
+    return NULL;
+
+  /* Ensure the modification of the vctp reg from df is consistent with
+     the iv and the number of lanes on the vctp insn.  */
+  if (GET_CODE (SET_SRC (vctp_reg_set)) != PLUS
+      || !REG_P (SET_DEST (vctp_reg_set))
+      || !REG_P (XEXP (SET_SRC (vctp_reg_set), 0))
+      || REGNO (SET_DEST (vctp_reg_set))
+	  != REGNO (XEXP (SET_SRC (vctp_reg_set), 0))
+      || !CONST_INT_P (XEXP (SET_SRC (vctp_reg_set), 1))
+      || INTVAL (XEXP (SET_SRC (vctp_reg_set), 1)) >= 0
+      || decrementnum != abs (INTVAL (XEXP (SET_SRC (vctp_reg_set), 1)))
+      || decrementnum != arm_mve_get_vctp_lanes (vctp_insn))
+    return NULL;
+
+  if (REG_P (condcount) && REG_P (condconst))
+    {
+      /* First we need to prove that the loop is going 0..condconst with an
+	 inc of 1 in each iteration.  */
+      if (GET_CODE (SET_SRC (condcount_reg_set)) == PLUS
+	  && CONST_INT_P (XEXP (SET_SRC (condcount_reg_set), 1))
+	  && INTVAL (XEXP (SET_SRC (condcount_reg_set), 1)) == 1)
+	{
+	    rtx counter_reg = SET_DEST (condcount_reg_set);
+	    /* Check that the counter did indeed start from zero.  */
+	    df_ref this_set = DF_REG_DEF_CHAIN (REGNO (counter_reg));
+	    if (!this_set)
+	      return NULL;
+	    df_ref last_set_def = DF_REF_NEXT_REG (this_set);
+	    if (!last_set_def)
+	      return NULL;
+	    rtx_insn* last_set_insn = DF_REF_INSN (last_set_def);
+	    rtx last_set = single_set (last_set_insn);
+	    if (!last_set)
+	      return NULL;
+	    rtx counter_orig_set;
+	    counter_orig_set = SET_SRC (last_set);
+	    if (!CONST_INT_P (counter_orig_set)
+		|| (INTVAL (counter_orig_set) != 0))
+	      return NULL;
+	    /* And finally check that the target value of the counter,
+	       condconst, is of the correct shape.  */
+	    if (!arm_mve_check_reg_origin_is_num_elems (body, condconst,
+							vctp_reg_iv.step))
+	      return NULL;
+	}
+      else
+	return NULL;
+    }
+  else
+    return NULL;
+
+  /* Everything looks valid.  */
+  return vctp_insn;
+}
+
+/* Helper function to `arm_mve_loop_valid_for_dlstp`.  In the case of a
+   counter that is decrementing, ensure that it is decrementing by the
+   right amount in each iteration and that the target condition is what
+   we expect.  */
+
+static rtx_insn*
+arm_mve_dlstp_check_dec_counter (basic_block body, rtx_insn* vctp_insn,
+				 rtx condconst, rtx condcount)
+{
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+  class rtx_iv vctp_reg_iv;
+  int decrementnum;
+  /* For decrementing loops of type A), the counter is usually present in the
+     loop latch.  Here we simply need to verify that this counter is the same
+     reg that is also used in the vctp_insn and that it is not otherwise
+     modified.  */
+  rtx_insn *dec_insn = BB_END (body->loop_father->latch);
+  /* If not in the loop latch, try to find the decrement in the loop body.  */
+  if (!NONDEBUG_INSN_P (dec_insn))
+  {
+    df_ref temp = df_bb_regno_only_def_find (body, REGNO (condcount));
+    /* If we haven't been able to find the decrement, bail out.  */
+    if (!temp)
+      return NULL;
+    dec_insn = DF_REF_INSN (temp);
+  }
+
+  rtx dec_set = single_set (dec_insn);
+
+  /* Next, ensure that it is a PLUS of the form:
+     (set (reg a) (plus (reg a) (const_int)))
+     where (reg a) is the same as condcount.  */
+  if (!dec_set
+      || !REG_P (SET_DEST (dec_set))
+      || !REG_P (XEXP (SET_SRC (dec_set), 0))
+      || !CONST_INT_P (XEXP (SET_SRC (dec_set), 1))
+      || REGNO (SET_DEST (dec_set))
+	  != REGNO (XEXP (SET_SRC (dec_set), 0))
+      || REGNO (SET_DEST (dec_set)) != REGNO (condcount))
+    return NULL;
+
+  decrementnum = INTVAL (XEXP (SET_SRC (dec_set), 1));
+
+  /* This decrementnum is the number of lanes/elements it decrements from the
+     remaining number of lanes/elements to process in the loop, for this reason
+     this is always a negative number, but to simplify later checks we use it's
+     absolute value.  */
+  if (decrementnum >= 0)
+    return NULL;
+  decrementnum = abs (decrementnum);
+
+  /* Ok, so we now know the loop decrement.  If it is a 1, then we need to
+     look at the loop vctp_reg and verify that it also decrements correctly.
+     Then, we need to establish that the starting value of the loop decrement
+     originates from the starting value of the vctp decrement.  */
+  if (decrementnum == 1)
+    {
+      class rtx_iv vctp_reg_iv;
+      /* The loop counter is found to be independent of the decrement
+	 of the reg used in the vctp_insn, again.  Ensure that IV analysis
+	 succeeds and check the step.  */
+      if (!iv_analyze (vctp_insn, as_a<scalar_int_mode> (GET_MODE (vctp_reg)),
+		       vctp_reg, &vctp_reg_iv))
+	return NULL;
+      /* Ensure it matches the number of lanes of the vctp instruction.  */
+      if (abs (INTVAL (vctp_reg_iv.step))
+	  != arm_mve_get_vctp_lanes (vctp_insn))
+	return NULL;
+      if (!arm_mve_check_reg_origin_is_num_elems (body, condcount, vctp_reg_iv.step))
+	return NULL;
+    }
+  /* If the decrements are the same, then the situation is simple: either they
+     are also the same reg, which is safe, or they are different registers, in
+     which case makse sure that there is a only simple SET from one to the
+     other inside the loop.*/
+  else if (decrementnum == arm_mve_get_vctp_lanes (vctp_insn))
+    {
+      if (REGNO (condcount) != REGNO (vctp_reg))
+	{
+	  /* It wasn't the same reg, but it could be behild a
+	     (set (vctp_reg) (condcount)), so instead find where
+	     the VCTP insn is DEF'd inside the loop.  */
+	  rtx_insn *vctp_reg_insn
+	    = DF_REF_INSN (df_bb_regno_only_def_find (body, REGNO (vctp_reg)));
+	  rtx vctp_reg_set = single_set (vctp_reg_insn);
+	  /* This must just be a simple SET from the condcount.  */
+	  if (!vctp_reg_set
+	      || !REG_P (SET_DEST (vctp_reg_set))
+	      || !REG_P (SET_SRC (vctp_reg_set))
+	      || REGNO (SET_SRC (vctp_reg_set)) != REGNO (condcount))
+	    return NULL;
+	}
+    }
+  else
+    return NULL;
+
+  /* We now only need to find out that the loop terminates with a LE
+     zero condition.  If condconst is a const_int, then this is easy.
+     If its a REG, look at the last condition+jump in a bb before
+     the loop, because that usually will have a branch jumping over
+     the loop body.  */
+  if (CONST_INT_P (condconst)
+      && !(INTVAL (condconst) == 0 && JUMP_P (BB_END (body))
+	   && GET_CODE (XEXP (PATTERN (BB_END (body)), 1)) == IF_THEN_ELSE
+	   && (GET_CODE (XEXP (XEXP (PATTERN (BB_END (body)), 1), 0)) == NE
+	       ||GET_CODE (XEXP (XEXP (PATTERN (BB_END (body)), 1), 0)) == GT)))
+    return NULL;
+  else if (REG_P (condconst))
+    {
+      basic_block pre_loop_bb = body;
+      while (pre_loop_bb->prev_bb && BB_END (pre_loop_bb->prev_bb)
+	     && !JUMP_P (BB_END (pre_loop_bb->prev_bb)))
+	pre_loop_bb = pre_loop_bb->prev_bb;
+      if (pre_loop_bb && BB_END (pre_loop_bb))
+	pre_loop_bb = pre_loop_bb->prev_bb;
+      else
+	return NULL;
+      rtx initial_compare = NULL_RTX;
+      if (!(prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb))
+	    && INSN_P (prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb)))))
+	return NULL;
+      else
+	initial_compare
+	    = single_set (prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb)));
+      if (!(initial_compare
+	    && cc_register (SET_DEST (initial_compare), VOIDmode)
+	    && GET_CODE (SET_SRC (initial_compare)) == COMPARE
+	    && CONST_INT_P (XEXP (SET_SRC (initial_compare), 1))
+	    && INTVAL (XEXP (SET_SRC (initial_compare), 1)) == 0))
+	return NULL;
+
+      /* Usually this is a LE condition, but it can also just be a GT or an EQ
+	 condition (if the value is unsigned or the compiler knows its not negative)  */
+      rtx_insn *loop_jumpover = BB_END (pre_loop_bb);
+      if (!(JUMP_P (loop_jumpover)
+	    && GET_CODE (XEXP (PATTERN (loop_jumpover), 1)) == IF_THEN_ELSE
+	    && (GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == LE
+		|| GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == GT
+		|| GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == EQ)))
+	return NULL;
+    }
+
+  /* Everything looks valid.  */
+  return vctp_insn;
+}
+
+/* Function to check a loop's structure to see if it is a valid candidate for
+   an MVE Tail Predicated Low-Overhead Loop.  Returns the loop's VCTP_INSN if
+   it is valid, or NULL if it isn't.  */
+
+static rtx_insn*
+arm_mve_loop_valid_for_dlstp (basic_block body)
+{
+  /* Doloop can only be done "elementwise" with predicated dlstp/letp if it
+     contains a VCTP on the number of elements processed by the loop.
+     Find the VCTP predicate generation inside the loop body BB.  */
+  rtx_insn *vctp_insn = arm_mve_get_loop_vctp (body);
+  if (!vctp_insn)
+    return NULL;
+
+  /* There are only two types of loops that can be turned into dlstp/letp
+     loops:
+      A) Loops of the form:
+	  while (num_of_elem > 0)
+	    {
+	      p = vctp<size> (num_of_elem)
+	      n -= num_of_lanes;
+	    }
+      B) Loops of the form:
+	  int num_of_iters = (num_of_elem + num_of_lanes - 1) / num_of_lanes
+	  for (i = 0; i < num_of_iters; i++)
+	    {
+	      p = vctp<size> (num_of_elem)
+	      n -= num_of_lanes;
+	    }
+
+    Then, depending on the type of loop above we need will need to do
+    different sets of checks.  */
+  iv_analysis_loop_init (body->loop_father);
+
+  /* In order to find out if the loop is of type A or B above look for the
+     loop counter: it will either be incrementing by one per iteration or
+     it will be decrementing by num_of_lanes.  We can find the loop counter
+     in the condition at the end of the loop.  */
+  rtx_insn *loop_cond = prev_nonnote_nondebug_insn_bb (BB_END (body));
+  if (!(cc_register (XEXP (PATTERN (loop_cond), 0), VOIDmode)
+	&& GET_CODE (XEXP (PATTERN (loop_cond), 1)) == COMPARE))
+    return NULL;
+
+  /* The operands in the condition:  Try to identify which one is the
+     constant and which is the counter and run IV analysis on the latter.  */
+  rtx cond_arg_1 = XEXP (XEXP (PATTERN (loop_cond), 1), 0);
+  rtx cond_arg_2 = XEXP (XEXP (PATTERN (loop_cond), 1), 1);
+
+  rtx loop_cond_constant;
+  rtx loop_counter;
+  class rtx_iv cond_counter_iv, cond_temp_iv;
+
+  if (CONST_INT_P (cond_arg_1))
+    {
+      /* cond_arg_1 is the constant and cond_arg_2 is the counter.  */
+      loop_cond_constant = cond_arg_1;
+      loop_counter = cond_arg_2;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_2)),
+		  cond_arg_2, &cond_counter_iv);
+    }
+  else if (CONST_INT_P (cond_arg_2))
+    {
+      /* cond_arg_2 is the constant and cond_arg_1 is the counter.  */
+      loop_cond_constant = cond_arg_2;
+      loop_counter = cond_arg_1;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_1)),
+		  cond_arg_1, &cond_counter_iv);
+    }
+  else if (REG_P (cond_arg_1) && REG_P (cond_arg_2))
+    {
+      /* If both operands to the compare are REGs, we can safely
+	 run IV analysis on both and then determine which is the
+	 constant by looking at the step.
+	 First assume cond_arg_1 is the counter.  */
+      loop_counter = cond_arg_1;
+      loop_cond_constant = cond_arg_2;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_1)),
+		  cond_arg_1, &cond_counter_iv);
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_2)),
+		  cond_arg_2, &cond_temp_iv);
+
+      /* Look at the steps and swap around the rtx's if needed.  Error out if
+	 one of them cannot be identified as constant.  */
+      if (!CONST_INT_P (cond_counter_iv.step) || !CONST_INT_P (cond_temp_iv.step))
+	return NULL;
+      if (INTVAL (cond_counter_iv.step) != 0 && INTVAL (cond_temp_iv.step) != 0)
+	return NULL;
+      if (INTVAL (cond_counter_iv.step) == 0 && INTVAL (cond_temp_iv.step) != 0)
+	{
+	  loop_counter = cond_arg_2;
+	  loop_cond_constant = cond_arg_1;
+	  cond_counter_iv = cond_temp_iv;
+	}
+    }
+  else
+    return NULL;
+
+  if (!REG_P (loop_counter))
+    return NULL;
+  if (!(REG_P (loop_cond_constant) || CONST_INT_P (loop_cond_constant)))
+    return NULL;
+
+  /* Now we have extracted the IV step of the loop counter, call the
+     appropriate checking function.  */
+  if (INTVAL (cond_counter_iv.step) > 0)
+    return arm_mve_dlstp_check_inc_counter (body, vctp_insn,
+					    loop_cond_constant, loop_counter);
+  else if (INTVAL (cond_counter_iv.step) < 0)
+    return arm_mve_dlstp_check_dec_counter (body, vctp_insn,
+					    loop_cond_constant, loop_counter);
+  else
+    return NULL;
+}
+
+/* Predict whether the given loop in gimple will be transformed in the RTL
+   doloop_optimize pass.  It could be argued that turning large enough loops
+   into low-overhead loops would not show a signficant performance boost.
+   Howeer, in the case of tail predication we would still avoid using VPT/VPST
+   instructions inside the loop, and in either case using low-overhead loops
+   would not be detrimental, so we decided to not consider size, avoiding the
+   need of a heuristic to determine what an appropriate size boundary is.  */
+
+static bool
+arm_predict_doloop_p (struct loop *loop)
+{
+  gcc_assert (loop);
+  /* On arm, targetm.can_use_doloop_p is actually
+     can_use_doloop_if_innermost.  Ensure the loop is innermost,
+     it is valid and as per arm_target_bb_ok_for_lob and the
+     correct architecture flags are enabled.  */
+  if (!(TARGET_HAVE_LOB && optimize > 0))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " target architecture or optimisation flags.\n");
+      return false;
+    }
+  else if (loop->inner != NULL)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " loop nesting.\n");
+      return false;
+    }
+  else if (!arm_target_bb_ok_for_lob (loop->header->next_bb))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " loop bb complexity.\n");
+      return false;
+    }
+
+  return true;
+}
+
+/* Implement targetm.loop_unroll_adjust.  Use this to block unrolling of loops
+   that may later be turned into MVE Tail Predicated Low Overhead Loops.  The
+   performance benefit of an MVE LoL is likely to be much higher than that of
+   the unrolling.  */
+
+unsigned
+arm_loop_unroll_adjust (unsigned nunroll, struct loop *loop)
+{
+  if (TARGET_HAVE_MVE
+      && arm_target_bb_ok_for_lob (loop->latch)
+      && arm_mve_loop_valid_for_dlstp (loop->header))
+    return 0;
+  else
+    return nunroll;
+}
+
+/* Function to hadle emitting a VPT-unpredicated version of a VPT-predicated
+   insn to a sequence.  */
+
+static bool
+arm_emit_mve_unpredicated_insn_to_seq (rtx_insn* insn)
+{
+  rtx insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn);
+  int new_icode = get_attr_mve_unpredicated_insn (insn);
+  if (!in_sequence_p ()
+      || !MVE_VPT_PREDICATED_INSN_P (insn)
+      || (!insn_vpr_reg_operand)
+      || (!new_icode))
+    return false;
+
+  extract_insn (insn);
+  rtx arr[8];
+  int j = 0;
+
+  /* When transforming a VPT-predicated instruction
+     into its unpredicated equivalent we need to drop
+     the VPR operand and we may need to also drop a
+     merge "vuninit" input operand, depending on the
+     instruction pattern.  Here ensure that we have at
+     most a two-operand difference between the two
+     instrunctions.  */
+  int n_operands_diff
+      = recog_data.n_operands - insn_data[new_icode].n_operands;
+  if (!(n_operands_diff > 0 && n_operands_diff <= 2))
+    return false;
+
+  /* Then, loop through the operands of the predicated
+     instruction, and retain the ones that map to the
+     unpredicated instruction.  */
+  for (int i = 0; i < recog_data.n_operands; i++)
+    {
+      /* Ignore the VPR and, if needed, the vuninit
+	 operand.  */
+      if (insn_vpr_reg_operand == recog_data.operand[i]
+	  || (n_operands_diff == 2
+	      && !strcmp (recog_data.constraints[i], "0")))
+	continue;
+      else
+	{
+	  arr[j] = recog_data.operand[i];
+	  j++;
+	}
+    }
+
+  /* Finally, emit the upredicated instruction.  */
+  rtx_insn *new_insn;
+  switch (j)
+    {
+      case 1:
+	new_insn = emit_insn (GEN_FCN (new_icode) (arr[0]));
+	break;
+      case 2:
+	new_insn = emit_insn (GEN_FCN (new_icode) (arr[0], arr[1]));
+	break;
+      case 3:
+	new_insn = emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2]));
+	break;
+      case 4:
+	new_insn = emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2],
+						   arr[3]));
+	break;
+      case 5:
+	new_insn = emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2],
+						   arr[3], arr[4]));
+	break;
+      case 6:
+	new_insn = emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2],
+						   arr[3], arr[4], arr[5]));
+	break;
+      case 7:
+	new_insn = emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2],
+						   arr[3], arr[4], arr[5],
+						   arr[6]));
+	break;
+      default:
+	gcc_unreachable ();
+    }
+  INSN_LOCATION (new_insn) = INSN_LOCATION (insn);
+  return true;
+}
+
+/* When a vctp insn is used, its out is often followed by
+   a zero-extend insn to SImode, which is then SUBREG'd into a
+   vector form of mode VALID_MVE_PRED_MODE: this vector form is
+   what is then used as an input to the instructions within the
+   loop.  Hence, store that vector form of the VPR reg into
+   vctp_vpr_generated, so that we can match it with instructions
+   in the loop to determine if they are predicated on this same
+   VPR.  If there is no zero-extend and subreg or it is otherwise
+   invalid, then return NULL to cancel the dlstp transform.  */
+
+static rtx
+arm_mve_get_vctp_vec_form (rtx_insn *insn)
+{
+  rtx vctp_vpr_generated = NULL_RTX;
+  rtx_insn *next_use1 = NULL;
+  df_ref use;
+  for (use
+	= DF_REG_USE_CHAIN
+	   (DF_REF_REGNO (DF_INSN_INFO_DEFS (DF_INSN_INFO_GET (insn))));
+       use; use = DF_REF_NEXT_REG (use))
+    if (!next_use1 && NONDEBUG_INSN_P (DF_REF_INSN (use)))
+      next_use1 = DF_REF_INSN (use);
+
+  rtx next_use1_set = single_set (next_use1);
+  if (next_use1_set
+      && GET_CODE (SET_SRC (next_use1_set)) == ZERO_EXTEND)
+    {
+      rtx_insn *next_use2 = NULL;
+      for (use
+	    = DF_REG_USE_CHAIN
+	       (DF_REF_REGNO
+		 (DF_INSN_INFO_DEFS (DF_INSN_INFO_GET (next_use1))));
+	   use; use = DF_REF_NEXT_REG (use))
+	if (!next_use2 && NONDEBUG_INSN_P (DF_REF_INSN (use)))
+	  next_use2 = DF_REF_INSN (use);
+
+      rtx next_use2_set = single_set (next_use2);
+      if (next_use2_set
+	  && GET_CODE (SET_SRC (next_use2_set)) == SUBREG)
+	vctp_vpr_generated = SET_DEST (next_use2_set);
+    }
+
+  if (!vctp_vpr_generated || !REG_P (vctp_vpr_generated)
+      || !VALID_MVE_PRED_MODE (GET_MODE (vctp_vpr_generated)))
+    return NULL_RTX;
+
+  return vctp_vpr_generated;
+}
+
+/* Attempt to transform the loop contents of loop basic block from VPT
+   predicated insns into unpredicated insns for a dlstp/letp loop.  Returns
+   rtx constant value to decrement from the total number of elements.  Return
+   (const_int 1) if we can't use tail predication and fallback to scalar
+   low-overhead loops.  */
+
+rtx
+arm_attempt_dlstp_transform (rtx label)
+{
+  basic_block body = BLOCK_FOR_INSN (label)->prev_bb;
+
+  /* Ensure that the bb is within a loop that has all required metadata.  */
+  if (!body->loop_father || !body->loop_father->header
+      || !body->loop_father->simple_loop_desc)
+    return const1_rtx;
+
+  rtx_insn *vctp_insn = arm_mve_loop_valid_for_dlstp (body);
+  if (!vctp_insn)
+    return const1_rtx;
+
+  gcc_assert (single_set (vctp_insn));
+
+  rtx vctp_vpr_generated = arm_mve_get_vctp_vec_form (vctp_insn);
+  if (!vctp_vpr_generated)
+    return const1_rtx;
+
+  /* decrementunum is already known to be valid at this point.  */
+  int decrementnum = arm_mve_get_vctp_lanes (vctp_insn);
+
+  rtx_insn *insn = 0;
+  rtx_insn *cur_insn = 0;
+  rtx_insn *seq;
+  hash_map <rtx_insn *, bool> *safe_insn_map
+      = new hash_map <rtx_insn *, bool>;
+
+  /* Scan through the insns in the loop bb and emit the transformed bb
+     insns to a sequence.  */
+  start_sequence ();
+  FOR_BB_INSNS (body, insn)
+    {
+      if (GET_CODE (insn) == CODE_LABEL || NOTE_INSN_BASIC_BLOCK_P (insn))
+	continue;
+      else if (NOTE_P (insn))
+	emit_note ((enum insn_note)NOTE_KIND (insn));
+      else if (DEBUG_INSN_P (insn))
+	emit_debug_insn (PATTERN (insn));
+      else if (!INSN_P (insn))
+	{
+	  end_sequence ();
+	  return const1_rtx;
+	}
+      /* When we find the vctp instruction: continue.  */
+      else if (insn == vctp_insn)
+	continue;
+       /* If the insn pattern requires the use of the VPR value from the
+	  vctp as an input parameter for predication.  */
+      else if (arm_mve_vec_insn_is_predicated_with_this_predicate
+		(insn, vctp_vpr_generated))
+	{
+	  bool success = arm_emit_mve_unpredicated_insn_to_seq (insn);
+	  if (!success)
+	    {
+	      end_sequence ();
+	      return const1_rtx;
+	    }
+	}
+      /* If the insn isn't VPT predicated on vctp_vpr_generated, we need to
+	 make sure that it is still valid within the dlstp/letp loop.  */
+      else
+	{
+	  /* If this instruction USE-s the vctp_vpr_generated other than for
+	     predication, this blocks the transformation as we are not allowed
+	     to optimise the VPR value away.  */
+	  df_ref insn_uses = NULL;
+	  FOR_EACH_INSN_USE (insn_uses, insn)
+	  {
+	    if (rtx_equal_p (vctp_vpr_generated, DF_REF_REG (insn_uses)))
+	      {
+		end_sequence ();
+		return const1_rtx;
+	      }
+	  }
+	  /* If within the loop we have an MVE vector instruction that is
+	     unpredicated, the dlstp/letp looping will add implicit
+	     predication to it.  This will result in a change in behaviour
+	     of the instruction, so we need to find out if any instructions
+	     that feed into the current instruction were implicitly
+	     predicated.  */
+	  if (arm_mve_check_df_chain_back_for_implic_predic
+	       (safe_insn_map, insn, vctp_vpr_generated))
+	    {
+	      if (arm_mve_check_df_chain_fwd_for_implic_predic_impact
+		    (insn, vctp_vpr_generated))
+		{
+		  end_sequence ();
+		  return const1_rtx;
+		}
+	    }
+	  emit_insn (PATTERN (insn));
+	}
+    }
+  seq = get_insns ();
+  end_sequence ();
+
+  /* Re-write the entire BB contents with the transformed
+     sequence.  */
+  FOR_BB_INSNS_SAFE (body, insn, cur_insn)
+    if (!(GET_CODE (insn) == CODE_LABEL || NOTE_INSN_BASIC_BLOCK_P (insn)))
+      delete_insn (insn);
+
+  emit_insn_after (seq, BB_END (body));
+
+  /* The transformation has succeeded, so now modify the "count"
+     (a.k.a. niter_expr) for the middle-end.  Also set noloop_assumptions
+     to NULL to stop the middle-end from making assumptions about the
+     number of iterations.  */
+  simple_loop_desc (body->loop_father)->niter_expr
+    = XVECEXP (SET_SRC (PATTERN (vctp_insn)), 0, 0);
+  simple_loop_desc (body->loop_father)->noloop_assumptions = NULL_RTX;
+  return GEN_INT (decrementnum);
 }
 
 #if CHECKING_P
diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 8efdebecc3c..da745288f26 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -124,6 +124,11 @@ (define_attr "fpu" "none,vfp"
 ; and not all ARM insns do.
 (define_attr "predicated" "yes,no" (const_string "no"))
 
+
+; An attribute that encodes the CODE_FOR_<insn> of the MVE VPT unpredicated
+; version of a VPT-predicated instruction.  For unpredicated instructions
+; that are predicable, encode the same pattern's CODE_FOR_<insn> as a way to
+; encode that it is a predicable instruction.
 (define_attr "mve_unpredicated_insn" "" (const_int 0))
 
 ; LENGTH of an instruction (in bytes)
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 5ea2d9e8668..9398702cddd 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -2673,6 +2673,17 @@ (define_int_iterator MRRCI [VUNSPEC_MRRC VUNSPEC_MRRC2])
 (define_int_attr mrrc [(VUNSPEC_MRRC "mrrc") (VUNSPEC_MRRC2 "mrrc2")])
 (define_int_attr MRRC [(VUNSPEC_MRRC "MRRC") (VUNSPEC_MRRC2 "MRRC2")])
 
+(define_int_attr dlstp_elemsize [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 "32")
+				 (DLSTP64 "64")])
+
+(define_int_attr letp_num_lanes [(LETP8 "16") (LETP16 "8") (LETP32 "4")
+				 (LETP64 "2")])
+(define_int_attr letp_num_lanes_neg [(LETP8 "-16") (LETP16 "-8") (LETP32 "-4")
+				     (LETP64 "-2")])
+
+(define_int_attr letp_num_lanes_minus_1 [(LETP8 "15") (LETP16 "7") (LETP32 "3")
+					 (LETP64 "1")])
+
 (define_int_attr opsuffix [(UNSPEC_DOT_S "s8")
 			   (UNSPEC_DOT_U "u8")
 			   (UNSPEC_DOT_US "s8")
@@ -2916,6 +2927,10 @@ (define_int_iterator SQRSHRLQ [SQRSHRL_64 SQRSHRL_48])
 (define_int_iterator VSHLCQ_M [VSHLCQ_M_S VSHLCQ_M_U])
 (define_int_iterator VQSHLUQ_M_N [VQSHLUQ_M_N_S])
 (define_int_iterator VQSHLUQ_N [VQSHLUQ_N_S])
+(define_int_iterator DLSTP [DLSTP8 DLSTP16 DLSTP32
+				   DLSTP64])
+(define_int_iterator LETP [LETP8 LETP16 LETP32
+			   LETP64])
 
 ;; Define iterators for VCMLA operations
 (define_int_iterator VCMLA_OP [UNSPEC_VCMLA
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 62df022ef19..5748e2333eb 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -6922,23 +6922,24 @@ (define_expand "@arm_mve_reinterpret<mode>"
 ;; Originally expanded by 'predicated_doloop_end'.
 ;; In the rare situation where the branch is too far, we do also need to
 ;; revert FPSCR.LTPSIZE back to 0x100 after the last iteration.
-(define_insn "*predicated_doloop_end_internal"
+(define_insn "predicated_doloop_end_internal<letp_num_lanes>"
   [(set (pc)
 	(if_then_else
-	   (ge (plus:SI (reg:SI LR_REGNUM)
-			(match_operand:SI 0 "const_int_operand" ""))
-		(const_int 0))
-	 (label_ref (match_operand 1 "" ""))
+	   (gtu (unspec:SI [(plus:SI (match_operand:SI 0 "s_register_operand" "=r")
+				     (const_int <letp_num_lanes_neg>))]
+		LETP)
+		(const_int <letp_num_lanes_minus_1>))
+	 (match_operand 1 "" "")
 	 (pc)))
-   (set (reg:SI LR_REGNUM)
-	(plus:SI (reg:SI LR_REGNUM) (match_dup 0)))
+   (set (match_dup 0)
+	(plus:SI (match_dup 0) (const_int <letp_num_lanes_neg>)))
    (clobber (reg:CC CC_REGNUM))]
-  "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2"
+  "TARGET_HAVE_MVE"
   {
     if (get_attr_length (insn) == 4)
       return "letp\t%|lr, %l1";
     else
-      return "subs\t%|lr, #%n0\n\tbgt\t%l1\n\tlctp";
+      return "subs\t%|lr, #<letp_num_lanes>\n\tbhi\t%l1\n\tlctp";
   }
   [(set (attr "length")
 	(if_then_else
@@ -6947,11 +6948,11 @@ (define_insn "*predicated_doloop_end_internal"
 	    (const_int 6)))
    (set_attr "type" "branch")])
 
-(define_insn "dlstp<mode1>_insn"
+(define_insn "dlstp<dlstp_elemsize>_insn"
   [
     (set (reg:SI LR_REGNUM)
 	 (unspec:SI [(match_operand:SI 0 "s_register_operand" "r")]
 	  DLSTP))
   ]
-  "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2"
-  "dlstp.<mode1>\t%|lr, %0")
+  "TARGET_HAVE_MVE"
+  "dlstp.<dlstp_elemsize>\t%|lr, %0")
diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
index e1e013befa7..f2801cea36a 100644
--- a/gcc/config/arm/thumb2.md
+++ b/gcc/config/arm/thumb2.md
@@ -1613,7 +1613,7 @@ (define_expand "doloop_end"
    (use (match_operand 1 "" ""))]     ; label
   "TARGET_32BIT"
   "
- {
+{
    /* Currently SMS relies on the do-loop pattern to recognize loops
       where (1) the control part consists of all insns defining and/or
       using a certain 'count' register and (2) the loop count can be
@@ -1623,41 +1623,77 @@ (define_expand "doloop_end"
 
       Also used to implement the low over head loops feature, which is part of
       the Armv8.1-M Mainline Low Overhead Branch (LOB) extension.  */
-   if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
-   {
-     rtx s0;
-     rtx bcomp;
-     rtx loc_ref;
-     rtx cc_reg;
-     rtx insn;
-     rtx cmp;
-
-     if (GET_MODE (operands[0]) != SImode)
-       FAIL;
-
-     s0 = operands [0];
-
-     /* Low over head loop instructions require the first operand to be LR.  */
-     if (TARGET_HAVE_LOB && arm_target_insn_ok_for_lob (operands [1]))
-       s0 = gen_rtx_REG (SImode, LR_REGNUM);
-
-     if (TARGET_THUMB2)
-       insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-     else
-       insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-
-     cmp = XVECEXP (PATTERN (insn), 0, 0);
-     cc_reg = SET_DEST (cmp);
-     bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
-     loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands [1]);
-     emit_jump_insn (gen_rtx_SET (pc_rtx,
-                                  gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
-                                                        loc_ref, pc_rtx)));
-     DONE;
-   }
- else
-   FAIL;
- }")
+  if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
+    {
+      rtx s0;
+      rtx bcomp;
+      rtx loc_ref;
+      rtx cc_reg;
+      rtx insn;
+      rtx cmp;
+      rtx decrement_num;
+
+      if (GET_MODE (operands[0]) != SImode)
+	FAIL;
+
+      s0 = operands[0];
+
+       if (TARGET_HAVE_LOB && arm_target_bb_ok_for_lob (BLOCK_FOR_INSN (operands[1])))
+	{
+	  s0 = gen_rtx_REG (SImode, LR_REGNUM);
+
+	  /* If we have a compatible MVE target, try and analyse the loop
+	     contents to determine if we can use predicated dlstp/letp
+	     looping.  */
+	  if (TARGET_HAVE_MVE
+	      && (decrement_num = arm_attempt_dlstp_transform (operands[1]))
+	      && (INTVAL (decrement_num) != 1))
+	    {
+	      loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+	      switch (INTVAL (decrement_num))
+		{
+		  case 2:
+		    insn = emit_jump_insn (gen_predicated_doloop_end_internal2
+					    (s0, loc_ref));
+		    break;
+		  case 4:
+		    insn = emit_jump_insn (gen_predicated_doloop_end_internal4
+					    (s0, loc_ref));
+		    break;
+		  case 8:
+		    insn = emit_jump_insn (gen_predicated_doloop_end_internal8
+					    (s0, loc_ref));
+		    break;
+		  case 16:
+		    insn = emit_jump_insn (gen_predicated_doloop_end_internal16
+					    (s0, loc_ref));
+		    break;
+		  default:
+		    gcc_unreachable ();
+		}
+	      DONE;
+	    }
+	}
+
+	/* Otherwise, try standard decrement-by-one dls/le looping.  */
+	if (TARGET_THUMB2)
+	  insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0,
+							GEN_INT (-1)));
+	else
+	  insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
+
+	cmp = XVECEXP (PATTERN (insn), 0, 0);
+	cc_reg = SET_DEST (cmp);
+	bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
+	loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+	emit_jump_insn (gen_rtx_SET (pc_rtx,
+				     gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+							   loc_ref, pc_rtx)));
+	DONE;
+    }
+  else
+    FAIL;
+}")
 
 (define_insn "*clear_apsr"
   [(unspec_volatile:SI [(const_int 0)] VUNSPEC_CLRM_APSR)
@@ -1755,7 +1791,37 @@ (define_expand "doloop_begin"
   {
     if (REGNO (operands[0]) == LR_REGNUM)
       {
-	emit_insn (gen_dls_insn (operands[0]));
+	/* Pick out the number by which we are decrementing the loop counter
+	   in every iteration.  If it's > 1, then use dlstp.  */
+	int const_int_dec_num
+	     = abs (INTVAL (XEXP (XEXP (XVECEXP (PATTERN (operands[1]), 0, 1),
+				  1),
+			    1)));
+	switch (const_int_dec_num)
+	  {
+	    case 16:
+	      emit_insn (gen_dlstp8_insn (operands[0]));
+	      break;
+
+	    case 8:
+	      emit_insn (gen_dlstp16_insn (operands[0]));
+	      break;
+
+	    case 4:
+	      emit_insn (gen_dlstp32_insn (operands[0]));
+	      break;
+
+	    case 2:
+	      emit_insn (gen_dlstp64_insn (operands[0]));
+	      break;
+
+	    case 1:
+	      emit_insn (gen_dls_insn (operands[0]));
+	      break;
+
+	    default:
+	      gcc_unreachable ();
+	  }
 	DONE;
       }
     else
diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
index 4713ec840ab..2d6f27c14f4 100644
--- a/gcc/config/arm/unspecs.md
+++ b/gcc/config/arm/unspecs.md
@@ -583,6 +583,14 @@ (define_c_enum "unspec" [
   VADDLVQ_U
   VCTP
   VCTP_M
+  DLSTP8
+  DLSTP16
+  DLSTP32
+  DLSTP64
+  LETP8
+  LETP16
+  LETP32
+  LETP64
   VPNOT
   VCREATEQ_F
   VCVTQ_N_TO_F_S
diff --git a/gcc/df-core.cc b/gcc/df-core.cc
index d4812b04a7c..4fcc14bf790 100644
--- a/gcc/df-core.cc
+++ b/gcc/df-core.cc
@@ -1964,6 +1964,21 @@ df_bb_regno_last_def_find (basic_block bb, unsigned int regno)
   return NULL;
 }
 
+/* Return the one and only def of REGNO within BB.  If there is no def or
+   there are multiple defs, return NULL.  */
+
+df_ref
+df_bb_regno_only_def_find (basic_block bb, unsigned int regno)
+{
+  df_ref temp = df_bb_regno_first_def_find (bb, regno);
+  if (!temp)
+    return NULL;
+  else if (temp == df_bb_regno_last_def_find (bb, regno))
+    return temp;
+  else
+    return NULL;
+}
+
 /* Finds the reference corresponding to the definition of REG in INSN.
    DF is the dataflow object.  */
 
diff --git a/gcc/df.h b/gcc/df.h
index 402657a7076..98623637f9c 100644
--- a/gcc/df.h
+++ b/gcc/df.h
@@ -987,6 +987,7 @@ extern void df_check_cfg_clean (void);
 #endif
 extern df_ref df_bb_regno_first_def_find (basic_block, unsigned int);
 extern df_ref df_bb_regno_last_def_find (basic_block, unsigned int);
+extern df_ref df_bb_regno_only_def_find (basic_block, unsigned int);
 extern df_ref df_find_def (rtx_insn *, rtx);
 extern bool df_reg_defined (rtx_insn *, rtx);
 extern df_ref df_find_use (rtx_insn *, rtx);
diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
index 4feb0a25ab9..d919207505c 100644
--- a/gcc/loop-doloop.cc
+++ b/gcc/loop-doloop.cc
@@ -85,10 +85,10 @@ doloop_condition_get (rtx_insn *doloop_pat)
      forms:
 
      1)  (parallel [(set (pc) (if_then_else (condition)
-	  			            (label_ref (label))
-				            (pc)))
-	             (set (reg) (plus (reg) (const_int -1)))
-	             (additional clobbers and uses)])
+					    (label_ref (label))
+					    (pc)))
+		     (set (reg) (plus (reg) (const_int -1)))
+		     (additional clobbers and uses)])
 
      The branch must be the first entry of the parallel (also required
      by jump.cc), and the second entry of the parallel must be a set of
@@ -96,19 +96,34 @@ doloop_condition_get (rtx_insn *doloop_pat)
      the loop counter in an if_then_else too.
 
      2)  (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-	                         (label_ref (label))
-			         (pc))).  
+	 (set (pc) (if_then_else (reg != 0)
+				 (label_ref (label))
+				 (pc))).
 
      Some targets (ARM) do the comparison before the branch, as in the
      following form:
 
-     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) */
-
+     3) (parallel [(set (cc) (compare (plus (reg) (const_int -1)) 0))
+		   (set (reg) (plus (reg) (const_int -1)))])
+	(set (pc) (if_then_else (cc == NE)
+				(label_ref (label))
+				(pc)))
+
+      The ARM target also supports a special case of a counter that decrements
+      by `n` and terminating in a GTU condition.  In that case, the compare and
+      branch are all part of one insn, containing an UNSPEC:
+
+      4) (parallel [
+	    (set (pc)
+		(if_then_else (gtu (unspec:SI [(plus:SI (reg:SI 14 lr)
+							(const_int -n))])
+				   (const_int n-1]))
+		    (label_ref)
+		    (pc)))
+	    (set (reg:SI 14 lr)
+		 (plus:SI (reg:SI 14 lr)
+			  (const_int -n)))
+     */
   pattern = PATTERN (doloop_pat);
 
   if (GET_CODE (pattern) != PARALLEL)
@@ -143,7 +158,7 @@ doloop_condition_get (rtx_insn *doloop_pat)
 	      || GET_CODE (cmp_arg1) != PLUS)
 	    return 0;
 	  reg_orig = XEXP (cmp_arg1, 0);
-	  if (XEXP (cmp_arg1, 1) != GEN_INT (-1) 
+	  if (XEXP (cmp_arg1, 1) != GEN_INT (-1)
 	      || !REG_P (reg_orig))
 	    return 0;
 	  cc_reg = SET_DEST (cmp_orig);
@@ -173,15 +188,17 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
     return 0;
 
-  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
      On IA-64, this decrement is wrapped in an if_then_else.  */
   inc_src = SET_SRC (inc);
   if (GET_CODE (inc_src) == IF_THEN_ELSE)
     inc_src = XEXP (inc_src, 1);
   if (GET_CODE (inc_src) != PLUS
       || XEXP (inc_src, 0) != reg
-      || XEXP (inc_src, 1) != constm1_rtx)
+      || !CONST_INT_P (XEXP (inc_src, 1))
+      || INTVAL (XEXP (inc_src, 1)) >= 0)
     return 0;
+  int dec_num = abs (INTVAL (XEXP (inc_src, 1)));
 
   /* Check for (set (pc) (if_then_else (condition)
                                        (label_ref (label))
@@ -196,60 +213,71 @@ doloop_condition_get (rtx_insn *doloop_pat)
   /* Extract loop termination condition.  */
   condition = XEXP (SET_SRC (cmp), 0);
 
-  /* We expect a GE or NE comparison with 0 or 1.  */
-  if ((GET_CODE (condition) != GE
-       && GET_CODE (condition) != NE)
-      || (XEXP (condition, 1) != const0_rtx
-          && XEXP (condition, 1) != const1_rtx))
+  /* We expect a GE or NE comparison with 0 or 1, or a GTU comparison with
+     dec_num - 1.  */
+  if (!((GET_CODE (condition) == GE
+	 || GET_CODE (condition) == NE)
+	&& (XEXP (condition, 1) == const0_rtx
+	    || XEXP (condition, 1) == const1_rtx ))
+      &&!(GET_CODE (condition) == GTU
+	  && ((INTVAL (XEXP (condition, 1))) == (dec_num - 1))))
     return 0;
 
-  if ((XEXP (condition, 0) == reg)
+  /* For the ARM special case of having a GTU: re-form the condition without
+     the unspec for the benefit of the middle-end.  */
+  if (GET_CODE (condition) == GTU)
+    {
+      condition = gen_rtx_fmt_ee (GTU, VOIDmode, inc_src,
+				  GEN_INT (dec_num - 1));
+      return condition;
+    }
+  else if ((XEXP (condition, 0) == reg)
       /* For the third case:  */  
       || ((cc_reg != NULL_RTX)
 	  && (XEXP (condition, 0) == cc_reg)
 	  && (reg_orig == reg))
       || (GET_CODE (XEXP (condition, 0)) == PLUS
 	  && XEXP (XEXP (condition, 0), 0) == reg))
-   {
+    {
      if (GET_CODE (pattern) != PARALLEL)
      /*  For the second form we expect:
 
-         (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-                                 (label_ref (label))
-                                 (pc))).
+	 (set (reg) (plus (reg) (const_int -1))
+	 (set (pc) (if_then_else (reg != 0)
+				 (label_ref (label))
+				 (pc))).
 
-         is equivalent to the following:
+	 is equivalent to the following:
 
-         (parallel [(set (pc) (if_then_else (reg != 1)
-                                            (label_ref (label))
-                                            (pc)))
-                     (set (reg) (plus (reg) (const_int -1)))
-                     (additional clobbers and uses)])
+	 (parallel [(set (pc) (if_then_else (reg != 1)
+					    (label_ref (label))
+					    (pc)))
+		     (set (reg) (plus (reg) (const_int -1)))
+		     (additional clobbers and uses)])
 
-        For the third form we expect:
+	For the third form we expect:
 
-        (parallel [(set (cc) (compare ((plus (reg) (const_int -1)), 0))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) 
+	(parallel [(set (cc) (compare ((plus (reg) (const_int -1)), 0))
+		   (set (reg) (plus (reg) (const_int -1)))])
+	(set (pc) (if_then_else (cc == NE)
+				(label_ref (label))
+				(pc))) 
 
-        which is equivalent to the following:
+	which is equivalent to the following:
 
-        (parallel [(set (cc) (compare (reg,  1))
-                   (set (reg) (plus (reg) (const_int -1)))
-                   (set (pc) (if_then_else (NE == cc)
-                                           (label_ref (label))
-                                           (pc))))])
+	(parallel [(set (cc) (compare (reg,  1))
+		   (set (reg) (plus (reg) (const_int -1)))
+		   (set (pc) (if_then_else (NE == cc)
+					   (label_ref (label))
+					   (pc))))])
 
-        So we return the second form instead for the two cases.
+	So we return the second form instead for the two cases.
 
      */
-        condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
+	condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
 
     return condition;
-   }
+    }
 
   /* ??? If a machine uses a funny comparison, we could return a
      canonicalized form here.  */
@@ -507,6 +535,11 @@ doloop_modify (class loop *loop, class niter_desc *desc,
 	nonneg = 1;
       break;
 
+    case GTU:
+      /* The iteration count does not need incrementing for a GTU test.  */
+      increment_count = false;
+      break;
+
       /* Abort if an invalid doloop pattern has been generated.  */
     default:
       gcc_unreachable ();
@@ -529,6 +562,10 @@ doloop_modify (class loop *loop, class niter_desc *desc,
 
   if (desc->noloop_assumptions)
     {
+      /* The GTU case has only been implemented for the ARM target, where
+	 noloop_assumptions gets explicitly set to NULL for that case, so
+	 assert here for safety.  */
+      gcc_assert (GET_CODE (condition) != GTU);
       rtx ass = copy_rtx (desc->noloop_assumptions);
       basic_block preheader = loop_preheader_edge (loop)->src;
       basic_block set_zero = split_edge (loop_preheader_edge (loop));
@@ -642,7 +679,7 @@ doloop_optimize (class loop *loop)
 {
   scalar_int_mode mode;
   rtx doloop_reg;
-  rtx count;
+  rtx count = NULL_RTX;
   widest_int iterations, iterations_max;
   rtx_code_label *start_label;
   rtx condition;
@@ -685,17 +722,6 @@ doloop_optimize (class loop *loop)
       return false;
     }
 
-  max_cost
-    = COSTS_N_INSNS (param_max_iterations_computation_cost);
-  if (set_src_cost (desc->niter_expr, mode, optimize_loop_for_speed_p (loop))
-      > max_cost)
-    {
-      if (dump_file)
-	fprintf (dump_file,
-		 "Doloop: number of iterations too costly to compute.\n");
-      return false;
-    }
-
   if (desc->const_iter)
     iterations = widest_int::from (rtx_mode_t (desc->niter_expr, mode),
 				   UNSIGNED);
@@ -716,12 +742,25 @@ doloop_optimize (class loop *loop)
 
   /* Generate looping insn.  If the pattern FAILs then give up trying
      to modify the loop since there is some aspect the back-end does
-     not like.  */
-  count = copy_rtx (desc->niter_expr);
+     not like.  If this succeeds, there is a chance that the loop
+     desc->niter_expr has been altered by the backend, so only extract
+     that data after the gen_doloop_end.  */
   start_label = block_label (desc->in_edge->dest);
   doloop_reg = gen_reg_rtx (mode);
   rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
 
+  max_cost
+    = COSTS_N_INSNS (param_max_iterations_computation_cost);
+  if (set_src_cost (desc->niter_expr, mode, optimize_loop_for_speed_p (loop))
+      > max_cost)
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "Doloop: number of iterations too costly to compute.\n");
+      return false;
+    }
+
+  count = copy_rtx (desc->niter_expr);
   word_mode_size = GET_MODE_PRECISION (word_mode);
   word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) - 1;
   if (! doloop_seq
diff --git a/gcc/testsuite/gcc.target/arm/lob.h b/gcc/testsuite/gcc.target/arm/lob.h
index feaae7cc899..3941fe7a8b6 100644
--- a/gcc/testsuite/gcc.target/arm/lob.h
+++ b/gcc/testsuite/gcc.target/arm/lob.h
@@ -1,15 +1,131 @@
 #include <string.h>
-
+#include <stdint.h>
 /* Common code for lob tests.  */
 
 #define NO_LOB asm volatile ("@ clobber lr" : : : "lr" )
 
-#define N 10000
+#define N 100
+
+static void
+reset_data (int *a, int *b, int *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data64 (int64_t *a, int64_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+check_plus (int *a, int *b, int *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
 
 static void
-reset_data (int *a, int *b, int *c)
+check_memcpy64 (int64_t *a, int64_t *c, int x)
 {
-  memset (a, -1, N * sizeof (*a));
-  memset (b, -1, N * sizeof (*b));
-  memset (c, -1, N * sizeof (*c));
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != a[i]) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
 }
diff --git a/gcc/testsuite/gcc.target/arm/lob1.c b/gcc/testsuite/gcc.target/arm/lob1.c
index ba5c82cd55c..c8ce653a5c3 100644
--- a/gcc/testsuite/gcc.target/arm/lob1.c
+++ b/gcc/testsuite/gcc.target/arm/lob1.c
@@ -54,29 +54,18 @@ loop3 (int *a, int *b, int *c)
     } while (i < N);
 }
 
-void
-check (int *a, int *b, int *c)
-{
-  for (int i = 0; i < N; i++)
-    {
-      NO_LOB;
-      if (c[i] != a[i] + b[i])
-	abort ();
-    }
-}
-
 int
 main (void)
 {
-  reset_data (a, b, c);
+  reset_data (a, b, c, N);
   loop1 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus (a, b, c, N);
+  reset_data (a, b, c, N);
   loop2 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus (a, b, c, N);
+  reset_data (a, b, c, N);
   loop3 (a, b ,c);
-  check (a, b ,c);
+  check_plus (a, b, c, N);
 
   return 0;
 }
diff --git a/gcc/testsuite/gcc.target/arm/lob6.c b/gcc/testsuite/gcc.target/arm/lob6.c
index 17b6124295e..4fe116e2c2b 100644
--- a/gcc/testsuite/gcc.target/arm/lob6.c
+++ b/gcc/testsuite/gcc.target/arm/lob6.c
@@ -79,14 +79,14 @@ check (void)
 int
 main (void)
 {
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data (a1, b1, c1, N);
+  reset_data (a2, b2, c2, N);
   loop1 (a1, b1, c1);
   ref1 (a2, b2, c2);
   check ();
 
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data (a1, b1, c1, N);
+  reset_data (a2, b2, c2, N);
   loop2 (a1, b1, c1);
   ref2 (a2, b2, c2);
   check ();
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c
new file mode 100644
index 00000000000..5ddd994e53d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c
@@ -0,0 +1,561 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O3 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+
+#define IMM 5
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)				\
+void test_##NAME##PRED##_##SIGN##BITS (TYPE##BITS##_t *a, TYPE##BITS##_t *b,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vb = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (b, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_##SIGN##BITS (va, vb, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      b += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (32, 4, w, NAME, PRED)
+
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vmulq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vsubq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vhaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vorrq, _x)
+
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_M(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)				\
+void test_##NAME##PRED##_##SIGN##BITS (TYPE##BITS##x##LANES##_t __inactive, TYPE##BITS##_t *a, TYPE##BITS##_t *b,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vb = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (b, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_##SIGN##BITS (__inactive, va, vb, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      b += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_M (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_M (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (32, 4, w, NAME, PRED)
+
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vmulq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vsubq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vhaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vorrq, _m)
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_N(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)	\
+void test_##NAME##PRED##_n_##SIGN##BITS (TYPE##BITS##_t *a,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_n_##SIGN##BITS (va, IMM, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_N (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_N (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (32, 4, w, NAME, PRED)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vmulq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vsubq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vhaddq, _x)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vbrsrq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vshlq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vshrq, _x)
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_M_N(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)	\
+void test_##NAME##PRED##_n_##SIGN##BITS (TYPE##BITS##x##LANES##_t __inactive, TYPE##BITS##_t *a,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_n_##SIGN##BITS (__inactive, va, IMM, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_M_N (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_M_N (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (32, 4, w, NAME, PRED)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vmulq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vsubq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vhaddq, _m)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vbrsrq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vshlq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vshrq, _m)
+
+/* Now test some more configurations.  */
+
+/* Using a >=1 condition.  */
+void test1 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n >= 1)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+/* Test a for loop format of decrementing to zero */
+int32_t a[] = {0, 1, 2, 3, 4, 5, 6, 7};
+void test2 (int32_t *b, int num_elems)
+{
+    for (int i = num_elems; i > 0; i-= 4)
+    {
+        mve_pred16_t p = vctp32q (i);
+        int32x4_t va = vldrwq_z_s32 (&(a[i]), p);
+        vstrwq_p_s32 (b + i, va, p);
+    }
+}
+
+/* Iteration counter counting up to num_iter.  */
+void test3 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 0; i < num_iter; i++)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+/* Iteration counter counting down from num_iter.  */
+void test4 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = num_iter; i > 0; i--)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+/* Using an unpredicated arithmetic instruction within the loop.  */
+void test5 (uint8_t *a, uint8_t *b, uint8_t *c,  uint8_t *d, int n)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_u8 (b);
+	/* Is affected by implicit predication, because vb also
+	came from an unpredicated load, but there is no functional
+	problem, because the result is used in a predicated store.  */ 
+        uint8x16_t vc = vaddq_u8 (va, vb);
+        uint8x16_t vd = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        vstrbq_p_u8 (d, vd, p);
+        n-=16;
+    }
+}
+
+/* Using a different VPR value for one instruction in the loop.  */
+void test6 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using another VPR value in the loop, with a vctp.
+   The doloop logic will always try to do the transform on the first
+   vctp it encounters, so this is still expected to work.  */
+void test7 (int32_t *a, int32_t *b, int32_t *c, int n, int g)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q (g);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vctp,
+   but this time the p1 will also change in every loop (still fine)  */
+void test8 (int32_t *a, int32_t *b, int32_t *c, int n, int g)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q (g);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+      g++;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vctp_m
+   that is independent of the loop vctp VPR.  */
+void test9 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p2 = vctp32q_m (n, p1);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop,
+   with a vctp_m that is tied to the base vctp VPR.  This
+   is still fine, because the vctp_m will be transformed
+   into a vctp and be implicitly predicated.  */
+void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q_m (n, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p1);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp.  */
+void test11 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vb);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p1);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp_m.  */
+void test12 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p2 = vcmpeqq_m_s32 (va, vb, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp_m 
+   that is tied to the base vctp VPR (same as above, this will be turned
+   into a vcmp and be implicitly predicated).  */
+void test13 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p2 = vcmpeqq_m_s32 (va, vb, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an unpredicated op with a scalar output, where the result is valid
+   outside the bb.  This is valid, because all the inputs to the unpredicated
+   op are correctly predicated.  */
+uint8_t test14 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       sum += vaddvq_u8 (vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Same as above, but with another scalar op between the unpredicated op and
+   the scalar op outside the loop.  */
+uint8_t test15 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx, int g)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       sum += vaddvq_u8 (vc);
+       sum += g;
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Using an unpredicated vcmp to generate a new predicate value in the
+   loop and then using it in a predicated store insn.  */
+void test16 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_s32 (b);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vc);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using a predicated vcmp to generate a new predicate value in the
+   loop and then using it in a predicated store insn.  */
+void test17 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_s32 (va, vb);
+      mve_pred16_t p1 = vcmpeqq_m_s32 (va, vc, p);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an across-vector unpredicated instruction in a valid way.
+   This tests that "vc" has correctly masked the risky "vb".  */
+uint16_t test18 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_x_u16 (va, vb, p);
+      res = vaddvq_u16 (vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector unpredicated instruction with a scalar from outside the loop.  */
+uint16_t test19 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_x_u16 (va, vb, p);
+      res = vaddvaq_u16 (res, vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector predicated instruction in a valid way.  */
+uint16_t  test20 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res = vaddvaq_p_u16 (res, vb, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector predicated instruction in a valid way.  */
+uint16_t  test21 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res++;
+      res = vaddvaq_p_u16 (res, vb, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+int test22 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vmaxvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test23 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vmaxavq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+/* The final number of DLSTPs currently is calculated by the number of
+  `TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY.*` macros * 6 + 23.  */
+/* { dg-final { scan-assembler-times {\tdlstp} 167 } } */
+/* { dg-final { scan-assembler-times {\tletp} 167 } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8-run.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8-run.c
new file mode 100644
index 00000000000..6966a396604
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8-run.c
@@ -0,0 +1,44 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_mve_hw } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+#include "dlstp-int16x8.c"
+
+int main ()
+{
+  int i;
+  int16_t temp1[N];
+  int16_t temp2[N];
+  int16_t temp3[N];
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus16 (temp1, temp2, temp3, 0);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus16 (temp1, temp2, temp3, 1);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 7);
+  check_plus16 (temp1, temp2, temp3, 7);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus16 (temp1, temp2, temp3, 8);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus16 (temp1, temp2, temp3, 9);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus16 (temp1, temp2, temp3, 16);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus16 (temp1, temp2, temp3, 17);
+
+  reset_data16 (temp1, temp2, temp3, N);
+}
+
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
new file mode 100644
index 00000000000..33632c5f14d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
@@ -0,0 +1,31 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int16_t *a, int16_t *b, int16_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      int16x8_t va = vldrhq_z_s16 (a, p);
+      int16x8_t vb = vldrhq_z_s16 (b, p);
+      int16x8_t vc = vaddq_x_s16 (va, vb, p);
+      vstrhq_p_s16 (c, vc, p);
+      c+=8;
+      a+=8;
+      b+=8;
+      n-=8;
+    }
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.16} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4-run.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4-run.c
new file mode 100644
index 00000000000..6833dddde92
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4-run.c
@@ -0,0 +1,45 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_mve_hw } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include "dlstp-int32x4.c"
+
+int main ()
+{
+  int i;
+  int32_t temp1[N];
+  int32_t temp2[N];
+  int32_t temp3[N];
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus32 (temp1, temp2, temp3, 0);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus32 (temp1, temp2, temp3, 1);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 3);
+  check_plus32 (temp1, temp2, temp3, 3);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 4);
+  check_plus32 (temp1, temp2, temp3, 4);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 5);
+  check_plus32 (temp1, temp2, temp3, 5);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus32 (temp1, temp2, temp3, 8);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus32 (temp1, temp2, temp3, 9);
+
+  reset_data32 (temp1, temp2, temp3, N);
+}
+
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c
new file mode 100644
index 00000000000..5d09f784b77
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c
@@ -0,0 +1,31 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.32} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2-run.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2-run.c
new file mode 100644
index 00000000000..cc0b9ce7ee9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2-run.c
@@ -0,0 +1,48 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_mve_hw } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include "dlstp-int64x2.c"
+
+int main ()
+{
+  int i;
+  int64_t temp1[N];
+  int64_t temp3[N];
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 0);
+  check_memcpy64 (temp1, temp3, 0);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 1);
+  check_memcpy64 (temp1, temp3, 1);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 2);
+  check_memcpy64 (temp1, temp3, 2);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 3);
+  check_memcpy64 (temp1, temp3, 3);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 4);
+  check_memcpy64 (temp1, temp3, 4);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 5);
+  check_memcpy64 (temp1, temp3, 5);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 6);
+  check_memcpy64 (temp1, temp3, 6);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 7);
+  check_memcpy64 (temp1, temp3, 7);
+
+  reset_data64  (temp1, temp3, N);
+}
+
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c
new file mode 100644
index 00000000000..21e882424ec
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c
@@ -0,0 +1,28 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int64_t *a, int64_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp64q (n);
+      int64x2_t va = vldrdq_gather_offset_z_s64 (a, vcreateq_u64 (0, 8), p);
+      vstrdq_scatter_offset_p_s64 (c, vcreateq_u64 (0, 8), va, p);
+      c+=2;
+      a+=2;
+      n-=2;
+    }
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.64} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c
new file mode 100644
index 00000000000..8ea181c82d4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c
@@ -0,0 +1,69 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_mve_hw } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp8q (n);
+      int8x16_t va = vldrbq_z_s8 (a, p);
+      int8x16_t vb = vldrbq_z_s8 (b, p);
+      int8x16_t vc = vaddq_x_s8 (va, vb, p);
+      vstrbq_p_s8 (c, vc, p);
+      c+=16;
+      a+=16;
+      b+=16;
+      n-=16;
+    }
+}
+
+int main ()
+{
+  int i;
+  int8_t temp1[N];
+  int8_t temp2[N];
+  int8_t temp3[N];
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus8 (temp1, temp2, temp3, 0);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus8 (temp1, temp2, temp3, 1);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 15);
+  check_plus8 (temp1, temp2, temp3, 15);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus8 (temp1, temp2, temp3, 16);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus8 (temp1, temp2, temp3, 17);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 32);
+  check_plus8 (temp1, temp2, temp3, 32);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 33);
+  check_plus8 (temp1, temp2, temp3, 33);
+
+  reset_data8 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.8} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
new file mode 100644
index 00000000000..f7c3e04f883
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
@@ -0,0 +1,391 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O3 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <limits.h>
+#include <arm_mve.h>
+
+/* Terminating on a non-zero number of elements.  */
+void test0 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    while (n > 1)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Terminating on n >= 0.  */
+void test1 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    while (n >= 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Similar, terminating on a non-zero number of elements, but in a for loop
+   format.  */
+int32_t a[] = {0, 1, 2, 3, 4, 5, 6, 7};
+void test2 (int32_t *b, int num_elems)
+{
+    for (int i = num_elems; i >= 2; i-= 4)
+    {
+       mve_pred16_t p = vctp32q (i);
+       int32x4_t va = vldrwq_z_s32 (&(a[i]), p);
+       vstrwq_p_s32 (b + i, va, p);
+    }
+}
+
+/* Iteration counter counting up to num_iter, with a non-zero starting num.  */
+void test3 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 1; i < num_iter; i++)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Iteration counter counting up to num_iter, with a larger increment  */
+void test4 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 0; i < num_iter; i+=2)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Using an unpredicated store instruction within the loop.  */
+void test5 (uint8_t *a, uint8_t *b, uint8_t *c,  uint8_t *d, int n)
+{
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_u8 (va, vb);
+       uint8x16_t vd = vaddq_x_u8 (va, vb, p);
+       vstrbq_u8 (d, vd);
+       n -= 16;
+    }
+}
+
+/* Using an unpredicated store outside the loop.  */
+void test6 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       vx = vaddq_u8 (vx, vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    vstrbq_u8 (c, vx);
+}
+
+/* Using a VPR that gets modified within the loop.  */
+void test9 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      p++;
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using a VPR that gets re-generated within the loop.  */
+void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  mve_pred16_t p = vctp32q (n);
+  while (n > 0)
+    {
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      p = vctp32q (n);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using vctp32q_m instead of vctp32q.  */
+void test11 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p0)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q_m (n, p0);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an unpredicated op with a scalar output, where the result is valid
+   outside the bb.  This is invalid, because one of the inputs to the
+   unpredicated op is also unpredicated.  */
+uint8_t test12 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_u8 (b);
+       uint8x16_t vc = vaddq_u8 (va, vb);
+       sum += vaddvq_u8 (vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Using an unpredicated vcmp to generate a new predicate value in the
+   loop and then using that VPR to predicate a store insn.  */
+void test13 (int32_t *a, int32_t *b, int32x4_t vc, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_s32 (a);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_s32 (va, vb);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vc);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an across-vector unpredicated instruction. "vb" is the risk.  */
+uint16_t test14 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      vb = vaddq_u16 (va, vb);
+      res = vaddvq_u16 (vb);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector unpredicated instruction. "vc" is the risk. */
+uint16_t test15 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_u16 (va, vb);
+      res = vaddvaq_u16 (res, vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+uint16_t test16 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res =0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res = vaddvaq_u16 (res, vb);
+      res = vaddvaq_p_u16 (res, va, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+int test17 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vmaxvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+
+
+int test18 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vminvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test19 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vminavq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test20 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vminvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+uint8x16_t test21 (uint8_t *a, uint32_t *b, int n, uint8x16_t res)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vshlcq_u8 (va, b, 1);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int8x16_t test22 (int8_t *a, int32_t *b, int n, int8x16_t res)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vshlcq_s8 (va, b, 1);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+/* Using an unsigned number of elements to count down from, with a >0*/
+void test23 (int32_t *a, int32_t *b, int32_t *c, unsigned int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+/* Using an unsigned number of elements to count up to, with a <n*/
+void test24 (uint8_t *a, uint8_t *b, uint8_t *c, unsigned int n)
+{
+    for (int i = 0; i < n; i+=16)
+    {
+        mve_pred16_t p = vctp8q (n-i);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+
+/* Using an unsigned number of elements to count up to, with a <=n*/
+void test25 (uint8_t *a, uint8_t *b, uint8_t *c, unsigned int n)
+{
+    for (int i = 1; i <= n; i+=16)
+    {
+        mve_pred16_t p = vctp8q (n-i+1);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+/* { dg-final { scan-assembler-not "\tdlstp" } } */
+/* { dg-final { scan-assembler-not "\tletp" } } */
\ No newline at end of file

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
@ 2023-11-06 11:20 Stamatis Markianos-Wright
  0 siblings, 0 replies; 14+ messages in thread
From: Stamatis Markianos-Wright @ 2023-11-06 11:20 UTC (permalink / raw)
  To: gcc-patches
  Cc: Kyrylo Tkachov, Richard Earnshaw, richard.sandiford, ramana.gcc

[-- Attachment #1: Type: text/plain, Size: 358 bytes --]

Patch has already been mostly approved at:

https://gcc.gnu.org/pipermail/gcc-patches/2023-September/630326.html
https://gcc.gnu.org/pipermail/gcc-patches/2023-October/634139.html
... But this latest revision includes:
* Changes based on the review from Richard Sandiford (thank you!)
* A rebase to latest trunk

So I'm hoping this is good to go in as-is :)

[-- Attachment #2: 2.patch --]
[-- Type: text/x-patch, Size: 107540 bytes --]

commit 1eb7b8c84d8687e493b65406ca7d4fe4cdb08e59
Author: Stam Markianos-Wright <stam.markianos-wright@arm.com>
Date:   Tue Oct 18 17:42:56 2022 +0100

    arm: Add support for MVE Tail-Predicated Low Overhead Loops
    
    This is the 2/2 patch that contains the functional changes needed
    for MVE Tail Predicated Low Overhead Loops.  See my previous email
    for a general introduction of MVE LOLs.
    
    This support is added through the already existing loop-doloop
    mechanisms that are used for non-MVE dls/le looping.
    
    Mid-end changes are:
    
    1) Relax the loop-doloop mechanism in the mid-end to allow for
       decrement numbers other that -1 and for `count` to be an
       rtx containing a simple REG (which in this case will contain
       the number of elements to be processed), rather
       than an expression for calculating the number of iterations.
    2) Added a new df utility function: `df_bb_regno_only_def_find` that
       will return the DEF of a REG if it is DEF-ed only once within the
       basic block.
    
    And many things in the backend to implement the above optimisation:
    
    3)  Implement the `arm_predict_doloop_p` target hook to instruct the
        mid-end about Low Overhead Loops (MVE or not), as well as
        `arm_loop_unroll_adjust` which will prevent unrolling of any loops
        that are valid for becoming MVE Tail_Predicated Low Overhead Loops
        (unrolling can transform a loop in ways that invalidate the dlstp/
        letp tranformation logic and the benefit of the dlstp/letp loop
        would be considerably higher than that of unrolling)
    4)  Appropriate changes to the define_expand of doloop_end, new
        patterns for dlstp and letp, new iterators,  unspecs, etc.
    5) `arm_mve_loop_valid_for_dlstp` and a number of checking functions:
       * `arm_mve_dlstp_check_dec_counter`
       * `arm_mve_dlstp_check_inc_counter`
       * `arm_mve_check_reg_origin_is_num_elems`
       * `arm_mve_check_df_chain_back_for_implic_predic`
       * `arm_mve_check_df_chain_fwd_for_implic_predic_impact`
       This all, in smoe way or another, are running checks on the loop
       structure in order to determine if the loop is valid for dlstp/letp
       transformation.
    6) `arm_attempt_dlstp_transform`: (called from the define_expand of
        doloop_end) this function re-checks for the loop's suitability for
        dlstp/letp transformation and then implements it, if possible.
    7) Various utility functions:
       *`arm_mve_get_vctp_lanes` to map
       from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg`
       to check an insn to see if it requires the VPR or not.
       * `arm_mve_get_loop_vctp`
       * `arm_mve_get_vctp_lanes`
       * `arm_emit_mve_unpredicated_insn_to_seq`
       * `arm_get_required_vpr_reg`
       * `arm_get_required_vpr_reg_param`
       * `arm_get_required_vpr_reg_ret_val`
       * `arm_mve_is_across_vector_insn`
       * `arm_is_mve_load_store_insn`
       * `arm_mve_vec_insn_is_predicated_with_this_predicate`
       * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`
    
    No regressions on arm-none-eabi with various targets and on
    aarch64-none-elf. Thoughts on getting this into trunk?
    
    Thank you,
    Stam Markianos-Wright
    
    gcc/ChangeLog:
    
            * config/arm/arm-protos.h (arm_target_insn_ok_for_lob): Rename to...
            (arm_target_bb_ok_for_lob): ...this
            (arm_attempt_dlstp_transform): New.
            * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.
            (TARGET_PREDICT_DOLOOP_P): New.
            (arm_block_set_vect):
            (arm_target_insn_ok_for_lob): Rename from arm_target_insn_ok_for_lob.
            (arm_target_bb_ok_for_lob): New.
            (arm_mve_get_vctp_lanes): New.
            (arm_get_required_vpr_reg): New.
            (arm_get_required_vpr_reg_param): New.
            (arm_get_required_vpr_reg_ret_val): New.
            (arm_mve_get_loop_vctp): New.
            (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.
            (arm_mve_vec_insn_is_predicated_with_this_predicate): New.
            (arm_mve_check_df_chain_back_for_implic_predic): New.
            (arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.
            (arm_mve_check_reg_origin_is_num_elems): New.
            (arm_mve_dlstp_check_inc_counter): New.
            (arm_mve_dlstp_check_dec_counter): New.
            (arm_mve_loop_valid_for_dlstp): New.
            (arm_mve_is_across_vector_insn): New.
            (arm_is_mve_load_store_insn): New.
            (arm_predict_doloop_p): New.
            (arm_loop_unroll_adjust): New.
            (arm_emit_mve_unpredicated_insn_to_seq): New.
            (arm_attempt_dlstp_transform): New.
            * config/arm/iterators.md (DLSTP): New.
            (mode1): Add DLSTP mappings.
            * config/arm/mve.md (*predicated_doloop_end_internal): New.
            (dlstp<mode1>_insn): New.
            * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
            * config/arm/unspecs.md: New unspecs.
            * df-core.cc (df_bb_regno_only_def_find): New.
            * df.h (df_bb_regno_only_def_find): New.
            * loop-doloop.cc (doloop_condition_get): Relax conditions.
            (doloop_optimize): Add support for elementwise LoLs.
    
    gcc/testsuite/ChangeLog:
    
            * gcc.target/arm/lob.h: Update framework.
            * gcc.target/arm/lob1.c: Likewise.
            * gcc.target/arm/lob6.c: Likewise.
            * gcc.target/arm/mve/dlstp-compile-asm.c: New test.
            * gcc.target/arm/mve/dlstp-int16x8.c: New test.
            * gcc.target/arm/mve/dlstp-int32x4.c: New test.
            * gcc.target/arm/mve/dlstp-int64x2.c: New test.
            * gcc.target/arm/mve/dlstp-int8x16.c: New test.
            * gcc.target/arm/mve/dlstp-invalid-asm.c: New test.

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 2f5ca79ed8d..4f164c54740 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -65,8 +65,8 @@ extern void arm_emit_speculation_barrier_function (void);
 extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *);
 extern bool arm_q_bit_access (void);
 extern bool arm_ge_bits_access (void);
-extern bool arm_target_insn_ok_for_lob (rtx);
-
+extern bool arm_target_bb_ok_for_lob (basic_block);
+extern rtx arm_attempt_dlstp_transform (rtx);
 #ifdef RTX_CODE
 enum reg_class
 arm_mode_base_reg_class (machine_mode);
diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index 620ef7bfb2f..def2f0d6a58 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -659,6 +659,12 @@ static const struct attribute_spec arm_attribute_table[] =
 #undef TARGET_HAVE_CONDITIONAL_EXECUTION
 #define TARGET_HAVE_CONDITIONAL_EXECUTION arm_have_conditional_execution
 
+#undef TARGET_LOOP_UNROLL_ADJUST
+#define TARGET_LOOP_UNROLL_ADJUST arm_loop_unroll_adjust
+
+#undef TARGET_PREDICT_DOLOOP_P
+#define TARGET_PREDICT_DOLOOP_P arm_predict_doloop_p
+
 #undef TARGET_LEGITIMATE_CONSTANT_P
 #define TARGET_LEGITIMATE_CONSTANT_P arm_legitimate_constant_p
 
@@ -34472,19 +34478,1103 @@ arm_invalid_within_doloop (const rtx_insn *insn)
 }
 
 bool
-arm_target_insn_ok_for_lob (rtx insn)
+arm_target_bb_ok_for_lob (basic_block bb)
 {
-  basic_block bb = BLOCK_FOR_INSN (insn);
   /* Make sure the basic block of the target insn is a simple latch
      having as single predecessor and successor the body of the loop
      itself.  Only simple loops with a single basic block as body are
      supported for 'low over head loop' making sure that LE target is
      above LE itself in the generated code.  */
-
   return single_succ_p (bb)
-    && single_pred_p (bb)
-    && single_succ_edge (bb)->dest == single_pred_edge (bb)->src
-    && contains_no_active_insn_p (bb);
+	 && single_pred_p (bb)
+	 && single_succ_edge (bb)->dest == single_pred_edge (bb)->src;
+}
+
+/* Utility fuction: Given a VCTP or a VCTP_M insn, return the number of MVE
+   lanes based on the machine mode being used.  */
+
+static int
+arm_mve_get_vctp_lanes (rtx x)
+{
+  if (GET_CODE (x) == SET && GET_CODE (XEXP (x, 1)) == UNSPEC
+      && (XINT (XEXP (x, 1), 1) == VCTP || XINT (XEXP (x, 1), 1) == VCTP_M))
+    {
+      machine_mode mode = GET_MODE (XEXP (x, 1));
+      return (VECTOR_MODE_P (mode) && VALID_MVE_PRED_MODE (mode))
+	     ? GET_MODE_NUNITS (mode) : 0;
+    }
+  return 0;
+}
+
+/* Check if INSN requires the use of the VPR reg, if it does, return the
+   sub-rtx of the VPR reg.  The TYPE argument controls whether
+   this function should:
+   * For TYPE == 0, check all operands, including the OUT operands,
+     and return the first occurrence of the VPR reg.
+   * For TYPE == 1, only check the input operands.
+   * For TYPE == 2, only check the output operands.
+   (INOUT operands are considered both as input and output operands)
+*/
+static rtx
+arm_get_required_vpr_reg (rtx_insn *insn, unsigned int type = 0)
+{
+  gcc_assert (type < 3);
+  if (!NONJUMP_INSN_P (insn))
+    return NULL_RTX;
+
+  bool requires_vpr;
+  extract_constrain_insn (insn);
+  int n_operands = recog_data.n_operands;
+  if (recog_data.n_alternatives == 0)
+    return NULL_RTX;
+
+  /* Fill in recog_op_alt with information about the constraints of
+     this insn.  */
+  preprocess_constraints (insn);
+
+  for (int op = 0; op < n_operands; op++)
+    {
+      requires_vpr = true;
+      if (type == 1 && recog_data.operand_type[op] == OP_OUT)
+	continue;
+      else if (type == 2 && recog_data.operand_type[op] == OP_IN)
+	continue;
+
+      /* Iterate through alternatives of operand "op" in recog_op_alt and
+	 identify if the operand is required to be the VPR.  */
+      for (int alt = 0; alt < recog_data.n_alternatives; alt++)
+	{
+	  const operand_alternative *op_alt
+	      = &recog_op_alt[alt * n_operands];
+	  /* Fetch the reg_class for each entry and check it against the
+	     VPR_REG reg_class.  */
+	  if (alternative_class (op_alt, op) != VPR_REG)
+	    requires_vpr = false;
+	}
+      /* If all alternatives of the insn require the VPR reg for this operand,
+	 it means that either this is VPR-generating instruction, like a vctp,
+	 vcmp, etc., or it is a VPT-predicated insruction.  Return the subrtx
+	 of the VPR reg operand.  */
+      if (requires_vpr)
+	return recog_data.operand[op];
+    }
+  return NULL_RTX;
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with TYPE == 1, so return
+   something only if the VPR reg is an input operand to the insn.  */
+
+static rtx
+arm_get_required_vpr_reg_param (rtx_insn *insn)
+{
+  return arm_get_required_vpr_reg (insn, 1);
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with TYPE == 2, so return
+   something only if the VPR reg is the return value, an output of, or is
+   clobbered by the insn.  */
+
+static rtx
+arm_get_required_vpr_reg_ret_val (rtx_insn *insn)
+{
+  return arm_get_required_vpr_reg (insn, 2);
+}
+
+/* Scan the basic block of a loop body for a vctp instruction.  If there is
+   at least vctp instruction, return the first rtx_insn *.  */
+
+static rtx_insn *
+arm_mve_get_loop_vctp (basic_block bb)
+{
+  rtx_insn *insn = BB_HEAD (bb);
+
+  /* Now scan through all the instruction patterns and pick out the VCTP
+     instruction.  We require arm_get_required_vpr_reg_param to be false
+     to make sure we pick up a VCTP, rather than a VCTP_M.  */
+  FOR_BB_INSNS (bb, insn)
+    if (NONDEBUG_INSN_P (insn))
+      if (arm_get_required_vpr_reg_ret_val (insn)
+	  && (arm_mve_get_vctp_lanes (PATTERN (insn)) != 0)
+	  && !arm_get_required_vpr_reg_param (insn))
+	return insn;
+  return NULL;
+}
+
+/* Return true if INSN is a MVE instruction that is VPT-predicable, but in
+   its unpredicated form, or if it is predicated, but on a predicate other
+   than VPR_REG.  */
+
+static bool
+arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate (rtx_insn *insn,
+							  rtx vpr_reg)
+{
+  rtx insn_vpr_reg_operand;
+  if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+      || (MVE_VPT_PREDICATED_INSN_P (insn)
+	  && (insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn))
+	  && !rtx_equal_p (vpr_reg, insn_vpr_reg_operand)))
+    return true;
+  else
+    return false;
+}
+
+/* Return true if INSN is a MVE instruction that is VPT-predicable and is
+   predicated on VPR_REG.  */
+
+static bool
+arm_mve_vec_insn_is_predicated_with_this_predicate (rtx_insn *insn,
+						    rtx vpr_reg)
+{
+  rtx insn_vpr_reg_operand;
+  if (MVE_VPT_PREDICATED_INSN_P (insn)
+      && (insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn))
+      && rtx_equal_p (vpr_reg, insn_vpr_reg_operand))
+    return true;
+  else
+    return false;
+}
+
+/* Utility function to identify if INSN is an MVE instruction that performs
+   some across-vector operation (and as a result does not align with normal
+   lane predication rules).  All such instructions give one only scalar
+   output, except for vshlcq which gives a PARALLEL of a vector and a scalar
+   (one vector result and one carry output).  */
+
+static bool
+arm_is_mve_across_vector_insn (rtx_insn* insn)
+{
+  df_ref insn_defs = NULL;
+  if (!MVE_VPT_PREDICABLE_INSN_P (insn))
+    return false;
+
+  bool is_across_vector = false;
+  FOR_EACH_INSN_DEF (insn_defs, insn)
+    if (!VALID_MVE_MODE (GET_MODE (DF_REF_REG (insn_defs)))
+	&& !arm_get_required_vpr_reg_ret_val (insn))
+      is_across_vector = true;
+
+  return is_across_vector;
+}
+
+/* Utility function to identify if INSN is an MVE load or store instruction.
+   * For TYPE == 0, check all operands.  If the function returns true,
+     INSN is a load or a store insn.
+   * For TYPE == 1, only check the input operands.  If the function returns
+     true, INSN is a load insn.
+   * For TYPE == 2, only check the output operands.  If the function returns
+     true, INSN is a store insn.  */
+
+static bool
+arm_is_mve_load_store_insn (rtx_insn* insn, int type = 0)
+{
+  int n_operands = recog_data.n_operands;
+  extract_insn (insn);
+
+  for (int op = 0; op < n_operands; op++)
+    {
+      if (type == 1 && recog_data.operand_type[op] == OP_OUT)
+	continue;
+      else if (type == 2 && recog_data.operand_type[op] == OP_IN)
+	continue;
+      if (mve_memory_operand (recog_data.operand[op],
+			      GET_MODE (recog_data.operand[op])))
+      return true;
+    }
+  return false;
+}
+
+/* When transforming an MVE intrinsic loop into an MVE Tail Predicated Low
+   Overhead Loop, there are a number of instructions that, if in their
+   unpredicated form, act across vector lanes, but are still safe to include
+   within the loop, despite the implicit predication added to the vector lanes.
+   This list has been compiled by carefully analyzing the instruction
+   pseudocode in the Arm-ARM.
+   All other across-vector instructions aren't allowed, because the addition
+   of implicit predication could influnce the result of the operation.
+   Any new across-vector instructions to the MVE ISA will have to assessed for
+   inclusion to this list.  */
+
+static bool
+arm_mve_is_allowed_unpredic_across_vector_insn (rtx_insn* insn)
+{
+  gcc_assert (MVE_VPT_UNPREDICATED_INSN_P (insn)
+	      && arm_is_mve_across_vector_insn (insn));
+  rtx insn_pattern = PATTERN (insn);
+  if (GET_CODE (insn_pattern) == SET
+      && GET_CODE (XEXP (insn_pattern, 1)) == UNSPEC
+      && (XINT (XEXP (insn_pattern, 1), 1) == VADDVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VABAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VABAVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMAXVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMAXAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVXQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAXQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHAXQ_S))
+    return true;
+  return false;
+}
+
+/* Recursively scan through the DF chain backwards within the basic block and
+   determine if any of the USEs of the original insn (or the USEs of the insns
+   where thy were DEF-ed, etc., recursively) were affected by implicit VPT
+   predication of an MVE_VPT_UNPREDICATED_INSN_P in a dlstp/letp loop.
+   This function returns true if the insn is affected implicit predication
+   and false otherwise.
+   Having such implicit predication on an unpredicated insn wouldn't in itself
+   block tail predication, because the output of that insn might then be used
+   in a correctly predicated store insn, where the disabled lanes will be
+   ignored.  To verify this we later call:
+   `arm_mve_check_df_chain_fwd_for_implic_predic_impact`, which will check the
+   DF chains forward to see if any implicitly-predicated operand gets used in
+   an improper way.  */
+
+static bool
+arm_mve_check_df_chain_back_for_implic_predic
+  (hash_map <rtx_insn *, bool> *safe_insn_map, rtx_insn *insn_in,
+   rtx vctp_vpr_generated)
+{
+
+  auto_vec<rtx_insn *> worklist;
+  worklist.safe_push (insn_in);
+
+  bool *temp = NULL;
+
+  while (worklist.length () > 0)
+    {
+      rtx_insn *insn = worklist.pop ();
+
+      if ((temp = safe_insn_map->get (insn)))
+	return *temp;
+
+      basic_block body = BLOCK_FOR_INSN (insn);
+
+      /* The circumstances under which an instruction is affected by "implicit
+	 predication" are as follows:
+	  * It is an UNPREDICATED_INSN_P:
+	    * That loads/stores from/to memory.
+	    * Where any one of its operands is an MVE vector from outside the
+	      loop body bb.
+	 Or:
+	  * Any of it's operands, recursively backwards, are affected.  */
+      if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+	  && (arm_is_mve_load_store_insn (insn)
+	      || (arm_is_mve_across_vector_insn (insn)
+		  && !arm_mve_is_allowed_unpredic_across_vector_insn (insn))))
+	{
+	  safe_insn_map->put (insn, true);
+	  return true;
+	}
+
+      df_ref insn_uses = NULL;
+      FOR_EACH_INSN_USE (insn_uses, insn)
+      {
+	/* If the operand is in the input reg set to the the basic block,
+	   (i.e. it has come from outside the loop!), consider it unsafe if:
+	     * It's being used in an unpredicated insn.
+	     * It is a predicable MVE vector.  */
+	if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+	    && VALID_MVE_MODE (GET_MODE (DF_REF_REG (insn_uses)))
+	    && REGNO_REG_SET_P (DF_LR_IN (body), DF_REF_REGNO (insn_uses)))
+	  {
+	    safe_insn_map->put (insn, true);
+	    return true;
+	  }
+
+	/* Scan backwards from the current INSN through the instruction chain
+	   until the start of the basic block.  */
+	for (rtx_insn *prev_insn = PREV_INSN (insn);
+	     prev_insn && prev_insn != PREV_INSN (BB_HEAD (body));
+	     prev_insn = PREV_INSN (prev_insn))
+	  {
+	    /* If a previous insn defines a register that INSN uses, then
+	       recurse in order to check that insn's USEs. If any of these
+	       insns return true as MVE_VPT_UNPREDICATED_INSN_Ps, then the
+	       whole chain is affected by the change in behaviour from being
+	       placed in dlstp/letp loop.  */
+	    df_ref prev_insn_defs = NULL;
+	    FOR_EACH_INSN_DEF (prev_insn_defs, prev_insn)
+	    {
+	      if (DF_REF_REGNO (insn_uses) == DF_REF_REGNO (prev_insn_defs)
+		  && !arm_mve_vec_insn_is_predicated_with_this_predicate
+		       (insn, vctp_vpr_generated))
+		worklist.safe_push (prev_insn);
+	    }
+	  }
+      }
+    }
+  safe_insn_map->put (insn, false);
+  return false;
+}
+
+/* If we have identified that the current DEF will be modified
+   by such implicit predication, scan through all the
+   insns that USE it and bail out if any one is outside the
+   current basic block (i.e. the reg is live after the loop)
+   or if any are store insns that are unpredicated or using a
+   predicate other than the loop VPR.
+   This function returns true if the insn is not suitable for
+   implicit predication and false otherwise.*/
+
+static bool
+arm_mve_check_df_chain_fwd_for_implic_predic_impact (rtx_insn *insn,
+						     rtx vctp_vpr_generated)
+{
+
+  /* If this insn is indeed an unpredicated store to memory, bail out.  */
+  if (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
+	(insn, vctp_vpr_generated)
+      && (arm_is_mve_load_store_insn (insn, 2)
+	  || arm_is_mve_across_vector_insn (insn)))
+    return true;
+
+  /* Next, scan forward to the various USEs of the DEFs in this insn.  */
+  df_ref insn_def = NULL;
+  FOR_EACH_INSN_DEF (insn_def, insn)
+    {
+      for (df_ref use = DF_REG_USE_CHAIN (DF_REF_REGNO (insn_def)); use;
+	   use = DF_REF_NEXT_REG (use))
+	{
+	  rtx_insn *next_use_insn = DF_REF_INSN (use);
+	  if (next_use_insn != insn
+	      && NONDEBUG_INSN_P (next_use_insn))
+	    {
+	      /* If the USE is outside the loop body bb, or it is inside, but
+		 is an differently-predicated store to memory or it is any
+		 across-vector instruction.  */
+	      if (BLOCK_FOR_INSN (insn) != BLOCK_FOR_INSN (next_use_insn)
+		  || (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
+		       (next_use_insn, vctp_vpr_generated)
+		     && (arm_is_mve_load_store_insn (next_use_insn, 2)
+			 || arm_is_mve_across_vector_insn (next_use_insn))))
+		return true;
+	    }
+	}
+    }
+  return false;
+}
+
+/* Helper function to `arm_mve_dlstp_check_inc_counter` and to
+   `arm_mve_dlstp_check_dec_counter`.  In the situations where the loop counter
+   is incrementing by 1 or decrementing by 1 in each iteration, ensure that the
+   target value or the initialisation value, respectively, was a calculation
+   of the number of iterations of the loop, which is expected to be an ASHIFTRT
+   by VCTP_STEP.  */
+
+static bool
+arm_mve_check_reg_origin_is_num_elems (basic_block body, rtx reg, rtx vctp_step)
+{
+  /* Ok, we now know the loop starts from zero and increments by one.
+     Now just show that the max value of the counter came from an
+     appropriate ASHIFRT expr of the correct amount.  */
+  basic_block pre_loop_bb = body->prev_bb;
+  while (pre_loop_bb && BB_END (pre_loop_bb)
+	 && !df_bb_regno_only_def_find (pre_loop_bb, REGNO (reg)))
+    pre_loop_bb = pre_loop_bb->prev_bb;
+
+  df_ref counter_max_last_def = df_bb_regno_only_def_find (pre_loop_bb, REGNO (reg));
+  rtx counter_max_last_set;
+  if (counter_max_last_def)
+    counter_max_last_set = PATTERN (DF_REF_INSN (counter_max_last_def));
+  else
+    return false;
+
+  /* If we encounter a simple SET from a REG, follow it through.  */
+  if (GET_CODE (counter_max_last_set) == SET
+      && REG_P (XEXP (counter_max_last_set, 1)))
+    return arm_mve_check_reg_origin_is_num_elems
+	     (pre_loop_bb->next_bb, XEXP (counter_max_last_set, 1), vctp_step);
+
+  /* If we encounter a SET from an IF_THEN_ELSE where one of the operands is a
+     constant and the other is a REG, follow through to that REG.  */
+  if (GET_CODE (counter_max_last_set) == SET
+      && GET_CODE (XEXP (counter_max_last_set, 1)) == IF_THEN_ELSE
+      && REG_P (XEXP (XEXP (counter_max_last_set, 1), 1))
+      && CONST_INT_P (XEXP (XEXP (counter_max_last_set, 1), 2)))
+    return arm_mve_check_reg_origin_is_num_elems
+	     (pre_loop_bb->next_bb, XEXP (XEXP (counter_max_last_set, 1), 1), vctp_step);
+
+  if (GET_CODE (XEXP (counter_max_last_set, 1)) == ASHIFTRT
+      && CONST_INT_P (XEXP (XEXP (counter_max_last_set, 1), 1))
+      && ((1 << INTVAL (XEXP (XEXP (counter_max_last_set, 1), 1)))
+	   == abs (INTVAL (vctp_step))))
+    return true;
+
+  return false;
+}
+
+/* If we have identified the loop to have an incrementing counter, we need to
+   make sure that it increments by 1 and that the loop is structured correctly:
+    * The counter starts from 0
+    * The counter terminates at (num_of_elem + num_of_lanes - 1) / num_of_lanes
+    * The vctp insn uses a reg that decrements appropriately in each iteration.
+*/
+
+static rtx_insn*
+arm_mve_dlstp_check_inc_counter (basic_block body, rtx_insn* vctp_insn,
+				 rtx condconst, rtx condcount)
+{
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+  /* The loop latch has to be empty.  When compiling all the known MVE LoLs in
+     user applications, none of those with incrementing counters had any real
+     insns in the loop latch.  As such, this function has only been tested with
+     an empty latch and may misbehave or ICE if we somehow get here with an
+     increment in the latch, so, for correctness, error out early.  */
+  if (!empty_block_p (body->loop_father->latch))
+    return NULL;
+
+  class rtx_iv vctp_reg_iv;
+  /* For loops of type B) the loop counter is independent of the decrement
+     of the reg used in the vctp_insn. So run iv analysis on that reg.  This
+     has to succeed for such loops to be supported.  */
+  if (!iv_analyze (vctp_insn, as_a<scalar_int_mode> (GET_MODE (vctp_reg)),
+      vctp_reg, &vctp_reg_iv))
+    return NULL;
+
+  /* Extract the decrementnum of the vctp reg from the iv.  */
+  int decrementnum = abs (INTVAL (vctp_reg_iv.step));
+
+  /* Find where both of those are modified in the loop body bb.  */
+  df_ref condcount_reg_set_df = df_bb_regno_only_def_find (body, REGNO (condcount));
+  df_ref vctp_reg_set_df = df_bb_regno_only_def_find (body, REGNO (vctp_reg));
+  if (!condcount_reg_set_df || !vctp_reg_set_df)
+    return NULL;
+  rtx condcount_reg_set = PATTERN (DF_REF_INSN (condcount_reg_set_df));
+  rtx_insn* vctp_reg_set = DF_REF_INSN (vctp_reg_set_df);
+  /* Ensure the modification of the vctp reg from df is consistent with
+     the iv and the number of lanes on the vctp insn.  */
+  if (!(GET_CODE (XEXP (PATTERN (vctp_reg_set), 1)) == PLUS
+      && REGNO (XEXP (PATTERN (vctp_reg_set), 0))
+	  == REGNO (XEXP (XEXP (PATTERN (vctp_reg_set), 1), 0))))
+    return NULL;
+  if (decrementnum != abs (INTVAL (XEXP (XEXP (PATTERN (vctp_reg_set), 1), 1)))
+      || decrementnum != arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+    return NULL;
+
+  if (REG_P (condcount) && REG_P (condconst))
+    {
+      /* First we need to prove that the loop is going 0..condconst with an
+	 inc of 1 in each iteration.  */
+      if (GET_CODE (XEXP (condcount_reg_set, 1)) == PLUS
+	  && CONST_INT_P (XEXP (XEXP (condcount_reg_set, 1), 1))
+	  && INTVAL (XEXP (XEXP (condcount_reg_set, 1), 1)) == 1)
+	{
+	    rtx counter_reg = XEXP (condcount_reg_set, 0);
+	    /* Check that the counter did indeed start from zero.  */
+	    df_ref this_set = DF_REG_DEF_CHAIN (REGNO (counter_reg));
+	    if (!this_set)
+	      return NULL;
+	    df_ref last_set = DF_REF_NEXT_REG (this_set);
+	    if (!last_set)
+	      return NULL;
+	    rtx_insn* last_set_insn = DF_REF_INSN (last_set);
+	    if (!single_set (last_set_insn))
+	      return NULL;
+	    rtx counter_orig_set;
+	    counter_orig_set = XEXP (PATTERN (last_set_insn), 1);
+	    if (!CONST_INT_P (counter_orig_set)
+		|| (INTVAL (counter_orig_set) != 0))
+	      return NULL;
+	    /* And finally check that the target value of the counter,
+	       condconst, is of the correct shape.  */
+	    if (!arm_mve_check_reg_origin_is_num_elems (body, condconst,
+							vctp_reg_iv.step))
+	      return NULL;
+	}
+      else
+	return NULL;
+    }
+  else
+    return NULL;
+
+  /* Everything looks valid.  */
+  return vctp_insn;
+}
+
+/* Helper function to `arm_mve_loop_valid_for_dlstp`.  In the case of a
+   counter that is decrementing, ensure that it is decrementing by the
+   right amount in each iteration and that the target condition is what
+   we expect.  */
+
+static rtx_insn*
+arm_mve_dlstp_check_dec_counter (basic_block body, rtx_insn* vctp_insn,
+				 rtx condconst, rtx condcount)
+{
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+  class rtx_iv vctp_reg_iv;
+  int decrementnum;
+  /* For decrementing loops of type A), the counter is usually present in the
+     loop latch.  Here we simply need to verify that this counter is the same
+     reg that is also used in the vctp_insn and that it is not otherwise
+     modified.  */
+  rtx_insn *dec_insn = BB_END (body->loop_father->latch);
+  /* If not in the loop latch, try to find the decrement in the loop body.  */
+  if (!NONDEBUG_INSN_P (dec_insn))
+  {
+    df_ref temp = df_bb_regno_only_def_find (body, REGNO (condcount));
+    /* If we haven't been able to find the decrement, bail out.  */
+    if (!temp)
+      return NULL;
+    dec_insn = DF_REF_INSN (temp);
+  }
+
+  /* Next, ensure that it is a PLUS of the form:
+     (set (reg a) (plus (reg a) (const_int)))
+     where (reg a) is the same as condcount.  */
+  if (GET_CODE (XEXP (PATTERN (dec_insn), 1)) == PLUS
+      && REGNO (XEXP (PATTERN (dec_insn), 0))
+	  == REGNO (XEXP (XEXP (PATTERN (dec_insn), 1), 0))
+      && REGNO (XEXP (PATTERN (dec_insn), 0)) == REGNO (condcount))
+    decrementnum = abs (INTVAL (XEXP (XEXP (PATTERN (dec_insn), 1), 1)));
+  else
+    return NULL;
+
+  /* Ok, so we now know the loop decrement.  If it is a 1, then we need to
+     look at the loop vctp_reg and verify that it also decrements correctly.
+     Then, we need to establish that the starting value of the loop decrement
+     originates from the starting value of the vctp decrement.  */
+  if (decrementnum == 1)
+    {
+      class rtx_iv vctp_reg_iv;
+      /* The loop counter is found to be independent of the decrement
+	 of the reg used in the vctp_insn, again.  Ensure that IV analysis
+	 succeeds and check the step.  */
+      if (!iv_analyze (vctp_insn, as_a<scalar_int_mode> (GET_MODE (vctp_reg)),
+		       vctp_reg, &vctp_reg_iv))
+	return NULL;
+      /* Ensure it matches the number of lanes of the vctp instruction.  */
+      if (abs (INTVAL (vctp_reg_iv.step))
+	  != arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+	return NULL;
+      if (!arm_mve_check_reg_origin_is_num_elems (body, condcount, vctp_reg_iv.step))
+	return NULL;
+    }
+  /* If the decrements are the same, then the situation is simple: either they
+     are also the same reg, which is safe, or they are different registers, in
+     which case makse sure that there is a only simple SET from one to the
+     other inside the loop.*/
+  else if (decrementnum == arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+    {
+      if (REGNO (condcount) != REGNO (vctp_reg))
+	{
+	  /* It wasn't the same reg, but it could be behild a
+	     (set (vctp_reg) (condcount)), so instead find where
+	     the VCTP insn is DEF'd inside the loop.  */
+	  rtx vctp_reg_set =
+		PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
+					(body, REGNO (vctp_reg))));
+	  /* This must just be a simple SET from the condcount.  */
+	  if (GET_CODE (vctp_reg_set) != SET || !REG_P (XEXP (vctp_reg_set, 1))
+	      || REGNO (XEXP (vctp_reg_set, 1)) != REGNO (condcount))
+	    return NULL;
+	}
+    }
+  else
+    return NULL;
+
+  /* We now only need to find out that the loop terminates with a LE
+     zero condition.  If condconst is a const_int, then this is easy.
+     If its a REG, look at the last condition+jump in a bb before
+     the loop, because that usually will have a branch jumping over
+     the loop body.  */
+  if (CONST_INT_P (condconst)
+      && !(INTVAL (condconst) == 0 && JUMP_P (BB_END (body))
+	   && GET_CODE (XEXP (PATTERN (BB_END (body)), 1)) == IF_THEN_ELSE
+	   && (GET_CODE (XEXP (XEXP (PATTERN (BB_END (body)), 1), 0)) == NE
+	       ||GET_CODE (XEXP (XEXP (PATTERN (BB_END (body)), 1), 0)) == GT)))
+    return NULL;
+  else if (REG_P (condconst))
+    {
+      basic_block pre_loop_bb = body;
+      while (pre_loop_bb->prev_bb && BB_END (pre_loop_bb->prev_bb)
+	     && !JUMP_P (BB_END (pre_loop_bb->prev_bb)))
+	pre_loop_bb = pre_loop_bb->prev_bb;
+      if (pre_loop_bb && BB_END (pre_loop_bb))
+	pre_loop_bb = pre_loop_bb->prev_bb;
+      else
+	return NULL;
+      rtx initial_compare = NULL_RTX;
+      if (!(prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb))
+	    && INSN_P (prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb)))))
+	return NULL;
+      else
+	initial_compare
+	    = PATTERN (prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb)));
+      if (!(initial_compare && GET_CODE (initial_compare) == SET
+	    && cc_register (XEXP (initial_compare, 0), VOIDmode)
+	    && GET_CODE (XEXP (initial_compare, 1)) == COMPARE
+	    && CONST_INT_P (XEXP (XEXP (initial_compare, 1), 1))
+	    && INTVAL (XEXP (XEXP (initial_compare, 1), 1)) == 0))
+	return NULL;
+
+      /* Usually this is a LE condition, but it can also just be a GT or an EQ
+	 condition (if the value is unsigned or the compiler knows its not negative)  */
+      rtx_insn *loop_jumpover = BB_END (pre_loop_bb);
+      if (!(JUMP_P (loop_jumpover)
+	    && GET_CODE (XEXP (PATTERN (loop_jumpover), 1)) == IF_THEN_ELSE
+	    && (GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == LE
+		|| GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == GT
+		|| GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == EQ)))
+	return NULL;
+    }
+
+  /* Everything looks valid.  */
+  return vctp_insn;
+}
+
+/* Function to check a loop's structure to see if it is a valid candidate for
+   an MVE Tail Predicated Low-Overhead Loop.  Returns the loop's VCTP_INSN if
+   it is valid, or NULL if it isn't.  */
+
+static rtx_insn*
+arm_mve_loop_valid_for_dlstp (basic_block body)
+{
+  /* Doloop can only be done "elementwise" with predicated dlstp/letp if it
+     contains a VCTP on the number of elements processed by the loop.
+     Find the VCTP predicate generation inside the loop body BB.  */
+  rtx_insn *vctp_insn = arm_mve_get_loop_vctp (body);
+  if (!vctp_insn)
+    return NULL;
+
+  /* There are only two types of loops that can be turned into dlstp/letp
+     loops:
+      A) Loops of the form:
+	  while (num_of_elem > 0)
+	    {
+	      p = vctp<size> (num_of_elem)
+	      n -= num_of_lanes;
+	    }
+      B) Loops of the form:
+	  int num_of_iters = (num_of_elem + num_of_lanes - 1) / num_of_lanes
+	  for (i = 0; i < num_of_iters; i++)
+	    {
+	      p = vctp<size> (num_of_elem)
+	      n -= num_of_lanes;
+	    }
+
+    Then, depending on the type of loop above we need will need to do
+    different sets of checks.  */
+  iv_analysis_loop_init (body->loop_father);
+
+  /* In order to find out if the loop is of type A or B above look for the
+     loop counter: it will either be incrementing by one per iteration or
+     it will be decrementing by num_of_lanes.  We can find the loop counter
+     in the condition at the end of the loop.  */
+  rtx_insn *loop_cond = prev_nonnote_nondebug_insn_bb (BB_END (body));
+  if (!(cc_register (XEXP (PATTERN (loop_cond), 0), VOIDmode)
+	&& GET_CODE (XEXP (PATTERN (loop_cond), 1)) == COMPARE))
+    return NULL;
+
+  /* The operands in the condition:  Try to identify which one is the
+     constant and which is the counter and run IV analysis on the latter.  */
+  rtx cond_arg_1 = XEXP (XEXP (PATTERN (loop_cond), 1), 0);
+  rtx cond_arg_2 = XEXP (XEXP (PATTERN (loop_cond), 1), 1);
+
+  rtx loop_cond_constant;
+  rtx loop_counter;
+  class rtx_iv cond_counter_iv, cond_temp_iv;
+
+  if (CONST_INT_P (cond_arg_1))
+    {
+      /* cond_arg_1 is the constant and cond_arg_2 is the counter.  */
+      loop_cond_constant = cond_arg_1;
+      loop_counter = cond_arg_2;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_2)),
+		  cond_arg_2, &cond_counter_iv);
+    }
+  else if (CONST_INT_P (cond_arg_2))
+    {
+      /* cond_arg_2 is the constant and cond_arg_1 is the counter.  */
+      loop_cond_constant = cond_arg_2;
+      loop_counter = cond_arg_1;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_1)),
+		  cond_arg_1, &cond_counter_iv);
+    }
+  else if (REG_P (cond_arg_1) && REG_P (cond_arg_2))
+    {
+      /* If both operands to the compare are REGs, we can safely
+	 run IV analysis on both and then determine which is the
+	 constant by looking at the step.
+	 First assume cond_arg_1 is the counter.  */
+      loop_counter = cond_arg_1;
+      loop_cond_constant = cond_arg_2;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_1)),
+		  cond_arg_1, &cond_counter_iv);
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_2)),
+		  cond_arg_2, &cond_temp_iv);
+
+      if (!CONST_INT_P (cond_counter_iv.step) || !CONST_INT_P (cond_temp_iv.step))
+	return NULL;
+      /* Look at the steps and swap around the rtx's if needed.  Error out if
+	 one of them cannot be identified as constant.  */
+      if (INTVAL (cond_counter_iv.step) != 0 && INTVAL (cond_temp_iv.step) != 0)
+	return NULL;
+      if (INTVAL (cond_counter_iv.step) == 0 && INTVAL (cond_temp_iv.step) != 0)
+	{
+	  loop_counter = cond_arg_2;
+	  loop_cond_constant = cond_arg_1;
+	  cond_counter_iv = cond_temp_iv;
+	}
+    }
+  else
+    return NULL;
+
+  if (!REG_P (loop_counter))
+    return NULL;
+  if (!(REG_P (loop_cond_constant) || CONST_INT_P (loop_cond_constant)))
+    return NULL;
+
+  /* Now we have extracted the IV step of the loop counter, call the
+     appropriate checking function.  */
+  if (INTVAL (cond_counter_iv.step) > 0)
+    return arm_mve_dlstp_check_inc_counter (body, vctp_insn,
+					    loop_cond_constant, loop_counter);
+  else if (INTVAL (cond_counter_iv.step) < 0)
+    return arm_mve_dlstp_check_dec_counter (body, vctp_insn,
+					    loop_cond_constant, loop_counter);
+  else
+    return NULL;
+}
+
+/* Predict whether the given loop in gimple will be transformed in the RTL
+   doloop_optimize pass.  */
+
+static bool
+arm_predict_doloop_p (struct loop *loop)
+{
+  gcc_assert (loop);
+  /* On arm, targetm.can_use_doloop_p is actually
+     can_use_doloop_if_innermost.  Ensure the loop is innermost,
+     it is valid and as per arm_target_bb_ok_for_lob and the
+     correct architecture flags are enabled.  */
+  if (!(TARGET_32BIT && TARGET_HAVE_LOB && optimize > 0))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " target architecture or optimisation flags.\n");
+      return false;
+    }
+  else if (loop->inner != NULL)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " loop nesting.\n");
+      return false;
+    }
+  else if (!arm_target_bb_ok_for_lob (loop->header->next_bb))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " loop bb complexity.\n");
+      return false;
+    }
+
+  return true;
+}
+
+/* Implement targetm.loop_unroll_adjust.  Use this to block unrolling of loops
+   that may later be turned into MVE Tail Predicated Low Overhead Loops.  The
+   performance benefit of an MVE LoL is likely to be much higher than that of
+   the unrolling.  */
+
+unsigned
+arm_loop_unroll_adjust (unsigned nunroll, struct loop *loop)
+{
+  if (TARGET_HAVE_MVE
+      && arm_target_bb_ok_for_lob (loop->latch)
+      && arm_mve_loop_valid_for_dlstp (loop->header))
+    return 0;
+  else
+    return nunroll;
+}
+
+/* Function to hadle emitting a VPT-unpredicated version of a VPT-predicated
+   insn to a sequence.  */
+
+static bool
+arm_emit_mve_unpredicated_insn_to_seq (rtx_insn* insn)
+{
+  rtx insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn);
+  int new_icode = get_attr_mve_unpredicated_insn (insn);
+  if (!in_sequence_p ()
+      || !MVE_VPT_PREDICATED_INSN_P (insn)
+      || (!insn_vpr_reg_operand)
+      || (!new_icode))
+    return false;
+
+  extract_insn (insn);
+  rtx arr[8];
+  int j = 0;
+
+  /* When transforming a VPT-predicated instruction
+     into its unpredicated equivalent we need to drop
+     the VPR operand and we may need to also drop a
+     merge "vuninit" input operand, depending on the
+     instruction pattern.  Here ensure that we have at
+     most a two-operand difference between the two
+     instrunctions.  */
+  int n_operands_diff
+      = recog_data.n_operands - insn_data[new_icode].n_operands;
+  if (!(n_operands_diff > 0 && n_operands_diff <= 2))
+    return false;
+
+  /* Then, loop through the operands of the predicated
+     instruction, and retain the ones that map to the
+     unpredicated instruction.  */
+  for (int i = 0; i < recog_data.n_operands; i++)
+    {
+      /* Ignore the VPR and, if needed, the vuninit
+	 operand.  */
+      if (insn_vpr_reg_operand == recog_data.operand[i]
+	  || (n_operands_diff == 2
+	      && !strcmp (recog_data.constraints[i], "0")))
+	continue;
+      else
+	{
+	  arr[j] = recog_data.operand[i];
+	  j++;
+	}
+    }
+
+  /* Finally, emit the upredicated instruction.  */
+  switch (j)
+    {
+      case 1:
+	emit_insn (GEN_FCN (new_icode) (arr[0]));
+	break;
+      case 2:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1]));
+	break;
+      case 3:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2]));
+	break;
+      case 4:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2],
+					arr[3]));
+	break;
+      case 5:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4]));
+	break;
+      case 6:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4], arr[5]));
+	break;
+      case 7:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4], arr[5], arr[6]));
+	break;
+      default:
+	gcc_unreachable ();
+    }
+  return true;
+}
+
+/* When a vctp insn is used, its out is often followed by
+   a zero-extend insn to SImode, which is then SUBREG'd into a
+   vector form of mode VALID_MVE_PRED_MODE: this vector form is
+   what is then used as an input to the instructions within the
+   loop.  Hence, store that vector form of the VPR reg into
+   vctp_vpr_generated, so that we can match it with instructions
+   in the loop to determine if they are predicated on this same
+   VPR.  If there is no zero-extend and subreg or it is otherwise
+   invalid, then return NULL to cancel the dlstp transform.  */
+
+static rtx
+arm_mve_get_vctp_vec_form (rtx_insn *insn)
+{
+  rtx vctp_vpr_generated = NULL_RTX;
+  rtx_insn *next_use1 = NULL;
+  df_ref use;
+  for (use
+	= DF_REG_USE_CHAIN
+	   (DF_REF_REGNO (DF_INSN_INFO_DEFS (DF_INSN_INFO_GET (insn))));
+       use; use = DF_REF_NEXT_REG (use))
+    if (!next_use1 && NONDEBUG_INSN_P (DF_REF_INSN (use)))
+      next_use1 = DF_REF_INSN (use);
+
+  if (single_set (next_use1)
+      && GET_CODE (SET_SRC (single_set (next_use1))) == ZERO_EXTEND)
+    {
+      rtx_insn *next_use2 = NULL;
+      for (use
+	    = DF_REG_USE_CHAIN
+	       (DF_REF_REGNO
+		 (DF_INSN_INFO_DEFS (DF_INSN_INFO_GET (next_use1))));
+	   use; use = DF_REF_NEXT_REG (use))
+	if (!next_use2 && NONDEBUG_INSN_P (DF_REF_INSN (use)))
+	  next_use2 = DF_REF_INSN (use);
+
+      if (single_set (next_use2)
+	  && GET_CODE (SET_SRC (single_set (next_use2))) == SUBREG)
+	vctp_vpr_generated = XEXP (PATTERN (next_use2), 0);
+    }
+
+  if (!vctp_vpr_generated || !REG_P (vctp_vpr_generated)
+      || !VALID_MVE_PRED_MODE (GET_MODE (vctp_vpr_generated)))
+    return NULL_RTX;
+
+  return vctp_vpr_generated;
+}
+
+/* Attempt to transform the loop contents of loop basic block from VPT
+   predicated insns into unpredicated insns for a dlstp/letp loop.  */
+
+rtx
+arm_attempt_dlstp_transform (rtx label)
+{
+  basic_block body = BLOCK_FOR_INSN (label)->prev_bb;
+
+  /* Ensure that the bb is within a loop that has all required metadata.  */
+  if (!body->loop_father || !body->loop_father->header
+      || !body->loop_father->simple_loop_desc)
+    return GEN_INT (1);
+
+  rtx_insn *vctp_insn = arm_mve_loop_valid_for_dlstp (body);
+  if (!vctp_insn)
+    return GEN_INT (1);
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+
+  rtx vctp_vpr_generated = arm_mve_get_vctp_vec_form (vctp_insn);
+  if (!vctp_vpr_generated)
+    return GEN_INT (1);
+
+  /* decrementunum is already known to be valid at this point.  */
+  int decrementnum = arm_mve_get_vctp_lanes (PATTERN (vctp_insn));
+
+  rtx_insn *insn = 0;
+  rtx_insn *cur_insn = 0;
+  rtx_insn *seq;
+  hash_map <rtx_insn *, bool> *safe_insn_map
+      = new hash_map <rtx_insn *, bool>;
+
+  /* Scan through the insns in the loop bb and emit the transformed bb
+     insns to a sequence.  */
+  start_sequence ();
+  FOR_BB_INSNS (body, insn)
+    {
+      if (GET_CODE (insn) == CODE_LABEL || NOTE_INSN_BASIC_BLOCK_P (insn))
+	continue;
+      else if (NOTE_P (insn))
+	emit_note ((enum insn_note)NOTE_KIND (insn));
+      else if (DEBUG_INSN_P (insn))
+	emit_debug_insn (PATTERN (insn));
+      else if (!INSN_P (insn))
+	{
+	  end_sequence ();
+	  return GEN_INT (1);
+	}
+      /* When we find the vctp instruction: continue.  */
+      else if (insn == vctp_insn)
+	continue;
+       /* If the insn pattern requires the use of the VPR value from the
+	  vctp as an input parameter for predication.  */
+      else if (arm_mve_vec_insn_is_predicated_with_this_predicate
+		(insn, vctp_vpr_generated))
+	{
+	  bool success = arm_emit_mve_unpredicated_insn_to_seq (insn);
+	  if (!success)
+	    {
+	      end_sequence ();
+	      return GEN_INT (1);
+	    }
+	}
+      /* If the insn isn't VPT predicated on vctp_vpr_generated, we need to
+	 make sure that it is still valid within the dlstp/letp loop.  */
+      else
+	{
+	  /* If this instruction USE-s the vctp_vpr_generated other than for
+	     predication, this blocks the transformation as we are not allowed
+	     to optimise the VPR value away.  */
+	  df_ref insn_uses = NULL;
+	  FOR_EACH_INSN_USE (insn_uses, insn)
+	  {
+	    if (rtx_equal_p (vctp_vpr_generated, DF_REF_REG (insn_uses)))
+	      {
+		end_sequence ();
+		return GEN_INT (1);
+	      }
+	  }
+	  /* If within the loop we have an MVE vector instruction that is
+	     unpredicated, the dlstp/letp looping will add implicit
+	     predication to it.  This will result in a change in behaviour
+	     of the instruction, so we need to find out if any instructions
+	     that feed into the current instruction were implicitly
+	     predicated.  */
+	  if (arm_mve_check_df_chain_back_for_implic_predic
+	       (safe_insn_map, insn, vctp_vpr_generated))
+	    {
+	      if (arm_mve_check_df_chain_fwd_for_implic_predic_impact
+		    (insn, vctp_vpr_generated))
+		{
+		  end_sequence ();
+		  return GEN_INT (1);
+		}
+	    }
+	  emit_insn (PATTERN (insn));
+	}
+    }
+  seq = get_insns ();
+  end_sequence ();
+
+  /* Re-write the entire BB contents with the transformed
+     sequence.  */
+  FOR_BB_INSNS_SAFE (body, insn, cur_insn)
+    if (!(GET_CODE (insn) == CODE_LABEL || NOTE_INSN_BASIC_BLOCK_P (insn)))
+      delete_insn (insn);
+  for (insn = seq; NEXT_INSN (insn); insn = NEXT_INSN (insn))
+    if (NOTE_P (insn))
+      emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));
+    else if (DEBUG_INSN_P (insn))
+      emit_debug_insn_after (PATTERN (insn), BB_END (body));
+    else
+      emit_insn_after (PATTERN (insn), BB_END (body));
+
+  emit_jump_insn_after (PATTERN (insn), BB_END (body));
+  /* The transformation has succeeded, so now modify the "count"
+     (a.k.a. niter_expr) for the middle-end.  Also set noloop_assumptions
+     to NULL to stop the middle-end from making assumptions about the
+     number of iterations.  */
+  simple_loop_desc (body->loop_father)->niter_expr = vctp_reg;
+  simple_loop_desc (body->loop_father)->noloop_assumptions = NULL_RTX;
+  return GEN_INT (decrementnum);
 }
 
 #if CHECKING_P
diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 8efdebecc3c..da745288f26 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -124,6 +124,11 @@
 ; and not all ARM insns do.
 (define_attr "predicated" "yes,no" (const_string "no"))
 
+
+; An attribute that encodes the CODE_FOR_<insn> of the MVE VPT unpredicated
+; version of a VPT-predicated instruction.  For unpredicated instructions
+; that are predicable, encode the same pattern's CODE_FOR_<insn> as a way to
+; encode that it is a predicable instruction.
 (define_attr "mve_unpredicated_insn" "" (const_int 0))
 
 ; LENGTH of an instruction (in bytes)
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 5ea2d9e8668..a6a7ff507a5 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -2673,6 +2673,9 @@
 (define_int_attr mrrc [(VUNSPEC_MRRC "mrrc") (VUNSPEC_MRRC2 "mrrc2")])
 (define_int_attr MRRC [(VUNSPEC_MRRC "MRRC") (VUNSPEC_MRRC2 "MRRC2")])
 
+(define_int_attr mode1 [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 "32")
+			(DLSTP64 "64")])
+
 (define_int_attr opsuffix [(UNSPEC_DOT_S "s8")
 			   (UNSPEC_DOT_U "u8")
 			   (UNSPEC_DOT_US "s8")
@@ -2916,6 +2919,8 @@
 (define_int_iterator VSHLCQ_M [VSHLCQ_M_S VSHLCQ_M_U])
 (define_int_iterator VQSHLUQ_M_N [VQSHLUQ_M_N_S])
 (define_int_iterator VQSHLUQ_N [VQSHLUQ_N_S])
+(define_int_iterator DLSTP [DLSTP8 DLSTP16 DLSTP32
+				   DLSTP64])
 
 ;; Define iterators for VCMLA operations
 (define_int_iterator VCMLA_OP [UNSPEC_VCMLA
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 44a04b86cb5..93905583b18 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -6933,7 +6933,7 @@
    (set (reg:SI LR_REGNUM)
 	(plus:SI (reg:SI LR_REGNUM) (match_dup 0)))
    (clobber (reg:CC CC_REGNUM))]
-  "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2"
+  "TARGET_HAVE_MVE"
   {
     if (get_attr_length (insn) == 4)
       return "letp\t%|lr, %l1";
@@ -6953,5 +6953,5 @@
 	 (unspec:SI [(match_operand:SI 0 "s_register_operand" "r")]
 	  DLSTP))
   ]
-  "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2"
+  "TARGET_HAVE_MVE"
   "dlstp.<mode1>\t%|lr, %0")
diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
index e1e013befa7..368d5138ca1 100644
--- a/gcc/config/arm/thumb2.md
+++ b/gcc/config/arm/thumb2.md
@@ -1613,7 +1613,7 @@
    (use (match_operand 1 "" ""))]     ; label
   "TARGET_32BIT"
   "
- {
+{
    /* Currently SMS relies on the do-loop pattern to recognize loops
       where (1) the control part consists of all insns defining and/or
       using a certain 'count' register and (2) the loop count can be
@@ -1623,41 +1623,65 @@
 
       Also used to implement the low over head loops feature, which is part of
       the Armv8.1-M Mainline Low Overhead Branch (LOB) extension.  */
-   if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
-   {
-     rtx s0;
-     rtx bcomp;
-     rtx loc_ref;
-     rtx cc_reg;
-     rtx insn;
-     rtx cmp;
-
-     if (GET_MODE (operands[0]) != SImode)
-       FAIL;
-
-     s0 = operands [0];
-
-     /* Low over head loop instructions require the first operand to be LR.  */
-     if (TARGET_HAVE_LOB && arm_target_insn_ok_for_lob (operands [1]))
-       s0 = gen_rtx_REG (SImode, LR_REGNUM);
-
-     if (TARGET_THUMB2)
-       insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-     else
-       insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-
-     cmp = XVECEXP (PATTERN (insn), 0, 0);
-     cc_reg = SET_DEST (cmp);
-     bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
-     loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands [1]);
-     emit_jump_insn (gen_rtx_SET (pc_rtx,
-                                  gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
-                                                        loc_ref, pc_rtx)));
-     DONE;
-   }
- else
-   FAIL;
- }")
+  if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
+    {
+      rtx s0;
+      rtx bcomp;
+      rtx loc_ref;
+      rtx cc_reg;
+      rtx insn;
+      rtx cmp;
+      rtx decrement_num;
+
+      if (GET_MODE (operands[0]) != SImode)
+	FAIL;
+
+      s0 = operands[0];
+
+       if (TARGET_HAVE_LOB && arm_target_bb_ok_for_lob (BLOCK_FOR_INSN (operands[1])))
+	{
+	  s0 = gen_rtx_REG (SImode, LR_REGNUM);
+
+	  /* If we have a compatibe MVE target, try and analyse the loop
+	     contents to determine if we can use predicated dlstp/letp
+	     looping.  */
+	  if (TARGET_HAVE_MVE
+	      && (decrement_num = arm_attempt_dlstp_transform (operands[1]))
+	      && (INTVAL (decrement_num) != 1))
+	    {
+	      insn = emit_insn
+		      (gen_thumb2_addsi3_compare0
+			  (s0, s0, GEN_INT ((-1) * (INTVAL (decrement_num)))));
+	      cmp = XVECEXP (PATTERN (insn), 0, 0);
+	      cc_reg = SET_DEST (cmp);
+	      bcomp = gen_rtx_GE (VOIDmode, cc_reg, const0_rtx);
+	      loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+	      emit_jump_insn (gen_rtx_SET (pc_rtx,
+				       gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+							     loc_ref, pc_rtx)));
+	      DONE;
+	    }
+	}
+
+	/* Otherwise, try standard decrement-by-one dls/le looping.  */
+	if (TARGET_THUMB2)
+	  insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0,
+							GEN_INT (-1)));
+	else
+	  insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
+
+	cmp = XVECEXP (PATTERN (insn), 0, 0);
+	cc_reg = SET_DEST (cmp);
+	bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
+	loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+	emit_jump_insn (gen_rtx_SET (pc_rtx,
+				     gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+							   loc_ref, pc_rtx)));
+	DONE;
+    }
+  else
+    FAIL;
+}")
 
 (define_insn "*clear_apsr"
   [(unspec_volatile:SI [(const_int 0)] VUNSPEC_CLRM_APSR)
@@ -1755,7 +1779,37 @@
   {
     if (REGNO (operands[0]) == LR_REGNUM)
       {
-	emit_insn (gen_dls_insn (operands[0]));
+	/* Pick out the number by which we are decrementing the loop counter
+	   in every iteration.  If it's > 1, then use dlstp.  */
+	int const_int_dec_num
+	     = abs (INTVAL (XEXP (XEXP (XVECEXP (PATTERN (operands[1]), 0, 1),
+				  1),
+			    1)));
+	switch (const_int_dec_num)
+	  {
+	    case 16:
+	      emit_insn (gen_dlstp8_insn (operands[0]));
+	      break;
+
+	    case 8:
+	      emit_insn (gen_dlstp16_insn (operands[0]));
+	      break;
+
+	    case 4:
+	      emit_insn (gen_dlstp32_insn (operands[0]));
+	      break;
+
+	    case 2:
+	      emit_insn (gen_dlstp64_insn (operands[0]));
+	      break;
+
+	    case 1:
+	      emit_insn (gen_dls_insn (operands[0]));
+	      break;
+
+	    default:
+	      gcc_unreachable ();
+	  }
 	DONE;
       }
     else
diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
index 4713ec840ab..12ae4c4f820 100644
--- a/gcc/config/arm/unspecs.md
+++ b/gcc/config/arm/unspecs.md
@@ -583,6 +583,10 @@
   VADDLVQ_U
   VCTP
   VCTP_M
+  DLSTP8
+  DLSTP16
+  DLSTP32
+  DLSTP64
   VPNOT
   VCREATEQ_F
   VCVTQ_N_TO_F_S
diff --git a/gcc/df-core.cc b/gcc/df-core.cc
index d4812b04a7c..4fcc14bf790 100644
--- a/gcc/df-core.cc
+++ b/gcc/df-core.cc
@@ -1964,6 +1964,21 @@ df_bb_regno_last_def_find (basic_block bb, unsigned int regno)
   return NULL;
 }
 
+/* Return the one and only def of REGNO within BB.  If there is no def or
+   there are multiple defs, return NULL.  */
+
+df_ref
+df_bb_regno_only_def_find (basic_block bb, unsigned int regno)
+{
+  df_ref temp = df_bb_regno_first_def_find (bb, regno);
+  if (!temp)
+    return NULL;
+  else if (temp == df_bb_regno_last_def_find (bb, regno))
+    return temp;
+  else
+    return NULL;
+}
+
 /* Finds the reference corresponding to the definition of REG in INSN.
    DF is the dataflow object.  */
 
diff --git a/gcc/df.h b/gcc/df.h
index 402657a7076..98623637f9c 100644
--- a/gcc/df.h
+++ b/gcc/df.h
@@ -987,6 +987,7 @@ extern void df_check_cfg_clean (void);
 #endif
 extern df_ref df_bb_regno_first_def_find (basic_block, unsigned int);
 extern df_ref df_bb_regno_last_def_find (basic_block, unsigned int);
+extern df_ref df_bb_regno_only_def_find (basic_block, unsigned int);
 extern df_ref df_find_def (rtx_insn *, rtx);
 extern bool df_reg_defined (rtx_insn *, rtx);
 extern df_ref df_find_use (rtx_insn *, rtx);
diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
index 4feb0a25ab9..6a72700a127 100644
--- a/gcc/loop-doloop.cc
+++ b/gcc/loop-doloop.cc
@@ -85,10 +85,10 @@ doloop_condition_get (rtx_insn *doloop_pat)
      forms:
 
      1)  (parallel [(set (pc) (if_then_else (condition)
-	  			            (label_ref (label))
-				            (pc)))
-	             (set (reg) (plus (reg) (const_int -1)))
-	             (additional clobbers and uses)])
+					    (label_ref (label))
+					    (pc)))
+		     (set (reg) (plus (reg) (const_int -1)))
+		     (additional clobbers and uses)])
 
      The branch must be the first entry of the parallel (also required
      by jump.cc), and the second entry of the parallel must be a set of
@@ -96,18 +96,19 @@ doloop_condition_get (rtx_insn *doloop_pat)
      the loop counter in an if_then_else too.
 
      2)  (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-	                         (label_ref (label))
-			         (pc))).  
+	 (set (pc) (if_then_else (reg != 0)
+				 (label_ref (label))
+				 (pc))).
 
-     Some targets (ARM) do the comparison before the branch, as in the
-     following form:
+     Some targets (ARM) do the comparison before the branch. The ARM target
+     also supports a counter that can decrement by `n`.  As such, the
+     following form is expected:
 
-     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) */
+     3) (parallel [(set (cc) (compare (plus (reg) (const_int -n)) 0))
+		   (set (reg) (plus (reg) (const_int -n)))])
+	(set (pc) (if_then_else (cc == NE/GE)
+				(label_ref (label))
+				(pc))) */
 
   pattern = PATTERN (doloop_pat);
 
@@ -143,7 +144,7 @@ doloop_condition_get (rtx_insn *doloop_pat)
 	      || GET_CODE (cmp_arg1) != PLUS)
 	    return 0;
 	  reg_orig = XEXP (cmp_arg1, 0);
-	  if (XEXP (cmp_arg1, 1) != GEN_INT (-1) 
+	  if (!CONST_INT_P (XEXP (cmp_arg1, 1))
 	      || !REG_P (reg_orig))
 	    return 0;
 	  cc_reg = SET_DEST (cmp_orig);
@@ -154,9 +155,11 @@ doloop_condition_get (rtx_insn *doloop_pat)
         inc = PATTERN (prev_insn);
       if (GET_CODE (cmp) == SET && GET_CODE (SET_SRC (cmp)) == IF_THEN_ELSE)
 	{
-	  /* We expect the condition to be of the form (reg != 0)  */
+	  /* We expect the condition to be of the form (reg != 0)
+	     or (reg >= 0)  */
 	  cond = XEXP (SET_SRC (cmp), 0);
-	  if (GET_CODE (cond) != NE || XEXP (cond, 1) != const0_rtx)
+	  if ((GET_CODE (cond) != NE && GET_CODE (cond) != GE)
+	      || XEXP (cond, 1) != const0_rtx)
 	    return 0;
 	}
     }
@@ -173,14 +176,14 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
     return 0;
 
-  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
      On IA-64, this decrement is wrapped in an if_then_else.  */
   inc_src = SET_SRC (inc);
   if (GET_CODE (inc_src) == IF_THEN_ELSE)
     inc_src = XEXP (inc_src, 1);
   if (GET_CODE (inc_src) != PLUS
       || XEXP (inc_src, 0) != reg
-      || XEXP (inc_src, 1) != constm1_rtx)
+      || !CONST_INT_P (XEXP (inc_src, 1)))
     return 0;
 
   /* Check for (set (pc) (if_then_else (condition)
@@ -211,42 +214,48 @@ doloop_condition_get (rtx_insn *doloop_pat)
       || (GET_CODE (XEXP (condition, 0)) == PLUS
 	  && XEXP (XEXP (condition, 0), 0) == reg))
    {
-     if (GET_CODE (pattern) != PARALLEL)
      /*  For the second form we expect:
 
-         (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-                                 (label_ref (label))
-                                 (pc))).
+	 (set (reg) (plus (reg) (const_int -1))
+	 (set (pc) (if_then_else (reg != 0)
+				 (label_ref (label))
+				 (pc))).
 
-         is equivalent to the following:
+	 That is equivalent to the following:
 
-         (parallel [(set (pc) (if_then_else (reg != 1)
-                                            (label_ref (label))
-                                            (pc)))
-                     (set (reg) (plus (reg) (const_int -1)))
-                     (additional clobbers and uses)])
+	 (parallel [(set (pc) (if_then_else (reg != 1)
+					    (label_ref (label))
+					    (pc)))
+		     (set (reg) (plus (reg) (const_int -1)))
+		     (additional clobbers and uses)])
 
-        For the third form we expect:
+	For the third form we expect:
 
-        (parallel [(set (cc) (compare ((plus (reg) (const_int -1)), 0))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) 
+	(parallel [(set (cc) (compare (plus (reg) (const_int -n)) 0))
+		   (set (reg) (plus (reg) (const_int -n)))])
+	(set (pc) (if_then_else (cc == NE/GE)
+				(label_ref (label))
+				(pc)))
 
-        which is equivalent to the following:
+	Which, for n == 1, is equivalent to the following:
 
-        (parallel [(set (cc) (compare (reg,  1))
-                   (set (reg) (plus (reg) (const_int -1)))
-                   (set (pc) (if_then_else (NE == cc)
-                                           (label_ref (label))
-                                           (pc))))])
+	(parallel [(set (cc) (compare (reg,  1))
+		   (set (reg) (plus (reg) (const_int -1)))
+		   (set (pc) (if_then_else (NE == cc)
+					   (label_ref (label))
+					   (pc))))])
 
-        So we return the second form instead for the two cases.
+	So we return the second form instead for the two cases when n == 1.
 
+	For n > 1, the final value may be exceeded, so use GE instead of NE.
      */
-        condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
+     if (GET_CODE (pattern) != PARALLEL)
+       {
+	if (INTVAL (XEXP (inc_src, 1)) != -1)
+	  condition = gen_rtx_fmt_ee (GE, VOIDmode, inc_src, const0_rtx);
+	else
+	  condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);;
+       }
 
     return condition;
    }
@@ -642,7 +651,7 @@ doloop_optimize (class loop *loop)
 {
   scalar_int_mode mode;
   rtx doloop_reg;
-  rtx count;
+  rtx count = NULL_RTX;
   widest_int iterations, iterations_max;
   rtx_code_label *start_label;
   rtx condition;
@@ -685,17 +694,6 @@ doloop_optimize (class loop *loop)
       return false;
     }
 
-  max_cost
-    = COSTS_N_INSNS (param_max_iterations_computation_cost);
-  if (set_src_cost (desc->niter_expr, mode, optimize_loop_for_speed_p (loop))
-      > max_cost)
-    {
-      if (dump_file)
-	fprintf (dump_file,
-		 "Doloop: number of iterations too costly to compute.\n");
-      return false;
-    }
-
   if (desc->const_iter)
     iterations = widest_int::from (rtx_mode_t (desc->niter_expr, mode),
 				   UNSIGNED);
@@ -716,12 +714,25 @@ doloop_optimize (class loop *loop)
 
   /* Generate looping insn.  If the pattern FAILs then give up trying
      to modify the loop since there is some aspect the back-end does
-     not like.  */
-  count = copy_rtx (desc->niter_expr);
+     not like.  If this succeeds, there is a chance that the loop
+     desc->niter_expr has been altered by the backend, so only extract
+     that data after the gen_doloop_end.  */
   start_label = block_label (desc->in_edge->dest);
   doloop_reg = gen_reg_rtx (mode);
   rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
 
+  max_cost
+    = COSTS_N_INSNS (param_max_iterations_computation_cost);
+  if (set_src_cost (desc->niter_expr, mode, optimize_loop_for_speed_p (loop))
+      > max_cost)
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "Doloop: number of iterations too costly to compute.\n");
+      return false;
+    }
+
+  count = copy_rtx (desc->niter_expr);
   word_mode_size = GET_MODE_PRECISION (word_mode);
   word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) - 1;
   if (! doloop_seq
diff --git a/gcc/testsuite/gcc.target/arm/lob.h b/gcc/testsuite/gcc.target/arm/lob.h
index feaae7cc899..3941fe7a8b6 100644
--- a/gcc/testsuite/gcc.target/arm/lob.h
+++ b/gcc/testsuite/gcc.target/arm/lob.h
@@ -1,15 +1,131 @@
 #include <string.h>
-
+#include <stdint.h>
 /* Common code for lob tests.  */
 
 #define NO_LOB asm volatile ("@ clobber lr" : : : "lr" )
 
-#define N 10000
+#define N 100
+
+static void
+reset_data (int *a, int *b, int *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data64 (int64_t *a, int64_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+check_plus (int *a, int *b, int *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
 
 static void
-reset_data (int *a, int *b, int *c)
+check_memcpy64 (int64_t *a, int64_t *c, int x)
 {
-  memset (a, -1, N * sizeof (*a));
-  memset (b, -1, N * sizeof (*b));
-  memset (c, -1, N * sizeof (*c));
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != a[i]) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
 }
diff --git a/gcc/testsuite/gcc.target/arm/lob1.c b/gcc/testsuite/gcc.target/arm/lob1.c
index ba5c82cd55c..c8ce653a5c3 100644
--- a/gcc/testsuite/gcc.target/arm/lob1.c
+++ b/gcc/testsuite/gcc.target/arm/lob1.c
@@ -54,29 +54,18 @@ loop3 (int *a, int *b, int *c)
     } while (i < N);
 }
 
-void
-check (int *a, int *b, int *c)
-{
-  for (int i = 0; i < N; i++)
-    {
-      NO_LOB;
-      if (c[i] != a[i] + b[i])
-	abort ();
-    }
-}
-
 int
 main (void)
 {
-  reset_data (a, b, c);
+  reset_data (a, b, c, N);
   loop1 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus (a, b, c, N);
+  reset_data (a, b, c, N);
   loop2 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus (a, b, c, N);
+  reset_data (a, b, c, N);
   loop3 (a, b ,c);
-  check (a, b ,c);
+  check_plus (a, b, c, N);
 
   return 0;
 }
diff --git a/gcc/testsuite/gcc.target/arm/lob6.c b/gcc/testsuite/gcc.target/arm/lob6.c
index 17b6124295e..4fe116e2c2b 100644
--- a/gcc/testsuite/gcc.target/arm/lob6.c
+++ b/gcc/testsuite/gcc.target/arm/lob6.c
@@ -79,14 +79,14 @@ check (void)
 int
 main (void)
 {
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data (a1, b1, c1, N);
+  reset_data (a2, b2, c2, N);
   loop1 (a1, b1, c1);
   ref1 (a2, b2, c2);
   check ();
 
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data (a1, b1, c1, N);
+  reset_data (a2, b2, c2, N);
   loop2 (a1, b1, c1);
   ref2 (a2, b2, c2);
   check ();
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c
new file mode 100644
index 00000000000..5ddd994e53d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c
@@ -0,0 +1,561 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O3 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+
+#define IMM 5
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)				\
+void test_##NAME##PRED##_##SIGN##BITS (TYPE##BITS##_t *a, TYPE##BITS##_t *b,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vb = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (b, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_##SIGN##BITS (va, vb, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      b += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (32, 4, w, NAME, PRED)
+
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vmulq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vsubq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vhaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vorrq, _x)
+
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_M(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)				\
+void test_##NAME##PRED##_##SIGN##BITS (TYPE##BITS##x##LANES##_t __inactive, TYPE##BITS##_t *a, TYPE##BITS##_t *b,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vb = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (b, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_##SIGN##BITS (__inactive, va, vb, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      b += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_M (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_M (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (32, 4, w, NAME, PRED)
+
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vmulq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vsubq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vhaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vorrq, _m)
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_N(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)	\
+void test_##NAME##PRED##_n_##SIGN##BITS (TYPE##BITS##_t *a,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_n_##SIGN##BITS (va, IMM, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_N (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_N (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (32, 4, w, NAME, PRED)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vmulq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vsubq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vhaddq, _x)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vbrsrq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vshlq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vshrq, _x)
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_M_N(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)	\
+void test_##NAME##PRED##_n_##SIGN##BITS (TYPE##BITS##x##LANES##_t __inactive, TYPE##BITS##_t *a,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_n_##SIGN##BITS (__inactive, va, IMM, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_M_N (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_M_N (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (32, 4, w, NAME, PRED)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vmulq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vsubq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vhaddq, _m)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vbrsrq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vshlq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vshrq, _m)
+
+/* Now test some more configurations.  */
+
+/* Using a >=1 condition.  */
+void test1 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n >= 1)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+/* Test a for loop format of decrementing to zero */
+int32_t a[] = {0, 1, 2, 3, 4, 5, 6, 7};
+void test2 (int32_t *b, int num_elems)
+{
+    for (int i = num_elems; i > 0; i-= 4)
+    {
+        mve_pred16_t p = vctp32q (i);
+        int32x4_t va = vldrwq_z_s32 (&(a[i]), p);
+        vstrwq_p_s32 (b + i, va, p);
+    }
+}
+
+/* Iteration counter counting up to num_iter.  */
+void test3 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 0; i < num_iter; i++)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+/* Iteration counter counting down from num_iter.  */
+void test4 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = num_iter; i > 0; i--)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+/* Using an unpredicated arithmetic instruction within the loop.  */
+void test5 (uint8_t *a, uint8_t *b, uint8_t *c,  uint8_t *d, int n)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_u8 (b);
+	/* Is affected by implicit predication, because vb also
+	came from an unpredicated load, but there is no functional
+	problem, because the result is used in a predicated store.  */ 
+        uint8x16_t vc = vaddq_u8 (va, vb);
+        uint8x16_t vd = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        vstrbq_p_u8 (d, vd, p);
+        n-=16;
+    }
+}
+
+/* Using a different VPR value for one instruction in the loop.  */
+void test6 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using another VPR value in the loop, with a vctp.
+   The doloop logic will always try to do the transform on the first
+   vctp it encounters, so this is still expected to work.  */
+void test7 (int32_t *a, int32_t *b, int32_t *c, int n, int g)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q (g);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vctp,
+   but this time the p1 will also change in every loop (still fine)  */
+void test8 (int32_t *a, int32_t *b, int32_t *c, int n, int g)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q (g);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+      g++;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vctp_m
+   that is independent of the loop vctp VPR.  */
+void test9 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p2 = vctp32q_m (n, p1);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop,
+   with a vctp_m that is tied to the base vctp VPR.  This
+   is still fine, because the vctp_m will be transformed
+   into a vctp and be implicitly predicated.  */
+void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q_m (n, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p1);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp.  */
+void test11 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vb);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p1);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp_m.  */
+void test12 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p2 = vcmpeqq_m_s32 (va, vb, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp_m 
+   that is tied to the base vctp VPR (same as above, this will be turned
+   into a vcmp and be implicitly predicated).  */
+void test13 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p2 = vcmpeqq_m_s32 (va, vb, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an unpredicated op with a scalar output, where the result is valid
+   outside the bb.  This is valid, because all the inputs to the unpredicated
+   op are correctly predicated.  */
+uint8_t test14 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       sum += vaddvq_u8 (vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Same as above, but with another scalar op between the unpredicated op and
+   the scalar op outside the loop.  */
+uint8_t test15 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx, int g)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       sum += vaddvq_u8 (vc);
+       sum += g;
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Using an unpredicated vcmp to generate a new predicate value in the
+   loop and then using it in a predicated store insn.  */
+void test16 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_s32 (b);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vc);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using a predicated vcmp to generate a new predicate value in the
+   loop and then using it in a predicated store insn.  */
+void test17 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_s32 (va, vb);
+      mve_pred16_t p1 = vcmpeqq_m_s32 (va, vc, p);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an across-vector unpredicated instruction in a valid way.
+   This tests that "vc" has correctly masked the risky "vb".  */
+uint16_t test18 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_x_u16 (va, vb, p);
+      res = vaddvq_u16 (vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector unpredicated instruction with a scalar from outside the loop.  */
+uint16_t test19 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_x_u16 (va, vb, p);
+      res = vaddvaq_u16 (res, vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector predicated instruction in a valid way.  */
+uint16_t  test20 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res = vaddvaq_p_u16 (res, vb, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector predicated instruction in a valid way.  */
+uint16_t  test21 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res++;
+      res = vaddvaq_p_u16 (res, vb, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+int test22 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vmaxvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test23 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vmaxavq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+/* The final number of DLSTPs currently is calculated by the number of
+  `TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY.*` macros * 6 + 23.  */
+/* { dg-final { scan-assembler-times {\tdlstp} 167 } } */
+/* { dg-final { scan-assembler-times {\tletp} 167 } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
new file mode 100644
index 00000000000..0125a2a15fa
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
@@ -0,0 +1,69 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_mve_hw } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int16_t *a, int16_t *b, int16_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      int16x8_t va = vldrhq_z_s16 (a, p);
+      int16x8_t vb = vldrhq_z_s16 (b, p);
+      int16x8_t vc = vaddq_x_s16 (va, vb, p);
+      vstrhq_p_s16 (c, vc, p);
+      c+=8;
+      a+=8;
+      b+=8;
+      n-=8;
+    }
+}
+
+int main ()
+{
+  int i;
+  int16_t temp1[N];
+  int16_t temp2[N];
+  int16_t temp3[N];
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus16 (temp1, temp2, temp3, 0);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus16 (temp1, temp2, temp3, 1);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 7);
+  check_plus16 (temp1, temp2, temp3, 7);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus16 (temp1, temp2, temp3, 8);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus16 (temp1, temp2, temp3, 9);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus16 (temp1, temp2, temp3, 16);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus16 (temp1, temp2, temp3, 17);
+
+  reset_data16 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.16} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c
new file mode 100644
index 00000000000..06b960ad9ca
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c
@@ -0,0 +1,69 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_mve_hw } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+int main ()
+{
+  int i;
+  int32_t temp1[N];
+  int32_t temp2[N];
+  int32_t temp3[N];
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus32 (temp1, temp2, temp3, 0);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus32 (temp1, temp2, temp3, 1);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 3);
+  check_plus32 (temp1, temp2, temp3, 3);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 4);
+  check_plus32 (temp1, temp2, temp3, 4);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 5);
+  check_plus32 (temp1, temp2, temp3, 5);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus32 (temp1, temp2, temp3, 8);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus32 (temp1, temp2, temp3, 9);
+
+  reset_data32 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.32} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c
new file mode 100644
index 00000000000..5a782dd7f74
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c
@@ -0,0 +1,69 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_mve_hw } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int64_t *a, int64_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp64q (n);
+      int64x2_t va = vldrdq_gather_offset_z_s64 (a, vcreateq_u64 (0, 8), p);
+      vstrdq_scatter_offset_p_s64 (c, vcreateq_u64 (0, 8), va, p);
+      c+=2;
+      a+=2;
+      n-=2;
+    }
+}
+
+int main ()
+{
+  int i;
+  int64_t temp1[N];
+  int64_t temp3[N];
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 0);
+  check_memcpy64 (temp1, temp3, 0);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 1);
+  check_memcpy64 (temp1, temp3, 1);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 2);
+  check_memcpy64 (temp1, temp3, 2);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 3);
+  check_memcpy64 (temp1, temp3, 3);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 4);
+  check_memcpy64 (temp1, temp3, 4);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 5);
+  check_memcpy64 (temp1, temp3, 5);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 6);
+  check_memcpy64 (temp1, temp3, 6);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 7);
+  check_memcpy64 (temp1, temp3, 7);
+
+  reset_data64  (temp1, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.64} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c
new file mode 100644
index 00000000000..8ea181c82d4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c
@@ -0,0 +1,69 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_mve_hw } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp8q (n);
+      int8x16_t va = vldrbq_z_s8 (a, p);
+      int8x16_t vb = vldrbq_z_s8 (b, p);
+      int8x16_t vc = vaddq_x_s8 (va, vb, p);
+      vstrbq_p_s8 (c, vc, p);
+      c+=16;
+      a+=16;
+      b+=16;
+      n-=16;
+    }
+}
+
+int main ()
+{
+  int i;
+  int8_t temp1[N];
+  int8_t temp2[N];
+  int8_t temp3[N];
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus8 (temp1, temp2, temp3, 0);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus8 (temp1, temp2, temp3, 1);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 15);
+  check_plus8 (temp1, temp2, temp3, 15);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus8 (temp1, temp2, temp3, 16);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus8 (temp1, temp2, temp3, 17);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 32);
+  check_plus8 (temp1, temp2, temp3, 32);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 33);
+  check_plus8 (temp1, temp2, temp3, 33);
+
+  reset_data8 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.8} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
new file mode 100644
index 00000000000..7c331c7895f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
@@ -0,0 +1,388 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O3 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+
+/* Terminating on a non-zero number of elements.  */
+void test0 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    while (n > 1)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Terminating on n >= 0.  */
+void test1 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    while (n >= 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Similar, terminating on a non-zero number of elements, but in a for loop
+   format.  */
+int32_t a[] = {0, 1, 2, 3, 4, 5, 6, 7};
+void test2 (int32_t *b, int num_elems)
+{
+    for (int i = num_elems; i >= 2; i-= 4)
+    {
+       mve_pred16_t p = vctp32q (i);
+       int32x4_t va = vldrwq_z_s32 (&(a[i]), p);
+       vstrwq_p_s32 (b + i, va, p);
+    }
+}
+
+/* Iteration counter counting up to num_iter, with a non-zero starting num.  */
+void test3 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 1; i < num_iter; i++)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Iteration counter counting up to num_iter, with a larger increment  */
+void test4 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 0; i < num_iter; i+=2)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Using an unpredicated store instruction within the loop.  */
+void test5 (uint8_t *a, uint8_t *b, uint8_t *c,  uint8_t *d, int n)
+{
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_u8 (va, vb);
+       uint8x16_t vd = vaddq_x_u8 (va, vb, p);
+       vstrbq_u8 (d, vd);
+       n -= 16;
+    }
+}
+
+/* Using an unpredicated store outside the loop.  */
+void test6 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       vx = vaddq_u8 (vx, vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    vstrbq_u8 (c, vx);
+}
+
+/* Using a VPR that gets modified within the loop.  */
+void test9 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      p++;
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using a VPR that gets re-generated within the loop.  */
+void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  mve_pred16_t p = vctp32q (n);
+  while (n > 0)
+    {
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      p = vctp32q (n);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using vctp32q_m instead of vctp32q.  */
+void test11 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p0)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q_m (n, p0);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an unpredicated op with a scalar output, where the result is valid
+   outside the bb.  This is invalid, because one of the inputs to the
+   unpredicated op is also unpredicated.  */
+uint8_t test12 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_u8 (b);
+       uint8x16_t vc = vaddq_u8 (va, vb);
+       sum += vaddvq_u8 (vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Using an unpredicated vcmp to generate a new predicate value in the
+   loop and then using that VPR to predicate a store insn.  */
+void test13 (int32_t *a, int32_t *b, int32x4_t vc, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_s32 (a);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_s32 (va, vb);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vc);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an across-vector unpredicated instruction. "vb" is the risk.  */
+uint16_t test14 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      vb = vaddq_u16 (va, vb);
+      res = vaddvq_u16 (vb);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector unpredicated instruction. "vc" is the risk. */
+uint16_t test15 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_u16 (va, vb);
+      res = vaddvaq_u16 (res, vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+uint16_t test16 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res =0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res = vaddvaq_u16 (res, vb);
+      res = vaddvaq_p_u16 (res, va, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+int test17 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vmaxvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+
+
+int test18 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vminvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test19 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vminavq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test20 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vminvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+uint8x16_t test21 (uint8_t *a, uint32_t *b, int n, uint8x16_t res)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vshlcq_u8 (va, b, 1);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int8x16_t test22 (int8_t *a, int32_t *b, int n, int8x16_t res)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vshlcq_s8 (va, b, 1);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+/* Using an unsigned number of elements to count down from, with a >0*/
+void test23 (int32_t *a, int32_t *b, int32_t *c, unsigned int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+/* Using an unsigned number of elements to count up to, with a <n*/
+void test24 (uint8_t *a, uint8_t *b, uint8_t *c, unsigned int n)
+{
+    for (int i = 0; i < n; i+=16)
+    {
+        mve_pred16_t p = vctp8q (n-i);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+    }
+}
+
+
+/* Using an unsigned number of elements to count up to, with a <=n*/
+void test25 (uint8_t *a, uint8_t *b, uint8_t *c, unsigned int n)
+{
+    for (int i = 1; i <= n; i+=16)
+    {
+        mve_pred16_t p = vctp8q (n-i);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+    }
+}
+
+/* { dg-final { scan-assembler-not "\tdlstp" } } */
+/* { dg-final { scan-assembler-not "\tletp" } } */
\ No newline at end of file

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
@ 2023-08-17 10:31 Stamatis Markianos-Wright
  0 siblings, 0 replies; 14+ messages in thread
From: Stamatis Markianos-Wright @ 2023-08-17 10:31 UTC (permalink / raw)
  To: gcc-patches; +Cc: Kyrylo Tkachov, Richard Earnshaw

[-- Attachment #1: Type: text/plain, Size: 5489 bytes --]

Hi all,


This is the 2/2 patch that contains the functional changes needed
for MVE Tail Predicated Low Overhead Loops.  See my previous email
for a general introduction of MVE LOLs.

This support is added through the already existing loop-doloop
mechanisms that are used for non-MVE dls/le looping.

Mid-end changes are:

1) Relax the loop-doloop mechanism in the mid-end to allow for
    decrement numbers other that -1 and for `count` to be an
    rtx containing a simple REG (which in this case will contain
    the number of elements to be processed), rather
    than an expression for calculating the number of iterations.
2) Added a new df utility function: `df_bb_regno_only_def_find` that
    will return the DEF of a REG if it is DEF-ed only once within the
    basic block.

And many things in the backend to implement the above optimisation:

3)  Implement the `arm_predict_doloop_p` target hook to instruct the
     mid-end about Low Overhead Loops (MVE or not), as well as
     `arm_loop_unroll_adjust` which will prevent unrolling of any loops
     that are valid for becoming MVE Tail_Predicated Low Overhead Loops
     (unrolling can transform a loop in ways that invalidate the dlstp/
     letp tranformation logic and the benefit of the dlstp/letp loop
     would be considerably higher than that of unrolling)
4)  Appropriate changes to the define_expand of doloop_end, new
     patterns for dlstp and letp, new iterators,  unspecs, etc.
5) `arm_mve_loop_valid_for_dlstp` and a number of checking functions:
    * `arm_mve_dlstp_check_dec_counter`
    * `arm_mve_dlstp_check_inc_counter`
    * `arm_mve_check_reg_origin_is_num_elems`
    * `arm_mve_check_df_chain_back_for_implic_predic`
    * `arm_mve_check_df_chain_fwd_for_implic_predic_impact`
    This all, in smoe way or another, are running checks on the loop
    structure in order to determine if the loop is valid for dlstp/letp
    transformation.
6) `arm_attempt_dlstp_transform`: (called from the define_expand of
     doloop_end) this function re-checks for the loop's suitability for
     dlstp/letp transformation and then implements it, if possible.
7) Various utility functions:
    *`arm_mve_get_vctp_lanes` to map
    from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg`
    to check an insn to see if it requires the VPR or not.
    * `arm_mve_get_loop_vctp`
    * `arm_mve_get_vctp_lanes`
    * `arm_emit_mve_unpredicated_insn_to_seq`
    * `arm_get_required_vpr_reg`
    * `arm_get_required_vpr_reg_param`
    * `arm_get_required_vpr_reg_ret_val`
    * `arm_mve_is_across_vector_insn`
    * `arm_is_mve_load_store_insn`
    * `arm_mve_vec_insn_is_predicated_with_this_predicate`
    * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`

No regressions on arm-none-eabi with various targets and on
aarch64-none-elf. Thoughts on getting this into trunk?

Thank you,
Stam Markianos-Wright

gcc/ChangeLog:

     * config/arm/arm-protos.h (arm_target_insn_ok_for_lob): Rename to...
     (arm_target_bb_ok_for_lob): ...this
     (arm_attempt_dlstp_transform): New.
     * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.
     (TARGET_PREDICT_DOLOOP_P): New.
     (arm_block_set_vect):
     (arm_target_insn_ok_for_lob): Rename from arm_target_insn_ok_for_lob.
     (arm_target_bb_ok_for_lob): New.
     (arm_mve_get_vctp_lanes): New.
     (arm_get_required_vpr_reg): New.
     (arm_get_required_vpr_reg_param): New.
     (arm_get_required_vpr_reg_ret_val): New.
     (arm_mve_get_loop_vctp): New.
     (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.
     (arm_mve_vec_insn_is_predicated_with_this_predicate): New.
     (arm_mve_check_df_chain_back_for_implic_predic): New.
     (arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.
     (arm_mve_check_reg_origin_is_num_elems): New.
     (arm_mve_dlstp_check_inc_counter): New.
     (arm_mve_dlstp_check_dec_counter): New.
     (arm_mve_loop_valid_for_dlstp): New.
     (arm_mve_is_across_vector_insn): New.
     (arm_is_mve_load_store_insn): New.
     (arm_predict_doloop_p): New.
     (arm_loop_unroll_adjust): New.
     (arm_emit_mve_unpredicated_insn_to_seq): New.
     (arm_attempt_dlstp_transform): New.
         * config/arm/iterators.md (DLSTP): New.
         (mode1): Add DLSTP mappings.
         * config/arm/mve.md (*predicated_doloop_end_internal): New.
         (dlstp<mode1>_insn): New.
         * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
         * config/arm/unspecs.md: New unspecs.
     * df-core.cc (df_bb_regno_only_def_find): New.
     * df.h (df_bb_regno_only_def_find): New.
         * loop-doloop.cc (doloop_condition_get): Relax conditions.
         (doloop_optimize): Add support for elementwise LoLs.

gcc/testsuite/ChangeLog:

         * gcc.target/arm/lob.h: Update framework.
         * gcc.target/arm/lob1.c: Likewise.
         * gcc.target/arm/lob6.c: Likewise.
     * gcc.target/arm/mve/dlstp-compile-asm.c: New test.
     * gcc.target/arm/mve/dlstp-int16x8.c: New test.
     * gcc.target/arm/mve/dlstp-int32x4.c: New test.
     * gcc.target/arm/mve/dlstp-int64x2.c: New test.
     * gcc.target/arm/mve/dlstp-int8x16.c: New test.
     * gcc.target/arm/mve/dlstp-invalid-asm.c: New test.

[-- Attachment #2: 2.patch --]
[-- Type: text/x-patch, Size: 105189 bytes --]

commit 8564dee09c1258c388094abd614f311e60723368
Author: Stam Markianos-Wright <stam.markianos-wright@arm.com>
Date:   Tue Oct 18 17:42:56 2022 +0100

    arm: Add support for MVE Tail-Predicated Low Overhead Loops
    
    This is the 2/2 patch that contains the functional changes needed
    for MVE Tail Predicated Low Overhead Loops.  See my previous email
    for a general introduction of MVE LOLs.
    
    This support is added through the already existing loop-doloop
    mechanisms that are used for non-MVE dls/le looping.
    
    Mid-end changes are:
    
    1) Relax the loop-doloop mechanism in the mid-end to allow for
       decrement numbers other that -1 and for `count` to be an
       rtx containing a simple REG (which in this case will contain
       the number of elements to be processed), rather
       than an expression for calculating the number of iterations.
    2) Added a new df utility function: `df_bb_regno_only_def_find` that
       will return the DEF of a REG if it is DEF-ed only once within the
       basic block.
    
    And many things in the backend to implement the above optimisation:
    
    3)  Implement the `arm_predict_doloop_p` target hook to instruct the
        mid-end about Low Overhead Loops (MVE or not), as well as
        `arm_loop_unroll_adjust` which will prevent unrolling of any loops
        that are valid for becoming MVE Tail_Predicated Low Overhead Loops
        (unrolling can transform a loop in ways that invalidate the dlstp/
        letp tranformation logic and the benefit of the dlstp/letp loop
        would be considerably higher than that of unrolling)
    4)  Appropriate changes to the define_expand of doloop_end, new
        patterns for dlstp and letp, new iterators,  unspecs, etc.
    5) `arm_mve_loop_valid_for_dlstp` and a number of checking functions:
       * `arm_mve_dlstp_check_dec_counter`
       * `arm_mve_dlstp_check_inc_counter`
       * `arm_mve_check_reg_origin_is_num_elems`
       * `arm_mve_check_df_chain_back_for_implic_predic`
       * `arm_mve_check_df_chain_fwd_for_implic_predic_impact`
       This all, in smoe way or another, are running checks on the loop
       structure in order to determine if the loop is valid for dlstp/letp
       transformation.
    6) `arm_attempt_dlstp_transform`: (called from the define_expand of
        doloop_end) this function re-checks for the loop's suitability for
        dlstp/letp transformation and then implements it, if possible.
    7) Various utility functions:
       *`arm_mve_get_vctp_lanes` to map
       from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg`
       to check an insn to see if it requires the VPR or not.
       * `arm_mve_get_loop_vctp`
       * `arm_mve_get_vctp_lanes`
       * `arm_emit_mve_unpredicated_insn_to_seq`
       * `arm_get_required_vpr_reg`
       * `arm_get_required_vpr_reg_param`
       * `arm_get_required_vpr_reg_ret_val`
       * `arm_mve_is_across_vector_insn`
       * `arm_is_mve_load_store_insn`
       * `arm_mve_vec_insn_is_predicated_with_this_predicate`
       * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`
    
    No regressions on arm-none-eabi with various targets and on
    aarch64-none-elf. Thoughts on getting this into trunk?
    
    Thank you,
    Stam Markianos-Wright
    
    gcc/ChangeLog:
    
            * config/arm/arm-protos.h (arm_target_insn_ok_for_lob): Rename to...
            (arm_target_bb_ok_for_lob): ...this
            (arm_attempt_dlstp_transform): New.
            * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.
            (TARGET_PREDICT_DOLOOP_P): New.
            (arm_block_set_vect):
            (arm_target_insn_ok_for_lob): Rename from arm_target_insn_ok_for_lob.
            (arm_target_bb_ok_for_lob): New.
            (arm_mve_get_vctp_lanes): New.
            (arm_get_required_vpr_reg): New.
            (arm_get_required_vpr_reg_param): New.
            (arm_get_required_vpr_reg_ret_val): New.
            (arm_mve_get_loop_vctp): New.
            (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.
            (arm_mve_vec_insn_is_predicated_with_this_predicate): New.
            (arm_mve_check_df_chain_back_for_implic_predic): New.
            (arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.
            (arm_mve_check_reg_origin_is_num_elems): New.
            (arm_mve_dlstp_check_inc_counter): New.
            (arm_mve_dlstp_check_dec_counter): New.
            (arm_mve_loop_valid_for_dlstp): New.
            (arm_mve_is_across_vector_insn): New.
            (arm_is_mve_load_store_insn): New.
            (arm_predict_doloop_p): New.
            (arm_loop_unroll_adjust): New.
            (arm_emit_mve_unpredicated_insn_to_seq): New.
            (arm_attempt_dlstp_transform): New.
            * config/arm/iterators.md (DLSTP): New.
            (mode1): Add DLSTP mappings.
            * config/arm/mve.md (*predicated_doloop_end_internal): New.
            (dlstp<mode1>_insn): New.
            * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
            * config/arm/unspecs.md: New unspecs.
            * df-core.cc (df_bb_regno_only_def_find): New.
            * df.h (df_bb_regno_only_def_find): New.
            * loop-doloop.cc (doloop_condition_get): Relax conditions.
            (doloop_optimize): Add support for elementwise LoLs.
    
    gcc/testsuite/ChangeLog:
    
            * gcc.target/arm/lob.h: Update framework.
            * gcc.target/arm/lob1.c: Likewise.
            * gcc.target/arm/lob6.c: Likewise.
            * gcc.target/arm/mve/dlstp-compile-asm.c: New test.
            * gcc.target/arm/mve/dlstp-int16x8.c: New test.
            * gcc.target/arm/mve/dlstp-int32x4.c: New test.
            * gcc.target/arm/mve/dlstp-int64x2.c: New test.
            * gcc.target/arm/mve/dlstp-int8x16.c: New test.
            * gcc.target/arm/mve/dlstp-invalid-asm.c: New test.

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 77e76336e94..74186930f0b 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -65,8 +65,8 @@ extern void arm_emit_speculation_barrier_function (void);
 extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *);
 extern bool arm_q_bit_access (void);
 extern bool arm_ge_bits_access (void);
-extern bool arm_target_insn_ok_for_lob (rtx);
-
+extern bool arm_target_bb_ok_for_lob (basic_block);
+extern rtx arm_attempt_dlstp_transform (rtx);
 #ifdef RTX_CODE
 enum reg_class
 arm_mode_base_reg_class (machine_mode);
diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index 6e933c80183..39d97ba5e4d 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -659,6 +659,12 @@ static const struct attribute_spec arm_attribute_table[] =
 #undef TARGET_HAVE_CONDITIONAL_EXECUTION
 #define TARGET_HAVE_CONDITIONAL_EXECUTION arm_have_conditional_execution
 
+#undef TARGET_LOOP_UNROLL_ADJUST
+#define TARGET_LOOP_UNROLL_ADJUST arm_loop_unroll_adjust
+
+#undef TARGET_PREDICT_DOLOOP_P
+#define TARGET_PREDICT_DOLOOP_P arm_predict_doloop_p
+
 #undef TARGET_LEGITIMATE_CONSTANT_P
 #define TARGET_LEGITIMATE_CONSTANT_P arm_legitimate_constant_p
 
@@ -34416,19 +34422,1096 @@ arm_invalid_within_doloop (const rtx_insn *insn)
 }
 
 bool
-arm_target_insn_ok_for_lob (rtx insn)
+arm_target_bb_ok_for_lob (basic_block bb)
 {
-  basic_block bb = BLOCK_FOR_INSN (insn);
   /* Make sure the basic block of the target insn is a simple latch
      having as single predecessor and successor the body of the loop
      itself.  Only simple loops with a single basic block as body are
      supported for 'low over head loop' making sure that LE target is
      above LE itself in the generated code.  */
-
   return single_succ_p (bb)
-    && single_pred_p (bb)
-    && single_succ_edge (bb)->dest == single_pred_edge (bb)->src
-    && contains_no_active_insn_p (bb);
+	 && single_pred_p (bb)
+	 && single_succ_edge (bb)->dest == single_pred_edge (bb)->src;
+}
+
+/* Utility fuction: Given a VCTP or a VCTP_M insn, return the number of MVE
+   lanes based on the machine mode being used.  */
+
+static int
+arm_mve_get_vctp_lanes (rtx x)
+{
+  if (GET_CODE (x) == SET && GET_CODE (XEXP (x, 1)) == UNSPEC
+      && (XINT (XEXP (x, 1), 1) == VCTP || XINT (XEXP (x, 1), 1) == VCTP_M))
+    {
+      machine_mode mode = GET_MODE (XEXP (x, 1));
+      return (VECTOR_MODE_P (mode) && VALID_MVE_PRED_MODE (mode))
+	     ? GET_MODE_NUNITS (mode) : 0;
+    }
+  return 0;
+}
+
+/* Check if INSN requires the use of the VPR reg, if it does, return the
+   sub-rtx of the VPR reg.  The TYPE argument controls whether
+   this function should:
+   * For TYPE == 0, check all operands, including the OUT operands,
+     and return the first occurrence of the VPR reg.
+   * For TYPE == 1, only check the input operands.
+   * For TYPE == 2, only check the output operands.
+   (INOUT operands are considered both as input and output operands)
+*/
+static rtx
+arm_get_required_vpr_reg (rtx_insn *insn, unsigned int type = 0)
+{
+  gcc_assert (type < 3);
+  if (!NONJUMP_INSN_P (insn))
+    return NULL_RTX;
+
+  bool requires_vpr;
+  extract_constrain_insn (insn);
+  int n_operands = recog_data.n_operands;
+  if (recog_data.n_alternatives == 0)
+    return NULL_RTX;
+
+  /* Fill in recog_op_alt with information about the constraints of
+     this insn.  */
+  preprocess_constraints (insn);
+
+  for (int op = 0; op < n_operands; op++)
+    {
+      requires_vpr = true;
+      if (type == 1 && recog_data.operand_type[op] == OP_OUT)
+	continue;
+      else if (type == 2 && recog_data.operand_type[op] == OP_IN)
+	continue;
+
+      /* Iterate through alternatives of operand "op" in recog_op_alt and
+	 identify if the operand is required to be the VPR.  */
+      for (int alt = 0; alt < recog_data.n_alternatives; alt++)
+	{
+	  const operand_alternative *op_alt
+	      = &recog_op_alt[alt * n_operands];
+	  /* Fetch the reg_class for each entry and check it against the
+	     VPR_REG reg_class.  */
+	  if (alternative_class (op_alt, op) != VPR_REG)
+	    requires_vpr = false;
+	}
+      /* If all alternatives of the insn require the VPR reg for this operand,
+	 it means that either this is VPR-generating instruction, like a vctp,
+	 vcmp, etc., or it is a VPT-predicated insruction.  Return the subrtx
+	 of the VPR reg operand.  */
+      if (requires_vpr)
+	return recog_data.operand[op];
+    }
+  return NULL_RTX;
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with TYPE == 1, so return
+   something only if the VPR reg is an input operand to the insn.  */
+
+static rtx
+ALWAYS_INLINE
+arm_get_required_vpr_reg_param (rtx_insn *insn)
+{
+  return arm_get_required_vpr_reg (insn, 1);
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with TYPE == 2, so return
+   something only if the VPR reg is the return value, an output of, or is
+   clobbered by the insn.  */
+
+static rtx
+ALWAYS_INLINE
+arm_get_required_vpr_reg_ret_val (rtx_insn *insn)
+{
+  return arm_get_required_vpr_reg (insn, 2);
+}
+
+/* Scan the basic block of a loop body for a vctp instruction.  If there is
+   at least vctp instruction, return the first rtx_insn *.  */
+
+static rtx_insn *
+arm_mve_get_loop_vctp (basic_block bb)
+{
+  rtx_insn *insn = BB_HEAD (bb);
+
+  /* Now scan through all the instruction patterns and pick out the VCTP
+     instruction.  We require arm_get_required_vpr_reg_param to be false
+     to make sure we pick up a VCTP, rather than a VCTP_M.  */
+  FOR_BB_INSNS (bb, insn)
+    if (NONDEBUG_INSN_P (insn))
+      if (arm_get_required_vpr_reg_ret_val (insn)
+	  && (arm_mve_get_vctp_lanes (PATTERN (insn)) != 0)
+	  && !arm_get_required_vpr_reg_param (insn))
+	return insn;
+  return NULL;
+}
+
+/* Return true if INSN is a MVE instruction that is VPT-predicable, but in
+   its unpredicated form, or if it is predicated, but on a predicate other
+   than VPR_REG.  */
+
+static bool
+arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate (rtx_insn *insn,
+							  rtx vpr_reg)
+{
+  rtx insn_vpr_reg_operand;
+  if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+      || (MVE_VPT_PREDICATED_INSN_P (insn)
+	  && (insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn))
+	  && !rtx_equal_p (vpr_reg, insn_vpr_reg_operand)))
+    return true;
+  else
+    return false;
+}
+
+/* Return true if INSN is a MVE instruction that is VPT-predicable and is
+   predicated on VPR_REG.  */
+
+static bool
+arm_mve_vec_insn_is_predicated_with_this_predicate (rtx_insn *insn,
+						    rtx vpr_reg)
+{
+  rtx insn_vpr_reg_operand;
+  if (MVE_VPT_PREDICATED_INSN_P (insn)
+      && (insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn))
+      && rtx_equal_p (vpr_reg, insn_vpr_reg_operand))
+    return true;
+  else
+    return false;
+}
+
+/* Utility function to identify if INSN is an MVE instruction that performs
+   some across-vector operation (and as a result does not align with normal
+   lane predication rules).  All such instructions give one only scalar
+   output, except for vshlcq which gives a PARALLEL of a vector and a scalar
+   (one vector result and one carry output).  */
+
+static bool
+arm_is_mve_across_vector_insn (rtx_insn* insn)
+{
+  df_ref insn_defs = NULL;
+  if (!MVE_VPT_PREDICABLE_INSN_P (insn))
+    return false;
+
+  bool is_across_vector = false;
+  FOR_EACH_INSN_DEF (insn_defs, insn)
+    if (!VALID_MVE_MODE (GET_MODE (DF_REF_REG (insn_defs)))
+	&& !arm_get_required_vpr_reg_ret_val (insn))
+      is_across_vector = true;
+
+  return is_across_vector;
+}
+
+/* Utility function to identify if INSN is an MVE load or store instruction.
+   * For TYPE == 0, check all operands.  If the function returns true,
+     INSN is a load or a store insn.
+   * For TYPE == 1, only check the input operands.  If the function returns
+     true, INSN is a load insn.
+   * For TYPE == 2, only check the output operands.  If the function returns
+     true, INSN is a store insn.  */
+
+static bool
+arm_is_mve_load_store_insn (rtx_insn* insn, int type = 0)
+{
+  int n_operands = recog_data.n_operands;
+  extract_insn (insn);
+
+  for (int op = 0; op < n_operands; op++)
+    {
+      if (type == 1 && recog_data.operand_type[op] == OP_OUT)
+	continue;
+      else if (type == 2 && recog_data.operand_type[op] == OP_IN)
+	continue;
+      if (mve_memory_operand (recog_data.operand[op],
+			      GET_MODE (recog_data.operand[op])))
+      return true;
+    }
+  return false;
+}
+
+/* When transforming an MVE intrinsic loop into an MVE Tail Predicated Low
+   Overhead Loop, there are a number of instructions that, if in their
+   unpredicated form, act across vector lanes, but are still safe to include
+   within the loop, despite the implicit predication added to the vector lanes.
+   This list has been compiled by carefully analyzing the instruction
+   pseudocode in the Arm-ARM.
+   All other across-vector instructions aren't allowed, because the addition
+   of implicit predication could influnce the result of the operation.
+   Any new across-vector instructions to the MVE ISA will have to assessed for
+   inclusion to this list.  */
+
+static bool
+arm_mve_is_allowed_unpredic_across_vector_insn (rtx_insn* insn)
+{
+  gcc_assert (MVE_VPT_UNPREDICATED_INSN_P (insn)
+	      && arm_is_mve_across_vector_insn (insn));
+  rtx insn_pattern = PATTERN (insn);
+  if (GET_CODE (insn_pattern) == SET
+      && GET_CODE (XEXP (insn_pattern, 1)) == UNSPEC
+      && (XINT (XEXP (insn_pattern, 1), 1) == VADDVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLADAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VABAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VABAVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDLVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VADDVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMAXVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMAXAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVXQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLALDAVAXQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSDAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VMLSLDAVAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHAQ_U
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLALDAVHAXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHXQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHAQ_S
+	  || XINT (XEXP (insn_pattern, 1), 1) == VRMLSLDAVHAXQ_S))
+    return true;
+  return false;
+}
+
+
+/* Recursively scan through the DF chain backwards within the basic block and
+   determine if any of the USEs of the original insn (or the USEs of the insns
+   where thy were DEF-ed, etc., recursively) were affected by implicit VPT
+   predication of an MVE_VPT_UNPREDICATED_INSN_P in a dlstp/letp loop.
+   This function returns true if the insn is affected implicit predication
+   and false otherwise.
+   Having such implicit predication on an unpredicated insn wouldn't in itself
+   block tail predication, because the output of that insn might then be used
+   in a correctly predicated store insn, where the disabled lanes will be
+   ignored.  To verify this we later call:
+   `arm_mve_check_df_chain_fwd_for_implic_predic_impact`, which will check the
+   DF chains forward to see if any implicitly-predicated operand gets used in
+   an improper way.  */
+
+static bool
+arm_mve_check_df_chain_back_for_implic_predic
+  (hash_map<int_hash<int, -1, -2>, bool>* safe_insn_map, rtx_insn *insn,
+   rtx vctp_vpr_generated)
+{
+  bool* temp = NULL;
+  if ((temp = safe_insn_map->get (INSN_UID (insn))))
+    return *temp;
+
+  basic_block body = BLOCK_FOR_INSN (insn);
+  /* The circumstances under which an instruction is affected by "implicit
+     predication" are as follows:
+      * It is an UNPREDICATED_INSN_P:
+	* That loads/stores from/to memory.
+	* Where any one of its operands is an MVE vector from outside the
+	  loop body bb.
+     Or:
+      * Any of it's operands, recursively backwards, are affected.  */
+  if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+      && (arm_is_mve_load_store_insn (insn)
+	  || (arm_is_mve_across_vector_insn (insn)
+	      && !arm_mve_is_allowed_unpredic_across_vector_insn (insn))))
+    {
+      safe_insn_map->put (INSN_UID (insn), true);
+      return true;
+    }
+
+  df_ref insn_uses = NULL;
+  FOR_EACH_INSN_USE (insn_uses, insn)
+  {
+    /* If the operand is in the input reg set to the the basic block,
+       (i.e. it has come from outside the loop!), consider it unsafe if:
+	 * It's being used in an unpredicated insn.
+	 * It is a predicable MVE vector.  */
+    if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+	&& VALID_MVE_MODE (GET_MODE (DF_REF_REG (insn_uses)))
+	&& REGNO_REG_SET_P (DF_LR_IN (body), DF_REF_REGNO (insn_uses)))
+      {
+	safe_insn_map->put (INSN_UID (insn), true);
+	return true;
+      }
+    /* Scan backwards from the current INSN through the instruction chain
+       until the start of the basic block.  */
+    for (rtx_insn *prev_insn = PREV_INSN (insn);
+	 prev_insn && prev_insn != PREV_INSN (BB_HEAD (body));
+	 prev_insn = PREV_INSN (prev_insn))
+      {
+	/* If a previous insn defines a register that INSN uses, then recurse
+	   in order to check that insn's USEs.
+	   If any of these insns return true as MVE_VPT_UNPREDICATED_INSN_Ps,
+	   then the whole chain is affected by the change in behaviour from
+	   being placed in dlstp/letp loop.  */
+	df_ref prev_insn_defs = NULL;
+	FOR_EACH_INSN_DEF (prev_insn_defs, prev_insn)
+	{
+	  if (DF_REF_REGNO (insn_uses) == DF_REF_REGNO (prev_insn_defs)
+	      && !arm_mve_vec_insn_is_predicated_with_this_predicate
+		   (insn, vctp_vpr_generated)
+	      && arm_mve_check_df_chain_back_for_implic_predic
+		  (safe_insn_map, prev_insn, vctp_vpr_generated))
+	    {
+	      safe_insn_map->put (INSN_UID (insn), true);
+	      return true;
+	    }
+	}
+      }
+  }
+  safe_insn_map->put (INSN_UID (insn), false);
+  return false;
+}
+
+/* If we have identified that the current DEF will be modified
+   by such implicit predication, scan through all the
+   insns that USE it and bail out if any one is outside the
+   current basic block (i.e. the reg is live after the loop)
+   or if any are store insns that are unpredicated or using a
+   predicate other than the loop VPR.
+   This function returns true if the insn is not suitable for
+   implicit predication and false otherwise.*/
+
+static bool
+arm_mve_check_df_chain_fwd_for_implic_predic_impact (rtx_insn *insn,
+						     rtx vctp_vpr_generated)
+{
+
+  /* If this insn is indeed an unpredicated store to memory, bail out.  */
+  if (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
+	(insn, vctp_vpr_generated)
+      && (arm_is_mve_load_store_insn (insn, 2)
+	  || arm_is_mve_across_vector_insn (insn)))
+    return true;
+
+  /* Next, scan forward to the various USEs of the DEFs in this insn.  */
+  df_ref insn_def = NULL;
+  FOR_EACH_INSN_DEF (insn_def, insn)
+    {
+      for (df_ref use = DF_REG_USE_CHAIN (DF_REF_REGNO (insn_def)); use;
+	   use = DF_REF_NEXT_REG (use))
+	{
+	  rtx_insn *next_use_insn = DF_REF_INSN (use);
+	  if (next_use_insn != insn
+	      && NONDEBUG_INSN_P (next_use_insn))
+	    {
+	      /* If the USE is outside the loop body bb, or it is inside, but
+		 is an differently-predicated store to memory or it is any
+		 across-vector instruction.  */
+	      if (BLOCK_FOR_INSN (insn) != BLOCK_FOR_INSN (next_use_insn)
+		  || (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
+		       (next_use_insn, vctp_vpr_generated)
+		     && (arm_is_mve_load_store_insn (next_use_insn, 2)
+			 || arm_is_mve_across_vector_insn (next_use_insn))))
+		return true;
+	    }
+	}
+    }
+  return false;
+}
+
+/* Helper function to `arm_mve_dlstp_check_inc_counter` and to
+   `arm_mve_dlstp_check_dec_counter`.  In the situations where the loop counter
+   is incrementing by 1 or decrementing by 1 in each iteration, ensure that the
+   target value or the initialisation value, respectively, was a calculation
+   of the number of iterations of the loop, which is expected to be an ASHIFTRT
+   by VCTP_STEP.  */
+
+static bool
+arm_mve_check_reg_origin_is_num_elems (basic_block body, rtx reg, rtx vctp_step)
+{
+  /* Ok, we now know the loop starts from zero and increments by one.
+     Now just show that the max value of the counter came from an
+     appropriate ASHIFRT expr of the correct amount.  */
+  basic_block pre_loop_bb = body->prev_bb;
+  while (pre_loop_bb && BB_END (pre_loop_bb)
+	 && !df_bb_regno_only_def_find (pre_loop_bb, REGNO (reg)))
+    pre_loop_bb = pre_loop_bb->prev_bb;
+
+  df_ref counter_max_last_def = df_bb_regno_only_def_find (pre_loop_bb, REGNO (reg));
+  rtx counter_max_last_set;
+  if (counter_max_last_def)
+    counter_max_last_set = PATTERN (DF_REF_INSN (counter_max_last_def));
+  else
+    return false;
+
+  /* If we encounter a simple SET from a REG, follow it through.  */
+  if (GET_CODE (counter_max_last_set) == SET
+      && REG_P (XEXP (counter_max_last_set, 1)))
+    return arm_mve_check_reg_origin_is_num_elems
+	     (pre_loop_bb->next_bb, XEXP (counter_max_last_set, 1), vctp_step);
+
+  /* If we encounter a SET from an IF_THEN_ELSE where one of the operands is a
+     constant and the other is a REG, follow through to that REG.  */
+  if (GET_CODE (counter_max_last_set) == SET
+      && GET_CODE (XEXP (counter_max_last_set, 1)) == IF_THEN_ELSE
+      && REG_P (XEXP (XEXP (counter_max_last_set, 1), 1))
+      && CONST_INT_P (XEXP (XEXP (counter_max_last_set, 1), 2)))
+    return arm_mve_check_reg_origin_is_num_elems
+	     (pre_loop_bb->next_bb, XEXP (XEXP (counter_max_last_set, 1), 1), vctp_step);
+
+  if (GET_CODE (XEXP (counter_max_last_set, 1)) == ASHIFTRT
+      && CONST_INT_P (XEXP (XEXP (counter_max_last_set, 1), 1))
+      && ((1 << INTVAL (XEXP (XEXP (counter_max_last_set, 1), 1)))
+	   == abs (INTVAL (vctp_step))))
+    return true;
+
+  return false;
+}
+
+/* If we have identified the loop to have an incrementing counter, we need to
+   make sure that it increments by 1 and that the loop is structured correctly:
+    * The counter starts from 0
+    * The counter terminates at (num_of_elem + num_of_lanes - 1) / num_of_lanes
+    * The vctp insn uses a reg that decrements appropriately in each iteration.
+*/
+
+static rtx_insn*
+arm_mve_dlstp_check_inc_counter (basic_block body, rtx_insn* vctp_insn,
+				 rtx condconst, rtx condcount)
+{
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+  /* The loop latch has to be empty.  When compiling all the known MVE LoLs in
+     user applications, none of those with incrementing counters had any real
+     insns in the loop latch.  As such, this function has only been tested with
+     an empty latch and may misbehave or ICE if we somehow get here with an
+     increment in the latch, so, for correctness, error out early.  */
+  rtx_insn *dec_insn = BB_END (body->loop_father->latch);
+  if (NONDEBUG_INSN_P (dec_insn))
+    return NULL;
+
+  class rtx_iv vctp_reg_iv;
+  /* For loops of type B) the loop counter is independent of the decrement
+     of the reg used in the vctp_insn. So run iv analysis on that reg.  This
+     has to succeed for such loops to be supported.  */
+  if (!iv_analyze (vctp_insn, as_a<scalar_int_mode> (GET_MODE (vctp_reg)),
+      vctp_reg, &vctp_reg_iv))
+    return NULL;
+
+  /* Find where both of those are modified in the loop body bb.  */
+  rtx condcount_reg_set
+	= PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
+				 (body, REGNO (condcount))));
+  rtx vctp_reg_set = PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
+					    (body, REGNO (vctp_reg))));
+  if (!vctp_reg_set || !condcount_reg_set)
+    return NULL;
+
+  if (REG_P (condcount) && REG_P (condconst))
+    {
+      /* First we need to prove that the loop is going 0..condconst with an
+	 inc of 1 in each iteration.  */
+      if (GET_CODE (XEXP (condcount_reg_set, 1)) == PLUS
+	  && CONST_INT_P (XEXP (XEXP (condcount_reg_set, 1), 1))
+	  && INTVAL (XEXP (XEXP (condcount_reg_set, 1), 1)) == 1)
+	{
+	    rtx counter_reg = XEXP (condcount_reg_set, 0);
+	    /* Check that the counter did indeed start from zero.  */
+	    df_ref this_set = DF_REG_DEF_CHAIN (REGNO (counter_reg));
+	    if (!this_set)
+	      return NULL;
+	    df_ref last_set = DF_REF_NEXT_REG (this_set);
+	    if (!last_set)
+	      return NULL;
+	    rtx_insn* last_set_insn = DF_REF_INSN (last_set);
+	    if (!single_set (last_set_insn))
+	      return NULL;
+	    rtx counter_orig_set;
+	    counter_orig_set = XEXP (PATTERN (last_set_insn), 1);
+	    if (!CONST_INT_P (counter_orig_set)
+		|| (INTVAL (counter_orig_set) != 0))
+	      return NULL;
+	    /* And finally check that the target value of the counter,
+	       condconst is of the correct shape.  */
+	    if (!arm_mve_check_reg_origin_is_num_elems (body, condconst,
+							vctp_reg_iv.step))
+	      return NULL;
+	}
+      else
+	return NULL;
+    }
+  else
+    return NULL;
+
+  /* Extract the decrementnum of the vctp reg.  */
+  int decrementnum = abs (INTVAL (vctp_reg_iv.step));
+  /* Ensure it matches the number of lanes of the vctp instruction.  */
+  if (decrementnum != arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+    return NULL;
+
+  /* Everything looks valid.  */
+  return vctp_insn;
+}
+
+/* Helper function to `arm_mve_loop_valid_for_dlstp`.  In the case of a
+   counter that is decrementing, ensure that it is decrementing by the
+   right amount in each iteration and that the target condition is what
+   we expect.  */
+
+static rtx_insn*
+arm_mve_dlstp_check_dec_counter (basic_block body, rtx_insn* vctp_insn,
+				 rtx condconst, rtx condcount)
+{
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+  class rtx_iv vctp_reg_iv;
+  int decrementnum;
+  /* For decrementing loops of type A), the counter is usually present in the
+     loop latch.  Here we simply need to verify that this counter is the same
+     reg that is also used in the vctp_insn and that it is not otherwise
+     modified.  */
+  rtx_insn *dec_insn = BB_END (body->loop_father->latch);
+  /* If not in the loop latch, try to find the decrement in the loop body.  */
+  if (!NONDEBUG_INSN_P (dec_insn))
+  {
+    df_ref temp = df_bb_regno_only_def_find (body, REGNO (condcount));
+    /* If we haven't been able to find the decrement, bail out.  */
+    if (!temp)
+      return NULL;
+    dec_insn = DF_REF_INSN (temp);
+  }
+
+  /* Next, ensure that it is a PLUS of the form:
+     (set (reg a) (plus (reg a) (const_int)))
+     where (reg a) is the same as condcount.  */
+  if (GET_CODE (XEXP (PATTERN (dec_insn), 1)) == PLUS
+      && REGNO (XEXP (PATTERN (dec_insn), 0))
+	  == REGNO (XEXP (XEXP (PATTERN (dec_insn), 1), 0))
+      && REGNO (XEXP (PATTERN (dec_insn), 0)) == REGNO (condcount))
+    decrementnum = abs (INTVAL (XEXP (XEXP (PATTERN (dec_insn), 1), 1)));
+  else
+    return NULL;
+
+  /* Ok, so we now know the loop decrement.  If it is a 1, then we need to
+     look at the loop vctp_reg and verify that it also decrements correctly.
+     Then, we need to establish that the starting value of the loop decrement
+     originates from the starting value of the vctp decrement.  */
+  if (decrementnum == 1)
+    {
+      class rtx_iv vctp_reg_iv;
+      /* The loop counter is found to be independent of the decrement
+	 of the reg used in the vctp_insn, again.  Ensure that IV analysis
+	 succeeds and check the step.  */
+      if (!iv_analyze (vctp_insn, as_a<scalar_int_mode> (GET_MODE (vctp_reg)),
+		       vctp_reg, &vctp_reg_iv))
+	return NULL;
+      /* Ensure it matches the number of lanes of the vctp instruction.  */
+      if (abs (INTVAL (vctp_reg_iv.step))
+	  != arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+	return NULL;
+      if (!arm_mve_check_reg_origin_is_num_elems (body, condcount, vctp_reg_iv.step))
+	return NULL;
+    }
+  /* If the decrements are the same, then the situation is simple: either they
+     are also the same reg, which is safe, or they are different registers, in
+     which case makse sure that there is a only simple SET from one to the
+     other inside the loop.*/
+  else if (decrementnum == arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+    {
+      if (REGNO (condcount) != REGNO (vctp_reg))
+	{
+	  /* It wasn't the same reg, but it could be behild a
+	     (set (vctp_reg) (condcount)), so instead find where
+	     the VCTP insn is DEF'd inside the loop.  */
+	  rtx vctp_reg_set =
+		PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
+					(body, REGNO (vctp_reg))));
+	  /* This must just be a simple SET from the condcount.  */
+	  if (GET_CODE (vctp_reg_set) != SET || !REG_P (XEXP (vctp_reg_set, 1))
+	      || REGNO (XEXP (vctp_reg_set, 1)) != REGNO (condcount))
+	    return NULL;
+	}
+    }
+  else
+    return NULL;
+
+  /* We now only need to find out that the loop terminates with a LE
+     zero condition.  If condconst is a const_int, then this is easy.
+     If its a REG, look at the last condition+jump in a bb before
+     the loop, because that usually will have a branch jumping over
+     the loop body.  */
+  if (CONST_INT_P (condconst)
+      && !(INTVAL (condconst) == 0 && JUMP_P (BB_END (body))
+	   && GET_CODE (XEXP (PATTERN (BB_END (body)), 1)) == IF_THEN_ELSE
+	   && (GET_CODE (XEXP (XEXP (PATTERN (BB_END (body)), 1), 0)) == NE
+	       ||GET_CODE (XEXP (XEXP (PATTERN (BB_END (body)), 1), 0)) == GT)))
+    return NULL;
+  else if (REG_P (condconst))
+    {
+      basic_block pre_loop_bb = body;
+      while (pre_loop_bb->prev_bb && BB_END (pre_loop_bb->prev_bb)
+	     && !JUMP_P (BB_END (pre_loop_bb->prev_bb)))
+	pre_loop_bb = pre_loop_bb->prev_bb;
+      if (pre_loop_bb && BB_END (pre_loop_bb))
+	pre_loop_bb = pre_loop_bb->prev_bb;
+      else
+	return NULL;
+      rtx initial_compare = NULL_RTX;
+      if (!(prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb))
+	    && INSN_P (prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb)))))
+	return NULL;
+      else
+	initial_compare
+	    = PATTERN (prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb)));
+      if (!(initial_compare && GET_CODE (initial_compare) == SET
+	    && cc_register (XEXP (initial_compare, 0), VOIDmode)
+	    && GET_CODE (XEXP (initial_compare, 1)) == COMPARE
+	    && CONST_INT_P (XEXP (XEXP (initial_compare, 1), 1))
+	    && INTVAL (XEXP (XEXP (initial_compare, 1), 1)) == 0))
+	return NULL;
+
+      /* Usually this is a LE condition, but it can also just be a GT or an EQ
+	 condition (if the value is unsigned or the compiler knows its not negative)  */
+      rtx_insn *loop_jumpover = BB_END (pre_loop_bb);
+      if (!(JUMP_P (loop_jumpover)
+	    && GET_CODE (XEXP (PATTERN (loop_jumpover), 1)) == IF_THEN_ELSE
+	    && (GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == LE
+		|| GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == GT
+		|| GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == EQ)))
+	return NULL;
+    }
+
+  /* Everything looks valid.  */
+  return vctp_insn;
+}
+
+/* Function to check a loop's structure to see if it is a valid candidate for
+   an MVE Tail Predicated Low-Overhead Loop.  Returns the loop's VCTP_INSN if
+   it is valid, or NULL if it isn't.  */
+
+static rtx_insn*
+arm_mve_loop_valid_for_dlstp (basic_block body)
+{
+  /* Doloop can only be done "elementwise" with predicated dlstp/letp if it
+     contains a VCTP on the number of elements processed by the loop.
+     Find the VCTP predicate generation inside the loop body BB.  */
+  rtx_insn *vctp_insn = arm_mve_get_loop_vctp (body);
+  if (!vctp_insn)
+    return NULL;
+
+  /* There are only two types of loops that can be turned into dlstp/letp
+     loops:
+      A) Loops of the form:
+	  while (num_of_elem > 0)
+	    {
+	      p = vctp<size> (num_of_elem)
+	      n -= num_of_lanes;
+	    }
+      B) Loops of the form:
+	  int num_of_iters = (num_of_elem + num_of_lanes - 1) / num_of_lanes
+	  for (i = 0; i < num_of_iters; i++)
+	    {
+	      p = vctp<size> (num_of_elem)
+	      n -= num_of_lanes;
+	    }
+
+    Then, depending on the type of loop above we need will need to do
+    different sets of checks.  */
+  iv_analysis_loop_init (body->loop_father);
+
+  /* In order to find out if the loop is of type A or B above look for the
+     loop counter: it will either be incrementing by one per iteration or
+     it will be decrementing by num_of_lanes.  We can find the loop counter
+     in the condition at the end of the loop.  */
+  rtx_insn *loop_cond = prev_nonnote_nondebug_insn_bb (BB_END (body));
+  if (!(cc_register (XEXP (PATTERN (loop_cond), 0), VOIDmode)
+	&& GET_CODE (XEXP (PATTERN (loop_cond), 1)) == COMPARE))
+    return NULL;
+
+  /* The operands in the condition:  Try to identify which one is the
+     constant and which is the counter and run IV analysis on the latter.  */
+  rtx cond_arg_1 = XEXP (XEXP (PATTERN (loop_cond), 1), 0);
+  rtx cond_arg_2 = XEXP (XEXP (PATTERN (loop_cond), 1), 1);
+
+  rtx loop_cond_constant;
+  rtx loop_counter;
+  class rtx_iv cond_counter_iv, cond_temp_iv;
+
+  if (CONST_INT_P (cond_arg_1))
+    {
+      /* cond_arg_1 is the constant and cond_arg_2 is the counter.  */
+      loop_cond_constant = cond_arg_1;
+      loop_counter = cond_arg_2;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_2)),
+		  cond_arg_2, &cond_counter_iv);
+    }
+  else if (CONST_INT_P (cond_arg_2))
+    {
+      /* cond_arg_2 is the constant and cond_arg_1 is the counter.  */
+      loop_cond_constant = cond_arg_2;
+      loop_counter = cond_arg_1;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_1)),
+		  cond_arg_1, &cond_counter_iv);
+    }
+  else if (REG_P (cond_arg_1) && REG_P (cond_arg_2))
+    {
+      /* If both operands to the compare are REGs, we can safely
+	 run IV analysis on both and then determine which is the
+	 constant by looking at the step.
+	 First assume cond_arg_1 is the counter.  */
+      loop_counter = cond_arg_1;
+      loop_cond_constant = cond_arg_2;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_1)),
+		  cond_arg_1, &cond_counter_iv);
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_2)),
+		  cond_arg_2, &cond_temp_iv);
+
+      if (!CONST_INT_P (cond_counter_iv.step) || !CONST_INT_P (cond_temp_iv.step))
+	return NULL;
+      /* Look at the steps and swap around the rtx's if needed.  Error out if
+	 one of them cannot be identified as constant.  */
+      if (INTVAL (cond_counter_iv.step) != 0 && INTVAL (cond_temp_iv.step) != 0)
+	return NULL;
+      if (INTVAL (cond_counter_iv.step) == 0 && INTVAL (cond_temp_iv.step) != 0)
+	{
+	  loop_counter = cond_arg_2;
+	  loop_cond_constant = cond_arg_1;
+	  cond_counter_iv = cond_temp_iv;
+	}
+    }
+  else
+    return NULL;
+
+  if (!REG_P (loop_counter))
+    return NULL;
+  if (!(REG_P (loop_cond_constant) || CONST_INT_P (loop_cond_constant)))
+    return NULL;
+
+  /* Now we have extracted the IV step of the loop counter, call the
+     appropriate checking function.  */
+  if (INTVAL (cond_counter_iv.step) > 0)
+    return arm_mve_dlstp_check_inc_counter (body, vctp_insn,
+					    loop_cond_constant, loop_counter);
+  else if (INTVAL (cond_counter_iv.step) < 0)
+    return arm_mve_dlstp_check_dec_counter (body, vctp_insn,
+					    loop_cond_constant, loop_counter);
+  else
+    return NULL;
+}
+
+/* Predict whether the given loop in gimple will be transformed in the RTL
+   doloop_optimize pass.  */
+
+static bool
+arm_predict_doloop_p (struct loop *loop)
+{
+  gcc_assert (loop);
+  /* On arm, targetm.can_use_doloop_p is actually
+     can_use_doloop_if_innermost.  Ensure the loop is innermost,
+     it is valid and as per arm_target_bb_ok_for_lob and the
+     correct architecture flags are enabled.  */
+  if (!(TARGET_32BIT && TARGET_HAVE_LOB && optimize > 0))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " target architecture or optimisation flags.\n");
+      return false;
+    }
+  else if (loop->inner != NULL)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " loop nesting.\n");
+      return false;
+    }
+  else if (!arm_target_bb_ok_for_lob (loop->header->next_bb))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " loop bb complexity.\n");
+      return false;
+    }
+
+  return true;
+}
+
+/* Implement targetm.loop_unroll_adjust.  Use this to block unrolling of loops
+   that may later be turned into MVE Tail Predicated Low Overhead Loops.  The
+   performance benefit of an MVE LoL is likely to be much higher than that of
+   the unrolling.  */
+
+unsigned
+arm_loop_unroll_adjust (unsigned nunroll, struct loop *loop)
+{
+  if (TARGET_HAVE_MVE
+      && arm_target_bb_ok_for_lob (loop->latch)
+      && arm_mve_loop_valid_for_dlstp (loop->header))
+    return 0;
+  else
+    return nunroll;
+}
+
+/* Function to hadle emitting a VPT-unpredicated version of a VPT-predicated
+   insn to a sequence.  */
+
+static bool
+arm_emit_mve_unpredicated_insn_to_seq (rtx_insn* insn)
+{
+  rtx insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn);
+  int new_icode = get_attr_mve_unpredicated_insn (insn);
+  if (!in_sequence_p ()
+      || !MVE_VPT_PREDICATED_INSN_P (insn)
+      || (!insn_vpr_reg_operand)
+      || (!new_icode))
+    return false;
+
+  extract_insn (insn);
+  rtx arr[8];
+  int j = 0;
+
+  /* When transforming a VPT-predicated instruction
+     into its unpredicated equivalent we need to drop
+     the VPR operand and we may need to also drop a
+     merge "vuninit" input operand, depending on the
+     instruction pattern.  Here ensure that we have at
+     most a two-operand difference between the two
+     instrunctions.  */
+  int n_operands_diff
+      = recog_data.n_operands - insn_data[new_icode].n_operands;
+  if (!(n_operands_diff > 0 && n_operands_diff <= 2))
+    return false;
+
+  /* Then, loop through the operands of the predicated
+     instruction, and retain the ones that map to the
+     unpredicated instruction.  */
+  for (int i = 0; i < recog_data.n_operands; i++)
+    {
+      /* Ignore the VPR and, if needed, the vuninit
+	 operand.  */
+      if (insn_vpr_reg_operand == recog_data.operand[i]
+	  || (n_operands_diff == 2
+	      && !strcmp (recog_data.constraints[i], "0")))
+	continue;
+      else
+	{
+	  arr[j] = recog_data.operand[i];
+	  j++;
+	}
+    }
+
+  /* Finally, emit the upredicated instruction.  */
+  switch (j)
+    {
+      case 1:
+	emit_insn (GEN_FCN (new_icode) (arr[0]));
+	break;
+      case 2:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1]));
+	break;
+      case 3:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2]));
+	break;
+      case 4:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2],
+					arr[3]));
+	break;
+      case 5:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4]));
+	break;
+      case 6:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4], arr[5]));
+	break;
+      case 7:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4], arr[5], arr[6]));
+	break;
+      default:
+	gcc_unreachable ();
+    }
+  return true;
+}
+
+/* When a vctp insn is used, its out is often followed by
+   a zero-extend insn to SImode, which is then SUBREG'd into a
+   vector form of mode VALID_MVE_PRED_MODE: this vector form is
+   what is then used as an input to the instructions within the
+   loop.  Hence, store that vector form of the VPR reg into
+   vctp_vpr_generated, so that we can match it with instructions
+   in the loop to determine if they are predicated on this same
+   VPR.  If there is no zero-extend and subreg or it is otherwise
+   invalid, then return NULL to cancel the dlstp transform.  */
+
+static rtx
+arm_mve_get_vctp_vec_form (rtx_insn *insn)
+{
+  rtx vctp_vpr_generated = NULL_RTX;
+  rtx_insn *next_use1 = NULL;
+  df_ref use;
+  for (use
+	= DF_REG_USE_CHAIN
+	   (DF_REF_REGNO (DF_INSN_INFO_DEFS (DF_INSN_INFO_GET (insn))));
+       use; use = DF_REF_NEXT_REG (use))
+    if (!next_use1 && NONDEBUG_INSN_P (DF_REF_INSN (use)))
+      next_use1 = DF_REF_INSN (use);
+
+  if (single_set (next_use1)
+      && GET_CODE (SET_SRC (single_set (next_use1))) == ZERO_EXTEND)
+    {
+      rtx_insn *next_use2 = NULL;
+      for (use
+	    = DF_REG_USE_CHAIN
+	       (DF_REF_REGNO
+		 (DF_INSN_INFO_DEFS (DF_INSN_INFO_GET (next_use1))));
+	   use; use = DF_REF_NEXT_REG (use))
+	if (!next_use2 && NONDEBUG_INSN_P (DF_REF_INSN (use)))
+	  next_use2 = DF_REF_INSN (use);
+
+      if (single_set (next_use2)
+	  && GET_CODE (SET_SRC (single_set (next_use2))) == SUBREG)
+	vctp_vpr_generated = XEXP (PATTERN (next_use2), 0);
+    }
+
+  if (!vctp_vpr_generated || !REG_P (vctp_vpr_generated)
+      || !VALID_MVE_PRED_MODE (GET_MODE (vctp_vpr_generated)))
+    return NULL_RTX;
+
+  return vctp_vpr_generated;
+}
+
+/* Attempt to transform the loop contents of loop basic block from VPT
+   predicated insns into unpredicated insns for a dlstp/letp loop.  */
+
+rtx
+arm_attempt_dlstp_transform (rtx label)
+{
+  basic_block body = BLOCK_FOR_INSN (label)->prev_bb;
+
+  /* Ensure that the bb is within a loop that has all required metadata.  */
+  if (!body->loop_father || !body->loop_father->header
+      || !body->loop_father->simple_loop_desc)
+    return GEN_INT (1);
+
+  rtx_insn *vctp_insn = arm_mve_loop_valid_for_dlstp (body);
+  if (!vctp_insn)
+    return GEN_INT (1);
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+
+  rtx vctp_vpr_generated = arm_mve_get_vctp_vec_form (vctp_insn);
+  if (!vctp_vpr_generated)
+    return GEN_INT (1);
+
+  /* decrementunum is already known to be valid at this point.  */
+  int decrementnum = arm_mve_get_vctp_lanes (PATTERN (vctp_insn));
+
+  rtx_insn *insn = 0;
+  rtx_insn *cur_insn = 0;
+  rtx_insn *seq;
+  hash_map<int_hash<int, -1, -2>, bool>* safe_insn_map
+					  = new hash_map<int_hash<int, -1, -2>,
+							 bool>;
+
+  /* Scan through the insns in the loop bb and emit the transformed bb
+     insns to a sequence.  */
+  start_sequence ();
+  FOR_BB_INSNS (body, insn)
+    {
+      if (GET_CODE (insn) == CODE_LABEL || NOTE_INSN_BASIC_BLOCK_P (insn))
+	continue;
+      else if (NOTE_P (insn))
+	emit_note ((enum insn_note)NOTE_KIND (insn));
+      else if (DEBUG_INSN_P (insn))
+	emit_debug_insn (PATTERN (insn));
+      else if (!INSN_P (insn))
+	{
+	  end_sequence ();
+	  return GEN_INT (1);
+	}
+      /* When we find the vctp instruction: continue.  */
+      else if (insn == vctp_insn)
+	continue;
+       /* If the insn pattern requires the use of the VPR value from the
+	  vctp as an input parameter for predication.  */
+      else if (arm_mve_vec_insn_is_predicated_with_this_predicate
+		(insn, vctp_vpr_generated))
+	{
+	  bool success = arm_emit_mve_unpredicated_insn_to_seq (insn);
+	  if (!success)
+	    {
+	      end_sequence ();
+	      return GEN_INT (1);
+	    }
+	}
+      /* If the insn isn't VPT predicated on vctp_vpr_generated, we need to
+	 make sure that it is still valid within the dlstp/letp loop.  */
+      else
+	{
+	  /* If this instruction USE-s the vctp_vpr_generated other than for
+	     predication, this blocks the transformation as we are not allowed
+	     to optimise the VPR value away.  */
+	  df_ref insn_uses = NULL;
+	  FOR_EACH_INSN_USE (insn_uses, insn)
+	  {
+	    if (rtx_equal_p (vctp_vpr_generated, DF_REF_REG (insn_uses)))
+	      {
+		end_sequence ();
+		return GEN_INT (1);
+	      }
+	  }
+	  /* If within the loop we have an MVE vector instruction that is
+	     unpredicated, the dlstp/letp looping will add implicit
+	     predication to it.  This will result in a change in behaviour
+	     of the instruction, so we need to find out if any instructions
+	     that feed into the current instruction were implicitly
+	     predicated.  */
+	  if (arm_mve_check_df_chain_back_for_implic_predic
+		(safe_insn_map, insn, vctp_vpr_generated))
+	    {
+	      if (arm_mve_check_df_chain_fwd_for_implic_predic_impact
+		    (insn, vctp_vpr_generated))
+		{
+		  end_sequence ();
+		  return GEN_INT (1);
+		}
+	    }
+	  emit_insn (PATTERN (insn));
+	}
+    }
+  seq = get_insns ();
+  end_sequence ();
+
+  /* Re-write the entire BB contents with the transformed
+     sequence.  */
+  FOR_BB_INSNS_SAFE (body, insn, cur_insn)
+    if (!(GET_CODE (insn) == CODE_LABEL || NOTE_INSN_BASIC_BLOCK_P (insn)))
+      delete_insn (insn);
+  for (insn = seq; NEXT_INSN (insn); insn = NEXT_INSN (insn))
+    if (NOTE_P (insn))
+      emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));
+    else if (DEBUG_INSN_P (insn))
+      emit_debug_insn_after (PATTERN (insn), BB_END (body));
+    else
+      emit_insn_after (PATTERN (insn), BB_END (body));
+
+  emit_jump_insn_after (PATTERN (insn), BB_END (body));
+  /* The transformation has succeeded, so now modify the "count"
+     (a.k.a. niter_expr) for the middle-end.  Also set noloop_assumptions
+     to NULL to stop the middle-end from making assumptions about the
+     number of iterations.  */
+  simple_loop_desc (body->loop_father)->niter_expr = vctp_reg;
+  simple_loop_desc (body->loop_father)->noloop_assumptions = NULL_RTX;
+  return GEN_INT (decrementnum);
 }
 
 #if CHECKING_P
diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index ee931ad6ebd..70fade0d0da 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -124,6 +124,11 @@
 ; and not all ARM insns do.
 (define_attr "predicated" "yes,no" (const_string "no"))
 
+
+; An attribute that encodes the CODE_FOR_<insn> of the MVE VPT unpredicated
+; version of a VPT-predicated instruction.  For unpredicated instructions
+; that are predicable, encode the same pattern's CODE_FOR_<insn> as a way to
+; encode that it is a predicable instruction.
 (define_attr "mve_unpredicated_insn" "" (const_int 0))
 
 ; LENGTH of an instruction (in bytes)
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 71e43539616..1401b59dc0b 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -2660,6 +2660,9 @@
 (define_int_attr mrrc [(VUNSPEC_MRRC "mrrc") (VUNSPEC_MRRC2 "mrrc2")])
 (define_int_attr MRRC [(VUNSPEC_MRRC "MRRC") (VUNSPEC_MRRC2 "MRRC2")])
 
+(define_int_attr mode1 [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 "32")
+			(DLSTP64 "64")])
+
 (define_int_attr opsuffix [(UNSPEC_DOT_S "s8")
 			   (UNSPEC_DOT_U "u8")
 			   (UNSPEC_DOT_US "s8")
@@ -2903,6 +2906,8 @@
 (define_int_iterator VSHLCQ_M [VSHLCQ_M_S VSHLCQ_M_U])
 (define_int_iterator VQSHLUQ_M_N [VQSHLUQ_M_N_S])
 (define_int_iterator VQSHLUQ_N [VQSHLUQ_N_S])
+(define_int_iterator DLSTP [DLSTP8 DLSTP16 DLSTP32
+				   DLSTP64])
 
 ;; Define iterators for VCMLA operations
 (define_int_iterator VCMLA_OP [UNSPEC_VCMLA
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 87cbf6c1726..dc4b6301aaa 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -6997,7 +6997,7 @@
    (set (reg:SI LR_REGNUM)
 	(plus:SI (reg:SI LR_REGNUM) (match_dup 0)))
    (clobber (reg:CC CC_REGNUM))]
-  "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2"
+  "TARGET_HAVE_MVE"
   {
     if (get_attr_length (insn) == 4)
       return "letp\t%|lr, %l1";
@@ -7017,5 +7017,5 @@
 	 (unspec:SI [(match_operand:SI 0 "s_register_operand" "r")]
 	  DLSTP))
   ]
-  "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2"
+  "TARGET_HAVE_MVE"
   "dlstp.<mode1>\t%|lr, %0")
diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
index e1e013befa7..368d5138ca1 100644
--- a/gcc/config/arm/thumb2.md
+++ b/gcc/config/arm/thumb2.md
@@ -1613,7 +1613,7 @@
    (use (match_operand 1 "" ""))]     ; label
   "TARGET_32BIT"
   "
- {
+{
    /* Currently SMS relies on the do-loop pattern to recognize loops
       where (1) the control part consists of all insns defining and/or
       using a certain 'count' register and (2) the loop count can be
@@ -1623,41 +1623,65 @@
 
       Also used to implement the low over head loops feature, which is part of
       the Armv8.1-M Mainline Low Overhead Branch (LOB) extension.  */
-   if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
-   {
-     rtx s0;
-     rtx bcomp;
-     rtx loc_ref;
-     rtx cc_reg;
-     rtx insn;
-     rtx cmp;
-
-     if (GET_MODE (operands[0]) != SImode)
-       FAIL;
-
-     s0 = operands [0];
-
-     /* Low over head loop instructions require the first operand to be LR.  */
-     if (TARGET_HAVE_LOB && arm_target_insn_ok_for_lob (operands [1]))
-       s0 = gen_rtx_REG (SImode, LR_REGNUM);
-
-     if (TARGET_THUMB2)
-       insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-     else
-       insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-
-     cmp = XVECEXP (PATTERN (insn), 0, 0);
-     cc_reg = SET_DEST (cmp);
-     bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
-     loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands [1]);
-     emit_jump_insn (gen_rtx_SET (pc_rtx,
-                                  gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
-                                                        loc_ref, pc_rtx)));
-     DONE;
-   }
- else
-   FAIL;
- }")
+  if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
+    {
+      rtx s0;
+      rtx bcomp;
+      rtx loc_ref;
+      rtx cc_reg;
+      rtx insn;
+      rtx cmp;
+      rtx decrement_num;
+
+      if (GET_MODE (operands[0]) != SImode)
+	FAIL;
+
+      s0 = operands[0];
+
+       if (TARGET_HAVE_LOB && arm_target_bb_ok_for_lob (BLOCK_FOR_INSN (operands[1])))
+	{
+	  s0 = gen_rtx_REG (SImode, LR_REGNUM);
+
+	  /* If we have a compatibe MVE target, try and analyse the loop
+	     contents to determine if we can use predicated dlstp/letp
+	     looping.  */
+	  if (TARGET_HAVE_MVE
+	      && (decrement_num = arm_attempt_dlstp_transform (operands[1]))
+	      && (INTVAL (decrement_num) != 1))
+	    {
+	      insn = emit_insn
+		      (gen_thumb2_addsi3_compare0
+			  (s0, s0, GEN_INT ((-1) * (INTVAL (decrement_num)))));
+	      cmp = XVECEXP (PATTERN (insn), 0, 0);
+	      cc_reg = SET_DEST (cmp);
+	      bcomp = gen_rtx_GE (VOIDmode, cc_reg, const0_rtx);
+	      loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+	      emit_jump_insn (gen_rtx_SET (pc_rtx,
+				       gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+							     loc_ref, pc_rtx)));
+	      DONE;
+	    }
+	}
+
+	/* Otherwise, try standard decrement-by-one dls/le looping.  */
+	if (TARGET_THUMB2)
+	  insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0,
+							GEN_INT (-1)));
+	else
+	  insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
+
+	cmp = XVECEXP (PATTERN (insn), 0, 0);
+	cc_reg = SET_DEST (cmp);
+	bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
+	loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+	emit_jump_insn (gen_rtx_SET (pc_rtx,
+				     gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+							   loc_ref, pc_rtx)));
+	DONE;
+    }
+  else
+    FAIL;
+}")
 
 (define_insn "*clear_apsr"
   [(unspec_volatile:SI [(const_int 0)] VUNSPEC_CLRM_APSR)
@@ -1755,7 +1779,37 @@
   {
     if (REGNO (operands[0]) == LR_REGNUM)
       {
-	emit_insn (gen_dls_insn (operands[0]));
+	/* Pick out the number by which we are decrementing the loop counter
+	   in every iteration.  If it's > 1, then use dlstp.  */
+	int const_int_dec_num
+	     = abs (INTVAL (XEXP (XEXP (XVECEXP (PATTERN (operands[1]), 0, 1),
+				  1),
+			    1)));
+	switch (const_int_dec_num)
+	  {
+	    case 16:
+	      emit_insn (gen_dlstp8_insn (operands[0]));
+	      break;
+
+	    case 8:
+	      emit_insn (gen_dlstp16_insn (operands[0]));
+	      break;
+
+	    case 4:
+	      emit_insn (gen_dlstp32_insn (operands[0]));
+	      break;
+
+	    case 2:
+	      emit_insn (gen_dlstp64_insn (operands[0]));
+	      break;
+
+	    case 1:
+	      emit_insn (gen_dls_insn (operands[0]));
+	      break;
+
+	    default:
+	      gcc_unreachable ();
+	  }
 	DONE;
       }
     else
diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
index 6a5b1f8f623..7921bffc169 100644
--- a/gcc/config/arm/unspecs.md
+++ b/gcc/config/arm/unspecs.md
@@ -581,6 +581,10 @@
   VADDLVQ_U
   VCTP
   VCTP_M
+  DLSTP8
+  DLSTP16
+  DLSTP32
+  DLSTP64
   VPNOT
   VCREATEQ_F
   VCVTQ_N_TO_F_S
diff --git a/gcc/df-core.cc b/gcc/df-core.cc
index d4812b04a7c..4fcc14bf790 100644
--- a/gcc/df-core.cc
+++ b/gcc/df-core.cc
@@ -1964,6 +1964,21 @@ df_bb_regno_last_def_find (basic_block bb, unsigned int regno)
   return NULL;
 }
 
+/* Return the one and only def of REGNO within BB.  If there is no def or
+   there are multiple defs, return NULL.  */
+
+df_ref
+df_bb_regno_only_def_find (basic_block bb, unsigned int regno)
+{
+  df_ref temp = df_bb_regno_first_def_find (bb, regno);
+  if (!temp)
+    return NULL;
+  else if (temp == df_bb_regno_last_def_find (bb, regno))
+    return temp;
+  else
+    return NULL;
+}
+
 /* Finds the reference corresponding to the definition of REG in INSN.
    DF is the dataflow object.  */
 
diff --git a/gcc/df.h b/gcc/df.h
index 402657a7076..98623637f9c 100644
--- a/gcc/df.h
+++ b/gcc/df.h
@@ -987,6 +987,7 @@ extern void df_check_cfg_clean (void);
 #endif
 extern df_ref df_bb_regno_first_def_find (basic_block, unsigned int);
 extern df_ref df_bb_regno_last_def_find (basic_block, unsigned int);
+extern df_ref df_bb_regno_only_def_find (basic_block, unsigned int);
 extern df_ref df_find_def (rtx_insn *, rtx);
 extern bool df_reg_defined (rtx_insn *, rtx);
 extern df_ref df_find_use (rtx_insn *, rtx);
diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
index 4feb0a25ab9..f6dbd0515de 100644
--- a/gcc/loop-doloop.cc
+++ b/gcc/loop-doloop.cc
@@ -85,29 +85,29 @@ doloop_condition_get (rtx_insn *doloop_pat)
      forms:
 
      1)  (parallel [(set (pc) (if_then_else (condition)
-	  			            (label_ref (label))
-				            (pc)))
-	             (set (reg) (plus (reg) (const_int -1)))
-	             (additional clobbers and uses)])
+					    (label_ref (label))
+					    (pc)))
+		     (set (reg) (plus (reg) (const_int -n)))
+		     (additional clobbers and uses)])
 
      The branch must be the first entry of the parallel (also required
      by jump.cc), and the second entry of the parallel must be a set of
      the loop counter register.  Some targets (IA-64) wrap the set of
      the loop counter in an if_then_else too.
 
-     2)  (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-	                         (label_ref (label))
-			         (pc))).  
+     2)  (set (reg) (plus (reg) (const_int -n))
+	 (set (pc) (if_then_else (reg != 0)
+				 (label_ref (label))
+				 (pc))).
 
      Some targets (ARM) do the comparison before the branch, as in the
      following form:
 
-     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) */
+     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -n), 0)))
+		   (set (reg) (plus (reg) (const_int -n)))])
+	(set (pc) (if_then_else (cc == NE)
+				(label_ref (label))
+				(pc))) */
 
   pattern = PATTERN (doloop_pat);
 
@@ -143,7 +143,7 @@ doloop_condition_get (rtx_insn *doloop_pat)
 	      || GET_CODE (cmp_arg1) != PLUS)
 	    return 0;
 	  reg_orig = XEXP (cmp_arg1, 0);
-	  if (XEXP (cmp_arg1, 1) != GEN_INT (-1) 
+	  if (!CONST_INT_P (XEXP (cmp_arg1, 1))
 	      || !REG_P (reg_orig))
 	    return 0;
 	  cc_reg = SET_DEST (cmp_orig);
@@ -156,7 +156,8 @@ doloop_condition_get (rtx_insn *doloop_pat)
 	{
 	  /* We expect the condition to be of the form (reg != 0)  */
 	  cond = XEXP (SET_SRC (cmp), 0);
-	  if (GET_CODE (cond) != NE || XEXP (cond, 1) != const0_rtx)
+	  if ((GET_CODE (cond) != NE && GET_CODE (cond) != GE)
+	      || XEXP (cond, 1) != const0_rtx)
 	    return 0;
 	}
     }
@@ -173,14 +174,14 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
     return 0;
 
-  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
      On IA-64, this decrement is wrapped in an if_then_else.  */
   inc_src = SET_SRC (inc);
   if (GET_CODE (inc_src) == IF_THEN_ELSE)
     inc_src = XEXP (inc_src, 1);
   if (GET_CODE (inc_src) != PLUS
       || XEXP (inc_src, 0) != reg
-      || XEXP (inc_src, 1) != constm1_rtx)
+      || !CONST_INT_P (XEXP (inc_src, 1)))
     return 0;
 
   /* Check for (set (pc) (if_then_else (condition)
@@ -211,42 +212,49 @@ doloop_condition_get (rtx_insn *doloop_pat)
       || (GET_CODE (XEXP (condition, 0)) == PLUS
 	  && XEXP (XEXP (condition, 0), 0) == reg))
    {
-     if (GET_CODE (pattern) != PARALLEL)
      /*  For the second form we expect:
 
-         (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-                                 (label_ref (label))
-                                 (pc))).
+	 (set (reg) (plus (reg) (const_int -n))
+	 (set (pc) (if_then_else (reg != 0)
+				 (label_ref (label))
+				 (pc))).
 
-         is equivalent to the following:
+	 If n == 1, that is equivalent to the following:
 
-         (parallel [(set (pc) (if_then_else (reg != 1)
-                                            (label_ref (label))
-                                            (pc)))
-                     (set (reg) (plus (reg) (const_int -1)))
-                     (additional clobbers and uses)])
+	 (parallel [(set (pc) (if_then_else (reg != 1)
+					    (label_ref (label))
+					    (pc)))
+		     (set (reg) (plus (reg) (const_int -1)))
+		     (additional clobbers and uses)])
 
-        For the third form we expect:
+	For the third form we expect:
 
-        (parallel [(set (cc) (compare ((plus (reg) (const_int -1)), 0))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) 
+	(parallel [(set (cc) (compare ((plus (reg) (const_int -n)), 0))
+		   (set (reg) (plus (reg) (const_int -n)))])
+	(set (pc) (if_then_else (cc == NE)
+				(label_ref (label))
+				(pc)))
 
-        which is equivalent to the following:
+	Which also for n == 1 is equivalent to the following:
 
-        (parallel [(set (cc) (compare (reg,  1))
-                   (set (reg) (plus (reg) (const_int -1)))
-                   (set (pc) (if_then_else (NE == cc)
-                                           (label_ref (label))
-                                           (pc))))])
+	(parallel [(set (cc) (compare (reg,  1))
+		   (set (reg) (plus (reg) (const_int -1)))
+		   (set (pc) (if_then_else (NE == cc)
+					   (label_ref (label))
+					   (pc))))])
 
-        So we return the second form instead for the two cases.
+	So we return the second form instead for the two cases.
 
+	For the "elementwise" form where the decrement number isn't -1,
+	the final value may be exceeded, so use GE instead of NE.
      */
-        condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
+     if (GET_CODE (pattern) != PARALLEL)
+       {
+	if (INTVAL (XEXP (inc_src, 1)) != -1)
+	  condition = gen_rtx_fmt_ee (GE, VOIDmode, inc_src, const0_rtx);
+	else
+	  condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);;
+       }
 
     return condition;
    }
@@ -685,17 +693,6 @@ doloop_optimize (class loop *loop)
       return false;
     }
 
-  max_cost
-    = COSTS_N_INSNS (param_max_iterations_computation_cost);
-  if (set_src_cost (desc->niter_expr, mode, optimize_loop_for_speed_p (loop))
-      > max_cost)
-    {
-      if (dump_file)
-	fprintf (dump_file,
-		 "Doloop: number of iterations too costly to compute.\n");
-      return false;
-    }
-
   if (desc->const_iter)
     iterations = widest_int::from (rtx_mode_t (desc->niter_expr, mode),
 				   UNSIGNED);
@@ -716,11 +713,24 @@ doloop_optimize (class loop *loop)
 
   /* Generate looping insn.  If the pattern FAILs then give up trying
      to modify the loop since there is some aspect the back-end does
-     not like.  */
-  count = copy_rtx (desc->niter_expr);
+     not like.  If this succeeds, there is a chance that the loop
+     desc->niter_expr has been altered by the backend, so only extract
+     that data after the gen_doloop_end.  */
   start_label = block_label (desc->in_edge->dest);
   doloop_reg = gen_reg_rtx (mode);
   rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
+  count = copy_rtx (desc->niter_expr);
+
+  max_cost
+    = COSTS_N_INSNS (param_max_iterations_computation_cost);
+  if (set_src_cost (count, mode, optimize_loop_for_speed_p (loop))
+      > max_cost)
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "Doloop: number of iterations too costly to compute.\n");
+      return false;
+    }
 
   word_mode_size = GET_MODE_PRECISION (word_mode);
   word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) - 1;
diff --git a/gcc/testsuite/gcc.target/arm/lob.h b/gcc/testsuite/gcc.target/arm/lob.h
index feaae7cc899..3941fe7a8b6 100644
--- a/gcc/testsuite/gcc.target/arm/lob.h
+++ b/gcc/testsuite/gcc.target/arm/lob.h
@@ -1,15 +1,131 @@
 #include <string.h>
-
+#include <stdint.h>
 /* Common code for lob tests.  */
 
 #define NO_LOB asm volatile ("@ clobber lr" : : : "lr" )
 
-#define N 10000
+#define N 100
+
+static void
+reset_data (int *a, int *b, int *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data64 (int64_t *a, int64_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+check_plus (int *a, int *b, int *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
 
 static void
-reset_data (int *a, int *b, int *c)
+check_memcpy64 (int64_t *a, int64_t *c, int x)
 {
-  memset (a, -1, N * sizeof (*a));
-  memset (b, -1, N * sizeof (*b));
-  memset (c, -1, N * sizeof (*c));
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != a[i]) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
 }
diff --git a/gcc/testsuite/gcc.target/arm/lob1.c b/gcc/testsuite/gcc.target/arm/lob1.c
index ba5c82cd55c..c8ce653a5c3 100644
--- a/gcc/testsuite/gcc.target/arm/lob1.c
+++ b/gcc/testsuite/gcc.target/arm/lob1.c
@@ -54,29 +54,18 @@ loop3 (int *a, int *b, int *c)
     } while (i < N);
 }
 
-void
-check (int *a, int *b, int *c)
-{
-  for (int i = 0; i < N; i++)
-    {
-      NO_LOB;
-      if (c[i] != a[i] + b[i])
-	abort ();
-    }
-}
-
 int
 main (void)
 {
-  reset_data (a, b, c);
+  reset_data (a, b, c, N);
   loop1 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus (a, b, c, N);
+  reset_data (a, b, c, N);
   loop2 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus (a, b, c, N);
+  reset_data (a, b, c, N);
   loop3 (a, b ,c);
-  check (a, b ,c);
+  check_plus (a, b, c, N);
 
   return 0;
 }
diff --git a/gcc/testsuite/gcc.target/arm/lob6.c b/gcc/testsuite/gcc.target/arm/lob6.c
index 17b6124295e..4fe116e2c2b 100644
--- a/gcc/testsuite/gcc.target/arm/lob6.c
+++ b/gcc/testsuite/gcc.target/arm/lob6.c
@@ -79,14 +79,14 @@ check (void)
 int
 main (void)
 {
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data (a1, b1, c1, N);
+  reset_data (a2, b2, c2, N);
   loop1 (a1, b1, c1);
   ref1 (a2, b2, c2);
   check ();
 
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data (a1, b1, c1, N);
+  reset_data (a2, b2, c2, N);
   loop2 (a1, b1, c1);
   ref2 (a2, b2, c2);
   check ();
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c
new file mode 100644
index 00000000000..5ddd994e53d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c
@@ -0,0 +1,561 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O3 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+
+#define IMM 5
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)				\
+void test_##NAME##PRED##_##SIGN##BITS (TYPE##BITS##_t *a, TYPE##BITS##_t *b,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vb = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (b, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_##SIGN##BITS (va, vb, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      b += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (32, 4, w, NAME, PRED)
+
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vmulq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vsubq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vhaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vorrq, _x)
+
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_M(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)				\
+void test_##NAME##PRED##_##SIGN##BITS (TYPE##BITS##x##LANES##_t __inactive, TYPE##BITS##_t *a, TYPE##BITS##_t *b,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vb = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (b, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_##SIGN##BITS (__inactive, va, vb, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      b += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_M (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_M (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (32, 4, w, NAME, PRED)
+
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vmulq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vsubq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vhaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vorrq, _m)
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_N(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)	\
+void test_##NAME##PRED##_n_##SIGN##BITS (TYPE##BITS##_t *a,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_n_##SIGN##BITS (va, IMM, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_N (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_N (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (32, 4, w, NAME, PRED)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vmulq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vsubq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vhaddq, _x)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vbrsrq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vshlq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vshrq, _x)
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_M_N(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)	\
+void test_##NAME##PRED##_n_##SIGN##BITS (TYPE##BITS##x##LANES##_t __inactive, TYPE##BITS##_t *a,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_n_##SIGN##BITS (__inactive, va, IMM, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_M_N (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_M_N (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (32, 4, w, NAME, PRED)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vmulq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vsubq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vhaddq, _m)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vbrsrq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vshlq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vshrq, _m)
+
+/* Now test some more configurations.  */
+
+/* Using a >=1 condition.  */
+void test1 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n >= 1)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+/* Test a for loop format of decrementing to zero */
+int32_t a[] = {0, 1, 2, 3, 4, 5, 6, 7};
+void test2 (int32_t *b, int num_elems)
+{
+    for (int i = num_elems; i > 0; i-= 4)
+    {
+        mve_pred16_t p = vctp32q (i);
+        int32x4_t va = vldrwq_z_s32 (&(a[i]), p);
+        vstrwq_p_s32 (b + i, va, p);
+    }
+}
+
+/* Iteration counter counting up to num_iter.  */
+void test3 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 0; i < num_iter; i++)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+/* Iteration counter counting down from num_iter.  */
+void test4 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = num_iter; i > 0; i--)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+/* Using an unpredicated arithmetic instruction within the loop.  */
+void test5 (uint8_t *a, uint8_t *b, uint8_t *c,  uint8_t *d, int n)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_u8 (b);
+	/* Is affected by implicit predication, because vb also
+	came from an unpredicated load, but there is no functional
+	problem, because the result is used in a predicated store.  */ 
+        uint8x16_t vc = vaddq_u8 (va, vb);
+        uint8x16_t vd = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        vstrbq_p_u8 (d, vd, p);
+        n-=16;
+    }
+}
+
+/* Using a different VPR value for one instruction in the loop.  */
+void test6 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using another VPR value in the loop, with a vctp.
+   The doloop logic will always try to do the transform on the first
+   vctp it encounters, so this is still expected to work.  */
+void test7 (int32_t *a, int32_t *b, int32_t *c, int n, int g)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q (g);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vctp,
+   but this time the p1 will also change in every loop (still fine)  */
+void test8 (int32_t *a, int32_t *b, int32_t *c, int n, int g)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q (g);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+      g++;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vctp_m
+   that is independent of the loop vctp VPR.  */
+void test9 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p2 = vctp32q_m (n, p1);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop,
+   with a vctp_m that is tied to the base vctp VPR.  This
+   is still fine, because the vctp_m will be transformed
+   into a vctp and be implicitly predicated.  */
+void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q_m (n, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p1);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp.  */
+void test11 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vb);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p1);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp_m.  */
+void test12 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p2 = vcmpeqq_m_s32 (va, vb, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp_m 
+   that is tied to the base vctp VPR (same as above, this will be turned
+   into a vcmp and be implicitly predicated).  */
+void test13 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p2 = vcmpeqq_m_s32 (va, vb, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an unpredicated op with a scalar output, where the result is valid
+   outside the bb.  This is valid, because all the inputs to the unpredicated
+   op are correctly predicated.  */
+uint8_t test14 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       sum += vaddvq_u8 (vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Same as above, but with another scalar op between the unpredicated op and
+   the scalar op outside the loop.  */
+uint8_t test15 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx, int g)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       sum += vaddvq_u8 (vc);
+       sum += g;
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Using an unpredicated vcmp to generate a new predicate value in the
+   loop and then using it in a predicated store insn.  */
+void test16 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_s32 (b);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vc);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using a predicated vcmp to generate a new predicate value in the
+   loop and then using it in a predicated store insn.  */
+void test17 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_s32 (va, vb);
+      mve_pred16_t p1 = vcmpeqq_m_s32 (va, vc, p);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an across-vector unpredicated instruction in a valid way.
+   This tests that "vc" has correctly masked the risky "vb".  */
+uint16_t test18 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_x_u16 (va, vb, p);
+      res = vaddvq_u16 (vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector unpredicated instruction with a scalar from outside the loop.  */
+uint16_t test19 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_x_u16 (va, vb, p);
+      res = vaddvaq_u16 (res, vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector predicated instruction in a valid way.  */
+uint16_t  test20 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res = vaddvaq_p_u16 (res, vb, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector predicated instruction in a valid way.  */
+uint16_t  test21 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res++;
+      res = vaddvaq_p_u16 (res, vb, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+int test22 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vmaxvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test23 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vmaxavq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+/* The final number of DLSTPs currently is calculated by the number of
+  `TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY.*` macros * 6 + 23.  */
+/* { dg-final { scan-assembler-times {\tdlstp} 167 } } */
+/* { dg-final { scan-assembler-times {\tletp} 167 } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
new file mode 100644
index 00000000000..0cdffb312b3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int16_t *a, int16_t *b, int16_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      int16x8_t va = vldrhq_z_s16 (a, p);
+      int16x8_t vb = vldrhq_z_s16 (b, p);
+      int16x8_t vc = vaddq_x_s16 (va, vb, p);
+      vstrhq_p_s16 (c, vc, p);
+      c+=8;
+      a+=8;
+      b+=8;
+      n-=8;
+    }
+}
+
+int main ()
+{
+  int i;
+  int16_t temp1[N];
+  int16_t temp2[N];
+  int16_t temp3[N];
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus16 (temp1, temp2, temp3, 0);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus16 (temp1, temp2, temp3, 1);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 7);
+  check_plus16 (temp1, temp2, temp3, 7);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus16 (temp1, temp2, temp3, 8);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus16 (temp1, temp2, temp3, 9);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus16 (temp1, temp2, temp3, 16);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus16 (temp1, temp2, temp3, 17);
+
+  reset_data16 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.16} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c
new file mode 100644
index 00000000000..7ff789d7650
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+int main ()
+{
+  int i;
+  int32_t temp1[N];
+  int32_t temp2[N];
+  int32_t temp3[N];
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus32 (temp1, temp2, temp3, 0);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus32 (temp1, temp2, temp3, 1);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 3);
+  check_plus32 (temp1, temp2, temp3, 3);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 4);
+  check_plus32 (temp1, temp2, temp3, 4);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 5);
+  check_plus32 (temp1, temp2, temp3, 5);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus32 (temp1, temp2, temp3, 8);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus32 (temp1, temp2, temp3, 9);
+
+  reset_data32 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.32} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c
new file mode 100644
index 00000000000..8065bd02469
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int64_t *a, int64_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp64q (n);
+      int64x2_t va = vldrdq_gather_offset_z_s64 (a, vcreateq_u64 (0, 8), p);
+      vstrdq_scatter_offset_p_s64 (c, vcreateq_u64 (0, 8), va, p);
+      c+=2;
+      a+=2;
+      n-=2;
+    }
+}
+
+int main ()
+{
+  int i;
+  int64_t temp1[N];
+  int64_t temp3[N];
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 0);
+  check_memcpy64 (temp1, temp3, 0);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 1);
+  check_memcpy64 (temp1, temp3, 1);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 2);
+  check_memcpy64 (temp1, temp3, 2);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 3);
+  check_memcpy64 (temp1, temp3, 3);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 4);
+  check_memcpy64 (temp1, temp3, 4);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 5);
+  check_memcpy64 (temp1, temp3, 5);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 6);
+  check_memcpy64 (temp1, temp3, 6);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 7);
+  check_memcpy64 (temp1, temp3, 7);
+
+  reset_data64  (temp1, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.64} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c
new file mode 100644
index 00000000000..552781001e9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp8q (n);
+      int8x16_t va = vldrbq_z_s8 (a, p);
+      int8x16_t vb = vldrbq_z_s8 (b, p);
+      int8x16_t vc = vaddq_x_s8 (va, vb, p);
+      vstrbq_p_s8 (c, vc, p);
+      c+=16;
+      a+=16;
+      b+=16;
+      n-=16;
+    }
+}
+
+int main ()
+{
+  int i;
+  int8_t temp1[N];
+  int8_t temp2[N];
+  int8_t temp3[N];
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus8 (temp1, temp2, temp3, 0);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus8 (temp1, temp2, temp3, 1);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 15);
+  check_plus8 (temp1, temp2, temp3, 15);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus8 (temp1, temp2, temp3, 16);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus8 (temp1, temp2, temp3, 17);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 32);
+  check_plus8 (temp1, temp2, temp3, 32);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 33);
+  check_plus8 (temp1, temp2, temp3, 33);
+
+  reset_data8 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.8} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
new file mode 100644
index 00000000000..c1c40c2fea7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
@@ -0,0 +1,343 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O3 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+
+/* Terminating on a non-zero number of elements.  */
+void test0 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    while (n > 1)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Terminating on n >= 0.  */
+void test1 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    while (n >= 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Similar, terminating on a non-zero number of elements, but in a for loop
+   format.  */
+int32_t a[] = {0, 1, 2, 3, 4, 5, 6, 7};
+void test2 (int32_t *b, int num_elems)
+{
+    for (int i = num_elems; i >= 2; i-= 4)
+    {
+       mve_pred16_t p = vctp32q (i);
+       int32x4_t va = vldrwq_z_s32 (&(a[i]), p);
+       vstrwq_p_s32 (b + i, va, p);
+    }
+}
+
+/* Iteration counter counting up to num_iter, with a non-zero starting num.  */
+void test3 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 1; i < num_iter; i++)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Iteration counter counting up to num_iter, with a larger increment  */
+void test4 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 0; i < num_iter; i+=2)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Using an unpredicated store instruction within the loop.  */
+void test5 (uint8_t *a, uint8_t *b, uint8_t *c,  uint8_t *d, int n)
+{
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_u8 (va, vb);
+       uint8x16_t vd = vaddq_x_u8 (va, vb, p);
+       vstrbq_u8 (d, vd);
+       n -= 16;
+    }
+}
+
+/* Using an unpredicated store outside the loop.  */
+void test6 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       vx = vaddq_u8 (vx, vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    vstrbq_u8 (c, vx);
+}
+
+/* Using a VPR that gets modified within the loop.  */
+void test9 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      p++;
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using a VPR that gets re-generated within the loop.  */
+void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  mve_pred16_t p = vctp32q (n);
+  while (n > 0)
+    {
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      p = vctp32q (n);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using vctp32q_m instead of vctp32q.  */
+void test11 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p0)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q_m (n, p0);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an unpredicated op with a scalar output, where the result is valid
+   outside the bb.  This is invalid, because one of the inputs to the
+   unpredicated op is also unpredicated.  */
+uint8_t test12 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_u8 (b);
+       uint8x16_t vc = vaddq_u8 (va, vb);
+       sum += vaddvq_u8 (vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Using an unpredicated vcmp to generate a new predicate value in the
+   loop and then using that VPR to predicate a store insn.  */
+void test13 (int32_t *a, int32_t *b, int32x4_t vc, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_s32 (a);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_s32 (va, vb);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vc);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an across-vector unpredicated instruction. "vb" is the risk.  */
+uint16_t test14 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      vb = vaddq_u16 (va, vb);
+      res = vaddvq_u16 (vb);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector unpredicated instruction. "vc" is the risk. */
+uint16_t test15 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_u16 (va, vb);
+      res = vaddvaq_u16 (res, vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+uint16_t test16 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res =0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res = vaddvaq_u16 (res, vb);
+      res = vaddvaq_p_u16 (res, va, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+int test17 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vmaxvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+
+
+int test18 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vminvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test19 (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vminavq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int test20 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int res = 0;
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vminvq (res, va);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+uint8x16_t test21 (uint8_t *a, uint32_t *b, int n, uint8x16_t res)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        res = vshlcq_u8 (va, b, 1);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+
+int8x16_t test22 (int8_t *a, int32_t *b, int n, int8x16_t res)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        int8x16_t va = vldrbq_z_s8 (a, p);
+        res = vshlcq_s8 (va, b, 1);
+        n-=16;
+        a+=16;
+    }
+    return res;
+}
+/* { dg-final { scan-assembler-not "\tdlstp" } } */
+/* { dg-final { scan-assembler-not "\tletp" } } */

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
  2023-06-23 16:25 ` Andre Vieira (lists)
@ 2023-07-05 16:41   ` Stamatis Markianos-Wright
  0 siblings, 0 replies; 14+ messages in thread
From: Stamatis Markianos-Wright @ 2023-07-05 16:41 UTC (permalink / raw)
  To: Andre Vieira (lists), gcc-patches
  Cc: Kyrylo Tkachov, Richard Earnshaw, ramana.gcc, nickc

[-- Attachment #1: Type: text/plain, Size: 13696 bytes --]


On 23/06/2023 17:25, Andre Vieira (lists) wrote:
> +  /* In order to find out if the loop is of type A or B above look 
> for the
> +     loop counter: it will either be incrementing by one per 
> iteration or
> +     it will be decrementing by num_of_lanes.  We can find the loop 
> counter
> +     in the condition at the end of the loop.  */
> +  rtx_insn *loop_cond = prev_nonnote_nondebug_insn_bb (BB_END (body));
> +  gcc_assert (cc_register (XEXP (PATTERN (loop_cond), 0), VOIDmode)
> +          && GET_CODE (XEXP (PATTERN (loop_cond), 1)) == COMPARE);
>
> Not sure this should be an assert. If we do encounter a differently 
> formed loop, we should bail out of DLSTPing for now but we shouldn't ICE.

Hmm, good point. A lot of these checks I actually did have in as 
asserts, so that then I could try and answer the question "does this 
ever happen in all currently known dlstp/letp-compatible codebases" and 
if they are still asserts it means that the answer was "No".

I agree, though, for future safety it's best to make this a bail-out 
instead.

Done

>
>
> +  /* The loop latch has to be empty.  When compiling all the known 
> MVE LoLs in
> +     user applications, none of those with incrementing counters had 
> any real
> +     insns in the loop latch.  As such, this function has only been 
> tested with
> +     an empty latch and may misbehave or ICE if we somehow get here 
> with an
> +     increment in the latch, so, for sanity, error out early.  */
> +  rtx_insn *dec_insn = BB_END (body->loop_father->latch);
> +  if (NONDEBUG_INSN_P (dec_insn))
> +    gcc_unreachable ();
>
> Similarly here I'd return false rather than gcc_unreachable ();
Same as above
Done
>
>
> +  /* Find where both of those are modified in the loop body bb. */
> +  rtx condcount_reg_set = PATTERN (DF_REF_INSN 
> (df_bb_regno_only_def_find
> +                         (body, REGNO (condcount))));
> Put = on newline, breaks it down nicer.

Done


>
> +      counter_orig_set = XEXP (PATTERN
> +                    (DF_REF_INSN
> +                      (DF_REF_NEXT_REG
> +                    (DF_REG_DEF_CHAIN
> +                     (REGNO
> +                       (XEXP (condcount_reg_set, 0)))))), 1);
>
> This makes me a bit nervous, can we be certain that the PATTERN of the 
> next insn that sets it is indeed a set. Heck can we even be sure 
> DF_REG_DEF_CHAIN returns a non-null, I can't imagine why not but maybe 
> there are some constructs it can't follow-up on? Might just be worth 
> checking these steps and bailing out.
>
Hmm, yes. The logic of it is:

* So  condcount_reg_set is a set to a REG. Get that REGNO and fetch the 
DEF_CHAIN...
* Then the "DF_REF_NEXT_REG" is the last df_ref in the chain that did a 
DEF to that REG
* Then get the INSN from that (DF_REF_INSN) --- it should be a SET
* Then get the PATTERN
* Then XEXP (.., 1) on the PATTERN will get the rhs of the SET

So the outcome is finding the previous value that was last set to that 
REGNO, regardless of it being in the same BB or not

But indeed it is unsafe that way.
Done some breakdown and some checking :)
>
>
> +      /* When we find the vctp instruction: This may be followed by
> +      a zero-extend insn to SImode.  If it is, then save the
> +      zero-extended REG into vctp_vpr_generated.  If there is no
> +      zero-extend, then store the raw output of the vctp.
> +      For any VPT-predicated instructions we need to ensure that
> +      the VPR they use is the same as the one given here and
> +      they often consume the output of a subreg of the SImode
> +      zero-extended VPR-reg.  As a result, comparing against the
> +      output of the zero-extend is more likely to succeed.
> +      This code also guarantees to us that the vctp comes before
> +      any instructions that use the VPR within the loop, for the
> +      dlstp/letp transform to succeed.  */
>
> Wrong comment indent after first line.

Done (but shifted this comment to a new location, because i split the 
logic searching for vctp_vpr_generated into a new function to be called 
separately from the FOR_BB_INSNS loop (this guarantees that we always 
have a value in `vctp_vpr_generated` when 
arm_mve_check_df_chain_back_for_implic_predic is called


>
> +  rtx_insn *vctp_insn = arm_mve_get_loop_vctp (body);
> +  if (!vctp_insn || !arm_mve_loop_valid_for_dlstp (body))
> +    return GEN_INT (1);
>
> arm_mve_loop_valid_for_dlstp already calls arm_mve_get_loop_vctp, 
> maybe have 'arm_mve_loop_valid_for_dlstp' return vctp_insn or null to 
> determine success or failure, avoids looping through the BB again.
>
> For the same reason I'd also pass vctp_insn down to 
> 'arm_mve_check_df_chain_back_for_implic_predic'.

Done

>
> +      if (GET_CODE (SET_SRC (single_set (next_use1))) == ZERO_EXTEND)
> +        {
> +          rtx_insn *next_use2 = NULL;
>
> Are we sure single_set can never return 0 here? Maybe worth an extra 
> check and bail out if it does?

True... I could imagine the VPR being used in e.g. an `adds` instruction 
that is a PARALLEL SET to the output and to CC_REG: that would ICE here

Added an additional check for this.

>
> +       /* If the insn pattern requires the use of the VPR value from the
> +      vctp as an input parameter.  */
> s/an an input parameter./as an input parameter for predication./
Done
>
> +      /* None of registers USE-d by the instruction need can be the VPR
> +         vctp_vpr_generated.  This blocks the optimisation if there any
> +         instructions that use the optimised-out VPR value in any way
> +         other than as a VPT block predicate.  */
>
> Reword this slightly to be less complex:
> This instruction USE-s the vctp_vpr_generated other than for 
> predication, this blocks the transformation as we are not allowed to 
> optimise the VPR value away.

Done


Note a couple of "extra" changes in this revision:
* As we discussed offline: Using "lctp" after the loop, instead of the 
"mrs,orr,and,msr" series when the branch is too far for "letp"
* Added "simple_loop_desc (body->loop_father)->noloop_assumptions = 
NULL_RTX" when the transform succeeds. This is as a safety to stop the 
middle-end from ever falling down that constant-number-of-terations 
logic (I never actually saw this happen, but it seemed reasonable to 
block this)
* Split out "arm_mve_get_vctp_vec_form" into a new function: this just 
helps with the read-ability of "arm_attempt_dlstp_transform"

I look forward to getting more comments :D

Cheers,
Stam

>
> Will continue reviewing next week :)
>
> On 15/06/2023 12:47, Stamatis Markianos-Wright via Gcc-patches wrote:
>>      Hi all,
>>
>>      This is the 2/2 patch that contains the functional changes needed
>>      for MVE Tail Predicated Low Overhead Loops.  See my previous email
>>      for a general introduction of MVE LOLs.
>>
>>      This support is added through the already existing loop-doloop
>>      mechanisms that are used for non-MVE dls/le looping.
>>
>>      Mid-end changes are:
>>
>>      1) Relax the loop-doloop mechanism in the mid-end to allow for
>>         decrement numbers other that -1 and for `count` to be an
>>         rtx containing a simple REG (which in this case will contain
>>         the number of elements to be processed), rather
>>         than an expression for calculating the number of iterations.
>>      2) Added a new df utility function: `df_bb_regno_only_def_find` 
>> that
>>         will return the DEF of a REG only if it is DEF-ed once within 
>> the
>>         basic block.
>>
>>      And many things in the backend to implement the above optimisation:
>>
>>      3)  Implement the `arm_predict_doloop_p` target hook to instruct 
>> the
>>          mid-end about Low Overhead Loops (MVE or not), as well as
>>          `arm_loop_unroll_adjust` which will prevent unrolling of any 
>> loops
>>          that are valid for becoming MVE Tail_Predicated Low Overhead 
>> Loops
>>          (unrolling can transform a loop in ways that invalidate the 
>> dlstp/
>>          letp tranformation logic and the benefit of the dlstp/letp loop
>>          would be considerably higher than that of unrolling)
>>      4)  Appropriate changes to the define_expand of doloop_end, new
>>          patterns for dlstp and letp, new iterators,  unspecs, etc.
>>      5) `arm_mve_loop_valid_for_dlstp` and a number of checking 
>> functions:
>>         * `arm_mve_dlstp_check_dec_counter`
>>         * `arm_mve_dlstp_check_inc_counter`
>>         * `arm_mve_check_reg_origin_is_num_elems`
>>         * `arm_mve_check_df_chain_back_for_implic_predic`
>>         * `arm_mve_check_df_chain_fwd_for_implic_predic_impact`
>>         This all, in smoe way or another, are running checks on the loop
>>         structure in order to determine if the loop is valid for 
>> dlstp/letp
>>         transformation.
>>      6) `arm_attempt_dlstp_transform`: (called from the define_expand of
>>          doloop_end) this function re-checks for the loop's 
>> suitability for
>>          dlstp/letp transformation and then implements it, if possible.
>>      7) Various utility functions:
>>         *`arm_mve_get_vctp_lanes` to map
>>         from vctp unspecs to number of lanes, and 
>> `arm_get_required_vpr_reg`
>>         to check an insn to see if it requires the VPR or not.
>>         * `arm_mve_get_loop_vctp`
>>         * `arm_mve_get_vctp_lanes`
>>         * `arm_emit_mve_unpredicated_insn_to_seq`
>>         * `arm_get_required_vpr_reg`
>>         * `arm_get_required_vpr_reg_param`
>>         * `arm_get_required_vpr_reg_ret_val`
>>         * `arm_mve_vec_insn_is_predicated_with_this_predicate`
>>         * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`
>>
>>      No regressions on arm-none-eabi with various targets and on
>>      aarch64-none-elf. Thoughts on getting this into trunk?
>>
>>      Thank you,
>>      Stam Markianos-Wright
>>
>>      gcc/ChangeLog:
>>
>>              * config/arm/arm-protos.h (arm_target_insn_ok_for_lob): 
>> Rename to...
>>              (arm_target_bb_ok_for_lob): ...this
>>              (arm_attempt_dlstp_transform): New.
>>              * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.
>>              (TARGET_PREDICT_DOLOOP_P): New.
>>              (arm_block_set_vect):
>>              (arm_target_insn_ok_for_lob): Rename from 
>> arm_target_insn_ok_for_lob.
>>              (arm_target_bb_ok_for_lob): New.
>>              (arm_mve_get_vctp_lanes): New.
>>              (arm_get_required_vpr_reg): New.
>>              (arm_get_required_vpr_reg_param): New.
>>              (arm_get_required_vpr_reg_ret_val): New.
>>              (arm_mve_get_loop_vctp): New.
>> (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.
>> (arm_mve_vec_insn_is_predicated_with_this_predicate): New.
>>              (arm_mve_check_df_chain_back_for_implic_predic): New.
>> (arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.
>>              (arm_mve_check_reg_origin_is_num_elems): New.
>>              (arm_mve_dlstp_check_inc_counter): New.
>>              (arm_mve_dlstp_check_dec_counter): New.
>>              (arm_mve_loop_valid_for_dlstp): New.
>>              (arm_predict_doloop_p): New.
>>              (arm_loop_unroll_adjust): New.
>>              (arm_emit_mve_unpredicated_insn_to_seq): New.
>>              (arm_attempt_dlstp_transform): New.
>>              * config/arm/iterators.md (DLSTP): New.
>>              (mode1): Add DLSTP mappings.
>>              * config/arm/mve.md (*predicated_doloop_end_internal): New.
>>              (dlstp<mode1>_insn): New.
>>              * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
>>              * config/arm/unspecs.md: New unspecs.
>>              * df-core.cc (df_bb_regno_only_def_find): New.
>>              * df.h (df_bb_regno_only_def_find): New.
>>              * loop-doloop.cc (doloop_condition_get): Relax conditions.
>>              (doloop_optimize): Add support for elementwise LoLs.
>>
>>      gcc/testsuite/ChangeLog:
>>
>>              * gcc.target/arm/lob.h: Update framework.
>>              * gcc.target/arm/lob1.c: Likewise.
>>              * gcc.target/arm/lob6.c: Likewise.
>>              * gcc.target/arm/mve/dlstp-compile-asm.c: New test.
>>              * gcc.target/arm/mve/dlstp-int16x8.c: New test.
>>              * gcc.target/arm/mve/dlstp-int32x4.c: New test.
>>              * gcc.target/arm/mve/dlstp-int64x2.c: New test.
>>              * gcc.target/arm/mve/dlstp-int8x16.c: New test.
>>              * gcc.target/arm/mve/dlstp-invalid-asm.c: New test.

[-- Attachment #2: diff.txt --]
[-- Type: text/plain, Size: 97487 bytes --]

commit c1c92d4d03fee44744908cf774219d21546c1f11
Author: Stam Markianos-Wright <stam.markianos-wright@arm.com>
Date:   Tue Oct 18 17:42:56 2022 +0100

    arm: Add support for MVE Tail-Predicated Low Overhead Loops
    
    This is the 2/2 patch that contains the functional changes needed
    for MVE Tail Predicated Low Overhead Loops.  See my previous email
    for a general introduction of MVE LOLs.
    
    This support is added through the already existing loop-doloop
    mechanisms that are used for non-MVE dls/le looping.
    
    Mid-end changes are:
    
    1) Relax the loop-doloop mechanism in the mid-end to allow for
       decrement numbers other that -1 and for `count` to be an
       rtx containing a simple REG (which in this case will contain
       the number of elements to be processed), rather
       than an expression for calculating the number of iterations.
    2) Added a new df utility function: `df_bb_regno_only_def_find` that
       will return the DEF of a REG if it is DEF-ed only once within the
       basic block.
    
    And many things in the backend to implement the above optimisation:
    
    3)  Implement the `arm_predict_doloop_p` target hook to instruct the
        mid-end about Low Overhead Loops (MVE or not), as well as
        `arm_loop_unroll_adjust` which will prevent unrolling of any loops
        that are valid for becoming MVE Tail_Predicated Low Overhead Loops
        (unrolling can transform a loop in ways that invalidate the dlstp/
        letp tranformation logic and the benefit of the dlstp/letp loop
        would be considerably higher than that of unrolling)
    4)  Appropriate changes to the define_expand of doloop_end, new
        patterns for dlstp and letp, new iterators,  unspecs, etc.
    5) `arm_mve_loop_valid_for_dlstp` and a number of checking functions:
       * `arm_mve_dlstp_check_dec_counter`
       * `arm_mve_dlstp_check_inc_counter`
       * `arm_mve_check_reg_origin_is_num_elems`
       * `arm_mve_check_df_chain_back_for_implic_predic`
       * `arm_mve_check_df_chain_fwd_for_implic_predic_impact`
       This all, in smoe way or another, are running checks on the loop
       structure in order to determine if the loop is valid for dlstp/letp
       transformation.
    6) `arm_attempt_dlstp_transform`: (called from the define_expand of
        doloop_end) this function re-checks for the loop's suitability for
        dlstp/letp transformation and then implements it, if possible.
    7) Various utility functions:
       *`arm_mve_get_vctp_lanes` to map
       from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg`
       to check an insn to see if it requires the VPR or not.
       * `arm_mve_get_loop_vctp`
       * `arm_mve_get_vctp_lanes`
       * `arm_emit_mve_unpredicated_insn_to_seq`
       * `arm_get_required_vpr_reg`
       * `arm_get_required_vpr_reg_param`
       * `arm_get_required_vpr_reg_ret_val`
       * `arm_mve_is_across_vector_insn`
       * `arm_is_mve_load_store_insn`
       * `arm_mve_vec_insn_is_predicated_with_this_predicate`
       * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`
    
    No regressions on arm-none-eabi with various targets and on
    aarch64-none-elf. Thoughts on getting this into trunk?
    
    Thank you,
    Stam Markianos-Wright
    
    gcc/ChangeLog:
    
            * config/arm/arm-protos.h (arm_target_insn_ok_for_lob): Rename to...
            (arm_target_bb_ok_for_lob): ...this
            (arm_attempt_dlstp_transform): New.
            * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.
            (TARGET_PREDICT_DOLOOP_P): New.
            (arm_block_set_vect):
            (arm_target_insn_ok_for_lob): Rename from arm_target_insn_ok_for_lob.
            (arm_target_bb_ok_for_lob): New.
            (arm_mve_get_vctp_lanes): New.
            (arm_get_required_vpr_reg): New.
            (arm_get_required_vpr_reg_param): New.
            (arm_get_required_vpr_reg_ret_val): New.
            (arm_mve_get_loop_vctp): New.
            (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.
            (arm_mve_vec_insn_is_predicated_with_this_predicate): New.
            (arm_mve_check_df_chain_back_for_implic_predic): New.
            (arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.
            (arm_mve_check_reg_origin_is_num_elems): New.
            (arm_mve_dlstp_check_inc_counter): New.
            (arm_mve_dlstp_check_dec_counter): New.
            (arm_mve_loop_valid_for_dlstp): New.
            (arm_mve_is_across_vector_insn): New.
            (arm_is_mve_load_store_insn): New.
            (arm_predict_doloop_p): New.
            (arm_loop_unroll_adjust): New.
            (arm_emit_mve_unpredicated_insn_to_seq): New.
            (arm_attempt_dlstp_transform): New.
            * config/arm/iterators.md (DLSTP): New.
            (mode1): Add DLSTP mappings.
            * config/arm/mve.md (*predicated_doloop_end_internal): New.
            (dlstp<mode1>_insn): New.
            * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
            * config/arm/unspecs.md: New unspecs.
            * df-core.cc (df_bb_regno_only_def_find): New.
            * df.h (df_bb_regno_only_def_find): New.
            * loop-doloop.cc (doloop_condition_get): Relax conditions.
            (doloop_optimize): Add support for elementwise LoLs.
    
    gcc/testsuite/ChangeLog:
    
            * gcc.target/arm/lob.h: Update framework.
            * gcc.target/arm/lob1.c: Likewise.
            * gcc.target/arm/lob6.c: Likewise.
            * gcc.target/arm/mve/dlstp-compile-asm.c: New test.
            * gcc.target/arm/mve/dlstp-int16x8.c: New test.
            * gcc.target/arm/mve/dlstp-int32x4.c: New test.
            * gcc.target/arm/mve/dlstp-int64x2.c: New test.
            * gcc.target/arm/mve/dlstp-int8x16.c: New test.
            * gcc.target/arm/mve/dlstp-invalid-asm.c: New test.

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 7d73c66a15d..6722dc4ed33 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -64,8 +64,8 @@ extern void arm_emit_speculation_barrier_function (void);
 extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *);
 extern bool arm_q_bit_access (void);
 extern bool arm_ge_bits_access (void);
-extern bool arm_target_insn_ok_for_lob (rtx);
-
+extern bool arm_target_bb_ok_for_lob (basic_block);
+extern rtx arm_attempt_dlstp_transform (rtx);
 #ifdef RTX_CODE
 enum reg_class
 arm_mode_base_reg_class (machine_mode);
diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index 38f0839de1c..a4e4abfe8f9 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -659,6 +659,12 @@ static const struct attribute_spec arm_attribute_table[] =
 #undef TARGET_HAVE_CONDITIONAL_EXECUTION
 #define TARGET_HAVE_CONDITIONAL_EXECUTION arm_have_conditional_execution
 
+#undef TARGET_LOOP_UNROLL_ADJUST
+#define TARGET_LOOP_UNROLL_ADJUST arm_loop_unroll_adjust
+
+#undef TARGET_PREDICT_DOLOOP_P
+#define TARGET_PREDICT_DOLOOP_P arm_predict_doloop_p
+
 #undef TARGET_LEGITIMATE_CONSTANT_P
 #define TARGET_LEGITIMATE_CONSTANT_P arm_legitimate_constant_p
 
@@ -34416,19 +34422,971 @@ arm_invalid_within_doloop (const rtx_insn *insn)
 }
 
 bool
-arm_target_insn_ok_for_lob (rtx insn)
+arm_target_bb_ok_for_lob (basic_block bb)
 {
-  basic_block bb = BLOCK_FOR_INSN (insn);
   /* Make sure the basic block of the target insn is a simple latch
      having as single predecessor and successor the body of the loop
      itself.  Only simple loops with a single basic block as body are
      supported for 'low over head loop' making sure that LE target is
      above LE itself in the generated code.  */
-
   return single_succ_p (bb)
-    && single_pred_p (bb)
-    && single_succ_edge (bb)->dest == single_pred_edge (bb)->src
-    && contains_no_active_insn_p (bb);
+	 && single_pred_p (bb)
+	 && single_succ_edge (bb)->dest == single_pred_edge (bb)->src;
+}
+
+/* Utility fuction: Given a VCTP or a VCTP_M insn, return the number of MVE
+   lanes based on the machine mode being used.  */
+
+static int
+arm_mve_get_vctp_lanes (rtx x)
+{
+  if (GET_CODE (x) == SET && GET_CODE (XEXP (x, 1)) == UNSPEC
+      && (XINT (XEXP (x, 1), 1) == VCTP || XINT (XEXP (x, 1), 1) == VCTP_M))
+    {
+      machine_mode mode = GET_MODE (XEXP (x, 1));
+      return (VECTOR_MODE_P (mode) && VALID_MVE_PRED_MODE (mode))
+	     ? GET_MODE_NUNITS (mode) : 0;
+    }
+  return 0;
+}
+
+/* Check if INSN requires the use of the VPR reg, if it does, return the
+   sub-rtx of the VPR reg.  The TYPE argument controls whether
+   this function should:
+   * For TYPE == 0, check all operands, including the OUT operands,
+     and return the first occurrence of the VPR reg.
+   * For TYPE == 1, only check the input operands.
+   * For TYPE == 2, only check the output operands.
+   (INOUT operands are considered both as input and output operands)
+*/
+static rtx
+arm_get_required_vpr_reg (rtx_insn *insn, unsigned int type = 0)
+{
+  gcc_assert (type < 3);
+  if (!NONJUMP_INSN_P (insn))
+    return NULL_RTX;
+
+  bool requires_vpr;
+  extract_constrain_insn (insn);
+  int n_operands = recog_data.n_operands;
+  if (recog_data.n_alternatives == 0)
+    return NULL_RTX;
+
+  /* Fill in recog_op_alt with information about the constraints of
+     this insn.  */
+  preprocess_constraints (insn);
+
+  for (int op = 0; op < n_operands; op++)
+    {
+      requires_vpr = true;
+      if (type == 1 && recog_data.operand_type[op] == OP_OUT)
+	continue;
+      else if (type == 2 && recog_data.operand_type[op] == OP_IN)
+	continue;
+
+      /* Iterate through alternatives of operand "op" in recog_op_alt and
+	 identify if the operand is required to be the VPR.  */
+      for (int alt = 0; alt < recog_data.n_alternatives; alt++)
+	{
+	  const operand_alternative *op_alt
+	      = &recog_op_alt[alt * n_operands];
+	  /* Fetch the reg_class for each entry and check it against the
+	     VPR_REG reg_class.  */
+	  if (alternative_class (op_alt, op) != VPR_REG)
+	    requires_vpr = false;
+	}
+      /* If all alternatives of the insn require the VPR reg for this operand,
+	 it means that either this is VPR-generating instruction, like a vctp,
+	 vcmp, etc., or it is a VPT-predicated insruction.  Return the subrtx
+	 of the VPR reg operand.  */
+      if (requires_vpr)
+	return recog_data.operand[op];
+    }
+  return NULL_RTX;
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with TYPE == 1, so return
+   something only if the VPR reg is an input operand to the insn.  */
+
+static rtx
+ALWAYS_INLINE
+arm_get_required_vpr_reg_param (rtx_insn *insn)
+{
+  return arm_get_required_vpr_reg (insn, 1);
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with TYPE == 2, so return
+   something only if the VPR reg is the return value, an output of, or is
+   clobbered by the insn.  */
+
+static rtx
+ALWAYS_INLINE
+arm_get_required_vpr_reg_ret_val (rtx_insn *insn)
+{
+  return arm_get_required_vpr_reg (insn, 2);
+}
+
+/* Scan the basic block of a loop body for a vctp instruction.  If there is
+   at least vctp instruction, return the first rtx_insn *.  */
+
+static rtx_insn *
+arm_mve_get_loop_vctp (basic_block bb)
+{
+  rtx_insn *insn = BB_HEAD (bb);
+
+  /* Now scan through all the instruction patterns and pick out the VCTP
+     instruction.  We require arm_get_required_vpr_reg_param to be false
+     to make sure we pick up a VCTP, rather than a VCTP_M.  */
+  FOR_BB_INSNS (bb, insn)
+    if (NONDEBUG_INSN_P (insn))
+      if (arm_get_required_vpr_reg_ret_val (insn)
+	  && (arm_mve_get_vctp_lanes (PATTERN (insn)) != 0)
+	  && !arm_get_required_vpr_reg_param (insn))
+	return insn;
+  return NULL;
+}
+
+/* Return true if INSN is a MVE instruction that is VPT-predicable, but in
+   its unpredicated form, or if it is predicated, but on a predicate other
+   than VPR_REG.  */
+
+static bool
+arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate (rtx_insn *insn,
+							  rtx vpr_reg)
+{
+  rtx insn_vpr_reg_operand;
+  if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+      || (MVE_VPT_PREDICATED_INSN_P (insn)
+	  && (insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn))
+	  && !rtx_equal_p (vpr_reg, insn_vpr_reg_operand)))
+    return true;
+  else
+    return false;
+}
+
+static bool
+arm_mve_vec_insn_is_predicated_with_this_predicate (rtx_insn *insn,
+						    rtx vpr_reg)
+{
+  rtx insn_vpr_reg_operand;
+  if (MVE_VPT_PREDICATED_INSN_P (insn)
+      && (insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn))
+      && rtx_equal_p (vpr_reg, insn_vpr_reg_operand))
+    return true;
+  else
+    return false;
+}
+
+/* Utility function to identify if INSN is an MVE instruction that performs
+   some across-vector operation (and as a result does not align with normal
+   lane predication rules).  All such instructions give one only scalar
+   output, except for vshlcq which gives a PARALLEL of two scalar outputs
+   (one result and one carry output).  */
+
+static bool
+arm_is_mve_across_vector_insn (rtx_insn* insn)
+{
+  df_ref insn_defs = NULL;
+  if (!MVE_VPT_PREDICABLE_INSN_P (insn))
+    return false;
+
+  bool is_across_vector = false;
+  FOR_EACH_INSN_DEF (insn_defs, insn)
+    if (!VALID_MVE_MODE (GET_MODE (DF_REF_REG (insn_defs))))
+      is_across_vector = true;
+
+  return is_across_vector;
+}
+
+/* Utility function to identify if INSN is an MVE load or store instruction.
+   * For TYPE == 0, check all operands.  If the function returns true,
+     INSN is a load or a store insn.
+   * For TYPE == 1, only check the input operands.  If the function returns
+     true, INSN is a load insn.
+   * For TYPE == 2, only check the output operands.  If the function returns
+     true, INSN is a store insn.  */
+
+static bool
+arm_is_mve_load_store_insn (rtx_insn* insn, int type = 0)
+{
+  int n_operands = recog_data.n_operands;
+  extract_insn (insn);
+
+  for (int op = 0; op < n_operands; op++)
+    {
+      if (type == 1 && recog_data.operand_type[op] == OP_OUT)
+	continue;
+      else if (type == 2 && recog_data.operand_type[op] == OP_IN)
+	continue;
+      if (mve_memory_operand (recog_data.operand[op],
+			      GET_MODE (recog_data.operand[op])))
+      return true;
+    }
+  return false;
+}
+
+/* Recursively scan through the DF chain backwards within the basic block and
+   determine if any of the USEs of the original insn (or the USEs of the insns
+   where thy were DEF-ed, etc., recursively) were affected by implicit VPT
+   predication of an MVE_VPT_UNPREDICATED_INSN_P in a dlstp/letp loop.
+   Having such implicit predication on an unpredicated insn isn't in itself
+   an error, because the output of that insn might then be used in a correctly
+   predicated store insn, where the disabled lanes will be ignored.  To verify
+   this we later call arm_mve_check_df_chain_fwd_for_implic_predic_impact,
+   which will check the DF chains forward to see if any implicitly-predicated
+   operand gets used in an improper way.  */
+
+static bool
+arm_mve_check_df_chain_back_for_implic_predic (rtx_insn *insn,
+					       rtx vctp_vpr_generated)
+{
+  basic_block body = BLOCK_FOR_INSN (insn);
+  /* The circumstances under which an instruction is affected by "implicit
+     predication" are as follows:
+      * It is an UNPREDICATED_INSN_P:
+	* That loads/stores from/to memory.
+	* Where any one of its operands is an MVE vector from outside the
+	  loop body bb.
+     Or:
+      * Any of it's operands, recursively backwards, are affected.  */
+  if (MVE_VPT_UNPREDICATED_INSN_P (insn) && arm_is_mve_load_store_insn (insn))
+    return true;
+
+  df_ref insn_uses = NULL;
+  FOR_EACH_INSN_USE (insn_uses, insn)
+  {
+    /* If the operand is in the input reg set to the the basic block,
+       (i.e. it has come from outside the loop!), consider it unsafe if:
+	 * It's being used in an unpredicated insn.
+	 * It is a predicable MVE vector.  */
+    if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+	&& VALID_MVE_MODE (GET_MODE (DF_REF_REG (insn_uses)))
+	&& REGNO_REG_SET_P (DF_LR_IN (body), DF_REF_REGNO (insn_uses)))
+      return true;
+
+    /* Scan backwards from the current INSN through the instruction chain
+       until the start of the basic block.  */
+    for (rtx_insn *prev_insn = PREV_INSN (insn);
+	 prev_insn && prev_insn != PREV_INSN (BB_HEAD (body));
+	 prev_insn = PREV_INSN (prev_insn))
+      {
+	/* If a previous insn defines a register that INSN uses, then recurse
+	   in order to check that insn's USEs.
+	   If any of these insns return true as MVE_VPT_UNPREDICATED_INSN_Ps,
+	   then the whole chain is affected by the change in behaviour from
+	   being placed in dlstp/letp loop.  */
+	df_ref prev_insn_defs = NULL;
+	FOR_EACH_INSN_DEF (prev_insn_defs, prev_insn)
+	{
+	  if (DF_REF_REGNO (insn_uses) == DF_REF_REGNO (prev_insn_defs)
+	      && !arm_mve_vec_insn_is_predicated_with_this_predicate
+		   (insn, vctp_vpr_generated)
+	      && arm_mve_check_df_chain_back_for_implic_predic
+		  (prev_insn, vctp_vpr_generated))
+	    return true;
+	}
+      }
+  }
+  return false;
+}
+
+/* If we have identified that the current DEF will be modified
+   by such implicit predication, scan through all the
+   insns that USE it and bail out if any one is outside the
+   current basic block (i.e. the reg is live after the loop)
+   or if any are store insns that are unpredicated or using a
+   predicate other than the loop VPR.  */
+
+static bool
+arm_mve_check_df_chain_fwd_for_implic_predic_impact (rtx_insn *insn,
+						     rtx vctp_vpr_generated)
+{
+
+  /* If this insn is indeed an unpredicated store to memory, bail out.  */
+  if (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
+	(insn, vctp_vpr_generated)
+      && arm_is_mve_load_store_insn (insn, 2))
+    return true;
+
+  /* Next, scan forward to the various USEs of the DEFs in this insn.  */
+  df_ref insn_def = NULL;
+  FOR_EACH_INSN_DEF (insn_def, insn)
+    {
+      for (df_ref use = DF_REG_USE_CHAIN (DF_REF_REGNO (insn_def)); use;
+	   use = DF_REF_NEXT_REG (use))
+	{
+	  rtx_insn *next_use_insn = DF_REF_INSN (use);
+	  if (next_use_insn != insn
+	      && NONDEBUG_INSN_P (next_use_insn))
+	    {
+	      /* If the USE is outside the loop body bb, or it is inside, but
+		 is an differently-predicated store to memory or it is any
+		 across-vector instruction.  */
+	      if (BLOCK_FOR_INSN (insn) != BLOCK_FOR_INSN (next_use_insn)
+		  || (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
+		       (next_use_insn, vctp_vpr_generated)
+		     && (arm_is_mve_load_store_insn (next_use_insn, 2)
+			 || arm_is_mve_across_vector_insn (next_use_insn))))
+		return true;
+	    }
+	}
+    }
+  return false;
+}
+
+static bool
+arm_mve_check_reg_origin_is_num_elems (basic_block body, rtx reg, rtx vctp_step)
+{
+  /* Ok, we now know the loop starts from zero and increments by one.
+     Now just show that the max value of the counter came from an
+     appropriate ASHIFRT expr of the correct amount.  */
+  basic_block pre_loop_bb = body->prev_bb;
+  while (pre_loop_bb && BB_END (pre_loop_bb)
+	 && !df_bb_regno_only_def_find (pre_loop_bb, REGNO (reg)))
+    pre_loop_bb = pre_loop_bb->prev_bb;
+
+  df_ref counter_max_last_def = df_bb_regno_only_def_find (pre_loop_bb, REGNO (reg));
+  rtx counter_max_last_set;
+  if (counter_max_last_def)
+    counter_max_last_set = PATTERN (DF_REF_INSN (counter_max_last_def));
+  else
+    return false;
+
+  /* If we encounter a simple SET from a REG, follow it through.  */
+  if (GET_CODE (counter_max_last_set) == SET
+      && REG_P (XEXP (counter_max_last_set, 1)))
+    return arm_mve_check_reg_origin_is_num_elems
+	     (pre_loop_bb, XEXP (counter_max_last_set, 1), vctp_step);
+
+  if (GET_CODE (XEXP (counter_max_last_set, 1)) == ASHIFTRT
+      && CONST_INT_P (XEXP (XEXP (counter_max_last_set, 1), 1))
+      && ((1 << INTVAL (XEXP (XEXP (counter_max_last_set, 1), 1)))
+	   == abs (INTVAL (vctp_step))))
+    return true;
+
+  return false;
+}
+
+/* If we have identified the loop to have an incrementing counter, we need to
+   make sure that it increments by 1 and that the loop is structured correctly:
+    * The counter starts from 0
+    * The counter terminates at (num_of_elem + num_of_lanes - 1) / num_of_lanes
+    * The vctp insn uses a reg that decrements appropriately in each iteration.
+*/
+
+static rtx_insn*
+arm_mve_dlstp_check_inc_counter (basic_block body, rtx_insn* vctp_insn,
+				 rtx condconst, rtx condcount)
+{
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+  /* The loop latch has to be empty.  When compiling all the known MVE LoLs in
+     user applications, none of those with incrementing counters had any real
+     insns in the loop latch.  As such, this function has only been tested with
+     an empty latch and may misbehave or ICE if we somehow get here with an
+     increment in the latch, so, for correctness, error out early.  */
+  rtx_insn *dec_insn = BB_END (body->loop_father->latch);
+  if (NONDEBUG_INSN_P (dec_insn))
+    return NULL;
+
+  class rtx_iv vctp_reg_iv;
+  /* For loops of type B) the loop counter is independent of the decrement
+     of the reg used in the vctp_insn. So run iv analysis on that reg.  This
+     has to succeed for such loops to be supported.  */
+  if (!iv_analyze (vctp_insn, as_a<scalar_int_mode> (GET_MODE (vctp_reg)),
+      vctp_reg, &vctp_reg_iv))
+    return NULL;
+
+  /* Find where both of those are modified in the loop body bb.  */
+  rtx condcount_reg_set
+	= PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
+				 (body, REGNO (condcount))));
+  rtx vctp_reg_set = PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
+					    (body, REGNO (vctp_reg))));
+  if (!vctp_reg_set || !condcount_reg_set)
+    return NULL;
+
+  if (REG_P (condcount) && REG_P (condconst))
+    {
+      /* First we need to prove that the loop is going 0..condconst with an
+	 inc of 1 in each iteration.  */
+      if (GET_CODE (XEXP (condcount_reg_set, 1)) == PLUS
+	  && CONST_INT_P (XEXP (XEXP (condcount_reg_set, 1), 1))
+	  && INTVAL (XEXP (XEXP (condcount_reg_set, 1), 1)) == 1)
+	{
+	    rtx counter_reg = XEXP (condcount_reg_set, 0);
+	    /* Check that the counter did indeed start from zero.  */
+	    df_ref this_set = DF_REG_DEF_CHAIN (REGNO (counter_reg));
+	    if (!this_set)
+	      return NULL;
+	    df_ref last_set = DF_REF_NEXT_REG (this_set);
+	    if (!last_set)
+	      return NULL;
+	    rtx_insn* last_set_insn = DF_REF_INSN (last_set);
+	    if (!single_set (last_set_insn))
+	      return NULL;
+	    rtx counter_orig_set;
+	    counter_orig_set = XEXP (PATTERN (last_set_insn), 1);
+	    if (!CONST_INT_P (counter_orig_set)
+		|| (INTVAL (counter_orig_set) != 0))
+	      return NULL;
+	    /* And finally check that the target value of the counter,
+	       condconst is of the correct shape.  */
+	    if (!arm_mve_check_reg_origin_is_num_elems (body, condconst,
+							vctp_reg_iv.step))
+	      return NULL;
+	}
+      else
+	return NULL;
+    }
+  else
+    return NULL;
+
+  /* Extract the decrementnum of the vctp reg.  */
+  int decrementnum = abs (INTVAL (vctp_reg_iv.step));
+  /* Ensure it matches the number of lanes of the vctp instruction.  */
+  if (decrementnum != arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+    return NULL;
+
+  /* Everything looks valid.  */
+  return vctp_insn;
+}
+
+static rtx_insn*
+arm_mve_dlstp_check_dec_counter (basic_block body, rtx_insn* vctp_insn,
+				 rtx condconst, rtx condcount)
+{
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+  class rtx_iv vctp_reg_iv;
+  int decrementnum;
+  /* For decrementing loops of type A), the counter is usually present in the
+     loop latch.  Here we simply need to verify that this counter is the same
+     reg that is also used in the vctp_insn and that it is not otherwise
+     modified.  */
+  rtx_insn *dec_insn = BB_END (body->loop_father->latch);
+  /* If not in the loop latch, try to find the decrement in the loop body.  */
+  if (!NONDEBUG_INSN_P (dec_insn))
+  {
+    df_ref temp = df_bb_regno_only_def_find (body, REGNO (condcount));
+    /* If we haven't been able to find the decrement, bail out.  */
+    if (!temp)
+      return NULL;
+    dec_insn = DF_REF_INSN (temp);
+  }
+
+  /* Next, ensure that it is a PLUS of the form:
+     (set (reg a) (plus (reg a) (const_int)))
+     where (reg a) is the same as condcount.  */
+  if (GET_CODE (XEXP (PATTERN (dec_insn), 1)) == PLUS
+      && REGNO (XEXP (PATTERN (dec_insn), 0))
+	  == REGNO (XEXP (XEXP (PATTERN (dec_insn), 1), 0))
+      && REGNO (XEXP (PATTERN (dec_insn), 0)) == REGNO (condcount))
+    decrementnum = abs (INTVAL (XEXP (XEXP (PATTERN (dec_insn), 1), 1)));
+  else
+    return NULL;
+
+  /* Ok, so we now know the loop decrement.  If it is a 1, then we need to
+     look at the loop vctp_reg and verify that it also decrements correctly.
+     Then, we need to establish that the starting value of the loop decrement
+     originates from the starting value of the vctp decrement.  */
+  if (decrementnum == 1)
+    {
+      class rtx_iv vctp_reg_iv;
+      /* The loop counter is found to be independent of the decrement
+	 of the reg used in the vctp_insn, again.  Ensure that IV analysis
+	 succeeds and check the step.  */
+      if (!iv_analyze (vctp_insn, as_a<scalar_int_mode> (GET_MODE (vctp_reg)),
+		       vctp_reg, &vctp_reg_iv))
+	return NULL;
+      /* Ensure it matches the number of lanes of the vctp instruction.  */
+      if (abs (INTVAL (vctp_reg_iv.step))
+	  != arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+	return NULL;
+      if (!arm_mve_check_reg_origin_is_num_elems (body, condcount, vctp_reg_iv.step))
+	return NULL;
+    }
+  /* If the decrements are the same, then the situation is simple: either they
+     are also the same reg, which is safe, or they are different registers, in
+     which case makse sure that there is a only simple SET from one to the
+     other inside the loop.*/
+  else if (decrementnum == arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+    {
+      if (REGNO (condcount) != REGNO (vctp_reg))
+	{
+	  /* It wasn't the same reg, but it could be behild a
+	     (set (vctp_reg) (condcount)), so instead find where
+	     the VCTP insn is DEF'd inside the loop.  */
+	  rtx vctp_reg_set =
+		PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
+					(body, REGNO (vctp_reg))));
+	  /* This must just be a simple SET from the condcount.  */
+	  if (GET_CODE (vctp_reg_set) != SET || !REG_P (XEXP (vctp_reg_set, 1))
+	      || REGNO (XEXP (vctp_reg_set, 1)) != REGNO (condcount))
+	    return NULL;
+	}
+    }
+  else
+    return NULL;
+
+  /* We now only need to find out that the loop terminates with a LE
+     zero condition.  If condconst is a const_int, then this is easy.
+     If its a REG, look at the last condition+jump in a bb before
+     the loop, because that usually will have a branch jumping over
+     the loop body.  */
+  if (CONST_INT_P (condconst)
+      && !(INTVAL (condconst) == 0 && JUMP_P (BB_END (body))
+	   && GET_CODE (XEXP (PATTERN (BB_END (body)), 1)) == IF_THEN_ELSE
+	   && (GET_CODE (XEXP (XEXP (PATTERN (BB_END (body)), 1), 0)) == NE
+	       ||GET_CODE (XEXP (XEXP (PATTERN (BB_END (body)), 1), 0)) == GT)))
+    return NULL;
+  else if (REG_P (condconst))
+    {
+      basic_block pre_loop_bb = body;
+      while (pre_loop_bb->prev_bb && BB_END (pre_loop_bb->prev_bb)
+	     && !JUMP_P (BB_END (pre_loop_bb->prev_bb)))
+	pre_loop_bb = pre_loop_bb->prev_bb;
+      if (pre_loop_bb && BB_END (pre_loop_bb))
+	pre_loop_bb = pre_loop_bb->prev_bb;
+      else
+	return NULL;
+      rtx initial_compare = NULL_RTX;
+      if (!(prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb))
+	    && INSN_P (prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb)))))
+	return NULL;
+      else
+	initial_compare
+	    = PATTERN (prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb)));
+      if (!(initial_compare && GET_CODE (initial_compare) == SET
+	    && cc_register (XEXP (initial_compare, 0), VOIDmode)
+	    && GET_CODE (XEXP (initial_compare, 1)) == COMPARE
+	    && CONST_INT_P (XEXP (XEXP (initial_compare, 1), 1))
+	    && INTVAL (XEXP (XEXP (initial_compare, 1), 1)) == 0))
+	return NULL;
+
+      /* Usually this is a LE condition, but it can also just be a GT or an EQ
+	 condition (if the value is unsigned or the compiler knows its not negative)  */
+      rtx_insn *loop_jumpover = BB_END (pre_loop_bb);
+      if (!(JUMP_P (loop_jumpover)
+	    && GET_CODE (XEXP (PATTERN (loop_jumpover), 1)) == IF_THEN_ELSE
+	    && (GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == LE
+		|| GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == GT
+		|| GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == EQ)))
+	return NULL;
+    }
+
+  /* Everything looks valid.  */
+  return vctp_insn;
+}
+
+static rtx_insn*
+arm_mve_loop_valid_for_dlstp (basic_block body)
+{
+  /* Doloop can only be done "elementwise" with predicated dlstp/letp if it
+     contains a VCTP on the number of elements processed by the loop.
+     Find the VCTP predicate generation inside the loop body BB.  */
+  rtx_insn *vctp_insn = arm_mve_get_loop_vctp (body);
+  if (!vctp_insn)
+    return NULL;
+
+  /* There are only two types of loops that can be turned into dlstp/letp
+     loops:
+      A) Loops of the form:
+	  while (num_of_elem > 0)
+	    {
+	      p = vctp<size> (num_of_elem)
+	      n -= num_of_lanes;
+	    }
+      B) Loops of the form:
+	  int num_of_iters = (num_of_elem + num_of_lanes - 1) / num_of_lanes
+	  for (i = 0; i < num_of_iters; i++)
+	    {
+	      p = vctp<size> (num_of_elem)
+	      n -= num_of_lanes;
+	    }
+
+    Then, depending on the type of loop above we need will need to do
+    different sets of checks.  */
+  iv_analysis_loop_init (body->loop_father);
+
+  /* In order to find out if the loop is of type A or B above look for the
+     loop counter: it will either be incrementing by one per iteration or
+     it will be decrementing by num_of_lanes.  We can find the loop counter
+     in the condition at the end of the loop.  */
+  rtx_insn *loop_cond = prev_nonnote_nondebug_insn_bb (BB_END (body));
+  if (!(cc_register (XEXP (PATTERN (loop_cond), 0), VOIDmode)
+	&& GET_CODE (XEXP (PATTERN (loop_cond), 1)) == COMPARE))
+    return NULL;
+
+  /* The operands in the condition:  Try to identify which one is the
+     constant and which is the counter and run IV analysis on the latter.  */
+  rtx cond_arg_1 = XEXP (XEXP (PATTERN (loop_cond), 1), 0);
+  rtx cond_arg_2 = XEXP (XEXP (PATTERN (loop_cond), 1), 1);
+
+  rtx loop_cond_constant;
+  rtx loop_counter;
+  class rtx_iv cond_counter_iv, cond_temp_iv;
+
+  if (CONST_INT_P (cond_arg_1))
+    {
+      /* cond_arg_1 is the constant and cond_arg_2 is the counter.  */
+      loop_cond_constant = cond_arg_1;
+      loop_counter = cond_arg_2;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_2)),
+		  cond_arg_2, &cond_counter_iv);
+    }
+  else if (CONST_INT_P (cond_arg_2))
+    {
+      /* cond_arg_2 is the constant and cond_arg_1 is the counter.  */
+      loop_cond_constant = cond_arg_2;
+      loop_counter = cond_arg_1;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_1)),
+		  cond_arg_1, &cond_counter_iv);
+    }
+  else if (REG_P (cond_arg_1) && REG_P (cond_arg_2))
+    {
+      /* If both operands to the compare are REGs, we can safely
+	 run IV analysis on both and then determine which is the
+	 constant by looking at the step.
+	 First assume cond_arg_1 is the counter.  */
+      loop_counter = cond_arg_1;
+      loop_cond_constant = cond_arg_2;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_1)),
+		  cond_arg_1, &cond_counter_iv);
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_2)),
+		  cond_arg_2, &cond_temp_iv);
+
+      if (!CONST_INT_P (cond_counter_iv.step) || !CONST_INT_P (cond_temp_iv.step))
+	return NULL;
+      /* Look at the steps and swap around the rtx's if needed.  Error out if
+	 one of them cannot be identified as constant.  */
+      if (INTVAL (cond_counter_iv.step) != 0 && INTVAL (cond_temp_iv.step) != 0)
+	return NULL;
+      if (INTVAL (cond_counter_iv.step) == 0 && INTVAL (cond_temp_iv.step) != 0)
+	{
+	  loop_counter = cond_arg_2;
+	  loop_cond_constant = cond_arg_1;
+	  cond_counter_iv = cond_temp_iv;
+	}
+    }
+  else
+    return NULL;
+
+  if (!REG_P (loop_counter))
+    return NULL;
+  if (!(REG_P (loop_cond_constant) || CONST_INT_P (loop_cond_constant)))
+    return NULL;
+
+  /* Now we have extracted the IV step of the loop counter, call the
+     appropriate checking function.  */
+  if (INTVAL (cond_counter_iv.step) > 0)
+    return arm_mve_dlstp_check_inc_counter (body, vctp_insn,
+					    loop_cond_constant, loop_counter);
+  else if (INTVAL (cond_counter_iv.step) < 0)
+    return arm_mve_dlstp_check_dec_counter (body, vctp_insn,
+					    loop_cond_constant, loop_counter);
+  else
+    return NULL;
+}
+
+/* Predict whether the given loop in gimple will be transformed in the RTL
+   doloop_optimize pass.  */
+
+static bool
+arm_predict_doloop_p (struct loop *loop)
+{
+  gcc_assert (loop);
+  /* On arm, targetm.can_use_doloop_p is actually
+     can_use_doloop_if_innermost.  Ensure the loop is innermost,
+     it is valid and as per arm_target_bb_ok_for_lob and the
+     correct architecture flags are enabled.  */
+  if (!(TARGET_32BIT && TARGET_HAVE_LOB && optimize > 0))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " target architecture or optimisation flags.\n");
+      return false;
+    }
+  else if (loop->inner != NULL)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " loop nesting.\n");
+      return false;
+    }
+  else if (!arm_target_bb_ok_for_lob (loop->header->next_bb))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " loop bb complexity.\n");
+      return false;
+    }
+
+  return true;
+}
+
+/* Implement targetm.loop_unroll_adjust.  Use this to block unrolling of loops
+   that may later be turned into MVE Tail Predicated Low Overhead Loops.  The
+   performance benefit of an MVE LoL is likely to be much higher than that of
+   the unrolling.  */
+
+unsigned
+arm_loop_unroll_adjust (unsigned nunroll, struct loop *loop)
+{
+  if (TARGET_THUMB2 && TARGET_HAVE_LOB && TARGET_HAVE_MVE
+      && arm_target_bb_ok_for_lob (loop->latch)
+      && arm_mve_loop_valid_for_dlstp (loop->header))
+    return 0;
+  else
+    return nunroll;
+}
+
+/* Function to hadle emitting a VPT-unpredicated version of a VPT-predicated
+   insn to a sequence.  */
+
+static bool
+arm_emit_mve_unpredicated_insn_to_seq (rtx_insn* insn)
+{
+  rtx insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn);
+  int new_icode = get_attr_mve_unpredicated_insn (insn);
+  if (!in_sequence_p ()
+      || !MVE_VPT_PREDICATED_INSN_P (insn)
+      || (!insn_vpr_reg_operand)
+      || (!new_icode))
+    return false;
+
+  extract_insn (insn);
+  rtx arr[8];
+  int j = 0;
+
+  /* When transforming a VPT-predicated instruction
+     into its unpredicated equivalent we need to drop
+     the VPR operand and we may need to also drop a
+     merge "vuninit" input operand, depending on the
+     instruction pattern.  Here ensure that we have at
+     most a two-operand difference between the two
+     instrunctions.  */
+  int n_operands_diff
+      = recog_data.n_operands - insn_data[new_icode].n_operands;
+  if (!(n_operands_diff > 0 && n_operands_diff <= 2))
+    return false;
+
+  /* Then, loop through the operands of the predicated
+     instruction, and retain the ones that map to the
+     unpredicated instruction.  */
+  for (int i = 0; i < recog_data.n_operands; i++)
+    {
+      /* Ignore the VPR and, if needed, the vuninit
+	 operand.  */
+      if (insn_vpr_reg_operand == recog_data.operand[i]
+	  || (n_operands_diff == 2
+	      && !strcmp (recog_data.constraints[i], "0")))
+	continue;
+      else
+	{
+	  arr[j] = recog_data.operand[i];
+	  j++;
+	}
+    }
+
+  /* Finally, emit the upredicated instruction.  */
+  switch (j)
+    {
+      case 1:
+	emit_insn (GEN_FCN (new_icode) (arr[0]));
+	break;
+      case 2:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1]));
+	break;
+      case 3:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2]));
+	break;
+      case 4:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2],
+					arr[3]));
+	break;
+      case 5:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4]));
+	break;
+      case 6:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4], arr[5]));
+	break;
+      case 7:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4], arr[5], arr[6]));
+	break;
+      default:
+	gcc_unreachable ();
+    }
+  return true;
+}
+
+/* When a vctp insn is used, its out is often followed by
+   a zero-extend insn to SImode, which is then SUBREG'd into a
+   vector form of mode VALID_MVE_PRED_MODE: this vector form is
+   what is then used as an input to the instructions within the
+   loop.  Hence, store that vector form of the VPR reg into
+   vctp_vpr_generated, so that we can match it with instructions
+   in the loop to determine if they are predicated on this same
+   VPR.  If there is no zero-extend and subreg or it is otherwise
+   invalid, then return NULL to cancel the dlstp transform.  */
+
+static rtx
+arm_mve_get_vctp_vec_form (rtx_insn *insn)
+{
+  rtx vctp_vpr_generated = NULL_RTX;
+  rtx_insn *next_use1 = NULL;
+  df_ref use;
+  for (use
+	= DF_REG_USE_CHAIN
+	   (DF_REF_REGNO (DF_INSN_INFO_DEFS (DF_INSN_INFO_GET (insn))));
+       use; use = DF_REF_NEXT_REG (use))
+    if (!next_use1 && NONDEBUG_INSN_P (DF_REF_INSN (use)))
+      next_use1 = DF_REF_INSN (use);
+
+  if (single_set (next_use1)
+      && GET_CODE (SET_SRC (single_set (next_use1))) == ZERO_EXTEND)
+    {
+      rtx_insn *next_use2 = NULL;
+      for (use
+	    = DF_REG_USE_CHAIN
+	       (DF_REF_REGNO
+		 (DF_INSN_INFO_DEFS (DF_INSN_INFO_GET (next_use1))));
+	   use; use = DF_REF_NEXT_REG (use))
+	if (!next_use2 && NONDEBUG_INSN_P (DF_REF_INSN (use)))
+	  next_use2 = DF_REF_INSN (use);
+
+      if (single_set (next_use2)
+	  && GET_CODE (SET_SRC (single_set (next_use2))) == SUBREG)
+	vctp_vpr_generated = XEXP (PATTERN (next_use2), 0);
+    }
+
+  if (!vctp_vpr_generated || !REG_P (vctp_vpr_generated)
+      || !VALID_MVE_PRED_MODE (GET_MODE (vctp_vpr_generated)))
+    return NULL_RTX;
+
+  return vctp_vpr_generated;
+}
+
+/* Attempt to transform the loop contents of loop basic block from VPT
+   predicated insns into unpredicated insns for a dlstp/letp loop.  */
+
+rtx
+arm_attempt_dlstp_transform (rtx label)
+{
+  basic_block body = BLOCK_FOR_INSN (label)->prev_bb;
+
+  /* Ensure that the bb is within a loop that has all required metadata.  */
+  if (!body->loop_father || !body->loop_father->header
+      || !body->loop_father->simple_loop_desc)
+    return GEN_INT (1);
+
+  rtx_insn *vctp_insn = arm_mve_loop_valid_for_dlstp (body);
+  if (!vctp_insn)
+    return GEN_INT (1);
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+
+  rtx vctp_vpr_generated = arm_mve_get_vctp_vec_form (vctp_insn);
+  if (!vctp_vpr_generated)
+    return GEN_INT (1);
+
+  /* decrementunum is already known to be valid at this point.  */
+  int decrementnum = arm_mve_get_vctp_lanes (PATTERN (vctp_insn));
+
+  rtx_insn *insn = 0;
+  rtx_insn *cur_insn = 0;
+  rtx_insn *seq;
+
+  /* Scan through the insns in the loop bb and emit the transformed bb
+     insns to a sequence.  */
+  start_sequence ();
+  FOR_BB_INSNS (body, insn)
+    {
+      if (GET_CODE (insn) == CODE_LABEL || NOTE_INSN_BASIC_BLOCK_P (insn))
+	continue;
+      else if (NOTE_P (insn))
+	emit_note ((enum insn_note)NOTE_KIND (insn));
+      else if (DEBUG_INSN_P (insn))
+	emit_debug_insn (PATTERN (insn));
+      else if (!INSN_P (insn))
+	{
+	  end_sequence ();
+	  return GEN_INT (1);
+	}
+      /* When we find the vctp instruction: continue.  */
+      else if (insn == vctp_insn)
+	continue;
+       /* If the insn pattern requires the use of the VPR value from the
+	  vctp as an input parameter for predication.  */
+      else if (arm_mve_vec_insn_is_predicated_with_this_predicate
+		(insn, vctp_vpr_generated))
+	{
+	  bool success = arm_emit_mve_unpredicated_insn_to_seq (insn);
+	  if (!success)
+	    {
+	      end_sequence ();
+	      return GEN_INT (1);
+	    }
+	}
+      /* If the insn isn't VPT predicated on vctp_vpr_generated, we need to
+	 make sure that it is still valid within the dlstp/letp loop.  */
+      else
+	{
+	  /* If this instruction USE-s the vctp_vpr_generated other than for
+	     predication, this blocks the transformation as we are not allowed
+	     to optimise the VPR value away.  */
+	  df_ref insn_uses = NULL;
+	  FOR_EACH_INSN_USE (insn_uses, insn)
+	  {
+	    if (rtx_equal_p (vctp_vpr_generated, DF_REF_REG (insn_uses)))
+	      {
+		end_sequence ();
+		return GEN_INT (1);
+	      }
+	  }
+	  /* If within the loop we have an MVE vector instruction that is
+	     unpredicated, the dlstp/letp looping will add implicit
+	     predication to it.  This will result in a change in behaviour
+	     of the instruction, so we need to find out if any instructions
+	     that feed into the current instruction were implicitly
+	     predicated.  */
+	  if (arm_mve_check_df_chain_back_for_implic_predic
+		(insn, vctp_vpr_generated))
+	    {
+	      if (arm_mve_check_df_chain_fwd_for_implic_predic_impact
+		    (insn, vctp_vpr_generated))
+		{
+		  end_sequence ();
+		  return GEN_INT (1);
+		}
+	    }
+	  emit_insn (PATTERN (insn));
+	}
+    }
+  seq = get_insns ();
+  end_sequence ();
+
+  /* Re-write the entire BB contents with the transformed
+     sequence.  */
+  FOR_BB_INSNS_SAFE (body, insn, cur_insn)
+    if (!(GET_CODE (insn) == CODE_LABEL || NOTE_INSN_BASIC_BLOCK_P (insn)))
+      delete_insn (insn);
+  for (insn = seq; NEXT_INSN (insn); insn = NEXT_INSN (insn))
+    if (NOTE_P (insn))
+      emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));
+    else if (DEBUG_INSN_P (insn))
+      emit_debug_insn_after (PATTERN (insn), BB_END (body));
+    else
+      emit_insn_after (PATTERN (insn), BB_END (body));
+
+  emit_jump_insn_after (PATTERN (insn), BB_END (body));
+  /* The transformation has succeeded, so now modify the "count"
+     (a.k.a. niter_expr) for the middle-end.  Also set noloop_assumptions
+     to NULL to stop the middle-end from making assumptions about the
+     number of iterations.  */
+  simple_loop_desc (body->loop_father)->niter_expr = vctp_reg;
+  simple_loop_desc (body->loop_father)->noloop_assumptions = NULL_RTX;
+  return GEN_INT (decrementnum);
 }
 
 #if CHECKING_P
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 9e77af55d60..2d9bea95b9a 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -2598,6 +2598,9 @@
 (define_int_attr mrrc [(VUNSPEC_MRRC "mrrc") (VUNSPEC_MRRC2 "mrrc2")])
 (define_int_attr MRRC [(VUNSPEC_MRRC "MRRC") (VUNSPEC_MRRC2 "MRRC2")])
 
+(define_int_attr mode1 [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 "32")
+			(DLSTP64 "64")])
+
 (define_int_attr opsuffix [(UNSPEC_DOT_S "s8")
 			   (UNSPEC_DOT_U "u8")
 			   (UNSPEC_DOT_US "s8")
@@ -2840,6 +2843,8 @@
 (define_int_iterator VSHLCQ_M [VSHLCQ_M_S VSHLCQ_M_U])
 (define_int_iterator VQSHLUQ_M_N [VQSHLUQ_M_N_S])
 (define_int_iterator VQSHLUQ_N [VQSHLUQ_N_S])
+(define_int_iterator DLSTP [DLSTP8 DLSTP16 DLSTP32
+				   DLSTP64])
 
 ;; Define iterators for VCMLA operations
 (define_int_iterator VCMLA_OP [UNSPEC_VCMLA
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index ddac6cc5ee1..ba6c01d2939 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -7194,3 +7194,40 @@
       }
   }
 )
+
+;; Originally expanded by 'predicated_doloop_end'.
+;; In the rare situation where the branch is too far, we do also need to
+;; revert FPSCR.LTPSIZE back to 0x100 after the last iteration.
+(define_insn "*predicated_doloop_end_internal"
+  [(set (pc)
+	(if_then_else
+	   (ge (plus:SI (reg:SI LR_REGNUM)
+			(match_operand:SI 0 "const_int_operand" ""))
+		(const_int 0))
+	 (label_ref (match_operand 1 "" ""))
+	 (pc)))
+   (set (reg:SI LR_REGNUM)
+	(plus:SI (reg:SI LR_REGNUM) (match_dup 0)))
+   (clobber (reg:CC CC_REGNUM))]
+  "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2"
+  {
+    if (get_attr_length (insn) == 4)
+      return "letp\t%|lr, %l1";
+    else
+      return "subs\t%|lr, #%n0\n\tbgt\t%l1\n\tlctp";
+  }
+  [(set (attr "length")
+	(if_then_else
+	   (ltu (minus (pc) (match_dup 1)) (const_int 1024))
+	    (const_int 4)
+	    (const_int 6)))
+   (set_attr "type" "branch")])
+
+(define_insn "dlstp<mode1>_insn"
+  [
+    (set (reg:SI LR_REGNUM)
+	 (unspec:SI [(match_operand:SI 0 "s_register_operand" "r")]
+	  DLSTP))
+  ]
+  "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2"
+  "dlstp.<mode1>\t%|lr, %0")
diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
index e1e013befa7..2094e9c4641 100644
--- a/gcc/config/arm/thumb2.md
+++ b/gcc/config/arm/thumb2.md
@@ -1613,7 +1613,7 @@
    (use (match_operand 1 "" ""))]     ; label
   "TARGET_32BIT"
   "
- {
+{
    /* Currently SMS relies on the do-loop pattern to recognize loops
       where (1) the control part consists of all insns defining and/or
       using a certain 'count' register and (2) the loop count can be
@@ -1623,41 +1623,67 @@
 
       Also used to implement the low over head loops feature, which is part of
       the Armv8.1-M Mainline Low Overhead Branch (LOB) extension.  */
-   if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
-   {
-     rtx s0;
-     rtx bcomp;
-     rtx loc_ref;
-     rtx cc_reg;
-     rtx insn;
-     rtx cmp;
-
-     if (GET_MODE (operands[0]) != SImode)
-       FAIL;
-
-     s0 = operands [0];
-
-     /* Low over head loop instructions require the first operand to be LR.  */
-     if (TARGET_HAVE_LOB && arm_target_insn_ok_for_lob (operands [1]))
-       s0 = gen_rtx_REG (SImode, LR_REGNUM);
-
-     if (TARGET_THUMB2)
-       insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-     else
-       insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-
-     cmp = XVECEXP (PATTERN (insn), 0, 0);
-     cc_reg = SET_DEST (cmp);
-     bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
-     loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands [1]);
-     emit_jump_insn (gen_rtx_SET (pc_rtx,
-                                  gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
-                                                        loc_ref, pc_rtx)));
-     DONE;
-   }
- else
-   FAIL;
- }")
+  if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
+    {
+      rtx s0;
+      rtx bcomp;
+      rtx loc_ref;
+      rtx cc_reg;
+      rtx insn;
+      rtx cmp;
+      rtx decrement_num;
+
+      if (GET_MODE (operands[0]) != SImode)
+	FAIL;
+
+      s0 = operands[0];
+
+       if (TARGET_HAVE_LOB && arm_target_bb_ok_for_lob (BLOCK_FOR_INSN (operands[1])))
+	{
+	  s0 = gen_rtx_REG (SImode, LR_REGNUM);
+
+	  /* If we have a compatibe MVE target, try and analyse the loop
+	     contents to determine if we can use predicated dlstp/letp
+	     looping.  */
+	  if (TARGET_HAVE_MVE && TARGET_THUMB2
+	      && (decrement_num = arm_attempt_dlstp_transform (operands[1]))
+	      && (INTVAL (decrement_num) != 1))
+	    {
+	      insn = emit_insn
+		      (gen_thumb2_addsi3_compare0
+			  (s0, s0, GEN_INT ((-1) * (INTVAL (decrement_num)))));
+	      cmp = XVECEXP (PATTERN (insn), 0, 0);
+	      cc_reg = SET_DEST (cmp);
+	      bcomp = gen_rtx_GE (VOIDmode, cc_reg, const0_rtx);
+	      loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+	      emit_jump_insn (gen_rtx_SET (pc_rtx,
+				       gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+							     loc_ref, pc_rtx)));
+	      DONE;
+	    }
+
+	  /* Otherwise, try standard decrement-by-one dls/le looping.  */
+	  if (TARGET_THUMB2)
+	    insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0,
+							  GEN_INT (-1)));
+	  else
+	    insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
+
+	  cmp = XVECEXP (PATTERN (insn), 0, 0);
+	  cc_reg = SET_DEST (cmp);
+	  bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
+	  loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+	  emit_jump_insn (gen_rtx_SET (pc_rtx,
+				       gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+							     loc_ref, pc_rtx)));
+	  DONE;
+	}
+      else
+	FAIL;
+    }
+  else
+    FAIL;
+}")
 
 (define_insn "*clear_apsr"
   [(unspec_volatile:SI [(const_int 0)] VUNSPEC_CLRM_APSR)
@@ -1755,7 +1781,37 @@
   {
     if (REGNO (operands[0]) == LR_REGNUM)
       {
-	emit_insn (gen_dls_insn (operands[0]));
+	/* Pick out the number by which we are decrementing the loop counter
+	   in every iteration.  If it's > 1, then use dlstp.  */
+	int const_int_dec_num
+	     = abs (INTVAL (XEXP (XEXP (XVECEXP (PATTERN (operands[1]), 0, 1),
+				  1),
+			    1)));
+	switch (const_int_dec_num)
+	  {
+	    case 16:
+	      emit_insn (gen_dlstp8_insn (operands[0]));
+	      break;
+
+	    case 8:
+	      emit_insn (gen_dlstp16_insn (operands[0]));
+	      break;
+
+	    case 4:
+	      emit_insn (gen_dlstp32_insn (operands[0]));
+	      break;
+
+	    case 2:
+	      emit_insn (gen_dlstp64_insn (operands[0]));
+	      break;
+
+	    case 1:
+	      emit_insn (gen_dls_insn (operands[0]));
+	      break;
+
+	    default:
+	      gcc_unreachable ();
+	  }
 	DONE;
       }
     else
diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
index dccda283573..f7c4f2b1b0e 100644
--- a/gcc/config/arm/unspecs.md
+++ b/gcc/config/arm/unspecs.md
@@ -581,6 +581,10 @@
   VADDLVQ_U
   VCTP
   VCTP_M
+  DLSTP8
+  DLSTP16
+  DLSTP32
+  DLSTP64
   VPNOT
   VCREATEQ_F
   VCVTQ_N_TO_F_S
diff --git a/gcc/df-core.cc b/gcc/df-core.cc
index d4812b04a7c..4fcc14bf790 100644
--- a/gcc/df-core.cc
+++ b/gcc/df-core.cc
@@ -1964,6 +1964,21 @@ df_bb_regno_last_def_find (basic_block bb, unsigned int regno)
   return NULL;
 }
 
+/* Return the one and only def of REGNO within BB.  If there is no def or
+   there are multiple defs, return NULL.  */
+
+df_ref
+df_bb_regno_only_def_find (basic_block bb, unsigned int regno)
+{
+  df_ref temp = df_bb_regno_first_def_find (bb, regno);
+  if (!temp)
+    return NULL;
+  else if (temp == df_bb_regno_last_def_find (bb, regno))
+    return temp;
+  else
+    return NULL;
+}
+
 /* Finds the reference corresponding to the definition of REG in INSN.
    DF is the dataflow object.  */
 
diff --git a/gcc/df.h b/gcc/df.h
index 402657a7076..98623637f9c 100644
--- a/gcc/df.h
+++ b/gcc/df.h
@@ -987,6 +987,7 @@ extern void df_check_cfg_clean (void);
 #endif
 extern df_ref df_bb_regno_first_def_find (basic_block, unsigned int);
 extern df_ref df_bb_regno_last_def_find (basic_block, unsigned int);
+extern df_ref df_bb_regno_only_def_find (basic_block, unsigned int);
 extern df_ref df_find_def (rtx_insn *, rtx);
 extern bool df_reg_defined (rtx_insn *, rtx);
 extern df_ref df_find_use (rtx_insn *, rtx);
diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
index 4feb0a25ab9..f6dbd0515de 100644
--- a/gcc/loop-doloop.cc
+++ b/gcc/loop-doloop.cc
@@ -85,29 +85,29 @@ doloop_condition_get (rtx_insn *doloop_pat)
      forms:
 
      1)  (parallel [(set (pc) (if_then_else (condition)
-	  			            (label_ref (label))
-				            (pc)))
-	             (set (reg) (plus (reg) (const_int -1)))
-	             (additional clobbers and uses)])
+					    (label_ref (label))
+					    (pc)))
+		     (set (reg) (plus (reg) (const_int -n)))
+		     (additional clobbers and uses)])
 
      The branch must be the first entry of the parallel (also required
      by jump.cc), and the second entry of the parallel must be a set of
      the loop counter register.  Some targets (IA-64) wrap the set of
      the loop counter in an if_then_else too.
 
-     2)  (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-	                         (label_ref (label))
-			         (pc))).  
+     2)  (set (reg) (plus (reg) (const_int -n))
+	 (set (pc) (if_then_else (reg != 0)
+				 (label_ref (label))
+				 (pc))).
 
      Some targets (ARM) do the comparison before the branch, as in the
      following form:
 
-     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) */
+     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -n), 0)))
+		   (set (reg) (plus (reg) (const_int -n)))])
+	(set (pc) (if_then_else (cc == NE)
+				(label_ref (label))
+				(pc))) */
 
   pattern = PATTERN (doloop_pat);
 
@@ -143,7 +143,7 @@ doloop_condition_get (rtx_insn *doloop_pat)
 	      || GET_CODE (cmp_arg1) != PLUS)
 	    return 0;
 	  reg_orig = XEXP (cmp_arg1, 0);
-	  if (XEXP (cmp_arg1, 1) != GEN_INT (-1) 
+	  if (!CONST_INT_P (XEXP (cmp_arg1, 1))
 	      || !REG_P (reg_orig))
 	    return 0;
 	  cc_reg = SET_DEST (cmp_orig);
@@ -156,7 +156,8 @@ doloop_condition_get (rtx_insn *doloop_pat)
 	{
 	  /* We expect the condition to be of the form (reg != 0)  */
 	  cond = XEXP (SET_SRC (cmp), 0);
-	  if (GET_CODE (cond) != NE || XEXP (cond, 1) != const0_rtx)
+	  if ((GET_CODE (cond) != NE && GET_CODE (cond) != GE)
+	      || XEXP (cond, 1) != const0_rtx)
 	    return 0;
 	}
     }
@@ -173,14 +174,14 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
     return 0;
 
-  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
      On IA-64, this decrement is wrapped in an if_then_else.  */
   inc_src = SET_SRC (inc);
   if (GET_CODE (inc_src) == IF_THEN_ELSE)
     inc_src = XEXP (inc_src, 1);
   if (GET_CODE (inc_src) != PLUS
       || XEXP (inc_src, 0) != reg
-      || XEXP (inc_src, 1) != constm1_rtx)
+      || !CONST_INT_P (XEXP (inc_src, 1)))
     return 0;
 
   /* Check for (set (pc) (if_then_else (condition)
@@ -211,42 +212,49 @@ doloop_condition_get (rtx_insn *doloop_pat)
       || (GET_CODE (XEXP (condition, 0)) == PLUS
 	  && XEXP (XEXP (condition, 0), 0) == reg))
    {
-     if (GET_CODE (pattern) != PARALLEL)
      /*  For the second form we expect:
 
-         (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-                                 (label_ref (label))
-                                 (pc))).
+	 (set (reg) (plus (reg) (const_int -n))
+	 (set (pc) (if_then_else (reg != 0)
+				 (label_ref (label))
+				 (pc))).
 
-         is equivalent to the following:
+	 If n == 1, that is equivalent to the following:
 
-         (parallel [(set (pc) (if_then_else (reg != 1)
-                                            (label_ref (label))
-                                            (pc)))
-                     (set (reg) (plus (reg) (const_int -1)))
-                     (additional clobbers and uses)])
+	 (parallel [(set (pc) (if_then_else (reg != 1)
+					    (label_ref (label))
+					    (pc)))
+		     (set (reg) (plus (reg) (const_int -1)))
+		     (additional clobbers and uses)])
 
-        For the third form we expect:
+	For the third form we expect:
 
-        (parallel [(set (cc) (compare ((plus (reg) (const_int -1)), 0))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) 
+	(parallel [(set (cc) (compare ((plus (reg) (const_int -n)), 0))
+		   (set (reg) (plus (reg) (const_int -n)))])
+	(set (pc) (if_then_else (cc == NE)
+				(label_ref (label))
+				(pc)))
 
-        which is equivalent to the following:
+	Which also for n == 1 is equivalent to the following:
 
-        (parallel [(set (cc) (compare (reg,  1))
-                   (set (reg) (plus (reg) (const_int -1)))
-                   (set (pc) (if_then_else (NE == cc)
-                                           (label_ref (label))
-                                           (pc))))])
+	(parallel [(set (cc) (compare (reg,  1))
+		   (set (reg) (plus (reg) (const_int -1)))
+		   (set (pc) (if_then_else (NE == cc)
+					   (label_ref (label))
+					   (pc))))])
 
-        So we return the second form instead for the two cases.
+	So we return the second form instead for the two cases.
 
+	For the "elementwise" form where the decrement number isn't -1,
+	the final value may be exceeded, so use GE instead of NE.
      */
-        condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
+     if (GET_CODE (pattern) != PARALLEL)
+       {
+	if (INTVAL (XEXP (inc_src, 1)) != -1)
+	  condition = gen_rtx_fmt_ee (GE, VOIDmode, inc_src, const0_rtx);
+	else
+	  condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);;
+       }
 
     return condition;
    }
@@ -685,17 +693,6 @@ doloop_optimize (class loop *loop)
       return false;
     }
 
-  max_cost
-    = COSTS_N_INSNS (param_max_iterations_computation_cost);
-  if (set_src_cost (desc->niter_expr, mode, optimize_loop_for_speed_p (loop))
-      > max_cost)
-    {
-      if (dump_file)
-	fprintf (dump_file,
-		 "Doloop: number of iterations too costly to compute.\n");
-      return false;
-    }
-
   if (desc->const_iter)
     iterations = widest_int::from (rtx_mode_t (desc->niter_expr, mode),
 				   UNSIGNED);
@@ -716,11 +713,24 @@ doloop_optimize (class loop *loop)
 
   /* Generate looping insn.  If the pattern FAILs then give up trying
      to modify the loop since there is some aspect the back-end does
-     not like.  */
-  count = copy_rtx (desc->niter_expr);
+     not like.  If this succeeds, there is a chance that the loop
+     desc->niter_expr has been altered by the backend, so only extract
+     that data after the gen_doloop_end.  */
   start_label = block_label (desc->in_edge->dest);
   doloop_reg = gen_reg_rtx (mode);
   rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
+  count = copy_rtx (desc->niter_expr);
+
+  max_cost
+    = COSTS_N_INSNS (param_max_iterations_computation_cost);
+  if (set_src_cost (count, mode, optimize_loop_for_speed_p (loop))
+      > max_cost)
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "Doloop: number of iterations too costly to compute.\n");
+      return false;
+    }
 
   word_mode_size = GET_MODE_PRECISION (word_mode);
   word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) - 1;
diff --git a/gcc/testsuite/gcc.target/arm/lob.h b/gcc/testsuite/gcc.target/arm/lob.h
index feaae7cc899..3941fe7a8b6 100644
--- a/gcc/testsuite/gcc.target/arm/lob.h
+++ b/gcc/testsuite/gcc.target/arm/lob.h
@@ -1,15 +1,131 @@
 #include <string.h>
-
+#include <stdint.h>
 /* Common code for lob tests.  */
 
 #define NO_LOB asm volatile ("@ clobber lr" : : : "lr" )
 
-#define N 10000
+#define N 100
+
+static void
+reset_data (int *a, int *b, int *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data64 (int64_t *a, int64_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+check_plus (int *a, int *b, int *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
 
 static void
-reset_data (int *a, int *b, int *c)
+check_memcpy64 (int64_t *a, int64_t *c, int x)
 {
-  memset (a, -1, N * sizeof (*a));
-  memset (b, -1, N * sizeof (*b));
-  memset (c, -1, N * sizeof (*c));
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != a[i]) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
 }
diff --git a/gcc/testsuite/gcc.target/arm/lob1.c b/gcc/testsuite/gcc.target/arm/lob1.c
index ba5c82cd55c..c8ce653a5c3 100644
--- a/gcc/testsuite/gcc.target/arm/lob1.c
+++ b/gcc/testsuite/gcc.target/arm/lob1.c
@@ -54,29 +54,18 @@ loop3 (int *a, int *b, int *c)
     } while (i < N);
 }
 
-void
-check (int *a, int *b, int *c)
-{
-  for (int i = 0; i < N; i++)
-    {
-      NO_LOB;
-      if (c[i] != a[i] + b[i])
-	abort ();
-    }
-}
-
 int
 main (void)
 {
-  reset_data (a, b, c);
+  reset_data (a, b, c, N);
   loop1 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus (a, b, c, N);
+  reset_data (a, b, c, N);
   loop2 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus (a, b, c, N);
+  reset_data (a, b, c, N);
   loop3 (a, b ,c);
-  check (a, b ,c);
+  check_plus (a, b, c, N);
 
   return 0;
 }
diff --git a/gcc/testsuite/gcc.target/arm/lob6.c b/gcc/testsuite/gcc.target/arm/lob6.c
index 17b6124295e..4fe116e2c2b 100644
--- a/gcc/testsuite/gcc.target/arm/lob6.c
+++ b/gcc/testsuite/gcc.target/arm/lob6.c
@@ -79,14 +79,14 @@ check (void)
 int
 main (void)
 {
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data (a1, b1, c1, N);
+  reset_data (a2, b2, c2, N);
   loop1 (a1, b1, c1);
   ref1 (a2, b2, c2);
   check ();
 
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data (a1, b1, c1, N);
+  reset_data (a2, b2, c2, N);
   loop2 (a1, b1, c1);
   ref2 (a2, b2, c2);
   check ();
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c
new file mode 100644
index 00000000000..f66c20a884b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c
@@ -0,0 +1,537 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O3 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+
+#define IMM 5
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)				\
+void test_##NAME##PRED##_##SIGN##BITS (TYPE##BITS##_t *a, TYPE##BITS##_t *b,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vb = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (b, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_##SIGN##BITS (va, vb, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      b += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (32, 4, w, NAME, PRED)
+
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vmulq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vsubq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vhaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vorrq, _x)
+
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_M(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)				\
+void test_##NAME##PRED##_##SIGN##BITS (TYPE##BITS##x##LANES##_t __inactive, TYPE##BITS##_t *a, TYPE##BITS##_t *b,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vb = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (b, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_##SIGN##BITS (__inactive, va, vb, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      b += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_M (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_M (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (32, 4, w, NAME, PRED)
+
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vmulq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vsubq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vhaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vorrq, _m)
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_N(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)	\
+void test_##NAME##PRED##_n_##SIGN##BITS (TYPE##BITS##_t *a,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_n_##SIGN##BITS (va, IMM, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_N (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_N (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (32, 4, w, NAME, PRED)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vmulq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vsubq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vhaddq, _x)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vbrsrq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vshlq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vshrq, _x)
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_M_N(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)	\
+void test_##NAME##PRED##_n_##SIGN##BITS (TYPE##BITS##x##LANES##_t __inactive, TYPE##BITS##_t *a,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_n_##SIGN##BITS (__inactive, va, IMM, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_M_N (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_M_N (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (32, 4, w, NAME, PRED)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vmulq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vsubq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vhaddq, _m)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vbrsrq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vshlq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vshrq, _m)
+
+/* Now test some more configurations.  */
+
+/* Using a >=1 condition.  */
+void test1 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n >= 1)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+/* Test a for loop format of decrementing to zero */
+int32_t a[] = {0, 1, 2, 3, 4, 5, 6, 7};
+void test2 (int32_t *b, int num_elems)
+{
+    for (int i = num_elems; i > 0; i-= 4)
+    {
+        mve_pred16_t p = vctp32q (i);
+        int32x4_t va = vldrwq_z_s32 (&(a[i]), p);
+        vstrwq_p_s32 (b + i, va, p);
+    }
+}
+
+/* Iteration counter counting up to num_iter.  */
+void test3 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 0; i < num_iter; i++)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+/* Using an unpredicated arithmetic instruction within the loop.  */
+void test4 (uint8_t *a, uint8_t *b, uint8_t *c,  uint8_t *d, int n)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_u8 (b);
+	/* Is affected by implicit predication, because vb also
+	came from an unpredicated load, but there is no functional
+	problem, because the result is used in a predicated store.  */ 
+        uint8x16_t vc = vaddq_u8 (va, vb);
+        uint8x16_t vd = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        vstrbq_p_u8 (d, vd, p);
+        n-=16;
+    }
+}
+
+/* Using a different VPR value for one instruction in the loop.  */
+void test5 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using another VPR value in the loop, with a vctp.
+   The doloop logic will always try to do the transform on the first
+   vctp it encounters, so this is still expected to work.  */
+void test6 (int32_t *a, int32_t *b, int32_t *c, int n, int g)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q (g);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vctp,
+   but this time the p1 will also chance in every loop (still fine)  */
+void test7 (int32_t *a, int32_t *b, int32_t *c, int n, int g)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q (g);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+      g++;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vctp_m
+   that is independent of the loop vctp VPR.  */
+void test8 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p2 = vctp32q_m (n, p1);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop,
+   with a vctp_m that is tied to the base vctp VPR.  This
+   is still fine, because the vctp_m will be transformed
+   into a vctp and be implicitly predicated.  */
+void test9 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q_m (n, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p1);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp.  */
+void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vb);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p1);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp_m.  */
+void test11 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p2 = vcmpeqq_m_s32 (va, vb, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp_m 
+   that is tied to the base vctp VPR (same as above, this will be turned
+   into a vcmp and be implicitly predicated).  */
+void test12 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p2 = vcmpeqq_m_s32 (va, vb, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an unpredicated op with a scalar output, where the result is valid
+   outside the bb.  This is valid, because all the inputs to the unpredicated
+   op are correctly predicated.  */
+uint8_t test13 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       sum += vaddvq_u8 (vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Same as above, but with another scalar op between the unpredicated op and
+   the scalar op outside the loop.  */
+uint8_t test14 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx, int g)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       sum += vaddvq_u8 (vc);
+       sum += g;
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Using an unpredicated vcmp to generate a new predicate value in the
+   loop and then using it in a predicated store insn.  */
+void test15 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_s32 (b);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vc);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using a predicated vcmp to generate a new predicate value in the
+   loop and then using it in a predicated store insn.  */
+void test16 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_s32 (va, vb);
+      mve_pred16_t p1 = vcmpeqq_m_s32 (va, vc, p);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an across-vector unpredicated instruction in a valid way.
+   This tests that "vc" has correctly masked the risky "vb".  */
+uint16_t test17 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_x_u16 (va, vb, p);
+      res = vaddvq_u16 (vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector unpredicated instruction with a scalar from outside the loop.  */
+uint16_t test18 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_x_u16 (va, vb, p);
+      res = vaddvaq_u16 (res, vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector predicated instruction in a valid way.  */
+uint16_t  test19 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res = vaddvaq_p_u16 (res, vb, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector predicated instruction in a valid way.  */
+uint16_t  test20 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res++;
+      res = vaddvaq_p_u16 (res, vb, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using vshlq in a valid way.  */
+uint16x8_t  test21 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t res = vdupq_n_u16 (0);
+  uint32_t carry = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_z_u16 (b, p);
+      res = vshlcq_u16 (vb, &carry, 1);
+      res = vshlcq_m_u16 (res, &carry, 3 , p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* The final number of DLSTPs currently is calculated by the number of
+  `TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY.*` macros * 6 + 16.  */
+/* { dg-final { scan-assembler-times {\tdlstp} 165 } } */
+/* { dg-final { scan-assembler-times {\tletp} 165 } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
new file mode 100644
index 00000000000..0cdffb312b3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int16_t *a, int16_t *b, int16_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      int16x8_t va = vldrhq_z_s16 (a, p);
+      int16x8_t vb = vldrhq_z_s16 (b, p);
+      int16x8_t vc = vaddq_x_s16 (va, vb, p);
+      vstrhq_p_s16 (c, vc, p);
+      c+=8;
+      a+=8;
+      b+=8;
+      n-=8;
+    }
+}
+
+int main ()
+{
+  int i;
+  int16_t temp1[N];
+  int16_t temp2[N];
+  int16_t temp3[N];
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus16 (temp1, temp2, temp3, 0);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus16 (temp1, temp2, temp3, 1);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 7);
+  check_plus16 (temp1, temp2, temp3, 7);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus16 (temp1, temp2, temp3, 8);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus16 (temp1, temp2, temp3, 9);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus16 (temp1, temp2, temp3, 16);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus16 (temp1, temp2, temp3, 17);
+
+  reset_data16 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.16} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c
new file mode 100644
index 00000000000..7ff789d7650
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+int main ()
+{
+  int i;
+  int32_t temp1[N];
+  int32_t temp2[N];
+  int32_t temp3[N];
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus32 (temp1, temp2, temp3, 0);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus32 (temp1, temp2, temp3, 1);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 3);
+  check_plus32 (temp1, temp2, temp3, 3);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 4);
+  check_plus32 (temp1, temp2, temp3, 4);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 5);
+  check_plus32 (temp1, temp2, temp3, 5);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus32 (temp1, temp2, temp3, 8);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus32 (temp1, temp2, temp3, 9);
+
+  reset_data32 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.32} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c
new file mode 100644
index 00000000000..8065bd02469
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int64_t *a, int64_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp64q (n);
+      int64x2_t va = vldrdq_gather_offset_z_s64 (a, vcreateq_u64 (0, 8), p);
+      vstrdq_scatter_offset_p_s64 (c, vcreateq_u64 (0, 8), va, p);
+      c+=2;
+      a+=2;
+      n-=2;
+    }
+}
+
+int main ()
+{
+  int i;
+  int64_t temp1[N];
+  int64_t temp3[N];
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 0);
+  check_memcpy64 (temp1, temp3, 0);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 1);
+  check_memcpy64 (temp1, temp3, 1);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 2);
+  check_memcpy64 (temp1, temp3, 2);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 3);
+  check_memcpy64 (temp1, temp3, 3);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 4);
+  check_memcpy64 (temp1, temp3, 4);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 5);
+  check_memcpy64 (temp1, temp3, 5);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 6);
+  check_memcpy64 (temp1, temp3, 6);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 7);
+  check_memcpy64 (temp1, temp3, 7);
+
+  reset_data64  (temp1, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.64} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c
new file mode 100644
index 00000000000..552781001e9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp8q (n);
+      int8x16_t va = vldrbq_z_s8 (a, p);
+      int8x16_t vb = vldrbq_z_s8 (b, p);
+      int8x16_t vc = vaddq_x_s8 (va, vb, p);
+      vstrbq_p_s8 (c, vc, p);
+      c+=16;
+      a+=16;
+      b+=16;
+      n-=16;
+    }
+}
+
+int main ()
+{
+  int i;
+  int8_t temp1[N];
+  int8_t temp2[N];
+  int8_t temp3[N];
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus8 (temp1, temp2, temp3, 0);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus8 (temp1, temp2, temp3, 1);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 15);
+  check_plus8 (temp1, temp2, temp3, 15);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus8 (temp1, temp2, temp3, 16);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus8 (temp1, temp2, temp3, 17);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 32);
+  check_plus8 (temp1, temp2, temp3, 32);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 33);
+  check_plus8 (temp1, temp2, temp3, 33);
+
+  reset_data8 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.8} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
new file mode 100644
index 00000000000..e96322360ec
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
@@ -0,0 +1,278 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O3 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+
+/* Terminating on a non-zero number of elements.  */
+void test0 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    while (n > 1)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Terminating on n >= 0.  */
+void test1 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    while (n >= 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Similar, terminating on a non-zero number of elements, but in a for loop
+   format.  */
+int32_t a[] = {0, 1, 2, 3, 4, 5, 6, 7};
+void test2 (int32_t *b, int num_elems)
+{
+    for (int i = num_elems; i >= 2; i-= 4)
+    {
+       mve_pred16_t p = vctp32q (i);
+       int32x4_t va = vldrwq_z_s32 (&(a[i]), p);
+       vstrwq_p_s32 (b + i, va, p);
+    }
+}
+
+/* Iteration counter counting up to num_iter, with a non-zero starting num.  */
+void test3 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 1; i < num_iter; i++)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Iteration counter counting up to num_iter, with a larger increment  */
+void test4 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 0; i < num_iter; i+=2)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Using an unpredicated store instruction within the loop.  */
+void test5 (uint8_t *a, uint8_t *b, uint8_t *c,  uint8_t *d, int n)
+{
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_u8 (va, vb);
+       uint8x16_t vd = vaddq_x_u8 (va, vb, p);
+       vstrbq_u8 (d, vd);
+       n -= 16;
+    }
+}
+
+/* Using an unpredicated store outside the loop.  */
+void test6 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       vx = vaddq_u8 (vx, vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    vstrbq_u8 (c, vx);
+}
+
+/* Using a VPR that gets modified within the loop.  */
+void test9 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      p++;
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using a VPR that gets re-generated within the loop.  */
+void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  mve_pred16_t p = vctp32q (n);
+  while (n > 0)
+    {
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      p = vctp32q (n);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using vctp32q_m instead of vctp32q.  */
+void test11 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p0)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q_m (n, p0);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an unpredicated op with a scalar output, where the result is valid
+   outside the bb.  This is invalid, because one of the inputs to the
+   unpredicated op is also unpredicated.  */
+uint8_t test12 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_u8 (b);
+       uint8x16_t vc = vaddq_u8 (va, vb);
+       sum += vaddvq_u8 (vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Using an unpredicated vcmp to generate a new predicate value in the
+   loop and then using that VPR to predicate a store insn.  */
+void test13 (int32_t *a, int32_t *b, int32x4_t vc, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_s32 (a);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_s32 (va, vb);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vc);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an across-vector unpredicated instruction. "vb" is the risk.  */
+uint16_t test14 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      vb = vaddq_u16 (va, vb);
+      res = vaddvq_u16 (vb);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* Using an across-vector unpredicated instruction. "vc" is the risk. */
+uint16_t test15 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t vb = vldrhq_u16 (b);
+  uint16_t res = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      uint16x8_t vc = vaddq_u16 (va, vb);
+      res = vaddvaq_u16 (res, vc);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+uint16_t test16 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16_t res =0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      uint16x8_t va = vldrhq_z_u16 (a, p);
+      res = vaddvaq_u16 (res, vb);
+      res = vaddvaq_p_u16 (res, va, p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+uint16x8_t test17 (uint16_t *a, uint16_t *b,  uint16_t *c, int n)
+{
+  uint16x8_t res = vdupq_n_u16 (0);
+  uint32_t carry = 0;
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      uint16x8_t vb = vldrhq_u16 (b);
+      res = vshlcq_u16 (vb, &carry, 1);
+      res = vshlcq_m_u16 (res, &carry, 1 , p);
+      c += 8;
+      a += 8;
+      b += 8;
+      n -= 8;
+    }
+  return res;
+}
+
+/* { dg-final { scan-assembler-not "\tdlstp" } } */
+/* { dg-final { scan-assembler-not "\tletp" } } */

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
  2023-06-23 10:23 ` Andre Vieira (lists)
@ 2023-07-05 16:13   ` Stamatis Markianos-Wright
  0 siblings, 0 replies; 14+ messages in thread
From: Stamatis Markianos-Wright @ 2023-07-05 16:13 UTC (permalink / raw)
  To: Andre Vieira (lists), gcc-patches
  Cc: Kyrylo Tkachov, Richard Earnshaw, ramana.gcc, nickc


On 23/06/2023 11:23, Andre Vieira (lists) wrote:
> +  if (insn != arm_mve_get_loop_vctp (body))
> +    {
>
> probably a good idea to invert the condition here and return false, 
> helps reducing the indenting in this function.

Done, thanks


>
>
> +    /* Starting from the current insn, scan backwards through the insn
> +       chain until BB_HEAD: "for each insn in the BB prior to the 
> current".
> +    */
>
> There's a trailing whitespace after insn, but also I'd rewrite this 
> bit. The "for each insn in the BB prior to the current" is superfluous 
> and even confusing to me. How about:
> "Scan backwards from the current INSN through the instruction chain 
> until the start of the basic block.  "
Yes, agreed, it wasn't very clear. Done.
>
>
>  I find 'that previous insn' to be confusing as you don't mention any 
> previous insn before. So how about something along the lines of:
> 'If a previous insn defines a register that INSN uses then return true 
> if...'
Done
>
>
> Do we need to check: 'insn != prev_insn' ? Any reason why you can't 
> start the loop with:
> 'for (rtx_insn *prev_insn = PREV_INSN (insn);'

True! Done.

>
> Now I also found a case where things might go wrong in:
> +        /* Look at all the DEFs of that previous insn: if one of them 
> is on
> +           the same REG as our current insn, then recurse in order to 
> check
> +           that insn's USEs.  If any of these insns return true as
> +           MVE_VPT_UNPREDICATED_INSN_Ps, then the whole chain is 
> affected
> +           by the change in behaviour from being placed in dlstp/letp 
> loop.
> +        */
> +        df_ref prev_insn_defs = NULL;
> +        FOR_EACH_INSN_DEF (prev_insn_defs, prev_insn)
> +          {
> +        if (DF_REF_REGNO (insn_uses) == DF_REF_REGNO (prev_insn_defs)
> +            && insn != prev_insn
> +            && body == BLOCK_FOR_INSN (prev_insn)
> +            && !arm_mve_vec_insn_is_predicated_with_this_predicate
> +             (insn, vctp_vpr_generated)
> +            && arm_mve_check_df_chain_back_for_implic_predic
> +             (prev_insn, vctp_vpr_generated))
> +          return true;
> +          }
>
> The body == BLOCK_FOR_INSN (prev_insn) hinted me at it, if a def comes 
> from outside of the BB (so outside of the loop's body) then its by 
> definition unpredicated by vctp.  I think you want to check that if 
> prev_insn defines a register used by insn then return true if 
> prev_insn isn't in the same BB or has a chain that is not predicated, 
> i.e.: '!arm_mve_vec_insn_is_predicated_with_this_predicate (insn, 
> vctp_vpr_generated) && arm_mve_check_df_chain_back_for_implic_predic 
> prev_insn, vctp_vpr_generated))' you check body != BLOCK_FOR_INSN 
> (prev_insn)'

Yes, you're right, this is vulnerable here. A neater fix to this (I 
think?) is to make the above REGNO_REG_SET_P more generic, so that it 
covers all scalar values and scalar ops, as well.
Then it's a "if this insn in the loop has any input that originates 
outside the bb, then it's unsafe" check and the recursive loop backwards 
is only for the recursive "are any previous insns unsafe"


>
>
> I also found some other issues, this currently loloops:
>
> uint16_t  test (uint16_t *a, int n)
> {
>   uint16_t res =0;
>   while (n > 0)
>     {
>       mve_pred16_t p = vctp16q (n);
>       uint16x8_t va = vldrhq_u16 (a);
>       res = vaddvaq_u16 (res, va);
>       res = vaddvaq_p_u16 (res, va, p);
>       a += 8;
>       n -= 8;
>     }
>   return res;
> }
>
> But it shouldn't, this is because there's a lack of handling of across 
> vector instructions. Luckily in MVE all across vector instructions 
> have the side-effect that they write to a scalar register, even the 
> vshlcq instruction (it writes to a scalar carry output).

Added support for them (you were right, there was some special handling 
needed!)


>
> Did this lead me to find an ICE with:
>
> uint16x8_t  test (uint16_t *a, int n)
> {
>   uint16x8_t res = vdupq_n_u16 (0);
>   while (n > 0)
>     {
>       uint16_t carry = 0;
>       mve_pred16_t p = vctp16q (n);
>       uint16x8_t va = vldrhq_u16 (a);
>       res = vshlcq_u16 (va, &carry, 1);
>       res = vshlcq_m_u16 (res, &carry, 1 , p);
>       a += 8;
>       n -= 8;
>     }
>   return res;
> }
>
> This is because:
> +          /* If the USE is outside the loop body bb, or it is inside, 
> but
> +         is an unpredicated store to memory.  */
> +          if (BLOCK_FOR_INSN (insn) != BLOCK_FOR_INSN (next_use_insn)
> +         || (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
> +             (next_use_insn, vctp_vpr_generated)
> +            && mve_memory_operand
> +            (SET_DEST (single_set (next_use_insn)),
> +             GET_MODE (SET_DEST (single_set (next_use_insn))))))
> +        return true;
>
> Assumes single_set doesn't return 0.

Thanks! That is indeed correct.

Corrected this by having a utility function to scan insn operands and 
check against mve_memory_operand that supports any number of 
operands/SETs in the insn

>
> Let's deal with these issues and I'll continue to review.
>
> On 15/06/2023 12:47, Stamatis Markianos-Wright via Gcc-patches wrote:
>>      Hi all,
>>
>>      This is the 2/2 patch that contains the functional changes needed
>>      for MVE Tail Predicated Low Overhead Loops.  See my previous email
>>      for a general introduction of MVE LOLs.
>>
>>      This support is added through the already existing loop-doloop
>>      mechanisms that are used for non-MVE dls/le looping.
>>
>>      Mid-end changes are:
>>
>>      1) Relax the loop-doloop mechanism in the mid-end to allow for
>>         decrement numbers other that -1 and for `count` to be an
>>         rtx containing a simple REG (which in this case will contain
>>         the number of elements to be processed), rather
>>         than an expression for calculating the number of iterations.
>>      2) Added a new df utility function: `df_bb_regno_only_def_find` 
>> that
>>         will return the DEF of a REG only if it is DEF-ed once within 
>> the
>>         basic block.
>>
>>      And many things in the backend to implement the above optimisation:
>>
>>      3)  Implement the `arm_predict_doloop_p` target hook to instruct 
>> the
>>          mid-end about Low Overhead Loops (MVE or not), as well as
>>          `arm_loop_unroll_adjust` which will prevent unrolling of any 
>> loops
>>          that are valid for becoming MVE Tail_Predicated Low Overhead 
>> Loops
>>          (unrolling can transform a loop in ways that invalidate the 
>> dlstp/
>>          letp tranformation logic and the benefit of the dlstp/letp loop
>>          would be considerably higher than that of unrolling)
>>      4)  Appropriate changes to the define_expand of doloop_end, new
>>          patterns for dlstp and letp, new iterators,  unspecs, etc.
>>      5) `arm_mve_loop_valid_for_dlstp` and a number of checking 
>> functions:
>>         * `arm_mve_dlstp_check_dec_counter`
>>         * `arm_mve_dlstp_check_inc_counter`
>>         * `arm_mve_check_reg_origin_is_num_elems`
>>         * `arm_mve_check_df_chain_back_for_implic_predic`
>>         * `arm_mve_check_df_chain_fwd_for_implic_predic_impact`
>>         This all, in smoe way or another, are running checks on the loop
>>         structure in order to determine if the loop is valid for 
>> dlstp/letp
>>         transformation.
>>      6) `arm_attempt_dlstp_transform`: (called from the define_expand of
>>          doloop_end) this function re-checks for the loop's 
>> suitability for
>>          dlstp/letp transformation and then implements it, if possible.
>>      7) Various utility functions:
>>         *`arm_mve_get_vctp_lanes` to map
>>         from vctp unspecs to number of lanes, and 
>> `arm_get_required_vpr_reg`
>>         to check an insn to see if it requires the VPR or not.
>>         * `arm_mve_get_loop_vctp`
>>         * `arm_mve_get_vctp_lanes`
>>         * `arm_emit_mve_unpredicated_insn_to_seq`
>>         * `arm_get_required_vpr_reg`
>>         * `arm_get_required_vpr_reg_param`
>>         * `arm_get_required_vpr_reg_ret_val`
>>         * `arm_mve_vec_insn_is_predicated_with_this_predicate`
>>         * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`
>>
>>      No regressions on arm-none-eabi with various targets and on
>>      aarch64-none-elf. Thoughts on getting this into trunk?
>>
>>      Thank you,
>>      Stam Markianos-Wright
>>
>>      gcc/ChangeLog:
>>
>>              * config/arm/arm-protos.h (arm_target_insn_ok_for_lob): 
>> Rename to...
>>              (arm_target_bb_ok_for_lob): ...this
>>              (arm_attempt_dlstp_transform): New.
>>              * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.
>>              (TARGET_PREDICT_DOLOOP_P): New.
>>              (arm_block_set_vect):
>>              (arm_target_insn_ok_for_lob): Rename from 
>> arm_target_insn_ok_for_lob.
>>              (arm_target_bb_ok_for_lob): New.
>>              (arm_mve_get_vctp_lanes): New.
>>              (arm_get_required_vpr_reg): New.
>>              (arm_get_required_vpr_reg_param): New.
>>              (arm_get_required_vpr_reg_ret_val): New.
>>              (arm_mve_get_loop_vctp): New.
>> (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.
>> (arm_mve_vec_insn_is_predicated_with_this_predicate): New.
>>              (arm_mve_check_df_chain_back_for_implic_predic): New.
>> (arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.
>>              (arm_mve_check_reg_origin_is_num_elems): New.
>>              (arm_mve_dlstp_check_inc_counter): New.
>>              (arm_mve_dlstp_check_dec_counter): New.
>>              (arm_mve_loop_valid_for_dlstp): New.
>>              (arm_predict_doloop_p): New.
>>              (arm_loop_unroll_adjust): New.
>>              (arm_emit_mve_unpredicated_insn_to_seq): New.
>>              (arm_attempt_dlstp_transform): New.
>>              * config/arm/iterators.md (DLSTP): New.
>>              (mode1): Add DLSTP mappings.
>>              * config/arm/mve.md (*predicated_doloop_end_internal): New.
>>              (dlstp<mode1>_insn): New.
>>              * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
>>              * config/arm/unspecs.md: New unspecs.
>>              * df-core.cc (df_bb_regno_only_def_find): New.
>>              * df.h (df_bb_regno_only_def_find): New.
>>              * loop-doloop.cc (doloop_condition_get): Relax conditions.
>>              (doloop_optimize): Add support for elementwise LoLs.
>>
>>      gcc/testsuite/ChangeLog:
>>
>>              * gcc.target/arm/lob.h: Update framework.
>>              * gcc.target/arm/lob1.c: Likewise.
>>              * gcc.target/arm/lob6.c: Likewise.
>>              * gcc.target/arm/mve/dlstp-compile-asm.c: New test.
>>              * gcc.target/arm/mve/dlstp-int16x8.c: New test.
>>              * gcc.target/arm/mve/dlstp-int32x4.c: New test.
>>              * gcc.target/arm/mve/dlstp-int64x2.c: New test.
>>              * gcc.target/arm/mve/dlstp-int8x16.c: New test.
>>              * gcc.target/arm/mve/dlstp-invalid-asm.c: New test.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
  2023-06-22 15:54 ` Andre Vieira (lists)
@ 2023-07-05 16:11   ` Stamatis Markianos-Wright
  0 siblings, 0 replies; 14+ messages in thread
From: Stamatis Markianos-Wright @ 2023-07-05 16:11 UTC (permalink / raw)
  To: Andre Vieira (lists), gcc-patches
  Cc: Kyrylo Tkachov, Richard Earnshaw, ramana.gcc, nickc

Thank you Andre for reviewing! I'll attach the updated version of the 
patch to the third review email (your final one thus far ;)

On 22/06/2023 16:54, Andre Vieira (lists) wrote:
> Some comments below, all quite minor. I'll continue to review 
> tomorrow, I need a fresher brain for 
> arm_mve_check_df_chain_back_for_implic_predic  ;)
>
> +static int
> +arm_mve_get_vctp_lanes (rtx x)
> +{
> +  if (GET_CODE (x) == SET && GET_CODE (XEXP (x, 1)) == UNSPEC
> +      && (XINT (XEXP (x, 1), 1) == VCTP || XINT (XEXP (x, 1), 1) == 
> VCTP_M))
> +    {
> +      switch (GET_MODE (XEXP (x, 1)))
> +    {
> +      case V16BImode:
> +        return 16;
> +      case V8BImode:
> +        return 8;
> +      case V4BImode:
> +        return 4;
> +      case V2QImode:
> +        return 2;
> +      default:
> +        break;
> +    }
> +    }
> +  return 0;
> +}
>
> I think you can replace the switch with something along the lines of:
> machine_mode mode = GET_MODE (XEXP (x, 1));
> return VECTOR_MODE_P (mode) ? GET_MODE_NUNITS (mode) : 0;
>
Ah true, especially now that there are no HImode predicates!

I added an additional check of `&& VALID_MVE_PRED_MODE (mode)` as well, 
just to make sure we could never pick up V4SImode, etc. (although I'd 
never expect that to happen if `rtx x` came from a valid instruction)


>
> +/* Check if an insn requires the use of the VPR_REG, if it does, 
> return the
> +   sub-rtx of the VPR_REG.  The `type` argument controls whether
> +   this function should:
> +   * For type == 0, check all operands, including the OUT operands,
> +     and return the first occurance of the VPR_REG.
>
> s/occurance/occurrence/
Done
>
> +      bool requires_vpr;
> +  extract_constrain_insn (insn);
>
> indent of requires_vpr is off.
Done
>
> +      if (type == 1 && (recog_data.operand_type[op] == OP_OUT
> +            || recog_data.operand_type[op] == OP_INOUT))
> +    continue;
> +      else if (type == 2 && (recog_data.operand_type[op] == OP_IN
> +                 || recog_data.operand_type[op] == OP_INOUT))
> +    continue;
>
> Why skip INOUT? I guess this will become clear when I see the uses, 
> but I'm wondering whether 'only check the input operands.' is clear 
> enough. Maybe 'check operands that are input only.' would be more 
> accurate?
Oh! Thanks for spotting this. It also doesn't work with my comment at 
the top:
`(INOUT operands are considered both as input and output operands)`

It's been a long time since I wrote this piece, but it might be that I 
added this after realising that there are no insns with an OP_INOUT VPR 
reg. Since I don't think it's functional, I changed the code to align 
with the comment, instead.

>
> +      /* Fetch the reg_class for each entry and check it against the
> +       * VPR_REG reg_class.  */
>
> Remove leading * on the second line.
Damn auto-formatters ;)
Done
>
> +
> +/* Wrapper function of arm_get_required_vpr_reg with type == 1, so 
> return
> +   something only if the VPR reg is an input operand to the insn.  */
>
> When talking about a function parameter in comments capitalize (INSN) 
> the name. Same for:
Done
>
> +/* Wrapper function of arm_get_required_vpr_reg with type == 2, so 
> return
> +   something only if the VPR reg is the retrurn value, an output of, 
> or is
> +   clobbered by the insn.  */
>
> +/* Return true if an insn is an MVE instruction that VPT-predicable, 
> but in
> +   its unpredicated form, or if it is predicated, but on a predicate 
> other
> +   than vpr_reg.  */
>
> In this one also 'is a MVE instruction that is VPT-predicable' would 
> be better I think.
Oops, thanks for spotting. Done.
>
>
> On 15/06/2023 12:47, Stamatis Markianos-Wright via Gcc-patches wrote:
> >      Hi all,
> >
> >      This is the 2/2 patch that contains the functional changes needed
> >      for MVE Tail Predicated Low Overhead Loops.  See my previous email
> >      for a general introduction of MVE LOLs.
> >
> >      This support is added through the already existing loop-doloop
> >      mechanisms that are used for non-MVE dls/le looping.
> >
> >      Mid-end changes are:
> >
> >      1) Relax the loop-doloop mechanism in the mid-end to allow for
> >         decrement numbers other that -1 and for `count` to be an
> >         rtx containing a simple REG (which in this case will contain
> >         the number of elements to be processed), rather
> >         than an expression for calculating the number of iterations.
> >      2) Added a new df utility function: `df_bb_regno_only_def_find` 
> that
> >         will return the DEF of a REG only if it is DEF-ed once 
> within the
> >         basic block.
> >
> >      And many things in the backend to implement the above 
> optimisation:
> >
> >      3)  Implement the `arm_predict_doloop_p` target hook to 
> instruct the
> >          mid-end about Low Overhead Loops (MVE or not), as well as
> >          `arm_loop_unroll_adjust` which will prevent unrolling of 
> any loops
> >          that are valid for becoming MVE Tail_Predicated Low 
> Overhead Loops
> >          (unrolling can transform a loop in ways that invalidate the 
> dlstp/
> >          letp tranformation logic and the benefit of the dlstp/letp 
> loop
> >          would be considerably higher than that of unrolling)
> >      4)  Appropriate changes to the define_expand of doloop_end, new
> >          patterns for dlstp and letp, new iterators, unspecs, etc.
> >      5) `arm_mve_loop_valid_for_dlstp` and a number of checking 
> functions:
> >         * `arm_mve_dlstp_check_dec_counter`
> >         * `arm_mve_dlstp_check_inc_counter`
> >         * `arm_mve_check_reg_origin_is_num_elems`
> >         * `arm_mve_check_df_chain_back_for_implic_predic`
> >         * `arm_mve_check_df_chain_fwd_for_implic_predic_impact`
> >         This all, in smoe way or another, are running checks on the 
> loop
> >         structure in order to determine if the loop is valid for 
> dlstp/letp
> >         transformation.
> >      6) `arm_attempt_dlstp_transform`: (called from the 
> define_expand of
> >          doloop_end) this function re-checks for the loop's 
> suitability for
> >          dlstp/letp transformation and then implements it, if possible.
> >      7) Various utility functions:
> >         *`arm_mve_get_vctp_lanes` to map
> >         from vctp unspecs to number of lanes, and
> > `arm_get_required_vpr_reg`
> >         to check an insn to see if it requires the VPR or not.
> >         * `arm_mve_get_loop_vctp`
> >         * `arm_mve_get_vctp_lanes`
> >         * `arm_emit_mve_unpredicated_insn_to_seq`
> >         * `arm_get_required_vpr_reg`
> >         * `arm_get_required_vpr_reg_param`
> >         * `arm_get_required_vpr_reg_ret_val`
> >         * `arm_mve_vec_insn_is_predicated_with_this_predicate`
> >         * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`
> >
> >      No regressions on arm-none-eabi with various targets and on
> >      aarch64-none-elf. Thoughts on getting this into trunk?
> >
> >      Thank you,
> >      Stam Markianos-Wright
> >
> >      gcc/ChangeLog:
> >
> >              * config/arm/arm-protos.h (arm_target_insn_ok_for_lob):
> > Rename to...
> >              (arm_target_bb_ok_for_lob): ...this
> >              (arm_attempt_dlstp_transform): New.
> >              * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.
> >              (TARGET_PREDICT_DOLOOP_P): New.
> >              (arm_block_set_vect):
> >              (arm_target_insn_ok_for_lob): Rename from
> > arm_target_insn_ok_for_lob.
> >              (arm_target_bb_ok_for_lob): New.
> >              (arm_mve_get_vctp_lanes): New.
> >              (arm_get_required_vpr_reg): New.
> >              (arm_get_required_vpr_reg_param): New.
> >              (arm_get_required_vpr_reg_ret_val): New.
> >              (arm_mve_get_loop_vctp): New.
> > (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.
> > (arm_mve_vec_insn_is_predicated_with_this_predicate): New.
> >              (arm_mve_check_df_chain_back_for_implic_predic): New.
> > (arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.
> >              (arm_mve_check_reg_origin_is_num_elems): New.
> >              (arm_mve_dlstp_check_inc_counter): New.
> >              (arm_mve_dlstp_check_dec_counter): New.
> >              (arm_mve_loop_valid_for_dlstp): New.
> >              (arm_predict_doloop_p): New.
> >              (arm_loop_unroll_adjust): New.
> >              (arm_emit_mve_unpredicated_insn_to_seq): New.
> >              (arm_attempt_dlstp_transform): New.
> >              * config/arm/iterators.md (DLSTP): New.
> >              (mode1): Add DLSTP mappings.
> >              * config/arm/mve.md (*predicated_doloop_end_internal): 
> New.
> >              (dlstp<mode1>_insn): New.
> >              * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
> >              * config/arm/unspecs.md: New unspecs.
> >              * df-core.cc (df_bb_regno_only_def_find): New.
> >              * df.h (df_bb_regno_only_def_find): New.
> >              * loop-doloop.cc (doloop_condition_get): Relax conditions.
> >              (doloop_optimize): Add support for elementwise LoLs.
> >
> >      gcc/testsuite/ChangeLog:
> >
> >              * gcc.target/arm/lob.h: Update framework.
> >              * gcc.target/arm/lob1.c: Likewise.
> >              * gcc.target/arm/lob6.c: Likewise.
> >              * gcc.target/arm/mve/dlstp-compile-asm.c: New test.
> >              * gcc.target/arm/mve/dlstp-int16x8.c: New test.
> >              * gcc.target/arm/mve/dlstp-int32x4.c: New test.
> >              * gcc.target/arm/mve/dlstp-int64x2.c: New test.
> >              * gcc.target/arm/mve/dlstp-int8x16.c: New test.
> >              * gcc.target/arm/mve/dlstp-invalid-asm.c: New test.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
  2023-06-15 11:47 Stamatis Markianos-Wright
  2023-06-22 15:54 ` Andre Vieira (lists)
  2023-06-23 10:23 ` Andre Vieira (lists)
@ 2023-06-23 16:25 ` Andre Vieira (lists)
  2023-07-05 16:41   ` Stamatis Markianos-Wright
  2 siblings, 1 reply; 14+ messages in thread
From: Andre Vieira (lists) @ 2023-06-23 16:25 UTC (permalink / raw)
  To: Stamatis Markianos-Wright, gcc-patches
  Cc: Kyrylo Tkachov, Richard Earnshaw, ramana.gcc, nickc

+  /* In order to find out if the loop is of type A or B above look for the
+     loop counter: it will either be incrementing by one per iteration or
+     it will be decrementing by num_of_lanes.  We can find the loop counter
+     in the condition at the end of the loop.  */
+  rtx_insn *loop_cond = prev_nonnote_nondebug_insn_bb (BB_END (body));
+  gcc_assert (cc_register (XEXP (PATTERN (loop_cond), 0), VOIDmode)
+	      && GET_CODE (XEXP (PATTERN (loop_cond), 1)) == COMPARE);

Not sure this should be an assert. If we do encounter a differently 
formed loop, we should bail out of DLSTPing for now but we shouldn't ICE.


+  /* The loop latch has to be empty.  When compiling all the known MVE 
LoLs in
+     user applications, none of those with incrementing counters had 
any real
+     insns in the loop latch.  As such, this function has only been 
tested with
+     an empty latch and may misbehave or ICE if we somehow get here with an
+     increment in the latch, so, for sanity, error out early.  */
+  rtx_insn *dec_insn = BB_END (body->loop_father->latch);
+  if (NONDEBUG_INSN_P (dec_insn))
+    gcc_unreachable ();

Similarly here I'd return false rather than gcc_unreachable ();


+  /* Find where both of those are modified in the loop body bb.  */
+  rtx condcount_reg_set = PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
+						 (body, REGNO (condcount))));
Put = on newline, breaks it down nicer.

+	  counter_orig_set = XEXP (PATTERN
+				    (DF_REF_INSN
+				      (DF_REF_NEXT_REG
+					(DF_REG_DEF_CHAIN
+					 (REGNO
+					   (XEXP (condcount_reg_set, 0)))))), 1);

This makes me a bit nervous, can we be certain that the PATTERN of the 
next insn that sets it is indeed a set. Heck can we even be sure 
DF_REG_DEF_CHAIN returns a non-null, I can't imagine why not but maybe 
there are some constructs it can't follow-up on? Might just be worth 
checking these steps and bailing out.



+      /* When we find the vctp instruction: This may be followed by
+      a zero-extend insn to SImode.  If it is, then save the
+      zero-extended REG into vctp_vpr_generated.  If there is no
+      zero-extend, then store the raw output of the vctp.
+      For any VPT-predicated instructions we need to ensure that
+      the VPR they use is the same as the one given here and
+      they often consume the output of a subreg of the SImode
+      zero-extended VPR-reg.  As a result, comparing against the
+      output of the zero-extend is more likely to succeed.
+      This code also guarantees to us that the vctp comes before
+      any instructions that use the VPR within the loop, for the
+      dlstp/letp transform to succeed.  */

Wrong comment indent after first line.

+  rtx_insn *vctp_insn = arm_mve_get_loop_vctp (body);
+  if (!vctp_insn || !arm_mve_loop_valid_for_dlstp (body))
+    return GEN_INT (1);

arm_mve_loop_valid_for_dlstp already calls arm_mve_get_loop_vctp, maybe 
have 'arm_mve_loop_valid_for_dlstp' return vctp_insn or null to 
determine success or failure, avoids looping through the BB again.

For the same reason I'd also pass vctp_insn down to 
'arm_mve_check_df_chain_back_for_implic_predic'.

+	  if (GET_CODE (SET_SRC (single_set (next_use1))) == ZERO_EXTEND)
+	    {
+	      rtx_insn *next_use2 = NULL;

Are we sure single_set can never return 0 here? Maybe worth an extra 
check and bail out if it does?

+       /* If the insn pattern requires the use of the VPR value from the
+	  vctp as an input parameter.  */
s/an an input parameter./as an input parameter for predication./

+	  /* None of registers USE-d by the instruction need can be the VPR
+	     vctp_vpr_generated.  This blocks the optimisation if there any
+	     instructions that use the optimised-out VPR value in any way
+	     other than as a VPT block predicate.  */

Reword this slightly to be less complex:
This instruction USE-s the vctp_vpr_generated other than for 
predication, this blocks the transformation as we are not allowed to 
optimise the VPR value away.

Will continue reviewing next week :)

On 15/06/2023 12:47, Stamatis Markianos-Wright via Gcc-patches wrote:
>      Hi all,
> 
>      This is the 2/2 patch that contains the functional changes needed
>      for MVE Tail Predicated Low Overhead Loops.  See my previous email
>      for a general introduction of MVE LOLs.
> 
>      This support is added through the already existing loop-doloop
>      mechanisms that are used for non-MVE dls/le looping.
> 
>      Mid-end changes are:
> 
>      1) Relax the loop-doloop mechanism in the mid-end to allow for
>         decrement numbers other that -1 and for `count` to be an
>         rtx containing a simple REG (which in this case will contain
>         the number of elements to be processed), rather
>         than an expression for calculating the number of iterations.
>      2) Added a new df utility function: `df_bb_regno_only_def_find` that
>         will return the DEF of a REG only if it is DEF-ed once within the
>         basic block.
> 
>      And many things in the backend to implement the above optimisation:
> 
>      3)  Implement the `arm_predict_doloop_p` target hook to instruct the
>          mid-end about Low Overhead Loops (MVE or not), as well as
>          `arm_loop_unroll_adjust` which will prevent unrolling of any loops
>          that are valid for becoming MVE Tail_Predicated Low Overhead Loops
>          (unrolling can transform a loop in ways that invalidate the dlstp/
>          letp tranformation logic and the benefit of the dlstp/letp loop
>          would be considerably higher than that of unrolling)
>      4)  Appropriate changes to the define_expand of doloop_end, new
>          patterns for dlstp and letp, new iterators,  unspecs, etc.
>      5) `arm_mve_loop_valid_for_dlstp` and a number of checking functions:
>         * `arm_mve_dlstp_check_dec_counter`
>         * `arm_mve_dlstp_check_inc_counter`
>         * `arm_mve_check_reg_origin_is_num_elems`
>         * `arm_mve_check_df_chain_back_for_implic_predic`
>         * `arm_mve_check_df_chain_fwd_for_implic_predic_impact`
>         This all, in smoe way or another, are running checks on the loop
>         structure in order to determine if the loop is valid for dlstp/letp
>         transformation.
>      6) `arm_attempt_dlstp_transform`: (called from the define_expand of
>          doloop_end) this function re-checks for the loop's suitability for
>          dlstp/letp transformation and then implements it, if possible.
>      7) Various utility functions:
>         *`arm_mve_get_vctp_lanes` to map
>         from vctp unspecs to number of lanes, and 
> `arm_get_required_vpr_reg`
>         to check an insn to see if it requires the VPR or not.
>         * `arm_mve_get_loop_vctp`
>         * `arm_mve_get_vctp_lanes`
>         * `arm_emit_mve_unpredicated_insn_to_seq`
>         * `arm_get_required_vpr_reg`
>         * `arm_get_required_vpr_reg_param`
>         * `arm_get_required_vpr_reg_ret_val`
>         * `arm_mve_vec_insn_is_predicated_with_this_predicate`
>         * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`
> 
>      No regressions on arm-none-eabi with various targets and on
>      aarch64-none-elf. Thoughts on getting this into trunk?
> 
>      Thank you,
>      Stam Markianos-Wright
> 
>      gcc/ChangeLog:
> 
>              * config/arm/arm-protos.h (arm_target_insn_ok_for_lob): 
> Rename to...
>              (arm_target_bb_ok_for_lob): ...this
>              (arm_attempt_dlstp_transform): New.
>              * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.
>              (TARGET_PREDICT_DOLOOP_P): New.
>              (arm_block_set_vect):
>              (arm_target_insn_ok_for_lob): Rename from 
> arm_target_insn_ok_for_lob.
>              (arm_target_bb_ok_for_lob): New.
>              (arm_mve_get_vctp_lanes): New.
>              (arm_get_required_vpr_reg): New.
>              (arm_get_required_vpr_reg_param): New.
>              (arm_get_required_vpr_reg_ret_val): New.
>              (arm_mve_get_loop_vctp): New.
> (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.
>              (arm_mve_vec_insn_is_predicated_with_this_predicate): New.
>              (arm_mve_check_df_chain_back_for_implic_predic): New.
>              (arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.
>              (arm_mve_check_reg_origin_is_num_elems): New.
>              (arm_mve_dlstp_check_inc_counter): New.
>              (arm_mve_dlstp_check_dec_counter): New.
>              (arm_mve_loop_valid_for_dlstp): New.
>              (arm_predict_doloop_p): New.
>              (arm_loop_unroll_adjust): New.
>              (arm_emit_mve_unpredicated_insn_to_seq): New.
>              (arm_attempt_dlstp_transform): New.
>              * config/arm/iterators.md (DLSTP): New.
>              (mode1): Add DLSTP mappings.
>              * config/arm/mve.md (*predicated_doloop_end_internal): New.
>              (dlstp<mode1>_insn): New.
>              * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
>              * config/arm/unspecs.md: New unspecs.
>              * df-core.cc (df_bb_regno_only_def_find): New.
>              * df.h (df_bb_regno_only_def_find): New.
>              * loop-doloop.cc (doloop_condition_get): Relax conditions.
>              (doloop_optimize): Add support for elementwise LoLs.
> 
>      gcc/testsuite/ChangeLog:
> 
>              * gcc.target/arm/lob.h: Update framework.
>              * gcc.target/arm/lob1.c: Likewise.
>              * gcc.target/arm/lob6.c: Likewise.
>              * gcc.target/arm/mve/dlstp-compile-asm.c: New test.
>              * gcc.target/arm/mve/dlstp-int16x8.c: New test.
>              * gcc.target/arm/mve/dlstp-int32x4.c: New test.
>              * gcc.target/arm/mve/dlstp-int64x2.c: New test.
>              * gcc.target/arm/mve/dlstp-int8x16.c: New test.
>              * gcc.target/arm/mve/dlstp-invalid-asm.c: New test.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
  2023-06-15 11:47 Stamatis Markianos-Wright
  2023-06-22 15:54 ` Andre Vieira (lists)
@ 2023-06-23 10:23 ` Andre Vieira (lists)
  2023-07-05 16:13   ` Stamatis Markianos-Wright
  2023-06-23 16:25 ` Andre Vieira (lists)
  2 siblings, 1 reply; 14+ messages in thread
From: Andre Vieira (lists) @ 2023-06-23 10:23 UTC (permalink / raw)
  To: Stamatis Markianos-Wright, gcc-patches
  Cc: Kyrylo Tkachov, Richard Earnshaw, ramana.gcc, nickc

+  if (insn != arm_mve_get_loop_vctp (body))
+    {

probably a good idea to invert the condition here and return false, 
helps reducing the indenting in this function.


+	/* Starting from the current insn, scan backwards through the insn
+	   chain until BB_HEAD: "for each insn in the BB prior to the current".
+	*/

There's a trailing whitespace after insn, but also I'd rewrite this bit. 
The "for each insn in the BB prior to the current" is superfluous and 
even confusing to me. How about:
"Scan backwards from the current INSN through the instruction chain 
until the start of the basic block.  "


  I find 'that previous insn' to be confusing as you don't mention any 
previous insn before. So how about something along the lines of:
'If a previous insn defines a register that INSN uses then return true 
if...'


Do we need to check: 'insn != prev_insn' ? Any reason why you can't 
start the loop with:
'for (rtx_insn *prev_insn = PREV_INSN (insn);'

Now I also found a case where things might go wrong in:
+	    /* Look at all the DEFs of that previous insn: if one of them is on
+	       the same REG as our current insn, then recurse in order to check
+	       that insn's USEs.  If any of these insns return true as
+	       MVE_VPT_UNPREDICATED_INSN_Ps, then the whole chain is affected
+	       by the change in behaviour from being placed in dlstp/letp loop.
+	    */
+	    df_ref prev_insn_defs = NULL;
+	    FOR_EACH_INSN_DEF (prev_insn_defs, prev_insn)
+	      {
+		if (DF_REF_REGNO (insn_uses) == DF_REF_REGNO (prev_insn_defs)
+		    && insn != prev_insn
+		    && body == BLOCK_FOR_INSN (prev_insn)
+		    && !arm_mve_vec_insn_is_predicated_with_this_predicate
+			 (insn, vctp_vpr_generated)
+		    && arm_mve_check_df_chain_back_for_implic_predic
+			 (prev_insn, vctp_vpr_generated))
+		  return true;
+	      }

The body == BLOCK_FOR_INSN (prev_insn) hinted me at it, if a def comes 
from outside of the BB (so outside of the loop's body) then its by 
definition unpredicated by vctp.  I think you want to check that if 
prev_insn defines a register used by insn then return true if prev_insn 
isn't in the same BB or has a chain that is not predicated, i.e.: 
'!arm_mve_vec_insn_is_predicated_with_this_predicate (insn, 
vctp_vpr_generated) && arm_mve_check_df_chain_back_for_implic_predic 
prev_insn, vctp_vpr_generated))' you check body != BLOCK_FOR_INSN 
(prev_insn)'


I also found some other issues, this currently loloops:

uint16_t  test (uint16_t *a, int n)
{
   uint16_t res =0;
   while (n > 0)
     {
       mve_pred16_t p = vctp16q (n);
       uint16x8_t va = vldrhq_u16 (a);
       res = vaddvaq_u16 (res, va);
       res = vaddvaq_p_u16 (res, va, p);
       a += 8;
       n -= 8;
     }
   return res;
}

But it shouldn't, this is because there's a lack of handling of across 
vector instructions. Luckily in MVE all across vector instructions have 
the side-effect that they write to a scalar register, even the vshlcq 
instruction (it writes to a scalar carry output).

Did this lead me to find an ICE with:

uint16x8_t  test (uint16_t *a, int n)
{
   uint16x8_t res = vdupq_n_u16 (0);
   while (n > 0)
     {
       uint16_t carry = 0;
       mve_pred16_t p = vctp16q (n);
       uint16x8_t va = vldrhq_u16 (a);
       res = vshlcq_u16 (va, &carry, 1);
       res = vshlcq_m_u16 (res, &carry, 1 , p);
       a += 8;
       n -= 8;
     }
   return res;
}

This is because:
+	      /* If the USE is outside the loop body bb, or it is inside, but
+		 is an unpredicated store to memory.  */
+	      if (BLOCK_FOR_INSN (insn) != BLOCK_FOR_INSN (next_use_insn)
+		 || (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
+		     (next_use_insn, vctp_vpr_generated)
+		    && mve_memory_operand
+			(SET_DEST (single_set (next_use_insn)),
+			 GET_MODE (SET_DEST (single_set (next_use_insn))))))
+		return true;

Assumes single_set doesn't return 0.

Let's deal with these issues and I'll continue to review.

On 15/06/2023 12:47, Stamatis Markianos-Wright via Gcc-patches wrote:
>      Hi all,
> 
>      This is the 2/2 patch that contains the functional changes needed
>      for MVE Tail Predicated Low Overhead Loops.  See my previous email
>      for a general introduction of MVE LOLs.
> 
>      This support is added through the already existing loop-doloop
>      mechanisms that are used for non-MVE dls/le looping.
> 
>      Mid-end changes are:
> 
>      1) Relax the loop-doloop mechanism in the mid-end to allow for
>         decrement numbers other that -1 and for `count` to be an
>         rtx containing a simple REG (which in this case will contain
>         the number of elements to be processed), rather
>         than an expression for calculating the number of iterations.
>      2) Added a new df utility function: `df_bb_regno_only_def_find` that
>         will return the DEF of a REG only if it is DEF-ed once within the
>         basic block.
> 
>      And many things in the backend to implement the above optimisation:
> 
>      3)  Implement the `arm_predict_doloop_p` target hook to instruct the
>          mid-end about Low Overhead Loops (MVE or not), as well as
>          `arm_loop_unroll_adjust` which will prevent unrolling of any loops
>          that are valid for becoming MVE Tail_Predicated Low Overhead Loops
>          (unrolling can transform a loop in ways that invalidate the dlstp/
>          letp tranformation logic and the benefit of the dlstp/letp loop
>          would be considerably higher than that of unrolling)
>      4)  Appropriate changes to the define_expand of doloop_end, new
>          patterns for dlstp and letp, new iterators,  unspecs, etc.
>      5) `arm_mve_loop_valid_for_dlstp` and a number of checking functions:
>         * `arm_mve_dlstp_check_dec_counter`
>         * `arm_mve_dlstp_check_inc_counter`
>         * `arm_mve_check_reg_origin_is_num_elems`
>         * `arm_mve_check_df_chain_back_for_implic_predic`
>         * `arm_mve_check_df_chain_fwd_for_implic_predic_impact`
>         This all, in smoe way or another, are running checks on the loop
>         structure in order to determine if the loop is valid for dlstp/letp
>         transformation.
>      6) `arm_attempt_dlstp_transform`: (called from the define_expand of
>          doloop_end) this function re-checks for the loop's suitability for
>          dlstp/letp transformation and then implements it, if possible.
>      7) Various utility functions:
>         *`arm_mve_get_vctp_lanes` to map
>         from vctp unspecs to number of lanes, and 
> `arm_get_required_vpr_reg`
>         to check an insn to see if it requires the VPR or not.
>         * `arm_mve_get_loop_vctp`
>         * `arm_mve_get_vctp_lanes`
>         * `arm_emit_mve_unpredicated_insn_to_seq`
>         * `arm_get_required_vpr_reg`
>         * `arm_get_required_vpr_reg_param`
>         * `arm_get_required_vpr_reg_ret_val`
>         * `arm_mve_vec_insn_is_predicated_with_this_predicate`
>         * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`
> 
>      No regressions on arm-none-eabi with various targets and on
>      aarch64-none-elf. Thoughts on getting this into trunk?
> 
>      Thank you,
>      Stam Markianos-Wright
> 
>      gcc/ChangeLog:
> 
>              * config/arm/arm-protos.h (arm_target_insn_ok_for_lob): 
> Rename to...
>              (arm_target_bb_ok_for_lob): ...this
>              (arm_attempt_dlstp_transform): New.
>              * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.
>              (TARGET_PREDICT_DOLOOP_P): New.
>              (arm_block_set_vect):
>              (arm_target_insn_ok_for_lob): Rename from 
> arm_target_insn_ok_for_lob.
>              (arm_target_bb_ok_for_lob): New.
>              (arm_mve_get_vctp_lanes): New.
>              (arm_get_required_vpr_reg): New.
>              (arm_get_required_vpr_reg_param): New.
>              (arm_get_required_vpr_reg_ret_val): New.
>              (arm_mve_get_loop_vctp): New.
> (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.
>              (arm_mve_vec_insn_is_predicated_with_this_predicate): New.
>              (arm_mve_check_df_chain_back_for_implic_predic): New.
>              (arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.
>              (arm_mve_check_reg_origin_is_num_elems): New.
>              (arm_mve_dlstp_check_inc_counter): New.
>              (arm_mve_dlstp_check_dec_counter): New.
>              (arm_mve_loop_valid_for_dlstp): New.
>              (arm_predict_doloop_p): New.
>              (arm_loop_unroll_adjust): New.
>              (arm_emit_mve_unpredicated_insn_to_seq): New.
>              (arm_attempt_dlstp_transform): New.
>              * config/arm/iterators.md (DLSTP): New.
>              (mode1): Add DLSTP mappings.
>              * config/arm/mve.md (*predicated_doloop_end_internal): New.
>              (dlstp<mode1>_insn): New.
>              * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
>              * config/arm/unspecs.md: New unspecs.
>              * df-core.cc (df_bb_regno_only_def_find): New.
>              * df.h (df_bb_regno_only_def_find): New.
>              * loop-doloop.cc (doloop_condition_get): Relax conditions.
>              (doloop_optimize): Add support for elementwise LoLs.
> 
>      gcc/testsuite/ChangeLog:
> 
>              * gcc.target/arm/lob.h: Update framework.
>              * gcc.target/arm/lob1.c: Likewise.
>              * gcc.target/arm/lob6.c: Likewise.
>              * gcc.target/arm/mve/dlstp-compile-asm.c: New test.
>              * gcc.target/arm/mve/dlstp-int16x8.c: New test.
>              * gcc.target/arm/mve/dlstp-int32x4.c: New test.
>              * gcc.target/arm/mve/dlstp-int64x2.c: New test.
>              * gcc.target/arm/mve/dlstp-int8x16.c: New test.
>              * gcc.target/arm/mve/dlstp-invalid-asm.c: New test.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
  2023-06-15 11:47 Stamatis Markianos-Wright
@ 2023-06-22 15:54 ` Andre Vieira (lists)
  2023-07-05 16:11   ` Stamatis Markianos-Wright
  2023-06-23 10:23 ` Andre Vieira (lists)
  2023-06-23 16:25 ` Andre Vieira (lists)
  2 siblings, 1 reply; 14+ messages in thread
From: Andre Vieira (lists) @ 2023-06-22 15:54 UTC (permalink / raw)
  To: Stamatis Markianos-Wright, gcc-patches
  Cc: Kyrylo Tkachov, Richard Earnshaw, ramana.gcc, nickc

Some comments below, all quite minor. I'll continue to review tomorrow, 
I need a fresher brain for arm_mve_check_df_chain_back_for_implic_predic 
  ;)

+static int
+arm_mve_get_vctp_lanes (rtx x)
+{
+  if (GET_CODE (x) == SET && GET_CODE (XEXP (x, 1)) == UNSPEC
+      && (XINT (XEXP (x, 1), 1) == VCTP || XINT (XEXP (x, 1), 1) == 
VCTP_M))
+    {
+      switch (GET_MODE (XEXP (x, 1)))
+	{
+	  case V16BImode:
+	    return 16;
+	  case V8BImode:
+	    return 8;
+	  case V4BImode:
+	    return 4;
+	  case V2QImode:
+	    return 2;
+	  default:
+	    break;
+	}
+    }
+  return 0;
+}

I think you can replace the switch with something along the lines of:
machine_mode mode = GET_MODE (XEXP (x, 1));
return VECTOR_MODE_P (mode) ? GET_MODE_NUNITS (mode) : 0;


+/* Check if an insn requires the use of the VPR_REG, if it does, return the
+   sub-rtx of the VPR_REG.  The `type` argument controls whether
+   this function should:
+   * For type == 0, check all operands, including the OUT operands,
+     and return the first occurance of the VPR_REG.

s/occurance/occurrence/

+	  bool requires_vpr;
+  extract_constrain_insn (insn);

indent of requires_vpr is off.

+      if (type == 1 && (recog_data.operand_type[op] == OP_OUT
+			|| recog_data.operand_type[op] == OP_INOUT))
+	continue;
+      else if (type == 2 && (recog_data.operand_type[op] == OP_IN
+			     || recog_data.operand_type[op] == OP_INOUT))
+	continue;

Why skip INOUT? I guess this will become clear when I see the uses, but 
I'm wondering whether 'only check the input operands.' is clear enough. 
Maybe 'check operands that are input only.' would be more accurate?

+	  /* Fetch the reg_class for each entry and check it against the
+	   * VPR_REG reg_class.  */

Remove leading * on the second line.

+
+/* Wrapper function of arm_get_required_vpr_reg with type == 1, so return
+   something only if the VPR reg is an input operand to the insn.  */

When talking about a function parameter in comments capitalize (INSN) 
the name. Same for:

+/* Wrapper function of arm_get_required_vpr_reg with type == 2, so return
+   something only if the VPR reg is the retrurn value, an output of, or is
+   clobbered by the insn.  */

+/* Return true if an insn is an MVE instruction that VPT-predicable, but in
+   its unpredicated form, or if it is predicated, but on a predicate other
+   than vpr_reg.  */

In this one also 'is a MVE instruction that is VPT-predicable' would be 
better I think.


On 15/06/2023 12:47, Stamatis Markianos-Wright via Gcc-patches wrote:
 >      Hi all,
 >
 >      This is the 2/2 patch that contains the functional changes needed
 >      for MVE Tail Predicated Low Overhead Loops.  See my previous email
 >      for a general introduction of MVE LOLs.
 >
 >      This support is added through the already existing loop-doloop
 >      mechanisms that are used for non-MVE dls/le looping.
 >
 >      Mid-end changes are:
 >
 >      1) Relax the loop-doloop mechanism in the mid-end to allow for
 >         decrement numbers other that -1 and for `count` to be an
 >         rtx containing a simple REG (which in this case will contain
 >         the number of elements to be processed), rather
 >         than an expression for calculating the number of iterations.
 >      2) Added a new df utility function: `df_bb_regno_only_def_find` that
 >         will return the DEF of a REG only if it is DEF-ed once within the
 >         basic block.
 >
 >      And many things in the backend to implement the above optimisation:
 >
 >      3)  Implement the `arm_predict_doloop_p` target hook to instruct the
 >          mid-end about Low Overhead Loops (MVE or not), as well as
 >          `arm_loop_unroll_adjust` which will prevent unrolling of any 
loops
 >          that are valid for becoming MVE Tail_Predicated Low Overhead 
Loops
 >          (unrolling can transform a loop in ways that invalidate the 
dlstp/
 >          letp tranformation logic and the benefit of the dlstp/letp loop
 >          would be considerably higher than that of unrolling)
 >      4)  Appropriate changes to the define_expand of doloop_end, new
 >          patterns for dlstp and letp, new iterators,  unspecs, etc.
 >      5) `arm_mve_loop_valid_for_dlstp` and a number of checking 
functions:
 >         * `arm_mve_dlstp_check_dec_counter`
 >         * `arm_mve_dlstp_check_inc_counter`
 >         * `arm_mve_check_reg_origin_is_num_elems`
 >         * `arm_mve_check_df_chain_back_for_implic_predic`
 >         * `arm_mve_check_df_chain_fwd_for_implic_predic_impact`
 >         This all, in smoe way or another, are running checks on the loop
 >         structure in order to determine if the loop is valid for 
dlstp/letp
 >         transformation.
 >      6) `arm_attempt_dlstp_transform`: (called from the define_expand of
 >          doloop_end) this function re-checks for the loop's 
suitability for
 >          dlstp/letp transformation and then implements it, if possible.
 >      7) Various utility functions:
 >         *`arm_mve_get_vctp_lanes` to map
 >         from vctp unspecs to number of lanes, and
 > `arm_get_required_vpr_reg`
 >         to check an insn to see if it requires the VPR or not.
 >         * `arm_mve_get_loop_vctp`
 >         * `arm_mve_get_vctp_lanes`
 >         * `arm_emit_mve_unpredicated_insn_to_seq`
 >         * `arm_get_required_vpr_reg`
 >         * `arm_get_required_vpr_reg_param`
 >         * `arm_get_required_vpr_reg_ret_val`
 >         * `arm_mve_vec_insn_is_predicated_with_this_predicate`
 >         * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`
 >
 >      No regressions on arm-none-eabi with various targets and on
 >      aarch64-none-elf. Thoughts on getting this into trunk?
 >
 >      Thank you,
 >      Stam Markianos-Wright
 >
 >      gcc/ChangeLog:
 >
 >              * config/arm/arm-protos.h (arm_target_insn_ok_for_lob):
 > Rename to...
 >              (arm_target_bb_ok_for_lob): ...this
 >              (arm_attempt_dlstp_transform): New.
 >              * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.
 >              (TARGET_PREDICT_DOLOOP_P): New.
 >              (arm_block_set_vect):
 >              (arm_target_insn_ok_for_lob): Rename from
 > arm_target_insn_ok_for_lob.
 >              (arm_target_bb_ok_for_lob): New.
 >              (arm_mve_get_vctp_lanes): New.
 >              (arm_get_required_vpr_reg): New.
 >              (arm_get_required_vpr_reg_param): New.
 >              (arm_get_required_vpr_reg_ret_val): New.
 >              (arm_mve_get_loop_vctp): New.
 > (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.
 >              (arm_mve_vec_insn_is_predicated_with_this_predicate): New.
 >              (arm_mve_check_df_chain_back_for_implic_predic): New.
 >              (arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.
 >              (arm_mve_check_reg_origin_is_num_elems): New.
 >              (arm_mve_dlstp_check_inc_counter): New.
 >              (arm_mve_dlstp_check_dec_counter): New.
 >              (arm_mve_loop_valid_for_dlstp): New.
 >              (arm_predict_doloop_p): New.
 >              (arm_loop_unroll_adjust): New.
 >              (arm_emit_mve_unpredicated_insn_to_seq): New.
 >              (arm_attempt_dlstp_transform): New.
 >              * config/arm/iterators.md (DLSTP): New.
 >              (mode1): Add DLSTP mappings.
 >              * config/arm/mve.md (*predicated_doloop_end_internal): New.
 >              (dlstp<mode1>_insn): New.
 >              * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
 >              * config/arm/unspecs.md: New unspecs.
 >              * df-core.cc (df_bb_regno_only_def_find): New.
 >              * df.h (df_bb_regno_only_def_find): New.
 >              * loop-doloop.cc (doloop_condition_get): Relax conditions.
 >              (doloop_optimize): Add support for elementwise LoLs.
 >
 >      gcc/testsuite/ChangeLog:
 >
 >              * gcc.target/arm/lob.h: Update framework.
 >              * gcc.target/arm/lob1.c: Likewise.
 >              * gcc.target/arm/lob6.c: Likewise.
 >              * gcc.target/arm/mve/dlstp-compile-asm.c: New test.
 >              * gcc.target/arm/mve/dlstp-int16x8.c: New test.
 >              * gcc.target/arm/mve/dlstp-int32x4.c: New test.
 >              * gcc.target/arm/mve/dlstp-int64x2.c: New test.
 >              * gcc.target/arm/mve/dlstp-int8x16.c: New test.
 >              * gcc.target/arm/mve/dlstp-invalid-asm.c: New test.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
@ 2023-06-15 11:47 Stamatis Markianos-Wright
  2023-06-22 15:54 ` Andre Vieira (lists)
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Stamatis Markianos-Wright @ 2023-06-15 11:47 UTC (permalink / raw)
  To: gcc-patches; +Cc: Kyrylo Tkachov, Richard Earnshaw, ramana.gcc, nickc

[-- Attachment #1: Type: text/plain, Size: 6356 bytes --]

     Hi all,

     This is the 2/2 patch that contains the functional changes needed
     for MVE Tail Predicated Low Overhead Loops.  See my previous email
     for a general introduction of MVE LOLs.

     This support is added through the already existing loop-doloop
     mechanisms that are used for non-MVE dls/le looping.

     Mid-end changes are:

     1) Relax the loop-doloop mechanism in the mid-end to allow for
        decrement numbers other that -1 and for `count` to be an
        rtx containing a simple REG (which in this case will contain
        the number of elements to be processed), rather
        than an expression for calculating the number of iterations.
     2) Added a new df utility function: `df_bb_regno_only_def_find` that
        will return the DEF of a REG only if it is DEF-ed once within the
        basic block.

     And many things in the backend to implement the above optimisation:

     3)  Implement the `arm_predict_doloop_p` target hook to instruct the
         mid-end about Low Overhead Loops (MVE or not), as well as
         `arm_loop_unroll_adjust` which will prevent unrolling of any loops
         that are valid for becoming MVE Tail_Predicated Low Overhead Loops
         (unrolling can transform a loop in ways that invalidate the dlstp/
         letp tranformation logic and the benefit of the dlstp/letp loop
         would be considerably higher than that of unrolling)
     4)  Appropriate changes to the define_expand of doloop_end, new
         patterns for dlstp and letp, new iterators,  unspecs, etc.
     5) `arm_mve_loop_valid_for_dlstp` and a number of checking functions:
        * `arm_mve_dlstp_check_dec_counter`
        * `arm_mve_dlstp_check_inc_counter`
        * `arm_mve_check_reg_origin_is_num_elems`
        * `arm_mve_check_df_chain_back_for_implic_predic`
        * `arm_mve_check_df_chain_fwd_for_implic_predic_impact`
        This all, in smoe way or another, are running checks on the loop
        structure in order to determine if the loop is valid for dlstp/letp
        transformation.
     6) `arm_attempt_dlstp_transform`: (called from the define_expand of
         doloop_end) this function re-checks for the loop's suitability for
         dlstp/letp transformation and then implements it, if possible.
     7) Various utility functions:
        *`arm_mve_get_vctp_lanes` to map
        from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg`
        to check an insn to see if it requires the VPR or not.
        * `arm_mve_get_loop_vctp`
        * `arm_mve_get_vctp_lanes`
        * `arm_emit_mve_unpredicated_insn_to_seq`
        * `arm_get_required_vpr_reg`
        * `arm_get_required_vpr_reg_param`
        * `arm_get_required_vpr_reg_ret_val`
        * `arm_mve_vec_insn_is_predicated_with_this_predicate`
        * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`

     No regressions on arm-none-eabi with various targets and on
     aarch64-none-elf. Thoughts on getting this into trunk?

     Thank you,
     Stam Markianos-Wright

     gcc/ChangeLog:

             * config/arm/arm-protos.h (arm_target_insn_ok_for_lob): 
Rename to...
             (arm_target_bb_ok_for_lob): ...this
             (arm_attempt_dlstp_transform): New.
             * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.
             (TARGET_PREDICT_DOLOOP_P): New.
             (arm_block_set_vect):
             (arm_target_insn_ok_for_lob): Rename from 
arm_target_insn_ok_for_lob.
             (arm_target_bb_ok_for_lob): New.
             (arm_mve_get_vctp_lanes): New.
             (arm_get_required_vpr_reg): New.
             (arm_get_required_vpr_reg_param): New.
             (arm_get_required_vpr_reg_ret_val): New.
             (arm_mve_get_loop_vctp): New.
(arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.
             (arm_mve_vec_insn_is_predicated_with_this_predicate): New.
             (arm_mve_check_df_chain_back_for_implic_predic): New.
             (arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.
             (arm_mve_check_reg_origin_is_num_elems): New.
             (arm_mve_dlstp_check_inc_counter): New.
             (arm_mve_dlstp_check_dec_counter): New.
             (arm_mve_loop_valid_for_dlstp): New.
             (arm_predict_doloop_p): New.
             (arm_loop_unroll_adjust): New.
             (arm_emit_mve_unpredicated_insn_to_seq): New.
             (arm_attempt_dlstp_transform): New.
             * config/arm/iterators.md (DLSTP): New.
             (mode1): Add DLSTP mappings.
             * config/arm/mve.md (*predicated_doloop_end_internal): New.
             (dlstp<mode1>_insn): New.
             * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
             * config/arm/unspecs.md: New unspecs.
             * df-core.cc (df_bb_regno_only_def_find): New.
             * df.h (df_bb_regno_only_def_find): New.
             * loop-doloop.cc (doloop_condition_get): Relax conditions.
             (doloop_optimize): Add support for elementwise LoLs.

     gcc/testsuite/ChangeLog:

             * gcc.target/arm/lob.h: Update framework.
             * gcc.target/arm/lob1.c: Likewise.
             * gcc.target/arm/lob6.c: Likewise.
             * gcc.target/arm/mve/dlstp-compile-asm.c: New test.
             * gcc.target/arm/mve/dlstp-int16x8.c: New test.
             * gcc.target/arm/mve/dlstp-int32x4.c: New test.
             * gcc.target/arm/mve/dlstp-int64x2.c: New test.
             * gcc.target/arm/mve/dlstp-int8x16.c: New test.
             * gcc.target/arm/mve/dlstp-invalid-asm.c: New test.

[-- Attachment #2: 2.patch --]
[-- Type: text/x-patch, Size: 91971 bytes --]

commit 4f6b2c886257c4e265fe255145f2ecb17510501b
Author: Stam Markianos-Wright <stam.markianos-wright@arm.com>
Date:   Tue Oct 18 17:42:56 2022 +0100

    arm: Add support for MVE Tail-Predicated Low Overhead Loops
    
    This is the 2/2 patch that contains the functional changes needed
    for MVE Tail Predicated Low Overhead Loops.  See my previous email
    for a general introduction of MVE LOLs.
    
    This support is added through the already existing loop-doloop
    mechanisms that are used for non-MVE dls/le looping.
    
    Mid-end changes are:
    
    1) Relax the loop-doloop mechanism in the mid-end to allow for
       decrement numbers other that -1 and for `count` to be an
       rtx containing a simple REG (which in this case will contain
       the number of elements to be processed), rather
       than an expression for calculating the number of iterations.
    2) Added a new df utility function: `df_bb_regno_only_def_find` that
       will return the DEF of a REG only if it is DEF-ed once within the
       basic block.
    
    And many things in the backend to implement the above optimisation:
    
    3)  Implement the `arm_predict_doloop_p` target hook to instruct the
        mid-end about Low Overhead Loops (MVE or not), as well as
        `arm_loop_unroll_adjust` which will prevent unrolling of any loops
        that are valid for becoming MVE Tail_Predicated Low Overhead Loops
        (unrolling can transform a loop in ways that invalidate the dlstp/
        letp tranformation logic and the benefit of the dlstp/letp loop
        would be considerably higher than that of unrolling)
    4)  Appropriate changes to the define_expand of doloop_end, new
        patterns for dlstp and letp, new iterators,  unspecs, etc.
    5) `arm_mve_loop_valid_for_dlstp` and a number of checking functions:
       * `arm_mve_dlstp_check_dec_counter`
       * `arm_mve_dlstp_check_inc_counter`
       * `arm_mve_check_reg_origin_is_num_elems`
       * `arm_mve_check_df_chain_back_for_implic_predic`
       * `arm_mve_check_df_chain_fwd_for_implic_predic_impact`
       This all, in smoe way or another, are running checks on the loop
       structure in order to determine if the loop is valid for dlstp/letp
       transformation.
    6) `arm_attempt_dlstp_transform`: (called from the define_expand of
        doloop_end) this function re-checks for the loop's suitability for
        dlstp/letp transformation and then implements it, if possible.
    7) Various utility functions:
       *`arm_mve_get_vctp_lanes` to map
       from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg`
       to check an insn to see if it requires the VPR or not.
       * `arm_mve_get_loop_vctp`
       * `arm_mve_get_vctp_lanes`
       * `arm_emit_mve_unpredicated_insn_to_seq`
       * `arm_get_required_vpr_reg`
       * `arm_get_required_vpr_reg_param`
       * `arm_get_required_vpr_reg_ret_val`
       * `arm_mve_vec_insn_is_predicated_with_this_predicate`
       * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`
    
    No regressions on arm-none-eabi with various targets and on
    aarch64-none-elf. Thoughts on getting this into trunk?
    
    Thank you,
    Stam Markianos-Wright
    
    gcc/ChangeLog:
    
            * config/arm/arm-protos.h (arm_target_insn_ok_for_lob): Rename to...
            (arm_target_bb_ok_for_lob): ...this
            (arm_attempt_dlstp_transform): New.
            * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.
            (TARGET_PREDICT_DOLOOP_P): New.
            (arm_block_set_vect):
            (arm_target_insn_ok_for_lob): Rename from arm_target_insn_ok_for_lob.
            (arm_target_bb_ok_for_lob): New.
            (arm_mve_get_vctp_lanes): New.
            (arm_get_required_vpr_reg): New.
            (arm_get_required_vpr_reg_param): New.
            (arm_get_required_vpr_reg_ret_val): New.
            (arm_mve_get_loop_vctp): New.
            (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.
            (arm_mve_vec_insn_is_predicated_with_this_predicate): New.
            (arm_mve_check_df_chain_back_for_implic_predic): New.
            (arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.
            (arm_mve_check_reg_origin_is_num_elems): New.
            (arm_mve_dlstp_check_inc_counter): New.
            (arm_mve_dlstp_check_dec_counter): New.
            (arm_mve_loop_valid_for_dlstp): New.
            (arm_predict_doloop_p): New.
            (arm_loop_unroll_adjust): New.
            (arm_emit_mve_unpredicated_insn_to_seq): New.
            (arm_attempt_dlstp_transform): New.
            * config/arm/iterators.md (DLSTP): New.
            (mode1): Add DLSTP mappings.
            * config/arm/mve.md (*predicated_doloop_end_internal): New.
            (dlstp<mode1>_insn): New.
            * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
            * config/arm/unspecs.md: New unspecs.
            * df-core.cc (df_bb_regno_only_def_find): New.
            * df.h (df_bb_regno_only_def_find): New.
            * loop-doloop.cc (doloop_condition_get): Relax conditions.
            (doloop_optimize): Add support for elementwise LoLs.
    
    gcc/testsuite/ChangeLog:
    
            * gcc.target/arm/lob.h: Update framework.
            * gcc.target/arm/lob1.c: Likewise.
            * gcc.target/arm/lob6.c: Likewise.
            * gcc.target/arm/mve/dlstp-compile-asm.c: New test.
            * gcc.target/arm/mve/dlstp-int16x8.c: New test.
            * gcc.target/arm/mve/dlstp-int32x4.c: New test.
            * gcc.target/arm/mve/dlstp-int64x2.c: New test.
            * gcc.target/arm/mve/dlstp-int8x16.c: New test.
            * gcc.target/arm/mve/dlstp-invalid-asm.c: New test.

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 61fcd671437..b4f6538c991 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -64,8 +64,8 @@ extern void arm_emit_speculation_barrier_function (void);
 extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *);
 extern bool arm_q_bit_access (void);
 extern bool arm_ge_bits_access (void);
-extern bool arm_target_insn_ok_for_lob (rtx);
-
+extern bool arm_target_bb_ok_for_lob (basic_block);
+extern rtx arm_attempt_dlstp_transform (rtx);
 #ifdef RTX_CODE
 enum reg_class
 arm_mode_base_reg_class (machine_mode);
diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index c3e731b8982..760635da914 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -659,6 +659,12 @@ static const struct attribute_spec arm_attribute_table[] =
 #undef TARGET_HAVE_CONDITIONAL_EXECUTION
 #define TARGET_HAVE_CONDITIONAL_EXECUTION arm_have_conditional_execution
 
+#undef TARGET_LOOP_UNROLL_ADJUST
+#define TARGET_LOOP_UNROLL_ADJUST arm_loop_unroll_adjust
+
+#undef TARGET_PREDICT_DOLOOP_P
+#define TARGET_PREDICT_DOLOOP_P arm_predict_doloop_p
+
 #undef TARGET_LEGITIMATE_CONSTANT_P
 #define TARGET_LEGITIMATE_CONSTANT_P arm_legitimate_constant_p
 
@@ -34416,9 +34422,8 @@ arm_invalid_within_doloop (const rtx_insn *insn)
 }
 
 bool
-arm_target_insn_ok_for_lob (rtx insn)
+arm_target_bb_ok_for_lob (basic_block bb)
 {
-  basic_block bb = BLOCK_FOR_INSN (insn);
   /* Make sure the basic block of the target insn is a simple latch
      having as single predecessor and successor the body of the loop
      itself.  Only simple loops with a single basic block as body are
@@ -34427,8 +34432,923 @@ arm_target_insn_ok_for_lob (rtx insn)
 
   return single_succ_p (bb)
     && single_pred_p (bb)
-    && single_succ_edge (bb)->dest == single_pred_edge (bb)->src
-    && contains_no_active_insn_p (bb);
+    && single_succ_edge (bb)->dest == single_pred_edge (bb)->src;
+}
+
+/* Utility fuction: Given a VCTP or a VCTP_M insn, return the number of MVE
+   lanes based on the machine mode being used.  */
+
+static int
+arm_mve_get_vctp_lanes (rtx x)
+{
+  if (GET_CODE (x) == SET && GET_CODE (XEXP (x, 1)) == UNSPEC
+      && (XINT (XEXP (x, 1), 1) == VCTP || XINT (XEXP (x, 1), 1) == VCTP_M))
+    {
+      switch (GET_MODE (XEXP (x, 1)))
+	{
+	  case V16BImode:
+	    return 16;
+	  case V8BImode:
+	    return 8;
+	  case V4BImode:
+	    return 4;
+	  case V2QImode:
+	    return 2;
+	  default:
+	    break;
+	}
+    }
+  return 0;
+}
+
+/* Check if an insn requires the use of the VPR_REG, if it does, return the
+   sub-rtx of the VPR_REG.  The `type` argument controls whether
+   this function should:
+   * For type == 0, check all operands, including the OUT operands,
+     and return the first occurance of the VPR_REG.
+   * For type == 1, only check the input operands.
+   * For type == 2, only check the output operands.
+   (INOUT operands are considered both as input and output operands)
+*/
+static rtx
+arm_get_required_vpr_reg (rtx_insn *insn, unsigned int type = 0)
+{
+  gcc_assert (type < 3);
+  if (!NONJUMP_INSN_P (insn))
+    return NULL_RTX;
+
+	  bool requires_vpr;
+  extract_constrain_insn (insn);
+  int n_operands = recog_data.n_operands;
+  if (recog_data.n_alternatives == 0)
+    return NULL_RTX;
+
+  /* Fill in recog_op_alt with information about the constraints of
+     this insn.  */
+  preprocess_constraints (insn);
+
+  for (int op = 0; op < n_operands; op++)
+    {
+      requires_vpr = true;
+      if (type == 1 && (recog_data.operand_type[op] == OP_OUT
+			|| recog_data.operand_type[op] == OP_INOUT))
+	continue;
+      else if (type == 2 && (recog_data.operand_type[op] == OP_IN
+			     || recog_data.operand_type[op] == OP_INOUT))
+	continue;
+
+      /* Iterate through alternatives of operand "op" in recog_op_alt and
+	 identify if the operand is required to be the VPR.  */
+      for (int alt = 0; alt < recog_data.n_alternatives; alt++)
+	{
+	  const operand_alternative *op_alt
+	      = &recog_op_alt[alt * n_operands];
+	  /* Fetch the reg_class for each entry and check it against the
+	   * VPR_REG reg_class.  */
+	  if (alternative_class (op_alt, op) != VPR_REG)
+	    requires_vpr = false;
+	}
+      /* If all alternatives of the insn require the VPR reg for this operand,
+	 it means that either this is VPR-generating instruction, like a vctp,
+	 vcmp, etc., or it is a VPT-predicated insruction.  Return the subrtx
+	 of the VPR reg operand.  */
+      if (requires_vpr)
+	return recog_data.operand[op];
+    }
+  return NULL_RTX;
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with type == 1, so return
+   something only if the VPR reg is an input operand to the insn.  */
+
+static rtx
+ALWAYS_INLINE
+arm_get_required_vpr_reg_param (rtx_insn *insn)
+{
+  return arm_get_required_vpr_reg (insn, 1);
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with type == 2, so return
+   something only if the VPR reg is the retrurn value, an output of, or is
+   clobbered by the insn.  */
+
+static rtx
+ALWAYS_INLINE
+arm_get_required_vpr_reg_ret_val (rtx_insn *insn)
+{
+  return arm_get_required_vpr_reg (insn, 2);
+}
+
+/* Scan the basic block of a loop body for a vctp instruction.  If there is
+   at least vctp instruction, return the first rtx_insn *.  */
+
+static rtx_insn *
+arm_mve_get_loop_vctp (basic_block bb)
+{
+  rtx_insn *insn = BB_HEAD (bb);
+
+  /* Now scan through all the instruction patterns and pick out the VCTP
+     instruction.  We require arm_get_required_vpr_reg_param to be false
+     to make sure we pick up a VCTP, rather than a VCTP_M.  */
+  FOR_BB_INSNS (bb, insn)
+    if (NONDEBUG_INSN_P (insn))
+      if (arm_get_required_vpr_reg_ret_val (insn)
+	  && (arm_mve_get_vctp_lanes (PATTERN (insn)) != 0)
+	  && !arm_get_required_vpr_reg_param (insn))
+	return insn;
+  return NULL;
+}
+
+/* Return true if an insn is an MVE instruction that VPT-predicable, but in
+   its unpredicated form, or if it is predicated, but on a predicate other
+   than vpr_reg.  */
+
+static bool
+arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate (rtx_insn *insn,
+							  rtx vpr_reg)
+{
+  rtx insn_vpr_reg_operand;
+  if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+      || (MVE_VPT_PREDICATED_INSN_P (insn)
+	  && (insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn))
+	  && !rtx_equal_p (vpr_reg, insn_vpr_reg_operand)))
+    return true;
+  else
+    return false;
+}
+
+static bool
+arm_mve_vec_insn_is_predicated_with_this_predicate (rtx_insn *insn,
+							  rtx vpr_reg)
+{
+  rtx insn_vpr_reg_operand;
+  if (MVE_VPT_PREDICATED_INSN_P (insn)
+      && (insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn))
+      && rtx_equal_p (vpr_reg, insn_vpr_reg_operand))
+    return true;
+  else
+    return false;
+}
+
+/* Recursively scan through the DF chain backwards within the basic block and
+   determine if any of the USEs of the original insn (or the USEs of the insns
+   where thy were DEF-ed, etc., recursively) were affected by implicit VPT
+   predication of an MVE_VPT_UNPREDICATED_INSN_P in a dlstp/letp loop.
+   Having such implicit predication on an unpredicated insn isn't in itself
+   an error, because the output of that insn might then be used in a correctly
+   predicated store insn, where the disabled lanes will be ignored.  To verify
+   this we later call arm_mve_check_df_chain_fwd_for_implic_predic_impact,
+   which will check the DF chains forward to see if any implicitly-predicated
+   operand gets used in an improper way.  */
+
+static bool
+arm_mve_check_df_chain_back_for_implic_predic (rtx_insn *insn,
+					       rtx vctp_vpr_generated)
+{
+  basic_block body = BLOCK_FOR_INSN (insn);
+  /* If we've traced back to the loop vctp, then this operand must be the VPR
+     reg and is safe.  */
+  if (insn != arm_mve_get_loop_vctp (body))
+    {
+      /* The circumstances under which an instruction is affected by "implicit
+	 predication" are as follows:
+	  * It is an UNPREDICATED_INSN_P:
+	    * That loads/stores from/to memory.
+	    * Where any one of its operands are MVE vectors from outside the
+	      loop body bb.
+	Or:
+	  * Any of it's operands, recursively backwards, are affected.  */
+      if (MVE_VPT_UNPREDICATED_INSN_P (insn))
+	{
+	  /* Find if this is a load or a store insn.  */
+	  extract_insn (insn);
+	  int n_operands = recog_data.n_operands;
+	  for (int op = 0; op < n_operands; op++)
+	    if (mve_memory_operand (recog_data.operand[op],
+				      GET_MODE (recog_data.operand[op])))
+	      return true;
+	}
+
+      df_ref insn_uses = NULL;
+      FOR_EACH_INSN_USE (insn_uses, insn)
+	{
+	  /* If the insn is unpredicated, the operand is in the input reg set
+	     to the the basic block and is an MVE vector, consider it unsafe.
+	  */
+	if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+	    && VALID_MVE_MODE (GET_MODE (DF_REF_REG (insn_uses)))
+	    && REGNO_REG_SET_P (df_get_live_in (body), DF_REF_REGNO
+							(insn_uses)))
+	  return true;
+
+	/* Starting from the current insn, scan backwards through the insn 
+	   chain until BB_HEAD: "for each insn in the BB prior to the current".
+	*/
+	rtx_insn *prev_insn = NULL;
+	for (prev_insn = insn;
+	     prev_insn && prev_insn != PREV_INSN (BB_HEAD (body));
+	     prev_insn = PREV_INSN (prev_insn))
+	  {
+	    /* Look at all the DEFs of that previous insn: if one of them is on
+	       the same REG as our current insn, then recurse in order to check
+	       that insn's USEs.  If any of these insns return true as
+	       MVE_VPT_UNPREDICATED_INSN_Ps, then the whole chain is affected
+	       by the change in behaviour from being placed in dlstp/letp loop.
+	    */
+	    df_ref prev_insn_defs = NULL;
+	    FOR_EACH_INSN_DEF (prev_insn_defs, prev_insn)
+	      {
+		if (DF_REF_REGNO (insn_uses) == DF_REF_REGNO (prev_insn_defs)
+		    && insn != prev_insn
+		    && body == BLOCK_FOR_INSN (prev_insn)
+		    && !arm_mve_vec_insn_is_predicated_with_this_predicate
+			 (insn, vctp_vpr_generated)
+		    && arm_mve_check_df_chain_back_for_implic_predic
+			 (prev_insn, vctp_vpr_generated))
+		  return true;
+	      }
+	  }
+      }
+    }
+  return false;
+}
+
+/* If we have identified that the current DEF will be modified
+    by such implicit predication, scan through all the
+   insns that USE it and bail out if any one is outside the
+   current basic block (i.e. the reg is live after the loop)
+   or if any are store insns that are unpredicated or using a
+   predicate other than the loop VPR.  */
+
+static bool
+arm_mve_check_df_chain_fwd_for_implic_predic_impact (rtx_insn *insn,
+						     rtx vctp_vpr_generated)
+{
+  /* If this insn is indeed an unpredicated store to memory, bail out.  */
+  if (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
+	(insn, vctp_vpr_generated)
+      && mve_memory_operand (SET_DEST (single_set (insn)),
+			     GET_MODE (SET_DEST (single_set (insn)))))
+    return true;
+
+  /* Next, scan forward to the various USEs of the DEFs in this insn.  */
+  df_ref insn_def = NULL;
+  FOR_EACH_INSN_DEF (insn_def, insn)
+    {
+      for (df_ref use = DF_REG_USE_CHAIN (DF_REF_REGNO (insn_def)); use;
+	   use = DF_REF_NEXT_REG (use))
+	{
+	  rtx_insn *next_use_insn = DF_REF_INSN (use);
+	  if (insn != next_use_insn && NONDEBUG_INSN_P (next_use_insn))
+	    {
+	      /* If the USE is outside the loop body bb, or it is inside, but
+		 is an unpredicated store to memory.  */
+	      if (BLOCK_FOR_INSN (insn) != BLOCK_FOR_INSN (next_use_insn)
+		 || (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
+		     (next_use_insn, vctp_vpr_generated)
+		    && mve_memory_operand
+			(SET_DEST (single_set (next_use_insn)),
+			 GET_MODE (SET_DEST (single_set (next_use_insn))))))
+		return true;
+	    }
+	}
+    }
+  return false;
+}
+
+static bool
+arm_mve_check_reg_origin_is_num_elems (basic_block body, rtx reg, rtx vctp_step)
+{
+  /* Ok, we now know the loop starts from zero and increments by one.
+     Now just show that the max value of the counter came from an
+     appropriate ASHIFRT expr of the correct amount.  */
+  basic_block pre_loop_bb = body->prev_bb;
+  while (pre_loop_bb && BB_END (pre_loop_bb)
+	 && !df_bb_regno_only_def_find (pre_loop_bb, REGNO (reg)))
+    pre_loop_bb = pre_loop_bb->prev_bb;
+
+  df_ref counter_max_last_def = df_bb_regno_only_def_find (pre_loop_bb, REGNO (reg));
+  rtx counter_max_last_set;
+  if (counter_max_last_def)
+    counter_max_last_set = PATTERN (DF_REF_INSN (counter_max_last_def));
+  else
+    return false;
+
+  /* If we encounter a simple SET from a REG, follow it through.  */
+  if (GET_CODE (counter_max_last_set) == SET
+      && REG_P (XEXP (counter_max_last_set, 1)))
+    return arm_mve_check_reg_origin_is_num_elems
+	     (pre_loop_bb, XEXP (counter_max_last_set, 1), vctp_step);
+
+  if (GET_CODE (XEXP (counter_max_last_set, 1)) == ASHIFTRT
+      && CONST_INT_P (XEXP (XEXP (counter_max_last_set, 1), 1))
+      && ((1 << INTVAL (XEXP (XEXP (counter_max_last_set, 1), 1)))
+	   == abs (INTVAL (vctp_step))))
+    return true;
+
+  return false;
+}
+
+/* If we have identified the loop to have an incrementing counter, we need to
+   make sure that it increments by 1 and that the loop is structured correctly:
+    * The counter starts from 0
+    * The counter terminates at (num_of_elem + num_of_lanes - 1) / num_of_lanes
+    * The vctp insn uses a reg that decrements appropriately in each iteration.
+*/
+
+static bool
+arm_mve_dlstp_check_inc_counter (basic_block body, rtx_insn* vctp_insn,
+				 rtx condconst, rtx condcount)
+{
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+  /* The loop latch has to be empty.  When compiling all the known MVE LoLs in
+     user applications, none of those with incrementing counters had any real
+     insns in the loop latch.  As such, this function has only been tested with
+     an empty latch and may misbehave or ICE if we somehow get here with an
+     increment in the latch, so, for sanity, error out early.  */
+  rtx_insn *dec_insn = BB_END (body->loop_father->latch);
+  if (NONDEBUG_INSN_P (dec_insn))
+    gcc_unreachable ();
+
+  class rtx_iv vctp_reg_iv;
+  /* For loops of type B) the loop counter is independent of the decrement
+     of the reg used in the vctp_insn. So run iv analysis on that reg.  This
+     has to succeed for such loops to be supported.  */
+  if (!iv_analyze (vctp_insn, as_a<scalar_int_mode> (GET_MODE (vctp_reg)),
+      vctp_reg, &vctp_reg_iv))
+    return false;
+
+  /* Find where both of those are modified in the loop body bb.  */
+  rtx condcount_reg_set = PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
+						 (body, REGNO (condcount))));
+  rtx vctp_reg_set = PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
+					    (body, REGNO (vctp_reg))));
+  if (!vctp_reg_set || !condcount_reg_set)
+    return false;
+
+  if (REG_P (condcount) && REG_P (condconst))
+    {
+      /* First we need to prove that the loop is going 0..condconst with an
+	 inc of 1 in each iteration.  */
+      if (GET_CODE (XEXP (condcount_reg_set, 1)) == PLUS
+	  && CONST_INT_P (XEXP (XEXP (condcount_reg_set, 1), 1))
+	  && INTVAL (XEXP (XEXP (condcount_reg_set, 1), 1)) == 1)
+	{
+	  /* Check that the counter did indeed start from zero.  */
+	  rtx counter_orig_set;
+	  counter_orig_set = XEXP (PATTERN
+				    (DF_REF_INSN
+				      (DF_REF_NEXT_REG
+					(DF_REG_DEF_CHAIN
+					 (REGNO
+					   (XEXP (condcount_reg_set, 0)))))), 1);
+	  if (!CONST_INT_P (counter_orig_set)
+	      || (INTVAL (counter_orig_set) != 0))
+	    return false;
+	  /* And finally check that the target value of the counter, condconst
+	     is of the correct shape.  */
+	  if (!arm_mve_check_reg_origin_is_num_elems (body, condconst, vctp_reg_iv.step))
+	    return false;
+	}
+      else
+	return false;
+    }
+  else
+    return false;
+
+  /* Extract the decrementnum of the vctp reg.  */
+  int decrementnum = abs (INTVAL (vctp_reg_iv.step));
+  /* Ensure it matches the number of lanes of the vctp instruction.  */
+  if (decrementnum != arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+    return false;
+
+  /* Everything looks valid.  */
+  return true;
+}
+
+static bool
+arm_mve_dlstp_check_dec_counter (basic_block body, rtx_insn* vctp_insn,
+				 rtx condconst, rtx condcount)
+{
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+  class rtx_iv vctp_reg_iv;
+  int decrementnum;
+  /* For decrementing loops of type A), the counter is usually present in the
+     loop latch.  Here we simply need to verify that this counter is the same
+     reg that is also used in the vctp_insn and that it is not otherwise
+     modified.  */
+  rtx_insn *dec_insn = BB_END (body->loop_father->latch);
+  /* If not in the loop latch, try to find the decrement in the loop body.  */
+  if (!NONDEBUG_INSN_P (dec_insn))
+  {
+    df_ref temp = df_bb_regno_only_def_find (body, REGNO (condcount));
+    /* If we haven't been able to find the decrement, bail out.  */
+    if (!temp)
+      return false;
+    dec_insn = DF_REF_INSN (temp);
+  }
+
+  /* Next, ensure that it is a PLUS of the form:
+     (set (reg a) (plus (reg a) (const_int)))
+     where (reg a) is the same as condcount.  */
+  if (GET_CODE (XEXP (PATTERN (dec_insn), 1)) == PLUS
+      && REGNO (XEXP (PATTERN (dec_insn), 0))
+	  == REGNO (XEXP (XEXP (PATTERN (dec_insn), 1), 0))
+      && REGNO (XEXP (PATTERN (dec_insn), 0)) == REGNO (condcount))
+    decrementnum = abs (INTVAL (XEXP (XEXP (PATTERN (dec_insn), 1), 1)));
+  else
+    return false;
+
+  /* Ok, so we now know the loop decrement.  If it is a 1, then we need to
+     look at the loop vctp_reg and verify that it also decrements correctly.
+     Then, we need to establish that the starting value of the loop decrement
+     originates from the starting value of the vctp decrement.  */
+  if (decrementnum == 1)
+    {
+      class rtx_iv vctp_reg_iv;
+      /* The loop counter is found to be independent of the decrement
+	 of the reg used in the vctp_insn, again.  Ensure that IV analysis
+	 succeeds and check the step.  */
+      if (!iv_analyze (vctp_insn, as_a<scalar_int_mode> (GET_MODE (vctp_reg)),
+		       vctp_reg, &vctp_reg_iv))
+	return false;
+      /* Ensure it matches the number of lanes of the vctp instruction.  */
+      if (abs (INTVAL (vctp_reg_iv.step))
+	  != arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+	return false;
+      if (!arm_mve_check_reg_origin_is_num_elems (body, condcount, vctp_reg_iv.step))
+	return false;
+    }
+  /* If the decrements are the same, then the situation is simple: either they
+     are also the same reg, which is safe, or they are different registers, in
+     which case makse sure that there is a only simple SET from one to the
+     other inside the loop.*/
+  else if (decrementnum == arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
+    {
+      if (REGNO (condcount) != REGNO (vctp_reg))
+	{
+	  /* It wasn't the same reg, but it could be behild a
+	     (set (vctp_reg) (condcount)), so instead find where
+	     the VCTP insn is DEF'd inside the loop.  */
+	  rtx vctp_reg_set =
+		PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
+					(body, REGNO (vctp_reg))));
+	  /* This must just be a simple SET from the condcount.  */
+	  if (GET_CODE (vctp_reg_set) != SET || !REG_P (XEXP (vctp_reg_set, 1))
+	      || REGNO (XEXP (vctp_reg_set, 1)) != REGNO (condcount))
+	    return false;
+	}
+    }
+  else
+    return false;
+
+  /* We now only need to find out that the loop terminates with a LE
+     zero condition.  If condconst is a const_int, then this is easy.
+     If its a REG, look at the last condition+jump in a bb before
+     the loop, because that usually will have a branch jumping over
+     the loop body.  */
+  if (CONST_INT_P (condconst)
+      && !(INTVAL (condconst) == 0 && JUMP_P (BB_END (body))
+	   && GET_CODE (XEXP (PATTERN (BB_END (body)), 1)) == IF_THEN_ELSE
+	   && (GET_CODE (XEXP (XEXP (PATTERN (BB_END (body)), 1), 0)) == NE
+	       ||GET_CODE (XEXP (XEXP (PATTERN (BB_END (body)), 1), 0)) == GT)))
+    return false;
+  else if (REG_P (condconst))
+    {
+      basic_block pre_loop_bb = body;
+      while (pre_loop_bb->prev_bb && BB_END (pre_loop_bb->prev_bb)
+	     && !JUMP_P (BB_END (pre_loop_bb->prev_bb)))
+	pre_loop_bb = pre_loop_bb->prev_bb;
+      if (pre_loop_bb && BB_END (pre_loop_bb))
+	pre_loop_bb = pre_loop_bb->prev_bb;
+      else
+	return false;
+      rtx initial_compare = NULL_RTX;
+      if (!(prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb))
+	    && INSN_P (prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb)))))
+	return false;
+      else
+	initial_compare
+	    = PATTERN (prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb)));
+      if (!(initial_compare && GET_CODE (initial_compare) == SET
+	    && cc_register (XEXP (initial_compare, 0), VOIDmode)
+	    && GET_CODE (XEXP (initial_compare, 1)) == COMPARE
+	    && CONST_INT_P (XEXP (XEXP (initial_compare, 1), 1))
+	    && INTVAL (XEXP (XEXP (initial_compare, 1), 1)) == 0))
+	return false;
+
+      /* Usually this is a LE condition, but it can also just be a GT or an EQ
+	 condition (if the value is unsigned or the compiler knows its not negative)  */
+      rtx_insn *loop_jumpover = BB_END (pre_loop_bb);
+      if (!(JUMP_P (loop_jumpover)
+	    && GET_CODE (XEXP (PATTERN (loop_jumpover), 1)) == IF_THEN_ELSE
+	    && (GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == LE
+		|| GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == GT
+		|| GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == EQ)))
+	return false;
+    }
+
+  /* Everything looks valid.  */
+  return true;
+}
+
+static bool
+arm_mve_loop_valid_for_dlstp (basic_block body)
+{
+  /* Doloop can only be done "elementwise" with predicated dlstp/letp if it
+     contains a VCTP on the number of elements processed by the loop.
+     Find the VCTP predicate generation inside the loop body BB.  */
+  rtx_insn *vctp_insn = arm_mve_get_loop_vctp (body);
+  if (!vctp_insn)
+    return false;
+
+  /* There are only two types of loops that can be turned into dlstp/letp
+     loops:
+      A) Loops of the form:
+	  while (num_of_elem > 0)
+	    {
+	      p = vctp<size> (num_of_elem)
+	      n -= num_of_lanes;
+	    }
+      B) Loops of the form:
+	  int num_of_iters = (num_of_elem + num_of_lanes - 1) / num_of_lanes
+	  for (i = 0; i < num_of_iters; i++)
+	    {
+	      p = vctp<size> (num_of_elem)
+	      n -= num_of_lanes;
+	    }
+
+    Then, depending on the type of loop above we need will need to do
+    different sets of checks.  */
+  iv_analysis_loop_init (body->loop_father);
+
+  /* In order to find out if the loop is of type A or B above look for the
+     loop counter: it will either be incrementing by one per iteration or
+     it will be decrementing by num_of_lanes.  We can find the loop counter
+     in the condition at the end of the loop.  */
+  rtx_insn *loop_cond = prev_nonnote_nondebug_insn_bb (BB_END (body));
+  gcc_assert (cc_register (XEXP (PATTERN (loop_cond), 0), VOIDmode)
+	      && GET_CODE (XEXP (PATTERN (loop_cond), 1)) == COMPARE);
+  /* The operands in the condition:  Try to identify which one is the
+     constant and which is the counter and run IV analysis on the latter.  */
+  rtx cond_arg_1 = XEXP (XEXP (PATTERN (loop_cond), 1), 0);
+  rtx cond_arg_2 = XEXP (XEXP (PATTERN (loop_cond), 1), 1);
+
+  rtx loop_cond_constant;
+  rtx loop_counter;
+  class rtx_iv cond_counter_iv, cond_temp_iv;
+
+  if (CONST_INT_P (cond_arg_1))
+    {
+      /* cond_arg_1 is the constant and cond_arg_2 is the counter.  */
+      loop_cond_constant = cond_arg_1;
+      loop_counter = cond_arg_2;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_2)),
+		  cond_arg_2, &cond_counter_iv);
+    }
+  else if (CONST_INT_P (cond_arg_2))
+    {
+      /* cond_arg_2 is the constant and cond_arg_1 is the counter.  */
+      loop_cond_constant = cond_arg_2;
+      loop_counter = cond_arg_1;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_1)),
+		  cond_arg_1, &cond_counter_iv);
+    }
+  else if (REG_P (cond_arg_1) && REG_P (cond_arg_2))
+    {
+      /* If both operands to the compare are REGs, we can safely
+	 run IV analysis on both and then determine which is the
+	 constant by looking at the step.
+	 First assume cond_arg_1 is the counter.  */
+      loop_counter = cond_arg_1;
+      loop_cond_constant = cond_arg_2;
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_1)),
+		  cond_arg_1, &cond_counter_iv);
+      iv_analyze (loop_cond, as_a<scalar_int_mode> (GET_MODE (cond_arg_2)),
+		  cond_arg_2, &cond_temp_iv);
+
+      if (!CONST_INT_P (cond_counter_iv.step) || !CONST_INT_P (cond_temp_iv.step))
+	return false;
+      /* Look at the steps and swap around the rtx's if needed.  Error out if
+	 one of them cannot be identified as constant.  */
+      if (INTVAL (cond_counter_iv.step) != 0 && INTVAL (cond_temp_iv.step) != 0)
+	return false;
+      if (INTVAL (cond_counter_iv.step) == 0 && INTVAL (cond_temp_iv.step) != 0)
+	{
+	  loop_counter = cond_arg_2;
+	  loop_cond_constant = cond_arg_1;
+	  cond_counter_iv = cond_temp_iv;
+	}
+    }
+  else
+    return false;
+
+  if (!REG_P (loop_counter))
+    return false;
+  if (!(REG_P (loop_cond_constant) || CONST_INT_P (loop_cond_constant)))
+    return false;
+
+  /* Now we have extracted the IV step of the loop counter, call the
+     appropriate checking function.  */
+  if (INTVAL (cond_counter_iv.step) > 0)
+    return arm_mve_dlstp_check_inc_counter (body, vctp_insn,
+					    loop_cond_constant, loop_counter);
+  else if (INTVAL (cond_counter_iv.step) < 0)
+    return arm_mve_dlstp_check_dec_counter (body, vctp_insn,
+					    loop_cond_constant, loop_counter);
+  else
+    return false;
+}
+
+/* Predict whether the given loop in gimple will be transformed in the RTL
+   doloop_optimize pass.  */
+
+static bool
+arm_predict_doloop_p (struct loop *loop)
+{
+  gcc_assert (loop);
+  /* On arm, targetm.can_use_doloop_p is actually
+     can_use_doloop_if_innermost.  Ensure the loop is innermost,
+     it is valid and as per arm_target_bb_ok_for_lob and the
+     correct architecture flags are enabled.  */
+  if (!(TARGET_32BIT && TARGET_HAVE_LOB && optimize > 0))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " target architecture or optimisation flags.\n");
+      return false;
+    }
+  else if (loop->inner != NULL)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " loop nesting.\n");
+      return false;
+    }
+  else if (!arm_target_bb_ok_for_lob (loop->header->next_bb))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Predict doloop failure due to"
+			    " loop bb complexity.\n");
+      return false;
+    }
+
+  return true;
+}
+
+/* Implement targetm.loop_unroll_adjust.  Use this to block unrolling of loops
+   that may later be turned into MVE Tail Predicated Low Overhead Loops.  The
+   performance benefit of an MVE LoL is likely to be much higher than that of
+   the unrolling.  */
+
+unsigned
+arm_loop_unroll_adjust (unsigned nunroll, struct loop *loop)
+{
+  if (TARGET_THUMB2 && TARGET_HAVE_LOB && TARGET_HAVE_MVE
+      && arm_target_bb_ok_for_lob (loop->header->next_bb)
+      && arm_mve_loop_valid_for_dlstp (loop->header))
+    return 0;
+  else
+    return nunroll;
+}
+
+/* Function to hadle emitting a VPT-unpredicated version of a VPT-predicated
+   insn to a sequence.  */
+
+static bool
+arm_emit_mve_unpredicated_insn_to_seq (rtx_insn* insn)
+{
+  rtx insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn);
+  int new_icode = get_attr_mve_unpredicated_insn (insn);
+  if (!in_sequence_p ()
+      || !MVE_VPT_PREDICATED_INSN_P (insn)
+      || (!insn_vpr_reg_operand)
+      || (!new_icode))
+    return false;
+
+  extract_insn (insn);
+  rtx arr[8];
+  int j = 0;
+
+  /* When transforming a VPT-predicated instruction
+     into its unpredicated equivalent we need to drop
+     the VPR operand and we may need to also drop a
+     merge "vuninit" input operand, depending on the
+     instruction pattern.  Here ensure that we have at
+     most a two-operand difference between the two
+     instrunctions.  */
+  int n_operands_diff
+      = recog_data.n_operands - insn_data[new_icode].n_operands;
+  if (!(n_operands_diff > 0 && n_operands_diff <= 2))
+    return false;
+
+  /* Then, loop through the operands of the predicated
+     instruction, and retain the ones that map to the
+     unpredicated instruction.  */
+  for (int i = 0; i < recog_data.n_operands; i++)
+    {
+      /* Ignore the VPR and, if needed, the vuninit
+	 operand.  */
+      if (insn_vpr_reg_operand == recog_data.operand[i]
+	  || (n_operands_diff == 2
+	      && !strcmp (recog_data.constraints[i], "0")))
+	continue;
+      else
+	{
+	  arr[j] = recog_data.operand[i];
+	  j++;
+	}
+    }
+
+  /* Finally, emit the upredicated instruction.  */
+  switch (j)
+    {
+      case 1:
+	emit_insn (GEN_FCN (new_icode) (arr[0]));
+	break;
+      case 2:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1]));
+	break;
+      case 3:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2]));
+	break;
+      case 4:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2],
+					arr[3]));
+	break;
+      case 5:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4]));
+	break;
+      case 6:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4], arr[5]));
+	break;
+      case 7:
+	emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3],
+					arr[4], arr[5], arr[6]));
+	break;
+      default:
+	gcc_unreachable ();
+    }
+  return true;
+}
+
+/* Attempt to transform the loop contents of loop basic block from VPT
+   predicated insns into unpredicated insns for a dlstp/letp loop.  */
+
+rtx
+arm_attempt_dlstp_transform (rtx label)
+{
+  basic_block body = BLOCK_FOR_INSN (label)->prev_bb;
+
+  /* Ensure that the bb is within a loop that has all required metadata.  */
+  if (!body->loop_father || !body->loop_father->header
+      || !body->loop_father->simple_loop_desc)
+    return GEN_INT (1);
+
+  rtx_insn *vctp_insn = arm_mve_get_loop_vctp (body);
+  if (!vctp_insn || !arm_mve_loop_valid_for_dlstp (body))
+    return GEN_INT (1);
+  rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0);
+  /* decrementunum is already known to be valid at this point.  */
+  int decrementnum = arm_mve_get_vctp_lanes
+		      (PATTERN (arm_mve_get_loop_vctp (body)));
+
+  rtx_insn *insn = 0;
+  rtx_insn *cur_insn = 0;
+  rtx_insn *seq;
+  rtx vctp_vpr_generated = NULL_RTX;
+
+  /* Scan through the insns in the loop bb and emit the transformed bb
+     insns to a sequence.  */
+  start_sequence ();
+  FOR_BB_INSNS (body, insn)
+    {
+      if (GET_CODE (insn) == CODE_LABEL || NOTE_INSN_BASIC_BLOCK_P (insn))
+	continue;
+      else if (NOTE_P (insn))
+	emit_note ((enum insn_note)NOTE_KIND (insn));
+      else if (DEBUG_INSN_P (insn))
+	emit_debug_insn (PATTERN (insn));
+      else if (!INSN_P (insn))
+	{
+	  end_sequence ();
+	  return GEN_INT (1);
+	}
+      /* When we find the vctp instruction: This may be followed by
+      a zero-extend insn to SImode.  If it is, then save the
+      zero-extended REG into vctp_vpr_generated.  If there is no
+      zero-extend, then store the raw output of the vctp.
+      For any VPT-predicated instructions we need to ensure that
+      the VPR they use is the same as the one given here and
+      they often consume the output of a subreg of the SImode
+      zero-extended VPR-reg.  As a result, comparing against the
+      output of the zero-extend is more likely to succeed.
+      This code also guarantees to us that the vctp comes before
+      any instructions that use the VPR within the loop, for the
+      dlstp/letp transform to succeed.  */
+      else if (insn == vctp_insn)
+	{
+	  rtx_insn *next_use1 = NULL;
+	  df_ref use;
+	  for (use = DF_REG_USE_CHAIN
+		      (DF_REF_REGNO (DF_INSN_INFO_DEFS
+				      (DF_INSN_INFO_GET (insn))));
+	       use; use = DF_REF_NEXT_REG (use))
+	    if (!next_use1 && NONDEBUG_INSN_P (DF_REF_INSN (use)))
+	      next_use1 = DF_REF_INSN (use);
+
+	  if (GET_CODE (SET_SRC (single_set (next_use1))) == ZERO_EXTEND)
+	    {
+	      rtx_insn *next_use2 = NULL;
+	      for (use = DF_REG_USE_CHAIN
+			  (DF_REF_REGNO (DF_INSN_INFO_DEFS
+					 (DF_INSN_INFO_GET (next_use1))));
+		   use; use = DF_REF_NEXT_REG (use))
+		if (!next_use2 && NONDEBUG_INSN_P (DF_REF_INSN (use)))
+		  next_use2 = DF_REF_INSN (use);
+
+	      if (GET_CODE (SET_SRC (single_set (next_use2))) == SUBREG)
+		vctp_vpr_generated = XEXP (PATTERN (next_use2), 0);
+	    }
+
+	  if (!vctp_vpr_generated || !REG_P (vctp_vpr_generated)
+	      || !VALID_MVE_PRED_MODE (GET_MODE (vctp_vpr_generated)))
+	    {
+	      end_sequence ();
+	      return GEN_INT (1);
+	    }
+	  continue;
+	}
+       /* If the insn pattern requires the use of the VPR value from the
+	  vctp as an input parameter.  */
+      else if (arm_mve_vec_insn_is_predicated_with_this_predicate (insn, vctp_vpr_generated))
+	{
+	  bool success = arm_emit_mve_unpredicated_insn_to_seq (insn);
+	  if (!success)
+	    {
+	      end_sequence ();
+	      return GEN_INT (1);
+	    }
+	}
+      /* If the insn isn't VPT predicated on vctp_vpr_generated, we need to
+	 make sure that it is still valid within the dlstp/letp loop.  */
+      else
+	{
+	  /* None of registers USE-d by the instruction need can be the VPR
+	     vctp_vpr_generated.  This blocks the optimisation if there any
+	     instructions that use the optimised-out VPR value in any way
+	     other than as a VPT block predicate.  */
+	  df_ref insn_uses = NULL;
+	  FOR_EACH_INSN_USE (insn_uses, insn)
+	  {
+	    if (rtx_equal_p (vctp_vpr_generated, DF_REF_REG (insn_uses)))
+	      {
+		end_sequence ();
+		return GEN_INT (1);
+	      }
+	  }
+	  /* If within the loop we have an MVE vector instruction that is
+	     unpredicated, the dlstp/letp looping will add implicit
+	     predication to it.  This will result in a change in behaviour
+	     of the instruction, so we need to find out if any instructions
+	     that feed into the current instruction were implicitly
+	     predicated.  */
+	  if (arm_mve_check_df_chain_back_for_implic_predic
+		(insn, vctp_vpr_generated))
+	    {
+	      if (arm_mve_check_df_chain_fwd_for_implic_predic_impact
+		    (insn, vctp_vpr_generated))
+		{
+		  end_sequence ();
+		  return GEN_INT (1);
+		}
+	    }
+	  emit_insn (PATTERN (insn));
+	}
+    }
+  seq = get_insns ();
+  end_sequence ();
+
+  /* Re-write the entire BB contents with the transformed
+     sequence.  */
+  FOR_BB_INSNS_SAFE (body, insn, cur_insn)
+    if (!(GET_CODE (insn) == CODE_LABEL || NOTE_INSN_BASIC_BLOCK_P (insn)))
+      delete_insn (insn);
+  for (insn = seq; NEXT_INSN (insn); insn = NEXT_INSN (insn))
+    if (NOTE_P (insn))
+      emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));
+    else if (DEBUG_INSN_P (insn))
+      emit_debug_insn_after (PATTERN (insn), BB_END (body));
+    else
+      emit_insn_after (PATTERN (insn), BB_END (body));
+
+  emit_jump_insn_after (PATTERN (insn), BB_END (body));
+  /* The transformation has succeeded, so now modify the "count"
+     (a.k.a. niter_expr) for the middle-end.  */
+  simple_loop_desc (body->loop_father)->niter_expr = vctp_reg;
+  return GEN_INT (decrementnum);
 }
 
 #if CHECKING_P
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 597c1dae640..835caa06fad 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -2599,6 +2599,9 @@
 (define_int_attr mrrc [(VUNSPEC_MRRC "mrrc") (VUNSPEC_MRRC2 "mrrc2")])
 (define_int_attr MRRC [(VUNSPEC_MRRC "MRRC") (VUNSPEC_MRRC2 "MRRC2")])
 
+(define_int_attr mode1 [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 "32")
+			(DLSTP64 "64")])
+
 (define_int_attr opsuffix [(UNSPEC_DOT_S "s8")
 			   (UNSPEC_DOT_U "u8")
 			   (UNSPEC_DOT_US "s8")
@@ -2841,6 +2844,8 @@
 (define_int_iterator VSHLCQ_M [VSHLCQ_M_S VSHLCQ_M_U])
 (define_int_iterator VQSHLUQ_M_N [VQSHLUQ_M_N_S])
 (define_int_iterator VQSHLUQ_N [VQSHLUQ_N_S])
+(define_int_iterator DLSTP [DLSTP8 DLSTP16 DLSTP32
+				   DLSTP64])
 
 ;; Define iterators for VCMLA operations
 (define_int_iterator VCMLA_OP [UNSPEC_VCMLA
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 74b8af8d57e..a81720f4aa2 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -7194,3 +7194,41 @@
       }
   }
 )
+
+;; Originally expanded by 'predicated_doloop_end'.
+;; In the rare situation where the branch is too far, we do also need to
+;; revert FPSCR.LTPSIZE back to 0x100 after the last iteration.
+(define_insn "*predicated_doloop_end_internal"
+  [(set (pc)
+	(if_then_else
+	   (ge (plus:SI (reg:SI LR_REGNUM)
+			(match_operand:SI 0 "const_int_operand" ""))
+		(const_int 0))
+	 (label_ref (match_operand 1 "" ""))
+	 (pc)))
+   (set (reg:SI LR_REGNUM)
+	(plus:SI (reg:SI LR_REGNUM) (match_dup 0)))
+   (clobber (reg:CC CC_REGNUM))
+   (clobber (match_scratch:SI 2 "=r"))]
+  "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2"
+  {
+    if (get_attr_length (insn) == 4)
+      return "letp\t%|lr, %l1";
+    else
+      return "subs\t%|lr, #%n0\n\tbgt\t%l1\n\tvmrs\t%2, FPSCR\n\torr\t%2, %2, #0x40000\n\tand\t%2, %2, 0xFFFCFFFF\n\tvmsr\tFPSCR, %2;";
+  }
+  [(set (attr "length")
+	(if_then_else
+	   (ltu (minus (pc) (match_dup 1)) (const_int 1024))
+	    (const_int 4)
+	    (const_int 6)))
+   (set_attr "type" "branch")])
+
+(define_insn "dlstp<mode1>_insn"
+  [
+    (set (reg:SI LR_REGNUM)
+	 (unspec:SI [(match_operand:SI 0 "s_register_operand" "r")]
+	  DLSTP))
+  ]
+  "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2"
+  "dlstp.<mode1>\t%|lr, %0")
diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
index e1e013befa7..2094e9c4641 100644
--- a/gcc/config/arm/thumb2.md
+++ b/gcc/config/arm/thumb2.md
@@ -1613,7 +1613,7 @@
    (use (match_operand 1 "" ""))]     ; label
   "TARGET_32BIT"
   "
- {
+{
    /* Currently SMS relies on the do-loop pattern to recognize loops
       where (1) the control part consists of all insns defining and/or
       using a certain 'count' register and (2) the loop count can be
@@ -1623,41 +1623,67 @@
 
       Also used to implement the low over head loops feature, which is part of
       the Armv8.1-M Mainline Low Overhead Branch (LOB) extension.  */
-   if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
-   {
-     rtx s0;
-     rtx bcomp;
-     rtx loc_ref;
-     rtx cc_reg;
-     rtx insn;
-     rtx cmp;
-
-     if (GET_MODE (operands[0]) != SImode)
-       FAIL;
-
-     s0 = operands [0];
-
-     /* Low over head loop instructions require the first operand to be LR.  */
-     if (TARGET_HAVE_LOB && arm_target_insn_ok_for_lob (operands [1]))
-       s0 = gen_rtx_REG (SImode, LR_REGNUM);
-
-     if (TARGET_THUMB2)
-       insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-     else
-       insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-
-     cmp = XVECEXP (PATTERN (insn), 0, 0);
-     cc_reg = SET_DEST (cmp);
-     bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
-     loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands [1]);
-     emit_jump_insn (gen_rtx_SET (pc_rtx,
-                                  gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
-                                                        loc_ref, pc_rtx)));
-     DONE;
-   }
- else
-   FAIL;
- }")
+  if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB))
+    {
+      rtx s0;
+      rtx bcomp;
+      rtx loc_ref;
+      rtx cc_reg;
+      rtx insn;
+      rtx cmp;
+      rtx decrement_num;
+
+      if (GET_MODE (operands[0]) != SImode)
+	FAIL;
+
+      s0 = operands[0];
+
+       if (TARGET_HAVE_LOB && arm_target_bb_ok_for_lob (BLOCK_FOR_INSN (operands[1])))
+	{
+	  s0 = gen_rtx_REG (SImode, LR_REGNUM);
+
+	  /* If we have a compatibe MVE target, try and analyse the loop
+	     contents to determine if we can use predicated dlstp/letp
+	     looping.  */
+	  if (TARGET_HAVE_MVE && TARGET_THUMB2
+	      && (decrement_num = arm_attempt_dlstp_transform (operands[1]))
+	      && (INTVAL (decrement_num) != 1))
+	    {
+	      insn = emit_insn
+		      (gen_thumb2_addsi3_compare0
+			  (s0, s0, GEN_INT ((-1) * (INTVAL (decrement_num)))));
+	      cmp = XVECEXP (PATTERN (insn), 0, 0);
+	      cc_reg = SET_DEST (cmp);
+	      bcomp = gen_rtx_GE (VOIDmode, cc_reg, const0_rtx);
+	      loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+	      emit_jump_insn (gen_rtx_SET (pc_rtx,
+				       gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+							     loc_ref, pc_rtx)));
+	      DONE;
+	    }
+
+	  /* Otherwise, try standard decrement-by-one dls/le looping.  */
+	  if (TARGET_THUMB2)
+	    insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0,
+							  GEN_INT (-1)));
+	  else
+	    insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
+
+	  cmp = XVECEXP (PATTERN (insn), 0, 0);
+	  cc_reg = SET_DEST (cmp);
+	  bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
+	  loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
+	  emit_jump_insn (gen_rtx_SET (pc_rtx,
+				       gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+							     loc_ref, pc_rtx)));
+	  DONE;
+	}
+      else
+	FAIL;
+    }
+  else
+    FAIL;
+}")
 
 (define_insn "*clear_apsr"
   [(unspec_volatile:SI [(const_int 0)] VUNSPEC_CLRM_APSR)
@@ -1755,7 +1781,37 @@
   {
     if (REGNO (operands[0]) == LR_REGNUM)
       {
-	emit_insn (gen_dls_insn (operands[0]));
+	/* Pick out the number by which we are decrementing the loop counter
+	   in every iteration.  If it's > 1, then use dlstp.  */
+	int const_int_dec_num
+	     = abs (INTVAL (XEXP (XEXP (XVECEXP (PATTERN (operands[1]), 0, 1),
+				  1),
+			    1)));
+	switch (const_int_dec_num)
+	  {
+	    case 16:
+	      emit_insn (gen_dlstp8_insn (operands[0]));
+	      break;
+
+	    case 8:
+	      emit_insn (gen_dlstp16_insn (operands[0]));
+	      break;
+
+	    case 4:
+	      emit_insn (gen_dlstp32_insn (operands[0]));
+	      break;
+
+	    case 2:
+	      emit_insn (gen_dlstp64_insn (operands[0]));
+	      break;
+
+	    case 1:
+	      emit_insn (gen_dls_insn (operands[0]));
+	      break;
+
+	    default:
+	      gcc_unreachable ();
+	  }
 	DONE;
       }
     else
diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
index dccda283573..f7c4f2b1b0e 100644
--- a/gcc/config/arm/unspecs.md
+++ b/gcc/config/arm/unspecs.md
@@ -581,6 +581,10 @@
   VADDLVQ_U
   VCTP
   VCTP_M
+  DLSTP8
+  DLSTP16
+  DLSTP32
+  DLSTP64
   VPNOT
   VCREATEQ_F
   VCVTQ_N_TO_F_S
diff --git a/gcc/df-core.cc b/gcc/df-core.cc
index d4812b04a7c..4fcc14bf790 100644
--- a/gcc/df-core.cc
+++ b/gcc/df-core.cc
@@ -1964,6 +1964,21 @@ df_bb_regno_last_def_find (basic_block bb, unsigned int regno)
   return NULL;
 }
 
+/* Return the one and only def of REGNO within BB.  If there is no def or
+   there are multiple defs, return NULL.  */
+
+df_ref
+df_bb_regno_only_def_find (basic_block bb, unsigned int regno)
+{
+  df_ref temp = df_bb_regno_first_def_find (bb, regno);
+  if (!temp)
+    return NULL;
+  else if (temp == df_bb_regno_last_def_find (bb, regno))
+    return temp;
+  else
+    return NULL;
+}
+
 /* Finds the reference corresponding to the definition of REG in INSN.
    DF is the dataflow object.  */
 
diff --git a/gcc/df.h b/gcc/df.h
index 402657a7076..98623637f9c 100644
--- a/gcc/df.h
+++ b/gcc/df.h
@@ -987,6 +987,7 @@ extern void df_check_cfg_clean (void);
 #endif
 extern df_ref df_bb_regno_first_def_find (basic_block, unsigned int);
 extern df_ref df_bb_regno_last_def_find (basic_block, unsigned int);
+extern df_ref df_bb_regno_only_def_find (basic_block, unsigned int);
 extern df_ref df_find_def (rtx_insn *, rtx);
 extern bool df_reg_defined (rtx_insn *, rtx);
 extern df_ref df_find_use (rtx_insn *, rtx);
diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
index 4feb0a25ab9..f6dbd0515de 100644
--- a/gcc/loop-doloop.cc
+++ b/gcc/loop-doloop.cc
@@ -85,29 +85,29 @@ doloop_condition_get (rtx_insn *doloop_pat)
      forms:
 
      1)  (parallel [(set (pc) (if_then_else (condition)
-	  			            (label_ref (label))
-				            (pc)))
-	             (set (reg) (plus (reg) (const_int -1)))
-	             (additional clobbers and uses)])
+					    (label_ref (label))
+					    (pc)))
+		     (set (reg) (plus (reg) (const_int -n)))
+		     (additional clobbers and uses)])
 
      The branch must be the first entry of the parallel (also required
      by jump.cc), and the second entry of the parallel must be a set of
      the loop counter register.  Some targets (IA-64) wrap the set of
      the loop counter in an if_then_else too.
 
-     2)  (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-	                         (label_ref (label))
-			         (pc))).  
+     2)  (set (reg) (plus (reg) (const_int -n))
+	 (set (pc) (if_then_else (reg != 0)
+				 (label_ref (label))
+				 (pc))).
 
      Some targets (ARM) do the comparison before the branch, as in the
      following form:
 
-     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) */
+     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -n), 0)))
+		   (set (reg) (plus (reg) (const_int -n)))])
+	(set (pc) (if_then_else (cc == NE)
+				(label_ref (label))
+				(pc))) */
 
   pattern = PATTERN (doloop_pat);
 
@@ -143,7 +143,7 @@ doloop_condition_get (rtx_insn *doloop_pat)
 	      || GET_CODE (cmp_arg1) != PLUS)
 	    return 0;
 	  reg_orig = XEXP (cmp_arg1, 0);
-	  if (XEXP (cmp_arg1, 1) != GEN_INT (-1) 
+	  if (!CONST_INT_P (XEXP (cmp_arg1, 1))
 	      || !REG_P (reg_orig))
 	    return 0;
 	  cc_reg = SET_DEST (cmp_orig);
@@ -156,7 +156,8 @@ doloop_condition_get (rtx_insn *doloop_pat)
 	{
 	  /* We expect the condition to be of the form (reg != 0)  */
 	  cond = XEXP (SET_SRC (cmp), 0);
-	  if (GET_CODE (cond) != NE || XEXP (cond, 1) != const0_rtx)
+	  if ((GET_CODE (cond) != NE && GET_CODE (cond) != GE)
+	      || XEXP (cond, 1) != const0_rtx)
 	    return 0;
 	}
     }
@@ -173,14 +174,14 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
     return 0;
 
-  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
      On IA-64, this decrement is wrapped in an if_then_else.  */
   inc_src = SET_SRC (inc);
   if (GET_CODE (inc_src) == IF_THEN_ELSE)
     inc_src = XEXP (inc_src, 1);
   if (GET_CODE (inc_src) != PLUS
       || XEXP (inc_src, 0) != reg
-      || XEXP (inc_src, 1) != constm1_rtx)
+      || !CONST_INT_P (XEXP (inc_src, 1)))
     return 0;
 
   /* Check for (set (pc) (if_then_else (condition)
@@ -211,42 +212,49 @@ doloop_condition_get (rtx_insn *doloop_pat)
       || (GET_CODE (XEXP (condition, 0)) == PLUS
 	  && XEXP (XEXP (condition, 0), 0) == reg))
    {
-     if (GET_CODE (pattern) != PARALLEL)
      /*  For the second form we expect:
 
-         (set (reg) (plus (reg) (const_int -1))
-         (set (pc) (if_then_else (reg != 0)
-                                 (label_ref (label))
-                                 (pc))).
+	 (set (reg) (plus (reg) (const_int -n))
+	 (set (pc) (if_then_else (reg != 0)
+				 (label_ref (label))
+				 (pc))).
 
-         is equivalent to the following:
+	 If n == 1, that is equivalent to the following:
 
-         (parallel [(set (pc) (if_then_else (reg != 1)
-                                            (label_ref (label))
-                                            (pc)))
-                     (set (reg) (plus (reg) (const_int -1)))
-                     (additional clobbers and uses)])
+	 (parallel [(set (pc) (if_then_else (reg != 1)
+					    (label_ref (label))
+					    (pc)))
+		     (set (reg) (plus (reg) (const_int -1)))
+		     (additional clobbers and uses)])
 
-        For the third form we expect:
+	For the third form we expect:
 
-        (parallel [(set (cc) (compare ((plus (reg) (const_int -1)), 0))
-                   (set (reg) (plus (reg) (const_int -1)))])
-        (set (pc) (if_then_else (cc == NE)
-                                (label_ref (label))
-                                (pc))) 
+	(parallel [(set (cc) (compare ((plus (reg) (const_int -n)), 0))
+		   (set (reg) (plus (reg) (const_int -n)))])
+	(set (pc) (if_then_else (cc == NE)
+				(label_ref (label))
+				(pc)))
 
-        which is equivalent to the following:
+	Which also for n == 1 is equivalent to the following:
 
-        (parallel [(set (cc) (compare (reg,  1))
-                   (set (reg) (plus (reg) (const_int -1)))
-                   (set (pc) (if_then_else (NE == cc)
-                                           (label_ref (label))
-                                           (pc))))])
+	(parallel [(set (cc) (compare (reg,  1))
+		   (set (reg) (plus (reg) (const_int -1)))
+		   (set (pc) (if_then_else (NE == cc)
+					   (label_ref (label))
+					   (pc))))])
 
-        So we return the second form instead for the two cases.
+	So we return the second form instead for the two cases.
 
+	For the "elementwise" form where the decrement number isn't -1,
+	the final value may be exceeded, so use GE instead of NE.
      */
-        condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
+     if (GET_CODE (pattern) != PARALLEL)
+       {
+	if (INTVAL (XEXP (inc_src, 1)) != -1)
+	  condition = gen_rtx_fmt_ee (GE, VOIDmode, inc_src, const0_rtx);
+	else
+	  condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);;
+       }
 
     return condition;
    }
@@ -685,17 +693,6 @@ doloop_optimize (class loop *loop)
       return false;
     }
 
-  max_cost
-    = COSTS_N_INSNS (param_max_iterations_computation_cost);
-  if (set_src_cost (desc->niter_expr, mode, optimize_loop_for_speed_p (loop))
-      > max_cost)
-    {
-      if (dump_file)
-	fprintf (dump_file,
-		 "Doloop: number of iterations too costly to compute.\n");
-      return false;
-    }
-
   if (desc->const_iter)
     iterations = widest_int::from (rtx_mode_t (desc->niter_expr, mode),
 				   UNSIGNED);
@@ -716,11 +713,24 @@ doloop_optimize (class loop *loop)
 
   /* Generate looping insn.  If the pattern FAILs then give up trying
      to modify the loop since there is some aspect the back-end does
-     not like.  */
-  count = copy_rtx (desc->niter_expr);
+     not like.  If this succeeds, there is a chance that the loop
+     desc->niter_expr has been altered by the backend, so only extract
+     that data after the gen_doloop_end.  */
   start_label = block_label (desc->in_edge->dest);
   doloop_reg = gen_reg_rtx (mode);
   rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
+  count = copy_rtx (desc->niter_expr);
+
+  max_cost
+    = COSTS_N_INSNS (param_max_iterations_computation_cost);
+  if (set_src_cost (count, mode, optimize_loop_for_speed_p (loop))
+      > max_cost)
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "Doloop: number of iterations too costly to compute.\n");
+      return false;
+    }
 
   word_mode_size = GET_MODE_PRECISION (word_mode);
   word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) - 1;
diff --git a/gcc/testsuite/gcc.target/arm/lob.h b/gcc/testsuite/gcc.target/arm/lob.h
index feaae7cc899..3941fe7a8b6 100644
--- a/gcc/testsuite/gcc.target/arm/lob.h
+++ b/gcc/testsuite/gcc.target/arm/lob.h
@@ -1,15 +1,131 @@
 #include <string.h>
-
+#include <stdint.h>
 /* Common code for lob tests.  */
 
 #define NO_LOB asm volatile ("@ clobber lr" : : : "lr" )
 
-#define N 10000
+#define N 100
+
+static void
+reset_data (int *a, int *b, int *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (b, -1, x * sizeof (*b));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+reset_data64 (int64_t *a, int64_t *c, int x)
+{
+  memset (a, -1, x * sizeof (*a));
+  memset (c, 0, x * sizeof (*c));
+}
+
+static void
+check_plus (int *a, int *b, int *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus8 (int8_t *a, int8_t *b, int8_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus16 (int16_t *a, int16_t *b, int16_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
+
+static void
+check_plus32 (int32_t *a, int32_t *b, int32_t *c, int x)
+{
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != (a[i] + b[i])) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
+}
 
 static void
-reset_data (int *a, int *b, int *c)
+check_memcpy64 (int64_t *a, int64_t *c, int x)
 {
-  memset (a, -1, N * sizeof (*a));
-  memset (b, -1, N * sizeof (*b));
-  memset (c, -1, N * sizeof (*c));
+  for (int i = 0; i < N; i++)
+    {
+      NO_LOB;
+      if (i < x)
+	{
+	  if (c[i] != a[i]) abort ();
+	}
+      else
+	{
+	  if (c[i] != 0) abort ();
+	}
+    }
 }
diff --git a/gcc/testsuite/gcc.target/arm/lob1.c b/gcc/testsuite/gcc.target/arm/lob1.c
index ba5c82cd55c..c8ce653a5c3 100644
--- a/gcc/testsuite/gcc.target/arm/lob1.c
+++ b/gcc/testsuite/gcc.target/arm/lob1.c
@@ -54,29 +54,18 @@ loop3 (int *a, int *b, int *c)
     } while (i < N);
 }
 
-void
-check (int *a, int *b, int *c)
-{
-  for (int i = 0; i < N; i++)
-    {
-      NO_LOB;
-      if (c[i] != a[i] + b[i])
-	abort ();
-    }
-}
-
 int
 main (void)
 {
-  reset_data (a, b, c);
+  reset_data (a, b, c, N);
   loop1 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus (a, b, c, N);
+  reset_data (a, b, c, N);
   loop2 (a, b ,c);
-  check (a, b ,c);
-  reset_data (a, b, c);
+  check_plus (a, b, c, N);
+  reset_data (a, b, c, N);
   loop3 (a, b ,c);
-  check (a, b ,c);
+  check_plus (a, b, c, N);
 
   return 0;
 }
diff --git a/gcc/testsuite/gcc.target/arm/lob6.c b/gcc/testsuite/gcc.target/arm/lob6.c
index 17b6124295e..4fe116e2c2b 100644
--- a/gcc/testsuite/gcc.target/arm/lob6.c
+++ b/gcc/testsuite/gcc.target/arm/lob6.c
@@ -79,14 +79,14 @@ check (void)
 int
 main (void)
 {
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data (a1, b1, c1, N);
+  reset_data (a2, b2, c2, N);
   loop1 (a1, b1, c1);
   ref1 (a2, b2, c2);
   check ();
 
-  reset_data (a1, b1, c1);
-  reset_data (a2, b2, c2);
+  reset_data (a1, b1, c1, N);
+  reset_data (a2, b2, c2, N);
   loop2 (a1, b1, c1);
   ref2 (a2, b2, c2);
   check ();
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c
new file mode 100644
index 00000000000..fe573ebb749
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c
@@ -0,0 +1,442 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O3 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+
+#define IMM 5
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)				\
+void test_##NAME##PRED##_##SIGN##BITS (TYPE##BITS##_t *a, TYPE##BITS##_t *b,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vb = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (b, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_##SIGN##BITS (va, vb, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      b += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (32, 4, w, NAME, PRED)
+
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vmulq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vsubq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vhaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vorrq, _x)
+
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_M(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)				\
+void test_##NAME##PRED##_##SIGN##BITS (TYPE##BITS##x##LANES##_t __inactive, TYPE##BITS##_t *a, TYPE##BITS##_t *b,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vb = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (b, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_##SIGN##BITS (__inactive, va, vb, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      b += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_M (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_M (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (32, 4, w, NAME, PRED)
+
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vmulq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vsubq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vhaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vorrq, _m)
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_N(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)	\
+void test_##NAME##PRED##_n_##SIGN##BITS (TYPE##BITS##_t *a,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_n_##SIGN##BITS (va, IMM, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_N (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_N (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (32, 4, w, NAME, PRED)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vaddq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vmulq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vsubq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vhaddq, _x)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vbrsrq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vshlq, _x)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vshrq, _x)
+
+#define TEST_COMPILE_IN_DLSTP_TERNARY_M_N(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED)	\
+void test_##NAME##PRED##_n_##SIGN##BITS (TYPE##BITS##x##LANES##_t __inactive, TYPE##BITS##_t *a,  TYPE##BITS##_t *c, int n)	\
+{											\
+  while (n > 0)										\
+    {											\
+      mve_pred16_t p = vctp##BITS##q (n);						\
+      TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p);		\
+      TYPE##BITS##x##LANES##_t vc = NAME##PRED##_n_##SIGN##BITS (__inactive, va, IMM, p);		\
+      vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p);					\
+      c += LANES;									\
+      a += LANES;									\
+      n -= LANES;									\
+    }											\
+}
+
+#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N(BITS, LANES, LDRSTRYTPE, NAME, PRED)	\
+TEST_COMPILE_IN_DLSTP_TERNARY_M_N (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_TERNARY_M_N (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED)
+
+#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N(NAME, PRED)			\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (8, 16, b, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (16, 8, h, NAME, PRED)				\
+TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (32, 4, w, NAME, PRED)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vaddq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vmulq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vsubq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vhaddq, _m)
+
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vbrsrq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vshlq, _m)
+TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vshrq, _m)
+
+/* Now test some more configurations.  */
+
+/* Using a >=1 condition.  */
+void test1 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n >= 1)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+/* Test a for loop format of decrementing to zero */
+int32_t a[] = {0, 1, 2, 3, 4, 5, 6, 7};
+void test2 (int32_t *b, int num_elems)
+{
+    for (int i = num_elems; i > 0; i-= 4)
+    {
+        mve_pred16_t p = vctp32q (i);
+        int32x4_t va = vldrwq_z_s32 (&(a[i]), p);
+        vstrwq_p_s32 (b + i, va, p);
+    }
+}
+
+/* Iteration counter counting up to num_iter.  */
+void test3 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 0; i < num_iter; i++)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_z_u8 (b, p);
+        uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        n-=16;
+    }
+}
+
+/* Using an unpredicated arithmetic instruction within the loop.  */
+void test4 (uint8_t *a, uint8_t *b, uint8_t *c,  uint8_t *d, int n)
+{
+    while (n > 0)
+    {
+        mve_pred16_t p = vctp8q (n);
+        uint8x16_t va = vldrbq_z_u8 (a, p);
+        uint8x16_t vb = vldrbq_u8 (b);
+	/* Is affected by implicit predication, because vb also
+	came from an unpredicated load, but there is no functional
+	problem, because the result is used in a predicated store.  */ 
+        uint8x16_t vc = vaddq_u8 (va, vb);
+        uint8x16_t vd = vaddq_x_u8 (va, vb, p);
+        vstrbq_p_u8 (c, vc, p);
+        vstrbq_p_u8 (d, vd, p);
+        n-=16;
+    }
+}
+
+/* Using a different VPR value for one instruction in the loop.  */
+void test5 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using another VPR value in the loop, with a vctp.
+   The doloop logic will always try to do the transform on the first
+   vctp it encounters, so this is still expected to work.  */
+void test6 (int32_t *a, int32_t *b, int32_t *c, int n, int g)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q (g);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vctp,
+   but this time the p1 will also chance in every loop (still fine)  */
+void test7 (int32_t *a, int32_t *b, int32_t *c, int n, int g)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q (g);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+      g++;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vctp_m
+   that is independent of the loop vctp VPR.  */
+void test8 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p2 = vctp32q_m (n, p1);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop,
+   with a vctp_m that is tied to the base vctp VPR.  This
+   is still fine, because the vctp_m will be transformed
+   into a vctp and be implicitly predicated.  */
+void test9 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      mve_pred16_t p1 = vctp32q_m (n, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p1);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp.  */
+void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vb);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p1);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp_m.  */
+void test11 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p2 = vcmpeqq_m_s32 (va, vb, p1);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Generating and using a different VPR value in the loop, with a vcmp_m 
+   that is tied to the base vctp VPR (same as above, this will be turned
+   into a vcmp and be implicitly predicated).  */
+void test12 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      mve_pred16_t p2 = vcmpeqq_m_s32 (va, vb, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p2);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an unpredicated op with a scalar output, where the result is valid
+   outside the bb.  This is valid, because all the inputs to the unpredicated
+   op are correctly predicated.  */
+uint8_t test13 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       sum += vaddvq_u8 (vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Same as above, but with another scalar op between the unpredicated op and
+   the scalar op outside the loop.  */
+uint8_t test14 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx, int g)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       sum += vaddvq_u8 (vc);
+       sum += g;
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Using an unpredicated vcmp to generate a new predicate value in the
+   loop and then using it in a predicated store insn.  */
+void test15 (int32_t *a, int32_t *b, int32x4_t vc, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_s32 (b);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vc);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using a predicated vcmp to generate a new predicate value in the
+   loop and then using it in a predicated store insn.  */
+void test16 (int32_t *a, int32_t *b, int32x4_t vc, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_s32 (va, vb);
+      mve_pred16_t p1 = vcmpeqq_m_s32 (va, vc, p);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* The final number of DLSTPs currently is calculated by the number of
+  `TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY.*` macros * 6 + 16.  */
+/* { dg-final { scan-assembler-times {\tdlstp} 160 } } */
+/* { dg-final { scan-assembler-times {\tletp} 160 } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
new file mode 100644
index 00000000000..0cdffb312b3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int16_t *a, int16_t *b, int16_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp16q (n);
+      int16x8_t va = vldrhq_z_s16 (a, p);
+      int16x8_t vb = vldrhq_z_s16 (b, p);
+      int16x8_t vc = vaddq_x_s16 (va, vb, p);
+      vstrhq_p_s16 (c, vc, p);
+      c+=8;
+      a+=8;
+      b+=8;
+      n-=8;
+    }
+}
+
+int main ()
+{
+  int i;
+  int16_t temp1[N];
+  int16_t temp2[N];
+  int16_t temp3[N];
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus16 (temp1, temp2, temp3, 0);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus16 (temp1, temp2, temp3, 1);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 7);
+  check_plus16 (temp1, temp2, temp3, 7);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus16 (temp1, temp2, temp3, 8);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus16 (temp1, temp2, temp3, 9);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus16 (temp1, temp2, temp3, 16);
+
+  reset_data16 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus16 (temp1, temp2, temp3, 17);
+
+  reset_data16 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.16} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c
new file mode 100644
index 00000000000..7ff789d7650
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c+=4;
+      a+=4;
+      b+=4;
+      n-=4;
+    }
+}
+
+int main ()
+{
+  int i;
+  int32_t temp1[N];
+  int32_t temp2[N];
+  int32_t temp3[N];
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus32 (temp1, temp2, temp3, 0);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus32 (temp1, temp2, temp3, 1);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 3);
+  check_plus32 (temp1, temp2, temp3, 3);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 4);
+  check_plus32 (temp1, temp2, temp3, 4);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 5);
+  check_plus32 (temp1, temp2, temp3, 5);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 8);
+  check_plus32 (temp1, temp2, temp3, 8);
+
+  reset_data32 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 9);
+  check_plus32 (temp1, temp2, temp3, 9);
+
+  reset_data32 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.32} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c
new file mode 100644
index 00000000000..8065bd02469
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int64_t *a, int64_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp64q (n);
+      int64x2_t va = vldrdq_gather_offset_z_s64 (a, vcreateq_u64 (0, 8), p);
+      vstrdq_scatter_offset_p_s64 (c, vcreateq_u64 (0, 8), va, p);
+      c+=2;
+      a+=2;
+      n-=2;
+    }
+}
+
+int main ()
+{
+  int i;
+  int64_t temp1[N];
+  int64_t temp3[N];
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 0);
+  check_memcpy64 (temp1, temp3, 0);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 1);
+  check_memcpy64 (temp1, temp3, 1);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 2);
+  check_memcpy64 (temp1, temp3, 2);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 3);
+  check_memcpy64 (temp1, temp3, 3);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 4);
+  check_memcpy64 (temp1, temp3, 4);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 5);
+  check_memcpy64 (temp1, temp3, 5);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 6);
+  check_memcpy64 (temp1, temp3, 6);
+
+  reset_data64  (temp1, temp3, N);
+  test (temp1, temp3, 7);
+  check_memcpy64 (temp1, temp3, 7);
+
+  reset_data64  (temp1, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.64} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c
new file mode 100644
index 00000000000..552781001e9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c
@@ -0,0 +1,68 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "../lob.h"
+
+void  __attribute__ ((noinline)) test (int8_t *a, int8_t *b, int8_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp8q (n);
+      int8x16_t va = vldrbq_z_s8 (a, p);
+      int8x16_t vb = vldrbq_z_s8 (b, p);
+      int8x16_t vc = vaddq_x_s8 (va, vb, p);
+      vstrbq_p_s8 (c, vc, p);
+      c+=16;
+      a+=16;
+      b+=16;
+      n-=16;
+    }
+}
+
+int main ()
+{
+  int i;
+  int8_t temp1[N];
+  int8_t temp2[N];
+  int8_t temp3[N];
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 0);
+  check_plus8 (temp1, temp2, temp3, 0);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 1);
+  check_plus8 (temp1, temp2, temp3, 1);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 15);
+  check_plus8 (temp1, temp2, temp3, 15);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 16);
+  check_plus8 (temp1, temp2, temp3, 16);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 17);
+  check_plus8 (temp1, temp2, temp3, 17);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 32);
+  check_plus8 (temp1, temp2, temp3, 32);
+
+  reset_data8 (temp1, temp2, temp3, N);
+  test (temp1, temp2, temp3, 33);
+  check_plus8 (temp1, temp2, temp3, 33);
+
+  reset_data8 (temp1, temp2, temp3, N);
+}
+
+/* { dg-final { scan-assembler-times {\tdlstp.8} 1 } } */
+/* { dg-final { scan-assembler-times {\tletp} 1 } } */
+/* { dg-final { scan-assembler-not "\tvctp" } } */
+/* { dg-final { scan-assembler-not "\tvpst" } } */
+/* { dg-final { scan-assembler-not "p0" } } */
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
new file mode 100644
index 00000000000..70e374dce81
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
@@ -0,0 +1,203 @@
+/* { dg-do compile { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-options "-O3 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+
+#include <arm_mve.h>
+
+/* Terminating on a non-zero number of elements.  */
+void test0 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    while (n > 1)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Terminating on n >= 0.  */
+void test1 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    while (n >= 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Similar, terminating on a non-zero number of elements, but in a for loop
+   format.  */
+int32_t a[] = {0, 1, 2, 3, 4, 5, 6, 7};
+void test2 (int32_t *b, int num_elems)
+{
+    for (int i = num_elems; i >= 2; i-= 4)
+    {
+       mve_pred16_t p = vctp32q (i);
+       int32x4_t va = vldrwq_z_s32 (&(a[i]), p);
+       vstrwq_p_s32 (b + i, va, p);
+    }
+}
+
+/* Iteration counter counting up to num_iter, with a non-zero starting num.  */
+void test3 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 1; i < num_iter; i++)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Iteration counter counting up to num_iter, with a larger increment  */
+void test4 (uint8_t *a, uint8_t *b, uint8_t *c, int n)
+{
+    int num_iter = (n + 15)/16;
+    for (int i = 0; i < num_iter; i+=2)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_x_u8 (va, vb, p);
+       vstrbq_p_u8 (c, vc, p);
+       n -= 16;
+    }
+}
+
+/* Using an unpredicated store instruction within the loop.  */
+void test5 (uint8_t *a, uint8_t *b, uint8_t *c,  uint8_t *d, int n)
+{
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_u8 (va, vb);
+       uint8x16_t vd = vaddq_x_u8 (va, vb, p);
+       vstrbq_u8 (d, vd);
+       n -= 16;
+    }
+}
+
+/* Using an unpredicated store outside the loop.  */
+void test6 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_z_u8 (b, p);
+       uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p);
+       vx = vaddq_u8 (vx, vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    vstrbq_u8 (c, vx);
+}
+
+/* Using a VPR that gets modified within the loop.  */
+void test9 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      p++;
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using a VPR that gets re-generated within the loop.  */
+void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  mve_pred16_t p = vctp32q (n);
+  while (n > 0)
+    {
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      p = vctp32q (n);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using vctp32q_m instead of vctp32q.  */
+void test11 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p0)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q_m (n, p0);
+      int32x4_t va = vldrwq_z_s32 (a, p);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_x_s32 (va, vb, p);
+      vstrwq_p_s32 (c, vc, p);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+
+/* Using an unpredicated op with a scalar output, where the result is valid
+   outside the bb.  This is invalid, because one of the inputs to the
+   unpredicated op is also unpredicated.  */
+uint8_t test12 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx)
+{
+    uint8_t sum = 0;
+    while (n > 0)
+    {
+       mve_pred16_t p = vctp8q (n);
+       uint8x16_t va = vldrbq_z_u8 (a, p);
+       uint8x16_t vb = vldrbq_u8 (b);
+       uint8x16_t vc = vaddq_u8 (va, vb);
+       sum += vaddvq_u8 (vc);
+       a += 16;
+       b += 16;
+       n -= 16;
+    }
+    return sum;
+}
+
+/* Using an unpredicated vcmp to generate a new predicate value in the
+   loop and then using that VPR to predicate a store insn.  */
+void test13 (int32_t *a, int32_t *b, int32x4_t vc, int32_t *c, int n)
+{
+  while (n > 0)
+    {
+      mve_pred16_t p = vctp32q (n);
+      int32x4_t va = vldrwq_s32 (a);
+      int32x4_t vb = vldrwq_z_s32 (b, p);
+      int32x4_t vc = vaddq_s32 (va, vb);
+      mve_pred16_t p1 = vcmpeqq_s32 (va, vc);
+      vstrwq_p_s32 (c, vc, p1);
+      c += 4;
+      a += 4;
+      b += 4;
+      n -= 4;
+    }
+}
+/* { dg-final { scan-assembler-not "\tdlstp" } } */
+/* { dg-final { scan-assembler-not "\tletp" } } */

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-12-20 16:54 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-11 17:40 [PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops Stam Markianos-Wright
2022-11-15 15:51 ` Andre Vieira (lists)
2022-11-28 12:13   ` Stam Markianos-Wright
2023-06-15 11:47 Stamatis Markianos-Wright
2023-06-22 15:54 ` Andre Vieira (lists)
2023-07-05 16:11   ` Stamatis Markianos-Wright
2023-06-23 10:23 ` Andre Vieira (lists)
2023-07-05 16:13   ` Stamatis Markianos-Wright
2023-06-23 16:25 ` Andre Vieira (lists)
2023-07-05 16:41   ` Stamatis Markianos-Wright
2023-08-17 10:31 Stamatis Markianos-Wright
2023-11-06 11:20 Stamatis Markianos-Wright
2023-12-18 11:53 [PATCH 0/2] " Andre Vieira
2023-12-18 11:53 ` [PATCH 2/2] " Andre Vieira
2023-12-20 16:54   ` Andre Vieira (lists)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).