From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id E52923858C78 for ; Tue, 12 Dec 2023 17:56:25 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org E52923858C78 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=foss.arm.com Authentication-Results: sourceware.org; spf=none smtp.mailfrom=foss.arm.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org E52923858C78 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1702403789; cv=none; b=EEe3LAYO7hCXo1OEwB77dHbuTG4fvK5tPuXumypjZxgB1777peeBamcgEgoXRAFfo0SBXyRZW4gBBtDLHZV6ik4wxcSjznV/gSZ67VDQrEYfjZWEuBAKrg6wTB92w3o+ySR0OMUvvDXZvOxkY4YfJGUML15Hej/84Lm0W2dlP9k= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1702403789; c=relaxed/simple; bh=Okfo7OhXwUcSW+FFNhNF5XiD47i0BmKog2CfzHchZmI=; h=Message-ID:Date:MIME-Version:Subject:To:From; b=nBivrZrffJenpbV2d+68LCL2kQOg2SdQwvksaz/ZRUA6dmLb56qbpxlBBzn3OeQh6K+AeNuRb74Z7ETz7u3H4LsrR4NI5kqhFNWhYZgVe6vbMQCeoTy9TNpwFXgb+FSRvPbdK6NsdFfksng8O1JZJ4eGzwcglx7ziA3k1Z9Af4c= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 8D75DFEC; Tue, 12 Dec 2023 09:57:11 -0800 (PST) Received: from [10.57.2.196] (unknown [10.57.2.196]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 258E93F762; Tue, 12 Dec 2023 09:56:24 -0800 (PST) Message-ID: <083e6ad9-1cf0-42f6-ab05-6b37854d1b19@foss.arm.com> Date: Tue, 12 Dec 2023 17:56:22 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops To: Stamatis Markianos-Wright , "Andre Vieira (lists)" , Stamatis Markianos-Wright via Gcc-patches , Richard Earnshaw , Richard Sandiford , Kyrylo Tkachov References: <949f5dd0-cdf0-715a-f04c-3de80c9b974f@arm.com> <32452185-e459-4521-9b77-e80d06573ee2@arm.com> <5793c5af-9c01-48a8-9bf3-f289e7f32640@arm.com> <05ab69cf-dea0-44d2-875c-983985a26b99@arm.com> <6cab7952-9b00-4d85-8fbb-c8058d2142d2@arm.com> <76426214-33e6-40eb-aa9c-94f357945c4b@arm.com> <8000eb76-0abf-45db-92c1-d307ac8fccfc@arm.com> Content-Language: en-GB From: Richard Earnshaw In-Reply-To: <8000eb76-0abf-45db-92c1-d307ac8fccfc@arm.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-3493.4 required=5.0 tests=BAYES_00,BODY_8BITS,GIT_PATCH_0,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_MANYTO,KAM_SHORT,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On 30/11/2023 12:55, Stamatis Markianos-Wright wrote: > Hi Andre, > > Thanks for the comments, see latest revision attached. > > On 27/11/2023 12:47, Andre Vieira (lists) wrote: >> Hi Stam, >> >> Just some comments. >> >> +/* Recursively scan through the DF chain backwards within the basic >> block and >> +   determine if any of the USEs of the original insn (or the USEs of >> the insns >> s/Recursively scan/Scan/ as you no longer recurse, thanks for that by >> the way :) +   where thy were DEF-ed, etc., recursively) were affected >> by implicit VPT >> remove recursively for the same reasons. >> >> +      if (!CONST_INT_P (cond_counter_iv.step) || !CONST_INT_P >> (cond_temp_iv.step)) >> +    return NULL; >> +      /* Look at the steps and swap around the rtx's if needed. Error >> out if >> +     one of them cannot be identified as constant.  */ >> +      if (INTVAL (cond_counter_iv.step) != 0 && INTVAL >> (cond_temp_iv.step) != 0) >> +    return NULL; >> >> Move the comment above the if before, as the erroring out it talks >> about is there. > Done >> >> +      emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body)); >>  space after 'insn_note)' >> >> @@ -173,14 +176,14 @@ doloop_condition_get (rtx_insn *doloop_pat) >>    if (! REG_P (reg)) >>      return 0; >>  -  /* Check if something = (plus (reg) (const_int -1)). >> +  /* Check if something = (plus (reg) (const_int -n)). >>       On IA-64, this decrement is wrapped in an if_then_else.  */ >>    inc_src = SET_SRC (inc); >>    if (GET_CODE (inc_src) == IF_THEN_ELSE) >>      inc_src = XEXP (inc_src, 1); >>    if (GET_CODE (inc_src) != PLUS >>        || XEXP (inc_src, 0) != reg >> -      || XEXP (inc_src, 1) != constm1_rtx) >> +      || !CONST_INT_P (XEXP (inc_src, 1))) >> >> Do we ever check that inc_src is negative? We used to check if it was >> -1, now we only check it's a constnat, but not a negative one, so I >> suspect this needs a: >> || INTVAL (XEXP (inc_src, 1)) >= 0 > Good point. Done >> >> @@ -492,7 +519,8 @@ doloop_modify (class loop *loop, class niter_desc >> *desc, >>      case GE: >>        /* Currently only GE tests against zero are supported.  */ >>        gcc_assert (XEXP (condition, 1) == const0_rtx); >> - >> +      /* FALLTHRU */ >> +    case GTU: >>        noloop = constm1_rtx; >> >> I spent a very long time staring at this trying to understand why >> noloop = constm1_rtx for GTU, where I thought it should've been (count >> & (n-1)). For the current use of doloop it doesn't matter because ARM >> is the only target using it and you set desc->noloop_assumptions to >> null_rtx in 'arm_attempt_dlstp_transform' so noloop is never used. >> However, if a different target accepts this GTU pattern then this >> target agnostic code will do the wrong thing.  I suggest we either: >>  - set noloop to what we think might be the correct value, which if >> you ask me should be 'count & (XEXP (condition, 1))', >>  - or add a gcc_assert (GET_CODE (condition) != GTU); under the if >> (desc->noloop_assumption); part and document why.  I have a slight >> preference for the assert given otherwise we are adding code that we >> can't test. > > Yea, that's true tbh. I've done the latter, but also separated out the > "case GTU:" and added a comment, so that it's more clear that the noloop > things aren't used in the only implemented GTU case (Arm) > > Thank you :) > >> >> LGTM otherwise (but I don't have the power to approve this ;)). >> >> Kind regards, >> Andre >> ________________________________________ >> From: Stamatis Markianos-Wright >> Sent: Thursday, November 16, 2023 11:36 AM >> To: Stamatis Markianos-Wright via Gcc-patches; Richard Earnshaw; >> Richard Sandiford; Kyrylo Tkachov >> Subject: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated >> Low Overhead Loops >> >> Pinging back to the top of reviewers' inboxes due to worry about Stage 1 >> End in a few days :) >> >> >> See the last email for the latest version of the 2/2 patch. The 1/2 >> patch is A-Ok from Kyrill's earlier target-backend review. >> >> >> On 10/11/2023 12:41, Stamatis Markianos-Wright wrote: >>> >>> On 06/11/2023 17:29, Stamatis Markianos-Wright wrote: >>>> >>>> On 06/11/2023 11:24, Richard Sandiford wrote: >>>>> Stamatis Markianos-Wright writes: >>>>>>> One of the main reasons for reading the arm bits was to try to >>>>>>> answer >>>>>>> the question: if we switch to a downcounting loop with a GE >>>>>>> condition, >>>>>>> how do we make sure that the start value is not a large unsigned >>>>>>> number that is interpreted as negative by GE?  E.g. if the loop >>>>>>> originally counted up in steps of N and used an LTU condition, >>>>>>> it could stop at a value in the range [INT_MAX + 1, UINT_MAX]. >>>>>>> But the loop might never iterate if we start counting down from >>>>>>> most values in that range. >>>>>>> >>>>>>> Does the patch handle that? >>>>>> So AFAICT this is actually handled in the generic code in >>>>>> `doloop_valid_p`: >>>>>> >>>>>> This kind of loops fail because of they are "desc->infinite", then no >>>>>> loop-doloop conversion is attempted at all (even for standard >>>>>> dls/le loops) >>>>>> >>>>>> Thanks to that check I haven't been able to trigger anything like the >>>>>> behaviour you describe, do you think the doloop_valid_p checks are >>>>>> robust enough? >>>>> The loops I was thinking of are provably not infinite though. E.g.: >>>>> >>>>>    for (unsigned int i = 0; i < UINT_MAX - 100; ++i) >>>>>      ... >>>>> >>>>> is known to terminate.  And doloop conversion is safe with the normal >>>>> count-down-by-1 approach, so I don't think current code would need >>>>> to reject it.  I.e. a conversion to: >>>>> >>>>>    unsigned int i = UINT_MAX - 101; >>>>>    do >>>>>      ... >>>>>    while (--i != ~0U); >>>>> >>>>> would be safe, but a conversion to: >>>>> >>>>>    int i = UINT_MAX - 101; >>>>>    do >>>>>      ... >>>>>    while ((i -= step, i > 0)); >>>>> >>>>> wouldn't, because the loop body would only be executed once. >>>>> >>>>> I'm only going off the name "infinite" though :)  It's possible that >>>>> it has more connotations than that. >>>>> >>>>> Thanks, >>>>> Richard >>>> >>>> Ack, yep, I see what you mean now, and yep, that kind of loop does >>>> indeed pass through doloop_valid_p >>>> >>>> Interestingly , in the v8-M Arm ARM this is done with: >>>> >>>> ``` >>>> >>>> boolean IsLastLowOverheadLoop(INSTR_EXEC_STATE_Type state) >>>> // This does not check whether a loop is currently active. >>>> // If the PE were in a loop, would this be the last one? >>>> return UInt(state.LoopCount) <= (1 << (4 - LTPSIZE)); >>>> >>>> ``` >>>> >>>> So architecturally the asm we output would be ok (except maybe the >>>> "branch too far subs;bgt;lctp" fallback at >>>> `predicated_doloop_end_internal` (maybe that should be `bhi`))... But >>>> now GE: isn't looking like an accurate representation of this >>>> operation in the compiler. >>>> >>>> I'm wondering if I should try to make >>>> `predicated_doloop_end_internal` contain a comparison along the lines >>>> of: >>>> (gtu: (plus: (LR) (const_int -num_lanes)) (const_int >>>> num_lanes_minus_1)) >>>> >>>> I'll give that a try :) >>>> >>>> The only reason I'd chosen to go with GE earlier, tbh, was because of >>>> the existing handling of GE in loop-doloop.cc >>>> >>>> Let me know if any other ideas come to your mind! >>>> >>>> >>>> Cheers, >>>> >>>> Stam >>> >>> >>> It looks like I've had success with the below (diff to previous patch), >>> trimmed a bit to only the functionally interesting things:: >>> >>> >>> >>> >>> diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md >>> index 368d5138ca1..54dd4ee564b 100644 >>> --- a/gcc/config/arm/thumb2.md >>> +++ b/gcc/config/arm/thumb2.md >>> @@ -1649,16 +1649,28 @@ >>>            && (decrement_num = arm_attempt_dlstp_transform >>> (operands[1])) >>>            && (INTVAL (decrement_num) != 1)) >>>          { >>> -          insn = emit_insn >>> -              (gen_thumb2_addsi3_compare0 >>> -              (s0, s0, GEN_INT ((-1) * (INTVAL (decrement_num))))); >>> -          cmp = XVECEXP (PATTERN (insn), 0, 0); >>> -          cc_reg = SET_DEST (cmp); >>> -          bcomp = gen_rtx_GE (VOIDmode, cc_reg, const0_rtx); >>>            loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]); >>> -          emit_jump_insn (gen_rtx_SET (pc_rtx, >>> -                       gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp, >>> -                                 loc_ref, pc_rtx))); >>> +          switch (INTVAL (decrement_num)) >>> +        { >>> +          case 2: >>> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal2 >>> +                        (s0, loc_ref)); >>> +            break; >>> +          case 4: >>> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal4 >>> +                        (s0, loc_ref)); >>> +            break; >>> +          case 8: >>> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal8 >>> +                        (s0, loc_ref)); >>> +            break; >>> +          case 16: >>> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal16 >>> +                        (s0, loc_ref)); >>> +            break; >>> +          default: >>> +            gcc_unreachable (); >>> +        } >>>            DONE; >>>          } >>>      } >>> >>> diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md >>> index 93905583b18..c083f965fa9 100644 >>> --- a/gcc/config/arm/mve.md >>> +++ b/gcc/config/arm/mve.md >>> @@ -6922,23 +6922,24 @@ >>>  ;; Originally expanded by 'predicated_doloop_end'. >>>  ;; In the rare situation where the branch is too far, we do also >>> need to >>>  ;; revert FPSCR.LTPSIZE back to 0x100 after the last iteration. >>> -(define_insn "*predicated_doloop_end_internal" >>> +(define_insn "predicated_doloop_end_internal" >>>    [(set (pc) >>>      (if_then_else >>> -       (ge (plus:SI (reg:SI LR_REGNUM) >>> -            (match_operand:SI 0 "const_int_operand" "")) >>> -        (const_int 0)) >>> -     (label_ref (match_operand 1 "" "")) >>> +       (gtu (unspec:SI [(plus:SI (match_operand:SI 0 >>> "s_register_operand" "=r") >>> +                     (const_int ))] >>> +        LETP) >>> +        (const_int )) >>> +     (match_operand 1 "" "") >>>       (pc))) >>> -   (set (reg:SI LR_REGNUM) >>> -    (plus:SI (reg:SI LR_REGNUM) (match_dup 0))) >>> +   (set (match_dup 0) >>> +    (plus:SI (match_dup 0) (const_int ))) >>>     (clobber (reg:CC CC_REGNUM))] >>>    "TARGET_HAVE_MVE" >>>    { >>>      if (get_attr_length (insn) == 4) >>>        return "letp\t%|lr, %l1"; >>>      else >>> -      return "subs\t%|lr, #%n0\n\tbgt\t%l1\n\tlctp"; >>> +      return "subs\t%|lr, #\n\tbhi\t%l1\n\tlctp"; >>>    } >>>    [(set (attr "length") >>>      (if_then_else >>> @@ -6947,11 +6948,11 @@ >>>          (const_int 6))) >>>     (set_attr "type" "branch")]) >>> >>> -(define_insn "dlstp_insn" >>> +(define_insn "dlstp_insn" >>>    [ >>>      (set (reg:SI LR_REGNUM) >>>       (unspec:SI [(match_operand:SI 0 "s_register_operand" "r")] >>>        DLSTP)) >>>    ] >>>    "TARGET_HAVE_MVE" >>> -  "dlstp.\t%|lr, %0") >>> +  "dlstp.\t%|lr, %0") >>> >>> diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc >>> index 6a72700a127..47fdef989b4 100644 >>> --- a/gcc/loop-doloop.cc >>> +++ b/gcc/loop-doloop.cc >>> @@ -185,6 +185,7 @@ doloop_condition_get (rtx_insn *doloop_pat) >>>        || XEXP (inc_src, 0) != reg >>>        || !CONST_INT_P (XEXP (inc_src, 1))) >>>      return 0; >>> +  int dec_num = abs (INTVAL (XEXP (inc_src, 1))); >>> >>>    /* Check for (set (pc) (if_then_else (condition) >>>                                         (label_ref (label)) >>> @@ -199,21 +200,32 @@ doloop_condition_get (rtx_insn *doloop_pat) >>>    /* Extract loop termination condition.  */ >>>    condition = XEXP (SET_SRC (cmp), 0); >>> >>> -  /* We expect a GE or NE comparison with 0 or 1.  */ >>> -  if ((GET_CODE (condition) != GE >>> -       && GET_CODE (condition) != NE) >>> -      || (XEXP (condition, 1) != const0_rtx >>> -          && XEXP (condition, 1) != const1_rtx)) >>> +  /* We expect a GE or NE comparison with 0 or 1, or a GTU comparison >>> with >>> +     dec_num - 1.  */ >>> +  if (!((GET_CODE (condition) == GE >>> +     || GET_CODE (condition) == NE) >>> +    && (XEXP (condition, 1) == const0_rtx >>> +        || XEXP (condition, 1) == const1_rtx )) >>> +      &&!(GET_CODE (condition) == GTU >>> +      && ((INTVAL (XEXP (condition, 1))) == (dec_num - 1)))) >>>      return 0; >>> >>> -  if ((XEXP (condition, 0) == reg) >>> +  /* For the ARM special case of having a GTU: re-form the condition >>> without >>> +     the unspec for the benefit of the middle-end.  */ >>> +  if (GET_CODE (condition) == GTU) >>> +    { >>> +      condition = gen_rtx_fmt_ee (GTU, VOIDmode, inc_src, GEN_INT >>> (dec_num - 1)); >>> +      return condition; >>> +    } >>> +  else if ((XEXP (condition, 0) == reg) >>>        /* For the third case:  */ >>>        || ((cc_reg != NULL_RTX) >>>        && (XEXP (condition, 0) == cc_reg) >>>        && (reg_orig == reg)) >>> @@ -245,20 +257,11 @@ doloop_condition_get (rtx_insn *doloop_pat) >>>                         (label_ref (label)) >>>                         (pc))))]) >>> >>> -    So we return the second form instead for the two cases when n == 1. >>> - >>> -    For n > 1, the final value may be exceeded, so use GE instead of >>> NE. >>> +    So we return the second form instead for the two cases. >>>       */ >>> -     if (GET_CODE (pattern) != PARALLEL) >>> -       { >>> -    if (INTVAL (XEXP (inc_src, 1)) != -1) >>> -      condition = gen_rtx_fmt_ee (GE, VOIDmode, inc_src, const0_rtx); >>> -    else >>> -      condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);; >>> -       } >>> - >>> +    condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx); >>>      return condition; >>> -   } >>> +    } >>> >>>    /* ??? If a machine uses a funny comparison, we could return a >>>       canonicalized form here.  */ >>> @@ -501,7 +504,8 @@ doloop_modify (class loop *loop, class niter_desc >>> *desc, >>>      case GE: >>>        /* Currently only GE tests against zero are supported. */ >>>        gcc_assert (XEXP (condition, 1) == const0_rtx); >>> - >>> +      /* FALLTHRU */ >>> +    case GTU: >>>        noloop = constm1_rtx; >>> diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md >>> index a6a7ff507a5..9398702cddd 100644 >>> --- a/gcc/config/arm/iterators.md >>> +++ b/gcc/config/arm/iterators.md >>> @@ -2673,8 +2673,16 @@ >>>  (define_int_attr mrrc [(VUNSPEC_MRRC "mrrc") (VUNSPEC_MRRC2 "mrrc2")]) >>>  (define_int_attr MRRC [(VUNSPEC_MRRC "MRRC") (VUNSPEC_MRRC2 "MRRC2")]) >>> >>> -(define_int_attr mode1 [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 "32") >>> -            (DLSTP64 "64")]) >>> +(define_int_attr dlstp_elemsize [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 >>> "32") >>> +                 (DLSTP64 "64")]) >>> + >>> +(define_int_attr letp_num_lanes [(LETP8 "16") (LETP16 "8") (LETP32 "4") >>> +                 (LETP64 "2")]) >>> +(define_int_attr letp_num_lanes_neg [(LETP8 "-16") (LETP16 "-8") >>> (LETP32 "-4") >>> +                     (LETP64 "-2")]) >>> + >>> +(define_int_attr letp_num_lanes_minus_1 [(LETP8 "15") (LETP16 "7") >>> (LETP32 "3") >>> +                     (LETP64 "1")]) >>> >>>  (define_int_attr opsuffix [(UNSPEC_DOT_S "s8") >>>                 (UNSPEC_DOT_U "u8") >>> @@ -2921,6 +2929,8 @@ >>>  (define_int_iterator VQSHLUQ_N [VQSHLUQ_N_S]) >>>  (define_int_iterator DLSTP [DLSTP8 DLSTP16 DLSTP32 >>>                     DLSTP64]) >>> +(define_int_iterator LETP [LETP8 LETP16 LETP32 >>> +               LETP64]) >>> >>>  ;; Define iterators for VCMLA operations >>>  (define_int_iterator VCMLA_OP [UNSPEC_VCMLA >>>        /* The iteration count does not need incrementing for a GE >>> test.  */ >>> diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md >>> index 12ae4c4f820..2d6f27c14f4 100644 >>> --- a/gcc/config/arm/unspecs.md >>> +++ b/gcc/config/arm/unspecs.md >>> @@ -587,6 +587,10 @@ >>>    DLSTP16 >>>    DLSTP32 >>>    DLSTP64 >>> +  LETP8 >>> +  LETP16 >>> +  LETP32 >>> +  LETP64 >>>    VPNOT >>>    VCREATEQ_F >>>    VCVTQ_N_TO_F_S >>> >>> >>> I've attached the whole [2/2] patch diff with this change and >>> the required comment changes in doloop_condition_get. >>> WDYT? >>> >>> >>> Thanks, >>> >>> Stam >>> >>> >>>> >>>> [I'm still working through this patch, but there are a number of things which clearly need addressing, so I'll stop at this point today.] +arm_predict_doloop_p (struct loop *loop) +{ Is it feasible to add something here to check the overall size of the loop, so that we don't try to convert loops that are clearly too big? +static rtx_insn* +arm_mve_dlstp_check_inc_counter (basic_block body, rtx_insn* vctp_insn, + rtx condconst, rtx condcount) ... + rtx condcount_reg_set = PATTERN (DF_REF_INSN (condcount_reg_set_df)); + rtx_insn* vctp_reg_set = DF_REF_INSN (vctp_reg_set_df); + /* Ensure the modification of the vctp reg from df is consistent with + the iv and the number of lanes on the vctp insn. */ + if (!(GET_CODE (XEXP (PATTERN (vctp_reg_set), 1)) == PLUS + && REGNO (XEXP (PATTERN (vctp_reg_set), 0)) + == REGNO (XEXP (XEXP (PATTERN (vctp_reg_set), 1), 0)))) You seem to be assuming that the pattern in insn you've found will always be of the form (set x y), but that's unsafe. When scanning RTL you must first check that you do have a genuine single set. The easiest way to do that is to call single_set (insn), which returns the SET RTL if it is a single set, and NULL otherwise; it is also able to handle irrelevant clobbers which might be added for register allocation purposes. Also, rather than using XEXP, you should use SET_SRC and SET_DEST when looking at the arm's of a SET operation, for better clarity. + if (!single_set (last_set_insn)) + return NULL; + rtx counter_orig_set; + counter_orig_set = XEXP (PATTERN (last_set_insn), 1); There's a similar problem here, in that single_set returns a valid value for something like (parallel [(set a b) (clobber scratch_reg)]) so looking directly at the pattern of the insn is wrong. Instead you should use the value returned by single_set for further analysis. + /* Next, ensure that it is a PLUS of the form: + (set (reg a) (plus (reg a) (const_int))) + where (reg a) is the same as condcount. */ + if (GET_CODE (XEXP (PATTERN (dec_insn), 1)) == PLUS + && REGNO (XEXP (PATTERN (dec_insn), 0)) + == REGNO (XEXP (XEXP (PATTERN (dec_insn), 1), 0)) + && REGNO (XEXP (PATTERN (dec_insn), 0)) == REGNO (condcount)) and again, you need to validate that dec_insn is a set first. There are several other cases where you need to check for a SET as well, but I won't mention any more here. Can this code be run before register allocation? If so, there's a risk that we will have different pseudos for the source and dest operands here, but expect register allocation to tie them back together; something like t1 = count loop_head: ... t2 = t1 ... t1 = t2 + const if (t1 < end) goto loop_head; Register allocation would be expected to eliminate the t2 = t1 insn by allocating the same physical register here. + decrementnum = abs (INTVAL (XEXP (XEXP (PATTERN (dec_insn), 1), 1))); why the use of abs()? I don't see where we validate that the direction really matches our expectation. + if (abs (INTVAL (vctp_reg_iv.step)) + != arm_mve_get_vctp_lanes (PATTERN (vctp_insn))) And again, we only look at the absolute value. Shouldn't we be validating the sign here against the sign we ignored earlier? + if (!(TARGET_32BIT && TARGET_HAVE_LOB && optimize > 0)) Again, the test for TARGET_32BIT seems redundant. + if (single_set (next_use1) + && GET_CODE (SET_SRC (single_set (next_use1))) == ZERO_EXTEND) Don't call single_set twice, just save the result of the first call and use that. Perhaps rtx next_use1_set = single_set (next_use1); if (next_use1_set && ...) +/* Attempt to transform the loop contents of loop basic block from VPT + predicated insns into unpredicated insns for a dlstp/letp loop. */ + +rtx +arm_attempt_dlstp_transform (rtx label) Please describe what is returned in the comment. I'm guessing it's some form of iteration count, but why return 1 on failure? + return GEN_INT (1); It's more efficient to write "return const1_rtx;". In fact, it looks like this function always returns a CONST_INT and is only ever called from one place in thumb2.md, where we only ever look at the integer value. So why not make it return a HOST_WIDE_INT in the first place? +static bool +arm_emit_mve_unpredicated_insn_to_seq (rtx_insn* insn) +{ + I think this function needs to also copy across the INSN_LOCATION information from the insn it is rewriting; that way we keep any diagnostic information and debug information more accurate. Something like: + } >> INSN_LOCATION (new_insn) = INSN_LOCATION (insn); + return true; Will be enough if we never get more than one insn emitted from the emit_insn sequence above (emit_insn returns new_insn, which you'll need to save). If you need something more complex, then perhaps you could use emit_insn_after_setloc (GEN_FCN..., get_last_insn (), INSN_LOCATION (insn)); + for (insn = seq; NEXT_INSN (insn); insn = NEXT_INSN (insn)) + if (NOTE_P (insn)) + emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body)); + else if (DEBUG_INSN_P (insn)) + emit_debug_insn_after (PATTERN (insn), BB_END (body)); + else + emit_insn_after (PATTERN (insn), BB_END (body)); + + I'm not sure I follow why you can't replace this entire loop with emit_insn_after (seq, BB_END (body)); which should do all of the above for you. But there's another problem here: + if (NOTE_P (insn)) + emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body)); Notes have data (see emit_note_copy in emit-rtl.cc). If you can't use that function, you'll need to copy the note data manually in the same way as emit_note_copy does (or add emit_note_copy_after() as a new function in emit-rtl.cc). I also note that you're already copying the note in arm_attempt_dlstp_transform and that also drops the note's data, but that really can use emit_note_copy(). thumb2.md: + /* If we have a compatibe MVE target, try and analyse the loop Typo: compatible +++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c @@ -0,0 +1,69 @@ +/* { dg-do run { target { arm*-*-* } } } */ +/* { dg-require-effective-target arm_v8_1m_mve_ok } */ +/* { dg-require-effective-target arm_mve_hw } */ +/* { dg-options "-O2 -save-temps" } */ +/* { dg-add-options arm_v8_1m_mve } */ + ... +/* { dg-final { scan-assembler-times {\tdlstp.16} 1 } } */ Please do not mix scan-assembler tests with execution tests. The latter require specific hardware to run, while the former do not, that means these tests get run by far fewer configurations of the compiler.