From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id B722A3858030 for ; Mon, 27 Nov 2023 12:47:53 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org B722A3858030 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org B722A3858030 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1701089284; cv=none; b=Ottts4LVSmr9yg7+6m4plu8+psgXNCQldLbVT9s60GPhRVr5NfwkJHDfeM2CWG6tmRHfIezeK0gvNVmY33cI4erIFvSoiatZlwOAbn1MBWGWCvz9dPp7tQPxrTICIXyD2NFKie2+2B6giCF/d/w2AGYtdNVTBHJkWiFF5c885pA= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1701089284; c=relaxed/simple; bh=mntLYGdClpPSnW07/YD7Bv+mz7fMtzw4bOo+iaGwHjQ=; h=Message-ID:Date:MIME-Version:From:Subject:To; b=BgJ66Na4ScX1LIr0FvyGIXy7Ci6rmRErK4j3T5nrBY0I3+OpRVgyrHXwhjcHW6ORoVtYRRS2SjOm47eVvZK+viCInhSj31cbgC6T5q7JQtmMAKtPLdXWrg3D1SVEspvIM/Z4fassxq+na1aCq2YW80RzWFUUwzC7fZ2XfPKFXXs= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id DA1E62F4; Mon, 27 Nov 2023 04:48:40 -0800 (PST) Received: from [10.57.73.61] (unknown [10.57.73.61]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 4A76D3F73F; Mon, 27 Nov 2023 04:47:52 -0800 (PST) Message-ID: <76426214-33e6-40eb-aa9c-94f357945c4b@arm.com> Date: Mon, 27 Nov 2023 12:47:46 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird From: "Andre Vieira (lists)" Subject: Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops To: Stam Markianos-Wright , Stamatis Markianos-Wright via Gcc-patches , Richard Earnshaw , Richard Sandiford , Kyrylo Tkachov References: <949f5dd0-cdf0-715a-f04c-3de80c9b974f@arm.com> <32452185-e459-4521-9b77-e80d06573ee2@arm.com> <5793c5af-9c01-48a8-9bf3-f289e7f32640@arm.com> <05ab69cf-dea0-44d2-875c-983985a26b99@arm.com> <6cab7952-9b00-4d85-8fbb-c8058d2142d2@arm.com> Content-Language: en-US In-Reply-To: <6cab7952-9b00-4d85-8fbb-c8058d2142d2@arm.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-13.4 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Hi Stam, Just some comments. +/* Recursively scan through the DF chain backwards within the basic block and + determine if any of the USEs of the original insn (or the USEs of the insns s/Recursively scan/Scan/ as you no longer recurse, thanks for that by the way :) + where thy were DEF-ed, etc., recursively) were affected by implicit VPT remove recursively for the same reasons. + if (!CONST_INT_P (cond_counter_iv.step) || !CONST_INT_P (cond_temp_iv.step)) + return NULL; + /* Look at the steps and swap around the rtx's if needed. Error out if + one of them cannot be identified as constant. */ + if (INTVAL (cond_counter_iv.step) != 0 && INTVAL (cond_temp_iv.step) != 0) + return NULL; Move the comment above the if before, as the erroring out it talks about is there. + emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body)); space after 'insn_note)' @@ -173,14 +176,14 @@ doloop_condition_get (rtx_insn *doloop_pat) if (! REG_P (reg)) return 0; - /* Check if something = (plus (reg) (const_int -1)). + /* Check if something = (plus (reg) (const_int -n)). On IA-64, this decrement is wrapped in an if_then_else. */ inc_src = SET_SRC (inc); if (GET_CODE (inc_src) == IF_THEN_ELSE) inc_src = XEXP (inc_src, 1); if (GET_CODE (inc_src) != PLUS || XEXP (inc_src, 0) != reg - || XEXP (inc_src, 1) != constm1_rtx) + || !CONST_INT_P (XEXP (inc_src, 1))) Do we ever check that inc_src is negative? We used to check if it was -1, now we only check it's a constnat, but not a negative one, so I suspect this needs a: || INTVAL (XEXP (inc_src, 1)) >= 0 @@ -492,7 +519,8 @@ doloop_modify (class loop *loop, class niter_desc *desc, case GE: /* Currently only GE tests against zero are supported. */ gcc_assert (XEXP (condition, 1) == const0_rtx); - + /* FALLTHRU */ + case GTU: noloop = constm1_rtx; I spent a very long time staring at this trying to understand why noloop = constm1_rtx for GTU, where I thought it should've been (count & (n-1)). For the current use of doloop it doesn't matter because ARM is the only target using it and you set desc->noloop_assumptions to null_rtx in 'arm_attempt_dlstp_transform' so noloop is never used. However, if a different target accepts this GTU pattern then this target agnostic code will do the wrong thing. I suggest we either: - set noloop to what we think might be the correct value, which if you ask me should be 'count & (XEXP (condition, 1))', - or add a gcc_assert (GET_CODE (condition) != GTU); under the if (desc->noloop_assumption); part and document why. I have a slight preference for the assert given otherwise we are adding code that we can't test. LGTM otherwise (but I don't have the power to approve this ;)). Kind regards, Andre ________________________________________ From: Stamatis Markianos-Wright Sent: Thursday, November 16, 2023 11:36 AM To: Stamatis Markianos-Wright via Gcc-patches; Richard Earnshaw; Richard Sandiford; Kyrylo Tkachov Subject: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops Pinging back to the top of reviewers' inboxes due to worry about Stage 1 End in a few days :) See the last email for the latest version of the 2/2 patch. The 1/2 patch is A-Ok from Kyrill's earlier target-backend review. On 10/11/2023 12:41, Stamatis Markianos-Wright wrote: > > On 06/11/2023 17:29, Stamatis Markianos-Wright wrote: >> >> On 06/11/2023 11:24, Richard Sandiford wrote: >>> Stamatis Markianos-Wright writes: >>>>> One of the main reasons for reading the arm bits was to try to answer >>>>> the question: if we switch to a downcounting loop with a GE >>>>> condition, >>>>> how do we make sure that the start value is not a large unsigned >>>>> number that is interpreted as negative by GE? E.g. if the loop >>>>> originally counted up in steps of N and used an LTU condition, >>>>> it could stop at a value in the range [INT_MAX + 1, UINT_MAX]. >>>>> But the loop might never iterate if we start counting down from >>>>> most values in that range. >>>>> >>>>> Does the patch handle that? >>>> So AFAICT this is actually handled in the generic code in >>>> `doloop_valid_p`: >>>> >>>> This kind of loops fail because of they are "desc->infinite", then no >>>> loop-doloop conversion is attempted at all (even for standard >>>> dls/le loops) >>>> >>>> Thanks to that check I haven't been able to trigger anything like the >>>> behaviour you describe, do you think the doloop_valid_p checks are >>>> robust enough? >>> The loops I was thinking of are provably not infinite though. E.g.: >>> >>> for (unsigned int i = 0; i < UINT_MAX - 100; ++i) >>> ... >>> >>> is known to terminate. And doloop conversion is safe with the normal >>> count-down-by-1 approach, so I don't think current code would need >>> to reject it. I.e. a conversion to: >>> >>> unsigned int i = UINT_MAX - 101; >>> do >>> ... >>> while (--i != ~0U); >>> >>> would be safe, but a conversion to: >>> >>> int i = UINT_MAX - 101; >>> do >>> ... >>> while ((i -= step, i > 0)); >>> >>> wouldn't, because the loop body would only be executed once. >>> >>> I'm only going off the name "infinite" though :) It's possible that >>> it has more connotations than that. >>> >>> Thanks, >>> Richard >> >> Ack, yep, I see what you mean now, and yep, that kind of loop does >> indeed pass through doloop_valid_p >> >> Interestingly , in the v8-M Arm ARM this is done with: >> >> ``` >> >> boolean IsLastLowOverheadLoop(INSTR_EXEC_STATE_Type state) >> // This does not check whether a loop is currently active. >> // If the PE were in a loop, would this be the last one? >> return UInt(state.LoopCount) <= (1 << (4 - LTPSIZE)); >> >> ``` >> >> So architecturally the asm we output would be ok (except maybe the >> "branch too far subs;bgt;lctp" fallback at >> `predicated_doloop_end_internal` (maybe that should be `bhi`))... But >> now GE: isn't looking like an accurate representation of this >> operation in the compiler. >> >> I'm wondering if I should try to make >> `predicated_doloop_end_internal` contain a comparison along the lines >> of: >> (gtu: (plus: (LR) (const_int -num_lanes)) (const_int num_lanes_minus_1)) >> >> I'll give that a try :) >> >> The only reason I'd chosen to go with GE earlier, tbh, was because of >> the existing handling of GE in loop-doloop.cc >> >> Let me know if any other ideas come to your mind! >> >> >> Cheers, >> >> Stam > > > It looks like I've had success with the below (diff to previous patch), > trimmed a bit to only the functionally interesting things:: > > > > > diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md > index 368d5138ca1..54dd4ee564b 100644 > --- a/gcc/config/arm/thumb2.md > +++ b/gcc/config/arm/thumb2.md > @@ -1649,16 +1649,28 @@ > && (decrement_num = arm_attempt_dlstp_transform (operands[1])) > && (INTVAL (decrement_num) != 1)) > { > - insn = emit_insn > - (gen_thumb2_addsi3_compare0 > - (s0, s0, GEN_INT ((-1) * (INTVAL (decrement_num))))); > - cmp = XVECEXP (PATTERN (insn), 0, 0); > - cc_reg = SET_DEST (cmp); > - bcomp = gen_rtx_GE (VOIDmode, cc_reg, const0_rtx); > loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]); > - emit_jump_insn (gen_rtx_SET (pc_rtx, > - gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp, > - loc_ref, pc_rtx))); > + switch (INTVAL (decrement_num)) > + { > + case 2: > + insn = emit_jump_insn (gen_predicated_doloop_end_internal2 > + (s0, loc_ref)); > + break; > + case 4: > + insn = emit_jump_insn (gen_predicated_doloop_end_internal4 > + (s0, loc_ref)); > + break; > + case 8: > + insn = emit_jump_insn (gen_predicated_doloop_end_internal8 > + (s0, loc_ref)); > + break; > + case 16: > + insn = emit_jump_insn (gen_predicated_doloop_end_internal16 > + (s0, loc_ref)); > + break; > + default: > + gcc_unreachable (); > + } > DONE; > } > } > > diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md > index 93905583b18..c083f965fa9 100644 > --- a/gcc/config/arm/mve.md > +++ b/gcc/config/arm/mve.md > @@ -6922,23 +6922,24 @@ > ;; Originally expanded by 'predicated_doloop_end'. > ;; In the rare situation where the branch is too far, we do also need to > ;; revert FPSCR.LTPSIZE back to 0x100 after the last iteration. > -(define_insn "*predicated_doloop_end_internal" > +(define_insn "predicated_doloop_end_internal" > [(set (pc) > (if_then_else > - (ge (plus:SI (reg:SI LR_REGNUM) > - (match_operand:SI 0 "const_int_operand" "")) > - (const_int 0)) > - (label_ref (match_operand 1 "" "")) > + (gtu (unspec:SI [(plus:SI (match_operand:SI 0 > "s_register_operand" "=r") > + (const_int ))] > + LETP) > + (const_int )) > + (match_operand 1 "" "") > (pc))) > - (set (reg:SI LR_REGNUM) > - (plus:SI (reg:SI LR_REGNUM) (match_dup 0))) > + (set (match_dup 0) > + (plus:SI (match_dup 0) (const_int ))) > (clobber (reg:CC CC_REGNUM))] > "TARGET_HAVE_MVE" > { > if (get_attr_length (insn) == 4) > return "letp\t%|lr, %l1"; > else > - return "subs\t%|lr, #%n0\n\tbgt\t%l1\n\tlctp"; > + return "subs\t%|lr, #\n\tbhi\t%l1\n\tlctp"; > } > [(set (attr "length") > (if_then_else > @@ -6947,11 +6948,11 @@ > (const_int 6))) > (set_attr "type" "branch")]) > > -(define_insn "dlstp_insn" > +(define_insn "dlstp_insn" > [ > (set (reg:SI LR_REGNUM) > (unspec:SI [(match_operand:SI 0 "s_register_operand" "r")] > DLSTP)) > ] > "TARGET_HAVE_MVE" > - "dlstp.\t%|lr, %0") > + "dlstp.\t%|lr, %0") > > diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc > index 6a72700a127..47fdef989b4 100644 > --- a/gcc/loop-doloop.cc > +++ b/gcc/loop-doloop.cc > @@ -185,6 +185,7 @@ doloop_condition_get (rtx_insn *doloop_pat) > || XEXP (inc_src, 0) != reg > || !CONST_INT_P (XEXP (inc_src, 1))) > return 0; > + int dec_num = abs (INTVAL (XEXP (inc_src, 1))); > > /* Check for (set (pc) (if_then_else (condition) > (label_ref (label)) > @@ -199,21 +200,32 @@ doloop_condition_get (rtx_insn *doloop_pat) > /* Extract loop termination condition. */ > condition = XEXP (SET_SRC (cmp), 0); > > - /* We expect a GE or NE comparison with 0 or 1. */ > - if ((GET_CODE (condition) != GE > - && GET_CODE (condition) != NE) > - || (XEXP (condition, 1) != const0_rtx > - && XEXP (condition, 1) != const1_rtx)) > + /* We expect a GE or NE comparison with 0 or 1, or a GTU comparison > with > + dec_num - 1. */ > + if (!((GET_CODE (condition) == GE > + || GET_CODE (condition) == NE) > + && (XEXP (condition, 1) == const0_rtx > + || XEXP (condition, 1) == const1_rtx )) > + &&!(GET_CODE (condition) == GTU > + && ((INTVAL (XEXP (condition, 1))) == (dec_num - 1)))) > return 0; > > - if ((XEXP (condition, 0) == reg) > + /* For the ARM special case of having a GTU: re-form the condition > without > + the unspec for the benefit of the middle-end. */ > + if (GET_CODE (condition) == GTU) > + { > + condition = gen_rtx_fmt_ee (GTU, VOIDmode, inc_src, GEN_INT > (dec_num - 1)); > + return condition; > + } > + else if ((XEXP (condition, 0) == reg) > /* For the third case: */ > || ((cc_reg != NULL_RTX) > && (XEXP (condition, 0) == cc_reg) > && (reg_orig == reg)) > @@ -245,20 +257,11 @@ doloop_condition_get (rtx_insn *doloop_pat) > (label_ref (label)) > (pc))))]) > > - So we return the second form instead for the two cases when n == 1. > - > - For n > 1, the final value may be exceeded, so use GE instead of NE. > + So we return the second form instead for the two cases. > */ > - if (GET_CODE (pattern) != PARALLEL) > - { > - if (INTVAL (XEXP (inc_src, 1)) != -1) > - condition = gen_rtx_fmt_ee (GE, VOIDmode, inc_src, const0_rtx); > - else > - condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);; > - } > - > + condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx); > return condition; > - } > + } > > /* ??? If a machine uses a funny comparison, we could return a > canonicalized form here. */ > @@ -501,7 +504,8 @@ doloop_modify (class loop *loop, class niter_desc > *desc, > case GE: > /* Currently only GE tests against zero are supported. */ > gcc_assert (XEXP (condition, 1) == const0_rtx); > - > + /* FALLTHRU */ > + case GTU: > noloop = constm1_rtx; > diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md > index a6a7ff507a5..9398702cddd 100644 > --- a/gcc/config/arm/iterators.md > +++ b/gcc/config/arm/iterators.md > @@ -2673,8 +2673,16 @@ > (define_int_attr mrrc [(VUNSPEC_MRRC "mrrc") (VUNSPEC_MRRC2 "mrrc2")]) > (define_int_attr MRRC [(VUNSPEC_MRRC "MRRC") (VUNSPEC_MRRC2 "MRRC2")]) > > -(define_int_attr mode1 [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 "32") > - (DLSTP64 "64")]) > +(define_int_attr dlstp_elemsize [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 > "32") > + (DLSTP64 "64")]) > + > +(define_int_attr letp_num_lanes [(LETP8 "16") (LETP16 "8") (LETP32 "4") > + (LETP64 "2")]) > +(define_int_attr letp_num_lanes_neg [(LETP8 "-16") (LETP16 "-8") > (LETP32 "-4") > + (LETP64 "-2")]) > + > +(define_int_attr letp_num_lanes_minus_1 [(LETP8 "15") (LETP16 "7") > (LETP32 "3") > + (LETP64 "1")]) > > (define_int_attr opsuffix [(UNSPEC_DOT_S "s8") > (UNSPEC_DOT_U "u8") > @@ -2921,6 +2929,8 @@ > (define_int_iterator VQSHLUQ_N [VQSHLUQ_N_S]) > (define_int_iterator DLSTP [DLSTP8 DLSTP16 DLSTP32 > DLSTP64]) > +(define_int_iterator LETP [LETP8 LETP16 LETP32 > + LETP64]) > > ;; Define iterators for VCMLA operations > (define_int_iterator VCMLA_OP [UNSPEC_VCMLA > /* The iteration count does not need incrementing for a GE > test. */ > diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md > index 12ae4c4f820..2d6f27c14f4 100644 > --- a/gcc/config/arm/unspecs.md > +++ b/gcc/config/arm/unspecs.md > @@ -587,6 +587,10 @@ > DLSTP16 > DLSTP32 > DLSTP64 > + LETP8 > + LETP16 > + LETP32 > + LETP64 > VPNOT > VCREATEQ_F > VCVTQ_N_TO_F_S > > > I've attached the whole [2/2] patch diff with this change and > the required comment changes in doloop_condition_get. > WDYT? > > > Thanks, > > Stam > > >> >>