From: "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com>
To: Stamatis Markianos-Wright <stam.markianos-wright@arm.com>,
Stamatis Markianos-Wright via Gcc-patches
<gcc-patches@gcc.gnu.org>,
Richard Earnshaw <Richard.Earnshaw@arm.com>,
Richard Sandiford <Richard.Sandiford@arm.com>,
Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
Subject: Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
Date: Thu, 7 Dec 2023 18:08:25 +0000 [thread overview]
Message-ID: <8bd09ac9-7d8c-463a-82f0-87a5ec5b0ae1@arm.com> (raw)
In-Reply-To: <8000eb76-0abf-45db-92c1-d307ac8fccfc@arm.com>
Thanks for addressing my comments. I have reviewed this and the other
patch before and they LGTM. I however do not have approval rights so you
will need the OK from a maintainer.
Thanks for doing this :)
Andre
On 30/11/2023 12:55, Stamatis Markianos-Wright wrote:
> Hi Andre,
>
> Thanks for the comments, see latest revision attached.
>
> On 27/11/2023 12:47, Andre Vieira (lists) wrote:
>> Hi Stam,
>>
>> Just some comments.
>>
>> +/* Recursively scan through the DF chain backwards within the basic
>> block and
>> + determine if any of the USEs of the original insn (or the USEs of
>> the insns
>> s/Recursively scan/Scan/ as you no longer recurse, thanks for that by
>> the way :) + where thy were DEF-ed, etc., recursively) were affected
>> by implicit VPT
>> remove recursively for the same reasons.
>>
>> + if (!CONST_INT_P (cond_counter_iv.step) || !CONST_INT_P
>> (cond_temp_iv.step))
>> + return NULL;
>> + /* Look at the steps and swap around the rtx's if needed. Error
>> out if
>> + one of them cannot be identified as constant. */
>> + if (INTVAL (cond_counter_iv.step) != 0 && INTVAL
>> (cond_temp_iv.step) != 0)
>> + return NULL;
>>
>> Move the comment above the if before, as the erroring out it talks
>> about is there.
> Done
>>
>> + emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));
>> space after 'insn_note)'
>>
>> @@ -173,14 +176,14 @@ doloop_condition_get (rtx_insn *doloop_pat)
>> if (! REG_P (reg))
>> return 0;
>> - /* Check if something = (plus (reg) (const_int -1)).
>> + /* Check if something = (plus (reg) (const_int -n)).
>> On IA-64, this decrement is wrapped in an if_then_else. */
>> inc_src = SET_SRC (inc);
>> if (GET_CODE (inc_src) == IF_THEN_ELSE)
>> inc_src = XEXP (inc_src, 1);
>> if (GET_CODE (inc_src) != PLUS
>> || XEXP (inc_src, 0) != reg
>> - || XEXP (inc_src, 1) != constm1_rtx)
>> + || !CONST_INT_P (XEXP (inc_src, 1)))
>>
>> Do we ever check that inc_src is negative? We used to check if it was
>> -1, now we only check it's a constnat, but not a negative one, so I
>> suspect this needs a:
>> || INTVAL (XEXP (inc_src, 1)) >= 0
> Good point. Done
>>
>> @@ -492,7 +519,8 @@ doloop_modify (class loop *loop, class niter_desc
>> *desc,
>> case GE:
>> /* Currently only GE tests against zero are supported. */
>> gcc_assert (XEXP (condition, 1) == const0_rtx);
>> -
>> + /* FALLTHRU */
>> + case GTU:
>> noloop = constm1_rtx;
>>
>> I spent a very long time staring at this trying to understand why
>> noloop = constm1_rtx for GTU, where I thought it should've been (count
>> & (n-1)). For the current use of doloop it doesn't matter because ARM
>> is the only target using it and you set desc->noloop_assumptions to
>> null_rtx in 'arm_attempt_dlstp_transform' so noloop is never used.
>> However, if a different target accepts this GTU pattern then this
>> target agnostic code will do the wrong thing. I suggest we either:
>> - set noloop to what we think might be the correct value, which if
>> you ask me should be 'count & (XEXP (condition, 1))',
>> - or add a gcc_assert (GET_CODE (condition) != GTU); under the if
>> (desc->noloop_assumption); part and document why. I have a slight
>> preference for the assert given otherwise we are adding code that we
>> can't test.
>
> Yea, that's true tbh. I've done the latter, but also separated out the
> "case GTU:" and added a comment, so that it's more clear that the noloop
> things aren't used in the only implemented GTU case (Arm)
>
> Thank you :)
>
>>
>> LGTM otherwise (but I don't have the power to approve this ;)).
>>
>> Kind regards,
>> Andre
>> ________________________________________
>> From: Stamatis Markianos-Wright <stam.markianos-wright@arm.com>
>> Sent: Thursday, November 16, 2023 11:36 AM
>> To: Stamatis Markianos-Wright via Gcc-patches; Richard Earnshaw;
>> Richard Sandiford; Kyrylo Tkachov
>> Subject: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated
>> Low Overhead Loops
>>
>> Pinging back to the top of reviewers' inboxes due to worry about Stage 1
>> End in a few days :)
>>
>>
>> See the last email for the latest version of the 2/2 patch. The 1/2
>> patch is A-Ok from Kyrill's earlier target-backend review.
>>
>>
>> On 10/11/2023 12:41, Stamatis Markianos-Wright wrote:
>>>
>>> On 06/11/2023 17:29, Stamatis Markianos-Wright wrote:
>>>>
>>>> On 06/11/2023 11:24, Richard Sandiford wrote:
>>>>> Stamatis Markianos-Wright <stam.markianos-wright@arm.com> writes:
>>>>>>> One of the main reasons for reading the arm bits was to try to
>>>>>>> answer
>>>>>>> the question: if we switch to a downcounting loop with a GE
>>>>>>> condition,
>>>>>>> how do we make sure that the start value is not a large unsigned
>>>>>>> number that is interpreted as negative by GE? E.g. if the loop
>>>>>>> originally counted up in steps of N and used an LTU condition,
>>>>>>> it could stop at a value in the range [INT_MAX + 1, UINT_MAX].
>>>>>>> But the loop might never iterate if we start counting down from
>>>>>>> most values in that range.
>>>>>>>
>>>>>>> Does the patch handle that?
>>>>>> So AFAICT this is actually handled in the generic code in
>>>>>> `doloop_valid_p`:
>>>>>>
>>>>>> This kind of loops fail because of they are "desc->infinite", then no
>>>>>> loop-doloop conversion is attempted at all (even for standard
>>>>>> dls/le loops)
>>>>>>
>>>>>> Thanks to that check I haven't been able to trigger anything like the
>>>>>> behaviour you describe, do you think the doloop_valid_p checks are
>>>>>> robust enough?
>>>>> The loops I was thinking of are provably not infinite though. E.g.:
>>>>>
>>>>> for (unsigned int i = 0; i < UINT_MAX - 100; ++i)
>>>>> ...
>>>>>
>>>>> is known to terminate. And doloop conversion is safe with the normal
>>>>> count-down-by-1 approach, so I don't think current code would need
>>>>> to reject it. I.e. a conversion to:
>>>>>
>>>>> unsigned int i = UINT_MAX - 101;
>>>>> do
>>>>> ...
>>>>> while (--i != ~0U);
>>>>>
>>>>> would be safe, but a conversion to:
>>>>>
>>>>> int i = UINT_MAX - 101;
>>>>> do
>>>>> ...
>>>>> while ((i -= step, i > 0));
>>>>>
>>>>> wouldn't, because the loop body would only be executed once.
>>>>>
>>>>> I'm only going off the name "infinite" though :) It's possible that
>>>>> it has more connotations than that.
>>>>>
>>>>> Thanks,
>>>>> Richard
>>>>
>>>> Ack, yep, I see what you mean now, and yep, that kind of loop does
>>>> indeed pass through doloop_valid_p
>>>>
>>>> Interestingly , in the v8-M Arm ARM this is done with:
>>>>
>>>> ```
>>>>
>>>> boolean IsLastLowOverheadLoop(INSTR_EXEC_STATE_Type state)
>>>> // This does not check whether a loop is currently active.
>>>> // If the PE were in a loop, would this be the last one?
>>>> return UInt(state.LoopCount) <= (1 << (4 - LTPSIZE));
>>>>
>>>> ```
>>>>
>>>> So architecturally the asm we output would be ok (except maybe the
>>>> "branch too far subs;bgt;lctp" fallback at
>>>> `predicated_doloop_end_internal` (maybe that should be `bhi`))... But
>>>> now GE: isn't looking like an accurate representation of this
>>>> operation in the compiler.
>>>>
>>>> I'm wondering if I should try to make
>>>> `predicated_doloop_end_internal` contain a comparison along the lines
>>>> of:
>>>> (gtu: (plus: (LR) (const_int -num_lanes)) (const_int
>>>> num_lanes_minus_1))
>>>>
>>>> I'll give that a try :)
>>>>
>>>> The only reason I'd chosen to go with GE earlier, tbh, was because of
>>>> the existing handling of GE in loop-doloop.cc
>>>>
>>>> Let me know if any other ideas come to your mind!
>>>>
>>>>
>>>> Cheers,
>>>>
>>>> Stam
>>>
>>>
>>> It looks like I've had success with the below (diff to previous patch),
>>> trimmed a bit to only the functionally interesting things::
>>>
>>>
>>>
>>>
>>> diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
>>> index 368d5138ca1..54dd4ee564b 100644
>>> --- a/gcc/config/arm/thumb2.md
>>> +++ b/gcc/config/arm/thumb2.md
>>> @@ -1649,16 +1649,28 @@
>>> && (decrement_num = arm_attempt_dlstp_transform
>>> (operands[1]))
>>> && (INTVAL (decrement_num) != 1))
>>> {
>>> - insn = emit_insn
>>> - (gen_thumb2_addsi3_compare0
>>> - (s0, s0, GEN_INT ((-1) * (INTVAL (decrement_num)))));
>>> - cmp = XVECEXP (PATTERN (insn), 0, 0);
>>> - cc_reg = SET_DEST (cmp);
>>> - bcomp = gen_rtx_GE (VOIDmode, cc_reg, const0_rtx);
>>> loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
>>> - emit_jump_insn (gen_rtx_SET (pc_rtx,
>>> - gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
>>> - loc_ref, pc_rtx)));
>>> + switch (INTVAL (decrement_num))
>>> + {
>>> + case 2:
>>> + insn = emit_jump_insn (gen_predicated_doloop_end_internal2
>>> + (s0, loc_ref));
>>> + break;
>>> + case 4:
>>> + insn = emit_jump_insn (gen_predicated_doloop_end_internal4
>>> + (s0, loc_ref));
>>> + break;
>>> + case 8:
>>> + insn = emit_jump_insn (gen_predicated_doloop_end_internal8
>>> + (s0, loc_ref));
>>> + break;
>>> + case 16:
>>> + insn = emit_jump_insn (gen_predicated_doloop_end_internal16
>>> + (s0, loc_ref));
>>> + break;
>>> + default:
>>> + gcc_unreachable ();
>>> + }
>>> DONE;
>>> }
>>> }
>>>
>>> diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
>>> index 93905583b18..c083f965fa9 100644
>>> --- a/gcc/config/arm/mve.md
>>> +++ b/gcc/config/arm/mve.md
>>> @@ -6922,23 +6922,24 @@
>>> ;; Originally expanded by 'predicated_doloop_end'.
>>> ;; In the rare situation where the branch is too far, we do also
>>> need to
>>> ;; revert FPSCR.LTPSIZE back to 0x100 after the last iteration.
>>> -(define_insn "*predicated_doloop_end_internal"
>>> +(define_insn "predicated_doloop_end_internal<letp_num_lanes>"
>>> [(set (pc)
>>> (if_then_else
>>> - (ge (plus:SI (reg:SI LR_REGNUM)
>>> - (match_operand:SI 0 "const_int_operand" ""))
>>> - (const_int 0))
>>> - (label_ref (match_operand 1 "" ""))
>>> + (gtu (unspec:SI [(plus:SI (match_operand:SI 0
>>> "s_register_operand" "=r")
>>> + (const_int <letp_num_lanes_neg>))]
>>> + LETP)
>>> + (const_int <letp_num_lanes_minus_1>))
>>> + (match_operand 1 "" "")
>>> (pc)))
>>> - (set (reg:SI LR_REGNUM)
>>> - (plus:SI (reg:SI LR_REGNUM) (match_dup 0)))
>>> + (set (match_dup 0)
>>> + (plus:SI (match_dup 0) (const_int <letp_num_lanes_neg>)))
>>> (clobber (reg:CC CC_REGNUM))]
>>> "TARGET_HAVE_MVE"
>>> {
>>> if (get_attr_length (insn) == 4)
>>> return "letp\t%|lr, %l1";
>>> else
>>> - return "subs\t%|lr, #%n0\n\tbgt\t%l1\n\tlctp";
>>> + return "subs\t%|lr, #<letp_num_lanes>\n\tbhi\t%l1\n\tlctp";
>>> }
>>> [(set (attr "length")
>>> (if_then_else
>>> @@ -6947,11 +6948,11 @@
>>> (const_int 6)))
>>> (set_attr "type" "branch")])
>>>
>>> -(define_insn "dlstp<mode1>_insn"
>>> +(define_insn "dlstp<dlstp_elemsize>_insn"
>>> [
>>> (set (reg:SI LR_REGNUM)
>>> (unspec:SI [(match_operand:SI 0 "s_register_operand" "r")]
>>> DLSTP))
>>> ]
>>> "TARGET_HAVE_MVE"
>>> - "dlstp.<mode1>\t%|lr, %0")
>>> + "dlstp.<dlstp_elemsize>\t%|lr, %0")
>>>
>>> diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
>>> index 6a72700a127..47fdef989b4 100644
>>> --- a/gcc/loop-doloop.cc
>>> +++ b/gcc/loop-doloop.cc
>>> @@ -185,6 +185,7 @@ doloop_condition_get (rtx_insn *doloop_pat)
>>> || XEXP (inc_src, 0) != reg
>>> || !CONST_INT_P (XEXP (inc_src, 1)))
>>> return 0;
>>> + int dec_num = abs (INTVAL (XEXP (inc_src, 1)));
>>>
>>> /* Check for (set (pc) (if_then_else (condition)
>>> (label_ref (label))
>>> @@ -199,21 +200,32 @@ doloop_condition_get (rtx_insn *doloop_pat)
>>> /* Extract loop termination condition. */
>>> condition = XEXP (SET_SRC (cmp), 0);
>>>
>>> - /* We expect a GE or NE comparison with 0 or 1. */
>>> - if ((GET_CODE (condition) != GE
>>> - && GET_CODE (condition) != NE)
>>> - || (XEXP (condition, 1) != const0_rtx
>>> - && XEXP (condition, 1) != const1_rtx))
>>> + /* We expect a GE or NE comparison with 0 or 1, or a GTU comparison
>>> with
>>> + dec_num - 1. */
>>> + if (!((GET_CODE (condition) == GE
>>> + || GET_CODE (condition) == NE)
>>> + && (XEXP (condition, 1) == const0_rtx
>>> + || XEXP (condition, 1) == const1_rtx ))
>>> + &&!(GET_CODE (condition) == GTU
>>> + && ((INTVAL (XEXP (condition, 1))) == (dec_num - 1))))
>>> return 0;
>>>
>>> - if ((XEXP (condition, 0) == reg)
>>> + /* For the ARM special case of having a GTU: re-form the condition
>>> without
>>> + the unspec for the benefit of the middle-end. */
>>> + if (GET_CODE (condition) == GTU)
>>> + {
>>> + condition = gen_rtx_fmt_ee (GTU, VOIDmode, inc_src, GEN_INT
>>> (dec_num - 1));
>>> + return condition;
>>> + }
>>> + else if ((XEXP (condition, 0) == reg)
>>> /* For the third case: */
>>> || ((cc_reg != NULL_RTX)
>>> && (XEXP (condition, 0) == cc_reg)
>>> && (reg_orig == reg))
>>> @@ -245,20 +257,11 @@ doloop_condition_get (rtx_insn *doloop_pat)
>>> (label_ref (label))
>>> (pc))))])
>>>
>>> - So we return the second form instead for the two cases when n == 1.
>>> -
>>> - For n > 1, the final value may be exceeded, so use GE instead of
>>> NE.
>>> + So we return the second form instead for the two cases.
>>> */
>>> - if (GET_CODE (pattern) != PARALLEL)
>>> - {
>>> - if (INTVAL (XEXP (inc_src, 1)) != -1)
>>> - condition = gen_rtx_fmt_ee (GE, VOIDmode, inc_src, const0_rtx);
>>> - else
>>> - condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);;
>>> - }
>>> -
>>> + condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
>>> return condition;
>>> - }
>>> + }
>>>
>>> /* ??? If a machine uses a funny comparison, we could return a
>>> canonicalized form here. */
>>> @@ -501,7 +504,8 @@ doloop_modify (class loop *loop, class niter_desc
>>> *desc,
>>> case GE:
>>> /* Currently only GE tests against zero are supported. */
>>> gcc_assert (XEXP (condition, 1) == const0_rtx);
>>> -
>>> + /* FALLTHRU */
>>> + case GTU:
>>> noloop = constm1_rtx;
>>> diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
>>> index a6a7ff507a5..9398702cddd 100644
>>> --- a/gcc/config/arm/iterators.md
>>> +++ b/gcc/config/arm/iterators.md
>>> @@ -2673,8 +2673,16 @@
>>> (define_int_attr mrrc [(VUNSPEC_MRRC "mrrc") (VUNSPEC_MRRC2 "mrrc2")])
>>> (define_int_attr MRRC [(VUNSPEC_MRRC "MRRC") (VUNSPEC_MRRC2 "MRRC2")])
>>>
>>> -(define_int_attr mode1 [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 "32")
>>> - (DLSTP64 "64")])
>>> +(define_int_attr dlstp_elemsize [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32
>>> "32")
>>> + (DLSTP64 "64")])
>>> +
>>> +(define_int_attr letp_num_lanes [(LETP8 "16") (LETP16 "8") (LETP32 "4")
>>> + (LETP64 "2")])
>>> +(define_int_attr letp_num_lanes_neg [(LETP8 "-16") (LETP16 "-8")
>>> (LETP32 "-4")
>>> + (LETP64 "-2")])
>>> +
>>> +(define_int_attr letp_num_lanes_minus_1 [(LETP8 "15") (LETP16 "7")
>>> (LETP32 "3")
>>> + (LETP64 "1")])
>>>
>>> (define_int_attr opsuffix [(UNSPEC_DOT_S "s8")
>>> (UNSPEC_DOT_U "u8")
>>> @@ -2921,6 +2929,8 @@
>>> (define_int_iterator VQSHLUQ_N [VQSHLUQ_N_S])
>>> (define_int_iterator DLSTP [DLSTP8 DLSTP16 DLSTP32
>>> DLSTP64])
>>> +(define_int_iterator LETP [LETP8 LETP16 LETP32
>>> + LETP64])
>>>
>>> ;; Define iterators for VCMLA operations
>>> (define_int_iterator VCMLA_OP [UNSPEC_VCMLA
>>> /* The iteration count does not need incrementing for a GE
>>> test. */
>>> diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
>>> index 12ae4c4f820..2d6f27c14f4 100644
>>> --- a/gcc/config/arm/unspecs.md
>>> +++ b/gcc/config/arm/unspecs.md
>>> @@ -587,6 +587,10 @@
>>> DLSTP16
>>> DLSTP32
>>> DLSTP64
>>> + LETP8
>>> + LETP16
>>> + LETP32
>>> + LETP64
>>> VPNOT
>>> VCREATEQ_F
>>> VCVTQ_N_TO_F_S
>>>
>>>
>>> I've attached the whole [2/2] patch diff with this change and
>>> the required comment changes in doloop_condition_get.
>>> WDYT?
>>>
>>>
>>> Thanks,
>>>
>>> Stam
>>>
>>>
>>>>
>>>>
next prev parent reply other threads:[~2023-12-07 18:08 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-17 10:31 [PATCH " Stamatis Markianos-Wright
2023-09-06 17:19 ` [PING][PATCH " Stamatis Markianos-Wright
2023-09-14 12:10 ` Kyrylo Tkachov
2023-09-28 12:51 ` Andre Vieira (lists)
2023-10-11 11:34 ` Stamatis Markianos-Wright
2023-10-23 10:16 ` Andre Vieira (lists)
2023-10-24 15:11 ` Richard Sandiford
2023-11-06 11:03 ` Stamatis Markianos-Wright
2023-11-06 11:24 ` Richard Sandiford
2023-11-06 17:29 ` Stamatis Markianos-Wright
2023-11-10 12:41 ` Stamatis Markianos-Wright
2023-11-16 11:36 ` Stamatis Markianos-Wright
2023-11-27 12:47 ` Andre Vieira (lists)
2023-11-30 12:55 ` Stamatis Markianos-Wright
2023-12-07 18:08 ` Andre Vieira (lists) [this message]
2023-12-09 18:31 ` Richard Sandiford
2023-12-12 17:56 ` Richard Earnshaw
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8bd09ac9-7d8c-463a-82f0-87a5ec5b0ae1@arm.com \
--to=andre.simoesdiasvieira@arm.com \
--cc=Kyrylo.Tkachov@arm.com \
--cc=Richard.Earnshaw@arm.com \
--cc=Richard.Sandiford@arm.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=stam.markianos-wright@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).