From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=Pdy4=CL=arm.com=andre.simoesdiasvieira@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id 8B0FE3858C36
	for <gcc-patches@gcc.gnu.org>; Fri, 23 Jun 2023 16:25:25 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 8B0FE3858C36
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 24CCB1042;
	Fri, 23 Jun 2023 09:26:09 -0700 (PDT)
Received: from [10.57.76.213] (unknown [10.57.76.213])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 3F6EC3F64C;
	Fri, 23 Jun 2023 09:25:24 -0700 (PDT)
Message-ID: <ca23f624-6b10-a8db-dac1-b1185410f4a6@arm.com>
Date: Fri, 23 Jun 2023 17:25:19 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
 Thunderbird/102.11.0
Subject: Re: [PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead
 Loops
Content-Language: en-US
To: Stamatis Markianos-Wright <stam.markianos-wright@arm.com>,
 "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Cc: Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>,
 Richard Earnshaw <Richard.Earnshaw@arm.com>, ramana.gcc@gmail.com,
 "nickc@redhat.com" <nickc@redhat.com>
References: <cada1fc1-dd8f-60b1-ecc2-f89262d458d4@arm.com>
From: "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com>
In-Reply-To: <cada1fc1-dd8f-60b1-ecc2-f89262d458d4@arm.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-6.8 required=5.0 tests=BAYES_00,BODY_8BITS,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,NICE_REPLY_A,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

+  /* In order to find out if the loop is of type A or B above look for the
+     loop counter: it will either be incrementing by one per iteration or
+     it will be decrementing by num_of_lanes.  We can find the loop counter
+     in the condition at the end of the loop.  */
+  rtx_insn *loop_cond = prev_nonnote_nondebug_insn_bb (BB_END (body));
+  gcc_assert (cc_register (XEXP (PATTERN (loop_cond), 0), VOIDmode)
+	      && GET_CODE (XEXP (PATTERN (loop_cond), 1)) == COMPARE);

Not sure this should be an assert. If we do encounter a differently 
formed loop, we should bail out of DLSTPing for now but we shouldn't ICE.


+  /* The loop latch has to be empty.  When compiling all the known MVE 
LoLs in
+     user applications, none of those with incrementing counters had 
any real
+     insns in the loop latch.  As such, this function has only been 
tested with
+     an empty latch and may misbehave or ICE if we somehow get here with an
+     increment in the latch, so, for sanity, error out early.  */
+  rtx_insn *dec_insn = BB_END (body->loop_father->latch);
+  if (NONDEBUG_INSN_P (dec_insn))
+    gcc_unreachable ();

Similarly here I'd return false rather than gcc_unreachable ();


+  /* Find where both of those are modified in the loop body bb.  */
+  rtx condcount_reg_set = PATTERN (DF_REF_INSN (df_bb_regno_only_def_find
+						 (body, REGNO (condcount))));
Put = on newline, breaks it down nicer.

+	  counter_orig_set = XEXP (PATTERN
+				    (DF_REF_INSN
+				      (DF_REF_NEXT_REG
+					(DF_REG_DEF_CHAIN
+					 (REGNO
+					   (XEXP (condcount_reg_set, 0)))))), 1);

This makes me a bit nervous, can we be certain that the PATTERN of the 
next insn that sets it is indeed a set. Heck can we even be sure 
DF_REG_DEF_CHAIN returns a non-null, I can't imagine why not but maybe 
there are some constructs it can't follow-up on? Might just be worth 
checking these steps and bailing out.


+      /* When we find the vctp instruction: This may be followed by
+      a zero-extend insn to SImode.  If it is, then save the
+      zero-extended REG into vctp_vpr_generated.  If there is no
+      zero-extend, then store the raw output of the vctp.
+      For any VPT-predicated instructions we need to ensure that
+      the VPR they use is the same as the one given here and
+      they often consume the output of a subreg of the SImode
+      zero-extended VPR-reg.  As a result, comparing against the
+      output of the zero-extend is more likely to succeed.
+      This code also guarantees to us that the vctp comes before
+      any instructions that use the VPR within the loop, for the
+      dlstp/letp transform to succeed.  */

Wrong comment indent after first line.

+  rtx_insn *vctp_insn = arm_mve_get_loop_vctp (body);
+  if (!vctp_insn || !arm_mve_loop_valid_for_dlstp (body))
+    return GEN_INT (1);

arm_mve_loop_valid_for_dlstp already calls arm_mve_get_loop_vctp, maybe 
have 'arm_mve_loop_valid_for_dlstp' return vctp_insn or null to 
determine success or failure, avoids looping through the BB again.

For the same reason I'd also pass vctp_insn down to 
'arm_mve_check_df_chain_back_for_implic_predic'.

+	  if (GET_CODE (SET_SRC (single_set (next_use1))) == ZERO_EXTEND)
+	    {
+	      rtx_insn *next_use2 = NULL;

Are we sure single_set can never return 0 here? Maybe worth an extra 
check and bail out if it does?

+       /* If the insn pattern requires the use of the VPR value from the
+	  vctp as an input parameter.  */
s/an an input parameter./as an input parameter for predication./

+	  /* None of registers USE-d by the instruction need can be the VPR
+	     vctp_vpr_generated.  This blocks the optimisation if there any
+	     instructions that use the optimised-out VPR value in any way
+	     other than as a VPT block predicate.  */

Reword this slightly to be less complex:
This instruction USE-s the vctp_vpr_generated other than for 
predication, this blocks the transformation as we are not allowed to 
optimise the VPR value away.

Will continue reviewing next week :)

On 15/06/2023 12:47, Stamatis Markianos-Wright via Gcc-patches wrote:
>      Hi all,
> 
>      This is the 2/2 patch that contains the functional changes needed
>      for MVE Tail Predicated Low Overhead Loops.  See my previous email
>      for a general introduction of MVE LOLs.
> 
>      This support is added through the already existing loop-doloop
>      mechanisms that are used for non-MVE dls/le looping.
> 
>      Mid-end changes are:
> 
>      1) Relax the loop-doloop mechanism in the mid-end to allow for
>         decrement numbers other that -1 and for `count` to be an
>         rtx containing a simple REG (which in this case will contain
>         the number of elements to be processed), rather
>         than an expression for calculating the number of iterations.
>      2) Added a new df utility function: `df_bb_regno_only_def_find` that
>         will return the DEF of a REG only if it is DEF-ed once within the
>         basic block.
> 
>      And many things in the backend to implement the above optimisation:
> 
>      3)  Implement the `arm_predict_doloop_p` target hook to instruct the
>          mid-end about Low Overhead Loops (MVE or not), as well as
>          `arm_loop_unroll_adjust` which will prevent unrolling of any loops
>          that are valid for becoming MVE Tail_Predicated Low Overhead Loops
>          (unrolling can transform a loop in ways that invalidate the dlstp/
>          letp tranformation logic and the benefit of the dlstp/letp loop
>          would be considerably higher than that of unrolling)
>      4)  Appropriate changes to the define_expand of doloop_end, new
>          patterns for dlstp and letp, new iterators,  unspecs, etc.
>      5) `arm_mve_loop_valid_for_dlstp` and a number of checking functions:
>         * `arm_mve_dlstp_check_dec_counter`
>         * `arm_mve_dlstp_check_inc_counter`
>         * `arm_mve_check_reg_origin_is_num_elems`
>         * `arm_mve_check_df_chain_back_for_implic_predic`
>         * `arm_mve_check_df_chain_fwd_for_implic_predic_impact`
>         This all, in smoe way or another, are running checks on the loop
>         structure in order to determine if the loop is valid for dlstp/letp
>         transformation.
>      6) `arm_attempt_dlstp_transform`: (called from the define_expand of
>          doloop_end) this function re-checks for the loop's suitability for
>          dlstp/letp transformation and then implements it, if possible.
>      7) Various utility functions:
>         *`arm_mve_get_vctp_lanes` to map
>         from vctp unspecs to number of lanes, and 
> `arm_get_required_vpr_reg`
>         to check an insn to see if it requires the VPR or not.
>         * `arm_mve_get_loop_vctp`
>         * `arm_mve_get_vctp_lanes`
>         * `arm_emit_mve_unpredicated_insn_to_seq`
>         * `arm_get_required_vpr_reg`
>         * `arm_get_required_vpr_reg_param`
>         * `arm_get_required_vpr_reg_ret_val`
>         * `arm_mve_vec_insn_is_predicated_with_this_predicate`
>         * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`
> 
>      No regressions on arm-none-eabi with various targets and on
>      aarch64-none-elf. Thoughts on getting this into trunk?
> 
>      Thank you,
>      Stam Markianos-Wright
> 
>      gcc/ChangeLog:
> 
>              * config/arm/arm-protos.h (arm_target_insn_ok_for_lob): 
> Rename to...
>              (arm_target_bb_ok_for_lob): ...this
>              (arm_attempt_dlstp_transform): New.
>              * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.
>              (TARGET_PREDICT_DOLOOP_P): New.
>              (arm_block_set_vect):
>              (arm_target_insn_ok_for_lob): Rename from 
> arm_target_insn_ok_for_lob.
>              (arm_target_bb_ok_for_lob): New.
>              (arm_mve_get_vctp_lanes): New.
>              (arm_get_required_vpr_reg): New.
>              (arm_get_required_vpr_reg_param): New.
>              (arm_get_required_vpr_reg_ret_val): New.
>              (arm_mve_get_loop_vctp): New.
> (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.
>              (arm_mve_vec_insn_is_predicated_with_this_predicate): New.
>              (arm_mve_check_df_chain_back_for_implic_predic): New.
>              (arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.
>              (arm_mve_check_reg_origin_is_num_elems): New.
>              (arm_mve_dlstp_check_inc_counter): New.
>              (arm_mve_dlstp_check_dec_counter): New.
>              (arm_mve_loop_valid_for_dlstp): New.
>              (arm_predict_doloop_p): New.
>              (arm_loop_unroll_adjust): New.
>              (arm_emit_mve_unpredicated_insn_to_seq): New.
>              (arm_attempt_dlstp_transform): New.
>              * config/arm/iterators.md (DLSTP): New.
>              (mode1): Add DLSTP mappings.
>              * config/arm/mve.md (*predicated_doloop_end_internal): New.
>              (dlstp<mode1>_insn): New.
>              * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
>              * config/arm/unspecs.md: New unspecs.
>              * df-core.cc (df_bb_regno_only_def_find): New.
>              * df.h (df_bb_regno_only_def_find): New.
>              * loop-doloop.cc (doloop_condition_get): Relax conditions.
>              (doloop_optimize): Add support for elementwise LoLs.
> 
>      gcc/testsuite/ChangeLog:
> 
>              * gcc.target/arm/lob.h: Update framework.
>              * gcc.target/arm/lob1.c: Likewise.
>              * gcc.target/arm/lob6.c: Likewise.
>              * gcc.target/arm/mve/dlstp-compile-asm.c: New test.
>              * gcc.target/arm/mve/dlstp-int16x8.c: New test.
>              * gcc.target/arm/mve/dlstp-int32x4.c: New test.
>              * gcc.target/arm/mve/dlstp-int64x2.c: New test.
>              * gcc.target/arm/mve/dlstp-int8x16.c: New test.
>              * gcc.target/arm/mve/dlstp-invalid-asm.c: New test.