From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <andre.simoesdiasvieira@arm.com>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id F016838983B7
	for <gcc-patches@gcc.gnu.org>; Tue, 15 Nov 2022 15:51:39 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org F016838983B7
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id E624513D5;
	Tue, 15 Nov 2022 07:51:45 -0800 (PST)
Received: from [10.57.5.16] (unknown [10.57.5.16])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 122493F587;
	Tue, 15 Nov 2022 07:51:38 -0800 (PST)
Message-ID: <a50167fb-6c99-9799-6e8e-b88825a2629a@arm.com>
Date: Tue, 15 Nov 2022 15:51:34 +0000
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
 Thunderbird/102.4.0
Subject: Re: [PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead
 Loops
Content-Language: en-US
To: Stam Markianos-Wright <stam.markianos-wright@arm.com>,
 gcc-patches@gcc.gnu.org
References: <212ceca7-21f7-7d99-9543-9e39d9056aba@arm.com>
From: "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com>
In-Reply-To: <212ceca7-21f7-7d99-9543-9e39d9056aba@arm.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-15.7 required=5.0 tests=BAYES_00,BODY_8BITS,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_LOTSOFHASH,KAM_SHORT,NICE_REPLY_A,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>


On 11/11/2022 17:40, Stam Markianos-Wright via Gcc-patches wrote:
> Hi all,
>
> This is the 2/2 patch that contains the functional changes needed
> for MVE Tail Predicated Low Overhead Loops.  See my previous email
> for a general introduction of MVE LOLs.
>
> This support is added through the already existing loop-doloop
> mechanisms that are used for non-MVE dls/le looping.
>
> Changes are:
>
> 1) Relax the loop-doloop mechanism in the mid-end to allow for
>    decrement numbers other that -1 and for `count` to be an
>    rtx containing the number of elements to be processed, rather
>    than an expression for calculating the number of iterations.
> 2) Add a `allow_elementwise_doloop` target hook. This allows the
>    target backend to manipulate the iteration count as it needs:
>    in our case to change it from a pre-calculation of the number
>    of iterations to the number of elements to be processed.
> 3) The doloop_end target-insn now had an additional parameter:
>    the `count` (note: this is before it gets modified to just be
>    the number of elements), so that the decrement value is
>    extracted from that parameter.
>
> And many things in the backend to implement the above optimisation:
>
> 4)  Appropriate changes to the define_expand of doloop_end and new
>     patterns for dlstp and letp.
> 5) `arm_attempt_dlstp_transform`: (called from the define_expand of
>     doloop_end) this function checks for the loop's suitability for
>     dlstp/letp transformation and then implements it, if possible.
> 6) `arm_mve_get_loop_unique_vctp`: A function that loops through
>     the loop contents and returns the vctp VPR-genereting operation
>     within the loop, if it is unique and there is exclusively one
>     vctp within the loop.
> 7) A couple of utility functions: `arm_mve_get_vctp_lanes` to map
>    from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg`
>    to check an insn to see if it requires the VPR or not.
>
> No regressions on arm-none-eabi with various targets and on
> aarch64-none-elf. Thoughts on getting this into trunk?
>
> Thank you,
> Stam Markianos-Wright
>
> gcc/ChangeLog:
>
>         * config/aarch64/aarch64.md: Add extra doloop_end arg.
>         * config/arm/arm-protos.h (arm_attempt_dlstp_transform): New.
>         * config/arm/arm.cc (TARGET_ALLOW_ELEMENTWISE_DOLOOP): New.
>         (arm_mve_get_vctp_lanes): New.
>         (arm_get_required_vpr_reg): New.
>         (arm_mve_get_loop_unique_vctp): New.
>         (arm_attempt_dlstp_transform): New.
>         (arm_allow_elementwise_doloop): New.
>         * config/arm/iterators.md:
>         * config/arm/mve.md (*predicated_doloop_end_internal): New.
>         (dlstp<mode1>_insn): New.
>         * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
>         * config/arm/unspecs.md: New unspecs.
>         * config/ia64/ia64.md: Add extra doloop_end arg.
>         * config/pru/pru.md: Add extra doloop_end arg.
>         * config/rs6000/rs6000.md: Add extra doloop_end arg.
>         * config/s390/s390.md: Add extra doloop_end arg.
>         * config/v850/v850.md: Add extra doloop_end arg.
>         * doc/tm.texi: Document new hook.
>         * doc/tm.texi.in: Likewise.
>         * loop-doloop.cc (doloop_condition_get): Relax conditions.
>         (doloop_optimize): Add support for elementwise LoLs.
>         * target-insns.def (doloop_end): Add extra arg.
>         * target.def (allow_elementwise_doloop): New hook.
>         * targhooks.cc (default_allow_elementwise_doloop): New.
>         * targhooks.h (default_allow_elementwise_doloop): New.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/arm/lob.h: Update framework.
>         * gcc.target/arm/lob1.c: Likewise.
>         * gcc.target/arm/lob6.c: Likewise.
>         * gcc.target/arm/dlstp-int16x8.c: New test.
>         * gcc.target/arm/dlstp-int32x4.c: New test.
>         * gcc.target/arm/dlstp-int64x2.c: New test.
>         * gcc.target/arm/dlstp-int8x16.c: New test.
>
>
> ### Inline copy of patch ###
>
> diff --git a/gcc/config/aarch64/aarch64.md 
> b/gcc/config/aarch64/aarch64.md
> index 
> f2e3d905dbbeb2949f2947f5cfd68208c94c9272..7a6d24a80060b4a704a481ccd1a32d96e7b0f369 
> 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -7366,7 +7366,8 @@
>  ;; knows what to generate.
>  (define_expand "doloop_end"
>    [(use (match_operand 0 "" ""))      ; loop pseudo
> -   (use (match_operand 1 "" ""))]     ; label
> +   (use (match_operand 1 "" ""))      ; label
> +   (use (match_operand 2 "" ""))]     ; decrement constant
>    "optimize > 0 && flag_modulo_sched"
>  {
>    rtx s0;
> diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
> index 
> 550272facd12e60a49bf8a3b20f811cc13765b3a..7684620f0f4d161dd9e9ad2d70308021ec3d3d34 
> 100644
> --- a/gcc/config/arm/arm-protos.h
> +++ b/gcc/config/arm/arm-protos.h
> @@ -63,7 +63,7 @@ extern void arm_decompose_di_binop (rtx, rtx, rtx *, 
> rtx *, rtx *, rtx *);
>  extern bool arm_q_bit_access (void);
>  extern bool arm_ge_bits_access (void);
>  extern bool arm_target_insn_ok_for_lob (rtx);
> -
> +extern rtx arm_attempt_dlstp_transform (rtx, rtx);
>  #ifdef RTX_CODE
>  enum reg_class
>  arm_mode_base_reg_class (machine_mode);
> diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
> index 
> ee8f1babf8a1319e77e0db0fa55851c038048804..99e144d52c26597c64b982b3d4ae9a62a114cf18 
> 100644
> --- a/gcc/config/arm/arm.cc
> +++ b/gcc/config/arm/arm.cc
> @@ -470,6 +470,9 @@ static const struct attribute_spec 
> arm_attribute_table[] =
>  #undef TARGET_SCHED_REORDER
>  #define TARGET_SCHED_REORDER arm_sched_reorder
>
> +#undef TARGET_ALLOW_ELEMENTWISE_DOLOOP
> +#define TARGET_ALLOW_ELEMENTWISE_DOLOOP arm_allow_elementwise_doloop
> +
Just a nit but maybe keep the same naming scheme as the existing hook:
TARGET_CAN_USE_ELEMENTWISE_DOLOOP_P ?
> +
> +static rtx
> +arm_get_required_vpr_reg (rtx_insn *insn)
> +{
> +  bool requires_vpr;
> +
> +  extract_constrain_insn (insn);
> +  int n_operands = recog_data.n_operands;
> +  if (recog_data.n_alternatives == 0)
> +    return NULL_RTX;
> +
> +  /* Fill in recog_op_alt with information about the constraints of
> +     this insn.  */
> +  preprocess_constraints (insn);
> +
> +  for (int use = 0; use < n_operands; use++)
> +    {
> +      requires_vpr = true;
> +      /* Iterate through alternatives of operand "use" in 
> recog_op_alt and
> +       * identify if the operand is required to be the VPR.  */
Remove the * at the start of the new line.
> +      for (int alt1 = 0; alt1 < recog_data.n_alternatives; alt1++)
> +    {
> +      const operand_alternative *op_alt1
> +          = &recog_op_alt[alt1 * n_operands];
> +      /* Fetch the reg_class for each entry and check it against the
> +       * VPR_REG reg_class.  */
> +      if (alternative_class (op_alt1, use) != VPR_REG)
> +        requires_vpr = false;
> +    }
> +      /* If all alternatives of the insn require the VPR reg for this 
> operand,
> +     it means that either this is VPR-generating instruction, like a 
> vctp,
> +     vcmp, etc., or it is a VPT-predicated insruction.  Return the 
> subrtx
> +     of the VPR reg operand.  */
> +      if (requires_vpr)
> +    return recog_data.operand[use];
> +    }
> +  return NULL_RTX;
> +}
> +
> +/* Scan the basic block of a loop body for a vctp instruction. If 
> there is
> +   exactly one unique vctp instruction, return its rtx_insn *. */
> +
> +static rtx_insn *
> +arm_mve_get_loop_unique_vctp (basic_block bb)
> +{
> +  rtx_insn *insn = BB_HEAD (bb);
> +  rtx_insn *vctp_op = NULL;
> +
> +  /* Now scan through all the instruction patterns and
> +     pick out any MVE instructions.  */
> +  FOR_BB_INSNS (bb, insn)
> +    {
> +      if (INSN_P (insn))
> +    {
> +      /* First check if this is a vctp instruction.  There needs to be
> +         exactly one vctp instruction within the loop.  */
> +      if (arm_mve_get_vctp_lanes (PATTERN (insn)) != 0)
> +        {
> +          /* If we already found one vctp instruction, then the
> +         loop is not consistent internally.  */
> +          if (vctp_op)
> +        return NULL;
> +
> +          vctp_op = insn;
> +        }
> +    }
> +    }
> +  return vctp_op;
> +}
> +
> +rtx
> +arm_attempt_dlstp_transform (rtx label, rtx count)
> +{
> +  int decrementnum;
> +  basic_block body = BLOCK_FOR_INSN (label)->prev_bb;
> +  rtx initial_compare;
> +  /* Doloop can only be done "elementwise" with predicated dlstp/letp
> +     when the iteration counter gets deprecated by the number of MVE
s/deprecated/decreased/ ? I think
> + lanes.  This can be exxtracted from the `count`, which is the 
> expression
s/exxtracted/extracted/
> +     used to calculate the number of iterations that the loop would 
> execute
> +     for a standard dls/le loop.  Since we only support cases where 
> this is a
> +     power of 2, we can assume that this expression arrives here as:
> +       (lshiftrt: (A) (const_int y))
> +     Then we can extract the decrementnum from y.  */
> +  if (GET_CODE (count) == LSHIFTRT && ARITHMETIC_P (XEXP (count, 0))
> +      && (decrementnum = (1 << (INTVAL (XEXP (count, 1)))))
Why are you calculating decrementnum inside the condition?
> +      /* There is one final condition that needs to be met for the 
> loop to be
> +     transformable: dlstp/letp will continue looping until there are
> +     elements still to process.  This can only work if the looping ends
> +     when the element counter reaches zero and not some other value
> +     (e.g. n > 0 works, not n > 1), or we can incorrectly end up running
> +     one additional iteration.  To by-pass any hoisting that the 
> compiler
> +     may have done with the `A` in `count` above, we can instead look up
> +     to the bb before the loop preheader: this should end with a 
> cmp+jump
> +     pair, where the cmp needs to be with (const_int 0).  */

I'm wondering whether it would be possible to subtract a non-zero const 
from count. But that might be dangerous...

Do you have an example/test case where you saw this happen?

> + && loop_preheader_edge (body->loop_father)->src->prev_bb
> +      && BB_END (loop_preheader_edge (body->loop_father)->src->prev_bb)
> +      && PREV_INSN (BB_END (loop_preheader_edge (body->loop_father)
> +                ->src->prev_bb))
> +      && INSN_P (PREV_INSN (BB_END (loop_preheader_edge 
> (body->loop_father)
> +                    ->src->prev_bb)))
> +      && (initial_compare
> +      = PATTERN (PREV_INSN (BB_END (loop_preheader_edge 
> (body->loop_father)
> +                        ->src->prev_bb))))
> +      && GET_CODE (initial_compare) == SET
> +      && cc_register (XEXP (initial_compare, 0), VOIDmode)
> +      && GET_CODE (XEXP (initial_compare, 1)) == COMPARE
> +      && CONST_INT_P (XEXP (XEXP (initial_compare, 1), 1))
> +      && INTVAL (XEXP (XEXP (initial_compare, 1), 1)) == 0)
> +    {
> +      /* Find the vctp predicate generation inside the loop body BB.  */
> +      rtx_insn *vctp_insn = arm_mve_get_loop_unique_vctp (body);
> +
> +      /* If we have successfully found one exactly vctp 
> predicate-generating
> +     instruction within the loop and the number by which we deprecate 
> the
> +     loop counter in each iteration matches the number of lanes of the
> +     vctp instruction, we can attempt to turn this into a dlstp/letp 
> loop.
> +     */
> +      if (vctp_insn
> +      && decrementnum == arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
> +    {
I would exit early here, so you don't need to indent the rest of the 
code, by that I mean something like:

if (!vectp_insn
      || decrementnum != ...)
   return GEN_INT (1);

.. rest of the code.

> + rtx_insn *insn = 0;
> +      rtx_insn *cur_insn = 0;
> +      rtx_insn *seq;
> +      rtx vctp_vpr_generated = NULL_RTX;
> +      rtx insn_vpr_reg_operand = NULL_RTX;
> +      bool transform_worked = true;
Won't need transform_worked.
> +      int new_icode;
> +
> +      /* Scan through the insns in the loop bb and emit the 
> transformed bb
> +         insns to a sequence.  */
> +      start_sequence ();
> +      FOR_BB_INSNS (body, insn)
> +        {
> +          if (INSN_P (insn))
This also captures DEBUG_INSNs, which means passing -g disables this 
feature.
> +        {
> +          /* When we find the vctp instruction: This may be followed by
> +             a sign-extend insn to SImode.  If it is, then save the
> +             sign-extended REG into vctp_vpr_generated.  If there is no
> +             sign-extend, then store the raw output of the vctp.
> +             For any VPT-predicated instructions we need to ensure that
> +             the VPR they use is the same as the one given here and
> +             they often consume the output of a subreg of the SImode
> +             sign-extended VPR-reg.  As a result, comparing against the
> +             output of the sign-extend is more likely to succeed.
> +             This code also guarantees to us that the vctp comes before
> +             any instructions that use the VPR within the loop, for the
> +             dlstp/letp transform to succeed.  */
> +          if (insn == vctp_insn)
> +            {
> +              if (GET_CODE (XEXP (PATTERN (NEXT_INSN (insn)), 1))
> +                  == SIGN_EXTEND
> +              && GET_CODE (XEXP (
> +                 PATTERN (NEXT_INSN (NEXT_INSN (insn))), 1))
> +                 == SUBREG)
> +            vctp_vpr_generated
> +                = XEXP (PATTERN (NEXT_INSN (NEXT_INSN (insn))), 0);
> +              else
> +            vctp_vpr_generated = XEXP (PATTERN (insn), 0);
> +              /* Also emit a USE of the source register of the vctp.
> +             This holds the number of elements being processed
> +             by the loop.  This later gets stored into `count`.
> +             */
> +              emit_use (XVECEXP (XEXP (PATTERN (insn), 1), 0, 0));
What if we get here but don't end up creating a predicated do-loop? Will 
this use break something?
> +       continue;
> +            }
> +          /* If the insn pattern requires the use of the VPR, then it
Missing an is.
> +      a VPT-predicated instruction, so it will need to be
> +             transformed into the non-predicated version of the
> +             instruction.  */
But this comment seems misplace here.
> + else if ((insn_vpr_reg_operand
> +                = arm_get_required_vpr_reg (insn))
> +               != NULL_RTX)
> +            {
> +              /* If the VPR value is different to the one generated by
> +             the vctp, then fail the conversion.  */
> +              if (!rtx_equal_p (vctp_vpr_generated,
> +                    insn_vpr_reg_operand))
> +            {
> +              transform_worked = false;
> +              break;
return GEN_INT (1);
> +     }
> +              /* Also ensure that it's a valid recog-ed instruction with
> +             the mve_unpredicated_insn atrribute.  */
> +              else if (recog_memoized (insn) >= 0
> +                   && (new_icode
> +                   = get_attr_mve_unpredicated_insn (insn)))
> +            {
> +              extract_insn (insn);
> +              rtx arr[8];
> +              int j = 0;
> +
> +              /* When transforming a VPT-predicated instruction
> +                 into its unpredicated equivalent we need to drop
> +                 the VPR operand and we may need to also drop a
> +                 merge "vuninit" input operand, depending on the
> +                 instruction pattern.  Here ensure that we have at
> +                 most a two-operand difference between the two
> +                 instrunctions.  */
> +              int n_operands_diff
> +                  = recog_data.n_operands
> +                - insn_data[new_icode].n_operands;
> +              gcc_assert (n_operands_diff > 0
> +                      && n_operands_diff <= 2);
> +
> +              /* Then, loop through the operands of the predicated
> +                 instruction, and retain the ones that map to the
> +                 unpredicated instruction.  */
> +              for (int i = 0; i < recog_data.n_operands; i++)
> +                {
> +                  /* Ignore the VPR and, if needed, the vuninit
> +                 operand.  */
> +                  if (insn_vpr_reg_operand == recog_data.operand[i]
> +                  || (n_operands_diff == 2
> +                      && !strcmp (recog_data.constraints[i],
> +                          "0")))
> +                continue;
> +                  else
> +                {
> +                  arr[j] = recog_data.operand[i];
> +                  j++;
> +                }
> +                }
> +
> +              /* Finally, emit the upredicated instruction.  */
> +              switch (j)
> +                {
> +                  case 2:
> +                emit_insn (GEN_FCN (new_icode) (arr[0],
> +                                arr[1]));
> +                break;
> +                  case 3:
> +                emit_insn (GEN_FCN (new_icode) (arr[0], arr[1],
> +                                arr[2]));
> +                break;
> +                  default:
> +                gcc_unreachable ();
> +                }
> +            }
> +              /* If we can't identify the INSN as either being either
> +             for deletion or to re-map, then we don't know how to
> +             handle it, so fail the whole conversion.  */
> +              else
> +            {
> +              transform_worked = false;
> +              break;
use
return GEN_INT (1);
> +     }
> +            }
> +          /* Instructions that dont's require the VPR can be carried
> +             over as-is.  */
> +          else
> +            emit_insn (PATTERN (insn));
> +        }
> +        }
> +      seq = get_insns ();
> +      end_sequence ();
> +
> +      if (transform_worked)
> +        {
no need to check this, you can only get here if it worked.
> + /* Re-write the entire BB contents with the transformed
> +         sequence.  */
> +          FOR_BB_INSNS_SAFE (body, insn, cur_insn)
> +        if (INSN_P (insn))
> +          delete_insn (insn);
This will also delete DEBUG_INSN's! You'd probably want to delete only 
NONDEBUG_INSN_P (insn). I'm not an expert in how DEBUG_INSNs work but I 
suspect their order compared to non-debug insns are likely to be 
important, so really you'd want change how you 'transform' the BB and do 
inline insn replacement.
> + for (insn = seq; NEXT_INSN (insn); insn = NEXT_INSN (insn))
> +        emit_insn_after (PATTERN (insn), BB_END (body));
> +          emit_jump_insn_after (PATTERN (insn), BB_END (body));
> +          return GEN_INT (decrementnum);
> +        }
> +    }
> +    }
> +  /* Bail out: we can't use dlstp/letp, so return 1 to allow 
> loop-doloop to try
> +     the standard dls/le pair.  */
> +  return GEN_INT (1);
> +}
>
Only reviewed until here, will look at the rest later.