On 11/15/22 15:51, Andre Vieira (lists) wrote:
>
> On 11/11/2022 17:40, Stam Markianos-Wright via Gcc-patches wrote:
>> Hi all,
>>
>> This is the 2/2 patch that contains the functional changes needed
>> for MVE Tail Predicated Low Overhead Loops.  See my previous email
>> for a general introduction of MVE LOLs.
>>
>> This support is added through the already existing loop-doloop
>> mechanisms that are used for non-MVE dls/le looping.
>>
>> Changes are:
>>
>> 1) Relax the loop-doloop mechanism in the mid-end to allow for
>>    decrement numbers other that -1 and for `count` to be an
>>    rtx containing the number of elements to be processed, rather
>>    than an expression for calculating the number of iterations.
>> 2) Add a `allow_elementwise_doloop` target hook. This allows the
>>    target backend to manipulate the iteration count as it needs:
>>    in our case to change it from a pre-calculation of the number
>>    of iterations to the number of elements to be processed.
>> 3) The doloop_end target-insn now had an additional parameter:
>>    the `count` (note: this is before it gets modified to just be
>>    the number of elements), so that the decrement value is
>>    extracted from that parameter.
>>
>> And many things in the backend to implement the above optimisation:
>>
>> 4)  Appropriate changes to the define_expand of doloop_end and new
>>     patterns for dlstp and letp.
>> 5) `arm_attempt_dlstp_transform`: (called from the define_expand of
>>     doloop_end) this function checks for the loop's suitability for
>>     dlstp/letp transformation and then implements it, if possible.
>> 6) `arm_mve_get_loop_unique_vctp`: A function that loops through
>>     the loop contents and returns the vctp VPR-genereting operation
>>     within the loop, if it is unique and there is exclusively one
>>     vctp within the loop.
>> 7) A couple of utility functions: `arm_mve_get_vctp_lanes` to map
>>    from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg`
>>    to check an insn to see if it requires the VPR or not.
>>
>> No regressions on arm-none-eabi with various targets and on
>> aarch64-none-elf. Thoughts on getting this into trunk?
>>
>> Thank you,
>> Stam Markianos-Wright
>>
>> gcc/ChangeLog:
>>
>>         * config/aarch64/aarch64.md: Add extra doloop_end arg.
>>         * config/arm/arm-protos.h (arm_attempt_dlstp_transform): New.
>>         * config/arm/arm.cc (TARGET_ALLOW_ELEMENTWISE_DOLOOP): New.
>>         (arm_mve_get_vctp_lanes): New.
>>         (arm_get_required_vpr_reg): New.
>>         (arm_mve_get_loop_unique_vctp): New.
>>         (arm_attempt_dlstp_transform): New.
>>         (arm_allow_elementwise_doloop): New.
>>         * config/arm/iterators.md:
>>         * config/arm/mve.md (*predicated_doloop_end_internal): New.
>>         (dlstp<mode1>_insn): New.
>>         * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
>>         * config/arm/unspecs.md: New unspecs.
>>         * config/ia64/ia64.md: Add extra doloop_end arg.
>>         * config/pru/pru.md: Add extra doloop_end arg.
>>         * config/rs6000/rs6000.md: Add extra doloop_end arg.
>>         * config/s390/s390.md: Add extra doloop_end arg.
>>         * config/v850/v850.md: Add extra doloop_end arg.
>>         * doc/tm.texi: Document new hook.
>>         * doc/tm.texi.in: Likewise.
>>         * loop-doloop.cc (doloop_condition_get): Relax conditions.
>>         (doloop_optimize): Add support for elementwise LoLs.
>>         * target-insns.def (doloop_end): Add extra arg.
>>         * target.def (allow_elementwise_doloop): New hook.
>>         * targhooks.cc (default_allow_elementwise_doloop): New.
>>         * targhooks.h (default_allow_elementwise_doloop): New.
>>
>> gcc/testsuite/ChangeLog:
>>
>>         * gcc.target/arm/lob.h: Update framework.
>>         * gcc.target/arm/lob1.c: Likewise.
>>         * gcc.target/arm/lob6.c: Likewise.
>>         * gcc.target/arm/dlstp-int16x8.c: New test.
>>         * gcc.target/arm/dlstp-int32x4.c: New test.
>>         * gcc.target/arm/dlstp-int64x2.c: New test.
>>         * gcc.target/arm/dlstp-int8x16.c: New test.
>>
>>
>> ### Inline copy of patch ###
>>
>> diff --git a/gcc/config/aarch64/aarch64.md 
>> b/gcc/config/aarch64/aarch64.md
>> index 
>> f2e3d905dbbeb2949f2947f5cfd68208c94c9272..7a6d24a80060b4a704a481ccd1a32d96e7b0f369 
>> 100644
>> --- a/gcc/config/aarch64/aarch64.md
>> +++ b/gcc/config/aarch64/aarch64.md
>> @@ -7366,7 +7366,8 @@
>>  ;; knows what to generate.
>>  (define_expand "doloop_end"
>>    [(use (match_operand 0 "" ""))      ; loop pseudo
>> -   (use (match_operand 1 "" ""))]     ; label
>> +   (use (match_operand 1 "" ""))      ; label
>> +   (use (match_operand 2 "" ""))]     ; decrement constant
>>    "optimize > 0 && flag_modulo_sched"
>>  {
>>    rtx s0;
>> diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
>> index 
>> 550272facd12e60a49bf8a3b20f811cc13765b3a..7684620f0f4d161dd9e9ad2d70308021ec3d3d34 
>> 100644
>> --- a/gcc/config/arm/arm-protos.h
>> +++ b/gcc/config/arm/arm-protos.h
>> @@ -63,7 +63,7 @@ extern void arm_decompose_di_binop (rtx, rtx, rtx 
>> *, rtx *, rtx *, rtx *);
>>  extern bool arm_q_bit_access (void);
>>  extern bool arm_ge_bits_access (void);
>>  extern bool arm_target_insn_ok_for_lob (rtx);
>> -
>> +extern rtx arm_attempt_dlstp_transform (rtx, rtx);
>>  #ifdef RTX_CODE
>>  enum reg_class
>>  arm_mode_base_reg_class (machine_mode);
>> diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
>> index 
>> ee8f1babf8a1319e77e0db0fa55851c038048804..99e144d52c26597c64b982b3d4ae9a62a114cf18 
>> 100644
>> --- a/gcc/config/arm/arm.cc
>> +++ b/gcc/config/arm/arm.cc
>> @@ -470,6 +470,9 @@ static const struct attribute_spec 
>> arm_attribute_table[] =
>>  #undef TARGET_SCHED_REORDER
>>  #define TARGET_SCHED_REORDER arm_sched_reorder
>>
>> +#undef TARGET_ALLOW_ELEMENTWISE_DOLOOP
>> +#define TARGET_ALLOW_ELEMENTWISE_DOLOOP arm_allow_elementwise_doloop
>> +
> Just a nit but maybe keep the same naming scheme as the existing hook:
> TARGET_CAN_USE_ELEMENTWISE_DOLOOP_P ?

Done,

Also:

Thanks for taking time to review!

I've done a second draft as an attachment to this email. Let me know of any

further when you get to the second half of the patch.

>> +
>> +static rtx
>> +arm_get_required_vpr_reg (rtx_insn *insn)
>> +{
>> +  bool requires_vpr;
>> +
>> +  extract_constrain_insn (insn);
>> +  int n_operands = recog_data.n_operands;
>> +  if (recog_data.n_alternatives == 0)
>> +    return NULL_RTX;
>> +
>> +  /* Fill in recog_op_alt with information about the constraints of
>> +     this insn.  */
>> +  preprocess_constraints (insn);
>> +
>> +  for (int use = 0; use < n_operands; use++)
>> +    {
>> +      requires_vpr = true;
>> +      /* Iterate through alternatives of operand "use" in 
>> recog_op_alt and
>> +       * identify if the operand is required to be the VPR.  */
> Remove the * at the start of the new line.
Done.
>> +      for (int alt1 = 0; alt1 < recog_data.n_alternatives; alt1++)
>> +    {
>> +      const operand_alternative *op_alt1
>> +          = &recog_op_alt[alt1 * n_operands];
>> +      /* Fetch the reg_class for each entry and check it against the
>> +       * VPR_REG reg_class.  */
>> +      if (alternative_class (op_alt1, use) != VPR_REG)
>> +        requires_vpr = false;
>> +    }
>> +      /* If all alternatives of the insn require the VPR reg for 
>> this operand,
>> +     it means that either this is VPR-generating instruction, like a 
>> vctp,
>> +     vcmp, etc., or it is a VPT-predicated insruction.  Return the 
>> subrtx
>> +     of the VPR reg operand.  */
>> +      if (requires_vpr)
>> +    return recog_data.operand[use];
>> +    }
>> +  return NULL_RTX;
>> +}
>> +
>> +/* Scan the basic block of a loop body for a vctp instruction. If 
>> there is
>> +   exactly one unique vctp instruction, return its rtx_insn *. */
>> +
>> +static rtx_insn *
>> +arm_mve_get_loop_unique_vctp (basic_block bb)
>> +{
>> +  rtx_insn *insn = BB_HEAD (bb);
>> +  rtx_insn *vctp_op = NULL;
>> +
>> +  /* Now scan through all the instruction patterns and
>> +     pick out any MVE instructions.  */
>> +  FOR_BB_INSNS (bb, insn)
>> +    {
>> +      if (INSN_P (insn))
>> +    {
>> +      /* First check if this is a vctp instruction.  There needs to be
>> +         exactly one vctp instruction within the loop.  */
>> +      if (arm_mve_get_vctp_lanes (PATTERN (insn)) != 0)
>> +        {
>> +          /* If we already found one vctp instruction, then the
>> +         loop is not consistent internally.  */
>> +          if (vctp_op)
>> +        return NULL;
>> +
>> +          vctp_op = insn;
>> +        }
>> +    }
>> +    }
>> +  return vctp_op;
>> +}
>> +
>> +rtx
>> +arm_attempt_dlstp_transform (rtx label, rtx count)
>> +{
>> +  int decrementnum;
>> +  basic_block body = BLOCK_FOR_INSN (label)->prev_bb;
>> +  rtx initial_compare;
>> +  /* Doloop can only be done "elementwise" with predicated dlstp/letp
>> +     when the iteration counter gets deprecated by the number of MVE
> s/deprecated/decreased/ ? I think

"decremented" was what I was going for, so changed it to that :)


>> + lanes.  This can be exxtracted from the `count`, which is the 
>> expression
> s/exxtracted/extracted/
Done.
>> +     used to calculate the number of iterations that the loop would 
>> execute
>> +     for a standard dls/le loop.  Since we only support cases where 
>> this is a
>> +     power of 2, we can assume that this expression arrives here as:
>> +       (lshiftrt: (A) (const_int y))
>> +     Then we can extract the decrementnum from y.  */
>> +  if (GET_CODE (count) == LSHIFTRT && ARITHMETIC_P (XEXP (count, 0))
>> +      && (decrementnum = (1 << (INTVAL (XEXP (count, 1)))))
> Why are you calculating decrementnum inside the condition?

Ahh ok so this is an artifact from a previous revision, where I also 
included the:

`decrementnum != arm_mve_get_vctp_lanes (PATTERN (vctp_insn))`

within that condition, but since that is now done late, I can move this 
down.

>> +      /* There is one final condition that needs to be met for the 
>> loop to be
>> +     transformable: dlstp/letp will continue looping until there are
>> +     elements still to process.  This can only work if the looping ends
>> +     when the element counter reaches zero and not some other value
>> +     (e.g. n > 0 works, not n > 1), or we can incorrectly end up 
>> running
>> +     one additional iteration.  To by-pass any hoisting that the 
>> compiler
>> +     may have done with the `A` in `count` above, we can instead 
>> look up
>> +     to the bb before the loop preheader: this should end with a 
>> cmp+jump
>> +     pair, where the cmp needs to be with (const_int 0).  */
>
> I'm wondering whether it would be possible to subtract a non-zero 
> const from count. But that might be dangerous...
>
> Do you have an example/test case where you saw this happen?

So this can happen in situations like:

     while (n > 1)
     {
         mve_pred16_t p = vctp8q (n);
         uint8x16_t va = vldrbq_z_u8 (a, p);
         uint8x16_t vb = vldrbq_z_u8 (b, p);
         uint8x16_t vc = vaddq_x_u8 (va, vb, p);
         vstrbq_p_u8 (c, vc, p);
         n-=16;
         a+=16;
         b+=16;
         c+=16;
     }

IIUC you mean that we'd end up with something like:

         sub r3, r3, #1
         dlstp.8 lr, r3

to enable loops that aren't terminating at zero.

I believe the answer is `No`, because even though we would execute the 
correct

number of iterations, we would use an incorrect predicate value:

e.g. if n == r3 == 18:

we'd subtract a const 1: r3 == 17

Iteration 1: use all 16 lanes: Correct.
Iteration 2: use 1 lane: Incorrect -- from the user code it looks like 
we should use 2.


Because the number of iterations and the predicate value here are tied, 
I think we

can only safely transform loops that terminate with a `>0`


On standard dls/le loops we do already subtract the non-zero const.

>
>> + && loop_preheader_edge (body->loop_father)->src->prev_bb
>> +      && BB_END (loop_preheader_edge (body->loop_father)->src->prev_bb)
>> +      && PREV_INSN (BB_END (loop_preheader_edge (body->loop_father)
>> +                ->src->prev_bb))
>> +      && INSN_P (PREV_INSN (BB_END (loop_preheader_edge 
>> (body->loop_father)
>> +                    ->src->prev_bb)))
>> +      && (initial_compare
>> +      = PATTERN (PREV_INSN (BB_END (loop_preheader_edge 
>> (body->loop_father)
>> +                        ->src->prev_bb))))
>> +      && GET_CODE (initial_compare) == SET
>> +      && cc_register (XEXP (initial_compare, 0), VOIDmode)
>> +      && GET_CODE (XEXP (initial_compare, 1)) == COMPARE
>> +      && CONST_INT_P (XEXP (XEXP (initial_compare, 1), 1))
>> +      && INTVAL (XEXP (XEXP (initial_compare, 1), 1)) == 0)
>> +    {
>> +      /* Find the vctp predicate generation inside the loop body 
>> BB.  */
>> +      rtx_insn *vctp_insn = arm_mve_get_loop_unique_vctp (body);
>> +
>> +      /* If we have successfully found one exactly vctp 
>> predicate-generating
>> +     instruction within the loop and the number by which we 
>> deprecate the
>> +     loop counter in each iteration matches the number of lanes of the
>> +     vctp instruction, we can attempt to turn this into a dlstp/letp 
>> loop.
>> +     */
>> +      if (vctp_insn
>> +      && decrementnum == arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))
>> +    {
> I would exit early here, so you don't need to indent the rest of the 
> code, by that I mean something like:
>
> if (!vectp_insn
>      || decrementnum != ...)
>   return GEN_INT (1);
>
> .. rest of the code.
Done, thanks!
>
>> + rtx_insn *insn = 0;
>> +      rtx_insn *cur_insn = 0;
>> +      rtx_insn *seq;
>> +      rtx vctp_vpr_generated = NULL_RTX;
>> +      rtx insn_vpr_reg_operand = NULL_RTX;
>> +      bool transform_worked = true;
> Won't need transform_worked.
Done
>> +      int new_icode;
>> +
>> +      /* Scan through the insns in the loop bb and emit the 
>> transformed bb
>> +         insns to a sequence.  */
>> +      start_sequence ();
>> +      FOR_BB_INSNS (body, insn)
>> +        {
>> +          if (INSN_P (insn))
> This also captures DEBUG_INSNs, which means passing -g disables this 
> feature.


Fixed: description of this is later on

>> +        {
>> +          /* When we find the vctp instruction: This may be followed by
>> +             a sign-extend insn to SImode.  If it is, then save the
>> +             sign-extended REG into vctp_vpr_generated.  If there is no
>> +             sign-extend, then store the raw output of the vctp.
>> +             For any VPT-predicated instructions we need to ensure that
>> +             the VPR they use is the same as the one given here and
>> +             they often consume the output of a subreg of the SImode
>> +             sign-extended VPR-reg.  As a result, comparing against the
>> +             output of the sign-extend is more likely to succeed.
>> +             This code also guarantees to us that the vctp comes before
>> +             any instructions that use the VPR within the loop, for the
>> +             dlstp/letp transform to succeed.  */
>> +          if (insn == vctp_insn)
>> +            {
>> +              if (GET_CODE (XEXP (PATTERN (NEXT_INSN (insn)), 1))
>> +                  == SIGN_EXTEND
>> +              && GET_CODE (XEXP (
>> +                 PATTERN (NEXT_INSN (NEXT_INSN (insn))), 1))
>> +                 == SUBREG)
>> +            vctp_vpr_generated
>> +                = XEXP (PATTERN (NEXT_INSN (NEXT_INSN (insn))), 0);
>> +              else
>> +            vctp_vpr_generated = XEXP (PATTERN (insn), 0);
>> +              /* Also emit a USE of the source register of the vctp.
>> +             This holds the number of elements being processed
>> +             by the loop.  This later gets stored into `count`.
>> +             */
>> +              emit_use (XVECEXP (XEXP (PATTERN (insn), 1), 0, 0));
> What if we get here but don't end up creating a predicated do-loop? 
> Will this use break something?

Nope, it won't break anything because of the use of 
start_sequence/end_sequence. The use and all

the transform insns get emitted to the sequence, and then only if we 
have fully succeeded, we

will wipe all the contents of the `bb` and put in the sequence (incl. 
the use).


The `use` gets deleted later in `arm_allow_elementwise_doloop_p` and 
really it's just a dodgy way

for me to preserve the REG used in the vctp, so that later I can make it 
the `count` or the number

of elements to be processed by the loop (the `dlstp lr, REG` register)

>> +       continue;
>> +            }
>> +          /* If the insn pattern requires the use of the VPR, then it
> Missing an is.
Done
>> +      a VPT-predicated instruction, so it will need to be
>> +             transformed into the non-predicated version of the
>> +             instruction.  */
> But this comment seems misplace here.
Adjusted this :)
>> + else if ((insn_vpr_reg_operand
>> +                = arm_get_required_vpr_reg (insn))
>> +               != NULL_RTX)
>> +            {
>> +              /* If the VPR value is different to the one generated by
>> +             the vctp, then fail the conversion.  */
>> +              if (!rtx_equal_p (vctp_vpr_generated,
>> +                    insn_vpr_reg_operand))
>> +            {
>> +              transform_worked = false;
>> +              break;
> return GEN_INT (1);
Done -- but also needed an end_sequece ();
>> +     }
>> +              /* Also ensure that it's a valid recog-ed instruction 
>> with
>> +             the mve_unpredicated_insn atrribute.  */
>> +              else if (recog_memoized (insn) >= 0
>> +                   && (new_icode
>> +                   = get_attr_mve_unpredicated_insn (insn)))
>> +            {
>> +              extract_insn (insn);
>> +              rtx arr[8];
>> +              int j = 0;
>> +
>> +              /* When transforming a VPT-predicated instruction
>> +                 into its unpredicated equivalent we need to drop
>> +                 the VPR operand and we may need to also drop a
>> +                 merge "vuninit" input operand, depending on the
>> +                 instruction pattern.  Here ensure that we have at
>> +                 most a two-operand difference between the two
>> +                 instrunctions.  */
>> +              int n_operands_diff
>> +                  = recog_data.n_operands
>> +                - insn_data[new_icode].n_operands;
>> +              gcc_assert (n_operands_diff > 0
>> +                      && n_operands_diff <= 2);
>> +
>> +              /* Then, loop through the operands of the predicated
>> +                 instruction, and retain the ones that map to the
>> +                 unpredicated instruction.  */
>> +              for (int i = 0; i < recog_data.n_operands; i++)
>> +                {
>> +                  /* Ignore the VPR and, if needed, the vuninit
>> +                 operand.  */
>> +                  if (insn_vpr_reg_operand == recog_data.operand[i]
>> +                  || (n_operands_diff == 2
>> +                      && !strcmp (recog_data.constraints[i],
>> +                          "0")))
>> +                continue;
>> +                  else
>> +                {
>> +                  arr[j] = recog_data.operand[i];
>> +                  j++;
>> +                }
>> +                }
>> +
>> +              /* Finally, emit the upredicated instruction.  */
>> +              switch (j)
>> +                {
>> +                  case 2:
>> +                emit_insn (GEN_FCN (new_icode) (arr[0],
>> +                                arr[1]));
>> +                break;
>> +                  case 3:
>> +                emit_insn (GEN_FCN (new_icode) (arr[0], arr[1],
>> +                                arr[2]));
>> +                break;
>> +                  default:
>> +                gcc_unreachable ();
>> +                }
>> +            }
>> +              /* If we can't identify the INSN as either being either
>> +             for deletion or to re-map, then we don't know how to
>> +             handle it, so fail the whole conversion.  */
>> +              else
>> +            {
>> +              transform_worked = false;
>> +              break;
> use
> return GEN_INT (1);
Done.
>> +     }
>> +            }
>> +          /* Instructions that dont's require the VPR can be carried
>> +             over as-is.  */
>> +          else
>> +            emit_insn (PATTERN (insn));
>> +        }
>> +        }
>> +      seq = get_insns ();
>> +      end_sequence ();
>> +
>> +      if (transform_worked)
>> +        {
> no need to check this, you can only get here if it worked.
Done.
>> + /* Re-write the entire BB contents with the transformed
>> +         sequence.  */
>> +          FOR_BB_INSNS_SAFE (body, insn, cur_insn)
>> +        if (INSN_P (insn))
>> +          delete_insn (insn);
> This will also delete DEBUG_INSN's! You'd probably want to delete only 
> NONDEBUG_INSN_P (insn). I'm not an expert in how DEBUG_INSNs work but 
> I suspect their order compared to non-debug insns are likely to be 
> important, so really you'd want change how you 'transform' the BB and 
> do inline insn replacement.
You are correct, thanks for spotting this!

So I do want to also delete debug_insns, because what I want to do is 
replace all the

bb contents with the previous sequence.

Instead, though I need to make sure that DEBUG_INSNs get put into the 
sequence

correctly (which wasn't working).


I also found a similar situation with NOTEs (which are !INSN_Ps), if 
there were any

in the bb other than the NOTE_INSN_BASIC_BLOCK, they would get filtered 
up to

the start of the bb. This should also be fixed now.

>> + for (insn = seq; NEXT_INSN (insn); insn = NEXT_INSN (insn))
>> +        emit_insn_after (PATTERN (insn), BB_END (body));
>> +          emit_jump_insn_after (PATTERN (insn), BB_END (body));
>> +          return GEN_INT (decrementnum);
>> +        }
>> +    }
>> +    }
>> +  /* Bail out: we can't use dlstp/letp, so return 1 to allow 
>> loop-doloop to try
>> +     the standard dls/le pair.  */
>> +  return GEN_INT (1);
>> +}
>>
> Only reviewed until here, will look at the rest later.