On 23/06/2023 17:25, Andre Vieira (lists) wrote: > +  /* In order to find out if the loop is of type A or B above look > for the > +     loop counter: it will either be incrementing by one per > iteration or > +     it will be decrementing by num_of_lanes.  We can find the loop > counter > +     in the condition at the end of the loop.  */ > +  rtx_insn *loop_cond = prev_nonnote_nondebug_insn_bb (BB_END (body)); > +  gcc_assert (cc_register (XEXP (PATTERN (loop_cond), 0), VOIDmode) > +          && GET_CODE (XEXP (PATTERN (loop_cond), 1)) == COMPARE); > > Not sure this should be an assert. If we do encounter a differently > formed loop, we should bail out of DLSTPing for now but we shouldn't ICE. Hmm, good point. A lot of these checks I actually did have in as asserts, so that then I could try and answer the question "does this ever happen in all currently known dlstp/letp-compatible codebases" and if they are still asserts it means that the answer was "No". I agree, though, for future safety it's best to make this a bail-out instead. Done > > > +  /* The loop latch has to be empty.  When compiling all the known > MVE LoLs in > +     user applications, none of those with incrementing counters had > any real > +     insns in the loop latch.  As such, this function has only been > tested with > +     an empty latch and may misbehave or ICE if we somehow get here > with an > +     increment in the latch, so, for sanity, error out early.  */ > +  rtx_insn *dec_insn = BB_END (body->loop_father->latch); > +  if (NONDEBUG_INSN_P (dec_insn)) > +    gcc_unreachable (); > > Similarly here I'd return false rather than gcc_unreachable (); Same as above Done > > > +  /* Find where both of those are modified in the loop body bb. */ > +  rtx condcount_reg_set = PATTERN (DF_REF_INSN > (df_bb_regno_only_def_find > +                         (body, REGNO (condcount)))); > Put = on newline, breaks it down nicer. Done > > +      counter_orig_set = XEXP (PATTERN > +                    (DF_REF_INSN > +                      (DF_REF_NEXT_REG > +                    (DF_REG_DEF_CHAIN > +                     (REGNO > +                       (XEXP (condcount_reg_set, 0)))))), 1); > > This makes me a bit nervous, can we be certain that the PATTERN of the > next insn that sets it is indeed a set. Heck can we even be sure > DF_REG_DEF_CHAIN returns a non-null, I can't imagine why not but maybe > there are some constructs it can't follow-up on? Might just be worth > checking these steps and bailing out. > Hmm, yes. The logic of it is: * So  condcount_reg_set is a set to a REG. Get that REGNO and fetch the DEF_CHAIN... * Then the "DF_REF_NEXT_REG" is the last df_ref in the chain that did a DEF to that REG * Then get the INSN from that (DF_REF_INSN) --- it should be a SET * Then get the PATTERN * Then XEXP (.., 1) on the PATTERN will get the rhs of the SET So the outcome is finding the previous value that was last set to that REGNO, regardless of it being in the same BB or not But indeed it is unsafe that way. Done some breakdown and some checking :) > > > +      /* When we find the vctp instruction: This may be followed by > +      a zero-extend insn to SImode.  If it is, then save the > +      zero-extended REG into vctp_vpr_generated.  If there is no > +      zero-extend, then store the raw output of the vctp. > +      For any VPT-predicated instructions we need to ensure that > +      the VPR they use is the same as the one given here and > +      they often consume the output of a subreg of the SImode > +      zero-extended VPR-reg.  As a result, comparing against the > +      output of the zero-extend is more likely to succeed. > +      This code also guarantees to us that the vctp comes before > +      any instructions that use the VPR within the loop, for the > +      dlstp/letp transform to succeed.  */ > > Wrong comment indent after first line. Done (but shifted this comment to a new location, because i split the logic searching for vctp_vpr_generated into a new function to be called separately from the FOR_BB_INSNS loop (this guarantees that we always have a value in `vctp_vpr_generated` when arm_mve_check_df_chain_back_for_implic_predic is called > > +  rtx_insn *vctp_insn = arm_mve_get_loop_vctp (body); > +  if (!vctp_insn || !arm_mve_loop_valid_for_dlstp (body)) > +    return GEN_INT (1); > > arm_mve_loop_valid_for_dlstp already calls arm_mve_get_loop_vctp, > maybe have 'arm_mve_loop_valid_for_dlstp' return vctp_insn or null to > determine success or failure, avoids looping through the BB again. > > For the same reason I'd also pass vctp_insn down to > 'arm_mve_check_df_chain_back_for_implic_predic'. Done > > +      if (GET_CODE (SET_SRC (single_set (next_use1))) == ZERO_EXTEND) > +        { > +          rtx_insn *next_use2 = NULL; > > Are we sure single_set can never return 0 here? Maybe worth an extra > check and bail out if it does? True... I could imagine the VPR being used in e.g. an `adds` instruction that is a PARALLEL SET to the output and to CC_REG: that would ICE here Added an additional check for this. > > +       /* If the insn pattern requires the use of the VPR value from the > +      vctp as an input parameter.  */ > s/an an input parameter./as an input parameter for predication./ Done > > +      /* None of registers USE-d by the instruction need can be the VPR > +         vctp_vpr_generated.  This blocks the optimisation if there any > +         instructions that use the optimised-out VPR value in any way > +         other than as a VPT block predicate.  */ > > Reword this slightly to be less complex: > This instruction USE-s the vctp_vpr_generated other than for > predication, this blocks the transformation as we are not allowed to > optimise the VPR value away. Done Note a couple of "extra" changes in this revision: * As we discussed offline: Using "lctp" after the loop, instead of the "mrs,orr,and,msr" series when the branch is too far for "letp" * Added "simple_loop_desc (body->loop_father)->noloop_assumptions = NULL_RTX" when the transform succeeds. This is as a safety to stop the middle-end from ever falling down that constant-number-of-terations logic (I never actually saw this happen, but it seemed reasonable to block this) * Split out "arm_mve_get_vctp_vec_form" into a new function: this just helps with the read-ability of "arm_attempt_dlstp_transform" I look forward to getting more comments :D Cheers, Stam > > Will continue reviewing next week :) > > On 15/06/2023 12:47, Stamatis Markianos-Wright via Gcc-patches wrote: >>      Hi all, >> >>      This is the 2/2 patch that contains the functional changes needed >>      for MVE Tail Predicated Low Overhead Loops.  See my previous email >>      for a general introduction of MVE LOLs. >> >>      This support is added through the already existing loop-doloop >>      mechanisms that are used for non-MVE dls/le looping. >> >>      Mid-end changes are: >> >>      1) Relax the loop-doloop mechanism in the mid-end to allow for >>         decrement numbers other that -1 and for `count` to be an >>         rtx containing a simple REG (which in this case will contain >>         the number of elements to be processed), rather >>         than an expression for calculating the number of iterations. >>      2) Added a new df utility function: `df_bb_regno_only_def_find` >> that >>         will return the DEF of a REG only if it is DEF-ed once within >> the >>         basic block. >> >>      And many things in the backend to implement the above optimisation: >> >>      3)  Implement the `arm_predict_doloop_p` target hook to instruct >> the >>          mid-end about Low Overhead Loops (MVE or not), as well as >>          `arm_loop_unroll_adjust` which will prevent unrolling of any >> loops >>          that are valid for becoming MVE Tail_Predicated Low Overhead >> Loops >>          (unrolling can transform a loop in ways that invalidate the >> dlstp/ >>          letp tranformation logic and the benefit of the dlstp/letp loop >>          would be considerably higher than that of unrolling) >>      4)  Appropriate changes to the define_expand of doloop_end, new >>          patterns for dlstp and letp, new iterators,  unspecs, etc. >>      5) `arm_mve_loop_valid_for_dlstp` and a number of checking >> functions: >>         * `arm_mve_dlstp_check_dec_counter` >>         * `arm_mve_dlstp_check_inc_counter` >>         * `arm_mve_check_reg_origin_is_num_elems` >>         * `arm_mve_check_df_chain_back_for_implic_predic` >>         * `arm_mve_check_df_chain_fwd_for_implic_predic_impact` >>         This all, in smoe way or another, are running checks on the loop >>         structure in order to determine if the loop is valid for >> dlstp/letp >>         transformation. >>      6) `arm_attempt_dlstp_transform`: (called from the define_expand of >>          doloop_end) this function re-checks for the loop's >> suitability for >>          dlstp/letp transformation and then implements it, if possible. >>      7) Various utility functions: >>         *`arm_mve_get_vctp_lanes` to map >>         from vctp unspecs to number of lanes, and >> `arm_get_required_vpr_reg` >>         to check an insn to see if it requires the VPR or not. >>         * `arm_mve_get_loop_vctp` >>         * `arm_mve_get_vctp_lanes` >>         * `arm_emit_mve_unpredicated_insn_to_seq` >>         * `arm_get_required_vpr_reg` >>         * `arm_get_required_vpr_reg_param` >>         * `arm_get_required_vpr_reg_ret_val` >>         * `arm_mve_vec_insn_is_predicated_with_this_predicate` >>         * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate` >> >>      No regressions on arm-none-eabi with various targets and on >>      aarch64-none-elf. Thoughts on getting this into trunk? >> >>      Thank you, >>      Stam Markianos-Wright >> >>      gcc/ChangeLog: >> >>              * config/arm/arm-protos.h (arm_target_insn_ok_for_lob): >> Rename to... >>              (arm_target_bb_ok_for_lob): ...this >>              (arm_attempt_dlstp_transform): New. >>              * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New. >>              (TARGET_PREDICT_DOLOOP_P): New. >>              (arm_block_set_vect): >>              (arm_target_insn_ok_for_lob): Rename from >> arm_target_insn_ok_for_lob. >>              (arm_target_bb_ok_for_lob): New. >>              (arm_mve_get_vctp_lanes): New. >>              (arm_get_required_vpr_reg): New. >>              (arm_get_required_vpr_reg_param): New. >>              (arm_get_required_vpr_reg_ret_val): New. >>              (arm_mve_get_loop_vctp): New. >> (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New. >> (arm_mve_vec_insn_is_predicated_with_this_predicate): New. >>              (arm_mve_check_df_chain_back_for_implic_predic): New. >> (arm_mve_check_df_chain_fwd_for_implic_predic_impact): New. >>              (arm_mve_check_reg_origin_is_num_elems): New. >>              (arm_mve_dlstp_check_inc_counter): New. >>              (arm_mve_dlstp_check_dec_counter): New. >>              (arm_mve_loop_valid_for_dlstp): New. >>              (arm_predict_doloop_p): New. >>              (arm_loop_unroll_adjust): New. >>              (arm_emit_mve_unpredicated_insn_to_seq): New. >>              (arm_attempt_dlstp_transform): New. >>              * config/arm/iterators.md (DLSTP): New. >>              (mode1): Add DLSTP mappings. >>              * config/arm/mve.md (*predicated_doloop_end_internal): New. >>              (dlstp_insn): New. >>              * config/arm/thumb2.md (doloop_end): Update for MVE LOLs. >>              * config/arm/unspecs.md: New unspecs. >>              * df-core.cc (df_bb_regno_only_def_find): New. >>              * df.h (df_bb_regno_only_def_find): New. >>              * loop-doloop.cc (doloop_condition_get): Relax conditions. >>              (doloop_optimize): Add support for elementwise LoLs. >> >>      gcc/testsuite/ChangeLog: >> >>              * gcc.target/arm/lob.h: Update framework. >>              * gcc.target/arm/lob1.c: Likewise. >>              * gcc.target/arm/lob6.c: Likewise. >>              * gcc.target/arm/mve/dlstp-compile-asm.c: New test. >>              * gcc.target/arm/mve/dlstp-int16x8.c: New test. >>              * gcc.target/arm/mve/dlstp-int32x4.c: New test. >>              * gcc.target/arm/mve/dlstp-int64x2.c: New test. >>              * gcc.target/arm/mve/dlstp-int8x16.c: New test. >>              * gcc.target/arm/mve/dlstp-invalid-asm.c: New test.