From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 8BE163858D20 for ; Fri, 23 Jun 2023 10:23:32 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 8BE163858D20 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 261E61042; Fri, 23 Jun 2023 03:24:16 -0700 (PDT) Received: from [10.57.76.213] (unknown [10.57.76.213]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 41A513F64C; Fri, 23 Jun 2023 03:23:31 -0700 (PDT) Message-ID: <7dbe3fde-5cdb-cd2d-cda4-16f02385906f@arm.com> Date: Fri, 23 Jun 2023 11:23:26 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.11.0 Subject: Re: [PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops Content-Language: en-US To: Stamatis Markianos-Wright , "gcc-patches@gcc.gnu.org" Cc: Kyrylo Tkachov , Richard Earnshaw , ramana.gcc@gmail.com, "nickc@redhat.com" References: From: "Andre Vieira (lists)" In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-6.8 required=5.0 tests=BAYES_00,BODY_8BITS,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,NICE_REPLY_A,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: + if (insn != arm_mve_get_loop_vctp (body)) + { probably a good idea to invert the condition here and return false, helps reducing the indenting in this function. + /* Starting from the current insn, scan backwards through the insn + chain until BB_HEAD: "for each insn in the BB prior to the current". + */ There's a trailing whitespace after insn, but also I'd rewrite this bit. The "for each insn in the BB prior to the current" is superfluous and even confusing to me. How about: "Scan backwards from the current INSN through the instruction chain until the start of the basic block. " I find 'that previous insn' to be confusing as you don't mention any previous insn before. So how about something along the lines of: 'If a previous insn defines a register that INSN uses then return true if...' Do we need to check: 'insn != prev_insn' ? Any reason why you can't start the loop with: 'for (rtx_insn *prev_insn = PREV_INSN (insn);' Now I also found a case where things might go wrong in: + /* Look at all the DEFs of that previous insn: if one of them is on + the same REG as our current insn, then recurse in order to check + that insn's USEs. If any of these insns return true as + MVE_VPT_UNPREDICATED_INSN_Ps, then the whole chain is affected + by the change in behaviour from being placed in dlstp/letp loop. + */ + df_ref prev_insn_defs = NULL; + FOR_EACH_INSN_DEF (prev_insn_defs, prev_insn) + { + if (DF_REF_REGNO (insn_uses) == DF_REF_REGNO (prev_insn_defs) + && insn != prev_insn + && body == BLOCK_FOR_INSN (prev_insn) + && !arm_mve_vec_insn_is_predicated_with_this_predicate + (insn, vctp_vpr_generated) + && arm_mve_check_df_chain_back_for_implic_predic + (prev_insn, vctp_vpr_generated)) + return true; + } The body == BLOCK_FOR_INSN (prev_insn) hinted me at it, if a def comes from outside of the BB (so outside of the loop's body) then its by definition unpredicated by vctp. I think you want to check that if prev_insn defines a register used by insn then return true if prev_insn isn't in the same BB or has a chain that is not predicated, i.e.: '!arm_mve_vec_insn_is_predicated_with_this_predicate (insn, vctp_vpr_generated) && arm_mve_check_df_chain_back_for_implic_predic prev_insn, vctp_vpr_generated))' you check body != BLOCK_FOR_INSN (prev_insn)' I also found some other issues, this currently loloops: uint16_t test (uint16_t *a, int n) { uint16_t res =0; while (n > 0) { mve_pred16_t p = vctp16q (n); uint16x8_t va = vldrhq_u16 (a); res = vaddvaq_u16 (res, va); res = vaddvaq_p_u16 (res, va, p); a += 8; n -= 8; } return res; } But it shouldn't, this is because there's a lack of handling of across vector instructions. Luckily in MVE all across vector instructions have the side-effect that they write to a scalar register, even the vshlcq instruction (it writes to a scalar carry output). Did this lead me to find an ICE with: uint16x8_t test (uint16_t *a, int n) { uint16x8_t res = vdupq_n_u16 (0); while (n > 0) { uint16_t carry = 0; mve_pred16_t p = vctp16q (n); uint16x8_t va = vldrhq_u16 (a); res = vshlcq_u16 (va, &carry, 1); res = vshlcq_m_u16 (res, &carry, 1 , p); a += 8; n -= 8; } return res; } This is because: + /* If the USE is outside the loop body bb, or it is inside, but + is an unpredicated store to memory. */ + if (BLOCK_FOR_INSN (insn) != BLOCK_FOR_INSN (next_use_insn) + || (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate + (next_use_insn, vctp_vpr_generated) + && mve_memory_operand + (SET_DEST (single_set (next_use_insn)), + GET_MODE (SET_DEST (single_set (next_use_insn)))))) + return true; Assumes single_set doesn't return 0. Let's deal with these issues and I'll continue to review. On 15/06/2023 12:47, Stamatis Markianos-Wright via Gcc-patches wrote: >     Hi all, > >     This is the 2/2 patch that contains the functional changes needed >     for MVE Tail Predicated Low Overhead Loops.  See my previous email >     for a general introduction of MVE LOLs. > >     This support is added through the already existing loop-doloop >     mechanisms that are used for non-MVE dls/le looping. > >     Mid-end changes are: > >     1) Relax the loop-doloop mechanism in the mid-end to allow for >        decrement numbers other that -1 and for `count` to be an >        rtx containing a simple REG (which in this case will contain >        the number of elements to be processed), rather >        than an expression for calculating the number of iterations. >     2) Added a new df utility function: `df_bb_regno_only_def_find` that >        will return the DEF of a REG only if it is DEF-ed once within the >        basic block. > >     And many things in the backend to implement the above optimisation: > >     3)  Implement the `arm_predict_doloop_p` target hook to instruct the >         mid-end about Low Overhead Loops (MVE or not), as well as >         `arm_loop_unroll_adjust` which will prevent unrolling of any loops >         that are valid for becoming MVE Tail_Predicated Low Overhead Loops >         (unrolling can transform a loop in ways that invalidate the dlstp/ >         letp tranformation logic and the benefit of the dlstp/letp loop >         would be considerably higher than that of unrolling) >     4)  Appropriate changes to the define_expand of doloop_end, new >         patterns for dlstp and letp, new iterators,  unspecs, etc. >     5) `arm_mve_loop_valid_for_dlstp` and a number of checking functions: >        * `arm_mve_dlstp_check_dec_counter` >        * `arm_mve_dlstp_check_inc_counter` >        * `arm_mve_check_reg_origin_is_num_elems` >        * `arm_mve_check_df_chain_back_for_implic_predic` >        * `arm_mve_check_df_chain_fwd_for_implic_predic_impact` >        This all, in smoe way or another, are running checks on the loop >        structure in order to determine if the loop is valid for dlstp/letp >        transformation. >     6) `arm_attempt_dlstp_transform`: (called from the define_expand of >         doloop_end) this function re-checks for the loop's suitability for >         dlstp/letp transformation and then implements it, if possible. >     7) Various utility functions: >        *`arm_mve_get_vctp_lanes` to map >        from vctp unspecs to number of lanes, and > `arm_get_required_vpr_reg` >        to check an insn to see if it requires the VPR or not. >        * `arm_mve_get_loop_vctp` >        * `arm_mve_get_vctp_lanes` >        * `arm_emit_mve_unpredicated_insn_to_seq` >        * `arm_get_required_vpr_reg` >        * `arm_get_required_vpr_reg_param` >        * `arm_get_required_vpr_reg_ret_val` >        * `arm_mve_vec_insn_is_predicated_with_this_predicate` >        * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate` > >     No regressions on arm-none-eabi with various targets and on >     aarch64-none-elf. Thoughts on getting this into trunk? > >     Thank you, >     Stam Markianos-Wright > >     gcc/ChangeLog: > >             * config/arm/arm-protos.h (arm_target_insn_ok_for_lob): > Rename to... >             (arm_target_bb_ok_for_lob): ...this >             (arm_attempt_dlstp_transform): New. >             * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New. >             (TARGET_PREDICT_DOLOOP_P): New. >             (arm_block_set_vect): >             (arm_target_insn_ok_for_lob): Rename from > arm_target_insn_ok_for_lob. >             (arm_target_bb_ok_for_lob): New. >             (arm_mve_get_vctp_lanes): New. >             (arm_get_required_vpr_reg): New. >             (arm_get_required_vpr_reg_param): New. >             (arm_get_required_vpr_reg_ret_val): New. >             (arm_mve_get_loop_vctp): New. > (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New. >             (arm_mve_vec_insn_is_predicated_with_this_predicate): New. >             (arm_mve_check_df_chain_back_for_implic_predic): New. >             (arm_mve_check_df_chain_fwd_for_implic_predic_impact): New. >             (arm_mve_check_reg_origin_is_num_elems): New. >             (arm_mve_dlstp_check_inc_counter): New. >             (arm_mve_dlstp_check_dec_counter): New. >             (arm_mve_loop_valid_for_dlstp): New. >             (arm_predict_doloop_p): New. >             (arm_loop_unroll_adjust): New. >             (arm_emit_mve_unpredicated_insn_to_seq): New. >             (arm_attempt_dlstp_transform): New. >             * config/arm/iterators.md (DLSTP): New. >             (mode1): Add DLSTP mappings. >             * config/arm/mve.md (*predicated_doloop_end_internal): New. >             (dlstp_insn): New. >             * config/arm/thumb2.md (doloop_end): Update for MVE LOLs. >             * config/arm/unspecs.md: New unspecs. >             * df-core.cc (df_bb_regno_only_def_find): New. >             * df.h (df_bb_regno_only_def_find): New. >             * loop-doloop.cc (doloop_condition_get): Relax conditions. >             (doloop_optimize): Add support for elementwise LoLs. > >     gcc/testsuite/ChangeLog: > >             * gcc.target/arm/lob.h: Update framework. >             * gcc.target/arm/lob1.c: Likewise. >             * gcc.target/arm/lob6.c: Likewise. >             * gcc.target/arm/mve/dlstp-compile-asm.c: New test. >             * gcc.target/arm/mve/dlstp-int16x8.c: New test. >             * gcc.target/arm/mve/dlstp-int32x4.c: New test. >             * gcc.target/arm/mve/dlstp-int64x2.c: New test. >             * gcc.target/arm/mve/dlstp-int8x16.c: New test. >             * gcc.target/arm/mve/dlstp-invalid-asm.c: New test.