>> In other words, why is this different from what >>vect_set_loop_controls_directly would do? Oh, I see. You are confused that why I do not make multiple-rgroup vec_trunk handling inside "vect_set_loop_controls_directly". Well. Frankly, I just replicate the handling of ARM SVE: unsigned int nmasks = i + 1; if (use_masks_p && (nmasks & 1) == 0) { rgroup_controls *half_rgc = &(*controls)[nmasks / 2 - 1]; if (!half_rgc->controls.is_empty () && vect_maybe_permute_loop_masks (&header_seq, rgc, half_rgc)) continue; } /* Try to use permutes to define the masks in DEST_RGM using the masks in SRC_RGM, given that the former has twice as many masks as the latter. Return true on success, adding any new statements to SEQ. */ static bool vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm, rgroup_controls *src_rgm) { tree src_masktype = src_rgm->type; tree dest_masktype = dest_rgm->type; machine_mode src_mode = TYPE_MODE (src_masktype); insn_code icode1, icode2; if (dest_rgm->max_nscalars_per_iter <= src_rgm->max_nscalars_per_iter && (icode1 = optab_handler (vec_unpacku_hi_optab, src_mode)) != CODE_FOR_nothing && (icode2 = optab_handler (vec_unpacku_lo_optab, src_mode)) != CODE_FOR_nothing) { /* Unpacking the source masks gives at least as many mask bits as we need. We can then VIEW_CONVERT any excess bits away. */ machine_mode dest_mode = insn_data[icode1].operand[0].mode; gcc_assert (dest_mode == insn_data[icode2].operand[0].mode); tree unpack_masktype = vect_halve_mask_nunits (src_masktype, dest_mode); for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i) { tree src = src_rgm->controls[i / 2]; tree dest = dest_rgm->controls[i]; tree_code code = ((i & 1) == (BYTES_BIG_ENDIAN ? 0 : 1) ? VEC_UNPACK_HI_EXPR : VEC_UNPACK_LO_EXPR); gassign *stmt; if (dest_masktype == unpack_masktype) stmt = gimple_build_assign (dest, code, src); else { tree temp = make_ssa_name (unpack_masktype); stmt = gimple_build_assign (temp, code, src); gimple_seq_add_stmt (seq, stmt); stmt = gimple_build_assign (dest, VIEW_CONVERT_EXPR, build1 (VIEW_CONVERT_EXPR, dest_masktype, temp)); } gimple_seq_add_stmt (seq, stmt); } return true; } vec_perm_indices indices[2]; if (dest_masktype == src_masktype && interleave_supported_p (&indices[0], src_masktype, 0) && interleave_supported_p (&indices[1], src_masktype, 1)) { /* The destination requires twice as many mask bits as the source, so we can use interleaving permutes to double up the number of bits. */ tree masks[2]; for (unsigned int i = 0; i < 2; ++i) masks[i] = vect_gen_perm_mask_checked (src_masktype, indices[i]); for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i) { tree src = src_rgm->controls[i / 2]; tree dest = dest_rgm->controls[i]; gimple *stmt = gimple_build_assign (dest, VEC_PERM_EXPR, src, src, masks[i & 1]); gimple_seq_add_stmt (seq, stmt); } return true; } return false; } I know this is just optimization for ARM SVE with sub_rgc (int16) is half size of rgc (int8). But when I just copy the codes from ARM SVE and make it general for all cases (int8 <-> int64). They all work well and codegen is good. If you don't like this way, would you mind give me some suggestions? Thanks. juzhe.zhong@rivai.ai From: Richard Sandiford Date: 2023-05-24 20:41 To: 钟居哲 CC: gcc-patches; rguenther Subject: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by variable amount support Sorry, I realised later that I had an implicit assumption here: if there are multiple rgroups, it's better to have a single IV for the smallest rgroup and scale that up to bigger rgroups. E.g. if the loop control IV is taken from an N-control rgroup and has a step S, an N*M-control rgroup would be based on M*S. Of course, it's also OK to create multiple IVs if you prefer. It's just a question of which approach gives the best output in practice. Another way of going from an N-control rgroup ("G1") to an N*M-control rgroup ("G2") would be to reuse all N controls from G1. E.g. the first M controls in G2 would come from G1[0], the next M from G1[1], etc. That might lower the longest dependency chain. But whatever we do, it doesn't feel like max_nscalars_per_iter should be part of the decision. (I realise it will be part of the decision for the follow-on SELECT_IV patch. But that's because we require the number of elements processed in each iteration to be a multiple of max_nscalars_per_iter, and AIUI SELECT_IV wouldn't guarantee that. max_nscalars_per_iter shouldn't matter for the current patch though.) 钟居哲 writes: > Hi, Richard. It's quite complicated for me and I am not sure whether I can catch up with you. > So I will rather split the work step by step to implement the decrement IV > > For the first step you mentioned: > >>> (1) In vect_set_loop_condition_partial_vectors, for the first iteration of: > > >> FOR_EACH_VEC_ELT (*controls, i, rgc) > >> if (!rgc->controls.is_empty ()) > >>> call vect_set_loop_controls_directly. That is: > >>> >> /* See whether zero-based IV would ever generate all-false masks >>> or zero length before wrapping around. */ >>> bool might_wrap_p = vect_rgroup_iv_might_wrap_p (loop_vinfo, rgc); >>> > /* Set up all controls for this group. */ >>> test_ctrl = vect_set_loop_controls_directly (loop, loop_vinfo, > >> &preheader_seq, > >> &header_seq, > >> loop_cond_gsi, rgc, > >> niters, niters_skip, > >> might_wrap_p); > >>> needs to be an "if" that (for LOOP_VINFO_USING_DECREMENTING_IV_P) >>> is only executed on the first iteration. > > Is it correct like this? > > FOR_EACH_VEC_ELT (*controls, i, rgc) > if (!rgc->controls.is_empty ()) > { > /* First try using permutes. This adds a single vector > instruction to the loop for each mask, but needs no extra > loop invariants or IVs. */ > unsigned int nmasks = i + 1; > if (use_masks_p && (nmasks & 1) == 0) > { > rgroup_controls *half_rgc = &(*controls)[nmasks / 2 - 1]; > if (!half_rgc->controls.is_empty () > && vect_maybe_permute_loop_masks (&header_seq, rgc, half_rgc)) > continue; > } > > /* See whether zero-based IV would ever generate all-false masks > or zero length before wrapping around. */ > bool might_wrap_p = vect_rgroup_iv_might_wrap_p (loop_vinfo, rgc); > > /* Set up all controls for this group. */ > test_ctrl = vect_set_loop_controls_directly (loop, loop_vinfo, > &preheader_seq, > &header_seq, > loop_cond_gsi, rgc, > niters, niters_skip, > might_wrap_p); > > /* Decrement IV only run vect_set_loop_controls_directly once. */ > if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo)) > break; > } I meant something like: FOR_EACH_VEC_ELT (*controls, i, rgc) if (!rgc->controls.is_empty ()) { /* First try using permutes. This adds a single vector instruction to the loop for each mask, but needs no extra loop invariants or IVs. */ unsigned int nmasks = i + 1; if (use_masks_p && (nmasks & 1) == 0) { rgroup_controls *half_rgc = &(*controls)[nmasks / 2 - 1]; if (!half_rgc->controls.is_empty () && vect_maybe_permute_loop_masks (&header_seq, rgc, half_rgc)) continue; } if (!LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo) || !LOOP_VINFO_DECREMENTING_IV_STEP (loop_info)) { /* See whether zero-based IV would ever generate all-false masks or zero length before wrapping around. */ bool might_wrap_p = vect_rgroup_iv_might_wrap_p (loop_vinfo, rgc); /* Set up all controls for this group. */ test_ctrl = vect_set_loop_controls_directly (loop, loop_vinfo, &preheader_seq, &header_seq, loop_cond_gsi, rgc, niters, niters_skip, might_wrap_p); } if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo) && rgc->controls.length () > 1) ...use vect_adjust_loop_lens_control... } where LOOP_VINFO_DECREMENTING_IV_STEP (loop_info) is "S" from my previous review. vect_set_loop_controls_directly would then set LOOP_VINFO_DECREMENTING_IV_STEP but would not call vect_adjust_loop_lens_control. But like I say, this is all based on the assumption that we should have a single IV and scale it up for later rgroups. If you'd prefer separate IVs then that's fine. But then I think it's less clear why we have: > + if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo) > + && rgc->max_nscalars_per_iter == 1 > + && rgc != &LOOP_VINFO_LENS (loop_vinfo)[0]) > + { > + /* Multiple rgroup (non-SLP): > + ... > + _38 = (unsigned long) n_12(D); > + ... > + # ivtmp_38 = PHI > + ... > + _40 = MIN_EXPR ; > + loop_len_21 = MIN_EXPR <_40, POLY_INT_CST [2, 2]>; > + _41 = _40 - loop_len_21; > + loop_len_20 = MIN_EXPR <_41, POLY_INT_CST [2, 2]>; > + _42 = _40 - loop_len_20; > + loop_len_19 = MIN_EXPR <_42, POLY_INT_CST [2, 2]>; > + _43 = _40 - loop_len_19; > + loop_len_16 = MIN_EXPR <_43, POLY_INT_CST [2, 2]>; > + ... > + vect__4.8_15 = .LEN_LOAD (_6, 64B, loop_len_21, 0); > + ... > + vect__4.9_8 = .LEN_LOAD (_13, 64B, loop_len_20, 0); > + ... > + vect__4.10_28 = .LEN_LOAD (_46, 64B, loop_len_19, 0); > + ... > + vect__4.11_30 = .LEN_LOAD (_49, 64B, loop_len_16, 0); > + vect__7.13_31 = VEC_PACK_TRUNC_EXPR ; > + vect__7.13_32 = VEC_PACK_TRUNC_EXPR <...>; > + vect__7.12_33 = VEC_PACK_TRUNC_EXPR <...>; > + ... > + .LEN_STORE (_14, 16B, _40, vect__7.12_33, 0); > + ivtmp_39 = ivtmp_38 - _40; > + ... > + if (ivtmp_39 != 0) > + goto ; [92.31%] > + else > + goto ; [7.69%] > + */ > + rgroup_controls *sub_rgc > + = &(*controls)[nmasks / rgc->controls.length () - 1]; > + if (!sub_rgc->controls.is_empty ()) > + { > + tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo); > + vect_adjust_loop_lens_control (iv_type, &header_seq, rgc, > + sub_rgc, NULL_TREE); > + continue; > + } > + } In other words, why is this different from what vect_set_loop_controls_directly would do? Thanks, Richard