Hi, Richard. Thanks for the comments. >> I think it would be worth saying in more detail why we only use SELECT_VL >> for single rgroups. I assume the reason is to simplify the pointer IV >> updates. Is that right? Yes. >> The difficulty is >> that an rgroup that controls N vector loads (say) would need N pointer >> updates by variable amounts. But I'm not 100% sure whether we're >> avoiding that situation because it's difficult to code, or because >> it's inefficient. Or maybe we're avoiding it because it doesn't >> fit well with the later RVV vsetvl pass. I don't want to use SELECT_VL pattern for multiple rgroups since it's really ineffecient and also not fit well with latter RVV vsetvl PASS. (I have a draft in my downstream, the loop body becomes very ugly with a lot of instructions to adjust IVs). Since we define SELECT_VL as a flexible pattern that doesn't have the side effect set vector length, we need much more scalar operations to adjust the pointer IVs and it cause more vsetvli than just using MIN. Also, it changes a lot in middle-end and make middle-end codes too ugly and no benefits I see so far. Current approach (MIN), I think the current codegen is good (even though may not be perfect). Besides, LLVM only can handle one vector length (in GCC, we call multiple rgroup). I think RVV GCC is already in a good shape now in case of loop control. I'd rather support all RVV auto-vectorization features soon and focus on optimizing loopVectorizer (For example, I known GCC in TSVC has 46 fails, fixing failed vectorization case will improve much more, I think it can also improve ARM. ). Address comments and will send the next patch, really appreciate it. Thanks. juzhe.zhong@rivai.ai From: Richard Sandiford Date: 2023-06-05 05:59 To: juzhe.zhong CC: gcc-patches; rguenther Subject: Re: [PATCH] VECT: Add SELECT_VL support Sorry for the slow review. I don't know the IV-related parts well enough to review those properly, but they looked reasonable to me. Hopefully Richi can comment. I'm curious though. For: > + tree step = vect_dr_behavior (vinfo, dr_info)->step; > + > + [...] > + poly_uint64 bytesize = GET_MODE_SIZE (element_mode (aggr_type)); > + /* Since the outcome of .SELECT_VL is element size, we should adjust > + it into bytesize so that it can be used in address pointer variable > + amount IVs adjustment. */ > + tree tmp = fold_build2 (MULT_EXPR, len_type, loop_len, > + build_int_cst (len_type, bytesize)); > + if (tree_int_cst_sgn (step) == -1) > + tmp = fold_build1 (NEGATE_EXPR, len_type, tmp); Could you not just multiply loop_len by step, probably written as: build_int_cst (len_type, wi::to_widest (step)) avoiding the NEGATE_EXPR and bytesize calculation? step should represent the step of the original scalar IV, so doing that feels more direct. The loop-control bits look good to me apart from one hunk: > @@ -2737,6 +2738,14 @@ start_over: > LOOP_VINFO_VECT_FACTOR (loop_vinfo)))) > LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo) = true; > > + /* If we're using decrement IV and SELECT_VL is supported by the target. > + Use output of SELECT_VL to adjust IV of loop control and data reference. > + Note: We only use SELECT_VL on single-rgroup control. */ > + if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo) > + && LOOP_VINFO_LENS (loop_vinfo).length () == 1 > + && !slp) > + LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) = true; > + > /* If we're vectorizing an epilogue loop, the vectorized loop either needs > to be able to handle fewer than VF scalars, or needs to have a lower VF > than the main loop. */ This test also needs to check that the target implements the select_vl optab for the chosen iv type. You can check that using direct_internal_fn_supported_p. We should also check that LOOP_VINFO_LENS (loop_vinfo)[0].factor == 1, since the IV update multiplies by the size in bytes. I think it would be worth saying in more detail why we only use SELECT_VL for single rgroups. I assume the reason is to simplify the pointer IV updates. Is that right? That is: the multiple length controls that are currently generated from a MIN_EXPR IV should also work with a SELECT_VL IV. The difficulty is that an rgroup that controls N vector loads (say) would need N pointer updates by variable amounts. But I'm not 100% sure whether we're avoiding that situation because it's difficult to code, or because it's inefficient. Or maybe we're avoiding it because it doesn't fit well with the later RVV vsetvl pass. Thanks, Richard