Thank you. Richard. >> I think that already works for them (could be misremembering). >> However, IIUC, they have no special instruction to calculate the >> length (unlike for RVV), and so it's open-coded using vect_get_len. Yeah, the current flow using min, sub, and then min in vect_get_len is working for IBM. But I wonder whether switching the current flow of length-loop-control into the WHILE_LEN pattern that this patch can improve their performance. >> (1) How easy would it be to express WHILE_LEN in normal gimple? >> I haven't thought about this at all, so the answer might be >> "very hard". But it reminds me a little of UQDEC on AArch64, >> which we open-code using MAX_EXPR and MINUS_EXPR (see >> vect_set_loop_controls_directly). >> I'm not saying WHILE_LEN is the same operation, just that it seems >> like it might be open-codeable in a similar way. >> Even if we can open-code it, we'd still need some way for the >> target to select the "RVV way" from the "s390/PowerPC way". WHILE_LEN in doc I define is operand0 = MIN (operand1, operand2)operand1 is the residual number of scalar elements need to be updated.operand2 is vectorization factor (vf) for single rgroup. if multiple rgroup operan2 = vf * nitems_per_ctrl.You mean such pattern is not well expressed so we need to replace it with normaltree code (MIN OR MAX). And let RISC-V backend to optimize them into vsetvl ?Sorry, maybe I am not on the same page. >> (2) What effect does using a variable IV step (the result of >> the WHILE_LEN) have on ivopts? I remember experimenting with >> something similar once (can't remember the context) and not >> having a constant step prevented ivopts from making good >> addresing-mode choices. Thank you so much for pointing out this. Currently, varialble IV step and decreasing n down to 0 works fine for RISC-V downstream GCC and we didn't find issues related addressing-mode choosing. I think I must missed something, would you mind giving me some hints so that I can study on ivopts to find out which case may generate inferior codegens for varialble IV step? Thank you so much. juzhe.zhong@rivai.ai From: Richard Sandiford Date: 2023-04-12 19:17 To: Richard Biener CC: juzhe.zhong\@rivai.ai; gcc-patches; jeffreyalaw; rdapp; linkw Subject: Re: [PATCH] VECT: Add WHILE_LEN pattern for decrement IV support for auto-vectorization Richard Biener writes: > On Wed, 12 Apr 2023, juzhe.zhong@rivai.ai wrote: > >> >> >> Thanks for the detailed explanation. Just to clarify - with RVV >> >> there's only a single mask register, v0.t, or did you want to >> >> say an instruction can only specify a single mask register? >> >> RVV has 32 (v0~v31) vector register in total. >> We can store vector data value or mask value in any of them. >> We also have mask-logic instruction for example mask-and between any vector register. >> >> However, any vector operation for example like vadd.vv can only predicated by v0 (in asm is v0.t) which is the first vector register. >> We can predicate vadd.vv with v1 - v31. >> >> So, you can image every time we want to use a mask to predicate a vector operation, we should always first store the mask value >> into v0. >> >> So, we can write intrinsic sequence like this: >> >> vmseq v0,v8,v9 (store mask value to v0) >> vmslt v1,v10,v11 (store mask value to v1) >> vmand v0,v0,v1 >> vadd.vv ...v0.t (predicate mask should always be mask). > > Ah, I see - that explains it well. > >> >> ARM SVE would have a loop control mask and a separate mask >> >> for the if (cond[i]) which would be combined with a mask-and >> >> instruction to a third mask which is then used on the >> >> predicated instructions. >> >> Yeah, I know it. ARM SVE way is a more elegant way than RVV do. >> However, for RVV, we can't follow this flow. >> We don't have a "whilelo" instruction to generate loop control mask. > > Yep. Similar for AVX512 where I have to use a vector compare. I'm > currently using > > { 0, 1, 2 ... } < { remaining_len, remaining_len, ... } > > and careful updating of remaining_len (we know it will either > be adjusted by the full constant vector length or updated to zero). > >> We only can do loop control with length generated by vsetvl. >> And we can only use "v0" to mask predicate vadd.vv, and mask value can only generated by comparison or mask logical instructions. >> >> >> PowerPC and s390x might be able to use WHILE_LEN as well (though >> >> they only have LEN variants of loads and stores) - of course >> >> only "simulating it". For the fixed-vector-length ISAs the >> >> predicated vector loop IMHO makes most sense for the epilogue to >> >> handle low-trip loops better. >> >> Yeah, I wonder how they do the flow control (if (cond[i])). >> For RVV, you can image I will need to add a pattern LEN_MASK_LOAD/LEN_MASK_STORE (length generated by WHILE_LEN and mask generated by comparison) >> >> I think we can CC IBM folks to see whether we can make WHILE_LEN works >> for both IBM and RVV ? > > I've CCed them. Adding WHILE_LEN support to rs6000/s390x would be > mainly the "easy" way to get len-masked (epilog) loop support. I think that already works for them (could be misremembering). However, IIUC, they have no special instruction to calculate the length (unlike for RVV), and so it's open-coded using vect_get_len. I suppose my two questions are: (1) How easy would it be to express WHILE_LEN in normal gimple? I haven't thought about this at all, so the answer might be "very hard". But it reminds me a little of UQDEC on AArch64, which we open-code using MAX_EXPR and MINUS_EXPR (see vect_set_loop_controls_directly). I'm not saying WHILE_LEN is the same operation, just that it seems like it might be open-codeable in a similar way. Even if we can open-code it, we'd still need some way for the target to select the "RVV way" from the "s390/PowerPC way". (2) What effect does using a variable IV step (the result of the WHILE_LEN) have on ivopts? I remember experimenting with something similar once (can't remember the context) and not having a constant step prevented ivopts from making good addresing-mode choices. Thanks, Richard