Thank you Richard. >> But the issue is the same in the reverse with WHILE_LEN, no? >>WHILE_LEN just computes a scalar value - you seem to suggest >>there's a hidden side-effect of "coalescing" the result with >>a hardware vector length register? I don't think that's good design. No, I don't plan to suggest there's a hidden side-effect of "coalescing" the result with a hardware vector length register. Today, I read RVV ISA deeply again. I realize that this patch is not absolute correct for any RVV hardward. According to RVV ISA, the vsetvl definition: an vsetvli instruction which is vsetvli vl, avl, vtype vl = AVL if AVL ≤ VLMAX ceil(AVL / 2) ≤ vl ≤ VLMAX if AVL < (2 * VLMAX) vl = VLMAX if AVL ≥ (2 * VLMAX) Deterministic on any given implementation for same input AVL and VLMAX values The second constraint make the result of vsetvli is not necessary to be VLMAX (the maximum number of elements will be updated of specific vector-length RVV CPU). So for a vsetvli instruction (vsetvli vl,avl,vtype). The "vl" value can be various among different RVV CPU depending on the implementation of the downstream RVV hardware. Now I think I should fix this patch since this patch is not always suitable for all hardware. So according to RVV ISA: For example, this permits an implementation to set vl = ceil(AVL / 2) for VLMAX < AVL < 2*VLMAX in order to evenly distribute work over the last two iterations of a stripmine loop. We can have these 2 following different RVV CPU: Suppose the maximum number of the elements needs to be updated is 10 element (int32_t), and the vector length = 256 bit (update 8 INT32 elements in max). So there are 2 iterations we need, the number elements of each iteration depending on hardware implementation. So we can have these 2 following hardware implementation are both legal for RVV standard: RVV CPU 1: 1st iteration update 5 element (it satisfy the constraint ceil (AVL/2) <= vl <= VLMAX), set vl = ceil (AVL/2) = 5 2nd iteration update 5 elements too. RVV CPU 2: 1st iteration update 8 elements. set vl = VLMAX = 8. 2nd iteration update 3 elments. These 2 RVV CPU are both legal according to RVV specification standard. It's obvious this patch is correct for RVV CPU 2 but incorrect for RVV CPU 1. Since the current flow of this patch is as follows: + + _19 = (unsigned long) n_5(D); + ... + + : + ... + # ivtmp_20 = PHI + ... + _22 = .WHILE_LEN (ivtmp_20, vf); + ... + LEN_LOAD (addr, _22);... addr = addr + vf;+ ivtmp_21 = ivtmp_20 - _22; + ... + if (ivtmp_21 != 0) + goto ; [75.00%] + else + goto ; [25.00%] + + + return; Here the _22 which is the output of WHILE_LEN is only used in ivtmp_21 = ivtmp_20 - _22; which serves the saturating-to-zero subtraction. And "addr = addr + vf;" The address is calculated in the loop just keep add vf. Such sequence is Ok for most of the RVV CPU so far I think. However, for future compatibility, we should make WHILE_LEN output as the address IV adding value too. So, we should abandon the current the address loop way which is just keeping add vf. Replace "addr = addr + vf". Instead, we should do like that: _22 = .WHILE_LEN (ivtmp_20, vf); .... LEN_LOAD (addr, _22);tmp = _22 * 4; (Assume it is INT32 calculation, make _22 which is INT32 align into BYTE align for address counting) addr = addr + tmp;.... Makeing the result of WHILE_LEN is not only used to do the remain = remain - len,But also used in addressing calculating: tmp = len * (element bytesize) ; addr = addr + tmp; Then this flow is the correct flow for all RVV CPU. This flow is totally same as example in RVV ISA define: https://github.com/riscv/riscv-v-spec/blob/master/example/vvaddint32.s I think I need to change this patch as described above to make it to be global suitable for all RVV CPU in the word. But I am not sure whether GCC community accept this flow. So I propose it now before I do it. I didn't realize that since my downstream RVV hardware and the open-source simulator generate "vl" = VLMAX. (Sorry about that) Expecting any suggestions and comments. Thank you so much. juzhe.zhong@rivai.ai From: Richard Biener Date: 2023-04-13 14:47 To: 钟居哲 CC: richard.sandiford; gcc-patches; Jeff Law; rdapp; linkw; kito.cheng Subject: Re: Re: [PATCH] VECT: Add WHILE_LEN pattern for decrement IV support for auto-vectorization On Wed, 12 Apr 2023, ??? wrote: > >> It's not so much that we need to do that. But normally it's only worth > >> adding internal functions if they do something that is too complicated > >> to express in simple gimple arithmetic. The UQDEC case I mentioned: > > >> z = MAX (x, y) - y > > >> fell into the "simple arithmetic" category for me. We could have added > >> an ifn for unsigned saturating decrement, but it didn't seem complicated > >> enough to merit its own ifn. > > Ah, I known your concern. I should admit that WHILE_LEN is a simple arithmetic operation > which is just taking result from > > min (remain,vf). > > The possible solution is to just use MIN_EXPR (remain,vf). > Then, add speciall handling in umin_optab pattern to recognize "vf" in the backend. > Finally generate vsetvl in RISC-V backend. > > The "vf" should be recognized as the operand of umin should be const_int/const_poly_int operand. > Otherwise, just generate umin scalar instruction.. > > However, there is a case that I can't recognize umin should generate vsetvl or umin. Is this following case: > void foo (int32_t a) > { > return min (a, 4); > } > > In this case I should generate: > li a1,4 > umin a1,a0,a1 > > instead of generating vsetvl > > However, in this case: > > void foo (int32_t *a...) > for (int i = 0; i < n; i++) > a[i] = b[i] + c[i]; > > with -mriscv-vector-bits=128 (which means each vector can handle 4 INT32) > Then the VF will be 4 too. If we also MIN_EXPR instead WHILE_LEN: > > ... > len = MIN_EXPR (n,4) > v = len_load (len) > .... > ... > > In this case, MIN_EXPR should emit vsetvl. > > It's hard for me to tell the difference between these 2 cases... But the issue is the same in the reverse with WHILE_LEN, no? WHILE_LEN just computes a scalar value - you seem to suggest there's a hidden side-effect of "coalescing" the result with a hardware vector length register? I don't think that's good design. IMHO tieing the scalar result with the uses has to be done where you emit the other vsetvl instructions. One convenient thing we have with WHILE_LEN is that it is a key for the vectorizer to query target capabilities (and preferences). But of course collecting whether stmts can be vectorized with length and/or with mask would be better. Richard. > CC RISC-V port backend maintainer: Kito. > > > > juzhe.zhong@rivai.ai > > From: Richard Sandiford > Date: 2023-04-12 20:24 > To: juzhe.zhong\@rivai.ai > CC: rguenther; gcc-patches; jeffreyalaw; rdapp; linkw > Subject: Re: [PATCH] VECT: Add WHILE_LEN pattern for decrement IV support for auto-vectorization > "juzhe.zhong@rivai.ai" writes: > >>> I think that already works for them (could be misremembering). > >>> However, IIUC, they have no special instruction to calculate the > >>> length (unlike for RVV), and so it's open-coded using vect_get_len. > > > > Yeah, the current flow using min, sub, and then min in vect_get_len > > is working for IBM. But I wonder whether switching the current flow of > > length-loop-control into the WHILE_LEN pattern that this patch can improve > > their performance. > > > >>> (1) How easy would it be to express WHILE_LEN in normal gimple? > >>> I haven't thought about this at all, so the answer might be > >>> "very hard". But it reminds me a little of UQDEC on AArch64, > >>> which we open-code using MAX_EXPR and MINUS_EXPR (see > > >> vect_set_loop_controls_directly). > > > > >> I'm not saying WHILE_LEN is the same operation, just that it seems > > >> like it might be open-codeable in a similar way. > > > > >> Even if we can open-code it, we'd still need some way for the > > >> target to select the "RVV way" from the "s390/PowerPC way". > > > > WHILE_LEN in doc I define is > > operand0 = MIN (operand1, operand2)operand1 is the residual number of scalar elements need to be updated.operand2 is vectorization factor (vf) for single rgroup. if multiple rgroup operan2 = vf * nitems_per_ctrl.You mean such pattern is not well expressed so we need to replace it with normaltree code (MIN OR MAX). And let RISC-V backend to optimize them into vsetvl ?Sorry, maybe I am not on the same page. > > It's not so much that we need to do that. But normally it's only worth > adding internal functions if they do something that is too complicated > to express in simple gimple arithmetic. The UQDEC case I mentioned: > > z = MAX (x, y) - y > > fell into the "simple arithmetic" category for me. We could have added > an ifn for unsigned saturating decrement, but it didn't seem complicated > enough to merit its own ifn. > > >>> (2) What effect does using a variable IV step (the result of > >>> the WHILE_LEN) have on ivopts? I remember experimenting with > >>> something similar once (can't remember the context) and not > >>> having a constant step prevented ivopts from making good > >>> addresing-mode choices. > > > > Thank you so much for pointing out this. Currently, varialble IV step and decreasing n down to 0 > > works fine for RISC-V downstream GCC and we didn't find issues related addressing-mode choosing. > > OK, that's good. Sounds like it isn't a problem then. > > > I think I must missed something, would you mind giving me some hints so that I can study on ivopts > > to find out which case may generate inferior codegens for varialble IV step? > > I think AArch64 was sensitive to this because (a) the vectoriser creates > separate IVs for each base address and (b) for SVE, we instead want > invariant base addresses that are indexed by the loop control IV. > Like Richard says, if the loop control IV isn't a SCEV, ivopts isn't > able to use it and so (b) fails. > > Thanks, > Richard > > -- Richard Biener SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman; HRB 36809 (AG Nuernberg)