Thanks Kewen. Current flow in this patch like you said: .... len = WHILE_LEN (n,vf); ... v = len_load (addr,len); .. addr = addr + vf (in byte align); .... This patch is just keep adding address with a vector factor (adjust as byte align). For example, if your vector length = 512bit. Then this patch is just updating address as addr = addr + 64; However, today after I read RVV ISA more deeply, it should be more appropriate that the address should updated as : addr = addr + (len * 4) if len is element number of INT32. the len is the result by WHILE_LEN which calculate the len. I assume for IBM target, it's better to just update address directly adding the whole register bytesize in address IV. Since I think the second way (address = addr + (len * 4)) is too RVV specific, and won't be suitable for IBM. Is that right? If it is true, I will keep this patch flow (won't change to address = addr + (len * 4)) to see what else I need to do for IBM. I would rather do that in RISC-V backend port. >> I tried >>to compile the above source files on Power, the former can adopt doloop >>optimization but the latter fails to. You mean GCC can not do hardward loop optimization when IV loop control is variable ? juzhe.zhong@rivai.ai From: Kewen.Lin Date: 2023-04-13 15:29 To: 钟居哲 CC: gcc-patches; Jeff Law; rdapp; richard.sandiford; rguenther Subject: Re: [PATCH] VECT: Add WHILE_LEN pattern for decrement IV support for auto-vectorization Hi Juzhe, on 2023/4/12 21:22, 钟居哲 wrote: > Thanks Kewen. > > It seems that this proposal WHILE_LEN can help s390 when using --param vect-partial-vector-usage=2 compile option. > Yeah, IMHO, the previous sequence vs. the proposed sequence are like: int foo (int *__restrict a, int *__restrict b, int n) { if (n <= 0) return 0; int iv = 0; int len = MIN (n, 16); int sum = 0; do { sum += a[len] + b[len]; iv += 16; int n1 = MIN (n, iv); // line A int n2 = n - n1; len = MIN (n2, 16); } while (n > iv); return sum; } vs. int foo (int *__restrict a, int *__restrict b, int n) { if (n <= 0) return 0; int len; int sum = 0; do { len = MIN (n, 16); sum += a[len] + b[len]; n -= len; } while (n > 0); return sum; } it at least saves one MIN (at line A) and one length preparation in the last iteration (it's useless since loop ends). But I think the concern that this proposed IV isn't recognized as simple iv may stay. I tried to compile the above source files on Power, the former can adopt doloop optimization but the latter fails to. > Would you mind apply this patch && support WHILE_LEN in s390 backend and test it to see the overal benefits for s390 > as well as the correctness of this sequence ? Sure, if all of you think this approach and this revision is good enough to go forward for this kind of evaluation, I'm happy to give it a shot, but only for rs6000. ;-) I noticed that there are some discussions on withdrawing this WHILE_LEN by using MIN_EXPR instead, I'll stay tuned. btw, now we only adopt vector with length on the epilogues rather than the main vectorized loops, because of the non-trivial extra costs for length preparation than just using the normal vector load/store (all lanes), so we don't care about the performance with --param vect-partial-vector-usage=2 much. Even if this new proposal can optimize the length preparation for --param vect-partial-vector-usage=2, the extra costs for length preparation is still unavoidable (MIN, shifting, one more GPR used), we would still stay with default --param vect-partial-vector-usage=1 (which can't benefit from this new proposal). BR, Kewen