>> It's not so much that we need to do that.  But normally it's only worth
>> adding internal functions if they do something that is too complicated
>> to express in simple gimple arithmetic.  The UQDEC case I mentioned:

>>    z = MAX (x, y) - y

>> fell into the "simple arithmetic" category for me.  We could have added
>> an ifn for unsigned saturating decrement, but it didn't seem complicated
>> enough to merit its own ifn.

Ah, I known your concern. I should admit that WHILE_LEN is a simple arithmetic operation
which is just taking result from

min (remain,vf).

The possible solution is to just use MIN_EXPR (remain,vf).
Then, add speciall handling in umin_optab pattern to recognize "vf" in the backend.
Finally generate vsetvl in RISC-V backend.

The "vf" should be recognized as the operand of umin should be const_int/const_poly_int operand.
Otherwise, just generate umin scalar instruction..

However, there is a case that I can't recognize umin should generate vsetvl or umin. Is this following case:
void foo (int32_t a)
{
  return min (a, 4);
}

In this case I should generate:
li a1,4
umin a1,a0,a1

instead of generating vsetvl

However, in this case:

void foo (int32_t *a...)
for (int i = 0; i < n; i++)
  a[i] = b[i] + c[i];

with -mriscv-vector-bits=128 (which means each vector can handle 4 INT32)
Then the VF will be 4 too. If we also MIN_EXPR instead WHILE_LEN:

...
len = MIN_EXPR (n,4)
v = len_load (len)
....
...

In this case, MIN_EXPR should emit vsetvl.

It's hard for me to tell the difference between these 2 cases...

CC RISC-V port backend maintainer: Kito.



juzhe.zhong@rivai.ai
 
From: Richard Sandiford
Date: 2023-04-12 20:24
To: juzhe.zhong\@rivai.ai
CC: rguenther; gcc-patches; jeffreyalaw; rdapp; linkw
Subject: Re: [PATCH] VECT: Add WHILE_LEN pattern for decrement IV support for auto-vectorization
"juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai> writes:
>>> I think that already works for them (could be misremembering).
>>> However, IIUC, they have no special instruction to calculate the
>>> length (unlike for RVV), and so it's open-coded using vect_get_len.
>
> Yeah, the current flow using min, sub, and then min in vect_get_len
> is working for IBM. But I wonder whether switching the current flow of
> length-loop-control into the WHILE_LEN pattern that this patch can improve
> their performance.
>
>>> (1) How easy would it be to express WHILE_LEN in normal gimple?
>>>     I haven't thought about this at all, so the answer might be
>>>     "very hard".  But it reminds me a little of UQDEC on AArch64,
>>>     which we open-code using MAX_EXPR and MINUS_EXPR (see
>  >>    vect_set_loop_controls_directly).
>
>   >>   I'm not saying WHILE_LEN is the same operation, just that it seems
>   >>   like it might be open-codeable in a similar way.
>
>  >>    Even if we can open-code it, we'd still need some way for the
>   >>   target to select the "RVV way" from the "s390/PowerPC way".
>
> WHILE_LEN in doc I define is
> operand0 = MIN (operand1, operand2)operand1 is the residual number of scalar elements need to be updated.operand2 is vectorization factor (vf) for single rgroup.         if multiple rgroup operan2 = vf * nitems_per_ctrl.You mean such pattern is not well expressed so we need to replace it with normaltree code (MIN OR MAX). And let RISC-V backend to optimize them into vsetvl ?Sorry, maybe I am not on the same page.
 
It's not so much that we need to do that.  But normally it's only worth
adding internal functions if they do something that is too complicated
to express in simple gimple arithmetic.  The UQDEC case I mentioned:
 
   z = MAX (x, y) - y
 
fell into the "simple arithmetic" category for me.  We could have added
an ifn for unsigned saturating decrement, but it didn't seem complicated
enough to merit its own ifn.
 
>>> (2) What effect does using a variable IV step (the result of
>>> the WHILE_LEN) have on ivopts?  I remember experimenting with
>>> something similar once (can't remember the context) and not
>>> having a constant step prevented ivopts from making good
>>> addresing-mode choices.
>
> Thank you so much for pointing out this. Currently, varialble IV step and decreasing n down to 0 
> works fine for RISC-V downstream GCC and we didn't find issues related addressing-mode choosing.
 
OK, that's good.  Sounds like it isn't a problem then.
 
> I think I must missed something, would you mind giving me some hints so that I can study on ivopts
> to find out which case may generate inferior codegens for varialble IV step?
 
I think AArch64 was sensitive to this because (a) the vectoriser creates
separate IVs for each base address and (b) for SVE, we instead want
invariant base addresses that are indexed by the loop control IV.
Like Richard says, if the loop control IV isn't a SCEV, ivopts isn't
able to use it and so (b) fails.
 
Thanks,
Richard