Hi, Richi. I have sent the first splitted patch (only add ifn and optabs) as you suggested. https://gcc.gnu.org/pipermail/gcc-patches/2023-June/621874.html Could you take a look at it? After this patch is approved, I will send the second patch (Support them into vectorizer) next. Thanks! juzhe.zhong@rivai.ai From: Richard Biener Date: 2023-06-15 17:52 To: Robin Dapp CC: juzhe.zhong@rivai.ai; gcc-patches; richard.sandiford; krebbel; uweigand; linkw Subject: Re: [PATCH V2] VECT: Support LEN_MASK_ LOAD/STORE to support flow control for length loop control On Thu, 15 Jun 2023, Robin Dapp wrote: > > Meh, PoP is now behind a paywall, trying to get through ... I wonder > > if there's a nice online html documenting the s390 len_load/store > > instructions to better understand the need for the bias. > > https://publibfp.dhe.ibm.com/epubs/pdf/a227832c.pdf > > Look for vector load with length (store). The length operand specifies > the highest bytes to load instead of the actual length. Hmm. It indeed cannot represent len == 0, so you are making sure that never happens? Because when it is actually zero you are going to get -1 here? At least I don't see the bias operand used at all: ; Implement len_load/len_store optabs with vll/vstl. (define_expand "len_load_v16qi" [(match_operand:V16QI 0 "register_operand") (match_operand:V16QI 1 "memory_operand") (match_operand:QI 2 "register_operand") (match_operand:QI 3 "vll_bias_operand") ] "TARGET_VX && TARGET_64BIT" { rtx mem = adjust_address (operands[1], BLKmode, 0); rtx len = gen_reg_rtx (SImode); emit_move_insn (len, gen_rtx_ZERO_EXTEND (SImode, operands[2])); emit_insn (gen_vllv16qi (operands[0], len, mem)); DONE; }) the docs of len_load say " @cindex @code{len_load_@var{m}} instruction pattern @item @samp{len_load_@var{m}} Load (operand 2 - operand 3) elements from vector memory operand 1 into vector register operand 0, setting the other elements of operand 0 to undefined values. Operands 0 and 1 have mode @var{m}, which must be a vector mode. Operand 2 has whichever integer mode the target prefers. Operand 3 conceptually has mode @code{QI}. Operand 2 can be a variable or a constant amount. Operand 3 specifies a constant bias: it is either a constant 0 or a constant -1. The predicate on operand 3 must only accept the bias values that the target actually supports. GCC handles a bias of 0 more efficiently than a bias of -1. If (operand 2 - operand 3) exceeds the number of elements in mode @var{m}, the behavior is undefined. If the target prefers the length to be measured in bytes rather than elements, it should only implement this pattern for vectors of @code{QI} elements." the minus in 'operand 2 - operand 3' should be a plus if the bias is really zero or -1. I suppose 'If (operand 2 - operand 3) exceeds the number of elements in mode @var{m}, the behavior is undefined.' means that the vectorizer has to make sure the biased element count never underflows? That is, for a loop like void foo (double *x, float *y, int n) { for (int i = 0; i < n; ++i) y[i] = x[i]; } you should get x1 = len_load (...); x2 = len_load (...); y = VEC_PACK_TRUNC_EXPR len_store (..., y); but then the x2 load can end up with a len of zero and thus trap (since you will load either a full vector or the first byte of it). I see you do /* If the backend requires a bias of -1 for LEN_LOAD, we must not emit len_loads with a length of zero. In order to avoid that we prohibit more than one loop length here. */ if (partial_load_bias == -1 && LOOP_VINFO_LENS (loop_vinfo).length () > 1) return false; that's quite conservative. I think you can do better when the loads are aligned, reading an extra byte when ignoring the bias is OK and you at least know the very first element is used. For stores you would need to emit compare&jump for all but the first store of a group though ... That said, I'm still not seeing where you actually apply the bias. Richard.