>> bool indices_fit_selector = maybe_ge (vec_len, 2 << GET_MODE_BITSIZE (GET_MODE_INNER (vmode))); No, I think it will make us miss some optimization. For example, for poly value [16,16] maybe_ge ([16,16], 65536) which makes us missed merge optimization but we definitely can do merge optimization. >> Also add a comment that the non-constant case is handled by >> shuffle_decompress_patterns in case we have a HImode vector twice the >> size that can hold our indices. Ok. >> Comment should go inside the if branch. Ok. >>Also add this as comment: >>As the mask is a simple {0, 1, ...} pattern and the length is known we can >>store it in a scalar register and broadcast it to a mask register. Ok. >>I would have hoped that a simple >> v.quick_push (gen_int_mode (0b01010101, QImode)); >>suffices but that will probably clash if there are more than >>two npatterns. No, we definitely can not use this. more details you can see the current test vmerge-*.c . We have various patterns: E.g. 0, nunits + 1, nunits+ 2, ... it is 011 nunits, 1, 2 it 100. .... Many different kinds of patterns can be used vmerge optimization. juzhe.zhong@rivai.ai From: Robin Dapp Date: 2023-12-15 19:14 To: Juzhe-Zhong; gcc-patches CC: rdapp.gcc; kito.cheng; kito.cheng; jeffreyalaw Subject: Re: [PATCH] RISC-V: Fix vmerge optimization bug in vec_perm vectorization Hi Juzhe, in general looks OK. > + /* We need to use precomputed mask for such situation and such mask > + can only be computed in compile-time known size modes. */ > + if (GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) == 8 && maybe_ge (vec_len, 256) > + && !vec_len.is_constant ()) > + return false; > + We could make this a separate variable like: bool indices_fit_selector = maybe_ge (vec_len, 2 << GET_MODE_BITSIZE (GET_MODE_INNER (vmode))); Also add a comment that the non-constant case is handled by shuffle_decompress_patterns in case we have a HImode vector twice the size that can hold our indices. > /* MASK = SELECTOR < NUNTIS ? 1 : 0. */ Comment should go inside the if branch. > + if (GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) > 8 || known_lt (vec_len, 256)) > + } > + else > + { > + /* For EEW8 and NUNITS may be larger than 255, we can't use vmsltu > + directly to generate the selector mask, instead, we can only use > + precomputed mask. I find that comment a bit misleading as it's not the vmsltu itself but rather that the indices cannot be held. > + > + E.g. selector = <0, 257, 2, 259> for EEW8 vector with NUNITS = 256, we > + don't have a QImode scalar register to hold larger than 255. */ We also cannot hold that in a vector QImode register, and, since there is no larger HI mode vector we cannot create a larger selector. Also add this as comment: As the mask is a simple {0, 1, ...} pattern and the length is known we can store it in a scalar register and broadcast it to a mask register. > + gcc_assert (vec_len.is_constant ()); > + int size = CEIL (GET_MODE_NUNITS (mask_mode).to_constant (), 8); > + machine_mode mode = get_vector_mode (QImode, size).require (); > + rtx tmp = gen_reg_rtx (mode); > + rvv_builder v (mode, 1, size); > + for (int i = 0; i < vec_len.to_constant () / 8; i++) > + { > + uint8_t value = 0; > + for (int j = 0; j < 8; j++) > + { > + int index = i * 8 + j; > + if (known_lt (d->perm[index], 256)) > + value |= 1 << j; > + } I would have hoped that a simple v.quick_push (gen_int_mode (0b01010101, QImode)); suffices but that will probably clash if there are more than two npatterns. Regards Robin