RVV also doesn't have a two-input permutation instructions (unlike ARM SVE has tbl instructions) and RVV needs about 4 instructions to handle this permutation, it still improve performance a lot. I think backend should handle this. Because this first-order recurrence loop vectorizer always generates the special permuation index = [vl-1, vl, vl+1,. ......] (This index sequence pattern is just following LLVM). If the backend doesn't want this permuation happens, just recognize this index pattern and disable it. juzhe.zhong@rivai.ai From: Andrew Stubbs Date: 2022-10-10 21:57 To: Richard Biener; gcc-patches@gcc.gnu.org CC: richard.sandiford@arm.com; juzhe.zhong@rivai.ai Subject: Re: [PATCH][RFT] Vectorization of first-order recurrences On 10/10/2022 12:03, Richard Biener wrote: > The following picks up the prototype by Ju-Zhe Zhong for vectorizing > first order recurrences. That solves two TSVC missed optimization PRs. > > There's a new scalar cycle def kind, vect_first_order_recurrence > and it's handling of the backedge value vectorization is complicated > by the fact that the vectorized value isn't the PHI but instead > a (series of) permute(s) shifting in the recurring value from the > previous iteration. I've implemented this by creating both the > single vectorized PHI and the series of permutes when vectorizing > the scalar PHI but leave the backedge values in both unassigned. > The backedge values are (for the testcases) computed by a load > which is also the place after which the permutes are inserted. > That placement also restricts the cases we can handle (without > resorting to code motion). > > I added both costing and SLP handling though SLP handling is > restricted to the case where a single vectorized PHI is enough. > > Missing is epilogue handling - while prologue peeling would > be handled transparently by adjusting iv_phi_p the epilogue > case doesn't work with just inserting a scalar LC PHI since > that a) keeps the scalar load live and b) that loads is the > wrong one, it has to be the last, much like when we'd vectorize > the LC PHI as live operation. Unfortunately LIVE > compute/analysis happens too early before we decide on > peeling. When using fully masked loop vectorization the > vect-recurr-6.c works as expected though. > > I have tested this on x86_64 for now, but since epilogue > handling is missing there's probably no practical cases. > My prototype WHILE_ULT AVX512 patch can handle vect-recurr-6.c > just fine but I didn't feel like running SPEC within SDE nor > is the WHILE_ULT patch complete enough. Builds of SPEC 2k7 > with fully masked loops succeed (minus three cases of > PR107096, caused by my WHILE_ULT prototype). > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > Testing with SVE, GCN or RVV appreciated, ideas how to cleanly > handle epilogues welcome. The testcases all produce correct code on GCN and pass the execution tests. The code isn't terribly optimal because we don't have a two-input permutation instruction, so we permute each half separately and vec_merge the results. In this case the first vector is always a no-op permutation so that's wasted cycles. We'd really want a vector rotate and write-lane (or the other way around). I think the special-case permutations can be recognised and coded into the backend, but I don't know if we can easily tell that the first vector is just a bunch of duplicates, when it's not constant. Andrew