Hi Richard,
This is reworking of patch to extend fold_vec_perm to handle VLA vectors.
The attached patch unifies handling of VLS and VLA vector_csts, while
using fallback code
for ctors.

For VLS vector, the patch ignores underlying encoding, and
uses npatterns = nelts, and nelts_per_pattern = 1.

For VLA patterns, if sel has a stepped sequence, then it
only chooses elements from a particular pattern of a particular
input vector.

To make things simpler, the patch imposes following constraints:
(a) op0_npatterns, op1_npatterns and sel_npatterns are powers of 2.
(b) The step size for a stepped sequence is a power of 2, and
      multiple of npatterns of chosen input vector.
(c) Runtime vector length of sel is a multiple of sel_npatterns.
     So, we don't handle sel.length = 2 + 2x and npatterns = 4.

Eg:
op0, op1: npatterns = 2, nelts_per_pattern = 3
op0_len = op1_len = 16 + 16x.
sel = { 0, 0, 2, 0, 4, 0, ... }
npatterns = 2, nelts_per_pattern = 3.

For pattern {0, 2, 4, ...}
Let,
a1 = 2
S = step size = 2

Let Esel denote number of elements per pattern in sel at runtime.
Esel = (16 + 16x) / npatterns_sel
        = (16 + 16x) / 2
        = (8 + 8x)

So, last element of pattern:
ae = a1 + (Esel - 2) * S
     = 2 + (8 + 8x - 2) * 2
     = 14 + 16x

a1 /trunc arg0_len = 2 / (16 + 16x) = 0
ae /trunc arg0_len = (14 + 16x) / (16 + 16x) = 0
Since both are equal with quotient = 0, we select elements from op0.

Since step size (S) is a multiple of npatterns(op0), we select
all elements from same pattern of op0.

res_npatterns = max (op0_npatterns, max (op1_npatterns, sel_npatterns))
                       = max (2, max (2, 2)
                       = 2

res_nelts_per_pattern = max (op0_nelts_per_pattern,
                                                max (op1_nelts_per_pattern,
                                                         sel_nelts_per_pattern))
                                    = max (3, max (3, 3))
                                    = 3

So res has encoding with npatterns = 2, nelts_per_pattern = 3.
res: { op0[0], op0[0], op0[2], op0[0], op0[4], op0[0], ... }

Unfortunately, this results in an issue for poly_int_cst index:
For example,
op0, op1: npatterns = 1, nelts_per_pattern = 3
op0_len = op1_len = 4 + 4x

sel: { 4 + 4x, 5 + 4x, 6 + 4x, ... } // should choose op1

In this case,
a1 = 5 + 4x
S = (6 + 4x) - (5 + 4x) = 1
Esel = 4 + 4x

ae = a1 + (esel - 2) * S
     = (5 + 4x) + (4 + 4x - 2) * 1
     = 7 + 8x

IIUC, 7 + 8x will always be index for last element of op1 ?
if x = 0, len = 4, 7 + 8x = 7
if x = 1, len = 8, 7 + 8x = 15, etc.
So the stepped sequence will always choose elements
from op1 regardless of vector length for above case ?

However,
ae /trunc op0_len
= (7 + 8x) / (4 + 4x)
which is not defined because 7/4 != 8/4
and we return NULL_TREE, but I suppose the expected result would be:
res: { op1[0], op1[1], op1[2], ... } ?

The patch passes bootstrap+test on aarch64-linux-gnu with and without sve,
and on x86_64-unknown-linux-gnu.
I would be grateful for suggestions on how to proceed.

Thanks,
Prathamesh