Hi, For the following test-case: int16x8_t foo(int16_t x, int16_t y) { return (int16x8_t) { x, y, x, y, x, y, x, y }; } Code gen at -O3: foo: dup v0.8h, w0 ins v0.h[1], w1 ins v0.h[3], w1 ins v0.h[5], w1 ins v0.h[7], w1 ret For 16 elements, it results in 8 ins instructions which might not be optimal perhaps. I guess, the above code-gen would be equivalent to the following ? dup v0.8h, w0 dup v1.8h, w1 zip1 v0.8h, v0.8h, v1.8h I have attached patch to do the same, if number of elements >= 8, which should be possibly better compared to current code-gen ? Patch passes bootstrap+test on aarch64-linux-gnu. Does the patch look OK ? Thanks, Prathamesh