The fix here (as noted https://gcc.gnu.org/ml/gcc-patches/2015-04/msg01345.html) is that this changes the vector indices present in the RTL on bigendian for float vec_unpacks, to be the same as for integer vec_unpacks. This appears consistent with the usage of VEC_UNPACK_(FLOAT_)?EXPR in tree-vect-stmts.c, which uses a different EXPR for the same half of the vector depending on endianness. I was not able to construct a testcase where the RTL here mattered (i.e. where the RTL was constant-folded, but the tree had not been), but the correctness can be seen from a testcase: double d[4]; void bar (float *f) { for (int i = 0; i < 4; i++) d[i] = f[i]; } which used to produced as final RTL (-O3) (insn:TI 8 10 12 (set (reg:V2DF 33 v1 [orig:78 vect__9.19 ] [78]) (float_extend:V2DF (vec_select:V2SF (reg:V4SF 32 v0 [orig:77 MEM[(float *)f_6(D)] ] [77]) (parallel [ (const_int 2 [0x2]) (const_int 3 [0x3]) ])))) test.c:40 1274 {vec_unpacks_hi_v4sf} (expr_list:REG_EQUIV (mem/c:V2DF (reg/f:DI 0 x0 [79]) [2 MEM[(double *)&d]+0 S16 A64]) (nil))) (insn:TI 12 8 11 (set (reg:V2DF 32 v0 [orig:81 vect__9.19 ] [81]) (float_extend:V2DF (vec_select:V2SF (reg:V4SF 32 v0 [orig:77 MEM[(float *)f_6(D)] ] [77]) (parallel [ (const_int 0 [0]) (const_int 1 [0x1]) ])))) test.c:40 1272 {vec_unpacks_lo_v4sf} (expr_list:REG_EQUIV (mem/c:V2DF (plus:DI (reg/f:DI 0 x0 [79]) (const_int 16 [0x10])) [2 MEM[(double *)&d + 16B]+0 S16 A64]) (nil))) (insn:TI 11 12 15 (set (mem/c:V2DF (reg/f:DI 0 x0 [79]) [2 MEM[(double *)&d]+0 S16 A64]) (reg:V2DF 33 v1 [orig:78 vect__9.19 ] [78])) test.c:40 808 {*aarch64_simd_movv2df} (expr_list:REG_DEAD (reg:V2DF 33 v1 [orig:78 vect__9.19 ] [78]) (nil))) (insn:TI 15 11 22 (set (mem/c:V2DF (plus:DI (reg/f:DI 0 x0 [79]) (const_int 16 [0x10])) [2 MEM[(double *)&d + 16B]+0 S16 A64]) (reg:V2DF 32 v0 [orig:81 vect__9.19 ] [81])) test.c:40 808 {*aarch64_simd_movv2df} (expr_list:REG_DEAD (reg:V2DF 32 v0 [orig:81 vect__9.19 ] [81]) i.e. apparently storing vector elements 2 and 3 to the address of d, and elems 0+1 to address (d+16). Of course this was flipped back again to be correct at assembly time, but following this patch the RTL indices are also correct (elems 0+1 to address d, elems 2+3 to address d+16). gcc/ChangeLog: * config/aarch64/aarch64-simd.md (aarch64_simd_vec_unpacks_lo_, aarch64_simd_vec_unpacks_hi_): New insn. (vec_unpacks_lo_v4sf, vec_unpacks_hi_v4sf): Delete insn. (vec_unpacks_lo_, vec_unpacks_hi_): New expand. (aarch64_float_extend_lo_v2df): Rename to... (aarch64_float_extend_lo_): this, using VDF and so adding V4SF. * config/aarch64/aarch64-simd-builtins.def (vec_unpacks_hi): Add v8hf. (float_extend_lo): Add v4sf. * config/aarch64/arm_neon.h (vcvt_f32_f16, vcvt_high_f32_f16): New. * config/aarch64/iterators.md (VQ_HSF): New iterator. (VWIDE, Vwtype, Vhalftype): Add V8HF, V4SF. (Vwide): New mode_attr.