Nowadays, just storing the (bigendian-corrected) vector element to the address, generates exactly the same assembler for all cases except {float,int,uint}64x1_t, where st1 {v0.d}[0], [x0] becomes str d0, [x0] This is not a problem, and the change will be much better for optimization through the midend, as well as making use of previous improvements in error reporting. Also move the /* vst1q */ comment, which was a couple intrinsics too late. gcc/ChangeLog: * config/aarch64/arm_neon.h (vst1_lane_f32, vst1_lane_f64, vst1_lane_p8, vst1_lane_p16, vst1_lane_s8, vst1_lane_s16, vst1_lane_s32, vst1_lane_s64, vst1_lane_u8, vst1_lane_u16, vst1_lane_u32, vst1_lane_u64, vst1q_lane_f32, vst1q_lane_f64, vst1q_lane_p8, vst1q_lane_p16, vst1q_lane_s8, vst1q_lane_s16, vst1q_lane_s32, vst1q_lane_s64, vst1q_lane_u8, vst1q_lane_u16, vst1q_lane_u32, vst1q_lane_u64): Reimplement with pointer dereference and __aarch64_vget_lane_any. Cross-tested check-gcc on aarch64-none-elf and aarch64_be-none-elf. Ok for trunk? Cheers, Alan