Nowadays, just storing the (bigendian-corrected) vector element to the address, 
generates exactly the same assembler for all cases except 
{float,int,uint}64x1_t, where
st1 {v0.d}[0], [x0]
becomes
str d0, [x0]

This is not a problem, and the change will be much better for optimization 
through the midend, as well as making use of previous improvements in error 
reporting.

Also move the /* vst1q */ comment, which was a couple intrinsics too late.

gcc/ChangeLog:

	* config/aarch64/arm_neon.h (vst1_lane_f32, vst1_lane_f64,
	vst1_lane_p8, vst1_lane_p16, vst1_lane_s8, vst1_lane_s16,
	vst1_lane_s32, vst1_lane_s64, vst1_lane_u8, vst1_lane_u16,
	vst1_lane_u32, vst1_lane_u64, vst1q_lane_f32, vst1q_lane_f64,
	vst1q_lane_p8, vst1q_lane_p16, vst1q_lane_s8, vst1q_lane_s16,
	vst1q_lane_s32, vst1q_lane_s64, vst1q_lane_u8, vst1q_lane_u16,
	vst1q_lane_u32, vst1q_lane_u64): Reimplement with pointer dereference
	and __aarch64_vget_lane_any.

Cross-tested check-gcc on aarch64-none-elf and aarch64_be-none-elf.

Ok for trunk?

Cheers, Alan