From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 483 invoked by alias); 13 Jan 2015 18:25:11 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 468 invoked by uid 89); 13 Jan 2015 18:25:10 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=AWL,BAYES_00,SPF_PASS autolearn=ham version=3.3.2 X-HELO: service87.mimecast.com Received: from service87.mimecast.com (HELO service87.mimecast.com) (91.220.42.44) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 13 Jan 2015 18:25:07 +0000 Received: from cam-owa1.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.140]) by service87.mimecast.com; Tue, 13 Jan 2015 18:25:04 +0000 Received: from [10.1.209.51] ([10.1.255.212]) by cam-owa1.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.3959); Tue, 13 Jan 2015 18:25:02 +0000 Message-ID: <54B562F4.9070803@arm.com> Date: Tue, 13 Jan 2015 18:39:00 -0000 From: Alan Lawrence User-Agent: Thunderbird 2.0.0.24 (X11/20101213) MIME-Version: 1.0 To: "gcc-patches@gcc.gnu.org" CC: Marcus Shawcroft Subject: [PATCH][AArch64 Intrinsics] Replace temporary assembler for vst1_lane X-MC-Unique: 115011318250400301 Content-Type: multipart/mixed; boundary="------------070303020901050204090804" X-IsSubscribed: yes X-SW-Source: 2015-01/txt/msg00914.txt.bz2 This is a multi-part message in MIME format. --------------070303020901050204090804 Content-Type: text/plain; charset=WINDOWS-1252; format=flowed Content-Transfer-Encoding: quoted-printable Content-length: 1052 Nowadays, just storing the (bigendian-corrected) vector element to the addr= ess,=20 generates exactly the same assembler for all cases except=20 {float,int,uint}64x1_t, where st1 {v0.d}[0], [x0] becomes str d0, [x0] This is not a problem, and the change will be much better for optimization= =20 through the midend, as well as making use of previous improvements in error= =20 reporting. Also move the /* vst1q */ comment, which was a couple intrinsics too late. gcc/ChangeLog: * config/aarch64/arm_neon.h (vst1_lane_f32, vst1_lane_f64, vst1_lane_p8, vst1_lane_p16, vst1_lane_s8, vst1_lane_s16, vst1_lane_s32, vst1_lane_s64, vst1_lane_u8, vst1_lane_u16, vst1_lane_u32, vst1_lane_u64, vst1q_lane_f32, vst1q_lane_f64, vst1q_lane_p8, vst1q_lane_p16, vst1q_lane_s8, vst1q_lane_s16, vst1q_lane_s32, vst1q_lane_s64, vst1q_lane_u8, vst1q_lane_u16, vst1q_lane_u32, vst1q_lane_u64): Reimplement with pointer dereference and __aarch64_vget_lane_any. Cross-tested check-gcc on aarch64-none-elf and aarch64_be-none-elf. Ok for trunk? Cheers, Alan= --------------070303020901050204090804 Content-Type: text/x-patch; name=vst1_lane.patch Content-Transfer-Encoding: quoted-printable Content-Disposition: inline; filename="vst1_lane.patch" Content-length: 22717 commit 926aec661699e52f617f16068075ef0242a43609 Author: Alan Lawrence Date: Thu Dec 11 17:29:54 2014 +0000 Replace temporary inline assembler for vst1_lane, move /* vst1q */ comm= ent. =20=20=20=20 Note for (float|u?int)64x1 vectors, st1 {v0.d}[0], [x0] becomes str d0, [x0] diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h index 3d1bcd5..980490f 100644 --- a/gcc/config/aarch64/arm_neon.h +++ b/gcc/config/aarch64/arm_neon.h @@ -10304,272 +10304,6 @@ vrsqrtss_f32 (float32_t a, float32_t b) result; \ }) =20 -#define vst1_lane_f32(a, b, c) \ - __extension__ \ - ({ \ - float32x2_t b_ =3D (b); \ - float32_t * a_ =3D (a); \ - __asm__ ("st1 {%1.s}[%2],[%0]" \ - : \ - : "r"(a_), "w"(b_), "i"(c) \ - : "memory"); \ - }) - -#define vst1_lane_f64(a, b, c) \ - __extension__ \ - ({ \ - float64x1_t b_ =3D (b); \ - float64_t * a_ =3D (a); \ - __asm__ ("st1 {%1.d}[%2],[%0]" \ - : \ - : "r"(a_), "w"(b_), "i"(c) \ - : "memory"); \ - }) - -#define vst1_lane_p8(a, b, c) \ - __extension__ \ - ({ \ - poly8x8_t b_ =3D (b); \ - poly8_t * a_ =3D (a); \ - __asm__ ("st1 {%1.b}[%2],[%0]" \ - : \ - : "r"(a_), "w"(b_), "i"(c) \ - : "memory"); \ - }) - -#define vst1_lane_p16(a, b, c) \ - __extension__ \ - ({ \ - poly16x4_t b_ =3D (b); \ - poly16_t * a_ =3D (a); \ - __asm__ ("st1 {%1.h}[%2],[%0]" \ - : \ - : "r"(a_), "w"(b_), "i"(c) \ - : "memory"); \ - }) - -#define vst1_lane_s8(a, b, c) \ - __extension__ \ - ({ \ - int8x8_t b_ =3D (b); \ - int8_t * a_ =3D (a); \ - __asm__ ("st1 {%1.b}[%2],[%0]" \ - : \ - : "r"(a_), "w"(b_), "i"(c) \ - : "memory"); \ - }) - -#define vst1_lane_s16(a, b, c) \ - __extension__ \ - ({ \ - int16x4_t b_ =3D (b); \ - int16_t * a_ =3D (a); \ - __asm__ ("st1 {%1.h}[%2],[%0]" \ - : \ - : "r"(a_), "w"(b_), "i"(c) \ - : "memory"); \ - }) - -#define vst1_lane_s32(a, b, c) \ - __extension__ \ - ({ \ - int32x2_t b_ =3D (b); \ - int32_t * a_ =3D (a); \ - __asm__ ("st1 {%1.s}[%2],[%0]" \ - : \ - : "r"(a_), "w"(b_), "i"(c) \ - : "memory"); \ - }) - -#define vst1_lane_s64(a, b, c) \ - __extension__ \ - ({ \ - int64x1_t b_ =3D (b); \ - int64_t * a_ =3D (a); \ - __asm__ ("st1 {%1.d}[%2],[%0]" \ - : \ - : "r"(a_), "w"(b_), "i"(c) \ - : "memory"); \ - }) - -#define vst1_lane_u8(a, b, c) \ - __extension__ \ - ({ \ - uint8x8_t b_ =3D (b); \ - uint8_t * a_ =3D (a); \ - __asm__ ("st1 {%1.b}[%2],[%0]" \ - : \ - : "r"(a_), "w"(b_), "i"(c) \ - : "memory"); \ - }) - -#define vst1_lane_u16(a, b, c) \ - __extension__ \ - ({ \ - uint16x4_t b_ =3D (b); \ - uint16_t * a_ =3D (a); \ - __asm__ ("st1 {%1.h}[%2],[%0]" \ - : \ - : "r"(a_), "w"(b_), "i"(c) \ - : "memory"); \ - }) - -#define vst1_lane_u32(a, b, c) \ - __extension__ \ - ({ \ - uint32x2_t b_ =3D (b); \ - uint32_t * a_ =3D (a); \ - __asm__ ("st1 {%1.s}[%2],[%0]" \ - : \ - : "r"(a_), "w"(b_), "i"(c) \ - : "memory"); \ - }) - -#define vst1_lane_u64(a, b, c) \ - __extension__ \ - ({ \ - uint64x1_t b_ =3D (b); \ - uint64_t * a_ =3D (a); \ - __asm__ ("st1 {%1.d}[%2],[%0]" \ - : \ - : "r"(a_), "w"(b_), "i"(c) \ - : "memory"); \ - }) - - -#define vst1q_lane_f32(a, b, c) \ - __extension__ \ - ({ \ - float32x4_t b_ =3D (b); \ - float32_t * a_ =3D (a); \ - __asm__ ("st1 {%1.s}[%2],[%0]" \ - : \ - : "r"(a_), "w"(b_), "i"(c) \ - : "memory"); \ - }) - -#define vst1q_lane_f64(a, b, c) \ - __extension__ \ - ({ \ - float64x2_t b_ =3D (b); \ - float64_t * a_ =3D (a); \ - __asm__ ("st1 {%1.d}[%2],[%0]" \ - : \ - : "r"(a_), "w"(b_), "i"(c) \ - : "memory"); \ - }) - -#define vst1q_lane_p8(a, b, c) \ - __extension__ \ - ({ \ - poly8x16_t b_ =3D (b); \ - poly8_t * a_ =3D (a); \ - __asm__ ("st1 {%1.b}[%2],[%0]" \ - : \ - : "r"(a_), "w"(b_), "i"(c) \ - : "memory"); \ - }) - -#define vst1q_lane_p16(a, b, c) \ - __extension__ \ - ({ \ - poly16x8_t b_ =3D (b); \ - poly16_t * a_ =3D (a); \ - __asm__ ("st1 {%1.h}[%2],[%0]" \ - : \ - : "r"(a_), "w"(b_), "i"(c) \ - : "memory"); \ - }) - -#define vst1q_lane_s8(a, b, c) \ - __extension__ \ - ({ \ - int8x16_t b_ =3D (b); \ - int8_t * a_ =3D (a); \ - __asm__ ("st1 {%1.b}[%2],[%0]" \ - : \ - : "r"(a_), "w"(b_), "i"(c) \ - : "memory"); \ - }) - -#define vst1q_lane_s16(a, b, c) \ - __extension__ \ - ({ \ - int16x8_t b_ =3D (b); \ - int16_t * a_ =3D (a); \ - __asm__ ("st1 {%1.h}[%2],[%0]" \ - : \ - : "r"(a_), "w"(b_), "i"(c) \ - : "memory"); \ - }) - -#define vst1q_lane_s32(a, b, c) \ - __extension__ \ - ({ \ - int32x4_t b_ =3D (b); \ - int32_t * a_ =3D (a); \ - __asm__ ("st1 {%1.s}[%2],[%0]" \ - : \ - : "r"(a_), "w"(b_), "i"(c) \ - : "memory"); \ - }) - -#define vst1q_lane_s64(a, b, c) \ - __extension__ \ - ({ \ - int64x2_t b_ =3D (b); \ - int64_t * a_ =3D (a); \ - __asm__ ("st1 {%1.d}[%2],[%0]" \ - : \ - : "r"(a_), "w"(b_), "i"(c) \ - : "memory"); \ - }) - -#define vst1q_lane_u8(a, b, c) \ - __extension__ \ - ({ \ - uint8x16_t b_ =3D (b); \ - uint8_t * a_ =3D (a); \ - __asm__ ("st1 {%1.b}[%2],[%0]" \ - : \ - : "r"(a_), "w"(b_), "i"(c) \ - : "memory"); \ - }) - -#define vst1q_lane_u16(a, b, c) \ - __extension__ \ - ({ \ - uint16x8_t b_ =3D (b); \ - uint16_t * a_ =3D (a); \ - __asm__ ("st1 {%1.h}[%2],[%0]" \ - : \ - : "r"(a_), "w"(b_), "i"(c) \ - : "memory"); \ - }) - -#define vst1q_lane_u32(a, b, c) \ - __extension__ \ - ({ \ - uint32x4_t b_ =3D (b); \ - uint32_t * a_ =3D (a); \ - __asm__ ("st1 {%1.s}[%2],[%0]" \ - : \ - : "r"(a_), "w"(b_), "i"(c) \ - : "memory"); \ - }) - -#define vst1q_lane_u64(a, b, c) \ - __extension__ \ - ({ \ - uint64x2_t b_ =3D (b); \ - uint64_t * a_ =3D (a); \ - __asm__ ("st1 {%1.d}[%2],[%0]" \ - : \ - : "r"(a_), "w"(b_), "i"(c) \ - : "memory"); \ - }) - - __extension__ static __inline uint8x8_t __attribute__ ((__always_inline__)) vtst_p8 (poly8x8_t a, poly8x8_t b) { @@ -22498,6 +22232,8 @@ vst1_u64 (uint64_t *a, uint64x1_t b) *a =3D b[0]; } =20 +/* vst1q */ + __extension__ static __inline void __attribute__ ((__always_inline__)) vst1q_f32 (float32_t *a, float32x4_t b) { @@ -22510,8 +22246,6 @@ vst1q_f64 (float64_t *a, float64x2_t b) __builtin_aarch64_st1v2df ((__builtin_aarch64_simd_df *) a, b); } =20 -/* vst1q */ - __extension__ static __inline void __attribute__ ((__always_inline__)) vst1q_p8 (poly8_t *a, poly8x16_t b) { @@ -22578,6 +22312,154 @@ vst1q_u64 (uint64_t *a, uint64x2_t b) (int64x2_t) b); } =20 +/* vst1_lane */ + +__extension__ static __inline void __attribute__ ((__always_inline__)) +vst1_lane_f32 (float32_t *__a, float32x2_t __b, const int __lane) +{ + *__a =3D __aarch64_vget_lane_any (__b, __lane); +} + +__extension__ static __inline void __attribute__ ((__always_inline__)) +vst1_lane_f64 (float64_t *__a, float64x1_t __b, const int __lane) +{ + *__a =3D __aarch64_vget_lane_any (__b, __lane); +} + +__extension__ static __inline void __attribute__ ((__always_inline__)) +vst1_lane_p8 (poly8_t *__a, poly8x8_t __b, const int __lane) +{ + *__a =3D __aarch64_vget_lane_any (__b, __lane); +} + +__extension__ static __inline void __attribute__ ((__always_inline__)) +vst1_lane_p16 (poly16_t *__a, poly16x4_t __b, const int __lane) +{ + *__a =3D __aarch64_vget_lane_any (__b, __lane); +} + +__extension__ static __inline void __attribute__ ((__always_inline__)) +vst1_lane_s8 (int8_t *__a, int8x8_t __b, const int __lane) +{ + *__a =3D __aarch64_vget_lane_any (__b, __lane); +} + +__extension__ static __inline void __attribute__ ((__always_inline__)) +vst1_lane_s16 (int16_t *__a, int16x4_t __b, const int __lane) +{ + *__a =3D __aarch64_vget_lane_any (__b, __lane); +} + +__extension__ static __inline void __attribute__ ((__always_inline__)) +vst1_lane_s32 (int32_t *__a, int32x2_t __b, const int __lane) +{ + *__a =3D __aarch64_vget_lane_any (__b, __lane); +} + +__extension__ static __inline void __attribute__ ((__always_inline__)) +vst1_lane_s64 (int64_t *__a, int64x1_t __b, const int __lane) +{ + *__a =3D __aarch64_vget_lane_any (__b, __lane); +} + +__extension__ static __inline void __attribute__ ((__always_inline__)) +vst1_lane_u8 (uint8_t *__a, uint8x8_t __b, const int __lane) +{ + *__a =3D __aarch64_vget_lane_any (__b, __lane); +} + +__extension__ static __inline void __attribute__ ((__always_inline__)) +vst1_lane_u16 (uint16_t *__a, uint16x4_t __b, const int __lane) +{ + *__a =3D __aarch64_vget_lane_any (__b, __lane); +} + +__extension__ static __inline void __attribute__ ((__always_inline__)) +vst1_lane_u32 (uint32_t *__a, uint32x2_t __b, const int __lane) +{ + *__a =3D __aarch64_vget_lane_any (__b, __lane); +} + +__extension__ static __inline void __attribute__ ((__always_inline__)) +vst1_lane_u64 (uint64_t *__a, uint64x1_t __b, const int __lane) +{ + *__a =3D __aarch64_vget_lane_any (__b, __lane); +} + +/* vst1q_lane */ + +__extension__ static __inline void __attribute__ ((__always_inline__)) +vst1q_lane_f32 (float32_t *__a, float32x4_t __b, const int __lane) +{ + *__a =3D __aarch64_vget_lane_any (__b, __lane); +} + +__extension__ static __inline void __attribute__ ((__always_inline__)) +vst1q_lane_f64 (float64_t *__a, float64x2_t __b, const int __lane) +{ + *__a =3D __aarch64_vget_lane_any (__b, __lane); +} + +__extension__ static __inline void __attribute__ ((__always_inline__)) +vst1q_lane_p8 (poly8_t *__a, poly8x16_t __b, const int __lane) +{ + *__a =3D __aarch64_vget_lane_any (__b, __lane); +} + +__extension__ static __inline void __attribute__ ((__always_inline__)) +vst1q_lane_p16 (poly16_t *__a, poly16x8_t __b, const int __lane) +{ + *__a =3D __aarch64_vget_lane_any (__b, __lane); +} + +__extension__ static __inline void __attribute__ ((__always_inline__)) +vst1q_lane_s8 (int8_t *__a, int8x16_t __b, const int __lane) +{ + *__a =3D __aarch64_vget_lane_any (__b, __lane); +} + +__extension__ static __inline void __attribute__ ((__always_inline__)) +vst1q_lane_s16 (int16_t *__a, int16x8_t __b, const int __lane) +{ + *__a =3D __aarch64_vget_lane_any (__b, __lane); +} + +__extension__ static __inline void __attribute__ ((__always_inline__)) +vst1q_lane_s32 (int32_t *__a, int32x4_t __b, const int __lane) +{ + *__a =3D __aarch64_vget_lane_any (__b, __lane); +} + +__extension__ static __inline void __attribute__ ((__always_inline__)) +vst1q_lane_s64 (int64_t *__a, int64x2_t __b, const int __lane) +{ + *__a =3D __aarch64_vget_lane_any (__b, __lane); +} + +__extension__ static __inline void __attribute__ ((__always_inline__)) +vst1q_lane_u8 (uint8_t *__a, uint8x16_t __b, const int __lane) +{ + *__a =3D __aarch64_vget_lane_any (__b, __lane); +} + +__extension__ static __inline void __attribute__ ((__always_inline__)) +vst1q_lane_u16 (uint16_t *__a, uint16x8_t __b, const int __lane) +{ + *__a =3D __aarch64_vget_lane_any (__b, __lane); +} + +__extension__ static __inline void __attribute__ ((__always_inline__)) +vst1q_lane_u32 (uint32_t *__a, uint32x4_t __b, const int __lane) +{ + *__a =3D __aarch64_vget_lane_any (__b, __lane); +} + +__extension__ static __inline void __attribute__ ((__always_inline__)) +vst1q_lane_u64 (uint64_t *__a, uint64x2_t __b, const int __lane) +{ + *__a =3D __aarch64_vget_lane_any (__b, __lane); +} + /* vstn */ =20 __extension__ static __inline void= --------------070303020901050204090804--