* [PATCH 0/3][AArch64]More intrinsics/builtins improvements
@ 2014-11-14 10:46 Alan Lawrence
2014-11-14 10:46 ` [PATCH 2/3][AArch64] Extend aarch64_simd_vec_set pattern, replace asm for vld1_lane Alan Lawrence
` (3 more replies)
0 siblings, 4 replies; 10+ messages in thread
From: Alan Lawrence @ 2014-11-14 10:46 UTC (permalink / raw)
To: gcc-patches
These three are logically independent, but all on a common theme, and I've
tested them all together by
bootstrapped + check-gcc on aarch64-none-elf
cross-tested check-gcc on aarch64_be-none-elf
Ok for trunk?
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 2/3][AArch64] Extend aarch64_simd_vec_set pattern, replace asm for vld1_lane
2014-11-14 10:46 [PATCH 0/3][AArch64]More intrinsics/builtins improvements Alan Lawrence
@ 2014-11-14 10:46 ` Alan Lawrence
2014-11-17 17:36 ` Marcus Shawcroft
2014-11-14 10:46 ` [PATCH 1/3][AArch64]Replace __builtin_aarch64_createv1df with a cast, cleanup Alan Lawrence
` (2 subsequent siblings)
3 siblings, 1 reply; 10+ messages in thread
From: Alan Lawrence @ 2014-11-14 10:46 UTC (permalink / raw)
To: gcc-patches
[-- Attachment #1: Type: text/plain, Size: 1478 bytes --]
The vld1_lane intrinsic is currently implemented using inline asm. This patch
replaces that with a load and a straightforward use of vset_lane (this gives us
correct bigendian lane-flipping in a simple manner).
Naively this would produce assembler along the lines of (for vld1_lane_u8):
ldrb w0, [x0]
ins v0.b[5], w0
Hence, the patch also extends the aarch64_simd_vec_set pattern, adding a variant
that reads from a memory operand, producing the expected:
ld1 {v0.b}[5], [x0]
...and thus we'll also get that assembler from a programmer writing natively in
GCC vector extensions and not using intrinsics :).
I've also added a testcase, as existing tests in aarch64 and advsimd-intrinsics
seemed only to cover vld{2,3,4}_lane, not vld1_lane.
gcc/ChangeLog:
* config/aarch64/aarch64-simd.md (aarch64_simd_vec_set<mode>): Add
variant reading from memory and assembling to ld1.
* config/aarch64/arm_neon.h (vld1_lane_f32, vld1_lane_f64, vld1_lane_p8,
vld1_lane_p16, vld1_lane_s8, vld1_lane_s16, vld1_lane_s32,
vld1_lane_s64, vld1_lane_u8, vld1_lane_u16, vld1_lane_u32,
vld1_lane_u64, vld1q_lane_f32, vld1q_lane_f64, vld1q_lane_p8,
vld1q_lane_p16, vld1q_lane_s8, vld1q_lane_s16, vld1q_lane_s32,
vld1q_lane_s64, vld1q_lane_u8, vld1q_lane_u16, vld1q_lane_u32,
vld1q_lane_u64): Replace asm with vset_lane and pointer dereference.
gcc/testsuite/ChangeLog:
gcc.target/aarch64/vld1_lane.c: New test.
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: vld1_lane.patch --]
[-- Type: text/x-patch; name=vld1_lane.patch, Size: 31509 bytes --]
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 00b59d3a352325e77632daa9723f3df4850cf922..b77a4f831c44df9df8fac609216ee3c501e0e54a 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -455,12 +455,12 @@
)
(define_insn "aarch64_simd_vec_set<mode>"
- [(set (match_operand:VQ_S 0 "register_operand" "=w,w")
+ [(set (match_operand:VQ_S 0 "register_operand" "=w,w,w")
(vec_merge:VQ_S
(vec_duplicate:VQ_S
- (match_operand:<VEL> 1 "register_operand" "r,w"))
- (match_operand:VQ_S 3 "register_operand" "0,0")
- (match_operand:SI 2 "immediate_operand" "i,i")))]
+ (match_operand:<VEL> 1 "aarch64_simd_general_operand" "r,w,Utv"))
+ (match_operand:VQ_S 3 "register_operand" "0,0,0")
+ (match_operand:SI 2 "immediate_operand" "i,i,i")))]
"TARGET_SIMD"
{
int elt = ENDIAN_LANE_N (<MODE>mode, exact_log2 (INTVAL (operands[2])));
@@ -471,11 +471,13 @@
return "ins\\t%0.<Vetype>[%p2], %w1";
case 1:
return "ins\\t%0.<Vetype>[%p2], %1.<Vetype>[0]";
+ case 2:
+ return "ld1\\t{%0.<Vetype>}[%p2], %1";
default:
gcc_unreachable ();
}
}
- [(set_attr "type" "neon_from_gp<q>, neon_ins<q>")]
+ [(set_attr "type" "neon_from_gp<q>, neon_ins<q>, neon_load1_1reg<q>")]
)
(define_insn "aarch64_simd_lshr<mode>"
diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index 4a0d718642f8a3cb56281a70435b1b6445ee35be..f036f7c0ba2733a822661027b815e7c3654db1bc 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -628,7 +628,7 @@ typedef struct poly16x8x4_t
#define __aarch64_vdupq_laneq_u64(__a, __b) \
__aarch64_vdup_lane_any (u64, q, q, __a, __b)
-/* vset_lane internal macro. */
+/* vset_lane and vld1_lane internal macro. */
#ifdef __AARCH64EB__
/* For big-endian, GCC's vector indices are the opposite way around
@@ -6275,162 +6275,6 @@ vld1_dup_u64 (const uint64_t * a)
return result;
}
-#define vld1_lane_f32(a, b, c) \
- __extension__ \
- ({ \
- float32x2_t b_ = (b); \
- const float32_t * a_ = (a); \
- float32x2_t result; \
- __asm__ ("ld1 {%0.s}[%1], %2" \
- : "=w"(result) \
- : "i" (c), "Utv"(*a_), "0"(b_) \
- : /* No clobbers */); \
- result; \
- })
-
-#define vld1_lane_f64(a, b, c) \
- __extension__ \
- ({ \
- float64x1_t b_ = (b); \
- const float64_t * a_ = (a); \
- float64x1_t result; \
- __asm__ ("ld1 {%0.d}[%1], %2" \
- : "=w"(result) \
- : "i" (c), "Utv"(*a_), "0"(b_) \
- : /* No clobbers */); \
- result; \
- })
-
-#define vld1_lane_p8(a, b, c) \
- __extension__ \
- ({ \
- poly8x8_t b_ = (b); \
- const poly8_t * a_ = (a); \
- poly8x8_t result; \
- __asm__ ("ld1 {%0.b}[%1], %2" \
- : "=w"(result) \
- : "i" (c), "Utv"(*a_), "0"(b_) \
- : /* No clobbers */); \
- result; \
- })
-
-#define vld1_lane_p16(a, b, c) \
- __extension__ \
- ({ \
- poly16x4_t b_ = (b); \
- const poly16_t * a_ = (a); \
- poly16x4_t result; \
- __asm__ ("ld1 {%0.h}[%1], %2" \
- : "=w"(result) \
- : "i" (c), "Utv"(*a_), "0"(b_) \
- : /* No clobbers */); \
- result; \
- })
-
-#define vld1_lane_s8(a, b, c) \
- __extension__ \
- ({ \
- int8x8_t b_ = (b); \
- const int8_t * a_ = (a); \
- int8x8_t result; \
- __asm__ ("ld1 {%0.b}[%1], %2" \
- : "=w"(result) \
- : "i" (c), "Utv"(*a_), "0"(b_) \
- : /* No clobbers */); \
- result; \
- })
-
-#define vld1_lane_s16(a, b, c) \
- __extension__ \
- ({ \
- int16x4_t b_ = (b); \
- const int16_t * a_ = (a); \
- int16x4_t result; \
- __asm__ ("ld1 {%0.h}[%1], %2" \
- : "=w"(result) \
- : "i" (c), "Utv"(*a_), "0"(b_) \
- : /* No clobbers */); \
- result; \
- })
-
-#define vld1_lane_s32(a, b, c) \
- __extension__ \
- ({ \
- int32x2_t b_ = (b); \
- const int32_t * a_ = (a); \
- int32x2_t result; \
- __asm__ ("ld1 {%0.s}[%1], %2" \
- : "=w"(result) \
- : "i" (c), "Utv"(*a_), "0"(b_) \
- : /* No clobbers */); \
- result; \
- })
-
-#define vld1_lane_s64(a, b, c) \
- __extension__ \
- ({ \
- int64x1_t b_ = (b); \
- const int64_t * a_ = (a); \
- int64x1_t result; \
- __asm__ ("ld1 {%0.d}[%1], %2" \
- : "=w"(result) \
- : "i" (c), "Utv"(*a_), "0"(b_) \
- : /* No clobbers */); \
- result; \
- })
-
-#define vld1_lane_u8(a, b, c) \
- __extension__ \
- ({ \
- uint8x8_t b_ = (b); \
- const uint8_t * a_ = (a); \
- uint8x8_t result; \
- __asm__ ("ld1 {%0.b}[%1], %2" \
- : "=w"(result) \
- : "i" (c), "Utv"(*a_), "0"(b_) \
- : /* No clobbers */); \
- result; \
- })
-
-#define vld1_lane_u16(a, b, c) \
- __extension__ \
- ({ \
- uint16x4_t b_ = (b); \
- const uint16_t * a_ = (a); \
- uint16x4_t result; \
- __asm__ ("ld1 {%0.h}[%1], %2" \
- : "=w"(result) \
- : "i" (c), "Utv"(*a_), "0"(b_) \
- : /* No clobbers */); \
- result; \
- })
-
-#define vld1_lane_u32(a, b, c) \
- __extension__ \
- ({ \
- uint32x2_t b_ = (b); \
- const uint32_t * a_ = (a); \
- uint32x2_t result; \
- __asm__ ("ld1 {%0.s}[%1], %2" \
- : "=w"(result) \
- : "i" (c), "Utv"(*a_), "0"(b_) \
- : /* No clobbers */); \
- result; \
- })
-
-#define vld1_lane_u64(a, b, c) \
- __extension__ \
- ({ \
- uint64x1_t b_ = (b); \
- const uint64_t * a_ = (a); \
- uint64x1_t result; \
- __asm__ ("ld1 {%0.d}[%1], %2" \
- : "=w"(result) \
- : "i" (c), "Utv"(*a_), "0"(b_) \
- : /* No clobbers */); \
- result; \
- })
-
__extension__ static __inline float32x4_t __attribute__ ((__always_inline__))
vld1q_dup_f32 (const float32_t * a)
{
@@ -6563,162 +6407,6 @@ vld1q_dup_u64 (const uint64_t * a)
return result;
}
-#define vld1q_lane_f32(a, b, c) \
- __extension__ \
- ({ \
- float32x4_t b_ = (b); \
- const float32_t * a_ = (a); \
- float32x4_t result; \
- __asm__ ("ld1 {%0.s}[%1], %2" \
- : "=w"(result) \
- : "i"(c), "Utv"(*a_), "0"(b_) \
- : /* No clobbers */); \
- result; \
- })
-
-#define vld1q_lane_f64(a, b, c) \
- __extension__ \
- ({ \
- float64x2_t b_ = (b); \
- const float64_t * a_ = (a); \
- float64x2_t result; \
- __asm__ ("ld1 {%0.d}[%1], %2" \
- : "=w"(result) \
- : "i"(c), "Utv"(*a_), "0"(b_) \
- : /* No clobbers */); \
- result; \
- })
-
-#define vld1q_lane_p8(a, b, c) \
- __extension__ \
- ({ \
- poly8x16_t b_ = (b); \
- const poly8_t * a_ = (a); \
- poly8x16_t result; \
- __asm__ ("ld1 {%0.b}[%1], %2" \
- : "=w"(result) \
- : "i"(c), "Utv"(*a_), "0"(b_) \
- : /* No clobbers */); \
- result; \
- })
-
-#define vld1q_lane_p16(a, b, c) \
- __extension__ \
- ({ \
- poly16x8_t b_ = (b); \
- const poly16_t * a_ = (a); \
- poly16x8_t result; \
- __asm__ ("ld1 {%0.h}[%1], %2" \
- : "=w"(result) \
- : "i"(c), "Utv"(*a_), "0"(b_) \
- : /* No clobbers */); \
- result; \
- })
-
-#define vld1q_lane_s8(a, b, c) \
- __extension__ \
- ({ \
- int8x16_t b_ = (b); \
- const int8_t * a_ = (a); \
- int8x16_t result; \
- __asm__ ("ld1 {%0.b}[%1], %2" \
- : "=w"(result) \
- : "i"(c), "Utv"(*a_), "0"(b_) \
- : /* No clobbers */); \
- result; \
- })
-
-#define vld1q_lane_s16(a, b, c) \
- __extension__ \
- ({ \
- int16x8_t b_ = (b); \
- const int16_t * a_ = (a); \
- int16x8_t result; \
- __asm__ ("ld1 {%0.h}[%1], %2" \
- : "=w"(result) \
- : "i"(c), "Utv"(*a_), "0"(b_) \
- : /* No clobbers */); \
- result; \
- })
-
-#define vld1q_lane_s32(a, b, c) \
- __extension__ \
- ({ \
- int32x4_t b_ = (b); \
- const int32_t * a_ = (a); \
- int32x4_t result; \
- __asm__ ("ld1 {%0.s}[%1], %2" \
- : "=w"(result) \
- : "i"(c), "Utv"(*a_), "0"(b_) \
- : /* No clobbers */); \
- result; \
- })
-
-#define vld1q_lane_s64(a, b, c) \
- __extension__ \
- ({ \
- int64x2_t b_ = (b); \
- const int64_t * a_ = (a); \
- int64x2_t result; \
- __asm__ ("ld1 {%0.d}[%1], %2" \
- : "=w"(result) \
- : "i"(c), "Utv"(*a_), "0"(b_) \
- : /* No clobbers */); \
- result; \
- })
-
-#define vld1q_lane_u8(a, b, c) \
- __extension__ \
- ({ \
- uint8x16_t b_ = (b); \
- const uint8_t * a_ = (a); \
- uint8x16_t result; \
- __asm__ ("ld1 {%0.b}[%1], %2" \
- : "=w"(result) \
- : "i"(c), "Utv"(*a_), "0"(b_) \
- : /* No clobbers */); \
- result; \
- })
-
-#define vld1q_lane_u16(a, b, c) \
- __extension__ \
- ({ \
- uint16x8_t b_ = (b); \
- const uint16_t * a_ = (a); \
- uint16x8_t result; \
- __asm__ ("ld1 {%0.h}[%1], %2" \
- : "=w"(result) \
- : "i"(c), "Utv"(*a_), "0"(b_) \
- : /* No clobbers */); \
- result; \
- })
-
-#define vld1q_lane_u32(a, b, c) \
- __extension__ \
- ({ \
- uint32x4_t b_ = (b); \
- const uint32_t * a_ = (a); \
- uint32x4_t result; \
- __asm__ ("ld1 {%0.s}[%1], %2" \
- : "=w"(result) \
- : "i"(c), "Utv"(*a_), "0"(b_) \
- : /* No clobbers */); \
- result; \
- })
-
-#define vld1q_lane_u64(a, b, c) \
- __extension__ \
- ({ \
- uint64x2_t b_ = (b); \
- const uint64_t * a_ = (a); \
- uint64x2_t result; \
- __asm__ ("ld1 {%0.d}[%1], %2" \
- : "=w"(result) \
- : "i"(c), "Utv"(*a_), "0"(b_) \
- : /* No clobbers */); \
- result; \
- })
-
__extension__ static __inline float32x2_t __attribute__ ((__always_inline__))
vmla_n_f32 (float32x2_t a, float32x2_t b, float32_t c)
{
@@ -16454,6 +16142,154 @@ vld1q_u64 (const uint64_t *a)
__builtin_aarch64_ld1v2di ((const __builtin_aarch64_simd_di *) a);
}
+/* vld1_lane */
+
+__extension__ static __inline float32x2_t __attribute__ ((__always_inline__))
+vld1_lane_f32 (const float32_t *__src, float32x2_t __vec, const int __lane)
+{
+ return vset_lane_f32 (*__src, __vec, __lane);
+}
+
+__extension__ static __inline float64x1_t __attribute__ ((__always_inline__))
+vld1_lane_f64 (const float64_t *__src, float64x1_t __vec, const int __lane)
+{
+ return vset_lane_f64 (*__src, __vec, __lane);
+}
+
+__extension__ static __inline poly8x8_t __attribute__ ((__always_inline__))
+vld1_lane_p8 (const poly8_t *__src, poly8x8_t __vec, const int __lane)
+{
+ return vset_lane_p8 (*__src, __vec, __lane);
+}
+
+__extension__ static __inline poly16x4_t __attribute__ ((__always_inline__))
+vld1_lane_p16 (const poly16_t *__src, poly16x4_t __vec, const int __lane)
+{
+ return vset_lane_p16 (*__src, __vec, __lane);
+}
+
+__extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
+vld1_lane_s8 (const int8_t *__src, int8x8_t __vec, const int __lane)
+{
+ return vset_lane_s8 (*__src, __vec, __lane);
+}
+
+__extension__ static __inline int16x4_t __attribute__ ((__always_inline__))
+vld1_lane_s16 (const int16_t *__src, int16x4_t __vec, const int __lane)
+{
+ return vset_lane_s16 (*__src, __vec, __lane);
+}
+
+__extension__ static __inline int32x2_t __attribute__ ((__always_inline__))
+vld1_lane_s32 (const int32_t *__src, int32x2_t __vec, const int __lane)
+{
+ return vset_lane_s32 (*__src, __vec, __lane);
+}
+
+__extension__ static __inline int64x1_t __attribute__ ((__always_inline__))
+vld1_lane_s64 (const int64_t *__src, int64x1_t __vec, const int __lane)
+{
+ return vset_lane_s64 (*__src, __vec, __lane);
+}
+
+__extension__ static __inline uint8x8_t __attribute__ ((__always_inline__))
+vld1_lane_u8 (const uint8_t *__src, uint8x8_t __vec, const int __lane)
+{
+ return vset_lane_u8 (*__src, __vec, __lane);
+}
+
+__extension__ static __inline uint16x4_t __attribute__ ((__always_inline__))
+vld1_lane_u16 (const uint16_t *__src, uint16x4_t __vec, const int __lane)
+{
+ return vset_lane_u16 (*__src, __vec, __lane);
+}
+
+__extension__ static __inline uint32x2_t __attribute__ ((__always_inline__))
+vld1_lane_u32 (const uint32_t *__src, uint32x2_t __vec, const int __lane)
+{
+ return vset_lane_u32 (*__src, __vec, __lane);
+}
+
+__extension__ static __inline uint64x1_t __attribute__ ((__always_inline__))
+vld1_lane_u64 (const uint64_t *__src, uint64x1_t __vec, const int __lane)
+{
+ return vset_lane_u64 (*__src, __vec, __lane);
+}
+
+/* vld1q_lane */
+
+__extension__ static __inline float32x4_t __attribute__ ((__always_inline__))
+vld1q_lane_f32 (const float32_t *__src, float32x4_t __vec, const int __lane)
+{
+ return vsetq_lane_f32 (*__src, __vec, __lane);
+}
+
+__extension__ static __inline float64x2_t __attribute__ ((__always_inline__))
+vld1q_lane_f64 (const float64_t *__src, float64x2_t __vec, const int __lane)
+{
+ return vsetq_lane_f64 (*__src, __vec, __lane);
+}
+
+__extension__ static __inline poly8x16_t __attribute__ ((__always_inline__))
+vld1q_lane_p8 (const poly8_t *__src, poly8x16_t __vec, const int __lane)
+{
+ return vsetq_lane_p8 (*__src, __vec, __lane);
+}
+
+__extension__ static __inline poly16x8_t __attribute__ ((__always_inline__))
+vld1q_lane_p16 (const poly16_t *__src, poly16x8_t __vec, const int __lane)
+{
+ return vsetq_lane_p16 (*__src, __vec, __lane);
+}
+
+__extension__ static __inline int8x16_t __attribute__ ((__always_inline__))
+vld1q_lane_s8 (const int8_t *__src, int8x16_t __vec, const int __lane)
+{
+ return vsetq_lane_s8 (*__src, __vec, __lane);
+}
+
+__extension__ static __inline int16x8_t __attribute__ ((__always_inline__))
+vld1q_lane_s16 (const int16_t *__src, int16x8_t __vec, const int __lane)
+{
+ return vsetq_lane_s16 (*__src, __vec, __lane);
+}
+
+__extension__ static __inline int32x4_t __attribute__ ((__always_inline__))
+vld1q_lane_s32 (const int32_t *__src, int32x4_t __vec, const int __lane)
+{
+ return vsetq_lane_s32 (*__src, __vec, __lane);
+}
+
+__extension__ static __inline int64x2_t __attribute__ ((__always_inline__))
+vld1q_lane_s64 (const int64_t *__src, int64x2_t __vec, const int __lane)
+{
+ return vsetq_lane_s64 (*__src, __vec, __lane);
+}
+
+__extension__ static __inline uint8x16_t __attribute__ ((__always_inline__))
+vld1q_lane_u8 (const uint8_t *__src, uint8x16_t __vec, const int __lane)
+{
+ return vsetq_lane_u8 (*__src, __vec, __lane);
+}
+
+__extension__ static __inline uint16x8_t __attribute__ ((__always_inline__))
+vld1q_lane_u16 (const uint16_t *__src, uint16x8_t __vec, const int __lane)
+{
+ return vsetq_lane_u16 (*__src, __vec, __lane);
+}
+
+__extension__ static __inline uint32x4_t __attribute__ ((__always_inline__))
+vld1q_lane_u32 (const uint32_t *__src, uint32x4_t __vec, const int __lane)
+{
+ return vsetq_lane_u32 (*__src, __vec, __lane);
+}
+
+__extension__ static __inline uint64x2_t __attribute__ ((__always_inline__))
+vld1q_lane_u64 (const uint64_t *__src, uint64x2_t __vec, const int __lane)
+{
+ return vsetq_lane_u64 (*__src, __vec, __lane);
+}
+
/* vldn */
__extension__ static __inline int64x1x2_t __attribute__ ((__always_inline__))
diff --git a/gcc/testsuite/gcc.target/aarch64/vld1_lane.c b/gcc/testsuite/gcc.target/aarch64/vld1_lane.c
new file mode 100644
index 0000000000000000000000000000000000000000..c2445f8df53034027051722155a40161b86574bb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vld1_lane.c
@@ -0,0 +1,85 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -fno-inline" } */
+
+#include <arm_neon.h>
+
+extern void abort (void);
+
+#define VARIANTS(VARIANT) \
+VARIANT (uint8, , 8, _u8, 5) \
+VARIANT (uint16, , 4, _u16, 3) \
+VARIANT (uint32, , 2, _u32, 1) \
+VARIANT (uint64, , 1, _u64, 0) \
+VARIANT (int8, , 8, _s8, 3) \
+VARIANT (int16, , 4, _s16, 2) \
+VARIANT (int32, , 2, _s32, 0) \
+VARIANT (int64, , 1, _s64, 0) \
+VARIANT (poly8, , 8, _p8, 7) \
+VARIANT (poly16, , 4, _p16, 2) \
+VARIANT (float32, , 2, _f32, 1) \
+VARIANT (float64, , 1, _f64, 0) \
+VARIANT (uint8, q, 16, _u8, 13) \
+VARIANT (uint16, q, 8, _u16, 5) \
+VARIANT (uint32, q, 4, _u32, 1) \
+VARIANT (uint64, q, 2, _u64, 0) \
+VARIANT (int8, q, 16, _s8, 15) \
+VARIANT (int16, q, 8, _s16, 3) \
+VARIANT (int32, q, 4, _s32, 1) \
+VARIANT (int64, q, 2, _s64, 1) \
+VARIANT (poly8, q, 16, _p8, 7) \
+VARIANT (poly16, q, 8, _p16, 4) \
+VARIANT (float32, q, 4, _f32, 2)\
+VARIANT (float64, q, 2, _f64, 1)
+
+#define TESTMETH(BASE, Q, ELTS, SUFFIX, LANE) \
+__attribute__((noinline)) BASE##x##ELTS##_t \
+wrap_vld1##Q##_lane##SUFFIX (const BASE##_t *load, \
+ BASE##x##ELTS##_t vec) \
+{ return vld1##Q##_lane##SUFFIX (load, vec, LANE); } \
+int \
+test_vld1##Q##_lane##SUFFIX (const BASE##_t *data, \
+ const BASE##_t *overwrite) \
+{ \
+ BASE##_t out[ELTS]; \
+ int j; \
+ BASE##x##ELTS##_t in = vld1##Q##SUFFIX (data); \
+ in = wrap_vld1##Q##_lane##SUFFIX (overwrite, in); \
+ vst1##Q##SUFFIX (out, in); \
+ for (j = 0; j < ELTS; j++) \
+ if (out[j] != (j == LANE ? *overwrite : data[j])) \
+ return 1; \
+ return 0; \
+}
+
+
+VARIANTS (TESTMETH)
+
+#define CHECK(BASE, Q, ELTS, SUFFIX, LANE) \
+ if (test_vld1##Q##_lane##SUFFIX ((const BASE##_t *)orig_data, \
+ BASE##_data) != 0) \
+ abort ();
+
+int
+main (int argc, char **argv)
+{
+ /* Original data for all vector formats. */
+ uint64_t orig_data[2] = {0x1234567890abcdefULL, 0x13579bdf02468aceULL};
+
+ /* Data with which vldN_lane will overwrite some of previous. */
+ uint8_t uint8_data[4] = { 7, 11, 13, 17 };
+ uint16_t uint16_data[4] = { 257, 263, 269, 271 };
+ uint32_t uint32_data[4] = { 65537, 65539, 65543, 65551 };
+ uint64_t uint64_data[4] = { 0xdeadbeefcafebabeULL, 0x0123456789abcdefULL,
+ 0xfedcba9876543210LL, 0xdeadbabecafebeefLL };
+ int8_t int8_data[4] = { -1, 3, -5, 7 };
+ int16_t int16_data[4] = { 257, -259, 261, -263 };
+ int32_t int32_data[4] = { 123456789, -987654321, -135792468, 975318642 };
+ int64_t *int64_data = (int64_t *)uint64_data;
+ poly8_t poly8_data[4] = { 0, 7, 13, 18, };
+ poly16_t poly16_data[4] = { 11111, 2222, 333, 44 };
+ float32_t float32_data[4] = { 3.14159, 2.718, 1.414, 100.0 };
+ float64_t float64_data[4] = { 1.010010001, 12345.6789, -9876.54321, 1.618 };
+
+ VARIANTS (CHECK);
+ return 0;
+}
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 1/3][AArch64]Replace __builtin_aarch64_createv1df with a cast, cleanup
2014-11-14 10:46 [PATCH 0/3][AArch64]More intrinsics/builtins improvements Alan Lawrence
2014-11-14 10:46 ` [PATCH 2/3][AArch64] Extend aarch64_simd_vec_set pattern, replace asm for vld1_lane Alan Lawrence
@ 2014-11-14 10:46 ` Alan Lawrence
2014-11-17 17:35 ` Marcus Shawcroft
2014-11-14 10:50 ` [PATCH 3/3][AArch64]Replace temporary assembler for vld1_dup Alan Lawrence
2014-11-17 2:55 ` [PATCH 0/3][AArch64]More intrinsics/builtins improvements Yangfei (Felix)
3 siblings, 1 reply; 10+ messages in thread
From: Alan Lawrence @ 2014-11-14 10:46 UTC (permalink / raw)
To: gcc-patches
[-- Attachment #1: Type: text/plain, Size: 1167 bytes --]
Now that float64x1_t is a vector, casting to it from a unit64_t causes the bit
pattern to be reinterpreted, just as vcreate_f64 should. (Previously when
float64x1_t was still a scalar, casting caused a conversion.) Hence, replace the
__builtin with a cast. None of the other variants of the aarch64_create pattern
were used, so remove it, and associated guff.
Also have to inhibit optimization of some testcases, as the midend can see
through casts, whereas it couldn't see builtins ;).
The affected intrinsics are all covered by tests gcc.target/aarch64/vrnd_f64_1,
vreinterpret_f64_1.c, vget_high_1.c.
gcc/ChangeLog:
* config/aarch64/aarch64-builtins.c (TYPES_CREATE): Remove.
* config/aarch64/aarch64-simd-builtins.def (create): Remove.
* config/aarch64/aarch64-simd.md (aarch64_create<mode>): Remove.
* config/aarch64/arm_neon.h (vcreate_f64, vreinterpret_f64_s64,
vreinterpret_f64_u64): Replace __builtin_aarch64_createv1df with C casts.
* config/aarch64/iterators.md (VD1): Remove.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/simd/vfma_f64.c: Add asm volatile memory.
* gcc.target/aarch64/simd/vfms_f64.c: Likewise.
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 1_vcreate.patch --]
[-- Type: text/x-patch; name=1_vcreate.patch, Size: 5723 bytes --]
diff --git a/gcc/config/aarch64/aarch64-builtins.c b/gcc/config/aarch64/aarch64-builtins.c
index 527445c5c7788bc37f41d9c3428f59a18410a93a..c130f80b869304087205e21aa644d76c06749309 100644
--- a/gcc/config/aarch64/aarch64-builtins.c
+++ b/gcc/config/aarch64/aarch64-builtins.c
@@ -136,7 +136,6 @@ static enum aarch64_type_qualifiers
aarch64_types_unopu_qualifiers[SIMD_MAX_BUILTIN_ARGS]
= { qualifier_unsigned, qualifier_unsigned };
#define TYPES_UNOPU (aarch64_types_unopu_qualifiers)
-#define TYPES_CREATE (aarch64_types_unop_qualifiers)
static enum aarch64_type_qualifiers
aarch64_types_binop_qualifiers[SIMD_MAX_BUILTIN_ARGS]
= { qualifier_none, qualifier_none, qualifier_maybe_immediate };
diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def b/gcc/config/aarch64/aarch64-simd-builtins.def
index 62b7f3357d12f2a4a483588e3ccf027c3f957c20..8cdb9609520a227f33008efa9201d7771e241755 100644
--- a/gcc/config/aarch64/aarch64-simd-builtins.def
+++ b/gcc/config/aarch64/aarch64-simd-builtins.def
@@ -39,7 +39,6 @@
1-9 - CODE_FOR_<name><mode><1-9>
10 - CODE_FOR_<name><mode>. */
- BUILTIN_VD1 (CREATE, create, 0)
BUILTIN_VDC (COMBINE, combine, 0)
BUILTIN_VB (BINOP, pmul, 0)
BUILTIN_VDQF (UNOP, sqrt, 2)
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index ef196e4b6fb39c0d2fd9ebfee76abab8369b1e92..00b59d3a352325e77632daa9723f3df4850cf922 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -2315,16 +2315,6 @@
;; Patterns for AArch64 SIMD Intrinsics.
-(define_expand "aarch64_create<mode>"
- [(match_operand:VD1 0 "register_operand" "")
- (match_operand:DI 1 "general_operand" "")]
- "TARGET_SIMD"
-{
- rtx src = gen_lowpart (<MODE>mode, operands[1]);
- emit_move_insn (operands[0], src);
- DONE;
-})
-
;; Lane extraction with sign extension to general purpose register.
(define_insn "*aarch64_get_lane_extend<GPI:mode><VDQQH:mode>"
[(set (match_operand:GPI 0 "register_operand" "=r")
diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index 0ec1a24a52d81a6f8a2d45c0a931e771972d5eef..4a0d718642f8a3cb56281a70435b1b6445ee35be 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -2662,7 +2662,7 @@ vcreate_u64 (uint64_t __a)
__extension__ static __inline float64x1_t __attribute__ ((__always_inline__))
vcreate_f64 (uint64_t __a)
{
- return __builtin_aarch64_createv1df (__a);
+ return (float64x1_t) __a;
}
__extension__ static __inline poly8x8_t __attribute__ ((__always_inline__))
@@ -3262,7 +3262,7 @@ vreinterpret_f64_s32 (int32x2_t __a)
__extension__ static __inline float64x1_t __attribute__((__always_inline__))
vreinterpret_f64_s64 (int64x1_t __a)
{
- return __builtin_aarch64_createv1df ((uint64_t) vget_lane_s64 (__a, 0));
+ return (float64x1_t) __a;
}
__extension__ static __inline float64x1_t __attribute__((__always_inline__))
@@ -3286,7 +3286,7 @@ vreinterpret_f64_u32 (uint32x2_t __a)
__extension__ static __inline float64x1_t __attribute__((__always_inline__))
vreinterpret_f64_u64 (uint64x1_t __a)
{
- return __builtin_aarch64_createv1df (vget_lane_u64 (__a, 0));
+ return (float64x1_t) __a;
}
__extension__ static __inline float64x2_t __attribute__((__always_inline__))
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 74c71fcc8047f221f28cedaba8fca80995576cc7..c5abc3af79405fa4cd5ab2fd6f9e756b5907a3ae 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -147,9 +147,6 @@
;; Double vector modes for combines.
(define_mode_iterator VDIC [V8QI V4HI V2SI])
-;; Double vector modes inc V1DF
-(define_mode_iterator VD1 [V8QI V4HI V2SI V2SF V1DF])
-
;; Vector modes except double int.
(define_mode_iterator VDQIF [V8QI V16QI V4HI V8HI V2SI V4SI V2SF V4SF V2DF])
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/vfma_f64.c b/gcc/testsuite/gcc.target/aarch64/simd/vfma_f64.c
index 272b79ceb395a8f44d76ea1ac833a9cd64210711..8083d2c0d68cd9da07ff8235f6cd5c7549d0571a 100644
--- a/gcc/testsuite/gcc.target/aarch64/simd/vfma_f64.c
+++ b/gcc/testsuite/gcc.target/aarch64/simd/vfma_f64.c
@@ -7,6 +7,10 @@
#define EPS 1.0e-15
+#define INHIB_OPT(x) asm volatile ("mov %d0, %1.d[0]" \
+ : "=w"(x) \
+ : "w"(x) \
+ : /* No clobbers. */);
extern void abort (void);
@@ -24,6 +28,10 @@ main (void)
arg2 = vcreate_f64 (0x3fa88480812d6670ULL);
arg3 = vcreate_f64 (0x3fd5791ae2a92572ULL);
+ INHIB_OPT (arg1);
+ INHIB_OPT (arg2);
+ INHIB_OPT (arg3);
+
expected = 0.6280448184360076;
actual = vget_lane_f64 (vfma_f64 (arg1, arg2, arg3), 0);
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/vfms_f64.c b/gcc/testsuite/gcc.target/aarch64/simd/vfms_f64.c
index f6e1f77886dd3af7374305a45b0055e9b2e3997d..ede140d75386faf91a038f3b59b7686d0a568ab2 100644
--- a/gcc/testsuite/gcc.target/aarch64/simd/vfms_f64.c
+++ b/gcc/testsuite/gcc.target/aarch64/simd/vfms_f64.c
@@ -7,6 +7,10 @@
#define EPS 1.0e-15
+#define INHIB_OPT(x) asm volatile ("mov %d0, %1.d[0]" \
+ : "=w"(x) \
+ : "w"(x) \
+ : /* No clobbers. */);
extern void abort (void);
@@ -24,6 +28,10 @@ main (void)
arg2 = vcreate_f64 (0x3fe6b78680fa29ceULL);
arg3 = vcreate_f64 (0x3feea3cbf921fbe0ULL);
+ INHIB_OPT (arg1);
+ INHIB_OPT (arg2);
+ INHIB_OPT (arg3);
+
expected = 4.4964705746355915e-2;
actual = vget_lane_f64 (vfms_f64 (arg1, arg2, arg3), 0);
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 3/3][AArch64]Replace temporary assembler for vld1_dup
2014-11-14 10:46 [PATCH 0/3][AArch64]More intrinsics/builtins improvements Alan Lawrence
2014-11-14 10:46 ` [PATCH 2/3][AArch64] Extend aarch64_simd_vec_set pattern, replace asm for vld1_lane Alan Lawrence
2014-11-14 10:46 ` [PATCH 1/3][AArch64]Replace __builtin_aarch64_createv1df with a cast, cleanup Alan Lawrence
@ 2014-11-14 10:50 ` Alan Lawrence
2014-11-17 17:54 ` Marcus Shawcroft
2014-11-17 2:55 ` [PATCH 0/3][AArch64]More intrinsics/builtins improvements Yangfei (Felix)
3 siblings, 1 reply; 10+ messages in thread
From: Alan Lawrence @ 2014-11-14 10:50 UTC (permalink / raw)
To: gcc-patches
[-- Attachment #1: Type: text/plain, Size: 759 bytes --]
This patch replaces the inline asm for vld1_dup intrinsics with a vdup_n_ and a
load from the pointer. The existing *aarch64_simd_ld1r<mode> insn, combiner,
etc., are quite capable of generating the expected single ld1r instruction from
this. (I've verified by inspecting assembler output.)
gcc/ChangeLog:
* config/aarch64/arm_neon.h (vld1_dup_f32, vld1_dup_f64, vld1_dup_p8,
vld1_dup_p16, vld1_dup_s8, vld1_dup_s16, vld1_dup_s32, vld1_dup_s64,
vld1_dup_u8, vld1_dup_u16, vld1_dup_u32, vld1_dup_u64, vld1q_dup_f32,
vld1q_dup_f64, vld1q_dup_p8, vld1q_dup_p16, vld1q_dup_s8, vld1q_dup_s16,
vld1q_dup_s32, vld1q_dup_s64, vld1q_dup_u8, vld1q_dup_u16,
vld1q_dup_u32, vld1q_dup_u64): Replace inline asm with vdup_n_ and
pointer dereference.
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: vld1_dup.patch --]
[-- Type: text/x-patch; name=vld1_dup.patch, Size: 10920 bytes --]
diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index f036f7c0ba2733a822661027b815e7c3654db1bc..61a3bd3ab59c427522087f10ddd5679d6d46019d 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -6144,270 +6144,6 @@ vhsubq_u32 (uint32x4_t a, uint32x4_t b)
}
__extension__ static __inline float32x2_t __attribute__ ((__always_inline__))
-vld1_dup_f32 (const float32_t * a)
-{
- float32x2_t result;
- __asm__ ("ld1r {%0.2s}, %1"
- : "=w"(result)
- : "Utv"(*a)
- : /* No clobbers */);
- return result;
-}
-
-__extension__ static __inline float64x1_t __attribute__ ((__always_inline__))
-vld1_dup_f64 (const float64_t * a)
-{
- float64x1_t result;
- __asm__ ("ld1r {%0.1d}, %1"
- : "=w"(result)
- : "Utv"(*a)
- : /* No clobbers */);
- return result;
-}
-
-__extension__ static __inline poly8x8_t __attribute__ ((__always_inline__))
-vld1_dup_p8 (const poly8_t * a)
-{
- poly8x8_t result;
- __asm__ ("ld1r {%0.8b}, %1"
- : "=w"(result)
- : "Utv"(*a)
- : /* No clobbers */);
- return result;
-}
-
-__extension__ static __inline poly16x4_t __attribute__ ((__always_inline__))
-vld1_dup_p16 (const poly16_t * a)
-{
- poly16x4_t result;
- __asm__ ("ld1r {%0.4h}, %1"
- : "=w"(result)
- : "Utv"(*a)
- : /* No clobbers */);
- return result;
-}
-
-__extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
-vld1_dup_s8 (const int8_t * a)
-{
- int8x8_t result;
- __asm__ ("ld1r {%0.8b}, %1"
- : "=w"(result)
- : "Utv"(*a)
- : /* No clobbers */);
- return result;
-}
-
-__extension__ static __inline int16x4_t __attribute__ ((__always_inline__))
-vld1_dup_s16 (const int16_t * a)
-{
- int16x4_t result;
- __asm__ ("ld1r {%0.4h}, %1"
- : "=w"(result)
- : "Utv"(*a)
- : /* No clobbers */);
- return result;
-}
-
-__extension__ static __inline int32x2_t __attribute__ ((__always_inline__))
-vld1_dup_s32 (const int32_t * a)
-{
- int32x2_t result;
- __asm__ ("ld1r {%0.2s}, %1"
- : "=w"(result)
- : "Utv"(*a)
- : /* No clobbers */);
- return result;
-}
-
-__extension__ static __inline int64x1_t __attribute__ ((__always_inline__))
-vld1_dup_s64 (const int64_t * a)
-{
- int64x1_t result;
- __asm__ ("ld1r {%0.1d}, %1"
- : "=w"(result)
- : "Utv"(*a)
- : /* No clobbers */);
- return result;
-}
-
-__extension__ static __inline uint8x8_t __attribute__ ((__always_inline__))
-vld1_dup_u8 (const uint8_t * a)
-{
- uint8x8_t result;
- __asm__ ("ld1r {%0.8b}, %1"
- : "=w"(result)
- : "Utv"(*a)
- : /* No clobbers */);
- return result;
-}
-
-__extension__ static __inline uint16x4_t __attribute__ ((__always_inline__))
-vld1_dup_u16 (const uint16_t * a)
-{
- uint16x4_t result;
- __asm__ ("ld1r {%0.4h}, %1"
- : "=w"(result)
- : "Utv"(*a)
- : /* No clobbers */);
- return result;
-}
-
-__extension__ static __inline uint32x2_t __attribute__ ((__always_inline__))
-vld1_dup_u32 (const uint32_t * a)
-{
- uint32x2_t result;
- __asm__ ("ld1r {%0.2s}, %1"
- : "=w"(result)
- : "Utv"(*a)
- : /* No clobbers */);
- return result;
-}
-
-__extension__ static __inline uint64x1_t __attribute__ ((__always_inline__))
-vld1_dup_u64 (const uint64_t * a)
-{
- uint64x1_t result;
- __asm__ ("ld1r {%0.1d}, %1"
- : "=w"(result)
- : "Utv"(*a)
- : /* No clobbers */);
- return result;
-}
-
-__extension__ static __inline float32x4_t __attribute__ ((__always_inline__))
-vld1q_dup_f32 (const float32_t * a)
-{
- float32x4_t result;
- __asm__ ("ld1r {%0.4s}, %1"
- : "=w"(result)
- : "Utv"(*a)
- : /* No clobbers */);
- return result;
-}
-
-__extension__ static __inline float64x2_t __attribute__ ((__always_inline__))
-vld1q_dup_f64 (const float64_t * a)
-{
- float64x2_t result;
- __asm__ ("ld1r {%0.2d}, %1"
- : "=w"(result)
- : "Utv"(*a)
- : /* No clobbers */);
- return result;
-}
-
-__extension__ static __inline poly8x16_t __attribute__ ((__always_inline__))
-vld1q_dup_p8 (const poly8_t * a)
-{
- poly8x16_t result;
- __asm__ ("ld1r {%0.16b}, %1"
- : "=w"(result)
- : "Utv"(*a)
- : /* No clobbers */);
- return result;
-}
-
-__extension__ static __inline poly16x8_t __attribute__ ((__always_inline__))
-vld1q_dup_p16 (const poly16_t * a)
-{
- poly16x8_t result;
- __asm__ ("ld1r {%0.8h}, %1"
- : "=w"(result)
- : "Utv"(*a)
- : /* No clobbers */);
- return result;
-}
-
-__extension__ static __inline int8x16_t __attribute__ ((__always_inline__))
-vld1q_dup_s8 (const int8_t * a)
-{
- int8x16_t result;
- __asm__ ("ld1r {%0.16b}, %1"
- : "=w"(result)
- : "Utv"(*a)
- : /* No clobbers */);
- return result;
-}
-
-__extension__ static __inline int16x8_t __attribute__ ((__always_inline__))
-vld1q_dup_s16 (const int16_t * a)
-{
- int16x8_t result;
- __asm__ ("ld1r {%0.8h}, %1"
- : "=w"(result)
- : "Utv"(*a)
- : /* No clobbers */);
- return result;
-}
-
-__extension__ static __inline int32x4_t __attribute__ ((__always_inline__))
-vld1q_dup_s32 (const int32_t * a)
-{
- int32x4_t result;
- __asm__ ("ld1r {%0.4s}, %1"
- : "=w"(result)
- : "Utv"(*a)
- : /* No clobbers */);
- return result;
-}
-
-__extension__ static __inline int64x2_t __attribute__ ((__always_inline__))
-vld1q_dup_s64 (const int64_t * a)
-{
- int64x2_t result;
- __asm__ ("ld1r {%0.2d}, %1"
- : "=w"(result)
- : "Utv"(*a)
- : /* No clobbers */);
- return result;
-}
-
-__extension__ static __inline uint8x16_t __attribute__ ((__always_inline__))
-vld1q_dup_u8 (const uint8_t * a)
-{
- uint8x16_t result;
- __asm__ ("ld1r {%0.16b}, %1"
- : "=w"(result)
- : "Utv"(*a)
- : /* No clobbers */);
- return result;
-}
-
-__extension__ static __inline uint16x8_t __attribute__ ((__always_inline__))
-vld1q_dup_u16 (const uint16_t * a)
-{
- uint16x8_t result;
- __asm__ ("ld1r {%0.8h}, %1"
- : "=w"(result)
- : "Utv"(*a)
- : /* No clobbers */);
- return result;
-}
-
-__extension__ static __inline uint32x4_t __attribute__ ((__always_inline__))
-vld1q_dup_u32 (const uint32_t * a)
-{
- uint32x4_t result;
- __asm__ ("ld1r {%0.4s}, %1"
- : "=w"(result)
- : "Utv"(*a)
- : /* No clobbers */);
- return result;
-}
-
-__extension__ static __inline uint64x2_t __attribute__ ((__always_inline__))
-vld1q_dup_u64 (const uint64_t * a)
-{
- uint64x2_t result;
- __asm__ ("ld1r {%0.2d}, %1"
- : "=w"(result)
- : "Utv"(*a)
- : /* No clobbers */);
- return result;
-}
-
-__extension__ static __inline float32x2_t __attribute__ ((__always_inline__))
vmla_n_f32 (float32x2_t a, float32x2_t b, float32_t c)
{
float32x2_t result;
@@ -16142,6 +15878,154 @@ vld1q_u64 (const uint64_t *a)
__builtin_aarch64_ld1v2di ((const __builtin_aarch64_simd_di *) a);
}
+/* vld1_dup */
+
+__extension__ static __inline float32x2_t __attribute__ ((__always_inline__))
+vld1_dup_f32 (const float32_t* __a)
+{
+ return vdup_n_f32 (*__a);
+}
+
+__extension__ static __inline float64x1_t __attribute__ ((__always_inline__))
+vld1_dup_f64 (const float64_t* __a)
+{
+ return vdup_n_f64 (*__a);
+}
+
+__extension__ static __inline poly8x8_t __attribute__ ((__always_inline__))
+vld1_dup_p8 (const poly8_t* __a)
+{
+ return vdup_n_p8 (*__a);
+}
+
+__extension__ static __inline poly16x4_t __attribute__ ((__always_inline__))
+vld1_dup_p16 (const poly16_t* __a)
+{
+ return vdup_n_p16 (*__a);
+}
+
+__extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
+vld1_dup_s8 (const int8_t* __a)
+{
+ return vdup_n_s8 (*__a);
+}
+
+__extension__ static __inline int16x4_t __attribute__ ((__always_inline__))
+vld1_dup_s16 (const int16_t* __a)
+{
+ return vdup_n_s16 (*__a);
+}
+
+__extension__ static __inline int32x2_t __attribute__ ((__always_inline__))
+vld1_dup_s32 (const int32_t* __a)
+{
+ return vdup_n_s32 (*__a);
+}
+
+__extension__ static __inline int64x1_t __attribute__ ((__always_inline__))
+vld1_dup_s64 (const int64_t* __a)
+{
+ return vdup_n_s64 (*__a);
+}
+
+__extension__ static __inline uint8x8_t __attribute__ ((__always_inline__))
+vld1_dup_u8 (const uint8_t* __a)
+{
+ return vdup_n_u8 (*__a);
+}
+
+__extension__ static __inline uint16x4_t __attribute__ ((__always_inline__))
+vld1_dup_u16 (const uint16_t* __a)
+{
+ return vdup_n_u16 (*__a);
+}
+
+__extension__ static __inline uint32x2_t __attribute__ ((__always_inline__))
+vld1_dup_u32 (const uint32_t* __a)
+{
+ return vdup_n_u32 (*__a);
+}
+
+__extension__ static __inline uint64x1_t __attribute__ ((__always_inline__))
+vld1_dup_u64 (const uint64_t* __a)
+{
+ return vdup_n_u64 (*__a);
+}
+
+/* vld1q_dup */
+
+__extension__ static __inline float32x4_t __attribute__ ((__always_inline__))
+vld1q_dup_f32 (const float32_t* __a)
+{
+ return vdupq_n_f32 (*__a);
+}
+
+__extension__ static __inline float64x2_t __attribute__ ((__always_inline__))
+vld1q_dup_f64 (const float64_t* __a)
+{
+ return vdupq_n_f64 (*__a);
+}
+
+__extension__ static __inline poly8x16_t __attribute__ ((__always_inline__))
+vld1q_dup_p8 (const poly8_t* __a)
+{
+ return vdupq_n_p8 (*__a);
+}
+
+__extension__ static __inline poly16x8_t __attribute__ ((__always_inline__))
+vld1q_dup_p16 (const poly16_t* __a)
+{
+ return vdupq_n_p16 (*__a);
+}
+
+__extension__ static __inline int8x16_t __attribute__ ((__always_inline__))
+vld1q_dup_s8 (const int8_t* __a)
+{
+ return vdupq_n_s8 (*__a);
+}
+
+__extension__ static __inline int16x8_t __attribute__ ((__always_inline__))
+vld1q_dup_s16 (const int16_t* __a)
+{
+ return vdupq_n_s16 (*__a);
+}
+
+__extension__ static __inline int32x4_t __attribute__ ((__always_inline__))
+vld1q_dup_s32 (const int32_t* __a)
+{
+ return vdupq_n_s32 (*__a);
+}
+
+__extension__ static __inline int64x2_t __attribute__ ((__always_inline__))
+vld1q_dup_s64 (const int64_t* __a)
+{
+ return vdupq_n_s64 (*__a);
+}
+
+__extension__ static __inline uint8x16_t __attribute__ ((__always_inline__))
+vld1q_dup_u8 (const uint8_t* __a)
+{
+ return vdupq_n_u8 (*__a);
+}
+
+__extension__ static __inline uint16x8_t __attribute__ ((__always_inline__))
+vld1q_dup_u16 (const uint16_t* __a)
+{
+ return vdupq_n_u16 (*__a);
+}
+
+__extension__ static __inline uint32x4_t __attribute__ ((__always_inline__))
+vld1q_dup_u32 (const uint32_t* __a)
+{
+ return vdupq_n_u32 (*__a);
+}
+
+__extension__ static __inline uint64x2_t __attribute__ ((__always_inline__))
+vld1q_dup_u64 (const uint64_t* __a)
+{
+ return vdupq_n_u64 (*__a);
+}
+
/* vld1_lane */
__extension__ static __inline float32x2_t __attribute__ ((__always_inline__))
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/3][AArch64]More intrinsics/builtins improvements
2014-11-14 10:46 [PATCH 0/3][AArch64]More intrinsics/builtins improvements Alan Lawrence
` (2 preceding siblings ...)
2014-11-14 10:50 ` [PATCH 3/3][AArch64]Replace temporary assembler for vld1_dup Alan Lawrence
@ 2014-11-17 2:55 ` Yangfei (Felix)
2014-11-17 12:03 ` Alan Lawrence
3 siblings, 1 reply; 10+ messages in thread
From: Yangfei (Felix) @ 2014-11-17 2:55 UTC (permalink / raw)
To: Alan Lawrence, gcc-patches
> These three are logically independent, but all on a common theme, and I've
> tested them all together by
>
> bootstrapped + check-gcc on aarch64-none-elf cross-tested check-gcc on
> aarch64_be-none-elf
>
> Ok for trunk?
Hi Alan,
It seems that we are duplicating the work for the vld1_dup part. (Refer to my message: https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01462.html)
I have a plan to improve these intrinsics/builtins: vrsubhnX, vrsqrtX, vqrdmulX, vqmovX, vqdmulhqX, vqdmulhX, vpminX, vpmaxX, vpaddX, vpadaX
vmvnX, vmulxX, vmovnX, vmlsX, vhsubX, vcvtX, vcopyX, vaddlvX, vabX, vfmX, vrecpeX, vcntX, vclsX
And work for these intrinsics is in progress: vfmX, vrecpeX, vhsubX, vcntX, vclsX
Please let me know if you guys want to work on any of them. Thanks.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/3][AArch64]More intrinsics/builtins improvements
2014-11-17 2:55 ` [PATCH 0/3][AArch64]More intrinsics/builtins improvements Yangfei (Felix)
@ 2014-11-17 12:03 ` Alan Lawrence
2014-11-18 9:13 ` Yangfei (Felix)
0 siblings, 1 reply; 10+ messages in thread
From: Alan Lawrence @ 2014-11-17 12:03 UTC (permalink / raw)
To: Yangfei (Felix); +Cc: gcc-patches
Ah, sorry for the duplication of effort. And thanks for the heads-up about
upcoming work! I don't think I have any plans for any of those others at the moment.
In the case of vld1_dup, however, I'm going to argue that my approach
(https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01718.html) is better, in that a
builtin is opaque (blocks optimization) for the midend, whereas gcc vector
extensions (as in vdup_n_...) allows the midend to perform constant-folding,
etc. Does that make sense?
--Alan
Yangfei (Felix) wrote:
>> These three are logically independent, but all on a common theme, and I've
>> tested them all together by
>>
>> bootstrapped + check-gcc on aarch64-none-elf cross-tested check-gcc on
>> aarch64_be-none-elf
>>
>> Ok for trunk?
>
>
> Hi Alan,
>
> It seems that we are duplicating the work for the vld1_dup part. (Refer to my message: https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01462.html)
> I have a plan to improve these intrinsics/builtins: vrsubhnX, vrsqrtX, vqrdmulX, vqmovX, vqdmulhqX, vqdmulhX, vpminX, vpmaxX, vpaddX, vpadaX
> vmvnX, vmulxX, vmovnX, vmlsX, vhsubX, vcvtX, vcopyX, vaddlvX, vabX, vfmX, vrecpeX, vcntX, vclsX
> And work for these intrinsics is in progress: vfmX, vrecpeX, vhsubX, vcntX, vclsX
> Please let me know if you guys want to work on any of them. Thanks.
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/3][AArch64]Replace __builtin_aarch64_createv1df with a cast, cleanup
2014-11-14 10:46 ` [PATCH 1/3][AArch64]Replace __builtin_aarch64_createv1df with a cast, cleanup Alan Lawrence
@ 2014-11-17 17:35 ` Marcus Shawcroft
0 siblings, 0 replies; 10+ messages in thread
From: Marcus Shawcroft @ 2014-11-17 17:35 UTC (permalink / raw)
To: Alan Lawrence; +Cc: gcc-patches
On 14 November 2014 10:45, Alan Lawrence <alan.lawrence@arm.com> wrote:
> gcc/ChangeLog:
>
> * config/aarch64/aarch64-builtins.c (TYPES_CREATE): Remove.
> * config/aarch64/aarch64-simd-builtins.def (create): Remove.
> * config/aarch64/aarch64-simd.md (aarch64_create<mode>): Remove.
> * config/aarch64/arm_neon.h (vcreate_f64, vreinterpret_f64_s64,
> vreinterpret_f64_u64): Replace __builtin_aarch64_createv1df with C
> casts.
> * config/aarch64/iterators.md (VD1): Remove.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/aarch64/simd/vfma_f64.c: Add asm volatile memory.
> * gcc.target/aarch64/simd/vfms_f64.c: Likewise.
OK /Marcus
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/3][AArch64] Extend aarch64_simd_vec_set pattern, replace asm for vld1_lane
2014-11-14 10:46 ` [PATCH 2/3][AArch64] Extend aarch64_simd_vec_set pattern, replace asm for vld1_lane Alan Lawrence
@ 2014-11-17 17:36 ` Marcus Shawcroft
0 siblings, 0 replies; 10+ messages in thread
From: Marcus Shawcroft @ 2014-11-17 17:36 UTC (permalink / raw)
To: Alan Lawrence; +Cc: gcc-patches
On 14 November 2014 10:46, Alan Lawrence <alan.lawrence@arm.com> wrote:
> gcc/ChangeLog:
>
> * config/aarch64/aarch64-simd.md (aarch64_simd_vec_set<mode>): Add
> variant reading from memory and assembling to ld1.
>
> * config/aarch64/arm_neon.h (vld1_lane_f32, vld1_lane_f64,
> vld1_lane_p8,
> vld1_lane_p16, vld1_lane_s8, vld1_lane_s16, vld1_lane_s32,
> vld1_lane_s64, vld1_lane_u8, vld1_lane_u16, vld1_lane_u32,
> vld1_lane_u64, vld1q_lane_f32, vld1q_lane_f64, vld1q_lane_p8,
> vld1q_lane_p16, vld1q_lane_s8, vld1q_lane_s16, vld1q_lane_s32,
> vld1q_lane_s64, vld1q_lane_u8, vld1q_lane_u16, vld1q_lane_u32,
> vld1q_lane_u64): Replace asm with vset_lane and pointer dereference.
>
> gcc/testsuite/ChangeLog:
>
> gcc.target/aarch64/vld1_lane.c: New test.
OK /Marcus
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 3/3][AArch64]Replace temporary assembler for vld1_dup
2014-11-14 10:50 ` [PATCH 3/3][AArch64]Replace temporary assembler for vld1_dup Alan Lawrence
@ 2014-11-17 17:54 ` Marcus Shawcroft
0 siblings, 0 replies; 10+ messages in thread
From: Marcus Shawcroft @ 2014-11-17 17:54 UTC (permalink / raw)
To: Alan Lawrence; +Cc: gcc-patches
On 14 November 2014 10:46, Alan Lawrence <alan.lawrence@arm.com> wrote:
> This patch replaces the inline asm for vld1_dup intrinsics with a vdup_n_
> and a load from the pointer. The existing *aarch64_simd_ld1r<mode> insn,
> combiner, etc., are quite capable of generating the expected single ld1r
> instruction from this. (I've verified by inspecting assembler output.)
>
> gcc/ChangeLog:
>
> * config/aarch64/arm_neon.h (vld1_dup_f32, vld1_dup_f64,
> vld1_dup_p8,
> vld1_dup_p16, vld1_dup_s8, vld1_dup_s16, vld1_dup_s32, vld1_dup_s64,
> vld1_dup_u8, vld1_dup_u16, vld1_dup_u32, vld1_dup_u64,
> vld1q_dup_f32,
> vld1q_dup_f64, vld1q_dup_p8, vld1q_dup_p16, vld1q_dup_s8,
> vld1q_dup_s16,
> vld1q_dup_s32, vld1q_dup_s64, vld1q_dup_u8, vld1q_dup_u16,
> vld1q_dup_u32, vld1q_dup_u64): Replace inline asm with vdup_n_ and
> pointer dereference.
OK /Marcus
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/3][AArch64]More intrinsics/builtins improvements
2014-11-17 12:03 ` Alan Lawrence
@ 2014-11-18 9:13 ` Yangfei (Felix)
0 siblings, 0 replies; 10+ messages in thread
From: Yangfei (Felix) @ 2014-11-18 9:13 UTC (permalink / raw)
To: Alan Lawrence, gcc-patches
Yeah, I agree that your approach is better. I missed this point. Thanks.
>
> Ah, sorry for the duplication of effort. And thanks for the heads-up about
> upcoming work! I don't think I have any plans for any of those others at the
> moment.
>
> In the case of vld1_dup, however, I'm going to argue that my approach
> (https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01718.html) is better, in that a
> builtin is opaque (blocks optimization) for the midend, whereas gcc vector
> extensions (as in vdup_n_...) allows the midend to perform constant-folding, etc.
> Does that make sense?
>
> --Alan
>
> Yangfei (Felix) wrote:
> >> These three are logically independent, but all on a common theme, and
> >> I've tested them all together by
> >>
> >> bootstrapped + check-gcc on aarch64-none-elf cross-tested check-gcc
> >> on aarch64_be-none-elf
> >>
> >> Ok for trunk?
> >
> >
> > Hi Alan,
> >
> > It seems that we are duplicating the work for the vld1_dup part. (Refer to
> my message: https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01462.html)
> > I have a plan to improve these intrinsics/builtins: vrsubhnX, vrsqrtX,
> vqrdmulX, vqmovX, vqdmulhqX, vqdmulhX, vpminX, vpmaxX, vpaddX, vpadaX
> > vmvnX, vmulxX,
> vmovnX, vmlsX, vhsubX, vcvtX, vcopyX, vaddlvX, vabX, vfmX, vrecpeX, vcntX,
> vclsX
> > And work for these intrinsics is in progress: vfmX, vrecpeX, vhsubX,
> vcntX, vclsX
> > Please let me know if you guys want to work on any of them. Thanks.
> >
>
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2014-11-18 8:54 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-14 10:46 [PATCH 0/3][AArch64]More intrinsics/builtins improvements Alan Lawrence
2014-11-14 10:46 ` [PATCH 2/3][AArch64] Extend aarch64_simd_vec_set pattern, replace asm for vld1_lane Alan Lawrence
2014-11-17 17:36 ` Marcus Shawcroft
2014-11-14 10:46 ` [PATCH 1/3][AArch64]Replace __builtin_aarch64_createv1df with a cast, cleanup Alan Lawrence
2014-11-17 17:35 ` Marcus Shawcroft
2014-11-14 10:50 ` [PATCH 3/3][AArch64]Replace temporary assembler for vld1_dup Alan Lawrence
2014-11-17 17:54 ` Marcus Shawcroft
2014-11-17 2:55 ` [PATCH 0/3][AArch64]More intrinsics/builtins improvements Yangfei (Felix)
2014-11-17 12:03 ` Alan Lawrence
2014-11-18 9:13 ` Yangfei (Felix)
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).