* [PATCH][AArch64 Intrinsics] Replace temporary assembler for vst1_lane
@ 2015-01-13 18:39 Alan Lawrence
2015-01-30 13:25 ` Alan Lawrence
0 siblings, 1 reply; 4+ messages in thread
From: Alan Lawrence @ 2015-01-13 18:39 UTC (permalink / raw)
To: gcc-patches; +Cc: Marcus Shawcroft
[-- Attachment #1: Type: text/plain, Size: 1064 bytes --]
Nowadays, just storing the (bigendian-corrected) vector element to the address,
generates exactly the same assembler for all cases except
{float,int,uint}64x1_t, where
st1 {v0.d}[0], [x0]
becomes
str d0, [x0]
This is not a problem, and the change will be much better for optimization
through the midend, as well as making use of previous improvements in error
reporting.
Also move the /* vst1q */ comment, which was a couple intrinsics too late.
gcc/ChangeLog:
* config/aarch64/arm_neon.h (vst1_lane_f32, vst1_lane_f64,
vst1_lane_p8, vst1_lane_p16, vst1_lane_s8, vst1_lane_s16,
vst1_lane_s32, vst1_lane_s64, vst1_lane_u8, vst1_lane_u16,
vst1_lane_u32, vst1_lane_u64, vst1q_lane_f32, vst1q_lane_f64,
vst1q_lane_p8, vst1q_lane_p16, vst1q_lane_s8, vst1q_lane_s16,
vst1q_lane_s32, vst1q_lane_s64, vst1q_lane_u8, vst1q_lane_u16,
vst1q_lane_u32, vst1q_lane_u64): Reimplement with pointer dereference
and __aarch64_vget_lane_any.
Cross-tested check-gcc on aarch64-none-elf and aarch64_be-none-elf.
Ok for trunk?
Cheers, Alan
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: vst1_lane.patch --]
[-- Type: text/x-patch; name=vst1_lane.patch, Size: 23009 bytes --]
commit 926aec661699e52f617f16068075ef0242a43609
Author: Alan Lawrence <alan.lawrence@arm.com>
Date: Thu Dec 11 17:29:54 2014 +0000
Replace temporary inline assembler for vst1_lane, move /* vst1q */ comment.
Note for (float|u?int)64x1 vectors,
st1 {v0.d}[0], [x0]
becomes
str d0, [x0]
diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index 3d1bcd5..980490f 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -10304,272 +10304,6 @@ vrsqrtss_f32 (float32_t a, float32_t b)
result; \
})
-#define vst1_lane_f32(a, b, c) \
- __extension__ \
- ({ \
- float32x2_t b_ = (b); \
- float32_t * a_ = (a); \
- __asm__ ("st1 {%1.s}[%2],[%0]" \
- : \
- : "r"(a_), "w"(b_), "i"(c) \
- : "memory"); \
- })
-
-#define vst1_lane_f64(a, b, c) \
- __extension__ \
- ({ \
- float64x1_t b_ = (b); \
- float64_t * a_ = (a); \
- __asm__ ("st1 {%1.d}[%2],[%0]" \
- : \
- : "r"(a_), "w"(b_), "i"(c) \
- : "memory"); \
- })
-
-#define vst1_lane_p8(a, b, c) \
- __extension__ \
- ({ \
- poly8x8_t b_ = (b); \
- poly8_t * a_ = (a); \
- __asm__ ("st1 {%1.b}[%2],[%0]" \
- : \
- : "r"(a_), "w"(b_), "i"(c) \
- : "memory"); \
- })
-
-#define vst1_lane_p16(a, b, c) \
- __extension__ \
- ({ \
- poly16x4_t b_ = (b); \
- poly16_t * a_ = (a); \
- __asm__ ("st1 {%1.h}[%2],[%0]" \
- : \
- : "r"(a_), "w"(b_), "i"(c) \
- : "memory"); \
- })
-
-#define vst1_lane_s8(a, b, c) \
- __extension__ \
- ({ \
- int8x8_t b_ = (b); \
- int8_t * a_ = (a); \
- __asm__ ("st1 {%1.b}[%2],[%0]" \
- : \
- : "r"(a_), "w"(b_), "i"(c) \
- : "memory"); \
- })
-
-#define vst1_lane_s16(a, b, c) \
- __extension__ \
- ({ \
- int16x4_t b_ = (b); \
- int16_t * a_ = (a); \
- __asm__ ("st1 {%1.h}[%2],[%0]" \
- : \
- : "r"(a_), "w"(b_), "i"(c) \
- : "memory"); \
- })
-
-#define vst1_lane_s32(a, b, c) \
- __extension__ \
- ({ \
- int32x2_t b_ = (b); \
- int32_t * a_ = (a); \
- __asm__ ("st1 {%1.s}[%2],[%0]" \
- : \
- : "r"(a_), "w"(b_), "i"(c) \
- : "memory"); \
- })
-
-#define vst1_lane_s64(a, b, c) \
- __extension__ \
- ({ \
- int64x1_t b_ = (b); \
- int64_t * a_ = (a); \
- __asm__ ("st1 {%1.d}[%2],[%0]" \
- : \
- : "r"(a_), "w"(b_), "i"(c) \
- : "memory"); \
- })
-
-#define vst1_lane_u8(a, b, c) \
- __extension__ \
- ({ \
- uint8x8_t b_ = (b); \
- uint8_t * a_ = (a); \
- __asm__ ("st1 {%1.b}[%2],[%0]" \
- : \
- : "r"(a_), "w"(b_), "i"(c) \
- : "memory"); \
- })
-
-#define vst1_lane_u16(a, b, c) \
- __extension__ \
- ({ \
- uint16x4_t b_ = (b); \
- uint16_t * a_ = (a); \
- __asm__ ("st1 {%1.h}[%2],[%0]" \
- : \
- : "r"(a_), "w"(b_), "i"(c) \
- : "memory"); \
- })
-
-#define vst1_lane_u32(a, b, c) \
- __extension__ \
- ({ \
- uint32x2_t b_ = (b); \
- uint32_t * a_ = (a); \
- __asm__ ("st1 {%1.s}[%2],[%0]" \
- : \
- : "r"(a_), "w"(b_), "i"(c) \
- : "memory"); \
- })
-
-#define vst1_lane_u64(a, b, c) \
- __extension__ \
- ({ \
- uint64x1_t b_ = (b); \
- uint64_t * a_ = (a); \
- __asm__ ("st1 {%1.d}[%2],[%0]" \
- : \
- : "r"(a_), "w"(b_), "i"(c) \
- : "memory"); \
- })
-
-
-#define vst1q_lane_f32(a, b, c) \
- __extension__ \
- ({ \
- float32x4_t b_ = (b); \
- float32_t * a_ = (a); \
- __asm__ ("st1 {%1.s}[%2],[%0]" \
- : \
- : "r"(a_), "w"(b_), "i"(c) \
- : "memory"); \
- })
-
-#define vst1q_lane_f64(a, b, c) \
- __extension__ \
- ({ \
- float64x2_t b_ = (b); \
- float64_t * a_ = (a); \
- __asm__ ("st1 {%1.d}[%2],[%0]" \
- : \
- : "r"(a_), "w"(b_), "i"(c) \
- : "memory"); \
- })
-
-#define vst1q_lane_p8(a, b, c) \
- __extension__ \
- ({ \
- poly8x16_t b_ = (b); \
- poly8_t * a_ = (a); \
- __asm__ ("st1 {%1.b}[%2],[%0]" \
- : \
- : "r"(a_), "w"(b_), "i"(c) \
- : "memory"); \
- })
-
-#define vst1q_lane_p16(a, b, c) \
- __extension__ \
- ({ \
- poly16x8_t b_ = (b); \
- poly16_t * a_ = (a); \
- __asm__ ("st1 {%1.h}[%2],[%0]" \
- : \
- : "r"(a_), "w"(b_), "i"(c) \
- : "memory"); \
- })
-
-#define vst1q_lane_s8(a, b, c) \
- __extension__ \
- ({ \
- int8x16_t b_ = (b); \
- int8_t * a_ = (a); \
- __asm__ ("st1 {%1.b}[%2],[%0]" \
- : \
- : "r"(a_), "w"(b_), "i"(c) \
- : "memory"); \
- })
-
-#define vst1q_lane_s16(a, b, c) \
- __extension__ \
- ({ \
- int16x8_t b_ = (b); \
- int16_t * a_ = (a); \
- __asm__ ("st1 {%1.h}[%2],[%0]" \
- : \
- : "r"(a_), "w"(b_), "i"(c) \
- : "memory"); \
- })
-
-#define vst1q_lane_s32(a, b, c) \
- __extension__ \
- ({ \
- int32x4_t b_ = (b); \
- int32_t * a_ = (a); \
- __asm__ ("st1 {%1.s}[%2],[%0]" \
- : \
- : "r"(a_), "w"(b_), "i"(c) \
- : "memory"); \
- })
-
-#define vst1q_lane_s64(a, b, c) \
- __extension__ \
- ({ \
- int64x2_t b_ = (b); \
- int64_t * a_ = (a); \
- __asm__ ("st1 {%1.d}[%2],[%0]" \
- : \
- : "r"(a_), "w"(b_), "i"(c) \
- : "memory"); \
- })
-
-#define vst1q_lane_u8(a, b, c) \
- __extension__ \
- ({ \
- uint8x16_t b_ = (b); \
- uint8_t * a_ = (a); \
- __asm__ ("st1 {%1.b}[%2],[%0]" \
- : \
- : "r"(a_), "w"(b_), "i"(c) \
- : "memory"); \
- })
-
-#define vst1q_lane_u16(a, b, c) \
- __extension__ \
- ({ \
- uint16x8_t b_ = (b); \
- uint16_t * a_ = (a); \
- __asm__ ("st1 {%1.h}[%2],[%0]" \
- : \
- : "r"(a_), "w"(b_), "i"(c) \
- : "memory"); \
- })
-
-#define vst1q_lane_u32(a, b, c) \
- __extension__ \
- ({ \
- uint32x4_t b_ = (b); \
- uint32_t * a_ = (a); \
- __asm__ ("st1 {%1.s}[%2],[%0]" \
- : \
- : "r"(a_), "w"(b_), "i"(c) \
- : "memory"); \
- })
-
-#define vst1q_lane_u64(a, b, c) \
- __extension__ \
- ({ \
- uint64x2_t b_ = (b); \
- uint64_t * a_ = (a); \
- __asm__ ("st1 {%1.d}[%2],[%0]" \
- : \
- : "r"(a_), "w"(b_), "i"(c) \
- : "memory"); \
- })
-
-
__extension__ static __inline uint8x8_t __attribute__ ((__always_inline__))
vtst_p8 (poly8x8_t a, poly8x8_t b)
{
@@ -22498,6 +22232,8 @@ vst1_u64 (uint64_t *a, uint64x1_t b)
*a = b[0];
}
+/* vst1q */
+
__extension__ static __inline void __attribute__ ((__always_inline__))
vst1q_f32 (float32_t *a, float32x4_t b)
{
@@ -22510,8 +22246,6 @@ vst1q_f64 (float64_t *a, float64x2_t b)
__builtin_aarch64_st1v2df ((__builtin_aarch64_simd_df *) a, b);
}
-/* vst1q */
-
__extension__ static __inline void __attribute__ ((__always_inline__))
vst1q_p8 (poly8_t *a, poly8x16_t b)
{
@@ -22578,6 +22312,154 @@ vst1q_u64 (uint64_t *a, uint64x2_t b)
(int64x2_t) b);
}
+/* vst1_lane */
+
+__extension__ static __inline void __attribute__ ((__always_inline__))
+vst1_lane_f32 (float32_t *__a, float32x2_t __b, const int __lane)
+{
+ *__a = __aarch64_vget_lane_any (__b, __lane);
+}
+
+__extension__ static __inline void __attribute__ ((__always_inline__))
+vst1_lane_f64 (float64_t *__a, float64x1_t __b, const int __lane)
+{
+ *__a = __aarch64_vget_lane_any (__b, __lane);
+}
+
+__extension__ static __inline void __attribute__ ((__always_inline__))
+vst1_lane_p8 (poly8_t *__a, poly8x8_t __b, const int __lane)
+{
+ *__a = __aarch64_vget_lane_any (__b, __lane);
+}
+
+__extension__ static __inline void __attribute__ ((__always_inline__))
+vst1_lane_p16 (poly16_t *__a, poly16x4_t __b, const int __lane)
+{
+ *__a = __aarch64_vget_lane_any (__b, __lane);
+}
+
+__extension__ static __inline void __attribute__ ((__always_inline__))
+vst1_lane_s8 (int8_t *__a, int8x8_t __b, const int __lane)
+{
+ *__a = __aarch64_vget_lane_any (__b, __lane);
+}
+
+__extension__ static __inline void __attribute__ ((__always_inline__))
+vst1_lane_s16 (int16_t *__a, int16x4_t __b, const int __lane)
+{
+ *__a = __aarch64_vget_lane_any (__b, __lane);
+}
+
+__extension__ static __inline void __attribute__ ((__always_inline__))
+vst1_lane_s32 (int32_t *__a, int32x2_t __b, const int __lane)
+{
+ *__a = __aarch64_vget_lane_any (__b, __lane);
+}
+
+__extension__ static __inline void __attribute__ ((__always_inline__))
+vst1_lane_s64 (int64_t *__a, int64x1_t __b, const int __lane)
+{
+ *__a = __aarch64_vget_lane_any (__b, __lane);
+}
+
+__extension__ static __inline void __attribute__ ((__always_inline__))
+vst1_lane_u8 (uint8_t *__a, uint8x8_t __b, const int __lane)
+{
+ *__a = __aarch64_vget_lane_any (__b, __lane);
+}
+
+__extension__ static __inline void __attribute__ ((__always_inline__))
+vst1_lane_u16 (uint16_t *__a, uint16x4_t __b, const int __lane)
+{
+ *__a = __aarch64_vget_lane_any (__b, __lane);
+}
+
+__extension__ static __inline void __attribute__ ((__always_inline__))
+vst1_lane_u32 (uint32_t *__a, uint32x2_t __b, const int __lane)
+{
+ *__a = __aarch64_vget_lane_any (__b, __lane);
+}
+
+__extension__ static __inline void __attribute__ ((__always_inline__))
+vst1_lane_u64 (uint64_t *__a, uint64x1_t __b, const int __lane)
+{
+ *__a = __aarch64_vget_lane_any (__b, __lane);
+}
+
+/* vst1q_lane */
+
+__extension__ static __inline void __attribute__ ((__always_inline__))
+vst1q_lane_f32 (float32_t *__a, float32x4_t __b, const int __lane)
+{
+ *__a = __aarch64_vget_lane_any (__b, __lane);
+}
+
+__extension__ static __inline void __attribute__ ((__always_inline__))
+vst1q_lane_f64 (float64_t *__a, float64x2_t __b, const int __lane)
+{
+ *__a = __aarch64_vget_lane_any (__b, __lane);
+}
+
+__extension__ static __inline void __attribute__ ((__always_inline__))
+vst1q_lane_p8 (poly8_t *__a, poly8x16_t __b, const int __lane)
+{
+ *__a = __aarch64_vget_lane_any (__b, __lane);
+}
+
+__extension__ static __inline void __attribute__ ((__always_inline__))
+vst1q_lane_p16 (poly16_t *__a, poly16x8_t __b, const int __lane)
+{
+ *__a = __aarch64_vget_lane_any (__b, __lane);
+}
+
+__extension__ static __inline void __attribute__ ((__always_inline__))
+vst1q_lane_s8 (int8_t *__a, int8x16_t __b, const int __lane)
+{
+ *__a = __aarch64_vget_lane_any (__b, __lane);
+}
+
+__extension__ static __inline void __attribute__ ((__always_inline__))
+vst1q_lane_s16 (int16_t *__a, int16x8_t __b, const int __lane)
+{
+ *__a = __aarch64_vget_lane_any (__b, __lane);
+}
+
+__extension__ static __inline void __attribute__ ((__always_inline__))
+vst1q_lane_s32 (int32_t *__a, int32x4_t __b, const int __lane)
+{
+ *__a = __aarch64_vget_lane_any (__b, __lane);
+}
+
+__extension__ static __inline void __attribute__ ((__always_inline__))
+vst1q_lane_s64 (int64_t *__a, int64x2_t __b, const int __lane)
+{
+ *__a = __aarch64_vget_lane_any (__b, __lane);
+}
+
+__extension__ static __inline void __attribute__ ((__always_inline__))
+vst1q_lane_u8 (uint8_t *__a, uint8x16_t __b, const int __lane)
+{
+ *__a = __aarch64_vget_lane_any (__b, __lane);
+}
+
+__extension__ static __inline void __attribute__ ((__always_inline__))
+vst1q_lane_u16 (uint16_t *__a, uint16x8_t __b, const int __lane)
+{
+ *__a = __aarch64_vget_lane_any (__b, __lane);
+}
+
+__extension__ static __inline void __attribute__ ((__always_inline__))
+vst1q_lane_u32 (uint32_t *__a, uint32x4_t __b, const int __lane)
+{
+ *__a = __aarch64_vget_lane_any (__b, __lane);
+}
+
+__extension__ static __inline void __attribute__ ((__always_inline__))
+vst1q_lane_u64 (uint64_t *__a, uint64x2_t __b, const int __lane)
+{
+ *__a = __aarch64_vget_lane_any (__b, __lane);
+}
+
/* vstn */
__extension__ static __inline void
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH][AArch64 Intrinsics] Replace temporary assembler for vst1_lane
2015-01-13 18:39 [PATCH][AArch64 Intrinsics] Replace temporary assembler for vst1_lane Alan Lawrence
@ 2015-01-30 13:25 ` Alan Lawrence
2015-01-30 15:49 ` Marcus Shawcroft
0 siblings, 1 reply; 4+ messages in thread
From: Alan Lawrence @ 2015-01-30 13:25 UTC (permalink / raw)
To: Marcus Shawcroft; +Cc: gcc-patches
This was posted towards the end of stage 3, a few days before stage 4
started. Is it now too late to "ping" ?
--Alan
Alan Lawrence wrote:
> Nowadays, just storing the (bigendian-corrected) vector element to the address,
> generates exactly the same assembler for all cases except
> {float,int,uint}64x1_t, where
> st1 {v0.d}[0], [x0]
> becomes
> str d0, [x0]
>
> This is not a problem, and the change will be much better for optimization
> through the midend, as well as making use of previous improvements in error
> reporting.
>
> Also move the /* vst1q */ comment, which was a couple intrinsics too late.
>
> gcc/ChangeLog:
>
> * config/aarch64/arm_neon.h (vst1_lane_f32, vst1_lane_f64,
> vst1_lane_p8, vst1_lane_p16, vst1_lane_s8, vst1_lane_s16,
> vst1_lane_s32, vst1_lane_s64, vst1_lane_u8, vst1_lane_u16,
> vst1_lane_u32, vst1_lane_u64, vst1q_lane_f32, vst1q_lane_f64,
> vst1q_lane_p8, vst1q_lane_p16, vst1q_lane_s8, vst1q_lane_s16,
> vst1q_lane_s32, vst1q_lane_s64, vst1q_lane_u8, vst1q_lane_u16,
> vst1q_lane_u32, vst1q_lane_u64): Reimplement with pointer dereference
> and __aarch64_vget_lane_any.
>
> Cross-tested check-gcc on aarch64-none-elf and aarch64_be-none-elf.
>
> Ok for trunk?
>
> Cheers, Alan
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH][AArch64 Intrinsics] Replace temporary assembler for vst1_lane
2015-01-30 13:25 ` Alan Lawrence
@ 2015-01-30 15:49 ` Marcus Shawcroft
2015-04-14 14:04 ` Alan Lawrence
0 siblings, 1 reply; 4+ messages in thread
From: Marcus Shawcroft @ 2015-01-30 15:49 UTC (permalink / raw)
To: Alan Lawrence, gcc-patches
On 30 January 2015 at 12:09, Alan Lawrence <alan.lawrence@arm.com> wrote:
> This was posted towards the end of stage 3, a few days before stage 4
> started. Is it now too late to "ping" ?
>
> --Alan
>> gcc/ChangeLog:
>>
>> * config/aarch64/arm_neon.h (vst1_lane_f32, vst1_lane_f64,
>> vst1_lane_p8, vst1_lane_p16, vst1_lane_s8, vst1_lane_s16,
>> vst1_lane_s32, vst1_lane_s64, vst1_lane_u8, vst1_lane_u16,
>> vst1_lane_u32, vst1_lane_u64, vst1q_lane_f32, vst1q_lane_f64,
>> vst1q_lane_p8, vst1q_lane_p16, vst1q_lane_s8, vst1q_lane_s16,
>> vst1q_lane_s32, vst1q_lane_s64, vst1q_lane_u8, vst1q_lane_u16,
>> vst1q_lane_u32, vst1q_lane_u64): Reimplement with pointer
>> dereference
>> and __aarch64_vget_lane_any.
Please don't top post.
This is OK but wait for stage-1 to open before you commit it please.
/Marcus
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH][AArch64 Intrinsics] Replace temporary assembler for vst1_lane
2015-01-30 15:49 ` Marcus Shawcroft
@ 2015-04-14 14:04 ` Alan Lawrence
0 siblings, 0 replies; 4+ messages in thread
From: Alan Lawrence @ 2015-04-14 14:04 UTC (permalink / raw)
To: Marcus Shawcroft; +Cc: gcc-patches
Marcus Shawcroft wrote:
> On 30 January 2015 at 12:09, Alan Lawrence <alan.lawrence@arm.com> wrote:
>> This was posted towards the end of stage 3, a few days before stage 4
>> started. Is it now too late to "ping" ?
>>
>> --Alan
>
>>> gcc/ChangeLog:
>>>
>>> * config/aarch64/arm_neon.h (vst1_lane_f32, vst1_lane_f64,
>>> vst1_lane_p8, vst1_lane_p16, vst1_lane_s8, vst1_lane_s16,
>>> vst1_lane_s32, vst1_lane_s64, vst1_lane_u8, vst1_lane_u16,
>>> vst1_lane_u32, vst1_lane_u64, vst1q_lane_f32, vst1q_lane_f64,
>>> vst1q_lane_p8, vst1q_lane_p16, vst1q_lane_s8, vst1q_lane_s16,
>>> vst1q_lane_s32, vst1q_lane_s64, vst1q_lane_u8, vst1q_lane_u16,
>>> vst1q_lane_u32, vst1q_lane_u64): Reimplement with pointer
>>> dereference
>>> and __aarch64_vget_lane_any.
>
> Please don't top post.
>
> This is OK but wait for stage-1 to open before you commit it please.
> /Marcus
>
Now pushed as r222092.
--Alan
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2015-04-14 14:04 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-13 18:39 [PATCH][AArch64 Intrinsics] Replace temporary assembler for vst1_lane Alan Lawrence
2015-01-30 13:25 ` Alan Lawrence
2015-01-30 15:49 ` Marcus Shawcroft
2015-04-14 14:04 ` Alan Lawrence
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).