[Patch ARM 0/3] Neon intrinsics TLC - Replace intrinsics with GNU C implementations where possible and remove dead code.

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [Patch ARM 0/3] Neon intrinsics TLC - Replace intrinsics with GNU C implementations where possible and remove dead code.
@ 2014-04-28 10:44 Ramana Radhakrishnan
  2014-04-28 10:45 ` [Patch ARM 1/3] Neon intrinsics TLC : Replace intrinsics with GNU C implementations where possible Ramana Radhakrishnan
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Ramana Radhakrishnan @ 2014-04-28 10:44 UTC (permalink / raw)
  To: gcc-patches@gcc.gnu.org ;; +Cc: Christophe Lyon

Hi,

	I was investigating a performance issue with Neon intrinsics and 
realized this needed to happen.

	Patch 1/3 does this. I've special cased the ffast-math case for the 
_f32 intrinsics to prevent the auto-vectorizer from coming along and 
vectorizing addv2sf and addv4sf type operations which we don't want to 
happen by default. Patch 1/3 causes apparent "regressions" in the rather 
ineffective neon intrinsics tests that we currently carry soon hopefully 
to be replaced by Christophe Lyon's rewrite that is being reviewed. On 
the whole I deem this patch stack to be safe to go in if necessary. 
These "regressions" are for -O0 with the vbic and vorn intrinsics which 
don't now get combined and well, so be it.

	This then left us in the happy position of being able to delete code 
but I was worried about LTO streaming as these "builtins" are 
essentially streamed out in LTO object code format. However since we 
make no promises about LTO compatibility across releases, that's safe 
but I structured the dead code elimination as Patch 2/3. This will be 
committed separately in case folks want to backport Patch 1/3 separately 
and want to assure their users of LTO compatibility within a release 
branch (if that even works :)  ) .

	Patch 3/3 removes the ML to generate Neon intrinsics and the 
documentation and updates the comments in the files to show that these 
are now hand crafted rather than auto-generated. We've had these for 
many years now and I think it's time we got rid of this. Not everyone 
groks ML and it doesn't help that only one or 2 folks can actually do 
this properly everytime. Instead of having these bottlenecks and given 
the fact that the intrinsics are pretty stable now, there's no point in 
retaining the generator interface. I'd rather get rid of them. The only 
bit left is neon-schedgen.ml, neon.ml and neon-testgen.ml. I think we 
can safely remove neon-testgen.ml once Christophe's testsuite is done 
and we'll probably just have to carry neon-schedgen.ml / neon.ml as it 
still generates the neon descriptions for both a8 and a9.

	The patch stack was caught up in the C++ type info mess recently and 
I've tested this on a cross arm-linux-gnueabihf testsuite run and it 
looks ok module the issues mentioned for Patch 1/3. I've deliberately 
resisted deleting the entire gcc.target/arm/neon and neon-testgen.ml in 
the hope that Christophe's testsuite will do the honours at that point 
:). Given we're in stage 1 and that I think we're getting some where 
with clyon's testsuite I feel that is reasonably practical in just 
carrying the noise with these extra failures. Christophe and I will 
testdrive his testsuite work in this space with these patches to see how 
the conversion process works and if there are any issues with these patches.

If there are issues I'm happy to hear about them.

Will apply to trunk in a couple of days if no regressions with clyon's 
testsuite for these intrinsics.


regards
Ramana
-- 
Ramana Radhakrishnan
Principal Engineer
ARM Ltd.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Patch ARM 1/3] Neon intrinsics TLC :  Replace intrinsics with GNU C implementations where possible.
  2014-04-28 10:44 [Patch ARM 0/3] Neon intrinsics TLC - Replace intrinsics with GNU C implementations where possible and remove dead code Ramana Radhakrishnan
@ 2014-04-28 10:45 ` Ramana Radhakrishnan
  2014-04-28 11:52   ` Julian Brown
  2014-04-29 13:58   ` Christophe Lyon
  2014-04-28 10:47 ` [Patch ARM 2/3] Remove dead code from backend Ramana Radhakrishnan
  2014-04-28 10:52 ` [Patch ARM 3/3] Neon intrinsics TLC - Remove unneeded ML " Ramana Radhakrishnan
  2 siblings, 2 replies; 9+ messages in thread
From: Ramana Radhakrishnan @ 2014-04-28 10:45 UTC (permalink / raw)
  To: Ramana Radhakrishnan; +Cc: gcc-patches, gcc-patches, Christophe Lyon

[-- Attachment #1: Type: text/plain, Size: 4302 bytes --]

I've special cased the ffast-math case for the _f32 intrinsics to 
prevent the auto-vectorizer from coming along and vectorizing addv2sf 
and addv4sf type operations which we don't want to happen by default.
Patch 1/3 causes apparent "regressions" in the rather ineffective neon 
intrinsics tests that we currently carry soon hopefully to be replaced 
by Christophe Lyon's rewrite that is being reviewed. On the whole I deem 
this patch stack to be safe to go in if necessary. These "regressions" 
are for -O0 with the vbic and vorn intrinsics which
don't now get combined and well, so be it.


Given we're in stage 1 and that I think we're getting some where
with clyon's testsuite I feel that is reasonably practical in just
carrying the noise with these extra failures. Christophe and I will
testdrive his testsuite work in this space with these patches to see how 
the conversion process works and if there are any issues with these patches.


<DATE>  Ramana Radhakrishnan  <ramana.radhakrishnan@arm.com>

	* config/arm/arm_neon.h (vadd_s8): GNU C implementation
	(vadd_s16): Likewise.
	(vadd_s32): Likewise.
	(vadd_f32): Likewise.
	(vadd_u8): Likewise.
	(vadd_u16): Likewise.
	(vadd_u32): Likewise.
	(vadd_s64): Likewise.
	(vadd_u64): Likewise.
	(vaddq_s8): Likewise.
	(vaddq_s16): Likewise.
	(vaddq_s32): Likewise.
	(vaddq_s64): Likewise.
	(vaddq_f32): Likewise.
	(vaddq_u8): Likewise.
	(vaddq_u16): Likewise.
	(vaddq_u32): Likewise.
	(vaddq_u64): Likewise.
	(vmul_s8): Likewise.
	(vmul_s16): Likewise.
	(vmul_s32): Likewise.
	(vmul_f32): Likewise.
	(vmul_u8): Likewise.
	(vmul_u16): Likewise.
	(vmul_u32): Likewise.
	(vmul_p8): Likewise.
	(vmulq_s8): Likewise.
	(vmulq_s16): Likewise.
	(vmulq_s32): Likewise.
	(vmulq_f32): Likewise.
	(vmulq_u8): Likewise.
	(vmulq_u16): Likewise.
	(vmulq_u32): Likewise.
	(vsub_s8): Likewise.
	(vsub_s16): Likewise.
	(vsub_s32): Likewise.
	(vsub_f32): Likewise.
	(vsub_u8): Likewise.
	(vsub_u16): Likewise.
	(vsub_u32): Likewise.
	(vsub_s64): Likewise.
	(vsub_u64): Likewise.
	(vsubq_s8): Likewise.
	(vsubq_s16): Likewise.
	(vsubq_s32): Likewise.
	(vsubq_s64): Likewise.
	(vsubq_f32): Likewise.
	(vsubq_u8): Likewise.
	(vsubq_u16): Likewise.
	(vsubq_u32): Likewise.
	(vsubq_u64): Likewise.
	(vand_s8): Likewise.
	(vand_s16): Likewise.
	(vand_s32): Likewise.
	(vand_u8): Likewise.
	(vand_u16): Likewise.
	(vand_u32): Likewise.
	(vand_s64): Likewise.
	(vand_u64): Likewise.
	(vandq_s8): Likewise.
	(vandq_s16): Likewise.
	(vandq_s32): Likewise.
	(vandq_s64): Likewise.
	(vandq_u8): Likewise.
	(vandq_u16): Likewise.
	(vandq_u32): Likewise.
	(vandq_u64): Likewise.
	(vorr_s8): Likewise.
	(vorr_s16): Likewise.
	(vorr_s32): Likewise.
	(vorr_u8): Likewise.
	(vorr_u16): Likewise.
	(vorr_u32): Likewise.
	(vorr_s64): Likewise.
	(vorr_u64): Likewise.
	(vorrq_s8): Likewise.
	(vorrq_s16): Likewise.
	(vorrq_s32): Likewise.
	(vorrq_s64): Likewise.
	(vorrq_u8): Likewise.
	(vorrq_u16): Likewise.
	(vorrq_u32): Likewise.
	(vorrq_u64): Likewise.
	(veor_s8): Likewise.
	(veor_s16): Likewise.
	(veor_s32): Likewise.
	(veor_u8): Likewise.
	(veor_u16): Likewise.
	(veor_u32): Likewise.
	(veor_s64): Likewise.
	(veor_u64): Likewise.
	(veorq_s8): Likewise.
	(veorq_s16): Likewise.
	(veorq_s32): Likewise.
	(veorq_s64): Likewise.
	(veorq_u8): Likewise.
	(veorq_u16): Likewise.
	(veorq_u32): Likewise.
	(veorq_u64): Likewise.
	(vbic_s8): Likewise.
	(vbic_s16): Likewise.
	(vbic_s32): Likewise.
	(vbic_u8): Likewise.
	(vbic_u16): Likewise.
	(vbic_u32): Likewise.
	(vbic_s64): Likewise.
	(vbic_u64): Likewise.
	(vbicq_s8): Likewise.
	(vbicq_s16): Likewise.
	(vbicq_s32): Likewise.
	(vbicq_s64): Likewise.
	(vbicq_u8): Likewise.
	(vbicq_u16): Likewise.
	(vbicq_u32): Likewise.
	(vbicq_u64): Likewise.
	(vorn_s8): Likewise.
	(vorn_s16): Likewise.
	(vorn_s32): Likewise.
	(vorn_u8): Likewise.
	(vorn_u16): Likewise.
	(vorn_u32): Likewise.
	(vorn_s64): Likewise.
	(vorn_u64): Likewise.
	(vornq_s8): Likewise.
	(vornq_s16): Likewise.
	(vornq_s32): Likewise.
	(vornq_s64): Likewise.
	(vornq_u8): Likewise.
	(vornq_u16): Likewise.
	(vornq_u32): Likewise.
	(vornq_u64): Likewise.



-- 
Ramana Radhakrishnan
Principal Engineer
ARM Ltd.

[-- Attachment #2: p2.txt --]
[-- Type: text/plain, Size: 31676 bytes --]

commit e0f07fa0f3602af6197e90e9277bfe4d169aecfd
Author: Ramana Radhakrishnan <ramana.radhakrishnan@arm.com>
Date:   Thu Apr 24 15:59:17 2014 +0100

    Replace with GNU C implementation.

diff --git a/gcc/config/arm/arm_neon.h b/gcc/config/arm/arm_neon.h
index 37a6e61..479ec2c 100644
--- a/gcc/config/arm/arm_neon.h
+++ b/gcc/config/arm/arm_neon.h
@@ -453,114 +453,121 @@ typedef struct poly64x2x4_t
 } poly64x2x4_t;
 #endif
 
-
-
+/* vadd  */
 __extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
 vadd_s8 (int8x8_t __a, int8x8_t __b)
 {
-  return (int8x8_t)__builtin_neon_vaddv8qi (__a, __b, 1);
+  return __a + __b;
 }
 
 __extension__ static __inline int16x4_t __attribute__ ((__always_inline__))
 vadd_s16 (int16x4_t __a, int16x4_t __b)
 {
-  return (int16x4_t)__builtin_neon_vaddv4hi (__a, __b, 1);
+  return __a + __b;
 }
 
 __extension__ static __inline int32x2_t __attribute__ ((__always_inline__))
 vadd_s32 (int32x2_t __a, int32x2_t __b)
 {
-  return (int32x2_t)__builtin_neon_vaddv2si (__a, __b, 1);
+  return __a + __b;
 }
 
 __extension__ static __inline float32x2_t __attribute__ ((__always_inline__))
 vadd_f32 (float32x2_t __a, float32x2_t __b)
 {
-  return (float32x2_t)__builtin_neon_vaddv2sf (__a, __b, 3);
+#ifdef __FAST_MATH__
+  return __a + __b;
+#else
+  return (float32x2_t) __builtin_neon_vaddv2sf (__a, __b, 3);
+#endif
 }
 
 __extension__ static __inline uint8x8_t __attribute__ ((__always_inline__))
 vadd_u8 (uint8x8_t __a, uint8x8_t __b)
 {
-  return (uint8x8_t)__builtin_neon_vaddv8qi ((int8x8_t) __a, (int8x8_t) __b, 0);
+  return __a + __b;
 }
 
 __extension__ static __inline uint16x4_t __attribute__ ((__always_inline__))
 vadd_u16 (uint16x4_t __a, uint16x4_t __b)
 {
-  return (uint16x4_t)__builtin_neon_vaddv4hi ((int16x4_t) __a, (int16x4_t) __b, 0);
+  return __a + __b;
 }
 
 __extension__ static __inline uint32x2_t __attribute__ ((__always_inline__))
 vadd_u32 (uint32x2_t __a, uint32x2_t __b)
 {
-  return (uint32x2_t)__builtin_neon_vaddv2si ((int32x2_t) __a, (int32x2_t) __b, 0);
+  return __a + __b;
 }
 
 __extension__ static __inline int64x1_t __attribute__ ((__always_inline__))
 vadd_s64 (int64x1_t __a, int64x1_t __b)
 {
-  return (int64x1_t)__builtin_neon_vadddi (__a, __b, 1);
+  return __a + __b;
 }
 
 __extension__ static __inline uint64x1_t __attribute__ ((__always_inline__))
 vadd_u64 (uint64x1_t __a, uint64x1_t __b)
 {
-  return (uint64x1_t)__builtin_neon_vadddi ((int64x1_t) __a, (int64x1_t) __b, 0);
+  return __a + __b;
 }
 
 __extension__ static __inline int8x16_t __attribute__ ((__always_inline__))
 vaddq_s8 (int8x16_t __a, int8x16_t __b)
 {
-  return (int8x16_t)__builtin_neon_vaddv16qi (__a, __b, 1);
+  return __a + __b;
 }
 
 __extension__ static __inline int16x8_t __attribute__ ((__always_inline__))
 vaddq_s16 (int16x8_t __a, int16x8_t __b)
 {
-  return (int16x8_t)__builtin_neon_vaddv8hi (__a, __b, 1);
+  return __a + __b;
 }
 
 __extension__ static __inline int32x4_t __attribute__ ((__always_inline__))
 vaddq_s32 (int32x4_t __a, int32x4_t __b)
 {
-  return (int32x4_t)__builtin_neon_vaddv4si (__a, __b, 1);
+  return __a + __b;
 }
 
 __extension__ static __inline int64x2_t __attribute__ ((__always_inline__))
 vaddq_s64 (int64x2_t __a, int64x2_t __b)
 {
-  return (int64x2_t)__builtin_neon_vaddv2di (__a, __b, 1);
+  return __a + __b;
 }
 
 __extension__ static __inline float32x4_t __attribute__ ((__always_inline__))
 vaddq_f32 (float32x4_t __a, float32x4_t __b)
 {
-  return (float32x4_t)__builtin_neon_vaddv4sf (__a, __b, 3);
+#ifdef __FAST_MATH
+  return __a + __b;
+#else
+  return (float32x4_t) __builtin_neon_vaddv4sf (__a, __b, 3);
+#endif
 }
 
 __extension__ static __inline uint8x16_t __attribute__ ((__always_inline__))
 vaddq_u8 (uint8x16_t __a, uint8x16_t __b)
 {
-  return (uint8x16_t)__builtin_neon_vaddv16qi ((int8x16_t) __a, (int8x16_t) __b, 0);
+  return __a + __b;
 }
 
 __extension__ static __inline uint16x8_t __attribute__ ((__always_inline__))
 vaddq_u16 (uint16x8_t __a, uint16x8_t __b)
 {
-  return (uint16x8_t)__builtin_neon_vaddv8hi ((int16x8_t) __a, (int16x8_t) __b, 0);
+  return __a + __b;
 }
 
 __extension__ static __inline uint32x4_t __attribute__ ((__always_inline__))
 vaddq_u32 (uint32x4_t __a, uint32x4_t __b)
 {
-  return (uint32x4_t)__builtin_neon_vaddv4si ((int32x4_t) __a, (int32x4_t) __b, 0);
+  return __a + __b;
 }
 
 __extension__ static __inline uint64x2_t __attribute__ ((__always_inline__))
 vaddq_u64 (uint64x2_t __a, uint64x2_t __b)
 {
-  return (uint64x2_t)__builtin_neon_vaddv2di ((int64x2_t) __a, (int64x2_t) __b, 0);
+  return __a + __b;
 }
 
 __extension__ static __inline int16x8_t __attribute__ ((__always_inline__))
@@ -950,91 +957,100 @@ vraddhn_u64 (uint64x2_t __a, uint64x2_t __b)
 __extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
 vmul_s8 (int8x8_t __a, int8x8_t __b)
 {
-  return (int8x8_t)__builtin_neon_vmulv8qi (__a, __b, 1);
+  return __a * __b;
 }
 
 __extension__ static __inline int16x4_t __attribute__ ((__always_inline__))
 vmul_s16 (int16x4_t __a, int16x4_t __b)
 {
-  return (int16x4_t)__builtin_neon_vmulv4hi (__a, __b, 1);
+  return __a * __b;
 }
 
 __extension__ static __inline int32x2_t __attribute__ ((__always_inline__))
 vmul_s32 (int32x2_t __a, int32x2_t __b)
 {
-  return (int32x2_t)__builtin_neon_vmulv2si (__a, __b, 1);
+  return __a * __b;
 }
 
 __extension__ static __inline float32x2_t __attribute__ ((__always_inline__))
 vmul_f32 (float32x2_t __a, float32x2_t __b)
 {
-  return (float32x2_t)__builtin_neon_vmulv2sf (__a, __b, 3);
+#ifdef __FAST_MATH
+  return __a * __b;
+#else
+  return (float32x2_t) __builtin_neon_vmulv2sf (__a, __b, 3);
+#endif
+
 }
 
 __extension__ static __inline uint8x8_t __attribute__ ((__always_inline__))
 vmul_u8 (uint8x8_t __a, uint8x8_t __b)
 {
-  return (uint8x8_t)__builtin_neon_vmulv8qi ((int8x8_t) __a, (int8x8_t) __b, 0);
+  return __a * __b;
 }
 
 __extension__ static __inline uint16x4_t __attribute__ ((__always_inline__))
 vmul_u16 (uint16x4_t __a, uint16x4_t __b)
 {
-  return (uint16x4_t)__builtin_neon_vmulv4hi ((int16x4_t) __a, (int16x4_t) __b, 0);
+  return __a * __b;
 }
 
 __extension__ static __inline uint32x2_t __attribute__ ((__always_inline__))
 vmul_u32 (uint32x2_t __a, uint32x2_t __b)
 {
-  return (uint32x2_t)__builtin_neon_vmulv2si ((int32x2_t) __a, (int32x2_t) __b, 0);
-}
-
-__extension__ static __inline poly8x8_t __attribute__ ((__always_inline__))
-vmul_p8 (poly8x8_t __a, poly8x8_t __b)
-{
-  return (poly8x8_t)__builtin_neon_vmulv8qi ((int8x8_t) __a, (int8x8_t) __b, 2);
+  return __a * __b;
 }
 
 __extension__ static __inline int8x16_t __attribute__ ((__always_inline__))
 vmulq_s8 (int8x16_t __a, int8x16_t __b)
 {
-  return (int8x16_t)__builtin_neon_vmulv16qi (__a, __b, 1);
+  return __a * __b;
 }
 
 __extension__ static __inline int16x8_t __attribute__ ((__always_inline__))
 vmulq_s16 (int16x8_t __a, int16x8_t __b)
 {
-  return (int16x8_t)__builtin_neon_vmulv8hi (__a, __b, 1);
+  return __a * __b;
 }
 
 __extension__ static __inline int32x4_t __attribute__ ((__always_inline__))
 vmulq_s32 (int32x4_t __a, int32x4_t __b)
 {
-  return (int32x4_t)__builtin_neon_vmulv4si (__a, __b, 1);
+  return __a * __b;
 }
 
 __extension__ static __inline float32x4_t __attribute__ ((__always_inline__))
 vmulq_f32 (float32x4_t __a, float32x4_t __b)
 {
-  return (float32x4_t)__builtin_neon_vmulv4sf (__a, __b, 3);
+#ifdef __FAST_MATH
+  return __a * __b;
+#else
+  return (float32x4_t) __builtin_neon_vmulv4sf (__a, __b, 3);
+#endif
 }
 
 __extension__ static __inline uint8x16_t __attribute__ ((__always_inline__))
 vmulq_u8 (uint8x16_t __a, uint8x16_t __b)
 {
-  return (uint8x16_t)__builtin_neon_vmulv16qi ((int8x16_t) __a, (int8x16_t) __b, 0);
+  return __a * __b;
 }
 
 __extension__ static __inline uint16x8_t __attribute__ ((__always_inline__))
 vmulq_u16 (uint16x8_t __a, uint16x8_t __b)
 {
-  return (uint16x8_t)__builtin_neon_vmulv8hi ((int16x8_t) __a, (int16x8_t) __b, 0);
+  return __a * __b;
 }
 
 __extension__ static __inline uint32x4_t __attribute__ ((__always_inline__))
 vmulq_u32 (uint32x4_t __a, uint32x4_t __b)
 {
-  return (uint32x4_t)__builtin_neon_vmulv4si ((int32x4_t) __a, (int32x4_t) __b, 0);
+  return __a * __b;
+}
+
+__extension__ static __inline poly8x8_t __attribute__ ((__always_inline__))
+vmul_p8 (poly8x8_t __a, poly8x8_t __b)
+{
+  return (poly8x8_t)__builtin_neon_vmulv8qi ((int8x8_t) __a, (int8x8_t) __b, 2);
 }
 
 __extension__ static __inline poly8x16_t __attribute__ ((__always_inline__))
@@ -1521,112 +1537,121 @@ vrndq_f32 (float32x4_t __a)
 }
 
 #endif
+
 __extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
 vsub_s8 (int8x8_t __a, int8x8_t __b)
 {
-  return (int8x8_t)__builtin_neon_vsubv8qi (__a, __b, 1);
+  return __a - __b;
 }
 
 __extension__ static __inline int16x4_t __attribute__ ((__always_inline__))
 vsub_s16 (int16x4_t __a, int16x4_t __b)
 {
-  return (int16x4_t)__builtin_neon_vsubv4hi (__a, __b, 1);
+  return __a - __b;
 }
 
 __extension__ static __inline int32x2_t __attribute__ ((__always_inline__))
 vsub_s32 (int32x2_t __a, int32x2_t __b)
 {
-  return (int32x2_t)__builtin_neon_vsubv2si (__a, __b, 1);
+  return __a - __b;
 }
 
 __extension__ static __inline float32x2_t __attribute__ ((__always_inline__))
 vsub_f32 (float32x2_t __a, float32x2_t __b)
 {
-  return (float32x2_t)__builtin_neon_vsubv2sf (__a, __b, 3);
+#ifdef __FAST_MATH
+  return __a - __b;
+#else
+  return (float32x2_t) __builtin_neon_vsubv2sf (__a, __b, 3);
+#endif
 }
 
 __extension__ static __inline uint8x8_t __attribute__ ((__always_inline__))
 vsub_u8 (uint8x8_t __a, uint8x8_t __b)
 {
-  return (uint8x8_t)__builtin_neon_vsubv8qi ((int8x8_t) __a, (int8x8_t) __b, 0);
+  return __a - __b;
 }
 
 __extension__ static __inline uint16x4_t __attribute__ ((__always_inline__))
 vsub_u16 (uint16x4_t __a, uint16x4_t __b)
 {
-  return (uint16x4_t)__builtin_neon_vsubv4hi ((int16x4_t) __a, (int16x4_t) __b, 0);
+  return __a - __b;
 }
 
 __extension__ static __inline uint32x2_t __attribute__ ((__always_inline__))
 vsub_u32 (uint32x2_t __a, uint32x2_t __b)
 {
-  return (uint32x2_t)__builtin_neon_vsubv2si ((int32x2_t) __a, (int32x2_t) __b, 0);
+  return __a - __b;
 }
 
 __extension__ static __inline int64x1_t __attribute__ ((__always_inline__))
 vsub_s64 (int64x1_t __a, int64x1_t __b)
 {
-  return (int64x1_t)__builtin_neon_vsubdi (__a, __b, 1);
+  return __a - __b;
 }
 
 __extension__ static __inline uint64x1_t __attribute__ ((__always_inline__))
 vsub_u64 (uint64x1_t __a, uint64x1_t __b)
 {
-  return (uint64x1_t)__builtin_neon_vsubdi ((int64x1_t) __a, (int64x1_t) __b, 0);
+  return __a - __b;
 }
 
 __extension__ static __inline int8x16_t __attribute__ ((__always_inline__))
 vsubq_s8 (int8x16_t __a, int8x16_t __b)
 {
-  return (int8x16_t)__builtin_neon_vsubv16qi (__a, __b, 1);
+  return __a - __b;
 }
 
 __extension__ static __inline int16x8_t __attribute__ ((__always_inline__))
 vsubq_s16 (int16x8_t __a, int16x8_t __b)
 {
-  return (int16x8_t)__builtin_neon_vsubv8hi (__a, __b, 1);
+  return __a - __b;
 }
 
 __extension__ static __inline int32x4_t __attribute__ ((__always_inline__))
 vsubq_s32 (int32x4_t __a, int32x4_t __b)
 {
-  return (int32x4_t)__builtin_neon_vsubv4si (__a, __b, 1);
+  return __a - __b;
 }
 
 __extension__ static __inline int64x2_t __attribute__ ((__always_inline__))
 vsubq_s64 (int64x2_t __a, int64x2_t __b)
 {
-  return (int64x2_t)__builtin_neon_vsubv2di (__a, __b, 1);
+  return __a - __b;
 }
 
 __extension__ static __inline float32x4_t __attribute__ ((__always_inline__))
 vsubq_f32 (float32x4_t __a, float32x4_t __b)
 {
-  return (float32x4_t)__builtin_neon_vsubv4sf (__a, __b, 3);
+#ifdef __FAST_MATH
+  return __a - __b;
+#else
+  return (float32x4_t) __builtin_neon_vsubv4sf (__a, __b, 3);
+#endif
 }
 
 __extension__ static __inline uint8x16_t __attribute__ ((__always_inline__))
 vsubq_u8 (uint8x16_t __a, uint8x16_t __b)
 {
-  return (uint8x16_t)__builtin_neon_vsubv16qi ((int8x16_t) __a, (int8x16_t) __b, 0);
+  return __a - __b;
 }
 
 __extension__ static __inline uint16x8_t __attribute__ ((__always_inline__))
 vsubq_u16 (uint16x8_t __a, uint16x8_t __b)
 {
-  return (uint16x8_t)__builtin_neon_vsubv8hi ((int16x8_t) __a, (int16x8_t) __b, 0);
+  return __a - __b;
 }
 
 __extension__ static __inline uint32x4_t __attribute__ ((__always_inline__))
 vsubq_u32 (uint32x4_t __a, uint32x4_t __b)
 {
-  return (uint32x4_t)__builtin_neon_vsubv4si ((int32x4_t) __a, (int32x4_t) __b, 0);
+  return __a - __b;
 }
 
 __extension__ static __inline uint64x2_t __attribute__ ((__always_inline__))
 vsubq_u64 (uint64x2_t __a, uint64x2_t __b)
 {
-  return (uint64x2_t)__builtin_neon_vsubv2di ((int64x2_t) __a, (int64x2_t) __b, 0);
+  return __a - __b;
 }
 
 __extension__ static __inline int16x8_t __attribute__ ((__always_inline__))
@@ -10907,484 +10932,483 @@ vst4q_lane_p16 (poly16_t * __a, poly16x8x4_t __b, const int __c)
 __extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
 vand_s8 (int8x8_t __a, int8x8_t __b)
 {
-  return (int8x8_t)__builtin_neon_vandv8qi (__a, __b, 1);
+  return __a & __b;
 }
 
 __extension__ static __inline int16x4_t __attribute__ ((__always_inline__))
 vand_s16 (int16x4_t __a, int16x4_t __b)
 {
-  return (int16x4_t)__builtin_neon_vandv4hi (__a, __b, 1);
+  return __a & __b;
 }
 
 __extension__ static __inline int32x2_t __attribute__ ((__always_inline__))
 vand_s32 (int32x2_t __a, int32x2_t __b)
 {
-  return (int32x2_t)__builtin_neon_vandv2si (__a, __b, 1);
+  return __a & __b;
 }
 
 __extension__ static __inline uint8x8_t __attribute__ ((__always_inline__))
 vand_u8 (uint8x8_t __a, uint8x8_t __b)
 {
-  return (uint8x8_t)__builtin_neon_vandv8qi ((int8x8_t) __a, (int8x8_t) __b, 0);
+  return __a & __b;
 }
 
 __extension__ static __inline uint16x4_t __attribute__ ((__always_inline__))
 vand_u16 (uint16x4_t __a, uint16x4_t __b)
 {
-  return (uint16x4_t)__builtin_neon_vandv4hi ((int16x4_t) __a, (int16x4_t) __b, 0);
+  return __a & __b;
 }
 
 __extension__ static __inline uint32x2_t __attribute__ ((__always_inline__))
 vand_u32 (uint32x2_t __a, uint32x2_t __b)
 {
-  return (uint32x2_t)__builtin_neon_vandv2si ((int32x2_t) __a, (int32x2_t) __b, 0);
+  return __a & __b;
 }
 
 __extension__ static __inline int64x1_t __attribute__ ((__always_inline__))
 vand_s64 (int64x1_t __a, int64x1_t __b)
 {
-  return (int64x1_t)__builtin_neon_vanddi (__a, __b, 1);
+  return __a & __b;
 }
 
 __extension__ static __inline uint64x1_t __attribute__ ((__always_inline__))
 vand_u64 (uint64x1_t __a, uint64x1_t __b)
 {
-  return (uint64x1_t)__builtin_neon_vanddi ((int64x1_t) __a, (int64x1_t) __b, 0);
+  return __a & __b;
 }
 
 __extension__ static __inline int8x16_t __attribute__ ((__always_inline__))
 vandq_s8 (int8x16_t __a, int8x16_t __b)
 {
-  return (int8x16_t)__builtin_neon_vandv16qi (__a, __b, 1);
+  return __a & __b;
 }
 
 __extension__ static __inline int16x8_t __attribute__ ((__always_inline__))
 vandq_s16 (int16x8_t __a, int16x8_t __b)
 {
-  return (int16x8_t)__builtin_neon_vandv8hi (__a, __b, 1);
+  return __a & __b;
 }
 
 __extension__ static __inline int32x4_t __attribute__ ((__always_inline__))
 vandq_s32 (int32x4_t __a, int32x4_t __b)
 {
-  return (int32x4_t)__builtin_neon_vandv4si (__a, __b, 1);
+  return __a & __b;
 }
 
 __extension__ static __inline int64x2_t __attribute__ ((__always_inline__))
 vandq_s64 (int64x2_t __a, int64x2_t __b)
 {
-  return (int64x2_t)__builtin_neon_vandv2di (__a, __b, 1);
+  return __a & __b;
 }
 
 __extension__ static __inline uint8x16_t __attribute__ ((__always_inline__))
 vandq_u8 (uint8x16_t __a, uint8x16_t __b)
 {
-  return (uint8x16_t)__builtin_neon_vandv16qi ((int8x16_t) __a, (int8x16_t) __b, 0);
+  return __a & __b;
 }
 
 __extension__ static __inline uint16x8_t __attribute__ ((__always_inline__))
 vandq_u16 (uint16x8_t __a, uint16x8_t __b)
 {
-  return (uint16x8_t)__builtin_neon_vandv8hi ((int16x8_t) __a, (int16x8_t) __b, 0);
+  return __a & __b;
 }
 
 __extension__ static __inline uint32x4_t __attribute__ ((__always_inline__))
 vandq_u32 (uint32x4_t __a, uint32x4_t __b)
 {
-  return (uint32x4_t)__builtin_neon_vandv4si ((int32x4_t) __a, (int32x4_t) __b, 0);
+  return __a & __b;
 }
 
 __extension__ static __inline uint64x2_t __attribute__ ((__always_inline__))
 vandq_u64 (uint64x2_t __a, uint64x2_t __b)
 {
-  return (uint64x2_t)__builtin_neon_vandv2di ((int64x2_t) __a, (int64x2_t) __b, 0);
+  return __a & __b;
 }
 
 __extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
 vorr_s8 (int8x8_t __a, int8x8_t __b)
 {
-  return (int8x8_t)__builtin_neon_vorrv8qi (__a, __b, 1);
+  return __a | __b;
 }
 
 __extension__ static __inline int16x4_t __attribute__ ((__always_inline__))
 vorr_s16 (int16x4_t __a, int16x4_t __b)
 {
-  return (int16x4_t)__builtin_neon_vorrv4hi (__a, __b, 1);
+  return __a | __b;
 }
 
 __extension__ static __inline int32x2_t __attribute__ ((__always_inline__))
 vorr_s32 (int32x2_t __a, int32x2_t __b)
 {
-  return (int32x2_t)__builtin_neon_vorrv2si (__a, __b, 1);
+  return __a | __b;
 }
 
 __extension__ static __inline uint8x8_t __attribute__ ((__always_inline__))
 vorr_u8 (uint8x8_t __a, uint8x8_t __b)
 {
-  return (uint8x8_t)__builtin_neon_vorrv8qi ((int8x8_t) __a, (int8x8_t) __b, 0);
+  return __a | __b;
 }
 
 __extension__ static __inline uint16x4_t __attribute__ ((__always_inline__))
 vorr_u16 (uint16x4_t __a, uint16x4_t __b)
 {
-  return (uint16x4_t)__builtin_neon_vorrv4hi ((int16x4_t) __a, (int16x4_t) __b, 0);
+  return __a | __b;
 }
 
 __extension__ static __inline uint32x2_t __attribute__ ((__always_inline__))
 vorr_u32 (uint32x2_t __a, uint32x2_t __b)
 {
-  return (uint32x2_t)__builtin_neon_vorrv2si ((int32x2_t) __a, (int32x2_t) __b, 0);
+  return __a | __b;
 }
 
 __extension__ static __inline int64x1_t __attribute__ ((__always_inline__))
 vorr_s64 (int64x1_t __a, int64x1_t __b)
 {
-  return (int64x1_t)__builtin_neon_vorrdi (__a, __b, 1);
+  return __a | __b;
 }
 
 __extension__ static __inline uint64x1_t __attribute__ ((__always_inline__))
 vorr_u64 (uint64x1_t __a, uint64x1_t __b)
 {
-  return (uint64x1_t)__builtin_neon_vorrdi ((int64x1_t) __a, (int64x1_t) __b, 0);
+  return __a | __b;
 }
 
 __extension__ static __inline int8x16_t __attribute__ ((__always_inline__))
 vorrq_s8 (int8x16_t __a, int8x16_t __b)
 {
-  return (int8x16_t)__builtin_neon_vorrv16qi (__a, __b, 1);
+  return __a | __b;
 }
 
 __extension__ static __inline int16x8_t __attribute__ ((__always_inline__))
 vorrq_s16 (int16x8_t __a, int16x8_t __b)
 {
-  return (int16x8_t)__builtin_neon_vorrv8hi (__a, __b, 1);
+  return __a | __b;
 }
 
 __extension__ static __inline int32x4_t __attribute__ ((__always_inline__))
 vorrq_s32 (int32x4_t __a, int32x4_t __b)
 {
-  return (int32x4_t)__builtin_neon_vorrv4si (__a, __b, 1);
+  return __a | __b;
 }
 
 __extension__ static __inline int64x2_t __attribute__ ((__always_inline__))
 vorrq_s64 (int64x2_t __a, int64x2_t __b)
 {
-  return (int64x2_t)__builtin_neon_vorrv2di (__a, __b, 1);
+  return __a | __b;
 }
 
 __extension__ static __inline uint8x16_t __attribute__ ((__always_inline__))
 vorrq_u8 (uint8x16_t __a, uint8x16_t __b)
 {
-  return (uint8x16_t)__builtin_neon_vorrv16qi ((int8x16_t) __a, (int8x16_t) __b, 0);
+  return __a | __b;
 }
 
 __extension__ static __inline uint16x8_t __attribute__ ((__always_inline__))
 vorrq_u16 (uint16x8_t __a, uint16x8_t __b)
 {
-  return (uint16x8_t)__builtin_neon_vorrv8hi ((int16x8_t) __a, (int16x8_t) __b, 0);
+  return __a | __b;
 }
 
 __extension__ static __inline uint32x4_t __attribute__ ((__always_inline__))
 vorrq_u32 (uint32x4_t __a, uint32x4_t __b)
 {
-  return (uint32x4_t)__builtin_neon_vorrv4si ((int32x4_t) __a, (int32x4_t) __b, 0);
+  return __a | __b;
 }
 
 __extension__ static __inline uint64x2_t __attribute__ ((__always_inline__))
 vorrq_u64 (uint64x2_t __a, uint64x2_t __b)
 {
-  return (uint64x2_t)__builtin_neon_vorrv2di ((int64x2_t) __a, (int64x2_t) __b, 0);
+  return __a | __b;
 }
 
 __extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
 veor_s8 (int8x8_t __a, int8x8_t __b)
 {
-  return (int8x8_t)__builtin_neon_veorv8qi (__a, __b, 1);
+  return __a ^ __b;
 }
 
 __extension__ static __inline int16x4_t __attribute__ ((__always_inline__))
 veor_s16 (int16x4_t __a, int16x4_t __b)
 {
-  return (int16x4_t)__builtin_neon_veorv4hi (__a, __b, 1);
+  return __a ^ __b;
 }
 
 __extension__ static __inline int32x2_t __attribute__ ((__always_inline__))
 veor_s32 (int32x2_t __a, int32x2_t __b)
 {
-  return (int32x2_t)__builtin_neon_veorv2si (__a, __b, 1);
+  return __a ^ __b;
 }
 
 __extension__ static __inline uint8x8_t __attribute__ ((__always_inline__))
 veor_u8 (uint8x8_t __a, uint8x8_t __b)
 {
-  return (uint8x8_t)__builtin_neon_veorv8qi ((int8x8_t) __a, (int8x8_t) __b, 0);
+  return __a ^ __b;
 }
 
 __extension__ static __inline uint16x4_t __attribute__ ((__always_inline__))
 veor_u16 (uint16x4_t __a, uint16x4_t __b)
 {
-  return (uint16x4_t)__builtin_neon_veorv4hi ((int16x4_t) __a, (int16x4_t) __b, 0);
+  return __a ^ __b;
 }
 
 __extension__ static __inline uint32x2_t __attribute__ ((__always_inline__))
 veor_u32 (uint32x2_t __a, uint32x2_t __b)
 {
-  return (uint32x2_t)__builtin_neon_veorv2si ((int32x2_t) __a, (int32x2_t) __b, 0);
+  return __a ^ __b;
 }
 
 __extension__ static __inline int64x1_t __attribute__ ((__always_inline__))
 veor_s64 (int64x1_t __a, int64x1_t __b)
 {
-  return (int64x1_t)__builtin_neon_veordi (__a, __b, 1);
+  return __a ^ __b;
 }
 
 __extension__ static __inline uint64x1_t __attribute__ ((__always_inline__))
 veor_u64 (uint64x1_t __a, uint64x1_t __b)
 {
-  return (uint64x1_t)__builtin_neon_veordi ((int64x1_t) __a, (int64x1_t) __b, 0);
+  return __a ^ __b;
 }
 
 __extension__ static __inline int8x16_t __attribute__ ((__always_inline__))
 veorq_s8 (int8x16_t __a, int8x16_t __b)
 {
-  return (int8x16_t)__builtin_neon_veorv16qi (__a, __b, 1);
+  return __a ^ __b;
 }
 
 __extension__ static __inline int16x8_t __attribute__ ((__always_inline__))
 veorq_s16 (int16x8_t __a, int16x8_t __b)
 {
-  return (int16x8_t)__builtin_neon_veorv8hi (__a, __b, 1);
+  return __a ^ __b;
 }
 
 __extension__ static __inline int32x4_t __attribute__ ((__always_inline__))
 veorq_s32 (int32x4_t __a, int32x4_t __b)
 {
-  return (int32x4_t)__builtin_neon_veorv4si (__a, __b, 1);
+  return __a ^ __b;
 }
 
 __extension__ static __inline int64x2_t __attribute__ ((__always_inline__))
 veorq_s64 (int64x2_t __a, int64x2_t __b)
 {
-  return (int64x2_t)__builtin_neon_veorv2di (__a, __b, 1);
+  return __a ^ __b;
 }
 
 __extension__ static __inline uint8x16_t __attribute__ ((__always_inline__))
 veorq_u8 (uint8x16_t __a, uint8x16_t __b)
 {
-  return (uint8x16_t)__builtin_neon_veorv16qi ((int8x16_t) __a, (int8x16_t) __b, 0);
+  return __a ^ __b;
 }
 
 __extension__ static __inline uint16x8_t __attribute__ ((__always_inline__))
 veorq_u16 (uint16x8_t __a, uint16x8_t __b)
 {
-  return (uint16x8_t)__builtin_neon_veorv8hi ((int16x8_t) __a, (int16x8_t) __b, 0);
+  return __a ^ __b;
 }
 
 __extension__ static __inline uint32x4_t __attribute__ ((__always_inline__))
 veorq_u32 (uint32x4_t __a, uint32x4_t __b)
 {
-  return (uint32x4_t)__builtin_neon_veorv4si ((int32x4_t) __a, (int32x4_t) __b, 0);
+  return __a ^ __b;
 }
 
 __extension__ static __inline uint64x2_t __attribute__ ((__always_inline__))
 veorq_u64 (uint64x2_t __a, uint64x2_t __b)
 {
-  return (uint64x2_t)__builtin_neon_veorv2di ((int64x2_t) __a, (int64x2_t) __b, 0);
+  return __a ^ __b;
 }
 
 __extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
 vbic_s8 (int8x8_t __a, int8x8_t __b)
 {
-  return (int8x8_t)__builtin_neon_vbicv8qi (__a, __b, 1);
+  return __a & ~__b;
 }
 
 __extension__ static __inline int16x4_t __attribute__ ((__always_inline__))
 vbic_s16 (int16x4_t __a, int16x4_t __b)
 {
-  return (int16x4_t)__builtin_neon_vbicv4hi (__a, __b, 1);
+  return __a & ~__b;
 }
 
 __extension__ static __inline int32x2_t __attribute__ ((__always_inline__))
 vbic_s32 (int32x2_t __a, int32x2_t __b)
 {
-  return (int32x2_t)__builtin_neon_vbicv2si (__a, __b, 1);
+  return __a & ~__b;
 }
 
 __extension__ static __inline uint8x8_t __attribute__ ((__always_inline__))
 vbic_u8 (uint8x8_t __a, uint8x8_t __b)
 {
-  return (uint8x8_t)__builtin_neon_vbicv8qi ((int8x8_t) __a, (int8x8_t) __b, 0);
+  return __a & ~__b;
 }
 
 __extension__ static __inline uint16x4_t __attribute__ ((__always_inline__))
 vbic_u16 (uint16x4_t __a, uint16x4_t __b)
 {
-  return (uint16x4_t)__builtin_neon_vbicv4hi ((int16x4_t) __a, (int16x4_t) __b, 0);
+  return __a & ~__b;
 }
 
 __extension__ static __inline uint32x2_t __attribute__ ((__always_inline__))
 vbic_u32 (uint32x2_t __a, uint32x2_t __b)
 {
-  return (uint32x2_t)__builtin_neon_vbicv2si ((int32x2_t) __a, (int32x2_t) __b, 0);
+  return __a & ~__b;
 }
 
 __extension__ static __inline int64x1_t __attribute__ ((__always_inline__))
 vbic_s64 (int64x1_t __a, int64x1_t __b)
 {
-  return (int64x1_t)__builtin_neon_vbicdi (__a, __b, 1);
+  return __a & ~__b;
 }
 
 __extension__ static __inline uint64x1_t __attribute__ ((__always_inline__))
 vbic_u64 (uint64x1_t __a, uint64x1_t __b)
 {
-  return (uint64x1_t)__builtin_neon_vbicdi ((int64x1_t) __a, (int64x1_t) __b, 0);
+  return __a & ~__b;
 }
 
 __extension__ static __inline int8x16_t __attribute__ ((__always_inline__))
 vbicq_s8 (int8x16_t __a, int8x16_t __b)
 {
-  return (int8x16_t)__builtin_neon_vbicv16qi (__a, __b, 1);
+  return __a & ~__b;
 }
 
 __extension__ static __inline int16x8_t __attribute__ ((__always_inline__))
 vbicq_s16 (int16x8_t __a, int16x8_t __b)
 {
-  return (int16x8_t)__builtin_neon_vbicv8hi (__a, __b, 1);
+  return __a & ~__b;
 }
 
 __extension__ static __inline int32x4_t __attribute__ ((__always_inline__))
 vbicq_s32 (int32x4_t __a, int32x4_t __b)
 {
-  return (int32x4_t)__builtin_neon_vbicv4si (__a, __b, 1);
+  return __a & ~__b;
 }
 
 __extension__ static __inline int64x2_t __attribute__ ((__always_inline__))
 vbicq_s64 (int64x2_t __a, int64x2_t __b)
 {
-  return (int64x2_t)__builtin_neon_vbicv2di (__a, __b, 1);
+  return __a & ~__b;
 }
 
 __extension__ static __inline uint8x16_t __attribute__ ((__always_inline__))
 vbicq_u8 (uint8x16_t __a, uint8x16_t __b)
 {
-  return (uint8x16_t)__builtin_neon_vbicv16qi ((int8x16_t) __a, (int8x16_t) __b, 0);
+  return __a & ~__b;
 }
 
 __extension__ static __inline uint16x8_t __attribute__ ((__always_inline__))
 vbicq_u16 (uint16x8_t __a, uint16x8_t __b)
 {
-  return (uint16x8_t)__builtin_neon_vbicv8hi ((int16x8_t) __a, (int16x8_t) __b, 0);
+  return __a & ~__b;
 }
 
 __extension__ static __inline uint32x4_t __attribute__ ((__always_inline__))
 vbicq_u32 (uint32x4_t __a, uint32x4_t __b)
 {
-  return (uint32x4_t)__builtin_neon_vbicv4si ((int32x4_t) __a, (int32x4_t) __b, 0);
+  return __a & ~__b;
 }
 
 __extension__ static __inline uint64x2_t __attribute__ ((__always_inline__))
 vbicq_u64 (uint64x2_t __a, uint64x2_t __b)
 {
-  return (uint64x2_t)__builtin_neon_vbicv2di ((int64x2_t) __a, (int64x2_t) __b, 0);
+  return __a & ~__b;
 }
 
 __extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
 vorn_s8 (int8x8_t __a, int8x8_t __b)
 {
-  return (int8x8_t)__builtin_neon_vornv8qi (__a, __b, 1);
+  return __a | ~__b;
 }
 
 __extension__ static __inline int16x4_t __attribute__ ((__always_inline__))
 vorn_s16 (int16x4_t __a, int16x4_t __b)
 {
-  return (int16x4_t)__builtin_neon_vornv4hi (__a, __b, 1);
+  return __a | ~__b;
 }
 
 __extension__ static __inline int32x2_t __attribute__ ((__always_inline__))
 vorn_s32 (int32x2_t __a, int32x2_t __b)
 {
-  return (int32x2_t)__builtin_neon_vornv2si (__a, __b, 1);
+  return __a | ~__b;
 }
 
 __extension__ static __inline uint8x8_t __attribute__ ((__always_inline__))
 vorn_u8 (uint8x8_t __a, uint8x8_t __b)
 {
-  return (uint8x8_t)__builtin_neon_vornv8qi ((int8x8_t) __a, (int8x8_t) __b, 0);
+  return __a | ~__b;
 }
 
 __extension__ static __inline uint16x4_t __attribute__ ((__always_inline__))
 vorn_u16 (uint16x4_t __a, uint16x4_t __b)
 {
-  return (uint16x4_t)__builtin_neon_vornv4hi ((int16x4_t) __a, (int16x4_t) __b, 0);
+  return __a | ~__b;
 }
 
 __extension__ static __inline uint32x2_t __attribute__ ((__always_inline__))
 vorn_u32 (uint32x2_t __a, uint32x2_t __b)
 {
-  return (uint32x2_t)__builtin_neon_vornv2si ((int32x2_t) __a, (int32x2_t) __b, 0);
+  return __a | ~__b;
 }
 
 __extension__ static __inline int64x1_t __attribute__ ((__always_inline__))
 vorn_s64 (int64x1_t __a, int64x1_t __b)
 {
-  return (int64x1_t)__builtin_neon_vorndi (__a, __b, 1);
+  return __a | ~__b;
 }
 
 __extension__ static __inline uint64x1_t __attribute__ ((__always_inline__))
 vorn_u64 (uint64x1_t __a, uint64x1_t __b)
 {
-  return (uint64x1_t)__builtin_neon_vorndi ((int64x1_t) __a, (int64x1_t) __b, 0);
+  return __a | ~__b;
 }
 
 __extension__ static __inline int8x16_t __attribute__ ((__always_inline__))
 vornq_s8 (int8x16_t __a, int8x16_t __b)
 {
-  return (int8x16_t)__builtin_neon_vornv16qi (__a, __b, 1);
+  return __a | ~__b;
 }
 
 __extension__ static __inline int16x8_t __attribute__ ((__always_inline__))
 vornq_s16 (int16x8_t __a, int16x8_t __b)
 {
-  return (int16x8_t)__builtin_neon_vornv8hi (__a, __b, 1);
+  return __a | ~__b;
 }
 
 __extension__ static __inline int32x4_t __attribute__ ((__always_inline__))
 vornq_s32 (int32x4_t __a, int32x4_t __b)
 {
-  return (int32x4_t)__builtin_neon_vornv4si (__a, __b, 1);
+  return __a | ~__b;
 }
 
 __extension__ static __inline int64x2_t __attribute__ ((__always_inline__))
 vornq_s64 (int64x2_t __a, int64x2_t __b)
 {
-  return (int64x2_t)__builtin_neon_vornv2di (__a, __b, 1);
+  return __a | ~__b;
 }
 
 __extension__ static __inline uint8x16_t __attribute__ ((__always_inline__))
 vornq_u8 (uint8x16_t __a, uint8x16_t __b)
 {
-  return (uint8x16_t)__builtin_neon_vornv16qi ((int8x16_t) __a, (int8x16_t) __b, 0);
+  return __a | ~__b;
 }
 
 __extension__ static __inline uint16x8_t __attribute__ ((__always_inline__))
 vornq_u16 (uint16x8_t __a, uint16x8_t __b)
 {
-  return (uint16x8_t)__builtin_neon_vornv8hi ((int16x8_t) __a, (int16x8_t) __b, 0);
+  return __a | ~__b;
 }
 
 __extension__ static __inline uint32x4_t __attribute__ ((__always_inline__))
 vornq_u32 (uint32x4_t __a, uint32x4_t __b)
 {
-  return (uint32x4_t)__builtin_neon_vornv4si ((int32x4_t) __a, (int32x4_t) __b, 0);
+  return __a | ~__b;
 }
 
 __extension__ static __inline uint64x2_t __attribute__ ((__always_inline__))
 vornq_u64 (uint64x2_t __a, uint64x2_t __b)
 {
-  return (uint64x2_t)__builtin_neon_vornv2di ((int64x2_t) __a, (int64x2_t) __b, 0);
+  return __a | ~__b;
 }
 
-
 __extension__ static __inline poly8x8_t __attribute__ ((__always_inline__))
 vreinterpret_p8_p16 (poly16x4_t __a)
 {

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Patch ARM 2/3] Remove dead code from backend.
  2014-04-28 10:44 [Patch ARM 0/3] Neon intrinsics TLC - Replace intrinsics with GNU C implementations where possible and remove dead code Ramana Radhakrishnan
  2014-04-28 10:45 ` [Patch ARM 1/3] Neon intrinsics TLC : Replace intrinsics with GNU C implementations where possible Ramana Radhakrishnan
@ 2014-04-28 10:47 ` Ramana Radhakrishnan
  2014-04-28 10:52 ` [Patch ARM 3/3] Neon intrinsics TLC - Remove unneeded ML " Ramana Radhakrishnan
  2 siblings, 0 replies; 9+ messages in thread
From: Ramana Radhakrishnan @ 2014-04-28 10:47 UTC (permalink / raw)
  To: gcc-patches, gcc-patches; +Cc: Christophe Lyon

[-- Attachment #1: Type: text/plain, Size: 991 bytes --]

This then left us in the happy position of being able to delete code
but I was worried about LTO streaming as these "builtins" are
essentially streamed out in LTO object code format. However since we
make no promises about LTO compatibility across releases, that's safe
but I structured the dead code elimination as Patch 2/3.

This will be committed separately in case folks want to backport Patch 
1/3 separately and want to assure their users of LTO compatibility 
within a release branch (if that even works) .



<DATE>  Ramana Radhakrishnan  <ramana.radhakrishnan@arm.com>

         * config/arm/arm_neon_builtins.def (vadd, vsub): Only define 
the v2sf and v4sf versions.
           (vand, vorr, veor, vorn, vbic): Remove.
         * config/arm/neon.md (neon_vadd, neon_vsub, neon_vadd_unspec, 
neon_vsub_unspec): Adjust iterator.
           (neon_vorr, neon_vand, neon_vbic, neon_veor, neon_vorn): Remove.

-- 
Ramana Radhakrishnan
Principal Engineer
ARM Ltd.

[-- Attachment #2: p3.txt --]
[-- Type: text/plain, Size: 6426 bytes --]

commit dad8586bd8c799ad26b0c7ee6e1837b50b9ef9a3
Author: Ramana Radhakrishnan <ramana.radhakrishnan@arm.com>
Date:   Thu Apr 24 16:00:08 2014 +0100

    Remove Dead code.

diff --git a/gcc/config/arm/arm_neon_builtins.def b/gcc/config/arm/arm_neon_builtins.def
index a00951a..85215b5 100644
--- a/gcc/config/arm/arm_neon_builtins.def
+++ b/gcc/config/arm/arm_neon_builtins.def
@@ -18,8 +18,7 @@
    along with GCC; see the file COPYING3.  If not see
    <http://www.gnu.org/licenses/>.  */
 
-VAR10 (BINOP, vadd,
-	v8qi, v4hi, v2si, v2sf, di, v16qi, v8hi, v4si, v4sf, v2di),
+VAR2 (BINOP, vadd, v2sf, v4sf),
 VAR3 (BINOP, vaddl, v8qi, v4hi, v2si),
 VAR3 (BINOP, vaddw, v8qi, v4hi, v2si),
 VAR6 (BINOP, vhadd, v8qi, v4hi, v2si, v16qi, v8hi, v4si),
@@ -54,7 +53,7 @@ VAR8 (SHIFTIMM, vqshl_n, v8qi, v4hi, v2si, di, v16qi, v8hi, v4si, v2di),
 VAR8 (SHIFTIMM, vqshlu_n, v8qi, v4hi, v2si, di, v16qi, v8hi, v4si, v2di),
 VAR3 (SHIFTIMM, vshll_n, v8qi, v4hi, v2si),
 VAR8 (SHIFTACC, vsra_n, v8qi, v4hi, v2si, di, v16qi, v8hi, v4si, v2di),
-VAR10 (BINOP, vsub, v8qi, v4hi, v2si, v2sf, di, v16qi, v8hi, v4si, v4sf, v2di),
+VAR2 (BINOP, vsub, v2sf, v4sf),
 VAR3 (BINOP, vsubl, v8qi, v4hi, v2si),
 VAR3 (BINOP, vsubw, v8qi, v4hi, v2si),
 VAR8 (BINOP, vqsub, v8qi, v4hi, v2si, di, v16qi, v8hi, v4si, v2di),
@@ -199,14 +198,4 @@ VAR5 (LOADSTRUCT, vld4_dup, v8qi, v4hi, v2si, v2sf, di),
 VAR9 (STORESTRUCT, vst4,
 	v8qi, v4hi, v2si, v2sf, di, v16qi, v8hi, v4si, v4sf),
 VAR7 (STORESTRUCTLANE, vst4_lane,
-	v8qi, v4hi, v2si, v2sf, v8hi, v4si, v4sf),
-VAR10 (LOGICBINOP, vand,
-	 v8qi, v4hi, v2si, v2sf, di, v16qi, v8hi, v4si, v4sf, v2di),
-VAR10 (LOGICBINOP, vorr,
-	 v8qi, v4hi, v2si, v2sf, di, v16qi, v8hi, v4si, v4sf, v2di),
-VAR10 (BINOP, veor,
-	 v8qi, v4hi, v2si, v2sf, di, v16qi, v8hi, v4si, v4sf, v2di),
-VAR10 (LOGICBINOP, vbic,
-	 v8qi, v4hi, v2si, v2sf, di, v16qi, v8hi, v4si, v4sf, v2di),
-VAR10 (LOGICBINOP, vorn,
-	 v8qi, v4hi, v2si, v2sf, di, v16qi, v8hi, v4si, v4sf, v2di)
+	v8qi, v4hi, v2si, v2sf, v8hi, v4si, v4sf)
diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index aad420c..9ac393b 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -1842,9 +1842,9 @@
 ; good for plain vadd, vaddq.
 
 (define_expand "neon_vadd<mode>"
-  [(match_operand:VDQX 0 "s_register_operand" "=w")
-   (match_operand:VDQX 1 "s_register_operand" "w")
-   (match_operand:VDQX 2 "s_register_operand" "w")
+  [(match_operand:VCVTF 0 "s_register_operand" "=w")
+   (match_operand:VCVTF 1 "s_register_operand" "w")
+   (match_operand:VCVTF 2 "s_register_operand" "w")
    (match_operand:SI 3 "immediate_operand" "i")]
   "TARGET_NEON"
 {
@@ -1869,9 +1869,9 @@
 ; Used for intrinsics when flag_unsafe_math_optimizations is false.
 
 (define_insn "neon_vadd<mode>_unspec"
-  [(set (match_operand:VDQX 0 "s_register_operand" "=w")
-        (unspec:VDQX [(match_operand:VDQX 1 "s_register_operand" "w")
-		      (match_operand:VDQX 2 "s_register_operand" "w")]
+  [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
+        (unspec:VCVTF [(match_operand:VCVTF 1 "s_register_operand" "w")
+		      (match_operand:VCVTF 2 "s_register_operand" "w")]
                      UNSPEC_VADD))]
   "TARGET_NEON"
   "vadd.<V_if_elem>\t%<V_reg>0, %<V_reg>1, %<V_reg>2"
@@ -2132,9 +2132,9 @@
 )
 
 (define_expand "neon_vsub<mode>"
-  [(match_operand:VDQX 0 "s_register_operand" "=w")
-   (match_operand:VDQX 1 "s_register_operand" "w")
-   (match_operand:VDQX 2 "s_register_operand" "w")
+  [(match_operand:VCVTF 0 "s_register_operand" "=w")
+   (match_operand:VCVTF 1 "s_register_operand" "w")
+   (match_operand:VCVTF 2 "s_register_operand" "w")
    (match_operand:SI 3 "immediate_operand" "i")]
   "TARGET_NEON"
 {
@@ -2149,9 +2149,9 @@
 ; Used for intrinsics when flag_unsafe_math_optimizations is false.
 
 (define_insn "neon_vsub<mode>_unspec"
-  [(set (match_operand:VDQX 0 "s_register_operand" "=w")
-        (unspec:VDQX [(match_operand:VDQX 1 "s_register_operand" "w")
-		      (match_operand:VDQX 2 "s_register_operand" "w")]
+  [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
+        (unspec:VCVTF [(match_operand:VCVTF 1 "s_register_operand" "w")
+		      (match_operand:VCVTF 2 "s_register_operand" "w")]
                      UNSPEC_VSUB))]
   "TARGET_NEON"
   "vsub.<V_if_elem>\t%<V_reg>0, %<V_reg>1, %<V_reg>2"
@@ -5357,61 +5357,6 @@
   [(set_attr "type" "neon_store4_4reg<q>")]
 )
 
-(define_expand "neon_vand<mode>"
-  [(match_operand:VDQX 0 "s_register_operand" "")
-   (match_operand:VDQX 1 "s_register_operand" "")
-   (match_operand:VDQX 2 "neon_inv_logic_op2" "")
-   (match_operand:SI 3 "immediate_operand" "")]
-  "TARGET_NEON"
-{
-  emit_insn (gen_and<mode>3 (operands[0], operands[1], operands[2]));
-  DONE;
-})
-
-(define_expand "neon_vorr<mode>"
-  [(match_operand:VDQX 0 "s_register_operand" "")
-   (match_operand:VDQX 1 "s_register_operand" "")
-   (match_operand:VDQX 2 "neon_logic_op2" "")
-   (match_operand:SI 3 "immediate_operand" "")]
-  "TARGET_NEON"
-{
-  emit_insn (gen_ior<mode>3 (operands[0], operands[1], operands[2]));
-  DONE;
-})
-
-(define_expand "neon_veor<mode>"
-  [(match_operand:VDQX 0 "s_register_operand" "")
-   (match_operand:VDQX 1 "s_register_operand" "")
-   (match_operand:VDQX 2 "s_register_operand" "")
-   (match_operand:SI 3 "immediate_operand" "")]
-  "TARGET_NEON"
-{
-  emit_insn (gen_xor<mode>3 (operands[0], operands[1], operands[2]));
-  DONE;
-})
-
-(define_expand "neon_vbic<mode>"
-  [(match_operand:VDQX 0 "s_register_operand" "")
-   (match_operand:VDQX 1 "s_register_operand" "")
-   (match_operand:VDQX 2 "neon_logic_op2" "")
-   (match_operand:SI 3 "immediate_operand" "")]
-  "TARGET_NEON"
-{
-  emit_insn (gen_bic<mode>3_neon (operands[0], operands[1], operands[2]));
-  DONE;
-})
-
-(define_expand "neon_vorn<mode>"
-  [(match_operand:VDQX 0 "s_register_operand" "")
-   (match_operand:VDQX 1 "s_register_operand" "")
-   (match_operand:VDQX 2 "neon_inv_logic_op2" "")
-   (match_operand:SI 3 "immediate_operand" "")]
-  "TARGET_NEON"
-{
-  emit_insn (gen_orn<mode>3_neon (operands[0], operands[1], operands[2]));
-  DONE;
-})
-
 (define_insn "neon_vec_unpack<US>_lo_<mode>"
   [(set (match_operand:<V_unpack> 0 "register_operand" "=w")
         (SE:<V_unpack> (vec_select:<V_HALF>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Patch ARM 3/3] Neon intrinsics TLC - Remove unneeded ML from backend.
  2014-04-28 10:44 [Patch ARM 0/3] Neon intrinsics TLC - Replace intrinsics with GNU C implementations where possible and remove dead code Ramana Radhakrishnan
  2014-04-28 10:45 ` [Patch ARM 1/3] Neon intrinsics TLC : Replace intrinsics with GNU C implementations where possible Ramana Radhakrishnan
  2014-04-28 10:47 ` [Patch ARM 2/3] Remove dead code from backend Ramana Radhakrishnan
@ 2014-04-28 10:52 ` Ramana Radhakrishnan
  2014-04-28 11:01   ` Ramana Radhakrishnan
  2 siblings, 1 reply; 9+ messages in thread
From: Ramana Radhakrishnan @ 2014-04-28 10:52 UTC (permalink / raw)
  To: gcc-patches, gcc-patches; +Cc: Christophe Lyon

[-- Attachment #1: Type: text/plain, Size: 1130 bytes --]

Patch 3/3 removes the ML to generate Neon intrinsics and the
documentation and updates the comments in the files to show that these
are now hand crafted rather than auto-generated. We've had these for
many years now and I think it's time we got rid of this. Not everyone
groks ML and it doesn't help that only one or 2 folks can actually do
this properly everytime. Instead of having these bottlenecks and given
the fact that the intrinsics are pretty stable now, there's no point in
retaining the generator interface. I'd rather get rid of them. The only
bit left is neon-schedgen.ml, neon.ml and neon-testgen.ml. I think we
can safely remove neon-testgen.ml once Christophe's testsuite is done
and we'll probably just have to carry neon-schedgen.ml / neon.ml as it
still generates the neon descriptions for both a8 and a9.


<DATE>  Ramana Radhakrishnan  <ramana.radhakrishnan@arm.com>

	* config/arm/arm_neon.h: Update comment.
	* config/arm/neon-docgen.ml: Delete.
	* config/arm/neon-gen.ml: Delete.
	* doc/arm-neon-intrinsics.texi: Update comment.

-- 
Ramana Radhakrishnan
Principal Engineer
ARM Ltd.

[-- Attachment #2: p4.txt --]
[-- Type: text/plain, Size: 40235 bytes --]

From 9382d4c22ceb555fc74d8c90c75e6ce47faaffe0 Mon Sep 17 00:00:00 2001
From: Ramana Radhakrishnan <ramana.radhakrishnan@arm.com>
Date: Thu, 24 Apr 2014 10:11:48 +0100
Subject: [PATCH 3/3]    We have now reached the point where both neon-gen.ml
 and neon-docgen.ml     are obsolete and are a pain to maintain for a number
 of bespoke handcrafted     changes to arm_neon.h.

    Given this there is no point in keeping this further in the source tree.

    neon-testgen.ml is on it's last legs and if clyon's work in getting
    the neon execute tests in is completed, we will remove all of
    gcc.target/arm/neon and neon-testgen.ml.

    Ramana
---
 gcc/config/arm/arm_neon.h        |   3 +-
 gcc/config/arm/neon-docgen.ml    | 424 -------------------------------
 gcc/config/arm/neon-gen.ml       | 520 ---------------------------------------
 gcc/doc/arm-neon-intrinsics.texi |   2 -
 4 files changed, 1 insertion(+), 948 deletions(-)
 delete mode 100644 gcc/config/arm/neon-docgen.ml
 delete mode 100644 gcc/config/arm/neon-gen.ml

diff --git a/gcc/config/arm/arm_neon.h b/gcc/config/arm/arm_neon.h
index e146369..564e46b 100644
--- a/gcc/config/arm/arm_neon.h
+++ b/gcc/config/arm/arm_neon.h
@@ -1,5 +1,4 @@
-/* ARM NEON intrinsics include file. This file is generated automatically
-   using neon-gen.ml.  Please do not edit manually.
+/* ARM NEON intrinsics include file.
 
    Copyright (C) 2006-2014 Free Software Foundation, Inc.
    Contributed by CodeSourcery.
diff --git a/gcc/config/arm/neon-docgen.ml b/gcc/config/arm/neon-docgen.ml
deleted file mode 100644
index 5788a53..0000000
--- a/gcc/config/arm/neon-docgen.ml
+++ /dev/null
@@ -1,424 +0,0 @@
-(* ARM NEON documentation generator.
-
-   Copyright (C) 2006-2014 Free Software Foundation, Inc.
-   Contributed by CodeSourcery.
-
-   This file is part of GCC.
-
-   GCC is free software; you can redistribute it and/or modify it under
-   the terms of the GNU General Public License as published by the Free
-   Software Foundation; either version 3, or (at your option) any later
-   version.
-
-   GCC is distributed in the hope that it will be useful, but WITHOUT ANY
-   WARRANTY; without even the implied warranty of MERCHANTABILITY or
-   FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
-   for more details.
-
-   You should have received a copy of the GNU General Public License
-   along with GCC; see the file COPYING3.  If not see
-   <http://www.gnu.org/licenses/>.
-
-   This is an O'Caml program.  The O'Caml compiler is available from:
-
-     http://caml.inria.fr/
-
-   Or from your favourite OS's friendly packaging system. Tested with version
-   3.09.2, though other versions will probably work too.
-
-   Compile with:
-     ocamlc -c neon.ml
-     ocamlc -o neon-docgen neon.cmo neon-docgen.ml
-
-   Run with:
-     /path/to/neon-docgen /path/to/gcc/doc/arm-neon-intrinsics.texi
-*)
-
-open Neon
-
-(* The combined "ops" and "reinterp" table.  *)
-let ops_reinterp = reinterp @ ops
-
-(* Helper functions for extracting things from the "ops" table.  *)
-let single_opcode desired_opcode () =
-  List.fold_left (fun got_so_far ->
-                  fun row ->
-                    match row with
-                      (opcode, _, _, _, _, _) ->
-                        if opcode = desired_opcode then row :: got_so_far
-                                                   else got_so_far
-                 ) [] ops_reinterp
-
-let multiple_opcodes desired_opcodes () =
-  List.fold_left (fun got_so_far ->
-                  fun desired_opcode ->
-                    (single_opcode desired_opcode ()) @ got_so_far)
-                 [] desired_opcodes
-
-let ldx_opcode number () =
-  List.fold_left (fun got_so_far ->
-                  fun row ->
-                    match row with
-                      (opcode, _, _, _, _, _) ->
-                        match opcode with
-                          Vldx n | Vldx_lane n | Vldx_dup n when n = number ->
-                            row :: got_so_far
-                          | _ -> got_so_far
-                 ) [] ops_reinterp
-
-let stx_opcode number () =
-  List.fold_left (fun got_so_far ->
-                  fun row ->
-                    match row with
-                      (opcode, _, _, _, _, _) ->
-                        match opcode with
-                          Vstx n | Vstx_lane n when n = number ->
-                            row :: got_so_far
-                          | _ -> got_so_far
-                 ) [] ops_reinterp
-
-let tbl_opcode () =
-  List.fold_left (fun got_so_far ->
-                  fun row ->
-                    match row with
-                      (opcode, _, _, _, _, _) ->
-                        match opcode with
-                          Vtbl _ -> row :: got_so_far
-                          | _ -> got_so_far
-                 ) [] ops_reinterp
-
-let tbx_opcode () =
-  List.fold_left (fun got_so_far ->
-                  fun row ->
-                    match row with
-                      (opcode, _, _, _, _, _) ->
-                        match opcode with
-                          Vtbx _ -> row :: got_so_far
-                          | _ -> got_so_far
-                 ) [] ops_reinterp
-
-(* The groups of intrinsics.  *)
-let intrinsic_groups =
-  [ "Addition", single_opcode Vadd;
-    "Multiplication", single_opcode Vmul;
-    "Multiply-accumulate", single_opcode Vmla;
-    "Multiply-subtract", single_opcode Vmls;
-    "Fused-multiply-accumulate", single_opcode Vfma;
-    "Fused-multiply-subtract", single_opcode Vfms;
-    "Round to integral (to nearest, ties to even)", single_opcode Vrintn;
-    "Round to integral (to nearest, ties away from zero)", single_opcode Vrinta;
-    "Round to integral (towards +Inf)", single_opcode Vrintp;
-    "Round to integral (towards -Inf)", single_opcode Vrintm;
-    "Round to integral (towards 0)", single_opcode Vrintz;
-    "Subtraction", single_opcode Vsub;
-    "Comparison (equal-to)", single_opcode Vceq;
-    "Comparison (greater-than-or-equal-to)", single_opcode Vcge;
-    "Comparison (less-than-or-equal-to)", single_opcode Vcle;
-    "Comparison (greater-than)", single_opcode Vcgt;
-    "Comparison (less-than)", single_opcode Vclt;
-    "Comparison (absolute greater-than-or-equal-to)", single_opcode Vcage;
-    "Comparison (absolute less-than-or-equal-to)", single_opcode Vcale;
-    "Comparison (absolute greater-than)", single_opcode Vcagt;
-    "Comparison (absolute less-than)", single_opcode Vcalt;
-    "Test bits", single_opcode Vtst;
-    "Absolute difference", single_opcode Vabd;
-    "Absolute difference and accumulate", single_opcode Vaba;
-    "Maximum", single_opcode Vmax;
-    "Minimum", single_opcode Vmin;
-    "Pairwise add", single_opcode Vpadd;
-    "Pairwise add, single_opcode widen and accumulate", single_opcode Vpada;
-    "Folding maximum", single_opcode Vpmax;
-    "Folding minimum", single_opcode Vpmin;
-    "Reciprocal step", multiple_opcodes [Vrecps; Vrsqrts];
-    "Vector shift left", single_opcode Vshl;
-    "Vector shift left by constant", single_opcode Vshl_n;
-    "Vector shift right by constant", single_opcode Vshr_n;
-    "Vector shift right by constant and accumulate", single_opcode Vsra_n;
-    "Vector shift right and insert", single_opcode Vsri;
-    "Vector shift left and insert", single_opcode Vsli;
-    "Absolute value", single_opcode Vabs;
-    "Negation", single_opcode Vneg;
-    "Bitwise not", single_opcode Vmvn;
-    "Count leading sign bits", single_opcode Vcls;
-    "Count leading zeros", single_opcode Vclz;
-    "Count number of set bits", single_opcode Vcnt;
-    "Reciprocal estimate", single_opcode Vrecpe;
-    "Reciprocal square-root estimate", single_opcode Vrsqrte;
-    "Get lanes from a vector", single_opcode Vget_lane;
-    "Set lanes in a vector", single_opcode Vset_lane;
-    "Create vector from literal bit pattern", single_opcode Vcreate;
-    "Set all lanes to the same value",
-      multiple_opcodes [Vdup_n; Vmov_n; Vdup_lane];
-    "Combining vectors", single_opcode Vcombine;
-    "Splitting vectors", multiple_opcodes [Vget_high; Vget_low];
-    "Conversions", multiple_opcodes [Vcvt; Vcvt_n];
-    "Move, single_opcode narrowing", single_opcode Vmovn;
-    "Move, single_opcode long", single_opcode Vmovl;
-    "Table lookup", tbl_opcode;
-    "Extended table lookup", tbx_opcode;
-    "Multiply, lane", single_opcode Vmul_lane;
-    "Long multiply, lane", single_opcode Vmull_lane;
-    "Saturating doubling long multiply, lane", single_opcode Vqdmull_lane;
-    "Saturating doubling multiply high, lane", single_opcode Vqdmulh_lane;
-    "Multiply-accumulate, lane", single_opcode Vmla_lane;
-    "Multiply-subtract, lane", single_opcode Vmls_lane;
-    "Vector multiply by scalar", single_opcode Vmul_n;
-    "Vector long multiply by scalar", single_opcode Vmull_n;
-    "Vector saturating doubling long multiply by scalar",
-      single_opcode Vqdmull_n;
-    "Vector saturating doubling multiply high by scalar",
-      single_opcode Vqdmulh_n;
-    "Vector multiply-accumulate by scalar", single_opcode Vmla_n;
-    "Vector multiply-subtract by scalar", single_opcode Vmls_n;
-    "Vector extract", single_opcode Vext;
-    "Reverse elements", multiple_opcodes [Vrev64; Vrev32; Vrev16];
-    "Bit selection", single_opcode Vbsl;
-    "Transpose elements", single_opcode Vtrn;
-    "Zip elements", single_opcode Vzip;
-    "Unzip elements", single_opcode Vuzp;
-    "Element/structure loads, VLD1 variants", ldx_opcode 1;
-    "Element/structure stores, VST1 variants", stx_opcode 1;
-    "Element/structure loads, VLD2 variants", ldx_opcode 2;
-    "Element/structure stores, VST2 variants", stx_opcode 2;
-    "Element/structure loads, VLD3 variants", ldx_opcode 3;
-    "Element/structure stores, VST3 variants", stx_opcode 3;
-    "Element/structure loads, VLD4 variants", ldx_opcode 4;
-    "Element/structure stores, VST4 variants", stx_opcode 4;
-    "Logical operations (AND)", single_opcode Vand;
-    "Logical operations (OR)", single_opcode Vorr;
-    "Logical operations (exclusive OR)", single_opcode Veor;
-    "Logical operations (AND-NOT)", single_opcode Vbic;
-    "Logical operations (OR-NOT)", single_opcode Vorn;
-    "Reinterpret casts", single_opcode Vreinterp ]
-
-(* Given an intrinsic shape, produce a string to document the corresponding
-   operand shapes.  *)
-let rec analyze_shape shape =
-  let rec n_things n thing =
-    match n with
-      0 -> []
-    | n -> thing :: (n_things (n - 1) thing)
-  in
-  let rec analyze_shape_elt reg_no elt =
-    match elt with
-      Dreg -> "@var{d" ^ (string_of_int reg_no) ^ "}"
-    | Qreg -> "@var{q" ^ (string_of_int reg_no) ^ "}"
-    | Corereg -> "@var{r" ^ (string_of_int reg_no) ^ "}"
-    | Immed -> "#@var{0}"
-    | VecArray (1, elt) ->
-        let elt_regexp = analyze_shape_elt 0 elt in
-          "@{" ^ elt_regexp ^ "@}"
-    | VecArray (n, elt) ->
-      let rec f m =
-        match m with
-          0 -> []
-        | m -> (analyze_shape_elt (m - 1) elt) :: (f (m - 1))
-      in
-      let ops = List.rev (f n) in
-        "@{" ^ (commas (fun x -> x) ops "") ^ "@}"
-    | (PtrTo elt | CstPtrTo elt) ->
-      "[" ^ (analyze_shape_elt reg_no elt) ^ "]"
-    | Element_of_dreg -> (analyze_shape_elt reg_no Dreg) ^ "[@var{0}]"
-    | Element_of_qreg -> (analyze_shape_elt reg_no Qreg) ^ "[@var{0}]"
-    | All_elements_of_dreg -> (analyze_shape_elt reg_no Dreg) ^ "[]"
-    | Alternatives alts -> (analyze_shape_elt reg_no (List.hd alts))
-  in
-    match shape with
-      All (n, elt) -> commas (analyze_shape_elt 0) (n_things n elt) ""
-    | Long -> (analyze_shape_elt 0 Qreg) ^ ", " ^ (analyze_shape_elt 0 Dreg) ^
-              ", " ^ (analyze_shape_elt 0 Dreg)
-    | Long_noreg elt -> (analyze_shape_elt 0 elt) ^ ", " ^
-              (analyze_shape_elt 0 elt)
-    | Wide -> (analyze_shape_elt 0 Qreg) ^ ", " ^ (analyze_shape_elt 0 Qreg) ^
-              ", " ^ (analyze_shape_elt 0 Dreg)
-    | Wide_noreg elt -> analyze_shape (Long_noreg elt)
-    | Narrow -> (analyze_shape_elt 0 Dreg) ^ ", " ^ (analyze_shape_elt 0 Qreg) ^
-                ", " ^ (analyze_shape_elt 0 Qreg)
-    | Use_operands elts -> commas (analyze_shape_elt 0) (Array.to_list elts) ""
-    | By_scalar Dreg ->
-        analyze_shape (Use_operands [| Dreg; Dreg; Element_of_dreg |])
-    | By_scalar Qreg ->
-        analyze_shape (Use_operands [| Qreg; Qreg; Element_of_dreg |])
-    | By_scalar _ -> assert false
-    | Wide_lane ->
-        analyze_shape (Use_operands [| Qreg; Dreg; Element_of_dreg |])
-    | Wide_scalar ->
-        analyze_shape (Use_operands [| Qreg; Dreg; Element_of_dreg |])
-    | Pair_result elt ->
-      let elt_regexp = analyze_shape_elt 0 elt in
-      let elt_regexp' = analyze_shape_elt 1 elt in
-        elt_regexp ^ ", " ^ elt_regexp'
-    | Unary_scalar _ -> "FIXME Unary_scalar"
-    | Binary_imm elt -> analyze_shape (Use_operands [| elt; elt; Immed |])
-    | Narrow_imm -> analyze_shape (Use_operands [| Dreg; Qreg; Immed |])
-    | Long_imm -> analyze_shape (Use_operands [| Qreg; Dreg; Immed |])
-
-(* Document a single intrinsic.  *)
-let describe_intrinsic first chan
-                       (elt_ty, (_, features, shape, name, munge, _)) =
-  let c_arity, new_elt_ty = munge shape elt_ty in
-  let c_types = strings_of_arity c_arity in
-  Printf.fprintf chan "@itemize @bullet\n";
-  let item_code = if first then "@item" else "@itemx" in
-    Printf.fprintf chan "%s %s %s_%s (" item_code (List.hd c_types)
-                   (intrinsic_name name) (string_of_elt elt_ty);
-    Printf.fprintf chan "%s)\n" (commas (fun ty -> ty) (List.tl c_types) "");
-    if not (List.exists (fun feature -> feature = No_op) features) then
-    begin
-      let print_one_insn name =
-        Printf.fprintf chan "@code{";
-        let no_suffix = (new_elt_ty = NoElts) in
-        let name_with_suffix =
-          if no_suffix then name
-          else name ^ "." ^ (string_of_elt_dots new_elt_ty)
-        in
-        let possible_operands = analyze_all_shapes features shape
-                                                   analyze_shape
-        in
-	let rec print_one_possible_operand op =
-	  Printf.fprintf chan "%s %s}" name_with_suffix op
-        in
-          (* If the intrinsic expands to multiple instructions, we assume
-             they are all of the same form.  *)
-          print_one_possible_operand (List.hd possible_operands)
-      in
-      let rec print_insns names =
-        match names with
-          [] -> ()
-        | [name] -> print_one_insn name
-        | name::names -> (print_one_insn name;
-                          Printf.fprintf chan " @emph{or} ";
-                          print_insns names)
-      in
-      let insn_names = get_insn_names features name in
-        Printf.fprintf chan "@*@emph{Form of expected instruction(s):} ";
-        print_insns insn_names;
-        Printf.fprintf chan "\n"
-    end;
-    Printf.fprintf chan "@end itemize\n";
-    Printf.fprintf chan "\n\n"
-
-(* Document a group of intrinsics.  *)
-let document_group chan (group_title, group_extractor) =
-  (* Extract the rows in question from the ops table and then turn them
-     into a list of intrinsics.  *)
-  let intrinsics =
-    List.fold_left (fun got_so_far ->
-                    fun row ->
-                      match row with
-                        (_, _, _, _, _, elt_tys) ->
-                          List.fold_left (fun got_so_far' ->
-                                          fun elt_ty ->
-                                            (elt_ty, row) :: got_so_far')
-                                         got_so_far elt_tys
-                   ) [] (group_extractor ())
-  in
-    (* Emit the title for this group.  *)
-    Printf.fprintf chan "@subsubsection %s\n\n" group_title;
-    (* Emit a description of each intrinsic.  *)
-    List.iter (describe_intrinsic true chan) intrinsics;
-    (* Close this group.  *)
-    Printf.fprintf chan "\n\n"
-
-let gnu_header chan =
-  List.iter (fun s -> Printf.fprintf chan "%s\n" s) [
-  "@c Copyright (C) 2006-2014 Free Software Foundation, Inc.";
-  "@c This is part of the GCC manual.";
-  "@c For copying conditions, see the file gcc.texi.";
-  "";
-  "@c This file is generated automatically using gcc/config/arm/neon-docgen.ml";
-  "@c Please do not edit manually."]
-
-let crypto_doc =
-"
-@itemize @bullet
-@item poly128_t vldrq_p128(poly128_t const *)
-@end itemize
-
-@itemize @bullet
-@item void vstrq_p128(poly128_t *, poly128_t)
-@end itemize
-
-@itemize @bullet
-@item uint64x1_t vceq_p64 (poly64x1_t, poly64x1_t)
-@end itemize
-
-@itemize @bullet
-@item uint64x1_t vtst_p64 (poly64x1_t, poly64x1_t)
-@end itemize
-
-@itemize @bullet
-@item uint32_t vsha1h_u32 (uint32_t)
-@*@emph{Form of expected instruction(s):} @code{sha1h.32 @var{q0}, @var{q1}}
-@end itemize
-
-@itemize @bullet
-@item uint32x4_t vsha1cq_u32 (uint32x4_t, uint32_t, uint32x4_t)
-@*@emph{Form of expected instruction(s):} @code{sha1c.32 @var{q0}, @var{q1}, @var{q2}}
-@end itemize
-
-@itemize @bullet
-@item uint32x4_t vsha1pq_u32 (uint32x4_t, uint32_t, uint32x4_t)
-@*@emph{Form of expected instruction(s):} @code{sha1p.32 @var{q0}, @var{q1}, @var{q2}}
-@end itemize
-
-@itemize @bullet
-@item uint32x4_t vsha1mq_u32 (uint32x4_t, uint32_t, uint32x4_t)
-@*@emph{Form of expected instruction(s):} @code{sha1m.32 @var{q0}, @var{q1}, @var{q2}}
-@end itemize
-
-@itemize @bullet
-@item uint32x4_t vsha1su0q_u32 (uint32x4_t, uint32x4_t, uint32x4_t)
-@*@emph{Form of expected instruction(s):} @code{sha1su0.32 @var{q0}, @var{q1}, @var{q2}}
-@end itemize
-
-@itemize @bullet
-@item uint32x4_t vsha1su1q_u32 (uint32x4_t, uint32x4_t)
-@*@emph{Form of expected instruction(s):} @code{sha1su1.32 @var{q0}, @var{q1}, @var{q2}}
-@end itemize
-
-@itemize @bullet
-@item uint32x4_t vsha256hq_u32 (uint32x4_t, uint32x4_t, uint32x4_t)
-@*@emph{Form of expected instruction(s):} @code{sha256h.32 @var{q0}, @var{q1}, @var{q2}}
-@end itemize
- 
-@itemize @bullet
-@item uint32x4_t vsha256h2q_u32 (uint32x4_t, uint32x4_t, uint32x4_t)
-@*@emph{Form of expected instruction(s):} @code{sha256h2.32 @var{q0}, @var{q1}, @var{q2}}
-@end itemize
- 
-@itemize @bullet
-@item uint32x4_t vsha256su0q_u32 (uint32x4_t, uint32x4_t)
-@*@emph{Form of expected instruction(s):} @code{sha256su0.32 @var{q0}, @var{q1}}
-@end itemize
- 
-@itemize @bullet
-@item uint32x4_t vsha256su1q_u32 (uint32x4_t, uint32x4_t, uint32x4_t)
-@*@emph{Form of expected instruction(s):} @code{sha256su1.32 @var{q0}, @var{q1}, @var{q2}}
-@end itemize
-
-@itemize @bullet
-@item poly128_t vmull_p64 (poly64_t a, poly64_t b)
-@*@emph{Form of expected instruction(s):} @code{vmull.p64 @var{q0}, @var{d1}, @var{d2}}
-@end itemize
-
-@itemize @bullet
-@item poly128_t vmull_high_p64 (poly64x2_t a, poly64x2_t b)
-@*@emph{Form of expected instruction(s):} @code{vmull.p64 @var{q0}, @var{d1}, @var{d2}}
-@end itemize
-"
-
-(* Program entry point.  *)
-let _ =
-  if Array.length Sys.argv <> 2 then
-    failwith "Usage: neon-docgen <output filename>"
-  else
-  let file = Sys.argv.(1) in
-    try
-      let chan = open_out file in
-        gnu_header chan;
-        List.iter (document_group chan) intrinsic_groups;
-        Printf.fprintf chan "%s\n" crypto_doc;
-        close_out chan
-    with Sys_error sys ->
-      failwith ("Could not create output file " ^ file ^ ": " ^ sys)
diff --git a/gcc/config/arm/neon-gen.ml b/gcc/config/arm/neon-gen.ml
deleted file mode 100644
index f3dd86b..0000000
--- a/gcc/config/arm/neon-gen.ml
+++ /dev/null
@@ -1,520 +0,0 @@
-(* Auto-generate ARM Neon intrinsics header file.
-   Copyright (C) 2006-2014 Free Software Foundation, Inc.
-   Contributed by CodeSourcery.
-
-   This file is part of GCC.
-
-   GCC is free software; you can redistribute it and/or modify it under
-   the terms of the GNU General Public License as published by the Free
-   Software Foundation; either version 3, or (at your option) any later
-   version.
-
-   GCC is distributed in the hope that it will be useful, but WITHOUT ANY
-   WARRANTY; without even the implied warranty of MERCHANTABILITY or
-   FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
-   for more details.
-
-   You should have received a copy of the GNU General Public License
-   along with GCC; see the file COPYING3.  If not see
-   <http://www.gnu.org/licenses/>.
-
-   This is an O'Caml program.  The O'Caml compiler is available from:
-
-     http://caml.inria.fr/
-
-   Or from your favourite OS's friendly packaging system. Tested with version
-   3.09.2, though other versions will probably work too.
-
-   Compile with:
-     ocamlc -c neon.ml
-     ocamlc -o neon-gen neon.cmo neon-gen.ml
-
-   Run with:
-     ./neon-gen > arm_neon.h
-*)
-
-open Neon
-
-(* The format codes used in the following functions are documented at:
-     http://caml.inria.fr/pub/docs/manual-ocaml/libref/Format.html\
-     #6_printflikefunctionsforprettyprinting
-   (one line, remove the backslash.)
-*)
-
-(* Following functions can be used to approximate GNU indentation style.  *)
-let start_function () =
-  Format.printf "@[<v 0>";
-  ref 0
-
-let end_function nesting =
-  match !nesting with
-    0 -> Format.printf "@;@;@]"
-  | _ -> failwith ("Bad nesting (ending function at level "
-                   ^ (string_of_int !nesting) ^ ")")
-
-let open_braceblock nesting =
-  begin match !nesting with
-    0 -> Format.printf "@,@<0>{@[<v 2>@,"
-  | _ -> Format.printf "@,@[<v 2>  @<0>{@[<v 2>@,"
-  end;
-  incr nesting
-
-let close_braceblock nesting =
-  decr nesting;
-  match !nesting with
-    0 -> Format.printf "@]@,@<0>}"
-  | _ -> Format.printf "@]@,@<0>}@]"
-
-let print_function arity fnname body =
-  let ffmt = start_function () in
-  Format.printf "__extension__ static __inline ";
-  let inl = "__attribute__ ((__always_inline__))" in
-  begin match arity with
-    Arity0 ret ->
-      Format.printf "%s %s@,%s (void)" (string_of_vectype ret) inl fnname
-  | Arity1 (ret, arg0) ->
-      Format.printf "%s %s@,%s (%s __a)" (string_of_vectype ret) inl fnname
-                                        (string_of_vectype arg0)
-  | Arity2 (ret, arg0, arg1) ->
-      Format.printf "%s %s@,%s (%s __a, %s __b)"
-        (string_of_vectype ret) inl fnname (string_of_vectype arg0)
-	(string_of_vectype arg1)
-  | Arity3 (ret, arg0, arg1, arg2) ->
-      Format.printf "%s %s@,%s (%s __a, %s __b, %s __c)"
-        (string_of_vectype ret) inl fnname (string_of_vectype arg0)
-	(string_of_vectype arg1) (string_of_vectype arg2)
-  | Arity4 (ret, arg0, arg1, arg2, arg3) ->
-      Format.printf "%s %s@,%s (%s __a, %s __b, %s __c, %s __d)"
-        (string_of_vectype ret) inl fnname (string_of_vectype arg0)
-	(string_of_vectype arg1) (string_of_vectype arg2)
-        (string_of_vectype arg3)
-  end;
-  open_braceblock ffmt;
-  let rec print_lines = function
-    []       -> ()
-  | "" :: lines -> print_lines lines
-  | [line] -> Format.printf "%s" line
-  | line::lines -> Format.printf "%s@," line ; print_lines lines in
-  print_lines body;
-  close_braceblock ffmt;
-  end_function ffmt
-
-let union_string num elts base =
-  let itype = inttype_for_array num elts in
-  let iname = string_of_inttype itype
-  and sname = string_of_vectype (T_arrayof (num, elts)) in
-  Printf.sprintf "union { %s __i; %s __o; } %s" sname iname base
-
-let rec signed_ctype = function
-    T_uint8x8 | T_poly8x8 -> T_int8x8
-  | T_uint8x16 | T_poly8x16 -> T_int8x16
-  | T_uint16x4 | T_poly16x4 -> T_int16x4
-  | T_uint16x8 | T_poly16x8 -> T_int16x8
-  | T_uint32x2 -> T_int32x2
-  | T_uint32x4 -> T_int32x4
-  | T_uint64x1 -> T_int64x1
-  | T_uint64x2 -> T_int64x2
-  | T_poly64x2 -> T_int64x2
-  (* Cast to types defined by mode in arm.c, not random types pulled in from
-     the <stdint.h> header in use. This fixes incompatible pointer errors when
-     compiling with C++.  *)
-  | T_uint8 | T_int8 -> T_intQI
-  | T_uint16 | T_int16 -> T_intHI
-  | T_uint32 | T_int32 -> T_intSI
-  | T_uint64 | T_int64 -> T_intDI
-  | T_float16 -> T_floatHF
-  | T_float32 -> T_floatSF
-  | T_poly8 -> T_intQI
-  | T_poly16 -> T_intHI
-  | T_poly64 -> T_intDI
-  | T_poly128 -> T_intTI
-  | T_arrayof (n, elt) -> T_arrayof (n, signed_ctype elt)
-  | T_ptrto elt -> T_ptrto (signed_ctype elt)
-  | T_const elt -> T_const (signed_ctype elt)
-  | x -> x
-
-let add_cast ctype cval =
-  let stype = signed_ctype ctype in
-  if ctype <> stype then
-    Printf.sprintf "(%s) %s" (string_of_vectype stype) cval
-  else
-    cval
-
-let cast_for_return to_ty = "(" ^ (string_of_vectype to_ty) ^ ")"
-
-(* Return a tuple of a list of declarations to go at the start of the function,
-   and a list of statements needed to return THING.  *)
-let return arity thing =
-  match arity with
-    Arity0 (ret) | Arity1 (ret, _) | Arity2 (ret, _, _) | Arity3 (ret, _, _, _)
-  | Arity4 (ret, _, _, _, _) ->
-      begin match ret with
-	T_arrayof (num, vec) ->
-          let uname = union_string num vec "__rv" in
-          [uname ^ ";"], ["__rv.__o = " ^ thing ^ ";"; "return __rv.__i;"]
-      | T_void ->
-	  [], [thing ^ ";"]
-      | _ ->
-	  [], ["return " ^ (cast_for_return ret) ^ thing ^ ";"]
-      end
-
-let mask_shape_for_shuffle = function
-    All (num, reg) -> All (num, reg)
-  | Pair_result reg -> All (2, reg)
-  | _ -> failwith "mask_for_shuffle"
-
-let mask_elems shuffle shape elttype part =
-  let elem_size = elt_width elttype in
-  let num_elems =
-    match regmap shape 0 with
-      Dreg -> 64 / elem_size
-    | Qreg -> 128 / elem_size
-    | _ -> failwith "mask_elems" in
-  shuffle elem_size num_elems part
-
-(* Return a tuple of a list of declarations 0and a list of statements needed
-   to implement an intrinsic using __builtin_shuffle.  SHUFFLE is a function
-   which returns a list of elements suitable for using as a mask.  *)
-
-let shuffle_fn shuffle shape arity elttype =
-  let mshape = mask_shape_for_shuffle shape in
-  let masktype = type_for_elt mshape (unsigned_of_elt elttype) 0 in
-  let masktype_str = string_of_vectype masktype in
-  let shuffle_res = type_for_elt mshape elttype 0 in
-  let shuffle_res_str = string_of_vectype shuffle_res in
-  match arity with
-    Arity0 (ret) | Arity1 (ret, _) | Arity2 (ret, _, _) | Arity3 (ret, _, _, _)
-  | Arity4 (ret, _, _, _, _) ->
-      begin match ret with
-        T_arrayof (num, vec) ->
-	  let elems1 = mask_elems shuffle mshape elttype `lo
-	  and elems2 = mask_elems shuffle mshape elttype `hi in
-	  let mask1 = (String.concat ", " (List.map string_of_int elems1))
-	  and mask2 = (String.concat ", " (List.map string_of_int elems2)) in
-	  let shuf1 = Printf.sprintf
-	    "__rv.val[0] = (%s) __builtin_shuffle (__a, __b, (%s) { %s });"
-	    shuffle_res_str masktype_str mask1
-	  and shuf2 = Printf.sprintf
-	    "__rv.val[1] = (%s) __builtin_shuffle (__a, __b, (%s) { %s });"
-	    shuffle_res_str masktype_str mask2 in
-	  [Printf.sprintf "%s __rv;" (string_of_vectype ret);],
-	  [shuf1; shuf2; "return __rv;"]
-      | _ ->
-          let elems = mask_elems shuffle mshape elttype `lo in
-          let mask =  (String.concat ", " (List.map string_of_int elems)) in
-	  let shuf = Printf.sprintf
-	    "return (%s) __builtin_shuffle (__a, (%s) { %s });" shuffle_res_str masktype_str mask in
-	  [""],
-	  [shuf]
-      end
-
-let rec element_type ctype =
-  match ctype with
-    T_arrayof (_, v) -> element_type v
-  | _ -> ctype
-
-let params ps =
-  let pdecls = ref [] in
-  let ptype t p =
-    match t with
-      T_arrayof (num, elts) ->
-        let uname = union_string num elts (p ^ "u") in
-        let decl = Printf.sprintf "%s = { %s };" uname p in
-        pdecls := decl :: !pdecls;
-        p ^ "u.__o"
-    | _ -> add_cast t p in
-  let plist = match ps with
-    Arity0 _ -> []
-  | Arity1 (_, t1) -> [ptype t1 "__a"]
-  | Arity2 (_, t1, t2) -> [ptype t1 "__a"; ptype t2 "__b"]
-  | Arity3 (_, t1, t2, t3) -> [ptype t1 "__a"; ptype t2 "__b"; ptype t3 "__c"]
-  | Arity4 (_, t1, t2, t3, t4) ->
-      [ptype t1 "__a"; ptype t2 "__b"; ptype t3 "__c"; ptype t4 "__d"] in
-  !pdecls, plist
-
-let modify_params features plist =
-  let is_flipped =
-    List.exists (function Flipped _ -> true | _ -> false) features in
-  if is_flipped then
-    match plist with
-      [ a; b ] -> [ b; a ]
-    | _ ->
-      failwith ("Don't know how to flip args " ^ (String.concat ", " plist))
-  else
-    plist
-
-(* !!! Decide whether to add an extra information word based on the shape
-   form.  *)
-let extra_word shape features paramlist bits =
-  let use_word =
-    match shape with
-      All _ | Long | Long_noreg _ | Wide | Wide_noreg _ | Narrow
-    | By_scalar _ | Wide_scalar | Wide_lane | Binary_imm _ | Long_imm
-    | Narrow_imm -> true
-    | _ -> List.mem InfoWord features
-  in
-    if use_word then
-      paramlist @ [string_of_int bits]
-    else
-      paramlist
-
-(* Bit 0 represents signed (1) vs unsigned (0), or float (1) vs poly (0).
-   Bit 1 represents floats & polynomials (1), or ordinary integers (0).
-   Bit 2 represents rounding (1) vs none (0).  *)
-let infoword_value elttype features =
-  let bits01 =
-    match elt_class elttype with
-      Signed | ConvClass (Signed, _) | ConvClass (_, Signed) -> 0b001
-    | Poly -> 0b010
-    | Float -> 0b011
-    | _ -> 0b000
-  and rounding_bit = if List.mem Rounding features then 0b100 else 0b000 in
-  bits01 lor rounding_bit
-
-(* "Cast" type operations will throw an exception in mode_of_elt (actually in
-   elt_width, called from there). Deal with that here, and generate a suffix
-   with multiple modes (<to><from>).  *)
-let rec mode_suffix elttype shape =
-  try
-    let mode = mode_of_elt elttype shape in
-    string_of_mode mode
-  with MixedMode (dst, src) ->
-    let dstmode = mode_of_elt ~argpos:0 dst shape
-    and srcmode = mode_of_elt ~argpos:1 src shape in
-    string_of_mode dstmode ^ string_of_mode srcmode
-
-let get_shuffle features =
-  try
-    match List.find (function Use_shuffle _ -> true | _ -> false) features with
-      Use_shuffle fn -> Some fn
-    | _ -> None
-  with Not_found -> None
-
-let print_feature_test_start features =
-  try
-    match List.find (fun feature ->
-                       match feature with Requires_feature _ -> true
-                                        | Requires_arch _ -> true
-                                        | Requires_FP_bit _ -> true
-                                        | _ -> false)
-                     features with
-      Requires_feature feature ->
-        Format.printf "#ifdef __ARM_FEATURE_%s@\n" feature
-    | Requires_arch arch ->
-        Format.printf "#if __ARM_ARCH >= %d@\n" arch
-    | Requires_FP_bit bit ->
-        Format.printf "#if ((__ARM_FP & 0x%X) != 0)@\n"
-                      (1 lsl bit)
-    | _ -> assert false
-  with Not_found -> assert true
-
-let print_feature_test_end features =
-  let feature =
-    List.exists (function Requires_feature _ -> true
-                          | Requires_arch _ -> true
-                          | Requires_FP_bit _ -> true
-                          |  _ -> false) features in
-  if feature then Format.printf "#endif@\n"
-
-
-let print_variant opcode features shape name (ctype, asmtype, elttype) =
-  let bits = infoword_value elttype features in
-  let modesuf = mode_suffix elttype shape in
-  let pdecls, paramlist = params ctype in
-  let rdecls, stmts =
-    match get_shuffle features with
-      Some shuffle -> shuffle_fn shuffle shape ctype elttype
-    | None ->
-	let paramlist' = modify_params features paramlist in
-	let paramlist'' = extra_word shape features paramlist' bits in
-	let parstr = String.concat ", " paramlist'' in
-	let builtin = Printf.sprintf "__builtin_neon_%s%s (%s)"
-                	(builtin_name features name) modesuf parstr in
-	return ctype builtin in
-  let body = pdecls @ rdecls @ stmts
-  and fnname = (intrinsic_name name) ^ "_" ^ (string_of_elt elttype) in
-  begin
-    print_feature_test_start features;
-    print_function ctype fnname body;
-    print_feature_test_end features;
-  end
-
-(* When this function processes the element types in the ops table, it rewrites
-   them in a list of tuples (a,b,c):
-     a : C type as an "arity", e.g. Arity1 (T_poly8x8, T_poly8x8)
-     b : Asm type : a single, processed element type, e.g. P16. This is the
-         type which should be attached to the asm opcode.
-     c : Variant type : the unprocessed type for this variant (e.g. in add
-         instructions which don't care about the sign, b might be i16 and c
-         might be s16.)
-*)
-
-let print_op (opcode, features, shape, name, munge, types) =
-  let sorted_types = List.sort compare types in
-  let munged_types = List.map
-    (fun elt -> let c, asm = munge shape elt in c, asm, elt) sorted_types in
-  List.iter
-    (fun variant -> print_variant opcode features shape name variant)
-    munged_types
-
-let print_ops ops =
-  List.iter print_op ops
-
-(* Output type definitions. Table entries are:
-     cbase : "C" name for the type.
-     abase : "ARM" base name for the type (i.e. int in int8x8_t).
-     esize : element size.
-     enum : element count.
-     alevel: architecture level at which available.
-*)
-
-type fpulevel = CRYPTO | ALL
-
-let deftypes () =
-  let typeinfo = [
-    (* Doubleword vector types.  *)
-    "__builtin_neon_qi", "int", 8, 8, ALL;
-    "__builtin_neon_hi", "int", 16, 4, ALL;
-    "__builtin_neon_si", "int", 32, 2, ALL;
-    "__builtin_neon_di", "int", 64, 1, ALL;
-    "__builtin_neon_hf", "float", 16, 4, ALL;
-    "__builtin_neon_sf", "float", 32, 2, ALL;
-    "__builtin_neon_poly8", "poly", 8, 8, ALL;
-    "__builtin_neon_poly16", "poly", 16, 4, ALL;
-    "__builtin_neon_poly64", "poly", 64, 1, CRYPTO;
-    "__builtin_neon_uqi", "uint", 8, 8, ALL;
-    "__builtin_neon_uhi", "uint", 16, 4, ALL;
-    "__builtin_neon_usi", "uint", 32, 2, ALL;
-    "__builtin_neon_udi", "uint", 64, 1, ALL;
-
-    (* Quadword vector types.  *)
-    "__builtin_neon_qi", "int", 8, 16, ALL;
-    "__builtin_neon_hi", "int", 16, 8, ALL;
-    "__builtin_neon_si", "int", 32, 4, ALL;
-    "__builtin_neon_di", "int", 64, 2, ALL;
-    "__builtin_neon_sf", "float", 32, 4, ALL;
-    "__builtin_neon_poly8", "poly", 8, 16, ALL;
-    "__builtin_neon_poly16", "poly", 16, 8, ALL;
-    "__builtin_neon_poly64", "poly", 64, 2, CRYPTO;
-    "__builtin_neon_uqi", "uint", 8, 16, ALL;
-    "__builtin_neon_uhi", "uint", 16, 8, ALL;
-    "__builtin_neon_usi", "uint", 32, 4, ALL;
-    "__builtin_neon_udi", "uint", 64, 2, ALL
-  ] in
-  List.iter
-    (fun (cbase, abase, esize, enum, fpulevel) ->
-      let attr =
-        match enum with
-          1 -> ""
-        | _ -> Printf.sprintf "\t__attribute__ ((__vector_size__ (%d)))"
-                              (esize * enum / 8) in
-      if fpulevel == CRYPTO then
-        Format.printf "#ifdef __ARM_FEATURE_CRYPTO\n";
-      Format.printf "typedef %s %s%dx%d_t%s;@\n" cbase abase esize enum attr;
-      if fpulevel == CRYPTO then
-        Format.printf "#endif\n";)
-    typeinfo;
-  Format.print_newline ();
-  (* Extra types not in <stdint.h>.  *)
-  Format.printf "typedef float float32_t;\n";
-  Format.printf "typedef __builtin_neon_poly8 poly8_t;\n";
-  Format.printf "typedef __builtin_neon_poly16 poly16_t;\n";
-  Format.printf "#ifdef __ARM_FEATURE_CRYPTO\n";
-  Format.printf "typedef __builtin_neon_poly64 poly64_t;\n";
-  Format.printf "typedef __builtin_neon_poly128 poly128_t;\n";
-  Format.printf "#endif\n"
-
-(* Output structs containing arrays, for load & store instructions etc.
-   poly128_t is deliberately not included here because it has no array types
-   defined for it.  *)
-
-let arrtypes () =
-  let typeinfo = [
-    "int", 8, ALL;    "int", 16, ALL;
-    "int", 32, ALL;   "int", 64, ALL;
-    "uint", 8, ALL;   "uint", 16, ALL;
-    "uint", 32, ALL;  "uint", 64, ALL;
-    "float", 32, ALL; "poly", 8, ALL;
-    "poly", 16, ALL; "poly", 64, CRYPTO
-  ] in
-  let writestruct elname elsize regsize arrsize fpulevel =
-    let elnum = regsize / elsize in
-    let structname =
-      Printf.sprintf "%s%dx%dx%d_t" elname elsize elnum arrsize in
-    let sfmt = start_function () in
-    Format.printf "%stypedef struct %s"
-      (if fpulevel == CRYPTO then "#ifdef __ARM_FEATURE_CRYPTO\n" else "") structname;
-    open_braceblock sfmt;
-    Format.printf "%s%dx%d_t val[%d];" elname elsize elnum arrsize;
-    close_braceblock sfmt;
-    Format.printf " %s;%s" structname (if fpulevel == CRYPTO then "\n#endif\n" else "");
-    end_function sfmt;
-  in
-    for n = 2 to 4 do
-      List.iter
-        (fun (elname, elsize, alevel) ->
-          writestruct elname elsize 64 n alevel;
-          writestruct elname elsize 128 n alevel)
-        typeinfo
-    done
-
-let print_lines = List.iter (fun s -> Format.printf "%s@\n" s)
-
-(* Do it.  *)
-
-let _ =
-  print_lines [
-"/* ARM NEON intrinsics include file. This file is generated automatically";
-"   using neon-gen.ml.  Please do not edit manually.";
-"";
-"   Copyright (C) 2006-2014 Free Software Foundation, Inc.";
-"   Contributed by CodeSourcery.";
-"";
-"   This file is part of GCC.";
-"";
-"   GCC is free software; you can redistribute it and/or modify it";
-"   under the terms of the GNU General Public License as published";
-"   by the Free Software Foundation; either version 3, or (at your";
-"   option) any later version.";
-"";
-"   GCC is distributed in the hope that it will be useful, but WITHOUT";
-"   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY";
-"   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public";
-"   License for more details.";
-"";
-"   Under Section 7 of GPL version 3, you are granted additional";
-"   permissions described in the GCC Runtime Library Exception, version";
-"   3.1, as published by the Free Software Foundation.";
-"";
-"   You should have received a copy of the GNU General Public License and";
-"   a copy of the GCC Runtime Library Exception along with this program;";
-"   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see";
-"   <http://www.gnu.org/licenses/>.  */";
-"";
-"#ifndef _GCC_ARM_NEON_H";
-"#define _GCC_ARM_NEON_H 1";
-"";
-"#ifndef __ARM_NEON__";
-"#error You must enable NEON instructions (e.g. -mfloat-abi=softfp -mfpu=neon) to use arm_neon.h";
-"#else";
-"";
-"#ifdef __cplusplus";
-"extern \"C\" {";
-"#endif";
-"";
-"#include <stdint.h>";
-""];
-  deftypes ();
-  arrtypes ();
-  Format.print_newline ();
-  print_ops ops;
-  Format.print_newline ();
-  print_ops reinterp;
-  print_ops reinterpq;
-  Format.printf "%s" crypto_intrinsics;
-  print_lines [
-"#ifdef __cplusplus";
-"}";
-"#endif";
-"#endif";
-"#endif"]
diff --git a/gcc/doc/arm-neon-intrinsics.texi b/gcc/doc/arm-neon-intrinsics.texi
index 67f84e0..56987e4 100644
--- a/gcc/doc/arm-neon-intrinsics.texi
+++ b/gcc/doc/arm-neon-intrinsics.texi
@@ -2,8 +2,6 @@
 @c This is part of the GCC manual.
 @c For copying conditions, see the file gcc.texi.
 
-@c This file is generated automatically using gcc/config/arm/neon-docgen.ml
-@c Please do not edit manually.
 @subsubsection Addition
 
 @itemize @bullet
-- 
1.8.3

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Patch ARM 3/3] Neon intrinsics TLC - Remove unneeded ML from backend.
  2014-04-28 10:52 ` [Patch ARM 3/3] Neon intrinsics TLC - Remove unneeded ML " Ramana Radhakrishnan
@ 2014-04-28 11:01   ` Ramana Radhakrishnan
  0 siblings, 0 replies; 9+ messages in thread
From: Ramana Radhakrishnan @ 2014-04-28 11:01 UTC (permalink / raw)
  To: Ramana Radhakrishnan; +Cc: gcc-patches, Christophe Lyon

On 04/28/14 11:48, Ramana Radhakrishnan wrote:
> Patch 3/3 removes the ML to generate Neon intrinsics and the
> documentation and updates the comments in the files to show that these
> are now hand crafted rather than auto-generated. We've had these for
> many years now and I think it's time we got rid of this. Not everyone
> groks ML and it doesn't help that only one or 2 folks can actually do
> this properly everytime. Instead of having these bottlenecks and given
> the fact that the intrinsics are pretty stable now, there's no point in
> retaining the generator interface. I'd rather get rid of them. The only
> bit left is neon-schedgen.ml, neon.ml and neon-testgen.ml. I think we
> can safely remove neon-testgen.ml once Christophe's testsuite is done
> and we'll probably just have to carry neon-schedgen.ml / neon.ml as it
> still generates the neon descriptions for both a8 and a9.

James just pointed out that (my memory was wrong or I must have been 
looking in the wrong directory) he killed neon-schedgen.ml last year. So 
neon.ml can go as well once neon-testgen.ml dies.

regards
Ramana



>
>
> <DATE>  Ramana Radhakrishnan  <ramana.radhakrishnan@arm.com>
>
> 	* config/arm/arm_neon.h: Update comment.
> 	* config/arm/neon-docgen.ml: Delete.
> 	* config/arm/neon-gen.ml: Delete.
> 	* doc/arm-neon-intrinsics.texi: Update comment.
>


-- 
Ramana Radhakrishnan
Principal Engineer
ARM Ltd.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Patch ARM 1/3] Neon intrinsics TLC :  Replace intrinsics with GNU C implementations where possible.
  2014-04-28 10:45 ` [Patch ARM 1/3] Neon intrinsics TLC : Replace intrinsics with GNU C implementations where possible Ramana Radhakrishnan
@ 2014-04-28 11:52   ` Julian Brown
  2014-04-28 13:04     ` Ramana Radhakrishnan
  2014-04-29 13:58   ` Christophe Lyon
  1 sibling, 1 reply; 9+ messages in thread
From: Julian Brown @ 2014-04-28 11:52 UTC (permalink / raw)
  To: Ramana Radhakrishnan; +Cc: gcc-patches, Christophe Lyon

On Mon, 28 Apr 2014 11:44:01 +0100
Ramana Radhakrishnan <ramrad01@arm.com> wrote:

> I've special cased the ffast-math case for the _f32 intrinsics to 
> prevent the auto-vectorizer from coming along and vectorizing addv2sf 
> and addv4sf type operations which we don't want to happen by default.
> Patch 1/3 causes apparent "regressions" in the rather ineffective
> neon intrinsics tests that we currently carry soon hopefully to be
> replaced by Christophe Lyon's rewrite that is being reviewed. On the
> whole I deem this patch stack to be safe to go in if necessary. These
> "regressions" are for -O0 with the vbic and vorn intrinsics which
> don't now get combined and well, so be it.

I think reimplementing these intrinsics in C is a mistake if we ever
hope to make big-endian mode work properly, and "fixing" the generated
header file by bypassing the generator makes it harder to accurately
perform the sweeping changes that will probably be necessary to do that.
Recall e.g. the discussion around:

http://gcc.gnu.org/ml/gcc-patches/2013-03/msg00161.html

Generally (though in this case it's merely an implementation detail)
the NEON intrinsics and GCC's generic vector support cannot be expected
to interwork properly (because of incompatible lane ordering). Of
course we get away with it in little-endian mode though, and I guess
the bridge has already been crossed by earlier patches.

Of course it's possible nobody actually wants to use big-endian NEON,
in which case it's probably time to declared it unsupported?

Julian

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Patch ARM 1/3] Neon intrinsics TLC :  Replace intrinsics with GNU C implementations where possible.
  2014-04-28 11:52   ` Julian Brown
@ 2014-04-28 13:04     ` Ramana Radhakrishnan
  2014-05-07 15:15       ` Richard Earnshaw
  0 siblings, 1 reply; 9+ messages in thread
From: Ramana Radhakrishnan @ 2014-04-28 13:04 UTC (permalink / raw)
  To: Julian Brown; +Cc: gcc-patches, Christophe Lyon

On Mon, Apr 28, 2014 at 12:44 PM, Julian Brown <julian@codesourcery.com> 
wrote:
 > On Mon, 28 Apr 2014 11:44:01 +0100
 > Ramana Radhakrishnan <ramrad01@arm.com> wrote:
 >
 >> I've special cased the ffast-math case for the _f32 intrinsics to
 >> prevent the auto-vectorizer from coming along and vectorizing addv2sf
 >> and addv4sf type operations which we don't want to happen by default.
 >> Patch 1/3 causes apparent "regressions" in the rather ineffective
 >> neon intrinsics tests that we currently carry soon hopefully to be
 >> replaced by Christophe Lyon's rewrite that is being reviewed. On the
 >> whole I deem this patch stack to be safe to go in if necessary. These
 >> "regressions" are for -O0 with the vbic and vorn intrinsics which
 >> don't now get combined and well, so be it.
 >
 > I think reimplementing these intrinsics in C is a mistake if we ever
 > hope to make big-endian mode work properly, and "fixing" the generated
 > header file by bypassing the generator makes it harder to accurately
 > perform the sweeping changes that will probably be necessary to do that.#

 > Recall e.g. the discussion around:

 >
 > http://gcc.gnu.org/ml/gcc-patches/2013-03/msg00161.html

Well, it would help if the generator were written in a better language 
than ML :) . While I don't mind the different language in the backend 
once in a while the problem is that everytime anyone needs to make a 
change to this file, we spend far more time relearning ML than actually 
doing the change :(.

 >
 > Generally (though in this case it's merely an implementation detail)
 > the NEON intrinsics and GCC's generic vector support cannot be expected
 > to interwork properly (because of incompatible lane ordering). Of
 > course we get away with it in little-endian mode though, and I guess
 > the bridge has already been crossed by earlier patches.

Please note that I have been very careful about doing only those 
operations that will not be afflicted by big endian. I am not touching 
any of the lane-wise intrinsics or intrinsics that touch lane numbers. 
It is the intrinsics that have explicit lane numbering that have the 
issue and not the intrinsics I have touched. What's being done here is 
similar to how these particular intrinsics have been dealt with in the 
AArch64 backend and we don't see any issues with these intrinsics in the 
big endian mode and I will not expect these intrinsics to be more broken 
in big-endian than they are currently with this patch or these set of 
patches.

What specifically are you worried about with Patch 1/3 with respect to 
big endian in this case ? I agree that there may be issues with the 
specific "lane" extraction and vector lane numbering extensions that GCC 
has in big-endian mode vs Neon intrinsics but otherwise this change 
should *not* cause any issues in that space.

What specifically are you worried about with this patch other than 
losing the ability to auto-generate these intrinsics - the patch as is 
doesn't do anything but touch all those that operate on the entire 
vector and have no dependence at all on lane numbering ?

regards
Ramana

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Patch ARM 1/3] Neon intrinsics TLC : Replace intrinsics with GNU C implementations where possible.
  2014-04-28 10:45 ` [Patch ARM 1/3] Neon intrinsics TLC : Replace intrinsics with GNU C implementations where possible Ramana Radhakrishnan
  2014-04-28 11:52   ` Julian Brown
@ 2014-04-29 13:58   ` Christophe Lyon
  1 sibling, 0 replies; 9+ messages in thread
From: Christophe Lyon @ 2014-04-29 13:58 UTC (permalink / raw)
  To: Ramana Radhakrishnan; +Cc: gcc-patches

Hi Ramana,

FWIW, I have executed the current set of my tests which cover all you
changes expect vmul, and I have noticed no regression.

Christophe.


2014-04-28 12:44 GMT+02:00 Ramana Radhakrishnan <ramrad01@arm.com>:
> I've special cased the ffast-math case for the _f32 intrinsics to prevent
> the auto-vectorizer from coming along and vectorizing addv2sf and addv4sf
> type operations which we don't want to happen by default.
> Patch 1/3 causes apparent "regressions" in the rather ineffective neon
> intrinsics tests that we currently carry soon hopefully to be replaced by
> Christophe Lyon's rewrite that is being reviewed. On the whole I deem this
> patch stack to be safe to go in if necessary. These "regressions" are for
> -O0 with the vbic and vorn intrinsics which
> don't now get combined and well, so be it.
>
>
> Given we're in stage 1 and that I think we're getting some where
> with clyon's testsuite I feel that is reasonably practical in just
> carrying the noise with these extra failures. Christophe and I will
> testdrive his testsuite work in this space with these patches to see how the
> conversion process works and if there are any issues with these patches.
>
>
> <DATE>  Ramana Radhakrishnan  <ramana.radhakrishnan@arm.com>
>
>         * config/arm/arm_neon.h (vadd_s8): GNU C implementation
>         (vadd_s16): Likewise.
>         (vadd_s32): Likewise.
>         (vadd_f32): Likewise.
>         (vadd_u8): Likewise.
>         (vadd_u16): Likewise.
>         (vadd_u32): Likewise.
>         (vadd_s64): Likewise.
>         (vadd_u64): Likewise.
>         (vaddq_s8): Likewise.
>         (vaddq_s16): Likewise.
>         (vaddq_s32): Likewise.
>         (vaddq_s64): Likewise.
>         (vaddq_f32): Likewise.
>         (vaddq_u8): Likewise.
>         (vaddq_u16): Likewise.
>         (vaddq_u32): Likewise.
>         (vaddq_u64): Likewise.
>         (vmul_s8): Likewise.
>         (vmul_s16): Likewise.
>         (vmul_s32): Likewise.
>         (vmul_f32): Likewise.
>         (vmul_u8): Likewise.
>         (vmul_u16): Likewise.
>         (vmul_u32): Likewise.
>         (vmul_p8): Likewise.
>         (vmulq_s8): Likewise.
>         (vmulq_s16): Likewise.
>         (vmulq_s32): Likewise.
>         (vmulq_f32): Likewise.
>         (vmulq_u8): Likewise.
>         (vmulq_u16): Likewise.
>         (vmulq_u32): Likewise.
>         (vsub_s8): Likewise.
>         (vsub_s16): Likewise.
>         (vsub_s32): Likewise.
>         (vsub_f32): Likewise.
>         (vsub_u8): Likewise.
>         (vsub_u16): Likewise.
>         (vsub_u32): Likewise.
>         (vsub_s64): Likewise.
>         (vsub_u64): Likewise.
>         (vsubq_s8): Likewise.
>         (vsubq_s16): Likewise.
>         (vsubq_s32): Likewise.
>         (vsubq_s64): Likewise.
>         (vsubq_f32): Likewise.
>         (vsubq_u8): Likewise.
>         (vsubq_u16): Likewise.
>         (vsubq_u32): Likewise.
>         (vsubq_u64): Likewise.
>         (vand_s8): Likewise.
>         (vand_s16): Likewise.
>         (vand_s32): Likewise.
>         (vand_u8): Likewise.
>         (vand_u16): Likewise.
>         (vand_u32): Likewise.
>         (vand_s64): Likewise.
>         (vand_u64): Likewise.
>         (vandq_s8): Likewise.
>         (vandq_s16): Likewise.
>         (vandq_s32): Likewise.
>         (vandq_s64): Likewise.
>         (vandq_u8): Likewise.
>         (vandq_u16): Likewise.
>         (vandq_u32): Likewise.
>         (vandq_u64): Likewise.
>         (vorr_s8): Likewise.
>         (vorr_s16): Likewise.
>         (vorr_s32): Likewise.
>         (vorr_u8): Likewise.
>         (vorr_u16): Likewise.
>         (vorr_u32): Likewise.
>         (vorr_s64): Likewise.
>         (vorr_u64): Likewise.
>         (vorrq_s8): Likewise.
>         (vorrq_s16): Likewise.
>         (vorrq_s32): Likewise.
>         (vorrq_s64): Likewise.
>         (vorrq_u8): Likewise.
>         (vorrq_u16): Likewise.
>         (vorrq_u32): Likewise.
>         (vorrq_u64): Likewise.
>         (veor_s8): Likewise.
>         (veor_s16): Likewise.
>         (veor_s32): Likewise.
>         (veor_u8): Likewise.
>         (veor_u16): Likewise.
>         (veor_u32): Likewise.
>         (veor_s64): Likewise.
>         (veor_u64): Likewise.
>         (veorq_s8): Likewise.
>         (veorq_s16): Likewise.
>         (veorq_s32): Likewise.
>         (veorq_s64): Likewise.
>         (veorq_u8): Likewise.
>         (veorq_u16): Likewise.
>         (veorq_u32): Likewise.
>         (veorq_u64): Likewise.
>         (vbic_s8): Likewise.
>         (vbic_s16): Likewise.
>         (vbic_s32): Likewise.
>         (vbic_u8): Likewise.
>         (vbic_u16): Likewise.
>         (vbic_u32): Likewise.
>         (vbic_s64): Likewise.
>         (vbic_u64): Likewise.
>         (vbicq_s8): Likewise.
>         (vbicq_s16): Likewise.
>         (vbicq_s32): Likewise.
>         (vbicq_s64): Likewise.
>         (vbicq_u8): Likewise.
>         (vbicq_u16): Likewise.
>         (vbicq_u32): Likewise.
>         (vbicq_u64): Likewise.
>         (vorn_s8): Likewise.
>         (vorn_s16): Likewise.
>         (vorn_s32): Likewise.
>         (vorn_u8): Likewise.
>         (vorn_u16): Likewise.
>         (vorn_u32): Likewise.
>         (vorn_s64): Likewise.
>         (vorn_u64): Likewise.
>         (vornq_s8): Likewise.
>         (vornq_s16): Likewise.
>         (vornq_s32): Likewise.
>         (vornq_s64): Likewise.
>         (vornq_u8): Likewise.
>         (vornq_u16): Likewise.
>         (vornq_u32): Likewise.
>         (vornq_u64): Likewise.
>
>
>
> --
> Ramana Radhakrishnan
> Principal Engineer
> ARM Ltd.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Patch ARM 1/3] Neon intrinsics TLC :  Replace intrinsics with GNU C implementations where possible.
  2014-04-28 13:04     ` Ramana Radhakrishnan
@ 2014-05-07 15:15       ` Richard Earnshaw
  0 siblings, 0 replies; 9+ messages in thread
From: Richard Earnshaw @ 2014-05-07 15:15 UTC (permalink / raw)
  To: Ramana Radhakrishnan; +Cc: Julian Brown, gcc-patches, Christophe Lyon

On 28/04/14 14:01, Ramana Radhakrishnan wrote:
> 
> On Mon, Apr 28, 2014 at 12:44 PM, Julian Brown <julian@codesourcery.com> 
> wrote:
>  > On Mon, 28 Apr 2014 11:44:01 +0100
>  > Ramana Radhakrishnan <ramrad01@arm.com> wrote:
>  >
>  >> I've special cased the ffast-math case for the _f32 intrinsics to
>  >> prevent the auto-vectorizer from coming along and vectorizing addv2sf
>  >> and addv4sf type operations which we don't want to happen by default.
>  >> Patch 1/3 causes apparent "regressions" in the rather ineffective
>  >> neon intrinsics tests that we currently carry soon hopefully to be
>  >> replaced by Christophe Lyon's rewrite that is being reviewed. On the
>  >> whole I deem this patch stack to be safe to go in if necessary. These
>  >> "regressions" are for -O0 with the vbic and vorn intrinsics which
>  >> don't now get combined and well, so be it.
>  >
>  > I think reimplementing these intrinsics in C is a mistake if we ever
>  > hope to make big-endian mode work properly, and "fixing" the generated
>  > header file by bypassing the generator makes it harder to accurately
>  > perform the sweeping changes that will probably be necessary to do that.#
> 
> 
>  > Recall e.g. the discussion around:
> 
>  >
>  > http://gcc.gnu.org/ml/gcc-patches/2013-03/msg00161.html
> 
> Well, it would help if the generator were written in a better language 
> than ML :) . While I don't mind the different language in the backend 
> once in a while the problem is that everytime anyone needs to make a 
> change to this file, we spend far more time relearning ML than actually 
> doing the change :(.
> 

I agree: it's time the ML files went.  They're an impediment to
maintenance these days.

When the ML description was added it did three things: generated
arm_neon.h, generated the testsuite and generated a pipeline description
for Cortex-A8.  As we've progressed the second and third of these have
gone away (or at least, are about to in the case of the testsuite),
leaving only the arm_neon.h generation.  I don't see any real merit in
having that file generated from the ML file; we might as well just
maintain the existing code directly and that brings about the chance to
have more people actively work on fixing issues there without having to
learn ML first.

R.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-05-07 15:15 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-28 10:44 [Patch ARM 0/3] Neon intrinsics TLC - Replace intrinsics with GNU C implementations where possible and remove dead code Ramana Radhakrishnan
2014-04-28 10:45 ` [Patch ARM 1/3] Neon intrinsics TLC : Replace intrinsics with GNU C implementations where possible Ramana Radhakrishnan
2014-04-28 11:52   ` Julian Brown
2014-04-28 13:04     ` Ramana Radhakrishnan
2014-05-07 15:15       ` Richard Earnshaw
2014-04-29 13:58   ` Christophe Lyon
2014-04-28 10:47 ` [Patch ARM 2/3] Remove dead code from backend Ramana Radhakrishnan
2014-04-28 10:52 ` [Patch ARM 3/3] Neon intrinsics TLC - Remove unneeded ML " Ramana Radhakrishnan
2014-04-28 11:01   ` Ramana Radhakrishnan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).