[PATCH 8/9][GCC][Arm] Add autovectorization support for complex multiplication and addition

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH 8/9][GCC][Arm] Add autovectorization support for complex multiplication and addition
@ 2018-11-11 10:28 Tamar Christina
  2018-11-14 12:27 ` Richard Biener
  0 siblings, 1 reply; 5+ messages in thread
From: Tamar Christina @ 2018-11-11 10:28 UTC (permalink / raw)
  To: gcc-patches
  Cc: nd, Ramana Radhakrishnan, Richard Earnshaw, nickc, Kyrylo Tkachov

[-- Attachment #1: Type: text/plain, Size: 4600 bytes --]

Hi All,

This patch adds the expander support for supporting autovectorization of complex number operations
such as Complex addition with a rotation along the Argand plane.  This also adds support for complex
FMA.

The instructions are described in the ArmARM [1] and are available from Armv8.3-a onwards.

Concretely, this generates

f90:
	add	ip, r1, #15
	add	r3, r0, #15
	sub	r3, r3, r2
	sub	ip, ip, r2
	cmp	ip, #30
	cmphi	r3, #30
	add	r3, r0, #1600
	bls	.L5
.L3:
	vld1.32	{q8}, [r0]!
	vld1.32	{q9}, [r1]!
	vcadd.f32	q8, q8, q9, #90
	vst1.32	{q8}, [r2]!
	cmp	r0, r3
	bne	.L3
	bx	lr
.L5:
	vld1.32	{d16}, [r0]!
	vld1.32	{d17}, [r1]!
	vcadd.f32	d16, d16, d17, #90
	vst1.32	{d16}, [r2]!
	cmp	r0, r3
	bne	.L5
	bx	lr



now instead of

f90:
	add	ip, r1, #31
	add	r3, r0, #31
	sub	r3, r3, r2
	sub	ip, ip, r2
	cmp	ip, #62
	cmphi	r3, #62
	add	r3, r0, #1600
	bls	.L2
.L3:
	vld2.32	{d20-d23}, [r0]!
	vld2.32	{d24-d27}, [r1]!
	cmp	r0, r3
	vsub.f32	q8, q10, q13
	vadd.f32	q9, q12, q11
	vst2.32	{d16-d19}, [r2]!
	bne	.L3
	bx	lr
.L2:
	vldr	d19, .L10
.L5:
	vld1.32	{d16}, [r1]!
	vld1.32	{d18}, [r0]!
	vrev64.32	d16, d16
	cmp	r0, r3
	vsub.f32	d17, d18, d16
	vadd.f32	d16, d16, d18
	vswp	d16, d17
	vtbl.8	d16, {d16, d17}, d19
	vst1.32	{d16}, [r2]!
	bne	.L5
	bx	lr
.L11:
	.align	3
.L10:
	.byte	0
	.byte	1
	.byte	2
	.byte	3
	.byte	12
	.byte	13
	.byte	14
	.byte	15


For complex additions with a 90* rotation along the Argand plane.

[1] https://developer.arm.com/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile

Bootstrap and Regtest on aarch64-none-linux-gnu, arm-none-gnueabihf and x86_64-pc-linux-gnu
are still on going but previous patch showed no regressions.

The instructions have also been tested on aarch64-none-elf and arm-none-eabi on a Armv8.3-a model
and -march=Armv8.3-a+fp16 and all tests pass.

Ok for trunk?

Thanks,
Tamar

gcc/ChangeLog:

2018-11-11  Tamar Christina  <tamar.christina@arm.com>

	* config/arm/arm.c (arm_arch8_3, arm_arch8_4): New.
	* config/arm/arm.h (TARGET_COMPLEX, arm_arch8_3, arm_arch8_4): New.
	(arm_option_reconfigure_globals): Use them.
	* config/arm/iterators.md (VDF, VQ_HSF): New.
	(VCADD, VCMLA): New.
	(VF_constraint, rot, rotsplit1, rotsplit2): Add V4HF and V8HF.
	* config/arm/neon.md (neon_vcadd<rot><mode>, fcadd<rot><mode>3,
	neon_vcmla<rot><mode>, fcmla<rot><mode>4): New.
	* config/arm/unspecs.md (UNSPEC_VCADD90, UNSPEC_VCADD270,
	UNSPEC_VCMLA, UNSPEC_VCMLA90, UNSPEC_VCMLA180, UNSPEC_VCMLA270): New.

gcc/testsuite/ChangeLog:

2018-11-11  Tamar Christina  <tamar.christina@arm.com>

        * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_1.c: Add Arm support.
        * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_2.c: Likewise.
        * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_3.c: Likewise.
        * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_4.c: Likewise.
        * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_5.c: Likewise.
        * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_6.c: Likewise.
        * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_1.c: Likewise.
        * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_2.c: Likewise.
        * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_3.c: Likewise.
        * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_4.c: Likewise.
        * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_5.c: Likewise.
        * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_6.c: Likewise.
        * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_1.c: Likewise.
        * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_1.c: Likewise.
        * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_2.c: Likewise.
        * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_3.c: Likewise.
        * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_2.c: Likewise.
        * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_1.c: Likewise.
        * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_2.c: Likewise.
        * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_3.c: Likewise.
        * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_3.c: Likewise.
        * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_1.c: Likewise.
        * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_2.c: Likewise.
        * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_3.c: Likewise.

-- 

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: rb10283.patch --]
[-- Type: text/x-diff; name="rb10283.patch", Size: 24692 bytes --]

diff --git a/gcc/config/arm/arm.h b/gcc/config/arm/arm.h
index 8472312487b50d93edf47cd31660efd6bc043387..f9209445ee3b644f37dbf672b0b16d049222929d 100644
--- a/gcc/config/arm/arm.h
+++ b/gcc/config/arm/arm.h
@@ -219,6 +219,9 @@ extern tree arm_fp16_type_node;
 					isa_bit_dotprod)		\
 			&& arm_arch8_2)
 
+/* Supports the Armv8.3-a Complex number AdvSIMD extensions.  */
+#define TARGET_COMPLEX (TARGET_NEON && arm_arch8_3)
+
 /* FPU supports the floating point FP16 instructions for ARMv8.2-A
    and later.  */
 #define TARGET_VFP_FP16INST \
@@ -441,6 +444,12 @@ extern int arm_arch8_1;
 /* Nonzero if this chip supports the ARM Architecture 8.2 extensions.  */
 extern int arm_arch8_2;
 
+/* Nonzero if this chip supports the ARM Architecture 8.3 extensions.  */
+extern int arm_arch8_3;
+
+/* Nonzero if this chip supports the ARM Architecture 8.4 extensions.  */
+extern int arm_arch8_4;
+
 /* Nonzero if this chip supports the FP16 instructions extension of ARM
    Architecture 8.2.  */
 extern int arm_fp16_inst;
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 79502606b632e6a187732c8b3be118df8bde149a..cbcbeeb6e076bb8f632e5c31dd751937af4514f5 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -895,6 +895,12 @@ int arm_arch8_1 = 0;
 /* Nonzero if this chip supports the ARM Architecture 8.2 extensions.  */
 int arm_arch8_2 = 0;
 
+/* Nonzero if this chip supports the ARM Architecture 8.3 extensions.  */
+int arm_arch8_3 = 0;
+
+/* Nonzero if this chip supports the ARM Architecture 8.4 extensions.  */
+int arm_arch8_4 = 0;
+
 /* Nonzero if this chip supports the FP16 instructions extension of ARM
    Architecture 8.2.  */
 int arm_fp16_inst = 0;
@@ -3649,6 +3655,8 @@ arm_option_reconfigure_globals (void)
   arm_arch8 = bitmap_bit_p (arm_active_target.isa, isa_bit_armv8);
   arm_arch8_1 = bitmap_bit_p (arm_active_target.isa, isa_bit_armv8_1);
   arm_arch8_2 = bitmap_bit_p (arm_active_target.isa, isa_bit_armv8_2);
+  arm_arch8_3 = bitmap_bit_p (arm_active_target.isa, isa_bit_armv8_3);
+  arm_arch8_4 = bitmap_bit_p (arm_active_target.isa, isa_bit_armv8_4);
   arm_arch_thumb1 = bitmap_bit_p (arm_active_target.isa, isa_bit_thumb);
   arm_arch_thumb2 = bitmap_bit_p (arm_active_target.isa, isa_bit_thumb2);
   arm_arch_xscale = bitmap_bit_p (arm_active_target.isa, isa_bit_xscale);
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 0b2d42399d22ba89a976e39bef6182d31173c1ef..4c3a0169d7b0f9613a5f36ed5dac0dfe31f54442 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -123,6 +123,12 @@
 (define_mode_iterator VF [(V4HF "TARGET_NEON_FP16INST")
 			   (V8HF "TARGET_NEON_FP16INST") V2SF V4SF])
 
+;; Double vector modes.
+(define_mode_iterator VDF [V2SF V4HF])
+
+;; Quad vector Float modes with half/single elements.
+(define_mode_iterator VQ_HSF [V8HF V4SF])
+
 ;; All supported vector modes (except those with 64-bit integer elements).
 (define_mode_iterator VDQW [V8QI V16QI V4HI V8HI V2SI V4SI V2SF V4SF])
 
@@ -423,6 +429,9 @@
 
 (define_int_iterator VFMLHALVES [UNSPEC_VFML_LO UNSPEC_VFML_HI])
 
+(define_int_iterator VCADD [UNSPEC_VCADD90 UNSPEC_VCADD270])
+(define_int_iterator VCMLA [UNSPEC_VCMLA UNSPEC_VCMLA90 UNSPEC_VCMLA180 UNSPEC_VCMLA270])
+
 ;;----------------------------------------------------------------------------
 ;; Mode attributes
 ;;----------------------------------------------------------------------------
@@ -741,7 +750,7 @@
 (define_mode_attr F_constraint [(SF "t") (DF "w")])
 (define_mode_attr vfp_type [(SF "s") (DF "d")])
 (define_mode_attr vfp_double_cond [(SF "") (DF "&& TARGET_VFP_DOUBLE")])
-(define_mode_attr VF_constraint [(V2SF "t") (V4SF "w")])
+(define_mode_attr VF_constraint [(V4HF "t") (V8HF "t") (V2SF "t") (V4SF "w")])
 
 ;; Mode attribute used to build the "type" attribute.
 (define_mode_attr q [(V8QI "") (V16QI "_q")
@@ -989,6 +998,23 @@
                           (UNSPEC_SHA1SU0 "V4SI") (UNSPEC_SHA256H "V4SI")
                           (UNSPEC_SHA256H2 "V4SI") (UNSPEC_SHA256SU1 "V4SI")])
 
+(define_int_attr rot [(UNSPEC_VCADD90 "90")
+		      (UNSPEC_VCADD270 "270")
+		      (UNSPEC_VCMLA "0")
+		      (UNSPEC_VCMLA90 "90")
+		      (UNSPEC_VCMLA180 "180")
+		      (UNSPEC_VCMLA270 "270")])
+
+(define_int_attr rotsplit1 [(UNSPEC_VCMLA "0")
+			    (UNSPEC_VCMLA90 "0")
+			    (UNSPEC_VCMLA180 "180")
+			    (UNSPEC_VCMLA270 "180")])
+
+(define_int_attr rotsplit2 [(UNSPEC_VCMLA "90")
+			    (UNSPEC_VCMLA90 "270")
+			    (UNSPEC_VCMLA180 "270")
+			    (UNSPEC_VCMLA270 "90")])
+
 ;; Both kinds of return insn.
 (define_code_iterator RETURNS [return simple_return])
 (define_code_attr return_str [(return "") (simple_return "simple_")])
@@ -1051,4 +1077,4 @@
 (define_int_attr MRRC [(VUNSPEC_MRRC "MRRC") (VUNSPEC_MRRC2 "MRRC2")])
 
 (define_int_attr opsuffix [(UNSPEC_DOT_S "s8")
-			   (UNSPEC_DOT_U "u8")])
+			   (UNSPEC_DOT_U "u8")])
\ No newline at end of file
diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index 5aeee4b08c168c5060d2156edfcba40cb25b5f6f..f50075bf5ffb6be6db1975087da0b468ab05a8a2 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -3425,6 +3425,62 @@
   DONE;
 })
 
+;; The vcadd and vcmla patterns are made UNSPEC for the explicitly due to the
+;; fact that their usage need to guarantee that the source vectors are
+;; contiguous.  It would be wrong to describe the operation without being able
+;; to describe the permute that is also required, but even if that is done
+;; the permute would have been created as a LOAD_LANES which means the values
+;; in the registers are in the wrong order.
+(define_insn "neon_vcadd<rot><mode>"
+  [(set (match_operand:VF 0 "register_operand" "=w")
+	(unspec:VF [(match_operand:VF 1 "register_operand" "w")
+		    (match_operand:VF 2 "register_operand" "w")]
+		    VCADD))]
+  "TARGET_COMPLEX"
+  "vcadd.<V_s_elem>\t%<V_reg>0, %<V_reg>1, %<V_reg>2, #<rot>"
+  [(set_attr "type" "neon_fcadd")]
+)
+
+(define_expand "fcadd<rot><mode>3"
+  [(set (match_operand:VF 0 "register_operand")
+	(unspec:VF [(match_operand:VF 1 "register_operand")
+		    (match_operand:VF 2 "register_operand")]
+		    VCADD))]
+  "TARGET_COMPLEX"
+{
+  emit_insn (gen_neon_vcadd<rot><mode> (operands[0], operands[1], operands[2]));
+  DONE;
+})
+
+(define_insn "neon_vcmla<rot><mode>"
+  [(set (match_operand:VF 0 "register_operand" "=w")
+	(plus:VF (match_operand:VF 1 "register_operand" "0")
+		 (unspec:VF [(match_operand:VF 2 "register_operand" "w")
+			     (match_operand:VF 3 "register_operand" "w")]
+			     VCMLA)))]
+  "TARGET_COMPLEX"
+  "vcmla.<V_s_elem>\t%<V_reg>0, %<V_reg>2, %<V_reg>3, #<rot>"
+  [(set_attr "type" "neon_fcmla")]
+)
+
+;; The complex mla operations always need to expand to two instructions.
+;; The first operation does half the computation and the second does the
+;; remainder.  Because of this, expand early.
+(define_expand "fcmla<rot><mode>4"
+  [(set (match_operand:VF 0 "register_operand")
+	(plus:VF (match_operand:VF 1 "register_operand")
+		 (unspec:VF [(match_operand:VF 2 "register_operand")
+			     (match_operand:VF 3 "register_operand")]
+			     VCMLA)))]
+  "TARGET_COMPLEX"
+{
+  emit_insn (gen_neon_vcmla<rotsplit1><mode> (operands[0], operands[1],
+					      operands[2], operands[3]));
+  emit_insn (gen_neon_vcmla<rotsplit2><mode> (operands[0], operands[0],
+					      operands[2], operands[3]));
+  DONE;
+})
+
 ;; These instructions map to the __builtins for the Dot Product operations.
 (define_insn "neon_<sup>dot<vsi2qi>"
   [(set (match_operand:VCVTI 0 "register_operand" "=w")
diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
index 19416736ef91fc2ac24352c30b6202c87510a3d2..9140b8cdfcf7263fc57908c7663fc8f0399b1b81 100644
--- a/gcc/config/arm/unspecs.md
+++ b/gcc/config/arm/unspecs.md
@@ -415,4 +415,10 @@
   UNSPEC_DOT_U
   UNSPEC_VFML_LO
   UNSPEC_VFML_HI
+  UNSPEC_VCADD90
+  UNSPEC_VCADD270
+  UNSPEC_VCMLA
+  UNSPEC_VCMLA90
+  UNSPEC_VCMLA180
+  UNSPEC_VCMLA270
 ])
diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_1.c b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_1.c
index 627f2e78daee9c4a4f86c2071080b4114820c209..997d9065504a9a16d3ea1316f7ea4208b3516c55 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_1.c
@@ -26,3 +26,5 @@ int main()
 }
 
 /* { dg-final { scan-assembler-times {fcadd\tv[0-9]+\.2d, v[0-9]+\.2d, v[0-9]+\.2d, #90} 1 { target { aarch64*-*-* } } } } */
+/* { dg-final { scan-assembler-not {vcadd\.} { target { arm*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_2.c b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_2.c
index d94becdb69386c08c07f8b763aea3fa050e6644c..8ab2aa75e261e0d885fb8042c09b6e42284dea85 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_2.c
@@ -26,4 +26,5 @@ int main()
 }
 
 /* { dg-final { scan-assembler-times {fcadd\tv[0-9]+\.4s, v[0-9]+\.4s, v[0-9]+\.4s, #90} 1 { target { aarch64*-*-* } } } } */
+/* { dg-final { scan-assembler-times {vcadd\.f32\tq[0-9]+, q[0-9]+, q[0-9]+, #(?:0|90)} 2 { target { arm*-*-* } } } } */
 
diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_3.c b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_3.c
index df33c313847ac2f519deb8346c3167b015913299..8002d4efa003bb8af6a6592334e7749da336875e 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_3.c
@@ -27,4 +27,5 @@ int main()
 }
 
 /* { dg-final { scan-assembler-times {fcadd\tv[0-9]+\.8h, v[0-9]+\.8h, v[0-9]+\.8h, #90} 1 { target { aarch64*-*-* } } } } */
+/* { dg-final { scan-assembler-times {vcadd\.f16\tq[0-9]+, q[0-9]+, q[0-9]+, #90} 1 { target { arm*-*-* } } } } */
 
diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_4.c b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_4.c
index bbd4d004177328023eb6270c7b38cd9ae0357c60..601d6886a4c999d010ca2e8a5babad066d5fa0a5 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_4.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_4.c
@@ -26,4 +26,5 @@ int main()
 }
 
 /* { dg-final { scan-assembler-times {fcadd\tv[0-9]+\.2d, v[0-9]+\.2d, v[0-9]+\.2d, #270} 1 { target { aarch64*-*-* } } } } */
+/* { dg-final { scan-assembler-not {vcadd\.} { target { arm*-*-* } } } } */
 
diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_5.c b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_5.c
index 8f719d072af591f624f89ef458c0f1552cb38a15..f7851bc7304bf671f3e14bb08e7dc434e867a29c 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_5.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_5.c
@@ -26,4 +26,5 @@ int main()
 }
 
 /* { dg-final { scan-assembler-times {fcadd\tv[0-9]+\.4s, v[0-9]+\.4s, v[0-9]+\.4s, #270} 1 { target { aarch64*-*-* } } } } */
+/* { dg-final { scan-assembler-times {vcadd\.f32\tq[0-9]+, q[0-9]+, q[0-9]+, #270} 1 { target { arm*-*-* } } } } */
 
diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_6.c b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_6.c
index 3f665ba31e2e2debaa7b78f9dce907a8f94951e7..02172be3647852cd3a959a6b1aef82e3a4c5f28d 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_6.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_6.c
@@ -27,4 +27,5 @@ int main()
 }
 
 /* { dg-final { scan-assembler-times {fcadd\tv[0-9]+\.8h, v[0-9]+\.8h, v[0-9]+\.8h, #270} 1 { target { aarch64*-*-* } } } } */
+/* { dg-final { scan-assembler-times {vcadd\.f16\tq[0-9]+, q[0-9]+, q[0-9]+, #270} 1 { target { arm*-*-* } } } } */
 
diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_1.c b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_1.c
index a8c596645ccfb60c89942312620c31636c3f3f40..aebe0b8bdeee25d7ae6e387b006de9413ecbc13e 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_1.c
@@ -29,4 +29,5 @@ int main()
 }
 
 /* { dg-final { scan-assembler-times {fcadd\tv[0-9]+\.2d, v[0-9]+\.2d, v[0-9]+\.2d, #90} 1 { target { aarch64*-*-* } } } } */
+/* { dg-final { scan-assembler-not {vcadd\.} { target { arm*-*-* } } } } */
 
diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_2.c b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_2.c
index abed2b1e824f010feb959d6491df1b7fcb943d23..891e9874d2d66b9849809c3f7ca3c31044256f99 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_2.c
@@ -27,4 +27,5 @@ int main()
 }
 
 /* { dg-final { scan-assembler-times {fcadd\tv[0-9]+\.4s, v[0-9]+\.4s, v[0-9]+\.4s, #90} 1 { target { aarch64*-*-* } } } } */
+/* { dg-final { scan-assembler-times {vcadd\.f32\tq[0-9]+, q[0-9]+, q[0-9]+, #90} 1 { target { arm*-*-* } } } } */
 
diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_3.c b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_3.c
index e097d79f9867be30b59118578b00540b60fe2b2c..871d64a9bab0b08433f55eedd890146058526a1c 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_3.c
@@ -28,4 +28,5 @@ int main()
 }
 
 /* { dg-final { scan-assembler-times {fcadd\tv[0-9]+\.8h, v[0-9]+\.8h, v[0-9]+\.8h, #90} 1 { target { aarch64*-*-* } } } } */
+/* { dg-final { scan-assembler-times {vcadd\.f32\tq[0-9]+, q[0-9]+, q[0-9]+, #90} 1 { target { arm*-*-* } } } } */
 
diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_4.c b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_4.c
index fb4b08bed42452e0a5e9c6315a6317b60c0b5d5d..7c9278945fc28e1350ef8ac9a4ddfdac56da14c9 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_4.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_4.c
@@ -27,4 +27,5 @@ int main()
 }
 
 /* { dg-final { scan-assembler-times {fcadd\tv[0-9]+\.2d, v[0-9]+\.2d, v[0-9]+\.2d, #270} 1 { target { aarch64*-*-* } } } } */
+/* { dg-final { scan-assembler-not {fcadd\.} { target { arm*-*-* } } } } */
 
diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_5.c b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_5.c
index 71cc559d1d8c1c9ad0a7a8274024f1954cca04af..a431fc82155c5eccf02cf4b66313caf989777084 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_5.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_5.c
@@ -27,4 +27,5 @@ int main()
 }
 
 /* { dg-final { scan-assembler-times {fcadd\tv[0-9]+\.4s, v[0-9]+\.4s, v[0-9]+\.4s, #270} 1 { target { aarch64*-*-* } } } } */
+/* { dg-final { scan-assembler-times {vcadd\.f32\tq[0-9]+, q[0-9]+, q[0-9]+, #270} 1 { target { arm*-*-* } } } } */
 
diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_6.c b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_6.c
index a964b29622b0e2715268c7a03a894ac48e82e98b..6e1b04d4088b9dd503f33aa6e85196a61db0ee5c 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_6.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_6.c
@@ -28,4 +28,5 @@ int main()
 }
 
 /* { dg-final { scan-assembler-times {fcadd\tv[0-9]+\.8h, v[0-9]+\.8h, v[0-9]+\.8h, #270} 1 { target { aarch64*-*-* } } } } */
+/* { dg-final { scan-assembler-times {vcadd\.f16\tq[0-9]+, q[0-9]+, q[0-9]+, #270} 1 { target { arm*-*-* } } } } */
 
diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_1.c b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_1.c
index 87a2d25e06fa5ab0b59bc8c9ccbf629f533f0e10..6b5baf013ce285cfa0a28cb9128d839d6ad3d4eb 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_1.c
@@ -29,3 +29,5 @@ int main()
 }
 
 /* { dg-final { scan-assembler-times {fcmla\tv[0-9]+\.2d, v[0-9]+\.2d, v[0-9]+\.2d, #(?:0|90)} 2 { target { aarch64*-*-* } } } } */
+/* { dg-final { scan-assembler-not {vcmla\.} { target { arm*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_1.c b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_1.c
index 8703902944b3e65e2f3dba659fc9425bea4cdd27..2d6fc3354ad5b32c4d636efbeeefdc756d7d2b7a 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_1.c
@@ -27,4 +27,5 @@ int main()
 }
 
 /* { dg-final { scan-assembler-times {fcmla\tv[0-9]+\.2d, v[0-9]+\.2d, v[0-9]+\.2d, #(?:180|270)} 2 { target { aarch64*-*-* } } } } */
+/* { dg-final { scan-assembler-not {vcmla\.} { target { arm*-*-* } } } } */
 
diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_2.c b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_2.c
index 10833bde24e24d944f2de068f492ac4505b4c012..f4ce831705b09288ef2ca52c26a26fef1d8cca20 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_2.c
@@ -27,4 +27,5 @@ int main()
 }
 
 /* { dg-final { scan-assembler-times {fcmla\tv[0-9]+\.4s, v[0-9]+\.4s, v[0-9]+\.4s, #(?:180|270)} 2 { target { aarch64*-*-* } } } } */
+/* { dg-final { scan-assembler-times {vcmla\.f32\tq[0-9]+, q[0-9]+, q[0-9]+, #(?:180|270)} 2 { target { arm*-*-* } } } } */
 
diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_3.c b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_3.c
index 2699577fd7b5a4e565dea702d96e91e0beded0f2..7a6aed992322753dc928f1db9689f58f02702745 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_3.c
@@ -28,4 +28,5 @@ int main()
 }
 
 /* { dg-final { scan-assembler-times {fcmla\tv[0-9]+\.8h, v[0-9]+\.8h, v[0-9]+\.8h, #(?:180|270)} 2 { target { aarch64*-*-* } } } } */
+/* { dg-final { scan-assembler-times {vcmla\.f16\tq[0-9]+, q[0-9]+, q[0-9]+, #(?:180|270)} 2 { target { arm*-*-* } } } } */
 
diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_2.c b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_2.c
index 7bdca9ad5033c99ea7035a7e1e0d411b64db1b73..70198d0eb52cf1be2c3df4c99ae5868d7abafd38 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_2.c
@@ -27,4 +27,5 @@ int main()
 }
 
 /* { dg-final { scan-assembler-times {fcmla\tv[0-9]+\.4s, v[0-9]+\.4s, v[0-9]+\.4s, #(?:0|90)} 2 { target { aarch64*-*-* } } } } */
+/* { dg-final { scan-assembler-times {vcmla\.f32\tq[0-9]+, q[0-9]+, q[0-9]+, #(?:0|90)} 2 { target { arm*-*-* } } } } */
 
diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_1.c b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_1.c
index 1c7584d4f949cf448cb1da4219f5dd9db14206cb..ccc4a8723b28f81de0ee93abeff6d8a09e841260 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_1.c
@@ -27,4 +27,4 @@ int main()
 }
 
 /* { dg-final { scan-assembler-not {fcmla} { target { aarch64*-*-* } } } } */
-/* { dg-final { scan-assembler-not {vcmla} { target { arm*-*-* } } } } */
+/* { dg-final { scan-assembler-not {vcmla\.} { target { arm*-*-* } } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_2.c b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_2.c
index eb5cd4ec760031d0ab81d8dcce224e94c234c764..b9748e3674f3594369d81f8587f5d5a424c13562 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_2.c
@@ -27,4 +27,4 @@ int main()
 }
 
 /* { dg-final { scan-assembler-not {fcmla} { target { aarch64*-*-* } } } } */
-/* { dg-final { scan-assembler-not {vcmla} { target { arm*-*-* } } } } */
+/* { dg-final { scan-assembler-not {vcmla\.} { target { arm*-*-* } } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_3.c b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_3.c
index 8f1585c14cc0637a17d335b93e9198b1549f753d..09e489ffcd302b4bdba2148c3a11529344df2a11 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_3.c
@@ -28,4 +28,4 @@ int main()
 }
 
 /* { dg-final { scan-assembler-not {fcmla} { target { aarch64*-*-* } } } } */
-/* { dg-final { scan-assembler-not {vcmla} { target { arm*-*-* } } } } */
+/* { dg-final { scan-assembler-not {vcmla\.} { target { arm*-*-* } } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_3.c b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_3.c
index bd8ead4d58884836bbc13005c59804a26d24617e..2259587237b510149a8369761f6b3b92d1d79cb2 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_3.c
@@ -28,4 +28,5 @@ int main()
 }
 
 /* { dg-final { scan-assembler-times {fcmla\tv[0-9]+\.8h, v[0-9]+\.8h, v[0-9]+\.8h, #(?:0|90)} 2 { target { aarch64*-*-* } } } } */
+/* { dg-final { scan-assembler-times {vcmla\.f16\tq[0-9]+, q[0-9]+, q[0-9]+, #(?:0|90)} 2 { target { arm*-*-* } } } } */
 
diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_1.c b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_1.c
index fafb8b576b5d3d44ca09c4050be220c2195c0eba..acc3fad76791d7038c3f96d5333b68ce9af99468 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_1.c
@@ -29,5 +29,5 @@ int main()
 }
 
 /* { dg-final { scan-assembler-not {fcmla} { target { aarch64*-*-* } } } } */
-/* { dg-final { scan-assembler-not {vcmla} { target { arm*-*-* } } } } */
+/* { dg-final { scan-assembler-not {vcmla\.} { target { arm*-*-* } } } } */
 
diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_2.c b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_2.c
index 68d395a6075fb19ad79f4f4d41db41bfa4227931..d913a192bce0b9d059297c4c2024814a59dabd0e 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_2.c
@@ -27,4 +27,4 @@ int main()
 }
 
 /* { dg-final { scan-assembler-not {fcmla} { target { aarch64*-*-* } } } } */
-/* { dg-final { scan-assembler-not {vcmla} { target { arm*-*-* } } } } */
+/* { dg-final { scan-assembler-not {vcmla\.} { target { arm*-*-* } } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_3.c b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_3.c
index acb566df44e1ce21a7039ae90af9bba7c8e49c50..08a77a8f8215db944d8d0438b310ce32f68a57ba 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_3.c
@@ -28,4 +28,4 @@ int main()
 }
 
 /* { dg-final { scan-assembler-not {fcmla} { target { aarch64*-*-* } } } } */
-/* { dg-final { scan-assembler-not {vcmla} { target { arm*-*-* } } } } */
+/* { dg-final { scan-assembler-not {vcmla\.} { target { arm*-*-* } } } } */


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 8/9][GCC][Arm] Add autovectorization support for complex multiplication and addition
  2018-11-11 10:28 [PATCH 8/9][GCC][Arm] Add autovectorization support for complex multiplication and addition Tamar Christina
@ 2018-11-14 12:27 ` Richard Biener
  2018-11-14 15:47   ` Tamar Christina
  0 siblings, 1 reply; 5+ messages in thread
From: Richard Biener @ 2018-11-14 12:27 UTC (permalink / raw)
  To: Tamar.Christina
  Cc: GCC Patches, nd, Ramana Radhakrishnan, Richard Earnshaw,
	Nick Clifton, Kyrylo.Tkachov

On Sun, Nov 11, 2018 at 11:28 AM Tamar Christina
<Tamar.Christina@arm.com> wrote:
>
> Hi All,
>
> This patch adds the expander support for supporting autovectorization of complex number operations
> such as Complex addition with a rotation along the Argand plane.  This also adds support for complex
> FMA.
>
> The instructions are described in the ArmARM [1] and are available from Armv8.3-a onwards.
>
> Concretely, this generates
>
> f90:
>         add     ip, r1, #15
>         add     r3, r0, #15
>         sub     r3, r3, r2
>         sub     ip, ip, r2
>         cmp     ip, #30
>         cmphi   r3, #30
>         add     r3, r0, #1600
>         bls     .L5
> .L3:
>         vld1.32 {q8}, [r0]!
>         vld1.32 {q9}, [r1]!
>         vcadd.f32       q8, q8, q9, #90
>         vst1.32 {q8}, [r2]!
>         cmp     r0, r3
>         bne     .L3
>         bx      lr
> .L5:
>         vld1.32 {d16}, [r0]!
>         vld1.32 {d17}, [r1]!
>         vcadd.f32       d16, d16, d17, #90
>         vst1.32 {d16}, [r2]!
>         cmp     r0, r3
>         bne     .L5
>         bx      lr
>
>
>
> now instead of
>
> f90:
>         add     ip, r1, #31
>         add     r3, r0, #31
>         sub     r3, r3, r2
>         sub     ip, ip, r2
>         cmp     ip, #62
>         cmphi   r3, #62
>         add     r3, r0, #1600
>         bls     .L2
> .L3:
>         vld2.32 {d20-d23}, [r0]!
>         vld2.32 {d24-d27}, [r1]!
>         cmp     r0, r3
>         vsub.f32        q8, q10, q13
>         vadd.f32        q9, q12, q11
>         vst2.32 {d16-d19}, [r2]!
>         bne     .L3
>         bx      lr
> .L2:
>         vldr    d19, .L10
> .L5:
>         vld1.32 {d16}, [r1]!
>         vld1.32 {d18}, [r0]!
>         vrev64.32       d16, d16
>         cmp     r0, r3
>         vsub.f32        d17, d18, d16
>         vadd.f32        d16, d16, d18
>         vswp    d16, d17
>         vtbl.8  d16, {d16, d17}, d19
>         vst1.32 {d16}, [r2]!
>         bne     .L5
>         bx      lr
> .L11:
>         .align  3
> .L10:
>         .byte   0
>         .byte   1
>         .byte   2
>         .byte   3
>         .byte   12
>         .byte   13
>         .byte   14
>         .byte   15
>
>
> For complex additions with a 90* rotation along the Argand plane.
>
> [1] https://developer.arm.com/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile
>
> Bootstrap and Regtest on aarch64-none-linux-gnu, arm-none-gnueabihf and x86_64-pc-linux-gnu
> are still on going but previous patch showed no regressions.
>
> The instructions have also been tested on aarch64-none-elf and arm-none-eabi on a Armv8.3-a model
> and -march=Armv8.3-a+fp16 and all tests pass.
>
> Ok for trunk?

+;; The complex mla operations always need to expand to two instructions.
+;; The first operation does half the computation and the second does the
+;; remainder.  Because of this, expand early.
+(define_expand "fcmla<rot><mode>4"
+  [(set (match_operand:VF 0 "register_operand")
+       (plus:VF (match_operand:VF 1 "register_operand")
+                (unspec:VF [(match_operand:VF 2 "register_operand")
+                            (match_operand:VF 3 "register_operand")]
+                            VCMLA)))]
+  "TARGET_COMPLEX"
+{
+  emit_insn (gen_neon_vcmla<rotsplit1><mode> (operands[0], operands[1],
+                                             operands[2], operands[3]));
+  emit_insn (gen_neon_vcmla<rotsplit2><mode> (operands[0], operands[0],
+                                             operands[2], operands[3]));
+  DONE;
+})

What's the two halves?  Why hide this from the vectorizer if you go down all to
the detail and expose the rotation to it?

+;; The vcadd and vcmla patterns are made UNSPEC for the explicitly due to the
+;; fact that their usage need to guarantee that the source vectors are
+;; contiguous.  It would be wrong to describe the operation without being able
+;; to describe the permute that is also required, but even if that is done
+;; the permute would have been created as a LOAD_LANES which means the values
+;; in the registers are in the wrong order.

Hmm, it's totally non-obvious to me how this relates to loads or what
a "non-contiguous"
register would be?  That is, once you make this an unspec combine will
never be able
to synthesize this from intrinsics code that doesn't use this form.

+(define_insn "neon_vcadd<rot><mode>"
+  [(set (match_operand:VF 0 "register_operand" "=w")
+       (unspec:VF [(match_operand:VF 1 "register_operand" "w")
+                   (match_operand:VF 2 "register_operand" "w")]
+                   VCADD))]


> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> 2018-11-11  Tamar Christina  <tamar.christina@arm.com>
>
>         * config/arm/arm.c (arm_arch8_3, arm_arch8_4): New.
>         * config/arm/arm.h (TARGET_COMPLEX, arm_arch8_3, arm_arch8_4): New.
>         (arm_option_reconfigure_globals): Use them.
>         * config/arm/iterators.md (VDF, VQ_HSF): New.
>         (VCADD, VCMLA): New.
>         (VF_constraint, rot, rotsplit1, rotsplit2): Add V4HF and V8HF.
>         * config/arm/neon.md (neon_vcadd<rot><mode>, fcadd<rot><mode>3,
>         neon_vcmla<rot><mode>, fcmla<rot><mode>4): New.
>         * config/arm/unspecs.md (UNSPEC_VCADD90, UNSPEC_VCADD270,
>         UNSPEC_VCMLA, UNSPEC_VCMLA90, UNSPEC_VCMLA180, UNSPEC_VCMLA270): New.
>
> gcc/testsuite/ChangeLog:
>
> 2018-11-11  Tamar Christina  <tamar.christina@arm.com>
>
>         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_1.c: Add Arm support.
>         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_2.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_3.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_4.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_5.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_6.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_1.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_2.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_3.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_4.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_5.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_6.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_1.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_1.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_2.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_3.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_2.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_1.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_2.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_3.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_3.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_1.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_2.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_3.c: Likewise.
>
> --

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH 8/9][GCC][Arm] Add autovectorization support for complex multiplication and addition
  2018-11-14 12:27 ` Richard Biener
@ 2018-11-14 15:47   ` Tamar Christina
  2018-11-15 12:42     ` Richard Biener
  0 siblings, 1 reply; 5+ messages in thread
From: Tamar Christina @ 2018-11-14 15:47 UTC (permalink / raw)
  To: Richard Biener
  Cc: GCC Patches, nd, Ramana Radhakrishnan, Richard Earnshaw, nickc,
	Kyrylo Tkachov

Hi Richard,

> > Ok for trunk?
> 
> +;; The complex mla operations always need to expand to two instructions.
> +;; The first operation does half the computation and the second does
> +the ;; remainder.  Because of this, expand early.
> +(define_expand "fcmla<rot><mode>4"
> +  [(set (match_operand:VF 0 "register_operand")
> +       (plus:VF (match_operand:VF 1 "register_operand")
> +                (unspec:VF [(match_operand:VF 2 "register_operand")
> +                            (match_operand:VF 3 "register_operand")]
> +                            VCMLA)))]
> +  "TARGET_COMPLEX"
> +{
> +  emit_insn (gen_neon_vcmla<rotsplit1><mode> (operands[0],
> operands[1],
> +                                             operands[2],
> +operands[3]));
> +  emit_insn (gen_neon_vcmla<rotsplit2><mode> (operands[0],
> operands[0],
> +                                             operands[2],
> +operands[3]));
> +  DONE;
> +})
> 
> What's the two halves?  Why hide this from the vectorizer if you go down all
> to the detail and expose the rotation to it?
> 

The two halves are an implementation detail of the instruction in Armv8.3-a. As far as the
Vectorizer is concerned all you want to do, is an FMA rotating one of the operands by 0 or 180 degrees.

Also note that the "rotations" in these instructions aren't exactly the same as what would fall under rotation of a complex number,
as each instruction can only do half of the final computation you want.

In the ISA these instructions have to be used in a pair, where rotations determine
the operation you want to perform. E.g. a rotation of #0 followed by #90 makes it a multiply and accumulate.

A rotation of #180 followed by #90 makes this a vector complex subtract, which is intended to be used by the first call
using a register cleared with 0 (It becomes an "FMS" essentially if you don't clear the register).
Each "rotation" determine what operation is done and using which parts of the complex number. You change the
"rotations" and the grouping of the instructions to get different operations.

I did not expose this to the vectorizer as It seems very ISA specific.

> +;; The vcadd and vcmla patterns are made UNSPEC for the explicitly due
> +to the ;; fact that their usage need to guarantee that the source
> +vectors are ;; contiguous.  It would be wrong to describe the operation
> +without being able ;; to describe the permute that is also required,
> +but even if that is done ;; the permute would have been created as a
> +LOAD_LANES which means the values ;; in the registers are in the wrong
> order.
> 
> Hmm, it's totally non-obvious to me how this relates to loads or what a "non-
> contiguous"
> register would be?  That is, once you make this an unspec combine will never
> be able to synthesize this from intrinsics code that doesn't use this form.
> 
> +(define_insn "neon_vcadd<rot><mode>"
> +  [(set (match_operand:VF 0 "register_operand" "=w")
> +       (unspec:VF [(match_operand:VF 1 "register_operand" "w")
> +                   (match_operand:VF 2 "register_operand" "w")]
> +                   VCADD))]
> 

Yes that's my goal, as if operand1 and operand2 are loaded by instructions that
would have permuted the values in the register then the instruction doesn't work.

The instruction does the permute itself, so it expects the values to have been loaded
using a simple load and not a LOAD_LANES. So I am intended to prevent combine from
recognizing the operation for that reason.  For the ADD combine can be used but then you'd
have to match the load and store since you have to change these, for the rest you'll run far afoul
of combine's 5 instruction limit.

Kind Regards,
Tamar

> 
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > 2018-11-11  Tamar Christina  <tamar.christina@arm.com>
> >
> >         * config/arm/arm.c (arm_arch8_3, arm_arch8_4): New.
> >         * config/arm/arm.h (TARGET_COMPLEX, arm_arch8_3, arm_arch8_4):
> New.
> >         (arm_option_reconfigure_globals): Use them.
> >         * config/arm/iterators.md (VDF, VQ_HSF): New.
> >         (VCADD, VCMLA): New.
> >         (VF_constraint, rot, rotsplit1, rotsplit2): Add V4HF and V8HF.
> >         * config/arm/neon.md (neon_vcadd<rot><mode>,
> fcadd<rot><mode>3,
> >         neon_vcmla<rot><mode>, fcmla<rot><mode>4): New.
> >         * config/arm/unspecs.md (UNSPEC_VCADD90, UNSPEC_VCADD270,
> >         UNSPEC_VCMLA, UNSPEC_VCMLA90, UNSPEC_VCMLA180,
> UNSPEC_VCMLA270): New.
> >
> > gcc/testsuite/ChangeLog:
> >
> > 2018-11-11  Tamar Christina  <tamar.christina@arm.com>
> >
> >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_1.c: Add Arm
> support.
> >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_2.c: Likewise.
> >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_3.c: Likewise.
> >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_4.c: Likewise.
> >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_5.c: Likewise.
> >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_6.c: Likewise.
> >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_1.c: Likewise.
> >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_2.c: Likewise.
> >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_3.c: Likewise.
> >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_4.c: Likewise.
> >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_5.c: Likewise.
> >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_6.c: Likewise.
> >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_1.c: Likewise.
> >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_1.c:
> Likewise.
> >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_2.c:
> Likewise.
> >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_3.c:
> Likewise.
> >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_2.c: Likewise.
> >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_1.c:
> Likewise.
> >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_2.c:
> Likewise.
> >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_3.c:
> Likewise.
> >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_3.c: Likewise.
> >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_1.c:
> Likewise.
> >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_2.c:
> Likewise.
> >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_3.c:
> Likewise.
> >
> > --

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 8/9][GCC][Arm] Add autovectorization support for complex multiplication and addition
  2018-11-14 15:47   ` Tamar Christina
@ 2018-11-15 12:42     ` Richard Biener
  2018-11-15 12:48       ` Richard Biener
  0 siblings, 1 reply; 5+ messages in thread
From: Richard Biener @ 2018-11-15 12:42 UTC (permalink / raw)
  To: Tamar.Christina
  Cc: GCC Patches, nd, Ramana Radhakrishnan, Richard Earnshaw,
	Nick Clifton, Kyrylo.Tkachov

On Wed, Nov 14, 2018 at 4:47 PM Tamar Christina <Tamar.Christina@arm.com> wrote:
>
> Hi Richard,
>
> > > Ok for trunk?
> >
> > +;; The complex mla operations always need to expand to two instructions.
> > +;; The first operation does half the computation and the second does
> > +the ;; remainder.  Because of this, expand early.
> > +(define_expand "fcmla<rot><mode>4"
> > +  [(set (match_operand:VF 0 "register_operand")
> > +       (plus:VF (match_operand:VF 1 "register_operand")
> > +                (unspec:VF [(match_operand:VF 2 "register_operand")
> > +                            (match_operand:VF 3 "register_operand")]
> > +                            VCMLA)))]
> > +  "TARGET_COMPLEX"
> > +{
> > +  emit_insn (gen_neon_vcmla<rotsplit1><mode> (operands[0],
> > operands[1],
> > +                                             operands[2],
> > +operands[3]));
> > +  emit_insn (gen_neon_vcmla<rotsplit2><mode> (operands[0],
> > operands[0],
> > +                                             operands[2],
> > +operands[3]));
> > +  DONE;
> > +})
> >
> > What's the two halves?  Why hide this from the vectorizer if you go down all
> > to the detail and expose the rotation to it?
> >
>
> The two halves are an implementation detail of the instruction in Armv8.3-a. As far as the
> Vectorizer is concerned all you want to do, is an FMA rotating one of the operands by 0 or 180 degrees.
>
> Also note that the "rotations" in these instructions aren't exactly the same as what would fall under rotation of a complex number,
> as each instruction can only do half of the final computation you want.
>
> In the ISA these instructions have to be used in a pair, where rotations determine
> the operation you want to perform. E.g. a rotation of #0 followed by #90 makes it a multiply and accumulate.
>
> A rotation of #180 followed by #90 makes this a vector complex subtract, which is intended to be used by the first call
> using a register cleared with 0 (It becomes an "FMS" essentially if you don't clear the register).
> Each "rotation" determine what operation is done and using which parts of the complex number. You change the
> "rotations" and the grouping of the instructions to get different operations.
>
> I did not expose this to the vectorizer as It seems very ISA specific.
>
> > +;; The vcadd and vcmla patterns are made UNSPEC for the explicitly due
> > +to the ;; fact that their usage need to guarantee that the source
> > +vectors are ;; contiguous.  It would be wrong to describe the operation
> > +without being able ;; to describe the permute that is also required,
> > +but even if that is done ;; the permute would have been created as a
> > +LOAD_LANES which means the values ;; in the registers are in the wrong
> > order.
> >
> > Hmm, it's totally non-obvious to me how this relates to loads or what a "non-
> > contiguous"
> > register would be?  That is, once you make this an unspec combine will never
> > be able to synthesize this from intrinsics code that doesn't use this form.
> >
> > +(define_insn "neon_vcadd<rot><mode>"
> > +  [(set (match_operand:VF 0 "register_operand" "=w")
> > +       (unspec:VF [(match_operand:VF 1 "register_operand" "w")
> > +                   (match_operand:VF 2 "register_operand" "w")]
> > +                   VCADD))]
> >
>
> Yes that's my goal, as if operand1 and operand2 are loaded by instructions that
> would have permuted the values in the register then the instruction doesn't work.
>
> The instruction does the permute itself, so it expects the values to have been loaded
> using a simple load and not a LOAD_LANES. So I am intended to prevent combine from
> recognizing the operation for that reason.

But LOAD_LANES is used differently and the ISA probably doesn't really care how
you set up the register inputs.  You of course have to put in the
correct values but
how they get there doesn't matter.  So I don't see how combine can
mess things up here.

>  For the ADD combine can be used but then you'd
> have to match the load and store since you have to change these, for the rest you'll run far afoul
> of combine's 5 instruction limit.

Why do you need to change these?  You assume the vectorizer vectorizes using
interleaving - yes, in that case all hope is lost.  I assume the
vectorizer will end up
doing SLP with the existing TWO_OPERATORS support, thus for complex subtraction
you'll see (A and B being complex vectors)

   add = A + B;
   sub = A - B;
   resultAcomplex_minusB = vec_merge (add, sub, 1)

basically the vectorizer will perform operations twice and then blend the two
results.  The add/sub + blend needs to be recognized by combine
(x86 does this for the vaddsub instructions which were designed to handle
complex subtraction and parts of the multiply).

For complex multiplication you'll see the pieces your ISA supports.

  mul1 = A * B
  mul2 = A * rot(B)  (rotation will be a shuffle)
  add = mul1 + mul2
  sub = mul1 - mul2
  result = blend (add, sub, ...)

as usual the combiner is helped by intermediate combiner patterns
(in this case modeling your ISAs intermediate steps probably already helps).
The x86 ISA also has fmaddsub/fmsubadd isntructions but without the
embedded rotation which you have to do explicitely.  For example the
vectorizer generates for a simple complex FMA loop

_Complex double x[1024];
_Complex double y[1024];
_Complex double z[1024];

void foo ()
{
  for (int i = 0; i < 1024; ++i)
    x[i] += y[i] * z[i];
}

  <bb 3> [local count: 1063004407]:
  # ivtmp.34_6 = PHI <0(2), ivtmp.34_12(3)>
  vect__6.5_49 = MEM[symbol: x, index: ivtmp.34_6, offset: 0B];
  vect__13.11_42 = MEM[symbol: y, index: ivtmp.34_6, offset: 0B];
  vect__13.12_41 = VEC_PERM_EXPR <vect__13.11_42, vect__13.11_42, { 0, 0 }>;
  vect__13.17_36 = VEC_PERM_EXPR <vect__13.11_42, vect__13.11_42, { 1, 1 }>;
  vect__11.8_46 = MEM[symbol: z, index: ivtmp.34_6, offset: 0B];
  vect__11.21_32 = VEC_PERM_EXPR <vect__11.8_46, vect__11.8_46, { 1, 0 }>;
  vect__17.13_40 = vect__13.12_41 * vect__11.8_46;
  vect__18.22_31 = vect__11.21_32 * vect__13.17_36;
  vect__21.23_30 = vect__17.13_40 - vect__18.22_31;
  vect__21.24_29 = vect__18.22_31 + vect__17.13_40;
  _28 = VEC_PERM_EXPR <vect__21.23_30, vect__21.24_29, { 0, 3 }>;
  vect__23.25_27 = _28 + vect__6.5_49;
  MEM[symbol: x, index: ivtmp.34_6, offset: 0B] = vect__23.25_27;
  ivtmp.34_12 = ivtmp.34_6 + 16;
  if (ivtmp.34_12 != 16384)
    goto <bb 3>; [99.00%]

which before combine looks like

    8: r92:V2DF=[r82:DI+`y']
   10: r93:V2DF=[r82:DI+`z']
   12: r97:V2DF=vec_select(vec_concat(r92:V2DF,r92:V2DF),parallel)
   13: r90:V2DF=r97:V2DF*r93:V2DF
   14: r98:V2DF=vec_select(r93:V2DF,parallel)
   15: r99:V2DF=vec_select(vec_concat(r92:V2DF,r92:V2DF),parallel)
   16: r87:V2DF=r98:V2DF*r99:V2DF
   18: r101:V2DF=r90:V2DF-r87:V2DF
   19: r102:V2DF=r87:V2DF+r90:V2DF
   20: r103:V2DF=vec_merge(r101:V2DF,r102:V2DF,0x1)
   22: r105:V2DF=r103:V2DF+[r82:DI+`x']
   23: [r82:DI+`x']=r105:V2DF

I assume you can combine the multiplications with the selects
(the selects might be sth else for you - that's somewhat target depenent)
into your half-way operations with the embedded rotates.

Richard.

>
> Kind Regards,
> Tamar
>
> >
> > > Thanks,
> > > Tamar
> > >
> > > gcc/ChangeLog:
> > >
> > > 2018-11-11  Tamar Christina  <tamar.christina@arm.com>
> > >
> > >         * config/arm/arm.c (arm_arch8_3, arm_arch8_4): New.
> > >         * config/arm/arm.h (TARGET_COMPLEX, arm_arch8_3, arm_arch8_4):
> > New.
> > >         (arm_option_reconfigure_globals): Use them.
> > >         * config/arm/iterators.md (VDF, VQ_HSF): New.
> > >         (VCADD, VCMLA): New.
> > >         (VF_constraint, rot, rotsplit1, rotsplit2): Add V4HF and V8HF.
> > >         * config/arm/neon.md (neon_vcadd<rot><mode>,
> > fcadd<rot><mode>3,
> > >         neon_vcmla<rot><mode>, fcmla<rot><mode>4): New.
> > >         * config/arm/unspecs.md (UNSPEC_VCADD90, UNSPEC_VCADD270,
> > >         UNSPEC_VCMLA, UNSPEC_VCMLA90, UNSPEC_VCMLA180,
> > UNSPEC_VCMLA270): New.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > 2018-11-11  Tamar Christina  <tamar.christina@arm.com>
> > >
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_1.c: Add Arm
> > support.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_2.c: Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_3.c: Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_4.c: Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_5.c: Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_6.c: Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_1.c: Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_2.c: Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_3.c: Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_4.c: Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_5.c: Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_6.c: Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_1.c: Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_1.c:
> > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_2.c:
> > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_3.c:
> > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_2.c: Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_1.c:
> > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_2.c:
> > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_3.c:
> > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_3.c: Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_1.c:
> > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_2.c:
> > Likewise.
> > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_3.c:
> > Likewise.
> > >
> > > --

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 8/9][GCC][Arm] Add autovectorization support for complex multiplication and addition
  2018-11-15 12:42     ` Richard Biener
@ 2018-11-15 12:48       ` Richard Biener
  0 siblings, 0 replies; 5+ messages in thread
From: Richard Biener @ 2018-11-15 12:48 UTC (permalink / raw)
  To: Tamar.Christina
  Cc: GCC Patches, nd, Ramana Radhakrishnan, Richard Earnshaw,
	Nick Clifton, Kyrylo.Tkachov

On Thu, Nov 15, 2018 at 1:42 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Wed, Nov 14, 2018 at 4:47 PM Tamar Christina <Tamar.Christina@arm.com> wrote:
> >
> > Hi Richard,
> >
> > > > Ok for trunk?
> > >
> > > +;; The complex mla operations always need to expand to two instructions.
> > > +;; The first operation does half the computation and the second does
> > > +the ;; remainder.  Because of this, expand early.
> > > +(define_expand "fcmla<rot><mode>4"
> > > +  [(set (match_operand:VF 0 "register_operand")
> > > +       (plus:VF (match_operand:VF 1 "register_operand")
> > > +                (unspec:VF [(match_operand:VF 2 "register_operand")
> > > +                            (match_operand:VF 3 "register_operand")]
> > > +                            VCMLA)))]
> > > +  "TARGET_COMPLEX"
> > > +{
> > > +  emit_insn (gen_neon_vcmla<rotsplit1><mode> (operands[0],
> > > operands[1],
> > > +                                             operands[2],
> > > +operands[3]));
> > > +  emit_insn (gen_neon_vcmla<rotsplit2><mode> (operands[0],
> > > operands[0],
> > > +                                             operands[2],
> > > +operands[3]));
> > > +  DONE;
> > > +})
> > >
> > > What's the two halves?  Why hide this from the vectorizer if you go down all
> > > to the detail and expose the rotation to it?
> > >
> >
> > The two halves are an implementation detail of the instruction in Armv8.3-a. As far as the
> > Vectorizer is concerned all you want to do, is an FMA rotating one of the operands by 0 or 180 degrees.
> >
> > Also note that the "rotations" in these instructions aren't exactly the same as what would fall under rotation of a complex number,
> > as each instruction can only do half of the final computation you want.
> >
> > In the ISA these instructions have to be used in a pair, where rotations determine
> > the operation you want to perform. E.g. a rotation of #0 followed by #90 makes it a multiply and accumulate.
> >
> > A rotation of #180 followed by #90 makes this a vector complex subtract, which is intended to be used by the first call
> > using a register cleared with 0 (It becomes an "FMS" essentially if you don't clear the register).
> > Each "rotation" determine what operation is done and using which parts of the complex number. You change the
> > "rotations" and the grouping of the instructions to get different operations.
> >
> > I did not expose this to the vectorizer as It seems very ISA specific.
> >
> > > +;; The vcadd and vcmla patterns are made UNSPEC for the explicitly due
> > > +to the ;; fact that their usage need to guarantee that the source
> > > +vectors are ;; contiguous.  It would be wrong to describe the operation
> > > +without being able ;; to describe the permute that is also required,
> > > +but even if that is done ;; the permute would have been created as a
> > > +LOAD_LANES which means the values ;; in the registers are in the wrong
> > > order.
> > >
> > > Hmm, it's totally non-obvious to me how this relates to loads or what a "non-
> > > contiguous"
> > > register would be?  That is, once you make this an unspec combine will never
> > > be able to synthesize this from intrinsics code that doesn't use this form.
> > >
> > > +(define_insn "neon_vcadd<rot><mode>"
> > > +  [(set (match_operand:VF 0 "register_operand" "=w")
> > > +       (unspec:VF [(match_operand:VF 1 "register_operand" "w")
> > > +                   (match_operand:VF 2 "register_operand" "w")]
> > > +                   VCADD))]
> > >
> >
> > Yes that's my goal, as if operand1 and operand2 are loaded by instructions that
> > would have permuted the values in the register then the instruction doesn't work.
> >
> > The instruction does the permute itself, so it expects the values to have been loaded
> > using a simple load and not a LOAD_LANES. So I am intended to prevent combine from
> > recognizing the operation for that reason.
>
> But LOAD_LANES is used differently and the ISA probably doesn't really care how
> you set up the register inputs.  You of course have to put in the
> correct values but
> how they get there doesn't matter.  So I don't see how combine can
> mess things up here.
>
> >  For the ADD combine can be used but then you'd
> > have to match the load and store since you have to change these, for the rest you'll run far afoul
> > of combine's 5 instruction limit.
>
> Why do you need to change these?  You assume the vectorizer vectorizes using
> interleaving - yes, in that case all hope is lost.  I assume the
> vectorizer will end up
> doing SLP with the existing TWO_OPERATORS support

You might be bitten by the fact that you tuned the vectorizer to always prefer
load/store-lanes over SLP when there are permutations.  You could lift
that a bit allowing rotation/projection permutes as they occur with complex
arithmetic.

> , thus for complex subtraction
> you'll see (A and B being complex vectors)
>
>    add = A + B;
>    sub = A - B;
>    resultAcomplex_minusB = vec_merge (add, sub, 1)
>
> basically the vectorizer will perform operations twice and then blend the two
> results.  The add/sub + blend needs to be recognized by combine
> (x86 does this for the vaddsub instructions which were designed to handle
> complex subtraction and parts of the multiply).
>
> For complex multiplication you'll see the pieces your ISA supports.
>
>   mul1 = A * B
>   mul2 = A * rot(B)  (rotation will be a shuffle)
>   add = mul1 + mul2
>   sub = mul1 - mul2
>   result = blend (add, sub, ...)
>
> as usual the combiner is helped by intermediate combiner patterns
> (in this case modeling your ISAs intermediate steps probably already helps).
> The x86 ISA also has fmaddsub/fmsubadd isntructions but without the
> embedded rotation which you have to do explicitely.  For example the
> vectorizer generates for a simple complex FMA loop
>
> _Complex double x[1024];
> _Complex double y[1024];
> _Complex double z[1024];
>
> void foo ()
> {
>   for (int i = 0; i < 1024; ++i)
>     x[i] += y[i] * z[i];
> }
>
>   <bb 3> [local count: 1063004407]:
>   # ivtmp.34_6 = PHI <0(2), ivtmp.34_12(3)>
>   vect__6.5_49 = MEM[symbol: x, index: ivtmp.34_6, offset: 0B];
>   vect__13.11_42 = MEM[symbol: y, index: ivtmp.34_6, offset: 0B];
>   vect__13.12_41 = VEC_PERM_EXPR <vect__13.11_42, vect__13.11_42, { 0, 0 }>;
>   vect__13.17_36 = VEC_PERM_EXPR <vect__13.11_42, vect__13.11_42, { 1, 1 }>;
>   vect__11.8_46 = MEM[symbol: z, index: ivtmp.34_6, offset: 0B];
>   vect__11.21_32 = VEC_PERM_EXPR <vect__11.8_46, vect__11.8_46, { 1, 0 }>;
>   vect__17.13_40 = vect__13.12_41 * vect__11.8_46;
>   vect__18.22_31 = vect__11.21_32 * vect__13.17_36;
>   vect__21.23_30 = vect__17.13_40 - vect__18.22_31;
>   vect__21.24_29 = vect__18.22_31 + vect__17.13_40;
>   _28 = VEC_PERM_EXPR <vect__21.23_30, vect__21.24_29, { 0, 3 }>;
>   vect__23.25_27 = _28 + vect__6.5_49;
>   MEM[symbol: x, index: ivtmp.34_6, offset: 0B] = vect__23.25_27;
>   ivtmp.34_12 = ivtmp.34_6 + 16;
>   if (ivtmp.34_12 != 16384)
>     goto <bb 3>; [99.00%]
>
> which before combine looks like
>
>     8: r92:V2DF=[r82:DI+`y']
>    10: r93:V2DF=[r82:DI+`z']
>    12: r97:V2DF=vec_select(vec_concat(r92:V2DF,r92:V2DF),parallel)
>    13: r90:V2DF=r97:V2DF*r93:V2DF
>    14: r98:V2DF=vec_select(r93:V2DF,parallel)
>    15: r99:V2DF=vec_select(vec_concat(r92:V2DF,r92:V2DF),parallel)
>    16: r87:V2DF=r98:V2DF*r99:V2DF
>    18: r101:V2DF=r90:V2DF-r87:V2DF
>    19: r102:V2DF=r87:V2DF+r90:V2DF
>    20: r103:V2DF=vec_merge(r101:V2DF,r102:V2DF,0x1)
>    22: r105:V2DF=r103:V2DF+[r82:DI+`x']
>    23: [r82:DI+`x']=r105:V2DF
>
> I assume you can combine the multiplications with the selects
> (the selects might be sth else for you - that's somewhat target depenent)
> into your half-way operations with the embedded rotates.
>
> Richard.
>
> >
> > Kind Regards,
> > Tamar
> >
> > >
> > > > Thanks,
> > > > Tamar
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > > 2018-11-11  Tamar Christina  <tamar.christina@arm.com>
> > > >
> > > >         * config/arm/arm.c (arm_arch8_3, arm_arch8_4): New.
> > > >         * config/arm/arm.h (TARGET_COMPLEX, arm_arch8_3, arm_arch8_4):
> > > New.
> > > >         (arm_option_reconfigure_globals): Use them.
> > > >         * config/arm/iterators.md (VDF, VQ_HSF): New.
> > > >         (VCADD, VCMLA): New.
> > > >         (VF_constraint, rot, rotsplit1, rotsplit2): Add V4HF and V8HF.
> > > >         * config/arm/neon.md (neon_vcadd<rot><mode>,
> > > fcadd<rot><mode>3,
> > > >         neon_vcmla<rot><mode>, fcmla<rot><mode>4): New.
> > > >         * config/arm/unspecs.md (UNSPEC_VCADD90, UNSPEC_VCADD270,
> > > >         UNSPEC_VCMLA, UNSPEC_VCMLA90, UNSPEC_VCMLA180,
> > > UNSPEC_VCMLA270): New.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > > 2018-11-11  Tamar Christina  <tamar.christina@arm.com>
> > > >
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_1.c: Add Arm
> > > support.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_2.c: Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_3.c: Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_4.c: Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_5.c: Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_6.c: Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_1.c: Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_2.c: Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_3.c: Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_4.c: Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_5.c: Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_6.c: Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_1.c: Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_1.c:
> > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_2.c:
> > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_3.c:
> > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_2.c: Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_1.c:
> > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_2.c:
> > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_3.c:
> > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_3.c: Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_1.c:
> > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_2.c:
> > > Likewise.
> > > >         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_3.c:
> > > Likewise.
> > > >
> > > > --

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-11-15 12:48 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-11 10:28 [PATCH 8/9][GCC][Arm] Add autovectorization support for complex multiplication and addition Tamar Christina
2018-11-14 12:27 ` Richard Biener
2018-11-14 15:47   ` Tamar Christina
2018-11-15 12:42     ` Richard Biener
2018-11-15 12:48       ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).