[PATCH]AArch64: Add NEON, SVE and SVE2 RTL patterns for Complex Addition, Multiply and FMA.

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH]AArch64: Add NEON, SVE and SVE2 RTL patterns for Complex Addition, Multiply and FMA.
@ 2020-12-10 16:59 Tamar Christina
  2020-12-14 11:01 ` Kyrylo Tkachov
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Tamar Christina @ 2020-12-10 16:59 UTC (permalink / raw)
  To: gcc-patches
  Cc: nd, Richard.Earnshaw, Marcus.Shawcroft, Kyrylo.Tkachov,
	richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 2197 bytes --]

Hi All,

This adds implementation for the optabs for complex operations.  With this the
following C code:

  void f90 (float complex a[restrict N], float complex b[restrict N],
	    float complex c[restrict N])
  {
    for (int i=0; i < N; i++)
      c[i] = a[i] + (b[i] * I);
  }

generates

  f90:
	  mov     x3, 0
	  .p2align 3,,7
  .L2:
	  ldr     q0, [x0, x3]
	  ldr     q1, [x1, x3]
	  fcadd   v0.4s, v0.4s, v1.4s, #90
	  str     q0, [x2, x3]
	  add     x3, x3, 16
	  cmp     x3, 1600
	  bne     .L2
	  ret

instead of

  f90:
	  add     x3, x1, 1600
	  .p2align 3,,7
  .L2:
	  ld2     {v4.4s - v5.4s}, [x0], 32
	  ld2     {v2.4s - v3.4s}, [x1], 32
	  fsub    v0.4s, v4.4s, v3.4s
	  fadd    v1.4s, v5.4s, v2.4s
	  st2     {v0.4s - v1.4s}, [x2], 32
	  cmp     x3, x1
	  bne     .L2
	  ret

It defined a new iterator VALL_ARITH which contains types for which we can do
general arithmetic (excludes bfloat16).

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
Checked with armv8-a+sve2+fp16 and no issues.  Note that sue to a mid-end
limitation SLP for SVE currently fails for some permutes.  The tests have these
marked as XFAIL.  I do intend to fix this soon. 

Matching tests for these are in the mid-end patches.
Note that The mid-end patches are still being respun and I may need to
change the order of some parameters but no other change is expected and
would like to decrease the size of future patches.  As such..

Ok for master?

Thanks,
Tamar


gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md (cadd<rot><mode>3,
	cml<fcmac1><rot_op><mode>4, cmul<rot_op><mode>3): New.
	* config/aarch64/iterators.md (VALL_ARITH, UNSPEC_FCMUL,
	UNSPEC_FCMUL180, UNSPEC_FCMLS, UNSPEC_FCMLS180, UNSPEC_CMLS,
	UNSPEC_CMLS180, UNSPEC_CMUL, UNSPEC_CMUL180, FCMLA_OP, FCMUL_OP, rot_op,
	rotsplit1, rotsplit2, fcmac1, sve_rot1, sve_rot2, SVE2_INT_CMLA_OP,
	SVE2_INT_CMUL_OP, SVE2_INT_CADD_OP): New.): New.): New.
	(rot): Add UNSPEC_FCMLS, UNSPEC_FCMUL, UNSPEC_FCMUL180.
	* config/aarch64/aarch64-sve.md (cadd<rot><mode>3,
	cml<fcmac1><rot_op><mode>4, cmul<rot_op><mode>3): New.
	* config/aarch64/aarch64-sve2.md (cadd<rot><mode>3,
	cml<fcmac1><rot_op><mode>4, cmul<rot_op><mode>3): New.

-- 

[-- Attachment #2: rb13907.patch --]
[-- Type: text/x-diff, Size: 18378 bytes --]

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 68baf416045178b0ebcfeb8de2d201f625f1c317..1aa74beeee154e054f2a01f8843dfed218fe850b 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -449,6 +449,14 @@ (define_insn "aarch64_fcadd<rot><mode>"
   [(set_attr "type" "neon_fcadd")]
 )
 
+(define_expand "cadd<rot><mode>3"
+  [(set (match_operand:VHSDF 0 "register_operand")
+	(unspec:VHSDF [(match_operand:VHSDF 1 "register_operand")
+		       (match_operand:VHSDF 2 "register_operand")]
+		       FCADD))]
+  "TARGET_COMPLEX && !BYTES_BIG_ENDIAN"
+)
+
 (define_insn "aarch64_fcmla<rot><mode>"
   [(set (match_operand:VHSDF 0 "register_operand" "=w")
 	(plus:VHSDF (match_operand:VHSDF 1 "register_operand" "0")
@@ -508,6 +516,47 @@ (define_insn "aarch64_fcmlaq_lane<rot><mode>"
   [(set_attr "type" "neon_fcmla")]
 )
 
+;; The complex mla/mls operations always need to expand to two instructions.
+;; The first operation does half the computation and the second does the
+;; remainder.  Because of this, expand early.
+(define_expand "cml<fcmac1><rot_op><mode>4"
+  [(set (match_operand:VHSDF 0 "register_operand")
+	(plus:VHSDF (match_operand:VHSDF 1 "register_operand")
+		    (unspec:VHSDF [(match_operand:VHSDF 2 "register_operand")
+				   (match_operand:VHSDF 3 "register_operand")]
+				   FCMLA_OP)))]
+  "TARGET_COMPLEX && !BYTES_BIG_ENDIAN"
+{
+  rtx tmp = gen_reg_rtx (<MODE>mode);
+  emit_insn (gen_aarch64_fcmla<rotsplit1><mode> (tmp, operands[1],
+						 operands[2], operands[3]));
+  emit_insn (gen_aarch64_fcmla<rotsplit2><mode> (operands[0], tmp,
+						 operands[2], operands[3]));
+  DONE;
+})
+
+;; The complex mul operations always need to expand to two instructions.
+;; The first operation does half the computation and the second does the
+;; remainder.  Because of this, expand early.
+(define_expand "cmul<rot_op><mode>3"
+  [(set (match_operand:VHSDF 0 "register_operand")
+	(unspec:VHSDF [(match_operand:VHSDF 1 "register_operand")
+		       (match_operand:VHSDF 2 "register_operand")]
+		       FCMUL_OP))]
+  "TARGET_COMPLEX && !BYTES_BIG_ENDIAN"
+{
+  rtx tmp = gen_reg_rtx (<MODE>mode);
+  rtx res1 = gen_reg_rtx (<MODE>mode);
+  emit_move_insn (tmp, CONST0_RTX (<MODE>mode));
+  emit_insn (gen_aarch64_fcmla<rotsplit1><mode> (res1, tmp,
+						 operands[1], operands[2]));
+  emit_insn (gen_aarch64_fcmla<rotsplit2><mode> (operands[0], res1,
+						 operands[1], operands[2]));
+  DONE;
+})
+
+
+
 ;; These instructions map to the __builtins for the Dot Product operations.
 (define_insn "aarch64_<sur>dot<vsi2qi>"
   [(set (match_operand:VS 0 "register_operand" "=w")
diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-sve.md
index 6359c40bdecda6c126bd70bef66561dd1da44dc9..7d27a84016d687cb6c019f98b99a7aacf8b3a031 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -5480,6 +5480,20 @@ (define_expand "@cond_<optab><mode>"
   "TARGET_SVE"
 )
 
+;; Predicated FCADD using ptrue for unpredicated optab for auto-vectorizer
+(define_expand "@cadd<rot><mode>3"
+  [(set (match_operand:SVE_FULL_F 0 "register_operand")
+	(unspec:SVE_FULL_F
+	  [(match_dup 3)
+	   (const_int SVE_RELAXED_GP)
+	   (match_operand:SVE_FULL_F 1 "register_operand")
+	   (match_operand:SVE_FULL_F 2 "register_operand")]
+	  SVE_COND_FCADD))]
+  "TARGET_SVE"
+{
+  operands[3] = aarch64_ptrue_reg (<VPRED>mode);
+})
+
 ;; Predicated FCADD, merging with the first input.
 (define_insn_and_rewrite "*cond_<optab><mode>_2_relaxed"
   [(set (match_operand:SVE_FULL_F 0 "register_operand" "=w, ?&w")
@@ -7152,6 +7166,64 @@ (define_insn "@aarch64_pred_<optab><mode>"
   [(set_attr "movprfx" "*,yes")]
 )
 
+;; unpredicated optab pattern for auto-vectorizer
+;; The complex mla/mls operations always need to expand to two instructions.
+;; The first operation does half the computation and the second does the
+;; remainder.  Because of this, expand early.
+(define_expand "cml<fcmac1><rot_op><mode>4"
+  [(set (match_operand:SVE_FULL_F 0 "register_operand")
+	(unspec:SVE_FULL_F
+	  [(match_dup 4)
+	   (match_dup 5)
+	   (match_operand:SVE_FULL_F 1 "register_operand")
+	   (match_operand:SVE_FULL_F 2 "register_operand")
+	   (match_operand:SVE_FULL_F 3 "register_operand")]
+	  FCMLA_OP))]
+  "TARGET_SVE && !BYTES_BIG_ENDIAN"
+{
+  operands[4] = aarch64_ptrue_reg (<VPRED>mode);
+  operands[5] = gen_int_mode (SVE_RELAXED_GP, SImode);
+  rtx tmp = gen_reg_rtx (<MODE>mode);
+  emit_insn (
+    gen_aarch64_pred_fcmla<sve_rot1><mode> (tmp, operands[4],
+					    operands[1], operands[2],
+					    operands[3], operands[5]));
+  emit_insn (
+    gen_aarch64_pred_fcmla<sve_rot2><mode> (operands[0], operands[4],
+					    tmp, operands[2],
+					    operands[3], operands[5]));
+  DONE;
+})
+
+;; unpredicated optab pattern for auto-vectorizer
+;; The complex mul operations always need to expand to two instructions.
+;; The first operation does half the computation and the second does the
+;; remainder.  Because of this, expand early.
+(define_expand "cmul<rot_op><mode>3"
+  [(set (match_operand:SVE_FULL_F 0 "register_operand")
+	(unspec:SVE_FULL_F
+	  [(match_dup 3)
+	   (match_dup 4)
+	   (match_operand:SVE_FULL_F 1 "register_operand")
+	   (match_operand:SVE_FULL_F 2 "register_operand")
+	   (match_dup 5)]
+	  FCMUL_OP))]
+  "TARGET_SVE && !BYTES_BIG_ENDIAN"
+{
+  operands[3] = aarch64_ptrue_reg (<VPRED>mode);
+  operands[4] = gen_int_mode (SVE_RELAXED_GP, SImode);
+  operands[5] = force_reg (<MODE>mode, CONST0_RTX (<MODE>mode));
+  rtx tmp = gen_reg_rtx (<MODE>mode);
+  emit_insn (
+    gen_aarch64_pred_fcmla<sve_rot1><mode> (tmp, operands[3], operands[1],
+					    operands[2], operands[5], operands[4]));
+  emit_insn (
+    gen_aarch64_pred_fcmla<sve_rot2><mode> (operands[0], operands[3], operands[1],
+					    operands[2], tmp,
+					    operands[4]));
+  DONE;
+})
+
 ;; Predicated FCMLA with merging.
 (define_expand "@cond_<optab><mode>"
   [(set (match_operand:SVE_FULL_F 0 "register_operand")
diff --git a/gcc/config/aarch64/aarch64-sve2.md b/gcc/config/aarch64/aarch64-sve2.md
index 772c35079c9441448534471fba4dba622322b8fc..58594f985e5a98a188f32d96c6f71c9f4e0a6f05 100644
--- a/gcc/config/aarch64/aarch64-sve2.md
+++ b/gcc/config/aarch64/aarch64-sve2.md
@@ -1799,6 +1799,16 @@ (define_insn "@aarch64_sve_<optab><mode>"
   [(set_attr "movprfx" "*,yes")]
 )
 
+;; unpredicated optab pattern for auto-vectorizer
+(define_expand "cadd<rot><mode>3"
+  [(set (match_operand:SVE_FULL_I 0 "register_operand")
+	(unspec:SVE_FULL_I
+	  [(match_operand:SVE_FULL_I 1 "register_operand")
+	   (match_operand:SVE_FULL_I 2 "register_operand")]
+	  SVE2_INT_CADD_OP))]
+  "TARGET_SVE2 && !BYTES_BIG_ENDIAN"
+)
+
 ;; -------------------------------------------------------------------------
 ;; ---- [INT] Complex ternary operations
 ;; -------------------------------------------------------------------------
@@ -1838,6 +1848,49 @@ (define_insn "@aarch64_<optab>_lane_<mode>"
   [(set_attr "movprfx" "*,yes")]
 )
 
+;; unpredicated optab pattern for auto-vectorizer
+;; The complex mla/mls operations always need to expand to two instructions.
+;; The first operation does half the computation and the second does the
+;; remainder.  Because of this, expand early.
+(define_expand "cml<fcmac1><rot_op><mode>4"
+  [(set (match_operand:SVE_FULL_I 0 "register_operand")
+	(plus:SVE_FULL_I (match_operand:SVE_FULL_I 1 "register_operand")
+	  (unspec:SVE_FULL_I
+	    [(match_operand:SVE_FULL_I 2 "register_operand")
+	     (match_operand:SVE_FULL_I 3 "register_operand")]
+	    SVE2_INT_CMLA_OP)))]
+  "TARGET_SVE2 && !BYTES_BIG_ENDIAN"
+{
+  rtx tmp = gen_reg_rtx (<MODE>mode);
+  emit_insn (gen_aarch64_sve_cmla<sve_rot1><mode> (tmp, operands[1],
+						   operands[2], operands[3]));
+  emit_insn (gen_aarch64_sve_cmla<sve_rot2><mode> (operands[0], tmp,
+						   operands[2], operands[3]));
+  DONE;
+})
+
+;; unpredicated optab pattern for auto-vectorizer
+;; The complex mul operations always need to expand to two instructions.
+;; The first operation does half the computation and the second does the
+;; remainder.  Because of this, expand early.
+(define_expand "cmul<rot_op><mode>3"
+  [(set (match_operand:SVE_FULL_I 0 "register_operand")
+	(unspec:SVE_FULL_I
+	  [(match_operand:SVE_FULL_I 1 "register_operand")
+	   (match_operand:SVE_FULL_I 2 "register_operand")
+	   (match_dup 3)]
+	  SVE2_INT_CMUL_OP))]
+  "TARGET_SVE2 && !BYTES_BIG_ENDIAN"
+{
+  operands[3] = force_reg (<MODE>mode, CONST0_RTX (<MODE>mode));
+  rtx tmp = gen_reg_rtx (<MODE>mode);
+  emit_insn (gen_aarch64_sve_cmla<sve_rot1><mode> (tmp, operands[3],
+						   operands[1], operands[2]));
+  emit_insn (gen_aarch64_sve_cmla<sve_rot2><mode> (operands[0], tmp,
+						   operands[1], operands[2]));
+  DONE;
+})
+
 ;; -------------------------------------------------------------------------
 ;; ---- [INT] Complex dot product
 ;; -------------------------------------------------------------------------
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index fb1426b7752890848cb49722ef7442d96cb1408b..dd88e63f4e3a60ffe0d0276f13c6068161511cb9 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -182,6 +182,11 @@ (define_mode_iterator V2F [V2SF V2DF])
 ;; All Advanced SIMD modes on which we support any arithmetic operations.
 (define_mode_iterator VALL [V8QI V16QI V4HI V8HI V2SI V4SI V2DI V2SF V4SF V2DF])
 
+;; All Advanced SIMD modes suitable for performing arithmetics.
+(define_mode_iterator VALL_ARITH [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
+				  (V4HF "TARGET_SIMD_F16INST") (V8HF "TARGET_SIMD_F16INST")
+				  V2SF V4SF V2DF])
+
 ;; All Advanced SIMD modes suitable for moving, loading, and storing.
 (define_mode_iterator VALL_F16 [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
 				V4HF V8HF V4BF V8BF V2SF V4SF V2DF])
@@ -708,6 +713,10 @@ (define_c_enum "unspec"
     UNSPEC_FCMLA90	; Used in aarch64-simd.md.
     UNSPEC_FCMLA180	; Used in aarch64-simd.md.
     UNSPEC_FCMLA270	; Used in aarch64-simd.md.
+    UNSPEC_FCMUL	; Used in aarch64-simd.md.
+    UNSPEC_FCMUL180	; Used in aarch64-simd.md.
+    UNSPEC_FCMLS	; Used in aarch64-simd.md.
+    UNSPEC_FCMLS180	; Used in aarch64-simd.md.
     UNSPEC_ASRD		; Used in aarch64-sve.md.
     UNSPEC_ADCLB	; Used in aarch64-sve2.md.
     UNSPEC_ADCLT	; Used in aarch64-sve2.md.
@@ -726,6 +735,10 @@ (define_c_enum "unspec"
     UNSPEC_CMLA180	; Used in aarch64-sve2.md.
     UNSPEC_CMLA270	; Used in aarch64-sve2.md.
     UNSPEC_CMLA90	; Used in aarch64-sve2.md.
+    UNSPEC_CMLS		; Used in aarch64-sve2.md.
+    UNSPEC_CMLS180	; Used in aarch64-sve2.md.
+    UNSPEC_CMUL		; Used in aarch64-sve2.md.
+    UNSPEC_CMUL180	; Used in aarch64-sve2.md.
     UNSPEC_COND_FCVTLT	; Used in aarch64-sve2.md.
     UNSPEC_COND_FCVTNT	; Used in aarch64-sve2.md.
     UNSPEC_COND_FCVTX	; Used in aarch64-sve2.md.
@@ -2598,6 +2611,23 @@ (define_int_iterator SVE2_INT_CMLA [UNSPEC_CMLA
 				    UNSPEC_SQRDCMLAH180
 				    UNSPEC_SQRDCMLAH270])
 
+;; Unlike the normal CMLA instructions these represent the actual operation you
+;; to be performed.  They will always need to be expanded into multiple
+;; sequences consisting of CMLA.
+(define_int_iterator SVE2_INT_CMLA_OP [UNSPEC_CMLA
+				       UNSPEC_CMLA180
+				       UNSPEC_CMLS])
+
+;; Unlike the normal CMLA instructions these represent the actual operation you
+;; to be performed.  They will always need to be expanded into multiple
+;; sequences consisting of CMLA.
+(define_int_iterator SVE2_INT_CMUL_OP [UNSPEC_CMUL
+				       UNSPEC_CMUL180])
+
+;; Same as SVE2_INT_CADD but exclude the saturating instructions
+(define_int_iterator SVE2_INT_CADD_OP [UNSPEC_CADD90
+				       UNSPEC_CADD270])
+
 (define_int_iterator SVE2_INT_CDOT [UNSPEC_CDOT
 				    UNSPEC_CDOT90
 				    UNSPEC_CDOT180
@@ -2708,6 +2738,14 @@ (define_int_iterator FMMLA [UNSPEC_FMMLA])
 (define_int_iterator BF_MLA [UNSPEC_BFMLALB
 			     UNSPEC_BFMLALT])
 
+(define_int_iterator FCMLA_OP [UNSPEC_FCMLA
+			       UNSPEC_FCMLA180
+			       UNSPEC_FCMLS
+			       UNSPEC_FCMLS180])
+
+(define_int_iterator FCMUL_OP [UNSPEC_FCMUL
+			       UNSPEC_FCMUL180])
+
 ;; Iterators for atomic operations.
 
 (define_int_iterator ATOMIC_LDOP
@@ -3403,6 +3441,7 @@ (define_int_attr rot [(UNSPEC_CADD90 "90")
 		      (UNSPEC_CMLA270 "270")
 		      (UNSPEC_FCADD90 "90")
 		      (UNSPEC_FCADD270 "270")
+		      (UNSPEC_FCMLS "0")
 		      (UNSPEC_FCMLA "0")
 		      (UNSPEC_FCMLA90 "90")
 		      (UNSPEC_FCMLA180 "180")
@@ -3418,7 +3457,85 @@ (define_int_attr rot [(UNSPEC_CADD90 "90")
 		      (UNSPEC_COND_FCMLA "0")
 		      (UNSPEC_COND_FCMLA90 "90")
 		      (UNSPEC_COND_FCMLA180 "180")
-		      (UNSPEC_COND_FCMLA270 "270")])
+		      (UNSPEC_COND_FCMLA270 "270")
+		      (UNSPEC_FCMUL "0")
+		      (UNSPEC_FCMUL180 "180")])
+
+;; A conjucate is a negation of the imaginary component
+;; The number in the inspecs are the rotation component of the instruction, e.g
+;; FCMLS180 means use the instruction with #180.
+;; The iterator is used to produce the right name mangling for the function.
+;;
+;; The rotation value does not directly correlate to a rotation along the argant
+;; plane as the instructions only perform half the computation.
+;;
+;; For the implementation we threat any rotation by 0 as normal and 180 as
+;; conjucate.  This is only for implementing the vectorizer patterns.
+(define_int_attr rot_op [(UNSPEC_FCMLS "")
+			 (UNSPEC_FCMLS180 "_conj")
+			 (UNSPEC_FCMLA "")
+			 (UNSPEC_FCMLA180 "_conj")
+			 (UNSPEC_FCMUL "")
+			 (UNSPEC_FCMUL180 "_conj")
+			 (UNSPEC_CMLS "")
+			 (UNSPEC_CMLA "")
+			 (UNSPEC_CMLA180 "_conj")
+			 (UNSPEC_CMUL "")
+			 (UNSPEC_CMUL180 "_conj")])
+
+;; The complex operations when performed on a real complex number require two
+;; instructions to perform the operation. e.g. complex multiplication requires
+;; two FCMUL with a particular rotation value.
+;;
+;; These values can be looked up in rotsplit1 and rotsplit2.  as an example
+;; FCMUL needs the first instruction to use #0 and the second #90.
+(define_int_attr rotsplit1 [(UNSPEC_FCMLA "0")
+			    (UNSPEC_FCMLA180 "0")
+			    (UNSPEC_FCMUL "0")
+			    (UNSPEC_FCMUL180 "0")
+			    (UNSPEC_FCMLS "270")
+			    (UNSPEC_FCMLS180 "90")])
+
+(define_int_attr rotsplit2 [(UNSPEC_FCMLA "90")
+			    (UNSPEC_FCMLA180 "270")
+			    (UNSPEC_FCMUL "90")
+			    (UNSPEC_FCMUL180 "270")
+			    (UNSPEC_FCMLS "180")
+			    (UNSPEC_FCMLS180 "180")])
+
+;; SVE has slightly different namings from NEON so we have to split these
+;; iterators.
+(define_int_attr sve_rot1 [(UNSPEC_FCMLA "")
+			   (UNSPEC_FCMLA180 "")
+			   (UNSPEC_FCMUL "")
+			   (UNSPEC_FCMUL180 "")
+			   (UNSPEC_FCMLS "270")
+			   (UNSPEC_FCMLS180 "90")
+			   (UNSPEC_CMLA "")
+			   (UNSPEC_CMLA180 "")
+			   (UNSPEC_CMUL "")
+			   (UNSPEC_CMUL180 "")
+			   (UNSPEC_CMLS "270")
+			   (UNSPEC_CMLS180 "90")])
+
+(define_int_attr sve_rot2 [(UNSPEC_FCMLA "90")
+			   (UNSPEC_FCMLA180 "270")
+			   (UNSPEC_FCMUL "90")
+			   (UNSPEC_FCMUL180 "270")
+			   (UNSPEC_FCMLS "180")
+			   (UNSPEC_FCMLS180 "180")
+			   (UNSPEC_CMLA "90")
+			   (UNSPEC_CMLA180 "270")
+			   (UNSPEC_CMUL "90")
+			   (UNSPEC_CMUL180 "270")
+			   (UNSPEC_CMLS "180")
+			   (UNSPEC_CMLS180 "180")])
+
+
+(define_int_attr fcmac1 [(UNSPEC_FCMLA "a") (UNSPEC_FCMLA180 "a")
+			 (UNSPEC_FCMLS "s") (UNSPEC_FCMLS180 "s")
+			 (UNSPEC_CMLA "a") (UNSPEC_CMLA180 "a")
+			 (UNSPEC_CMLS "s") (UNSPEC_CMLS180 "s")])
 
 (define_int_attr sve_fmla_op [(UNSPEC_COND_FMLA "fmla")
 			      (UNSPEC_COND_FMLS "fmls")
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 592af35f038f48b5f4ac622a0ed944ffc2a140f2..43e1ebd87cf69e716474bb6ee9bcdd405523d8da 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -712,7 +712,7 @@ (define_mode_attr v_cmp_result [(V8QI "v8qi") (V16QI "v16qi")
 				(DI   "di")   (V2DI  "v2di")
 				(V2SF "v2si") (V4SF  "v4si")])
 
-;; Get element type from double-width mode, for operations where we 
+;; Get element type from double-width mode, for operations where we
 ;; don't care about signedness.
 (define_mode_attr V_if_elem [(V8QI "i8")  (V16QI "i8")
 			     (V4HI "i16") (V8HI  "i16")
@@ -1180,7 +1180,49 @@ (define_int_attr rot [(UNSPEC_VCADD90 "90")
 		      (UNSPEC_VCMLA "0")
 		      (UNSPEC_VCMLA90 "90")
 		      (UNSPEC_VCMLA180 "180")
-		      (UNSPEC_VCMLA270 "270")])
+		      (UNSPEC_VCMLA270 "270")
+		      (UNSPEC_VCMUL "0")
+		      (UNSPEC_VCMUL180 "180")])
+
+;; A conjucate is a negation of the imaginary component
+;; The number in the inspecs are the rotation component of the instruction, e.g
+;; FCMLS180 means use the instruction with #180.
+;; The iterator is used to produce the right name mangling for the function.
+;;
+;; The rotation value does not directly correlate to a rotation along the argant
+;; plane as the instructions only perform half the computation.
+;;
+;; For the implementation we threat any rotation by 0 as normal and 180 as
+;; conjucate.  This is only for implementing the vectorizer patterns.
+(define_int_attr rot_op [(UNSPEC_VCMLS "")
+			 (UNSPEC_VCMLS180 "_conj")
+			 (UNSPEC_VCMLA "")
+			 (UNSPEC_VCMLA180 "_conj")
+			 (UNSPEC_VCMUL "")
+			 (UNSPEC_VCMUL180 "_conj")])
+
+;; The complex operations when performed on a real complex number require two
+;; instructions to perform the operation. e.g. complex multiplication requires
+;; two FCMUL with a particular rotation value.
+;;
+;; These values can be looked up in rotsplit1 and rotsplit2.  as an example
+;; FCMUL needs the first instruction to use #0 and the second #90.
+(define_int_attr rotsplit1 [(UNSPEC_VCMLA "0")
+			    (UNSPEC_VCMLA180 "0")
+			    (UNSPEC_VCMUL "0")
+			    (UNSPEC_VCMUL180 "0")
+			    (UNSPEC_VCMLS "270")
+			    (UNSPEC_VCMLS180 "90")])
+
+(define_int_attr rotsplit2 [(UNSPEC_VCMLA "90")
+			    (UNSPEC_VCMLA180 "270")
+			    (UNSPEC_VCMUL "90")
+			    (UNSPEC_VCMUL180 "270")
+			    (UNSPEC_VCMLS "180")
+			    (UNSPEC_VCMLS180 "180")])
+
+(define_int_attr fcmac1 [(UNSPEC_VCMLA "a") (UNSPEC_VCMLA180 "a")
+			 (UNSPEC_VCMLS "s") (UNSPEC_VCMLS180 "s")])
 
 (define_int_attr simd32_op [(UNSPEC_QADD8 "qadd8") (UNSPEC_QSUB8 "qsub8")
 			    (UNSPEC_SHADD8 "shadd8") (UNSPEC_SHSUB8 "shsub8")


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH]AArch64: Add NEON, SVE and SVE2 RTL patterns for Complex Addition, Multiply and FMA.
  2020-12-10 16:59 [PATCH]AArch64: Add NEON, SVE and SVE2 RTL patterns for Complex Addition, Multiply and FMA Tamar Christina
@ 2020-12-14 11:01 ` Kyrylo Tkachov
  2020-12-14 13:30 ` Richard Sandiford
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Kyrylo Tkachov @ 2020-12-14 11:01 UTC (permalink / raw)
  To: Tamar Christina, gcc-patches
  Cc: nd, Richard Earnshaw, Marcus Shawcroft, Richard Sandiford



> -----Original Message-----
> From: Tamar Christina <Tamar.Christina@arm.com>
> Sent: 10 December 2020 17:00
> To: gcc-patches@gcc.gnu.org
> Cc: nd <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>;
> Marcus Shawcroft <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
> <Kyrylo.Tkachov@arm.com>; Richard Sandiford
> <Richard.Sandiford@arm.com>
> Subject: [PATCH]AArch64: Add NEON, SVE and SVE2 RTL patterns for
> Complex Addition, Multiply and FMA.
> 
> Hi All,
> 
> This adds implementation for the optabs for complex operations.  With this
> the
> following C code:
> 
>   void f90 (float complex a[restrict N], float complex b[restrict N],
> 	    float complex c[restrict N])
>   {
>     for (int i=0; i < N; i++)
>       c[i] = a[i] + (b[i] * I);
>   }
> 
> generates
> 
>   f90:
> 	  mov     x3, 0
> 	  .p2align 3,,7
>   .L2:
> 	  ldr     q0, [x0, x3]
> 	  ldr     q1, [x1, x3]
> 	  fcadd   v0.4s, v0.4s, v1.4s, #90
> 	  str     q0, [x2, x3]
> 	  add     x3, x3, 16
> 	  cmp     x3, 1600
> 	  bne     .L2
> 	  ret
> 
> instead of
> 
>   f90:
> 	  add     x3, x1, 1600
> 	  .p2align 3,,7
>   .L2:
> 	  ld2     {v4.4s - v5.4s}, [x0], 32
> 	  ld2     {v2.4s - v3.4s}, [x1], 32
> 	  fsub    v0.4s, v4.4s, v3.4s
> 	  fadd    v1.4s, v5.4s, v2.4s
> 	  st2     {v0.4s - v1.4s}, [x2], 32
> 	  cmp     x3, x1
> 	  bne     .L2
> 	  ret
> 
> It defined a new iterator VALL_ARITH which contains types for which we can
> do
> general arithmetic (excludes bfloat16).
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> Checked with armv8-a+sve2+fp16 and no issues.  Note that sue to a mid-end
> limitation SLP for SVE currently fails for some permutes.  The tests have
> these
> marked as XFAIL.  I do intend to fix this soon.
> 
> Matching tests for these are in the mid-end patches.
> Note that The mid-end patches are still being respun and I may need to
> change the order of some parameters but no other change is expected and
> would like to decrease the size of future patches.  As such..
> 
> Ok for master?

Ok. The patterns look clean to me from a back-end perspective.
Thanks,
Kyrill

> 
> Thanks,
> Tamar
> 
> 
> gcc/ChangeLog:
> 
> 	* config/aarch64/aarch64-simd.md (cadd<rot><mode>3,
> 	cml<fcmac1><rot_op><mode>4, cmul<rot_op><mode>3): New.
> 	* config/aarch64/iterators.md (VALL_ARITH, UNSPEC_FCMUL,
> 	UNSPEC_FCMUL180, UNSPEC_FCMLS, UNSPEC_FCMLS180,
> UNSPEC_CMLS,
> 	UNSPEC_CMLS180, UNSPEC_CMUL, UNSPEC_CMUL180, FCMLA_OP,
> FCMUL_OP, rot_op,
> 	rotsplit1, rotsplit2, fcmac1, sve_rot1, sve_rot2, SVE2_INT_CMLA_OP,
> 	SVE2_INT_CMUL_OP, SVE2_INT_CADD_OP): New.): New.): New.
> 	(rot): Add UNSPEC_FCMLS, UNSPEC_FCMUL, UNSPEC_FCMUL180.
> 	* config/aarch64/aarch64-sve.md (cadd<rot><mode>3,
> 	cml<fcmac1><rot_op><mode>4, cmul<rot_op><mode>3): New.
> 	* config/aarch64/aarch64-sve2.md (cadd<rot><mode>3,
> 	cml<fcmac1><rot_op><mode>4, cmul<rot_op><mode>3): New.
> 
> --

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH]AArch64: Add NEON, SVE and SVE2 RTL patterns for Complex Addition, Multiply and FMA.
  2020-12-10 16:59 [PATCH]AArch64: Add NEON, SVE and SVE2 RTL patterns for Complex Addition, Multiply and FMA Tamar Christina
  2020-12-14 11:01 ` Kyrylo Tkachov
@ 2020-12-14 13:30 ` Richard Sandiford
  2020-12-14 15:50   ` Tamar Christina
  2021-01-15 15:30 ` [PATCH]AArch64: Add NEON, SVE and SVE2 RTL patterns for Multiply, FMS " Tamar Christina
  2021-01-15 17:31 ` Richard Sandiford
  3 siblings, 1 reply; 9+ messages in thread
From: Richard Sandiford @ 2020-12-14 13:30 UTC (permalink / raw)
  To: Tamar Christina
  Cc: gcc-patches, nd, Richard.Earnshaw, Marcus.Shawcroft, Kyrylo.Tkachov

Rearranging slightly…

> @@ -708,6 +713,10 @@ (define_c_enum "unspec"
>      UNSPEC_FCMLA90	; Used in aarch64-simd.md.
>      UNSPEC_FCMLA180	; Used in aarch64-simd.md.
>      UNSPEC_FCMLA270	; Used in aarch64-simd.md.
> +    UNSPEC_FCMUL	; Used in aarch64-simd.md.
> +    UNSPEC_FCMUL180	; Used in aarch64-simd.md.
> +    UNSPEC_FCMLS	; Used in aarch64-simd.md.
> +    UNSPEC_FCMLS180	; Used in aarch64-simd.md.
>      UNSPEC_ASRD		; Used in aarch64-sve.md.
>      UNSPEC_ADCLB	; Used in aarch64-sve2.md.
>      UNSPEC_ADCLT	; Used in aarch64-sve2.md.
> @@ -726,6 +735,10 @@ (define_c_enum "unspec"
>      UNSPEC_CMLA180	; Used in aarch64-sve2.md.
>      UNSPEC_CMLA270	; Used in aarch64-sve2.md.
>      UNSPEC_CMLA90	; Used in aarch64-sve2.md.
> +    UNSPEC_CMLS		; Used in aarch64-sve2.md.
> +    UNSPEC_CMLS180	; Used in aarch64-sve2.md.
> +    UNSPEC_CMUL		; Used in aarch64-sve2.md.
> +    UNSPEC_CMUL180	; Used in aarch64-sve2.md.
>      UNSPEC_COND_FCVTLT	; Used in aarch64-sve2.md.
>      UNSPEC_COND_FCVTNT	; Used in aarch64-sve2.md.
>      UNSPEC_COND_FCVTX	; Used in aarch64-sve2.md.
> […]
> @@ -3418,7 +3457,85 @@ (define_int_attr rot [(UNSPEC_CADD90 "90")
>  		      (UNSPEC_COND_FCMLA "0")
>  		      (UNSPEC_COND_FCMLA90 "90")
>  		      (UNSPEC_COND_FCMLA180 "180")
> -		      (UNSPEC_COND_FCMLA270 "270")])
> +		      (UNSPEC_COND_FCMLA270 "270")
> +		      (UNSPEC_FCMUL "0")
> +		      (UNSPEC_FCMUL180 "180")])
> +
> +;; A conjucate is a negation of the imaginary component
> +;; The number in the inspecs are the rotation component of the instruction, e.g

unspecs

> +;; FCMLS180 means use the instruction with #180.
> +;; The iterator is used to produce the right name mangling for the function.
> +;;
> +;; The rotation value does not directly correlate to a rotation along the argant
> +;; plane as the instructions only perform half the computation.
> +;;
> +;; For the implementation we threat any rotation by 0 as normal and 180 as
> +;; conjucate.  This is only for implementing the vectorizer patterns.
> +(define_int_attr rot_op [(UNSPEC_FCMLS "")
> +			 (UNSPEC_FCMLS180 "_conj")
> +			 (UNSPEC_FCMLA "")
> +			 (UNSPEC_FCMLA180 "_conj")
> +			 (UNSPEC_FCMUL "")
> +			 (UNSPEC_FCMUL180 "_conj")
> +			 (UNSPEC_CMLS "")
> +			 (UNSPEC_CMLA "")
> +			 (UNSPEC_CMLA180 "_conj")
> +			 (UNSPEC_CMUL "")
> +			 (UNSPEC_CMUL180 "_conj")])
> +

Realise this is being awkward, sorry, but: I think it would be clearer
to use unspecs with _CONJ in the name for the conjugated forms, rather
than explain away how 180 is being used.

All four rotations make sense as a pair (i.e. as a full complex
multiplication), rather than just as a standalone instruction:

  FMLA    = a + b * c
  FMLA90  = a + b * c * i
  FMLA180 = a - b * c
  FMLA270 = a - b * c * i

I.e. FMLA180 == FMLS and FMLA270 == “FMLS90” (a made-up term).
“FMLS180” is equivalent to FMLA.

We can then additionally conjugate “b” in each of the four forms above.

At the moment, FMLA90 and FMLA270/FMLS90 (and their conjugate forms)
aren't exposed as optabs, but that could change in future.

So I think we should avoid introducing FMLS and FMLS180 and instead use
(name in patch -> suggested replacement):

  FMLA -> FMLA
  FMLS -> FMLA180
  FMLA180 -> FMLA_CONJ
  FMLS180 -> FMLA180_CONJ

where _CONJ has the effect of adding 180 degrees to the second
rotation amount.

Then:

> +;; The complex operations when performed on a real complex number require two
> +;; instructions to perform the operation. e.g. complex multiplication requires
> +;; two FCMUL with a particular rotation value.
> +;;
> +;; These values can be looked up in rotsplit1 and rotsplit2.  as an example
> +;; FCMUL needs the first instruction to use #0 and the second #90.
> +(define_int_attr rotsplit1 [(UNSPEC_FCMLA "0")
> +			    (UNSPEC_FCMLA180 "0")
> +			    (UNSPEC_FCMUL "0")
> +			    (UNSPEC_FCMUL180 "0")
> +			    (UNSPEC_FCMLS "270")
> +			    (UNSPEC_FCMLS180 "90")])
> +
> +(define_int_attr rotsplit2 [(UNSPEC_FCMLA "90")
> +			    (UNSPEC_FCMLA180 "270")
> +			    (UNSPEC_FCMUL "90")
> +			    (UNSPEC_FCMUL180 "270")
> +			    (UNSPEC_FCMLS "180")
> +			    (UNSPEC_FCMLS180 "180")])

would become something like:

(define_int_attr rotsplit1 [(UNSPEC_FCMLA "0")
			    (UNSPEC_FCMLA_CONJ "0")
			    (UNSPEC_FCMLA180 "180")
			    (UNSPEC_FCMLA180_CONJ "180")
			    (UNSPEC_FCMUL "0")
			    (UNSPEC_FCMUL_CONJ "0")])

(define_int_attr rotsplit2 [(UNSPEC_FCMLA "90")
			    (UNSPEC_FCMLA_CONJ "270")
			    (UNSPEC_FCMLA180 "270")
			    (UNSPEC_FCMLA180_CONJ "90")
			    (UNSPEC_FCMUL "90")
			    (UNSPEC_FCMUL_CONJ "270")])

IMO the mapping between unspec and rotation is then more obvious:
adding _CONJ adds 180 to rotsplt2, but doesn't change rotsplit1.

(The patch uses 270 as the first rotation for FMLS and 180 as the second
rotation, instead of the other way around.  That would be OK too if it's
better for some reason, but I think it would be worth a comment if so.)

> […]
> +;; unpredicated optab pattern for auto-vectorizer
> +;; The complex mla/mls operations always need to expand to two instructions.
> +;; The first operation does half the computation and the second does the
> +;; remainder.  Because of this, expand early.
> +(define_expand "cml<fcmac1><rot_op><mode>4"
> +  [(set (match_operand:SVE_FULL_F 0 "register_operand")
> +	(unspec:SVE_FULL_F
> +	  [(match_dup 4)
> +	   (match_dup 5)
> +	   (match_operand:SVE_FULL_F 1 "register_operand")
> +	   (match_operand:SVE_FULL_F 2 "register_operand")
> +	   (match_operand:SVE_FULL_F 3 "register_operand")]
> +	  FCMLA_OP))]
> +  "TARGET_SVE && !BYTES_BIG_ENDIAN"
> +{
> +  operands[4] = aarch64_ptrue_reg (<VPRED>mode);
> +  operands[5] = gen_int_mode (SVE_RELAXED_GP, SImode);
> +  rtx tmp = gen_reg_rtx (<MODE>mode);
> +  emit_insn (
> +    gen_aarch64_pred_fcmla<sve_rot1><mode> (tmp, operands[4],
> +					    operands[1], operands[2],
> +					    operands[3], operands[5]));
> +  emit_insn (
> +    gen_aarch64_pred_fcmla<sve_rot2><mode> (operands[0], operands[4],
> +					    tmp, operands[2],
> +					    operands[3], operands[5]));

Very minor, sorry, but: should be no “(”s at the end of a line:

  emit_insn
    (gen_aarch64_pred_fcmla<sve_rot2><mode> (operands[0], operands[4],
					     tmp, operands[2],
					     operands[3], operands[5]));

Same for the rest of the patch.

Are you sure the operands above are the right way round though?
The pattern is:

;; Predicated FCMLA.
(define_insn "@aarch64_pred_<optab><mode>"
  [(set (match_operand:SVE_FULL_F 0 "register_operand" "=w, ?&w")
	(unspec:SVE_FULL_F
	  [(match_operand:<VPRED> 1 "register_operand" "Upl, Upl")
	   (match_operand:SI 5 "aarch64_sve_gp_strictness")
	   (match_operand:SVE_FULL_F 2 "register_operand" "w, w")
	   (match_operand:SVE_FULL_F 3 "register_operand" "w, w")
	   (match_operand:SVE_FULL_F 4 "register_operand" "0, w")]
	  SVE_COND_FCMLA))]
  "TARGET_SVE"
  "@
   fcmla\t%0.<Vetype>, %1/m, %2.<Vetype>, %3.<Vetype>, #<rot>
   movprfx\t%0, %4\;fcmla\t%0.<Vetype>, %1/m, %2.<Vetype>, %3.<Vetype>, #<rot>"
  [(set_attr "movprfx" "*,yes")]
)

where operand 4 is the accumulator.  The code above seems to be passing
the accumulator as operand 2 instead.

I agree this is confusing.  The current SVE_COND_FCMLA order is the same
as the (fma …) rtx code (and thus the same as the FMLA patterns), which
seemed appropriate given that it really is a fused operation.

> +  DONE;
> +})
> +
> +;; unpredicated optab pattern for auto-vectorizer
> +;; The complex mul operations always need to expand to two instructions.
> +;; The first operation does half the computation and the second does the
> +;; remainder.  Because of this, expand early.
> +(define_expand "cmul<rot_op><mode>3"
> +  [(set (match_operand:SVE_FULL_F 0 "register_operand")
> +	(unspec:SVE_FULL_F
> +	  [(match_dup 3)
> +	   (match_dup 4)
> +	   (match_operand:SVE_FULL_F 1 "register_operand")
> +	   (match_operand:SVE_FULL_F 2 "register_operand")
> +	   (match_dup 5)]
> +	  FCMUL_OP))]
> +  "TARGET_SVE && !BYTES_BIG_ENDIAN"
> +{
> +  operands[3] = aarch64_ptrue_reg (<VPRED>mode);
> +  operands[4] = gen_int_mode (SVE_RELAXED_GP, SImode);
> +  operands[5] = force_reg (<MODE>mode, CONST0_RTX (<MODE>mode));

I can see that it makes sense to use (match_dup 4) and (match_dup 5)
in the FCMLA_OP expander above, so that the unspec has the same number
of operands as it does in “real” FCMLA patterns.  It doesn't matter in
practice, since the unspec is never generated, but I agree it's less
confusing.

But in this case, FCMUL_OP is a purely synthetic thing, so I think it
would be better to use local variables instead, like for the
Advanced SIMD patterns:

  rtx ptrue = aarch64_ptrue_reg (<VPRED>mode);
  rtx ptrue_hint = gen_int_mode (SVE_RELAXED_GP, SImode);
  rtx zero = force_reg (<MODE>mode, CONST0_RTX (<MODE>mode));

> […]
> +;; unpredicated optab pattern for auto-vectorizer
> +;; The complex mla/mls operations always need to expand to two instructions.
> +;; The first operation does half the computation and the second does the
> +;; remainder.  Because of this, expand early.
> +(define_expand "cml<fcmac1><rot_op><mode>4"
> +  [(set (match_operand:SVE_FULL_I 0 "register_operand")
> +	(plus:SVE_FULL_I (match_operand:SVE_FULL_I 1 "register_operand")
> +	  (unspec:SVE_FULL_I
> +	    [(match_operand:SVE_FULL_I 2 "register_operand")
> +	     (match_operand:SVE_FULL_I 3 "register_operand")]
> +	    SVE2_INT_CMLA_OP)))]

The SVE2_INT_CMLA_OP unspecs take three operands, with the accumulator
first this time (since it's not an fma).  So I think this should simply be:

(define_expand "cml<fcmac1><rot_op><mode>4"
  [(set (match_operand:SVE_FULL_I 0 "register_operand")
	(unspec:SVE_FULL_I
	  [(match_operand:SVE_FULL_I 1 "register_operand")
	   (match_operand:SVE_FULL_I 2 "register_operand")
	   (match_operand:SVE_FULL_I 3 "register_operand")]
	  SVE2_INT_CMLA_OP))]

> +  "TARGET_SVE2 && !BYTES_BIG_ENDIAN"
> +{
> +  rtx tmp = gen_reg_rtx (<MODE>mode);
> +  emit_insn (gen_aarch64_sve_cmla<sve_rot1><mode> (tmp, operands[1],
> +						   operands[2], operands[3]));
> +  emit_insn (gen_aarch64_sve_cmla<sve_rot2><mode> (operands[0], tmp,
> +						   operands[2], operands[3]));
> +  DONE;
> +})
> +
> +;; unpredicated optab pattern for auto-vectorizer
> +;; The complex mul operations always need to expand to two instructions.
> +;; The first operation does half the computation and the second does the
> +;; remainder.  Because of this, expand early.
> +(define_expand "cmul<rot_op><mode>3"
> +  [(set (match_operand:SVE_FULL_I 0 "register_operand")
> +	(unspec:SVE_FULL_I
> +	  [(match_operand:SVE_FULL_I 1 "register_operand")
> +	   (match_operand:SVE_FULL_I 2 "register_operand")
> +	   (match_dup 3)]
> +	  SVE2_INT_CMUL_OP))]
> +  "TARGET_SVE2 && !BYTES_BIG_ENDIAN"
> +{
> +  operands[3] = force_reg (<MODE>mode, CONST0_RTX (<MODE>mode));

Here too I think it would be better to use a temporary variable instead
of a (match_dup …) for the “fake” CMUL operations.

> +  rtx tmp = gen_reg_rtx (<MODE>mode);
> +  emit_insn (gen_aarch64_sve_cmla<sve_rot1><mode> (tmp, operands[3],
> +						   operands[1], operands[2]));
> +  emit_insn (gen_aarch64_sve_cmla<sve_rot2><mode> (operands[0], tmp,
> +						   operands[1], operands[2]));
> +  DONE;
> +})

Thanks,
Richard

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH]AArch64: Add NEON, SVE and SVE2 RTL patterns for Complex Addition, Multiply and FMA.
  2020-12-14 13:30 ` Richard Sandiford
@ 2020-12-14 15:50   ` Tamar Christina
  2020-12-14 16:11     ` Richard Sandiford
  0 siblings, 1 reply; 9+ messages in thread
From: Tamar Christina @ 2020-12-14 15:50 UTC (permalink / raw)
  To: Richard Sandiford
  Cc: gcc-patches, nd, Richard Earnshaw, Marcus Shawcroft, Kyrylo Tkachov

Hi Richard,

Do you object to me splitting off complex add and addressing your remaining feedback later when the rewrite of mul and fma are done.

Thanks,
Tamar
________________________________
From: Richard Sandiford <richard.sandiford@arm.com>
Sent: Monday, December 14, 2020 1:30 PM
To: Tamar Christina <Tamar.Christina@arm.com>
Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>; nd <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>; Marcus Shawcroft <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
Subject: Re: [PATCH]AArch64: Add NEON, SVE and SVE2 RTL patterns for Complex Addition, Multiply and FMA.

Rearranging slightly…

> @@ -708,6 +713,10 @@ (define_c_enum "unspec"
>      UNSPEC_FCMLA90   ; Used in aarch64-simd.md.
>      UNSPEC_FCMLA180  ; Used in aarch64-simd.md.
>      UNSPEC_FCMLA270  ; Used in aarch64-simd.md.
> +    UNSPEC_FCMUL     ; Used in aarch64-simd.md.
> +    UNSPEC_FCMUL180  ; Used in aarch64-simd.md.
> +    UNSPEC_FCMLS     ; Used in aarch64-simd.md.
> +    UNSPEC_FCMLS180  ; Used in aarch64-simd.md.
>      UNSPEC_ASRD              ; Used in aarch64-sve.md.
>      UNSPEC_ADCLB     ; Used in aarch64-sve2.md.
>      UNSPEC_ADCLT     ; Used in aarch64-sve2.md.
> @@ -726,6 +735,10 @@ (define_c_enum "unspec"
>      UNSPEC_CMLA180   ; Used in aarch64-sve2.md.
>      UNSPEC_CMLA270   ; Used in aarch64-sve2.md.
>      UNSPEC_CMLA90    ; Used in aarch64-sve2.md.
> +    UNSPEC_CMLS              ; Used in aarch64-sve2.md.
> +    UNSPEC_CMLS180   ; Used in aarch64-sve2.md.
> +    UNSPEC_CMUL              ; Used in aarch64-sve2.md.
> +    UNSPEC_CMUL180   ; Used in aarch64-sve2.md.
>      UNSPEC_COND_FCVTLT       ; Used in aarch64-sve2.md.
>      UNSPEC_COND_FCVTNT       ; Used in aarch64-sve2.md.
>      UNSPEC_COND_FCVTX        ; Used in aarch64-sve2.md.
> […]
> @@ -3418,7 +3457,85 @@ (define_int_attr rot [(UNSPEC_CADD90 "90")
>                      (UNSPEC_COND_FCMLA "0")
>                      (UNSPEC_COND_FCMLA90 "90")
>                      (UNSPEC_COND_FCMLA180 "180")
> -                   (UNSPEC_COND_FCMLA270 "270")])
> +                   (UNSPEC_COND_FCMLA270 "270")
> +                   (UNSPEC_FCMUL "0")
> +                   (UNSPEC_FCMUL180 "180")])
> +
> +;; A conjucate is a negation of the imaginary component
> +;; The number in the inspecs are the rotation component of the instruction, e.g

unspecs

> +;; FCMLS180 means use the instruction with #180.
> +;; The iterator is used to produce the right name mangling for the function.
> +;;
> +;; The rotation value does not directly correlate to a rotation along the argant
> +;; plane as the instructions only perform half the computation.
> +;;
> +;; For the implementation we threat any rotation by 0 as normal and 180 as
> +;; conjucate.  This is only for implementing the vectorizer patterns.
> +(define_int_attr rot_op [(UNSPEC_FCMLS "")
> +                      (UNSPEC_FCMLS180 "_conj")
> +                      (UNSPEC_FCMLA "")
> +                      (UNSPEC_FCMLA180 "_conj")
> +                      (UNSPEC_FCMUL "")
> +                      (UNSPEC_FCMUL180 "_conj")
> +                      (UNSPEC_CMLS "")
> +                      (UNSPEC_CMLA "")
> +                      (UNSPEC_CMLA180 "_conj")
> +                      (UNSPEC_CMUL "")
> +                      (UNSPEC_CMUL180 "_conj")])
> +

Realise this is being awkward, sorry, but: I think it would be clearer
to use unspecs with _CONJ in the name for the conjugated forms, rather
than explain away how 180 is being used.

All four rotations make sense as a pair (i.e. as a full complex
multiplication), rather than just as a standalone instruction:

  FMLA    = a + b * c
  FMLA90  = a + b * c * i
  FMLA180 = a - b * c
  FMLA270 = a - b * c * i

I.e. FMLA180 == FMLS and FMLA270 == “FMLS90” (a made-up term).
“FMLS180” is equivalent to FMLA.

We can then additionally conjugate “b” in each of the four forms above.

At the moment, FMLA90 and FMLA270/FMLS90 (and their conjugate forms)
aren't exposed as optabs, but that could change in future.

So I think we should avoid introducing FMLS and FMLS180 and instead use
(name in patch -> suggested replacement):

  FMLA -> FMLA
  FMLS -> FMLA180
  FMLA180 -> FMLA_CONJ
  FMLS180 -> FMLA180_CONJ

where _CONJ has the effect of adding 180 degrees to the second
rotation amount.

Then:

> +;; The complex operations when performed on a real complex number require two
> +;; instructions to perform the operation. e.g. complex multiplication requires
> +;; two FCMUL with a particular rotation value.
> +;;
> +;; These values can be looked up in rotsplit1 and rotsplit2.  as an example
> +;; FCMUL needs the first instruction to use #0 and the second #90.
> +(define_int_attr rotsplit1 [(UNSPEC_FCMLA "0")
> +                         (UNSPEC_FCMLA180 "0")
> +                         (UNSPEC_FCMUL "0")
> +                         (UNSPEC_FCMUL180 "0")
> +                         (UNSPEC_FCMLS "270")
> +                         (UNSPEC_FCMLS180 "90")])
> +
> +(define_int_attr rotsplit2 [(UNSPEC_FCMLA "90")
> +                         (UNSPEC_FCMLA180 "270")
> +                         (UNSPEC_FCMUL "90")
> +                         (UNSPEC_FCMUL180 "270")
> +                         (UNSPEC_FCMLS "180")
> +                         (UNSPEC_FCMLS180 "180")])

would become something like:

(define_int_attr rotsplit1 [(UNSPEC_FCMLA "0")
                            (UNSPEC_FCMLA_CONJ "0")
                            (UNSPEC_FCMLA180 "180")
                            (UNSPEC_FCMLA180_CONJ "180")
                            (UNSPEC_FCMUL "0")
                            (UNSPEC_FCMUL_CONJ "0")])

(define_int_attr rotsplit2 [(UNSPEC_FCMLA "90")
                            (UNSPEC_FCMLA_CONJ "270")
                            (UNSPEC_FCMLA180 "270")
                            (UNSPEC_FCMLA180_CONJ "90")
                            (UNSPEC_FCMUL "90")
                            (UNSPEC_FCMUL_CONJ "270")])

IMO the mapping between unspec and rotation is then more obvious:
adding _CONJ adds 180 to rotsplt2, but doesn't change rotsplit1.

(The patch uses 270 as the first rotation for FMLS and 180 as the second
rotation, instead of the other way around.  That would be OK too if it's
better for some reason, but I think it would be worth a comment if so.)

> […]
> +;; unpredicated optab pattern for auto-vectorizer
> +;; The complex mla/mls operations always need to expand to two instructions.
> +;; The first operation does half the computation and the second does the
> +;; remainder.  Because of this, expand early.
> +(define_expand "cml<fcmac1><rot_op><mode>4"
> +  [(set (match_operand:SVE_FULL_F 0 "register_operand")
> +     (unspec:SVE_FULL_F
> +       [(match_dup 4)
> +        (match_dup 5)
> +        (match_operand:SVE_FULL_F 1 "register_operand")
> +        (match_operand:SVE_FULL_F 2 "register_operand")
> +        (match_operand:SVE_FULL_F 3 "register_operand")]
> +       FCMLA_OP))]
> +  "TARGET_SVE && !BYTES_BIG_ENDIAN"
> +{
> +  operands[4] = aarch64_ptrue_reg (<VPRED>mode);
> +  operands[5] = gen_int_mode (SVE_RELAXED_GP, SImode);
> +  rtx tmp = gen_reg_rtx (<MODE>mode);
> +  emit_insn (
> +    gen_aarch64_pred_fcmla<sve_rot1><mode> (tmp, operands[4],
> +                                         operands[1], operands[2],
> +                                         operands[3], operands[5]));
> +  emit_insn (
> +    gen_aarch64_pred_fcmla<sve_rot2><mode> (operands[0], operands[4],
> +                                         tmp, operands[2],
> +                                         operands[3], operands[5]));

Very minor, sorry, but: should be no “(”s at the end of a line:

  emit_insn
    (gen_aarch64_pred_fcmla<sve_rot2><mode> (operands[0], operands[4],
                                             tmp, operands[2],
                                             operands[3], operands[5]));

Same for the rest of the patch.

Are you sure the operands above are the right way round though?
The pattern is:

;; Predicated FCMLA.
(define_insn "@aarch64_pred_<optab><mode>"
  [(set (match_operand:SVE_FULL_F 0 "register_operand" "=w, ?&w")
        (unspec:SVE_FULL_F
          [(match_operand:<VPRED> 1 "register_operand" "Upl, Upl")
           (match_operand:SI 5 "aarch64_sve_gp_strictness")
           (match_operand:SVE_FULL_F 2 "register_operand" "w, w")
           (match_operand:SVE_FULL_F 3 "register_operand" "w, w")
           (match_operand:SVE_FULL_F 4 "register_operand" "0, w")]
          SVE_COND_FCMLA))]
  "TARGET_SVE"
  "@
   fcmla\t%0.<Vetype>, %1/m, %2.<Vetype>, %3.<Vetype>, #<rot>
   movprfx\t%0, %4\;fcmla\t%0.<Vetype>, %1/m, %2.<Vetype>, %3.<Vetype>, #<rot>"
  [(set_attr "movprfx" "*,yes")]
)

where operand 4 is the accumulator.  The code above seems to be passing
the accumulator as operand 2 instead.

I agree this is confusing.  The current SVE_COND_FCMLA order is the same
as the (fma …) rtx code (and thus the same as the FMLA patterns), which
seemed appropriate given that it really is a fused operation.

> +  DONE;
> +})
> +
> +;; unpredicated optab pattern for auto-vectorizer
> +;; The complex mul operations always need to expand to two instructions.
> +;; The first operation does half the computation and the second does the
> +;; remainder.  Because of this, expand early.
> +(define_expand "cmul<rot_op><mode>3"
> +  [(set (match_operand:SVE_FULL_F 0 "register_operand")
> +     (unspec:SVE_FULL_F
> +       [(match_dup 3)
> +        (match_dup 4)
> +        (match_operand:SVE_FULL_F 1 "register_operand")
> +        (match_operand:SVE_FULL_F 2 "register_operand")
> +        (match_dup 5)]
> +       FCMUL_OP))]
> +  "TARGET_SVE && !BYTES_BIG_ENDIAN"
> +{
> +  operands[3] = aarch64_ptrue_reg (<VPRED>mode);
> +  operands[4] = gen_int_mode (SVE_RELAXED_GP, SImode);
> +  operands[5] = force_reg (<MODE>mode, CONST0_RTX (<MODE>mode));

I can see that it makes sense to use (match_dup 4) and (match_dup 5)
in the FCMLA_OP expander above, so that the unspec has the same number
of operands as it does in “real” FCMLA patterns.  It doesn't matter in
practice, since the unspec is never generated, but I agree it's less
confusing.

But in this case, FCMUL_OP is a purely synthetic thing, so I think it
would be better to use local variables instead, like for the
Advanced SIMD patterns:

  rtx ptrue = aarch64_ptrue_reg (<VPRED>mode);
  rtx ptrue_hint = gen_int_mode (SVE_RELAXED_GP, SImode);
  rtx zero = force_reg (<MODE>mode, CONST0_RTX (<MODE>mode));

> […]
> +;; unpredicated optab pattern for auto-vectorizer
> +;; The complex mla/mls operations always need to expand to two instructions.
> +;; The first operation does half the computation and the second does the
> +;; remainder.  Because of this, expand early.
> +(define_expand "cml<fcmac1><rot_op><mode>4"
> +  [(set (match_operand:SVE_FULL_I 0 "register_operand")
> +     (plus:SVE_FULL_I (match_operand:SVE_FULL_I 1 "register_operand")
> +       (unspec:SVE_FULL_I
> +         [(match_operand:SVE_FULL_I 2 "register_operand")
> +          (match_operand:SVE_FULL_I 3 "register_operand")]
> +         SVE2_INT_CMLA_OP)))]

The SVE2_INT_CMLA_OP unspecs take three operands, with the accumulator
first this time (since it's not an fma).  So I think this should simply be:

(define_expand "cml<fcmac1><rot_op><mode>4"
  [(set (match_operand:SVE_FULL_I 0 "register_operand")
        (unspec:SVE_FULL_I
          [(match_operand:SVE_FULL_I 1 "register_operand")
           (match_operand:SVE_FULL_I 2 "register_operand")
           (match_operand:SVE_FULL_I 3 "register_operand")]
          SVE2_INT_CMLA_OP))]

> +  "TARGET_SVE2 && !BYTES_BIG_ENDIAN"
> +{
> +  rtx tmp = gen_reg_rtx (<MODE>mode);
> +  emit_insn (gen_aarch64_sve_cmla<sve_rot1><mode> (tmp, operands[1],
> +                                                operands[2], operands[3]));
> +  emit_insn (gen_aarch64_sve_cmla<sve_rot2><mode> (operands[0], tmp,
> +                                                operands[2], operands[3]));
> +  DONE;
> +})
> +
> +;; unpredicated optab pattern for auto-vectorizer
> +;; The complex mul operations always need to expand to two instructions.
> +;; The first operation does half the computation and the second does the
> +;; remainder.  Because of this, expand early.
> +(define_expand "cmul<rot_op><mode>3"
> +  [(set (match_operand:SVE_FULL_I 0 "register_operand")
> +     (unspec:SVE_FULL_I
> +       [(match_operand:SVE_FULL_I 1 "register_operand")
> +        (match_operand:SVE_FULL_I 2 "register_operand")
> +        (match_dup 3)]
> +       SVE2_INT_CMUL_OP))]
> +  "TARGET_SVE2 && !BYTES_BIG_ENDIAN"
> +{
> +  operands[3] = force_reg (<MODE>mode, CONST0_RTX (<MODE>mode));

Here too I think it would be better to use a temporary variable instead
of a (match_dup …) for the “fake” CMUL operations.

> +  rtx tmp = gen_reg_rtx (<MODE>mode);
> +  emit_insn (gen_aarch64_sve_cmla<sve_rot1><mode> (tmp, operands[3],
> +                                                operands[1], operands[2]));
> +  emit_insn (gen_aarch64_sve_cmla<sve_rot2><mode> (operands[0], tmp,
> +                                                operands[1], operands[2]));
> +  DONE;
> +})

Thanks,
Richard

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH]AArch64: Add NEON, SVE and SVE2 RTL patterns for Complex Addition, Multiply and FMA.
  2020-12-14 15:50   ` Tamar Christina
@ 2020-12-14 16:11     ` Richard Sandiford
  2020-12-15 19:59       ` Tamar Christina
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Sandiford @ 2020-12-14 16:11 UTC (permalink / raw)
  To: Tamar Christina
  Cc: gcc-patches, nd, Richard Earnshaw, Marcus Shawcroft, Kyrylo Tkachov

Tamar Christina <Tamar.Christina@arm.com> writes:
> Hi Richard, 
>
> Do you object to me splitting off complex add and addressing your remaining
> feedback later when the rewrite of mul and fma are done. 

No, sounds good to me.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH]AArch64: Add NEON, SVE and SVE2 RTL patterns for Complex Addition, Multiply and FMA.
  2020-12-14 16:11     ` Richard Sandiford
@ 2020-12-15 19:59       ` Tamar Christina
  2020-12-16  9:02         ` Richard Sandiford
  0 siblings, 1 reply; 9+ messages in thread
From: Tamar Christina @ 2020-12-15 19:59 UTC (permalink / raw)
  To: Richard Sandiford
  Cc: gcc-patches, nd, Richard Earnshaw, Marcus Shawcroft, Kyrylo Tkachov

[-- Attachment #1: Type: text/plain, Size: 1020 bytes --]

Hi Richard,

Here's the split off complex add.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
Checked with armv8-a+sve2+fp16 and no issues.  Note that due to a mid-end
limitation SLP for SVE currently fails for some permutes.  The tests have these
marked as XFAIL.

Matching tests for these are in the mid-end patches.

Ok for master?

The rest will be respun when I have a working tree again and can test them
again.

Thanks,
Tamar


gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md (cadd<rot><mode>3): New.
	* config/aarch64/iterators.md (SVE2_INT_CADD_OP): New.
	* config/aarch64/aarch64-sve.md (cadd<rot><mode>3): New.
	* config/aarch64/aarch64-sve2.md (cadd<rot><mode>3): New.

The 12/14/2020 16:11, Richard Sandiford wrote:
> Tamar Christina <Tamar.Christina@arm.com> writes:
> > Hi Richard, 
> >
> > Do you object to me splitting off complex add and addressing your remaining
> > feedback later when the rewrite of mul and fma are done. 
> 
> No, sounds good to me.
> 
> Thanks,
> Richard

-- 

[-- Attachment #2: pr13916.patch --]
[-- Type: text/x-diff, Size: 3266 bytes --]

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 68baf416045178b0ebcfeb8de2d201f625f1c317..05d18f8bd3ac09c56c82dc73cff855315eb302b7 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -449,6 +449,14 @@ (define_insn "aarch64_fcadd<rot><mode>"
   [(set_attr "type" "neon_fcadd")]
 )
 
+(define_expand "cadd<rot><mode>3"
+  [(set (match_operand:VHSDF 0 "register_operand")
+	(unspec:VHSDF [(match_operand:VHSDF 1 "register_operand")
+		       (match_operand:VHSDF 2 "register_operand")]
+		       FCADD))]
+  "TARGET_COMPLEX && !BYTES_BIG_ENDIAN"
+)
+
 (define_insn "aarch64_fcmla<rot><mode>"
   [(set (match_operand:VHSDF 0 "register_operand" "=w")
 	(plus:VHSDF (match_operand:VHSDF 1 "register_operand" "0")
diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-sve.md
index 6359c40bdecda6c126bd70bef66561dd1da44dc9..6a5194f54f9c3d7b985890df523ceb4221b0cc63 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -5480,6 +5480,20 @@ (define_expand "@cond_<optab><mode>"
   "TARGET_SVE"
 )
 
+;; Predicated FCADD using ptrue for unpredicated optab for auto-vectorizer
+(define_expand "@cadd<rot><mode>3"
+  [(set (match_operand:SVE_FULL_F 0 "register_operand")
+	(unspec:SVE_FULL_F
+	  [(match_dup 3)
+	   (const_int SVE_RELAXED_GP)
+	   (match_operand:SVE_FULL_F 1 "register_operand")
+	   (match_operand:SVE_FULL_F 2 "register_operand")]
+	  SVE_COND_FCADD))]
+  "TARGET_SVE"
+{
+  operands[3] = aarch64_ptrue_reg (<VPRED>mode);
+})
+
 ;; Predicated FCADD, merging with the first input.
 (define_insn_and_rewrite "*cond_<optab><mode>_2_relaxed"
   [(set (match_operand:SVE_FULL_F 0 "register_operand" "=w, ?&w")
diff --git a/gcc/config/aarch64/aarch64-sve2.md b/gcc/config/aarch64/aarch64-sve2.md
index 772c35079c9441448534471fba4dba622322b8fc..1897ddf69c34496247e40ba36d97b93a9a9d7670 100644
--- a/gcc/config/aarch64/aarch64-sve2.md
+++ b/gcc/config/aarch64/aarch64-sve2.md
@@ -1799,6 +1799,16 @@ (define_insn "@aarch64_sve_<optab><mode>"
   [(set_attr "movprfx" "*,yes")]
 )
 
+;; unpredicated optab pattern for auto-vectorizer
+(define_expand "cadd<rot><mode>3"
+  [(set (match_operand:SVE_FULL_I 0 "register_operand")
+	(unspec:SVE_FULL_I
+	  [(match_operand:SVE_FULL_I 1 "register_operand")
+	   (match_operand:SVE_FULL_I 2 "register_operand")]
+	  SVE2_INT_CADD_OP))]
+  "TARGET_SVE2"
+)
+
 ;; -------------------------------------------------------------------------
 ;; ---- [INT] Complex ternary operations
 ;; -------------------------------------------------------------------------
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index fb1426b7752890848cb49722ef7442d96cb1408b..b8ee4220603ce6da4e8044f61042d98a7e60b17d 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -2598,6 +2598,10 @@ (define_int_iterator SVE2_INT_CMLA [UNSPEC_CMLA
 				    UNSPEC_SQRDCMLAH180
 				    UNSPEC_SQRDCMLAH270])
 
+;; Same as SVE2_INT_CADD but exclude the saturating instructions
+(define_int_iterator SVE2_INT_CADD_OP [UNSPEC_CADD90
+				       UNSPEC_CADD270])
+
 (define_int_iterator SVE2_INT_CDOT [UNSPEC_CDOT
 				    UNSPEC_CDOT90
 				    UNSPEC_CDOT180


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH]AArch64: Add NEON, SVE and SVE2 RTL patterns for Complex Addition, Multiply and FMA.
  2020-12-15 19:59       ` Tamar Christina
@ 2020-12-16  9:02         ` Richard Sandiford
  0 siblings, 0 replies; 9+ messages in thread
From: Richard Sandiford @ 2020-12-16  9:02 UTC (permalink / raw)
  To: Tamar Christina
  Cc: gcc-patches, nd, Richard Earnshaw, Marcus Shawcroft, Kyrylo Tkachov

Tamar Christina <Tamar.Christina@arm.com> writes:
> Hi Richard,
>
> Here's the split off complex add.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> Checked with armv8-a+sve2+fp16 and no issues.  Note that due to a mid-end
> limitation SLP for SVE currently fails for some permutes.  The tests have these
> marked as XFAIL.
>
> Matching tests for these are in the mid-end patches.
>
> Ok for master?

OK, thanks.

Richard

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH]AArch64: Add NEON, SVE and SVE2 RTL patterns for Multiply, FMS and FMA.
  2020-12-10 16:59 [PATCH]AArch64: Add NEON, SVE and SVE2 RTL patterns for Complex Addition, Multiply and FMA Tamar Christina
  2020-12-14 11:01 ` Kyrylo Tkachov
  2020-12-14 13:30 ` Richard Sandiford
@ 2021-01-15 15:30 ` Tamar Christina
  2021-01-15 17:31 ` Richard Sandiford
  3 siblings, 0 replies; 9+ messages in thread
From: Tamar Christina @ 2021-01-15 15:30 UTC (permalink / raw)
  To: gcc-patches
  Cc: nd, Richard.Earnshaw, Marcus.Shawcroft, Kyrylo.Tkachov,
	richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 16619 bytes --]

Hi All,

This adds implementation for the optabs for complex operations.  With this the
following C code:

  void g (float complex a[restrict N], float complex b[restrict N],
	  float complex c[restrict N])
  {
    for (int i=0; i < N; i++)
      c[i] =  a[i] * b[i];
  }

generates


NEON:

g:
        movi    v3.4s, 0
        mov     x3, 0
        .p2align 3,,7
.L2:
        mov     v0.16b, v3.16b
        ldr     q2, [x1, x3]
        ldr     q1, [x0, x3]
        fcmla   v0.4s, v1.4s, v2.4s, #0
        fcmla   v0.4s, v1.4s, v2.4s, #90
        str     q0, [x2, x3]
        add     x3, x3, 16
        cmp     x3, 1600
        bne     .L2
        ret

SVE:

g:
        mov     x3, 0
        mov     x4, 400
        ptrue   p1.b, all
        whilelo p0.s, xzr, x4
        mov     z3.s, #0
        .p2align 3,,7
.L2:
        ld1w    z1.s, p0/z, [x0, x3, lsl 2]
        ld1w    z2.s, p0/z, [x1, x3, lsl 2]
        movprfx z0, z3
        fcmla   z0.s, p1/m, z1.s, z2.s, #0
        fcmla   z0.s, p1/m, z1.s, z2.s, #90
        st1w    z0.s, p0, [x2, x3, lsl 2]
        incw    x3
        whilelo p0.s, x3, x4
        b.any   .L2
        ret

SVE2 (with int instead of float)
g:
        mov     x3, 0
        mov     x4, 400
        mov     z3.b, #0
        whilelo p0.s, xzr, x4
        .p2align 3,,7
.L2:
        ld1w    z1.s, p0/z, [x0, x3, lsl 2]
        ld1w    z2.s, p0/z, [x1, x3, lsl 2]
        movprfx z0, z3
        cmla    z0.s, z1.s, z2.s, #0
        cmla    z0.s, z1.s, z2.s, #90
        st1w    z0.s, p0, [x2, x3, lsl 2]
        incw    x3
        whilelo p0.s, x3, x4
        b.any   .L2
        ret


It defined a new iterator VALL_ARITH which contains types for which we can do
general arithmetic (excludes bfloat16).

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
Checked with armv8-a+sve2+fp16 and no issues.  Note that sue to a mid-end
limitation SLP for SVE currently fails for some permutes.  The tests have these
marked as XFAIL.  I do intend to fix this soon.

Execution tests verified with QEMU.

Matching tests for these are in the mid-end patches.  This I will turn on for
these patterns in a separate patch.

Ok for master?

Thanks,
Tamar


gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md (cml<fcmac1><rot_op><mode>4,
	cmul<rot_op><mode>3): New.
	* config/aarch64/iterators.md (VALL_ARITH, UNSPEC_FCMUL,
	UNSPEC_FCMUL180, UNSPEC_FCMLA_CONJ, UNSPEC_FCMLA180_CONJ,
	UNSPEC_CMLA_CONJ, UNSPEC_CMLA180_CONJ, UNSPEC_CMUL, UNSPEC_CMUL180,
	FCMLA_OP, FCMUL_OP, rot_op, rotsplit1, rotsplit2, fcmac1, sve_rot1,
	sve_rot2, SVE2_INT_CMLA_OP, SVE2_INT_CMUL_OP, SVE2_INT_CADD_OP): New.
	(rot): Add UNSPEC_FCMUL, UNSPEC_FCMUL180.
	* config/aarch64/aarch64-sve.md (cml<fcmac1><rot_op><mode>4,
	cmul<rot_op><mode>3): New.
	* config/aarch64/aarch64-sve2.md (cml<fcmac1><rot_op><mode>4,
	cmul<rot_op><mode>3): New.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 4b869ded918fd91ffd41e6ba068239a752b331e5..8a5f1dad224a99a8ba30669139259922a1250d0e 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -516,6 +516,47 @@ (define_insn "aarch64_fcmlaq_lane<rot><mode>"
   [(set_attr "type" "neon_fcmla")]
 )
 
+;; The complex mla/mls operations always need to expand to two instructions.
+;; The first operation does half the computation and the second does the
+;; remainder.  Because of this, expand early.
+(define_expand "cml<fcmac1><rot_op><mode>4"
+  [(set (match_operand:VHSDF 0 "register_operand")
+	(plus:VHSDF (match_operand:VHSDF 1 "register_operand")
+		    (unspec:VHSDF [(match_operand:VHSDF 2 "register_operand")
+				   (match_operand:VHSDF 3 "register_operand")]
+				   FCMLA_OP)))]
+  "TARGET_COMPLEX && !BYTES_BIG_ENDIAN"
+{
+  rtx tmp = gen_reg_rtx (<MODE>mode);
+  emit_insn (gen_aarch64_fcmla<rotsplit1><mode> (tmp, operands[1],
+						 operands[3], operands[2]));
+  emit_insn (gen_aarch64_fcmla<rotsplit2><mode> (operands[0], tmp,
+						 operands[3], operands[2]));
+  DONE;
+})
+
+;; The complex mul operations always need to expand to two instructions.
+;; The first operation does half the computation and the second does the
+;; remainder.  Because of this, expand early.
+(define_expand "cmul<rot_op><mode>3"
+  [(set (match_operand:VHSDF 0 "register_operand")
+	(unspec:VHSDF [(match_operand:VHSDF 1 "register_operand")
+		       (match_operand:VHSDF 2 "register_operand")]
+		       FCMUL_OP))]
+  "TARGET_COMPLEX && !BYTES_BIG_ENDIAN"
+{
+  rtx tmp = gen_reg_rtx (<MODE>mode);
+  rtx res1 = gen_reg_rtx (<MODE>mode);
+  emit_move_insn (tmp, CONST0_RTX (<MODE>mode));
+  emit_insn (gen_aarch64_fcmla<rotsplit1><mode> (res1, tmp,
+						 operands[2], operands[1]));
+  emit_insn (gen_aarch64_fcmla<rotsplit2><mode> (operands[0], res1,
+						 operands[2], operands[1]));
+  DONE;
+})
+
+
+
 ;; These instructions map to the __builtins for the Dot Product operations.
 (define_insn "aarch64_<sur>dot<vsi2qi>"
   [(set (match_operand:VS 0 "register_operand" "=w")
diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-sve.md
index da15bd8788507feb12d52894c14e099370f34108..9dfe6a3f4512a20ba4f1e66a105ee0ae5d6949ea 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -7243,6 +7243,62 @@ (define_insn "@aarch64_pred_<optab><mode>"
   [(set_attr "movprfx" "*,yes")]
 )
 
+;; unpredicated optab pattern for auto-vectorizer
+;; The complex mla/mls operations always need to expand to two instructions.
+;; The first operation does half the computation and the second does the
+;; remainder.  Because of this, expand early.
+(define_expand "cml<fcmac1><rot_op><mode>4"
+  [(set (match_operand:SVE_FULL_F 0 "register_operand")
+	(unspec:SVE_FULL_F
+	  [(match_dup 4)
+	   (match_dup 5)
+	   (match_operand:SVE_FULL_F 1 "register_operand")
+	   (match_operand:SVE_FULL_F 2 "register_operand")
+	   (match_operand:SVE_FULL_F 3 "register_operand")]
+	  FCMLA_OP))]
+  "TARGET_SVE"
+{
+  operands[4] = aarch64_ptrue_reg (<VPRED>mode);
+  operands[5] = gen_int_mode (SVE_RELAXED_GP, SImode);
+  rtx tmp = gen_reg_rtx (<MODE>mode);
+  emit_insn
+    (gen_aarch64_pred_fcmla<sve_rot1><mode> (tmp, operands[4],
+					     operands[3], operands[2],
+					     operands[1], operands[5]));
+  emit_insn
+    (gen_aarch64_pred_fcmla<sve_rot2><mode> (operands[0], operands[4],
+					     operands[3], operands[2],
+					     tmp, operands[5]));
+  DONE;
+})
+
+;; unpredicated optab pattern for auto-vectorizer
+;; The complex mul operations always need to expand to two instructions.
+;; The first operation does half the computation and the second does the
+;; remainder.  Because of this, expand early.
+(define_expand "cmul<rot_op><mode>3"
+  [(set (match_operand:SVE_FULL_F 0 "register_operand")
+	(unspec:SVE_FULL_F
+	   [(match_operand:SVE_FULL_F 1 "register_operand")
+	    (match_operand:SVE_FULL_F 2 "register_operand")]
+	  FCMUL_OP))]
+  "TARGET_SVE"
+{
+  rtx pred_reg = aarch64_ptrue_reg (<VPRED>mode);
+  rtx gp_mode = gen_int_mode (SVE_RELAXED_GP, SImode);
+  rtx accum = force_reg (<MODE>mode, CONST0_RTX (<MODE>mode));
+  rtx tmp = gen_reg_rtx (<MODE>mode);
+  emit_insn
+    (gen_aarch64_pred_fcmla<sve_rot1><mode> (tmp, pred_reg,
+					     operands[2], operands[1],
+					     accum, gp_mode));
+  emit_insn
+    (gen_aarch64_pred_fcmla<sve_rot2><mode> (operands[0], pred_reg,
+					     operands[2], operands[1],
+					     tmp, gp_mode));
+  DONE;
+})
+
 ;; Predicated FCMLA with merging.
 (define_expand "@cond_<optab><mode>"
   [(set (match_operand:SVE_FULL_F 0 "register_operand")
diff --git a/gcc/config/aarch64/aarch64-sve2.md b/gcc/config/aarch64/aarch64-sve2.md
index 5cb9144da98af2d02b83043511a99b5723d7e8c0..b96708d03f4458726b32ec46c0078499e00b8549 100644
--- a/gcc/config/aarch64/aarch64-sve2.md
+++ b/gcc/config/aarch64/aarch64-sve2.md
@@ -1848,6 +1848,48 @@ (define_insn "@aarch64_<optab>_lane_<mode>"
   [(set_attr "movprfx" "*,yes")]
 )
 
+;; unpredicated optab pattern for auto-vectorizer
+;; The complex mla/mls operations always need to expand to two instructions.
+;; The first operation does half the computation and the second does the
+;; remainder.  Because of this, expand early.
+(define_expand "cml<fcmac1><rot_op><mode>4"
+  [(set (match_operand:SVE_FULL_I 0 "register_operand")
+	(plus:SVE_FULL_I (match_operand:SVE_FULL_I 1 "register_operand")
+	  (unspec:SVE_FULL_I
+	    [(match_operand:SVE_FULL_I 2 "register_operand")
+	     (match_operand:SVE_FULL_I 3 "register_operand")]
+	    SVE2_INT_CMLA_OP)))]
+  "TARGET_SVE2"
+{
+  rtx tmp = gen_reg_rtx (<MODE>mode);
+  emit_insn (gen_aarch64_sve_cmla<sve_rot1><mode> (tmp, operands[1],
+						   operands[3], operands[2]));
+  emit_insn (gen_aarch64_sve_cmla<sve_rot2><mode> (operands[0], tmp,
+						   operands[3], operands[2]));
+  DONE;
+})
+
+;; unpredicated optab pattern for auto-vectorizer
+;; The complex mul operations always need to expand to two instructions.
+;; The first operation does half the computation and the second does the
+;; remainder.  Because of this, expand early.
+(define_expand "cmul<rot_op><mode>3"
+  [(set (match_operand:SVE_FULL_I 0 "register_operand")
+	(unspec:SVE_FULL_I
+	  [(match_operand:SVE_FULL_I 1 "register_operand")
+	   (match_operand:SVE_FULL_I 2 "register_operand")]
+	  SVE2_INT_CMUL_OP))]
+  "TARGET_SVE2"
+{
+  rtx accum = force_reg (<MODE>mode, CONST0_RTX (<MODE>mode));
+  rtx tmp = gen_reg_rtx (<MODE>mode);
+  emit_insn (gen_aarch64_sve_cmla<sve_rot1><mode> (tmp, accum,
+						   operands[2], operands[1]));
+  emit_insn (gen_aarch64_sve_cmla<sve_rot2><mode> (operands[0], tmp,
+						   operands[2], operands[1]));
+  DONE;
+})
+
 ;; -------------------------------------------------------------------------
 ;; ---- [INT] Complex dot product
 ;; -------------------------------------------------------------------------
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index d42a70653edb266f2b76924b75a814db25f08f23..3f61fc8e380abd922d39973f40a966b7ce64fa40 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -182,6 +182,11 @@ (define_mode_iterator V2F [V2SF V2DF])
 ;; All Advanced SIMD modes on which we support any arithmetic operations.
 (define_mode_iterator VALL [V8QI V16QI V4HI V8HI V2SI V4SI V2DI V2SF V4SF V2DF])
 
+;; All Advanced SIMD modes suitable for performing arithmetics.
+(define_mode_iterator VALL_ARITH [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
+				  (V4HF "TARGET_SIMD_F16INST") (V8HF "TARGET_SIMD_F16INST")
+				  V2SF V4SF V2DF])
+
 ;; All Advanced SIMD modes suitable for moving, loading, and storing.
 (define_mode_iterator VALL_F16 [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
 				V4HF V8HF V4BF V8BF V2SF V4SF V2DF])
@@ -712,6 +717,10 @@ (define_c_enum "unspec"
     UNSPEC_FCMLA90	; Used in aarch64-simd.md.
     UNSPEC_FCMLA180	; Used in aarch64-simd.md.
     UNSPEC_FCMLA270	; Used in aarch64-simd.md.
+    UNSPEC_FCMUL	; Used in aarch64-simd.md.
+    UNSPEC_FCMUL_CONJ	; Used in aarch64-simd.md.
+    UNSPEC_FCMLA_CONJ	; Used in aarch64-simd.md.
+    UNSPEC_FCMLA180_CONJ	; Used in aarch64-simd.md.
     UNSPEC_ASRD		; Used in aarch64-sve.md.
     UNSPEC_ADCLB	; Used in aarch64-sve2.md.
     UNSPEC_ADCLT	; Used in aarch64-sve2.md.
@@ -730,6 +739,10 @@ (define_c_enum "unspec"
     UNSPEC_CMLA180	; Used in aarch64-sve2.md.
     UNSPEC_CMLA270	; Used in aarch64-sve2.md.
     UNSPEC_CMLA90	; Used in aarch64-sve2.md.
+    UNSPEC_CMLA_CONJ	; Used in aarch64-sve2.md.
+    UNSPEC_CMLA180_CONJ	; Used in aarch64-sve2.md.
+    UNSPEC_CMUL		; Used in aarch64-sve2.md.
+    UNSPEC_CMUL_CONJ	; Used in aarch64-sve2.md.
     UNSPEC_COND_FCVTLT	; Used in aarch64-sve2.md.
     UNSPEC_COND_FCVTNT	; Used in aarch64-sve2.md.
     UNSPEC_COND_FCVTX	; Used in aarch64-sve2.md.
@@ -1291,7 +1304,7 @@ (define_mode_attr Vwide [(V2SF "v2df") (V4HF "v4sf")
 
 ;; Widened mode register suffixes for VD_BHSI/VQW/VQ_HSF.
 (define_mode_attr Vwtype [(V8QI "8h") (V4HI "4s")
-			  (V2SI "2d") (V16QI "8h") 
+			  (V2SI "2d") (V16QI "8h")
 			  (V8HI "4s") (V4SI "2d")
 			  (V8HF "4s") (V4SF "2d")])
 
@@ -1313,7 +1326,7 @@ (define_mode_attr Vewtype [(VNx16QI "h")
 
 ;; Widened mode register suffixes for VDW/VQW.
 (define_mode_attr Vmwtype [(V8QI ".8h") (V4HI ".4s")
-			   (V2SI ".2d") (V16QI ".8h") 
+			   (V2SI ".2d") (V16QI ".8h")
 			   (V8HI ".4s") (V4SI ".2d")
 			   (V4HF ".4s") (V2SF ".2d")
 			   (SI   "")    (HI   "")])
@@ -2611,6 +2624,19 @@ (define_int_iterator SVE2_INT_CMLA [UNSPEC_CMLA
 				    UNSPEC_SQRDCMLAH180
 				    UNSPEC_SQRDCMLAH270])
 
+;; Unlike the normal CMLA instructions these represent the actual operation you
+;; to be performed.  They will always need to be expanded into multiple
+;; sequences consisting of CMLA.
+(define_int_iterator SVE2_INT_CMLA_OP [UNSPEC_CMLA
+				       UNSPEC_CMLA_CONJ
+				       UNSPEC_CMLA180])
+
+;; Unlike the normal CMLA instructions these represent the actual operation you
+;; to be performed.  They will always need to be expanded into multiple
+;; sequences consisting of CMLA.
+(define_int_iterator SVE2_INT_CMUL_OP [UNSPEC_CMUL
+				       UNSPEC_CMUL_CONJ])
+
 ;; Same as SVE2_INT_CADD but exclude the saturating instructions
 (define_int_iterator SVE2_INT_CADD_OP [UNSPEC_CADD90
 				       UNSPEC_CADD270])
@@ -2725,6 +2751,14 @@ (define_int_iterator FMMLA [UNSPEC_FMMLA])
 (define_int_iterator BF_MLA [UNSPEC_BFMLALB
 			     UNSPEC_BFMLALT])
 
+(define_int_iterator FCMLA_OP [UNSPEC_FCMLA
+			       UNSPEC_FCMLA180
+			       UNSPEC_FCMLA_CONJ
+			       UNSPEC_FCMLA180_CONJ])
+
+(define_int_iterator FCMUL_OP [UNSPEC_FCMUL
+			       UNSPEC_FCMUL_CONJ])
+
 ;; Iterators for atomic operations.
 
 (define_int_iterator ATOMIC_LDOP
@@ -3435,7 +3469,79 @@ (define_int_attr rot [(UNSPEC_CADD90 "90")
 		      (UNSPEC_COND_FCMLA "0")
 		      (UNSPEC_COND_FCMLA90 "90")
 		      (UNSPEC_COND_FCMLA180 "180")
-		      (UNSPEC_COND_FCMLA270 "270")])
+		      (UNSPEC_COND_FCMLA270 "270")
+		      (UNSPEC_FCMUL "0")
+		      (UNSPEC_FCMUL_CONJ "180")])
+
+;; A conjucate is a negation of the imaginary component
+;; The number in the unspecs are the rotation component of the instruction, e.g
+;; FCMLA180 means use the instruction with #180.
+;; The iterator is used to produce the right name mangling for the function.
+(define_int_attr rot_op [(UNSPEC_FCMLA180 "")
+			 (UNSPEC_FCMLA180_CONJ "_conj")
+			 (UNSPEC_FCMLA "")
+			 (UNSPEC_FCMLA_CONJ "_conj")
+			 (UNSPEC_FCMUL "")
+			 (UNSPEC_FCMUL_CONJ "_conj")
+			 (UNSPEC_CMLA "")
+			 (UNSPEC_CMLA180 "")
+			 (UNSPEC_CMLA_CONJ "_conj")
+			 (UNSPEC_CMUL "")
+			 (UNSPEC_CMUL_CONJ "_conj")])
+
+;; The complex operations when performed on a real complex number require two
+;; instructions to perform the operation. e.g. complex multiplication requires
+;; two FCMUL with a particular rotation value.
+;;
+;; These values can be looked up in rotsplit1 and rotsplit2.  as an example
+;; FCMUL needs the first instruction to use #0 and the second #90.
+(define_int_attr rotsplit1 [(UNSPEC_FCMLA "0")
+			    (UNSPEC_FCMLA_CONJ "0")
+			    (UNSPEC_FCMUL "0")
+			    (UNSPEC_FCMUL_CONJ "0")
+			    (UNSPEC_FCMLA180 "270")
+			    (UNSPEC_FCMLA180_CONJ "90")])
+
+(define_int_attr rotsplit2 [(UNSPEC_FCMLA "90")
+			    (UNSPEC_FCMLA_CONJ "270")
+			    (UNSPEC_FCMUL "90")
+			    (UNSPEC_FCMUL_CONJ "270")
+			    (UNSPEC_FCMLA180 "180")
+			    (UNSPEC_FCMLA180_CONJ "180")])
+
+;; SVE has slightly different namings from NEON so we have to split these
+;; iterators.
+(define_int_attr sve_rot1 [(UNSPEC_FCMLA "")
+			   (UNSPEC_FCMLA_CONJ "")
+			   (UNSPEC_FCMUL "")
+			   (UNSPEC_FCMUL_CONJ "")
+			   (UNSPEC_FCMLA180 "270")
+			   (UNSPEC_FCMLA180_CONJ "90")
+			   (UNSPEC_CMLA "")
+			   (UNSPEC_CMLA_CONJ "")
+			   (UNSPEC_CMUL "")
+			   (UNSPEC_CMUL_CONJ "")
+			   (UNSPEC_CMLA180 "270")
+			   (UNSPEC_CMLA180_CONJ "90")])
+
+(define_int_attr sve_rot2 [(UNSPEC_FCMLA "90")
+			   (UNSPEC_FCMLA_CONJ "270")
+			   (UNSPEC_FCMUL "90")
+			   (UNSPEC_FCMUL_CONJ "270")
+			   (UNSPEC_FCMLA180 "180")
+			   (UNSPEC_FCMLA180_CONJ "180")
+			   (UNSPEC_CMLA "90")
+			   (UNSPEC_CMLA_CONJ "270")
+			   (UNSPEC_CMUL "90")
+			   (UNSPEC_CMUL_CONJ "270")
+			   (UNSPEC_CMLA180 "180")
+			   (UNSPEC_CMLA180_CONJ "180")])
+
+
+(define_int_attr fcmac1 [(UNSPEC_FCMLA "a") (UNSPEC_FCMLA_CONJ "a")
+			 (UNSPEC_FCMLA180 "s") (UNSPEC_FCMLA180_CONJ "s")
+			 (UNSPEC_CMLA "a") (UNSPEC_CMLA_CONJ "a")
+			 (UNSPEC_CMLA180 "s") (UNSPEC_CMLA180_CONJ "s")])
 
 (define_int_attr sve_fmla_op [(UNSPEC_COND_FMLA "fmla")
 			      (UNSPEC_COND_FMLS "fmls")


-- 

[-- Attachment #2: rb13907.patch --]
[-- Type: text/x-diff, Size: 13746 bytes --]

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 4b869ded918fd91ffd41e6ba068239a752b331e5..8a5f1dad224a99a8ba30669139259922a1250d0e 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -516,6 +516,47 @@ (define_insn "aarch64_fcmlaq_lane<rot><mode>"
   [(set_attr "type" "neon_fcmla")]
 )
 
+;; The complex mla/mls operations always need to expand to two instructions.
+;; The first operation does half the computation and the second does the
+;; remainder.  Because of this, expand early.
+(define_expand "cml<fcmac1><rot_op><mode>4"
+  [(set (match_operand:VHSDF 0 "register_operand")
+	(plus:VHSDF (match_operand:VHSDF 1 "register_operand")
+		    (unspec:VHSDF [(match_operand:VHSDF 2 "register_operand")
+				   (match_operand:VHSDF 3 "register_operand")]
+				   FCMLA_OP)))]
+  "TARGET_COMPLEX && !BYTES_BIG_ENDIAN"
+{
+  rtx tmp = gen_reg_rtx (<MODE>mode);
+  emit_insn (gen_aarch64_fcmla<rotsplit1><mode> (tmp, operands[1],
+						 operands[3], operands[2]));
+  emit_insn (gen_aarch64_fcmla<rotsplit2><mode> (operands[0], tmp,
+						 operands[3], operands[2]));
+  DONE;
+})
+
+;; The complex mul operations always need to expand to two instructions.
+;; The first operation does half the computation and the second does the
+;; remainder.  Because of this, expand early.
+(define_expand "cmul<rot_op><mode>3"
+  [(set (match_operand:VHSDF 0 "register_operand")
+	(unspec:VHSDF [(match_operand:VHSDF 1 "register_operand")
+		       (match_operand:VHSDF 2 "register_operand")]
+		       FCMUL_OP))]
+  "TARGET_COMPLEX && !BYTES_BIG_ENDIAN"
+{
+  rtx tmp = gen_reg_rtx (<MODE>mode);
+  rtx res1 = gen_reg_rtx (<MODE>mode);
+  emit_move_insn (tmp, CONST0_RTX (<MODE>mode));
+  emit_insn (gen_aarch64_fcmla<rotsplit1><mode> (res1, tmp,
+						 operands[2], operands[1]));
+  emit_insn (gen_aarch64_fcmla<rotsplit2><mode> (operands[0], res1,
+						 operands[2], operands[1]));
+  DONE;
+})
+
+
+
 ;; These instructions map to the __builtins for the Dot Product operations.
 (define_insn "aarch64_<sur>dot<vsi2qi>"
   [(set (match_operand:VS 0 "register_operand" "=w")
diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-sve.md
index da15bd8788507feb12d52894c14e099370f34108..9dfe6a3f4512a20ba4f1e66a105ee0ae5d6949ea 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -7243,6 +7243,62 @@ (define_insn "@aarch64_pred_<optab><mode>"
   [(set_attr "movprfx" "*,yes")]
 )
 
+;; unpredicated optab pattern for auto-vectorizer
+;; The complex mla/mls operations always need to expand to two instructions.
+;; The first operation does half the computation and the second does the
+;; remainder.  Because of this, expand early.
+(define_expand "cml<fcmac1><rot_op><mode>4"
+  [(set (match_operand:SVE_FULL_F 0 "register_operand")
+	(unspec:SVE_FULL_F
+	  [(match_dup 4)
+	   (match_dup 5)
+	   (match_operand:SVE_FULL_F 1 "register_operand")
+	   (match_operand:SVE_FULL_F 2 "register_operand")
+	   (match_operand:SVE_FULL_F 3 "register_operand")]
+	  FCMLA_OP))]
+  "TARGET_SVE"
+{
+  operands[4] = aarch64_ptrue_reg (<VPRED>mode);
+  operands[5] = gen_int_mode (SVE_RELAXED_GP, SImode);
+  rtx tmp = gen_reg_rtx (<MODE>mode);
+  emit_insn
+    (gen_aarch64_pred_fcmla<sve_rot1><mode> (tmp, operands[4],
+					     operands[3], operands[2],
+					     operands[1], operands[5]));
+  emit_insn
+    (gen_aarch64_pred_fcmla<sve_rot2><mode> (operands[0], operands[4],
+					     operands[3], operands[2],
+					     tmp, operands[5]));
+  DONE;
+})
+
+;; unpredicated optab pattern for auto-vectorizer
+;; The complex mul operations always need to expand to two instructions.
+;; The first operation does half the computation and the second does the
+;; remainder.  Because of this, expand early.
+(define_expand "cmul<rot_op><mode>3"
+  [(set (match_operand:SVE_FULL_F 0 "register_operand")
+	(unspec:SVE_FULL_F
+	   [(match_operand:SVE_FULL_F 1 "register_operand")
+	    (match_operand:SVE_FULL_F 2 "register_operand")]
+	  FCMUL_OP))]
+  "TARGET_SVE"
+{
+  rtx pred_reg = aarch64_ptrue_reg (<VPRED>mode);
+  rtx gp_mode = gen_int_mode (SVE_RELAXED_GP, SImode);
+  rtx accum = force_reg (<MODE>mode, CONST0_RTX (<MODE>mode));
+  rtx tmp = gen_reg_rtx (<MODE>mode);
+  emit_insn
+    (gen_aarch64_pred_fcmla<sve_rot1><mode> (tmp, pred_reg,
+					     operands[2], operands[1],
+					     accum, gp_mode));
+  emit_insn
+    (gen_aarch64_pred_fcmla<sve_rot2><mode> (operands[0], pred_reg,
+					     operands[2], operands[1],
+					     tmp, gp_mode));
+  DONE;
+})
+
 ;; Predicated FCMLA with merging.
 (define_expand "@cond_<optab><mode>"
   [(set (match_operand:SVE_FULL_F 0 "register_operand")
diff --git a/gcc/config/aarch64/aarch64-sve2.md b/gcc/config/aarch64/aarch64-sve2.md
index 5cb9144da98af2d02b83043511a99b5723d7e8c0..b96708d03f4458726b32ec46c0078499e00b8549 100644
--- a/gcc/config/aarch64/aarch64-sve2.md
+++ b/gcc/config/aarch64/aarch64-sve2.md
@@ -1848,6 +1848,48 @@ (define_insn "@aarch64_<optab>_lane_<mode>"
   [(set_attr "movprfx" "*,yes")]
 )
 
+;; unpredicated optab pattern for auto-vectorizer
+;; The complex mla/mls operations always need to expand to two instructions.
+;; The first operation does half the computation and the second does the
+;; remainder.  Because of this, expand early.
+(define_expand "cml<fcmac1><rot_op><mode>4"
+  [(set (match_operand:SVE_FULL_I 0 "register_operand")
+	(plus:SVE_FULL_I (match_operand:SVE_FULL_I 1 "register_operand")
+	  (unspec:SVE_FULL_I
+	    [(match_operand:SVE_FULL_I 2 "register_operand")
+	     (match_operand:SVE_FULL_I 3 "register_operand")]
+	    SVE2_INT_CMLA_OP)))]
+  "TARGET_SVE2"
+{
+  rtx tmp = gen_reg_rtx (<MODE>mode);
+  emit_insn (gen_aarch64_sve_cmla<sve_rot1><mode> (tmp, operands[1],
+						   operands[3], operands[2]));
+  emit_insn (gen_aarch64_sve_cmla<sve_rot2><mode> (operands[0], tmp,
+						   operands[3], operands[2]));
+  DONE;
+})
+
+;; unpredicated optab pattern for auto-vectorizer
+;; The complex mul operations always need to expand to two instructions.
+;; The first operation does half the computation and the second does the
+;; remainder.  Because of this, expand early.
+(define_expand "cmul<rot_op><mode>3"
+  [(set (match_operand:SVE_FULL_I 0 "register_operand")
+	(unspec:SVE_FULL_I
+	  [(match_operand:SVE_FULL_I 1 "register_operand")
+	   (match_operand:SVE_FULL_I 2 "register_operand")]
+	  SVE2_INT_CMUL_OP))]
+  "TARGET_SVE2"
+{
+  rtx accum = force_reg (<MODE>mode, CONST0_RTX (<MODE>mode));
+  rtx tmp = gen_reg_rtx (<MODE>mode);
+  emit_insn (gen_aarch64_sve_cmla<sve_rot1><mode> (tmp, accum,
+						   operands[2], operands[1]));
+  emit_insn (gen_aarch64_sve_cmla<sve_rot2><mode> (operands[0], tmp,
+						   operands[2], operands[1]));
+  DONE;
+})
+
 ;; -------------------------------------------------------------------------
 ;; ---- [INT] Complex dot product
 ;; -------------------------------------------------------------------------
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index d42a70653edb266f2b76924b75a814db25f08f23..3f61fc8e380abd922d39973f40a966b7ce64fa40 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -182,6 +182,11 @@ (define_mode_iterator V2F [V2SF V2DF])
 ;; All Advanced SIMD modes on which we support any arithmetic operations.
 (define_mode_iterator VALL [V8QI V16QI V4HI V8HI V2SI V4SI V2DI V2SF V4SF V2DF])
 
+;; All Advanced SIMD modes suitable for performing arithmetics.
+(define_mode_iterator VALL_ARITH [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
+				  (V4HF "TARGET_SIMD_F16INST") (V8HF "TARGET_SIMD_F16INST")
+				  V2SF V4SF V2DF])
+
 ;; All Advanced SIMD modes suitable for moving, loading, and storing.
 (define_mode_iterator VALL_F16 [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
 				V4HF V8HF V4BF V8BF V2SF V4SF V2DF])
@@ -712,6 +717,10 @@ (define_c_enum "unspec"
     UNSPEC_FCMLA90	; Used in aarch64-simd.md.
     UNSPEC_FCMLA180	; Used in aarch64-simd.md.
     UNSPEC_FCMLA270	; Used in aarch64-simd.md.
+    UNSPEC_FCMUL	; Used in aarch64-simd.md.
+    UNSPEC_FCMUL_CONJ	; Used in aarch64-simd.md.
+    UNSPEC_FCMLA_CONJ	; Used in aarch64-simd.md.
+    UNSPEC_FCMLA180_CONJ	; Used in aarch64-simd.md.
     UNSPEC_ASRD		; Used in aarch64-sve.md.
     UNSPEC_ADCLB	; Used in aarch64-sve2.md.
     UNSPEC_ADCLT	; Used in aarch64-sve2.md.
@@ -730,6 +739,10 @@ (define_c_enum "unspec"
     UNSPEC_CMLA180	; Used in aarch64-sve2.md.
     UNSPEC_CMLA270	; Used in aarch64-sve2.md.
     UNSPEC_CMLA90	; Used in aarch64-sve2.md.
+    UNSPEC_CMLA_CONJ	; Used in aarch64-sve2.md.
+    UNSPEC_CMLA180_CONJ	; Used in aarch64-sve2.md.
+    UNSPEC_CMUL		; Used in aarch64-sve2.md.
+    UNSPEC_CMUL_CONJ	; Used in aarch64-sve2.md.
     UNSPEC_COND_FCVTLT	; Used in aarch64-sve2.md.
     UNSPEC_COND_FCVTNT	; Used in aarch64-sve2.md.
     UNSPEC_COND_FCVTX	; Used in aarch64-sve2.md.
@@ -1291,7 +1304,7 @@ (define_mode_attr Vwide [(V2SF "v2df") (V4HF "v4sf")
 
 ;; Widened mode register suffixes for VD_BHSI/VQW/VQ_HSF.
 (define_mode_attr Vwtype [(V8QI "8h") (V4HI "4s")
-			  (V2SI "2d") (V16QI "8h") 
+			  (V2SI "2d") (V16QI "8h")
 			  (V8HI "4s") (V4SI "2d")
 			  (V8HF "4s") (V4SF "2d")])
 
@@ -1313,7 +1326,7 @@ (define_mode_attr Vewtype [(VNx16QI "h")
 
 ;; Widened mode register suffixes for VDW/VQW.
 (define_mode_attr Vmwtype [(V8QI ".8h") (V4HI ".4s")
-			   (V2SI ".2d") (V16QI ".8h") 
+			   (V2SI ".2d") (V16QI ".8h")
 			   (V8HI ".4s") (V4SI ".2d")
 			   (V4HF ".4s") (V2SF ".2d")
 			   (SI   "")    (HI   "")])
@@ -2611,6 +2624,19 @@ (define_int_iterator SVE2_INT_CMLA [UNSPEC_CMLA
 				    UNSPEC_SQRDCMLAH180
 				    UNSPEC_SQRDCMLAH270])
 
+;; Unlike the normal CMLA instructions these represent the actual operation you
+;; to be performed.  They will always need to be expanded into multiple
+;; sequences consisting of CMLA.
+(define_int_iterator SVE2_INT_CMLA_OP [UNSPEC_CMLA
+				       UNSPEC_CMLA_CONJ
+				       UNSPEC_CMLA180])
+
+;; Unlike the normal CMLA instructions these represent the actual operation you
+;; to be performed.  They will always need to be expanded into multiple
+;; sequences consisting of CMLA.
+(define_int_iterator SVE2_INT_CMUL_OP [UNSPEC_CMUL
+				       UNSPEC_CMUL_CONJ])
+
 ;; Same as SVE2_INT_CADD but exclude the saturating instructions
 (define_int_iterator SVE2_INT_CADD_OP [UNSPEC_CADD90
 				       UNSPEC_CADD270])
@@ -2725,6 +2751,14 @@ (define_int_iterator FMMLA [UNSPEC_FMMLA])
 (define_int_iterator BF_MLA [UNSPEC_BFMLALB
 			     UNSPEC_BFMLALT])
 
+(define_int_iterator FCMLA_OP [UNSPEC_FCMLA
+			       UNSPEC_FCMLA180
+			       UNSPEC_FCMLA_CONJ
+			       UNSPEC_FCMLA180_CONJ])
+
+(define_int_iterator FCMUL_OP [UNSPEC_FCMUL
+			       UNSPEC_FCMUL_CONJ])
+
 ;; Iterators for atomic operations.
 
 (define_int_iterator ATOMIC_LDOP
@@ -3435,7 +3469,79 @@ (define_int_attr rot [(UNSPEC_CADD90 "90")
 		      (UNSPEC_COND_FCMLA "0")
 		      (UNSPEC_COND_FCMLA90 "90")
 		      (UNSPEC_COND_FCMLA180 "180")
-		      (UNSPEC_COND_FCMLA270 "270")])
+		      (UNSPEC_COND_FCMLA270 "270")
+		      (UNSPEC_FCMUL "0")
+		      (UNSPEC_FCMUL_CONJ "180")])
+
+;; A conjucate is a negation of the imaginary component
+;; The number in the unspecs are the rotation component of the instruction, e.g
+;; FCMLA180 means use the instruction with #180.
+;; The iterator is used to produce the right name mangling for the function.
+(define_int_attr rot_op [(UNSPEC_FCMLA180 "")
+			 (UNSPEC_FCMLA180_CONJ "_conj")
+			 (UNSPEC_FCMLA "")
+			 (UNSPEC_FCMLA_CONJ "_conj")
+			 (UNSPEC_FCMUL "")
+			 (UNSPEC_FCMUL_CONJ "_conj")
+			 (UNSPEC_CMLA "")
+			 (UNSPEC_CMLA180 "")
+			 (UNSPEC_CMLA_CONJ "_conj")
+			 (UNSPEC_CMUL "")
+			 (UNSPEC_CMUL_CONJ "_conj")])
+
+;; The complex operations when performed on a real complex number require two
+;; instructions to perform the operation. e.g. complex multiplication requires
+;; two FCMUL with a particular rotation value.
+;;
+;; These values can be looked up in rotsplit1 and rotsplit2.  as an example
+;; FCMUL needs the first instruction to use #0 and the second #90.
+(define_int_attr rotsplit1 [(UNSPEC_FCMLA "0")
+			    (UNSPEC_FCMLA_CONJ "0")
+			    (UNSPEC_FCMUL "0")
+			    (UNSPEC_FCMUL_CONJ "0")
+			    (UNSPEC_FCMLA180 "270")
+			    (UNSPEC_FCMLA180_CONJ "90")])
+
+(define_int_attr rotsplit2 [(UNSPEC_FCMLA "90")
+			    (UNSPEC_FCMLA_CONJ "270")
+			    (UNSPEC_FCMUL "90")
+			    (UNSPEC_FCMUL_CONJ "270")
+			    (UNSPEC_FCMLA180 "180")
+			    (UNSPEC_FCMLA180_CONJ "180")])
+
+;; SVE has slightly different namings from NEON so we have to split these
+;; iterators.
+(define_int_attr sve_rot1 [(UNSPEC_FCMLA "")
+			   (UNSPEC_FCMLA_CONJ "")
+			   (UNSPEC_FCMUL "")
+			   (UNSPEC_FCMUL_CONJ "")
+			   (UNSPEC_FCMLA180 "270")
+			   (UNSPEC_FCMLA180_CONJ "90")
+			   (UNSPEC_CMLA "")
+			   (UNSPEC_CMLA_CONJ "")
+			   (UNSPEC_CMUL "")
+			   (UNSPEC_CMUL_CONJ "")
+			   (UNSPEC_CMLA180 "270")
+			   (UNSPEC_CMLA180_CONJ "90")])
+
+(define_int_attr sve_rot2 [(UNSPEC_FCMLA "90")
+			   (UNSPEC_FCMLA_CONJ "270")
+			   (UNSPEC_FCMUL "90")
+			   (UNSPEC_FCMUL_CONJ "270")
+			   (UNSPEC_FCMLA180 "180")
+			   (UNSPEC_FCMLA180_CONJ "180")
+			   (UNSPEC_CMLA "90")
+			   (UNSPEC_CMLA_CONJ "270")
+			   (UNSPEC_CMUL "90")
+			   (UNSPEC_CMUL_CONJ "270")
+			   (UNSPEC_CMLA180 "180")
+			   (UNSPEC_CMLA180_CONJ "180")])
+
+
+(define_int_attr fcmac1 [(UNSPEC_FCMLA "a") (UNSPEC_FCMLA_CONJ "a")
+			 (UNSPEC_FCMLA180 "s") (UNSPEC_FCMLA180_CONJ "s")
+			 (UNSPEC_CMLA "a") (UNSPEC_CMLA_CONJ "a")
+			 (UNSPEC_CMLA180 "s") (UNSPEC_CMLA180_CONJ "s")])
 
 (define_int_attr sve_fmla_op [(UNSPEC_COND_FMLA "fmla")
 			      (UNSPEC_COND_FMLS "fmls")


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH]AArch64: Add NEON, SVE and SVE2 RTL patterns for Multiply, FMS and FMA.
  2020-12-10 16:59 [PATCH]AArch64: Add NEON, SVE and SVE2 RTL patterns for Complex Addition, Multiply and FMA Tamar Christina
                   ` (2 preceding siblings ...)
  2021-01-15 15:30 ` [PATCH]AArch64: Add NEON, SVE and SVE2 RTL patterns for Multiply, FMS " Tamar Christina
@ 2021-01-15 17:31 ` Richard Sandiford
  3 siblings, 0 replies; 9+ messages in thread
From: Richard Sandiford @ 2021-01-15 17:31 UTC (permalink / raw)
  To: Tamar Christina
  Cc: gcc-patches, nd, Richard.Earnshaw, Marcus.Shawcroft, Kyrylo.Tkachov

Tamar Christina <tamar.christina@arm.com> writes:
> Hi All,
>
> This adds implementation for the optabs for complex operations.  With this the
> following C code:
>
>   void g (float complex a[restrict N], float complex b[restrict N],
> 	  float complex c[restrict N])
>   {
>     for (int i=0; i < N; i++)
>       c[i] =  a[i] * b[i];
>   }
>
> generates
>
>
> NEON:
>
> g:
>         movi    v3.4s, 0
>         mov     x3, 0
>         .p2align 3,,7
> .L2:
>         mov     v0.16b, v3.16b
>         ldr     q2, [x1, x3]
>         ldr     q1, [x0, x3]
>         fcmla   v0.4s, v1.4s, v2.4s, #0
>         fcmla   v0.4s, v1.4s, v2.4s, #90
>         str     q0, [x2, x3]
>         add     x3, x3, 16
>         cmp     x3, 1600
>         bne     .L2
>         ret
>
> SVE:
>
> g:
>         mov     x3, 0
>         mov     x4, 400
>         ptrue   p1.b, all
>         whilelo p0.s, xzr, x4
>         mov     z3.s, #0
>         .p2align 3,,7
> .L2:
>         ld1w    z1.s, p0/z, [x0, x3, lsl 2]
>         ld1w    z2.s, p0/z, [x1, x3, lsl 2]
>         movprfx z0, z3
>         fcmla   z0.s, p1/m, z1.s, z2.s, #0
>         fcmla   z0.s, p1/m, z1.s, z2.s, #90
>         st1w    z0.s, p0, [x2, x3, lsl 2]
>         incw    x3
>         whilelo p0.s, x3, x4
>         b.any   .L2
>         ret
>
> SVE2 (with int instead of float)
> g:
>         mov     x3, 0
>         mov     x4, 400
>         mov     z3.b, #0
>         whilelo p0.s, xzr, x4
>         .p2align 3,,7
> .L2:
>         ld1w    z1.s, p0/z, [x0, x3, lsl 2]
>         ld1w    z2.s, p0/z, [x1, x3, lsl 2]
>         movprfx z0, z3
>         cmla    z0.s, z1.s, z2.s, #0
>         cmla    z0.s, z1.s, z2.s, #90
>         st1w    z0.s, p0, [x2, x3, lsl 2]
>         incw    x3
>         whilelo p0.s, x3, x4
>         b.any   .L2
>         ret
>
>
> It defined a new iterator VALL_ARITH which contains types for which we can do
> general arithmetic (excludes bfloat16).

It doesn't look like anything uses this though.  Is it just left over
from the previous version?

>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> Checked with armv8-a+sve2+fp16 and no issues.  Note that sue to a mid-end
> limitation SLP for SVE currently fails for some permutes.  The tests have these
> marked as XFAIL.  I do intend to fix this soon.
>
> Execution tests verified with QEMU.
>
> Matching tests for these are in the mid-end patches.  This I will turn on for
> these patterns in a separate patch.
>
> Ok for master?
>
> Thanks,
> Tamar
>
>
> gcc/ChangeLog:
>
> 	* config/aarch64/aarch64-simd.md (cml<fcmac1><rot_op><mode>4,
> 	cmul<rot_op><mode>3): New.
> 	* config/aarch64/iterators.md (VALL_ARITH, UNSPEC_FCMUL,
> 	UNSPEC_FCMUL180, UNSPEC_FCMLA_CONJ, UNSPEC_FCMLA180_CONJ,
> 	UNSPEC_CMLA_CONJ, UNSPEC_CMLA180_CONJ, UNSPEC_CMUL, UNSPEC_CMUL180,
> 	FCMLA_OP, FCMUL_OP, rot_op, rotsplit1, rotsplit2, fcmac1, sve_rot1,
> 	sve_rot2, SVE2_INT_CMLA_OP, SVE2_INT_CMUL_OP, SVE2_INT_CADD_OP): New.
> 	(rot): Add UNSPEC_FCMUL, UNSPEC_FCMUL180.
> 	* config/aarch64/aarch64-sve.md (cml<fcmac1><rot_op><mode>4,
> 	cmul<rot_op><mode>3): New.
> 	* config/aarch64/aarch64-sve2.md (cml<fcmac1><rot_op><mode>4,
> 	cmul<rot_op><mode>3): New.
>
> --- inline copy of patch -- 
> diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
> index 4b869ded918fd91ffd41e6ba068239a752b331e5..8a5f1dad224a99a8ba30669139259922a1250d0e 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -516,6 +516,47 @@ (define_insn "aarch64_fcmlaq_lane<rot><mode>"
>    [(set_attr "type" "neon_fcmla")]
>  )
>  
> +;; The complex mla/mls operations always need to expand to two instructions.
> +;; The first operation does half the computation and the second does the
> +;; remainder.  Because of this, expand early.
> +(define_expand "cml<fcmac1><rot_op><mode>4"
> +  [(set (match_operand:VHSDF 0 "register_operand")
> +	(plus:VHSDF (match_operand:VHSDF 1 "register_operand")
> +		    (unspec:VHSDF [(match_operand:VHSDF 2 "register_operand")
> +				   (match_operand:VHSDF 3 "register_operand")]
> +				   FCMLA_OP)))]
> +  "TARGET_COMPLEX && !BYTES_BIG_ENDIAN"
> +{
> +  rtx tmp = gen_reg_rtx (<MODE>mode);
> +  emit_insn (gen_aarch64_fcmla<rotsplit1><mode> (tmp, operands[1],
> +						 operands[3], operands[2]));
> +  emit_insn (gen_aarch64_fcmla<rotsplit2><mode> (operands[0], tmp,
> +						 operands[3], operands[2]));
> +  DONE;
> +})
> +
> +;; The complex mul operations always need to expand to two instructions.
> +;; The first operation does half the computation and the second does the
> +;; remainder.  Because of this, expand early.
> +(define_expand "cmul<rot_op><mode>3"
> +  [(set (match_operand:VHSDF 0 "register_operand")
> +	(unspec:VHSDF [(match_operand:VHSDF 1 "register_operand")
> +		       (match_operand:VHSDF 2 "register_operand")]
> +		       FCMUL_OP))]
> +  "TARGET_COMPLEX && !BYTES_BIG_ENDIAN"
> +{
> +  rtx tmp = gen_reg_rtx (<MODE>mode);
> +  rtx res1 = gen_reg_rtx (<MODE>mode);
> +  emit_move_insn (tmp, CONST0_RTX (<MODE>mode));

Doesn't make much difference, but I think:

  rtx tmp = force_reg (<MODE>mode, CONST0_RTX (<MODE>mode));
  rtx res1 = gen_reg_rtx (<MODE>mode);

would be more usual.

> +  emit_insn (gen_aarch64_fcmla<rotsplit1><mode> (res1, tmp,
> +						 operands[2], operands[1]));
> +  emit_insn (gen_aarch64_fcmla<rotsplit2><mode> (operands[0], res1,
> +						 operands[2], operands[1]));
> +  DONE;
> +})
> +
> +
> +

Nit: excess new lines.

>  ;; These instructions map to the __builtins for the Dot Product operations.
>  (define_insn "aarch64_<sur>dot<vsi2qi>"
>    [(set (match_operand:VS 0 "register_operand" "=w")
> diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-sve.md
> index da15bd8788507feb12d52894c14e099370f34108..9dfe6a3f4512a20ba4f1e66a105ee0ae5d6949ea 100644
> --- a/gcc/config/aarch64/aarch64-sve.md
> +++ b/gcc/config/aarch64/aarch64-sve.md
> @@ -7243,6 +7243,62 @@ (define_insn "@aarch64_pred_<optab><mode>"
>    [(set_attr "movprfx" "*,yes")]
>  )
>  
> +;; unpredicated optab pattern for auto-vectorizer
> +;; The complex mla/mls operations always need to expand to two instructions.
> +;; The first operation does half the computation and the second does the
> +;; remainder.  Because of this, expand early.
> +(define_expand "cml<fcmac1><rot_op><mode>4"
> +  [(set (match_operand:SVE_FULL_F 0 "register_operand")
> +	(unspec:SVE_FULL_F
> +	  [(match_dup 4)
> +	   (match_dup 5)
> +	   (match_operand:SVE_FULL_F 1 "register_operand")
> +	   (match_operand:SVE_FULL_F 2 "register_operand")
> +	   (match_operand:SVE_FULL_F 3 "register_operand")]
> +	  FCMLA_OP))]
> +  "TARGET_SVE"
> +{
> +  operands[4] = aarch64_ptrue_reg (<VPRED>mode);
> +  operands[5] = gen_int_mode (SVE_RELAXED_GP, SImode);
> +  rtx tmp = gen_reg_rtx (<MODE>mode);
> +  emit_insn
> +    (gen_aarch64_pred_fcmla<sve_rot1><mode> (tmp, operands[4],
> +					     operands[3], operands[2],
> +					     operands[1], operands[5]));
> +  emit_insn
> +    (gen_aarch64_pred_fcmla<sve_rot2><mode> (operands[0], operands[4],
> +					     operands[3], operands[2],
> +					     tmp, operands[5]));
> +  DONE;
> +})
> +
> +;; unpredicated optab pattern for auto-vectorizer
> +;; The complex mul operations always need to expand to two instructions.
> +;; The first operation does half the computation and the second does the
> +;; remainder.  Because of this, expand early.
> +(define_expand "cmul<rot_op><mode>3"
> +  [(set (match_operand:SVE_FULL_F 0 "register_operand")
> +	(unspec:SVE_FULL_F
> +	   [(match_operand:SVE_FULL_F 1 "register_operand")
> +	    (match_operand:SVE_FULL_F 2 "register_operand")]
> +	  FCMUL_OP))]
> +  "TARGET_SVE"
> +{
> +  rtx pred_reg = aarch64_ptrue_reg (<VPRED>mode);
> +  rtx gp_mode = gen_int_mode (SVE_RELAXED_GP, SImode);
> +  rtx accum = force_reg (<MODE>mode, CONST0_RTX (<MODE>mode));
> +  rtx tmp = gen_reg_rtx (<MODE>mode);
> +  emit_insn
> +    (gen_aarch64_pred_fcmla<sve_rot1><mode> (tmp, pred_reg,
> +					     operands[2], operands[1],
> +					     accum, gp_mode));
> +  emit_insn
> +    (gen_aarch64_pred_fcmla<sve_rot2><mode> (operands[0], pred_reg,
> +					     operands[2], operands[1],
> +					     tmp, gp_mode));
> +  DONE;
> +})
> +
>  ;; Predicated FCMLA with merging.
>  (define_expand "@cond_<optab><mode>"
>    [(set (match_operand:SVE_FULL_F 0 "register_operand")
> diff --git a/gcc/config/aarch64/aarch64-sve2.md b/gcc/config/aarch64/aarch64-sve2.md
> index 5cb9144da98af2d02b83043511a99b5723d7e8c0..b96708d03f4458726b32ec46c0078499e00b8549 100644
> --- a/gcc/config/aarch64/aarch64-sve2.md
> +++ b/gcc/config/aarch64/aarch64-sve2.md
> @@ -1848,6 +1848,48 @@ (define_insn "@aarch64_<optab>_lane_<mode>"
>    [(set_attr "movprfx" "*,yes")]
>  )
>  
> +;; unpredicated optab pattern for auto-vectorizer
> +;; The complex mla/mls operations always need to expand to two instructions.
> +;; The first operation does half the computation and the second does the
> +;; remainder.  Because of this, expand early.
> +(define_expand "cml<fcmac1><rot_op><mode>4"
> +  [(set (match_operand:SVE_FULL_I 0 "register_operand")
> +	(plus:SVE_FULL_I (match_operand:SVE_FULL_I 1 "register_operand")
> +	  (unspec:SVE_FULL_I
> +	    [(match_operand:SVE_FULL_I 2 "register_operand")
> +	     (match_operand:SVE_FULL_I 3 "register_operand")]
> +	    SVE2_INT_CMLA_OP)))]
> +  "TARGET_SVE2"
> +{
> +  rtx tmp = gen_reg_rtx (<MODE>mode);
> +  emit_insn (gen_aarch64_sve_cmla<sve_rot1><mode> (tmp, operands[1],
> +						   operands[3], operands[2]));
> +  emit_insn (gen_aarch64_sve_cmla<sve_rot2><mode> (operands[0], tmp,
> +						   operands[3], operands[2]));
> +  DONE;
> +})
> +
> +;; unpredicated optab pattern for auto-vectorizer
> +;; The complex mul operations always need to expand to two instructions.
> +;; The first operation does half the computation and the second does the
> +;; remainder.  Because of this, expand early.
> +(define_expand "cmul<rot_op><mode>3"
> +  [(set (match_operand:SVE_FULL_I 0 "register_operand")
> +	(unspec:SVE_FULL_I
> +	  [(match_operand:SVE_FULL_I 1 "register_operand")
> +	   (match_operand:SVE_FULL_I 2 "register_operand")]
> +	  SVE2_INT_CMUL_OP))]
> +  "TARGET_SVE2"
> +{
> +  rtx accum = force_reg (<MODE>mode, CONST0_RTX (<MODE>mode));
> +  rtx tmp = gen_reg_rtx (<MODE>mode);
> +  emit_insn (gen_aarch64_sve_cmla<sve_rot1><mode> (tmp, accum,
> +						   operands[2], operands[1]));
> +  emit_insn (gen_aarch64_sve_cmla<sve_rot2><mode> (operands[0], tmp,
> +						   operands[2], operands[1]));
> +  DONE;
> +})
> +
>  ;; -------------------------------------------------------------------------
>  ;; ---- [INT] Complex dot product
>  ;; -------------------------------------------------------------------------
> diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
> index d42a70653edb266f2b76924b75a814db25f08f23..3f61fc8e380abd922d39973f40a966b7ce64fa40 100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -182,6 +182,11 @@ (define_mode_iterator V2F [V2SF V2DF])
>  ;; All Advanced SIMD modes on which we support any arithmetic operations.
>  (define_mode_iterator VALL [V8QI V16QI V4HI V8HI V2SI V4SI V2DI V2SF V4SF V2DF])
>  
> +;; All Advanced SIMD modes suitable for performing arithmetics.
> +(define_mode_iterator VALL_ARITH [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
> +				  (V4HF "TARGET_SIMD_F16INST") (V8HF "TARGET_SIMD_F16INST")

Nit: avoidable long line.

> +				  V2SF V4SF V2DF])
> +

…but as said above, this looks like dead code now, so I think we
should remove it until we need it.

>  ;; All Advanced SIMD modes suitable for moving, loading, and storing.
>  (define_mode_iterator VALL_F16 [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
>  				V4HF V8HF V4BF V8BF V2SF V4SF V2DF])
> @@ -712,6 +717,10 @@ (define_c_enum "unspec"
>      UNSPEC_FCMLA90	; Used in aarch64-simd.md.
>      UNSPEC_FCMLA180	; Used in aarch64-simd.md.
>      UNSPEC_FCMLA270	; Used in aarch64-simd.md.
> +    UNSPEC_FCMUL	; Used in aarch64-simd.md.
> +    UNSPEC_FCMUL_CONJ	; Used in aarch64-simd.md.
> +    UNSPEC_FCMLA_CONJ	; Used in aarch64-simd.md.
> +    UNSPEC_FCMLA180_CONJ	; Used in aarch64-simd.md.
>      UNSPEC_ASRD		; Used in aarch64-sve.md.
>      UNSPEC_ADCLB	; Used in aarch64-sve2.md.
>      UNSPEC_ADCLT	; Used in aarch64-sve2.md.
> @@ -730,6 +739,10 @@ (define_c_enum "unspec"
>      UNSPEC_CMLA180	; Used in aarch64-sve2.md.
>      UNSPEC_CMLA270	; Used in aarch64-sve2.md.
>      UNSPEC_CMLA90	; Used in aarch64-sve2.md.
> +    UNSPEC_CMLA_CONJ	; Used in aarch64-sve2.md.
> +    UNSPEC_CMLA180_CONJ	; Used in aarch64-sve2.md.
> +    UNSPEC_CMUL		; Used in aarch64-sve2.md.
> +    UNSPEC_CMUL_CONJ	; Used in aarch64-sve2.md.
>      UNSPEC_COND_FCVTLT	; Used in aarch64-sve2.md.
>      UNSPEC_COND_FCVTNT	; Used in aarch64-sve2.md.
>      UNSPEC_COND_FCVTX	; Used in aarch64-sve2.md.
> @@ -1291,7 +1304,7 @@ (define_mode_attr Vwide [(V2SF "v2df") (V4HF "v4sf")
>  
>  ;; Widened mode register suffixes for VD_BHSI/VQW/VQ_HSF.
>  (define_mode_attr Vwtype [(V8QI "8h") (V4HI "4s")
> -			  (V2SI "2d") (V16QI "8h") 
> +			  (V2SI "2d") (V16QI "8h")
>  			  (V8HI "4s") (V4SI "2d")
>  			  (V8HF "4s") (V4SF "2d")])
>  
> @@ -1313,7 +1326,7 @@ (define_mode_attr Vewtype [(VNx16QI "h")
>  
>  ;; Widened mode register suffixes for VDW/VQW.
>  (define_mode_attr Vmwtype [(V8QI ".8h") (V4HI ".4s")
> -			   (V2SI ".2d") (V16QI ".8h") 
> +			   (V2SI ".2d") (V16QI ".8h")
>  			   (V8HI ".4s") (V4SI ".2d")
>  			   (V4HF ".4s") (V2SF ".2d")
>  			   (SI   "")    (HI   "")])
> @@ -2611,6 +2624,19 @@ (define_int_iterator SVE2_INT_CMLA [UNSPEC_CMLA
>  				    UNSPEC_SQRDCMLAH180
>  				    UNSPEC_SQRDCMLAH270])
>  
> +;; Unlike the normal CMLA instructions these represent the actual operation you

Typo s/you//

> +;; to be performed.  They will always need to be expanded into multiple
> +;; sequences consisting of CMLA.
> +(define_int_iterator SVE2_INT_CMLA_OP [UNSPEC_CMLA
> +				       UNSPEC_CMLA_CONJ
> +				       UNSPEC_CMLA180])

Is UNSPEC_CMLA180_CONJ missing here?  I couldn't see anything that
generated it.

> +
> +;; Unlike the normal CMLA instructions these represent the actual operation you

s/you// as above.

> +;; to be performed.  They will always need to be expanded into multiple
> +;; sequences consisting of CMLA.
> +(define_int_iterator SVE2_INT_CMUL_OP [UNSPEC_CMUL
> +				       UNSPEC_CMUL_CONJ])
> +
>  ;; Same as SVE2_INT_CADD but exclude the saturating instructions
>  (define_int_iterator SVE2_INT_CADD_OP [UNSPEC_CADD90
>  				       UNSPEC_CADD270])
> @@ -2725,6 +2751,14 @@ (define_int_iterator FMMLA [UNSPEC_FMMLA])
>  (define_int_iterator BF_MLA [UNSPEC_BFMLALB
>  			     UNSPEC_BFMLALT])
>  
> +(define_int_iterator FCMLA_OP [UNSPEC_FCMLA
> +			       UNSPEC_FCMLA180
> +			       UNSPEC_FCMLA_CONJ
> +			       UNSPEC_FCMLA180_CONJ])
> +
> +(define_int_iterator FCMUL_OP [UNSPEC_FCMUL
> +			       UNSPEC_FCMUL_CONJ])
> +
>  ;; Iterators for atomic operations.
>  
>  (define_int_iterator ATOMIC_LDOP
> @@ -3435,7 +3469,79 @@ (define_int_attr rot [(UNSPEC_CADD90 "90")
>  		      (UNSPEC_COND_FCMLA "0")
>  		      (UNSPEC_COND_FCMLA90 "90")
>  		      (UNSPEC_COND_FCMLA180 "180")
> -		      (UNSPEC_COND_FCMLA270 "270")])
> +		      (UNSPEC_COND_FCMLA270 "270")
> +		      (UNSPEC_FCMUL "0")
> +		      (UNSPEC_FCMUL_CONJ "180")])
> +
> +;; A conjucate is a negation of the imaginary component
> +;; The number in the unspecs are the rotation component of the instruction, e.g
> +;; FCMLA180 means use the instruction with #180.
> +;; The iterator is used to produce the right name mangling for the function.
> +(define_int_attr rot_op [(UNSPEC_FCMLA180 "")

Please don't shoot me, but I think conj_op might be a better name for this.

> +			 (UNSPEC_FCMLA180_CONJ "_conj")
> +			 (UNSPEC_FCMLA "")
> +			 (UNSPEC_FCMLA_CONJ "_conj")
> +			 (UNSPEC_FCMUL "")
> +			 (UNSPEC_FCMUL_CONJ "_conj")
> +			 (UNSPEC_CMLA "")
> +			 (UNSPEC_CMLA180 "")
> +			 (UNSPEC_CMLA_CONJ "_conj")
> +			 (UNSPEC_CMUL "")
> +			 (UNSPEC_CMUL_CONJ "_conj")])
> +
> +;; The complex operations when performed on a real complex number require two
> +;; instructions to perform the operation. e.g. complex multiplication requires
> +;; two FCMUL with a particular rotation value.
> +;;
> +;; These values can be looked up in rotsplit1 and rotsplit2.  as an example
> +;; FCMUL needs the first instruction to use #0 and the second #90.
> +(define_int_attr rotsplit1 [(UNSPEC_FCMLA "0")
> +			    (UNSPEC_FCMLA_CONJ "0")
> +			    (UNSPEC_FCMUL "0")
> +			    (UNSPEC_FCMUL_CONJ "0")
> +			    (UNSPEC_FCMLA180 "270")
> +			    (UNSPEC_FCMLA180_CONJ "90")])

Unless there's some reason this doesn't work or is slower, I think this
should be:

			    (UNSPEC_FCMLA180 "180")
			    (UNSPEC_FCMLA180_CONJ "180")

and…

> +(define_int_attr rotsplit2 [(UNSPEC_FCMLA "90")
> +			    (UNSPEC_FCMLA_CONJ "270")
> +			    (UNSPEC_FCMUL "90")
> +			    (UNSPEC_FCMUL_CONJ "270")
> +			    (UNSPEC_FCMLA180 "180")
> +			    (UNSPEC_FCMLA180_CONJ "180")])

…this should be:

			    (UNSPEC_FCMLA180 "270")
			    (UNSPEC_FCMLA180_CONJ "90")])

I.e. swap the rotsplit1 and rotsplit2 values around for these two
unspecs only.  That way, rotsplit1 always represents the operation
to perform on the real components of the first multiplication operand
while rotsplit2 always represents the operation to perform on the
imaginary components of the first multiplication operand.  The _CONJ
only affects rotsplit2 and not rotsplit1.  (Specifically, it adds
180 to the value of rotsplit2.)

> +;; SVE has slightly different namings from NEON so we have to split these
> +;; iterators.
> +(define_int_attr sve_rot1 [(UNSPEC_FCMLA "")
> +			   (UNSPEC_FCMLA_CONJ "")
> +			   (UNSPEC_FCMUL "")
> +			   (UNSPEC_FCMUL_CONJ "")
> +			   (UNSPEC_FCMLA180 "270")
> +			   (UNSPEC_FCMLA180_CONJ "90")
> +			   (UNSPEC_CMLA "")
> +			   (UNSPEC_CMLA_CONJ "")
> +			   (UNSPEC_CMUL "")
> +			   (UNSPEC_CMUL_CONJ "")
> +			   (UNSPEC_CMLA180 "270")
> +			   (UNSPEC_CMLA180_CONJ "90")])
> +
> +(define_int_attr sve_rot2 [(UNSPEC_FCMLA "90")
> +			   (UNSPEC_FCMLA_CONJ "270")
> +			   (UNSPEC_FCMUL "90")
> +			   (UNSPEC_FCMUL_CONJ "270")
> +			   (UNSPEC_FCMLA180 "180")
> +			   (UNSPEC_FCMLA180_CONJ "180")
> +			   (UNSPEC_CMLA "90")
> +			   (UNSPEC_CMLA_CONJ "270")
> +			   (UNSPEC_CMUL "90")
> +			   (UNSPEC_CMUL_CONJ "270")
> +			   (UNSPEC_CMLA180 "180")
> +			   (UNSPEC_CMLA180_CONJ "180")])

Same idea here: I think the {,F}CMLA180{,_CONJ} cases should be
swapped so that sve_rot1 is always "" or "180" and sve_rot2 is
always "90" or "270".

OK with those changes, thanks.

Richard

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-01-15 17:31 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-10 16:59 [PATCH]AArch64: Add NEON, SVE and SVE2 RTL patterns for Complex Addition, Multiply and FMA Tamar Christina
2020-12-14 11:01 ` Kyrylo Tkachov
2020-12-14 13:30 ` Richard Sandiford
2020-12-14 15:50   ` Tamar Christina
2020-12-14 16:11     ` Richard Sandiford
2020-12-15 19:59       ` Tamar Christina
2020-12-16  9:02         ` Richard Sandiford
2021-01-15 15:30 ` [PATCH]AArch64: Add NEON, SVE and SVE2 RTL patterns for Multiply, FMS " Tamar Christina
2021-01-15 17:31 ` Richard Sandiford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).