[PATCH][AArch64] Vectorize MULH(R)S patterns with SVE2 instructions

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH][AArch64] Vectorize MULH(R)S patterns with SVE2 instructions
@ 2019-08-29 14:17 Yuliang Wang
  2019-08-30 12:54 ` Richard Sandiford
  0 siblings, 1 reply; 4+ messages in thread
From: Yuliang Wang @ 2019-08-29 14:17 UTC (permalink / raw)
  To: gcc-patches; +Cc: nd, Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 3040 bytes --]

This patch allows for more efficient SVE2 vectorization of Multiply High with Round and Scale (MULHRS) patterns.

The example snippet:

    uint16_t a[N], b[N], c[N];

    void foo_round (void)
    {
        for (int i = 0; i < N; i++)
            a[i] = ((((int32_t)b[i] * (int32_t)c[i]) >> 14) + 1) >> 1;
    }

... previously vectorized to:

    foo_round:
        ...
        ptrue   p0.s
        whilelo p1.h, wzr, w2
        ld1h    {z2.h}, p1/z, [x4, x0, lsl #1]
        ld1h    {z0.h}, p1/z, [x3, x0, lsl #1]
        uunpklo z3.s, z2.h                      //
        uunpklo z1.s, z0.h                      //
        uunpkhi z2.s, z2.h                      //
        uunpkhi z0.s, z0.h                      //
        mul     z1.s, p0/m, z1.s, z3.s          //
        mul     z0.s, p0/m, z0.s, z2.s          //
        asr     z1.s, z1.s, #14                 //
        asr     z0.s, z0.s, #14                 //
        add     z1.s, z1.s, #1                  //
        add     z0.s, z0.s, #1                  //
        asr     z1.s, z1.s, #1                  //
        asr     z0.s, z0.s, #1                  //
        uzp1    z0.h, z1.h, z0.h                //
        st1h    {z0.h}, p1, [x1, x0, lsl #1]
        inch    x0
        whilelo p1.h, w0, w2
        b.ne    28
        ret

... and now vectorizes to:

    foo_round:
        ...
        whilelo p0.h, wzr, w2
        nop
        ld1h    {z1.h}, p0/z, [x4, x0, lsl #1]
        ld1h    {z2.h}, p0/z, [x3, x0, lsl #1]
        umullb  z0.s, z1.h, z2.h                //
        umullt  z1.s, z1.h, z2.h                //
        rshrnb  z0.h, z0.s, #15                 //
        rshrnt  z0.h, z1.s, #15                 //
        st1h    {z0.h}, p0, [x1, x0, lsl #1]
        inch    x0
        whilelo p0.h, w0, w2
        b.ne    28
        ret
        nop

Also supported are:

* Non-rounding cases

    The equivalent example snippet:

        void foo_trunc (void)
        {
            for (int i = 0; i < N; i++)
                a[i] = ((int32_t)b[i] * (int32_t)c[i]) >> 15;
        }

    ... vectorizes with SHRNT/SHRNB

* 32-bit and 8-bit input/output types

* Signed output types

    SMULLT/SMULLB are generated instead

SQRDMULH was considered as a potential single-instruction optimization but saturates the intermediate value instead of truncating.

Best Regards,
Yuliang Wang


ChangeLog:

2019-08-22  Yuliang Wang  <yuliang.wang@arm.com>

        * config/aarch64/aarch64-sve2.md: support for SVE2 instructions [S/U]MULL[T/B] + [R]SHRN[T/B] and MULHRS pattern variants
        * config/aarch64/iterators.md: iterators and attributes for above
        * internal-fn.def: internal functions for MULH[R]S patterns
        * optabs.def: optabs definitions for above and sign variants
        * tree-vect-patterns.c (vect_recog_multhi_pattern): pattern recognition function for MULHRS
        * gcc.target/aarch64/sve2/mulhrs_1.c: new test for all variants

[-- Attachment #2: rb11655.patch --]
[-- Type: application/octet-stream, Size: 20235 bytes --]

diff --git a/gcc/config/aarch64/aarch64-sve2.md b/gcc/config/aarch64/aarch64-sve2.md
index 2334e5a7b7dc524bbd1f4d0a48ba5cd991970118..51783604ad8f83eb1d070c133009ed41a2a0252d 100644
--- a/gcc/config/aarch64/aarch64-sve2.md
+++ b/gcc/config/aarch64/aarch64-sve2.md
@@ -63,3 +63,89 @@
    movprfx\t%0, %2\;<sur>h<addsub>\t%0.<Vetype>, %1/m, %0.<Vetype>, %3.<Vetype>"
   [(set_attr "movprfx" "*,yes")]
 )
+
+;; Multiply long top / bottom
+(define_insn "*<su>mull<bt><Vwide>"
+  [(set (match_operand:<VWIDE> 0 "register_operand" "=w")
+    (unspec:<VWIDE> [(match_operand:SVE_BHSI 1 "register_operand" "w")
+			  (match_operand:SVE_BHSI 2 "register_operand" "w")]
+			 MULLBT))]
+  "TARGET_SVE2"
+  "<su>mull<bt>\t%0.<Vewtype>, %1.<Vetype>, %2.<Vetype>"
+)
+
+(define_expand "<su>mull<bt><Vwide>"
+  [(set (match_operand:<VWIDE> 0 "register_operand")
+    (unspec:<VWIDE> [(match_operand:SVE_BHSI 1 "register_operand")
+			  (match_operand:SVE_BHSI 2 "register_operand")]
+			 MULLBT))]
+  "TARGET_SVE2"
+  ""
+)
+
+;; (Rounding) Right shift narrow bottom
+(define_insn "*<r>shrnb<mode>"
+  [(set (match_operand:SVE_BHSI 0 "register_operand" "=w")
+	  (unspec:SVE_BHSI [(match_operand:<VWIDE> 1 "register_operand" "w")
+			  (match_operand 2 "aarch64_simd_shift_imm_offset_<Vel>" "i")]
+			 SHRNB))]
+  "TARGET_SVE2"
+  "<r>shrnb\t%0.<Vetype>, %1.<Vewtype>, #%2"
+)
+
+(define_expand "<r>shrnb<mode>"
+  [(set (match_operand:SVE_BHSI 0 "register_operand")
+	 (unspec:SVE_BHSI [(match_operand:<VWIDE> 1 "register_operand")
+			  (match_operand 2 "aarch64_simd_shift_imm_offset_<Vel>")]
+			 SHRNB))]
+  "TARGET_SVE2"
+  ""
+)
+
+;; (Rounding) Right shift narrow top
+(define_insn "*<r>shrnt<mode>"
+  [(set (match_operand:SVE_BHSI 0 "register_operand" "=w")
+	  (unspec:SVE_BHSI [(match_operand:SVE_BHSI 1 "register_operand" "%0")
+	      (match_operand:<VWIDE> 2 "register_operand" "w")
+			  (match_operand 3 "aarch64_simd_shift_imm_offset_<Vel>" "i")]
+			 SHRNT))]
+  "TARGET_SVE2"
+  "<r>shrnt\t%0.<Vetype>, %2.<Vewtype>, #%3"
+)
+
+(define_expand "<r>shrnt<mode>"
+  [(set (match_operand:SVE_BHSI 0 "register_operand")
+	  (unspec:SVE_BHSI [(match_operand:<VWIDE> 1 "register_operand")
+	      (match_operand:<VWIDE> 2 "register_operand")
+			  (match_operand 3 "aarch64_simd_shift_imm_offset_<Vel>")]
+			 SHRNT))]
+  "TARGET_SVE2"
+  ""
+)
+
+;; Unpredicated integer multiply-high-with-(round-and-)scale
+(define_expand "<su>mulh<r>s<mode>3"
+  [(set (match_operand:SVE_BHSI 0 "register_operand")
+	(unspec:SVE_BHSI
+	  [(match_dup 3)
+	   (unspec:SVE_BHSI [(match_operand:SVE_BHSI 1 "register_operand")
+			  (match_operand:SVE_BHSI 2 "register_operand")]
+			 MULHRS)]
+	  UNSPEC_PRED_X))]
+  "TARGET_SVE2"
+  {
+    operands[3] = force_reg (<VPRED>mode, CONSTM1_RTX (<VPRED>mode));
+
+    rtx prod_b = gen_reg_rtx (<VWIDE>mode);
+    rtx prod_t = gen_reg_rtx (<VWIDE>mode);
+    emit_insn (gen_<su>mullb<Vwide> (prod_b, operands[1], operands[2]));
+    emit_insn (gen_<su>mullt<Vwide> (prod_t, operands[1], operands[2]));
+
+    rtx shift = GEN_INT (GET_MODE_UNIT_BITSIZE (<MODE>mode) - 1);
+    emit_insn (gen_<r>shrnb<mode> (operands[0], prod_b, shift));
+    emit_insn (gen_<r>shrnt<mode> (operands[0], operands[0], prod_t, shift));
+
+    DONE;
+  }
+)
+
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index e8ba4f3d987b0a79d290db6f75b9638b867a5f0e..258b0a8cca6c1c4877fad1820da8ad3386bd8f98 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -375,6 +375,10 @@
     UNSPEC_RSUBHN2	; Used in aarch64-simd.md.
     UNSPEC_SQDMULH	; Used in aarch64-simd.md.
     UNSPEC_SQRDMULH	; Used in aarch64-simd.md.
+    UNSPEC_SMULLB ; Used in aarch64-sve2.md.
+    UNSPEC_SMULLT ; Used in aarch64-sve2.md.
+    UNSPEC_UMULLB ; Used in aarch64-sve2.md.
+    UNSPEC_UMULLT ; Used in aarch64-sve2.md.
     UNSPEC_PMUL		; Used in aarch64-simd.md.
     UNSPEC_FMULX	; Used in aarch64-simd.md.
     UNSPEC_USQADD	; Used in aarch64-simd.md.
@@ -397,6 +401,10 @@
     UNSPEC_UQSHRN	; Used in aarch64-simd.md.
     UNSPEC_SQRSHRN	; Used in aarch64-simd.md.
     UNSPEC_UQRSHRN	; Used in aarch64-simd.md.
+    UNSPEC_SHRNB ; Used in aarch64-sve2.md.
+    UNSPEC_SHRNT ; Used in aarch64-sve2.md.
+    UNSPEC_RSHRNB ; Used in aarch64-sve2.md.
+    UNSPEC_RSHRNT ; Used in aarch64-sve2.md.
     UNSPEC_SSHL		; Used in aarch64-simd.md.
     UNSPEC_USHL		; Used in aarch64-simd.md.
     UNSPEC_SRSHL	; Used in aarch64-simd.md.
@@ -520,6 +528,10 @@
     UNSPEC_FCMLA90	; Used in aarch64-simd.md.
     UNSPEC_FCMLA180	; Used in aarch64-simd.md.
     UNSPEC_FCMLA270	; Used in aarch64-simd.md.
+    UNSPEC_SMULHS ; Used in aarch64-sve2.md.
+    UNSPEC_SMULHRS ; Used in aarch64-sve2.md.
+    UNSPEC_UMULHS ; Used in aarch64-sve2.md.
+    UNSPEC_UMULHRS ; Used in aarch64-sve2.md.
 ])
 
 ;; ------------------------------------------------------------------
@@ -1585,6 +1597,13 @@
 
 (define_int_iterator RHADD [UNSPEC_SRHADD UNSPEC_URHADD])
 
+(define_int_iterator MULLBT [UNSPEC_SMULLB UNSPEC_UMULLB
+            UNSPEC_SMULLT UNSPEC_UMULLT])
+
+(define_int_iterator SHRNB [UNSPEC_SHRNB UNSPEC_RSHRNB])
+
+(define_int_iterator SHRNT [UNSPEC_SHRNT UNSPEC_RSHRNT])
+
 (define_int_iterator DOTPROD [UNSPEC_SDOT UNSPEC_UDOT])
 
 (define_int_iterator ADDSUBHN [UNSPEC_ADDHN UNSPEC_RADDHN
@@ -1604,6 +1623,9 @@
 
 (define_int_iterator VQDMULH [UNSPEC_SQDMULH UNSPEC_SQRDMULH])
 
+(define_int_iterator MULHRS [UNSPEC_SMULHS UNSPEC_UMULHS
+            UNSPEC_SMULHRS UNSPEC_UMULHRS])
+
 (define_int_iterator USSUQADD [UNSPEC_SUQADD UNSPEC_USQADD])
 
 (define_int_iterator SUQMOVN [UNSPEC_SQXTN UNSPEC_UQXTN])
@@ -1866,7 +1888,11 @@
 		     (UNSPEC_COND_FCVTZS "s")
 		     (UNSPEC_COND_FCVTZU "u")
 		     (UNSPEC_COND_SCVTF "s")
-		     (UNSPEC_COND_UCVTF "u")])
+		     (UNSPEC_COND_UCVTF "u")
+	       (UNSPEC_SMULLB "s") (UNSPEC_UMULLB "u")
+	       (UNSPEC_SMULLT "s") (UNSPEC_UMULLT "u")
+	       (UNSPEC_SMULHS "s") (UNSPEC_UMULHS "u")
+	       (UNSPEC_SMULHRS "s") (UNSPEC_UMULHRS "u")])
 
 (define_int_attr sur [(UNSPEC_SHADD "s") (UNSPEC_UHADD "u")
 		      (UNSPEC_SRHADD "sr") (UNSPEC_URHADD "ur")
@@ -1904,6 +1930,10 @@
                     (UNSPEC_SQRSHRN "r") (UNSPEC_UQRSHRN "r")
                     (UNSPEC_SQSHL   "")  (UNSPEC_UQSHL  "")
                     (UNSPEC_SQRSHL   "r")(UNSPEC_UQRSHL  "r")
+		    (UNSPEC_SHRNB "") (UNSPEC_SHRNT "")
+		    (UNSPEC_RSHRNB "r") (UNSPEC_RSHRNT "r")
+	       (UNSPEC_SMULHS "") (UNSPEC_UMULHS "")
+	       (UNSPEC_SMULHRS "r") (UNSPEC_UMULHRS "r")
 ])
 
 (define_int_attr lr [(UNSPEC_SSLI  "l") (UNSPEC_USLI  "l")
@@ -1916,6 +1946,9 @@
 		    (UNSPEC_SHADD "") (UNSPEC_UHADD "u")
 		    (UNSPEC_SRHADD "") (UNSPEC_URHADD "u")])
 
+(define_int_attr bt [(UNSPEC_SMULLB "b") (UNSPEC_UMULLB "b")
+		    (UNSPEC_SMULLT "t") (UNSPEC_UMULLT "t")])
+
 (define_int_attr addsub [(UNSPEC_SHADD "add")
 			 (UNSPEC_UHADD "add")
 			 (UNSPEC_SRHADD "add")
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 9461693bcd12dd54eedd0d983758947bfc016b73..67eb30dee81b0a9b22459f3c1cb8e031b7ac32c9 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -149,6 +149,11 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_FLOOR, ECF_CONST | ECF_NOTHROW, first,
 DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL, ECF_CONST | ECF_NOTHROW, first,
 			      savg_ceil, uavg_ceil, binary)
 
+DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST | ECF_NOTHROW, first,
+            smulhs, umulhs, binary)
+DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST | ECF_NOTHROW, first,
+            smulhrs, umulhrs, binary)
+
 DEF_INTERNAL_OPTAB_FN (COND_ADD, ECF_CONST, cond_add, cond_binary)
 DEF_INTERNAL_OPTAB_FN (COND_SUB, ECF_CONST, cond_sub, cond_binary)
 DEF_INTERNAL_OPTAB_FN (COND_MUL, ECF_CONST, cond_smul, cond_binary)
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 5283e6753f26eb0da08a393949128d09643e50dd..adcd9477b4c4b2e73cad7c72ee51336e98e81954 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -342,6 +342,10 @@ OPTAB_D (udot_prod_optab, "udot_prod$I$a")
 OPTAB_D (usum_widen_optab, "widen_usum$I$a3")
 OPTAB_D (usad_optab, "usad$I$a")
 OPTAB_D (ssad_optab, "ssad$I$a")
+OPTAB_D (smulhs_optab, "smulhs$a3")
+OPTAB_D (smulhrs_optab, "smulhrs$a3")
+OPTAB_D (umulhs_optab, "umulhs$a3")
+OPTAB_D (umulhrs_optab, "umulhrs$a3")
 OPTAB_D (vec_pack_sfix_trunc_optab, "vec_pack_sfix_trunc_$a")
 OPTAB_D (vec_pack_ssat_optab, "vec_pack_ssat_$a")
 OPTAB_D (vec_pack_trunc_optab, "vec_pack_trunc_$a")
diff --git a/gcc/testsuite/gcc.target/aarch64/sve2/mulhrs_1.c b/gcc/testsuite/gcc.target/aarch64/sve2/mulhrs_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..98cdc73d5256d712c985c541c1a585d081e5f5ea
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve2/mulhrs_1.c
@@ -0,0 +1,123 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details --save-temps" } */
+
+#include <stdint.h>
+
+#define MULTHI(TYPE, BIGGER, RND)                     \
+TYPE __attribute__ ((noinline, noclone))              \
+mulhs_##TYPE##_##RND (TYPE *restrict x,               \
+        TYPE *restrict y, TYPE *restrict z, int n)    \
+{                                                     \
+  for (int i = 0; i < n; i++)                         \
+  {                                                   \
+    z[i] = ((((BIGGER)x[i] * (BIGGER)y[i]) >>         \
+            (sizeof(BIGGER)/2-(RND+1)) + RND) >> 1;   \
+  }                                                   \
+}
+
+MULTHI (int8_t, int16_t, 0)
+MULTHI (int16_t, int32_t, 0)
+MULTHI (int32_t, int64_t, 0)
+
+MULTHI (uint8_t, uint16_t, 0)
+MULTHI (uint16_t, uint32_t, 0)
+MULTHI (uint32_t, uint64_t, 0)
+
+MULTHI (int8_t, int16_t, 1)
+MULTHI (int16_t, int32_t, 1)
+MULTHI (int32_t, int64_t, 1)
+
+MULTHI (uint8_t, uint16_t, 1)
+MULTHI (uint16_t, uint32_t, 1)
+MULTHI (uint32_t, uint64_t, 1)
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops in function" 12 "vect" } } */
+
+/* { dg-final { scan-assembler-times {
+                  \tsmullb\tz[0-9]+\.h, z[0-9]+\.b, z[0-9]+\.b\n
+                  \tsmullt\tz[0-9]+\.h, z[0-9]+\.b, z[0-9]+\.b\n
+                  \tshrnb\tz[0-9]+\.b, z[0-9]+\.h, #7\n
+                  \tshrnt\tz[0-9]+\.b, z[0-9]+\.h, #7\n
+              } 
+     1 } } */
+/* { dg-final { scan-assembler-times {
+                  \tsmullb\tz[0-9]+\.s, z[0-9]+\.h, z[0-9]+\.h\n
+                  \tsmullt\tz[0-9]+\.s, z[0-9]+\.h, z[0-9]+\.h\n
+                  \tshrnb\tz[0-9]+\.h, z[0-9]+\.s, #15\n
+                  \tshrnt\tz[0-9]+\.h, z[0-9]+\.s, #15\n
+              } 
+     1 } } */
+/* { dg-final { scan-assembler-times {
+                  \tsmullb\tz[0-9]+\.d, z[0-9]+\.s, z[0-9]+\.s\n
+                  \tsmullt\tz[0-9]+\.d, z[0-9]+\.s, z[0-9]+\.s\n
+                  \tshrnb\tz[0-9]+\.s, z[0-9]+\.d, #31\n
+                  \tshrnt\tz[0-9]+\.s, z[0-9]+\.d, #31\n
+              } 
+     1 } } */
+     
+/* { dg-final { scan-assembler-times {
+                  \tumullb\tz[0-9]+\.h, z[0-9]+\.b, z[0-9]+\.b\n
+                  \tumullt\tz[0-9]+\.h, z[0-9]+\.b, z[0-9]+\.b\n
+                  \tshrnb\tz[0-9]+\.b, z[0-9]+\.h, #7\n
+                  \tshrnt\tz[0-9]+\.b, z[0-9]+\.h, #7\n
+              } 
+     1 } } */
+/* { dg-final { scan-assembler-times {
+                  \tumullb\tz[0-9]+\.s, z[0-9]+\.h, z[0-9]+\.h\n
+                  \tumullt\tz[0-9]+\.s, z[0-9]+\.h, z[0-9]+\.h\n
+                  \tshrnb\tz[0-9]+\.h, z[0-9]+\.s, #15\n
+                  \tshrnt\tz[0-9]+\.h, z[0-9]+\.s, #15\n
+              } 
+     1 } } */
+/* { dg-final { scan-assembler-times {
+                  \tumullb\tz[0-9]+\.d, z[0-9]+\.s, z[0-9]+\.s\n
+                  \tumullt\tz[0-9]+\.d, z[0-9]+\.s, z[0-9]+\.s\n
+                  \tshrnb\tz[0-9]+\.s, z[0-9]+\.d, #31\n
+                  \tshrnt\tz[0-9]+\.s, z[0-9]+\.d, #31\n
+              } 
+     1 } } */
+     
+/* { dg-final { scan-assembler-times {
+                  \tsmullb\tz[0-9]+\.h, z[0-9]+\.b, z[0-9]+\.b\n
+                  \tsmullt\tz[0-9]+\.h, z[0-9]+\.b, z[0-9]+\.b\n
+                  \trshrnb\tz[0-9]+\.b, z[0-9]+\.h, #7\n
+                  \trshrnt\tz[0-9]+\.b, z[0-9]+\.h, #7\n
+              } 
+     1 } } */
+/* { dg-final { scan-assembler-times {
+                  \tsmullb\tz[0-9]+\.s, z[0-9]+\.h, z[0-9]+\.h\n
+                  \tsmullt\tz[0-9]+\.s, z[0-9]+\.h, z[0-9]+\.h\n
+                  \trshrnb\tz[0-9]+\.h, z[0-9]+\.s, #15\n
+                  \trshrnt\tz[0-9]+\.h, z[0-9]+\.s, #15\n
+              } 
+     1 } } */
+/* { dg-final { scan-assembler-times {
+                  \tsmullb\tz[0-9]+\.d, z[0-9]+\.s, z[0-9]+\.s\n
+                  \tsmullt\tz[0-9]+\.d, z[0-9]+\.s, z[0-9]+\.s\n
+                  \trshrnb\tz[0-9]+\.s, z[0-9]+\.d, #31\n
+                  \trshrnt\tz[0-9]+\.s, z[0-9]+\.d, #31\n
+              } 
+     1 } } */
+     
+/* { dg-final { scan-assembler-times {
+                  \tumullb\tz[0-9]+\.h, z[0-9]+\.b, z[0-9]+\.b\n
+                  \tumullt\tz[0-9]+\.h, z[0-9]+\.b, z[0-9]+\.b\n
+                  \trshrnb\tz[0-9]+\.b, z[0-9]+\.h, #7\n
+                  \trshrnt\tz[0-9]+\.b, z[0-9]+\.h, #7\n
+              } 
+     1 } } */
+/* { dg-final { scan-assembler-times {
+                  \tumullb\tz[0-9]+\.s, z[0-9]+\.h, z[0-9]+\.h\n
+                  \tumullt\tz[0-9]+\.s, z[0-9]+\.h, z[0-9]+\.h\n
+                  \trshrnb\tz[0-9]+\.h, z[0-9]+\.s, #15\n
+                  \trshrnt\tz[0-9]+\.h, z[0-9]+\.s, #15\n
+              } 
+     1 } } */
+/* { dg-final { scan-assembler-times {
+                  \tumullb\tz[0-9]+\.d, z[0-9]+\.s, z[0-9]+\.s\n
+                  \tumullt\tz[0-9]+\.d, z[0-9]+\.s, z[0-9]+\.s\n
+                  \trshrnb\tz[0-9]+\.s, z[0-9]+\.d, #31\n
+                  \trshrnt\tz[0-9]+\.s, z[0-9]+\.d, #31\n
+              } 
+     1 } } */
+     
diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
index ccb2e1edecda09db786c0d98ccd25f5be107c7e4..96e86f716da0b819640cca2ff5d01c3ce8b342e3 100644
--- a/gcc/tree-vect-patterns.c
+++ b/gcc/tree-vect-patterns.c
@@ -1723,6 +1723,169 @@ vect_recog_over_widening_pattern (stmt_vec_info last_stmt_info, tree *type_out)
   return pattern_stmt;
 }
 
+/* Recognize the following patterns:
+
+	    ATYPE a;  // narrower than TYPE
+	    BTYPE b;  // narrower than TYPE
+
+	    // multiply high with scaling
+	    1) TYPE res = ((TYPE) a * (TYPE) b) >> c;
+	    // or also with rounding
+	    2) TYPE res = (((TYPE) a * (TYPE) b) >> d + 1) >> 1;
+
+	where only the bottom half of res is used.  */
+
+static gimple *
+vect_recog_multhi_pattern (stmt_vec_info last_stmt_info, tree *type_out)
+{
+	vec_info *vinfo;
+
+  /* Check for a right shift.  */
+  gassign *last_stmt = dyn_cast <gassign *> (last_stmt_info->stmt);
+  if (!last_stmt
+      || gimple_assign_rhs_code (last_stmt) != RSHIFT_EXPR)
+    return NULL;
+  vinfo = last_stmt_info->vinfo;
+
+  /* Check that the shift result is wider than the users of the
+     result need (i.e. that narrowing would be a natural choice).  */
+  tree lhs = gimple_assign_lhs (last_stmt);
+  tree lhs_type = TREE_TYPE (lhs);
+  unsigned int target_precision =
+    vect_element_precision (last_stmt_info->min_output_precision);
+  if (!INTEGRAL_TYPE_P (lhs_type)
+      || target_precision >= TYPE_PRECISION (lhs_type))
+    return NULL;
+
+  /* Look through any change in sign on the outer shift input.  */
+  vect_unpromoted_value unprom_input;
+  tree rhs_rshift_inp = vect_look_through_possible_promotion (vinfo,
+          gimple_assign_rhs1 (last_stmt), &unprom_input);
+  if (!rhs_rshift_inp
+      || TYPE_PRECISION (TREE_TYPE (rhs_rshift_inp))
+         != TYPE_PRECISION (lhs_type))
+    return NULL;
+
+  /* Get the definition of the shift input.  */
+  stmt_vec_info input_stmt_info =
+      vect_get_internal_def (vinfo, rhs_rshift_inp);
+  if (!input_stmt_info)
+    return NULL;
+  gassign *input_stmt = dyn_cast <gassign *> (input_stmt_info->stmt);
+  if (!input_stmt)
+    return NULL;
+
+  stmt_vec_info mulhs_stmt_info;
+  internal_fn ifn;
+  unsigned int expect_offset;
+
+  /* Check for presence of rounding term.  */
+  if (gimple_assign_rhs_code (input_stmt) == PLUS_EXPR)
+  {
+    /* Check that the outer shift was by 1.  */
+    if (!integer_onep (gimple_assign_rhs2 (last_stmt)))
+      return NULL;
+
+    /* Check that one of the operands is 1.  */
+    tree oprnds[2] = { gimple_assign_rhs1 (input_stmt),
+                       gimple_assign_rhs2 (input_stmt) };
+    bool isone[2] = { integer_onep (oprnds[0]),
+                      integer_onep (oprnds[1]) };
+    if (!(isone[0] ^ isone[1]))
+      return NULL;
+    tree mulhs = oprnds[(int)(isone[0])];
+
+    mulhs_stmt_info = vect_get_internal_def (vinfo, mulhs);
+    if (!mulhs_stmt_info)
+      return NULL;
+
+    expect_offset = target_precision + 2;
+    ifn = IFN_MULHRS;
+  }
+  else
+  {
+    mulhs_stmt_info = last_stmt_info;
+
+    expect_offset = target_precision + 1;
+    ifn = IFN_MULHS;
+  }
+
+  /* Get the definition of the multiply-high-scale part.  */
+  gassign *mulhs_stmt = dyn_cast <gassign *> (mulhs_stmt_info->stmt);
+  if (!mulhs_stmt
+      || gimple_assign_rhs_code (mulhs_stmt) != RSHIFT_EXPR)
+    return NULL;
+
+  /* Get the scaling factor.  */
+  tree rhs_scale = gimple_assign_rhs2 (mulhs_stmt);
+  tree rhs_scale_type = TREE_TYPE (rhs_scale);
+  if (TREE_CODE (rhs_scale) != INTEGER_CST
+      || TREE_CODE (rhs_scale_type) != INTEGER_TYPE
+      || !type_has_mode_precision_p (rhs_scale_type))
+    return NULL;
+
+  /* Get the definition of the scaling input term.  */
+  tree rhs_mulh = gimple_assign_rhs1 (mulhs_stmt);
+  tree rhs_mulh_type = TREE_TYPE (rhs_mulh);
+  if (!INTEGRAL_TYPE_P (rhs_mulh_type))
+    return NULL;
+  stmt_vec_info mulh_stmt_info = vect_get_internal_def (vinfo, rhs_mulh);
+  if (!mulh_stmt_info)
+    return NULL;
+
+  /* Check that the shift represents the correct scaling.  */
+  if (wi::ne_p(wi::to_widest(rhs_scale) + expect_offset,
+          TYPE_PRECISION (rhs_mulh_type)))
+    return NULL;
+
+  /* Check whether the scaling input term can be seen as two widened
+     inputs multiplied together.  */
+  vect_unpromoted_value unprom_mult[2];
+  tree new_type;
+  unsigned int nops = vect_widened_op_tree (mulh_stmt_info, MULT_EXPR,
+					    WIDEN_MULT_EXPR, false, 2,
+					    unprom_mult, &new_type);
+  if (nops != 2)
+    return NULL;
+
+  vect_pattern_detected ("vect_recog_multhi_pattern", last_stmt);
+
+  /* Adjust output precision.  */
+  if (TYPE_PRECISION (new_type) < target_precision)
+    new_type = build_nonstandard_integer_type (target_precision,
+					       TYPE_UNSIGNED (new_type));
+
+  /* Check for target support.  */
+  tree new_vectype = get_vectype_for_scalar_type (new_type);
+  if (!new_vectype
+      || !direct_internal_fn_supported_p (ifn, new_vectype,
+					  OPTIMIZE_FOR_SPEED))
+    return NULL;
+
+  /* The IR requires a valid vector type for the cast result, even though
+     it's likely to be discarded.  */
+  *type_out = get_vectype_for_scalar_type (lhs_type);
+  if (!*type_out)
+    return NULL;
+
+  /* Generate the IFN_MULHRS call.  */
+  tree new_var = vect_recog_temp_ssa_var (new_type, NULL);
+  tree new_ops[2];
+  vect_convert_inputs (last_stmt_info, 2, new_ops, new_type,
+          unprom_mult, new_vectype);
+  gcall *mulhrs_stmt = gimple_build_call_internal (ifn, 2,
+          new_ops[0], new_ops[1]);
+  gimple_call_set_lhs (mulhrs_stmt, new_var);
+  gimple_set_location (mulhrs_stmt, gimple_location (last_stmt));
+
+  if (dump_enabled_p ())
+    dump_printf_loc (MSG_NOTE, vect_location,
+		     "created pattern stmt: %G", mulhrs_stmt);
+
+  return vect_convert_output (last_stmt_info, lhs_type,
+          mulhrs_stmt, new_vectype);
+}
+
 /* Recognize the patterns:
 
 	    ATYPE a;  // narrower than TYPE
@@ -4713,6 +4876,7 @@ static vect_recog_func vect_vect_recog_func_ptrs[] = {
   /* Must come after over_widening, which narrows the shift as much as
      possible beforehand.  */
   { vect_recog_average_pattern, "average" },
+  { vect_recog_multhi_pattern, "mult_high" },
   { vect_recog_cast_forwprop_pattern, "cast_forwprop" },
   { vect_recog_widen_mult_pattern, "widen_mult" },
   { vect_recog_dot_prod_pattern, "dot_prod" },

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH][AArch64] Vectorize MULH(R)S patterns with SVE2 instructions
  2019-08-29 14:17 [PATCH][AArch64] Vectorize MULH(R)S patterns with SVE2 instructions Yuliang Wang
@ 2019-08-30 12:54 ` Richard Sandiford
  2019-09-12  7:52   ` Yuliang Wang
  0 siblings, 1 reply; 4+ messages in thread
From: Richard Sandiford @ 2019-08-30 12:54 UTC (permalink / raw)
  To: Yuliang Wang; +Cc: gcc-patches, nd

Thanks for doing this.  The patch looks good, so this review is mostly a
list of very minor formatting comments, sorry.

Yuliang Wang <Yuliang.Wang@arm.com> writes:
> 2019-08-22  Yuliang Wang  <yuliang.wang@arm.com>
>

Please add a line here pointing at the PR:

	PR tree-optimization/89386

The commit hooks pick this up automatically and link the commit to the
bugzilla ticket.  (The PR was filed for SSSE3, but the target-independent
bits are still needed there.)

>         * config/aarch64/aarch64-sve2.md: support for SVE2 instructions [S/U]MULL[T/B] + [R]SHRN[T/B] and MULHRS pattern variants

Unfortunately the changelog format is pretty strict here.  Lines have
to be 80 chars or shorter, indented by tabs, and each pattern, function,
variable or type needs to be listed individually regardless of how
useful that seems.  So I think this should be something like:

	* config/aarch64/aarch64-sve2.md (<su>mull<bt><Vwide>)
	(<r>shrnb<mode>, <r>shrnt<mode>, <su>mulh<r>s<mode>3): New patterns.

(See below for why the "*" patterns aren't listed.)

>         * config/aarch64/iterators.md: iterators and attributes for above

Here too the iterators need to be listed:

	* config/aarch64/iterators.md (UNSPEC_SMULLB, UNSPEC_SMULLT)
	(UNSPEC_UMULLB, UNSPEC_UMULLT, UNSPEC_SHRNB, UNSPEC_SHRNT)
	(UNSPEC_RSHRNB, UNSPEC_RSHRNT, UNSPEC_SMULHS, UNSPEC_SMULHRS)
	UNSPEC_UMULHS, UNSPEC_UMULHRS): New unspecs.
	(MULLBT, SHRNB, SHRNT, MULHRS): New int iterators.
	(su, r): Handle the unspecs above.
	(bt): New int attribute.

>         * internal-fn.def: internal functions for MULH[R]S patterns

	* internal-fn.def (IFN_MULHS, IFN_MULHRS): New internal functions.

>         * optabs.def: optabs definitions for above and sign variants

	* optabs.def (smulhs_optab, smulhrs_optab, umulhs_optab)
	(umulhrs_optab): New optabs.

>         * tree-vect-patterns.c (vect_recog_multhi_pattern): pattern recognition function for MULHRS

	* tree-vect-patterns.c (vect_recog_multhi_pattern): New function.
	(vect_vect_recog_func_ptrs): Add it.

>         * gcc.target/aarch64/sve2/mulhrs_1.c: new test for all variants

Just:

	* gcc.target/aarch64/sve2/mulhrs_1.c: New test.

(Sorry that this is so finicky.  I'm just the messenger. :-))

> diff --git a/gcc/config/aarch64/aarch64-sve2.md b/gcc/config/aarch64/aarch64-sve2.md
> index 2334e5a7b7dc524bbd1f4d0a48ba5cd991970118..51783604ad8f83eb1d070c133009ed41a2a0252d 100644
> --- a/gcc/config/aarch64/aarch64-sve2.md
> +++ b/gcc/config/aarch64/aarch64-sve2.md
> @@ -63,3 +63,89 @@
>     movprfx\t%0, %2\;<sur>h<addsub>\t%0.<Vetype>, %1/m, %0.<Vetype>, %3.<Vetype>"
>    [(set_attr "movprfx" "*,yes")]
>  )
> +
> +;; Multiply long top / bottom

Very minor, but: GCC comments traditionally end with "." even if they're
not full sentences.

> +(define_insn "*<su>mull<bt><Vwide>"
> +  [(set (match_operand:<VWIDE> 0 "register_operand" "=w")
> +    (unspec:<VWIDE> [(match_operand:SVE_BHSI 1 "register_operand" "w")
> +			  (match_operand:SVE_BHSI 2 "register_operand" "w")]
> +			 MULLBT))]
> +  "TARGET_SVE2"
> +  "<su>mull<bt>\t%0.<Vewtype>, %1.<Vetype>, %2.<Vetype>"
> +)
> +
> +(define_expand "<su>mull<bt><Vwide>"
> +  [(set (match_operand:<VWIDE> 0 "register_operand")
> +    (unspec:<VWIDE> [(match_operand:SVE_BHSI 1 "register_operand")
> +			  (match_operand:SVE_BHSI 2 "register_operand")]
> +			 MULLBT))]
> +  "TARGET_SVE2"
> +  ""
> +)

It's more usual to put the define_expand first.  But in this case we
don't need separate define_expands, we can just make the define_insn the
named pattern itself:

;; Multiply long top / bottom.
(define_insn "<su>mull<bt><Vwide>"
  [(set (match_operand:<VWIDE> 0 "register_operand" "=w")
    (unspec:<VWIDE> [(match_operand:SVE_BHSI 1 "register_operand" "w")
		     (match_operand:SVE_BHSI 2 "register_operand" "w")]
		    MULLBT))]
  "TARGET_SVE2"
  "<su>mull<bt>\t%0.<Vewtype>, %1.<Vetype>, %2.<Vetype>"
)

> +
> +;; (Rounding) Right shift narrow bottom
> +(define_insn "*<r>shrnb<mode>"
> +  [(set (match_operand:SVE_BHSI 0 "register_operand" "=w")
> +	  (unspec:SVE_BHSI [(match_operand:<VWIDE> 1 "register_operand" "w")
> +			  (match_operand 2 "aarch64_simd_shift_imm_offset_<Vel>" "i")]

This is more a personal thing, but I think it's better to leave out
the "i" (just remove it rather than replace it with "") when the operand
doesn't in fact accept all the things that "i" allows.

> +			 SHRNB))]
> +  "TARGET_SVE2"
> +  "<r>shrnb\t%0.<Vetype>, %1.<Vewtype>, #%2"
> +)
> +
> +(define_expand "<r>shrnb<mode>"
> +  [(set (match_operand:SVE_BHSI 0 "register_operand")
> +	 (unspec:SVE_BHSI [(match_operand:<VWIDE> 1 "register_operand")
> +			  (match_operand 2 "aarch64_simd_shift_imm_offset_<Vel>")]
> +			 SHRNB))]
> +  "TARGET_SVE2"
> +  ""
> +)
> +
> +;; (Rounding) Right shift narrow top
> +(define_insn "*<r>shrnt<mode>"
> +  [(set (match_operand:SVE_BHSI 0 "register_operand" "=w")
> +	  (unspec:SVE_BHSI [(match_operand:SVE_BHSI 1 "register_operand" "%0")

"%" indicates that operand 1 is commutative with operand 2, which isn't
the case here.

> +	      (match_operand:<VWIDE> 2 "register_operand" "w")
> +			  (match_operand 3 "aarch64_simd_shift_imm_offset_<Vel>" "i")]

Same comment about "i" here.

> +			 SHRNT))]
> +  "TARGET_SVE2"
> +  "<r>shrnt\t%0.<Vetype>, %2.<Vewtype>, #%3"
> +)
> +
> +(define_expand "<r>shrnt<mode>"
> +  [(set (match_operand:SVE_BHSI 0 "register_operand")
> +	  (unspec:SVE_BHSI [(match_operand:<VWIDE> 1 "register_operand")
> +	      (match_operand:<VWIDE> 2 "register_operand")
> +			  (match_operand 3 "aarch64_simd_shift_imm_offset_<Vel>")]
> +			 SHRNT))]
> +  "TARGET_SVE2"
> +  ""
> +)

Each of the above two define_expands can also be removed and the "*"
removed from the define_insn.

> +;; Unpredicated integer multiply-high-with-(round-and-)scale
> +(define_expand "<su>mulh<r>s<mode>3"
> +  [(set (match_operand:SVE_BHSI 0 "register_operand")
> +	(unspec:SVE_BHSI
> +	  [(match_dup 3)
> +	   (unspec:SVE_BHSI [(match_operand:SVE_BHSI 1 "register_operand")
> +			  (match_operand:SVE_BHSI 2 "register_operand")]
> +			 MULHRS)]
> +	  UNSPEC_PRED_X))]
> +  "TARGET_SVE2"
> +  {
> +    operands[3] = force_reg (<VPRED>mode, CONSTM1_RTX (<VPRED>mode));

These days this should be:

    operands[3] = aarch64_ptrue_reg (<VPRED>mode);

There's been a bit of churn around this, sorry.

> @@ -1585,6 +1597,13 @@
>  
>  (define_int_iterator RHADD [UNSPEC_SRHADD UNSPEC_URHADD])
>  
> +(define_int_iterator MULLBT [UNSPEC_SMULLB UNSPEC_UMULLB
> +            UNSPEC_SMULLT UNSPEC_UMULLT])

Minor formatting nit: should be indented one column beyond the "[".

> +
> +(define_int_iterator SHRNB [UNSPEC_SHRNB UNSPEC_RSHRNB])
> +
> +(define_int_iterator SHRNT [UNSPEC_SHRNT UNSPEC_RSHRNT])
> +
>  (define_int_iterator DOTPROD [UNSPEC_SDOT UNSPEC_UDOT])
>  
>  (define_int_iterator ADDSUBHN [UNSPEC_ADDHN UNSPEC_RADDHN
> @@ -1604,6 +1623,9 @@
>  
>  (define_int_iterator VQDMULH [UNSPEC_SQDMULH UNSPEC_SQRDMULH])
>  
> +(define_int_iterator MULHRS [UNSPEC_SMULHS UNSPEC_UMULHS
> +            UNSPEC_SMULHRS UNSPEC_UMULHRS])
> +

Same here.

>  (define_int_iterator USSUQADD [UNSPEC_SUQADD UNSPEC_USQADD])
>  
>  (define_int_iterator SUQMOVN [UNSPEC_SQXTN UNSPEC_UQXTN])
> @@ -1866,7 +1888,11 @@
>  		     (UNSPEC_COND_FCVTZS "s")
>  		     (UNSPEC_COND_FCVTZU "u")
>  		     (UNSPEC_COND_SCVTF "s")
> -		     (UNSPEC_COND_UCVTF "u")])
> +		     (UNSPEC_COND_UCVTF "u")
> +	       (UNSPEC_SMULLB "s") (UNSPEC_UMULLB "u")
> +	       (UNSPEC_SMULLT "s") (UNSPEC_UMULLT "u")
> +	       (UNSPEC_SMULHS "s") (UNSPEC_UMULHS "u")
> +	       (UNSPEC_SMULHRS "s") (UNSPEC_UMULHRS "u")])

These should line up with the existing entries.

>  
>  (define_int_attr sur [(UNSPEC_SHADD "s") (UNSPEC_UHADD "u")
>  		      (UNSPEC_SRHADD "sr") (UNSPEC_URHADD "ur")
> @@ -1904,6 +1930,10 @@
>                      (UNSPEC_SQRSHRN "r") (UNSPEC_UQRSHRN "r")
>                      (UNSPEC_SQSHL   "")  (UNSPEC_UQSHL  "")
>                      (UNSPEC_SQRSHL   "r")(UNSPEC_UQRSHL  "r")
> +		    (UNSPEC_SHRNB "") (UNSPEC_SHRNT "")
> +		    (UNSPEC_RSHRNB "r") (UNSPEC_RSHRNT "r")
> +	       (UNSPEC_SMULHS "") (UNSPEC_UMULHS "")
> +	       (UNSPEC_SMULHRS "r") (UNSPEC_UMULHRS "r")

Same here.

>  ])
>  
>  (define_int_attr lr [(UNSPEC_SSLI  "l") (UNSPEC_USLI  "l")
> @@ -1916,6 +1946,9 @@
>  		    (UNSPEC_SHADD "") (UNSPEC_UHADD "u")
>  		    (UNSPEC_SRHADD "") (UNSPEC_URHADD "u")])
>  
> +(define_int_attr bt [(UNSPEC_SMULLB "b") (UNSPEC_UMULLB "b")
> +		    (UNSPEC_SMULLT "t") (UNSPEC_UMULLT "t")])

This line should be indented one column further.

> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index 5283e6753f26eb0da08a393949128d09643e50dd..adcd9477b4c4b2e73cad7c72ee51336e98e81954 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -342,6 +342,10 @@ OPTAB_D (udot_prod_optab, "udot_prod$I$a")
>  OPTAB_D (usum_widen_optab, "widen_usum$I$a3")
>  OPTAB_D (usad_optab, "usad$I$a")
>  OPTAB_D (ssad_optab, "ssad$I$a")
> +OPTAB_D (smulhs_optab, "smulhs$a3")
> +OPTAB_D (smulhrs_optab, "smulhrs$a3")
> +OPTAB_D (umulhs_optab, "umulhs$a3")
> +OPTAB_D (umulhrs_optab, "umulhrs$a3")
>  OPTAB_D (vec_pack_sfix_trunc_optab, "vec_pack_sfix_trunc_$a")
>  OPTAB_D (vec_pack_ssat_optab, "vec_pack_ssat_$a")
>  OPTAB_D (vec_pack_trunc_optab, "vec_pack_trunc_$a")

The new patterns need to be documented in doc/md.texi.

> diff --git a/gcc/testsuite/gcc.target/aarch64/sve2/mulhrs_1.c b/gcc/testsuite/gcc.target/aarch64/sve2/mulhrs_1.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..98cdc73d5256d712c985c541c1a585d081e5f5ea
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve2/mulhrs_1.c
> @@ -0,0 +1,123 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details --save-temps" } */
> +
> +#include <stdint.h>
> +
> +#define MULTHI(TYPE, BIGGER, RND)                     \
> +TYPE __attribute__ ((noinline, noclone))              \
> +mulhs_##TYPE##_##RND (TYPE *restrict x,               \
> +        TYPE *restrict y, TYPE *restrict z, int n)    \
> +{                                                     \
> +  for (int i = 0; i < n; i++)                         \
> +  {                                                   \
> +    z[i] = ((((BIGGER)x[i] * (BIGGER)y[i]) >>         \
> +            (sizeof(BIGGER)/2-(RND+1)) + RND) >> 1;   \

Missing ")" here.

> +  }                                                   \
> +}
> +
> +MULTHI (int8_t, int16_t, 0)
> +MULTHI (int16_t, int32_t, 0)
> +MULTHI (int32_t, int64_t, 0)
> +
> +MULTHI (uint8_t, uint16_t, 0)
> +MULTHI (uint16_t, uint32_t, 0)
> +MULTHI (uint32_t, uint64_t, 0)
> +
> +MULTHI (int8_t, int16_t, 1)
> +MULTHI (int16_t, int32_t, 1)
> +MULTHI (int32_t, int64_t, 1)
> +
> +MULTHI (uint8_t, uint16_t, 1)
> +MULTHI (uint16_t, uint32_t, 1)
> +MULTHI (uint32_t, uint64_t, 1)
> +
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops in function" 12 "vect" } } */
> +
> +/* { dg-final { scan-assembler-times {
> +                  \tsmullb\tz[0-9]+\.h, z[0-9]+\.b, z[0-9]+\.b\n
> +                  \tsmullt\tz[0-9]+\.h, z[0-9]+\.b, z[0-9]+\.b\n
> +                  \tshrnb\tz[0-9]+\.b, z[0-9]+\.h, #7\n
> +                  \tshrnt\tz[0-9]+\.b, z[0-9]+\.h, #7\n
> +              } 
> +     1 } } */

Unfortunately, this kind of multi-line scan isn't supported in this
form: the dg-final has to be on a single line.  (Also unfortunately,
you don't get an error about this; the line is simply ignored.)

With scheduling enabled, only the order of the shrnb and shrnt is
guaranteed, so I think we should have separate scan-assembler-times
for each line.

> +/* { dg-final { scan-assembler-times {
> +                  \tsmullb\tz[0-9]+\.s, z[0-9]+\.h, z[0-9]+\.h\n
> +                  \tsmullt\tz[0-9]+\.s, z[0-9]+\.h, z[0-9]+\.h\n
> +                  \tshrnb\tz[0-9]+\.h, z[0-9]+\.s, #15\n
> +                  \tshrnt\tz[0-9]+\.h, z[0-9]+\.s, #15\n
> +              } 
> +     1 } } */
> +/* { dg-final { scan-assembler-times {
> +                  \tsmullb\tz[0-9]+\.d, z[0-9]+\.s, z[0-9]+\.s\n
> +                  \tsmullt\tz[0-9]+\.d, z[0-9]+\.s, z[0-9]+\.s\n
> +                  \tshrnb\tz[0-9]+\.s, z[0-9]+\.d, #31\n
> +                  \tshrnt\tz[0-9]+\.s, z[0-9]+\.d, #31\n
> +              } 
> +     1 } } */
> +     
> +/* { dg-final { scan-assembler-times {
> +                  \tumullb\tz[0-9]+\.h, z[0-9]+\.b, z[0-9]+\.b\n
> +                  \tumullt\tz[0-9]+\.h, z[0-9]+\.b, z[0-9]+\.b\n
> +                  \tshrnb\tz[0-9]+\.b, z[0-9]+\.h, #7\n
> +                  \tshrnt\tz[0-9]+\.b, z[0-9]+\.h, #7\n
> +              } 
> +     1 } } */
> +/* { dg-final { scan-assembler-times {
> +                  \tumullb\tz[0-9]+\.s, z[0-9]+\.h, z[0-9]+\.h\n
> +                  \tumullt\tz[0-9]+\.s, z[0-9]+\.h, z[0-9]+\.h\n
> +                  \tshrnb\tz[0-9]+\.h, z[0-9]+\.s, #15\n
> +                  \tshrnt\tz[0-9]+\.h, z[0-9]+\.s, #15\n
> +              } 
> +     1 } } */
> +/* { dg-final { scan-assembler-times {
> +                  \tumullb\tz[0-9]+\.d, z[0-9]+\.s, z[0-9]+\.s\n
> +                  \tumullt\tz[0-9]+\.d, z[0-9]+\.s, z[0-9]+\.s\n
> +                  \tshrnb\tz[0-9]+\.s, z[0-9]+\.d, #31\n
> +                  \tshrnt\tz[0-9]+\.s, z[0-9]+\.d, #31\n
> +              } 
> +     1 } } */
> +     
> +/* { dg-final { scan-assembler-times {
> +                  \tsmullb\tz[0-9]+\.h, z[0-9]+\.b, z[0-9]+\.b\n
> +                  \tsmullt\tz[0-9]+\.h, z[0-9]+\.b, z[0-9]+\.b\n
> +                  \trshrnb\tz[0-9]+\.b, z[0-9]+\.h, #7\n
> +                  \trshrnt\tz[0-9]+\.b, z[0-9]+\.h, #7\n
> +              } 
> +     1 } } */
> +/* { dg-final { scan-assembler-times {
> +                  \tsmullb\tz[0-9]+\.s, z[0-9]+\.h, z[0-9]+\.h\n
> +                  \tsmullt\tz[0-9]+\.s, z[0-9]+\.h, z[0-9]+\.h\n
> +                  \trshrnb\tz[0-9]+\.h, z[0-9]+\.s, #15\n
> +                  \trshrnt\tz[0-9]+\.h, z[0-9]+\.s, #15\n
> +              } 
> +     1 } } */
> +/* { dg-final { scan-assembler-times {
> +                  \tsmullb\tz[0-9]+\.d, z[0-9]+\.s, z[0-9]+\.s\n
> +                  \tsmullt\tz[0-9]+\.d, z[0-9]+\.s, z[0-9]+\.s\n
> +                  \trshrnb\tz[0-9]+\.s, z[0-9]+\.d, #31\n
> +                  \trshrnt\tz[0-9]+\.s, z[0-9]+\.d, #31\n
> +              } 
> +     1 } } */
> +     
> +/* { dg-final { scan-assembler-times {
> +                  \tumullb\tz[0-9]+\.h, z[0-9]+\.b, z[0-9]+\.b\n
> +                  \tumullt\tz[0-9]+\.h, z[0-9]+\.b, z[0-9]+\.b\n
> +                  \trshrnb\tz[0-9]+\.b, z[0-9]+\.h, #7\n
> +                  \trshrnt\tz[0-9]+\.b, z[0-9]+\.h, #7\n
> +              } 
> +     1 } } */
> +/* { dg-final { scan-assembler-times {
> +                  \tumullb\tz[0-9]+\.s, z[0-9]+\.h, z[0-9]+\.h\n
> +                  \tumullt\tz[0-9]+\.s, z[0-9]+\.h, z[0-9]+\.h\n
> +                  \trshrnb\tz[0-9]+\.h, z[0-9]+\.s, #15\n
> +                  \trshrnt\tz[0-9]+\.h, z[0-9]+\.s, #15\n
> +              } 
> +     1 } } */
> +/* { dg-final { scan-assembler-times {
> +                  \tumullb\tz[0-9]+\.d, z[0-9]+\.s, z[0-9]+\.s\n
> +                  \tumullt\tz[0-9]+\.d, z[0-9]+\.s, z[0-9]+\.s\n
> +                  \trshrnb\tz[0-9]+\.s, z[0-9]+\.d, #31\n
> +                  \trshrnt\tz[0-9]+\.s, z[0-9]+\.d, #31\n
> +              } 
> +     1 } } */
> +     
> diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
> index ccb2e1edecda09db786c0d98ccd25f5be107c7e4..96e86f716da0b819640cca2ff5d01c3ce8b342e3 100644
> --- a/gcc/tree-vect-patterns.c
> +++ b/gcc/tree-vect-patterns.c
> @@ -1723,6 +1723,169 @@ vect_recog_over_widening_pattern (stmt_vec_info last_stmt_info, tree *type_out)
>    return pattern_stmt;
>  }
>  
> +/* Recognize the following patterns:
> +
> +	    ATYPE a;  // narrower than TYPE
> +	    BTYPE b;  // narrower than TYPE
> +
> +	    // multiply high with scaling
> +	    1) TYPE res = ((TYPE) a * (TYPE) b) >> c;
> +	    // or also with rounding
> +	    2) TYPE res = (((TYPE) a * (TYPE) b) >> d + 1) >> 1;
> +
> +	where only the bottom half of res is used.  */

Should be indented three spaces only.

> +
> +static gimple *
> +vect_recog_multhi_pattern (stmt_vec_info last_stmt_info, tree *type_out)

Maybe "vect_recog_mulhs_pattern", for consistency with the IFN names?

> +{
> +	vec_info *vinfo;

Might as well delay declaring this until...

> +
> +  /* Check for a right shift.  */
> +  gassign *last_stmt = dyn_cast <gassign *> (last_stmt_info->stmt);
> +  if (!last_stmt
> +      || gimple_assign_rhs_code (last_stmt) != RSHIFT_EXPR)
> +    return NULL;
> +  vinfo = last_stmt_info->vinfo;

...here.  (A lot of existing GCC code predates even C99 and so doesn't
do this, but it's more usual for newer code.)

> +
> +  /* Check that the shift result is wider than the users of the
> +     result need (i.e. that narrowing would be a natural choice).  */
> +  tree lhs = gimple_assign_lhs (last_stmt);
> +  tree lhs_type = TREE_TYPE (lhs);
> +  unsigned int target_precision =
> +    vect_element_precision (last_stmt_info->min_output_precision);

GCC formatting puts operators at the start rather than the end of a line:

  unsigned int target_precision
    = vect_element_precision (last_stmt_info->min_output_precision);

Same for the rest of the function.

> +  if (!INTEGRAL_TYPE_P (lhs_type)
> +      || target_precision >= TYPE_PRECISION (lhs_type))
> +    return NULL;
> +
> +  /* Look through any change in sign on the outer shift input.  */
> +  vect_unpromoted_value unprom_input;
> +  tree rhs_rshift_inp = vect_look_through_possible_promotion (vinfo,
> +          gimple_assign_rhs1 (last_stmt), &unprom_input);

Formatting:

  tree rhs_rshift_inp = vect_look_through_possible_promotion
    (vinfo, gimple_assign_rhs1 (last_stmt), &unprom_input);

> +  if (!rhs_rshift_inp
> +      || TYPE_PRECISION (TREE_TYPE (rhs_rshift_inp))
> +         != TYPE_PRECISION (lhs_type))
> +    return NULL;
> +
> +  /* Get the definition of the shift input.  */
> +  stmt_vec_info input_stmt_info =
> +      vect_get_internal_def (vinfo, rhs_rshift_inp);
> +  if (!input_stmt_info)
> +    return NULL;
> +  gassign *input_stmt = dyn_cast <gassign *> (input_stmt_info->stmt);
> +  if (!input_stmt)
> +    return NULL;
> +
> +  stmt_vec_info mulhs_stmt_info;
> +  internal_fn ifn;
> +  unsigned int expect_offset;
> +
> +  /* Check for presence of rounding term.  */
> +  if (gimple_assign_rhs_code (input_stmt) == PLUS_EXPR)
> +  {
> +    /* Check that the outer shift was by 1.  */
> +    if (!integer_onep (gimple_assign_rhs2 (last_stmt)))
> +      return NULL;

Brace should be indented two columns beyond the "if", giving:

  if (gimple_assign_rhs_code (input_stmt) == PLUS_EXPR)
    {
      /* Check that the outer shift was by 1.  */
      if (!integer_onep (gimple_assign_rhs2 (last_stmt)))
	return NULL;

etc.

> +
> +    /* Check that one of the operands is 1.  */
> +    tree oprnds[2] = { gimple_assign_rhs1 (input_stmt),
> +                       gimple_assign_rhs2 (input_stmt) };
> +    bool isone[2] = { integer_onep (oprnds[0]),
> +                      integer_onep (oprnds[1]) };
> +    if (!(isone[0] ^ isone[1]))
> +      return NULL;

In a PLUS_EXPR, any constant operand has to come second, so this only
needs to check integer_onep (oprnds[1]).

> +    tree mulhs = oprnds[(int)(isone[0])];
> +
> +    mulhs_stmt_info = vect_get_internal_def (vinfo, mulhs);
> +    if (!mulhs_stmt_info)
> +      return NULL;
> +
> +    expect_offset = target_precision + 2;
> +    ifn = IFN_MULHRS;
> +  }
> +  else
> +  {
> +    mulhs_stmt_info = last_stmt_info;
> +
> +    expect_offset = target_precision + 1;
> +    ifn = IFN_MULHS;
> +  }
> +
> +  /* Get the definition of the multiply-high-scale part.  */
> +  gassign *mulhs_stmt = dyn_cast <gassign *> (mulhs_stmt_info->stmt);
> +  if (!mulhs_stmt
> +      || gimple_assign_rhs_code (mulhs_stmt) != RSHIFT_EXPR)
> +    return NULL;

I think this should go in the PLUS_EXPR arm, since we've already
checked it for the "else" path.

> +
> +  /* Get the scaling factor.  */
> +  tree rhs_scale = gimple_assign_rhs2 (mulhs_stmt);
> +  tree rhs_scale_type = TREE_TYPE (rhs_scale);
> +  if (TREE_CODE (rhs_scale) != INTEGER_CST
> +      || TREE_CODE (rhs_scale_type) != INTEGER_TYPE
> +      || !type_has_mode_precision_p (rhs_scale_type))
> +    return NULL;

Just:

  /* Get the scaling factor.  */
  tree rhs_scale = gimple_assign_rhs2 (mulhs_stmt);
  if (TREE_CODE (rhs_scale) != INTEGER_CST)
    return NULL;

would be enough.

> +  /* Get the definition of the scaling input term.  */
> +  tree rhs_mulh = gimple_assign_rhs1 (mulhs_stmt);
> +  tree rhs_mulh_type = TREE_TYPE (rhs_mulh);
> +  if (!INTEGRAL_TYPE_P (rhs_mulh_type))
> +    return NULL;

This is guaranteed to be true, no need to check.

> +  stmt_vec_info mulh_stmt_info = vect_get_internal_def (vinfo, rhs_mulh);
> +  if (!mulh_stmt_info)
> +    return NULL;
> +
> +  /* Check that the shift represents the correct scaling.  */
> +  if (wi::ne_p(wi::to_widest(rhs_scale) + expect_offset,
> +          TYPE_PRECISION (rhs_mulh_type)))
> +    return NULL;

At this point it's guaranteed that:

  TYPE_PRECISION (rhs_mulh_type) == TYPE_PRECISION (lhs_type)

I think it would be better use lhs_type for the condition and check the
value as part of the "TREE_CODE (rhs_scale) != INTEGER_CST" condition
above.  You can use != instead of wi::ne_p.

> +
> +  /* Check whether the scaling input term can be seen as two widened
> +     inputs multiplied together.  */
> +  vect_unpromoted_value unprom_mult[2];
> +  tree new_type;
> +  unsigned int nops = vect_widened_op_tree (mulh_stmt_info, MULT_EXPR,
> +					    WIDEN_MULT_EXPR, false, 2,
> +					    unprom_mult, &new_type);
> +  if (nops != 2)
> +    return NULL;
> +
> +  vect_pattern_detected ("vect_recog_multhi_pattern", last_stmt);
> +
> +  /* Adjust output precision.  */
> +  if (TYPE_PRECISION (new_type) < target_precision)
> +    new_type = build_nonstandard_integer_type (target_precision,
> +					       TYPE_UNSIGNED (new_type));
> +
> +  /* Check for target support.  */
> +  tree new_vectype = get_vectype_for_scalar_type (new_type);
> +  if (!new_vectype
> +      || !direct_internal_fn_supported_p (ifn, new_vectype,
> +					  OPTIMIZE_FOR_SPEED))
> +    return NULL;
> +
> +  /* The IR requires a valid vector type for the cast result, even though
> +     it's likely to be discarded.  */
> +  *type_out = get_vectype_for_scalar_type (lhs_type);
> +  if (!*type_out)
> +    return NULL;
> +
> +  /* Generate the IFN_MULHRS call.  */
> +  tree new_var = vect_recog_temp_ssa_var (new_type, NULL);
> +  tree new_ops[2];
> +  vect_convert_inputs (last_stmt_info, 2, new_ops, new_type,
> +          unprom_mult, new_vectype);

Arguments should be indented one character beyond the "(".

> +  gcall *mulhrs_stmt = gimple_build_call_internal (ifn, 2,
> +          new_ops[0], new_ops[1]);

Same here.

> +  gimple_call_set_lhs (mulhrs_stmt, new_var);
> +  gimple_set_location (mulhrs_stmt, gimple_location (last_stmt));
> +
> +  if (dump_enabled_p ())
> +    dump_printf_loc (MSG_NOTE, vect_location,
> +		     "created pattern stmt: %G", mulhrs_stmt);
> +
> +  return vect_convert_output (last_stmt_info, lhs_type,
> +          mulhrs_stmt, new_vectype);

Here too.

It would be good to have some tests in gcc.dg/vect too, along the
lines of gcc.dg/vect/vect-avg-[1-4].c.

For the record, I think in principle we could look through other sign
conversions too.  E.g. we might have (psuedo-gimple):

  int _2 = _1 >> 14;
  unsigned int _3 = (unsigned int) _2;
  unsigned int _4 = _3 + 1;
  int _5 = (int) _4;
  int _6 = _5 >> 1;

This guarantees that overflow in the addition of _4 is well-defined.
Or we might have a sign change on the input to the shift.  But I think
those cases should be treated as potential follow-on improvements if
we see a case that needs them (probably via a new helper to make looking
through sign conversions a bit easier).  It doesn't need to be part of
the initial patch.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: [PATCH][AArch64] Vectorize MULH(R)S patterns with SVE2 instructions
  2019-08-30 12:54 ` Richard Sandiford
@ 2019-09-12  7:52   ` Yuliang Wang
  2019-09-12 10:02     ` Richard Sandiford
  0 siblings, 1 reply; 4+ messages in thread
From: Yuliang Wang @ 2019-09-12  7:52 UTC (permalink / raw)
  To: gcc-patches; +Cc: nd, Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 26776 bytes --]

Hi Richard,

Thanks for your comments and advice; I have applied the relevant changes.

Regards,
Yuliang


UPDATE:

Added new tests. Built and regression tested on aarch64-none-elf and aarch64-linux-gnu.

gcc/ChangeLog:

2019-09-1  Yuliang Wang  <yuliang.wang@arm.com>

	PR tree-optimization/89386

	* config/aarch64/aarch64-sve2.md (<su>mull<bt><Vwide>)
	(<r>shrnb<mode>, <r>shrnt<mode>): New SVE2 patterns.
	(<su>mulh<r>s<mode>3): New pattern for MULHRS.
	* config/aarch64/iterators.md (UNSPEC_SMULLB, UNSPEC_SMULLT)
	(UNSPEC_UMULLB, UNSPEC_UMULLT, UNSPEC_SHRNB, UNSPEC_SHRNT)
	(UNSPEC_RSHRNB, UNSPEC_RSHRNT, UNSPEC_SMULHS, UNSPEC_SMULHRS)
	UNSPEC_UMULHS, UNSPEC_UMULHRS): New unspecs.
	(MULLBT, SHRNB, SHRNT, MULHRS): New int iterators.
	(su, r): Handle the unspecs above.
	(bt): New int attribute.
	* internal-fn.def (IFN_MULHS, IFN_MULHRS): New internal functions.
	* internal-fn.c (first_commutative_argument): Commutativity info for above.
	* optabs.def (smulhs_optab, smulhrs_optab, umulhs_optab, umulhrs_optab):
	New optabs.
	* doc/md.texi (smulhs$var{m3}, umulhs$var{m3})
	(smulhrs$var{m3}, umulhrs$var{m3}): Documentation for the above.
	* tree-vect-patterns.c (vect_recog_mulhs_pattern): New pattern function.
	(vect_vect_recog_func_ptrs): Add it.
	* testsuite/gcc.target/aarch64/sve2/mulhrs_1.c: New test.
	* testsuite/gcc.dg/vect/vect-mulhrs-1.c: As above.
	* testsuite/gcc.dg/vect/vect-mulhrs-2.c: As above.
	* testsuite/gcc.dg/vect/vect-mulhrs-3.c: As above.
	* testsuite/gcc.dg/vect/vect-mulhrs-4.c: As above.
	* doc/sourcebuild.texi (vect_mulhrs_hi): Document new target selector.
	* testsuite/lib/target-supports.exp (check_effective_target_vect_mulhrs_hi):
	Return true for AArch64 without SVE2.


-----Original Message-----
From: Richard Sandiford <richard.sandiford@arm.com> 
Sent: 30 August 2019 12:49
To: Yuliang Wang <Yuliang.Wang@arm.com>
Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
Subject: Re: [PATCH][AArch64] Vectorize MULH(R)S patterns with SVE2 instructions

Thanks for doing this.  The patch looks good, so this review is mostly a list of very minor formatting comments, sorry.

Yuliang Wang <Yuliang.Wang@arm.com> writes:
> 2019-08-22  Yuliang Wang  <yuliang.wang@arm.com>
>

Please add a line here pointing at the PR:

	PR tree-optimization/89386

The commit hooks pick this up automatically and link the commit to the bugzilla ticket.  (The PR was filed for SSSE3, but the target-independent bits are still needed there.)

>         * config/aarch64/aarch64-sve2.md: support for SVE2 
> instructions [S/U]MULL[T/B] + [R]SHRN[T/B] and MULHRS pattern variants

Unfortunately the changelog format is pretty strict here.  Lines have to be 80 chars or shorter, indented by tabs, and each pattern, function, variable or type needs to be listed individually regardless of how useful that seems.  So I think this should be something like:

	* config/aarch64/aarch64-sve2.md (<su>mull<bt><Vwide>)
	(<r>shrnb<mode>, <r>shrnt<mode>, <su>mulh<r>s<mode>3): New patterns.

(See below for why the "*" patterns aren't listed.)

>         * config/aarch64/iterators.md: iterators and attributes for 
> above

Here too the iterators need to be listed:

	* config/aarch64/iterators.md (UNSPEC_SMULLB, UNSPEC_SMULLT)
	(UNSPEC_UMULLB, UNSPEC_UMULLT, UNSPEC_SHRNB, UNSPEC_SHRNT)
	(UNSPEC_RSHRNB, UNSPEC_RSHRNT, UNSPEC_SMULHS, UNSPEC_SMULHRS)
	UNSPEC_UMULHS, UNSPEC_UMULHRS): New unspecs.
	(MULLBT, SHRNB, SHRNT, MULHRS): New int iterators.
	(su, r): Handle the unspecs above.
	(bt): New int attribute.

>         * internal-fn.def: internal functions for MULH[R]S patterns

	* internal-fn.def (IFN_MULHS, IFN_MULHRS): New internal functions.

>         * optabs.def: optabs definitions for above and sign variants

	* optabs.def (smulhs_optab, smulhrs_optab, umulhs_optab)
	(umulhrs_optab): New optabs.

>         * tree-vect-patterns.c (vect_recog_multhi_pattern): pattern 
> recognition function for MULHRS

	* tree-vect-patterns.c (vect_recog_multhi_pattern): New function.
	(vect_vect_recog_func_ptrs): Add it.

>         * gcc.target/aarch64/sve2/mulhrs_1.c: new test for all 
> variants

Just:

	* gcc.target/aarch64/sve2/mulhrs_1.c: New test.

(Sorry that this is so finicky.  I'm just the messenger. :-))

> diff --git a/gcc/config/aarch64/aarch64-sve2.md 
> b/gcc/config/aarch64/aarch64-sve2.md
> index 
> 2334e5a7b7dc524bbd1f4d0a48ba5cd991970118..51783604ad8f83eb1d070c133009
> ed41a2a0252d 100644
> --- a/gcc/config/aarch64/aarch64-sve2.md
> +++ b/gcc/config/aarch64/aarch64-sve2.md
> @@ -63,3 +63,89 @@
>     movprfx\t%0, %2\;<sur>h<addsub>\t%0.<Vetype>, %1/m, %0.<Vetype>, %3.<Vetype>"
>    [(set_attr "movprfx" "*,yes")]
>  )
> +
> +;; Multiply long top / bottom

Very minor, but: GCC comments traditionally end with "." even if they're not full sentences.

> +(define_insn "*<su>mull<bt><Vwide>"
> +  [(set (match_operand:<VWIDE> 0 "register_operand" "=w")
> +    (unspec:<VWIDE> [(match_operand:SVE_BHSI 1 "register_operand" "w")
> +			  (match_operand:SVE_BHSI 2 "register_operand" "w")]
> +			 MULLBT))]
> +  "TARGET_SVE2"
> +  "<su>mull<bt>\t%0.<Vewtype>, %1.<Vetype>, %2.<Vetype>"
> +)
> +
> +(define_expand "<su>mull<bt><Vwide>"
> +  [(set (match_operand:<VWIDE> 0 "register_operand")
> +    (unspec:<VWIDE> [(match_operand:SVE_BHSI 1 "register_operand")
> +			  (match_operand:SVE_BHSI 2 "register_operand")]
> +			 MULLBT))]
> +  "TARGET_SVE2"
> +  ""
> +)

It's more usual to put the define_expand first.  But in this case we don't need separate define_expands, we can just make the define_insn the named pattern itself:

;; Multiply long top / bottom.
(define_insn "<su>mull<bt><Vwide>"
  [(set (match_operand:<VWIDE> 0 "register_operand" "=w")
    (unspec:<VWIDE> [(match_operand:SVE_BHSI 1 "register_operand" "w")
		     (match_operand:SVE_BHSI 2 "register_operand" "w")]
		    MULLBT))]
  "TARGET_SVE2"
  "<su>mull<bt>\t%0.<Vewtype>, %1.<Vetype>, %2.<Vetype>"
)

> +
> +;; (Rounding) Right shift narrow bottom (define_insn 
> +"*<r>shrnb<mode>"
> +  [(set (match_operand:SVE_BHSI 0 "register_operand" "=w")
> +	  (unspec:SVE_BHSI [(match_operand:<VWIDE> 1 "register_operand" "w")
> +			  (match_operand 2 "aarch64_simd_shift_imm_offset_<Vel>" "i")]

This is more a personal thing, but I think it's better to leave out the "i" (just remove it rather than replace it with "") when the operand doesn't in fact accept all the things that "i" allows.

> +			 SHRNB))]
> +  "TARGET_SVE2"
> +  "<r>shrnb\t%0.<Vetype>, %1.<Vewtype>, #%2"
> +)
> +
> +(define_expand "<r>shrnb<mode>"
> +  [(set (match_operand:SVE_BHSI 0 "register_operand")
> +	 (unspec:SVE_BHSI [(match_operand:<VWIDE> 1 "register_operand")
> +			  (match_operand 2 "aarch64_simd_shift_imm_offset_<Vel>")]
> +			 SHRNB))]
> +  "TARGET_SVE2"
> +  ""
> +)
> +
> +;; (Rounding) Right shift narrow top
> +(define_insn "*<r>shrnt<mode>"
> +  [(set (match_operand:SVE_BHSI 0 "register_operand" "=w")
> +	  (unspec:SVE_BHSI [(match_operand:SVE_BHSI 1 "register_operand" 
> +"%0")

"%" indicates that operand 1 is commutative with operand 2, which isn't the case here.

> +	      (match_operand:<VWIDE> 2 "register_operand" "w")
> +			  (match_operand 3 "aarch64_simd_shift_imm_offset_<Vel>" "i")]

Same comment about "i" here.

> +			 SHRNT))]
> +  "TARGET_SVE2"
> +  "<r>shrnt\t%0.<Vetype>, %2.<Vewtype>, #%3"
> +)
> +
> +(define_expand "<r>shrnt<mode>"
> +  [(set (match_operand:SVE_BHSI 0 "register_operand")
> +	  (unspec:SVE_BHSI [(match_operand:<VWIDE> 1 "register_operand")
> +	      (match_operand:<VWIDE> 2 "register_operand")
> +			  (match_operand 3 "aarch64_simd_shift_imm_offset_<Vel>")]
> +			 SHRNT))]
> +  "TARGET_SVE2"
> +  ""
> +)

Each of the above two define_expands can also be removed and the "*"
removed from the define_insn.

> +;; Unpredicated integer multiply-high-with-(round-and-)scale
> +(define_expand "<su>mulh<r>s<mode>3"
> +  [(set (match_operand:SVE_BHSI 0 "register_operand")
> +	(unspec:SVE_BHSI
> +	  [(match_dup 3)
> +	   (unspec:SVE_BHSI [(match_operand:SVE_BHSI 1 "register_operand")
> +			  (match_operand:SVE_BHSI 2 "register_operand")]
> +			 MULHRS)]
> +	  UNSPEC_PRED_X))]
> +  "TARGET_SVE2"
> +  {
> +    operands[3] = force_reg (<VPRED>mode, CONSTM1_RTX (<VPRED>mode));

These days this should be:

    operands[3] = aarch64_ptrue_reg (<VPRED>mode);

There's been a bit of churn around this, sorry.

> @@ -1585,6 +1597,13 @@
>  
>  (define_int_iterator RHADD [UNSPEC_SRHADD UNSPEC_URHADD])
>  
> +(define_int_iterator MULLBT [UNSPEC_SMULLB UNSPEC_UMULLB
> +            UNSPEC_SMULLT UNSPEC_UMULLT])

Minor formatting nit: should be indented one column beyond the "[".

> +
> +(define_int_iterator SHRNB [UNSPEC_SHRNB UNSPEC_RSHRNB])
> +
> +(define_int_iterator SHRNT [UNSPEC_SHRNT UNSPEC_RSHRNT])
> +
>  (define_int_iterator DOTPROD [UNSPEC_SDOT UNSPEC_UDOT])
>  
>  (define_int_iterator ADDSUBHN [UNSPEC_ADDHN UNSPEC_RADDHN @@ -1604,6 
> +1623,9 @@
>  
>  (define_int_iterator VQDMULH [UNSPEC_SQDMULH UNSPEC_SQRDMULH])
>  
> +(define_int_iterator MULHRS [UNSPEC_SMULHS UNSPEC_UMULHS
> +            UNSPEC_SMULHRS UNSPEC_UMULHRS])
> +

Same here.

>  (define_int_iterator USSUQADD [UNSPEC_SUQADD UNSPEC_USQADD])
>  
>  (define_int_iterator SUQMOVN [UNSPEC_SQXTN UNSPEC_UQXTN]) @@ -1866,7 
> +1888,11 @@
>  		     (UNSPEC_COND_FCVTZS "s")
>  		     (UNSPEC_COND_FCVTZU "u")
>  		     (UNSPEC_COND_SCVTF "s")
> -		     (UNSPEC_COND_UCVTF "u")])
> +		     (UNSPEC_COND_UCVTF "u")
> +	       (UNSPEC_SMULLB "s") (UNSPEC_UMULLB "u")
> +	       (UNSPEC_SMULLT "s") (UNSPEC_UMULLT "u")
> +	       (UNSPEC_SMULHS "s") (UNSPEC_UMULHS "u")
> +	       (UNSPEC_SMULHRS "s") (UNSPEC_UMULHRS "u")])

These should line up with the existing entries.

>  
>  (define_int_attr sur [(UNSPEC_SHADD "s") (UNSPEC_UHADD "u")
>  		      (UNSPEC_SRHADD "sr") (UNSPEC_URHADD "ur") @@ -1904,6 +1930,10 
> @@
>                      (UNSPEC_SQRSHRN "r") (UNSPEC_UQRSHRN "r")
>                      (UNSPEC_SQSHL   "")  (UNSPEC_UQSHL  "")
>                      (UNSPEC_SQRSHL   "r")(UNSPEC_UQRSHL  "r")
> +		    (UNSPEC_SHRNB "") (UNSPEC_SHRNT "")
> +		    (UNSPEC_RSHRNB "r") (UNSPEC_RSHRNT "r")
> +	       (UNSPEC_SMULHS "") (UNSPEC_UMULHS "")
> +	       (UNSPEC_SMULHRS "r") (UNSPEC_UMULHRS "r")

Same here.

>  ])
>  
>  (define_int_attr lr [(UNSPEC_SSLI  "l") (UNSPEC_USLI  "l") @@ -1916,6 
> +1946,9 @@
>  		    (UNSPEC_SHADD "") (UNSPEC_UHADD "u")
>  		    (UNSPEC_SRHADD "") (UNSPEC_URHADD "u")])
>  
> +(define_int_attr bt [(UNSPEC_SMULLB "b") (UNSPEC_UMULLB "b")
> +		    (UNSPEC_SMULLT "t") (UNSPEC_UMULLT "t")])

This line should be indented one column further.

> diff --git a/gcc/optabs.def b/gcc/optabs.def index 
> 5283e6753f26eb0da08a393949128d09643e50dd..adcd9477b4c4b2e73cad7c72ee51
> 336e98e81954 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -342,6 +342,10 @@ OPTAB_D (udot_prod_optab, "udot_prod$I$a")  
> OPTAB_D (usum_widen_optab, "widen_usum$I$a3")  OPTAB_D (usad_optab, 
> "usad$I$a")  OPTAB_D (ssad_optab, "ssad$I$a")
> +OPTAB_D (smulhs_optab, "smulhs$a3")
> +OPTAB_D (smulhrs_optab, "smulhrs$a3") OPTAB_D (umulhs_optab, 
> +"umulhs$a3") OPTAB_D (umulhrs_optab, "umulhrs$a3")
>  OPTAB_D (vec_pack_sfix_trunc_optab, "vec_pack_sfix_trunc_$a")  
> OPTAB_D (vec_pack_ssat_optab, "vec_pack_ssat_$a")  OPTAB_D 
> (vec_pack_trunc_optab, "vec_pack_trunc_$a")

The new patterns need to be documented in doc/md.texi.

> diff --git a/gcc/testsuite/gcc.target/aarch64/sve2/mulhrs_1.c 
> b/gcc/testsuite/gcc.target/aarch64/sve2/mulhrs_1.c
> new file mode 100644
> index 
> 0000000000000000000000000000000000000000..98cdc73d5256d712c985c541c1a5
> 85d081e5f5ea
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve2/mulhrs_1.c
> @@ -0,0 +1,123 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details 
> +--save-temps" } */
> +
> +#include <stdint.h>
> +
> +#define MULTHI(TYPE, BIGGER, RND)                     \
> +TYPE __attribute__ ((noinline, noclone))              \
> +mulhs_##TYPE##_##RND (TYPE *restrict x,               \
> +        TYPE *restrict y, TYPE *restrict z, int n)    \
> +{                                                     \
> +  for (int i = 0; i < n; i++)                         \
> +  {                                                   \
> +    z[i] = ((((BIGGER)x[i] * (BIGGER)y[i]) >>         \
> +            (sizeof(BIGGER)/2-(RND+1)) + RND) >> 1;   \

Missing ")" here.

> +  }                                                   \
> +}
> +
> +MULTHI (int8_t, int16_t, 0)
> +MULTHI (int16_t, int32_t, 0)
> +MULTHI (int32_t, int64_t, 0)
> +
> +MULTHI (uint8_t, uint16_t, 0)
> +MULTHI (uint16_t, uint32_t, 0)
> +MULTHI (uint32_t, uint64_t, 0)
> +
> +MULTHI (int8_t, int16_t, 1)
> +MULTHI (int16_t, int32_t, 1)
> +MULTHI (int32_t, int64_t, 1)
> +
> +MULTHI (uint8_t, uint16_t, 1)
> +MULTHI (uint16_t, uint32_t, 1)
> +MULTHI (uint32_t, uint64_t, 1)
> +
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops in function" 
> +12 "vect" } } */
> +
> +/* { dg-final { scan-assembler-times {
> +                  \tsmullb\tz[0-9]+\.h, z[0-9]+\.b, z[0-9]+\.b\n
> +                  \tsmullt\tz[0-9]+\.h, z[0-9]+\.b, z[0-9]+\.b\n
> +                  \tshrnb\tz[0-9]+\.b, z[0-9]+\.h, #7\n
> +                  \tshrnt\tz[0-9]+\.b, z[0-9]+\.h, #7\n
> +              } 
> +     1 } } */

Unfortunately, this kind of multi-line scan isn't supported in this
form: the dg-final has to be on a single line.  (Also unfortunately, you don't get an error about this; the line is simply ignored.)

With scheduling enabled, only the order of the shrnb and shrnt is guaranteed, so I think we should have separate scan-assembler-times for each line.

> +/* { dg-final { scan-assembler-times {
> +                  \tsmullb\tz[0-9]+\.s, z[0-9]+\.h, z[0-9]+\.h\n
> +                  \tsmullt\tz[0-9]+\.s, z[0-9]+\.h, z[0-9]+\.h\n
> +                  \tshrnb\tz[0-9]+\.h, z[0-9]+\.s, #15\n
> +                  \tshrnt\tz[0-9]+\.h, z[0-9]+\.s, #15\n
> +              } 
> +     1 } } */
> +/* { dg-final { scan-assembler-times {
> +                  \tsmullb\tz[0-9]+\.d, z[0-9]+\.s, z[0-9]+\.s\n
> +                  \tsmullt\tz[0-9]+\.d, z[0-9]+\.s, z[0-9]+\.s\n
> +                  \tshrnb\tz[0-9]+\.s, z[0-9]+\.d, #31\n
> +                  \tshrnt\tz[0-9]+\.s, z[0-9]+\.d, #31\n
> +              } 
> +     1 } } */
> +     
> +/* { dg-final { scan-assembler-times {
> +                  \tumullb\tz[0-9]+\.h, z[0-9]+\.b, z[0-9]+\.b\n
> +                  \tumullt\tz[0-9]+\.h, z[0-9]+\.b, z[0-9]+\.b\n
> +                  \tshrnb\tz[0-9]+\.b, z[0-9]+\.h, #7\n
> +                  \tshrnt\tz[0-9]+\.b, z[0-9]+\.h, #7\n
> +              } 
> +     1 } } */
> +/* { dg-final { scan-assembler-times {
> +                  \tumullb\tz[0-9]+\.s, z[0-9]+\.h, z[0-9]+\.h\n
> +                  \tumullt\tz[0-9]+\.s, z[0-9]+\.h, z[0-9]+\.h\n
> +                  \tshrnb\tz[0-9]+\.h, z[0-9]+\.s, #15\n
> +                  \tshrnt\tz[0-9]+\.h, z[0-9]+\.s, #15\n
> +              } 
> +     1 } } */
> +/* { dg-final { scan-assembler-times {
> +                  \tumullb\tz[0-9]+\.d, z[0-9]+\.s, z[0-9]+\.s\n
> +                  \tumullt\tz[0-9]+\.d, z[0-9]+\.s, z[0-9]+\.s\n
> +                  \tshrnb\tz[0-9]+\.s, z[0-9]+\.d, #31\n
> +                  \tshrnt\tz[0-9]+\.s, z[0-9]+\.d, #31\n
> +              } 
> +     1 } } */
> +     
> +/* { dg-final { scan-assembler-times {
> +                  \tsmullb\tz[0-9]+\.h, z[0-9]+\.b, z[0-9]+\.b\n
> +                  \tsmullt\tz[0-9]+\.h, z[0-9]+\.b, z[0-9]+\.b\n
> +                  \trshrnb\tz[0-9]+\.b, z[0-9]+\.h, #7\n
> +                  \trshrnt\tz[0-9]+\.b, z[0-9]+\.h, #7\n
> +              } 
> +     1 } } */
> +/* { dg-final { scan-assembler-times {
> +                  \tsmullb\tz[0-9]+\.s, z[0-9]+\.h, z[0-9]+\.h\n
> +                  \tsmullt\tz[0-9]+\.s, z[0-9]+\.h, z[0-9]+\.h\n
> +                  \trshrnb\tz[0-9]+\.h, z[0-9]+\.s, #15\n
> +                  \trshrnt\tz[0-9]+\.h, z[0-9]+\.s, #15\n
> +              } 
> +     1 } } */
> +/* { dg-final { scan-assembler-times {
> +                  \tsmullb\tz[0-9]+\.d, z[0-9]+\.s, z[0-9]+\.s\n
> +                  \tsmullt\tz[0-9]+\.d, z[0-9]+\.s, z[0-9]+\.s\n
> +                  \trshrnb\tz[0-9]+\.s, z[0-9]+\.d, #31\n
> +                  \trshrnt\tz[0-9]+\.s, z[0-9]+\.d, #31\n
> +              } 
> +     1 } } */
> +     
> +/* { dg-final { scan-assembler-times {
> +                  \tumullb\tz[0-9]+\.h, z[0-9]+\.b, z[0-9]+\.b\n
> +                  \tumullt\tz[0-9]+\.h, z[0-9]+\.b, z[0-9]+\.b\n
> +                  \trshrnb\tz[0-9]+\.b, z[0-9]+\.h, #7\n
> +                  \trshrnt\tz[0-9]+\.b, z[0-9]+\.h, #7\n
> +              } 
> +     1 } } */
> +/* { dg-final { scan-assembler-times {
> +                  \tumullb\tz[0-9]+\.s, z[0-9]+\.h, z[0-9]+\.h\n
> +                  \tumullt\tz[0-9]+\.s, z[0-9]+\.h, z[0-9]+\.h\n
> +                  \trshrnb\tz[0-9]+\.h, z[0-9]+\.s, #15\n
> +                  \trshrnt\tz[0-9]+\.h, z[0-9]+\.s, #15\n
> +              } 
> +     1 } } */
> +/* { dg-final { scan-assembler-times {
> +                  \tumullb\tz[0-9]+\.d, z[0-9]+\.s, z[0-9]+\.s\n
> +                  \tumullt\tz[0-9]+\.d, z[0-9]+\.s, z[0-9]+\.s\n
> +                  \trshrnb\tz[0-9]+\.s, z[0-9]+\.d, #31\n
> +                  \trshrnt\tz[0-9]+\.s, z[0-9]+\.d, #31\n
> +              } 
> +     1 } } */
> +     
> diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c index 
> ccb2e1edecda09db786c0d98ccd25f5be107c7e4..96e86f716da0b819640cca2ff5d0
> 1c3ce8b342e3 100644
> --- a/gcc/tree-vect-patterns.c
> +++ b/gcc/tree-vect-patterns.c
> @@ -1723,6 +1723,169 @@ vect_recog_over_widening_pattern (stmt_vec_info last_stmt_info, tree *type_out)
>    return pattern_stmt;
>  }
>  
> +/* Recognize the following patterns:
> +
> +	    ATYPE a;  // narrower than TYPE
> +	    BTYPE b;  // narrower than TYPE
> +
> +	    // multiply high with scaling
> +	    1) TYPE res = ((TYPE) a * (TYPE) b) >> c;
> +	    // or also with rounding
> +	    2) TYPE res = (((TYPE) a * (TYPE) b) >> d + 1) >> 1;
> +
> +	where only the bottom half of res is used.  */

Should be indented three spaces only.

> +
> +static gimple *
> +vect_recog_multhi_pattern (stmt_vec_info last_stmt_info, tree 
> +*type_out)

Maybe "vect_recog_mulhs_pattern", for consistency with the IFN names?

> +{
> +	vec_info *vinfo;

Might as well delay declaring this until...

> +
> +  /* Check for a right shift.  */
> +  gassign *last_stmt = dyn_cast <gassign *> (last_stmt_info->stmt);  
> + if (!last_stmt
> +      || gimple_assign_rhs_code (last_stmt) != RSHIFT_EXPR)
> +    return NULL;
> +  vinfo = last_stmt_info->vinfo;

...here.  (A lot of existing GCC code predates even C99 and so doesn't do this, but it's more usual for newer code.)

> +
> +  /* Check that the shift result is wider than the users of the
> +     result need (i.e. that narrowing would be a natural choice).  */  
> + tree lhs = gimple_assign_lhs (last_stmt);  tree lhs_type = TREE_TYPE 
> + (lhs);  unsigned int target_precision =
> +    vect_element_precision (last_stmt_info->min_output_precision);

GCC formatting puts operators at the start rather than the end of a line:

  unsigned int target_precision
    = vect_element_precision (last_stmt_info->min_output_precision);

Same for the rest of the function.

> +  if (!INTEGRAL_TYPE_P (lhs_type)
> +      || target_precision >= TYPE_PRECISION (lhs_type))
> +    return NULL;
> +
> +  /* Look through any change in sign on the outer shift input.  */  
> + vect_unpromoted_value unprom_input;  tree rhs_rshift_inp = 
> + vect_look_through_possible_promotion (vinfo,
> +          gimple_assign_rhs1 (last_stmt), &unprom_input);

Formatting:

  tree rhs_rshift_inp = vect_look_through_possible_promotion
    (vinfo, gimple_assign_rhs1 (last_stmt), &unprom_input);

> +  if (!rhs_rshift_inp
> +      || TYPE_PRECISION (TREE_TYPE (rhs_rshift_inp))
> +         != TYPE_PRECISION (lhs_type))
> +    return NULL;
> +
> +  /* Get the definition of the shift input.  */  stmt_vec_info 
> + input_stmt_info =
> +      vect_get_internal_def (vinfo, rhs_rshift_inp);  if 
> + (!input_stmt_info)
> +    return NULL;
> +  gassign *input_stmt = dyn_cast <gassign *> (input_stmt_info->stmt);  
> + if (!input_stmt)
> +    return NULL;
> +
> +  stmt_vec_info mulhs_stmt_info;
> +  internal_fn ifn;
> +  unsigned int expect_offset;
> +
> +  /* Check for presence of rounding term.  */  if 
> + (gimple_assign_rhs_code (input_stmt) == PLUS_EXPR)  {
> +    /* Check that the outer shift was by 1.  */
> +    if (!integer_onep (gimple_assign_rhs2 (last_stmt)))
> +      return NULL;

Brace should be indented two columns beyond the "if", giving:

  if (gimple_assign_rhs_code (input_stmt) == PLUS_EXPR)
    {
      /* Check that the outer shift was by 1.  */
      if (!integer_onep (gimple_assign_rhs2 (last_stmt)))
	return NULL;

etc.

> +
> +    /* Check that one of the operands is 1.  */
> +    tree oprnds[2] = { gimple_assign_rhs1 (input_stmt),
> +                       gimple_assign_rhs2 (input_stmt) };
> +    bool isone[2] = { integer_onep (oprnds[0]),
> +                      integer_onep (oprnds[1]) };
> +    if (!(isone[0] ^ isone[1]))
> +      return NULL;

In a PLUS_EXPR, any constant operand has to come second, so this only needs to check integer_onep (oprnds[1]).

> +    tree mulhs = oprnds[(int)(isone[0])];
> +
> +    mulhs_stmt_info = vect_get_internal_def (vinfo, mulhs);
> +    if (!mulhs_stmt_info)
> +      return NULL;
> +
> +    expect_offset = target_precision + 2;
> +    ifn = IFN_MULHRS;
> +  }
> +  else
> +  {
> +    mulhs_stmt_info = last_stmt_info;
> +
> +    expect_offset = target_precision + 1;
> +    ifn = IFN_MULHS;
> +  }
> +
> +  /* Get the definition of the multiply-high-scale part.  */  gassign 
> + *mulhs_stmt = dyn_cast <gassign *> (mulhs_stmt_info->stmt);  if 
> + (!mulhs_stmt
> +      || gimple_assign_rhs_code (mulhs_stmt) != RSHIFT_EXPR)
> +    return NULL;

I think this should go in the PLUS_EXPR arm, since we've already checked it for the "else" path.

> +
> +  /* Get the scaling factor.  */
> +  tree rhs_scale = gimple_assign_rhs2 (mulhs_stmt);  tree 
> + rhs_scale_type = TREE_TYPE (rhs_scale);  if (TREE_CODE (rhs_scale) 
> + != INTEGER_CST
> +      || TREE_CODE (rhs_scale_type) != INTEGER_TYPE
> +      || !type_has_mode_precision_p (rhs_scale_type))
> +    return NULL;

Just:

  /* Get the scaling factor.  */
  tree rhs_scale = gimple_assign_rhs2 (mulhs_stmt);
  if (TREE_CODE (rhs_scale) != INTEGER_CST)
    return NULL;

would be enough.

> +  /* Get the definition of the scaling input term.  */  tree rhs_mulh 
> + = gimple_assign_rhs1 (mulhs_stmt);  tree rhs_mulh_type = TREE_TYPE 
> + (rhs_mulh);  if (!INTEGRAL_TYPE_P (rhs_mulh_type))
> +    return NULL;

This is guaranteed to be true, no need to check.

> +  stmt_vec_info mulh_stmt_info = vect_get_internal_def (vinfo, 
> + rhs_mulh);  if (!mulh_stmt_info)
> +    return NULL;
> +
> +  /* Check that the shift represents the correct scaling.  */  if 
> + (wi::ne_p(wi::to_widest(rhs_scale) + expect_offset,
> +          TYPE_PRECISION (rhs_mulh_type)))
> +    return NULL;

At this point it's guaranteed that:

  TYPE_PRECISION (rhs_mulh_type) == TYPE_PRECISION (lhs_type)

I think it would be better use lhs_type for the condition and check the value as part of the "TREE_CODE (rhs_scale) != INTEGER_CST" condition above.  You can use != instead of wi::ne_p.

> +
> +  /* Check whether the scaling input term can be seen as two widened
> +     inputs multiplied together.  */
> +  vect_unpromoted_value unprom_mult[2];
> +  tree new_type;
> +  unsigned int nops = vect_widened_op_tree (mulh_stmt_info, MULT_EXPR,
> +					    WIDEN_MULT_EXPR, false, 2,
> +					    unprom_mult, &new_type);
> +  if (nops != 2)
> +    return NULL;
> +
> +  vect_pattern_detected ("vect_recog_multhi_pattern", last_stmt);
> +
> +  /* Adjust output precision.  */
> +  if (TYPE_PRECISION (new_type) < target_precision)
> +    new_type = build_nonstandard_integer_type (target_precision,
> +					       TYPE_UNSIGNED (new_type));
> +
> +  /* Check for target support.  */
> +  tree new_vectype = get_vectype_for_scalar_type (new_type);
> +  if (!new_vectype
> +      || !direct_internal_fn_supported_p (ifn, new_vectype,
> +					  OPTIMIZE_FOR_SPEED))
> +    return NULL;
> +
> +  /* The IR requires a valid vector type for the cast result, even though
> +     it's likely to be discarded.  */  *type_out = 
> + get_vectype_for_scalar_type (lhs_type);  if (!*type_out)
> +    return NULL;
> +
> +  /* Generate the IFN_MULHRS call.  */  tree new_var = 
> + vect_recog_temp_ssa_var (new_type, NULL);  tree new_ops[2];  
> + vect_convert_inputs (last_stmt_info, 2, new_ops, new_type,
> +          unprom_mult, new_vectype);

Arguments should be indented one character beyond the "(".

> +  gcall *mulhrs_stmt = gimple_build_call_internal (ifn, 2,
> +          new_ops[0], new_ops[1]);

Same here.

> +  gimple_call_set_lhs (mulhrs_stmt, new_var);  gimple_set_location 
> + (mulhrs_stmt, gimple_location (last_stmt));
> +
> +  if (dump_enabled_p ())
> +    dump_printf_loc (MSG_NOTE, vect_location,
> +		     "created pattern stmt: %G", mulhrs_stmt);
> +
> +  return vect_convert_output (last_stmt_info, lhs_type,
> +          mulhrs_stmt, new_vectype);

Here too.

It would be good to have some tests in gcc.dg/vect too, along the lines of gcc.dg/vect/vect-avg-[1-4].c.

For the record, I think in principle we could look through other sign conversions too.  E.g. we might have (psuedo-gimple):

  int _2 = _1 >> 14;
  unsigned int _3 = (unsigned int) _2;
  unsigned int _4 = _3 + 1;
  int _5 = (int) _4;
  int _6 = _5 >> 1;

This guarantees that overflow in the addition of _4 is well-defined.
Or we might have a sign change on the input to the shift.  But I think those cases should be treated as potential follow-on improvements if we see a case that needs them (probably via a new helper to make looking through sign conversions a bit easier).  It doesn't need to be part of the initial patch.

Thanks,
Richard

[-- Attachment #2: rb11655.patch --]
[-- Type: application/octet-stream, Size: 25196 bytes --]

diff --git a/gcc/config/aarch64/aarch64-sve2.md b/gcc/config/aarch64/aarch64-sve2.md
index 2334e5a7b7dc524bbd1f4d0a48ba5cd991970118..6a7c484c4c08488bfef60f11c344a32686b9d9ca 100644
--- a/gcc/config/aarch64/aarch64-sve2.md
+++ b/gcc/config/aarch64/aarch64-sve2.md
@@ -63,3 +63,61 @@
    movprfx\t%0, %2\;<sur>h<addsub>\t%0.<Vetype>, %1/m, %0.<Vetype>, %3.<Vetype>"
   [(set_attr "movprfx" "*,yes")]
 )
+
+;; Multiply long top / bottom.
+(define_insn "<su>mull<bt><Vwide>"
+  [(set (match_operand:<VWIDE> 0 "register_operand" "=w")
+  (unspec:<VWIDE> [(match_operand:SVE_BHSI 1 "register_operand" "w")
+		   (match_operand:SVE_BHSI 2 "register_operand" "w")]
+		  MULLBT))]
+  "TARGET_SVE2"
+  "<su>mull<bt>\t%0.<Vewtype>, %1.<Vetype>, %2.<Vetype>"
+)
+
+;; (Rounding) Right shift narrow bottom.
+(define_insn "<r>shrnb<mode>"
+  [(set (match_operand:SVE_BHSI 0 "register_operand" "=w")
+  (unspec:SVE_BHSI [(match_operand:<VWIDE> 1 "register_operand" "w")
+		    (match_operand 2 "aarch64_simd_shift_imm_offset_<Vel>" "")]
+		   SHRNB))]
+  "TARGET_SVE2"
+  "<r>shrnb\t%0.<Vetype>, %1.<Vewtype>, #%2"
+)
+
+;; (Rounding) Right shift narrow top.
+(define_insn "<r>shrnt<mode>"
+  [(set (match_operand:SVE_BHSI 0 "register_operand" "=w")
+  (unspec:SVE_BHSI [(match_operand:SVE_BHSI 1 "register_operand" "0")
+		    (match_operand:<VWIDE> 2 "register_operand" "w")
+		    (match_operand 3 "aarch64_simd_shift_imm_offset_<Vel>" "i")]
+		   SHRNT))]
+  "TARGET_SVE2"
+  "<r>shrnt\t%0.<Vetype>, %2.<Vewtype>, #%3"
+)
+
+;; Unpredicated integer multiply-high-with-(round-and-)scale.
+(define_expand "<su>mulh<r>s<mode>3"
+  [(set (match_operand:SVE_BHSI 0 "register_operand")
+  (unspec:SVE_BHSI
+	[(match_dup 3)
+	(unspec:SVE_BHSI [(match_operand:SVE_BHSI 1 "register_operand")
+			  (match_operand:SVE_BHSI 2 "register_operand")]
+			 MULHRS)]
+	UNSPEC_PRED_X))]
+  "TARGET_SVE2"
+  {
+    operands[3] = aarch64_ptrue_reg (<VPRED>mode);
+
+    rtx prod_b = gen_reg_rtx (<VWIDE>mode);
+    rtx prod_t = gen_reg_rtx (<VWIDE>mode);
+    emit_insn (gen_<su>mullb<Vwide> (prod_b, operands[1], operands[2]));
+    emit_insn (gen_<su>mullt<Vwide> (prod_t, operands[1], operands[2]));
+
+    rtx shift = GEN_INT (GET_MODE_UNIT_BITSIZE (<MODE>mode) - 1);
+    emit_insn (gen_<r>shrnb<mode> (operands[0], prod_b, shift));
+    emit_insn (gen_<r>shrnt<mode> (operands[0], operands[0], prod_t, shift));
+
+    DONE;
+  }
+)
+
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 49d227f674694725ff18223eb2ff045181b5da61..d23f0fcbc2f47e8b48ca9bc52b3e483e450da432 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -378,6 +378,10 @@
     UNSPEC_RSUBHN2	; Used in aarch64-simd.md.
     UNSPEC_SQDMULH	; Used in aarch64-simd.md.
     UNSPEC_SQRDMULH	; Used in aarch64-simd.md.
+    UNSPEC_SMULLB	; Used in aarch64-sve2.md.
+    UNSPEC_SMULLT	; Used in aarch64-sve2.md.
+    UNSPEC_UMULLB	; Used in aarch64-sve2.md.
+    UNSPEC_UMULLT	; Used in aarch64-sve2.md.
     UNSPEC_PMUL		; Used in aarch64-simd.md.
     UNSPEC_FMULX	; Used in aarch64-simd.md.
     UNSPEC_USQADD	; Used in aarch64-simd.md.
@@ -400,6 +404,10 @@
     UNSPEC_UQSHRN	; Used in aarch64-simd.md.
     UNSPEC_SQRSHRN	; Used in aarch64-simd.md.
     UNSPEC_UQRSHRN	; Used in aarch64-simd.md.
+    UNSPEC_SHRNB	; Used in aarch64-sve2.md.
+    UNSPEC_SHRNT	; Used in aarch64-sve2.md.
+    UNSPEC_RSHRNB	; Used in aarch64-sve2.md.
+    UNSPEC_RSHRNT	; Used in aarch64-sve2.md.
     UNSPEC_SSHL		; Used in aarch64-simd.md.
     UNSPEC_USHL		; Used in aarch64-simd.md.
     UNSPEC_SRSHL	; Used in aarch64-simd.md.
@@ -523,6 +531,10 @@
     UNSPEC_FCMLA90	; Used in aarch64-simd.md.
     UNSPEC_FCMLA180	; Used in aarch64-simd.md.
     UNSPEC_FCMLA270	; Used in aarch64-simd.md.
+    UNSPEC_SMULHS	; Used in aarch64-sve2.md.
+    UNSPEC_SMULHRS	; Used in aarch64-sve2.md.
+    UNSPEC_UMULHS	; Used in aarch64-sve2.md.
+    UNSPEC_UMULHRS	; Used in aarch64-sve2.md.
 ])
 
 ;; ------------------------------------------------------------------
@@ -1588,6 +1600,13 @@
 
 (define_int_iterator RHADD [UNSPEC_SRHADD UNSPEC_URHADD])
 
+(define_int_iterator MULLBT [UNSPEC_SMULLB UNSPEC_UMULLB
+                             UNSPEC_SMULLT UNSPEC_UMULLT])
+
+(define_int_iterator SHRNB [UNSPEC_SHRNB UNSPEC_RSHRNB])
+
+(define_int_iterator SHRNT [UNSPEC_SHRNT UNSPEC_RSHRNT])
+
 (define_int_iterator DOTPROD [UNSPEC_SDOT UNSPEC_UDOT])
 
 (define_int_iterator ADDSUBHN [UNSPEC_ADDHN UNSPEC_RADDHN
@@ -1607,6 +1626,9 @@
 
 (define_int_iterator VQDMULH [UNSPEC_SQDMULH UNSPEC_SQRDMULH])
 
+(define_int_iterator MULHRS [UNSPEC_SMULHS UNSPEC_UMULHS
+                             UNSPEC_SMULHRS UNSPEC_UMULHRS])
+
 (define_int_iterator USSUQADD [UNSPEC_SUQADD UNSPEC_USQADD])
 
 (define_int_iterator SUQMOVN [UNSPEC_SQXTN UNSPEC_UQXTN])
@@ -1872,7 +1894,11 @@
 		     (UNSPEC_COND_FCVTZS "s")
 		     (UNSPEC_COND_FCVTZU "u")
 		     (UNSPEC_COND_SCVTF "s")
-		     (UNSPEC_COND_UCVTF "u")])
+		     (UNSPEC_COND_UCVTF "u")
+		     (UNSPEC_SMULLB "s") (UNSPEC_UMULLB "u")
+		     (UNSPEC_SMULLT "s") (UNSPEC_UMULLT "u")
+		     (UNSPEC_SMULHS "s") (UNSPEC_UMULHS "u")
+		     (UNSPEC_SMULHRS "s") (UNSPEC_UMULHRS "u")])
 
 (define_int_attr sur [(UNSPEC_SHADD "s") (UNSPEC_UHADD "u")
 		      (UNSPEC_SRHADD "sr") (UNSPEC_URHADD "ur")
@@ -1910,6 +1936,10 @@
                     (UNSPEC_SQRSHRN "r") (UNSPEC_UQRSHRN "r")
                     (UNSPEC_SQSHL   "")  (UNSPEC_UQSHL  "")
                     (UNSPEC_SQRSHL   "r")(UNSPEC_UQRSHL  "r")
+		    (UNSPEC_SHRNB "") (UNSPEC_SHRNT "")
+		    (UNSPEC_RSHRNB "r") (UNSPEC_RSHRNT "r")
+		    (UNSPEC_SMULHS "") (UNSPEC_UMULHS "")
+		    (UNSPEC_SMULHRS "r") (UNSPEC_UMULHRS "r")
 ])
 
 (define_int_attr lr [(UNSPEC_SSLI  "l") (UNSPEC_USLI  "l")
@@ -1922,6 +1952,9 @@
 		    (UNSPEC_SHADD "") (UNSPEC_UHADD "u")
 		    (UNSPEC_SRHADD "") (UNSPEC_URHADD "u")])
 
+(define_int_attr bt [(UNSPEC_SMULLB "b") (UNSPEC_UMULLB "b")
+		     (UNSPEC_SMULLT "t") (UNSPEC_UMULLT "t")])
+
 (define_int_attr addsub [(UNSPEC_SHADD "add")
 			 (UNSPEC_UHADD "add")
 			 (UNSPEC_SRHADD "add")
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index fa4ae14534b790f0416992da31ce49ccefec7a73..f35fd2b1b19cef1deb41566d7614d80d449d69fc 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5387,6 +5387,33 @@ operand 1. Add operand 1 to operand 2 and place the widened result in
 operand 0. (This is used express accumulation of elements into an accumulator
 of a wider mode.)
 
+@cindex @code{smulhs@var{m3}} instruction pattern
+@item @samp{smulhs@var{m3}}
+@cindex @code{umulhs@var{m3}} instruction pattern
+@itemx @samp{umulhs@var{m3}}
+Signed/unsigned multiply high with scale. This is equivalent to the C code:
+@smallexample
+narrow op0, op1, op2;
+@dots{}
+op0 = (narrow) (((wide) op1 * (wide) op2) >> (N / 2 - 1));
+@end smallexample
+where the sign of @samp{narrow} determines whether this is a signed
+or unsigned operation, and @var{N} is the size of @samp{wide} in bits.
+
+@cindex @code{smulhrs@var{m3}} instruction pattern
+@item @samp{smulhrs@var{m3}}
+@cindex @code{umulhrs@var{m3}} instruction pattern
+@itemx @samp{umulhrs@var{m3}}
+Signed/unsigned multiply high with round and scale. This is
+equivalent to the C code:
+@smallexample
+narrow op0, op1, op2;
+@dots{}
+op0 = (narrow) (((((wide) op1 * (wide) op2) >> (N / 2 - 2)) + 1) >> 1);
+@end smallexample
+where the sign of @samp{narrow} determines whether this is a signed
+or unsigned operation, and @var{N} is the size of @samp{wide} in bits.
+
 @cindex @code{vec_shl_insert_@var{m}} instruction pattern
 @item @samp{vec_shl_insert_@var{m}}
 Shift the elements in vector input operand 1 left one element (i.e.@:
diff --git a/gcc/doc/sourcebuild.texi b/gcc/doc/sourcebuild.texi
index 7867ac8424419bfeb1cdbd21f6b98f62979fbf8d..b7267c5e02cb871939e970803c3767da44b24ac8 100644
--- a/gcc/doc/sourcebuild.texi
+++ b/gcc/doc/sourcebuild.texi
@@ -1442,6 +1442,10 @@ vector alignment.
 Target supports both signed and unsigned averaging operations on vectors
 of bytes.
 
+@item vect_mulhrs_hi
+Target supports both signed and unsigned multiply-high-with-round-and-scale
+operations on vectors of half-words.
+
 @item vect_condition
 Target supports vector conditional operations.
 
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index ad86b9afd5488f2ff3cec94d966a3c320bc15919..549d6f1153b6e568d1ab8b0e5b8fa8ccc2161b61 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -3210,6 +3210,8 @@ first_commutative_argument (internal_fn fn)
     case IFN_FNMS:
     case IFN_AVG_FLOOR:
     case IFN_AVG_CEIL:
+    case IFN_MULHS:
+    case IFN_MULHRS:
     case IFN_FMIN:
     case IFN_FMAX:
       return 0;
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index b5a6ca33223728f9c39bca0211b7a9728e665d5e..49f57978c88a3a8c1a0206d983e1720ed09b0385 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -149,6 +149,11 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_FLOOR, ECF_CONST | ECF_NOTHROW, first,
 DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL, ECF_CONST | ECF_NOTHROW, first,
 			      savg_ceil, uavg_ceil, binary)
 
+DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST | ECF_NOTHROW, first,
+			      smulhs, umulhs, binary)
+DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST | ECF_NOTHROW, first,
+			      smulhrs, umulhrs, binary)
+
 DEF_INTERNAL_OPTAB_FN (COND_ADD, ECF_CONST, cond_add, cond_binary)
 DEF_INTERNAL_OPTAB_FN (COND_SUB, ECF_CONST, cond_sub, cond_binary)
 DEF_INTERNAL_OPTAB_FN (COND_MUL, ECF_CONST, cond_smul, cond_binary)
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 0860b38badbb2531f7292159bc63f03d9b1aa23c..308696846d4926fdd94133b87f4f59b8d1cc7f20 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -343,6 +343,10 @@ OPTAB_D (udot_prod_optab, "udot_prod$I$a")
 OPTAB_D (usum_widen_optab, "widen_usum$I$a3")
 OPTAB_D (usad_optab, "usad$I$a")
 OPTAB_D (ssad_optab, "ssad$I$a")
+OPTAB_D (smulhs_optab, "smulhs$a3")
+OPTAB_D (smulhrs_optab, "smulhrs$a3")
+OPTAB_D (umulhs_optab, "umulhs$a3")
+OPTAB_D (umulhrs_optab, "umulhrs$a3")
 OPTAB_D (vec_pack_sfix_trunc_optab, "vec_pack_sfix_trunc_$a")
 OPTAB_D (vec_pack_ssat_optab, "vec_pack_ssat_$a")
 OPTAB_D (vec_pack_trunc_optab, "vec_pack_trunc_$a")
diff --git a/gcc/testsuite/gcc.dg/vect/vect-mulhrs-1.c b/gcc/testsuite/gcc.dg/vect/vect-mulhrs-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..8e46ff6b01fe765f597add737e0b64ec5b505dd1
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-mulhrs-1.c
@@ -0,0 +1,49 @@
+/* { dg-require-effective-target vect_int } */
+
+#include "tree-vect.h"
+#ifndef SIGNEDNESS
+#define SIGNEDNESS signed
+#endif
+#ifndef BIAS
+#define BIAS 0
+#endif
+
+#define HRS(x) ((((x) >> (15 - BIAS)) + BIAS) >> BIAS)
+
+void __attribute__ ((noipa))
+f (SIGNEDNESS short *restrict a, SIGNEDNESS short *restrict b,
+   SIGNEDNESS short *restrict c, __INTPTR_TYPE__ n)
+{
+  for (__INTPTR_TYPE__ i = 0; i < n; ++i)
+    a[i] = HRS((SIGNEDNESS int) b[i] * (SIGNEDNESS int) c[i]);
+}
+
+#define N 50
+#define BASE1 ((SIGNEDNESS int) -1 < 0 ? -126 : 4)
+#define BASE2 ((SIGNEDNESS int) -1 < 0 ? -101 : 26)
+#define CONST1 0x01AB
+#define CONST2 0x01CD
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS short a[N], b[N], c[N];
+  for (int i = 0; i < N; ++i)
+    {
+      b[i] = BASE1 + i * CONST1;
+      c[i] = BASE2 + i * CONST2;
+      asm volatile ("" ::: "memory");
+    }
+  f (a, b, c, N);
+  for (int i = 0; i < N; ++i)
+    if (a[i] != HRS(BASE1 * BASE2 + i * i * (CONST1 * CONST2)
+		    + i * (BASE1 * CONST2 + BASE2 * CONST1)))
+      __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_mulhs_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump {\.MULHS} "vect" { target vect_mulhrs_hi } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_mulhrs_hi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-mulhrs-2.c b/gcc/testsuite/gcc.dg/vect/vect-mulhrs-2.c
new file mode 100644
index 0000000000000000000000000000000000000000..a16e71c6a37f2a0f72905741dbc4698a8fcbc601
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-mulhrs-2.c
@@ -0,0 +1,9 @@
+/* { dg-require-effective-target vect_int } */
+
+#define SIGNEDNESS unsigned
+
+#include "vect-mulhrs-1.c"
+
+/* { dg-final { scan-tree-dump "vect_recog_mulhs_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump {\.MULHS} "vect" { target vect_mulhrs_hi } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_mulhrs_hi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-mulhrs-3.c b/gcc/testsuite/gcc.dg/vect/vect-mulhrs-3.c
new file mode 100644
index 0000000000000000000000000000000000000000..e7d44d75d6c474de2605c72deb7db87a107c6531
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-mulhrs-3.c
@@ -0,0 +1,9 @@
+/* { dg-require-effective-target vect_int } */
+
+#define BIAS 1
+
+#include "vect-mulhrs-1.c"
+
+/* { dg-final { scan-tree-dump "vect_recog_mulhs_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump {\.MULHRS} "vect" { target vect_mulhrs_hi } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_mulhrs_hi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-mulhrs-4.c b/gcc/testsuite/gcc.dg/vect/vect-mulhrs-4.c
new file mode 100644
index 0000000000000000000000000000000000000000..e121763352ee9aed75423352db86c67f866e2d1a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-mulhrs-4.c
@@ -0,0 +1,10 @@
+/* { dg-require-effective-target vect_int } */
+
+#define SIGNEDNESS unsigned
+#define BIAS 1
+
+#include "vect-mulhrs-1.c"
+
+/* { dg-final { scan-tree-dump "vect_recog_mulhs_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump {\.MULHRS} "vect" { target vect_mulhrs_hi } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_mulhrs_hi } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve2/mulhrs_1.c b/gcc/testsuite/gcc.target/aarch64/sve2/mulhrs_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..7970d681c9d137ed71b5fe603d23870397b81eb7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve2/mulhrs_1.c
@@ -0,0 +1,63 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details --save-temps" } */
+
+#include <stdint.h>
+
+#define MULTHI(TYPE, BIGGER, RND)                     \
+TYPE __attribute__ ((noinline, noclone))              \
+mulhs_##TYPE##_##RND (TYPE *restrict x,               \
+        TYPE *restrict y, TYPE *restrict z, int n)    \
+{                                                     \
+  for (int i = 0; i < n; i++)                         \
+  {                                                   \
+    z[i] = ((((BIGGER)x[i] * (BIGGER)y[i]) >>         \
+            (sizeof(BIGGER)*8/2-2)) + RND) >> 1;      \
+  }                                                   \
+}
+
+MULTHI (int8_t, int16_t, 0)
+MULTHI (int16_t, int32_t, 0)
+MULTHI (int32_t, int64_t, 0)
+
+MULTHI (uint8_t, uint16_t, 0)
+MULTHI (uint16_t, uint32_t, 0)
+MULTHI (uint32_t, uint64_t, 0)
+
+MULTHI (int8_t, int16_t, 1)
+MULTHI (int16_t, int32_t, 1)
+MULTHI (int32_t, int64_t, 1)
+
+MULTHI (uint8_t, uint16_t, 1)
+MULTHI (uint16_t, uint32_t, 1)
+MULTHI (uint32_t, uint64_t, 1)
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops in function" 12 "vect" } } */
+
+/* { dg-final { scan-assembler-times {\tsmullb\tz[0-9]+\.h, z[0-9]+\.b, z[0-9]+\.b\n} 2 } } */
+/* { dg-final { scan-assembler-times {\tsmullt\tz[0-9]+\.h, z[0-9]+\.b, z[0-9]+\.b\n} 2 } } */
+/* { dg-final { scan-assembler-times {\tsmullb\tz[0-9]+\.s, z[0-9]+\.h, z[0-9]+\.h\n} 2 } } */
+/* { dg-final { scan-assembler-times {\tsmullt\tz[0-9]+\.s, z[0-9]+\.h, z[0-9]+\.h\n} 2 } } */
+/* { dg-final { scan-assembler-times {\tsmullb\tz[0-9]+\.d, z[0-9]+\.s, z[0-9]+\.s\n} 2 } } */
+/* { dg-final { scan-assembler-times {\tsmullt\tz[0-9]+\.d, z[0-9]+\.s, z[0-9]+\.s\n} 2 } } */
+
+/* { dg-final { scan-assembler-times {\tshrnb\tz[0-9]+\.b, z[0-9]+\.h, #7\n} 2 } } */
+/* { dg-final { scan-assembler-times {\tshrnt\tz[0-9]+\.b, z[0-9]+\.h, #7\n} 2 } } */
+/* { dg-final { scan-assembler-times {\tshrnb\tz[0-9]+\.h, z[0-9]+\.s, #15\n} 2 } } */
+/* { dg-final { scan-assembler-times {\tshrnt\tz[0-9]+\.h, z[0-9]+\.s, #15\n} 2 } } */
+/* { dg-final { scan-assembler-times {\tshrnb\tz[0-9]+\.s, z[0-9]+\.d, #31\n} 2 } } */
+/* { dg-final { scan-assembler-times {\tshrnt\tz[0-9]+\.s, z[0-9]+\.d, #31\n} 2 } } */
+
+/* { dg-final { scan-assembler-times {\tumullb\tz[0-9]+\.h, z[0-9]+\.b, z[0-9]+\.b\n} 2 } } */
+/* { dg-final { scan-assembler-times {\tumullt\tz[0-9]+\.h, z[0-9]+\.b, z[0-9]+\.b\n} 2 } } */
+/* { dg-final { scan-assembler-times {\tumullb\tz[0-9]+\.s, z[0-9]+\.h, z[0-9]+\.h\n} 2 } } */
+/* { dg-final { scan-assembler-times {\tumullt\tz[0-9]+\.s, z[0-9]+\.h, z[0-9]+\.h\n} 2 } } */
+/* { dg-final { scan-assembler-times {\tumullb\tz[0-9]+\.d, z[0-9]+\.s, z[0-9]+\.s\n} 2 } } */
+/* { dg-final { scan-assembler-times {\tumullt\tz[0-9]+\.d, z[0-9]+\.s, z[0-9]+\.s\n} 2 } } */
+
+/* { dg-final { scan-assembler-times {\trshrnb\tz[0-9]+\.b, z[0-9]+\.h, #7\n} 2 } } */
+/* { dg-final { scan-assembler-times {\trshrnt\tz[0-9]+\.b, z[0-9]+\.h, #7\n} 2 } } */
+/* { dg-final { scan-assembler-times {\trshrnb\tz[0-9]+\.h, z[0-9]+\.s, #15\n} 2 } } */
+/* { dg-final { scan-assembler-times {\trshrnt\tz[0-9]+\.h, z[0-9]+\.s, #15\n} 2 } } */
+/* { dg-final { scan-assembler-times {\trshrnb\tz[0-9]+\.s, z[0-9]+\.d, #31\n} 2 } } */
+/* { dg-final { scan-assembler-times {\trshrnt\tz[0-9]+\.s, z[0-9]+\.d, #31\n} 2 } } */
+
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index e32d42491d96959ad04547263f2cc3631f4994d9..07bc42346b8346c9d3935bb69d2d82a96c60c3ca 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -6167,6 +6167,15 @@ proc check_effective_target_vect_avg_qi {} {
 		   && ![check_effective_target_aarch64_sve1_only] }]
 }
 
+# Return 1 if the target plus current options supports both signed
+# and unsigned multiply-high-with-round-and-scale operations
+# on vectors of half-words.
+
+proc check_effective_target_vect_mulhrs_hi {} {
+    return [expr { [istarget aarch64*-*-*]
+		   && [check_effective_target_aarch64_sve2] }]
+}
+
 # Return 1 if the target plus current options supports a vector
 # demotion (packing) of shorts (to chars) and ints (to shorts) 
 # using modulo arithmetic, 0 otherwise.
diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
index ccb2e1edecda09db786c0d98ccd25f5be107c7e4..20d1878dc39fa24a45be736a2f891a0c239cf1d3 100644
--- a/gcc/tree-vect-patterns.c
+++ b/gcc/tree-vect-patterns.c
@@ -1723,6 +1723,175 @@ vect_recog_over_widening_pattern (stmt_vec_info last_stmt_info, tree *type_out)
   return pattern_stmt;
 }
 
+/* Recognize the following patterns:
+
+     ATYPE a;  // narrower than TYPE
+     BTYPE b;  // narrower than TYPE
+
+   1) Multiply high with scaling
+     TYPE res = ((TYPE) a * (TYPE) b) >> c;
+   2) ... or also with rounding
+     TYPE res = (((TYPE) a * (TYPE) b) >> d + 1) >> 1;
+
+   where only the bottom half of res is used.  */
+
+static gimple *
+vect_recog_mulhs_pattern (stmt_vec_info last_stmt_info, tree *type_out)
+{
+  /* Check for a right shift.  */
+  gassign *last_stmt = dyn_cast <gassign *> (last_stmt_info->stmt);
+  if (!last_stmt
+      || gimple_assign_rhs_code (last_stmt) != RSHIFT_EXPR)
+    return NULL;
+	vec_info *vinfo = last_stmt_info->vinfo;
+
+  /* Check that the shift result is wider than the users of the
+     result need (i.e. that narrowing would be a natural choice).  */
+  tree lhs_type = TREE_TYPE (gimple_assign_lhs (last_stmt));
+  unsigned int target_precision
+    = vect_element_precision (last_stmt_info->min_output_precision);
+  if (!INTEGRAL_TYPE_P (lhs_type)
+      || target_precision >= TYPE_PRECISION (lhs_type))
+    return NULL;
+
+  /* Look through any change in sign on the outer shift input.  */
+  vect_unpromoted_value unprom_rshift_input;
+  tree rshift_input = vect_look_through_possible_promotion
+    (vinfo, gimple_assign_rhs1 (last_stmt), &unprom_rshift_input);
+  if (!rshift_input
+      || TYPE_PRECISION (TREE_TYPE (rshift_input))
+	   != TYPE_PRECISION (lhs_type))
+    return NULL;
+
+  /* Get the definition of the shift input.  */
+  stmt_vec_info rshift_input_stmt_info
+    = vect_get_internal_def (vinfo, rshift_input);
+  if (!rshift_input_stmt_info)
+    return NULL;
+  gassign *rshift_input_stmt
+    = dyn_cast <gassign *> (rshift_input_stmt_info->stmt);
+  if (!rshift_input_stmt)
+    return NULL;
+
+  stmt_vec_info mulh_stmt_info;
+  tree scale_term;
+  internal_fn ifn;
+  unsigned int expect_offset;
+
+  /* Check for the presence of the rounding term.  */
+  if (gimple_assign_rhs_code (rshift_input_stmt) == PLUS_EXPR)
+    {
+      /* Check that the outer shift was by 1.  */
+      if (!integer_onep (gimple_assign_rhs2 (last_stmt)))
+	return NULL;
+
+      /* Check that the second operand of the PLUS_EXPR is 1.  */
+      if (!integer_onep (gimple_assign_rhs2 (rshift_input_stmt)))
+	return NULL;
+
+      /* Look through any change in sign on the addition input.  */
+      vect_unpromoted_value unprom_plus_input;
+      tree plus_input = vect_look_through_possible_promotion
+	(vinfo, gimple_assign_rhs1 (rshift_input_stmt), &unprom_plus_input);
+      if (!plus_input
+	   || TYPE_PRECISION (TREE_TYPE (plus_input))
+		!= TYPE_PRECISION (TREE_TYPE (rshift_input)))
+	return NULL;
+
+      /* Get the definition of the multiply-high-scale part.  */
+      stmt_vec_info plus_input_stmt_info
+	= vect_get_internal_def (vinfo, plus_input);
+      if (!plus_input_stmt_info)
+	return NULL;
+      gassign *plus_input_stmt
+	= dyn_cast <gassign *> (plus_input_stmt_info->stmt);
+      if (!plus_input_stmt
+	  || gimple_assign_rhs_code (plus_input_stmt) != RSHIFT_EXPR)
+	return NULL;
+
+      /* Look through any change in sign on the scaling input.  */
+      vect_unpromoted_value unprom_scale_input;
+      tree scale_input = vect_look_through_possible_promotion
+	(vinfo, gimple_assign_rhs1 (plus_input_stmt), &unprom_scale_input);
+      if (!scale_input
+	  || TYPE_PRECISION (TREE_TYPE (scale_input))
+	       != TYPE_PRECISION (TREE_TYPE (plus_input)))
+	return NULL;
+
+      /* Get the definition of the multiply-high part.  */
+      mulh_stmt_info = vect_get_internal_def (vinfo, scale_input);
+      if (!mulh_stmt_info)
+	return NULL;
+
+      /* Get the scaling term.  */
+      scale_term = gimple_assign_rhs2 (plus_input_stmt);
+
+      expect_offset = target_precision + 2;
+      ifn = IFN_MULHRS;
+    }
+  else
+    {
+      mulh_stmt_info = rshift_input_stmt_info;
+      scale_term = gimple_assign_rhs2 (last_stmt);
+
+      expect_offset = target_precision + 1;
+      ifn = IFN_MULHS;
+    }
+
+  /* Check that the scaling factor is correct.  */
+  if (TREE_CODE (scale_term) != INTEGER_CST
+      || wi::to_widest (scale_term) + expect_offset
+	   != TYPE_PRECISION (lhs_type))
+    return NULL;
+
+  /* Check whether the scaling input term can be seen as two widened
+     inputs multiplied together.  */
+  vect_unpromoted_value unprom_mult[2];
+  tree new_type;
+  unsigned int nops
+    = vect_widened_op_tree (mulh_stmt_info, MULT_EXPR, WIDEN_MULT_EXPR,
+			    false, 2, unprom_mult, &new_type);
+  if (nops != 2)
+    return NULL;
+
+  vect_pattern_detected ("vect_recog_mulhs_pattern", last_stmt);
+
+  /* Adjust output precision.  */
+  if (TYPE_PRECISION (new_type) < target_precision)
+    new_type = build_nonstandard_integer_type
+      (target_precision, TYPE_UNSIGNED (new_type));
+
+  /* Check for target support.  */
+  tree new_vectype = get_vectype_for_scalar_type (new_type);
+  if (!new_vectype
+      || !direct_internal_fn_supported_p
+	    (ifn, new_vectype, OPTIMIZE_FOR_SPEED))
+    return NULL;
+
+  /* The IR requires a valid vector type for the cast result, even though
+     it's likely to be discarded.  */
+  *type_out = get_vectype_for_scalar_type (lhs_type);
+  if (!*type_out)
+    return NULL;
+
+  /* Generate the IFN_MULHRS call.  */
+  tree new_var = vect_recog_temp_ssa_var (new_type, NULL);
+  tree new_ops[2];
+  vect_convert_inputs (last_stmt_info, 2, new_ops, new_type,
+		       unprom_mult, new_vectype);
+  gcall *mulhrs_stmt
+    = gimple_build_call_internal (ifn, 2, new_ops[0], new_ops[1]);
+  gimple_call_set_lhs (mulhrs_stmt, new_var);
+  gimple_set_location (mulhrs_stmt, gimple_location (last_stmt));
+
+  if (dump_enabled_p ())
+    dump_printf_loc (MSG_NOTE, vect_location,
+		     "created pattern stmt: %G", mulhrs_stmt);
+
+  return vect_convert_output (last_stmt_info, lhs_type,
+			      mulhrs_stmt, new_vectype);
+}
+
 /* Recognize the patterns:
 
 	    ATYPE a;  // narrower than TYPE
@@ -4713,6 +4882,7 @@ static vect_recog_func vect_vect_recog_func_ptrs[] = {
   /* Must come after over_widening, which narrows the shift as much as
      possible beforehand.  */
   { vect_recog_average_pattern, "average" },
+  { vect_recog_mulhs_pattern, "mult_high" },
   { vect_recog_cast_forwprop_pattern, "cast_forwprop" },
   { vect_recog_widen_mult_pattern, "widen_mult" },
   { vect_recog_dot_prod_pattern, "dot_prod" },

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH][AArch64] Vectorize MULH(R)S patterns with SVE2 instructions
  2019-09-12  7:52   ` Yuliang Wang
@ 2019-09-12 10:02     ` Richard Sandiford
  0 siblings, 0 replies; 4+ messages in thread
From: Richard Sandiford @ 2019-09-12 10:02 UTC (permalink / raw)
  To: Yuliang Wang; +Cc: gcc-patches, nd

Yuliang Wang <Yuliang.Wang@arm.com> writes:
> Hi Richard,
>
> Thanks for your comments and advice; I have applied the relevant changes.
>
> Regards,
> Yuliang
>
>
> UPDATE:
>
> Added new tests. Built and regression tested on aarch64-none-elf and aarch64-linux-gnu.
>
> gcc/ChangeLog:
>
> 2019-09-1  Yuliang Wang  <yuliang.wang@arm.com>
>
> 	PR tree-optimization/89386
>
> 	* config/aarch64/aarch64-sve2.md (<su>mull<bt><Vwide>)
> 	(<r>shrnb<mode>, <r>shrnt<mode>): New SVE2 patterns.
> 	(<su>mulh<r>s<mode>3): New pattern for MULHRS.
> 	* config/aarch64/iterators.md (UNSPEC_SMULLB, UNSPEC_SMULLT)
> 	(UNSPEC_UMULLB, UNSPEC_UMULLT, UNSPEC_SHRNB, UNSPEC_SHRNT)
> 	(UNSPEC_RSHRNB, UNSPEC_RSHRNT, UNSPEC_SMULHS, UNSPEC_SMULHRS)
> 	UNSPEC_UMULHS, UNSPEC_UMULHRS): New unspecs.
> 	(MULLBT, SHRNB, SHRNT, MULHRS): New int iterators.
> 	(su, r): Handle the unspecs above.
> 	(bt): New int attribute.
> 	* internal-fn.def (IFN_MULHS, IFN_MULHRS): New internal functions.
> 	* internal-fn.c (first_commutative_argument): Commutativity info for above.
> 	* optabs.def (smulhs_optab, smulhrs_optab, umulhs_optab, umulhrs_optab):
> 	New optabs.
> 	* doc/md.texi (smulhs$var{m3}, umulhs$var{m3})
> 	(smulhrs$var{m3}, umulhrs$var{m3}): Documentation for the above.
> 	* tree-vect-patterns.c (vect_recog_mulhs_pattern): New pattern function.
> 	(vect_vect_recog_func_ptrs): Add it.
> 	* testsuite/gcc.target/aarch64/sve2/mulhrs_1.c: New test.
> 	* testsuite/gcc.dg/vect/vect-mulhrs-1.c: As above.
> 	* testsuite/gcc.dg/vect/vect-mulhrs-2.c: As above.
> 	* testsuite/gcc.dg/vect/vect-mulhrs-3.c: As above.
> 	* testsuite/gcc.dg/vect/vect-mulhrs-4.c: As above.
> 	* doc/sourcebuild.texi (vect_mulhrs_hi): Document new target selector.
> 	* testsuite/lib/target-supports.exp (check_effective_target_vect_mulhrs_hi):
> 	Return true for AArch64 without SVE2.
(with SVE2)

Thanks for doing this.  Applied with some very minor whitespace tweaks
as r275682.

Richard

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-09-12 10:02 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-29 14:17 [PATCH][AArch64] Vectorize MULH(R)S patterns with SVE2 instructions Yuliang Wang
2019-08-30 12:54 ` Richard Sandiford
2019-09-12  7:52   ` Yuliang Wang
2019-09-12 10:02     ` Richard Sandiford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).