Re: [PATCH 1/2]AArch64 Add fallback case using sdot for usdot

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Richard Sandiford <richard.sandiford@arm.com>
To: Richard Biener <richard.guenther@gmail.com>
Cc: Tamar Christina <Tamar.Christina@arm.com>,
	Richard Earnshaw <Richard.Earnshaw@arm.com>, nd <nd@arm.com>,
	"gcc-patches\@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
	Marcus Shawcroft <Marcus.Shawcroft@arm.com>
Subject: Re: [PATCH 1/2]AArch64 Add fallback case using sdot for usdot
Date: Tue, 05 Jul 2022 07:08:08 +0100	[thread overview]
Message-ID: <mptsfng3xdj.fsf@arm.com> (raw)
In-Reply-To: <CAFiYyc1MD58V_dpQ0Wr0X2i1kT-oSe9LW6mZCzZixB0U_QwZ_A@mail.gmail.com> (Richard Biener's message of "Thu, 30 Jun 2022 08:45:26 +0200")

Richard Biener <richard.guenther@gmail.com> writes:
> On Wed, Jun 29, 2022 at 4:35 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Richard Biener <richard.guenther@gmail.com> writes:
>> > On Tue, Jun 28, 2022 at 5:54 PM Tamar Christina <Tamar.Christina@arm.com> wrote:
>> >>
>> >> > -----Original Message-----
>> >> > From: Richard Biener <richard.guenther@gmail.com>
>> >> > Sent: Monday, June 27, 2022 7:10 AM
>> >> > To: Tamar Christina <Tamar.Christina@arm.com>
>> >> > Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Richard Earnshaw
>> >> > <Richard.Earnshaw@arm.com>; nd <nd@arm.com>; gcc-
>> >> > patches@gcc.gnu.org; Marcus Shawcroft <Marcus.Shawcroft@arm.com>
>> >> > Subject: Re: [PATCH 1/2]AArch64 Add fallback case using sdot for usdot
>> >> >
>> >> > On Mon, Jun 27, 2022 at 7:25 AM Tamar Christina via Gcc-patches <gcc-
>> >> > patches@gcc.gnu.org> wrote:
>> >> > >
>> >> > > > -----Original Message-----
>> >> > > > From: Richard Sandiford <richard.sandiford@arm.com>
>> >> > > > Sent: Thursday, June 16, 2022 7:54 PM
>> >> > > > To: Tamar Christina <Tamar.Christina@arm.com>
>> >> > > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
>> >> > > > <Richard.Earnshaw@arm.com>; Marcus Shawcroft
>> >> > > > <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
>> >> > <Kyrylo.Tkachov@arm.com>
>> >> > > > Subject: Re: [PATCH 1/2]AArch64 Add fallback case using sdot for
>> >> > > > usdot
>> >> > > >
>> >> > > > Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> >> > > > > Tamar Christina <tamar.christina@arm.com> writes:
>> >> > > > >> Hi All,
>> >> > > > >>
>> >> > > > >> The usdot operation is common in video encoder and decoders
>> >> > > > >> including some of the most widely used ones.
>> >> > > > >>
>> >> > > > >> This patch adds a +dotprod version of the optab as a fallback for
>> >> > > > >> when you do have sdot but not usdot available.
>> >> > > > >>
>> >> > > > >> The fallback works by adding a bias to the unsigned argument to
>> >> > > > >> convert it to a signed value and then correcting for the bias later on.
>> >> > > > >>
>> >> > > > >> Essentially it relies on (x - 128)y + 128y == xy where x is
>> >> > > > >> unsigned and y is signed (assuming both are 8-bit values).
>> >> > > > >> Because the range of a signed byte is only to 127 we split the bias
>> >> > correction into:
>> >> > > > >>
>> >> > > > >>    (x - 128)y + 127y + y
>> >> > > > >
>> >> > > > > I bet you knew this question was coming, but: this technique isn't
>> >> > > > > target-specific, so wouldn't it be better to handle it in
>> >> > > > > tree-vect-patterns.cc instead?
>> >> > >
>> >> > > Ok, so after many hours of trying I don't know how to make this work.
>> >> > > DOT_PROD_EXPR is a reduction, but emitting them as additional pattern
>> >> > > statement doesn't work because they'll be marked as internal_def
>> >> > > rather than reduction_def.  I tried marking the new vec_stmt_info that
>> >> > > I create explicitly as reduction_def but this gets overwritten during analysis.
>> >> > >
>> >> > > I then looked into getting it as a vectorizable_operation but has this
>> >> > > obvious problems In that it no longer treats it as a reduction and so tries to
>> >> > decompose into hi/lo.
>> >> > >
>> >> > > I then looked into treating additional patterns from  a reduction as
>> >> > > reductions themselves but this is obviously wrong as non-reduction
>> >> > statements also get marked as reductions.
>> >> > >
>> >> > > The conclusion is that I don't think the vectorizer allows additional
>> >> > > reductions to be emitted from patterns.
>> >> >
>> >> > Indeed.  DOT_PROD is a weird beast and it doesn't define which lanes are
>> >> > reduced to which so it's only usable when the result is reduced to a single
>> >> > lane.
>> >> >
>> >> > An SLP pattern might work if you use reduc-plus for the reduced lanes and
>> >> > keep the multiply separate?
>> >>
>> >> Unfortunately I can't seem to get it to handle the reduction in SLP.  It seems to always
>> >> use the non-SLP aware loop vectorizer here.  The suggested unroll factor is always 1 and
>> >> even trying to force it gets it to bail out later, presumable because it's reducing into a
>> >> scalar that's used outside the loop?
>> >
>> > Yes, it possibly needs 1-lane SLP support.
>>
>> As I mentioned to Tamar off-list, I feel like I've been wasting
>> people's time recently by spewing out ideas that might or might not work
>> (usually "not work"), so I wanted to get some confidence that the next
>> suggestion made sense.  In the end I needed most of an implementation
>> to do that, so it seemed easiest just to finish it off rather than post
>> it in a half-complete state.  Sorry for the duplication. :-(
>>
>> The patch certainly isn't pretty, but I think it's the best we can
>> do under the current infrastructure, and it should at least make
>> the costs reasonably accurate.  (Actually, that said, we probably
>> need to patch the reduction latency calculation in the aarch64
>> vector code -- didn't think of that until now.)
>>
>> Tested on aarch64-linux-gnu and x64_64-linux-gnu.  WDYT?

Turned out I needed another change for this to fire on x86.  Previously
the input type (half_type) had an arbitrary sign for mixed-sign dotprods,
which was OK for the existing code, but meant that we could sometimes
query for unsigned dotprod instead of signed dotprod when considering
the fallback.  Fixed in the version below (which canonicalises on
using the signed type).

> Looks reasonable - does this end up in OKish code generation as well?

Seems OK for aarch64.  The Advanced SIMD version of vect-reduc-dot-11.c is:

.L7:
        ldr     q2, [x1, x3]
        ldr     q1, [x2, x3]
        sdot    v0.4s, v1.16b, v3.16b
        add     x3, x3, 16
        sdot    v0.4s, v1.16b, v3.16b
        add     v2.16b, v2.16b, v4.16b
        sdot    v0.4s, v1.16b, v2.16b
        cmp     x3, 48
        bne     .L7

and the SVE version is:

.L7:
        ld1b    z1.b, p0/z, [x2, x3]
        ld1b    z2.b, p0/z, [x1, x3]
        sel     z1.b, p0, z1.b, z4.b
        add     x3, x3, x5
        add     z2.b, z2.b, #128
        sdot    z0.s, z1.b, z3.b
        whilelo p0.b, w3, w4
        sdot    z0.s, z1.b, z3.b
        sdot    z0.s, z1.b, z2.b
        b.any   .L7

(with the extra SEL handling a final partial vector).

On x86, for -mavx:

int
f (int res, unsigned short *restrict a, short *restrict b)
{
  for (int i = 0; i < 256; ++i)
    res += a[i] * b[i];
  return res;
}

previously generated:

.L2:
        vmovdqu (%rsi,%rax), %xmm1
        vmovdqu (%rdx,%rax), %xmm0
        addq    $16, %rax
        vpmovsxwd       %xmm0, %xmm3
        vpsrldq $8, %xmm0, %xmm0
        vpmovzxwd       %xmm1, %xmm4
        vpsrldq $8, %xmm1, %xmm1
        vpmulld %xmm4, %xmm3, %xmm3
        vpmovsxwd       %xmm0, %xmm0
        vpmovzxwd       %xmm1, %xmm1
        vpmulld %xmm1, %xmm0, %xmm0
        vpaddd  %xmm2, %xmm3, %xmm2
        vpaddd  %xmm2, %xmm0, %xmm2
        cmpq    $512, %rax
        jne     .L2

whereas now it generates:

.L2:
        vpmaddwd        (%rdx,%rax), %xmm3, %xmm2
        vpaddw  (%rsi,%rax), %xmm4, %xmm1
        vpmaddwd        (%rdx,%rax), %xmm1, %xmm1
        addq    $16, %rax
        vpaddd  %xmm2, %xmm0, %xmm0
        vpaddd  %xmm2, %xmm0, %xmm0
        vpaddd  %xmm1, %xmm0, %xmm0
        cmpq    $512, %rax
        jne     .L2

I don't know x86 well enough to be sure that's an improvement though.
The length of the loop carry dependency has increased from 2 to 3
VPADDDs.

Tested on aarch64-linux-gnu and x86_64-linux-gnu.

Richard


gcc/
	* tree-vect-patterns.cc (vect_convert_input): Expect the input
	type to be signed for optab_vector_mixed_sign.  Update the vectype
	at the same time as type.
	(vect_recog_dot_prod_pattern): Update accordingly.  If usdot isn't
	available, try sdot instead.
	* tree-vect-loop.cc (vect_is_emulated_mixed_dot_prod): New function.
	(vect_model_reduction_cost): Model the cost of implementing usdot
	using sdot.
	(vectorizable_reduction): Likewise.  Skip target support test
	for lane reductions.
	(vect_emulate_mixed_dot_prod): New function.
	(vect_transform_reduction): Use it to emulate usdot via sdot.

gcc/testsuite/
	* gcc.dg/vect/vect-reduc-dot-9.c: Reduce target requirements
	from i8mm to dotprod.
	* gcc.dg/vect/vect-reduc-dot-10.c: Likewise.
	* gcc.dg/vect/vect-reduc-dot-11.c: Likewise.
	* gcc.dg/vect/vect-reduc-dot-12.c: Likewise.
	* gcc.dg/vect/vect-reduc-dot-13.c: Likewise.
	* gcc.dg/vect/vect-reduc-dot-14.c: Likewise.
	* gcc.dg/vect/vect-reduc-dot-15.c: Likewise.
	* gcc.dg/vect/vect-reduc-dot-16.c: Likewise.
	* gcc.dg/vect/vect-reduc-dot-17.c: Likewise.
	* gcc.dg/vect/vect-reduc-dot-18.c: Likewise.
	* gcc.dg/vect/vect-reduc-dot-19.c: Likewise.
	* gcc.dg/vect/vect-reduc-dot-20.c: Likewise.
	* gcc.dg/vect/vect-reduc-dot-21.c: Likewise.
	* gcc.dg/vect/vect-reduc-dot-22.c: Likewise.
---
 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-10.c |   6 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-11.c |   6 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-12.c |   6 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-13.c |   6 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-14.c |   6 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-15.c |   6 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-16.c |   6 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-17.c |   6 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-18.c |   6 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-19.c |   4 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-20.c |   4 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-21.c |   4 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-22.c |   4 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-9.c  |   6 +-
 gcc/tree-vect-loop.cc                         | 160 ++++++++++++++++--
 gcc/tree-vect-patterns.cc                     |  38 ++++-
 16 files changed, 213 insertions(+), 61 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-10.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-10.c
index 7ce86965ea9..34e25ab7fb0 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-10.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-10.c
@@ -1,6 +1,6 @@
 /* { dg-require-effective-target vect_int } */
-/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
-/* { dg-add-options arm_v8_2a_i8mm }  */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
 
 #define SIGNEDNESS_1 unsigned
 #define SIGNEDNESS_2 unsigned
@@ -10,4 +10,4 @@
 #include "vect-reduc-dot-9.c"
 
 /* { dg-final { scan-tree-dump-not "vect_recog_dot_prod_pattern: detected" "vect" } } */
-/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-11.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-11.c
index 0f7cbbb87ef..3af8df54cf9 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-11.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-11.c
@@ -1,6 +1,6 @@
 /* { dg-require-effective-target vect_int } */
-/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
-/* { dg-add-options arm_v8_2a_i8mm }  */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
 
 #define SIGNEDNESS_1 unsigned
 #define SIGNEDNESS_2 signed
@@ -10,4 +10,4 @@
 #include "vect-reduc-dot-9.c"
 
 /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
-/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-12.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-12.c
index 08412614fc6..77ceef3643b 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-12.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-12.c
@@ -1,6 +1,6 @@
 /* { dg-require-effective-target vect_int } */
-/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
-/* { dg-add-options arm_v8_2a_i8mm }  */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
 
 #define SIGNEDNESS_1 unsigned
 #define SIGNEDNESS_2 signed
@@ -10,4 +10,4 @@
 #include "vect-reduc-dot-9.c"
 
 /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
-/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-13.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-13.c
index 7ee0f45f642..d3c0c86f529 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-13.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-13.c
@@ -1,6 +1,6 @@
 /* { dg-require-effective-target vect_int } */
-/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
-/* { dg-add-options arm_v8_2a_i8mm }  */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
 
 #define SIGNEDNESS_1 signed
 #define SIGNEDNESS_2 unsigned
@@ -10,4 +10,4 @@
 #include "vect-reduc-dot-9.c"
 
 /* { dg-final { scan-tree-dump-not "vect_recog_dot_prod_pattern: detected" "vect" } } */
-/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-14.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-14.c
index 2de1434528b..86a5c85753c 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-14.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-14.c
@@ -1,6 +1,6 @@
 /* { dg-require-effective-target vect_int } */
-/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
-/* { dg-add-options arm_v8_2a_i8mm }  */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
 
 #define SIGNEDNESS_1 signed
 #define SIGNEDNESS_2 unsigned
@@ -10,4 +10,4 @@
 #include "vect-reduc-dot-9.c"
 
 /* { dg-final { scan-tree-dump-not "vect_recog_dot_prod_pattern: detected" "vect" } } */
-/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-15.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-15.c
index dc48f95a32b..25de0940a65 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-15.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-15.c
@@ -1,6 +1,6 @@
 /* { dg-require-effective-target vect_int } */
-/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
-/* { dg-add-options arm_v8_2a_i8mm }  */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
 
 #define SIGNEDNESS_1 signed
 #define SIGNEDNESS_2 signed
@@ -10,4 +10,4 @@
 #include "vect-reduc-dot-9.c"
 
 /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
-/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-16.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-16.c
index aec62878936..4a1dec0677e 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-16.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-16.c
@@ -1,6 +1,6 @@
 /* { dg-require-effective-target vect_int } */
-/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
-/* { dg-add-options arm_v8_2a_i8mm }  */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
 
 #define SIGNEDNESS_1 signed
 #define SIGNEDNESS_2 signed
@@ -10,4 +10,4 @@
 #include "vect-reduc-dot-9.c"
 
 /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
-/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-17.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-17.c
index 38f86fe458a..90d21188b76 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-17.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-17.c
@@ -1,6 +1,6 @@
 /* { dg-require-effective-target vect_int } */
-/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
-/* { dg-add-options arm_v8_2a_i8mm }  */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
 
 #include "tree-vect.h"
 
@@ -50,4 +50,4 @@ main (void)
 }
 
 /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
-/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-18.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-18.c
index 2e86ebe3c6c..81ecb158d29 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-18.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-18.c
@@ -1,6 +1,6 @@
 /* { dg-require-effective-target vect_int } */
-/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
-/* { dg-add-options arm_v8_2a_i8mm }  */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
 
 #include "tree-vect.h"
 
@@ -50,4 +50,4 @@ main (void)
 }
 
 /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
-/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-19.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-19.c
index d00f24aae4c..cbcd4f120a5 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-19.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-19.c
@@ -1,6 +1,6 @@
 /* { dg-require-effective-target vect_int } */
-/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
-/* { dg-add-options arm_v8_2a_i8mm }  */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
 
 #include "tree-vect.h"
 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-20.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-20.c
index 17adbca83a0..e81ed1da5a4 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-20.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-20.c
@@ -1,6 +1,6 @@
 /* { dg-require-effective-target vect_int } */
-/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
-/* { dg-add-options arm_v8_2a_i8mm }  */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
 
 #include "tree-vect.h"
 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-21.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-21.c
index 6cc6a4f2e92..81ce5cdaffb 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-21.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-21.c
@@ -1,6 +1,6 @@
 /* { dg-require-effective-target vect_int } */
-/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
-/* { dg-add-options arm_v8_2a_i8mm }  */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
 
 #include "tree-vect.h"
 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-22.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-22.c
index e13d3d5c4da..b8c9d3ca53b 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-22.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-22.c
@@ -1,6 +1,6 @@
 /* { dg-require-effective-target vect_int } */
-/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
-/* { dg-add-options arm_v8_2a_i8mm }  */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
 
 #include "tree-vect.h"
 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-9.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-9.c
index d1049c96bf1..e0b132f6b35 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-9.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-9.c
@@ -1,6 +1,6 @@
 /* { dg-require-effective-target vect_int } */
-/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
-/* { dg-add-options arm_v8_2a_i8mm }  */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
 
 #include "tree-vect.h"
 
@@ -50,4 +50,4 @@ main (void)
 }
 
 /* { dg-final { scan-tree-dump-not "vect_recog_dot_prod_pattern: detected" "vect" } } */
-/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 78dfe8519aa..3a70c15b593 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -4566,6 +4566,31 @@ have_whole_vector_shift (machine_mode mode)
   return true;
 }
 
+/* Return true if (a) STMT_INFO is a DOT_PROD_EXPR reduction whose
+   multiplication operands have differing signs and (b) we intend
+   to emulate the operation using a series of signed DOT_PROD_EXPRs.
+   See vect_emulate_mixed_dot_prod for the actual sequence used.  */
+
+static bool
+vect_is_emulated_mixed_dot_prod (loop_vec_info loop_vinfo,
+				 stmt_vec_info stmt_info)
+{
+  gassign *assign = dyn_cast<gassign *> (stmt_info->stmt);
+  if (!assign || gimple_assign_rhs_code (assign) != DOT_PROD_EXPR)
+    return false;
+
+  tree rhs1 = gimple_assign_rhs1 (assign);
+  tree rhs2 = gimple_assign_rhs2 (assign);
+  if (TYPE_SIGN (TREE_TYPE (rhs1)) == TYPE_SIGN (TREE_TYPE (rhs2)))
+    return false;
+
+  stmt_vec_info reduc_info = info_for_reduction (loop_vinfo, stmt_info);
+  gcc_assert (reduc_info->is_reduc_info);
+  return !directly_supported_p (DOT_PROD_EXPR,
+				STMT_VINFO_REDUC_VECTYPE_IN (reduc_info),
+				optab_vector_mixed_sign);
+}
+
 /* TODO: Close dependency between vect_model_*_cost and vectorizable_*
    functions. Design better to avoid maintenance issues.  */
 
@@ -4601,6 +4626,8 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
   if (!gimple_extract_op (orig_stmt_info->stmt, &op))
     gcc_unreachable ();
 
+  bool emulated_mixed_dot_prod
+    = vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info);
   if (reduction_type == EXTRACT_LAST_REDUCTION)
     /* No extra instructions are needed in the prologue.  The loop body
        operations are costed in vectorizable_condition.  */
@@ -4628,11 +4655,20 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
     }
   else
     {
-      /* Add in cost for initial definition.
-	 For cond reduction we have four vectors: initial index, step,
-	 initial result of the data reduction, initial value of the index
-	 reduction.  */
-      int prologue_stmts = reduction_type == COND_REDUCTION ? 4 : 1;
+      /* Add in the cost of the initial definitions.  */
+      int prologue_stmts;
+      if (reduction_type == COND_REDUCTION)
+	/* For cond reductions we have four vectors: initial index, step,
+	   initial result of the data reduction, initial value of the index
+	   reduction.  */
+	prologue_stmts = 4;
+      else if (emulated_mixed_dot_prod)
+	/* We need the initial reduction value and two invariants:
+	   one that contains the minimum signed value and one that
+	   contains half of its negative.  */
+	prologue_stmts = 3;
+      else
+	prologue_stmts = 1;
       prologue_cost += record_stmt_cost (cost_vec, prologue_stmts,
 					 scalar_to_vec, stmt_info, 0,
 					 vect_prologue);
@@ -6797,11 +6833,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   bool lane_reduc_code_p = (op.code == DOT_PROD_EXPR
 			    || op.code == WIDEN_SUM_EXPR
 			    || op.code == SAD_EXPR);
-  enum optab_subtype optab_query_kind = optab_vector;
-  if (op.code == DOT_PROD_EXPR
-      && (TYPE_SIGN (TREE_TYPE (op.ops[0]))
-	  != TYPE_SIGN (TREE_TYPE (op.ops[1]))))
-    optab_query_kind = optab_vector_mixed_sign;
 
   if (!POINTER_TYPE_P (op.type) && !INTEGRAL_TYPE_P (op.type)
       && !SCALAR_FLOAT_TYPE_P (op.type))
@@ -7328,9 +7359,17 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
       /* 4. Supportable by target?  */
       bool ok = true;
 
-      /* 4.1. check support for the operation in the loop  */
+      /* 4.1. check support for the operation in the loop
+
+	 This isn't necessary for the lane reduction codes, since they
+	 can only be produced by pattern matching, and it's up to the
+	 pattern matcher to test for support.  The main reason for
+	 specifically skipping this step is to avoid rechecking whether
+	 mixed-sign dot-products can be implemented using signed
+	 dot-products.  */
       machine_mode vec_mode = TYPE_MODE (vectype_in);
-      if (!directly_supported_p (op.code, vectype_in, optab_query_kind))
+      if (!lane_reduc_code_p
+	  && !directly_supported_p (op.code, vectype_in))
         {
           if (dump_enabled_p ())
             dump_printf (MSG_NOTE, "op not supported by target.\n");
@@ -7398,7 +7437,14 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
      vect_transform_reduction.  Otherwise this is costed by the
      separate vectorizable_* routines.  */
   if (single_defuse_cycle || lane_reduc_code_p)
-    record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, vect_body);
+    {
+      int factor = 1;
+      if (vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info))
+	/* Three dot-products and a subtraction.  */
+	factor = 4;
+      record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,
+			stmt_info, 0, vect_body);
+    }
 
   if (dump_enabled_p ()
       && reduction_type == FOLD_LEFT_REDUCTION)
@@ -7457,6 +7503,81 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   return true;
 }
 
+/* STMT_INFO is a dot-product reduction whose multiplication operands
+   have different signs.  Emit a sequence to emulate the operation
+   using a series of signed DOT_PROD_EXPRs and return the last
+   statement generated.  VEC_DEST is the result of the vector operation
+   and VOP lists its inputs.  */
+
+static gassign *
+vect_emulate_mixed_dot_prod (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
+			     gimple_stmt_iterator *gsi, tree vec_dest,
+			     tree vop[3])
+{
+  tree wide_vectype = signed_type_for (TREE_TYPE (vec_dest));
+  tree narrow_vectype = signed_type_for (TREE_TYPE (vop[0]));
+  tree narrow_elttype = TREE_TYPE (narrow_vectype);
+  gimple *new_stmt;
+
+  /* Make VOP[0] the unsigned operand VOP[1] the signed operand.  */
+  if (!TYPE_UNSIGNED (TREE_TYPE (vop[0])))
+    std::swap (vop[0], vop[1]);
+
+  /* Convert all inputs to signed types.  */
+  for (int i = 0; i < 3; ++i)
+    if (TYPE_UNSIGNED (TREE_TYPE (vop[i])))
+      {
+	tree tmp = make_ssa_name (signed_type_for (TREE_TYPE (vop[i])));
+	new_stmt = gimple_build_assign (tmp, NOP_EXPR, vop[i]);
+	vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
+	vop[i] = tmp;
+      }
+
+  /* In the comments below we assume 8-bit inputs for simplicity,
+     but the approach works for any full integer type.  */
+
+  /* Create a vector of -128.  */
+  tree min_narrow_elttype = TYPE_MIN_VALUE (narrow_elttype);
+  tree min_narrow = build_vector_from_val (narrow_vectype,
+					   min_narrow_elttype);
+
+  /* Create a vector of 64.  */
+  auto half_wi = wi::lrshift (wi::to_wide (min_narrow_elttype), 1);
+  tree half_narrow = wide_int_to_tree (narrow_elttype, half_wi);
+  half_narrow = build_vector_from_val (narrow_vectype, half_narrow);
+
+  /* Emit: SUB_RES = VOP[0] - 128.  */
+  tree sub_res = make_ssa_name (narrow_vectype);
+  new_stmt = gimple_build_assign (sub_res, PLUS_EXPR, vop[0], min_narrow);
+  vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
+
+  /* Emit:
+
+       STAGE1 = DOT_PROD_EXPR <VOP[1], 64, VOP[2]>;
+       STAGE2 = DOT_PROD_EXPR <VOP[1], 64, STAGE1>;
+       STAGE3 = DOT_PROD_EXPR <SUB_RES, -128, STAGE2>;
+
+     on the basis that x * y == (x - 128) * y + 64 * y + 64 * y
+     Doing the two 64 * y steps first allows more time to compute x.  */
+  tree stage1 = make_ssa_name (wide_vectype);
+  new_stmt = gimple_build_assign (stage1, DOT_PROD_EXPR,
+				  vop[1], half_narrow, vop[2]);
+  vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
+
+  tree stage2 = make_ssa_name (wide_vectype);
+  new_stmt = gimple_build_assign (stage2, DOT_PROD_EXPR,
+				  vop[1], half_narrow, stage1);
+  vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
+
+  tree stage3 = make_ssa_name (wide_vectype);
+  new_stmt = gimple_build_assign (stage3, DOT_PROD_EXPR,
+				  sub_res, vop[1], stage2);
+  vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
+
+  /* Convert STAGE3 to the reduction type.  */
+  return gimple_build_assign (vec_dest, CONVERT_EXPR, stage3);
+}
+
 /* Transform the definition stmt STMT_INFO of a reduction PHI backedge
    value.  */
 
@@ -7563,12 +7684,17 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 					: &vec_oprnds2));
     }
 
+  bool emulated_mixed_dot_prod
+    = vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info);
   FOR_EACH_VEC_ELT (vec_oprnds0, i, def0)
     {
       gimple *new_stmt;
       tree vop[3] = { def0, vec_oprnds1[i], NULL_TREE };
       if (masked_loop_p && !mask_by_cond_expr)
 	{
+	  /* No conditional ifns have been defined for dot-product yet.  */
+	  gcc_assert (code != DOT_PROD_EXPR);
+
 	  /* Make sure that the reduction accumulator is vop[0].  */
 	  if (reduc_index == 1)
 	    {
@@ -7597,8 +7723,12 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 	      build_vect_cond_expr (code, vop, mask, gsi);
 	    }
 
-	  new_stmt = gimple_build_assign (vec_dest, code,
-					  vop[0], vop[1], vop[2]);
+	  if (emulated_mixed_dot_prod)
+	    new_stmt = vect_emulate_mixed_dot_prod (loop_vinfo, stmt_info, gsi,
+						    vec_dest, vop);
+	  else
+	    new_stmt = gimple_build_assign (vec_dest, code,
+					    vop[0], vop[1], vop[2]);
 	  new_temp = make_ssa_name (vec_dest, new_stmt);
 	  gimple_assign_set_lhs (new_stmt, new_temp);
 	  vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index 8f624863971..dfbfb71b3c6 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -760,12 +760,16 @@ vect_convert_input (vec_info *vinfo, stmt_vec_info stmt_info, tree type,
 		    vect_unpromoted_value *unprom, tree vectype,
 		    enum optab_subtype subtype = optab_default)
 {
-
   /* Update the type if the signs differ.  */
-  if (subtype == optab_vector_mixed_sign
-      && TYPE_SIGN (type) != TYPE_SIGN (TREE_TYPE (unprom->op)))
-    type = build_nonstandard_integer_type (TYPE_PRECISION (type),
-					   TYPE_SIGN (unprom->type));
+  if (subtype == optab_vector_mixed_sign)
+    {
+      gcc_assert (!TYPE_UNSIGNED (type));
+      if (TYPE_UNSIGNED (TREE_TYPE (unprom->op)))
+	{
+	  type = unsigned_type_for (type);
+	  vectype = unsigned_type_for (vectype);
+	}
+    }
 
   /* Check for a no-op conversion.  */
   if (types_compatible_p (type, TREE_TYPE (unprom->op)))
@@ -1139,16 +1143,34 @@ vect_recog_dot_prod_pattern (vec_info *vinfo,
      is signed; otherwise, the result has the same sign as the operands.  */
   if (TYPE_PRECISION (unprom_mult.type) != TYPE_PRECISION (type)
       && (subtype == optab_vector_mixed_sign
-	? TYPE_UNSIGNED (unprom_mult.type)
-	: TYPE_SIGN (unprom_mult.type) != TYPE_SIGN (half_type)))
+	  ? TYPE_UNSIGNED (unprom_mult.type)
+	  : TYPE_SIGN (unprom_mult.type) != TYPE_SIGN (half_type)))
     return NULL;
 
   vect_pattern_detected ("vect_recog_dot_prod_pattern", last_stmt);
 
+  /* If the inputs have mixed signs, canonicalize on using the signed
+     input type for analysis.  This also helps when emulating mixed-sign
+     operations using signed operations.  */
+  if (subtype == optab_vector_mixed_sign)
+    half_type = signed_type_for (half_type);
+
   tree half_vectype;
   if (!vect_supportable_direct_optab_p (vinfo, type, DOT_PROD_EXPR, half_type,
 					type_out, &half_vectype, subtype))
-    return NULL;
+    {
+      /* We can emulate a mixed-sign dot-product using a sequence of
+	 signed dot-products; see vect_emulate_mixed_dot_prod for details.  */
+      if (subtype != optab_vector_mixed_sign
+	  || !vect_supportable_direct_optab_p (vinfo, signed_type_for (type),
+					       DOT_PROD_EXPR, half_type,
+					       type_out, &half_vectype,
+					       optab_vector))
+	return NULL;
+
+      *type_out = signed_or_unsigned_type_for (TYPE_UNSIGNED (type),
+					       *type_out);
+    }
 
   /* Get the inputs in the appropriate types.  */
   tree mult_oprnd[2];
-- 
2.25.1

next prev parent reply	other threads:[~2022-07-05  6:08 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-16 10:48 Tamar Christina
2022-06-16 10:49 ` [PATCH 2/2] Add SVE " Tamar Christina
2022-06-16 16:09 ` [PATCH 1/2]AArch64 Add " Richard Sandiford
2022-06-16 18:53   ` Richard Sandiford
2022-06-27  5:24     ` Tamar Christina
2022-06-27  6:09       ` Richard Biener
2022-06-28 15:54         ` Tamar Christina
2022-06-29  9:33           ` Richard Biener
2022-06-29 14:35             ` Richard Sandiford
2022-06-30  6:45               ` Richard Biener
2022-07-05  6:08                 ` Richard Sandiford [this message]
2022-07-05  7:41                   ` Richard Biener

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=mptsfng3xdj.fsf@arm.com \
    --to=richard.sandiford@arm.com \
    --cc=Marcus.Shawcroft@arm.com \
    --cc=Richard.Earnshaw@arm.com \
    --cc=Tamar.Christina@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=nd@arm.com \
    --cc=richard.guenther@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).