From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk1-x72e.google.com (mail-qk1-x72e.google.com [IPv6:2607:f8b0:4864:20::72e]) by sourceware.org (Postfix) with ESMTPS id 67672385828E for ; Tue, 5 Jul 2022 07:41:28 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 67672385828E Received: by mail-qk1-x72e.google.com with SMTP id p11so8104683qkg.12 for ; Tue, 05 Jul 2022 00:41:28 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=hVGrQRLAO0TnjGgf9fBHS5w5IZIwULLSkaU/XBQoxvc=; b=SIiPniREMF8vs4IGjO/laVYlVqtfUdvVemrgbepkm/tqLb+hL2vSHMQI5oaUZyEzEl kz0UCTGpHiauYyeyqXBGtGjM/+OwluHw1a7NIysA8QbqpS8DLw2ACtQX2uFpOTH3hPVM 4f4HNquUiq1bp546V3IgkhjFQfKMdJ68SNJTBVOgx3QIfREaCnEclwnCQLjKRQVV4BP5 /pic0j/PNGs7+GOlCwZ1LwjaZo3ynLhqCKAWxH+PM84eqPkeuNXz+T/RJM/h/XjadiBE M0CpVYeSQqr6JVyKBryVh7fonUVF6pUlaNTaRS72qcWKeOlzxhUKOTfxPdLLj4GEhwNr hOfg== X-Gm-Message-State: AJIora/rfF1hx50Givfrg20G8GZkdRg+3OvgpqalJGK23hahM9Z5+6r/ ikeLbq9wi8CcfUlfUbAQY4XDedEKTLruj5yCqbbYimu5 X-Google-Smtp-Source: AGRyM1vVPLrbVOsVaigmSeV3KEHbEHYu8B23STT1DwzxmLrY0Bf60fL3n7XdhQharNNQluFmywiSm65k8kL63sL556M= X-Received: by 2002:a37:9c41:0:b0:6b4:8116:df32 with SMTP id f62-20020a379c41000000b006b48116df32mr1905497qke.581.1657006887618; Tue, 05 Jul 2022 00:41:27 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Richard Biener Date: Tue, 5 Jul 2022 09:41:16 +0200 Message-ID: Subject: Re: [PATCH 1/2]AArch64 Add fallback case using sdot for usdot To: Richard Biener , Tamar Christina , Richard Earnshaw , nd , "gcc-patches@gcc.gnu.org" , Marcus Shawcroft , Richard Sandiford Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-8.3 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_MANYTO, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jul 2022 07:41:32 -0000 On Tue, Jul 5, 2022 at 8:08 AM Richard Sandiford wrote: > > Richard Biener writes: > > On Wed, Jun 29, 2022 at 4:35 PM Richard Sandiford > > wrote: > >> > >> Richard Biener writes: > >> > On Tue, Jun 28, 2022 at 5:54 PM Tamar Christina wrote: > >> >> > >> >> > -----Original Message----- > >> >> > From: Richard Biener > >> >> > Sent: Monday, June 27, 2022 7:10 AM > >> >> > To: Tamar Christina > >> >> > Cc: Richard Sandiford ; Richard Earnshaw > >> >> > ; nd ; gcc- > >> >> > patches@gcc.gnu.org; Marcus Shawcroft > >> >> > Subject: Re: [PATCH 1/2]AArch64 Add fallback case using sdot for usdot > >> >> > > >> >> > On Mon, Jun 27, 2022 at 7:25 AM Tamar Christina via Gcc-patches >> >> > patches@gcc.gnu.org> wrote: > >> >> > > > >> >> > > > -----Original Message----- > >> >> > > > From: Richard Sandiford > >> >> > > > Sent: Thursday, June 16, 2022 7:54 PM > >> >> > > > To: Tamar Christina > >> >> > > > Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw > >> >> > > > ; Marcus Shawcroft > >> >> > > > ; Kyrylo Tkachov > >> >> > > >> >> > > > Subject: Re: [PATCH 1/2]AArch64 Add fallback case using sdot for > >> >> > > > usdot > >> >> > > > > >> >> > > > Richard Sandiford via Gcc-patches writes: > >> >> > > > > Tamar Christina writes: > >> >> > > > >> Hi All, > >> >> > > > >> > >> >> > > > >> The usdot operation is common in video encoder and decoders > >> >> > > > >> including some of the most widely used ones. > >> >> > > > >> > >> >> > > > >> This patch adds a +dotprod version of the optab as a fallback for > >> >> > > > >> when you do have sdot but not usdot available. > >> >> > > > >> > >> >> > > > >> The fallback works by adding a bias to the unsigned argument to > >> >> > > > >> convert it to a signed value and then correcting for the bias later on. > >> >> > > > >> > >> >> > > > >> Essentially it relies on (x - 128)y + 128y == xy where x is > >> >> > > > >> unsigned and y is signed (assuming both are 8-bit values). > >> >> > > > >> Because the range of a signed byte is only to 127 we split the bias > >> >> > correction into: > >> >> > > > >> > >> >> > > > >> (x - 128)y + 127y + y > >> >> > > > > > >> >> > > > > I bet you knew this question was coming, but: this technique isn't > >> >> > > > > target-specific, so wouldn't it be better to handle it in > >> >> > > > > tree-vect-patterns.cc instead? > >> >> > > > >> >> > > Ok, so after many hours of trying I don't know how to make this work. > >> >> > > DOT_PROD_EXPR is a reduction, but emitting them as additional pattern > >> >> > > statement doesn't work because they'll be marked as internal_def > >> >> > > rather than reduction_def. I tried marking the new vec_stmt_info that > >> >> > > I create explicitly as reduction_def but this gets overwritten during analysis. > >> >> > > > >> >> > > I then looked into getting it as a vectorizable_operation but has this > >> >> > > obvious problems In that it no longer treats it as a reduction and so tries to > >> >> > decompose into hi/lo. > >> >> > > > >> >> > > I then looked into treating additional patterns from a reduction as > >> >> > > reductions themselves but this is obviously wrong as non-reduction > >> >> > statements also get marked as reductions. > >> >> > > > >> >> > > The conclusion is that I don't think the vectorizer allows additional > >> >> > > reductions to be emitted from patterns. > >> >> > > >> >> > Indeed. DOT_PROD is a weird beast and it doesn't define which lanes are > >> >> > reduced to which so it's only usable when the result is reduced to a single > >> >> > lane. > >> >> > > >> >> > An SLP pattern might work if you use reduc-plus for the reduced lanes and > >> >> > keep the multiply separate? > >> >> > >> >> Unfortunately I can't seem to get it to handle the reduction in SLP. It seems to always > >> >> use the non-SLP aware loop vectorizer here. The suggested unroll factor is always 1 and > >> >> even trying to force it gets it to bail out later, presumable because it's reducing into a > >> >> scalar that's used outside the loop? > >> > > >> > Yes, it possibly needs 1-lane SLP support. > >> > >> As I mentioned to Tamar off-list, I feel like I've been wasting > >> people's time recently by spewing out ideas that might or might not work > >> (usually "not work"), so I wanted to get some confidence that the next > >> suggestion made sense. In the end I needed most of an implementation > >> to do that, so it seemed easiest just to finish it off rather than post > >> it in a half-complete state. Sorry for the duplication. :-( > >> > >> The patch certainly isn't pretty, but I think it's the best we can > >> do under the current infrastructure, and it should at least make > >> the costs reasonably accurate. (Actually, that said, we probably > >> need to patch the reduction latency calculation in the aarch64 > >> vector code -- didn't think of that until now.) > >> > >> Tested on aarch64-linux-gnu and x64_64-linux-gnu. WDYT? > > Turned out I needed another change for this to fire on x86. Previously > the input type (half_type) had an arbitrary sign for mixed-sign dotprods, > which was OK for the existing code, but meant that we could sometimes > query for unsigned dotprod instead of signed dotprod when considering > the fallback. Fixed in the version below (which canonicalises on > using the signed type). > > > Looks reasonable - does this end up in OKish code generation as well? > > Seems OK for aarch64. The Advanced SIMD version of vect-reduc-dot-11.c is: > > .L7: > ldr q2, [x1, x3] > ldr q1, [x2, x3] > sdot v0.4s, v1.16b, v3.16b > add x3, x3, 16 > sdot v0.4s, v1.16b, v3.16b > add v2.16b, v2.16b, v4.16b > sdot v0.4s, v1.16b, v2.16b > cmp x3, 48 > bne .L7 > > and the SVE version is: > > .L7: > ld1b z1.b, p0/z, [x2, x3] > ld1b z2.b, p0/z, [x1, x3] > sel z1.b, p0, z1.b, z4.b > add x3, x3, x5 > add z2.b, z2.b, #128 > sdot z0.s, z1.b, z3.b > whilelo p0.b, w3, w4 > sdot z0.s, z1.b, z3.b > sdot z0.s, z1.b, z2.b > b.any .L7 > > (with the extra SEL handling a final partial vector). > > On x86, for -mavx: > > int > f (int res, unsigned short *restrict a, short *restrict b) > { > for (int i = 0; i < 256; ++i) > res += a[i] * b[i]; > return res; > } > > previously generated: > > .L2: > vmovdqu (%rsi,%rax), %xmm1 > vmovdqu (%rdx,%rax), %xmm0 > addq $16, %rax > vpmovsxwd %xmm0, %xmm3 > vpsrldq $8, %xmm0, %xmm0 > vpmovzxwd %xmm1, %xmm4 > vpsrldq $8, %xmm1, %xmm1 > vpmulld %xmm4, %xmm3, %xmm3 > vpmovsxwd %xmm0, %xmm0 > vpmovzxwd %xmm1, %xmm1 > vpmulld %xmm1, %xmm0, %xmm0 > vpaddd %xmm2, %xmm3, %xmm2 > vpaddd %xmm2, %xmm0, %xmm2 > cmpq $512, %rax > jne .L2 > > whereas now it generates: > > .L2: > vpmaddwd (%rdx,%rax), %xmm3, %xmm2 > vpaddw (%rsi,%rax), %xmm4, %xmm1 > vpmaddwd (%rdx,%rax), %xmm1, %xmm1 > addq $16, %rax > vpaddd %xmm2, %xmm0, %xmm0 > vpaddd %xmm2, %xmm0, %xmm0 > vpaddd %xmm1, %xmm0, %xmm0 > cmpq $512, %rax > jne .L2 > > I don't know x86 well enough to be sure that's an improvement though. > The length of the loop carry dependency has increased from 2 to 3 > VPADDDs. I think that should be OK. > > Tested on aarch64-linux-gnu and x86_64-linux-gnu. OK. Thanks, Richard. > Richard > > > gcc/ > * tree-vect-patterns.cc (vect_convert_input): Expect the input > type to be signed for optab_vector_mixed_sign. Update the vectype > at the same time as type. > (vect_recog_dot_prod_pattern): Update accordingly. If usdot isn't > available, try sdot instead. > * tree-vect-loop.cc (vect_is_emulated_mixed_dot_prod): New function. > (vect_model_reduction_cost): Model the cost of implementing usdot > using sdot. > (vectorizable_reduction): Likewise. Skip target support test > for lane reductions. > (vect_emulate_mixed_dot_prod): New function. > (vect_transform_reduction): Use it to emulate usdot via sdot. > > gcc/testsuite/ > * gcc.dg/vect/vect-reduc-dot-9.c: Reduce target requirements > from i8mm to dotprod. > * gcc.dg/vect/vect-reduc-dot-10.c: Likewise. > * gcc.dg/vect/vect-reduc-dot-11.c: Likewise. > * gcc.dg/vect/vect-reduc-dot-12.c: Likewise. > * gcc.dg/vect/vect-reduc-dot-13.c: Likewise. > * gcc.dg/vect/vect-reduc-dot-14.c: Likewise. > * gcc.dg/vect/vect-reduc-dot-15.c: Likewise. > * gcc.dg/vect/vect-reduc-dot-16.c: Likewise. > * gcc.dg/vect/vect-reduc-dot-17.c: Likewise. > * gcc.dg/vect/vect-reduc-dot-18.c: Likewise. > * gcc.dg/vect/vect-reduc-dot-19.c: Likewise. > * gcc.dg/vect/vect-reduc-dot-20.c: Likewise. > * gcc.dg/vect/vect-reduc-dot-21.c: Likewise. > * gcc.dg/vect/vect-reduc-dot-22.c: Likewise. > --- > gcc/testsuite/gcc.dg/vect/vect-reduc-dot-10.c | 6 +- > gcc/testsuite/gcc.dg/vect/vect-reduc-dot-11.c | 6 +- > gcc/testsuite/gcc.dg/vect/vect-reduc-dot-12.c | 6 +- > gcc/testsuite/gcc.dg/vect/vect-reduc-dot-13.c | 6 +- > gcc/testsuite/gcc.dg/vect/vect-reduc-dot-14.c | 6 +- > gcc/testsuite/gcc.dg/vect/vect-reduc-dot-15.c | 6 +- > gcc/testsuite/gcc.dg/vect/vect-reduc-dot-16.c | 6 +- > gcc/testsuite/gcc.dg/vect/vect-reduc-dot-17.c | 6 +- > gcc/testsuite/gcc.dg/vect/vect-reduc-dot-18.c | 6 +- > gcc/testsuite/gcc.dg/vect/vect-reduc-dot-19.c | 4 +- > gcc/testsuite/gcc.dg/vect/vect-reduc-dot-20.c | 4 +- > gcc/testsuite/gcc.dg/vect/vect-reduc-dot-21.c | 4 +- > gcc/testsuite/gcc.dg/vect/vect-reduc-dot-22.c | 4 +- > gcc/testsuite/gcc.dg/vect/vect-reduc-dot-9.c | 6 +- > gcc/tree-vect-loop.cc | 160 ++++++++++++++++-- > gcc/tree-vect-patterns.cc | 38 ++++- > 16 files changed, 213 insertions(+), 61 deletions(-) > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-10.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-10.c > index 7ce86965ea9..34e25ab7fb0 100644 > --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-10.c > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-10.c > @@ -1,6 +1,6 @@ > /* { dg-require-effective-target vect_int } */ > -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > -/* { dg-add-options arm_v8_2a_i8mm } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > #define SIGNEDNESS_1 unsigned > #define SIGNEDNESS_2 unsigned > @@ -10,4 +10,4 @@ > #include "vect-reduc-dot-9.c" > > /* { dg-final { scan-tree-dump-not "vect_recog_dot_prod_pattern: detected" "vect" } } */ > -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */ > +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-11.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-11.c > index 0f7cbbb87ef..3af8df54cf9 100644 > --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-11.c > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-11.c > @@ -1,6 +1,6 @@ > /* { dg-require-effective-target vect_int } */ > -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > -/* { dg-add-options arm_v8_2a_i8mm } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > #define SIGNEDNESS_1 unsigned > #define SIGNEDNESS_2 signed > @@ -10,4 +10,4 @@ > #include "vect-reduc-dot-9.c" > > /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */ > -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */ > +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-12.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-12.c > index 08412614fc6..77ceef3643b 100644 > --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-12.c > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-12.c > @@ -1,6 +1,6 @@ > /* { dg-require-effective-target vect_int } */ > -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > -/* { dg-add-options arm_v8_2a_i8mm } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > #define SIGNEDNESS_1 unsigned > #define SIGNEDNESS_2 signed > @@ -10,4 +10,4 @@ > #include "vect-reduc-dot-9.c" > > /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */ > -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */ > +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-13.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-13.c > index 7ee0f45f642..d3c0c86f529 100644 > --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-13.c > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-13.c > @@ -1,6 +1,6 @@ > /* { dg-require-effective-target vect_int } */ > -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > -/* { dg-add-options arm_v8_2a_i8mm } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > #define SIGNEDNESS_1 signed > #define SIGNEDNESS_2 unsigned > @@ -10,4 +10,4 @@ > #include "vect-reduc-dot-9.c" > > /* { dg-final { scan-tree-dump-not "vect_recog_dot_prod_pattern: detected" "vect" } } */ > -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */ > +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-14.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-14.c > index 2de1434528b..86a5c85753c 100644 > --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-14.c > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-14.c > @@ -1,6 +1,6 @@ > /* { dg-require-effective-target vect_int } */ > -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > -/* { dg-add-options arm_v8_2a_i8mm } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > #define SIGNEDNESS_1 signed > #define SIGNEDNESS_2 unsigned > @@ -10,4 +10,4 @@ > #include "vect-reduc-dot-9.c" > > /* { dg-final { scan-tree-dump-not "vect_recog_dot_prod_pattern: detected" "vect" } } */ > -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */ > +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-15.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-15.c > index dc48f95a32b..25de0940a65 100644 > --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-15.c > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-15.c > @@ -1,6 +1,6 @@ > /* { dg-require-effective-target vect_int } */ > -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > -/* { dg-add-options arm_v8_2a_i8mm } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > #define SIGNEDNESS_1 signed > #define SIGNEDNESS_2 signed > @@ -10,4 +10,4 @@ > #include "vect-reduc-dot-9.c" > > /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */ > -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */ > +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-16.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-16.c > index aec62878936..4a1dec0677e 100644 > --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-16.c > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-16.c > @@ -1,6 +1,6 @@ > /* { dg-require-effective-target vect_int } */ > -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > -/* { dg-add-options arm_v8_2a_i8mm } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > #define SIGNEDNESS_1 signed > #define SIGNEDNESS_2 signed > @@ -10,4 +10,4 @@ > #include "vect-reduc-dot-9.c" > > /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */ > -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */ > +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-17.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-17.c > index 38f86fe458a..90d21188b76 100644 > --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-17.c > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-17.c > @@ -1,6 +1,6 @@ > /* { dg-require-effective-target vect_int } */ > -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > -/* { dg-add-options arm_v8_2a_i8mm } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > #include "tree-vect.h" > > @@ -50,4 +50,4 @@ main (void) > } > > /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */ > -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */ > +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-18.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-18.c > index 2e86ebe3c6c..81ecb158d29 100644 > --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-18.c > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-18.c > @@ -1,6 +1,6 @@ > /* { dg-require-effective-target vect_int } */ > -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > -/* { dg-add-options arm_v8_2a_i8mm } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > #include "tree-vect.h" > > @@ -50,4 +50,4 @@ main (void) > } > > /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */ > -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */ > +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-19.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-19.c > index d00f24aae4c..cbcd4f120a5 100644 > --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-19.c > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-19.c > @@ -1,6 +1,6 @@ > /* { dg-require-effective-target vect_int } */ > -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > -/* { dg-add-options arm_v8_2a_i8mm } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > #include "tree-vect.h" > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-20.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-20.c > index 17adbca83a0..e81ed1da5a4 100644 > --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-20.c > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-20.c > @@ -1,6 +1,6 @@ > /* { dg-require-effective-target vect_int } */ > -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > -/* { dg-add-options arm_v8_2a_i8mm } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > #include "tree-vect.h" > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-21.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-21.c > index 6cc6a4f2e92..81ce5cdaffb 100644 > --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-21.c > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-21.c > @@ -1,6 +1,6 @@ > /* { dg-require-effective-target vect_int } */ > -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > -/* { dg-add-options arm_v8_2a_i8mm } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > #include "tree-vect.h" > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-22.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-22.c > index e13d3d5c4da..b8c9d3ca53b 100644 > --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-22.c > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-22.c > @@ -1,6 +1,6 @@ > /* { dg-require-effective-target vect_int } */ > -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > -/* { dg-add-options arm_v8_2a_i8mm } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > #include "tree-vect.h" > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-9.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-9.c > index d1049c96bf1..e0b132f6b35 100644 > --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-9.c > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-9.c > @@ -1,6 +1,6 @@ > /* { dg-require-effective-target vect_int } */ > -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > -/* { dg-add-options arm_v8_2a_i8mm } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > #include "tree-vect.h" > > @@ -50,4 +50,4 @@ main (void) > } > > /* { dg-final { scan-tree-dump-not "vect_recog_dot_prod_pattern: detected" "vect" } } */ > -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */ > +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */ > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > index 78dfe8519aa..3a70c15b593 100644 > --- a/gcc/tree-vect-loop.cc > +++ b/gcc/tree-vect-loop.cc > @@ -4566,6 +4566,31 @@ have_whole_vector_shift (machine_mode mode) > return true; > } > > +/* Return true if (a) STMT_INFO is a DOT_PROD_EXPR reduction whose > + multiplication operands have differing signs and (b) we intend > + to emulate the operation using a series of signed DOT_PROD_EXPRs. > + See vect_emulate_mixed_dot_prod for the actual sequence used. */ > + > +static bool > +vect_is_emulated_mixed_dot_prod (loop_vec_info loop_vinfo, > + stmt_vec_info stmt_info) > +{ > + gassign *assign = dyn_cast (stmt_info->stmt); > + if (!assign || gimple_assign_rhs_code (assign) != DOT_PROD_EXPR) > + return false; > + > + tree rhs1 = gimple_assign_rhs1 (assign); > + tree rhs2 = gimple_assign_rhs2 (assign); > + if (TYPE_SIGN (TREE_TYPE (rhs1)) == TYPE_SIGN (TREE_TYPE (rhs2))) > + return false; > + > + stmt_vec_info reduc_info = info_for_reduction (loop_vinfo, stmt_info); > + gcc_assert (reduc_info->is_reduc_info); > + return !directly_supported_p (DOT_PROD_EXPR, > + STMT_VINFO_REDUC_VECTYPE_IN (reduc_info), > + optab_vector_mixed_sign); > +} > + > /* TODO: Close dependency between vect_model_*_cost and vectorizable_* > functions. Design better to avoid maintenance issues. */ > > @@ -4601,6 +4626,8 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo, > if (!gimple_extract_op (orig_stmt_info->stmt, &op)) > gcc_unreachable (); > > + bool emulated_mixed_dot_prod > + = vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info); > if (reduction_type == EXTRACT_LAST_REDUCTION) > /* No extra instructions are needed in the prologue. The loop body > operations are costed in vectorizable_condition. */ > @@ -4628,11 +4655,20 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo, > } > else > { > - /* Add in cost for initial definition. > - For cond reduction we have four vectors: initial index, step, > - initial result of the data reduction, initial value of the index > - reduction. */ > - int prologue_stmts = reduction_type == COND_REDUCTION ? 4 : 1; > + /* Add in the cost of the initial definitions. */ > + int prologue_stmts; > + if (reduction_type == COND_REDUCTION) > + /* For cond reductions we have four vectors: initial index, step, > + initial result of the data reduction, initial value of the index > + reduction. */ > + prologue_stmts = 4; > + else if (emulated_mixed_dot_prod) > + /* We need the initial reduction value and two invariants: > + one that contains the minimum signed value and one that > + contains half of its negative. */ > + prologue_stmts = 3; > + else > + prologue_stmts = 1; > prologue_cost += record_stmt_cost (cost_vec, prologue_stmts, > scalar_to_vec, stmt_info, 0, > vect_prologue); > @@ -6797,11 +6833,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > bool lane_reduc_code_p = (op.code == DOT_PROD_EXPR > || op.code == WIDEN_SUM_EXPR > || op.code == SAD_EXPR); > - enum optab_subtype optab_query_kind = optab_vector; > - if (op.code == DOT_PROD_EXPR > - && (TYPE_SIGN (TREE_TYPE (op.ops[0])) > - != TYPE_SIGN (TREE_TYPE (op.ops[1])))) > - optab_query_kind = optab_vector_mixed_sign; > > if (!POINTER_TYPE_P (op.type) && !INTEGRAL_TYPE_P (op.type) > && !SCALAR_FLOAT_TYPE_P (op.type)) > @@ -7328,9 +7359,17 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > /* 4. Supportable by target? */ > bool ok = true; > > - /* 4.1. check support for the operation in the loop */ > + /* 4.1. check support for the operation in the loop > + > + This isn't necessary for the lane reduction codes, since they > + can only be produced by pattern matching, and it's up to the > + pattern matcher to test for support. The main reason for > + specifically skipping this step is to avoid rechecking whether > + mixed-sign dot-products can be implemented using signed > + dot-products. */ > machine_mode vec_mode = TYPE_MODE (vectype_in); > - if (!directly_supported_p (op.code, vectype_in, optab_query_kind)) > + if (!lane_reduc_code_p > + && !directly_supported_p (op.code, vectype_in)) > { > if (dump_enabled_p ()) > dump_printf (MSG_NOTE, "op not supported by target.\n"); > @@ -7398,7 +7437,14 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > vect_transform_reduction. Otherwise this is costed by the > separate vectorizable_* routines. */ > if (single_defuse_cycle || lane_reduc_code_p) > - record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, vect_body); > + { > + int factor = 1; > + if (vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info)) > + /* Three dot-products and a subtraction. */ > + factor = 4; > + record_stmt_cost (cost_vec, ncopies * factor, vector_stmt, > + stmt_info, 0, vect_body); > + } > > if (dump_enabled_p () > && reduction_type == FOLD_LEFT_REDUCTION) > @@ -7457,6 +7503,81 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > return true; > } > > +/* STMT_INFO is a dot-product reduction whose multiplication operands > + have different signs. Emit a sequence to emulate the operation > + using a series of signed DOT_PROD_EXPRs and return the last > + statement generated. VEC_DEST is the result of the vector operation > + and VOP lists its inputs. */ > + > +static gassign * > +vect_emulate_mixed_dot_prod (loop_vec_info loop_vinfo, stmt_vec_info stmt_info, > + gimple_stmt_iterator *gsi, tree vec_dest, > + tree vop[3]) > +{ > + tree wide_vectype = signed_type_for (TREE_TYPE (vec_dest)); > + tree narrow_vectype = signed_type_for (TREE_TYPE (vop[0])); > + tree narrow_elttype = TREE_TYPE (narrow_vectype); > + gimple *new_stmt; > + > + /* Make VOP[0] the unsigned operand VOP[1] the signed operand. */ > + if (!TYPE_UNSIGNED (TREE_TYPE (vop[0]))) > + std::swap (vop[0], vop[1]); > + > + /* Convert all inputs to signed types. */ > + for (int i = 0; i < 3; ++i) > + if (TYPE_UNSIGNED (TREE_TYPE (vop[i]))) > + { > + tree tmp = make_ssa_name (signed_type_for (TREE_TYPE (vop[i]))); > + new_stmt = gimple_build_assign (tmp, NOP_EXPR, vop[i]); > + vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi); > + vop[i] = tmp; > + } > + > + /* In the comments below we assume 8-bit inputs for simplicity, > + but the approach works for any full integer type. */ > + > + /* Create a vector of -128. */ > + tree min_narrow_elttype = TYPE_MIN_VALUE (narrow_elttype); > + tree min_narrow = build_vector_from_val (narrow_vectype, > + min_narrow_elttype); > + > + /* Create a vector of 64. */ > + auto half_wi = wi::lrshift (wi::to_wide (min_narrow_elttype), 1); > + tree half_narrow = wide_int_to_tree (narrow_elttype, half_wi); > + half_narrow = build_vector_from_val (narrow_vectype, half_narrow); > + > + /* Emit: SUB_RES = VOP[0] - 128. */ > + tree sub_res = make_ssa_name (narrow_vectype); > + new_stmt = gimple_build_assign (sub_res, PLUS_EXPR, vop[0], min_narrow); > + vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi); > + > + /* Emit: > + > + STAGE1 = DOT_PROD_EXPR ; > + STAGE2 = DOT_PROD_EXPR ; > + STAGE3 = DOT_PROD_EXPR ; > + > + on the basis that x * y == (x - 128) * y + 64 * y + 64 * y > + Doing the two 64 * y steps first allows more time to compute x. */ > + tree stage1 = make_ssa_name (wide_vectype); > + new_stmt = gimple_build_assign (stage1, DOT_PROD_EXPR, > + vop[1], half_narrow, vop[2]); > + vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi); > + > + tree stage2 = make_ssa_name (wide_vectype); > + new_stmt = gimple_build_assign (stage2, DOT_PROD_EXPR, > + vop[1], half_narrow, stage1); > + vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi); > + > + tree stage3 = make_ssa_name (wide_vectype); > + new_stmt = gimple_build_assign (stage3, DOT_PROD_EXPR, > + sub_res, vop[1], stage2); > + vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi); > + > + /* Convert STAGE3 to the reduction type. */ > + return gimple_build_assign (vec_dest, CONVERT_EXPR, stage3); > +} > + > /* Transform the definition stmt STMT_INFO of a reduction PHI backedge > value. */ > > @@ -7563,12 +7684,17 @@ vect_transform_reduction (loop_vec_info loop_vinfo, > : &vec_oprnds2)); > } > > + bool emulated_mixed_dot_prod > + = vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info); > FOR_EACH_VEC_ELT (vec_oprnds0, i, def0) > { > gimple *new_stmt; > tree vop[3] = { def0, vec_oprnds1[i], NULL_TREE }; > if (masked_loop_p && !mask_by_cond_expr) > { > + /* No conditional ifns have been defined for dot-product yet. */ > + gcc_assert (code != DOT_PROD_EXPR); > + > /* Make sure that the reduction accumulator is vop[0]. */ > if (reduc_index == 1) > { > @@ -7597,8 +7723,12 @@ vect_transform_reduction (loop_vec_info loop_vinfo, > build_vect_cond_expr (code, vop, mask, gsi); > } > > - new_stmt = gimple_build_assign (vec_dest, code, > - vop[0], vop[1], vop[2]); > + if (emulated_mixed_dot_prod) > + new_stmt = vect_emulate_mixed_dot_prod (loop_vinfo, stmt_info, gsi, > + vec_dest, vop); > + else > + new_stmt = gimple_build_assign (vec_dest, code, > + vop[0], vop[1], vop[2]); > new_temp = make_ssa_name (vec_dest, new_stmt); > gimple_assign_set_lhs (new_stmt, new_temp); > vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi); > diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc > index 8f624863971..dfbfb71b3c6 100644 > --- a/gcc/tree-vect-patterns.cc > +++ b/gcc/tree-vect-patterns.cc > @@ -760,12 +760,16 @@ vect_convert_input (vec_info *vinfo, stmt_vec_info stmt_info, tree type, > vect_unpromoted_value *unprom, tree vectype, > enum optab_subtype subtype = optab_default) > { > - > /* Update the type if the signs differ. */ > - if (subtype == optab_vector_mixed_sign > - && TYPE_SIGN (type) != TYPE_SIGN (TREE_TYPE (unprom->op))) > - type = build_nonstandard_integer_type (TYPE_PRECISION (type), > - TYPE_SIGN (unprom->type)); > + if (subtype == optab_vector_mixed_sign) > + { > + gcc_assert (!TYPE_UNSIGNED (type)); > + if (TYPE_UNSIGNED (TREE_TYPE (unprom->op))) > + { > + type = unsigned_type_for (type); > + vectype = unsigned_type_for (vectype); > + } > + } > > /* Check for a no-op conversion. */ > if (types_compatible_p (type, TREE_TYPE (unprom->op))) > @@ -1139,16 +1143,34 @@ vect_recog_dot_prod_pattern (vec_info *vinfo, > is signed; otherwise, the result has the same sign as the operands. */ > if (TYPE_PRECISION (unprom_mult.type) != TYPE_PRECISION (type) > && (subtype == optab_vector_mixed_sign > - ? TYPE_UNSIGNED (unprom_mult.type) > - : TYPE_SIGN (unprom_mult.type) != TYPE_SIGN (half_type))) > + ? TYPE_UNSIGNED (unprom_mult.type) > + : TYPE_SIGN (unprom_mult.type) != TYPE_SIGN (half_type))) > return NULL; > > vect_pattern_detected ("vect_recog_dot_prod_pattern", last_stmt); > > + /* If the inputs have mixed signs, canonicalize on using the signed > + input type for analysis. This also helps when emulating mixed-sign > + operations using signed operations. */ > + if (subtype == optab_vector_mixed_sign) > + half_type = signed_type_for (half_type); > + > tree half_vectype; > if (!vect_supportable_direct_optab_p (vinfo, type, DOT_PROD_EXPR, half_type, > type_out, &half_vectype, subtype)) > - return NULL; > + { > + /* We can emulate a mixed-sign dot-product using a sequence of > + signed dot-products; see vect_emulate_mixed_dot_prod for details. */ > + if (subtype != optab_vector_mixed_sign > + || !vect_supportable_direct_optab_p (vinfo, signed_type_for (type), > + DOT_PROD_EXPR, half_type, > + type_out, &half_vectype, > + optab_vector)) > + return NULL; > + > + *type_out = signed_or_unsigned_type_for (TYPE_UNSIGNED (type), > + *type_out); > + } > > /* Get the inputs in the appropriate types. */ > tree mult_oprnd[2]; > -- > 2.25.1 >