From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qv1-xf2a.google.com (mail-qv1-xf2a.google.com [IPv6:2607:f8b0:4864:20::f2a]) by sourceware.org (Postfix) with ESMTPS id 58C2E3846473 for ; Thu, 30 Jun 2022 06:45:39 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 58C2E3846473 Received: by mail-qv1-xf2a.google.com with SMTP id z1so11073550qvp.9 for ; Wed, 29 Jun 2022 23:45:39 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=xS9BRBncCOkfdFMmQYElydfy95zfpVFO0t2CNBlpv/s=; b=4Ds9O0Qip9dKZQfta9aabyUlJxUKxRZhshZznN9COXCAYZp7EJtdakW2WtXOsDlNQ0 v0MhbwpkUyO8cG2btYF2d4WXtHECCAvqW+/yVn3RPHG3V7Azmkhhh5yqDsa2osU6ruWD /7Y7vfS7YWiHeR+wF8kA/Dta711mgLAxhfD7qAZVC+ejRtQce3Tc9mWhhs5imiONk1bi QpsaPGb7WtromgT3mKfdhKlptgt2mZ3q4AkfjSA+h8fpHQ6W1ADmFqhL/G4ZFY/dtAuD BFDpVe1XFRy6Ze/B/8ewDQRtgmCnCe0GeFDZgT6Ja1ZGhPnFA+UKw5t4LqfScrwwEYKO mX1A== X-Gm-Message-State: AJIora9YLegmX/JJqCjaOj+ijK6Y+602rP1JdAT6Xco/avkOIXsORv8a WllkO/4aAfi2nqV1qT+mY2YBDcilDysAzf8zGyhfVCgS X-Google-Smtp-Source: AGRyM1vQyf4K58FpC0RVSguq2QGI98jy693Xm9pYECjmm7fSpvh/9N/REd4swvoh93LT+md/mJky3K05KFmAa64WoLU= X-Received: by 2002:a05:6214:2a83:b0:470:a898:e467 with SMTP id jr3-20020a0562142a8300b00470a898e467mr10169857qvb.122.1656571538536; Wed, 29 Jun 2022 23:45:38 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Richard Biener Date: Thu, 30 Jun 2022 08:45:26 +0200 Message-ID: Subject: Re: [PATCH 1/2]AArch64 Add fallback case using sdot for usdot To: Richard Biener , Tamar Christina , Richard Earnshaw , nd , "gcc-patches@gcc.gnu.org" , Marcus Shawcroft , Richard Sandiford Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-8.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_MANYTO, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jun 2022 06:45:43 -0000 On Wed, Jun 29, 2022 at 4:35 PM Richard Sandiford wrote: > > Richard Biener writes: > > On Tue, Jun 28, 2022 at 5:54 PM Tamar Christina wrote: > >> > >> > -----Original Message----- > >> > From: Richard Biener > >> > Sent: Monday, June 27, 2022 7:10 AM > >> > To: Tamar Christina > >> > Cc: Richard Sandiford ; Richard Earnshaw > >> > ; nd ; gcc- > >> > patches@gcc.gnu.org; Marcus Shawcroft > >> > Subject: Re: [PATCH 1/2]AArch64 Add fallback case using sdot for usdot > >> > > >> > On Mon, Jun 27, 2022 at 7:25 AM Tamar Christina via Gcc-patches >> > patches@gcc.gnu.org> wrote: > >> > > > >> > > > -----Original Message----- > >> > > > From: Richard Sandiford > >> > > > Sent: Thursday, June 16, 2022 7:54 PM > >> > > > To: Tamar Christina > >> > > > Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw > >> > > > ; Marcus Shawcroft > >> > > > ; Kyrylo Tkachov > >> > > >> > > > Subject: Re: [PATCH 1/2]AArch64 Add fallback case using sdot for > >> > > > usdot > >> > > > > >> > > > Richard Sandiford via Gcc-patches writes: > >> > > > > Tamar Christina writes: > >> > > > >> Hi All, > >> > > > >> > >> > > > >> The usdot operation is common in video encoder and decoders > >> > > > >> including some of the most widely used ones. > >> > > > >> > >> > > > >> This patch adds a +dotprod version of the optab as a fallback for > >> > > > >> when you do have sdot but not usdot available. > >> > > > >> > >> > > > >> The fallback works by adding a bias to the unsigned argument to > >> > > > >> convert it to a signed value and then correcting for the bias later on. > >> > > > >> > >> > > > >> Essentially it relies on (x - 128)y + 128y == xy where x is > >> > > > >> unsigned and y is signed (assuming both are 8-bit values). > >> > > > >> Because the range of a signed byte is only to 127 we split the bias > >> > correction into: > >> > > > >> > >> > > > >> (x - 128)y + 127y + y > >> > > > > > >> > > > > I bet you knew this question was coming, but: this technique isn't > >> > > > > target-specific, so wouldn't it be better to handle it in > >> > > > > tree-vect-patterns.cc instead? > >> > > > >> > > Ok, so after many hours of trying I don't know how to make this work. > >> > > DOT_PROD_EXPR is a reduction, but emitting them as additional pattern > >> > > statement doesn't work because they'll be marked as internal_def > >> > > rather than reduction_def. I tried marking the new vec_stmt_info that > >> > > I create explicitly as reduction_def but this gets overwritten during analysis. > >> > > > >> > > I then looked into getting it as a vectorizable_operation but has this > >> > > obvious problems In that it no longer treats it as a reduction and so tries to > >> > decompose into hi/lo. > >> > > > >> > > I then looked into treating additional patterns from a reduction as > >> > > reductions themselves but this is obviously wrong as non-reduction > >> > statements also get marked as reductions. > >> > > > >> > > The conclusion is that I don't think the vectorizer allows additional > >> > > reductions to be emitted from patterns. > >> > > >> > Indeed. DOT_PROD is a weird beast and it doesn't define which lanes are > >> > reduced to which so it's only usable when the result is reduced to a single > >> > lane. > >> > > >> > An SLP pattern might work if you use reduc-plus for the reduced lanes and > >> > keep the multiply separate? > >> > >> Unfortunately I can't seem to get it to handle the reduction in SLP. It seems to always > >> use the non-SLP aware loop vectorizer here. The suggested unroll factor is always 1 and > >> even trying to force it gets it to bail out later, presumable because it's reducing into a > >> scalar that's used outside the loop? > > > > Yes, it possibly needs 1-lane SLP support. > > As I mentioned to Tamar off-list, I feel like I've been wasting > people's time recently by spewing out ideas that might or might not work > (usually "not work"), so I wanted to get some confidence that the next > suggestion made sense. In the end I needed most of an implementation > to do that, so it seemed easiest just to finish it off rather than post > it in a half-complete state. Sorry for the duplication. :-( > > The patch certainly isn't pretty, but I think it's the best we can > do under the current infrastructure, and it should at least make > the costs reasonably accurate. (Actually, that said, we probably > need to patch the reduction latency calculation in the aarch64 > vector code -- didn't think of that until now.) > > Tested on aarch64-linux-gnu and x64_64-linux-gnu. WDYT? Looks reasonable - does this end up in OKish code generation as well? Thanks, Richard. > Thanks, > Richard > > ---------------- > > Following a suggestion from Tamar, this patch adds a fallback > implementation of usdot using sdot. Specifically, for 8-bit > input types: > > acc_2 = DOT_PROD_EXPR ; > > becomes: > > tmp_1 = DOT_PROD_EXPR <64, b_signed, acc_1>; > tmp_2 = DOT_PROD_EXPR <64, b_signed, tmp_1>; > acc_2 = DOT_PROD_EXPR ; > > on the basis that (x-128)*y + 64*y + 64*y. Doing the two 64*y > operations first should give more time for x to be calculated, > on the off chance that that's useful. > > gcc/ > * tree-vect-patterns.cc (vect_recog_dot_prod_pattern): If usdot > isn't available, try sdot instead. > * tree-vect-loop.cc (vect_is_emulated_mixed_dot_prod): New function. > (vect_model_reduction_cost): Model the cost of implementing usdot > using sdot. > (vectorizable_reduction): Likewise. Skip target support test > for lane reductions. > (vect_emulate_mixed_dot_prod): New function. > (vect_transform_reduction): Use it to emulate usdot via sdot. > > gcc/testsuite/ > * gcc.dg/vect/vect-reduc-dot-9.c: Reduce target requirements > from i8mm to dotprod. > * gcc.dg/vect/vect-reduc-dot-10.c: Likewise. > * gcc.dg/vect/vect-reduc-dot-11.c: Likewise. > * gcc.dg/vect/vect-reduc-dot-12.c: Likewise. > * gcc.dg/vect/vect-reduc-dot-13.c: Likewise. > * gcc.dg/vect/vect-reduc-dot-14.c: Likewise. > * gcc.dg/vect/vect-reduc-dot-15.c: Likewise. > * gcc.dg/vect/vect-reduc-dot-16.c: Likewise. > * gcc.dg/vect/vect-reduc-dot-17.c: Likewise. > * gcc.dg/vect/vect-reduc-dot-18.c: Likewise. > * gcc.dg/vect/vect-reduc-dot-19.c: Likewise. > * gcc.dg/vect/vect-reduc-dot-20.c: Likewise. > * gcc.dg/vect/vect-reduc-dot-21.c: Likewise. > * gcc.dg/vect/vect-reduc-dot-22.c: Likewise. > --- > gcc/testsuite/gcc.dg/vect/vect-reduc-dot-10.c | 6 +- > gcc/testsuite/gcc.dg/vect/vect-reduc-dot-11.c | 6 +- > gcc/testsuite/gcc.dg/vect/vect-reduc-dot-12.c | 6 +- > gcc/testsuite/gcc.dg/vect/vect-reduc-dot-13.c | 6 +- > gcc/testsuite/gcc.dg/vect/vect-reduc-dot-14.c | 6 +- > gcc/testsuite/gcc.dg/vect/vect-reduc-dot-15.c | 6 +- > gcc/testsuite/gcc.dg/vect/vect-reduc-dot-16.c | 6 +- > gcc/testsuite/gcc.dg/vect/vect-reduc-dot-17.c | 6 +- > gcc/testsuite/gcc.dg/vect/vect-reduc-dot-18.c | 6 +- > gcc/testsuite/gcc.dg/vect/vect-reduc-dot-19.c | 4 +- > gcc/testsuite/gcc.dg/vect/vect-reduc-dot-20.c | 4 +- > gcc/testsuite/gcc.dg/vect/vect-reduc-dot-21.c | 4 +- > gcc/testsuite/gcc.dg/vect/vect-reduc-dot-22.c | 4 +- > gcc/testsuite/gcc.dg/vect/vect-reduc-dot-9.c | 6 +- > gcc/tree-vect-loop.cc | 160 ++++++++++++++++-- > gcc/tree-vect-patterns.cc | 14 +- > 16 files changed, 196 insertions(+), 54 deletions(-) > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-10.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-10.c > index 7ce86965ea9..34e25ab7fb0 100644 > --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-10.c > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-10.c > @@ -1,6 +1,6 @@ > /* { dg-require-effective-target vect_int } */ > -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > -/* { dg-add-options arm_v8_2a_i8mm } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > #define SIGNEDNESS_1 unsigned > #define SIGNEDNESS_2 unsigned > @@ -10,4 +10,4 @@ > #include "vect-reduc-dot-9.c" > > /* { dg-final { scan-tree-dump-not "vect_recog_dot_prod_pattern: detected" "vect" } } */ > -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */ > +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-11.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-11.c > index 0f7cbbb87ef..3af8df54cf9 100644 > --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-11.c > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-11.c > @@ -1,6 +1,6 @@ > /* { dg-require-effective-target vect_int } */ > -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > -/* { dg-add-options arm_v8_2a_i8mm } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > #define SIGNEDNESS_1 unsigned > #define SIGNEDNESS_2 signed > @@ -10,4 +10,4 @@ > #include "vect-reduc-dot-9.c" > > /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */ > -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */ > +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-12.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-12.c > index 08412614fc6..77ceef3643b 100644 > --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-12.c > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-12.c > @@ -1,6 +1,6 @@ > /* { dg-require-effective-target vect_int } */ > -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > -/* { dg-add-options arm_v8_2a_i8mm } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > #define SIGNEDNESS_1 unsigned > #define SIGNEDNESS_2 signed > @@ -10,4 +10,4 @@ > #include "vect-reduc-dot-9.c" > > /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */ > -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */ > +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-13.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-13.c > index 7ee0f45f642..d3c0c86f529 100644 > --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-13.c > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-13.c > @@ -1,6 +1,6 @@ > /* { dg-require-effective-target vect_int } */ > -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > -/* { dg-add-options arm_v8_2a_i8mm } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > #define SIGNEDNESS_1 signed > #define SIGNEDNESS_2 unsigned > @@ -10,4 +10,4 @@ > #include "vect-reduc-dot-9.c" > > /* { dg-final { scan-tree-dump-not "vect_recog_dot_prod_pattern: detected" "vect" } } */ > -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */ > +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-14.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-14.c > index 2de1434528b..86a5c85753c 100644 > --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-14.c > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-14.c > @@ -1,6 +1,6 @@ > /* { dg-require-effective-target vect_int } */ > -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > -/* { dg-add-options arm_v8_2a_i8mm } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > #define SIGNEDNESS_1 signed > #define SIGNEDNESS_2 unsigned > @@ -10,4 +10,4 @@ > #include "vect-reduc-dot-9.c" > > /* { dg-final { scan-tree-dump-not "vect_recog_dot_prod_pattern: detected" "vect" } } */ > -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */ > +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-15.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-15.c > index dc48f95a32b..25de0940a65 100644 > --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-15.c > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-15.c > @@ -1,6 +1,6 @@ > /* { dg-require-effective-target vect_int } */ > -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > -/* { dg-add-options arm_v8_2a_i8mm } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > #define SIGNEDNESS_1 signed > #define SIGNEDNESS_2 signed > @@ -10,4 +10,4 @@ > #include "vect-reduc-dot-9.c" > > /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */ > -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */ > +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-16.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-16.c > index aec62878936..4a1dec0677e 100644 > --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-16.c > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-16.c > @@ -1,6 +1,6 @@ > /* { dg-require-effective-target vect_int } */ > -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > -/* { dg-add-options arm_v8_2a_i8mm } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > #define SIGNEDNESS_1 signed > #define SIGNEDNESS_2 signed > @@ -10,4 +10,4 @@ > #include "vect-reduc-dot-9.c" > > /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */ > -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */ > +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-17.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-17.c > index 38f86fe458a..90d21188b76 100644 > --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-17.c > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-17.c > @@ -1,6 +1,6 @@ > /* { dg-require-effective-target vect_int } */ > -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > -/* { dg-add-options arm_v8_2a_i8mm } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > #include "tree-vect.h" > > @@ -50,4 +50,4 @@ main (void) > } > > /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */ > -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */ > +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-18.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-18.c > index 2e86ebe3c6c..81ecb158d29 100644 > --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-18.c > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-18.c > @@ -1,6 +1,6 @@ > /* { dg-require-effective-target vect_int } */ > -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > -/* { dg-add-options arm_v8_2a_i8mm } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > #include "tree-vect.h" > > @@ -50,4 +50,4 @@ main (void) > } > > /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */ > -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */ > +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-19.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-19.c > index d00f24aae4c..cbcd4f120a5 100644 > --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-19.c > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-19.c > @@ -1,6 +1,6 @@ > /* { dg-require-effective-target vect_int } */ > -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > -/* { dg-add-options arm_v8_2a_i8mm } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > #include "tree-vect.h" > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-20.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-20.c > index 17adbca83a0..e81ed1da5a4 100644 > --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-20.c > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-20.c > @@ -1,6 +1,6 @@ > /* { dg-require-effective-target vect_int } */ > -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > -/* { dg-add-options arm_v8_2a_i8mm } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > #include "tree-vect.h" > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-21.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-21.c > index 6cc6a4f2e92..81ce5cdaffb 100644 > --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-21.c > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-21.c > @@ -1,6 +1,6 @@ > /* { dg-require-effective-target vect_int } */ > -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > -/* { dg-add-options arm_v8_2a_i8mm } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > #include "tree-vect.h" > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-22.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-22.c > index e13d3d5c4da..b8c9d3ca53b 100644 > --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-22.c > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-22.c > @@ -1,6 +1,6 @@ > /* { dg-require-effective-target vect_int } */ > -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > -/* { dg-add-options arm_v8_2a_i8mm } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > #include "tree-vect.h" > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-9.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-9.c > index d1049c96bf1..e0b132f6b35 100644 > --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-9.c > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-9.c > @@ -1,6 +1,6 @@ > /* { dg-require-effective-target vect_int } */ > -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > -/* { dg-add-options arm_v8_2a_i8mm } */ > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > #include "tree-vect.h" > > @@ -50,4 +50,4 @@ main (void) > } > > /* { dg-final { scan-tree-dump-not "vect_recog_dot_prod_pattern: detected" "vect" } } */ > -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */ > +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */ > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > index 78dfe8519aa..3a70c15b593 100644 > --- a/gcc/tree-vect-loop.cc > +++ b/gcc/tree-vect-loop.cc > @@ -4566,6 +4566,31 @@ have_whole_vector_shift (machine_mode mode) > return true; > } > > +/* Return true if (a) STMT_INFO is a DOT_PROD_EXPR reduction whose > + multiplication operands have differing signs and (b) we intend > + to emulate the operation using a series of signed DOT_PROD_EXPRs. > + See vect_emulate_mixed_dot_prod for the actual sequence used. */ > + > +static bool > +vect_is_emulated_mixed_dot_prod (loop_vec_info loop_vinfo, > + stmt_vec_info stmt_info) > +{ > + gassign *assign = dyn_cast (stmt_info->stmt); > + if (!assign || gimple_assign_rhs_code (assign) != DOT_PROD_EXPR) > + return false; > + > + tree rhs1 = gimple_assign_rhs1 (assign); > + tree rhs2 = gimple_assign_rhs2 (assign); > + if (TYPE_SIGN (TREE_TYPE (rhs1)) == TYPE_SIGN (TREE_TYPE (rhs2))) > + return false; > + > + stmt_vec_info reduc_info = info_for_reduction (loop_vinfo, stmt_info); > + gcc_assert (reduc_info->is_reduc_info); > + return !directly_supported_p (DOT_PROD_EXPR, > + STMT_VINFO_REDUC_VECTYPE_IN (reduc_info), > + optab_vector_mixed_sign); > +} > + > /* TODO: Close dependency between vect_model_*_cost and vectorizable_* > functions. Design better to avoid maintenance issues. */ > > @@ -4601,6 +4626,8 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo, > if (!gimple_extract_op (orig_stmt_info->stmt, &op)) > gcc_unreachable (); > > + bool emulated_mixed_dot_prod > + = vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info); > if (reduction_type == EXTRACT_LAST_REDUCTION) > /* No extra instructions are needed in the prologue. The loop body > operations are costed in vectorizable_condition. */ > @@ -4628,11 +4655,20 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo, > } > else > { > - /* Add in cost for initial definition. > - For cond reduction we have four vectors: initial index, step, > - initial result of the data reduction, initial value of the index > - reduction. */ > - int prologue_stmts = reduction_type == COND_REDUCTION ? 4 : 1; > + /* Add in the cost of the initial definitions. */ > + int prologue_stmts; > + if (reduction_type == COND_REDUCTION) > + /* For cond reductions we have four vectors: initial index, step, > + initial result of the data reduction, initial value of the index > + reduction. */ > + prologue_stmts = 4; > + else if (emulated_mixed_dot_prod) > + /* We need the initial reduction value and two invariants: > + one that contains the minimum signed value and one that > + contains half of its negative. */ > + prologue_stmts = 3; > + else > + prologue_stmts = 1; > prologue_cost += record_stmt_cost (cost_vec, prologue_stmts, > scalar_to_vec, stmt_info, 0, > vect_prologue); > @@ -6797,11 +6833,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > bool lane_reduc_code_p = (op.code == DOT_PROD_EXPR > || op.code == WIDEN_SUM_EXPR > || op.code == SAD_EXPR); > - enum optab_subtype optab_query_kind = optab_vector; > - if (op.code == DOT_PROD_EXPR > - && (TYPE_SIGN (TREE_TYPE (op.ops[0])) > - != TYPE_SIGN (TREE_TYPE (op.ops[1])))) > - optab_query_kind = optab_vector_mixed_sign; > > if (!POINTER_TYPE_P (op.type) && !INTEGRAL_TYPE_P (op.type) > && !SCALAR_FLOAT_TYPE_P (op.type)) > @@ -7328,9 +7359,17 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > /* 4. Supportable by target? */ > bool ok = true; > > - /* 4.1. check support for the operation in the loop */ > + /* 4.1. check support for the operation in the loop > + > + This isn't necessary for the lane reduction codes, since they > + can only be produced by pattern matching, and it's up to the > + pattern matcher to test for support. The main reason for > + specifically skipping this step is to avoid rechecking whether > + mixed-sign dot-products can be implemented using signed > + dot-products. */ > machine_mode vec_mode = TYPE_MODE (vectype_in); > - if (!directly_supported_p (op.code, vectype_in, optab_query_kind)) > + if (!lane_reduc_code_p > + && !directly_supported_p (op.code, vectype_in)) > { > if (dump_enabled_p ()) > dump_printf (MSG_NOTE, "op not supported by target.\n"); > @@ -7398,7 +7437,14 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > vect_transform_reduction. Otherwise this is costed by the > separate vectorizable_* routines. */ > if (single_defuse_cycle || lane_reduc_code_p) > - record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, vect_body); > + { > + int factor = 1; > + if (vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info)) > + /* Three dot-products and a subtraction. */ > + factor = 4; > + record_stmt_cost (cost_vec, ncopies * factor, vector_stmt, > + stmt_info, 0, vect_body); > + } > > if (dump_enabled_p () > && reduction_type == FOLD_LEFT_REDUCTION) > @@ -7457,6 +7503,81 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > return true; > } > > +/* STMT_INFO is a dot-product reduction whose multiplication operands > + have different signs. Emit a sequence to emulate the operation > + using a series of signed DOT_PROD_EXPRs and return the last > + statement generated. VEC_DEST is the result of the vector operation > + and VOP lists its inputs. */ > + > +static gassign * > +vect_emulate_mixed_dot_prod (loop_vec_info loop_vinfo, stmt_vec_info stmt_info, > + gimple_stmt_iterator *gsi, tree vec_dest, > + tree vop[3]) > +{ > + tree wide_vectype = signed_type_for (TREE_TYPE (vec_dest)); > + tree narrow_vectype = signed_type_for (TREE_TYPE (vop[0])); > + tree narrow_elttype = TREE_TYPE (narrow_vectype); > + gimple *new_stmt; > + > + /* Make VOP[0] the unsigned operand VOP[1] the signed operand. */ > + if (!TYPE_UNSIGNED (TREE_TYPE (vop[0]))) > + std::swap (vop[0], vop[1]); > + > + /* Convert all inputs to signed types. */ > + for (int i = 0; i < 3; ++i) > + if (TYPE_UNSIGNED (TREE_TYPE (vop[i]))) > + { > + tree tmp = make_ssa_name (signed_type_for (TREE_TYPE (vop[i]))); > + new_stmt = gimple_build_assign (tmp, NOP_EXPR, vop[i]); > + vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi); > + vop[i] = tmp; > + } > + > + /* In the comments below we assume 8-bit inputs for simplicity, > + but the approach works for any full integer type. */ > + > + /* Create a vector of -128. */ > + tree min_narrow_elttype = TYPE_MIN_VALUE (narrow_elttype); > + tree min_narrow = build_vector_from_val (narrow_vectype, > + min_narrow_elttype); > + > + /* Create a vector of 64. */ > + auto half_wi = wi::lrshift (wi::to_wide (min_narrow_elttype), 1); > + tree half_narrow = wide_int_to_tree (narrow_elttype, half_wi); > + half_narrow = build_vector_from_val (narrow_vectype, half_narrow); > + > + /* Emit: SUB_RES = VOP[0] - 128. */ > + tree sub_res = make_ssa_name (narrow_vectype); > + new_stmt = gimple_build_assign (sub_res, PLUS_EXPR, vop[0], min_narrow); > + vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi); > + > + /* Emit: > + > + STAGE1 = DOT_PROD_EXPR ; > + STAGE2 = DOT_PROD_EXPR ; > + STAGE3 = DOT_PROD_EXPR ; > + > + on the basis that x * y == (x - 128) * y + 64 * y + 64 * y > + Doing the two 64 * y steps first allows more time to compute x. */ > + tree stage1 = make_ssa_name (wide_vectype); > + new_stmt = gimple_build_assign (stage1, DOT_PROD_EXPR, > + vop[1], half_narrow, vop[2]); > + vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi); > + > + tree stage2 = make_ssa_name (wide_vectype); > + new_stmt = gimple_build_assign (stage2, DOT_PROD_EXPR, > + vop[1], half_narrow, stage1); > + vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi); > + > + tree stage3 = make_ssa_name (wide_vectype); > + new_stmt = gimple_build_assign (stage3, DOT_PROD_EXPR, > + sub_res, vop[1], stage2); > + vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi); > + > + /* Convert STAGE3 to the reduction type. */ > + return gimple_build_assign (vec_dest, CONVERT_EXPR, stage3); > +} > + > /* Transform the definition stmt STMT_INFO of a reduction PHI backedge > value. */ > > @@ -7563,12 +7684,17 @@ vect_transform_reduction (loop_vec_info loop_vinfo, > : &vec_oprnds2)); > } > > + bool emulated_mixed_dot_prod > + = vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info); > FOR_EACH_VEC_ELT (vec_oprnds0, i, def0) > { > gimple *new_stmt; > tree vop[3] = { def0, vec_oprnds1[i], NULL_TREE }; > if (masked_loop_p && !mask_by_cond_expr) > { > + /* No conditional ifns have been defined for dot-product yet. */ > + gcc_assert (code != DOT_PROD_EXPR); > + > /* Make sure that the reduction accumulator is vop[0]. */ > if (reduc_index == 1) > { > @@ -7597,8 +7723,12 @@ vect_transform_reduction (loop_vec_info loop_vinfo, > build_vect_cond_expr (code, vop, mask, gsi); > } > > - new_stmt = gimple_build_assign (vec_dest, code, > - vop[0], vop[1], vop[2]); > + if (emulated_mixed_dot_prod) > + new_stmt = vect_emulate_mixed_dot_prod (loop_vinfo, stmt_info, gsi, > + vec_dest, vop); > + else > + new_stmt = gimple_build_assign (vec_dest, code, > + vop[0], vop[1], vop[2]); > new_temp = make_ssa_name (vec_dest, new_stmt); > gimple_assign_set_lhs (new_stmt, new_temp); > vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi); > diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc > index 8f624863971..b336f12e6be 100644 > --- a/gcc/tree-vect-patterns.cc > +++ b/gcc/tree-vect-patterns.cc > @@ -1148,7 +1148,19 @@ vect_recog_dot_prod_pattern (vec_info *vinfo, > tree half_vectype; > if (!vect_supportable_direct_optab_p (vinfo, type, DOT_PROD_EXPR, half_type, > type_out, &half_vectype, subtype)) > - return NULL; > + { > + /* We can emulate a mixed-sign dot-product using a sequence of > + signed dot-products; see vect_emulate_mixed_dot_prod for details. */ > + if (subtype != optab_vector_mixed_sign > + || !vect_supportable_direct_optab_p (vinfo, signed_type_for (type), > + DOT_PROD_EXPR, half_type, > + type_out, &half_vectype, > + optab_vector)) > + return NULL; > + > + *type_out = signed_or_unsigned_type_for (TYPE_UNSIGNED (type), > + *type_out); > + } > > /* Get the inputs in the appropriate types. */ > tree mult_oprnd[2]; > -- > 2.25.1 > >