From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.guenther@gmail.com>
Received: from mail-qv1-xf2a.google.com (mail-qv1-xf2a.google.com
 [IPv6:2607:f8b0:4864:20::f2a])
 by sourceware.org (Postfix) with ESMTPS id 58C2E3846473
 for <gcc-patches@gcc.gnu.org>; Thu, 30 Jun 2022 06:45:39 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 58C2E3846473
Received: by mail-qv1-xf2a.google.com with SMTP id z1so11073550qvp.9
 for <gcc-patches@gcc.gnu.org>; Wed, 29 Jun 2022 23:45:39 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to;
 bh=xS9BRBncCOkfdFMmQYElydfy95zfpVFO0t2CNBlpv/s=;
 b=4Ds9O0Qip9dKZQfta9aabyUlJxUKxRZhshZznN9COXCAYZp7EJtdakW2WtXOsDlNQ0
 v0MhbwpkUyO8cG2btYF2d4WXtHECCAvqW+/yVn3RPHG3V7Azmkhhh5yqDsa2osU6ruWD
 /7Y7vfS7YWiHeR+wF8kA/Dta711mgLAxhfD7qAZVC+ejRtQce3Tc9mWhhs5imiONk1bi
 QpsaPGb7WtromgT3mKfdhKlptgt2mZ3q4AkfjSA+h8fpHQ6W1ADmFqhL/G4ZFY/dtAuD
 BFDpVe1XFRy6Ze/B/8ewDQRtgmCnCe0GeFDZgT6Ja1ZGhPnFA+UKw5t4LqfScrwwEYKO
 mX1A==
X-Gm-Message-State: AJIora9YLegmX/JJqCjaOj+ijK6Y+602rP1JdAT6Xco/avkOIXsORv8a
 WllkO/4aAfi2nqV1qT+mY2YBDcilDysAzf8zGyhfVCgS
X-Google-Smtp-Source: AGRyM1vQyf4K58FpC0RVSguq2QGI98jy693Xm9pYECjmm7fSpvh/9N/REd4swvoh93LT+md/mJky3K05KFmAa64WoLU=
X-Received: by 2002:a05:6214:2a83:b0:470:a898:e467 with SMTP id
 jr3-20020a0562142a8300b00470a898e467mr10169857qvb.122.1656571538536; Wed, 29
 Jun 2022 23:45:38 -0700 (PDT)
MIME-Version: 1.0
References: <patch-15821-tamar@arm.com> <mpt1qvobne0.fsf@arm.com>
 <mptsfo4a18n.fsf@arm.com>
 <VI1PR08MB5325005D515D65033C38F9D8FFB99@VI1PR08MB5325.eurprd08.prod.outlook.com>
 <CAFiYyc2fRbSZ_qP-e-WYfd+QuCkcYtffZ4v83yEO9nUX6GvJ1Q@mail.gmail.com>
 <VI1PR08MB532543722C72470A61DFE775FFB89@VI1PR08MB5325.eurprd08.prod.outlook.com>
 <CAFiYyc2c-nLn0RLCyXoqkUAEmM975ZgDsb8jHtM9i1+S_0cV0g@mail.gmail.com>
 <mptk08zv8p7.fsf@arm.com>
In-Reply-To: <mptk08zv8p7.fsf@arm.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Thu, 30 Jun 2022 08:45:26 +0200
Message-ID: <CAFiYyc1MD58V_dpQ0Wr0X2i1kT-oSe9LW6mZCzZixB0U_QwZ_A@mail.gmail.com>
Subject: Re: [PATCH 1/2]AArch64 Add fallback case using sdot for usdot
To: Richard Biener <richard.guenther@gmail.com>,
 Tamar Christina <Tamar.Christina@arm.com>, 
 Richard Earnshaw <Richard.Earnshaw@arm.com>, nd <nd@arm.com>, 
 "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
 Marcus Shawcroft <Marcus.Shawcroft@arm.com>, 
 Richard Sandiford <richard.sandiford@arm.com>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-8.0 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 KAM_MANYTO, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Thu, 30 Jun 2022 06:45:43 -0000

On Wed, Jun 29, 2022 at 4:35 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Tue, Jun 28, 2022 at 5:54 PM Tamar Christina <Tamar.Christina@arm.com> wrote:
> >>
> >> > -----Original Message-----
> >> > From: Richard Biener <richard.guenther@gmail.com>
> >> > Sent: Monday, June 27, 2022 7:10 AM
> >> > To: Tamar Christina <Tamar.Christina@arm.com>
> >> > Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Richard Earnshaw
> >> > <Richard.Earnshaw@arm.com>; nd <nd@arm.com>; gcc-
> >> > patches@gcc.gnu.org; Marcus Shawcroft <Marcus.Shawcroft@arm.com>
> >> > Subject: Re: [PATCH 1/2]AArch64 Add fallback case using sdot for usdot
> >> >
> >> > On Mon, Jun 27, 2022 at 7:25 AM Tamar Christina via Gcc-patches <gcc-
> >> > patches@gcc.gnu.org> wrote:
> >> > >
> >> > > > -----Original Message-----
> >> > > > From: Richard Sandiford <richard.sandiford@arm.com>
> >> > > > Sent: Thursday, June 16, 2022 7:54 PM
> >> > > > To: Tamar Christina <Tamar.Christina@arm.com>
> >> > > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
> >> > > > <Richard.Earnshaw@arm.com>; Marcus Shawcroft
> >> > > > <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
> >> > <Kyrylo.Tkachov@arm.com>
> >> > > > Subject: Re: [PATCH 1/2]AArch64 Add fallback case using sdot for
> >> > > > usdot
> >> > > >
> >> > > > Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> > > > > Tamar Christina <tamar.christina@arm.com> writes:
> >> > > > >> Hi All,
> >> > > > >>
> >> > > > >> The usdot operation is common in video encoder and decoders
> >> > > > >> including some of the most widely used ones.
> >> > > > >>
> >> > > > >> This patch adds a +dotprod version of the optab as a fallback for
> >> > > > >> when you do have sdot but not usdot available.
> >> > > > >>
> >> > > > >> The fallback works by adding a bias to the unsigned argument to
> >> > > > >> convert it to a signed value and then correcting for the bias later on.
> >> > > > >>
> >> > > > >> Essentially it relies on (x - 128)y + 128y == xy where x is
> >> > > > >> unsigned and y is signed (assuming both are 8-bit values).
> >> > > > >> Because the range of a signed byte is only to 127 we split the bias
> >> > correction into:
> >> > > > >>
> >> > > > >>    (x - 128)y + 127y + y
> >> > > > >
> >> > > > > I bet you knew this question was coming, but: this technique isn't
> >> > > > > target-specific, so wouldn't it be better to handle it in
> >> > > > > tree-vect-patterns.cc instead?
> >> > >
> >> > > Ok, so after many hours of trying I don't know how to make this work.
> >> > > DOT_PROD_EXPR is a reduction, but emitting them as additional pattern
> >> > > statement doesn't work because they'll be marked as internal_def
> >> > > rather than reduction_def.  I tried marking the new vec_stmt_info that
> >> > > I create explicitly as reduction_def but this gets overwritten during analysis.
> >> > >
> >> > > I then looked into getting it as a vectorizable_operation but has this
> >> > > obvious problems In that it no longer treats it as a reduction and so tries to
> >> > decompose into hi/lo.
> >> > >
> >> > > I then looked into treating additional patterns from  a reduction as
> >> > > reductions themselves but this is obviously wrong as non-reduction
> >> > statements also get marked as reductions.
> >> > >
> >> > > The conclusion is that I don't think the vectorizer allows additional
> >> > > reductions to be emitted from patterns.
> >> >
> >> > Indeed.  DOT_PROD is a weird beast and it doesn't define which lanes are
> >> > reduced to which so it's only usable when the result is reduced to a single
> >> > lane.
> >> >
> >> > An SLP pattern might work if you use reduc-plus for the reduced lanes and
> >> > keep the multiply separate?
> >>
> >> Unfortunately I can't seem to get it to handle the reduction in SLP.  It seems to always
> >> use the non-SLP aware loop vectorizer here.  The suggested unroll factor is always 1 and
> >> even trying to force it gets it to bail out later, presumable because it's reducing into a
> >> scalar that's used outside the loop?
> >
> > Yes, it possibly needs 1-lane SLP support.
>
> As I mentioned to Tamar off-list, I feel like I've been wasting
> people's time recently by spewing out ideas that might or might not work
> (usually "not work"), so I wanted to get some confidence that the next
> suggestion made sense.  In the end I needed most of an implementation
> to do that, so it seemed easiest just to finish it off rather than post
> it in a half-complete state.  Sorry for the duplication. :-(
>
> The patch certainly isn't pretty, but I think it's the best we can
> do under the current infrastructure, and it should at least make
> the costs reasonably accurate.  (Actually, that said, we probably
> need to patch the reduction latency calculation in the aarch64
> vector code -- didn't think of that until now.)
>
> Tested on aarch64-linux-gnu and x64_64-linux-gnu.  WDYT?

Looks reasonable - does this end up in OKish code generation as well?

Thanks,
Richard.

> Thanks,
> Richard
>
> ----------------
>
> Following a suggestion from Tamar, this patch adds a fallback
> implementation of usdot using sdot.  Specifically, for 8-bit
> input types:
>
>    acc_2 = DOT_PROD_EXPR <a_unsigned, b_signed, acc_1>;
>
> becomes:
>
>    tmp_1 = DOT_PROD_EXPR <64, b_signed, acc_1>;
>    tmp_2 = DOT_PROD_EXPR <64, b_signed, tmp_1>;
>    acc_2 = DOT_PROD_EXPR <a_unsigned - 128, b_signed, tmp_2>;
>
> on the basis that (x-128)*y + 64*y + 64*y.  Doing the two 64*y
> operations first should give more time for x to be calculated,
> on the off chance that that's useful.
>
> gcc/
>         * tree-vect-patterns.cc (vect_recog_dot_prod_pattern): If usdot
>         isn't available, try sdot instead.
>         * tree-vect-loop.cc (vect_is_emulated_mixed_dot_prod): New function.
>         (vect_model_reduction_cost): Model the cost of implementing usdot
>         using sdot.
>         (vectorizable_reduction): Likewise.  Skip target support test
>         for lane reductions.
>         (vect_emulate_mixed_dot_prod): New function.
>         (vect_transform_reduction): Use it to emulate usdot via sdot.
>
> gcc/testsuite/
>         * gcc.dg/vect/vect-reduc-dot-9.c: Reduce target requirements
>         from i8mm to dotprod.
>         * gcc.dg/vect/vect-reduc-dot-10.c: Likewise.
>         * gcc.dg/vect/vect-reduc-dot-11.c: Likewise.
>         * gcc.dg/vect/vect-reduc-dot-12.c: Likewise.
>         * gcc.dg/vect/vect-reduc-dot-13.c: Likewise.
>         * gcc.dg/vect/vect-reduc-dot-14.c: Likewise.
>         * gcc.dg/vect/vect-reduc-dot-15.c: Likewise.
>         * gcc.dg/vect/vect-reduc-dot-16.c: Likewise.
>         * gcc.dg/vect/vect-reduc-dot-17.c: Likewise.
>         * gcc.dg/vect/vect-reduc-dot-18.c: Likewise.
>         * gcc.dg/vect/vect-reduc-dot-19.c: Likewise.
>         * gcc.dg/vect/vect-reduc-dot-20.c: Likewise.
>         * gcc.dg/vect/vect-reduc-dot-21.c: Likewise.
>         * gcc.dg/vect/vect-reduc-dot-22.c: Likewise.
> ---
>  gcc/testsuite/gcc.dg/vect/vect-reduc-dot-10.c |   6 +-
>  gcc/testsuite/gcc.dg/vect/vect-reduc-dot-11.c |   6 +-
>  gcc/testsuite/gcc.dg/vect/vect-reduc-dot-12.c |   6 +-
>  gcc/testsuite/gcc.dg/vect/vect-reduc-dot-13.c |   6 +-
>  gcc/testsuite/gcc.dg/vect/vect-reduc-dot-14.c |   6 +-
>  gcc/testsuite/gcc.dg/vect/vect-reduc-dot-15.c |   6 +-
>  gcc/testsuite/gcc.dg/vect/vect-reduc-dot-16.c |   6 +-
>  gcc/testsuite/gcc.dg/vect/vect-reduc-dot-17.c |   6 +-
>  gcc/testsuite/gcc.dg/vect/vect-reduc-dot-18.c |   6 +-
>  gcc/testsuite/gcc.dg/vect/vect-reduc-dot-19.c |   4 +-
>  gcc/testsuite/gcc.dg/vect/vect-reduc-dot-20.c |   4 +-
>  gcc/testsuite/gcc.dg/vect/vect-reduc-dot-21.c |   4 +-
>  gcc/testsuite/gcc.dg/vect/vect-reduc-dot-22.c |   4 +-
>  gcc/testsuite/gcc.dg/vect/vect-reduc-dot-9.c  |   6 +-
>  gcc/tree-vect-loop.cc                         | 160 ++++++++++++++++--
>  gcc/tree-vect-patterns.cc                     |  14 +-
>  16 files changed, 196 insertions(+), 54 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-10.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-10.c
> index 7ce86965ea9..34e25ab7fb0 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-10.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-10.c
> @@ -1,6 +1,6 @@
>  /* { dg-require-effective-target vect_int } */
> -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> -/* { dg-add-options arm_v8_2a_i8mm }  */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
>
>  #define SIGNEDNESS_1 unsigned
>  #define SIGNEDNESS_2 unsigned
> @@ -10,4 +10,4 @@
>  #include "vect-reduc-dot-9.c"
>
>  /* { dg-final { scan-tree-dump-not "vect_recog_dot_prod_pattern: detected" "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-11.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-11.c
> index 0f7cbbb87ef..3af8df54cf9 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-11.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-11.c
> @@ -1,6 +1,6 @@
>  /* { dg-require-effective-target vect_int } */
> -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> -/* { dg-add-options arm_v8_2a_i8mm }  */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
>
>  #define SIGNEDNESS_1 unsigned
>  #define SIGNEDNESS_2 signed
> @@ -10,4 +10,4 @@
>  #include "vect-reduc-dot-9.c"
>
>  /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-12.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-12.c
> index 08412614fc6..77ceef3643b 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-12.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-12.c
> @@ -1,6 +1,6 @@
>  /* { dg-require-effective-target vect_int } */
> -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> -/* { dg-add-options arm_v8_2a_i8mm }  */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
>
>  #define SIGNEDNESS_1 unsigned
>  #define SIGNEDNESS_2 signed
> @@ -10,4 +10,4 @@
>  #include "vect-reduc-dot-9.c"
>
>  /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-13.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-13.c
> index 7ee0f45f642..d3c0c86f529 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-13.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-13.c
> @@ -1,6 +1,6 @@
>  /* { dg-require-effective-target vect_int } */
> -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> -/* { dg-add-options arm_v8_2a_i8mm }  */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
>
>  #define SIGNEDNESS_1 signed
>  #define SIGNEDNESS_2 unsigned
> @@ -10,4 +10,4 @@
>  #include "vect-reduc-dot-9.c"
>
>  /* { dg-final { scan-tree-dump-not "vect_recog_dot_prod_pattern: detected" "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-14.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-14.c
> index 2de1434528b..86a5c85753c 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-14.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-14.c
> @@ -1,6 +1,6 @@
>  /* { dg-require-effective-target vect_int } */
> -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> -/* { dg-add-options arm_v8_2a_i8mm }  */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
>
>  #define SIGNEDNESS_1 signed
>  #define SIGNEDNESS_2 unsigned
> @@ -10,4 +10,4 @@
>  #include "vect-reduc-dot-9.c"
>
>  /* { dg-final { scan-tree-dump-not "vect_recog_dot_prod_pattern: detected" "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-15.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-15.c
> index dc48f95a32b..25de0940a65 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-15.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-15.c
> @@ -1,6 +1,6 @@
>  /* { dg-require-effective-target vect_int } */
> -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> -/* { dg-add-options arm_v8_2a_i8mm }  */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
>
>  #define SIGNEDNESS_1 signed
>  #define SIGNEDNESS_2 signed
> @@ -10,4 +10,4 @@
>  #include "vect-reduc-dot-9.c"
>
>  /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-16.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-16.c
> index aec62878936..4a1dec0677e 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-16.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-16.c
> @@ -1,6 +1,6 @@
>  /* { dg-require-effective-target vect_int } */
> -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> -/* { dg-add-options arm_v8_2a_i8mm }  */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
>
>  #define SIGNEDNESS_1 signed
>  #define SIGNEDNESS_2 signed
> @@ -10,4 +10,4 @@
>  #include "vect-reduc-dot-9.c"
>
>  /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-17.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-17.c
> index 38f86fe458a..90d21188b76 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-17.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-17.c
> @@ -1,6 +1,6 @@
>  /* { dg-require-effective-target vect_int } */
> -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> -/* { dg-add-options arm_v8_2a_i8mm }  */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
>
>  #include "tree-vect.h"
>
> @@ -50,4 +50,4 @@ main (void)
>  }
>
>  /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-18.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-18.c
> index 2e86ebe3c6c..81ecb158d29 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-18.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-18.c
> @@ -1,6 +1,6 @@
>  /* { dg-require-effective-target vect_int } */
> -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> -/* { dg-add-options arm_v8_2a_i8mm }  */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
>
>  #include "tree-vect.h"
>
> @@ -50,4 +50,4 @@ main (void)
>  }
>
>  /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-19.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-19.c
> index d00f24aae4c..cbcd4f120a5 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-19.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-19.c
> @@ -1,6 +1,6 @@
>  /* { dg-require-effective-target vect_int } */
> -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> -/* { dg-add-options arm_v8_2a_i8mm }  */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
>
>  #include "tree-vect.h"
>
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-20.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-20.c
> index 17adbca83a0..e81ed1da5a4 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-20.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-20.c
> @@ -1,6 +1,6 @@
>  /* { dg-require-effective-target vect_int } */
> -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> -/* { dg-add-options arm_v8_2a_i8mm }  */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
>
>  #include "tree-vect.h"
>
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-21.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-21.c
> index 6cc6a4f2e92..81ce5cdaffb 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-21.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-21.c
> @@ -1,6 +1,6 @@
>  /* { dg-require-effective-target vect_int } */
> -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> -/* { dg-add-options arm_v8_2a_i8mm }  */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
>
>  #include "tree-vect.h"
>
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-22.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-22.c
> index e13d3d5c4da..b8c9d3ca53b 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-22.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-22.c
> @@ -1,6 +1,6 @@
>  /* { dg-require-effective-target vect_int } */
> -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> -/* { dg-add-options arm_v8_2a_i8mm }  */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
>
>  #include "tree-vect.h"
>
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-9.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-9.c
> index d1049c96bf1..e0b132f6b35 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-9.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-9.c
> @@ -1,6 +1,6 @@
>  /* { dg-require-effective-target vect_int } */
> -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> -/* { dg-add-options arm_v8_2a_i8mm }  */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
>
>  #include "tree-vect.h"
>
> @@ -50,4 +50,4 @@ main (void)
>  }
>
>  /* { dg-final { scan-tree-dump-not "vect_recog_dot_prod_pattern: detected" "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 78dfe8519aa..3a70c15b593 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -4566,6 +4566,31 @@ have_whole_vector_shift (machine_mode mode)
>    return true;
>  }
>
> +/* Return true if (a) STMT_INFO is a DOT_PROD_EXPR reduction whose
> +   multiplication operands have differing signs and (b) we intend
> +   to emulate the operation using a series of signed DOT_PROD_EXPRs.
> +   See vect_emulate_mixed_dot_prod for the actual sequence used.  */
> +
> +static bool
> +vect_is_emulated_mixed_dot_prod (loop_vec_info loop_vinfo,
> +                                stmt_vec_info stmt_info)
> +{
> +  gassign *assign = dyn_cast<gassign *> (stmt_info->stmt);
> +  if (!assign || gimple_assign_rhs_code (assign) != DOT_PROD_EXPR)
> +    return false;
> +
> +  tree rhs1 = gimple_assign_rhs1 (assign);
> +  tree rhs2 = gimple_assign_rhs2 (assign);
> +  if (TYPE_SIGN (TREE_TYPE (rhs1)) == TYPE_SIGN (TREE_TYPE (rhs2)))
> +    return false;
> +
> +  stmt_vec_info reduc_info = info_for_reduction (loop_vinfo, stmt_info);
> +  gcc_assert (reduc_info->is_reduc_info);
> +  return !directly_supported_p (DOT_PROD_EXPR,
> +                               STMT_VINFO_REDUC_VECTYPE_IN (reduc_info),
> +                               optab_vector_mixed_sign);
> +}
> +
>  /* TODO: Close dependency between vect_model_*_cost and vectorizable_*
>     functions. Design better to avoid maintenance issues.  */
>
> @@ -4601,6 +4626,8 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
>    if (!gimple_extract_op (orig_stmt_info->stmt, &op))
>      gcc_unreachable ();
>
> +  bool emulated_mixed_dot_prod
> +    = vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info);
>    if (reduction_type == EXTRACT_LAST_REDUCTION)
>      /* No extra instructions are needed in the prologue.  The loop body
>         operations are costed in vectorizable_condition.  */
> @@ -4628,11 +4655,20 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
>      }
>    else
>      {
> -      /* Add in cost for initial definition.
> -        For cond reduction we have four vectors: initial index, step,
> -        initial result of the data reduction, initial value of the index
> -        reduction.  */
> -      int prologue_stmts = reduction_type == COND_REDUCTION ? 4 : 1;
> +      /* Add in the cost of the initial definitions.  */
> +      int prologue_stmts;
> +      if (reduction_type == COND_REDUCTION)
> +       /* For cond reductions we have four vectors: initial index, step,
> +          initial result of the data reduction, initial value of the index
> +          reduction.  */
> +       prologue_stmts = 4;
> +      else if (emulated_mixed_dot_prod)
> +       /* We need the initial reduction value and two invariants:
> +          one that contains the minimum signed value and one that
> +          contains half of its negative.  */
> +       prologue_stmts = 3;
> +      else
> +       prologue_stmts = 1;
>        prologue_cost += record_stmt_cost (cost_vec, prologue_stmts,
>                                          scalar_to_vec, stmt_info, 0,
>                                          vect_prologue);
> @@ -6797,11 +6833,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>    bool lane_reduc_code_p = (op.code == DOT_PROD_EXPR
>                             || op.code == WIDEN_SUM_EXPR
>                             || op.code == SAD_EXPR);
> -  enum optab_subtype optab_query_kind = optab_vector;
> -  if (op.code == DOT_PROD_EXPR
> -      && (TYPE_SIGN (TREE_TYPE (op.ops[0]))
> -         != TYPE_SIGN (TREE_TYPE (op.ops[1]))))
> -    optab_query_kind = optab_vector_mixed_sign;
>
>    if (!POINTER_TYPE_P (op.type) && !INTEGRAL_TYPE_P (op.type)
>        && !SCALAR_FLOAT_TYPE_P (op.type))
> @@ -7328,9 +7359,17 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>        /* 4. Supportable by target?  */
>        bool ok = true;
>
> -      /* 4.1. check support for the operation in the loop  */
> +      /* 4.1. check support for the operation in the loop
> +
> +        This isn't necessary for the lane reduction codes, since they
> +        can only be produced by pattern matching, and it's up to the
> +        pattern matcher to test for support.  The main reason for
> +        specifically skipping this step is to avoid rechecking whether
> +        mixed-sign dot-products can be implemented using signed
> +        dot-products.  */
>        machine_mode vec_mode = TYPE_MODE (vectype_in);
> -      if (!directly_supported_p (op.code, vectype_in, optab_query_kind))
> +      if (!lane_reduc_code_p
> +         && !directly_supported_p (op.code, vectype_in))
>          {
>            if (dump_enabled_p ())
>              dump_printf (MSG_NOTE, "op not supported by target.\n");
> @@ -7398,7 +7437,14 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>       vect_transform_reduction.  Otherwise this is costed by the
>       separate vectorizable_* routines.  */
>    if (single_defuse_cycle || lane_reduc_code_p)
> -    record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, vect_body);
> +    {
> +      int factor = 1;
> +      if (vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info))
> +       /* Three dot-products and a subtraction.  */
> +       factor = 4;
> +      record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,
> +                       stmt_info, 0, vect_body);
> +    }
>
>    if (dump_enabled_p ()
>        && reduction_type == FOLD_LEFT_REDUCTION)
> @@ -7457,6 +7503,81 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>    return true;
>  }
>
> +/* STMT_INFO is a dot-product reduction whose multiplication operands
> +   have different signs.  Emit a sequence to emulate the operation
> +   using a series of signed DOT_PROD_EXPRs and return the last
> +   statement generated.  VEC_DEST is the result of the vector operation
> +   and VOP lists its inputs.  */
> +
> +static gassign *
> +vect_emulate_mixed_dot_prod (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
> +                            gimple_stmt_iterator *gsi, tree vec_dest,
> +                            tree vop[3])
> +{
> +  tree wide_vectype = signed_type_for (TREE_TYPE (vec_dest));
> +  tree narrow_vectype = signed_type_for (TREE_TYPE (vop[0]));
> +  tree narrow_elttype = TREE_TYPE (narrow_vectype);
> +  gimple *new_stmt;
> +
> +  /* Make VOP[0] the unsigned operand VOP[1] the signed operand.  */
> +  if (!TYPE_UNSIGNED (TREE_TYPE (vop[0])))
> +    std::swap (vop[0], vop[1]);
> +
> +  /* Convert all inputs to signed types.  */
> +  for (int i = 0; i < 3; ++i)
> +    if (TYPE_UNSIGNED (TREE_TYPE (vop[i])))
> +      {
> +       tree tmp = make_ssa_name (signed_type_for (TREE_TYPE (vop[i])));
> +       new_stmt = gimple_build_assign (tmp, NOP_EXPR, vop[i]);
> +       vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
> +       vop[i] = tmp;
> +      }
> +
> +  /* In the comments below we assume 8-bit inputs for simplicity,
> +     but the approach works for any full integer type.  */
> +
> +  /* Create a vector of -128.  */
> +  tree min_narrow_elttype = TYPE_MIN_VALUE (narrow_elttype);
> +  tree min_narrow = build_vector_from_val (narrow_vectype,
> +                                          min_narrow_elttype);
> +
> +  /* Create a vector of 64.  */
> +  auto half_wi = wi::lrshift (wi::to_wide (min_narrow_elttype), 1);
> +  tree half_narrow = wide_int_to_tree (narrow_elttype, half_wi);
> +  half_narrow = build_vector_from_val (narrow_vectype, half_narrow);
> +
> +  /* Emit: SUB_RES = VOP[0] - 128.  */
> +  tree sub_res = make_ssa_name (narrow_vectype);
> +  new_stmt = gimple_build_assign (sub_res, PLUS_EXPR, vop[0], min_narrow);
> +  vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
> +
> +  /* Emit:
> +
> +       STAGE1 = DOT_PROD_EXPR <VOP[1], 64, VOP[2]>;
> +       STAGE2 = DOT_PROD_EXPR <VOP[1], 64, STAGE1>;
> +       STAGE3 = DOT_PROD_EXPR <SUB_RES, -128, STAGE2>;
> +
> +     on the basis that x * y == (x - 128) * y + 64 * y + 64 * y
> +     Doing the two 64 * y steps first allows more time to compute x.  */
> +  tree stage1 = make_ssa_name (wide_vectype);
> +  new_stmt = gimple_build_assign (stage1, DOT_PROD_EXPR,
> +                                 vop[1], half_narrow, vop[2]);
> +  vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
> +
> +  tree stage2 = make_ssa_name (wide_vectype);
> +  new_stmt = gimple_build_assign (stage2, DOT_PROD_EXPR,
> +                                 vop[1], half_narrow, stage1);
> +  vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
> +
> +  tree stage3 = make_ssa_name (wide_vectype);
> +  new_stmt = gimple_build_assign (stage3, DOT_PROD_EXPR,
> +                                 sub_res, vop[1], stage2);
> +  vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
> +
> +  /* Convert STAGE3 to the reduction type.  */
> +  return gimple_build_assign (vec_dest, CONVERT_EXPR, stage3);
> +}
> +
>  /* Transform the definition stmt STMT_INFO of a reduction PHI backedge
>     value.  */
>
> @@ -7563,12 +7684,17 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>                                         : &vec_oprnds2));
>      }
>
> +  bool emulated_mixed_dot_prod
> +    = vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info);
>    FOR_EACH_VEC_ELT (vec_oprnds0, i, def0)
>      {
>        gimple *new_stmt;
>        tree vop[3] = { def0, vec_oprnds1[i], NULL_TREE };
>        if (masked_loop_p && !mask_by_cond_expr)
>         {
> +         /* No conditional ifns have been defined for dot-product yet.  */
> +         gcc_assert (code != DOT_PROD_EXPR);
> +
>           /* Make sure that the reduction accumulator is vop[0].  */
>           if (reduc_index == 1)
>             {
> @@ -7597,8 +7723,12 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>               build_vect_cond_expr (code, vop, mask, gsi);
>             }
>
> -         new_stmt = gimple_build_assign (vec_dest, code,
> -                                         vop[0], vop[1], vop[2]);
> +         if (emulated_mixed_dot_prod)
> +           new_stmt = vect_emulate_mixed_dot_prod (loop_vinfo, stmt_info, gsi,
> +                                                   vec_dest, vop);
> +         else
> +           new_stmt = gimple_build_assign (vec_dest, code,
> +                                           vop[0], vop[1], vop[2]);
>           new_temp = make_ssa_name (vec_dest, new_stmt);
>           gimple_assign_set_lhs (new_stmt, new_temp);
>           vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index 8f624863971..b336f12e6be 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -1148,7 +1148,19 @@ vect_recog_dot_prod_pattern (vec_info *vinfo,
>    tree half_vectype;
>    if (!vect_supportable_direct_optab_p (vinfo, type, DOT_PROD_EXPR, half_type,
>                                         type_out, &half_vectype, subtype))
> -    return NULL;
> +    {
> +      /* We can emulate a mixed-sign dot-product using a sequence of
> +        signed dot-products; see vect_emulate_mixed_dot_prod for details.  */
> +      if (subtype != optab_vector_mixed_sign
> +         || !vect_supportable_direct_optab_p (vinfo, signed_type_for (type),
> +                                              DOT_PROD_EXPR, half_type,
> +                                              type_out, &half_vectype,
> +                                              optab_vector))
> +       return NULL;
> +
> +      *type_out = signed_or_unsigned_type_for (TYPE_UNSIGNED (type),
> +                                              *type_out);
> +    }
>
>    /* Get the inputs in the appropriate types.  */
>    tree mult_oprnd[2];
> --
> 2.25.1
>
>