From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.guenther@gmail.com>
Received: from mail-ej1-x62e.google.com (mail-ej1-x62e.google.com
 [IPv6:2a00:1450:4864:20::62e])
 by sourceware.org (Postfix) with ESMTPS id 1AA3D385740E
 for <gcc-patches@gcc.gnu.org>; Thu,  2 Sep 2021 12:18:41 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 1AA3D385740E
Received: by mail-ej1-x62e.google.com with SMTP id i21so3869887ejd.2
 for <gcc-patches@gcc.gnu.org>; Thu, 02 Sep 2021 05:18:41 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:content-transfer-encoding;
 bh=+iEgaJLhP+hYhV6ipV5DJlQyJw3trD2w7lSLOIv2tnQ=;
 b=SpTSuU4iS+2szwhF7/we7eE0y7gDUFO+YQ8g+Dch6QmrTMA3zsTzPn0bhH5xjKew0Z
 6Z5xfTQTaZ+wmbUu7sUWWCeyzyhBu1Q7zd4cQj4Y4k6S5Tga2aeAJb2mlIw0cZiPSmB/
 HkFsjETq+m515lQ0Od4+fFrxEEfv6r3rUdrxxu7lQHNfcrApFJJMuWOJFr8MUkEZ0j5W
 IphTS/LOokly6wdMJVQPHO/3lNfebHR0PK3JcASqKijBDelmDvRknGMnZjr0k3OteZBa
 7zubjnReLE+34CC2eq7CL3cH/s078LJbfFCw8DcbiLkFrNoYnb6h/sv1KgSpGkTv+APZ
 G7kg==
X-Gm-Message-State: AOAM532oZNNxCaso7yqVyWdmzBEFel6kK/rsFWKPP63vGksWgP2Ztt4L
 BDzSdnwmpDVOwEUiyLXSyQZMpt5ekEl6T70h/BQ=
X-Google-Smtp-Source: ABdhPJyT9RKD1jKFocQdiOvJ4Ii3I7WWByrs7DEk1QEwum0HImG66BFY5ShE5eSYZOvWqsNjqHjmtJ9xljNMrntLl/w=
X-Received: by 2002:a17:906:6808:: with SMTP id
 k8mr3584470ejr.138.1630585120003; 
 Thu, 02 Sep 2021 05:18:40 -0700 (PDT)
MIME-Version: 1.0
References: <patch-14784-tamar@arm.com> <mpt4kb5mzyd.fsf@arm.com>
 <CAFiYyc3V_eZKCm_TFJgitAZRUw0ddKZbuxvrejuBwx+xhBM2-Q@mail.gmail.com>
In-Reply-To: <CAFiYyc3V_eZKCm_TFJgitAZRUw0ddKZbuxvrejuBwx+xhBM2-Q@mail.gmail.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Thu, 2 Sep 2021 14:18:29 +0200
Message-ID: <CAFiYyc1Cq3uKg9K8D0_dekQBauSawT=vNdFFZ1BhjKqD4cypag@mail.gmail.com>
Subject: Re: [PATCH]AArch64 RFC: Don't cost all scalar operations during
 vectorization if scalar will fuse
To: Richard Sandiford <richard.sandiford@arm.com>,
 Tamar Christina <tamar.christina@arm.com>, 
 GCC Patches <gcc-patches@gcc.gnu.org>, nd <nd@arm.com>, 
 Richard Earnshaw <Richard.Earnshaw@arm.com>,
 Marcus Shawcroft <Marcus.Shawcroft@arm.com>, 
 Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-8.0 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 KAM_LOTSOFHASH, KAM_MANYTO, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,
 SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Thu, 02 Sep 2021 12:18:51 -0000

On Wed, Sep 1, 2021 at 3:47 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Tue, Aug 31, 2021 at 4:50 PM Richard Sandiford via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > Tamar Christina <tamar.christina@arm.com> writes:
> > > Hi All,
> > >
> > > As the vectorizer has improved over time in capabilities it has start=
ed
> > > over-vectorizing.  This has causes regressions in the order of 1-7x o=
n libraries
> > > that Arm produces.
> > >
> > > The vector costs actually do make a lot of sense and I don't think th=
at they are
> > > wrong.  I think that the costs for the scalar code are wrong.
> > >
> > > In particular the costing doesn't take into effect that scalar operat=
ion
> > > can/will fuse as this happens in RTL.  Because of this the costs for =
the scalars
> > > end up being always higher.
> > >
> > > As an example the loop in PR 97984:
> > >
> > > void x (long * __restrict a, long * __restrict b)
> > > {
> > >   a[0] *=3D b[0];
> > >   a[1] *=3D b[1];
> > >   a[0] +=3D b[0];
> > >   a[1] +=3D b[1];
> > > }
> > >
> > > generates:
> > >
> > > x:
> > >         ldp     x2, x3, [x0]
> > >         ldr     x4, [x1]
> > >         ldr     q1, [x1]
> > >         mul     x2, x2, x4
> > >         ldr     x4, [x1, 8]
> > >         fmov    d0, x2
> > >         ins     v0.d[1], x3
> > >         mul     x1, x3, x4
> > >         ins     v0.d[1], x1
> > >         add     v0.2d, v0.2d, v1.2d
> > >         str     q0, [x0]
> > >         ret
> > >
> > > On an actual loop the prologue costs would make the loop too expensiv=
e so we
> > > produce the scalar output, but with SLP there's no loop overhead cost=
s so we end
> > > up trying to vectorize this. Because SLP discovery is started from th=
e stores we
> > > will end up vectorizing and costing the add but not the MUL.
> > >
> > > To counter this the patch adjusts the costing when it finds an operat=
ion that
> > > can be fused and discounts the cost of the "other" operation being fu=
sed in.
> > >
> > > The attached testcase shows that even when we discount it we still ge=
t still get
> > > vectorized code when profitable to do so, e.g. SVE.
> > >
> > > This happens as well with other operations such as scalar operations =
where
> > > shifts can be fused in or for e.g. bfxil.  As such sending this for f=
eedback.
> > >
> > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > >
> > > Ok for master? If the approach is acceptable I can add support for mo=
re.
> > >
> > > Thanks,
> > > Tamar
> > >
> > > gcc/ChangeLog:
> > >
> > >       PR target/97984
> > >       * config/aarch64/aarch64.c (aarch64_add_stmt_cost): Check for f=
using
> > >       madd.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >       PR target/97984
> > >       * gcc.target/aarch64/pr97984-1.c: New test.
> > >       * gcc.target/aarch64/pr97984-2.c: New test.
> > >       * gcc.target/aarch64/pr97984-3.c: New test.
> > >       * gcc.target/aarch64/pr97984-4.c: New test.
> > >
> > > --- inline copy of patch --
> > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch6=
4.c
> > > index 4cd4b037f2606e515ad8f4669d2cd13a509dd0a4..329b556311310d86aaf54=
6d7b395a3750a9d57d4 100644
> > > --- a/gcc/config/aarch64/aarch64.c
> > > +++ b/gcc/config/aarch64/aarch64.c
> > > @@ -15536,6 +15536,39 @@ aarch64_add_stmt_cost (class vec_info *vinfo=
, void *data, int count,
> > >       stmt_cost =3D aarch64_sve_adjust_stmt_cost (vinfo, kind, stmt_i=
nfo,
> > >                                                 vectype, stmt_cost);
> > >
> > > +      /* Scale costs if operation is fusing.  */
> > > +      if (stmt_info && kind =3D=3D scalar_stmt)
> > > +      {
> > > +     if (gassign *stmt =3D dyn_cast<gassign *> (STMT_VINFO_STMT (stm=
t_info)))
> > > +       {
> > > +         switch (gimple_assign_rhs_code (stmt))
> > > +         {
> > > +         case PLUS_EXPR:
> > > +         case MINUS_EXPR:
> > > +           {
> > > +             /* Check if operation can fuse into MSUB or MADD.  */
> > > +             tree rhs1 =3D gimple_assign_rhs1 (stmt);
> > > +             if (gassign *stmt1 =3D dyn_cast<gassign *> (SSA_NAME_DE=
F_STMT (rhs1)))
> > > +               if (gimple_assign_rhs_code (stmt1) =3D=3D MULT_EXPR)
> > > +                 {
> > > +                   stmt_cost =3D 0;
> > > +                   break;
> > > +                }
> > > +             tree rhs2 =3D gimple_assign_rhs2 (stmt);
> > > +             if (gassign *stmt2 =3D dyn_cast<gassign *> (SSA_NAME_DE=
F_STMT (rhs2)))
> > > +               if (gimple_assign_rhs_code (stmt2) =3D=3D MULT_EXPR)
> > > +                 {
> > > +                   stmt_cost =3D 0;
> > > +                   break;
> > > +                 }
> > > +           }
> > > +           break;
> > > +         default:
> > > +           break;
> > > +         }
> > > +       }
> > > +      }
> > > +
> >
> > The difficulty with this is that we can also use MLA-type operations
> > for SVE, and for Advanced SIMD if the mode is not DI.  It's not just
> > a scalar thing.
> >
> > We already take the combination into account (via aarch64_multiply_add_=
p)
> > when estimating issue rates.  But we don't take it into account for
> > latencies because of the reason above: if the multiplications are
> > vectorisable, then the combination applies to both the scalar and
> > the vector code, so the adjustments cancel out.  (Admittedly that
> > decision predates the special Advanced SIMD handling in
> > aarch64_multiply_add_p, so we might want to revisit it.)
> >
> > I think the key feature for this testcase is that the multiplication is
> > not part of the vector code.  I think that's something we need to check
> > if we're going to cost the scalar code more cheaply.
> >
> > But for this particular testcase, I think the main problem is that
> > we count the cost of the scalar loads even though those are needed
> > by other users.  E.g.:
> >
> > int foo(long *restrict res, long *restrict foo, long a, long b)
> > {
> >   res[0] =3D ((foo[0] * a) >> 1) + foo[0];
> >   res[1] =3D ((foo[1] * b) >> 1) + foo[1];
> > }
> >
> > gets vectorised as:
> >
> >         mov     x4, x0
> >         mov     w0, w5
> >         ldr     x5, [x1, 8]
> >         ldr     x6, [x1]
> >         mul     x3, x3, x5
> >         ldr     q1, [x1]
> >         mul     x2, x2, x6
> >         fmov    d0, x2
> >         ins     v0.d[1], x3
> >         ins     v0.d[1], x3
> >         ssra    v1.2d, v0.2d, 1
> >         str     q1, [x4]
> >         ret
> >
> > which is surely worse than the scalar:
> >
> >         mov     x4, x0
> >         mov     w0, w5
> >         ldp     x5, x1, [x1]
> >         mul     x2, x5, x2
> >         mul     x3, x1, x3
> >         add     x2, x5, x2, asr 1
> >         add     x3, x1, x3, asr 1
> >         stp     x2, x3, [x4]
> >         ret
> >
> > even though both versions can hide the shift.  There's also the point
> > that two-element vectors can be stored using a single scalar STP (and
> > loaded using a single LDP), so it's not always accurate to cost the
> > scalar stores as being twice as expensive as the vector stores.
> >
> > The fact that there are multiple problems doesn't mean that we shouldn'=
t
> > start with the costing of instruction combinations.  It's just that
> > tackling the other aspects might be more general.
> >
> > If we do start by tackling the costing of instruction combinations,
> > I think we need to introduce the =E2=80=9Cmultiplication is not vectori=
sed=E2=80=9D
> > condition described above.
>
> BB vectorization costing should already not cost the scalar load
> since it's not going away but it should cost a vector load.  Does
> this for some reason not work for you?  Indeed looking with
> a cross I see
>
> t.c:3:10: note: Costing subgraph:
> t.c:3:10: note: node 0x35f03b0 (max_nunits=3D2, refcnt=3D1)
> t.c:3:10: note: op template: *res_12(D) =3D _4;
> t.c:3:10: note:         stmt 0 *res_12(D) =3D _4;
> t.c:3:10: note:         stmt 1 MEM[(long int *)res_12(D) + 8B] =3D _8;
> t.c:3:10: note:         children 0x35f0440
> t.c:3:10: note: node 0x35f0440 (max_nunits=3D2, refcnt=3D1)
> t.c:3:10: note: op template: _4 =3D _1 + _3;
> t.c:3:10: note:         stmt 0 _4 =3D _1 + _3;
> t.c:3:10: note:         stmt 1 _8 =3D _5 + _7;
> t.c:3:10: note:         children 0x35f04d0 0x35f0560
> t.c:3:10: note: node 0x35f04d0 (max_nunits=3D2, refcnt=3D2)
> t.c:3:10: note: op template: _1 =3D *foo_10(D);
> t.c:3:10: note:         stmt 0 _1 =3D *foo_10(D);
> t.c:3:10: note:         stmt 1 _5 =3D MEM[(long int *)foo_10(D) + 8B];
> t.c:3:10: note: node 0x35f0560 (max_nunits=3D2, refcnt=3D1)
> t.c:3:10: note: op template: _3 =3D _2 >> 1;
> t.c:3:10: note:         stmt 0 _3 =3D _2 >> 1;
> t.c:3:10: note:         stmt 1 _7 =3D _6 >> 1;
> t.c:3:10: note:         children 0x35f05f0 0x35f0710
> t.c:3:10: note: node (external) 0x35f05f0 (max_nunits=3D2, refcnt=3D1)
> t.c:3:10: note:         stmt 0 _2 =3D _1 * a_11(D);
> t.c:3:10: note:         stmt 1 _6 =3D _5 * b_14(D);
> t.c:3:10: note:         children 0x35f04d0 0x35f0680
> t.c:3:10: note: node (external) 0x35f0680 (max_nunits=3D1, refcnt=3D1)
> t.c:3:10: note:         { a_11(D), b_14(D) }
> t.c:3:10: note: node (constant) 0x35f0710 (max_nunits=3D1, refcnt=3D1)
> t.c:3:10: note:         { 1, 1 }
>
> so the promoted external node 0x35f05f0 should keep the load live.
> vect_bb_slp_scalar_cost relies on PURE_SLP_STMT but
> that's unreliable here since the per-stmt setting cannot capture the
> different uses.  The code shares intend (and some bugs) with
> vect_bb_slp_mark_live_stmts and the problem in general is a bit
> difficult given the lack of back-mapping from stmt to SLP nodes
> referencing it.

So I think the situation might only appear with externals that have
"internal" operands, and the partitioning should (hopefully) guarantee
that we're seeing all SLP instances that eventually use a scalar
stmt so that we can produce a set of stmts used as operands
of 'extern' nodes and avoid costing those on the scalar side.

If they are not code-generated as vector extracts - but that's sth
we don't compute reliably yet unfortunately.

Ideally for such "in SLP instance" externals - that is, those that
also participate in vectorized stmts, we'd have the vector extract
and external compose more explicitely represented which would
eventually allow those to be optimized by optimize_slp as well.

> Mind to open a bugreport?

PR102176

Richard.

> Richard.
>
> >
> > Thanks,
> > Richard
> >
> > >        if (stmt_info && aarch64_use_new_vector_costs_p ())
> > >       {
> > >         /* Account for any extra "embedded" costs that apply additive=
ly
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr97984-1.c b/gcc/tests=
uite/gcc.target/aarch64/pr97984-1.c
> > > new file mode 100644
> > > index 0000000000000000000000000000000000000000..9d403eb76ec3a72747f47=
e718a88ed6b062643f9
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/pr97984-1.c
> > > @@ -0,0 +1,12 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-O3 -fdump-tree-slp-all" } */
> > > +
> > > +void x (long * __restrict a, long * __restrict b)
> > > +{
> > > +  a[0] *=3D b[0];
> > > +  a[1] *=3D b[1];
> > > +  a[0] +=3D b[0];
> > > +  a[1] +=3D b[1];
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump-times "not vectorized: vectorization =
is not profitable" 1 "slp2" } } */
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr97984-2.c b/gcc/tests=
uite/gcc.target/aarch64/pr97984-2.c
> > > new file mode 100644
> > > index 0000000000000000000000000000000000000000..a4086380fd613035f7ce3=
e8e8c89e853efa1304e
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/pr97984-2.c
> > > @@ -0,0 +1,12 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-O3 -std=3Dc99 -fdump-tree-vect-all" } *=
/
> > > +
> > > +void x (long * __restrict a, long * __restrict b, int n)
> > > +{
> > > +  for (int i =3D 0; i < n; i++) {
> > > +    a[i] *=3D b[i];
> > > +    a[i] +=3D b[i];
> > > +  }
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump-times "vectorized 0 loops in function=
" 1 "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr97984-3.c b/gcc/tests=
uite/gcc.target/aarch64/pr97984-3.c
> > > new file mode 100644
> > > index 0000000000000000000000000000000000000000..c6c8d351834705998b3c8=
7a40edf1a00a4bb80f9
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/pr97984-3.c
> > > @@ -0,0 +1,12 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-O3 -fdump-tree-slp-all" } */
> > > +
> > > +void x (long * __restrict a, long * __restrict b, long * __restrict =
c)
> > > +{
> > > +  a[0] *=3D b[0];
> > > +  a[1] *=3D b[1];
> > > +  a[0] +=3D c[0];
> > > +  a[1] +=3D c[1];
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump-times "not vectorized: vectorization =
is not profitable" 1 "slp2" } } */
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr97984-4.c b/gcc/tests=
uite/gcc.target/aarch64/pr97984-4.c
> > > new file mode 100644
> > > index 0000000000000000000000000000000000000000..d03b0bb84dd822ab3b61e=
229c01be4cd1fc7f1d4
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/pr97984-4.c
> > > @@ -0,0 +1,12 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-O3 -std=3Dc99 -fdump-tree-vect-all -mar=
ch=3Darmv8.3-a+sve" } */
> > > +
> > > +void x (long * __restrict a, long * __restrict b, int n)
> > > +{
> > > +  for (int i =3D 0; i < n; i++) {
> > > +    a[i] *=3D b[i];
> > > +    a[i] +=3D b[i];
> > > +  }
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump-times "vectorized 1 loops in function=
" 1 "vect" } } */