From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ej1-x62e.google.com (mail-ej1-x62e.google.com [IPv6:2a00:1450:4864:20::62e]) by sourceware.org (Postfix) with ESMTPS id 1AA3D385740E for ; Thu, 2 Sep 2021 12:18:41 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 1AA3D385740E Received: by mail-ej1-x62e.google.com with SMTP id i21so3869887ejd.2 for ; Thu, 02 Sep 2021 05:18:41 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:content-transfer-encoding; bh=+iEgaJLhP+hYhV6ipV5DJlQyJw3trD2w7lSLOIv2tnQ=; b=SpTSuU4iS+2szwhF7/we7eE0y7gDUFO+YQ8g+Dch6QmrTMA3zsTzPn0bhH5xjKew0Z 6Z5xfTQTaZ+wmbUu7sUWWCeyzyhBu1Q7zd4cQj4Y4k6S5Tga2aeAJb2mlIw0cZiPSmB/ HkFsjETq+m515lQ0Od4+fFrxEEfv6r3rUdrxxu7lQHNfcrApFJJMuWOJFr8MUkEZ0j5W IphTS/LOokly6wdMJVQPHO/3lNfebHR0PK3JcASqKijBDelmDvRknGMnZjr0k3OteZBa 7zubjnReLE+34CC2eq7CL3cH/s078LJbfFCw8DcbiLkFrNoYnb6h/sv1KgSpGkTv+APZ G7kg== X-Gm-Message-State: AOAM532oZNNxCaso7yqVyWdmzBEFel6kK/rsFWKPP63vGksWgP2Ztt4L BDzSdnwmpDVOwEUiyLXSyQZMpt5ekEl6T70h/BQ= X-Google-Smtp-Source: ABdhPJyT9RKD1jKFocQdiOvJ4Ii3I7WWByrs7DEk1QEwum0HImG66BFY5ShE5eSYZOvWqsNjqHjmtJ9xljNMrntLl/w= X-Received: by 2002:a17:906:6808:: with SMTP id k8mr3584470ejr.138.1630585120003; Thu, 02 Sep 2021 05:18:40 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Richard Biener Date: Thu, 2 Sep 2021 14:18:29 +0200 Message-ID: Subject: Re: [PATCH]AArch64 RFC: Don't cost all scalar operations during vectorization if scalar will fuse To: Richard Sandiford , Tamar Christina , GCC Patches , nd , Richard Earnshaw , Marcus Shawcroft , Kyrylo Tkachov Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-8.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_LOTSOFHASH, KAM_MANYTO, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 02 Sep 2021 12:18:51 -0000 On Wed, Sep 1, 2021 at 3:47 PM Richard Biener wrote: > > On Tue, Aug 31, 2021 at 4:50 PM Richard Sandiford via Gcc-patches > wrote: > > > > Tamar Christina writes: > > > Hi All, > > > > > > As the vectorizer has improved over time in capabilities it has start= ed > > > over-vectorizing. This has causes regressions in the order of 1-7x o= n libraries > > > that Arm produces. > > > > > > The vector costs actually do make a lot of sense and I don't think th= at they are > > > wrong. I think that the costs for the scalar code are wrong. > > > > > > In particular the costing doesn't take into effect that scalar operat= ion > > > can/will fuse as this happens in RTL. Because of this the costs for = the scalars > > > end up being always higher. > > > > > > As an example the loop in PR 97984: > > > > > > void x (long * __restrict a, long * __restrict b) > > > { > > > a[0] *=3D b[0]; > > > a[1] *=3D b[1]; > > > a[0] +=3D b[0]; > > > a[1] +=3D b[1]; > > > } > > > > > > generates: > > > > > > x: > > > ldp x2, x3, [x0] > > > ldr x4, [x1] > > > ldr q1, [x1] > > > mul x2, x2, x4 > > > ldr x4, [x1, 8] > > > fmov d0, x2 > > > ins v0.d[1], x3 > > > mul x1, x3, x4 > > > ins v0.d[1], x1 > > > add v0.2d, v0.2d, v1.2d > > > str q0, [x0] > > > ret > > > > > > On an actual loop the prologue costs would make the loop too expensiv= e so we > > > produce the scalar output, but with SLP there's no loop overhead cost= s so we end > > > up trying to vectorize this. Because SLP discovery is started from th= e stores we > > > will end up vectorizing and costing the add but not the MUL. > > > > > > To counter this the patch adjusts the costing when it finds an operat= ion that > > > can be fused and discounts the cost of the "other" operation being fu= sed in. > > > > > > The attached testcase shows that even when we discount it we still ge= t still get > > > vectorized code when profitable to do so, e.g. SVE. > > > > > > This happens as well with other operations such as scalar operations = where > > > shifts can be fused in or for e.g. bfxil. As such sending this for f= eedback. > > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > > > > > Ok for master? If the approach is acceptable I can add support for mo= re. > > > > > > Thanks, > > > Tamar > > > > > > gcc/ChangeLog: > > > > > > PR target/97984 > > > * config/aarch64/aarch64.c (aarch64_add_stmt_cost): Check for f= using > > > madd. > > > > > > gcc/testsuite/ChangeLog: > > > > > > PR target/97984 > > > * gcc.target/aarch64/pr97984-1.c: New test. > > > * gcc.target/aarch64/pr97984-2.c: New test. > > > * gcc.target/aarch64/pr97984-3.c: New test. > > > * gcc.target/aarch64/pr97984-4.c: New test. > > > > > > --- inline copy of patch -- > > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch6= 4.c > > > index 4cd4b037f2606e515ad8f4669d2cd13a509dd0a4..329b556311310d86aaf54= 6d7b395a3750a9d57d4 100644 > > > --- a/gcc/config/aarch64/aarch64.c > > > +++ b/gcc/config/aarch64/aarch64.c > > > @@ -15536,6 +15536,39 @@ aarch64_add_stmt_cost (class vec_info *vinfo= , void *data, int count, > > > stmt_cost =3D aarch64_sve_adjust_stmt_cost (vinfo, kind, stmt_i= nfo, > > > vectype, stmt_cost); > > > > > > + /* Scale costs if operation is fusing. */ > > > + if (stmt_info && kind =3D=3D scalar_stmt) > > > + { > > > + if (gassign *stmt =3D dyn_cast (STMT_VINFO_STMT (stm= t_info))) > > > + { > > > + switch (gimple_assign_rhs_code (stmt)) > > > + { > > > + case PLUS_EXPR: > > > + case MINUS_EXPR: > > > + { > > > + /* Check if operation can fuse into MSUB or MADD. */ > > > + tree rhs1 =3D gimple_assign_rhs1 (stmt); > > > + if (gassign *stmt1 =3D dyn_cast (SSA_NAME_DE= F_STMT (rhs1))) > > > + if (gimple_assign_rhs_code (stmt1) =3D=3D MULT_EXPR) > > > + { > > > + stmt_cost =3D 0; > > > + break; > > > + } > > > + tree rhs2 =3D gimple_assign_rhs2 (stmt); > > > + if (gassign *stmt2 =3D dyn_cast (SSA_NAME_DE= F_STMT (rhs2))) > > > + if (gimple_assign_rhs_code (stmt2) =3D=3D MULT_EXPR) > > > + { > > > + stmt_cost =3D 0; > > > + break; > > > + } > > > + } > > > + break; > > > + default: > > > + break; > > > + } > > > + } > > > + } > > > + > > > > The difficulty with this is that we can also use MLA-type operations > > for SVE, and for Advanced SIMD if the mode is not DI. It's not just > > a scalar thing. > > > > We already take the combination into account (via aarch64_multiply_add_= p) > > when estimating issue rates. But we don't take it into account for > > latencies because of the reason above: if the multiplications are > > vectorisable, then the combination applies to both the scalar and > > the vector code, so the adjustments cancel out. (Admittedly that > > decision predates the special Advanced SIMD handling in > > aarch64_multiply_add_p, so we might want to revisit it.) > > > > I think the key feature for this testcase is that the multiplication is > > not part of the vector code. I think that's something we need to check > > if we're going to cost the scalar code more cheaply. > > > > But for this particular testcase, I think the main problem is that > > we count the cost of the scalar loads even though those are needed > > by other users. E.g.: > > > > int foo(long *restrict res, long *restrict foo, long a, long b) > > { > > res[0] =3D ((foo[0] * a) >> 1) + foo[0]; > > res[1] =3D ((foo[1] * b) >> 1) + foo[1]; > > } > > > > gets vectorised as: > > > > mov x4, x0 > > mov w0, w5 > > ldr x5, [x1, 8] > > ldr x6, [x1] > > mul x3, x3, x5 > > ldr q1, [x1] > > mul x2, x2, x6 > > fmov d0, x2 > > ins v0.d[1], x3 > > ins v0.d[1], x3 > > ssra v1.2d, v0.2d, 1 > > str q1, [x4] > > ret > > > > which is surely worse than the scalar: > > > > mov x4, x0 > > mov w0, w5 > > ldp x5, x1, [x1] > > mul x2, x5, x2 > > mul x3, x1, x3 > > add x2, x5, x2, asr 1 > > add x3, x1, x3, asr 1 > > stp x2, x3, [x4] > > ret > > > > even though both versions can hide the shift. There's also the point > > that two-element vectors can be stored using a single scalar STP (and > > loaded using a single LDP), so it's not always accurate to cost the > > scalar stores as being twice as expensive as the vector stores. > > > > The fact that there are multiple problems doesn't mean that we shouldn'= t > > start with the costing of instruction combinations. It's just that > > tackling the other aspects might be more general. > > > > If we do start by tackling the costing of instruction combinations, > > I think we need to introduce the =E2=80=9Cmultiplication is not vectori= sed=E2=80=9D > > condition described above. > > BB vectorization costing should already not cost the scalar load > since it's not going away but it should cost a vector load. Does > this for some reason not work for you? Indeed looking with > a cross I see > > t.c:3:10: note: Costing subgraph: > t.c:3:10: note: node 0x35f03b0 (max_nunits=3D2, refcnt=3D1) > t.c:3:10: note: op template: *res_12(D) =3D _4; > t.c:3:10: note: stmt 0 *res_12(D) =3D _4; > t.c:3:10: note: stmt 1 MEM[(long int *)res_12(D) + 8B] =3D _8; > t.c:3:10: note: children 0x35f0440 > t.c:3:10: note: node 0x35f0440 (max_nunits=3D2, refcnt=3D1) > t.c:3:10: note: op template: _4 =3D _1 + _3; > t.c:3:10: note: stmt 0 _4 =3D _1 + _3; > t.c:3:10: note: stmt 1 _8 =3D _5 + _7; > t.c:3:10: note: children 0x35f04d0 0x35f0560 > t.c:3:10: note: node 0x35f04d0 (max_nunits=3D2, refcnt=3D2) > t.c:3:10: note: op template: _1 =3D *foo_10(D); > t.c:3:10: note: stmt 0 _1 =3D *foo_10(D); > t.c:3:10: note: stmt 1 _5 =3D MEM[(long int *)foo_10(D) + 8B]; > t.c:3:10: note: node 0x35f0560 (max_nunits=3D2, refcnt=3D1) > t.c:3:10: note: op template: _3 =3D _2 >> 1; > t.c:3:10: note: stmt 0 _3 =3D _2 >> 1; > t.c:3:10: note: stmt 1 _7 =3D _6 >> 1; > t.c:3:10: note: children 0x35f05f0 0x35f0710 > t.c:3:10: note: node (external) 0x35f05f0 (max_nunits=3D2, refcnt=3D1) > t.c:3:10: note: stmt 0 _2 =3D _1 * a_11(D); > t.c:3:10: note: stmt 1 _6 =3D _5 * b_14(D); > t.c:3:10: note: children 0x35f04d0 0x35f0680 > t.c:3:10: note: node (external) 0x35f0680 (max_nunits=3D1, refcnt=3D1) > t.c:3:10: note: { a_11(D), b_14(D) } > t.c:3:10: note: node (constant) 0x35f0710 (max_nunits=3D1, refcnt=3D1) > t.c:3:10: note: { 1, 1 } > > so the promoted external node 0x35f05f0 should keep the load live. > vect_bb_slp_scalar_cost relies on PURE_SLP_STMT but > that's unreliable here since the per-stmt setting cannot capture the > different uses. The code shares intend (and some bugs) with > vect_bb_slp_mark_live_stmts and the problem in general is a bit > difficult given the lack of back-mapping from stmt to SLP nodes > referencing it. So I think the situation might only appear with externals that have "internal" operands, and the partitioning should (hopefully) guarantee that we're seeing all SLP instances that eventually use a scalar stmt so that we can produce a set of stmts used as operands of 'extern' nodes and avoid costing those on the scalar side. If they are not code-generated as vector extracts - but that's sth we don't compute reliably yet unfortunately. Ideally for such "in SLP instance" externals - that is, those that also participate in vectorized stmts, we'd have the vector extract and external compose more explicitely represented which would eventually allow those to be optimized by optimize_slp as well. > Mind to open a bugreport? PR102176 Richard. > Richard. > > > > > Thanks, > > Richard > > > > > if (stmt_info && aarch64_use_new_vector_costs_p ()) > > > { > > > /* Account for any extra "embedded" costs that apply additive= ly > > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr97984-1.c b/gcc/tests= uite/gcc.target/aarch64/pr97984-1.c > > > new file mode 100644 > > > index 0000000000000000000000000000000000000000..9d403eb76ec3a72747f47= e718a88ed6b062643f9 > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.target/aarch64/pr97984-1.c > > > @@ -0,0 +1,12 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-additional-options "-O3 -fdump-tree-slp-all" } */ > > > + > > > +void x (long * __restrict a, long * __restrict b) > > > +{ > > > + a[0] *=3D b[0]; > > > + a[1] *=3D b[1]; > > > + a[0] +=3D b[0]; > > > + a[1] +=3D b[1]; > > > +} > > > + > > > +/* { dg-final { scan-tree-dump-times "not vectorized: vectorization = is not profitable" 1 "slp2" } } */ > > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr97984-2.c b/gcc/tests= uite/gcc.target/aarch64/pr97984-2.c > > > new file mode 100644 > > > index 0000000000000000000000000000000000000000..a4086380fd613035f7ce3= e8e8c89e853efa1304e > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.target/aarch64/pr97984-2.c > > > @@ -0,0 +1,12 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-additional-options "-O3 -std=3Dc99 -fdump-tree-vect-all" } *= / > > > + > > > +void x (long * __restrict a, long * __restrict b, int n) > > > +{ > > > + for (int i =3D 0; i < n; i++) { > > > + a[i] *=3D b[i]; > > > + a[i] +=3D b[i]; > > > + } > > > +} > > > + > > > +/* { dg-final { scan-tree-dump-times "vectorized 0 loops in function= " 1 "vect" } } */ > > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr97984-3.c b/gcc/tests= uite/gcc.target/aarch64/pr97984-3.c > > > new file mode 100644 > > > index 0000000000000000000000000000000000000000..c6c8d351834705998b3c8= 7a40edf1a00a4bb80f9 > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.target/aarch64/pr97984-3.c > > > @@ -0,0 +1,12 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-additional-options "-O3 -fdump-tree-slp-all" } */ > > > + > > > +void x (long * __restrict a, long * __restrict b, long * __restrict = c) > > > +{ > > > + a[0] *=3D b[0]; > > > + a[1] *=3D b[1]; > > > + a[0] +=3D c[0]; > > > + a[1] +=3D c[1]; > > > +} > > > + > > > +/* { dg-final { scan-tree-dump-times "not vectorized: vectorization = is not profitable" 1 "slp2" } } */ > > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr97984-4.c b/gcc/tests= uite/gcc.target/aarch64/pr97984-4.c > > > new file mode 100644 > > > index 0000000000000000000000000000000000000000..d03b0bb84dd822ab3b61e= 229c01be4cd1fc7f1d4 > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.target/aarch64/pr97984-4.c > > > @@ -0,0 +1,12 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-additional-options "-O3 -std=3Dc99 -fdump-tree-vect-all -mar= ch=3Darmv8.3-a+sve" } */ > > > + > > > +void x (long * __restrict a, long * __restrict b, int n) > > > +{ > > > + for (int i =3D 0; i < n; i++) { > > > + a[i] *=3D b[i]; > > > + a[i] +=3D b[i]; > > > + } > > > +} > > > + > > > +/* { dg-final { scan-tree-dump-times "vectorized 1 loops in function= " 1 "vect" } } */