From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 1B5EE3858409 for ; Tue, 31 Aug 2021 14:49:33 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 1B5EE3858409 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id AF4F06D; Tue, 31 Aug 2021 07:49:32 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.126]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id BF5FB3F5A1; Tue, 31 Aug 2021 07:49:31 -0700 (PDT) From: Richard Sandiford To: Tamar Christina Mail-Followup-To: Tamar Christina , gcc-patches@gcc.gnu.org, nd@arm.com, Richard.Earnshaw@arm.com, Marcus.Shawcroft@arm.com, Kyrylo.Tkachov@arm.com, richard.sandiford@arm.com Cc: gcc-patches@gcc.gnu.org, nd@arm.com, Richard.Earnshaw@arm.com, Marcus.Shawcroft@arm.com, Kyrylo.Tkachov@arm.com Subject: Re: [PATCH]AArch64 RFC: Don't cost all scalar operations during vectorization if scalar will fuse References: Date: Tue, 31 Aug 2021 15:49:30 +0100 In-Reply-To: (Tamar Christina's message of "Tue, 31 Aug 2021 14:27:12 +0100") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-11.7 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_STATUS, KAM_LOTSOFHASH, KAM_SHORT, SCC_5_SHORT_WORD_LINES, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 31 Aug 2021 14:49:43 -0000 Tamar Christina writes: > Hi All, > > As the vectorizer has improved over time in capabilities it has started > over-vectorizing. This has causes regressions in the order of 1-7x on li= braries > that Arm produces. > > The vector costs actually do make a lot of sense and I don't think that t= hey are > wrong. I think that the costs for the scalar code are wrong. > > In particular the costing doesn't take into effect that scalar operation > can/will fuse as this happens in RTL. Because of this the costs for the = scalars > end up being always higher. > > As an example the loop in PR 97984: > > void x (long * __restrict a, long * __restrict b) > { > a[0] *=3D b[0]; > a[1] *=3D b[1]; > a[0] +=3D b[0]; > a[1] +=3D b[1]; > } > > generates: > > x: > ldp x2, x3, [x0] > ldr x4, [x1] > ldr q1, [x1] > mul x2, x2, x4 > ldr x4, [x1, 8] > fmov d0, x2 > ins v0.d[1], x3 > mul x1, x3, x4 > ins v0.d[1], x1 > add v0.2d, v0.2d, v1.2d > str q0, [x0] > ret > > On an actual loop the prologue costs would make the loop too expensive so= we > produce the scalar output, but with SLP there's no loop overhead costs so= we end > up trying to vectorize this. Because SLP discovery is started from the st= ores we > will end up vectorizing and costing the add but not the MUL. > > To counter this the patch adjusts the costing when it finds an operation = that > can be fused and discounts the cost of the "other" operation being fused = in. > > The attached testcase shows that even when we discount it we still get st= ill get > vectorized code when profitable to do so, e.g. SVE. > > This happens as well with other operations such as scalar operations where > shifts can be fused in or for e.g. bfxil. As such sending this for feedb= ack. > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > Ok for master? If the approach is acceptable I can add support for more. > > Thanks, > Tamar > > gcc/ChangeLog: > > PR target/97984 > * config/aarch64/aarch64.c (aarch64_add_stmt_cost): Check for fusing > madd. > > gcc/testsuite/ChangeLog: > > PR target/97984 > * gcc.target/aarch64/pr97984-1.c: New test. > * gcc.target/aarch64/pr97984-2.c: New test. > * gcc.target/aarch64/pr97984-3.c: New test. > * gcc.target/aarch64/pr97984-4.c: New test. > > --- inline copy of patch --=20 > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > index 4cd4b037f2606e515ad8f4669d2cd13a509dd0a4..329b556311310d86aaf546d7b= 395a3750a9d57d4 100644 > --- a/gcc/config/aarch64/aarch64.c > +++ b/gcc/config/aarch64/aarch64.c > @@ -15536,6 +15536,39 @@ aarch64_add_stmt_cost (class vec_info *vinfo, vo= id *data, int count, > stmt_cost =3D aarch64_sve_adjust_stmt_cost (vinfo, kind, stmt_info, > vectype, stmt_cost); >=20=20 > + /* Scale costs if operation is fusing. */ > + if (stmt_info && kind =3D=3D scalar_stmt) > + { > + if (gassign *stmt =3D dyn_cast (STMT_VINFO_STMT (stmt_info))) > + { > + switch (gimple_assign_rhs_code (stmt)) > + { > + case PLUS_EXPR: > + case MINUS_EXPR: > + { > + /* Check if operation can fuse into MSUB or MADD. */ > + tree rhs1 =3D gimple_assign_rhs1 (stmt); > + if (gassign *stmt1 =3D dyn_cast (SSA_NAME_DEF_STMT (rhs1))) > + if (gimple_assign_rhs_code (stmt1) =3D=3D MULT_EXPR) > + { > + stmt_cost =3D 0; > + break; > + } > + tree rhs2 =3D gimple_assign_rhs2 (stmt); > + if (gassign *stmt2 =3D dyn_cast (SSA_NAME_DEF_STMT (rhs2))) > + if (gimple_assign_rhs_code (stmt2) =3D=3D MULT_EXPR) > + { > + stmt_cost =3D 0; > + break; > + } > + } > + break; > + default: > + break; > + } > + } > + } > + The difficulty with this is that we can also use MLA-type operations for SVE, and for Advanced SIMD if the mode is not DI. It's not just a scalar thing. We already take the combination into account (via aarch64_multiply_add_p) when estimating issue rates. But we don't take it into account for latencies because of the reason above: if the multiplications are vectorisable, then the combination applies to both the scalar and the vector code, so the adjustments cancel out. (Admittedly that decision predates the special Advanced SIMD handling in aarch64_multiply_add_p, so we might want to revisit it.) I think the key feature for this testcase is that the multiplication is not part of the vector code. I think that's something we need to check if we're going to cost the scalar code more cheaply. But for this particular testcase, I think the main problem is that we count the cost of the scalar loads even though those are needed by other users. E.g.: int foo(long *restrict res, long *restrict foo, long a, long b) { res[0] =3D ((foo[0] * a) >> 1) + foo[0]; res[1] =3D ((foo[1] * b) >> 1) + foo[1]; } gets vectorised as: mov x4, x0 mov w0, w5 ldr x5, [x1, 8] ldr x6, [x1] mul x3, x3, x5 ldr q1, [x1] mul x2, x2, x6 fmov d0, x2 ins v0.d[1], x3 ins v0.d[1], x3 ssra v1.2d, v0.2d, 1 str q1, [x4] ret which is surely worse than the scalar: mov x4, x0 mov w0, w5 ldp x5, x1, [x1] mul x2, x5, x2 mul x3, x1, x3 add x2, x5, x2, asr 1 add x3, x1, x3, asr 1 stp x2, x3, [x4] ret even though both versions can hide the shift. There's also the point that two-element vectors can be stored using a single scalar STP (and loaded using a single LDP), so it's not always accurate to cost the scalar stores as being twice as expensive as the vector stores. The fact that there are multiple problems doesn't mean that we shouldn't start with the costing of instruction combinations. It's just that tackling the other aspects might be more general. If we do start by tackling the costing of instruction combinations, I think we need to introduce the =E2=80=9Cmultiplication is not vectorised= =E2=80=9D condition described above. Thanks, Richard > if (stmt_info && aarch64_use_new_vector_costs_p ()) > { > /* Account for any extra "embedded" costs that apply additively > diff --git a/gcc/testsuite/gcc.target/aarch64/pr97984-1.c b/gcc/testsuite= /gcc.target/aarch64/pr97984-1.c > new file mode 100644 > index 0000000000000000000000000000000000000000..9d403eb76ec3a72747f47e718= a88ed6b062643f9 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/pr97984-1.c > @@ -0,0 +1,12 @@ > +/* { dg-do compile } */ > +/* { dg-additional-options "-O3 -fdump-tree-slp-all" } */ > + > +void x (long * __restrict a, long * __restrict b) > +{ > + a[0] *=3D b[0]; > + a[1] *=3D b[1]; > + a[0] +=3D b[0]; > + a[1] +=3D b[1]; > +} > + > +/* { dg-final { scan-tree-dump-times "not vectorized: vectorization is n= ot profitable" 1 "slp2" } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/pr97984-2.c b/gcc/testsuite= /gcc.target/aarch64/pr97984-2.c > new file mode 100644 > index 0000000000000000000000000000000000000000..a4086380fd613035f7ce3e8e8= c89e853efa1304e > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/pr97984-2.c > @@ -0,0 +1,12 @@ > +/* { dg-do compile } */ > +/* { dg-additional-options "-O3 -std=3Dc99 -fdump-tree-vect-all" } */ > + > +void x (long * __restrict a, long * __restrict b, int n) > +{ > + for (int i =3D 0; i < n; i++) { > + a[i] *=3D b[i]; > + a[i] +=3D b[i]; > + } > +} > + > +/* { dg-final { scan-tree-dump-times "vectorized 0 loops in function" 1 = "vect" } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/pr97984-3.c b/gcc/testsuite= /gcc.target/aarch64/pr97984-3.c > new file mode 100644 > index 0000000000000000000000000000000000000000..c6c8d351834705998b3c87a40= edf1a00a4bb80f9 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/pr97984-3.c > @@ -0,0 +1,12 @@ > +/* { dg-do compile } */ > +/* { dg-additional-options "-O3 -fdump-tree-slp-all" } */ > + > +void x (long * __restrict a, long * __restrict b, long * __restrict c) > +{ > + a[0] *=3D b[0]; > + a[1] *=3D b[1]; > + a[0] +=3D c[0]; > + a[1] +=3D c[1]; > +} > + > +/* { dg-final { scan-tree-dump-times "not vectorized: vectorization is n= ot profitable" 1 "slp2" } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/pr97984-4.c b/gcc/testsuite= /gcc.target/aarch64/pr97984-4.c > new file mode 100644 > index 0000000000000000000000000000000000000000..d03b0bb84dd822ab3b61e229c= 01be4cd1fc7f1d4 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/pr97984-4.c > @@ -0,0 +1,12 @@ > +/* { dg-do compile } */ > +/* { dg-additional-options "-O3 -std=3Dc99 -fdump-tree-vect-all -march= =3Darmv8.3-a+sve" } */ > + > +void x (long * __restrict a, long * __restrict b, int n) > +{ > + for (int i =3D 0; i < n; i++) { > + a[i] *=3D b[i]; > + a[i] +=3D b[i]; > + } > +} > + > +/* { dg-final { scan-tree-dump-times "vectorized 1 loops in function" 1 = "vect" } } */