From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id D665F3858C41; Fri, 29 Dec 2023 16:16:01 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D665F3858C41 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1703866561; bh=fhzhVW25hnBFEtjp62ptV5o9dRaNltg/H06dOQIqGXo=; h=From:To:Subject:Date:In-Reply-To:References:From; b=yflaFLMtZzbheg8lgEPYJjYprKyDIyRXALjo8juSxy4mrT3zobufD34Rx4/+5KKTg oyTx+CnboSmtYtbLECQ2O7m9kJmbfM4bxXCsKXp0K+3I52NijH/hX39AWQDTQ3AAT3 +vqB1j3WP4cTNH+VbTU7Z3kQaCqvohscxC8Ft1kY= From: "tnfchris at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/110625] [14 Regression][AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large Date: Fri, 29 Dec 2023 16:16:00 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: tnfchris at gcc dot gnu.org X-Bugzilla-Status: RESOLVED X-Bugzilla-Resolution: FIXED X-Bugzilla-Priority: P1 X-Bugzilla-Assigned-To: rsandifo at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: resolution bug_status Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D110625 Tamar Christina changed: What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |FIXED Status|ASSIGNED |RESOLVED --- Comment #25 from Tamar Christina --- (In reply to Hao Liu from comment #0) > This problem causes a performance regression in SPEC2017 538.imagick. For > the following simple case (modified from pr96208): >=20 > typedef struct { > unsigned short m1, m2, m3, m4; > } the_struct_t; > typedef struct { > double m1, m2, m3, m4, m5; > } the_struct2_t; >=20 > double bar1 (the_struct2_t*); >=20 > double foo (double* k, unsigned int n, the_struct_t* the_struct) { > unsigned int u; > the_struct2_t result; > for (u=3D0; u < n; u++, k--) { > result.m1 +=3D (*k)*the_struct[u].m1; > result.m2 +=3D (*k)*the_struct[u].m2; > result.m3 +=3D (*k)*the_struct[u].m3; > result.m4 +=3D (*k)*the_struct[u].m4; > } > return bar1 (&result); > } >=20 In the context of this report the regression should be fixed, however we st= ill don't vectorize this loop. We ran this and other cases comparing scalar and vector versions of this loop and it looks like specifically Neoverse N2 does much better using the scalar version here. So it looks like the cost model= is doing the right thing here for the current codegen of the function. Note that the vector version: ldr q31, [x3], 16 ldr q29, [x4], -16 rev64 v31.8h, v31.8h uxtl v30.4s, v31.4h uxtl2 v31.4s, v31.8h sxtl v27.2d, v30.2s sxtl v28.2d, v31.2s sxtl2 v30.2d, v30.4s sxtl2 v31.2d, v31.4s scvtf v27.2d, v27.2d scvtf v28.2d, v28.2d scvtf v30.2d, v30.2d scvtf v31.2d, v31.2d fmla v26.2d, v27.2d, v29.d[1] fmla v24.2d, v30.2d, v29.d[1] fmla v23.2d, v28.2d, v29.d[0] fmla v25.2d, v31.2d, v29.d[0] Is still pretty inefficient due to all the extends. If we generate better = code here this may tip the scale back to vector. But for now, the patch should = fix the regression.=