From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 48F15385802D; Fri, 26 Jan 2024 10:21:57 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 48F15385802D DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1706264517; bh=U1qbki1EgIb/Xe3dPcoP33+LNvhQF4eU2QVmBO+KrXQ=; h=From:To:Subject:Date:In-Reply-To:References:From; b=AtPSAvivdg3OLTbK670UedZ39agvymG3IsRp77Jm8wy51qn2n9vye2O55Cqk9mFAu TgWpTY+l1BhBh1L5rafoXFH6MWh1MZbTuOfajYf0X2r0mBmTQK+mVR5Lka70ndsXTW +/ZXZrxbmHA+nZjvoR4fbn2uzB8mMiKBiqM4tkUI= From: "rguenther at suse dot de" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized. Date: Fri, 26 Jan 2024 10:21:55 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: enhancement X-Bugzilla-Who: rguenther at suse dot de X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D113583 --- Comment #10 from rguenther at suse dot de --- On Fri, 26 Jan 2024, rdapp at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D113583 >=20 > --- Comment #9 from Robin Dapp --- > (In reply to rguenther@suse.de from comment #6) >=20 > > t.c:47:21: missed: the size of the group of accesses is not a power o= f 2=20 > > or not equal to 3 > > t.c:47:21: missed: not falling back to elementwise accesses > > t.c:58:15: missed: not vectorized: relevant stmt not supported: _4 = =3D=20 > > *_3; > > t.c:47:21: missed: bad operation or unsupported loop bound. > >=20 > > where we don't consider using gather because we have a known constant > > stride (20). Since the stores are really scatters we don't attempt > > to SLP either. > >=20 > > Disabling the above heuristic we get this vectorized as well, avoiding > > gather/scatter by manually implementing them and using a quite high > > VF of 8 (with -mprefer-vector-width=3D256 you get VF 4 and likely > > faster code in the end). >=20 > I suppose you're referring to this? >=20 > /* FIXME: At the moment the cost model seems to underestimate the > cost of using elementwise accesses. This check preserves the > traditional behavior until that can be fixed. */ > stmt_vec_info first_stmt_info =3D DR_GROUP_FIRST_ELEMENT (stmt_info); > if (!first_stmt_info) > first_stmt_info =3D stmt_info; > if (*memory_access_type =3D=3D VMAT_ELEMENTWISE > && !STMT_VINFO_STRIDED_P (first_stmt_info) > && !(stmt_info =3D=3D DR_GROUP_FIRST_ELEMENT (stmt_info) > && !DR_GROUP_NEXT_ELEMENT (stmt_info) > && !pow2p_hwi (DR_GROUP_SIZE (stmt_info)))) > { > if (dump_enabled_p ()) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > "not falling back to elementwise accesses\n"); > return false; > } >=20 >=20 > I did some more tests on my laptop. As said above the whole loop in lbm = is > larger and contains two ifs. The first one prevents clang and GCC from > vectorizing the loop, the second one >=20 > if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) { > ux =3D 0.005; > uy =3D 0.002; > uz =3D 0.000; > } >=20 > seems to be if-converted? by clang or at least doesn't inhibit vectorizat= ion. >=20 > Now if I comment out the first, larger if clang does vectorize the loop. = With > the return false commented out in the above GCC snippet GCC also vectoriz= es, > but only when both ifs are commented out. >=20 > Results (with both ifs commented out), -march=3Dnative (resulting in avx2= ), best > of 3 as lbm is notoriously fickle: >=20 > gcc trunk vanilla: 156.04s > gcc trunk with elementwise: 132.10s > clang 17: 143.06s >=20 > Of course even the comment already said that costing is difficult and the > change will surely cause regressions elsewhere. However the 15% improvem= ent > with vectorization (or the 9% improvement of clang) IMHO show that it's s= urely > useful to look into this further. On top, the riscv clang seems to not c= are > about the first if either and still vectorize. I haven't looked closer w= hat > happens there, though. Yes. I think this shows we should remove the above hack and instead try to fix the costing next stage1.=