From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 8AEB23858284; Fri, 26 Jan 2024 09:50:56 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8AEB23858284 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1706262656; bh=eFQZiE6n7WvolaW/6EeO9YRFE40fuobJOJXctZE4YIc=; h=From:To:Subject:Date:In-Reply-To:References:From; b=a7xjol/Zl75KX81Ou4uYvYKfm20QH+7/Op5aANKpfQiG6G0pjp9aJ5DokET0nJZFZ s2WGnOxyxdjgt+oqsIPc7S5sHd9xb16LS1Px0UkGnqyzk4PLQBrqqxeATE04AaeKwn kQEFvCTUpJoi3ug1PqquucGnXHJBel45Psr+o1bY= From: "rdapp at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized. Date: Fri, 26 Jan 2024 09:50:52 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: enhancement X-Bugzilla-Who: rdapp at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D113583 --- Comment #9 from Robin Dapp --- (In reply to rguenther@suse.de from comment #6) > t.c:47:21: missed: the size of the group of accesses is not a power of = 2=20 > or not equal to 3 > t.c:47:21: missed: not falling back to elementwise accesses > t.c:58:15: missed: not vectorized: relevant stmt not supported: _4 =3D= =20 > *_3; > t.c:47:21: missed: bad operation or unsupported loop bound. >=20 > where we don't consider using gather because we have a known constant > stride (20). Since the stores are really scatters we don't attempt > to SLP either. >=20 > Disabling the above heuristic we get this vectorized as well, avoiding > gather/scatter by manually implementing them and using a quite high > VF of 8 (with -mprefer-vector-width=3D256 you get VF 4 and likely > faster code in the end). I suppose you're referring to this? /* FIXME: At the moment the cost model seems to underestimate the cost of using elementwise accesses. This check preserves the traditional behavior until that can be fixed. */ stmt_vec_info first_stmt_info =3D DR_GROUP_FIRST_ELEMENT (stmt_info); if (!first_stmt_info) first_stmt_info =3D stmt_info; if (*memory_access_type =3D=3D VMAT_ELEMENTWISE && !STMT_VINFO_STRIDED_P (first_stmt_info) && !(stmt_info =3D=3D DR_GROUP_FIRST_ELEMENT (stmt_info) && !DR_GROUP_NEXT_ELEMENT (stmt_info) && !pow2p_hwi (DR_GROUP_SIZE (stmt_info)))) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "not falling back to elementwise accesses\n"); return false; } I did some more tests on my laptop. As said above the whole loop in lbm is larger and contains two ifs. The first one prevents clang and GCC from vectorizing the loop, the second one if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) { ux =3D 0.005; uy =3D 0.002; uz =3D 0.000; } seems to be if-converted? by clang or at least doesn't inhibit vectorizatio= n. Now if I comment out the first, larger if clang does vectorize the loop. W= ith the return false commented out in the above GCC snippet GCC also vectorizes, but only when both ifs are commented out. Results (with both ifs commented out), -march=3Dnative (resulting in avx2),= best of 3 as lbm is notoriously fickle: gcc trunk vanilla: 156.04s gcc trunk with elementwise: 132.10s clang 17: 143.06s Of course even the comment already said that costing is difficult and the change will surely cause regressions elsewhere. However the 15% improvement with vectorization (or the 9% improvement of clang) IMHO show that it's sur= ely useful to look into this further. On top, the riscv clang seems to not care about the first if either and still vectorize. I haven't looked closer what happens there, though.=