From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 48F15385802D; Fri, 26 Jan 2024 10:21:57 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 48F15385802D
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1706264517;
	bh=U1qbki1EgIb/Xe3dPcoP33+LNvhQF4eU2QVmBO+KrXQ=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=AtPSAvivdg3OLTbK670UedZ39agvymG3IsRp77Jm8wy51qn2n9vye2O55Cqk9mFAu
	 TgWpTY+l1BhBh1L5rafoXFH6MWh1MZbTuOfajYf0X2r0mBmTQK+mVR5Lka70ndsXTW
	 +/ZXZrxbmHA+nZjvoR4fbn2uzB8mMiKBiqM4tkUI=
From: "rguenther at suse dot de" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
Date: Fri, 26 Jan 2024 10:21:55 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 14.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: enhancement
X-Bugzilla-Who: rguenther at suse dot de
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-113583-4-eihsZpoLHP@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-113583-4@http.gcc.gnu.org/bugzilla/>
References: <bug-113583-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D113583
--- Comment #10 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 26 Jan 2024, rdapp at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D113583
>=20
> --- Comment #9 from Robin Dapp <rdapp at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #6)
>=20
> > t.c:47:21: missed:   the size of the group of accesses is not a power o=
f 2=20
> > or not equal to 3
> > t.c:47:21: missed:   not falling back to elementwise accesses
> > t.c:58:15: missed:   not vectorized: relevant stmt not supported: _4 =
=3D=20
> > *_3;
> > t.c:47:21: missed:  bad operation or unsupported loop bound.
> >=20
> > where we don't consider using gather because we have a known constant
> > stride (20).  Since the stores are really scatters we don't attempt
> > to SLP either.
> >=20
> > Disabling the above heuristic we get this vectorized as well, avoiding
> > gather/scatter by manually implementing them and using a quite high
> > VF of 8 (with -mprefer-vector-width=3D256 you get VF 4 and likely
> > faster code in the end).
>=20
> I suppose you're referring to this?
>=20
>   /* FIXME: At the moment the cost model seems to underestimate the
>      cost of using elementwise accesses.  This check preserves the
>      traditional behavior until that can be fixed.  */
>   stmt_vec_info first_stmt_info =3D DR_GROUP_FIRST_ELEMENT (stmt_info);
>   if (!first_stmt_info)
>     first_stmt_info =3D stmt_info;
>   if (*memory_access_type =3D=3D VMAT_ELEMENTWISE
>       && !STMT_VINFO_STRIDED_P (first_stmt_info)
>       && !(stmt_info =3D=3D DR_GROUP_FIRST_ELEMENT (stmt_info)
>            && !DR_GROUP_NEXT_ELEMENT (stmt_info)
>            && !pow2p_hwi (DR_GROUP_SIZE (stmt_info))))
>     {
>       if (dump_enabled_p ())
>         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>                          "not falling back to elementwise accesses\n");
>       return false;
>     }
>=20
>=20
> I did some more tests on my laptop.  As said above the whole loop in lbm =
is
> larger and contains two ifs.  The first one prevents clang and GCC from
> vectorizing the loop, the second one
>=20
>                 if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) {
>                         ux =3D 0.005;
>                         uy =3D 0.002;
>                         uz =3D 0.000;
>                 }
>=20
> seems to be if-converted? by clang or at least doesn't inhibit vectorizat=
ion.
>=20
> Now if I comment out the first, larger if clang does vectorize the loop. =
 With
> the return false commented out in the above GCC snippet GCC also vectoriz=
es,
> but only when both ifs are commented out.
>=20
> Results (with both ifs commented out), -march=3Dnative (resulting in avx2=
), best
> of 3 as lbm is notoriously fickle:
>=20
> gcc trunk vanilla: 156.04s
> gcc trunk with elementwise: 132.10s
> clang 17: 143.06s
>=20
> Of course even the comment already said that costing is difficult and the
> change will surely cause regressions elsewhere.  However the 15% improvem=
ent
> with vectorization (or the 9% improvement of clang) IMHO show that it's s=
urely
> useful to look into this further.  On top, the riscv clang seems to not c=
are
> about the first if either and still vectorize.  I haven't looked closer w=
hat
> happens there, though.

Yes.  I think this shows we should remove the above hack and instead
try to fix the costing next stage1.=