From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 8AEB23858284; Fri, 26 Jan 2024 09:50:56 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8AEB23858284
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1706262656;
	bh=eFQZiE6n7WvolaW/6EeO9YRFE40fuobJOJXctZE4YIc=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=a7xjol/Zl75KX81Ou4uYvYKfm20QH+7/Op5aANKpfQiG6G0pjp9aJ5DokET0nJZFZ
	 s2WGnOxyxdjgt+oqsIPc7S5sHd9xb16LS1Px0UkGnqyzk4PLQBrqqxeATE04AaeKwn
	 kQEFvCTUpJoi3ug1PqquucGnXHJBel45Psr+o1bY=
From: "rdapp at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
Date: Fri, 26 Jan 2024 09:50:52 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 14.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: enhancement
X-Bugzilla-Who: rdapp at gcc dot gnu.org
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-113583-4-4ocaad3Zmt@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-113583-4@http.gcc.gnu.org/bugzilla/>
References: <bug-113583-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D113583
--- Comment #9 from Robin Dapp <rdapp at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #6)

> t.c:47:21: missed:   the size of the group of accesses is not a power of =
2=20
> or not equal to 3
> t.c:47:21: missed:   not falling back to elementwise accesses
> t.c:58:15: missed:   not vectorized: relevant stmt not supported: _4 =3D=
=20
> *_3;
> t.c:47:21: missed:  bad operation or unsupported loop bound.
>=20
> where we don't consider using gather because we have a known constant
> stride (20).  Since the stores are really scatters we don't attempt
> to SLP either.
>=20
> Disabling the above heuristic we get this vectorized as well, avoiding
> gather/scatter by manually implementing them and using a quite high
> VF of 8 (with -mprefer-vector-width=3D256 you get VF 4 and likely
> faster code in the end).

I suppose you're referring to this?

  /* FIXME: At the moment the cost model seems to underestimate the
     cost of using elementwise accesses.  This check preserves the
     traditional behavior until that can be fixed.  */
  stmt_vec_info first_stmt_info =3D DR_GROUP_FIRST_ELEMENT (stmt_info);
  if (!first_stmt_info)
    first_stmt_info =3D stmt_info;
  if (*memory_access_type =3D=3D VMAT_ELEMENTWISE
      && !STMT_VINFO_STRIDED_P (first_stmt_info)
      && !(stmt_info =3D=3D DR_GROUP_FIRST_ELEMENT (stmt_info)
           && !DR_GROUP_NEXT_ELEMENT (stmt_info)
           && !pow2p_hwi (DR_GROUP_SIZE (stmt_info))))
    {
      if (dump_enabled_p ())
        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                         "not falling back to elementwise accesses\n");
      return false;
    }


I did some more tests on my laptop.  As said above the whole loop in lbm is
larger and contains two ifs.  The first one prevents clang and GCC from
vectorizing the loop, the second one

                if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) {
                        ux =3D 0.005;
                        uy =3D 0.002;
                        uz =3D 0.000;
                }

seems to be if-converted? by clang or at least doesn't inhibit vectorizatio=
n.

Now if I comment out the first, larger if clang does vectorize the loop.  W=
ith
the return false commented out in the above GCC snippet GCC also vectorizes,
but only when both ifs are commented out.

Results (with both ifs commented out), -march=3Dnative (resulting in avx2),=
 best
of 3 as lbm is notoriously fickle:

gcc trunk vanilla: 156.04s
gcc trunk with elementwise: 132.10s
clang 17: 143.06s

Of course even the comment already said that costing is difficult and the
change will surely cause regressions elsewhere.  However the 15% improvement
with vectorization (or the 9% improvement of clang) IMHO show that it's sur=
ely
useful to look into this further.  On top, the riscv clang seems to not care
about the first if either and still vectorize.  I haven't looked closer what
happens there, though.=