From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id EF5AF3857726; Wed, 27 Sep 2023 07:34:41 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org EF5AF3857726
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1695800081;
	bh=3kzbbrezZuDi4YIHXJX0KvirgsfQ9tmXKJZJzQ0EBZk=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=rcHnPeXeoQyVsXWRDNOOHc3PV7WuyjD/+YCbr2emWLk1Gob3meVSXl+dxk4Hgrwtq
	 uJkFaCC0f1GiDKlWhc2HsKlE7clF6bzg5LXPI3bqg2DaYx45VsHF0H3ggc2SEMTVld
	 1lAr8R+nDuuO2xZrA14d48GN5bGI8L8uFrP4kx24=
From: "juzhe.zhong at rivai dot ai" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/109088] GCC does not always vectorize
 conditional reduction
Date: Wed, 27 Sep 2023 07:34:41 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 13.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: enhancement
X-Bugzilla-Who: juzhe.zhong at rivai dot ai
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-109088-4-PcK74xZbZ2@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-109088-4@http.gcc.gnu.org/bugzilla/>
References: <bug-109088-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D109088

--- Comment #10 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
(In reply to Richard Biener from comment #9)
> (In reply to JuzheZhong from comment #8)
> > It's because the order of the operations we are doing:
> >=20
> > For code as follows:
> >=20
> > result +=3D mask ? a[i] + x : 0;
> >=20
> > GCC:
> > result_ssa_1 =3D PHI <result_ssa_2, 0>
> > ...
> > STMT 1. tmp =3D a[i] + x;
> > STMT 2. tmp2 =3D tmp + result_ssa_1;
> > STMT 3. result_ssa_2 =3D mask ? tmp2 : result_ssa_1;
> >=20
> > Here we can see both STMT 2 and STMT 3 are using 'result_ssa_1',
> > we end up with 2 uses of the PHI result. Then, we failed to vectorize.
> >=20
> > Wheras LLVM:
> >=20
> > result_ssa_1 =3D PHI <result_ssa_2, 0>
> > ...
> > IR 1. tmp =3D a[i] + x;
> > IR 2. tmp2 =3D mask ? tmp : 0;
> > IR 3. result_ssa_2 =3D tmp2 + result_ssa_1.
>=20
> For floating point these are not equivalent (adding zero isn't a no-op).


Yes, I agree these are not equivalent for floating-point.
But I they are equivalent if we specify -ffast-math.

I have double checked LLVM, they failed to vectorize conditionl
floating-point reduction too by default.

However, if we specify LLVM -ffast-math, it will generate the same=20
if-conversion IR sequence as integer, then vectorization succeed.


>=20
> > LLVM only has 1 use.
> >=20
> > Is it reasonable to swap the order in match.pd ?
>=20
> if-conversion could be teached to swap this (it's if-conversion creating
> the IL for conditional reductions) when valid.  IIRC Robin Dapp also has
> a patch to make if-conversion emit .COND_ADD instead which should make
> it even better to vectorize.

I knew that patch, Robin is trying fixing the issue (in-order reduction)tha=
t I
posted.

I have confirm that patch can't help since it didn't modify the code for th=
is
case, we will end up with multiple use in conditional reduction.

The reduction failed since:

  /* If this isn't a nested cycle or if the nested cycle reduction value
     is used ouside of the inner loop we cannot handle uses of the reduction
     value.  */
  if (nlatch_def_loop_uses > 1 || nphi_def_loop_uses > 1)
    {
      if (dump_enabled_p ())
        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                         "reduction used in loop.\n");
      return NULL;
    }

when  nphi_def_loop_uses  > 1, we failed to vectorize.

I have checked LLVM codes, and I think we can extend this function:

strip_nop_cond_scalar_reduction

We should be able to strip all the statement until we can reach the
use of PHI result, like this:

LLVM is able to handle this case:

for ()
  if (cond)
    result +=3D a[i] + b[i] + c[i] + ....=20

No matter how many variables are added in the condition reduction.
They well handle that since they keep iterating all the statement until
reach the result:

result_ssa_1 =3D PHI <>
tmp1 =3D result_ssa_1 + a[i];
tmp2 =3D tmp1 + b[i];
tmp3 =3D tmp2 + c[i];
....

We keep iterating until find the result_ssa_1 to hold the reduction variabl=
e.

Is this LLVM's approach reasonable to GCC?

If yes, I can translate LLVM code into GCC.

Thanks.=