From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id EF5AF3857726; Wed, 27 Sep 2023 07:34:41 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org EF5AF3857726 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1695800081; bh=3kzbbrezZuDi4YIHXJX0KvirgsfQ9tmXKJZJzQ0EBZk=; h=From:To:Subject:Date:In-Reply-To:References:From; b=rcHnPeXeoQyVsXWRDNOOHc3PV7WuyjD/+YCbr2emWLk1Gob3meVSXl+dxk4Hgrwtq uJkFaCC0f1GiDKlWhc2HsKlE7clF6bzg5LXPI3bqg2DaYx45VsHF0H3ggc2SEMTVld 1lAr8R+nDuuO2xZrA14d48GN5bGI8L8uFrP4kx24= From: "juzhe.zhong at rivai dot ai" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/109088] GCC does not always vectorize conditional reduction Date: Wed, 27 Sep 2023 07:34:41 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 13.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: enhancement X-Bugzilla-Who: juzhe.zhong at rivai dot ai X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D109088 --- Comment #10 from JuzheZhong --- (In reply to Richard Biener from comment #9) > (In reply to JuzheZhong from comment #8) > > It's because the order of the operations we are doing: > >=20 > > For code as follows: > >=20 > > result +=3D mask ? a[i] + x : 0; > >=20 > > GCC: > > result_ssa_1 =3D PHI > > ... > > STMT 1. tmp =3D a[i] + x; > > STMT 2. tmp2 =3D tmp + result_ssa_1; > > STMT 3. result_ssa_2 =3D mask ? tmp2 : result_ssa_1; > >=20 > > Here we can see both STMT 2 and STMT 3 are using 'result_ssa_1', > > we end up with 2 uses of the PHI result. Then, we failed to vectorize. > >=20 > > Wheras LLVM: > >=20 > > result_ssa_1 =3D PHI > > ... > > IR 1. tmp =3D a[i] + x; > > IR 2. tmp2 =3D mask ? tmp : 0; > > IR 3. result_ssa_2 =3D tmp2 + result_ssa_1. >=20 > For floating point these are not equivalent (adding zero isn't a no-op). Yes, I agree these are not equivalent for floating-point. But I they are equivalent if we specify -ffast-math. I have double checked LLVM, they failed to vectorize conditionl floating-point reduction too by default. However, if we specify LLVM -ffast-math, it will generate the same=20 if-conversion IR sequence as integer, then vectorization succeed. >=20 > > LLVM only has 1 use. > >=20 > > Is it reasonable to swap the order in match.pd ? >=20 > if-conversion could be teached to swap this (it's if-conversion creating > the IL for conditional reductions) when valid. IIRC Robin Dapp also has > a patch to make if-conversion emit .COND_ADD instead which should make > it even better to vectorize. I knew that patch, Robin is trying fixing the issue (in-order reduction)tha= t I posted. I have confirm that patch can't help since it didn't modify the code for th= is case, we will end up with multiple use in conditional reduction. The reduction failed since: /* If this isn't a nested cycle or if the nested cycle reduction value is used ouside of the inner loop we cannot handle uses of the reduction value. */ if (nlatch_def_loop_uses > 1 || nphi_def_loop_uses > 1) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "reduction used in loop.\n"); return NULL; } when nphi_def_loop_uses > 1, we failed to vectorize. I have checked LLVM codes, and I think we can extend this function: strip_nop_cond_scalar_reduction We should be able to strip all the statement until we can reach the use of PHI result, like this: LLVM is able to handle this case: for () if (cond) result +=3D a[i] + b[i] + c[i] + ....=20 No matter how many variables are added in the condition reduction. They well handle that since they keep iterating all the statement until reach the result: result_ssa_1 =3D PHI <> tmp1 =3D result_ssa_1 + a[i]; tmp2 =3D tmp1 + b[i]; tmp3 =3D tmp2 + c[i]; .... We keep iterating until find the result_ssa_1 to hold the reduction variabl= e. Is this LLVM's approach reasonable to GCC? If yes, I can translate LLVM code into GCC. Thanks.=