From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 8F40C3858C5F; Wed, 28 Feb 2024 08:23:48 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8F40C3858C5F
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1709108628;
	bh=cnr1myLWkcYqCVUg7Im0z+6RzSVyVeoQ6RomLmnBbeM=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=xwSg5TlWk6tvfXUHauSTk7NXyppsxONtn+7E+aJ5XZ/Cz768sF3Ntu4KzbVK82rRR
	 9hOY5JQEtmVNcD+zSdl9dUz7TuNU7Sd3MfsTMbDNyFXkmBEKC+1zaB8G85TBTR4Qx7
	 msmf0AKpsnDqxJlEbnfyGt7oa5nWqBvdxBB0j6h4=
From: "rguenther at suse dot de" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/112325] Missed vectorization of reduction
 after unrolling
Date: Wed, 28 Feb 2024 08:23:47 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 14.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenther at suse dot de
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-112325-4-J9MYNGNEkj@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-112325-4@http.gcc.gnu.org/bugzilla/>
References: <bug-112325-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D112325
--- Comment #15 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 28 Feb 2024, liuhongt at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D112325
>=20
> --- Comment #14 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #13)
> > On Tue, 27 Feb 2024, liuhongt at gcc dot gnu.org wrote:
> >=20
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D112325
> > >=20
> > > --- Comment #11 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
> > >=20
> > > >    Loop body is likely going to simplify further, this is difficult
> > > >    to guess, we just decrease the result by 1/3.  */
> > > >=20
> > >=20
> > > This is introduced by r0-68074-g91a01f21abfe19
> > >=20
> > > /* Estimate number of insns of completely unrolled loop.  We assume
> > > +   that the size of the unrolled loop is decreased in the
> > > +   following way (the numbers of insns are based on what
> > > +   estimate_num_insns returns for appropriate statements):
> > > +
> > > +   1) exit condition gets removed (2 insns)
> > > +   2) increment of the control variable gets removed (2 insns)
> > > +   3) All remaining statements are likely to get simplified
> > > +      due to constant propagation.  Hard to estimate; just
> > > +      as a heuristics we decrease the rest by 1/3.
> > > +
> > > +   NINSNS is the number of insns in the loop before unrolling.
> > > +   NUNROLL is the number of times the loop is unrolled.  */
> > > +
> > > +static unsigned HOST_WIDE_INT
> > > +estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
> > > +                        unsigned HOST_WIDE_INT nunroll)
> > > +{
> > > +  HOST_WIDE_INT unr_insns =3D 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
> > > +  if (unr_insns <=3D 0)
> > > +    unr_insns =3D 1;
> > > +  unr_insns *=3D (nunroll + 1);
> > > +
> > > +  return unr_insns;
> > > +}
> > >=20
> > > And r0-93444-g08f1af2ed022e0 try do it more accurately by marking
> > > likely_eliminated stmt and minus that from total insns, But 2 / 3 is =
still
> > > keeped.
> > >=20
> > > +/* Estimate number of insns of completely unrolled loop.
> > > +   It is (NUNROLL + 1) * size of loop body with taking into account
> > > +   the fact that in last copy everything after exit conditional
> > > +   is dead and that some instructions will be eliminated after
> > > +   peeling.
> > >=20
> > > -   NINSNS is the number of insns in the loop before unrolling.
> > > -   NUNROLL is the number of times the loop is unrolled.  */
> > > +   Loop body is likely going to simplify futher, this is difficult
> > > +   to guess, we just decrease the result by 1/3.  */
> > >=20
> > >  static unsigned HOST_WIDE_INT
> > > -estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
> > > +estimated_unrolled_size (struct loop_size *size,
> > >                          unsigned HOST_WIDE_INT nunroll)
> > >  {
> > > -  HOST_WIDE_INT unr_insns =3D 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
> > > +  HOST_WIDE_INT unr_insns =3D ((nunroll)
> > > +                            * (HOST_WIDE_INT) (size->overall
> > > +                                               -
> > > size->eliminated_by_peeling));
> > > +  if (!nunroll)
> > > +    unr_insns =3D 0;
> > > +  unr_insns +=3D size->last_iteration -
> > > size->last_iteration_eliminated_by_peeling;
> > > +
> > > +  unr_insns =3D unr_insns * 2 / 3;
> > >    if (unr_insns <=3D 0)
> > >      unr_insns =3D 1;
> > > -  unr_insns *=3D (nunroll + 1);
> > >=20
> > > It looks to me 1 / 3 overestimates the instructions that can be optim=
ised away,
> > > especially if we've subtracted eliminated_by_peeling
> >=20
> > Yes, that 1/3 reduction is a bit odd - you could have the same effect
> > by increasing the instruction limit by 1/3, but that means it doesn't
> > really matter, does it?  It would be interesting to see if increasing
> > the limit by 1/3 and removing the above is neutral on SPEC?
>=20
> Remove 1/3 reduction get ~2% improvement for 525.x264_r on SPR with
> -march=3Dnative -O3, no big impact on other integer benchmark.

454.calculix was always the benchmark to cross check as that benefits
from much unrolling.

I'm all for removing the 1/3 for innermost loop handling (in cunroll
the unrolled loop is then innermost).  I'm more concerned about
unrolling more than one level which is exactly what's required for
454.calculix.=