From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 4EEBB3858C36; Wed, 28 Feb 2024 07:26:31 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 4EEBB3858C36
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1709105191;
	bh=fQe99eRzwvsZgUVDYOb8gzNhNh5Qigs8j5A4tsc8t08=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=DPwf7Zkz2x432fib8iYuRFtIoG9kkuQEfcMlJkDdBTq8jgedKJ7ACCjmbG1gvH2jf
	 FthXRjr5Wr/RTE9veeDKx444Vn9sv8RCDIJV0haJOi7Sa4WzCIMRvcqE5JYy4WBW2f
	 4Tu4CjYK+VrhbaKJG6oU0MZ5tth7bJVbjDsaFpbQ=
From: "liuhongt at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/112325] Missed vectorization of reduction
 after unrolling
Date: Wed, 28 Feb 2024 07:26:28 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 14.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: liuhongt at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-112325-4-DS3TBtcB5a@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-112325-4@http.gcc.gnu.org/bugzilla/>
References: <bug-112325-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D112325
--- Comment #14 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #13)
> On Tue, 27 Feb 2024, liuhongt at gcc dot gnu.org wrote:
>=20
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D112325
> >=20
> > --- Comment #11 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
> >=20
> > >    Loop body is likely going to simplify further, this is difficult
> > >    to guess, we just decrease the result by 1/3.  */
> > >=20
> >=20
> > This is introduced by r0-68074-g91a01f21abfe19
> >=20
> > /* Estimate number of insns of completely unrolled loop.  We assume
> > +   that the size of the unrolled loop is decreased in the
> > +   following way (the numbers of insns are based on what
> > +   estimate_num_insns returns for appropriate statements):
> > +
> > +   1) exit condition gets removed (2 insns)
> > +   2) increment of the control variable gets removed (2 insns)
> > +   3) All remaining statements are likely to get simplified
> > +      due to constant propagation.  Hard to estimate; just
> > +      as a heuristics we decrease the rest by 1/3.
> > +
> > +   NINSNS is the number of insns in the loop before unrolling.
> > +   NUNROLL is the number of times the loop is unrolled.  */
> > +
> > +static unsigned HOST_WIDE_INT
> > +estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
> > +                        unsigned HOST_WIDE_INT nunroll)
> > +{
> > +  HOST_WIDE_INT unr_insns =3D 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
> > +  if (unr_insns <=3D 0)
> > +    unr_insns =3D 1;
> > +  unr_insns *=3D (nunroll + 1);
> > +
> > +  return unr_insns;
> > +}
> >=20
> > And r0-93444-g08f1af2ed022e0 try do it more accurately by marking
> > likely_eliminated stmt and minus that from total insns, But 2 / 3 is st=
ill
> > keeped.
> >=20
> > +/* Estimate number of insns of completely unrolled loop.
> > +   It is (NUNROLL + 1) * size of loop body with taking into account
> > +   the fact that in last copy everything after exit conditional
> > +   is dead and that some instructions will be eliminated after
> > +   peeling.
> >=20
> > -   NINSNS is the number of insns in the loop before unrolling.
> > -   NUNROLL is the number of times the loop is unrolled.  */
> > +   Loop body is likely going to simplify futher, this is difficult
> > +   to guess, we just decrease the result by 1/3.  */
> >=20
> >  static unsigned HOST_WIDE_INT
> > -estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
> > +estimated_unrolled_size (struct loop_size *size,
> >                          unsigned HOST_WIDE_INT nunroll)
> >  {
> > -  HOST_WIDE_INT unr_insns =3D 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
> > +  HOST_WIDE_INT unr_insns =3D ((nunroll)
> > +                            * (HOST_WIDE_INT) (size->overall
> > +                                               -
> > size->eliminated_by_peeling));
> > +  if (!nunroll)
> > +    unr_insns =3D 0;
> > +  unr_insns +=3D size->last_iteration -
> > size->last_iteration_eliminated_by_peeling;
> > +
> > +  unr_insns =3D unr_insns * 2 / 3;
> >    if (unr_insns <=3D 0)
> >      unr_insns =3D 1;
> > -  unr_insns *=3D (nunroll + 1);
> >=20
> > It looks to me 1 / 3 overestimates the instructions that can be optimis=
ed away,
> > especially if we've subtracted eliminated_by_peeling
>=20
> Yes, that 1/3 reduction is a bit odd - you could have the same effect
> by increasing the instruction limit by 1/3, but that means it doesn't
> really matter, does it?  It would be interesting to see if increasing
> the limit by 1/3 and removing the above is neutral on SPEC?

Remove 1/3 reduction get ~2% improvement for 525.x264_r on SPR with
-march=3Dnative -O3, no big impact on other integer benchmark.

The regression comes from below function, cunrolli unrolls the inner loop,
cunroll unrolls the outer loop, and causes lots of spills.

typedef unsigned long long uint64_t;
typedef unsigned char uint8_t;
typedef unsigned int uint32_t;
uint64_t x264_pixel_var_8x8(uint8_t *pix, int i_stride )
{
    uint32_t sum =3D 0, sqr =3D 0;
    for( int y =3D 0; y < 8; y++ )
    {
        for( int x =3D 0; x < 8; x++ )=20
        {
            sum +=3D pix[x];=20
            sqr +=3D pix[x] * pix[x];=20
        }=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
        pix +=3D i_stride;=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20
    }=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20
    return sum + ((uint64_t)sqr << 32);=20=20=20=20
}=