From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id EC6CA3858C60; Tue, 27 Feb 2024 07:58:30 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org EC6CA3858C60
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1709020710;
	bh=7tQnhPDTy1LeJ+h7LU8zbPq7zrftH5Tr9XRF2LH9jao=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=g5xCQmga88NNXS3vMWMLFsz2dtrrw11HtxtGWI0808/0k/ixO48pYu2AaQpmi/YVc
	 +YMZIOKsNK9JBvUjdyPJg9qyJCA44MusRCHU6agikz9IPTz/1EPrW4O4iaLQ2ThubB
	 FErgigm7ONb2ulq8iuC30xTBEfyXyB40tHzya5j4=
From: "rguenther at suse dot de" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/112325] Missed vectorization of reduction
 after unrolling
Date: Tue, 27 Feb 2024 07:58:30 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 14.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenther at suse dot de
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-112325-4-9WgvIAzYJ2@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-112325-4@http.gcc.gnu.org/bugzilla/>
References: <bug-112325-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D112325
--- Comment #13 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 27 Feb 2024, liuhongt at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D112325
>=20
> --- Comment #11 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
>=20
> >    Loop body is likely going to simplify further, this is difficult
> >    to guess, we just decrease the result by 1/3.  */
> >=20
>=20
> This is introduced by r0-68074-g91a01f21abfe19
>=20
> /* Estimate number of insns of completely unrolled loop.  We assume
> +   that the size of the unrolled loop is decreased in the
> +   following way (the numbers of insns are based on what
> +   estimate_num_insns returns for appropriate statements):
> +
> +   1) exit condition gets removed (2 insns)
> +   2) increment of the control variable gets removed (2 insns)
> +   3) All remaining statements are likely to get simplified
> +      due to constant propagation.  Hard to estimate; just
> +      as a heuristics we decrease the rest by 1/3.
> +
> +   NINSNS is the number of insns in the loop before unrolling.
> +   NUNROLL is the number of times the loop is unrolled.  */
> +
> +static unsigned HOST_WIDE_INT
> +estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
> +                        unsigned HOST_WIDE_INT nunroll)
> +{
> +  HOST_WIDE_INT unr_insns =3D 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
> +  if (unr_insns <=3D 0)
> +    unr_insns =3D 1;
> +  unr_insns *=3D (nunroll + 1);
> +
> +  return unr_insns;
> +}
>=20
> And r0-93444-g08f1af2ed022e0 try do it more accurately by marking
> likely_eliminated stmt and minus that from total insns, But 2 / 3 is still
> keeped.
>=20
> +/* Estimate number of insns of completely unrolled loop.
> +   It is (NUNROLL + 1) * size of loop body with taking into account
> +   the fact that in last copy everything after exit conditional
> +   is dead and that some instructions will be eliminated after
> +   peeling.
>=20
> -   NINSNS is the number of insns in the loop before unrolling.
> -   NUNROLL is the number of times the loop is unrolled.  */
> +   Loop body is likely going to simplify futher, this is difficult
> +   to guess, we just decrease the result by 1/3.  */
>=20
>  static unsigned HOST_WIDE_INT
> -estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
> +estimated_unrolled_size (struct loop_size *size,
>                          unsigned HOST_WIDE_INT nunroll)
>  {
> -  HOST_WIDE_INT unr_insns =3D 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
> +  HOST_WIDE_INT unr_insns =3D ((nunroll)
> +                            * (HOST_WIDE_INT) (size->overall
> +                                               -
> size->eliminated_by_peeling));
> +  if (!nunroll)
> +    unr_insns =3D 0;
> +  unr_insns +=3D size->last_iteration -
> size->last_iteration_eliminated_by_peeling;
> +
> +  unr_insns =3D unr_insns * 2 / 3;
>    if (unr_insns <=3D 0)
>      unr_insns =3D 1;
> -  unr_insns *=3D (nunroll + 1);
>=20
> It looks to me 1 / 3 overestimates the instructions that can be optimised=
 away,
> especially if we've subtracted eliminated_by_peeling

Yes, that 1/3 reduction is a bit odd - you could have the same effect
by increasing the instruction limit by 1/3, but that means it doesn't
really matter, does it?  It would be interesting to see if increasing
the limit by 1/3 and removing the above is neutral on SPEC?

Note this kind of "simplification guessing" is most important for
the 2nd stage unrolling an outer loop with an unrolled inner loop
as there are 2nd level recurrences to be optimized the "elmiminated by
peeling" heuristics do not get (but value-numbering would).  So another
thing to do would be not do the 1/3 reduction for innermost loops
but only for loops up from that.=