From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 8F40C3858C5F; Wed, 28 Feb 2024 08:23:48 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8F40C3858C5F DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1709108628; bh=cnr1myLWkcYqCVUg7Im0z+6RzSVyVeoQ6RomLmnBbeM=; h=From:To:Subject:Date:In-Reply-To:References:From; b=xwSg5TlWk6tvfXUHauSTk7NXyppsxONtn+7E+aJ5XZ/Cz768sF3Ntu4KzbVK82rRR 9hOY5JQEtmVNcD+zSdl9dUz7TuNU7Sd3MfsTMbDNyFXkmBEKC+1zaB8G85TBTR4Qx7 msmf0AKpsnDqxJlEbnfyGt7oa5nWqBvdxBB0j6h4= From: "rguenther at suse dot de" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/112325] Missed vectorization of reduction after unrolling Date: Wed, 28 Feb 2024 08:23:47 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenther at suse dot de X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D112325 --- Comment #15 from rguenther at suse dot de --- On Wed, 28 Feb 2024, liuhongt at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D112325 >=20 > --- Comment #14 from Hongtao Liu --- > (In reply to rguenther@suse.de from comment #13) > > On Tue, 27 Feb 2024, liuhongt at gcc dot gnu.org wrote: > >=20 > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D112325 > > >=20 > > > --- Comment #11 from Hongtao Liu --- > > >=20 > > > > Loop body is likely going to simplify further, this is difficult > > > > to guess, we just decrease the result by 1/3. */ > > > >=20 > > >=20 > > > This is introduced by r0-68074-g91a01f21abfe19 > > >=20 > > > /* Estimate number of insns of completely unrolled loop. We assume > > > + that the size of the unrolled loop is decreased in the > > > + following way (the numbers of insns are based on what > > > + estimate_num_insns returns for appropriate statements): > > > + > > > + 1) exit condition gets removed (2 insns) > > > + 2) increment of the control variable gets removed (2 insns) > > > + 3) All remaining statements are likely to get simplified > > > + due to constant propagation. Hard to estimate; just > > > + as a heuristics we decrease the rest by 1/3. > > > + > > > + NINSNS is the number of insns in the loop before unrolling. > > > + NUNROLL is the number of times the loop is unrolled. */ > > > + > > > +static unsigned HOST_WIDE_INT > > > +estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns, > > > + unsigned HOST_WIDE_INT nunroll) > > > +{ > > > + HOST_WIDE_INT unr_insns =3D 2 * ((HOST_WIDE_INT) ninsns - 4) / 3; > > > + if (unr_insns <=3D 0) > > > + unr_insns =3D 1; > > > + unr_insns *=3D (nunroll + 1); > > > + > > > + return unr_insns; > > > +} > > >=20 > > > And r0-93444-g08f1af2ed022e0 try do it more accurately by marking > > > likely_eliminated stmt and minus that from total insns, But 2 / 3 is = still > > > keeped. > > >=20 > > > +/* Estimate number of insns of completely unrolled loop. > > > + It is (NUNROLL + 1) * size of loop body with taking into account > > > + the fact that in last copy everything after exit conditional > > > + is dead and that some instructions will be eliminated after > > > + peeling. > > >=20 > > > - NINSNS is the number of insns in the loop before unrolling. > > > - NUNROLL is the number of times the loop is unrolled. */ > > > + Loop body is likely going to simplify futher, this is difficult > > > + to guess, we just decrease the result by 1/3. */ > > >=20 > > > static unsigned HOST_WIDE_INT > > > -estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns, > > > +estimated_unrolled_size (struct loop_size *size, > > > unsigned HOST_WIDE_INT nunroll) > > > { > > > - HOST_WIDE_INT unr_insns =3D 2 * ((HOST_WIDE_INT) ninsns - 4) / 3; > > > + HOST_WIDE_INT unr_insns =3D ((nunroll) > > > + * (HOST_WIDE_INT) (size->overall > > > + - > > > size->eliminated_by_peeling)); > > > + if (!nunroll) > > > + unr_insns =3D 0; > > > + unr_insns +=3D size->last_iteration - > > > size->last_iteration_eliminated_by_peeling; > > > + > > > + unr_insns =3D unr_insns * 2 / 3; > > > if (unr_insns <=3D 0) > > > unr_insns =3D 1; > > > - unr_insns *=3D (nunroll + 1); > > >=20 > > > It looks to me 1 / 3 overestimates the instructions that can be optim= ised away, > > > especially if we've subtracted eliminated_by_peeling > >=20 > > Yes, that 1/3 reduction is a bit odd - you could have the same effect > > by increasing the instruction limit by 1/3, but that means it doesn't > > really matter, does it? It would be interesting to see if increasing > > the limit by 1/3 and removing the above is neutral on SPEC? >=20 > Remove 1/3 reduction get ~2% improvement for 525.x264_r on SPR with > -march=3Dnative -O3, no big impact on other integer benchmark. 454.calculix was always the benchmark to cross check as that benefits from much unrolling. I'm all for removing the 1/3 for innermost loop handling (in cunroll the unrolled loop is then innermost). I'm more concerned about unrolling more than one level which is exactly what's required for 454.calculix.=