From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 6DBAB3858D37; Sat, 21 Jan 2023 13:32:53 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 6DBAB3858D37
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1674307973;
	bh=DgnM4WL5bGOkqkSKGRObyZOnp4BEdDAIoy+xBKCsWKY=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=L4N+fP5Ls0UOTA5MAlQKRJWhmDzBQA5b9r0UjXZo3nAe2ZOld1oCjhj9uo73iKUt0
	 V/Pqa69DJPNPzDQXqp7zk2lq5UVYM1031hxFjhat0H0m3qJtB12+4/z03RLZ6GQmKT
	 msO/z1IiNW5cH9fozMmkzMmFwuGN7vLfeG8wppAU=
From: "Mark_B53 at yahoo dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/108487] [10/11/12/13 Regression] ~20-30x
 slowdown in populating std::vector from std::ranges::iota_view
Date: Sat, 21 Jan 2023 13:32:53 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 12.2.0
X-Bugzilla-Keywords: needs-bisection
X-Bugzilla-Severity: normal
X-Bugzilla-Who: Mark_B53 at yahoo dot com
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-108487-4-XqowqWrAYq@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-108487-4@http.gcc.gnu.org/bugzilla/>
References: <bug-108487-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D108487

--- Comment #2 from MARK BOURGEAULT <Mark_B53 at yahoo dot com> ---
>> For fn1, assembly of the inner loop should be identical, so I think the =
20% you were seeing may result from different loop alignment with respect t=
o 32b fetch boundary
Yes, it does appear that this is the explanation for the difference.  Here =
are
the full results:

original code
 * gcc 10.3 -std=3Dc++20 -O3 =3D> fn1 =3D ~2000ms, fn2 =3D ~1000ms
 * gcc 10.3 -std=3Dc++20 -O3 -falign-loops=3D32 =3D> fn1 =3D ~2000ms, fn2 =
=3D ~1000ms
 * gcc 12.2 -std=3Dc++20 -O3 =3D> fn1 =3D ~2500ms, fn2 =3D ~32000ms
 * gcc 12.2 -std=3Dc++20 -O3 -falign-loops=3D32 =3D> fn1 =3D ~2000ms, fn2 =
=3D ~32000ms

fn1 only
 * gcc 10.3 -std=3Dc++20 -O3 =3D> fn1 =3D ~2500ms
 * gcc 10.3 -std=3Dc++20 -O3 -falign-loops=3D32 =3D> fn1 =3D ~2000ms
 * gcc 12.2 -std=3Dc++20 -O3 =3D> fn1 =3D ~2000ms
 * gcc 12.2 -std=3Dc++20 -O3 -falign-loops=3D32 =3D> fn1 =3D ~2000ms

>> Also please note that cloud instances backing godbolt.org have different=
 CPUs, so timing results from different runs are not directly comparable.
Yes, I know.  I really only used godbolt to reach the conclusion that the i=
ssue
still exists on trunk.=