From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 744D03858C62; Wed, 12 Jul 2023 00:58:58 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 744D03858C62
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1689123538;
	bh=0YvBm1m5iyrl74Nhk3jvq8NKKVCL5AZyG2k45KHhI+I=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=kWKFCr7GMTOOrJ/sm+T28bMWZE0DHA61Ewxe3dGTbEXmqoDxeSkddx8PVREb9UjxE
	 yX2stuVy9SVcDh/ZBSUYSYK1B3SqmGaCXcDypaBHlzy6QBBlA/kHggSwUc7JofMJS6
	 tyBFIYgJgDYOEKiph0OX0roVXK2bHUXTEMQM+m0s=
From: "hliu at amperecomputing dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/110625] [AArch64] Vect: SLP fails to vectorize a loop as
 the reduction_latency calculated by new costs is too large
Date: Wed, 12 Jul 2023 00:58:58 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 14.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: hliu at amperecomputing dot com
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-110625-4-TAB0sPzsSZ@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-110625-4@http.gcc.gnu.org/bugzilla/>
References: <bug-110625-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D110625
--- Comment #2 from Hao Liu <hliu at amperecomputing dot com> ---
To my understanding, "reduction latency" is the least number of cycles need=
ed
to do the reduction calculation for 1 iteration of loop.  It is calcualted =
by
the extra instruction issue-info of the new cost models in AArch64 backend.

Usually, the reduction latency of vectorized loop should be smaller than the
scalar loop.  If the latency of vectorized loop is larger than the scalar l=
oop,
it thinks maybe not beneficial to do vectorization, so it increases the
vect-body costs by the scale of vect_reduct_latency/scalar_reduct_latency in
the above case.

For the above case, it thinks the scalar loop needs 4 cycles (2*VF=3D4) to
calculate "results.m +=3D rhs", while the vectorized loop needs 8 cycles
(2*count=3D8).  As a result, the vect-body costs are doubled from originial=
 value
of 51 to 102.  It seems not true for the vectorized loop, which should only
need 2 cycles to calculate the SIMD version of "results.m +=3D rhs".=