From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 744D03858C62; Wed, 12 Jul 2023 00:58:58 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 744D03858C62 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1689123538; bh=0YvBm1m5iyrl74Nhk3jvq8NKKVCL5AZyG2k45KHhI+I=; h=From:To:Subject:Date:In-Reply-To:References:From; b=kWKFCr7GMTOOrJ/sm+T28bMWZE0DHA61Ewxe3dGTbEXmqoDxeSkddx8PVREb9UjxE yX2stuVy9SVcDh/ZBSUYSYK1B3SqmGaCXcDypaBHlzy6QBBlA/kHggSwUc7JofMJS6 tyBFIYgJgDYOEKiph0OX0roVXK2bHUXTEMQM+m0s= From: "hliu at amperecomputing dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/110625] [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large Date: Wed, 12 Jul 2023 00:58:58 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: hliu at amperecomputing dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D110625 --- Comment #2 from Hao Liu --- To my understanding, "reduction latency" is the least number of cycles need= ed to do the reduction calculation for 1 iteration of loop. It is calcualted = by the extra instruction issue-info of the new cost models in AArch64 backend. Usually, the reduction latency of vectorized loop should be smaller than the scalar loop. If the latency of vectorized loop is larger than the scalar l= oop, it thinks maybe not beneficial to do vectorization, so it increases the vect-body costs by the scale of vect_reduct_latency/scalar_reduct_latency in the above case. For the above case, it thinks the scalar loop needs 4 cycles (2*VF=3D4) to calculate "results.m +=3D rhs", while the vectorized loop needs 8 cycles (2*count=3D8). As a result, the vect-body costs are doubled from originial= value of 51 to 102. It seems not true for the vectorized loop, which should only need 2 cycles to calculate the SIMD version of "results.m +=3D rhs".=