From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 6F5E03857806; Tue, 18 Jul 2023 10:41:12 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 6F5E03857806 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1689676872; bh=RlUXxIq1SLtI1g5wI2hnZMtsY3xxrk+JUcJS+xlItBk=; h=From:To:Subject:Date:In-Reply-To:References:From; b=GYd6AOMa1YfWqp2QCq6CThajntE5QIg9a82veZzcRym6yRHAPoLTA4zk8QjQs8e3k X9sPrP1cdHYyDPq6qd4bgKO7jW5bd02yAQEV9TFmBNtJuMjnJZYUyKh26jpFnFbvwd 6yaGUMu4Jsm15cOun7lWxyq6QiJmu1nOT2T7whKI= From: "rsandifo at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/110625] [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large Date: Tue, 18 Jul 2023 10:41:11 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: rsandifo at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: cc Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D110625 rsandifo at gcc dot gnu.org changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |rsandifo at gcc dot gnu.org --- Comment #4 from rsandifo at gcc dot gnu.org --- Sorry, didn't see this PR until now. On: > general operations =3D 15 <-- Too large Are you sure this is too large? The vector code seems to be: ldr q31, [x3], 16 ldr q29, [x4], -16 rev64 v31.8h, v31.8h uxtl v30.4s, v31.4h uxtl2 v31.4s, v31.8h sxtl v27.2d, v30.2s sxtl v28.2d, v31.2s sxtl2 v30.2d, v30.4s sxtl2 v31.2d, v31.4s scvtf v27.2d, v27.2d scvtf v28.2d, v28.2d scvtf v30.2d, v30.2d scvtf v31.2d, v31.2d fmla v26.2d, v27.2d, v29.d[1] fmla v24.2d, v30.2d, v29.d[1] fmla v23.2d, v28.2d, v29.d[0] fmla v25.2d, v31.2d, v29.d[0] Discounting the loads, we do have 15 general operations. On the reduction latency, the: > /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunate= ly > that's not yet the case. */ is referring to the single_defuse_cycle code in vectorizable_reduction. Th= at's always seemed like a misfeature to me, since it serialises a multi-vector reduction through a single accumulator. I guess it's finally time to opt o= ut of that for aarch64. If we did opt out, then removing the =E2=80=9C* count=E2=80=9D should be co= rrect for all cases.=