From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 6F5E03857806; Tue, 18 Jul 2023 10:41:12 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 6F5E03857806
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1689676872;
	bh=RlUXxIq1SLtI1g5wI2hnZMtsY3xxrk+JUcJS+xlItBk=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=GYd6AOMa1YfWqp2QCq6CThajntE5QIg9a82veZzcRym6yRHAPoLTA4zk8QjQs8e3k
	 X9sPrP1cdHYyDPq6qd4bgKO7jW5bd02yAQEV9TFmBNtJuMjnJZYUyKh26jpFnFbvwd
	 6yaGUMu4Jsm15cOun7lWxyq6QiJmu1nOT2T7whKI=
From: "rsandifo at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/110625] [AArch64] Vect: SLP fails to vectorize a loop as
 the reduction_latency calculated by new costs is too large
Date: Tue, 18 Jul 2023 10:41:11 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 14.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rsandifo at gcc dot gnu.org
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: cc
Message-ID: <bug-110625-4-4O4QdulQRr@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-110625-4@http.gcc.gnu.org/bugzilla/>
References: <bug-110625-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D110625

rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rsandifo at gcc dot gnu.org
--- Comment #4 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.or=
g> ---
Sorry, didn't see this PR until now.

On:

>       general operations =3D 15   <-- Too large

Are you sure this is too large?  The vector code seems to be:

        ldr     q31, [x3], 16
        ldr     q29, [x4], -16
        rev64   v31.8h, v31.8h
        uxtl    v30.4s, v31.4h
        uxtl2   v31.4s, v31.8h
        sxtl    v27.2d, v30.2s
        sxtl    v28.2d, v31.2s
        sxtl2   v30.2d, v30.4s
        sxtl2   v31.2d, v31.4s
        scvtf   v27.2d, v27.2d
        scvtf   v28.2d, v28.2d
        scvtf   v30.2d, v30.2d
        scvtf   v31.2d, v31.2d
        fmla    v26.2d, v27.2d, v29.d[1]
        fmla    v24.2d, v30.2d, v29.d[1]
        fmla    v23.2d, v28.2d, v29.d[0]
        fmla    v25.2d, v31.2d, v29.d[0]

Discounting the loads, we do have 15 general operations.

On the reduction latency, the:

>      /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunate=
ly
>	 that's not yet the case.  */

is referring to the single_defuse_cycle code in vectorizable_reduction.  Th=
at's
always seemed like a misfeature to me, since it serialises a multi-vector
reduction through a single accumulator.  I guess it's finally time to opt o=
ut
of that for aarch64.

If we did opt out, then removing the =E2=80=9C* count=E2=80=9D should be co=
rrect for all cases.=