From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id D665F3858C41; Fri, 29 Dec 2023 16:16:01 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D665F3858C41
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1703866561;
	bh=fhzhVW25hnBFEtjp62ptV5o9dRaNltg/H06dOQIqGXo=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=yflaFLMtZzbheg8lgEPYJjYprKyDIyRXALjo8juSxy4mrT3zobufD34Rx4/+5KKTg
	 oyTx+CnboSmtYtbLECQ2O7m9kJmbfM4bxXCsKXp0K+3I52NijH/hX39AWQDTQ3AAT3
	 +vqB1j3WP4cTNH+VbTU7Z3kQaCqvohscxC8Ft1kY=
From: "tnfchris at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/110625] [14 Regression][AArch64] Vect: SLP fails to
 vectorize a loop as the reduction_latency calculated by new costs is too
 large
Date: Fri, 29 Dec 2023 16:16:00 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 14.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: tnfchris at gcc dot gnu.org
X-Bugzilla-Status: RESOLVED
X-Bugzilla-Resolution: FIXED
X-Bugzilla-Priority: P1
X-Bugzilla-Assigned-To: rsandifo at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: resolution bug_status
Message-ID: <bug-110625-4-HdnsuQsBh0@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-110625-4@http.gcc.gnu.org/bugzilla/>
References: <bug-110625-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D110625

Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|ASSIGNED                    |RESOLVED
--- Comment #25 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Hao Liu from comment #0)
> This problem causes a performance regression in SPEC2017 538.imagick.  For
> the following simple case (modified from pr96208):
>=20
>     typedef struct {
>         unsigned short m1, m2, m3, m4;
>     } the_struct_t;
>     typedef struct {
>         double m1, m2, m3, m4, m5;
>     } the_struct2_t;
>=20
>     double bar1 (the_struct2_t*);
>=20
>     double foo (double* k, unsigned int n, the_struct_t* the_struct) {
>         unsigned int u;
>         the_struct2_t result;
>         for (u=3D0; u < n; u++, k--) {
>             result.m1 +=3D (*k)*the_struct[u].m1;
>             result.m2 +=3D (*k)*the_struct[u].m2;
>             result.m3 +=3D (*k)*the_struct[u].m3;
>             result.m4 +=3D (*k)*the_struct[u].m4;
>         }
>         return bar1 (&result);
>     }
>=20

In the context of this report the regression should be fixed, however we st=
ill
don't vectorize this loop.  We ran this and other cases comparing scalar and
vector versions of this loop and it looks like specifically Neoverse N2 does
much better using the scalar version here.  So it looks like the cost model=
 is
doing the right thing here for the current codegen of the function.

Note that the vector version:

        ldr     q31, [x3], 16
        ldr     q29, [x4], -16
        rev64   v31.8h, v31.8h
        uxtl    v30.4s, v31.4h
        uxtl2   v31.4s, v31.8h
        sxtl    v27.2d, v30.2s
        sxtl    v28.2d, v31.2s
        sxtl2   v30.2d, v30.4s
        sxtl2   v31.2d, v31.4s
        scvtf   v27.2d, v27.2d
        scvtf   v28.2d, v28.2d
        scvtf   v30.2d, v30.2d
        scvtf   v31.2d, v31.2d
        fmla    v26.2d, v27.2d, v29.d[1]
        fmla    v24.2d, v30.2d, v29.d[1]
        fmla    v23.2d, v28.2d, v29.d[0]
        fmla    v25.2d, v31.2d, v29.d[0]

Is still pretty inefficient due to all the extends.  If we generate better =
code
here this may tip the scale back to vector.  But for now, the patch should =
fix
the regression.=