From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 4959A385840E; Tue, 11 Jul 2023 09:15:57 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 4959A385840E
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1689066957;
	bh=OhcX38axme4fncLVNwmcz32LI+TQivmeB6AZqDPZy4I=;
	h=From:To:Subject:Date:From;
	b=wKFefmDl5sxtrVOdg58Do5bioloPr8ZamheoTPBvZElfk8rRyBG1rcW1/xuLHJY+/
	 cgHEFQR+7EIyj4NcQnTSBvRP0fJBU8yyvu76W9088Jjd9F6xqS94pV8CJsYejA0qGt
	 bF70ebPu/wE5FNI67Reva1cIZdTri6vwyzcW9UE0=
From: "hliu at amperecomputing dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/110625] New: [AArch64] Vect: SLP fails to vectorize a
 loop as the reduction_latency calculated by new costs is too large
Date: Tue, 11 Jul 2023 09:15:52 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 14.0
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: hliu at amperecomputing dot com
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status
 bug_severity priority component assigned_to reporter target_milestone
Message-ID: <bug-110625-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D110625

            Bug ID: 110625
           Summary: [AArch64] Vect: SLP fails to vectorize a loop as the
                    reduction_latency calculated by new costs is too large
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hliu at amperecomputing dot com
  Target Milestone: ---

This problem causes a performance regression in SPEC2017 538.imagick.  For =
the
following simple case (modified from pr96208):

    typedef struct {
        unsigned short m1, m2, m3, m4;
    } the_struct_t;
    typedef struct {
        double m1, m2, m3, m4, m5;
    } the_struct2_t;

    double bar1 (the_struct2_t*);

    double foo (double* k, unsigned int n, the_struct_t* the_struct) {
        unsigned int u;
        the_struct2_t result;
        for (u=3D0; u < n; u++, k--) {
            result.m1 +=3D (*k)*the_struct[u].m1;
            result.m2 +=3D (*k)*the_struct[u].m2;
            result.m3 +=3D (*k)*the_struct[u].m3;
            result.m4 +=3D (*k)*the_struct[u].m4;
        }
        return bar1 (&result);
    }


Compile it with "-Ofast -S -mcpu=3Dneoverse-n2 -fdump-tree-vect-details
-fno-tree-slp-vectorize". SLP fails to vectorize the loop as the vector body
cost is increased due to the too large "reduction latency".  See the dump of
vect pass:

    Original vector body cost =3D 51
    Scalar issue estimate:
      ...
      reduction latency =3D 2
      estimated min cycles per iteration =3D 2.000000
      estimated cycles per vector iteration (for VF 2) =3D 4.000000
    Vector issue estimate:
      ...
      reduction latency =3D 8      <-- Too large
      estimated min cycles per iteration =3D 8.000000
    Increasing body cost to 102 because scalar code would issue more quickly
    Cost model analysis:=20
    Vector inside of loop cost: 102
    ...
    Scalar iteration cost: 44
    ...
    missed:  cost model: the vector iteration cost =3D 102 divided by the s=
calar
iteration cost =3D 44 is greater or equal to the vectorization factor =3D 2.
    missed:  not vectorized: vectorization not profitable.


SLP will success with "-mcpu=3Dneoverse-n1", as N1 doesn't use the new vect=
or
costs and vector body cost is not increased. The "reduction latency" is
calculated in aarch64.cc count_ops():
      /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately
         that's not yet the case.  */
      ops->reduction_latency =3D MAX (ops->reduction_latency, base * count);

For this case, the "base" is 2 and "count" is 4 .  To my understanding, the
"count" of SLP means the number of scalar stmts (i.e. results.m1 +=3D, ...)=
 in a
permutation group to be merged into a vector stmt.  It seems not reasonable=
 to
multiply cost by "count" (maybe it doesn't consider about the SLP situation=
).=20
So, I'm thinking to calcualte it differently for SLP situation, e.g.

      unsigned int latency =3D PURE_SLP_STMT(stmt_info) ? base : base * cou=
nt;
      ops->reduction_latency =3D MAX (ops->reduction_latency, latency);

Is this reasonable?=