From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 4959A385840E; Tue, 11 Jul 2023 09:15:57 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 4959A385840E DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1689066957; bh=OhcX38axme4fncLVNwmcz32LI+TQivmeB6AZqDPZy4I=; h=From:To:Subject:Date:From; b=wKFefmDl5sxtrVOdg58Do5bioloPr8ZamheoTPBvZElfk8rRyBG1rcW1/xuLHJY+/ cgHEFQR+7EIyj4NcQnTSBvRP0fJBU8yyvu76W9088Jjd9F6xqS94pV8CJsYejA0qGt bF70ebPu/wE5FNI67Reva1cIZdTri6vwyzcW9UE0= From: "hliu at amperecomputing dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/110625] New: [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large Date: Tue, 11 Jul 2023 09:15:52 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: hliu at amperecomputing dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D110625 Bug ID: 110625 Summary: [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: hliu at amperecomputing dot com Target Milestone: --- This problem causes a performance regression in SPEC2017 538.imagick. For = the following simple case (modified from pr96208): typedef struct { unsigned short m1, m2, m3, m4; } the_struct_t; typedef struct { double m1, m2, m3, m4, m5; } the_struct2_t; double bar1 (the_struct2_t*); double foo (double* k, unsigned int n, the_struct_t* the_struct) { unsigned int u; the_struct2_t result; for (u=3D0; u < n; u++, k--) { result.m1 +=3D (*k)*the_struct[u].m1; result.m2 +=3D (*k)*the_struct[u].m2; result.m3 +=3D (*k)*the_struct[u].m3; result.m4 +=3D (*k)*the_struct[u].m4; } return bar1 (&result); } Compile it with "-Ofast -S -mcpu=3Dneoverse-n2 -fdump-tree-vect-details -fno-tree-slp-vectorize". SLP fails to vectorize the loop as the vector body cost is increased due to the too large "reduction latency". See the dump of vect pass: Original vector body cost =3D 51 Scalar issue estimate: ... reduction latency =3D 2 estimated min cycles per iteration =3D 2.000000 estimated cycles per vector iteration (for VF 2) =3D 4.000000 Vector issue estimate: ... reduction latency =3D 8 <-- Too large estimated min cycles per iteration =3D 8.000000 Increasing body cost to 102 because scalar code would issue more quickly Cost model analysis:=20 Vector inside of loop cost: 102 ... Scalar iteration cost: 44 ... missed: cost model: the vector iteration cost =3D 102 divided by the s= calar iteration cost =3D 44 is greater or equal to the vectorization factor =3D 2. missed: not vectorized: vectorization not profitable. SLP will success with "-mcpu=3Dneoverse-n1", as N1 doesn't use the new vect= or costs and vector body cost is not increased. The "reduction latency" is calculated in aarch64.cc count_ops(): /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately that's not yet the case. */ ops->reduction_latency =3D MAX (ops->reduction_latency, base * count); For this case, the "base" is 2 and "count" is 4 . To my understanding, the "count" of SLP means the number of scalar stmts (i.e. results.m1 +=3D, ...)= in a permutation group to be merged into a vector stmt. It seems not reasonable= to multiply cost by "count" (maybe it doesn't consider about the SLP situation= ).=20 So, I'm thinking to calcualte it differently for SLP situation, e.g. unsigned int latency =3D PURE_SLP_STMT(stmt_info) ? base : base * cou= nt; ops->reduction_latency =3D MAX (ops->reduction_latency, latency); Is this reasonable?=