From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 68B6339888B6; Fri, 9 Apr 2021 07:05:46 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 68B6339888B6 From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once Date: Fri, 09 Apr 2021 07:05:46 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 10.2.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: ASSIGNED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: rguenth at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: keywords assigned_to everconfirmed cf_reconfirmed_on bug_status Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 09 Apr 2021 07:05:46 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D99971 Richard Biener changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |missed-optimization Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot = gnu.org Ever confirmed|0 |1 Last reconfirmed| |2021-04-09 Status|UNCONFIRMED |ASSIGNED --- Comment #2 from Richard Biener --- Confirmed. While we manage to analyze for the "perfect" solution" we fail because dependence testing doesn't handle a piece, this throws away half of the vectorization. We do actually see that we'll retain the scalar loads and computations but still doing three vector loads and a vector add seems cheaper than doing four scalar stores: 0x1fdb5a0 x_2(D)->a 1 times unaligned_load (misalign -1) costs 12 in body 0x1fdb5a0 y1_3(D)->a 1 times unaligned_load (misalign -1) costs 12 in body 0x1fdb5a0 _13 + _14 1 times vector_stmt costs 4 in body 0x1fdb5a0 _15 1 times unaligned_store (misalign -1) costs 12 in body 0x1fddcb0 _15 1 times scalar_store costs 12 in body 0x1fddcb0 _18 1 times scalar_store costs 12 in body 0x1fddcb0 _21 1 times scalar_store costs 12 in body 0x1fddcb0 _24 1 times scalar_store costs 12 in body t.C:28:1: note: Cost model analysis: Vector inside of basic block cost: 40 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar cost of basic block: 48 t.C:28:1: note: Basic block will be vectorized using SLP now, fortunately GCC 11 will improve on this [a bit] and we'll produce _Z4testR1ARKS_S2_: .LFB2: .cfi_startproc movdqu (%rsi), %xmm0 movdqu (%rdi), %xmm1 paddd %xmm1, %xmm0 movups %xmm0, (%rdi) movd %xmm0, %eax subl (%rdx), %eax movl %eax, (%rdi) pextrd $1, %xmm0, %eax subl 4(%rdx), %eax movl %eax, 4(%rdi) pextrd $2, %xmm0, %eax subl 8(%rdx), %eax movl %eax, 8(%rdi) pextrd $3, %xmm0, %eax subl 12(%rdx), %eax movl %eax, 12(%rdi) ret which is not re-doing the scalar loads/adds but instead uses the vector result. Still the same dependence issue is present: t.C:16:11: missed: can't determine dependence between y1_3(D)->b and x_2(D)->a t.C:16:11: note: removing SLP instance operations starting from: x_2(D)->a= =3D _6; the scalar code before vectorization looks like [local count: 1073741824]: _13 =3D x_2(D)->a; _14 =3D y1_3(D)->a; _15 =3D _13 + _14; x_2(D)->a =3D _15; _16 =3D x_2(D)->b; _17 =3D y1_3(D)->b; <--- _18 =3D _16 + _17; x_2(D)->b =3D _18; _19 =3D x_2(D)->c; _20 =3D y1_3(D)->c; _21 =3D _19 + _20; x_2(D)->c =3D _21; _22 =3D x_2(D)->d; _23 =3D y1_3(D)->d; _24 =3D _22 + _23; x_2(D)->d =3D _24; _5 =3D y2_4(D)->a; _6 =3D _15 - _5; x_2(D)->a =3D _6; <--- _7 =3D y2_4(D)->b; _8 =3D _18 - _7; x_2(D)->b =3D _8; _9 =3D y2_4(D)->c; _10 =3D _21 - _9; x_2(D)->c =3D _10; _11 =3D y2_4(D)->d; _12 =3D _24 - _11; x_2(D)->d =3D _12; return; Using void test(A& __restrict x, A const& y1, A const& y2) { x +=3D y1; x -=3D y2; } produces optimal assembly even with GCC 10: _Z4testR1ARKS_S2_: .LFB2: .cfi_startproc movdqu (%rsi), %xmm0 movdqu (%rdx), %xmm1 movdqu (%rdi), %xmm2 psubd %xmm1, %xmm0 paddd %xmm2, %xmm0 movups %xmm0, (%rdi) ret note that I think we should be able to handle the dependences even without the __restrict annotation.=