From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 130144 invoked by alias); 22 Oct 2015 10:28:08 -0000 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org Received: (qmail 130074 invoked by uid 48); 22 Oct 2015 10:28:04 -0000 From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug middle-end/65962] Missed vectorization of strided stores Date: Thu, 22 Oct 2015 10:28:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: middle-end X-Bugzilla-Version: 5.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: ASSIGNED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: rguenth at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_status assigned_to Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2015-10/txt/msg01835.txt.bz2 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65962 Richard Biener changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org --- Comment #3 from Richard Biener --- While strided stores are now implemented the case is still not handled because single-element interleaving takes precedence (and single-element interleaving isn't supported for stores as that always produces gaps). I have a patch that produces .L2: movdqu 16(%rax), %xmm1 addq $32, %rax movdqu -32(%rax), %xmm0 shufps $136, %xmm1, %xmm0 paddd %xmm2, %xmm0 pshufd $85, %xmm0, %xmm1 movd %xmm0, -32(%rax) movd %xmm1, -24(%rax) movdqa %xmm0, %xmm1 punpckhdq %xmm0, %xmm1 pshufd $255, %xmm0, %xmm0 movd %xmm1, -16(%rax) movd %xmm0, -8(%rax) cmpq %rdx, %rax jne .L2 when you disable the cost model. Otherwise it's deemed not profitable. Using scatters for AVX could in theory make it profitable (not sure). t.c:5:3: note: Cost model analysis: Vector inside of loop cost: 13 Vector prologue cost: 1 Vector epilogue cost: 12 Scalar iteration cost: 3 Scalar outside cost: 0 Vector outside cost: 13 prologue iterations: 0 epilogue iterations: 4 t.c:5:3: note: cost model: the vector iteration cost = 13 divided by the scalar iteration cost = 3 is greater or equal to the vectorization factor = 4. t.c:5:3: note: not vectorized: vectorization not profitable. t.c:5:3: note: not vectorized: vector version will never be profitable. t.c:5:3: note: ==> examining statement: *_8 = _10; t.c:5:3: note: vect_is_simple_use: operand _10 t.c:5:3: note: def_stmt: _10 = _9 + 7; t.c:5:3: note: type of def: internal t.c:5:3: note: vect_model_store_cost: inside_cost = 8, prologue_cost = 0 . so the strided store has cost 8, that's 4 extracts plus 4 scalar stores. With AVX we generate vmovd %xmm0, -32(%rax) vpextrd $1, %xmm0, -24(%rax) vpextrd $2, %xmm0, -16(%rax) vpextrd $3, %xmm0, -8(%rax) so it can combine extract and store, with SSE2 we get pshufd $85, %xmm0, %xmm1 movd %xmm0, -32(%rax) movd %xmm1, -24(%rax) movdqa %xmm0, %xmm1 punpckhdq %xmm0, %xmm1 pshufd $255, %xmm0, %xmm0 movd %xmm1, -16(%rax) movd %xmm0, -8(%rax) which is even worse than expected ;) As usual the cost model isn't target aware enough here (and it errs on the conservative side here)