From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 40346 invoked by alias); 3 Mar 2015 09:21:11 -0000 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org Received: (qmail 40285 invoked by uid 55); 3 Mar 2015 09:21:07 -0000 From: "rguenther at suse dot de" To: gcc-bugs@gcc.gnu.org Subject: [Bug testsuite/63175] [4.9/5 regression] FAIL: gcc.dg/vect/costmodel/ppc/costmodel-bb-slp-9a.c scan-tree-dump-times slp2" basic block vectorized using SLP" 1 Date: Tue, 03 Mar 2015 09:21:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: testsuite X-Bugzilla-Version: 4.9.1 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenther at suse dot de X-Bugzilla-Status: NEW X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: 4.9.3 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2015-03/txt/msg00259.txt.bz2 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63175 --- Comment #25 from rguenther at suse dot de --- On Tue, 3 Mar 2015, msebor at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63175 > > --- Comment #24 from Martin Sebor --- > (In reply to Richard Biener from comment #16) > > Why is the loop bound to i != 16 / sizeof *s? > > The upper bound is intended to make the copied sequence fit into one vector > register, irrespective of the size of the array element. > > The vector load and store instructions tolerate unaligned accesses and there > are permute instructions that combine the contents of two vector registers into > a single one to compensate for unaligned reads or writes. I'm not sure it > makes sense to expect unaligned copies involving a single vector register's > worth of data to be vectorized (as done in my proposed tests for char and > short), but I would expect larger unaligned copies (i.e., multiples of 16 > bytes) to benefit from it. In my experiments I've seen no evidence of GCC > attempting to vectorize such copies but I need to do some more research to > understand why. > > (In reply to comment #23) > > The test uses -maltivec and that's what I've been using as well. But I > see in the Power ISA book that lxvw4x and stxvw4x are classified as VSX > instructions, so perhaps they shouldn't be emitted without -mvsx. > Although 5.0 doesn't emit them even with -vsx. 5.0 doesn't consider stxvw4x without -mvsx - it does so with but then the vectorizer cost model says the vectorization is not profitable: t.c:10:10: note: Cost model analysis: Vector inside of basic block cost: 29 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar cost of basic block: 8 t.c:10:10: note: not vectorized: vectorization is not profitable. I'll see if that cost caluclation is sensible. We have 2 aligned vector loads (cost 2), one permute (cost 3), one vector stmt (cost 1), one unaligned store (unknown misalignment) which hits rs6000_builtin_vectorization_cost (type_of_cost=unaligned_store, vectype=, misalign=-1) at /space/rguenther/src/svn/trunk2/gcc/config/rs6000/rs6000.c:4376 4376 switch (type_of_cost) ... 4455 case -1: 4456 /* Unknown misalignment. */ 4457 case 4: 4458 case 12: 4459 /* Word aligned. */ 4460 return 23; cost of 23!(??). For a misalign of 4? Well - there you have it. For the testcase #define T int extern const T a []; T b[8]; void g (void) { const T *p = a + 1; T *q = b + 1; *q++ = *p++; *q++ = *p++; *q++ = *p++; *q++ = *p++; } Eventually 4.8 had the cost model turned off for the testsuite or it had bugs and misrepresented the case. But clearly a cost of 23 looks excessive to me here (the scalar store of one of the 4 elements has cost 1! so the unaligned vector store is nearly 6 times more expensive than doing the 4 unaligned stores. Nobody would design an instruction with such a severe penalty). With -fvect-cost-model=unlimited GCC 5 produces .L.g: addis 9,2,.LC0@toc@ha # gpr load fusion, type long ld 9,.LC0@toc@l(9) addis 8,2,.LANCHOR0@toc@ha addi 8,8,.LANCHOR0@toc@l addi 10,9,12 neg 7,9 rldicr 10,10,0,59 rldicr 9,9,0,59 lvsr 13,0,7 lxvw4x 33,0,9 lxvw4x 32,0,10 li 9,4 vperm 0,1,0,13 stxvw4x 32,8,9 blr Ah, GCC 4.8 had the cost model disabled by default (at least for basic-block vectorization), so you need to enable it via -fvect-cost-model where it rejects vectorizing the above with the same reasoning. So there is no regression and if vectorization is profitable then the backend needs to adjust its cost model.