From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 18461 invoked by alias); 9 Aug 2011 17:34:24 -0000 Received: (qmail 18006 invoked by uid 22791); 9 Aug 2011 17:34:23 -0000 X-SWARE-Spam-Status: No, hits=-2.5 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00,TW_BD,TW_DN,TW_FS,TW_GL,TW_VC,TW_VM,TW_VP,TW_XS,TW_XV,TW_XX X-Spam-Check-By: sourceware.org Received: from localhost (HELO gcc.gnu.org) (127.0.0.1) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Tue, 09 Aug 2011 17:34:08 +0000 From: "meissner at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/50031] New: Sphinx3 has a 10% regression going from GCC 4.5 to GCC 4.6 on powerpc X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: meissner at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Changed-Fields: Message-ID: X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated Content-Type: text/plain; charset="UTF-8" MIME-Version: 1.0 Date: Tue, 09 Aug 2011 17:34:00 -0000 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org X-SW-Source: 2011-08/txt/msg00976.txt.bz2 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50031 Summary: Sphinx3 has a 10% regression going from GCC 4.5 to GCC 4.6 on powerpc Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassigned@gcc.gnu.org ReportedBy: meissner@gcc.gnu.org Host: powerpc64-linux power-linux Target: powerpc64-linux Build: powerpc64-linux Created attachment 24964 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=24964 Cut down example of the function with the regression. The sphinx3 benchmark from Spec 2006, exhibits a 10% regression when comparing a GCC 4.5 build targeting power7 with vectorization to GCC 4.6. The GCC 4.7 trunk shows the same slowdown. In doing various profiling runs, I have traced the slowdown to the vector_gautbl_eval_logs3 function in vector.c. The main part of the function is the inner loop: { int32 i, r; float64 f; int32 end, veclen; float32 *m1, *m2, *v1, *v2; float64 dval1, dval2, diff1, diff2; f = log_to_logs3_factor(); end = offset + count; veclen = gautbl->veclen; for (r = offset; r < end-1; r += 2) { m1 = gautbl->mean[r]; m2 = gautbl->mean[r+1]; v1 = gautbl->var[r]; v2 = gautbl->var[r+1]; dval1 = gautbl->lrd[r]; dval2 = gautbl->lrd[r+1]; /* start of the critical loop */ for (i = 0; i < veclen; i++) { diff1 = x[i] - m1[i]; dval1 -= diff1 * diff1 * v1[i]; diff2 = x[i] - m2[i]; dval2 -= diff2 * diff2 * v2[i]; } /* end of the critical loop */ if (dval1 < gautbl->distfloor) dval1 = gautbl->distfloor; if (dval2 < gautbl->distfloor) dval2 = gautbl->distfloor; score[r] = (int32)(f * dval1); score[r+1] = (int32)(f * dval2); } if (r < end) { m1 = gautbl->mean[r]; v1 = gautbl->var[r]; dval1 = gautbl->lrd[r]; for (i = 0; i < veclen; i++) { diff1 = x[i] - m1[i]; dval1 -= diff1 * diff1 * v1[i]; } if (dval1 < gautbl->distfloor) dval1 = gautbl->distfloor; score[r] = (int32)(f * dval1); } } Now, notice that the loop is done by reading single precision floating point, doing a subtraction, and then converting the result to double precision. The code tries to 'help' the compiler by unrolling the loop twice. The code produced by GCC 4.5 with -O3 -ffast-math -mcpu=power7 is fairly straight forward scalar evaluation of the inner loop: .L4: lfsx 0,28,9 lfsx 10,10,9 lfsx 13,11,9 lfsx 9,8,9 fsubs 13,0,13 fsubs 0,0,10 lfsx 10,6,9 addi 9,9,4 cmpd 7,9,0 xsmuldp 13,13,13 xsmuldp 0,0,0 xsnmsubadp 11,13,9 xsnmsubadp 12,0,10 bne 7,.L4 Now, with GCC 4.6, the compiler figures it can vectorize the loop: .L5: add 25,24,27 add 26,23,27 rldicr 20,27,0,59 rldicr 25,25,0,59 rldicr 26,26,0,59 lxvw4x 44,0,25 lxvw4x 45,0,20 add 25,22,27 lxvw4x 42,0,26 add 26,21,27 rldicr 25,25,0,59 rldicr 26,26,0,59 addi 27,27,16 vperm 11,11,13,2 vperm 9,9,12,19 vperm 7,7,10,17 xvsubsp 9,43,41 lxvw4x 41,0,26 xvsubsp 10,43,39 lxvw4x 43,0,25 vperm 0,0,9,16 vperm 1,1,11,18 xxmrglw 36,32,32 xxmrglw 37,9,9 xxmrglw 38,10,10 xxmrghw 9,9,9 xvcvspdp 37,37 xxmrglw 35,33,33 xvcvspdp 38,38 xxmrghw 10,10,10 xvcvspdp 39,9 xxmrghw 32,32,32 xvcvspdp 35,35 xxmrghw 33,33,33 xvcvspdp 40,10 xvcvspdp 36,36 xvmuldp 37,37,37 xvmuldp 38,38,38 xvmuldp 9,39,39 xvcvspdp 33,33 xvcvspdp 0,32 xvmuldp 10,40,40 xvmuldp 6,37,35 xvmuldp 7,38,36 xxlor 32,41,41 xxlor 39,42,42 xxlor 41,44,44 xvmaddadp 6,9,33 xvmaddadp 7,10,0 xxlor 33,43,43 xxlor 43,45,45 xvsubdp 4,4,6 xvsubdp 5,5,7 bdnz .L5 This loop demonstrates several problems, some are powerpc backend only and some tree optimizers (that would need hooks from the backend): 1) When tree-vec-data-refs doesn't know the alignment of memory in a loop that is vectorized, and the machine has a vec_realign_load_ pattern, the loop that is generated always uses the unalgined load, even though it might be slow. On the power7, the realign code uses a vector load and the lvsr instruction to create a permute mask, and then in the inner loop, after each load, use the permute mask to do the unaligned loads. Thus, in the loop, before doing the conversions, we will be doing 4 vector loads, and 4 permutes. The vector conversion from 32-bit to 64-bit, involve two more permutes to split the V4SF values into the appropriate registers before doing the float->double convert. Thus in the loop we will have 4 permutes for the 4 loads that are done, and 8 permutes for the conversions. The power7 only has one permute functional unit, and multiple permutes can slow things down. The code has one segment with 3 back to back permutes, and another with 6 back to back permutes. If vectorizer could clone the loop and on one side test to see if the pointers are aligned, and if the pointers are aligned, do the aligned loads, and on the other side, do unalgined loads it would help. I experimented with an option to disable the vec_realign_load_ pattern, and it helped this particular benchmark, but hurt other benchmarks, because the code would do the vectorized loop only if the pointers are aligned, and fell back to scalar loop if they were unaligned. I would think falling back to use vec_realign_ would be a win. 2) In looking at the documentation, I discovered that vec_realign_ is not documented in md.texi. 3) The powerpc backend doesn't realize it could use the Altivec memory instruction to load memory (since the Altivec load implicitly ignores the bottom bits of the address). 4) The code in tree-vec-stmts.c, tree-vec-slp.c, and tree-vec-loop.c that calls the vectorization cost target hook, never pass in the actual type to the argument vectype, or set the misalign argument to non-zero. I would imagine that vector systems might have different costs, depending on the type. Maybe the two arguments should be eliminated if we aren't going to pass useful information. In addition, there doesn't seem to be a cost of doing vec_realign. There is cost for unaligned loads (via movmisalign), but there doesn't seem to be a cost for realignment. I have patches that fix the immediate problem by disabling the float/int to double vector conversion under switch control, which I will submit shortly. In general it restores the performance for sphinx3 on both GCC 4.6 and 4.7, and improves the performance of a few other benchmarks, though in 4.6 it does have one regression (tonto is a few percent slower on 32-bit). However, I suspect it really needs to be attacked at a higher level.