From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 27497 invoked by alias); 16 Aug 2011 20:52:31 -0000 Received: (qmail 27482 invoked by uid 22791); 16 Aug 2011 20:52:29 -0000 X-SWARE-Spam-Status: No, hits=-2.9 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 X-Spam-Check-By: sourceware.org Received: from localhost (HELO gcc.gnu.org) (127.0.0.1) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Tue, 16 Aug 2011 20:52:16 +0000 From: "meissner at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug rtl-optimization/50101] New: GCC 4.5 and 4.6 generate suboptimal code on ppc for countdown loops when the CTR register cannot be used Date: Tue, 16 Aug 2011 20:57:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: rtl-optimization X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: meissner at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Changed-Fields: Message-ID: X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated Content-Type: text/plain; charset="UTF-8" MIME-Version: 1.0 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org X-SW-Source: 2011-08/txt/msg01447.txt.bz2 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50101 Bug #: 50101 Summary: GCC 4.5 and 4.6 generate suboptimal code on ppc for countdown loops when the CTR register cannot be used Classification: Unclassified Product: gcc Version: 4.6.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization AssignedTo: unassigned@gcc.gnu.org ReportedBy: meissner@gcc.gnu.org Host: powerpc64-linux Target: powerpc64-linux Build: powerpc64-linux When GCC switched over to the IRA register allocator in GCC 4.5, it made some loops run slower on the PowerPC. In particular, the powerpc has a count down register (CTR) that the compiler can use with the -fbranch-count-reg optimization. However, if the CTR register is not available in the loop, the compiler does not use a GPR register for the loop index, but instead loads the index value from memory, increments it, and stores it back to the stack. For example, in the code: int code[65536]; mike() { int j; long addr; for (j = 0; j < 65536; j+=4) { asm("mtctr %1" : "=c" (addr) : "r" (&code[j])); asm("bctrl" : : "c" (addr) : "lr" ); } } It generates the following on 4.3 (Sles 11SP1 host compiler): .L.mike: mflr 0 ld 9,.LC0@toc(2) li 11,16384 std 0,16(1) .p2align 4,,15 .L2: #APP # 10 "test-ppc-ctr.c" 1 mtctr 9 # 0 "" 2 # 11 "test-ppc-ctr.c" 1 bctrl # 0 "" 2 #NO_APP addic. 11,11,-1 addi 9,9,16 bne 0,.L2 ld 0,16(1) mtlr 0 blr If I go to a 4.4 based compiler such as the RHEL6 host compiler I get: .L.mike: mflr 0 ld 9,.LC0@toc(2) std 0,16(1) li 0,16384 std 0,-16(1) .p2align 4,,15 .L2: #APP # 10 "test-ppc-ctr.c" 1 mtctr 9 # 0 "" 2 # 11 "test-ppc-ctr.c" 1 bctrl # 0 "" 2 #NO_APP ld 0,-16(1) addi 9,9,16 addic. 11,0,-1 std 11,-16(1) bne 0,.L2 ld 0,16(1) mtlr 0 blr Notice that it stores and loads the loop index value. If I use -fno-branch-count-reg, it generates code to use the GPRS: .L.mike: mflr 0 ld 9,.LC0@toc(2) std 0,16(1) addis 0,9,0x4 .p2align 4,,15 .L2: #APP # 10 "test-ppc-ctr.c" 1 mtctr 9 # 0 "" 2 # 11 "test-ppc-ctr.c" 1 bctrl # 0 "" 2 #NO_APP addi 9,9,16 cmpd 7,9,0 bne 7,.L2 ld 0,16(1) mtlr 0 blr This is fixed in the GCC 4.7 development sources. The development source revision that fixed this was subversion id 171649, created on March 28th, 2011 by Vladimir Makarov , in his large rewrite of the ira register allocator. As an experiment, I built the Spec 2006 benchmark suite with -fno-branch-count-reg. As expected, there are a number of benchmarks that regress if the count register optimization, but there are a few benchmarks that get a large speed up by disabling this optimization, which probably indicates they are being mis-optimized. The benchmarks with the speedup include: 464.h264ref (19.65% improvement), 434.zeusmp (17.92% improvement) and 459.GemsFDTD (13.02% improvement).