From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-365673-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 27497 invoked by alias); 16 Aug 2011 20:52:31 -0000
Received: (qmail 27482 invoked by uid 22791); 16 Aug 2011 20:52:29 -0000
X-SWARE-Spam-Status: No, hits=-2.9 required=5.0	tests=ALL_TRUSTED,AWL,BAYES_00
X-Spam-Check-By: sourceware.org
Received: from localhost (HELO gcc.gnu.org) (127.0.0.1)    by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Tue, 16 Aug 2011 20:52:16 +0000
From: "meissner at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug rtl-optimization/50101] New: GCC 4.5 and 4.6 generate suboptimal code on ppc for countdown loops when the CTR register cannot be used
Date: Tue, 16 Aug 2011 20:57:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: rtl-optimization
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: meissner at gcc dot gnu.org
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Changed-Fields:
Message-ID: <bug-50101-4@http.gcc.gnu.org/bugzilla/>
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
Content-Type: text/plain; charset="UTF-8"
MIME-Version: 1.0
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
X-SW-Source: 2011-08/txt/msg01447.txt.bz2

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50101

             Bug #: 50101
           Summary: GCC 4.5 and 4.6 generate suboptimal code on ppc for
                    countdown loops when the CTR register cannot be used
    Classification: Unclassified
           Product: gcc
           Version: 4.6.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: meissner@gcc.gnu.org
              Host: powerpc64-linux
            Target: powerpc64-linux
             Build: powerpc64-linux


When GCC switched over to the IRA register allocator in GCC 4.5, it made some
loops run slower on the PowerPC.  In particular, the powerpc has a count down
register (CTR) that the compiler can use with the -fbranch-count-reg
optimization.  However, if the CTR register is not available in the loop, the
compiler does not use a GPR register for the loop index, but instead loads the
index value from memory, increments it, and stores it back to the stack.

For example, in the code:

int code[65536];

mike()
{
  int j;
  long addr;

  for (j = 0; j < 65536; j+=4) {
    asm("mtctr %1" : "=c" (addr) : "r" (&code[j]));
    asm("bctrl" : : "c" (addr) : "lr" );
  }
}

It generates the following on 4.3 (Sles 11SP1 host compiler):

.L.mike:
        mflr 0
        ld 9,.LC0@toc(2)
        li 11,16384
        std 0,16(1)
        .p2align 4,,15
.L2:
#APP
 # 10 "test-ppc-ctr.c" 1
        mtctr 9
 # 0 "" 2
 # 11 "test-ppc-ctr.c" 1
        bctrl
 # 0 "" 2
#NO_APP
        addic. 11,11,-1
        addi 9,9,16
        bne 0,.L2
        ld 0,16(1)
        mtlr 0
        blr

If I go to a 4.4 based compiler such as the RHEL6 host compiler I get:

.L.mike:
        mflr 0
        ld 9,.LC0@toc(2)
        std 0,16(1)
        li 0,16384
        std 0,-16(1)
        .p2align 4,,15
.L2:
#APP
 # 10 "test-ppc-ctr.c" 1
        mtctr 9
 # 0 "" 2
 # 11 "test-ppc-ctr.c" 1
        bctrl
 # 0 "" 2
#NO_APP
        ld 0,-16(1)
        addi 9,9,16
        addic. 11,0,-1
        std 11,-16(1)
        bne 0,.L2
        ld 0,16(1)
        mtlr 0
        blr

Notice that it stores and loads the loop index value.  If I use
-fno-branch-count-reg, it generates code to use the GPRS:

.L.mike:
        mflr 0
        ld 9,.LC0@toc(2)
        std 0,16(1)
        addis 0,9,0x4
        .p2align 4,,15
.L2:
#APP
 # 10 "test-ppc-ctr.c" 1
        mtctr 9
 # 0 "" 2
 # 11 "test-ppc-ctr.c" 1
        bctrl
 # 0 "" 2
#NO_APP
        addi 9,9,16
        cmpd 7,9,0
        bne 7,.L2
        ld 0,16(1)
        mtlr 0
        blr

This is fixed in the GCC 4.7 development sources.  The development source
revision that fixed this was subversion id 171649, created on March 28th, 2011
by Vladimir Makarov  <vmakarov@redhat.com>, in his large rewrite of the ira
register allocator.

As an experiment, I built the Spec 2006 benchmark suite with
-fno-branch-count-reg.  As expected, there are a number of benchmarks that
regress if the count register optimization, but there are a few benchmarks that
get a large speed up by disabling this optimization, which probably indicates
they are being mis-optimized.  The benchmarks with the speedup include:
464.h264ref (19.65% improvement), 434.zeusmp (17.92% improvement) and
459.GemsFDTD (13.02% improvement).