From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-231900-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 16266 invoked by alias); 9 Oct 2007 16:53:46 -0000
Received: (qmail 15866 invoked by uid 48); 9 Oct 2007 16:53:35 -0000
Date: Tue, 09 Oct 2007 16:53:00 -0000
Subject: [Bug rtl-optimization/33717]  New: slow code generated for 64-bit arithmetic
X-Bugzilla-Reason: CC
Message-ID: <bug-33717-3511@http.gcc.gnu.org/bugzilla/>
Reply-To: gcc-bugzilla@gcc.gnu.org
To: gcc-bugs@gcc.gnu.org
From: "felix-gcc at fefe dot de" <gcc-bugzilla@gcc.gnu.org>
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
X-SW-Source: 2007-10/txt/msg00815.txt.bz2

gcc generates very poor code on some bignum code I wrote.

I put the sample code to http://dl.fefe.de/bignum-add.c for you to look at.

The crucial loop is this (x, y and z are arrays of unsigned int).

  for (i=0; i<100; ++i) {
    l += (unsigned long long)x[i] + y[i];
    z[i]=l;
    l>>=32;
  }

gcc code (-O3 -march=athlon64):

        movl    -820(%ebp,%esi,4), %eax
        movl    -420(%ebp,%esi,4), %ecx
        xorl    %edx, %edx
        xorl    %ebx, %ebx
        addl    %ecx, %eax
        adcl    %ebx, %edx
        addl    -1224(%ebp), %eax
        adcl    -1220(%ebp), %edx
        movl    %eax, -4(%edi,%esi,4)
        incl    %esi
        movl    %edx, %eax
        xorl    %edx, %edx
        cmpl    $101, %esi
        movl    %eax, -1224(%ebp)
        movl    %edx, -1220(%ebp)
        jne     .L4

As you can see, gcc keeps the long long accumulator in memory.  icc keeps it
in registers instead:

        movl      4(%esp,%edx,4), %eax                          #25.30
        xorl      %ebx, %ebx                                    #25.5
        addl      404(%esp,%edx,4), %eax                        #25.5
        adcl      $0, %ebx                                      #25.5
        addl      %esi, %eax                                    #25.37
        movl      %ebx, %esi                                    #25.37
        adcl      $0, %esi                                      #25.37
        movl      %eax, 804(%esp,%edx,4)                        #26.5
        addl      $1, %edx                                      #24.22
        cmpl      $100, %edx                                    #24.15
        jb        ..B1.4        # Prob 99%                      #24.15

The difference is staggering: 2000 cycles for gcc, 1000 for icc.

This only happens on x86, btw.  On amd64 there are enough registers, so gcc and
icc are closer (840 vs 924, icc still generates better code here).

Still: both compilers could generate even better code.  I put some inline asm
in the file for comparison, which could be improved further by loop unrolling.


-- 
           Summary: slow code generated for 64-bit arithmetic
           Product: gcc
           Version: 4.3.0
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: rtl-optimization
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: felix-gcc at fefe dot de
 GCC build triplet: i386-pc-linux-gnu
  GCC host triplet: i386-pc-linux-gnu
GCC target triplet: i386-pc-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33717