public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug rtl-optimization/33717]  New: slow code generated for 64-bit arithmetic
@ 2007-10-09 16:53 felix-gcc at fefe dot de
  2008-12-31 18:35 ` [Bug rtl-optimization/33717] " pinskia at gcc dot gnu dot org
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: felix-gcc at fefe dot de @ 2007-10-09 16:53 UTC (permalink / raw)
  To: gcc-bugs

gcc generates very poor code on some bignum code I wrote.

I put the sample code to http://dl.fefe.de/bignum-add.c for you to look at.

The crucial loop is this (x, y and z are arrays of unsigned int).

  for (i=0; i<100; ++i) {
    l += (unsigned long long)x[i] + y[i];
    z[i]=l;
    l>>=32;
  }

gcc code (-O3 -march=athlon64):

        movl    -820(%ebp,%esi,4), %eax
        movl    -420(%ebp,%esi,4), %ecx
        xorl    %edx, %edx
        xorl    %ebx, %ebx
        addl    %ecx, %eax
        adcl    %ebx, %edx
        addl    -1224(%ebp), %eax
        adcl    -1220(%ebp), %edx
        movl    %eax, -4(%edi,%esi,4)
        incl    %esi
        movl    %edx, %eax
        xorl    %edx, %edx
        cmpl    $101, %esi
        movl    %eax, -1224(%ebp)
        movl    %edx, -1220(%ebp)
        jne     .L4

As you can see, gcc keeps the long long accumulator in memory.  icc keeps it
in registers instead:

        movl      4(%esp,%edx,4), %eax                          #25.30
        xorl      %ebx, %ebx                                    #25.5
        addl      404(%esp,%edx,4), %eax                        #25.5
        adcl      $0, %ebx                                      #25.5
        addl      %esi, %eax                                    #25.37
        movl      %ebx, %esi                                    #25.37
        adcl      $0, %esi                                      #25.37
        movl      %eax, 804(%esp,%edx,4)                        #26.5
        addl      $1, %edx                                      #24.22
        cmpl      $100, %edx                                    #24.15
        jb        ..B1.4        # Prob 99%                      #24.15

The difference is staggering: 2000 cycles for gcc, 1000 for icc.

This only happens on x86, btw.  On amd64 there are enough registers, so gcc and
icc are closer (840 vs 924, icc still generates better code here).

Still: both compilers could generate even better code.  I put some inline asm
in the file for comparison, which could be improved further by loop unrolling.


-- 
           Summary: slow code generated for 64-bit arithmetic
           Product: gcc
           Version: 4.3.0
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: rtl-optimization
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: felix-gcc at fefe dot de
 GCC build triplet: i386-pc-linux-gnu
  GCC host triplet: i386-pc-linux-gnu
GCC target triplet: i386-pc-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33717


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug rtl-optimization/33717] slow code generated for 64-bit arithmetic
  2007-10-09 16:53 [Bug rtl-optimization/33717] New: slow code generated for 64-bit arithmetic felix-gcc at fefe dot de
@ 2008-12-31 18:35 ` pinskia at gcc dot gnu dot org
  2008-12-31 18:38 ` [Bug target/33717] " pinskia at gcc dot gnu dot org
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2008-12-31 18:35 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #1 from pinskia at gcc dot gnu dot org  2008-12-31 18:33 -------
The inner loop on the trunk looks like:
.L15:
        movl    848(%esp,%eax,4), %edx
.L4:
        movl    448(%esp,%eax,4), %ecx
        xorl    %ebx, %ebx
        addl    %ecx, %esi
        adcl    %ebx, %edi
        xorl    %ecx, %ecx
        addl    %edx, %esi
        adcl    %ecx, %edi
        movl    %esi, 48(%esp,%eax,4)
        incl    %eax
        movl    %edi, %esi
        xorl    %edi, %edi
        cmpl    $100, %eax
        jne     .L15

Which is a lot better.  This was improved by the new register allocator on the
trunk.  Going back to the old one on the trunk, gives what 4.3 gave.


-- 

pinskia at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization, ra


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33717


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/33717] slow code generated for 64-bit arithmetic
  2007-10-09 16:53 [Bug rtl-optimization/33717] New: slow code generated for 64-bit arithmetic felix-gcc at fefe dot de
  2008-12-31 18:35 ` [Bug rtl-optimization/33717] " pinskia at gcc dot gnu dot org
@ 2008-12-31 18:38 ` pinskia at gcc dot gnu dot org
  2008-12-31 18:41 ` pinskia at gcc dot gnu dot org
  2009-01-01 17:37 ` ubizjak at gmail dot com
  3 siblings, 0 replies; 5+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2008-12-31 18:38 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #2 from pinskia at gcc dot gnu dot org  2008-12-31 18:37 -------
4.4 with the new register allocator (which is turned on by default):
C: 522 cycles
asm: 342 cycles

4.4 with the old one:
C: 749 cycles
asm: 344 cycles


So 4.4 is much better but still has extra instructions but that is a target
issue now.


-- 

pinskia at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
          Component|rtl-optimization            |target
     Ever Confirmed|0                           |1
           Keywords|ra                          |
   Last reconfirmed|0000-00-00 00:00:00         |2008-12-31 18:37:00
               date|                            |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33717


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/33717] slow code generated for 64-bit arithmetic
  2007-10-09 16:53 [Bug rtl-optimization/33717] New: slow code generated for 64-bit arithmetic felix-gcc at fefe dot de
  2008-12-31 18:35 ` [Bug rtl-optimization/33717] " pinskia at gcc dot gnu dot org
  2008-12-31 18:38 ` [Bug target/33717] " pinskia at gcc dot gnu dot org
@ 2008-12-31 18:41 ` pinskia at gcc dot gnu dot org
  2009-01-01 17:37 ` ubizjak at gmail dot com
  3 siblings, 0 replies; 5+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2008-12-31 18:41 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #3 from pinskia at gcc dot gnu dot org  2008-12-31 18:39 -------
GCC does not produce "adcl      $0" which is where the extra xors come from.


Most likely addsi3_carry should accept 0 as one of the operands.


-- 

pinskia at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |pinskia at gcc dot gnu dot
                   |                            |org
  GCC build triplet|i386-pc-linux-gnu           |
   GCC host triplet|i386-pc-linux-gnu           |
 GCC target triplet|i386-pc-linux-gnu           |i?86-*-*


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33717


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/33717] slow code generated for 64-bit arithmetic
  2007-10-09 16:53 [Bug rtl-optimization/33717] New: slow code generated for 64-bit arithmetic felix-gcc at fefe dot de
                   ` (2 preceding siblings ...)
  2008-12-31 18:41 ` pinskia at gcc dot gnu dot org
@ 2009-01-01 17:37 ` ubizjak at gmail dot com
  3 siblings, 0 replies; 5+ messages in thread
From: ubizjak at gmail dot com @ 2009-01-01 17:37 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #4 from ubizjak at gmail dot com  2009-01-01 17:35 -------
(In reply to comment #3)

> Most likely addsi3_carry should accept 0 as one of the operands.

It does:

(define_insn "addsi3_carry"
  [(set (match_operand:SI 0 "nonimmediate_operand" "=rm,r")
          (plus:SI (plus:SI (match_operand:SI 3 "ix86_carry_flag_operator" "")
                            (match_operand:SI 1 "nonimmediate_operand" "%0,0"))
                   (match_operand:SI 2 "general_operand" "ri,rm")))
   (clobber (reg:CC FLAGS_REG))]


It looks to me that cprop_hardreg is the pass to handle this case, at least
this sequence should be handled (to propagate cx):

(insn 74 50 52 3 pr33717.c:12 (parallel [
            (set (reg:SI 2 cx [+4 ])
                (const_int 0 [0x0]))
            (clobber (reg:CC 17 flags))
        ]) 45 {*movsi_xor} (nil))


(insn 53 52 54 3 pr33717.c:12 (parallel [
            (set (reg:SI 4 si [+4 ])
                (plus:SI (plus:SI (ltu:SI (reg:CC 17 flags)
                            (const_int 0 [0x0]))
                        (reg:SI 4 si [+4 ]))
                    (reg:SI 2 cx [+4 ])))
            (clobber (reg:CC 17 flags))
        ]) 266 {addsi3_carry} (expr_list:REG_DEAD (reg:CC 17 flags)
        (expr_list:REG_DEAD (reg:SI 2 cx [+4 ])
            (expr_list:REG_UNUSED (reg:CC 17 flags)
                (nil)))))


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33717


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2009-01-01 17:37 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-10-09 16:53 [Bug rtl-optimization/33717] New: slow code generated for 64-bit arithmetic felix-gcc at fefe dot de
2008-12-31 18:35 ` [Bug rtl-optimization/33717] " pinskia at gcc dot gnu dot org
2008-12-31 18:38 ` [Bug target/33717] " pinskia at gcc dot gnu dot org
2008-12-31 18:41 ` pinskia at gcc dot gnu dot org
2009-01-01 17:37 ` ubizjak at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).