public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug rtl-optimization/22152] New: Poor loop optimization when using sse2 builtins - regression from 3.3
@ 2005-06-22 20:06 fjahanian at apple dot com
  2005-06-22 20:19 ` [Bug rtl-optimization/22152] Poor loop optimization when using mmx builtins pinskia at gcc dot gnu dot org
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: fjahanian at apple dot com @ 2005-06-22 20:06 UTC (permalink / raw)
  To: gcc-bugs

In the following trivial test case, gcc-4.1 produces very ineffecient code for the loop. gcc-3.3 produces 
much better code.

typedef int __m64 __attribute__ ((__vector_size__ (8)));

__m64 unsigned_add3( const __m64 *a, const __m64 *b, unsigned long count )
{
                __m64 sum;
                unsigned int i;

                for( i = 1; i < count; i++ )
                {
                        sum = (__m64) __builtin_ia32_paddq ((long long)a[i], (long long)b[i]);
                }
                return sum;
}

1) Loop when compiled with gcc-4.1 -O2 -msse2 (note in particular the extra movq to memory):

L4:
        movl    12(%ebp), %esi
        movq    (%eax,%edx,8), %mm0
        paddq   (%esi,%edx,8), %mm0
        incl    %edx
        cmpl    %edx, %ecx
        movq    %mm0, -16(%ebp)
        movl    -16(%ebp), %esi
        movl    -12(%ebp), %edi
        jne     L4

2) Loop using gcc-3.3 compiled with -O2 -msse2:

L6:
        movq    (%esi,%edx,8), %mm0
        paddq   (%eax,%edx,8), %mm0
        addl    $1, %edx
        cmpl    %ecx, %edx
        jb      L6

AFAICT, culprit is reload which generates extra load and store of %mm0:

(insn 62 30 63 2 (set (mem:V2SI (plus:SI (reg/f:SI 6 bp)
                (const_int -16 [0xfffffffffffffff0])) [0 S8 A8])
        (reg:V2SI 29 mm0)) 736 {*movv2si_internal} (nil)
    (nil))

(insn 63 62 32 2 (set (reg/v:V2SI 4 si [orig:61 sum ] [61])
        (mem:V2SI (plus:SI (reg/f:SI 6 bp)
                (const_int -16 [0xfffffffffffffff0])) [0 S8 A8])) 736 {*movv2si_internal} (nil)
    (nil))

Here is the larger test case from which above test was extracted:

#include <xmmintrin.h>

__m64 unsigned_add3( const __m64 *a, const __m64 *b, __m64 *result, unsigned long count )
{
        __m64 carry, temp, sum, one, onesCarry, _a, _b;
        unsigned int i;

        if( count > 0 )
        {
                _a = a[0];
                _b = b[0];

                one = _mm_cmpeq_pi8( _a, _a );  //-1
                one = _mm_sub_si64( _mm_xor_si64( one, one ), one );    //1
                sum = _mm_add_si64( _a, _b );

                onesCarry = _mm_and_si64( _a, _b );             //the 1's bit is set only if the 1's bit add 
generates a carry
                onesCarry = _mm_and_si64( onesCarry, one );             //onesCarry &= 1

                //Trim off the one's bit on both vA and vB to make room for a carry bit at the top after the 
add
                _a = _mm_srli_si64( _a, 1 );                                            //vA >>= 1
                _b = _mm_srli_si64( _b, 1 );                                            //vB >>= 1

                //Add vA to vB and add the carry bit
                carry = _mm_add_si64( _a, _b );
                carry = _mm_add_si64( carry, onesCarry );

                //right shift by 63 bits to get the carry bit for the high 64 bit quantity
                carry = _mm_srli_si64( carry, 63 );

                for( i = 1; i < count; i++ )
                {
                        result[i-1] = sum;
                        _a = a[i];
                        _b = b[i];
                        onesCarry = _mm_and_si64( _a, _b );
                        onesCarry = _mm_and_si64( onesCarry, one );
                        sum = _mm_add_si64( _a, _b );
                        _a = _mm_add_si64( _a, onesCarry );
                        onesCarry = _mm_and_si64( carry, _a );  //find low bit carry
                        sum = _mm_add_si64( sum, carry );               //add in carry bit to low word sum 
                        carry = _mm_add_si64( _a, onesCarry );  //add in low bit carry to high result
                }

                result[i-1] = sum;
        }

        return carry;
}

Again, gcc-3.3 produces much better code for this loop.

-- 
           Summary: Poor loop optimization when using sse2 builtins -
                    regression from 3.3
           Product: gcc
           Version: 4.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: rtl-optimization
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: fjahanian at apple dot com
                CC: gcc-bugs at gcc dot gnu dot org
 GCC build triplet: apple-x86-darwin
  GCC host triplet: apple-x86-darwin
GCC target triplet: apple-x86-darwin


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22152


^ permalink raw reply	[flat|nested] 15+ messages in thread
[parent not found: <bug-22152-7508@http.gcc.gnu.org/bugzilla/>]

end of thread, other threads:[~2010-09-06 18:14 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-06-22 20:06 [Bug rtl-optimization/22152] New: Poor loop optimization when using sse2 builtins - regression from 3.3 fjahanian at apple dot com
2005-06-22 20:19 ` [Bug rtl-optimization/22152] Poor loop optimization when using mmx builtins pinskia at gcc dot gnu dot org
2005-06-22 20:26 ` [Bug target/22152] " pinskia at gcc dot gnu dot org
2005-09-13  0:52 ` fjahanian at apple dot com
     [not found] <bug-22152-7508@http.gcc.gnu.org/bugzilla/>
2007-03-01 13:47 ` ubizjak at gmail dot com
2008-03-08  7:00 ` uros at gcc dot gnu dot org
2008-03-08  7:09 ` ubizjak at gmail dot com
2008-03-08  7:22 ` ubizjak at gmail dot com
2008-03-08  7:23 ` ubizjak at gmail dot com
2008-03-08 12:44 ` uros at gcc dot gnu dot org
2008-03-08 12:52 ` ubizjak at gmail dot com
2010-09-06 12:06 ` ubizjak at gmail dot com
2010-09-06 17:51 ` uros at gcc dot gnu dot org
2010-09-06 17:55 ` uros at gcc dot gnu dot org
2010-09-06 18:14 ` ubizjak at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).