public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/38825]  New: missed optimization: register renaming in unrolled loop
@ 2009-01-13 11:40 tim at klingt dot org
  2009-01-13 15:09 ` [Bug target/38825] " rguenth at gcc dot gnu dot org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: tim at klingt dot org @ 2009-01-13 11:40 UTC (permalink / raw)
  To: gcc-bugs

the following two functions are equivalent, adding a scalar to a vector, using
a manual loop unrolling of 8 (2 sse vectors).

the first function serializes the operation, while the second function
interleaves the instructions for two operations:

void bench_3(float * out, float * in, float f, unsigned int n)
{
    n /= 8;
    __m128 scalar = _mm_set_ps1(f);
    do
    {
        __m128 arg = _mm_load_ps(in);
        __m128 result = _mm_add_ps(arg, scalar);
        _mm_store_ps(out, result);

        arg = _mm_load_ps(in+4);
        result = _mm_add_ps(arg, scalar);
        _mm_store_ps(out+4, result);
        in += 8;
        out += 8;
    }
    while (--n);
}

with the generated code:
.L13:
        movaps  (%rsi,%rax), %xmm0
        addps   %xmm1, %xmm0
        movaps  %xmm0, (%rdi,%rax)
        movaps  16(%rsi,%rax), %xmm0
        addps   %xmm1, %xmm0
        movaps  %xmm0, 16(%rdi,%rax)
        addq    $32, %rax
        cmpq    %rdx, %rax
        jne     .L13


void bench_4(float * out, float * in, float f, unsigned int n)
{
    n /= 8;
    __m128 scalar = _mm_set_ps1(f);
    do
    {
        __m128 arg  = _mm_load_ps(in);
        __m128 arg2 = _mm_load_ps(in+4);
        __m128 result  = _mm_add_ps(arg, scalar);
        __m128 result2 = _mm_add_ps(arg2, scalar);
        _mm_store_ps(out, result);
        _mm_store_ps(out+4, result2);
        in += 8;
        out += 8;
    }
    while (--n);
}

generated code:
.L9:
        movaps  (%rsi,%rax), %xmm0
        movaps  16(%rsi,%rax), %xmm1
        addps   %xmm2, %xmm0
        addps   %xmm2, %xmm1
        movaps  %xmm0, (%rdi,%rax)
        movaps  %xmm1, 16(%rdi,%rax)
        addq    $32, %rax
        cmpq    %rdx, %rax
        jne     .L9

the interleaved code outperforms the sequential code by about 12% on
x86_64/core2, possibly, because the instruction pairs (load/add/store) don't
have any data dependencies.
it would be nice, if gcc could do a register renaming and instruction
reordering on the first function to generate the same instructions than the
second function.


-- 
           Summary: missed optimization: register renaming in unrolled loop
           Product: gcc
           Version: 4.4.0
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: target
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: tim at klingt dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38825


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-01-13 16:37 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-01-13 11:40 [Bug target/38825] New: missed optimization: register renaming in unrolled loop tim at klingt dot org
2009-01-13 15:09 ` [Bug target/38825] " rguenth at gcc dot gnu dot org
2009-01-13 15:16 ` rguenth at gcc dot gnu dot org
2009-01-13 15:26 ` tim at klingt dot org
2009-01-13 15:45 ` rguenth at gcc dot gnu dot org
2009-01-13 16:08 ` tim at klingt dot org
2009-01-13 16:37 ` rguenth at gcc dot gnu dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).