public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
From: "tim at klingt dot org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug inline-asm/38671]  New: [4.4 Regression] speed regression with sse intrinsics
Date: Tue, 30 Dec 2008 12:58:00 -0000	[thread overview]
Message-ID: <bug-38671-12873@http.gcc.gnu.org/bugzilla/> (raw)

i experience some speed regressions with gcc-4.4, with sse intrinsics on a
core2 (x86_64). the code is:

namespace detail
{
/** compute x1 * (1 + x2 * amount)  */
__m128 inline amp_mod4_loop(__m128 x1, __m128 x2, __m128 amount, __m128 one)
{
    return _mm_mul_ps(x1,
                      _mm_add_ps(one,
                                 _mm_mul_ps(x2, amount)));
}
} /* namespace detail */

template <>
inline void amp_mod4(float * out, const float * in1, const float * in2,
                     const float amount, unsigned int n)
{
    n = n >> 2;
    const __m128 one = detail::gen_one();
    const __m128 amnt = _mm_set_ps1(amount);

    do
    {
        const __m128 x1 = _mm_load_ps(in1);
        in1 += 4;
        const __m128 x2 = _mm_load_ps(in2);
        in2 += 4;

        const __m128 result = detail::amp_mod4_loop(x1, x2, amnt, one);

        _mm_store_ps(out, result);
        out += 4;
    }
    while (--n);
}

the results for different compilers (using hardware performance counters) are:
gcc-4.4:
cycles: 1416276094
branch misses: 425897

gcc-4.4 -march=core2:
cycles: 1520034636
branch misses: 3263912

gcc-4.3:
cycles: 1548838336
branch misses: 5990424

gcc-4.3 -march=core2:
cycles: 1386605444
branch misses: 5609

gcc-4.2:
cycles: 1321697674
branch misses: 3682

it seems that gcc-4.3 with -march core2 and gcc-4.2 generate code, which is
more friendly to the branch predictor. tuning for core2 on gcc-4.4 actually
seems to generate worse code.

the best code (gcc-4.2) is:
0000000000400de0 <bench_1_simd(unsigned int)>:
  400de0:       66 0f ef c0             pxor   %xmm0,%xmm0
  400de4:       c1 ef 02                shr    $0x2,%edi
  400de7:       0f 28 15 32 0f 00 00    movaps 0xf32(%rip),%xmm2        #
401d20 <_IO_stdin_used+0xb0>
  400dee:       31 c0                   xor    %eax,%eax
  400df0:       66 0f 76 c0             pcmpeqd %xmm0,%xmm0
  400df4:       66 0f 72 d0 19          psrld  $0x19,%xmm0
  400df9:       66 0f 72 f0 17          pslld  $0x17,%xmm0
  400dfe:       0f 28 c8                movaps %xmm0,%xmm1
  400e01:       0f 28 80 e0 26 60 00    movaps 0x6026e0(%rax),%xmm0
  400e08:       0f 59 c2                mulps  %xmm2,%xmm0
  400e0b:       0f 58 c1                addps  %xmm1,%xmm0
  400e0e:       0f 59 80 e0 25 60 00    mulps  0x6025e0(%rax),%xmm0
  400e15:       0f 29 80 e0 24 60 00    movaps %xmm0,0x6024e0(%rax)
  400e1c:       48 83 c0 10             add    $0x10,%rax
  400e20:       83 ef 01                sub    $0x1,%edi
  400e23:       75 dc                   jne    400e01 <bench_1_simd(unsigned
int)+0x21>
  400e25:       f3 c3                   repz retq 
  400e27:       90                      nop    
  400e28:       0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)

the worst code (gcc-4.4, -march=core2) is 15% slower:
0000000000400e70 <bench_1_simd(unsigned int)>:
  400e70:       66 0f ef d2             pxor   %xmm2,%xmm2
  400e74:       89 fa                   mov    %edi,%edx
  400e76:       66 0f 76 d2             pcmpeqd %xmm2,%xmm2
  400e7a:       c1 ea 02                shr    $0x2,%edx
  400e7d:       66 0f 72 d2 19          psrld  $0x19,%xmm2
  400e82:       ff ca                   dec    %edx
  400e84:       66 0f 72 f2 17          pslld  $0x17,%xmm2
  400e89:       48 ff c2                inc    %rdx
  400e8c:       0f 28 0d 7d 17 00 00    movaps 0x177d(%rip),%xmm1        #
402610 <_IO_stdin_used+0xb0>
  400e93:       48 c1 e2 04             shl    $0x4,%rdx
  400e97:       31 c0                   xor    %eax,%eax
  400e99:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)
  400ea0:       0f 28 c1                movaps %xmm1,%xmm0
  400ea3:       0f 59 80 e0 36 60 00    mulps  0x6036e0(%rax),%xmm0
  400eaa:       0f 58 c2                addps  %xmm2,%xmm0
  400ead:       0f 59 80 e0 35 60 00    mulps  0x6035e0(%rax),%xmm0
  400eb4:       0f 29 80 e0 34 60 00    movaps %xmm0,0x6034e0(%rax)
  400ebb:       48 83 c0 10             add    $0x10,%rax
  400ebf:       48 39 d0                cmp    %rdx,%rax
  400ec2:       75 dc                   jne    400ea0 <bench_1_simd(unsigned
int)+0x30>
  400ec4:       f3 c3                   repz retq 
  400ec6:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  400ecd:       00 00 00


-- 
           Summary: [4.4 Regression] speed regression with sse intrinsics
           Product: gcc
           Version: 4.4.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: inline-asm
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: tim at klingt dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38671


             reply	other threads:[~2008-12-30 12:58 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-12-30 12:58 tim at klingt dot org [this message]
2008-12-30 12:59 ` [Bug inline-asm/38671] " tim at klingt dot org
2008-12-30 16:23 ` [Bug target/38671] " pinskia at gcc dot gnu dot org
2008-12-31  7:50 ` pinskia at gcc dot gnu dot org
2008-12-31  7:57 ` pinskia at gcc dot gnu dot org
2008-12-31  8:11 ` [Bug middle-end/38671] [4.4 Regression] extra code for setting up loops pinskia at gcc dot gnu dot org
2008-12-31  8:14 ` [Bug middle-end/38671] [4.4 Regression] extra code for setting up loops (IV-opts and 32bits vs 64bits) pinskia at gcc dot gnu dot org
2008-12-31  9:21 ` tim at klingt dot org
2009-01-05 11:28 ` rguenth at gcc dot gnu dot org
2009-04-21 16:00 ` [Bug middle-end/38671] [4.4/4.5 " jakub at gcc dot gnu dot org
2009-07-22 10:35 ` jakub at gcc dot gnu dot org
2009-10-15 12:54 ` jakub at gcc dot gnu dot org
2010-01-21 13:16 ` jakub at gcc dot gnu dot org
2010-03-01 23:32 ` [Bug middle-end/38671] [4.3/4.4/4.5 " pinskia at gcc dot gnu dot org
2010-03-01 23:35 ` [Bug middle-end/38671] [4.3/4.4/4.5 Regression] selecting one IV instead of three pinskia at gcc dot gnu dot org
2010-04-30  9:25 ` [Bug middle-end/38671] [4.3/4.4/4.5/4.6 " jakub at gcc dot gnu dot org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-38671-12873@http.gcc.gnu.org/bugzilla/ \
    --to=gcc-bugzilla@gcc.gnu.org \
    --cc=gcc-bugs@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).