public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/38824]  New: [4.4 regression] performance regression of sse code from 4.2/4.3
@ 2009-01-13 11:25 tim at klingt dot org
  2009-01-13 15:07 ` [Bug target/38824] " rguenth at gcc dot gnu dot org
                   ` (31 more replies)
  0 siblings, 32 replies; 33+ messages in thread
From: tim at klingt dot org @ 2009-01-13 11:25 UTC (permalink / raw)
  To: gcc-bugs

the following code shows a performance regression from gcc-4.2 to gcc-4.3 and
4.4 (20090111) on an intel core2 using the x86_64 architecture:

void bench_1(float * out, float * in, float f, unsigned int n)
{
    n /= 4;
    __m128 scalar = _mm_set_ps1(f);
    do
    {
        __m128 arg = _mm_load_ps(in);
        __m128 result = _mm_add_ps(arg, scalar);
        _mm_store_ps(out, result);
        in += 4;
        out += 4;
    }
    while (--n);
}

results, running the function 100000000 times, measured with performance
counters (requires a patched kernel), compiled with -O3 -mfpmath=sse -msse
gcc-4.2: 1946256122 cycles, 8394301290 instructions, 5005 branch misses
gcc-4.3: 2191990305 cycles, 7658465214 instructions, 3442 branch misses
gcc-4.4: 2532778908 cycles, 7462359830 instructions, 8593402 branch misses

although the instruction count decreases, the cycles spent in the function
increases. also gcc-4.4 shows a huge number of branch misses.

the generated code is

gcc-4.2:
.globl _Z7bench_1PfS_fj
        .type   _Z7bench_1PfS_fj, @function
_Z7bench_1PfS_fj:
.LFB2695:
        movaps  %xmm0, %xmm2
        shrl    $2, %edx
        shufps  $0, %xmm2, %xmm2
        movaps  %xmm2, %xmm1
        .p2align 4,,7
.L15:
        movaps  (%rsi), %xmm0
        addq    $16, %rsi
        addps   %xmm1, %xmm0
        movaps  %xmm0, (%rdi)
        addq    $16, %rdi
        subl    $1, %edx
        jne     .L15
        rep ; ret
.LFE2695:
        .size   _Z7bench_1PfS_fj, .-_Z7bench_1PfS_fj
        .align 2
        .p2align 4,,15

gcc-4.3
.globl _Z7bench_1PfS_fj
        .type   _Z7bench_1PfS_fj, @function
_Z7bench_1PfS_fj:
.LFB2563:
        movaps  %xmm0, %xmm2
        shrl    $2, %edx
        subl    $1, %edx
        xorl    %eax, %eax
        shufps  $0, %xmm2, %xmm2
        mov     %edx, %edx
        addq    $1, %rdx
        salq    $4, %rdx
        movaps  %xmm2, %xmm1
        .p2align 4,,10
        .p2align 3
.L17:
        movaps  (%rsi,%rax), %xmm0
        addps   %xmm1, %xmm0
        movaps  %xmm0, (%rdi,%rax)
        addq    $16, %rax
        cmpq    %rdx, %rax
        jne     .L17
        rep
        ret
.LFE2563:
        .size   _Z7bench_1PfS_fj, .-_Z7bench_1PfS_fj
        .p2align 4,,15

gcc-4.4
.globl _Z7bench_1PfS_fj
        .type   _Z7bench_1PfS_fj, @function
_Z7bench_1PfS_fj:
.LFB2489:
        .cfi_startproc
        .cfi_personality 0x3,__gxx_personality_v0
        shrl    $2, %edx
        shufps  $0, %xmm0, %xmm0
        subl    $1, %edx
        xorl    %eax, %eax
        addq    $1, %rdx
        salq    $4, %rdx
        .p2align 4,,10
        .p2align 3
.L17:
        movaps  %xmm0, %xmm1
        addps   (%rsi,%rax), %xmm1
        movaps  %xmm1, (%rdi,%rax)
        addq    $16, %rax
        cmpq    %rdx, %rax
        jne     .L17
        rep
        ret
        .cfi_endproc
.LFE2489:
        .size   _Z7bench_1PfS_fj, .-_Z7bench_1PfS_fj
        .p2align 4,,15


-- 
           Summary: [4.4 regression] performance regression of sse code from
                    4.2/4.3
           Product: gcc
           Version: 4.4.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: tim at klingt dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
@ 2009-01-13 15:07 ` rguenth at gcc dot gnu dot org
  2009-01-13 16:22 ` tim at klingt dot org
                   ` (30 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-01-13 15:07 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #1 from rguenth at gcc dot gnu dot org  2009-01-13 15:07 -------
I don't see how this changes could cause more branch misses.  If you do the
same .palign for the 4.4 code does the regression vanish?  I would suspect
that the loop-stream detector catches one but not the other form for some
reason.  Maybe the Intel folks can properly analyze this - HJ?


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hjl at gcc dot gnu dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
  2009-01-13 15:07 ` [Bug target/38824] " rguenth at gcc dot gnu dot org
@ 2009-01-13 16:22 ` tim at klingt dot org
  2009-01-14 20:20 ` hubicka at gcc dot gnu dot org
                   ` (29 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: tim at klingt dot org @ 2009-01-13 16:22 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #2 from tim at klingt dot org  2009-01-13 16:22 -------
(In reply to comment #1)
> I don't see how this changes could cause more branch misses.  If you do the
> same .palign for the 4.4 code does the regression vanish?  I would suspect
> that the loop-stream detector catches one but not the other form for some
> reason.  Maybe the Intel folks can properly analyze this - HJ?

after doing some more tests, i wouldn't think too much about the branch misses.
they seem to be quite dependent on the binary, even on linked libraries. i am
more concerned about the inner loop ...


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
  2009-01-13 15:07 ` [Bug target/38824] " rguenth at gcc dot gnu dot org
  2009-01-13 16:22 ` tim at klingt dot org
@ 2009-01-14 20:20 ` hubicka at gcc dot gnu dot org
  2009-01-14 20:26 ` [Bug target/38824] [4.4 Regression] " rguenth at gcc dot gnu dot org
                   ` (28 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: hubicka at gcc dot gnu dot org @ 2009-01-14 20:20 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #3 from hubicka at gcc dot gnu dot org  2009-01-14 20:20 -------
It might be IRA change.  Chips generally preffer separate load and execute
instruction as in the old loop over the load+execute since they are easier to
retire.

Splitting the instruction post reload probably won't do much good, since there
is extra move already. If just splitting the instruction would help, we can
macroize:
(define_peephole2
  [(match_scratch:SI 2 "r")
   (parallel [(set (match_operand:SI 0 "register_operand" "")
                   (match_operator:SI 3 "arith_or_logical_operator"
                     [(match_dup 0)
                      (match_operand:SI 1 "memory_operand" "")]))
              (clobber (reg:CC FLAGS_REG))])]
  "optimize_insn_for_speed_p () && ! TARGET_READ_MODIFY"
  [(set (match_dup 2) (match_dup 1))
   (parallel [(set (match_dup 0)
                   (match_op_dup 3 [(match_dup 0) (match_dup 2)]))
              (clobber (reg:CC FLAGS_REG))])]
  "") 

peephole for vector modes too.
Vladimir, perhaps IRA can be tweaked here somehow?


-- 

hubicka at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |vmakarov at redhat dot com,
                   |                            |hubicka at gcc dot gnu dot
                   |                            |org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (2 preceding siblings ...)
  2009-01-14 20:20 ` hubicka at gcc dot gnu dot org
@ 2009-01-14 20:26 ` rguenth at gcc dot gnu dot org
  2009-01-14 20:32 ` [Bug target/38824] [4.4 regression] " hubicka at gcc dot gnu dot org
                   ` (27 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-01-14 20:26 UTC (permalink / raw)
  To: gcc-bugs



-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
            Summary|[4.4 regression] performance|[4.4 Regression] performance
                   |regression of sse code from |regression of sse code from
                   |4.2/4.3                     |4.2/4.3
   Target Milestone|---                         |4.4.0


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (3 preceding siblings ...)
  2009-01-14 20:26 ` [Bug target/38824] [4.4 Regression] " rguenth at gcc dot gnu dot org
@ 2009-01-14 20:32 ` hubicka at gcc dot gnu dot org
  2009-01-15  0:31 ` hubicka at gcc dot gnu dot org
                   ` (26 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: hubicka at gcc dot gnu dot org @ 2009-01-14 20:32 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #4 from hubicka at gcc dot gnu dot org  2009-01-14 20:31 -------
Actually perhaps in simple case like this even peep2 will work since we can
copyprop will fix it later.  I am trying to add the peep


-- 

hubicka at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         AssignedTo|unassigned at gcc dot gnu   |hubicka at gcc dot gnu dot
                   |dot org                     |org
             Status|UNCONFIRMED                 |ASSIGNED
     Ever Confirmed|0                           |1
           Keywords|missed-optimization         |
   Last reconfirmed|0000-00-00 00:00:00         |2009-01-14 20:31:52
               date|                            |
            Summary|[4.4 Regression] performance|[4.4 regression] performance
                   |regression of sse code from |regression of sse code from
                   |4.2/4.3                     |4.2/4.3
   Target Milestone|4.4.0                       |---


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (4 preceding siblings ...)
  2009-01-14 20:32 ` [Bug target/38824] [4.4 regression] " hubicka at gcc dot gnu dot org
@ 2009-01-15  0:31 ` hubicka at gcc dot gnu dot org
  2009-01-15  1:26 ` hjl dot tools at gmail dot com
                   ` (25 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: hubicka at gcc dot gnu dot org @ 2009-01-15  0:31 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #5 from hubicka at gcc dot gnu dot org  2009-01-15 00:30 -------
Created an attachment (id=17106)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17106&action=view)
Proposed patch

The patch makes GCC to generate movaps load followed by addps.  On Core 2 it
speeds up the testcase from 7s to 6.2s so I guess it works as expected.

The same however does not reproduce on AMD box and I am not sure if it is just
coincidence here or if really core preffer to split read-execute SSE operations
(it is not recommended by the manual).

H.J. perhaps, you can have some advice here?  Or at least can we do some
benchmarking?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (5 preceding siblings ...)
  2009-01-15  0:31 ` hubicka at gcc dot gnu dot org
@ 2009-01-15  1:26 ` hjl dot tools at gmail dot com
  2009-01-15  1:49 ` hubicka at ucw dot cz
                   ` (24 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: hjl dot tools at gmail dot com @ 2009-01-15  1:26 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #6 from hjl dot tools at gmail dot com  2009-01-15 01:25 -------
(In reply to comment #5)
>
> H.J. perhaps, you can have some advice here?  Or at least can we do some
> benchmarking?
> 

Joey and Xuepeng are looking into it.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (6 preceding siblings ...)
  2009-01-15  1:26 ` hjl dot tools at gmail dot com
@ 2009-01-15  1:49 ` hubicka at ucw dot cz
  2009-01-23 16:19 ` [Bug target/38824] [4.4 Regression] " rguenth at gcc dot gnu dot org
                   ` (23 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: hubicka at ucw dot cz @ 2009-01-15  1:49 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #7 from hubicka at ucw dot cz  2009-01-15 01:49 -------
Subject: Re:  [4.4 regression] performance regression of sse code from 4.2/4.3

I guess th3 main difference here is that load + addps pair generate 2
uops, while mov + loading addps generate 3 since the move has to go
through the queue.  I will try to change testcase to fit in cache to see
if AMD machine reproduce it too..

Honza


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (7 preceding siblings ...)
  2009-01-15  1:49 ` hubicka at ucw dot cz
@ 2009-01-23 16:19 ` rguenth at gcc dot gnu dot org
  2009-01-24  5:12 ` xuepeng dot guo at intel dot com
                   ` (22 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-01-23 16:19 UTC (permalink / raw)
  To: gcc-bugs



-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
            Summary|[4.4 regression] performance|[4.4 Regression] performance
                   |regression of sse code from |regression of sse code from
                   |4.2/4.3                     |4.2/4.3
   Target Milestone|---                         |4.4.0


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (8 preceding siblings ...)
  2009-01-23 16:19 ` [Bug target/38824] [4.4 Regression] " rguenth at gcc dot gnu dot org
@ 2009-01-24  5:12 ` xuepeng dot guo at intel dot com
  2009-01-24  9:56 ` tim at klingt dot org
                   ` (21 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: xuepeng dot guo at intel dot com @ 2009-01-24  5:12 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #8 from xuepeng dot guo at intel dot com  2009-01-24 05:12 -------
Created an attachment (id=17173)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17173&action=view)
An extracted test case for this bug.

Hi tim, I extracted this test case from your website. But I can't exactly
reproduce this bug on my machine with a core2 quard micor processor. Can you
help me to check whether my test case is valid firstly? Here I post what I got
on my machine for your reference:

[xguo2@shgcc-10 38824]$ /home/xguo2/app/trunk/bin/g++ -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../src/configure --enable-checking=assert --disable-bootstrap
--enable-languages=c,c++,fortran
Thread model: posix
gcc version 4.4.0 20090121 (experimental) [trunk revision 143537] (GCC)
[xguo2@shgcc-10 38824]$ /home/xguo2/app/trunk/bin/g++ -O3 -msse -mfpmath=sse
simd_unroll_benchmarks.cpp -o 44.out
[xguo2@shgcc-10 38824]$ time ./44.out

real    0m1.877s
user    0m1.876s
sys     0m0.001s
[xguo2@shgcc-10 38824]$ time ./44.out

real    0m1.877s
user    0m1.877s
sys     0m0.000s
[xguo2@shgcc-10 38824]$ time ./44.out

real    0m1.881s
user    0m1.882s
sys     0m0.000s
[xguo2@shgcc-10 38824]$ /home/xguo2/app/usr/gcc-4.2/bin/g++ -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: /net/gnu-13/export/gnu/src/gcc-4.2/gcc/configure
--enable-clocale=gnu --with-system-zlib --with-demangler-in-ld --enable-shared
--enable-threads=posix --enable-haifa --enable-checking=assert
--prefix=/usr/gcc-4.2 --with-local-prefix=/usr/local
Thread model: posix
gcc version 4.2.0
[xguo2@shgcc-10 38824]$ /home/xguo2/app/usr/gcc-4.2/bin/g++ -O3 -msse
-mfpmath=sse simd_unroll_benchmarks.cpp -o 42.out
[xguo2@shgcc-10 38824]$ time ./42.out

real    0m1.991s
user    0m1.991s
sys     0m0.000s
[xguo2@shgcc-10 38824]$ time ./42.out

real    0m1.991s
user    0m1.989s
sys     0m0.001s
[xguo2@shgcc-10 38824]$ time ./42.out

real    0m1.991s
user    0m1.990s
sys     0m0.000s
[xguo2@shgcc-10 38824]$ g++ -v
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
--infodir=/usr/share/info --enable-shared --enable-threads=posix
--enable-checking=release --with-system-zlib --enable-__cxa_atexit
--disable-libunwind-exceptions --enable-libgcj-multifile
--enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk
--disable-dssi --enable-plugin
--with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic
--host=x86_64-redhat-linux
Thread model: posix
gcc version 4.1.2 20071124 (Red Hat 4.1.2-42)
[xguo2@shgcc-10 38824]$ g++ -O3 -msse -mfpmath=sse simd_unroll_benchmarks.cpp
-o 41.out
[xguo2@shgcc-10 38824]$ time ./41.out

real    0m1.465s
user    0m1.464s
sys     0m0.002s
[xguo2@shgcc-10 38824]$ time ./41.out

real    0m1.465s
user    0m1.465s
sys     0m0.000s
[xguo2@shgcc-10 38824]$ time ./41.out

real    0m1.465s
user    0m1.464s
sys     0m0.002s


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (9 preceding siblings ...)
  2009-01-24  5:12 ` xuepeng dot guo at intel dot com
@ 2009-01-24  9:56 ` tim at klingt dot org
  2009-01-24 13:14 ` tim at klingt dot org
                   ` (20 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: tim at klingt dot org @ 2009-01-24  9:56 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #9 from tim at klingt dot org  2009-01-24 09:56 -------
> Hi tim, I extracted this test case from your website. But I can't exactly
> reproduce this bug on my machine with a core2 quard micor processor. Can you
> help me to check whether my test case is valid firstly? Here I post what I got
> on my machine for your reference:

the benchmark test case looks fine.

the times on my machine:
gcc-4.2:
tim@thinkpad:~/sandbox$ time ./a.out 

real    0m1.852s
user    0m1.829s
sys     0m0.010s
tim@thinkpad:~/sandbox$ time ./a.out 

real    0m1.826s
user    0m1.817s
sys     0m0.002s
tim@thinkpad:~/sandbox$ time ./a.out 

real    0m1.833s
user    0m1.826s
sys     0m0.001s

gcc-4.3:
time ./a.out 

real    0m2.062s
user    0m2.047s
sys     0m0.002s
tim@thinkpad:~/sandbox$ time ./a.out 

real    0m2.061s
user    0m2.043s
sys     0m0.006s
tim@thinkpad:~/sandbox$ time ./a.out 

real    0m2.101s
user    0m2.053s
sys     0m0.036s

gcc-4.4 (20090111):
tim@thinkpad:~/sandbox$ time ./a.out 

real    0m2.536s
user    0m2.481s
sys     0m0.017s
tim@thinkpad:~/sandbox$ time ./a.out 

real    0m2.497s
user    0m2.467s
sys     0m0.003s
tim@thinkpad:~/sandbox$ time ./a.out 

real    0m2.539s
user    0m2.484s
sys     0m0.036s

best, tim


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (10 preceding siblings ...)
  2009-01-24  9:56 ` tim at klingt dot org
@ 2009-01-24 13:14 ` tim at klingt dot org
  2009-01-25 17:56 ` rguenth at gcc dot gnu dot org
                   ` (19 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: tim at klingt dot org @ 2009-01-24 13:14 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #10 from tim at klingt dot org  2009-01-24 13:14 -------
btw, i tried the proposed patch ssef, with no big performance difference:

tim@thinkpad:~/sandbox$ time ./a.out 
real    0m2.494s
user    0m2.473s
sys     0m0.002s
tim@thinkpad:~/sandbox$ time ./a.out 

real    0m2.479s
user    0m2.475s
sys     0m0.002s
tim@thinkpad:~/sandbox$ time ./a.out 

real    0m2.501s
user    0m2.476s
sys     0m0.003s


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (11 preceding siblings ...)
  2009-01-24 13:14 ` tim at klingt dot org
@ 2009-01-25 17:56 ` rguenth at gcc dot gnu dot org
  2009-02-06  9:16 ` bonzini at gnu dot org
                   ` (18 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-01-25 17:56 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #11 from rguenth at gcc dot gnu dot org  2009-01-25 17:56 -------
We seem to have a lot of similar "sse performance regression" P2 bugs, can
someone make sure that there are no duplicates here?


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P3                          |P2


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (12 preceding siblings ...)
  2009-01-25 17:56 ` rguenth at gcc dot gnu dot org
@ 2009-02-06  9:16 ` bonzini at gnu dot org
  2009-02-06 22:35 ` dwarak dot rajagopal at amd dot com
                   ` (17 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: bonzini at gnu dot org @ 2009-02-06  9:16 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #12 from bonzini at gnu dot org  2009-02-06 09:16 -------
There's another peephole2, namely from

[(set (match_operand 0 "register_operand")
      (match_operand 1 "register_operand"))
 (set (match_operand 0 "register_operand")
      (match_operator 3 "arith_or_logical_operator"
          [(match_dup 0)
           (match_operand 2 "memory_operand" "")]))]

to

[(set (match_dup 0) (match_dup 2))
 (set (match_dup 0) (match_op_dup 3 [(match_dup 0) (match_dup 1)])]

for operands[0] != operands[1] and commutative operator 3 (i.e.
plus,mult,and,ior,xor,smin,smax,umin,umax).  Testing a patch.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (13 preceding siblings ...)
  2009-02-06  9:16 ` bonzini at gnu dot org
@ 2009-02-06 22:35 ` dwarak dot rajagopal at amd dot com
  2009-02-07 16:18 ` rob1weld at aol dot com
                   ` (16 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: dwarak dot rajagopal at amd dot com @ 2009-02-06 22:35 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #13 from dwarak dot rajagopal at amd dot com  2009-02-06 22:35 -------

> The patch makes GCC to generate movaps load followed by addps.  On Core 2 it
> speeds up the testcase from 7s to 6.2s so I guess it works as expected.
> 
> The same however does not reproduce on AMD box and I am not sure if it is just
> coincidence here or if really core preffer to split read-execute SSE operations
> (it is not recommended by the manual).

fyi, AMD (amdfam10) prefers load-execute rather than having separate load and
execute instructions. 


-- 

dwarak dot rajagopal at amd dot com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dwarak dot rajagopal at amd
                   |                            |dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (14 preceding siblings ...)
  2009-02-06 22:35 ` dwarak dot rajagopal at amd dot com
@ 2009-02-07 16:18 ` rob1weld at aol dot com
  2009-02-08 12:36 ` hubicka at gcc dot gnu dot org
                   ` (15 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: rob1weld at aol dot com @ 2009-02-07 16:18 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #14 from rob1weld at aol dot com  2009-02-07 16:18 -------
(In reply to comment #8)
> Created an attachment (id=17173)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17173&action=view) [edit]
> An extracted test case for this bug.
> 
> Hi tim, I extracted this test case from your website. But I can't exactly
> ...
FWIW.

Platform i386-pc-solaris2.11 on an AMD Athlon X2 4200+:

# /usr/bin/g++ -v
Reading specs from /usr/sfw/lib/gcc/i386-pc-solaris2.11/3.4.3/specs
Configured with: /builds2/sfwnv-gate/usr/src/cmd/gcc/gcc-3.4.3/configure
--prefix=/usr/sfw --with-as=/usr/sfw/bin/gas --with-gnu-as
--with-ld=/usr/ccs/bin/ld --without-gnu-ld --enable-languages=c,c++,f77,objc
--enable-shared
Thread model: posix
gcc version 3.4.3 (csl-sol210-3_4-20050802)

# /opt/csw/gcc3/bin/g++ -v
Reading specs from /opt/csw/gcc3/lib/gcc/i386-pc-solaris2.8/3.4.5/specs
Configured with: ../sources/gcc-3.4.5/configure --prefix=/opt/csw/gcc3
--with-local-prefix=/opt/csw --with-gnu-as --with-as=/opt/csw/bin/gas
--without-gnu-ld --with-ld=/usr/ccs/bin/ld --enable-threads=posix
--enable-shared --enable-multilib --enable-nls --with-included-gettext
--with-libiconv-prefix=/opt/csw --with-x --enable-java-awt=xlib
--enable-languages=all
Thread model: posix
gcc version 3.4.5

# /opt/csw/gcc4/bin/g++ -v
Reading specs from /opt/csw/gcc4/lib/gcc/i386-pc-solaris2.8/4.0.2/specs
Target: i386-pc-solaris2.8
Configured with: ../sources/gcc-4.0.2/configure --prefix=/opt/csw/gcc4
--with-local-prefix=/opt/csw --with-gnu-as --with-as=/opt/csw/bin/gas
--without-gnu-ld --with-ld=/usr/ccs/bin/ld --enable-threads=posix
--enable-shared --enable-multilib --enable-nls --with-included-gettext
--with-libiconv-prefix=/opt/csw --with-x --enable-java-awt=xlib
--with-system-zlib --enable-languages=c,c++,f95,java,objc,ada
Thread model: posix
gcc version 4.0.2

# g++ -v
Using built-in specs.
Target: i386-pc-solaris2.11
Configured with: ../gcc_trunk/configure
--enable-languages=ada,c,c++,fortran,java,objc,obj-c++ --enable-shared
--disable-static --enable-multilib --enable-decimal-float
--with-long-double-128 --with-included-gettext --enable-stage1-checking
--enable-checking=release --with-tune=k8 --with-cpu=k8 --with-arch=k8
--with-gnu-as --with-as=/usr/local/bin/as --without-gnu-ld
--with-ld=/usr/ccs/bin/ld
Thread model: posix
gcc version 4.4.0 20090206 (experimental) [trunk revision 143992] (GCC) 


---------

# time ./3.4.3.out 
real    0m5.554s
user    0m4.144s
sys     0m0.146s

# time ./3.4.5.out 
real    0m5.669s
user    0m4.089s
sys     0m0.141s

# time ./4.0.2.out 
real    0m5.266s
user    0m4.023s
sys     0m0.132s

# time ./4.4.0.out 
real    0m5.060s
user    0m3.799s
sys     0m0.124s

---------

It seems gcc 3.4.3 (csl-sol210-3_4-20050802) is faster than gcc 3.4.5 and 
the current Trunk is ~10% faster (with all the years of progress)....

Rob


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (15 preceding siblings ...)
  2009-02-07 16:18 ` rob1weld at aol dot com
@ 2009-02-08 12:36 ` hubicka at gcc dot gnu dot org
  2009-02-08 12:40 ` hubicka at gcc dot gnu dot org
                   ` (14 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: hubicka at gcc dot gnu dot org @ 2009-02-08 12:36 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #15 from hubicka at gcc dot gnu dot org  2009-02-08 12:36 -------
I tested the patch on SPECfp and core and there is not much difference.  I
guess without somehow tweaking regalloc there is not much to do about this
problem. Xuepeng, if the testcase is core2-variant sensitive, perhaps it is not
related to uops count at all? It seems to me that the bottleneck should lie
elsewhere anyway, as the testcase should be memory bound after all...

Honza


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (16 preceding siblings ...)
  2009-02-08 12:36 ` hubicka at gcc dot gnu dot org
@ 2009-02-08 12:40 ` hubicka at gcc dot gnu dot org
  2009-02-09  9:16 ` xuepeng dot guo at intel dot com
                   ` (13 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: hubicka at gcc dot gnu dot org @ 2009-02-08 12:40 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #16 from hubicka at gcc dot gnu dot org  2009-02-08 12:40 -------
Since the splitting peep2 don't seem to be win in general (it wins only when
copy propagation takes place afterwards) and we don't seem to understand what
really makes the testcase faster I am unassigning myself until we get better
idea what is going on here.

Honza


-- 

hubicka at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         AssignedTo|hubicka at gcc dot gnu dot  |unassigned at gcc dot gnu
                   |org                         |dot org
             Status|ASSIGNED                    |NEW


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (17 preceding siblings ...)
  2009-02-08 12:40 ` hubicka at gcc dot gnu dot org
@ 2009-02-09  9:16 ` xuepeng dot guo at intel dot com
  2009-02-09 13:36 ` bonzini at gnu dot org
                   ` (12 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: xuepeng dot guo at intel dot com @ 2009-02-09  9:16 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #17 from xuepeng dot guo at intel dot com  2009-02-09 09:16 -------
Below is a loop in the case in its original form(compiled by GCC 4.4):

_Z7bench_1PfS_fj:
.LFB2309:
        shrl    $2, %edx
        shufps  $0, %xmm0, %xmm0
        subl    $1, %edx
        xorl    %eax, %eax
        addq    $1, %rdx
        salq    $4, %rdx
        .p2align 4,,10
        .p2align 3
.L11:
        movaps  %xmm0, %xmm1       
        addps   (%rsi,%rax), %xmm1
        movaps  %xmm1, (%rdi,%rax)
        addq    $16, %rax
        cmpq    %rdx, %rax
        jne     .L11
        rep
        ret

The time is:

[xguo2@shgcc-10 38824]$ g++ 44.s -o orig.out
[xguo2@shgcc-10 38824]$ time ./orig.out

real    0m1.878s
user    0m1.877s
sys     0m0.000s
[xguo2@shgcc-10 38824]$ time ./orig.out

real    0m1.879s
user    0m1.879s
sys     0m0.001s
[xguo2@shgcc-10 38824]$ time ./orig.out

real    0m1.873s
user    0m1.872s
sys     0m0.001s

After adding two nop:

.L11:
        movaps  %xmm0, %xmm1
        nop
        nop
        addps   (%rsi,%rax), %xmm1
        movaps  %xmm1, (%rdi,%rax)
        addq    $16, %rax
        cmpq    %rdx, %rax
        jne     .L11
        rep
        ret

The time is:
[xguo2@shgcc-10 38824]$ g++ 44.s -o 2nop.out
[xguo2@shgcc-10 38824]$ time ./2nop.out

real    0m1.762s
user    0m1.762s
sys     0m0.000s
[xguo2@shgcc-10 38824]$ time ./2nop.out

real    0m1.762s
user    0m1.762s
sys     0m0.000s
[xguo2@shgcc-10 38824]$ time ./2nop.out

real    0m1.762s
user    0m1.761s
sys     0m0.000s

I suspect that the code layout maybe hurt the performance.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (18 preceding siblings ...)
  2009-02-09  9:16 ` xuepeng dot guo at intel dot com
@ 2009-02-09 13:36 ` bonzini at gnu dot org
  2009-02-09 13:38 ` bonzini at gnu dot org
                   ` (11 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: bonzini at gnu dot org @ 2009-02-09 13:36 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #18 from bonzini at gnu dot org  2009-02-09 13:35 -------
Xuepeng, can you test with the loop as produced by my posted patch, that is:

.L11:
        movaps  (%rsi,%rax), %xmm0
        addps   %xmm1, %xmm0
        movaps  %xmm0, (%rdi,%rax)
        addq    $16, %rax
        cmpq    %rdx, %rax
        jne     .L11

I don't have access to new enough chips.


-- 

bonzini at gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bonzini at gnu dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (19 preceding siblings ...)
  2009-02-09 13:36 ` bonzini at gnu dot org
@ 2009-02-09 13:38 ` bonzini at gnu dot org
  2009-02-10 16:29 ` dwarak dot rajagopal at amd dot com
                   ` (10 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: bonzini at gnu dot org @ 2009-02-09 13:38 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #19 from bonzini at gnu dot org  2009-02-09 13:37 -------
Also, Dwarak, here the change is not from

    addps  (%rax, %rsi), %xmm1

to

    movps  (%rax, %rsi), %xmm0
    addps  %xmm0, %xmm1

but rather from

    movps  %xmm0, %xmm1
    addps  (%rax, %rsi), %xmm1

to the second snippet above.  Does this pessimize on AMD too?  I don't think
so, it should be 1 uop less, but I'd rather have confirmation.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (20 preceding siblings ...)
  2009-02-09 13:38 ` bonzini at gnu dot org
@ 2009-02-10 16:29 ` dwarak dot rajagopal at amd dot com
  2009-02-10 16:39 ` bonzini at gnu dot org
                   ` (9 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: dwarak dot rajagopal at amd dot com @ 2009-02-10 16:29 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #20 from dwarak dot rajagopal at amd dot com  2009-02-10 16:28 -------
Paulo,
(a)   movaps  (%rax, %rsi), %xmm0
      addps  %xmm0, %xmm1

(b)   movaps  %xmm0, %xmm1
      addps  (%rax, %rsi), %xmm1

Yes, case (a) is slightly better than case (b). It shouldn't matter much though
in amdfam10(shanghai) processors. 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (21 preceding siblings ...)
  2009-02-10 16:29 ` dwarak dot rajagopal at amd dot com
@ 2009-02-10 16:39 ` bonzini at gnu dot org
  2009-02-11  7:37 ` xuepeng dot guo at intel dot com
                   ` (8 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: bonzini at gnu dot org @ 2009-02-10 16:39 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #21 from bonzini at gnu dot org  2009-02-10 16:39 -------
So my patch should be a uniform win.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (22 preceding siblings ...)
  2009-02-10 16:39 ` bonzini at gnu dot org
@ 2009-02-11  7:37 ` xuepeng dot guo at intel dot com
  2009-02-11  8:01 ` bonzini at gnu dot org
                   ` (7 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: xuepeng dot guo at intel dot com @ 2009-02-11  7:37 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #22 from xuepeng dot guo at intel dot com  2009-02-11 07:37 -------
(In reply to comment #18)
> Xuepeng, can you test with the loop as produced by my posted patch, that is:
> .L11:
>         movaps  (%rsi,%rax), %xmm0
>         addps   %xmm1, %xmm0
>         movaps  %xmm0, (%rdi,%rax)
>         addq    $16, %rax
>         cmpq    %rdx, %rax
>         jne     .L11
> I don't have access to new enough chips.

Your patch improved the performance. My machine is "Intel(R) Core(TM)2 Quad CPU
   Q6700  @ 2.66GHz". The results are:

[xguo2@shgcc-9 38824]$ time ./gcc-42.out

real    0m1.991s
user    0m1.990s
sys     0m0.000s
[xguo2@shgcc-9 38824]$ time ./gcc-42.out

real    0m1.991s
user    0m1.991s
sys     0m0.001s
[xguo2@shgcc-9 38824]$ time ./gcc-42.out

real    0m1.991s
user    0m1.989s
sys     0m0.002s
[xguo2@shgcc-9 38824]$ time ./gcc-44.out

real    0m1.880s
user    0m1.879s
sys     0m0.001s
[xguo2@shgcc-9 38824]$ time ./gcc-44.out

real    0m1.878s
user    0m1.878s
sys     0m0.000s
[xguo2@shgcc-9 38824]$ time ./gcc-44.out

real    0m1.870s
user    0m1.869s
sys     0m0.002s
[xguo2@shgcc-9 38824]$ time ./gcc-44p.out

real    0m1.690s
user    0m1.690s
sys     0m0.000s
[xguo2@shgcc-9 38824]$ time ./gcc-44p.out

real    0m1.690s
user    0m1.689s
sys     0m0.002s
[xguo2@shgcc-9 38824]$ time ./gcc-44p.out

real    0m1.690s
user    0m1.690s
sys     0m0.000s

The only difference is:

--- 44.s        2009-02-11 15:34:57.000000000 +0800
+++ 44p.s       2009-02-11 15:34:49.000000000 +0800
@@ -102,8 +102,8 @@ _Z7bench_1PfS_fj:
        .p2align 4,,10
        .p2align 3
 .L11:
-       movaps  %xmm0, %xmm1
-       addps   (%rsi,%rax), %xmm1
+       movaps  (%rsi,%rax), %xmm1
+       addps   %xmm0, %xmm1
        movaps  %xmm1, (%rdi,%rax)
        addq    $16, %rax
        cmpq    %rdx, %rax


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (23 preceding siblings ...)
  2009-02-11  7:37 ` xuepeng dot guo at intel dot com
@ 2009-02-11  8:01 ` bonzini at gnu dot org
  2009-02-11  8:14 ` ubizjak at gmail dot com
                   ` (6 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: bonzini at gnu dot org @ 2009-02-11  8:01 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #23 from bonzini at gnu dot org  2009-02-11 08:01 -------
Subject: Re:  [4.4 Regression] performance regression of
 sse code from 4.2/4.3


> [xguo2@shgcc-9 38824]$ time ./gcc-42.out
> real    0m1.991s
> 
> [xguo2@shgcc-9 38824]$ time ./gcc-44.out
> real    0m1.880s
> 
> [xguo2@shgcc-9 38824]$ time ./gcc-44p.out
> real    0m1.690s

Even though you don't observe the reporter's slowdown from 4.2/4.3 to
unpatched 4.4, I guess this makes a good case for the patch.  Ok for trunk?

Paolo


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (24 preceding siblings ...)
  2009-02-11  8:01 ` bonzini at gnu dot org
@ 2009-02-11  8:14 ` ubizjak at gmail dot com
  2009-02-11  8:58 ` bonzini at gnu dot org
                   ` (5 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: ubizjak at gmail dot com @ 2009-02-11  8:14 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #24 from ubizjak at gmail dot com  2009-02-11 08:14 -------
(In reply to comment #23)

> Even though you don't observe the reporter's slowdown from 4.2/4.3 to
> unpatched 4.4, I guess this makes a good case for the patch.  Ok for trunk?

OK with a ChangeLog ;)

BTW: Please watch benchmarks testers [1] for a couple of days...

[1] http://gcc.gnu.org/benchmarks/


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (25 preceding siblings ...)
  2009-02-11  8:14 ` ubizjak at gmail dot com
@ 2009-02-11  8:58 ` bonzini at gnu dot org
  2009-02-12 15:45 ` hjl at gcc dot gnu dot org
                   ` (4 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: bonzini at gnu dot org @ 2009-02-11  8:58 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #25 from bonzini at gnu dot org  2009-02-11 08:57 -------
patch committed (the changelog was in gcc-patches :-).


-- 

bonzini at gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (26 preceding siblings ...)
  2009-02-11  8:58 ` bonzini at gnu dot org
@ 2009-02-12 15:45 ` hjl at gcc dot gnu dot org
  2009-02-16  9:15 ` bonzini at gnu dot org
                   ` (3 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: hjl at gcc dot gnu dot org @ 2009-02-12 15:45 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #26 from hjl at gcc dot gnu dot org  2009-02-12 15:45 -------
Subject: Bug 38824

Author: hjl
Date: Thu Feb 12 15:45:20 2009
New Revision: 144129

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=144129
Log:
Mention PR target/38824 in ChangeLog entries.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (27 preceding siblings ...)
  2009-02-12 15:45 ` hjl at gcc dot gnu dot org
@ 2009-02-16  9:15 ` bonzini at gnu dot org
  2009-03-12 16:01 ` hjl dot tools at gmail dot com
                   ` (2 subsequent siblings)
  31 siblings, 0 replies; 33+ messages in thread
From: bonzini at gnu dot org @ 2009-02-16  9:15 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #27 from bonzini at gnu dot org  2009-02-16 09:14 -------
Added bugs corresponding to the patch fallout in case distros want to backport
it (it gave quite a nice boost and probably fixed PR21676 too)


-- 

bonzini at gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  BugsThisDependsOn|                            |39152, 39196
OtherBugsDependingO|                            |21676
              nThis|                            |
Bug 38824 depends on bug 39152, which changed state.

Bug 39152 Summary: [4.4 regression] Revision 144098 breaks 416.gamess in SPEC CPU 2006
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39152

           What    |Old Value                   |New Value
----------------------------------------------------------------------------
             Status|REOPENED                    |RESOLVED
         Resolution|                            |FIXED

Bug 38824 depends on bug 39196, which changed state.

Bug 39196 Summary: [4.4 Regression] ICE in copyprop_hardreg_forward_1, at regrename.c:1603 during libjava compile
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39196

           What    |Old Value                   |New Value
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|                            |FIXED

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (28 preceding siblings ...)
  2009-02-16  9:15 ` bonzini at gnu dot org
@ 2009-03-12 16:01 ` hjl dot tools at gmail dot com
  2009-03-12 16:08 ` hjl at gcc dot gnu dot org
  2009-03-12 20:22 ` hjl dot tools at gmail dot com
  31 siblings, 0 replies; 33+ messages in thread
From: hjl dot tools at gmail dot com @ 2009-03-12 16:01 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #28 from hjl dot tools at gmail dot com  2009-03-12 16:00 -------
(In reply to comment #25)
> patch committed (the changelog was in gcc-patches :-).
> 

This patch caused:

http://gcc.gnu.org/ml/gcc/2009-03/msg00340.html


-- 

hjl dot tools at gmail dot com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|FIXED                       |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (29 preceding siblings ...)
  2009-03-12 16:01 ` hjl dot tools at gmail dot com
@ 2009-03-12 16:08 ` hjl at gcc dot gnu dot org
  2009-03-12 20:22 ` hjl dot tools at gmail dot com
  31 siblings, 0 replies; 33+ messages in thread
From: hjl at gcc dot gnu dot org @ 2009-03-12 16:08 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #29 from hjl at gcc dot gnu dot org  2009-03-12 16:08 -------
Subject: Bug 38824

Author: hjl
Date: Thu Mar 12 16:08:02 2009
New Revision: 144817

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=144817
Log:
2009-03-12  H.J. Lu  <hongjiu.lu@intel.com>

        PR target/38824
        * config/i386/i386.md: Compare REGNO on the new peephole2
        patterns.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.md


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
  2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
                   ` (30 preceding siblings ...)
  2009-03-12 16:08 ` hjl at gcc dot gnu dot org
@ 2009-03-12 20:22 ` hjl dot tools at gmail dot com
  31 siblings, 0 replies; 33+ messages in thread
From: hjl dot tools at gmail dot com @ 2009-03-12 20:22 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #30 from hjl dot tools at gmail dot com  2009-03-12 20:21 -------
Fixed.


-- 

hjl dot tools at gmail dot com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REOPENED                    |RESOLVED
         Resolution|                            |FIXED


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2009-03-12 20:22 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-01-13 11:25 [Bug target/38824] New: [4.4 regression] performance regression of sse code from 4.2/4.3 tim at klingt dot org
2009-01-13 15:07 ` [Bug target/38824] " rguenth at gcc dot gnu dot org
2009-01-13 16:22 ` tim at klingt dot org
2009-01-14 20:20 ` hubicka at gcc dot gnu dot org
2009-01-14 20:26 ` [Bug target/38824] [4.4 Regression] " rguenth at gcc dot gnu dot org
2009-01-14 20:32 ` [Bug target/38824] [4.4 regression] " hubicka at gcc dot gnu dot org
2009-01-15  0:31 ` hubicka at gcc dot gnu dot org
2009-01-15  1:26 ` hjl dot tools at gmail dot com
2009-01-15  1:49 ` hubicka at ucw dot cz
2009-01-23 16:19 ` [Bug target/38824] [4.4 Regression] " rguenth at gcc dot gnu dot org
2009-01-24  5:12 ` xuepeng dot guo at intel dot com
2009-01-24  9:56 ` tim at klingt dot org
2009-01-24 13:14 ` tim at klingt dot org
2009-01-25 17:56 ` rguenth at gcc dot gnu dot org
2009-02-06  9:16 ` bonzini at gnu dot org
2009-02-06 22:35 ` dwarak dot rajagopal at amd dot com
2009-02-07 16:18 ` rob1weld at aol dot com
2009-02-08 12:36 ` hubicka at gcc dot gnu dot org
2009-02-08 12:40 ` hubicka at gcc dot gnu dot org
2009-02-09  9:16 ` xuepeng dot guo at intel dot com
2009-02-09 13:36 ` bonzini at gnu dot org
2009-02-09 13:38 ` bonzini at gnu dot org
2009-02-10 16:29 ` dwarak dot rajagopal at amd dot com
2009-02-10 16:39 ` bonzini at gnu dot org
2009-02-11  7:37 ` xuepeng dot guo at intel dot com
2009-02-11  8:01 ` bonzini at gnu dot org
2009-02-11  8:14 ` ubizjak at gmail dot com
2009-02-11  8:58 ` bonzini at gnu dot org
2009-02-12 15:45 ` hjl at gcc dot gnu dot org
2009-02-16  9:15 ` bonzini at gnu dot org
2009-03-12 16:01 ` hjl dot tools at gmail dot com
2009-03-12 16:08 ` hjl at gcc dot gnu dot org
2009-03-12 20:22 ` hjl dot tools at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).