[Bug other/47167] New: Performance regression in numerical code

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug other/47167] New: Performance regression in numerical code
@ 2011-01-04 14:45 martin@mpa-garching.mpg.de
  2011-01-05 14:46 ` [Bug other/47167] " martin@mpa-garching.mpg.de
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: martin@mpa-garching.mpg.de @ 2011-01-04 14:45 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167

           Summary: Performance regression in numerical code
           Product: gcc
           Version: 4.6.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: other
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: martin@mpa-garching.mpg.de


Created attachment 22897
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=22897
test case

When compiling the attached testcase on a machine with a Core 2 Duo E8500 CPU
and 64bit Linux using

gcc -O2 -fomit-frame-pointer testcase.i -lm

the results with gcc 4.5.1 are

Testing map analysis accuracy.
lmax=2047, 0 iterations, spin=0

Testing ECP grid (4096 rings, 4096 pixels/ring, 16777216 pixels)

iteration 0:
wall time for alm2map: 8.811477s
wall time for map2alm: 9.204556s
component 0: rms 1.390734e-13, maxerr 1.582512e-12

However, with current trunk one obtains

Testing map analysis accuracy.
lmax=2047, 0 iterations, spin=0

Testing ECP grid (4096 rings, 4096 pixels/ring, 16777216 pixels)

iteration 0:
wall time for alm2map: 9.518667s
wall time for map2alm: 9.780509s
component 0: rms 1.390734e-13, maxerr 1.582512e-12

The numerical result is identical, but the code generated by the more recent
compiler is noticeably slower.

Reducing the test case is unfortunately not trivial; the computational hot
spots are located in pshtd_inner_loop() and Ylmgen_recalc_Ylm_sse2().

Please let me know if I can provide further information.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug other/47167] Performance regression in numerical code
  2011-01-04 14:45 [Bug other/47167] New: Performance regression in numerical code martin@mpa-garching.mpg.de
@ 2011-01-05 14:46 ` martin@mpa-garching.mpg.de
  2011-01-05 17:38 ` ubizjak at gmail dot com
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: martin@mpa-garching.mpg.de @ 2011-01-05 14:46 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167

--- Comment #1 from Martin Reinecke <martin@mpa-garching.mpg.de> 2011-01-05 14:42:20 UTC ---
Created attachment 22904
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=22904
shorter test case

More compact test case; the hot spot is marked with "CRITICAL LOOP".
Compile with "gcc -O2 -fomit-frame-pointer testcase2.c -lm" and
test using "time ./a.out".


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug other/47167] Performance regression in numerical code
  2011-01-04 14:45 [Bug other/47167] New: Performance regression in numerical code martin@mpa-garching.mpg.de
  2011-01-05 14:46 ` [Bug other/47167] " martin@mpa-garching.mpg.de
@ 2011-01-05 17:38 ` ubizjak at gmail dot com
  2011-01-05 19:50 ` ubizjak at gmail dot com
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: ubizjak at gmail dot com @ 2011-01-05 17:38 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167

--- Comment #2 from Uros Bizjak <ubizjak at gmail dot com> 2011-01-05 17:31:20 UTC ---
The only difference in the hot loop is the usage of two regs in the address:

4.5.1:

.L142:
    movapd    %xmm0, (%rcx)
    mulpd    %xmm6, %xmm2
    addq    $32, %rbx
    movapd    %xmm1, %xmm6
    mulpd    %xmm0, %xmm6
    movsd    (%rax), %xmm1
    movsd    8(%rax), %xmm3
    unpcklpd    %xmm1, %xmm1
    subpd    %xmm2, %xmm6
    unpcklpd    %xmm3, %xmm3
    mulpd    %xmm9, %xmm1
    mulpd    %xmm0, %xmm3
    movapd    %xmm6, 16(%rcx)
    addq    $32, %rcx
    movapd    %xmm1, %xmm0
    movsd    16(%rax), %xmm1
    mulpd    %xmm6, %xmm0
    unpcklpd    %xmm1, %xmm1
    movsd    24(%rax), %xmm2
    addq    $32, %rax
    cmpq    %rsi, %rbx
    unpcklpd    %xmm2, %xmm2
    subpd    %xmm3, %xmm0
    mulpd    %xmm9, %xmm1
    jne    .L142

4.6:

.L167:
    movapd    %xmm0, %xmm10
.L143:
    mulpd    %xmm2, %xmm6
    movapd    %xmm3, %xmm2
    movapd    %xmm10, (%rsi,%rcx)
    mulpd    %xmm10, %xmm2
    movsd    (%rdx), %xmm0
    movsd    8(%rdx), %xmm1
    subpd    %xmm6, %xmm2
    unpcklpd    %xmm0, %xmm0
    unpcklpd    %xmm1, %xmm1
    mulpd    %xmm11, %xmm0
    movapd    %xmm2, 16(%rsi,%rcx)
    mulpd    %xmm10, %xmm1
    addq    $32, %rcx
    mulpd    %xmm2, %xmm0
    movsd    16(%rdx), %xmm3
    movsd    24(%rdx), %xmm6
    addq    $32, %rdx
    cmpq    %rdi, %rcx
    unpcklpd    %xmm3, %xmm3
    unpcklpd    %xmm6, %xmm6
    subpd    %xmm1, %xmm0
    mulpd    %xmm11, %xmm3
    jne    .L167

Given the comment above ix86_address_cost:

/* Return cost of the memory address x.
   For i386, it is better to use a complex address than let gcc copy
   the address into a reg and make a new pseudo.  But not if the address
   requires to two regs - that would mean more pseudos with longer
   lifetimes.  */

this could be the reason for slowdown.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug other/47167] Performance regression in numerical code
  2011-01-04 14:45 [Bug other/47167] New: Performance regression in numerical code martin@mpa-garching.mpg.de
  2011-01-05 14:46 ` [Bug other/47167] " martin@mpa-garching.mpg.de
  2011-01-05 17:38 ` ubizjak at gmail dot com
@ 2011-01-05 19:50 ` ubizjak at gmail dot com
  2011-01-05 20:09 ` ubizjak at gmail dot com
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: ubizjak at gmail dot com @ 2011-01-05 19:50 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167

--- Comment #3 from Uros Bizjak <ubizjak at gmail dot com> 2011-01-05 19:30:49 UTC ---
> this could be the reason for slowdown.

Hm, not really.

But, by changing the generated assembly around loop entry:

$ diff -u testcase2.s testcase2_.s
--- testcase2.s    2011-01-05 20:21:01.492919294 +0100
+++ testcase2_.s    2011-01-05 20:22:23.616577277 +0100
@@ -1678,6 +1678,7 @@
     addq    %r15, %rdx
     addq    $1, %rdi
     salq    $5, %rdi
+    movapd    %xmm10, %xmm0
     jmp    .L143
     .p2align 4,,10
     .p2align 3
@@ -1687,7 +1688,7 @@
     mulpd    %xmm2, %xmm6
     movapd    %xmm3, %xmm2
     movapd    %xmm10, (%rsi,%rcx)
-    mulpd    %xmm10, %xmm2
+    mulpd    %xmm0, %xmm2
     movsd    (%rdx), %xmm0
     movsd    8(%rdx), %xmm1
     subpd    %xmm6, %xmm2

The slowdown is magically fixed:

$ gcc -lm testcase2_.s
$ time ./a.out

real    0m4.041s
user    0m4.034s
sys    0m0.003s

versus:

$ gcc -lm testcase2.s
$ time ./a.out

real    0m4.239s
user    0m4.234s
sys    0m0.001s

The important change is the change of %xmm10 -> %xmm0 in the mulpd instruction.
The functionality of the test didn't change due to existing "movapd    %xmm0,
%xmm10" at the top of the loop and added extra "movapd    %xmm10, %xmm0" before
the loop.

This all happens on SnB, the code is generated with -O2 only.

H.J., any ideas?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug other/47167] Performance regression in numerical code
  2011-01-04 14:45 [Bug other/47167] New: Performance regression in numerical code martin@mpa-garching.mpg.de
                   ` (2 preceding siblings ...)
  2011-01-05 19:50 ` ubizjak at gmail dot com
@ 2011-01-05 20:09 ` ubizjak at gmail dot com
  2011-01-05 22:14 ` hjl.tools at gmail dot com
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: ubizjak at gmail dot com @ 2011-01-05 20:09 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167

--- Comment #4 from Uros Bizjak <ubizjak at gmail dot com> 2011-01-05 19:48:58 UTC ---
Applying the same medicine to original test gets us from:

wall time for map2alm: 6.908527s

to

wall time for map2alm: 6.703142s

where 4.5.1 wins with:

wall time for map2alm: 6.651740s


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug other/47167] Performance regression in numerical code
  2011-01-04 14:45 [Bug other/47167] New: Performance regression in numerical code martin@mpa-garching.mpg.de
                   ` (3 preceding siblings ...)
  2011-01-05 20:09 ` ubizjak at gmail dot com
@ 2011-01-05 22:14 ` hjl.tools at gmail dot com
  2011-01-06  9:33 ` ubizjak at gmail dot com
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: hjl.tools at gmail dot com @ 2011-01-05 22:14 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167

--- Comment #5 from H.J. Lu <hjl.tools at gmail dot com> 2011-01-05 20:09:11 UTC ---
(In reply to comment #3)
> > this could be the reason for slowdown.
> 
....
> 
> $ gcc -lm testcase2.s
> $ time ./a.out
> 
> real    0m4.239s
> user    0m4.234s
> sys    0m0.001s
> 
> The important change is the change of %xmm10 -> %xmm0 in the mulpd instruction.
> The functionality of the test didn't change due to existing "movapd    %xmm0,
> %xmm10" at the top of the loop and added extra "movapd    %xmm10, %xmm0" before
> the loop.
> 
> This all happens on SnB, the code is generated with -O2 only.
> 
> H.J., any ideas?

Some loop performance is very sensitive to code sizes.  This change

-    mulpd    %xmm10, %xmm2
+    mulpd    %xmm0, %xmm2

will impact loop size due to exta REX prefix.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug other/47167] Performance regression in numerical code
  2011-01-04 14:45 [Bug other/47167] New: Performance regression in numerical code martin@mpa-garching.mpg.de
                   ` (4 preceding siblings ...)
  2011-01-05 22:14 ` hjl.tools at gmail dot com
@ 2011-01-06  9:33 ` ubizjak at gmail dot com
  2011-01-19 15:37 ` martin@mpa-garching.mpg.de
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: ubizjak at gmail dot com @ 2011-01-06  9:33 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167

--- Comment #6 from Uros Bizjak <ubizjak at gmail dot com> 2011-01-06 07:38:11 UTC ---
(In reply to comment #5)

> Some loop performance is very sensitive to code sizes.  This change
> 
> -    mulpd    %xmm10, %xmm2
> +    mulpd    %xmm0, %xmm2
> 
> will impact loop size due to exta REX prefix.

Adding nop (or several of them, FWIW) around changed mulpd insn does not affect
the performance, so IMO it is not the loop layout that matters in this case.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug other/47167] Performance regression in numerical code
  2011-01-04 14:45 [Bug other/47167] New: Performance regression in numerical code martin@mpa-garching.mpg.de
                   ` (5 preceding siblings ...)
  2011-01-06  9:33 ` ubizjak at gmail dot com
@ 2011-01-19 15:37 ` martin@mpa-garching.mpg.de
  2011-01-19 16:32 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: martin@mpa-garching.mpg.de @ 2011-01-19 15:37 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167

--- Comment #7 from Martin Reinecke <martin@mpa-garching.mpg.de> 2011-01-19 14:16:18 UTC ---
OK, I located the problematic commit, at least on the 4.5 branch: it's revision
number 167492 (fix for PR tree-optimization/46806).

Between revisions 167491 and 167492 the CPU time for the testcase2.c jumps from
4.7s to 5.4s.

Do you think that anything can be done about this regression?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug other/47167] Performance regression in numerical code
  2011-01-04 14:45 [Bug other/47167] New: Performance regression in numerical code martin@mpa-garching.mpg.de
                   ` (6 preceding siblings ...)
  2011-01-19 15:37 ` martin@mpa-garching.mpg.de
@ 2011-01-19 16:32 ` rguenth at gcc dot gnu.org
  2011-01-19 18:10 ` martin@mpa-garching.mpg.de
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-01-19 16:32 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org

--- Comment #8 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-01-19 16:13:31 UTC ---
Can you check if the following patch solves your problem?

Index: tree-ssa-copyrename.c
===================================================================
--- tree-ssa-copyrename.c       (revision 168987)
+++ tree-ssa-copyrename.c       (working copy)
@@ -226,8 +226,11 @@ copy_rename_partition_coalesce (var_map
       ign2 = false;
     }

-  /* Don't coalesce if the two variables are not of the same type.  */
-  if (TREE_TYPE (root1) != TREE_TYPE (root2))
+  /* Don't coalesce if the two variables are not compatible .  */
+  if (!types_compatible_p (TREE_TYPE (root1), TREE_TYPE (root2))
+      || ((TREE_CODE (TREE_TYPE (root1)) == ENUMERAL_TYPE
+          || TREE_CODE (TREE_TYPE (root2)) == ENUMERAL_TYPE)
+         && TREE_TYPE (root1) != TREE_TYPE (root2)))
     {
       if (debug)
        fprintf (debug, " : Different types.  No coalesce.\n");


The differences in GIMPLE of the patch do not explain the code-differences
though, so it might be just bad luck that the patch regressed things for
you.  I can see other unwanted differences though.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug other/47167] Performance regression in numerical code
  2011-01-04 14:45 [Bug other/47167] New: Performance regression in numerical code martin@mpa-garching.mpg.de
                   ` (7 preceding siblings ...)
  2011-01-19 16:32 ` rguenth at gcc dot gnu.org
@ 2011-01-19 18:10 ` martin@mpa-garching.mpg.de
  2011-01-20 10:41 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: martin@mpa-garching.mpg.de @ 2011-01-19 18:10 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167

--- Comment #9 from Martin Reinecke <martin@mpa-garching.mpg.de> 2011-01-19 17:26:31 UTC ---
(In reply to comment #8)
> Can you check if the following patch solves your problem?

Yes, this patch gets performance back to normal on the 4.5 branch and on trunk.
Great!

> The differences in GIMPLE of the patch do not explain the code-differences
> though, so it might be just bad luck that the patch regressed things for
> you.  I can see other unwanted differences though.

I would of course be happy if the code generation would be less "erratic", and
if the nice performance I'm seeing does not depend on my luck ;)
So if I can do anything to help optimizing this kind of code more consistently,
please let me know! Of course, I'm more into numerics than into compiler
writing ...


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug other/47167] Performance regression in numerical code
  2011-01-04 14:45 [Bug other/47167] New: Performance regression in numerical code martin@mpa-garching.mpg.de
                   ` (8 preceding siblings ...)
  2011-01-19 18:10 ` martin@mpa-garching.mpg.de
@ 2011-01-20 10:41 ` rguenth at gcc dot gnu.org
  2011-01-20 10:42 ` rguenth at gcc dot gnu.org
  2011-01-20 10:43 ` rguenth at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-01-20 10:41 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167

--- Comment #10 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-01-20 10:33:18 UTC ---
Author: rguenth
Date: Thu Jan 20 10:33:15 2011
New Revision: 169050

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=169050
Log:
2011-01-20  Richard Guenther  <rguenther@suse.de>

    PR tree-optimization/47167
    * tree-ssa-copyrename.c (copy_rename_partition_coalesce):
    Revert previous change, only avoid enumeral type changes.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-copyrename.c


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug other/47167] Performance regression in numerical code
  2011-01-04 14:45 [Bug other/47167] New: Performance regression in numerical code martin@mpa-garching.mpg.de
                   ` (9 preceding siblings ...)
  2011-01-20 10:41 ` rguenth at gcc dot gnu.org
@ 2011-01-20 10:42 ` rguenth at gcc dot gnu.org
  2011-01-20 10:43 ` rguenth at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-01-20 10:42 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167

--- Comment #11 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-01-20 10:36:32 UTC ---
Author: rguenth
Date: Thu Jan 20 10:36:29 2011
New Revision: 169051

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=169051
Log:
2011-01-20  Richard Guenther  <rguenther@suse.de>

    PR tree-optimization/47167
    * tree-ssa-copyrename.c (copy_rename_partition_coalesce):
    Revert previous change, only avoid enumeral type changes.

Modified:
    branches/gcc-4_5-branch/gcc/ChangeLog
    branches/gcc-4_5-branch/gcc/tree-ssa-copyrename.c


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug other/47167] Performance regression in numerical code
  2011-01-04 14:45 [Bug other/47167] New: Performance regression in numerical code martin@mpa-garching.mpg.de
                   ` (10 preceding siblings ...)
  2011-01-20 10:42 ` rguenth at gcc dot gnu.org
@ 2011-01-20 10:43 ` rguenth at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-01-20 10:43 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|                            |FIXED

--- Comment #12 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-01-20 10:36:50 UTC ---
Fixed.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2011-01-20 10:37 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-04 14:45 [Bug other/47167] New: Performance regression in numerical code martin@mpa-garching.mpg.de
2011-01-05 14:46 ` [Bug other/47167] " martin@mpa-garching.mpg.de
2011-01-05 17:38 ` ubizjak at gmail dot com
2011-01-05 19:50 ` ubizjak at gmail dot com
2011-01-05 20:09 ` ubizjak at gmail dot com
2011-01-05 22:14 ` hjl.tools at gmail dot com
2011-01-06  9:33 ` ubizjak at gmail dot com
2011-01-19 15:37 ` martin@mpa-garching.mpg.de
2011-01-19 16:32 ` rguenth at gcc dot gnu.org
2011-01-19 18:10 ` martin@mpa-garching.mpg.de
2011-01-20 10:41 ` rguenth at gcc dot gnu.org
2011-01-20 10:42 ` rguenth at gcc dot gnu.org
2011-01-20 10:43 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).