public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug other/47167] New: Performance regression in numerical code
@ 2011-01-04 14:45 martin@mpa-garching.mpg.de
2011-01-05 14:46 ` [Bug other/47167] " martin@mpa-garching.mpg.de
` (11 more replies)
0 siblings, 12 replies; 13+ messages in thread
From: martin@mpa-garching.mpg.de @ 2011-01-04 14:45 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167
Summary: Performance regression in numerical code
Product: gcc
Version: 4.6.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: other
AssignedTo: unassigned@gcc.gnu.org
ReportedBy: martin@mpa-garching.mpg.de
Created attachment 22897
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=22897
test case
When compiling the attached testcase on a machine with a Core 2 Duo E8500 CPU
and 64bit Linux using
gcc -O2 -fomit-frame-pointer testcase.i -lm
the results with gcc 4.5.1 are
Testing map analysis accuracy.
lmax=2047, 0 iterations, spin=0
Testing ECP grid (4096 rings, 4096 pixels/ring, 16777216 pixels)
iteration 0:
wall time for alm2map: 8.811477s
wall time for map2alm: 9.204556s
component 0: rms 1.390734e-13, maxerr 1.582512e-12
However, with current trunk one obtains
Testing map analysis accuracy.
lmax=2047, 0 iterations, spin=0
Testing ECP grid (4096 rings, 4096 pixels/ring, 16777216 pixels)
iteration 0:
wall time for alm2map: 9.518667s
wall time for map2alm: 9.780509s
component 0: rms 1.390734e-13, maxerr 1.582512e-12
The numerical result is identical, but the code generated by the more recent
compiler is noticeably slower.
Reducing the test case is unfortunately not trivial; the computational hot
spots are located in pshtd_inner_loop() and Ylmgen_recalc_Ylm_sse2().
Please let me know if I can provide further information.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug other/47167] Performance regression in numerical code
2011-01-04 14:45 [Bug other/47167] New: Performance regression in numerical code martin@mpa-garching.mpg.de
@ 2011-01-05 14:46 ` martin@mpa-garching.mpg.de
2011-01-05 17:38 ` ubizjak at gmail dot com
` (10 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: martin@mpa-garching.mpg.de @ 2011-01-05 14:46 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167
--- Comment #1 from Martin Reinecke <martin@mpa-garching.mpg.de> 2011-01-05 14:42:20 UTC ---
Created attachment 22904
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=22904
shorter test case
More compact test case; the hot spot is marked with "CRITICAL LOOP".
Compile with "gcc -O2 -fomit-frame-pointer testcase2.c -lm" and
test using "time ./a.out".
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug other/47167] Performance regression in numerical code
2011-01-04 14:45 [Bug other/47167] New: Performance regression in numerical code martin@mpa-garching.mpg.de
2011-01-05 14:46 ` [Bug other/47167] " martin@mpa-garching.mpg.de
@ 2011-01-05 17:38 ` ubizjak at gmail dot com
2011-01-05 19:50 ` ubizjak at gmail dot com
` (9 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: ubizjak at gmail dot com @ 2011-01-05 17:38 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167
--- Comment #2 from Uros Bizjak <ubizjak at gmail dot com> 2011-01-05 17:31:20 UTC ---
The only difference in the hot loop is the usage of two regs in the address:
4.5.1:
.L142:
movapd %xmm0, (%rcx)
mulpd %xmm6, %xmm2
addq $32, %rbx
movapd %xmm1, %xmm6
mulpd %xmm0, %xmm6
movsd (%rax), %xmm1
movsd 8(%rax), %xmm3
unpcklpd %xmm1, %xmm1
subpd %xmm2, %xmm6
unpcklpd %xmm3, %xmm3
mulpd %xmm9, %xmm1
mulpd %xmm0, %xmm3
movapd %xmm6, 16(%rcx)
addq $32, %rcx
movapd %xmm1, %xmm0
movsd 16(%rax), %xmm1
mulpd %xmm6, %xmm0
unpcklpd %xmm1, %xmm1
movsd 24(%rax), %xmm2
addq $32, %rax
cmpq %rsi, %rbx
unpcklpd %xmm2, %xmm2
subpd %xmm3, %xmm0
mulpd %xmm9, %xmm1
jne .L142
4.6:
.L167:
movapd %xmm0, %xmm10
.L143:
mulpd %xmm2, %xmm6
movapd %xmm3, %xmm2
movapd %xmm10, (%rsi,%rcx)
mulpd %xmm10, %xmm2
movsd (%rdx), %xmm0
movsd 8(%rdx), %xmm1
subpd %xmm6, %xmm2
unpcklpd %xmm0, %xmm0
unpcklpd %xmm1, %xmm1
mulpd %xmm11, %xmm0
movapd %xmm2, 16(%rsi,%rcx)
mulpd %xmm10, %xmm1
addq $32, %rcx
mulpd %xmm2, %xmm0
movsd 16(%rdx), %xmm3
movsd 24(%rdx), %xmm6
addq $32, %rdx
cmpq %rdi, %rcx
unpcklpd %xmm3, %xmm3
unpcklpd %xmm6, %xmm6
subpd %xmm1, %xmm0
mulpd %xmm11, %xmm3
jne .L167
Given the comment above ix86_address_cost:
/* Return cost of the memory address x.
For i386, it is better to use a complex address than let gcc copy
the address into a reg and make a new pseudo. But not if the address
requires to two regs - that would mean more pseudos with longer
lifetimes. */
this could be the reason for slowdown.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug other/47167] Performance regression in numerical code
2011-01-04 14:45 [Bug other/47167] New: Performance regression in numerical code martin@mpa-garching.mpg.de
2011-01-05 14:46 ` [Bug other/47167] " martin@mpa-garching.mpg.de
2011-01-05 17:38 ` ubizjak at gmail dot com
@ 2011-01-05 19:50 ` ubizjak at gmail dot com
2011-01-05 20:09 ` ubizjak at gmail dot com
` (8 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: ubizjak at gmail dot com @ 2011-01-05 19:50 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167
--- Comment #3 from Uros Bizjak <ubizjak at gmail dot com> 2011-01-05 19:30:49 UTC ---
> this could be the reason for slowdown.
Hm, not really.
But, by changing the generated assembly around loop entry:
$ diff -u testcase2.s testcase2_.s
--- testcase2.s 2011-01-05 20:21:01.492919294 +0100
+++ testcase2_.s 2011-01-05 20:22:23.616577277 +0100
@@ -1678,6 +1678,7 @@
addq %r15, %rdx
addq $1, %rdi
salq $5, %rdi
+ movapd %xmm10, %xmm0
jmp .L143
.p2align 4,,10
.p2align 3
@@ -1687,7 +1688,7 @@
mulpd %xmm2, %xmm6
movapd %xmm3, %xmm2
movapd %xmm10, (%rsi,%rcx)
- mulpd %xmm10, %xmm2
+ mulpd %xmm0, %xmm2
movsd (%rdx), %xmm0
movsd 8(%rdx), %xmm1
subpd %xmm6, %xmm2
The slowdown is magically fixed:
$ gcc -lm testcase2_.s
$ time ./a.out
real 0m4.041s
user 0m4.034s
sys 0m0.003s
versus:
$ gcc -lm testcase2.s
$ time ./a.out
real 0m4.239s
user 0m4.234s
sys 0m0.001s
The important change is the change of %xmm10 -> %xmm0 in the mulpd instruction.
The functionality of the test didn't change due to existing "movapd %xmm0,
%xmm10" at the top of the loop and added extra "movapd %xmm10, %xmm0" before
the loop.
This all happens on SnB, the code is generated with -O2 only.
H.J., any ideas?
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug other/47167] Performance regression in numerical code
2011-01-04 14:45 [Bug other/47167] New: Performance regression in numerical code martin@mpa-garching.mpg.de
` (2 preceding siblings ...)
2011-01-05 19:50 ` ubizjak at gmail dot com
@ 2011-01-05 20:09 ` ubizjak at gmail dot com
2011-01-05 22:14 ` hjl.tools at gmail dot com
` (7 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: ubizjak at gmail dot com @ 2011-01-05 20:09 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167
--- Comment #4 from Uros Bizjak <ubizjak at gmail dot com> 2011-01-05 19:48:58 UTC ---
Applying the same medicine to original test gets us from:
wall time for map2alm: 6.908527s
to
wall time for map2alm: 6.703142s
where 4.5.1 wins with:
wall time for map2alm: 6.651740s
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug other/47167] Performance regression in numerical code
2011-01-04 14:45 [Bug other/47167] New: Performance regression in numerical code martin@mpa-garching.mpg.de
` (3 preceding siblings ...)
2011-01-05 20:09 ` ubizjak at gmail dot com
@ 2011-01-05 22:14 ` hjl.tools at gmail dot com
2011-01-06 9:33 ` ubizjak at gmail dot com
` (6 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: hjl.tools at gmail dot com @ 2011-01-05 22:14 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167
--- Comment #5 from H.J. Lu <hjl.tools at gmail dot com> 2011-01-05 20:09:11 UTC ---
(In reply to comment #3)
> > this could be the reason for slowdown.
>
....
>
> $ gcc -lm testcase2.s
> $ time ./a.out
>
> real 0m4.239s
> user 0m4.234s
> sys 0m0.001s
>
> The important change is the change of %xmm10 -> %xmm0 in the mulpd instruction.
> The functionality of the test didn't change due to existing "movapd %xmm0,
> %xmm10" at the top of the loop and added extra "movapd %xmm10, %xmm0" before
> the loop.
>
> This all happens on SnB, the code is generated with -O2 only.
>
> H.J., any ideas?
Some loop performance is very sensitive to code sizes. This change
- mulpd %xmm10, %xmm2
+ mulpd %xmm0, %xmm2
will impact loop size due to exta REX prefix.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug other/47167] Performance regression in numerical code
2011-01-04 14:45 [Bug other/47167] New: Performance regression in numerical code martin@mpa-garching.mpg.de
` (4 preceding siblings ...)
2011-01-05 22:14 ` hjl.tools at gmail dot com
@ 2011-01-06 9:33 ` ubizjak at gmail dot com
2011-01-19 15:37 ` martin@mpa-garching.mpg.de
` (5 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: ubizjak at gmail dot com @ 2011-01-06 9:33 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167
--- Comment #6 from Uros Bizjak <ubizjak at gmail dot com> 2011-01-06 07:38:11 UTC ---
(In reply to comment #5)
> Some loop performance is very sensitive to code sizes. This change
>
> - mulpd %xmm10, %xmm2
> + mulpd %xmm0, %xmm2
>
> will impact loop size due to exta REX prefix.
Adding nop (or several of them, FWIW) around changed mulpd insn does not affect
the performance, so IMO it is not the loop layout that matters in this case.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug other/47167] Performance regression in numerical code
2011-01-04 14:45 [Bug other/47167] New: Performance regression in numerical code martin@mpa-garching.mpg.de
` (5 preceding siblings ...)
2011-01-06 9:33 ` ubizjak at gmail dot com
@ 2011-01-19 15:37 ` martin@mpa-garching.mpg.de
2011-01-19 16:32 ` rguenth at gcc dot gnu.org
` (4 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: martin@mpa-garching.mpg.de @ 2011-01-19 15:37 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167
--- Comment #7 from Martin Reinecke <martin@mpa-garching.mpg.de> 2011-01-19 14:16:18 UTC ---
OK, I located the problematic commit, at least on the 4.5 branch: it's revision
number 167492 (fix for PR tree-optimization/46806).
Between revisions 167491 and 167492 the CPU time for the testcase2.c jumps from
4.7s to 5.4s.
Do you think that anything can be done about this regression?
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug other/47167] Performance regression in numerical code
2011-01-04 14:45 [Bug other/47167] New: Performance regression in numerical code martin@mpa-garching.mpg.de
` (6 preceding siblings ...)
2011-01-19 15:37 ` martin@mpa-garching.mpg.de
@ 2011-01-19 16:32 ` rguenth at gcc dot gnu.org
2011-01-19 18:10 ` martin@mpa-garching.mpg.de
` (3 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-01-19 16:32 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167
Richard Guenther <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |rguenth at gcc dot gnu.org
--- Comment #8 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-01-19 16:13:31 UTC ---
Can you check if the following patch solves your problem?
Index: tree-ssa-copyrename.c
===================================================================
--- tree-ssa-copyrename.c (revision 168987)
+++ tree-ssa-copyrename.c (working copy)
@@ -226,8 +226,11 @@ copy_rename_partition_coalesce (var_map
ign2 = false;
}
- /* Don't coalesce if the two variables are not of the same type. */
- if (TREE_TYPE (root1) != TREE_TYPE (root2))
+ /* Don't coalesce if the two variables are not compatible . */
+ if (!types_compatible_p (TREE_TYPE (root1), TREE_TYPE (root2))
+ || ((TREE_CODE (TREE_TYPE (root1)) == ENUMERAL_TYPE
+ || TREE_CODE (TREE_TYPE (root2)) == ENUMERAL_TYPE)
+ && TREE_TYPE (root1) != TREE_TYPE (root2)))
{
if (debug)
fprintf (debug, " : Different types. No coalesce.\n");
The differences in GIMPLE of the patch do not explain the code-differences
though, so it might be just bad luck that the patch regressed things for
you. I can see other unwanted differences though.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug other/47167] Performance regression in numerical code
2011-01-04 14:45 [Bug other/47167] New: Performance regression in numerical code martin@mpa-garching.mpg.de
` (7 preceding siblings ...)
2011-01-19 16:32 ` rguenth at gcc dot gnu.org
@ 2011-01-19 18:10 ` martin@mpa-garching.mpg.de
2011-01-20 10:41 ` rguenth at gcc dot gnu.org
` (2 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: martin@mpa-garching.mpg.de @ 2011-01-19 18:10 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167
--- Comment #9 from Martin Reinecke <martin@mpa-garching.mpg.de> 2011-01-19 17:26:31 UTC ---
(In reply to comment #8)
> Can you check if the following patch solves your problem?
Yes, this patch gets performance back to normal on the 4.5 branch and on trunk.
Great!
> The differences in GIMPLE of the patch do not explain the code-differences
> though, so it might be just bad luck that the patch regressed things for
> you. I can see other unwanted differences though.
I would of course be happy if the code generation would be less "erratic", and
if the nice performance I'm seeing does not depend on my luck ;)
So if I can do anything to help optimizing this kind of code more consistently,
please let me know! Of course, I'm more into numerics than into compiler
writing ...
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug other/47167] Performance regression in numerical code
2011-01-04 14:45 [Bug other/47167] New: Performance regression in numerical code martin@mpa-garching.mpg.de
` (8 preceding siblings ...)
2011-01-19 18:10 ` martin@mpa-garching.mpg.de
@ 2011-01-20 10:41 ` rguenth at gcc dot gnu.org
2011-01-20 10:42 ` rguenth at gcc dot gnu.org
2011-01-20 10:43 ` rguenth at gcc dot gnu.org
11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-01-20 10:41 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167
--- Comment #10 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-01-20 10:33:18 UTC ---
Author: rguenth
Date: Thu Jan 20 10:33:15 2011
New Revision: 169050
URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=169050
Log:
2011-01-20 Richard Guenther <rguenther@suse.de>
PR tree-optimization/47167
* tree-ssa-copyrename.c (copy_rename_partition_coalesce):
Revert previous change, only avoid enumeral type changes.
Modified:
trunk/gcc/ChangeLog
trunk/gcc/tree-ssa-copyrename.c
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug other/47167] Performance regression in numerical code
2011-01-04 14:45 [Bug other/47167] New: Performance regression in numerical code martin@mpa-garching.mpg.de
` (9 preceding siblings ...)
2011-01-20 10:41 ` rguenth at gcc dot gnu.org
@ 2011-01-20 10:42 ` rguenth at gcc dot gnu.org
2011-01-20 10:43 ` rguenth at gcc dot gnu.org
11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-01-20 10:42 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167
--- Comment #11 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-01-20 10:36:32 UTC ---
Author: rguenth
Date: Thu Jan 20 10:36:29 2011
New Revision: 169051
URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=169051
Log:
2011-01-20 Richard Guenther <rguenther@suse.de>
PR tree-optimization/47167
* tree-ssa-copyrename.c (copy_rename_partition_coalesce):
Revert previous change, only avoid enumeral type changes.
Modified:
branches/gcc-4_5-branch/gcc/ChangeLog
branches/gcc-4_5-branch/gcc/tree-ssa-copyrename.c
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug other/47167] Performance regression in numerical code
2011-01-04 14:45 [Bug other/47167] New: Performance regression in numerical code martin@mpa-garching.mpg.de
` (10 preceding siblings ...)
2011-01-20 10:42 ` rguenth at gcc dot gnu.org
@ 2011-01-20 10:43 ` rguenth at gcc dot gnu.org
11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-01-20 10:43 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167
Richard Guenther <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |RESOLVED
Resolution| |FIXED
--- Comment #12 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-01-20 10:36:50 UTC ---
Fixed.
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2011-01-20 10:37 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-04 14:45 [Bug other/47167] New: Performance regression in numerical code martin@mpa-garching.mpg.de
2011-01-05 14:46 ` [Bug other/47167] " martin@mpa-garching.mpg.de
2011-01-05 17:38 ` ubizjak at gmail dot com
2011-01-05 19:50 ` ubizjak at gmail dot com
2011-01-05 20:09 ` ubizjak at gmail dot com
2011-01-05 22:14 ` hjl.tools at gmail dot com
2011-01-06 9:33 ` ubizjak at gmail dot com
2011-01-19 15:37 ` martin@mpa-garching.mpg.de
2011-01-19 16:32 ` rguenth at gcc dot gnu.org
2011-01-19 18:10 ` martin@mpa-garching.mpg.de
2011-01-20 10:41 ` rguenth at gcc dot gnu.org
2011-01-20 10:42 ` rguenth at gcc dot gnu.org
2011-01-20 10:43 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).