public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/52459] New: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt
@ 2012-03-02 7:03 M8R-ynb11d at mailinator dot com
2012-03-02 7:12 ` [Bug tree-optimization/52459] " M8R-ynb11d at mailinator dot com
` (5 more replies)
0 siblings, 6 replies; 7+ messages in thread
From: M8R-ynb11d at mailinator dot com @ 2012-03-02 7:03 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52459
Bug #: 52459
Summary: [x86] loop vectorization performance very bad (worse
than -O0) when using sse4.2 popcnt
Classification: Unclassified
Product: gcc
Version: 4.6.3
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
AssignedTo: unassigned@gcc.gnu.org
ReportedBy: M8R-ynb11d@mailinator.com
Created attachment 26808
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=26808
testcase
gcc 4.6.3 on x86_64-unknown-linux-gnu, running on Core i7 2600K (Sandy Bridge)
The attached testcase simply exercises the popcnt instruction over every
unsigned int and creates a histogram. But with -O2 -ftree-vectorize or with
-O3, the vectorizer adds two popcnt instructions per loop iteration, which
makes performance worse than the unoptimized version, and about 3x slower than
-Os.
Here's the timings and the resulting asm of the loop:
With -O0 -m32 -msse4.2: [7.40 seconds]
.L2:
mov eax, DWORD PTR [ebp-12]
add DWORD PTR [ebp-12], 1
popcnt eax, eax
mov edx, DWORD PTR [ebp-144+eax*4]
add edx, 1
mov DWORD PTR [ebp-144+eax*4], edx
cmp DWORD PTR [ebp-12], 0
jne .L2
With -O1 -m32 -msse4.2: [2.90 seconds]
.L2:
lea edx, [eax+1]
popcnt eax, eax
add DWORD PTR [esp+12+eax*4], 1
mov eax, edx
test edx, edx
jne .L2
With -O2 -m32 -msse4.2: [2.91 seconds]
.L5:
popcnt edx, eax
mov ecx, DWORD PTR [esp+12+edx*4]
add eax, 1
.L3:
add ecx, 1
test eax, eax
mov DWORD PTR [esp+12+edx*4], ecx
jne .L5
With -Os -m32 -msse4.2: [2.82 seconds]
.L2:
popcnt edx, eax
inc DWORD PTR [ebp-136+edx*4]
inc eax
jne .L2
With -O3 -m32 -msse4.2: [8.45 seconds]
.L5:
popcnt edx, eax
mov edx, DWORD PTR [esp+edx*4]
.L3:
popcnt ecx, eax
add edx, 1
add eax, 1
mov DWORD PTR [esp+ecx*4], edx
jne .L5
Things are about the same (relatively) with -m64 but somewhat slower, I'm
assuming due to the extra edx -> rdx sign extension step.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/52459] [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt
2012-03-02 7:03 [Bug tree-optimization/52459] New: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt M8R-ynb11d at mailinator dot com
@ 2012-03-02 7:12 ` M8R-ynb11d at mailinator dot com
2012-03-02 7:20 ` [Bug tree-optimization/52459] [x86] loop " pinskia at gcc dot gnu.org
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: M8R-ynb11d at mailinator dot com @ 2012-03-02 7:12 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52459
--- Comment #1 from M8R-ynb11d at mailinator dot com 2012-03-02 07:11:47 UTC ---
Similar (but much slower) results when not using SSE and using the libgcc
library version of __builtin_popcount:
-O0: 22.55 secs
-O1: 20.57 secs
-O2: 22.48 secs
-Os: 22.81 secs
-O3: 45.17 secs
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/52459] [x86] loop performance very bad (worse than -O0) when using sse4.2 popcnt
2012-03-02 7:03 [Bug tree-optimization/52459] New: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt M8R-ynb11d at mailinator dot com
2012-03-02 7:12 ` [Bug tree-optimization/52459] " M8R-ynb11d at mailinator dot com
@ 2012-03-02 7:20 ` pinskia at gcc dot gnu.org
2012-03-02 10:01 ` [Bug tree-optimization/52459] PPRE performs stupid inserts rguenth at gcc dot gnu.org
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2012-03-02 7:20 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52459
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Summary|[x86] loop vectorization |[x86] loop performance very
|performance very bad (worse |bad (worse than -O0) when
|than -O0) when using sse4.2 |using sse4.2 popcnt
|popcnt |
--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> 2012-03-02 07:19:49 UTC ---
This has nothing to do with the vectorizer but rather PPRE.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/52459] PPRE performs stupid inserts
2012-03-02 7:03 [Bug tree-optimization/52459] New: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt M8R-ynb11d at mailinator dot com
2012-03-02 7:12 ` [Bug tree-optimization/52459] " M8R-ynb11d at mailinator dot com
2012-03-02 7:20 ` [Bug tree-optimization/52459] [x86] loop " pinskia at gcc dot gnu.org
@ 2012-03-02 10:01 ` rguenth at gcc dot gnu.org
2012-03-21 10:45 ` [Bug tree-optimization/52459] PRE " rguenth at gcc dot gnu.org
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-03-02 10:01 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52459
Richard Guenther <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |ASSIGNED
Last reconfirmed| |2012-03-02
AssignedTo|unassigned at gcc dot |rguenth at gcc dot gnu.org
|gnu.org |
Summary|[x86] loop performance very |PPRE performs stupid
|bad (worse than -O0) when |inserts
|using sse4.2 popcnt |
Ever Confirmed|0 |1
Known to fail| |4.7.0
--- Comment #3 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-03-02 10:00:38 UTC ---
I will have a look.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/52459] PRE performs stupid inserts
2012-03-02 7:03 [Bug tree-optimization/52459] New: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt M8R-ynb11d at mailinator dot com
` (2 preceding siblings ...)
2012-03-02 10:01 ` [Bug tree-optimization/52459] PPRE performs stupid inserts rguenth at gcc dot gnu.org
@ 2012-03-21 10:45 ` rguenth at gcc dot gnu.org
2012-03-22 7:33 ` rguenth at gcc dot gnu.org
2012-03-22 7:41 ` rguenth at gcc dot gnu.org
5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-03-21 10:45 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52459
Richard Guenther <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |matz at gcc dot gnu.org
Summary|PPRE performs stupid |PRE performs stupid inserts
|inserts |
--- Comment #4 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-03-21 10:41:04 UTC ---
I have a patch. The issue is we inhibit PHI insertion for the builtin call
at -O3 (when -ftree-vectorize is on), not partial-PRE. The stupid inserts
are really caused because of the inhibited PRE insertion will cause
only part of the dependent expressions to be available but some uses get
eliminated so the still inserted PHI node is not removed ... which means
that limiting 'PHI insertion' is bougs - we should really limit what
we PHI translate, to not cause dependent expressions to be partially
available. Not sure if we know at that point whether we'd need a PHI node,
but at least the edges we want to eventually restrict translation are
easy to spot - the latch edges.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/52459] PRE performs stupid inserts
2012-03-02 7:03 [Bug tree-optimization/52459] New: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt M8R-ynb11d at mailinator dot com
` (3 preceding siblings ...)
2012-03-21 10:45 ` [Bug tree-optimization/52459] PRE " rguenth at gcc dot gnu.org
@ 2012-03-22 7:33 ` rguenth at gcc dot gnu.org
2012-03-22 7:41 ` rguenth at gcc dot gnu.org
5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-03-22 7:33 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52459
--- Comment #5 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-03-22 07:29:39 UTC ---
Author: rguenth
Date: Thu Mar 22 07:29:30 2012
New Revision: 185676
URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=185676
Log:
2012-03-22 Richard Guenther <rguenther@suse.de>
PR tree-optimization/52459
* tree-ssa-pre.c (inhibit_phi_insertion): Do not inhibit
PHI insertion for calls.
Modified:
trunk/gcc/ChangeLog
trunk/gcc/tree-ssa-pre.c
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/52459] PRE performs stupid inserts
2012-03-02 7:03 [Bug tree-optimization/52459] New: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt M8R-ynb11d at mailinator dot com
` (4 preceding siblings ...)
2012-03-22 7:33 ` rguenth at gcc dot gnu.org
@ 2012-03-22 7:41 ` rguenth at gcc dot gnu.org
5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-03-22 7:41 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52459
Richard Guenther <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|ASSIGNED |RESOLVED
Resolution| |FIXED
--- Comment #6 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-03-22 07:32:56 UTC ---
Fixed for 4.8.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2012-03-22 7:33 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-02 7:03 [Bug tree-optimization/52459] New: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt M8R-ynb11d at mailinator dot com
2012-03-02 7:12 ` [Bug tree-optimization/52459] " M8R-ynb11d at mailinator dot com
2012-03-02 7:20 ` [Bug tree-optimization/52459] [x86] loop " pinskia at gcc dot gnu.org
2012-03-02 10:01 ` [Bug tree-optimization/52459] PPRE performs stupid inserts rguenth at gcc dot gnu.org
2012-03-21 10:45 ` [Bug tree-optimization/52459] PRE " rguenth at gcc dot gnu.org
2012-03-22 7:33 ` rguenth at gcc dot gnu.org
2012-03-22 7:41 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).