public inbox for gcc-bugs@sourceware.org help / color / mirror / Atom feed
* [Bug tree-optimization/52459] New: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt @ 2012-03-02 7:03 M8R-ynb11d at mailinator dot com 2012-03-02 7:12 ` [Bug tree-optimization/52459] " M8R-ynb11d at mailinator dot com ` (5 more replies) 0 siblings, 6 replies; 7+ messages in thread From: M8R-ynb11d at mailinator dot com @ 2012-03-02 7:03 UTC (permalink / raw) To: gcc-bugs http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52459 Bug #: 52459 Summary: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt Classification: Unclassified Product: gcc Version: 4.6.3 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassigned@gcc.gnu.org ReportedBy: M8R-ynb11d@mailinator.com Created attachment 26808 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=26808 testcase gcc 4.6.3 on x86_64-unknown-linux-gnu, running on Core i7 2600K (Sandy Bridge) The attached testcase simply exercises the popcnt instruction over every unsigned int and creates a histogram. But with -O2 -ftree-vectorize or with -O3, the vectorizer adds two popcnt instructions per loop iteration, which makes performance worse than the unoptimized version, and about 3x slower than -Os. Here's the timings and the resulting asm of the loop: With -O0 -m32 -msse4.2: [7.40 seconds] .L2: mov eax, DWORD PTR [ebp-12] add DWORD PTR [ebp-12], 1 popcnt eax, eax mov edx, DWORD PTR [ebp-144+eax*4] add edx, 1 mov DWORD PTR [ebp-144+eax*4], edx cmp DWORD PTR [ebp-12], 0 jne .L2 With -O1 -m32 -msse4.2: [2.90 seconds] .L2: lea edx, [eax+1] popcnt eax, eax add DWORD PTR [esp+12+eax*4], 1 mov eax, edx test edx, edx jne .L2 With -O2 -m32 -msse4.2: [2.91 seconds] .L5: popcnt edx, eax mov ecx, DWORD PTR [esp+12+edx*4] add eax, 1 .L3: add ecx, 1 test eax, eax mov DWORD PTR [esp+12+edx*4], ecx jne .L5 With -Os -m32 -msse4.2: [2.82 seconds] .L2: popcnt edx, eax inc DWORD PTR [ebp-136+edx*4] inc eax jne .L2 With -O3 -m32 -msse4.2: [8.45 seconds] .L5: popcnt edx, eax mov edx, DWORD PTR [esp+edx*4] .L3: popcnt ecx, eax add edx, 1 add eax, 1 mov DWORD PTR [esp+ecx*4], edx jne .L5 Things are about the same (relatively) with -m64 but somewhat slower, I'm assuming due to the extra edx -> rdx sign extension step. ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/52459] [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt 2012-03-02 7:03 [Bug tree-optimization/52459] New: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt M8R-ynb11d at mailinator dot com @ 2012-03-02 7:12 ` M8R-ynb11d at mailinator dot com 2012-03-02 7:20 ` [Bug tree-optimization/52459] [x86] loop " pinskia at gcc dot gnu.org ` (4 subsequent siblings) 5 siblings, 0 replies; 7+ messages in thread From: M8R-ynb11d at mailinator dot com @ 2012-03-02 7:12 UTC (permalink / raw) To: gcc-bugs http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52459 --- Comment #1 from M8R-ynb11d at mailinator dot com 2012-03-02 07:11:47 UTC --- Similar (but much slower) results when not using SSE and using the libgcc library version of __builtin_popcount: -O0: 22.55 secs -O1: 20.57 secs -O2: 22.48 secs -Os: 22.81 secs -O3: 45.17 secs ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/52459] [x86] loop performance very bad (worse than -O0) when using sse4.2 popcnt 2012-03-02 7:03 [Bug tree-optimization/52459] New: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt M8R-ynb11d at mailinator dot com 2012-03-02 7:12 ` [Bug tree-optimization/52459] " M8R-ynb11d at mailinator dot com @ 2012-03-02 7:20 ` pinskia at gcc dot gnu.org 2012-03-02 10:01 ` [Bug tree-optimization/52459] PPRE performs stupid inserts rguenth at gcc dot gnu.org ` (3 subsequent siblings) 5 siblings, 0 replies; 7+ messages in thread From: pinskia at gcc dot gnu.org @ 2012-03-02 7:20 UTC (permalink / raw) To: gcc-bugs http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52459 Andrew Pinski <pinskia at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|[x86] loop vectorization |[x86] loop performance very |performance very bad (worse |bad (worse than -O0) when |than -O0) when using sse4.2 |using sse4.2 popcnt |popcnt | --- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> 2012-03-02 07:19:49 UTC --- This has nothing to do with the vectorizer but rather PPRE. ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/52459] PPRE performs stupid inserts 2012-03-02 7:03 [Bug tree-optimization/52459] New: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt M8R-ynb11d at mailinator dot com 2012-03-02 7:12 ` [Bug tree-optimization/52459] " M8R-ynb11d at mailinator dot com 2012-03-02 7:20 ` [Bug tree-optimization/52459] [x86] loop " pinskia at gcc dot gnu.org @ 2012-03-02 10:01 ` rguenth at gcc dot gnu.org 2012-03-21 10:45 ` [Bug tree-optimization/52459] PRE " rguenth at gcc dot gnu.org ` (2 subsequent siblings) 5 siblings, 0 replies; 7+ messages in thread From: rguenth at gcc dot gnu.org @ 2012-03-02 10:01 UTC (permalink / raw) To: gcc-bugs http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52459 Richard Guenther <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |ASSIGNED Last reconfirmed| |2012-03-02 AssignedTo|unassigned at gcc dot |rguenth at gcc dot gnu.org |gnu.org | Summary|[x86] loop performance very |PPRE performs stupid |bad (worse than -O0) when |inserts |using sse4.2 popcnt | Ever Confirmed|0 |1 Known to fail| |4.7.0 --- Comment #3 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-03-02 10:00:38 UTC --- I will have a look. ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/52459] PRE performs stupid inserts 2012-03-02 7:03 [Bug tree-optimization/52459] New: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt M8R-ynb11d at mailinator dot com ` (2 preceding siblings ...) 2012-03-02 10:01 ` [Bug tree-optimization/52459] PPRE performs stupid inserts rguenth at gcc dot gnu.org @ 2012-03-21 10:45 ` rguenth at gcc dot gnu.org 2012-03-22 7:33 ` rguenth at gcc dot gnu.org 2012-03-22 7:41 ` rguenth at gcc dot gnu.org 5 siblings, 0 replies; 7+ messages in thread From: rguenth at gcc dot gnu.org @ 2012-03-21 10:45 UTC (permalink / raw) To: gcc-bugs http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52459 Richard Guenther <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |matz at gcc dot gnu.org Summary|PPRE performs stupid |PRE performs stupid inserts |inserts | --- Comment #4 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-03-21 10:41:04 UTC --- I have a patch. The issue is we inhibit PHI insertion for the builtin call at -O3 (when -ftree-vectorize is on), not partial-PRE. The stupid inserts are really caused because of the inhibited PRE insertion will cause only part of the dependent expressions to be available but some uses get eliminated so the still inserted PHI node is not removed ... which means that limiting 'PHI insertion' is bougs - we should really limit what we PHI translate, to not cause dependent expressions to be partially available. Not sure if we know at that point whether we'd need a PHI node, but at least the edges we want to eventually restrict translation are easy to spot - the latch edges. ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/52459] PRE performs stupid inserts 2012-03-02 7:03 [Bug tree-optimization/52459] New: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt M8R-ynb11d at mailinator dot com ` (3 preceding siblings ...) 2012-03-21 10:45 ` [Bug tree-optimization/52459] PRE " rguenth at gcc dot gnu.org @ 2012-03-22 7:33 ` rguenth at gcc dot gnu.org 2012-03-22 7:41 ` rguenth at gcc dot gnu.org 5 siblings, 0 replies; 7+ messages in thread From: rguenth at gcc dot gnu.org @ 2012-03-22 7:33 UTC (permalink / raw) To: gcc-bugs http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52459 --- Comment #5 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-03-22 07:29:39 UTC --- Author: rguenth Date: Thu Mar 22 07:29:30 2012 New Revision: 185676 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=185676 Log: 2012-03-22 Richard Guenther <rguenther@suse.de> PR tree-optimization/52459 * tree-ssa-pre.c (inhibit_phi_insertion): Do not inhibit PHI insertion for calls. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-pre.c ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/52459] PRE performs stupid inserts 2012-03-02 7:03 [Bug tree-optimization/52459] New: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt M8R-ynb11d at mailinator dot com ` (4 preceding siblings ...) 2012-03-22 7:33 ` rguenth at gcc dot gnu.org @ 2012-03-22 7:41 ` rguenth at gcc dot gnu.org 5 siblings, 0 replies; 7+ messages in thread From: rguenth at gcc dot gnu.org @ 2012-03-22 7:41 UTC (permalink / raw) To: gcc-bugs http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52459 Richard Guenther <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |FIXED --- Comment #6 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-03-22 07:32:56 UTC --- Fixed for 4.8. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2012-03-22 7:33 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-03-02 7:03 [Bug tree-optimization/52459] New: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt M8R-ynb11d at mailinator dot com 2012-03-02 7:12 ` [Bug tree-optimization/52459] " M8R-ynb11d at mailinator dot com 2012-03-02 7:20 ` [Bug tree-optimization/52459] [x86] loop " pinskia at gcc dot gnu.org 2012-03-02 10:01 ` [Bug tree-optimization/52459] PPRE performs stupid inserts rguenth at gcc dot gnu.org 2012-03-21 10:45 ` [Bug tree-optimization/52459] PRE " rguenth at gcc dot gnu.org 2012-03-22 7:33 ` rguenth at gcc dot gnu.org 2012-03-22 7:41 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).