public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/52459] New: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt
@ 2012-03-02  7:03 M8R-ynb11d at mailinator dot com
  2012-03-02  7:12 ` [Bug tree-optimization/52459] " M8R-ynb11d at mailinator dot com
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: M8R-ynb11d at mailinator dot com @ 2012-03-02  7:03 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52459

             Bug #: 52459
           Summary: [x86] loop vectorization performance very bad (worse
                    than -O0) when using sse4.2 popcnt
    Classification: Unclassified
           Product: gcc
           Version: 4.6.3
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: M8R-ynb11d@mailinator.com


Created attachment 26808
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=26808
testcase

gcc 4.6.3 on x86_64-unknown-linux-gnu, running on Core i7 2600K (Sandy Bridge)

The attached testcase simply exercises the popcnt instruction over every
unsigned int and creates a histogram.  But with -O2 -ftree-vectorize or with
-O3, the vectorizer adds two popcnt instructions per loop iteration, which
makes performance worse than the unoptimized version, and about 3x slower than
-Os.

Here's the timings and the resulting asm of the loop:

With -O0 -m32 -msse4.2: [7.40 seconds]
.L2:
    mov    eax, DWORD PTR [ebp-12]
    add    DWORD PTR [ebp-12], 1
    popcnt    eax, eax
    mov    edx, DWORD PTR [ebp-144+eax*4]
    add    edx, 1
    mov    DWORD PTR [ebp-144+eax*4], edx
    cmp    DWORD PTR [ebp-12], 0
    jne    .L2


With -O1 -m32 -msse4.2: [2.90 seconds]
.L2:
    lea    edx, [eax+1]
    popcnt    eax, eax
    add    DWORD PTR [esp+12+eax*4], 1
    mov    eax, edx
    test    edx, edx
    jne    .L2


With -O2 -m32 -msse4.2: [2.91 seconds]
.L5:
    popcnt    edx, eax
    mov    ecx, DWORD PTR [esp+12+edx*4]
    add    eax, 1
.L3:
    add    ecx, 1
    test    eax, eax
    mov    DWORD PTR [esp+12+edx*4], ecx
    jne    .L5


With -Os -m32 -msse4.2: [2.82 seconds]
.L2:
    popcnt    edx, eax
    inc    DWORD PTR [ebp-136+edx*4]
    inc    eax
    jne    .L2


With -O3 -m32 -msse4.2: [8.45 seconds]
.L5:
    popcnt    edx, eax
    mov    edx, DWORD PTR [esp+edx*4]
.L3:
    popcnt    ecx, eax
    add    edx, 1
    add    eax, 1
    mov    DWORD PTR [esp+ecx*4], edx
    jne    .L5


Things are about the same (relatively) with -m64 but somewhat slower, I'm
assuming due to the extra edx -> rdx sign extension step.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/52459] [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt
  2012-03-02  7:03 [Bug tree-optimization/52459] New: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt M8R-ynb11d at mailinator dot com
@ 2012-03-02  7:12 ` M8R-ynb11d at mailinator dot com
  2012-03-02  7:20 ` [Bug tree-optimization/52459] [x86] loop " pinskia at gcc dot gnu.org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: M8R-ynb11d at mailinator dot com @ 2012-03-02  7:12 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52459

--- Comment #1 from M8R-ynb11d at mailinator dot com 2012-03-02 07:11:47 UTC ---
Similar (but much slower) results when not using SSE and using the libgcc
library version of __builtin_popcount:

-O0: 22.55 secs
-O1: 20.57 secs
-O2: 22.48 secs
-Os: 22.81 secs
-O3: 45.17 secs


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/52459] [x86] loop performance very bad (worse than -O0) when using sse4.2 popcnt
  2012-03-02  7:03 [Bug tree-optimization/52459] New: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt M8R-ynb11d at mailinator dot com
  2012-03-02  7:12 ` [Bug tree-optimization/52459] " M8R-ynb11d at mailinator dot com
@ 2012-03-02  7:20 ` pinskia at gcc dot gnu.org
  2012-03-02 10:01 ` [Bug tree-optimization/52459] PPRE performs stupid inserts rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2012-03-02  7:20 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52459

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|[x86] loop vectorization    |[x86] loop performance very
                   |performance very bad (worse |bad (worse than -O0) when
                   |than -O0) when using sse4.2 |using sse4.2 popcnt
                   |popcnt                      |

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> 2012-03-02 07:19:49 UTC ---
This has nothing to do with the vectorizer but rather PPRE.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/52459] PPRE performs stupid inserts
  2012-03-02  7:03 [Bug tree-optimization/52459] New: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt M8R-ynb11d at mailinator dot com
  2012-03-02  7:12 ` [Bug tree-optimization/52459] " M8R-ynb11d at mailinator dot com
  2012-03-02  7:20 ` [Bug tree-optimization/52459] [x86] loop " pinskia at gcc dot gnu.org
@ 2012-03-02 10:01 ` rguenth at gcc dot gnu.org
  2012-03-21 10:45 ` [Bug tree-optimization/52459] PRE " rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-03-02 10:01 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52459

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |ASSIGNED
   Last reconfirmed|                            |2012-03-02
         AssignedTo|unassigned at gcc dot       |rguenth at gcc dot gnu.org
                   |gnu.org                     |
            Summary|[x86] loop performance very |PPRE performs stupid
                   |bad (worse than -O0) when   |inserts
                   |using sse4.2 popcnt         |
     Ever Confirmed|0                           |1
      Known to fail|                            |4.7.0

--- Comment #3 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-03-02 10:00:38 UTC ---
I will have a look.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/52459] PRE performs stupid inserts
  2012-03-02  7:03 [Bug tree-optimization/52459] New: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt M8R-ynb11d at mailinator dot com
                   ` (2 preceding siblings ...)
  2012-03-02 10:01 ` [Bug tree-optimization/52459] PPRE performs stupid inserts rguenth at gcc dot gnu.org
@ 2012-03-21 10:45 ` rguenth at gcc dot gnu.org
  2012-03-22  7:33 ` rguenth at gcc dot gnu.org
  2012-03-22  7:41 ` rguenth at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-03-21 10:45 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52459

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |matz at gcc dot gnu.org
            Summary|PPRE performs stupid        |PRE performs stupid inserts
                   |inserts                     |

--- Comment #4 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-03-21 10:41:04 UTC ---
I have a patch.  The issue is we inhibit PHI insertion for the builtin call
at -O3 (when -ftree-vectorize is on), not partial-PRE.  The stupid inserts
are really caused because of the inhibited PRE insertion will cause
only part of the dependent expressions to be available but some uses get
eliminated so the still inserted PHI node is not removed ... which means
that limiting 'PHI insertion' is bougs - we should really limit what
we PHI translate, to not cause dependent expressions to be partially
available.  Not sure if we know at that point whether we'd need a PHI node,
but at least the edges we want to eventually restrict translation are
easy to spot - the latch edges.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/52459] PRE performs stupid inserts
  2012-03-02  7:03 [Bug tree-optimization/52459] New: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt M8R-ynb11d at mailinator dot com
                   ` (3 preceding siblings ...)
  2012-03-21 10:45 ` [Bug tree-optimization/52459] PRE " rguenth at gcc dot gnu.org
@ 2012-03-22  7:33 ` rguenth at gcc dot gnu.org
  2012-03-22  7:41 ` rguenth at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-03-22  7:33 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52459

--- Comment #5 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-03-22 07:29:39 UTC ---
Author: rguenth
Date: Thu Mar 22 07:29:30 2012
New Revision: 185676

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=185676
Log:
2012-03-22  Richard Guenther  <rguenther@suse.de>

    PR tree-optimization/52459
    * tree-ssa-pre.c (inhibit_phi_insertion): Do not inhibit
    PHI insertion for calls.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-pre.c


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/52459] PRE performs stupid inserts
  2012-03-02  7:03 [Bug tree-optimization/52459] New: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt M8R-ynb11d at mailinator dot com
                   ` (4 preceding siblings ...)
  2012-03-22  7:33 ` rguenth at gcc dot gnu.org
@ 2012-03-22  7:41 ` rguenth at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-03-22  7:41 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52459

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|                            |FIXED

--- Comment #6 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-03-22 07:32:56 UTC ---
Fixed for 4.8.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-03-22  7:33 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-02  7:03 [Bug tree-optimization/52459] New: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt M8R-ynb11d at mailinator dot com
2012-03-02  7:12 ` [Bug tree-optimization/52459] " M8R-ynb11d at mailinator dot com
2012-03-02  7:20 ` [Bug tree-optimization/52459] [x86] loop " pinskia at gcc dot gnu.org
2012-03-02 10:01 ` [Bug tree-optimization/52459] PPRE performs stupid inserts rguenth at gcc dot gnu.org
2012-03-21 10:45 ` [Bug tree-optimization/52459] PRE " rguenth at gcc dot gnu.org
2012-03-22  7:33 ` rguenth at gcc dot gnu.org
2012-03-22  7:41 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).