[Bug tree-optimization/65492] New: Bad optimization in -O3 on SSE intrinsics

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/65492] New: Bad optimization in -O3 on SSE intrinsics
@ 2015-03-20 12:32 linux at carewolf dot com
  2015-03-20 12:40 ` [Bug tree-optimization/65492] " linux at carewolf dot com
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: linux at carewolf dot com @ 2015-03-20 12:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492

            Bug ID: 65492
           Summary: Bad optimization in -O3 on SSE intrinsics
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: linux at carewolf dot com

After investigating a loop using SSE intrinsics that was significantly faster
in clang than in gcc, I discovered gcc had the same performance as clang in
-O2, and only performed signficantly worse in -O3.

Disabling all the switches mentioned in the documentation as activates by -O3
(or enabling them for -O2), doesn't fully account for the difference, but the
switch -f(no-)tree-loop-vectorize accounts for roughly half of it.

I have attached the files I used to test it. Using gcc -O2 or clang -O2 or -O3,
it times in at 1.8s on my machine. Using g++ (4.9 or 5.0) -O3 it times in at
2.5s. Using -O3 -fno-tree-loop-vectorize it runs in 2.3s, and -O2
-ftree-vectorize at 2.25s.

Using callgrind, it seems the performance difference is mainly spend on the
accessing integers in the vector union.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/65492] Bad optimization in -O3 on SSE intrinsics
  2015-03-20 12:32 [Bug tree-optimization/65492] New: Bad optimization in -O3 on SSE intrinsics linux at carewolf dot com
@ 2015-03-20 12:40 ` linux at carewolf dot com
  2015-03-20 12:41 ` linux at carewolf dot com
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: linux at carewolf dot com @ 2015-03-20 12:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492

--- Comment #1 from Allan Jensen <linux at carewolf dot com> ---
Created attachment 35070
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35070&action=edit
main


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/65492] Bad optimization in -O3 on SSE intrinsics
  2015-03-20 12:32 [Bug tree-optimization/65492] New: Bad optimization in -O3 on SSE intrinsics linux at carewolf dot com
  2015-03-20 12:40 ` [Bug tree-optimization/65492] " linux at carewolf dot com
@ 2015-03-20 12:41 ` linux at carewolf dot com
  2015-03-20 12:49 ` linux at carewolf dot com
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: linux at carewolf dot com @ 2015-03-20 12:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492

--- Comment #2 from Allan Jensen <linux at carewolf dot com> ---
Created attachment 35071
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35071&action=edit
vector union test


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/65492] Bad optimization in -O3 on SSE intrinsics
  2015-03-20 12:32 [Bug tree-optimization/65492] New: Bad optimization in -O3 on SSE intrinsics linux at carewolf dot com
  2015-03-20 12:40 ` [Bug tree-optimization/65492] " linux at carewolf dot com
  2015-03-20 12:41 ` linux at carewolf dot com
@ 2015-03-20 12:49 ` linux at carewolf dot com
  2015-03-20 14:31 ` rguenth at gcc dot gnu.org
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: linux at carewolf dot com @ 2015-03-20 12:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492

--- Comment #3 from Allan Jensen <linux at carewolf dot com> ---
The -O3 regression seems to go back a long way, but has become lesser over
time.

With gcc 4.6 and older it runs at 3.1s with -O3, and still at 1.8s with -O2.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/65492] Bad optimization in -O3 on SSE intrinsics
  2015-03-20 12:32 [Bug tree-optimization/65492] New: Bad optimization in -O3 on SSE intrinsics linux at carewolf dot com
                   ` (2 preceding siblings ...)
  2015-03-20 12:49 ` linux at carewolf dot com
@ 2015-03-20 14:31 ` rguenth at gcc dot gnu.org
  2015-03-20 14:35 ` rguenth at gcc dot gnu.org
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-03-20 14:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
             Target|                            |x86_64-*-*
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2015-03-20
            Version|unknown                     |5.0
     Ever confirmed|0                           |1

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
Part of it is if-conversion enabled at -O3 (-fno-tree-loop-if-convert), but
then -O2 is still ~20% faster.

Confirmed.

Not sure if really caused by intrinsics.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/65492] Bad optimization in -O3 on SSE intrinsics
  2015-03-20 12:32 [Bug tree-optimization/65492] New: Bad optimization in -O3 on SSE intrinsics linux at carewolf dot com
                   ` (3 preceding siblings ...)
  2015-03-20 14:31 ` rguenth at gcc dot gnu.org
@ 2015-03-20 14:35 ` rguenth at gcc dot gnu.org
  2015-03-20 14:59 ` rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-03-20 14:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Unrolling of the inner loop accounts for the rest (both conditional moves
with if-conversion applied and the branchy code if not seems to put a too
heavy load on the branch predictor(?) when inside another loop).


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/65492] Bad optimization in -O3 on SSE intrinsics
  2015-03-20 12:32 [Bug tree-optimization/65492] New: Bad optimization in -O3 on SSE intrinsics linux at carewolf dot com
                   ` (4 preceding siblings ...)
  2015-03-20 14:35 ` rguenth at gcc dot gnu.org
@ 2015-03-20 14:59 ` rguenth at gcc dot gnu.org
  2015-03-20 20:40 ` [Bug tree-optimization/65492] Bad optimization in -O3 due to if-conversion and/or unrolling linux at carewolf dot com
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-03-20 14:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hubicka at gcc dot gnu.org

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
--param max-peel-branches default of 32 seems to be quite high.  For this
loop we have two branches on the hot path and 4 times unrolling.

Honza - how did you arrive at the default of 32?  Shouldn't that depend
on the number of other stmts thus rather look at branch density?

Similarly late unrolling should take conditional stmts (COND_EXPR rhs_code)
into account?

Especially as we don't really estimate anything to become constant after
unrolling.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/65492] Bad optimization in -O3 due to if-conversion and/or unrolling
  2015-03-20 12:32 [Bug tree-optimization/65492] New: Bad optimization in -O3 on SSE intrinsics linux at carewolf dot com
                   ` (5 preceding siblings ...)
  2015-03-20 14:59 ` rguenth at gcc dot gnu.org
@ 2015-03-20 20:40 ` linux at carewolf dot com
  2015-03-21  2:09 ` linux at carewolf dot com
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: linux at carewolf dot com @ 2015-03-20 20:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492

--- Comment #8 from Allan Jensen <linux at carewolf dot com> ---
You can remove the branches in the inner loop and still reproduce the issue.
There were no branches in the original code, I only added them to the reduced
case because I was using a smaller lookup table.

I appears after removing the branches, the execution time with and without
-fno-tree-vectorize on -O3 is the same. So they also cause some issue, but is
the main one.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/65492] Bad optimization in -O3 due to if-conversion and/or unrolling
  2015-03-20 12:32 [Bug tree-optimization/65492] New: Bad optimization in -O3 on SSE intrinsics linux at carewolf dot com
                   ` (6 preceding siblings ...)
  2015-03-20 20:40 ` [Bug tree-optimization/65492] Bad optimization in -O3 due to if-conversion and/or unrolling linux at carewolf dot com
@ 2015-03-21  2:09 ` linux at carewolf dot com
  2015-03-21 14:05 ` linux at carewolf dot com
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: linux at carewolf dot com @ 2015-03-21  2:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492

--- Comment #9 from Allan Jensen <linux at carewolf dot com> ---
Looking at the assembler, it does indeed appear that the only difference just
loop unrolling and if conversion. 

After testing on another machine (and old PhenomII as opposed to the
Sandybridge), and report that disabling tree-loop-if-convert directly or
indirectly via tree-loop-vectorize -O3 regains all of the speed difference to
-O2 on PhenomII.

My guess is that the small loop-unrolling is conflicting with op-cache Intel
introduced in the SandyBridge and newer architectures which speeds up small
tight loops. On architectures without op-cache the loop-unrolling is probably
still slightly faster.

Unfortunately, using -mtune=sandybridge does not improve the situation, so
maybe there should be some architecture tuning on even trivial loop unrolling,
and possibly discussion on making it part of generic-x64 tuning.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/65492] Bad optimization in -O3 due to if-conversion and/or unrolling
  2015-03-20 12:32 [Bug tree-optimization/65492] New: Bad optimization in -O3 on SSE intrinsics linux at carewolf dot com
                   ` (7 preceding siblings ...)
  2015-03-21  2:09 ` linux at carewolf dot com
@ 2015-03-21 14:05 ` linux at carewolf dot com
  2015-03-24 13:08 ` linux at carewolf dot com
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: linux at carewolf dot com @ 2015-03-21 14:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492

--- Comment #10 from Allan Jensen <linux at carewolf dot com> ---
Just make things more complicated, I just tried the test on a Haswell, and
surprisingly disabling if-convert or tree-vectorize on -O3 has no effect on
performance, but activating tree-vectorize on -O2 does.

In conclusion. This test is slower in -O3 than -O2 on all tested CPUs Phenom,
SandyBridge and Haswell, but for different reasons.

On Phenom, it is slower due to if-convert, but not unroll (unrolled might even
be slightly faster, but only by a small amount).
On SandyBridge, it slower due to both if-convert and unroll, and even slower
when both are active.
On Haswell, it is slower due to both if-convert and unroll, but if-convert on
top of unroll is no slower than unroll on its own.

In general it is probably safe to try to avoid or undo the if-convert. There
appears to be special if-conversions only performed when vectorization is
active. Presumably they are only used in that case because they are known to
likely be slower when the loop is not vectorized. In this case the
if-conversion is done, but the loop not vectorized in the end, just slowing it
down (on non Haswell).

The unroll issue could perhaps be handled by controlling some optimization
params with tuning profiles. Where is trivial unrolling like this even
performed?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/65492] Bad optimization in -O3 due to if-conversion and/or unrolling
  2015-03-20 12:32 [Bug tree-optimization/65492] New: Bad optimization in -O3 on SSE intrinsics linux at carewolf dot com
                   ` (8 preceding siblings ...)
  2015-03-21 14:05 ` linux at carewolf dot com
@ 2015-03-24 13:08 ` linux at carewolf dot com
  2015-03-31 11:25 ` linux at carewolf dot com
  2021-08-14 23:12 ` pinskia at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: linux at carewolf dot com @ 2015-03-24 13:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492

--- Comment #11 from Allan Jensen <linux at carewolf dot com> ---
Issues with slow cmov has been seen in several bug reports:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53346
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54073
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/65492] Bad optimization in -O3 due to if-conversion and/or unrolling
  2015-03-20 12:32 [Bug tree-optimization/65492] New: Bad optimization in -O3 on SSE intrinsics linux at carewolf dot com
                   ` (9 preceding siblings ...)
  2015-03-24 13:08 ` linux at carewolf dot com
@ 2015-03-31 11:25 ` linux at carewolf dot com
  2021-08-14 23:12 ` pinskia at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: linux at carewolf dot com @ 2015-03-31 11:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492

--- Comment #12 from Allan Jensen <linux at carewolf dot com> ---
I have a very crude fix for this.

First though, according to comments in tree-if-conv.c and earlier bugs on the
issues. If-conversion is suppposed to be conditional. It performed in a piece
of conditionally code only to be used if vectorized. For some reason this
version appears to be used.

But secondly. If conditional move instructions are generally slower than
branches, shouldn't they be avoided during instruction selections? The crude
fix is simply placing a 'return false;' in the top of ix86_expand_int_movcc in
i386.c.

So this case somehow triggers a case where the if-conversion that is supposed
to only be used by vectorization gets used anyway, but more generally, i386
shouldn't be generating cmov instructions for conditional moves in the first
place for modern architectures (anything newer than core2 and bulldozer). At
least not without input from a profile run.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/65492] Bad optimization in -O3 due to if-conversion and/or unrolling
  2015-03-20 12:32 [Bug tree-optimization/65492] New: Bad optimization in -O3 on SSE intrinsics linux at carewolf dot com
                   ` (10 preceding siblings ...)
  2015-03-31 11:25 ` linux at carewolf dot com
@ 2021-08-14 23:12 ` pinskia at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-14 23:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |6.0
         Resolution|---                         |FIXED
             Status|NEW                         |RESOLVED
      Known to work|                            |6.1.0
           See Also|                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=66002

--- Comment #13 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
This is fixed in GCC 6, we produce:

  _76 = MIN_EXPR <offset_99, 255>;
  offset_75 = MAX_EXPR <_76, 1>;
...

  _104 = offset_75 + -1;

Rather than:

  offset_30 = _52 <= 254 ? offset_67 : 1;
  prephitmp_119 = _52 <= 254 ? pretmp_118 : 0;
  _17 = offset_67 <= 255;
  offset_69 = _17 ? offset_30 : 255;
  prephitmp_109 = _17 ? prephitmp_119 : 254;

This was fixed by r6-528.  PR 66002 is describing almost the same issue even.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-08-14 23:12 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-20 12:32 [Bug tree-optimization/65492] New: Bad optimization in -O3 on SSE intrinsics linux at carewolf dot com
2015-03-20 12:40 ` [Bug tree-optimization/65492] " linux at carewolf dot com
2015-03-20 12:41 ` linux at carewolf dot com
2015-03-20 12:49 ` linux at carewolf dot com
2015-03-20 14:31 ` rguenth at gcc dot gnu.org
2015-03-20 14:35 ` rguenth at gcc dot gnu.org
2015-03-20 14:59 ` rguenth at gcc dot gnu.org
2015-03-20 20:40 ` [Bug tree-optimization/65492] Bad optimization in -O3 due to if-conversion and/or unrolling linux at carewolf dot com
2015-03-21  2:09 ` linux at carewolf dot com
2015-03-21 14:05 ` linux at carewolf dot com
2015-03-24 13:08 ` linux at carewolf dot com
2015-03-31 11:25 ` linux at carewolf dot com
2021-08-14 23:12 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).