public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
@ 2013-02-13 20:21 arturomdn at gmail dot com
  2013-02-13 20:23 ` [Bug middle-end/56309] " arturomdn at gmail dot com
                   ` (32 more replies)
  0 siblings, 33 replies; 34+ messages in thread
From: arturomdn at gmail dot com @ 2013-02-13 20:21 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

             Bug #: 56309
           Summary: -O3 optimizer generates conditional moves instead of
                    compare and branch resulting in almost 2x slower code
    Classification: Unclassified
           Product: gcc
           Version: 4.7.2
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: arturomdn@gmail.com


Created attachment 29442
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=29442
Self contained source file with parameter x passed by value (slow)

This bug report reflects the analysis of a question asked in stackoverflow

http://stackoverflow.com/questions/14805641/why-does-changing-const-ull-to-const-ull-in-function-parameter-result-in-pe/14819939#14819939

When an unsigned long long parameter to a function is passed by reference
instead of by value the result is a dramatic almost 2x improvement in speed
when compiled with -O3.  Given that the function is inlined this is unexpected.
 Upon closer inspection it was found that the code generated is quite
different, as if passing the parameter by value enables an optimization (use of
x86 conditional moves) that backfires, possibly by suffering an unexpected
stall in the processor.

Two files are attached

by-val-O3.ii
by-ref-O3.ii

They differ only in the way the unsigned long long parameter "x" is passed.

./by-ref-O3
Took 11.85 seconds total.

./by-ref-O3
Took 6.67 seconds total.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug middle-end/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
@ 2013-02-13 20:23 ` arturomdn at gmail dot com
  2013-02-13 20:26 ` [Bug target/56309] " pinskia at gcc dot gnu.org
                   ` (31 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: arturomdn at gmail dot com @ 2013-02-13 20:23 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

--- Comment #1 from arturomdn at gmail dot com 2013-02-13 20:22:51 UTC ---
Created attachment 29443
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=29443
Self contained source file with parameter x passed by reference (fast)

This source file differs from by-val-O3.ii only in the way parameter "x" is
passed.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug target/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
  2013-02-13 20:23 ` [Bug middle-end/56309] " arturomdn at gmail dot com
@ 2013-02-13 20:26 ` pinskia at gcc dot gnu.org
  2013-02-13 20:29 ` arturomdn at gmail dot com
                   ` (30 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: pinskia at gcc dot gnu.org @ 2013-02-13 20:26 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|middle-end                  |target

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> 2013-02-13 20:25:58 UTC ---
Which target is this on?  On some (most non x86 targets) conditional moves are
faster than compare and branch.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug target/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
  2013-02-13 20:23 ` [Bug middle-end/56309] " arturomdn at gmail dot com
  2013-02-13 20:26 ` [Bug target/56309] " pinskia at gcc dot gnu.org
@ 2013-02-13 20:29 ` arturomdn at gmail dot com
  2013-02-14  8:16 ` ubizjak at gmail dot com
                   ` (29 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: arturomdn at gmail dot com @ 2013-02-13 20:29 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

--- Comment #3 from arturomdn at gmail dot com 2013-02-13 20:29:12 UTC ---
Intel Xeon X5570 @ 2.93GHz

(In reply to comment #2)
> Which target is this on?  On some (most non x86 targets) conditional moves are
> faster than compare and branch.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug target/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (2 preceding siblings ...)
  2013-02-13 20:29 ` arturomdn at gmail dot com
@ 2013-02-14  8:16 ` ubizjak at gmail dot com
  2013-02-14  8:22 ` ubizjak at gmail dot com
                   ` (28 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: ubizjak at gmail dot com @ 2013-02-14  8:16 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

Uros Bizjak <ubizjak at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|                            |x86
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2013-02-14
                 CC|                            |hjl.tools at gmail dot com,
                   |                            |ubizjak at gmail dot com
     Ever Confirmed|0                           |1

--- Comment #4 from Uros Bizjak <ubizjak at gmail dot com> 2013-02-14 08:16:26 UTC ---
(In reply to comment #0)

> When an unsigned long long parameter to a function is passed by reference
> instead of by value the result is a dramatic almost 2x improvement in speed
> when compiled with -O3.  Given that the function is inlined this is unexpected.
>  Upon closer inspection it was found that the code generated is quite
> different, as if passing the parameter by value enables an optimization (use of
> x86 conditional moves) that backfires, possibly by suffering an unexpected
> stall in the processor.

This is due to x86 processor deficiency (HW bug?) with cmove insns. Please see
explanations in PR53346 and PR54073.

There is no solution (yet?), compiler should probably detect this situation and
avoid conversion to cmove.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug target/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (3 preceding siblings ...)
  2013-02-14  8:16 ` ubizjak at gmail dot com
@ 2013-02-14  8:22 ` ubizjak at gmail dot com
  2013-02-14  8:28 ` ubizjak at gmail dot com
                   ` (27 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: ubizjak at gmail dot com @ 2013-02-14  8:22 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

--- Comment #5 from Uros Bizjak <ubizjak at gmail dot com> 2013-02-14 08:22:33 UTC ---
For reference, some quotes from Honza:

-- PR54073--
The decision on whether to use cmov or jmp was always tricky on x86
architectures. Cmov increase dependency chains, register pressure (both values
needs to be loaded in) and has long opcode. So jump sequence, if well
predicted, flows better through the out-of-order core. If badly predicted it
is, of course, a disaster. I think more modern CPUs solved the problems with
long latency of cmov, but the dependency chains are still there.

[...]

We should do something about rnflow. I will look into that.
The usual wisdom is that lacking profile feedback one should handle non-loop
branhces as inpredctable and loop branches as predictable, so all handled by
ifconvert fals to the first category. With profile feedback one can see branch
probability and if it is close to 0 or REG_BR_PROB_BASE tread the branch as
predictable. We handle this with predictable_edge_p parameter passed to
BRANCH_COST (that by itself is a gross, but for years we was not able to come
with something saner)
-- /PR54073 --

-- PR53046 --
Well, as I wrote to the other PR, the main problem of cmov is extension of
dependency chain.  For well predicted sequence with conditional jump there is
no update of rbs so the loop executes faster, because the
loads/stores/comparisons executes "in parallel". The load in the next iteration
can then happen speculatively before the condition from previous iteration is
resolved. With cmov in it, there is dependence on rbx for all the other
computations in the loop.

I guess there is no localy available information suggesting suggesting that the
particular branch is well predictable, at least without profile feedback (where
we won't disable the conversion anyway).

[...]
-- /PR53046 --


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug target/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (4 preceding siblings ...)
  2013-02-14  8:22 ` ubizjak at gmail dot com
@ 2013-02-14  8:28 ` ubizjak at gmail dot com
  2013-02-14 14:49 ` ysrumyan at gmail dot com
                   ` (26 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: ubizjak at gmail dot com @ 2013-02-14  8:28 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

--- Comment #6 from Uros Bizjak <ubizjak at gmail dot com> 2013-02-14 08:27:52 UTC ---
(In reply to comment #5)
> -- PR53046 --
> [...]
> -- /PR53046 --

Actually, PR53346.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug target/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (5 preceding siblings ...)
  2013-02-14  8:28 ` ubizjak at gmail dot com
@ 2013-02-14 14:49 ` ysrumyan at gmail dot com
  2013-02-14 15:53 ` arturomdn at gmail dot com
                   ` (25 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: ysrumyan at gmail dot com @ 2013-02-14 14:49 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

Yuri Rumyantsev <ysrumyan at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ysrumyan at gmail dot com

--- Comment #7 from Yuri Rumyantsev <ysrumyan at gmail dot com> 2013-02-14 14:49:31 UTC ---
My 2 cents.

We can enhance if conversion phase in gcc by analyzing conditional jump
probablity and if we have skewed hammock do not perform transformation. In
general it requires to use profiling but for this specific example we can use
so called "static profiler", e.g. in our example we have 
    if (tmp >= imax) 
where imax = 1 << 32, i.e. huge conastant. For such conditions we can assume
that this branch is likely not-taken and reject if-conversion of such hammock.

Other possiblity is implement support for __builtin_expect in gcc but it
requires source code changing too.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug target/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (6 preceding siblings ...)
  2013-02-14 14:49 ` ysrumyan at gmail dot com
@ 2013-02-14 15:53 ` arturomdn at gmail dot com
  2013-02-14 16:01 ` arturomdn at gmail dot com
                   ` (24 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: arturomdn at gmail dot com @ 2013-02-14 15:53 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

--- Comment #8 from arturomdn at gmail dot com 2013-02-14 15:53:15 UTC ---
It is possible (just a guess) that the extra compare is causing an interlock in
the processor since the first cmov is issued speculatively and the condition
won't be confirmed until the first compare has executed.  Someone from Intel
could tell us exactly why the original sequence is so disastrous and suggest an
alternative that still uses cmov and is better than jmp.

I wonder if instead of emitting this sequence

   shr    $0x20,%rdi                                                            
   and    $0xffffffff,%ecx                                                      
   cmp    %r8,%rdx                                                              
   cmovbe %r11,%rdi                                                             
   add    $0x1,%rax                                                             
   cmp    %r8,%rdx                                                              
   cmovbe %rdx,%rcx                                                             

it would do this instead

   shr    $0x20,%rdi                                                            
   and    $0xffffffff,%ecx                                                      
   add    $0x1,%rax                                                             
   cmp    %r8,%rdx                                                              
   cmovbe %r11,%rdi                                                             
   cmovbe %rdx,%rcx


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug target/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (7 preceding siblings ...)
  2013-02-14 15:53 ` arturomdn at gmail dot com
@ 2013-02-14 16:01 ` arturomdn at gmail dot com
  2013-02-14 16:44 ` arturomdn at gmail dot com
                   ` (23 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: arturomdn at gmail dot com @ 2013-02-14 16:01 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

--- Comment #9 from arturomdn at gmail dot com 2013-02-14 16:00:49 UTC ---
I found in the Intel optimization guide an example of this idiom of comparing
once and issuing two cmov back-to-back... so the problem isn't the two cmov,
but possibly introducing the 2nd compare that splits them.

not_equal:
   movzx eax, BYTE PTR[esi+edx]
   movzx edx, BYTE PTR[edi+edx]
   cmp eax, edx
   cmova eax, ONE
   cmovb eax, NEG_ONE
   jmp ret_tag

Taken from example 10-16 of
http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug target/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (8 preceding siblings ...)
  2013-02-14 16:01 ` arturomdn at gmail dot com
@ 2013-02-14 16:44 ` arturomdn at gmail dot com
  2013-02-14 16:59 ` steven at gcc dot gnu.org
                   ` (22 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: arturomdn at gmail dot com @ 2013-02-14 16:44 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

--- Comment #10 from arturomdn at gmail dot com 2013-02-14 16:43:23 UTC ---
Might be worth mentioning here what I said in the stackoverflow answer, that in
this particular case the entire conditional branch can be avoided because it is
redundant.

This code

    if (tmp >= imax) {
        carry = tmp >> numbits;    // <---- A
        tmp &= imax - 1;                // <---- B
    } else {
        carry = 0;                            // <---- C
    }

can be reduced to

    carry = tmp >> numbits;
    tmp &= imax - 1;

Proof:
    1) numbits is 32
    2) imax is 1ULL << 32 so lower 32 bits are zero
    3) imax - 1 is 0xFFFFFFFF (see 2)
    4) if tmp >= imax then tmp has bits set in upper 32 bits
    5) otherwise if tmp < imax then tmp does not have bits set in upper 32 bits
    6) statement "A" is equivalent to "C" when tmp < imax (because of 4)
    7) statement "B" is a NOP when tmp < imax (because of 3 and 5)


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug target/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (9 preceding siblings ...)
  2013-02-14 16:44 ` arturomdn at gmail dot com
@ 2013-02-14 16:59 ` steven at gcc dot gnu.org
  2013-02-14 17:10 ` steven at gcc dot gnu.org
                   ` (21 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: steven at gcc dot gnu.org @ 2013-02-14 16:59 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

Steven Bosscher <steven at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |steven at gcc dot gnu.org

--- Comment #11 from Steven Bosscher <steven at gcc dot gnu.org> 2013-02-14 16:59:04 UTC ---
(In reply to comment #8)
> I wonder if instead of emitting this sequence
> 
>    shr    $0x20,%rdi
>    and    $0xffffffff,%ecx
>    cmp    %r8,%rdx
>    cmovbe %r11,%rdi
>    add    $0x1,%rax
>    cmp    %r8,%rdx
>    cmovbe %rdx,%rcx
> 
> it would do this instead
> 
>    shr    $0x20,%rdi
>    and    $0xffffffff,%ecx
>    add    $0x1,%rax
>    cmp    %r8,%rdx
>    cmovbe %r11,%rdi
>    cmovbe %rdx,%rcx

GCC fails to do so because the flags are clobbered between the two
cmovs, preventing code motion to group the two cmovs:

  197: r116:DI={(gtu(flags:CC,0))?r125:DI:r233:DI}
  199: {r110:DI=r110:DI+0x1;clobber flags:CC;}
  201: flags:CC=cmp(r124:DI,r235:DI)
  202: r221:DI={(gtu(flags:CC,0))?r126:DI:r124:DI}

If you do this change manually in your code (compile with -S, "fix"
the .s file and assemble it), does that speed up your code?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug target/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (10 preceding siblings ...)
  2013-02-14 16:59 ` steven at gcc dot gnu.org
@ 2013-02-14 17:10 ` steven at gcc dot gnu.org
  2013-02-14 17:26 ` [Bug tree-optimization/56309] " steven at gcc dot gnu.org
                   ` (20 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: steven at gcc dot gnu.org @ 2013-02-14 17:10 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

--- Comment #12 from Steven Bosscher <steven at gcc dot gnu.org> 2013-02-14 17:10:02 UTC ---
A bit more clear with insn 195 added:

  195: flags:CC=cmp(r124:DI,r235:DI)
  197: r116:DI={(gtu(flags:CC,0))?r125:DI:r233:DI}
  199: {r110:DI=r110:DI+0x1;clobber flags:CC;}
  201: flags:CC=cmp(r124:DI,r235:DI)
  202: r221:DI={(gtu(flags:CC,0))?r126:DI:r124:DI}

insns 195 and 201 compute the same condition but GCC can't eliminate
the comparison in insns 201 because insn 199 clobbers the flags (i.e.
destroys the result from insn 195).

As for speed, of course I can measure that myself:

--- by-val-O3.s.orig    2013-02-14 18:06:56.000000000 +0100
+++ by-val-O3.s 2013-02-14 18:07:23.000000000 +0100
@@ -357,9 +357,8 @@
        shrq    $32, %rdi
        cmpq    %r8, %rdx
        cmovbe  %r11, %rdi
-       addq    $1, %rax
-       cmpq    %r8, %rdx
        cmovbe  %rdx, %rcx
+       addq    $1, %rax
        cmpq    %rbp, %rax
        movq    %rcx, -8(%rsi,%rax,8)
        jne     .L50

unmodified: Took 14.31 seconds total.
modified:   Took 13.04 seconds total.

So re. comment #9: it's not the problem but it'd be a small improvement.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug tree-optimization/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (11 preceding siblings ...)
  2013-02-14 17:10 ` steven at gcc dot gnu.org
@ 2013-02-14 17:26 ` steven at gcc dot gnu.org
  2013-02-14 17:31 ` arturomdn at gmail dot com
                   ` (19 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: steven at gcc dot gnu.org @ 2013-02-14 17:26 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

Steven Bosscher <steven at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
          Component|target                      |tree-optimization

--- Comment #13 from Steven Bosscher <steven at gcc dot gnu.org> 2013-02-14 17:26:08 UTC ---
$ g++ --version
g++ (GCC) 4.7.2 ...
$ g++ -O3 -std=c++11 by-val-O3.ii ; ./a.out
Took 14.31 seconds total.
$ g++ -O3 -std=c++11 by-val-O3.ii -fvect-cost-model ; ./a.out
Took 14.31 seconds total.
$ g++ -O3 -std=c++11 by-val-O3.ii -fno-tree-vectorize ; ./a.out
Took 10.25 seconds total.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug tree-optimization/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (12 preceding siblings ...)
  2013-02-14 17:26 ` [Bug tree-optimization/56309] " steven at gcc dot gnu.org
@ 2013-02-14 17:31 ` arturomdn at gmail dot com
  2013-02-14 17:34 ` ubizjak at gmail dot com
                   ` (18 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: arturomdn at gmail dot com @ 2013-02-14 17:31 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

--- Comment #14 from arturomdn at gmail dot com 2013-02-14 17:30:54 UTC ---
I also did the experiment, with the same results... it got faster but not as
fast as the version with
conditional branch instead of conditional moves:

./by-ref-O3
Took 6.65 seconds total.

./by-val-O3 
Took 11.64 seconds total.

./by-val-fixed-O3 
Took 9.94 seconds total.

--- by-val-O3.s 2013-02-14 11:27:28.109856000 -0600
+++ by-val-fixed-O3.s   2013-02-14 11:11:07.312317000 -0600
@@ -679,13 +679,12 @@
        shrq    $32, %rdi
        .loc 4 25 0
        andl    $4294967295, %ecx
+       addq    $1, %rax
        cmpq    %r8, %rdx
        cmovbe  %r11, %rdi
 .LVL43:
        .loc 4 29 0
-       addq    $1, %rax
 .LVL44:
-       cmpq    %r8, %rdx
        cmovbe  %rdx, %rcx
 .LVL45:
        .loc 4 21 0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug tree-optimization/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (13 preceding siblings ...)
  2013-02-14 17:31 ` arturomdn at gmail dot com
@ 2013-02-14 17:34 ` ubizjak at gmail dot com
  2013-02-14 17:43 ` arturomdn at gmail dot com
                   ` (17 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: ubizjak at gmail dot com @ 2013-02-14 17:34 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

--- Comment #15 from Uros Bizjak <ubizjak at gmail dot com> 2013-02-14 17:34:15 UTC ---
(In reply to comment #13)
> $ g++ --version
> g++ (GCC) 4.7.2 ...
> $ g++ -O3 -std=c++11 by-val-O3.ii ; ./a.out
> Took 14.31 seconds total.
> $ g++ -O3 -std=c++11 by-val-O3.ii -fvect-cost-model ; ./a.out
> Took 14.31 seconds total.
> $ g++ -O3 -std=c++11 by-val-O3.ii -fno-tree-vectorize ; ./a.out
> Took 10.25 seconds total.

This is red herring, see PR53346c11:

--q--
You are right. -ftree-vectorize implies -ftree-loop-if-convert and this option
makes all the difference!
--/q--

Please try wiht -ftree-vectorize -fno-tree-loop-if-convert.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug tree-optimization/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (14 preceding siblings ...)
  2013-02-14 17:34 ` ubizjak at gmail dot com
@ 2013-02-14 17:43 ` arturomdn at gmail dot com
  2013-02-14 17:50 ` ubizjak at gmail dot com
                   ` (16 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: arturomdn at gmail dot com @ 2013-02-14 17:43 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

--- Comment #16 from arturomdn at gmail dot com 2013-02-14 17:42:55 UTC ---
With -ftree-vectorize -fno-tree-loop-if-convert flags it generated this for the
loop in question:

.L39:
        movq    %rdi, %rdx
        addq    (%rsi,%rax,8), %rcx
        imulq   (%r9,%rax,8), %rdx
        addq    %rcx, %rdx
        xorl    %ecx, %ecx
        cmpq    %r10, %rdx
        jbe     .L38
        movq    %rdx, %rcx
        andl    $4294967295, %edx
        shrq    $32, %rcx
.L38:
        addq    $1, %rax
        cmpq    %r8, %rax
        movq    %rdx, -8(%rsi,%rax,8)
        jne     .L39

And it executed fast:

./by-val-O3-flags
Took 6.74 seconds total.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug tree-optimization/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (15 preceding siblings ...)
  2013-02-14 17:43 ` arturomdn at gmail dot com
@ 2013-02-14 17:50 ` ubizjak at gmail dot com
  2013-02-14 18:23 ` ubizjak at gmail dot com
                   ` (15 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: ubizjak at gmail dot com @ 2013-02-14 17:50 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

--- Comment #17 from Uros Bizjak <ubizjak at gmail dot com> 2013-02-14 17:49:46 UTC ---
(In reply to comment #16)

> And it executed fast:
> 
> ./by-val-O3-flags
> Took 6.74 seconds total.

The solution to all these PRs is trivial: kill -ftree-loop-if-convert on x86,
until infrastructure to detect possible stalls is implemented. This decision
needs lots of benchmarking, though.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug tree-optimization/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (16 preceding siblings ...)
  2013-02-14 17:50 ` ubizjak at gmail dot com
@ 2013-02-14 18:23 ` ubizjak at gmail dot com
  2013-02-14 18:35 ` [Bug target/56309] " ubizjak at gmail dot com
                   ` (14 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: ubizjak at gmail dot com @ 2013-02-14 18:23 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

--- Comment #18 from Uros Bizjak <ubizjak at gmail dot com> 2013-02-14 18:23:12 UTC ---
Following patch is a big hammer approach to the problem, intended only for
benchmarking

--cut here--
Index: common/config/i386/i386-common.c
===================================================================
--- common/config/i386/i386-common.c    (revision 195988)
+++ common/config/i386/i386-common.c    (working copy)
@@ -710,6 +710,8 @@
     /* Turn off -fschedule-insns by default.  It tends to make the
        problem with not enough registers even worse.  */
     { OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 },
+    /* Turn off -ftree-loop-if-convert by default.  */
+    { OPT_LEVELS_ALL, OPT_ftree_loop_if_convert, NULL, 0 },

 #ifdef SUBTARGET_OPTIMIZATION_OPTIONS
     SUBTARGET_OPTIMIZATION_OPTIONS,
--cut here--

OTOH, it looks that on x86, the only problematic insns are scalar integer cmov
instructions. With some (machine dependant ?) pass inserted at the right spot
in the sequence of passes, appropriate cmovs can be decomposed back to jumps,
for some yet unknown definition of "appropriate".


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug target/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (17 preceding siblings ...)
  2013-02-14 18:23 ` ubizjak at gmail dot com
@ 2013-02-14 18:35 ` ubizjak at gmail dot com
  2013-02-14 22:18 ` steven at gcc dot gnu.org
                   ` (13 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: ubizjak at gmail dot com @ 2013-02-14 18:35 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

Uros Bizjak <ubizjak at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|tree-optimization           |target

--- Comment #19 from Uros Bizjak <ubizjak at gmail dot com> 2013-02-14 18:35:05 UTC ---
Back to "target" component.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug target/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (18 preceding siblings ...)
  2013-02-14 18:35 ` [Bug target/56309] " ubizjak at gmail dot com
@ 2013-02-14 22:18 ` steven at gcc dot gnu.org
  2013-02-15  6:50 ` izamyatin at gmail dot com
                   ` (12 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: steven at gcc dot gnu.org @ 2013-02-14 22:18 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

--- Comment #20 from Steven Bosscher <steven at gcc dot gnu.org> 2013-02-14 22:17:23 UTC ---
(In reply to comment #12)
> --- by-val-O3.s.orig    2013-02-14 18:06:56.000000000 +0100
> +++ by-val-O3.s 2013-02-14 18:07:23.000000000 +0100
> @@ -357,9 +357,8 @@
>         shrq    $32, %rdi
>         cmpq    %r8, %rdx
>         cmovbe  %r11, %rdi
> -       addq    $1, %rax
> -       cmpq    %r8, %rdx
>         cmovbe  %rdx, %rcx
> +       addq    $1, %rax
>         cmpq    %rbp, %rax
>         movq    %rcx, -8(%rsi,%rax,8)
>         jne     .L50
> 
> unmodified: Took 14.31 seconds total.
> modified:   Took 13.04 seconds total.
> 
> So re. comment #9: it's not the problem but it'd be a small improvement.

FWIW this comes from not eliminating the condition expression in
the conditional moves that ifcvt creates:

  tmp_97 = tmp_93 > 4294967295 ? tmp_95 : tmp_93;
  carry_105 = tmp_93 > 4294967295 ? carry_94 : 0;

I'm surprised this form is allowed at all, I'd expect we only allow
is_gimple_reg() for a COND_EXPR_COND in a RHS context.

Anyway -- separate problem.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug target/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (19 preceding siblings ...)
  2013-02-14 22:18 ` steven at gcc dot gnu.org
@ 2013-02-15  6:50 ` izamyatin at gmail dot com
  2013-02-15  7:15 ` jakub at gcc dot gnu.org
                   ` (11 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: izamyatin at gmail dot com @ 2013-02-15  6:50 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

--- Comment #21 from Igor Zamyatin <izamyatin at gmail dot com> 2013-02-15 06:49:53 UTC ---
(In reply to comment #18)
> Following patch is a big hammer approach to the problem, intended only for
> benchmarking
> 
> --cut here--
> Index: common/config/i386/i386-common.c
> ===================================================================
> --- common/config/i386/i386-common.c    (revision 195988)
> +++ common/config/i386/i386-common.c    (working copy)
> @@ -710,6 +710,8 @@
>      /* Turn off -fschedule-insns by default.  It tends to make the
>         problem with not enough registers even worse.  */
>      { OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 },
> +    /* Turn off -ftree-loop-if-convert by default.  */
> +    { OPT_LEVELS_ALL, OPT_ftree_loop_if_convert, NULL, 0 },
> 
>  #ifdef SUBTARGET_OPTIMIZATION_OPTIONS
>      SUBTARGET_OPTIMIZATION_OPTIONS,
> --cut here--
> 

But in this case we are giving up vectorization for some cases, aren't we?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug target/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (20 preceding siblings ...)
  2013-02-15  6:50 ` izamyatin at gmail dot com
@ 2013-02-15  7:15 ` jakub at gcc dot gnu.org
  2013-02-15  9:33 ` rguenth at gcc dot gnu.org
                   ` (10 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: jakub at gcc dot gnu.org @ 2013-02-15  7:15 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #22 from Jakub Jelinek <jakub at gcc dot gnu.org> 2013-02-15 07:14:44 UTC ---
Yeah.  We need to either find another way how to present vectorizer with the
ifconverted statement (perhaps similar to how pattern matching works, I think
if-conversion only handles inner-most loops, so maybe just as an alternative
gimple_seq for the single bb of the loop body?), or we need some kind of
if-unconversion pass.
See the http://gcc.gnu.org/ml/gcc-patches/2012-11/msg01659.html patch which
contains much more limited version of if-unconversion, unfortunately it is
going to be harder to handle more complex conditions efficiently.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug target/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (21 preceding siblings ...)
  2013-02-15  7:15 ` jakub at gcc dot gnu.org
@ 2013-02-15  9:33 ` rguenth at gcc dot gnu.org
  2013-02-15  9:36 ` rguenth at gcc dot gnu.org
                   ` (9 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: rguenth at gcc dot gnu.org @ 2013-02-15  9:33 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

--- Comment #23 from Richard Biener <rguenth at gcc dot gnu.org> 2013-02-15 09:33:22 UTC ---
(In reply to comment #20)
> (In reply to comment #12)
> > --- by-val-O3.s.orig    2013-02-14 18:06:56.000000000 +0100
> > +++ by-val-O3.s 2013-02-14 18:07:23.000000000 +0100
> > @@ -357,9 +357,8 @@
> >         shrq    $32, %rdi
> >         cmpq    %r8, %rdx
> >         cmovbe  %r11, %rdi
> > -       addq    $1, %rax
> > -       cmpq    %r8, %rdx
> >         cmovbe  %rdx, %rcx
> > +       addq    $1, %rax
> >         cmpq    %rbp, %rax
> >         movq    %rcx, -8(%rsi,%rax,8)
> >         jne     .L50
> > 
> > unmodified: Took 14.31 seconds total.
> > modified:   Took 13.04 seconds total.
> > 
> > So re. comment #9: it's not the problem but it'd be a small improvement.
> 
> FWIW this comes from not eliminating the condition expression in
> the conditional moves that ifcvt creates:
> 
>   tmp_97 = tmp_93 > 4294967295 ? tmp_95 : tmp_93;
>   carry_105 = tmp_93 > 4294967295 ? carry_94 : 0;
> 
> I'm surprised this form is allowed at all, I'd expect we only allow
> is_gimple_reg() for a COND_EXPR_COND in a RHS context.

Yeah, it's on my list (even with partial patches available ...).
Note that then vectorizing a COND_EXPR is not different from being
able to vectorize a comparison statement.

   pred_2 = tmp_93 > 4294967295;
   tmp_97 = pred_2 ? tmp_95 : tmp_93;
   carry_105 = pred_2 ? carry_94 : 0;

this form, both vectorized and not vectorized has issues when doing
initial instruction selection during expand (with multiple uses of
pred_2 we don't TER it).  So it might be that we present combine and
other RTL optimizers with initial code they will not be able to handle
as well as what we do right now.

> Anyway -- separate problem.

Indeed.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug target/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (22 preceding siblings ...)
  2013-02-15  9:33 ` rguenth at gcc dot gnu.org
@ 2013-02-15  9:36 ` rguenth at gcc dot gnu.org
  2013-02-17  8:35 ` [Bug target/56309] conditional moves instead of compare and branch result " ubizjak at gmail dot com
                   ` (8 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: rguenth at gcc dot gnu.org @ 2013-02-15  9:36 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org

--- Comment #24 from Richard Biener <rguenth at gcc dot gnu.org> 2013-02-15 09:36:23 UTC ---
(In reply to comment #22)
> Yeah.  We need to either find another way how to present vectorizer with the
> ifconverted statement (perhaps similar to how pattern matching works, I think
> if-conversion only handles inner-most loops, so maybe just as an alternative
> gimple_seq for the single bb of the loop body?), or we need some kind of
> if-unconversion pass.
> See the http://gcc.gnu.org/ml/gcc-patches/2012-11/msg01659.html patch which
> contains much more limited version of if-unconversion, unfortunately it is
> going to be harder to handle more complex conditions efficiently.

Well, ideally if-conversion would work as part of vectorization.
if-conversion is nothing else than pattern detection - but of course the
complication is that the vectorizer would need to be taught to vectorize
loops with more than a single basic-block (even though the vectorized
loop would again contain a single basic-block).


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug target/56309] conditional moves instead of compare and branch result in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (23 preceding siblings ...)
  2013-02-15  9:36 ` rguenth at gcc dot gnu.org
@ 2013-02-17  8:35 ` ubizjak at gmail dot com
  2013-02-17  8:38 ` ubizjak at gmail dot com
                   ` (7 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: ubizjak at gmail dot com @ 2013-02-17  8:35 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

Uros Bizjak <ubizjak at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|-O3 optimizer generates     |conditional moves instead
                   |conditional moves instead   |of compare and branch
                   |of compare and branch       |result in almost 2x slower
                   |resulting in almost 2x      |code
                   |slower code                 |

--- Comment #25 from Uros Bizjak <ubizjak at gmail dot com> 2013-02-17 08:35:17 UTC ---
(In reply to comment #21)
> (In reply to comment #18)

> But in this case we are giving up vectorization for some cases, aren't we?

Yes, and please note that vectorized conditional move doesn't result in a cmov
instruction, but in a sequence of sse logic isns. So, as far as x86 targets are
concerned, this bug has nothing to do with a vectorizer.

Maybe we should decompose scalar cmoves into the same sequence?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug target/56309] conditional moves instead of compare and branch result in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (24 preceding siblings ...)
  2013-02-17  8:35 ` [Bug target/56309] conditional moves instead of compare and branch result " ubizjak at gmail dot com
@ 2013-02-17  8:38 ` ubizjak at gmail dot com
  2013-02-17 10:10 ` ubizjak at gmail dot com
                   ` (6 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: ubizjak at gmail dot com @ 2013-02-17  8:38 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

Uros Bizjak <ubizjak at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jake.stine at gmail dot com

--- Comment #26 from Uros Bizjak <ubizjak at gmail dot com> 2013-02-17 08:38:24 UTC ---
Another analysis by Jake in PR54037:

I have done quite a bit of analysis on cmov performance across x86
architectures, so I will share here in case it helps:

Quick summary: Conditional moves on Intel Core/Xeon and AMD Bulldozer
architectures should probably be avoided "as a rule."

History: Conditional moves were beneficial for the Intel Pentium 4, and also
(but less-so) for AMD Athlon/Phenom chips.  In the AMD Athlon/Phenom case the
performance of cmov vs cmp+branch is determined more by the alignment of the
target of the branch, than by the prediction rate of the branch.  The
instruction decoders would incur penalties on certain types of unaligned branch
targets (when taken), or when decoding sequences of instructions that contained
multiple branches within a 16byte "fetch" window (taken or not).  cmov was
sometimes handy for avoiding those.

With regard to more current Intel Core and AMD Bulldozer/Bobcat architecture:

I have found that use of conditional moves (cmov) is only beneficial if the
branch that the move is replacing is badly mis-predicted.  In my tests, the
cmov only became clearly "optimal" when the branch was predicted correctly less
than 92% of the time, which is abysmal by modern branch predictor standards and
rarely occurs in practice.  Above 97% prediction rates, cmov is typically
slower than cmp+branch. Inside loops that contain branches with prediction
rates approaching 100% (as is the case presented by the OP), cmov becomes a
severe performance bottleneck.  This holds true for both Core and Bulldozer. 
Bulldozer has less efficient branching than the i7, but is also severely
bottlenecked by its limited fetch/decode.  Cmov requires executing more total
instructions, and that makes Bulldozer very unhappy.

Note that my tests involved relatively simple loops that did not suffer from
the added register pressure that cmov introduces.  In practice, the prognosis
for cmov being "optimal" is even worse than what I've observed in a controlled
environment.  Furthermore, to my knowledge the status of cmov vs. branch
performance on x86 will not be changing anytime soon.  cmov will continue to be
a liability well into the next couple architecture releases from Intel and AMD.
 Piledriver will have added fetch/decode resources but should also have a
smaller mispredict penalty, so its doubtful cmov will gain much advantages
there either.

Therefore I would recommend setting -fno-tree-loop-if-convert for all -march
matching Intel Core and AMD Bulldozer/Bobcat families.


There is one good use-case for cmov on x86:  Mis-predicted conditions inside of
loops.  Currently there's no way to force that behavior in situations where I,
the programmer, am fully aware that the condition is chaotic/random.  A builtin
cmov or condition hint would be nice.  For now I'm forced to address those
(fortunately infrequent) situations via inline asm.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug target/56309] conditional moves instead of compare and branch result in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (25 preceding siblings ...)
  2013-02-17  8:38 ` ubizjak at gmail dot com
@ 2013-02-17 10:10 ` ubizjak at gmail dot com
  2013-09-22  6:37 ` icegood1980 at gmail dot com
                   ` (5 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: ubizjak at gmail dot com @ 2013-02-17 10:10 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

--- Comment #27 from Uros Bizjak <ubizjak at gmail dot com> 2013-02-17 10:10:16 UTC ---
Some additional analysis:

http://gcc.gnu.org/ml/gcc/2008-02/msg00601.html


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug target/56309] conditional moves instead of compare and branch result in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (26 preceding siblings ...)
  2013-02-17 10:10 ` ubizjak at gmail dot com
@ 2013-09-22  6:37 ` icegood1980 at gmail dot com
  2013-10-18  9:12 ` jakub at gcc dot gnu.org
                   ` (4 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: icegood1980 at gmail dot com @ 2013-09-22  6:37 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

icegood1980 at gmail dot com <icegood1980 at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |icegood1980 at gmail dot com

--- Comment #28 from icegood1980 at gmail dot com <icegood1980 at gmail dot com> ---
hmmm, i also found that under x86
using of __builtin_expect makes code slower...


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug target/56309] conditional moves instead of compare and branch result in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (27 preceding siblings ...)
  2013-09-22  6:37 ` icegood1980 at gmail dot com
@ 2013-10-18  9:12 ` jakub at gcc dot gnu.org
  2013-10-18 10:06 ` ubizjak at gmail dot com
                   ` (3 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: jakub at gcc dot gnu.org @ 2013-10-18  9:12 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

--- Comment #29 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
FYI, I've posted patchset to only do tree if-conversion for vectorization and
not for scalar basic blocks (and also for conditional load/store support in
vectorized loops using AVX vmaskmov* and AVX2 masked gathers) in
http://gcc.gnu.org/ml/gcc-patches/2013-10/msg01373.html
I'd appreciate any benchmark results for that.

On the by-ref-O3.ii testcase there is no change, but by-val-O3.ii seems to be
consistently faster, around 20% on AMD 8354 and 35% on Intel 2600.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug target/56309] conditional moves instead of compare and branch result in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (28 preceding siblings ...)
  2013-10-18  9:12 ` jakub at gcc dot gnu.org
@ 2013-10-18 10:06 ` ubizjak at gmail dot com
  2015-08-06  9:35 ` ysrumyan at gmail dot com
                   ` (2 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: ubizjak at gmail dot com @ 2013-10-18 10:06 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

--- Comment #30 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Jakub Jelinek from comment #29)

> On the by-ref-O3.ii testcase there is no change, but by-val-O3.ii seems to
> be consistently faster, around 20% on AMD 8354 and 35% on Intel 2600.

According to detailed analysis in Comment 26, perhaps a new tuning flag or
target hook could be used for if-conversion of scalars. Also, I think that -Os
can still use cmoves due to smaller code size.
>From gcc-bugs-return-432112-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org Fri Oct 18 10:12:18 2013
Return-Path: <gcc-bugs-return-432112-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Delivered-To: listarch-gcc-bugs@gcc.gnu.org
Received: (qmail 25233 invoked by alias); 18 Oct 2013 10:12:17 -0000
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
Delivered-To: mailing list gcc-bugs@gcc.gnu.org
Received: (qmail 25214 invoked by uid 48); 18 Oct 2013 10:12:15 -0000
From: "markus at trippelsdorf dot de" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug bootstrap/58780] Cannot build gcc-4.8.3 on system with binutils build by gcc-4.9
Date: Fri, 18 Oct 2013 10:12:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: bootstrap
X-Bugzilla-Version: 4.8.3
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: markus at trippelsdorf dot de
X-Bugzilla-Status: RESOLVED
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields: bug_status resolution
Message-ID: <bug-58780-4-sGj2YLQENs@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-58780-4@http.gcc.gnu.org/bugzilla/>
References: <bug-58780-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2013-10/txt/msg01256.txt.bz2
Content-length: 480

http://gcc.gnu.org/bugzilla/show_bug.cgi?idX780

Markus Trippelsdorf <markus at trippelsdorf dot de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|---                         |WONTFIX

--- Comment #2 from Markus Trippelsdorf <markus at trippelsdorf dot de> ---
I guess this issue is not easily fixable.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug target/56309] conditional moves instead of compare and branch result in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (29 preceding siblings ...)
  2013-10-18 10:06 ` ubizjak at gmail dot com
@ 2015-08-06  9:35 ` ysrumyan at gmail dot com
  2021-09-04 20:24 ` peter at cordes dot ca
  2021-09-04 20:26 ` peter at cordes dot ca
  32 siblings, 0 replies; 34+ messages in thread
From: ysrumyan at gmail dot com @ 2015-08-06  9:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

--- Comment #34 from Yuri Rumyantsev <ysrumyan at gmail dot com> ---
Created attachment 36138
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36138&action=edit
simple reproducer

Use -O3 -std=c++14 options to compile and -fno-tree-loop-if-convert to get
better performance


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug target/56309] conditional moves instead of compare and branch result in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (30 preceding siblings ...)
  2015-08-06  9:35 ` ysrumyan at gmail dot com
@ 2021-09-04 20:24 ` peter at cordes dot ca
  2021-09-04 20:26 ` peter at cordes dot ca
  32 siblings, 0 replies; 34+ messages in thread
From: peter at cordes dot ca @ 2021-09-04 20:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

Peter Cordes <peter at cordes dot ca> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |peter at cordes dot ca

--- Comment #36 from Peter Cordes <peter at cordes dot ca> ---
Related:  a similar case of cmov being a worse choice, for a threshold
condition with an array input that happens to already be sorted:

https://stackoverflow.com/questions/28875325/gcc-optimization-flag-o3-makes-code-slower-than-o2

GCC with -fprofile-generate / -fprofile-use does correctly decide to use
branches.

GCC7 and later (including current trunk) with -O3 -fno-tree-vectorize
de-optimizes by putting the CMOV on the critical path, instead of as part of
creating a zero/non-zero input for the ADD. PR82666.  If you do allow full -O3,
then vectorization is effective, though.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug target/56309] conditional moves instead of compare and branch result in almost 2x slower code
  2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
                   ` (31 preceding siblings ...)
  2021-09-04 20:24 ` peter at cordes dot ca
@ 2021-09-04 20:26 ` peter at cordes dot ca
  32 siblings, 0 replies; 34+ messages in thread
From: peter at cordes dot ca @ 2021-09-04 20:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

--- Comment #37 from Peter Cordes <peter at cordes dot ca> ---
Correction, PR82666 is that the cmov on the critical path happens even at -O2
(with GCC7 and later).  Not just with -O3 -fno-tree-vectorize.

Anyway, that's related, but probably separate from choosing to do if-conversion
or not after inlining.

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2021-09-04 20:26 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-13 20:21 [Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code arturomdn at gmail dot com
2013-02-13 20:23 ` [Bug middle-end/56309] " arturomdn at gmail dot com
2013-02-13 20:26 ` [Bug target/56309] " pinskia at gcc dot gnu.org
2013-02-13 20:29 ` arturomdn at gmail dot com
2013-02-14  8:16 ` ubizjak at gmail dot com
2013-02-14  8:22 ` ubizjak at gmail dot com
2013-02-14  8:28 ` ubizjak at gmail dot com
2013-02-14 14:49 ` ysrumyan at gmail dot com
2013-02-14 15:53 ` arturomdn at gmail dot com
2013-02-14 16:01 ` arturomdn at gmail dot com
2013-02-14 16:44 ` arturomdn at gmail dot com
2013-02-14 16:59 ` steven at gcc dot gnu.org
2013-02-14 17:10 ` steven at gcc dot gnu.org
2013-02-14 17:26 ` [Bug tree-optimization/56309] " steven at gcc dot gnu.org
2013-02-14 17:31 ` arturomdn at gmail dot com
2013-02-14 17:34 ` ubizjak at gmail dot com
2013-02-14 17:43 ` arturomdn at gmail dot com
2013-02-14 17:50 ` ubizjak at gmail dot com
2013-02-14 18:23 ` ubizjak at gmail dot com
2013-02-14 18:35 ` [Bug target/56309] " ubizjak at gmail dot com
2013-02-14 22:18 ` steven at gcc dot gnu.org
2013-02-15  6:50 ` izamyatin at gmail dot com
2013-02-15  7:15 ` jakub at gcc dot gnu.org
2013-02-15  9:33 ` rguenth at gcc dot gnu.org
2013-02-15  9:36 ` rguenth at gcc dot gnu.org
2013-02-17  8:35 ` [Bug target/56309] conditional moves instead of compare and branch result " ubizjak at gmail dot com
2013-02-17  8:38 ` ubizjak at gmail dot com
2013-02-17 10:10 ` ubizjak at gmail dot com
2013-09-22  6:37 ` icegood1980 at gmail dot com
2013-10-18  9:12 ` jakub at gcc dot gnu.org
2013-10-18 10:06 ` ubizjak at gmail dot com
2015-08-06  9:35 ` ysrumyan at gmail dot com
2021-09-04 20:24 ` peter at cordes dot ca
2021-09-04 20:26 ` peter at cordes dot ca

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).